Editing Openai/693e3ce6-229c-8008-97dc-ab720cb1f95a (section)

=== Assistant: Below is a self-contained LaTeX note that proves a finite active-set identification result for standard accelerated pro… ===
Below is a self-contained LaTeX note that proves a finite active-set identification result for standard accelerated proximal gradient (APG / FISTA-style momentum) on the ℓ1\ell_1ℓ1-regularized PageRank objective (3) from your COLT’22 open problem note open-problem-fountoulakis22a

, under a strict-complementarity “margin” assumption. The proof shows that once the accelerated iterates are close enough to the optimum (with a radius proportional to the margin), no coordinates outside the optimal support can ever be activated again; it also gives an explicit bound on the identification time in terms of the margin and the accelerated convergence rate.

<syntaxhighlight lang="latex">\documentclass[11pt]{article}

\usepackage{amsmath,amssymb,amsthm,mathtools}
\usepackage[margin=1in]{geometry}
\usepackage{hyperref}

% --------------------
% Theorem environments
% --------------------
\newtheorem{theorem}{Theorem}
\newtheorem{lemma}{Lemma}
\newtheorem{corollary}{Corollary}
\newtheorem{assumption}{Assumption}
\theoremstyle{remark}
\newtheorem{remark}{Remark}

% --------------------
% Macros
% --------------------
\newcommand{\R}{\mathbb{R}}
\newcommand{\norm}[1]{\left\lVert #1 \right\rVert}
\newcommand{\ip}[2]{\left\langle #1,#2 \right\rangle}
\newcommand{\prox}{\operatorname{prox}}
\newcommand{\supp}{\operatorname{supp}}
\newcommand{\diag}{\operatorname{diag}}

\title{Finite Active-Set Identification for Accelerated Proximal Gradient on $\ell_1$-Regularized PageRank Under a Margin Condition}
\author{}
\date{}

\begin{document}
\maketitle

\begin{abstract}
We consider the $\ell_1$-regularized PageRank objective
\[
\min_{x\in\R^n}\; F(x) \;:=\; f(x) + g(x),
\qquad
g(x) := \rho\alpha \norm{D^{1/2}x}_1,
\]
where $f$ is the smooth quadratic arising from personalized PageRank.
We prove a \emph{finite identification} statement for standard accelerated proximal-gradient methods:
under a strict-complementarity \emph{margin} condition at the unique minimizer $x^\star$,
all coordinates outside $\supp(x^\star)$ become and remain identically zero after finitely many iterations.
An explicit identification-time bound is given in terms of the margin and the accelerated convergence rate.
\end{abstract}

\section{Problem setup: $\ell_1$-regularized PageRank as a composite strongly convex problem}

Let $G=(V,E)$ be an undirected graph with $n=|V|$ nodes.
Let $A\in\R^{n\times n}$ denote the adjacency matrix and $D=\diag(d_1,\dots,d_n)$ the degree matrix.
Fix $\alpha\in(0,1]$ and a seed vector $s\in\R^n$ (e.g.\ an indicator of a seed node).

Define the (symmetric) normalized Laplacian
\[
L := I - D^{-1/2} A D^{-1/2}.
\]
It is standard that $L$ is positive semidefinite and $\lambda(L)\subseteq[0,2]$ for undirected graphs.

Following the $\ell_1$-regularized PageRank formulation, define the matrix
\[
Q := \alpha I + \frac{1-\alpha}{2}L,
\]
and the smooth function
\begin{equation}\label{eq:f-def}
f(x) := \frac12 x^\top Q x - \alpha\, x^\top D^{-1/2}s.
\end{equation}
Let the nonsmooth regularizer be
\begin{equation}\label{eq:g-def}
g(x) := \rho\alpha\, \norm{D^{1/2}x}_1
      = \sum_{i=1}^n \lambda_i |x_i|,
\qquad
\lambda_i := \rho\alpha\sqrt{d_i}.
\end{equation}
We study
\begin{equation}\label{eq:main-problem}
\min_{x\in\R^n} F(x) := f(x)+g(x).
\end{equation}

\subsection{Smoothness and strong convexity constants for $f$ and $F$}

Since $\nabla^2 f(x)=Q$ is constant, we can read smoothness/strong convexity from the spectrum of $Q$.
Because $\lambda(L)\subseteq[0,2]$, we have
\[
\lambda(Q) = \left\{\alpha + \frac{1-\alpha}{2}\lambda: \lambda\in\lambda(L)\right\}
\subseteq [\alpha,\, \alpha+(1-\alpha)] = [\alpha,1].
\]
Therefore:
\begin{itemize}
\item $\nabla f$ is $L_f$-Lipschitz with $L_f=\norm{Q}_2\le 1$.
\item $f$ is $\mu$-strongly convex with $\mu=\lambda_{\min}(Q)\ge \alpha$.
\end{itemize}
In what follows we write $L:=L_f$ and $\mu:=\alpha$ for simplicity; the argument works for general $L,\mu$.

Since $g$ is convex, the sum $F=f+g$ is also $\mu$-strongly convex. Hence $F$ has a unique minimizer
$x^\star$.

\section{Prox operator and optimality conditions}

\subsection{Weighted soft-thresholding proximal operator}

Because $g(x)=\sum_i \lambda_i|x_i|$ is separable, its proximal operator is coordinatewise.

\begin{lemma}[Weighted soft-thresholding]\label{lem:prox}
Let $t>0$. For any $u\in\R^n$,
\[
\prox_{t g}(u)_i \;=\; \operatorname{sign}(u_i)\,\max\{ |u_i| - t\lambda_i,\; 0\},
\qquad i=1,\dots,n.
\]
In particular, $\prox_{t g}(u)_i=0$ if and only if $|u_i|\le t\lambda_i$.
\end{lemma}

\begin{proof}
This is standard: minimize $\frac{1}{2t}\norm{x-u}_2^2 + \sum_i \lambda_i|x_i|$,
which separates over coordinates.
For each coordinate one obtains the soft-thresholding solution with threshold $t\lambda_i$.
\end{proof}

\subsection{KKT conditions and the active/inactive sets}

Let $x^\star$ denote the unique minimizer of \eqref{eq:main-problem}. The first-order optimality condition is
\[
0 \in \nabla f(x^\star) + \partial g(x^\star).
\]
Coordinatewise, since $\partial|x_i|=\{\operatorname{sign}(x_i)\}$ for $x_i\neq 0$
and $\partial|0|=[-1,1]$, we obtain:

\begin{lemma}[Coordinatewise optimality conditions]\label{lem:kkt}
For each $i=1,\dots,n$:
\begin{itemize}
\item If $x_i^\star\neq 0$, then $\nabla f(x^\star)_i + \lambda_i\,\operatorname{sign}(x_i^\star)=0$.
Equivalently, $|\nabla f(x^\star)_i|=\lambda_i$.
\item If $x_i^\star=0$, then $|\nabla f(x^\star)_i|\le \lambda_i$.
\end{itemize}
\end{lemma}

Define the \emph{active set} $S^\star := \supp(x^\star)=\{i:x_i^\star\neq 0\}$ and the \emph{inactive set}
$I^\star := \{1,\dots,n\}\setminus S^\star = \{i:x_i^\star=0\}$.

\section{A margin (strict complementarity) assumption}

The key idea is to assume that the inactive coordinates satisfy \emph{strict} inequality in the KKT condition.

\begin{assumption}[Margin / strict complementarity]\label{ass:margin}
Define the margin parameter
\begin{equation}\label{eq:margin}
\gamma \;:=\; \min_{i\in I^\star}\bigl(\lambda_i - |\nabla f(x^\star)_i|\bigr).
\end{equation}
Assume $\gamma>0$.
Equivalently, for every inactive coordinate $i\in I^\star$,
\[
|\nabla f(x^\star)_i|\le \lambda_i - \gamma.
\]
\end{assumption}

\begin{remark}
This assumption is standard in manifold-identification results for $\ell_1$-type regularizers.
It may fail (i.e.\ $\gamma=0$) in degenerate instances; without a positive margin, one cannot in general
guarantee finite-time identification.
\end{remark}

\section{Accelerated proximal-gradient method}

Fix a step size $t\in(0,1/L]$.
Consider the standard accelerated proximal-gradient iteration (Nesterov / FISTA-style momentum):
\begin{equation}\label{eq:apg}
\begin{aligned}
y_k &= x_k + \beta_k (x_k - x_{k-1}),\\
x_{k+1} &= \prox_{t g}\bigl( y_k - t \nabla f(y_k)\bigr),
\end{aligned}
\qquad k\ge 0,
\end{equation}
with initialization $x_{-1}=x_0$ (so $y_0=x_0$).
For strongly convex objectives one often uses a constant $\beta_k\equiv\beta\in[0,1)$,
e.g.\ $\beta=\frac{\sqrt{L}-\sqrt{\mu}}{\sqrt{L}+\sqrt{\mu}}$, but the identification argument below
only needs that $y_k\to x^\star$ and (for explicit bounds) a known convergence rate.

\section{Key stability lemma: the margin prevents activation of inactive coordinates}

\begin{lemma}[Inactivity is stable in a neighborhood]\label{lem:stable-zero}
Assume $f$ has $L$-Lipschitz gradient in $\ell_2$, and Assumption~\ref{ass:margin} holds.
Let $t\in(0,1/L]$ and define the radius
\begin{equation}\label{eq:radius}
r \;:=\; \frac{t\gamma}{1+tL}.
\end{equation}
Then for any $y\in\R^n$ satisfying $\norm{y-x^\star}_2 \le r$, the proximal-gradient update
\[
x^+ := \prox_{t g}\bigl(y - t\nabla f(y)\bigr)
\]
satisfies $x^+_i = 0$ for all $i\in I^\star$.
\end{lemma}

\begin{proof}
Fix an inactive coordinate $i\in I^\star$, so $x_i^\star=0$.
Let
\[
u := y - t\nabla f(y).
\]
By Lemma~\ref{lem:prox}, it suffices to show $|u_i|\le t\lambda_i$.
Write
\[
u_i = (y_i-x_i^\star) - t\bigl(\nabla f(y)_i - \nabla f(x^\star)_i\bigr) \;-\; t\nabla f(x^\star)_i.
\]
Using the triangle inequality and $x_i^\star=0$,
\begin{align*}
|u_i|
&\le |y_i-x_i^\star| + t|\nabla f(y)_i - \nabla f(x^\star)_i| + t|\nabla f(x^\star)_i| \\
&\le \norm{y-x^\star}_2 + t\norm{\nabla f(y)-\nabla f(x^\star)}_2 + t|\nabla f(x^\star)_i|.
\end{align*}
Since $\nabla f$ is $L$-Lipschitz in $\ell_2$,
$\norm{\nabla f(y)-\nabla f(x^\star)}_2 \le L\norm{y-x^\star}_2$, hence
\[
|u_i|\le (1+tL)\norm{y-x^\star}_2 + t|\nabla f(x^\star)_i|.
\]
By the margin assumption, $|\nabla f(x^\star)_i|\le \lambda_i - \gamma$, so
\[
|u_i|\le (1+tL)\norm{y-x^\star}_2 + t(\lambda_i-\gamma).
\]
If $\norm{y-x^\star}_2\le r=\frac{t\gamma}{1+tL}$, then
\[
|u_i|\le (1+tL)\cdot \frac{t\gamma}{1+tL} + t(\lambda_i-\gamma) = t\lambda_i.
\]
Therefore $|u_i|\le t\lambda_i$, and Lemma~\ref{lem:prox} gives $x^+_i=0$.
Since $i\in I^\star$ was arbitrary, $x^+_{I^\star}=0$.
\end{proof}

\section{Main result: finite identification for accelerated proximal gradient}

\subsection{Existence of a finite identification time}

\begin{theorem}[Finite active-set identification (qualitative)]\label{thm:finite-id-qual}
Assume Assumption~\ref{ass:margin} (i.e.\ $\gamma>0$).
Consider the accelerated proximal-gradient iteration \eqref{eq:apg} with any step $t\in(0,1/L]$.
If $y_k\to x^\star$ in $\ell_2$, then there exists a finite index $K$ such that for all $k\ge K$,
\[
x_{k+1,i}=0 \quad \text{for every } i\in I^\star.
\]
Equivalently, after finitely many iterations, all iterates satisfy
$\supp(x_k)\subseteq S^\star$.
\end{theorem}

\begin{proof}
By Lemma~\ref{lem:stable-zero}, any iterate $y$ in the ball
$\{y:\norm{y-x^\star}_2\le r\}$ (with $r$ as in \eqref{eq:radius}) produces a proximal-gradient
update $x^+$ with $x^+_{I^\star}=0$.
Since $y_k\to x^\star$, there exists $K$ such that $\norm{y_k-x^\star}_2\le r$ for all $k\ge K$.
Applying Lemma~\ref{lem:stable-zero} at each such $k$ yields $x_{k+1,I^\star}=0$ for all $k\ge K$.
\end{proof}

\subsection{An explicit identification-time bound using a standard accelerated rate}

To make Theorem~\ref{thm:finite-id-qual} quantitative, we combine it with a standard
accelerated convergence rate for strongly convex composite minimization.
We state the rate as an assumption/lemma; this is ``standard accelerated algorithms'' content.

\begin{lemma}[Standard accelerated rate (strongly convex composite)]\label{lem:apg-rate}
Assume $F=f+g$ is $\mu$-strongly convex and $\nabla f$ is $L$-Lipschitz.
Consider an accelerated proximal-gradient scheme \eqref{eq:apg} with step $t=1/L$ and a
strongly-convex momentum choice (e.g.\ constant $\beta=\frac{\sqrt{L}-\sqrt{\mu}}{\sqrt{L}+\sqrt{\mu}}$).
Then there exist constants $C>0$ and $q\in(0,1)$ such that
\begin{equation}\label{eq:rate-gap}
F(x_k)-F(x^\star) \le C q^k \qquad \text{for all }k\ge 0,
\end{equation}
where one may take $q=1-\sqrt{\mu/L}$ (up to standard constant-factor variants).
\end{lemma}

\begin{remark}
Lemma~\ref{lem:apg-rate} is a standard result for accelerated proximal-gradient methods on
$\mu$-strongly convex composite objectives.
It is proved, for example, via estimate sequences or by analyzing a suitable Lyapunov/potential function.
The precise constant $C$ is not important for the identification argument; only geometric decay is used.
\end{remark}

\begin{theorem}[Identification time bound]\label{thm:finite-id-quant}
Assume Assumption~\ref{ass:margin} holds with margin $\gamma>0$,
and suppose the accelerated method satisfies Lemma~\ref{lem:apg-rate} with parameters $(C,q)$
and step $t=1/L$.
Let $\mu$ be the strong convexity parameter of $F$.
Define
\[
r := \frac{t\gamma}{1+tL} = \frac{\gamma}{2L}.
\]
Then for every $k\ge 1$,
\[
\norm{y_k-x^\star}_2 \le
\sqrt{\frac{2C}{\mu}}\; q^{\frac{k-1}{2}}
\left( (1+\beta)\sqrt{q} + \beta \right),
\]
and therefore it suffices that
\begin{equation}\label{eq:K-bound}
k \;\ge\;
1 + \frac{2\log\!\left(\displaystyle
\sqrt{\frac{2C}{\mu}}\,
\frac{( (1+\beta)\sqrt{q} + \beta )}{r}
\right)}{\log(1/q^{-1})}
\end{equation}
to guarantee $x_{k+1,i}=0$ for all $i\in I^\star$.
In particular, the identification time is $O\!\left(\sqrt{\frac{L}{\mu}}\log\frac{1}{\gamma}\right)$
up to problem-dependent constants.
\end{theorem}

\begin{proof}
From $\mu$-strong convexity of $F$ we have, for all $x$,
\[
F(x)-F(x^\star) \ge \frac{\mu}{2}\norm{x-x^\star}_2^2,
\]
hence \eqref{eq:rate-gap} implies
\[
\norm{x_k-x^\star}_2 \le \sqrt{\frac{2C}{\mu}}\, q^{k/2}.
\]
Now bound $y_k$ using the definition $y_k=x_k+\beta(x_k-x_{k-1})$:
\begin{align*}
\norm{y_k-x^\star}_2
&= \norm{(1+\beta)(x_k-x^\star) - \beta(x_{k-1}-x^\star)}_2 \\
&\le (1+\beta)\norm{x_k-x^\star}_2 + \beta\norm{x_{k-1}-x^\star}_2 \\
&\le \sqrt{\frac{2C}{\mu}}\left( (1+\beta)q^{k/2} + \beta q^{(k-1)/2}\right) \\
&= \sqrt{\frac{2C}{\mu}}\; q^{(k-1)/2}
\left( (1+\beta)\sqrt{q} + \beta\right).
\end{align*}
If the right-hand side is at most $r$, then Lemma~\ref{lem:stable-zero} applies and gives
$x_{k+1,I^\star}=0$. Solving for $k$ yields \eqref{eq:K-bound}.
The stated big-$O$ form follows from the standard choice $q\approx 1-\sqrt{\mu/L}$,
for which $\log(1/q^{-1}) \asymp \sqrt{\mu/L}$.
\end{proof}

\section{Specialization to $\ell_1$-regularized PageRank}

For the PageRank objective \eqref{eq:f-def}--\eqref{eq:g-def} we have $L\le 1$ and $\mu=\alpha$.
Taking $t=1$ (valid since $L\le 1$), the stability radius in Lemma~\ref{lem:stable-zero} becomes
\[
r = \frac{\gamma}{1+L}\ge \frac{\gamma}{2}.
\]
Thus, once an accelerated method produces extrapolated points $y_k$ within distance $\gamma/2$ of $x^\star$,
the inactive coordinates $I^\star=\{i:x_i^\star=0\}$ are never activated again.

\begin{remark}[What this does and does not prove]
Theorem~\ref{thm:finite-id-qual}--\ref{thm:finite-id-quant} provide a clean way to control
\emph{eventual} sparsity using a \emph{margin parameter} $\gamma$.
However:
\begin{itemize}
\item The margin $\gamma$ can be arbitrarily small (or zero) in worst-case instances, so this does not by itself
yield a worst-case runtime bound depending only on $(\alpha,\rho)$.
\item The results do not bound how many coordinates might become active during an initial transient phase
(before the iterates enter the $\gamma$-neighborhood).
\end{itemize}
Nonetheless, under a positive margin, the accelerated method identifies the optimal zero pattern in finite time.
\end{remark}

\section*{Acknowledgment of standard ingredients}
The only nontrivial ingredients beyond basic convex analysis are:
(i) the proximal map of the weighted $\ell_1$ norm (Lemma~\ref{lem:prox}),
(ii) Lipschitz continuity of $\nabla f$, and
(iii) a standard accelerated linear convergence rate for strongly convex composite minimization (Lemma~\ref{lem:apg-rate}).
The identification argument itself (Lemma~\ref{lem:stable-zero} and Theorems~\ref{thm:finite-id-qual}--\ref{thm:finite-id-quant})
is self-contained.

\end{document}

</syntaxhighlight>