Editing Openai/693e3ce6-229c-8008-97dc-ab720cb1f95a (section)

=== Assistant: From your COLT’22 note, the “dream” guarantee is: ===
From your COLT’22 note, the “dream” guarantee is:
* use a standard accelerated proximal-gradient method (FISTA / linear coupling) on the ℓ1–regularized PPR objective (your eq. (3)),
* keep the per-iteration work local/sparse (≈ only O(1/ρ)O(1/\rho)O(1/ρ) coordinates touched each iteration, as in the ISTA/PGD analysis),
* and thereby upgrade the total runtime from O~((αρ)−1)\tilde O((\alpha\rho)^{-1})O~((αρ)−1) to O~((α ρ)−1)\tilde O((\sqrt{\alpha}\,\rho)^{-1})O~((αρ)−1). open-problem-fountoulakis22a

You also point out the exact bottleneck: we already know accelerated methods reduce iteration complexity to O(1/αlog⁡(1/ε))O(\sqrt{1/\alpha}\log(1/\varepsilon))O(1/αlog(1/ε)), but we don’t have a worst‑case upper bound on how many coordinates get activated/updated per iteration, so the only safe bound is the pessimistic O(n1/αlog⁡(1/ε))O(n\sqrt{1/\alpha}\log(1/\varepsilon))O(n1/αlog(1/ε)). open-problem-fountoulakis22a

==== If you mean a theorem of the form ====

: 

then my assessment is: very unlikely without adding extra safeguards or extra assumptions.

===== The ISTA/PGD sparsity proof in this setting hinges on very special monotonicity/order properties of the non‑accelerated proximal-gradient map for these PageRank/M‑matrix quadratics (this is what lets you argue you never need to “touch” most coordinates). Your note explicitly says that analysis “does not apply to the accelerated version.” open-problem-fountoulakis22a =====

Acceleration introduces an extrapolated point (schematically yk=xk+βk(xk−xk−1)y_k=x_k+\beta_k(x_k-x_{k-1})yk=xk+βk(xk−xk−1)), which breaks exactly the kind of order-preserving / “push”-style monotonicity that local PageRank proofs exploit. Once you lose that, it becomes extremely difficult to rule out “spurious activations” of many coordinates in a single iteration, because the prox step can turn on any coordinate whose pre-prox value crosses the threshold.

There’s also a broader, well-known phenomenon in composite/proximal methods:
* Finite-time support (manifold) identification is typically provable only under a nondegeneracy / strict complementarity margin at the solution.
* Without such a margin (degenerate cases), the active set can be “almost active” for many coordinates, and accelerated methods can oscillate around the threshold.

This is not specific to PageRank; it’s a general limitation noted in the manifold-identification literature. Optimization Online<ref>{{cite web|title=Optimization Online|url=https://optimization-online.org/wp-content/uploads/2019/03/7109.pdf|publisher=Optimization Online|access-date=2025-12-19}}</ref>

That matters here because your open problem is explicitly worst-case (“it is not even clear if acceleration would not lead to a worse running time complexity in the worst case”). open-problem-fountoulakis22a

Worst-case families can easily include symmetric/degenerate graphs where many coordinates sit exactly on (or arbitrarily close to) the activation boundary.

==== The Martínez‑Rubio–Wirth–Pokutta COLT’23 paper essentially answers your question in the affirmative by changing the algorithmic object: they build an active-set accelerated method that maintains the invariants that vanilla acceleration breaks. ====

Concretely, their Algorithm 4 (ASPR) repeatedly:
# defines the current active set S(t)S^{(t)}S(t) from negative gradient coordinates,
# runs an accelerated projected GD subroutine (APGD) on the restricted subspace C(t)=span{ei:i∈S(t)}∩R≥0nC^{(t)}=\mathrm{span}\{e_i:i\in S^{(t)}\}\cap\mathbb{R}^n_{\ge 0}C(t)=span{ei:i∈S(t)}∩R≥0n,
# applies a small “shrink” step x(t+1)←max⁡{0,xˉ(t+1)−δt1}x^{(t+1)}\leftarrow \max\{0,\bar x^{(t+1)}-\delta_t \mathbf{1}\}x(t+1)←max{0,xˉ(t+1)−δt1},
# and expands S(t)S^{(t)}S(t) again. Proceedings of Machine Learning Research<ref>{{cite web|title=Proceedings of Machine Learning Research|url=https://proceedings.mlr.press/v195/martinez-rubio23b/martinez-rubio23b.pdf|publisher=Proceedings of Machine Learning Research|access-date=2025-12-19}}</ref>

That APGD call is exactly the “expensive subproblem” you’re referring to. It’s not “vanilla FISTA on the original full problem”; it’s accelerated optimization on an evolving restricted space, with additional machinery to guarantee S(t)⊆S\''S^{(t)}\subseteq S^\''S(t)⊆S\'' and x(t)≤x\''x^{(t)}\le x^\''x(t)≤x\'' coordinatewise. They explicitly highlight that this is what makes provable sparse updates possible and that this answers your COLT’22 open question. Proceedings of Machine Learning Research<ref>{{cite web|title=Proceedings of Machine Learning Research|url=https://proceedings.mlr.press/v195/martinez-rubio23b/martinez-rubio23b.pdf|publisher=Proceedings of Machine Learning Research|access-date=2025-12-19}}</ref>

So the existence of Rubio’s solution is also evidence that the missing ingredient is not a standard acceleration rate proof—it’s the active-set control that keeps iterates from “lighting up” irrelevant coordinates.

==== If you relax your goal slightly, there is a more “standard” line of argument you could potentially formalize: ====

===== Assume a margin at the optimum that separates active from inactive coordinates (a strict complementarity / nondegeneracy condition). In ℓ1 language this is the usual requirement that, for inactive coordinates, the optimality condition is strict (not tight). =====

Then the standard identification results imply that after finitely many iterations the algorithm identifies the correct manifold/support, after which the method behaves like an accelerated method on a fixed low-dimensional subspace (size ∣S\''∣|S^\''|∣S\*∣, which is ≈ O(1/ρ)O(1/\rho)O(1/ρ) in your application regime). The catch is:
* the identification time bound typically depends on the margin parameter (often called something like δmin⁡\delta_{\min}δmin), and in degenerate instances δmin⁡\delta_{\min}δmin can be arbitrarily small or zero. Optimization Online<ref>{{cite web|title=Optimization Online|url=https://optimization-online.org/wp-content/uploads/2019/03/7109.pdf|publisher=Optimization Online|access-date=2025-12-19}}</ref>
* so you don’t get a clean worst-case O~((αρ)−1)\tilde O((\sqrt{\alpha}\rho)^{-1})O~((αρ)−1) bound purely in terms of α,ρ\alpha,\rhoα,ρ.

In other words, you can likely get something like

time  ≲  1ρ polylog ⁣(1δmin⁡)⏟identification phase  +  1ραlog⁡ ⁣(1ε)⏟accelerated phase on S\*,\text{time} \;\lesssim\; \underbrace{\frac{1}{\rho}\,\mathrm{polylog}\!\Big(\frac{1}{\delta_{\min}}\Big)}_{\text{identification phase}}
\;+\;
\underbrace{\frac{1}{\rho\sqrt{\alpha}}\log\!\Big(\frac{1}{\varepsilon}\Big)}_{\text{accelerated phase on }S^\''},time≲identification phaseρ1polylog(δmin1)+accelerated phase on S\''ρα1log(ε1),
but the δmin⁡\delta_{\min}δmin-dependence is the part that prevents a clean worst-case theorem.

==== - If you want a worst‑case guarantee over all graphs/instances of the kind stated in your open problem—i.e., a universal δ\deltaδ such that plain FISTA/linear coupling updates ≤δ/ρ\le \delta/\rho≤δ/ρ coordinates each iteration—then I don’t see a path via “standard acceleration analysis” alone, because the whole difficulty is controlling the extrapolated iterates’ active set, and generic proximal identification theory tells you you cannot control that uniformly without a nondegeneracy margin. ====
* The only provable acceleration-with-sparsity result I’m aware of in this line is essentially active-set acceleration (Rubio et al.), which introduces exactly the kind of “subproblem” / restricted-space solves you want to avoid. Proceedings of Machine Learning Research<ref>{{cite web|title=Proceedings of Machine Learning Research|url=https://proceedings.mlr.press/v195/martinez-rubio23b/martinez-rubio23b.pdf|publisher=Proceedings of Machine Learning Research|access-date=2025-12-19}}</ref>

===== The most plausible compromise (theorem-wise) is: =====
# prove a finite identification result for your PageRank instance class (needs a margin condition or a probabilistic model), then
# run a single accelerated method on the identified support.

This avoids repeated expensive restricted solves, but it won’t give a clean worst-case O~((αρ)−1)\tilde O((\sqrt{\alpha}\rho)^{-1})O~((αρ)−1) unless you can lower-bound the margin in terms of ρ\rhoρ for all graphs—which is exactly the kind of statement that seems false in worst case (symmetries/degeneracy).

If you tell me which “standard accelerated algorithm” you have in mind (plain FISTA, strongly-convex APG with fixed β\betaβ, linear coupling variant, etc.), I can be more concrete about which invariants one would need to establish (e.g., order-bounds 0≤yk≤c x\''0\le y_k\le c\,x^\''0≤yk≤cx\*, or a per-iteration certificate that prevents activation outside a small frontier), and where the known theory breaks.