Editing Openai/693e3ce6-229c-8008-97dc-ab720cb1f95a (section)

==== The Martínez‑Rubio–Wirth–Pokutta COLT’23 paper essentially answers your question in the affirmative by changing the algorithmic object: they build an active-set accelerated method that maintains the invariants that vanilla acceleration breaks. ====

Concretely, their Algorithm 4 (ASPR) repeatedly:
# defines the current active set S(t)S^{(t)}S(t) from negative gradient coordinates,
# runs an accelerated projected GD subroutine (APGD) on the restricted subspace C(t)=span{ei:i∈S(t)}∩R≥0nC^{(t)}=\mathrm{span}\{e_i:i\in S^{(t)}\}\cap\mathbb{R}^n_{\ge 0}C(t)=span{ei:i∈S(t)}∩R≥0n,
# applies a small “shrink” step x(t+1)←max⁡{0,xˉ(t+1)−δt1}x^{(t+1)}\leftarrow \max\{0,\bar x^{(t+1)}-\delta_t \mathbf{1}\}x(t+1)←max{0,xˉ(t+1)−δt1},
# and expands S(t)S^{(t)}S(t) again. Proceedings of Machine Learning Research<ref>{{cite web|title=Proceedings of Machine Learning Research|url=https://proceedings.mlr.press/v195/martinez-rubio23b/martinez-rubio23b.pdf|publisher=Proceedings of Machine Learning Research|access-date=2025-12-19}}</ref>

That APGD call is exactly the “expensive subproblem” you’re referring to. It’s not “vanilla FISTA on the original full problem”; it’s accelerated optimization on an evolving restricted space, with additional machinery to guarantee S(t)⊆S\''S^{(t)}\subseteq S^\''S(t)⊆S\'' and x(t)≤x\''x^{(t)}\le x^\''x(t)≤x\'' coordinatewise. They explicitly highlight that this is what makes provable sparse updates possible and that this answers your COLT’22 open question. Proceedings of Machine Learning Research<ref>{{cite web|title=Proceedings of Machine Learning Research|url=https://proceedings.mlr.press/v195/martinez-rubio23b/martinez-rubio23b.pdf|publisher=Proceedings of Machine Learning Research|access-date=2025-12-19}}</ref>

So the existence of Rubio’s solution is also evidence that the missing ingredient is not a standard acceleration rate proof—it’s the active-set control that keeps iterates from “lighting up” irrelevant coordinates.