Editing Openai/693e3ce6-229c-8008-97dc-ab720cb1f95a (section)

=== Assistant: You’re right to flag that: for the local running time story in your open problem, what matters is not just that the “wr… ===
You’re right to flag that: for the local running time story in your open problem, what matters is not just that the “wrong” coordinates eventually die, but how big the support can get before identification. open-problem-fountoulakis22a

There are two different answers depending on what kind of bound you mean:
# A bound purely from the margin γ\gammaγ, independent of how far you are from x⋆x^\starx⋆
# A bound that uses γ\gammaγ ''and'' a measure of how far yky_kyk is from x⋆x^\starx⋆ (distance or objective gap)

===== Not in the strong sense you would want (e.g., “≤c/ρ\le c/\rho≤c/ρ active coordinates for all kkk”). =====

Reason: the margin is a local property at the solution:

∣∇f(x⋆)i∣≤λi−γ(i∈I⋆),|\nabla f(x^\star)_i| \le \lambda_i - \gamma\qquad (i\in I^\star),∣∇f(x⋆)i∣≤λi−γ(i∈I⋆),
so it only says those coordinates are safely inside the “dead zone” near x⋆x^\starx⋆. Far from x⋆x^\starx⋆, nothing stops the prox input from crossing the threshold on many coordinates.

Also, the literature on identification for inertial/accelerated forward–backward methods explicitly discusses that FISTA/inertial schemes can oscillate around the active manifold before settling, so one should not expect a clean monotone “active set only grows nicely” picture without extra structure. arXiv<ref>{{cite web|title=arXiv|url=https://arxiv.org/pdf/1503.03703|publisher=arxiv.org|access-date=2025-12-19}}</ref>

So: a uniform, instance-independent “small support for all transient iterates” bound does not follow from γ>0\gamma>0γ>0 alone.

==== Even though γ\gammaγ is local, it does give a very clean quantitative bound of the form ====

: 

This is already useful because accelerated methods give you an explicit decay of ∥yk−x⋆∥\|y_k-x^\star\|∥yk−x⋆∥ (or of the objective gap), so you can turn it into a bound on transient “wrong support” size that shrinks over time.

===== Consider the composite problem =====

min⁡x  F(x)=f(x)+g(x),g(x)=∑i=1nλi∣xi∣,\min_x\;F(x)=f(x)+g(x),\qquad g(x)=\sum_{i=1}^n \lambda_i |x_i|,xminF(x)=f(x)+g(x),g(x)=i=1∑nλi∣xi∣,
with ∇f\nabla f∇f LLL-Lipschitz (in ℓ2\ell_2ℓ2). Let x⋆x^\starx⋆ be the minimizer and define the inactive set I⋆={i:xi⋆=0}I^\star=\{i:x_i^\star=0\}I⋆={i:xi⋆=0}. Assume the margin

γ:=min⁡i∈I⋆(λi−∣∇f(x⋆)i∣)>0.\gamma:=\min_{i\in I^\star}\bigl(\lambda_i-|\nabla f(x^\star)_i|\bigr)>0.γ:=i∈I⋆min(λi−∣∇f(x⋆)i∣)>0.
Fix a step t>0t>0t>0, and for any point yyy define the prox-gradient input

u(y):=y−t∇f(y),x+(y):=prox⁡tg(u(y)).u(y):=y-t\nabla f(y),\qquad x^+(y):=\operatorname{prox}_{t g}(u(y)).u(y):=y−t∇f(y),x+(y):=proxtg(u(y)).
Let the “spurious active set” at yyy be

A(y):={i∈I⋆: x+(y)i≠0}.A(y):=\{i\in I^\star:\ x^+(y)_i\neq 0\}.A(y):={i∈I⋆: x+(y)i=0}.
Then

∣A(y)∣ ≤ ((1+tL) ∥y−x⋆∥2tγ)2.|A(y)|\ \le\ \left(\frac{(1+tL)\,\|y-x^\star\|_2}{t\gamma}\right)^2.∣A(y)∣ ≤ (tγ(1+tL)∥y−x⋆∥2)2.
====== For weighted soft-thresholding, x+(y)i≠0x^+(y)_i\neq 0x+(y)i=0 iff ∣u(y)i∣>tλi|u(y)_i|>t\lambda_i∣u(y)i∣>tλi. ======

For i∈I⋆i\in I^\stari∈I⋆, we have xi⋆=0x^\star_i=0xi⋆=0 and ∣∇f(x⋆)i∣≤λi−γ|\nabla f(x^\star)_i|\le \lambda_i-\gamma∣∇f(x⋆)i∣≤λi−γ, so

∣u(x⋆)i∣=∣xi⋆−t∇f(x⋆)i∣=t∣∇f(x⋆)i∣≤t(λi−γ).|u(x^\star)_i|=|x^\star_i-t\nabla f(x^\star)_i|=t|\nabla f(x^\star)_i|\le t(\lambda_i-\gamma).∣u(x⋆)i∣=∣xi⋆−t∇f(x⋆)i∣=t∣∇f(x⋆)i∣≤t(λi−γ).
Hence, if i∈A(y)i\in A(y)i∈A(y), then ∣u(y)i∣>tλi|u(y)_i|>t\lambda_i∣u(y)i∣>tλi, and therefore

∣u(y)i∣−∣u(x⋆)i∣>tλi−t(λi−γ)=tγ.|u(y)_i|-|u(x^\star)_i|>t\lambda_i - t(\lambda_i-\gamma)=t\gamma.∣u(y)i∣−∣u(x⋆)i∣>tλi−t(λi−γ)=tγ.
Using ∣ ∣a∣−∣b∣ ∣≤∣a−b∣|\,|a|-|b|\,|\le |a-b|∣∣a∣−∣b∣∣≤∣a−b∣, we get ∣u(y)i−u(x⋆)i∣>tγ|u(y)_i-u(x^\star)_i|>t\gamma∣u(y)i−u(x⋆)i∣>tγ.

Thus

∥u(y)−u(x⋆)∥22 ≥ ∑i∈A(y)∣u(y)i−u(x⋆)i∣2 > ∣A(y)∣ (tγ)2.\|u(y)-u(x^\star)\|_2^2\ \ge\ \sum_{i\in A(y)} |u(y)_i-u(x^\star)_i|^2\ >\ |A(y)|\,(t\gamma)^2.∥u(y)−u(x⋆)∥22 ≥ i∈A(y)∑∣u(y)i−u(x⋆)i∣2 > ∣A(y)∣(tγ)2.
On the other hand, since ∇f\nabla f∇f is LLL-Lipschitz,

∥u(y)−u(x⋆)∥2=∥(y−x⋆)−t(∇f(y)−∇f(x⋆))∥2≤(1+tL)∥y−x⋆∥2.\|u(y)-u(x^\star)\|_2
=\|(y-x^\star)-t(\nabla f(y)-\nabla f(x^\star))\|_2
\le (1+tL)\|y-x^\star\|_2.∥u(y)−u(x⋆)∥2=∥(y−x⋆)−t(∇f(y)−∇f(x⋆))∥2≤(1+tL)∥y−x⋆∥2.
Combine the two inequalities and rearrange to obtain the claim. ∎

==== In an accelerated prox-gradient method, you compute ====

xk+1=prox⁡tg(yk−t∇f(yk)).x_{k+1}=\operatorname{prox}_{t g}(y_k-t\nabla f(y_k)).xk+1=proxtg(yk−t∇f(yk)).
Apply the lemma with y=yky=y_ky=yk:

∣A(yk)∣ ≤ ((1+tL) ∥yk−x⋆∥2tγ)2.|A(y_k)|\ \le\ \left(\frac{(1+tL)\,\|y_k-x^\star\|_2}{t\gamma}\right)^2.∣A(yk)∣ ≤ (tγ(1+tL)∥yk−x⋆∥2)2.
So you can bound the number of “wrong” active coordinates at iteration kkk as soon as you have any bound on ∥yk−x⋆∥2\|y_k-x^\star\|_2∥yk−x⋆∥2 (or on an objective gap, using strong convexity).

And the identification literature you’re implicitly leaning on (active-set / manifold identification) does exactly this: it bounds the “active set complexity” in terms of a margin-like parameter (often called δmin⁡\delta_{\min}δmin) and a convergence bound εx(k)\varepsilon_x(k)εx(k). Proceedings of Machine Learning Research<ref>{{cite web|title=Proceedings of Machine Learning Research|url=https://proceedings.mlr.press/v89/sun19a/sun19a.pdf|publisher=Proceedings of Machine Learning Research|access-date=2025-12-19}}</ref>

==== Not by itself. ====
* The bound is quadratic in 1/γ1/\gamma1/γ, not linear in 1/ρ1/\rho1/ρ.
* To turn it into a “local algorithm” guarantee, you would need to relate γ\gammaγ to ρ\rhoρ (and control ∥yk−x⋆∥\|y_k-x^\star\|∥yk−x⋆∥ by something that doesn’t hide an nnn-dependence). In worst case, γ\gammaγ can be arbitrarily small even when ∣S⋆∣|S^\star|∣S⋆∣ is O(1/ρ)O(1/\rho)O(1/ρ), so the lemma allows very large transient supports.

So the best honest summary is:
* Yes, you can bound transient spurious activations using a margin parameter, but the natural bound you can prove from standard smoothness + margin is of the form ∣Ak∣≲∥yk−x⋆∥2/γ2,|A_k| \lesssim \|y_k-x^\star\|^2/\gamma^2,∣Ak∣≲∥yk−x⋆∥2/γ2, which is not the O~(1/ρ)\tilde O(1/\rho)O~(1/ρ) sparsity you ultimately want.
* Getting a uniform O~(1/ρ)\tilde O(1/\rho)O~(1/ρ) per-iteration support bound for acceleration typically needs extra invariants/controls (active-set control, monotonicity/restarter tricks, etc.), which is exactly where the “nonstandard” machinery comes in—consistent with why the identification papers emphasize oscillations/inertial effects. arXiv<ref>{{cite web|title=arXiv|url=https://arxiv.org/pdf/1503.03703|publisher=arxiv.org|access-date=2025-12-19}}</ref>

###