Jump to content
Main menu
Main menu
move to sidebar
hide
Navigation
Main page
Recent changes
Random page
freem
Search
Search
Appearance
Create account
Log in
Personal tools
Create account
Log in
Pages for logged out editors
learn more
Contributions
Talk
Editing
Openai/693e3ce6-229c-8008-97dc-ab720cb1f95a
(section)
Add languages
Page
Discussion
English
Read
Edit
Edit source
View history
Tools
Tools
move to sidebar
hide
Actions
Read
Edit
Edit source
View history
General
What links here
Related changes
Special pages
Page information
Appearance
move to sidebar
hide
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
=== Assistant: You’re right to flag that: for the local running time story in your open problem, what matters is not just that the “wr… === You’re right to flag that: for the local running time story in your open problem, what matters is not just that the “wrong” coordinates eventually die, but how big the support can get before identification. open-problem-fountoulakis22a There are two different answers depending on what kind of bound you mean: # A bound purely from the margin γ\gammaγ, independent of how far you are from x⋆x^\starx⋆ # A bound that uses γ\gammaγ ''and'' a measure of how far yky_kyk is from x⋆x^\starx⋆ (distance or objective gap) ===== Not in the strong sense you would want (e.g., “≤c/ρ\le c/\rho≤c/ρ active coordinates for all kkk”). ===== Reason: the margin is a local property at the solution: ∣∇f(x⋆)i∣≤λi−γ(i∈I⋆),|\nabla f(x^\star)_i| \le \lambda_i - \gamma\qquad (i\in I^\star),∣∇f(x⋆)i∣≤λi−γ(i∈I⋆), so it only says those coordinates are safely inside the “dead zone” near x⋆x^\starx⋆. Far from x⋆x^\starx⋆, nothing stops the prox input from crossing the threshold on many coordinates. Also, the literature on identification for inertial/accelerated forward–backward methods explicitly discusses that FISTA/inertial schemes can oscillate around the active manifold before settling, so one should not expect a clean monotone “active set only grows nicely” picture without extra structure. arXiv<ref>{{cite web|title=arXiv|url=https://arxiv.org/pdf/1503.03703|publisher=arxiv.org|access-date=2025-12-19}}</ref> So: a uniform, instance-independent “small support for all transient iterates” bound does not follow from γ>0\gamma>0γ>0 alone. ==== Even though γ\gammaγ is local, it does give a very clean quantitative bound of the form ==== : This is already useful because accelerated methods give you an explicit decay of ∥yk−x⋆∥\|y_k-x^\star\|∥yk−x⋆∥ (or of the objective gap), so you can turn it into a bound on transient “wrong support” size that shrinks over time. ===== Consider the composite problem ===== minx F(x)=f(x)+g(x),g(x)=∑i=1nλi∣xi∣,\min_x\;F(x)=f(x)+g(x),\qquad g(x)=\sum_{i=1}^n \lambda_i |x_i|,xminF(x)=f(x)+g(x),g(x)=i=1∑nλi∣xi∣, with ∇f\nabla f∇f LLL-Lipschitz (in ℓ2\ell_2ℓ2). Let x⋆x^\starx⋆ be the minimizer and define the inactive set I⋆={i:xi⋆=0}I^\star=\{i:x_i^\star=0\}I⋆={i:xi⋆=0}. Assume the margin γ:=mini∈I⋆(λi−∣∇f(x⋆)i∣)>0.\gamma:=\min_{i\in I^\star}\bigl(\lambda_i-|\nabla f(x^\star)_i|\bigr)>0.γ:=i∈I⋆min(λi−∣∇f(x⋆)i∣)>0. Fix a step t>0t>0t>0, and for any point yyy define the prox-gradient input u(y):=y−t∇f(y),x+(y):=proxtg(u(y)).u(y):=y-t\nabla f(y),\qquad x^+(y):=\operatorname{prox}_{t g}(u(y)).u(y):=y−t∇f(y),x+(y):=proxtg(u(y)). Let the “spurious active set” at yyy be A(y):={i∈I⋆: x+(y)i≠0}.A(y):=\{i\in I^\star:\ x^+(y)_i\neq 0\}.A(y):={i∈I⋆: x+(y)i=0}. Then ∣A(y)∣ ≤ ((1+tL) ∥y−x⋆∥2tγ)2.|A(y)|\ \le\ \left(\frac{(1+tL)\,\|y-x^\star\|_2}{t\gamma}\right)^2.∣A(y)∣ ≤ (tγ(1+tL)∥y−x⋆∥2)2. ====== For weighted soft-thresholding, x+(y)i≠0x^+(y)_i\neq 0x+(y)i=0 iff ∣u(y)i∣>tλi|u(y)_i|>t\lambda_i∣u(y)i∣>tλi. ====== For i∈I⋆i\in I^\stari∈I⋆, we have xi⋆=0x^\star_i=0xi⋆=0 and ∣∇f(x⋆)i∣≤λi−γ|\nabla f(x^\star)_i|\le \lambda_i-\gamma∣∇f(x⋆)i∣≤λi−γ, so ∣u(x⋆)i∣=∣xi⋆−t∇f(x⋆)i∣=t∣∇f(x⋆)i∣≤t(λi−γ).|u(x^\star)_i|=|x^\star_i-t\nabla f(x^\star)_i|=t|\nabla f(x^\star)_i|\le t(\lambda_i-\gamma).∣u(x⋆)i∣=∣xi⋆−t∇f(x⋆)i∣=t∣∇f(x⋆)i∣≤t(λi−γ). Hence, if i∈A(y)i\in A(y)i∈A(y), then ∣u(y)i∣>tλi|u(y)_i|>t\lambda_i∣u(y)i∣>tλi, and therefore ∣u(y)i∣−∣u(x⋆)i∣>tλi−t(λi−γ)=tγ.|u(y)_i|-|u(x^\star)_i|>t\lambda_i - t(\lambda_i-\gamma)=t\gamma.∣u(y)i∣−∣u(x⋆)i∣>tλi−t(λi−γ)=tγ. Using ∣ ∣a∣−∣b∣ ∣≤∣a−b∣|\,|a|-|b|\,|\le |a-b|∣∣a∣−∣b∣∣≤∣a−b∣, we get ∣u(y)i−u(x⋆)i∣>tγ|u(y)_i-u(x^\star)_i|>t\gamma∣u(y)i−u(x⋆)i∣>tγ. Thus ∥u(y)−u(x⋆)∥22 ≥ ∑i∈A(y)∣u(y)i−u(x⋆)i∣2 > ∣A(y)∣ (tγ)2.\|u(y)-u(x^\star)\|_2^2\ \ge\ \sum_{i\in A(y)} |u(y)_i-u(x^\star)_i|^2\ >\ |A(y)|\,(t\gamma)^2.∥u(y)−u(x⋆)∥22 ≥ i∈A(y)∑∣u(y)i−u(x⋆)i∣2 > ∣A(y)∣(tγ)2. On the other hand, since ∇f\nabla f∇f is LLL-Lipschitz, ∥u(y)−u(x⋆)∥2=∥(y−x⋆)−t(∇f(y)−∇f(x⋆))∥2≤(1+tL)∥y−x⋆∥2.\|u(y)-u(x^\star)\|_2 =\|(y-x^\star)-t(\nabla f(y)-\nabla f(x^\star))\|_2 \le (1+tL)\|y-x^\star\|_2.∥u(y)−u(x⋆)∥2=∥(y−x⋆)−t(∇f(y)−∇f(x⋆))∥2≤(1+tL)∥y−x⋆∥2. Combine the two inequalities and rearrange to obtain the claim. ∎ ==== In an accelerated prox-gradient method, you compute ==== xk+1=proxtg(yk−t∇f(yk)).x_{k+1}=\operatorname{prox}_{t g}(y_k-t\nabla f(y_k)).xk+1=proxtg(yk−t∇f(yk)). Apply the lemma with y=yky=y_ky=yk: ∣A(yk)∣ ≤ ((1+tL) ∥yk−x⋆∥2tγ)2.|A(y_k)|\ \le\ \left(\frac{(1+tL)\,\|y_k-x^\star\|_2}{t\gamma}\right)^2.∣A(yk)∣ ≤ (tγ(1+tL)∥yk−x⋆∥2)2. So you can bound the number of “wrong” active coordinates at iteration kkk as soon as you have any bound on ∥yk−x⋆∥2\|y_k-x^\star\|_2∥yk−x⋆∥2 (or on an objective gap, using strong convexity). And the identification literature you’re implicitly leaning on (active-set / manifold identification) does exactly this: it bounds the “active set complexity” in terms of a margin-like parameter (often called δmin\delta_{\min}δmin) and a convergence bound εx(k)\varepsilon_x(k)εx(k). Proceedings of Machine Learning Research<ref>{{cite web|title=Proceedings of Machine Learning Research|url=https://proceedings.mlr.press/v89/sun19a/sun19a.pdf|publisher=Proceedings of Machine Learning Research|access-date=2025-12-19}}</ref> ==== Not by itself. ==== * The bound is quadratic in 1/γ1/\gamma1/γ, not linear in 1/ρ1/\rho1/ρ. * To turn it into a “local algorithm” guarantee, you would need to relate γ\gammaγ to ρ\rhoρ (and control ∥yk−x⋆∥\|y_k-x^\star\|∥yk−x⋆∥ by something that doesn’t hide an nnn-dependence). In worst case, γ\gammaγ can be arbitrarily small even when ∣S⋆∣|S^\star|∣S⋆∣ is O(1/ρ)O(1/\rho)O(1/ρ), so the lemma allows very large transient supports. So the best honest summary is: * Yes, you can bound transient spurious activations using a margin parameter, but the natural bound you can prove from standard smoothness + margin is of the form ∣Ak∣≲∥yk−x⋆∥2/γ2,|A_k| \lesssim \|y_k-x^\star\|^2/\gamma^2,∣Ak∣≲∥yk−x⋆∥2/γ2, which is not the O~(1/ρ)\tilde O(1/\rho)O~(1/ρ) sparsity you ultimately want. * Getting a uniform O~(1/ρ)\tilde O(1/\rho)O~(1/ρ) per-iteration support bound for acceleration typically needs extra invariants/controls (active-set control, monotonicity/restarter tricks, etc.), which is exactly where the “nonstandard” machinery comes in—consistent with why the identification papers emphasize oscillations/inertial effects. arXiv<ref>{{cite web|title=arXiv|url=https://arxiv.org/pdf/1503.03703|publisher=arxiv.org|access-date=2025-12-19}}</ref> ###
Summary:
Please note that all contributions to freem are considered to be released under the Creative Commons Attribution-ShareAlike 4.0 (see
Freem:Copyrights
for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource.
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)