Editing Openai/693c0f4f-255c-8008-92e9-0cd44c6d6226 (section)

==== They rewrite Eq. (4.1) as an optimization problem: ====

z\''=arg⁡min⁡z  ∥x−f^(z)∥2(4.3)z^\'' = \arg\min_z \; \big\|x - \hat f(z)\big\|^2 \tag{4.3}z\*=argzminx−f^(z)2(4.3)
So, for a given (possibly OOD) image xxx:
# Define the reconstruction loss: L(z)=∥x−f^(z)∥2.L(z) = \big\|x - \hat f(z)\big\|^2.L(z)=x−f^(z)2.
# Start from some initial guess z(0)z^{(0)}z(0).
# Do gradient descent: z(t+1)=z(t)−η ∇zL(z(t)).z^{(t+1)} = z^{(t)} - \eta \,\nabla_z L(z^{(t)}).z(t+1)=z(t)−η∇zL(z(t)).

Because f^\hat ff^ is differentiable, you can compute ∇zL\nabla_z L∇zL via backprop.

The efficiency depends heavily on initialization:
* If you start from a random z(0)z^{(0)}z(0), you may need many steps or get stuck.
* Their trick: use the encoder as a “System 1” guess: z(0)=g^(x).z^{(0)} = \hat g(x).z(0)=g^(x).

So the pipeline for Search is:

: 

Intuition:
* g^\hat gg^ is trained only on XIDX_{\text{ID}}XID, so on OOD it might be biased or slightly wrong.
* But it’s often close to a good solution.
* Gradient-based search corrects g^\hat gg^’s mistakes by explicitly enforcing that the decoded image matches xxx under f^\hat ff^.

This is what they show in Fig. 4 (left): encoder gives an initial slot decomposition; search refines it to better match xxx.

Pros:
* In principle, if f^\hat ff^ truly identifies fff, then solving (4.3) correctly recovers the “right” slots for any x∈Xx \in Xx∈X, including OOD.
* You don’t need extra training data.

Cons:
* It’s online and per-image expensive (hundreds of gradient steps in experiments).
* For large models / many images, this is slow.