Editing Openai/693c0f4f-255c-8008-92e9-0cd44c6d6226 (section)

==== From Sec. 2: ====
* There is a ground-truth generator f:Z→X,x=f(z)f: Z \to X, \quad x = f(z)f:Z→X,x=f(z) where: - Z=RdzZ = \mathbb{R}^{d_z}Z=Rdz is latent space, decomposed into slots: z=(z1,…,zK),zk∈Rmz = (z_1, \dots, z_K),\quad z_k \in \mathbb{R}^mz=(z1,…,zK),zk∈Rm e.g. “animal 1 slot”, “animal 2 slot”, “background slot”. - X⊂RdxX \subset \mathbb{R}^{d_x}X⊂Rdx is image space (the data manifold).
* A representation z^=φ(x)\hat z = \varphi(x)z^=φ(x) is “good perception” if it inverts fff up to slot-wise reparam + permutation: ∀z∈ZS:φ(f(z))=hπ(z)\forall z\in Z_S: \quad \varphi(f(z)) = h_\pi(z)∀z∈ZS:φ(f(z))=hπ(z) (Eq. 2.1). hπh_\pihπ just re-labels and reparametrizes slots.

===== They define two regimes: =====
* Generative approach: - Learn a decoder f^:Z→Rdx\hat f : Z \to \mathbb{R}^{d_x}f^:Z→Rdx. - Use its inverse to represent images: φ(x)=f^−1(x)\varphi(x) = \hat f^{-1}(x)φ(x)=f^−1(x) - For this to match truth, f^\hat ff^ must identify fff: f^(hπ(z))=f(z)(Eq. 2.2)\hat f(h_\pi(z)) = f(z) \quad\text{(Eq. 2.2)}f^(hπ(z))=f(z)(Eq. 2.2)
* Non-generative approach: - Learn an encoder directly: φ(x)=g^(x)\varphi(x) = \hat g(x)φ(x)=g^(x) - For this to match truth, g^\hat gg^ must approximate the inverse: g^(x)=hπ(g(x)),g:=f−1(Eq. 2.3)\hat g(x) = h_\pi(g(x)),\quad g := f^{-1} \quad\text{(Eq. 2.3)}g^(x)=hπ(g(x)),g:=f−1(Eq. 2.3)

The difference is not “does the architecture have an encoder or decoder”, but:
* Are you constraining a decoder f^\hat ff^ and then inverting it (generative)?
* Or are you only constraining an encoder g^\hat gg^ (non-generative)?

===== - ID region: XID=f(ZID)X_{\text{ID}} = f(Z_{\text{ID}})XID=f(ZID) from some subset ZID⊂ZZ_{\text{ID}}\subset ZZID⊂Z of concept combinations. =====
* OOD region: ZOODZ_{\text{OOD}}ZOOD is all other combinations of slot values; XOOD=f(ZOOD)X_{\text{OOD}} = f(Z_{\text{OOD}})XOOD=f(ZOOD).

The goal: if Eq. (2.1) holds on ZIDZ_{\text{ID}}ZID, does it generalize to all combinations in ZOODZ_{\text{OOD}}ZOOD? That’s compositional generalization.

===== - They define a generator class FintF_{\text{int}}Fint where concepts interact in a structured way (polynomial interactions across slots). =====
* The inverse class Gint={f−1∣f∈Fint}G_{\text{int}} = \{ f^{-1} \mid f \in F_{\text{int}}\}Gint={f−1∣f∈Fint} has strong constraints on Jacobian and Hessian (Eq. 3.3).

Key asymmetry:
* It’s feasible to constrain a decoder to lie in FintF_{\text{int}}Fint (by architecture/regularization).
* It’s basically infeasible to constrain an encoder to lie in GintG_{\text{int}}Gint when images live on a low-dim manifold in a high-dim ambient space (dx≫dzd_x \gg d_zdx≫dz). The Hessian/Jacobian structure “disappears”; almost any encoder is compatible with some f∈Fintf \in F_{\text{int}}f∈Fint.

So: you can enforce the right inductive bias on f^\hat ff^, but not on g^\hat gg^. This is why they say: to guarantee compositional generalization, you need a generative approach.