Editing Openai/693c0f4f-255c-8008-92e9-0cd44c6d6226 (section)

===== To turn your idea into something that could in principle guarantee compositional generalization, you’d need more than: =====
* small latent dim, and
* classification/pseudo-label objectives.

You’d need constraints that approximate the derivative structure they identify for good inverses (block-diagonal-ish behavior on the tangent space of the data manifold). But:
* those constraints depend on the geometry of the manifold XXX, especially in OOD regions, which you don’t observe;
* that’s exactly why they argue encoder-side inductive biases are fundamentally weaker than decoder-side ones.

So, even if you add pseudo-labels + bottleneck, you’re still in the “encoder can fit ID, OOD is unconstrained” regime.

TL;DR on your proposal
* It’s a reasonable heuristic (and quite close to things tried in SSL already).
* But it does not escape the paper’s core critique of non-generative methods: - You’re still only constraining the encoder on XIDX_{\text{ID}}XID. - The bottleneck isn’t structured enough to enforce the slotwise/polynomial structure required for compositional generalization. - There remain many encoders that fit ID pseudo-labels perfectly and behave arbitrarily OOD.

If you want, we can sketch a concrete toy construction where two different low-dim encoders get 100% pseudo-label accuracy on the same dataset but give totally different decompositions—and opposite predictions—on OOD combos. That’s a nice way to see the ambiguity very explicitly.