Editing Openai/693c0f4f-255c-8008-92e9-0cd44c6d6226 (section)

=== Assistant: Nice idea, and you’re definitely poking in the right place capacity + bottleneck vs structure, but this particular tric… ===
Nice idea, and you’re definitely poking in the right place (capacity + bottleneck vs structure), but this particular trick doesn’t actually get you what the paper is worried about.

Let me restate what you’re proposing in my words:
* You have a dataset with NNN images.
* You assign one unique pseudo-label per image (so NNN classes).
* You train an encoder gθ(x)∈Rdg_\theta(x)\in\mathbb{R}^dgθ(x)∈Rd with small latent dimension d≪Nd \ll Nd≪N.
* A classifier head on top of gθ(x)g_\theta(x)gθ(x) predicts the pseudo-label.
* Hope: because the representation is low-dimensional but must separate many “classes”, it will be forced to learn something like the underlying compositional structure (slots).

This feels like an “information bottleneck self-supervised instance discrimination” idea.

===== The paper’s impossibility result is not about number of classes or supervision; it’s about the space of encoders that fit the ID data and how unconstrained they are OOD. =====

Roughly, they show (Theorem 3.2) that when dx≫dzd_x \gg d_zdx≫dz, for almost any Jacobian/Hessian you want for an encoder ggg at a point on the manifold, there exists some generator f∈Fintf \in F_{\text{int}}f∈Fint such that ggg is a valid inverse of fff on the data manifold – but its behavior off the manifold (i.e., in unseen combinations) can be arbitrarily bad. The structure that characterizes the “good” inverses GintG_{\text{int}}Gint only shows up restricted to the data manifold geometry, which you don’t know OOD. So you can’t encode the right inductive bias with architecture/regularization alone. 2512.08854v1

Your pseudo-label idea still only constrains ggg on the training points (ID images), not on the whole manifold or its OOD regions. So there are many encoders that:
* give perfect pseudo-label accuracy on all training images,
* but implement completely different—and arbitrarily bad—extensions to any new combination (OOD).

Nothing in the pseudo-label objective forces you towards the particular inverse g∈Gintg \in G_{\text{int}}g∈Gint that corresponds to the “true slots”.

===== Intuitively it’s tempting to think: “but d≪Nd \ll Nd≪N, so the representation can’t just memorize!” =====
Unfortunately, it absolutely can.

A deep net encoder with small output dimension can still:
* map each training image xix_ixi to some distinct vector zi∈Rdz_i \in \mathbb{R}^dzi∈Rd,
* and a classifier head can memorize a mapping {zi}→{1,…,N}\{z_i\} \to \{1,\dots,N\}{zi}→{1,…,N}.

Key points:
# Continuous codes carry lots of information. Even a 32-dim float vector can encode absurdly many distinct codes. We are not limited to “one dimension per class” – that’s only true if you force something like one-hot or linear separability constraints and low numeric precision.
# Memorization in low-dim is easy for finite data. For a finite set of NNN points, you can always construct an injective map xi↦zi∈Rdx_i \mapsto z_i \in \mathbb{R}^dxi↦zi∈Rd for very small ddd (even d=2d=2d=2 for reasonable NNN). Then the classifier head (linear or shallow MLP) just learns decision regions around those codes. Geometrically: you only need each training point to sit in a little “cell” of representation space that maps to its label. No requirement those cells are arranged in a way that reflects compositional structure.
# No pressure to share factors. The bottleneck doesn’t force the encoder to separate “animal slot” vs “background slot”; it just forces some kind of compression. The easiest compression for SGD might be something like “two big clusters corresponding to (camel,desert) images vs (penguin,snow) images”, not “4 independent factors”.

So the encoder can happily learn a highly entangled representation that just supports “memorize N instance labels” with no notion of compositionality.

===== Take the simple case: two binary factors =====
* animal: {camel, penguin}
* background: {desert, snow}

There are 4 possible combos, but ID only contains:
* (camel, desert) images
* (penguin, snow) images

Your scheme:
* give each training image its own label,
* train encoder g(x)∈Rdg(x)\in\mathbb{R}^dg(x)∈Rd, ddd small.

What can ggg do?
* It can map all (camel, desert) images into some complicated 1D curve in latent space, all (penguin, snow) images into a different curve.
* The classifier head just cuts that space to separate each individual sample.

When you show an OOD (penguin, desert) image:
* There is zero incentive from training that its representation should be near “penguin slot + desert slot”.
* It will land wherever the network’s entangled features send it—likely near one of the existing ID clusters—so you misinterpret it.

This is exactly the phenomenon they show empirically: even with proper supervised labels for animal/background, encoders trained non-generatively get near-perfect ID accuracy but often fail badly OOD unless you add huge pretraining or extra supervision. 2512.08854v1

Your pseudo-labels are actually weaker supervision than that (they don’t tell the model which factors are shared across images).

===== What you’re proposing is very close to: =====
* Instance discrimination / contrastive learning (each image as its own “class”),
* with a small embedding dimension.

People already do something like this (SimCLR, MoCo, etc.), and:
* It does give good general representations in practice if you use tons of data and strong augmentations.
* But it still falls under their “non-generative” umbrella: you’re just learning an encoder whose behavior OOD is determined by the training distribution and optimization, not by inverting a structured generator.

And it doesn’t touch the paper’s main theoretical obstruction: you still have a huge space of encoders consistent with ID data, and nothing that forces you into the specific inverse class GintG_{\text{int}}Gint.

===== To turn your idea into something that could in principle guarantee compositional generalization, you’d need more than: =====
* small latent dim, and
* classification/pseudo-label objectives.

You’d need constraints that approximate the derivative structure they identify for good inverses (block-diagonal-ish behavior on the tangent space of the data manifold). But:
* those constraints depend on the geometry of the manifold XXX, especially in OOD regions, which you don’t observe;
* that’s exactly why they argue encoder-side inductive biases are fundamentally weaker than decoder-side ones.

So, even if you add pseudo-labels + bottleneck, you’re still in the “encoder can fit ID, OOD is unconstrained” regime.

TL;DR on your proposal
* It’s a reasonable heuristic (and quite close to things tried in SSL already).
* But it does not escape the paper’s core critique of non-generative methods: - You’re still only constraining the encoder on XIDX_{\text{ID}}XID. - The bottleneck isn’t structured enough to enforce the slotwise/polynomial structure required for compositional generalization. - There remain many encoders that fit ID pseudo-labels perfectly and behave arbitrarily OOD.

If you want, we can sketch a concrete toy construction where two different low-dim encoders get 100% pseudo-label accuracy on the same dataset but give totally different decompositions—and opposite predictions—on OOD combos. That’s a nice way to see the ambiguity very explicitly.