Editing Openai/693c0f4f-255c-8008-92e9-0cd44c6d6226 (section)

===== A more behaviorally principled variant: =====
# Prompts PtP_tPt → embedding ete_tet
# Generator G(et)=ΔW^tG(e_t) = \hat{\Delta W}_tG(et)=ΔW^t
# Use two behavioral losses instead of relying heavily on raw weight MSE: - Delta loss: ∥Delta(ΔW^t)−δt∥2\|\text{Delta}(\hat{\Delta W}_t) - \delta_t\|^2∥Delta(ΔW^t)−δt∥2 (behavior on a fixed probe set) - Output distillation loss: On a small set of task-specific prompts QtQ_tQt, minimize, e.g., KL(pbase+ΔW^t(⋅∣q) ∥ pbase+ΔWt(⋅∣q))\text{KL}\big(p_{\text{base}+\hat{\Delta W}_t}(\cdot|q) \,\|\, p_{\text{base}+\Delta W_t}(\cdot|q)\big)KL(pbase+ΔW^t(⋅∣q)∥pbase+ΔWt(⋅∣q)) so that generated LoRA matches the real LoRA’s outputs.

Now weight-level supervision becomes optional or just a small regularizer. Most of the signal is:
* “your LoRA should behave like this known one on probes and on some task prompts”.

This is much closer to what we actually want (function match), and δ is a cheap, global summarization of that behavior.

Weaknesses:
* Needs two model evaluations (base+true-LoRA and base+gen-LoRA) during training. Expensive but maybe ok if you restrict to small probe sets.
* Still doesn’t solve the identifiability problem perfectly: lots of LoRAs can match the same δ and distillation on small Q_t but differ elsewhere. But that might be fine in practice.