Add BD3LM architecture adapter#1479
Conversation
jlarson4
left a comment
There was a problem hiding this comment.
Great work so far @puranikyashaswin! Just a couple small comments and one code hygiene point below.
-
hook_attn_outis a dead hook –DelegatedAttentionBlockBridgepops the four input aliases (hook_q_input/hook_k_input/hook_v_input/hook_attn_in), but leaveshook_attn_outtoattn.hook_out. In this situation,attnis aSymbolicBridgewhoseforwardraises, so it never runs under delegation. Due to thisblocks.{i}.hook_attn_outappears in the registry but silently never fires (no activation, no error). Please either redirect it toattn.o.hook_out(the realattn_outprojection's output, which does fire) or pop it alongside the input aliases. -
No committed end-to-end correctness gate – Nothing in CI exercises the real
DDiTBlock.forward, the only test buildsnn.Linear/nn.LayerNormmocks and never runs a forward pass, there's no integration test, andverify_modelscan't reach BD3LM (kuleshov-group/isn't in_BRIDGE_REMOTE_CODE_PREFIXES). So neither the dead hook above nor any numerical drift is currently detectable, and the block-by-block parity in the description isn't reproducible.test_nemotron_h_adapter.pyalready establishes the pattern to copy: an opt-in, env-var-skippable real-HF test assertingmax_diff == 0.0, plus a hook-firing check on the real model. Addingkuleshov-group/to the remote-code allowlist would also restore the standard verification path.
| tl_config.n_layers = source_config.n_layer | ||
| elif hasattr(source_config, "num_hidden_layers"): | ||
| tl_config.n_layers = source_config.num_hidden_layers | ||
| elif hasattr(source_config, "n_blocks"): |
There was a problem hiding this comment.
Move the n_blocks elif to the end of the n_layers chain in map_default_transformer_lens_config. It currently sits before num_transformer_layers/num_layers. Since this new case only serves one architecture, it should have lowest precedence.
…n test and verify_models support
…om/puranikyashaswin/TransformerLens into feature/bd3lm-architecture-adapter # Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.
|
Thanks for the detailed review @jlarson4 fixed both:
Also fixed the One CI note: |
|
@puranikyashaswin Thank you for the updates! Great work! The demo notebooks can false fail in CI due to API limits, I am rerunning it now. Assuming it passes, I will merge |
|
Thank you @jlarson4. I appreciate the review and for rerunning the notebook checks. Glad the fixes addressed the concerns. Let me know if there's anything else you'd like me to adjust before merge. |
Description
Adds a TransformerBridge Architecture Adapter for BD3LM (Kuleshov Group's Block Diffusion Language Model, ICLR 2025), enabling
TransformerBridge.boot_transformers("kuleshov-group/bd3lm-owt-block_size4")with full hook support.Fixes #1473
BD3LM is a discrete diffusion LM with a single
block_sizeknob that interpolates between autoregressive and full diffusion behavior. It differs structurally from standard causal LMs: adaLN conditioning on the diffusion timestep, a custom Rotary embedding, joint QKV projection, and non-causal block-diffusion attention masking.Adapter design:
DelegatedAttentionBlockBridgeto delegateDDiTBlock.forward()wholesale to the original HF module. adaLN modulation varies per-timestep, so it can't be folded into weights the way LayerNorm folding works for standard transformers wrapping rather than reimplementing avoids getting this subtly wrong.TestRegistrySyncedWithFactorypasses.sources/transformers.pygainshidden_dim→d_modelandn_blocks→n_layersfallback aliasing, since BD3LM's HF config uses non-standard attribute names._HF_PASSTHROUGH_ATTRS(in bothsources/transformers.pyandsources/_bridge_builder.py) gainsmodel_length,block_size,cond_dim,adaln,cross_attn. Without this,model_lengthsilently falls back to the wrong default, producing an incorrectly-shaped attention mask and small nonzero logit divergence caught and root-caused during development.Verification:
sample_mode=False(default forward path, seq_len=2048, real block-diffusion mask) andsample_mode=True(generation-time path) exact match once the passthrough-attrs fix above was in place.run_with_cacheconfirmed to populate real per-hook activations (28 hooks per block: norms, QKV, adaLN modulation output, MLP) in both modes, not just pass-through logits.model_bridgeunit suite (2209 tests) passes with no regressions from the shared_HF_PASSTHROUGH_ATTRSchange.Open item feedback welcome:
verify_modelscan't currently run BD3LM it assumesAutoModelForCausalLMand doesn't havetrust_remote_codeallowlisted for this model prefix. Fixing that touches shared infra (verify_models.py) rather than just this adapter, so it's scoped out of this PR; parity/cache correctness was verified directly instead (see above). Happy to take this on separately or fold it in here if preferred.supports_generation = Falsesince BD3LM uses its own diffusion sampling loop, not HF'sgenerate()Phase 4 doesn't apply, but Phases 1–3 do and were manually verified as above.Test coverage note: this PR includes unit tests (
tests/unit/model_bridge/supported_architectures/test_bd3lm_adapter.py, 17 tests) but not yet a committed integration test attests/integration/model_bridge/test_bd3lm_adapter.py. Parity was verified via ad-hoc scripts during development rather than a committed test. Happy to add one before merge if preferred.Type of change
Checklist: