Skip to content

Add BD3LM architecture adapter#1479

Open
puranikyashaswin wants to merge 4 commits into
TransformerLensOrg:devfrom
puranikyashaswin:feature/bd3lm-architecture-adapter
Open

Add BD3LM architecture adapter#1479
puranikyashaswin wants to merge 4 commits into
TransformerLensOrg:devfrom
puranikyashaswin:feature/bd3lm-architecture-adapter

Conversation

@puranikyashaswin

Copy link
Copy Markdown
Contributor

Description

Adds a TransformerBridge Architecture Adapter for BD3LM (Kuleshov Group's Block Diffusion Language Model, ICLR 2025), enabling TransformerBridge.boot_transformers("kuleshov-group/bd3lm-owt-block_size4") with full hook support.

Fixes #1473

BD3LM is a discrete diffusion LM with a single block_size knob that interpolates between autoregressive and full diffusion behavior. It differs structurally from standard causal LMs: adaLN conditioning on the diffusion timestep, a custom Rotary embedding, joint QKV projection, and non-causal block-diffusion attention masking.

Adapter design:

  • Uses DelegatedAttentionBlockBridge to delegate DDiTBlock.forward() wholesale to the original HF module. adaLN modulation varies per-timestep, so it can't be folded into weights the way LayerNorm folding works for standard transformers wrapping rather than reimplementing avoids getting this subtly wrong.
  • Registered at all four required sites (adapter package, factory, model registry, report generation); TestRegistrySyncedWithFactory passes.
  • sources/transformers.py gains hidden_dimd_model and n_blocksn_layers fallback aliasing, since BD3LM's HF config uses non-standard attribute names.
  • _HF_PASSTHROUGH_ATTRS (in both sources/transformers.py and sources/_bridge_builder.py) gains model_length, block_size, cond_dim, adaln, cross_attn. Without this, model_length silently falls back to the wrong default, producing an incorrectly-shaped attention mask and small nonzero logit divergence caught and root-caused during development.

Verification:

  • Logit parity vs. the raw HF model confirmed block-by-block (all 12 blocks + embeddings + final logits) in both sample_mode=False (default forward path, seq_len=2048, real block-diffusion mask) and sample_mode=True (generation-time path) exact match once the passthrough-attrs fix above was in place.
  • run_with_cache confirmed to populate real per-hook activations (28 hooks per block: norms, QKV, adaLN modulation output, MLP) in both modes, not just pass-through logits.
  • Full existing model_bridge unit suite (2209 tests) passes with no regressions from the shared _HF_PASSTHROUGH_ATTRS change.

Open item feedback welcome: verify_models can't currently run BD3LM it assumes AutoModelForCausalLM and doesn't have trust_remote_code allowlisted for this model prefix. Fixing that touches shared infra (verify_models.py) rather than just this adapter, so it's scoped out of this PR; parity/cache correctness was verified directly instead (see above). Happy to take this on separately or fold it in here if preferred.

supports_generation = False since BD3LM uses its own diffusion sampling loop, not HF's generate() Phase 4 doesn't apply, but Phases 1–3 do and were manually verified as above.

Test coverage note: this PR includes unit tests (tests/unit/model_bridge/supported_architectures/test_bd3lm_adapter.py, 17 tests) but not yet a committed integration test at tests/integration/model_bridge/test_bd3lm_adapter.py. Parity was verified via ad-hoc scripts during development rather than a committed test. Happy to add one before merge if preferred.

Type of change

  • New feature (non-breaking change which adds functionality)

Checklist:

  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have not rewritten tests relating to key interfaces which would affect backward compatibility

@jlarson4 jlarson4 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work so far @puranikyashaswin! Just a couple small comments and one code hygiene point below.

  1. hook_attn_out is a dead hook – DelegatedAttentionBlockBridge pops the four input aliases (hook_q_input/hook_k_input/hook_v_input/hook_attn_in), but leaves hook_attn_out to attn.hook_out. In this situation, attn is a SymbolicBridge whose forward raises, so it never runs under delegation. Due to this blocks.{i}.hook_attn_out appears in the registry but silently never fires (no activation, no error). Please either redirect it to attn.o.hook_out (the real attn_out projection's output, which does fire) or pop it alongside the input aliases.

  2. No committed end-to-end correctness gate – Nothing in CI exercises the real DDiTBlock.forward, the only test builds nn.Linear/nn.LayerNorm mocks and never runs a forward pass, there's no integration test, and verify_models can't reach BD3LM (kuleshov-group/ isn't in _BRIDGE_REMOTE_CODE_PREFIXES). So neither the dead hook above nor any numerical drift is currently detectable, and the block-by-block parity in the description isn't reproducible. test_nemotron_h_adapter.py already establishes the pattern to copy: an opt-in, env-var-skippable real-HF test asserting max_diff == 0.0, plus a hook-firing check on the real model. Adding kuleshov-group/ to the remote-code allowlist would also restore the standard verification path.

tl_config.n_layers = source_config.n_layer
elif hasattr(source_config, "num_hidden_layers"):
tl_config.n_layers = source_config.num_hidden_layers
elif hasattr(source_config, "n_blocks"):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move the n_blocks elif to the end of the n_layers chain in map_default_transformer_lens_config. It currently sits before num_transformer_layers/num_layers. Since this new case only serves one architecture, it should have lowest precedence.

@jlarson4 jlarson4 linked an issue Jul 2, 2026 that may be closed by this pull request
1 task
…om/puranikyashaswin/TransformerLens into feature/bd3lm-architecture-adapter

# Please enter a commit message to explain why this merge is necessary,
# especially if it merges an updated upstream into a topic branch.
#
# Lines starting with '#' will be ignored, and an empty message aborts
# the commit.
@puranikyashaswin

Copy link
Copy Markdown
Contributor Author

Thanks for the detailed review @jlarson4 fixed both:

  1. hook_attn_out: redirected to attn.o.hook_out via hook_alias_overrides. Worth noting explicitly: this captures the raw attn_out projection output, not the gate_msa-scaled value actually added to the residual stream the gating happens inside a torch.jit.script-fused function with no hookable module boundary, so this is the closest available hook point, not a fully equivalent one. Documented this in a comment at the override site so it's clear to anyone using the hook for activation patching.

  2. Correctness gate: added tests/integration/model_bridge/test_bd3lm_adapter.py following test_nemotron_h_adapter.py's pattern loads the real model, asserts max_diff == 0.0 against HF, and verifies hook firing on real activations (confirms the fix to Add patching and ablations features #1 actually works, not just structurally). Also added kuleshov-group/ to _BRIDGE_REMOTE_CODE_PREFIXES so verify_models can reach it now.

Also fixed the n_blocks elif ordering per your inline comment. Full unit suite (3043 tests) passes clean with no regressions.

One CI note: Notebook Checks (Activation_Patching_in_TL_Demo) is failing, but I verified it passes cleanly both on this branch and on a fresh checkout of dev locally (same pre-existing jupyter_client deprecation warnings either way) looks like CI-runner flakiness unrelated to this PR. Let me know if a re-run would help, since I don't have permission to trigger one from a fork.

@jlarson4

jlarson4 commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

@puranikyashaswin Thank you for the updates! Great work! The demo notebooks can false fail in CI due to API limits, I am rerunning it now. Assuming it passes, I will merge

@puranikyashaswin

Copy link
Copy Markdown
Contributor Author

Thank you @jlarson4. I appreciate the review and for rerunning the notebook checks. Glad the fixes addressed the concerns. Let me know if there's anything else you'd like me to adjust before merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Proposal] Add BD3LM block-diffusion adapter (BD3LM)

2 participants