Add AutoEP + AutoTP parallel folding#8064
Conversation
Allow tensor parallelism (AutoTP) for the dense/attention path to coexist with expert parallelism (AutoEP) for routed experts on the same rank set, without requiring EP to be a subset of DP. - Treat dense and MoE as independent partitionings: dense view tp*dp, expert view ep*etp*edp, with dp/edp derived so tp*dp == ep*etp*edp == stage_size. expert_tensor_parallel_size is reserved (must currently be 1). - Express folding via the existing tensor_parallel/expert_parallel config sections, with divisibility, TP/sequence-parallel exclusivity, and preset_model consistency validation. - Add the route-full / partition-dispatch MoE path and AutoTP skipping of AutoEP subtrees; derive folded process groups via the generalized expert/data-parallel group creation. - Reduce TP-replicated router/gate gradients mode-aware (sum when tokens are partitioned, average when replicated); record per-parameter-family ZeRO checkpoint metadata and handle folded ZeRO-1/2 optimizer state. - Add folding unit tests (config, groups, dispatch, runtime, gradient parity, checkpoint), including multi-rank GPU-gated cases. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 278c919489
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| chunks = torch.split(grad_output, ctx.counts, dim=0) | ||
| grad_padded = grad_output.new_zeros((ctx.max_rows, *grad_output.shape[1:])) | ||
| if local_count: | ||
| grad_padded[:local_count].copy_(chunks[ctx.group_rank]) | ||
| return grad_padded[:local_count].contiguous(), None, None, None |
There was a problem hiding this comment.
Sum gathered-row gradients across TP lanes
When folded MoE output is consumed differently on each TP lane (for example by a row-parallel/lm-head layer that slices the hidden dimension), every gathered row participates in the loss on every lane. This backward path only returns chunks[ctx.group_rank] from the local rank's grad_output, so contributions from peer lanes to this rank's local expert outputs and routing weights are dropped; the padded local gradient needs to be accumulated across ctx.group before returning.
Useful? React with 👍 / 👎.
| grad_reduc = self.get_gradient_for_reduction(param) | ||
| self._maybe_reduce_autoep_folding_tp_gradient(param, grad_reduc) |
There was a problem hiding this comment.
Honor ds_grad_is_ready before TP reduction
In ZeRO-2 folded runs, parameters with ds_grad_is_ready=False are intentionally skipped until their transient/tiled gradient is complete, as the guard immediately below documents. Calling the new TP reduction before that guard mutates and all-reduces incomplete gradients for those parameters, which can corrupt the final accumulated gradient once the ready shard is eventually reduced.
Useful? React with 👍 / 👎.
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
This PR adds parallel folding for AutoEP: tensor parallelism (AutoTP) for the dense/attention path can now coexist with expert parallelism (AutoEP) for the routed-expert path on the same set of ranks, without forcing EP to be a subset of DP.
(This PR should be adjusted for ZeRO3 support after #8060 is merged)
Design
Attention/dense and MoE are treated as two independent partitionings of the same rank set, parameterized per parameter family:
stage_size = tp * dpstage_size = ep * etp * edpdpandedpare always derived, never user-configured, so the invarianttp * dp == ep * etp * edp == stage_sizecannot be broken from config.Configuration
No new config section. Folding is expressed by the coexistence of the existing
tensor_parallelandexpert_parallelsections:{ "tensor_parallel": { "autotp_size": 2 }, "expert_parallel": { "enabled": true, "autoep_size": 4, "expert_tensor_parallel_size": 1 } }expert_tensor_parallel_sizeis carried as a config field but currently must be1(expert-internal TP is reserved as follow-up and rejected fail-fast). Validation enforces divisibility, TP/sequence-parallel exclusivity, andpreset_modelconsistency between the two sections.What's included
mp_modeTP-strided vs SP-consecutive ordering).deepspeed/moe/ep_tp_dispatch.py), with AutoTP skipping AutoEP subtrees.Correctness & validation
aws-torch-latest-full) on H100 GPUs.Scope / follow-ups
expert_tensor_parallel_size > 1) is reserved for a follow-up.