perf(groupby): unify scatter kernel over numpy and dask via apply_ufunc#802
perf(groupby): unify scatter kernel over numpy and dask via apply_ufunc#802FBumann wants to merge 6 commits into
Conversation
The fast path of LinearExpression.groupby(...).sum() used ds.unstack(group_dim, fill_value=...) followed by a stack, which materializes 2-3 intermediate copies of the padded result (n_groups x max_group_size x nterm) and goes through pandas MultiIndex machinery sized by the number of elements. Instead, factorize the groups and scatter coeffs/vars directly into the preallocated padded result arrays; constants are group-summed with np.add.at. Peak memory drops to input + result (the minimum for the padded layout) and the grouping itself gets considerably faster. The result is unchanged: same dims, coords, term ordering and padding. The unstack-based implementation is kept as _sum_by_unstack and still used for chunked (dask-backed) data, which cannot be scattered into numpy arrays. NaN group labels now raise an informative ValueError instead of failing inside unstack. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a test for grouping over an empty group dimension, which the scatter fast path handles cleanly but the unstack fallback cannot. Trim comments that duplicated the helper docstrings.
Relax the groupby-sum scatter gate to a pure numpy/dask check: auxiliary coordinates on the grouped dimension no longer force the slow unstack path. Summing over groups collapses that dimension, so both kernels drop every coordinate tied to it — the scatter result is identical, just cheaper. The unstack kernel now serves only chunked (dask) data, and a debug log records when that fallback is taken. Inline the now-trivial predicate into the dispatch and consolidate the kernel tests into a TestGroupbySumScatterKernel class: a one-line case table over a shared fixture, with added coverage for combined structures, auxiliary coords, and a MultiIndex grouped dimension. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Merging this PR will improve performance by ×2.1
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ⚡ | Memory | test_to_lp[nodal_balance-severity=100] |
17.9 MB | 6 MB | ×3 |
| ⚡ | Memory | test_to_lp[nodal_balance-severity=50] |
9.2 MB | 3.1 MB | ×3 |
| ⚡ | Memory | test_to_lp[nodal_balance-severity=0] |
385.3 KB | 138.4 KB | ×2.8 |
| ⚡ | Memory | test_build[nodal_balance-severity=100] |
32 MB | 12.8 MB | ×2.5 |
| ⚡ | Memory | test_build[nodal_balance-severity=50] |
16.8 MB | 7 MB | ×2.4 |
| ⚡ | Memory | test_to_solver[highs-nodal_balance-severity=100] |
24.9 MB | 13.3 MB | +87.47% |
| ⚡ | Memory | test_to_solver[gurobi-nodal_balance-severity=100] |
25.1 MB | 13.5 MB | +86.1% |
| ⚡ | Memory | test_to_solver[highs-nodal_balance-severity=50] |
12.9 MB | 7.1 MB | +81.68% |
| ⚡ | Memory | test_to_solver[gurobi-nodal_balance-severity=50] |
13.1 MB | 7.3 MB | +79.32% |
| ⚡ | Memory | test_build[nodal_balance-severity=0] |
1.4 MB | 1.2 MB | +19.61% |
Tip
Curious why this is faster? Comment @codspeedbot explain why this is faster on this PR, or directly use the CodSpeed MCP with your agent.
Comparing fluxopt:perf/groupby-sum-apply-ufunc (ac8ec47) with master (fe798b1)
Footnotes
-
138 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩
06ab9af to
2abf407
Compare
Replace the previous numpy-scatter / dask-unstack split with a single kernel (`_grouped_sum`) wrapped in `xarray.apply_ufunc`. It scatters terms into the padded result arrays for numpy-backed data and runs the same scatter lazily on chunked (dask) data via `dask="parallelized"`, after gathering the grouped and term dimensions (the scatter's core dims) into single chunks. This removes the last `pd.MultiIndex`/`unstack` usage in groupby-sum, drops the numpy-vs-dask branch in `sum()`, and keeps peak memory at input + result on both backends. Multi-key / DataFrame grouping and its `MultiIndex` result are unaffected — that logic sits above the kernel. Tests verify the kernel from first principles (each group's terms and constant must match its members) across every case shape on both numpy and dask, plus explicit anchors pinning the exact padded layout — member order, fill position, term interleaving and the factor axis — for the linear, multidim and quadratic cases. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2abf407 to
ac8ec47
Compare
|
I ran the benchmark locally with pytets-benchmem to also check timing. Small improvement there too! Note The following content was generated by AI. Local verification of the performance claim on the
Both peak memory (~2.4–2.5× lower) and build time (~5–13% faster) improve, and the gain grows with group skew — the pathological case the scatter targets. Consistent with the CodSpeed report (×2–3 memory on build/to_lp/to_solver). MethodEach version's A |
I use apply_ufunc to make this dask capable. As we dont have the reference unstack implementation anymore, i introduced quite a heavy testing part (fast though), as I find the apply_ufunc version harder to understand personally. Happy to strip it down.
Note
The following content was generated by AI.
Stacked on #793. Until #793 merges, this PR's diff includes its commit too — review only the top commit
perf(groupby): unify scatter kernel ....What this does
#793 split groupby-sum into a numpy kernel and a dask
unstackfallback. Thiscollapses them into a single kernel (
_grouped_sum) wrapped inxarray.apply_ufunc:dask="parallelized",after gathering the grouped dimension into a single chunk (which unstacking
required too).
This removes the last
pd.MultiIndex/unstackusage in groupby-sum, drops thenumpy-vs-dask branch in
sum(), and keeps peak memory at input + result on bothbackends. Multi-key / DataFrame grouping and its
MultiIndexresult areunaffected — that logic sits above the kernel (existing tests cover it).
Tests
The kernel is verified from first principles — for every group and every
slice over the non-grouped dims, the result's live terms must equal the multiset
of its members' terms and the constant their NaN-skipping sum — across every
case shape on both numpy and dask backends. Three explicit anchors pin the
exact padded layout (member order, fill position,
(nterm, max_size)interleaving, and the
_factoraxis) for the linear, multidim and quadraticcases.
Benchmark (300k elem × 8 dim × 1000 groups, numpy)
_sum_by_scatter(#793)_sum_by_unstack(#793 dask path)The unified kernel matches the scatter kernel's memory and time; the old dask
path cost 2.2× peak.
Notes
linopy/expressions.pyand the groupby kernel tests; fulltest_linear_expression+test_quadratic_expressionpass (366), broadersuite green.
group_diminto one chunk is unavoidable for a scatter (a group'smembers can sit in any chunk) and is exactly what the old unstack path forced.