perf(groupby): unify scatter kernel over numpy and dask via apply_ufunc by FBumann · Pull Request #802 · PyPSA/linopy

FBumann · 2026-06-30T20:50:51Z

I use apply_ufunc to make this dask capable. As we dont have the reference unstack implementation anymore, i introduced quite a heavy testing part (fast though), as I find the apply_ufunc version harder to understand personally. Happy to strip it down.

Note

The following content was generated by AI.

Stacked on #793. Until #793 merges, this PR's diff includes its commit too — review only the top commit perf(groupby): unify scatter kernel ....

What this does

#793 split groupby-sum into a numpy kernel and a dask unstack fallback. This
collapses them into a single kernel (_grouped_sum) wrapped in
xarray.apply_ufunc:

numpy-backed data scatters terms into the padded result arrays as before;
chunked (dask) data runs the same scatter lazily via dask="parallelized",
after gathering the grouped dimension into a single chunk (which unstacking
required too).

This removes the last pd.MultiIndex/unstack usage in groupby-sum, drops the
numpy-vs-dask branch in sum(), and keeps peak memory at input + result on both
backends. Multi-key / DataFrame grouping and its MultiIndex result are
unaffected — that logic sits above the kernel (existing tests cover it).

Tests

The kernel is verified from first principles — for every group and every
slice over the non-grouped dims, the result's live terms must equal the multiset
of its members' terms and the constant their NaN-skipping sum — across every
case shape on both numpy and dask backends. Three explicit anchors pin the
exact padded layout (member order, fill position, (nterm, max_size)
interleaving, and the _factor axis) for the linear, multidim and quadratic
cases.

Benchmark (300k elem × 8 dim × 1000 groups, numpy)

kernel	mean time	peak memory
`_sum_by_scatter` (#793)	46.5 ms	114.6 MiB
`_sum_by_unstack` (#793 dask path)	49.9 ms	254.8 MiB
apply_ufunc (this PR)	48.3 ms	114.6 MiB

The unified kernel matches the scatter kernel's memory and time; the old dask
path cost 2.2× peak.

Notes

Touches only linopy/expressions.py and the groupby kernel tests; full
test_linear_expression + test_quadratic_expression pass (366), broader
suite green.
Gathering group_dim into one chunk is unavoidable for a scatter (a group's
members can sit in any chunk) and is exactly what the old unstack path forced.

The fast path of LinearExpression.groupby(...).sum() used ds.unstack(group_dim, fill_value=...) followed by a stack, which materializes 2-3 intermediate copies of the padded result (n_groups x max_group_size x nterm) and goes through pandas MultiIndex machinery sized by the number of elements. Instead, factorize the groups and scatter coeffs/vars directly into the preallocated padded result arrays; constants are group-summed with np.add.at. Peak memory drops to input + result (the minimum for the padded layout) and the grouping itself gets considerably faster. The result is unchanged: same dims, coords, term ordering and padding. The unstack-based implementation is kept as _sum_by_unstack and still used for chunked (dask-backed) data, which cannot be scattered into numpy arrays. NaN group labels now raise an informative ValueError instead of failing inside unstack. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add a test for grouping over an empty group dimension, which the scatter fast path handles cleanly but the unstack fallback cannot. Trim comments that duplicated the helper docstrings.

Relax the groupby-sum scatter gate to a pure numpy/dask check: auxiliary coordinates on the grouped dimension no longer force the slow unstack path. Summing over groups collapses that dimension, so both kernels drop every coordinate tied to it — the scatter result is identical, just cheaper. The unstack kernel now serves only chunked (dask) data, and a debug log records when that fallback is taken. Inline the now-trivial predicate into the dispatch and consolidate the kernel tests into a TestGroupbySumScatterKernel class: a one-line case table over a shared fixture, with added coverage for combined structures, auxiliary coords, and a MultiIndex grouped dimension. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

codspeed-hq · 2026-06-30T20:54:09Z

Merging this PR will improve performance by ×2.1

⚠️

Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

⚡ 10 improved benchmarks
✅ 128 untouched benchmarks
⏩ 138 skipped benchmarks¹

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
⚡	Memory	`test_to_lp[nodal_balance-severity=100]`	17.9 MB	6 MB	×3
⚡	Memory	`test_to_lp[nodal_balance-severity=50]`	9.2 MB	3.1 MB	×3
⚡	Memory	`test_to_lp[nodal_balance-severity=0]`	385.3 KB	138.4 KB	×2.8
⚡	Memory	`test_build[nodal_balance-severity=100]`	32 MB	12.8 MB	×2.5
⚡	Memory	`test_build[nodal_balance-severity=50]`	16.8 MB	7 MB	×2.4
⚡	Memory	`test_to_solver[highs-nodal_balance-severity=100]`	24.9 MB	13.3 MB	+87.47%
⚡	Memory	`test_to_solver[gurobi-nodal_balance-severity=100]`	25.1 MB	13.5 MB	+86.1%
⚡	Memory	`test_to_solver[highs-nodal_balance-severity=50]`	12.9 MB	7.1 MB	+81.68%
⚡	Memory	`test_to_solver[gurobi-nodal_balance-severity=50]`	13.1 MB	7.3 MB	+79.32%
⚡	Memory	`test_build[nodal_balance-severity=0]`	1.4 MB	1.2 MB	+19.61%

Tip

Curious why this is faster? Comment @codspeedbot explain why this is faster on this PR, or directly use the CodSpeed MCP with your agent.

_{Comparing fluxopt:perf/groupby-sum-apply-ufunc (ac8ec47) with master (fe798b1)}

138 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

Replace the previous numpy-scatter / dask-unstack split with a single kernel (`_grouped_sum`) wrapped in `xarray.apply_ufunc`. It scatters terms into the padded result arrays for numpy-backed data and runs the same scatter lazily on chunked (dask) data via `dask="parallelized"`, after gathering the grouped and term dimensions (the scatter's core dims) into single chunks. This removes the last `pd.MultiIndex`/`unstack` usage in groupby-sum, drops the numpy-vs-dask branch in `sum()`, and keeps peak memory at input + result on both backends. Multi-key / DataFrame grouping and its `MultiIndex` result are unaffected — that logic sits above the kernel. Tests verify the kernel from first principles (each group's terms and constant must match its members) across every case shape on both numpy and dask, plus explicit anchors pinning the exact padded layout — member order, fill position, term interleaving and the factor axis — for the linear, multidim and quadratic cases. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

FBumann · 2026-07-01T07:09:28Z

I ran the benchmark locally with pytets-benchmem to also check timing. Small improvement there too!

Note

The following content was generated by AI.

Local verification of the performance claim on the nodal_balance reference model from benchmarks/ (the groupby(bus).sum() idiom), comparing master (fe798b1, pre-#793) against this branch. Measured with pytest-benchmark + pytest-benchmem (--benchmark-memory).

`test_build[nodal_balance]`	time: master → branch	peak mem: master → branch
severity=0	10.24 → 9.78 ms (−5%)	1.43 → 1.30 MiB
severity=50	11.16 → 10.14 ms (−9%)	16.60 → 6.92 MiB (×2.4)
severity=100	11.70 → 10.39 ms (−13%)	31.82 → 12.72 MiB (×2.5)

Both peak memory (~2.4–2.5× lower) and build time (~5–13% faster) improve, and the gain grows with group skew — the pathological case the scatter targets. Consistent with the CodSpeed report (×2–3 memory on build/to_lp/to_solver).

Method

Each version's linopy selected via PYTHONPATH (master checked out in a throwaway worktree; branch from this checkout), same benchmarks/ suite, then:

benchmem compare bench_master.json bench_branch.json --columns time,peak --stat mean

A benchmem sweep linopy <master-ref> <branch-ref> --suite benchmarks/ --memory reproduces this in isolated uv venvs.

FabianHofmann · 2026-07-01T07:09:31Z

apply_ufunc is of course master class :) this implementation is quite complex and I would run it on pypsa tests first. I would say this should have priority over #793 and should directly go into master. @coroa is it realistic for you to take a look?

FBumann and others added 5 commits June 29, 2026 09:58

Merge branch 'master' into perf/groupby-sum-scatter-upstream

34d1332

test: cover empty group dim in scatter groupby-sum

7f3edea

Add a test for grouping over an empty group dimension, which the scatter fast path handles cleanly but the unstack fallback cannot. Trim comments that duplicated the helper docstrings.

docs: add release note for scatter groupby-sum

7598180

FBumann force-pushed the perf/groupby-sum-apply-ufunc branch 3 times, most recently from 06ab9af to 2abf407 Compare July 1, 2026 06:16

FBumann force-pushed the perf/groupby-sum-apply-ufunc branch from 2abf407 to ac8ec47 Compare July 1, 2026 06:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(groupby): unify scatter kernel over numpy and dask via apply_ufunc#802

perf(groupby): unify scatter kernel over numpy and dask via apply_ufunc#802
FBumann wants to merge 6 commits into
PyPSA:masterfrom
fluxopt:perf/groupby-sum-apply-ufunc

FBumann commented Jun 30, 2026 •

edited

Loading

Uh oh!

codspeed-hq Bot commented Jun 30, 2026 •

edited

Loading

Uh oh!

FBumann commented Jul 1, 2026 •

edited

Loading

Uh oh!

FabianHofmann commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

FBumann commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this does

Tests

Notes

Uh oh!

codspeed-hq Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will improve performance by ×2.1

Performance Changes

Footnotes

Uh oh!

FBumann commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FabianHofmann commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

FBumann commented Jun 30, 2026 •

edited

Loading

codspeed-hq Bot commented Jun 30, 2026 •

edited

Loading

FBumann commented Jul 1, 2026 •

edited

Loading