Skip to content

sketch out improved performance by refactoring codec pipeline logic#3719

Closed
d-v-b wants to merge 38 commits into
zarr-developers:mainfrom
d-v-b:perf/smarter-codecs
Closed

sketch out improved performance by refactoring codec pipeline logic#3719
d-v-b wants to merge 38 commits into
zarr-developers:mainfrom
d-v-b:perf/smarter-codecs

Conversation

@d-v-b

@d-v-b d-v-b commented Feb 25, 2026

Copy link
Copy Markdown
Contributor

This builds on top of #3715 and achieves even more perf improvements by refactoring the basic logic of codec encoding / decoding. The design document behind these changes is here.

I do not think this is merge-worthy, as it's far too big. But i'm going to post the performance gains, and start figuring out how to break this into pieces.

A big feature this adds is the ability to write individual subchunks to uncompressed shards on storage backends that support range writes (local and memory).

Benchmark comparison: perf/smarter-codecs vs main

Test Name perf/smarter-codecs (ms) main (ms) Speedup
test_write_array[memory-chunks=100,shards=1M-None] 46.85 1018.60 21.74×
test_write_array[local-chunks=100,shards=1M-None] 68.45 1006.27 14.70×
test_sharded_morton_indexing[(32,32,32)] 22.45 247.49 11.02×
test_slice_indexing[None-(0,0,0)] 0.03 0.28 10.05×
test_sharded_morton_indexing_large[(33,33,33)] 254.85 2521.32 9.89×
test_slice_indexing[(50,50,50)-full_slice] 9.12 89.80 9.85×
test_sharded_morton_indexing_large[(32,32,32)] 226.80 2153.47 9.50×
test_sharded_morton_indexing_large[(30,30,30)] 181.86 1725.29 9.49×
test_slice_indexing[(50,50,50)-(0,0,0)] 0.06 0.60 9.45×
test_write_array[memory-chunks=100,shards=1M-gzip] 211.85 1978.29 9.34×
test_slice_indexing[None-(slice(None,10,None))*3] 0.03 0.28 9.28×
test_write_array[local-chunks=100,shards=1M-gzip] 217.00 1965.37 9.06×
test_slice_indexing[(50,50,50)-strided_4] 8.96 80.44 8.98×
test_slice_indexing[(50,50,50)-strided_4_offset] 4.83 43.21 8.95×
test_sharded_morton_indexing[(16,16,16)] 2.78 23.26 8.37×
test_slice_indexing[(50,50,50)-(slice(None,10,None))*3] 0.07 0.60 8.36×
test_slice_indexing[None-full_slice] 11.07 85.12 7.69×
test_read_array[memory-chunks=100,shards=1M-gzip] 181.44 1372.13 7.56×
test_read_array[memory-chunks=100,shards=1M-None] 80.73 609.53 7.55×
test_read_array[memory-chunks=1K,no_shards-None] 7.21 53.52 7.43×
test_read_array[local-chunks=100,shards=1M-None] 83.79 612.18 7.31×
test_slice_indexing[None-strided_4] 12.19 84.98 6.97×
test_read_array[memory-chunks=1K,no_shards-gzip] 16.79 115.69 6.89×
test_write_array[memory-chunks=1K,no_shards-gzip] 32.54 219.93 6.76×
test_read_array[local-chunks=100,shards=1M-gzip] 190.20 1277.12 6.71×
test_read_array[local-chunks=1K,no_shards-None] 24.23 142.33 5.87×
test_slice_indexing[None-mixed_slice] 0.12 0.71 5.84×
test_read_array[local-chunks=1K,no_shards-gzip] 37.83 216.45 5.72×
test_slice_indexing[None-strided_4_offset] 5.77 32.55 5.64×
test_write_array[memory-chunks=1K,no_shards-None] 19.16 106.12 5.54×
test_slice_indexing[(50,50,50)-mixed_slice] 0.23 1.21 5.29×
test_slice_indexing[(50,50,50)-strided_4-get_latency] 15.82 82.14 5.19×
test_write_array[memory-chunks=1K,shards=1K-gzip] 100.60 451.93 4.49×
test_read_array[local-chunks=1K,shards=1K-None] 66.94 298.83 4.46×
test_write_array[memory-chunks=1K,shards=1K-None] 71.06 315.12 4.43×
test_read_array[memory-chunks=1K,shards=1K-None] 43.90 192.21 4.38×
test_sharded_morton_single_chunk[(32,32,32)] 0.18 0.74 4.12×
test_read_array[memory-chunks=1K,shards=1K-gzip] 68.96 283.15 4.11×
test_read_array[local-chunks=1K,shards=1K-gzip] 90.06 368.79 4.09×
test_sharded_morton_single_chunk[(33,33,33)] 0.19 0.76 4.03×
test_slice_indexing[(50,50,50)-full_slice-get_latency] 20.53 79.62 3.88×
test_sharded_morton_single_chunk[(30,30,30)] 0.20 0.72 3.69×
test_slice_indexing[(50,50,50)-strided_4_offset-get_latency] 17.73 50.27 2.84×
test_slice_indexing[None-strided_4_offset-get_latency] 19.29 48.98 2.54×
test_slice_indexing[None-strided_4-get_latency] 43.55 102.36 2.35×
test_slice_indexing[None-full_slice-get_latency] 46.77 101.82 2.18×
test_write_array[local-chunks=1K,shards=1K-gzip] 367.61 733.76 2.00×
test_slice_indexing[(50,50,50)-(0,0,0)-get_latency] 0.49 0.87 1.79×
test_slice_indexing[(50,50,50)-(slice(None,10,None))*3-get_latency] 0.50 0.88 1.77×
test_slice_indexing[None-mixed_slice-get_latency] 0.58 1.00 1.74×
test_slice_indexing[(50,50,50)-mixed_slice-get_latency] 1.12 1.92 1.72×
test_write_array[local-chunks=1K,shards=1K-None] 390.76 619.05 1.58×
test_write_array[local-chunks=1K,no_shards-None] 225.86 340.42 1.51×
test_write_array[local-chunks=1K,no_shards-gzip] 284.20 400.62 1.41×
test_slice_indexing[None-(slice(None,10,None))*3-get_latency] 0.31 0.43 1.39×
test_slice_indexing[None-(0,0,0)-get_latency] 0.33 0.43 1.28×
test_morton_order_iter[(30,30,30)] 104.17 123.84 1.19×
test_morton_order_iter[(16,16,16)] 2.56 3.01 1.18×
test_morton_order_iter[(10,10,10)] 7.44 8.61 1.16×
test_morton_order_iter[(8,8,8)] 0.31 0.36 1.15×
test_morton_order_iter[(33,33,33)] 644.86 740.97 1.15×
test_morton_order_iter[(20,20,20)] 78.57 88.98 1.13×
test_sharded_morton_write_single_chunk[(33,33,33)] 668.32 754.39 1.13×
test_morton_order_iter[(32,32,32)] 22.80 24.85 1.09×
test_sharded_morton_write_single_chunk[(30,30,30)] 130.08 129.83 1.00×
test_sharded_morton_write_single_chunk[(32,32,32)] 48.33 32.43 0.67×

d-v-b added 30 commits February 18, 2026 21:48
@d-v-b d-v-b added the benchmark Code will be benchmarked in a CI job. label Mar 10, 2026
@codspeed-hq

codspeed-hq Bot commented Mar 10, 2026

Copy link
Copy Markdown

Merging this PR will improve performance by ×260

⚡ 59 improved benchmarks
✅ 7 untouched benchmarks
⏩ 6 skipped benchmarks1

Performance Changes

Mode Benchmark BASE HEAD Efficiency
WallTime test_write_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=(1000,))-gzip] 2,196.4 ms 704.2 ms ×3.1
WallTime test_write_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=None)-None] 556.8 ms 193.8 ms ×2.9
WallTime test_slice_indexing[(50, 50, 50)-(0, 0, 0)-memory] 1,742.5 µs 549.1 µs ×3.2
WallTime test_slice_indexing[None-(slice(None, None, None), slice(0, 3, 2), slice(0, 10, None))-memory] 3.8 ms 1.1 ms ×3.5
WallTime test_write_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=None)-gzip] 1,031.3 ms 283.1 ms ×3.6
WallTime test_slice_indexing[(50, 50, 50)-(slice(None, None, None), slice(None, None, None), slice(None, None, None))-memory_get_latency] 436.2 ms 117.8 ms ×3.7
WallTime test_slice_indexing[(50, 50, 50)-(slice(None, None, None), slice(None, None, None), slice(None, None, None))-memory] 419.6 ms 83.5 ms ×5
WallTime test_sharded_morton_indexing_large[(33, 33, 33)-memory] 10.3 s 1.8 s ×5.8
WallTime test_slice_indexing[None-(slice(None, None, None), slice(0, 3, 2), slice(0, 10, None))-memory_get_latency] 4.2 ms 2.2 ms +84.66%
WallTime test_sharded_morton_single_chunk[(32, 32, 32)-memory] 2,018.4 µs 681 µs ×3
WallTime test_slice_indexing[(50, 50, 50)-(0, 0, 0)-memory_get_latency] 4.2 ms 3.2 ms +28.54%
WallTime test_write_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=(1000,))-None] 1,622.2 ms 590.1 ms ×2.7
WallTime test_write_array[memory-Layout(shape=(1000000,), chunks=(100,), shards=(1000000,))-None] 5,367.9 ms 252.9 ms ×21
WallTime test_sharded_morton_single_chunk[(33, 33, 33)-memory] 1,978.3 µs 703.4 µs ×2.8
WallTime test_slice_indexing[(50, 50, 50)-(slice(None, 10, None), slice(None, 10, None), slice(None, 10, None))-memory_get_latency] 4.2 ms 3.3 ms +27.81%
WallTime test_write_array[memory-Layout(shape=(1000000,), chunks=(100,), shards=(1000000,))-gzip] 9.5 s 1.2 s ×7.7
WallTime test_slice_indexing[(50, 50, 50)-(slice(0, None, 4), slice(0, None, 4), slice(0, None, 4))-memory] 414.6 ms 78.9 ms ×5.3
WallTime test_sharded_morton_single_chunk[(30, 30, 30)-memory] 1,958 µs 659.6 µs ×3
WallTime test_slice_indexing[(50, 50, 50)-(slice(0, None, 4), slice(0, None, 4), slice(0, None, 4))-memory_get_latency] 431.3 ms 113.2 ms ×3.8
WallTime test_slice_indexing[(50, 50, 50)-(slice(10, -10, 4), slice(10, -10, 4), slice(10, -10, 4))-memory_get_latency] 243.5 ms 70.6 ms ×3.4
... ... ... ... ... ...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.


Comparing d-v-b:perf/smarter-codecs (d5c712c) with main (a02d996)

Open in CodSpeed

Footnotes

  1. 6 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@d-v-b

d-v-b commented Mar 10, 2026

Copy link
Copy Markdown
Contributor Author

@zarr-developers/python-core-devs please look at these numbers. Getting this PR into main is probably not practical, but I am confident that breaking it into pieces will work.

@normanrz

Copy link
Copy Markdown
Member

The numbers are certainly impressive! Great work.
Breaking it up makes a lot of sense. The changes in tier 1 and probably also tier 2 are not really controversial. For tier 3, I think we'll need some discussions and probably extended benchmarking.

@d-v-b

d-v-b commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

closing this as superseded by #3885

@d-v-b d-v-b closed this Jun 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

benchmark Code will be benchmarked in a CI job.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants