Skip to content

Optimize Parquet row-filter struct schema pruning#22960

Open
shehab-ali wants to merge 1 commit into
apache:mainfrom
shehab-ali:shehab/parquet-schema-pruning
Open

Optimize Parquet row-filter struct schema pruning#22960
shehab-ali wants to merge 1 commit into
apache:mainfrom
shehab-ali:shehab/parquet-schema-pruning

Conversation

@shehab-ali

@shehab-ali shehab-ali commented Jun 15, 2026

Copy link
Copy Markdown

Rationale for this change

build_filter_schema and prune_struct_type` repeatedly scanned the same struct-field access paths while constructing Parquet row-filter projection schemas. That caused avoidable iterator work and temporary allocations when filters referenced multiple struct fields, especially across nested struct paths.

This change groups struct field access paths once per schema level, so schema pruning can reuse lookups instead of repeatedly filtering the full access-path list.

What changes are included in this PR?

  • Group struct field access paths by root column in build_filter_schema.
  • Reuse the grouped root-path lookup when deciding whether to keep a full field or prune a struct field.
  • Group recursive struct pruning paths by field name in prune_struct_type.
  • Avoid rebuilding per-field temporary vectors while walking struct fields.
  • Preserve whole-field output when an access path terminates at that field.

Benchmark Results

These results compare each benchmark’s with_pushdown case against its matching no_pushdown case in the current code.

Benchmark No pushdown median With pushdown median Runtime improvement Speedup
parquet_struct_filter_pushdown/select_star 5.0117 s 483.70 ms 90.35% faster 10.36×
parquet_struct_filter_pushdown/select_star_cross_col 5.0457 s 4.8399 s 4.08% faster 1.04×
parquet_struct_filter_pushdown/select_id 4.7709 s 369.30 µs 99.99% faster 12,919×
parquet_nested_filter_pushdown 35.332 ms 5.8611 ms 83.41% faster 6.03×

Throughput comparison

Benchmark No pushdown median throughput With pushdown median throughput Throughput improvement
parquet_struct_filter_pushdown/select_star 19.953 Kelem/s 206.74 Kelem/s 936.16% higher
parquet_struct_filter_pushdown/select_star_cross_col 19.819 Kelem/s 20.662 Kelem/s 4.25% higher
parquet_struct_filter_pushdown/select_id 20.960 Kelem/s 270.78 Melem/s ~1,291,885% higher
parquet_nested_filter_pushdown 2.8303 Melem/s 17.062 Melem/s 502.83% higher

Are these changes tested?

  • cargo fmt --all
  • cargo clippy -p datafusion-datasource-parquet --all-targets --all-features -- -D warnings
  • cargo test -p datafusion-datasource-parquet row_filter

Are there any user-facing changes?

No

@github-actions github-actions Bot added the datasource Changes to the datasource crate label Jun 15, 2026
@shehab-ali shehab-ali marked this pull request as ready for review June 15, 2026 20:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

datasource Changes to the datasource crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant