feat(physical-plan): add GroupColumn support for FixedSizeList<primitive> in multi-column GROUP BY#23128
Open
zhuqi-lucas wants to merge 1 commit into
Open
Conversation
…ive> in multi-column GROUP BY PR 2 of the EPIC apache#22715 sequence. Builds on the dispatcher refactor + factory framework that landed in PR 1 (apache#22751) to bring the first nested type into `GroupValuesColumn`, unblocking the rest of the sequence (PR 3 Struct, PR 4 List, PR 5 LargeList, PR 6 composite FSL, PR 7 Map). ## Which issue does this PR close? Part of apache#22682 (nested type coverage) and apache#22715 (EPIC). ## Rationale for this change Today a single `FixedSizeList<primitive>` column in a GROUP BY drags the whole grouping onto the byte-encoded `GroupValuesRows` fallback, even when every other column would have qualified for the column-wise + cross-column short-circuit fast path in `GroupValuesColumn`. With the recursive `make_group_column` factory + `group_column_supported_type` allow-list in place from PR 1, this PR is now a self-contained addition of one builder and two dispatcher entries. ## What changes are included in this PR? - `FixedSizeListGroupValueBuilder<T: ArrowPrimitiveType>` in a new `fixed_size_list` submodule. Storage: outer null bitmap + a child `PrimitiveGroupValueBuilder<T, true>` that holds every element flat (length = `outer_len * list_len`). Element `j` of outer row `i` lives at child index `i * list_len + j`. - `group_column_supported_type` accepts `FixedSizeList<primitive>` for the primitive subset wired through the dispatcher (Int8..Int64, UInt8..UInt64, Float32, Float64, Date32, Date64). Composite children (Struct / List inside FSL) are deferred to a follow-up after their respective builders land. - `make_group_column` dispatches `DataType::FixedSizeList(...)` to the appropriate `FixedSizeListGroupValueBuilder<T>` via an `instantiate_fsl!` macro mirroring the existing `instantiate_primitive!` pattern. ## Are these changes tested? Yes. 12 new unit tests on `FixedSizeListGroupValueBuilder` plus extensions to the existing `group_column_supported_type` ⇔ `make_group_column` consistency fuzz. Per-builder: - append / build round trip with mixed outer-null and inner-null rows - `equal_to` for identical / different / outer-null / inner-null rows - `take_n` boundary cases (`n=0`, `n=len`, prefix containing null rows) - sliced input array (offset != 0) - `vectorized_equal_to` / `vectorized_append` match per-row reference - `size()` grows with appends - `build` on empty builder returns empty array - end-to-end dispatcher → `GroupValuesColumn` routing Consistency fuzz now covers four primitive FSL children (signed, unsigned, float, date) on the supported side and three non-primitive FSL children (Utf8, Decimal128, Boolean) on the rejected side, locking in the scope boundary for this PR. 127/127 aggregates tests pass, clippy clean, fmt clean. ## What follows in the EPIC - PR 3: `Struct<...>` builder + dispatcher. - PR 4: `List<T>` builder + dispatcher (recursive child via factory). - PR 5: `LargeList<T>` builder + dispatcher. - PR 6: Relax FSL child restriction to allow `FSL<Struct>` / `FSL<List>` once the prerequisite child builders are in. - PR 7: `Map<K,V>` (`List<Struct<keys, values>>` Arrow representation).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR 2 of the EPIC #22715 sequence. Builds on the dispatcher refactor + factory framework that landed in PR 1 (#22751) to bring the first nested type into `GroupValuesColumn`, unblocking the rest of the sequence (PR 3 Struct, PR 4 List, PR 5 LargeList, PR 6 composite FSL, PR 7 Map).
Which issue does this PR close?
Part of #22682 (nested type coverage) and #22715 (EPIC).
Rationale for this change
Today a single `FixedSizeList` column in a GROUP BY drags the whole grouping onto the byte-encoded `GroupValuesRows` fallback, even when every other column would have qualified for the column-wise + cross-column short-circuit fast path in `GroupValuesColumn`. With the recursive `make_group_column` factory + `group_column_supported_type` allow-list in place from PR 1, this PR is now a self-contained addition of one builder and two dispatcher entries.
What changes are included in this PR?
Are these changes tested?
Yes. 12 new unit tests on `FixedSizeListGroupValueBuilder` plus extensions to the existing `group_column_supported_type` ⇔ `make_group_column` consistency fuzz.
Per-builder:
Consistency fuzz now covers four primitive FSL children (signed, unsigned, float, date) on the supported side and three non-primitive FSL children (Utf8, Decimal128, Boolean) on the rejected side, locking in the scope boundary for this PR.
127/127 aggregates tests pass, clippy clean, fmt clean.
What follows in the EPIC
Are there any user-facing changes?
No behavior change for users whose GROUP BY did not previously contain a `FixedSizeList` column. For users who did, the grouping now uses the column-native fast path instead of falling back to `GroupValuesRows` — same results, less memory and CPU.