Skip to content

feat(physical-plan): add GroupColumn support for FixedSizeList<primitive> in multi-column GROUP BY#23128

Open
zhuqi-lucas wants to merge 1 commit into
apache:mainfrom
zhuqi-lucas:qizhu/group-column-fixed-size-list-primitive
Open

feat(physical-plan): add GroupColumn support for FixedSizeList<primitive> in multi-column GROUP BY#23128
zhuqi-lucas wants to merge 1 commit into
apache:mainfrom
zhuqi-lucas:qizhu/group-column-fixed-size-list-primitive

Conversation

@zhuqi-lucas

Copy link
Copy Markdown
Contributor

PR 2 of the EPIC #22715 sequence. Builds on the dispatcher refactor + factory framework that landed in PR 1 (#22751) to bring the first nested type into `GroupValuesColumn`, unblocking the rest of the sequence (PR 3 Struct, PR 4 List, PR 5 LargeList, PR 6 composite FSL, PR 7 Map).

Which issue does this PR close?

Part of #22682 (nested type coverage) and #22715 (EPIC).

Rationale for this change

Today a single `FixedSizeList` column in a GROUP BY drags the whole grouping onto the byte-encoded `GroupValuesRows` fallback, even when every other column would have qualified for the column-wise + cross-column short-circuit fast path in `GroupValuesColumn`. With the recursive `make_group_column` factory + `group_column_supported_type` allow-list in place from PR 1, this PR is now a self-contained addition of one builder and two dispatcher entries.

What changes are included in this PR?

  • `FixedSizeListGroupValueBuilder<T: ArrowPrimitiveType>` in a new `fixed_size_list` submodule. Storage: outer null bitmap + a child `PrimitiveGroupValueBuilder<T, true>` that holds every element flat (length = `outer_len * list_len`). Element `j` of outer row `i` lives at child index `i * list_len + j`.
  • `group_column_supported_type` accepts `FixedSizeList` for the primitive subset wired through the dispatcher (Int8..Int64, UInt8..UInt64, Float32, Float64, Date32, Date64). Composite children (Struct / List inside FSL) are deferred to a follow-up after their respective builders land.
  • `make_group_column` dispatches `DataType::FixedSizeList(...)` to the appropriate `FixedSizeListGroupValueBuilder` via an `instantiate_fsl!` macro mirroring the existing `instantiate_primitive!` pattern.

Are these changes tested?

Yes. 12 new unit tests on `FixedSizeListGroupValueBuilder` plus extensions to the existing `group_column_supported_type` ⇔ `make_group_column` consistency fuzz.

Per-builder:

  • append / build round trip with mixed outer-null and inner-null rows
  • `equal_to` for identical / different / outer-null / inner-null rows
  • `take_n` boundary cases (`n=0`, `n=len`, prefix containing null rows)
  • sliced input array (offset != 0)
  • `vectorized_equal_to` / `vectorized_append` match per-row reference
  • `size()` grows with appends
  • `build` on empty builder returns empty array
  • end-to-end dispatcher → `GroupValuesColumn` routing

Consistency fuzz now covers four primitive FSL children (signed, unsigned, float, date) on the supported side and three non-primitive FSL children (Utf8, Decimal128, Boolean) on the rejected side, locking in the scope boundary for this PR.

127/127 aggregates tests pass, clippy clean, fmt clean.

What follows in the EPIC

  • PR 3: `Struct<...>` builder + dispatcher.
  • PR 4: `List` builder + dispatcher (recursive child via factory).
  • PR 5: `LargeList` builder + dispatcher.
  • PR 6: Relax FSL child restriction to allow `FSL` / `FSL` once the prerequisite child builders are in.
  • PR 7: `Map<K,V>` (`List<Struct<keys, values>>` Arrow representation).

Are there any user-facing changes?

No behavior change for users whose GROUP BY did not previously contain a `FixedSizeList` column. For users who did, the grouping now uses the column-native fast path instead of falling back to `GroupValuesRows` — same results, less memory and CPU.

…ive> in multi-column GROUP BY

PR 2 of the EPIC apache#22715 sequence. Builds on the dispatcher refactor +
factory framework that landed in PR 1 (apache#22751) to bring the first
nested type into `GroupValuesColumn`, unblocking the rest of the
sequence (PR 3 Struct, PR 4 List, PR 5 LargeList, PR 6 composite FSL,
PR 7 Map).

## Which issue does this PR close?

Part of apache#22682 (nested type coverage) and apache#22715 (EPIC).

## Rationale for this change

Today a single `FixedSizeList<primitive>` column in a GROUP BY drags
the whole grouping onto the byte-encoded `GroupValuesRows` fallback,
even when every other column would have qualified for the column-wise
+ cross-column short-circuit fast path in `GroupValuesColumn`. With
the recursive `make_group_column` factory + `group_column_supported_type`
allow-list in place from PR 1, this PR is now a self-contained
addition of one builder and two dispatcher entries.

## What changes are included in this PR?

- `FixedSizeListGroupValueBuilder<T: ArrowPrimitiveType>` in a new
  `fixed_size_list` submodule. Storage: outer null bitmap + a child
  `PrimitiveGroupValueBuilder<T, true>` that holds every element flat
  (length = `outer_len * list_len`). Element `j` of outer row `i`
  lives at child index `i * list_len + j`.
- `group_column_supported_type` accepts `FixedSizeList<primitive>` for
  the primitive subset wired through the dispatcher (Int8..Int64,
  UInt8..UInt64, Float32, Float64, Date32, Date64). Composite children
  (Struct / List inside FSL) are deferred to a follow-up after their
  respective builders land.
- `make_group_column` dispatches `DataType::FixedSizeList(...)` to the
  appropriate `FixedSizeListGroupValueBuilder<T>` via an
  `instantiate_fsl!` macro mirroring the existing `instantiate_primitive!`
  pattern.

## Are these changes tested?

Yes. 12 new unit tests on `FixedSizeListGroupValueBuilder` plus
extensions to the existing `group_column_supported_type` ⇔
`make_group_column` consistency fuzz.

Per-builder:
- append / build round trip with mixed outer-null and inner-null rows
- `equal_to` for identical / different / outer-null / inner-null rows
- `take_n` boundary cases (`n=0`, `n=len`, prefix containing null rows)
- sliced input array (offset != 0)
- `vectorized_equal_to` / `vectorized_append` match per-row reference
- `size()` grows with appends
- `build` on empty builder returns empty array
- end-to-end dispatcher → `GroupValuesColumn` routing

Consistency fuzz now covers four primitive FSL children (signed,
unsigned, float, date) on the supported side and three non-primitive
FSL children (Utf8, Decimal128, Boolean) on the rejected side, locking
in the scope boundary for this PR.

127/127 aggregates tests pass, clippy clean, fmt clean.

## What follows in the EPIC

- PR 3: `Struct<...>` builder + dispatcher.
- PR 4: `List<T>` builder + dispatcher (recursive child via factory).
- PR 5: `LargeList<T>` builder + dispatcher.
- PR 6: Relax FSL child restriction to allow `FSL<Struct>` / `FSL<List>`
  once the prerequisite child builders are in.
- PR 7: `Map<K,V>` (`List<Struct<keys, values>>` Arrow representation).
Copilot AI review requested due to automatic review settings June 23, 2026 15:32
@github-actions github-actions Bot added the physical-plan Changes to the physical-plan crate label Jun 23, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants