Skip to content

feat: support file-level parquet row selections#22940

Open
haohuaijin wants to merge 6 commits into
apache:mainfrom
haohuaijin:row-selection-access-plan
Open

feat: support file-level parquet row selections#22940
haohuaijin wants to merge 6 commits into
apache:mainfrom
haohuaijin:row-selection-access-plan

Conversation

@haohuaijin

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

  • Add public ParquetRowSelection.
  • Add ParquetAccessPlan::try_new_from_overall_row_selection.
  • Allow Parquet opener setup to read either ParquetAccessPlan or ParquetRowSelection.
  • Reject using both extension types on the same file.
  • Validate that the selection row count matches the file row count.
  • Document the new extension path in ParquetSource.

Are these changes tested?

Yes. This PR adds tests for:

  • converting a file-level selection into row-group access
  • rejecting invalid selection row counts
  • creating an initial plan from ParquetRowSelection
  • rejecting both ParquetAccessPlan and ParquetRowSelection on the same file

Are there any user-facing changes?

Yes. This adds a new public ParquetRowSelection type for callers that want to attach a file-level Parquet RowSelection to a PartitionedFile.

@github-actions github-actions Bot added the datasource Changes to the datasource crate label Jun 13, 2026

@kosiew kosiew left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@haohuaijin
Thanks for the PR. I do not see any blocking issues, just a few suggestions that could make the implementation a bit simpler and help protect the new extension behavior.

selection: RowSelection,
row_group_meta_data: &[RowGroupMetaData],
) -> Result<Self> {
let selectors: Vec<RowSelector> = selection.into();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work here. I think this could be simplified a bit by leaning on RowSelection::split_off, since it already handles the boundary splitting invariant.

One possible shape would be to keep let mut remaining_selection = selection;, then for each row group call let group_selection = remaining_selection.split_off(rg_rows);. From there, derive selected = group_selection.row_count() and skipped = group_selection.skipped_row_count(), then map (selected, skipped) to Skip, Scan, or Selection(group_selection).

After the loop, the total row count check could also be more direct by validating remaining_selection.row_count() + remaining_selection.skipped_row_count() == 0 along with the accumulated file count. That would remove the manual current / leading / mixed cursor state machine.

let plan_len = access_plan.len();
if plan_len != row_group_count {
let row_group_count = rg_metadata.len();
match (

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This match might read a little more cleanly if it returned the Result directly instead of using early returns plus a trailing default branch.

Something like:

match (...) {
    (Some(_), Some(_)) => exec_err!(...),
    (Some(access_plan), None) => {
        ...
        Ok(access_plan.clone())
    }
    (None, Some(row_selection)) => ParquetAccessPlan::try_new_from_overall_row_selection(...),
    (None, None) => Ok(ParquetAccessPlan::new_all(row_group_count)),
}

That keeps all extension cases in one expression and avoids the extra fallthrough branch.

}

#[test]
fn create_initial_plan_from_parquet_row_selection_extension() {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add one end-to-end scan test that attaches ParquetRowSelection to a PartitionedFile and asserts the returned rows?

The unit tests cover conversion and mutual exclusion well, but since this is a public extension contract, it would be great to have one test proving the extension survives the path from PartitionedFile into the parquet opener and reader.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

datasource Changes to the datasource crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support file-level Parquet RowSelection

2 participants