feat: support file-level parquet row selections#22940
Conversation
There was a problem hiding this comment.
@haohuaijin
Thanks for the PR. I do not see any blocking issues, just a few suggestions that could make the implementation a bit simpler and help protect the new extension behavior.
| selection: RowSelection, | ||
| row_group_meta_data: &[RowGroupMetaData], | ||
| ) -> Result<Self> { | ||
| let selectors: Vec<RowSelector> = selection.into(); |
There was a problem hiding this comment.
Nice work here. I think this could be simplified a bit by leaning on RowSelection::split_off, since it already handles the boundary splitting invariant.
One possible shape would be to keep let mut remaining_selection = selection;, then for each row group call let group_selection = remaining_selection.split_off(rg_rows);. From there, derive selected = group_selection.row_count() and skipped = group_selection.skipped_row_count(), then map (selected, skipped) to Skip, Scan, or Selection(group_selection).
After the loop, the total row count check could also be more direct by validating remaining_selection.row_count() + remaining_selection.skipped_row_count() == 0 along with the accumulated file count. That would remove the manual current / leading / mixed cursor state machine.
| let plan_len = access_plan.len(); | ||
| if plan_len != row_group_count { | ||
| let row_group_count = rg_metadata.len(); | ||
| match ( |
There was a problem hiding this comment.
This match might read a little more cleanly if it returned the Result directly instead of using early returns plus a trailing default branch.
Something like:
match (...) {
(Some(_), Some(_)) => exec_err!(...),
(Some(access_plan), None) => {
...
Ok(access_plan.clone())
}
(None, Some(row_selection)) => ParquetAccessPlan::try_new_from_overall_row_selection(...),
(None, None) => Ok(ParquetAccessPlan::new_all(row_group_count)),
}That keeps all extension cases in one expression and avoids the extra fallthrough branch.
| } | ||
|
|
||
| #[test] | ||
| fn create_initial_plan_from_parquet_row_selection_extension() { |
There was a problem hiding this comment.
Could we add one end-to-end scan test that attaches ParquetRowSelection to a PartitionedFile and asserts the returned rows?
The unit tests cover conversion and mutual exclusion well, but since this is a public extension contract, it would be great to have one test proving the extension survives the path from PartitionedFile into the parquet opener and reader.
Which issue does this PR close?
RowSelection#22939Rationale for this change
RowSelection#22939What changes are included in this PR?
ParquetRowSelection.ParquetAccessPlan::try_new_from_overall_row_selection.ParquetAccessPlanorParquetRowSelection.ParquetSource.Are these changes tested?
Yes. This PR adds tests for:
ParquetRowSelectionParquetAccessPlanandParquetRowSelectionon the same fileAre there any user-facing changes?
Yes. This adds a new public
ParquetRowSelectiontype for callers that want to attach a file-level ParquetRowSelectionto aPartitionedFile.