Support null aware hash mark-joins#21585
Conversation
41c562a to
97fbdcf
Compare
|
@AdamGS are you planning to continue on this soon? |
|
yes! I have some more changes locally that I haven't pushed. Codex also came up with a test case that I'm trying to figure out if it should be covered by this change or is a different correctness corner that DataFusion currently misses. |
|
LMK when you are ready, I can allocate some time to review |
0a42875 to
359f140
Compare
|
pushed some stuff I have locally, mostly to show a regression test that I think is relevant. |
359f140 to
26eec2d
Compare
26eec2d to
f0811c6
Compare
|
@Dandandan got distracted by a lot of other stuff, but I think its getting pretty close. I'm going to spend more time this weekend trying to review it and think through the change with the paper. |
f0811c6 to
4d3f2a2
Compare
5e5e7b7 to
faa515c
Compare
…ntations (apache#22913) ## Which issue does this PR close? - Closes apache#22912. ## Rationale for this change This change makes testing null_aware behavior easier, and also makes the performance of various joins clearer - null_aware joins do extra work. This was originally part apache#21585, but it seems like there is a bunch of activity around null-aware joins, so I figured its worth splitting out. ## What changes are included in this PR? Add a `null_aware` indication to relevant Display implementations when appropriate. ## Are these changes tested? SLT tests ## Are there any user-facing changes? Display only --------- Signed-off-by: Adam Gutglick <adamgsal@gmail.com>
|
This PR is too big, I'm going to start splitting it off, probably starting with SLT tests. |
Signed-off-by: Adam Gutglick <adamgsal@gmail.com>
58d6a7b to
2ed8628
Compare
2ed8628 to
7dceb8a
Compare
|
Thank you for opening this pull request! Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch). Details |
Which issue does this PR close?
Rationale for this change
This change is about correctness/sql completeness, but is also a step towards better subquery de-correlation.
Before this change, mark joins only produced boolean
true/falseresults, so queries such as(id NOT IN (...)) IS NULLcould return incorrect results, especially for correlated scalar subqueries.What changes are included in this PR?
NOT INmark joins use null-aware semantics when the predicate can be represented as hash join keys.I pulled tests from existing tests, a DuckDB issue I ran into, the Story of Joins (in HyPer) paper and some AI generated cases.
Are these changes tested?
Are there any user-facing changes?
This PR changes planning behavior and introduces more public API around hash joins, the only new functionality is public function I could find is the new helper function
build_join_schema_with_null_awareAI Usage
AI was used in the process of developing this PR, mostly around testing and planning