Skip to content

Support null aware hash mark-joins#21585

Draft
AdamGS wants to merge 9 commits into
apache:mainfrom
AdamGS:adamg/mark-join-null-marking
Draft

Support null aware hash mark-joins#21585
AdamGS wants to merge 9 commits into
apache:mainfrom
AdamGS:adamg/mark-join-null-marking

Conversation

@AdamGS

@AdamGS AdamGS commented Apr 13, 2026

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

This change is about correctness/sql completeness, but is also a step towards better subquery de-correlation.

Before this change, mark joins only produced boolean true / false results, so queries such as (id NOT IN (...)) IS NULL could return incorrect results, especially for correlated scalar subqueries.

What changes are included in this PR?

  1. Adds support for null-aware mark joins.
  2. Make sure queries that joins that require null awareness go through a join implementation that supports that.
  3. Extends decorrelation so scalar NOT IN mark joins use null-aware semantics when the predicate can be represented as hash join keys.

I pulled tests from existing tests, a DuckDB issue I ran into, the Story of Joins (in HyPer) paper and some AI generated cases.

Are these changes tested?

  1. Existing SLT tests that explicitly showed bad results.
  2. New dedicated SLT tests.
  3. New unit tests.

Are there any user-facing changes?

This PR changes planning behavior and introduces more public API around hash joins, the only new functionality is public function I could find is the new helper function build_join_schema_with_null_aware

AI Usage

AI was used in the process of developing this PR, mostly around testing and planning

@AdamGS AdamGS changed the title Adamg/mark join null marking Null aware hash mark joins Apr 13, 2026
@github-actions github-actions Bot added logical-expr Logical plan and expressions optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) physical-plan Changes to the physical-plan crate labels Apr 13, 2026
@AdamGS AdamGS force-pushed the adamg/mark-join-null-marking branch from 41c562a to 97fbdcf Compare April 15, 2026 22:49
@Dandandan

Copy link
Copy Markdown
Contributor

@AdamGS are you planning to continue on this soon?

@AdamGS

AdamGS commented Apr 30, 2026

Copy link
Copy Markdown
Contributor Author

yes! I have some more changes locally that I haven't pushed. Codex also came up with a test case that I'm trying to figure out if it should be covered by this change or is a different correctness corner that DataFusion currently misses.
I'll try and clean up what I have over the weekend and make progress on this.

@Dandandan

Copy link
Copy Markdown
Contributor

LMK when you are ready, I can allocate some time to review

@AdamGS AdamGS force-pushed the adamg/mark-join-null-marking branch 2 times, most recently from 0a42875 to 359f140 Compare May 9, 2026 21:05
@AdamGS

AdamGS commented May 9, 2026

Copy link
Copy Markdown
Contributor Author

pushed some stuff I have locally, mostly to show a regression test that I think is relevant.
I have more work locally, hoping to have tomorrow to mostly push on this.

@AdamGS AdamGS force-pushed the adamg/mark-join-null-marking branch from 359f140 to 26eec2d Compare May 20, 2026 20:18
@github-actions github-actions Bot added the common Related to common crate label May 20, 2026
@AdamGS AdamGS force-pushed the adamg/mark-join-null-marking branch from 26eec2d to f0811c6 Compare May 29, 2026 23:15
@AdamGS AdamGS changed the title Null aware hash mark joins Support null aware hash mark-joins May 29, 2026
@AdamGS

AdamGS commented May 29, 2026

Copy link
Copy Markdown
Contributor Author

@Dandandan got distracted by a lot of other stuff, but I think its getting pretty close. I'm going to spend more time this weekend trying to review it and think through the change with the paper.

@AdamGS AdamGS force-pushed the adamg/mark-join-null-marking branch from f0811c6 to 4d3f2a2 Compare June 8, 2026 10:06
@AdamGS AdamGS marked this pull request as ready for review June 8, 2026 10:06
@AdamGS AdamGS force-pushed the adamg/mark-join-null-marking branch from 5e5e7b7 to faa515c Compare June 11, 2026 19:03
marvelshan pushed a commit to marvelshan/datafusion that referenced this pull request Jun 16, 2026
…ntations (apache#22913)

## Which issue does this PR close?

- Closes apache#22912.

## Rationale for this change

This change makes testing null_aware behavior easier, and also makes the
performance of various joins clearer - null_aware joins do extra work.

This was originally part apache#21585, but it seems like there is a bunch of
activity around null-aware joins, so I figured its worth splitting out.

## What changes are included in this PR?

Add a `null_aware` indication to relevant Display implementations when
appropriate.

## Are these changes tested?

SLT tests

## Are there any user-facing changes?

Display only

---------

Signed-off-by: Adam Gutglick <adamgsal@gmail.com>
@AdamGS

AdamGS commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

This PR is too big, I'm going to start splitting it off, probably starting with SLT tests.

AdamGS added 2 commits June 23, 2026 12:41
Signed-off-by: Adam Gutglick <adamgsal@gmail.com>
AdamGS added 6 commits June 23, 2026 12:41
Signed-off-by: Adam Gutglick <adamgsal@gmail.com>
Signed-off-by: Adam Gutglick <adamgsal@gmail.com>
Signed-off-by: Adam Gutglick <adamgsal@gmail.com>
Signed-off-by: Adam Gutglick <adamgsal@gmail.com>
@AdamGS AdamGS force-pushed the adamg/mark-join-null-marking branch from 58d6a7b to 2ed8628 Compare June 23, 2026 12:29
@AdamGS AdamGS force-pushed the adamg/mark-join-null-marking branch from 2ed8628 to 7dceb8a Compare June 23, 2026 12:42
@github-actions

github-actions Bot commented Jun 23, 2026

Copy link
Copy Markdown

Thank you for opening this pull request!

Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch).

Details
     Cloning apache/main
    Building datafusion v54.0.0 (current)
       Built [ 102.568s] (current)
     Parsing datafusion v54.0.0 (current)
      Parsed [   0.036s] (current)
    Building datafusion v54.0.0 (baseline)
       Built [ 101.469s] (baseline)
     Parsing datafusion v54.0.0 (baseline)
      Parsed [   0.037s] (baseline)
    Checking datafusion v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.960s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [ 206.544s] datafusion
    Building datafusion-common v54.0.0 (current)
       Built [  31.738s] (current)
     Parsing datafusion-common v54.0.0 (current)
      Parsed [   0.061s] (current)
    Building datafusion-common v54.0.0 (baseline)
       Built [  31.661s] (baseline)
     Parsing datafusion-common v54.0.0 (baseline)
      Parsed [   0.064s] (baseline)
    Checking datafusion-common v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   1.045s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [  65.584s] datafusion-common
    Building datafusion-expr v54.0.0 (current)
       Built [  25.234s] (current)
     Parsing datafusion-expr v54.0.0 (current)
      Parsed [   0.078s] (current)
    Building datafusion-expr v54.0.0 (baseline)
       Built [  25.515s] (baseline)
     Parsing datafusion-expr v54.0.0 (baseline)
      Parsed [   0.078s] (baseline)
    Checking datafusion-expr v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   1.985s] 223 checks: 222 pass, 1 fail, 0 warn, 30 skip

--- failure function_parameter_count_changed: pub fn parameter count changed ---

Description:
A publicly-visible function now takes a different number of parameters.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#fn-change-arity
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/function_parameter_count_changed.ron

Failed in:
  datafusion_expr::logical_plan::builder::build_join_schema now takes 4 parameters instead of 3, in /home/runner/work/datafusion/datafusion/datafusion/expr/src/logical_plan/builder.rs:1683
  datafusion_expr::builder::build_join_schema now takes 4 parameters instead of 3, in /home/runner/work/datafusion/datafusion/datafusion/expr/src/logical_plan/builder.rs:1683
  datafusion_expr::logical_plan::build_join_schema now takes 4 parameters instead of 3, in /home/runner/work/datafusion/datafusion/datafusion/expr/src/logical_plan/builder.rs:1683
  datafusion_expr::build_join_schema now takes 4 parameters instead of 3, in /home/runner/work/datafusion/datafusion/datafusion/expr/src/logical_plan/builder.rs:1683

     Summary semver requires new major version: 1 major and 0 minor checks failed
    Finished [  53.844s] datafusion-expr
    Building datafusion-optimizer v54.0.0 (current)
       Built [  25.589s] (current)
     Parsing datafusion-optimizer v54.0.0 (current)
      Parsed [   0.031s] (current)
    Building datafusion-optimizer v54.0.0 (baseline)
       Built [  25.540s] (baseline)
     Parsing datafusion-optimizer v54.0.0 (baseline)
      Parsed [   0.033s] (baseline)
    Checking datafusion-optimizer v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.258s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [  52.314s] datafusion-optimizer
    Building datafusion-physical-plan v54.0.0 (current)
       Built [  34.395s] (current)
     Parsing datafusion-physical-plan v54.0.0 (current)
      Parsed [   0.135s] (current)
    Building datafusion-physical-plan v54.0.0 (baseline)
       Built [  34.264s] (baseline)
     Parsing datafusion-physical-plan v54.0.0 (baseline)
      Parsed [   0.136s] (baseline)
    Checking datafusion-physical-plan v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.925s] 223 checks: 222 pass, 1 fail, 0 warn, 30 skip

--- failure function_parameter_count_changed: pub fn parameter count changed ---

Description:
A publicly-visible function now takes a different number of parameters.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#fn-change-arity
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/function_parameter_count_changed.ron

Failed in:
  datafusion_physical_plan::joins::utils::build_join_schema now takes 4 parameters instead of 3, in /home/runner/work/datafusion/datafusion/datafusion/physical-plan/src/joins/utils.rs:265

     Summary semver requires new major version: 1 major and 0 minor checks failed
    Finished [  71.055s] datafusion-physical-plan
    Building datafusion-sqllogictest v54.0.0 (current)
       Built [ 181.721s] (current)
     Parsing datafusion-sqllogictest v54.0.0 (current)
      Parsed [   0.023s] (current)
    Building datafusion-sqllogictest v54.0.0 (baseline)
       Built [ 177.129s] (baseline)
     Parsing datafusion-sqllogictest v54.0.0 (baseline)
      Parsed [   0.024s] (baseline)
    Checking datafusion-sqllogictest v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.119s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [ 362.015s] datafusion-sqllogictest

@github-actions github-actions Bot added the auto detected api change Auto detected API change label Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto detected api change Auto detected API change common Related to common crate core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules physical-plan Changes to the physical-plan crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Mark joins don't support null mark columns

2 participants