Skip to content

PARQUET-2249: Add IEEE-754 total order and nan count for floating types#3393

Open
wgtmac wants to merge 3 commits into
apache:masterfrom
wgtmac:PARQUET-2249
Open

PARQUET-2249: Add IEEE-754 total order and nan count for floating types#3393
wgtmac wants to merge 3 commits into
apache:masterfrom
wgtmac:PARQUET-2249

Conversation

@wgtmac

@wgtmac wgtmac commented Feb 12, 2026

Copy link
Copy Markdown
Member

Rationale for this change

PoC implementation for the spec change: apache/parquet-format#514

What changes are included in this PR?

  • Added a new ColumnOrder for IEEE 754 total order.
  • Added comparators for Float, Double and Float16.
  • Added nan_count to statistics and column index for floating types.
  • Added predicate pushdown support for statistics and column index filtering.

Are these changes tested?

Added various test cases for both new metadata and filtering.

Are there any user-facing changes?

Yes, users can now set the new column order but by default it is not used.

Closes #406

@wgtmac wgtmac force-pushed the PARQUET-2249 branch 2 times, most recently from c01b3f3 to 4b7e86b Compare March 6, 2026 15:27
@wgtmac wgtmac force-pushed the PARQUET-2249 branch 2 times, most recently from 133fb4b to 07a4d77 Compare March 14, 2026 15:43

@shangxinli shangxinli left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Gang for picking this up and driving it forward!

+1 on the approach. This combined solution addresses both the ordering ambiguity and the NaN pollution concern pragmatically. Looking forward to seeing the arrow-cpp PoC as well.

@etseidl

etseidl commented Apr 2, 2026

Copy link
Copy Markdown
Contributor

@wgtmac thank you for adding the interop test! 🙏 In arrow-rs we've made the decision to only write the new column order for floats, so I can't reproduce the total order columns.

Some things I think need to be added are negative NaNs, as well as examples where the min and/or max are 0. The latter is to make sure that the old rules regarding 0 min being set to -0 and 0 max set +0 are no longer followed with the new ordering.

@wgtmac

wgtmac commented Apr 3, 2026

Copy link
Copy Markdown
Member Author

In arrow-rs we've made the decision to only write the new column order for floats, so I can't reproduce the total order columns.

Did you mean arrow-rs will no longer write floats with the legacy TypeDefinedOrder? From the perspective of interoperability test, I think this is fine if it does not fail when reading files produced by other writers.

Some things I think need to be added are negative NaNs, as well as examples where the min and/or max are 0

That's a good suggestion! I have updated the floating-point interop coverage to add explicit ZERO_MIN and ZERO_MAX cases, so we now verify that IEEE-754 total order no longer rewrites +0 min to -0 or -0 max to +0. I also expanded the NaN coverage to include both negative and positive NaN patterns.

While debugging the test, I found that the Java implementation uses Float.floatToIntBits instead of Float.floatToRawIntBits (same for double) which canonicalizes NaN bits and pollutes both values and stats. I fixed the float/double write paths to preserve raw NaN bits instead of canonicalizing them.

@etseidl

etseidl commented Apr 3, 2026

Copy link
Copy Markdown
Contributor

Thank you. I hope to have the rust tests done today.

@wgtmac

wgtmac commented Jun 15, 2026

Copy link
Copy Markdown
Member Author

@shangxinli @etseidl Do you want to take a look again? I think now everything is ready on my end. cc @gszadovszky @Fokko

@gszadovszky gszadovszky left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a couple of comments.

Also, it would be nice to add coverage for predicate filtering with signed NaN values. Since NaN value are excluded from the statistics, this is only for the value level (ValueInspector). Can be either covered unit or integration level.

Comment thread parquet-column/src/main/java/org/apache/parquet/schema/PrimitiveType.java Outdated
@wgtmac

wgtmac commented Jun 17, 2026

Copy link
Copy Markdown
Member Author

Thanks @gszadovszky for the review! I think I've addressed all your comments. Let me know what you think.

@gszadovszky

Copy link
Copy Markdown
Contributor

Thank you, @wgtmac. One more thing.

This PR also changes how NaN values are encoded in dictionaries. I'm not sure if it is tightly related to IEEE-754, and is a breaking change. Before, we have lost the original NaN value during the dictionary encoding, and encoded every NaNs into the same bit pattern (e.g. 0x7ff8000000000000L for doubles). With this change, we preserve the actual bit pattern, whatever it means. We can say this is a bug fix, but maybe, we should discuss this behavioral change with the community first.

If we move to that direction, we also need to fix the dictionary filter. We use a boxed HashSet there, so even if the dictionary itself preserves the original bit patterns, the filter itself does not.

@wgtmac

wgtmac commented Jun 17, 2026

Copy link
Copy Markdown
Member Author

@gszadovszky That's a good question! I think the main behavior change is that now original NaN bits are preserved in not only dictionary but also encoded values and bloom filter. I would regard this as a benign bug fix but I need some time to think about the current read path, especially the boxed HashSet that you've mentioned...

For dictionary filter and bloom filter on the read path, they are not aware of column order. Introducing IEEE754 total order is anyway a breaking change to them because raw bits of NaNs must be preserved as is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants