PARQUET-2249: Add IEEE-754 total order and nan count for floating types#3393
PARQUET-2249: Add IEEE-754 total order and nan count for floating types#3393wgtmac wants to merge 3 commits into
Conversation
c01b3f3 to
4b7e86b
Compare
133fb4b to
07a4d77
Compare
|
@wgtmac thank you for adding the interop test! 🙏 In arrow-rs we've made the decision to only write the new column order for floats, so I can't reproduce the total order columns. Some things I think need to be added are negative NaNs, as well as examples where the min and/or max are 0. The latter is to make sure that the old rules regarding 0 min being set to -0 and 0 max set +0 are no longer followed with the new ordering. |
Did you mean arrow-rs will no longer write floats with the legacy TypeDefinedOrder? From the perspective of interoperability test, I think this is fine if it does not fail when reading files produced by other writers.
That's a good suggestion! I have updated the floating-point interop coverage to add explicit ZERO_MIN and ZERO_MAX cases, so we now verify that IEEE-754 total order no longer rewrites +0 min to -0 or -0 max to +0. I also expanded the NaN coverage to include both negative and positive NaN patterns. While debugging the test, I found that the Java implementation uses |
|
Thank you. I hope to have the rust tests done today. |
|
@shangxinli @etseidl Do you want to take a look again? I think now everything is ready on my end. cc @gszadovszky @Fokko |
gszadovszky
left a comment
There was a problem hiding this comment.
Added a couple of comments.
Also, it would be nice to add coverage for predicate filtering with signed NaN values. Since NaN value are excluded from the statistics, this is only for the value level (ValueInspector). Can be either covered unit or integration level.
|
Thanks @gszadovszky for the review! I think I've addressed all your comments. Let me know what you think. |
|
Thank you, @wgtmac. One more thing. This PR also changes how If we move to that direction, we also need to fix the dictionary filter. We use a boxed |
|
@gszadovszky That's a good question! I think the main behavior change is that now original NaN bits are preserved in not only dictionary but also encoded values and bloom filter. I would regard this as a benign bug fix but I need some time to think about the current read path, especially the For dictionary filter and bloom filter on the read path, they are not aware of column order. Introducing IEEE754 total order is anyway a breaking change to them because raw bits of NaNs must be preserved as is. |
Rationale for this change
PoC implementation for the spec change: apache/parquet-format#514
What changes are included in this PR?
Are these changes tested?
Added various test cases for both new metadata and filtering.
Are there any user-facing changes?
Yes, users can now set the new column order but by default it is not used.
Closes #406