GH-50184: [C++][Parquet] Avoid reading past truncated statistics values in FormatStatValue#50025
Conversation
| break; | ||
| case Type::BYTE_ARRAY: | ||
| case Type::FIXED_LEN_BYTE_ARRAY: | ||
| if (logical_type != nullptr && logical_type->is_float16()) { |
There was a problem hiding this comment.
Why not dealing with the length of FIXED_LEN_BYTE_ARRAY?
There was a problem hiding this comment.
FormatStatValue doesn't receive the column's type_length, so it can't validate the full FLBA width here. The only fixed-size read for FLBA is the 2-byte float16 load, which is what this case guards. The decimal, string and hex paths all key off val.size() and stay in bounds.
There was a problem hiding this comment.
I admit that it is better than nothing to check types with known sizes. It still looks confusing with this partial support. If we do want to add the check, I think we need to support all types. We can either pass the type instance instead of only the enum or pass the type length to FormatStatValue. The challenge is that FormatStatValue is a public API so we cannot changing its signature without breaking backward compatibility.
There was a problem hiding this comment.
BTW, we need to create an issue for changes like this.
There was a problem hiding this comment.
Fair. Full per-length validation needs the column type_length, and since FormatStatValue is public I left the signature alone, so the guard only covers the types that do an unconditional fixed-width load (BOOLEAN/INT32/INT64/INT96/FLOAT/DOUBLE and the float16 FLBA), which are the ones that actually over-read. The variable-length FLBA (decimal/string) and BYTE_ARRAY paths are already size-driven and stay in bounds. If you'd rather thread the type_length through a new overload to cover everything, I'm happy to go that way.
| case Type::INT96: | ||
| required = 3 * sizeof(int32_t); | ||
| break; | ||
| case Type::BYTE_ARRAY: |
There was a problem hiding this comment.
It is better to put BYTE_ARRAY and UNDEFINED together since they are skipped. In case you don't know, BYTE_ARRAY cannot have float16 logical type.
There was a problem hiding this comment.
Good point. Moved BYTE_ARRAY down with UNDEFINED since it can't carry a float16 logical type and has no fixed-width load here anyway.
| default: | ||
| break; | ||
| } | ||
| if (val.size() < required) { |
There was a problem hiding this comment.
I'm a little hesitant to complicate FormatStatValue by introducing this check. FormatStatValue is usually called in the printer when we want to print some information of a parquet file for debugging purpuse.
There was a problem hiding this comment.
Fair, it's mostly the debug/printer path. The thing is val is taken verbatim from the file's Thrift stats, so a truncated min/max makes the BOOLEAN/INT96/numeric memcpys read past the end of the buffer even there. The guard is just a size check in front of the existing loads, so I tried to keep it minimal. Happy to trim it further if it still feels too heavy.
There was a problem hiding this comment.
I admit that it is better than nothing to check types with known sizes. It still looks confusing with this partial support. If we do want to add the check, I think we need to support all types. We can either pass the type instance instead of only the enum or pass the type length to FormatStatValue. The challenge is that FormatStatValue is a public API so we cannot changing its signature without breaking backward compatibility.
There was a problem hiding this comment.
(I thought I have posted the above comment but it was a draft before I posted it just now)
There was a problem hiding this comment.
switched to clamping each fixed-width read to val.size() rather than the required-length table, so every over-reading path is covered without touching the public signature, and there's no throw since truncated stats are valid.
|
|
|
Filed #50184 for this and retitled the PR to reference it. Also reworked the description into the template sections. |
|
The Parquet spec allows for truncated statistics, so raising an exception sounds wrong here. Just make sure that we're not copying too many bytes from the |
…ngth The Parquet spec allows truncated statistics, so a too-short min/max is valid and must not throw. Bound every fixed-width memcpy/Float16 load in FormatStatValue to val.size() and zero-pad the rest instead of reading past the buffer.
|
@pitrou makes sense, truncated stats are spec-legal so throwing was wrong. reworked it to clamp every fixed-width memcpy/Float16 load in FormatStatValue to val.size(), so a short min/max is zero-padded instead of read past the buffer and nothing throws. the variable-length FLBA (decimal/string) and BYTE_ARRAY paths were already size-driven, so they're untouched. @wgtmac this also drops the partial required-length table you flagged. every read now bounds itself, so there's no half-covered enum check and the public signature stays the same. updated the test and the title/description to match. |
Rationale for this change
FormatStatValueincpp/src/parquet/types.ccdoes fixed-widthmemcpy/byte loads on a statistics value that comes straight out of the file's Thrift-encoded metadata, so its length is attacker controlled. The Parquet spec allows truncated statistics, so amin_value/max_valueshorter than the column's physical type (for example a zero-byte stat on anINT96column) is valid input, but the fixed-width loads still read past the end of the source buffer. It is reachable from the printer/debug path that formats a file's column statistics.What changes are included in this PR?
Each fixed-width read in
FormatStatValueand the helpers it delegates to now clamps its copy toval.size()instead of assuming the full type width is present: theBOOLEAN/INT32/INT64/INT96/FLOAT/DOUBLEnumeric loads, the int32/int64-backed decimal loads, and the float16FIXED_LEN_BYTE_ARRAYload. A truncated value is zero-padded rather than over-read, and no exception is raised since truncated statistics are spec-legal. The variable-lengthFIXED_LEN_BYTE_ARRAY(decimal/string) andBYTE_ARRAYpaths were already keyed offval.size()and are unchanged.Are these changes tested?
Yes.
TypePrinter.StatisticsTypesShortValueintypes_test.ccformats a too-short value for each path and checks it returns without reading past the buffer (caught under ASan).Are there any user-facing changes?
No API change. A truncated statistics value is formatted from the bytes that are present instead of reading out of bounds.
This PR contains a "Critical Fix". It stops an out-of-bounds read on a buffer whose length is controlled by the input Parquet file.