Skip to content

GH-50184: [C++][Parquet] Avoid reading past truncated statistics values in FormatStatValue#50025

Merged
pitrou merged 3 commits into
apache:mainfrom
jmestwa-coder:stat-value-short-bound
Jun 17, 2026
Merged

GH-50184: [C++][Parquet] Avoid reading past truncated statistics values in FormatStatValue#50025
pitrou merged 3 commits into
apache:mainfrom
jmestwa-coder:stat-value-short-bound

Conversation

@jmestwa-coder

@jmestwa-coder jmestwa-coder commented May 23, 2026

Copy link
Copy Markdown
Contributor

Rationale for this change

FormatStatValue in cpp/src/parquet/types.cc does fixed-width memcpy/byte loads on a statistics value that comes straight out of the file's Thrift-encoded metadata, so its length is attacker controlled. The Parquet spec allows truncated statistics, so a min_value/max_value shorter than the column's physical type (for example a zero-byte stat on an INT96 column) is valid input, but the fixed-width loads still read past the end of the source buffer. It is reachable from the printer/debug path that formats a file's column statistics.

What changes are included in this PR?

Each fixed-width read in FormatStatValue and the helpers it delegates to now clamps its copy to val.size() instead of assuming the full type width is present: the BOOLEAN/INT32/INT64/INT96/FLOAT/DOUBLE numeric loads, the int32/int64-backed decimal loads, and the float16 FIXED_LEN_BYTE_ARRAY load. A truncated value is zero-padded rather than over-read, and no exception is raised since truncated statistics are spec-legal. The variable-length FIXED_LEN_BYTE_ARRAY (decimal/string) and BYTE_ARRAY paths were already keyed off val.size() and are unchanged.

Are these changes tested?

Yes. TypePrinter.StatisticsTypesShortValue in types_test.cc formats a too-short value for each path and checks it returns without reading past the buffer (caught under ASan).

Are there any user-facing changes?

No API change. A truncated statistics value is formatted from the bytes that are present instead of reading out of bounds.

This PR contains a "Critical Fix". It stops an out-of-bounds read on a buffer whose length is controlled by the input Parquet file.

Comment thread cpp/src/parquet/types.cc Outdated
break;
case Type::BYTE_ARRAY:
case Type::FIXED_LEN_BYTE_ARRAY:
if (logical_type != nullptr && logical_type->is_float16()) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not dealing with the length of FIXED_LEN_BYTE_ARRAY?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FormatStatValue doesn't receive the column's type_length, so it can't validate the full FLBA width here. The only fixed-size read for FLBA is the 2-byte float16 load, which is what this case guards. The decimal, string and hex paths all key off val.size() and stay in bounds.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I admit that it is better than nothing to check types with known sizes. It still looks confusing with this partial support. If we do want to add the check, I think we need to support all types. We can either pass the type instance instead of only the enum or pass the type length to FormatStatValue. The challenge is that FormatStatValue is a public API so we cannot changing its signature without breaking backward compatibility.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, we need to create an issue for changes like this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair. Full per-length validation needs the column type_length, and since FormatStatValue is public I left the signature alone, so the guard only covers the types that do an unconditional fixed-width load (BOOLEAN/INT32/INT64/INT96/FLOAT/DOUBLE and the float16 FLBA), which are the ones that actually over-read. The variable-length FLBA (decimal/string) and BYTE_ARRAY paths are already size-driven and stay in bounds. If you'd rather thread the type_length through a new overload to cover everything, I'm happy to go that way.

Comment thread cpp/src/parquet/types.cc Outdated
case Type::INT96:
required = 3 * sizeof(int32_t);
break;
case Type::BYTE_ARRAY:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to put BYTE_ARRAY and UNDEFINED together since they are skipped. In case you don't know, BYTE_ARRAY cannot have float16 logical type.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Moved BYTE_ARRAY down with UNDEFINED since it can't carry a float16 logical type and has no fixed-width load here anyway.

Comment thread cpp/src/parquet/types.cc Outdated
default:
break;
}
if (val.size() < required) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little hesitant to complicate FormatStatValue by introducing this check. FormatStatValue is usually called in the printer when we want to print some information of a parquet file for debugging purpuse.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair, it's mostly the debug/printer path. The thing is val is taken verbatim from the file's Thrift stats, so a truncated min/max makes the BOOLEAN/INT96/numeric memcpys read past the end of the buffer even there. The guard is just a size check in front of the existing loads, so I tried to keep it minimal. Happy to trim it further if it still feels too heavy.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I admit that it is better than nothing to check types with known sizes. It still looks confusing with this partial support. If we do want to add the check, I think we need to support all types. We can either pass the type instance instead of only the enum or pass the type length to FormatStatValue. The challenge is that FormatStatValue is a public API so we cannot changing its signature without breaking backward compatibility.

@wgtmac wgtmac Jun 16, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I thought I have posted the above comment but it was a draft before I posted it just now)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou Do you have any suggestion?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switched to clamping each fixed-width read to val.size() rather than the required-length table, so every over-reading path is covered without touching the public signature, and there's no throw since truncated stats are valid.

@github-actions github-actions Bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels May 27, 2026
@jmestwa-coder jmestwa-coder changed the title MINOR: [C++][Parquet] Reject too-short statistics values in FormatStatValue GH-50184: [C++][Parquet] Reject too-short statistics values in FormatStatValue Jun 15, 2026
@github-actions

Copy link
Copy Markdown

⚠️ GitHub issue #50184 has been automatically assigned in GitHub to PR creator.

@jmestwa-coder

Copy link
Copy Markdown
Contributor Author

Filed #50184 for this and retitled the PR to reference it. Also reworked the description into the template sections.

@pitrou

pitrou commented Jun 16, 2026

Copy link
Copy Markdown
Member

The Parquet spec allows for truncated statistics, so raising an exception sounds wrong here. Just make sure that we're not copying too many bytes from the std::string_view .

…ngth

The Parquet spec allows truncated statistics, so a too-short min/max is
valid and must not throw. Bound every fixed-width memcpy/Float16 load in
FormatStatValue to val.size() and zero-pad the rest instead of reading
past the buffer.
@jmestwa-coder jmestwa-coder requested a review from pitrou as a code owner June 16, 2026 15:26
@jmestwa-coder jmestwa-coder changed the title GH-50184: [C++][Parquet] Reject too-short statistics values in FormatStatValue GH-50184: [C++][Parquet] Avoid reading past truncated statistics values in FormatStatValue Jun 16, 2026
@jmestwa-coder

Copy link
Copy Markdown
Contributor Author

@pitrou makes sense, truncated stats are spec-legal so throwing was wrong. reworked it to clamp every fixed-width memcpy/Float16 load in FormatStatValue to val.size(), so a short min/max is zero-padded instead of read past the buffer and nothing throws. the variable-length FLBA (decimal/string) and BYTE_ARRAY paths were already size-driven, so they're untouched.

@wgtmac this also drops the partial required-length table you flagged. every read now bounds itself, so there's no half-covered enum check and the public signature stays the same. updated the test and the title/description to match.

@pitrou pitrou left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thank you @jmestwa-coder

@pitrou pitrou merged commit 546d42c into apache:main Jun 17, 2026
50 of 53 checks passed
@pitrou pitrou removed the awaiting committer review Awaiting committer review label Jun 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants