Skip to content

fix: isolate anonymous file statistics cache#22950

Merged
kosiew merged 2 commits into
apache:mainfrom
kumarUjjawal:fix/project_statistics_bounds
Jun 20, 2026
Merged

fix: isolate anonymous file statistics cache#22950
kosiew merged 2 commits into
apache:mainfrom
kumarUjjawal:fix/project_statistics_bounds

Conversation

@kumarUjjawal

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

Anonymous file reads can read the same path with different explicit schemas in the same session. The shared file statistics cache was keyed by table/path metadata, but did not validate that cached statistics matched the schema used to compute them.

This could reuse narrower cached statistics for a later wider schema read and panic during statistics projection.

What changes are included in this PR?

This PR routes anonymous listing table statistics through a per-table cache instead of the shared session cache.

Named tables still use the shared session cache, since their table reference gives the cache a stable identity.

It also adds a regression test that first warms statistics with the physical schema, then reads the same Parquet file with a wider explicit schema.

Are these changes tested?

Yes

Are there any user-facing changes?

No API Change

@github-actions github-actions Bot added core Core DataFusion crate catalog Related to the catalog crate labels Jun 15, 2026

@nuno-faria nuno-faria left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kumarUjjawal, I confirmed that this solves the issue.

However, I'm not sure about having a new file statistics cache for each new anonymous listing table. Right now we will be stuck with the DefaultFileStatisticsCache for anonymous tables, and the file_statistics_cache_limit can be exceeded since we can have multiple caches. I think it would be better to simply not cache anonymous tables that have Specified schemas. If these tables are constantly queried it's probably better to register them anyway.

Maybe @adriangb has some thoughts on this.

@kosiew kosiew left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kumarUjjawal
Thanks for working on this fix. I think the current approach solves the immediate cache collision issue, but it introduces a broader caching behavior change that can bypass the intended runtime cache limits.

Comment thread datafusion/catalog-listing/src/table.rs Outdated
/// Anonymous tables do not have a stable table id in the shared cache key
/// and may read the same path with different explicit schemas. Use this
/// cache for those tables rather than populating the shared session cache.
local_statistics_cache: Arc<dyn FileStatisticsCache>,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this fixes the anonymous ListingTable statistics reuse issue, but it does so by giving each anonymous table its own DefaultFileStatisticsCache.

That seems to bypass the intended global datafusion.runtime.file_statistics_cache_limit, since every anonymous table gets a separate cache and with_cache copies the full shared cache limit into each instance.

The invariant that was failing appears to be narrower: anonymous reads with SchemaSource::Specified should not reuse statistics that were computed for the same path under a different schema.

Could we avoid caching entirely for anonymous specified-schema tables instead? Registered tables could continue using the shared cache through their table reference, and anonymous inferred-schema reads could still share statistics by path when the schema is inferred consistently.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good Thanks for the feedback.

@adriangb

Copy link
Copy Markdown
Contributor

I agree that between giving all anonymous tables their own limit and not caching them not caching them is preferable behavior today.

If the anonymous scan is only used for 1 query, in what scenario are the cached statistics useful?

@kosiew kosiew left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kumarUjjawal

Thanks for the iteration.

Looks 👍 to me

@nuno-faria nuno-faria left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kumarUjjawal, I just leave some additional minor suggestions below.

self
}

fn statistics_cache(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a small comment explaining why we discard the cache for future reference?

}

#[tokio::test]
async fn anonymous_specified_schema_skips_session_statistics_cache() {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this test needed? Seems very similar to anonymous_parquet_stats_cache_with_explicit_wider_schema.

@kumarUjjawal kumarUjjawal force-pushed the fix/project_statistics_bounds branch from 912789a to b62294f Compare June 19, 2026 05:34
@kosiew kosiew added this pull request to the merge queue Jun 20, 2026
Merged via the queue into apache:main with commit 88273eb Jun 20, 2026
57 of 58 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

catalog Related to the catalog crate core Core DataFusion crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

panic: ProjectionExprs::project_statistics index out of bounds

4 participants