fix: isolate anonymous file statistics cache by kumarUjjawal · Pull Request #22950 · apache/datafusion

kumarUjjawal · 2026-06-15T05:01:40Z

Which issue does this PR close?

Closes panic: ProjectionExprs::project_statistics index out of bounds #22935.

Rationale for this change

Anonymous file reads can read the same path with different explicit schemas in the same session. The shared file statistics cache was keyed by table/path metadata, but did not validate that cached statistics matched the schema used to compute them.

This could reuse narrower cached statistics for a later wider schema read and panic during statistics projection.

What changes are included in this PR?

This PR routes anonymous listing table statistics through a per-table cache instead of the shared session cache.

Named tables still use the shared session cache, since their table reference gives the cache a stable identity.

It also adds a regression test that first warms statistics with the physical schema, then reads the same Parquet file with a wider explicit schema.

Are these changes tested?

Yes

Are there any user-facing changes?

No API Change

nuno-faria

Thanks @kumarUjjawal, I confirmed that this solves the issue.

However, I'm not sure about having a new file statistics cache for each new anonymous listing table. Right now we will be stuck with the DefaultFileStatisticsCache for anonymous tables, and the file_statistics_cache_limit can be exceeded since we can have multiple caches. I think it would be better to simply not cache anonymous tables that have Specified schemas. If these tables are constantly queried it's probably better to register them anyway.

Maybe @adriangb has some thoughts on this.

kosiew

@kumarUjjawal
Thanks for working on this fix. I think the current approach solves the immediate cache collision issue, but it introduces a broader caching behavior change that can bypass the intended runtime cache limits.

kosiew · 2026-06-17T11:01:59Z

+    /// Anonymous tables do not have a stable table id in the shared cache key
+    /// and may read the same path with different explicit schemas. Use this
+    /// cache for those tables rather than populating the shared session cache.
+    local_statistics_cache: Arc<dyn FileStatisticsCache>,


I think this fixes the anonymous ListingTable statistics reuse issue, but it does so by giving each anonymous table its own DefaultFileStatisticsCache.

That seems to bypass the intended global datafusion.runtime.file_statistics_cache_limit, since every anonymous table gets a separate cache and with_cache copies the full shared cache limit into each instance.

The invariant that was failing appears to be narrower: anonymous reads with SchemaSource::Specified should not reuse statistics that were computed for the same path under a different schema.

Could we avoid caching entirely for anonymous specified-schema tables instead? Registered tables could continue using the shared cache through their table reference, and anonymous inferred-schema reads could still share statistics by path when the schema is inferred consistently.

Sounds good Thanks for the feedback.

adriangb · 2026-06-17T11:53:31Z

I agree that between giving all anonymous tables their own limit and not caching them not caching them is preferable behavior today.

If the anonymous scan is only used for 1 query, in what scenario are the cached statistics useful?

kosiew

@kumarUjjawal

Thanks for the iteration.

Looks 👍 to me

nuno-faria

Thanks @kumarUjjawal, I just leave some additional minor suggestions below.

nuno-faria · 2026-06-18T18:44:44Z

        self
    }

+    fn statistics_cache(


Could you please add a small comment explaining why we discard the cache for future reference?

nuno-faria · 2026-06-18T18:47:01Z

+}
+
+#[tokio::test]
+async fn anonymous_specified_schema_skips_session_statistics_cache() {


Is this test needed? Seems very similar to anonymous_parquet_stats_cache_with_explicit_wider_schema.

github-actions Bot added core Core DataFusion crate catalog Related to the catalog crate labels Jun 15, 2026

nuno-faria reviewed Jun 16, 2026

View reviewed changes

kosiew requested changes Jun 17, 2026

View reviewed changes

kosiew approved these changes Jun 18, 2026

View reviewed changes

nuno-faria approved these changes Jun 18, 2026

View reviewed changes

kumarUjjawal added 2 commits June 19, 2026 10:55

fix: isolate anonymous file statistics cache

e6fea16

skip cache

b62294f

kumarUjjawal force-pushed the fix/project_statistics_bounds branch from 912789a to b62294f Compare June 19, 2026 05:34

kosiew added this pull request to the merge queue Jun 20, 2026

Merged via the queue into apache:main with commit 88273eb Jun 20, 2026
57 of 58 checks passed

kosiew mentioned this pull request Jun 22, 2026

Make file-statistics cache keys schema-aware #23072

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: isolate anonymous file statistics cache#22950

fix: isolate anonymous file statistics cache#22950
kosiew merged 2 commits into
apache:mainfrom
kumarUjjawal:fix/project_statistics_bounds

kumarUjjawal commented Jun 15, 2026

Uh oh!

nuno-faria left a comment

Uh oh!

kosiew left a comment •

edited

Loading

Uh oh!

kosiew Jun 17, 2026

Uh oh!

kumarUjjawal Jun 17, 2026

Uh oh!

adriangb commented Jun 17, 2026

Uh oh!

kosiew left a comment

Uh oh!

nuno-faria left a comment

Uh oh!

nuno-faria Jun 18, 2026

Uh oh!

nuno-faria Jun 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

kumarUjjawal commented Jun 15, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

nuno-faria left a comment

Choose a reason for hiding this comment

Uh oh!

kosiew left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kosiew Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

kumarUjjawal Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

adriangb commented Jun 17, 2026

Uh oh!

kosiew left a comment

Choose a reason for hiding this comment

Uh oh!

nuno-faria left a comment

Choose a reason for hiding this comment

Uh oh!

nuno-faria Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

nuno-faria Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kosiew left a comment •

edited

Loading