fix: isolate anonymous file statistics cache#22950
Conversation
nuno-faria
left a comment
There was a problem hiding this comment.
Thanks @kumarUjjawal, I confirmed that this solves the issue.
However, I'm not sure about having a new file statistics cache for each new anonymous listing table. Right now we will be stuck with the DefaultFileStatisticsCache for anonymous tables, and the file_statistics_cache_limit can be exceeded since we can have multiple caches. I think it would be better to simply not cache anonymous tables that have Specified schemas. If these tables are constantly queried it's probably better to register them anyway.
Maybe @adriangb has some thoughts on this.
There was a problem hiding this comment.
@kumarUjjawal
Thanks for working on this fix. I think the current approach solves the immediate cache collision issue, but it introduces a broader caching behavior change that can bypass the intended runtime cache limits.
| /// Anonymous tables do not have a stable table id in the shared cache key | ||
| /// and may read the same path with different explicit schemas. Use this | ||
| /// cache for those tables rather than populating the shared session cache. | ||
| local_statistics_cache: Arc<dyn FileStatisticsCache>, |
There was a problem hiding this comment.
I think this fixes the anonymous ListingTable statistics reuse issue, but it does so by giving each anonymous table its own DefaultFileStatisticsCache.
That seems to bypass the intended global datafusion.runtime.file_statistics_cache_limit, since every anonymous table gets a separate cache and with_cache copies the full shared cache limit into each instance.
The invariant that was failing appears to be narrower: anonymous reads with SchemaSource::Specified should not reuse statistics that were computed for the same path under a different schema.
Could we avoid caching entirely for anonymous specified-schema tables instead? Registered tables could continue using the shared cache through their table reference, and anonymous inferred-schema reads could still share statistics by path when the schema is inferred consistently.
There was a problem hiding this comment.
Sounds good Thanks for the feedback.
|
I agree that between giving all anonymous tables their own limit and not caching them not caching them is preferable behavior today. If the anonymous scan is only used for 1 query, in what scenario are the cached statistics useful? |
nuno-faria
left a comment
There was a problem hiding this comment.
Thanks @kumarUjjawal, I just leave some additional minor suggestions below.
| self | ||
| } | ||
|
|
||
| fn statistics_cache( |
There was a problem hiding this comment.
Could you please add a small comment explaining why we discard the cache for future reference?
| } | ||
|
|
||
| #[tokio::test] | ||
| async fn anonymous_specified_schema_skips_session_statistics_cache() { |
There was a problem hiding this comment.
Is this test needed? Seems very similar to anonymous_parquet_stats_cache_with_explicit_wider_schema.
912789a to
b62294f
Compare
Which issue does this PR close?
ProjectionExprs::project_statisticsindex out of bounds #22935.Rationale for this change
Anonymous file reads can read the same path with different explicit schemas in the same session. The shared file statistics cache was keyed by table/path metadata, but did not validate that cached statistics matched the schema used to compute them.
This could reuse narrower cached statistics for a later wider schema read and panic during statistics projection.
What changes are included in this PR?
This PR routes anonymous listing table statistics through a per-table cache instead of the shared session cache.
Named tables still use the shared session cache, since their table reference gives the cache a stable identity.
It also adds a regression test that first warms statistics with the physical schema, then reads the same Parquet file with a wider explicit schema.
Are these changes tested?
Yes
Are there any user-facing changes?
No API Change