fix(fts): enforce required terms for and queries#7385
Conversation
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Xuanwo
left a comment
There was a problem hiding this comment.
Two issues need changes before merging. Please add regression coverage for multi-token max_expansions and grouped fuzzy AND top-k/resource-bound behavior.
| let part_for_wand = part.clone(); | ||
| let mut partition_result = spawn_cpu(move || { | ||
| let has_grouped_expansions = !grouped_expansions.is_empty(); | ||
| let wand_params = if has_grouped_expansions { |
There was a problem hiding this comment.
With grouped fuzzy AND, this clears the requested limit before WAND runs. Since WAND treats None as usize::MAX, a limit=1 query can still materialize every matching candidate in each partition and resolve all deferred row ids before the outer heap trims results, which removes the resource bound users expect from top-k search.
There was a problem hiding this comment.
Addressed in afef607: grouped fuzzy AND no longer clears the WAND limit to None; it now uses bounded oversampling based on grouped expansion terms while keeping final matched-expansion rescoring. Added regression coverage in test_fuzzy_and_grouped_rescore_keeps_wand_limit_bounded.
|
|
||
| let base_len = tokens.token_type().prefix_len(token) as u32; | ||
| if let TokenMap::Fst(ref map) = self.tokens.tokens { | ||
| let mut expanded = Vec::new(); |
There was a problem hiding this comment.
This makes max_expansions apply separately to each query token. The previous implementation accumulated expansions in one vector, so the same cap applied to the whole fuzzy query; multi-token fuzzy queries can now expand to tokens.len() * max_expansions terms, changing recall, scoring, and posting IO for existing queries.
There was a problem hiding this comment.
Addressed in afef607: fuzzy expansion now applies max_expansions across the whole query again while preserving original token positions for grouped fuzzy AND. Added regression coverage in test_fuzzy_expansion_cap_applies_to_whole_query.
Bug Fix
What is the bug?
FTS
ANDqueries could return matches from a partition that only contained a subset of the required query terms. For fuzzyAND, expansions were also flattened without preserving the original query-position grouping, so missing required positions and same-position expansion scoring could produce incorrect results.What issues or incorrect behavior does the bug cause?
A query such as
alpha AND betacould return rows from a partition that only hadalphabecause the missing term was skipped before WAND saw the query. FuzzyANDcould also treat expansions from one original position as separate required terms, or score grouped expansions using the wrong token IDF, which could affect top-k ordering.How does this PR fix the problem?
This PR makes partition posting-list loading aware of the query operator. For
AND, a partition now returns empty results when any required original position has no exact term or fuzzy expansion. For fuzzyAND, expansions are grouped by original query position, same-position expansions are unioned for candidate selection, and final scoring uses the actual matched expansion token frequencies.Validation
cargo fmt --all --checkgit diff --checkCARGO_TARGET_DIR=... cargo test -p lance-index test_fuzzy_and_scores_grouped_expansions_by_matched_token -- --nocaptureCARGO_TARGET_DIR=... cargo test -p lance-index test_and_query -- --nocaptureCARGO_TARGET_DIR=... cargo test -p lance-index test_fuzzy_and_groups_expansions_by_original_position -- --nocaptureCARGO_TARGET_DIR=... cargo test -p lance-index bm25_search -- --nocaptureCARGO_TARGET_DIR=... cargo test -p lance-index scalar::inverted::wand::tests -- --nocapture