Skip to content

fix(fts): enforce required terms for and queries#7385

Merged
BubbleCal merged 4 commits into
mainfrom
yang/fix-fts-and-required-terms
Jun 22, 2026
Merged

fix(fts): enforce required terms for and queries#7385
BubbleCal merged 4 commits into
mainfrom
yang/fix-fts-and-required-terms

Conversation

@BubbleCal

Copy link
Copy Markdown
Contributor

Bug Fix

What is the bug?

FTS AND queries could return matches from a partition that only contained a subset of the required query terms. For fuzzy AND, expansions were also flattened without preserving the original query-position grouping, so missing required positions and same-position expansion scoring could produce incorrect results.

What issues or incorrect behavior does the bug cause?

A query such as alpha AND beta could return rows from a partition that only had alpha because the missing term was skipped before WAND saw the query. Fuzzy AND could also treat expansions from one original position as separate required terms, or score grouped expansions using the wrong token IDF, which could affect top-k ordering.

How does this PR fix the problem?

This PR makes partition posting-list loading aware of the query operator. For AND, a partition now returns empty results when any required original position has no exact term or fuzzy expansion. For fuzzy AND, expansions are grouped by original query position, same-position expansions are unioned for candidate selection, and final scoring uses the actual matched expansion token frequencies.

Validation

  • cargo fmt --all --check
  • git diff --check
  • CARGO_TARGET_DIR=... cargo test -p lance-index test_fuzzy_and_scores_grouped_expansions_by_matched_token -- --nocapture
  • CARGO_TARGET_DIR=... cargo test -p lance-index test_and_query -- --nocapture
  • CARGO_TARGET_DIR=... cargo test -p lance-index test_fuzzy_and_groups_expansions_by_original_position -- --nocapture
  • CARGO_TARGET_DIR=... cargo test -p lance-index bm25_search -- --nocapture
  • CARGO_TARGET_DIR=... cargo test -p lance-index scalar::inverted::wand::tests -- --nocapture

@github-actions github-actions Bot added A-index Vector index, linalg, tokenizer bug Something isn't working labels Jun 22, 2026
@codecov

codecov Bot commented Jun 22, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 90.68966% with 54 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-index/src/scalar/inverted/index.rs 91.16% 43 Missing and 5 partials ⚠️
rust/lance-index/src/scalar/inverted/wand.rs 83.78% 6 Missing ⚠️

📢 Thoughts on this report? Let us know!

@github-actions github-actions Bot added the A-python Python bindings label Jun 22, 2026
@BubbleCal BubbleCal marked this pull request as ready for review June 22, 2026 13:57

@Xuanwo Xuanwo left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two issues need changes before merging. Please add regression coverage for multi-token max_expansions and grouped fuzzy AND top-k/resource-bound behavior.

let part_for_wand = part.clone();
let mut partition_result = spawn_cpu(move || {
let has_grouped_expansions = !grouped_expansions.is_empty();
let wand_params = if has_grouped_expansions {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With grouped fuzzy AND, this clears the requested limit before WAND runs. Since WAND treats None as usize::MAX, a limit=1 query can still materialize every matching candidate in each partition and resolve all deferred row ids before the outer heap trims results, which removes the resource bound users expect from top-k search.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in afef607: grouped fuzzy AND no longer clears the WAND limit to None; it now uses bounded oversampling based on grouped expansion terms while keeping final matched-expansion rescoring. Added regression coverage in test_fuzzy_and_grouped_rescore_keeps_wand_limit_bounded.


let base_len = tokens.token_type().prefix_len(token) as u32;
if let TokenMap::Fst(ref map) = self.tokens.tokens {
let mut expanded = Vec::new();

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes max_expansions apply separately to each query token. The previous implementation accumulated expansions in one vector, so the same cap applied to the whole fuzzy query; multi-token fuzzy queries can now expand to tokens.len() * max_expansions terms, changing recall, scoring, and posting IO for existing queries.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in afef607: fuzzy expansion now applies max_expansions across the whole query again while preserving original token positions for grouped fuzzy AND. Added regression coverage in test_fuzzy_expansion_cap_applies_to_whole_query.

@Xuanwo Xuanwo left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

@BubbleCal BubbleCal merged commit 2b1b100 into main Jun 22, 2026
41 of 42 checks passed
@BubbleCal BubbleCal deleted the yang/fix-fts-and-required-terms branch June 22, 2026 15:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-index Vector index, linalg, tokenizer A-python Python bindings bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants