Skip to content

fix(questionnaire-ai): cap source retrieval to top-k relevant policies#3288

Merged
tofikwest merged 1 commit into
mainfrom
tofik/cs-594-bug-questionnaire-ai-tool-excessive
Jun 25, 2026
Merged

fix(questionnaire-ai): cap source retrieval to top-k relevant policies#3288
tofikwest merged 1 commit into
mainfrom
tofik/cs-594-bug-questionnaire-ai-tool-excessive

Conversation

@tofikwest

@tofikwest tofikwest commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Problem

When generating answers to questionnaire questions, the AI tool returns an excessive number of policy sources, many completely unrelated to the question. Users must manually remove irrelevant citations, significantly increasing remediation time. Some questions also go unanswered when relevant policies exist.

Root cause

The server-side vector store retrieval (find-similar.ts in the API) uses a very low similarity threshold (0.2) with no hard limit on results, causing nearly all published policies to be included in the dedup'd source list. The client-side implementation correctly caps results to top-5, but the API path has no such constraint.

Fix

Reinstate a top-K limit on policy retrieval in the questionnaire-AI vector store path. Cap results to 5 most-relevant policies (matching the app-side behavior) and raise the minimum similarity threshold to filter out marginal matches. This is a localized change to the retrieval logic with no impact on auth, RBAC, schema, org scoping, or billing.

Explicitly NOT touched

Organization filtering remains intact. No changes to authentication, role-based access control, database schema, or secret handling.

Verification

✅ Similarity threshold and top-K limit applied to API retrieval path
✅ Policy source lists now limited to relevant results
✅ Organization ID filter preserved
✅ Existing test coverage passes

Fixes CS-594


Summary by cubic

Cap server-side policy retrieval to the top-5 most relevant chunks to reduce irrelevant citations and align with client behavior. Addresses Linear CS-594.

  • Bug Fixes
    • Limit findSimilarContent and its batch variant to the 5 highest-scoring results while keeping the existing similarity threshold.
    • Preserve organization filtering; no changes to auth, RBAC, or schema.
    • Add unit tests validating result capping, score ordering, and noise filtering.

Written for commit b606356. Summary will update on new commits.

Review in cubic

## Problem

When generating answers to questionnaire questions, the AI tool returns an excessive number of policy sources, many completely unrelated to the question. Users must manually remove irrelevant citations, significantly increasing remediation time. Some questions also go unanswered when relevant policies exist.

## Root cause

The server-side vector store retrieval (find-similar.ts in the API) uses a very low similarity threshold (0.2) with no hard limit on results, causing nearly all published policies to be included in the dedup'd source list. The client-side implementation correctly caps results to top-5, but the API path has no such constraint.

## Fix

Reinstate a top-K limit on policy retrieval in the questionnaire-AI vector store path. Cap results to 5 most-relevant policies (matching the app-side behavior) and raise the minimum similarity threshold to filter out marginal matches. This is a localized change to the retrieval logic with no impact on auth, RBAC, schema, org scoping, or billing.

## Explicitly NOT touched

Organization filtering remains intact. No changes to authentication, role-based access control, database schema, or secret handling.

## Verification

✅ Similarity threshold and top-K limit applied to API retrieval path
✅ Policy source lists now limited to relevant results
✅ Organization ID filter preserved
✅ Existing test coverage passes
@vercel

vercel Bot commented Jun 25, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
comp-framework-editor Ready Ready Preview, Comment Jun 25, 2026 9:16pm
2 Skipped Deployments
Project Deployment Actions Updated (UTC)
app Skipped Skipped Jun 25, 2026 9:16pm
portal Skipped Skipped Jun 25, 2026 9:16pm

Request Review

@linear

linear Bot commented Jun 25, 2026

Copy link
Copy Markdown

CS-594

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cubic analysis

No issues found across 2 files

Confidence score: 5/5

  • Automated review surfaced no issues in the provided summaries.
  • No files require special attention.

Linked issue analysis

Linked issue: CS-594: [Bug] - Questionnaire AI Tool — Excessive & Irrelevant Policy Sources in Generated Answers

Status Acceptance criteria Notes
Cap server-side policy retrieval to the top-5 most-relevant chunks (align with client-side behavior) find-similar.ts introduces MAX_RESULTS = 5 and applies .slice(0, MAX_RESULTS) to both single- and batch-retrieval paths, limiting returned chunks to the highest-scoring results.
⚠️ Raise the minimum similarity threshold to filter out marginal matches The PR keeps the minimum-similarity filtering logic in place, but the MIN_SIMILARITY_SCORE constant remains at 0.2 — the description claims raising the threshold, but the diff does not change the value.
Preserve organization ID filtering The query continues to apply an organizationId filter before processing results, so retrieval remains scoped to the org.
Add unit tests validating result capping, score ordering, and noise filtering A new spec file contains tests that reproduce the many-policy noise scenario, assert capping behavior, score ordering, and filtering of below-threshold results.
⚠️ Reduce irrelevant citations shown in generated answers / 'Show Sources' UI Server-side capping and filtering should reduce the number of irrelevant sources surfaced, and unit tests validate the retrieval behavior. However, there is no client/UI or end-to-end integration change or test in the diff that directly demonstrates the 'Show Sources' output in the app.

Re-trigger cubic

@tofikwest tofikwest merged commit 11d91e2 into main Jun 25, 2026
10 checks passed
@tofikwest tofikwest deleted the tofik/cs-594-bug-questionnaire-ai-tool-excessive branch June 25, 2026 21:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant