Skip to content

Fix IVF extend list resize bug#2297

Open
qwertyforce wants to merge 2 commits into
NVIDIA:mainfrom
qwertyforce:fix_ivf_resize
Open

Fix IVF extend list resize bug#2297
qwertyforce wants to merge 2 commits into
NVIDIA:mainfrom
qwertyforce:fix_ivf_resize

Conversation

@qwertyforce

Copy link
Copy Markdown

Description

Fixes a bug in IVF-Flat and IVF-SQ extend() where a list could receive new vectors but keep its old logical list::size.

The issue happens when extending a list that already has enough allocated capacity. Interleaved IVF list storage may be padded, so the old copy extent can be larger than the old logical size.

Example:

old logical size: 100
old padded copy size: 128
new logical size: 120

Previously, resize_list() received only the padded value, 128, and used it both for copying old storage and for deciding whether list::size needed to be updated. It checked 120 <= 128, skipped the update, and left list::size == 100 even though new vectors were inserted up to 120.

This stale size can break the shared-list / copy-on-write logic in later extends and corrupt search results.

Fix

Pass separate old sizes to ivf::resize_list():

old_logical_size: used for the list::size update/CAS decision
old_used_size:    used for copying old storage, may be padded

The existing overload is kept for call sites where both sizes are the same.

Tests

Added regression tests for:

IVF-Flat repeated extend from a shared base index with enough existing capacity
IVF-SQ in-place extend with enough existing capacity

@copy-pr-bot

copy-pr-bot Bot commented Jul 4, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@achirkin

achirkin commented Jul 5, 2026

Copy link
Copy Markdown
Contributor

old_used_size is the old logical size. old real size is checked via extends.

@qwertyforce

qwertyforce commented Jul 5, 2026

Copy link
Copy Markdown
Author

old_used_size is the old logical size. old real size is checked via extends.

in ivf-flat extend, resize_list() receives raft::Pow2<kIndexGroupSize>::roundUp(old_list_sizes[label]) as an argument, right?

@qwertyforce

Copy link
Copy Markdown
Author

here is poc

// ivf_flat_extend_size_poc.cpp

#include <cuvs/neighbors/ivf_flat.hpp>

#include <raft/core/copy.cuh>
#include <raft/core/device_mdarray.hpp>
#include <raft/core/device_mdspan.hpp>
#include <raft/core/resource/cuda_stream.hpp>
#include <raft/core/resources.hpp>

#include <cstdint>
#include <iostream>
#include <numeric>
#include <optional>
#include <vector>

namespace ivf_flat = cuvs::neighbors::ivf_flat;

template <typename IndexT>
uint32_t visible_list_size(raft::resources const& handle, IndexT& index)
{
  auto stream = raft::resource::get_cuda_stream(handle);
  uint32_t h = 0;
  raft::copy(&h, index.list_sizes().data_handle(), 1, stream);
  raft::resource::sync_stream(handle);
  return h;
}

template <typename IndexT>
void print_sizes(char const* tag, raft::resources const& handle, IndexT& index)
{
  std::cout << tag << "\n"
            << "  index.list_sizes()[0] = " << visible_list_size(handle, index) << "\n"
            << "  list::size            = " << index.lists()[0]->get_size() << "\n"
            << "  capacity              = " << index.lists()[0]->indices_capacity() << "\n";
}

int main()
{
  raft::resources handle;
  auto stream = raft::resource::get_cuda_stream(handle);

  constexpr int64_t base_rows = 100;
  constexpr int64_t grow_rows = 20;
  constexpr int64_t rows      = base_rows + grow_rows;
  constexpr int64_t dim       = 4;

  std::vector<float> h_data(rows * dim);
  for (int64_t r = 0; r < rows; ++r) {
    for (int64_t c = 0; c < dim; ++c) {
      h_data[r * dim + c] = static_cast<float>(r + c);
    }
  }

  auto data = raft::make_device_matrix<float, int64_t>(handle, rows, dim);
  raft::copy(data.data_handle(), h_data.data(), h_data.size(), stream);

  ivf_flat::index_params params;
  params.n_lists                        = 1;
  params.metric                         = cuvs::distance::DistanceType::L2Expanded;
  params.add_data_on_build              = false;
  params.kmeans_trainset_fraction       = 1.0;
  params.adaptive_centers               = false;
  params.conservative_memory_allocation = false;

  auto all_view = raft::make_device_matrix_view<const float, int64_t, raft::row_major>(
    data.data_handle(), rows, dim);

  auto empty_index = ivf_flat::build(handle, params, all_view);

  auto base_view = raft::make_device_matrix_view<const float, int64_t, raft::row_major>(
    data.data_handle(), base_rows, dim);

  auto base_index = ivf_flat::extend(handle, base_view, std::nullopt, empty_index);
  raft::resource::sync_stream(handle);

  print_sizes("after base extend", handle, base_index);

  std::vector<int64_t> h_ids(grow_rows);
  std::iota(h_ids.begin(), h_ids.end(), base_rows);

  auto ids = raft::make_device_vector<int64_t, int64_t>(handle, grow_rows);
  raft::copy(ids.data_handle(), h_ids.data(), h_ids.size(), stream);

  auto grow_view = raft::make_device_matrix_view<const float, int64_t, raft::row_major>(
    data.data_handle() + base_rows * dim, grow_rows, dim);

  auto ids_view =
    raft::make_device_vector_view<const int64_t, int64_t>(ids.data_handle(), grow_rows);

  auto grown_index = ivf_flat::extend(
    handle,
    grow_view,
    std::make_optional<raft::device_vector_view<const int64_t, int64_t>>(ids_view),
    base_index);

  raft::resource::sync_stream(handle);

  print_sizes("after grow extend", handle, grown_index);

  auto visible  = visible_list_size(handle, grown_index);
  auto internal = grown_index.lists()[0]->get_size();

  if (visible == rows && internal == base_rows) {
    std::cout << "\nBUG: index-visible size is " << visible
              << " but internal list::size stayed " << internal << "\n";
    return 1;
  }

  std::cout << "\nNo mismatch observed\n";
  return 0;
}

after base extend
index.list_sizes()[0] = 100
list::size = 100
capacity = 128

after grow extend
index.list_sizes()[0] = 120
list::size = 100
capacity = 128

BUG: index-visible size is 120 but internal list::size stayed 100

@achirkin

achirkin commented Jul 5, 2026

Copy link
Copy Markdown
Contributor

Ok, sorry for closing this prematurely, this does seem legit. I'm not sure though if this internal get_size() is used where it could cause any problems. I'll look into this in the following days.

@achirkin achirkin reopened this Jul 5, 2026
@achirkin achirkin added bug Something isn't working non-breaking Introduces a non-breaking change labels Jul 5, 2026
@achirkin achirkin self-assigned this Jul 5, 2026
@qwertyforce

Copy link
Copy Markdown
Author

Thank you very much! I forgot to mention how i even stumbled upon this - i got very strange Illegal memory access errors (i am using python wrapper for multiple extends + search + torch for calculating the vectors) and believe that this bug is related (IMA was in calc_chunk_indices_kernel)
But I couldnt distill my production code to a small reproducible poc (the IMA is also non-deterministic), so all i can do is either the one poc i sent, or https://gist.github.com/qwertyforce/65d9bb2d108701240fc9ef1bc952e685 (tldr: if you create 2 new indexes by extend from the base one (not inplace), child_b data can leak into child_a. child_a is not modified after child_b is created, but child_a search results change)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants