Skip to content

refactor(storage): move batch state membership to app layer#233

Open
albertywu wants to merge 2 commits into
mainfrom
wua/batch-active-membership
Open

refactor(storage): move batch state membership to app layer#233
albertywu wants to merge 2 commits into
mainfrom
wua/batch-active-membership

Conversation

@albertywu

@albertywu albertywu commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Summary

Replace the storage-owned active_batch implementation with a generic app-maintained batch_state_membership store keyed by (queue, state, batch_id). BatchStore is back to pure batch CRUD/CAS; queue/state membership is a separate store and the orchestrator owns the app semantics for active/non-terminal state handling.

The new batchstate controller helper composes BatchStore + BatchStateMembershipStore to:

  • add initial membership before creating a non-terminal batch
  • add target non-terminal membership before state CAS, then best-effort remove previous membership after success
  • list queue/state batch IDs, resolve authoritative Batch rows by primary key, and filter stale membership rows in the app layer

This also generalizes the lookup to queue,state -> batch_ids instead of baking in an "active" table per state group. No backfill is included because this application is not deployed to production yet.

Reviewer feedback addressed

  • Moved "active batches" logic out of BatchStore and into the orchestrator app layer.
  • Added a separate storage entity/data store for batch state membership.
  • Replaced active_batch with a generalized queue/state membership table.
  • Removed storage-layer self-healing of app state; storage now exposes primitive membership records, and callers resolve/filter against authoritative batch rows.

Test Plan

make fmt
make build
make test
make e2e-test
./tool/bazel test //test/integration/submitqueue/extension/storage/... --test_output=errors

Copilot AI review requested due to automatic review settings June 10, 2026 03:18

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors how “active batches” are queried in the storage layer by removing reliance on a (queue, state) secondary index and introducing an explicit active_batch(queue, batch_id) membership table, enabling queue-scoped listing via primary-key-prefix reads.

Changes:

  • Replace BatchStore.GetByQueueAndStates(queue, states) with BatchStore.ListActive(queue) and move state filtering into callers.
  • Add active_batch membership table; remove the idx_queue_state index from the MySQL schema; update MySQL BatchStore to maintain/self-heal membership.
  • Extend integration/contract tests to cover ListActive semantics and MySQL-specific self-healing behavior.

Reviewed changes

Copilot reviewed 12 out of 13 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
test/integration/submitqueue/extension/storage/suite.go Adds contract tests for BatchStore.ListActive and exposes storage via GetStorage() for backend-specific assertions.
test/integration/submitqueue/extension/storage/mysql/storage_test.go Adds MySQL-specific tests validating self-healing behavior of active_batch rows.
test/integration/submitqueue/extension/storage/mysql/BUILD.bazel Adds entity dependency required by new MySQL integration tests.
submitqueue/orchestrator/controller/cancel/cancel.go Switches cancel controller from GetByQueueAndStates to ListActive.
submitqueue/orchestrator/controller/cancel/cancel_test.go Updates mocks/expectations for ListActive.
submitqueue/orchestrator/controller/batch/batch.go Uses ListActive and filters desired states in-memory for conflict analysis.
submitqueue/orchestrator/controller/batch/batch_test.go Updates mocks/expectations and adds coverage for in-memory filtering behavior.
submitqueue/extension/storage/mysql/schema/README.md Documents the new membership-table approach and maintenance model.
submitqueue/extension/storage/mysql/schema/batch.sql Removes secondary index from batch table schema.
submitqueue/extension/storage/mysql/schema/active_batch.sql Adds new active_batch membership table schema.
submitqueue/extension/storage/mysql/batch_store.go Implements membership-first Create and ListActive resolution + best-effort terminal cleanup.
submitqueue/extension/storage/mock/batch_store_mock.go Updates mock interface to ListActive.
submitqueue/extension/storage/batch_store.go Updates storage interface to ListActive and documents new contract.
Files not reviewed (1)
  • submitqueue/extension/storage/mock/batch_store_mock.go: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread submitqueue/extension/storage/mysql/schema/README.md Outdated
Comment thread submitqueue/extension/storage/mysql/schema/active_batch.sql Outdated
Comment thread submitqueue/extension/storage/mysql/schema/active_batch.sql Outdated
@albertywu albertywu force-pushed the wua/batch-active-membership branch 2 times, most recently from ebba199 to 7c00f7b Compare June 10, 2026 04:07
Copilot AI review requested due to automatic review settings June 10, 2026 04:07

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 13 changed files in this pull request and generated 2 comments.

Files not reviewed (1)
  • submitqueue/extension/storage/mock/batch_store_mock.go: Language not supported

Comment thread submitqueue/extension/storage/mysql/batch_store.go Outdated
Comment thread submitqueue/extension/storage/mysql/schema/batch.sql Outdated
@albertywu albertywu force-pushed the wua/batch-active-membership branch from 7c00f7b to 49119c4 Compare June 10, 2026 04:27
@albertywu albertywu marked this pull request as ready for review June 10, 2026 04:34
Copilot AI review requested due to automatic review settings June 10, 2026 04:34
@albertywu albertywu requested review from a team, behinddwalls and sbalabanov as code owners June 10, 2026 04:34

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 13 changed files in this pull request and generated 1 comment.

Files not reviewed (1)
  • submitqueue/extension/storage/mock/batch_store_mock.go: Language not supported

Comment thread submitqueue/extension/storage/mysql/schema/README.md Outdated

@sbalabanov sbalabanov left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to move the "active batches" logic to the app layer. Which means that a separate entity with its own data store has to be created.

- **Write overhead**: Every `INSERT` and `UPDATE` to the `batch` table must also update the secondary index, adding latency to write operations.
- **Storage cost**: The index consumes additional disk space proportional to the number of rows in the table.
- **Lock contention**: Under high write concurrency, index maintenance can increase lock contention on the affected index pages.
`queue` leads the composite primary key `(queue, batch_id)`, so listing a queue's active batches is a primary-key-prefix scan and the table is shardable by queue. On a key-value store the same shape maps directly onto a partition key (`queue`) and sort key (`batch_id`) with no secondary index.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typical key-value stores do not support querying by partition key alone.
Possible solutions:

  • make queue PK and list of ids as value. Cons: high conflict rate for updates
  • require datastore to support ranged queries

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I updated the docs to state the backend requirement explicitly: the companion table assumes ranged scans over a partition/sort key. It no longer claims a generic partition-key-only lookup, and this keeps writes low-conflict compared with storing one queue-level list value.

}
return nil, fmt.Errorf("failed to get active batch id=%q queue=%q: %w", id, queue, err)
}
if batch.State.IsTerminal() {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

storage layer should be kept simple and not enforce the logic of the app.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, implemented in 4a6d905. Storage now exposes only Add, Remove, and ListIDs for membership rows; controller/batchstate owns create/update/list semantics and filters against authoritative Batch rows.

//
// Membership is tracked in active_batch (queue leads the PK), so listing is a
// PK-prefix scan that ports cleanly to a key-value store. Each member is fetched
// by primary key: a terminal batch's membership is best-effort removed (race-free,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not fully correct: list of batches in activeBatchIDs is not fetched by a primary key

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved by the refactor. There is no activeBatchIDs helper now. The primitive lookup is ListIDs(queue, state) over (queue, state, batch_id), and callers then fetch each batch by primary key before using it.

@albertywu albertywu marked this pull request as draft June 11, 2026 03:20
-- Membership is best-effort: it is added on Create (before the batch row) and
-- removed on read by ListActive once a batch is terminal. A reconcile job reclaims
-- rows left dangling by a failed or crashed create. See schema/README.md.
CREATE TABLE IF NOT EXISTS active_batch (

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we likely want a more general way to get queue, state -> [batch_ids], just active means we have to create a table per few states

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed and implemented in 4a6d905. active_batch was replaced with general batch_state_membership(queue, state, batch_id), so any queue/state can be listed with the same table instead of adding one table per active-state grouping.

albertywu and others added 2 commits June 16, 2026 21:37
…h membership

Replace the secondary index over the batch table's mutable `state` column with
an `active_batch` membership table that answers the only queue-scoped query the
pipeline needs: "which batches in this queue are still active?" (the batch
controller uses it to find conflict dependencies; the cancel controller uses it
to find the batch holding a request). A row is intended to exist while its batch
is non-terminal, so the table stays bounded by the live speculation window
rather than growing with batch history. `queue` leads the PK so listing is a
PK-prefix scan and the table is shardable by queue — an access pattern that
ports cleanly to a key-value store (queue = partition key, batch_id = sort key),
unlike a server-maintained secondary index over a mutable non-key column.

Membership is best-effort, not an exact mirror of batch state, and is
maintained without transactions:

- Create writes the membership row before the batch row. This ordering is
  required for correctness: whenever a batch row is visible to a reader its
  membership row is already present, so a concurrent ListActive can never miss
  an active batch. INSERT IGNORE keeps the membership write idempotent across
  retries.
- If the batch insert then fails, Create deliberately leaves the membership row
  in place. A returned error does not prove the row was not written (an
  ambiguous failure can commit the batch row and still return an error), so
  deleting would risk permanently orphaning a live, non-terminal batch from
  ListActive. A dangling membership is the safe direction.
- ListActive resolves each member by primary key: a terminal batch's membership
  is best-effort removed (race-free — a terminal batch is fully committed and
  its id is never reused); a missing batch is skipped but NOT removed (it may
  belong to an in-flight Create that has written its membership but not yet its
  batch row). Cleanup failures are swallowed so a read never fails on index
  maintenance, and terminal-state writers (merge, speculate, dlq) need not touch
  the index.

Genuinely dangling rows (failed/crashed creates) and batches stuck in a
non-terminal state are left for a future reconcile/prune job, documented in
schema/README.md.

Integration tests cover the self-heal and membership invariants:
  - TestActiveBatch_SelfHealsTerminalMembership
  - TestActiveBatch_SkipsDanglingMembershipWithoutDeleting
  - TestActiveBatch_CreateKeepsMembershipOnDuplicate
  - TestActiveBatch_CreateKeepsMembershipOnFailedInsert

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@albertywu albertywu force-pushed the wua/batch-active-membership branch from 49119c4 to 4a6d905 Compare June 16, 2026 20:54
@albertywu albertywu changed the title refactor(storage): replace batch (queue,state) index with active_batch membership refactor(storage): move batch state membership to app layer Jun 16, 2026
@albertywu

Copy link
Copy Markdown
Contributor Author

Updated in 4a6d905 to follow the reviewer feedback.

What changed:

  • Replaced active_batch with generalized batch_state_membership(queue, state, batch_id).
  • Removed BatchStore.ListActive; BatchStore is pure batch CRUD/CAS again.
  • Added BatchStateMembershipStore as the separate storage entity/data store.
  • Moved active/non-terminal/state filtering semantics to the orchestrator via controller/batchstate.
  • Controllers now explicitly maintain membership around batch create/state transitions, and list paths resolve authoritative Batch rows before filtering.

I intentionally did not add a backfill because this app is not deployed to production yet.

Validation passed:

  • make fmt
  • make build
  • make test
  • make e2e-test
  • ./tool/bazel test //test/integration/submitqueue/extension/storage/... --test_output=errors

@albertywu

Copy link
Copy Markdown
Contributor Author

@sbalabanov addressing the review summary: agreed and implemented in 4a6d905. BatchStore is back to primary-key CRUD/CAS only, BatchStateMembershipStore is a separate primitive entity store, and active/non-terminal semantics live in the orchestrator package controller/batchstate.

@albertywu albertywu marked this pull request as ready for review June 16, 2026 21:00
Copilot AI review requested due to automatic review settings June 16, 2026 21:00

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 35 out of 38 changed files in this pull request and generated 2 comments.

Files not reviewed (3)
  • submitqueue/extension/storage/mock/batch_state_membership_store_mock.go: Generated file
  • submitqueue/extension/storage/mock/batch_store_mock.go: Generated file
  • submitqueue/extension/storage/mock/storage_mock.go: Generated file

Comment on lines +122 to +124
if batch.Queue != queue {
continue
}
Comment on lines +29 to +46
// NonTerminalStates are the batch states that should remain discoverable via
// membership lookups.
var NonTerminalStates = []entity.BatchState{
entity.BatchStateCreated,
entity.BatchStateScored,
entity.BatchStateSpeculating,
entity.BatchStateMerging,
entity.BatchStateCancelling,
}

// ConflictStates are the active states the batch controller considers when
// building conflict dependencies for a new batch.
var ConflictStates = []entity.BatchState{
entity.BatchStateCreated,
entity.BatchStateSpeculating,
entity.BatchStateMerging,
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants