Skip to content

feat(mergeconflict): make merge-conflict check asynchronous via runway#245

Draft
behinddwalls wants to merge 1 commit into
preetam/messagequeue-runwayfrom
preetam/mergeconflict-async
Draft

feat(mergeconflict): make merge-conflict check asynchronous via runway#245
behinddwalls wants to merge 1 commit into
preetam/messagequeue-runwayfrom
preetam/mergeconflict-async

Conversation

@behinddwalls

@behinddwalls behinddwalls commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Summary

Why?

The merge-conflict check ran synchronously inside the validate consumer by calling the mergechecker extension inline. A real merge attempt is slow and I/O-heavy, so doing it on the partition lease blocks the pipeline and couples SubmitQueue to the checker's latency. This moves the check to an asynchronous round-trip with runway, modelled on build/buildsignal but across a service boundary.

What?

The pipeline gains validate → mergeconflict ⇢ (runway) ⇢ mergeconflictsignal → batch:

  • validate drops the inline mergechecker call and now publishes the request id to the internal mergeconflict topic (dedup + change-metadata + claim are unchanged).
  • mergeconflict (new) publishes the full MergeRequest to the runway-owned merge-conflict-checker queue, keyed by the request id as the client-owned correlation id. No local record is needed — the request id round-trips, so the result correlates straight back to the request (unlike build, whose server-generated id needs a mapping store).
  • mergeconflictsignal (new) consumes runway's MergeResult off merge-conflict-checker-signal, advances the request to batch when mergeable, or fails it (user error) when conflicted.
  • DLQ reconcilers drive the request to Error on dead-letter; the signal DLQ reads the request id straight off the result.

Crossing the runway boundary is why these payloads carry full data rather than entity IDs; the new queue-payload-boundary rule is documented in CLAUDE.md, with the pipeline diagram and stage table updated in workflow.md and the superseded mergechecker validate-path row noted in extension-contract.md. The mergechecker package is left in-tree (unused on the validate path); removing it is a follow-up. Runway's service implementation is out of scope — only its contract (added in #260) is consumed here.

Test Plan

  • bazel build //...
  • bazel test //... --test_tag_filters=-integration,-e2e (57 tests pass)
  • make gazelle clean

Stack

  1. docs(rfc): add message queue contract RFC #259
  2. feat(api/runway): add external merge queue contract #260
  3. @ feat(mergeconflict): make merge-conflict check asynchronous via runway #245
  4. feat(merge): make merge asynchronous via runway #247

Comment thread doc/rfc/submitqueue/extension-contract.md Outdated
Comment thread submitqueue/core/topickey/topickey.go
Comment thread submitqueue/orchestrator/controller/mergeconflictsignal/mergeconflictsignal.go Outdated
@behinddwalls behinddwalls force-pushed the preetam/mergeconflict-async branch from a51a86b to d565b91 Compare June 16, 2026 03:48
@behinddwalls behinddwalls marked this pull request as ready for review June 16, 2026 15:46
@behinddwalls behinddwalls requested review from a team and sbalabanov as code owners June 16, 2026 15:46
@behinddwalls behinddwalls force-pushed the preetam/runway-contract branch from edf7ab9 to 3e8071d Compare June 16, 2026 17:09
@behinddwalls behinddwalls force-pushed the preetam/mergeconflict-async branch from d565b91 to 00ed26c Compare June 16, 2026 17:09
**Outputs are unchanged except `pusher`.** This RFC moves the *input* toward identity; five of the six return contracts — conflicts, score, mergeability, change info, build id/status — are exactly what they are today. `pusher` is the lone exception: because its input becomes a *list* of independently-landed batches, its result regroups per batch (`BatchID`-tagged, per-change commit detail kept underneath) so each batch's outcome stays correlatable — the "output mirrors the input unit" principle above. No other output shape changes.
**Outputs are unchanged except `pusher`.** This RFC moves the *input* toward identity; four of the five return contracts — conflicts, score, change info, build id/status — are exactly what they are today. `pusher` is the lone exception: because its input becomes a *list* of independently-landed batches, its result regroups per batch (`BatchID`-tagged, per-change commit detail kept underneath) so each batch's outcome stays correlatable — the "output mirrors the input unit" principle above. No other output shape changes.

The validate-time mergeability check is no longer in this catalog because it is no longer an in-process extension. It now runs **asynchronously and out-of-process** in runway: `validate` hands off to the `mergeconflict` controller, which publishes a full check request to the runway-owned `merge-conflict-checker` queue, and `mergeconflictsignal` consumes runway's result (see [workflow.md](workflow.md)). The `mergechecker` package stays in-tree but unused on the validate path; removing it is a follow-up.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ask AI to avoid history in documentation i.e. "no longer in this catalog"

Comment thread doc/rfc/submitqueue/workflow.md Outdated
# Orchestrator Workflow

The orchestrator processes land requests through a queue-driven pipeline of small, single-purpose controllers. The gateway accepts a request over RPC and hands it off asynchronously; from there each controller consumes one topic, advances the request or batch, and publishes to the next topic. Most hops carry only an ID — the controller fetches the entity from storage — while a few entry points (`start`, `buildsignal`, `log`) carry the full payload because there is no row to fetch yet.
The orchestrator processes land requests through a queue-driven pipeline of small, single-purpose controllers. The gateway accepts a request over RPC and hands it off asynchronously; from there each controller consumes one topic, advances the request or batch, and publishes to the next topic. Most hops carry only an ID — the controller fetches the entity from storage — while a few entry points (`start`, `buildsignal`, `log`) carry the full payload because there is no row to fetch yet. The merge-conflict check is the one stage that crosses a service boundary: `mergeconflict` publishes a full payload to the runway-owned `merge-conflict-checker` queue and `mergeconflictsignal` consumes a full payload back off `merge-conflict-checker-signal`, because runway cannot read submitqueue's storage. See the queue-payload-boundary rule in [CLAUDE.md](../../../CLAUDE.md).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merge-conflict is not the only one, so it should not be mentioned explicitly

Comment thread doc/rfc/submitqueue/workflow.md Outdated
Every *consumed* primary pipeline topic above is paired with a `{topic}_dlq` subscription consumed by a dedicated DLQ controller. The `log` topic is the exception: the orchestrator only publishes to it (the gateway is the sole consumer that persists the request log), so it has no orchestrator-side subscription and therefore no DLQ. The consumer framework moves a message to its DLQ once the primary controller returns a non-retryable error or exhausts retries on a retryable one; without the DLQ side the affected request would stay in a non-terminal state forever and the gateway would still report it as "in progress".

The DLQ controllers do not re-attempt the failed work. They decode the payload to recover the affected `RequestID` (start, validate, batch, cancel) or `BatchID` (score, speculate, build, buildsignal, merge, conclude) and drive the entity to a terminal failed state — `RequestStateError` for requests, `BatchStateFailed` for batches with fan-out to the member requests. State writes use the same optimistic-locking CAS as the primary pipeline, so a late primary-pipeline update wins cleanly and a version mismatch is asked back for redelivery.
The DLQ controllers do not re-attempt the failed work. They decode the payload to recover the affected `RequestID` (start, validate, mergeconflict, batch, cancel) or `BatchID` (score, speculate, build, buildsignal, merge, conclude) and drive the entity to a terminal failed state — `RequestStateError` for requests, `BatchStateFailed` for batches with fan-out to the member requests. The `mergeconflictsignal` DLQ decodes a runway `MergeResult` whose `id` is the request id echoed back, so it fails that request directly. State writes use the same optimistic-locking CAS as the primary pipeline, so a late primary-pipeline update wins cleanly and a version mismatch is asked back for redelivery.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, avoid specifics unless only as an example

metrics.NamedCounter(c.metricsScope, opName, "deserialize_errors", 1)
return fmt.Errorf("failed to decode merge conflict check result from dlq payload: %w", err)
}
if result.ID == "" {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need to check for data correctness if we control both producer and consumer?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likely not, we can get rid of it.

// runway dedupes on it; the result is matched straight back to the request. At
// validate time the check is a single step (candidate vs target branch); a
// future in-flight caller layers earlier steps before the candidate.
req := runwayentity.MergeRequest{

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we even need mergeconflict controller now if the only thing it does is relaying a message? Why not do it in a prev controller in a pipeline?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought about a bit..we could do this in validate controller and get rid of it, but thought it would be cleaner in relaying the message say Validating -> Validated, mergeconflictchecking -> mergeconflictchecked

We are not doing PublishLog yet in all controllers but ideally each controller emit that transition as well to the request logs

"github.com/uber/submitqueue/core/metrics"
"github.com/uber/submitqueue/entity/change"
entityqueue "github.com/uber/submitqueue/entity/messagequeue"
runwayentity "github.com/uber/submitqueue/runway/entity"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sq should not reference runway

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my thought was these would be contract entity like APIs, i could move them (in a separate change to the top)... or unless you we should have a mirrored SQ entity for queue payload (which would be same as this entity)

// in Error skips the state CAS but still publishes the log (so a prior attempt
// that flipped the state but failed before logging is repaired); a request that
// reached a different terminal state (e.g. a racing cancel) is left untouched.
func (c *Controller) failRequest(ctx context.Context, request entity.Request, reason string) error {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likely a copy from another controller - may be consolidate them into a helper at some point, may not be this diff

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes..this is same as we do in DLQ control flow, ideally needed in multiple places which we don't do yet, so i will make it a utility and move to core later

@behinddwalls behinddwalls force-pushed the preetam/runway-contract branch from 3e8071d to cd798f6 Compare June 16, 2026 20:12
@behinddwalls behinddwalls force-pushed the preetam/mergeconflict-async branch from 00ed26c to 2ee27cf Compare June 16, 2026 20:12
@behinddwalls behinddwalls requested a review from sbalabanov June 16, 2026 20:12
@behinddwalls behinddwalls force-pushed the preetam/runway-contract branch from cd798f6 to 3e8071d Compare June 17, 2026 06:00
@behinddwalls behinddwalls force-pushed the preetam/mergeconflict-async branch 2 times, most recently from a970815 to bc59a27 Compare June 17, 2026 06:09
@behinddwalls behinddwalls force-pushed the preetam/runway-contract branch from 3e8071d to 78d66a5 Compare June 17, 2026 06:09
Base automatically changed from preetam/runway-contract to main June 17, 2026 06:15
@behinddwalls behinddwalls force-pushed the preetam/mergeconflict-async branch from bc59a27 to d9dc61f Compare June 17, 2026 06:20
@behinddwalls behinddwalls marked this pull request as draft June 17, 2026 16:56
@behinddwalls behinddwalls force-pushed the preetam/mergeconflict-async branch from d9dc61f to 7ab64d3 Compare June 17, 2026 17:33
@behinddwalls behinddwalls changed the base branch from main to preetam/messagequeue-runway June 17, 2026 17:33
@behinddwalls behinddwalls force-pushed the preetam/messagequeue-runway branch from 35e8531 to cee7fb5 Compare June 17, 2026 18:00
@behinddwalls behinddwalls force-pushed the preetam/mergeconflict-async branch from 7ab64d3 to b6042a6 Compare June 17, 2026 18:00
@behinddwalls behinddwalls force-pushed the preetam/messagequeue-runway branch from cee7fb5 to 5c1798d Compare June 17, 2026 18:34
@behinddwalls behinddwalls force-pushed the preetam/mergeconflict-async branch from b6042a6 to 2a8ea85 Compare June 17, 2026 20:20
@behinddwalls behinddwalls changed the base branch from preetam/messagequeue-runway to main June 17, 2026 20:20
@behinddwalls behinddwalls changed the base branch from main to preetam/messagequeue-runway June 17, 2026 20:22
@behinddwalls behinddwalls force-pushed the preetam/messagequeue-runway branch from 557ebb7 to 58313d2 Compare June 17, 2026 21:30
@behinddwalls behinddwalls force-pushed the preetam/mergeconflict-async branch from 2a8ea85 to 6f9a08f Compare June 17, 2026 21:30
@behinddwalls behinddwalls force-pushed the preetam/mergeconflict-async branch from 6f9a08f to 5d3231c Compare June 17, 2026 21:50
@behinddwalls behinddwalls force-pushed the preetam/messagequeue-runway branch from 58313d2 to 823ec31 Compare June 17, 2026 21:50
@behinddwalls behinddwalls force-pushed the preetam/mergeconflict-async branch from 5d3231c to eb1ba84 Compare June 17, 2026 21:55
@behinddwalls behinddwalls force-pushed the preetam/messagequeue-runway branch 2 times, most recently from 738e35e to 8696ae5 Compare June 17, 2026 22:05
@behinddwalls behinddwalls force-pushed the preetam/mergeconflict-async branch from eb1ba84 to eadb853 Compare June 17, 2026 22:05
The merge-conflict check ran synchronously inside the `validate` consumer by calling the `mergechecker` extension inline. A real merge attempt is slow and I/O-heavy, so doing it on the partition lease blocks the pipeline and couples SubmitQueue to the checker's latency. This moves the check to an asynchronous round-trip with runway, modelled on `build`/`buildsignal` but across a service boundary.

The pipeline gains `validate → mergeconflict ⇢ (runway) ⇢ mergeconflictsignal → batch`:

- `validate` drops the inline `mergechecker` call and now publishes the request id to the internal `mergeconflict` topic (dedup + change-metadata + claim are unchanged).
- `mergeconflict` (new) publishes the full `MergeConflictCheckRequest` to the runway-owned `merge-conflict-checker` queue, keyed by the request id as the client-owned correlation id. No local record is needed — the request id round-trips, so the result correlates straight back to the request (unlike `build`, whose server-generated id needs a mapping store).
- `mergeconflictsignal` (new) consumes runway's `MergeConflictCheckResult` off `merge-conflict-checker-signal`, advances the request to `batch` when mergeable, or fails it (user error) when conflicted.
- DLQ reconcilers drive the request to `Error` on dead-letter; the signal DLQ reads the request id straight off the result.

Crossing the runway boundary is why these payloads carry full data rather than entity IDs; the new queue-payload-boundary rule is documented in CLAUDE.md, with the pipeline diagram and stage table updated in workflow.md and the superseded `mergechecker` validate-path row noted in extension-contract.md. The `mergechecker` package is left in-tree (unused on the validate path); removing it is a follow-up. Runway's service implementation is out of scope — only its contract (added in the parent PR) is consumed here.

✅ `bazel build //...`
✅ `bazel test //... --test_tag_filters=-integration,-e2e` (54 tests pass)
✅ `make gazelle` clean

Co-authored-by: Cursor <cursoragent@cursor.com>
@behinddwalls behinddwalls force-pushed the preetam/messagequeue-runway branch from 8696ae5 to e4512bb Compare June 17, 2026 22:20
@behinddwalls behinddwalls force-pushed the preetam/mergeconflict-async branch from eadb853 to e77077c Compare June 17, 2026 22:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants