Skip to content

Fix JobGroup state aggregation#557

Open
fallintoplace wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
fallintoplace:fix/jobgroup-state-aggregation
Open

Fix JobGroup state aggregation#557
fallintoplace wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
fallintoplace:fix/jobgroup-state-aggregation

Conversation

@fallintoplace

Copy link
Copy Markdown

What changed

  • add explicit JobGroup state aggregation instead of returning only self.states[0]
  • make cleanup() wait until every tracked handle is terminal before running executor cleanup
  • update JobGroup tests to cover mixed multi-handle states instead of only single-handle behavior

Why

JobGroup.state previously returned only the first handle's state, which let mixed groups report the wrong status. That leaked into status() and cleanup() behavior.

This fixes the main bad cases:

  • [SUCCEEDED, FAILED] now reports FAILED
  • [RUNNING, FAILED] no longer hides the failure
  • [FAILED, RUNNING] still reports the failure, but cleanup() no longer runs until all handles are terminal

Aggregation rules

  • FAILED if any handle failed
  • RUNNING if any handle is still running and none failed
  • PENDING or SUBMITTED if work is still queued and nothing is running or failed
  • SUCCEEDED only if all handles succeeded
  • CANCELLED when all handles are terminal and at least one was cancelled, with no failures
  • UNKNOWN otherwise

Checks

  • uv run -- pytest test/run/test_job.py
  • uv run --group lint -- ruff check nemo_run/run/job.py test/run/test_job.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant