AIR CLI Integration: air run Command Pt. 1 - Add GPU accelerator type and compute config model#5602
AIR CLI Integration: air run Command Pt. 1 - Add GPU accelerator type and compute config model#5602riddhibhagwat-db wants to merge 6 commits into
air run Command Pt. 1 - Add GPU accelerator type and compute config model#5602Conversation
Implement the read-only run-details command (renamed from `status` to `get`).
It fetches a job run via the Jobs API and renders the run's status, start time,
duration, retries, experiment, accelerators, dashboard URL, MLflow deep-link,
and a foreach/sweep summary. Output is the air-style {v, ts, data} JSON envelope
under -o json, or a text view.
Renames the command-level identifiers (status -> get) while keeping the run's
"status" field/label. Adds format/mlflow/sweep/output helpers with unit tests
and an acceptance test, and drops `get` from the not-implemented stub coverage.
Co-authored-by: Isaac
Co-authored-by: Isaac
Co-authored-by: Isaac
Integration test reportCommit: 9efd3d1
23 interesting tests: 14 SKIP, 7 RECOVERED, 2 flaky
Top 29 slowest tests (at least 2 minutes):
|
| NodePoolID string `yaml:"node_pool_id"` | ||
| PoolName string `yaml:"pool_name"` |
There was a problem hiding this comment.
Hey let's leave out any pool related features from Go port. cc @ben-hansen-db @maggiewang-db I'd cc Yu Peng but he doesn't have a -db GH account?
| perNode, err := gpusPerNode(g) | ||
| if err != nil { | ||
| return err | ||
| } | ||
| if c.NumAccelerators%perNode != 0 { | ||
| return fmt.Errorf("compute.num_accelerators for %s must be a multiple of %d, got %d", c.AcceleratorType, perNode, c.NumAccelerators) | ||
| } |
There was a problem hiding this comment.
I'm off the opinion this kind of check should be done in the backend. @maggiewang-db @ben-hansen-db @vinchenzo-db wdyt? can we do that easily using Training Service logic?
There was a problem hiding this comment.
I think that based on the project milestones and as I discussed with Maggie yesterday, we want to port this in phases. As written in the project doc, we want to first port the run functionality directly as is (including the validation) and then move the validation & add handlers to the backend in milestone 3.
There was a problem hiding this comment.
I agree. But my plan is to do that later in Milestone 3.2 after the initial lift and shift.
It needs some design to decide which validations to move to backend, which validations to keep in client
| case gpuType8xH100: | ||
| return 8, nil | ||
| } | ||
| return 0, fmt.Errorf("invalid GPU type %q", string(g)) |
There was a problem hiding this comment.
Nit: By the time validate() reaches gpusPerNode(), parseGPUType() has already guaranteed g is valid.
It's ok to leave the code as is to be defensive. Just add a comment this shouldn't be reachable.
7af56f3 to
a69e0d3
Compare
Add compute.go: the gpuType model and compute-block validation the upcoming `air run` config layer depends on. Defines the canonical GPU_* accelerator types, parseGPUType (exact, case-sensitive), gpusPerNode (partition counts), and computeConfig.validate (positive count, multiple-of-per-node, mutually exclusive node_pool_id/pool_name). Co-authored-by: Isaac
The training compute config no longer supports pool placement, so remove the node_pool_id and pool_name fields and the validation that rejected setting both. Co-authored-by: Isaac
f8477fc to
62be1a1
Compare
Changes
Adds
experimental/air/cmd/compute.go, which is thegpuTypemodel andcomputewhich is the block validation that theair runconfiguration layer depends on.Specifically:
GPU_1xA10,GPU_8xH100,GPU_1xH100)parseGPUTyperesolves a YAML accelerator type stringgpusPerNodeis the per node partition count based on the type namecomputeConfigandvalidate()are the port of the pythonComputeConfigvalidatorsWhy
This is the first, leaf-most piece of the
air runport for the AIR CLI and the root of the config validation layer dependencies. This piece for compute does not depend on anything else so it lands first as a small and fully unit-tested unit.Note that we also use exact case sensitive parsing since a potential typo in the user's YAML could misroute the run. Additionally, we only support
GPU_*training service types (legacy MAPI types (eg.h100_80gb) are no longer supported and intentionally deprecated in this port. However, they still have their own display map for historical runs to be able to be displayed (but no new runs can use the MAPI path). Rendering them in get is unaffected since format.go keeps its own display map for historical runs.Tests
Table-driven unit tests in compute_test.go: parseGPUType for valid types and rejected inputs (wrong casing, legacy types, unknown, empty); gpusPerNode counts plus its invalid-type error; and computeConfig.validate across valid configs and every failure mode (unknown/legacy type, non-positive count, non-multiple count, dual-pool conflict). go build, go test, and golangci-lint are clean.