Skip to content

AIR CLI Integration: air run Command Pt. 1 - Add GPU accelerator type and compute config model#5602

Open
riddhibhagwat-db wants to merge 6 commits into
air-integration-m1-1from
air-integration-m2-1
Open

AIR CLI Integration: air run Command Pt. 1 - Add GPU accelerator type and compute config model#5602
riddhibhagwat-db wants to merge 6 commits into
air-integration-m1-1from
air-integration-m2-1

Conversation

@riddhibhagwat-db

Copy link
Copy Markdown

Changes

Adds experimental/air/cmd/compute.go , which is the gpuType model and compute which is the block validation that the air run configuration layer depends on.
Specifically:

  • the training service accelerator types were added (GPU_1xA10, GPU_8xH100, GPU_1xH100)
  • parseGPUType resolves a YAML accelerator type string
  • gpusPerNode is the per node partition count based on the type name
  • computeConfig and validate() are the port of the python ComputeConfig validators

Why

This is the first, leaf-most piece of the air run port for the AIR CLI and the root of the config validation layer dependencies. This piece for compute does not depend on anything else so it lands first as a small and fully unit-tested unit.
Note that we also use exact case sensitive parsing since a potential typo in the user's YAML could misroute the run. Additionally, we only support GPU_* training service types (legacy MAPI types (eg. h100_80gb) are no longer supported and intentionally deprecated in this port. However, they still have their own display map for historical runs to be able to be displayed (but no new runs can use the MAPI path). Rendering them in get is unaffected since format.go keeps its own display map for historical runs.

Tests

Table-driven unit tests in compute_test.go: parseGPUType for valid types and rejected inputs (wrong casing, legacy types, unknown, empty); gpusPerNode counts plus its invalid-type error; and computeConfig.validate across valid configs and every failure mode (unknown/legacy type, non-positive count, non-multiple count, dual-pool conflict). go build, go test, and golangci-lint are clean.

Implement the read-only run-details command (renamed from `status` to `get`).
It fetches a job run via the Jobs API and renders the run's status, start time,
duration, retries, experiment, accelerators, dashboard URL, MLflow deep-link,
and a foreach/sweep summary. Output is the air-style {v, ts, data} JSON envelope
under -o json, or a text view.

Renames the command-level identifiers (status -> get) while keeping the run's
"status" field/label. Adds format/mlflow/sweep/output helpers with unit tests
and an acceptance test, and drops `get` from the not-implemented stub coverage.

Co-authored-by: Isaac
@eng-dev-ecosystem-bot

eng-dev-ecosystem-bot commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Integration test report

Commit: 9efd3d1

Run: 27724619645

Env 🔄​flaky 💚​RECOVERED 🙈​SKIP ✅​pass 🙈​skip Time
💚​ aws linux 7 14 264 998 6:41
💚​ aws windows 7 14 266 996 8:00
💚​ aws-ucws linux 7 14 360 912 6:22
💚​ aws-ucws windows 7 14 362 910 9:00
💚​ azure linux 1 16 267 996 7:02
💚​ azure windows 1 16 269 994 7:12
💚​ azure-ucws linux 1 16 365 908 8:10
🔄​ azure-ucws windows 2 1 16 365 906 8:23
💚​ gcp linux 1 16 263 999 7:00
💚​ gcp windows 1 16 265 997 9:11
23 interesting tests: 14 SKIP, 7 RECOVERED, 2 flaky
Test Name aws linux aws windows aws-ucws linux aws-ucws windows azure linux azure windows azure-ucws linux azure-ucws windows gcp linux gcp windows
💚​ TestAccept 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R
🙈​ TestAccept/bundle/invariant/no_drift 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/permissions 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions 💚​R 💚​R 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=direct 💚​R 💚​R 💚​R 💚​R
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 💚​R 💚​R 💚​R 💚​R
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions 💚​R 💚​R 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=direct 💚​R 💚​R 💚​R 💚​R
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 💚​R 💚​R 💚​R 💚​R
🙈​ TestAccept/bundle/resources/postgres_branches/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/recreate 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/replace_existing 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/update_protected 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/without_branch_id 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_endpoints/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_endpoints/recreate 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_projects/update_display_name 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/synced_database_tables/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/vector_search_endpoints/drift/recreated_same_name 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/vector_search_indexes/recreate/embedding_dimension 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/ssh/connection 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🔄​ TestFsCpFileToFile ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p 🔄​f ✅​p ✅​p
🔄​ TestFsCpFileToFile/local_to_uc-volumes 🙈​s 🙈​s ✅​p ✅​p 🙈​s 🙈​s ✅​p 🔄​f 🙈​s 🙈​s
Top 29 slowest tests (at least 2 minutes):
duration env testname
4:10 gcp windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
4:10 gcp linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
4:10 gcp windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
4:00 gcp linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:28 azure-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:26 aws-ucws windows TestAccept
3:22 aws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:21 azure-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:19 aws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:19 aws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:04 azure linux TestSecretsPutSecretStringValue
3:04 aws-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:04 azure-ucws linux TestSecretsPutSecretStringValue
3:02 aws-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:00 azure windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:59 aws-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:48 azure windows TestAccept
2:47 azure-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:46 aws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:45 aws-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:44 azure linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:40 azure-ucws windows TestAccept
2:39 gcp windows TestAccept
2:38 aws windows TestAccept
2:34 azure-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:34 azure windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:24 azure linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:13 gcp linux TestSecretsPutSecretStringValue
2:02 aws linux TestSecretsPutSecretStringValue

Comment thread experimental/air/cmd/compute.go Outdated
Comment on lines +58 to +59
NodePoolID string `yaml:"node_pool_id"`
PoolName string `yaml:"pool_name"`

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey let's leave out any pool related features from Go port. cc @ben-hansen-db @maggiewang-db I'd cc Yu Peng but he doesn't have a -db GH account?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed, thanks!

Comment on lines +73 to +79
perNode, err := gpusPerNode(g)
if err != nil {
return err
}
if c.NumAccelerators%perNode != 0 {
return fmt.Errorf("compute.num_accelerators for %s must be a multiple of %d, got %d", c.AcceleratorType, perNode, c.NumAccelerators)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm off the opinion this kind of check should be done in the backend. @maggiewang-db @ben-hansen-db @vinchenzo-db wdyt? can we do that easily using Training Service logic?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that based on the project milestones and as I discussed with Maggie yesterday, we want to port this in phases. As written in the project doc, we want to first port the run functionality directly as is (including the validation) and then move the validation & add handlers to the backend in milestone 3.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. But my plan is to do that later in Milestone 3.2 after the initial lift and shift.
It needs some design to decide which validations to move to backend, which validations to keep in client

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good

case gpuType8xH100:
return 8, nil
}
return 0, fmt.Errorf("invalid GPU type %q", string(g))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: By the time validate() reaches gpusPerNode(), parseGPUType() has already guaranteed g is valid.
It's ok to leave the code as is to be defensive. Just add a comment this shouldn't be reachable.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment, thanks

Add compute.go: the gpuType model and compute-block validation the upcoming
`air run` config layer depends on. Defines the canonical GPU_* accelerator
types, parseGPUType (exact, case-sensitive), gpusPerNode (partition counts),
and computeConfig.validate (positive count, multiple-of-per-node, mutually
exclusive node_pool_id/pool_name).

Co-authored-by: Isaac
The training compute config no longer supports pool placement, so remove the
node_pool_id and pool_name fields and the validation that rejected setting both.

Co-authored-by: Isaac
@riddhibhagwat-db riddhibhagwat-db changed the base branch from air-integration-m1-1 to air-cli June 17, 2026 22:39
@riddhibhagwat-db riddhibhagwat-db changed the base branch from air-cli to air-integration-m1-1 June 17, 2026 22:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants