Only show installable projects in 'databricks labs list'#5560
Only show installable projects in 'databricks labs list'#5560janniklasrose wants to merge 4 commits into
Conversation
'databricks labs list' showed every non-archived, non-fork repository in the databrickslabs GitHub org (currently 39), but only repositories that ship a labs.yml manifest at the root of their release tag can actually be installed (currently 8). Everything else failed 'databricks labs install' with a not-found error. Filter the listing to repositories that have a root labs.yml on their default branch, checked concurrently via raw.githubusercontent.com (not subject to the low unauthenticated GitHub API rate limit) and cached for 24 hours like the repository list itself. Co-authored-by: Isaac
Approval status: pending
|
Co-authored-by: Isaac
Integration test reportCommit: 14ce2cb
22 interesting tests: 15 SKIP, 7 RECOVERED
Top 25 slowest tests (at least 2 minutes):
|
simonfaltum
left a comment
There was a problem hiding this comment.
Reviewed the full diff plus the supporting packages (localcache, cmd/labs/github, clear_cache.go), and ran an independent second-model pass over the same diff; both converged on the same two issues, so requesting changes for those (details inline):
labs clear-cachedoes not know about the new cache file.- An offline cold start writes an empty installable cache that then sticks for 24h, and (1) means
clear-cachecannot fix it.
Both fixes are small. Two smaller notes inline (changelog wording given #5559 is still open, and a test nit).
Checked and found sound: the errgroup filter (writes to distinct slice elements, first-error semantics, limit 10, ctx propagation), preserved ordering and archived/fork semantics, graceful offline behavior when caches exist, the raw.githubusercontent choice and its failure mode (failing loudly beats caching a partial list for 24h), no stale-cache hazard on default_branch (it has been in ghRepo since #914, so old on-disk caches have it), and the test design (the blueprint fixture proof in TestListingWorks is a nice touch). Unit tests for cmd/labs, cmd/labs/github, and cmd/labs/localcache pass locally, including a -race run of the new test.
This review was written by Isaac, an AI coding agent, with an independent second pass by another model.
There was a problem hiding this comment.
I agree that labs list should only list installable projects, and considered the problem back when I addressed the paging issue.
The problem with testing for labs.yml is the additional REST calls that are required:
- Before this PR: 1 request (results cached for 24h)
- After this request: 1 + N requests, where N is currently 39.
Although this implementation avoids the 60/IP/hour quota on the REST API by hitting the CDN directly, in terms of light-touch I think I'd prefer to filter projects based on repository "topics". I've tagged a few with databricks-cli-installable, for testing purposes. Filtering on this has a few benefits:
- No additional HTTP requests necessary, repository topics are already included in the response we get.
- Caching remains simple.
- On the labs/maintainer side, things become opt-in. At 8/39 I think opt-in is preferable to opt-out.
- On the labs/maintainer side, turning up on
labs listbecomes an admin operation.
Before reviewing the technical implementation I'd like to get consensus on this.
P.S. I also rejected using the GraphQL API as a solution to detect the presence of labs.yml: calls need to be authenticated, and the quota system would still make it costly.
|
@asnare I like the |
…cks#5559) ## Changes `databricks labs install <name>` only works for repositories in the databrickslabs GitHub org that ship a `labs.yml` manifest at the repository root. Most repositories in the org do not (they are libraries published to CRAN, PyPI, Maven, etc.), and installing one of them fails with: ``` Error: remote: read labs.yml from GitHub: not found ``` which gives the user no clue what is wrong. Detect the not-found case when fetching `labs.yml` and return an actionable error instead: ``` Error: remote: databrickslabs/brickster@v0.2.13 does not provide labs.yml (not found); this project cannot be installed with the Databricks CLI, see https://github.com/databrickslabs/brickster for instructions ``` This also covers projects without any GitHub release, where the version resolves to the literal ref `latest` and the fetch 404s the same way. While PR databricks#5560 limits `databricks labs list` to those that are installable, users can still provide a lab name not in that list. Thus, this PR is still relevant for graceful error handling. ## Tests New unit test simulating a project whose release tag has no `labs.yml` (`cmd/labs/project/fetcher_test.go`). Verified live: `databricks labs install brickster` produces the message above. This pull request and its description were written by Isaac, an AI coding agent.
Changes
databricks labs listshowed every non-archived, non-fork repository in the databrickslabs GitHub org — currently 39 — but only repositories that ship alabs.ymlmanifest can actually be installed (currently 8: blueprint, dlt-meta, dqx, lakebridge, lsql, pylint-plugin, sandbox, ucx). Picking anything else from the list faileddatabricks labs installwithError: remote: read labs.yml from GitHub: not found(error message improved separately in #5559), so the listing mostly advertised projects that cannot be installed.With this change, the filter is narrowed to labs that are tagged as
databricks-cli-installable, currently 3.Output before (39 entries, abridged) / after (3 entries):
Tests
Unit tests
This pull request and its description were written by Isaac, an AI coding agent.