docs: deduplicate per-language images into a shared build-time pool by grandixximo · Pull Request #4157 · LinuxCNC/linuxcnc

grandixximo · 2026-06-12T12:06:53Z

The HTML build copied every image into all 8 language trees (~324MB, mostly
duplicates). This dedups during the build instead of in a post-build pass, as
plain files (no symlinks).

The image resolver, previously PDF-only, now runs for HTML too and rewrites each
image src (and click-to-enlarge link) to a shared pool: generic => image/,
localized => /image/, where is the path relative to
docs/src. .html-images-stamp copies each image there once; all languages share the
one image/ tree. The CI-only dedup-docs.py pass is removed.

Full 8-language build: tree ~324MB => ~120MB (tarball ~39MB), zero image symlinks,
checkref passes for every language, second build is a no-op.

Scope: build-side dedup only. The source-tree hygiene raised in review (stray
placement, byte-identical and multi-resolution copies, localized variants orphaned
by naming) needs per-file judgement and moves translation strings, so it is a
follow-up.

BsAtHome · 2026-06-12T12:10:57Z

I went with a post-build pass rather than doing this in image_resolver.rb.

When it works as intended without too much extra work,... why not. I trust you have weighed the options and went for the better one :-)

I'll have a look, later.

grandixximo · 2026-06-13T02:30:07Z

Thanks. FWIW I did spend a fair bit of time weighing the build-integrated alternative before settling on the post-build pass.

This PR adds just one file and it is readable, but I agree the reason it has to exist is ugly: the build creates the duplicates and then this cleans them up.

The real alternative is not only refactoring the Ruby resolver but also shifting the build from zip to tar, so the resolver can place symlinks and have them survive. That preserves the dedup all the way through (artifact, deb, fetch), but it is a larger blast radius and would need coordination with @hdiethelm.

If we agree on that shape instead, the result is a more elegant structure, and I think genuinely better. Happy to go that way if you and @hdiethelm are on board.

andypugh · 2026-06-17T19:13:00Z

A bit late to think of this, but wouldn't Python be a good choice for the script to pull down the docs?
(I just checked, we do have Python)

I commented in the wrong thread. I was meaning to use Python on the web server rather than bash / php

hdiethelm · 2026-06-17T19:31:27Z

Sorry, I just see this now. Hmm, does this 300MB hurt anyone? Otherwise, we could just leave it as it is.

The best solution of course would be to build the docs in a way that common images are in a common folder and the html links are correct at build time, not just clean it up afterwards. But this might be unreasonable complex, I have no clue how this doc build works.

If you want to go with symlinks:

https://github.com/actions/upload-artifact can do zip or raw files. So I should be able to use tar to create an archive in CI and then upload this as raw file. Should I try that in #4150?

hdiethelm · 2026-06-17T20:19:36Z

I went for .tar.gz anyway, so you are free to do what you need.

If the size is a real issue for the homepage server but nowhere else, an option would be to run your dedup scrip in CI only for the artifact but not in the makefile. You can easily add another step before the tar step. If it's CI only, the right place for the script is in .github/scripts/ so the tree is not messed up with CI scripts.

That would be something like:

    - name: Dedup linuxcnc-doc
      run: |
        set -x
        python3 .github/scipts/dedup-docs.py docs/build/html

grandixximo · 2026-06-18T06:55:28Z

Went with your CI-only approach. The dedup is now .github/scripts/dedup-docs.py with no Makefile wiring, so the build itself is untouched and only the packaged artifact gets deduped.

On a full 8-language build it takes the tree from ~324MB to ~118MB (227MB of duplicate images down to 28.8MB), self-verifies that every image reference still resolves after the rewrite, and is idempotent.

This PR is only the script. Could you add the invocation before the tar step in #4150?

- name: Deduplicate doc images
  run: python3 .github/scripts/dedup-docs.py docs/build/html --apply

One ordering note: that step needs the script on master first, else the CI build (your branch merged onto master) cannot find .github/scripts/dedup-docs.py and fails. So merge #4157 first, then rebase here and add the line. Or guard it with [ -f .github/scripts/dedup-docs.py ] to avoid depending on order.

hdiethelm · 2026-06-18T10:52:33Z

This working on different branches on the same topic is anoying. Best you rebase on top of my branch and do the CI change yourself. If yours get merged first, my changes are in and I close mine. If mine gets merged first, you can rebase to master.

grandixximo · 2026-06-18T13:47:00Z

@hdiethelm since #4150 merged, the tar step is on master, so I added the dedup call to the CI in this PR instead of asking you to wire it. It runs in the htmldocs job after the build and before the tar, so only the packaged artifact is deduplicated and the build itself stays untouched.

BsAtHome · 2026-06-18T14:06:50Z

Deduplication after the fact is wasteful. First you create a lot of copies just to do a lot of work to get rid of them again. That is working backwards and fixing the symptoms of a "maybe-not-so-good" choice that caused the problem.

The better approach, IMO, is not to create the duplicates in the first place. You already have to traverse all the links for the languages to match a translated image and use it instead of the generic/English version. I have a hard time understanding why this extra work after the fact is better than prevention?

grandixximo · 2026-06-18T14:19:00Z

It will be a bigger PR, larger blast radius, I was settling for good enough with Hannes proposal...
Will reshape as we discussed before, we should be ok with symlinks right?

hdiethelm · 2026-06-18T20:11:53Z

It all depends if this 300MB is a real issue on the doc server or just "not nice". If it is a real issue, a quickfix like yours is fine. If not, I am with @BsAtHome, rather invest the time to really fix it.

Or also just don't invest time at all, I just checked the debian packages, they don't contain the doc's in html. So it might hurt the server a bit and takes a few seconds more to download but not worth the effort to really do anything about, especially thinking about the people that have to fix this script in a few years from now because something broke.

BsAtHome · 2026-06-18T21:06:17Z

It all depends if this 300MB is a real issue on the doc server or just "not nice". If it is a real issue, a quickfix like yours is fine. If not, I am with @BsAtHome, rather invest the time to really fix it.

It isn't really about whether the space is available or not. It is IMO bad design to copy files indiscriminately. Even more so when we are talking large binary files like images. Blind copying does not scale well and never does. It is also a huge data hog on caching proxies. Reducing your footprint makes things faster for the user. If we need to build 10 seconds longer for that to be accomplished, so be it.
For the download speed, well, that is a regular problem in my experience. Cutting a large part of that redundant download is a Good Thing^TM and reduces potential problems. Smaller overall size is fewer problems.

My original directory layout had imagined all those images together once outside the language directories and only the localized image versions would be part of the language directory. At least that is what I envisioned. I should have checked.

I do not think that symlinking it all is the solution because a server may not have enabled "followsymlinks". Symlinks in the html directory should be avoided and are not necessary. We need to adjust the paths in the docs correctly when they are generated.

This is what I originally had in mind:

html/
    index.html
    css/
        xyz.css
        ...
    images/
        code/
            image.png
        common/
        config/
        ...
    <lang>/
        index.html
        images/
            code/
                image-<lang>.png
            common/
            config/
        code/
            x.html
            y.html
            ....
        common/
            ....
        config/
            ....
        ....

A complete set is the collection of the directories: {css/, images/, [<lang>/]+} is what I envisioned. (empty directories should be pruned)

And:

some image placement in the source-tree needs to be fixed for a proper generic build script without tons of exceptions.
the some source images may be identical between documents (like a logo) but the directory where they reside is different (see also next point). This needs to be tackled too. This may possibly be handled in the stream handler
some images are duplicated in the source tree in (some in different resolutions) and that is unfortunate and not very smart

The HTML build copied every referenced image into all 8 language trees, so the built tree was ~324MB, mostly duplicate image bytes. The image resolver, previously used only for the PDF build, now runs for HTML too and rewrites every image src (and click-to-enlarge link) to a shared pool: a generic image to image/<srcrel> and a localized foo_<lang> variant to <lang>/image/<srcrel>, where <srcrel> is the path relative to docs/src. .html-images-stamp then materialises the pool as real files, one copy per unique image. No post-build pass and no symlinks (a server may lack FollowSymLinks). The 8-language tree drops to ~120MB. The page language is passed explicitly as lcnc-lang to both the HTML and PDF builds: the Asciidoctor 2.0 CLI has no sourcemap option and the render-tree path does not encode the language, so the previous path-based detection was unreliable. This replaces the CI-only dedup-docs.py pass, which is removed.

grandixximo · 2026-06-19T03:58:12Z

You are right that deduplicating after the fact is backwards. Reworked: the dedup
now happens during the build, output is plain files, no post-build pass and no
symlinks (your FollowSymLinks point). Layout is essentially the one you sketched,
and keying on the source-relative path means no per-case exceptions in the build.
Details in the PR description.

The rest is source-tree hygiene needing per-file judgement, so I would prefer a
separate PR after this lands: the 9 byte-identical cross-doc groups, the
multi-resolution copies, and one I noticed: ~119 localized _fr/_es variants are
never used because the adoc points at the plain name (e.g. pncconf-basic.png) not
pncconf-basic_en.png, so the swap can never reach them. Happy to fold it in if you
prefer, but it moves a lot of translation strings.

grandixximo mentioned this pull request Jun 12, 2026

Devel Docs not showing at linuxcnc.org/docs/devel/html/ #4152

Open

grandixximo force-pushed the docs-image-dedup branch from c3832e4 to f535a17 Compare June 18, 2026 06:54

grandixximo changed the title ~~docs: deduplicate built HTML images into a shared image/ tree~~ docs(ci): deduplicate doc-artifact images into a shared image/ tree Jun 18, 2026

grandixximo force-pushed the docs-image-dedup branch from f535a17 to 1a9eddb Compare June 18, 2026 08:32

grandixximo assigned hdiethelm and unassigned hdiethelm Jun 18, 2026

grandixximo force-pushed the docs-image-dedup branch from 1a9eddb to 2535f0d Compare June 18, 2026 13:44

grandixximo changed the title ~~docs(ci): deduplicate doc-artifact images into a shared image/ tree~~ docs(ci): deduplicate doc images before packaging the artifact Jun 18, 2026

grandixximo marked this pull request as draft June 18, 2026 14:15

grandixximo force-pushed the docs-image-dedup branch from 2535f0d to 4ae2cd5 Compare June 19, 2026 03:57

grandixximo changed the title ~~docs(ci): deduplicate doc images before packaging the artifact~~ docs: deduplicate per-language images into a shared build-time pool Jun 19, 2026

Conversation

grandixximo commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BsAtHome commented Jun 12, 2026

Uh oh!

grandixximo commented Jun 13, 2026

Uh oh!

andypugh commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hdiethelm commented Jun 17, 2026

Uh oh!

hdiethelm commented Jun 17, 2026

Uh oh!

grandixximo commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hdiethelm commented Jun 18, 2026

Uh oh!

grandixximo commented Jun 18, 2026

Uh oh!

BsAtHome commented Jun 18, 2026

Uh oh!

grandixximo commented Jun 18, 2026

Uh oh!

hdiethelm commented Jun 18, 2026

Uh oh!

BsAtHome commented Jun 18, 2026

Uh oh!

grandixximo commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

grandixximo commented Jun 12, 2026 •

edited

Loading

andypugh commented Jun 17, 2026 •

edited

Loading

grandixximo commented Jun 18, 2026 •

edited

Loading