Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make it easier for contributors to update Docker images #15332

Closed
Tracked by #16203
ScottTodd opened this issue Oct 27, 2023 · 9 comments · Fixed by #18566
Closed
Tracked by #16203

Make it easier for contributors to update Docker images #15332

ScottTodd opened this issue Oct 27, 2023 · 9 comments · Fixed by #18566
Assignees
Labels
infrastructure Relating to build systems, CI, or testing

Comments

@ScottTodd
Copy link
Member

There is some background context in this Discord discussion.

as with most OSS projects, a push to the Dockerfile on main triggers a rebuild on an ephemeral runner and uses the project token to push to the GitHub package repo without more side-channel auth stuff. The label is based on the branch, so if I push to "test-my-change", then that gets pushed to a label of "test-my-change". Basically, the security permiter ends up defined as who has write access to the repo and is traceable to commits. This seems superior in every way. We could also configure GA to push to another container registry but would need to manage the secrets, etc (and we'd be right back to where we started).

We have ~17 Dockerfiles in the main repo that require authentication (Storage Admin role in the iree-oss GCP project) to update: https://github.com/openxla/iree/tree/main/build_tools/docker. It would be much more convenient for contributors and maintainers if any approved commit could update the images used.

We should see what other projects are doing and modify our setup to be less bespoke.

Possible requirements:

  • Contributors can test that a Dockerfile builds
  • Contributors can test that a Dockerfile does what is expected in a CI workflow
  • Postsubmit uses the latest build of all images from checked in code (our current solution uses a manifest file for this that can get out of sync)

TBD how much scripting/automation is needed for this.

@ScottTodd ScottTodd added the infrastructure Relating to build systems, CI, or testing label Oct 27, 2023
@stellaraccident
Copy link
Collaborator

And then we could just move IREE's docker image building over there.
I just set up a GH action to update them when the file changes and they push to GH's docker repository.
It also seems like we should have fewer docker images, but I've not looked closely in quite some time.

ScottTodd added a commit that referenced this issue Dec 7, 2023
The associated workflow was only run once?
https://github.com/openxla/iree/actions/workflows/android_tflite_oneshot_build.yml

This is a rather large image too:
```
scotttodd:~$ docker image inspect gcr.io/iree-oss/gradle-android@sha256:cf7bf0392d5125f2babb4b9de4b43b583220506ecebd6b6201b23b2575f671c0 | grep Size
        "Size": 11533931968,
        "VirtualSize": 11533931968,
```

Progress towards #15332 (fewer
Dockerfiles to port/maintain)
@ScottTodd ScottTodd self-assigned this Dec 7, 2023
ramiro050 pushed a commit to ramiro050/iree that referenced this issue Dec 19, 2023
The associated workflow was only run once?
https://github.com/openxla/iree/actions/workflows/android_tflite_oneshot_build.yml

This is a rather large image too:
```
scotttodd:~$ docker image inspect gcr.io/iree-oss/gradle-android@sha256:cf7bf0392d5125f2babb4b9de4b43b583220506ecebd6b6201b23b2575f671c0 | grep Size
        "Size": 11533931968,
        "VirtualSize": 11533931968,
```

Progress towards iree-org#15332 (fewer
Dockerfiles to port/maintain)
ScottTodd added a commit that referenced this issue Jan 10, 2024
Fixes #15299. This removes
SwiftShader-flavored Dockerfiles from this repository and switches all
uses across the project to use the non-SwiftShader equivalent images.

SwiftShader is a CPU implementation of the Vulkan API which we have been
using for min-spec coverage and testing on devices that lack a physical
GPU. We have CI workflows on most platforms we support that use real
hardware and those builds are reliable enough to drop the
CPU/SwiftShader coverage.

Vulkan tests are still included on presubmit using NVIDIA Tesla T4 GPUs
in the `test_gpu` and `test_tf_integrations_gpu` jobs. Postsubmit jobs
using NVIDIA A100 GPUs and Android phones (pixel-6-pro and
moto-edge-x30) also run Vulkan tests.

Removing these extra Dockerfiles will also help us with the planned
refactoring tracked in #15332.

---

Follow-up work _not_ completed as part of this PR:

- [ ] Add min-spec coverage via profiles
ScottTodd added a commit that referenced this issue Jan 18, 2024
Fixes #15623 . If the workflow is
still useful then (IMO) it should find another repository to live in.

Progress towards #15332 (fewer
Dockerfiles to port/maintain)
ScottTodd added a commit that referenced this issue Feb 7, 2024
…16346)

Progress on #16203 and
#15332

At the point where a job is installing a multi-gigabyte Docker image, it
might as well just install Python requirements like TF directly.
Switching to install from pip loses some control over the supply chain
but I think that is fine for these test/benchmark jobs.

Comparing [`build_e2e_test_artifacts`
before](https://github.com/openxla/iree/actions/runs/7815739382/job/21320388848)
to [`build_e2e_test_artifacts`
after](https://github.com/openxla/iree/actions/runs/7818347765/job/21328712850?pr=16346)
(sample size 1):
* Docker fetch time decreased from 1m50s to 30s
  * 'frontends' depended on 'android' so it included the NDK too 😛
* Python setup (including pip install) time increased from 6s to 1m20s

So about the same time taken, just using less cloud storage / infra
complexity.
@ScottTodd
Copy link
Member Author

As part of moving more jobs from ci.yml go pkgci.yml, I've been chipping away at our use of Docker images. Better to not use Docker at all if we can avoid it.

@ScottTodd
Copy link
Member Author

Seems like we could fork https://github.com/nod-ai/base-docker-images into iree-org and iterate from there. Can keep the original repo in nod-ai so existing packages continue to exist and new dockerfiles/images specific to that github org can be developed.

ScottTodd added a commit that referenced this issue Aug 14, 2024
Follow-up to #18144. Related to
#15332.

* `build_all.yml` was used as the first step in multiple other
workflows. New workflows are using `pkgci_build_packages.yml` directly
or nightly releases. Workflows could also use historical artifacts from
`pkgci_build_packages.yml` if they want to use versions different from
the nightly releases.
* `android.Dockerfile` was used for Android builds and benchmarks. New
workflows install the NDK on demand without needing a large Dockerfile.
* `nvidia.Dockerfile` and `nvidia-bleeding-edge.Dockerfile` were used
for CUDA/Vulkan benchmarks. New workflows rely on the drivers and
software packages that are already installed on runners. We could have
workflows install on demand or add new Dockerfiles as needed.
ScottTodd added a commit that referenced this issue Aug 16, 2024
…18252)

Progress on #15332 and
#18238 .

The
[`build_tools/docker/docker_run.sh`](https://github.com/iree-org/iree/blob/main/build_tools/docker/docker_run.sh)
script does a bunch of weird/hacky setup, including setup for `gcloud`
(for working with GCP) and Bazel-specific Docker workarounds. Most CMake
builds can just use a container for the entire workflow
(https://docs.github.com/en/actions/writing-workflows/choosing-where-your-workflow-runs/running-jobs-in-a-container).
Note that GitHub in its infinite wisdom changed the default shell _just_
for jobs that run in a container, from `bash` to `sh`, so we flip it
back.

These jobs run nightly on GitHub-hosted runners, so I tested here:
*
https://github.com/iree-org/iree/actions/runs/10396020082/job/28789218696
*
https://github.com/iree-org/iree/actions/runs/10422541951/job/28867245589

(Those jobs should also run on this PR, but they'll take a while)

skip-ci: no impact on other workflows
ScottTodd added a commit that referenced this issue Aug 19, 2024
Progress on #15332 and
#18238 .

Similar to #18252, this drops a
dependency on the
[`build_tools/docker/docker_run.sh`](https://github.com/iree-org/iree/blob/main/build_tools/docker/docker_run.sh)
script. Unlike that PR, this goes a step further and also stops using
[`build_tools/cmake/build_all.sh`](https://github.com/iree-org/iree/blob/main/build_tools/cmake/build_all.sh).

Functional changes:
* No more building `iree-test-deps`
* We only get marginal value out of compiling test files using a debug
compiler
* Those tests are on the path to being moved to
https://github.com/iree-org/iree-test-suites
* No more ccache
* The debug build cache is too large for a local / GitHub Actions cache
* I want to limit our reliance on the remote cache at
`http://storage.googleapis.com/iree-sccache/ccache` (which uses GCP for
storage and needs GCP auth)
* Experiments show that this build is not significantly faster when
using a cache, or at least dropping `iree-test-deps` provides equivalent
time savings

Logs before:
https://github.com/iree-org/iree/actions/runs/10417779910/job/28864909582
(96% cache hits, 9 minute build but 19 minutes total, due to
`iree-test-deps`)
Logs after:
https://github.com/iree-org/iree/actions/runs/10423409599/job/28870060781?pr=18255
(no cache, 11 minute build)

ci-exactly: linux_x64_clang_debug

---------

Co-authored-by: Marius Brehler <marius.brehler@gmail.com>
@ScottTodd
Copy link
Member Author

Seems like we could fork https://github.com/nod-ai/base-docker-images into iree-org and iterate from there. Can keep the original repo in nod-ai so existing packages continue to exist and new dockerfiles/images specific to that github org can be developed.

Sent an RFC to fork that repo into iree-org: https://groups.google.com/g/iree-discuss/c/IPLzMsPb5UI

ScottTodd added a commit that referenced this issue Aug 26, 2024
The workflows using this had been disabled. A new workflow was added in
#18274 that does not use this
Dockerfile, opting to instead run the steps contained in the file
directly.

Remaining "uses" are here, and those will be ported as-needed:
https://github.com/iree-org/iree/blob/e3936dca933893a8849195989db8b9e5a0893316/.github/workflows/ci.yml#L249-L263

Progress on #15332 (one less
Dockerfile to port)
ScottTodd added a commit that referenced this issue Aug 26, 2024
)

Progress on #15332.

See the RFC: https://groups.google.com/g/iree-discuss/c/IPLzMsPb5UI. I
have forked https://github.com/nod-ai/base-docker-images/ into
https://github.com/iree-org/base-docker-images.

Now, dockerfiles are built and published as packages in the iree-org
namespace using GitHub's container registry. Future changes will migrate
what remains in
https://github.com/iree-org/iree/tree/main/build_tools/docker.
@ScottTodd
Copy link
Member Author

ScottTodd commented Aug 29, 2024

I've almost finished switching workflows to using dockerfiles hosted at https://github.com/iree-org/base-docker-images/.

Here's what's left:

  • build_test_all_bazel in ci.yml uses gcr.io/iree-oss/base-bleeding-edge
  • build:remote_cache_bazel_ci in iree.bazelrc uses gcr.io/iree-oss/base-bleeding-edge in the cache key value
  • publish_website in publish_website.yml uses gcr.io/iree-oss/base
  • web in samples.yml uses gcr.io/iree-oss/emscripten
  • linux_arm64_clang in ci_linux_arm64_clang.yml uses gcr.io/iree-oss/base-arm64

@ScottTodd
Copy link
Member Author

For Bazel, we could try using https://bazel.build/install/docker-container (e.g. gcr.io/bazel-public/bazel:latest). That wouldn't have any of our other build deps or configurations for ramdisks, remote caches, etc. baked in though... if any of that is still critical.

@ScottTodd
Copy link
Member Author

Actually nevermind RE: Bazel...? The source for that is https://github.com/bazelbuild/continuous-integration/blob/master/bazel/oci/Dockerfile and it's just for using Bazel. The entrypoint is hardcoded to /usr/local/bin/bazel, when I think we really just want a general purpose container with software installed on it.

ScottTodd added a commit that referenced this issue Sep 3, 2024
Following iree-org/base-docker-images#6, the new
cpubuilder dockerfile should have all the software needed for ASan and
TSan building + testing (specifically `clang-19` instead of just
`clang-14`).

Progress on #15332. The only
remaining uses of `gcr.io/iree-oss/base.*` are:

* `build_test_all_bazel` uses `gcr.io/iree-oss/base-bleeding-edge`
* `publish_website` uses `gcr.io/iree-oss/base`
* arm64 workflows use `gcr.io/iree-oss/base-arm64`
* `gcr.io/iree-oss/emscripten` (used by web test workflows) depends on
`gcr.io/iree-oss/base`
ScottTodd added a commit that referenced this issue Sep 3, 2024
Progress on #15332 - one less
workflow needing the `gcr.io/iree-oss/base` dockerfile and a ccache
storage on GCP.

Tested on my fork:
* Cold cache (5m30s):
https://github.com/ScottTodd/iree/actions/runs/10690774251/job/29635889592
* Warm cache (3m30s):
https://github.com/ScottTodd/iree/actions/runs/10690871405/job/29636198158

skip-ci: no impact on other workflows
ScottTodd added a commit that referenced this issue Sep 3, 2024
)

Also delete the now unused `emscripten.Dockerfile` (it's technically
still referenced in a commented out `cross_compile_and_test` build, but
that can be added back as needed using this same technique).

Progress on #15332 - one less
Dockerfile to maintain.

Tested here:
https://github.com/ScottTodd/iree/actions/runs/10691087132/job/29636852886
(this samples.yml workflow runs on a nightly schedule).

skip-ci: no impact on other workflows
IanWood1 pushed a commit to IanWood1/iree that referenced this issue Sep 8, 2024
…rg#18421)

Progress on iree-org#15332 - one less
workflow needing the `gcr.io/iree-oss/base` dockerfile and a ccache
storage on GCP.

Tested on my fork:
* Cold cache (5m30s):
https://github.com/ScottTodd/iree/actions/runs/10690774251/job/29635889592
* Warm cache (3m30s):
https://github.com/ScottTodd/iree/actions/runs/10690871405/job/29636198158

skip-ci: no impact on other workflows
IanWood1 pushed a commit to IanWood1/iree that referenced this issue Sep 8, 2024
…e-org#18422)

Also delete the now unused `emscripten.Dockerfile` (it's technically
still referenced in a commented out `cross_compile_and_test` build, but
that can be added back as needed using this same technique).

Progress on iree-org#15332 - one less
Dockerfile to maintain.

Tested here:
https://github.com/ScottTodd/iree/actions/runs/10691087132/job/29636852886
(this samples.yml workflow runs on a nightly schedule).

skip-ci: no impact on other workflows
saienduri pushed a commit that referenced this issue Sep 12, 2024
Progress on #15332. This uses a
new `cpubuilder_ubuntu_jammy_x86_64` dockerfile from
https://github.com/iree-org/base-docker-images.

This stops using the remote cache that is hosted on GCP. Build time
_without a cache_ is about 20 minutes on current runners, while build
_with a cache_ is closer to 10 minutes. Build time without a cache is
closer to 28-30 minutes on new runners. We can try adding back a cache
using GitHub or our own hosted storage.

I tried to continue using the previous cache during this transition
period, but the `gcloud` command needs to run on the host, and I'd like
to stop using the `docker_run.sh` script. I'm hoping we can keep folding
away this sort of complexity by having the build machines run a
dockerfile that includes key environment components like utility tools
and any needed authorization/secrets (see
#18238).

ci-exactly: linux_x64_clang
saienduri pushed a commit that referenced this issue Sep 12, 2024
Progress on #15332. I'm trying to
get rid of the `docker_run.sh` scripts, replacing them with GitHub's
`container:` feature. While local development flows _may_ want to use
Docker like the CI workflows do, those scripts contained a lot of
special handling and file mounting to be compatible with Bazel. Much of
that is not needed for CMake and can be folded away, though the
`--privileged` option needed here is one exception.

This stops using the remote cache that is hosted on GCP. We can try
adding back a cache using GitHub or our own hosted storage as part of
#18238.

Job | Cache? | Runner cluster | Time | Logs
-- | -- | -- | -- | --
ASan | Cache | GCP runners | 14 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10620030527/job/29438925064)
ASan | No cache | GCP runners | 28 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10605848397/job/29395467181)
ASan | Cache | Azure runners | (not configured yet)
ASan | No cache | Azure runners | 35 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10621238709/job/29442788013?pr=18396)
| | |
TSan | Cache | GCP runners | 12 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10612418711/job/29414025939)
TSan | No cache | GCP runners | 21 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10605848414/job/29395467002)
TSan | Cache | Azure runners | (not configured yet)
TSan | No cache | Azure runners | 32 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10621238738/job/29442788341?pr=18396)

ci-exactly: linux_x64_clang_asan
saienduri pushed a commit that referenced this issue Sep 12, 2024
Following iree-org/base-docker-images#6, the new
cpubuilder dockerfile should have all the software needed for ASan and
TSan building + testing (specifically `clang-19` instead of just
`clang-14`).

Progress on #15332. The only
remaining uses of `gcr.io/iree-oss/base.*` are:

* `build_test_all_bazel` uses `gcr.io/iree-oss/base-bleeding-edge`
* `publish_website` uses `gcr.io/iree-oss/base`
* arm64 workflows use `gcr.io/iree-oss/base-arm64`
* `gcr.io/iree-oss/emscripten` (used by web test workflows) depends on
`gcr.io/iree-oss/base`
saienduri pushed a commit that referenced this issue Sep 12, 2024
Progress on #15332. This uses a
new `cpubuilder_ubuntu_jammy_x86_64` dockerfile from
https://github.com/iree-org/base-docker-images.

This stops using the remote cache that is hosted on GCP. Build time
_without a cache_ is about 20 minutes on current runners, while build
_with a cache_ is closer to 10 minutes. Build time without a cache is
closer to 28-30 minutes on new runners. We can try adding back a cache
using GitHub or our own hosted storage.

I tried to continue using the previous cache during this transition
period, but the `gcloud` command needs to run on the host, and I'd like
to stop using the `docker_run.sh` script. I'm hoping we can keep folding
away this sort of complexity by having the build machines run a
dockerfile that includes key environment components like utility tools
and any needed authorization/secrets (see
#18238).

ci-exactly: linux_x64_clang
Signed-off-by: saienduri <saimanas.enduri@amd.com>
saienduri pushed a commit that referenced this issue Sep 12, 2024
Progress on #15332. I'm trying to
get rid of the `docker_run.sh` scripts, replacing them with GitHub's
`container:` feature. While local development flows _may_ want to use
Docker like the CI workflows do, those scripts contained a lot of
special handling and file mounting to be compatible with Bazel. Much of
that is not needed for CMake and can be folded away, though the
`--privileged` option needed here is one exception.

This stops using the remote cache that is hosted on GCP. We can try
adding back a cache using GitHub or our own hosted storage as part of
#18238.

Job | Cache? | Runner cluster | Time | Logs
-- | -- | -- | -- | --
ASan | Cache | GCP runners | 14 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10620030527/job/29438925064)
ASan | No cache | GCP runners | 28 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10605848397/job/29395467181)
ASan | Cache | Azure runners | (not configured yet)
ASan | No cache | Azure runners | 35 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10621238709/job/29442788013?pr=18396)
| | |
TSan | Cache | GCP runners | 12 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10612418711/job/29414025939)
TSan | No cache | GCP runners | 21 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10605848414/job/29395467002)
TSan | Cache | Azure runners | (not configured yet)
TSan | No cache | Azure runners | 32 minutes |
[logs](https://github.com/iree-org/iree/actions/runs/10621238738/job/29442788341?pr=18396)

ci-exactly: linux_x64_clang_asan
Signed-off-by: saienduri <saimanas.enduri@amd.com>
saienduri pushed a commit that referenced this issue Sep 12, 2024
Following iree-org/base-docker-images#6, the new
cpubuilder dockerfile should have all the software needed for ASan and
TSan building + testing (specifically `clang-19` instead of just
`clang-14`).

Progress on #15332. The only
remaining uses of `gcr.io/iree-oss/base.*` are:

* `build_test_all_bazel` uses `gcr.io/iree-oss/base-bleeding-edge`
* `publish_website` uses `gcr.io/iree-oss/base`
* arm64 workflows use `gcr.io/iree-oss/base-arm64`
* `gcr.io/iree-oss/emscripten` (used by web test workflows) depends on
`gcr.io/iree-oss/base`

Signed-off-by: saienduri <saimanas.enduri@amd.com>
josemonsalve2 pushed a commit to josemonsalve2/iree that referenced this issue Sep 14, 2024
…rg#18421)

Progress on iree-org#15332 - one less
workflow needing the `gcr.io/iree-oss/base` dockerfile and a ccache
storage on GCP.

Tested on my fork:
* Cold cache (5m30s):
https://github.com/ScottTodd/iree/actions/runs/10690774251/job/29635889592
* Warm cache (3m30s):
https://github.com/ScottTodd/iree/actions/runs/10690871405/job/29636198158

skip-ci: no impact on other workflows
josemonsalve2 pushed a commit to josemonsalve2/iree that referenced this issue Sep 14, 2024
…e-org#18422)

Also delete the now unused `emscripten.Dockerfile` (it's technically
still referenced in a commented out `cross_compile_and_test` build, but
that can be added back as needed using this same technique).

Progress on iree-org#15332 - one less
Dockerfile to maintain.

Tested here:
https://github.com/ScottTodd/iree/actions/runs/10691087132/job/29636852886
(this samples.yml workflow runs on a nightly schedule).

skip-ci: no impact on other workflows
@ScottTodd
Copy link
Member Author

I'm planning to start migrating build_test_all_bazel to a new dockerfile running on the new Azure build cluster soon. We'll see what issues I run into :)

ScottTodd added a commit that referenced this issue Sep 19, 2024
Progress on #15332 and
#18238. Fixes
#16915.

This switches the `build_test_all_bazel` CI job from the
`gcr.io/iree-oss/base-bleeding-edge` Dockerfile using GCP for remote
cache storage to the `ghcr.io/iree-org/cpubuilder_ubuntu_jammy_x86_64`
Dockerfile with no remote cache.

With no cache, this job takes between 18 and 25 minutes. Early testing
also showed times as long as 60 minutes, if the Docker command and
runner are both not optimally configured for Bazel (e.g. not using a RAM
disk).

The job is also moved from running on every commit to running on a
nightly schedule while we evaluate how frequently it breaks and how long
it takes to run. If we set up a new remote cache
(https://bazel.build/remote/caching), we can move it back to running
more regularly.
ScottTodd added a commit to iree-org/base-docker-images that referenced this issue Sep 19, 2024
This should let us replace
https://github.com/iree-org/iree/blob/main/build_tools/docker/dockerfiles/base-arm64.Dockerfile
with this cpubuilder dockerfile, making progress on
iree-org/iree#15332. It uses a
multi-architecture build rather than a fully forked file, which seems to
work reasonably well with some local testing. I can even run the arm64
dockerfile on my x86_64 host.

Various related changes are included here:

* Drop `x86_64` from file names
  * `publish_cpubuilder_x86_64.yml` --> `publish_cpubuilder.yml`
* `cpubuilder_ubuntu_jammy_x86_64.Dockerfile` -->
`cpubuilder_ubuntu_jammy.Dockerfile`
* `cpubuilder_ubuntu_jammy_ghr_x86_64.Dockerfile` -->
`cpubuilder_ubuntu_jammy_ghr.Dockerfile`
* Build with `--platform linux/amd64,linux/arm64` and update docs for
this
* Build ccache from source as needed in `build_tools/install_ccache.sh`
(code lifted from
https://github.com/iree-org/iree/blob/main/build_tools/docker/context/install_ccache.sh).
Note that if we standardize on sccache we can drop the ccache install
entirely
* Since this builds from source using CMake, I moved the ccache install
step to after the cmake install step
* I am _not_ installing qemu here yet, as we did in IREE
(https://github.com/iree-org/iree/blob/782f372b070eadd593a727004cf61dc84aabc634/build_tools/docker/dockerfiles/base-arm64.Dockerfile#L75-L80).
I want to try installing on demand with
https://github.com/docker/setup-qemu-action first.
ScottTodd added a commit that referenced this issue Sep 23, 2024
Progress on #15332. This was the
last active use of
[`build_tools/docker/`](https://github.com/iree-org/iree/tree/main/build_tools/docker),
so we can now delete that directory:
#18566.

This uses the same "cpubuilder" dockerfile as the x86_64 builds, which
is now built for multiple architectures thanks to
iree-org/base-docker-images#11. As before, we
install a qemu binary in the dockerfile, this time using the approach in
iree-org/base-docker-images#13 instead of a
forked dockerfile.

Prior PRs for context:
* #14372
* #16331

Build time varies pretty wildly depending on cache hit rate and the
phase of the moon:

| Scenario | Cache hit rate | Time | Logs |
| -- | -- | -- | -- |
Cold cache | 0% | 1h45m |
[Logs](https://github.com/iree-org/iree/actions/runs/10962049593/job/30440393279)
Warm (?) cache | 61% | 48m |
[Logs](https://github.com/iree-org/iree/actions/runs/10963546631/job/30445257323)
Warm (hot?) cache | 98% | 16m |
[Logs](https://github.com/iree-org/iree/actions/runs/10964289304/job/30447618503?pr=18569)

CI history
(https://github.com/iree-org/iree/actions/workflows/ci_linux_arm64_clang.yml?query=branch%3Amain)
shows that regular 97% cache hit rates and 17 minute job times are
possible. I'm not sure why one test run only got 61% cache hits. This
job only runs nightly, so that's not a super high priority to
investigate and fix.

If we migrate the arm64 runner off of GCP
(#18238) we can further simplify
this workflow by dropping its reliance on `gcloud auth
application-default print-access-token` and the `docker_run.sh` script.
Other workflows are now using `source setup_sccache.sh` and some other
code.
ScottTodd added a commit that referenced this issue Sep 23, 2024
Fixes #15332.

The dockerfiles in this repository have all been migrated to
https://github.com/iree-org/base-docker-images/ and all uses in-tree
have been updated.

I'm keeping the
https://github.com/iree-org/iree/blob/main/build_tools/docker/docker_run.sh
script for now, but I've replaced nearly all uses of that with GitHub's
`container:` argument
(https://docs.github.com/en/actions/writing-workflows/choosing-where-your-workflow-runs/running-jobs-in-a-container).
All remaining uses need to run some code outside of Docker first, like
`gcloud auth application-default print-access-token`. As we continue to
migrate jobs off of GCP runners
(#18238), we'll be using a
different authentication and caching setup that removes that
requirement.
@ScottTodd
Copy link
Member Author

Done!

  • Dockerfiles are now hosted in https://github.com/iree-org/base-docker-images/, which contains automated workflows to publish to GitHub's Container registry (https://docs.github.com/en/packages/working-with-a-github-packages-registry/working-with-the-container-registry). Members of the iree-write team can contribute to that repository to make updates to the published images, with no additional privileges required.
  • There are 6 dockerfiles there at the moment
    • 3 dockerfiles are being forks between "ghr" (GitHub Runner) and "non-ghr" variants, and we aren't actually using the "ghr" variants right now
    • Nearly all workflows just use the cpubuilder image, with the manylinux image being used for Python package building, and the amdgpu image currently used (maybe?)
  • All workflows have been updated to use the new dockerfiles, or no dockerfile at all.
  • When not using Docker, workflows instead rely on either
    • software included in GitHub's standard runners
    • software included in our own self-hosted runners
    • software that can be easily and quickly installed on demand (e.g. python packages via pip, the Emscripten SDK, etc.)
  • Documentation for our usage of Docker is now at https://iree.dev/developers/general/github-actions/#docker-and-dependencies

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
infrastructure Relating to build systems, CI, or testing
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants