-
Notifications
You must be signed in to change notification settings - Fork 581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make it easier for contributors to update Docker images #15332
Comments
And then we could just move IREE's docker image building over there. |
The associated workflow was only run once? https://github.com/openxla/iree/actions/workflows/android_tflite_oneshot_build.yml This is a rather large image too: ``` scotttodd:~$ docker image inspect gcr.io/iree-oss/gradle-android@sha256:cf7bf0392d5125f2babb4b9de4b43b583220506ecebd6b6201b23b2575f671c0 | grep Size "Size": 11533931968, "VirtualSize": 11533931968, ``` Progress towards #15332 (fewer Dockerfiles to port/maintain)
The associated workflow was only run once? https://github.com/openxla/iree/actions/workflows/android_tflite_oneshot_build.yml This is a rather large image too: ``` scotttodd:~$ docker image inspect gcr.io/iree-oss/gradle-android@sha256:cf7bf0392d5125f2babb4b9de4b43b583220506ecebd6b6201b23b2575f671c0 | grep Size "Size": 11533931968, "VirtualSize": 11533931968, ``` Progress towards iree-org#15332 (fewer Dockerfiles to port/maintain)
Fixes #15299. This removes SwiftShader-flavored Dockerfiles from this repository and switches all uses across the project to use the non-SwiftShader equivalent images. SwiftShader is a CPU implementation of the Vulkan API which we have been using for min-spec coverage and testing on devices that lack a physical GPU. We have CI workflows on most platforms we support that use real hardware and those builds are reliable enough to drop the CPU/SwiftShader coverage. Vulkan tests are still included on presubmit using NVIDIA Tesla T4 GPUs in the `test_gpu` and `test_tf_integrations_gpu` jobs. Postsubmit jobs using NVIDIA A100 GPUs and Android phones (pixel-6-pro and moto-edge-x30) also run Vulkan tests. Removing these extra Dockerfiles will also help us with the planned refactoring tracked in #15332. --- Follow-up work _not_ completed as part of this PR: - [ ] Add min-spec coverage via profiles
…16346) Progress on #16203 and #15332 At the point where a job is installing a multi-gigabyte Docker image, it might as well just install Python requirements like TF directly. Switching to install from pip loses some control over the supply chain but I think that is fine for these test/benchmark jobs. Comparing [`build_e2e_test_artifacts` before](https://github.com/openxla/iree/actions/runs/7815739382/job/21320388848) to [`build_e2e_test_artifacts` after](https://github.com/openxla/iree/actions/runs/7818347765/job/21328712850?pr=16346) (sample size 1): * Docker fetch time decreased from 1m50s to 30s * 'frontends' depended on 'android' so it included the NDK too 😛 * Python setup (including pip install) time increased from 6s to 1m20s So about the same time taken, just using less cloud storage / infra complexity.
As part of moving more jobs from ci.yml go pkgci.yml, I've been chipping away at our use of Docker images. Better to not use Docker at all if we can avoid it.
|
Seems like we could fork https://github.com/nod-ai/base-docker-images into iree-org and iterate from there. Can keep the original repo in nod-ai so existing packages continue to exist and new dockerfiles/images specific to that github org can be developed. |
Follow-up to #18144. Related to #15332. * `build_all.yml` was used as the first step in multiple other workflows. New workflows are using `pkgci_build_packages.yml` directly or nightly releases. Workflows could also use historical artifacts from `pkgci_build_packages.yml` if they want to use versions different from the nightly releases. * `android.Dockerfile` was used for Android builds and benchmarks. New workflows install the NDK on demand without needing a large Dockerfile. * `nvidia.Dockerfile` and `nvidia-bleeding-edge.Dockerfile` were used for CUDA/Vulkan benchmarks. New workflows rely on the drivers and software packages that are already installed on runners. We could have workflows install on demand or add new Dockerfiles as needed.
…18252) Progress on #15332 and #18238 . The [`build_tools/docker/docker_run.sh`](https://github.com/iree-org/iree/blob/main/build_tools/docker/docker_run.sh) script does a bunch of weird/hacky setup, including setup for `gcloud` (for working with GCP) and Bazel-specific Docker workarounds. Most CMake builds can just use a container for the entire workflow (https://docs.github.com/en/actions/writing-workflows/choosing-where-your-workflow-runs/running-jobs-in-a-container). Note that GitHub in its infinite wisdom changed the default shell _just_ for jobs that run in a container, from `bash` to `sh`, so we flip it back. These jobs run nightly on GitHub-hosted runners, so I tested here: * https://github.com/iree-org/iree/actions/runs/10396020082/job/28789218696 * https://github.com/iree-org/iree/actions/runs/10422541951/job/28867245589 (Those jobs should also run on this PR, but they'll take a while) skip-ci: no impact on other workflows
Progress on #15332 and #18238 . Similar to #18252, this drops a dependency on the [`build_tools/docker/docker_run.sh`](https://github.com/iree-org/iree/blob/main/build_tools/docker/docker_run.sh) script. Unlike that PR, this goes a step further and also stops using [`build_tools/cmake/build_all.sh`](https://github.com/iree-org/iree/blob/main/build_tools/cmake/build_all.sh). Functional changes: * No more building `iree-test-deps` * We only get marginal value out of compiling test files using a debug compiler * Those tests are on the path to being moved to https://github.com/iree-org/iree-test-suites * No more ccache * The debug build cache is too large for a local / GitHub Actions cache * I want to limit our reliance on the remote cache at `http://storage.googleapis.com/iree-sccache/ccache` (which uses GCP for storage and needs GCP auth) * Experiments show that this build is not significantly faster when using a cache, or at least dropping `iree-test-deps` provides equivalent time savings Logs before: https://github.com/iree-org/iree/actions/runs/10417779910/job/28864909582 (96% cache hits, 9 minute build but 19 minutes total, due to `iree-test-deps`) Logs after: https://github.com/iree-org/iree/actions/runs/10423409599/job/28870060781?pr=18255 (no cache, 11 minute build) ci-exactly: linux_x64_clang_debug --------- Co-authored-by: Marius Brehler <marius.brehler@gmail.com>
Sent an RFC to fork that repo into iree-org: https://groups.google.com/g/iree-discuss/c/IPLzMsPb5UI |
The workflows using this had been disabled. A new workflow was added in #18274 that does not use this Dockerfile, opting to instead run the steps contained in the file directly. Remaining "uses" are here, and those will be ported as-needed: https://github.com/iree-org/iree/blob/e3936dca933893a8849195989db8b9e5a0893316/.github/workflows/ci.yml#L249-L263 Progress on #15332 (one less Dockerfile to port)
) Progress on #15332. See the RFC: https://groups.google.com/g/iree-discuss/c/IPLzMsPb5UI. I have forked https://github.com/nod-ai/base-docker-images/ into https://github.com/iree-org/base-docker-images. Now, dockerfiles are built and published as packages in the iree-org namespace using GitHub's container registry. Future changes will migrate what remains in https://github.com/iree-org/iree/tree/main/build_tools/docker.
I've almost finished switching workflows to using dockerfiles hosted at https://github.com/iree-org/base-docker-images/. Here's what's left:
|
For Bazel, we could try using https://bazel.build/install/docker-container (e.g. |
Actually nevermind RE: Bazel...? The source for that is https://github.com/bazelbuild/continuous-integration/blob/master/bazel/oci/Dockerfile and it's just for using Bazel. The entrypoint is hardcoded to |
Following iree-org/base-docker-images#6, the new cpubuilder dockerfile should have all the software needed for ASan and TSan building + testing (specifically `clang-19` instead of just `clang-14`). Progress on #15332. The only remaining uses of `gcr.io/iree-oss/base.*` are: * `build_test_all_bazel` uses `gcr.io/iree-oss/base-bleeding-edge` * `publish_website` uses `gcr.io/iree-oss/base` * arm64 workflows use `gcr.io/iree-oss/base-arm64` * `gcr.io/iree-oss/emscripten` (used by web test workflows) depends on `gcr.io/iree-oss/base`
Progress on #15332 - one less workflow needing the `gcr.io/iree-oss/base` dockerfile and a ccache storage on GCP. Tested on my fork: * Cold cache (5m30s): https://github.com/ScottTodd/iree/actions/runs/10690774251/job/29635889592 * Warm cache (3m30s): https://github.com/ScottTodd/iree/actions/runs/10690871405/job/29636198158 skip-ci: no impact on other workflows
) Also delete the now unused `emscripten.Dockerfile` (it's technically still referenced in a commented out `cross_compile_and_test` build, but that can be added back as needed using this same technique). Progress on #15332 - one less Dockerfile to maintain. Tested here: https://github.com/ScottTodd/iree/actions/runs/10691087132/job/29636852886 (this samples.yml workflow runs on a nightly schedule). skip-ci: no impact on other workflows
…rg#18421) Progress on iree-org#15332 - one less workflow needing the `gcr.io/iree-oss/base` dockerfile and a ccache storage on GCP. Tested on my fork: * Cold cache (5m30s): https://github.com/ScottTodd/iree/actions/runs/10690774251/job/29635889592 * Warm cache (3m30s): https://github.com/ScottTodd/iree/actions/runs/10690871405/job/29636198158 skip-ci: no impact on other workflows
…e-org#18422) Also delete the now unused `emscripten.Dockerfile` (it's technically still referenced in a commented out `cross_compile_and_test` build, but that can be added back as needed using this same technique). Progress on iree-org#15332 - one less Dockerfile to maintain. Tested here: https://github.com/ScottTodd/iree/actions/runs/10691087132/job/29636852886 (this samples.yml workflow runs on a nightly schedule). skip-ci: no impact on other workflows
Progress on #15332. This uses a new `cpubuilder_ubuntu_jammy_x86_64` dockerfile from https://github.com/iree-org/base-docker-images. This stops using the remote cache that is hosted on GCP. Build time _without a cache_ is about 20 minutes on current runners, while build _with a cache_ is closer to 10 minutes. Build time without a cache is closer to 28-30 minutes on new runners. We can try adding back a cache using GitHub or our own hosted storage. I tried to continue using the previous cache during this transition period, but the `gcloud` command needs to run on the host, and I'd like to stop using the `docker_run.sh` script. I'm hoping we can keep folding away this sort of complexity by having the build machines run a dockerfile that includes key environment components like utility tools and any needed authorization/secrets (see #18238). ci-exactly: linux_x64_clang
Progress on #15332. I'm trying to get rid of the `docker_run.sh` scripts, replacing them with GitHub's `container:` feature. While local development flows _may_ want to use Docker like the CI workflows do, those scripts contained a lot of special handling and file mounting to be compatible with Bazel. Much of that is not needed for CMake and can be folded away, though the `--privileged` option needed here is one exception. This stops using the remote cache that is hosted on GCP. We can try adding back a cache using GitHub or our own hosted storage as part of #18238. Job | Cache? | Runner cluster | Time | Logs -- | -- | -- | -- | -- ASan | Cache | GCP runners | 14 minutes | [logs](https://github.com/iree-org/iree/actions/runs/10620030527/job/29438925064) ASan | No cache | GCP runners | 28 minutes | [logs](https://github.com/iree-org/iree/actions/runs/10605848397/job/29395467181) ASan | Cache | Azure runners | (not configured yet) ASan | No cache | Azure runners | 35 minutes | [logs](https://github.com/iree-org/iree/actions/runs/10621238709/job/29442788013?pr=18396) | | | TSan | Cache | GCP runners | 12 minutes | [logs](https://github.com/iree-org/iree/actions/runs/10612418711/job/29414025939) TSan | No cache | GCP runners | 21 minutes | [logs](https://github.com/iree-org/iree/actions/runs/10605848414/job/29395467002) TSan | Cache | Azure runners | (not configured yet) TSan | No cache | Azure runners | 32 minutes | [logs](https://github.com/iree-org/iree/actions/runs/10621238738/job/29442788341?pr=18396) ci-exactly: linux_x64_clang_asan
Following iree-org/base-docker-images#6, the new cpubuilder dockerfile should have all the software needed for ASan and TSan building + testing (specifically `clang-19` instead of just `clang-14`). Progress on #15332. The only remaining uses of `gcr.io/iree-oss/base.*` are: * `build_test_all_bazel` uses `gcr.io/iree-oss/base-bleeding-edge` * `publish_website` uses `gcr.io/iree-oss/base` * arm64 workflows use `gcr.io/iree-oss/base-arm64` * `gcr.io/iree-oss/emscripten` (used by web test workflows) depends on `gcr.io/iree-oss/base`
Progress on #15332. This uses a new `cpubuilder_ubuntu_jammy_x86_64` dockerfile from https://github.com/iree-org/base-docker-images. This stops using the remote cache that is hosted on GCP. Build time _without a cache_ is about 20 minutes on current runners, while build _with a cache_ is closer to 10 minutes. Build time without a cache is closer to 28-30 minutes on new runners. We can try adding back a cache using GitHub or our own hosted storage. I tried to continue using the previous cache during this transition period, but the `gcloud` command needs to run on the host, and I'd like to stop using the `docker_run.sh` script. I'm hoping we can keep folding away this sort of complexity by having the build machines run a dockerfile that includes key environment components like utility tools and any needed authorization/secrets (see #18238). ci-exactly: linux_x64_clang Signed-off-by: saienduri <saimanas.enduri@amd.com>
Progress on #15332. I'm trying to get rid of the `docker_run.sh` scripts, replacing them with GitHub's `container:` feature. While local development flows _may_ want to use Docker like the CI workflows do, those scripts contained a lot of special handling and file mounting to be compatible with Bazel. Much of that is not needed for CMake and can be folded away, though the `--privileged` option needed here is one exception. This stops using the remote cache that is hosted on GCP. We can try adding back a cache using GitHub or our own hosted storage as part of #18238. Job | Cache? | Runner cluster | Time | Logs -- | -- | -- | -- | -- ASan | Cache | GCP runners | 14 minutes | [logs](https://github.com/iree-org/iree/actions/runs/10620030527/job/29438925064) ASan | No cache | GCP runners | 28 minutes | [logs](https://github.com/iree-org/iree/actions/runs/10605848397/job/29395467181) ASan | Cache | Azure runners | (not configured yet) ASan | No cache | Azure runners | 35 minutes | [logs](https://github.com/iree-org/iree/actions/runs/10621238709/job/29442788013?pr=18396) | | | TSan | Cache | GCP runners | 12 minutes | [logs](https://github.com/iree-org/iree/actions/runs/10612418711/job/29414025939) TSan | No cache | GCP runners | 21 minutes | [logs](https://github.com/iree-org/iree/actions/runs/10605848414/job/29395467002) TSan | Cache | Azure runners | (not configured yet) TSan | No cache | Azure runners | 32 minutes | [logs](https://github.com/iree-org/iree/actions/runs/10621238738/job/29442788341?pr=18396) ci-exactly: linux_x64_clang_asan Signed-off-by: saienduri <saimanas.enduri@amd.com>
Following iree-org/base-docker-images#6, the new cpubuilder dockerfile should have all the software needed for ASan and TSan building + testing (specifically `clang-19` instead of just `clang-14`). Progress on #15332. The only remaining uses of `gcr.io/iree-oss/base.*` are: * `build_test_all_bazel` uses `gcr.io/iree-oss/base-bleeding-edge` * `publish_website` uses `gcr.io/iree-oss/base` * arm64 workflows use `gcr.io/iree-oss/base-arm64` * `gcr.io/iree-oss/emscripten` (used by web test workflows) depends on `gcr.io/iree-oss/base` Signed-off-by: saienduri <saimanas.enduri@amd.com>
…rg#18421) Progress on iree-org#15332 - one less workflow needing the `gcr.io/iree-oss/base` dockerfile and a ccache storage on GCP. Tested on my fork: * Cold cache (5m30s): https://github.com/ScottTodd/iree/actions/runs/10690774251/job/29635889592 * Warm cache (3m30s): https://github.com/ScottTodd/iree/actions/runs/10690871405/job/29636198158 skip-ci: no impact on other workflows
…e-org#18422) Also delete the now unused `emscripten.Dockerfile` (it's technically still referenced in a commented out `cross_compile_and_test` build, but that can be added back as needed using this same technique). Progress on iree-org#15332 - one less Dockerfile to maintain. Tested here: https://github.com/ScottTodd/iree/actions/runs/10691087132/job/29636852886 (this samples.yml workflow runs on a nightly schedule). skip-ci: no impact on other workflows
I'm planning to start migrating |
Progress on #15332 and #18238. Fixes #16915. This switches the `build_test_all_bazel` CI job from the `gcr.io/iree-oss/base-bleeding-edge` Dockerfile using GCP for remote cache storage to the `ghcr.io/iree-org/cpubuilder_ubuntu_jammy_x86_64` Dockerfile with no remote cache. With no cache, this job takes between 18 and 25 minutes. Early testing also showed times as long as 60 minutes, if the Docker command and runner are both not optimally configured for Bazel (e.g. not using a RAM disk). The job is also moved from running on every commit to running on a nightly schedule while we evaluate how frequently it breaks and how long it takes to run. If we set up a new remote cache (https://bazel.build/remote/caching), we can move it back to running more regularly.
This should let us replace https://github.com/iree-org/iree/blob/main/build_tools/docker/dockerfiles/base-arm64.Dockerfile with this cpubuilder dockerfile, making progress on iree-org/iree#15332. It uses a multi-architecture build rather than a fully forked file, which seems to work reasonably well with some local testing. I can even run the arm64 dockerfile on my x86_64 host. Various related changes are included here: * Drop `x86_64` from file names * `publish_cpubuilder_x86_64.yml` --> `publish_cpubuilder.yml` * `cpubuilder_ubuntu_jammy_x86_64.Dockerfile` --> `cpubuilder_ubuntu_jammy.Dockerfile` * `cpubuilder_ubuntu_jammy_ghr_x86_64.Dockerfile` --> `cpubuilder_ubuntu_jammy_ghr.Dockerfile` * Build with `--platform linux/amd64,linux/arm64` and update docs for this * Build ccache from source as needed in `build_tools/install_ccache.sh` (code lifted from https://github.com/iree-org/iree/blob/main/build_tools/docker/context/install_ccache.sh). Note that if we standardize on sccache we can drop the ccache install entirely * Since this builds from source using CMake, I moved the ccache install step to after the cmake install step * I am _not_ installing qemu here yet, as we did in IREE (https://github.com/iree-org/iree/blob/782f372b070eadd593a727004cf61dc84aabc634/build_tools/docker/dockerfiles/base-arm64.Dockerfile#L75-L80). I want to try installing on demand with https://github.com/docker/setup-qemu-action first.
Progress on #15332. This was the last active use of [`build_tools/docker/`](https://github.com/iree-org/iree/tree/main/build_tools/docker), so we can now delete that directory: #18566. This uses the same "cpubuilder" dockerfile as the x86_64 builds, which is now built for multiple architectures thanks to iree-org/base-docker-images#11. As before, we install a qemu binary in the dockerfile, this time using the approach in iree-org/base-docker-images#13 instead of a forked dockerfile. Prior PRs for context: * #14372 * #16331 Build time varies pretty wildly depending on cache hit rate and the phase of the moon: | Scenario | Cache hit rate | Time | Logs | | -- | -- | -- | -- | Cold cache | 0% | 1h45m | [Logs](https://github.com/iree-org/iree/actions/runs/10962049593/job/30440393279) Warm (?) cache | 61% | 48m | [Logs](https://github.com/iree-org/iree/actions/runs/10963546631/job/30445257323) Warm (hot?) cache | 98% | 16m | [Logs](https://github.com/iree-org/iree/actions/runs/10964289304/job/30447618503?pr=18569) CI history (https://github.com/iree-org/iree/actions/workflows/ci_linux_arm64_clang.yml?query=branch%3Amain) shows that regular 97% cache hit rates and 17 minute job times are possible. I'm not sure why one test run only got 61% cache hits. This job only runs nightly, so that's not a super high priority to investigate and fix. If we migrate the arm64 runner off of GCP (#18238) we can further simplify this workflow by dropping its reliance on `gcloud auth application-default print-access-token` and the `docker_run.sh` script. Other workflows are now using `source setup_sccache.sh` and some other code.
Fixes #15332. The dockerfiles in this repository have all been migrated to https://github.com/iree-org/base-docker-images/ and all uses in-tree have been updated. I'm keeping the https://github.com/iree-org/iree/blob/main/build_tools/docker/docker_run.sh script for now, but I've replaced nearly all uses of that with GitHub's `container:` argument (https://docs.github.com/en/actions/writing-workflows/choosing-where-your-workflow-runs/running-jobs-in-a-container). All remaining uses need to run some code outside of Docker first, like `gcloud auth application-default print-access-token`. As we continue to migrate jobs off of GCP runners (#18238), we'll be using a different authentication and caching setup that removes that requirement.
Done!
|
There is some background context in this Discord discussion.
We have ~17 Dockerfiles in the main repo that require authentication (
Storage Admin
role in theiree-oss
GCP project) to update: https://github.com/openxla/iree/tree/main/build_tools/docker. It would be much more convenient for contributors and maintainers if any approved commit could update the images used.We should see what other projects are doing and modify our setup to be less bespoke.
Possible requirements:
TBD how much scripting/automation is needed for this.
The text was updated successfully, but these errors were encountered: