UX improvements for the new benchmark pipeline #12215

pzread · 2023-02-15T12:34:06Z

Since we move to use artificial ids in the new benchmark pipeline, naturally it isn't very friendly for users to figure out the details of benchmarks with those meaningless ids.

To improve the UX, some information can be added to the benchmark config files and some tools can help users to retrieve the benchmark information.

P0

Serialize composite id
Upload compilation and run flags to GCS (Dump run and compile flags into benchmark JSON config #12397)
Upload compilation benchmark config to GCS
Tool to find benchmark config and artifacts with benchmark ID (Add benchmark helper tool to dump cmdlines #12530)
- Dump benchmark names so they can be searched with keywords (Dump names in benchmark helper output #12833)
run_benchmarks_on_*.py should support filter + run_config to run benchmarks locally with the new benchmark suite
Post artifact links in the CI workflow results (use annotations or summary, GH log and artifacts retention are only 90d) (Post GCS links in CI summary #13273)
Better way to trigger benchmarks (no need to push empty commit) (Independent launchable benchmark workflow #12348)
More clues to link build target and artifacts back to the generator (Show friendly name when building e2e test modules #12693)

P1

Benchmark display name should be generated independently from run_benchmarks_on_*.py (Use benchmark name from benchmark config object #12723)
Support more x86_64 architecture for local developments (e.g. CPU w/o AVX512)
Export the benchmark config when building the e2e_test_artifacts

P2

Improve the readability of the serialized format
More structural filters in benchmark tools
Derive CPU uarch from device sepc
More clear and concise benchmark summary comment
Show timestamp on the benchmark summary
Docs to explain the structure of the new python benchmark framework
Docs for how the new benchmark suite structure and how to hack it

The text was updated successfully, but these errors were encountered:

pzread · 2023-03-15T20:53:47Z

Dump some thoughts about the tracability of using SHA ID and naming issues on artifacts and benchmarks

Backgrounds

E2E Test Artifacts

${IREE_BUILD_DIR}/e2e_test_artifacts is the central place to store all fetched
files and artifacts built in generated_e2e_test_iree_artifacts.cmake. The
script generate_cmake_e2e_test_artifacts_suite.py collects all
ModuleGenerationConfig (right now only from the benchmark suite) and converts
them into cmake build rules.

Each ModuleGenerationConfig has an unique SHA256 ID computed from its
sub-components (CompileConfig and ImportedModel). This ID is used as the target
and output file name in the build rules.

IREE Benchmark Suites

There are two types of benchmarks: execution and compilation. Execution
benchmark is defined by E2EModelRunConfig and compilation benchmark is defined
by ModuleGenerationConfig (note that only some generation configs are
compilation benchmarks, we just reuse the same object from module artifact
generation). Each benchmark has its permanent SHA256 ID.

To run the benchmarks, export_benchmark_config.py exports the configs of
serialized E2EModelRunConfig and ModuleGenerationConfig for the benchmark
tools. The config file contains the full objects and their sub-components,
mainly for indexing and searching benchmarks by their metadata in the tools and
in the upcomming database.

Issues and Solutions

Tracability in E2E Test Artifacts

Since the build rules and output artifacts only have the SHA256 ID in the names,
it is very hard to backtrack which Python code generates rules. Right now you
can only guess from the flags, or write code to print out the ID when a
ModuleGenerationConfig is generated.

Using SHA256 ID as cmake target and file names has advantage of avoiding the
needs to escape or replace the disallowed characters, fixed length of names, and
collision free. However not being able to backtrack the sources of rules is a
big flaw for debugging the build errors.

Solutions

To make them trackable, first ModuleGenerationConfig needs to contain
information about where it is generated. A simple way is giving a readable name
for each config. The name can be auto-generated from its sub-components, such as
model name and the tags of compile config. This approach also aligns with the
requirements to generate IREE benchmark names from the config. The readable name
need to contain the model name, target architecture name, and compile tags.

The second step is to annotate the names on build rules and output artifacts. We
can reconsider to combine the readable name with SHA256 as the file name of
artifacts, but this will run into the problems of long file name and disallowed
characters (although they should be solvable). Another simple option is adding
the name either as a comment on each cmake build rule or in the FRIENDLY_NAME
argument of iree_bytecode_module rule. This should at least let people find
the readable name when searching the SHA256 build target name in
generated_e2e_test_iree_artifacts.cmake or build log.

Searchablilty and Tracability in IREE Benchmark Suites

The same problem also exists in the IREE benchmark suites, as the benchmark ID
is only a SHA256 and the serialized format of config file, although contains all
metadata, is not easy to read.

In addition, we decided not to dump all execution benchmark flagfiles into the
e2e_test_artifacts directory as we thought the number of files would grow
quickly since it's N x number of module artifacts where N ~= the number of threads + experiments. The current N is around 3 (~330 execution benchmarks in
total). And it's not necessary when we can have all benchmarks listed in a
single JSON config file.

However, the format of config is not really readable becuase the serializer uses
ID to reference the nested objects (in order to dedup the common objects). For
example:

"iree_e2e_model_run_configs:<execution_benchmark_id_1>": {
  "module_generation_config": <id to its module generation config>
  "run_flags": [...]
},
"iree_e2e_model_run_configs:<execution_benchmark_id_2>": {
  "module_generation_config": <id to its module generation config>
  "run_flags": [...]
},
"iree_module_generation_configs:<compilation_benchmark_id_3>": {
  "imported_model": <id to its imported model>
  "compile_flags": [...]
},
"iree_module_generation_configs:<compilation_benchmark_id_4>": {
  "imported_model": <id to its imported model>
  "compile_flags": [...]
},
"iree_imported_models:<imported_model_id>": {
  "model": <id to its model>
},
"models:<model_id>": {
  "name": <model name>,
  ...
},
...

The format requires people to back-and-forth search with IDs to gather all
sub-components and their metadata of a benchmark. Naturally we prefer a more
straightforward layout like:

{
  "e2e_model_run_configs": {
    "<execution_benchmark_id>": {
      "run_flags": [...],
      "tags": [...],
      "module_generation_config": {
        "compile_flags": [...],
        "tags": [...],
        "imported_model": {
          "import_flags": [...],
          "model": {
            "name": <model_name>,
            "tags": [...],
          }
        }
      },
    }
  },
  ...
}

Solutions

We can consider dump all compilation and run flags into a single file with benchmark names, tags, and ids. So it is easier to use text editors to search benchmarks by keywords.

We might also need to re-evaluate the decision not to dump all run flags. 300~400 flag files might not be that problematic.

ScottTodd · 2023-04-28T17:22:46Z

The new improvements are great, especially the docs (https://github.com/openxla/iree/blob/main/docs/developers/developing_iree/benchmark_suites.md#3-fetch-the-benchmark-artifacts) and summaries (e.g. https://github.com/openxla/iree/actions/runs/4832269569#summary-13105536036).

One thing I'd like is a way to see how large the files are and filter to only download "small" programs. When I run gcloud storage cp "%E2E_TEST_ARTIFACTS_DIR_URL%/*.mlir [destination path], that starts to fetch ~54GB of files. I could let that download complete, but what I'm usually looking for is a snapshot of several real programs and if I wanted the large programs I wouldn't mind running a different command. (I also wonder if we're actually storing 50+GB of data for each benchmark run... are those files compressed or deduplicated in the storage system?)

This was working well for me before (266MB):

I'm sure there's a way to do that with the gcloud CLI, so consider this a selfish request for an alternate command to copy/paste, a script to run, or some directory structure changes :p

pzread · 2023-05-09T21:47:19Z

Yeah, it's unhealthy to upload 50GB of artifacts to the GCS for every presubmit commit. It's a new problem after we added a new long-running category with larger models to the benchmark suite. I think the first step is to create separate build targets so we don't build the "long-running" benchmarks in the presubmit (or even in the postsubmit, since we only run them nightly).

A trick to only download small files can be:

gcloud storage ls -l "gs://iree-github-actions-presubmit-artifacts/4927200698/1/e2e-test-artifacts/*.mlir*" | sort -h

then pipe it to some bash tools that could give you a list of smaller files. I'll update the doc if I find something suitable (or feel free to update with your commands). (And I also realized we are now uploading *.mlir and *.mlirbc, so probably need to update the docs...)

Another way could be putting TF/TFLite into the file name, that should allow you to download only TFLite models, which are usually smaller.

pzread · 2024-01-23T06:58:12Z

Major improvements are considered to be done

pzread added infrastructure Relating to build systems, CI, or testing infrastructure/benchmark Relating to benchmarking infrastructure labels Feb 15, 2023

pzread self-assigned this Feb 15, 2023

pzread mentioned this issue Feb 15, 2023

Upload compilation benchmark config in CI #12219

Merged

pzread mentioned this issue Mar 14, 2023

Add docs to run the new benchmark suite locally #12614

Merged

pzread mentioned this issue Mar 16, 2023

Give names to configs in e2e test framework #12649

Merged

ScottTodd mentioned this issue May 18, 2023

Split benchamrk suite build target into default and long-running #13129

Merged

pzread closed this as completed Jan 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UX improvements for the new benchmark pipeline #12215

UX improvements for the new benchmark pipeline #12215

pzread commented Feb 15, 2023 •

edited

Loading

pzread commented Mar 15, 2023 •

edited

Loading

ScottTodd commented Apr 28, 2023

pzread commented May 9, 2023 •

edited

Loading

pzread commented Jan 23, 2024

UX improvements for the new benchmark pipeline #12215

UX improvements for the new benchmark pipeline #12215

Comments

pzread commented Feb 15, 2023 • edited Loading

P0

P1

P2

pzread commented Mar 15, 2023 • edited Loading

Backgrounds

E2E Test Artifacts

IREE Benchmark Suites

Issues and Solutions

Tracability in E2E Test Artifacts

Solutions

Searchablilty and Tracability in IREE Benchmark Suites

Solutions

ScottTodd commented Apr 28, 2023

pzread commented May 9, 2023 • edited Loading

pzread commented Jan 23, 2024

pzread commented Feb 15, 2023 •

edited

Loading

pzread commented Mar 15, 2023 •

edited

Loading

pzread commented May 9, 2023 •

edited

Loading