[datasets-overhaul #2] Add a new Dataset class. #191

ChrisCummins · 2021-04-21T23:27:59Z

A dataset is a collection of benchmarks.

Datasets provide a convenience abstraction for installing and managing groups of benchmarks. Datasets provide the API for instantiating benchmarks, either by randomly selecting from the available benchmarks or by selecting by URI.

The Dataset class is an abstract base for implementing datasets. At a minimum, subclasses must implement the benchmark() and benchmark_uris() methods. Other methods such as install() may be used where helpful.

Issue #45.

JD-ETH

I think the structure is very clear now this way, don't see issues here.

tests/datasets/dataset_test.py

ChrisCummins · 2021-04-22T15:38:05Z

Thanks for the review @JD-ETH!

Issue #45.

This switches over the `CompilerEnv` environment to use the new dataset API, dropping the `LegacyDataset` class. Background ---------- Since the very first prototype of CompilerGym, a `Benchmark` protocol buffer has been used to provide a serializable representation of benchmarks that can be passed back and forth between the service and the frontend. Initially, it was up to the compiler service to maintain the set of available benchmarks, exposing the available benchmarks with a `GetBenchmarks()` RPC method, and allowing new benchmarks to be added using an `AddBenchmarks()` method. This was fine for the initial use case of shipping a handful of benchmarks and allowing ad-hoc new benchmarks to be added, but for managing larger sets of benchmarks, a *datasets* abstraction was added. Initial Datasets abstraction ---------------------------- To add support for managing large sets of programs, a [Dataset](https://github.com/facebookresearch/CompilerGym/blob/49c10d77d1c1b1297a1269604584a13c10434cbb/compiler_gym/datasets/dataset.py#L20) tuple was added that describes a set of programs, and a link to the a tarball containing those programs. The tarball is required to have a JSON file containing metadata, and a directory containing the benchmarks, one file per benchmark. A set of operations were added to the frontend command line to make downloading and unpacking these tarballs easier: https://github.com/facebookresearch/CompilerGym/blob/49c10d77d1c1b1297a1269604584a13c10434cbb/compiler_gym/bin/datasets.py#L5-L133 Problems with this approach --------------------------- (1) **Leaky abstraction** Both the environment and backend service have to know about datasets. This means redundant duplicated logic, and adds a maintenance burden of keeping the C++/python logic in sync. (2) **Inflexible** Only supports environments in which a single file represents a benchmark. No support for multi-file benchmarks, benchmarks that are compiled on-demand, etc. (3) **O(n) space and time overhead** on each service instance, where *n* is the total number of benchmarks. At init time, each service needs to recursively scan a directory tree to build a list of available benchmarks. This list must be kept in memory. This adds startup time, and also causes cache invalidation issues when multiple environment instances are modifying the underlying filesystem. New Dataset API --------------- This commit changes the ownership model so that the *Environment* owns the benchmarks and datasets, not the service. This uses the new `Dataset` class hierarchy that has been added in previous pull requests: facebookresearch#190, facebookresearch#191, facebookresearch#192, facebookresearch#200, facebookresearch#201. Now, the backend has no knowledge of "datasets". Instead the service simply keeps a small cache of benchmarks that it has seen. If a session request has a benchmark URI that is not in this cache, the service returns a "resource not found" error and the frontend logic can then respond by sending it a copy of the benchmark as a `Benchmark` proto. The service is free to cache this for future use, and can empty the cache whenever it wants. This new approach has a few key benefits: (1) By moving all of the datasets logic into the frontend, it becomes much easier for users to define their own datasets. (2) Reduces compiler service startup time as it removes the need for each service to do a recursive filesystem sweep. (3) Removes the requirement that the set of benchmarks is fully enumerable, allow for program generators that can produce a theoretically infinite number of benchmarks. (4) Adds support for lazily-compiled datasets of programs that are generated on-demand. (5) Removes the need to download datasets ahead of time. Datasets can now be installed on-demand. Summary of changes ------------------ (1) Changes the type of `env.benchmark` from a string to a `Benchmark` instance. (2) Makes `env.benchmark` a mandatory attribute. If no benchmark is provided at init time, one is chosen deterministically. If you wish to select a random benchmark, use `env.datasets.benchmark()`. (3) `env.fork()` no longer requires `env.reset()` to have been called first. It will call `env.reset()` if required. (4) `env.benchmark = None` is no longer a valid way of requesting a random benchmark. If you would like a random benchmark, you must now roll your own random picker using `env.datasets.benchmark_uris()` and similar. (5) Deprecates all `LegacyDataset` operations, changing their behavior to no-ops, and removing the class. (6) Renames `cBench` to `cbench` to be consistent with the lower-case naming convention of gym. The old `cBench` datasets are kept around but are marked deprecated to encourage migration. Migrating to the new interface ------------------------------ To migrate existing code to the new interface: (1) Update references to `cBench-v[01]` to `cbench-v1`. (2) Review code that accesses the `env.benchmark` property and update to `env.benchmark.uri` if a string name is required. (3) Review code that calls `env.reset()` without first setting a benchmark. Previously, calling `env.reset()` would select a random benchmark. Now, `env.reset()` always selects the last used benchmark, or a predetermined default if none is specified. (4) Review code that relies on `env.benchmark` being `None` to select benchmarks randomly. Now, `env.benchmark` is always set to the previously used benchmark, or a predetermined default benchmark if none has been provided. (5) Remove calls to `env.require_dataset()`. Issue facebookresearch#45.

This release introduces some significant changes to the way that benchmarks are managed, introducing a new dataset API. This enabled us to add support for millions of new benchmarks and a more efficient implementation for the LLVM environment, but this will require some migrating of old code to the new interfaces (see “Migration Checklist” below). Some of the key changes of this release are: - [Core API change] We have added a Python Benchmark class (#190). The env.benchmark attribute is now an instance of this class rather than a string (#222). - [Core behavior change] Environments will no longer select benchmarks randomly. Now env.reset() will now always select the last-used benchmark, unless the benchmark argument is provided or env.benchmark has been set. If no benchmark is specified, a default is used. - [API deprecations] We have added a new Dataset class hierarchy (#191, #192). All datasets are now available without needing to be downloaded first, and a new Datasets class can be used to iterate over them (#200). We have deprecated the old dataset management operations, the compiler_gym.bin.datasets script, and removed the --dataset and --ls_benchmark flags from the command line tools. - [RPC interface change] The StartSession RPC endpoint now accepts a list of initial observations to compute. This removes the need for an immediate call to Step, reducing environment reset time by 15-21% (#189). - [LLVM] We have added several new datasets of benchmarks, including the Csmith and llvm-stress program generators (#207), a dataset of OpenCL kernels (#208), and a dataset of compilable C functions (#210). See the docs for an overview. - CompilerEnv now takes an optional Logger instance at construction time for fine-grained control over logging output (#187). - [LLVM] The ModuleID and source_filename of LLVM-IR modules are now anonymized to prevent unintentional overfitting to benchmarks by name (#171). - [docs] We have added a Feature Stability section to the documentation (#196). - Numerous bug fixes and improvements. Please use this checklist when updating code for the previous CompilerGym release: - Review code that accesses the env.benchmark property and update to env.benchmark.uri if a string name is required. Setting this attribute by string (env.benchmark = "benchmark://a-v0/b") and comparison to string types (env.benchmark == "benchmark://a-v0/b") still work. - Review code that calls env.reset() without first setting a benchmark. Previously, calling env.reset() would select a random benchmark. Now, env.reset() always selects the last used benchmark, or a predetermined default if none is specified. - Review code that relies on env.benchmark being None to select benchmarks randomly. Now, env.benchmark is always set to the previously used benchmark, or a predetermined default benchmark if none has been specified. Setting env.benchmark = None will raise an error. Select a benchmark randomly by sampling from the env.datasets.benchmark_uris() iterator. - Remove calls to env.require_dataset() and related operations. These are no longer required. - Remove accesses to env.benchmarks. An iterator over available benchmark URIs is now available at env.datasets.benchmark_uris(), but the list of URIs cannot be relied on to be fully enumerable (the LLVM environments have over 2^32 URIs). - Review code that accesses env.observation_space and update to env.observation_space_spec where necessary (#228). - Update compiler service implementations to support the updated RPC interface by removing the deprecated GetBenchmarks RPC endpoint and replacing it with Dataset classes. See the example service for details. - [LLVM] Update references to the poj104-v0 dataset to poj104-v1. - [LLVM] Update references to the cBench-v1 dataset to cbench-v1.

This switches over the `CompilerEnv` environment to use the new dataset API, dropping the `LegacyDataset` class. Background ---------- Since the very first prototype of CompilerGym, a `Benchmark` protocol buffer has been used to provide a serializable representation of benchmarks that can be passed back and forth between the service and the frontend. Initially, it was up to the compiler service to maintain the set of available benchmarks, exposing the available benchmarks with a `GetBenchmarks()` RPC method, and allowing new benchmarks to be added using an `AddBenchmarks()` method. This was fine for the initial use case of shipping a handful of benchmarks and allowing ad-hoc new benchmarks to be added, but for managing larger sets of benchmarks, a *datasets* abstraction was added. Initial Datasets abstraction ---------------------------- To add support for managing large sets of programs, a [Dataset](https://github.com/facebookresearch/CompilerGym/blob/49c10d77d1c1b1297a1269604584a13c10434cbb/compiler_gym/datasets/dataset.py#L20) tuple was added that describes a set of programs, and a link to the a tarball containing those programs. The tarball is required to have a JSON file containing metadata, and a directory containing the benchmarks, one file per benchmark. A set of operations were added to the frontend command line to make downloading and unpacking these tarballs easier: https://github.com/facebookresearch/CompilerGym/blob/49c10d77d1c1b1297a1269604584a13c10434cbb/compiler_gym/bin/datasets.py#L5-L133 Problems with this approach --------------------------- (1) **Leaky abstraction** Both the environment and backend service have to know about datasets. This means redundant duplicated logic, and adds a maintenance burden of keeping the C++/python logic in sync. (2) **Inflexible** Only supports environments in which a single file represents a benchmark. No support for multi-file benchmarks, benchmarks that are compiled on-demand, etc. (3) **O(n) space and time overhead** on each service instance, where *n* is the total number of benchmarks. At init time, each service needs to recursively scan a directory tree to build a list of available benchmarks. This list must be kept in memory. This adds startup time, and also causes cache invalidation issues when multiple environment instances are modifying the underlying filesystem. New Dataset API --------------- This commit changes the ownership model so that the *Environment* owns the benchmarks and datasets, not the service. This uses the new `Dataset` class hierarchy that has been added in previous pull requests: facebookresearch#190, facebookresearch#191, facebookresearch#192, facebookresearch#200, facebookresearch#201. Now, the backend has no knowledge of "datasets". Instead the service simply keeps a small cache of benchmarks that it has seen. If a session request has a benchmark URI that is not in this cache, the service returns a "resource not found" error and the frontend logic can then respond by sending it a copy of the benchmark as a `Benchmark` proto. The service is free to cache this for future use, and can empty the cache whenever it wants. This new approach has a few key benefits: (1) By moving all of the datasets logic into the frontend, it becomes much easier for users to define their own datasets. (2) Reduces compiler service startup time as it removes the need for each service to do a recursive filesystem sweep. (3) Removes the requirement that the set of benchmarks is fully enumerable, allow for program generators that can produce a theoretically infinite number of benchmarks. (4) Adds support for lazily-compiled datasets of programs that are generated on-demand. (5) Removes the need to download datasets ahead of time. Datasets can now be installed on-demand. Summary of changes ------------------ (1) Changes the type of `env.benchmark` from a string to a `Benchmark` instance. (2) Makes `env.benchmark` a mandatory attribute. If no benchmark is provided at init time, one is chosen deterministically. If you wish to select a random benchmark, use `env.datasets.benchmark()`. (3) `env.fork()` no longer requires `env.reset()` to have been called first. It will call `env.reset()` if required. (4) `env.benchmark = None` is no longer a valid way of requesting a random benchmark. If you would like a random benchmark, you must now roll your own random picker using `env.datasets.benchmark_uris()` and similar. (5) Deprecates all `LegacyDataset` operations, changing their behavior to no-ops, and removing the class. (6) Renames `cBench` to `cbench` to be consistent with the lower-case naming convention of gym. The old `cBench` datasets are kept around but are marked deprecated to encourage migration. Migrating to the new interface ------------------------------ To migrate existing code to the new interface: (1) Update references to `cBench-v[01]` to `cbench-v1`. (2) Review code that accesses the `env.benchmark` property and update to `env.benchmark.uri` if a string name is required. (3) Review code that calls `env.reset()` without first setting a benchmark. Previously, calling `env.reset()` would select a random benchmark. Now, `env.reset()` always selects the last used benchmark, or a predetermined default if none is specified. (4) Review code that relies on `env.benchmark` being `None` to select benchmarks randomly. Now, `env.benchmark` is always set to the previously used benchmark, or a predetermined default benchmark if none has been provided. (5) Remove calls to `env.require_dataset()`. Issue facebookresearch#45.

This release introduces some significant changes to the way that benchmarks are managed, introducing a new dataset API. This enabled us to add support for millions of new benchmarks and a more efficient implementation for the LLVM environment, but this will require some migrating of old code to the new interfaces (see “Migration Checklist” below). Some of the key changes of this release are: - [Core API change] We have added a Python Benchmark class (facebookresearch#190). The env.benchmark attribute is now an instance of this class rather than a string (facebookresearch#222). - [Core behavior change] Environments will no longer select benchmarks randomly. Now env.reset() will now always select the last-used benchmark, unless the benchmark argument is provided or env.benchmark has been set. If no benchmark is specified, a default is used. - [API deprecations] We have added a new Dataset class hierarchy (facebookresearch#191, facebookresearch#192). All datasets are now available without needing to be downloaded first, and a new Datasets class can be used to iterate over them (facebookresearch#200). We have deprecated the old dataset management operations, the compiler_gym.bin.datasets script, and removed the --dataset and --ls_benchmark flags from the command line tools. - [RPC interface change] The StartSession RPC endpoint now accepts a list of initial observations to compute. This removes the need for an immediate call to Step, reducing environment reset time by 15-21% (facebookresearch#189). - [LLVM] We have added several new datasets of benchmarks, including the Csmith and llvm-stress program generators (facebookresearch#207), a dataset of OpenCL kernels (facebookresearch#208), and a dataset of compilable C functions (facebookresearch#210). See the docs for an overview. - CompilerEnv now takes an optional Logger instance at construction time for fine-grained control over logging output (facebookresearch#187). - [LLVM] The ModuleID and source_filename of LLVM-IR modules are now anonymized to prevent unintentional overfitting to benchmarks by name (facebookresearch#171). - [docs] We have added a Feature Stability section to the documentation (facebookresearch#196). - Numerous bug fixes and improvements. Please use this checklist when updating code for the previous CompilerGym release: - Review code that accesses the env.benchmark property and update to env.benchmark.uri if a string name is required. Setting this attribute by string (env.benchmark = "benchmark://a-v0/b") and comparison to string types (env.benchmark == "benchmark://a-v0/b") still work. - Review code that calls env.reset() without first setting a benchmark. Previously, calling env.reset() would select a random benchmark. Now, env.reset() always selects the last used benchmark, or a predetermined default if none is specified. - Review code that relies on env.benchmark being None to select benchmarks randomly. Now, env.benchmark is always set to the previously used benchmark, or a predetermined default benchmark if none has been specified. Setting env.benchmark = None will raise an error. Select a benchmark randomly by sampling from the env.datasets.benchmark_uris() iterator. - Remove calls to env.require_dataset() and related operations. These are no longer required. - Remove accesses to env.benchmarks. An iterator over available benchmark URIs is now available at env.datasets.benchmark_uris(), but the list of URIs cannot be relied on to be fully enumerable (the LLVM environments have over 2^32 URIs). - Review code that accesses env.observation_space and update to env.observation_space_spec where necessary (facebookresearch#228). - Update compiler service implementations to support the updated RPC interface by removing the deprecated GetBenchmarks RPC endpoint and replacing it with Dataset classes. See the example service for details. - [LLVM] Update references to the poj104-v0 dataset to poj104-v1. - [LLVM] Update references to the cBench-v1 dataset to cbench-v1.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 21, 2021

ChrisCummins added this to the v0.1.8 milestone Apr 21, 2021

ChrisCummins force-pushed the dataset branch 2 times, most recently from 8583b9e to fad989e Compare April 21, 2021 23:35

ChrisCummins changed the title ~~[datasets] Add a new Dataset class.~~ [datasets-overhaul #2] Add a new Dataset class. Apr 21, 2021

ChrisCummins force-pushed the benchmark branch from 5716633 to ac14b77 Compare April 22, 2021 07:19

ChrisCummins force-pushed the dataset branch from fad989e to 75730f1 Compare April 22, 2021 07:19

ChrisCummins requested review from JD-ETH and hughleat April 22, 2021 13:25

JD-ETH approved these changes Apr 22, 2021

View reviewed changes

tests/datasets/dataset_test.py Outdated Show resolved Hide resolved

ChrisCummins force-pushed the dataset branch from 75730f1 to 3455935 Compare April 22, 2021 15:37

ChrisCummins force-pushed the dataset branch from 3455935 to f0b28e7 Compare April 22, 2021 15:39

ChrisCummins force-pushed the benchmark branch from ac14b77 to 3bd8206 Compare April 22, 2021 18:19

[datasets] Add a new Dataset class.

12d27a3

Issue #45.

ChrisCummins force-pushed the dataset branch from f0b28e7 to 12d27a3 Compare April 22, 2021 18:20

Base automatically changed from benchmark to development April 22, 2021 21:13

ChrisCummins merged commit fa809b4 into development Apr 22, 2021

ChrisCummins deleted the dataset branch April 22, 2021 21:14

ChrisCummins mentioned this pull request Apr 27, 2021

[datasets] Switch CompilerEnv to the new dataset API. #222

Merged

ChrisCummins mentioned this pull request Apr 30, 2021

Release v0.1.8 #238

Merged

9 tasks

ChrisCummins mentioned this pull request Apr 30, 2021

Release v0.1.8 #241

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[datasets-overhaul #2] Add a new Dataset class. #191

[datasets-overhaul #2] Add a new Dataset class. #191

ChrisCummins commented Apr 21, 2021 •

edited

Loading

JD-ETH left a comment

ChrisCummins commented Apr 22, 2021

[datasets-overhaul #2] Add a new Dataset class. #191

[datasets-overhaul #2] Add a new Dataset class. #191

Conversation

ChrisCummins commented Apr 21, 2021 • edited Loading

JD-ETH left a comment

Choose a reason for hiding this comment

ChrisCummins commented Apr 22, 2021

ChrisCummins commented Apr 21, 2021 •

edited

Loading