Release notes

These are the release notes of the initial release of the Evaluate library.

Goals

Goals of the Evaluate library:

reproducibility: reporting and reproducing results is easy
ease-of-use: access to a wide range of evaluation tools with a unified interface
diversity: provide wide range of evaluation tools with metrics, comparisons, and measurements
multimodal: models and datasets of many modalities can be evaluated
community-driven: anybody can add custom evaluations by hosting them on the Hugging Face Hub

Release overview:

evaluate.load(): The load() function is the main entry point into evaluate and allows to load evaluation modules from a local folder, the evaluate repository, or the Hugging Face Hub. It downloads, caches, and loads the evaluation modules and returns an evaluate.EvaluationModule.
evaluate.save(): With save() a user can save evaluation results in a JSON file. In addition to the results from evaluate.EvaluationModule it can save additional parameters and automatically saves the timestamp, git commit hash, library version as well as Python path. One can either provide a directory for the results, in which case file names are automatically created, or an explicit file name for the result.
evaluate.push_to_hub(): The push_to_hub function allows to push the results of a model evaluation to the model card on the Hugging Face Hub. The model, dataset, and metric are specified such that they can be linked on the hub.
evaluate.EvaluationModule: The EvaluationModule class is the baseclass for all evaluation modules. There are three module types: metrics (to evaluate models), comparisons (to compare models), and measurements (to analyze datasets). The inputs can be either added with add (single input) and add_batch (batch of inputs) followed by a final compute call to compute the scores or all inputs can be passed to compute directly. Under the hood, Apache Arrow stores and loads the input data to compute the scores.
evaluate.EvaluationModuleInfo: The EvaluationModule class is used to store attributes:
- description: A short description of the evaluation module.
- citation: A BibTex string for citation when available.
- features: A Features object defining the input format. The inputs provided to add, add_batch, and compute are tested against these types and an error is thrown in case of a mismatch.
- inputs_description: This is equivalent to the modules docstring.
- homepage: The homepage of the module.
- license: The license of the module.
- codebase_urls: Link to the code behind the module.
- reference_urls: Additional reference URLs.
evaluate.evaluator: The evaluator provides automated evaluation and only requires a model, dataset, metric, in contrast to the metrics in the EvaluationModule which require model predictions. It has three main components: a model wrapped in a pipeline, a dataset, and a metric, and it returns the computed evaluation scores. Besides the three main components, it may also require two mappings to align the columns in the dataset and the pipeline labels with the datasets labels. This is an experimental feature -- currently, only text classification is supported.
evaluate-cli: The community can add custom metrics by adding the necessary module script to a Space on the Hugging Face Hub. The evaluate-cli is a tool that simplifies this process by creating the Space, populating a template, and pushing it to the Hub. It also provides instructions to customize the template and integrate custom logic.

Main contributors:

@lvwerra , @sashavor , @NimaBoscarino , @ola13 , @osanseviero , @lhoestq , @lewtun , @douwekiela

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial relase of `evaluate`

Release notes

Goals

Release overview:

Main contributors:

Contributors