Skip to content

Initial relase of `evaluate`

Compare
Choose a tag to compare
@lvwerra lvwerra released this 31 May 13:57
· 264 commits to main since this release

Release notes

​
These are the release notes of the initial release of the Evaluate library.
​

Goals

​
Goals of the Evaluate library:
​

  • reproducibility: reporting and reproducing results is easy
  • ease-of-use: access to a wide range of evaluation tools with a unified interface
  • diversity: provide wide range of evaluation tools with metrics, comparisons, and measurements
  • multimodal: models and datasets of many modalities can be evaluated
  • community-driven: anybody can add custom evaluations by hosting them on the Hugging Face Hub
    ​

Release overview:

​

  • evaluate.load(): The load() function is the main entry point into evaluate and allows to load evaluation modules from a local folder, the evaluate repository, or the Hugging Face Hub. It downloads, caches, and loads the evaluation modules and returns an evaluate.EvaluationModule.
  • evaluate.save(): With save() a user can save evaluation results in a JSON file. In addition to the results from evaluate.EvaluationModule it can save additional parameters and automatically saves the timestamp, git commit hash, library version as well as Python path. One can either provide a directory for the results, in which case file names are automatically created, or an explicit file name for the result.
  • evaluate.push_to_hub(): The push_to_hub function allows to push the results of a model evaluation to the model card on the Hugging Face Hub. The model, dataset, and metric are specified such that they can be linked on the hub.
  • evaluate.EvaluationModule: The EvaluationModule class is the baseclass for all evaluation modules. There are three module types: metrics (to evaluate models), comparisons (to compare models), and measurements (to analyze datasets). The inputs can be either added with add (single input) and add_batch (batch of inputs) followed by a final compute call to compute the scores or all inputs can be passed to compute directly. Under the hood, Apache Arrow stores and loads the input data to compute the scores.
  • evaluate.EvaluationModuleInfo: The EvaluationModule class is used to store attributes:
    • description: A short description of the evaluation module.
    • citation: A BibTex string for citation when available.
    • features: A Features object defining the input format. The inputs provided to add, add_batch, and compute are tested against these types and an error is thrown in case of a mismatch.
    • inputs_description: This is equivalent to the modules docstring.
    • homepage: The homepage of the module.
    • license: The license of the module.
    • codebase_urls: Link to the code behind the module.
    • reference_urls: Additional reference URLs.
  • evaluate.evaluator: The evaluator provides automated evaluation and only requires a model, dataset, metric, in contrast to the metrics in the EvaluationModule which require model predictions. It has three main components: a model wrapped in a pipeline, a dataset, and a metric, and it returns the computed evaluation scores. Besides the three main components, it may also require two mappings to align the columns in the dataset and the pipeline labels with the datasets labels. This is an experimental feature -- currently, only text classification is supported.
  • evaluate-cli: The community can add custom metrics by adding the necessary module script to a Space on the Hugging Face Hub. The evaluate-cli is a tool that simplifies this process by creating the Space, populating a template, and pushing it to the Hub. It also provides instructions to customize the template and integrate custom logic.
    ​

Main contributors:

​
@lvwerra , @sashavor , @NimaBoscarino , @ola13 , @osanseviero , @lhoestq , @lewtun , @douwekiela