Release notes

These are the release notes of the initial release of the BigCode Evaluation Harness.

Goals

The framework aims to achieve the following goals:

Reproducibility: Making it easy to report and reproduce results.
Ease-of-use: Providing access to a diverse range of code benchmarks through a unified interface.
Efficiency: Leveraging data parallelism on multiple GPUs to generate benchmark solutions quickly.
Isolation: Using Docker containers for executing the generated solutions.

The framework supports the following features & tasks:

Features:
- Any autoregressive model available on Hugging Face hub can be used, but we recommend using code generation models trained specifically on Code.
- We provide Multi-GPU text generation with accelerate for multi-sample problems and Dockerfiles for evaluating on Docker containers for security and reproducibility.
Tasks:
- 4 code generation Python tasks (with unit tests): HumanEval, APPS, MBPP and DS-1000 for both completion (left-to-right) and insertion (FIM) mode.
- MultiPL-E evaluation suite (HumanEval translated into 18 programming languages).
- Pal Program-aided Language Models evaluation for grade school math problems : GSM8K and GSM-HARD. These problems are solved by generating reasoning chains of text and code.
- Code to text task from CodeXGLUE (zero-shot & fine-tuning) for 6 languages: Python, Go, Ruby, Java, JavaScript and PHP. Documentation translation task from CodeXGLUE.
- CoNaLa for Python code generation (2-shot setting and evaluation with BLEU score).
- Concode for Java code generation (2-shot setting and evaluation with BLEU score).
- 3 multilingual downstream classification tasks: Java Complexity prediction, Java code equivalence prediction, C code defect prediction.