Skip to content

Latest commit

 

History

History
149 lines (101 loc) · 7.03 KB

guide.md

File metadata and controls

149 lines (101 loc) · 7.03 KB

Guide: adding a new task

Here we provide a step by step guide for adding a new task to the bigcode-evaluation-harness to evaluate code generation language models. The process is similar to adding tasks in lm_evaluation-harness, from which this repository is inspired, so this document is based on their task_guide. The Task class is the backbone of all tasks in this framewok.

Setup

If you haven't already, fork the main repo, clone it, create a branch with the name of your task, and install the project requirements in your environment:

# After forking...
git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git
cd bigcode-evaluation-harness
git checkout -b <task-name>
pip install -r requirements.txt

Creating Your Task File

From the bigcode-evaluation-harness project root, copy over the new_task.py template to bigcode_eval/tasks.

cp template/new_task.py bigcode_eval/tasks/<task-name>.py

Task Heading

Open the file you've just created and add a multiline docstring on the first line with the following contents:

"""
<Paper title>
<Paper PDF URL>

<Short description of task>

Homepage: <URL to task's homepage>
"""

Data Handling

Downloading your Data

All data downloading and management is handled through the HuggingFace (HF) datasets API. So, if your dataset isn't already on the hub (see catalog), please consider adding it to make it accessible to a wider user base by following this new dataset guide.

Now, that you have your HF dataset, you need to assign its path and name to your Task in the following fields:

class TaskName(...):
    DATASET_PATH = "..."
    DATASET_NAME = "..."

where DATASET_PATH is the name of the dataset as listed by HF in the datasets Hub and DATASET_NAME is the name of sub-task of the benchmark. If your task does not contain any data instances/subsets, just set DATASET_NAME = None.

Next you need to load the evaluation split of the dataset in get_dataset function. For example

def get_dataset(self):
    return self.dataset["test"]

You might need to redefine some arguments of the class, like stop_words which defines the stop words for stopping criteria during the code generation, and requires_execution which defines whether the task requires code execution or not.

    def __init__(self):
        super().__init__(
            stop_words=["\n"],
            requires_execution=True,
        )

Processing Documents

Then you need to format your document into a single query prompt without the answer to be sent to the Language Model in get_prompt method.

It takes a single doc example of type dict with str key-value members.

def get_prompt(self, doc):
    return ""

If the prompt involves few-shot examples, you first need to save them in a json <task_name>_few_shot_prompts.json in bigcode_eval/tasks/few_shot_example and then load them in fewshot_examples method like this:

def fewshot_examples(self):
    with open("bigcode_eval/tasks/few_shot_examples/<task_name>_few_shot_prompts.json", "r") as file:
        examples = json.load(file)
    return examples

The prompt will be sent to the languge model, and the generation will be evaluated against ground truth solutions or unit tests. You need to load them from the doc in get_target method.

def get_target(self, doc):
    return ""

Postprocessing & Evaluation

The solutions generated by the language model often require postprocessing to remove unececessary text and get executable code. This is done in the postprocess_generation function. It takes as input the model generation generation and the document index to which the generation belongs in the dataset idx (this is not needed in most cases).

def postprocess_generation(self, generation, idx):
    return ""

The evaluation happens in process_results function. This function takes as argument the list of generations for all selected problems in the benchmark in generations and their refernces in references and returns a dictionary of metrics and their values.

def process_results(self, generations, references):
    return {}

You need to load your metric and run it. Check Hugging Face evaluate library for the available metrics. For example code_eval for pass@k, BLEU for BLEU score and apps_metric are implemented. If you cannot find your desired metric, you can either add it to the evaluate library or implement it in the bigcode_eval/tasks/custom_metrics folder and import it from there.

Registering Your Task

Now's a good time to register your task to expose it for usage. All you'll need to do is import your task module in bigcode_eval/tasks/__init__.py and provide an entry in the TASK_REGISTRY dictionary with the key as the name of your benchmark task (in the form it'll be referred to in the command line) and the value as the task class. See how it's done for other tasks in the file.

Task submission

Running Unit Tests

To run the entire test suite, use:

pytest

Fine-tuning

Few-shot tasks are easier to conduct, but if you need to add the finetuning script for your task, you can create a folder for it in finetuning folder and use a similar training and evaluation script to the other tasks.

Code formatting

You can format your changes and perform black standard checks

black bigcode_eval/tasks/<task-name>.py

Task documentation

Please document your task with advised parameters for execution from litterature in the docs like it's done for the other benchamrks.

Pull request

Please specify in your pull request if you followed the orginal paper's approach to build the prompts or if some changes were introduced (especially if you build few shot examples). Ideally, you can evaluate some public models and compare the scores to the published results and see if they match.

If there are no published results for your task, make sure the evaluation works properly by testing some samples with a good code generation model such as InCoder-1B. During the experiments you have the option to save generation.json and references.json, take a look to see if the generations are properely cleaned and are somewhat close to the references for match-based evaluations for example.

Now push your work and make a pull request! Thanks for the contribution 🚀.