A benchmark framework for large language models (LLMs) on scholarly manuscript revision

This repository contains code used to evaluate the effectiveness of prompts and LLMs in the context of scholarly manuscript revision. Initially, the goal of the evaluations is to improve the prompts used in the Manubot AI Editor, which is a tool for Manubot that uses AI to help authors revise their manuscripts automatically

Under-the-hood, it uses:

promptfoo for test configuration, running evaluations, and presenting comparisons.
Ollama for managing local models.
Python for basic scripting and coordination.

Setup

Install software requirements

Install Miniconda.

Create conda environment:

conda env create -f environment.yml
conda activate manubot-ai-editor-evals

Install the last tested promptfoo version:
```
npm install -g promptfoo@0.47.0
```
Install this package in editable mode:
```
pip install -e .
```

Install Ollama. The latest version we tested is v0.1.32, which in Linux (amd64) you can install with:

sudo curl -L https://github.com/ollama/ollama/releases/download/v0.1.32/ollama-linux-amd64 -o /usr/bin/ollama
sudo chmod +x /usr/bin/ollama

Start required processes

Activate the conda environment if haven't already:
```
conda activate manubot-ai-editor-evals
```
Start Ollama in a different terminal (no need to activate the conda environment), if not already running automatically:
```
ollama serve
```

Select models

promptfoo supports a large selection of models from different providers. This tool lists a handful of select models in src/run.py, focusing on OpenAI ChatGPT and local models with Ollama.

This list is what is used when running the script commands below. To add other models from promptfoo/Ollama, include their ids here. To select specific models for a pull/eval/view, you can comment/uncomment their entries.

Pull local models

Before you can run models locally, you have to pull them with Ollama.

python src/run.py --pull

Configure access to remote models

Provide an API key for the service you wish to use as an environment variable:

In .env file:

API_KEY_NAME="API_KEY_VALUE"

or in CLI:

export API_KEY_NAME="API_KEY_VALUE"

Service	API_KEY_NAME
OpenAI	OPENAI_API_KEY
Replicate	REPLICATE_API_TOKEN

(Per promptfoo docs)

Evaluations

Evaluations are organized into folders by manuscript section. For example, for the abstract and the introduction sections, the structure could be:

├── abstract
│   ├── cases
│   │   └── phenoplier
│   │       ├── inputs
│   │       ├── outputs
│   │       └── promptfooconfig.yaml
│   └── prompts
│       ├── baseline.txt
│       └── candidate.txt
├── introduction
│   ├── ...

Under each section, there are two subfolders: 1) cases and 2) prompts.

A case corresponds to text from an existing manuscript (journal article, preprint, etc.) for testing. In the above example, phenoplier corresponds to this journal article. A case contains a promptfoo configuration file (promptfooconfig.yaml) with test cases and assertions, and an outputs folder with the results of the evaluations across different models.

The prompts folder contains the prompts to be evaluated for this manuscript section. At the moment, we are using 1) a candidate prompt containing a complex set of instructions and 2) a baseline prompt containing more basic instructions to compare the candidate prompt against.

Usage

First, move to the directory of the section and case of interest. Then run the src/run.py script from there. For example, for the abstract section and the phenoplier case:

cd abstract/cases/phenoplier/
python ../../../src/run.py

Run evaluation

Running the script without flags runs your evaluations.

python ../../../src/run.py

By default, all queries to the models are cached in src/cache/*.db (SQLite) for faster and cheaper subsequent runs.

Visualize results

To explore the results of your evaluations across all models in a web UI table, run:

python ../../../src/run.py --view

If you are interested only in a specific model such as gpt-3.5-turbo-0613, run:

promptfoo view outputs/gpt-3.5-turbo-0125/

See more here.

Misc

If you need to clear promptfoo's cache, you can run:

promptfoo cache clear

Advanced

SQLite cache

In case the cache files located in src/cache/*.db (SQLite) need to be updated, you can open the .db file with sqlite3:

sqlite3 src/cache/llm_cache-rep0.db

Updating cached queries

You can run queries to update the cache, such as:

-- Update the model name for a specific prompt
UPDATE full_llm_cache
SET llm = replace(llm, 'mixtral-8x22-fix', 'mixtral:8x22b-instruct-v0.1-q5_1' )
WHERE llm LIKE '%mixtral-8x22%';

Deleting old entries

To delete certain entries (such as old/previous models not used anymore):

DELETE FROM full_llm_cache
WHERE llm LIKE "%('model', 'mixtral:8x22b-instruct-v0.1-q4_1')%";

Vacuuming

From the terminal:

sqlite3 src/cache/llm_cache-rep0.db "VACUUM;"

Name		Name	Last commit message	Last commit date
Latest commit History 213 Commits
abstract		abstract
discussion		discussion
introduction		introduction
methods		methods
nbs		nbs
results		results
scripts		scripts
src		src
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
environment.yml		environment.yml
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A benchmark framework for large language models (LLMs) on scholarly manuscript revision

Setup

Install software requirements

Start required processes

Select models

Pull local models

Configure access to remote models

Evaluations

Usage

Run evaluation

Visualize results

Misc

Advanced

SQLite cache

Updating cached queries

Deleting old entries

Vacuuming

About

Releases

Packages

Contributors 2

Languages

pivlab/manubot-ai-editor-evals

Folders and files

Latest commit

History

Repository files navigation

A benchmark framework for large language models (LLMs) on scholarly manuscript revision

Setup

Install software requirements

Start required processes

Select models

Pull local models

Configure access to remote models

Evaluations

Usage

Run evaluation

Visualize results

Misc

Advanced

SQLite cache

Updating cached queries

Deleting old entries

Vacuuming

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages