⚡ Purpose of Lit-LLaMA-QA ⚡

Goal 1: By using academic dataset, we can get some intuition on how to improve fine-tuning and understand what works. For example, to answer questions such as "Does LoRA really work?" is very difficult with generative responses as human evaluation is challenging and time-consuming. We want to get grounded feedback on the proposed training methodology. Thus, we will rely on using academic dataset first to get some intuition on practices to follow.

Goal 2: To gauge how performant are GPT models, especially under PeFT methods. With academic dataset, we at least have some baseline results while experimenting with different methods. We are also curious on how easy would it be to reach SOTA results.

Please jump to Current takeaways from experiments for some of our learnings from experimenting with GPT models or Academic Paper Results and comparison (SQuAD 2.0) for our experiment results relative to published SOTA research.

Find the original lit-llama repository here.

SQuAD 2.0

We are focusing on QA dataset first as the future goal is to train abstractive qa with dialogue based replies (hard to evaluate, no standard benchmark for this). To start off, our first targeted dataset will be SQuAD 2.0.

(A) Dataset detail

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable SQuAD 2.0 reference.

Dataset consist of 150,000 and 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. Dev dataset and official evaluation script is provided for evaluation.

(B) Metric

Exact match and F1 score

Experiments

Please check out our experiment results here.

Experiments is done without tweaking parameters. Results are provided without "bell or whistle", we have not done anything extra to boost the results such as ensembling (generation/model), probability thresholding on unanswerable, etc.

Evaluation is done via official evaluation script.
Model: LLaMA 7B with context length of 512 (float16) unless stated otherwise.

For instructions to set up fine-tuning and replicating our experiment for SQuAD 2.0 dataset, view setup_squad.md.

Results comparison with SOTA Research Paper

For comparison, we should only compare to the best research out there to get some idea on how good is the performance of fine-tuning llama. Comparison made is for the dev set (as per paper and our own experiment)

Model	F1	Reference
Ours (7B)	88.13	Full-finetune
Ours (30B)	90.14	LoRA
FLAN 137B	43.1	3-shot
GPT-3	69.8	16-shot
BERT	83.1	Supervised
Retrospective Reader	91.3	Supervised
DeBERTa (large)	90.7	Supervised
DeBERTa (base)	86.2	Supervised
DeBERTa V3	91.16	Supervised

DeBERTa V3 paper claims is that F1 score is 91.5. However, current best on dev set verified by paperswithcode is deepset/deberta-v3-large-squad2 with F1: 91.16. However, the official eval script (the one we are using) gives a slightly lower result on their model, refer to Hugging Face repo.

Model that was specifically developed / more suitable (architecture,ablations studies) for the task of extractive QA (ex: SQuAD 2.0):

BERT
Retro-Reader
DeBERTa
DeBERTa V3

Current takeaways from experiments

How performant is finetuning using LoRA?

Competitive results on downstream task can be achieved just by using LoRA for finetuning. You can see that the results are fairly close to the best.
Finetuned GPT results is amazing considering that GPT models (decoder-only) task is to generate the next token which is not suitable for extractive QA when compared to BERT based model (encoder-only) that can directly classify the start and end token of the context.

When to use full finetuning versus LoRA?

Full fine-tuning results is proven to be even better than LoRA for small language model. LoRA paper claims: "LoRA performs on-par or better than fine-tuning in model quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite having fewer trainable parameters". This claim may not translate too well to smaller models as per our experiment. However, the degradation of performance is not much.
However, LoRA requires way lesser training time and computation. You can even finetune a 7B GPT Model with consumer GPUs. Thus, we need to determine whether is the tradeoff of performance versus training cost and time worth it
Typically, given models above 7B params, full finetuning may not be feasible at all for most people due to GPU VRAM requirement, you can view the Hardware Requirement to get a rough idea of hardware requirement.
[Information to be added: Comparison of time taken for loss to converge for full finetuning versus LoRA]

How easy is it to finetune GPT models?

Fine-tuning GPT models is easy to set up and loss converges pretty fast. Most experiments took just a few hours to 2 days to achieve its lowest validation loss.
For example, fine-tuning the 30B Model using LoRa on 2x80GB A100 (DDP) only took us approximately 5 hours to reach the lowest validation loss.

How does quantisation affect performance?

Surprisingly, it does not affect much. You can judge the full results over at Experiment 1, summary provided below:

dtype	F1	EM
bfloat16	86.67	83.27
int8	86.23	82.70
int4 (GPTQ)	85.07	81.32

How does LoRA rank affect performance?

[Information to be added: Rank 16, Rank 32]

What other PeFT methods can be equally efficient and performant as LoRA?

[Information to be added:]

Future Work

Finetune for abstractive question and answering under the context length of 2048. This model will be more suitable for real world application.
Try bigger language models

Experiments with the 13B, 30B, 65B variant

Experiment with more PeFT techniques

LoRa with different rank
Prefix-tuning
Joining up the ideas (LoRA + Prefix-tuning), etc

Fine-tuning the LM directly for Unified QA then evaluation can be done with every QA dataset, paper inspiration.

Name		Name	Last commit message	Last commit date
Latest commit History 205 Commits
.github		.github
generation_qa/squad2.0		generation_qa/squad2.0
howto		howto
lit_llama		lit_llama
scripts		scripts
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
evaluate_full_squad.py		evaluate_full_squad.py
evaluate_lora_squad.py		evaluate_lora_squad.py
experiments_result.md		experiments_result.md
finetune_full_squad.py		finetune_full_squad.py
finetune_lora_squad.py		finetune_lora_squad.py
generate.py		generate.py
generate_full.py		generate_full.py
generate_lora.py		generate_lora.py
quantize.py		quantize.py
requirements.txt		requirements.txt
setup.py		setup.py
setup_squad.md		setup_squad.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

⚡ Purpose of Lit-LLaMA-QA ⚡

SQuAD 2.0

Experiments

Results comparison with SOTA Research Paper

Current takeaways from experiments

Future Work

About

Releases

Packages

Contributors 18

Languages

License

timothylimyl/lit-llama-qa

Folders and files

Latest commit

History

Repository files navigation

⚡ Purpose of Lit-LLaMA-QA ⚡

SQuAD 2.0

Experiments

Results comparison with SOTA Research Paper

Current takeaways from experiments

Future Work

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 18

Languages

Packages