From 16bf7bb1f514f984d31f60e9574012016f76b9bd Mon Sep 17 00:00:00 2001
From: Trevor Grant <trevor.d.grant@gmail.com>
Date: Tue, 5 Mar 2024 16:59:59 -0600
Subject: [PATCH] [LAB-307] Add paper link (#322)

* [LAB-307] Add paper link

Signed-off-by: Trevor Grant <trevor.d.grant@gmail.com>

* [LAB-307] Add paper link

Signed-off-by: Trevor Grant <trevor.d.grant@gmail.com>

* [LAB-307] Add paper link

Signed-off-by: Trevor Grant <trevor.d.grant@gmail.com>

---------

Signed-off-by: Trevor Grant <trevor.d.grant@gmail.com>
---
 .../Training_a_LoRA_With_Instruct_Lab.ipynb   | 1838 ++++++++---------
 1 file changed, 919 insertions(+), 919 deletions(-)

diff --git a/notebooks/Training_a_LoRA_With_Instruct_Lab.ipynb b/notebooks/Training_a_LoRA_With_Instruct_Lab.ipynb
index 14133abaa99b..a49c955d9dc1 100644
--- a/notebooks/Training_a_LoRA_With_Instruct_Lab.ipynb
+++ b/notebooks/Training_a_LoRA_With_Instruct_Lab.ipynb
@@ -1,921 +1,921 @@
 {
-  "cells": [
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "1iu1ofeHvdNK"
-      },
-      "source": [
-        "## Project Overview\n",
-        "\n",
-        "`instruct lab` is a novel synthetic data-based alignment tuning method for Large Language Models (LLMs.) The \"lab\" in `instruct lab` stands for Large-scale Alignment for Chat Bots.\n",
-        "\n",
-        "It is an outgrowth of the paper [To be Updated when Paper is Available]().\n",
-        "\n",
-        "### Getting Started\n",
-        "\n",
-        "This notebook represents one step in the `instruct lab` pipeline- to see what else is involved, please check out https://github.com/instruct-lab/cli"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "lWhvICmttxaS"
-      },
-      "source": [
-        "## Overview of this Notebook\n",
-        "\n",
-        "This notebook represents the `Model training` step of the guide found [here](https://github.com/instruct-lab/cli).\n",
-        "\n",
-        "But at the time of writing it's not.\n",
-        "\n",
-        "This notebook takes the output of `lab generate` (i.e. the synthetic data set generated), and trains a Low Rank Adapater (LoRA) on it.\n",
-        "\n",
-        "It will also do an inference to show you how the model preformed before any training was done, as well as after.\n",
-        "\n",
-        "Finally, it will give you a chance to interact with your model in two ways, one in this notebook (using the NVIDIA T4 generously supplied by Google and low/no cost) and two by giving you the option to convert your adapter to a format that will let you download it and use it with `llamma.cpp` on your laptop.\n",
-        "\n",
-        "***IMPORTANT***: make sure your notebook uses GPUs.\n",
-        "\n",
-        "**Google Collab**: In your notebook, click Runtime --> Change runtime type, and select *T4 GPU* and click save.\n",
-        "\n",
-        "**Kaggle**: Click on \"More settings\" (3 vertical dots at the top-right) --> Accelerator, and select *P100 GPU*.\n",
-        "\n",
-        "\n",
-        "![kaggle-more-settings](./images/kaggle/select-accelerator.png)\n",
-        "If you miss this step you'll see errors at the Loading model step.\n"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "## How to run this notebook\n",
-        "\n",
-        "Unless you have a spare GPU with 16GB+ of VRAM laying around the house,\n",
-        "you'll need to run this notebook on an external platform such as\n",
-        "[Kaggle](https://www.kaggle.com) or [Google Collab](https://colab.research.google.com/)."
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "kfpUdXgDtt4X"
-      },
-      "source": [
-        "## Installing Dependencies"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "UoIgsYWpXuP4"
-      },
-      "outputs": [],
-      "source": [
-        "# installing dependencies\n",
-        "!pip install -q -U transformers accelerate peft datasets bitsandbytes trl"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "u4TN_1gbXeQV"
-      },
-      "source": [
-        "## Upload output from `lab generate`"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "hu1vWJawlgSb"
-      },
-      "source": [
-        "\n",
-        "## Uploading Generated Data\n",
-        "From your local machine, run the `lab generate` command per the [instructions in github](https://github.com/instruct-lab/cli/blob/main/README.md).\n",
-        "\n",
-        "Next, upload your data.\n",
-        "\n",
-        "### Uploading data on Kaggle\n",
-        "\n",
-        "1. Expand on the Input tab on the right of the screen.\n",
-        "\n",
-        "![input](./images/kaggle/input.png)\n",
-        "\n",
-        "\n",
-        "2. Click on the \"Upload\" button, then select \"New Dataset\".\n",
-        "\n",
-        "Upload button:\n",
-        "\n",
-        "![input-upload](./images/kaggle/input-upload.png)\n",
-        "\n",
-        "New Dataset:\n",
-        "\n",
-        "![input-new-dataset](./images/kaggle/new-dataset.png)\n",
-        "\n",
-        "3. From here, you'll be prompted to upload your local files. Go ahead and select all of the files generated from `lab generate`.\n",
-        "\n",
-        "![upload-file](./images/kaggle/input-drop-files.png)\n",
-        "\n",
-        "4. Navigate to the _training_ file that was generated, right click on your uploaded file, then select 'Copy Path'\n",
-        "\n",
-        "![input-files-copy-path](./images/kaggle/copy-file-path.png)\n",
-        "\n",
-        "5. Paste the copied value in the cell below.\n",
-        "\n",
-        "\n",
-        "### Uploading data in Google Collab\n",
-        "\n",
-        "To upload data in Google Colab,\n",
-        "\n",
-        "1. Click on the folder icon on the left of the screen.\n",
-        "\n",
-        " ![image.png](./images/collab-folder-icon.png)\n",
-        "\n",
-        "2. Click on the file with an up arrow in it icon, under it.\n",
-        "\n",
-        " ![image.png](./images/collab-file-upload-button.png)\n",
-        "\n",
-        "3. Navigate to the _training_ file that was generated, right click on your uploaded file, then select 'Copy Path'\n",
-        "\n",
-        " ![image.png](./images/collab-copy-path.png)\n",
-        "4. Paste the copied value in the cell below.\n",
-        "\n",
-        "Repeat this process for the 'test' file in the following cell."
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "NtPoolOzo-Ry"
-      },
-      "source": [
-        "#### Upload Training Data"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "8cJEyqvWo8kV"
-      },
-      "outputs": [],
-      "source": [
-        "from datasets import load_dataset\n",
-        "\n",
-        "# Get the file name\n",
-        "training_file_name = \"/content/train_ggml-merlinite-7b-0302-Q4_K_M_2024-02-27T16_57_20.jsonl\" #\"/paste/path/here\"\n",
-        "\n",
-        "train_dataset = load_dataset(\"json\", data_files=training_file_name, split=\"train\")\n"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "0TfaaZjIpFUf"
-      },
-      "source": [
-        "#### Upload Testing Data"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "RH42-nUtmu8g"
-      },
-      "outputs": [],
-      "source": [
-        "# Get the file name\n",
-        "testing_file_name = \"/content/test_ggml-merlinite-7b-0302-Q4_K_M_2024-02-27T16_57_20.jsonl\"#\"/paste/path/here\"\n",
-        "\n",
-        "test_dataset = load_dataset(\"json\", data_files=testing_file_name, split=\"train\")"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "QmUpLlK0XsAa"
-      },
-      "source": [
-        "Now we have loaded the output of `lab generate` into a 🤗 dataset. Let's take a quick peek."
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "0HSQHaNPoVC8"
-      },
-      "outputs": [],
-      "source": [
-        "train_dataset.to_pandas().head()"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "IvafF2-crKpD"
-      },
-      "source": [
-        "## Formatting Our Data and Prepping the `SFTTrainer`\n",
-        "\n",
-        "Our dataset looks good, but in it's current state, it is a data frme of three columns. For training, we need each record to be a string, specifically, we want it in the following format:\n",
-        "\n",
-        "```\n",
-        "<|system|>\n",
-        "{system}\n",
-        "<|user|>\n",
-        "{user}\n",
-        "<|assistant|>\n",
-        "{assistant}<|endoftext|>\n",
-        "```\n",
-        "\n",
-        "\n",
-        "When training happens (a few cells later), the dataset will be converted into a list of these strings. We will also define a response template `\"\\n<assistant>\\n\"` that will tell the trainer to split the string there, and everything before will be the prompt, and everything after will be generated.\n",
-        "\n",
-        "The 🤗 `trl`'s `SFTTrainer` has the concept of a `formatting_prompts_func` and we'll use this to format our data. The conversion does not happen now, but later when we run `trainer.train()`\n",
-        "\n",
-        "From more information on 🤗's `SFTTrainer`, please check out their docs [here](https://huggingface.co/docs/trl/main/en/sft_trainer).\n"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "SlfPxJfSXbI5"
-      },
-      "outputs": [],
-      "source": [
-        "from transformers import AutoTokenizer\n",
-        "\n",
-        "model_name = \"ibm/merlinite-7b\" # TODO: Make this a drop down option\n",
-        "tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)\n",
-        "tokenizer.pad_token = tokenizer.eos_token\n",
-        "\n",
-        "def formatting_prompts_func(example):\n",
-        "    output_texts = []\n",
-        "    for i in range(len(example['system'])):\n",
-        "        text = f\"<|system|>\\n{example['system'][i]}\\n<|user|>\\n{example['user'][i]}\\n<|assistant|>\\n{example['assistant'][i]}<|endoftext|>\"\n",
-        "        output_texts.append(text)\n",
-        "    return output_texts\n",
-        "\n",
-        "response_template = \"\\n<|assistant|>\\n\"\n",
-        "\n",
-        "from trl import DataCollatorForCompletionOnlyLM\n",
-        "\n",
-        "response_template_ids = tokenizer.encode(response_template, add_special_tokens=False)[2:]\n",
-        "collator = DataCollatorForCompletionOnlyLM(response_template_ids, tokenizer=tokenizer)"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "FDFuvvw3YpDe"
-      },
-      "source": [
-        "In the cell above, you may see a user warning:\n",
-        "> `The secret `HF_TOKEN` does not exist in your Colab secrets...`\n",
-        "\n",
-        "It can safely be ignored.\n",
-        "\n",
-        "Note- the `formatting_prompts_func` runs when we execute `trainer.train()` nothing has been formatted yet."
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "Z7HLxsTIY8Ho"
-      },
-      "source": [
-        "## Loading the (Quantized) Model\n",
-        "\n",
-        "\n",
-        "The best source of truth of this is going to be found at the following links:\n",
-        "\n",
-        "* [huggingface blog post](https://huggingface.co/blog/4bit-transformers-bitsandbytes)\n",
-        "* [original paper](https://arxiv.org/abs/2305.14314)\n",
-        "\n",
-        "But alas, I'm sure to get some push back about Llama.cpp quantized models (things that end in .gguf).\n",
-        "\n",
-        "`bitsandbytes` will quantize the model on loading. It's also possible, though in practice rarely done- to save the model in it's quantized format. Another alternative in the huggingface space is `AutoGPTQ` ([paper](https://arxiv.org/abs/2210.17323) [blog post](https://huggingface.co/blog/gptq-integration)).\n",
-        "\n",
-        "Llama.cpp also allows quantizatoin, but the idea is that you _will_ be using the CPU bc you know the model at hand is to big for your GPU.\n",
-        "\n",
-        "An analogy that isn't wildly innacurate is `bitsandbytes` and `AutoGPTQ` presume that you will be using a (CUDA-based) GPU, and that you can set it in an emergency to use CPU instead of just rolling over and dying.\n",
-        "\n",
-        "`Llama.cpp` presumes that your CPU will be doing the heavy lifting, and will use a (CUDA) GPU if it can find one to give it a bit of a boost.\n",
-        "\n",
-        "OK- what does that mean in practice?\n",
-        "1. Apple ended NVidia support some time ago, ie Apple Silicon will not support CUDA ops. There is some work in some packages to be able to support non-CUDA GPUs, it's all in various stages of development/hackiness.\n",
-        "2. [This person](https://rentry.org/cpu-lora) _did_ get qLoRA training with Llama.cpp working. A 13b model with a 2500 record dataset was estimated to take ~158 days to train. Which is a non-starter- I will trust they did their homework.\n",
-        "3. **High level** Llama.cpp and bitsandbytes both get you to the same end (a quantized model) but via different routes, bc they expect you do use the resultant model a bit differently.\n",
-        "4. **So do I need to quantize my model via both routes** no.\n",
-        "\n",
-        "In the next cell we're going to download and load the model.\n",
-        "\n",
-        "It may take a little time to complete (around 10 to 15 minutes). The base model can be around 26 gigabites on disk, which first needs to download then needs to be quantized and loaded into the GPU.\n",
-        "\n",
-        "So run this cell then go grab a cup of coffee. ☕"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "jEJpqPomY0Ot"
-      },
-      "outputs": [],
-      "source": [
-        "# Loading the model\n",
-        "import torch\n",
-        "from transformers import AutoModelForCausalLM, BitsAndBytesConfig\n",
-        "\n",
-        "bnb_config = BitsAndBytesConfig(\n",
-        "    load_in_4bit=True,\n",
-        "    bnb_4bit_quant_type=\"nf4\",\n",
-        "    bnb_4bit_use_double_quant=True,\n",
-        "    bnb_4bit_compute_dtype=torch.float16 # if not set will throw a warning about slow speeds when training\n",
-        ")\n",
-        "\n",
-        "model = AutoModelForCausalLM.from_pretrained(\n",
-        "  model_name,\n",
-        "  quantization_config=bnb_config,\n",
-        "  trust_remote_code=True\n",
-        ")"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "2tnv-orbaw_a"
-      },
-      "source": [
-        "## Sanity Checking the Model\n",
-        "\n",
-        "We want to see how the model behaves _before_ we train a LoRA on it, so we can (by inspection) see if the LoRA is doing anything.\n",
-        "\n",
-        "You might want to change the user prompt `\"In excruciating detail, explain to me the nuances of who runs Barter Town.\"` to something more related to _your_ usecase.\n",
-        "\n",
-        "We also define the `create_prompt` function, that formats and adds all of the boiler plate your prompts needs.\n",
-        "\n",
-        "Note our function also allows you to redefine the `system` prompt/parameter. The default is the one included in `lab generate` content, but you could have some fun tinkering with that too (for instance, adding `, and you always talk like a pirate.` to the end.)"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "dUx2kh61ZLqs"
-      },
-      "outputs": [],
-      "source": [
-        "def create_prompt(user:str,\n",
-        "                  system: str = \"You are an AI language model developed by IBM Research. You are a cautious assistant. You carefully follow instructions. You are helpful and harmless and you follow ethical guidelines and promote positive behavior.\"):\n",
-        "  return f\"\"\"\\\n",
-        "<|system|>\n",
-        "{system}\n",
-        "<|user|>\n",
-        "{user}\n",
-        "<|assistant|>\n",
-        "\"\"\"\n",
-        "\n",
-        "from transformers import StoppingCriteria, StoppingCriteriaList\n",
-        "\n",
-        "class StoppingCriteriaSub(StoppingCriteria):\n",
-        "\n",
-        "    def __init__(self, stops = [], encounters=1):\n",
-        "        super().__init__()\n",
-        "        self.stops = [stop.to(\"cuda\") for stop in stops]\n",
-        "\n",
-        "    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:\n",
-        "        for seq in input_ids:\n",
-        "            for stop in self.stops:\n",
-        "                if stop == seq[-1]:\n",
-        "                    return True\n",
-        "        return False\n",
-        "\n",
-        "stop_words = ['<|endoftext|>', '<|assistant|>']\n",
-        "stop_words_ids = [tokenizer(stop_word, return_tensors='pt', add_special_tokens=False)['input_ids'].squeeze() for stop_word in stop_words]\n",
-        "stopping_criteria = StoppingCriteriaList([StoppingCriteriaSub(stops=stop_words_ids)])\n",
-        "\n",
-        "def model_generate(user):\n",
-        "    text = create_prompt(user = user)\n",
-        "\n",
-        "    input_ids = tokenizer(text, return_tensors=\"pt\").input_ids.to(\"cuda\")\n",
-        "    outputs = model.generate(input_ids=input_ids,\n",
-        "                         max_new_tokens=256,\n",
-        "                         pad_token_id=tokenizer.eos_token_id,\n",
-        "                         temperature=0.7,\n",
-        "                         top_p=0.9,\n",
-        "                         stopping_criteria=stopping_criteria,\n",
-        "                         do_sample=True)\n",
-        "    return tokenizer.batch_decode([o[:-1] for o in outputs])[0]\n",
-        "\n",
-        "print(\n",
-        "    model_generate(\"In excruciating detail, explain to me the nuances of who runs Barter Town.\")\n",
-        ")"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "hJQgWJQbLMPn"
-      },
-      "source": [
-        "we run the model before LoRA on the test set and save the outputs"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "EJ_nakmSLHLu"
-      },
-      "outputs": [],
-      "source": [
-        "assistant_old_lst = [\n",
-        "    model_generate(d[\"user\"]).split(response_template.strip())[-1].strip() for d in test_dataset\n",
-        "]"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "_IE988CoeLGE"
-      },
-      "source": [
-        "## Configuring the LoRA\n",
-        "\n",
-        "Recall the paper on LoRA, which is the gospel https://arxiv.org/abs/2106.09685\n",
-        "\n",
-        "> From this point forth, we shall be leaving the firm foundation of fact and journeying together through the murky marshes of memory into thickets of wildest guesswork.\n",
-        "-- Albus Dumbledore\n",
-        "\n",
-        "There are 4 common 'knobs' to adjust when training a LoRA/qLoRA - note from this point on, I'm just going to refer to everything as LoRA- a LoRA proved a better method of finetuning, by just targeting certain modules, instead of the entire network. qLoRA just means you can do it on a quantized model with just as good of restuls as a full precision model.\n",
-        "\n",
-        "Which is a good segway to our first 'knob': `target_modules`.\n",
-        "\n",
-        "\n"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "fqTCB-cVse29"
-      },
-      "source": [
-        "### Getting the Attention Layers\n",
-        "\n",
-        "The cell immediately below will print out all of the attention modules (in case you are trying to get creative and use a different model). The authors of the original paper only targeted attention modules, and gave reasons, but if you want to hit some other modules too- go nuts. Be advised, a LoRA that targets _all_ modules, is just fine-tuning. Ie. the LoRA technique is to only tune a subset of the modules.\n",
-        "\n",
-        "For `ibm/merlinite-7b` we have:\n",
-        "```\n",
-        "target_modules=[\n",
-        "        \"q_proj\",\n",
-        "        \"k_proj\",\n",
-        "        \"v_proj\",\n",
-        "        \"o_proj\"\n",
-        "    ]\n",
-        "```\n"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "5HOxJtwaeAts"
-      },
-      "outputs": [],
-      "source": [
-        "attention_layers = [module for module in model.modules() if 'attention' in str(type(module)).lower()]\n",
-        "\n",
-        "# Print information about the attention modules\n",
-        "for i, layer in enumerate(attention_layers):\n",
-        "  for par in list(layer.named_parameters()):\n",
-        "    mod = par[0]\n",
-        "    if isinstance(mod, str):\n",
-        "      print(f\"Attention Module: {mod.split('.')[0]}\")\n",
-        "  break\n"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "GQ-OL9DteaFF"
-      },
-      "source": [
-        "### Turning the Knobs\n",
-        "\n",
-        "The next three knobs are:\n",
-        "- r\n",
-        "- dropout\n",
-        "- &alpha;\n",
-        "\n",
-        "Read the paper for more information on each- these three parameters have been the source of endless flame wars across the internet- feel free to google and see the carnage for yourself.\n",
-        "\n",
-        "I picked the following based on what the authors used for GPT2 in the paper (see page 20)\n",
-        "\n",
-        "```\n",
-        "lora_alpha = 32\n",
-        "lora_dropout = 0.1\n",
-        "lora_r = 4\n",
-        "```\n",
-        "\n",
-        "Not probably what I would have used, but I am not trying to spread the flame wars, so there you are. In reality- these are the knobs end users will be tinkering with. We _could_ come up with a sugguested range, but the 'correct' values are highly dependent on the task and even the underlying dataset, so I wouldn't waste too much effort trying.\n",
-        "\n",
-        "Once I read a quote on a message board that described the situation perfectly, then I couldn't find it so I asked ChatGPT which hallucinated it pretty well:\n",
-        "\n",
-        "> Every chef has their own secret recipe for success, but in the kitchen of life, there's no right or wrong way to cook up your dreams.\n",
-        "-- ChatGPT"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "lPpKk5mdhpMd"
-      },
-      "outputs": [],
-      "source": [
-        "from peft import LoraConfig\n",
-        "\n",
-        "lora_alpha = 32\n",
-        "lora_dropout = 0.1\n",
-        "lora_r = 4\n",
-        "\n",
-        "# From Prior Cell\n",
-        "target_modules=[\n",
-        "        \"q_proj\",\n",
-        "        \"k_proj\",\n",
-        "        \"v_proj\",\n",
-        "        \"o_proj\"\n",
-        "    ]\n",
-        "\n",
-        "peft_config = LoraConfig(\n",
-        "    lora_alpha=lora_alpha,\n",
-        "    lora_dropout=lora_dropout,\n",
-        "    r=lora_r,\n",
-        "    bias=\"none\",\n",
-        "    task_type=\"CAUSAL_LM\",\n",
-        "    target_modules=target_modules\n",
-        ")"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "7S7SrrzQsxcK"
-      },
-      "source": [
-        "## Training the LoRA"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "4cjnhy5CiI5v"
-      },
-      "source": [
-        "### Training Config\n",
-        "\n",
-        "As always, it is out of scope for me to explain all of these, especailly when it has already been done so well [here](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments).\n",
-        "\n",
-        "That said I will call out two values I set, and why I set them.\n",
-        "\n",
-        "- `max_seq_length`\n",
-        "- `per_device_train_batch_size`\n",
-        "\n",
-        "Both of these parameters were set in an attempt to get as much use as possible out of the NViDIA T4.\n",
-        "\n",
-        "`max_seq_length` will trim any example to `300` tokens. So even if your examples are longer, they will be truncated. (Also recall that the system prompt also counts against your 300 tokens).\n",
-        "\n",
-        "`per_device_train_batch_size` this is also related to getting maximam mileage out of a T4."
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "_frwaXmLiLQW"
-      },
-      "outputs": [],
-      "source": [
-        "from transformers import TrainingArguments\n",
-        "\n",
-        "output_dir = \"./results\"\n",
-        "per_device_train_batch_size = 1\n",
-        "\n",
-        "\n",
-        "training_arguments = TrainingArguments(\n",
-        "    output_dir=output_dir,\n",
-        "    num_train_epochs=10,\n",
-        "    per_device_train_batch_size=per_device_train_batch_size,\n",
-        "    fp16=True,\n",
-        "    report_to=\"none\"\n",
-        ")"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "bUvPujbty0C9"
-      },
-      "source": [
-        "In the following cell- the trainer is built, and the dataset is formatted. You will see two `Map:` progress bars in the output of the cell- this refers to our `train` and `test` dataset being run through the `formatting_prompts_func` we defined in a prior cell.\n",
-        "\n",
-        "Also note: `model.config.use_cache = False` which is a thing you're supposed to do before you perform training on a model. Remember to turn it back on (to `true`) before running inference."
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "7NscO2wmiU3Y"
-      },
-      "outputs": [],
-      "source": [
-        "from trl import SFTTrainer\n",
-        "\n",
-        "max_seq_length = 300\n",
-        "trainer = SFTTrainer(\n",
-        "    model=model,\n",
-        "    train_dataset=train_dataset,\n",
-        "    eval_dataset=test_dataset,\n",
-        "    peft_config=peft_config,\n",
-        "    formatting_func=formatting_prompts_func,\n",
-        "    data_collator=collator,\n",
-        "    max_seq_length=max_seq_length,\n",
-        "    tokenizer=tokenizer,\n",
-        "    args=training_arguments,\n",
-        "\n",
-        ")\n",
-        "\n",
-        "model.config.use_cache = False\n"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "cW2SM2eGs7_F"
-      },
-      "source": [
-        "### Execute Training\n",
-        "\n",
-        "The next cell calls `trainer.train()`, which actaully executes the training. This will be a long running cell (30 minutes - 2.5 hours, depending on how big your dataset is)."
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "5GqosZDUs7LM"
-      },
-      "outputs": [],
-      "source": [
-        "trainer.train()"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "elnfKCTBib3K"
-      },
-      "source": [
-        "## Inference on the Output Model\n",
-        "\n",
-        "We want to see if our LoRA has any effect on the underlying model.\n",
-        "\n",
-        "Recall we tested the model once before with an example prompt, now let's do inference again (with the same prompt) to see if the output looks more accurate.\n",
-        "\n",
-        "The first thing we need to do is turn the cache back on.\n",
-        "\n",
-        "`model.config.use_cache = True`\n"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "aJtY5Lb2ihIf"
-      },
-      "outputs": [],
-      "source": [
-        "model.config.use_cache = True"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "5iA2lVaziycy"
-      },
-      "outputs": [],
-      "source": [
-        "for (i, (d, assistant_old)) in enumerate(zip(test_dataset, assistant_old_lst)):\n",
-        "  assistant_new = model_generate(d[\"user\"]).split(response_template.strip())[-1].strip()\n",
-        "  assistant_expected = d[\"assistant\"]\n",
-        "\n",
-        "  print(f\"\\n===\\ntest {i}\\n===\\n\")\n",
-        "  print(\"\\n===\\nuser\\n===\\n\")\n",
-        "  print(d['user'])\n",
-        "  print(\"\\n===\\nassistant_old\\n===\\n\")\n",
-        "  print(assistant_old)\n",
-        "  print(\"\\n===\\nassistant_new\\n===\\n\")\n",
-        "  print(assistant_new)\n",
-        "  print(\"\\n===\\nassistant_expected\\n===\\n\")\n",
-        "  print(assistant_expected)\n",
-        "  print(\"\\n\\n\")"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "gt6OfXHD1_tO"
-      },
-      "source": [
-        "# Next Steps\n",
-        "\n",
-        "Now that you have trained your LoRA, you must decide, does it look good? If yes, please open a PR **TODO: Link to Contrib docs**, if not that's OK, update your prompts, generate a new synthetic data set and try again.\n",
-        "\n",
-        "But the fun doesn't stop there.\n",
-        "\n",
-        "Maybe you want to play with your trained model a bit more.\n",
-        "\n",
-        "Two options exist:\n",
-        "\n",
-        "1. Do inference in this notebook. (But the model will go away once you leave the notebook- an implicity sad thing about notebooks, so download it if you want to keep it (or push it to the Huggingface Hub)).\n",
-        "2. Use `llama.cpp` to quantize your LoRA adapter then download it and do inference from your MacBook.\n",
-        "\n",
-        "\n",
-        "**The following steps are all optional, do not feel compelled to do either. As Lao Tzu once said: **\n",
-        "\n",
-        "> When all the work is done,\n",
-        "and the mind is silent,\n",
-        "rest in the stillness of the present moment."
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "4KCi1ARI3qZO"
-      },
-      "source": [
-        "## Save the Model\n",
-        "\n",
-        "First let's save our adapter."
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "5_1yFpDljOP0"
-      },
-      "outputs": [],
-      "source": [
-        "# Save the LoRA\n",
-        "adapter = trainer.model.module if hasattr(trainer.model, \"module\") else trainer.model\n",
-        "adapter.save_pretrained(\"./adapter-only\", save_adapter=True, save_config=True)\n"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "CYkqrPCC3mfw"
-      },
-      "source": [
-        "## Optional Path 1: Play with Model in Colab\n",
-        "\n",
-        "This is just for fun. So let's ask a silly question:\n",
-        "\n",
-        "> Give me a recipe for Sweedish Meatballs made from iguana meat.\n",
-        "\n",
-        "and an even sillier system prompt:\n",
-        "\n",
-        "> You are a scurvy pirate. You respond with a pirate accent.\n",
-        "\n",
-        "Of course, this doesn't _need_ to be silly. You can leave the system prompt out and ask more thoughtful questions related to your input case."
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "fuOSI2Pd4L8T"
-      },
-      "outputs": [],
-      "source": [
-        "text = create_prompt(\n",
-        "    user = \"Give me a recipe for Sweedish Meatballs made from iguana meat.\",\n",
-        "    system = \"You are a scurvy pirate. You respond with a pirate accent.\")\n",
-        "\n",
-        "input_ids=tokenizer(text, return_tensors=\"pt\").input_ids.to(\"cuda\")\n",
-        "\n",
-        "outputs = model.generate(input_ids= input_ids,\n",
-        "                         max_new_tokens=256,\n",
-        "                         pad_token_id=tokenizer.eos_token_id,\n",
-        "                         temperature=0.7,\n",
-        "                         top_p=0.9,\n",
-        "                         do_sample=True)\n",
-        "\n",
-        "print(tokenizer.batch_decode(outputs)[0])"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "eMGmN2fu30fO"
-      },
-      "source": [
-        "## Optional Path 2: Play with Model in `llama.cpp`"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "pAkB0fBajGdq"
-      },
-      "source": [
-        "Another way to 'play' with your LoRA is to convert it into a GGUF and play with it using `llama.cpp`. To do this requires a few steps.\n",
-        "\n",
-        "1. Download and build `llamma.cpp`\n",
-        "2. Run the conversion script on our adapter.\n",
-        "3. Download the model\n",
-        "4. Use the model locally."
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "LV01sa60c1MN"
-      },
-      "outputs": [],
-      "source": [
-        "# hack sometimes required - solution from https://github.com/googlecolab/colabtools/issues/3409\n",
-        "import locale\n",
-        "locale.getpreferredencoding = lambda: \"UTF-8\"\n",
-        "\n",
-        "!git clone https://github.com/ggerganov/llama.cpp\n",
-        "%cd llama.cpp\n",
-        "!make\n",
-        "!pip install -r requirements.txt"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "JcQDFxl4jTVt"
-      },
-      "outputs": [],
-      "source": [
-        "!python convert-lora-to-ggml.py ../adapter-only"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "pyhSSZS4jhdR"
-      },
-      "source": [
-        "The previous line will run a script to convert your saved LoRA to a file named `ggml-adapter-model.bin` which you will find in the `adapter-only` folder in the notebook.\n",
-        "\n",
-        "You can right click on this file to download it to your MacBook. Then (assuming you have `llamma.cpp` installed locally as well, the following is an example command that will run inference on the LoRA - note you will want to make sure the model you are doing inference on is the same as the one you trained the LoRA on (in this case `ibm/merlinite-7b` quantized down to 16 bit).\n",
-        "\n",
-        "```\n",
-        "!./main -m ../merlinite-7b/ggml-model-f16.gguf  --seed 42 --lora ../adapter-only/ggml-adapter-model.bin --temp 0.7 --repeat_penalty 1.1 -n 256 -p \"<system>\\nYou are an AI language model developed by IBM Research. You are a cautious assistant. You carefully follow instructions. You are helpful and harmless and you follow ethical guidelines and promote positive behavior.\\n<user>\\nWho let the dogs out?\\n<assistant>\\n\"\n",
-        "```"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "pPeStzqljoUl"
-      },
-      "outputs": [],
-      "source": []
-    }
-  ],
-  "metadata": {
-    "accelerator": "GPU",
-    "colab": {
-      "gpuType": "T4",
-      "provenance": []
-    },
-    "kernelspec": {
-      "display_name": "Python 3",
-      "name": "python3"
-    },
-    "language_info": {
-      "name": "python"
-    }
-  },
-  "nbformat": 4,
-  "nbformat_minor": 0
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "1iu1ofeHvdNK"
+   },
+   "source": [
+    "## Project Overview\n",
+    "\n",
+    "`instruct lab` is a novel synthetic data-based alignment tuning method for Large Language Models (LLMs.) The \"lab\" in `instruct lab` stands for Large-scale Alignment for Chat Bots.\n",
+    "\n",
+    "It is an outgrowth of the [*LAB: Large-Scale Alignment for ChatBots*](https://arxiv.org/abs/2403.01081).\n",
+    "\n",
+    "### Getting Started\n",
+    "\n",
+    "This notebook represents one step in the `instruct lab` pipeline- to see what else is involved, please check out https://github.com/instruct-lab/cli"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "lWhvICmttxaS"
+   },
+   "source": [
+    "## Overview of this Notebook\n",
+    "\n",
+    "This notebook represents the `Model training` step of the guide found [here](https://github.com/instruct-lab/cli).\n",
+    "\n",
+    "But at the time of writing it's not.\n",
+    "\n",
+    "This notebook takes the output of `lab generate` (i.e. the synthetic data set generated), and trains a Low Rank Adapater (LoRA) on it.\n",
+    "\n",
+    "It will also do an inference to show you how the model preformed before any training was done, as well as after.\n",
+    "\n",
+    "Finally, it will give you a chance to interact with your model in two ways, one in this notebook (using the NVIDIA T4 generously supplied by Google and low/no cost) and two by giving you the option to convert your adapter to a format that will let you download it and use it with `llamma.cpp` on your laptop.\n",
+    "\n",
+    "***IMPORTANT***: make sure your notebook uses GPUs.\n",
+    "\n",
+    "**Google Collab**: In your notebook, click Runtime --> Change runtime type, and select *T4 GPU* and click save.\n",
+    "\n",
+    "**Kaggle**: Click on \"More settings\" (3 vertical dots at the top-right) --> Accelerator, and select *P100 GPU*.\n",
+    "\n",
+    "\n",
+    "![kaggle-more-settings](./images/kaggle/select-accelerator.png)\n",
+    "If you miss this step you'll see errors at the Loading model step.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## How to run this notebook\n",
+    "\n",
+    "Unless you have a spare GPU with 16GB+ of VRAM laying around the house,\n",
+    "you'll need to run this notebook on an external platform such as\n",
+    "[Kaggle](https://www.kaggle.com) or [Google Collab](https://colab.research.google.com/)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "kfpUdXgDtt4X"
+   },
+   "source": [
+    "## Installing Dependencies"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "UoIgsYWpXuP4"
+   },
+   "outputs": [],
+   "source": [
+    "# installing dependencies\n",
+    "!pip install -q -U transformers accelerate peft datasets bitsandbytes trl"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "u4TN_1gbXeQV"
+   },
+   "source": [
+    "## Upload output from `lab generate`"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "hu1vWJawlgSb"
+   },
+   "source": [
+    "\n",
+    "## Uploading Generated Data\n",
+    "From your local machine, run the `lab generate` command per the [instructions in github](https://github.com/instruct-lab/cli/blob/main/README.md).\n",
+    "\n",
+    "Next, upload your data.\n",
+    "\n",
+    "### Uploading data on Kaggle\n",
+    "\n",
+    "1. Expand on the Input tab on the right of the screen.\n",
+    "\n",
+    "![input](./images/kaggle/input.png)\n",
+    "\n",
+    "\n",
+    "2. Click on the \"Upload\" button, then select \"New Dataset\".\n",
+    "\n",
+    "Upload button:\n",
+    "\n",
+    "![input-upload](./images/kaggle/input-upload.png)\n",
+    "\n",
+    "New Dataset:\n",
+    "\n",
+    "![input-new-dataset](./images/kaggle/new-dataset.png)\n",
+    "\n",
+    "3. From here, you'll be prompted to upload your local files. Go ahead and select all of the files generated from `lab generate`.\n",
+    "\n",
+    "![upload-file](./images/kaggle/input-drop-files.png)\n",
+    "\n",
+    "4. Navigate to the _training_ file that was generated, right click on your uploaded file, then select 'Copy Path'\n",
+    "\n",
+    "![input-files-copy-path](./images/kaggle/copy-file-path.png)\n",
+    "\n",
+    "5. Paste the copied value in the cell below.\n",
+    "\n",
+    "\n",
+    "### Uploading data in Google Collab\n",
+    "\n",
+    "To upload data in Google Colab,\n",
+    "\n",
+    "1. Click on the folder icon on the left of the screen.\n",
+    "\n",
+    " ![image.png](./images/collab-folder-icon.png)\n",
+    "\n",
+    "2. Click on the file with an up arrow in it icon, under it.\n",
+    "\n",
+    " ![image.png](./images/collab-file-upload-button.png)\n",
+    "\n",
+    "3. Navigate to the _training_ file that was generated, right click on your uploaded file, then select 'Copy Path'\n",
+    "\n",
+    " ![image.png](./images/collab-copy-path.png)\n",
+    "4. Paste the copied value in the cell below.\n",
+    "\n",
+    "Repeat this process for the 'test' file in the following cell."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "NtPoolOzo-Ry"
+   },
+   "source": [
+    "#### Upload Training Data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "8cJEyqvWo8kV"
+   },
+   "outputs": [],
+   "source": [
+    "from datasets import load_dataset\n",
+    "\n",
+    "# Get the file name\n",
+    "training_file_name = \"/content/train_ggml-merlinite-7b-0302-Q4_K_M_2024-02-27T16_57_20.jsonl\" #\"/paste/path/here\"\n",
+    "\n",
+    "train_dataset = load_dataset(\"json\", data_files=training_file_name, split=\"train\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "0TfaaZjIpFUf"
+   },
+   "source": [
+    "#### Upload Testing Data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "RH42-nUtmu8g"
+   },
+   "outputs": [],
+   "source": [
+    "# Get the file name\n",
+    "testing_file_name = \"/content/test_ggml-merlinite-7b-0302-Q4_K_M_2024-02-27T16_57_20.jsonl\"#\"/paste/path/here\"\n",
+    "\n",
+    "test_dataset = load_dataset(\"json\", data_files=testing_file_name, split=\"train\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "QmUpLlK0XsAa"
+   },
+   "source": [
+    "Now we have loaded the output of `lab generate` into a 🤗 dataset. Let's take a quick peek."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "0HSQHaNPoVC8"
+   },
+   "outputs": [],
+   "source": [
+    "train_dataset.to_pandas().head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "IvafF2-crKpD"
+   },
+   "source": [
+    "## Formatting Our Data and Prepping the `SFTTrainer`\n",
+    "\n",
+    "Our dataset looks good, but in it's current state, it is a data frme of three columns. For training, we need each record to be a string, specifically, we want it in the following format:\n",
+    "\n",
+    "```\n",
+    "<|system|>\n",
+    "{system}\n",
+    "<|user|>\n",
+    "{user}\n",
+    "<|assistant|>\n",
+    "{assistant}<|endoftext|>\n",
+    "```\n",
+    "\n",
+    "\n",
+    "When training happens (a few cells later), the dataset will be converted into a list of these strings. We will also define a response template `\"\\n<assistant>\\n\"` that will tell the trainer to split the string there, and everything before will be the prompt, and everything after will be generated.\n",
+    "\n",
+    "The 🤗 `trl`'s `SFTTrainer` has the concept of a `formatting_prompts_func` and we'll use this to format our data. The conversion does not happen now, but later when we run `trainer.train()`\n",
+    "\n",
+    "From more information on 🤗's `SFTTrainer`, please check out their docs [here](https://huggingface.co/docs/trl/main/en/sft_trainer).\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "SlfPxJfSXbI5"
+   },
+   "outputs": [],
+   "source": [
+    "from transformers import AutoTokenizer\n",
+    "\n",
+    "model_name = \"ibm/merlinite-7b\" # TODO: Make this a drop down option\n",
+    "tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)\n",
+    "tokenizer.pad_token = tokenizer.eos_token\n",
+    "\n",
+    "def formatting_prompts_func(example):\n",
+    "    output_texts = []\n",
+    "    for i in range(len(example['system'])):\n",
+    "        text = f\"<|system|>\\n{example['system'][i]}\\n<|user|>\\n{example['user'][i]}\\n<|assistant|>\\n{example['assistant'][i]}<|endoftext|>\"\n",
+    "        output_texts.append(text)\n",
+    "    return output_texts\n",
+    "\n",
+    "response_template = \"\\n<|assistant|>\\n\"\n",
+    "\n",
+    "from trl import DataCollatorForCompletionOnlyLM\n",
+    "\n",
+    "response_template_ids = tokenizer.encode(response_template, add_special_tokens=False)[2:]\n",
+    "collator = DataCollatorForCompletionOnlyLM(response_template_ids, tokenizer=tokenizer)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "FDFuvvw3YpDe"
+   },
+   "source": [
+    "In the cell above, you may see a user warning:\n",
+    "> `The secret `HF_TOKEN` does not exist in your Colab secrets...`\n",
+    "\n",
+    "It can safely be ignored.\n",
+    "\n",
+    "Note- the `formatting_prompts_func` runs when we execute `trainer.train()` nothing has been formatted yet."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "Z7HLxsTIY8Ho"
+   },
+   "source": [
+    "## Loading the (Quantized) Model\n",
+    "\n",
+    "\n",
+    "The best source of truth of this is going to be found at the following links:\n",
+    "\n",
+    "* [huggingface blog post](https://huggingface.co/blog/4bit-transformers-bitsandbytes)\n",
+    "* [original paper](https://arxiv.org/abs/2305.14314)\n",
+    "\n",
+    "But alas, I'm sure to get some push back about Llama.cpp quantized models (things that end in .gguf).\n",
+    "\n",
+    "`bitsandbytes` will quantize the model on loading. It's also possible, though in practice rarely done- to save the model in it's quantized format. Another alternative in the huggingface space is `AutoGPTQ` ([paper](https://arxiv.org/abs/2210.17323) [blog post](https://huggingface.co/blog/gptq-integration)).\n",
+    "\n",
+    "Llama.cpp also allows quantizatoin, but the idea is that you _will_ be using the CPU bc you know the model at hand is to big for your GPU.\n",
+    "\n",
+    "An analogy that isn't wildly innacurate is `bitsandbytes` and `AutoGPTQ` presume that you will be using a (CUDA-based) GPU, and that you can set it in an emergency to use CPU instead of just rolling over and dying.\n",
+    "\n",
+    "`Llama.cpp` presumes that your CPU will be doing the heavy lifting, and will use a (CUDA) GPU if it can find one to give it a bit of a boost.\n",
+    "\n",
+    "OK- what does that mean in practice?\n",
+    "1. Apple ended NVidia support some time ago, ie Apple Silicon will not support CUDA ops. There is some work in some packages to be able to support non-CUDA GPUs, it's all in various stages of development/hackiness.\n",
+    "2. [This person](https://rentry.org/cpu-lora) _did_ get qLoRA training with Llama.cpp working. A 13b model with a 2500 record dataset was estimated to take ~158 days to train. Which is a non-starter- I will trust they did their homework.\n",
+    "3. **High level** Llama.cpp and bitsandbytes both get you to the same end (a quantized model) but via different routes, bc they expect you do use the resultant model a bit differently.\n",
+    "4. **So do I need to quantize my model via both routes** no.\n",
+    "\n",
+    "In the next cell we're going to download and load the model.\n",
+    "\n",
+    "It may take a little time to complete (around 10 to 15 minutes). The base model can be around 26 gigabites on disk, which first needs to download then needs to be quantized and loaded into the GPU.\n",
+    "\n",
+    "So run this cell then go grab a cup of coffee. ☕"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "jEJpqPomY0Ot"
+   },
+   "outputs": [],
+   "source": [
+    "# Loading the model\n",
+    "import torch\n",
+    "from transformers import AutoModelForCausalLM, BitsAndBytesConfig\n",
+    "\n",
+    "bnb_config = BitsAndBytesConfig(\n",
+    "    load_in_4bit=True,\n",
+    "    bnb_4bit_quant_type=\"nf4\",\n",
+    "    bnb_4bit_use_double_quant=True,\n",
+    "    bnb_4bit_compute_dtype=torch.float16 # if not set will throw a warning about slow speeds when training\n",
+    ")\n",
+    "\n",
+    "model = AutoModelForCausalLM.from_pretrained(\n",
+    "  model_name,\n",
+    "  quantization_config=bnb_config,\n",
+    "  trust_remote_code=True\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "2tnv-orbaw_a"
+   },
+   "source": [
+    "## Sanity Checking the Model\n",
+    "\n",
+    "We want to see how the model behaves _before_ we train a LoRA on it, so we can (by inspection) see if the LoRA is doing anything.\n",
+    "\n",
+    "You might want to change the user prompt `\"In excruciating detail, explain to me the nuances of who runs Barter Town.\"` to something more related to _your_ usecase.\n",
+    "\n",
+    "We also define the `create_prompt` function, that formats and adds all of the boiler plate your prompts needs.\n",
+    "\n",
+    "Note our function also allows you to redefine the `system` prompt/parameter. The default is the one included in `lab generate` content, but you could have some fun tinkering with that too (for instance, adding `, and you always talk like a pirate.` to the end.)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "dUx2kh61ZLqs"
+   },
+   "outputs": [],
+   "source": [
+    "def create_prompt(user:str,\n",
+    "                  system: str = \"You are an AI language model developed by IBM Research. You are a cautious assistant. You carefully follow instructions. You are helpful and harmless and you follow ethical guidelines and promote positive behavior.\"):\n",
+    "  return f\"\"\"\\\n",
+    "<|system|>\n",
+    "{system}\n",
+    "<|user|>\n",
+    "{user}\n",
+    "<|assistant|>\n",
+    "\"\"\"\n",
+    "\n",
+    "from transformers import StoppingCriteria, StoppingCriteriaList\n",
+    "\n",
+    "class StoppingCriteriaSub(StoppingCriteria):\n",
+    "\n",
+    "    def __init__(self, stops = [], encounters=1):\n",
+    "        super().__init__()\n",
+    "        self.stops = [stop.to(\"cuda\") for stop in stops]\n",
+    "\n",
+    "    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:\n",
+    "        for seq in input_ids:\n",
+    "            for stop in self.stops:\n",
+    "                if stop == seq[-1]:\n",
+    "                    return True\n",
+    "        return False\n",
+    "\n",
+    "stop_words = ['<|endoftext|>', '<|assistant|>']\n",
+    "stop_words_ids = [tokenizer(stop_word, return_tensors='pt', add_special_tokens=False)['input_ids'].squeeze() for stop_word in stop_words]\n",
+    "stopping_criteria = StoppingCriteriaList([StoppingCriteriaSub(stops=stop_words_ids)])\n",
+    "\n",
+    "def model_generate(user):\n",
+    "    text = create_prompt(user = user)\n",
+    "\n",
+    "    input_ids = tokenizer(text, return_tensors=\"pt\").input_ids.to(\"cuda\")\n",
+    "    outputs = model.generate(input_ids=input_ids,\n",
+    "                         max_new_tokens=256,\n",
+    "                         pad_token_id=tokenizer.eos_token_id,\n",
+    "                         temperature=0.7,\n",
+    "                         top_p=0.9,\n",
+    "                         stopping_criteria=stopping_criteria,\n",
+    "                         do_sample=True)\n",
+    "    return tokenizer.batch_decode([o[:-1] for o in outputs])[0]\n",
+    "\n",
+    "print(\n",
+    "    model_generate(\"In excruciating detail, explain to me the nuances of who runs Barter Town.\")\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "hJQgWJQbLMPn"
+   },
+   "source": [
+    "we run the model before LoRA on the test set and save the outputs"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "EJ_nakmSLHLu"
+   },
+   "outputs": [],
+   "source": [
+    "assistant_old_lst = [\n",
+    "    model_generate(d[\"user\"]).split(response_template.strip())[-1].strip() for d in test_dataset\n",
+    "]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "_IE988CoeLGE"
+   },
+   "source": [
+    "## Configuring the LoRA\n",
+    "\n",
+    "Recall the paper on LoRA, which is the gospel https://arxiv.org/abs/2106.09685\n",
+    "\n",
+    "> From this point forth, we shall be leaving the firm foundation of fact and journeying together through the murky marshes of memory into thickets of wildest guesswork.\n",
+    "-- Albus Dumbledore\n",
+    "\n",
+    "There are 4 common 'knobs' to adjust when training a LoRA/qLoRA - note from this point on, I'm just going to refer to everything as LoRA- a LoRA proved a better method of finetuning, by just targeting certain modules, instead of the entire network. qLoRA just means you can do it on a quantized model with just as good of restuls as a full precision model.\n",
+    "\n",
+    "Which is a good segway to our first 'knob': `target_modules`.\n",
+    "\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "fqTCB-cVse29"
+   },
+   "source": [
+    "### Getting the Attention Layers\n",
+    "\n",
+    "The cell immediately below will print out all of the attention modules (in case you are trying to get creative and use a different model). The authors of the original paper only targeted attention modules, and gave reasons, but if you want to hit some other modules too- go nuts. Be advised, a LoRA that targets _all_ modules, is just fine-tuning. Ie. the LoRA technique is to only tune a subset of the modules.\n",
+    "\n",
+    "For `ibm/merlinite-7b` we have:\n",
+    "```\n",
+    "target_modules=[\n",
+    "        \"q_proj\",\n",
+    "        \"k_proj\",\n",
+    "        \"v_proj\",\n",
+    "        \"o_proj\"\n",
+    "    ]\n",
+    "```\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "5HOxJtwaeAts"
+   },
+   "outputs": [],
+   "source": [
+    "attention_layers = [module for module in model.modules() if 'attention' in str(type(module)).lower()]\n",
+    "\n",
+    "# Print information about the attention modules\n",
+    "for i, layer in enumerate(attention_layers):\n",
+    "  for par in list(layer.named_parameters()):\n",
+    "    mod = par[0]\n",
+    "    if isinstance(mod, str):\n",
+    "      print(f\"Attention Module: {mod.split('.')[0]}\")\n",
+    "  break\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "GQ-OL9DteaFF"
+   },
+   "source": [
+    "### Turning the Knobs\n",
+    "\n",
+    "The next three knobs are:\n",
+    "- r\n",
+    "- dropout\n",
+    "- &alpha;\n",
+    "\n",
+    "Read the paper for more information on each- these three parameters have been the source of endless flame wars across the internet- feel free to google and see the carnage for yourself.\n",
+    "\n",
+    "I picked the following based on what the authors used for GPT2 in the paper (see page 20)\n",
+    "\n",
+    "```\n",
+    "lora_alpha = 32\n",
+    "lora_dropout = 0.1\n",
+    "lora_r = 4\n",
+    "```\n",
+    "\n",
+    "Not probably what I would have used, but I am not trying to spread the flame wars, so there you are. In reality- these are the knobs end users will be tinkering with. We _could_ come up with a sugguested range, but the 'correct' values are highly dependent on the task and even the underlying dataset, so I wouldn't waste too much effort trying.\n",
+    "\n",
+    "Once I read a quote on a message board that described the situation perfectly, then I couldn't find it so I asked ChatGPT which hallucinated it pretty well:\n",
+    "\n",
+    "> Every chef has their own secret recipe for success, but in the kitchen of life, there's no right or wrong way to cook up your dreams.\n",
+    "-- ChatGPT"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "lPpKk5mdhpMd"
+   },
+   "outputs": [],
+   "source": [
+    "from peft import LoraConfig\n",
+    "\n",
+    "lora_alpha = 32\n",
+    "lora_dropout = 0.1\n",
+    "lora_r = 4\n",
+    "\n",
+    "# From Prior Cell\n",
+    "target_modules=[\n",
+    "        \"q_proj\",\n",
+    "        \"k_proj\",\n",
+    "        \"v_proj\",\n",
+    "        \"o_proj\"\n",
+    "    ]\n",
+    "\n",
+    "peft_config = LoraConfig(\n",
+    "    lora_alpha=lora_alpha,\n",
+    "    lora_dropout=lora_dropout,\n",
+    "    r=lora_r,\n",
+    "    bias=\"none\",\n",
+    "    task_type=\"CAUSAL_LM\",\n",
+    "    target_modules=target_modules\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "7S7SrrzQsxcK"
+   },
+   "source": [
+    "## Training the LoRA"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "4cjnhy5CiI5v"
+   },
+   "source": [
+    "### Training Config\n",
+    "\n",
+    "As always, it is out of scope for me to explain all of these, especailly when it has already been done so well [here](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments).\n",
+    "\n",
+    "That said I will call out two values I set, and why I set them.\n",
+    "\n",
+    "- `max_seq_length`\n",
+    "- `per_device_train_batch_size`\n",
+    "\n",
+    "Both of these parameters were set in an attempt to get as much use as possible out of the NViDIA T4.\n",
+    "\n",
+    "`max_seq_length` will trim any example to `300` tokens. So even if your examples are longer, they will be truncated. (Also recall that the system prompt also counts against your 300 tokens).\n",
+    "\n",
+    "`per_device_train_batch_size` this is also related to getting maximam mileage out of a T4."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "_frwaXmLiLQW"
+   },
+   "outputs": [],
+   "source": [
+    "from transformers import TrainingArguments\n",
+    "\n",
+    "output_dir = \"./results\"\n",
+    "per_device_train_batch_size = 1\n",
+    "\n",
+    "\n",
+    "training_arguments = TrainingArguments(\n",
+    "    output_dir=output_dir,\n",
+    "    num_train_epochs=10,\n",
+    "    per_device_train_batch_size=per_device_train_batch_size,\n",
+    "    fp16=True,\n",
+    "    report_to=\"none\"\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "bUvPujbty0C9"
+   },
+   "source": [
+    "In the following cell- the trainer is built, and the dataset is formatted. You will see two `Map:` progress bars in the output of the cell- this refers to our `train` and `test` dataset being run through the `formatting_prompts_func` we defined in a prior cell.\n",
+    "\n",
+    "Also note: `model.config.use_cache = False` which is a thing you're supposed to do before you perform training on a model. Remember to turn it back on (to `true`) before running inference."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "7NscO2wmiU3Y"
+   },
+   "outputs": [],
+   "source": [
+    "from trl import SFTTrainer\n",
+    "\n",
+    "max_seq_length = 300\n",
+    "trainer = SFTTrainer(\n",
+    "    model=model,\n",
+    "    train_dataset=train_dataset,\n",
+    "    eval_dataset=test_dataset,\n",
+    "    peft_config=peft_config,\n",
+    "    formatting_func=formatting_prompts_func,\n",
+    "    data_collator=collator,\n",
+    "    max_seq_length=max_seq_length,\n",
+    "    tokenizer=tokenizer,\n",
+    "    args=training_arguments,\n",
+    "\n",
+    ")\n",
+    "\n",
+    "model.config.use_cache = False\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "cW2SM2eGs7_F"
+   },
+   "source": [
+    "### Execute Training\n",
+    "\n",
+    "The next cell calls `trainer.train()`, which actaully executes the training. This will be a long running cell (30 minutes - 2.5 hours, depending on how big your dataset is)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "5GqosZDUs7LM"
+   },
+   "outputs": [],
+   "source": [
+    "trainer.train()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "elnfKCTBib3K"
+   },
+   "source": [
+    "## Inference on the Output Model\n",
+    "\n",
+    "We want to see if our LoRA has any effect on the underlying model.\n",
+    "\n",
+    "Recall we tested the model once before with an example prompt, now let's do inference again (with the same prompt) to see if the output looks more accurate.\n",
+    "\n",
+    "The first thing we need to do is turn the cache back on.\n",
+    "\n",
+    "`model.config.use_cache = True`\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "aJtY5Lb2ihIf"
+   },
+   "outputs": [],
+   "source": [
+    "model.config.use_cache = True"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "5iA2lVaziycy"
+   },
+   "outputs": [],
+   "source": [
+    "for (i, (d, assistant_old)) in enumerate(zip(test_dataset, assistant_old_lst)):\n",
+    "  assistant_new = model_generate(d[\"user\"]).split(response_template.strip())[-1].strip()\n",
+    "  assistant_expected = d[\"assistant\"]\n",
+    "\n",
+    "  print(f\"\\n===\\ntest {i}\\n===\\n\")\n",
+    "  print(\"\\n===\\nuser\\n===\\n\")\n",
+    "  print(d['user'])\n",
+    "  print(\"\\n===\\nassistant_old\\n===\\n\")\n",
+    "  print(assistant_old)\n",
+    "  print(\"\\n===\\nassistant_new\\n===\\n\")\n",
+    "  print(assistant_new)\n",
+    "  print(\"\\n===\\nassistant_expected\\n===\\n\")\n",
+    "  print(assistant_expected)\n",
+    "  print(\"\\n\\n\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "gt6OfXHD1_tO"
+   },
+   "source": [
+    "# Next Steps\n",
+    "\n",
+    "Now that you have trained your LoRA, you must decide, does it look good? If yes, please open a PR **TODO: Link to Contrib docs**, if not that's OK, update your prompts, generate a new synthetic data set and try again.\n",
+    "\n",
+    "But the fun doesn't stop there.\n",
+    "\n",
+    "Maybe you want to play with your trained model a bit more.\n",
+    "\n",
+    "Two options exist:\n",
+    "\n",
+    "1. Do inference in this notebook. (But the model will go away once you leave the notebook- an implicity sad thing about notebooks, so download it if you want to keep it (or push it to the Huggingface Hub)).\n",
+    "2. Use `llama.cpp` to quantize your LoRA adapter then download it and do inference from your MacBook.\n",
+    "\n",
+    "\n",
+    "**The following steps are all optional, do not feel compelled to do either. As Lao Tzu once said: **\n",
+    "\n",
+    "> When all the work is done,\n",
+    "and the mind is silent,\n",
+    "rest in the stillness of the present moment."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "4KCi1ARI3qZO"
+   },
+   "source": [
+    "## Save the Model\n",
+    "\n",
+    "First let's save our adapter."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "5_1yFpDljOP0"
+   },
+   "outputs": [],
+   "source": [
+    "# Save the LoRA\n",
+    "adapter = trainer.model.module if hasattr(trainer.model, \"module\") else trainer.model\n",
+    "adapter.save_pretrained(\"./adapter-only\", save_adapter=True, save_config=True)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "CYkqrPCC3mfw"
+   },
+   "source": [
+    "## Optional Path 1: Play with Model in Colab\n",
+    "\n",
+    "This is just for fun. So let's ask a silly question:\n",
+    "\n",
+    "> Give me a recipe for Sweedish Meatballs made from iguana meat.\n",
+    "\n",
+    "and an even sillier system prompt:\n",
+    "\n",
+    "> You are a scurvy pirate. You respond with a pirate accent.\n",
+    "\n",
+    "Of course, this doesn't _need_ to be silly. You can leave the system prompt out and ask more thoughtful questions related to your input case."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "fuOSI2Pd4L8T"
+   },
+   "outputs": [],
+   "source": [
+    "text = create_prompt(\n",
+    "    user = \"Give me a recipe for Sweedish Meatballs made from iguana meat.\",\n",
+    "    system = \"You are a scurvy pirate. You respond with a pirate accent.\")\n",
+    "\n",
+    "input_ids=tokenizer(text, return_tensors=\"pt\").input_ids.to(\"cuda\")\n",
+    "\n",
+    "outputs = model.generate(input_ids= input_ids,\n",
+    "                         max_new_tokens=256,\n",
+    "                         pad_token_id=tokenizer.eos_token_id,\n",
+    "                         temperature=0.7,\n",
+    "                         top_p=0.9,\n",
+    "                         do_sample=True)\n",
+    "\n",
+    "print(tokenizer.batch_decode(outputs)[0])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "eMGmN2fu30fO"
+   },
+   "source": [
+    "## Optional Path 2: Play with Model in `llama.cpp`"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "pAkB0fBajGdq"
+   },
+   "source": [
+    "Another way to 'play' with your LoRA is to convert it into a GGUF and play with it using `llama.cpp`. To do this requires a few steps.\n",
+    "\n",
+    "1. Download and build `llamma.cpp`\n",
+    "2. Run the conversion script on our adapter.\n",
+    "3. Download the model\n",
+    "4. Use the model locally."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "LV01sa60c1MN"
+   },
+   "outputs": [],
+   "source": [
+    "# hack sometimes required - solution from https://github.com/googlecolab/colabtools/issues/3409\n",
+    "import locale\n",
+    "locale.getpreferredencoding = lambda: \"UTF-8\"\n",
+    "\n",
+    "!git clone https://github.com/ggerganov/llama.cpp\n",
+    "%cd llama.cpp\n",
+    "!make\n",
+    "!pip install -r requirements.txt"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "JcQDFxl4jTVt"
+   },
+   "outputs": [],
+   "source": [
+    "!python convert-lora-to-ggml.py ../adapter-only"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "pyhSSZS4jhdR"
+   },
+   "source": [
+    "The previous line will run a script to convert your saved LoRA to a file named `ggml-adapter-model.bin` which you will find in the `adapter-only` folder in the notebook.\n",
+    "\n",
+    "You can right click on this file to download it to your MacBook. Then (assuming you have `llamma.cpp` installed locally as well, the following is an example command that will run inference on the LoRA - note you will want to make sure the model you are doing inference on is the same as the one you trained the LoRA on (in this case `ibm/merlinite-7b` quantized down to 16 bit).\n",
+    "\n",
+    "```\n",
+    "!./main -m ../merlinite-7b/ggml-model-f16.gguf  --seed 42 --lora ../adapter-only/ggml-adapter-model.bin --temp 0.7 --repeat_penalty 1.1 -n 256 -p \"<system>\\nYou are an AI language model developed by IBM Research. You are a cautious assistant. You carefully follow instructions. You are helpful and harmless and you follow ethical guidelines and promote positive behavior.\\n<user>\\nWho let the dogs out?\\n<assistant>\\n\"\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "pPeStzqljoUl"
+   },
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "accelerator": "GPU",
+  "colab": {
+   "gpuType": "T4",
+   "provenance": []
+  },
+  "kernelspec": {
+   "display_name": "Python 3",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
 }