-
-
Notifications
You must be signed in to change notification settings - Fork 780
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA device error with llama2_chat strategy #568
Comments
I just tried it again with the latest commits from main. The issue still occurs. Here is the full error message:
|
I am having the very same issue. ``RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED
|
The same issue for me |
Given that other reports on similar errors are related to mismatch in index. I found one place that was adding pad token to tokenizer in llama_chat which was not accounted in model or maybe too late to resize the model. def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.sequence_len = 4096
self.tokenizer.add_special_tokens({"pad_token": "<pad>"})
# https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/blob/main/added_tokens.json |
Here's my llama2_chat.py version that worked for me, just make sure you remove adding pad token here: https://github.com/OpenAccess-AI-Collective/axolotl/blob/ec0958f4f846236ac2703dd644f6dac4365f64b4/src/axolotl/utils/models.py#L80 """
Prompt Strategy for finetuning Llama2 chat models
see also https://github.com/facebookresearch/llama/blob/6c7fe276574e78057f917549435a2554000a876d/llama/generation.py#L213 for ma reference implementation.
This implementation is based on the Vicuna PR and the fastchat repo, see also:
https://github.com/lm-sys/FastChat/blob/cdd7730686cb1bf9ae2b768ee171bdf7d1ff04f3/fastchat/conversation.py#L847
Use dataset type: "llama2_chat" in conig.yml to use this prompt style.
E.g. in the config.yml:
datasets:
- path: llama_finetune_train.jsonl
type: llama2_chat
The dataset itself should look like this:
{'conversations':[{"from": "human", "value": "Who are you?"}, {"from": "gpt", "value": "I am Vicuna"},...]}
in a jsonl file. The first message should be from the human, the second from gpt.
For a custom system message, the first "from" can be "system" (followed by alternating "human" and "gpt" turns).
Important: Don't use "special_tokens:" in your config.yml if you are not sure what you are doing!
"""
import logging
from dataclasses import dataclass, field
from typing import Generator, List, Sequence
from axolotl.prompt_tokenizers import PromptTokenizingStrategy
from axolotl.prompters import IGNORE_TOKEN_ID, SHAREGPT_ASSERTION_FAILED_ROLE
import traceback
@dataclass
class Llama2ChatConversation:
"""A class that manages prompt templates and keeps all conversation history.
copied from https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py"""
name: str = "llama2"
# The system prompt
system: str = (
"[INST] <<SYS>>\nYou are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. "
"Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. "
"Please ensure that your responses are socially unbiased and positive in nature.\n\n"
"If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. "
"If you don't know the answer to a question, please don't share false information.\n<</SYS>>\n\n"
)
roles: Sequence[str] = ("[INST]", "[/INST]")
messages: List[List[str]] = field(default_factory=list)
offset: int = 0
sep = " "
sep2 = " </s><s>"
sep3 = " </s>"
stop_token_ids = [2]
def get_prompt(self) -> str:
"""Get the prompt for generation."""
seps = [self.sep, self.sep2]
ret = ""
for i, (role, message) in enumerate(self.messages):
if (i == len(self.messages) - 1) and (role == self.roles[0]):
# last message is from user (due to length),
# return prompt without it for training
return ret
if i == 0:
ret += self.system + message.strip()
else:
if i == len(self.messages) - 1:
ret += role + " " + message.strip() + self.sep3
else:
ret += role + " " + message.strip() + seps[i % 2]
return ret
def append_message(self, role: str, message: str):
"""Append a new message."""
self.messages.append([role, message])
class LLama2ChatTokenizingStrategy(PromptTokenizingStrategy):
"""
Tokenizing strategy for ShareGPT prompts.
adapted from https://github.com/lm-sys/FastChat/blob/main/fastchat/train/train.py
"""
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.sequence_len = 4096
# self.tokenizer.add_special_tokens({"pad_token": "<pad>"})
# https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/blob/main/added_tokens.json
def tokenize_prompt(self, prompt):
conv = next(self.prompter.build_prompt(prompt))
conversation_str = conv.get_prompt()
# Tokenize conversations
input_ids = self.tokenizer(
conversation_str,
return_tensors="pt",
## padding="max_length",
padding=False,
max_length=self.sequence_len,
truncation=True,
).input_ids[0]
target = input_ids.clone()
# Mask targets. Only compute loss on the assistant outputs.
sep = conv.roles[1]
## total_len = int(target.ne(self.tokenizer.pad_token_id).sum())
total_len = len(target)
turns = conversation_str.split(conv.sep2)
cur_len = 0
target[:cur_len] = IGNORE_TOKEN_ID
for turn in turns:
if turn == "":
break
# turn_len = len(self.tokenizer(turn).input_ids) - 1
turn_len = len(self.tokenizer(turn).input_ids)
parts = turn.split(sep)
if len(parts) != 2:
break
parts[0] += sep
# "-1" is hardcoded for the LLaMA tokenizer to make the offset correct.
instruction_len = len(self.tokenizer(parts[0]).input_ids)
# Ignore the user instructions
target[cur_len - 1 : cur_len + instruction_len] = IGNORE_TOKEN_ID
cur_len += turn_len
target[cur_len:] = IGNORE_TOKEN_ID
if cur_len < self.sequence_len:
if cur_len != total_len:
target[:] = IGNORE_TOKEN_ID
logging.warning(
f"WARNING: tokenization mismatch: {cur_len} vs. {total_len}."
f" (ignored)"
)
attention_mask = input_ids.ne(self.tokenizer.pad_token_id).tolist()
input_ids = input_ids.tolist()
target = target.tolist()
# this is a fix for the tokenizer which tokenizes [ differently with eos tokens and
# follows the original llama implementation
for i in range(2, total_len - 2):
if input_ids[i] == 29961:
input_ids[i] = 518
if target[i] == 29961:
target[i] = 518
return {
"input_ids": input_ids,
"labels": target,
"attention_mask": attention_mask,
}
class Llama2ChatPrompter: # pylint: disable=too-few-public-methods
"""
A prompter that generates prompts for Llama2 models.
"""
system_prompt = (
"[INST] <<SYS>>\nYou are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. "
"Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. "
"Please ensure that your responses are socially unbiased and positive in nature.\n\n"
"If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. "
"If you don't know the answer to a question, please don't share false information.\n<</SYS>>\n\n"
)
def build_prompt(self, source) -> Generator[Llama2ChatConversation, None, None]:
# see https://github.com/lm-sys/FastChat/blob/da0641e567cf93756b0978ab5a6b092e96f06240/fastchat/train/train.py#L78
source = source["conversations"] # fix data structure for datasets
# if system prompt provided, use it
if source[0]["from"] == "system":
system = f"[INST] <<SYS>>\n{source[0]['value']}\n<</SYS>>\n\n"
source = source[1:]
else:
system = self.system_prompt
conv = Llama2ChatConversation(system=system)
if len(source) < 2:
# If there isn't a back and forth conversation, ignore it
# also happens on the data splitting leaving empty conversations
raise IndexError
roles = {"human": conv.roles[0], "gpt": conv.roles[1]}
if roles[source[0]["from"]] != conv.roles[0]:
# Skip the first one if it is not from human
source = source[1:]
conv.messages = [] # pylint: disable=R0801
for j, sentence in enumerate(source):
role = roles[sentence["from"]]
assert role == conv.roles[j % 2], SHAREGPT_ASSERTION_FAILED_ROLE
if sentence["value"]:
conv.append_message(role, sentence["value"])
yield conv
def load(tokenizer, cfg) -> LLama2ChatTokenizingStrategy:
return LLama2ChatTokenizingStrategy(
Llama2ChatPrompter(),
tokenizer,
cfg.train_on_inputs,
cfg.sequence_len,
) |
@dimichgh thanks. Can you let us know if you've gotten good results from your fine tuning with your changes? I have resorted to modifying the AlpacaPrompter with the llama2 prompt format and this has yielded quite good results. I'm not sure if I want to mess with llama2_chat until some dev takes the time and fixes it. |
Can you please share your aplacaPrompter ? Also does it need the dataset to be formatted in alpaca style or sharegpt ? |
One thing you can try is the branch in PR #578 Simply set : type: sharegpt
conversation: llama-2 |
@vibhorag101 I'm sorry but I don't feel comfortable sharing it, because I don't really know what I'm doing. I wouldn't want people to waste time on something if I'm not on the right track. But essentially, if you want to try it yourself - @dimichgh I have tried your changes and the fine tune process itself works, but the model will not produce meaningful output and cannot be quantized because unexpected tensor dimension errors by llama.cpp. @winglian thanks, will try it out later this week. |
I tried it and it seems to somewhat work (at least fine tuning, converting, quantization and inference all finish without issues), but the model doesn't learn too well. With my hacked together alpaca prompt prompt_strategy that uses the llama2 message format I get a loss of around 1.0 where the same dataset with your suggestion gets about 1.4. That also reflects in the much poorer response quality of the resulting model with your solution. Plus, I was trying to understand your code but basing the classes off of sharegpt made it all even more difficult to understand. I really don't like the code layout there, makes it hard to follow what's used where, sorry to be so blunt. |
I see similar behaviour @kaldeberger saw, I was getting loss trend down from 0.9 to 0.2 after an epoch on my dataset, however switching to new prompt strategy I see loss trend down from 1.5 to 0.5 after an epoch. |
@pshivraj cannyou clarify which prompt strategies were giving which loss results please? |
Hi @winglian Sorry for not mentioning this beforehand.
|
@dimichgh I found some more issues after that, but did not have time to provide a patch for that yet. |
Current code by default sets llamatokenizer's to use llama's EOS as pad token, except for the llama2 chat class above.
Model embed length is automatically resized if mismatch with tokenizer, so you don't need to do it yourself. It seems that the original issue is solved by using the other class. Regarding the weird loss, it could be due to how the data is tokenized, so providing the debugging output (see readme) could help. I'll close this for now as the original issue is solved. If you would like to dive into the loss, a separate issue might be better. |
Please check that this issue hasn't been reported before.
Expected Behavior
I am trying to finetune code-llama on runpod with the provided docker image using this command:
accelerate launch scripts/finetune.py
.When using the config from examples/code-llama/13B/qlora.yml with a dataset from local file-system in llama2_chat format should work.
Note:
Other formats, e.g. alpaca or sharegpt:chat, do work fine with this config. The problem seems to be with the llama2 prompt strategy.
Current behaviour
After loading the model shards there are many of these error messages:
indexSelectLargeIndex: block: [0,0,0], thread: [64,0,0] Assertion
srcIndex < srcSelectDimSiz efailed.
followed by
CUDA Error: device-side assert triggered /tmp/pip-install-3n5798ar/dropout-layer-norm_1323beba825f4e58852d39754f678e64/csrc/layer_norm/ln_fwd_kernels.cuh 236
The finetune.py process terminates.
Steps to reproduce
{"conversations": [{"from": "human", "value": "Who are you?"}, {"from": "gpt", "value": "I am your chat assistant."}]}
accelerate launch scripts/finetune.py qlora.yml
Possible solution
No response
Which Operating Systems are you using?
Note: I am using the runpod template via the direct link included in the Readme.
Python Version
from the runpod template (docker image)
axolotl branch-commit
main
Acknowledgements
The text was updated successfully, but these errors were encountered: