Skip to content

Commit

Permalink
prepared dataset caching, other misc fixes (#665)
Browse files Browse the repository at this point in the history
* prepared dataset caching, other misc fixes

* also don't load from disk cache unless explicit
  • Loading branch information
winglian committed Oct 3, 2023
1 parent f4868d7 commit e50a64e
Show file tree
Hide file tree
Showing 32 changed files with 35 additions and 34 deletions.
2 changes: 1 addition & 1 deletion examples/cerebras/qlora.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ push_dataset_to_hub:
datasets:
- path: teknium/GPT4-LLM-Cleaned
type: alpaca
dataset_prepared_path: last_run_prepared
dataset_prepared_path:
val_set_size: 0.01
adapter: qlora
lora_model_dir:
Expand Down
2 changes: 1 addition & 1 deletion examples/code-llama/13b/lora.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ strict: false
datasets:
- path: mhenrichsen/alpaca_2k_test
type: alpaca
dataset_prepared_path: last_run_prepared
dataset_prepared_path:
val_set_size: 0.01
output_dir: ./lora-out

Expand Down
2 changes: 1 addition & 1 deletion examples/code-llama/13b/qlora.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ strict: false
datasets:
- path: mhenrichsen/alpaca_2k_test
type: alpaca
dataset_prepared_path: last_run_prepared
dataset_prepared_path:
val_set_size: 0.01
output_dir: ./qlora-out

Expand Down
2 changes: 1 addition & 1 deletion examples/code-llama/34b/lora.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ strict: false
datasets:
- path: mhenrichsen/alpaca_2k_test
type: alpaca
dataset_prepared_path: last_run_prepared
dataset_prepared_path:
val_set_size: 0.01
output_dir: ./lora-out

Expand Down
2 changes: 1 addition & 1 deletion examples/code-llama/34b/qlora.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ strict: false
datasets:
- path: mhenrichsen/alpaca_2k_test
type: alpaca
dataset_prepared_path: last_run_prepared
dataset_prepared_path:
val_set_size: 0.01
output_dir: ./qlora-out

Expand Down
2 changes: 1 addition & 1 deletion examples/code-llama/7b/lora.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ strict: false
datasets:
- path: mhenrichsen/alpaca_2k_test
type: alpaca
dataset_prepared_path: last_run_prepared
dataset_prepared_path:
val_set_size: 0.01
output_dir: ./lora-out

Expand Down
2 changes: 1 addition & 1 deletion examples/code-llama/7b/qlora.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ strict: false
datasets:
- path: mhenrichsen/alpaca_2k_test
type: alpaca
dataset_prepared_path: last_run_prepared
dataset_prepared_path:
val_set_size: 0.01
output_dir: ./qlora-out

Expand Down
2 changes: 1 addition & 1 deletion examples/falcon/config-7b-lora.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ push_dataset_to_hub:
datasets:
- path: teknium/GPT4-LLM-Cleaned
type: alpaca:chat
dataset_prepared_path: last_run_prepared
dataset_prepared_path:
val_set_size: 0.01
adapter: lora
lora_model_dir:
Expand Down
2 changes: 1 addition & 1 deletion examples/falcon/config-7b-qlora.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ datasets:
data_files:
- Chain-of-Thought/formatted_cot_data/gsm8k_train.json
type: "alpaca:chat"
dataset_prepared_path: last_run_prepared
dataset_prepared_path:
val_set_size: 0.01
# enable QLoRA
adapter: qlora
Expand Down
2 changes: 1 addition & 1 deletion examples/falcon/config-7b.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ push_dataset_to_hub:
datasets:
- path: teknium/GPT4-LLM-Cleaned
type: alpaca:chat
dataset_prepared_path: last_run_prepared
dataset_prepared_path:
val_set_size: 0.01
adapter:
lora_model_dir:
Expand Down
2 changes: 1 addition & 1 deletion examples/gptj/qlora.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ push_dataset_to_hub:
datasets:
- path: teknium/GPT4-LLM-Cleaned
type: alpaca
dataset_prepared_path: last_run_prepared
dataset_prepared_path:
val_set_size: 0.01
adapter: qlora
lora_model_dir:
Expand Down
2 changes: 1 addition & 1 deletion examples/jeopardy-bot/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ load_in_8bit: false
datasets:
- path: openaccess-ai-collective/jeopardy
type: jeopardy
dataset_prepared_path: last_run_prepared
dataset_prepared_path:
val_set_size: 0.02
adapter:
lora_model_dir:
Expand Down
2 changes: 1 addition & 1 deletion examples/llama-2/gptq-lora.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ hf_use_auth_token: true
datasets:
- path: mhenrichsen/alpaca_2k_test
type: alpaca
dataset_prepared_path: last_run_prepared
dataset_prepared_path:
val_set_size: 0.01
adapter: lora
lora_model_dir:
Expand Down
2 changes: 1 addition & 1 deletion examples/llama-2/lora.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ strict: false
datasets:
- path: mhenrichsen/alpaca_2k_test
type: alpaca
dataset_prepared_path: last_run_prepared
dataset_prepared_path:
val_set_size: 0.01
output_dir: ./lora-out

Expand Down
2 changes: 1 addition & 1 deletion examples/llama-2/qlora.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ strict: false
datasets:
- path: mhenrichsen/alpaca_2k_test
type: alpaca
dataset_prepared_path: last_run_prepared
dataset_prepared_path:
val_set_size: 0.01
output_dir: ./qlora-out

Expand Down
2 changes: 1 addition & 1 deletion examples/llama-2/relora.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ strict: false
datasets:
- path: teknium/GPT4-LLM-Cleaned
type: alpaca
dataset_prepared_path: last_run_prepared
dataset_prepared_path:
val_set_size: 0.01
output_dir: ./relora-out

Expand Down
2 changes: 1 addition & 1 deletion examples/llama-2/tiny-llama.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ strict: false
datasets:
- path: mhenrichsen/alpaca_2k_test
type: alpaca
dataset_prepared_path: last_run_prepared
dataset_prepared_path:
val_set_size: 0.01
output_dir: ./lora-out

Expand Down
2 changes: 1 addition & 1 deletion examples/mistral/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ strict: false
datasets:
- path: mhenrichsen/alpaca_2k_test
type: alpaca
dataset_prepared_path: last_run_prepared
dataset_prepared_path:
val_set_size: 0.01
output_dir: ./out

Expand Down
2 changes: 1 addition & 1 deletion examples/mpt-7b/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ load_in_8bit: false
datasets:
- path: vicgalle/alpaca-gpt4
type: alpaca
dataset_prepared_path: last_run_prepared
dataset_prepared_path:
val_set_size: 0.02
adapter:
lora_model_dir:
Expand Down
2 changes: 1 addition & 1 deletion examples/openllama-3b/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ push_dataset_to_hub:
datasets:
- path: teknium/GPT4-LLM-Cleaned
type: alpaca
dataset_prepared_path: last_run_prepared
dataset_prepared_path:
val_set_size: 0.02
adapter:
lora_model_dir:
Expand Down
2 changes: 1 addition & 1 deletion examples/openllama-3b/lora.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ push_dataset_to_hub:
datasets:
- path: teknium/GPT4-LLM-Cleaned
type: alpaca
dataset_prepared_path: last_run_prepared
dataset_prepared_path:
val_set_size: 0.02
adapter: lora
lora_model_dir:
Expand Down
2 changes: 1 addition & 1 deletion examples/openllama-3b/qlora.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ push_dataset_to_hub:
datasets:
- path: teknium/GPT4-LLM-Cleaned
type: alpaca
dataset_prepared_path: last_run_prepared
dataset_prepared_path:
val_set_size: 0.01
adapter: qlora
lora_model_dir:
Expand Down
2 changes: 1 addition & 1 deletion examples/phi/phi-ft.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ datasets:
- path: garage-bAInd/Open-Platypus
type: alpaca

dataset_prepared_path: last_run_prepared
dataset_prepared_path:
val_set_size: 0.05
output_dir: ./phi-sft-out

Expand Down
2 changes: 1 addition & 1 deletion examples/phi/phi-qlora.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ datasets:
- path: garage-bAInd/Open-Platypus
type: alpaca

dataset_prepared_path: last_run_prepared
dataset_prepared_path:
val_set_size: 0.05
output_dir: ./phi-sft-out

Expand Down
2 changes: 1 addition & 1 deletion examples/pythia-12b/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ device_map: auto
datasets:
- path: vicgalle/alpaca-gpt4
type: alpaca
dataset_prepared_path: last_run_prepared
dataset_prepared_path:
val_set_size: 0.05
adapter:
lora_model_dir:
Expand Down
2 changes: 1 addition & 1 deletion examples/pythia/lora.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ load_in_8bit: true
datasets:
- path: teknium/GPT4-LLM-Cleaned
type: alpaca
dataset_prepared_path: last_run_prepared
dataset_prepared_path:
val_set_size: 0.05
adapter: lora
lora_model_dir:
Expand Down
2 changes: 1 addition & 1 deletion examples/redpajama/config-3b.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ load_in_8bit: false
datasets:
- path: vicgalle/alpaca-gpt4
type: alpaca
dataset_prepared_path: last_run_prepared
dataset_prepared_path:
val_set_size: 0.02
adapter:
lora_model_dir:
Expand Down
2 changes: 1 addition & 1 deletion examples/replit-3b/config-lora.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ load_in_8bit: false
datasets:
- path: vicgalle/alpaca-gpt4
type: alpaca
dataset_prepared_path: last_run_prepared
dataset_prepared_path:
val_set_size: 0.05
adapter: lora
lora_model_dir:
Expand Down
2 changes: 1 addition & 1 deletion examples/xgen-7b/xgen-7b-8k-qlora.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ datasets:
data_files:
- openassistant_best_replies_train.jsonl
type: "completion"
dataset_prepared_path: last_run_prepared
dataset_prepared_path:
val_set_size: 0.01
# enable QLoRA
adapter: qlora
Expand Down
2 changes: 1 addition & 1 deletion src/axolotl/cli/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ def print_axolotl_text_art(suffix=None):


def get_multi_line_input() -> Optional[str]:
print("Give me an instruction (Ctrl + D to finish): ")
print("Give me an instruction (Ctrl + D to submit): ")
instruction = ""
for line in sys.stdin:
instruction += line # pylint: disable=consider-using-join
Expand Down
6 changes: 3 additions & 3 deletions src/axolotl/utils/data.py
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,7 @@ def load_tokenized_prepared_datasets(

if dataset:
...
elif any(prepared_ds_path.glob("*")):
elif cfg.dataset_prepared_path and any(prepared_ds_path.glob("*")):
LOG.info(f"Loading prepared dataset from disk at {prepared_ds_path}...")
dataset = load_from_disk(str(prepared_ds_path))
LOG.info("Prepared dataset loaded from disk...")
Expand Down Expand Up @@ -357,7 +357,7 @@ def for_d_in_datasets(dataset_configs):
if len(datasets) > 1:
LOG.info("shuffle merged datasets")
dataset = dataset.shuffle(seed=seed)
if cfg.local_rank == 0:
if cfg.local_rank == 0 and cfg.dataset_prepared_path:
LOG.info(f"Saving merged prepared dataset to disk... {prepared_ds_path}")
dataset.save_to_disk(prepared_ds_path)
if cfg.push_dataset_to_hub:
Expand Down Expand Up @@ -425,7 +425,7 @@ def load_prepare_datasets(

if dataset:
...
elif any(prepared_ds_path.glob("*")):
elif cfg.dataset_prepared_path and any(prepared_ds_path.glob("*")):
LOG.info(
f"Loading prepared packed dataset from disk at {prepared_ds_path}..."
)
Expand Down
3 changes: 2 additions & 1 deletion src/axolotl/utils/tokenization.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,8 @@ def check_example_labels(example, tokenizer, text_only=False):
)
colored_tokens.append(colored_token)

LOG.info(" ".join(colored_tokens))
delimiter = "" if text_only else " "
LOG.info(delimiter.join(colored_tokens))
LOG.info("\n\n\n")
print(" ".join(colored_tokens))

Expand Down

0 comments on commit e50a64e

Please sign in to comment.