Skip to content

Commit

Permalink
remove columns after tokenizing for pretraining (#571)
Browse files Browse the repository at this point in the history
  • Loading branch information
winglian committed Sep 14, 2023
1 parent 3b18c96 commit 1157950
Showing 1 changed file with 3 additions and 3 deletions.
6 changes: 3 additions & 3 deletions src/axolotl/utils/data.py
Original file line number Diff line number Diff line change
Expand Up @@ -644,8 +644,8 @@ def load_pretraining_dataset(path, tokenizer, max_tokens=2048, seed=42):
encode,
batched=True,
input_columns="text",
remove_columns=[
"text",
],
# remove all the existing columns after mapping since they end up having
# a different length than the encoded/tokenized column
remove_columns=dataset.features.keys(),
)
return dataset

0 comments on commit 1157950

Please sign in to comment.