Skip to content

Commit

Permalink
report min lenght of tokenized data (#1186) [skip ci]
Browse files Browse the repository at this point in the history
  • Loading branch information
winglian committed Jan 24, 2024
1 parent 02f2c72 commit d85d494
Showing 1 changed file with 2 additions and 0 deletions.
2 changes: 2 additions & 0 deletions src/axolotl/utils/trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,8 @@ def process_datasets_for_packing(cfg, train_dataset, eval_dataset):
drop_long = partial(drop_long_seq, sequence_len=cfg.sequence_len)
with zero_first(is_main_process()):
if cfg.is_preprocess:
min_input_len = np.min(get_dataset_lengths(train_dataset))
LOG.debug(f"min_input_len: {min_input_len}", main_process_only=True)
max_input_len = np.max(get_dataset_lengths(train_dataset))
LOG.debug(f"max_input_len: {max_input_len}", main_process_only=True)

Expand Down

0 comments on commit d85d494

Please sign in to comment.