Skip to content

Commit

Permalink
[Cherry Pick] allow dataset size smaller than calibration samples (#2091
Browse files Browse the repository at this point in the history
) (#2179)

* allow dataset size smaller than calibration samples (#2091)

* merge issue
  • Loading branch information
Satrat committed Mar 13, 2024
1 parent 2c3bdf7 commit d64d9fb
Showing 1 changed file with 10 additions and 2 deletions.
12 changes: 10 additions & 2 deletions src/sparseml/transformers/finetune/data/data_helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,9 +49,17 @@ def format_calibration_data(
:param accelerator: optional accelerator for if preparing in FSDP mode
:return: list of trimmed calibration data tensors
"""
num_calibration_samples = num_calibration_samples or len(tokenized_dataset)
safe_calibration_samples = len(tokenized_dataset)
if num_calibration_samples is not None:
safe_calibration_samples = min(len(tokenized_dataset), num_calibration_samples)
if safe_calibration_samples != num_calibration_samples:
LOGGER.warn(
f"Requested {num_calibration_samples} calibration samples but "
f"the provided dataset only has {safe_calibration_samples}. "
)

shuffled_calibration = tokenized_dataset.shuffle()
shuffled_calibration = shuffled_calibration.select(range(num_calibration_samples))
shuffled_calibration = shuffled_calibration.select(range(safe_calibration_samples))

dataloader_params = {
"batch_size": 1,
Expand Down

0 comments on commit d64d9fb

Please sign in to comment.