Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data validation notebook #1029

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
136 commits
Select commit Hold shift + click to select a range
8cb6522
add validation script
xiaohanzhan-db Dec 23, 2023
c59c11f
update
xiaohanzhan-db Jan 3, 2024
66f34eb
change token count function
Jan 3, 2024
2cd387b
reorganize cells
Jan 5, 2024
3eac3bf
Add unit tests
xiaohanzhan-db Jan 5, 2024
d2d9767
Add a printout for CPT
xiaohanzhan-db Jan 6, 2024
be25591
update question
xiaohanzhan-db Jan 6, 2024
4651be7
Add questions
Jan 8, 2024
5cd6a94
Fix lints
xiaohanzhan-db Jan 8, 2024
8e2c1f4
Merge branch 'main' into validation
XiaohanZhangCMU Jan 8, 2024
e6e4a81
update format
xiaohanzhan-db Jan 8, 2024
34c5690
Merge branch 'validation' of github.com:XiaohanZhangCMU/llm-foundryX …
xiaohanzhan-db Jan 8, 2024
1668b9a
update
xiaohanzhan-db Jan 8, 2024
2219135
nb source
xiaohanzhan-db Jan 8, 2024
86c6e87
add validation script
xiaohanzhan-db Dec 23, 2023
678b376
update
xiaohanzhan-db Jan 3, 2024
297e057
change token count function
Jan 3, 2024
09d0ebb
reorganize cells
Jan 5, 2024
460df65
Add unit tests
xiaohanzhan-db Jan 5, 2024
3ffd200
Add a printout for CPT
xiaohanzhan-db Jan 6, 2024
9362886
update question
xiaohanzhan-db Jan 6, 2024
898e5ac
Add questions
Jan 8, 2024
a4bef71
Fix lints
xiaohanzhan-db Jan 8, 2024
4ca9cc6
update format
xiaohanzhan-db Jan 8, 2024
d636a0f
update
xiaohanzhan-db Jan 8, 2024
827d155
nb source
xiaohanzhan-db Jan 8, 2024
6bbf3fc
Remove license insert for validation notebook
xiaohanzhan-db Jan 8, 2024
4f6a4fb
Merge branch 'validation' of github.com:XiaohanZhangCMU/llm-foundryX …
xiaohanzhan-db Jan 8, 2024
5966b68
Add validation utils
xiaohanzhan-db Jan 11, 2024
da17813
Merge branch 'main' into validation
xiaohanzhan-db Jan 11, 2024
89fb909
Validation (#856)
XiaohanZhangCMU Jan 11, 2024
55e4626
update utils/__init__.py to include extra validation functions
xiaohanzhan-db Jan 11, 2024
45544a1
update notebook
Jan 11, 2024
d2797b3
update
xiaohanzhan-db Jan 11, 2024
019da77
Merge branch 'validation' of github.com:XiaohanZhangCMU/llm-foundryX …
xiaohanzhan-db Jan 11, 2024
756fdae
update
xiaohanzhan-db Jan 11, 2024
93b5a9f
Add download remote function to util
xiaohanzhan-db Jan 11, 2024
b47c878
update
xiaohanzhan-db Jan 11, 2024
13fd34c
update
xiaohanzhan-db Jan 11, 2024
610f669
update
xiaohanzhan-db Jan 11, 2024
9f2e51b
update
xiaohanzhan-db Jan 11, 2024
ec68f10
update
xiaohanzhan-db Jan 11, 2024
1e76068
update
xiaohanzhan-db Jan 11, 2024
7a5c164
update
xiaohanzhan-db Jan 11, 2024
e76038f
Merge branch 'main' into validation
xiaohanzhan-db Jan 11, 2024
5b413f5
update
xiaohanzhan-db Jan 11, 2024
a1aa31f
update
xiaohanzhan-db Jan 11, 2024
d24fd5c
update
xiaohanzhan-db Jan 11, 2024
55fce37
Add dask and dataframe_to_mds
xiaohanzhan-db Jan 12, 2024
86e2412
update
xiaohanzhan-db Jan 12, 2024
bbfec65
update
xiaohanzhan-db Jan 12, 2024
b2e880d
update
xiaohanzhan-db Jan 12, 2024
596443a
update
xiaohanzhan-db Jan 12, 2024
ea65187
Add notebook
xiaohanzhan-db Jan 12, 2024
378a4e0
update
xiaohanzhan-db Jan 12, 2024
af6e9aa
update
Jan 12, 2024
4e286ec
remove script and tests, keep notebook
xiaohanzhan-db Jan 12, 2024
09c4892
update
xiaohanzhan-db Jan 12, 2024
c82da6c
update
xiaohanzhan-db Jan 12, 2024
e5f83cc
update
xiaohanzhan-db Jan 12, 2024
17d2b9f
update
xiaohanzhan-db Jan 12, 2024
6579d55
Merge branch 'main' into validation
xiaohanzhan-db Jan 12, 2024
56308ff
Merge branch 'byod/data_validation' into validation
XiaohanZhangCMU Jan 12, 2024
00a51b5
Validation (#862)
XiaohanZhangCMU Jan 12, 2024
4daa324
updated notebook
Jan 12, 2024
b809691
Merge branch 'main' into validation
xiaohanzhan-db Jan 12, 2024
8b75f94
remove scripts keep notebook
xiaohanzhan-db Jan 12, 2024
99bf2cd
merge with byod/data_validation
xiaohanzhan-db Jan 12, 2024
9b37063
Validation (#866)
XiaohanZhangCMU Jan 12, 2024
22014d6
update notebook. rephrase.
Jan 12, 2024
d9f28aa
merged
xiaohanzhan-db Jan 12, 2024
f1fa63c
Validation (#867)
XiaohanZhangCMU Jan 12, 2024
43c8ac9
update
xiaohanzhan-db Jan 12, 2024
b8ac771
Add response tokens
xiaohanzhan-db Jan 16, 2024
1b9681c
update
xiaohanzhan-db Jan 16, 2024
16883c2
merge
xiaohanzhan-db Jan 16, 2024
a9218d6
Validation (#875)
XiaohanZhangCMU Jan 16, 2024
c7567f1
update
xiaohanzhan-db Jan 20, 2024
1764b72
Disable MDSWrite, return token counts
xiaohanzhan-db Jan 22, 2024
808ced5
Change plot settings
xiaohanzhan-db Jan 23, 2024
26ae516
Fix conflict
xiaohanzhan-db Jan 23, 2024
a212ee8
update notebook
Jan 23, 2024
d279817
update
xiaohanzhan-db Jan 23, 2024
f1cfe9e
Validation (#898)
XiaohanZhangCMU Jan 23, 2024
dbe3f4e
update notebook
Jan 23, 2024
3005718
update
xiaohanzhan-db Jan 23, 2024
8498662
Validation (#900)
XiaohanZhangCMU Jan 23, 2024
f5b900c
update
Jan 23, 2024
02d0979
Merge branch 'byod/data_validation' of https://github.com/mosaicml/ll…
xiaohanzhan-db Jan 23, 2024
205e405
Validation (#901)
XiaohanZhangCMU Jan 23, 2024
2f883a7
update notebook
Jan 23, 2024
0315caf
update
xiaohanzhan-db Jan 23, 2024
1a510ff
update pip install link
xiaohanzhan-db Mar 13, 2024
530a55a
Change done file location
xiaohanzhan-db Mar 13, 2024
5493295
Validation (#902)
XiaohanZhangCMU Mar 13, 2024
81c3757
Create the dest folder
xiaohanzhan-db Mar 13, 2024
5090e13
Validation (#1025)
XiaohanZhangCMU Mar 13, 2024
de95862
clean up code ready for debug unit tests
xiaohanzhan-db Mar 13, 2024
765f399
refactor
xiaohanzhan-db Mar 13, 2024
0751a7a
Add unit tests
xiaohanzhan-db Mar 13, 2024
f88917d
update notebook
xiaohanzhan-db Mar 14, 2024
4c86f74
update
xiaohanzhan-db Mar 14, 2024
962974b
Merge branch 'byod/data_validation' into validation
XiaohanZhangCMU Mar 14, 2024
9fd91cf
Validation (#1027)
XiaohanZhangCMU Mar 14, 2024
67f7b4c
Merge pull request #1 from mosaicml/byod/data_validation
XiaohanZhangCMU Mar 14, 2024
28cd2e6
update notebook
xiaohanzhan-db Mar 14, 2024
944b260
Validation (#1028)
XiaohanZhangCMU Mar 14, 2024
8197ad5
Fix lints
xiaohanzhan-db Mar 14, 2024
0544e65
shuffle
xiaohanzhan-db Mar 14, 2024
b06cfb3
Fix lints
xiaohanzhan-db Mar 14, 2024
9a19d8a
fix conflict
xiaohanzhan-db Mar 14, 2024
a6b2ae0
Validation (#1031)
XiaohanZhangCMU Mar 14, 2024
de90934
update token_counts
xiaohanzhan-db Mar 14, 2024
5dfd30c
Validation (#1032)
XiaohanZhangCMU Mar 14, 2024
61adb43
update pip install list
xiaohanzhan-db Mar 14, 2024
c404dc7
Validation (#1033)
XiaohanZhangCMU Mar 14, 2024
c77bdf6
fix
xiaohanzhan-db Mar 14, 2024
ad71cc0
update
xiaohanzhan-db Mar 14, 2024
9bc3a39
fix token counts
xiaohanzhan-db Mar 14, 2024
9ec582e
Expose validate chat
xiaohanzhan-db Mar 14, 2024
734008e
Expose more
xiaohanzhan-db Mar 14, 2024
51f2eef
update
xiaohanzhan-db Mar 14, 2024
7b6956d
expose
xiaohanzhan-db Mar 14, 2024
60ed7de
add collate
xiaohanzhan-db Mar 14, 2024
fba1dcb
Fix
xiaohanzhan-db Mar 14, 2024
58185ba
Fix conflict
xiaohanzhan-db Mar 14, 2024
8e8f431
Validation (#1034)
XiaohanZhangCMU Mar 14, 2024
24f3d9e
update notebook
xiaohanzhan-db Mar 14, 2024
714002d
Fix conflict
xiaohanzhan-db Mar 14, 2024
1640f30
Validation (#1035)
XiaohanZhangCMU Mar 14, 2024
b053363
Merge branch 'byod/data_validation' of https://github.com/mosaicml/ll…
xiaohanzhan-db Mar 14, 2024
eb0fdbe
Add collate function to dataset.map
xiaohanzhan-db Mar 14, 2024
cdfe625
Fix lints
xiaohanzhan-db Mar 14, 2024
920c5e8
Fix lints
xiaohanzhan-db Mar 14, 2024
ffdcd5b
yapf removed
xiaohanzhan-db Mar 14, 2024
58f618b
Fix lints
xiaohanzhan-db Mar 14, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@ repos:
- "#"
- --allow-past-years
types: [python]
exclude: scripts/data_prep/validate_and_tokenize_data.py
- repo: https://github.com/PyCQA/docformatter
rev: v1.5.0
hooks:
Expand Down
14 changes: 12 additions & 2 deletions llmfoundry/data/finetuning/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,16 @@
# SPDX-License-Identifier: Apache-2.0

from llmfoundry.data.finetuning.collator import Seq2SeqFinetuningCollator
from llmfoundry.data.finetuning.dataloader import build_finetuning_dataloader
from llmfoundry.data.finetuning.dataloader import (_build_collate_fn,
build_finetuning_dataloader)
from llmfoundry.data.finetuning.tasks import (
ChatFormattedDict, PromptResponseDict, TokenizedExample, _get_example_type,
_validate_chat_formatted_example,
_validate_prompt_response_formatted_example)

__all__ = ['Seq2SeqFinetuningCollator', 'build_finetuning_dataloader']
__all__ = [
'Seq2SeqFinetuningCollator', 'build_finetuning_dataloader',
'_build_collate_fn', '_validate_chat_formatted_example',
'_validate_prompt_response_formatted_example', '_get_example_type',
'PromptResponseDict', 'ChatFormattedDict', 'TokenizedExample'
]
19 changes: 13 additions & 6 deletions llmfoundry/data/finetuning/tasks.py
Original file line number Diff line number Diff line change
Expand Up @@ -282,10 +282,8 @@ def _tokenize_chat_formatted_example(
}


def _tokenize_prompt_response_formatted_example(
example: PromptResponseDict,
tokenizer: PreTrainedTokenizerBase) -> TokenizedExample:
"""Tokenize a formatted example and validate expected keys."""
def _validate_prompt_response_formatted_example(example: PromptResponseDict):
"""Validate expected keys."""
example_keys = set(example.keys())
prompt_keys = example_keys.intersection(_ALLOWED_PROMPT_KEYS)
response_keys = example_keys.intersection(_ALLOWED_RESPONSE_KEYS)
Expand Down Expand Up @@ -317,6 +315,15 @@ def _tokenize_prompt_response_formatted_example(
f'Unable to tokenize example because {response_key} was not a string. {example=}'
)

return prompt, response


def _tokenize_prompt_response_formatted_example(
example: PromptResponseDict,
tokenizer: PreTrainedTokenizerBase) -> TokenizedExample:
"""Tokenize a formatted example and validate expected keys."""
prompt, response = _validate_prompt_response_formatted_example(example)

# Note: We default to the tokenizer's add_bos_token and add_eos_token behavior here
# (which we do not do for chat-formatted examples). This is because chat examples specifically
# go through the tokenizer's `apply_chat_template` method, which handles special tokens,
Expand Down Expand Up @@ -787,8 +794,8 @@ def dataset_mapper(example: Dict):
return tokenize_formatted_example(example, tokenizer)

detected_cpu_count = os.cpu_count() or 1
detected_cpus_with_margin = detected_cpu_count - 8
num_cpus_to_use = max(1, detected_cpus_with_margin)
detected_cpus_with_margin = detected_cpu_count - 8 # pyright: ignore
num_cpus_to_use = detected_cpu_count # Hack for Valiation instead of max(1, detected_cpus_with_margin)

columns_to_remove = list(dataset[0].keys())
tokenized_dataset = dataset.map(
Expand Down
23 changes: 20 additions & 3 deletions llmfoundry/utils/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Copyright 2022 MosaicML LLM Foundry authors
# SPDX-License-Identifier: Apache-2.0

# yapf: disable # isort: skip
from llmfoundry.utils.builders import (build_algorithm, build_callback,
build_icl_evaluators, build_logger,
build_optimizer, build_scheduler,
Expand All @@ -10,12 +11,18 @@
from llmfoundry.utils.config_utils import (calculate_batch_size_info,
log_config, pop_config,
update_batch_size_info)
# yapf: disable
from llmfoundry.utils.data_validation_utils import (check_HF_datasets,
cpt_token_counts,
create_om_cfg,
integrity_check,
is_hf_dataset_path,
is_uc_delta_table,
parse_args, plot_hist,
token_counts,
token_counts_with_collate)
from llmfoundry.utils.model_download_utils import (
download_from_hf_hub, download_from_http_fileserver)

# yapf: enable

__all__ = [
'build_callback',
'build_logger',
Expand All @@ -32,4 +39,14 @@
'update_batch_size_info',
'log_config',
'pop_config',
'create_om_cfg',
'token_counts_with_collate',
'token_counts',
'check_HF_datasets',
'is_hf_dataset_path',
'is_uc_delta_table',
'parse_args',
'cpt_token_counts',
'integrity_check',
'plot_hist',
]
Loading
Loading