Add flag to allow or reject datasets containing unsafe types (i.e., Pickle) #519

knighton · 2023-12-04T13:32:50Z

Description

Pickle serialization format, which is one of the available MDS encodings, is a potential security vulnerability. Here, we add a flag to either allow or reject datasets containing Pickle.

Work

Add argument allow_unsafe_types: bool = False to StreamingDataset
Add argument allow_unsafe_types: bool to Stream.get_shards()
Add shard validating to get_shards()
Add validate(allow_unsafe_types) to Reader
This is a no-op for non-MDS shard formats because unsafeness/pickle are specific to MDS
Add is_mds_encoding_safe to MDS' encodings.py
Add _unsafe_mds_encodings set to MDS' encodings.py
Add tests

Questions

Should we call this safe/allow_unsafe_types or secure/allow_insecure_types (for comparable terminology, we currently have safe_keep_zip and references to thread-safety in the codebase)? Currently going with safe.

Issue

STR-137

tests/test_unsafe_types.py

simulation/core/sim_dataset.py

snarayan21

some minor comments, other than that lgtm!

snarayan21 · 2023-12-04T18:17:03Z

simulation/core/sim_dataset.py

@@ -125,7 +128,8 @@ def __init__(self,
                 shuffle_block_size: Optional[int] = None,
                 sampling_method: str = 'balanced',
                 sampling_granularity: int = 1,
-                 batching_method: str = 'random') -> None:
+                 batching_method: str = 'random',
+                 allow_unsafe_types: bool = False) -> None:


Once this goes in, can you also submit a PR to foundry that upstreams this arg to there? I don't think diffusion repo needs it because it just passes in **kwargs, but foundry datasets don't do that.

tests/test_unsafe_types.py

…to james/unsafe-types

karan6181

Can you please update the PR title with more details? Thanks!

knighton added 2 commits December 1, 2023 17:21

Add allow_unsafe_types.

943619a

is_mds_encodings_safe.

8c5f7d3

karan6181 reviewed Dec 4, 2023

View reviewed changes

tests/test_unsafe_types.py Show resolved Hide resolved

tests/test_unsafe_types.py Outdated Show resolved Hide resolved

tests/test_unsafe_types.py Outdated Show resolved Hide resolved

simulation/core/sim_dataset.py Show resolved Hide resolved

snarayan21 reviewed Dec 4, 2023

View reviewed changes

knighton added 4 commits December 6, 2023 18:10

usefixtures.

baa6916

Merge branch 'main' into james/unsafe-types

983509b

Fix lint.

1f20eab

Merge branch 'james/unsafe-types' of github.com:mosaicml/streaming in…

8f1245b

…to james/unsafe-types

karan6181 approved these changes Dec 7, 2023

View reviewed changes

knighton changed the title ~~Add allow_unsafe_types.~~ Add flag to allow or reject datasets containing unsafe types (i.e., Pickle) Dec 7, 2023

knighton merged commit 8b0f1df into main Dec 7, 2023
6 checks passed

knighton deleted the james/unsafe-types branch December 7, 2023 08:21

karan6181 mentioned this pull request Dec 8, 2023

Add safe_load option to restrict HF dataset downloads to allowed file types mosaicml/llm-foundry#779

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add flag to allow or reject datasets containing unsafe types (i.e., Pickle) #519

Add flag to allow or reject datasets containing unsafe types (i.e., Pickle) #519

knighton commented Dec 4, 2023 •

edited

Loading

snarayan21 left a comment

snarayan21 Dec 4, 2023

karan6181 left a comment

Add flag to allow or reject datasets containing unsafe types (i.e., Pickle) #519

Add flag to allow or reject datasets containing unsafe types (i.e., Pickle) #519

Conversation

knighton commented Dec 4, 2023 • edited Loading

Description

Work

Questions

Issue

snarayan21 left a comment

Choose a reason for hiding this comment

snarayan21 Dec 4, 2023

Choose a reason for hiding this comment

karan6181 left a comment

Choose a reason for hiding this comment

knighton commented Dec 4, 2023 •

edited

Loading