Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add flag to allow or reject datasets containing unsafe types (i.e., Pickle) #519

Merged
merged 6 commits into from
Dec 7, 2023

Conversation

knighton
Copy link
Contributor

@knighton knighton commented Dec 4, 2023

Description

Pickle serialization format, which is one of the available MDS encodings, is a potential security vulnerability. Here, we add a flag to either allow or reject datasets containing Pickle.

Work

  • Add argument allow_unsafe_types: bool = False to StreamingDataset
  • Add argument allow_unsafe_types: bool to Stream.get_shards()
  • Add shard validating to get_shards()
  • Add validate(allow_unsafe_types) to Reader
  • This is a no-op for non-MDS shard formats because unsafeness/pickle are specific to MDS
  • Add is_mds_encoding_safe to MDS' encodings.py
  • Add _unsafe_mds_encodings set to MDS' encodings.py
  • Add tests

Questions

  • Should we call this safe/allow_unsafe_types or secure/allow_insecure_types (for comparable terminology, we currently have safe_keep_zip and references to thread-safety in the codebase)? Currently going with safe.

Issue

STR-137

tests/test_unsafe_types.py Show resolved Hide resolved
tests/test_unsafe_types.py Outdated Show resolved Hide resolved
tests/test_unsafe_types.py Outdated Show resolved Hide resolved
simulation/core/sim_dataset.py Show resolved Hide resolved
Copy link
Collaborator

@snarayan21 snarayan21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some minor comments, other than that lgtm!

@@ -125,7 +128,8 @@ def __init__(self,
shuffle_block_size: Optional[int] = None,
sampling_method: str = 'balanced',
sampling_granularity: int = 1,
batching_method: str = 'random') -> None:
batching_method: str = 'random',
allow_unsafe_types: bool = False) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once this goes in, can you also submit a PR to foundry that upstreams this arg to there? I don't think diffusion repo needs it because it just passes in **kwargs, but foundry datasets don't do that.

tests/test_unsafe_types.py Show resolved Hide resolved
tests/test_unsafe_types.py Outdated Show resolved Hide resolved
tests/test_unsafe_types.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@karan6181 karan6181 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please update the PR title with more details? Thanks!

@knighton knighton changed the title Add allow_unsafe_types. Add flag to allow or reject datasets containing unsafe types (i.e., Pickle) Dec 7, 2023
@knighton knighton merged commit 8b0f1df into main Dec 7, 2023
6 checks passed
@knighton knighton deleted the james/unsafe-types branch December 7, 2023 08:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants