-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add flag to allow or reject datasets containing unsafe types (i.e., Pickle) #519
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some minor comments, other than that lgtm!
@@ -125,7 +128,8 @@ def __init__(self, | |||
shuffle_block_size: Optional[int] = None, | |||
sampling_method: str = 'balanced', | |||
sampling_granularity: int = 1, | |||
batching_method: str = 'random') -> None: | |||
batching_method: str = 'random', | |||
allow_unsafe_types: bool = False) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once this goes in, can you also submit a PR to foundry that upstreams this arg to there? I don't think diffusion repo needs it because it just passes in **kwargs, but foundry datasets don't do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please update the PR title with more details? Thanks!
Description
Pickle serialization format, which is one of the available MDS encodings, is a potential security vulnerability. Here, we add a flag to either allow or reject datasets containing Pickle.
Work
allow_unsafe_types: bool = False
toStreamingDataset
allow_unsafe_types: bool
toStream.get_shards()
get_shards()
validate(allow_unsafe_types)
toReader
is_mds_encoding_safe
to MDS'encodings.py
_unsafe_mds_encodings
set to MDS'encodings.py
Questions
safe
/allow_unsafe_types
orsecure
/allow_insecure_types
(for comparable terminology, we currently havesafe_keep_zip
and references to thread-safety in the codebase)? Currently going withsafe
.Issue
STR-137