-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement zstd Compression Support for JSONL and Parquet Files #230
Implement zstd Compression Support for JSONL and Parquet Files #230
Conversation
@@ -40,8 +40,13 @@ def __init__( | |||
self.compression = compression | |||
self.output_folder = get_datafolder(output_folder) | |||
output_filename = output_filename or self.default_output_filename | |||
if self.compression == "gzip" and not output_filename.endswith(".gz"): | |||
if output_filename.endswith(".parquet") and compression != "none": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or we can just use .parquet (not .{compression}.parquet) without adding here.
Then, we do not pass compression option at src/datatrove/pipeline/writers/parquet.py
line 26.
change to
super().__init__(
output_folder,
output_filename,
compression=None,
adapter=adapter,
mode="wb",
expand_metadata=expand_metadata,
max_file_size=max_file_size,
)
Isn't the usual convention to append |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should go with your idea of handling everything on the ParquetWriter side, by changing https://github.com/huggingface/datatrove/pull/230/files#diff-6b0424e98052b42ca1d50f9fe6008cfb0b4191bbcd33ef7529112db08b5a2b4dR26 to None
Handle compression on ParquetWriter directly Co-authored-by: Guilherme Penedo <nostrumg@gmail.com>
None to out of list Co-authored-by: Guilherme Penedo <nostrumg@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we include the compression type in the filename within the ParquetWriter code, such as {filename}.snappy.parquet?
adapter: Callable = None, | ||
batch_size: int = 1000, | ||
expand_metadata: bool = False, | ||
max_file_size: int = 5 * 2**30, # 5GB | ||
): | ||
# Validate the compression setting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add validation check for supported protocols
super().__init__( | ||
output_folder, | ||
output_filename, | ||
compression, | ||
adapter, | ||
compression=None, # Ensure superclass initializes without compression |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Set superclass compression parameter to None to leverage pyarrow's native compression functions.
official extension for zstd is ".zst" Co-authored-by: Guilherme Penedo <nostrumg@gmail.com>
HI @guipenedo , I used the branch requested in the pull request to process the jsonl.zst file found at dclm-baseline-1.0, and I can confirm that it works correctly. Please review this pull request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, sorry for the delay and thanks a lot!
Support for zstd compression in both JSONL and Parquet file formats.
Parquet Files:
Filename Convention: