Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement zstd Compression Support for JSONL and Parquet Files #230

Merged
merged 11 commits into from
Aug 28, 2024

Conversation

justHungryMan
Copy link
Contributor

Support for zstd compression in both JSONL and Parquet file formats.

Parquet Files:

  • The implementation applies compression directly within the internal write function (pq.ParquetWriter) using the compression option.
  • To maintain clarity and avoid redundancy in DiskWriter implementation, we continue to pass the compression option at the top level but set compression=None in the actual DiskWriter call. This ensures that DiskWriter does not attempt to compress already compressed data.
  • Despite specifying the compression option in ParquetWriter, it is feasible to pass super().init(compression=None) to control behavior from a higher abstraction level. This design choice allows flexibility depending on how users might want to integrate or extend our functionality.

Filename Convention:

  • For clarity and standardization, files processed with zstd compression are named with the suffix {file_name}.zstd.parquet (or .zstd.jsonl), aligning with typical conventions and aiding in clear differentiation of compressed files.
  • If the file extension is set to .parquet.zstd, the pq.ParquetFile function fails to properly recognize and read the file. To ensure seamless integration and functionality, the recommended file naming convention is {file_name}.zstd.parquet, which has been tested and confirmed to work effectively with our current setup.

@@ -40,8 +40,13 @@ def __init__(
self.compression = compression
self.output_folder = get_datafolder(output_folder)
output_filename = output_filename or self.default_output_filename
if self.compression == "gzip" and not output_filename.endswith(".gz"):
if output_filename.endswith(".parquet") and compression != "none":
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we can just use .parquet (not .{compression}.parquet) without adding here.
Then, we do not pass compression option at src/datatrove/pipeline/writers/parquet.py line 26.

change to

super().__init__(
    output_folder,
    output_filename,
    compression=None,
    adapter=adapter,
    mode="wb",
    expand_metadata=expand_metadata,
    max_file_size=max_file_size,
)

@guipenedo
Copy link
Collaborator

guipenedo commented Jul 8, 2024

Isn't the usual convention to append .zst (instead of .zstd) at the end of the filename rather than in the middle?
Edit: ok I see this is a parquet specific thing

Copy link
Collaborator

@guipenedo guipenedo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should go with your idea of handling everything on the ParquetWriter side, by changing https://github.com/huggingface/datatrove/pull/230/files#diff-6b0424e98052b42ca1d50f9fe6008cfb0b4191bbcd33ef7529112db08b5a2b4dR26 to None

src/datatrove/pipeline/writers/disk_base.py Outdated Show resolved Hide resolved
src/datatrove/pipeline/writers/disk_base.py Outdated Show resolved Hide resolved
src/datatrove/pipeline/writers/parquet.py Outdated Show resolved Hide resolved
src/datatrove/pipeline/writers/parquet.py Show resolved Hide resolved
justHungryMan and others added 4 commits July 16, 2024 01:04
Handle compression on ParquetWriter directly

Co-authored-by: Guilherme Penedo <nostrumg@gmail.com>
None to out of list

Co-authored-by: Guilherme Penedo <nostrumg@gmail.com>
Copy link
Contributor Author

@justHungryMan justHungryMan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guipenedo

Should we include the compression type in the filename within the ParquetWriter code, such as {filename}.snappy.parquet?

adapter: Callable = None,
batch_size: int = 1000,
expand_metadata: bool = False,
max_file_size: int = 5 * 2**30, # 5GB
):
# Validate the compression setting
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add validation check for supported protocols

super().__init__(
output_folder,
output_filename,
compression,
adapter,
compression=None, # Ensure superclass initializes without compression
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set superclass compression parameter to None to leverage pyarrow's native compression functions.

official extension for zstd is ".zst"

Co-authored-by: Guilherme Penedo <nostrumg@gmail.com>
@justHungryMan
Copy link
Contributor Author

HI @guipenedo ,

I used the branch requested in the pull request to process the jsonl.zst file found at dclm-baseline-1.0, and I can confirm that it works correctly. Please review this pull request.

Copy link
Collaborator

@guipenedo guipenedo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, sorry for the delay and thanks a lot!

@guipenedo guipenedo merged commit d5d1924 into huggingface:main Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants