Implement zstd Compression Support for JSONL and Parquet Files #230

justHungryMan · 2024-06-22T15:44:23Z

Support for zstd compression in both JSONL and Parquet file formats.

Parquet Files:

The implementation applies compression directly within the internal write function (pq.ParquetWriter) using the compression option.
To maintain clarity and avoid redundancy in DiskWriter implementation, we continue to pass the compression option at the top level but set compression=None in the actual DiskWriter call. This ensures that DiskWriter does not attempt to compress already compressed data.
Despite specifying the compression option in ParquetWriter, it is feasible to pass super().init(compression=None) to control behavior from a higher abstraction level. This design choice allows flexibility depending on how users might want to integrate or extend our functionality.

Filename Convention:

For clarity and standardization, files processed with zstd compression are named with the suffix {file_name}.zstd.parquet (or .zstd.jsonl), aligning with typical conventions and aiding in clear differentiation of compressed files.
If the file extension is set to .parquet.zstd, the pq.ParquetFile function fails to properly recognize and read the file. To ensure seamless integration and functionality, the recommended file naming convention is {file_name}.zstd.parquet, which has been tested and confirmed to work effectively with our current setup.

…iles

justHungryMan · 2024-06-22T15:44:46Z

src/datatrove/pipeline/writers/disk_base.py

@@ -40,8 +40,13 @@ def __init__(
        self.compression = compression
        self.output_folder = get_datafolder(output_folder)
        output_filename = output_filename or self.default_output_filename
-        if self.compression == "gzip" and not output_filename.endswith(".gz"):
+        if output_filename.endswith(".parquet") and compression != "none":


Or we can just use .parquet (not .{compression}.parquet) without adding here.
Then, we do not pass compression option at src/datatrove/pipeline/writers/parquet.py line 26.

change to

super().__init__( output_folder, output_filename, compression=None, adapter=adapter, mode="wb", expand_metadata=expand_metadata, max_file_size=max_file_size, )

guipenedo · 2024-07-08T09:23:14Z

Isn't the usual convention to append .zst (instead of .zstd) at ~~the end of the filename rather than in the middle~~?
Edit: ok I see this is a parquet specific thing

guipenedo

I think we should go with your idea of handling everything on the ParquetWriter side, by changing https://github.com/huggingface/datatrove/pull/230/files#diff-6b0424e98052b42ca1d50f9fe6008cfb0b4191bbcd33ef7529112db08b5a2b4dR26 to None

src/datatrove/pipeline/writers/disk_base.py

src/datatrove/pipeline/writers/parquet.py

Handle compression on ParquetWriter directly Co-authored-by: Guilherme Penedo <nostrumg@gmail.com>

None to out of list Co-authored-by: Guilherme Penedo <nostrumg@gmail.com>

justHungryMan

@guipenedo

Should we include the compression type in the filename within the ParquetWriter code, such as {filename}.snappy.parquet?

justHungryMan · 2024-07-15T16:29:06Z

src/datatrove/pipeline/writers/parquet.py

        adapter: Callable = None,
        batch_size: int = 1000,
        expand_metadata: bool = False,
        max_file_size: int = 5 * 2**30,  # 5GB
    ):
+        # Validate the compression setting


add validation check for supported protocols

justHungryMan · 2024-07-15T16:30:54Z

src/datatrove/pipeline/writers/parquet.py

        super().__init__(
            output_folder,
            output_filename,
-            compression,
-            adapter,
+            compression=None,  # Ensure superclass initializes without compression


Set superclass compression parameter to None to leverage pyarrow's native compression functions.

official extension for zstd is ".zst" Co-authored-by: Guilherme Penedo <nostrumg@gmail.com>

justHungryMan · 2024-08-06T06:42:18Z

HI @guipenedo ,

I used the branch requested in the pull request to process the jsonl.zst file found at dclm-baseline-1.0, and I can confirm that it works correctly. Please review this pull request.

guipenedo

LGTM, sorry for the delay and thanks a lot!

justHungryMan added 4 commits June 22, 2024 17:39

Add zstandard dependency for compression support

f5911ee

feat: Add zstd compression support for jsonl reader

c416ca9

feat: Add zstd compression support for ParquetWriter

5a4697f

feat: Update DiskWriter to handle the other compression for Parquet f…

5998f0f

…iles

justHungryMan commented Jun 22, 2024

View reviewed changes

justHungryMan added 2 commits June 23, 2024 00:51

Remove annotaion

f69e768

feat: Update compression handling in DiskWriter and ParquetWriter

749291f

guipenedo requested changes Jul 8, 2024

View reviewed changes

justHungryMan and others added 4 commits July 16, 2024 01:04

Update src/datatrove/pipeline/writers/disk_base.py

b0c1e24

Handle compression on ParquetWriter directly Co-authored-by: Guilherme Penedo <nostrumg@gmail.com>

Update src/datatrove/pipeline/writers/parquet.py

becf062

None to out of list Co-authored-by: Guilherme Penedo <nostrumg@gmail.com>

Refactor constructor to explicitly set default compression to None

deae3bc

Add validation for compression parameter in ParquetWriter

40deea9

justHungryMan commented Jul 15, 2024

View reviewed changes

justHungryMan requested a review from guipenedo July 15, 2024 16:58

Update src/datatrove/pipeline/writers/disk_base.py

1ed3869

official extension for zstd is ".zst" Co-authored-by: Guilherme Penedo <nostrumg@gmail.com>

guipenedo approved these changes Aug 28, 2024

View reviewed changes

guipenedo merged commit d5d1924 into huggingface:main Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement zstd Compression Support for JSONL and Parquet Files #230

Implement zstd Compression Support for JSONL and Parquet Files #230

justHungryMan commented Jun 22, 2024

justHungryMan Jun 22, 2024

guipenedo commented Jul 8, 2024 •

edited

Loading

guipenedo left a comment

justHungryMan left a comment

justHungryMan Jul 15, 2024

justHungryMan Jul 15, 2024

justHungryMan commented Aug 6, 2024

guipenedo left a comment

Implement zstd Compression Support for JSONL and Parquet Files #230

Implement zstd Compression Support for JSONL and Parquet Files #230

Conversation

justHungryMan commented Jun 22, 2024

justHungryMan Jun 22, 2024

Choose a reason for hiding this comment

guipenedo commented Jul 8, 2024 • edited Loading

guipenedo left a comment

Choose a reason for hiding this comment

justHungryMan left a comment

Choose a reason for hiding this comment

justHungryMan Jul 15, 2024

Choose a reason for hiding this comment

justHungryMan Jul 15, 2024

Choose a reason for hiding this comment

justHungryMan commented Aug 6, 2024

guipenedo left a comment

Choose a reason for hiding this comment

guipenedo commented Jul 8, 2024 •

edited

Loading