Adds parquet writer #103

guipenedo · 2024-02-21T15:30:32Z

No description provided.

mariosasko

Two comments :).

src/datatrove/pipeline/writers/parquet.py

mariosasko · 2024-02-21T16:36:33Z

src/datatrove/pipeline/writers/parquet.py

+            self._writers[filename] = pq.ParquetWriter(
+                file_handler, schema=pa.table({name: [val] for name, val in document.items()}).schema
+            )


The Document's attributes have fixed types, so I wonder if it would make more sense to pass pa.schema({"text": pa.string(), "id": pa.string(), media: pa.struct({"type": pa.int32(), "url": pa.string(), "alt": pa.string(), "local_path": pa.string()}), "metadata": pa.string()}) for the schema.

Parquet still doesn't support unions (see apache/parquet-format#44), so we would have to work around this limitation by turning the metadata value into a string using json.dumps(metadata). Then, to make the ParquetReader compatible with this format, we would also have to add metadata to the schema (pa.schema(fields, metadata=...)), which the reader would check and perform deserialization (using json.loads) on the other side if needed.

But the current solution is good enough, so this can also be addressed later.

PS: To be extra strict, the default nullability of non-nullable fields ("text", "id", etc.) in the above schema can be disabled with pa.field(pa_type, nullable=False)

They used to have fixed types but now we support an adapter so that people can choose their output format (still a dictionary, but they can do whatever they want with the fields)

Regarding unions, does this mean if we have different value types in metadata (let's say strings and floats) then this doesn't work?

Regarding nullability, the problem would also be the custom user formats

maybe we could also have pa.RecordBatch.from_pylist([document]).schema here instead?

They used to have fixed types but now we support an adapter so that people can choose their output format (still a dictionary, but they can do whatever they want with the fields)

We could only use the fixed schema if adapter is not specified.

Regarding unions, does this mean if we have different value types in metadata (let's say strings and floats) then this doesn't work?

JSON supports these types, so it will work.

maybe we could also have pa.RecordBatch.from_pylist([document]).schema here instead?

Yes, this would be cleaner indeed

I see. I think maybe for now we will keep the current format so that even when people upload to the hub directly and so on there isn't a big json field

Co-authored-by: Mario Šaško <mariosasko777@gmail.com>

added parquet writer

ec98bfb

guipenedo marked this pull request as ready for review February 21, 2024 15:30

guipenedo requested a review from mariosasko February 21, 2024 15:30

nit

9e00d75

mariosasko reviewed Feb 21, 2024

View reviewed changes

guipenedo and others added 3 commits February 21, 2024 17:45

Update src/datatrove/pipeline/writers/parquet.py

7cd4051

Co-authored-by: Mario Šaško <mariosasko777@gmail.com>

updated test

9459e50

nit

19be55f

guipenedo merged commit d4cf053 into main Feb 22, 2024
4 checks passed

guipenedo deleted the parquet-writer branch February 22, 2024 10:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds parquet writer #103

Adds parquet writer #103

guipenedo commented Feb 21, 2024

mariosasko left a comment

mariosasko Feb 21, 2024

guipenedo Feb 21, 2024

guipenedo Feb 21, 2024

mariosasko Feb 21, 2024 •

edited

Loading

guipenedo Feb 22, 2024

Adds parquet writer #103

Adds parquet writer #103

Conversation

guipenedo commented Feb 21, 2024

mariosasko left a comment

Choose a reason for hiding this comment

mariosasko Feb 21, 2024

Choose a reason for hiding this comment

guipenedo Feb 21, 2024

Choose a reason for hiding this comment

guipenedo Feb 21, 2024

Choose a reason for hiding this comment

mariosasko Feb 21, 2024 • edited Loading

Choose a reason for hiding this comment

guipenedo Feb 22, 2024

Choose a reason for hiding this comment

mariosasko Feb 21, 2024 •

edited

Loading