Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds parquet writer #103

Merged
merged 5 commits into from
Feb 22, 2024
Merged

Adds parquet writer #103

merged 5 commits into from
Feb 22, 2024

Conversation

guipenedo
Copy link
Collaborator

No description provided.

@guipenedo guipenedo marked this pull request as ready for review February 21, 2024 15:30
Copy link
Contributor

@mariosasko mariosasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two comments :).

src/datatrove/pipeline/writers/parquet.py Outdated Show resolved Hide resolved
Comment on lines 42 to 44
self._writers[filename] = pq.ParquetWriter(
file_handler, schema=pa.table({name: [val] for name, val in document.items()}).schema
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Document's attributes have fixed types, so I wonder if it would make more sense to pass pa.schema({"text": pa.string(), "id": pa.string(), media: pa.struct({"type": pa.int32(), "url": pa.string(), "alt": pa.string(), "local_path": pa.string()}), "metadata": pa.string()}) for the schema.

Parquet still doesn't support unions (see apache/parquet-format#44), so we would have to work around this limitation by turning the metadata value into a string using json.dumps(metadata). Then, to make the ParquetReader compatible with this format, we would also have to add metadata to the schema (pa.schema(fields, metadata=...)), which the reader would check and perform deserialization (using json.loads) on the other side if needed.

But the current solution is good enough, so this can also be addressed later.

PS: To be extra strict, the default nullability of non-nullable fields ("text", "id", etc.) in the above schema can be disabled with pa.field(pa_type, nullable=False)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They used to have fixed types but now we support an adapter so that people can choose their output format (still a dictionary, but they can do whatever they want with the fields)

Regarding unions, does this mean if we have different value types in metadata (let's say strings and floats) then this doesn't work?

Regarding nullability, the problem would also be the custom user formats

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we could also have pa.RecordBatch.from_pylist([document]).schema here instead?

Copy link
Contributor

@mariosasko mariosasko Feb 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They used to have fixed types but now we support an adapter so that people can choose their output format (still a dictionary, but they can do whatever they want with the fields)

We could only use the fixed schema if adapter is not specified.

Regarding unions, does this mean if we have different value types in metadata (let's say strings and floats) then this doesn't work?

JSON supports these types, so it will work.

maybe we could also have pa.RecordBatch.from_pylist([document]).schema here instead?

Yes, this would be cleaner indeed

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I think maybe for now we will keep the current format so that even when people upload to the hub directly and so on there isn't a big json field

guipenedo and others added 3 commits February 21, 2024 17:45
Co-authored-by: Mario Šaško <mariosasko777@gmail.com>
@guipenedo guipenedo merged commit d4cf053 into main Feb 22, 2024
4 checks passed
@guipenedo guipenedo deleted the parquet-writer branch February 22, 2024 10:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants