-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: having zero bytes files after converting spark dataframe to MDS saved on dbfs:/Volumes #668
Fix: having zero bytes files after converting spark dataframe to MDS saved on dbfs:/Volumes #668
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as per offline discussion, holding off for now
@XiaohanZhangCMU can you make the PR title and description more informative? In the description, mind adding what the issue was and why this addresses it? thanks :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great -- seemingly simple fix for a hairy bug. Thank you @XiaohanZhangCMU !
@XiaohanZhangCMU can you make the PR title more descriptive before merging? |
Description of changes:
Problem:
When calling dataframe_to_mds and writing files to dbfs:/Volumes, the result datasets can have some zero-byte shards or index files. The problem stems from two aspects:
Fix:
We add a manual checking of the file size uploaded based on the remote metadata and compare it with the local file size. A mismatch will signal an exception so that Streaming's upload can retry. Experiments show that retry=2 can reliably minimize the chances of "zero-byte uploading".
Issue #, if available:
Merge Checklist:
Put an
x
without space in the boxes that apply. If you are unsure about any checklist, please don't hesitate to ask. We are here to help! This is simply a reminder of what we are going to look for before merging your pull request.General
Tests
pre-commit
on my change. (check out thepre-commit
section of prerequisites)