Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: having zero bytes files after converting spark dataframe to MDS saved on dbfs:/Volumes #668

Merged
merged 11 commits into from
May 14, 2024

Conversation

XiaohanZhangCMU
Copy link
Member

@XiaohanZhangCMU XiaohanZhangCMU commented May 9, 2024

Description of changes:

Problem:

When calling dataframe_to_mds and writing files to dbfs:/Volumes, the result datasets can have some zero-byte shards or index files. The problem stems from two aspects:

  1. When mapInPandas is called, each executor is assigned a few tasks. Only one Python thread per executor is running through the tasks assigned to the executor. Each task has its MDSWriter initialized which again instantiated a ThreadExecutorPool for file uploading, each thread is responsible for uploading one file. However, when the tasks are more than available processes, the tasks are sharing threads. There can be multiple upload file futures lined up for one thread and stochastically, the upload files will be throttled which results in non-successful uploads, i.e., the file exists but has zero bytes.
  2. Why didn't it retry? It appears that Databricks' filesAPI does not signal any exception when the above uploading failure happens. So our code never attempted to retry.

Fix:
We add a manual checking of the file size uploaded based on the remote metadata and compare it with the local file size. A mismatch will signal an exception so that Streaming's upload can retry. Experiments show that retry=2 can reliably minimize the chances of "zero-byte uploading".

Issue #, if available:

Merge Checklist:

Put an x without space in the boxes that apply. If you are unsure about any checklist, please don't hesitate to ask. We are here to help! This is simply a reminder of what we are going to look for before merging your pull request.

General

  • I have read the contributor guidelines
  • This is a documentation change or typo fix. If so, skip the rest of this checklist.
  • I certify that the changes I am introducing will be backward compatible, and I have discussed concerns about this, if any, with the MosaicML team.
  • I have updated any necessary documentation, including README and API docs (if appropriate).

Tests

  • I ran pre-commit on my change. (check out the pre-commit section of prerequisites)
  • I have added tests that prove my fix is effective or that my feature works (if appropriate).
  • I ran the tests locally to make sure it pass. (check out testing)
  • I have added unit and/or integration tests as appropriate to ensure backward compatibility of the changes.

Copy link
Collaborator

@snarayan21 snarayan21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as per offline discussion, holding off for now

streaming/base/storage/upload.py Outdated Show resolved Hide resolved
streaming/base/storage/upload.py Outdated Show resolved Hide resolved
@snarayan21
Copy link
Collaborator

@XiaohanZhangCMU can you make the PR title and description more informative? In the description, mind adding what the issue was and why this addresses it? thanks :)

Copy link
Collaborator

@snarayan21 snarayan21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great -- seemingly simple fix for a hairy bug. Thank you @XiaohanZhangCMU !

@snarayan21
Copy link
Collaborator

@XiaohanZhangCMU can you make the PR title more descriptive before merging?

@XiaohanZhangCMU XiaohanZhangCMU merged commit 80cc752 into mosaicml:main May 14, 2024
8 checks passed
@XiaohanZhangCMU XiaohanZhangCMU changed the title Add fix Fix: having zero bytes files after converting spark dataframe to MDS saved on dbfs:/Volumes May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants