Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce tony.application.x.untracked.timeout to solve partial jobs hang #610

Closed
wants to merge 2 commits into from

Conversation

zuston
Copy link
Member

@zuston zuston commented Oct 13, 2021

No description provided.

@oliverhu
Copy link
Member

just tony.x.timeout? tracked tasks also need a timeout?

@zuston
Copy link
Member Author

zuston commented Oct 14, 2021

Refer #603

Why
On TFRuntime, I found two problems on tensorflow training on our production env.

Sometimes, when using tf estimator api, evaluator maybe wait the newest global step but chief has finished. So evaluator will hang. TF doc
Sometimes, worker will hang due to tf bugs, but chief has finished its task. So we need to kill workers which not finished.
Besides do we need to make job failed when some worker or evaluator not finished?

I think it can depends on user with config.

I think it's enough to solve the above problems by introducing untracked task timeout.

Do you have other ideas about possible problem?

@zuston
Copy link
Member Author

zuston commented Oct 15, 2021

Gentle ping @oliverhu

@oliverhu
Copy link
Member

I think the tony.x.timeout notion is more generic and easier to comprehend. People can specify tony.evaluator.timeout=100s etc.

@oliverhu
Copy link
Member

We discussed offline to have some generic task grouping and specify dependencies..
tony.groupA.timeout.aftergroupB = 10h etc.

@zuston
Copy link
Member Author

zuston commented Nov 25, 2021

Close it. Solved by #621

@zuston zuston closed this Nov 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants