Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Population based training #1833

Merged
merged 23 commits into from
Jun 21, 2022
Merged

Population based training #1833

merged 23 commits into from
Jun 21, 2022

Conversation

a9p
Copy link
Contributor

@a9p a9p commented Mar 8, 2022

What this PR does / why we need it:

Support the discovery of modulated hyperparameters rather than attempting to find a fixed set over the entire training process. The paper has more details about the technique.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):

This PR provides some initial support for PBT within Katib (#1382).

Checklist:

  • Docs included if any changes are user facing

@aws-kf-ci-bot
Copy link
Contributor

Hi @a9p. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@a9p
Copy link
Contributor Author

a9p commented Mar 8, 2022

Comments and open questions:

  • The learned hyperparameter schedule is not accessible in the experiment object, only the last generation hyperparameters (as best trial).
  • The learned hyperparameter schedule is not presented in the UI. A new view may be necessary; or if there is a way to increment the trial generation field, we could create a plot in the frontend based on that. Suggestions?
  • Is it possible to/should the final learned model be added to the artifact store from the suggestion service on shutdown?

/cc @andreyvelich

@coveralls
Copy link

coveralls commented Mar 8, 2022

Coverage Status

Coverage decreased (-8.2%) to 72.857% when pulling d88187d on a9p:pbt into f7261de on kubeflow:master.

@johnugeorge
Copy link
Member

/cc @richardsliu

Copy link
Member

@Arhell Arhell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ok-to-test

@a9p
Copy link
Contributor Author

a9p commented Apr 7, 2022

/retest

@johnugeorge
Copy link
Member

Please comment when PR is complete to be reviewed

@a9p
Copy link
Contributor Author

a9p commented Apr 27, 2022

/retest (@johnugeorge fyi)

@aws-kf-ci-bot
Copy link
Contributor

@a9p: The /retest command does not accept any targets.
The following commands are available to trigger jobs:

  • /test kubeflow-katib-presubmit

Use /test all to run all jobs.

In response to this:

/retest (@johnugeorge fyi)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@a9p
Copy link
Contributor Author

a9p commented May 4, 2022

/test all

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this great contribution @a9p, sorry for the late reply.
I left my first comments.

You can also create PR, similar to this: kubeflow/testing#970.
We should add PBT Suggestion and Trial images to our CI.

/assign @gaocegege @johnugeorge @tenzen-y @anencore94

examples/v1beta1/trial-images/simple-pbt/Dockerfile Outdated Show resolved Hide resolved
examples/v1beta1/trial-images/simple-pbt/pbt_test.py Outdated Show resolved Hide resolved
operators/katib-controller/src/pbtTemplate.yaml Outdated Show resolved Hide resolved
pkg/suggestion/v1beta1/pbt/service.py Outdated Show resolved Hide resolved
pkg/suggestion/v1beta1/pbt/service.py Outdated Show resolved Hide resolved
pkg/suggestion/v1beta1/pbt/service.py Outdated Show resolved Hide resolved
pkg/suggestion/v1beta1/pbt/service.py Outdated Show resolved Hide resolved
pkg/suggestion/v1beta1/pbt/service.py Outdated Show resolved Hide resolved
@tenzen-y
Copy link
Member

@kubeflow/wg-automl-leads Can you start GH Actions?

@a9p
Copy link
Contributor Author

a9p commented Jun 17, 2022

It looks like no trials kicked off again, but this time due to a pull failure for the simple-pbt image. Did I miss something in the .github workflow?

@tenzen-y
Copy link
Member

It looks like no trials kicked off again, but this time due to a pull failure for the simple-pbt image. Did I miss something in the .github workflow?

Thanks for checking CI. @a9p
I guess that you need to add run "suggestion-pbt" "$CMD_PREFIX/suggestion/pbt/$VERSION/Dockerfile" to the following part.

run "suggestion-hyperopt" "$CMD_PREFIX/suggestion/hyperopt/$VERSION/Dockerfile"
run "suggestion-chocolate" "$CMD_PREFIX/suggestion/chocolate/$VERSION/Dockerfile"
run "suggestion-hyperband" "$CMD_PREFIX/suggestion/hyperband/$VERSION/Dockerfile"
run "suggestion-skopt" "$CMD_PREFIX/suggestion/skopt/$VERSION/Dockerfile"
run "suggestion-goptuna" "$CMD_PREFIX/suggestion/goptuna/$VERSION/Dockerfile"
run "suggestion-optuna" "$CMD_PREFIX/suggestion/optuna/$VERSION/Dockerfile"
run "suggestion-enas" "$CMD_PREFIX/suggestion/nas/enas/$VERSION/Dockerfile"
run "suggestion-darts" "$CMD_PREFIX/suggestion/nas/darts/$VERSION/Dockerfile"

@a9p
Copy link
Contributor Author

a9p commented Jun 19, 2022

@johnugeorge I think the last approved run failed due to a pull policy that I had in simple-pbt. I checked against the other hp jobs and they should be in parity now (no additional fields)!

@johnugeorge
Copy link
Member

@a9p Tests didn't complete yet. Does it succeed when you tried locally?

…mages.sh update yaml function (causing failing gha).
@a9p
Copy link
Contributor Author

a9p commented Jun 21, 2022

@johnugeorge it did work for me, but it seems like I had images that satisfied the docker image pull locally. I setup a new test vm -- the main issue that has been causing the ci failure was a missing entry for simple-pbt in update-images.sh which changes the image tag from latest to e2e-test for the tests.

@johnugeorge
Copy link
Member

Thanks @a9p for hard work and patience. We are ready to merge.

One minor addition request. Can you update https://github.com/kubeflow/katib#search-algorithms to add PBT as well in the README table?

In a separate PR, can you also add a new section https://www.kubeflow.org/docs/components/katib/experiment/#search-algorithms-in-detail as well in https://github.com/kubeflow/website ? This will help drive users to try PBT.

@johnugeorge
Copy link
Member

@tenzen-y Can you give a LGTM?

@a9p we can merge this now. Can you create a website PR as well to update the docs pointing to the provided link?

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for implementing this excellent feature! @a9p
/lgtm

@google-oss-prow google-oss-prow bot added the lgtm label Jun 21, 2022
@tenzen-y
Copy link
Member

tenzen-y commented Jun 21, 2022

@a9p we can merge this now. Can you create a website PR as well to update the docs pointing to the provided link?

@a9p
FYI https://github.com/kubeflow/website

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this amazing contribution @a9p!
I am excited to see the outcome from our users.

/lgtm

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: a9p, andreyvelich, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 04ac975 into kubeflow:master Jun 21, 2022
@a9p
Copy link
Contributor Author

a9p commented Jun 21, 2022

Thank you all for your guidance and patience with getting this merged in!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants