Add nightly build #1503

laserprec · 2021-08-19T16:59:49Z

Description

To improve developer experience working with this repo, we will begin migrating our DevOps pipelines into Github Action, leveraging popular DevOps tools like tox and flake8. This will be a sequence of updates to our DevOps infrastructures:

~~Propose and draft the CI pipeline in Github Action~~
~~Setup self-hosted machines to run our GPU workloads~~
Create feature parity with the existing CI pipelines (pr-gates & nightly-build) <---- (Current PR)
Run tests on the appropriate dependency subsets
Optimize build time for unit tests
Enforce flake8 (coding style checks) on the build and clean up coding styles to pass flake8 checks
Deprecate CI pipelines in ADO and switch to Github Actions

In this PR:

Add a nightly build pipeline that runs the full test suites: unit, integration and smoke tests:

Related Issues

#1498

Checklist:

I have followed the contribution guidelines and code style for this project.
I have added tests covering my contributions.
I have updated the documentation accordingly.
This PR is being made to staging branch and not to main branch.

miguelgfierro · 2021-08-23T13:57:14Z

.github/workflows/nightly.yml

+        test-kind: ['unit', 'smoke']
+        # pytest markers configured in tox.ini. See https://docs.pytest.org/en/6.2.x/example/markers.html
+        test-marker: ['not spark and not gpu']
+        include:
+          # Integration tests needs a more powermachine with more memory. GitHub-hosted agents only have 7GB memory.
+          - os: self-hosted
+            python: 3.6
+            test-kind: 'integration'


one question about this, is this pipeline running unit, smoke and integration test?

Yup! I know this extra include clause is looking unusual. At the moment, the integration tests cannot run on the smaller SKUs (Standard_DS2_v2: 7GB) provided by GitHub because there isn't enough memory to run the training on the full dataset.

Ideally, all our tests shouldn't be that compute intensive and they can be ran on the default SKUs provided by the CI (unless we need to test for compatibility with GPU enabled machines). Thus the matrix should ideally look like:

test-kind: ['unit', 'smoke', 'integration'] test-marker: [....] ....

I think we can have a discussion over the objective of the different unit/smoke/integration tests and whether doing training on the full dataset achieve those goals. Added a new issue for the discussion: #1507.

sorry I don't understand, you mentioned that this pipeline runs unit, smoke and integration but later that the integration cannot run on the smaller SKUs. Some questions:

Where are the integration tests running?

Is there a UI where I can see an example on how the tests are running?

there are 3 tests running for CPU, unit, smoke and integration, are they running in parallel or sequentially?

Do we know if GitHub has plans to support bigger SKUs?

Related to 4, Azure Pipelines also has SKUs for testing, are they the same as GitHub, or are they bigger? (if they are bigger, then GitHub might catch up eventually)

No worries!

The integration tests are currently running in the self-hosted gpu agent. This is configured with the self-hosted keyword.

include: - os: self-hosted

Please find all of the GitHub Action runs here, especially the pr-gate and nightly pipeline.

Currently all of them are running in parallel. However because we set up only 2 self-hosted but there are about 5 nightly step needing to run on self-hosted machine, we haven't scale up the resource to run them in true parallel.

As far as I know, I don't think there's a plan. Maybe the SKU offerings are different with GitHub Enterprise, but I haven't gotten too much info on that.

Azure Pipeline and GitHub Action both use the same SKU: Standard_DS2_v2.

I hope that helps.

got it thanks for the clarification

would you mind explaining in the file that some test are being run on the hosted machine and that some tests on the self-hosted one?

For nightly, in the current pipeline we are running only smoke and integration, see https://github.com/microsoft/recommenders/blob/main/tests/ci/azure_pipeline_test/dsvm_nightly_linux_cpu.yml#L30, and these two are sequential. The reason why they are sequential is that the smoke tests are just a small version of the integration tests. Is it possible to replicate this behavior in github?

We are making so many changes in test, I think we need also to review the test readme, there might be info that is not up to date: https://github.com/microsoft/recommenders/blob/main/tests/README.md

Thanks for the great feedbacks!

I already added a comment on these mixed runs for integration tests and link to a new issue (Tests runnable on default CI agent #1507) on whether we should optimize/trim these integration tests and make them fit to run on hosted VMs:

include: # Integration tests need a powerful machine with more memory. GitHub-hosted agents only have 7GB memory. # TODO: Optimization/refactoring should be done to ensure that these test can run in the default CI agent (#1507) - os: self-hosted

@miguelgfierro is there anywhere else you would like me to comment?

If I understand correctly, in the existing pipeline, the smoke tests are running before integration tests so we can "fail fast" and avoid running the full integration test which is destined to fail. In the similar manner, github action has the fail-fast keyword which will cancel running jobs in the matrix if any failed. If this isn't what you are going after and we need to run both of them sequentially, I think we would need to refactor the CI structure because is designed to run everything in parallel as much as possible.

We are making so many changes in test, I think we need also to review the test readme

This is a great document when I was onboarding :). All of the CI changes is compatible with the what's written in the document in terms of test execution. One thing that we can change is that we no longer need the --duration 0 flag from, for example, pytest tests/integration -m "integration and not spark and not gpu" --durations 0, because these flags are stored in tox.ini and pytest will pick the flags up from tox.ini without you specifying it. Of course, it doesn't hurt to repeat yourself add them in when running. 😃

@laserprec

@miguelgfierro is there anywhere else you would like me to comment?

For new users, particularly when the pipeline is complex like this one, it helps save them time to have a short summary at the top of the file. Here is an example: https://github.com/microsoft/recommenders/blob/main/tests/ci/azure_artifact_feed.yaml

.github/workflows/nightly.yml

miguelgfierro · 2021-08-24T08:49:59Z

.github/workflows/nightly.yml

+        test-kind: ['unit', 'smoke']
+        # pytest markers configured in tox.ini. See https://docs.pytest.org/en/6.2.x/example/markers.html
+        test-marker: ['not spark and not gpu']
+        include:
+          # Integration tests needs a more powermachine with more memory. GitHub-hosted agents only have 7GB memory.
+          - os: self-hosted
+            python: 3.6
+            test-kind: 'integration'


sorry I don't understand, you mentioned that this pipeline runs unit, smoke and integration but later that the integration cannot run on the smaller SKUs. Some questions:

Where are the integration tests running?

Is there a UI where I can see an example on how the tests are running?

there are 3 tests running for CPU, unit, smoke and integration, are they running in parallel or sequentially?

Do we know if GitHub has plans to support bigger SKUs?

Related to 4, Azure Pipelines also has SKUs for testing, are they the same as GitHub, or are they bigger? (if they are bigger, then GitHub might catch up eventually)

.github/workflows/nightly.yml

miguelgfierro · 2021-08-24T08:59:57Z

@laserprec do you have the time saving between doing the old pipeline vs this one?

.github/workflows/pr-gate.yml

laserprec · 2021-08-25T16:08:13Z

@laserprec do you have the time saving between doing the old pipeline vs this one?

@miguelgfierro, please interpret the numbers with a grain of salt. It's not an apples-to-apples comparison:
The current nightly build on ADO takes about 4hrs50min to run sequentially the 3 builds and smoke & integration tests.

Link to pipeline runs

This new nightly build on GitHub Action is about 4hrs40min to run the 3 builds and the full test suites (unit/smoke/integration) in partially parallel.

link to pipeline run

As you see the GH action is bottlenecked by the longest running thread which is the GPU integration tests (4hr40min) and is considerably longer than the same run in ADO. This is because we are not using the same Azure SKUs to run them. On ADO we are using NC6s_v2 (which has P100 GPU but we no longer have any quota to allocate more) and in GitHub we are using NC12_promo (which has two P40 GPUs).

So at this point, we don't have much immediate gains, but with the new CIs in GitHub Action we are in better grounds to scale up the infra for distributed loads when we have more environments to support and centralized interface to interpret the run and optimize the individual jobs in the next step.

miguelgfierro · 2021-08-25T17:21:15Z

On ADO we are using NC6s_v2 (which has P100 GPU but we no longer have any quota to allocate more) and in GitHub we are using NC12_promo (which has two P40 GPUs).

I can try to fix that and get you more quota, for the comparison we should have the same SKU, please ping me privately and I'll try to unblock you.

Another point I see is that, as you said this is not apples, in the old ADO pipeline we are not running the unit tests. Do we want to have the unit test in the nightly build?

The unit tests for spark take 30min? whereas in the ADO pipeline they take around 14, see this and this, do you know why?

I also noticed that this PR is triggering the nightly builds with GitHub, they shouldn't be triggered in the PR but scheduled, can you review?

One way to scale would be to use azureml VMs to execute the tests, this was an initiative that was tried in the past but we didn't put in production. We have some code here that might help you to understand what we did. Can you estimate how much work and how difficult it would be to have azureml VMs instead of a hosted VM on github?

laserprec · 2021-08-25T18:28:23Z

Do we want to have the unit test in the nightly build?

@miguelgfierro I think it's typical to run the full test suites on nightly build. This way we can get the full code coverage report. Relatively speaking the time to run the full unit test is probably a small fraction of that of the integrations. I don't think it hurts to run 😄.

The unit tests for spark take 30min? whereas in the ADO pipeline they take around 14, see this and this, do you know why?

This is because the unit tests are running in the default SKU, with only 7GB memory. It took longer because we were running with less compute.

I also noticed that this PR is triggering the nightly builds with GitHub, they shouldn't be triggered in the PR but scheduled, can you review?

Yup, the manual trigger will not be registered in the UI until these changes are checked in into main branch. I am just adding a custom PR trigger to make sure the nightly build runs. Now that we confirm it can run fine, I can remove it (todo).

We have some code here that might help you to understand what we did. Can you estimate how much work and how difficult it would be to have azureml VMs instead of a hosted VM on github?

I think that's a good idea for managed compute pools for model training. At the moment, I am not aware of any good integration plugins available for azureml and Github Action. I think this is not a trivial change in terms of secret managements, complexity introduced involving aml in the pipeline, etc. The gains, I assume, is relieving the team from managing the self-hosted VMs and potentially introduce AML as the training scenario in our NB? I think we can investigate this direction. @miguelgfierro what would be the best way to backlog this item for us to come back to?

miguelgfierro · 2021-08-26T06:41:23Z

Got it, makes sense, really good suggestions, and thought processes.

I think we can investigate this direction. @miguelgfierro what would be the best way to backlog this item for us to come back to?

We have an open issue about this #995, we can either reuse this one or create a new one and close this, whatever you prefer

laserprec · 2021-08-26T16:27:42Z

We have an open issue about this #995, we can either reuse this one or create a new one and close this, whatever you prefer

Awesome! We can reuse this issue and I already assigned this to me. I think we can look into this after we have finished in migration. Btw, @miguelgfierro, is there anything else I haven't address? I would love to get your approval before merging this PR 😄.

miguelgfierro · 2021-08-27T09:02:52Z

Btw, @miguelgfierro, is there anything else I haven't address? I would love to get your approval before merging this PR 😄.

there are still some questions that it would be good to have clarity on:

Clarity on the naming, see Add nightly build #1503 (comment)
Trying 3.7, see Add nightly build #1503 (comment)
Short summary at the top to help users understand the pipeline quickly, see Add nightly build #1503 (comment)

laserprec · 2021-08-27T16:58:57Z

@miguelgfierro

Clarity on the naming, see Add nightly build #1503 (comment)

I hope I gave a better explanation :p. If not, I can hop into a call.

Trying 3.7, see Add nightly build #1503 (comment)

I feel like this support for 3.7 can be its own PR. Trying to keep this PR modular. I have created a new issue for this #1511. I can create like a shadow CI job to try the build in 3.7 (but would not fail the build when related jobs fail) in upcoming changes.

Short summary at the top to help users understand the pipeline quickly, see Add nightly build #1503 (comment)

I updated the comments with more details commit in 97faf6341d. Let me know if we need more context.

codecov-commenter · 2021-08-27T20:47:57Z

Codecov Report

Merging #1503 (97faf63) into staging (3e23511) will decrease coverage by 0.00%.
The diff coverage is n/a.

@@             Coverage Diff             @@
##           staging    #1503      +/-   ##
===========================================
- Coverage    61.96%   61.95%   -0.01%     
===========================================
  Files           84       84              
  Lines         8368     8369       +1     
===========================================
  Hits          5185     5185              
- Misses        3183     3184       +1

Impacted Files	Coverage Δ
recommenders/models/vae/standard_vae.py	`0.00% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3e23511...97faf63. Read the comment docs.

miguelgfierro

This is really good job @laserprec

Jianjie Liu added 2 commits August 18, 2021 19:44

Add nightly build

f5a5064

Edit pipeline triggers

61c9dd7

laserprec added build labels Aug 19, 2021

laserprec added this to the GitHub Action Migration & CI Infra Enchancement milestone Aug 19, 2021

laserprec requested review from anargyri, gramhagen, loomlike, miguelgfierro, wutaomsft and yueguoguo as code owners August 19, 2021 16:59

Try run failing tests in self-hosted machine

4ba3bf2

gramhagen approved these changes Aug 20, 2021

View reviewed changes

Don't run integration tests in default agent

f84182a

miguelgfierro reviewed Aug 23, 2021

View reviewed changes

Jianjie Liu added 3 commits August 23, 2021 21:16

Fix typos

e384eb2

Add manual trigger

16eff0f

reuse push trigger to test run

6061f33

miguelgfierro reviewed Aug 24, 2021

View reviewed changes

Fix test markers

6851daf

miguelgfierro reviewed Aug 24, 2021

View reviewed changes

.github/workflows/pr-gate.yml Outdated Show resolved Hide resolved

Remove custom push trigger

5ed36ce

Update pipeline summary

97faf63

miguelgfierro approved these changes Aug 31, 2021

View reviewed changes

laserprec merged commit c5029af into staging Aug 31, 2021

laserprec linked an issue Aug 31, 2021 that may be closed by this pull request

Add nightly build #1501

Closed

miguelgfierro deleted the laserprec/nightly-ci branch September 1, 2021 08:43

miguelgfierro restored the laserprec/nightly-ci branch September 1, 2021 08:43

miguelgfierro deleted the laserprec/nightly-ci branch September 1, 2021 08:43

This was referenced Sep 1, 2021

Ediit READMEs based on Simon's feedback; sync dependencies in conda script #1515

Merged

Run CIs in py37 #1518

Merged

laserprec mentioned this pull request Nov 2, 2021

[HELP WANTED] What's the recommended platform & python environment? #1563

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add nightly build #1503

Add nightly build #1503

laserprec commented Aug 19, 2021 •

edited

Loading

miguelgfierro Aug 23, 2021

laserprec Aug 23, 2021 •

edited

Loading

miguelgfierro Aug 24, 2021

laserprec Aug 24, 2021 •

edited

Loading

miguelgfierro Aug 24, 2021

laserprec Aug 24, 2021 •

edited

Loading

miguelgfierro Aug 26, 2021

miguelgfierro Aug 24, 2021

miguelgfierro commented Aug 24, 2021

laserprec commented Aug 25, 2021 •

edited

Loading

miguelgfierro commented Aug 25, 2021

laserprec commented Aug 25, 2021

miguelgfierro commented Aug 26, 2021

laserprec commented Aug 26, 2021

miguelgfierro commented Aug 27, 2021

laserprec commented Aug 27, 2021 •

edited

Loading

codecov-commenter commented Aug 27, 2021

miguelgfierro left a comment

Add nightly build #1503

Add nightly build #1503

Conversation

laserprec commented Aug 19, 2021 • edited Loading

Description

In this PR:

Related Issues

Checklist:

miguelgfierro Aug 23, 2021

Choose a reason for hiding this comment

laserprec Aug 23, 2021 • edited Loading

Choose a reason for hiding this comment

miguelgfierro Aug 24, 2021

Choose a reason for hiding this comment

laserprec Aug 24, 2021 • edited Loading

Choose a reason for hiding this comment

miguelgfierro Aug 24, 2021

Choose a reason for hiding this comment

laserprec Aug 24, 2021 • edited Loading

Choose a reason for hiding this comment

miguelgfierro Aug 26, 2021

Choose a reason for hiding this comment

miguelgfierro Aug 24, 2021

Choose a reason for hiding this comment

miguelgfierro commented Aug 24, 2021

laserprec commented Aug 25, 2021 • edited Loading

miguelgfierro commented Aug 25, 2021

laserprec commented Aug 25, 2021

miguelgfierro commented Aug 26, 2021

laserprec commented Aug 26, 2021

miguelgfierro commented Aug 27, 2021

laserprec commented Aug 27, 2021 • edited Loading

codecov-commenter commented Aug 27, 2021

Codecov Report

miguelgfierro left a comment

Choose a reason for hiding this comment

laserprec commented Aug 19, 2021 •

edited

Loading

laserprec Aug 23, 2021 •

edited

Loading

laserprec Aug 24, 2021 •

edited

Loading

laserprec Aug 24, 2021 •

edited

Loading

laserprec commented Aug 25, 2021 •

edited

Loading

laserprec commented Aug 27, 2021 •

edited

Loading