run full TPU pytests #2560

Borda · 2020-07-08T22:53:03Z

What does this PR do?

A this moment we execute just single file and if any TPU test is elsewhere it has to be added, so we rather run on whole package so we do not accidentally (in future) update this separate test case...

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

codecov · 2020-07-08T23:09:38Z

Codecov Report

Merging #2560 (df9ba0b) into master (3c86193) will increase coverage by 2%.
The diff coverage is n/a.

❗ Current head df9ba0b differs from pull request most recent head 082f639. Consider uploading reports for the commit 082f639 to get more accurate results

@@           Coverage Diff           @@
##           master   #2560    +/-   ##
=======================================
+ Coverage      91%     93%    +2%     
=======================================
  Files         192     159    -33     
  Lines       12231   11380   -851     
=======================================
- Hits        11184   10602   -582     
+ Misses       1047     778   -269

Borda · 2020-07-09T07:52:42Z

failing dome TPU tests

terminate called after throwing an instance of 'std::runtime_error'
  what():  tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:1106 : Check failed: session->session()->Run( feed_inputs, {}, {cached_node.operations[0]}, &outputs) == ::tensorflow::Status::OK() (Not found: From /job:tpu_worker/replica:0/task:0:

XRT memory handle not found: 3824064254680872
	 [[{{node XRTReleaseAllocationHandle}}]] vs. OK)
*** Begin stack trace ***
	tensorflow::CurrentStackTrace[abi:cxx11]()
	xla::XrtComputationClient::ReleaseHandles(std::vector<xla::XrtComputationClient::DeviceHandle, std::allocator<xla::XrtComputationClient::DeviceHandle> >*, std::function<xla::XrtSession::CachedNode const& (xla::XrtSession*, tensorflow::Scope const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)> const&, xla::metrics::Metric*, xla::metrics::Counter*)
	xla::XrtComputationClient::HandleReleaser()
	xla::util::TriggeredTask::Runner()

@zcain117 any idea what it could be?
reported in pytorch/xla#2338

Borda · 2020-07-09T07:58:26Z

running the same test in Collab
https://colab.research.google.com/drive/1Gr1Wg4zVnu15WHE_-dU2YKr4Z5xsy-fL#scrollTo=Mx61q3X5bwoW

FAILED tests/callbacks/test_model_checkpoint.py::test_model_checkpoint_no_extraneous_invocations
FAILED tests/loggers/test_all.py::test_logger_created_on_rank_zero_only[TensorBoardLogger]
FAILED tests/loggers/test_all.py::test_logger_created_on_rank_zero_only[CometLogger]
FAILED tests/loggers/test_all.py::test_logger_created_on_rank_zero_only[NeptuneLogger]
FAILED tests/loggers/test_all.py::test_logger_created_on_rank_zero_only[TestTubeLogger]
FAILED tests/loggers/test_all.py::test_logger_created_on_rank_zero_only[WandbLogger]
FAILED tests/loggers/test_wandb.py::FLAKE8
FAILED tests/models/test_cpu.py::test_multi_cpu_model_ddp - Exception: proces...
FAILED tests/models/test_horovod.py::test_horovod_cpu - assert 1 == 0
FAILED tests/models/test_horovod.py::test_horovod_cpu_implicit - assert 1 == 0
FAILED tests/models/test_horovod.py::test_horovod_multi_optimizer - pytorch_l...
FAILED tests/models/test_tpu.py::test_base_tpu_model[1] - SystemExit: 17
FAILED tests/models/test_tpu.py::test_base_tpu_model[tpu_cores1] - RuntimeErr...
FAILED tests/models/test_tpu.py::test_base_tpu_model[8] - Exception: 
FAILED tests/models/test_tpu.py::test_base_tpu_16bit_model[1] - SystemExit: 17
FAILED tests/models/test_tpu.py::test_base_tpu_16bit_model[tpu_cores1] - Runt...
FAILED tests/models/test_tpu.py::test_base_tpu_16bit_model[8] - Exception: 
FAILED tests/models/test_tpu.py::test_early_stop_checkpoints_on_tpu[tpu_cores0-xla:1]
FAILED tests/models/test_tpu.py::test_early_stop_checkpoints_on_tpu[tpu_cores1-xla:8]
FAILED tests/models/test_tpu.py::test_single_tpu_core_model[tpu_cores0-xla:1]
FAILED tests/models/test_tpu.py::test_single_tpu_core_model[tpu_cores1-xla:8]
FAILED tests/models/test_tpu.py::test_multi_core_tpu_model[1] - SystemExit: 17
FAILED tests/models/test_tpu.py::test_multi_core_tpu_model[8] - Exception: 
FAILED tests/models/test_tpu.py::test_dataloaders_passed_to_fit - Exception: 
FAILED tests/models/data/horovod/train_default_model.py::FLAKE8
==== 25 failed, 591 passed, 59 skipped, 498 warnings in 2989.65s (0:49:49) =====

TPU start failing after #2512

Borda · 2020-07-29T22:20:31Z

the only limitation is that it would take longer to finish TPU test which is even not the longer from all others...

Borda · 2020-08-02T21:04:50Z

@zcain117 any ide why even the TPU tests are failing?

mergify · 2020-09-30T12:40:17Z

This pull request is now in conflict... :(

Borda · 2021-02-19T21:09:54Z

=========================== short test summary info ============================
FAILED tests/callbacks/test_pruning.py::test_pruning_callback_ddp_cpu - Excep...
FAILED tests/callbacks/test_swa.py::test_swa_callback_ddp_cpu - Exception: pr...
FAILED tests/checkpointing/test_model_checkpoint.py::test_model_checkpoint_no_extraneous_invocations
FAILED tests/checkpointing/test_torch_saving.py::test_model_torch_save_ddp_cpu
FAILED tests/loggers/test_all.py::test_logger_created_on_rank_zero_only[MLFlowLogger]
FAILED tests/loggers/test_all.py::test_logger_created_on_rank_zero_only[NeptuneLogger]
FAILED tests/loggers/test_all.py::test_logger_created_on_rank_zero_only[TensorBoardLogger]
FAILED tests/loggers/test_all.py::test_logger_created_on_rank_zero_only[TestTubeLogger]
FAILED tests/models/test_cpu.py::test_multi_cpu_model_ddp - Exception: proces...
FAILED tests/models/test_horovod.py::test_horovod_cpu - assert 127 == 0
FAILED tests/models/test_horovod.py::test_horovod_cpu_implicit - assert 127 == 0
FAILED tests/models/test_horovod.py::test_horovod_multi_optimizer - pytorch_l...
FAILED tests/models/test_tpu.py::test_model_tpu_cores_1 - AssertionError: exp...
FAILED tests/models/test_tpu.py::test_model_tpu_index[1] - AssertionError: ex...
FAILED tests/models/test_tpu.py::test_model_tpu_index[5] - AssertionError: ex...
FAILED tests/models/test_tpu.py::test_model_16bit_tpu_cores_1 - AssertionErro...
FAILED tests/models/test_tpu.py::test_model_16bit_tpu_index[1] - AssertionErr...
FAILED tests/models/test_tpu.py::test_model_16bit_tpu_index[5] - AssertionErr...
FAILED tests/models/test_tpu.py::test_model_tpu_early_stop - AssertionError: ...
FAILED tests/models/test_tpu.py::test_tpu_grad_norm - AssertionError: expecte...
FAILED tests/trainer/test_trainer.py::test_pytorch_profiler_nested - Assertio...
FAILED tests/trainer/logging_/test_distributed_logging.py::test_global_zero_only_logging_ddp_cpu
FAILED tests/trainer/properties/test_get_model.py::test_get_model_ddp_cpu - E...
FAILED tests/utilities/test_all_gather_grad.py::test_all_gather_ddp - Excepti...
= 24 failed, 3369 passed, 477 skipped, 3 xfailed, 2959 warnings in 1559.35s (0:25:59) =

Borda · 2021-02-24T11:23:35Z

it would be nice to have it also with #6078 🐰

tchaton

Not sure we should have all our tests on TPU as it takes 30 min.

carmocca · 2021-02-26T15:28:17Z

I agree with Thomas. I would try to avoid adding extra test time to TPUs since they are a bottleneck in our CI pipeline and we get random kubernetes failures from time to time.

If what we want to avoid is people forgetting to add the test to a tpu_test.sh file. We could define the tests which require TPU with a marker and we indicate pytest to only run those tests in the TPU CI. see:

https://docs.pytest.org/en/stable/example/markers.html

Borda · 2021-02-26T15:31:04Z

I agree with Thomas. I would try to avoid adding extra test time to TPUs since they are a bottleneck in our CI pipeline and we get random kubernetes failures from time to time.

I see your point with time, but what you say is we do not case if TPU works except for some selected cases...

carmocca · 2021-02-26T16:30:12Z

we do not case if TPU works except for some selected cases...

Can the CPU tests fail if they are run in an environment with TPUs?

Borda · 2021-02-26T18:14:29Z

we do not case if TPU works except for some selected cases...

Can the CPU tests fail if they are run in an environment with TPUs?

t is what the tests shall tell you...

Borda added ci Continuous Integration accelerator: tpu Tensor Processing Unit labels Jul 8, 2020

Borda added this to the 0.8.x milestone Jul 8, 2020

mergify bot requested a review from a team July 8, 2020 22:53

Borda marked this pull request as ready for review July 9, 2020 07:52

Borda changed the title ~~run full TPU pytests~~ [blocked by #2432] run full TPU pytests Jul 27, 2020

Borda force-pushed the tests/full-pytest branch from e05c092 to c5d9d13 Compare July 28, 2020 06:09

Borda changed the title ~~[blocked by #2432] run full TPU pytests~~ run full TPU pytests Jul 28, 2020

Borda force-pushed the tests/full-pytest branch from b2643cc to 1071fb1 Compare July 29, 2020 22:19

Borda force-pushed the tests/full-pytest branch 2 times, most recently from e27035f to b838141 Compare August 2, 2020 20:41

Borda modified the milestones: 0.8.x, 0.9.0 Aug 6, 2020

awaelchli modified the milestones: 0.9.0, 1.0.0 Aug 8, 2020

edenlightning modified the milestones: 1.0.0, 0.9.x Sep 1, 2020

Borda force-pushed the tests/full-pytest branch from b838141 to f7b74c8 Compare September 7, 2020 08:05

Borda changed the title ~~run full TPU pytests~~ [blocked by #3024] run full TPU pytests Sep 29, 2020

Borda force-pushed the tests/full-pytest branch from f7b74c8 to 2438186 Compare October 4, 2020 07:41

edenlightning modified the milestones: 0.9.x, 1.0 Oct 4, 2020

Borda force-pushed the tests/full-pytest branch 2 times, most recently from 76d4b5e to d69a05d Compare February 8, 2021 12:32

Borda removed the has conflicts label Feb 8, 2021

Borda force-pushed the tests/full-pytest branch 2 times, most recently from 9f3c531 to 9b8b5b2 Compare February 9, 2021 08:47

edenlightning removed this from the 1.2 milestone Feb 9, 2021

Base automatically changed from release/1.2-dev to master February 11, 2021 14:31

Borda force-pushed the tests/full-pytest branch 2 times, most recently from 7f44723 to 2ce9e21 Compare February 19, 2021 20:21

Borda force-pushed the tests/full-pytest branch from 2ce9e21 to c4c4062 Compare February 24, 2021 11:18

tchaton reviewed Feb 26, 2021

View reviewed changes

Borda removed the priority: 1 Medium priority task label Feb 27, 2021

Borda force-pushed the tests/full-pytest branch from c4c4062 to 379f303 Compare March 25, 2021 19:43

Borda and others added 6 commits March 30, 2021 12:16

run full TPU tests

fc6485d

timeout

0ccf15e

GNU

19f13a3

time

961f397

timeout

9e78d70

450

082f639

Borda force-pushed the tests/full-pytest branch from 379f303 to 082f639 Compare March 30, 2021 10:16

Borda closed this May 26, 2021

Borda deleted the tests/full-pytest branch June 17, 2021 16:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

run full TPU pytests #2560

run full TPU pytests #2560

Borda commented Jul 8, 2020

codecov bot commented Jul 8, 2020 •

edited

Loading

Borda commented Jul 9, 2020 •

edited

Loading

Borda commented Jul 9, 2020 •

edited

Loading

Borda commented Jul 29, 2020

Borda commented Aug 2, 2020

mergify bot commented Sep 30, 2020

Borda commented Feb 19, 2021

Borda commented Feb 24, 2021

tchaton left a comment

carmocca commented Feb 26, 2021 •

edited

Loading

Borda commented Feb 26, 2021

carmocca commented Feb 26, 2021

Borda commented Feb 26, 2021

run full TPU pytests #2560

run full TPU pytests #2560

Conversation

Borda commented Jul 8, 2020

What does this PR do?

PR review

Did you have fun?

codecov bot commented Jul 8, 2020 • edited Loading

Codecov Report

Borda commented Jul 9, 2020 • edited Loading

Borda commented Jul 9, 2020 • edited Loading

Borda commented Jul 29, 2020

Borda commented Aug 2, 2020

mergify bot commented Sep 30, 2020

Borda commented Feb 19, 2021

Borda commented Feb 24, 2021

tchaton left a comment

Choose a reason for hiding this comment

carmocca commented Feb 26, 2021 • edited Loading

Borda commented Feb 26, 2021

carmocca commented Feb 26, 2021

Borda commented Feb 26, 2021

codecov bot commented Jul 8, 2020 •

edited

Loading

Borda commented Jul 9, 2020 •

edited

Loading

Borda commented Jul 9, 2020 •

edited

Loading

carmocca commented Feb 26, 2021 •

edited

Loading