Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run full TPU pytests #2560

Closed
wants to merge 6 commits into from
Closed

run full TPU pytests #2560

wants to merge 6 commits into from

Conversation

Borda
Copy link
Member

@Borda Borda commented Jul 8, 2020

What does this PR do?

A this moment we execute just single file and if any TPU test is elsewhere it has to be added, so we rather run on whole package so we do not accidentally (in future) update this separate test case...

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@Borda Borda added ci Continuous Integration accelerator: tpu Tensor Processing Unit labels Jul 8, 2020
@Borda Borda added this to the 0.8.x milestone Jul 8, 2020
@mergify mergify bot requested a review from a team July 8, 2020 22:53
@codecov
Copy link

codecov bot commented Jul 8, 2020

Codecov Report

Merging #2560 (df9ba0b) into master (3c86193) will increase coverage by 2%.
The diff coverage is n/a.

❗ Current head df9ba0b differs from pull request most recent head 082f639. Consider uploading reports for the commit 082f639 to get more accurate results

@@           Coverage Diff           @@
##           master   #2560    +/-   ##
=======================================
+ Coverage      91%     93%    +2%     
=======================================
  Files         192     159    -33     
  Lines       12231   11380   -851     
=======================================
- Hits        11184   10602   -582     
+ Misses       1047     778   -269     

@Borda
Copy link
Member Author

Borda commented Jul 9, 2020

failing dome TPU tests

terminate called after throwing an instance of 'std::runtime_error'
  what():  tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:1106 : Check failed: session->session()->Run( feed_inputs, {}, {cached_node.operations[0]}, &outputs) == ::tensorflow::Status::OK() (Not found: From /job:tpu_worker/replica:0/task:0:

XRT memory handle not found: 3824064254680872
	 [[{{node XRTReleaseAllocationHandle}}]] vs. OK)
*** Begin stack trace ***
	tensorflow::CurrentStackTrace[abi:cxx11]()
	xla::XrtComputationClient::ReleaseHandles(std::vector<xla::XrtComputationClient::DeviceHandle, std::allocator<xla::XrtComputationClient::DeviceHandle> >*, std::function<xla::XrtSession::CachedNode const& (xla::XrtSession*, tensorflow::Scope const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)> const&, xla::metrics::Metric*, xla::metrics::Counter*)
	xla::XrtComputationClient::HandleReleaser()
	xla::util::TriggeredTask::Runner()

@zcain117 any idea what it could be?
reported in pytorch/xla#2338

@Borda Borda marked this pull request as ready for review July 9, 2020 07:52
@Borda
Copy link
Member Author

Borda commented Jul 9, 2020

running the same test in Collab
https://colab.research.google.com/drive/1Gr1Wg4zVnu15WHE_-dU2YKr4Z5xsy-fL#scrollTo=Mx61q3X5bwoW

FAILED tests/callbacks/test_model_checkpoint.py::test_model_checkpoint_no_extraneous_invocations
FAILED tests/loggers/test_all.py::test_logger_created_on_rank_zero_only[TensorBoardLogger]
FAILED tests/loggers/test_all.py::test_logger_created_on_rank_zero_only[CometLogger]
FAILED tests/loggers/test_all.py::test_logger_created_on_rank_zero_only[NeptuneLogger]
FAILED tests/loggers/test_all.py::test_logger_created_on_rank_zero_only[TestTubeLogger]
FAILED tests/loggers/test_all.py::test_logger_created_on_rank_zero_only[WandbLogger]
FAILED tests/loggers/test_wandb.py::FLAKE8
FAILED tests/models/test_cpu.py::test_multi_cpu_model_ddp - Exception: proces...
FAILED tests/models/test_horovod.py::test_horovod_cpu - assert 1 == 0
FAILED tests/models/test_horovod.py::test_horovod_cpu_implicit - assert 1 == 0
FAILED tests/models/test_horovod.py::test_horovod_multi_optimizer - pytorch_l...
FAILED tests/models/test_tpu.py::test_base_tpu_model[1] - SystemExit: 17
FAILED tests/models/test_tpu.py::test_base_tpu_model[tpu_cores1] - RuntimeErr...
FAILED tests/models/test_tpu.py::test_base_tpu_model[8] - Exception: 
FAILED tests/models/test_tpu.py::test_base_tpu_16bit_model[1] - SystemExit: 17
FAILED tests/models/test_tpu.py::test_base_tpu_16bit_model[tpu_cores1] - Runt...
FAILED tests/models/test_tpu.py::test_base_tpu_16bit_model[8] - Exception: 
FAILED tests/models/test_tpu.py::test_early_stop_checkpoints_on_tpu[tpu_cores0-xla:1]
FAILED tests/models/test_tpu.py::test_early_stop_checkpoints_on_tpu[tpu_cores1-xla:8]
FAILED tests/models/test_tpu.py::test_single_tpu_core_model[tpu_cores0-xla:1]
FAILED tests/models/test_tpu.py::test_single_tpu_core_model[tpu_cores1-xla:8]
FAILED tests/models/test_tpu.py::test_multi_core_tpu_model[1] - SystemExit: 17
FAILED tests/models/test_tpu.py::test_multi_core_tpu_model[8] - Exception: 
FAILED tests/models/test_tpu.py::test_dataloaders_passed_to_fit - Exception: 
FAILED tests/models/data/horovod/train_default_model.py::FLAKE8
==== 25 failed, 591 passed, 59 skipped, 498 warnings in 2989.65s (0:49:49) =====

TPU start failing after #2512

@Borda Borda changed the title run full TPU pytests [blocked by #2432] run full TPU pytests Jul 27, 2020
@Borda Borda changed the title [blocked by #2432] run full TPU pytests run full TPU pytests Jul 28, 2020
@Borda
Copy link
Member Author

Borda commented Jul 29, 2020

the only limitation is that it would take longer to finish TPU test which is even not the longer from all others...

@Borda Borda force-pushed the tests/full-pytest branch 2 times, most recently from e27035f to b838141 Compare August 2, 2020 20:41
@Borda
Copy link
Member Author

Borda commented Aug 2, 2020

@zcain117 any ide why even the TPU tests are failing?

@Borda Borda modified the milestones: 0.8.x, 0.9.0 Aug 6, 2020
@awaelchli awaelchli modified the milestones: 0.9.0, 1.0.0 Aug 8, 2020
@edenlightning edenlightning modified the milestones: 1.0.0, 0.9.x Sep 1, 2020
@Borda Borda changed the title run full TPU pytests [blocked by #3024] run full TPU pytests Sep 29, 2020
@mergify
Copy link
Contributor

mergify bot commented Sep 30, 2020

This pull request is now in conflict... :(

@edenlightning edenlightning modified the milestones: 0.9.x, 1.0 Oct 4, 2020
@Borda Borda force-pushed the tests/full-pytest branch 2 times, most recently from 76d4b5e to d69a05d Compare February 8, 2021 12:32
@Borda Borda force-pushed the tests/full-pytest branch 2 times, most recently from 9f3c531 to 9b8b5b2 Compare February 9, 2021 08:47
@edenlightning edenlightning removed this from the 1.2 milestone Feb 9, 2021
Base automatically changed from release/1.2-dev to master February 11, 2021 14:31
@Borda Borda force-pushed the tests/full-pytest branch 2 times, most recently from 7f44723 to 2ce9e21 Compare February 19, 2021 20:21
@Borda
Copy link
Member Author

Borda commented Feb 19, 2021

=========================== short test summary info ============================
FAILED tests/callbacks/test_pruning.py::test_pruning_callback_ddp_cpu - Excep...
FAILED tests/callbacks/test_swa.py::test_swa_callback_ddp_cpu - Exception: pr...
FAILED tests/checkpointing/test_model_checkpoint.py::test_model_checkpoint_no_extraneous_invocations
FAILED tests/checkpointing/test_torch_saving.py::test_model_torch_save_ddp_cpu
FAILED tests/loggers/test_all.py::test_logger_created_on_rank_zero_only[MLFlowLogger]
FAILED tests/loggers/test_all.py::test_logger_created_on_rank_zero_only[NeptuneLogger]
FAILED tests/loggers/test_all.py::test_logger_created_on_rank_zero_only[TensorBoardLogger]
FAILED tests/loggers/test_all.py::test_logger_created_on_rank_zero_only[TestTubeLogger]
FAILED tests/models/test_cpu.py::test_multi_cpu_model_ddp - Exception: proces...
FAILED tests/models/test_horovod.py::test_horovod_cpu - assert 127 == 0
FAILED tests/models/test_horovod.py::test_horovod_cpu_implicit - assert 127 == 0
FAILED tests/models/test_horovod.py::test_horovod_multi_optimizer - pytorch_l...
FAILED tests/models/test_tpu.py::test_model_tpu_cores_1 - AssertionError: exp...
FAILED tests/models/test_tpu.py::test_model_tpu_index[1] - AssertionError: ex...
FAILED tests/models/test_tpu.py::test_model_tpu_index[5] - AssertionError: ex...
FAILED tests/models/test_tpu.py::test_model_16bit_tpu_cores_1 - AssertionErro...
FAILED tests/models/test_tpu.py::test_model_16bit_tpu_index[1] - AssertionErr...
FAILED tests/models/test_tpu.py::test_model_16bit_tpu_index[5] - AssertionErr...
FAILED tests/models/test_tpu.py::test_model_tpu_early_stop - AssertionError: ...
FAILED tests/models/test_tpu.py::test_tpu_grad_norm - AssertionError: expecte...
FAILED tests/trainer/test_trainer.py::test_pytorch_profiler_nested - Assertio...
FAILED tests/trainer/logging_/test_distributed_logging.py::test_global_zero_only_logging_ddp_cpu
FAILED tests/trainer/properties/test_get_model.py::test_get_model_ddp_cpu - E...
FAILED tests/utilities/test_all_gather_grad.py::test_all_gather_ddp - Excepti...
= 24 failed, 3369 passed, 477 skipped, 3 xfailed, 2959 warnings in 1559.35s (0:25:59) =

@Borda
Copy link
Member Author

Borda commented Feb 24, 2021

it would be nice to have it also with #6078 🐰

Copy link
Contributor

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure we should have all our tests on TPU as it takes 30 min.

@carmocca
Copy link
Contributor

carmocca commented Feb 26, 2021

I agree with Thomas. I would try to avoid adding extra test time to TPUs since they are a bottleneck in our CI pipeline and we get random kubernetes failures from time to time.

If what we want to avoid is people forgetting to add the test to a tpu_test.sh file. We could define the tests which require TPU with a marker and we indicate pytest to only run those tests in the TPU CI. see:

https://docs.pytest.org/en/stable/example/markers.html

@Borda
Copy link
Member Author

Borda commented Feb 26, 2021

I agree with Thomas. I would try to avoid adding extra test time to TPUs since they are a bottleneck in our CI pipeline and we get random kubernetes failures from time to time.

I see your point with time, but what you say is we do not case if TPU works except for some selected cases...

@carmocca
Copy link
Contributor

we do not case if TPU works except for some selected cases...

Can the CPU tests fail if they are run in an environment with TPUs?

@Borda
Copy link
Member Author

Borda commented Feb 26, 2021

we do not case if TPU works except for some selected cases...

Can the CPU tests fail if they are run in an environment with TPUs?

t is what the tests shall tell you...

@Borda Borda removed the priority: 1 Medium priority task label Feb 27, 2021
@Borda Borda closed this May 26, 2021
@Borda Borda deleted the tests/full-pytest branch June 17, 2021 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accelerator: tpu Tensor Processing Unit ci Continuous Integration help wanted Open to be worked on
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants