Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Hanging flaky test test_operator.test_norm @ Python 3: GPU Win #11509

Open
marcoabreu opened this issue Jun 30, 2018 · 21 comments
Open

Hanging flaky test test_operator.test_norm @ Python 3: GPU Win #11509

marcoabreu opened this issue Jun 30, 2018 · 21 comments

Comments

@marcoabreu
Copy link
Contributor

test_operator.test_norm causes hangs at Python 3: GPU Win.

http://jenkins.mxnet-ci.amazon-ml.com/view/Master%20Jobs/job/incubator-mxnet/job/master/1109/

test_operator.test_pad ... ok

test_operator.test_instance_normalization ... ok

test_operator.test_l2_normalization ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1234 to reproduce.

ok

test_operator.test_norm ... Sending interrupt signal to process

After 10s process did not stop
@marcoabreu
Copy link
Contributor Author

On other occasions, it causes assertion errors:
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1108/pipeline

======================================================================

FAIL: test_operator.test_norm

----------------------------------------------------------------------

Traceback (most recent call last):

  File "/usr/lib/python3.6/site-packages/nose/case.py", line 198, in runTest

    self.test(*self.arg)

  File "/work/mxnet/tests/python/unittest/common.py", line 157, in test_new

    orig_test(*args, **kwargs)

  File "/work/mxnet/tests/python/unittest/test_operator.py", line 3074, in test_norm

    check_numeric_gradient(norm_sym, [in_data], numeric_eps=epsilon, rtol=1e-2, atol=1e-3)

  File "/work/mxnet/python/mxnet/test_utils.py", line 914, in check_numeric_gradient

    ("NUMERICAL_%s"%name, "BACKWARD_%s"%name))

  File "/work/mxnet/python/mxnet/test_utils.py", line 493, in assert_almost_equal

    raise AssertionError(msg)

AssertionError: 

Items are not equal:

Error 1.028974 exceeds tolerance rtol=0.010000, atol=0.001000.  Location of maximum error:(0, 1, 4, 5, 0, 0), a=-0.003815, b=-0.002757

 NUMERICAL_data: array([[[[[[ 0.01990795, -0.06532669,  0.00363588, ...,  0.01215935,

            -0.00745058, -0.00047684],

           [-0.00017881, -0.10442734,  0.01651049, ..., -0.03290176,...

 BACKWARD_data: array([[[[[[ 0.01993995, -0.06537328,  0.00367729, ...,  0.01226579,

            -0.0074518 , -0.0004907 ],

           [-0.00015076, -0.10442342,  0.01648286, ..., -0.03304606,...

-------------------- >> begin captured logging << --------------------

common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=615595105 to reproduce.

--------------------- >> end captured logging << ---------------------

@eric-haibin-lin
Copy link
Member

@anirudhacharya

@anirudhacharya
Copy link
Member

I will look into it

@leezu leezu mentioned this issue Jul 5, 2018
7 tasks
@leezu
Copy link
Contributor

leezu commented Jul 6, 2018

The latter error could be due to that numpy.linalg.norm is not numerically stable (neither is the mxnet one). Possibly #11573 could fix this issue.

@szha
Copy link
Member

szha commented Jul 11, 2018

FYI @anirudhacharya the test has been partially re-enabled in #11573.

@KellenSunderland
Copy link
Contributor

@szha Would it make sense to close this issue and re-open if we see test issuse with test_norm again?

@anirudhacharya
Copy link
Member

@KellenSunderland as Sheng mentioned it has only been partially enabled, there are certain test cases which are still not enabled and fix that was merged does not fix the occasional hanging of the test case on the CI.

@leezu
Copy link
Contributor

leezu commented Jul 13, 2018

@anirudh2290 @KellenSunderland Only the part that caused assertion errors during numerical checking of the gradient is disabled. It is not clear if the hang was occurring during check_numeric_gradient or during the other parts in the test. If the hang occurred outside of check_numeric_gradient it is fixed by change of underlying norm implementation, otherwise it is disabled..

You may want to rename this issue to make it clear that currently a part of the test is disabled due to numerical instability of the check_numeric_gradient.

@haojin2
Copy link
Contributor

haojin2 commented Jul 30, 2018

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-11482/22/pipeline/858

======================================================================

FAIL: test_operator_gpu.test_norm

----------------------------------------------------------------------

Traceback (most recent call last):

  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest

    self.test(*self.arg)

  File "/usr/local/lib/python3.5/dist-packages/nose/util.py", line 620, in newfunc

    return func(*arg, **kw)

  File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 172, in test_new

    orig_test(*args, **kwargs)

  File "/work/mxnet/tests/python/gpu/../unittest/test_ndarray.py", line 1324, in test_norm

    mx.test_utils.assert_almost_equal(npy_out, mx_out.asnumpy())

  File "/work/mxnet/python/mxnet/test_utils.py", line 493, in assert_almost_equal

    raise AssertionError(msg)

AssertionError: 

Items are not equal:

Error 200000.015625 exceeds tolerance rtol=0.000010, atol=0.000000.  Location of maximum error:(0, 0, 0, 0), a=0.567434, b=-0.567434

 a: array([[[[0.56743413, 0.3482769 ],

         [0.76402813, 0.21444368],

         [0.26532394, 0.35806283],...

 b: array([[[[-0.56743413, -0.3482769 ],

         [-0.76402813, -0.21444368],

         [-0.26532394, -0.35806283],...

-------------------- >> begin captured logging << --------------------

common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1479125905 to reproduce.

--------------------- >> end captured logging << ---------------------

@marcoabreu
Copy link
Contributor Author

@leezu

@leezu
Copy link
Contributor

leezu commented Jul 30, 2018

I'm looking into the issue now.

@leezu
Copy link
Contributor

leezu commented Jul 30, 2018

I can't reproduce this on p3 and p2 instances so far. I keep running this test in a loop for now. Do the tests run on a different instance type?

@szha
Copy link
Member

szha commented Jul 30, 2018

Hao and I weren't able to reproduce the error on the (supposedly) same instance type that CI slaves use, using the same docker building logic.

@marcoabreu
Copy link
Contributor Author

G3

Are you running with our docker setup?

@haojin2
Copy link
Contributor

haojin2 commented Jul 30, 2018

yes, Sheng and I were using g3.8xlarge + instructions from https://cwiki.apache.org/confluence/display/MXNET/Reproducing+test+results

@leezu
Copy link
Contributor

leezu commented Jul 30, 2018

@marcoabreu is it feasible to restart the CI Pipeline with the same global seed (not only test seed)? Jenkins should make it feasible to restart the pipeline when accessing http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-11482/24/pipeline/861/ while being logged in, but I'm not sure about the setup for global seed.

@marcoabreu
Copy link
Contributor Author

I don't think this is feasible, considering the setup is entirely reproducible with docker. Also the costs are not worth the effort.

I might be stating the obvious, but did you use the latest master or the commit at that time to reproduce the error?

@szha
Copy link
Member

szha commented Jul 30, 2018

@marcoabreu Hao and I have been using my branch and I imagine @leezu is doing the same.

@anirudhacharya
Copy link
Member

Update on this issue -

I ran this test case ~5k times for different scenarios on a Linux GPU machine (EC2 p2.8x large instance). Currently the reasons for test failure are -

  1. check_numeric_gradient has precision issues when run with np.float16 and np.float32. I have updated set of atol and rtol values that can fix this. But the new atol and rtol values still have to be verified with repeated runs of the test.
  2. The bigger bug is that the test intermittently hangs while performing the check_numeric_gradient. check_symbolic_backward and check_symbolic_forward works correctly and consistently.

The above two issues are only reproducible when run against CPU context. When run against GPU the test neither hangs nor does it have accuracy issues.

I need to check the same with a Windows instance.

@anirudhacharya
Copy link
Member

With regards to the hanging check_numeric_gradient -

It is not reproducible when I run it in a single-threaded environment with MXNET_TEST_COUNT=500 MXNET_ENGINE_TYPE=NaiveEngine nosetests --nocapture --verbose tests/python/unittest/test_operator.py:test_norm.

It only hangs in a multi-threaded environment, due to resource contention. I have been using strace -p <PID> to inspect the stack trace when this test runs, but I am yet to narrow down on the exact source of the resource contention.

The strace logs have recurring occurrences of the following system calls from competing threads -

101312 futex(0x22c1f90, FUTEX_WAKE_PRIVATE, 1) = 0
101169 futex(0x3b1f5bc, FUTEX_WAIT_PRIVATE, 6862611, NULL <unfinished ...>
101312 futex(0x7f19a4005ce4, FUTEX_WAKE_PRIVATE, 2147483647) = 0
101341 futex(0x7f19a4015194, FUTEX_WAKE_PRIVATE, 2147483647 <unfinished ...>
101312 futex(0x3b1f5bc, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, 0x3b1f590, 6862612 <unfinished ...>
101341 <... futex resumed> )            = 0
101169 <... futex resumed> )            = 0
101312 <... futex resumed> )            = 1

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants