Hanging flaky test test_operator.test_norm @ Python 3: GPU Win #11509

marcoabreu · 2018-06-30T14:02:39Z

test_operator.test_norm causes hangs at Python 3: GPU Win.

http://jenkins.mxnet-ci.amazon-ml.com/view/Master%20Jobs/job/incubator-mxnet/job/master/1109/

test_operator.test_pad ... ok

test_operator.test_instance_normalization ... ok

test_operator.test_l2_normalization ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1234 to reproduce.

ok

test_operator.test_norm ... Sending interrupt signal to process

After 10s process did not stop

The text was updated successfully, but these errors were encountered:

marcoabreu · 2018-06-30T14:04:38Z

On other occasions, it causes assertion errors:
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1108/pipeline

======================================================================

FAIL: test_operator.test_norm

----------------------------------------------------------------------

Traceback (most recent call last):

  File "/usr/lib/python3.6/site-packages/nose/case.py", line 198, in runTest

    self.test(*self.arg)

  File "/work/mxnet/tests/python/unittest/common.py", line 157, in test_new

    orig_test(*args, **kwargs)

  File "/work/mxnet/tests/python/unittest/test_operator.py", line 3074, in test_norm

    check_numeric_gradient(norm_sym, [in_data], numeric_eps=epsilon, rtol=1e-2, atol=1e-3)

  File "/work/mxnet/python/mxnet/test_utils.py", line 914, in check_numeric_gradient

    ("NUMERICAL_%s"%name, "BACKWARD_%s"%name))

  File "/work/mxnet/python/mxnet/test_utils.py", line 493, in assert_almost_equal

    raise AssertionError(msg)

AssertionError: 

Items are not equal:

Error 1.028974 exceeds tolerance rtol=0.010000, atol=0.001000.  Location of maximum error:(0, 1, 4, 5, 0, 0), a=-0.003815, b=-0.002757

 NUMERICAL_data: array([[[[[[ 0.01990795, -0.06532669,  0.00363588, ...,  0.01215935,

            -0.00745058, -0.00047684],

           [-0.00017881, -0.10442734,  0.01651049, ..., -0.03290176,...

 BACKWARD_data: array([[[[[[ 0.01993995, -0.06537328,  0.00367729, ...,  0.01226579,

            -0.0074518 , -0.0004907 ],

           [-0.00015076, -0.10442342,  0.01648286, ..., -0.03304606,...

-------------------- >> begin captured logging << --------------------

common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=615595105 to reproduce.

--------------------- >> end captured logging << ---------------------

eric-haibin-lin · 2018-07-01T18:57:38Z

@anirudhacharya

anirudhacharya · 2018-07-02T00:19:07Z

I will look into it

leezu · 2018-07-06T00:56:25Z

The latter error could be due to that numpy.linalg.norm is not numerically stable (neither is the mxnet one). Possibly #11573 could fix this issue.

szha · 2018-07-11T05:31:39Z

FYI @anirudhacharya the test has been partially re-enabled in #11573.

KellenSunderland · 2018-07-13T08:46:26Z

@szha Would it make sense to close this issue and re-open if we see test issuse with test_norm again?

anirudhacharya · 2018-07-13T17:33:29Z

@KellenSunderland as Sheng mentioned it has only been partially enabled, there are certain test cases which are still not enabled and fix that was merged does not fix the occasional hanging of the test case on the CI.

leezu · 2018-07-13T17:55:04Z

@anirudh2290 @KellenSunderland Only the part that caused assertion errors during numerical checking of the gradient is disabled. It is not clear if the hang was occurring during check_numeric_gradient or during the other parts in the test. If the hang occurred outside of check_numeric_gradient it is fixed by change of underlying norm implementation, otherwise it is disabled..

You may want to rename this issue to make it clear that currently a part of the test is disabled due to numerical instability of the check_numeric_gradient.

haojin2 · 2018-07-30T04:21:30Z

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-11482/22/pipeline/858

======================================================================

FAIL: test_operator_gpu.test_norm

----------------------------------------------------------------------

Traceback (most recent call last):

  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest

    self.test(*self.arg)

  File "/usr/local/lib/python3.5/dist-packages/nose/util.py", line 620, in newfunc

    return func(*arg, **kw)

  File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 172, in test_new

    orig_test(*args, **kwargs)

  File "/work/mxnet/tests/python/gpu/../unittest/test_ndarray.py", line 1324, in test_norm

    mx.test_utils.assert_almost_equal(npy_out, mx_out.asnumpy())

  File "/work/mxnet/python/mxnet/test_utils.py", line 493, in assert_almost_equal

    raise AssertionError(msg)

AssertionError: 

Items are not equal:

Error 200000.015625 exceeds tolerance rtol=0.000010, atol=0.000000.  Location of maximum error:(0, 0, 0, 0), a=0.567434, b=-0.567434

 a: array([[[[0.56743413, 0.3482769 ],

         [0.76402813, 0.21444368],

         [0.26532394, 0.35806283],...

 b: array([[[[-0.56743413, -0.3482769 ],

         [-0.76402813, -0.21444368],

         [-0.26532394, -0.35806283],...

-------------------- >> begin captured logging << --------------------

common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1479125905 to reproduce.

--------------------- >> end captured logging << ---------------------

marcoabreu · 2018-07-30T08:20:52Z

@leezu

szha · 2018-07-30T17:32:03Z

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-11482/24/pipeline/861

leezu · 2018-07-30T19:12:07Z

I'm looking into the issue now.

leezu · 2018-07-30T19:40:26Z

I can't reproduce this on p3 and p2 instances so far. I keep running this test in a loop for now. Do the tests run on a different instance type?

szha · 2018-07-30T19:51:05Z

Hao and I weren't able to reproduce the error on the (supposedly) same instance type that CI slaves use, using the same docker building logic.

marcoabreu · 2018-07-30T19:51:17Z

G3

Are you running with our docker setup?

haojin2 · 2018-07-30T19:53:13Z

yes, Sheng and I were using g3.8xlarge + instructions from https://cwiki.apache.org/confluence/display/MXNET/Reproducing+test+results

leezu · 2018-07-30T20:01:16Z

@marcoabreu is it feasible to restart the CI Pipeline with the same global seed (not only test seed)? Jenkins should make it feasible to restart the pipeline when accessing http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-11482/24/pipeline/861/ while being logged in, but I'm not sure about the setup for global seed.

marcoabreu · 2018-07-30T20:09:43Z

I don't think this is feasible, considering the setup is entirely reproducible with docker. Also the costs are not worth the effort.

I might be stating the obvious, but did you use the latest master or the commit at that time to reproduce the error?

szha · 2018-07-30T20:18:05Z

@marcoabreu Hao and I have been using my branch and I imagine @leezu is doing the same.

anirudhacharya · 2018-08-01T21:18:07Z

Update on this issue -

I ran this test case ~5k times for different scenarios on a Linux GPU machine (EC2 p2.8x large instance). Currently the reasons for test failure are -

check_numeric_gradient has precision issues when run with np.float16 and np.float32. I have updated set of atol and rtol values that can fix this. But the new atol and rtol values still have to be verified with repeated runs of the test.
The bigger bug is that the test intermittently hangs while performing the check_numeric_gradient. check_symbolic_backward and check_symbolic_forward works correctly and consistently.

The above two issues are only reproducible when run against CPU context. When run against GPU the test neither hangs nor does it have accuracy issues.

I need to check the same with a Windows instance.

anirudhacharya · 2018-08-03T23:25:49Z

With regards to the hanging check_numeric_gradient -

It is not reproducible when I run it in a single-threaded environment with MXNET_TEST_COUNT=500 MXNET_ENGINE_TYPE=NaiveEngine nosetests --nocapture --verbose tests/python/unittest/test_operator.py:test_norm.

It only hangs in a multi-threaded environment, due to resource contention. I have been using strace -p <PID> to inspect the stack trace when this test runs, but I am yet to narrow down on the exact source of the resource contention.

The strace logs have recurring occurrences of the following system calls from competing threads -

101312 futex(0x22c1f90, FUTEX_WAKE_PRIVATE, 1) = 0
101169 futex(0x3b1f5bc, FUTEX_WAIT_PRIVATE, 6862611, NULL <unfinished ...>
101312 futex(0x7f19a4005ce4, FUTEX_WAKE_PRIVATE, 2147483647) = 0
101341 futex(0x7f19a4015194, FUTEX_WAKE_PRIVATE, 2147483647 <unfinished ...>
101312 futex(0x3b1f5bc, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, 0x3b1f590, 6862612 <unfinished ...>
101341 <... futex resumed> )            = 0
101169 <... futex resumed> )            = 0
101312 <... futex resumed> )            = 1

marcoabreu added Operator Test Flaky Disabled test labels Jun 30, 2018

marcoabreu mentioned this issue Jun 30, 2018

Disable test_operator:test_norm for #11509 #11510

Merged

5 tasks

leezu mentioned this issue Jul 5, 2018

Add stable nrm2 Reducer #11573

Merged

7 tasks

anirudhacharya mentioned this issue Aug 3, 2018

[MXNET-768] Partially enable flaky test for norm operator #12027

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hanging flaky test test_operator.test_norm @ Python 3: GPU Win #11509

Hanging flaky test test_operator.test_norm @ Python 3: GPU Win #11509

marcoabreu commented Jun 30, 2018

marcoabreu commented Jun 30, 2018

eric-haibin-lin commented Jul 1, 2018

anirudhacharya commented Jul 2, 2018

leezu commented Jul 6, 2018

szha commented Jul 11, 2018

KellenSunderland commented Jul 13, 2018

anirudhacharya commented Jul 13, 2018

leezu commented Jul 13, 2018

haojin2 commented Jul 30, 2018

marcoabreu commented Jul 30, 2018

szha commented Jul 30, 2018

leezu commented Jul 30, 2018

leezu commented Jul 30, 2018

szha commented Jul 30, 2018

marcoabreu commented Jul 30, 2018

haojin2 commented Jul 30, 2018 •

edited

Loading

leezu commented Jul 30, 2018

marcoabreu commented Jul 30, 2018

szha commented Jul 30, 2018

anirudhacharya commented Aug 1, 2018

anirudhacharya commented Aug 3, 2018

Hanging flaky test test_operator.test_norm @ Python 3: GPU Win #11509

Hanging flaky test test_operator.test_norm @ Python 3: GPU Win #11509

Comments

marcoabreu commented Jun 30, 2018

marcoabreu commented Jun 30, 2018

eric-haibin-lin commented Jul 1, 2018

anirudhacharya commented Jul 2, 2018

leezu commented Jul 6, 2018

szha commented Jul 11, 2018

KellenSunderland commented Jul 13, 2018

anirudhacharya commented Jul 13, 2018

leezu commented Jul 13, 2018

haojin2 commented Jul 30, 2018

marcoabreu commented Jul 30, 2018

szha commented Jul 30, 2018

leezu commented Jul 30, 2018

leezu commented Jul 30, 2018

szha commented Jul 30, 2018

marcoabreu commented Jul 30, 2018

haojin2 commented Jul 30, 2018 • edited Loading

leezu commented Jul 30, 2018

marcoabreu commented Jul 30, 2018

szha commented Jul 30, 2018

anirudhacharya commented Aug 1, 2018

anirudhacharya commented Aug 3, 2018

haojin2 commented Jul 30, 2018 •

edited

Loading