Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frozen pretrained Faster RCNN/RFCN networks from model zoo yielding different outputs on different GPUs and runs #2374

Closed
EpochalEngineer opened this issue Sep 13, 2017 · 6 comments
Assignees
Labels
type:bug Bug in the code

Comments

@EpochalEngineer
Copy link

EpochalEngineer commented Sep 13, 2017

System information

  • What is the top-level directory of the model you are using:
    Using unmodified pretrained coco models: faster_rcnn_inception_resnet_v2_atrous_coco_11_06_2017, faster_rcnn_resnet101_coco_11_06_2017, rfcn_resnet101_coco_11_06_2017

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
    No

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
    UPDATE: tested on two machines now, both reproduce it:
    Machine 1: Linux Ubuntu 14.04.4 LTS
    Machine 2: Linux Ubuntu 16.04.2 LTS

  • TensorFlow installed from (source or binary):
    official docker container, with last commit 58fb6d7
    docker version from 2017-08-24T02:37:57.51182742Z

  • TensorFlow version (use command below):
    ('v1.2.0-5-g435cdfc', '1.2.1')

  • Bazel version (if compiling from source):
    N/A

  • CUDA/cuDNN version:
    From official docker: CUDA 8., cuDNN 5.1.10

  • GPU model and memory:
    Machine 1: Three nVIDIA GeForce GTX 1080, 12 GB
    Machine 2: Two nVIDIA GeForce GTX 1080, 12 GB

  • Exact command to reproduce:
    Running object_detection_tutorial.ipynb with different GPUs, either with export CUDA_VISIBLE_DEVICES=, or by setting it in the session config. Version that runs through 3 GPUs several times and compares output is included.

Describe the problem

Running on different GPUs yields different results, and GPUs 1 and 2 are not deterministic. This is accomplished by making devices 1,2 invisible, and tensorflow runs on 0, and so forth. This is using frozen pretrained networks from this repository's linked model zoo and the supplied object_detection_tutorial.ipynb with no modifications other than setting the cuda visible_device_list. The SSD frozen models, however, give identical outputs on the 3 GPUs from what I have seen.

I have also run cuda_memtest on all 3 GPUs, logs attached

UPDATE: I just tested on a second machine with 2 GPUs, and reproduced the issue. GPU 0 is deterministic, GPU 1 is not (and often produces bad results).

Source code / logs

I've attached the diff of the modified object_detection_tutorial.ipynb which loops over 3 GPUs 3 times and prints out the top box scores, which change depending on the run. Also attached is a PDF of that ipynb with detections drawn on it. Text output:

Evaluating image 0

Running on GPU 0
Top 4 box scores:
Iter 1: [ 0.99978215 0.99857557 0.95300484 0.91580492]
Iter 2: [ 0.99978215 0.99857557 0.95300484 0.91580492]
Iter 3: [ 0.99978215 0.99857557 0.95300484 0.91580492]

Running on GPU 1
Top 4 box scores:
Iter 1: [ 0.68702352 0.16781448 0.13143283 0.12993629]
Iter 2: [ 0.18502565 0.16854601 0.08074528 0.07859289]
Iter 3: [ 0.18502565 0.16854601 0.05546702 0.05111229]

Running on GPU 2
Top 4 box scores:
Iter 1: [ 0.68702352 0.16781448 0.13143283 0.12993629]
Iter 2: [ 0.18941374 0.18502565 0.16854601 0.16230994]
Iter 3: [ 0.18502565 0.16854601 0.05546702 0.05482833]

Evaluating image 1

Running on GPU 0
Top 4 box scores:
Iter 1: [ 0.99755412 0.99750346 0.99380219 0.99067008]
Iter 2: [ 0.99755412 0.99750346 0.99380219 0.99067008]
Iter 3: [ 0.99755412 0.99750346 0.99380219 0.99067008]

Running on GPU 1
Top 4 box scores:
Iter 1: [ 0.96881998 0.96441168 0.96164131 0.96006596]
Iter 2: [ 0.9377929 0.91686022 0.80374646 0.79758978]
Iter 3: [ 0.90396696 0.89217037 0.85456908 0.85334581]

Running on GPU 2
Top 4 box scores:
Iter 1: [ 0.9377929 0.91686022 0.80374646 0.79758978]
Iter 2: [ 0.9377929 0.91686022 0.80374646 0.79758978]
Iter 3: [ 0.9377929 0.91686022 0.80374646 0.79758978]

object_detection_tutorial.diff.txt

gpu_output_differences.pdf

Updated with longer run:
cuda_memtest.log.txt

@cy89 cy89 added the stat:awaiting response Waiting on input from the contributor label Sep 14, 2017
@EpochalEngineer EpochalEngineer changed the title Different Object Detection outputs from frozen inference graph ? Frozen pretrained Faster RCNN/RFCN networks from model zoo yielding different outputs on different GPUs and runs Sep 19, 2017
@EpochalEngineer
Copy link
Author

EpochalEngineer commented Sep 19, 2017

Updated with a simplified test with model_zoo and second machine test that reproduced these issues.

@aselle aselle removed the stat:awaiting response Waiting on input from the contributor label Sep 19, 2017
@EpochalEngineer
Copy link
Author

@aselle Was there supposed to be a response added with the removal of that tag?

@aselle
Copy link
Contributor

aselle commented Sep 25, 2017

@nealwu, could you take a look?

@aselle aselle added stat:community support stat:awaiting model gardener Waiting on input from TensorFlow model gardener type:bug Bug in the code and removed stat:community support labels Sep 25, 2017
@nealwu
Copy link
Contributor

nealwu commented Sep 25, 2017

Looks like this is an object detection question. Looping in @derekjchow @jch1

@EpochalEngineer
Copy link
Author

EpochalEngineer commented Oct 2, 2017

Noticed a difference in using an environment variable CUDA_VISIBLE_DEVICES vs setting the config parameter. We're no longer able to reproduce this behavior with the environment variable, only with the config parameter. In addition, when using the config parameter, there is a small ~180 MB task on GPU0 when the config file is set to use GPU[1,2], which seems to correlate with these issues.

@tombstone tombstone added this to To Do in Object Detection via automation Nov 18, 2017
@tensorflowbutler tensorflowbutler removed the stat:awaiting model gardener Waiting on input from TensorFlow model gardener label Apr 6, 2018
@tensorflowbutler
Copy link
Member

Hi There,
We are checking to see if you still need help on this, as this seems to be considerably old issue. Please update this issue with the latest information, code snippet to reproduce your issue and error you are seeing.
If we don't hear from you in the next 7 days, this issue will be closed automatically. If you don't need help on this issue any more, please consider closing this.

Object Detection automation moved this from To Do to Done Feb 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:bug Bug in the code
Projects
Object Detection
  
Closed
Development

No branches or pull requests

7 participants