CIFAR-10 tutorial for multi-GPU fails because full shape isn't passed to prefetch_queue #7216

chrismattmann · 2019-07-15T19:33:46Z

System information

What is the top-level directory of the model you are using:
- tutorials/image/cifar10/cifar10_multi_gpu_train.py
Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
no
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux jupyter-mattmann-40usc-2eedu 4.15.15-1.el7.x86_64 initial commit, simple, separated models #1 SMP Thu Oct 4 07:42:41 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
TensorFlow installed from (source or binary): used PIP (binary)
TensorFlow version (use command below): 1.13.1, tensorflow-datasets 1.0.2
Bazel version (if compiling from source): N/A
CUDA/cuDNN version: N/A
GPU model and memory: 4 GPUs
Exact command to reproduce:
python3 cifar_eval.py

== env ==========================================================
LD_LIBRARY_PATH /usr/local/nvidia/lib:/usr/local/nvidia/lib64
DYLD_LIBRARY_PATH is unset

== nvidia-smi ===================================================
Tue Jul 16 15:59:26 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.14       Driver Version: 430.14       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:1B:00.0 Off |                  N/A |
| 31%   32C    P0    85W / 250W |      0MiB / 11019MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:1E:00.0 Off |                  N/A |
|  0%   30C    P8    22W / 250W |      0MiB / 11019MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce RTX 208...  Off  | 00000000:61:00.0 Off |                  N/A |
|  0%   30C    P0    65W / 250W |      0MiB / 11019MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce RTX 208...  Off  | 00000000:63:00.0 Off |                  N/A |
| 29%   29C    P0    62W / 250W |      0MiB / 11019MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

== cuda libs  ===================================================
/usr/local/cuda-10.0/targets/x86_64-linux/lib/libcudart_static.a
/usr/local/cuda-10.0/targets/x86_64-linux/lib/libcudart.so.10.0.130

== tensorflow installed from info ==================

== python version  ==============================================
(major, minor, micro, releaselevel, serial)
(3, 7, 3, 'final', 0)

== bazel version  ===============================================
jovyan@jupyter-mattmann-40usc-2eedu:~/models/tutorials/image/cifar10$

Describe the problem

The CIFAR-10 Multi-GPU Tutorial has a bug in it when run from the command line. I am using Tensorflow 1.13.1 and Tensorflow-Datasets 1.0.2. You simply need to put an explicit call to tf.reshape before passing it into the prefetch_queue to add the outer num_samples shape. I've got a quick PR that fixes this.

Source code / logs

Will send a PR.

The text was updated successfully, but these errors were encountered:

…ils because full shape isn't passed to prefetch_queue contributed by mattmann.

tensorflowbutler · 2019-07-16T12:09:22Z

Thank you for your post. We noticed you have not filled out the following field in the issue template. Could you update them if they are relevant in your case, or leave them as N/A? Thanks.
What is the top-level directory of the model you are using
Have I written custom code
OS Platform and Distribution
TensorFlow installed from
TensorFlow version
Bazel version
CUDA/cuDNN version
GPU model and memory
Exact command to reproduce

chrismattmann · 2019-07-16T15:45:18Z

@tensorflowbutler done

…ils because full shape isn't passed to prefetch_queue contributed by mattmann.

…e full shape isn't passed to prefetch_queue contributed by mattmann. (#7217)

chrismattmann · 2019-07-28T21:13:26Z

Committed by @tfboyd in 97a87f9 thanks!

chrismattmann added a commit to chrismattmann/models that referenced this issue Jul 15, 2019

Fix for TF-models tensorflow#7216: CIFAR-10 tutorial for multi-GPU fa…

a4d67a5

…ils because full shape isn't passed to prefetch_queue contributed by mattmann.

chrismattmann mentioned this issue Jul 15, 2019

Fix for TF-models #7216: CIFAR-10 tutorial for multi-GPU fails becaus… #7217

Merged

tensorflowbutler added the stat:awaiting response Waiting on input from the contributor label Jul 16, 2019

chrismattmann added a commit to chrismattmann/models that referenced this issue Jul 16, 2019

Fix for TF-models tensorflow#7216: CIFAR-10 tutorial for multi-GPU fa…

c8fae64

…ils because full shape isn't passed to prefetch_queue contributed by mattmann.

tensorflowbutler removed the stat:awaiting response Waiting on input from the contributor label Jul 17, 2019

tfboyd pushed a commit that referenced this issue Jul 19, 2019

Fix for TF-models #7216: CIFAR-10 tutorial for multi-GPU fails becaus…

97a87f9

…e full shape isn't passed to prefetch_queue contributed by mattmann. (#7217)

chrismattmann closed this as completed Jul 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CIFAR-10 tutorial for multi-GPU fails because full shape isn't passed to prefetch_queue #7216

CIFAR-10 tutorial for multi-GPU fails because full shape isn't passed to prefetch_queue #7216

chrismattmann commented Jul 15, 2019 •

edited

Loading

tensorflowbutler commented Jul 16, 2019

chrismattmann commented Jul 16, 2019

chrismattmann commented Jul 28, 2019

CIFAR-10 tutorial for multi-GPU fails because full shape isn't passed to prefetch_queue #7216

CIFAR-10 tutorial for multi-GPU fails because full shape isn't passed to prefetch_queue #7216

Comments

chrismattmann commented Jul 15, 2019 • edited Loading

System information

Describe the problem

Source code / logs

tensorflowbutler commented Jul 16, 2019

chrismattmann commented Jul 16, 2019

chrismattmann commented Jul 28, 2019

chrismattmann commented Jul 15, 2019 •

edited

Loading