Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CIFAR-10 tutorial for multi-GPU fails because full shape isn't passed to prefetch_queue #7216

Closed
chrismattmann opened this issue Jul 15, 2019 · 3 comments

Comments

@chrismattmann
Copy link
Contributor

chrismattmann commented Jul 15, 2019

System information

  • What is the top-level directory of the model you are using:
    • tutorials/image/cifar10/cifar10_multi_gpu_train.py
  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
    no
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux jupyter-mattmann-40usc-2eedu 4.15.15-1.el7.x86_64 initial commit, simple, separated models #1 SMP Thu Oct 4 07:42:41 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
  • TensorFlow installed from (source or binary): used PIP (binary)
  • TensorFlow version (use command below): 1.13.1, tensorflow-datasets 1.0.2
  • Bazel version (if compiling from source): N/A
  • CUDA/cuDNN version: N/A
  • GPU model and memory: 4 GPUs
  • Exact command to reproduce:
    python3 cifar_eval.py
== env ==========================================================
LD_LIBRARY_PATH /usr/local/nvidia/lib:/usr/local/nvidia/lib64
DYLD_LIBRARY_PATH is unset

== nvidia-smi ===================================================
Tue Jul 16 15:59:26 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.14       Driver Version: 430.14       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:1B:00.0 Off |                  N/A |
| 31%   32C    P0    85W / 250W |      0MiB / 11019MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:1E:00.0 Off |                  N/A |
|  0%   30C    P8    22W / 250W |      0MiB / 11019MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce RTX 208...  Off  | 00000000:61:00.0 Off |                  N/A |
|  0%   30C    P0    65W / 250W |      0MiB / 11019MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce RTX 208...  Off  | 00000000:63:00.0 Off |                  N/A |
| 29%   29C    P0    62W / 250W |      0MiB / 11019MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

== cuda libs  ===================================================
/usr/local/cuda-10.0/targets/x86_64-linux/lib/libcudart_static.a
/usr/local/cuda-10.0/targets/x86_64-linux/lib/libcudart.so.10.0.130

== tensorflow installed from info ==================

== python version  ==============================================
(major, minor, micro, releaselevel, serial)
(3, 7, 3, 'final', 0)

== bazel version  ===============================================
jovyan@jupyter-mattmann-40usc-2eedu:~/models/tutorials/image/cifar10$ 

Describe the problem

The CIFAR-10 Multi-GPU Tutorial has a bug in it when run from the command line. I am using Tensorflow 1.13.1 and Tensorflow-Datasets 1.0.2. You simply need to put an explicit call to tf.reshape before passing it into the prefetch_queue to add the outer num_samples shape. I've got a quick PR that fixes this.

Source code / logs

Will send a PR.

chrismattmann added a commit to chrismattmann/models that referenced this issue Jul 15, 2019
…ils because full shape isn't passed to prefetch_queue contributed by mattmann.
@tensorflowbutler tensorflowbutler added the stat:awaiting response Waiting on input from the contributor label Jul 16, 2019
@tensorflowbutler
Copy link
Member

Thank you for your post. We noticed you have not filled out the following field in the issue template. Could you update them if they are relevant in your case, or leave them as N/A? Thanks.
What is the top-level directory of the model you are using
Have I written custom code
OS Platform and Distribution
TensorFlow installed from
TensorFlow version
Bazel version
CUDA/cuDNN version
GPU model and memory
Exact command to reproduce

@chrismattmann
Copy link
Contributor Author

@tensorflowbutler done

chrismattmann added a commit to chrismattmann/models that referenced this issue Jul 16, 2019
…ils because full shape isn't passed to prefetch_queue contributed by mattmann.
@tensorflowbutler tensorflowbutler removed the stat:awaiting response Waiting on input from the contributor label Jul 17, 2019
tfboyd pushed a commit that referenced this issue Jul 19, 2019
…e full shape isn't passed to prefetch_queue contributed by mattmann. (#7217)
@chrismattmann
Copy link
Contributor Author

Committed by @tfboyd in 97a87f9 thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants