Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Nasnet_large results in Out of memory #4846

Closed
antoajayraj opened this issue Jul 20, 2018 · 2 comments
Closed

Training Nasnet_large results in Out of memory #4846

antoajayraj opened this issue Jul 20, 2018 · 2 comments
Assignees
Labels
models:research models that come under research directory type:support

Comments

@antoajayraj
Copy link

Please go to Stack Overflow for help and support:

http://stackoverflow.com/questions/tagged/tensorflow

Also, please understand that many of the models included in this repository are experimental and research-style code. If you open a GitHub issue, here is our policy:

  1. It must be a bug, a feature request, or a significant problem with documentation (for small docs fixes please send a PR instead).
  2. The form below must be filled out.

Here's why we have that policy: TensorFlow developers respond to issues. We want to focus on work that benefits the whole community, e.g., fixing bugs and adding features. Support only helps individuals. GitHub also notifies thousands of people when issues are filed. We want them to see you communicating an interesting problem, rather than being redirected to Stack Overflow.


System information

  • What is the top-level directory of the model you are using:
    models/research/slim/
  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
    No
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
    Redhat 7.5
  • TensorFlow installed from (source or binary): source
  • TensorFlow version (use command below):1.8
  • Bazel version (if compiling from source):0.10.0
  • CUDA/cuDNN version:9.2/7.1.2
  • GPU model and memory: V100/16GB
  • Exact command to reproduce:
    python train_image_classifier.py --dataset_dir=/data/TF_records/ --dataset_name=imagenet --dataset_split_name=train --model_name=nasnet_large --num_clones=4

You can collect some of this information using our environment capture script:

https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh

You can obtain the TensorFlow version with

python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"

== cat /etc/issue ===============================================
Linux paiws7 4.14.0-49.el7a.ppc64le #1 SMP Wed Mar 14 13:58:40 UTC 2018 ppc64le ppc64le ppc64le GNU/Linux
VERSION="7.5 (Maipo)"
VERSION_ID="7.5"
REDHAT_BUGZILLA_PRODUCT_VERSION=7.5
REDHAT_SUPPORT_PRODUCT_VERSION="7.5"

== are we in docker =============================================
No

== compiler =====================================================
c++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-28)
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

== uname -a =====================================================
Linux paiws7 4.14.0-49.el7a.ppc64le #1 SMP Wed Mar 14 13:58:40 UTC 2018 ppc64le ppc64le ppc64le GNU/Linux

== check pips ===================================================
numpy 1.14.4
numpydoc 0.7.0
protobuf 3.5.0
tensorflow 1.8.0

== check for virtualenv =========================================
False

== tensorflow import ============================================
tf.VERSION = 1.8.0
tf.GIT_VERSION = b'unknown'
tf.COMPILER_VERSION = b'unknown'
Sanity check: array([1], dtype=int32)
/root/anaconda3/lib/python3.6/site-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type.
from ._conv import register_converters as _register_converters

== env ==========================================================
LD_LIBRARY_PATH /usr/local/cuda/lib64:/usr/lib64:/usr/lib:/usr/local/cuda-9.1/lib:/usr/local/cuda/nvvm/lib64:/usr/local/cuda-9.2/lib64:/usr/local/cuda-9.2/extras/CUPTI/lib64:/opt/DL/tensorflow/lib
DYLD_LIBRARY_PATH is unset

== nvidia-smi ===================================================
Fri Jul 20 00:56:20 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.15 Driver Version: 396.15 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000004:04:00.0 Off | 0 |
| N/A 38C P0 39W / 300W | 0MiB / 15360MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000004:05:00.0 Off | 0 |
| N/A 42C P0 40W / 300W | 0MiB / 15360MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000035:03:00.0 Off | 0 |
| N/A 39C P0 42W / 300W | 0MiB / 15360MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000035:04:00.0 Off | 0 |
| N/A 42C P0 39W / 300W | 0MiB / 15360MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

== cuda libs ===================================================
/usr/local/cuda-9.2/doc/man/man7/libcudart.7
/usr/local/cuda-9.2/doc/man/man7/libcudart.so.7
/usr/local/cuda-9.2/targets/ppc64le-linux/lib/libcudart.so.9.2.64
/usr/local/cuda-9.2/targets/ppc64le-linux/lib/libcudart_static.a

Describe the problem

Describe the problem clearly here. Be sure to convey here why it's a bug in TensorFlow or a feature request.

Was trying to train nasnet_large using models/research/slim using imagenet dataset, and I encounter Out of memory.

python train_image_classifier.py --dataset_dir=/data/TF_records/ --dataset_name=imagenet --dataset_split_name=train --model_name=nasnet_large --num_clones=4

2018-07-18 23:04:09.439260: I tensorflow/core/common_runtime/bfc_allocator.cc:680] Stats:
Limit: 14813580493
InUse: 14813580288
MaxInUse: 14813580288
NumAllocs: 9334
MaxAllocSize: 338608128

2018-07-18 23:04:09.439375: W tensorflow/core/common_runtime/bfc_allocator.cc:279] ****************************************************************************************************
2018-07-18 23:04:09.439462: W tensorflow/core/framework/op_kernel.cc:1318] OP_REQUIRES failed at fused_batch_norm_op.cc:274 : Resource exhausted: OOM when allocating tensor with shape[32,336,21,21] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
INFO:tensorflow:Recording summary at step 0.
Traceback (most recent call last):
File "train_image_classifier.py", line 581, in
tf.app.run()
File "/opt/DL/tensorflow/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "train_image_classifier.py", line 577, in main
sync_optimizer=optimizer if FLAGS.sync_replicas else None)
File "/opt/DL/tensorflow/lib/python3.6/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 769, in train
sess, train_op, global_step, train_step_kwargs)
File "/opt/DL/tensorflow/lib/python3.6/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 487, in train_step
run_metadata=run_metadata)
File "/opt/DL/tensorflow/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 900, in run
run_metadata_ptr)
File "/opt/DL/tensorflow/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1135, in _run
feed_dict_tensor, options, run_metadata)
File "/opt/DL/tensorflow/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1316, in _do_run
run_metadata)
File "/opt/DL/tensorflow/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 332, in init
def init(self, node_def, op, message):
KeyboardInterrupt

Source code / logs

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. Try to provide a reproducible test case that is the bare minimum necessary to generate the problem.

@tensorflowbutler
Copy link
Member

Hi There,
We are checking to see if you still need help on this, as this seems to be considerably old issue. Please update this issue with the latest information, code snippet to reproduce your issue and error you are seeing.
If we don't hear from you in the next 7 days, this issue will be closed automatically. If you don't need help on this issue any more, please consider closing this.

@robieta robieta removed their assignment Feb 6, 2020
@ravikyram ravikyram added models:research models that come under research directory type:support labels Jul 10, 2020
@google-ml-butler
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
models:research models that come under research directory type:support
Projects
None yet
Development

No branches or pull requests

7 participants