Training Nasnet_large results in Out of memory #4846

antoajayraj · 2018-07-20T06:00:56Z

Please go to Stack Overflow for help and support:

http://stackoverflow.com/questions/tagged/tensorflow

Also, please understand that many of the models included in this repository are experimental and research-style code. If you open a GitHub issue, here is our policy:

It must be a bug, a feature request, or a significant problem with documentation (for small docs fixes please send a PR instead).
The form below must be filled out.

Here's why we have that policy: TensorFlow developers respond to issues. We want to focus on work that benefits the whole community, e.g., fixing bugs and adding features. Support only helps individuals. GitHub also notifies thousands of people when issues are filed. We want them to see you communicating an interesting problem, rather than being redirected to Stack Overflow.

System information

What is the top-level directory of the model you are using:
models/research/slim/
Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
No
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
Redhat 7.5
TensorFlow installed from (source or binary): source
TensorFlow version (use command below):1.8
Bazel version (if compiling from source):0.10.0
CUDA/cuDNN version:9.2/7.1.2
GPU model and memory: V100/16GB
Exact command to reproduce:
python train_image_classifier.py --dataset_dir=/data/TF_records/ --dataset_name=imagenet --dataset_split_name=train --model_name=nasnet_large --num_clones=4

You can collect some of this information using our environment capture script:

https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh

You can obtain the TensorFlow version with

python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"

== cat /etc/issue ===============================================
Linux paiws7 4.14.0-49.el7a.ppc64le #1 SMP Wed Mar 14 13:58:40 UTC 2018 ppc64le ppc64le ppc64le GNU/Linux
VERSION="7.5 (Maipo)"
VERSION_ID="7.5"
REDHAT_BUGZILLA_PRODUCT_VERSION=7.5
REDHAT_SUPPORT_PRODUCT_VERSION="7.5"

== are we in docker =============================================
No

== compiler =====================================================
c++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-28)
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

== uname -a =====================================================
Linux paiws7 4.14.0-49.el7a.ppc64le #1 SMP Wed Mar 14 13:58:40 UTC 2018 ppc64le ppc64le ppc64le GNU/Linux

== check pips ===================================================
numpy 1.14.4
numpydoc 0.7.0
protobuf 3.5.0
tensorflow 1.8.0

== check for virtualenv =========================================
False

== tensorflow import ============================================
tf.VERSION = 1.8.0
tf.GIT_VERSION = b'unknown'
tf.COMPILER_VERSION = b'unknown'
Sanity check: array([1], dtype=int32)
/root/anaconda3/lib/python3.6/site-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type.
from ._conv import register_converters as _register_converters

== env ==========================================================
LD_LIBRARY_PATH /usr/local/cuda/lib64:/usr/lib64:/usr/lib:/usr/local/cuda-9.1/lib:/usr/local/cuda/nvvm/lib64:/usr/local/cuda-9.2/lib64:/usr/local/cuda-9.2/extras/CUPTI/lib64:/opt/DL/tensorflow/lib
DYLD_LIBRARY_PATH is unset

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

== cuda libs ===================================================
/usr/local/cuda-9.2/doc/man/man7/libcudart.7
/usr/local/cuda-9.2/doc/man/man7/libcudart.so.7
/usr/local/cuda-9.2/targets/ppc64le-linux/lib/libcudart.so.9.2.64
/usr/local/cuda-9.2/targets/ppc64le-linux/lib/libcudart_static.a

Describe the problem

Describe the problem clearly here. Be sure to convey here why it's a bug in TensorFlow or a feature request.

Was trying to train nasnet_large using models/research/slim using imagenet dataset, and I encounter Out of memory.

python train_image_classifier.py --dataset_dir=/data/TF_records/ --dataset_name=imagenet --dataset_split_name=train --model_name=nasnet_large --num_clones=4

2018-07-18 23:04:09.439260: I tensorflow/core/common_runtime/bfc_allocator.cc:680] Stats:
Limit: 14813580493
InUse: 14813580288
MaxInUse: 14813580288
NumAllocs: 9334
MaxAllocSize: 338608128

2018-07-18 23:04:09.439375: W tensorflow/core/common_runtime/bfc_allocator.cc:279] ****************************************************************************************************
2018-07-18 23:04:09.439462: W tensorflow/core/framework/op_kernel.cc:1318] OP_REQUIRES failed at fused_batch_norm_op.cc:274 : Resource exhausted: OOM when allocating tensor with shape[32,336,21,21] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
INFO:tensorflow:Recording summary at step 0.
Traceback (most recent call last):
File "train_image_classifier.py", line 581, in
tf.app.run()
File "/opt/DL/tensorflow/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "train_image_classifier.py", line 577, in main
sync_optimizer=optimizer if FLAGS.sync_replicas else None)
File "/opt/DL/tensorflow/lib/python3.6/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 769, in train
sess, train_op, global_step, train_step_kwargs)
File "/opt/DL/tensorflow/lib/python3.6/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 487, in train_step
run_metadata=run_metadata)
File "/opt/DL/tensorflow/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 900, in run
run_metadata_ptr)
File "/opt/DL/tensorflow/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1135, in _run
feed_dict_tensor, options, run_metadata)
File "/opt/DL/tensorflow/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1316, in _do_run
run_metadata)
File "/opt/DL/tensorflow/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 332, in init
def init(self, node_def, op, message):
KeyboardInterrupt

Source code / logs

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. Try to provide a reproducible test case that is the bare minimum necessary to generate the problem.

The text was updated successfully, but these errors were encountered:

tensorflowbutler · 2020-01-29T19:45:46Z

Hi There,
We are checking to see if you still need help on this, as this seems to be considerably old issue. Please update this issue with the latest information, code snippet to reproduce your issue and error you are seeing.
If we don't hear from you in the next 7 days, this issue will be closed automatically. If you don't need help on this issue any more, please consider closing this.

google-ml-butler · 2021-01-24T05:04:43Z

Are you satisfied with the resolution of your issue?
Yes
No

tensorflowbutler assigned robieta Jul 20, 2018

robieta removed their assignment Feb 6, 2020

ravikyram added models:research models that come under research directory type:support labels Jul 10, 2020

ravikyram assigned sguada and marksandler2 Jul 10, 2020

saberkun closed this as completed Jan 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training Nasnet_large results in Out of memory #4846

Training Nasnet_large results in Out of memory #4846

antoajayraj commented Jul 20, 2018

tensorflowbutler commented Jan 29, 2020

google-ml-butler bot commented Jan 24, 2021

Training Nasnet_large results in Out of memory #4846

Training Nasnet_large results in Out of memory #4846

Comments

antoajayraj commented Jul 20, 2018

System information

Describe the problem

Source code / logs

tensorflowbutler commented Jan 29, 2020

google-ml-butler bot commented Jan 24, 2021