You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Also, please understand that many of the models included in this repository are experimental and research-style code. If you open a GitHub issue, here is our policy:
It must be a bug, a feature request, or a significant problem with documentation (for small docs fixes please send a PR instead).
The form below must be filled out.
Here's why we have that policy: TensorFlow developers respond to issues. We want to focus on work that benefits the whole community, e.g., fixing bugs and adding features. Support only helps individuals. GitHub also notifies thousands of people when issues are filed. We want them to see you communicating an interesting problem, rather than being redirected to Stack Overflow.
System information
What is the top-level directory of the model you are using:
models/research/slim/
Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
No
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
Redhat 7.5
TensorFlow installed from (source or binary): source
python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
== cat /etc/issue ===============================================
Linux paiws7 4.14.0-49.el7a.ppc64le #1 SMP Wed Mar 14 13:58:40 UTC 2018 ppc64le ppc64le ppc64le GNU/Linux
VERSION="7.5 (Maipo)"
VERSION_ID="7.5"
REDHAT_BUGZILLA_PRODUCT_VERSION=7.5
REDHAT_SUPPORT_PRODUCT_VERSION="7.5"
== are we in docker =============================================
No
== compiler =====================================================
c++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-28)
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
== uname -a =====================================================
Linux paiws7 4.14.0-49.el7a.ppc64le #1 SMP Wed Mar 14 13:58:40 UTC 2018 ppc64le ppc64le ppc64le GNU/Linux
== check for virtualenv =========================================
False
== tensorflow import ============================================
tf.VERSION = 1.8.0
tf.GIT_VERSION = b'unknown'
tf.COMPILER_VERSION = b'unknown'
Sanity check: array([1], dtype=int32)
/root/anaconda3/lib/python3.6/site-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type.
from ._conv import register_converters as _register_converters
== env ==========================================================
LD_LIBRARY_PATH /usr/local/cuda/lib64:/usr/lib64:/usr/lib:/usr/local/cuda-9.1/lib:/usr/local/cuda/nvvm/lib64:/usr/local/cuda-9.2/lib64:/usr/local/cuda-9.2/extras/CUPTI/lib64:/opt/DL/tensorflow/lib
DYLD_LIBRARY_PATH is unset
== nvidia-smi ===================================================
Fri Jul 20 00:56:20 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.15 Driver Version: 396.15 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000004:04:00.0 Off | 0 |
| N/A 38C P0 39W / 300W | 0MiB / 15360MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000004:05:00.0 Off | 0 |
| N/A 42C P0 40W / 300W | 0MiB / 15360MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000035:03:00.0 Off | 0 |
| N/A 39C P0 42W / 300W | 0MiB / 15360MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000035:04:00.0 Off | 0 |
| N/A 42C P0 39W / 300W | 0MiB / 15360MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
== cuda libs ===================================================
/usr/local/cuda-9.2/doc/man/man7/libcudart.7
/usr/local/cuda-9.2/doc/man/man7/libcudart.so.7
/usr/local/cuda-9.2/targets/ppc64le-linux/lib/libcudart.so.9.2.64
/usr/local/cuda-9.2/targets/ppc64le-linux/lib/libcudart_static.a
Describe the problem
Describe the problem clearly here. Be sure to convey here why it's a bug in TensorFlow or a feature request.
Was trying to train nasnet_large using models/research/slim using imagenet dataset, and I encounter Out of memory.
2018-07-18 23:04:09.439375: W tensorflow/core/common_runtime/bfc_allocator.cc:279] ****************************************************************************************************
2018-07-18 23:04:09.439462: W tensorflow/core/framework/op_kernel.cc:1318] OP_REQUIRES failed at fused_batch_norm_op.cc:274 : Resource exhausted: OOM when allocating tensor with shape[32,336,21,21] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
INFO:tensorflow:Recording summary at step 0.
Traceback (most recent call last):
File "train_image_classifier.py", line 581, in
tf.app.run()
File "/opt/DL/tensorflow/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "train_image_classifier.py", line 577, in main
sync_optimizer=optimizer if FLAGS.sync_replicas else None)
File "/opt/DL/tensorflow/lib/python3.6/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 769, in train
sess, train_op, global_step, train_step_kwargs)
File "/opt/DL/tensorflow/lib/python3.6/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 487, in train_step
run_metadata=run_metadata)
File "/opt/DL/tensorflow/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 900, in run
run_metadata_ptr)
File "/opt/DL/tensorflow/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1135, in _run
feed_dict_tensor, options, run_metadata)
File "/opt/DL/tensorflow/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1316, in _do_run
run_metadata)
File "/opt/DL/tensorflow/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 332, in init
def init(self, node_def, op, message):
KeyboardInterrupt
Source code / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. Try to provide a reproducible test case that is the bare minimum necessary to generate the problem.
The text was updated successfully, but these errors were encountered:
Hi There,
We are checking to see if you still need help on this, as this seems to be considerably old issue. Please update this issue with the latest information, code snippet to reproduce your issue and error you are seeing.
If we don't hear from you in the next 7 days, this issue will be closed automatically. If you don't need help on this issue any more, please consider closing this.
Please go to Stack Overflow for help and support:
http://stackoverflow.com/questions/tagged/tensorflow
Also, please understand that many of the models included in this repository are experimental and research-style code. If you open a GitHub issue, here is our policy:
Here's why we have that policy: TensorFlow developers respond to issues. We want to focus on work that benefits the whole community, e.g., fixing bugs and adding features. Support only helps individuals. GitHub also notifies thousands of people when issues are filed. We want them to see you communicating an interesting problem, rather than being redirected to Stack Overflow.
System information
models/research/slim/
No
Redhat 7.5
python train_image_classifier.py --dataset_dir=/data/TF_records/ --dataset_name=imagenet --dataset_split_name=train --model_name=nasnet_large --num_clones=4
You can collect some of this information using our environment capture script:
https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh
You can obtain the TensorFlow version with
python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
== cat /etc/issue ===============================================
Linux paiws7 4.14.0-49.el7a.ppc64le #1 SMP Wed Mar 14 13:58:40 UTC 2018 ppc64le ppc64le ppc64le GNU/Linux
VERSION="7.5 (Maipo)"
VERSION_ID="7.5"
REDHAT_BUGZILLA_PRODUCT_VERSION=7.5
REDHAT_SUPPORT_PRODUCT_VERSION="7.5"
== are we in docker =============================================
No
== compiler =====================================================
c++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-28)
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
== uname -a =====================================================
Linux paiws7 4.14.0-49.el7a.ppc64le #1 SMP Wed Mar 14 13:58:40 UTC 2018 ppc64le ppc64le ppc64le GNU/Linux
== check pips ===================================================
numpy 1.14.4
numpydoc 0.7.0
protobuf 3.5.0
tensorflow 1.8.0
== check for virtualenv =========================================
False
== tensorflow import ============================================
tf.VERSION = 1.8.0
tf.GIT_VERSION = b'unknown'
tf.COMPILER_VERSION = b'unknown'
Sanity check: array([1], dtype=int32)
/root/anaconda3/lib/python3.6/site-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from
float
tonp.floating
is deprecated. In future, it will be treated asnp.float64 == np.dtype(float).type
.from ._conv import register_converters as _register_converters
== env ==========================================================
LD_LIBRARY_PATH /usr/local/cuda/lib64:/usr/lib64:/usr/lib:/usr/local/cuda-9.1/lib:/usr/local/cuda/nvvm/lib64:/usr/local/cuda-9.2/lib64:/usr/local/cuda-9.2/extras/CUPTI/lib64:/opt/DL/tensorflow/lib
DYLD_LIBRARY_PATH is unset
== nvidia-smi ===================================================
Fri Jul 20 00:56:20 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.15 Driver Version: 396.15 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000004:04:00.0 Off | 0 |
| N/A 38C P0 39W / 300W | 0MiB / 15360MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000004:05:00.0 Off | 0 |
| N/A 42C P0 40W / 300W | 0MiB / 15360MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000035:03:00.0 Off | 0 |
| N/A 39C P0 42W / 300W | 0MiB / 15360MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000035:04:00.0 Off | 0 |
| N/A 42C P0 39W / 300W | 0MiB / 15360MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
== cuda libs ===================================================
/usr/local/cuda-9.2/doc/man/man7/libcudart.7
/usr/local/cuda-9.2/doc/man/man7/libcudart.so.7
/usr/local/cuda-9.2/targets/ppc64le-linux/lib/libcudart.so.9.2.64
/usr/local/cuda-9.2/targets/ppc64le-linux/lib/libcudart_static.a
Describe the problem
Describe the problem clearly here. Be sure to convey here why it's a bug in TensorFlow or a feature request.
Was trying to train nasnet_large using models/research/slim using imagenet dataset, and I encounter Out of memory.
python train_image_classifier.py --dataset_dir=/data/TF_records/ --dataset_name=imagenet --dataset_split_name=train --model_name=nasnet_large --num_clones=4
2018-07-18 23:04:09.439260: I tensorflow/core/common_runtime/bfc_allocator.cc:680] Stats:
Limit: 14813580493
InUse: 14813580288
MaxInUse: 14813580288
NumAllocs: 9334
MaxAllocSize: 338608128
2018-07-18 23:04:09.439375: W tensorflow/core/common_runtime/bfc_allocator.cc:279] ****************************************************************************************************
2018-07-18 23:04:09.439462: W tensorflow/core/framework/op_kernel.cc:1318] OP_REQUIRES failed at fused_batch_norm_op.cc:274 : Resource exhausted: OOM when allocating tensor with shape[32,336,21,21] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
INFO:tensorflow:Recording summary at step 0.
Traceback (most recent call last):
File "train_image_classifier.py", line 581, in
tf.app.run()
File "/opt/DL/tensorflow/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "train_image_classifier.py", line 577, in main
sync_optimizer=optimizer if FLAGS.sync_replicas else None)
File "/opt/DL/tensorflow/lib/python3.6/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 769, in train
sess, train_op, global_step, train_step_kwargs)
File "/opt/DL/tensorflow/lib/python3.6/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 487, in train_step
run_metadata=run_metadata)
File "/opt/DL/tensorflow/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 900, in run
run_metadata_ptr)
File "/opt/DL/tensorflow/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1135, in _run
feed_dict_tensor, options, run_metadata)
File "/opt/DL/tensorflow/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1316, in _do_run
run_metadata)
File "/opt/DL/tensorflow/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 332, in init
def init(self, node_def, op, message):
KeyboardInterrupt
Source code / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. Try to provide a reproducible test case that is the bare minimum necessary to generate the problem.
The text was updated successfully, but these errors were encountered: