Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very slow training speed. Is this expected for my system setup? #34

Open
mrxiaohe opened this issue Mar 11, 2019 · 4 comments
Open

Very slow training speed. Is this expected for my system setup? #34

mrxiaohe opened this issue Mar 11, 2019 · 4 comments

Comments

@mrxiaohe
Copy link

I am currently training a model using the Chinese in the Wild image data. My system setup is as follows:

  • OS: Windows Server 2016 Standard
  • RAM: 256 GB
  • Had drive: 6TB
  • Processor: Intel Xeon CPU E5-2687W v4 (24 cores)
  • GPU: NVIDIA Tesla V100-PCIE-16GB

The speed is shown below: Each step takes close to 30 seconds. The training has been running for 2 days, and it's only done 5410 steps, so far. It seems like GPU is getting utilized -- 96% of the GPU memory is used. The CPU also shows quite a bit of activity - e.g., 40% by the Python session in which the training is running.

Also, when I started training, I got the message failed to allocate 15.90G (17071144960 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY. Not sure if this is related

So my question is if the speed I am observing is normal for the kind of computer setup I have, and how I might improve the speed. Thanks!

INFO:tensorflow:Recording summary at step 5315.
INFO:tensorflow:Saving checkpoint to path E:\Projects\TEXT DETECTION\chinese_text_in_the_wild\ctw-baseline-master\classification\products\train_logs_alexnet_v2\model.ckpt
INFO:tensorflow:Recording summary at step 5319.
INFO:tensorflow:global step 5320: loss = 6.8224 (29.31 sec/step)
INFO:tensorflow:Recording summary at step 5323.
INFO:tensorflow:Recording summary at step 5327.
INFO:tensorflow:global step 5330: loss = 6.9395 (29.13 sec/step)
INFO:tensorflow:Recording summary at step 5331.
INFO:tensorflow:Recording summary at step 5335.
INFO:tensorflow:Recording summary at step 5339.
INFO:tensorflow:global step 5340: loss = 6.7953 (34.16 sec/step)
INFO:tensorflow:Recording summary at step 5343.
INFO:tensorflow:Recording summary at step 5347.
INFO:tensorflow:global step 5350: loss = 6.8213 (30.08 sec/step)
INFO:tensorflow:Recording summary at step 5351.
INFO:tensorflow:Recording summary at step 5355.
INFO:tensorflow:Saving checkpoint to path E:\Projects\TEXT DETECTION\chinese_text_in_the_wild\ctw-baseline-master\classification\products\train_logs_alexnet_v2\model.ckpt
INFO:tensorflow:Recording summary at step 5359.
INFO:tensorflow:global step 5360: loss = 6.8168 (29.48 sec/step)
INFO:tensorflow:Recording summary at step 5363.
INFO:tensorflow:Recording summary at step 5367.
INFO:tensorflow:global step 5370: loss = 6.8478 (29.09 sec/step)
INFO:tensorflow:Recording summary at step 5371.
INFO:tensorflow:Recording summary at step 5375.
INFO:tensorflow:Recording summary at step 5376.
INFO:tensorflow:global step 5380: loss = 6.8576 (30.47 sec/step)
INFO:tensorflow:Recording summary at step 5380.
INFO:tensorflow:Recording summary at step 5384.
INFO:tensorflow:Recording summary at step 5388.
INFO:tensorflow:global step 5390: loss = 6.8722 (30.95 sec/step)
INFO:tensorflow:Recording summary at step 5392.
@yuantailing
Copy link
Owner

yuantailing commented Mar 12, 2019

The AlexNet v2 model is relatively small. It performs 0.2 sec/step on NVIDIA GTX TITAN X.

I never meet such slow training progress. Did you concurrently run other GPU programs? Maybe restart will help.

You can change the summary interval to a very large number (e.g., 999999) to close the summary. It may help.

'save_summaries_secs': '120',

In my experience, the warning CUDA_ERROR_OUT_OF_MEMORY is inessential -- Tensorflow will use a memory-costly algorithm if you have a large GPU memory, it works well too using smaller memory. NVIDIA GTX TITAN X has only 12GB memory in fact.

@mrxiaohe
Copy link
Author

Thanks for the response! I don't have any other GPU programs running concurrently. What would be the best way to check if GPU is actually being used? I pip installed a Python module called GPUtil, which shows that 96% of GPU memory is being utilized, but it doesn't seem to say that GPU is actually getting used:

image

@yuantailing
Copy link
Owner

yuantailing commented Mar 12, 2019

On Linux, I run nvidia-smi. But I don't know how to do this on Windows.

nvidia-smi

Your GPU Util should be very low -- for each step, GPU can do the computation in 0.2 sec, but it takes 30 sec. Thus in 29 sec you will see GPU Util is 0, while in the other 1 sec you will see GPU Util is not 0. It seems most time is spent on preparing data and transferring data to GPU memory.

@mrxiaohe
Copy link
Author

Thanks for following up so quickly! I just ran nvidia-smi on my Windows. It looks like Python (the one where training is being run) uses almost all the GPU memory, but volatile GPU utilization is 0:

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants