lightning and apex amp performance not improved #2699

zhenhuahu · 2020-07-24T23:39:45Z

❓ lightning and apex amp performance not improved

Before asking:

search the issues.
search the docs.

I'm trying to use lightning and Apex amp to speed ddp training. I tried amp_level O0, O1, O2, and O3, and they use almost the same time (all around 45 minutes).

train_loader = DataLoader(dataset=train_dataset, batch_size=2, shuffle=True, num_workers=4)
val_loader = DataLoader(dataset=val_dataset, batch_size=1, shuffle=False, num_workers=4)

trainer = pl.Trainer(gpus= 8, num_nodes = 1, distributed_backend='ddp', precision = 16, amp_level = 'O1')
trainer.fit(model, train_dataloader=train_loader, val_dataloaders=val_loader)

i didn't change batchsize to be a multiple of 8 because i saw this post and my cudnn version is 7.6.5.

Thanks!

What have you tried?

I also tried torch.backends.cudnn.benchmark = True but got no improvement.

What's your environment?

OS: [e.g. iOS, Linux, Win] linx
Packaging [e.g. pip, conda] pip
Version [e.g. 0.5.2.1]; latest

ibeltagy · 2020-07-25T14:56:49Z

Can you check your gpu utilization without fp16 and make sure it is high? a low gpu utilization indicates that gpu compute is not the bottleneck and if this is the case, speeding your forward/backwork pass won't help much.
You can also try the pytorch-lightning profiler to see how much time is spent in the dataset reader vs. the forward/backward pass.

zhenhuahu · 2020-07-25T17:26:18Z

Can you check your gpu utilization without fp16 and make sure it is high? a low gpu utilization indicates that gpu compute is not the bottleneck and if this is the case, speeding your forward/backwork pass won't help much.
You can also try the pytorch-lightning profiler to see how much time is spent in the dataset reader vs. the forward/backward pass.

Hi, thanks for the reply. The gpu utilization is quite high without fp16, but the memory is not high.

Below is the result of using amp_level = 'O1'. I can see the memory is reduced a little bit but the utilization is still very high.

I'll try the profiler to compare the time of dataset reader and forward/backward pass. Thank you so much!

zhenhuahu · 2020-07-26T03:53:46Z

Can you check your gpu utilization without fp16 and make sure it is high? a low gpu utilization indicates that gpu compute is not the bottleneck and if this is the case, speeding your forward/backwork pass won't help much.
You can also try the pytorch-lightning profiler to see how much time is spent in the dataset reader vs. the forward/backward pass.

Hi, thanks for the reply. I used basic profiler and got the following results. In 'O0', the result is

and I can see that 'model_forward' uses much more time than 'model_backward'.
I then used 'AdvancedProfiler' and the result is

`
Profile stats for: model_forward
236550 function calls (216450 primitive calls) in 24.804 seconds
Ordered by: cumulative time

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   150    0.001    0.000   24.798    0.165 training_loop.py:898(training_forward)
17250/150 0.058 0.000 24.793 0.165 module.py:522(call)
150 0.011 0.000 24.790 0.165 data_parallel.py:80(forward)
150 0.013 0.000 23.086 0.154 pl_fuse_model.py:54(training_step)
300 0.018 0.000 22.077 0.074 utils.py:120(normalize_batch)
600 22.053 0.037 22.053 0.037 {method 'new_tensor' of 'torch._C._TensorBase' objects}
150 0.001 0.000 1.592 0.011 distributed.py:470(scatter)
150 0.002 0.000 1.591 0.011 scatter_gather.py:34(scatter_kwargs)
150 0.000 0.000 1.589 0.011 scatter_gather.py:5(scatter)`

So here '{method 'new_tensor' of 'torch._C._TensorBase' objects}' consumes most of time.

In order to use more fp16, I used 'O2' and got the profiler as

I can see that 'model_forward' does use much less time, but 'model_backward' uses much more time than it, even more than 'O0' state.
The result after 'AdvacedProfiler' is

`Profile stats for: model_backward
19430 function calls (19351 primitive calls) in 17.216 seconds

Ordered by: cumulative time

ncalls tottime percall cumtime percall filename:lineno(function)
450 0.013 0.000 15.975 0.036 {built-in method builtins.next}
300 0.010 0.000 15.961 0.053 handle.py:16(scale_loss)
300 0.001 0.000 15.948 0.053 contextlib.py:85(exit)
150 0.001 0.000 15.898 0.106 scaler.py:197(update_scale)
150 15.897 0.106 15.897 0.106 {method 'item' of 'torch._C._TensorBase' objects}
150 0.001 0.000 1.237 0.008 hooks.py:168(backward)
150 0.001 0.000 1.236 0.008 tensor.py:167(backward)
150 0.001 0.000 1.235 0.008 init.py:44(backward)

So here it has a new call {method 'item' of 'torch._C._TensorBase' objects} which consumes a lot of time, and it doesn't appear when I use 'O0' mode.

I don't know what the {method 'item' of 'torch._C._TensorBase' objects} is doing. I'm not sure if it has something to do with the casting of 'floattensor' to 'halftensor'. Since I'm using vgg_loss in the program, I have to cast the output of the model to halftensor otherwise it would raise an error

Thank you so much!

zhenhuahu · 2020-07-26T04:35:01Z

Can you check your gpu utilization without fp16 and make sure it is high? a low gpu utilization indicates that gpu compute is not the bottleneck and if this is the case, speeding your forward/backwork pass won't help much.
You can also try the pytorch-lightning profiler to see how much time is spent in the dataset reader vs. the forward/backward pass.

I commented the vgg loss, and found that the bottleneck for model_backward (method 'item' of 'torch._C._TensorBase' objects) is still there. But the model_forward has not bottleneck '{method 'new_tensor' of 'torch._C._TensorBase' objects}' anymore.

zhenhuahu · 2020-07-27T19:16:51Z

Can you check your gpu utilization without fp16 and make sure it is high? a low gpu utilization indicates that gpu compute is not the bottleneck and if this is the case, speeding your forward/backwork pass won't help much.
You can also try the pytorch-lightning profiler to see how much time is spent in the dataset reader vs. the forward/backward pass.

I t turns out that I used '.new_tensor' to normalize input every time I call vgg loss. This caused a lot of extra work in model_forward. And using pytorch 1.6 native amp instead of Apex solved the model_backward problem. I don't know why Apex amp doesn't work. Maybe because I use 'ddp' as backend.

Thanks for your help!

zhenhuahu added the question Further information is requested label Jul 24, 2020

zhenhuahu mentioned this issue Jul 25, 2020

Tensor Size Constraint for Tensor Cores NVIDIA/apex#921

Closed

zhenhuahu closed this as completed Jul 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lightning and apex amp performance not improved #2699

lightning and apex amp performance not improved #2699

zhenhuahu commented Jul 24, 2020 •

edited

Loading

ibeltagy commented Jul 25, 2020

zhenhuahu commented Jul 25, 2020

zhenhuahu commented Jul 26, 2020 •

edited

Loading

zhenhuahu commented Jul 26, 2020

zhenhuahu commented Jul 27, 2020

lightning and apex amp performance not improved #2699

lightning and apex amp performance not improved #2699

Comments

zhenhuahu commented Jul 24, 2020 • edited Loading