Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lightning and apex amp performance not improved #2699

Closed
zhenhuahu opened this issue Jul 24, 2020 · 5 comments
Closed

lightning and apex amp performance not improved #2699

zhenhuahu opened this issue Jul 24, 2020 · 5 comments
Labels
question Further information is requested

Comments

@zhenhuahu
Copy link

zhenhuahu commented Jul 24, 2020

❓ lightning and apex amp performance not improved

Before asking:

  1. search the issues.
  2. search the docs.

I'm trying to use lightning and Apex amp to speed ddp training. I tried amp_level O0, O1, O2, and O3, and they use almost the same time (all around 45 minutes).

train_loader = DataLoader(dataset=train_dataset, batch_size=2, shuffle=True, num_workers=4)
val_loader = DataLoader(dataset=val_dataset, batch_size=1, shuffle=False, num_workers=4)

trainer = pl.Trainer(gpus= 8, num_nodes = 1, distributed_backend='ddp', precision = 16, amp_level = 'O1')
trainer.fit(model, train_dataloader=train_loader, val_dataloaders=val_loader)

i didn't change batchsize to be a multiple of 8 because i saw this post and my cudnn version is 7.6.5.

Thanks!

What have you tried?

I also tried torch.backends.cudnn.benchmark = True but got no improvement.

What's your environment?

  • OS: [e.g. iOS, Linux, Win] linx
  • Packaging [e.g. pip, conda] pip
  • Version [e.g. 0.5.2.1]; latest
@zhenhuahu zhenhuahu added the question Further information is requested label Jul 24, 2020
@ibeltagy
Copy link
Contributor

Can you check your gpu utilization without fp16 and make sure it is high? a low gpu utilization indicates that gpu compute is not the bottleneck and if this is the case, speeding your forward/backwork pass won't help much.
You can also try the pytorch-lightning profiler to see how much time is spent in the dataset reader vs. the forward/backward pass.

@zhenhuahu
Copy link
Author

Can you check your gpu utilization without fp16 and make sure it is high? a low gpu utilization indicates that gpu compute is not the bottleneck and if this is the case, speeding your forward/backwork pass won't help much.
You can also try the pytorch-lightning profiler to see how much time is spent in the dataset reader vs. the forward/backward pass.

Hi, thanks for the reply. The gpu utilization is quite high without fp16, but the memory is not high.
amp0_gpu

Below is the result of using amp_level = 'O1'. I can see the memory is reduced a little bit but the utilization is still very high.
amp_O1_gpu_usage

I'll try the profiler to compare the time of dataset reader and forward/backward pass. Thank you so much!

@zhenhuahu
Copy link
Author

zhenhuahu commented Jul 26, 2020

Can you check your gpu utilization without fp16 and make sure it is high? a low gpu utilization indicates that gpu compute is not the bottleneck and if this is the case, speeding your forward/backwork pass won't help much.
You can also try the pytorch-lightning profiler to see how much time is spent in the dataset reader vs. the forward/backward pass.

Hi, thanks for the reply. I used basic profiler and got the following results. In 'O0', the result is
profiler-8gpu_numdown2_amp_O0
and I can see that 'model_forward' uses much more time than 'model_backward'.
I then used 'AdvancedProfiler' and the result is

`
Profile stats for: model_forward
236550 function calls (216450 primitive calls) in 24.804 seconds

Ordered by: cumulative time

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   150    0.001    0.000   24.798    0.165 training_loop.py:898(training_forward)

17250/150 0.058 0.000 24.793 0.165 module.py:522(call)
150 0.011 0.000 24.790 0.165 data_parallel.py:80(forward)
150 0.013 0.000 23.086 0.154 pl_fuse_model.py:54(training_step)
300 0.018 0.000 22.077 0.074 utils.py:120(normalize_batch)
600 22.053 0.037 22.053 0.037 {method 'new_tensor' of 'torch._C._TensorBase' objects}
150 0.001 0.000 1.592 0.011 distributed.py:470(scatter)
150 0.002 0.000 1.591 0.011 scatter_gather.py:34(scatter_kwargs)
150 0.000 0.000 1.589 0.011 scatter_gather.py:5(scatter)`

So here '{method 'new_tensor' of 'torch._C._TensorBase' objects}' consumes most of time.

In order to use more fp16, I used 'O2' and got the profiler as
profiler-8gpu_numdown2_amp_O2
I can see that 'model_forward' does use much less time, but 'model_backward' uses much more time than it, even more than 'O0' state.
The result after 'AdvacedProfiler' is

`Profile stats for: model_backward
19430 function calls (19351 primitive calls) in 17.216 seconds

Ordered by: cumulative time

ncalls tottime percall cumtime percall filename:lineno(function)
450 0.013 0.000 15.975 0.036 {built-in method builtins.next}
300 0.010 0.000 15.961 0.053 handle.py:16(scale_loss)
300 0.001 0.000 15.948 0.053 contextlib.py:85(exit)
150 0.001 0.000 15.898 0.106 scaler.py:197(update_scale)
150 15.897 0.106 15.897 0.106 {method 'item' of 'torch._C._TensorBase' objects}
150 0.001 0.000 1.237 0.008 hooks.py:168(backward)
150 0.001 0.000 1.236 0.008 tensor.py:167(backward)
150 0.001 0.000 1.235 0.008 init.py:44(backward)

So here it has a new call {method 'item' of 'torch._C._TensorBase' objects} which consumes a lot of time, and it doesn't appear when I use 'O0' mode.

I don't know what the {method 'item' of 'torch._C._TensorBase' objects} is doing. I'm not sure if it has something to do with the casting of 'floattensor' to 'halftensor'. Since I'm using vgg_loss in the program, I have to cast the output of the model to halftensor otherwise it would raise an error
vgg_halftensor

Thank you so much!

@zhenhuahu
Copy link
Author

Can you check your gpu utilization without fp16 and make sure it is high? a low gpu utilization indicates that gpu compute is not the bottleneck and if this is the case, speeding your forward/backwork pass won't help much.
You can also try the pytorch-lightning profiler to see how much time is spent in the dataset reader vs. the forward/backward pass.

I commented the vgg loss, and found that the bottleneck for model_backward (method 'item' of 'torch._C._TensorBase' objects) is still there. But the model_forward has not bottleneck '{method 'new_tensor' of 'torch._C._TensorBase' objects}' anymore.

@zhenhuahu
Copy link
Author

Can you check your gpu utilization without fp16 and make sure it is high? a low gpu utilization indicates that gpu compute is not the bottleneck and if this is the case, speeding your forward/backwork pass won't help much.
You can also try the pytorch-lightning profiler to see how much time is spent in the dataset reader vs. the forward/backward pass.

I t turns out that I used '.new_tensor' to normalize input every time I call vgg loss. This caused a lot of extra work in model_forward. And using pytorch 1.6 native amp instead of Apex solved the model_backward problem. I don't know why Apex amp doesn't work. Maybe because I use 'ddp' as backend.

Thanks for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants