[feat] Add PyTorch Profiler. (#5560)

* add profiler * add profiler * update * resolve flake8 * update doc * update changelog * clean doc * delete prof file * merge pr codebase * update * update doc * update doc * update doc * update on comments * update docstring * update docstring * try * update test * Update pytorch_lightning/profiler/__init__.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Update pytorch_lightning/profiler/__init__.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * update on comments * remove old code * add support for ddp * resolve flake8 * Update pytorch_lightning/profiler/__init__.py Co-authored-by: Sean Naren <sean.narenthiran@gmail.com> * resolve tests * resolve flake8 Co-authored-by: Jirka Borovec <jirka.borovec@seznam.cz> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Lightning-AI · Jan 26, 2021 · 5f33728 · 5f33728
1 parent f782230
commit 5f33728
Show file tree

Hide file tree

Showing 13 changed files with 500 additions and 13 deletions.
diff --git a/.gitignore b/.gitignore
@@ -141,3 +141,4 @@ pytorch\ lightning
 test-reports/
 wandb
 .forked/
+*.prof
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -59,6 +59,8 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 - `Recall` and `Precision` metrics (and their functional counterparts `recall` and `precision`) can now be generalized to Recall@K and Precision@K with the use of `top_k` parameter ([#4842](https://github.com/PyTorchLightning/pytorch-lightning/pull/4842))
 
 
+- Added `PyTorchProfiler` ([#5560](https://github.com/PyTorchLightning/pytorch-lightning/pull/5560))
+
 
 ### Changed
 

diff --git a/pytorch_lightning/core/memory.py b/pytorch_lightning/core/memory.py
@@ -16,7 +16,7 @@
 import shutil
 import subprocess
 from collections import OrderedDict
-from typing import Tuple, Dict, Union, List, Any
+from typing import Any, Dict, List, Tuple, Union
 
 import numpy as np
 import torch
@@ -182,7 +182,8 @@ def __init__(self, model, mode: str = MODE_DEFAULT):
         self._model = model
         self._mode = mode
         self._layer_summary = self.summarize()
-        self._precision_megabytes = (self._model.precision / 8.0) * 1e-6  # 1 byte -> 8 bits
+        # 1 byte -> 8 bits
+        self._precision_megabytes = (self._model.precision / 8.0) * 1e-6
 
     @property
     def named_modules(self) -> List[Tuple[str, nn.Module]]:

diff --git a/pytorch_lightning/profiler/__init__.py b/pytorch_lightning/profiler/__init__.py
@@ -50,7 +50,7 @@
 
 
 Advanced Profiling
---------------------
+------------------
 
 If you want more information on the functions called during each event, you can use the `AdvancedProfiler`.
 This option uses Python's cProfiler_ to provide a report of time spent on *each* function called within your code.
@@ -114,13 +114,98 @@ def custom_processing_step(self, data):
     model = MyModel(profiler)
     trainer = Trainer(profiler=profiler, max_epochs=1)
 
+
+PyTorch Profiling
+-----------------
+
+Autograd includes a profiler that lets you inspect the cost of different operators
+inside your model - both on the CPU and GPU.
+
+Find the Pytorch Profiler doc at [PyTorch Profiler](https://pytorch-lightning.readthedocs.io/en/stable/profiler.html)
+
+.. code-block:: python
+
+    trainer = Trainer(..., profiler="pytorch")
+
+    or
+
+    profiler = PyTorchProfiler(...)
+    trainer = Trainer(..., profiler=profiler)
+
+
+This profiler works with PyTorch ``DistributedDataParallel``.
+If ``output_filename`` is provided, each rank will save their profiled operation to their own file.
+
+
+The profiler's results will be printed on the completion of a training `fit()`. This profiler
+report can be quite long, so you can also specify an `output_filename` to save the report instead
+of logging it to the output in your terminal.
+
+This profiler will record only for `training_step_and_backward`, `evaluation_step` and `test_step` functions by default.
+The output below shows the profiling for the action `training_step_and_backward`.
+The user can provide ``PyTorchProfiler(profiled_functions=[...])`` to extend the scope of profiled functions.
+
+.. note:: When using the PyTorch Profiler, wall clock time will not not be representative of the true wall clock time. This is due to forcing profiled operations to be measured synchronously, when many CUDA ops happen asynchronously. It is recommended to use this Profiler to find bottlenecks/breakdowns, however for end to end wall clock time use the `SimpleProfiler`.   # noqa E501
+
+.. code-block:: python
+
+    Profiler Report
+
+    Profile stats for: training_step_and_backward
+    ---------------------  ---------------  ---------------  ---------------  ---------------  ---------------
+    Name                   Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg
+    ---------------------  ---------------  ---------------  ---------------  ---------------  ---------------
+    t                      62.10%           1.044ms          62.77%           1.055ms          1.055ms
+    addmm                  32.32%           543.135us        32.69%           549.362us        549.362us
+    mse_loss               1.35%            22.657us         3.58%            60.105us         60.105us
+    mean                   0.22%            3.694us          2.05%            34.523us         34.523us
+    div_                   0.64%            10.756us         1.90%            32.001us         16.000us
+    ones_like              0.21%            3.461us          0.81%            13.669us         13.669us
+    sum_out                0.45%            7.638us          0.74%            12.432us         12.432us
+    transpose              0.23%            3.786us          0.68%            11.393us         11.393us
+    as_strided             0.60%            10.060us         0.60%            10.060us         3.353us
+    to                     0.18%            3.059us          0.44%            7.464us          7.464us
+    empty_like             0.14%            2.387us          0.41%            6.859us          6.859us
+    empty_strided          0.38%            6.351us          0.38%            6.351us          3.175us
+    fill_                  0.28%            4.782us          0.33%            5.566us          2.783us
+    expand                 0.20%            3.336us          0.28%            4.743us          4.743us
+    empty                  0.27%            4.456us          0.27%            4.456us          2.228us
+    copy_                  0.15%            2.526us          0.15%            2.526us          2.526us
+    broadcast_tensors      0.15%            2.492us          0.15%            2.492us          2.492us
+    size                   0.06%            0.967us          0.06%            0.967us          0.484us
+    is_complex             0.06%            0.961us          0.06%            0.961us          0.481us
+    stride                 0.03%            0.517us          0.03%            0.517us          0.517us
+    ---------------------  ---------------  ---------------  ---------------  ---------------  ---------------
+    Self CPU time total: 1.681ms
+
+When running with `PyTorchProfiler(emit_nvtx=True)`. You should run as following::
+
+    nvprof --profile-from-start off -o trace_name.prof -- <regular command here>
+
+To visualize the profiled operation, you can either:
+
+* Use::
+
+    nvvp trace_name.prof
+
+* Use::
+
+     python -c 'import torch; print(torch.autograd.profiler.load_nvprof("trace_name.prof"))'
+
 """
 
-from pytorch_lightning.profiler.profilers import AdvancedProfiler, BaseProfiler, PassThroughProfiler, SimpleProfiler
+from pytorch_lightning.profiler.profilers import (
+    AdvancedProfiler,
+    BaseProfiler,
+    PassThroughProfiler,
+    PyTorchProfiler,
+    SimpleProfiler,
+)
 
 __all__ = [
     'BaseProfiler',
     'SimpleProfiler',
     'AdvancedProfiler',
     'PassThroughProfiler',
+    "PyTorchProfiler",
 ]