[contrib] Improve FusedAdamSWA interface and add unit tests #1759

lirundong · 2023-12-13T04:15:52Z

Why?

FusedAdamSWA interface was loosely typed and error-prone
The training critical path of FusedAdamSWA (i.e., its step function) could contain unnecessary GPU-host sync when grad_clip_scale is set to a non-CUDA-tensor variable
FusedAdamSWA didn't have any unit test

What?

Encapsulated FusedAdamSWA math types and internal numerical type into Python enumerations to improve type robustness and readability
Accept grad_clip_scale as either a tensor or a number, for the latter case we move it to GPU in a non-blocking manner to eliminate a GPU-host sync
Add unit test to guarentee numerical correctness and demostrate usage

lirundong · 2023-12-14T02:26:22Z

Maybe @crcrpar, would you please review? Thanks!

crcrpar · 2023-12-14T03:38:32Z

apex/contrib/openfold_triton/fused_adam_swa.py

-kBF16 = 1
-kFP32 = 2
-kFP64 = 3
+@unique


Didn't know there exists this decorator

Yeah, this is a neat decorator ensuring unique enum values ;)

crcrpar · 2023-12-14T03:39:20Z

apex/contrib/openfold_triton/fused_adam_swa.py

+        params: List[nn.Parameter],
+        compute_params: List[nn.Parameter],
+        swa_params: List[nn.Parameter],
+        swa_decay_rate: float,
+        lr: float = 1e-3,
+        bias_correction: bool = True,
+        betas: Tuple[float, float] = (0.9, 0.999),
+        eps: float = 1e-8,
+        adam_math_mode: AdamMathType = AdamMathType.PyTorchAdam,
+        weight_decay: float = 0.0,
+        amsgrad: bool = False,
+        set_grad_none: bool = True,
+        capturable: bool = False,
+        master_weights: bool = False,


crcrpar · 2023-12-14T03:40:45Z

apex/contrib/test/openfold_triton/test_fused_adam_swa.py

+            chain(moments_gt, velocities_gt),
+        )
+    ):
+        assert torch.allclose(m, m_gt, rtol=rtol, atol=atol)


Suggested change

assert torch.allclose(m, m_gt, rtol=rtol, atol=atol)

torch.testing.assert_close(m, m_gt, rtol=rtol, atol=atol)

Good catch! Fixed.

crcrpar · 2023-12-14T03:41:12Z

apex/contrib/test/openfold_triton/test_fused_adam_swa.py

+            chain(state_params_gt, compute_params_gt, swa_params_gt),
+        )
+    ):
+        assert torch.allclose(p_test, p_gt, rtol=rtol, atol=atol)


Suggested change

assert torch.allclose(p_test, p_gt, rtol=rtol, atol=atol)

torch.testing.assert_close(p_test, p_gt, rtol=rtol, atol=atol)

crcrpar · 2023-12-14T06:17:04Z

apex/contrib/test/openfold_triton/test_fused_adam_swa.py

idieally I want this to be compatible with python standard unittest module.
In this repository some file uses pytorch's TestCase class and runner as in https://github.com/NVIDIA/apex/blob/37d83fce4dcbb59897dfd951906493a6fe7fae37/tests/L0/run_fused_layer_norm/test_fused_layer_norm.py

Thanks for the pointer. Have refactored to unittest.TestCase.

Why? - FusedAdamSWA interface was loosely typed and error-prone - The training critical path of FusedAdamSWA (i.e., its step function) could contain unnecessary GPU-host sync when grad_clip_scale is set to a non-CUDA-tensor variable - FusedAdamSWA didn't have any unit test What? - Encapsulated FusedAdamSWA math types and internal numerical type into Python enumerations to improve type robustness and readability - Accept grad_clip_scale as either a tensor or a number, for the latter case we move it to GPU in a non-blocking manner to eliminate a GPU-host sync - Add unit test to guarentee numerical correctness and demostrate usage

crcrpar · 2023-12-15T04:16:02Z

apex/contrib/test/openfold_triton/test_fused_adam_swa.py

crcrpar reviewed Dec 14, 2023

View reviewed changes

lirundong force-pushed the lirundong/adam-swa-remove-sync branch from d7101b4 to e7afced Compare December 14, 2023 05:49

crcrpar approved these changes Dec 14, 2023

View reviewed changes

lirundong force-pushed the lirundong/adam-swa-remove-sync branch from e7afced to 499bfdc Compare December 14, 2023 07:51

crcrpar approved these changes Dec 15, 2023

View reviewed changes

crcrpar added this to the 24.01 milestone Dec 15, 2023

Merge branch 'master' into lirundong/adam-swa-remove-sync

3d9ddf8

crcrpar merged commit ccffcc4 into NVIDIA:master Dec 15, 2023

xwang233 mentioned this pull request May 16, 2024

Contrib unit test failure in openfold_triton/test_fused_adam_swa.py::FusedAdamSWATestCase::test_fused_update_on_random_data #1802

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[contrib] Improve FusedAdamSWA interface and add unit tests #1759

[contrib] Improve FusedAdamSWA interface and add unit tests #1759

lirundong commented Dec 13, 2023 •

edited

Loading

lirundong commented Dec 14, 2023

crcrpar Dec 14, 2023

lirundong Dec 14, 2023

crcrpar Dec 14, 2023

crcrpar Dec 14, 2023

lirundong Dec 14, 2023

crcrpar Dec 14, 2023

lirundong Dec 14, 2023

crcrpar Dec 14, 2023

lirundong Dec 14, 2023

crcrpar Dec 15, 2023

crcrpar Dec 15, 2023

	assert torch.allclose(m, m_gt, rtol=rtol, atol=atol)
	torch.testing.assert_close(m, m_gt, rtol=rtol, atol=atol)

	assert torch.allclose(p_test, p_gt, rtol=rtol, atol=atol)
	torch.testing.assert_close(p_test, p_gt, rtol=rtol, atol=atol)

[contrib] Improve FusedAdamSWA interface and add unit tests #1759

[contrib] Improve FusedAdamSWA interface and add unit tests #1759

Conversation

lirundong commented Dec 13, 2023 • edited Loading

lirundong commented Dec 14, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lirundong commented Dec 13, 2023 •

edited

Loading