Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FP16 training fix #1560

Merged
merged 6 commits into from
Sep 13, 2019
Merged

FP16 training fix #1560

merged 6 commits into from
Sep 13, 2019

Conversation

vince62s
Copy link
Member

Here is the issue:

For FP16 training, we use Nvidia/apex.
They changed the API to something called AMP and we adapted to this new API on June 13.

However, a few things are broken in this new API when it comes to handle FusedAdam.

On Aug 8 they included FusedAdam in the AMP new API but I realized that:

O2 level does not work at all for both Adam and FuseAdam (O2 was our Default level).
see NVIDIA/apex#475
O1 level works fine for Adam but does work (unstable) with FusedAdam.

This PR will:

  1. Default to O1 with Adam/AMP (FP16 training only)
  2. Use the Legacy FusedAdam code that we have included in optimizer.py

This solution should be temporary since Nvidia works on including AMP directly in Pytorch.

As of this PR, Accuracy/PPL are ok for Adam FP32, Adam FP16, FusedAdam FP16

@vince62s vince62s merged commit f79f83c into OpenNMT:master Sep 13, 2019
@vince62s vince62s deleted the fixapex branch August 17, 2022 15:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants