FP16 training fix #1560

vince62s · 2019-09-12T15:31:34Z

Here is the issue:

For FP16 training, we use Nvidia/apex.
They changed the API to something called AMP and we adapted to this new API on June 13.

However, a few things are broken in this new API when it comes to handle FusedAdam.

On Aug 8 they included FusedAdam in the AMP new API but I realized that:

O2 level does not work at all for both Adam and FuseAdam (O2 was our Default level).
see NVIDIA/apex#475
O1 level works fine for Adam but does work (unstable) with FusedAdam.

This PR will:

Default to O1 with Adam/AMP (FP16 training only)
Use the Legacy FusedAdam code that we have included in optimizer.py

This solution should be temporary since Nvidia works on including AMP directly in Pytorch.

As of this PR, Accuracy/PPL are ok for Adam FP32, Adam FP16, FusedAdam FP16

francoishernandez added 6 commits September 12, 2019 15:49

fixapex

d25d460

restore legacy apex

d803c8e

fix flake

6059a11

fix flake again

d92f3e0

remove unused code

0eaa9ae

fix flake

67a5c0d

vince62s merged commit f79f83c into OpenNMT:master Sep 13, 2019

vince62s deleted the fixapex branch August 17, 2022 15:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FP16 training fix #1560

FP16 training fix #1560

vince62s commented Sep 12, 2019

FP16 training fix #1560

FP16 training fix #1560

Conversation

vince62s commented Sep 12, 2019