Enable HF SpeedMonitor #997

rlrs · 2024-02-26T18:27:26Z

Enable SpeedMonitor on HF models by using PyTorch FlopCounterMode to calculate model FLOPs.

rlrs · 2024-02-26T18:29:10Z

Oops, some of these changes are for our internal use. Will remove them from here.

dakinggg · 2024-02-26T23:09:42Z

Hey @rlrs, thanks for the contribution! I didn't know about this PyTorch flop counter! We'll want to do a bit of testing to make sure that this reports the correct number and doesn't cause any issues with (1) speed (2) memory usage or (3) bad interactions with distributed training strategies like FSDP. What testing of this have you been able to do yourself?

rlrs · 2024-02-27T10:09:46Z

Apologies for the lack of explanation or tests, I rushed this a bit.

So far I've used this with Mistral 7B, comparing against the standard Transformer Math 6PD calculation, and the results are quite close - well, I also rely on one of the same assumptions, namely that the backward pass is 2x the forward pass. It is possible to wrap fwd+bwd in FlopCounterMode instead of just fwd. To me, that seems more complicated since that code has to live outside the HF model wrapper, from where the model FLOPs have to be returned.

One uncertainty I have is how the FLOP counter interacts with non-PyTorch constructs like Flash Attention. I suspect that it might be necessary to register such code manually in order to get the correct result. If so, it might be silently underreporting FLOPs right now.

dakinggg · 2024-03-01T00:38:01Z

No worries @rlrs! If you're able to do some testing (and add some unit tests) that would be great! Otherwise we'll look into it when we get a chance and appreciate the suggestion!

rlrs and others added 6 commits December 20, 2023 16:06

fixes for LUMI/AMD

f89ce64

Merge branch 'mosaicml:main' into lumi

87e585a

Merge branch 'mosaicml:main' into lumi

8d5beb7

Merge branch 'mosaicml:main' into lumi

b402d22

add flop counter to HF model wrapper

b6c52fe

forget a *bs

77624a9

rlrs requested a review from a team as a code owner February 26, 2024 18:27

rlrs and others added 2 commits February 26, 2024 19:29

Merge branch 'main' into flop-counter

356d41a

undo irrelevant changes

f86a980

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable HF SpeedMonitor #997

Enable HF SpeedMonitor #997

rlrs commented Feb 26, 2024

rlrs commented Feb 26, 2024

dakinggg commented Feb 26, 2024

rlrs commented Feb 27, 2024

dakinggg commented Mar 1, 2024

Enable HF SpeedMonitor #997

Are you sure you want to change the base?

Enable HF SpeedMonitor #997

Conversation

rlrs commented Feb 26, 2024

rlrs commented Feb 26, 2024

dakinggg commented Feb 26, 2024

rlrs commented Feb 27, 2024

dakinggg commented Mar 1, 2024