Determine FSDP/deepspeed settings on device select. #883

kallewoof · 2023-11-21T07:02:04Z

Without this, the OS env check for accelerate will fail.

NanoCode012 · 2023-11-25T03:18:17Z

Could you provide an example of the error that you're trying to fix and how to reproduce that?

kallewoof · 2023-11-25T12:42:57Z

It's not a direct error, but a bug. Perhaps the behavior is intended, in which case it should be fixed in some other way.

In either case, unless I'm mistaken, the if case in

https://github.com/OpenAccess-AI-Collective/axolotl/blob/7ee3c4cacb1c7a0d92810247f014a7394e07db80/src/axolotl/utils/config.py#L37-L41

will never be true, unless the user manually does ACCELERATE_USE_something=true python -m ....

because choose_device() is called before setup_trainer() in every case.

NanoCode012 · 2023-11-25T15:01:57Z

I'm not so clear about that line. As I recall, the only env for that is ACCELERATE_USE_DEEPSPEED.

winglian · 2023-11-25T21:56:05Z

thanks @kallewoof, I made a change to your PR, mostly on naming the new function and calling it between the config validation and normalization step (where the device setup happens)

kallewoof · 2023-11-26T06:26:59Z

@winglian Looks good to me.

kallewoof · 2023-11-26T06:27:48Z

Actually, the normalization and validation should be swapped. Right now if you have e.g. a None eval_batch_size, you will get a warning about it being different from the batch size, even though the normalization part sets it to the same value if it is None. (This is unrelated to the fix in this PR, but it is something I've noticed and figured I'd mention since we're touching those lines.)

Edit: this:

eval_batch_size != micro_batch_size. This can lead to VRAM instability.

winglian · 2023-11-29T13:36:54Z

Actually, the normalization and validation should be swapped. Right now if you have e.g. a None eval_batch_size, you will get a warning about it being different from the batch size, even though the normalization part sets it to the same value if it is None. (This is unrelated to the fix in this PR, but it is something I've noticed and figured I'd mention since we're touching those lines.)

Edit: this:
eval_batch_size != micro_batch_size. This can lead to VRAM instability.

I'll address this in a different PR

NanoCode012 · 2023-11-29T13:47:37Z

Actually, the normalization and validation should be swapped. Right now if you have e.g. a None eval_batch_size, you will get a warning about it being different from the batch size, even though the normalization part sets it to the same value if it is None. (This is unrelated to the fix in this PR, but it is something I've noticed and figured I'd mention since we're touching those lines.)
Edit: this:
eval_batch_size != micro_batch_size. This can lead to VRAM instability.
I'll address this in a different PR

I think I just fixed this recently. #896

kallewoof · 2023-11-29T14:20:45Z

Actually, the normalization and validation should be swapped. Right now if you have e.g. a None eval_batch_size, you will get a warning about it being different from the batch size, even though the normalization part sets it to the same value if it is None. (This is unrelated to the fix in this PR, but it is something I've noticed and figured I'd mention since we're touching those lines.)
Edit: this:
eval_batch_size != micro_batch_size. This can lead to VRAM instability.
I'll address this in a different PR
I think I just fixed this recently. #896

No, you circumvented one of the symptoms, but the underlying issue still remains. Normalization should happen before validation, to avoid having to put in code like in #896.

winglian · 2023-11-29T14:34:35Z

No, you circumvented one of the symptoms, but the underlying issue still remains. Normalization should happen before validation, to avoid having to put in code like in #896.

The reason we validate before normalization is because any errors that need to be surfaced to the user could be harder to understand since what they input could be indirectly different from the validation after normalization. I'm open to figuring out a better way.

kallewoof · 2023-11-29T15:34:09Z

That makes sense. I think the opposite is the case too, though, where a user may become unnecessarily confused because an error or warning is raised that ultimately does not reflect the actual parameters used in training. Such as the VRAM warning mentioned above.

…#883) * Determine FSDP/deepspeed settings on device select. Without this, the OS env check for accelerate will fail. * rename and move env setup call * chore: lint --------- Co-authored-by: Karl-Johan Alm <kalle@gmail.com> Co-authored-by: Wing Lian <wing.lian@gmail.com>

Determine FSDP/deepspeed settings on device select.

5d56f1a

Without this, the OS env check for accelerate will fail.

rename and move env setup call

32ecd09

chore: lint

d6ff17c

kallewoof referenced this pull request Nov 27, 2023

fix: warning should not show if eval_batch_size not provided (#896)

7ee3c4c

winglian merged commit 71b7ea3 into axolotl-ai-cloud:main Nov 29, 2023
4 checks passed

kallewoof deleted the 202311-prep-loader branch November 29, 2023 14:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Determine FSDP/deepspeed settings on device select. #883

Determine FSDP/deepspeed settings on device select. #883

kallewoof commented Nov 21, 2023

NanoCode012 commented Nov 25, 2023

kallewoof commented Nov 25, 2023

NanoCode012 commented Nov 25, 2023

winglian commented Nov 25, 2023

kallewoof commented Nov 26, 2023

kallewoof commented Nov 26, 2023 •

edited

Loading

winglian commented Nov 29, 2023

NanoCode012 commented Nov 29, 2023 •

edited

Loading

kallewoof commented Nov 29, 2023

winglian commented Nov 29, 2023

kallewoof commented Nov 29, 2023

Determine FSDP/deepspeed settings on device select. #883

Determine FSDP/deepspeed settings on device select. #883

Conversation

kallewoof commented Nov 21, 2023

NanoCode012 commented Nov 25, 2023

kallewoof commented Nov 25, 2023

NanoCode012 commented Nov 25, 2023

winglian commented Nov 25, 2023

kallewoof commented Nov 26, 2023

kallewoof commented Nov 26, 2023 • edited Loading

winglian commented Nov 29, 2023

NanoCode012 commented Nov 29, 2023 • edited Loading

kallewoof commented Nov 29, 2023

winglian commented Nov 29, 2023

kallewoof commented Nov 29, 2023

kallewoof commented Nov 26, 2023 •

edited

Loading

NanoCode012 commented Nov 29, 2023 •

edited

Loading