feature: loss watchdog for terminating training runs that are failing #899

kallewoof · 2023-11-27T06:59:07Z

This adds a loss watchdog, which will stop the trainer if the loss exceeds the given threshold for more than loss_watchdog_patience steps (in a row). It is useful to prevent using a bunch of resources to keep training when the model has broken down.

winglian

lgtm

…axolotl-ai-cloud#899) Co-authored-by: Karl-Johan Alm <kalle@gmail.com>

kallewoof force-pushed the 202311-loss-watchdog branch from 847030b to 218592f Compare November 27, 2023 07:21

feature: loss watchdog for terminating training runs that are failing

49c9cd7

kallewoof force-pushed the 202311-loss-watchdog branch from 218592f to 49c9cd7 Compare November 28, 2023 00:36

winglian approved these changes Nov 28, 2023

View reviewed changes

winglian merged commit 58ec8b1 into axolotl-ai-cloud:main Dec 4, 2023
4 checks passed

kallewoof deleted the 202311-loss-watchdog branch December 4, 2023 13:49

mkeoliya pushed a commit to mkeoliya/axolotl that referenced this pull request Dec 15, 2023

feature: loss watchdog for terminating training runs that are failing (…

df2b7ed

…axolotl-ai-cloud#899) Co-authored-by: Karl-Johan Alm <kalle@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature: loss watchdog for terminating training runs that are failing #899

feature: loss watchdog for terminating training runs that are failing #899

kallewoof commented Nov 27, 2023

winglian left a comment

feature: loss watchdog for terminating training runs that are failing #899

feature: loss watchdog for terminating training runs that are failing #899

Conversation

kallewoof commented Nov 27, 2023

winglian left a comment

Choose a reason for hiding this comment