-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add checkpoint save interval variable. #301
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this change @DriesSmit, looks good 👍
The failing checks seem to be an AutoRom issue, so we should be able to get this in once Kale-ab's manual install PR (#300) is in. |
Thanks @DriesSmit , this is useful! Any reason to set the default for I also pushed a commit removing system tests from github actions on this PR branch. @arnupretorius I think that automatically made your review stale on github :( _ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DriesSmit I tried one of the examples and this fails. Here is the stack trace:
Run Cmd: python examples/petting_zoo/sisl/multiwalker/feedforward/decentralised/run_maddpg.py
Stack:
Traceback (most recent call last):
[trainer/0] File "/home/kaleab/anaconda3/envs/mava-exp/lib/python3.8/site-packages/launchpad/nodes/python/process_entry.py", line 92, in <module>
[trainer/0] app.run(main)
[trainer/0] File "/home/kaleab/anaconda3/envs/mava-exp/lib/python3.8/site-packages/absl/app.py", line 300, in run
[trainer/0] _run_main(main, args)
[trainer/0] File "/home/kaleab/anaconda3/envs/mava-exp/lib/python3.8/site-packages/absl/app.py", line 251, in _run_main
[trainer/0] sys.exit(main(argv))
[trainer/0] File "/home/kaleab/anaconda3/envs/mava-exp/lib/python3.8/site-packages/launchpad/nodes/python/process_entry.py", line 87, in main
[trainer/0] functions[task_id]()
[trainer/0] File "/home/kaleab/anaconda3/envs/mava-exp/lib/python3.8/site-packages/launchpad/nodes/python/node.py", line 72, in _construct_function
[trainer/0] return functools.partial(self._function, *args, **kwargs)()
[trainer/0] File "/home/kaleab/anaconda3/envs/mava-exp/lib/python3.8/site-packages/launchpad/nodes/courier/node.py", line 107, in run
[trainer/0] instance = self._construct_instance() # pytype:disable=wrong-arg-types
[trainer/0] File "/home/kaleab/anaconda3/envs/mava-exp/lib/python3.8/site-packages/launchpad/nodes/python/node.py", line 165, in _construct_instance
[trainer/0] return self._constructor(*args, **kwargs)
[trainer/0] File "/home/kaleab/Documents/Code/Mava/Mava/mava/systems/tf/maddpg/system.py", line 377, in trainer
[trainer/0] return self._builder.make_trainer(
[trainer/0] File "/home/kaleab/Documents/Code/Mava/Mava/mava/systems/tf/maddpg/builder.py", line 417, in make_trainer
[trainer/0] trainer = self._trainer_fn(**trainer_config)
[trainer/0] TypeError: __init__() missing 1 required positional argument: 'checkpoint_minute_interval'
Hey @KaleabTessera :) I chose 5 minutes because I thought it might mitigate some confusion in the future. One of the Mava issuers had a problem with not being able to load their trained model. The problem was that they were training in less than 15 minutes. So I thought 5 minutes strikes a good balance while still allowing enough time between saves. We can change the default back to 15 minutes if you think 5 minutes is to small. Let me know 👍 Thanks for pointing out the bug in MADDPG 💯 I think it should be fixed now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @DriesSmit , looks good 👍 Personally, I think 15 minutes might be a better default, striking a balance between writing to disk performance and restoring the latest model, but I am happy to leave this decision up to you :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @DriesSmit 😄
I think let's leave it at 5 for now, to cover the debugging env case that brought this up. If we feel we need to change it, we can update it in a separate PR. |
What?
Add an option in all the systems to set the time interval in minutes between checkpoint saving.
Why?
Helps address an issue where the default interval value of 15 minutes is too long when trying to debug if checkpoint loading is working.
How?
Made minor changes in each system's system.py, builder.py and trainer.py files.
Extra
.