Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add checkpoint save interval variable. #301

Merged
merged 7 commits into from
Aug 23, 2021

Conversation

DriesSmit
Copy link
Contributor

@DriesSmit DriesSmit commented Aug 19, 2021

What?

Add an option in all the systems to set the time interval in minutes between checkpoint saving.

Why?

Helps address an issue where the default interval value of 15 minutes is too long when trying to debug if checkpoint loading is working.

How?

Made minor changes in each system's system.py, builder.py and trainer.py files.

Extra

.

arnupretorius
arnupretorius previously approved these changes Aug 19, 2021
Copy link
Collaborator

@arnupretorius arnupretorius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this change @DriesSmit, looks good 👍

@arnupretorius
Copy link
Collaborator

arnupretorius commented Aug 19, 2021

The failing checks seem to be an AutoRom issue, so we should be able to get this in once Kale-ab's manual install PR (#300) is in.

@KaleabTessera
Copy link
Contributor

Thanks @DriesSmit , this is useful! Any reason to set the default for checkpoint_minute_interval to 5, as opposed to 15? Do you think 15 minutes between checkpoints is too long?

I also pushed a commit removing system tests from github actions on this PR branch. @arnupretorius I think that automatically made your review stale on github :( _

Copy link
Contributor

@KaleabTessera KaleabTessera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DriesSmit I tried one of the examples and this fails. Here is the stack trace:
Run Cmd: python examples/petting_zoo/sisl/multiwalker/feedforward/decentralised/run_maddpg.py
Stack:

Traceback (most recent call last):
[trainer/0]   File "/home/kaleab/anaconda3/envs/mava-exp/lib/python3.8/site-packages/launchpad/nodes/python/process_entry.py", line 92, in <module>
[trainer/0]     app.run(main)
[trainer/0]   File "/home/kaleab/anaconda3/envs/mava-exp/lib/python3.8/site-packages/absl/app.py", line 300, in run
[trainer/0]     _run_main(main, args)
[trainer/0]   File "/home/kaleab/anaconda3/envs/mava-exp/lib/python3.8/site-packages/absl/app.py", line 251, in _run_main
[trainer/0]     sys.exit(main(argv))
[trainer/0]   File "/home/kaleab/anaconda3/envs/mava-exp/lib/python3.8/site-packages/launchpad/nodes/python/process_entry.py", line 87, in main
[trainer/0]     functions[task_id]()
[trainer/0]   File "/home/kaleab/anaconda3/envs/mava-exp/lib/python3.8/site-packages/launchpad/nodes/python/node.py", line 72, in _construct_function
[trainer/0]     return functools.partial(self._function, *args, **kwargs)()
[trainer/0]   File "/home/kaleab/anaconda3/envs/mava-exp/lib/python3.8/site-packages/launchpad/nodes/courier/node.py", line 107, in run
[trainer/0]     instance = self._construct_instance()  # pytype:disable=wrong-arg-types
[trainer/0]   File "/home/kaleab/anaconda3/envs/mava-exp/lib/python3.8/site-packages/launchpad/nodes/python/node.py", line 165, in _construct_instance
[trainer/0]     return self._constructor(*args, **kwargs)
[trainer/0]   File "/home/kaleab/Documents/Code/Mava/Mava/mava/systems/tf/maddpg/system.py", line 377, in trainer
[trainer/0]     return self._builder.make_trainer(
[trainer/0]   File "/home/kaleab/Documents/Code/Mava/Mava/mava/systems/tf/maddpg/builder.py", line 417, in make_trainer
[trainer/0]     trainer = self._trainer_fn(**trainer_config)
[trainer/0] TypeError: __init__() missing 1 required positional argument: 'checkpoint_minute_interval'

@DriesSmit
Copy link
Contributor Author

DriesSmit commented Aug 19, 2021

Thanks @DriesSmit , this is useful! Any reason to set the default for checkpoint_minute_interval to 5, as opposed to 15? Do you think 15 minutes between checkpoints is too long?

I also pushed a commit removing system tests from github actions on this PR branch. @arnupretorius I think that automatically made your review stale on github :( _

Hey @KaleabTessera :) I chose 5 minutes because I thought it might mitigate some confusion in the future. One of the Mava issuers had a problem with not being able to load their trained model. The problem was that they were training in less than 15 minutes. So I thought 5 minutes strikes a good balance while still allowing enough time between saves. We can change the default back to 15 minutes if you think 5 minutes is to small. Let me know 👍 Thanks for pointing out the bug in MADDPG 💯 I think it should be fixed now.

Copy link
Contributor

@KaleabTessera KaleabTessera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @DriesSmit , looks good 👍 Personally, I think 15 minutes might be a better default, striking a balance between writing to disk performance and restoring the latest model, but I am happy to leave this decision up to you :)

Copy link
Collaborator

@arnupretorius arnupretorius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @DriesSmit 😄

@arnupretorius
Copy link
Collaborator

Thanks @DriesSmit , looks good 👍 Personally, I think 15 minutes might be a better default, striking a balance between writing to disk performance and restoring the latest model, but I am happy to leave this decision up to you :)

I think let's leave it at 5 for now, to cover the debugging env case that brought this up. If we feel we need to change it, we can update it in a separate PR.

@arnupretorius arnupretorius merged commit ab63d5b into develop Aug 23, 2021
@arnupretorius arnupretorius deleted the feature/set-checkpoint-interval branch August 23, 2021 08:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants