-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(python): support checkpointing for moe #242
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remaining comments which cannot be posted as a review comment to avoid GitHub Rate Limit
blackfmt
bagua/torch_api/checkpointing.py|148|
bagua/torch_api/checkpointing.py|153|
bagua/torch_api/checkpointing.py|158|
bagua/torch_api/checkpointing.py|165|
bagua/torch_api/checkpointing.py|187|
bagua/torch_api/checkpointing.py|191|
bagua/torch_api/checkpointing.py|195|
bagua/torch_api/checkpointing.py|198|
bagua/torch_api/checkpointing.py|201|
bagua/torch_api/checkpointing.py|206|
bagua/torch_api/checkpointing.py|211|
bagua/torch_api/checkpointing.py|213|
bagua/torch_api/checkpointing.py|216|
bagua/torch_api/checkpointing.py|219|
bagua/torch_api/checkpointing.py|223|
bagua/torch_api/checkpointing.py|225|
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
…to moe_checkpoint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comments
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
|
||
|
||
def _save_moe_checkpoint( | ||
iteration, checkpoints_path, num_experts, model, optimizer=None, lr_scheduler=None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lack type annotation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still lack type annotations on _xxx functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
|
||
def save_checkpoint( | ||
iteration, checkpoints_path, model, optimizer=None, lr_scheduler=None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lack type annotation, similar for other functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
def load_checkpoint( | ||
checkpoints_path: str, | ||
model: BaguaModule, | ||
optimizer: torch.optim.Optimizer = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional[xxx]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
checkpoints_path: str, | ||
model: BaguaModule, | ||
optimizer: torch.optim.Optimizer = None, | ||
lr_scheduler: torch.optim.lr_scheduler._LRScheduler = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional[xxx]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
iteration: int, | ||
checkpoints_path: str, | ||
model: BaguaModule, | ||
optimizer: torch.optim.Optimizer = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional[xxx]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
def _get_moe_state_dict( | ||
full_state_dict: Dict[str, torch.Tensor], | ||
num_local_experts: int, | ||
expp_rank: int, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is expp
short for?
No description provided.