Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): support checkpointing for moe #242

Merged
merged 25 commits into from
Oct 15, 2021
Merged

feat(python): support checkpointing for moe #242

merged 25 commits into from
Oct 15, 2021

Conversation

liuhatry
Copy link
Member

No description provided.

Copy link
Contributor

@github-actions github-actions bot left a comment

bagua/torch_api/checkpointing.py Outdated Show resolved Hide resolved
bagua/torch_api/checkpointing.py Outdated Show resolved Hide resolved
bagua/torch_api/checkpointing.py Outdated Show resolved Hide resolved
bagua/torch_api/checkpointing.py Outdated Show resolved Hide resolved
bagua/torch_api/checkpointing.py Outdated Show resolved Hide resolved
bagua/torch_api/checkpointing.py Outdated Show resolved Hide resolved
bagua/torch_api/checkpointing.py Outdated Show resolved Hide resolved
bagua/torch_api/checkpointing.py Outdated Show resolved Hide resolved
bagua/torch_api/checkpointing.py Outdated Show resolved Hide resolved
bagua/torch_api/checkpointing.py Outdated Show resolved Hide resolved
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
bagua/torch_api/checkpointing.py Outdated Show resolved Hide resolved
bagua/torch_api/checkpointing.py Outdated Show resolved Hide resolved
bagua/torch_api/checkpointing.py Outdated Show resolved Hide resolved
bagua/torch_api/checkpointing.py Outdated Show resolved Hide resolved
bagua/torch_api/checkpointing.py Outdated Show resolved Hide resolved
bagua/torch_api/checkpointing.py Outdated Show resolved Hide resolved
bagua/torch_api/checkpointing.py Outdated Show resolved Hide resolved
bagua/torch_api/checkpointing.py Outdated Show resolved Hide resolved
bagua/torch_api/checkpointing.py Outdated Show resolved Hide resolved
bagua/torch_api/checkpointing.py Outdated Show resolved Hide resolved
liuhatry and others added 4 commits September 30, 2021 09:01
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
@liuhatry liuhatry marked this pull request as draft September 30, 2021 08:24
liuhatry and others added 4 commits September 30, 2021 16:58
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
@liuhatry liuhatry marked this pull request as ready for review September 30, 2021 09:58
Copy link
Contributor

@NOBLES5E NOBLES5E left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comments

liuhatry and others added 2 commits October 8, 2021 15:45
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
@liuhatry liuhatry changed the title feat(python): support checkpointing for moe feat: support checkpointing for moe Oct 8, 2021
@liuhatry liuhatry changed the title feat: support checkpointing for moe feat(python): support checkpointing for moe Oct 8, 2021


def _save_moe_checkpoint(
iteration, checkpoints_path, num_experts, model, optimizer=None, lr_scheduler=None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lack type annotation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still lack type annotations on _xxx functions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done



def save_checkpoint(
iteration, checkpoints_path, model, optimizer=None, lr_scheduler=None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lack type annotation, similar for other functions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
def load_checkpoint(
checkpoints_path: str,
model: BaguaModule,
optimizer: torch.optim.Optimizer = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional[xxx]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

checkpoints_path: str,
model: BaguaModule,
optimizer: torch.optim.Optimizer = None,
lr_scheduler: torch.optim.lr_scheduler._LRScheduler = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional[xxx]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

iteration: int,
checkpoints_path: str,
model: BaguaModule,
optimizer: torch.optim.Optimizer = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional[xxx]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

def _get_moe_state_dict(
full_state_dict: Dict[str, torch.Tensor],
num_local_experts: int,
expp_rank: int,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is expp short for?

@NOBLES5E NOBLES5E merged commit 5dcd77b into master Oct 15, 2021
@NOBLES5E NOBLES5E deleted the moe_checkpoint branch October 15, 2021 08:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants