Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataParallel support for torchvision #1332

Merged
merged 6 commits into from
Feb 10, 2023
Merged

Conversation

rahul-tuli
Copy link
Member

@rahul-tuli rahul-tuli commented Jan 23, 2023

This PR adds support for DataParallel in sparseml's torchvision integration, if no devices are specified, all devices are used by default, else devices can be specified as "cuda:0,1,2,3"

The bug w.r.t multi-gpu environments occurred because the device info wasn't propagated correctly after wrapping the model in DataParallel

The bug has been fixed and the following command was run in a multi-gpu setup to verify all GPU's were picked up; manually verified GPU usage

sparseml.image_classification.train \
    --recipe "zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned95-none?recipe_type=transfer-classification" --checkpoint-path "zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned95_quant-none" --arch-key resnet50 --dataset-path /home/XXXX/sparseml/imagenette/imagenette-160

Screenshot 2023-02-08 at 9 58 45 AM


bfineran
bfineran previously approved these changes Jan 23, 2023
Copy link
Member

@bfineran bfineran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, please add what testing was included to the PR description

@bfineran
Copy link
Member

@rahul-tuli looks like there are quality issues

KSGulin
KSGulin previously approved these changes Jan 23, 2023
@rahul-tuli rahul-tuli dismissed stale reviews from KSGulin and bfineran via 2c07ca7 January 26, 2023 15:33
@rahul-tuli rahul-tuli marked this pull request as draft January 26, 2023 15:51
Update: all _create_model calling code to accept a third argument
Update: device to maybe_dp_device after teacher creation
@rahul-tuli rahul-tuli marked this pull request as ready for review February 8, 2023 14:59
@rahul-tuli rahul-tuli merged commit a8a6992 into main Feb 10, 2023
@rahul-tuli rahul-tuli deleted the torchvision-dp-support branch February 10, 2023 22:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants