Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Torch MPS (gpu) acceleration not working M1 Mac. #8102

Closed
1 of 2 tasks
jerjer1223 opened this issue Jun 5, 2022 · 14 comments · Fixed by #8121
Closed
1 of 2 tasks

Torch MPS (gpu) acceleration not working M1 Mac. #8102

jerjer1223 opened this issue Jun 5, 2022 · 14 comments · Fixed by #8121
Labels
bug Something isn't working

Comments

@jerjer1223
Copy link

Search before asking

  • I have searched the YOLOv5 issues and found no similar bug report.

YOLOv5 Component

Detection

Bug

When I change the device to mps with --device mps. It gives me "RuntimeError: don't know how to restore data location of torch.storage._UntypedStorage (tagged with mps)."

Torch 1.13 has GPU acceleration, as stated on their website and this article (https://towardsdatascience.com/gpu-acceleration-comes-to-pytorch-on-m1-macs-195c399efcc1)

Environment

YOLOv5 🚀 2022-6-3 Python-3.9.13 torch-1.13.0.dev20220604 MPS

Minimal Reproducible Example

python detect.py --device mps

Additional

Full log here

detect: weights=yolov5s.pt, source=data/images, data=data/coco128.yaml, imgsz=[640, 640], conf_thres=0.25, iou_thres=0.45, max_det=1000, device=mps, view_img=False, save_txt=False, save_conf=False, save_crop=False, nosave=False, classes=None, agnostic_nms=False, augment=False, visualize=False, update=False, project=runs/detect, name=exp, exist_ok=False, line_thickness=3, hide_labels=False, hide_conf=False, half=False, dnn=False
YOLOv5 🚀 2022-6-3 Python-3.9.13 torch-1.13.0.dev20220604 MPS

Traceback (most recent call last):
File "/Users/jerry/Documents/yolov5-master/detect.py", line 252, in
main(opt)
File "/Users/jerry/Documents/yolov5-master/detect.py", line 247, in main
run(**vars(opt))
File "/opt/homebrew/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/Users/jerry/Documents/yolov5-master/detect.py", line 92, in run
model = DetectMultiBackend(weights, device=device, dnn=dnn, data=data, fp16=half)
File "/Users/jerry/Documents/yolov5-master/models/common.py", line 334, in init
model = attempt_load(weights if isinstance(weights, list) else w, device=device)
File "/Users/jerry/Documents/yolov5-master/models/experimental.py", line 80, in attempt_load
ckpt = torch.load(attempt_download(w), map_location=device)
File "/opt/homebrew/lib/python3.9/site-packages/torch/serialization.py", line 712, in load
return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
File "/opt/homebrew/lib/python3.9/site-packages/torch/serialization.py", line 1049, in _load
result = unpickler.load()
File "/opt/homebrew/lib/python3.9/site-packages/torch/serialization.py", line 1019, in persistent_load
load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
File "/opt/homebrew/lib/python3.9/site-packages/torch/serialization.py", line 1001, in load_tensor
wrap_storage=restore_location(storage, location),
File "/opt/homebrew/lib/python3.9/site-packages/torch/serialization.py", line 973, in restore_location
return default_restore_location(storage, str(map_location))
File "/opt/homebrew/lib/python3.9/site-packages/torch/serialization.py", line 178, in default_restore_location
raise RuntimeError("don't know how to restore data location of "
RuntimeError: don't know how to restore data location of torch.storage._UntypedStorage (tagged with mps)

Are you willing to submit a PR?

  • Yes I'd like to help by submitting a PR!
@jerjer1223 jerjer1223 added the bug Something isn't working label Jun 5, 2022
@github-actions
Copy link
Contributor

github-actions bot commented Jun 5, 2022

👋 Hello @jerjer1223, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://ultralytics.com or email support@ultralytics.com.

Requirements

Python>=3.7.0 with all requirements.txt installed including PyTorch>=1.7. To get started:

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

CI CPU testing

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), validation (val.py), inference (detect.py) and export (export.py) on macOS, Windows, and Ubuntu every 24 hours and on every commit.

glenn-jocher added a commit that referenced this issue Jun 6, 2022
glenn-jocher added a commit that referenced this issue Jun 6, 2022
* experimental.py Apple MPS fix

May resolve #8102

* Update experimental.py

* Update experimental.py
@glenn-jocher
Copy link
Member

@jerjer1223 good news 😃! Your original issue may now be fixed ✅ in PR #8121. This PR removes MPS from the torch.device() map_location argument which appears to be the original source of the issue.

To receive this update:

  • Gitgit pull from within your yolov5/ directory or git clone https://github.com/ultralytics/yolov5 again
  • PyTorch Hub – Force-reload model = torch.hub.load('ultralytics/yolov5', 'yolov5s', force_reload=True)
  • Notebooks – View updated notebooks Open In Colab Open In Kaggle
  • Dockersudo docker pull ultralytics/yolov5:latest to update your image Docker Pulls

Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀!

@GerardWalsh
Copy link

GerardWalsh commented Jun 7, 2022

@glenn-jocher the above solved the same issue for me "RuntimeError: don't know how to restore data location of torch.storage._UntypedStorage (tagged with mps).", but I am now running into:

MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:782: failed assertion `[MPSNDArray, initWithBuffer:descriptor:] Error: buffer is not large enough. Must be 25600 bytes

@glenn-jocher
Copy link
Member

@GerardWalsh great, I'm glad we resolved the original issue. The buffer size issue is known to the pytorch team and I believe they are working on solutions for it. See pytorch/pytorch#77886

@jerjer1223
Copy link
Author

Also it seems to have problems with the CPU as well with PyTorch 1.13. When I ran it under CPU, it gave me this error.

PyTorch version 1.13.0.dev20220607
Torchvision version 0.14.0a0+f9f721d

RuntimeError: Couldn't load custom C++ ops. This can happen if your PyTorch and torchvision versions are incompatible, or if you had errors while compiling torchvision from source. For further information on the compatible versions, check https://github.com/pytorch/vision#installation for the compatibility matrix. Please check your PyTorch version with torch.version and your torchvision version with torchvision.version and verify if they are compatible, and if not please reinstall torchvision so that it matches your PyTorch install.

@GerardWalsh
Copy link

GerardWalsh commented Jun 8, 2022

@jerjer1223 try torchvision 0.14.0.dev20220603, with that torch version (1.13.0.dev20220607) that you're using.

tdhooghe pushed a commit to tdhooghe/yolov5 that referenced this issue Jun 10, 2022
* experimental.py Apple MPS fix

May resolve ultralytics#8102

* Update experimental.py

* Update experimental.py
@Symbadian
Copy link

@jerjer1223 good news 😃! Your original issue may now be fixed ✅ in PR #8121. This PR removes MPS from the torch.device() map_location argument which appears to be the original source of the issue.

To receive this update:

* **[Git](https://github.com/ultralytics/yolov5)** – `git pull` from within your `yolov5/` directory or `git clone https://github.com/ultralytics/yolov5` again

* **[PyTorch Hub](https://pytorch.org/hub/ultralytics_yolov5/)** – Force-reload `model = torch.hub.load('ultralytics/yolov5', 'yolov5s', force_reload=True)`

* **[Notebooks](https://github.com/ultralytics/yolov5/blob/master/tutorial.ipynb)** – View updated notebooks  [![Open In Colab](https://github.com/camo/84f0493939e0c4de4e6dbe113251b4bfb5353e57134ffd9fcab6b8714514d4d1/68747470733a2f2f636f6c61622e72657365617263682e676f6f676c652e636f6d2f6173736574732f636f6c61622d62616467652e737667)](https://colab.research.google.com/github/ultralytics/yolov5/blob/master/tutorial.ipynb) [![Open In Kaggle](https://github.com/camo/a08ca511178e691ace596a95d334f73cf4ce06e83a5c4a5169b8bb68cac27bef/68747470733a2f2f6b6167676c652e636f6d2f7374617469632f696d616765732f6f70656e2d696e2d6b6167676c652e737667)](https://www.kaggle.com/ultralytics/yolov5)

* **[Docker](https://hub.docker.com/r/ultralytics/yolov5)** – `sudo docker pull ultralytics/yolov5:latest` to update your image [![Docker Pulls](https://github.com/camo/280faedaf431e4c0c24fdb30ec00a66d627404e5c4c498210d3f014dd58c2c7e/68747470733a2f2f696d672e736869656c64732e696f2f646f636b65722f70756c6c732f756c7472616c79746963732f796f6c6f76353f6c6f676f3d646f636b6572)](https://hub.docker.com/r/ultralytics/yolov5)

Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀!

Hi @glenn-jocher and good day to you,

I am having a challenge where I am trying to use my M1 GPU for training via python3 train.py etc..

I have been trying to implement this for some time now by googling but this seems a little more challenging than I expected it to be.

I discovered the MPS component but that based on my research is used when deploying inference or detect.py. I can be wrong based on my limited experience. Can you guide me as to how to install the M1 GPU Silicon Chip on the new Macbook Pro for YOLOV5 Training, Please?

This training procedure is extremely painful on my old mac so I bought a newer model to handle the processing and I'm not sure how this works. Thanx loads for the YOLOV5 approach and your efforts. This is working on my old mac but that has been training since last Friday morning 3am and today is Tuesday, 16th Aug 2021 and it's now gotten to 8 epochs out of 30??!!!?!??!

Please help me initiate this faster with the M1 GPU or MPS not sure how it goes nevertheless my googling.

Thanx loads for anyone responding to my limitation, I am grateful just to learn

@glenn-jocher
Copy link
Member

@Symbadian MPS support is in place currently for YOLOv5, but PyTorch has not completed sufficient support for MPS training.

If you have an M1/M2 machine you'll already see faster inference and training vs Intel chips simply by installing Python with Universal2 installers for python>=3.9. The speedup is about 200ms Intel vs 70ms M1 with universal2. MPS support would theoretically be faster still when available from pytorch.

@Symbadian
Copy link

Symbadian commented Aug 18, 2022

Ok @glenn-jocher, so this would not work for MPS just yet, wow! DISAPPOINTED.....

  1. Ok, so would this be the reason why python is hogging all of my memory and Causing my terminal to freeze???!!!??
    MAKING THE ENTIRE ops a pain???
  2. IT'S POINTLESS to try the n6,s6,m6,l6 and x6 models to increase the image size from 640 to 1280 for small objects in the scenery, I am constantly running out of memory and it's challenging me to understanding why.
  • The reason for trying to implement MPS or GPU via my TERMINAL
  • on My new MONTEREY 12.5ver MBP M1 Max 2021 64GB 32-CORES, is to SPEED UP THE Training...

This is taking 3-5 days to train the (yolov5m, l and x ) model: 30 epochs, 32 batch size, I tried implementing the --hyp low, med, high and testing these (FROM SCRATCH and PRE-TRAINED WEIGHTS) to see which is superior in performance for my solution. Every time I try implementing a larger model than the (m), I get the prompt below..???!!

  • I just got the training results today for the (m) model and my poor computer has been running Since Friday last 3am-to now???!!!
  • Can something be done here??
  • if yes please guide me to an example??

THANX LOADS @glenn-jocher FOR YOUR Works, really appreciate this, I'm just trying to get this to work and understand what I am doing!!

dd4031678f8ba6bc24413a6257e458b36f5932a62839b2b735e9b0fe3842e095
IMG_2722
?

@glenn-jocher
Copy link
Member

@Symbadian you can track (and vote on) ongoing aten operator development in pytorch/pytorch#77764 that's needed for full MPS training to work correctly.

@kulinseth
Copy link

Ok @glenn-jocher, so this would not work for MPS just yet, wow! DISAPPOINTED.....

  1. Ok, so would this be the reason why python is hogging all of my memory and Causing my terminal to freeze???!!!??
    MAKING THE ENTIRE ops a pain???
  2. IT'S POINTLESS to try the n6,s6,m6,l6 and x6 models to increase the image size from 640 to 1280 for small objects in the scenery, I am constantly running out of memory and it's challenging me to understanding why.
  • The reason for trying to implement MPS or GPU via my TERMINAL
  • on My new MONTEREY 12.5ver MBP M1 Max 2021 64GB 32-CORES, is to SPEED UP THE Training...

This is taking 3-5 days to train the (yolov5m, l and x ) model: 30 epochs, 32 batch size, I tried implementing the --hyp low, med, high and testing these (FROM SCRATCH and PRE-TRAINED WEIGHTS) to see which is superior in performance for my solution. Every time I try implementing a larger model than the (m), I get the prompt below..???!!

  • I just got the training results today for the (m) model and my poor computer has been running Since Friday last 3am-to now???!!!
  • Can something be done here??
  • if yes please guide me to an example??

THANX LOADS @glenn-jocher FOR YOUR Works, really appreciate this, I'm just trying to get this to work and understand what I am doing!!

dd4031678f8ba6bc24413a6257e458b36f5932a62839b2b735e9b0fe3842e095 IMG_2722 ?

Hi @Symbadian , can you please file an issue in PyTorch with "MPS" label, we will take a look.

@Symbadian
Copy link

Hi @kulinseth how do I do so? I’ve never file an issue before and would like to have the most productive Impact to help others as well.

I am still struggling with this challenge, no matter what I do all of the resources are being drained and currently, Googling is not providing a solution.. please guide me

@DenisVieriu97
Copy link

Hi @Symbadian - to file a PyTorch issue, you can go to https://github.com/pytorch/pytorch/issues and click on the green button New Issue (nearby the search bar). From there select Bug Report and please add the necessary info to reproduce it (e.g command line used, machine config info, pytorch version). In the labels tab, please add module: mps - we'll take a look from there.
Thanks!

@Symbadian
Copy link

Symbadian commented Aug 24, 2022 via email

ctjanuhowski pushed a commit to ctjanuhowski/yolov5 that referenced this issue Sep 8, 2022
* experimental.py Apple MPS fix

May resolve ultralytics#8102

* Update experimental.py

* Update experimental.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants