Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maybe the cheapest cloud inference option for Yolov5 (AWS Neuron inf1 instance) #2643

Closed
Ownmarc opened this issue Mar 28, 2021 · 6 comments
Closed
Labels
question Further information is requested Stale

Comments

@Ownmarc
Copy link
Contributor

Ownmarc commented Mar 28, 2021

Been trying the inf1 ec2 instance from AWS with their own Inferentia chips.

https://aws.amazon.com/ec2/instance-types/inf1/

The Yolov5 model doesn't fully compile yet for their accelerated inference chip, but it is still working pretty well. This issue on their repo will be tracking the support of Yolov5 : aws-neuron/aws-neuron-sdk#253

I have done some really basic speed test (the zidane image ran 10 times, batch size = 1) to compare my 1080ti versus 1 neuron core of this Inferentia chip (there is 4 neuron core per chip, so if the inference job can be parallelized, we can divide the inference time by 4 as each neuron core could have its own yolov5 model loaded and infer independently)

I'll keep this updated when they'll update their issue about yolov5 support and maybe I'll make a simple tutorial to show how easy it is to compile and run Yolov5 on their chip.

image

image

image

@Ownmarc Ownmarc added the question Further information is requested label Mar 28, 2021
@glenn-jocher
Copy link
Member

glenn-jocher commented Mar 28, 2021

@Ownmarc that's really interesting!

BTW if the aws inference instances can exploit FP16 inference then they should benefit from the new autocast PR #2641 I made today for PyTorch Hub models. This seemed to cut about 1/3 of the inference time off on a Colab T4. The results.print() method also now displays profiling results:

Screenshot 2021-03-28 at 20 29 13

I think up to now the best performance:cost ratio I've seen is from T4 instances, but I have not tried the inf1-xlarge instances either. Is a 4-GPU/neuron instance the smallest they get?

EDIT: I should clarify, the best performance:cost from enterprise GPUs hosted on the large cloud providers is probably from T4s. The best overall GPU performance per cost would probably be from 1080ti's on consumer clouds like vast.ai.

@Ownmarc
Copy link
Contributor Author

Ownmarc commented Mar 28, 2021

@glenn-jocher from what AWS says, Amazon EC2 Inf1 instances based on AWS Inferentia chips deliver up to 30% higher throughput and up to 45% lower cost per inference than Amazon EC2 G4 instances. Those G4 instance use the T4 gpu. That is mostlikely assuming the model fully compiles for their chip which is currently not the case for yolov5.

Yes the inf1 xlarge is the smallest they offer with this Inferentia chip

@glenn-jocher
Copy link
Member

glenn-jocher commented Mar 28, 2021

@Ownmarc that's interesting! AWS seems to like offering larger chunks of stuff than GCP. The AWS P4 instances are only available as full 8x A100 machines on AWS, whereas on GCP you can get a smaller instance with 1, 2, 4 or 8 A100s.

Are there any special steps you need to get started with the Inf1 instances?

BTW I forgot to tell you, when timing cuda ops its important to make calls to torch.cuda.synchronize() so you don't get incorrect times. We have a helper function that can replace time.time() that does this here:

def time_synchronized():
# pytorch-accurate time
if torch.cuda.is_available():
torch.cuda.synchronize()
return time.time()

This function is always used when profiling (detect.py, pytorch hub, test.py etc.)

@Ownmarc
Copy link
Contributor Author

Ownmarc commented Mar 29, 2021

Yea, so you need to compile the model for their inferencia chip, its pretty easy with the common frameworks (torch/tensorflow/mxnet), there is a couple of tutorials on how to do so.

And yes, I know there is some extra steps required to get the real time taken to make the inference, when they release a version that will get your model to fully compile, I'll give it a shot again and try to get better metrics! :)

@glenn-jocher
Copy link
Member

@Ownmarc interesting stuff. I responded over at aws-neuron/aws-neuron-sdk#253 to let them know I'm available to offer any help I can.

@github-actions
Copy link
Contributor

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested Stale
Projects
None yet
Development

No branches or pull requests

2 participants