-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrote get_top_k_items() to improve runtime #1748
Conversation
Do you have any timing results on this change? |
I tested the old vs new function on the movielens dataset with 1m samples. I set k = 10. Here are the results: |
@ChuyangKe Could you please also add a test case for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good progress @ChuyangKe. I agree with Simon, it would be great to have the tests. This can be done in this PR or in a different one, depending on what you prefer. Please let Simon and me know.
In another PR you maybe can explore how to perform the same operation with numpy, hopefully this gives us a much better time.
Some code I found in the web:
- with argpartition: https://stackoverflow.com/questions/43386432/how-to-get-indexes-of-k-maximum-values-from-a-numpy-multidimensional-array
- groupby with numpy (maybe the other method is better): https://cmdlinetips.com/2019/05/how-to-implement-pandas-groupby-operation-with-numpy/
I will create another PR for the test case @simonzhaoms @miguelgfierro. Just wanted to make sure @miguelgfierro, do you mean implementing the same function get_top_k_items() using numpy (instead of current pandas operations), or is it another function? |
get_top_k_items() using numpy I'll merge this, super good @ChuyangKe! keep up the good work |
Description
Rewrote get_top_k_items().
Optimize the runtime of ranking metrics.
Related Issues
Checklist:
staging branch
and not tomain branch
.