-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Niwilso/adding tfidf #1088
Niwilso/adding tfidf #1088
Conversation
Check out this pull request on You'll be able to see Jupyter notebook diff and discuss changes. Powered by ReviewNB. |
This is really great, thanks for the addition! I looked through briefly and looks like you did a very thorough job fitting this into the repo. I need to spend a bit more time looking more closely and will share a few suggestions. I'm thinking we may want to find some ways to balance maintenance w/ likelihood of re-usability with some of the functions. If there's clear value in wrapping a function because it simplifies some complex functionality (particularly if it's expected to be repeated) then it makes sense. In general, we've sometimes gone the other way in this repo and ended up creating many functions that make it hard to discover functionality or ensure it is tested & working for all cases. So I'll take some time to make sure we can avoid that here. |
The notebook looks very nice, great job! Just one thing, could you truncate the text output, it is a bit too long. |
Thank you @anargyri for the comment! I have made the following changes in my latest commit:
|
amazing work Nile, I'll pass the tests manually since right now our test machines are down due to covid situation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lots of great content here! I added some suggestions to simplify and generalize where possible. the less code we put in the less we have to test =)
# Save to class | ||
self.recommendations = results | ||
|
||
def __organize_results_as_tabular(self, df_clean): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it may be possible to speed this up using existing functionality
https://github.com/microsoft/recommenders/blob/71a38d422c00329f8f8226ea24ad6260b1b7b4e9/reco_utils/common/python_utils.py#L69
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for pointing this function out! I tested if it was faster to (1) keep the code as is, (2) use python_utils.get_top_k_score_items()
in this method, or (3) reduce iterating through k and not use python_utils.get_top_k_score_items()
.
k = 200
Method | Time to execute __organize_results_as_tabular() |
---|---|
(1) Original | 0.266 seconds |
(2) Use python_utils.get_top_k_score_items() |
0.353 seconds |
(3) Reduce iterating in original | 0.228 seconds |
This is just a rough comparison, but I did a few runs and took the mean (for when k=200) and saw that approach (3) was the fastest. My latest commit uses this approach.
Also, I noticed when playing around with this that python_utils.get_top_k_score_items()
breaks when you set sort_top_k=True
. If I were to try to use python_utils.get_top_k_score_items()
, I would need to go through some extra steps to make sure the output is properly sorted.
tests/unit/test_covid_utils.py
Outdated
output = clean_dataframe(df) | ||
assert len(df) > len(output) | ||
|
||
def test_extract_text_from_file(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you plan on adding tests for these functions? I would rather force them to fail then add ones that pass, that way they will get attention
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extract_text_from_file()
is now joined with retrieve_text()
.
I was not planning on writing out tests for retrieve_text()
and get_public_domain_text()
because both of these functions require accessing Azure blob storage.
Co-Authored-By: Scott Graham <5720537+gramhagen@users.noreply.github.com>
Making retrieving text more efficient Co-Authored-By: Scott Graham <5720537+gramhagen@users.noreply.github.com>
… the COVID-19 dataset
…e column and instead only using id column
Thank you @gramhagen for the thorough review! With the changes, the code should be more efficient, considering we are no longer looping through pandas dataframes (thank you for pointing it out and providing the resource). In addition, the TfidfRecommender class is now more generalized and is not limited to the COVID-19 dataset. There are still a few open comments where I would like to get other's feedback as well before making changes. Regardless of if we make changes from the latest commit or not, I believe this PR should be ready to go. |
thanks for making these changes. one last thing, we try to keep everything in the same format for readability by using black. Can you run that on the files for this pr? Some details here if you need it: https://github.com/microsoft/recommenders/wiki/Coding-Guidelines#python-and-docstrings-style |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
amazing work Nile!
@@ -0,0 +1,83 @@ | |||
# Copyright (c) Microsoft Corporation. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, it would be great to add an integration test for the notebook https://github.com/microsoft/recommenders/tree/master/tests#how-to-create-tests-on-notebooks-with-papermill
If you don't have bandwidth, at least can you please mark it as todo with @pytest.mark.skip(reason="TODO: Implement this")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have created a new integration test file ./tests/integration/test_covid.py
. It does not currently have any tests to run, but I hope to write them out today. Will update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately did not have time to write out the tests today, but a placeholder file has been created ./tests/integration/test_covid.py
@niwilso just trying to make your life easier. If you are using VSCode, with the current repo, you would just need to add these lines to your settings:
and when you save, magically, everything will be automatically formatted. If you are interested in the full settings I use, you can find them here: https://github.com/miguelgfierro/codebase/blob/master/minsc/vscode_settings.json |
Summary of today's main edits:
I believe all that remains now is writing the integration tests for the quickstart notebook. I don't think I'll be able to get to it tomorrow but hopefully I can on Friday. If the notebook integration test isn't required for this PR, I would be happy to open a separate PR after this one has been completed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
excellent work!
Description
Adding the quick start notebook (
./notebooks/00_quick_start/tfidf_covid.ipynb
), associated utils (./reco_utils/dataset/covid_utils.py
and./reco_utils/recommender/tfidf/tfidf_utils.py
), and unit tests (./tests/unit/test_covid_utils.py
and./tests/unit/test_tfidf_utils.py
) to the staging branch.This directly addresses the open issue linked below, which proposes adding a simple TF-IDF recommender demonstration using a novel dataset.
Related Issues
#1087
Checklist:
staging
and notmaster
.