Niwilso/adding tfidf #1088

niwilso · 2020-04-18T03:09:16Z

Description

Adding the quick start notebook (./notebooks/00_quick_start/tfidf_covid.ipynb), associated utils (./reco_utils/dataset/covid_utils.py and ./reco_utils/recommender/tfidf/tfidf_utils.py), and unit tests (./tests/unit/test_covid_utils.py and ./tests/unit/test_tfidf_utils.py) to the staging branch.

This directly addresses the open issue linked below, which proposes adding a simple TF-IDF recommender demonstration using a novel dataset.

Related Issues

#1087

Checklist:

[ x ] I have followed the contribution guidelines and code style for this project.
[ x ] I have added tests covering my contributions.
[ x ] I have updated the documentation accordingly.
[ x ] This PR is being made to staging and not master.

review-notebook-app · 2020-04-18T03:09:22Z

Check out this pull request on

You'll be able to see Jupyter notebook diff and discuss changes. Powered by ReviewNB.

gramhagen · 2020-04-20T01:10:40Z

This is really great, thanks for the addition! I looked through briefly and looks like you did a very thorough job fitting this into the repo. I need to spend a bit more time looking more closely and will share a few suggestions.

I'm thinking we may want to find some ways to balance maintenance w/ likelihood of re-usability with some of the functions. If there's clear value in wrapping a function because it simplifies some complex functionality (particularly if it's expected to be repeated) then it makes sense. In general, we've sometimes gone the other way in this repo and ended up creating many functions that make it hard to discover functionality or ensure it is tested & working for all cases. So I'll take some time to make sure we can avoid that here.

notebooks/00_quick_start/tfidf_covid.ipynb

reco_utils/dataset/covid_utils.py

reco_utils/recommender/tfidf/tfidf_utils.py

…fidfRecommender class

anargyri · 2020-04-21T10:11:11Z

The notebook looks very nice, great job! Just one thing, could you truncate the text output, it is a bit too long.
I agree with the comments above about refactoring, the methods that are not specific to the Covid data set can be moved to another place like download_utils.py and some of them (like duplicates or NaN) can be done easily with Pandas.
Another suggestion I have is to follow the convention we use for the other data sets (Movielens, Criteo) with the load_pandas_df() functions i.e. you could have such a function for Covid data, that the user can call, as in https://github.com/microsoft/recommenders/blob/master/notebooks/02_model/surprise_svd_deep_dive.ipynb

niwilso · 2020-04-21T12:34:27Z

The notebook looks very nice, great job! Just one thing, could you truncate the text output, it is a bit too long.
I agree with the comments above about refactoring, the methods that are not specific to the Covid data set can be moved to another place like download_utils.py and some of them (like duplicates or NaN) can be done easily with Pandas.
Another suggestion I have is to follow the convention we use for the other data sets (Movielens, Criteo) with the load_pandas_df() functions i.e. you could have such a function for Covid data, that the user can call, as in https://github.com/microsoft/recommenders/blob/master/notebooks/02_model/surprise_svd_deep_dive.ipynb

Thank you @anargyri for the comment! I have made the following changes in my latest commit:

Text output is now truncated.
get_blob_service() and load_csv_from_blob have been moved from covid_utils.py to download_utils.py.
covid_utils.py now has a load_pandas_df() function like in the MovieLens example.
The quickstart notebook has been updated to accommodate the above changes.
I have kept remove_duplicates() and remove_nan() due to the reasons explained in my responses to Miguel. These wrapper functions are more thorough in handling duplicates and nan in the case of this specific dataset.

reco_utils/recommender/tfidf/tfidf_utils.py

miguelgfierro · 2020-04-21T14:32:40Z

amazing work Nile, I'll pass the tests manually since right now our test machines are down due to covid situation.

…ommenders into niwilso/adding_tfidf

gramhagen

lots of great content here! I added some suggestions to simplify and generalize where possible. the less code we put in the less we have to test =)

README.md

reco_utils/dataset/download_utils.py

reco_utils/dataset/covid_utils.py

reco_utils/recommender/tfidf/tfidf_utils.py

gramhagen · 2020-04-21T15:10:48Z

reco_utils/recommender/tfidf/tfidf_utils.py

+        # Save to class
+        self.recommendations = results
+
+    def __organize_results_as_tabular(self, df_clean):


it may be possible to speed this up using existing functionality
https://github.com/microsoft/recommenders/blob/71a38d422c00329f8f8226ea24ad6260b1b7b4e9/reco_utils/common/python_utils.py#L69

Thank you for pointing this function out! I tested if it was faster to (1) keep the code as is, (2) use python_utils.get_top_k_score_items() in this method, or (3) reduce iterating through k and not use python_utils.get_top_k_score_items().

k = 200

Method Time to execute __organize_results_as_tabular()

(1) Original 0.266 seconds

(2) Use python_utils.get_top_k_score_items() 0.353 seconds

(3) Reduce iterating in original 0.228 seconds

This is just a rough comparison, but I did a few runs and took the mean (for when k=200) and saw that approach (3) was the fastest. My latest commit uses this approach.

Also, I noticed when playing around with this that python_utils.get_top_k_score_items() breaks when you set sort_top_k=True. If I were to try to use python_utils.get_top_k_score_items(), I would need to go through some extra steps to make sure the output is properly sorted.

reco_utils/recommender/tfidf/tfidf_utils.py

gramhagen · 2020-04-21T15:14:45Z

tests/unit/test_covid_utils.py

+    output = clean_dataframe(df)
+    assert len(df) > len(output)
+
+def test_extract_text_from_file():


do you plan on adding tests for these functions? I would rather force them to fail then add ones that pass, that way they will get attention

extract_text_from_file() is now joined with retrieve_text().

I was not planning on writing out tests for retrieve_text() and get_public_domain_text() because both of these functions require accessing Azure blob storage.

reco_utils/recommender/tfidf/tfidf_utils.py

Co-Authored-By: Scott Graham <5720537+gramhagen@users.noreply.github.com>

Making retrieving text more efficient Co-Authored-By: Scott Graham <5720537+gramhagen@users.noreply.github.com>

… the COVID-19 dataset

…e column and instead only using id column

niwilso · 2020-04-22T02:15:42Z

Thank you @gramhagen for the thorough review!

With the changes, the code should be more efficient, considering we are no longer looping through pandas dataframes (thank you for pointing it out and providing the resource).

In addition, the TfidfRecommender class is now more generalized and is not limited to the COVID-19 dataset.

There are still a few open comments where I would like to get other's feedback as well before making changes.

Regardless of if we make changes from the latest commit or not, I believe this PR should be ready to go.

reco_utils/dataset/covid_utils.py

gramhagen · 2020-04-22T13:08:09Z

thanks for making these changes. one last thing, we try to keep everything in the same format for readability by using black. Can you run that on the files for this pr? Some details here if you need it: https://github.com/microsoft/recommenders/wiki/Coding-Guidelines#python-and-docstrings-style

miguelgfierro

amazing work Nile!

reco_utils/dataset/covid_utils.py

tests/unit/test_tfidf_utils.py

tests/unit/test_covid_utils.py

miguelgfierro · 2020-04-22T15:09:26Z

tests/unit/test_tfidf_utils.py

@@ -0,0 +1,83 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.


also, it would be great to add an integration test for the notebook https://github.com/microsoft/recommenders/tree/master/tests#how-to-create-tests-on-notebooks-with-papermill

If you don't have bandwidth, at least can you please mark it as todo with @pytest.mark.skip(reason="TODO: Implement this")

I have created a new integration test file ./tests/integration/test_covid.py. It does not currently have any tests to run, but I hope to write them out today. Will update.

Unfortunately did not have time to write out the tests today, but a placeholder file has been created ./tests/integration/test_covid.py

miguelgfierro · 2020-04-22T15:19:47Z

thanks for making these changes. one last thing, we try to keep everything in the same format for readability by using black. Can you run that on the files for this pr? Some details here if you need it: https://github.com/microsoft/recommenders/wiki/Coding-Guidelines#python-and-docstrings-style

@niwilso just trying to make your life easier. If you are using VSCode, with the current repo, you would just need to add these lines to your settings:

    // Python
    "python.pythonPath": "C:/Anaconda3/envs/py36/python.exe",
    "python.formatting.blackPath": "C:/Anaconda3/envs/py36/Scripts/black.exe",
    "python.formatting.provider": "black",

and when you save, magically, everything will be automatically formatted. If you are interested in the full settings I use, you can find them here: https://github.com/miguelgfierro/codebase/blob/master/minsc/vscode_settings.json

niwilso · 2020-04-23T01:05:47Z

Summary of today's main edits:

load_csv_from_blob() has been moved to a new general file ./reco_utils/dataset/blob_utils.py
Required packages have been added to ./scripts/generate_conda_file.py, matching the versioning in the NLP repo
Created placeholder for integration testing and designated skipping TODO tests
Formatted code using Black

I believe all that remains now is writing the integration tests for the quickstart notebook. I don't think I'll be able to get to it tomorrow but hopefully I can on Friday. If the notebook integration test isn't required for this PR, I would be happy to open a separate PR after this one has been completed.

gramhagen

excellent work!

niwilso added 9 commits April 15, 2020 15:52

WIP loading covid data in notebook

b430b9f

Initial full commit for quickstart

9c08026

Updated to-do list

a926ae1

Updating preprocessing to remove special characters

b8aadc6

Quickstart notebook v1 ready

633c2db

Added unit tests for tfidf

87491be

Fixed bug in clean_dataframe()

703f994

Updated readme with TF-IDF

4322d8e

Removed extraneous period

c544921

niwilso requested review from gramhagen, miguelgfierro and yueguoguo as code owners April 18, 2020 03:09

miguelgfierro reviewed Apr 20, 2020

View reviewed changes

notebooks/00_quick_start/tfidf_covid.ipynb Show resolved Hide resolved

notebooks/00_quick_start/tfidf_covid.ipynb Show resolved Hide resolved

notebooks/00_quick_start/tfidf_covid.ipynb Show resolved Hide resolved

miguelgfierro reviewed Apr 20, 2020

View reviewed changes

miguelgfierro requested review from anargyri and loomlike April 20, 2020 12:30

niwilso added 2 commits April 20, 2020 13:15

Generalizing download_metadata() to now be load_csv_from_blob()

775c25e

Reorganized code such that TF-IDF functions are now called within a T…

bb5389b

…fidfRecommender class

Moved common Azure data storage functions to download_utils.py

cc48f5a

Fixed typo

31d31e2

anargyri approved these changes Apr 21, 2020

View reviewed changes

Merge branch 'staging' into niwilso/adding_tfidf

ab0d64f

miguelgfierro reviewed Apr 21, 2020

View reviewed changes

reco_utils/recommender/tfidf/tfidf_utils.py Outdated Show resolved Hide resolved

reco_utils/recommender/tfidf/tfidf_utils.py Outdated Show resolved Hide resolved

niwilso added 2 commits April 21, 2020 07:58

Changed .fit_tfidf() to .fit() and moved k definition location

182836c

Merge branch 'niwilso/adding_tfidf' of https://github.com/niwilso/rec…

8b2e875

…ommenders into niwilso/adding_tfidf

gramhagen requested changes Apr 21, 2020

View reviewed changes

niwilso and others added 9 commits April 21, 2020 10:26

Update README.md

1f46356

Co-Authored-By: Scott Graham <5720537+gramhagen@users.noreply.github.com>

Update reco_utils/dataset/covid_utils.py

e12e31a

Making retrieving text more efficient Co-Authored-By: Scott Graham <5720537+gramhagen@users.noreply.github.com>

Consolidated retrieve_text() and extract_text_from_file()

3fcd477

Forcing users to specify model parameters

18b3546

Setting tokenization_method default to scibert

d601545

Reduced dataframe looping in tfidf_utils.py by using .map() and .apply()

101532a

Replaced pandas looping with apply function in covid_utils.py

5b6a86b

Generalized get_top_k_recommendations such that it is not specific to…

7581a58

… the COVID-19 dataset

Generalized TfidfRecommender further by removing need to specify titl…

d5e0110

…e column and instead only using id column

Minor comment change

42cfa5b

gramhagen reviewed Apr 22, 2020

View reviewed changes

reco_utils/dataset/covid_utils.py Outdated Show resolved Hide resolved

miguelgfierro approved these changes Apr 22, 2020

View reviewed changes

niwilso added 8 commits April 22, 2020 11:22

Moving load_csv_from_blob into new utils file

92a1c61

Removed oneline extract_public_domain() function

86ac968

Simplified retrieve_text() method

6f5205b

Skipping TODO tests

0eb75ef

Adding requirements

e5c2768

Sped up __organize_results_as_tabular()

138471d

Added raise error when k is too large

64fe868

Formatted with Black

a3ccad7

gramhagen approved these changes Apr 23, 2020

View reviewed changes

Merge branch 'staging' into niwilso/adding_tfidf

b81de2d

miguelgfierro merged commit e646c4e into recommenders-team:staging Apr 23, 2020

niwilso mentioned this pull request Apr 23, 2020

[FEATURE] Adding TF-IDF recommender using COVID-19 dataset #1087

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Niwilso/adding tfidf #1088

Niwilso/adding tfidf #1088

niwilso commented Apr 18, 2020

review-notebook-app bot commented Apr 18, 2020

gramhagen commented Apr 20, 2020

anargyri commented Apr 21, 2020

niwilso commented Apr 21, 2020

miguelgfierro commented Apr 21, 2020

gramhagen left a comment

gramhagen Apr 21, 2020

niwilso Apr 23, 2020

gramhagen Apr 21, 2020

niwilso Apr 22, 2020

niwilso commented Apr 22, 2020

gramhagen commented Apr 22, 2020

miguelgfierro left a comment

miguelgfierro Apr 22, 2020

niwilso Apr 22, 2020

niwilso Apr 23, 2020

miguelgfierro commented Apr 22, 2020

niwilso commented Apr 23, 2020 •

edited

Loading

gramhagen left a comment

Method	Time to execute `__organize_results_as_tabular()`
(1) Original	0.266 seconds
(2) Use `python_utils.get_top_k_score_items()`	0.353 seconds
(3) Reduce iterating in original	0.228 seconds

		@@ -0,0 +1,83 @@
		# Copyright (c) Microsoft Corporation. All rights reserved.

Niwilso/adding tfidf #1088

Niwilso/adding tfidf #1088

Conversation

niwilso commented Apr 18, 2020

Description

Related Issues

Checklist:

review-notebook-app bot commented Apr 18, 2020

gramhagen commented Apr 20, 2020

anargyri commented Apr 21, 2020

niwilso commented Apr 21, 2020

miguelgfierro commented Apr 21, 2020

gramhagen left a comment

Choose a reason for hiding this comment

gramhagen Apr 21, 2020

Choose a reason for hiding this comment

niwilso Apr 23, 2020

Choose a reason for hiding this comment

gramhagen Apr 21, 2020

Choose a reason for hiding this comment

niwilso Apr 22, 2020

Choose a reason for hiding this comment

niwilso commented Apr 22, 2020

gramhagen commented Apr 22, 2020

miguelgfierro left a comment

Choose a reason for hiding this comment

miguelgfierro Apr 22, 2020

Choose a reason for hiding this comment

niwilso Apr 22, 2020

Choose a reason for hiding this comment

niwilso Apr 23, 2020

Choose a reason for hiding this comment

miguelgfierro commented Apr 22, 2020

niwilso commented Apr 23, 2020 • edited Loading

gramhagen left a comment

Choose a reason for hiding this comment

niwilso commented Apr 23, 2020 •

edited

Loading