-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add logic and example of item feature vector based item-item similari… #1505
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
few small comments, this looks like a nice addition.
p = 2 | ||
return float(v1.dot(v2))/float(v1.norm(p)*v2.norm(p)) | ||
except: | ||
return 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to indicate that two vectors are identical if there's a failure to compute the distance? can we return a NaN or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we vectors are identical, their cosine similarity will be 1. On which situation do you think the calculation will fail? @gramhagen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry you're right, we're not computing a distance here. i think in this case returning a 0 is okay in the exception. my concern is just about silent failures, if for some reason there is an exception, most likely because the vectors are not the correct size or have invalid values, then we don't ever surface the fact that one of the item vectors is incorrectly defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added logic to check the size of vectors. If the sizes of two input vector are different, raise an exception
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm a still a bit confused on this part, instead of both catching a particular exception and raising an exception and providing a generic except that silently swallows the exception and returns a 0. I think we just do one or the other. I'm fine with using a generic exception and returning a NaN or not even using a try-except and throwing the exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably better practice to pass the exception through instead of returning 0, because people interpret 0 as the two vectors being orthogonal. Also passing NaN in the case of an exception is not great because people would expect NaN to be the output when one of the vectors is zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed try/catch block
Co-authored-by: Scott Graham <5720537+gramhagen@users.noreply.github.com>
… just like user, item, relevance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is really nice Yan, it looks there are a couple of tests breaking, @laserprec might be able to help
…nders into zhangya/itemsimilarity
@@ -588,6 +608,15 @@ def _get_pairwise_items(self, df): | |||
) | |||
|
|||
def _get_cosine_similarity(self, n_partitions=200): | |||
if self.item_feature_df is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to select the method for calculating similarity using an argument instead of whether item_feature_df
is present or not.
Because if the item features are available in the dataframe, one would need to create a new SparkDiversityEvaluation object if one wishes to compute the co-occurrence based similarities.
There would need to be an additional member dataframe, in order to keep track whether it has been computed already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean we want to only create a single SparkDiversityEvaluation object, and be able to get diversity value from two different item similarity approaches if item features are available?
e.g.
eval = SparkDiversityEvaluation(args, ...)
diversity_item_coocurrence = eval.diversity(method="item_coocurrence")
diversity_item_similarity=eval.diversity(method="item_similarity")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If using this approach, it doubles the code in multiple functions. It looks very confusing. Therefore I kept the previous method unchanged, i.e. only one "item_sim_measure" can be chosen for one object
eval = SparkDiversityEvaluation(args, ...).
"select the method for calculating similarity using an argument instead of whether item_feature_df is present or not." -- this is implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would it increase the code significantly? I think you would just need to add an extra argument in some functions and just pass this argument from calling to called function. So that you could write
als_diversity=als_eval.diversity(method="cooccurrence")
or something similar.
And you would remove the argument from the constructor that you added in the most recent commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main thing I would like to avoid is the user creating two SparkDiversityEvaluation objects when they want to compute both types of diversity or serendipity metrics. This is more user-unfriendly than the alternative. And imagine what would happen if we added more item similarity methods: they would need to create as many objects as the methods available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anargyri @gramhagen what do you think? Should we make change to use the above method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, you don't need to duplicate the code in this way. What I meant was to change some of the member DFs into dictionaries indexed by the similarity method. I.e. do something like this instead:
def user_diversity(self, item_sim_measure="item_cooccurrence_count"):
if item_sim_measure not in self.df_user_diversity:
self.df_intralist_similarity[item_sim_measure] = self._get_intralist_similarity(self.reco_df, item_sim_measure=item_sim_measure)
self.df_user_diversity[item_sim_measure] = (
self.df_intralist_similarity[item_sim_measure].withColumn(
"user_diversity", 1 - F.col("avg_il_sim")
)
.select(self.col_user, "user_diversity")
.orderBy(self.col_user)
)
return self.df_user_diversity[item_sim_measure]
So the code doesn't really increase but it becomes less readable. I am fine both ways but it's a trade-off, i.e. shifting convenience from the end-user to the developer or vice versa.
@yueguoguo @miguelgfierro any opinions on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Codecov Report
@@ Coverage Diff @@
## staging #1505 +/- ##
============================================
- Coverage 74.23% 62.03% -12.21%
============================================
Files 84 84
Lines 8369 8397 +28
============================================
- Hits 6213 5209 -1004
- Misses 2156 3188 +1032
Continue to review full report at Codecov.
|
…ty calculation
Description
In the scenarios when item features are available, we may want to calculate item-item similarity based on item feature vectors. In this PR, we add logic and example on how to calculate diversity metrics using item feature vector based item-item similarity.
Related Issues
Checklist:
staging branch
and not tomain branch
.