Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recommending documents to users based on keywords in the absence of interactions #716

Open
dlindelof opened this issue Aug 4, 2024 · 0 comments

Comments

@dlindelof
Copy link

I have a basic question about the use of lightFM, apologies if this isn't the right forum.

I'm building a recommender system that will recommend documents to users. There are no interactions yet and all we know about the users are the set of keywords they're interested in.

I've built a prototype where I transform each document using TF-IDF. I then transform the user's keywords with the same transformer and use cosine similarity to find the most relevant documents. It works reasonably well.

I'm now porting this to lightFM so that we can include interactions, but first I need the system to perform equally well as the TF-IDF solution, but I struggle to make it work. Here's the current approach:

  1. build Dataset object on all items in the corpus, using TF-IDF to build item features

When request for recommendations for a new user comes in:

  1. get that user’s keywords. Form a pseudo-document containing just a string with all the keywords.
  2. get the TF-IDF features on that pseudo document, using the same vectorizer used to build the corpus features
  3. retrain the LightFM model, with a single interaction between the user and the pseudo document and item_features formed by concatenating the corpus's item features and the pseudo document's features
  4. call the predict function to get the recommendations

In my unit tests I have 52 documents, which get transformed to a TF-IDF vector of about 3300 columns. The user's pseudo document is transformed to a vector with a single 1.0 entry corresponding to that keyword.

So I would expect the prediction to score high those documents for which the TF-IDF entry corresponding to the keyword are also high. But instead, the scores are more or less the same, about -0.5.

Am I doing something wrong here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant