Skip to content

Short project building a content-based filtering recommender system from the SciRate user's history

Notifications You must be signed in to change notification settings

carlosparaciari/scirate_recommender_system

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Recommendation system for scientific papers

Goal

In the following study, we use (my) user history on the SciRate website to create a recommendation system using content-based filtering. The idea is to use paper metadata such as the title, abstract, author list, and year of publication to create a similarity matrix between the papers the user scited on the website, and the paper uploaded in the relevant category of the arXiv in the last few years. In this way, the model should provide recommendations on new papers to read that we have not yet scited (and therefore, we probably have not yet read).

Method

Since we only have access to the history of a single user (we cannot obtain the history of other people using the website), our model will only provide recommendations for this user and needs to be retrained to provide recommendations for other users.

We create a similarity matrix between the papers the user scited and the papers uploaded on the arXiv since 2016 in the relevant category (quant-ph, that is, quantum physics). The similarity score between a pair of papers is assigned by considering:

  • the text (title and abstract) of the papers. We use standard NLP techniques to clean these fields, and we map the text into a vector space using the TF-IDF vectorization. The similarity is then computed using the cosine distance (inner product of normalized vectors).

  • the author's list of the papers. The author list is embedded into a vector space using count vectorization, which essentially produces a vector of 0's and 1's, where 1 indicated that the authors wrote the paper, 0 that they did not. The exponential of cosine distance is used to gauge the similarity of papers using this field.

  • the year of publication of the papers. We simply use a radial basis kernel to compute the similarity between the papers based on the year.

Once the similarity matrix is done, our content-based filtering recommender system used one of two methods:

  • We either consider the average similarity of unscited papers, computed by first searching for the K most similar scited paper, and then averaging these similarities. We then select N unscited papers with the higher average similarity. We call this method user-centered in the following, to distinguish it from the next one.

  • Otherwise, we use the average similarity to weigh the score the scited papers were given. The problem is that the user can only scite/non-scite a paper, and therefore we do not have a user score that we can use. Instead, we can use the number of scites a paper received as a score, although this score does not represent the opinion of the user alone, but rather that of the community of people using SciRate. This method might suggest papers that are less relevant for the user, but more popular among their community.

We test the above models using common measures used in the recommender system, in particular the hit rate and the diversity of the suggested papers. We find that the user-centric model works pretty well in suggesting new papers that are of interest to the user, and achieve a hit rate of ~ 4% while still maintaining a high diversity. The collective method, instead, suggests papers that are less relevant to the user, and are not even very popular on SciRate (but are likely quite similar to hyped papers the user scited).

A baseline model that randomly suggests paper is used and achieves a hit rate of ~ 0.2%, thus hinting that indeed the user-centric model here created might do a good job at suggesting papers. We end the study with a look at the top-N list generated by the two models created.

Sections

  • Download SciRate history
  • Understanding user history
  • Downloading unscited papers
  • Data-preprocessing
    • Embedding titles and abstracts
    • Embedding authors
  • Similarity Matrix
  • Content-Based Filtering
    • Computing the hit rate
    • Computing the diversity
    • Producing the top-N list

Make your recommender system

To train your model, you only need two things,

  1. Your SciRate history. Get it by signing in on your SciRate account and accessing your profile. The history is available in JSON format.

  2. A list of papers you have not scited. We use the papers published on the arXiv since 2016 on the quant-ph category. You can get the full metadata corpus of the arXiv at this link. We add a small script to the repository for you to extract from this corpus the papers relevant to you. Please open it, understand it, and change the year and category as it best fits you.

About

Short project building a content-based filtering recommender system from the SciRate user's history

Topics

Resources

Stars

Watchers

Forks