WSDM 2020 Workshop

https://biendata.com/competition/wsdm2020/

ID: @nlp-rabbit

Prerequirements

Python >= 3.6

Reproduce the result

Clone Code and Install Requirements

git clone https://github.com/supercoderhawk/wsdm-digg-2020
pip3 install -r requirements.txt
python3 -m spacy download en

Setup ElasticSearch

setup elasticsearch service, refer to link
setting value ES_BASE_URL in constants.py with your configured elastic search endpoint.

Prepare Data

unzip file and put all files under data/ folder, rename test.csv to test_release.csv
Download model , unzip it and put files into data folder
execute bash scripts/prepare_data.sh in project root folder to build the data for next step

Execute the retrieval process end2end

execute bash scripts/run_end2end.sh in project root folder

Details

the above script includes three main parts

execute elasticsearch to retrieval candidate papers

core logic in search\search.py which is called by benchmark\benchmark.py
execute the rerank by BERT

core logic in reranking\predict.py, model code in reranking\plm_rerank.py

Basic Algorithm Architecture

recall phase
1. keywords and keyphrase extraction
  1. noun chunk extraction
  2. textrank keyword extraction
  3. candidate keywords filtering, including noun, proper noun and adjective
2. BM25 based search (elasticsearch)
rerank phase

Bert based rerank (SciBert from AllenAI), single model, not have any ensemble methods

training data built by first stage (BM25) search result

loss is marginal loss (hinge loss) which is widely used in ranking scenario

Train the Model

The model required to be trained just the Bert based reranking model

# prepare training data for reranking
bash scripts/prepare_rerank.sh

# training the rerank model
bash scripts/train_rerank.sh

# predict the result
bash scripts/predict_rerank.sh

Others

In this project, abbreviation plm means Pretrained Language Model.
methods tried but not effective:
1. Bert-Knrm, Bert-ConvKnrm paper: CEDR: Contextualized Embeddings for Document Ranking, code in reranking\plm_knrm.py and reranking\plm_conv_knrm.py
2. Bert based sentence vectorization method, paper Universal Sentence Encoder (Use BERT CLS output replaced vanilla transformer trained from scratch) code in vectorization\plm_vectorization.py and vectorization\predict.py

related papaer

[1] Understanding the Behaviors of BERT in Ranking

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
data		data
notebooks		notebooks
scripts		scripts
tests		tests
wsdm_digg		wsdm_digg
.gitignore		.gitignore
Pipfile		Pipfile
ReadMe.md		ReadMe.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WSDM 2020 Workshop

Prerequirements

Reproduce the result

Clone Code and Install Requirements

Setup ElasticSearch

Prepare Data

Execute the retrieval process end2end

Details

Basic Algorithm Architecture

Train the Model

Others

related papaer

About

Releases

Packages

Languages

supercoderhawk/wsdm-digg-2020

Folders and files

Latest commit

History

Repository files navigation

WSDM 2020 Workshop

Prerequirements

Reproduce the result

Clone Code and Install Requirements

Setup ElasticSearch

Prepare Data

Execute the retrieval process end2end

Details

Basic Algorithm Architecture

Train the Model

Others

related papaer

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages