Article Linking

This repository contains a description and supporting code for CSET's current method of cross-dataset article linking. Note that we use "article" very loosely, although in a way that to our knowledge is fairly consistent across corpora. Books, for example, are included.

For each article in arXiv, WOS, Papers With Code, Semantic Scholar, The Lens, and OpenAlex we normalized titles, abstracts, and author last names. For the purpose of matching, we filtered out titles, abstracts, and DOIs that occurred more than 10 times in the corpus. We then considered each group of articles within or across datasets that shared at least one of the following (non-null) metadata fields:

Normalized title
Normalized abstract
Citations
DOI

as well as a match on one additional field above, or on

Publication year
Normalized author last names

to correspond to one article in the merged dataset. We add to this set "near matches" of the concatenation of the normalized title and abstract within a publication year, which we identify using simhash.

To do this, we run the linkage_dag.py on airflow. The article linkage runs weekly, triggered by the scholarly_lit_trigger dag.

For an English description of what the dag does, see the documentation.

How to use the linkage tables (CSET only)

We have three tables that are most likely to help you use article linkage.

gcp_cset_links_v2.article_links - For each original ID (e.g., from WoS), gives the corresponding CSET ID. This is a many-to-one mapping. Please update your scripts to use gcp_cset_links_v2.article_links_with_dataset, which has an additional column that contains the dataset of the orig_id.
gcp_cset_links_v2.all_metadata_with_cld2_lid - provides CLD2 LID for the titles and abstracts of each current version of each article's metadata. You can also use this table to get the metadata used in the match for each version of the raw articles. Note that the id column is not unique as some corpora like WOS have multiple versions of the metadata for different languages.
gcp_cset_links_v2.article_merged_metadata - This maps the CSET merged_id to a set of merged metadata. The merging method takes the maximum value of each metadata field across each matched article, which may not be suitable for your purposes.

Name		Name	Last commit message	Last commit date
Latest commit History 332 Commits
.github/workflows		.github/workflows
evaluation		evaluation
methods_documentation		methods_documentation
schemas		schemas
sequences		sequences
sql		sql
tests		tests
utils		utils
.flake8		.flake8
.pre-commit-config.yaml		.pre-commit-config.yaml
.sqlfluff		.sqlfluff
.sqlfluffignore		.sqlfluffignore
README.md		README.md
__init__.py		__init__.py
linkage_dag.py		linkage_dag.py
push_to_airflow.sh		push_to_airflow.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Article Linking

How to use the linkage tables (CSET only)

About

Releases 2

Packages

Contributors 6

Languages

georgetown-cset/article-linking

Folders and files

Latest commit

History

Repository files navigation

Article Linking

How to use the linkage tables (CSET only)

About

Resources

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 6

Languages

Packages