The RUB Corpus and Code

Russian-Language Corpus & Lexicon-Based Sentiment Analysis

RUB Corpus and Code are two downloadable, open-source collections.

The RUB Corpus is a collection of Russian-language official government speeches, interviews, and press releases made by top policymakers in Russia, Ukraine, and Belarus from 2006 to 2016.

The first image at the top of this README is a screenshot of the corpus collection for Belarus.

The Code represents the programs used to compile the RUB Corpus and to conduct a lexicon-based sentiment analysis upon the RUB Corpus.

The second image above is a screenshot of a lemmatisation function, which is part of the sentiment analysis Code.

The sentiment analysis was conducted using a modified version of the lexicon created by Loukachevitch and Levchik (2016).

Usage

To use any of the RUB Code scripts, please use the following citation:

Braga, P. (2020). RUB Corpus and Code. Project repository. Available at: https://github.com/pjbraga/rub_corpus_and_code.

The RUB Corpus Files

The RUB Corpus and Code can be used in several ways.

The corpus is a collection of 71,515 Russian-language texts published between 01 January 2006 to 31 December 2016.

The Corpus is divided by country (russia_all_texts.tsv.zip, ukraine_all_texts.tsv.zip, belarus_all_texts.tsv.zip) in plain text .tsv (tab-separated values) files.

The Corpus files can be utilised by any language that is suitable for text-processing and data-handling of .tsv files (such as python or Perl).

The sources for the RUB Corpus texts are online, official government archives. Therefore, the data is already within the public domain and can be used however the user sees fit.

The Code (Sentiment Analysis) Files

The Code is a series of programs used to: (1) assemble the RUB Corpus and; (2) conduct a Russian-language, lexicon-based, sentiment analysis upon the RUB Corpus.

All the Code scripts are writtin in python.

Thus, to use the Code, it is necessary to have some basic knowledge of python and certain site-packages (such as pandas, the natural language toolkit, beautiful-soup, etc...) associted with it.

In addition, considering this code deals with Russian-language texts, it helps to have some knowledge of Russian as well.

The various .py (python) scripts contain multiple notes and comments, which are intended to make the code easier to understand.

All the scripts uploaded here begin with a standard file header, and then a comment block with the following four headings:

Description (a one-to-two sentence explanation of the script)
Requirements (necessary prerequisites to run the script)
Summary of Code (synopsis of the layout and function of the script)
Notes (any peculiarities or potential issues with the code)

Reporting Issues

For any issues with the RUB Corpus and Code repositories, please use GitHub.

Contact

For general questions about this project or any ideas for academic collaboration, contact Peter Braga at: pjbraga.rubcc@gmail.com.

References

Loukachevitch, N. and Levchik, A. (2016). Creating a General Russian Sentiment Lexicon. In Proceedings of Language Resources and Evaluation Conference LREC-2016. Available at: http://www.lrec-conf.org/proceedings/lrec2016/pdf/285_Paper.pdf.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
_code		_code
_corpus		_corpus
_data		_data
_includes		_includes
_layouts		_layouts
_posts		_posts
_sass		_sass
_site		_site
assets/css		assets/css
images		images
.DS_Store		.DS_Store
404.html		404.html
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
README.md		README.md
_config.yml		_config.yml
code.markdown		code.markdown
corpus.markdown		corpus.markdown
index.markdown		index.markdown
news.markdown		news.markdown
visuals.markdown		visuals.markdown

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The RUB Corpus and Code

Russian-Language Corpus & Lexicon-Based Sentiment Analysis

Usage

The RUB Corpus Files

The Code (Sentiment Analysis) Files

Reporting Issues

Contact

References

About

Releases

Packages

Languages

pjbraga/rub_corpus_and_code

Folders and files

Latest commit

History

Repository files navigation

The RUB Corpus and Code

Russian-Language Corpus & Lexicon-Based Sentiment Analysis

Usage

The RUB Corpus Files

The Code (Sentiment Analysis) Files

Reporting Issues

Contact

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages