Skip to content
/ JESC Public

A large parallel corpus of English and Japanese

Notifications You must be signed in to change notification settings

rpryzant/JESC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

JESC Code Release

Welcome to the JESC code release! This repo contains the crawlers, parsers, aligners, and various tools used to create the Japanese-English Subtitle Corpus (JESC).

Requirements

Use pip: pip install -r requirements.txt

Additionally, some of the corpus_processing scripts make use of google/sentencepiece, which has installation instructions on its github page.

Instructions

Each file is a standalone tool with usage instructions given in the comment header. These files are organized into the following categories (subdirectories):

  • corpus_generation: Scripts for downloading, parsing, and aligning subtitles from the internet.

  • corpus_cleaning: Scripts for converting file formats, thresholding on length ratios, and spellchecking.

  • corpus_processing: Scripts for manipulating completed datasets, including tokenization and train/test/dev splitting.

Citation

Please give the proper citation or credit if you use these data:

@ARTICLE{pryzant_jesc_2017,
   author = {{Pryzant}, R. and {Chung}, Y. and {Jurafsky}, D. and {Britz}, D.},
    title = "{JESC: Japanese-English Subtitle Corpus}",
  journal = {ArXiv e-prints},
archivePrefix = "arXiv",
   eprint = {1710.10639},
 keywords = {Computer Science - Computation and Language},
     year = 2017,
    month = oct,
}             ```

About

A large parallel corpus of English and Japanese

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published