Skip to content

Latest commit

 

History

History
937 lines (774 loc) · 96.4 KB

README.en.md

File metadata and controls

937 lines (774 loc) · 96.4 KB

awesome-japanese-nlp-resources

Awesome License: CC0-1.0 CC0

A curated list of resources dedicated to Python libraries, llms, dictionaries, and corpora of NLP for Japanese

English | 日本語 (Japanese) | 繁體中文 (Chinese) | 简体中文 (Chinese)

The latest additions 🎉

Improve slow page loading issues

Removed the statistics table from README.md. Please refer to README.full.md for the previous pages.

Hugging Face 🤗

Dictionary and IME

  • azookey-desktop - Japanese Input Method azooKey for Desktop, supporting macOS
  • fcitx5-hazkey - Japanese input method for fcitx5, powered by azooKey engine

Python

  • Jusho - Easy wrapper for the postal code data of Japan

Updated on Aug 07, 2024

Contents

Python library

Morphology analysis

  • sudachi.rs - SudachiPy 0.6* and above are developed as Sudachi.rs.
  • Janome - Japanese morphological analysis engine written in pure Python
  • mecab-python3 - mecab-python. mecab-python. you can find original version here:http://taku910.github.io/mecab/
  • mecab - This repository is for building Windows 64-bit MeCab binary and improving MeCab Python binding.
  • fugashi - A Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis.
  • nagisa - A Japanese tokenizer based on recurrent neural networks
  • pyknp - A Python Module for JUMAN++/KNP
  • Mykytea-python - Python wrapper for KyTea
  • konoha - Konoha: Simple wrapper of Japanese Tokenizers
  • natto-py - natto-py combines the Python programming language with MeCab, the part-of-speech and morphological analyzer for the Japanese language.
  • rakutenma-python - Rakuten MA (Python version)
  • python-vaporetto - Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto.
  • dango - An easy to use tokenizer for Japanese text, aimed at language learners and non-linguists
  • rhoknp - Yet another Python binding for Juman++/KNP
  • python-vibrato - Viterbi-based accelerated tokenizer (Python wrapper)
  • jagger-python - Python binding for Jagger(C++ implementation of Pattern-based Japanese Morphological Analyzer)

To check the statistics table (GitHub stars/Downloads), please refer to this page.

Parsing

  • ginza - A Japanese NLP Library using spaCy as framework based on Universal Dependencies
  • cabocha - Yet Another Japanese Dependency Structure Analyzer
  • UniDic2UD - Tokenizer POS-tagger Lemmatizer and Dependency-parser for modern and contemporary Japanese
  • camphr - Camphr - NLP libary for creating pipeline components
  • SuPar-UniDic - Tokenizer POS-tagger Lemmatizer and Dependency-parser for modern and contemporary Japanese with BERT models
  • depccg - A* CCG Parser with a Supertag and Dependency Factored Model
  • bertknp - A Japanese dependency parser based on BERT
  • esupar - Tokenizer POS-Tagger and Dependency-parser with BERT/RoBERTa/DeBERTa models for Japanese and other languages
  • yomikata - Heteronym disambiguation library using a fine-tuned BERT model.
  • jdepp-python - Python binding for J.DepP(C++ implementation of Japanese Dependency Parsers)

To check the statistics table (GitHub stars/Downloads), please refer to this page.

Converter

  • pykakasi - Lightweight converter from Japanese Kana-kanji sentences into Kana-Roman.
  • cutlet - Japanese to romaji converter in Python
  • alphabet2kana - Convert English alphabet to Katakana
  • Convert-Numbers-to-Japanese - Converts Arabic numerals, or 'western' style numbers, to a Japanese context.
  • mozcpy - Mozc for Python: Kana-Kanji converter
  • jamorasep - Japanese text parser that separates Hiragana/Katakana strings into morae (syllables).
  • text2phoneme - Script to convert Japanese text into phoneme sequence.
  • jntajis-python - A fast character conversion and transliteration library based on the scheme defined for Japan National Tax Agency (国税庁) 's corporate number (法人番号) system.
  • wiredify - Convert japanese kana from ba-bi-bu-be-bo into va-vi-vu-ve-vo
  • mecab-text-cleaner - Simple Python package (CLI/Python API) for getting japanese readings (yomigana) and accents using MeCab.
  • pynormalizenumexp - Python implementation of NormalizeNumexp for extracting and normalizing quantity expressions and time expressions.
  • Jusho - Easy wrapper for the postal code data of Japan

To check the statistics table (GitHub stars/Downloads), please refer to this page.

Preprocessor

  • neologdn - Japanese text normalizer for mecab-neologd
  • jaconv - A Python-based tool for converting Japanese characters between Hiragana, Katakana, Hankaku, and Zenkaku.
  • mojimoji - A quick converter for Japanese half-width and full-width characters.
  • text-cleaning - A powerful text cleaner for Japanese web texts
  • HojiChar - A text preprocessing tool that configures and manages multiple preprocessing steps.
  • utsuho - Utsuho is a Python module that facilitates bidirectional conversion between half-width katakana and full-width katakana in Japanese.
  • python-habachen - Yet Another Fast Japanese String Converter

To check the statistics table (GitHub stars/Downloads), please refer to this page.

Sentence spliter

  • Bunkai - Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)
  • japanese-sentence-breaker - Japanese Sentence Breaker
  • sengiri - Yet another sentence-level tokenizer for the Japanese text
  • budoux - Standalone. Small. Language-neutral. BudouX is the successor to Budou, the machine learning powered line break organizer tool.
  • ja_sentence_segmenter - japanese sentence segmentation library for python
  • hasami - A tool to perform sentence segmentation on Japanese text
  • kuzukiri - Japanese Text Segmenter for Python written in Rust
  • ja-senter-benchmark - Comparison of Japanese Sentence Segmentation Tools

To check the statistics table (GitHub stars/Downloads), please refer to this page.

Sentiment analysis

  • oseti - Dictionary based Sentiment Analysis for Japanese
  • negapoji - Japanese document sentiment analysis to determine negative or positive.
  • pymlask - Emotion analyzer for Japanese text
  • asari - Japanese sentiment analyzer implemented in Python.

To check the statistics table (GitHub stars/Downloads), please refer to this page.

Machine translation

  • jparacrawl-finetune - An example usage of JParaCrawl pre-trained Neural Machine Translation (NMT) models.
  • JASS - JASS: Japanese-specific Sequence to Sequence Pre-training for Neural Machine Translation (LREC2020) & Linguistically Driven Multi-Task Pre-Training for Low-Resource Neural Machine Translation (ACM TALLIP)
  • PheMT - A phenomenon-wise evaluation dataset for Japanese-English machine translation robustness. The dataset is based on the MTNT dataset, with additional annotations of four linguistic phenomena; Proper Noun, Abbreviated Noun, Colloquial Expression, and Variant. COLING 2020.
  • VISA - An ambiguous subtitles dataset for visual scene-aware machine translation

To check the statistics table (GitHub stars/Downloads), please refer to this page.

Named entity recognition

  • namaco - Character Based Named Entity Recognition.
  • entitypedia - Entitypedia is an Extended Named Entity Dictionary from Wikipedia.
  • noyaki - Converts character span label information to tokenized text-based label information.
  • bert-japanese-ner-finetuning - This is a sample code for creating and using a model for named entity recognition task through finetuning of the BERT model.
  • joint-information-extraction-hs - Code for inferring the accuracy of named entity and relation extraction from a case report corpus based on detailed annotation criteria.
  • pygeonlp - pygeonlp, A python module for geotagging Japanese texts.
  • bert-ner-japanese - Program for fine-tuning Japanese named entity recognition using BERT

To check the statistics table (GitHub stars/Downloads), please refer to this page.

OCR

  • Manga OCR - About Optical character recognition for Japanese text, with the main focus being Japanese manga
  • mokuro - Read Japanese manga inside browser with selectable text.
  • handwritten-japanese-ocr - Handwritten Japanese OCR demo using touch panel to draw the input text using Intel OpenVINO toolkit
  • OCR_Japanease - Japanese OCR
  • ndlocr_cli - NDLOCR application
  • donut - Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV 2022
  • JMTrans - Manga translator - retrieve Japanese manga from URL to translate manga images.
  • Kindai-OCR - OCR system for recognizing modern Japanese magazines
  • text_recognition - Text recognition module for NDLOCR.
  • Poricom - Optical character recognition in manga images. Manga OCR desktop application

To check the statistics table (GitHub stars/Downloads), please refer to this page.

Tool for pretrained models

  • JGLUE - JGLUE: Japanese General Language Understanding Evaluation
  • ginza-transformers - Use custom tokenizers in spacy-transformers
  • t5_japanese_dialogue_generation - Conversation generation using T5.
  • japanese_text_classification - To investigate various DNN text classifiers including MLP, CNN, RNN, BERT approaches.
  • Japanese-BERT-Sentiment-Analyzer - Deploying sentiment analysis server with FastAPI and BERT
  • jmlm_scoring - Masked Language Model-based Scoring for Japanese and Vietnamese
  • allennlp-shiba-model - AllenNLP integration for Shiba: Japanese CANINE model
  • evaluate_japanese_w2v - script to evaluate pre-trained Japanese word2vec model on Japanese similarity dataset
  • gector-ja - BERT-based GEC tagging for Japanese
  • Japanese-BPEEncoder - 日本語-BPEエンコーダー
  • Japanese-BPEEncoder_V2 - Japanese-BPEEncoder Version 2
  • transformer-copy - Japanese grammar error correction tool
  • japanese-stable-diffusion - Japanese Stable Diffusion is a Japanese specific latent text-to-image diffusion model capable of generating photo-realistic images given any text input.
  • nagisa_bert - A BERT model for Nagisa.
  • prefix-tuning-gpt - Example code for prefix-tuning GPT/GPT-NeoX models and for inference with trained prefixes
  • JGLUE-benchmark - Training and evaluation scripts for JGLUE, a Japanese language understanding benchmark
  • jptranstokenizer - 日本語のトークナイザー(分かち書きツール)をTransformersライブラリ用に作成しました。
  • jp-stable - JP Language Model Evaluation Harness
  • compare-ja-tokenizer - How do different tokenizers perform on downstream tasks in scriptio continua languages?: A case study in Japanese - ACL SRW 2023
  • lm-evaluation-harness-jp-stable - A framework for few-shot evaluation of autoregressive language models.
  • llm-lora-classification - llm-lora-classification
  • jp-stable - JP Language Model Evaluation Harness
  • rinna_gpt-neox_ggml-lora - The repository contains scripts and merge scripts that have been modified to adapt an Alpaca-Lora adapter for LoRA tuning when assuming the use of the "rinna/japanese-gpt-neox..." [gpt-neox] model converted to ggml.
  • japanese-llm-roleplay-benchmark - This repository was created to evaluate the performance of character role-playing in Japanese LLM.
  • japanese-llm-ranking - This repository supports YuzuAI's Rakuda leaderboard of Japanese LLMs, which is a Japanese-focused analogue of LMSYS' Vicuna eval.
  • llm-jp-eval - This tool is designed to automatically evaluate large-scale Japanese language models across multiple datasets.
  • llm-jp-sft - This repository contains the code for supervised fine-tuning of LLM-jp models.
  • llm-jp-tokenizer - This is a repository that summarizes the tokenizer related to LLM being developed at the LLM Study Group (LLM-jp).
  • japanese-lm-fin-harness - 日本語言模型金融评估工具
  • ja-vicuna-qa-benchmark - Japanese Vicuna QA Benchmark
  • swallow-evaluation - Swallow Project Large-Scale Language Model Evaluation Script

To check the statistics table (GitHub stars/Downloads), please refer to this page.

Others

  • namedivider-python - A tool for dividing the Japanese full name into a family name and a given name.
  • asa-python - A curated list of resources dedicated to Python libraries of NLP for Japanese
  • python_asa - Python-based Japanese semantic role labeling system (ASA)
  • toiro - A comparison tool of Japanese tokenizers
  • ja-timex - A rule-based parser for extracting/normalizing time expressions written in natural language.
  • JapaneseTokenizers - A set of metrics for feature selection from text data
  • daaja - This repository has implementations of data augmentation for NLP for Japanese.
  • accel-brain-code - The purpose of this repository is to make prototypes as case study in the context of proof of concept(PoC) and research and development(R&D) that I have written in my website. The main research topics are Auto-Encoders in relation to the representation learning, the statistical machine learning for energy-based models, adversarial generation net…
  • kyoto-reader - A processor for KyotoCorpus, KWDLC, and AnnotatedFKCCorpus
  • nlplot - Visualization Module for Natural Language Processing
  • rake-ja - Rapid Automatic Keyword Extraction algorithm for Japanese
  • jel - Japanese Entity Linker.
  • MedNER-J - Latest version of MedEX/J (Japanese disease name extractor)
  • zunda-python - Zunda: Japanese Enhanced Modality Analyzer client for Python.
  • AIO2_DPR_baseline - https://www.nlp.ecei.tohoku.ac.jp/projects/aio/
  • showcase - A PyTorch implementation of the Japanese Predicate-Argument Structure (PAS) analyser presented in the paper of Matsubayashi & Inui (2018) with some improvements.
  • darts-clone-python - Darts-clone python binding
  • jrte-corpus_example - Example codes for the Japanese Realistic Textual Entailment Corpus.
  • desuwa - Feature annotator to morphemes and phrases based on KNP rule files (pure-Python)
  • HotPepperGourmetDialogue - Restaurant Search System through Dialogue in Japanese.
  • nlp-recipes-ja - Samples codes for natural language processing in Japanese
  • Japanese_nlp_scripts - Small example scripts for working with Japanese texts in Python
  • DNorm-J - Japanese version of DNorm
  • pyknp-eventgraph - EventGraph is a development platform for high-level NLP applications in Japanese.
  • ishi - Ishi: A volition classifier for Japanese
  • python-npylm - Unsupervised morphological analysis using a Bayesian hierarchical language model.
  • python-npycrf - Semi-supervised morphological analysis through integration of conditional probability fields and Bayesian hierarchical language models.
  • unsupervised-pos-tagging - Part-of-speech tagging without a teacher
  • negima - Negima is a Python package to extract phrases in Japanese text by using the part-of-speeches based rules you defined.
  • YouyakuMan - Extractive summarizer using BertSum as summarization model
  • japanese-numbers-python - A parser for Japanese number (Kanji, arabic) in the natural language.
  • kantan - Lookup japanese words by radical patterns
  • make-meidai-dialogue - Obtain a corpus of Japanese dialogue.
  • japanese_summarizer - A summarizer for Japanese articles.
  • chirptext - ChirpText is a collection of text processing tools for Python.
  • yubin - 日本の住所マンガー
  • jawiki-cleaner - Japanese Wikipedia Cleaner
  • japanese2phoneme - A python library to convert Japanese to phoneme.
  • anlp_nlp2021_d3-1 - This repository contains codes related to the experiments in "An Experimental Evaluation of Japanese Tokenizers for Sentiment-Based Text Classification"
  • aozora_classification - About This project aims to classify Japanese sentence to how well similar to some Japanese classical writers, such as Soseki Natsume, Ogai Mori, Ryunosuke Akutagawa and so on.
  • aozora-corpus-generator - Generates plain or tokenized text files from the Aozora Bunko
  • JLM - A fast LSTM Language Model for large vocabulary language like Japanese and Chinese
  • NTM - Testing of Neural Topic Modeling for Japanese articles
  • EN-JP-ML-Lexicon - This is a English-Japanese lexicon for Machine Learning and Deep Learning terminology.
  • text-generation - Easy-to-use scripts to fine-tune GPT-2-JA with your own texts, to generate sentences, and to tweet them automatically.
  • chainer_nic - Neural Image Caption (NIC) on chainer, its pretrained models on English and Japanese image caption datasets.
  • unihan-lm - The official repository for "UnihanLM: Coarse-to-Fine Chinese-Japanese Language Model Pretraining with the Unihan Database", AACL-IJCNLP 2020
  • mbart-finetuning - Code to perform finetuning of the mBART model.
  • xvector_jtubespeech - Model xvector on jtubespeech.
  • TinySegmenterMaker - A tool for creating a custom learning model for TinySegmenter.
  • Grongish - Script for mutual conversion between Japanese and Gurongi language.
  • WordCloud-Japanese - A script that enables morphological analysis-like display of Japanese sentences in WordCloud without using Mecab (a morphological analysis engine).
  • snark - DB access library using Japanese WordNet
  • toEmoji - Something that converts Japanese sentences into sentences made up of only emojis.
  • termextract - Practice implementing a specialized terminology extraction algorithm.
  • JDT-with-KenLM-scoring - Scoring is performed using an N-gram language model by KenLM on response candidates from Japanese-Dialog-Transformer, followed by filtering or re-ranking.
  • mixture-of-unigram-model - Mixture of Unigram Model and Infinite Mixture of Unigram Model in Python. (混合ユニグラムモデルと無限混合ユニグラムモデル)
  • hidden-markov-model - 隠れマルコフモデル (Hidden Markov Model, HMM) and 無限隠れマルコフモデル (Infinite Hidden Markov Model, iHMM) in Python.
  • Ngram-language-model - Ngram language model in Python. (Nグラム言語モデル)
  • ASRDeepSpeech - Automatic Speech Recognition with deepspeech2 model in pytorch with support from Zakuro AI.
  • neural_ime - Neural IME: Neural Input Method Engine
  • neural_japanese_transliterator - Can neural networks transliterate Romaji into Japanese correctly?
  • tinysegmenter - tokenizer specified for Japanese
  • AugLy-jp - Data Augmentation for Japanese Text on AugLy
  • furigana4epub - A Python script for adding furigana to Japanese epub books using Mecab and Unidic.
  • PyKatsuyou - Japanese verb/adjective inflections tool
  • jageocoder - Pure Python Japanese address geocoder
  • pygeonlp - pygeonlp, A python module for geotagging Japanese texts.
  • nksnd - 新しい仮名漢字変換エンジン
  • JaMIE - A Japanese Medical Information Extraction Toolkit
  • fasttext-vs-word2vec-on-twitter-data - This is a comparison between fasttext and word2vec, as well as execution and learning scripts.
  • minimal-search-engine - Smallest search engine/PageRank/tf-idf
  • 5ch-analysis - Scraping past logs from 5ch and conducting tracking investigations on words that were popular in the past (e.g. kagutsushi, orz).
  • tweet_extructor - Tweet downloader for Japanese sentiment analysis dataset on Twitter.
  • japanese-word-aggregation - Aggregating Japanese words based on Juman++ and ConceptNet5.5
  • jinf - A Japanese inflection converter
  • kwja - A unified language analyzer for Japanese
  • mlm-scoring-transformers - Reproduced package based on Masked Language Model Scoring (ACL2020).
  • ClipCap-for-Japanese - [PyTorch] ClipCap for Japanese
  • SAT-for-Japanese - [PyTorch] Show, Attend and Tell for Japanese
  • cihai - Python library for CJK (Chinese, Japanese, and Korean) language dictionary
  • marine - MARINE : Multi-task leaRnIng-based JapaNese accent Estimation
  • whisper-asr-finetune - Fine-tuning the Whisper ASR model.
  • japanese_chatbot - A PyTorch Implementation of japanese chatbot using BERT and Transformer's decoder
  • radicalchar - Radical character normalization library
  • akaza - Yet another Japanese IME for IBus/Linux
  • posuto - 日本の郵便番号データ。
  • tacotron2-japanese - Tacotron2 implementation of Japanese
  • ibus-hiragana - Hiragana IME for IBus
  • furiganapad - Furigana pad
  • chikkarpy - Japanese synonym library
  • ja-tokenizer-docker-py - Mecab + NEologd + Docker + Python3
  • JapaneseEmbeddingEval - Japanese Embedding Evaluation
  • gptuber-by-langchain - GPT will become a YouTuber.
  • shuwa - Extend GNOME On-Screen Keyboard for Input Methods
  • japanese-nli-model - This repository provides the code for Japanese NLI model, a fine-tuned masked language model.
  • tra-fugu - A tool for Japanese-English translation and English-Japanese translation by using FuguMT
  • fugumt - This is a translation environment that uses a machine translation engine released on the Blue Forest Concept website. It is capable of translating input text strings and PDF files through a form.
  • JaSPICE - JaSPICE: Automatic Evaluation Metric Using Predicate-Argument Structures for Image Captioning Models
  • Retrieval-based-Voice-Conversion-WebUI-JP-localization - Japanese localization
  • pyopenjtalk - Python wrapper for OpenJTalk
  • yomigana-ebook - Make learning Japanese easier by adding readings for every kanji in the eBook
  • N46Whisper - Whisper based Japanese subtitle generator
  • japanese_llm_simple_webui - This is a simple web interface for Japanese compatible LLM (Large Language Model) such as Rinna-3.6B and OpenCALM.
  • pdf-translator - pdf-translator translates English PDF files into Japanese, preserving the original layout.
  • japanese_qa_demo_with_haystack_and_es - Haystack + Elasticsearch + wikipedia(ja) を用いた、日本語の質問応答システムのサンプル
  • mozc-devices - Automatically exported from code.google.com/p/mozc-morse
  • natsume - A Japanese text frontend processing toolkit
  • vits-japros-webui - 日本語TTS(VITS)の学習と音声合成のGradio WebUI
  • ja-law-parser - A Japanese law parser
  • dictation-kit - 日本語の音声認識キットを使用しているジュリウス
  • julius4seg - Segmentation support tool using Julius
  • voicevox_engine - VOICEVOX is a high-quality text-to-speech software that can be used for free.
  • LLaVA-JP - LLaVA-JP is a Japanese VLM trained by LLaVA method
  • RAG-Japanese - Open source RAG with Llama Index for Japanese LLM in low resource settting
  • bertjsc - Japanese Spelling Error Corrector using BERT(Masked-Language Model). BERTに基づいて日本語校正
  • llm-leaderboard - Project of llm evaluation to Japanese tasks
  • jglue-evaluation-scripts - About Training and evaluation scripts for JGLUE, a Japanese language understanding benchmark Training and evaluation scripts for JGLUE, a Japanese language understanding benchmark
  • BLIP2-Japanese - Modifying LAVIS' BLIP2 Q-former with models pretrained on Japanese datasets.
  • wikipedia-passages-jawiki-embeddings-utils - wikipedia 日本語の文を、各種日本語の embeddings や faiss index へと変換するスクリプト等。
  • simple-simcse-ja - Exploring Japanese SimCSE
  • wikipedia-japanese-open-rag - Sample RAG based on Gradio to answer user questions using Japanese Wikipedia articles
  • gpt4-autoeval - Script for automatically evaluating language model responses using GPT-4.
  • t5-japanese - Japanese T5 model
  • japanese_llm_eval - A repo for evaluating Japanese LLMs ・ 日本語LLMを評価するレポ
  • jmteb - The evaluation scripts of JMTEB (Japanese Massive Text Embedding Benchmark)
  • pydomino - This is a tool for aligning phoneme labels with Japanese language audio.
  • easynovelassistant - This is a simple novel generation assistant using the lightweight and unregulated Japanese local LLM "LightChatAssistant-TypeB". It generates forever with local privileges, stacking up hit gachas. It also supports reading aloud.
  • clip-japanese - Japanese CLIP model
  • rime-jaroomaji - Japanese rōmaji input schema for Rime IME
  • deep-question-generation - Quiz automatic generation using deep learning (Japanese T5 model)
  • magpie-nemotron - Code to create a synthetic dialogue dataset using the technique called Magpie and Nemotron-4-340B-Instruct.
  • qlora_ja - Sample code for qlora instruction tuning learning in a Japanese dataset.
  • mozcdic-ut-jawiki - Mozc UT Jawiki Dictionary is a dictionary generated from the Japanese Wikipedia for Mozc.
  • shisa-v2 - Japanese / English Bilingual LLM
  • llm-translator - Mixtral-based Ja-En (En-Ja) Translation model

To check the statistics table (GitHub stars/Downloads), please refer to this page.

C++

Morphology analysis

  • mecab - Yet another Japanese morphological analyzer
  • jumanpp - Juman++ (a Morphological Analyzer Toolkit)
  • kytea - The Kyoto Text Analysis Toolkit for word segmentation and pronunciation estimation, etc.

To check the statistics table (GitHub stars/Downloads), please refer to this page.

Parsing

  • cabocha - Yet Another Japanese Dependency Structure Analyzer
  • knp - A Japanese Parser

To check the statistics table (GitHub stars/Downloads), please refer to this page.

Others

  • jsc - Joint source channel model for Japanese Kana Kanji conversion, Chinese pinyin input and CJE mixed input.
  • aquaskk - An input method without morphological analysis.
  • mozc - Mozc - a Japanese Input Method Editor designed for multi-platform
  • trimatch - Trimatch: An (Exact|Prefix|Approximate) String Matching Library
  • resembla - Resembla: Word-based Japanese similar sentence search library
  • corvusskk - ▽▼ SKK-like Japanese Input Method Editor for Windows

To check the statistics table (GitHub stars/Downloads), please refer to this page.

Rust crate

Morphology analysis

  • lindera - A morphological analysis library.
  • vaporetto - Vaporetto: Very Accelerated POintwise pREdicTion based TOkenizer
  • goya - Japanese Morphological Analysis written in Rust
  • vibrato - vibrato: Viterbi-based accelerated tokenizer
  • yoin - A Japanese Morphological Analyzer written in pure Rust
  • mecab-rs - Safe Rust bindings for mecab a part-of-speech and morphological analyzer library
  • awabi - A morphological analyzer using mecab dictionary

To check the statistics table (GitHub stars/Downloads), please refer to this page.

Converter

  • wana_kana_rust - Utility library for checking and converting between Japanese characters - Hiragana, Katakana - and Romaji
  • unicode-jp-rs - A Rust library to convert Japanese Half-width-kana[半角カナ] and Wide-alphanumeric[全角英数] into normal ones
  • kana - [Mirror] CLI program for transliterating romaji text to either hiragana or katakana
  • kanaria - This library provides functions such as mutual conversion and discrimination of hiragana, katakana, half-width, and full-width characters.
  • japanese-address-parser - This is a library that splits Japanese addresses into prefecture/city or town/village/neighborhood/other.

To check the statistics table (GitHub stars/Downloads), please refer to this page.

Search engine library

To check the statistics table (GitHub stars/Downloads), please refer to this page.

Others

  • daachorse - A fast implementation of the Aho-Corasick algorithm using the compact double-array data structure in Rust.
  • find-simdoc - Finding all pairs of similar documents time- and memory-efficiently
  • crawdad - Rust library of natural language dictionaries using character-wise double-array tries.
  • tokenizer-speed-bench - Comparison code of various tokenizers
  • stringmatch-bench - Here provides benchmark tools to compare the performance of data structures for string matching.
  • vime - Using Vim as an input method for X11 apps
  • voicevox_core - The core of VOICEVOX, a medium-quality text-to-speech software that can be used for free.
  • akaza - Yet another Japanese IME for IBus/Linux
  • Jotoba - A free online, self-hostable, multilang Japanese dictionary.
  • dvorakjp-romantable - DvorakJP Roman Table for Google Japanese Input
  • niinii - Japanese glossator for assisted reading of text using Ichiran
  • cskk - SKK (Simple Kana Kanji conversion) library
  • japanki - Learn Japanese vocabs 🇯🇵 by doing quizzes on CLI!
  • jpreprocess - Japanese text preprocessor for Text-to-Speech applications (OpenJTalk rewrite in rust language)
  • listup_precedent - Software that scrapes and generates a list of case law data from the court's website (https://www.courts.go.jp/index.html)
  • jisho - Jisho is a CLI tool & Rust library that provides a Japanese-English dictionary.

To check the statistics table (GitHub stars/Downloads), please refer to this page.

JavaScript

Morphology analysis

  • kuromoji.js - JavaScript implementation of Japanese morphological analyzer
  • rakutenma - Rakuten MA - morphological analyzer (word segmentor + PoS Tagger) for Chinese and Japanese written purely in JavaScript. Resources
  • node-mecab-ya - Yet another mecab wrapper for nodejs
  • juman-bin - a User-Extensible Morphological Analyzer for Japanese. 日本語形態素解析システム
  • node-mecab-async - Asynchronous japanese morphological analyser using MeCab.

To check the statistics table (GitHub stars/Downloads), please refer to this page.

Converter

  • kuroshiro - Japanese language library for converting Japanese sentence to Hiragana, Katakana or Romaji with furigana and okurigana modes supported.
  • kuroshiro-analyzer-kuromoji - Kuromoji morphological analyzer for kuroshiro.
  • hepburn - Node.js module for converting Japanese Hiragana and Katakana script to, and from, Romaji using Hepburn romanisation
  • japanese-numerals-to-number - Converts Japanese Numerals into number
  • jslingua - Javascript libraries for text processing: Arabic, Japanese, and more.
  • WanaKana - A Javascript library that can detect and transliterate between Hiragana, Katakana, and Romaji.
  • node-romaji-name - Normalize and fix common issues with Romaji-based Japanese names.
  • kyujitai.js - Utility collections for making Japanese text old-fashioned
  • normalize-japanese-addresses - Open source address normalization library.

To check the statistics table (GitHub stars/Downloads), please refer to this page.

Others

  • bangumi-data - 生データーの日本のアニメについて
  • yomichan - Japanese pop-up dictionary extension for Chrome and Firefox.
  • proofreading-tool - GUIで動作する文書校正ツール GUI tool for textlinting.
  • kanjigrid - A web-app displaying the 2200 kanji characters taught in James Heisig's "Remembering the Kanji", 6th edition.
  • japanese-toolkit - Monorepo for Kanji, Furigana, Japanese DB, and others
  • analyze-desumasu-dearu - A JavaScript library for analyzing polite language (desu-masu style) and plain language (da-aru style) in sentences.
  • hatsuon - Japanese pitch accent utils
  • sentiment_ja_js - Sentiment Analysis in Japanese. sentiment_ja with JavaScript
  • mecab-ipadic-seed - mecab-ipadic seed dictionary reader
  • Japanese-Word-Of-The-Day - Well, a different Japanese word everyday.
  • oskim - Extend GNOME On-Screen Keyboard for Input Methods
  • tweetMapping - This is a digital archive of geotagged tweets that were tweeted within 24 hours of the occurrence of the Great East Japan Earthquake.
  • pitch-accent - Predict pitch accent in Japanese
  • kana2ipa - Command to convert "hiragana" or "katakana" into International Phonetic Alphabet (IPA) symbols when pronouncing in Japanese.
  • voicevox - Editor for VOICEVOX, a high-quality text-to-speech software that can be used for free.

To check the statistics table (GitHub stars/Downloads), please refer to this page.

Go

Morphology analysis

  • kagome - Self-contained Japanese Morphological Analyzer written in pure Go

To check the statistics table (GitHub stars/Downloads), please refer to this page.

Others

  • ojosama - Converts text into the tone of Lady Salome from the Hundred Celestial Plains.
  • nihongo - Japanese Dictionary
  • yomichan-import - External dictionary importer for Yomichan.
  • imas-ime-dic - THE IDOLM@STER words dictionary for Japanese IME (by imas-db.jp)
  • go-kakasi - Kanji 転写 to hiragana/katakana/romaji, in Go
  • go-moji - A Go library for Zenkaku/Hankaku conversion
  • ojichat - Generate sentences that an uncle would send via LINE or email.

To check the statistics table (GitHub stars/Downloads), please refer to this page.

Java

Morphology analysis

  • kuromoji - Kuromoji is a self-contained and very easy to use Japanese morphological analyzer designed for search
  • Sudachi - A Japanese Tokenizer for Business
  • SudachiDict - A lexicon for Sudachi

To check the statistics table (GitHub stars/Downloads), please refer to this page.

Others

  • kanjitomo-ocr - Java library for identifying Japanese characters from images
  • jakaroma - Java library and command-line tool to transliterate Japanese kanji to romaji (Latin alphabet)
  • kakasi-java - Kanji transliteration to hiragana/katakana/romaji, in Java
  • Kamite - A desktop language immersion companion for learners of Japanese
  • react-native-japanese-tokenizer - Async Japanese Tokenizer Native Plugin for React Native for iOS and Android
  • elasticsearch-analysis-japanese - The Japanese analyzer utilizes the Kuromoji Japanese tokenizer for ElasticSearch.
  • moji4j - A Java library to converts between Japanese Hiragana, Katakana, and Romaji scripts.
  • neologdn-java - Japanese text normalizer for mecab-neologd
  • elasticsearch-sudachi - The Japanese analysis plugin for elasticsearch

To check the statistics table (GitHub stars/Downloads), please refer to this page.

Pretrained model

Word2Vec

  • japanese-words-to-vectors - Word2vec (word to vectors) approach for Japanese language using Gensim and Mecab.
  • chiVe - Japanese word embedding with Sudachi and NWJC
  • elmo-japanese - Elmo (in Japanese)
  • embedrank - Python Implementation of EmbedRank
  • aovec - Easy aozorabunko Word2Vec Builder - Word2Vec Builder and pre-built model for all books in the Aozora Bunko library.
  • dependency-based-japanese-word-embeddings - This is a repository for the AI LAB article "係り受けに基づく日本語単語埋込 (Dependency-based Japanese Word Embeddings)" ( Article URL https://ai-lab.lapras.com/nlp/japanese-word-embedding/)
  • jawikivec - Yet Another Japanese-Wikipedia Entity Vectors
  • jawiki_word_vector_updater - A script for learning word embedding models such as word2vec, fastText, and GloVe based on the results of morphological analysis using both the IPA dictionary and the latest Neologd dictionary, using MeCab on the latest Japanese Wikipedia dump data.

To check the statistics table (GitHub stars/Downloads), please refer to this page.

Transformer based models

  • bert-japanese - BERT models for Japanese language text.
  • japanese-pretrained-models - Code for producing Japanese pretrained models provided by rinna Co., Ltd.
  • bert-japanese - BERT with SentencePiece for Japanese text.
  • SudachiTra - 日本語のトークナイザー(分かち書きツール)のためのTransformers
  • japanese-dialog-transformers - Code for evaluating Japanese pretrained models provided by NTT Ltd.
  • shiba - Pytorch implementation and pre-trained Japanese model for CANINE, the efficient character-level transformer.
  • Dialog - A PyTorch Implementation of japanese chatbot using BERT and Transformer's decoder
  • language-pretraining - BERT and ELECTRA models of PyTorch implementations for Japanese text.
  • medbertjp - Trials of pre-trained BERT models for the medical domain in Japanese.
  • ILYS-aoba-chatbot - ILYS-aoba-chatbot
  • t5-japanese - Codes to pre-train Japanese T5 models
  • pytorch_bert_japanese - Using a pre-trained Japanese BERT model with Pytorch.
  • Laboro-BERT-Japanese - Laboro BERT Japanese: Japanese BERT Pre-Trained With Web-Corpus
  • RoBERTa-japanese - Japanese BERT Pretrained Model
  • aMLP-japanese - aMLP Transformer Model for Japanese
  • bert-japanese-aozora - Japanese BERT trained on Aozora Bunko and Wikipedia, pre-tokenized by MeCab with UniDic & SudachiPy
  • sbert-ja - Code to train Sentence BERT Japanese model for Hugging Face Model Hub
  • BERT-Japan-vaccination - Official fine-tuning code for "Emotion Analysis of Japanese Tweets and Comparison to Vaccinations in Japan"
  • gpt2-japanese - Japanese GPT2 Generation Model
  • text2text-japanese - gpt-2 based text2text conversion model
  • gpt-ja - GPT-2 Japanese model for HuggingFace's transformers
  • friendly_JA-Model - MT model trained using the friendly_JA Corpus attempting to make Japanese easier/more accessible to occidental people by using the Latin/English derived katakana lexicon instead of the standard Sino-Japanese lexicon
  • albert-japanese - BERT with SentencePiece for Japanese text.
  • ja_text_bert - Repository for generating a pre-trained BERT model using the Japanese Wikipedia corpus.
  • DistilBERT-base-jp - A Japanese DistilBERT pretrained model, which was trained on Wikipedia.
  • bert - This repository provides snippets to use RoBERTa pre-trained on Japanese corpus. Our dataset consists of Japanese Wikipedia and web-scrolled articles, 25GB in total. The released model is built based on that from HuggingFace.
  • Laboro-DistilBERT-Japanese - Laboro DistilBERT Japanese
  • luke - LUKE -- Language Understanding with Knowledge-based Embeddings
  • GPTSAN - General-purpose Swich transformer based Japanese language mode
  • japanese-clip - Japanese CLIP by rinna Co., Ltd.
  • AcademicBART - We pretrained a BART-based Japanese masked language model on paper abstracts from the academic database CiNii Articles
  • AcademicRoBERTa - We pretrained a RoBERTa-based Japanese masked language model on paper abstracts from the academic database CiNii Articles.
  • LINE-DistilBERT-Japanese - DistilBERT model pre-trained on 131 GB of Japanese web text. The teacher model is BERT-base that built in-house at LINE.
  • Japanese-Alpaca-LoRA - Link to the Low-Rank Adapter created by fine-tuning LLaMA using the Stanford Alpaca dataset translated into Japanese, and sample code for generating it.
  • albert-japanese-tinysegmenter - Pretrained models, codes and guidances to pretrain official ALBERT(https://github.com/google-research/albert) on Japanese Wikipedia Resources
  • japanese-llama-experiment - 日本のLLaMa実験
  • easylightchatassistant - EasyLightChatAssistant は軽量で検閲や規制のないローカル日本語モデルのLightChatAssistant を、KoboldCpp で簡単にお試しする環境です。

To check the statistics table (GitHub stars/Downloads), please refer to this page.

ChatGPT

  • VRChatGPT - A program that allows you to chat using ChatGPT in VRChat.
  • AITuberDegikkoMirii - We are developing the foundation of AITuber.
  • wanna - Shell command launcher with natural language
  • ChatdollKit - ChatdollKit enables you to make your 3D model into a chatbot
  • ChuanhuChatGPTJapanese - GUI for ChatGPT API For Japanese
  • AISisterAIChan - This is a Siro Ghost equipped with ChatGPT3.5, called "AI Imouto Aichan". A separate ChatGPT API key is required to use it.
  • vrchatbot - Repository for creating AI bots in VRChat
  • gptuber-by-langchain - GPT will become a YouTuber.
  • openai-chatfriend - A chatbox application built using Nuxt 3 powered by Open AI Text completion endpoint. You can select different personality of your AI friend. The default will respond in Japanese. You can use this app to practice your Nihongo skills!
  • chrome-ext-translate-to-hiragana-with-chatgpt - This Chrome extension can translate selected Japanese text to Hiragana by using ChatGPT.
  • azure-search-openai-demo - In this sample, we demonstrate several approaches to creating an experience similar to ChatGPT for proprietary data using the Retrieval Augmented Generation pattern.
  • chatvrm - ChatVRM is a demo application that allows you to easily chat with 3D characters in your browser.
  • sftly-replace - A Chrome extention to replace the selected text softly
  • summarize_arxv - Summarize arXiv paper with figures
  • aiavatarkit - Building AI-based conversational avatars lightning fast
  • pva-aoai-integration-solution - This repository is intended to package and release the flows and other solutions created for the trial use of ChatGPT at Kobe City Hall.
  • jp-azureopenai-samples - We provide free samples of applications (reference architecture, sample code, and deployment instructions) for the purpose of implementing applications using Azure OpenAI.
  • character_chat - This is a chat script that uses OpenAI's API to have a conversation with a character set in Japanese.
  • chatgpt-slackbot - Slackbot script for using OpenAI's ChatGPT API on Slack (assumes usage in Japanese)
  • chatgpt-prompt-sample-japanese - This is a sample of ChatGPT's prompt.
  • kanji-flashcard-app-gpt4 - A Japanese Kanji Flashcard App built using Python and Langchain, enhanced with the intelligence of GPT-4.
  • IgakuQA - Evaluating GPT-4 and ChatGPT on Japanese Medical Licensing Examinations
  • japagen - Investigation of pseudo-learning data generation using LLM in Japanese language tasks

To check the statistics table (GitHub stars/Downloads), please refer to this page.

Dictionary and IME

  • mecab-ipadic-neologd - Neologism dictionary based on the language resources on the Web for mecab-ipadic
  • tdmelodic - A Japanese accent dictionary generator
  • jamdict - Python 3 library for manipulating Jim Breen's JMdict, KanjiDic2, JMnedict and kanji-radical mappings
  • unidic-py - Unidic packaged for installation via pip.
  • Japanese-Company-Lexicon - Japanese Company Lexicon (JCLdic)
  • manbyo-sudachi - A comprehensive medical dictionary for Sudachi.
  • jawiki-kana-kanji-dict - Generate SKK/MeCab dictionary from Wikipedia(Japanese edition)
  • JIWC-Dictionary - dictionary to find emotion related to text
  • JumanDIC - This repository contains source dictionary files to build dictionaries for JUMAN and Juman++.
  • ipadic-py - IPAdic packaged for easy use from Python.
  • unidic-lite - A small version of UniDic for easy pip installs.
  • emoji-ime-dictionary - An IME additional dictionary for inputting emojis in Japanese, such as the Google Japanese Input, which enables conversion from Japanese to emojis through an IME extension dictionary.
  • google-ime-dictionary - An IME additional dictionary called "orange_book" for Japanese-English conversion and expansion of English abbreviations, which enables Japanese-English conversion and expansion of English abbreviations in Google Japanese Input and ATOK.
  • dic-nico-intersection-pixiv - IME dictionary for the common parts of Nico Nico Daihyakka and Pixiv Encyclopedia.
  • google-ime-user-dictionary-ja-en - GoogleIME用カタカナ語辞書プロジェクトのアーカイブです。Project archive of Google IME user dictionary from Katakana word ( Japanese loanword ) to English.
  • emoticon - Google Japanese Input's emoticon dictionary ∩(,,Ò‿Ó,,)∩
  • mecab-mozcdic - This is a conversion of the open source mozc dictionary to the MeCab dictionary format.
  • denonbu-ime-dic - Electric Sound Dictionary: A dictionary of terms related to "Electric Sound Department" intended for use with Microsoft IME and other similar software.
  • nijisanji-ime-dic - This is a glossary of "Nijisanji" related terms intended for use with Microsoft IME and other similar software.
  • pokemon-ime-dic - This is a terminology dictionary that covers the names of all currently known Pokémon, intended for use with Microsoft IME and similar software.
  • EJDict - English-Japanese Dictionary data (Public Domain) EJDict-hand
  • Ayashiy-Nipongo-Dic - Using the precious tobacco box as a visual aid, it is possible to speak proper Japanese.
  • genshin-dict - This is a vocabulary dictionary for Genshin Impact that can be used on Windows/macOS.
  • jmdict-simplified - JMdict and JMnedict in JSON format
  • mozcdict-ext - Convert external words into Mozc system dictionary
  • mh-dict-jp - I want to create a user dictionary for Monster Hunter...
  • jitenbot - Convert data from Japanese dictionary websites and applications into portable file formats
  • mecab-unidic-neologd - Neologism dictionary based on the language resources on the Web for mecab-unidic
  • hololive-dictionary - This is a dictionary file about Hololive (Hololive Production). You can use the text files in the ./dictionary folder to add words to your IME. Please refer to README.md for more details.
  • jmdict-yomitan - JMdict, JMnedict, KANJIDIC for Yomitan/Yomichan.
  • yomichan-jlpt-vocab - JLPT level tags for words in Yomichan
  • Jitendex - A free and openly licensed Japanese-to-English dictionary compatible with multiple dictionary clients
  • jiten - japanese android/cli/web dictionary based on jmdict/kanjidic — 日本語 辞典 和英辞典 漢英字典 和独辞典 和蘭辞典
  • pixiv-yomitan - Pixiv Encyclopedia Dictionary for Yomitan
  • uchinaaguchi_dict - Uchinaaguchi Dictionary (Okinawan Language Dictionary)
  • yomitan-dictionaries - Japanese and Chinese dictionaries for Yomitan.
  • mouse_over_dictionary - Generic dictionary tool that automatically reads the word you mouse over.
  • jisyo - New dictionary format for the kana-kanji conversion engine SKK
  • skk-jisyo.emoji-ja - SKK dictionary for converting Japanese readings to Emoji 😂
  • anthy - Anthy is a kana-kanji conversion engine for Japanese. It converts roma-ji to kana, and the kana text to a mixed kana and kanji.
  • aws_dic_for_google_ime - Dictionary for Google Japanese input for AWS service names
  • cl-skkserv - SKK dictionary server and its extensions using Common Lisp
  • anthy - Anthy maintenance
  • anthy-unicode - Anthy Unicode - Another Anthy
  • azooKey - azooKey: A Japanese Keyboard iOS Application Fully Developed in Swift
  • azookey-desktop - Japanese Input Method azooKey for Desktop, supporting macOS
  • fcitx5-hazkey - Japanese input method for fcitx5, powered by azooKey engine

To check the statistics table (GitHub stars/Downloads), please refer to this page.

Corpus

Part-of-speech tagging / Named entity recognition

To check the statistics table (GitHub stars/Downloads), please refer to this page.

Parallel corpus

To check the statistics table (GitHub stars/Downloads), please refer to this page.

Dialog corpus

  • JMRD - 日本映画の推薦対話データセット
  • open2ch-dialogue-corpus - A dialogue corpus created by crawling the 2channel open forum.
  • BSD - The Business Scene Dialogue corpus
  • asdc - Accommodation Search Dialog Corpus (宿泊施設探索対話コーパス)
  • japanese-corpus - Japanese dialogue data for seq2seq, etc.
  • BPersona-chat - This repository contains the Japanese–English bilingual chat corpus BPersona-chat published in the paper Chat Translation Error Detection for Assisting Cross-lingual Communications at AACL-IJCNLP 2022's Workshop Eval4NLP 2022.
  • japanese-daily-dialogue - Japanese Daily Dialogue, or 日本語日常対話コーパス in Japanese, is a high-quality multi-turn dialogue dataset containing daily conversations on five topics: dailylife, school, travel, health, and entertainment.
  • llm-japanese-dataset - Japanese chat dataset for building LLM.

To check the statistics table (GitHub stars/Downloads), please refer to this page.

Others

  • jrte-corpus - Japanese Realistic Textual Entailment Corpus (NLP 2020, LREC 2020)
  • kanji-data - A JSON kanji dataset with updated JLPT levels and WaniKani information
  • JapaneseWordSimilarityDataset - Japanese Word Similarity Dataset
  • simple-jppdb - A paraphrase database for Japanese text simplification
  • chABSA-dataset - chakki's Aspect-Based Sentiment Analysis dataset
  • JaQuAD - JaQuAD: Japanese Question Answering Dataset for Machine Reading Comprehension (2022, Skelter Labs)
  • JaNLI - Japanese Adversarial Natural Language Inference Dataset
  • ebe-dataset - Evidence-based Explanation Dataset (AACL-IJCNLP 2020)
  • emoji-ja - Japanese pronunciation/keywords/classification dictionary for UNICODE emojis.
  • nayose-wikipedia-ja - Japanese name matching dataset created from Wikipedia.
  • ja.text8 - Japanese text8 corpus for word embedding.
  • ThreeLineSummaryDataset - 3-line summary dataset
  • japanese - This repo contains a list of the 44,998 most common Japanese words in order of frequency, as determined by the University of Leeds Corpus.
  • kanji-frequency - Kanji usage frequency data collected from various sources
  • TEDxJP-10K - TEDxJP-10K ASR Evaluation Dataset
  • CoARiJ - Corpus of Annual Reports in Japan
  • technological-book-corpus-ja - A raw corpus/tool that collects technical books written in Japanese.
  • ita-corpus-chuwa - Chunked word annotation for ITA corpus
  • wikipedia-utils - Utility scripts for preprocessing Wikipedia texts for NLP
  • inappropriate-words-ja - We will collect inappropriate expressions in Japanese. We believe it can be used for data cleaning in natural language processing.
  • house-of-councillors - We organized data on factions, members, bills, and interpellation requests from the official website of the House of Councillors.
  • house-of-representatives - National Diet Bill Database: House of Representatives
  • STAIR-captions - STAIR captions: A Japanese image caption dataset on a large scale.
  • Winograd-Schema-Challenge-Ja - Japanese Translation of Winograd Schema Challenge
  • speechBSD - An extension of the BSD corpus with audio and speaker attribute information
  • ita-corpus - List of sentences in the ITA corpus
  • rohan4600 - Mora balance Japanese corpus
  • anlp-jp-history - A complete list and machine-readable version of the presentations at the annual conference of the Association for Computational Linguistics.
  • keigo_transfer_task - Evaluation dataset for honorific language conversion task.
  • loanwords_gairaigo - English loanwords in Japanese
  • jawikicorpus - Japanese-Wikipedia Wikification Corpus
  • GeneralPolicySpeechOfPrimeMinisterOfJapan - This is the corpus of Japanese Text that general policy speech of prime minister of Japan
  • wrime - WRIME: Subjective and Objective Emotion Analysis Dataset.
  • jtubespeech - JTubeSpeech: Corpus of Japanese speech collected from YouTube
  • WikipediaWordFrequencyList - List of frequently used words in Japanese Wikipedia.
  • kokkosho_data - Dataset on vehicle malfunction information.
  • pdmocrdataset-part1 - OCR learning dataset created for digital material OCR text conversion project.
  • huriganacorpus-ndlbib - A dataset of furigana created from the National Bibliographic Data.
  • jvs_hiho - Creating labels for self-made JVS (Japanese Versatile Speech) corpus.
  • hirakanadic - Allows Sudachi to normalize from hiragana to katakana from any compound word list
  • animedb - Anime works list database spanning approximately 100 years.
  • security_words - Japanese-English correspondence of public organizations related to cybersecurity.
  • Data-on-Japanese-Diet-Members - Data of Japanese parliament members.
  • honkoku-data - 歴史資料の市民参加型翻刻プラットフォーム「みんなで翻刻」のテキストデータ置き場です。 / Transcription texts created on Minna de Honkoku (https://honkoku.org), a crowdsourced transcription platform for historical Japanese documents.
  • wikihow_japanese - データセット「wikiHow」(日本語版)
  • engineer-vocabulary-list - Engineer Vocabulary List in Japanese/English
  • JSICK - Japanese Sentences Involving Compositional Knowledge (JSICK) Dataset/JSICK-stress Test Set
  • phishurl-list - Phishing URL dataset from JPCERT/CC
  • jcms - A Japanese Corpus of Many Specialized Domains (JCMS)
  • aozorabunko_text - text-only archives of www.aozora.gr.jp
  • friendly_JA-Corpus - friendly_JA is a parallel Japanese-to-Japanese corpus aimed at making Japanese easier by using the Latin/English derived katakana lexicon instead of the standard Sino-Japanese lexicon
  • topokanji - Topologically ordered lists of kanji for effective learning
  • isbn4groups - Data related to Japanese publications in ISBN-13 format (978-4-XXXXXXXXX)
  • NMeCab - NMeCab: About Japanese morphological analyzer on .NET
  • ndlngramdata - Dataset of n-gram frequency statistics information from OCR text data created from digitized materials.
  • ndlngramviewer_v2 - The complete set of source code for the NDL Ngram Viewer that was renewed in January 2023.
  • data_set - Dataset related to laws and precedents.
  • huggingface-datasets_wrime - WRIME for huggingface datasets
  • ndl-minhon-ocrdataset - NDL Classical Text OCR Learning Dataset (Collaborative Transcription and Processing Data)
  • PAX_SAPIENTICA - GIS & Archaeological Simulator is currently in development and is expected to be released in 2023.
  • j-liwc2015 - Japanese version of LIWC2015
  • huggingface-datasets_livedoor-news-corpus - Japanese Livedoor news corpus for huggingface datasets
  • huggingface-datasets_JGLUE - JGLUE: Japanese General Language Understanding Evaluation for huggingface datasets
  • commonsense-moral-ja - JCommonsenseMorality is a dataset created through crowdsourcing that reflects the commonsense morality of Japanese annotators.
  • comet-atomic-ja - COMET-ATOMIC yes
  • dcsg-ja - Dialogue Commonsense Graph in Japanese
  • japanese-toxic-dataset - "Proposal and Evaluation of Japanese Toxicity Schema" provides a schema and dataset for toxicity in the Japanese language.
  • camera - CAMERA (CyberAgent Multimodal Evaluation for Ad Text GeneRAtion) is the Japanese ad text generation dataset.
  • Japanese-Fakenews-Dataset - Japanese Fake News Dataset
  • jpn_explainable_qa_dataset - jpn_explainable_qa_dataset
  • copa-japanese - COPAデータセット(日本語)
  • WLSP-familiarity - Word Familiarity Rate for 'Word List by Semantic Principles (WLSP)'
  • ProSub - A cross-linguistic study of pronoun substitutes and address terms
  • commonsense-moral-ja - JCommonsenseMorality is a dataset created through crowdsourcing that reflects the commonsense morality of Japanese annotators.
  • ramendb - Scraping tool and collected data from Nantoka Database (https://supleks.jp/).
  • huggingface-datasets_CAMERA - CAMERA (CyberAgent Multimodal Evaluation for Ad Text GeneRAtion) for huggingface datasets
  • FactCheckSentenceNLI-FCSNLI- - FactCheckSentenceNLIデータセット
  • databricks-dolly-15k-ja - This is a dataset that has been translated into Japanese from the databricks-dolly-15k.jsonl file used for training in databricks/dolly-v2-12b.
  • EaST-MELD - EaST-MELD is an English-Japanese dataset for emotion-aware speech translation based on MELD.
  • meconaudio - Mecon Audio (Medical Conference Audio) is a dataset of read-out minutes for advanced medical conferences sponsored by the Ministry of Health, Labour and Welfare.
  • japanese-addresses - Open data of address data at the town and block level nationwide (277,191 entries).
  • aozorasearch - The full-text search system for Aozora Bunko by Groonga. 青空文庫全文検索ライブラリ兼Webアプリ。
  • llm-jp-corpus - This repository contains scripts to reproduce the LLM-jp corpus.
  • alpaca_ja - This is a Japanese version of the alpaca dataset.
  • instruction_ja - Japanese instruction data (日本語指示データ)
  • japanese-family-names - Top 5000 Japanese family names, with readings, ordered by frequency.
  • kanji-data-media - Japanese language data on kanji, radicals, media files, fonts and related resources from Kanji alive
  • reazonspeech - Construct large-scale Japanese audio corpus at home
  • huriganacorpus-aozora - Data set of furigana created from Aozora Bunko and Sapie's braille data.
  • koniwa - An open collection of annotated voices in Japanese language
  • JMMLU - Japanese Massive Multitask Language Understanding Benchmark
  • hurigana-speech-corpus-aozora - Dataset of audio corpus with furigana annotations from Aozora Bunko
  • jqara - JQaRA: Japanese Question Answering with Retrieval Augmentation - 検索拡張(RAG)評価のための日本語Q&Aデータセット
  • jemhopqa - JEMHopQA (Japanese Explainable Multi-hop Question Answering) is a Japanese multi-hop QA dataset that can evaluate internal reasoning.
  • jacred - Repository for Japanese Document-level Relation Extraction Dataset (plan to be released in March).
  • jades - JADES is a dataset for text simplification in Japanese, described in 'JADES: New Text Simplification Dataset in Japanese Targeted at Non-Native Speakers' (the paper will be available soon).
  • do-not-answer-ja - A safety evaluation dataset "Do-Not-Answer" released by the University of Melbourne in August 2023 has been automatically translated into Japanese for use in the evaluation of Japanese LLM, and further modified to take into account Japanese culture.
  • oasst1-89k-ja - This is a dataset that translates OpenAssistant's open source data OASST1 into Japanese.
  • jacwir - JaCWIR: Japanese Casual Web IR Small-scale and casual web title and abstract dataset for Japanese information retrieval evaluation.
  • japanese-technical-dict - Comparison table of commonly used katakana and original words in the science and technology industry for Japanese language learners.
  • j-unimorph - Dataset of UniMorph in Japanese
  • GazeVQA - Dataset for the LREC-COLING 2024 paper A Gaze-grounded Visual Question Answering Dataset for Clarifying Ambiguous Japanese Questions
  • J-CRe3 - Code for J-CRe3 experiments (Ueda et al., LREC-COLING, 2024)
  • jmed-llm - JMED-LLM: Japanese Medical Evaluation Dataset for Large Language Models
  • lawtext - Plain text format for Japanese law
  • pdmocrdataset-part2 - OCR learning dataset created in OCR processing program research and development project.

To check the statistics table (GitHub stars/Downloads), please refer to this page.

Tutorial

To check the statistics table (GitHub stars/Downloads), please refer to this page.

Research summary

  • awesome-bert-japanese - A list of pre-trained BERT models for Japanese with word/subword tokenization + vocabulary construction algorithm information
  • GEC-Info-ja - Repository for collecting and categorizing Japanese literature on correcting grammar errors.
  • dataset-list - lists of text corpus and more (mainly Japanese)
  • tuning_playbook_ja - A playbook for systematically maximizing the performance of deep learning models.
  • japanese-pitch-accent-resources - Trying to consolidate japanese phonetic, and in particular pitch accent resources into one list
  • awesome-japanese-llm - Summary of Japanese LLM (Open Source)

To check the statistics table (GitHub stars/Downloads), please refer to this page.

Reference

Contributors