This is a repository to share dependency-based Japanese word embeddings which we trained for experiments in the article 係り受けに基づく日本語単語埋込 (Dependency-based Japanese Word Embeddings).
We applied the method proposed in the paper Dependency-based Word Embeddings to Japanese.
To prepare the training data, we first extracted sentences from Japanese Wikipedia dumps.
Then, we parsed them using an NLP framework GiNZA.
Finally, we trained the embeddings with the script provided in the page of the paper's first author.
The parameter settings for the experiments is as below where DIM is the number of dimensions written in each file name.
-size DIM -negative 15 -threads 20
You can download the data from links below.
Download begins soon after you click on a link.
- dep-ja-100dim (85.4 MB)
- 100 dimensional word vectors
- dep-ja-200dim (169.9 MB)
- 200 dimensional word vectors
- dep-ja-300dim (254.5 MB)
- 300 dimensional word vectors
You can use the embeddings in the same way as embeddings trained by using the original implementation of Word2Vec.
Here is an example code to load them from your Python script.
from gensim.models import KeyedVectors
vectors = KeyedVectors.load_word2vec_format("path/to/embeddings")
When writing your paper using them, please cite this bibtex,
@misc{matsuno2019dependencybasedjapanesewordembeddings,
title = {Dependency-based Japanese Word Embeddings},
author = {Tomoki, Matsuno},
affiliation = {LAPRAS inc.},
url = {https://github.com/lapras-inc/dependency-based-japanese-word-embeddings},
year = {2019}
}
- 松田寛, 大村舞, 浅原正幸. 短単位品詞の用法曖昧性解決と依存関係ラベリングの同時学習, 言語処理学会 第 25 回年次大会 発表論文集, 2019.
- Mikolov, T., Chen, K., Corrado, G. & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, .
- Levy, O. & Goldberg, Y. (2014). Dependency-Based Word Embeddings. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (p./pp. 302--308), June, Baltimore, Maryland: Association for Computational Linguistics.