Skip to content
/ WNE Public

C++ implementation of the paper "Word-like n-gram embedding". EMNLP 2018 Workshop on Noisy User-generated Text.

License

Notifications You must be signed in to change notification settings

kdrl/WNE

Repository files navigation

SGNS-WNE : The word-like n-gram embedding version of skip-gram model with negative sampling

SGNS-WNE is an open source implementation of our framework to learn distributed representation of words by embedding word-like character n-grams, described in the following papers:

Requirements & Environment

  • Linux(Tested with CentOS Linux release 7.4.1708)
  • gcc(>=5)
  • hdf5
  • Python 3
  • NumPy
  • Pandas
  • h5py
  • scikit-learn
  • tqdm
  • cmdline : Download cmdline.h and place it in 2_count_ngram_frequency/, 4_count_expected_word_frequenct/ and 5_SGNS_WNE/

Contents

  • 1_preprocess/ : Pre-processing corpus. Sentences are concatenated and white spaces are replaces with another character for visualization.
  • 2_count_ngram_frequency/ : Count n-grams frequency. In this implementation, we use lossy counting algorithm.
  • 3_logistic_regression/ : Probabilistic predictor for word boundary.
  • 4_count_expected_word_frequenct/ : Count expected word frequency (ewf) of word-like n-grams.
  • 5_SGNS_WNE/ : Compute distributed representations of word-like n-grams via skip-gram model with negative sampling.
.
├── 1_preprocess
│   └── main.py
├── 2_count_ngram_frequency
│   ├── cmdline.h
│   ├── lossycounting.cpp
│   ├── lossycounting.h
│   ├── main.cpp
│   ├── makefile
│   └── run.sh
├── 3_logistic_regression
│   └── main.py
├── 4_count_expected_word_frequency
│   ├── cmdline.h
│   ├── counting_word.cpp
│   ├── counting_word.h
│   ├── main.cpp
│   ├── makefile
│   └── run.sh
├── 5_SGNS_WNE
│   ├── cheaprand.h
│   ├── cmdline.h
│   ├── main.cpp
│   ├── makefile
│   ├── run.sh
│   ├── skipgram.cpp
│   └── skipgram.h
└── README.md

Submodules & Dependencies

The majority of C++ code which is used for computing representations for n-grams with SGNS is taken from word2vec - Google Codes[1] and w2v-sembei[2].

References

  1. Mikolov, T., Corrado, G., Chen, K., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. In Proceedings of ICLR2013. [pdf, code]
  2. Oshikiri, T. (2017). Segmentation-Free Word Embedding for Unsegmented Languages. In Proceedings of EMNLP2017. [pdf]
  3. Kudo, T., Yamamoto, K., & Matsumoto, Y. (2004). Applying Conditional Random Fields to Japanese Morphological Analysis. In Proceedings of EMNLP2004. [pdf]
  4. MeCab: Yet Another Part-of-Speech and Morphological Analyzer. [code]

About

C++ implementation of the paper "Word-like n-gram embedding". EMNLP 2018 Workshop on Noisy User-generated Text.

Topics

Resources

License

Stars

Watchers

Forks

Packages