Unsupervised Word Segmentation for Neural Machine Translation and Text Generation
-
Updated
Aug 7, 2024 - Python
Unsupervised Word Segmentation for Neural Machine Translation and Text Generation
Subword Encoding in Lattice LSTM for Chinese Word Segmentation
Subword-augmented Embedding for Cloze Reading Comprehension (COLING 2018)
Byte Pair Encoding (BPE)
Simple-to-use scoring function for arbitrarily tokenized texts.
Natural Language EnCoder-Decoder: word, char, bpe etc
Fast bare-bones BPE for modern tokenizer training
Learning BPE embeddings by first learning a segmentation model and then training word2vec
Central repository with pretrained models for transfer learning, BPE subword-tokenization, mono/multilingual embeddings, and everything in between.
PyTorch original implementation of Cross-lingual Language Model Pretraining.
An extremily simple and restricted tool/lib converting binary data into text that can be processed with unsuperwised character-level natural language processing tools/libs
A modified, secure version of BPE algorithm
ASR pytorch project
A python package to build a corpus vocabulary using the byte pair methodology and also a tokenizer to tokenize input texts based on the built vocab.
Byte-Pair Encoding (BPE) (subword-based tokenization) algorithm implementaions from scratch with python
Add a description, image, and links to the bpe topic page so that developers can more easily learn about it.
To associate your repository with the bpe topic, visit your repo's landing page and select "manage topics."