PyTorch original implementation of Cross-lingual Language Model Pretraining.
-
Updated
Jul 28, 2020 - Python
PyTorch original implementation of Cross-lingual Language Model Pretraining.
ASR pytorch project
Central repository with pretrained models for transfer learning, BPE subword-tokenization, mono/multilingual embeddings, and everything in between.
This project aims to implement word-based, character-based and subword-based tokenization techniques.
An extremily simple and restricted tool/lib converting binary data into text that can be processed with unsuperwised character-level natural language processing tools/libs
A modified, secure version of BPE algorithm
Low resource language machine translation(az,be,tr -> en).
An educational project dedicated to text-to-image generation with neural networks. VQVAE and BPE autoencoders are used to learn the embedding of text and image respectively. A transformer-based model then is trained to predict the next token in the concatenated sequence of image and text tokens and used for generation.
A python package to build a corpus vocabulary using the byte pair methodology and also a tokenizer to tokenize input texts based on the built vocab.
Byte-Pair Encoding tokenizer for training large language models on huge datasets
Natural Language EnCoder-Decoder: word, char, bpe etc
Byte Pair Encoding (BPE)
Byte-Pair Encoding (BPE) (subword-based tokenization) algorithm implementaions from scratch with python
Subword-augmented Embedding for Cloze Reading Comprehension (COLING 2018)
Learning BPE embeddings by first learning a segmentation model and then training word2vec
Add a description, image, and links to the bpe topic page so that developers can more easily learn about it.
To associate your repository with the bpe topic, visit your repo's landing page and select "manage topics."