Unsupervised Word Segmentation for Neural Machine Translation and Text Generation
-
Updated
Aug 7, 2024 - Python
Unsupervised Word Segmentation for Neural Machine Translation and Text Generation
Simple-to-use scoring function for arbitrarily tokenized texts.
Byte-Pair Encoding tokenizer for training large language models on huge datasets
A modified, secure version of BPE algorithm
Fast bare-bones BPE for modern tokenizer training
An extremily simple and restricted tool/lib converting binary data into text that can be processed with unsuperwised character-level natural language processing tools/libs
Natural Language EnCoder-Decoder: word, char, bpe etc
Byte-Pair Encoding (BPE) (subword-based tokenization) algorithm implementaions from scratch with python
Learning BPE embeddings by first learning a segmentation model and then training word2vec
ASR pytorch project
Central repository with pretrained models for transfer learning, BPE subword-tokenization, mono/multilingual embeddings, and everything in between.
This project aims to implement word-based, character-based and subword-based tokenization techniques.
An educational project dedicated to text-to-image generation with neural networks. VQVAE and BPE autoencoders are used to learn the embedding of text and image respectively. A transformer-based model then is trained to predict the next token in the concatenated sequence of image and text tokens and used for generation.
Add a description, image, and links to the bpe topic page so that developers can more easily learn about it.
To associate your repository with the bpe topic, visit your repo's landing page and select "manage topics."