sentencepiece

Star

Here are 35 public repositories matching this topic...

eliben / go-sentencepiece

Star

Go implementation of the SentencePiece tokenizer

go golang encoding language-model tokenization sentencepiece llm

Updated Aug 8, 2024
Go

twinnydotdev / toxe

Star

SentencePiece tokenizer for cross-encoders

machine-learning tokenizer artificial-intelligence sentencepiece crossencoder twinny

Updated Aug 7, 2024
JavaScript

Systemcluster / kitoken

Sponsor

Star

Fast and versatile tokenizer for language models with BPE, Unigram and WordPiece tokenization. Compatible with SentencePiece, Tokenizers, Tiktoken and more.

nlp tokenizer word-segmentation unigram bpe sentencepiece

Updated Aug 7, 2024
Rust

niedev / RTranslator

Star

Open source real-time translation app for Android that runs locally

android translator translation transformers mobile-app android-app bluetooth-le whisper realtime-translator onnx sentencepiece onnxruntime nllb

Updated Aug 1, 2024
C++

OpenNMT / Tokenizer

Star

Fast and customizable text tokenization library with BPE and SentencePiece support

python unicode natural-language-processing cpp icu tokenizer machine-translation tokenization bpe sentencepiece

Updated Jul 25, 2024
C++

Systemcluster / sentencepiece-model

Sponsor

Star

SentencePiece model parser generated from the SentencePiece protobuf definition.

nlp tokenizer sentencepiece

Updated Jul 16, 2024
Rust

ReshiAdavan / Thoth

Star

An Industry Standard Tokenizer, purposed for large-scale language models like OpenAI's GPT Series.

python rust natural-language-processing tokenizer gpt-2 sentencepiece bytepairencoding gpt-4 tiktoken llama2

Updated Jun 29, 2024
Python

ZJaume / escape-unk

Star

Escape unknown symbols in SentecePiece vocabularies

natural-language-processing neural-machine-translation escaping sentencepiece

Updated Jun 25, 2024
Python

leliuga / datrin

Star

dataset, train, inference

inference dataset flax train jax sentencepiece safetensors

Updated May 19, 2024
Python

himkt / konoha

Sponsor

Star

🌿 An easy-to-use Japanese Text Processing tool, which makes it possible to switch tokenizers with small changes of code.

nlp natural-language-processing japanese text-processing mecab kytea sudachi sentencepiece janome

Updated May 15, 2024
Python

kmaurinjones / WikiGameBot

Star

Automated WikiGame-playing 'bot'. Achieved via SentenceTransformer Word Embeddings.

nlp api wikipedia transformer wordembeddings sentencepiece wikigame sentencetransformer

Updated Jan 18, 2024
Python

Doarakko / vector-text-similarity-search

Sponsor

Star

Search for similar documents using Elasticsearch and BERT.

elasticsearch japanese bert similarity-search sentencepiece

Updated Sep 25, 2023
Jupyter Notebook

FloweryK / Sentencepiece-Pretrained-Models

Star

pretrained models and a training code for sentencepiece

pretrained sentencepiece

Updated Jul 27, 2023
Python

danieldk / sentencepiece

Star

Rust binding for the sentencepiece library

rust sentencepiece

Updated Jul 22, 2023
Rust

taishan1994 / sentencepiece_chinese_bpe

Star

使用sentencepiece中BPE训练中文词表，并在transformers中进行使用。

tokenization sentencepiece chinese-vocab

Updated Jun 24, 2023
Python

sunsikim / tf-spm-tokenizer-pattern

Star

Tensorflow Model Incorporable Sentencepiece Tokenizer Training Code

nlp imdb-dataset tensorflow2 sentencepiece

Updated May 21, 2023
Python

stephantul / piecelearn

Star

Learning BPE embeddings by first learning a segmentation model and then training word2vec

word2vec embeddings bpe wordpiece sentencepiece

Updated Dec 18, 2022
Python

bnosac / sentencepiece

Star

R package for Byte Pair Encoding / Unigram modelling based on Sentencepiece

natural-language-processing byte word-segmentation sentencepiece

Updated Nov 14, 2022
C++

This repository contains codes related to the experiments in "An Experimental Evaluation of Japanese Tokenizers for Sentiment-Based Text Classification" presented at https://www.anlp.jp/nlp2021/. Authors: Andre Rusli and Makoto Shishido (Tokyo Denki University).

natural-language-processing text-classification mecab sentencepiece japanese-tokenizer sudachipy

Updated Mar 8, 2022
Jupyter Notebook

Sid911 / sentencepiece

Star

Unsupervised text tokenizer for Neural Network-based text generation.

natural-language-processing cmake sentencepiece

Updated Oct 26, 2021
C++

Improve this page

Add a description, image, and links to the sentencepiece topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the sentencepiece topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sentencepiece

Here are 35 public repositories matching this topic...

eliben / go-sentencepiece

twinnydotdev / toxe

Systemcluster / kitoken

niedev / RTranslator

OpenNMT / Tokenizer

Systemcluster / sentencepiece-model

ReshiAdavan / Thoth

ZJaume / escape-unk

leliuga / datrin

himkt / konoha

kmaurinjones / WikiGameBot

Doarakko / vector-text-similarity-search

FloweryK / Sentencepiece-Pretrained-Models

danieldk / sentencepiece

taishan1994 / sentencepiece_chinese_bpe

sunsikim / tf-spm-tokenizer-pattern

stephantul / piecelearn

bnosac / sentencepiece

arusl / anlp_nlp2021_d3-1

Sid911 / sentencepiece

Improve this page

Add this topic to your repo