A lightweight implementation of the Unicode Text Segmentation (UAX #29)
-
Zero-dependencies: It doesn't bloat
node_modules
or the networks tab. -
Excellent compatibility: It works well on older browsers, edge runtimes, and embedded JavaScript runtimes like Hermes and QuickJS.
-
Small bundle size: It effectively compresses Unicode data and provides a tree-shakeable format, allowing unnecessary codes to be eliminated.
-
Extremely efficient: It's carefully optimized for performance, making it the fastest one in the ecosystemโoutperforming even the built-in
Intl.Segmenter
. -
TypeScript: It's fully type-checked, and provides definitions with JSDoc.
-
ESM-first: It natively supports ES Modules, also supports CommonJS too.
Unicodeยฎ 15.1.0
Unicodeยฎ Standard Annex #29 - Revision 43 (2023-08-16)
There are several entries for text segmentation.
unicode-segmenter/grapheme
: Segments and counts extended grapheme clustersunicode-segmenter/intl-adapter
:Intl.Segmenter
adapterunicode-segmenter/intl-polyfill
:Intl.Segmenter
polyfill
And extra utilities for combined use cases.
unicode-segmenter/emoji
: Matches single codepoint emojisunicode-segmenter/general
: Matches single codepoint alphanumericsunicode-segmenter/utils
: Handles UTF-8 and UTF-16 surrogates
Utilities for text segmentation by extended grapheme cluster rules.
import { countGrapheme } from 'unicode-segmenter/grapheme';
'๐ ์๋
!'.length;
// => 6
countGrapheme('๐ ์๋
!');
// => 5
'aฬeฬoฬฬฒ'.length;
// => 7
countGrapheme('aฬeฬoฬฬฒ');
// => 3
import { graphemeSegments } from 'unicode-segmenter/grapheme';
[...graphemeSegments('aฬeฬoฬฬฒ\r\n')];
// 0: { segment: 'aฬ', index: 0, input: 'aฬeฬoฬฬฒ\r\n' }
// 1: { segment: 'eฬ', index: 2, input: 'aฬeฬoฬฬฒ\r\n' }
// 2: { segment: 'oฬฬฒ', index: 4, input: 'aฬeฬoฬฬฒ\r\n' }
// 3: { segment: '\r\n', index: 7, input: 'aฬeฬoฬฬฒ\r\n' }
graphemeSegments()
exposes some knowledge identified in the middle of the process to support some useful cases.
For example, knowing the Grapheme_Cluster_Break category at the beginning and end of a segment can help approximately infer the applied boundary rule.
import { graphemeSegments, GraphemeCategory } from 'unicode-segmenter/grapheme';
function* matchEmoji(str) {
for (const { segment, _catBegin } of graphemeSegments(input)) {
// `_catBegin` identified as Extended_Pictographic means the segment is emoji
if (_catBegin === GraphemeCategory.Extended_Pictographic) {
yield segment;
}
}
}
[...matchEmoji('1๐ท2๐3๐ฉ4๐5๐')]
// 0: ๐ท
// 1: ๐
// 2: ๐ฉ
// 3: ๐
// 4: ๐
Intl.Segmenter
API adapter (only granularity: "grapheme"
available yet)
import { Segmenter } from 'unicode-segmenter/intl-adapter';
// Same API with the `Intl.Segmenter`
const segmenter = new Segmenter();
Intl.Segmenter
API polyfill (only granularity: "grapheme"
available yet)
// Apply polyfill to the `globalThis.Intl` object.
import 'unicode-segmenter/intl-polyfill';
const segmenter = new Intl.Segmenter();
Utilities for matching emoji-like characters.
import {
isEmojiPresentation, // match \p{Emoji_Presentation}
isExtendedPictographic, // match \p{Extended_Pictographic}
} from 'unicode-segmenter/emoji';
isEmojiPresentation('๐'.codePointAt(0));
// => true
isEmojiPresentation('โก'.codePointAt(0));
// => false
isExtendedPictographic('๐'.codePointAt(0));
// => true
isExtendedPictographic('โก'.codePointAt(0));
// => true
Utilities for matching alphanumeric characters.
import {
isLetter, // match \p{L}
isNumeric, // match \p{N}
isAlphabetic, // match \p{Alphabetic}
isAlphanumeric, // match [\p{N}\p{Alphabetic}]
} from 'unicode-segmenter/general';
You can access some internal utilities to deal with JavaScript strings.
import {
isHighSurrogate,
isLowSurrogate,
surrogatePairToCodePoint,
} from 'unicode-segmenter/utils';
const u32 = '๐';
const hi = u32.charCodeAt(0);
const lo = u32.charCodeAt(1);
if (isHighSurrogate(hi) && isLowSurrogate(lo)) {
const codePoint = surrogatePairToCodePoint(hi, lo);
// => equivalent to u32.codePointAt(0)
}
import { isBMP } from 'unicode-segmenter/utils';
const char = '๐'; // .length = 2
const cp = char.codePointAt(0);
char.length === isBMP(cp) ? 1 : 2;
// => true
unicode-segmenter
uses only fundamental features of ES2015, making it compatible with most browsers.
To ensure compatibility, the runtime should support:
If the runtime doesn't support these features, it can easily be fulfilled with tools like Babel.
Since Hermes doesn't support the Intl.Segmenter
API yet, unicode-segmenter
is a good alternative.
unicode-segmenter
is compiled into small & efficient Hermes bytecode than other JavaScript libraries. See the benchmark for details.
unicode-segmenter
aims to be lighter and faster than alternatives in the ecosystem while fully spec compliant. So the benchmark is tracking several libraries' performance, bundle size, and Unicode version compliance.
- graphemer@1.4.0 (16.6M+ weekly downloads on NPM)
- grapheme-splitter@1.0.4 (5.7M+ weekly downloads on NPM)
- @formatjs/intl-segmenter@11.5.7 (5.4K+ weekly downloads on NPM)
- WebAssembly binding of the Rust's unicode-segmentation library
- Built-in
Intl.Segmenter
API
Name | Unicodeยฎ | ESM? | Size | Size (min) | Size (min+gzip) | Size (min+br) |
---|---|---|---|---|---|---|
unicode-segmenter/grapheme |
15.1.0 | โ๏ธ | 28,270 | 24,291 | 6,347 | 4,273 |
graphemer |
15.0.0 | โ๏ธ ๏ธ | 410,435 | 95,104 | 15,752 | 10,660 |
grapheme-splitter |
10.0.0 | โ๏ธ | 122,252 | 23,680 | 7,852 | 4,841 |
@formatjs/intl-segmenter * |
15.0.0 | โ๏ธ | 491,043 | 318,721 | 54,248 | 34,380 |
unicode-segmentation * |
15.0.0 | โ๏ธ | 45,803 | 41,717 | 19,687 | 13,477 |
Intl.Segmenter * |
- | - | 0 | 0 | 0 | 0 |
@formatjs/intl-segmenter
handles grapheme, word, and sentence, but it's not tree-shakable.unicode-segmentation
size contains only the minimum WASM binary and bindings. It will be larger by adding more features.Intl.Segmenter
's Unicode data depends on the host, and may not be up-to-date.Intl.Segmenter
may not be available in some old browsers, edge runtimes, or embedded environments.
Name | Unicodeยฎ | Bytecode size | Bytecode size (gzip)* |
---|---|---|---|
unicode-segmenter/grapheme |
15.1.0 | 35,014 | 13,326 |
graphemer |
15.0.0 | 133,949 | 31,710 |
grapheme-splitter |
10.0.0 | 63,810 | 19,125 |
@formatjs/intl-segmenter * |
15.0.0 | 315,865 | 99,063 |
- It would be compressed when included as an app asset.
Here is a brief explanation, and you can see archived benchmark results.
Performance in Node.js: unicode-segmenter/grapheme
is significantly faster than alternatives.
- 7~18x faster than other JavaScript libraries
- 1.5~3x faster than WASM binding of the Rust's unicode-segmentation
- 3~8x faster than built-in
Intl.Segmenter
Performance in Bun: unicode-segmenter/grapheme
has almost the same performance as the built-in Intl.Segmenter
, with no performance degradation compared to other JavaScript libraries.
Performance in Browsers: The performance in browser environments varies greatly due to differences in browser engines and versions, which makes benchmarking less consistent. Despite these variations, unicode-segmenter/grapheme
generally outperforms other JavaScript libraries in most environments.
Performance in React Native: unicode-segmenter/grapheme
is significantly faster than alternatives when compiled to Hermes bytecode. It's 2~4x faster than graphemer
and 18~25x faster than grapheme-splitter
, with the performance gap increasing with input size.
Instead of trusting these claims, you can try yarn perf:grapheme
directly in your environment or build a benchmark yourself.
Note
The initial implementation was ported manually from Rust's unicode-segmentation library, which is licensed under the MIT license.