unicode-segmenter

A lightweight implementation of the Unicode Text Segmentation (UAX #29)

Zero-dependencies: It doesn't bloat node_modules or the networks tab.
Excellent compatibility: It works well on older browsers, edge runtimes, and embedded JavaScript runtimes like Hermes and QuickJS.
Small bundle size: It effectively compresses Unicode data and provides a tree-shakeable format, allowing unnecessary codes to be eliminated.
Extremely efficient: It's carefully optimized for performance, making it the fastest one in the ecosystem—outperforming even the built-in Intl.Segmenter.
TypeScript: It's fully type-checked, and provides definitions with JSDoc.
ESM-first: It natively supports ES Modules, also supports CommonJS too.

Unicode® Version

Unicode® 15.1.0

Unicode® Standard Annex #29 - Revision 43 (2023-08-16)

APIs

There are several entries for text segmentation.

unicode-segmenter/grapheme: Segments and counts extended grapheme clusters
unicode-segmenter/intl-adapter: Intl.Segmenter adapter
unicode-segmenter/intl-polyfill: Intl.Segmenter polyfill

And extra utilities for combined use cases.

unicode-segmenter/emoji: Matches single codepoint emojis
unicode-segmenter/general: Matches single codepoint alphanumerics
unicode-segmenter/utils: Handles UTF-8 and UTF-16 surrogates

Export `unicode-segmenter/grapheme`

Utilities for text segmentation by extended grapheme cluster rules.

Example: Count graphemes

import { countGrapheme } from 'unicode-segmenter/grapheme';

'👋 안녕!'.length;
// => 6
countGrapheme('👋 안녕!');
// => 5

'a̐éö̲'.length;
// => 7
countGrapheme('a̐éö̲');
// => 3

Example: Get grapheme segments

import { graphemeSegments } from 'unicode-segmenter/grapheme';

[...graphemeSegments('a̐éö̲\r\n')];
// 0: { segment: 'a̐', index: 0, input: 'a̐éö̲\r\n' }
// 1: { segment: 'é', index: 2, input: 'a̐éö̲\r\n' }
// 2: { segment: 'ö̲', index: 4, input: 'a̐éö̲\r\n' }
// 3: { segment: '\r\n', index: 7, input: 'a̐éö̲\r\n' }

Example: Build an advanced grapheme matcher

graphemeSegments() exposes some knowledge identified in the middle of the process to support some useful cases.

For example, knowing the Grapheme_Cluster_Break category at the beginning and end of a segment can help approximately infer the applied boundary rule.

import { graphemeSegments, GraphemeCategory } from 'unicode-segmenter/grapheme';

function* matchEmoji(str) {
  for (const { segment, _catBegin } of graphemeSegments(input)) {
    // `_catBegin` identified as Extended_Pictographic means the segment is emoji
    if (_catBegin === GraphemeCategory.Extended_Pictographic) {
      yield segment;
    }
  }
}

[...matchEmoji('1🌷2🎁3💩4😜5👍')]
// 0: 🌷
// 1: 🎁
// 2: 💩
// 3: 😜
// 4: 👍

Export `unicode-segmenter/intl-adapter`

Intl.Segmenter API adapter (only granularity: "grapheme" available yet)

import { Segmenter } from 'unicode-segmenter/intl-adapter';

// Same API with the `Intl.Segmenter`
const segmenter = new Segmenter();

Export `unicode-segmenter/intl-polyfill`

Intl.Segmenter API polyfill (only granularity: "grapheme" available yet)

// Apply polyfill to the `globalThis.Intl` object.
import 'unicode-segmenter/intl-polyfill';

const segmenter = new Intl.Segmenter();

Export `unicode-segmenter/emoji`

Utilities for matching emoji-like characters.

Example: Use Unicode emoji property matches

import {
  isEmojiPresentation,    // match \p{Emoji_Presentation}
  isExtendedPictographic, // match \p{Extended_Pictographic}
} from 'unicode-segmenter/emoji';

isEmojiPresentation('😍'.codePointAt(0));
// => true
isEmojiPresentation('♡'.codePointAt(0));
// => false

isExtendedPictographic('😍'.codePointAt(0));
// => true
isExtendedPictographic('♡'.codePointAt(0));
// => true

Export `unicode-segmenter/general`

Utilities for matching alphanumeric characters.

Example: Use Unicode general property matchers

import {
  isLetter,       // match \p{L}
  isNumeric,      // match \p{N}
  isAlphabetic,   // match \p{Alphabetic}
  isAlphanumeric, // match [\p{N}\p{Alphabetic}]
} from 'unicode-segmenter/general';

Export `unicode-segmenter/utils`

You can access some internal utilities to deal with JavaScript strings.

Example: Handle UTF-16 surrogate pairs

import {
  isHighSurrogate,
  isLowSurrogate,
  surrogatePairToCodePoint,
} from 'unicode-segmenter/utils';

const u32 = '😍';
const hi = u32.charCodeAt(0);
const lo = u32.charCodeAt(1);

if (isHighSurrogate(hi) && isLowSurrogate(lo)) {
  const codePoint = surrogatePairToCodePoint(hi, lo);
  // => equivalent to u32.codePointAt(0)
}

Example: Determine the length of a character

import { isBMP } from 'unicode-segmenter/utils';

const char = '😍'; // .length = 2
const cp = char.codePointAt(0);

char.length === isBMP(cp) ? 1 : 2;
// => true

Runtime Compatibility

unicode-segmenter uses only fundamental features of ES2015, making it compatible with most browsers.

To ensure compatibility, the runtime should support:

If the runtime doesn't support these features, it can easily be fulfilled with tools like Babel.

React Native Support

Since Hermes doesn't support the Intl.Segmenter API yet, unicode-segmenter is a good alternative.

unicode-segmenter is compiled into small & efficient Hermes bytecode than other JavaScript libraries. See the benchmark for details.

Comparison

unicode-segmenter aims to be lighter and faster than alternatives in the ecosystem while fully spec compliant. So the benchmark is tracking several libraries' performance, bundle size, and Unicode version compliance.

`unicode-segmenter/grapheme` vs

graphemer@1.4.0 (16.6M+ weekly downloads on NPM)
grapheme-splitter@1.0.4 (5.7M+ weekly downloads on NPM)
@formatjs/intl-segmenter@11.5.7 (5.4K+ weekly downloads on NPM)
WebAssembly binding of the Rust's unicode-segmentation library
Built-in Intl.Segmenter API

JS Bundle Stats

Name	Unicode®	ESM?	Size	Size (min)	Size (min+gzip)	Size (min+br)
`unicode-segmenter/grapheme`	15.1.0	✔️	28,270	24,291	6,347	4,273
`graphemer`	15.0.0	✖️ ️	410,435	95,104	15,752	10,660
`grapheme-splitter`	10.0.0	✖️	122,252	23,680	7,852	4,841
`@formatjs/intl-segmenter`*	15.0.0	✖️	491,043	318,721	54,248	34,380
`unicode-segmentation`*	15.0.0	✔️	45,803	41,717	19,687	13,477
`Intl.Segmenter`*	-	-	0	0	0	0

@formatjs/intl-segmenter handles grapheme, word, and sentence, but it's not tree-shakable.
unicode-segmentation size contains only the minimum WASM binary and bindings. It will be larger by adding more features.
Intl.Segmenter's Unicode data depends on the host, and may not be up-to-date.
Intl.Segmenter may not be available in some old browsers, edge runtimes, or embedded environments.

Hermes Bytecode Stats

Name	Unicode®	Bytecode size	Bytecode size (gzip)*
`unicode-segmenter/grapheme`	15.1.0	35,014	13,326
`graphemer`	15.0.0	133,949	31,710
`grapheme-splitter`	10.0.0	63,810	19,125
`@formatjs/intl-segmenter`*	15.0.0	315,865	99,063

It would be compressed when included as an app asset.

Runtime Performance

Here is a brief explanation, and you can see archived benchmark results.

Performance in Node.js: unicode-segmenter/grapheme is significantly faster than alternatives.

7~18x faster than other JavaScript libraries
1.5~3x faster than WASM binding of the Rust's unicode-segmentation
3~8x faster than built-in Intl.Segmenter

Performance in Bun: unicode-segmenter/grapheme has almost the same performance as the built-in Intl.Segmenter, with no performance degradation compared to other JavaScript libraries.

Performance in Browsers: The performance in browser environments varies greatly due to differences in browser engines and versions, which makes benchmarking less consistent. Despite these variations, unicode-segmenter/grapheme generally outperforms other JavaScript libraries in most environments.

Performance in React Native: unicode-segmenter/grapheme is significantly faster than alternatives when compiled to Hermes bytecode. It's 2~4x faster than graphemer and 18~25x faster than grapheme-splitter, with the performance gap increasing with input size.

Instead of trusting these claims, you can try yarn perf:grapheme directly in your environment or build a benchmark yourself.

LICENSE

MIT

Note

The initial implementation was ported manually from Rust's unicode-segmentation library, which is licensed under the MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 159 Commits
.changeset		.changeset
.github/workflows		.github/workflows
.yarn/releases		.yarn/releases
benchmark		benchmark
fuzz		fuzz
licenses		licenses
scripts		scripts
src		src
test		test
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.yarnrc.yml		.yarnrc.yml
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
package.json		package.json
tsconfig.build.json		tsconfig.build.json
tsconfig.json		tsconfig.json
yarn.lock		yarn.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

unicode-segmenter

Unicode® Version

APIs

Export `unicode-segmenter/grapheme`

Example: Count graphemes

Example: Get grapheme segments

Example: Build an advanced grapheme matcher

Export `unicode-segmenter/intl-adapter`

Export `unicode-segmenter/intl-polyfill`

Export `unicode-segmenter/emoji`

Example: Use Unicode emoji property matches

Export `unicode-segmenter/general`

Example: Use Unicode general property matchers

Export `unicode-segmenter/utils`

Example: Handle UTF-16 surrogate pairs

Example: Determine the length of a character

Runtime Compatibility

React Native Support

Comparison

`unicode-segmenter/grapheme` vs

JS Bundle Stats

Hermes Bytecode Stats

Runtime Performance

LICENSE

About

Releases 22

Contributors 2

Languages

License

cometkim/unicode-segmenter

Folders and files

Latest commit

History

Repository files navigation

unicode-segmenter

Unicode® Version

APIs

Export unicode-segmenter/grapheme

Example: Count graphemes

Example: Get grapheme segments

Example: Build an advanced grapheme matcher

Export unicode-segmenter/intl-adapter

Export unicode-segmenter/intl-polyfill

Export unicode-segmenter/emoji

Example: Use Unicode emoji property matches

Export unicode-segmenter/general

Example: Use Unicode general property matchers

Export unicode-segmenter/utils

Example: Handle UTF-16 surrogate pairs

Example: Determine the length of a character

Runtime Compatibility

React Native Support

Comparison

unicode-segmenter/grapheme vs

JS Bundle Stats

Hermes Bytecode Stats

Runtime Performance

LICENSE

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 22

Contributors 2

Languages

Export `unicode-segmenter/grapheme`

Export `unicode-segmenter/intl-adapter`

Export `unicode-segmenter/intl-polyfill`

Export `unicode-segmenter/emoji`

Export `unicode-segmenter/general`

Export `unicode-segmenter/utils`

`unicode-segmenter/grapheme` vs