base131072

Base131072 is a binary encoding optimised for UTF-32-encoded text and Twitter; it is the intended successor to Base65536. This JavaScript module, base131072, is an implementation of this encoding... however, it can't be used yet because there aren't enough safe Unicode characters.

Efficiency ratings are averaged over long inputs. Higher is better.

Encoding		Efficiency			Bytes per Tweet *
Encoding		UTF‑8	UTF‑16	UTF‑32	Bytes per Tweet *
ASCII‑constrained	Unary / Base1	0%	0%	0%	1
	Binary	13%	6%	3%	35
	Hexadecimal	50%	25%	13%	140
	Base64	75%	38%	19%	210
	Base85 †	80%	40%	20%	224
BMP‑constrained	HexagramEncode	25%	38%	19%	105
	BrailleEncode	33%	50%	25%	140
	Base2048	56%	69%	34%	385
	Base32768	63%	94%	47%	263
Full Unicode	Ecoji	31%	31%	31%	175
	Base65536	56%	64%	50%	280
	Base131072 ‡	53%+	53%+	53%	297

* New-style "long" Tweets, up to 280 Unicode characters give or take Twitter's complex "weighting" calculation.
† Base85 is listed for completeness but all variants use characters which are considered hazardous for general use in text: escape characters, brackets, punctuation etc..
‡ Base131072 is a work in progress, not yet ready for general use.

For example, using Base64, up to 105 bytes of binary data can fit in a Tweet. With Base131072, 297 bytes are possible.

How does it work?

Base131072 is a 17-bit encoding. We take the input binary data as a sequence of 8-bit numbers, compact it into a sequence of bits, then dice the bits up again to make a sequence of 17-bit numbers. We then encode each of these 2¹⁷ = 131,072 possible numbers as a different Unicode code point.

Padding

Note that the final 17-bit number in the sequence is likely to be "incomplete", i.e. missing some of its bits. We need to signal this fact in the output string somehow. Here's how we handle those cases.

Final 17-bit number has 1 to 7 missing bits

In the following cases:

bbbbbbbbcccccccc_ // 1 missing bit
bbbbbbbcccccccc__ // 2 missing bits
bbbbbbcccccccc___ // 3 missing bits
bbbbbcccccccc____ // 4 missing bits (note: this is how a Tweet containing 297 bytes of data will end)
bbbbcccccccc_____ // 5 missing bits
bbbcccccccc______ // 6 missing bits
bbcccccccc_______ // 7 missing bits

we pad the incomplete 17-bit number out to 17 bits using 1s:

bbbbbbbbcccccccc1
bbbbbbbcccccccc11
bbbbbbcccccccc111
bbbbbcccccccc1111
bbbbcccccccc11111
bbbcccccccc111111
bbcccccccc1111111

and then encode as normal using our 2¹⁷-bit repertoire.

Final 17-bit number has 8 to 15 missing bits

In the following cases:

bcccccccc________ // 8 missing bits
cccccccc_________ // 9 missing bits
ccccccc__________ // 10 missing bits
cccccc___________ // 11 missing bits
ccccc____________ // 12 missing bits (note: this is how a Tweet containing 296 bytes of data will end)
cccc_____________ // 13 missing bits
ccc______________ // 14 missing bits
cc_______________ // 15 missing bits

we encode them differently. We'll pad the incomplete number out to only 9 bits using 1s:

bcccccccc
cccccccc1
ccccccc11
cccccc111
ccccc1111
cccc11111
ccc111111
cc1111111

and then encode them using a completely different, 2⁹-character repertoire. On decoding, we will treat that character differently, returning 9 bits, rather than 17 from characters in the main repertoire.

Final 17-bit number has 16 missing bits

In this final case:

c________________ // 16 missing bits

we simply take this as a 1-bit number:

and encode it using a third, 2¹-character repertoire. Again, on decoding, this is treated specially, and only 1 bit is added to the stream, rather than 9 or 17 as for the other characters.

In other words, Base131072 is a slight misnomer. It uses not 131,072 but 2¹⁷ + 2⁹ + 2¹ = 131,586 characters for its three repertoires. Of course, Base64 uses a 65th character for its padding too.

Decoding

On decoding, we get a series of 8-bit values, the last of which might be incomplete, like so:

1_______ // 7 missing bits
11______ // 6 missing bits
111_____ // 5 missing bits
1111____ // 4 missing bits
11111___ // 3 missing bits
111111__ // 2 missing bits
1111111_ // 1 missing bit

These are the padding 1s added at encoding time. We can check this and discard this final value.

Is this ready yet?

No. We need 131,586 "safe" characters for this encoding, but as of Unicode 9.0 only 108,397 exist. However, future versions of Unicode may add enough safe characters for this to become possible. In any case, the groundwork can certainly be laid.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.github		.github
sample-files		sample-files
src		src
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

base131072

How does it work?

Padding

Final 17-bit number has 1 to 7 missing bits

Final 17-bit number has 8 to 15 missing bits

Final 17-bit number has 16 missing bits

Decoding

Is this ready yet?

About

Releases

Packages

Contributors 2

Languages

License

qntm/base131072

Folders and files

Latest commit

History

Repository files navigation

base131072

How does it work?

Padding

Final 17-bit number has 1 to 7 missing bits

Final 17-bit number has 8 to 15 missing bits

Final 17-bit number has 16 missing bits

Decoding

Is this ready yet?

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages