Azam Codec

TL;DR : A lexicographically sortable multi-section string encoding of byte arrays.

A byte array encoding and decoding method, in big endian (network byte order), with markers to separate byte arrays into sections. The encoded symbols used are picked so that encoded string are lexicographically sortable based on the byte array values of each section, given comparing valid strings with the same number of section. Optimized to store multiple unsigned integers in a string, with human readable characters, and human input errors considered, by use of symbols inspired by Crockford's Base32 encoding. Having different set of characters in same case, representing higher and lower 4-bit of a byte makes the string resist case conversions.

Implementations

Language	Repository
Rust	azamcodec-rs
Java	azamcodec-java
Javascript	azamcodec-js
Go	azamcodec-go

Specification

Given an array of byte array of size m, with each byte array has variable lengths n1,...,nm.

A = [[b₁₁, b₁₂, ..., b₁ₙ₁], [b₂₁, b₂₂, ..., b₂ₙ₂], ..., , [bₘ₁, bₘ₂, ..., bₘₙₘ]]

Loop for each byte array (let's call these sections).

S₁ = [b₁₁, b₁₂, ..., b₁ₙ₁]
S₂ = [b₂₁, b₂₂, ..., b₂ₙ₂]
...
Sₘ = [bₘ₁, bₘ₂, ..., bₘₙₘ]

Encode for each section

input: I (Nybble value)
output: S (Nybble symbol)
procedure LOW_SYMBOL(I) -> S
  return symbol at index B in '0123456789abcdef'
end procedure

input: I (Nybble value)
output: S (Nybble symbol)
procedure HIGH_SYMBOL(I) -> S
  return symbol at index B in 'ghjkmnpqrstvwxyz'
end procedure

input: S (Byte array)
output: V (Encoded string)
procedure ENCODE_SECTION (S) -> V
  V <- ''
  lead_nybble_checked <- false
  for i = 0..LENGTH(S) do
    byte <- S[i]
    high_nybble <- byte >> 4
    low_nybble <- byte & 0x0f
    if lead_nybble_checked is true OR high_nybble > 0 do
      S <- HIGH_SYMBOL(high_nybble)
      V <- V + S
      if i + 1 < LENGTH(S) do
        S <- LOW_SYMBOL(high_nybble)
        V <- V + S
      else do
        S <- HIGH_SYMBOL(high_nybble)
        V <- V + S
      end if
      lead_nybble_checked <- true
    else if low_nybble > 0 do
      if i + 1 < LENGTH(S) do
        S <- LOW_SYMBOL(high_nybble)
        V <- V + S
      else do
        S <- HIGH_SYMBOL(high_nybble)
        V <- V + S
      end if
      lead_nybble_checked <- true
    end if
  end for

  return V
end procedure

Characteristics

While this could be a generic variable length byte array codec, the merits was weighed heavily on creating a short, sortable, multi-part integers.

Similar to a big endian hexadecimal string with following differences:
- Byte array can be marked as separate sequential byte arrays.
- Leading zeros are always truncated.
Given a set of encoded strings with same section counts, the set is lexicographically sortable.
Only digits and roman alphabets is used, with similar looking alphabets (digit zero 0 and letter O, or digit one 1 and letters I and L) mapped to the same value.
Length of encoded string is the sum of zero truncated 4-bits of each byte array of each sections.
- e.g. [[0xf1],[0x00, 0xf2],[0x10,0x03],[0x04]] (length: 6) => 'z1','z2','hgg3','4' => 'z1z2hgg34' (length: 9)
Minimum length of encoded string is count of sections.
Maximum length of encoded string is two times sum of all byte array length in all sections.

Advantages

Compact encoding for unsigned integers
Support multiple byte arrays
URL safe
Lexicographically sortable in normal alphabetical collation
High human readability
Case insensitive

Disadvantages

Total length is same as hexadecimal at maximum
Length of byte array in each section is not preserved
- e.g. [0x00, 0x00, 0x00] and [0x00] both encodes to the same string 0

Examples

Description	Bytes	Encoded
Empty	`[]`	`''`
Single byte array	`[0x00]`	`'0'`
	`[0x01]`	`'1'`
	`[0x0f]`	`'f'`
	`[0x10]`	`'h0'`
	`[0xff]`	`'zf'`
	`[0x00,0x00]`	`'0'`
	`[0x00,0x01]`	`'1'`
	`[0x01,0x01]`	`'hg1'`
	`[0x01,0xff]`	`'hzf'`
	`[0x0f,0xff]`	`'zzf'`
	`[0xff,0xff]`	`'zzzf'`
	`[0xff,0x00,0xff]`	`'zzggzf'`
Multiple byte arrays	`[0x01],[0x02],[0x03]`	`'123'`
	`[0x01],[0x02],[0xff]`	`'12zf'`
	`[0x01,0x02,0x03],[0x04,0x05,0x06],[0x07,0x08,0x09]`	`'hgjg3mgng6qgrg9'`

Encoded length

Given an array of array of bytes [[b₁, b₂, ..., bₙ₁], [b₁, b₂, ..., bₙ₂], ..., , [b₁, b₂, ..., bₙᵢ]], the encoded length is twice the sum of byte length of the arrays (2 × (n1+n2+...+ni)) at maximum (same as hexadecimal encoding), and the sum of byte length of the arrays (n1+n2+...+ni) at minimum.

Symbols

Inspired by Crockford's Base32, split in half with first 16 symbols for lower nybble and next 16 symbols as higher nybble, with lower case as encode symbol. A valid encoded string do not contain leading high nybble value of 0 (g or G) in any section, is always in lowercase, but decoding supports upper case. Any symbols outside of the decode symbols are considered invalid.

Lower nybble

Nybble Value	Decode Symbol	Encode Symbol
0	0 o O	0
1	1 i I l L	1
2	2	2
3	3	3
4	4	4
5	5	5
6	6	6
7	7	7
8	8	8
9	9	9
10	a A	a
11	b B	b
12	c C	c
13	d D	d
14	e E	e
15	f F	f

Higher nybble

Symbol	Decode Symbol	Encode Symbol / Decode Strict Symbol
0	g G	g
1	h H	h
2	j J	j
3	k K	k
4	m M	m
5	n N	n
6	p P	p
7	q Q	q
8	r R	r
9	s S	s
10	t T	t
11	v V	v
12	w W	w
13	x X	x
14	y Y	y
15	z Z	z

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Azam Codec

Implementations

Specification

Characteristics

Advantages

Disadvantages

Examples

Encoded length

Symbols

Lower nybble

Higher nybble

Prior Art

About

Releases

Packages

Symbol	Decode Symbol	Encode Symbol / Decode Strict Symbol
0	g G	g
1	h H	h
2	j J	j
3	k K	k
4	m M	m
5	n N	n
6	p P	p
7	q Q	q
8	r R	r
9	s S	s
10	t T	t
11	v V	v
12	w W	w
13	x X	x
14	y Y	y
15	z Z	z

Symbol	Decode Symbol	Encode Symbol / Decode Strict Symbol
0	g G	g
1	h H	h
2	j J	j
3	k K	k
4	m M	m
5	n N	n
6	p P	p
7	q Q	q
8	r R	r
9	s S	s
10	t T	t
11	v V	v
12	w W	w
13	x X	x
14	y Y	y
15	z Z	z

License

azam/azamcodec

Folders and files

Latest commit

History

Repository files navigation

Azam Codec

Implementations

Specification

Characteristics

Advantages

Disadvantages

Examples

Encoded length

Symbols

Lower nybble

Higher nybble

Prior Art

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages

Symbol	Decode Symbol	Encode Symbol / Decode Strict Symbol
0	g G	g
1	h H	h
2	j J	j
3	k K	k
4	m M	m
5	n N	n
6	p P	p
7	q Q	q
8	r R	r
9	s S	s
10	t T	t
11	v V	v
12	w W	w
13	x X	x
14	y Y	y
15	z Z	z