Skip to content

Commit

Permalink
Refactor all IPLD specifications (ipld#72)
Browse files Browse the repository at this point in the history
Retcon of all IPLD specifications.
  • Loading branch information
mikeal authored Nov 8, 2018
1 parent 4463bff commit b95ed30
Show file tree
Hide file tree
Showing 11 changed files with 406 additions and 873 deletions.
108 changes: 0 additions & 108 deletions CAR.md

This file was deleted.

68 changes: 68 additions & 0 deletions CID.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# CIDv1

# Content IDs

This document will use the words Content IDs or CIDs.

Prior base58 multihash links to protobuf data be called CID Version 0.

## CIDs Version 1

Putting together the IPLD Link update statements above, we can term the new handle for IPLD data CID Version 1, with a multibase prefix, a version, a packed multicodec, and a multihash.

```
<mbase><version><mcodec><mhash>
```

Where:
- `<mbase>` is a multibase prefix describing the base that encodes this CID. If binary, this is omitted.
- `<version>` is the version number of the cid.
- `<mcodec>` is a multicodec-packed identifier, from the CID multicodec table
- `<mhash>` is a cryptographic multihash, including: `<mhash-code><mhash-len><mhash-value>`

Note that all CIDs v1 and on should always begin with `<mbase><version>`, this evolving nicely.

### Multicodec Packed Representation

It is useful to have a compact version of multicodec, for use in small identifiers. This compact identifier will just be a single varint, looked up in a table. Different applications can use different tables. We should probably have one common table for well-known formats.

We will establish a table for common authenticated data structure formats, for example: IPFS v0 Merkledag, CBOR IPLD, Git, Bitcoin, and more. The table is a simple varint lookup.

### Distinguishing v0 and v1 CIDs (old and new)

It is a HARD CONSTRAINT that all IPFS links continue to work. This means we need to continue to support v0 CIDs. This means IPFS APIs must accept both v0 and v1 CIDs. This section defines how to distinguish v0 from v1 CIDs.

Old v0 CIDs are strictly sha2-256 multihashes encoded in base58 -- this is because IPFS tooling only shipped with support for sha2-256. This means the binary versions are 34 bytes long (sha2-256 256 bit multihash), and that the string versions are 46 characters long (base58 encoded). This means we can recognize a v0 CID by ensuring it is a sha256 bit multihash, of length 256 bits, and base58 encoded (when a string). Basically:

- `<mbase>` is implicitly base58.
- `<version>` is implicitly 0.
- `<mcodec>` is implicitly protobuf (for backwards compat with v0).
- `<mhash>` is a cryptographic multihash, explicit.

We can re-write old v0 CIDs into v1 CIDs, by making the elements explicit. This should be done henceforth to avoid creating more v0 CIDs. But note that many references exist in the wild, and thus we must continue supporting v0 links. In the distant future, we may remove this support after sha2 breaks.

Note we can cleanly distinguish the values, which makes it easy to support both. The code for this check is here: https://gist.github.com/jbenet/bf402718a7955bf636fb47d214bcef8a

### IPLD supports non-CID hash links as implicit CIDv1s

Note that raw hash links _stored in various data structures_ (eg Protbouf, Git, Bitcoin, Ethereum, etc) already exist. These links -- when loaded directly as one of these data structures -- can be seen as "linking within a network" whereas proper CIDv1 IPLD links can be seen as linking "across networks" (internet of data! internet of data structures!). Supporting these existing (or even new) raw hash links as a CIDv1 can be done by noting that when on data structure links with just a raw binary link, the rest of the CIDv1 fields are implicit:

- `<mbase>` is implicitly binary or whatever the format encodes.
- `<version>` is implicitly 1.
- `<mcodec>` is implicitly the same as the data structure.
- `<mhash>` can be determined from the raw hash.

Basically, we construct the corresponding CIDv1 out of the raw hash link because all the other information is _in the context_ of the data structure. This is very useful because it allows:
- more compact encoding of a CIDv1 when linking from one data struct to another
- linking from CBOR IPLD to other CBOR IPLD objects exactly as has been spec-ed out so far, so any IPLD adopters continue working.
- (most important) opens the door for native support of other data structures

### IPLD addresses raw data

Given the above addressing changes, it is now possible to address raw data directly, as an IPLD node. This node is of course taken to be just a byte buffer, and devoid of links (i.e. a leaf node).

The utility of this is the ability to directly address any object via hashing external to IPLD datastructures.

### Support for multiple binary packed formats

Contrary to prior Merkle objects (e.g IPFS protobuf legacy, git, bitcoin, dat and others), new IPLD ojects are authenticated AND self described data blobs, each IPLD object is serialized and prefixed by a multicodec identifying its format.
37 changes: 37 additions & 0 deletions Codecs/DAG-CBOR.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# [WIP] DagCBOR Spec

DAG-CBOR supports the full ["IPLD Data Model v1."](../IPLD-Data-Model-v1.md)

CBOR already natively supports all ["IPLD Data Model v1: Simple Types."](../IPLD-Data-Model-v1.md#simple-types)

## Format

The CBOR IPLD format is called DagCBOR to disambiguate it from regular CBOR.
Most CBOR objects are valid DagCBOR. The only hard restriction is that any field
with the tag 42 must be a valid CID.

## Link Format

As with all IPLD formats, DagCBOR must be able to encode merkle-links. In
DagCBOR, links are encoded using the raw-binary (identity, NUL) multibase in a
field with a byte-string type (major type 2), with the tag 42.

(the inclusion of the multibase exists for historical reasons)

## Map Key Restriction

In DagCBOR, map keys must be strings (TODO: drop this? We already have
unpathable map keys). Furthermore, map keys should avoid using `/` as this is
unpathable (TODO: drop this? IMO, we should support path escaping out of the
box).

## Canonical DagCBOR

Canonical DagCBOR should:

1. Use no tags other than the CID tag (42). Other tags may be lost in
conversion.
2. Should use the canonical CBOR encoding and field ordering. Other orderings
will yield different CIDs.
3. Should only use string map keys. Some implementations may not be able to
handle non-string keys.
25 changes: 25 additions & 0 deletions Codecs/DAG-JSON.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# [WIP] DAG-JSON v1

DAG-JSON supports the full ["IPLD Data Model v1."](../IPLD-Data-Model-v1.md)

## Format

### Simple Types

All simple types except binary are supported natively by JSON.

Contrary to popular belief, JSON as a format supports Big Integers. It's only
JavaScript itself that has trouble with them. This means JS implementations
of `dag-json` can't use the native JSON parser and serializer.

#### Binary Type

```javascript
{"/": { "base64": String }}
```

### Link Type

```javascript
{"/": String /* base encoded CID */}
```
5 changes: 5 additions & 0 deletions Data-Structures/HAMT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# [WIP] Hash-Array Mapped Trie

This specifies a standardized hash-array mapped trie on IPLD Data Model v1.

TODO: write this spec.
17 changes: 17 additions & 0 deletions IPLD-Data-Model-v1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# [WIP] IPLD Data Model

## Simple Types

* Boolean
* Null
* String
* Integer
* Float
* Array
* Object (Hash Map)
* Binary

## Link Type

This type represents a link to another IPLD Block. The link reference
is a [`CID`]('./CID.md).
21 changes: 21 additions & 0 deletions IPLD-Path.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# [WIP] IPLD Path v1

An IPLD Path is a string identifier used for deep references into IPLD
graphs.

IPLD Path's are constructed following the same constraints as [URI Paths](https://tools.ietf.org/html/rfc3986#section-3.3).

Similarly, the string `?` is reserved for future use as a query separator.

# Path Resolution

Path resolution is broken into two parts: full path resolution and block level resolution.

Block level path resolutionis defined by individual codecs.

Full path resolution should use block level resolution through each block.
When a block level resolver returns an `IPLD Link` a full path resolution
should retreive that block, load its codec, and continue on with additional
block level resolution until the full path is resolved. Finally, path resolution
should return a [**representation**](./IPLD-Path.md#representation)
of the value for the given path.
Loading

0 comments on commit b95ed30

Please sign in to comment.