Refactor all IPLD specifications (ipld#72)

Retcon of all IPLD specifications.
iprs-dev · Nov 8, 2018 · b95ed30 · b95ed30
1 parent 4463bff
commit b95ed30
Show file tree

Hide file tree

Showing 11 changed files with 406 additions and 873 deletions.
diff --git a/CAR.md b/CAR.md
diff --git a/CID.md b/CID.md
@@ -0,0 +1,68 @@
+# CIDv1
+
+# Content IDs
+
+This document will use the words Content IDs or CIDs.
+
+Prior base58 multihash links to protobuf data be called CID Version 0.
+
+## CIDs Version 1
+
+Putting together the IPLD Link update statements above, we can term the new handle for IPLD data CID Version 1, with a multibase prefix, a version, a packed multicodec, and a multihash.
+
+```
+<mbase><version><mcodec><mhash>
+```
+
+Where:
+- `<mbase>` is a multibase prefix describing the base that encodes this CID. If binary, this is omitted.
+- `<version>` is the version number of the cid.
+- `<mcodec>` is a multicodec-packed identifier, from the CID multicodec table
+- `<mhash>` is a cryptographic multihash, including: `<mhash-code><mhash-len><mhash-value>`
+
+Note that all CIDs v1 and on should always begin with `<mbase><version>`, this evolving nicely.
+
+### Multicodec Packed Representation
+
+It is useful to have a compact version of multicodec, for use in small identifiers. This compact identifier will just be a single varint, looked up in a table. Different applications can use different tables. We should probably have one common table for well-known formats.
+
+We will establish a table for common authenticated data structure formats, for example: IPFS v0 Merkledag, CBOR IPLD, Git, Bitcoin, and more. The table is a simple varint lookup.
+
+### Distinguishing v0 and v1 CIDs (old and new)
+
+It is a HARD CONSTRAINT that all IPFS links continue to work. This means we need to continue to support v0 CIDs. This means IPFS APIs must accept both v0 and v1 CIDs. This section defines how to distinguish v0 from v1 CIDs.
+
+Old v0 CIDs are strictly sha2-256 multihashes encoded in base58 -- this is because IPFS tooling only shipped with support for sha2-256. This means the binary versions are 34 bytes long (sha2-256 256 bit multihash), and that the string versions are 46 characters long (base58 encoded). This means we can recognize a v0 CID by ensuring it is a sha256 bit multihash, of length 256 bits, and base58 encoded (when a string). Basically:
+
+- `<mbase>` is implicitly base58.
+- `<version>` is implicitly 0.
+- `<mcodec>` is implicitly protobuf (for backwards compat with v0).
+- `<mhash>` is a cryptographic multihash, explicit.
+
+We can re-write old v0 CIDs into v1 CIDs, by making the elements explicit. This should be done henceforth to avoid creating more v0 CIDs. But note that many references exist in the wild, and thus we must continue supporting v0 links. In the distant future, we may remove this support after sha2 breaks.
+
+Note we can cleanly distinguish the values, which makes it easy to support both. The code for this check is here: https://gist.github.com/jbenet/bf402718a7955bf636fb47d214bcef8a
+
+### IPLD supports non-CID hash links as implicit CIDv1s
+
+Note that raw hash links _stored in various data structures_ (eg Protbouf, Git, Bitcoin, Ethereum, etc) already exist. These links -- when loaded directly as one of these data structures -- can be seen as "linking within a network" whereas proper CIDv1 IPLD links can be seen as linking "across networks" (internet of data! internet of data structures!). Supporting these existing (or even new) raw hash links as a CIDv1 can be done by noting that when on data structure links with just a raw binary link, the rest of the CIDv1 fields are implicit:
+
+- `<mbase>` is implicitly binary or whatever the format encodes.
+- `<version>` is implicitly 1.
+- `<mcodec>` is implicitly the same as the data structure.
+- `<mhash>` can be determined from the raw hash.
+
+Basically, we construct the corresponding CIDv1 out of the raw hash link because all the other information is _in the context_ of the data structure. This is very useful because it allows:
+- more compact encoding of a CIDv1 when linking from one data struct to another
+- linking from CBOR IPLD to other CBOR IPLD objects exactly as has been spec-ed out so far, so any IPLD adopters continue working.
+- (most important) opens the door for native support of other data structures
+
+### IPLD addresses raw data
+
+Given the above addressing changes, it is now possible to address raw data directly, as an IPLD node. This node is of course taken to be just a byte buffer, and devoid of links (i.e. a leaf node).
+
+The utility of this is the ability to directly address any object via hashing external to IPLD datastructures.
+
+### Support for multiple binary packed formats
+
+Contrary to prior Merkle objects (e.g IPFS protobuf legacy, git, bitcoin, dat and others), new IPLD ojects are authenticated AND self described data blobs, each IPLD object is serialized and prefixed by a multicodec identifying its format.
diff --git a/Codecs/DAG-CBOR.md b/Codecs/DAG-CBOR.md
@@ -0,0 +1,37 @@
+# [WIP] DagCBOR Spec
+
+DAG-CBOR supports the full ["IPLD Data Model v1."](../IPLD-Data-Model-v1.md)
+
+CBOR already natively supports all ["IPLD Data Model v1: Simple Types."](../IPLD-Data-Model-v1.md#simple-types)
+
+## Format
+
+The CBOR IPLD format is called DagCBOR to disambiguate it from regular CBOR.
+Most CBOR objects are valid DagCBOR. The only hard restriction is that any field
+with the tag 42 must be a valid CID.
+
+## Link Format
+
+As with all IPLD formats, DagCBOR must be able to encode merkle-links. In
+DagCBOR, links are encoded using the raw-binary (identity, NUL) multibase in a
+field with a byte-string type (major type 2), with the tag 42.
+
+(the inclusion of the multibase exists for historical reasons)
+
+## Map Key Restriction
+
+In DagCBOR, map keys must be strings (TODO: drop this? We already have
+unpathable map keys). Furthermore, map keys should avoid using `/` as this is
+unpathable (TODO: drop this? IMO, we should support path escaping out of the
+box).
+
+## Canonical DagCBOR
+
+Canonical DagCBOR should:
+
+1. Use no tags other than the CID tag (42). Other tags may be lost in
+   conversion.
+2. Should use the canonical CBOR encoding and field ordering. Other orderings
+   will yield different CIDs.
+3. Should only use string map keys. Some implementations may not be able to
+   handle non-string keys.
diff --git a/Codecs/DAG-JSON.md b/Codecs/DAG-JSON.md
@@ -0,0 +1,25 @@
+# [WIP] DAG-JSON v1
+
+DAG-JSON supports the full ["IPLD Data Model v1."](../IPLD-Data-Model-v1.md)
+
+## Format
+
+### Simple Types
+
+All simple types except binary are supported natively by JSON.
+
+Contrary to popular belief, JSON as a format supports Big Integers. It's only
+JavaScript itself that has trouble with them. This means JS implementations
+of `dag-json` can't use the native JSON parser and serializer.
+
+#### Binary Type
+
+```javascript
+{"/": { "base64": String }}
+```
+
+### Link Type
+
+```javascript
+{"/": String /* base encoded CID */}
+```
diff --git a/Data-Structures/HAMT.md b/Data-Structures/HAMT.md
@@ -0,0 +1,5 @@
+# [WIP] Hash-Array Mapped Trie
+
+This specifies a standardized hash-array mapped trie on IPLD Data Model v1.
+
+TODO: write this spec.
diff --git a/IPLD-Data-Model-v1.md b/IPLD-Data-Model-v1.md
@@ -0,0 +1,17 @@
+# [WIP] IPLD Data Model
+
+## Simple Types
+
+* Boolean
+* Null
+* String
+* Integer
+* Float
+* Array
+* Object (Hash Map)
+* Binary
+
+## Link Type
+
+This type represents a link to another IPLD Block. The link reference
+is a [`CID`]('./CID.md).
diff --git a/IPLD-Path.md b/IPLD-Path.md
@@ -0,0 +1,21 @@
+# [WIP] IPLD Path v1
+
+An IPLD Path is a string identifier used for deep references into IPLD
+graphs.
+
+IPLD Path's are constructed following the same constraints as [URI Paths](https://tools.ietf.org/html/rfc3986#section-3.3).
+
+Similarly, the string `?` is reserved for future use as a query separator.
+
+# Path Resolution
+
+Path resolution is broken into two parts: full path resolution and block level resolution.
+
+Block level path resolutionis defined by individual codecs.
+
+Full path resolution should use block level resolution through each block.
+When a block level resolver returns an `IPLD Link` a full path resolution
+should retreive that block, load its codec, and continue on with additional
+block level resolution until the full path is resolved. Finally, path resolution
+should return a [**representation**](./IPLD-Path.md#representation)
+of the value for the given path.