From 479fb5534a4ef9273f88c0896e6308b6e9e46da4 Mon Sep 17 00:00:00 2001 From: Mikeal Rogers Date: Fri, 28 Sep 2018 16:16:06 -0700 Subject: [PATCH 01/18] initial: refactor all IPLD specifications --- CAR.md | 108 -------- CID.md | 166 ++++++++++++ Codecs/DAG-CBOR.md | 28 ++ Codecs/DAG-JSON.md | 23 ++ Data-Structures/HAMT.md | 5 + IPLD-Data-Model-v1.md | 17 ++ IPLD-Path.md | 3 + IPLD.md | 553 --------------------------------------- IPLD_FUTURE_FROM_PAST.md | 181 ------------- README.md | 266 ++++++++++++++++--- ROADMAP.md | 3 - 11 files changed, 473 insertions(+), 880 deletions(-) delete mode 100644 CAR.md create mode 100644 CID.md create mode 100644 Codecs/DAG-CBOR.md create mode 100644 Codecs/DAG-JSON.md create mode 100644 Data-Structures/HAMT.md create mode 100644 IPLD-Data-Model-v1.md create mode 100644 IPLD-Path.md delete mode 100644 IPLD.md delete mode 100644 IPLD_FUTURE_FROM_PAST.md delete mode 100644 ROADMAP.md diff --git a/CAR.md b/CAR.md deleted file mode 100644 index 62bee684..00000000 --- a/CAR.md +++ /dev/null @@ -1,108 +0,0 @@ -# ![](https://img.shields.io/badge/status-wip-orange.svg?style=flat-square) Certified ARchive - -CARs (Certified ARchives) are archives for IPLD DAGs. - -## Summary - -CARs are archives of IPLD DAGs. They are: - -1. Certified -2. Seekable -3. Compact -4. Reproducible -5. Simple and Stable - -The actual format is just a (mostly) recursively-defined topological sort of the -desired DAG with some metadata for fast traversal. - -``` -CID -len(ROOT) -ROOT -[ - CHILD-1-OFFSET - ... -] -[ - len(CHILD-1) - CHILD-1 - [ - CHILD-1/1-OFFSET - ... - ] - ... -] -``` - -Offsets are relative, offsets for missing children use a sentinel value. - -We only bother including the root CID because all the other CIDs are embedded in -the objects themselves. This saves space and *forces* parsers to actually -traverse the DAG (hopefully validating it). - -## Motivation - -Use cases: - -1. Reliably export/import a DAG to/from an external hard drive (backup, sneakernet). -2. Traverse a large DAG on an external hard drive without parsing the *entire* DAG. -3. Traverse a large DAG streamed over HTTP without downloading the entire thing. - -The simple method is to copy the entire repo. However, for performance, we need -to be able to upgrade the repo format so this isn't really a stable format. -Additionally, repos need to support insertions, deletions, and random lookups. -Supporting these efficiently necessarily complicates the formats. We'd like -something simple and portable (backup). - -The slightly more complex way is to download every object into a separate file -and then import each file. However, this isn't very convenient and *does not* -scale to large directories well (use case 2). - -One could improve this multi-file approach by splitting up the DAG into multiple -directories and providing a set of tools to manage the files. However, we'd -rather not rely on the filesystem for anything, really. Filesystems: - -* Don't always deal with names well (e.g., FAT16). -* Don't always handle many small files well. -* Aren't usually as space-efficient as possible (to support updates). -* Are complex (easy to corrupt metadata/structure). - -Additionally, it's hard to download a directory structure over HTTP (motivated by -use-case 3). One can just TAR it up but that layers another (complex) file -format into the mix. - -So, we'd like a new single-file format that, if necessary, we can just `dd` to a -drive in place of a filesystem. - -TODO: Expand. - -# Questions - -However, there are a few open questions. - -## Uint64/Varint - -The advantage of using uint64s over varints is that we can leave the "jump -tables" blank and then fill them in on a second pass after we've written -everything. However, if we topologically sort the DAG, we may be able to compute -the jump tables up-front. - -The advantages of varints over uint64 are space and flexibility (DAGs larger -than 16 Exbibytes). - -Currently, I'm leaning toward varints as this will make storing lots of small -blocks significantly more efficient. - -## Inline Blocks - -So, we can technically have inline blocks using the identity multihash. How do -we deal with them? - -1. We *don't* want to duplicate the data. -2. We need to support inline blocks with children. - -## Topological sort - -So, a topological sort makes it really easy to traverse the CAR, even when -streaming. However producing a topologically sorted DAG is a bit trickier. Note: -whatever we choose, it won't have any affect on the asymptotic runtime (memory or time). diff --git a/CID.md b/CID.md new file mode 100644 index 00000000..228ef52f --- /dev/null +++ b/CID.md @@ -0,0 +1,166 @@ +This is the first CID proposal, from: https://github.com/ipfs/specs/issues/130 - reproduced here for historical purposes. + +--- + +# READ THIS PARAGRAPH FIRST + +Hey everyone, the below is a proposal for some changes to IPFS, IPLD, and how we link to data structures. It would address a bunch of open problems that have been identified, and improve the use, tooling, and model of IPLD to allow lots of what people have been requesting for months. Please review and leave comments. We feel pretty strongly about this being a good solution, **but we're not sure if we're just drinking the koolaid and going to make things worse. Sanity check before we move further pls?** Also my apologies, i would spend more time writing up a better version but i just dont have enough time right now and time is of the essence on this. + +--- + +# [EXPERIMENTAL PROPOSAL] CIDv1 -- Important Updates to IPFS, IPLD, Multicodec, and more. + +> IPFS migration path to IPLD (CBOR) from MerkleDAG (ProtoBuf) + +## Multicodec Packed Representation + +It is useful to have a compact version of multicodec, for use in small identifiers. This compact identifier will just be a single varint, looked up in a table. Different applications can use different tables. We should probably have one common table for well-known formats. + +We will establish a table for common authenticated data structure formats, for example: IPFS v0 Merkledag, CBOR IPLD, Git, Bitcoin, and more. The table is a simple varint lookup. + +## IPLD Links Updates (new format) + +### Open Problems (Motivation) + +IPLD allows content to be stored in multiple different formats, and thus we need a way to understand what kind of content is being loaded in when traversing a link. A problematic issue is that old ipfs content (protobuf merkledag) does not use multicodec. It makes it difficult to distinguish between the new CBOR IPLD objects and the old Protobuf objects. + +It has been proposed earlier that we wrap protobuf objects with a multicodec. But this is a problem, because the protobuf multicodec would not be authenticated. This is further complicated because many people have been requesting the ability to address raw leaf objects directly (that is, a hash linking to raw content, without ipld nor protobuf wrapping). This is a nice thing to have, but introduces difficulty in distinguishing between a protobuf or a raw encoded object, particularly when neither has a multicodec header which is authenticated by the object's hash. This lack of authentication is an attack vector: adversaries may provide protobuf objects with a raw multicodec, and depending on how implementations handle the multicodec, may poison an implementation's object repo. + +Another important performance constraint is that multicodec headers are quite large: `/ipld/cbor/v0`, for example, is 13 bytes, which is way too large for many applications of small data. Instead, we would like to be able to use a compact multicodec representation ("multicodec packed", a single varint) to distinguish the formats. So that encoded objects are wrapped with minimal overhead. Note that this still does not affect protobuf or raw objects because these do not include headers. + +Additional complications include how bitswap sends or identifies blocks, how a DagStore can pull out the object for a multihash and know what format encoding to use for it (eg raw vs protobuf), whether to allow linking from one object type to another, support for multiple base encodings for links, among others. + +In discussions we (@jbenet, @diasdavid, and @whyrusleeping) reviewed many different possiblities. We considered possibilities and how it affected linking data, wrapping the data with multicodec, storing it that way under the many layers of abstraction (dag store, blockstore, datastore, file systems), fetching and retrieving objects, knowing what format to use when, ensuring values are authenticated and not opening up vectors for attackers to poison repos, and more. + +In the end, we came up with a few small changes to how we represent IPLD links that solve all our problems (tm) \o/. These are: +- teach IPLD links to carry data formats (using multicodec) +- teach IPLD links to distinguish base encodings + +It is worth crediting many people here that have tirelessly pushed hard to get a bunch of these ideas out. @davidar @mildred @nicola to name a few, but many others too. But they haven't looked at this yet. this first post is the first they'll hear of this construction, and they may very well hate this particular combination of ideas :) please be direct with feedback, the sooner the better. + +### IPLD Links learn about Base Encoding + +We propose adding a multibase prefix to representations of IPLD links. This is particularly important where the encoding is not binary. + +At this time, we recommend not including it in direct storage, where it should be binary. However, it may be found during the course of review that it is better to always retain the multibase prefix, even when storing in binary. + +This change is a much requested option to support multiple encodings for the hashes. Current links use by default base58, which is perfect for URLs as it doesn't contain any non supported char and can be easily copy-pasted, however, for performance reasons, it is not always the best format. Some users already encode IPFS multihashes in other bases, and therefore it would be ideal to have all IPFS and IPLD tooling support these encodings through multibase, avoiding confusing failures. + +### IPLD Links acquire a version + +The fact we propose here changes to the basic link structure remind us of the basic multiformats principle: + +> "Never going to change" considered harmful. + +therefore we deem it wise to ensure that henceforth we include a version so that evolution can be simple, and not complex. The below changes suggest a way to distinguish between old and new links, but we should avoid such situations in the future, as this approach leverages knowledge about multihash distributions in the wild. This will be less feasible in the future. + +### IPLD Links learn about Codecs + +The most important component of these changes introduces a multicodec-packed varint prefix to the link, to signal the encoding of the linked-to object. This enables the link to carry information about the data it points to, and ensure it is interpreted correctly. This ensures that the multicodec prefix is NOT necessary for interpretation of an IPLD object, as the link to the object carries information for its interpretation. + +All proper IPLD formats (cbor and on) should carry the multicodec header at the beginning of their serialized representation, which authenticates the header and ensures clients can interpret the object without even having a link. But, this is not possible with objects of formats created before the IPLD spec, such as the first merkledag protobuf object codec in IPFS (go-ipfs 0.4.x and below). This includes also objects from other authenticated data structure distributed systems, such as Git, Bitcoin, Ethereum, and more. Finally, raw data -- which many hope to be able to address directly in IPLD -- cannot carry an authenticated prefix either. + +The approach of adding the multicodec to the link entirely side-steps the problem of not being able to authenticate multicodec headers for protobufs, git, bitcoin, or raw data objects. And this avoids a nasty repo poisoning attack, possible in other proposed suggestions that rely on an unauthenticated multicodec header (carried along with the object) to determine the type of an object. + +This also ensures that IPLD objects can still be content-addressed nicely, without needing to also store codec metadata alongside. + +This change has been long-proposed in other forms. These other forms usually suggested attaching a `@multicodec` key to IPLD link objects (as a property on or next to the link), which was cumbersome and introduced complexity in other ways. Specially, it was not easy to carry over this info to a URL or copy-pasted identifier. + +This multicodec-packed prefix will be sampled from a special table, maintained along with the IPLD spec. This table is expandable over time. A global multicodec table could grow from this one, or start separately. + +### Content IDs + +This document will use the words Content IDs or CIDs. this abstraction is useful here but may not be useful beyond it. Another word -- albeit much less precise -- may be IPLD Link. + +Other options are: +- SID - Self-describing IDentifier +- SSDID - Secure Self Describable Identifier +- IPLD Links -- no fancy name, less abstraction creep. less precise. + +Let the old base58 multihash links to protobuf data be called CID Version 0. + +#### CIDs Version 1 (new) + +Putting together the IPLD Link update statements above, we can term the new handle for IPLD data CID Version 1, with a multibase prefix, a version, a packed multicodec, and a multihash. + +``` + +``` + +Where: +- `` is a multibase prefix describing the base that encodes this CID. If binary, this is omitted. +- `` is the version number of the cid. +- `` is a multicodec-packed identifier, from the CID multicodec table +- `` is a cryptographic multihash, including: `` + +Note that all CIDs v1 and on should always begin with ``, this evolving nicely. + +#### Distinguishing v0 and v1 CIDs (old and new) + +It is a HARD CONSTRAINT that all IPFS links continue to work. This means we need to continue to support v0 CIDs. This means IPFS APIs must accept both v0 and v1 CIDs. This section defines how to distinguish v0 from v1 CIDs. + +Old v0 CIDs are strictly sha2-256 multihashes encoded in base58 -- this is because IPFS tooling only shipped with support for sha2-256. This means the binary versions are 34 bytes long (sha2-256 256 bit multihash), and that the string versions are 46 characters long (base58 encoded). This means we can recognize a v0 CID by ensuring it is a sha256 bit multihash, of length 256 bits, and base58 encoded (when a string). Basically: + +- `` is implicitly base58. +- `` is implicitly 0. +- `` is implicitly protobuf (todo: add code here) +- `` is a cryptographic multihash, explicit. + +We can re-write old v0 CIDs into v1 CIDs, by making the elements explicit. This should be done henceforth to avoid creating more v0 CIDs. But note that many references exist in the wild, and thus we must continue supporting v0 links. In the distant future, we may remove this support after sha2 breaks. + +Note we can cleanly distinguish the values, which makes it easy to support both. The code for this check is here: https://gist.github.com/jbenet/bf402718a7955bf636fb47d214bcef8a + +### IPLD supports non-CID hash links as implicit CIDv1s + +Note that raw hash links _stored in various data structures_ (eg Protbouf, Git, Bitcoin, Ethereum, etc) already exist. These links -- when loaded directly as one of these data structures -- can be seen as "linking within a network" whereas proper CIDv1 IPLD links can be seen as linking "across networks" (internet of data! internet of data structures!). Supporting these existing (or even new) raw hash links as a CIDv1 can be done by noting that when on data structure links with just a raw binary link, the rest of the CIDv1 fields are implicit: + +- `` is implicitly binary or whatever the format encodes. +- `` is implicitly 1. +- `` is implicitly the same as the data structure. +- `` can be determined from the raw hash. + +Basically, we construct the corresponding CIDv1 out of the raw hash link because all the other information is _in the context_ of the data structure. This is very useful because it allows: +- more compact encoding of a CIDv1 when linking from one data struct to another +- linking from CBOR IPLD to other CBOR IPLD objects exactly as has been spec-ed out so far, so any IPLD adopters continue working. +- (most important) opens the door for native support of other data structures + +### IPLD native support for Git, Bitcoin, Ethereum, and other authenticated data structures + +Given the above addressing changes, it is now possible to directly address and implement native support for Git, Bitcoin, Ethereum, and other authenticated data structure formats. Such native support would allow resolving through such objects, and treat them as true IPLD objects, instead of needing to wrap them in CBOR or another format. This is the proper merkle-forest. \o/ + +### IPLD addresses raw data + +Given the above addressing changes, it is now possible to address raw data directly, as an IPLD node. This node is of course taken to be just a byte buffer, and devoid of links (i.e. a leaf node). + +The utility of this is the ability to directly address any object via hashing external to IPLD datastructures, which is a _much_-requested feature. + + +### Support for multiple binary packed formats + +Contrary to existing Merkle objects (e.g IPFS protobuf legacy, git, bitcoin, dat and others), new IPLD ojects are authenticated AND self described data blobs, each IPLD object is serialized and prefixed by a multicodec identifying its format. + +Some candidate formats: +- /ipld/cbor +- /ipld/ion/1.0.0 +- /ipld/protobuf/3.0.0 +- /ipld/protobuf/2.0.0 + +There is one strong requirement for these formats to work: a format MUST have a 1:1 mapping to the canonical IPLD serialiation format. Today (July 29, 2016), that format is CBOR. + +## Changes to Interfaces / Specs + +Need changes to: + +- IPFS specs (addressing in particular) need to support CIDv1 +- IPFS interfaces need to support CIDv1 +- Add a new, small CIDv1 or "IPLD Links" spec +- IPLD spec is compatible. Can improve in wording. CBOR data format does not change. Pathing does not change. . + +### Support for CID v0 and v1 + +It is a HARD CONSTRAINT that all IPFS links continue to work. In order to support both CID v0 paths (`/ipfs/`) and the new CID v1 paths (`/ipfs/`, IPFS and other IPLD tooling will detect the version of the CID through a matching function. (See "Distinguishing v0 and v1 CIDs (old and new)" above). + +The following interfaces must support both types: +- The IPFS API, which takes CIDs and Paths + - This includes subprotocols, such as Bitswap +- HTTP-to-IPFS Gateway, for all existing https://ipfs.io/ipfs/... links diff --git a/Codecs/DAG-CBOR.md b/Codecs/DAG-CBOR.md new file mode 100644 index 00000000..52b3c54c --- /dev/null +++ b/Codecs/DAG-CBOR.md @@ -0,0 +1,28 @@ +# [WIP] DAG-CBOR.md + +DAG-CBOR supports the full ["IPLD Data Model v1."](../IPLD-Data-Model-v1.md) + +## Simple Types + +CBOR already natively supports all "IPLD Data Model v1: Simple Types." + +## Link Type + +IPLD links can be represented in CBOR using tags which are defined in [RFC 7049 section 2.4](http://tools.ietf.org/html/rfc7049#section-2.4). + +A tag `` is defined. This tag can be followed by a text string (major type 3) or byte string (major type 2) corresponding to the link target. + +When encoding an IPLD "link object" to CBOR, use this algorithm: + +- The *link value* is extracted. +- If the *link value* is a valid [multiaddress](https://github.com/multiformats/multiaddr) and converting that link text to the multiaddress binary string and back to text is guaranteed to result to the exact same text, the link is converted to a binary multiaddress stored in CBOR as a byte string (major type 2). +- Else, the *link value* is stored as text (major type 3) +- The resulting encoding is the `` followed by the CBOR representation of the *link value* + +When decoding CBOR and converting it to IPLD, each occurences of `` is transformed by the following algorithm: + +- The following value must be the *link value*, which is extracted. +- If the link is a binary string, it is interpreted as a multiaddress and converted to a textual format. Else, the text string is used directly. +- A map is created with a single key value pair. The key is the standard IPLD *link key* `/`, the value is the textual string containing the *link value*. + +When an IPLD object contains these tags in the way explained here, the multicodec header used to represent the object codec must be `/cbor/ipld-tagsv1` instead of just `/cbor`. Readers should be able to use an optimized reading process to detect links using these tags. \ No newline at end of file diff --git a/Codecs/DAG-JSON.md b/Codecs/DAG-JSON.md new file mode 100644 index 00000000..9c383273 --- /dev/null +++ b/Codecs/DAG-JSON.md @@ -0,0 +1,23 @@ +# [WIP] DAG-JSON v1 + +DAG-JSON supports the full ["IPLD Data Model v1."](../IPLD-Data-Model-v1.md) + +## Simple Types + +All simple types except binary are supported natively by JSON. + +Contrary to popular belief, JSON as a format supports Big Integers. It's only +JavaScript itself that has trouble with them. This means JS implementations +of `dag-json` can't use the native JSON parser and serializer. + +### Binary Type + +```javascript +{"/": { "base64": String }} +``` + +## Link Type + +```javascript +{"/": String /* base encoded CID */} +``` \ No newline at end of file diff --git a/Data-Structures/HAMT.md b/Data-Structures/HAMT.md new file mode 100644 index 00000000..4f155f8c --- /dev/null +++ b/Data-Structures/HAMT.md @@ -0,0 +1,5 @@ +# [WIP] Hash-Array Mapped Trie + +This specifies a standardized hash-array mapped trie on IPLD Data Model v1. + +TODO: write this spec. \ No newline at end of file diff --git a/IPLD-Data-Model-v1.md b/IPLD-Data-Model-v1.md new file mode 100644 index 00000000..5b6256e9 --- /dev/null +++ b/IPLD-Data-Model-v1.md @@ -0,0 +1,17 @@ +# [WIP] IPLD Data Model + +## Simple Types + +* Boolean +* Null +* String +* Integer +* Float +* Array +* Object (Hash Map) +* Binary + +## Link Type + +This type represents a link to another IPLD Block. The link reference +is a [`CID`]('./CID.md). diff --git a/IPLD-Path.md b/IPLD-Path.md new file mode 100644 index 00000000..6bb57ad3 --- /dev/null +++ b/IPLD-Path.md @@ -0,0 +1,3 @@ +# [WIP] IPLD Path v1 + +TODO: write IPLD Path spec. \ No newline at end of file diff --git a/IPLD.md b/IPLD.md deleted file mode 100644 index 68cd8535..00000000 --- a/IPLD.md +++ /dev/null @@ -1,553 +0,0 @@ -# ![](https://img.shields.io/badge/status-draft-green.svg?style=flat-square) IPLD `OUT OF DATE` - -> The "thin-waist" merkle dag format - -There are a variety of systems that use merkle-tree and hash-chain inspired datastructures (e.g. git, bittorrent, ipfs, tahoe-lafs, sfsro). IPLD (Inter Planetary Linked Data) defines: - -- **_merkle-links_**: the core unit of a merkle-graph -- **_merkle-dag_**: any graphs whose edges are _merkle-links_. `dag` stands for "directed acyclic graph" -- **_merkle-paths_**: unix-style paths for traversing _merkle-dags_ with _named merkle-links_ -- **IPLD Formats**: a set of formats in which IPLD objects can be represented, for example JSON, CBOR, CSON, YAML, Protobuf, XML, RDF, etc. -- **IPLD Canonical Format**: a deterministic description on a serialized format that ensures the same _logical_ object is always serialized to _the exact same sequence of bits_. This is critical for merkle-linking, and all cryptographic applications. - -## Intro - -### What is a _merkle-link_? - -A _merkle-link_ is a link between two objects which is content-addressed with the _cryptographic hash_ of the target object, and embedded in the source object. Content addressing with merkle-links allows: - -- **Cryptographic Integrity Checking**: resolving a link's value can be tested by hashing. In turn, this allows wide, secure, trustless exchanges of data (e.g. git or bittorrent), as others cannot give you any data that does not hash to the link's value. -- **Immutable Datastructures**: data structures with merkle links cannot mutate, which is a nice property for distributed systems. This is useful for versioning, for representing distributed mutable state (eg CRDTs), and for long term archival. - -A _merkle-link_ is represented in the IPLD object model by a map containing a single key `/` mapped to a "link value". For example: - - -**A link, represented in json as a "link object"** - -```js -{ "/" : "/ipfs/QmUmg7BZC1YP1ca66rRtWKxpXp77WgVHrnv263JtDuvs2k" } -// "/" is the link key -// "/ipfs/QmUmg7BZC1YP1ca66rRtWKxpXp77WgVHrnv263JtDuvs2k" is the link value -``` - -**Object with a link at `foo/baz`** - -```js -{ - "foo": { - "bar": "/ipfs/QmUmg7BZC1YP1ca66rRtWKxpXp77WgVHrnv263JtDuvs2k", // not a link - "baz": {"/": "/ipfs/QmUmg7BZC1YP1ca66rRtWKxpXp77WgVHrnv263JtDuvs2k"} // link - } -} -``` - -**Object with pseudo "link object" at `files/cat.jpg` and actual link at `files/cat.jpg/link`** - -```js -{ - "files": { - "cat.jpg": { // give links properties wrapping them in another object - "link": {"/": "/ipfs/QmUmg7BZC1YP1ca66rRtWKxpXp77WgVHrnv263JtDuvs2k"}, // the link - "mode": 0755, - "owner": "jbenet" - } - } -} -``` - -When dereferencing the link, the map itself is to be replaced by the object it points to unless the link path is invalid. - -The link can either be a multihash, in which case it is assumed that it is a link in the `/ipfs` hierarchy, or directly the absolute path to the object. Currently, only the `/ipfs` hierarchy is allowed. - -If an application wants to use objects with a single `/` key for other purposes, the application itself is responsible to escape the `/` key in the IPLD object so that the application keys do not conflict with IPLD's special `/` key. - -### What is a _merkle-graph_ or a _merkle-dag_? - -Objects with merkle-links form a Graph (merkle-graph), which necessarily is both Directed, and which can be counted on to be Acyclic, iff the properties of the cryptographic hash function hold. I.e. a _merkle-dag_. Hence all graphs which use _merkle-linking_ (_merkle-graph_) are necessarily also Directed Acyclic Graphs (DAGs, hence _merkle-dag_). - -### What is a _merkle-path_? - -A merkle-path is a unix-style path (e.g. `/a/b/c/d`) which initially dereferences through a _merkle-link_ and allows access of elements of the referenced node and other nodes transitively. - -General purpose filesystems are encouraged to design an object model on top of IPLD that would be specialized for file manipulation and have specific path algorithms to query this model. - -### How do _merkle-paths_ work? - -A _merkle-path_ is a unix-style path which initially dereferences through a _merkle-link_ and then follows _named merkle-links_ in the intermediate objects. Following a name means looking into the object, finding the _name_ and resolving the associated _merkle-link_. - -For example, suppose we have this _merkle-path_: - -``` -/ipfs/QmUmg7BZC1YP1ca66rRtWKxpXp77WgVHrnv263JtDuvs2k/a/b/c/d -``` - -Where: -- `ipfs` is a protocol namespace (to allow the computer to discern what to do) -- `QmUmg7BZC1YP1ca66rRtWKxpXp77WgVHrnv263JtDuvs2k` is a cryptographic hash. -- `a/b/c/d` is a path _traversal_, as in unix. - -Path traversals, denoted with `/`, happen over two kinds of links: - -- **in-object traversals** traverse data within the same object. -- **cross-object traversals** traverse from one object to another, resolving through a merkle-link. - -#### Examples - -Using the following dataset: - - > ipfs object cat --fmt=yaml QmUmg7BZC1YP1ca66rRtWKxpXp77WgVHrnv263JtDuvs2k - --- - a: - b: - link: - /: QmV76pUdAAukxEHt9Wp2xwyTpiCmzJCvjnMxyQBreaUeKT - c: "d" - foo: - /: QmQmkZPNPoRkPd7wj2xUJe5v5DsY6MX33MFaGhZKB2pRSE - - > ipfs object cat --fmt=yaml QmV76pUdAAukxEHt9Wp2xwyTpiCmzJCvjnMxyQBreaUeKT - --- - c: "e" - d: - e: "f" - foo: - name: "second foo" - - > ipfs object cat --fmt=yaml QmQmkZPNPoRkPd7wj2xUJe5v5DsY6MX33MFaGhZKB2pRSE - --- - name: "third foo" - -An example of the paths: - -- `/ipfs/QmUmg7BZC1YP1ca66rRtWKxpXp77WgVHrnv263JtDuvs2k/a/b/c` will only traverse the first object and lead to string `d`. -- `/ipfs/QmUmg7BZC1YP1ca66rRtWKxpXp77WgVHrnv263JtDuvs2k/a/b/link/c` will traverse two objects and lead to the string `e` -- `/ipfs/QmUmg7BZC1YP1ca66rRtWKxpXp77WgVHrnv263JtDuvs2k/a/b/link/d/e` traverse two objects and leads to the string `f` -- `/ipfs/QmUmg7BZC1YP1ca66rRtWKxpXp77WgVHrnv263JtDuvs2k/a/b/link/foo/name` traverse the first and second object and lead to string `second foo` -- `/ipfs/QmUmg7BZC1YP1ca66rRtWKxpXp77WgVHrnv263JtDuvs2k/a/b/foo/name` traverse the first and last object and lead to string `third foo` - - -## What is the IPLD Data Model? - -The IPLD Data Model defines a simple JSON-based _structure_ for all merkle-dags, and identifies a set of formats to encode the structure into. - -### Constraints and Desires - -Some Constraints: -- IPLD paths MUST be unambiguous. A given path string MUST always deterministically traverse to the same object. (e.g. avoid duplicating link names) -- IPLD paths MUST be universal and avoid oppressing non-english societies (e.g. use UTF-8, not ASCII). -- IPLD paths MUST layer cleanly over UNIX and The Web (use `/`, have deterministic transforms for ASCII systems). -- Given the wide success of JSON, a huge number of systems present JSON interfaces. IPLD MUST be able to import and export to JSON trivially. -- The JSON data model is also very simple and easy to use. IPLD MUST be just as easy to use. -- Definining new datastructures MUST be trivially easy. It should not be cumbersome -- or require much knowledge -- to experiment with new definitions on top of IPLD. -- Since IPLD is based on the JSON data model, it is fully compatible with RDF and Linked Data standards through JSON-LD. -- IPLD Serialized Formats (on disk and on the wire) MUST be fast and space efficient. (should not use JSON as the storage format, and instead use CBOR or similar formats) -- IPLD cryptographic hashes MUST be upgradeable (use [multihash](https://github.com/multiformats/multihash)) - -Some nice-to-haves: -- IPLD SHOULD NOT carry over mistakes, e.g. the lack of integers in JSON. -- IPLD SHOULD be upgradable, e.g. if a better on-disk format emerges, systems should be able to migrate to it and minimize costs of doing so. -- IPLD objects SHOULD be able to resolve properties too as paths, not just merkle links. -- IPLD Canonical Format SHOULD be easy to write a parser for. -- IPLD Canonical Format SHOULD enable seeking without parsing full objects. (CBOR and Protobuf allow this). - - -### Format Definition - -(**NOTE:** Here we will use both JSON and YML to show what formats look like. We explicitly use both to show equivalence of the object across two different formats.) - -At its core, IPLD Data Model "is just JSON" in that it (a) is also tree based documents with a few primitive types, (b) maps 1:1 to json, (c) users can use it through JSON itself. It "is not JSON" in that (a) it improves on some mistakes, (b) has an efficient serialized representation, and (c) does not actually specify a single on-wire format, as the world is known to improve. - -#### Basic Node - -Here is an example IPLD object in JSON: - -```json -{ - "name": "Vannevar Bush" -} -``` - -Suppose it hashes to the multihash value `QmAAA...AAA`. Note that it has no links at all, just a string name value. But we are still be able to "resolve" the key `name` under it: - -```sh -> ipld cat --json QmAAA...AAA -{ - "name": "Vannevar Bush" -} - -> ipld cat --json QmAAA...AAA/name -"Vannevar Bush" -``` - -And -- of course -- we are able to view it in other formats - -```sh -> ipld cat --yml QmAAA...AAA ---- -name: Vannevar Bush - -> ipld cat --xml QmAAA...AAA - - - Vannevar Bush - -``` - -#### Linking Between Nodes - -Merkle-Linking between nodes is the reason for IPLD to exist. A Link in IPLD is just an embedded node with a special format: - -```js -{ - "title": "As We May Think", - "author": { - "/": "QmAAA...AAA" // links to the node above. - } -} -``` - -Suppose this hashes to the multihash value `QmBBB...BBB`. This node links the _subpath `author` to `QmAAA...AAA`, the node in the section above. So we can now do: - -```sh -> ipld cat --json QmBBB...BBB -{ - "title": "As We May Think", - "author": { - "/": "QmAAA...AAA" // links to the node above. - } -} - -> ipld cat --json QmBBB...BBB/author -{ - "name": "Vannevar Bush" -} - -> ipld cat --yml QmBBB...BBB/author ---- -name: "Vannevar Bush" - -> ipld cat --json QmBBB...BBB/author/name -"Vannevar Bush" -``` - -#### Link Properties Convention - -IPLD allows users to construct complex datastructures, with other properties associated with links. This is useful to encode other information along with a link, such as the kind of relationship, or ancilliary data required in the link. This is _different from_ the "Link Objects Convention", discussed below, which are very useful in their own right. But sometimes, you just want to add a bit of data on the link and not have to make another object. IPLD doesn't get in your way. You can simply do it by nesting the actual IPLD link within another object, with the additional properties. - -> IMPORTANT NOTE: the link properties are not allowed directly in the link object because of travesal ambiguities. Read the spec history for a discussion on the difficulties. - -For example, supposed you have a file system, and want to assign metadata like permissions, or owners in the link between objects. Suppose you have a `directory` object with hash `QmCCC...CCC` like this: - -```js -{ - "foo": { // link wrapper with more properties - "link": {"/": "QmCCC...111"} // the link - "mode": "0755", - "owner": "jbenet" - }, - "cat.jpg": { - "link": {"/": "QmCCC...222"}, - "mode": "0644", - "owner": "jbenet" - }, - "doge.jpg": { - "link": {"/": "QmCCC...333"}, - "mode": "0644", - "owner": "jbenet" - } -} -``` - -or in YML - -```yml ---- -foo: - link: - /: QmCCC...111 - mode: 0755 - owner: jbenet -cat.jpg: - link: - /: QmCCC...222 - mode: 0644 - owner: jbenet -doge.jpg: - link: - /: QmCCC...333 - mode: 0644 - owner: jbenet -``` - -Though we have new properties in the links that are _specific to this datastructure_, we can still resolve links just fine: - -```js -> ipld cat --json QmCCC...CCC/cat.jpg -{ - "data": "\u0008\u0002\u0012��\u0008����\u0000\u0010JFIF\u0000\u0001\u0001\u0001\u0000H\u0000H..." -} - -> ipld cat --json QmCCC...CCC/doge.jpg -{ - "subfiles": [ - { - "/": "QmPHPs1P3JaWi53q5qqiNauPhiTqa3S1mbszcVPHKGNWRh" - }, - { - "/": "QmPCuqUTNb21VDqtp5b8VsNzKEMtUsZCCVsEUBrjhERRSR" - }, - { - "/": "QmS7zrNSHEt5GpcaKrwdbnv1nckBreUxWnLaV4qivjaNr3" - } - ] -} - -> ipld cat --yml QmCCC...CCC/doge.jpg ---- -subfiles: - - /: QmPHPs1P3JaWi53q5qqiNauPhiTqa3S1mbszcVPHKGNWRh - - /: QmPCuqUTNb21VDqtp5b8VsNzKEMtUsZCCVsEUBrjhERRSR - - /: QmS7zrNSHEt5GpcaKrwdbnv1nckBreUxWnLaV4qivjaNr3 - -> ipld cat --json QmCCC...CCC/doge.jpg/subfiles/1/ -{ - "data": "\u0008\u0002\u0012��\u0008����\u0000\u0010JFIF\u0000\u0001\u0001\u0001\u0000H\u0000H..." -} -``` - -But we can't extract the link as nicely as other properties, as links are meant to _resolve through_. - -#### Duplicate property keys - -Note that having two properties with _the same_ name IS NOT ALLOWED, but actually impossible to prevent (someone will do it and feed it to parsers), so to be safe, we define the value of the path traversal to be _the first_ entry in the serialized representation. For example, suppose we have the object: - -```json -{ - "name": "J.C.R. Licklider", - "name": "Hans Moravec" -} -``` - -Suppose _this_ was the _exact order_ in the _Canonical Format_ (not json, but cbor), and it hashes to `QmDDD...DDD`. We would _ALWAYS_ get: - -```sh -> ipld cat --json QmDDD...DDD -{ - "name": "J.C.R. Licklider", - "name": "Hans Moravec" -} -> ipld cat --json QmDDD...DDD/name -"J.C.R. Licklider" -``` - - -#### Path Restrictions - -There are some important problems that come about with path descriptions in Unix and the web. For a discussion see [this discussion](https://github.com/ipfs/go-ipfs/issues/1710). In order to be compatible with the models and expectations of unix and the web, IPLD explicitly disallows paths with certain path components. **Note that the data itself _may_ still contain these properties (someone will do it, and there are legitimate uses for it). So it is only _Path Resolvers_ that MUST NOT resolve through those paths.** The restrictions are the same as typical unix and UTF-8 path systems: - - -TODO: -- [ ] list path resolving restrictions -- [ ] show examples - -#### Integers in JSON - -IPLD is _directly compatible_ with JSON, to take advantage of JSON's successes, but it need not be _held back_ by JSON's mistakes. This is where we can afford to follow format idiomatic choices, though care MUST be given to ensure there is always a well-defined 1:1 mapping. - -On the subject of integers, there exist a variety of formats which represent integers as strings in JSON, for example, [EJSON](https://docs.meteor.com/api/ejson.html). These can be used and conversion to and from other formats should happen naturally-- that is, when converting JSON to CBOR, an EJSON integer should be transformed naturally to a proper CBOR integer, instead of representing it as a map with string values. - - -## Serialized Data Formats - -IPLD supports a variety of serialized data formats through [multicodec](https://github.com/multiformats/multicodec). These can be used however is idiomatic to the format, for example in `CBOR`, we can use `CBOR` type tags to represent the merkle-link, and avoid writing out the full string key `@link`. Users are encouraged to use the formats to their fullest, and to store and transmit IPLD data in whatever format makes the most sense. The only requirement **is that there MUST be a well-defined one-to-one mapping with the IPLD Canonical format.** This is so that data can be transformed from one format to another, and back, without changing its meaning nor its cryptographic hashes. - -### Serialized CBOR with tags - -IPLD links can be represented in CBOR using tags which are defined in [RFC 7049 section 2.4](http://tools.ietf.org/html/rfc7049#section-2.4). - -A tag `` is defined. This tag can be followed by a text string (major type 3) or byte string (major type 2) corresponding to the link target. - -When encoding an IPLD "link object" to CBOR, use this algorithm: - -- The *link value* is extracted. -- If the *link value* is a valid [multiaddress](https://github.com/multiformats/multiaddr) and converting that link text to the multiaddress binary string and back to text is guaranteed to result to the exact same text, the link is converted to a binary multiaddress stored in CBOR as a byte string (major type 2). -- Else, the *link value* is stored as text (major type 3) -- The resulting encoding is the `` followed by the CBOR representation of the *link value* - -When decoding CBOR and converting it to IPLD, each occurences of `` is transformed by the following algorithm: - -- The following value must be the *link value*, which is extracted. -- If the link is a binary string, it is interpreted as a multiaddress and converted to a textual format. Else, the text string is used directly. -- A map is created with a single key value pair. The key is the standard IPLD *link key* `/`, the value is the textual string containing the *link value*. - -When an IPLD object contains these tags in the way explained here, the multicodec header used to represent the object codec must be `/cbor/ipld-tagsv1` instead of just `/cbor`. Readers should be able to use an optimized reading process to detect links using these tags. - -### Canonical Format - -In order to preserve merkle-linking's power, we must ensure that there is a single **_canonical_** serialized representation of an IPLD document. This ensures that applications arrive at the same cryptographic hashes. It should be noted --though-- that this is a system-wide parameter. Future systems might change it to evolve representations. However we estimate this would need to be done no more than once per decade. - -**The IPLD Canonical format is _canonicalized CBOR with tags_.** - -The canonical CBOR format must follow rules defines in [RFC 7049 section 3.9](http://tools.ietf.org/html/rfc7049#section-3.9) in addition to the rules defined here. - -Users of this format should not expect any specific ordering of the keys, as the keys might be ordered differently in non canonical formats. - -The legacy canonical format is protocol buffers. - -This canonical format is used to decide which format to use when creating the object for the first time and computing its hash. Once the format is decided for an IPLD object, it must be used in all communications so senders and receivers can check the data against the hash. - -For example, when sending a legacy object encoded in protocol buffers over the wire, the sender must not send the CBOR version as the receiver will not be able to check the file validity. - -In the same way, when the receiver is storing the object, it must make sure that the canonical format for this object is store along with the object so it will be able to share the object with other peers. - -A simple way to store such objects with their format is to store them with their multicodec header. - - -## Datastructure Examples - -It is important that IPLD be a simple, nimble, and flexible format that does not get in the way of users defining new or importing old datastractures. For this purpose, below I will show a few example data structures. - - -### Unix Filesystem - - -#### A small File - -```js -{ - "data": "hello world", - "size": "11" -} -``` - -#### A Chunked File - -Split into multiple independent sub-Files. - -```js -{ - "size": "1424119", - "subfiles": [ - { - "link": {"/": "QmAAA..."}, - "size": "100324" - }, - { - "link": {"/": "QmAA1..."}, - "size": "120345", - "repeat": "10" - }, - { - "link": {"/": "QmAA1..."}, - "size": "120345" - }, - ] -} -``` - -#### A Directory - -```js -{ - "foo": { - "link": {"/": "QmCCC...111"}, - "mode": "0755", - "owner": "jbenet" - }, - "cat.jpg": { - "link": {"/": "QmCCC...222"}, - "mode": "0644", - "owner": "jbenet" - }, - "doge.jpg": { - "link": {"/": "QmCCC...333"}, - "mode": "0644", - "owner": "jbenet" - } -} -``` - -### git - -#### git blob - -```js -{ - "data": "hello world" -} -``` - -#### git tree - -```js -{ - "foo": { - "link": {"/": "QmCCC...111"}, - "mode": "0755" - }, - "cat.jpg": { - "link": {"/": "QmCCC...222"}, - "mode": "0644" - }, - "doge.jpg": { - "link": {"/": "QmCCC...333"}, - "mode": "0644" - } -} -``` - -#### git commit - -```js -{ - "tree": {"/": "e4647147e940e2fab134e7f3d8a40c2022cb36f3"}, - "parents": [ - {"/": "b7d3ead1d80086940409206f5bd1a7a858ab6c95"}, - {"/": "ba8fbf7bc07818fa2892bd1a302081214b452afb"} - ], - "author": { - "name": "Juan Batiz-Benet", - "email": "juan@benet.ai", - "time": "1435398707 -0700" - }, - "committer": { - "name": "Juan Batiz-Benet", - "email": "juan@benet.ai", - "time": "1435398707 -0700" - }, - "message": "Merge pull request #7 from ipfs/iprs\n\n(WIP) records + merkledag specs" -} -``` - -### Bitcoin - -#### Bitcoin Block - -```js -{ - "parent": {"/": "Qm000000002CPGAzmfdYPghgrFtYFB6pf1BqMvqfiPDam8"}, - "transactions": {"/": "QmTgzctfxxE8ZwBNGn744rL5R826EtZWzKvv2TF2dAcd9n"}, - "nonce": "UJPTFZnR2CPGAzmfdYPghgrFtYFB6pf1BqMvqfiPDam8" -} -``` - -#### Bitcoin Transaction - -This time, in YML. TODO: make this a real txn - -```yml ---- -inputs: - - input: {/: Qmes5e1x9YEku2Y4kDgT6pjf91TPGsE2nJAaAKgwnUqR82} - amount: 100 -outputs: - - output: {/: Qmes5e1x9YEku2Y4kDgT6pjf91TPGsE2nJAaAKgwnUqR82} - amount: 50 - - output: {/: QmbcfRVZqMNVRcarRN3JjEJCHhQBcUeqzZfa3zoWMaSrTW} - amount: 30 - - output: {/: QmV9PkR2gXcmUgNH7s7zMg9dsk7Hy7bLS18S9SHK96m7zV} - amount: 15 - - output: {/: QmP8r8fLUnEywGnRRUrHB28nnBKwmshMLiYeg8udzYg7TK} - amount: 5 -script: OP_VERIFY -``` diff --git a/IPLD_FUTURE_FROM_PAST.md b/IPLD_FUTURE_FROM_PAST.md deleted file mode 100644 index 37af6284..00000000 --- a/IPLD_FUTURE_FROM_PAST.md +++ /dev/null @@ -1,181 +0,0 @@ -**This document contains a draft from a proposed spec for an older version of IPLD. It remains here until we have published the new spec** - -# IPLD Spec v1 - -Editor: Nicola Greco, MIT - -> This specification defines a data model and a naming scheme for linking data with cryptographic hashes. -> -> InterPlanetary Linked Data (IPLD) is an information space of inter-linked data, where content addresses and links are expressed using the content's cryptographic hash. IPLD is designed to universally address and link to any data so that the data does not require a central naming authority, the integrity of the data can be verified, and untrusted parties can help distribute the data. This specification describes a data model for structured data and a naming scheme to point to data and subsets of the data. These design goals make it different from earlier data models such as JSON or RDF, and naming schemes such as NI [[RFC6920]](https://tools.ietf.org/html/rfc6920), or Magnet URI. - - -## Table of content - -- [Introduction](#introduction) -- [Basic Concepts](#basic-concepts) -- [IPLD](#ipld) - - [Data Model](#data-model) - - [Naming Scheme](#naming-scheme) -- [Serialization](#serialization) -- [Security Considerations](#security-considerations) -- [Examples](#examples) -- [Acknowledgements](#acknowledgements) -- [References](#references) - ---- - -## Introduction -Naming things with hashes solves three fundamental problems for the decentralized web: - -1. **Data integrity**: URLs give no guarantees that the data we receive hasn't been compromised. The IPLD naming system ensures that no one can lie about the data they are serving. -2. **Distributed naming**: Only the owner of a domain name can serve you the data behind a URL; in IPLD, any computer - trusted and untrusted - that has the data can participate in distributing it. -3. **Immutable Content**: The content behind URLs can change or disappear, breaking links or pointing to unexpected content. IPLD links cannot mutate. - -Using cryptographic hashes as pointers for data objects is not a new concept. Successful applications (e.g. Bitcoin, Git, Certificate Transparency) and existing specs ([[RFC6920]](https://tools.ietf.org/html/rfc6920)) used this strategy to authenticate their datasets, generate global identifiers and to provide end-to-end integrity to their systems. However existing applications have implemented a different data model and pointer format which does not interoperate, making it difficult to re-use the same data across applications. Furthermore, vertical implementations are application specific (e.g. forcing a particular data model) and can hardly be used elsewhere. - -IPLD is a forest of hash-linked directed acyclic graphs, also referred to as Merkle DAGs (or generically, tree-based authenticated data structures). -IPLD aims at being the way to address any authenticated data structure under the IPLD namespace `/ipld/`, followed by the hash of the data. Conceptually, any Bittorrent, Git, or Blockchain data also resides in this namespace, thus solving the described interoperability problem. - -This specification defines: -- **IPLD Data Model**: a data model to describe unstructured and structured data and to represent Merkle DAGs. -- **IPLD Naming Scheme**: a UNIX-like naming scheme that is self-authenticating. It can be used to point to data or subsets of it. - -The IPLD Data Model and Naming Scheme defined bellow follow specific design goals that are not currently met by other existing standards. The underlining data model is an extension of the JSON [[RFC4627]](https://www.ietf.org/rfc/rfc4627.txt) and the CBOR data model [[RFC7049]](https://tools.ietf.org/html/rfc7049). The Naming Scheme is built upon JSON Pointer [[RFC6901]](https://tools.ietf.org/html/rfc6901). It is important to note that this is not a proposal of a data format, but an abstract data model that can be serialized in multiple formats. - -Related specs: CID - - - -## Basic Concepts - -In this section we cover some basic concepts on which IPLD builds upon. - -**Cryptographic integrity**. Cryptographic hash functions are one way functions that can be used to map any binary data to a specific string, called a digest or a hash. A cryptographic hash function gives strong probabilistic guarantees that different content don't *collide* on the same hash, meaning that no two different content can have the same hash. By naming content with hashes, we can guarantee that the data has not been altered during storage or transmission, since when obtaining a file, the receiver can themselves regenerate the hash of the content received. - -**Range verifiability**. A cryptographic hash provides integrity guarantees not only to the content it directly dereferences to, as well as the entire graph of the content that it is linked from it. - -**Merkle DAGs**. We refer to directed acyclic graphs linked via cryptographic hashes as Merkle DAGs. Systems such as Git, IPFS, Bittorrent, Bitcoin use different type of hash-based direct acyclic graphs. - -## Objectives - -Objectives of the IPLD Data Model: - -1. Data must be able to be decoded without a schema description. -2. The Data model must support all the JSON data types for conversion from and to JSON. -3. The representation must be able to unambiguously encode most common data formats, as well as existing data structures used in Internet and Web standards. - -Objectives of the IPLD Naming Scheme: - -1. Names must be self-descriptive on how they are encoded, what type of content they contain and the hash functions used -2. The Naming Scheme must be extensible, new hash functions and new encoding must be able to be introduced without loosing backward compatibility. -3. The Naming Scheme must be respect conventions used in the Unix file system and on the World Wide Web. - -## Terminology - -| Name | Description | -| :---- | :---- | -| Resource | Any piece of data, structured or unstructured that can be addressed via cryptographic hash. | -| IPLD Objects | semi-structured data (similar to JSON) that consists of attribute-value pairs objects that conform to the IPLD Objects Data Model. | -| IPLD Link Object | The value of an attribute in an IPLD Object can be a Link Object, a special object that describes a link to another resource. | -| CID | The cryptographic hash of a resource prefixed by bits that describe the type of data, the cryptographic hash function used and the encoding of the hash. | -| IPLD Address | A name combined of the CID and an optional path scheme that points to a resource or an attribute in an IPLD Object. | -| IPLD Formats | The process of serialization/deserialization of an IPLD Object into/from a data format (e.g. CBOR, JSON) | -| IPLD Types | The process of serialization/deserialization of an IPLD object into/from a special data structure (e.g. Ethereum block) | - -## IPLD Data Model - -### IPLD Objects - -IPLD objects consists of attribute–value pairs (similar to JSON). - -An object has a set of attribute each of which has a corresponding value. -A value can be of four types: -- a terminal: which can be a string, an integer, a real number, a boolean -- an IPLD Object (recursive definition) -- an IPLD Link Object -- an ordered array of the previous - -### Link Object -``` -TODO: describe the link object -- the `/` keyword and accepted values -- pointers can be of these forms: - - relative (?) - - pointers: (for further understanding of pointers, see below) - - only hash - - hash + path -``` - - -## IPLD Naming Scheme - -``` -TODO: define the different components of an IRI -- A Pointer is "Protocol(optional?) + CID + Path" -- CID (multicodec, multihash, versioning, etc) -- Path (optional) - - must respect the shape of the object or will result in a error -``` - -``` -TODO: format -- restricted char -``` - -## Representations -``` -TODO: specifying the canonical format in the CID -``` - -``` -TODO: serializing and de-serializing -``` - -``` -TODO: different formats -- json -- yaml -- cbor - -TODO: define the possibility of converting -``` - -## Error Handling -``` -TODO: describe possible errors: -- CID has bad syntax -- hash function not known -- pointer referencing to non existent value -``` - -## Security considerations - -``` -TODO: -- no secret information required to generate or verify a name, names are secure and self-evident - - corollary: causal links -- disclosure of names does not disclose the referenced object but may enable the attacker to determine the contents referenced -- note about hash collision and probabilistic guarantees -- hash functions can break -``` - -## Examples - -### Hello World -### File system example -### Social network example - -## Acknowledgements - -``` -TODO: list all contributors -``` - -## References diff --git a/README.md b/README.md index 307171ba..b2a9888f 100644 --- a/README.md +++ b/README.md @@ -1,45 +1,97 @@ IPLD Specifications =================== -[![](https://img.shields.io/badge/made%20by-Protocol%20Labs-blue.svg?style=flat-square)](http://ipn.io) -[![](https://img.shields.io/badge/project-IPLD-blue.svg?style=flat-square)](http://github.com/ipld/ipld) -[![](https://img.shields.io/badge/freenode-%23ipfs-blue.svg?style=flat-square)](http://webchat.freenode.net/?channels=%23ipfs) - -> This repository contains the specs for InterPlanetary Linked Data (IPLD). - -**Specs are not finished yet. We use the following tag system to identify their state:** - -- ![](https://img.shields.io/badge/status-wip-orange.svg?style=flat-square) - this spec is a work-in-progress, it is likely not even complete. -- ![](https://img.shields.io/badge/status-draft-yellow.svg?style=flat-square) - this spec is a rough draft and will likely change substantially. -- ![](https://img.shields.io/badge/status-reliable-green.svg?style=flat-square) - this spec is believed to be close to final. Minor changes only. -- ![](https://img.shields.io/badge/status-stable-brightgreen.svg?style=flat-square) - this spec is likely to improve, but not change fundamentally. -- ![](https://img.shields.io/badge/status-permanent-blue.svg?style=flat-square) - this spec will not change. - -Nothing in this spec repository is `permanent` yet. As in many IPLD repositories, most of the work is happening in [the issues](https://github.com/ipld/specs/issues/) or in [active pull requests](https://github.com/ipld/specs/pulls/). Go take a look! - -## Documents - -- [**Roadmap**](/ROADMAP.md) -- **Specifications:** - - ![](https://img.shields.io/badge/status-draft-yellow.svg?style=flat-square) [`IPLD`](/IPLD.md) - spec about the data model, pointers and link formats - - ![](https://img.shields.io/badge/status-wip-orange.svg?style=flat-square) `IPLD Selectors` - spec about simple language to select multiple unknown nodes in a graph - - ![](https://img.shields.io/badge/status-wip-orange.svg?style=flat-square) `IPLD Transformations` - spec about the language to trasform an IPLD graph into another - - ![](https://img.shields.io/badge/status-reliable-green.svg?style=flat-square) [`CID (Content IDentifier)`](https://github.com/ipld/cid) - - ![](https://img.shields.io/badge/status-wip-orange.svg?style=flat-square) [`IPLD Formats`](https://github.com/ipld/interface-ipld-format) - interface definition for adding support to different formats - - ![](https://img.shields.io/badge/status-draft-yellow.svg?style=flat-square) [`CAR`](/CAR.md) - Content Addressable Archives +IPLD is not a simple specification, it is a set of specifications built on each other. + +``` + The IPLD Stack + +-----------------------------+ + +-------------+ | | + | | | End-User Application Stacks | + | MFS in IPFS | | | + | | +-----------------------------+ + +-------------+ | | + | | | Structured Data w/ indexes | + | unixfs v2 | | VR, Geo, SQL, etc. | +----------+ + | | | | | | + +-------------+ +-----------------------------+ | MFS in | + | | | | | IPFS | + | HAMT | | Sorted Index (sharded) | | | + | | | | +----------+ + +-------------+-+-----------------------------+ | | + | | | unixfs | + | Complex Data Structures | | v1 | + | | | | ++------------------------------------------------------------------------------+ +| | | | | +| | dag-json dag-cbor | | | +| | | | | +| Codecs +---------------------------------------------+ git | dag-pb | +| | | | | +| | IPLD Data Model | | | +| | | | | ++-------------------------------------------------------------+-----+----------+ + | | + | CID Path | + | | + +--------------------------------------------------------------+ +``` + +The goal of this technology stack is to enable decentralized data-structures +which in turn will enable more decentralized applications. + +Many of the specifications in this stack are inter-dependent. + +``` + IPLD Dependency Graph + ++---+ +-----+ +---+ +|CID+-----------+-------------->Block+-------->Raw| ++---+ | +--+--+ +---+ + +------v-------------+ | ++----+ |Links (Conceptually)| | +|Path| +------+-------------+ | +-----------+ ++-+--+ | +------------->Replication| + | Codecs | | +-----------+ ++-v-------------v-----------------+---+ +| | +| +---+ +-----------------------+ | Complex Data-Structures +| |Git| | Data Model v1 | | +--------------v-------+ +| +---+ | | | | | +| | +--------+ +--------+ +----> +----+ +-----------+ | +| +------+ | |dag|json< |dag|cbor< | | | |HAMT| |Sorted Tree| | +| |dag|pb| | +--------+ +--------+ | | | +--+-+ +----+------+ | +| +------+ | | | | | | | +| +-----------------------+ | +----------------------+ +| | | | ++-------------------------------------+ | | + | | + +----------------------+ | | + | File System (unixfs) <-----+ | + +----------------------+ | + +--------------------+ | + | | | +Structured Data | VR, Geo, SQL, etc. <----------------+ + w/ indexes | | + +--------------------+ +``` + +## Specification Repo Layout + +* /IPLD-Data-Model-v1.md +* /IPLD-Path.md +* /CID.md +* /Codecs + * /Codecs/DAG-JSON.md + * /Codecs/DAG-CBOR.md +* /Data-Structures + * /Data-Structures/HAMT.md ## Discussion -Join the discussion for: +Discussion of specific specifications happens in [this repository's issues](https://github.com/ipld/specs/issues). -- Specs - https://github.com/ipld/specs/issues -- General IPLD - https://github.com/ipld/ipld/issues -- JavaScript Implementation - https://github.com/ipld/js-ipld/issues -- Golang Implementation - https://github.com/ipfs/go-ipld-format - -## Weekly Hangout - -TBA soon™ +Discussion of IPLD more generally happens in the [IPLD repository](https://github.com/ipld/ipld/issues). ## Contribute @@ -54,3 +106,147 @@ Small note: If editing the README, please conform to the [standard-readme](https ## License This repository is only for documents. All of these are licensed under the [CC-BY-SA 3.0](https://ipfs.io/ipfs/QmVreNvKsQmQZ83T86cWSjPu2vR3yZHGPm5jnxFuunEB9u) license, © 2016 Protocol Labs Inc. + +# Terminology + +## Description of IPLD + +IPLD is a set of standards and implementations for creating decentralized data-structures that are +universally addressable and linkable. These structures allow us to do for data what URLs and +links did for HTML web pages. + +## Generic Terms + +### Content Addressability + +"Content addressability" refers to the ability to refer to content by a trustless identifier. + +Rather than referring to content by a string identifier or URL, content addressable systems refer to content +by a cryptographic hash. This allows complete decentralization of the content as the identifier +does not specific the retreival method and provides a secure way to verify the content. + +## IPLD Terms + +### Multihash + +Multihash is hash format that is not specific to a single hashing algorithm. + +Multihashes describe the algorithm used for the hash as well as the hash value. + +``` ++---------+------------------------------+ +| Hash Type | Hash Value | ++---------+------------------------------+ +``` + +SHA-256 example. + +``` ++---------+------------------------------------------------------------------+ +| SHA-256 | 2413fb3709b05939f04cf2e92f7d0897fc2596f9ad0b8a9ea855c7bfebaae892 | ++---------+------------------------------------------------------------------+ +``` + +Note: these examples are simplifications of the concepts. For a complete description visit the [project and its specs](https://github.com/multiformats/multihash). + +### CID (Content Identifier) + +Hash based content identifier. Includes the `codec` and + +``` ++-------+------------------------------+ +| Codec | Multihash | ++-------+------------------------------+ +``` + +The long version +``` ++------------------------------+ +|Codec | ++------------------------------+ +|Multihash | +| +----------+---------------+ | +| |Hash Type | Hash Value | | +| +----------+---------------+ | +| | ++------------------------------+ +``` + +Note: these examples are simplifications of the concepts. For a complete description visit the [spec](/CID.md). + +### Block + +A CID and the binary data value for that CID. + +The short version. +``` ++-----+--------------------------------+ +| CID | Data | ++-----+--------------------------------+ +``` + +The long version. +``` ++-----------------------------------+------------------+ +| CID | Binary Data | +| +------------------------------+ | | +| |Codec | | | +| +------------------------------+ | | +| |Multihash | | | +| | +----------+---------------+ | | | +| | |Hash Type | Hash Value | | | | +| | +----------+---------------+ | | | +| | | | | +| +------------------------------+ | | +| | | ++-----------------------------------+------------------+ +``` + +### IPLD Path + +TODO: + +### IPLD Data Model + +The IPLD Data Model describes a set of base types. Codecs that support these base types +can be used by any of the Data-Structures built on top of the IPLD Data Model. + +Codecs that support the IPLD Data Model: + +* [DAG-CBOR](/Codecs/DAG-CBOR.md) +* WIP: [DAG-JSON](/Codecs/DAG-JSON.md) + +### Codec + +A codec exposes serialization and deserialization for IPLD blocks. If it also supports +content addressable links then the codec also exposes those links as `CID`'s. A codec +also supports atomic IPLD Path lookups on the block. + +#### Serializer, Deserializer and Format + +A logical separation exists in any given IPLD codec between the **format** and the **serializer/deserializer**. + +``` ++--------------------+ +--------------------+ +| | | | +| Serializer | | Deserializer | +| | | | ++---------+----------+ +---------+----------+ + | ^ + | Sent to another peer | + v | ++---------+----------+ +----------+---------+ +| | | | +| Format +-------------> Format | +| | | | ++--------------------+ +--------------------+ +``` + +A **format** may represent object types and tree structures any +way it wishes. This includes existing representations (JSON, BSON, CBOR, +Protobuf, msgpack, etc) or even new custom serializations. We will refer to +this as the **representation**. + +Therefor, a **format** is the standardized representation of IPLD Links and Paths in a given **representation**. It describes how to translate between structured data and binary. + +It is worth noting that **serializers** and **deserializers** differ by programming language while the **format** does not and MUST remain consistent across all codec implementations. diff --git a/ROADMAP.md b/ROADMAP.md deleted file mode 100644 index 1807a759..00000000 --- a/ROADMAP.md +++ /dev/null @@ -1,3 +0,0 @@ -# IPLD ROADMAP - -[Soon™](https://github.com/ipld/specs/issues/41) From 4f94bbf864df336847d0f5f6573bd18076b7480b Mon Sep 17 00:00:00 2001 From: Mikeal Rogers Date: Fri, 28 Sep 2018 16:22:06 -0700 Subject: [PATCH 02/18] doc: copy editing and typos --- README.md | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/README.md b/README.md index b2a9888f..6ea152a7 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,7 @@ IPLD Specifications =================== -IPLD is not a simple specification, it is a set of specifications built on each other. +IPLD is not a single specification, it is a set of specifications. ``` The IPLD Stack @@ -37,7 +37,7 @@ IPLD is not a simple specification, it is a set of specifications built on each +--------------------------------------------------------------+ ``` -The goal of this technology stack is to enable decentralized data-structures +The goal of this stack is to enable decentralized data-structures which in turn will enable more decentralized applications. Many of the specifications in this stack are inter-dependent. @@ -78,14 +78,14 @@ Structured Data | VR, Geo, SQL, etc. <----------------+ ## Specification Repo Layout -* /IPLD-Data-Model-v1.md -* /IPLD-Path.md -* /CID.md -* /Codecs - * /Codecs/DAG-JSON.md - * /Codecs/DAG-CBOR.md -* /Data-Structures - * /Data-Structures/HAMT.md +* [/IPLD-Data-Model-v1.md](/IPLD-Data-Model-v1.md) +* [/IPLD-Path.md](/IPLD-Path.md) +* [/CID.md](/CID.md) +* [/Codecs](/Codecs) + * [/Codecs/DAG-JSON.md](/Codecs/DAG-JSON.md) + * [/Codecs/DAG-CBOR.md](/Codecs/DAG-CBOR.md) +* [/Data-Structures](/Data-Structures) + * [/Data-Structures/HAMT.md](/Data-Structures/HAMT.md) ## Discussion From 08b0b98d1846e3b649ef2effb5486f217810e70b Mon Sep 17 00:00:00 2001 From: Mikeal Rogers Date: Fri, 28 Sep 2018 16:30:41 -0700 Subject: [PATCH 03/18] doc: copy editing and typos --- README.md | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index 6ea152a7..ec8c1048 100644 --- a/README.md +++ b/README.md @@ -123,7 +123,7 @@ links did for HTML web pages. Rather than referring to content by a string identifier or URL, content addressable systems refer to content by a cryptographic hash. This allows complete decentralization of the content as the identifier -does not specific the retreival method and provides a secure way to verify the content. +does not specify the retrieval method and provides a secure way to verify the content. ## IPLD Terms @@ -131,12 +131,12 @@ does not specific the retreival method and provides a secure way to verify the c Multihash is hash format that is not specific to a single hashing algorithm. -Multihashes describe the algorithm used for the hash as well as the hash value. +A multihash describes the algorithm used for the hash as well as the hash value. ``` -+---------+------------------------------+ ++-----------+----------------------------+ | Hash Type | Hash Value | -+---------+------------------------------+ ++-----------+----------------------------+ ``` SHA-256 example. @@ -151,7 +151,9 @@ Note: these examples are simplifications of the concepts. For a complete descrip ### CID (Content Identifier) -Hash based content identifier. Includes the `codec` and +Hash based content identifier. Includes the `codec` and `multihash`. + +CID's ``` +-------+------------------------------+ From 13eff4c7ea20e790c034bb4773242031e431f498 Mon Sep 17 00:00:00 2001 From: Mikeal Rogers Date: Fri, 28 Sep 2018 16:41:09 -0700 Subject: [PATCH 04/18] spec: move format specifics into format section --- Codecs/DAG-CBOR.md | 6 ++++-- Codecs/DAG-JSON.md | 8 +++++--- 2 files changed, 9 insertions(+), 5 deletions(-) diff --git a/Codecs/DAG-CBOR.md b/Codecs/DAG-CBOR.md index 52b3c54c..f83fefce 100644 --- a/Codecs/DAG-CBOR.md +++ b/Codecs/DAG-CBOR.md @@ -2,11 +2,13 @@ DAG-CBOR supports the full ["IPLD Data Model v1."](../IPLD-Data-Model-v1.md) -## Simple Types +## Format + +### Simple Types CBOR already natively supports all "IPLD Data Model v1: Simple Types." -## Link Type +### Link Type IPLD links can be represented in CBOR using tags which are defined in [RFC 7049 section 2.4](http://tools.ietf.org/html/rfc7049#section-2.4). diff --git a/Codecs/DAG-JSON.md b/Codecs/DAG-JSON.md index 9c383273..488f8fbe 100644 --- a/Codecs/DAG-JSON.md +++ b/Codecs/DAG-JSON.md @@ -2,7 +2,9 @@ DAG-JSON supports the full ["IPLD Data Model v1."](../IPLD-Data-Model-v1.md) -## Simple Types +## Format + +### Simple Types All simple types except binary are supported natively by JSON. @@ -10,13 +12,13 @@ Contrary to popular belief, JSON as a format supports Big Integers. It's only JavaScript itself that has trouble with them. This means JS implementations of `dag-json` can't use the native JSON parser and serializer. -### Binary Type +#### Binary Type ```javascript {"/": { "base64": String }} ``` -## Link Type +### Link Type ```javascript {"/": String /* base encoded CID */} From e884f5889dce2ac3b7cb411327683f8046ad583f Mon Sep 17 00:00:00 2001 From: Steven Allen Date: Tue, 30 Oct 2018 15:27:52 -0700 Subject: [PATCH 05/18] dag-cbor: add a draft spec --- formats/DagCBOR.md | 31 +++++++++++++++++++++++++++++++ 1 file changed, 31 insertions(+) create mode 100644 formats/DagCBOR.md diff --git a/formats/DagCBOR.md b/formats/DagCBOR.md new file mode 100644 index 00000000..58417493 --- /dev/null +++ b/formats/DagCBOR.md @@ -0,0 +1,31 @@ +# DagCBOR Spec + +The CBOR IPLD format is called DagCBOR to disambiguate it from regular CBOR. +Most CBOR objects are valid DagCBOR. The only hard restriction is that any field +with the tag 42 must be a valid CID. + +## Link Format + +As with all IPLD formats, DagCBOR must be able to encode merkle-links. In +DagCBOR, links are encoded using the raw-binary (identity, NUL) multibase in a +field with a byte-string type (major type 2), with the tag 42. + +(the inclusion of the multibase exists for historical reasons) + +## Map Key Restriction + +In DagCBOR, map keys must be strings (TODO: drop this? We already have +unpathable map keys). Furthermore, map keys should avoid using `/` as this is +unpathable (TODO: drop this? IMO, we should support path escaping out of the +box). + +## Canonical DagCBOR + +Canonical DagCBOR should: + +1. Use no tags other than the CID tag (42). Other tags may be lost in + conversion. +2. Should use the canonical CBOR encoding and field ordering. Other orderings + will yield different CIDs. +3. Should only use string map keys. Some implementations may not be able to + handle non-string keys. From 81aef5bccfb9047a557963a97fd0f68f05bef052 Mon Sep 17 00:00:00 2001 From: Mikeal Rogers Date: Mon, 5 Nov 2018 14:40:40 -0800 Subject: [PATCH 06/18] fix: layer model improvement. --- README.md | 49 +++++++++++++++++++++++++------------------------ 1 file changed, 25 insertions(+), 24 deletions(-) diff --git a/README.md b/README.md index ec8c1048..c0e1376d 100644 --- a/README.md +++ b/README.md @@ -4,7 +4,8 @@ IPLD Specifications IPLD is not a single specification, it is a set of specifications. ``` - The IPLD Stack + The IPLD Stack + +-----------------------------+ +-------------+ | | | | | End-User Application Stacks | @@ -12,29 +13,29 @@ IPLD is not a single specification, it is a set of specifications. | | +-----------------------------+ +-------------+ | | | | | Structured Data w/ indexes | - | unixfs v2 | | VR, Geo, SQL, etc. | +----------+ - | | | | | | - +-------------+ +-----------------------------+ | MFS in | - | | | | | IPFS | - | HAMT | | Sorted Index (sharded) | | | - | | | | +----------+ - +-------------+-+-----------------------------+ | | - | | | unixfs | - | Complex Data Structures | | v1 | - | | | | -+------------------------------------------------------------------------------+ -| | | | | -| | dag-json dag-cbor | | | -| | | | | -| Codecs +---------------------------------------------+ git | dag-pb | -| | | | | -| | IPLD Data Model | | | -| | | | | -+-------------------------------------------------------------+-----+----------+ - | | - | CID Path | - | | - +--------------------------------------------------------------+ + | unixfs v2 | | VR, Geo, SQL, etc. | +----------+ + | | | | | | + +-------------+ +-----------------------------+ | MFS in | + | | | | | IPFS | + | HAMT | | Sorted Index (sharded) | | | + | | | | +----------+ + +-------------+-+-----------------------------+ | | + | | | unixfs | + | Complex Data Structures | | v1 | + | | | | ++-----------------------------------------------------------------------------------------+ +| | | | | +| | dag-json dag-cbor | ipld-git | | +| | | | | +| Codecs +---------------------------------------------+ ipld-btc | dag-pb | +| | | | | +| | IPLD Data Model | ipld-zcash | | +| | | | | ++-------------------------------------------------------------+----------------+----------+ + | | + | CID Path | + | | + +-------------------------------------------------------------------------+ ``` The goal of this stack is to enable decentralized data-structures From 9ad76def8d919fc0f99a8f780fd382fc8bf736c0 Mon Sep 17 00:00:00 2001 From: Mikeal Rogers Date: Mon, 5 Nov 2018 14:45:02 -0800 Subject: [PATCH 07/18] Pulling DAG-CBOR from new feature branch. --- Codecs/DAG-CBOR.md | 43 ++++++++++++++++++++++--------------------- formats/DagCBOR.md | 31 ------------------------------- 2 files changed, 22 insertions(+), 52 deletions(-) delete mode 100644 formats/DagCBOR.md diff --git a/Codecs/DAG-CBOR.md b/Codecs/DAG-CBOR.md index f83fefce..58417493 100644 --- a/Codecs/DAG-CBOR.md +++ b/Codecs/DAG-CBOR.md @@ -1,30 +1,31 @@ -# [WIP] DAG-CBOR.md +# DagCBOR Spec -DAG-CBOR supports the full ["IPLD Data Model v1."](../IPLD-Data-Model-v1.md) +The CBOR IPLD format is called DagCBOR to disambiguate it from regular CBOR. +Most CBOR objects are valid DagCBOR. The only hard restriction is that any field +with the tag 42 must be a valid CID. -## Format +## Link Format -### Simple Types +As with all IPLD formats, DagCBOR must be able to encode merkle-links. In +DagCBOR, links are encoded using the raw-binary (identity, NUL) multibase in a +field with a byte-string type (major type 2), with the tag 42. -CBOR already natively supports all "IPLD Data Model v1: Simple Types." +(the inclusion of the multibase exists for historical reasons) -### Link Type +## Map Key Restriction -IPLD links can be represented in CBOR using tags which are defined in [RFC 7049 section 2.4](http://tools.ietf.org/html/rfc7049#section-2.4). +In DagCBOR, map keys must be strings (TODO: drop this? We already have +unpathable map keys). Furthermore, map keys should avoid using `/` as this is +unpathable (TODO: drop this? IMO, we should support path escaping out of the +box). -A tag `` is defined. This tag can be followed by a text string (major type 3) or byte string (major type 2) corresponding to the link target. +## Canonical DagCBOR -When encoding an IPLD "link object" to CBOR, use this algorithm: +Canonical DagCBOR should: -- The *link value* is extracted. -- If the *link value* is a valid [multiaddress](https://github.com/multiformats/multiaddr) and converting that link text to the multiaddress binary string and back to text is guaranteed to result to the exact same text, the link is converted to a binary multiaddress stored in CBOR as a byte string (major type 2). -- Else, the *link value* is stored as text (major type 3) -- The resulting encoding is the `` followed by the CBOR representation of the *link value* - -When decoding CBOR and converting it to IPLD, each occurences of `` is transformed by the following algorithm: - -- The following value must be the *link value*, which is extracted. -- If the link is a binary string, it is interpreted as a multiaddress and converted to a textual format. Else, the text string is used directly. -- A map is created with a single key value pair. The key is the standard IPLD *link key* `/`, the value is the textual string containing the *link value*. - -When an IPLD object contains these tags in the way explained here, the multicodec header used to represent the object codec must be `/cbor/ipld-tagsv1` instead of just `/cbor`. Readers should be able to use an optimized reading process to detect links using these tags. \ No newline at end of file +1. Use no tags other than the CID tag (42). Other tags may be lost in + conversion. +2. Should use the canonical CBOR encoding and field ordering. Other orderings + will yield different CIDs. +3. Should only use string map keys. Some implementations may not be able to + handle non-string keys. diff --git a/formats/DagCBOR.md b/formats/DagCBOR.md deleted file mode 100644 index 58417493..00000000 --- a/formats/DagCBOR.md +++ /dev/null @@ -1,31 +0,0 @@ -# DagCBOR Spec - -The CBOR IPLD format is called DagCBOR to disambiguate it from regular CBOR. -Most CBOR objects are valid DagCBOR. The only hard restriction is that any field -with the tag 42 must be a valid CID. - -## Link Format - -As with all IPLD formats, DagCBOR must be able to encode merkle-links. In -DagCBOR, links are encoded using the raw-binary (identity, NUL) multibase in a -field with a byte-string type (major type 2), with the tag 42. - -(the inclusion of the multibase exists for historical reasons) - -## Map Key Restriction - -In DagCBOR, map keys must be strings (TODO: drop this? We already have -unpathable map keys). Furthermore, map keys should avoid using `/` as this is -unpathable (TODO: drop this? IMO, we should support path escaping out of the -box). - -## Canonical DagCBOR - -Canonical DagCBOR should: - -1. Use no tags other than the CID tag (42). Other tags may be lost in - conversion. -2. Should use the canonical CBOR encoding and field ordering. Other orderings - will yield different CIDs. -3. Should only use string map keys. Some implementations may not be able to - handle non-string keys. From af45255f674eb613620ff0feda37aa520d4e7228 Mon Sep 17 00:00:00 2001 From: Mikeal Rogers Date: Mon, 5 Nov 2018 14:46:48 -0800 Subject: [PATCH 08/18] Adding note about simple types. --- Codecs/DAG-CBOR.md | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/Codecs/DAG-CBOR.md b/Codecs/DAG-CBOR.md index 58417493..a2b16c9b 100644 --- a/Codecs/DAG-CBOR.md +++ b/Codecs/DAG-CBOR.md @@ -1,4 +1,10 @@ -# DagCBOR Spec +# [WIP] DagCBOR Spec + +DAG-CBOR supports the full ["IPLD Data Model v1."](../IPLD-Data-Model-v1.md) + +CBOR already natively supports all ["IPLD Data Model v1: Simple Types."](../IPLD-Data-Model-v1.md#simple-types) + +## Format The CBOR IPLD format is called DagCBOR to disambiguate it from regular CBOR. Most CBOR objects are valid DagCBOR. The only hard restriction is that any field From 55150f4ef46e4f1e425c9349d343768a8a36c70e Mon Sep 17 00:00:00 2001 From: Mikeal Rogers Date: Mon, 5 Nov 2018 14:56:15 -0800 Subject: [PATCH 09/18] fix: removing old notes from CID spec. --- CID.md | 128 +++++++-------------------------------------------------- 1 file changed, 15 insertions(+), 113 deletions(-) diff --git a/CID.md b/CID.md index 228ef52f..01e957c8 100644 --- a/CID.md +++ b/CID.md @@ -1,85 +1,12 @@ -This is the first CID proposal, from: https://github.com/ipfs/specs/issues/130 - reproduced here for historical purposes. +# CIDv1 ---- +# Content IDs -# READ THIS PARAGRAPH FIRST +This document will use the words Content IDs or CIDs. -Hey everyone, the below is a proposal for some changes to IPFS, IPLD, and how we link to data structures. It would address a bunch of open problems that have been identified, and improve the use, tooling, and model of IPLD to allow lots of what people have been requesting for months. Please review and leave comments. We feel pretty strongly about this being a good solution, **but we're not sure if we're just drinking the koolaid and going to make things worse. Sanity check before we move further pls?** Also my apologies, i would spend more time writing up a better version but i just dont have enough time right now and time is of the essence on this. +Prior base58 multihash links to protobuf data be called CID Version 0. ---- - -# [EXPERIMENTAL PROPOSAL] CIDv1 -- Important Updates to IPFS, IPLD, Multicodec, and more. - -> IPFS migration path to IPLD (CBOR) from MerkleDAG (ProtoBuf) - -## Multicodec Packed Representation - -It is useful to have a compact version of multicodec, for use in small identifiers. This compact identifier will just be a single varint, looked up in a table. Different applications can use different tables. We should probably have one common table for well-known formats. - -We will establish a table for common authenticated data structure formats, for example: IPFS v0 Merkledag, CBOR IPLD, Git, Bitcoin, and more. The table is a simple varint lookup. - -## IPLD Links Updates (new format) - -### Open Problems (Motivation) - -IPLD allows content to be stored in multiple different formats, and thus we need a way to understand what kind of content is being loaded in when traversing a link. A problematic issue is that old ipfs content (protobuf merkledag) does not use multicodec. It makes it difficult to distinguish between the new CBOR IPLD objects and the old Protobuf objects. - -It has been proposed earlier that we wrap protobuf objects with a multicodec. But this is a problem, because the protobuf multicodec would not be authenticated. This is further complicated because many people have been requesting the ability to address raw leaf objects directly (that is, a hash linking to raw content, without ipld nor protobuf wrapping). This is a nice thing to have, but introduces difficulty in distinguishing between a protobuf or a raw encoded object, particularly when neither has a multicodec header which is authenticated by the object's hash. This lack of authentication is an attack vector: adversaries may provide protobuf objects with a raw multicodec, and depending on how implementations handle the multicodec, may poison an implementation's object repo. - -Another important performance constraint is that multicodec headers are quite large: `/ipld/cbor/v0`, for example, is 13 bytes, which is way too large for many applications of small data. Instead, we would like to be able to use a compact multicodec representation ("multicodec packed", a single varint) to distinguish the formats. So that encoded objects are wrapped with minimal overhead. Note that this still does not affect protobuf or raw objects because these do not include headers. - -Additional complications include how bitswap sends or identifies blocks, how a DagStore can pull out the object for a multihash and know what format encoding to use for it (eg raw vs protobuf), whether to allow linking from one object type to another, support for multiple base encodings for links, among others. - -In discussions we (@jbenet, @diasdavid, and @whyrusleeping) reviewed many different possiblities. We considered possibilities and how it affected linking data, wrapping the data with multicodec, storing it that way under the many layers of abstraction (dag store, blockstore, datastore, file systems), fetching and retrieving objects, knowing what format to use when, ensuring values are authenticated and not opening up vectors for attackers to poison repos, and more. - -In the end, we came up with a few small changes to how we represent IPLD links that solve all our problems (tm) \o/. These are: -- teach IPLD links to carry data formats (using multicodec) -- teach IPLD links to distinguish base encodings - -It is worth crediting many people here that have tirelessly pushed hard to get a bunch of these ideas out. @davidar @mildred @nicola to name a few, but many others too. But they haven't looked at this yet. this first post is the first they'll hear of this construction, and they may very well hate this particular combination of ideas :) please be direct with feedback, the sooner the better. - -### IPLD Links learn about Base Encoding - -We propose adding a multibase prefix to representations of IPLD links. This is particularly important where the encoding is not binary. - -At this time, we recommend not including it in direct storage, where it should be binary. However, it may be found during the course of review that it is better to always retain the multibase prefix, even when storing in binary. - -This change is a much requested option to support multiple encodings for the hashes. Current links use by default base58, which is perfect for URLs as it doesn't contain any non supported char and can be easily copy-pasted, however, for performance reasons, it is not always the best format. Some users already encode IPFS multihashes in other bases, and therefore it would be ideal to have all IPFS and IPLD tooling support these encodings through multibase, avoiding confusing failures. - -### IPLD Links acquire a version - -The fact we propose here changes to the basic link structure remind us of the basic multiformats principle: - -> "Never going to change" considered harmful. - -therefore we deem it wise to ensure that henceforth we include a version so that evolution can be simple, and not complex. The below changes suggest a way to distinguish between old and new links, but we should avoid such situations in the future, as this approach leverages knowledge about multihash distributions in the wild. This will be less feasible in the future. - -### IPLD Links learn about Codecs - -The most important component of these changes introduces a multicodec-packed varint prefix to the link, to signal the encoding of the linked-to object. This enables the link to carry information about the data it points to, and ensure it is interpreted correctly. This ensures that the multicodec prefix is NOT necessary for interpretation of an IPLD object, as the link to the object carries information for its interpretation. - -All proper IPLD formats (cbor and on) should carry the multicodec header at the beginning of their serialized representation, which authenticates the header and ensures clients can interpret the object without even having a link. But, this is not possible with objects of formats created before the IPLD spec, such as the first merkledag protobuf object codec in IPFS (go-ipfs 0.4.x and below). This includes also objects from other authenticated data structure distributed systems, such as Git, Bitcoin, Ethereum, and more. Finally, raw data -- which many hope to be able to address directly in IPLD -- cannot carry an authenticated prefix either. - -The approach of adding the multicodec to the link entirely side-steps the problem of not being able to authenticate multicodec headers for protobufs, git, bitcoin, or raw data objects. And this avoids a nasty repo poisoning attack, possible in other proposed suggestions that rely on an unauthenticated multicodec header (carried along with the object) to determine the type of an object. - -This also ensures that IPLD objects can still be content-addressed nicely, without needing to also store codec metadata alongside. - -This change has been long-proposed in other forms. These other forms usually suggested attaching a `@multicodec` key to IPLD link objects (as a property on or next to the link), which was cumbersome and introduced complexity in other ways. Specially, it was not easy to carry over this info to a URL or copy-pasted identifier. - -This multicodec-packed prefix will be sampled from a special table, maintained along with the IPLD spec. This table is expandable over time. A global multicodec table could grow from this one, or start separately. - -### Content IDs - -This document will use the words Content IDs or CIDs. this abstraction is useful here but may not be useful beyond it. Another word -- albeit much less precise -- may be IPLD Link. - -Other options are: -- SID - Self-describing IDentifier -- SSDID - Secure Self Describable Identifier -- IPLD Links -- no fancy name, less abstraction creep. less precise. - -Let the old base58 multihash links to protobuf data be called CID Version 0. - -#### CIDs Version 1 (new) +## CIDs Version 1 Putting together the IPLD Link update statements above, we can term the new handle for IPLD data CID Version 1, with a multibase prefix, a version, a packed multicodec, and a multihash. @@ -95,7 +22,13 @@ Where: Note that all CIDs v1 and on should always begin with ``, this evolving nicely. -#### Distinguishing v0 and v1 CIDs (old and new) +### Multicodec Packed Representation + +It is useful to have a compact version of multicodec, for use in small identifiers. This compact identifier will just be a single varint, looked up in a table. Different applications can use different tables. We should probably have one common table for well-known formats. + +We will establish a table for common authenticated data structure formats, for example: IPFS v0 Merkledag, CBOR IPLD, Git, Bitcoin, and more. The table is a simple varint lookup. + +### Distinguishing v0 and v1 CIDs (old and new) It is a HARD CONSTRAINT that all IPFS links continue to work. This means we need to continue to support v0 CIDs. This means IPFS APIs must accept both v0 and v1 CIDs. This section defines how to distinguish v0 from v1 CIDs. @@ -103,7 +36,7 @@ Old v0 CIDs are strictly sha2-256 multihashes encoded in base58 -- this is becau - `` is implicitly base58. - `` is implicitly 0. -- `` is implicitly protobuf (todo: add code here) +- `` is implicitly protobuf (for backwards compat with v0). - `` is a cryptographic multihash, explicit. We can re-write old v0 CIDs into v1 CIDs, by making the elements explicit. This should be done henceforth to avoid creating more v0 CIDs. But note that many references exist in the wild, and thus we must continue supporting v0 links. In the distant future, we may remove this support after sha2 breaks. @@ -124,43 +57,12 @@ Basically, we construct the corresponding CIDv1 out of the raw hash link because - linking from CBOR IPLD to other CBOR IPLD objects exactly as has been spec-ed out so far, so any IPLD adopters continue working. - (most important) opens the door for native support of other data structures -### IPLD native support for Git, Bitcoin, Ethereum, and other authenticated data structures - -Given the above addressing changes, it is now possible to directly address and implement native support for Git, Bitcoin, Ethereum, and other authenticated data structure formats. Such native support would allow resolving through such objects, and treat them as true IPLD objects, instead of needing to wrap them in CBOR or another format. This is the proper merkle-forest. \o/ - ### IPLD addresses raw data Given the above addressing changes, it is now possible to address raw data directly, as an IPLD node. This node is of course taken to be just a byte buffer, and devoid of links (i.e. a leaf node). -The utility of this is the ability to directly address any object via hashing external to IPLD datastructures, which is a _much_-requested feature. - +The utility of this is the ability to directly address any object via hashing external to IPLD datastructures. ### Support for multiple binary packed formats -Contrary to existing Merkle objects (e.g IPFS protobuf legacy, git, bitcoin, dat and others), new IPLD ojects are authenticated AND self described data blobs, each IPLD object is serialized and prefixed by a multicodec identifying its format. - -Some candidate formats: -- /ipld/cbor -- /ipld/ion/1.0.0 -- /ipld/protobuf/3.0.0 -- /ipld/protobuf/2.0.0 - -There is one strong requirement for these formats to work: a format MUST have a 1:1 mapping to the canonical IPLD serialiation format. Today (July 29, 2016), that format is CBOR. - -## Changes to Interfaces / Specs - -Need changes to: - -- IPFS specs (addressing in particular) need to support CIDv1 -- IPFS interfaces need to support CIDv1 -- Add a new, small CIDv1 or "IPLD Links" spec -- IPLD spec is compatible. Can improve in wording. CBOR data format does not change. Pathing does not change. . - -### Support for CID v0 and v1 - -It is a HARD CONSTRAINT that all IPFS links continue to work. In order to support both CID v0 paths (`/ipfs/`) and the new CID v1 paths (`/ipfs/`, IPFS and other IPLD tooling will detect the version of the CID through a matching function. (See "Distinguishing v0 and v1 CIDs (old and new)" above). - -The following interfaces must support both types: -- The IPFS API, which takes CIDs and Paths - - This includes subprotocols, such as Bitswap -- HTTP-to-IPFS Gateway, for all existing https://ipfs.io/ipfs/... links +Contrary to prior Merkle objects (e.g IPFS protobuf legacy, git, bitcoin, dat and others), new IPLD ojects are authenticated AND self described data blobs, each IPLD object is serialized and prefixed by a multicodec identifying its format. From 15b3cb399983164aeaa9fd2eed06c0b46dd21a94 Mon Sep 17 00:00:00 2001 From: Mikeal Rogers Date: Mon, 5 Nov 2018 14:56:27 -0800 Subject: [PATCH 10/18] fix: require bigint. --- IPLD-Data-Model-v1.md | 1 + 1 file changed, 1 insertion(+) diff --git a/IPLD-Data-Model-v1.md b/IPLD-Data-Model-v1.md index 5b6256e9..be98407f 100644 --- a/IPLD-Data-Model-v1.md +++ b/IPLD-Data-Model-v1.md @@ -6,6 +6,7 @@ * Null * String * Integer + * Must support BigInt * Float * Array * Object (Hash Map) From 024eb854151b64df544c50cd31844fe5abafb881 Mon Sep 17 00:00:00 2001 From: Mikeal Rogers Date: Mon, 5 Nov 2018 15:06:33 -0800 Subject: [PATCH 11/18] intial: path spec. --- IPLD-Path.md | 19 ++++++++++++++++++- 1 file changed, 18 insertions(+), 1 deletion(-) diff --git a/IPLD-Path.md b/IPLD-Path.md index 6bb57ad3..1042b8c6 100644 --- a/IPLD-Path.md +++ b/IPLD-Path.md @@ -1,3 +1,20 @@ # [WIP] IPLD Path v1 -TODO: write IPLD Path spec. \ No newline at end of file +An IPLD Path is a string identifier used for deep references into IPLD +graphs. + +IPLD Path's are constructed following the same constraints as [URI Paths](https://tools.ietf.org/html/rfc3986#section-3.3). + +Similarly, the string `?` is reserved for future use as a query separator. + +# Path Resolution + +Path resolution is broken into two parts: full path resolution and block level resolution. + +Block level path resolutionis defined by individual codecs. +For most common codecs each path segment is a property lookup. + +Full path resolution should use block level resolution through each block. +When a block level resolver returns an `IPLD Link` a full path resolution +should retreive that block, load its codec, and continue on with additional +block level resolution until the full path is resolved. From ebf79f6f76a5a12cada4730f1847bc34e9c7bf91 Mon Sep 17 00:00:00 2001 From: Mikeal Rogers Date: Mon, 5 Nov 2018 15:13:31 -0800 Subject: [PATCH 12/18] fix: copy edits. --- README.md | 18 +++++++++++------- 1 file changed, 11 insertions(+), 7 deletions(-) diff --git a/README.md b/README.md index c0e1376d..27b88296 100644 --- a/README.md +++ b/README.md @@ -175,7 +175,7 @@ The long version +------------------------------+ ``` -Note: these examples are simplifications of the concepts. For a complete description visit the [spec](/CID.md). +Note: these examples are simplifications of the concepts. For a complete description visit the [spec](./CID.md). ### Block @@ -207,12 +207,13 @@ The long version. ### IPLD Path -TODO: +A string identifier used for deep references into IPLD +graphs. Follows similar escape and segmentation rules as URI Paths. ### IPLD Data Model The IPLD Data Model describes a set of base types. Codecs that support these base types -can be used by any of the Data-Structures built on top of the IPLD Data Model. +can be used by any of the data-structures built on top of the IPLD Data Model. Codecs that support the IPLD Data Model: @@ -222,7 +223,7 @@ Codecs that support the IPLD Data Model: ### Codec A codec exposes serialization and deserialization for IPLD blocks. If it also supports -content addressable links then the codec also exposes those links as `CID`'s. A codec +content addressable links then the codec exposes those links as `CID`'s. A codec also supports atomic IPLD Path lookups on the block. #### Serializer, Deserializer and Format @@ -247,9 +248,12 @@ A logical separation exists in any given IPLD codec between the **format** and t A **format** may represent object types and tree structures any way it wishes. This includes existing representations (JSON, BSON, CBOR, -Protobuf, msgpack, etc) or even new custom serializations. We will refer to -this as the **representation**. +Protobuf, msgpack, etc) or even new custom serializations. -Therefor, a **format** is the standardized representation of IPLD Links and Paths in a given **representation**. It describes how to translate between structured data and binary. +Therefor, a **format** is the standardized representation of IPLD Links and Paths. It describes how to translate between structured data and binary. It is worth noting that **serializers** and **deserializers** differ by programming language while the **format** does not and MUST remain consistent across all codec implementations. + +#### Representation + +The in-memory representation of a de-serialized IPLD value. From 630716acb37ea6e8f4065f48954e1b5f39963c51 Mon Sep 17 00:00:00 2001 From: Mikeal Rogers Date: Tue, 6 Nov 2018 16:25:23 -0800 Subject: [PATCH 13/18] fix: remove confusing statement. --- IPLD-Path.md | 1 - 1 file changed, 1 deletion(-) diff --git a/IPLD-Path.md b/IPLD-Path.md index 1042b8c6..355d7426 100644 --- a/IPLD-Path.md +++ b/IPLD-Path.md @@ -12,7 +12,6 @@ Similarly, the string `?` is reserved for future use as a query separator. Path resolution is broken into two parts: full path resolution and block level resolution. Block level path resolutionis defined by individual codecs. -For most common codecs each path segment is a property lookup. Full path resolution should use block level resolution through each block. When a block level resolver returns an `IPLD Link` a full path resolution From 5ac77801ab103ea916322a79b94163f5697cbd08 Mon Sep 17 00:00:00 2001 From: Mikeal Rogers Date: Tue, 6 Nov 2018 16:26:47 -0800 Subject: [PATCH 14/18] fix: link to IPLD Path spec --- README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/README.md b/README.md index 27b88296..1f180dcc 100644 --- a/README.md +++ b/README.md @@ -210,6 +210,8 @@ The long version. A string identifier used for deep references into IPLD graphs. Follows similar escape and segmentation rules as URI Paths. +[Read the full specification for more details.](./IPLD-Path.md) + ### IPLD Data Model The IPLD Data Model describes a set of base types. Codecs that support these base types From d99da8d868c43d2bb58129b08045010e8eb6d7f0 Mon Sep 17 00:00:00 2001 From: Mikeal Rogers Date: Tue, 6 Nov 2018 16:28:26 -0800 Subject: [PATCH 15/18] fix: make raw reference clearer --- README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 1f180dcc..c344b38b 100644 --- a/README.md +++ b/README.md @@ -46,9 +46,9 @@ Many of the specifications in this stack are inter-dependent. ``` IPLD Dependency Graph -+---+ +-----+ +---+ -|CID+-----------+-------------->Block+-------->Raw| -+---+ | +--+--+ +---+ ++---+ +-----+ +---------+ +|CID+-----------+-------------->Block+-------->Raw Block| ++---+ | +--+--+ +---------+ +------v-------------+ | +----+ |Links (Conceptually)| | |Path| +------+-------------+ | +-----------+ From 44cb5e1808f2625a336d27cd2048a85f841d3e2c Mon Sep 17 00:00:00 2001 From: Mikeal Rogers Date: Tue, 6 Nov 2018 16:30:34 -0800 Subject: [PATCH 16/18] fix: path resolution includes returning a value --- IPLD-Path.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/IPLD-Path.md b/IPLD-Path.md index 355d7426..ab7aae7f 100644 --- a/IPLD-Path.md +++ b/IPLD-Path.md @@ -16,4 +16,6 @@ Block level path resolutionis defined by individual codecs. Full path resolution should use block level resolution through each block. When a block level resolver returns an `IPLD Link` a full path resolution should retreive that block, load its codec, and continue on with additional -block level resolution until the full path is resolved. +block level resolution until the full path is resolved. Finally, path resolution +should return a [**representation**](./IPLD-Path.md#representation) +of the value for the given path. \ No newline at end of file From b4425c3aef5c6e73e3c3d93904fb9ac0cf5296aa Mon Sep 17 00:00:00 2001 From: Mikeal Rogers Date: Thu, 8 Nov 2018 15:01:31 -0800 Subject: [PATCH 17/18] doc: adjusting diagram --- README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index c344b38b..e63025e1 100644 --- a/README.md +++ b/README.md @@ -46,9 +46,9 @@ Many of the specifications in this stack are inter-dependent. ``` IPLD Dependency Graph -+---+ +-----+ +---------+ -|CID+-----------+-------------->Block+-------->Raw Block| -+---+ | +--+--+ +---------+ ++---+ +----------+ +|CID+-----------+-------------->Raw Blocks| ++---+ | +--+-------+ +------v-------------+ | +----+ |Links (Conceptually)| | |Path| +------+-------------+ | +-----------+ From c6631c3132f9007d7ddcd6baf55f70568d346a63 Mon Sep 17 00:00:00 2001 From: Mikeal Rogers Date: Thu, 8 Nov 2018 15:04:00 -0800 Subject: [PATCH 18/18] fix: pulling bigint for now, will start new PR for it after merge --- IPLD-Data-Model-v1.md | 1 - 1 file changed, 1 deletion(-) diff --git a/IPLD-Data-Model-v1.md b/IPLD-Data-Model-v1.md index be98407f..5b6256e9 100644 --- a/IPLD-Data-Model-v1.md +++ b/IPLD-Data-Model-v1.md @@ -6,7 +6,6 @@ * Null * String * Integer - * Must support BigInt * Float * Array * Object (Hash Map)