-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add reader's privacy specification #5
Merged
Merged
Changes from all commits
Commits
Show all changes
49 commits
Select commit
Hold shift + click to select a range
1fed637
Add reader's privacy specification
ischasny abe8237
Fixed spacing
ischasny b84ea7a
Fixed spacing
ischasny 9c3c06e
Update reader-privacy.md
ischasny 8565242
Update reader-privacy.md
ischasny 02f305b
Update reader-privacy.md
ischasny d184ef1
Update reader-privacy.md
ischasny f070287
Update reader-privacy.md
ischasny 7cd7d9c
Update reader-privacy.md
ischasny 50dc928
Update reader-privacy.md
ischasny 06889cd
Update reader-privacy.md
ischasny 0ea3bf8
Update reader-privacy.md
ischasny 1d39505
Update reader-privacy.md
ischasny 14354e2
Update reader-privacy.md
ischasny efadf02
Update reader-privacy.md
ischasny 7b8b876
Update reader-privacy.md
ischasny 4cd1705
Update reader-privacy.md
ischasny fc9f954
Update reader-privacy.md
ischasny e848b3c
Add mermaid diagram
ischasny 7242649
Add mermaid diagram
ischasny e4073be
Update double-hashing spec
ischasny 3817042
Add trade offs
ischasny 84ece7a
Add threat modeling section
ischasny 0c80a2d
Add threat modeling section
ischasny aa818cc
Update reader-privacy.md
ischasny 7c16595
Update reader privacy spec
ischasny 34968ef
Fix definition of ProviderRecord
ischasny cff4521
Update reader-privacy.md
ischasny 8d5d602
Minor fixes
ischasny c9cc59f
Add reader privacy implementation details
ischasny 8b74305
Add reader privacy implementation details
ischasny 3ecf6a8
Formatting
ischasny 51f7b50
Formatting
ischasny 9a97942
Formatting
ischasny 293ab38
Move addendum to the main spec
ischasny 7e6f3b9
Small update to the spec
ischasny b28fc7e
Remove few redundant sentences
ischasny 532e92d
Remove few redundant sentences
ischasny 3f4e381
Tidy up grammar
ischasny 40d0f6c
Add Extended Providers and more security considerations
ischasny fe8f9b9
Remove CR_HASH prefix that is not used
ischasny 95ddcbf
Remove ProviderRecord encryption
ischasny 51a2f02
Add info about libp2p peerstore
ischasny 600b664
Fix typo
ischasny 891c9f5
Align with the DHT specification
ischasny 453f531
Spacing
ischasny 033307c
Fix typos
ischasny 896ff3d
Update text, links, etc.
gammazero 7090ce0
Update storage space overhead
gammazero File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,244 @@ | ||
# Reader Privacy Preservation | ||
| ||
![wip](https://img.shields.io/badge/status-wip-orange.svg?style=flat-square) | ||
| ||
|
||
**Author(s)**: | ||
<!-- keep names alphabetically sorted --> | ||
| ||
- [Andrew Gillis](https://github.com/gammazero) | ||
- [Ivan Schasny](https://github.com/ischasny) | ||
- [Masih Derkani](https://github.com/masih) | ||
- [Will Scott](https://github.com/willscott) | ||
| ||
|
||
**Maintainer(s)**: | ||
| ||
- [Andrew Gillis](https://github.com/gammazero) | ||
- [Ivan Schasny](https://github.com/ischasny) | ||
- [Masih Derkani](https://github.com/masih) | ||
- [Will Scott](https://github.com/willscott) | ||
| ||
* * * | ||
| ||
**Abstract** | ||
|
||
The lookup APIs provided by IPNI nodes are able to observe what data is being accessed by the clients. | ||
This is true regardless of whether the data itself is public or not. Because IPNI nodes continuously | ||
catalogue the content hosted by all the providers, and provide a central lookup API the need for | ||
reader privacy is amplified. This makes IPNI a difficult choice as an alternative routing system in | ||
projects such as IPFS, which use a more decentrailsed routing system that by nature reduces the | ||
possibility of mass query snooping. | ||
| ||
Building on top of IPFS's reader privacy mechanism, a.k.a, double-hashing, this document specifies how | ||
a similar technique is applied to IPNI in order to preserve the reader's privacy while continuing to | ||
facilitate low-latency provider lookup. | ||
| ||
## Table of Contents | ||
|
||
- [Introduction](#introduction) | ||
- [Background](#background) | ||
- [Specification](#specification) | ||
- [Extended Providers](#extended-providers) | ||
- [Security](#security) | ||
- [Hashing and Encryption Function Upgrades](#hashing-and-encryption-function-upgrades) | ||
- [Trade Offs](#trade-offs) | ||
- [Threat Modelling](#threat-modelling) | ||
- [Implementation](#implementation) | ||
- [Hashing](#hashing) | ||
- [Encryption](#encryption) | ||
- [Data Formats](#data-formats) | ||
- [Related Resources](#related-resources) | ||
|
||
For more technical implementation details please see the [Addendum](reader-privacy-addendum.md). | ||
| ||
## Introduction | ||
| ||
IPFS requires various privacy protections, which includes privacy for the content routing subsystem. Readers | ||
(clients accessing files) need privacy for the content they consume and writers (hosts storing and | ||
distributing content) need privacy for the content they publish. It is very easy for a content router node or | ||
a passive observer to learn which file is requested by which client during the routing process, as the | ||
potential adversary easily learns about the requested `CID`. A snooping actor could request the same `CID` | ||
and download the associated file to monitor the user’s behavior. This is obviously undesirable and has been | ||
for some time now a strong request from the community. | ||
|
||
This specification describes how IPNI provides Readers Privacy. It prevents passive observers from tracking a user's | ||
actions as described above. | ||
|
||
### Non Goals | ||
|
||
* Writer, i.e. content provider or publisher, Privacy, which will be done in a separate specification; | ||
* Retrieval Privacy, which is out of scope for the content routing subsystem; | ||
* Rogue IPNI behaviour (explained in the [Security](#security) section). | ||
| ||
## Background | ||
| ||
IPNI builds its indexes by ingesting chains of Advertisements. Advertisement is a | ||
construct that allows Storage Providers to publish their CIDs in bulk (FIL deals) instead of doing | ||
that individually for each CID. A group of CIDs is represented by a ContextID that is unique per provider as can be seen | ||
on the diagram below: | ||
| ||
![Index building flow](resources/readers-privacy-1.png) | ||
|
||
## Magic Values | ||
|
||
All salts below are 64-bytes long, and represent a string padded with `\x00`. | ||
|
||
- `SALT_DOUBLEHASH = bytes("CR_DOUBLEHASH\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00")` | ||
- `SALT_ENCRYPTIONKEY = bytes("CR_ENCRYPTIONKEY\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00")` | ||
- `SALT_NONCE = bytes("CR_NONCE\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00")` | ||
|
||
|
||
## Definitions | ||
|
||
- **Advertisement** is [IPNI Advertisement](https://github.com/ipni/go-libipni/blob/main/ingest/schema/schema.ipldsch#L40). | ||
- **Storage Provider** is a party who stores the data and wants that data to be discoverable through IPNI. | ||
- **Publisher** is a party who publishes multihashes, via Advertisements, to IPNI on behalf of a Storage Provider. | ||
- **Client** is a party who wants to find the content by its CID using IPNI for the purpose of retreiving from the Storage Provider. | ||
- **Passive Observer** is a snooping party that wants to understand what content is being looked up by observing Client-to-IPNI traffic. | ||
- **`enc`** is [AESGCM](https://en.wikipedia.org/wiki/Galois/Counter_Mode) encryption. The following notation will be used for the rest of the specification `enc(passphrase, nonce, payload)`. | ||
- **`hash`** is [SHA256](https://en.wikipedia.org/wiki/SHA-2) hashing. | ||
- **`||`** is concatenation of two values. | ||
- **`deriveKey`** is deriving a 32-byte encryption key from a passphrase that is done as `hash(SALT_ENCRYPTIONKEY || passphrase)`. | ||
- **`Nonce`** is a 12-byte nonce used as Initialization Vector (IV) for the AESGCM encryption. IPNI expects an explicit instruction to delete a record (comparing to the DHT where records expire). | ||
Hence the IPNI server needs to be able to compare encrypted values without having to decrypt them as that would require a key that it is unaware of. | ||
That means that the nonce has to be deterministically chosen so that `enc(passphrase, nonce, payload)` produces the same output for the same | ||
`passpharase` + `payload` pair. Nonce must be calculated as `hash(SALT_NONCE || passphrase || len(payload) || payload)[:12]`, where `len(payload)` is | ||
an 8-byte length of the `payload` encoded in Little Endian format. Choice of nonce is not enforced by the IPNI specification. The described approach will be used while IPNI encrypts Advertisements on behaf of Publishers. However once Writer Privacy is implemented, the choice of nonce will be left up to the Publisher. | ||
- **`CID`** is the [Content IDentifier](https://github.com/multiformats/cid). | ||
- **`MH`** is the [Multihash](https://github.com/multiformats/multihash) contained in a `CID`. It corresponds to the digest of a hash function over some content. `MH` is represented as a 32-byte array. | ||
- **`HASH2`** is a second hash over the multihash. Second Hashes must be of `Multihash` format with `DBL_SHA_256` codec. The digest must be calculated as `hash(SALT_DOUBLEHASH || MH)`. | ||
- **`ProviderRecord`** is a data structure that contains such information about Storage Provider as ther PeerID and Addresses. | ||
- **`ProviderRecordKey`** is a concatentation of `peerID || contextID`. There is no need for explicitly encoding lengths as they are already encoded as a part of the multihash format. | ||
- **`EncProviderRecordKey`** is `Nonce || enc(deriveKey(multihash), Nonce, ProviderRecordKey)`. | ||
- **`HashProviderRecordKey`** is a hash over `ProviderRecordKey` that must be calculated as `hash(SALT_DOUBLEHASH || ProviderRecordKey)`. | ||
- **`Metadata`** is IPNI metadata that is supplied in Advertisements. | ||
- **`EncMetadata`** is `Nonce || enc(deriveKey(ProviderRecordKey), Nonce, Metadata)`. | ||
|
||
## Specification | ||
| ||
This specification improves Reader Privacy by making changes to the Step 3, depicted above, where the Client | ||
supplies the multihash (or CID) to IPNI in order to lookup corresponding Storage Providers. | ||
|
||
* A Client who wants to do a lookup will calculate `HASH2` and use it for the lookup query; | ||
* In response to that IPNI will return a list of `EncProviderRecordKey`s that containencrypted `peerID` and `contextID` of the Storage Providers that have the content represented by the original `MH`. | ||
In order to make sense of that payload, a Passive Observer would need | ||
to get hold of the original `MH` that isn't revealed during this communication round; | ||
* Using the original `MH`, the Client decrypts `EncProviderRecordKey`s and uses the `peerID` to fetch a `ProviderRecord`. `ProviderRecord`s can be cached on the Client side and hence this rountrip can be avoided most of the times. | ||
Peer addresses can also be discovered through alternative sources such as libp2p peerstore; | ||
* Using addresses from the `ProviderRecord` the Client will reach out to the Storage Provider directly and fetch the desired content; | ||
* The Client might choose to fetch IPNI Metadata that will require another lookup round by `HashProviderRecordKey` to get `EncMetadata` in response. | ||
|
||
By utilising such scheme only a party that knows the original CID can decode the protocol, | ||
which is never revealed. | ||
|
||
```mermaid | ||
sequenceDiagram | ||
participant Client | ||
participant IPNI | ||
participant StorageProvider | ||
Client->>Client: calculates HASH2 | ||
Client->>IPNI: sends a find request for HASH2 | ||
IPNI->>Client: sends a list of [EncProviderRecordKey] | ||
loop EncProviderRecordKeys | ||
Client->>Client: decrypts EncProviderRecordKey and extracts peerID from it | ||
Client->>IPNI: [Optional] sends ProviderRecord lookup request for peerID | ||
IPNI->>Client: [Optional] sends a ProviderRecord | ||
Client->>IPNI: [Optional] sends EncMetadata lookup request for hash(ProviderRecordKey) | ||
IPNI->>Client: [Optional] sends EncMetadata | ||
Client->>Client: [Optional] decrypts the EncMetadata using ProviderRecordKey | ||
end | ||
Client->>StorageProvider: reaches out to the Storage Provider for the desired content | ||
``` | ||
|
||
### Extended Providers | ||
|
||
[Extended Providers](https://github.com/ipni/specs/blob/main/IPNI.md#extendedprovider) allow a Publisher to add an extra information to all their past and future Advertisements | ||
or to a single Advertisement with a specific `ContextID`. That can be done by sending just a single Advertisement without having to re-publish the whole Advertisement chain. | ||
If present Extended Providers are applied to the IPNI output on the server which results into more `ProviderRecord`s being returned to the user. Same will not be possible | ||
for privacy preserving lookups as the required fields such as `PeerID` and `ContextID` are opaque to the server. | ||
|
||
While the mechanics stays the same, applying Extended Providers to the decrypted values will have to be done at the Client side. If exist, Extended Providers should be included | ||
as a field in the `ProviderRecord` which would make them cacheable too. | ||
|
||
### Security | ||
| ||
Security model of the Reader Privacy proposal boils down to inability of a Passive Observer to *algorithmically* derive the original `MH` from | ||
`HASH2` that is used for IPNI lookups. IPNI Advertisments are not encrypted, but authenticated and contain plain multihash values in them. | ||
Before Writer Privacy is implemented a Passive Observer could build a map of `HASH2 -> MH` | ||
by re-ingesting Advertisements chain from each Publisher in order to collect all original multihashes which can then be used to decrypt `EncProviderRecord`s and so on. | ||
Doing that will require significant resources as it involves crawling the entire network. Eliminating this requires a Writer Privacy upgrade. | ||
|
||
Even with both Reader and Writer Privacies in place a rogue IPNI actor might abuse this security model. For example: | ||
* Someone wants to detect who is looking for a particular piece of content, i.e. surveilling content. For example, an IPNI endpoint that wants to know how | ||
frequently people are requesting some website it cares about; | ||
* Someone wants to do mass surveillance on readily accessible data. For example, a group running an IPNI endpoint also runs web crawlers looking for IPFS links, | ||
or runs a public HTTP gateway and can log those requests. | ||
|
||
Rogue IPNI behaviour will be addressed by IPNI writer privacy and reputation system that is out of scope for this specification. | ||
|
||
Reader Privacy is a first step towards fully private content routing protocol. | ||
|
||
Wider security implications are discussed in the [IPFS Reader Privacy specification](https://github.com/guillaumemichel/specs/blob/double-hashing-dht/IPIP/0373-double-hash-dht.md): TODO update the link once the PR is merged. | ||
|
||
#### Hashing and Encryption Function Upgrades | ||
|
||
All multihashes contain a codec. If a hashing or encryption funciton will have to rotate then different types of multihahses can coexist together | ||
and can be processed differently by IPNI implementations. It will not be possible to apply a fix retroactivelly to the data returned by previous lookup requests, | ||
however IPNI implementations can start blocking all new ones that use a compromised scheme, allowing some transtition period. | ||
|
||
Moving an IPNI implementation to a new hash / encryption function will require reingesting all data from the beginning of its publication. Before Writer Privacy is impemented the | ||
index can be migrated over to new functions by reingesting existing Advertisement chains. With Writer Privacy, Publishers will have to republish Advertisments | ||
using new functions (as the data in the Advertisements themselves will have to be re-hashed / re-encrypted). Both old and new scheme can coexist together for some time. | ||
The old scheme should be retired either immediately or once the indexes have been rebuilt and the users have been migrated over. | ||
|
||
An exact operational procedure will be different for differnet IPNI implementations. | ||
|
||
### Trade Offs | ||
|
||
* **Multiple lookups**. In the best case scenario Reader Privacy protocol will require one roundtrip to get a list of peers for a given CID. | ||
Worst case scenario, when both `ProviderRecord`s and `Metadata`s need to be fetched from IPNI, will require 3+ lookups. | ||
|
||
* **Extra compute**. At minimum, Clients must perform an extra hash computation per CID and decryption per `ProviderRecordKey` that will add | ||
some overhead to each lookup. | ||
|
||
* **Extra storage space**. Storing encrypted data will require more space due to padding, nonce, and the addition of an encrypted provider key per multihash; | ||
|
||
* **Bulk deletes**. Encrypted `PeerID` will be different for each multihash and hence bulk delete operations (delete everything for a provider X) will not be possible. Such deletion will require a garbage collection mechanism that rereads deleted advertisements and deletes the HASH2 for all multihashes. | ||
|
||
* **Operational overhead**. Reader Privacy roll out will be a gradual process as many clients will have to migrate over. During the transition period | ||
IPNI implementations will have to serve both plain and hashed lookups. That will involve either serving hashed and regular queries from the same instances. Servers will have to do decryption on behalf of their clients using a plain multihash that has been provided in the lookup request; | ||
|
||
* **Data Migration**. Existing indexes will have to re-ingest all index content that they want to provide Reader Privacy for. | ||
|
||
### Threat Modelling | ||
|
||
There are three actors involved into the IPNI workflow: Publisher, Client and IPNI. Publishers makes update to indexes by publishing Advertisements. | ||
Advertisements are signed by their Publishers and can be authenticated. Advertisements are organised in a chain and are ingested strictly in order. | ||
It's not possible to reorder Advertisements wihtout having to fork the chain. Advertisements processing is idempotent - re-ingesting the same Advertismeent twice | ||
doesn't affect IPNI state. The IPNI specification is agnostic to transport protocols so particular protocol choice is up to the implementation. | ||
Compromised Publisher's identity is out of scope of this specification. | ||
|
||
Clients consume index by performing CID (multihash) lookups. Additional hashing and encryption aims to prevent a Passive Observer | ||
from being able to infer what data is being looked up by spying at the Client-to-IPNI traffic. Withouty a writer privacy solution, a malicious indexer that keeps a map of HASH2->MH could expose a client. Therefore communicaton protocol between client and IPNI should be chosen carefully to prevent MITM attacks. | ||
|
||
`EncProviderRecord`s do not contain any authentication data. It is possible for a malicious IPNI to | ||
present a wrong dataset to the Client. Clients can tackle that by excluding such IPNIs from their pool. Returning wrong datasets will | ||
eventually affect the IPNI's reputation score when a reputation tracking system is available. Data integrity is built into IPFS - Clients | ||
can verify that the data returned by a Storage Provider matches the CID. So even if an IPNI is compromised the data itself is not compromised. | ||
|
||
### Data Formats | ||
|
||
All binary data must be b58 encoded when transferred over the wire. | ||
|
||
## Related Resources | ||
| ||
TODO: link to corresponding IPFS spec once materialised. | ||
| ||
* [Double Hashing and Content Routing](https://youtu.be/ZPIDU1-JnVc) | ||
* [Duble Hashing as a way to increase reader privacy](https://youtu.be/VBlx-VvIZqU) | ||
* [Deployment and transition options of Double Hashing](https://youtu.be/m-6_VZ8e1tk) | ||
| ||
## Copyright | ||
| ||
Copyright and related rights waived via [CC0](https://creativecommons.org/publicdomain/zero/1.0/). |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there anything preventing content providers to publish countless dummy provider records to exhaust the IPNI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great question. The protocol itself doesn't enforce any restrictions on how many records can be published in a period of time. It's up to the implementation to implement rate limiting and block abusive storage providers.