Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add reader's privacy specification #5

Merged
merged 49 commits into from
Mar 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
1fed637
Add reader's privacy specification
ischasny Dec 9, 2022
abe8237
Fixed spacing
ischasny Dec 9, 2022
b84ea7a
Fixed spacing
ischasny Dec 9, 2022
9c3c06e
Update reader-privacy.md
ischasny Dec 9, 2022
8565242
Update reader-privacy.md
ischasny Dec 9, 2022
02f305b
Update reader-privacy.md
ischasny Dec 9, 2022
d184ef1
Update reader-privacy.md
ischasny Dec 9, 2022
f070287
Update reader-privacy.md
ischasny Dec 9, 2022
7cd7d9c
Update reader-privacy.md
ischasny Dec 9, 2022
50dc928
Update reader-privacy.md
ischasny Dec 9, 2022
06889cd
Update reader-privacy.md
ischasny Dec 9, 2022
0ea3bf8
Update reader-privacy.md
ischasny Dec 9, 2022
1d39505
Update reader-privacy.md
ischasny Dec 9, 2022
14354e2
Update reader-privacy.md
ischasny Dec 9, 2022
efadf02
Update reader-privacy.md
ischasny Dec 9, 2022
7b8b876
Update reader-privacy.md
ischasny Dec 9, 2022
4cd1705
Update reader-privacy.md
ischasny Dec 9, 2022
fc9f954
Update reader-privacy.md
ischasny Dec 9, 2022
e848b3c
Add mermaid diagram
ischasny Dec 9, 2022
7242649
Add mermaid diagram
ischasny Dec 9, 2022
e4073be
Update double-hashing spec
ischasny Dec 12, 2022
3817042
Add trade offs
ischasny Dec 14, 2022
84ece7a
Add threat modeling section
ischasny Dec 15, 2022
0c80a2d
Add threat modeling section
ischasny Dec 15, 2022
aa818cc
Update reader-privacy.md
ischasny Dec 16, 2022
7c16595
Update reader privacy spec
ischasny Dec 16, 2022
34968ef
Fix definition of ProviderRecord
ischasny Jan 3, 2023
cff4521
Update reader-privacy.md
ischasny Jan 3, 2023
8d5d602
Minor fixes
ischasny Jan 4, 2023
c9cc59f
Add reader privacy implementation details
ischasny Jan 17, 2023
8b74305
Add reader privacy implementation details
ischasny Jan 17, 2023
3ecf6a8
Formatting
ischasny Jan 17, 2023
51f7b50
Formatting
ischasny Jan 17, 2023
9a97942
Formatting
ischasny Jan 17, 2023
293ab38
Move addendum to the main spec
ischasny Jan 18, 2023
7e6f3b9
Small update to the spec
ischasny Jan 24, 2023
b28fc7e
Remove few redundant sentences
ischasny Jan 27, 2023
532e92d
Remove few redundant sentences
ischasny Jan 27, 2023
3f4e381
Tidy up grammar
ischasny Jan 27, 2023
40d0f6c
Add Extended Providers and more security considerations
ischasny Jan 30, 2023
fe8f9b9
Remove CR_HASH prefix that is not used
ischasny Jan 30, 2023
95ddcbf
Remove ProviderRecord encryption
ischasny Jan 31, 2023
51a2f02
Add info about libp2p peerstore
ischasny Jan 31, 2023
600b664
Fix typo
ischasny Feb 14, 2023
891c9f5
Align with the DHT specification
ischasny Mar 3, 2023
453f531
Spacing
ischasny Mar 3, 2023
033307c
Fix typos
ischasny Mar 3, 2023
896ff3d
Update text, links, etc.
gammazero Mar 19, 2024
7090ce0
Update storage space overhead
gammazero Mar 19, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
244 changes: 244 additions & 0 deletions reader-privacy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,244 @@
# Reader Privacy Preservation
![wip](https://img.shields.io/badge/status-wip-orange.svg?style=flat-square)

**Author(s)**:
<!-- keep names alphabetically sorted -->
- [Andrew Gillis](https://github.com/gammazero)
- [Ivan Schasny](https://github.com/ischasny)
- [Masih Derkani](https://github.com/masih)
- [Will Scott](https://github.com/willscott)

**Maintainer(s)**:
- [Andrew Gillis](https://github.com/gammazero)
- [Ivan Schasny](https://github.com/ischasny)
- [Masih Derkani](https://github.com/masih)
- [Will Scott](https://github.com/willscott)
* * *
**Abstract**

The lookup APIs provided by IPNI nodes are able to observe what data is being accessed by the clients.
This is true regardless of whether the data itself is public or not. Because IPNI nodes continuously
catalogue the content hosted by all the providers, and provide a central lookup API the need for
reader privacy is amplified. This makes IPNI a difficult choice as an alternative routing system in
projects such as IPFS, which use a more decentrailsed routing system that by nature reduces the
possibility of mass query snooping.
Building on top of IPFS's reader privacy mechanism, a.k.a, double-hashing, this document specifies how
a similar technique is applied to IPNI in order to preserve the reader's privacy while continuing to
facilitate low-latency provider lookup.
## Table of Contents

- [Introduction](#introduction)
- [Background](#background)
- [Specification](#specification)
- [Extended Providers](#extended-providers)
- [Security](#security)
- [Hashing and Encryption Function Upgrades](#hashing-and-encryption-function-upgrades)
- [Trade Offs](#trade-offs)
- [Threat Modelling](#threat-modelling)
- [Implementation](#implementation)
- [Hashing](#hashing)
- [Encryption](#encryption)
- [Data Formats](#data-formats)
- [Related Resources](#related-resources)

For more technical implementation details please see the [Addendum](reader-privacy-addendum.md).
## Introduction
IPFS requires various privacy protections, which includes privacy for the content routing subsystem. Readers
(clients accessing files) need privacy for the content they consume and writers (hosts storing and
distributing content) need privacy for the content they publish. It is very easy for a content router node or
a passive observer to learn which file is requested by which client during the routing process, as the
potential adversary easily learns about the requested `CID`. A snooping actor could request the same `CID`
and download the associated file to monitor the user’s behavior. This is obviously undesirable and has been
for some time now a strong request from the community.

This specification describes how IPNI provides Readers Privacy. It prevents passive observers from tracking a user's
actions as described above.

### Non Goals

* Writer, i.e. content provider or publisher, Privacy, which will be done in a separate specification;
* Retrieval Privacy, which is out of scope for the content routing subsystem;
* Rogue IPNI behaviour (explained in the [Security](#security) section).
## Background
IPNI builds its indexes by ingesting chains of Advertisements. Advertisement is a
construct that allows Storage Providers to publish their CIDs in bulk (FIL deals) instead of doing
that individually for each CID. A group of CIDs is represented by a ContextID that is unique per provider as can be seen
on the diagram below:
![Index building flow](resources/readers-privacy-1.png)

## Magic Values

All salts below are 64-bytes long, and represent a string padded with `\x00`.

- `SALT_DOUBLEHASH = bytes("CR_DOUBLEHASH\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00")`
- `SALT_ENCRYPTIONKEY = bytes("CR_ENCRYPTIONKEY\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00")`
- `SALT_NONCE = bytes("CR_NONCE\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00")`


## Definitions

- **Advertisement** is [IPNI Advertisement](https://github.com/ipni/go-libipni/blob/main/ingest/schema/schema.ipldsch#L40).
- **Storage Provider** is a party who stores the data and wants that data to be discoverable through IPNI.
- **Publisher** is a party who publishes multihashes, via Advertisements, to IPNI on behalf of a Storage Provider.
- **Client** is a party who wants to find the content by its CID using IPNI for the purpose of retreiving from the Storage Provider.
- **Passive Observer** is a snooping party that wants to understand what content is being looked up by observing Client-to-IPNI traffic.
- **`enc`** is [AESGCM](https://en.wikipedia.org/wiki/Galois/Counter_Mode) encryption. The following notation will be used for the rest of the specification `enc(passphrase, nonce, payload)`.
- **`hash`** is [SHA256](https://en.wikipedia.org/wiki/SHA-2) hashing.
- **`||`** is concatenation of two values.
- **`deriveKey`** is deriving a 32-byte encryption key from a passphrase that is done as `hash(SALT_ENCRYPTIONKEY || passphrase)`.
- **`Nonce`** is a 12-byte nonce used as Initialization Vector (IV) for the AESGCM encryption. IPNI expects an explicit instruction to delete a record (comparing to the DHT where records expire).
Hence the IPNI server needs to be able to compare encrypted values without having to decrypt them as that would require a key that it is unaware of.
That means that the nonce has to be deterministically chosen so that `enc(passphrase, nonce, payload)` produces the same output for the same
`passpharase` + `payload` pair. Nonce must be calculated as `hash(SALT_NONCE || passphrase || len(payload) || payload)[:12]`, where `len(payload)` is
an 8-byte length of the `payload` encoded in Little Endian format. Choice of nonce is not enforced by the IPNI specification. The described approach will be used while IPNI encrypts Advertisements on behaf of Publishers. However once Writer Privacy is implemented, the choice of nonce will be left up to the Publisher.
- **`CID`** is the [Content IDentifier](https://github.com/multiformats/cid).
- **`MH`** is the [Multihash](https://github.com/multiformats/multihash) contained in a `CID`. It corresponds to the digest of a hash function over some content. `MH` is represented as a 32-byte array.
- **`HASH2`** is a second hash over the multihash. Second Hashes must be of `Multihash` format with `DBL_SHA_256` codec. The digest must be calculated as `hash(SALT_DOUBLEHASH || MH)`.
- **`ProviderRecord`** is a data structure that contains such information about Storage Provider as ther PeerID and Addresses.
- **`ProviderRecordKey`** is a concatentation of `peerID || contextID`. There is no need for explicitly encoding lengths as they are already encoded as a part of the multihash format.
- **`EncProviderRecordKey`** is `Nonce || enc(deriveKey(multihash), Nonce, ProviderRecordKey)`.
- **`HashProviderRecordKey`** is a hash over `ProviderRecordKey` that must be calculated as `hash(SALT_DOUBLEHASH || ProviderRecordKey)`.
- **`Metadata`** is IPNI metadata that is supplied in Advertisements.
- **`EncMetadata`** is `Nonce || enc(deriveKey(ProviderRecordKey), Nonce, Metadata)`.

## Specification
This specification improves Reader Privacy by making changes to the Step 3, depicted above, where the Client
supplies the multihash (or CID) to IPNI in order to lookup corresponding Storage Providers.

* A Client who wants to do a lookup will calculate `HASH2` and use it for the lookup query;
* In response to that IPNI will return a list of `EncProviderRecordKey`s that containencrypted `peerID` and `contextID` of the Storage Providers that have the content represented by the original `MH`.
In order to make sense of that payload, a Passive Observer would need
to get hold of the original `MH` that isn't revealed during this communication round;
* Using the original `MH`, the Client decrypts `EncProviderRecordKey`s and uses the `peerID` to fetch a `ProviderRecord`. `ProviderRecord`s can be cached on the Client side and hence this rountrip can be avoided most of the times.
Peer addresses can also be discovered through alternative sources such as libp2p peerstore;
* Using addresses from the `ProviderRecord` the Client will reach out to the Storage Provider directly and fetch the desired content;
* The Client might choose to fetch IPNI Metadata that will require another lookup round by `HashProviderRecordKey` to get `EncMetadata` in response.

By utilising such scheme only a party that knows the original CID can decode the protocol,
which is never revealed.

```mermaid
sequenceDiagram
participant Client
participant IPNI
participant StorageProvider
Client->>Client: calculates HASH2
Client->>IPNI: sends a find request for HASH2
IPNI->>Client: sends a list of [EncProviderRecordKey]
loop EncProviderRecordKeys
Client->>Client: decrypts EncProviderRecordKey and extracts peerID from it
Client->>IPNI: [Optional] sends ProviderRecord lookup request for peerID
IPNI->>Client: [Optional] sends a ProviderRecord
Client->>IPNI: [Optional] sends EncMetadata lookup request for hash(ProviderRecordKey)
IPNI->>Client: [Optional] sends EncMetadata
Client->>Client: [Optional] decrypts the EncMetadata using ProviderRecordKey
end
Client->>StorageProvider: reaches out to the Storage Provider for the desired content
```

### Extended Providers

[Extended Providers](https://github.com/ipni/specs/blob/main/IPNI.md#extendedprovider) allow a Publisher to add an extra information to all their past and future Advertisements
or to a single Advertisement with a specific `ContextID`. That can be done by sending just a single Advertisement without having to re-publish the whole Advertisement chain.
If present Extended Providers are applied to the IPNI output on the server which results into more `ProviderRecord`s being returned to the user. Same will not be possible
for privacy preserving lookups as the required fields such as `PeerID` and `ContextID` are opaque to the server.

While the mechanics stays the same, applying Extended Providers to the decrypted values will have to be done at the Client side. If exist, Extended Providers should be included
as a field in the `ProviderRecord` which would make them cacheable too.

### Security
Security model of the Reader Privacy proposal boils down to inability of a Passive Observer to *algorithmically* derive the original `MH` from
`HASH2` that is used for IPNI lookups. IPNI Advertisments are not encrypted, but authenticated and contain plain multihash values in them.
Before Writer Privacy is implemented a Passive Observer could build a map of `HASH2 -> MH`
by re-ingesting Advertisements chain from each Publisher in order to collect all original multihashes which can then be used to decrypt `EncProviderRecord`s and so on.
Doing that will require significant resources as it involves crawling the entire network. Eliminating this requires a Writer Privacy upgrade.

Even with both Reader and Writer Privacies in place a rogue IPNI actor might abuse this security model. For example:
* Someone wants to detect who is looking for a particular piece of content, i.e. surveilling content. For example, an IPNI endpoint that wants to know how
frequently people are requesting some website it cares about;
* Someone wants to do mass surveillance on readily accessible data. For example, a group running an IPNI endpoint also runs web crawlers looking for IPFS links,
or runs a public HTTP gateway and can log those requests.

Rogue IPNI behaviour will be addressed by IPNI writer privacy and reputation system that is out of scope for this specification.

Reader Privacy is a first step towards fully private content routing protocol.

Wider security implications are discussed in the [IPFS Reader Privacy specification](https://github.com/guillaumemichel/specs/blob/double-hashing-dht/IPIP/0373-double-hash-dht.md): TODO update the link once the PR is merged.

#### Hashing and Encryption Function Upgrades

All multihashes contain a codec. If a hashing or encryption funciton will have to rotate then different types of multihahses can coexist together
and can be processed differently by IPNI implementations. It will not be possible to apply a fix retroactivelly to the data returned by previous lookup requests,
however IPNI implementations can start blocking all new ones that use a compromised scheme, allowing some transtition period.

Moving an IPNI implementation to a new hash / encryption function will require reingesting all data from the beginning of its publication. Before Writer Privacy is impemented the
index can be migrated over to new functions by reingesting existing Advertisement chains. With Writer Privacy, Publishers will have to republish Advertisments
using new functions (as the data in the Advertisements themselves will have to be re-hashed / re-encrypted). Both old and new scheme can coexist together for some time.
The old scheme should be retired either immediately or once the indexes have been rebuilt and the users have been migrated over.

An exact operational procedure will be different for differnet IPNI implementations.

### Trade Offs

* **Multiple lookups**. In the best case scenario Reader Privacy protocol will require one roundtrip to get a list of peers for a given CID.
Worst case scenario, when both `ProviderRecord`s and `Metadata`s need to be fetched from IPNI, will require 3+ lookups.

* **Extra compute**. At minimum, Clients must perform an extra hash computation per CID and decryption per `ProviderRecordKey` that will add
some overhead to each lookup.

* **Extra storage space**. Storing encrypted data will require more space due to padding, nonce, and the addition of an encrypted provider key per multihash;

* **Bulk deletes**. Encrypted `PeerID` will be different for each multihash and hence bulk delete operations (delete everything for a provider X) will not be possible. Such deletion will require a garbage collection mechanism that rereads deleted advertisements and deletes the HASH2 for all multihashes.

* **Operational overhead**. Reader Privacy roll out will be a gradual process as many clients will have to migrate over. During the transition period
IPNI implementations will have to serve both plain and hashed lookups. That will involve either serving hashed and regular queries from the same instances. Servers will have to do decryption on behalf of their clients using a plain multihash that has been provided in the lookup request;

* **Data Migration**. Existing indexes will have to re-ingest all index content that they want to provide Reader Privacy for.

### Threat Modelling
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything preventing content providers to publish countless dummy provider records to exhaust the IPNI?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question. The protocol itself doesn't enforce any restrictions on how many records can be published in a period of time. It's up to the implementation to implement rate limiting and block abusive storage providers.


There are three actors involved into the IPNI workflow: Publisher, Client and IPNI. Publishers makes update to indexes by publishing Advertisements.
Advertisements are signed by their Publishers and can be authenticated. Advertisements are organised in a chain and are ingested strictly in order.
It's not possible to reorder Advertisements wihtout having to fork the chain. Advertisements processing is idempotent - re-ingesting the same Advertismeent twice
doesn't affect IPNI state. The IPNI specification is agnostic to transport protocols so particular protocol choice is up to the implementation.
Compromised Publisher's identity is out of scope of this specification.

Clients consume index by performing CID (multihash) lookups. Additional hashing and encryption aims to prevent a Passive Observer
from being able to infer what data is being looked up by spying at the Client-to-IPNI traffic. Withouty a writer privacy solution, a malicious indexer that keeps a map of HASH2->MH could expose a client. Therefore communicaton protocol between client and IPNI should be chosen carefully to prevent MITM attacks.

`EncProviderRecord`s do not contain any authentication data. It is possible for a malicious IPNI to
present a wrong dataset to the Client. Clients can tackle that by excluding such IPNIs from their pool. Returning wrong datasets will
eventually affect the IPNI's reputation score when a reputation tracking system is available. Data integrity is built into IPFS - Clients
can verify that the data returned by a Storage Provider matches the CID. So even if an IPNI is compromised the data itself is not compromised.

### Data Formats

All binary data must be b58 encoded when transferred over the wire.

## Related Resources
TODO: link to corresponding IPFS spec once materialised.
* [Double Hashing and Content Routing](https://youtu.be/ZPIDU1-JnVc)
* [Duble Hashing as a way to increase reader privacy](https://youtu.be/VBlx-VvIZqU)
* [Deployment and transition options of Double Hashing](https://youtu.be/m-6_VZ8e1tk)
## Copyright
Copyright and related rights waived via [CC0](https://creativecommons.org/publicdomain/zero/1.0/).
Binary file added resources/readers-privacy-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.