Call for Participants/Collaborators for data.gov Sprint #107

flyingzumwalt · 2017-01-16T23:17:25Z

What We're Doing

The IPFS team are working on an experiment with the Stanford University Libraries. This work is starting immediately. We're looking for collaborators to participate in the experiment. The goal is to download all of data.gov (~350TB), add the data to IPFS nodes at Stanford, replicate the data onto IPFS nodes at multiple collaborating institutions and, through IPFS, allow anyone in the world to hold copies of the parts of data.gov they care about.

Our objectives:

Capture these datasets so that they can be archived/preserved and redistributed, reinforcing the existing efforts of organizations like Internet Archive, EDGI and Data Refuge
Test/Prove the viability of decentralized approaches to storing, replicating and serving datasets like these
Provide a reference point for conversations about the role of Libraries, Archives and Museums in the decentralized web

For detailed information about the work plan, see the Github issues for the work sprint and the main "Epic": Replicate 350 TB of Data Between 3 Peers (and then the World). (Note: These github issues are subject to change.)

Who we're Looking to Collaborate With

Institutional Collaborators

We are looking to collaborate with institutions who are able to allocate 300+ TB of network-available storage on short notice. Ideal collaborators would be institutions with data archivists on staff, or organizations who are familiar with the efforts to archive data.gov.

Individual or Private Participants

When we've finished the first round of tightly-coordinated tests, we will make the data available on the general IPFS network. That will be a great opportunity for everyone to help replicate the data and help us improve the experience of using & running an ipfs node.

Our Timeline

We are beginning work on this project immediately. The IPFS team have allocated major resources for a two-week sprint 16-27 January. After that sprint, community efforts and conversations will continue, but the IPFS engineers will turn their focus to other areas for the remainder of Q1.

How to Get Involved

To get involved, comment on this issue or contact @flyingzumwalt directly at matt at protocol dot ai

Q & A

What does an Institutional Collaborator need to provide?

How much storage do we need to allocate?

UPDATE: Based on initial crawls of the first 3000 datasets, we might need far less storage than we initially estimated. The total corpus of data.gov datasets might be less than 50TB, or even less than 10TB, but the actual numbers are difficult to estimate until we finish crawling all 192,000 datasets. However, we have identified other big datasets to replicate in addition to data.gov.

If the new estimates are true, then collaborators would be able to allocate far less than 300TB in order to participate. Note, however, that you might want to use spare storage to store redundant copies of the data or to store other datasets from other harvesting initiatives.

Original answer: Ideally institutional collaborators should allocate enough storage to hold the entire corpus of datasets. Our current estimate is 350TB.

What if we can't get that much storage right away?
Our first rounds of replication will be 5TB, 10TB, 50TB, 100TB, etc. so you could participate at those volumes even if you don’t have 300Tb available yet.

It will also be possible to “pin” specific datasets or subsets of the whole collection.

Do our machines need public IP addresses?

No. You don't need a public IP address. IPFS relies on peer-to-peer TCP connections. As long as your machines are able to connect to the internet, our engineers will probably be able to help you connect your ipfs nodes to the other ipfs nodes. If you want/need help, create an issue in the ipfs/archives repository and we will help you out.

What kind of bandwidth will we need?

The more the better.

When do we need to make the storage available?

The bulk of the tests & replication work will happen next week (23-27 January) and will continue after that.

Why IPFS?

There are a number of benefits to creating decentralized archives with IPFS. For example:

IPFS is content-addressed, so you get the benefits of checksums and content-versioning automatically
It's easy to replicate data across any number of peers, using an approach that scales to tens of millions of nodes
Peers can choose to "pin" subsets of a dataset (ie. just one file, or one set of files, from a larger corpus)
Efficient de-duplication of blocks -- if I add two slightly different versions of a file, I only store the union of the two versions' content, not two full copies of the file.

Related Discussions:

Making IPFS accessible for distributed archival. ipfs/notes#210: Making IPFS accessible for distributed archives
Comparison of IPFS and BitTorrent for Archives ipfs/notes#208: Comparison of IPFS and BitTorrent for Archives

Exciting Features in the Works

There are a number of work-in-progress IPFS features that apply to this endeavor. This experiment will accelerate work on some of them.

The "filestore" feature for IPFS will allow you to index content in-place, serving it over the ipfs network without moving it, modifying it or creating a redundant copy of the data on your machine.
ipfs-pack will allow you to build authenticatable (content-addressed) manifests of your data that are compatible with the bagit specification
ipfs-cluster will coordinate pinsets across networks of participating nodes, allowing groups of nodes to share the burden of hosting data .

The text was updated successfully, but these errors were encountered:

flyingzumwalt · 2017-01-20T19:25:25Z

UPDATE: Based on initial crawls of the first 3000 datasets, we might need far less storage than we initially estimated. The total corpus of data.gov datasets might be less than 50TB, or even less than 10TB, but the actual numbers are difficult to estimate until we finish crawling all 192,000 datasets.

If the new estimates are true, then collaborators would be able to allocate far less than 300TB in order to participate. Note, however, that you might want to use spare storage to store redundant copies of the data or to store other datasets from other harvesting initiatives.

flyingzumwalt · 2017-01-20T20:03:09Z

Update on the update: We're identifying other big datasets and adding them to the corpus, like this 30TB NOAA dataset so we'll definitely have plenty of data to replicate!

flyingzumwalt added the in progress label Jan 16, 2017

flyingzumwalt added this to the Data.gov (aka 300 TB Challenge) milestone Jan 16, 2017

flyingzumwalt self-assigned this Jan 16, 2017

This was referenced Jan 17, 2017

Sprint: Data.gov (aka 300 TB Challenge) #87

Open

Institutional Collaborators Install and Configure IPFS #114

Open

flyingzumwalt mentioned this issue Jan 30, 2017

Captain's Log for the ipfs/archives endeavor #138

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Call for Participants/Collaborators for data.gov Sprint #107

Call for Participants/Collaborators for data.gov Sprint #107

flyingzumwalt commented Jan 16, 2017 •

edited

Loading

flyingzumwalt commented Jan 20, 2017

flyingzumwalt commented Jan 20, 2017

Call for Participants/Collaborators for data.gov Sprint #107

Call for Participants/Collaborators for data.gov Sprint #107

Comments

flyingzumwalt commented Jan 16, 2017 • edited Loading

What We're Doing

Who we're Looking to Collaborate With

Institutional Collaborators

Individual or Private Participants

Our Timeline

How to Get Involved

Q & A

What does an Institutional Collaborator need to provide?

How much storage do we need to allocate?

Do our machines need public IP addresses?

What kind of bandwidth will we need?

When do we need to make the storage available?

Why IPFS?

Exciting Features in the Works

flyingzumwalt commented Jan 20, 2017

flyingzumwalt commented Jan 20, 2017

flyingzumwalt commented Jan 16, 2017 •

edited

Loading