Skip to content
This repository has been archived by the owner on Apr 16, 2020. It is now read-only.

Call for Participants/Collaborators for data.gov Sprint #107

Open
flyingzumwalt opened this issue Jan 16, 2017 · 2 comments
Open

Call for Participants/Collaborators for data.gov Sprint #107

flyingzumwalt opened this issue Jan 16, 2017 · 2 comments
Assignees

Comments

@flyingzumwalt
Copy link
Contributor

flyingzumwalt commented Jan 16, 2017

What We're Doing

The IPFS team are working on an experiment with the Stanford University Libraries. This work is starting immediately. We're looking for collaborators to participate in the experiment. The goal is to download all of data.gov (~350TB), add the data to IPFS nodes at Stanford, replicate the data onto IPFS nodes at multiple collaborating institutions and, through IPFS, allow anyone in the world to hold copies of the parts of data.gov they care about.

Our objectives:

  • Capture these datasets so that they can be archived/preserved and redistributed, reinforcing the existing efforts of organizations like Internet Archive, EDGI and Data Refuge
  • Test/Prove the viability of decentralized approaches to storing, replicating and serving datasets like these
  • Provide a reference point for conversations about the role of Libraries, Archives and Museums in the decentralized web

For detailed information about the work plan, see the Github issues for the work sprint and the main "Epic": Replicate 350 TB of Data Between 3 Peers (and then the World). (Note: These github issues are subject to change.)

Who we're Looking to Collaborate With

Institutional Collaborators

We are looking to collaborate with institutions who are able to allocate 300+ TB of network-available storage on short notice. Ideal collaborators would be institutions with data archivists on staff, or organizations who are familiar with the efforts to archive data.gov.

Individual or Private Participants

When we've finished the first round of tightly-coordinated tests, we will make the data available on the general IPFS network. That will be a great opportunity for everyone to help replicate the data and help us improve the experience of using & running an ipfs node.

Our Timeline

We are beginning work on this project immediately. The IPFS team have allocated major resources for a two-week sprint 16-27 January. After that sprint, community efforts and conversations will continue, but the IPFS engineers will turn their focus to other areas for the remainder of Q1.

How to Get Involved

To get involved, comment on this issue or contact @flyingzumwalt directly at matt at protocol dot ai

Q & A

What does an Institutional Collaborator need to provide?

How much storage do we need to allocate?

UPDATE: Based on initial crawls of the first 3000 datasets, we might need far less storage than we initially estimated. The total corpus of data.gov datasets might be less than 50TB, or even less than 10TB, but the actual numbers are difficult to estimate until we finish crawling all 192,000 datasets. However, we have identified other big datasets to replicate in addition to data.gov.

If the new estimates are true, then collaborators would be able to allocate far less than 300TB in order to participate. Note, however, that you might want to use spare storage to store redundant copies of the data or to store other datasets from other harvesting initiatives.

Original answer: Ideally institutional collaborators should allocate enough storage to hold the entire corpus of datasets. Our current estimate is 350TB.

What if we can't get that much storage right away?
Our first rounds of replication will be 5TB, 10TB, 50TB, 100TB, etc. so you could participate at those volumes even if you don’t have 300Tb available yet.

It will also be possible to “pin” specific datasets or subsets of the whole collection.

Do our machines need public IP addresses?

No. You don't need a public IP address. IPFS relies on peer-to-peer TCP connections. As long as your machines are able to connect to the internet, our engineers will probably be able to help you connect your ipfs nodes to the other ipfs nodes. If you want/need help, create an issue in the ipfs/archives repository and we will help you out.

What kind of bandwidth will we need?

The more the better.

When do we need to make the storage available?

The bulk of the tests & replication work will happen next week (23-27 January) and will continue after that.

Why IPFS?

There are a number of benefits to creating decentralized archives with IPFS. For example:

  • IPFS is content-addressed, so you get the benefits of checksums and content-versioning automatically
  • It's easy to replicate data across any number of peers, using an approach that scales to tens of millions of nodes
  • Peers can choose to "pin" subsets of a dataset (ie. just one file, or one set of files, from a larger corpus)
  • Efficient de-duplication of blocks -- if I add two slightly different versions of a file, I only store the union of the two versions' content, not two full copies of the file.

Related Discussions:

Exciting Features in the Works

There are a number of work-in-progress IPFS features that apply to this endeavor. This experiment will accelerate work on some of them.

  • The "filestore" feature for IPFS will allow you to index content in-place, serving it over the ipfs network without moving it, modifying it or creating a redundant copy of the data on your machine.
  • ipfs-pack will allow you to build authenticatable (content-addressed) manifests of your data that are compatible with the bagit specification
  • ipfs-cluster will coordinate pinsets across networks of participating nodes, allowing groups of nodes to share the burden of hosting data .
@flyingzumwalt
Copy link
Contributor Author

UPDATE: Based on initial crawls of the first 3000 datasets, we might need far less storage than we initially estimated. The total corpus of data.gov datasets might be less than 50TB, or even less than 10TB, but the actual numbers are difficult to estimate until we finish crawling all 192,000 datasets.

If the new estimates are true, then collaborators would be able to allocate far less than 300TB in order to participate. Note, however, that you might want to use spare storage to store redundant copies of the data or to store other datasets from other harvesting initiatives.

@flyingzumwalt
Copy link
Contributor Author

Update on the update: We're identifying other big datasets and adding them to the corpus, like this 30TB NOAA dataset so we'll definitely have plenty of data to replicate!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant