Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update ETL.js mapping example to use Streams #10

Closed
Zerim opened this issue Apr 24, 2018 · 2 comments
Closed

Update ETL.js mapping example to use Streams #10

Zerim opened this issue Apr 24, 2018 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@Zerim
Copy link
Contributor

Zerim commented Apr 24, 2018

We shouldn't load all of a dataset into memory at once... we should take advantage of streams/ flow control to make sure we're only loading a manageable amount of data into memory for processing at any given time.

@Zerim
Copy link
Contributor Author

Zerim commented Apr 25, 2018

After spending more time than I care to admit researching this... I've come up with several options for having a streaming API in the ETL mapping example.

  • Node Streams
  • pull-streams
  • generators/ async generators

Node Streams suffer from several issues that are well documented by the js-ipfs team here: ipfs/js-ipfs#362

In that discussion they chose to go with pull-streams, which addresses several of the problems they were having with Node Streams including error propagation. (RXJS was not chosen because of poor support for network backpressure).

The pull-stream API looks very similar to the ES6 generator API in terms of functionality. This lead me to explore using generators in our mappings API:

export function* extract(data, { adapters }) {
  const { ipfs } = adapters
  const originListing = data
  // Need to pretend that entityCount can be arbitrarily large
  const entityCount = originListing.listingsLength()
  for (let i = 0; i < entityCount; i++) {
    let listing = originListing.getListing(i)
    let [index, lister, ipfsHash, price, unitsAvailable] = listing
    yield ipfs.block
      .get(`Qm${entity.ipfsHash}`)
      .then(data => JSON.parse(data))
      .then(ipfsObject => ({
        index: entity.index,
        lister: entity.lister,
        price: entity.price,
        unitsAvailable: unitsAvailable,
        listingData: ipfsObject.data,
      }))
  }
}

The main difference between the pull-stream approach and the generator approach is that pull-streams are asynchronous while generators are synchronous. This is OK for the Ethereum web3 client because it makes synchronous HTTP calls by default, however we shouldn't expect this to be the case for all storage backends we integrate with. For example, reading from a large file on IPFS would require us supporting asynchronous IO.

In general I favor using idiomatic Javascript in our API, however, it doesn't seem like the functionality is quite there yet to avoid using something like Node Streams or pull-streams. There is a proposal to "modernize" the pull-streams API using the new async generators spec, which would be preferable, however the support for that async generators in Node is still experimental

I'm also not crazy about having users implement pull-stream Readable. In the case of js-ipfs-api it's not a big deal because they already return pull-stream Readable as part of their API, and so a new Readable could be derived by transforming the original one using pull-stream's operators. This wouldn't necessarily be the case for other storage adapter clients.

We may not even end up going with the ETL-based implementation of the mappings API, so for now I'm going to put this investigation on pause, and submit a PR for the generator approach (since it works fine for the Ethereum-only use case). I'll leave this issue open to track future investigation.

@Zerim
Copy link
Contributor Author

Zerim commented Apr 27, 2018

According to this, async iteration is available in Node 10.0 as well as nightly builds: https://node.green/#ES2018-features-Asynchronous-Iterators

@Zerim Zerim closed this as completed Apr 30, 2018
@Zerim Zerim removed the to do label Apr 30, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant