Update ETL.js mapping example to use Streams #10

Zerim · 2018-04-24T00:16:32Z

We shouldn't load all of a dataset into memory at once... we should take advantage of streams/ flow control to make sure we're only loading a manageable amount of data into memory for processing at any given time.

Zerim · 2018-04-25T05:09:48Z

After spending more time than I care to admit researching this... I've come up with several options for having a streaming API in the ETL mapping example.

Node Streams
pull-streams
generators/ async generators

Node Streams suffer from several issues that are well documented by the js-ipfs team here: ipfs/js-ipfs#362

In that discussion they chose to go with pull-streams, which addresses several of the problems they were having with Node Streams including error propagation. (RXJS was not chosen because of poor support for network backpressure).

The pull-stream API looks very similar to the ES6 generator API in terms of functionality. This lead me to explore using generators in our mappings API:

export function* extract(data, { adapters }) {
  const { ipfs } = adapters
  const originListing = data
  // Need to pretend that entityCount can be arbitrarily large
  const entityCount = originListing.listingsLength()
  for (let i = 0; i < entityCount; i++) {
    let listing = originListing.getListing(i)
    let [index, lister, ipfsHash, price, unitsAvailable] = listing
    yield ipfs.block
      .get(`Qm${entity.ipfsHash}`)
      .then(data => JSON.parse(data))
      .then(ipfsObject => ({
        index: entity.index,
        lister: entity.lister,
        price: entity.price,
        unitsAvailable: unitsAvailable,
        listingData: ipfsObject.data,
      }))
  }
}

The main difference between the pull-stream approach and the generator approach is that pull-streams are asynchronous while generators are synchronous. This is OK for the Ethereum web3 client because it makes synchronous HTTP calls by default, however we shouldn't expect this to be the case for all storage backends we integrate with. For example, reading from a large file on IPFS would require us supporting asynchronous IO.

In general I favor using idiomatic Javascript in our API, however, it doesn't seem like the functionality is quite there yet to avoid using something like Node Streams or pull-streams. There is a proposal to "modernize" the pull-streams API using the new async generators spec, which would be preferable, however the support for that async generators in Node is still experimental

I'm also not crazy about having users implement pull-stream Readable. In the case of js-ipfs-api it's not a big deal because they already return pull-stream Readable as part of their API, and so a new Readable could be derived by transforming the original one using pull-stream's operators. This wouldn't necessarily be the case for other storage adapter clients.

We may not even end up going with the ETL-based implementation of the mappings API, so for now I'm going to put this investigation on pause, and submit a PR for the generator approach (since it works fine for the Ethereum-only use case). I'll leave this issue open to track future investigation.

Zerim · 2018-04-27T22:29:32Z

According to this, async iteration is available in Node 10.0 as well as nightly builds: https://node.green/#ES2018-features-Asynchronous-Iterators

Zerim self-assigned this Apr 24, 2018

Zerim added enhancement New feature or request to do labels Apr 24, 2018

Zerim mentioned this issue Apr 24, 2018

Add examples of ingest mapping for old OriginListing contract #2

Merged

Zerim mentioned this issue Apr 25, 2018

Use generators in ETL mapping example to avoid memory issues #16

Merged

Zerim closed this as completed Apr 30, 2018

Zerim removed the to do label Apr 30, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update ETL.js mapping example to use Streams #10

Update ETL.js mapping example to use Streams #10

Zerim commented Apr 24, 2018

Zerim commented Apr 25, 2018

Zerim commented Apr 27, 2018

Update ETL.js mapping example to use Streams #10

Update ETL.js mapping example to use Streams #10

Comments

Zerim commented Apr 24, 2018

Zerim commented Apr 25, 2018

Zerim commented Apr 27, 2018