using chunked instead of ichunked #238

seanmacavaney · 2021-11-08T12:57:28Z

I noticed some unexpectedly poor performance in indexing pipelines (sometimes even resulted in segfaults?!?!). Tracked it down to the use of more_itertools.ichunked in index. Replacing the function with more_itertools.chunked fixes the issue.

Unlike some i-prefixed iterable functions in the standard library, ichunked and chunked both return iterables; it's just that ichunked returns an iter[list] whereas ichunked returns an iter[iter]. To support this, iterchunked uses tee-ing so it can yield one version of the iterable and hold on to the other for itself. I bet this is where the problem arises.

In our case, we don't need an iter[iter] -- it usually just ends up being coerced into a DataFrame. And in cases where it doesn't, I also wouldn't expect problems because batch_size is likely always going to be a manageable size.

In a benchmark on msmarco-passage, I saw ~40x faster iteration.

A simple test using more_itertools alone clearly demonstrates the issue:

>>> from tqdm import tqdm
>>> from more_itertools import chunked, ichunked
>>> for _ in ichunked(tqdm(range(3_000_000)), 1000):
...   pass
 23%|  698461/3000000 [00:05<00:36, 63231.90it/s]
 32%|  948724/3000000 [00:10<00:48, 42513.00it/s]
 38%| 1137833/3000000 [00:15<00:55, 33747.15it/s]
 43%| 1287378/3000000 [00:20<00:59, 28855.96it/s]
 48%| 1425604/3000000 [00:25<01:01, 25466.22it/s]
 51%| 1539932/3000000 [00:30<01:02, 23219.40it/s]
 55%| 1652571/3000000 [00:35<01:05, 20458.89it/s]
 58%| 1751094/3000000 [00:40<01:02, 19879.07it/s]
 62%| 1847406/3000000 [00:45<01:01, 18642.17it/s]
 65%| 1938015/3000000 [00:50<01:00, 17611.84it/s]
 68%| 2040571/3000000 [00:55<00:57, 16569.79it/s]
 70%| 2113594/3000000 [01:00<00:55, 15867.30it/s]
# ^ getting progressively slower over time

>>> for _ in chunked(tqdm(range(3_000_000)), 1000):
...   pass
100%| 3000000/3000000 [00:00<00:00, 6905401.43it/s]
# ^ finishes in under 1sec

I plan to write up a corresponding issue on more_itertools. Perhaps this isn't something fixable, but it's probably at least worthy of a warning in their documentation.

I noticed some unexpectedly poor performance in indexing pipelines (sometimes even resulted in segfaults?!?!). Tracked it down to the use of [`more_itertools.ichunked`](https://more-itertools.readthedocs.io/en/stable/api.html#more_itertools.ichunked) in `index`. Replacing the function with [`more_itertools.chunked`](https://more-itertools.readthedocs.io/en/stable/api.html#more_itertools.chunked) fixes the issue. Unlike some `i`-prefixed iterable functions in the standard library, `ichunked` and `chunked` both return iterables; it's just that `ichunked` returns an `iter[list]` whereas `ichunked` returns an `iter[iter]`. To support this, `iterchunked` uses `tee`-ing so it can yield one version of the iterable and hold on to the other for itself. I bet this is where the problem arises. In our case, we don't need an `iter[iter]` -- it usually just ends up being coerced into a DataFrame. And in cases where it doesn't, I also wouldn't expect problems because `batch_size` is likely always going to be a manageable size. In a benchmark on `msmarco-passage`, I saw ~40x faster iteration. A simple test using `more_itertools` alone clearly demonstrates the issue: ```python >>> from tqdm import tqdm >>> from more_itertools import chunked, ichunked >>> for _ in ichunked(tqdm(range(3_000_000)), 1000): ... pass 23%| 698461/3000000 [00:05<00:36, 63231.90it/s] 32%| 948724/3000000 [00:10<00:48, 42513.00it/s] 38%| 1137833/3000000 [00:15<00:55, 33747.15it/s] 43%| 1287378/3000000 [00:20<00:59, 28855.96it/s] 48%| 1425604/3000000 [00:25<01:01, 25466.22it/s] 51%| 1539932/3000000 [00:30<01:02, 23219.40it/s] 55%| 1652571/3000000 [00:35<01:05, 20458.89it/s] 58%| 1751094/3000000 [00:40<01:02, 19879.07it/s] 62%| 1847406/3000000 [00:45<01:01, 18642.17it/s] 65%| 1938015/3000000 [00:50<01:00, 17611.84it/s] 68%| 2040571/3000000 [00:55<00:57, 16569.79it/s] 70%| 2113594/3000000 [01:00<00:55, 15867.30it/s] # ^ getting progressively slower over time >>> for _ in chunked(tqdm(range(3_000_000)), 1000): ... pass 100%| 3000000/3000000 [00:00<00:00, 6905401.43it/s] # ^ finishes in under 1sec ``` I plan to write up a corresponding issue on `more_itertools`. Perhaps this isn't something fixable, but it's probably at least worthy of a warning in their documentation.

cmacdonald · 2021-11-08T13:07:50Z

Fabulous find!

cmacdonald · 2021-11-08T16:47:37Z

Sean, do you have some timings of indexing with and without this patch?

seanmacavaney · 2021-11-08T17:45:10Z

Setting: ColBERT (first 1M passages):

>>> import pyterrier as pt
>>> pt.init()
>>> import pyterrier_colbert.indexing
>>> dataset = pt.get_dataset('irds:msmarco-passage')
>>> from itertools import islice
>>> indexer = pyterrier_colbert.indexing.ColBERTIndexer("http://www.dcs.gla.ac.uk/~craigm/colbert.dnn.zip", "colbert_msmarco1M", "index_name", 64)
>>> (pt.apply.dummy(lambda x: None) >> indexer).index(islice(dataset.get_corpus_iter(), 1_000_000))
#    ^ dummy transformer to trigger pipeline mode

The first of indexing is simply loading up a large batch of passages to pass to the colbert library for encoding. Without the patch, it takes 6m26s to do this, getting progressively slower as it goes. It also would pretty reliably crash with segfault after around 20min (taking off the 1M cap). With the patch it takes 33s and no segfaults. Without the dummy transformer (to trigger the chunking), it's a bit faster still, as you'd expect, because it doesn't need to convert to/from dataframes and apply the transformer.

# ichunked:
msmarco-passage documents:  11%| 999999/8841823 [06:26<50:31, 2586.36it/s]
# chunked
msmarco-passage documents:  11%| 999999/8841823 [00:33<04:21, 30012.35it/s]
# without dummy transformer (so no chunking)
msmarco-passage documents:  11%| 999999/8841823 [00:19<02:34, 50764.32it/s]

As far as I'm aware, the only impact is on this initial iteration stage (for ColBERT).

(The 40x speedup mentioned above was based on a slightly different setting that went for more documents; 1M gets the same point across in less time.)

cmacdonald · 2021-11-08T18:18:35Z

Ok, perfect. I'm so pleased this is fixed!

seanmacavaney mentioned this pull request Nov 8, 2021

poor ichunked performance more-itertools/more-itertools#576

Closed

cmacdonald merged commit 2fcc017 into master Nov 8, 2021

cmacdonald deleted the seanmacavaney-patch-1 branch October 13, 2022 16:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

using chunked instead of ichunked #238

using chunked instead of ichunked #238

seanmacavaney commented Nov 8, 2021

cmacdonald commented Nov 8, 2021 •

edited

Loading

cmacdonald commented Nov 8, 2021

seanmacavaney commented Nov 8, 2021

cmacdonald commented Nov 8, 2021

using chunked instead of ichunked #238

using chunked instead of ichunked #238

Conversation

seanmacavaney commented Nov 8, 2021

cmacdonald commented Nov 8, 2021 • edited Loading

cmacdonald commented Nov 8, 2021

seanmacavaney commented Nov 8, 2021

cmacdonald commented Nov 8, 2021

cmacdonald commented Nov 8, 2021 •

edited

Loading