Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

using chunked instead of ichunked #238

Merged
merged 1 commit into from
Nov 8, 2021
Merged

Conversation

seanmacavaney
Copy link
Collaborator

I noticed some unexpectedly poor performance in indexing pipelines (sometimes even resulted in segfaults?!?!). Tracked it down to the use of more_itertools.ichunked in index. Replacing the function with more_itertools.chunked fixes the issue.

Unlike some i-prefixed iterable functions in the standard library, ichunked and chunked both return iterables; it's just that ichunked returns an iter[list] whereas ichunked returns an iter[iter]. To support this, iterchunked uses tee-ing so it can yield one version of the iterable and hold on to the other for itself. I bet this is where the problem arises.

In our case, we don't need an iter[iter] -- it usually just ends up being coerced into a DataFrame. And in cases where it doesn't, I also wouldn't expect problems because batch_size is likely always going to be a manageable size.

In a benchmark on msmarco-passage, I saw ~40x faster iteration.

A simple test using more_itertools alone clearly demonstrates the issue:

>>> from tqdm import tqdm
>>> from more_itertools import chunked, ichunked
>>> for _ in ichunked(tqdm(range(3_000_000)), 1000):
...   pass
 23%|  698461/3000000 [00:05<00:36, 63231.90it/s]
 32%|  948724/3000000 [00:10<00:48, 42513.00it/s]
 38%| 1137833/3000000 [00:15<00:55, 33747.15it/s]
 43%| 1287378/3000000 [00:20<00:59, 28855.96it/s]
 48%| 1425604/3000000 [00:25<01:01, 25466.22it/s]
 51%| 1539932/3000000 [00:30<01:02, 23219.40it/s]
 55%| 1652571/3000000 [00:35<01:05, 20458.89it/s]
 58%| 1751094/3000000 [00:40<01:02, 19879.07it/s]
 62%| 1847406/3000000 [00:45<01:01, 18642.17it/s]
 65%| 1938015/3000000 [00:50<01:00, 17611.84it/s]
 68%| 2040571/3000000 [00:55<00:57, 16569.79it/s]
 70%| 2113594/3000000 [01:00<00:55, 15867.30it/s]
# ^ getting progressively slower over time

>>> for _ in chunked(tqdm(range(3_000_000)), 1000):
...   pass
100%| 3000000/3000000 [00:00<00:00, 6905401.43it/s]
# ^ finishes in under 1sec

I plan to write up a corresponding issue on more_itertools. Perhaps this isn't something fixable, but it's probably at least worthy of a warning in their documentation.

I noticed some unexpectedly poor performance in indexing pipelines (sometimes even resulted in segfaults?!?!). Tracked it down to the use of [`more_itertools.ichunked`](https://more-itertools.readthedocs.io/en/stable/api.html#more_itertools.ichunked) in `index`. Replacing the function with [`more_itertools.chunked`](https://more-itertools.readthedocs.io/en/stable/api.html#more_itertools.chunked) fixes the issue.

Unlike some `i`-prefixed iterable functions in the standard library, `ichunked` and `chunked` both return iterables; it's just that `ichunked` returns an `iter[list]` whereas `ichunked` returns an `iter[iter]`. To support this, `iterchunked` uses `tee`-ing so it can yield one version of the iterable and hold on to the other for itself. I bet this is where the problem arises.

In our case, we don't need an `iter[iter]` -- it usually just ends up being coerced into a DataFrame. And in cases where it doesn't, I also wouldn't expect problems because `batch_size` is likely always going to be a manageable size.

In a benchmark on `msmarco-passage`, I saw ~40x faster iteration.

A simple test using `more_itertools` alone clearly demonstrates the issue:

```python
>>> from tqdm import tqdm
>>> from more_itertools import chunked, ichunked
>>> for _ in ichunked(tqdm(range(3_000_000)), 1000):
...   pass
 23%|  698461/3000000 [00:05<00:36, 63231.90it/s]
 32%|  948724/3000000 [00:10<00:48, 42513.00it/s]
 38%| 1137833/3000000 [00:15<00:55, 33747.15it/s]
 43%| 1287378/3000000 [00:20<00:59, 28855.96it/s]
 48%| 1425604/3000000 [00:25<01:01, 25466.22it/s]
 51%| 1539932/3000000 [00:30<01:02, 23219.40it/s]
 55%| 1652571/3000000 [00:35<01:05, 20458.89it/s]
 58%| 1751094/3000000 [00:40<01:02, 19879.07it/s]
 62%| 1847406/3000000 [00:45<01:01, 18642.17it/s]
 65%| 1938015/3000000 [00:50<01:00, 17611.84it/s]
 68%| 2040571/3000000 [00:55<00:57, 16569.79it/s]
 70%| 2113594/3000000 [01:00<00:55, 15867.30it/s]
# ^ getting progressively slower over time

>>> for _ in chunked(tqdm(range(3_000_000)), 1000):
...   pass
100%| 3000000/3000000 [00:00<00:00, 6905401.43it/s]
# ^ finishes in under 1sec
```

I plan to write up a corresponding issue on `more_itertools`. Perhaps this isn't something fixable, but it's probably at least worthy of a warning in their documentation.
@cmacdonald
Copy link
Contributor

cmacdonald commented Nov 8, 2021

Fabulous find!

@cmacdonald
Copy link
Contributor

Sean, do you have some timings of indexing with and without this patch?

@seanmacavaney
Copy link
Collaborator Author

Setting: ColBERT (first 1M passages):

>>> import pyterrier as pt
>>> pt.init()
>>> import pyterrier_colbert.indexing
>>> dataset = pt.get_dataset('irds:msmarco-passage')
>>> from itertools import islice
>>> indexer = pyterrier_colbert.indexing.ColBERTIndexer("http://www.dcs.gla.ac.uk/~craigm/colbert.dnn.zip", "colbert_msmarco1M", "index_name", 64)
>>> (pt.apply.dummy(lambda x: None) >> indexer).index(islice(dataset.get_corpus_iter(), 1_000_000))
#    ^ dummy transformer to trigger pipeline mode

The first of indexing is simply loading up a large batch of passages to pass to the colbert library for encoding. Without the patch, it takes 6m26s to do this, getting progressively slower as it goes. It also would pretty reliably crash with segfault after around 20min (taking off the 1M cap). With the patch it takes 33s and no segfaults. Without the dummy transformer (to trigger the chunking), it's a bit faster still, as you'd expect, because it doesn't need to convert to/from dataframes and apply the transformer.

# ichunked:
msmarco-passage documents:  11%| 999999/8841823 [06:26<50:31, 2586.36it/s]
# chunked
msmarco-passage documents:  11%| 999999/8841823 [00:33<04:21, 30012.35it/s]
# without dummy transformer (so no chunking)
msmarco-passage documents:  11%| 999999/8841823 [00:19<02:34, 50764.32it/s]

As far as I'm aware, the only impact is on this initial iteration stage (for ColBERT).

(The 40x speedup mentioned above was based on a slightly different setting that went for more documents; 1M gets the same point across in less time.)

@cmacdonald cmacdonald merged commit 2fcc017 into master Nov 8, 2021
@cmacdonald
Copy link
Contributor

Ok, perfect. I'm so pleased this is fixed!

@cmacdonald cmacdonald deleted the seanmacavaney-patch-1 branch October 13, 2022 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants