[5.2] Speed up chunk for large sets of data #12861

kevindoole · 2016-03-25T18:16:42Z

I've been working on a project that involves sifting through some pretty large databases, and the query builder's chunk method was not really doing it for me because it slowed down so much as it worked through the table.

As chunk gets deeper and deeper into a table, it asks MySQL to count through more and more rows. When chunk says to MySQL, select whatever from wherever limit 800000, 1000 MySQL counts through 800,000 rows without using an index or nuthin. Then, the next page, it again counts through all previous rows to find the next set. Poor MySQL... It gets really really slow (details below).

I realize this may be too much an edge case, so didn't want to spend too much time working on the code; just looking for feedback at this point.

If you're still reading, here are some scrappy benchmarks. :)

After 4 seconds, chunk is very optimistic:
187500/862443 [▓▓▓▓▓▓░░░░░░░░░░░░░░░░░░░░░░] 21% 4 secs/18 secs 10.0 MiB

However, in the end the process has taken substantially longer:
862443/862443 [▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓] 100% 2 mins/2 mins 10.0 MiB

As you can imagine, the problem gets progressively worse as the database gets larger.

If we instead query by 'id' > $theLastIdFromTheLastSet, the chunking is much faster.

Again, after 4 seconds:
387000/862443 [▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░░░░░░░░░░░] 44% 4 secs/9 secs 10.0 MiB

And the same set takes only 21s in the end.
862443/862443 [▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓] 100% 21 secs/21 secs 10.0 MiB

GrahamCampbell · 2016-03-25T18:27:36Z

This only works if we are actually ordering by id.

GrahamCampbell · 2016-03-25T18:27:56Z

Also, the each function is still forced to use the poor performance chunk method.

kevindoole · 2016-03-25T18:34:36Z

Yeah, it's def a pretty specific use case (large set, ordered by id), but that might be common enough. I mean, for people using an auto incrementing id, using a where, or better yet a whereBetween will always be faster, regardless of the size of the dataset. So i guess there's actually some performance benefit for smaller sets.

Fwiw, for my needs i ended up implementing a method using whereBetween instead of using chunk. It makes sense when you don't necessarily care how many rows you get back in a set, which is probably not useful a whole lot of the time.

taylorotwell · 2016-03-25T19:12:17Z

Yeah it's common enough to order by ID that it's probably worth the improvement here. Thanks!

kevindoole · 2016-03-27T19:32:21Z

Ok, great!

So far i've kept the changed functionality completely separated in a chunkById method. Is it worth spending a bit extra time to make the chunk method use where instead of offset by default when it's possible, or would you rather keep it separated? I guess the logic would be something like, if there's already an order by or if there's no id available, use offset. Otherwise, use the where id > x.

Come to think of it, i guess where id < x order by id desc is more natural.

Sound good?

taylorotwell · 2016-03-28T14:20:42Z

I would just keep them separate how you have them now.

taylorotwell · 2016-03-28T14:32:15Z

I just merged this but noticed the Query builder version is all messed up. The query builder doesn't return a collection at all in 5.2. I guess you only wrote test for Eloquent version?

kevindoole · 2016-03-28T18:00:40Z

Yeah, sorry! I was expecting to need to put a little more work in -- should've put a 'WIP' in the PR title or something.

This evening I can send another PR to polish up the query builder side and add tests.

vlakoff · 2016-03-29T08:45:59Z

There are two chunkById methods defined, but they are very similar, so maybe code duplication could be avoided. If you define chunkById only in Builder, it will also be callable by Eloquent because of the __call().
More important, because of the select statement in pageAfterId, the results contain only id column...

halaei · 2016-04-12T06:33:57Z

Could there be a similar thing for pagination?

hughgrigg · 2016-04-12T08:38:42Z

src/Illuminate/Database/Query/Builder.php

@@ -1315,6 +1315,23 @@ public function forPage($page, $perPage = 15)
    }

    /**
+     * Set the limit and query for the next set of results, given the
+     * last scene ID in a set.


Scene or seen?

barryvdh · 2016-05-16T09:06:26Z

@kevindoole Did you perhaps also benchmark chunking with a whereIn() or subquery for the ids?
Eg. for large datasets, make 2 queries (or 1 + 1 subquery); 1 which selects just the id for the resultset, the second to get the actual data. This means MySQL can look up the ids from just the index, which should be faster (see https://explainextended.com/2009/10/23/mysql-order-by-limit-performance-late-row-lookups/ and https://explainextended.com/2011/02/11/late-row-lookups-innodb/ although I'm not sure this is not optimised yet)
That could work for other orderBy's also.

kevindoole added 2 commits March 25, 2016 13:14

Implement an id to speed up chunk for large databases

9bd6f3c

Refactor a method name, code style fix

bd09dda

GrahamCampbell changed the title ~~Speed up chunk for large sets of data~~ [5.2] Speed up chunk for large sets of data Mar 25, 2016

taylorotwell merged commit bd09dda into laravel:5.2 Mar 28, 2016

vlakoff mentioned this pull request Mar 29, 2016

[5.2] Add chunkById query builder tests #12906

Merged

hughgrigg reviewed Apr 12, 2016
View reviewed changes

themsaid mentioned this pull request Apr 13, 2016

[5.2] Don't limit column selection while chunking by id #13137

Merged

vlakoff mentioned this pull request Oct 30, 2016

[5.3] Avoid extraneous database query when last chunk is partial #16180

Merged

vlakoff mentioned this pull request Nov 23, 2016

[5.3] Fix to prevent recursive where clause addition for chunkById #16498

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[5.2] Speed up chunk for large sets of data #12861

[5.2] Speed up chunk for large sets of data #12861

kevindoole commented Mar 25, 2016

GrahamCampbell commented Mar 25, 2016

GrahamCampbell commented Mar 25, 2016

kevindoole commented Mar 25, 2016

taylorotwell commented Mar 25, 2016

kevindoole commented Mar 27, 2016

taylorotwell commented Mar 28, 2016

taylorotwell commented Mar 28, 2016

kevindoole commented Mar 28, 2016

vlakoff commented Mar 29, 2016

halaei commented Apr 12, 2016

hughgrigg Apr 12, 2016

barryvdh commented May 16, 2016

[5.2] Speed up chunk for large sets of data #12861

[5.2] Speed up chunk for large sets of data #12861

Conversation

kevindoole commented Mar 25, 2016

GrahamCampbell commented Mar 25, 2016

GrahamCampbell commented Mar 25, 2016

kevindoole commented Mar 25, 2016

taylorotwell commented Mar 25, 2016

kevindoole commented Mar 27, 2016

taylorotwell commented Mar 28, 2016

taylorotwell commented Mar 28, 2016

kevindoole commented Mar 28, 2016

vlakoff commented Mar 29, 2016

halaei commented Apr 12, 2016

hughgrigg Apr 12, 2016

Choose a reason for hiding this comment

barryvdh commented May 16, 2016