Improve performance of prefixed column resolution #1115

justinmeiners · 2022-02-24T16:51:30Z

For issue #1109

Previously this allocated arrays every time it was accessed. It will now only allocate in one of the error cases.
It also keeps track of the columnName index for both key and value so it doesn't need to look it up twice.

- Previously this allocated arrays every time it was accessed. It will now only allocate in one of the error cases. - It also keeps track of the columnName index for both key and value so it doesn't need to look it up twice.

jberkel · 2022-02-24T19:29:37Z

Thanks for submitting a PR. Please use make lint to check for lint errors (requires SwiftLint)

justinmeiners · 2022-02-24T20:47:38Z

@jberkel fixed.

jberkel · 2022-02-25T13:46:04Z

Wouldn't it be simpler to just change line 1172 to

let similar = columnNames.keys.filter { $0.hasSuffix(".\(column.template)") }

This avoids the allocation of the array with all columns. In the most common case (empty array) no allocation will be made.

justinmeiners · 2022-02-25T16:41:50Z

@jberkel

the most common case (empty array) no allocation will be made.

I am less certain about this, but I believe the most common case is 1. The user asks for a column and it happens to be prefixed. A 0 means it couldn't find the unprefixed or prefixed version of what the user requested (they asked for the wrong thing).

filter

Filter on dictionary keys creates an array. We could use the lazy variant, assuming this particular kind of sequence allows traversal multiple times.

If the lazy version worked here is what would happen for the case of 1:

Apply the predicate to the entire list to count 1 (start of switch).
Apply the predicate to half of the list again (on average) to find the key.
Lookup that key in the dictionary.

With the proposed changes we only apply the predicate to the entire list once, and we don't have to lookup the entry in the dictionary afterwards.

Now, this kind of analysis is a definitely a little overboard for most code, but I think this fix is pretty straightforward. get is the most common operation and will be called a huge number of times in all kinds of inner loops.

jberkel · 2022-02-26T23:52:41Z

Yes, it will create an array, but one with just one element in non-error cases. I'm not sure if your changes really have a noticeable benefit. The array is allocated on the stack, so this is a cheap operation, and I doubt that this is slower than two separate indexOf operations. Have you done any profiling?

justinmeiners · 2022-02-26T23:56:12Z

@jberkel If I demonstrate it is faster, will you include it? Or do you still have other concerns?

jberkel · 2022-02-27T11:46:57Z

If it's faster, yes. A straightforward change would be to avoid the allocation of an array containing all the column names. (Array(columnNames.keys).filter → columnNames.keys.filter). This should then be compared to your version.

justinmeiners · 2022-04-29T17:18:20Z

@jberkel Just did a test where I pull a single column from about 50,000 rows. I measure the length of time using sign posts in instruments:

        os.os_signpost(.begin, log: perfLog, name: "parse rows")
            
        defer {
            os.os_signpost(.end, log: perfLog, name: "parse rows")
        }

        return rows.map({ row in
            return (row[value], row[value])
        })

The runtimes for a few trials were the following:

Filter keys

476ms
497ms
447ms
440ms

Finding index

378ms
430ms
410ms
379ms

Explanation

columnNames.keys.filter has to allocate an array to store the results. Finding the first two indicies forgoes this need.

jberkel · 2022-05-01T00:00:28Z

nice, that looks like a substantial improvement. How many columns are in your test data? There's not just the allocation, but also the iteration over all elements.

justinmeiners · 2022-05-01T00:04:55Z

@jberkel that's a good question. I just had 4 columns in this query result. I expect both algorithms to scale the same in the number of columns. Both visit each column exactly once, in order. Both apply a predicate on each visit.

jberkel · 2022-05-01T00:52:31Z

You're right, that leaves the allocation overhead. Maybe the stack allocation isn't as fast they claim it is. Thanks for investigating!

Justin Meiners added 2 commits February 24, 2022 09:49

improve performance of prefixed column resolution

f1ab605

- Previously this allocated arrays every time it was accessed. It will now only allocate in one of the error cases. - It also keeps track of the columnName index for both key and value so it doesn't need to look it up twice.

rename match -> similar to follow convention

55bf2c1

Fix lint issues

c659dc9

jberkel merged commit 593a749 into stephencelis:master May 1, 2022

jberkel mentioned this pull request Jul 17, 2022

Performance: accessing a column from a row should not allocate an array #1109

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of prefixed column resolution #1115

Improve performance of prefixed column resolution #1115

justinmeiners commented Feb 24, 2022

jberkel commented Feb 24, 2022

justinmeiners commented Feb 24, 2022

jberkel commented Feb 25, 2022

justinmeiners commented Feb 25, 2022 •

edited

Loading

jberkel commented Feb 26, 2022

justinmeiners commented Feb 26, 2022 •

edited

Loading

jberkel commented Feb 27, 2022

justinmeiners commented Apr 29, 2022 •

edited

Loading

jberkel commented May 1, 2022

justinmeiners commented May 1, 2022

jberkel commented May 1, 2022

Improve performance of prefixed column resolution #1115

Improve performance of prefixed column resolution #1115

Conversation

justinmeiners commented Feb 24, 2022

jberkel commented Feb 24, 2022

justinmeiners commented Feb 24, 2022

jberkel commented Feb 25, 2022

justinmeiners commented Feb 25, 2022 • edited Loading

jberkel commented Feb 26, 2022

justinmeiners commented Feb 26, 2022 • edited Loading

jberkel commented Feb 27, 2022

justinmeiners commented Apr 29, 2022 • edited Loading

jberkel commented May 1, 2022

justinmeiners commented May 1, 2022

jberkel commented May 1, 2022

justinmeiners commented Feb 25, 2022 •

edited

Loading

justinmeiners commented Feb 26, 2022 •

edited

Loading

justinmeiners commented Apr 29, 2022 •

edited

Loading