Allow shorter run times with a lot of random endpoints #5

abyrd · 2014-12-02T14:40:19Z

Our existing approach essentially tries the cross product of all modes, all times of day, and all endpoints. It scales this product down somewhat, but it's still an enormous number of combinations.

This approach was motivated by the possibility of stumbling upon some random origin point that might exhibit interesting characteristics only at a particular time of day, or with a particular mode. This produces an enormous amount of requests though. In practice, to achieve realistic run times we find ourselves drastically limiting the number of random endpoints.

For example, with 1000 random endpoints in non-fast mode we see an estimated run time of over 21 days in NYC.

Note that when holding the total number of requests constant, by virtue of the randomness of the search endpoints, this cross-product approach does not have a higher probability of finding an "interesting" endpoint for a particular set of search parameters, despite testing a much smaller number of locations.

Therefore, I propose that we grab the endpoints two by two rather than doing anything resembling a cross product. We can step over the random endpoints and the search parameter combinations in parallel, looping over the shorter list until both have been exhausted.

It may also be a good idea to take the full list of search parameter combinations from a CSV or JSON file rather than generating them automatically. Generating them by taking the cross product of a bunch of individual parameter values tends to over-represent unusual combinations. There should still be a script to generate "starter" files of request parameters that can then be edited down by the user. The current scripts already allow this approach, but it is not highlighted in the readme.

It is still a good idea to have a slow and a fast mode. Perhaps the slow mode should use the custom endpoints as origins, over the full range of search parameters, using the full range of random endpoints as destinations.

buma · 2014-12-02T15:04:24Z

What about something similar to non exhaustive Grid Search in scikit

While using a grid of parameter settings is currently the most widely used method for parameter optimization, other search methods have more favourable properties. RandomizedSearchCV implements a randomized search over parameters, where each setting is sampled from a distribution over possible parameter values. This has two main benefits over an exhaustive search:

A budget can be chosen independent of the number of parameters and possible values.

Adding parameters that do not influence the performance does not decrease efficiency.

Because I think this is similar problem. You can try all the combinations which is exact and takes a long time, or you try only some options according to some score. With same seed chosen settings are the same.

abyrd self-assigned this Dec 2, 2014

abyrd added a commit that referenced this issue Dec 3, 2014

use a kdtree to speed up point generation. #5

2022fd6

abyrd added a commit that referenced this issue Dec 3, 2014

use endpoints pairwise instead of cross-product. see #5

e3d0d01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow shorter run times with a lot of random endpoints #5

Allow shorter run times with a lot of random endpoints #5

abyrd commented Dec 2, 2014

buma commented Dec 2, 2014

Allow shorter run times with a lot of random endpoints #5

Allow shorter run times with a lot of random endpoints #5

Comments

abyrd commented Dec 2, 2014

buma commented Dec 2, 2014