Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow shorter run times with a lot of random endpoints #5

Open
abyrd opened this issue Dec 2, 2014 · 1 comment
Open

Allow shorter run times with a lot of random endpoints #5

abyrd opened this issue Dec 2, 2014 · 1 comment
Assignees

Comments

@abyrd
Copy link
Member

abyrd commented Dec 2, 2014

Our existing approach essentially tries the cross product of all modes, all times of day, and all endpoints. It scales this product down somewhat, but it's still an enormous number of combinations.

This approach was motivated by the possibility of stumbling upon some random origin point that might exhibit interesting characteristics only at a particular time of day, or with a particular mode. This produces an enormous amount of requests though. In practice, to achieve realistic run times we find ourselves drastically limiting the number of random endpoints.

For example, with 1000 random endpoints in non-fast mode we see an estimated run time of over 21 days in NYC.

Note that when holding the total number of requests constant, by virtue of the randomness of the search endpoints, this cross-product approach does not have a higher probability of finding an "interesting" endpoint for a particular set of search parameters, despite testing a much smaller number of locations.

Therefore, I propose that we grab the endpoints two by two rather than doing anything resembling a cross product. We can step over the random endpoints and the search parameter combinations in parallel, looping over the shorter list until both have been exhausted.

It may also be a good idea to take the full list of search parameter combinations from a CSV or JSON file rather than generating them automatically. Generating them by taking the cross product of a bunch of individual parameter values tends to over-represent unusual combinations. There should still be a script to generate "starter" files of request parameters that can then be edited down by the user. The current scripts already allow this approach, but it is not highlighted in the readme.

It is still a good idea to have a slow and a fast mode. Perhaps the slow mode should use the custom endpoints as origins, over the full range of search parameters, using the full range of random endpoints as destinations.

@abyrd abyrd self-assigned this Dec 2, 2014
@buma
Copy link
Contributor

buma commented Dec 2, 2014

What about something similar to non exhaustive Grid Search in scikit

While using a grid of parameter settings is currently the most widely used method for parameter optimization, other search methods have more favourable properties. RandomizedSearchCV implements a randomized search over parameters, where each setting is sampled from a distribution over possible parameter values. This has two main benefits over an exhaustive search:

  • A budget can be chosen independent of the number of parameters and possible values.
  • Adding parameters that do not influence the performance does not decrease efficiency.

Because I think this is similar problem. You can try all the combinations which is exact and takes a long time, or you try only some options according to some score. With same seed chosen settings are the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants