No search method in Spark implementation #4

daniel-acuna · 2016-01-15T20:42:08Z

Thanks very much for this implementation!

I was wondering if there was any plan on adding a script to search the index in spark. So far, we can create a model, save it, and compute the code of a set of data points, but no search functionality.

Thanks!

pumpikano · 2016-01-17T23:57:24Z

Thanks, I hope you find it useful.

Yes, this is functionality that would great to have. I can't say for sure if or when I will work on it, however. But I thought I might lay out a few approaches here for possible discussion.

The simplest solution would be to cache an RDD of (id, lopq code) pairs, generate some candidate cells for a query with the multisequence generator, filter the RDD for those cells, and do distance reconstruction for this subset of items (either locally or distributively). The main drawback here is that filtering for some set of cells requires a pass over the data - you will get the right results but it will likely be slow. This could be mitigated by a partitioning scheme resulting in points with the same cell to fall in the same partition.
Rather than filtering the whole dataset, you could first groupByKey to create buckets of points keyed by cell and filter this RDD. This could cause trouble with very large cell buckets and grouping by key is sometime nontrivial in terms of memory issues.
I suspect the best solution would utilize something like an IndexedRDD which can very quickly return cell buckets, perhaps iteratively until a result quota for the query is reached. Currently IndexedRDD does not have a python interface.

Besides the convenience of building a large similarity index in Spark for testing, a tantalizing potential use case for a fast Spark implementation of LOPQ search might be in providing fast near neighbor sampling as a component of other ML pipelines or algos in Spark. If you explore this further I would love to hear how it goes and possibly get your work into this project!

daniel-acuna · 2016-01-18T01:33:22Z

Thanks for the answer and the alternatives!

Some applications (e.g., entity disambiguation) need to have the nearest neighbors of the points already indexed so approach 1 would be the best in that scenario since you need all NN anyways. For other applications that require to query few points, solution 3 would work but I'm not familiar with IndexedRDD. If you have very few points per cell (e.g., image search), then solution 2 would work, but other applications (e.g., entity disambiguation), there might be big buckets.

I was wondering if putting the (id, hash, data) into a parquet where you could use the built it partitioning system would allow you to retrieve candidate cells without traversing the whole dataset. One solution off the top of my head would be to define a partitioning by the first couple of codebooks in the product.

I'm interested in working on this so I might try IndexedRDD or the parquet solution and let you know.

Thanks for the great work!

pumpikano · 2016-04-06T22:54:44Z

I recently needed better support for this myself, so I added some utilities to launch a search cluster on Spark. It is a bit experimental at the moment, but you can find it in the spark-search-cluster branch.

https://github.com/yahoo/lopq/blob/spark-search-cluster/spark/spark_lopq_cluster.py

Usage example and documentation are in that file. Given an RDD of lopq codes, it creates a mapPartitions job that launches a search server on every partition in the cluster. Those servers register themselves with the driver when they start up. After all partitions have registered, you can query them in parallel from the driver, bypassing Spark completely. There is also a small utility that wraps a server around this functionality so that the cluster can be queried from, eg., a webpage.

This is intended for experimentation and evaluation and certainly not for live production uses. I have found this approach to work very well for the cases I have tested. The largest index size I have tested was 100 million and the largest cluster size was 480 partitions. Queries that rank a few tens of thousands points run in 2-3 seconds.

It is based on a set of primitives I wrote to aid in launching clusters on Spark like this: https://github.com/pumpikano/spark-partition-server

If you find it useful, let me know!

MLnick · 2017-03-08T08:26:34Z

Hi there. This library looks cool - just wondering if it supports cosine similarity and dot-product (max inner product search) - it seems to only mention Euclidean distance?

skamalas · 2017-03-08T19:29:20Z

Hi Nick, you are right, it does not support MIPS at its current state, only min Euclidean distance search. If your data vectors are L2 normalized, though, the two are equivalent and you can use Euclidean distance instead of inner product. Yannis

…

On Wed, Mar 8, 2017 at 12:26 AM, Nick Pentreath ***@***.***> wrote: Hi there. This library looks cool - just wondering if it supports cosine similarity and dot-product (max inner product search) - it seems to only mention Euclidean distance? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#4 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ALZ1FHCEJP4rGELby_puGDPYX3UAKTv8ks5rjmY7gaJpZM4HGBiX> .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No search method in Spark implementation #4

No search method in Spark implementation #4

daniel-acuna commented Jan 15, 2016

pumpikano commented Jan 17, 2016

daniel-acuna commented Jan 18, 2016

pumpikano commented Apr 6, 2016

MLnick commented Mar 8, 2017

skamalas commented Mar 8, 2017 via email

No search method in Spark implementation #4

No search method in Spark implementation #4

Comments

daniel-acuna commented Jan 15, 2016

pumpikano commented Jan 17, 2016

daniel-acuna commented Jan 18, 2016

pumpikano commented Apr 6, 2016

MLnick commented Mar 8, 2017

skamalas commented Mar 8, 2017 via email