LSH: solve the EXACT Near Neighbor Search? - data-mining

I'm curious if it's possible to find exact match using LSH.
On MIT website about LSH they state:
Locality-Sensitive Hashing (LSH) is an algorithm for solving the approximate or exact Near Neighbor Search in high dimensional spaces
https://www.mit.edu/~andoni/LSH/
I kinda made some search around internet and google scholar but it seems like there's no sign about it. Is there anyone know if it's possible and can point me to the paper about it? Much appreciated.

You have to iterate through all cells that overlap your query range.
Then you'll find all neighbors. But of course this gets more costly, in particular in high dimensional data, or with large query ranges. If your query range is small, you may get away with just a few cells.

There are a lot of heuristic approaches, but if you want something really state of the art, check "Confirmation Sampling for Exact Nearest Neighbor Search".

Related

OpenCv/C++ - Find similarities from a picture with a big database easily

I would like to do a comparison from a query with pictures in a database (about 2000).
Before posting on this website i read a lot of papers concerning methods for matching a picture in a big database and read a lot of posts on stackOverflow.
Concerning papers, there are some stuff interesting but quite technical and difficult to understand well the algorithms. (I just began to specialize myself in this field)
Posts (the most interesting) :
Simple and fast method to compare images for similarity ;
Nearest neighbors in high-dimensional data? ;
How to understand Locality Sensitive Hashing? ;
Image fingerprint to compare similarity of many images ;
C++/SIFT/SQL - If there a way to compare efficiently a SIFT descriptor of an image with a SIFT descriptor in a SQL database?
Papers :
Object retrieval with large vocabularies and fast spatial matching,
Image Similarity Search with Compact Data Structures,
LSH,
Near Duplicate Image Detection min-Hash and tf-idf Weighting
Vocabulary tree
Aggregating locals descriptors
But i'm still confusing.
The first thing i did is to implement BoW. I trained the Bag of Words (with ORB as detector and descriptor ,and used VLAD features) with 5 class in order to test its efficiency. After a long training, i launched it. It functioned well with an accuracy of 94 %. That's pretty good.
But there is a problem for me:
I don't want to do a classification. In my database, i'll have about 2000 differents pictures. I just want to find the best matches between my query and the database.
So if i have 2000 differents pictures,if i'm logical i have to consider these 2000 pictures as 2000 differents class and obviously that's impossible...
For this first thing, are you agree with me ? It's not obviously the best method to do what i would like ?
Maybe there is another way to use BoW in order to find similarities in the database ?
The second thing i did is « more simpler ».
I compute the descriptors of my query. Then i did a loop over all my database and i computed the descriptors of each picture and then added each descriptors in a vector.
std::vector<cv::Mat> all_descriptors_database;
for (i → 2000) :
cv::Mat request=cv::imread(img);
computeKeypoints(request) ;
computeDescriptors(request) ;
all_descriptors_database.pushback(descriptors_of_request)
At the end i have a big vector which contains all the descriptors of the all database. (The same with all the keypoints)
Then, this is here where i get confused.
At the beginning, i wanted to compute the matching inside the loop that is to say, for each image in the database, compute its descriptors and do a match with the query. But it tooks a lot of time.
So after reading a lot of paper about how find similarities in big databases, i found the LSH algorithm which seems to be appropriate for that kind of search.
Therefore i wanted to use this method.
So inside my loop i did something like that :
//Create Flann LSH index
cv::flann::Index flannIndex(all_descriptors_database.at(i), cv::flann::LshIndexParams(12, 20, 2), cvflann::FLANN_DIST_HAMMING);
cv::Mat results, dists;
int k=2; // find the 2 nearest neighbors
// search (nearest neighbor)
flannIndex.knnSearch(query_descriptors, results, dists, k, cv::flann::SearchParams() );
However i have some questions :
It tooks more than 5 seconds to loop all my database (2000) whereas i thought it will take less 1s (on the papers, they have huge databases not like me and LSH is more efficient). Did i do something wrong ?
I found on the internet some libraries which implement LSH like http://lshkit.sourceforge.net/ or http://www.mit.edu/~andoni/LSH/ . So what is the difference between these libraries and the four line of code i wrote using OpenCV ? Because i checked the libraries and for a kind of beginner like me, it was so difficult to try to use it.I got a bit confused.
The third thing :
I wanted to do a kind of fingerprint of each descriptors for each picture (in order to compute the Hamming distance with the database) but it seems to be impossible to do that. OpenCV / SURF How to generate a image hash / fingerprint / signature out of the descriptors?
So since 3 days, i'm blocked on that task. I don't know if i'm on the wrong way or not.
Maybe i missed something.
I hope it will be enough clear for you. Thank for reading
Your question is kind of big. I'll give you some hints, though.
Bag of Words can work but classification is unnecessary. BoW pipeline typically consists of:
keypoint detection - ORB
keypoint description (feature extraction) - ORB
quantization - VLAD (fisher encoding might be better, but plain old kmeans might be enough in your case)
classification - you probably can skip this stage
You can treat quantization results (e.g. VLAD encoding) for each image as its fingerprint. Computing distance between fingerprints will yield a similarity measure. Still you have to do a 1 vs all matching, which is going to be tremendously expensive when your database gets big enough.
I didn't get your point.
I'd suggest reading G. Hinton's papers (e.g. this one) on dimensionality reduction with deep autoencoders and convolutional neural networks. He boasts of beating LSH. As for the tools, I'd recommend taking a look on BVLC's Caffe, a great neural network library.

Clustering a list of dates

I have a list of dates I'd like to cluster into 3 clusters. Now, I can see hints that I should be looking at k-means, but all the examples I've found so far are related to coordinates, in other words, pairs of list items.
I want to take this list of dates and append them to three separate lists indicating whether they were before, during or after a certain event. I don't have the time for this event, but that's why I'm guessing it by breaking the date/times into three groups.
Can anyone please help with a simple example on how to use something like numpy or scipy to do this?
k-means is exclusively for coordinates. And more precisely: for continuous and linear values.
The reason is the mean functions. Many people overlook the role of the mean for k-means (despite it being in the name...)
On non-numerical data, how do you compute the mean?
There exist some variants for binary or categorial data. IIRC there is k-modes, for example, and there is k-medoids (PAM, partitioning around medoids).
It's unclear to me what you want to achieve overall... your data seems to be 1-dimensional, so you may want to look at the many questions here about 1-dimensional data (as the data can be sorted, it can be processed much more efficiently than multidimensional data).
In general, even if you projected your data into unix time (seconds since 1.1.1970), k-means will likely only return mediocre results for you. The reason is that it will try to make the three intervals have the same length.
Do you have any reason to suspect that "before", "during" and "after" have the same duration? If not, don't use k-means.
You may however want to have a look at KDE; and plot the estimated density. Once you have understood the role of density for your task, you can start looking at appropriate algorithms (e.g. take the derivative of your density estimation, and look for the largest increase / decrease, or estimate an "average" level, and look for the longest above-average interval).
Here are some workaround methods that may not be the best answer but should help.
You can plot the dates as converted durations from a starting date (such as one week)
and convert the dates to number representations for time in minutes or hours from the starting point.
These would all graph along an x-axis but Kmeans should still be possible and clustering still visible on a graph.
Here are more examples of numpy:Python k-means algorithm

Minimizing Sum of Distances: Optimization Problem

The actual question goes like this:
McDonald's is planning to open a number of joints (say n) along a straight highway. These joints require warehouses to store their food. A warehouse can store food for any number of joints, but has to be located at one of the joints only. McD has a limited number of warehouses (say k) available, and wants to place them in such a way that the average distance of joints from their nearest warehouse is minimized.
Given an array (n elements) of coordinates of the joints and an integer 'k', return an array of 'k' elements giving the coordinates of the optimal positioning of warehouses.
Sorry, I don't have any examples available since I'm writing this down from memory. Anyway, one sample could be:
array={1,3,4,5,7,7,8,10,11} (n=9)
k=1
Ans: {7}
This is what I've been thinking: For k=1, we can simply find out the median of the set, which would give the optimal location of the warehouse. However, for k>1, the given set should be divided into 'k' subsets (disjoint, and of contiguous elements of the superset), and median for each subset would give the warehouse locations. However, I don't understand on what basis the 'k' subsets should be formed. Thanks in advance.
EDIT: There's a variation to this problem also: Instead of sum/avg, minimize the maximum distance between a joint and its closest warehouse. I don't get this either..
The straight highway makes this an exercise in dynamic programming, working from left to right along the highway. A partial solution can be described by the location of the rightmost warehouse and the number of warehouses placed. The cost of the partial solution will be the total distance to the nearest warehouse (for fixed k minimising this is the same as minimising the averge) or the maximum distance so far to the closest warehouse.
At each stage you have worked out the answers for the leftmost N joints and have them indexed by number of warehouses used and position of the rightmost warehouse - you need to save only the best cost. Now consider the next joint and work out the best solution for N+1 joints and all possible values of k and rightmost warehouse, using the answers you have stored for N joints to speed this up. Once you have worked out the best cost solution covering all the joints you know where its rightmost warehouse is, which gives you the location of one warehouse. Go back to the solution that has that warehouse as the rightmost joint and find out what solution that was based on. That gives you one more rightmost warehouse - and so you can work your way back to the location of all the warehouses for the best solution.
I tend to get the cost of working this out wrong, but with N joints and k warehouses to place you have N steps to take, each of the based on considering no more than Nk previous solutions, so I reckon cost is kN^2.
This is NOT a clustering problem, it's a special case of a facility location problem. You can solve it using a general integer / linear programming package, but because the problem is on a line, there may be more efficient (and less expensive software-wise) algorithms that would work. You might consider dynamic programming since there are probably combination of facilities that could be eliminated rather quickly. Look into the P-Median problem for more info.

Finding the spread of each cluster from Kmeans

I'm trying to detect how well an input vector fits a given cluster centre. I can find the best match quite easily (the centre with the minimum euclidean distance to the input vector is the best), however, I now need to work how good a match that is.
To do this I need to find the spread (standard deviation?) of the vectors which build up the centroid, then see if the distance from my input vector to the centre is less than the spread. If it's more than the spread than I should be able to say that I have no clusters to fit it (given that the best doesn't fit the input vector well).
I'm not sure how to find the spread per cluster. I have all the centre vectors, and all the training vectors are labelled with their closest cluster, I just can't quite fathom exactly what I need to do to get the spread.
I hope that's clear? If not I'll try to reword it!
TIA
Ian
Use the distance function and calculate the distance from your center point to each labeled point, then figure out the mean of those distances. That should give you the standard deviation.
If you switch to using a different algorithm, such as Mixture of Gaussians, you get the spread (e.g., std. deviation) as part of the model (clustering result).
http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/mixture.html
http://en.wikipedia.org/wiki/Mixture_model

Generating contour lines from regularly spaced data

I am currently working on a data visualization project.My aim is to produce contour lines ,in other words iso-lines, from gridded data.Data can be temperature, weather data or any kind of other environmental parameters but only condition is it must be regularly spaced.
I searched in internet , however i could not find a good algorithm, pseudo-code or source code for producing contour lines from grids.
Does anybody knows a library, source code or an algorithm for producing contour lines from gridded data?
it will be good if your suggestion has a good run time performance, i don't want to wait my users so much :)
Edit: thanks for response but isolines have some constrains like they should not intersects
so just generating bezier curves does not accomplish my goal.
See this question: How to approximate a vector contour from an elevation raster?
It's a near duplicate, but uses quite different terminology. You'll find that cartography and computer graphics solve many of the same problems, but use different terminology for them.
there's some reasonably good contouring available in GNUplot - if you're able to use GPL code that may help.
If your data is placed at regular intervals, this can be done fairly easily (assuming I understand your problem correctly). First you need to determine at what interval you want your contours. Next create the grid you are going to use to store the contour information (i'm assuming just a simple on/off or elevation at this contour level type of data), which should be one interval smaller than the source data.
Now the trick here is to offset the 2 grids by 1/2 an interval (won't actually show up in code like this, but its the concept I'm dealing with here) and compare the 4 coordinates surrounding the current point in the contour data grid you are calculating. If any of the 4 points are in a different interval range, then that 'pixel' in the contour grid should be set to true (or the value of the contour range being crossed).
With this method, there will be a problem when the interval is too fine which will cause several contours to overlap onto each other.
As the link from Paul Tomblin suggests, Bezier curves (which are a subset of B-splines) are a ripe solution for your problem. If runtime performance is an issue, Bezier curves have the added benefit of being constructable via the very fast de Casteljau algorithm, instead of drawing them according to the parametric equations. On the off chance you're working with DirectX, it has a library function for the de Casteljau, but it should not be challenging to brew one yourself using the 1001 web pages that describe it.