Anybody know a good way to do offline reverse geocoding? - geocoding

I need to do offline reverse geocoding, meant to run as cron-jobs. Reverse coder should give city, state for millions of 'records'.
What services exist ( paid or free ) that can do 'it'?
I've looked at, but that's not exactly what I need.

Get a table coordinates of all cities (and which state the city is in). For any coordinate, pick the closest city. For efficient use, encode the table as an R-tree.
For links on R-tree etc, see Reverse Geocoding Without Web Access
Get cities table from geonames, see Given the lat/long coordinates, how can we find out the city/country?

This question is quite old, but I have found another solution besides the popular R-Tree-based solution.
Treap the map as an image, you could run-length-encode the map into a (sorted) list of (x, y1, y2, loc) tuples with the desired resolution. To find the location of a point (a, b), simply find the row where x = a and y1 <= b <= y2.
This solution is efficient as the map has a fixed resolution. Compare to the R-Tree-based solution, It has the advantage of not introducing new libs into existing systems. The data can be efficiently queried by a regular SQL DB as long as (x, y1) is indexed, or queried by running binary searches on a static file.
There is a website ( that offers such database and free demos on how this works. Disclosure: I'm working on this site.


Vector embeddings to mimic a ranking algorithm

Consider a search system where the user submits a query ‘query’ and retrieves products based on some ranking algorithm. Assume that these products are ordered according to their quality such that p_0, p_1, …, p_10 and so on.
I would like to generate vector embeddings that mimic this ranking algorithm. The closest product vector to a query vector should ideally be p_0, the next one should be p_1 and so on.
I have tried to building word2vec embeddings for products by feeding products that have appeared in the same search session as sentences. Then, I have calculated the weighted average of product vectors to find query vectors to make the query vector closer to the top result. Although the closest result is usually the best result for a given query, the subsequent results include some results that would never appear as a top result.
Is there a trick that the word2vec can learn the ranking algorithm or any other techniques that I can try? I have looked into multi-dimensional vector scaling with non-metric distances but it did not seem scalable to me for more than 100Ks of products.
There's no one trick – just iteratively improving your representations, & training set, & ranking methods to better meet your goals.
Word2vec-based representations can often help, but are still fairly simple & centered on individual words – whose senses may vary based on context & position in ways that a simple weighted-average-of-tokens fails to capture.
You may want to represent 'products' by more than just a string-of-word-tokens – to include other properties, as well. These could be scalar values like prices or various other kinds of ratings/properties, or extra synthetic labels, such as the result of other salient groupings (whether hand-edited or learned).
And even if just working with natural-language product descriptions – like product names, or descriptions, or reviews – there are other more-sophisticated text-representations that can be trained or used – such as sentence/document embeddings using deeper-networks than plain word2vec.
Most generically, if you have a bunch of quantitative representations of candidate results, and a query, and want to use some initial examples of "good" results to bootstrap more generalizable rules for scoring top results, you are attempting a "learning-to-rank" process:
To suggest more specific steps would require a more specific description of inputs/outputs/goals, & what's been tried, and how what's been tried has failed.
For example, are your queries always just textual product names? In such a case, maybe plain keyword search is the central technology required – with things like word-vector-modelling just a tweak for handling some tough cases, like expanding the results, or adding more contrast to the rankings, when results are too few or to many.
Or, can you detect key gaps in the modeling related to exactly those cases where "results include some results that would [ideally] never appear as a top result"? If certain things like rare (poorly-modeled) words, or important qualities not yet captured in the model, seem to be to blame for such cases, that will guide the potential set of corrective changes.

Building delivery list based on distance and point in polygons in LAMP App

Building a LAMP services application that will have 10000's of Vendors providing delivery to Customers, and upon the Customer entering their address, we need to generate a delivery list of Vendors which can provide service to that location. Each Vendor will have a delivery boundary that will be defined by one of these three criteria:
A. List of Zip Codes
B. Distance of delivery point from Vendor in miles (X) (point to point)
C. Defined polygon drawn (most likely) in GME and imported as KML (point in polygon)
A is straightforward, but after extensive research we are unsure of what would be the most efficient and scalable way to approach B and C. Should we use MySQL to store the data and calculate results using code/classes/library, or should we setup a spatial DB like PostGIS to handle all geo storage and calculation, and what about API solutions for some or all, etc.?
Here is our current line of thinking in broad strokes:
Store polygon data (as KML?)
Convert Vendor address to verified lat/long coordinates
Convert B and C boundaries to zip code array to generate subset of likely matches
Convert Customer address to verified lat/long coordinates
The algorithm would then have 3 parts to return a master delivery list:
Part (a):
Query all A Vendors who deliver to Customer's delivery zip code
Part (b):
Filter out all B Vendors that don't have the Customer's zip in likely match array
Query that subset of B Vendors where the distance between coordinates is less than specified
Part (c):
Filter out all C Vendors that don't have the Customer's zip in likely match array
Query that subset of C Vendors where Customer's coordinate is within the polygon
Seeking advice on best practice and what tool/technology/APIs to use, for each step starting with address verification, long/lat of the verified addresses, auto-generate zip array based on spatial data of B and C, calculating point-to-point, creating polygon, storing/converting polygon data, using KML or what?, and calculating point-in-ploygon. Pointers to posts/research/resources very welcome!

highlight buildings based on value and show in browser

I want to build a website with a map based on openstreetmap that colors buildings based on a their potential average annual yield of solar power. I have the energy data for individual houses.
My question is now, can I assign each house (identified by street name and number) a value and the house can then be colored based on this value in the browser?
I have little to no experience with openstreetmap and would be happy about hints into the right direction.
So you need a OSM dataset and filter it for building=* ways to get the building outlines (e.g. with osmosis). Then you do create a second run to filter for addr:= tags of nodes and merge them with the building outlines from step 1. Be aware of conflicts and that one building can have multiple addresses. So now you have a dataset with normalized addresses and need to create a lookup structure like hashmap to get a mapping for your solar data: addr:street x addr:housenumber -> building id
(very raw idea on how to do it)
IMHO the mixing of external datasources to the copyleft open database license makes that you need to relicense your dataset also under ODbL.
Also keep in mind that not every address is currently at OSM and the existing ones can be wrong!

Check a fingerprint in the database

I am saving the fingerprints in a field "blob", then wonder if the only way to compare these impressions is retrieving all prints saved in the database and then create a vector to check, using the function "identify_finger"? You can check directly from the database using a SELECT?
I'm working with libfprint. In this code the verification is done in a vector:
def test_identify():
cur = DB.cursor()
cur.execute('select id, fp from print')
id = []
gallary = []
for row in cur.fetchall():
data = pyfprint.pyf.fp_print_data_from_data(str(row['fp']))
gallary.append(pyfprint.Fprint(data_ptr = data))
n, fp, img = FingerDevice.identify_finger(gallary)
There are two fundamentally different ways to use a fingerprint database. One is to verify the identity of a person who is known through other means, and one is to search for a person whose identity is unknown.
A simple library such as libfprint is suitable for the first case only. Since you're using it to verify someone you can use their identity to look up a single row from the database. Perhaps you've scanned more than one finger, or perhaps you've stored multiple scans per finger, but it will still be a small number of database blobs returned.
A fingerprint search algorithm must be designed from the ground up to narrow the search space, to compare quickly, and to rank the results and deal with false positives. Just as a Google search may come up with pages totally unrelated to what you're looking for, so too will a fingerprint search. There are companies that devote their entire existence to solving this problem.
Another way would be to have a mysql plugin that knows how to work with fingerprint images and select based on what you are looking for.
I really doubt that there is such a thing.
You could also try to parallelize the fingerprint comparation, ie - calling:
in parallel, on different cores/machines
You can't check directly from the database using a SELECT because each scan is different and will produce different blobs. libfprint does the hard work of comparing different scans and judging if they are from the same person or not
What zinking and Tudor are saying, I think, is that if you understand how does that judgement process works (which is by the way, by minutiae comparison) you can develop a method of storing the relevant data for the process (the *minutiae, maybe?) in the database and then a method for fetching the relevant values -- maybe a kind of index or some type of extension to the database.
In other words, you would have to reimplement the libfprint algorithms in a more complex (and beautiful) way, instead of just accepting the libfprint method of comparing the scan with all stored fingerprint in a loop.
other solutions for speeding your program
use C:
I only know sufficient C to write kind of hello-world programs, but it was not hard to write code in pure C to use the fp_identify_finger_img function of libfprint and I can tell you it is much faster than pyfprint.identify_finger.
You can continue doing the enrollment part of the stuff in python. I do it.
use a time / location based SELECT:
If you know your users will scan their fingerprints with more probability at some time than other time, or at some place than other place (maybe arriving at work at some time and scanning their fingers, or leaving, or entering the building by one gate, or by other), you can collect data (at each scan) for measuring the probabilities and creating parallel tables to sort the users for their probability of arriving at each time and location.
We know that identify_finger tries to identify fingers in a loop with the fingerprint objects you provided in a list, so we can use that and give it the objects sorted in a way in which the more likely user for that time and that location will be the first in the list and so on.

How does data mining actually work?

Suppose I want to do some data mining on the database of a supermarket. What does that actually mean?
1) What will the output/results be like?
2) Will the output be different every day or change over time?
3) Before applying data mining, do I need to know what I want or will data mining give everything I want automatically?
Data Mining is a general category of techniques that can be applied to different kinds of datasets, just like programming is a general category of techniques that can be applied using different languages to do different things.
None of your questions make any sense.
A1: Data mining will give us an accurate reports about your queries of database of supermarket.
A2: Sure, because Data mining depend on analyzing during time, in this case it depend on your problems or goals that you want to reach it. if your database was very big also you built data warehouse in right way you will get the different output over time.
A3: yes you should determine what are the problems you have to mine then use tools of Data mining to get the results or indicators automatically.
To answer your first question: For the case of supermarket customer data, I could image the following questions:
how many products X are usually sold on Fridays ?
(helps you to determine how many X you should have in stock)
which customers bought product X often in the last month/year ?
Useful when when you introduce a new X-like product: send advertising material (which has a given cost) only to those customers.
given a customer buys product X (e.g. beer) what's the probability that he/she also buys product Y (e.g. chips) ?
useful for the following: make sure X and Y never are on promotional offer at the same time (X and Y are bought together often). Get the customers into the store by offering a rebate on X knowing they'll also by Y at the same time. Or: put a high price X-like product right next to Y, putting the cheaper X somewhere else.
which neighborhoods have the smallest number of customers ?
helps to find out which neighborhoods you could target with advertising to bring more customers into the store.
Often, by 'asking certain questions to the data' one discovers some features and comes up with new questions.
Data mining is a set of techniques. It refers to discovering interesting and unexpected patterns in data.
If you want to apply some data mining techniques, you need to know which one and you should know why. The answer to questions 1, 2 and 3 depends on the techniques that you choose.
For example, if i want to find associations between items sold in a supermarket, i may use association rule mining. If i want to find groups of similar customers, I might use a clustering algorithm. etc.
There is not just ONE technique in data mining.