Google - translating latitude, longitude into Geotarget locations - geocoding

I have a bunch of locations (cities, counties, street level addresses) that I want to translate into the 100k locations found here: https://developers.google.com/adwords/api/docs/appendix/geotargeting
Let's say I've got:
location | latitude | longitude |
-----------------------------------------------------
New York, NY, USA | 40.70 | -74.00 |
And I want to map this to:
"21167","New York","New York, United States","US","State"
Is there a way of doing that? parsing text and matching it like that isn't an option.
The Google Geocode API only gives me a list of coordinates based on an address, not an ID from that file.
One thing I thought of would be to take each canonical_name from the file, open up Maps, get the coordinates for all locations and map my lat, long addresses to the nearest point.
Could it be done in a better/more accurate way?

Long story short, this can't be done for free.
I made a selenium script that inputs the address in the Maps search, hits Enter (this centers the map to the coordinates I think), then pulls the coordinates from somewhere within the HTML or the URL it redirects to. I think.
For stuff I couldn't find a match for, I got the latitude & longitude for the locations in the Geotargets CSV and picked the nearest one to my address.

Related

Sorting query by distance requires reading entire data set?

To perform geoqueries in DynamoDB, there are libraries in AWS (https://aws.amazon.com/blogs/mobile/geo-library-for-amazon-dynamodb-part-1-table-structure/). But to sort the results of a geoquery by distance, the entire dataset must be read, correct? If a geoquery produces a large number of results, there is no way to paginate that (on the backend, not to the user) if you're sorting by distance, is there?
You are correct. To sort all of the datapoint by distance from some arbitrary location, you must read all the data from your DynamoDB table.
In DynamoDB, you can only sort results using a pre-computed value that has been stored in the DynamoDB table and is being used as the sort key of the table or one of its indexes. If you need to sort by distance from a fixed location, then you can do this with DynamoDB.
Possible Workaround (with limitations)
TLDR; it's not such a bad problem if you can get away with only sorting the items that are within X kms from an arbitrary point.
This still involves sorting the data points in memory, but it makes the problem easier by producing incomplete results (by limiting the maximum range of the results.)
To do this, you need the Geohash of your point P (from which you are measuring the distance of all other points). Suppose it is A234311. Then you need to pick what range of results is appropriate. Let's put some numbers on this to make it concrete. (I'm totally making these numbers up because the actual numbers are irrelevant for understanding the concepts.)
A - represents a 6400km by 6400km area
2 - represents a 3200km by 3200km area within A
3 - represents a 1600km by 1600km area within A2
4 - represents a 800km by 800km area within A23
3 - represents a 400km by 400km area within A234
1 - represents a 200km by 200km area within A2343
1 - represents a 100km by 100km area within A23431
Graphically, it might look like this:
View of A View of A23
|----------|-----------| |----------|-----------|
| | A21 | A22 | | | |
| A1 |-----|-----| | A231 | A232 |
| | A23 | A24 | | | |
|----------|-----------| |----------|-----------|
| | | | |A2341|A2342|
| A3 | A4 | | A233 |-----|-----|
| | | | |A2343|A2344|
|----------|-----------| |----------|-----------| ... and so on.
In this case, our point P is in A224132. Suppose also, that we want to get the sorted points within 400km. A2343 is 400km by 400km, so we need to load the result from A2343 and all of its 8-connected neighbors (A2341, A2342, A2344, A2334, A2332, A4112, A4121, A4122). Then once we've loaded only those in memory, then you calculate the distances, sort them, and discard any results that are more than 400km.
(You could keep the results that are more than 400km away as long as the users/clients know that beyond 400km, the data could be incomplete.)
The hashing method that DynamoDB Geo library uses is very similar to a Z-Order Curve—you may find it helpful to familiarize yourself with that method as well as Part 1 and Part 2 of the AWS Database Blog on Z-Order Indexing for Multifaceted Queries in DynamoDB.
Not exactly. When querying location you can query by a fixed query value (partition key value) and by sort key, so you can limit your query data result and also apply a little filtering.
I have been racking my brain while designing a DynamoDB Geo Hash proximity locator service. For this example customer_A wants to find all service providers_X in their area. All customers and providers have a 'g8' key that stores their precise geoHash location (to 8 levels).
The accepted way to accomplish this search is to generate a secondary index from the main table with a less accurate geoHash 'g4' which gives a broader area for the main query key. I am applying key overloading and composite key structures for a single table design. The goal in this design is to return all the data required in a single query, secondary indexes can duplicate data by design (storage is cheap but cpu and bandwidth is not)
GSI1PK GSI1SK providerId Projected keys and attributes
---------------------------------------------
g4_9q5c provider pr_providerId1 name rating
g4_9q5c provider pr_providerId2 name rating
g4_9q5h provider pr_providerId3 name rating
Scenario1: customer_A.g8_9q5cfmtk So you issue a query where GSI1PK=g4_9q5c and a list of two providers is returned, not three I desire.
But using geoHash.neighbor() will return eight surrounding neighbors like 9q5h (see reference below). That's great because there a provider in 9q5h but this means I have to run nine queries, one on the center and eight on the neighbors, or run 1-N until I have the minimum results I require.
But which direction to query second, NW, SW, E?? This would require another level of hinting toward which neighbor has more results, without knowing first, unless you run a pre-query for weighted results. But then you run the risk of only returning favorable neighbors as there could be new providers in previously unfavored neighbors. You could apply some ML and randomized query into neighbors to check current counts.
Before the above approach I tried this design.
GSI1PK GSI1SK providerId Projected keys and attributes
---------------------------------------------
loc g8_9q5cfmtk pr_provider1
loc g8_9q5cfjgq pr_provider2
loc g8_9q5fe954 pr_provider3
Scenario2: customer_A.g8_9q5cfmtk So you issue a query where GSI1PK=loc and GSI1SK in between g8_9q5ca and g8_9q5fz and a list of three providers is returned, but a ton of data was pulled and discarded.
To achieve the above query the between X and Y sort criteria is composed of. 9q5c.neighbors().sorted() = 9q59, 9q5c, 9q5d, 9q5e, 9q5f, 9q5g, 9qh1, 9qh4, 9qh5. So we can just use X=9q59 and Y=9qh5 but there are over 50 (I really didn't count after 50) matching quadrants in such a UTF between function.
Regarding the hash/size table above I would recommend to use this https://www.movable-type.co.uk/scripts/geohash.html
Geohash length Cell width Cell height
1 ≤ 5,000km × 5,000km
2 ≤ 1,250km × 625km
3 ≤ 156km × 156km
4 ≤ 39.1km × 19.5km
5 ≤ 4.89km × 4.89km
...

What is the best option for creating a web app for searching through millions of addresses in a database?

I'm developing a Node.js web app that will allow users to search through every school in the world (~7 million) stored in a back-end PostgreSQL database.
UX
The user will select a location on Google Maps (with optional fields such as type of school, N number of schools to show, M radius in km), and the map will show the top N schools within M km. The location that the user selects may or may not be a valid address, so Google Maps will translate the user selected location into latitude and longitude, and my web app will call function findSchoolsByLocation(latitude, longitude, filterParams...) and return a JSON object of the data from PostgreSQL.
Data
The raw data I am given consists of the address and metadata about that school, like this:
| Primary Key | Address -------------------------------- | School Name ------- |
| ??????????? | 3210 Wimberly Rd, Amarillo TX 79109-3433 | University of Texas |
| ??????????? | 5198 Jex St, Arlington, TX, 78019-4532 | Texas Elementary School |
After validating the address and metadata, is it better to 1) geocode all 7 million addresses as they are stored into PostgreSQL and use the latitude and longitude as the primary key, or is it better to 2) use the address as the primary key, and findSchoolsByLocation is somehow able to find the nearest N addresses solely with the string address, without the latitude and longitude?
If 1), I'm considering using PostGIS in local server (least code change), PostGIS in AWS RDS Postgre to better scale (I'm not familiar with AWS), or Google Geocode API (more accurate but is web service). I need to geocode a huge number of addresses, but I only need to do it once, and subsequent changes I will just update the geocode for updated addresses (obviously will not be nearly as much). I've read about the benefits and downsides of using web service vs writing to DB directly. Which is the better option for my use case?
Looking for paragraph responses here I want to write a report that would explain my decision process, alternative options, and handling risk and mistakes for implementing this web app, geocoding and database design:
What should I do?
What would I do if I made a mistake in this decision? How would this be calculated risk taking?
How would I handle conflict with my teammates in deciding which is the better solution?
Assuming that the data is provided in the format described:
| Primary Key | Address -------------------------------- | School Name ------- |
| ??????????? | 3210 Wimberly Rd, Amarillo TX 79109-3433 | University of Texas |
| ??????????? | 5198 Jex St, Arlington, TX, 78019-4532 | Texas Elementary School |
What should I do?
I think the following solution will work just fine:
Extend your Postgres DB table as described with latitude and longitude. You don't have to use any of those as primary key, you just need to index your table by these new columns.
like this
| Primary Key | Address -------------------------------- | School Name ------- | Latitude | Longitude |
| ??????????? | 3210 Wimberly Rd, Amarillo TX 79109-3433 | University of Texas | ??????????? | ??????????? |
| ??????????? | 5198 Jex St, Arlington, TX, 78019-4532 | Texas Elementary School | ??????????? | ??????????? |
Populate the latitude and longitude columns based on address using a GIS or google map API. The majority should work fine, but you would have to manually fix some. You might consider adding a GEO_ADDRESS to your table for future purposes as well as for generalization of manual fix up into a general algorithm which will work without user intervention on a geocoding system used.
Now given an address to show schools located within radius R, you can find geolocation as latitude and longitude.
Using this geo location you can calculate a range of latitude (latitude_min and latitude_max) and longitude (longitude_min and longitude_max). You can do so by converting meters into degrees as described for example here. Alternatively, you can use this SQL query from here:
Using an example table of 'addresses' with a latitude and longitude
column Replace #LONGITUDE#, #LATITUDE#, and #DISTANCE_IN_MILES# with
your search values
SELECT addresses.*, (ACOS( SIN(RADIANS(#LATITUDE#)) * SIN(RADIANS(addresses.latitude)) + COS(RADIANS(#LATITUDE#)) * COS(RADIANS(addresses.latitude)) * COS(RADIANS(addresses.longitude) - RADIANS(#LONGITUDE#)) ) * 3963.1676) AS distance
FROM addresses
WHERE (((ACOS( SIN(RADIANS(#LATITUDE#)) * SIN(RADIANS(addresses.latitude)) + COS(RADIANS(#LATITUDE#)) * COS(RADIANS(addresses.latitude)) * COS(RADIANS(addresses.longitude) - RADIANS(#LONGITUDE#)) ) * 3963.1676) <= #DISTANCE#) OR (addresses.latitude = #LATITUDE# AND addresses.longitude = #LONGITUDE#))
In either case, you might (depending on the supported area of addresses) have to handle the cases when your ranges cross -180/180 (Longitude around Antimeridian) or -90/90 (Latitude around the Poles) by splitting the ranges to before and after discontinuity for example. Unlikely you need to support those areas, but still.
That should give you exact selection or if you prefer faster query such as SELECT * FROM Table WHERE Latitude > latitude_min AND Latitude > latitude_max AND Longitude > longitude_min AND Longitude < longitude_max at least leave you with small enough number of options to filter it by the actual distance, rather than Manhattan distance. If it doesn't, you can safely show "too many schools selected to display" or "narrow down your search" to the user, but that has to be added to requirements specification.
What would I do if I made a mistake in this decision?
You might have to develop a new solution, based on additional or clarified requirements. This is the nature of iterative software development process. The earlier you deliver something the earlier you fail and move on to the next iteration, therefore the simplest solution is a good one to start with and a prototype is a valuable way to confirm requirements with the customer.
How would this be calculated risk-taking?
The smaller the step, the less the risk. Prototype frequently and you will avoid large risks. Say, for example, to evaluate performance implications of two solutions I suggested above (using simple SQL query to select based on Manhattan distance vs complex one to select on actual distance) you can create a simple test without real data and verify performance implications of each solution.
How would I handle conflict with my teammates in deciding which is the better solution?
By presenting alternatives, open discussion and agreeing to the best option. If for some reason, this doesn't work, by escalating discussion to include your management.
Few notes:
There is no option to get away from geocoding (string lookups just not possible)
Are you sure Google Geocoding is something useful? It's just a geocoding tool, and as somebody mentioned they don't allow to keep geocoding data. You might need to use some other services (mapquest seemed to have plans with storing results)
I think your actual 2 options are:
You either upload all your 7M points into some cloud service, they
geocode it for you and let run spatial queries through API (check
cartodb, mapbox). Google also have fusion tables, it's actually free, but there is limit on data size per table and data will be public (but the tool itself is great)
Or you geocode data yourself and run spatial queries in your own
database. Looks like geocoding will be the main challenge here. Make sure
google API is right for you. Regarding AWS or local - go with AWS (or any other cloud) if you are in some small company or budget allows that. If you already have infrastructure and resources - probably it would make more sense to go with local.
Answering your 3 question - I think your main risk and concern would be the price. Just do a cost analysis of all services you might be using, I think after that it will be clear for you. To start I would check if cartodb (or smth similar) has a solution for you. If not then research which geocoding provider is right for you (the key is to be able to store the data you are getting). Then get an estimate from AWS. I think running local DB might be a headache, but probably will be cost effective.
Regarding technical part, I think you should use spatial types/indexes, no need to calculate distances with formulas. Below is a simple example on how to create, query and retrieve spatial data (in case you are not familiar or fuzzy on it)
--- set up postgis environment with docker if needed
--- (from here: https://alexurquhart.com/post/set-up-postgis-with-docker):
-- docker volume create pg_data
-- docker run --name=postgis -d -e POSTGRES_USER=alex -e POSTGRES_PASS=password -e POSTGRES_DBNAME=gis -e ALLOW_IP_RANGE=0.0.0.0/0 -p 5432:5432 -v pg_data:/var/lib/postgresql --restart=always kartoza/postgis:9.6-2.4
-- drop table schools
create table schools (
country varchar(20),
state varchar(20),
school varchar(60),
lat float,
long float,
loc GEOGRAPHY
);
---- NYC schools
insert into schools values ('USA', 'NY', 'New York City School District 1', 40.7212744,-73.986311, null);
insert into schools values ('USA', 'NY', 'KIPP NYC College Prep', 40.8162614,-73.9260793, null);
insert into schools values ('USA', 'NY', 'The Young Womens Leadership School of Astoria', 40.7712631,-73.9241695, null);
insert into schools values ('USA', 'NY', 'Brooklyn East Collegiate Charter School', 40.6784249,-73.9658189, null);
insert into schools values ('USA', 'NY', 'N Y City Board of Education', 40.6933457,-73.9215088, null);
insert into schools values ('USA', 'NY', 'New York City School District 28', 40.7027487,-73.8079333, null);
insert into schools values ('USA', 'NY', 'School of Math, Science, and Healthy', 40.6394884,-74.0202785, null);
UPDATE schools SET loc = ST_POINT(long,lat);
CREATE INDEX school_loc ON schools USING GIST (loc);
--- get schools within 10km around (-73.9091706, 40.71163)
select S.*
,ST_Distance(loc, ST_POINT(-73.9091706, 40.71163)) as dist
from schools S
where ST_Distance(loc, ST_POINT(-73.9091706, 40.71163)) < 10000
---- Converting result to JSON.
---- It's a good idea to get it as GeoJSON since it's supported almost by any spatial tool. You can use http://geojson.io to visualize it
with result as (
select S.*, ST_Distance(loc, ST_POINT(-73.9091706, 40.71163)) as dist from schools S
)
,features as (
select json_build_object(
'type', 'Feature',
'geometry', st_AsGeoJSON(loc)::json,
'properties', (school, dist)
) AS feature
from result
where dist < 10000
order by dist
)
------ main
-- select feature from features
select json_build_object(
'type', 'FeatureCollection',
'features', json_agg(feature)
)
from features

How to get lat-long values of all areas in a city

I have tried to find a lot over the Internet but I am unable to get a perfect utility/API for my requirement.
I am interested in getting the latitude, longitude values of all areas in a city.
Currently i am using this google maps api
https://developers.google.com/maps/documentation/geocoding/start
But, when i enter a city name, it is giving only one lat-long pair for that city. Is there any way that if i give a city name, i can get all the areas and their corresponding latitude, longitude values?
Thanks.
There's a nice documentation for this at: Places API
I used this to get the latitude / longitude for one of my own projects and I also have an example of this.
If you look at the example, you can just type a location and it will immediately get the lat / long of the location and zoom in, you can also do this for more locations at the same time. Remember there is a limit for the maps api so it can only process so many data at the same time. Hope this may help you out! :)

Finding the overlapping locations for a 5 mile and 10 mile radius for a list of location data with latittude and longitude

I have a 10,000 observation dataset with a list of location information looking like this:
ADDRESS | CITY | STATE | ZIP |LATITUDE |LONGITUDE
1189 Beall Ave | Wooster | OH | 44691 | 40.8110501 |-81.93361870000001
580 West 113th Street | New York City | NY | 10025 | 40.8059768 | -73.96506139999997
268 West Putnam Avenue | Greenwich | CT | 06830 | 40.81776801 |-73.96324589997
1 University Drive | Orange | CA | 92866 | 40.843766801 |-73.9447589997
200 South Pointe Drive | Miami Beach | FL | 33139 | 40.1234801 |-73.966427997
I need to find the overlapping locations within a 5 mile and 10 mile radius. I heard that their is a function called geodist which may allow me to do that, although I have never used it. The problem is that for geodist to work I may need all the combinations of the latitudes and longitudes to be side by side, which may make the file really really large and hard to use. I also, do not know how I would be able to get the lat/longs for every combination to be side by side.
Does anyone know of a way I could get the final output that I am looking for ?
Here is a broad outline of one possible approach to this problem:
Allocate each address into a latitude and longitude 'grid' by rounding the co-ordinates to the nearest 0.01 degrees or something like that.
Within each cell, number all the addresses 1 to n so that each has a unique id.
Write a datastep taking your address dataset as input via a set statement, and also load it into a hash object. Your dataset is fairly small, so you should have no problems fitting the relevant bits in memory.
For each address, calculate distances only to other addresses in the same cell, or other cells within a certain radius, i.e.
Decide which cell to look up
Iterate through all the addresses in that cell using the unique id you created earlier, looking up the co-ordinates of each from the hash object
Use geodist to calculate the distance for each one and output a record if it's a keeper.
This is a bit more work to program, but it is much more efficient than an O(n^2) brute force search. I once used a similar algorithm with a dataset of 1.8m UK postcodes and about 60m points of co-ordinate data.

highlight buildings based on value and show in browser

I want to build a website with a map based on openstreetmap that colors buildings based on a their potential average annual yield of solar power. I have the energy data for individual houses.
My question is now, can I assign each house (identified by street name and number) a value and the house can then be colored based on this value in the browser?
I have little to no experience with openstreetmap and would be happy about hints into the right direction.
So you need a OSM dataset and filter it for building=* ways to get the building outlines (e.g. with osmosis). Then you do create a second run to filter for addr:= tags of nodes and merge them with the building outlines from step 1. Be aware of conflicts and that one building can have multiple addresses. So now you have a dataset with normalized addresses and need to create a lookup structure like hashmap to get a mapping for your solar data: addr:street x addr:housenumber -> building id
(very raw idea on how to do it)
IMHO the mixing of external datasources to the copyleft open database license makes that you need to relicense your dataset also under ODbL.
Also keep in mind that not every address is currently at OSM and the existing ones can be wrong!