Geocoding Accuracy

Geocoding Accuracy - geocoding

I'm brand new to geocoding and have a relatively large dataset of addresses 100,000+. When I geocode them using MapMarker Professional I get about 10% that I'm not able to geocode with a high level of precision (I get mostly S2 precision values back which mean that it was able to match to a Primary Postal Code centroid, centerpoint of the Primary Postal Code boundary). Each of the addresses has already been standardized so they should be valid (I have taken a random sample and run them through the USPS Zip Code Lookup process to verify this). My question is, should I be able to geocode addresses with a higher degree of accuracy than what I'm seeing or am I expecting too much of the products currently on the market? I've tried geocoding using google and yahoo's services without any better luck. All of the services appear to be able to give me the postal code centroid, but none of them appear to have enough information to be able to give me distinct coordinates for houses in at least 98% of the addresses I send to it.
Thanks for any guidance you can provide,
Jeremy

Geocoding is an imprecise process. The addresses you are geocoding that don't have good precision are likely in rural areas, where it is not uncommon for addresses to be off by up to a mile.
They only know where addresses are by taking the number at the start and end of a street segment, and dividing from there.

Related

how to get txPower to calculate distance from RSSI

I got this code from google code :
void QBluetoothDeviceDiscoveryAgent::deviceDiscovered(const QBluetoothDeviceInfo &info)
QBluetoothDeviceInfo::rssi().
But how to get rssi distance from `QBluetoothServiceDiscoveryAgent ?
I tried with
QBluetoothServiceDiscoveryAgent serviceInfo;
quint i =serviceInfo.device().rssi();
here i = -43
how to convert it to distance?
I got the link
Understanding ibeacon distancing
but how to get the transmitter power? to calculate the distance according to formula?

Make sure you understood the implications of QBluetoothDeviceInfo::rssi(). Calling this functions returns immediately with the last stored value when the device was scanned last. If you only receive one advertisement-packet, which happens to be at e.x. -90dB, and then immediately connect, this function will keep returning -90 until you disconnect from it and scan it again. Connected devices usually don't send advertisement-packets so the RSSI you can read via Qt won't be updated during the connection.
As for proximity, it's not so easy to get good values. To accurately convert from RSSI to geometric distance you must know the sender's original/intended signal-strength (or TX-power-level == RSSI at 1m distance). This value will differ between devices. To make things worse, in practice it can also vary by a huge margin depending on things like the sender's battery-level, physical orientations of sender/receiver to eachother, quality of individual parts, random interference from other RF devices....
The BLE-folk has a blog explaining how you should do it. You can read it up here. The linked article doesn't read or assume the theoretical maximum RSSI of the sender but instead it propoposes to gather multiple RSSI-values over time (+ do some mean/mode filtering), and use the current mean-value in comparison with the previous value to determine if you are approaching or moving away from the sender. Paired with some fine-tuning using real-world data you gotta collect, plus documentation-reading and common-sense, you could probably develop a proximity calculation for many or even most sender-devices which would be accurate to about one meter or even less at close proximity. In the end it's a tradeoff between how many devices you wish to 'calibrate' for and those you are okay with having shifted values due to higher or lower TX-power-levels.
The downside being - you can't test for every possible device on the market and as I said earlier, different devices have different TX-power-levels. With this approach you can develop an algorithm to get pretty good measurements for devices which have approximately equal signal-configurations but others will seem far off. The article's author talks about creating different profiles for different vendors but that's not really gonna help (consider two identical beacons ("big/small"), one for large and one for small indoor locations - with RSSI alone you can't reliably determine if you're close to the small beacon or in medium range to the big one unless they identify themselves via GAP or otherwise (forget MAC-addresses if you plan to deploy on MacOS or iOS).
Also, prepare yourself for the joyride that is Android BLE development. Some vendors know that their BLE implementation is so terribly bad and broken, they even disabled the HCI-Logging-Feature on all their ROMs to hide it. Others can be BLE-nuked like Win98 by ethernet, back in the days.

Mapbox Geocoding API V5- Get all neighborhoods in a city

Is there a way to get all neighborhoods per city by lat and lng from mapbox API V5.
For example, if I search using the lat and lng of Long Beach.
-118.1937, 33.7701
I expect to get back all the neighborhoods, instead, I only get back one result of
"place_name: "Downtown, Long Beach, California 90802, United States""
I have changed the response limit and bound box, with no results.
Here is the mapbox playground.
https://www.mapbox.com/api-playground/#/forward-geocoding
Thanks!

Mapbox doesn't really do neighborhoods, they require some sort of search data to pull either addresses or places.
However, there are services where you can get neighborhood data. I found this Stack Overflow question to have several links (sadly, most of them outdated....), with the reference to Zillow having a lot of promise.
I'd also suggest the Census Bureau data as it may have what you are looking for, but it is what I would call 'less than user friendly' to find anything - unless you are comfortable reading government spec sort of things... :)

Zip-Centroid geo-coding: is there a zip+4 centroid

There is geo-coding by zip-code centroid, but is there a zip+4 centroid which would be more granular yet not quite street-address granularity?

Yes, there is. I work at SmartyStreets, where we validate and geocode street addresses. ZIP+4 is approximately block-level, which sounds like what you're looking for. There are some services that offer this type of geocoding, including ours, LiveAddress.
If you don't need super-precise geocoding (like you would if you were, say, going to skydive onto your buddy's rooftop), then ZIP+4 is most likely going to be the most affordable option, as well. Rooftop-level precision data has to be gathered manually (by companies like Google, etc.) and so it is very expensive.
So yes, it is possible to geocode by ZIP+4 centroid, and it would usually get you within about a street block of your target.

Google Places vs. Qype vs. others

at the moment I am working on a regional evaluation system.
I actually want to e.g. find out how regions are composed, let us say given
a lat long coordinate and a radius. Hereby I would really like to be able to separate by type and it is also necessary for the data to be up to date.
So which API based services do you recommend, if the following factors are important:
support for lat/long coordinates with search radius
differentiation by type of location
up to date information
As far as I know Google places and qype.com offer APIs which should be able to do so.
Is there a better option or which of the both do you recommend and why?

As far as I found out only Qype and Google Places offer the APIs.
Google offers 1000 requests per day for free while Qype only offers 200,
but one could apply for multiple keys in Qype which enables you to do more requests a day.
With Qype it is possible to check the full amount of commercial establishments in range (bounding box or radius), while google places has a restriction to 60 places per request.
That is the reason why I decided to use Qype.
About whether or not the information is up to date I did not make an evaluation,
but Qype shows reasonable results when applied to Munich.

I need an address matching algorithm

I have looked around online for this but haven't found much really. Basically I need to compare a bunch of addresses to see if they match. The addresses could be written in all different ways. For Example : 1345 135th st NE, 1345 NE 135TH ST, etc. Plus they could be in different languages as well. Before I attempt to write some parsing matching algorithm on my own does anyone know any libraries or ways I could easily do this? My friend though of using google or bing maps web service and passing them the address and getting back the geo-coordinates and comparing using the coordinates instead of string matching. But then I have to call a web service thousands of times for all these addresses I have, not very elegant ;) Any help would be nice :)

I don't think that this is a REGEX type of problem. You are looking at converting to a comparable format first.
There are several web services / products available that will standardize an address for you. Bing for "USPS Address Standardization API" and you will find a ton of information. Once the address is standardized, the comparison should be straightforward.
http://www.bing.com/search?q=usps+address+standardization+api&go=&form=QBRE&qs=n&sk=&sc=1-32
Alternatively you can GeoCode the address to get a set of coordinates and then compare those.
http://code.google.com/apis/maps/documentation/geocoding/

US addresses can (usually) be uniquely represented by a 12-digit number called the delivery point (DPBC). This number consists of the full 9-digit ZIP Code and a 3 digit delivery point number. This is what is used to form barcodes on mail pieces to speed up delivery. Using a service that is CASS-Certified can provide the 12-digit delivery point and even flag duplicates for you.
In the interest of full disclosure I work for SmartyStreets, which was formerly Qualified Address, which was mentioned in the other answer by Mowgli.
We provide an API that can be queried as well as a batch processing service (which will flag duplicates as explained above).
Keep in mind that even the 12-digit DPBC doesn't always uniquely identify a particular address. This happens frequently when a particular street block, or 9-digit ZIP code, has a long stretch of homes that have similar primary numbers. In these cases, it's best to use a CASS service to standardize and validate the addresses, then hash them for convenient comparisons. (But as said, duplicates will already be flagged by some CASS services.)
Update: SmartyStreets now provides international address verification.

I wouldn't consider this a regex problem.
One free tool that could be helpful is usaddress, a python library for parsing addresses. It performs pretty well on all sorts of address formats, b/c it uses a probabilistic approach rather than a regex approach (although it is made for US addresses, & may not work well on addresses in other languages)
http://usaddress.readthedocs.org/en/latest/
Parsing addresses won't solve your problem 100%, but comparing two addresses, especially addresses w/ varying formats, will be much easier if the addresses are split into their respective components (so that you can compare street # against street #, city against city, etc)
Then, to compare records, you can use dedupe - another free python library.
http://dedupe.readthedocs.org/en/latest/

I found 2 options.
Firstly, maybe, instead of taking any input, you let the users choose from a limited number of options, similar to how facebook deals with addresses. If you use an autocomplete api, as they type, the possible addresses will be narrowed down by the api. Here is one from google:
http://code.google.com/p/geo-autocomplete/
Secondly, address finding & qualifying (but they arn't free):
https://www.craftyclicks.co.uk/
https://smartystreets.com/ (Previously Qualified Address)
https://www.alliescomputing.com/ (Previously offered World Addresses)

There is an open source python library for record deduplication / entity resolution that can be applied to address matching: Dedupe.
It's free and can be run on a laptop, as opposed to a huge server.

This requires intelligence to do correctly; computers aren't intelligent.
A simple algorithm could tell you which addresses have something in common, for example, "1345 135th st NE" and "1345 NE 135TH ST" have the number "1345" in common.
You would then have fewer to compare yourself. It would also reduce the number you geolocate.

This is definitely not a REGEX problem. This is 2018 and we have hands on more advanced methods yet. Both R and python offer solutions for that type of problem
In R: https://cran.r-project.org/web/packages/RecordLinkage/index.html
In python: https://recordlinkage.readthedocs.io/en/latest/about.html

1. Using address string similarity
Bacause of addresses could be written in many different ways it's usful to apply fuzzy logic and calculate similarity of address strings. I used to solve this task a fuzzywuzzy Python library. It has a functions that calculate Levenshtein Distance as a differences between strings.
from fuzzywuzzy import fuzz
addr1 = "USA AZ 850020 Phoenix Green Garden street, 283"
addr2 = "850020, USA AZ Phoenix Green Garden, 283, 3a"
addr3 = "Canada VC 9830 Vancouver Dark Ocean street, 283"
addr_similarity12 = fuzz.token_set_ratio(addr1, addr2)
addr_similarity13 = fuzz.token_set_ratio(addr1, addr3)
print(f"Address similarity 1 <-> 2: {addr_similarity12}")
print(f"Address similarity 1 <-> 3: {addr_similarity13}")
Output will be:
Address similarity 1 <-> 2: 96
Address similarity 1 <-> 3: 55
Really, first two addresses is almost the same and last two ones are different. Important task is a choosing appropriate threshold that will indicate address equality.
2. Using Google Map Geocoding API
Geocoding is the process of converting addresses (like "1600 Amphitheatre Parkway, Mountain View, CA") into geographic coordinates (like latitude 37.423021 and longitude -122.083739). And then it's possible to calculate numerical "distance" between two addresses.

Well one way to solve this problem is to convert both the addresses in same format. One easy way to do this but using Google Map Geocoding API is to simply pass both addresses to the API and get the output. The output for Geocoding API looks something like:
FORMAT OF GOOGLE'S GEODIRECTORY API (for reference):
{'results': [{'address_components': [{'long_name': '22',
'short_name': '22',
'types': ['street_number']},
{'long_name': 'Rue de Berri',
'short_name': 'Rue de Berri',
'types': ['route']},
{'long_name': 'Paris',
'short_name': 'Paris',
'types': ['locality', 'political']},
{'long_name': 'Département de Paris',
'short_name': 'Département de Paris',
'types': ['administrative_area_level_2', 'political']},
{'long_name': 'Île-de-France',
'short_name': 'IDF',
'types': ['administrative_area_level_1', 'political']},
{'long_name': 'France',
'short_name': 'FR',
'types': ['country', 'political']},
{'long_name': '75008', 'short_name': '75008', 'types': ['postal_code']}],
'formatted_address': '22 Rue de Berri, 75008 Paris, France',
'geometry': {'location': {'lat': 48.8728822, 'lng': 2.3054154},
'location_type': 'ROOFTOP',
'viewport': {'northeast': {'lat': 48.8743208802915,
'lng': 2.306719730291501},
'southwest': {'lat': 48.8716229197085, 'lng': 2.304021769708497}}},
'place_id': 'ChIJWxDbRsFv5kcRRcfu62JSRog',
'plus_code': {'compound_code': 'V8F4+55 Paris, France',
'global_code': '8FW4V8F4+55'},
'types': ['establishment', 'lodging', 'point_of_interest']}],
'status': 'OK'}
Here notice how google has provided you the different components of addresses like street number, locality etc. Now you can do a weighted/fuzzy matching between these components. Its upto you whether you want all to match or maybe some rules like street number or numbers shoulds always match, for other its okay if 4 out of 5 matches. Also you can consider distance between coordinate (Note : Use Haversine function and not just Euclidean Reference : https://towardsdatascience.com/calculating-distance-between-two-geolocations-in-python-26ad3afe287b ). You can then have a weighted score which should be greater than threshold for them to be consider same place.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js