I have looked around online for this but haven't found much really. Basically I need to compare a bunch of addresses to see if they match. The addresses could be written in all different ways. For Example : 1345 135th st NE, 1345 NE 135TH ST, etc. Plus they could be in different languages as well. Before I attempt to write some parsing matching algorithm on my own does anyone know any libraries or ways I could easily do this? My friend though of using google or bing maps web service and passing them the address and getting back the geo-coordinates and comparing using the coordinates instead of string matching. But then I have to call a web service thousands of times for all these addresses I have, not very elegant ;) Any help would be nice :)
I don't think that this is a REGEX type of problem. You are looking at converting to a comparable format first.
There are several web services / products available that will standardize an address for you. Bing for "USPS Address Standardization API" and you will find a ton of information. Once the address is standardized, the comparison should be straightforward.
http://www.bing.com/search?q=usps+address+standardization+api&go=&form=QBRE&qs=n&sk=&sc=1-32
Alternatively you can GeoCode the address to get a set of coordinates and then compare those.
http://code.google.com/apis/maps/documentation/geocoding/
US addresses can (usually) be uniquely represented by a 12-digit number called the delivery point (DPBC). This number consists of the full 9-digit ZIP Code and a 3 digit delivery point number. This is what is used to form barcodes on mail pieces to speed up delivery. Using a service that is CASS-Certified can provide the 12-digit delivery point and even flag duplicates for you.
In the interest of full disclosure I work for SmartyStreets, which was formerly Qualified Address, which was mentioned in the other answer by Mowgli.
We provide an API that can be queried as well as a batch processing service (which will flag duplicates as explained above).
Keep in mind that even the 12-digit DPBC doesn't always uniquely identify a particular address. This happens frequently when a particular street block, or 9-digit ZIP code, has a long stretch of homes that have similar primary numbers. In these cases, it's best to use a CASS service to standardize and validate the addresses, then hash them for convenient comparisons. (But as said, duplicates will already be flagged by some CASS services.)
Update: SmartyStreets now provides international address verification.
I wouldn't consider this a regex problem.
One free tool that could be helpful is usaddress, a python library for parsing addresses. It performs pretty well on all sorts of address formats, b/c it uses a probabilistic approach rather than a regex approach (although it is made for US addresses, & may not work well on addresses in other languages)
http://usaddress.readthedocs.org/en/latest/
Parsing addresses won't solve your problem 100%, but comparing two addresses, especially addresses w/ varying formats, will be much easier if the addresses are split into their respective components (so that you can compare street # against street #, city against city, etc)
Then, to compare records, you can use dedupe - another free python library.
http://dedupe.readthedocs.org/en/latest/
I found 2 options.
Firstly, maybe, instead of taking any input, you let the users choose from a limited number of options, similar to how facebook deals with addresses. If you use an autocomplete api, as they type, the possible addresses will be narrowed down by the api. Here is one from google:
http://code.google.com/p/geo-autocomplete/
Secondly, address finding & qualifying (but they arn't free):
https://www.craftyclicks.co.uk/
https://smartystreets.com/ (Previously Qualified Address)
https://www.alliescomputing.com/ (Previously offered World Addresses)
There is an open source python library for record deduplication / entity resolution that can be applied to address matching: Dedupe.
It's free and can be run on a laptop, as opposed to a huge server.
This requires intelligence to do correctly; computers aren't intelligent.
A simple algorithm could tell you which addresses have something in common, for example, "1345 135th st NE" and "1345 NE 135TH ST" have the number "1345" in common.
You would then have fewer to compare yourself. It would also reduce the number you geolocate.
This is definitely not a REGEX problem. This is 2018 and we have hands on more advanced methods yet. Both R and python offer solutions for that type of problem
In R: https://cran.r-project.org/web/packages/RecordLinkage/index.html
In python: https://recordlinkage.readthedocs.io/en/latest/about.html
1. Using address string similarity
Bacause of addresses could be written in many different ways it's usful to apply fuzzy logic and calculate similarity of address strings. I used to solve this task a fuzzywuzzy Python library. It has a functions that calculate Levenshtein Distance as a differences between strings.
from fuzzywuzzy import fuzz
addr1 = "USA AZ 850020 Phoenix Green Garden street, 283"
addr2 = "850020, USA AZ Phoenix Green Garden, 283, 3a"
addr3 = "Canada VC 9830 Vancouver Dark Ocean street, 283"
addr_similarity12 = fuzz.token_set_ratio(addr1, addr2)
addr_similarity13 = fuzz.token_set_ratio(addr1, addr3)
print(f"Address similarity 1 <-> 2: {addr_similarity12}")
print(f"Address similarity 1 <-> 3: {addr_similarity13}")
Output will be:
Address similarity 1 <-> 2: 96
Address similarity 1 <-> 3: 55
Really, first two addresses is almost the same and last two ones are different. Important task is a choosing appropriate threshold that will indicate address equality.
2. Using Google Map Geocoding API
Geocoding is the process of converting addresses (like "1600 Amphitheatre Parkway, Mountain View, CA") into geographic coordinates (like latitude 37.423021 and longitude -122.083739). And then it's possible to calculate numerical "distance" between two addresses.
Well one way to solve this problem is to convert both the addresses in same format. One easy way to do this but using Google Map Geocoding API is to simply pass both addresses to the API and get the output. The output for Geocoding API looks something like:
FORMAT OF GOOGLE'S GEODIRECTORY API (for reference):
{'results': [{'address_components': [{'long_name': '22',
'short_name': '22',
'types': ['street_number']},
{'long_name': 'Rue de Berri',
'short_name': 'Rue de Berri',
'types': ['route']},
{'long_name': 'Paris',
'short_name': 'Paris',
'types': ['locality', 'political']},
{'long_name': 'Département de Paris',
'short_name': 'Département de Paris',
'types': ['administrative_area_level_2', 'political']},
{'long_name': 'Île-de-France',
'short_name': 'IDF',
'types': ['administrative_area_level_1', 'political']},
{'long_name': 'France',
'short_name': 'FR',
'types': ['country', 'political']},
{'long_name': '75008', 'short_name': '75008', 'types': ['postal_code']}],
'formatted_address': '22 Rue de Berri, 75008 Paris, France',
'geometry': {'location': {'lat': 48.8728822, 'lng': 2.3054154},
'location_type': 'ROOFTOP',
'viewport': {'northeast': {'lat': 48.8743208802915,
'lng': 2.306719730291501},
'southwest': {'lat': 48.8716229197085, 'lng': 2.304021769708497}}},
'place_id': 'ChIJWxDbRsFv5kcRRcfu62JSRog',
'plus_code': {'compound_code': 'V8F4+55 Paris, France',
'global_code': '8FW4V8F4+55'},
'types': ['establishment', 'lodging', 'point_of_interest']}],
'status': 'OK'}
Here notice how google has provided you the different components of addresses like street number, locality etc. Now you can do a weighted/fuzzy matching between these components. Its upto you whether you want all to match or maybe some rules like street number or numbers shoulds always match, for other its okay if 4 out of 5 matches. Also you can consider distance between coordinate (Note : Use Haversine function and not just Euclidean Reference : https://towardsdatascience.com/calculating-distance-between-two-geolocations-in-python-26ad3afe287b ). You can then have a weighted score which should be greater than threshold for them to be consider same place.
Related
I'm trying to do a fuzzy lookup on two datasets in SAS. I have searched over google and found the below link which explains the process of doing the fuzzy lookup in SAS.
Link: http://blogs.sas.com/content/sgf/2015/01/27/how-to-perform-a-fuzzy-match-using-sas-functions/
To explain in detail the problem, the two datasets contains information of Hospital names and other additional information. I have to match both the data sets based on Hospital names. But the main challenge is in some cases I have the hospital name as follows:
Dataset1(hospital Name): St.Hospital
Dataset2(hospital Name): Saint.Hospital
Like wise INC and Incorporated.
I would like to know is there any best way to do the fuzzy lookup in SAS.
Thanks,
VJ
There can't be any single best way to do a fuzzy lookup, as the article you linked to explains. You have to decide on the best approach for your particular problem domain and your particular tolerances for false positives and false negatives, etc.
For your data, I would probably just define a set of 'best guess' transformations on the hospital name in both input data sets, and then do a standard merge on the transformed names. The transformations would be something like:
Convert to uppercase
Convert 'ST.' or 'ST ' to 'SAINT' (or should that be 'STREET'??)
Convert 'INC' or 'INC.' to 'INCORPORATED'
Convert any other known common strings as above
Remove any remaining punctuation
Use COMPBL to reduce multiple spaces to a single space
Do the merge
You will then have to examine the result and decide if it's good enough for your purposes. There is no general way for a computer to match up two strings that might be arbitrarily badly-spelled, particularly if there are multiple possible 'correct' matches - this is the same problem that spell-checkers have been trying to solve for decades - there's no way of knowing (in isolation) whether a misspelled word like 'falt' was meant to be 'fault', 'fall', 'fast', 'fat' etc.
If your results have to be perfect, you will need a human to review anything that isn't an exact match, and even then some of the exact matches might be misspellings that happen to match another hospital's name (eg, 'Saint Mary's Hospital' vs 'Saint May's Hospital'). That's why the preferred approach would usually be to identify the hospital by an ID number and the name, rather than just the name.
What I am trying to do:
I am trying to take a list of terms and distinguish which domain they are coming from. For example "intestine" would be from the anatomical domain while the term "cancer" would be from the disease domain. I am getting these terms from different ontologies such as DOID and FMA (they can be found at bioportal.bioontology.org)
The problem:
I am having a hard time realizing the best way to implement this. Currently I am naively taking the terms from the ontologies DOID and FMA and taking difference of any term that is in the FMA list which we know is anatomical from the DOID list (which contains terms that may be anatomical such as colon carcinoma, colon being anatomical and carcinoma being disease).
Thoughts:
I was thinking that I can get root words, prefixes, and postfixes, for the different term domains and try and match it to the terms in the list. Another idea is to take more information from their ontology such as meta data or something and use this to distinguish between the terms.
Any ideas are welcome.
As a first run, you'll probably have the best luck with bigrams. As an initial hypothesis, diseases are usually noun phrases, and usually have a very English-specific structure where NP -> N N, like "liver cancer", which means roughly the same thing as "cancer of the liver." Doctors tend not to use the latter, while the former should be caught with bigrams quite well.
Use the two ontologies you have there as starting points to train some kind of bigram model. Like Rcynic suggested, you can count them up and derive probabilities. A Naive Bayes classifier would work nicely here. The features are the bigrams; classes are anatomy or disease. sklearn has Naive Bayes built in. The "naive" part means, in this case, that all your bigrams are independent of each other. This assumption is fundamentally false, but it works well in a lot of circumstances, so we pretend it's true.
This won't work perfectly. As it's your first pass, you should be prepared to probe the output to understand how it derived the answer it came upon and find cases that failed on. When you find trends of errors, tweak your model, and try again.
I wouldn't recommend WordNet here. It wasn't written by doctors, and since what you're doing relies on precise medical terminology, it's probably going to add bizarre meanings. Consider, from nltk.corpus.wordnet:
>>> livers = reader.synsets("liver")
>>> pprint([l.definition() for l in livers])
[u'large and complicated reddish-brown glandular organ located in the upper right portion of the abdominal cavity; secretes bile and functions in metabolism of protein and carbohydrate and fat; synthesizes substances involved in the clotting of the blood; synthesizes vitamin A; detoxifies poisonous substances and breaks down worn-out erythrocytes',
u'liver of an animal used as meat',
u'a person who has a special life style',
u'someone who lives in a place',
u'having a reddish-brown color']
Only one of these is really of interest to you. As a null hypothesis, there's an 80% chance WordNet will add noise, not knowledge.
The naive approach - what precision and recall is it getting you? If you setup a test case now, then you can track your progress as you apply more sophisticated methods.
I don't know what initial set you are dealing with - but one thing to try is to get your hands on annotated documents(maybe use mechanical turk). The documents need to be tagged as the domains you're looking for - anatomical or disease.
then count and divide will tell you how likely a word you encounter is to belong to a domain. With that the next step and be to tweak some weights.
Another approach (going in a whole other direction) is using WordNet. I don't know if it will be useful for exactly your purposes, but its a massive ontology - so it might help.
Python has bindings to use Wordnet via nltk.
from nltk.corpus import wordnet as wn
wn.synsets('cancer')
gives output = [Synset('cancer.n.01'), Synset('cancer.n.02'), Synset('cancer.n.03'), Synset('cancer.n.04'), Synset('cancer.n.05')]
http://wordnetweb.princeton.edu/perl/webwn
Let us know how it works out.
There is geo-coding by zip-code centroid, but is there a zip+4 centroid which would be more granular yet not quite street-address granularity?
Yes, there is. I work at SmartyStreets, where we validate and geocode street addresses. ZIP+4 is approximately block-level, which sounds like what you're looking for. There are some services that offer this type of geocoding, including ours, LiveAddress.
If you don't need super-precise geocoding (like you would if you were, say, going to skydive onto your buddy's rooftop), then ZIP+4 is most likely going to be the most affordable option, as well. Rooftop-level precision data has to be gathered manually (by companies like Google, etc.) and so it is very expensive.
So yes, it is possible to geocode by ZIP+4 centroid, and it would usually get you within about a street block of your target.
I'm brand new to geocoding and have a relatively large dataset of addresses 100,000+. When I geocode them using MapMarker Professional I get about 10% that I'm not able to geocode with a high level of precision (I get mostly S2 precision values back which mean that it was able to match to a Primary Postal Code centroid, centerpoint of the Primary Postal Code boundary). Each of the addresses has already been standardized so they should be valid (I have taken a random sample and run them through the USPS Zip Code Lookup process to verify this). My question is, should I be able to geocode addresses with a higher degree of accuracy than what I'm seeing or am I expecting too much of the products currently on the market? I've tried geocoding using google and yahoo's services without any better luck. All of the services appear to be able to give me the postal code centroid, but none of them appear to have enough information to be able to give me distinct coordinates for houses in at least 98% of the addresses I send to it.
Thanks for any guidance you can provide,
Jeremy
Geocoding is an imprecise process. The addresses you are geocoding that don't have good precision are likely in rural areas, where it is not uncommon for addresses to be off by up to a mile.
They only know where addresses are by taking the number at the start and end of a street segment, and dividing from there.
I need to distinguish between a Queens style address, from a valid ranged address, and an address with a unit#. For eg:
Queens style: 123-125 Some Street, NY
Ranged Address: 6414-6418 37th Ln SE, Olympia, WA 98503
Address with unit#: 1990-A Gildersleeve Ave, Bronx, NY.
In the case of #3, A is a unit# at street address 1990. THe unit# might be a number as well, for eg: 1990-12. A ranged address identifies a range of addresses on a street, and not a unique deliverable address.
So, the question is, is there an easy way to identify the Queens style address from the other cases?
---- UPDATE ---
Thanks, all. From your answers, it seems that there is no easy way to do this. I basically need to know if a street address in the form ABCD-WXYZ is a Queens-style address pointing to a single property, or if it is a ranged address.
How about some followup questions:
1) Are all addresses in NY City of the form ABCD-WXYZ?
2) Are there any other places in US where this style of addressing is used? Wikipedia seems to imply that is true, but does not give any examples.
This is from the memory of growing up there, so beware:
An address like
198-16 100th Avenue, Hollis, NY, 11423
Can be deciphered first by deciding whether the 11423 zip code is in Queens. If not, then punt.
Next, it says "100th Avenue". That implies that the "198" is referring to "198th Street": Streets always run North to South, and Avenues always run East to West. You get some interesting things with "Road" and "Place" and such, but "Place" is a "Street", and I believe that "Road" is an "Avenue".
To find the building, start at 198th Street, on the South side (even numbers), and start counting. You'll find that 198-16 is on the corner of 199th Street and 100th Avenue, just like it was when I lived there, because if it was on the other side of 199th street, it would have been 200-something.
As to how to distinguish, you could start by applying the above rules, and seeing if you come up with something that makes sense. Maybe the Street never intersects the Avenue? Maybe the numbers don't go up that high (I don't believe there is a 300th Street, and I'm not sure about a 300th Avenue). Maybe the building number is too high (you'd live on a very long street if you lived at 198-200 100th Avenue, especially because the distance from 198th Street to 199th Street on 100th Avenue isn't very great: it's a short block in that direction).
Unfortunately addresses don't have enough to "verify themselves" like a mod 10 checksum on a credit card. This means that without external information, there really is no way to know for sure how the address is supposed to look in a standardized format as compared to the original, unprocessed input format.
This is where something like an address verification web service would come into play. For a few dollars a month (usually about $20) you can verify your address database and clean it up and also prevent bad or duplicate addresses from getting into your system and spreading through it like a cancer. Most address validation web services will standardize the format of the address and expose the various component parts of the address so you can do additional process or inspection or whatever.
Just so you are aware, I'm the founder of SmartyStreets. We offer an address verification web service API called LiveAddress. You're more than welcome to contact me personally with questions about addresses whether you're a customer or not. I'm more than happy to help.
Well you would know that the second address isn't in Queens because the X-Y format is based on the streets and avenues of the borough. There aren't 6414 avenues or streets in Queens (less than 280 of each). The house number shouldn't go much over 100 because they reset every numbered cross street/avenue. So the X and Y would rarely have the same amount of digits. Ultimately though, a valid address would have the house number, street name, city, state/province if available, zip code or address code, country if international, otherwise they won't be sent, so you rarely would be given just the house number, if the other information weren't clearly implied.
The system was created to avoid confusion with the other boroughs before we had the Zip code system. I mean, there are some locations in Astoria, that if you don't give a zip code or a neighborhood name (or use the dash system), google maps will point to Long Island City and that's within the borough. No other borough in the city uses this system. It's just a Queens gem. Outside of the state, however, I believe (don't quote me on this) that Philadelphia uses this system. I know this is an old post, but I just saw it and wanted to give my two cents.
Generally, you can't distinguish between these different address styles, without additional information. Fortunately, the remainder of the addresses provide some clues as to what address style is in use.
Your first example is a Queens style address. Knowing that the address is in NY, and knowing that it has a specific street name, you might be able to infer that it's in Queens, and treat accordingly. If you had the ZIP code, that would be even better, because then you could restrict treatment of Queens style addresses to only those that have specific ZIP codes.