Querying GeoLite2 Country CSV in SQL - geoip

Does anyone know how to look up an IP4 address from MaxMind's GeoLite2 Country CSV using SQL?
I have been using MaxMind's free GeoIP data for many years, and would like to upgrade to their GeoLite2 data. I have the blocks and locations data loaded into MySQL tables, but am not sure how to determine the address range that an IP4 address falls into. The old format had a start/end number for each block; the new format only seems to have a start number.
I have already hunted through the MaxMind developer docs, and Googled, but can't seem to find any information on how to query the new format. I'm sure it's obvious, and will edit this posting if I figure it out in the interim.
I think thank I'd have to find the first blocks entry that is greater or equal to the IP4 address, and LIMIT 1, perhaps.
I use this data both for web application looks, and for querying directly in SQL for generating reports; so I usually need to make sure I can implement the lookup in both Perl code and pure SQL.
I am upgrading because I'm seeing some funny results for Japanese visitors appearing to be from France on the old data.
many thanks

The address format used in Geolite2 CSV includes a block IP address start, followed by a Prefix length #, which can be converted into the block IP address end.
(Somewhat confusingly, Maxmind is using "Network_Mask_Length" instead of "Prefix_Length", the accepted IPv6 terminology, to label this field.)
Geolite2 CSV's blocks fields layout:
network_start_ip,network_mask_length,geoname_id,registered_country_geoname_id,re
presented_country_geoname_id,postal_code,latitude,longitude,is_anonymous_proxy,i
s_satellite_provider
Exemple:(record extracted from Geolite2-Country-Blocks.csv)
::ffff:81.248.136.0,120,3578476,3017382,,,,,0,0
Given the above exemple, what is the Last IPv4 address assigned to the block?
First IP address: 81.248.136.0
Prefix_Length/Network_mask_Length: 120
Last IP address: 81.248.136.255
The following URL might be handy to quickly look up the number of IP addresses available for a specific Prefix_Length:
http://www.gestioip.net/cgi-bin/subnet_calculator.cgi
__philippe

Prefix_Length calculator usage:
(In this case, more of a simple Table Lookup tool than a Calculator, really...;-)
http://www.gestioip.net/cgi-bin/subnet_calculator.cgi
In the calculator, tick the IPv6 button, then click the PL box down arrow.
A list of "Prefix Length" will be presented, with the corresponding available number of IP addresses.
To determine the last IP adress of any Geolite2 block, the following relevant range of Prefix_Length/address pairs should most likely suffice:
Prefix #addresses
Length
117 2048
118 1024
119 512
120 256
121 128
122 64
123 32
124 16
125 8
126 4
127 2
128 1
Note that the Geolite2 file structure follows a form of hybrid IPv4/IPv6 notation, aka
"IPv4-mapped-IPv6 address".
Those "hybrid" addresses are written with the first 96 bits in the standard IPv6 format, and the remaining 32 bits written in the customary dot-decimal notation of IPv4.
For instance, ::ffff:192.0.2.128 represents the IPv4 address 192.0.2.128
For much more on this very(hairy) subject, check here:
http://en.wikipedia.org/wiki/IPv6
__philippe

Related

IP address matching filter function

I am writing code in C++ which runs both on windows and mac platform. I want to write a function which will accept machine IP address list and list of IP filters in CIDR format. This function will check if machine IP matches IP filter.
For example. If machine IP 10.210.177.47 and filter is 10.210.177.1/32
The function will check if 10.210.177.47 falls inside the filter range.
Filter can also be Plain IP address like 10.210.177.45
i need to write a common code which can run on windows and mac.
The easiest solution is to convert the mask length into a bit mask. E.g. a /8 uses the upper 8 bits to identify the network and the lower 24 bits to identify hosts within that network. Hence, by shifting the IP address (expressed as std::uint32_t) left over 24 bits (>>24, you keep just the network part. For 10.210.177.47 within 10.0.0.0/8, that leaves 10 - matches. For /24, it would leave 10.210.177 - no match.

How to understand network length from geolite2 block

I'm trying to decode the end address from the Maxmind Geolite2 database fro the ip_v4 portion, however I am used to working with class, eg /8 /16, etc. and this length of 113, 114, 112 isn't making any sense to me presumably because these are v4 addresses in v6 notation.
eg.
::ffff:1.0.128.0,113
Can anyone point me to how to translate the lengths here so that I can generate the correct mask? I want to understand it mathematically, but for some reason the penny isn't dropping.
To get the IPv4 prefix/mask length, subtract 96 from the IPv6 prefix length. For instance, :ffff::1.0.128.0/113, from your example, is equivalent to 1.0.128.0/17 or the range 1.0.128.0 to 1.0.255.255.

I need an address matching algorithm

I have looked around online for this but haven't found much really. Basically I need to compare a bunch of addresses to see if they match. The addresses could be written in all different ways. For Example : 1345 135th st NE, 1345 NE 135TH ST, etc. Plus they could be in different languages as well. Before I attempt to write some parsing matching algorithm on my own does anyone know any libraries or ways I could easily do this? My friend though of using google or bing maps web service and passing them the address and getting back the geo-coordinates and comparing using the coordinates instead of string matching. But then I have to call a web service thousands of times for all these addresses I have, not very elegant ;) Any help would be nice :)
I don't think that this is a REGEX type of problem. You are looking at converting to a comparable format first.
There are several web services / products available that will standardize an address for you. Bing for "USPS Address Standardization API" and you will find a ton of information. Once the address is standardized, the comparison should be straightforward.
http://www.bing.com/search?q=usps+address+standardization+api&go=&form=QBRE&qs=n&sk=&sc=1-32
Alternatively you can GeoCode the address to get a set of coordinates and then compare those.
http://code.google.com/apis/maps/documentation/geocoding/
US addresses can (usually) be uniquely represented by a 12-digit number called the delivery point (DPBC). This number consists of the full 9-digit ZIP Code and a 3 digit delivery point number. This is what is used to form barcodes on mail pieces to speed up delivery. Using a service that is CASS-Certified can provide the 12-digit delivery point and even flag duplicates for you.
In the interest of full disclosure I work for SmartyStreets, which was formerly Qualified Address, which was mentioned in the other answer by Mowgli.
We provide an API that can be queried as well as a batch processing service (which will flag duplicates as explained above).
Keep in mind that even the 12-digit DPBC doesn't always uniquely identify a particular address. This happens frequently when a particular street block, or 9-digit ZIP code, has a long stretch of homes that have similar primary numbers. In these cases, it's best to use a CASS service to standardize and validate the addresses, then hash them for convenient comparisons. (But as said, duplicates will already be flagged by some CASS services.)
Update: SmartyStreets now provides international address verification.
I wouldn't consider this a regex problem.
One free tool that could be helpful is usaddress, a python library for parsing addresses. It performs pretty well on all sorts of address formats, b/c it uses a probabilistic approach rather than a regex approach (although it is made for US addresses, & may not work well on addresses in other languages)
http://usaddress.readthedocs.org/en/latest/
Parsing addresses won't solve your problem 100%, but comparing two addresses, especially addresses w/ varying formats, will be much easier if the addresses are split into their respective components (so that you can compare street # against street #, city against city, etc)
Then, to compare records, you can use dedupe - another free python library.
http://dedupe.readthedocs.org/en/latest/
I found 2 options.
Firstly, maybe, instead of taking any input, you let the users choose from a limited number of options, similar to how facebook deals with addresses. If you use an autocomplete api, as they type, the possible addresses will be narrowed down by the api. Here is one from google:
http://code.google.com/p/geo-autocomplete/
Secondly, address finding & qualifying (but they arn't free):
https://www.craftyclicks.co.uk/
https://smartystreets.com/ (Previously Qualified Address)
https://www.alliescomputing.com/ (Previously offered World Addresses)
There is an open source python library for record deduplication / entity resolution that can be applied to address matching: Dedupe.
It's free and can be run on a laptop, as opposed to a huge server.
This requires intelligence to do correctly; computers aren't intelligent.
A simple algorithm could tell you which addresses have something in common, for example, "1345 135th st NE" and "1345 NE 135TH ST" have the number "1345" in common.
You would then have fewer to compare yourself. It would also reduce the number you geolocate.
This is definitely not a REGEX problem. This is 2018 and we have hands on more advanced methods yet. Both R and python offer solutions for that type of problem
In R: https://cran.r-project.org/web/packages/RecordLinkage/index.html
In python: https://recordlinkage.readthedocs.io/en/latest/about.html
1. Using address string similarity
Bacause of addresses could be written in many different ways it's usful to apply fuzzy logic and calculate similarity of address strings. I used to solve this task a fuzzywuzzy Python library. It has a functions that calculate Levenshtein Distance as a differences between strings.
from fuzzywuzzy import fuzz
addr1 = "USA AZ 850020 Phoenix Green Garden street, 283"
addr2 = "850020, USA AZ Phoenix Green Garden, 283, 3a"
addr3 = "Canada VC 9830 Vancouver Dark Ocean street, 283"
addr_similarity12 = fuzz.token_set_ratio(addr1, addr2)
addr_similarity13 = fuzz.token_set_ratio(addr1, addr3)
print(f"Address similarity 1 <-> 2: {addr_similarity12}")
print(f"Address similarity 1 <-> 3: {addr_similarity13}")
Output will be:
Address similarity 1 <-> 2: 96
Address similarity 1 <-> 3: 55
Really, first two addresses is almost the same and last two ones are different. Important task is a choosing appropriate threshold that will indicate address equality.
2. Using Google Map Geocoding API
Geocoding is the process of converting addresses (like "1600 Amphitheatre Parkway, Mountain View, CA") into geographic coordinates (like latitude 37.423021 and longitude -122.083739). And then it's possible to calculate numerical "distance" between two addresses.
Well one way to solve this problem is to convert both the addresses in same format. One easy way to do this but using Google Map Geocoding API is to simply pass both addresses to the API and get the output. The output for Geocoding API looks something like:
FORMAT OF GOOGLE'S GEODIRECTORY API (for reference):
{'results': [{'address_components': [{'long_name': '22',
'short_name': '22',
'types': ['street_number']},
{'long_name': 'Rue de Berri',
'short_name': 'Rue de Berri',
'types': ['route']},
{'long_name': 'Paris',
'short_name': 'Paris',
'types': ['locality', 'political']},
{'long_name': 'Département de Paris',
'short_name': 'Département de Paris',
'types': ['administrative_area_level_2', 'political']},
{'long_name': 'Île-de-France',
'short_name': 'IDF',
'types': ['administrative_area_level_1', 'political']},
{'long_name': 'France',
'short_name': 'FR',
'types': ['country', 'political']},
{'long_name': '75008', 'short_name': '75008', 'types': ['postal_code']}],
'formatted_address': '22 Rue de Berri, 75008 Paris, France',
'geometry': {'location': {'lat': 48.8728822, 'lng': 2.3054154},
'location_type': 'ROOFTOP',
'viewport': {'northeast': {'lat': 48.8743208802915,
'lng': 2.306719730291501},
'southwest': {'lat': 48.8716229197085, 'lng': 2.304021769708497}}},
'place_id': 'ChIJWxDbRsFv5kcRRcfu62JSRog',
'plus_code': {'compound_code': 'V8F4+55 Paris, France',
'global_code': '8FW4V8F4+55'},
'types': ['establishment', 'lodging', 'point_of_interest']}],
'status': 'OK'}
Here notice how google has provided you the different components of addresses like street number, locality etc. Now you can do a weighted/fuzzy matching between these components. Its upto you whether you want all to match or maybe some rules like street number or numbers shoulds always match, for other its okay if 4 out of 5 matches. Also you can consider distance between coordinate (Note : Use Haversine function and not just Euclidean Reference : https://towardsdatascience.com/calculating-distance-between-two-geolocations-in-python-26ad3afe287b ). You can then have a weighted score which should be greater than threshold for them to be consider same place.

How to distinguish a NY "Queens-style" street address from a ranged address, and an address with a unit#

I need to distinguish between a Queens style address, from a valid ranged address, and an address with a unit#. For eg:
Queens style: 123-125 Some Street, NY
Ranged Address: 6414-6418 37th Ln SE, Olympia, WA 98503
Address with unit#: 1990-A Gildersleeve Ave, Bronx, NY.
In the case of #3, A is a unit# at street address 1990. THe unit# might be a number as well, for eg: 1990-12. A ranged address identifies a range of addresses on a street, and not a unique deliverable address.
So, the question is, is there an easy way to identify the Queens style address from the other cases?
---- UPDATE ---
Thanks, all. From your answers, it seems that there is no easy way to do this. I basically need to know if a street address in the form ABCD-WXYZ is a Queens-style address pointing to a single property, or if it is a ranged address.
How about some followup questions:
1) Are all addresses in NY City of the form ABCD-WXYZ?
2) Are there any other places in US where this style of addressing is used? Wikipedia seems to imply that is true, but does not give any examples.
This is from the memory of growing up there, so beware:
An address like
198-16 100th Avenue, Hollis, NY, 11423
Can be deciphered first by deciding whether the 11423 zip code is in Queens. If not, then punt.
Next, it says "100th Avenue". That implies that the "198" is referring to "198th Street": Streets always run North to South, and Avenues always run East to West. You get some interesting things with "Road" and "Place" and such, but "Place" is a "Street", and I believe that "Road" is an "Avenue".
To find the building, start at 198th Street, on the South side (even numbers), and start counting. You'll find that 198-16 is on the corner of 199th Street and 100th Avenue, just like it was when I lived there, because if it was on the other side of 199th street, it would have been 200-something.
As to how to distinguish, you could start by applying the above rules, and seeing if you come up with something that makes sense. Maybe the Street never intersects the Avenue? Maybe the numbers don't go up that high (I don't believe there is a 300th Street, and I'm not sure about a 300th Avenue). Maybe the building number is too high (you'd live on a very long street if you lived at 198-200 100th Avenue, especially because the distance from 198th Street to 199th Street on 100th Avenue isn't very great: it's a short block in that direction).
Unfortunately addresses don't have enough to "verify themselves" like a mod 10 checksum on a credit card. This means that without external information, there really is no way to know for sure how the address is supposed to look in a standardized format as compared to the original, unprocessed input format.
This is where something like an address verification web service would come into play. For a few dollars a month (usually about $20) you can verify your address database and clean it up and also prevent bad or duplicate addresses from getting into your system and spreading through it like a cancer. Most address validation web services will standardize the format of the address and expose the various component parts of the address so you can do additional process or inspection or whatever.
Just so you are aware, I'm the founder of SmartyStreets. We offer an address verification web service API called LiveAddress. You're more than welcome to contact me personally with questions about addresses whether you're a customer or not. I'm more than happy to help.
Well you would know that the second address isn't in Queens because the X-Y format is based on the streets and avenues of the borough. There aren't 6414 avenues or streets in Queens (less than 280 of each). The house number shouldn't go much over 100 because they reset every numbered cross street/avenue. So the X and Y would rarely have the same amount of digits. Ultimately though, a valid address would have the house number, street name, city, state/province if available, zip code or address code, country if international, otherwise they won't be sent, so you rarely would be given just the house number, if the other information weren't clearly implied.
The system was created to avoid confusion with the other boroughs before we had the Zip code system. I mean, there are some locations in Astoria, that if you don't give a zip code or a neighborhood name (or use the dash system), google maps will point to Long Island City and that's within the borough. No other borough in the city uses this system. It's just a Queens gem. Outside of the state, however, I believe (don't quote me on this) that Philadelphia uses this system. I know this is an old post, but I just saw it and wanted to give my two cents.
Generally, you can't distinguish between these different address styles, without additional information. Fortunately, the remainder of the addresses provide some clues as to what address style is in use.
Your first example is a Queens style address. Knowing that the address is in NY, and knowing that it has a specific street name, you might be able to infer that it's in Queens, and treat accordingly. If you had the ZIP code, that would be even better, because then you could restrict treatment of Queens style addresses to only those that have specific ZIP codes.

Yahoo Maps Geocode

How Do I work around a problem with the yahoo map geocode result set? The result set being returned is wrong. The city field contains the city, region and postal code. As seen below.
Is there a way to work around this issue without breaking scalability.
-33.924320
151.187057
203 Coward St
MASCOT NSW 2020
Australia
AU
The Yahoo geoencoding returns usually an XML or a PHP serialized. By querying the encoding service I suppose you already have the address and you want to get the coordinates for your geoPoint. It is possible that you are feeding the maps engine with a wrong request.
If you think you found a bug you can send them an email, but I suggest you to check with other locations or to publish first here your code in order to spot the eventual errors.