How to use GeoLite City, how to obtain startIpNum and endIpNum - geoip

Alright i downloaded from http://dev.maxmind.com/geoip/legacy/geolite/ GeoLite City database.
The values are in the GeoLiteCity-Blocks.csv
startIpNum,endIpNum,locId
"16777216","16777471","17"
However i have no idea how to convert ips to these startIpNum and endIpNum range. I cant find on their site and they link to here
What is the algorithm to convert ip v4 into this ?
Thank you very much

It depends which programmation language you use.
In php, you can use ip2long and long2ip to work with ips.
string long2ip ( string $proper_address )
The function long2ip() generates an Internet address in dotted format (i.e.: aaa.bbb.ccc.ddd) from the proper address representation.
http://php.net/manual/en/function.long2ip.php
Example with the ips you give: http://sandbox.onlinephpfunctions.com/code/3decd3d4818a02d65b9a80cf2dc1c70297b0d2d5

If you were using Python, you could use
>>> import socket, struct
>>> socket.inet_ntoa(struct.pack('!L', 16777216))
'1.0.0.0'
to convert from integer to IP string.

Related

ClientError: Unable to parse csv: rows 1-1000, file

I've looked at the other answers to this issue and none of them are helping me. I am trying to run a simple random cut forest algorithm. I have a small data set of IPs which have been stripped down to only have numbers. I still get this error. It only has one column of these numbers. The CSV looks like this:
176162144
176862141
176762141
176761141
176562141
Have you looked at this sample notebook, and tried using it with your own data?
https://github.com/awslabs/amazon-sagemaker-examples/blob/master/introduction_to_amazon_algorithms/random_cut_forest/random_cut_forest.ipynb
In a nutshell, it reads the CSV file with Pandas and trains the model like this:
rcf = RandomCutForest(role=execution_role,
train_instance_count=1,
train_instance_type='ml.m4.xlarge',
data_location='s3://{}/{}/'.format(bucket, prefix),
output_path='s3://{}/{}/output'.format(bucket, prefix),
num_samples_per_tree=512,
num_trees=50)
# automatically upload the training data to S3 and run the training job
rcf.fit(rcf.record_set(taxi_data.value.as_matrix().reshape(-1,1)))
You didn't say what your use case was, but as you're working with IP addresses, you may find the IP Insights built-in algorithm useful too: https://docs.aws.amazon.com/sagemaker/latest/dg/ip-insights.html
I was utilizing the sample notebook Julien Simon mentioned earlier, but at some point the data was ending up as strings! The funny thing about RCF algorithms is they have to run on integer data.
What I did is I made sure to cast the array as an int array as a double check and vallah! It worked. I am at loss over how the data ended up in a string format but alas, that was the issue. Simple solution.

Maxmind GeoLite2-City is_anonymous_proxy and is_satellite_provider always 0

I've downloaded the Maxmind GeoLite2-City.csv file from https://dev.maxmind.com/geoip/geoip2/geolite2/
I see that the is_anonymous_proxy and is_satellite_provider columns are always 0.
I am trying to figure out why these columns are always 0, and if i send out an IP addr for anonymous proxy, will i get a 0 or a different value.
Is there a API (Java or other languages) to call GeoLite2-City and see if it gives out reliable results?
The is_anonymous_proxy and is_satellite_provider columns are not always 0. Try the following command to find examples:
grep ",1," GeoLite2-City-Blocks-IPv4.csv

Dealing with a list of tuples - SQLite3 & Python 2.7

I am using a database to return a couple of values I place there. Let's just say the data is google, yahoo, bing.
The Code
dbCursor.execute('''SELECT ticker FROM SearchEngines''')
allEngines = dbCursor.fetchall()
for engine in engines:
print engine
Yields the following result:
(u'google')
(u'yahoo')
(u'bing)
This is troublesome because I require the result to be appended to a url in a string format. Does anybody know a way around this?
Thanks
fetchall() always returns a tuple, even if you're just selecting one field. So...
for engine in engines:
print engine[0]
Or:
for (engine,) in engines:
print engine
Hope this helps.

I need an address matching algorithm

I have looked around online for this but haven't found much really. Basically I need to compare a bunch of addresses to see if they match. The addresses could be written in all different ways. For Example : 1345 135th st NE, 1345 NE 135TH ST, etc. Plus they could be in different languages as well. Before I attempt to write some parsing matching algorithm on my own does anyone know any libraries or ways I could easily do this? My friend though of using google or bing maps web service and passing them the address and getting back the geo-coordinates and comparing using the coordinates instead of string matching. But then I have to call a web service thousands of times for all these addresses I have, not very elegant ;) Any help would be nice :)
I don't think that this is a REGEX type of problem. You are looking at converting to a comparable format first.
There are several web services / products available that will standardize an address for you. Bing for "USPS Address Standardization API" and you will find a ton of information. Once the address is standardized, the comparison should be straightforward.
http://www.bing.com/search?q=usps+address+standardization+api&go=&form=QBRE&qs=n&sk=&sc=1-32
Alternatively you can GeoCode the address to get a set of coordinates and then compare those.
http://code.google.com/apis/maps/documentation/geocoding/
US addresses can (usually) be uniquely represented by a 12-digit number called the delivery point (DPBC). This number consists of the full 9-digit ZIP Code and a 3 digit delivery point number. This is what is used to form barcodes on mail pieces to speed up delivery. Using a service that is CASS-Certified can provide the 12-digit delivery point and even flag duplicates for you.
In the interest of full disclosure I work for SmartyStreets, which was formerly Qualified Address, which was mentioned in the other answer by Mowgli.
We provide an API that can be queried as well as a batch processing service (which will flag duplicates as explained above).
Keep in mind that even the 12-digit DPBC doesn't always uniquely identify a particular address. This happens frequently when a particular street block, or 9-digit ZIP code, has a long stretch of homes that have similar primary numbers. In these cases, it's best to use a CASS service to standardize and validate the addresses, then hash them for convenient comparisons. (But as said, duplicates will already be flagged by some CASS services.)
Update: SmartyStreets now provides international address verification.
I wouldn't consider this a regex problem.
One free tool that could be helpful is usaddress, a python library for parsing addresses. It performs pretty well on all sorts of address formats, b/c it uses a probabilistic approach rather than a regex approach (although it is made for US addresses, & may not work well on addresses in other languages)
http://usaddress.readthedocs.org/en/latest/
Parsing addresses won't solve your problem 100%, but comparing two addresses, especially addresses w/ varying formats, will be much easier if the addresses are split into their respective components (so that you can compare street # against street #, city against city, etc)
Then, to compare records, you can use dedupe - another free python library.
http://dedupe.readthedocs.org/en/latest/
I found 2 options.
Firstly, maybe, instead of taking any input, you let the users choose from a limited number of options, similar to how facebook deals with addresses. If you use an autocomplete api, as they type, the possible addresses will be narrowed down by the api. Here is one from google:
http://code.google.com/p/geo-autocomplete/
Secondly, address finding & qualifying (but they arn't free):
https://www.craftyclicks.co.uk/
https://smartystreets.com/ (Previously Qualified Address)
https://www.alliescomputing.com/ (Previously offered World Addresses)
There is an open source python library for record deduplication / entity resolution that can be applied to address matching: Dedupe.
It's free and can be run on a laptop, as opposed to a huge server.
This requires intelligence to do correctly; computers aren't intelligent.
A simple algorithm could tell you which addresses have something in common, for example, "1345 135th st NE" and "1345 NE 135TH ST" have the number "1345" in common.
You would then have fewer to compare yourself. It would also reduce the number you geolocate.
This is definitely not a REGEX problem. This is 2018 and we have hands on more advanced methods yet. Both R and python offer solutions for that type of problem
In R: https://cran.r-project.org/web/packages/RecordLinkage/index.html
In python: https://recordlinkage.readthedocs.io/en/latest/about.html
1. Using address string similarity
Bacause of addresses could be written in many different ways it's usful to apply fuzzy logic and calculate similarity of address strings. I used to solve this task a fuzzywuzzy Python library. It has a functions that calculate Levenshtein Distance as a differences between strings.
from fuzzywuzzy import fuzz
addr1 = "USA AZ 850020 Phoenix Green Garden street, 283"
addr2 = "850020, USA AZ Phoenix Green Garden, 283, 3a"
addr3 = "Canada VC 9830 Vancouver Dark Ocean street, 283"
addr_similarity12 = fuzz.token_set_ratio(addr1, addr2)
addr_similarity13 = fuzz.token_set_ratio(addr1, addr3)
print(f"Address similarity 1 <-> 2: {addr_similarity12}")
print(f"Address similarity 1 <-> 3: {addr_similarity13}")
Output will be:
Address similarity 1 <-> 2: 96
Address similarity 1 <-> 3: 55
Really, first two addresses is almost the same and last two ones are different. Important task is a choosing appropriate threshold that will indicate address equality.
2. Using Google Map Geocoding API
Geocoding is the process of converting addresses (like "1600 Amphitheatre Parkway, Mountain View, CA") into geographic coordinates (like latitude 37.423021 and longitude -122.083739). And then it's possible to calculate numerical "distance" between two addresses.
Well one way to solve this problem is to convert both the addresses in same format. One easy way to do this but using Google Map Geocoding API is to simply pass both addresses to the API and get the output. The output for Geocoding API looks something like:
FORMAT OF GOOGLE'S GEODIRECTORY API (for reference):
{'results': [{'address_components': [{'long_name': '22',
'short_name': '22',
'types': ['street_number']},
{'long_name': 'Rue de Berri',
'short_name': 'Rue de Berri',
'types': ['route']},
{'long_name': 'Paris',
'short_name': 'Paris',
'types': ['locality', 'political']},
{'long_name': 'Département de Paris',
'short_name': 'Département de Paris',
'types': ['administrative_area_level_2', 'political']},
{'long_name': 'Île-de-France',
'short_name': 'IDF',
'types': ['administrative_area_level_1', 'political']},
{'long_name': 'France',
'short_name': 'FR',
'types': ['country', 'political']},
{'long_name': '75008', 'short_name': '75008', 'types': ['postal_code']}],
'formatted_address': '22 Rue de Berri, 75008 Paris, France',
'geometry': {'location': {'lat': 48.8728822, 'lng': 2.3054154},
'location_type': 'ROOFTOP',
'viewport': {'northeast': {'lat': 48.8743208802915,
'lng': 2.306719730291501},
'southwest': {'lat': 48.8716229197085, 'lng': 2.304021769708497}}},
'place_id': 'ChIJWxDbRsFv5kcRRcfu62JSRog',
'plus_code': {'compound_code': 'V8F4+55 Paris, France',
'global_code': '8FW4V8F4+55'},
'types': ['establishment', 'lodging', 'point_of_interest']}],
'status': 'OK'}
Here notice how google has provided you the different components of addresses like street number, locality etc. Now you can do a weighted/fuzzy matching between these components. Its upto you whether you want all to match or maybe some rules like street number or numbers shoulds always match, for other its okay if 4 out of 5 matches. Also you can consider distance between coordinate (Note : Use Haversine function and not just Euclidean Reference : https://towardsdatascience.com/calculating-distance-between-two-geolocations-in-python-26ad3afe287b ). You can then have a weighted score which should be greater than threshold for them to be consider same place.

Can you get access to the NumberFormatter used by ICU MessageFormat

This may be a niche question but I'm working with ICU to format currency strings. I've bumped into a situation that I don't quite understand.
When using the MesssageFormat class, is it possible to get access to the NumberFormat object it uses to format currency strings. When you create a NumberFormat instance yourself, you can specify attributes like precision and rounding used when creating currency strings.
I have an issue where for the South Korean locale ("ko_KR"), the MessageFormat class seems to create currency strings w/ rounding (100.50 -> ₩100).
In areas where I use NumberFormat directly, I set setMaximumFractionDigits and setMinimumFractionDigits to 2 but I can't seem to set this in the MessageFormat.
Any ideas?
Probably because one Korean Won is worth $0.00088. They don't have pennies.
Same idea with the Swiss Franc, the 1 centime coin stopped being legal tender in 2007.
I've determined that gaining access to the internal formatter used is not possible. I've opened a ticket with the ICU project. http://bugs.icu-project.org/trac/ticket/7571#preview