Why does Maxmind's GeoLite2 City database return non-existent geoname_id? - geoip

I've downloaded the latest updates of the Maxmind's GeoLite2 City database (both in MaxMind DB binary and CSV formats). When I tried to look up "88.184.98.0" here's what I got:
{"city":{"geoname_id":2982652,"names":{"de":"Rouen","en":"Rouen","es":"Ruan","fr":"Rouen","ja":"ルーアン","pt-BR":"Ruão","ru":"Руан","zh-CN":"鲁昂"}},"continent":{"code":"EU","geoname_id":6255148,"names":{"de":"Europa","en":"Europe","es":"Europa","fr":"Europe","ja":"ヨーロッパ","pt-BR":"Europa","ru":"Европа","zh-CN":"欧洲"}},"country":{"geoname_id":3017382,"is_in_european_union":true,"iso_code":"FR","names":{"de":"Frankreich","en":"France","es":"Francia","fr":"France","ja":"フランス共和国","pt-BR":"França","ru":"Франция","zh-CN":"法国"}},"location":{"accuracy_radius":5,"latitude":49.4431,"longitude":1.0993,"time_zone":"Europe/Paris"},"postal":{"code":"76100"},"registered_country":{"geoname_id":3017382,"is_in_european_union":true,"iso_code":"FR","names":{"de":"Frankreich","en":"France","es":"Francia","fr":"France","ja":"フランス共和国","pt-BR":"França","ru":"Франция","zh-CN":"法国"}},"subdivisions":[{"geoname_id":11071621,"iso_code":"NOR","names":{"de":"Normandie","en":"Normandy","es":"Normandía","fr":"Normandie"}},{"geoname_id":2975248,"iso_code":"76","names":{"de":"Seine-Maritime","en":"Seine-Maritime","es":"Sena Marítimo","fr":"Seine-Maritime","pt-BR":"Sena Marítimo"}}]}
However, there's no corresponding geoname_id for returned subdivisions in CSV files (e.g. cat GeoLite2-City-Locations-en.csv | grep 11071621 returns nothing).
Is it a bug or expected behavior? I couldn't find anything in the documentation.

I've reached out to support and the answer is that this is by design. The CSV files do not include locations if there's no IP directly associated with them.
Read more here: https://github.com/maxmind/MaxMind-DB/issues/61

Related

ClientError: Unable to parse csv: rows 1-1000, file

I've looked at the other answers to this issue and none of them are helping me. I am trying to run a simple random cut forest algorithm. I have a small data set of IPs which have been stripped down to only have numbers. I still get this error. It only has one column of these numbers. The CSV looks like this:
176162144
176862141
176762141
176761141
176562141
Have you looked at this sample notebook, and tried using it with your own data?
https://github.com/awslabs/amazon-sagemaker-examples/blob/master/introduction_to_amazon_algorithms/random_cut_forest/random_cut_forest.ipynb
In a nutshell, it reads the CSV file with Pandas and trains the model like this:
rcf = RandomCutForest(role=execution_role,
train_instance_count=1,
train_instance_type='ml.m4.xlarge',
data_location='s3://{}/{}/'.format(bucket, prefix),
output_path='s3://{}/{}/output'.format(bucket, prefix),
num_samples_per_tree=512,
num_trees=50)
# automatically upload the training data to S3 and run the training job
rcf.fit(rcf.record_set(taxi_data.value.as_matrix().reshape(-1,1)))
You didn't say what your use case was, but as you're working with IP addresses, you may find the IP Insights built-in algorithm useful too: https://docs.aws.amazon.com/sagemaker/latest/dg/ip-insights.html
I was utilizing the sample notebook Julien Simon mentioned earlier, but at some point the data was ending up as strings! The funny thing about RCF algorithms is they have to run on integer data.
What I did is I made sure to cast the array as an int array as a double check and vallah! It worked. I am at loss over how the data ended up in a string format but alas, that was the issue. Simple solution.

sd_journal_send to send binary data. How can I retrieve the data using journalctl?

I'm looking at systemd-journal as a method of collecting logs from external processors. I'm very interested in it's ability to collect binary data when necessary.
I'm simply testing and investigating journal right now. I'm well aware there are other, probably better, solutions.
I'm logging binary data like so:
// strData is a string container containing binary data
strData += '\0';
sd_journal_send(
"MESSAGE=test_msg",
"MESSAGE_ID=12345",
"BINARY=%s", strData.c_str(),
NULL);
The log line shows up when using the journalctl tool. I can find the log line like this from the terminal:
journalctl MESSAGE_ID=12345
I can get the binary data of all logs in journal like so from the terminal:
journalctl --field=BINARY
I need to get the binary data to a file so that I can access from a program and decode it. How can I do this?
This does not work:
journalctl --field=BINARY MESSAGE_ID=12345
I get there error:
"Extraneous arguments starting with 'MESSAGE_ID=1234567890987654321"
Any suggestions? The documentation on systemd-journal seems slim. Thanks in advance.
You just got the wrong option. See the docs for:
-F, --field=
Print all possible data values the specified field can take in all entries of the journal.
vs
--output-fields=
A comma separated list of the fields which should be included in the output.
You also have to specify the plain output format (-o cat) to get the raw content:
journalctl --output-fields=BINARY MESSAGE_ID=12345 -o cat

MaxMind's GeoIPCity for a single country only?

Recently I have stumbled upon a problem - MaxMind's GeoIPCity file is way too big for our needs and contains A LOT of data we don't need and won't need.
The question is: is there a way to limit the City database to a single country? let's say, Canadian cities only?
You cannot just conveniently download the database for Canadian cities only, but you can certainly prune the database once you have downloaded and loaded it. This is true whether you use the MaxMind DB or download the CSV format, just trim out the lines that do not represent Canada's country code or geoname_id (depending on v1 or v2 of the dataset).
If you identify your specific coding environment and language, I'm certain someone can help you write a few lines of code that chops out all the fat.

Inconsistency between GeoLite2 City and GeoLite2 Country

We started using the GeoLite2 Country DB and saw some very strange results. Looking closely it appeared that the DB itself had incorrect data. Looking more closely I saw that for the same IP the GeoLite2 Country and GeoLite2 City gave different results.
(I may also be reading this wrong so any advice on that most welcome!)
The IP in question is 46.251.120.133
Maxmind – country
- Doesn't have 46.251.120.0
- It does have 46.251.0.0 which is found as location 719819 (Hungary) which is incorrect.
Maxmind - city
- Has 46.251.120.0 which is found as location 146268 (Nicosia, Cyprus) which is, correct.
To be specific, we're using the csv files found here:
http://dev.maxmind.com/geoip/geoip2/geolite2/
Really hoping I'm reading something wrong in the db...
Thanks!
::46.251.120.0 is part of the ::ffff:46.251.96.0/115 network, which is mapped to 146669. 146669 is the blocks record for Cyrus.

How to post Stata program via Dropbox or private website?

Here is a sample program .do file, sampleprog.do:
program sampleprog
egen newVar = group (`1' `2')
end
How can I post it on my website (or dropbox), so that other people could install it to their Stata like this?
net from http://www.mywebsite.com/sampleprog.do
*** or may be like like this:
ssc install ...
I read the documentation about stata.toc...but I did not quite get it. What files should I upload and should it be one folder or what?
(PS: I definitely can simply email the .do file but this is not an option in my case.)
Here is a full explanation of how to share program or data files with others using your own website. I tried using Dropbox, but Stata 12 appears to have issues with https, which is the protocol for all Dropbox public links. If you want to use Dropbox, I recommend creating a shared folder that will sync on your collaborators' machines. The rest of this answer assumes you have a website serving pages over http or are using Stata 13, which supports https.
If this is a one-time thing, you can skip the rest of this answer by putting the file on your website and telling your collaborator to type:
. copy http://your-site.com/ado/program.ado program.ado
That will copy the ado file at the specified url into the user's current directory. If you want to provide information about your files, plan on sharing with multiple people and need to maintain/document a set files, read on!
Step 1 Create a folder on your website to hold the programs. I will call mine ado/
Step 2 Add the program files, help files, and data files you want to share. For this example, I have created a simple ado file called unique.ado with the following contents:
********************************************** unique.ado
capture program drop unique
program define unique
*! Count and number observations within group defined by varlist
* Example: unique person_id, obs(prow) tobs(pcount) sortby(time)
* to count and number rows by a variable called person_id
syntax varlist, obs(name) tobs(name) [sortby(varlist)]
bys `varlist' (`sortby') : gen long `obs' = _n
bys `varlist' (`sortby') : gen long `tobs' = _N
la var `obs' "Number of this row within `varlist' group."
la var `tobs' "Total number of rows with identical `varlist' values."
end
Step 3 Create a file called stata.toc to describe the files you wish to share. Here is mine:
********************************************** stata.toc
v 3
d Program to count observations by group
p unique [The unique.ado program for counting observations by group]
These files can be complicated. There are many features I won't cover here, but you can read this documentation to learn more.
Step 4 Create a package file for each of the packages defined by the lines in stata.toc that start with the letter p. Here is my package file for the unique package defined above:
********************************************** unique.pkg
v 3
d unique
d Program to count observations by group
d Distribution-Date: 28 June 2012
f unique.ado
Your directory now looks like this:
ado/
stata.toc
unique.ado
unique.pkg
Step 5 Use the site! Here are the commands to enter.
. net from http://example.com/ado/
. net describe unique
. net install unique
Here is what you'll see after entering the first command:
-----------------------------------------------------------------------------------
http://www.example.com/ado/
Program to count observations by group
-----------------------------------------------------------------------------------
PACKAGES you could -net describe-:
unique [The unique.ado program for counting observations by group]
-----------------------------------------------------------------------------------
The second command will tell you more about the package net describe unique:
---------------------------------------------------------------------------------------
package unique from http://www.example.com/ado
---------------------------------------------------------------------------------------
TITLE
unique
DESCRIPTION/AUTHOR(S)
Program to count observations by group
Distribution-Date: 28 June 2012
INSTALLATION FILES (type net install unique)
unique.ado
---------------------------------------------------------------------------------------
The third command will install the package net install unique:
checking unique consistency and verifying not already installed...
installing into /Users/cpoliquin/Library/Application Support/Stata/ado/plus/...
installation complete.
EDIT
See Nick's comments in the answer below. I intended this example to be simple and I don't expect other people to use this program. If you plan on submitting things to Stata Journal or SSC then his comments certainly apply! I hope this answer can serve as a decent tutorial for those confused by the official documentation.
This will be too long for a comment, so it is going to be an extra answer.
Your example uses the program name unique. If you search unique, all (or in Stata 13, search unique) you will find that a user-written program with the same name has been installed on SSC since 1998. This will create a clash of names for your users if (and only if) they attempt to use your program and also that earlier program. The more general advice is to search to see if a program name is already in use to try to avoid these problems.
Specifically, although you may just be using your unique as an arbitrary example, note that it contains bugs. An int doesn't contain enough bits to hold observation numbers exactly for large datasets. Also, as a matter of style, unique can change the sort order of your data, which is widely considered to be poor data management style.
Your example concerns dissemination of a program file without an accompanying help file. Suffice it to say that the SSC site would never accept such a program and the Stata Journal would not even review a paper based on such a submission before a help file was written to accompany it. Including explanatory comments with the code may be sufficient for your personal practices, but it falls below general Stata standards.
Stata 13 now supports https. See http://www.stata.com/manuals13/u.pdf, Section 3.6.
In short, I appreciate that you are trying to explain how to do something, but it is already well documented, and explicitly and implicitly some of your recommendations are below community standards.