How to parse freebase quad dump using Amazon mapreduce - mapreduce

Im trying to extract movie informations from freebase, i just need name of the movie, name and id of the director and of the actors.
I found it hard to do so using freebases topic dumps, because there is no reference to the director ID, just directors name.
What is the right approach for this task? Do i need to parse somehow whole quad dump using amazons cloud? Or is there some esy way?

You do need to use the quad dump, but it is under 4 GB and shouldn't require Hadoop, MapReduce, or any cloud processing to do. A decent laptop should be fine. On a couple year old laptop, this simple-minded command:
time bzgrep '/film/' freebase-datadump-quadruples.tsv.bz2 | wc -l
10394545
real 18m56.968s
user 19m30.101s
sys 0m56.804s
extracts and counts everything referencing the film domain in under 20 minutes. Even if you have to make multiple passes through the file (which is likely), you'll be able to complete your whole task in less than an hour, which should mean there's no need for beefy computing resources.
You'll need to traverse an intermediary node (CVT in Freebase-speak) to get the actors, but rest of your information should be connected directly to the subject film node.
Tom

First of all, I completely share Tom's point of view and his suggestion. I often use UNIX command line tools to take 'interesting' slices of data out of Freebase data dump.
However, an alternative would be to load Freebase data into a 'graph' storage system locally and use APIs and/or the query language available from that system to interact with the data for further processing.
I use RDF, since the data model is quite similar and it is very easy to convert the Freebase data dump into RDF (see: https://github.com/castagna/freebase2rdf). I then load it into Apache Jena's TDB store (http://incubator.apache.org/jena/documentation/tdb/) and use the Jena APIs or SPARQL for further processing.
Another reasonable and scalable approach would be to implement what you need to do in MapReduce, but this makes sense only if the amount of processing you do is touching a large fraction of Freebase data and it is not as trivial as counting lines. This is more expensive than using your own machine, you need an Hadoop cluster or you need to use Amazon EMR. (I should probably write a MapReduce version of freebase2rdf ;-))
My 2 cents.

Related

Preprocessing data in EMR

I want to crunch 10 PB data. The input data is in some proprietary format (stored in S3) and first preprocessing step is to convert this proprietary data to CSV and move it back to S3. Due to some constraints, I can't couple the preprocessing step with Map task. What would be the correct way to do that?
I'm planning to use AWS EMR for the same. One way would be to run a separate EMR job with no reduce task and upload data to S3 in the Map phase. Is there any better way to do that as running a map-reduce job without reduce task for preprocessing data looks like a hacky solution.
It would seem you have at least two options:
Convert the data into a format you find easier to work with. You might want to look at formats such as Parquet or Avro. Using a map-only task for this is an appropriate method, you would only use a reducer in this case if you wanted to control the number of files produced, ie combine lots of small files into a larger one.
Create a custom InputFormat and just read the data directly. There are lots of resources on the net about how to do this. Depending on what this proprietary formats looks like you might need to do this anyway to achieve #1.
A few things for you to think about are:
Is the proprietary format space efficient compared with other formats?
How easy is the format to work with, would making it into a CSV make your processing jobs simpler?
Is the original data ever updated or added to, would you continually need to convert it to another format or update already converted data?

AWS ec2 to run a python program using latex and OpenCV

A friend and I are working on a machine learning project together. We've managed to collect about 5,000 tex documents (we hope to get up to around 100,000 soon). We have a python script that we run on each document to do some text manipulation, extract particular parts of the tex code, compile the parts, convert the compiled parts to cropped PNG images, and search a converted PNG of the full tex for the cropped images using OpenCV. The code takes between 30 seconds and 2 minutes on the documents we've tried so far, so we really need to speed it up.
I've been tasked with gaining access to a computer cluster and figuring out how to implement our code on such a cluster. Someone suggested I look into using AWS, so I've made an account and have been trying to figure out how to use EC2 for the past few hours. Am I on the right track, or is there some other part of AWS or something else entirely that would be better suited to my task?
Whatever I use, it has to have access to the various python libraries in our code and to pdflatex and the full set of tex packages. Is this possible on EC2? I have almost no idea how to go about using EC2 (I've managed to start some instances, but how do I use them to run my script? and do I need to change my python script to accomodate the parallel processing, or does EC2 take care of that somehow? is it as easy as starting a linux instance and installing the programs I need like I would on any other linux machine?). None of the tutorials are immediately useful, and I'm still not even sure if EC2 is capable of doing what I'm looking for. Any advice is appreciated.
I wouldn't normally answer this kind of question but it sounds like you are doing something interesting. So let's have a go
Q1.
"We have a python script that we run on each document to do some text
manipulation, extract particular parts of the tex code, compile the
parts, convert the compiled parts to cropped PNG images, and search a
converted PNG of the full tex for the cropped images using OpenCV.. we
really need to speed it up"
Probably you could split the 100,000 documents into 10 parts and set up
10 instances of the processing software and do the run in parallel.
To set up 10 instances the same, there are many methods but one of the simpler ways is to set up one machine as desired, take a snapshot, make an AMI and then
use the AMI to launch many more copies.
There might be an extra step with putting the results of the search into some
kind of central database.
I don't know anything about OpenCV but there are several suggestions that with a G3 instance type (this has a GPU) it might go faster. Google for "Open CV on AWS"
Q2.
"trying to figure out how to use EC2 for the past few hours. Am I on
the right track, or is there some other part of AWS or something else
entirely that would be better suited to my task?"
EC2 is a general purpose virtual machine, so if you already have code that runs on
some other machine it is easy to move it to EC2
EC2 has many features but one you might find interesting is "spot instances", these are short lived but cheap ( typically 10% of the price ) instance launch
Q3.
Whatever I use, it has to have access to the various python libraries
in our code and to pdflatex and the full set of tex packages. Is this
possible on EC2?
Yes, they will pip install or install from packages just like any other system
Q4.
how do I use them to run my script? and do I need to change my python
script to accomodate the parallel processing, or does EC2 take care of
that somehow? is it as easy as starting a linux instance and
installing the programs I need like I would on any other linux
machine?
As described above your basic task seems to scale well, you may need a step to
collate the results. Yes it is basically the same as any other linux machine

Hbase BulkLoad without mapreduce

I'm wondering if it is possibile to write a java program that do a BulkLoad on HBase. I'm on a hadoop cluster but I don't need to write a MapReduce Job for some reason.
Thanks
BulkLoad works with HFile. So If you have HFiles, you can directly use LoadIncrementalHFiles to handle the bulk load.
Generally we use Map reduce, which can convert the data into above format, and perform Bulk Load.
If you have csv file, you can use ImportTsv utility to process your data into HFiles. use this link, for more information
It depends at which format you data is in currently.
Point to note is, Bulk Load, do not use Write ahead Logs(WAL). They skip this step and add data at a faster rate. if you have any other framework depending on the above WAL, consider other options of adding data in Hbase. Happy Coding.

How to create a program which is working similar like RAID1 (mirroring)?

I want to create a simple program which is working very similar to RAID1. It should work like this:
First i want to give the primary HDD-s drive letter and than the secondary one. I will only write to the primary HDD! If any new data is copied to the primary HDD it should automatically copy it to the secondary one.
I need some help where should i start all this? How to monitor the written data in the primary HDD? Obviously there are many ways to do what i want (i think), but i need the simpliest way.
If this isn't so complicated, than how can i handle that case if the primary HDD has two or more partition, because then i should check the secondary HDD's partition too, and then create/resize them if necessary?
Thanks in advance!
kampi
The concept of mirroring disk writes to another disk in real time is the basis for high availability, and implementing these schemes are not trivial.
The company I work for makes DoubleTake, which does real time mirroring & replication of file based IO to local or remote volumes. This is a little different than what you are describing, which appears to be block based disk/volume replication, but many of the concepts are similar.
For file based replication, there are a quite a few nasty scenarios, i'll describe a few:
Synchronizing the contents of one volume to another volume, keeping in mind that changes can occur while you are doing this. I suppose you could simply this by requiring that volumes start out totally formatted. But for people that have data that will not be a good solution!
keeping up with disk changes: What if the volume you are mirroring to is slower than the source volume? Where do you buffer? To Disk? Memory?
Anyways we use a kernel mode file system filter driver to capture the disk IO, and then our user mode service grabs this IO and forwards it to a local or remote disk.
If you want to learn about file system filtering, one of the best books (its old but good) is File System Internals, by Rajeev Nagar. Its a must read for doing any serious work with file system filters.
Also take a look at the file system filter samples on the Windows 7 WDK, its free, and they have good file mon examples that will get you seeing disk changes pretty quickly.
Good Luck!

Looking for Ideas: How would you start to write a geo-coder?

Because the open source geo-coders cannot begin to compare to Google's or even Yahoo's, I would like to start a project to create a good open source geo-coder. Just to clarify, a geo-coder takes some text (usually with some constraints) and returns one or more lat/lon pairs.
I realize that this is a difficult and garguntuan task, so I am wondering how you might get started. What would you read? What algorithms would you familiarize yourself with? What code would you review?
And also, assuming you were going to develop this very agilely, what would you want the first prototype to be able to do?
EDIT: Let's set aside the data question for now. I am going to use OpenStreetMap data, along with a database of waypoints that I have. I would later plan to include other data sets as well, and I realize the geo-coder would be inherently limited by the quality of the original data.
The first (and probably blocking) problem would be: where do you get your data from? (unless you are willing to pay thousands of dollars for proprietary sets).
You could build a geocoding-api on top of OpenStreetMap (they publish their data in dumps on a regular basis) I guess, but that one was still very incomplete last time I checked.
Algorithms are easy. Good mapping data, however, is expensive. Very expensive.
Google drove their cars all over the world, collecting this data among other things.
From a .NET point of view these articles might be interesting for you:
Writing Your Own GPS Applications: Part I
Writing Your Own GPS Applications: Part 2
Writing GIS and Mapping Software for .NET
I've only glanced at the articles but they've been on CodeProject's 'Most Popular' list for a long time.
And maybe this CodePlex project which the author of the articles above made available.
I would start at the absolute beginning by figuring out how you're going to get the data that matches a street address with a geocode. Either Google had people going around with GPS units, OR they got the information from some existing source. That existing source may have been... (all guesses)
The Postal Service
Some existing maps(printed)
A bunch of enthusiastic users that were early adopters of GPS technology who ere more than willing to enter in street addresses and GPS coordinates
Some government entity (or entities)
Their own satellites
etc
I guess what I'm getting at is the information was either imported from somewhere or was input by someone via some interface. As my starting point I would look at how to get that information. In an open source situation, you may be able to get a bunch of enthusiastic people to enter information.
So for my first prototype, boring as it would be, I would create a form for entering information.
Then you need to know the math for figuring out the closest distance (as the crow flies). From there, try to figure out how to include roads. (My guess is you would have to have data point for each and every curve, where you hold the geocode location of the curve, and the angle of the road on a north/south and east/west vector. You'd probably need to take incline into account, too to get accurate road measurements.)
That's just where I'd start.
But in all honesty, I wouldn't even start on this. Other programmers have done it already, I'm more interested in what hasn't already been done.
get my free raw data from somewhere like http://ipinfodb.com/ip_database.php
load it into a database, denormalizing for fast lookups
design my API
build it out as a RESTful web service
return results in varying formats: JSON, XML, CSV, raw text
The first prototype should accept a ZIP code and return lat/lon in raw text.