What is the easiest to use distributed map reduce programming system? - mapreduce

What is the easiest to use distributed map reduce programming system?
For example. in a distributed datastore containing many users, each with many connections, say I wanted to count the total number of connections:
Map:
for all records of type "user"
do for each user
count number of connections
retrun connection_count_for_one_user
Reduce:
reduce (connection_count_for_one_user)
total_connections += connection_count_for_one_user
Is there any mapreduce system that lets me program in this way?

Well i'll take a stab at making some suggestions, but your question isn't too clear.
So how are you storing your data? The storage mechanism is separate to how you apply MapReduce algorithms to the data. I'm going to assume you are using the Hadoop Distributed File System.
The problem you illustrate actually looks very similar to the typical Hadoop MapReduce word count example. Instead of words you are just counting users instead.
Some of the options you have for applying MapReduce to data stored on a HDFS are:
Java framework - good if you are comfortable with Java.
Pig - a high-level scripting language.
Hive - a data warehousing solution for Hadoop that provides an SQL like interface.
Hadoop streaming - allows you to write mappers and reducers in pretty much any language.
Which is easiest?
Well that all depends on what you feel comfortable with. If know Java take a look at the standard Java framework. If you are used to scripting languages you could use Pig or streaming. If you know SQL you could take a look at using Hive QL to query the HDFS. I would take a look the documentation for each as a starting point.

Related

The best approach to make a Document store on top of a Key/Value store?

There are some projects on GitHub which use a key/value store like leveldb and rocksdb to make a NoSQL/SQL store. For example levelgraph is a graph DB written in Nodejs and uses leveldb under the hood. YugaByte DB is a distributed RDBMS on top of rocksdb. These projects specially levelgraph motivated me to make a document store on top of rocksdb/leveldb. Since i'm not familar with the algorithms, data structures and in general theory of DBs, I want to know what's the best approach to make an embeddable document store (I don't want it to be distributed right now).
Questions:
Is there any academic paper or reference on this subject? would you please list the required skills i need to obtain to finish the project?
Levelgraph is written in node.js using levelup, wrapper forabstract-leveldowncompliant stores. leveldown is pure C++ Node.jsleveldbbinding. If i want to program my DB inNodejsusinglevelup`, How much language difference will impact the DB performance?

Hive vs HBase vs Pig for time series data with AWS Elastic MapReduce

I'm trying to perform statistical analysis on relatively flat time series data with AWS Elastic MapReduce. AWS gives you the option of using Hive, Pig, or HBase for EMR jobs- which one would be best for this type of analysis? I don't think data analysis is gonna be on the terrabyte scale- items in my tables are mostly under 1K. I've also never used any of the three, but learning curve shouldn't be an issue. I'm more concerned with what is going to be more efficient; I'm also handing this project off soon, so something that is relatively to understand for people with noSQL experience would be nice- but I'm mostly looking to make the sensible choice for the data I have. An example query I might make is something like "Find all accounts between last week and today with an event value over 20 for each day".
IMHO, none of these. You use MR, Hive, Pig, etc when your data is big, really big and you are talking about a dataset which not even of ~TB. And you want your system to be efficient as well. In such a scenario using these tools would be an overkill. So the sensible choice for the data you have would be a RDBMS of your choice.
And if it is just for learning purpose then use HDFS+Hive or Pig(Depending on what suits you better).
In response to your comment :
If I had such a situation like this, I would use HDFS, to store my flat data, with Hive. The reason why I would go with Hive is that I don't see a lot of transformation kind of things going on here. So, yes, I would go with Hive. And, I don't really see any HBase need as of now. HBase is normally used when you need random real-time access to some part of your data. And if your use case really demands HBase you need to be careful while designing your schema since you are dealing with timeseries data.
But the decision on whether to use Hive or Pig needs some deeper analysis of the kind of operations you are going to perform on your data. You might find these links helpful :
http://developer.yahoo.com/blogs/hadoop/pig-hive-yahoo-464.html
http://www.larsgeorge.com/2009/10/hive-vs-pig.html
P.S. : You might wanna have a look at R project.
A short summary answer:
Hive is an easy "first option" for your data analysis, because it will use familiar SQL syntax. Because of this there are many convenient connectors to front end analysis tools: Excel, Tableau, Pentaho, Datameer, SAS, etc.
Pig is used more for ETL (transformation) of data incoming to Hadoop. Your data analysis may require some "transformation" of the data before it is stored in Hive. For example you may choose to strip out headers, apply information from other sources, etc. A good example of how this works is provided with the free Hortonworks sandbox tutorials.
HBase is more valuable when you're explicitly looking for a NoSQL store on top of hadoop (example).

NoSQL with analytic functions

I'm searching for any NoSQL system (preferably open source) that supports analytic functions (AF for short) like Oracle/SQL Server/Postgres does. I didn't find any with build-in functions. I've read something about Hive but it doesn't have actual feature of AF (windows, first_last values, ntiles, lag, lead and so on) just histograms and ngrams. Also some NoSQL systems (Redis for example) support map/reduce, but I'm not sure if AF can be replaced with it.
I want to make a performance comparison to choose either Postgres or NoSQL system.
So, in short:
Searching for NoSQL systems with AF
Can I rely on map/reduce to replace AF? Is it fast, reliable, easy to go.
ps. I tried to make my question more constructive.
Once you've really understood how MapReduce works, you can do amazing things with a few lines of code.
Here is a nice video course:
http://code.google.com/intl/fr/edu/submissions/mapreduce-minilecture/listing.html
The real difficulty factor will be between functions that you can implement with a single MapReduce and those that will need chained MapReduces. Moreover, some nice MapReduce implementations (like CouchDB) don't allow you to chain MapReduces (easily).
Some function uses knowledge of all existing data when it involves some king of aggregation (avg, median, standard deviation) or some ordering (first, last).
If you want a distributed NOSQL solution that support AF out of the box, the system will need to rely on some centralized indexing and metadata to keep information about the data in all nodes, thus having a master-node and probably a single point of failure.
You have to ask what you expect to accomplish using NoSQL. You want schemaless tables ? Distributed data ? Better raw performance for very simple queries ?
Depending of your needs, I see three main alternatives here:
1 - use a distributed NoSQL with no single point of failure (ie: Cassandra) to store your data and use map/reduce to process the data and produce the results for the desired function (almost any major NoSQL solution support Hadoop). The caveat is that map/reduce queries are not realtime (can take minutes or hours to execute the query) and requires extra-setup and learning.
2 - use a traditional RDBMS that support multiple servers like MySQL Cluster
3 - use a NoSQL with master/slave topology that supports ad-hoc and aggregation queries like Mongo
As for the second question: yes, you can rely on M/R to replace AF. You can do almost anything with M/R.

Starting with Data Mining

I have started learning Data Mining and wish to create a small project in C++/Java that allows me to utilize a database, say from twitter and then publish a particular set of results (for eg. all the news items on a feed). I want to know how to go about it? Where should I start?
This is a really broad question, so it's hard to answer. Here are some things to consider:
Where are you going to get the data? You mention twitter, but you'll still need to collect the data in some way. There are probably libraries out there for listening to twitter streams, or you could probably buy the data if someone is selling it.
Where are you going to store the data? Depending on how much you'll have and what you plan to do with it, a traditional relational database may or may not be the best fit. You may be better off with something that supports running mapreduce jobs out-of-the box.
Based on the answers to those questions, the choice of programming languages and libraries will be easier to make.
If you're really set on Java, then I think a Hadoop cluster is probably what you want to start out with. It supports writing mapreduce jobs in Java, and works as an effective platform for other systems such as HBase, a column-oriented datastore.
If your data are going to be fairly regular (that is, not much variation in structure from one record to the next), maybe Hive would be a better fit. With Hive, you can write SQL-like queries, given only data files as input. I've never used Mahout, but I understand that its machine learning capabilities are suited for data mining tasks.
These are just some ideas that come to mind. There are lots of options out there and choosing between them has as much to do with the particular problem you're trying to solve and your own personal tastes as anything else.
If you just want to start learning about Data Mining there are two books that I particularly really enjoy:
Pattern Recognition and Machine Learning. Christopher M. Bishop. Springer.
And this one, which is for free:
http://infolab.stanford.edu/~ullman/mmds.html
Good references for you are
AI course taught by people who actually know the subject,Weka website, Machine Learning datasets, Even more datasets, Framework for supporting the mining of larger datasets.
The first link is a good introduction on AI taught by Peter Norvig and Sebastian Thrun, Google's Research Director, and Stanley's creator (the autonomous car), respectively.
The second link you get you to Weka website. Download the software - which is pretty intuitive - and get the book. Make sure you understand all the concepts: what's data mining, what's machine learning, what are the most common tasks, and what are the rationales behind them. Play a lot with the examples - the software package bundles some datasets - until you understand what generated the results.
Next, go to real datasets and play with them. When tackling massive datasets, you may face several performance issues with Weka - which is more of a learning tool as far as my experience can tell. Thus I recommend you to take a look at the fifth link, which will get you to Apache Mahout website.
It's far from being a simple topic, however, it's quite interesting.
I can tell you how I did it.
1) I got the data using twitter4j.
2) I analyzed the data using JUNG.
You have to define a class representing edges and a class representing vertices.
These classes will contain the attributes of the edges and vertices.
3) Then, there is a simple function to add an edge g.addedge(V1,V2,edgeFromV1ToV2) or to add a vertex g.addVertex(V).
The class that defines edges or vertices is easy to create. As an example :
`public class MyEdge {
int Id;
}`
The same is done for vertices.
Today I would do it with R, but if you don't want to learn a new programming language, just import jung which is a java library.
Data mining is broad fields with many different techniques; classification, clustering, association and pattern mining, outlier detection, etc.
You should first decide what you want to do and then decide wich algorithm you need.
If you are new to data mining, I would recommend to read some books like Introduction to Data Mining by Tan, Steinbach and Kumar.
I would like to suggest you to use python or R for data mining process. Doing work with java or c , it bit difficult in the sense you need to do a lot coding

Is MapReduce right for me?

I am working on a project that deals with analyzing a very large amount of data, so I discovered MapReduce fairly recently, and before i dive any further into it, i would like to make sure my expectations are correct.
The interaction with the data will happen from a web interface, so response time is critical here, i am thinking a 10-15 second limit. Assuming my data will be loaded into a distributed file system before i perform any analysis on it, what kind of a performance can i expect from it?
Let's say I need to filter a simple 5GB XML file that is well formed, has a fairly flat data structure and 10,000,000 records in it. And let's say the output will result in 100,000 records. Is 10 seconds possible?
If it, what kind of hardware am i looking at?
If not, why not?
I put the example down, but now wish that I didn't. 5GB was just a sample that i was talking about, and in reality I would be dealing with a lot of data. 5GB might be data for one hour of the day, and I might want to identify all the records that meet a certain criteria.
A database is really not an option for me. What i wanted to find out is what is the fastest performance i can expect out of using MapReduce. Is it always in minutes or hours? Is it never seconds?
MapReduce is good for scaling the processing of large datasets, but it is not intended to be responsive. In the Hadoop implementation, for instance, the overhead of startup usually takes a couple of minutes alone. The idea here is to take a processing job that would take days and bring it down to the order of hours, or hours to minutes, etc. But you would not start a new job in response to a web request and expect it to finish in time to respond.
To touch on why this is the case, consider the way MapReduce works (general, high-level overview):
A bunch of nodes receive portions of
the input data (called splits) and do
some processing (the map step)
The intermediate data (output from
the last step) is repartitioned such
that data with like keys ends up
together. This usually requires some
data transfer between nodes.
The reduce nodes (which are not
necessarily distinct from the mapper
nodes - a single machine can do
multiple jobs in succession) perform
the reduce step.
Result data is collected and merged
to produce the final output set.
While Hadoop, et al try to keep data locality as high as possible, there is still a fair amount of shuffling around that occurs during processing. This alone should preclude you from backing a responsive web interface with a distributed MapReduce implementation.
Edit: as Jan Jongboom pointed out, MapReduce is very good for preprocessing data such that web queries can be fast BECAUSE they don't need to engage in processing. Consider the famous example of creating an inverted index from a large set of webpages.
A distributed implementation of MapReduce such as Hadoop is not a good fit for processing a 5GB XML
Hadoop works best on large amounts of data. Although 5GB is a fairly big XML file, it can easily be processed on a single machine.
Input files to Hadoop jobs need to be splittable so that different parts of the file can be processed on different machines. Unless your xml is trivially flat, the splitting of the file will be non deterministic so you'll need a pre processing step to format the file for splitting.
If you had many 5GB files, then you could use hadoop to distribute the splitting. You could also use it to merge results across files and store the results in a format for fast querying for use by your web interface as other answers have mentioned.
MapReduce is a generic term. You probably mean to ask whether a fully featured MapReduce framework with job control, such as Hadoop, is right for you. The answer still depends on the framework, but usually, the job control, network, data replication, and fault tolerance features of a MapReduce framework makes it suitable for tasks that take minutes, hours, or longer, and that's probably the short and correct answer for you.
The MapReduce paradigm might be useful to you if your tasks can be split among indepdent mappers and combined with one or more reducers, and the language, framework, and infrastructure that you have available let you take advantage of that.
There isn't necessarily a distinction between MapReduce and a database. A declarative language such as SQL is a good way to abstract parallelism, as are queryable MapReduce frameworks such as HBase. This article discusses MapReduce implementations of a k-means algorithm, and ends with a pure SQL example (which assumes that the server can parallelize it).
Ideally, a developer doesn't need to know too much about the plumbing at all. Erlang examples like to show off how the functional language features handle process control.
Also, keep in mind that there are lightweight ways to play with MapReduce, such as bashreduce.
I recently worked on a system that processes roughly 120GB/hour with 30 days of history. We ended up using Netezza for organizational reasons, but I think Hadoop may be an appropriate solution depending on the details of your data and queries.
Note that XML is very verbose. One of your main cost will reading/writing to disk. If you can, chose a more compact format.
The number of nodes in your cluster will depend on type and number of disks and CPU. You can assume for a rough calculation that you will be limited by disk speed. If your 7200rpm disk can scan at 50MB/s and you want to scan 500GB in 10s, then you need 1000 nodes.
You may want to play with Amazon's EC2, where you can stand up a Hadoop cluster and pay by the minute, or you can run a MapReduce job on their infrastructure.
It sounds like what you might want is a good old fashioned database. Not quite as trendy as map/reduce, but often sufficient for small jobs like this. Depending on how flexible your filtering needs to be, you could either just import your 5GB file into a SQL database, or you could implement your own indexing scheme yourself, by either storing records in different files, storing everything in memory in a giant hashtable, or whatever is appropriate for your needs.