MapReduce realtime projects with code - mapreduce

I would like to take a look at Code of big mapreduce job.
Please give me some idea about real mapreduce project with real time usecases

From wikipedia:
MapReduce is useful in a wide range of applications, including
distributed pattern-based searching, distributed sorting, web
link-graph reversal, term-vector per host, web access log stats,
inverted index construction, document clustering, machine learning,[8]
and statistical machine translation. Moreover, the MapReduce model has
been adapted to several computing environments like multi-core and
many-core systems,[9][10][11] desktop grids,[12] volunteer computing
environments,[13] dynamic cloud environments,[14] and mobile
environments.[15]
At Google, MapReduce was used to completely regenerate Google's index
of the World Wide Web. It replaced the old ad hoc programs that
updated the index and ran the various analyses.[16]
MapReduce's stable inputs and outputs are usually stored in a
distributed file system. The transient data is usually stored on local
disk and fetched remotely by the reducers.
Follow the link and read the papers for source code.

Related

Distributed Tensorflow Training of Reinpect Human detection model

I am working on Distributed Tensorflow, particularly the implementation of Reinspect model using Distributed Tensorflow given in the following paper https://github.com/Russell91/TensorBox .
We are using Between-graph-Asynchronous implementation of Distributed tensorflow settings but the results are very surprising. While bench marking, we have come to see that Distributed training takes almost more than 2 times more training time than a single machine training. Any leads about what could be happening and what else could be tried be would be really appreciated. Thanks
Note: There is a correction in the post, we are using between-graph implementation not in-graph implementation. Sorry for the mistake
In general, I wouldn't be surprised if moving from a single-process implementation of a model to a multi-machine implementation would lead to a slowdown. From your question, it's not obvious what might be going on, but here are a few general pointers:
If the model has a large number of parameters relative to the amount of computation (e.g. if it mostly performs large matrix multiplications rather than convolutions), then you may find that the network is the bottleneck. What is the bandwidth of your network connection?
Are there a large number of copies between processes, perhaps due to unfortunate device placement? Try collecting and visualizing a timeline to see what is going on when you run your model.
You mention that you are using "in-graph replication", which is not currently recommended for scalability. In-graph replication can create a bottleneck at the single master, especially when you have a large model graph with many replicas.
Are you using a single input pipeline across the replicas or multiple input pipelines? Using a single input pipeline would create a bottleneck at the process running the input pipeline. (However, with in-graph replication, running multiple input pipelines could also create a bottleneck as there would be one Python process driving the I/O with a large number of threads.)
Or are you using the feed mechanism? Feeding data is much slower when it has to cross process boundaries, as it would in a replicated setting. Using between-graph replication would at least remove the bottleneck at the single client process, but to get better performance you should use an input pipeline. (As Yaroslav observed, feeding and fetching large tensor values is slower in the distributed version because the data is transferred via RPC. In a single process these would use a simple memcpy() instead.)
How many processes are you using? What does the scaling curve look like? Is there an immediate slowdown when you switch to using a parameter server and single worker replica (compared to a single combined process)? Does the performance get better or worse as you add more replicas?
I was looking at similar thing recently, and I noticed that moving data from grpc into Python runtime is slower than expected. In particular consider following pattern
add_op = params.assign_add(update)
...
sess.run(add_op)
If add_op lies on a different process, then sess.run adds a decoding step that happens at rate of 50-100 MB/second.
Here's a benchmark and relevant discussion

inmemory datastructure

i have a distributed application. here are set of processes , spread accross mutiple computers , communicating each other. i have a data structure , which is modified among these proceses . and this is not stored in database .
Now the question is how do i maintain the same view of the this data structure , accross all processes
i.e., at any point of time all process should see the same data structure
You say that you don't have a database. That's a shame, because database authors have solved your problem. You would need to incorporate the equivalent technology in your project. And obviously, the fastest and most simple way to incorporate the technology of databases is to incorporate a database.
Redis is designed to solve your problem. It is a key-value store for sharing between programs running on different machines but sharing the data. It is a server you run somewhere, and your programs all connect to this server using the client library it provides.
You can also use a database such as mysql but with in-memory tables.
If your data-structure does not fit into the key-value or relational models very well, you have the same kind of situation as multi-player games. It is non-trivial to sync multi-player games but it can be done and here is an excellent introduction as to how: gafferongames.com
I would recommend something like the Data Distribution Services platform for something like this (open source version is OpenDDS). Their key selling point is that it is designed to propagate changes to data to all interested in such changes. And performance isn't bad either.
Commercial implementations of this protocol are used in a variety of real-time systems, mostly military grade applications.
More options to consider, distributed caches (such as memcached) - though I've not played with this myself - it looks quite straight forward to get up and running.

Has anyone used dataflow programming in a real project with a mainstream language?

I am looking at using some Dataflow programming techniques in a clojure program but I am having difficulty in finding much information from projects using Java, C#, or other mainstream languages that have used such techniques in the real world. I would be grateful to hear if anyone has any expereinces they could share regarding this.
Here, we are! We've made... (quotation is from one of my older post):
We've designed and implemented a DF
server for our automation project
(dispatcher, component iterface, a
bunch of components, DF language, DF
compiler, UI). It is written in bare
C++, and runs on several Unix-like
systems (Linux x86, MIPS, avr32 etc.,
Mac OSX). It lacks several features,
e.g. sophisticated flow control,
complex thread control (there is only
a not too advanced component for it),
so it is just a prototype, even it
works. We're now working on a
full-featured server. We've learnt lot
during implementing and using the
prototype.
Also, we'll make a visual editor some
day.
There're dataflow systems wich don't even mention dataflow approach:
SynthEdit: http://www.synthedit.com/ - It's an audio related framework and component set for creating VST plugins
TinyOS: http://www.tinyos.net/ - It's an embedded operating system/framework
Digital synthetisers/samplers are dataflow systems, programmed - supposedly - in C or some parts in Assembly, check my answer to another post about some examples.
Quartz Composer, a graphic magic tool for Mac,
Blender has dataflow subsystem for image composing.
Writing a dataflow system is not rocket science. Here's my older post about the basics of dataflow framework.
The term dataflow is wide. There are realtime synchronous dataflow systems, like synthetisers and samplers, there are asynchronous ones, like our home aut. system (the system is in idle unless the user presses a button or a timer runs out), and there're even different architectures, like spreadsheets or make.
Wanna reading more about dataflow programming? Read J. Paul Morrison's site and book.
Pervasive DataRush is a framework for parallel dataflow programming for any JVM language, including Clojure.
Pervasive DataRush uses a dataflow architecture. The architecture implements a program that executes as a graph of computation nodes interconnected by dataflow queues. The nodes use the queues to share data. As the data is streaming, only data required by any active operation needs to be in memory at any given time, allowing very large data sets to be analyzed. Besides offering the potential for scaling to problems larger than available memory, dataflow graphs exploit multiple forms of parallelism.
Customers are using DataRush for big data analytics and data preparation (ETL).
We've made another one: a collaborative spreadsheet with MySQL/PHP backend and AJAX frontend. The software is in beta state, documentation is under construction.

What is the easiest to use distributed map reduce programming system?

What is the easiest to use distributed map reduce programming system?
For example. in a distributed datastore containing many users, each with many connections, say I wanted to count the total number of connections:
Map:
for all records of type "user"
do for each user
count number of connections
retrun connection_count_for_one_user
Reduce:
reduce (connection_count_for_one_user)
total_connections += connection_count_for_one_user
Is there any mapreduce system that lets me program in this way?
Well i'll take a stab at making some suggestions, but your question isn't too clear.
So how are you storing your data? The storage mechanism is separate to how you apply MapReduce algorithms to the data. I'm going to assume you are using the Hadoop Distributed File System.
The problem you illustrate actually looks very similar to the typical Hadoop MapReduce word count example. Instead of words you are just counting users instead.
Some of the options you have for applying MapReduce to data stored on a HDFS are:
Java framework - good if you are comfortable with Java.
Pig - a high-level scripting language.
Hive - a data warehousing solution for Hadoop that provides an SQL like interface.
Hadoop streaming - allows you to write mappers and reducers in pretty much any language.
Which is easiest?
Well that all depends on what you feel comfortable with. If know Java take a look at the standard Java framework. If you are used to scripting languages you could use Pig or streaming. If you know SQL you could take a look at using Hive QL to query the HDFS. I would take a look the documentation for each as a starting point.

Is MapReduce right for me?

I am working on a project that deals with analyzing a very large amount of data, so I discovered MapReduce fairly recently, and before i dive any further into it, i would like to make sure my expectations are correct.
The interaction with the data will happen from a web interface, so response time is critical here, i am thinking a 10-15 second limit. Assuming my data will be loaded into a distributed file system before i perform any analysis on it, what kind of a performance can i expect from it?
Let's say I need to filter a simple 5GB XML file that is well formed, has a fairly flat data structure and 10,000,000 records in it. And let's say the output will result in 100,000 records. Is 10 seconds possible?
If it, what kind of hardware am i looking at?
If not, why not?
I put the example down, but now wish that I didn't. 5GB was just a sample that i was talking about, and in reality I would be dealing with a lot of data. 5GB might be data for one hour of the day, and I might want to identify all the records that meet a certain criteria.
A database is really not an option for me. What i wanted to find out is what is the fastest performance i can expect out of using MapReduce. Is it always in minutes or hours? Is it never seconds?
MapReduce is good for scaling the processing of large datasets, but it is not intended to be responsive. In the Hadoop implementation, for instance, the overhead of startup usually takes a couple of minutes alone. The idea here is to take a processing job that would take days and bring it down to the order of hours, or hours to minutes, etc. But you would not start a new job in response to a web request and expect it to finish in time to respond.
To touch on why this is the case, consider the way MapReduce works (general, high-level overview):
A bunch of nodes receive portions of
the input data (called splits) and do
some processing (the map step)
The intermediate data (output from
the last step) is repartitioned such
that data with like keys ends up
together. This usually requires some
data transfer between nodes.
The reduce nodes (which are not
necessarily distinct from the mapper
nodes - a single machine can do
multiple jobs in succession) perform
the reduce step.
Result data is collected and merged
to produce the final output set.
While Hadoop, et al try to keep data locality as high as possible, there is still a fair amount of shuffling around that occurs during processing. This alone should preclude you from backing a responsive web interface with a distributed MapReduce implementation.
Edit: as Jan Jongboom pointed out, MapReduce is very good for preprocessing data such that web queries can be fast BECAUSE they don't need to engage in processing. Consider the famous example of creating an inverted index from a large set of webpages.
A distributed implementation of MapReduce such as Hadoop is not a good fit for processing a 5GB XML
Hadoop works best on large amounts of data. Although 5GB is a fairly big XML file, it can easily be processed on a single machine.
Input files to Hadoop jobs need to be splittable so that different parts of the file can be processed on different machines. Unless your xml is trivially flat, the splitting of the file will be non deterministic so you'll need a pre processing step to format the file for splitting.
If you had many 5GB files, then you could use hadoop to distribute the splitting. You could also use it to merge results across files and store the results in a format for fast querying for use by your web interface as other answers have mentioned.
MapReduce is a generic term. You probably mean to ask whether a fully featured MapReduce framework with job control, such as Hadoop, is right for you. The answer still depends on the framework, but usually, the job control, network, data replication, and fault tolerance features of a MapReduce framework makes it suitable for tasks that take minutes, hours, or longer, and that's probably the short and correct answer for you.
The MapReduce paradigm might be useful to you if your tasks can be split among indepdent mappers and combined with one or more reducers, and the language, framework, and infrastructure that you have available let you take advantage of that.
There isn't necessarily a distinction between MapReduce and a database. A declarative language such as SQL is a good way to abstract parallelism, as are queryable MapReduce frameworks such as HBase. This article discusses MapReduce implementations of a k-means algorithm, and ends with a pure SQL example (which assumes that the server can parallelize it).
Ideally, a developer doesn't need to know too much about the plumbing at all. Erlang examples like to show off how the functional language features handle process control.
Also, keep in mind that there are lightweight ways to play with MapReduce, such as bashreduce.
I recently worked on a system that processes roughly 120GB/hour with 30 days of history. We ended up using Netezza for organizational reasons, but I think Hadoop may be an appropriate solution depending on the details of your data and queries.
Note that XML is very verbose. One of your main cost will reading/writing to disk. If you can, chose a more compact format.
The number of nodes in your cluster will depend on type and number of disks and CPU. You can assume for a rough calculation that you will be limited by disk speed. If your 7200rpm disk can scan at 50MB/s and you want to scan 500GB in 10s, then you need 1000 nodes.
You may want to play with Amazon's EC2, where you can stand up a Hadoop cluster and pay by the minute, or you can run a MapReduce job on their infrastructure.
It sounds like what you might want is a good old fashioned database. Not quite as trendy as map/reduce, but often sufficient for small jobs like this. Depending on how flexible your filtering needs to be, you could either just import your 5GB file into a SQL database, or you could implement your own indexing scheme yourself, by either storing records in different files, storing everything in memory in a giant hashtable, or whatever is appropriate for your needs.