MapReduce.How does it work? - mapreduce

What is MapReduce and how does it work?
I have tried reading some links but couldn't clearly understand the concept.
Can anyone please explain it in simple terms? any help would be appreciated.

I am going to explain with example.
Consider You have a whether temperature data of last 100 years and you want to know the year wise highest temperature. Suppose total size of the data is 100PT. How you will cater this problem? We can not process the data in SQL like oracle, My SQL, or any sql database.
In hadoop, mainly there are two term:
Hadoop Distributed File System(HDFS)
Map-Reduce
HDFS is used to store the data in distributed environment. Therefore, HDFS will store your 100PT data in cluster. It may be 2 machines cluster or 100 machines. By default your data will be divided into 64MB chunks and stored in different machine in cluster.
Now, we move to process the data. For processing data in hadoop cluster we have Map-Reduce framework to write processing logic. We need to write the map-reduce code to find the maximum temperature.
Structure of Map-Reduce code (Just for understand, syntax is not right):
class XYZ{
static class map{
void map(){
//processing logic for mapper
}
}
static class Reduce{
void reduce(){
//processing logic for reducer
}
}
}
whatever you write in map() method, that will run by all data nodes in parallel on 64MB chunk of data and generate the output.
Now, Output of all the mapper instances will shuffle and sort. And then it pass to reduce() method as a input.
Reducer will generate final output.
In our example suppose, hadoop initiate 3 below mapper:
64 MB chunk data ->mapper 1 -> (year,temperature) (1901,45),(1902,34),(1903,44)
64 MB chunk data ->mapper 2 -> (year,temperature) (1901,55),(1902,24),(1904,44)
64 MB chunk data ->mapper 3 -> (year,temperature) (1901,65),(1902,24),(1903,46)
output of all mapper pass to the reducer.
output of all mapper -> reducer -> (1901,65),(1902,34),(1903,46),(1904,44)

a mapreduce has a Mapper and a Reducer.
Map is a common functional programming tool which does a single operation on multiple data. For example, if we have the array
arr = [1,2,3,4,5]
and invoke
map(arr,*2)
it will multiply each element of the array, such that the result would be:
[2,4,6,8,10]
Reduce is a bit counter-intuitive in my opinion, but it is not as complicated as one would expect.
Assume you have got the mapped array above, and would like to use a reducer on it. A reducer gets an array, a binary operator, and an initial element.
The action it does is simple. Assuming we have the mapped array above, the binary operator '+', and the initial element '0', the reducer applies the operator again and again in the following order:
0 + 2 = 2
2 + 4 = 6
6 + 6 = 12
12 + 8 = 20
20 + 10 = 30.
It actually takes the last result and the next array element, and applies the binary operator on them. In the represented case, we've got the sum of the array.

Related

Reducers for Hive data

I'm a novice. I'm curious to know how reducers are set to different hive data sets. Is it based on the size of the data processed? Or a default set of reducers for all?
For example, 5GB of data requires how many reducers? will the same number of reducers set to smaller data set?
Thanks in advance!! Cheers!
In open source hive (and EMR likely)
# reducers = (# bytes of input to mappers)
/ (hive.exec.reducers.bytes.per.reducer)
default hive.exec.reducers.bytes.per.reducer is 1G.
Number of reducers depends also on size of the input file
You could change that by setting the property hive.exec.reducers.bytes.per.reducer:
either by changing hive-site.xml
hive.exec.reducers.bytes.per.reducer 1000000
or using set
hive -e "set hive.exec.reducers.bytes.per.reducer=100000
In a MapReduce program, reducer is gets assigned based on key in the reducer input.Hence the reduce method is called for each pair in the grouped inputs.It is not dependent of data size.
Suppose if you are going a simple word count program and file size is 1 MB but mapper output contains 5 key which is going to reducer for reducing then there is a chance to get 5 reducer to perform that task.
But suppose if you have 5GB data and mapper output contains only one key then only one reducer will get assigned to process the data into reducer phase.
Number of reducer in hive is also controlled by following configuration:
mapred.reduce.tasks
Default Value: -1
The default number of reduce tasks per job. Typically set to a prime close to the number of available hosts. Ignored when mapred.job.tracker is "local". Hadoop set this to 1 by default, whereas hive uses -1 as its default value. By setting this property to -1, Hive will automatically figure out what should be the number of reducers.
hive.exec.reducers.bytes.per.reducer
Default Value: 1000000000
The default is 1G, i.e if the input size is 10G, it will use 10 reducers.
hive.exec.reducers.max
Default Value: 999
Max number of reducers will be used. If the one specified in the configuration parameter mapred.reduce.tasks is negative, hive will use this one as the max number of reducers when automatically determine number of reducers.
How Many Reduces?
The right number of reduces seems to be 0.95 or 1.75 multiplied by (<no. of nodes> * mapred.tasktracker.reduce.tasks.maximum).
With 0.95 all of the reduces can launch immediately and start transfering map outputs as the maps finish. With 1.75 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing.
Increasing the number of reduces increases the framework overhead, but increases load balancing and lowers the cost of failures.The scaling factors above are slightly less than whole numbers to reserve a few reduce slots in the framework for speculative-tasks and failed tasks.
Source: http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
Please check below link to get more clarification about reducer.
Hadoop MapReduce: Clarification on number of reducers
hive.exec.reducers.bytes.per.reducer
Default Value: 1,000,000,000 prior to Hive 0.14.0; 256 MB (256,000,000) in Hive 0.14.0 and later
Source: https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties

How to increase Mappers and Reducer in Apache TEZ

I know this simple question, I need some help on this query from this community, When I create PartitionTable with ORC format, When I try to dump data from non partition table which is pointing to 2 GB File with 210 columns, I see Number of Mapper are 2 and reducer are 2 . is there a way to increase Mapper and reducer. My assumption is we cant set number of Mapper and reducer like MR 1.0, It is based on Settings like Yarn container size, Mapper minimum memory and maximum memory . can any one suggest me TEz Calculates mappers and reducers. What is best value to keep memory size setting, so that i dont come across : Java heap space, Java Out of Memory problem. My file size may grow upto 100GB. Please help me on this.
You can still set the number of mappers and reducers in Yarn. Have you tried that? If so, please get back here.
Yarn changes the underlying execution mechanism, but #mappers and #reducers is describing the Job requirements - not the way the job resources are allocated (which is how yarn and mrv1 differ).
Traditional Map/Reduce has a hard coded number of map and reduce "slot". As you say - Yarn uses containers - which are per-application. Yarn is thus more flexible. But the #mappers and #reducers are inputs of the job in both cases. And also in both cases the actual number of mappers and reducers may differ from the requested number. Typically the #reducers would either be
(a) precisely the number that was requested
(b) exactly ONE reducer - that is if the job required it such as in total ordering
For the memory settings, if you are using hive with tez, the following 2 settings will be of use to you:
1) hive.tez.container.size - this is the size of the Yarn Container that will be used ( value in MB ).
2) hive.tez.java.opts - this is for the java opts that will be used for each task. If container size is set to 1024 MB, set java opts to say something like "-Xmx800m" and not "-Xmx1024m". YARN kills processes that use more memory than specified container size and given that a java process's memory footprint usually can exceed the specified Xmx value, setting Xmx to be the same value as the container size usually leads to problems.

Mapper or Reducer, where to do more processing?

I have a 6 million line text file with lines up to 32,000 characters long, and I want to
measure the word-length frequencies.
The simplest method is for the Mapper to create a (word-length, 1) key-value pair for every word and let an 'aggregate' Reducer do the rest of the work.
Would it be more efficient to do some of the aggregation in the mapper? Where the key-value pair output would be (word-length, frequency_per_line).
The outputs from the mapper would be decreased by an factor of the average amount of words per line.
I know there are many configuration factors involved. But is there a hard rule saying whether most or the work should be done by the Mapper or the Reducer?
The platform is AWS with a student account, limited to the following configuration.

Does the number of map tasks spwaned depends on the number of jobnodes?

The number of map() spawned is equal to the number of 64MB blocks of input data. Suppose we have 2 input files of 1MB size, both the files will be stored in a single block. But when I run my MR program with 1 namenode and 2 jobnodes, I see 2 map() spawned, one for each file. So is this because the system tried to split the job between 2 nodes i.e.,
Number of map() spawned = number of 64MB blocks of input data * number of jobnodes ?
Also, in the mapreduce tutorial, its written than for a 10TB file with blocksize being 128KB, 82000 maps will be spawned. However, according to the logic that number of maps is only dependent on block size, 78125 jobs must be spawned (10TB/128MB). I am not understanding how few extra jobs have been spawned? It will be great if anyone can share your thoughts on this? Thanks. :)
By Default one mapper per input file is spawned and if the size of input file is greater than than split size(which is normally kept same as block size) then for that file number of mappers will be ceil of filesize/split size.
Now say you 5 input files and split size is kept as 64 MB
file1 - 10 MB
file2 - 30 MB
file3 - 50 MB
file4 - 100 MB
file5 - 1500 MB
number of mapper launched
file1 - 1
file2 - 1
file3 - 1
file4 - 2
file5 - 24
total mappers - 29
Additionally, input split size and block size is not always honored. If input file is a gzip, it is not splittable. So if one of the gzip file is 1500mb, it will not be split. It is better to use Block compression with Snappy or LZO along with sequence file format.
Also, input split size is not used if input is HBASE table. In case of HBase table, only to split is to maintain correct region size for the table. If table is not properly distributed, manually split the table into multiple regions.
Number of mappers depends on just one thing, the no of InputSplits created by the InputFormat you are using(Default is TextInputFormat which creates splits taking \n as the delimiter). It does not depend on the no. of nodes or the file or the block size(64MB or whatever). It's very good if the split is equal to the block. But this is just an ideal situation and cannot be guaranteed always. MapReudce framework tries its best to optimise the process. And in this process things like creating just 1 mapper for the entire file happen(if the filesize is less than the block size). Another optimization could be to create lesser number of mappers than the number of splits. For example if your file has 20 lines and you are using TextInputFormat then you might think that you'll get 20 mappers(as no. of mappers = no. of splits and TextInputFormat creates splits based on \n). But this does not happen. There will be unwanted overhead in creating 20 mappers for such a small file.
And if the size of a split is greater than the block size, the remaining data is moved in from the other remote block on a different machine in order to gets processed.
About the MapReduce tutorial :
If you have 10TB data, then -
(10*1024*1024)/128 = 81,920 mappers, which almost = 82,000
Hope this clears some of the things.

How are map and reduce workers in cluster selected?

For mapreduce job we need to specify partitioning of input data (count of map processes - M) and count of reduce processes (R). In MapReduce papers is example of their often settings: cluster with 2 000 workers and M = 200 000, R = 5 000. Workers are tagged as map-worker or reduce-worker. I wonder how are these workers in cluster selected.
Is this done so that is chosen fixed count of map-workers and fixed count of reduce-workers? (and then data stored in reduce-workers nodes has to be send to map-reduce workers)
Or map phase is running on each node in cluster and any count of nodes are then selected as reduce-workers?
Or is it done in another way?
Thanks for your answers.
The number of Map-Worker(Mapper) depends on the number of Input-splits of the input file.
so Ex: 200 input-splits( they are logical ) =200 Mapper .
How Mapper Node is selected ?
The Mapper is the Local Data Node , if its not possible then data is transferred to free Node and Mapper is invoked on that node
.
The number of Reducer can be set by the user( Job.setNumberOfReducer(Number) ) or else it will also be as per the number of splits of Intermediate-output of Mapper .
Other Question's Answers
Q1>so in one node can run for example 10 mappers in parallel at one time, or these mappers are processed sequentially?
Ans : sequentially (Max Number of (active/running)mapper =Number of DataNodes)
Q2>how are chosen the nodes where are reducers invoked?
Ans :
Intermediate Key-Values are stored in Local File system Not in HDFS , and then it is being copied(HDFS) to Reducer Node .
A single Mapper will feed Data to multiple reducer . so locality of data is out of the question coz a data for a particular reducer come from many Nodes if not from all .
So Reducer is (or atleast should be) selected on Bandwidth of a Node , keeping in minds all above points
Q3>if we need reducers count bigger then overall nodes count (for example 90 reducers in 50 nodes cluster), are the reducers on one node processed in parallel or sequentially?
Ans : sequentially (Max Number of (active/running)Reducer =Number of DataNodes)