How are map and reduce workers in cluster selected? - mapreduce

For mapreduce job we need to specify partitioning of input data (count of map processes - M) and count of reduce processes (R). In MapReduce papers is example of their often settings: cluster with 2 000 workers and M = 200 000, R = 5 000. Workers are tagged as map-worker or reduce-worker. I wonder how are these workers in cluster selected.
Is this done so that is chosen fixed count of map-workers and fixed count of reduce-workers? (and then data stored in reduce-workers nodes has to be send to map-reduce workers)
Or map phase is running on each node in cluster and any count of nodes are then selected as reduce-workers?
Or is it done in another way?
Thanks for your answers.

The number of Map-Worker(Mapper) depends on the number of Input-splits of the input file.
so Ex: 200 input-splits( they are logical ) =200 Mapper .
How Mapper Node is selected ?
The Mapper is the Local Data Node , if its not possible then data is transferred to free Node and Mapper is invoked on that node
.
The number of Reducer can be set by the user( Job.setNumberOfReducer(Number) ) or else it will also be as per the number of splits of Intermediate-output of Mapper .
Other Question's Answers
Q1>so in one node can run for example 10 mappers in parallel at one time, or these mappers are processed sequentially?
Ans : sequentially (Max Number of (active/running)mapper =Number of DataNodes)
Q2>how are chosen the nodes where are reducers invoked?
Ans :
Intermediate Key-Values are stored in Local File system Not in HDFS , and then it is being copied(HDFS) to Reducer Node .
A single Mapper will feed Data to multiple reducer . so locality of data is out of the question coz a data for a particular reducer come from many Nodes if not from all .
So Reducer is (or atleast should be) selected on Bandwidth of a Node , keeping in minds all above points
Q3>if we need reducers count bigger then overall nodes count (for example 90 reducers in 50 nodes cluster), are the reducers on one node processed in parallel or sequentially?
Ans : sequentially (Max Number of (active/running)Reducer =Number of DataNodes)

Related

AWS Hadoop MapReduce - Word Count Average

Hi I have a csv data file as below.
bus,train,bus,TRAIN,car,bus,Train,CAr,car,Train,Cart,Bus,Bicycle,Bicycle,Car,Bus,Cart,Cart,Bicycle,Threewheel
I need to count the average word count in the above CSV using MapReduce.
Eg: Bus = 5/20 =0.25
I can get the word count easily but I need the Total Number of records (20 in this case) to take word count average. But Passing that to reduce function using global Variables did not work out. I tried to pass this as a key-value pair in the map. Key = "Total" Value = total Count to the reducer input. It was not successfull as well.
Any Suggestions to pass this Total Count from Map function to Reducer Function?
I used One master and 3 slaves in EMR Cluster if that is a piece of needed information.
Thank You in Advance !!!
Once you have the pairs (K, V) where K is the word and V the times it appears, you can map all to a single key, lets say (W, (K, V)). Now you can reduce to obtain a total word count. Then you can make another map/reduce step to join the old keys with the new count.
Hope it helps.

Redshift -- Query Performance Issues

SELECT
a.id,
b.url as codingurl
FROM fact_A a
INNER JOIN dim_B b
ON strpos(a.url,b.url)> 0
Records Count in Fact_A: 2 Million
Records Count in Dim_B : 1500
Time Taken to Execute : 10 Mins
No of Nodes: 2
Could someone help me with an understanding why the above query takes more time to execute?
We have declared the distribution key in Fact_A to appropriately distribute the records evenly in both the nodes and also Sort Key is created on URL in Fact_A.
Dim_B table is created with DISTRIBUTION ALL.
Redshift does not have full-text search indexes or prefix indexes, so a query like this (with strpos used in filter) will result in full table scan, executing strpos 3 billion times.
Depending on which urls are in dim_B, you might be able to optimise this by extracting prefixes into separate columns. For example, if you always compare subpaths of the form http[s]://hostname/part1/part2/part3 then you can extract "part1/part2/part3" as a separate column both in fact_A and dim_B, and make it the dist and sort keys.
You can also rely on parallelism of Redshift. If you resize your cluster from 2 nodes to 20 nodes, you should see immediate performance improvement of 8-10 times as this kind of query can be executed by each node in parallel (for the most part).

Reducers for Hive data

I'm a novice. I'm curious to know how reducers are set to different hive data sets. Is it based on the size of the data processed? Or a default set of reducers for all?
For example, 5GB of data requires how many reducers? will the same number of reducers set to smaller data set?
Thanks in advance!! Cheers!
In open source hive (and EMR likely)
# reducers = (# bytes of input to mappers)
/ (hive.exec.reducers.bytes.per.reducer)
default hive.exec.reducers.bytes.per.reducer is 1G.
Number of reducers depends also on size of the input file
You could change that by setting the property hive.exec.reducers.bytes.per.reducer:
either by changing hive-site.xml
hive.exec.reducers.bytes.per.reducer 1000000
or using set
hive -e "set hive.exec.reducers.bytes.per.reducer=100000
In a MapReduce program, reducer is gets assigned based on key in the reducer input.Hence the reduce method is called for each pair in the grouped inputs.It is not dependent of data size.
Suppose if you are going a simple word count program and file size is 1 MB but mapper output contains 5 key which is going to reducer for reducing then there is a chance to get 5 reducer to perform that task.
But suppose if you have 5GB data and mapper output contains only one key then only one reducer will get assigned to process the data into reducer phase.
Number of reducer in hive is also controlled by following configuration:
mapred.reduce.tasks
Default Value: -1
The default number of reduce tasks per job. Typically set to a prime close to the number of available hosts. Ignored when mapred.job.tracker is "local". Hadoop set this to 1 by default, whereas hive uses -1 as its default value. By setting this property to -1, Hive will automatically figure out what should be the number of reducers.
hive.exec.reducers.bytes.per.reducer
Default Value: 1000000000
The default is 1G, i.e if the input size is 10G, it will use 10 reducers.
hive.exec.reducers.max
Default Value: 999
Max number of reducers will be used. If the one specified in the configuration parameter mapred.reduce.tasks is negative, hive will use this one as the max number of reducers when automatically determine number of reducers.
How Many Reduces?
The right number of reduces seems to be 0.95 or 1.75 multiplied by (<no. of nodes> * mapred.tasktracker.reduce.tasks.maximum).
With 0.95 all of the reduces can launch immediately and start transfering map outputs as the maps finish. With 1.75 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing.
Increasing the number of reduces increases the framework overhead, but increases load balancing and lowers the cost of failures.The scaling factors above are slightly less than whole numbers to reserve a few reduce slots in the framework for speculative-tasks and failed tasks.
Source: http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
Please check below link to get more clarification about reducer.
Hadoop MapReduce: Clarification on number of reducers
hive.exec.reducers.bytes.per.reducer
Default Value: 1,000,000,000 prior to Hive 0.14.0; 256 MB (256,000,000) in Hive 0.14.0 and later
Source: https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties

MapReduce.How does it work?

What is MapReduce and how does it work?
I have tried reading some links but couldn't clearly understand the concept.
Can anyone please explain it in simple terms? any help would be appreciated.
I am going to explain with example.
Consider You have a whether temperature data of last 100 years and you want to know the year wise highest temperature. Suppose total size of the data is 100PT. How you will cater this problem? We can not process the data in SQL like oracle, My SQL, or any sql database.
In hadoop, mainly there are two term:
Hadoop Distributed File System(HDFS)
Map-Reduce
HDFS is used to store the data in distributed environment. Therefore, HDFS will store your 100PT data in cluster. It may be 2 machines cluster or 100 machines. By default your data will be divided into 64MB chunks and stored in different machine in cluster.
Now, we move to process the data. For processing data in hadoop cluster we have Map-Reduce framework to write processing logic. We need to write the map-reduce code to find the maximum temperature.
Structure of Map-Reduce code (Just for understand, syntax is not right):
class XYZ{
static class map{
void map(){
//processing logic for mapper
}
}
static class Reduce{
void reduce(){
//processing logic for reducer
}
}
}
whatever you write in map() method, that will run by all data nodes in parallel on 64MB chunk of data and generate the output.
Now, Output of all the mapper instances will shuffle and sort. And then it pass to reduce() method as a input.
Reducer will generate final output.
In our example suppose, hadoop initiate 3 below mapper:
64 MB chunk data ->mapper 1 -> (year,temperature) (1901,45),(1902,34),(1903,44)
64 MB chunk data ->mapper 2 -> (year,temperature) (1901,55),(1902,24),(1904,44)
64 MB chunk data ->mapper 3 -> (year,temperature) (1901,65),(1902,24),(1903,46)
output of all mapper pass to the reducer.
output of all mapper -> reducer -> (1901,65),(1902,34),(1903,46),(1904,44)
a mapreduce has a Mapper and a Reducer.
Map is a common functional programming tool which does a single operation on multiple data. For example, if we have the array
arr = [1,2,3,4,5]
and invoke
map(arr,*2)
it will multiply each element of the array, such that the result would be:
[2,4,6,8,10]
Reduce is a bit counter-intuitive in my opinion, but it is not as complicated as one would expect.
Assume you have got the mapped array above, and would like to use a reducer on it. A reducer gets an array, a binary operator, and an initial element.
The action it does is simple. Assuming we have the mapped array above, the binary operator '+', and the initial element '0', the reducer applies the operator again and again in the following order:
0 + 2 = 2
2 + 4 = 6
6 + 6 = 12
12 + 8 = 20
20 + 10 = 30.
It actually takes the last result and the next array element, and applies the binary operator on them. In the represented case, we've got the sum of the array.

Why does a Solandra 2 node cluster perform indexing worse than a single node cluster?

In indexing tests using Solrj, 2 Solandra nodes are performing worse than 1.
Each node is -Xms1G -Xmx12G
Single index; Index is ~10M docs; each doc is ~4KB in size with a unique id. I built up the index to about 6M on a single node, then added a new node to the ring and used "move" to assign new tokens to balance.
Using all Solandra default configs - for example: solandra.maximum.docs.per.shard = 1048576, solandra.index.id.reserve.size = 16384, solandra.shards.at.once = 4
Nodetool ring shows:node1 Up Normal 35.11 GB 50.00% 0, node2 Up Normal 54.5 GB 50.00% 85070591730234615865843651857942052864
Indexing performance:
single node: 166 docs/s
2 nodes(sending to single node): 111 docs/s
2 nodes (sending to both in parallel): 55 docs/s (see note below)
(note) I was sending batches of 100K (a batch is building up a list of SolrInputDocuments + commit on whole list), and when I switched to batches of 10K, there was some performance improvement to 98 docs/s
Some questions:
In general, for both indexing & searching, how can I get Solandra to perform better with more than 1 node?
Why does indexing performance degrade with 2 nodes versus 1? When should I expect a performance upgrade?
What is the recommended way to index documents with Solandra- sending to a single node in the ring or to multiple nodes?
What is the recommended way to query with Solandra- sending query requests to a single node or multiple nodes?
Send all query requests to a single node in a 2 node cluster performs roughly the same as a single node cluster - any ideas why?