Reducers for Hive data - mapreduce

I'm a novice. I'm curious to know how reducers are set to different hive data sets. Is it based on the size of the data processed? Or a default set of reducers for all?
For example, 5GB of data requires how many reducers? will the same number of reducers set to smaller data set?
Thanks in advance!! Cheers!

In open source hive (and EMR likely)
# reducers = (# bytes of input to mappers)
/ (hive.exec.reducers.bytes.per.reducer)
default hive.exec.reducers.bytes.per.reducer is 1G.
Number of reducers depends also on size of the input file
You could change that by setting the property hive.exec.reducers.bytes.per.reducer:
either by changing hive-site.xml
hive.exec.reducers.bytes.per.reducer 1000000
or using set
hive -e "set hive.exec.reducers.bytes.per.reducer=100000

In a MapReduce program, reducer is gets assigned based on key in the reducer input.Hence the reduce method is called for each pair in the grouped inputs.It is not dependent of data size.
Suppose if you are going a simple word count program and file size is 1 MB but mapper output contains 5 key which is going to reducer for reducing then there is a chance to get 5 reducer to perform that task.
But suppose if you have 5GB data and mapper output contains only one key then only one reducer will get assigned to process the data into reducer phase.
Number of reducer in hive is also controlled by following configuration:
mapred.reduce.tasks
Default Value: -1
The default number of reduce tasks per job. Typically set to a prime close to the number of available hosts. Ignored when mapred.job.tracker is "local". Hadoop set this to 1 by default, whereas hive uses -1 as its default value. By setting this property to -1, Hive will automatically figure out what should be the number of reducers.
hive.exec.reducers.bytes.per.reducer
Default Value: 1000000000
The default is 1G, i.e if the input size is 10G, it will use 10 reducers.
hive.exec.reducers.max
Default Value: 999
Max number of reducers will be used. If the one specified in the configuration parameter mapred.reduce.tasks is negative, hive will use this one as the max number of reducers when automatically determine number of reducers.
How Many Reduces?
The right number of reduces seems to be 0.95 or 1.75 multiplied by (<no. of nodes> * mapred.tasktracker.reduce.tasks.maximum).
With 0.95 all of the reduces can launch immediately and start transfering map outputs as the maps finish. With 1.75 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing.
Increasing the number of reduces increases the framework overhead, but increases load balancing and lowers the cost of failures.The scaling factors above are slightly less than whole numbers to reserve a few reduce slots in the framework for speculative-tasks and failed tasks.
Source: http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
Please check below link to get more clarification about reducer.
Hadoop MapReduce: Clarification on number of reducers

hive.exec.reducers.bytes.per.reducer
Default Value: 1,000,000,000 prior to Hive 0.14.0; 256 MB (256,000,000) in Hive 0.14.0 and later
Source: https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties

Related

Can anyone please explain what does mapreduce.job.reduces =-1 means

Need to understand the purpose of mapreduce.job.reduces = -1. I understand the attribute mapreduce.job.reduces reduces the file output to the configured value, but what does -1 means.
Quoting Hive's documentation:
mapred.reduce.tasks <-- (In YARN it is mapreduce.job.reduces)
Default Value: -1
Added In: Hive 0.1.0
The default number of reduce tasks per job. Typically set to a prime close to the number of available hosts. Ignored when mapred.job.tracker is "local". Hadoop set this to 1 by default, whereas Hive uses -1 as its default value. By setting this property to -1, Hive will automatically figure out what should be the number of reducers.
Setting the number of reducers is much more than setting the number of output files. It somehow defines the level of parallelism, i.e., how many reduce tasks will run in parallel. If using 1 reduce task, no parallelism is achieved. If 2 reduce tasks are used, ideally, you want to cut the workload (and execution time) of each reduce task to half. The same holds for the number of mappers, but this is trickier to set.

How to increase Mappers and Reducer in Apache TEZ

I know this simple question, I need some help on this query from this community, When I create PartitionTable with ORC format, When I try to dump data from non partition table which is pointing to 2 GB File with 210 columns, I see Number of Mapper are 2 and reducer are 2 . is there a way to increase Mapper and reducer. My assumption is we cant set number of Mapper and reducer like MR 1.0, It is based on Settings like Yarn container size, Mapper minimum memory and maximum memory . can any one suggest me TEz Calculates mappers and reducers. What is best value to keep memory size setting, so that i dont come across : Java heap space, Java Out of Memory problem. My file size may grow upto 100GB. Please help me on this.
You can still set the number of mappers and reducers in Yarn. Have you tried that? If so, please get back here.
Yarn changes the underlying execution mechanism, but #mappers and #reducers is describing the Job requirements - not the way the job resources are allocated (which is how yarn and mrv1 differ).
Traditional Map/Reduce has a hard coded number of map and reduce "slot". As you say - Yarn uses containers - which are per-application. Yarn is thus more flexible. But the #mappers and #reducers are inputs of the job in both cases. And also in both cases the actual number of mappers and reducers may differ from the requested number. Typically the #reducers would either be
(a) precisely the number that was requested
(b) exactly ONE reducer - that is if the job required it such as in total ordering
For the memory settings, if you are using hive with tez, the following 2 settings will be of use to you:
1) hive.tez.container.size - this is the size of the Yarn Container that will be used ( value in MB ).
2) hive.tez.java.opts - this is for the java opts that will be used for each task. If container size is set to 1024 MB, set java opts to say something like "-Xmx800m" and not "-Xmx1024m". YARN kills processes that use more memory than specified container size and given that a java process's memory footprint usually can exceed the specified Xmx value, setting Xmx to be the same value as the container size usually leads to problems.

What factors decide the number of executors in a stand alone mode?

Given a Spark application
What factors decide the number of executors in a stand alone mode? In the Mesos and YARN according to this documents, we can specify the number of executers/cores and memory.
Once a number of executors are started. Does Spark start the tasks in a round robin fashion or is it smart enough to see if some of the executors are idle/busy and then schedule the tasks accordingly.
Also, how does Spark decide on the number of tasks? I did write a simple max temperature program with small dataset and Spark spawned two tasks in a single executor. This is in the Spark stand alone mode.
Answering your questions:
The standalone mode uses the same configuration variable as Mesos and Yarn modes to set the number of executors. The variable spark.cores.max defines the maximun number of cores used in the spark Context. The default value is infinity so Spark will use all the cores in the cluster. The spark.task.cpus variable defines how many CPUs Spark will allocate for a single task, the default value is 1. With these two variables you can define the maximun number of parallel tasks in your cluster.
When you create an RDD subClass you can define in which machines to run your task. This is defined in the getPreferredLocations method. But as the method signatures suggest this is only a preference so if Spark detects that one machine is not busy, it will launch the task in this idle machine. However I don't know the mechanism used by Spark to know what machines are idle. To achieve locality, we (Stratio) decided to make each Partions smaller so the task takes less time and achieve locality.
The number of tasks of each Spark's operation is defined according to the length of the RDD's partitions. This vector is the result of the getPartitions method that you have to override if you want to develop a new RDD subClass. This method returns how a RDD is split, where the information is and the partitions. When you join two or more RDDs using, for example, union or join operations, the number of tasks of the resulting RDD is the maximum number of tasks of the RDDs involved in the operation. For example: if you join RDD1 that has 100 tasks and RDD2 that has 1000 tasks, the next operation of the resulting RDD will have 1000 tasks. Note that a high number of partitions is not necessarily synonym of more data.
I hope this will help.
I agree with #jlopezmat about how Spark chooses its configuration. With respect to your test code, your are seeing two task due to the way textFile is implemented. From SparkContext.scala:
/**
* Read a text file from HDFS, a local file system (available on all nodes), or any
* Hadoop-supported file system URI, and return it as an RDD of Strings.
*/
def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String] = {
hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
minPartitions).map(pair => pair._2.toString)
}
and if we check what is the value of defaultMinPartitions:
/** Default min number of partitions for Hadoop RDDs when not given by user */
def defaultMinPartitions: Int = math.min(defaultParallelism, 2)
Spark chooses the number of tasks based on the number of partitions in the original data set. If you are using HDFS as your data source, then the number of partitions with be equal to the number of HDFS blocks, by default. You can change the number of partitions in a number of different ways. The top two: as an extra argument to the SparkContext.textFile method; by calling the RDD.repartion method.
Answering some points that were not addressed in previous answers:
in Standalone mode, you need to play with --executor-cores and --max-executor-cores to set the number of executors that will be launched (granted that you have enough memory to fit that number if you specify --executor-memory)
Spark does not allocate task in a round-robin manner, it uses a mechanism called "Delay Scheduling", which is a pull-based technique allowing each executor to offer it's availability to the master, which will decide whether or not to send a task on it.

Mapper or Reducer, where to do more processing?

I have a 6 million line text file with lines up to 32,000 characters long, and I want to
measure the word-length frequencies.
The simplest method is for the Mapper to create a (word-length, 1) key-value pair for every word and let an 'aggregate' Reducer do the rest of the work.
Would it be more efficient to do some of the aggregation in the mapper? Where the key-value pair output would be (word-length, frequency_per_line).
The outputs from the mapper would be decreased by an factor of the average amount of words per line.
I know there are many configuration factors involved. But is there a hard rule saying whether most or the work should be done by the Mapper or the Reducer?
The platform is AWS with a student account, limited to the following configuration.

Sharing counter values between MapReduce mappers

I have a mapper that reads input and writes to a database. I want to limit how many inputs are actually converted and written to that database, and all mappers must contribute to the limit and then stop once that limit is reached (approximately; one or two extra isn't a big deal.)
I implemented a limiter function on our mapper that asks the other tasks, "How many records have you imported?" Once a given limit is reached, it will stop importing those records (although it will continue processing them for other purposes.)
the map code in question looks something like this:
public void map(ImmutableBytesWritable key, Result row, Context context) {
// prepare the input
// ...
if (context.getCounter(Metrics.IMPORTED).getValue()<IMPORT_LIMIT){
importRecord();
context.getCounter(Metrics.IMPORTED).increment(1l);
}
// do other things
// ...
}
So each mapper checks to see if there is more room to import, and only if the limit hasn't been reached does it perform any importing. However, each mapper itself is importing up to the limit, so that for 16 mappers, we get 16*IMPORT_LIMIT records imported. It's definitely doing SOME limiting (the count is much much lower than the normal number of imported records.)
When are counter values pushed to other mappers, or are they even available to each mapper? Can I actually get somewhat real-time values from the counter, or do they only update when a mapper is finished? Is there a better way to share a value between mappers?
Okay: from what I've seen, MapReduce doesn't share counters between mappers until the job is finished (ie. not at all.) I'm not sure if mappers that commit partway through will allow later mappers to see their counters, but it's not reliable enough to be done real time.
Instead what I'll do is I will run a simple java application that iterates over the rows on its own and write to a column, which the existing MapReduce job will use to determine if it should import the row or not.