What is the difference between Partitioner phase and Shuffle&Sort phase in MapReduce? - mapreduce

As I understand it, between mapping and reducing there is Combining (if applicable) followed by partitioning followed by shuffling.
While it seems clear that partitioning and shuffle&sort are distinct phases in map/reduce, I cannot differentiate their roles.
Together they must take the key/value pairs from many mappers (or combiners) and send them to reducers, with all values sharing the same key being sent to the same reducer. But I don't know what each of the two phases does.

Partitioning is the sub-phase executed just before shuffle-sort sub-phase. But why partitioning is needed?
Each reducer takes data from several different mappers. Look at this picture (found it here):
Hadoop must know that all Ayush records from every mapper must be sent to the particular reducer (or the task will return incorrect result). The process when it decides which key will be sent to which partition, which will be sent to the particular reducer is the partitioning process. The total number of partitions is equal to the total number of reducers.
Shuffling is the process of moving the intermediate data provided by the partitioner to the reducer node. During this phase, there are sorting and merging subphases:
Merging - combines all key-value pairs which have
same keys and returns >.
Sorting - takes output from Merging step and sort all
key-value pairs by using Keys. This step also returns
(Key, List[Value]) output but with sorted key-value pairs.
Output of shuffle-sort phase is sent directly to reducers.

Related

Which one is better performance wise in Informatica Powercenter? Use sorter transformation or add number of sorted ports on source qualifier?

I have a mapping in Informatica Powercenter which combines data from two sources. One source has around 22 million rows of data while the other have >389 million rows of data. Will it be better performance-wise if I add Sorter transformation or is it better to add number of sorted ports in the Source Qualifier?
Also, what factors that makes one way better than the other(in case of sorter transformation vs adding number of sorted ports in SQ)?
If both tables are from same DB, without a doubt - sort in the SQ using number of sorted ports.
Informatica sorter brings whole data into infa server and then sort it. So, sorting 300M resultant data is going to take lot of time + resource.
Now, joining 389 M and 22M table in source and sort the result in source itself will take less time and resource. Informatica doesnt have to bring any data into its server.
Now, if they are from different data bases, then, sorting them in source qualifier will give perf boost while joining. You have to join them using joiner to get whole data set. And i think data order will be same if your sort key is same as join key and you do not have to sort again using sorter. Issue is joining both will take time.

What will be the input to reducer without combine phase in map reduce .

I am reading tutorial for mapreduce with combiners
http://www.tutorialspoint.com/map_reduce/map_reduce_combiners.htm
The reducer receives the following input from combiner
<What,1,1,1> <do,1,1> <you,1,1> <mean,1> <by,1> <Object,1>
<know,1> <about,1> <Java,1,1,1>
<is,1> <Virtual,1> <Machine,1>
<How,1> <enabled,1> <High,1> <Performance,1>
My doubt is what if I skip the combiner and allow mapper to pass the output to
the reducer without performing any grouping operation ( without using combiner ) and allow it to pass through shuffle and sort phase .
what input will the reducer receive after mapper phase is over and after going through shuffling and sorting phase ?
Can I check what input is received for reducer ?
I would say that the output your looking at from that tutorial is perhaps a bit wrong. Since it's reusing the code from the reducer as the combine stage the output from the combiner would actually look like:
<What,3> <do,2> <you,2> <mean,1> <by,1> <Object,1>
<know,1> <about,1> <Java,3>
<is,1> <Virtual,1> <Machine,1>
<How,1> <enabled,1> <High,1> <Performance,1>
In this example, you can absolutely not use the combine and the final result will be the same. In a scenario where you have multiple mappers and reducers, the combine would just be doing some local aggregation on the output from the mappers, with the reduce doing the final aggregation.
If you run without the combine, you are still going to get key based groupings at the reduce stage. The combine will just be doing some local aggregation for you on the map output.
The input to the reduce will just be the output written by the mapper, but grouped by key.

How to override shuffle/sort in map/reduce or else, how can I get the sorted list in map/reduce from the last element to the patitioner

Assuming only one reducer.
My scenario is to get the list of top N scorers in the university. The data is in format. The Map/reduce framework, by default, sorting the data, in ascending order. But I want the list in descending order, or atleast if I can access the sorted list from the end, my work becomes damm easy. Instead of sending a lot of data to the reducer, I can restrict the data to a limit.
(I want to override the predefined Shuffle/Sort)
Thanks & Regards
Ashwanth
I guess Combiners is what you want. It runs along with the Mappers and they typically do what a reducer does but instead on a single mapper's output data. Generally the combiner class is set the same as the reducer. In your case you can sort and pick top-K elements in each mapper and send only those out.
So instead of sending all your map output records you will be sending only a maximum of K * number of mappers records to the reducer.
You can find example usage on http://wiki.apache.org/hadoop/WordCount.
Bonus - Check out http://blog.optimal.io/3-differences-between-a-mapreduce-combiner-and-reducer/ for major differences between a combiner and a reducer.

Mapper or Reducer, where to do more processing?

I have a 6 million line text file with lines up to 32,000 characters long, and I want to
measure the word-length frequencies.
The simplest method is for the Mapper to create a (word-length, 1) key-value pair for every word and let an 'aggregate' Reducer do the rest of the work.
Would it be more efficient to do some of the aggregation in the mapper? Where the key-value pair output would be (word-length, frequency_per_line).
The outputs from the mapper would be decreased by an factor of the average amount of words per line.
I know there are many configuration factors involved. But is there a hard rule saying whether most or the work should be done by the Mapper or the Reducer?
The platform is AWS with a student account, limited to the following configuration.

getting the partition number in a reducer to which a key-value pair belongs

when i am processing a given key-{set of values} pair in reducer function, how can I get the partition number to which this key-{set of values} belong to? How is it possible to get this partition number without adding extra information about the partition number with each key-value pair during partitioning?
Cheers
This has worked for me:
jobconf.getInt( "mapred.task.partition", 0);
in the reducer.