How to override shuffle/sort in map/reduce or else, how can I get the sorted list in map/reduce from the last element to the patitioner - mapreduce

Assuming only one reducer.
My scenario is to get the list of top N scorers in the university. The data is in format. The Map/reduce framework, by default, sorting the data, in ascending order. But I want the list in descending order, or atleast if I can access the sorted list from the end, my work becomes damm easy. Instead of sending a lot of data to the reducer, I can restrict the data to a limit.
(I want to override the predefined Shuffle/Sort)
Thanks & Regards
Ashwanth

I guess Combiners is what you want. It runs along with the Mappers and they typically do what a reducer does but instead on a single mapper's output data. Generally the combiner class is set the same as the reducer. In your case you can sort and pick top-K elements in each mapper and send only those out.
So instead of sending all your map output records you will be sending only a maximum of K * number of mappers records to the reducer.
You can find example usage on http://wiki.apache.org/hadoop/WordCount.
Bonus - Check out http://blog.optimal.io/3-differences-between-a-mapreduce-combiner-and-reducer/ for major differences between a combiner and a reducer.

Related

Which one is better performance wise in Informatica Powercenter? Use sorter transformation or add number of sorted ports on source qualifier?

I have a mapping in Informatica Powercenter which combines data from two sources. One source has around 22 million rows of data while the other have >389 million rows of data. Will it be better performance-wise if I add Sorter transformation or is it better to add number of sorted ports in the Source Qualifier?
Also, what factors that makes one way better than the other(in case of sorter transformation vs adding number of sorted ports in SQ)?
If both tables are from same DB, without a doubt - sort in the SQ using number of sorted ports.
Informatica sorter brings whole data into infa server and then sort it. So, sorting 300M resultant data is going to take lot of time + resource.
Now, joining 389 M and 22M table in source and sort the result in source itself will take less time and resource. Informatica doesnt have to bring any data into its server.
Now, if they are from different data bases, then, sorting them in source qualifier will give perf boost while joining. You have to join them using joiner to get whole data set. And i think data order will be same if your sort key is same as join key and you do not have to sort again using sorter. Issue is joining both will take time.

What is the difference between Partitioner phase and Shuffle&Sort phase in MapReduce?

As I understand it, between mapping and reducing there is Combining (if applicable) followed by partitioning followed by shuffling.
While it seems clear that partitioning and shuffle&sort are distinct phases in map/reduce, I cannot differentiate their roles.
Together they must take the key/value pairs from many mappers (or combiners) and send them to reducers, with all values sharing the same key being sent to the same reducer. But I don't know what each of the two phases does.
Partitioning is the sub-phase executed just before shuffle-sort sub-phase. But why partitioning is needed?
Each reducer takes data from several different mappers. Look at this picture (found it here):
Hadoop must know that all Ayush records from every mapper must be sent to the particular reducer (or the task will return incorrect result). The process when it decides which key will be sent to which partition, which will be sent to the particular reducer is the partitioning process. The total number of partitions is equal to the total number of reducers.
Shuffling is the process of moving the intermediate data provided by the partitioner to the reducer node. During this phase, there are sorting and merging subphases:
Merging - combines all key-value pairs which have
same keys and returns >.
Sorting - takes output from Merging step and sort all
key-value pairs by using Keys. This step also returns
(Key, List[Value]) output but with sorted key-value pairs.
Output of shuffle-sort phase is sent directly to reducers.

Is reducer bottleneck in MR framework

I want to understand what to do in the following case.
For example, I have 1TB of text data, and lets assume that 900GB of it is the word "Hello".
After each map operation, i will have a collection of key-value pairs of <"Hello",1>.
But as I said, this is a huge collection, 900GB and as I understand , the reducer gets all of it and will crush.
My reducer RAM is of 80GB only.
Will the reducer really crush ??
In other words is reducer the bottleneck of horizontal scaling ?
Yes, all equal keys from all mappers get funneled into a single reducer.
It's not clear if you have 900GB of only one word, or a bunch of large text documents with a bunch of words.
In the later case, the string "Hello" really doesn't take that much data. Neither does a single integer.
The reducer will also get a long list of ones, sure, but if you re-used the reducer code as a Combiner, then you can mitigate the memory issues by pre-aggregating the values for each input split

Map-reduce : work on multiple lines

I have a requirement where i need to work on multiple rows of input data, first sort the data and then substract one value from row one in next row and so on. Can we do this operation in map reduce somehow ?
You can make your custom Record Reader and send your desired number of records to map task and perform the calculations.

getting the partition number in a reducer to which a key-value pair belongs

when i am processing a given key-{set of values} pair in reducer function, how can I get the partition number to which this key-{set of values} belong to? How is it possible to get this partition number without adding extra information about the partition number with each key-value pair during partitioning?
Cheers
This has worked for me:
jobconf.getInt( "mapred.task.partition", 0);
in the reducer.