What will be the input to reducer without combine phase in map reduce . - mapreduce

I am reading tutorial for mapreduce with combiners
http://www.tutorialspoint.com/map_reduce/map_reduce_combiners.htm
The reducer receives the following input from combiner
<What,1,1,1> <do,1,1> <you,1,1> <mean,1> <by,1> <Object,1>
<know,1> <about,1> <Java,1,1,1>
<is,1> <Virtual,1> <Machine,1>
<How,1> <enabled,1> <High,1> <Performance,1>
My doubt is what if I skip the combiner and allow mapper to pass the output to
the reducer without performing any grouping operation ( without using combiner ) and allow it to pass through shuffle and sort phase .
what input will the reducer receive after mapper phase is over and after going through shuffling and sorting phase ?
Can I check what input is received for reducer ?

I would say that the output your looking at from that tutorial is perhaps a bit wrong. Since it's reusing the code from the reducer as the combine stage the output from the combiner would actually look like:
<What,3> <do,2> <you,2> <mean,1> <by,1> <Object,1>
<know,1> <about,1> <Java,3>
<is,1> <Virtual,1> <Machine,1>
<How,1> <enabled,1> <High,1> <Performance,1>
In this example, you can absolutely not use the combine and the final result will be the same. In a scenario where you have multiple mappers and reducers, the combine would just be doing some local aggregation on the output from the mappers, with the reduce doing the final aggregation.
If you run without the combine, you are still going to get key based groupings at the reduce stage. The combine will just be doing some local aggregation for you on the map output.
The input to the reduce will just be the output written by the mapper, but grouped by key.

Related

What is the difference between Partitioner phase and Shuffle&Sort phase in MapReduce?

As I understand it, between mapping and reducing there is Combining (if applicable) followed by partitioning followed by shuffling.
While it seems clear that partitioning and shuffle&sort are distinct phases in map/reduce, I cannot differentiate their roles.
Together they must take the key/value pairs from many mappers (or combiners) and send them to reducers, with all values sharing the same key being sent to the same reducer. But I don't know what each of the two phases does.
Partitioning is the sub-phase executed just before shuffle-sort sub-phase. But why partitioning is needed?
Each reducer takes data from several different mappers. Look at this picture (found it here):
Hadoop must know that all Ayush records from every mapper must be sent to the particular reducer (or the task will return incorrect result). The process when it decides which key will be sent to which partition, which will be sent to the particular reducer is the partitioning process. The total number of partitions is equal to the total number of reducers.
Shuffling is the process of moving the intermediate data provided by the partitioner to the reducer node. During this phase, there are sorting and merging subphases:
Merging - combines all key-value pairs which have
same keys and returns >.
Sorting - takes output from Merging step and sort all
key-value pairs by using Keys. This step also returns
(Key, List[Value]) output but with sorted key-value pairs.
Output of shuffle-sort phase is sent directly to reducers.

How to generate permutation of different words in mapreduce?

How to generate permutation of different words in mapreduce using Java
input:abc
output:abc,acb,bac,bca,cab,cba
If you know the code in java than your job is done, just put that code in Mapper and write the output to context.
In your case you will not require reducer.

How can I form a pair of input (coming from two different location) before passing it to the map in mapreduce job

I am new to the Map reduce world.
I have inputs files in two different locations. I want to pass them to my mapper in pairs after doing the merge sort. How can I do this?
For eg.
/folder1/file1.txt,file2.txt,file3.txt
/folder2/file1.txt,file2.txt,file3.txt
sample content of files:
folder1/file1.txt
"Key1": "value1"
folder2/file1.txt
"Key1": "value2"
After Appling merge sort.
"key1" : "value1,value2" as input to my mapper.
Please help me in solving this problem.
It looks like you'll need two map-reduce jobs in order to get this to work properly. The first job should combine the files together, so that it's input is in form "key1" : "value1,value2". Then make a second job that does whatever you originally wanted to do, which uses the output of the first job as input.
Alternatively - if possibble - move the processing you want to do from the mapper to the reducer, and simply pass both files into the job. The reducer will process the values the same regardless of which file they came from.

How to override shuffle/sort in map/reduce or else, how can I get the sorted list in map/reduce from the last element to the patitioner

Assuming only one reducer.
My scenario is to get the list of top N scorers in the university. The data is in format. The Map/reduce framework, by default, sorting the data, in ascending order. But I want the list in descending order, or atleast if I can access the sorted list from the end, my work becomes damm easy. Instead of sending a lot of data to the reducer, I can restrict the data to a limit.
(I want to override the predefined Shuffle/Sort)
Thanks & Regards
Ashwanth
I guess Combiners is what you want. It runs along with the Mappers and they typically do what a reducer does but instead on a single mapper's output data. Generally the combiner class is set the same as the reducer. In your case you can sort and pick top-K elements in each mapper and send only those out.
So instead of sending all your map output records you will be sending only a maximum of K * number of mappers records to the reducer.
You can find example usage on http://wiki.apache.org/hadoop/WordCount.
Bonus - Check out http://blog.optimal.io/3-differences-between-a-mapreduce-combiner-and-reducer/ for major differences between a combiner and a reducer.

Sharing counter values between MapReduce mappers

I have a mapper that reads input and writes to a database. I want to limit how many inputs are actually converted and written to that database, and all mappers must contribute to the limit and then stop once that limit is reached (approximately; one or two extra isn't a big deal.)
I implemented a limiter function on our mapper that asks the other tasks, "How many records have you imported?" Once a given limit is reached, it will stop importing those records (although it will continue processing them for other purposes.)
the map code in question looks something like this:
public void map(ImmutableBytesWritable key, Result row, Context context) {
// prepare the input
// ...
if (context.getCounter(Metrics.IMPORTED).getValue()<IMPORT_LIMIT){
importRecord();
context.getCounter(Metrics.IMPORTED).increment(1l);
}
// do other things
// ...
}
So each mapper checks to see if there is more room to import, and only if the limit hasn't been reached does it perform any importing. However, each mapper itself is importing up to the limit, so that for 16 mappers, we get 16*IMPORT_LIMIT records imported. It's definitely doing SOME limiting (the count is much much lower than the normal number of imported records.)
When are counter values pushed to other mappers, or are they even available to each mapper? Can I actually get somewhat real-time values from the counter, or do they only update when a mapper is finished? Is there a better way to share a value between mappers?
Okay: from what I've seen, MapReduce doesn't share counters between mappers until the job is finished (ie. not at all.) I'm not sure if mappers that commit partway through will allow later mappers to see their counters, but it's not reliable enough to be done real time.
Instead what I'll do is I will run a simple java application that iterates over the rows on its own and write to a column, which the existing MapReduce job will use to determine if it should import the row or not.