How to generate permutation of different words in mapreduce using Java
input:abc
output:abc,acb,bac,bca,cab,cba
If you know the code in java than your job is done, just put that code in Mapper and write the output to context.
In your case you will not require reducer.
Related
I am reading tutorial for mapreduce with combiners
http://www.tutorialspoint.com/map_reduce/map_reduce_combiners.htm
The reducer receives the following input from combiner
<What,1,1,1> <do,1,1> <you,1,1> <mean,1> <by,1> <Object,1>
<know,1> <about,1> <Java,1,1,1>
<is,1> <Virtual,1> <Machine,1>
<How,1> <enabled,1> <High,1> <Performance,1>
My doubt is what if I skip the combiner and allow mapper to pass the output to
the reducer without performing any grouping operation ( without using combiner ) and allow it to pass through shuffle and sort phase .
what input will the reducer receive after mapper phase is over and after going through shuffling and sorting phase ?
Can I check what input is received for reducer ?
I would say that the output your looking at from that tutorial is perhaps a bit wrong. Since it's reusing the code from the reducer as the combine stage the output from the combiner would actually look like:
<What,3> <do,2> <you,2> <mean,1> <by,1> <Object,1>
<know,1> <about,1> <Java,3>
<is,1> <Virtual,1> <Machine,1>
<How,1> <enabled,1> <High,1> <Performance,1>
In this example, you can absolutely not use the combine and the final result will be the same. In a scenario where you have multiple mappers and reducers, the combine would just be doing some local aggregation on the output from the mappers, with the reduce doing the final aggregation.
If you run without the combine, you are still going to get key based groupings at the reduce stage. The combine will just be doing some local aggregation for you on the map output.
The input to the reduce will just be the output written by the mapper, but grouped by key.
I want to understand MapReduce of Hbase from application point of view, Need some real use cases of it to better understand the efficient use case of writing these jobs.
If there is any link to document or examples that explains the real use cases, Please share.
I can give some example based on my use cases. If you already store your data in hbase, you can write a java program, which scans a table and do something, then write the output to hbase or somewhere else. OR you can use mapreduce to do the same. The difference is, mapreduce will run where the data is and network traffic is used only for result data. We have hourly jobs to calculate sum and average of kpis and input data is huge but output data is tiny for this task. If i did not use mapreduce, i need to move one hour of data over network which is 18gb. But mapreduce output is only 1mb and i can write it to hbase or file or somewhere else.
Also mapreduce gives you parallel task execution ability, which you can have in java but why :)
Keep in mind that YARN creates map tasks according to your hbase table's split count. So if you need more map task, split your table.
If you already store your data in hadoop hdfs, you are lucky, a mapreduce reading from hdfs is much faster than reading from hbase. Also you can still write mapreduce output to hbase, if you want.
Please look into the usecases given
1. here.
2. And a small reference here - 30.Joins
3. May be an end to end example here
In the end, it all depends on your understanding of each concept Map reduce, Hbase and use it as per your need in your project. The same task can be done with or without map reduce. Happy coding
I am new to the Map reduce world.
I have inputs files in two different locations. I want to pass them to my mapper in pairs after doing the merge sort. How can I do this?
For eg.
/folder1/file1.txt,file2.txt,file3.txt
/folder2/file1.txt,file2.txt,file3.txt
sample content of files:
folder1/file1.txt
"Key1": "value1"
folder2/file1.txt
"Key1": "value2"
After Appling merge sort.
"key1" : "value1,value2" as input to my mapper.
Please help me in solving this problem.
It looks like you'll need two map-reduce jobs in order to get this to work properly. The first job should combine the files together, so that it's input is in form "key1" : "value1,value2". Then make a second job that does whatever you originally wanted to do, which uses the output of the first job as input.
Alternatively - if possibble - move the processing you want to do from the mapper to the reducer, and simply pass both files into the job. The reducer will process the values the same regardless of which file they came from.
Following the instructions on this link, I implemented a wordcount program in c++ using single mapper and single reducer. Now I need to use two mappers and one reducer for the same problem.
Can someone help me please in this regard?
The number of mappers depends on the number of input splits created. The number of input splits depends on the size of the input, the size of a block, the number of input files (each input file creates at least one input split), whether the input files are splittable or not, etc. See also this post in SO.
You can set the number of reducers to as many as you wish. I guess in hadoop pipes you can do this by setting the -D mapred.reduce.tasks=... when running hadoop. See this post in SO.
If you want to quickly test how your program works with more than one mappers, you can simply put a new file in your input path. This will make hadoop create another input split and thus another map task.
PS: The link that you provide is not reachable.
I have a 6 million line text file with lines up to 32,000 characters long, and I want to
measure the word-length frequencies.
The simplest method is for the Mapper to create a (word-length, 1) key-value pair for every word and let an 'aggregate' Reducer do the rest of the work.
Would it be more efficient to do some of the aggregation in the mapper? Where the key-value pair output would be (word-length, frequency_per_line).
The outputs from the mapper would be decreased by an factor of the average amount of words per line.
I know there are many configuration factors involved. But is there a hard rule saying whether most or the work should be done by the Mapper or the Reducer?
The platform is AWS with a student account, limited to the following configuration.