Hadoop mapreduce using 2 mapper and 1 reducer using c++ - c++

Following the instructions on this link, I implemented a wordcount program in c++ using single mapper and single reducer. Now I need to use two mappers and one reducer for the same problem.
Can someone help me please in this regard?

The number of mappers depends on the number of input splits created. The number of input splits depends on the size of the input, the size of a block, the number of input files (each input file creates at least one input split), whether the input files are splittable or not, etc. See also this post in SO.
You can set the number of reducers to as many as you wish. I guess in hadoop pipes you can do this by setting the -D mapred.reduce.tasks=... when running hadoop. See this post in SO.
If you want to quickly test how your program works with more than one mappers, you can simply put a new file in your input path. This will make hadoop create another input split and thus another map task.
PS: The link that you provide is not reachable.

Related

Is there a way to specify the number of mappers in Scalding?

I am new to scalding world. My scalding job will have multiple stages, and I need to tune each stage individually.
I have found that we might be able to change the number of reducers by using withReducers. Also, I am able to set the split size for the input data by the job config. However, I didn't see there is any way to change the number of mappers for my sub-tasks on the fly.
Did I miss something? Does anyone know how to specify the number of mappers for my sub-tasks? Thanks.
Got some answers/ideas might be helpful for someone else who shared the same question.
It is much easier to control reducers compared to mappers.
Mappers are controlled by hadoop without a similar simple knob. You can set some config parameters to give hadoop an idea of how many map tasks to launch.
This stack overflow may be helpful:
Setting the number of map tasks and reduce tasks
One workaround I could think of is changing your major task to small ones, which you could individually tweak the size (# of mappers) of your input data.

EMR AWS increase number of mappers

I am executing a mapreduce program on AWS and the code is working correctly.
My problem is with the number of map functions that work in parallel.
every time I execute the program, there is only one map function and only one node working in parallel.
my input file contains 100 line with a size of 4 kB. I need to make a map function for each 20 lines that run in parallel.
I tried to change "fs.s3n.block.size" parameter in the config yet nothing has changed.
Thank you.

When to use use MapReduce in Hbase?

I want to understand MapReduce of Hbase from application point of view, Need some real use cases of it to better understand the efficient use case of writing these jobs.
If there is any link to document or examples that explains the real use cases, Please share.
I can give some example based on my use cases. If you already store your data in hbase, you can write a java program, which scans a table and do something, then write the output to hbase or somewhere else. OR you can use mapreduce to do the same. The difference is, mapreduce will run where the data is and network traffic is used only for result data. We have hourly jobs to calculate sum and average of kpis and input data is huge but output data is tiny for this task. If i did not use mapreduce, i need to move one hour of data over network which is 18gb. But mapreduce output is only 1mb and i can write it to hbase or file or somewhere else.
Also mapreduce gives you parallel task execution ability, which you can have in java but why :)
Keep in mind that YARN creates map tasks according to your hbase table's split count. So if you need more map task, split your table.
If you already store your data in hadoop hdfs, you are lucky, a mapreduce reading from hdfs is much faster than reading from hbase. Also you can still write mapreduce output to hbase, if you want.
Please look into the usecases given
1. here.
2. And a small reference here - 30.Joins
3. May be an end to end example here
In the end, it all depends on your understanding of each concept Map reduce, Hbase and use it as per your need in your project. The same task can be done with or without map reduce. Happy coding

Multiple MapReduce jobs with multiple files as input and multiple reducers

I need to chain multiple MapReduce streaming jobs in order to perform some computation over a large dataset.
I intend to use multiple reducers for each job in order to quicken the overall job. As a workflow scheduler I use Oozie.
Here is an illustration to clarify my problem:
Let say I have two files
File 1: File 2:
A B 1 A B 3
A C 4 C D 6
B D 2 B D 1
I'd like to have two mappers and two reducers and get the following output for the MapReduce job:
Output:
A B 4
A C 4
B D 3
C D 6
But this is not at all what I get, instead I have partial sums.
Here is what I think happens.
Since I have multiple reducers for each MapReduce job, the input of the next job is split into several files. These files are given to the mappers which then send their output to the reducers. It seems that the mappers send their output to the reducers without waiting the whole input to be processed and sorted with name1, for example, as the key.
I've read several threads about using multiple files as an input and I don't think it is a matter of performing a map side join. Maybe it has to do with partitioning but I haven't exactly understood what partitioning consists in.
Is there any way to sort the output of several mappers before sending it to reducers ? Or can I tell Oozie to merge the output of several reducers in order to have only one file as the input of the next MapReduce Job ?
I'm slightly new to MapReduce, but it looks like your job isn't processing the keys correctly, if you are not getting the desired output based on your example.
By default, Hadoop streaming uses Tab as the default field separator and takes everything from the start of a line to the first Tab character as the Key. In your case, if your input format is actually "A[space]B[space]1", you'll need to add
-D stream.map.output.field.separator= \
-D stream.num.map.output.key.fields=2 \
to your Hadoop streaming command in order to set space as the column delimiter and the first 2 columns as the key. This will map all the lines that start with "A B" to the same reducer. More info can be found here

HBase Mapreduce output to hdfs & HBASe

I have a mapreduce program that first scans an HBase table.
I want some reducer output to go to hdfs and some reducer output to be written to an hbase table. Can a reducer be configured to output to two different locations/formats like this?
A reducer can be configured to use multiple files to output using the MulitpleOutputsclass. The documentation at the top of that class provides a clear example for writing to multiple files. However, since there is no built in Outputformat for writing to HBase you might consider writing the 2nd stream to specific place on HDFS and then using another job to insert it into HBase.
If you don't want to write too much code, just open a Table in your mapper's or reducer's setup method and do a put statement into your hbase table. On the other hand, write your job such that the output file is an hdfs file. This way you get to both write to hbase and hdfs.
To be more elaborate, when you do a context.write(), you would write to the hdfs file, and on the other hand, the table.put can happen when you do a put.
Also, don't forget to close the table and anything else in your cleanup() method. The only backdrop is, if there are let's say 1000 mappers your table connection would be opened a 1000 times, but at any given point, only the max number of your mappers really run, so that would probably be 50, depending on your setup. Works for me at least!
i think multiple output can do the job..
chk tis out
http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html