Maximum Monthly Temperature Reducer code - mapreduce

package com.ibm.dw61;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class MaxTempReducer extends
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int maxTemp = Integer.MIN_VALUE;
for (IntWritable value: values) {
maxTemp = Math.max(maxTemp, value.get());
}
context.write(key, new IntWritable(maxTemp));
}
}
Questions :
1) int maxTemp = Integer.MIN_VALUE <----- this line seems to be a initialisation of the maxTemp variable. Why does the coder not initialise it to zero? Integer.MIN_VALUE gives -2147483648. It is impossible for lowest temperature to ever reach -100 degrees.
2) context.write(key, new IntWritable(maxTemp) <------ This is the end result. Key is month and maxTemp is the maximum temperature for the month. Why does the 'new' word required for maxTemp but not for the key (month)?

1) int maxTemp = Integer.MIN_VALUE
public static final int MIN_VALUE
A constant holding the minimum value
an int can have, -231.
2) context.write(key, new IntWritable(maxTemp)
Why does Hadoop need classes like Text or IntWritable instead of String or Integer?
Hope this is helpful

Related

String concatenation in mapper class of a MapReduce Program giving errors

In my mapper class I want to do a small manipulation to a string read from a file(as a line) and then send it over to the reducer to get a string count. The manipulation being replace null strings with 0. (the current replace & join part is failing my hadoop job)
Here is my code:
import java.io.BufferedReader;
import java.io.IOException;
.....
public class PartNumberMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private static Text partString = new Text("");
private final static IntWritable count = new IntWritable(1);
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
// Read line by line to bufferreader and output the (line,count) pair
BufferedReader bufReader = new BufferedReader(new StringReader(line));
String l=null;
while( (l=bufReader.readLine()) != null )
{
/**** This part is the problem ****/
String a[]=l.split(",");
if(a[1]==""){ // if a[1] i.e. second string is "" then set it to "0"
a[1]="0";
l = StringUtils.join(",", a); // join the string array to form a string
}
/**** problematic part ends ****/
partString.set(l);
output.collect(partString, count);
}
}
}
After this is run, the mapper just fails and doesn't post any errors.
[The code is run with yarn]
I am not sure what I am doing wrong, the same code worked without the string join part.
Could any of you explain what is wrong with the string replace/concat? Is there a better way to do it?
Here's a modified version of your Mapper class with a few changes:
Remove the BufferedReader, it seems redundant and isn't being closed
String equality should be .equals() and not ==
Declare a String array using String[] and not String a[]
Resulting in the following code:
public class PartNumberMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private Text partString = new Text();
private final static IntWritable count = new IntWritable(1);
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
String[] a = l.split(",");
if (a[1].equals("")) {
a[1] = "0";
l = StringUtils.join(",", a);
}
partString.set(l);
output.collect(partString, count);
}
}

getting output for only one key in a map reduce program

I am trying to write a Map Reduce program to do a join between two text files. The output that I get, is only for one of the keys. For example, if I have one file R.txt with data as
a4 b3
a3 b4
and another file S.txt with data as
b3 c3
b3 c1
b3 c2
b4 c4
I get the output
a4 c2
a4 c1
a4 c3
whereas if R.txt has
b4 c4
and S.txt has
a3 b4
the output is
a3 c4.
Here is my program
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.MultipleInputs;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class RSJoin{
public static class SMap extends Mapper<Object, Text, Text, Text>{
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
String[] words = value.toString().split(" ");
context.write(new Text(words[0]), new Text("S\t"+words[1]));
}
}
public static class RMap extends Mapper<Object, Text, Text, Text>{
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
String[] words = value.toString().split(" ");
context.write(new Text(words[1]), new Text("R\t"+words[0]));
}
}
public static class Reduce extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
for (Text val : values) {
String [] parts = val.toString().split("\t");
String a=parts[0];
if (a.equals("R")){
for (Text val1 : values){
String [] parts1=val1.toString().split("\t");
String b=parts1[0];
if (b.equals("S")){
context.write(new Text(parts[1]), new Text(parts1[1]));
}
}
}
}
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
#SuppressWarnings("deprecation")
Job job = new Job(conf, "ReduceJoin");
job.setJarByClass(RSJoin.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setReducerClass(Reduce.class);
MultipleInputs.addInputPath(job,new Path(args[0]),TextInputFormat.class,RMap.class);
MultipleInputs.addInputPath(job,new Path(args[1]),TextInputFormat.class,SMap.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileOutputFormat.setOutputPath(job, new Path(args[2]));
job.waitForCompletion(true);
}
}
Your join logic assumes that the R value comes before the S value in the values list. Only when you see an R do you then look for an S. The inner for over the values Iterable begins where the outer for left off so if the S comes first your nine loop won't find it.
If you only have one R value for multiple S values, either do a secondary sort (adding "R" and "S" to the key, adding a partitioner and adding a grouping comparator - this is the right way) or have a variable to hold the R value once you find it, a list to hold S values until you find the R value (doesn't really scale well) and have a single iteration throughout the set of values.
I changed the reducer code as below and got the expected output
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
List<String> listR = new ArrayList <String>();
List<String> listS = new ArrayList <String>();
for (Text val : values) {
String [] parts = val.toString().split("\t");
String a=parts[0];
if (a.equals("R")){
listR.add(parts[1]);
}
else if (a.equals("S")){
listS.add(parts[1]);
}
}
for (String Temp: listR)
{
for (String Temp1: listS)
{
context.write(new Text(Temp), new Text(Temp1));
}
}
}

Map Reduce Filter records

I have set of records where i need to process only male records,in map reduce program i have used if condition to filter only male records.but below program giving zero records as output.
Input file:
1,Brandon Buckner,avil,female,525
2,Veda Hopkins,avil,male,633
3,Zia Underwood,paracetamol,male,980
4,Austin Mayer,paracetamol,female,338
5,Mara Higgins,avil,female,153
6,Sybill Crosby,avil,male,193
7,Tyler Rosales,paracetamol,male,778
8,Ivan Hale,avil,female,454
9,Alika Gilmore,paracetamol,female,833
10,Len Burgess,metacin,male,325
Mapreduce Program:
package org.samples.mapreduce.training;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class patientrxMR_filter {
public static class MapDemohadoop extends
Mapper<LongWritable, Text, Text, IntWritable> {
// setup , map, run, cleanup
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
String[] elements = line.split(",");
String gender =elements[3];
if ( gender == "male" ) {
Text tx = new Text(elements[2]);
int i = Integer.parseInt(elements[4]);
IntWritable it = new IntWritable(i);
context.write(tx, it);
}
}
}
public static class Reduce extends
Reducer<Text, IntWritable, Text, IntWritable> {
// setup, reduce, run, cleanup
// innput - para [150,100]
public void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Insufficient args");
System.exit(-1);
}
Configuration conf = new Configuration();
//conf.set("fs.default.name","hdfs://localhost:50000");
conf.set("mapred.job.tracker", "hdfs://localhost:50001");
// conf.set("DrugName", args[3]);
Job job = new Job(conf, "Drug Amount Spent");
job.setJarByClass(patientrxMR_filter.class); // class conmtains mapper and
// reducer class
job.setMapOutputKeyClass(Text.class); // map output key class
job.setMapOutputValueClass(IntWritable.class);// map output value class
job.setOutputKeyClass(Text.class); // output key type in reducer
job.setOutputValueClass(IntWritable.class);// output value type in
// reducer
job.setMapperClass(MapDemohadoop.class);
job.setReducerClass(Reduce.class);
job.setNumReduceTasks(1);
job.setInputFormatClass(TextInputFormat.class); // default -- inputkey
// type -- longwritable
// : valuetype is text
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
if ( gender == "male" )
This line doesn't work for equality check, For equality in java pls use object.equals()
i.e
if ( gender.equals("male") )
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
String[] elements = line.split(",");
Hadoop is using distributed file system, in "String line = value.toString();"
line is the file content in block which has a offset (key). In this case the, the line loads the entire test file, which apparently can fit into one block, instead of each line in the file as you expected.

Mapreduce MultipleOutputs error

I want to store output of a mapreduce job in two different directories.
Eventhough my code is designed to store the same output in different directories.
My Driver class code below
public class WordCountMain {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job myhadoopJob = new Job(conf);
myhadoopJob.setJarByClass(WordCountMain.class);
myhadoopJob.setJobName("WORD COUNT JOB");
FileInputFormat.addInputPath(myhadoopJob, new Path(args[0]));
myhadoopJob.setMapperClass(WordCountMapper.class);
myhadoopJob.setReducerClass(WordCountReducer.class);
myhadoopJob.setInputFormatClass(TextInputFormat.class);
myhadoopJob.setOutputFormatClass(TextOutputFormat.class);
myhadoopJob.setMapOutputKeyClass(Text.class);
myhadoopJob.setMapOutputValueClass(IntWritable.class);
myhadoopJob.setOutputKeyClass(Text.class);
myhadoopJob.setOutputValueClass(IntWritable.class);
MultipleOutputs.addNamedOutput(myhadoopJob, "output1", TextOutputFormat.class, Text.class, IntWritable.class);
MultipleOutputs.addNamedOutput(myhadoopJob, "output2", TextOutputFormat.class, Text.class, IntWritable.class);
FileOutputFormat.setOutputPath(myhadoopJob, new Path(args[1]));
System.exit(myhadoopJob.waitForCompletion(true) ? 0 : 1);
}
}
My Mapper Code
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable>
{
#Override
protected void map(LongWritable key, Text value, Context context)throws IOException, InterruptedException {
String line = value.toString();
String word =null;
StringTokenizer st = new StringTokenizer(line,",");
while(st.hasMoreTokens())
{
word= st.nextToken();
context.write(new Text(word), new IntWritable(1));
}
}
}
My Reducer Code is below
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable>
{
MultipleOutputs mout =null;
protected void reduce(Text key, Iterable<IntWritable> values, Context context)throws IOException, InterruptedException {
int count=0;
int num =0;
Iterator<IntWritable> ie =values.iterator();
while(ie.hasNext())
{
num = ie.next().get();//1
count= count+num;
}
mout.write("output1", key, new IntWritable(count));
mout.write("output2", key, new IntWritable(count));
#Override
protected void setup(org.apache.hadoop.mapreduce.Reducer.Context context)
throws IOException, InterruptedException {
// TODO Auto-generated method stub
super.setup(context);
mout = new MultipleOutputs<Text, IntWritable>(context);
}
}
#Override
protected void setup(org.apache.hadoop.mapreduce.Reducer.Context context)
throws IOException, InterruptedException {
super.setup(context);
mout = new MultipleOutputs<Text, IntWritable>(context);
}
}
I am simply giving the output directories in reduce method itself
But when I run this mapreduce job using the below command, it does nothing. Even Mapreduce is not at all started. just a blank and stays idle.
hadoop jar WordCountMain.jar /user/cloudera/inputfiles/words.txt /user/cloudera/outputfiles/mapreduce/multipleoutputs
Could someone explain me what went wrong and how do I correct this with my code
Actually what happens is two output files with different name are stored inside /user/cloudera/outputfiles/mapreduce/multipleoutputs.
but what I need is storing output files in different directories.
In pig we can use by two STORE statement by giving different directories
How do I achieve the same in mapreduce
Can you try closing multiple output object in cleanup method for Reducer.

Hadoop: Use only a part of the reduce Iterable

I have a situation in which I only want to use the first n values of the Iterable given to my reducer and then abort. I have been reading about the Iterable class and it seems like this may not be trivial.
I can't use a for loop or a next method. I can't use a foreach since it iterates over the whole object. Is there a straight-forward solution or am I approaching the problem wrong?
Thanks.
You can just extract the iterator from the iterable and use a good old for loop, or a while loop.
For example, the below sums over only at most the first TOPN values.
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable();
private static final int TOPN = 10;
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException
{
int sum = 0;
Iterator<IntWritable> iter = values.iterator();
for (int i=0; iter.hasNext() && i < TOPN; i++) {
sum += iter.next().get();
}
result.set(sum);
context.write(key, result);
}
}