String concatenation in mapper class of a MapReduce Program giving errors - mapreduce

In my mapper class I want to do a small manipulation to a string read from a file(as a line) and then send it over to the reducer to get a string count. The manipulation being replace null strings with 0. (the current replace & join part is failing my hadoop job)
Here is my code:
import java.io.BufferedReader;
import java.io.IOException;
.....
public class PartNumberMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private static Text partString = new Text("");
private final static IntWritable count = new IntWritable(1);
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
// Read line by line to bufferreader and output the (line,count) pair
BufferedReader bufReader = new BufferedReader(new StringReader(line));
String l=null;
while( (l=bufReader.readLine()) != null )
{
/**** This part is the problem ****/
String a[]=l.split(",");
if(a[1]==""){ // if a[1] i.e. second string is "" then set it to "0"
a[1]="0";
l = StringUtils.join(",", a); // join the string array to form a string
}
/**** problematic part ends ****/
partString.set(l);
output.collect(partString, count);
}
}
}
After this is run, the mapper just fails and doesn't post any errors.
[The code is run with yarn]
I am not sure what I am doing wrong, the same code worked without the string join part.
Could any of you explain what is wrong with the string replace/concat? Is there a better way to do it?

Here's a modified version of your Mapper class with a few changes:
Remove the BufferedReader, it seems redundant and isn't being closed
String equality should be .equals() and not ==
Declare a String array using String[] and not String a[]
Resulting in the following code:
public class PartNumberMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private Text partString = new Text();
private final static IntWritable count = new IntWritable(1);
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
String[] a = l.split(",");
if (a[1].equals("")) {
a[1] = "0";
l = StringUtils.join(",", a);
}
partString.set(l);
output.collect(partString, count);
}
}

Related

Text to String map reduce

I am trying to split a string using mapreduce2(yarn) in Hortonworks Sandbox.
It throws a ArrayOutOfBound Exception if I try to access val[1] , Works fine with when I don't split the input file.
Mapper:
public class MapperClass extends Mapper<Object, Text, Text, Text> {
private Text airline_id;
private Text name;
private Text country;
private Text value1;
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String s = value.toString();
if (s.length() > 1) {
String val[] = s.split(",");
context.write(new Text("blah"), new Text(val[1]));
}
}
}
Reducer:
public class ReducerClass extends Reducer<Text, Text, Text, Text> {
private Text result = new Text();
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
String airports = "";
if (key.equals("India")) {
for (Text val : values) {
airports += "\t" + val.toString();
}
result.set(airports);
context.write(key, result);
}
}
}
MainClass:
public class MainClass {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration();
#SuppressWarnings("deprecation")
Job job = new Job(conf, "Flights MR");
job.setJarByClass(MainClass.class);
job.setMapperClass(MapperClass.class);
job.setReducerClass(ReducerClass.class);
job.setNumReduceTasks(0);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(KeyValueTextInputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Can you help?
Update:
Figured out that it doesn't convert Text to String.
If the string you are splitting does not contain a comma, the resulting String[] will be of length 1 with the entire string in at val[0].
Currently, you are making sure that the string is not the empty string
if (s.length() > -1)
But you are not checking that the split will actually result in an array of length more than 1 and assuming that there was a split.
context.write(new Text("blah"), new Text(val[1]));
If there was no split this will cause an out of bounds error. A possible solution would be to make sure that the string contains at least 1 comma, instead of checking that it is not the empty string like so:
String s = value.toString();
if (s.indexOf(',') > -1) {
String val[] = s.split(",");
context.write(new Text("blah"), new Text(val[1]));
}

getting output for only one key in a map reduce program

I am trying to write a Map Reduce program to do a join between two text files. The output that I get, is only for one of the keys. For example, if I have one file R.txt with data as
a4 b3
a3 b4
and another file S.txt with data as
b3 c3
b3 c1
b3 c2
b4 c4
I get the output
a4 c2
a4 c1
a4 c3
whereas if R.txt has
b4 c4
and S.txt has
a3 b4
the output is
a3 c4.
Here is my program
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.MultipleInputs;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class RSJoin{
public static class SMap extends Mapper<Object, Text, Text, Text>{
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
String[] words = value.toString().split(" ");
context.write(new Text(words[0]), new Text("S\t"+words[1]));
}
}
public static class RMap extends Mapper<Object, Text, Text, Text>{
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
String[] words = value.toString().split(" ");
context.write(new Text(words[1]), new Text("R\t"+words[0]));
}
}
public static class Reduce extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
for (Text val : values) {
String [] parts = val.toString().split("\t");
String a=parts[0];
if (a.equals("R")){
for (Text val1 : values){
String [] parts1=val1.toString().split("\t");
String b=parts1[0];
if (b.equals("S")){
context.write(new Text(parts[1]), new Text(parts1[1]));
}
}
}
}
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
#SuppressWarnings("deprecation")
Job job = new Job(conf, "ReduceJoin");
job.setJarByClass(RSJoin.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setReducerClass(Reduce.class);
MultipleInputs.addInputPath(job,new Path(args[0]),TextInputFormat.class,RMap.class);
MultipleInputs.addInputPath(job,new Path(args[1]),TextInputFormat.class,SMap.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileOutputFormat.setOutputPath(job, new Path(args[2]));
job.waitForCompletion(true);
}
}
Your join logic assumes that the R value comes before the S value in the values list. Only when you see an R do you then look for an S. The inner for over the values Iterable begins where the outer for left off so if the S comes first your nine loop won't find it.
If you only have one R value for multiple S values, either do a secondary sort (adding "R" and "S" to the key, adding a partitioner and adding a grouping comparator - this is the right way) or have a variable to hold the R value once you find it, a list to hold S values until you find the R value (doesn't really scale well) and have a single iteration throughout the set of values.
I changed the reducer code as below and got the expected output
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
List<String> listR = new ArrayList <String>();
List<String> listS = new ArrayList <String>();
for (Text val : values) {
String [] parts = val.toString().split("\t");
String a=parts[0];
if (a.equals("R")){
listR.add(parts[1]);
}
else if (a.equals("S")){
listS.add(parts[1]);
}
}
for (String Temp: listR)
{
for (String Temp1: listS)
{
context.write(new Text(Temp), new Text(Temp1));
}
}
}

Mapreduce MultipleOutputs error

I want to store output of a mapreduce job in two different directories.
Eventhough my code is designed to store the same output in different directories.
My Driver class code below
public class WordCountMain {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job myhadoopJob = new Job(conf);
myhadoopJob.setJarByClass(WordCountMain.class);
myhadoopJob.setJobName("WORD COUNT JOB");
FileInputFormat.addInputPath(myhadoopJob, new Path(args[0]));
myhadoopJob.setMapperClass(WordCountMapper.class);
myhadoopJob.setReducerClass(WordCountReducer.class);
myhadoopJob.setInputFormatClass(TextInputFormat.class);
myhadoopJob.setOutputFormatClass(TextOutputFormat.class);
myhadoopJob.setMapOutputKeyClass(Text.class);
myhadoopJob.setMapOutputValueClass(IntWritable.class);
myhadoopJob.setOutputKeyClass(Text.class);
myhadoopJob.setOutputValueClass(IntWritable.class);
MultipleOutputs.addNamedOutput(myhadoopJob, "output1", TextOutputFormat.class, Text.class, IntWritable.class);
MultipleOutputs.addNamedOutput(myhadoopJob, "output2", TextOutputFormat.class, Text.class, IntWritable.class);
FileOutputFormat.setOutputPath(myhadoopJob, new Path(args[1]));
System.exit(myhadoopJob.waitForCompletion(true) ? 0 : 1);
}
}
My Mapper Code
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable>
{
#Override
protected void map(LongWritable key, Text value, Context context)throws IOException, InterruptedException {
String line = value.toString();
String word =null;
StringTokenizer st = new StringTokenizer(line,",");
while(st.hasMoreTokens())
{
word= st.nextToken();
context.write(new Text(word), new IntWritable(1));
}
}
}
My Reducer Code is below
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable>
{
MultipleOutputs mout =null;
protected void reduce(Text key, Iterable<IntWritable> values, Context context)throws IOException, InterruptedException {
int count=0;
int num =0;
Iterator<IntWritable> ie =values.iterator();
while(ie.hasNext())
{
num = ie.next().get();//1
count= count+num;
}
mout.write("output1", key, new IntWritable(count));
mout.write("output2", key, new IntWritable(count));
#Override
protected void setup(org.apache.hadoop.mapreduce.Reducer.Context context)
throws IOException, InterruptedException {
// TODO Auto-generated method stub
super.setup(context);
mout = new MultipleOutputs<Text, IntWritable>(context);
}
}
#Override
protected void setup(org.apache.hadoop.mapreduce.Reducer.Context context)
throws IOException, InterruptedException {
super.setup(context);
mout = new MultipleOutputs<Text, IntWritable>(context);
}
}
I am simply giving the output directories in reduce method itself
But when I run this mapreduce job using the below command, it does nothing. Even Mapreduce is not at all started. just a blank and stays idle.
hadoop jar WordCountMain.jar /user/cloudera/inputfiles/words.txt /user/cloudera/outputfiles/mapreduce/multipleoutputs
Could someone explain me what went wrong and how do I correct this with my code
Actually what happens is two output files with different name are stored inside /user/cloudera/outputfiles/mapreduce/multipleoutputs.
but what I need is storing output files in different directories.
In pig we can use by two STORE statement by giving different directories
How do I achieve the same in mapreduce
Can you try closing multiple output object in cleanup method for Reducer.

Mapreduce output showing all records in same line

I have implemented a mapreduce operation for log file using amazon and hadoop with custom jar.
My output shows the correct keys and values, but all the records are being displayed in a single line. For example, given the following pairs:
<1387, 2>
<1388, 1>
This is what's printing:
1387 21388 1
This is what I'm expecting:
1387 2
1388 1
How can I fix this?
Cleaned up your code for you :)
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(LogAnalyzer.class);
conf.setJobName("Loganalyzer");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(LogAnalyzer.Map.class);
conf.setCombinerClass(LogAnalyzer.Reduce.class);
conf.setReducerClass(LogAnalyzer.Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
conf.set("mapreduce.textoutputformat.separator", "--");
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = ((Text) value).toString();
Matcher matcher = p.matcher(line);
if (matcher.matches()) {
String timestamp = matcher.group(4);
minute.set(getMinuteBucket(timestamp));
output.collect(minute, ONE); //context.write(minute, one);
}
}
This isn't hadoop-streaming, it's just a normal java job. You should amend the tag on the question.
This looks okay to me, although you don't have the mapper inside a class, which I assume is a copy/paste omission.
With regards to the line endings. I don't suppose you are looking at the output on Windows? It could be a problem with unix/windows line endings. If you open up the file in sublime or another advanced text editor you can switch between unix and windows. See if that works.

Hbase Map/reduce-How to access individual columns of the table?

I have a table called User with two columns, one called visitorId and the other called friend which is a list of strings. I want to check whether the VisitorId is in the friendlist. Can anyone direct me as to how to access the table columns in a map function?
I'm not able to picture how data is output from a map function in hbase.
My code is as follows:
ublic class MapReduce {
static class Mapper1 extends TableMapper<ImmutableBytesWritable, Text> {
private int numRecords = 0;
private static final IntWritable one = new IntWritable(1);
private final IntWritable ONE = new IntWritable(1);
private Text text = new Text();
#Override
public void map(ImmutableBytesWritable row, Result values, Context context) throws IOException {
//What should i do here??
ImmutableBytesWritable userKey = new ImmutableBytesWritable(row.get(), 0, Bytes.SIZEOF_INT);
context.write(userkey,One);
}
//context.write(text, ONE);
} catch (InterruptedException e) {
throw new IOException(e);
}
}
}
public static void main(String[] args) throws Exception {
Configuration conf = HBaseConfiguration.create();
Job job = new Job(conf, "CheckVisitor");
job.setJarByClass(MapReduce.class);
Scan scan = new Scan();
Filter f = new RowFilter(CompareOp.EQUAL,new SubstringComparator("mId2"));
scan.setFilter(f);
scan.addFamily(Bytes.toBytes("visitor"));
scan.addFamily(Bytes.toBytes("friend"));
TableMapReduceUtil.initTableMapperJob("User", scan, Mapper1.class, ImmutableBytesWritable.class,Text.class, job);
}
}
So Result values instance would contain the full row from the scanner.
To get the appropriate columns from the Result I would do something like :-
VisitorIdVal = value.getColumnLatest(Bytes.toBytes(columnFamily1), Bytes.toBytes("VisitorId"))
friendlistVal = value.getColumnLatest(Bytes.toBytes(columnFamily2), Bytes.toBytes("friendlist"))
Here VisitorIdVal and friendlistVal are of the type keyValue http://archive.cloudera.com/cdh/3/hbase/apidocs/org/apache/hadoop/hbase/KeyValue.html, to get their values out you can do a Bytes.toString(VisitorIdVal.getValue())
Once you have extracted the values from columns you can check for "VisitorId" in "friendlist"