Text to String map reduce - mapreduce

I am trying to split a string using mapreduce2(yarn) in Hortonworks Sandbox.
It throws a ArrayOutOfBound Exception if I try to access val[1] , Works fine with when I don't split the input file.
Mapper:
public class MapperClass extends Mapper<Object, Text, Text, Text> {
private Text airline_id;
private Text name;
private Text country;
private Text value1;
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String s = value.toString();
if (s.length() > 1) {
String val[] = s.split(",");
context.write(new Text("blah"), new Text(val[1]));
}
}
}
Reducer:
public class ReducerClass extends Reducer<Text, Text, Text, Text> {
private Text result = new Text();
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
String airports = "";
if (key.equals("India")) {
for (Text val : values) {
airports += "\t" + val.toString();
}
result.set(airports);
context.write(key, result);
}
}
}
MainClass:
public class MainClass {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration();
#SuppressWarnings("deprecation")
Job job = new Job(conf, "Flights MR");
job.setJarByClass(MainClass.class);
job.setMapperClass(MapperClass.class);
job.setReducerClass(ReducerClass.class);
job.setNumReduceTasks(0);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(KeyValueTextInputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Can you help?
Update:
Figured out that it doesn't convert Text to String.

If the string you are splitting does not contain a comma, the resulting String[] will be of length 1 with the entire string in at val[0].
Currently, you are making sure that the string is not the empty string
if (s.length() > -1)
But you are not checking that the split will actually result in an array of length more than 1 and assuming that there was a split.
context.write(new Text("blah"), new Text(val[1]));
If there was no split this will cause an out of bounds error. A possible solution would be to make sure that the string contains at least 1 comma, instead of checking that it is not the empty string like so:
String s = value.toString();
if (s.indexOf(',') > -1) {
String val[] = s.split(",");
context.write(new Text("blah"), new Text(val[1]));
}

Related

How to solve the "too many connection" problem in zookeeper when I want to query too many times in reduce stage?

Sorry for my stupid question and thank you in advance.
I need to replace the outputvalue in reduce stage(or map stage). However, it will case too many connection in zookeeper. I don't know how to deal with it.
This is my reduce method:
public static class HbaseToHDFSReducer extends Reducer<Text,Text,Text, Text> {
protected void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
HashSet<String> address = new HashSet<>();
for(Text item :values){
String city = getDataByRowKey("A1",item.toString());
address.add(city);
}
context.write(key,new Text(String.valueOf(address).replace("\"", "")));
}
This is the query method:
public static String getDataByRowKey(String tableName, String rowKey) throws IOException {
Table table = ConnectionFactory.createConnection(conf).getTable(TableName.valueOf(tableName));
Get get = new Get(rowKey.getBytes());
String data = new String();
if (!get.isCheckExistenceOnly()) {
Result result = table.get(get);
for (Cell cell : result.rawCells()) {
String colName = Bytes.toString(cell.getQualifierArray(), cell.getQualifierOffset(), cell.getQualifierLength());
String value = Bytes.toString(cell.getValueArray(), cell.getValueOffset(), cell.getValueLength());
if (colName.equals(rowKey)) {
data = value;
}
}
}
table.close();
return data;
}
What should I do to solve it?
Thank you again
You created one connection per query, and connection creation is a heavy-weight operation. Maybe you can get a Connection in your reduce method and change
getDataByRowKey(String tableName, String rowKey) to this
getDataByRowKey(Connection connection, String tableName, String rowKey).

String concatenation in mapper class of a MapReduce Program giving errors

In my mapper class I want to do a small manipulation to a string read from a file(as a line) and then send it over to the reducer to get a string count. The manipulation being replace null strings with 0. (the current replace & join part is failing my hadoop job)
Here is my code:
import java.io.BufferedReader;
import java.io.IOException;
.....
public class PartNumberMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private static Text partString = new Text("");
private final static IntWritable count = new IntWritable(1);
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
// Read line by line to bufferreader and output the (line,count) pair
BufferedReader bufReader = new BufferedReader(new StringReader(line));
String l=null;
while( (l=bufReader.readLine()) != null )
{
/**** This part is the problem ****/
String a[]=l.split(",");
if(a[1]==""){ // if a[1] i.e. second string is "" then set it to "0"
a[1]="0";
l = StringUtils.join(",", a); // join the string array to form a string
}
/**** problematic part ends ****/
partString.set(l);
output.collect(partString, count);
}
}
}
After this is run, the mapper just fails and doesn't post any errors.
[The code is run with yarn]
I am not sure what I am doing wrong, the same code worked without the string join part.
Could any of you explain what is wrong with the string replace/concat? Is there a better way to do it?
Here's a modified version of your Mapper class with a few changes:
Remove the BufferedReader, it seems redundant and isn't being closed
String equality should be .equals() and not ==
Declare a String array using String[] and not String a[]
Resulting in the following code:
public class PartNumberMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private Text partString = new Text();
private final static IntWritable count = new IntWritable(1);
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
String[] a = l.split(",");
if (a[1].equals("")) {
a[1] = "0";
l = StringUtils.join(",", a);
}
partString.set(l);
output.collect(partString, count);
}
}

Mapreduce MultipleOutputs error

I want to store output of a mapreduce job in two different directories.
Eventhough my code is designed to store the same output in different directories.
My Driver class code below
public class WordCountMain {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job myhadoopJob = new Job(conf);
myhadoopJob.setJarByClass(WordCountMain.class);
myhadoopJob.setJobName("WORD COUNT JOB");
FileInputFormat.addInputPath(myhadoopJob, new Path(args[0]));
myhadoopJob.setMapperClass(WordCountMapper.class);
myhadoopJob.setReducerClass(WordCountReducer.class);
myhadoopJob.setInputFormatClass(TextInputFormat.class);
myhadoopJob.setOutputFormatClass(TextOutputFormat.class);
myhadoopJob.setMapOutputKeyClass(Text.class);
myhadoopJob.setMapOutputValueClass(IntWritable.class);
myhadoopJob.setOutputKeyClass(Text.class);
myhadoopJob.setOutputValueClass(IntWritable.class);
MultipleOutputs.addNamedOutput(myhadoopJob, "output1", TextOutputFormat.class, Text.class, IntWritable.class);
MultipleOutputs.addNamedOutput(myhadoopJob, "output2", TextOutputFormat.class, Text.class, IntWritable.class);
FileOutputFormat.setOutputPath(myhadoopJob, new Path(args[1]));
System.exit(myhadoopJob.waitForCompletion(true) ? 0 : 1);
}
}
My Mapper Code
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable>
{
#Override
protected void map(LongWritable key, Text value, Context context)throws IOException, InterruptedException {
String line = value.toString();
String word =null;
StringTokenizer st = new StringTokenizer(line,",");
while(st.hasMoreTokens())
{
word= st.nextToken();
context.write(new Text(word), new IntWritable(1));
}
}
}
My Reducer Code is below
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable>
{
MultipleOutputs mout =null;
protected void reduce(Text key, Iterable<IntWritable> values, Context context)throws IOException, InterruptedException {
int count=0;
int num =0;
Iterator<IntWritable> ie =values.iterator();
while(ie.hasNext())
{
num = ie.next().get();//1
count= count+num;
}
mout.write("output1", key, new IntWritable(count));
mout.write("output2", key, new IntWritable(count));
#Override
protected void setup(org.apache.hadoop.mapreduce.Reducer.Context context)
throws IOException, InterruptedException {
// TODO Auto-generated method stub
super.setup(context);
mout = new MultipleOutputs<Text, IntWritable>(context);
}
}
#Override
protected void setup(org.apache.hadoop.mapreduce.Reducer.Context context)
throws IOException, InterruptedException {
super.setup(context);
mout = new MultipleOutputs<Text, IntWritable>(context);
}
}
I am simply giving the output directories in reduce method itself
But when I run this mapreduce job using the below command, it does nothing. Even Mapreduce is not at all started. just a blank and stays idle.
hadoop jar WordCountMain.jar /user/cloudera/inputfiles/words.txt /user/cloudera/outputfiles/mapreduce/multipleoutputs
Could someone explain me what went wrong and how do I correct this with my code
Actually what happens is two output files with different name are stored inside /user/cloudera/outputfiles/mapreduce/multipleoutputs.
but what I need is storing output files in different directories.
In pig we can use by two STORE statement by giving different directories
How do I achieve the same in mapreduce
Can you try closing multiple output object in cleanup method for Reducer.

Mapper is not producing expected results

This is my input file (custs.txt):
1002|surender|23
1003|Rahja|24
And this is my program:
Main:
public class ReduceSideJoinMain {
/**
* #param args
* #throws IOException
* #throws ClassNotFoundException
* #throws InterruptedException
*/
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
JobConf config = new JobConf();
config.setQueueName("omega");
Job job = new Job(config,"word count");
job.setJarByClass(ReduceSideJoinMain.class);
Path inputFilePath1 = new Path(args[0]);
Path outputFilePath2 = new Path(args[1]);
//MultipleInputs.addInputPath(job, inputFilePath1, TextInputFormat.class,CustMapper.class);
//MultipleInputs.addInputPath(job, inputFilePath2, TextInputFormat.class,TxnsMapper.class);
FileInputFormat.addInputPath(job, inputFilePath1);
FileOutputFormat.setOutputPath(job, outputFilePath2);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setMapperClass(CustMapper.class);
//job.setReducerClass(ReduceJoinMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Mapper:
public class CustMapper extends Mapper<LongWritable,Text,Text,Text>
{
public static IntWritable one = new IntWritable(1);
protected void map(LongWritable key, Text value, Context context) throws java.io.IOException,java.lang.InterruptedException
{
String line = value.toString();
String arr[]= line.split("|");
context.write(new Text(arr[0]), new Text(arr[1]));
}
}
I am getting the following output, which is wrong:
1
1
I am expecting the output to be:
1002 surender
1003 Rahja
Why is it not giving the expected output? Is there any issue with Split method?
use String arr[] = line.split("\\|");

Hbase Map/reduce-How to access individual columns of the table?

I have a table called User with two columns, one called visitorId and the other called friend which is a list of strings. I want to check whether the VisitorId is in the friendlist. Can anyone direct me as to how to access the table columns in a map function?
I'm not able to picture how data is output from a map function in hbase.
My code is as follows:
ublic class MapReduce {
static class Mapper1 extends TableMapper<ImmutableBytesWritable, Text> {
private int numRecords = 0;
private static final IntWritable one = new IntWritable(1);
private final IntWritable ONE = new IntWritable(1);
private Text text = new Text();
#Override
public void map(ImmutableBytesWritable row, Result values, Context context) throws IOException {
//What should i do here??
ImmutableBytesWritable userKey = new ImmutableBytesWritable(row.get(), 0, Bytes.SIZEOF_INT);
context.write(userkey,One);
}
//context.write(text, ONE);
} catch (InterruptedException e) {
throw new IOException(e);
}
}
}
public static void main(String[] args) throws Exception {
Configuration conf = HBaseConfiguration.create();
Job job = new Job(conf, "CheckVisitor");
job.setJarByClass(MapReduce.class);
Scan scan = new Scan();
Filter f = new RowFilter(CompareOp.EQUAL,new SubstringComparator("mId2"));
scan.setFilter(f);
scan.addFamily(Bytes.toBytes("visitor"));
scan.addFamily(Bytes.toBytes("friend"));
TableMapReduceUtil.initTableMapperJob("User", scan, Mapper1.class, ImmutableBytesWritable.class,Text.class, job);
}
}
So Result values instance would contain the full row from the scanner.
To get the appropriate columns from the Result I would do something like :-
VisitorIdVal = value.getColumnLatest(Bytes.toBytes(columnFamily1), Bytes.toBytes("VisitorId"))
friendlistVal = value.getColumnLatest(Bytes.toBytes(columnFamily2), Bytes.toBytes("friendlist"))
Here VisitorIdVal and friendlistVal are of the type keyValue http://archive.cloudera.com/cdh/3/hbase/apidocs/org/apache/hadoop/hbase/KeyValue.html, to get their values out you can do a Bytes.toString(VisitorIdVal.getValue())
Once you have extracted the values from columns you can check for "VisitorId" in "friendlist"