Mapper is not producing expected results

Mapper is not producing expected results - mapreduce

This is my input file (custs.txt):
1002|surender|23
1003|Rahja|24
And this is my program:
Main:
public class ReduceSideJoinMain {
/**
* #param args
* #throws IOException
* #throws ClassNotFoundException
* #throws InterruptedException
*/
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
JobConf config = new JobConf();
config.setQueueName("omega");
Job job = new Job(config,"word count");
job.setJarByClass(ReduceSideJoinMain.class);
Path inputFilePath1 = new Path(args[0]);
Path outputFilePath2 = new Path(args[1]);
//MultipleInputs.addInputPath(job, inputFilePath1, TextInputFormat.class,CustMapper.class);
//MultipleInputs.addInputPath(job, inputFilePath2, TextInputFormat.class,TxnsMapper.class);
FileInputFormat.addInputPath(job, inputFilePath1);
FileOutputFormat.setOutputPath(job, outputFilePath2);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setMapperClass(CustMapper.class);
//job.setReducerClass(ReduceJoinMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Mapper:
public class CustMapper extends Mapper<LongWritable,Text,Text,Text>
{
public static IntWritable one = new IntWritable(1);
protected void map(LongWritable key, Text value, Context context) throws java.io.IOException,java.lang.InterruptedException
{
String line = value.toString();
String arr[]= line.split("|");
context.write(new Text(arr[0]), new Text(arr[1]));
}
}
I am getting the following output, which is wrong:
1
1
I am expecting the output to be:
1002 surender
1003 Rahja
Why is it not giving the expected output? Is there any issue with Split method?

use String arr[] = line.split("\\|");

Related

AWS SDK2 java s3 select example - how to get result bytes

I am trying to use aws sdk2 java for s3 select operations but not able to get extract the final data. Looking for an example if someone has implemented it. I got some idea from [this post][1] but not able to figure out how to get and read the full data .
Fetching specific fields from an S3 document
Basically, equivalent of v1 sdk:
``` InputStream resultInputStream = result.getPayload().getRecordsInputStream(
new SelectObjectContentEventVisitor() {
#Override
public void visit(SelectObjectContentEvent.StatsEvent event)
{
System.out.println(
"Received Stats, Bytes Scanned: " + event.getDetails().getBytesScanned()
+ " Bytes Processed: " + event.getDetails().getBytesProcessed());
}
/*
* An End Event informs that the request has finished successfully.
*/
#Override
public void visit(SelectObjectContentEvent.EndEvent event)
{
isResultComplete.set(true);
System.out.println("Received End Event. Result is complete.");
}
}
);```
///IN AWS SDK2, how do get ResultOutputStream ?
```public byte[] getQueryResults() {
logger.info("V2 query");
S3AsyncClient s3Client = null;
s3Client = S3AsyncClient.builder()
.region(Region.US_WEST_2)
.build();
String fileObjKeyName = "upload/" + filePath;
try{
logger.info("Filepath: " + fileObjKeyName);
ListObjectsV2Request listObjects = ListObjectsV2Request
.builder()
.bucket(Constants.bucketName)
.build();
......
InputSerialization inputSerialization = InputSerialization.builder().
json(JSONInput.builder().type(JSONType.LINES).build()).build()
OutputSerialization outputSerialization = null;
outputSerialization = OutputSerialization.builder().
json(JSONOutput.builder()
.build()
).build();
SelectObjectContentRequest selectObjectContentRequest = SelectObjectContentRequest.builder()
.bucket(Constants.bucketName)
.key(partFilename)
.expression(query)
.expressionType(ExpressionType.SQL)
.inputSerialization(inputSerialization)
.outputSerialization(outputSerialization)
.scanRange(ScanRange.builder().start(0L).end(Constants.limitBytes).build())
.build();
final DataHandler handler = new DataHandler();
CompletableFuture future = s3Client.selectObjectContent(selectObjectContentRequest, handler);
//hold it till we get a end event
EndEvent endEvent = (EndEvent) handler.receivedEvents.stream()
.filter(e -> e.sdkEventType() == SelectObjectContentEventStream.EventType.END)
.findFirst()
.orElse(null);```
//Now, from here how do I get the response bytes ?
///////---> ISSUE: How do I get ResultStream bytes ????
return <bytes>
}```
// handler
private static class DataHandler implements SelectObjectContentResponseHandler {
private SelectObjectContentResponse response;
private List receivedEvents = new ArrayList<>();
private Throwable exception;
#Override
public void responseReceived(SelectObjectContentResponse response) {
this.response = response;
}
#Override
public void onEventStream(SdkPublisher<SelectObjectContentEventStream> publisher) {
publisher.subscribe(receivedEvents::add);
}
#Override
public void exceptionOccurred(Throwable throwable) {
exception = throwable;
}
#Override
public void complete() {
}
} ```
[1]: https://stackoverflow.com/questions/67315601/fetching-specific-fields-from-an-s3-document

i came to your post since I was working on the same issue as to avoid V1.
After hours of searching i ended up with finding the answer at. https://github.com/aws/aws-sdk-java-v2/pull/2943/files
The answer is located at SelectObjectContentIntegrationTest.java File
services/s3/src/it/java/software/amazon/awssdk/services/SelectObjectContentIntegrationTest.java
The way to get the bytes is by using the RecordsEvent class, please note for my use case I used CSV, not sure if this would be different for a different file type.
in the complete method you have access to the receivedEvents. this is where you get the first index to get the filtered returned results and casting it to the RecordsEvent class. then this class provides the payload as bytes
#Override
public void complete() {
RecordsEvent records = (RecordsEvent) this.receivedEvents.get(0)
String result = records.payload().asUtf8String();
}

Text to String map reduce

I am trying to split a string using mapreduce2(yarn) in Hortonworks Sandbox.
It throws a ArrayOutOfBound Exception if I try to access val[1] , Works fine with when I don't split the input file.
Mapper:
public class MapperClass extends Mapper<Object, Text, Text, Text> {
private Text airline_id;
private Text name;
private Text country;
private Text value1;
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String s = value.toString();
if (s.length() > 1) {
String val[] = s.split(",");
context.write(new Text("blah"), new Text(val[1]));
}
}
}
Reducer:
public class ReducerClass extends Reducer<Text, Text, Text, Text> {
private Text result = new Text();
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
String airports = "";
if (key.equals("India")) {
for (Text val : values) {
airports += "\t" + val.toString();
}
result.set(airports);
context.write(key, result);
}
}
}
MainClass:
public class MainClass {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration();
#SuppressWarnings("deprecation")
Job job = new Job(conf, "Flights MR");
job.setJarByClass(MainClass.class);
job.setMapperClass(MapperClass.class);
job.setReducerClass(ReducerClass.class);
job.setNumReduceTasks(0);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(KeyValueTextInputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Can you help?
Update:
Figured out that it doesn't convert Text to String.

If the string you are splitting does not contain a comma, the resulting String[] will be of length 1 with the entire string in at val[0].
Currently, you are making sure that the string is not the empty string
if (s.length() > -1)
But you are not checking that the split will actually result in an array of length more than 1 and assuming that there was a split.
context.write(new Text("blah"), new Text(val[1]));
If there was no split this will cause an out of bounds error. A possible solution would be to make sure that the string contains at least 1 comma, instead of checking that it is not the empty string like so:
String s = value.toString();
if (s.indexOf(',') > -1) {
String val[] = s.split(",");
context.write(new Text("blah"), new Text(val[1]));
}

Map Reduce Filter records

I have set of records where i need to process only male records,in map reduce program i have used if condition to filter only male records.but below program giving zero records as output.
Input file:
1,Brandon Buckner,avil,female,525
2,Veda Hopkins,avil,male,633
3,Zia Underwood,paracetamol,male,980
4,Austin Mayer,paracetamol,female,338
5,Mara Higgins,avil,female,153
6,Sybill Crosby,avil,male,193
7,Tyler Rosales,paracetamol,male,778
8,Ivan Hale,avil,female,454
9,Alika Gilmore,paracetamol,female,833
10,Len Burgess,metacin,male,325
Mapreduce Program:
package org.samples.mapreduce.training;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class patientrxMR_filter {
public static class MapDemohadoop extends
Mapper<LongWritable, Text, Text, IntWritable> {
// setup , map, run, cleanup
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
String[] elements = line.split(",");
String gender =elements[3];
if ( gender == "male" ) {
Text tx = new Text(elements[2]);
int i = Integer.parseInt(elements[4]);
IntWritable it = new IntWritable(i);
context.write(tx, it);
}
}
}
public static class Reduce extends
Reducer<Text, IntWritable, Text, IntWritable> {
// setup, reduce, run, cleanup
// innput - para [150,100]
public void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Insufficient args");
System.exit(-1);
}
Configuration conf = new Configuration();
//conf.set("fs.default.name","hdfs://localhost:50000");
conf.set("mapred.job.tracker", "hdfs://localhost:50001");
// conf.set("DrugName", args[3]);
Job job = new Job(conf, "Drug Amount Spent");
job.setJarByClass(patientrxMR_filter.class); // class conmtains mapper and
// reducer class
job.setMapOutputKeyClass(Text.class); // map output key class
job.setMapOutputValueClass(IntWritable.class);// map output value class
job.setOutputKeyClass(Text.class); // output key type in reducer
job.setOutputValueClass(IntWritable.class);// output value type in
// reducer
job.setMapperClass(MapDemohadoop.class);
job.setReducerClass(Reduce.class);
job.setNumReduceTasks(1);
job.setInputFormatClass(TextInputFormat.class); // default -- inputkey
// type -- longwritable
// : valuetype is text
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}

if ( gender == "male" )
This line doesn't work for equality check, For equality in java pls use object.equals()
i.e
if ( gender.equals("male") )

public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
String[] elements = line.split(",");
Hadoop is using distributed file system, in "String line = value.toString();"
line is the file content in block which has a offset (key). In this case the, the line loads the entire test file, which apparently can fit into one block, instead of each line in the file as you expected.

Mapreduce MultipleOutputs error

I want to store output of a mapreduce job in two different directories.
Eventhough my code is designed to store the same output in different directories.
My Driver class code below
public class WordCountMain {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job myhadoopJob = new Job(conf);
myhadoopJob.setJarByClass(WordCountMain.class);
myhadoopJob.setJobName("WORD COUNT JOB");
FileInputFormat.addInputPath(myhadoopJob, new Path(args[0]));
myhadoopJob.setMapperClass(WordCountMapper.class);
myhadoopJob.setReducerClass(WordCountReducer.class);
myhadoopJob.setInputFormatClass(TextInputFormat.class);
myhadoopJob.setOutputFormatClass(TextOutputFormat.class);
myhadoopJob.setMapOutputKeyClass(Text.class);
myhadoopJob.setMapOutputValueClass(IntWritable.class);
myhadoopJob.setOutputKeyClass(Text.class);
myhadoopJob.setOutputValueClass(IntWritable.class);
MultipleOutputs.addNamedOutput(myhadoopJob, "output1", TextOutputFormat.class, Text.class, IntWritable.class);
MultipleOutputs.addNamedOutput(myhadoopJob, "output2", TextOutputFormat.class, Text.class, IntWritable.class);
FileOutputFormat.setOutputPath(myhadoopJob, new Path(args[1]));
System.exit(myhadoopJob.waitForCompletion(true) ? 0 : 1);
}
}
My Mapper Code
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable>
{
#Override
protected void map(LongWritable key, Text value, Context context)throws IOException, InterruptedException {
String line = value.toString();
String word =null;
StringTokenizer st = new StringTokenizer(line,",");
while(st.hasMoreTokens())
{
word= st.nextToken();
context.write(new Text(word), new IntWritable(1));
}
}
}
My Reducer Code is below
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable>
{
MultipleOutputs mout =null;
protected void reduce(Text key, Iterable<IntWritable> values, Context context)throws IOException, InterruptedException {
int count=0;
int num =0;
Iterator<IntWritable> ie =values.iterator();
while(ie.hasNext())
{
num = ie.next().get();//1
count= count+num;
}
mout.write("output1", key, new IntWritable(count));
mout.write("output2", key, new IntWritable(count));
#Override
protected void setup(org.apache.hadoop.mapreduce.Reducer.Context context)
throws IOException, InterruptedException {
// TODO Auto-generated method stub
super.setup(context);
mout = new MultipleOutputs<Text, IntWritable>(context);
}
}
#Override
protected void setup(org.apache.hadoop.mapreduce.Reducer.Context context)
throws IOException, InterruptedException {
super.setup(context);
mout = new MultipleOutputs<Text, IntWritable>(context);
}
}
I am simply giving the output directories in reduce method itself
But when I run this mapreduce job using the below command, it does nothing. Even Mapreduce is not at all started. just a blank and stays idle.
hadoop jar WordCountMain.jar /user/cloudera/inputfiles/words.txt /user/cloudera/outputfiles/mapreduce/multipleoutputs
Could someone explain me what went wrong and how do I correct this with my code
Actually what happens is two output files with different name are stored inside /user/cloudera/outputfiles/mapreduce/multipleoutputs.
but what I need is storing output files in different directories.
In pig we can use by two STORE statement by giving different directories
How do I achieve the same in mapreduce

Can you try closing multiple output object in cleanup method for Reducer.

Missing invocation to mocked type at this point;

I am new to jMockit. I am trying to mock multiple instances of java.io.File type in a method. There are some places where, I shouldn't mock file Object. For that reason, I am using #Injectable. It is throwing the below exception.
I don't want to mock all the instances of java.io.File.I want the instances returned from the methods to be actual Files.
The below is test class.
/**
*
*/
package org.iis.uafdataloader.tasklet;
import static org.junit.Assert.fail;
import java.io.File;
import java.io.FilenameFilter;
import java.io.IOException;
import java.util.regex.Pattern;
import mockit.Expectations;
import mockit.Injectable;
import mockit.Mocked;
import mockit.NonStrictExpectations;
import mockit.VerificationsInOrder;
import org.apache.commons.io.FileUtils;
import org.apache.commons.io.filefilter.RegexFileFilter;
import org.iis.uafdataloader.tasklet.validation.FileNotFoundException;
import org.junit.Test;
import org.springframework.batch.core.StepContribution;
import org.springframework.batch.core.scope.context.ChunkContext;
import org.springframework.batch.repeat.RepeatStatus;
/**
* #author K23883
*
*/
public class FileMovingTaskletTest {
private FileMovingTasklet fileMovingTasklet;
#Mocked
private StepContribution contribution;
#Mocked
private ChunkContext chunkContext;
/**
* Test method for
* {#link org.iis.uafdataloader.tasklet.FileMovingTasklet#execute(org.springframework.batch.core.StepContribution, org.springframework.batch.core.scope.context.ChunkContext)}
* .
*
* #throws Exception
*/
#Test
public void testExecuteWhenWorkingDirDoesNotExist(
// #Mocked final File file,
#Injectable final File sourceDirectory,
#Injectable final File workingDirectory,
#Injectable final File archiveDirectory,
#Mocked final RegexFileFilter regexFileFilter,
#Mocked final FileUtils fileUtils) throws Exception {
fileMovingTasklet = new FileMovingTasklet();
fileMovingTasklet.setSourceDirectoryPath("sourceDirectoryPath");
fileMovingTasklet.setInFileRegexPattern("inFileRegexPattern");
fileMovingTasklet.setArchiveDirectoryPath("archiveDirectoryPath");
fileMovingTasklet.setWorkingDirectoryPath("workingDirectoryPath");
final File[] sourceDirectoryFiles = new File[] {
new File("sourceDirectoryPath/ISGUAFFILE.D140728.C00"),
new File("sourceDirectoryPath/ISGUAFFILE.D140729.C00") };
final File[] workingDirectoryFiles = new File[] {
new File("workingDirectoryPath/ISGUAFFILE.D140728.C00"),
new File("workingDirectoryPath/ISGUAFFILE.D140729.C00") };
new NonStrictExpectations(){{
new File("sourceDirectoryPath");
result = sourceDirectory;
sourceDirectory.exists();
result = true;
sourceDirectory.isDirectory();
result = true;
// workingDirectory =
new File("workingDirectoryPath");
result = workingDirectory;
workingDirectory.exists();
result = false;
workingDirectory.mkdirs();
FileUtils.cleanDirectory(onInstance(workingDirectory));
FilenameFilter fileNameFilter = new RegexFileFilter(anyString,
Pattern.CASE_INSENSITIVE);
sourceDirectory.listFiles(fileNameFilter);
result = sourceDirectoryFiles;
System.out.println("sourceDirectoryFile :"
+ ((File[]) sourceDirectoryFiles).length);
// for (int i = 0; i < sourceDirectoryFiles.length; i++) {
// FileUtils.moveFileToDirectory(sourceDirectoryFiles[i],
// workingDirectory, true);
// }
// archiveDirectory =
new File("archiveDirectoryPath");
result = archiveDirectory;
workingDirectory.listFiles();
result = workingDirectoryFiles;
// for (int i = 0; i < workingDirectoryFiles.length; i++) {
// FileUtils.copyFileToDirectory(workingDirectoryFiles[i],
// archiveDirectory);
// }
}};
RepeatStatus status = fileMovingTasklet.execute(contribution,
chunkContext);
assert (status == RepeatStatus.FINISHED);
new VerificationsInOrder() {{
sourceDirectory.exists();
onInstance(sourceDirectory).isDirectory();
onInstance(workingDirectory).exists();
onInstance(workingDirectory).mkdirs();
onInstance(sourceDirectory).listFiles((FilenameFilter)any);
FileUtils.moveFileToDirectory((File)any, onInstance(workingDirectory), true);
times = 2;
FileUtils.copyFileToDirectory((File)any, onInstance(archiveDirectory));
times= 2;
}};
}
}
The below is actual implementation method
/*
* (non-Javadoc)
*
* #see org.springframework.batch.core.step.tasklet.Tasklet#execute(org.
* springframework.batch.core.StepContribution,
* org.springframework.batch.core.scope.context.ChunkContext)
*/
#Override
public RepeatStatus execute(StepContribution contribution,
ChunkContext chunkContext) throws Exception {
File sourceDirectory = new File(sourceDirectoryPath);
if (sourceDirectory == null || !sourceDirectory.exists()
|| !sourceDirectory.isDirectory()) {
throw new FileNotFoundException("The source directory '"
+ sourceDirectoryPath
+ "' doesn't exist or can't be read or not a directory");
}
File workingDirectory = new File(workingDirectoryPath);
if (workingDirectory != null && !workingDirectory.exists() ) {
workingDirectory.mkdirs();
}
FileUtils.cleanDirectory(workingDirectory);
FilenameFilter fileFilter = new RegexFileFilter(inFileRegexPattern,
Pattern.CASE_INSENSITIVE);
File[] sourceDirectoryFiles = sourceDirectory.listFiles(fileFilter);
System.out.println("sourceDirectoryFiles : " + sourceDirectoryFiles.length);
for (File file : sourceDirectoryFiles) {
FileUtils.moveFileToDirectory(file, workingDirectory, true);
}
File archiveDirectory = new File(archiveDirectoryPath);
for (File file : workingDirectory.listFiles()) {
FileUtils.copyFileToDirectory(file, archiveDirectory);
}
return RepeatStatus.FINISHED;
}
The below is stack trace.
java.lang.IllegalStateException: Missing invocation to mocked type at this point; please make sure such invocations appear only after the declaration of a suitable mock field or parameter
at org.iis.uafdataloader.tasklet.FileMovingTaskletTest$1.<init>(FileMovingTaskletTest.java:75)
at org.iis.uafdataloader.tasklet.FileMovingTaskletTest.testExecuteWhenWorkingDirDoesNotExist(FileMovingTaskletTest.java:71)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.lang.reflect.Method.invoke(Method.java:483)
at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
Please, help me in solving the problem.

#Injectable gives you a single mocked instance; it won't affect other instances of the mocked type. So, when the test attempts to record new File("sourceDirectoryPath"), it says "missing invocation to mocked type at this point" precisely because the File(String) is not mocked.
To mock the entire File class (including its constructors) so that all instances are affected, you need to use #Mocked instead, as the following example shows:
#Test
public void mockFutureFileObjects(#Mocked File anyFile) throws Exception
{
final String srcDirPath = "sourceDir";
final String wrkDirPath = "workingDir";
new NonStrictExpectations() {{
File srcDir = new File(srcDirPath);
srcDir.exists(); result = true;
srcDir.isDirectory(); result = true;
File wrkDir = new File(wrkDirPath);
wrkDir.exists(); result = true;
}};
sut.execute(srcDirPath, wrkDirPath);
}
The JMockit Tutorial describes the same mechanism, although with a slightly different syntax.
This said, I would suggest instead to write the test with real files and directories.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Mapper is not producing expected results - mapreduce

use String arr[] = line.split("\\|");

Related

AWS SDK2 java s3 select example - how to get result bytes

Text to String map reduce

Map Reduce Filter records

Mapreduce MultipleOutputs error

Missing invocation to mocked type at this point;

Categories

Resources