About GenericOptionsParser getRemainingArgs method - mapreduce

package com.ibm.dw61;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import com.ibm.dw61.MaxTempReducer;
import com.ibm.dw61.MaxTempMapper;
public class MaxMonthlyTemp {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] programArgs = new GenericOptionsParser(conf, args)
.getRemainingArgs();
if (programArgs.length != 2) {
System.err.println("Usage: MaxTemp <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "Monthly Max Temp");
job.setJarByClass(MaxMonthlyTemp.class);
job.setMapperClass(MaxTempMapper.class);
job.setCombinerClass(MaxTempReducer.class);
job.setReducerClass(MaxTempReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(programArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(programArgs[1]));
// Submit the job and wait for it to finish.
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Questions :
1) This is a map-reduce code to extract max temperature for each month. The coder is trying to get non-generic options using the getRemainingArgs method. But the next line says if the number of non-generic options is not 2, that means there is an error and the program will immediately abort. I couldn’t figure out what is the coder’s logic here. Anyone kind enough to explain?
2) In another example Wordcount, the coder didn’t perform this step of getting non-generic options. So under what circumstances do we have to perform this step and testing whether the non-generic options numbers 2?

as you can see in the Hadoop API documentation, purpose of the method getRemainingArgs is to extract application-specific arguments , those that are not related to Hadoop framework. in this code, you should specify two arguments, first your input and then output, as you can see in the Usage

Related

Running MapReduce on Hbase Exported Table thorws Could not find a deserializer for the Value class: 'org.apache.hadoop.hbase.client.Result

I have taken the Hbase table backup using Hbase Export utility tool .
hbase org.apache.hadoop.hbase.mapreduce.Export "FinancialLineItem" "/project/fricadev/ESGTRF/EXPORT"
This has kicked in mapreduce and transferred all my table data into Output folder .
As per the document the file format will of the ouotput file is sequence file .
So i ran below code to extract my key and value from the file .
Now i want to run mapreduce to read the key value from the output file but getting below exception
java.lang.Exception: java.io.IOException: Could not find a
deserializer for the Value class:
'org.apache.hadoop.hbase.client.Result'. Please ensure that the
configuration 'io.serializations' is properly configured, if you're
using custom serialization.
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:406)
Caused by: java.io.IOException: Could not find a deserializer for the Value class: 'org.apache.hadoop.hbase.client.Result'. Please
ensure that the configuration 'io.serializations' is properly
configured, if you're using custom serialization.
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1964)
at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1811)
at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1760)
at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1774)
at org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.initialize(SequenceFileRecordReader.java:50)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:478)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:671)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
Here is my driver code
package SEQ;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class SeqDriver extends Configured implements Tool
{
public static void main(String[] args) throws Exception{
int exitCode = ToolRunner.run(new SeqDriver(), args);
System.exit(exitCode);
}
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.err.printf("Usage: %s needs two arguments files\n",
getClass().getSimpleName());
return -1;
}
String outputPath = args[1];
FileSystem hfs = FileSystem.get(getConf());
Job job = new Job();
job.setJarByClass(SeqDriver.class);
job.setJobName("SequenceFileReader");
HDFSUtil.removeHdfsSubDirIfExists(hfs, new Path(outputPath), true);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setOutputKeyClass(ImmutableBytesWritable.class);
job.setOutputValueClass(Result.class);
job.setInputFormatClass(SequenceFileInputFormat.class);
job.setMapperClass(MySeqMapper.class);
job.setNumReduceTasks(0);
int returnValue = job.waitForCompletion(true) ? 0:1;
if(job.isSuccessful()) {
System.out.println("Job was successful");
} else if(!job.isSuccessful()) {
System.out.println("Job was not successful");
}
return returnValue;
}
}
Here is my mapper code
package SEQ;
import java.io.IOException;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class MySeqMapper extends Mapper <ImmutableBytesWritable, Result, Text, Text>{
#Override
public void map(ImmutableBytesWritable row, Result value,Context context)
throws IOException, InterruptedException {
}
}
So i will answer my question
here is what was needed to make it work
Because we use HBase to store our data and this reducer outputs its result to HBase table, Hadoop is telling us that he doesn’t know how to serialize our data. That is why we need to help it. Inside setUp set the io.serializations variable
hbaseConf.setStrings("io.serializations", new String[]{hbaseConf.get("io.serializations"), MutationSerialization.class.getName(), ResultSerialization.class.getName()});

How to checkout and checkin any document outside alfresco using rest API?

I have created one Web Application using Servlets and JSP. Through that I have connected to alfresco repository. I am also able be to upload document in Alfresco and view document in external web application.
Now my requirement is, I have to give checkin and checkout option to those documents.
I found below rest apis for this purpuse.
But I am not getting how to use these apis in servlets to full-fill my requirment.
POST /alfresco/service/slingshot/doclib/action/cancel-checkout/site/{site}/{container}/{path}
POST /alfresco/service/slingshot/doclib/action/cancel-checkout/node/{store_type}/{store_id}/{id}
Can anyone please provide the simple steps or some piece of code to do this task?
Thanks in advance.
Please do not use the internal slingshot URLs for this. Instead, use OpenCMIS from Apache Chemistry. It will save you a lot of time and headaches and it is more portable to other repositories besides Alfresco.
The example below grabs an existing document by path, performs a checkout, then checks in a new major version of the plain text document.
package com.someco.cmis.examples;
import java.io.ByteArrayInputStream;
import java.io.InputStream;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import org.apache.chemistry.opencmis.client.api.Document;
import org.apache.chemistry.opencmis.client.api.ObjectId;
import org.apache.chemistry.opencmis.client.api.Repository;
import org.apache.chemistry.opencmis.client.api.Session;
import org.apache.chemistry.opencmis.client.api.SessionFactory;
import org.apache.chemistry.opencmis.client.runtime.SessionFactoryImpl;
import org.apache.chemistry.opencmis.commons.SessionParameter;
import org.apache.chemistry.opencmis.commons.data.ContentStream;
import org.apache.chemistry.opencmis.commons.enums.BindingType;
public class CheckoutCheckinExample {
private String serviceUrl = "http://localhost:8080/alfresco/api/-default-/public/cmis/versions/1.1/atom"; // Uncomment for Atom Pub binding
private Session session = null;
public static void main(String[] args) {
CheckoutCheckinExample cce = new CheckoutCheckinExample();
cce.doExample();
}
public void doExample() {
Document doc = (Document) getSession().getObjectByPath("/test/test-plain-1.txt");
String fileName = doc.getName();
ObjectId pwcId = doc.checkOut(); // Checkout the document
Document pwc = (Document) getSession().getObject(pwcId); // Get the working copy
// Set up an updated content stream
String docText = "This is a new major version.";
byte[] content = docText.getBytes();
InputStream stream = new ByteArrayInputStream(content);
ContentStream contentStream = session.getObjectFactory().createContentStream(fileName, Long.valueOf(content.length), "text/plain", stream);
// Check in the working copy as a major version with a comment
ObjectId updatedId = pwc.checkIn(true, null, contentStream, "My new version comment");
doc = (Document) getSession().getObject(updatedId);
System.out.println("Doc is now version: " + doc.getProperty("cmis:versionLabel").getValueAsString());
}
public Session getSession() {
if (session == null) {
// default factory implementation
SessionFactory factory = SessionFactoryImpl.newInstance();
Map<String, String> parameter = new HashMap<String, String>();
// user credentials
parameter.put(SessionParameter.USER, "admin"); // <-- Replace
parameter.put(SessionParameter.PASSWORD, "admin"); // <-- Replace
// connection settings
parameter.put(SessionParameter.ATOMPUB_URL, this.serviceUrl); // Uncomment for Atom Pub binding
parameter.put(SessionParameter.BINDING_TYPE, BindingType.ATOMPUB.value()); // Uncomment for Atom Pub binding
List<Repository> repositories = factory.getRepositories(parameter);
this.session = repositories.get(0).createSession();
}
return this.session;
}
}
Note that on the version of Alfresco I tested with (5.1.e) the document must already have the versionable aspect applied for the version label to get incremented, otherwise the checkin will simply override the original.

How to write Elastic unit tests to test query building

I want to write unit tests that test the Elastic query building. I want to test that certain param values produce certain queries.
I started looking into ESTestCase. I see that you can mock a client using ESTestCase. I don't really need to mock the ES node, I just need to reproduce the query building part, but that requires the client.
Has anybody dealt with such issue?
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Map;
import org.elasticsearch.action.search.SearchRequestBuilder;
import org.elasticsearch.client.Client;
import org.elasticsearch.client.transport.TransportClient;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.common.unit.DistanceUnit;
import org.elasticsearch.test.ESIntegTestCase;
import org.elasticsearch.test.ESTestCase;
import org.junit.AfterClass;
import org.junit.BeforeClass;
import org.junit.Ignore;
import org.junit.Test;
import com.google.common.collect.Lists;
public class SearchRequestBuilderTests extends ESTestCase {
private static Client client;
#BeforeClass
public static void initClient() {
//this client will not be hit by any request, but it needs to be a non null proper client
//that is why we create it but we don't add any transport address to it
Settings settings = Settings.builder()
.put("", createTempDir().toString())
.build();
client = TransportClient.builder().settings(settings).build();
}
#AfterClass
public static void closeClient() {
client.close();
client = null;
}
public static Map<String, String> createSampleSearchParams() {
Map<String, String> searchParams = new HashMap<>();
searchParams.put(SenseneConstants.ADC_PARAM, "US");
searchParams.put(SenseneConstants.FETCH_SIZE_QUERY_PARAM, "10");
searchParams.put(SenseneConstants.QUERY_PARAM, "some query");
searchParams.put(SenseneConstants.LOCATION_QUERY_PARAM, "");
searchParams.put(SenseneConstants.RADIUS_QUERY_PARAM, "20");
searchParams.put(SenseneConstants.DISTANCE_UNIT_PARAM, DistanceUnit.MILES.name());
searchParams.put(SenseneConstants.GEO_DISTANCE_PARAM, "true");
return searchParams;
}
#Test
public void test() {
BasicSearcher searcher = new BasicSearcher(client); // this is my application's searcher
Map<String, String> searchParams = createSampleSearchParams();
ArrayList<String> filterQueries = Lists.newArrayList();
SearchRequest searchRequest = SearchRequest.create(searchParams, filterQueries);
MySearchRequestBuilder medleyReqBuilder = new MySearchRequestBuilder.Builder(client, "my_index", searchRequest).build();
SearchRequestBuilder searchRequestBuilder = medleyReqBuilder.constructSearchRequestBuilder();
System.out.print(searchRequestBuilder.toString());
// Here I want to assert that the search request builder output is what it should be for the above client params
}
}
I get this, and nothing in the code runs:
Assertions mismatch: -ea was not specified but -Dtests.asserts=true
REPRODUCE WITH: mvn test -Pdev -Dtests.seed=5F09BEDD71BBD14E - Dtests.class=SearchRequestBuilderTests -Dtests.locale=en_US -Dtests.timezone=America/Los_Angeles
NOTE: test params are: codec=null, sim=null, locale=null, timezone=(null)
NOTE: Mac OS X 10.10.5 x86_64/Oracle Corporation 1.7.0_80 (64-bit)/cpus=4,threads=1,free=122894936,total=128974848
NOTE: All tests run in this JVM: [SearchRequestBuilderTests]
Obviously a bit late but...
So this actually has nothing to do with the ES Testing framework but rather your run settings. Assuming you are running this in eclipse, this is actually a duplicate of Assertions mismatch: -ea was not specified but -Dtests.asserts=true.
eclipse preference -> junit -> Add -ea checkbox enable.
right click on the eclipse project -> run as -> run configure -> arguments tab -> add the -ea option in vm arguments

How to force an Apache Mahout application read directly from the HDFS

I have implemented an Apache Mahout application (attached bellow) which does some basic computations. To do so it is required to load the dataset from my local machine. This application comes in the form of a jar file, but then its being executed within a hadoop pseudo-distributed cluster. The terminal command for that is: $ hadoop jar /home/eualin/ApacheMahout/tdunning-MiA-5b8956f/target/mia-0.1-jar-with-dependencies.jar mia.recommender.ch03.IREvaluatorBooleanPrefIntro2 "/home/eualin/Desktop/links-final"
Now, my question is how to do the same, but this time by reading the dataset from the HDFS (we, of course, suppose that the dataset is already stored in HDFS, e.g. in /user/eualin/output/links-final}. What should change in that case? This might help: hdfs://localhost:50010/user/eualin/output/links-final
package mia.recommender.ch03;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.mahout.cf.taste.common.TasteException;
import org.apache.mahout.cf.taste.eval.DataModelBuilder;
import org.apache.mahout.cf.taste.eval.IRStatistics;
import org.apache.mahout.cf.taste.eval.RecommenderBuilder;
import org.apache.mahout.cf.taste.eval.RecommenderIRStatsEvaluator;
import org.apache.mahout.cf.taste.impl.common.FastByIDMap;
import org.apache.mahout.cf.taste.impl.eval.GenericRecommenderIRStatsEvaluator;
import org.apache.mahout.cf.taste.impl.model.GenericBooleanPrefDataModel;
import org.apache.mahout.cf.taste.impl.model.file.FileDataModel;
import org.apache.mahout.cf.taste.impl.neighborhood.NearestNUserNeighborhood;
import org.apache.mahout.cf.taste.impl.recommender.GenericBooleanPrefUserBasedRecommender;
import org.apache.mahout.cf.taste.impl.similarity.LogLikelihoodSimilarity;
import org.apache.mahout.cf.taste.model.DataModel;
import org.apache.mahout.cf.taste.model.PreferenceArray;
import org.apache.mahout.cf.taste.neighborhood.UserNeighborhood;
import org.apache.mahout.cf.taste.recommender.Recommender;
import org.apache.mahout.cf.taste.similarity.UserSimilarity;
import java.io.File;
public class IREvaluatorBooleanPrefIntro2 {
private IREvaluatorBooleanPrefIntro2() {
}
public static void main(String[] args) throws Exception {
if (args.length != 1) {
System.out.println("give file's HDFS path");
System.exit(1);
}
DataModel model = new GenericBooleanPrefDataModel(
GenericBooleanPrefDataModel.toDataMap(
new GenericBooleanPrefDataModel(new FileDataModel(new File(args[0])))));
RecommenderIRStatsEvaluator evaluator =
new GenericRecommenderIRStatsEvaluator();
RecommenderBuilder recommenderBuilder = new RecommenderBuilder() {
#Override
public Recommender buildRecommender(DataModel model) throws TasteException {
UserSimilarity similarity = new LogLikelihoodSimilarity(model);
UserNeighborhood neighborhood =
new NearestNUserNeighborhood(10, similarity, model);
return new GenericBooleanPrefUserBasedRecommender(model, neighborhood, similarity);
}
};
DataModelBuilder modelBuilder = new DataModelBuilder() {
#Override
public DataModel buildDataModel(FastByIDMap<PreferenceArray> trainingData) {
return new GenericBooleanPrefDataModel(
GenericBooleanPrefDataModel.toDataMap(trainingData));
}
};
IRStatistics stats = evaluator.evaluate(
recommenderBuilder, modelBuilder, model, null, 10,
GenericRecommenderIRStatsEvaluator.CHOOSE_THRESHOLD,
1.0);
System.out.println(stats.getPrecision());
System.out.println(stats.getRecall());
}
}
You can't, directly, since the non-distributed code has no knowledge of HDFS. Instead, copy the file to a local location in setup() and then read it from a local file.

Akka scheduled job questions

I have been experimenting with Play 2.0 and using Akka for a recurring scheduled job. I would like the job to run every 5 minutes. I have this really basic test and it works for the most part. Based on this test it should create a PDF file every 5 minutes. What happens is I get 4 files written every 5 minutes and sometimes more. I am not exactly sure why. Below is my code.
package models;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.util.*;
import javax.persistence.*;
import play.libs.*;
import play.db.ebean.*;
import akka.util.*;
import static java.util.concurrent.TimeUnit.*;
import com.itextpdf.text.Paragraph;
import com.itextpdf.text.pdf.PdfWriter;
#Entity
public class EmailService extends Model {
public EmailService() {
// Run the Service every 5 minutes
Akka.system().scheduler().schedule(
Duration.create(0, MILLISECONDS),
Duration.create(5, MINUTES),
new Runnable() {
public void run() {
try {
// TEST
com.itextpdf.text.Document document = new com.itextpdf.text.Document();
PdfWriter.getInstance(document, new FileOutputStream(UUID.randomUUID().toString() + ".pdf"));
document.open();
document.add(new Paragraph("Hello World!"));
document.close();
} catch (Exception e) {
e.printStackTrace();
}
}
}
);
}
}
Ideas why it runs multiple times?