Running MapReduce on Hbase Exported Table thorws Could not find a deserializer for the Value class: 'org.apache.hadoop.hbase.client.Result - mapreduce

I have taken the Hbase table backup using Hbase Export utility tool .
hbase org.apache.hadoop.hbase.mapreduce.Export "FinancialLineItem" "/project/fricadev/ESGTRF/EXPORT"
This has kicked in mapreduce and transferred all my table data into Output folder .
As per the document the file format will of the ouotput file is sequence file .
So i ran below code to extract my key and value from the file .
Now i want to run mapreduce to read the key value from the output file but getting below exception
java.lang.Exception: java.io.IOException: Could not find a
deserializer for the Value class:
'org.apache.hadoop.hbase.client.Result'. Please ensure that the
configuration 'io.serializations' is properly configured, if you're
using custom serialization.
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:406)
Caused by: java.io.IOException: Could not find a deserializer for the Value class: 'org.apache.hadoop.hbase.client.Result'. Please
ensure that the configuration 'io.serializations' is properly
configured, if you're using custom serialization.
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1964)
at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1811)
at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1760)
at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1774)
at org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.initialize(SequenceFileRecordReader.java:50)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:478)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:671)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
Here is my driver code
package SEQ;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class SeqDriver extends Configured implements Tool
{
public static void main(String[] args) throws Exception{
int exitCode = ToolRunner.run(new SeqDriver(), args);
System.exit(exitCode);
}
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.err.printf("Usage: %s needs two arguments files\n",
getClass().getSimpleName());
return -1;
}
String outputPath = args[1];
FileSystem hfs = FileSystem.get(getConf());
Job job = new Job();
job.setJarByClass(SeqDriver.class);
job.setJobName("SequenceFileReader");
HDFSUtil.removeHdfsSubDirIfExists(hfs, new Path(outputPath), true);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setOutputKeyClass(ImmutableBytesWritable.class);
job.setOutputValueClass(Result.class);
job.setInputFormatClass(SequenceFileInputFormat.class);
job.setMapperClass(MySeqMapper.class);
job.setNumReduceTasks(0);
int returnValue = job.waitForCompletion(true) ? 0:1;
if(job.isSuccessful()) {
System.out.println("Job was successful");
} else if(!job.isSuccessful()) {
System.out.println("Job was not successful");
}
return returnValue;
}
}
Here is my mapper code
package SEQ;
import java.io.IOException;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class MySeqMapper extends Mapper <ImmutableBytesWritable, Result, Text, Text>{
#Override
public void map(ImmutableBytesWritable row, Result value,Context context)
throws IOException, InterruptedException {
}
}

So i will answer my question
here is what was needed to make it work
Because we use HBase to store our data and this reducer outputs its result to HBase table, Hadoop is telling us that he doesn’t know how to serialize our data. That is why we need to help it. Inside setUp set the io.serializations variable
hbaseConf.setStrings("io.serializations", new String[]{hbaseConf.get("io.serializations"), MutationSerialization.class.getName(), ResultSerialization.class.getName()});

Related

Unable to read my config text file(Column Names) from GCS in dataflow

I have one source CSV file (without header) as well as header config CSV file (contains only column names) in GCS. I also have static table in Bigquery. I want to load source file into static table by using column header mapping (config file).
I was tried with different approach earlier(I was maintain source file which contain header and data in same file and then tried to split header from source file then insert those data into Bigquery by using header column mapping. I noticed this approach is NOT possible because dataflow shuffle data into multiple worker node. so i dropped this approach.
The below code i have used hard coded column names. I am looking approach to read column names from external config file (I want to make my code as dynamic).
package com.coe.cog;
import java.io.BufferedReader;
import java.util.*;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.transforms.Create;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.MapElements;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.transforms.SimpleFunction;
import org.apache.beam.sdk.values.PCollection;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import com.google.api.services.bigquery.model.TableReference;
import com.google.api.services.bigquery.model.TableRow;
public class SampleTest {
private static final Logger LOG = LoggerFactory.getLogger(SampleTest.class);
public static TableReference getGCDSTableReference() {
TableReference ref = new TableReference();
ref.setProjectId("myownproject");
ref.setDatasetId("DS_Employee");
ref.setTableId("tLoad14");
return ref;
}
static class TransformToTable extends DoFn<String, TableRow> {
#ProcessElement
public void processElement(ProcessContext c) {
String csvSplitBy = ",";
String lineHeader = "ID,NAME,AGE,SEX"; // Hard code column name but i want to read these header from GCS file.
String[] colmnsHeader = lineHeader.split(csvSplitBy); //Only Header array
String[] split = c.element().split(csvSplitBy); //Data section
TableRow row = new TableRow();
for (int i = 0; i < split.length; i++) {
row.set(colmnsHeader[i], split[i]);
}
c.output(row);
// }
}
}
public interface MyOptions extends PipelineOptions {
/*
* Param
*
*/
}
public static void main(String[] args) {
MyOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(MyOptions.class);
options.setTempLocation("gs://demo-bucket-data/temp");
Pipeline p = Pipeline.create(options);
PCollection<String> lines = p.apply("Read From Storage", TextIO.read().from("gs://demo-bucket-data/Demo/Test/SourceFile_WithOutHeader.csv"));
PCollection<TableRow> rows = lines.apply("Transform To Table",ParDo.of(new TransformToTable()));
rows.apply("Write To Table",BigQueryIO.writeTableRows().to(getGCDSTableReference())
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER));
p.run();
}
}
Source File:
1,John,25,M
2,Smith,30,M
3,Josephine,20,F
Config File (Headers only):
ID,NAME,AGE,SEX
You have a couple of options:
Use a Dataflow/Beam side input to read the config/header file into some sort of collection e.g. a a ArrayList. It will be available to all workers in the cluster. You can then use the side input to dynamically assign the schema to the BigQuery table using DynamicDestinations.
Before dropping into your Dataflow pipeline, call the GCS api directly to grab your config/header file, parse it and then it the results to setup your pipeline.
Using Beam's FileSystems API for reading config files from GCS, is another approach.
Advantages:
No need of additional dependencies, it's included with beam API.
Using GCP's client libraries can lead to dependency version issues.
We can use a beam's FileSystems API in any transforms.
Here is a snippet for reading files.
//filePath format: gs://bucket/file
public static String loadSchema(String filePath) {
MatchResult.Metadata metadata;
try {
metadata = FileSystems.matchSingleFileSpec(filePath); // searching
} catch (IOException e) {
throw new RuntimeException(e);
}
String schema;
try {
// reading file
schema = CharStreams.toString(
Channels.newReader(
FileSystems.open(metadata.resourceId()),
StandardCharsets.UTF_8.name()
)
);
} catch (IOException e) {
throw new RuntimeException(e);
}
// returning content as string. We can process it now.
return schema;
}
Disadvantages of Sideinput
File's orientation changes.
It's hard to parse multiline file like Json and others.
Side Input can work for single line static values.

About GenericOptionsParser getRemainingArgs method

package com.ibm.dw61;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import com.ibm.dw61.MaxTempReducer;
import com.ibm.dw61.MaxTempMapper;
public class MaxMonthlyTemp {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] programArgs = new GenericOptionsParser(conf, args)
.getRemainingArgs();
if (programArgs.length != 2) {
System.err.println("Usage: MaxTemp <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "Monthly Max Temp");
job.setJarByClass(MaxMonthlyTemp.class);
job.setMapperClass(MaxTempMapper.class);
job.setCombinerClass(MaxTempReducer.class);
job.setReducerClass(MaxTempReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(programArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(programArgs[1]));
// Submit the job and wait for it to finish.
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Questions :
1) This is a map-reduce code to extract max temperature for each month. The coder is trying to get non-generic options using the getRemainingArgs method. But the next line says if the number of non-generic options is not 2, that means there is an error and the program will immediately abort. I couldn’t figure out what is the coder’s logic here. Anyone kind enough to explain?
2) In another example Wordcount, the coder didn’t perform this step of getting non-generic options. So under what circumstances do we have to perform this step and testing whether the non-generic options numbers 2?
as you can see in the Hadoop API documentation, purpose of the method getRemainingArgs is to extract application-specific arguments , those that are not related to Hadoop framework. in this code, you should specify two arguments, first your input and then output, as you can see in the Usage

How to write Elastic unit tests to test query building

I want to write unit tests that test the Elastic query building. I want to test that certain param values produce certain queries.
I started looking into ESTestCase. I see that you can mock a client using ESTestCase. I don't really need to mock the ES node, I just need to reproduce the query building part, but that requires the client.
Has anybody dealt with such issue?
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Map;
import org.elasticsearch.action.search.SearchRequestBuilder;
import org.elasticsearch.client.Client;
import org.elasticsearch.client.transport.TransportClient;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.common.unit.DistanceUnit;
import org.elasticsearch.test.ESIntegTestCase;
import org.elasticsearch.test.ESTestCase;
import org.junit.AfterClass;
import org.junit.BeforeClass;
import org.junit.Ignore;
import org.junit.Test;
import com.google.common.collect.Lists;
public class SearchRequestBuilderTests extends ESTestCase {
private static Client client;
#BeforeClass
public static void initClient() {
//this client will not be hit by any request, but it needs to be a non null proper client
//that is why we create it but we don't add any transport address to it
Settings settings = Settings.builder()
.put("", createTempDir().toString())
.build();
client = TransportClient.builder().settings(settings).build();
}
#AfterClass
public static void closeClient() {
client.close();
client = null;
}
public static Map<String, String> createSampleSearchParams() {
Map<String, String> searchParams = new HashMap<>();
searchParams.put(SenseneConstants.ADC_PARAM, "US");
searchParams.put(SenseneConstants.FETCH_SIZE_QUERY_PARAM, "10");
searchParams.put(SenseneConstants.QUERY_PARAM, "some query");
searchParams.put(SenseneConstants.LOCATION_QUERY_PARAM, "");
searchParams.put(SenseneConstants.RADIUS_QUERY_PARAM, "20");
searchParams.put(SenseneConstants.DISTANCE_UNIT_PARAM, DistanceUnit.MILES.name());
searchParams.put(SenseneConstants.GEO_DISTANCE_PARAM, "true");
return searchParams;
}
#Test
public void test() {
BasicSearcher searcher = new BasicSearcher(client); // this is my application's searcher
Map<String, String> searchParams = createSampleSearchParams();
ArrayList<String> filterQueries = Lists.newArrayList();
SearchRequest searchRequest = SearchRequest.create(searchParams, filterQueries);
MySearchRequestBuilder medleyReqBuilder = new MySearchRequestBuilder.Builder(client, "my_index", searchRequest).build();
SearchRequestBuilder searchRequestBuilder = medleyReqBuilder.constructSearchRequestBuilder();
System.out.print(searchRequestBuilder.toString());
// Here I want to assert that the search request builder output is what it should be for the above client params
}
}
I get this, and nothing in the code runs:
Assertions mismatch: -ea was not specified but -Dtests.asserts=true
REPRODUCE WITH: mvn test -Pdev -Dtests.seed=5F09BEDD71BBD14E - Dtests.class=SearchRequestBuilderTests -Dtests.locale=en_US -Dtests.timezone=America/Los_Angeles
NOTE: test params are: codec=null, sim=null, locale=null, timezone=(null)
NOTE: Mac OS X 10.10.5 x86_64/Oracle Corporation 1.7.0_80 (64-bit)/cpus=4,threads=1,free=122894936,total=128974848
NOTE: All tests run in this JVM: [SearchRequestBuilderTests]
Obviously a bit late but...
So this actually has nothing to do with the ES Testing framework but rather your run settings. Assuming you are running this in eclipse, this is actually a duplicate of Assertions mismatch: -ea was not specified but -Dtests.asserts=true.
eclipse preference -> junit -> Add -ea checkbox enable.
right click on the eclipse project -> run as -> run configure -> arguments tab -> add the -ea option in vm arguments

How to force an Apache Mahout application read directly from the HDFS

I have implemented an Apache Mahout application (attached bellow) which does some basic computations. To do so it is required to load the dataset from my local machine. This application comes in the form of a jar file, but then its being executed within a hadoop pseudo-distributed cluster. The terminal command for that is: $ hadoop jar /home/eualin/ApacheMahout/tdunning-MiA-5b8956f/target/mia-0.1-jar-with-dependencies.jar mia.recommender.ch03.IREvaluatorBooleanPrefIntro2 "/home/eualin/Desktop/links-final"
Now, my question is how to do the same, but this time by reading the dataset from the HDFS (we, of course, suppose that the dataset is already stored in HDFS, e.g. in /user/eualin/output/links-final}. What should change in that case? This might help: hdfs://localhost:50010/user/eualin/output/links-final
package mia.recommender.ch03;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.mahout.cf.taste.common.TasteException;
import org.apache.mahout.cf.taste.eval.DataModelBuilder;
import org.apache.mahout.cf.taste.eval.IRStatistics;
import org.apache.mahout.cf.taste.eval.RecommenderBuilder;
import org.apache.mahout.cf.taste.eval.RecommenderIRStatsEvaluator;
import org.apache.mahout.cf.taste.impl.common.FastByIDMap;
import org.apache.mahout.cf.taste.impl.eval.GenericRecommenderIRStatsEvaluator;
import org.apache.mahout.cf.taste.impl.model.GenericBooleanPrefDataModel;
import org.apache.mahout.cf.taste.impl.model.file.FileDataModel;
import org.apache.mahout.cf.taste.impl.neighborhood.NearestNUserNeighborhood;
import org.apache.mahout.cf.taste.impl.recommender.GenericBooleanPrefUserBasedRecommender;
import org.apache.mahout.cf.taste.impl.similarity.LogLikelihoodSimilarity;
import org.apache.mahout.cf.taste.model.DataModel;
import org.apache.mahout.cf.taste.model.PreferenceArray;
import org.apache.mahout.cf.taste.neighborhood.UserNeighborhood;
import org.apache.mahout.cf.taste.recommender.Recommender;
import org.apache.mahout.cf.taste.similarity.UserSimilarity;
import java.io.File;
public class IREvaluatorBooleanPrefIntro2 {
private IREvaluatorBooleanPrefIntro2() {
}
public static void main(String[] args) throws Exception {
if (args.length != 1) {
System.out.println("give file's HDFS path");
System.exit(1);
}
DataModel model = new GenericBooleanPrefDataModel(
GenericBooleanPrefDataModel.toDataMap(
new GenericBooleanPrefDataModel(new FileDataModel(new File(args[0])))));
RecommenderIRStatsEvaluator evaluator =
new GenericRecommenderIRStatsEvaluator();
RecommenderBuilder recommenderBuilder = new RecommenderBuilder() {
#Override
public Recommender buildRecommender(DataModel model) throws TasteException {
UserSimilarity similarity = new LogLikelihoodSimilarity(model);
UserNeighborhood neighborhood =
new NearestNUserNeighborhood(10, similarity, model);
return new GenericBooleanPrefUserBasedRecommender(model, neighborhood, similarity);
}
};
DataModelBuilder modelBuilder = new DataModelBuilder() {
#Override
public DataModel buildDataModel(FastByIDMap<PreferenceArray> trainingData) {
return new GenericBooleanPrefDataModel(
GenericBooleanPrefDataModel.toDataMap(trainingData));
}
};
IRStatistics stats = evaluator.evaluate(
recommenderBuilder, modelBuilder, model, null, 10,
GenericRecommenderIRStatsEvaluator.CHOOSE_THRESHOLD,
1.0);
System.out.println(stats.getPrecision());
System.out.println(stats.getRecall());
}
}
You can't, directly, since the non-distributed code has no knowledge of HDFS. Instead, copy the file to a local location in setup() and then read it from a local file.

Testing Solr via Embedded Server

I'm coding some tests for my solr-indexer application. Following testing best practices, I want to write code self-dependant, just loading the schema.xml and solrconfig.xml and creating a temporary data tree for the indexing-searching tests.
As the application is most written in java, I'm dealing with SolrJ library, but I'm getting problems (well, I'm lost in the universe of corecontainers-coredescriptor-coreconfig-solrcore ...)
Anyone can place here some code to create an Embedded Server that loads the config and also writes to a parameter-pased data-dir?
You can start with the SolrExampleTests which extends SolrExampleTestBase which extends AbstractSolrTestCase .
Also this SampleTest.
Also take a look at this and this threads.
This is an example for a simple test case. solr is the directory that contains your solr configuration files:
import java.io.IOException;
import org.apache.solr.client.solrj.embedded.EmbeddedSolrServer;
import org.apache.solr.util.AbstractSolrTestCase;
import org.apache.solr.client.solrj.SolrQuery;
import org.apache.solr.client.solrj.SolrServer;
import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.response.QueryResponse;
import org.apache.solr.common.SolrInputDocument;
import org.apache.solr.common.params.SolrParams;
import org.junit.Before;
import org.junit.Test;
import static org.junit.Assert.assertEquals;
public class SolrSearchConfigTest extends AbstractSolrTestCase {
private SolrServer server;
#Override
public String getSchemaFile() {
return "solr/conf/schema.xml";
}
#Override
public String getSolrConfigFile() {
return "solr/conf/solrconfig.xml";
}
#Before
#Override
public void setUp() throws Exception {
super.setUp();
server = new EmbeddedSolrServer(h.getCoreContainer(), h.getCore().getName());
}
#Test
public void testThatNoResultsAreReturned() throws SolrServerException {
SolrParams params = new SolrQuery("text that is not found");
QueryResponse response = server.query(params);
assertEquals(0L, response.getResults().getNumFound());
}
#Test
public void testThatDocumentIsFound() throws SolrServerException, IOException {
SolrInputDocument document = new SolrInputDocument();
document.addField("id", "1");
document.addField("name", "my name");
server.add(document);
server.commit();
SolrParams params = new SolrQuery("name");
QueryResponse response = server.query(params);
assertEquals(1L, response.getResults().getNumFound());
assertEquals("1", response.getResults().get(0).get("id"));
}
}
See this blogpost for more info:Solr Integration Tests
First you need to set your Solr Home Directory which contains solr.xml and conf folder containing solrconfig.xml, schema.xml etc.
After that you can use this simple and basic code for Solrj.
File solrHome = new File("Your/Solr/Home/Dir/");
File configFile = new File(solrHome, "solr.xml");
CoreContainer coreContainer = new CoreContainer(solrHome.toString(), configFile);
SolrServer solrServer = new EmbeddedSolrServer(coreContainer, "Your-Core-Name-in-solr.xml");
SolrQuery query = new SolrQuery("Your Solr Query");
QueryResponse rsp = solrServer.query(query);
SolrDocumentList docs = rsp.getResults();
Iterator<SolrDocument> i = docs.iterator();
while (i.hasNext()) {
System.out.println(i.next().toString());
}
I hope this helps.