Load MapReduce output data into HBase - mapreduce

The last few days I've been experimenting with Hadoop. I'm running Hadoop in pseudo-distributed mode on Ubuntu 12.10 and successfully executed some standard MapReduce jobs.
Next I wanted to start experimenting with HBase. I've installed HBase, played a bit in the shell. That all went fine so I wanted to experiment with HBase through a simple Java program. I wanted to import the output of one of the previous MapReduce jobs and load it into an HBase table. I've wrote a Mapper that should produce HFileOutputFormat files that should easily read into a HBase table.
Now, whenever I run the program (using: hadoop jar [compiled jar]) I get a ClassNotFoundException. The program seems unable to resolve com.google.commons.primitives.Long. Of course, I thought it was just a dependency missing but the JAR (Google's Guava) is there.
I've tried a lot of different things but can't seem to find a solution.
I attached the Exception that occurs and the most important classes. I would be truly appreciated if someone could help me out or give me some advice on where to look.
Kind regards,
Pieterjan
ERROR
12/12/13 09:02:54 WARN snappy.LoadSnappy: Snappy native library not loaded
12/12/13 09:03:00 INFO mapred.JobClient: Running job: job_201212130304_0020
12/12/13 09:03:01 INFO mapred.JobClient: map 0% reduce 0%
12/12/13 09:04:07 INFO mapred.JobClient: map 100% reduce 0%
12/12/13 09:04:51 INFO mapred.JobClient: Task Id : attempt_201212130304_0020_r_000000_0,Status : FAILED
Error: java.lang.ClassNotFoundException: com.google.common.primitives.Longs
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:423)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:356)
at org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:1554)
at org.apache.hadoop.hbase.KeyValue$KVComparator.compare(KeyValue.java:1536)
at java.util.TreeMap.compare(TreeMap.java:1188)
at java.util.TreeMap.put(TreeMap.java:531)
at java.util.TreeSet.add(TreeSet.java:255)
at org.apache.hadoop.hbase.mapreduce.PutSortReducer.reduce(PutSortReducer.java:63)
at org.apache.hadoop.hbase.mapreduce.PutSortReducer.reduce(PutSortReducer.java:40)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:650)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
JAVA
Mapper:
public class TestHBaseMapper extends Mapper<LongWritable, Text, ImmutableBytesWritable, Put> {
#Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//Tab delimiter \t, white space delimiter: \\s+
String[] s = value.toString().split("\t");
Put put = new Put(s[0].getBytes());
put.add("amount".getBytes(), "value".getBytes(), value.getBytes());
context.write(new ImmutableBytesWritable(Bytes.toBytes(s[0])), put);
}
Job:
public class TestHBaseRun extends Configured implements Tool {
#Override
public int run(String[] args) throws Exception {
try {
Configuration configuration = getConf();
Job hbasejob = new Job(configuration);
hbasejob.setJobName("TestHBaseJob");
hbasejob.setJarByClass(TestHBaseRun.class);
//Specifies the InputFormat and the path.
hbasejob.setInputFormatClass(TextInputFormat.class);
TextInputFormat.setInputPaths(hbasejob, new Path("/hadoopdir/user/data/output/test/"));
//Set Mapper, MapperOutputKey and MapperOutputValue classes.
hbasejob.setMapperClass(TestHBaseMapper.class);
hbasejob.setMapOutputKeyClass(ImmutableBytesWritable.class);
hbasejob.setMapOutputValueClass(Put.class);
//Specifies the OutputFormat and the path. If The path exists it's reinitialized.
//In this case HFiles, that can be imported into HBase, are produced.
hbasejob.setOutputFormatClass(HFileOutputFormat.class);
FileSystem fs = FileSystem.get(configuration);
Path outputpath = new Path("/hadoopdir/user/data/hbase/table/");
fs.delete(outputpath, true);
HFileOutputFormat.setOutputPath(hbasejob, outputpath);
//Check if table exists in HBase and creates it if necessary.
HBaseUtil util = new HBaseUtil(configuration);
if (!util.exists("test")) {
util.createTable("test", new String[]{"amount"});
}
//Reads the existing (or thus newly created) table.
Configuration hbaseconfiguration = HBaseConfiguration.create(configuration);
HTable table = new HTable(hbaseconfiguration, "test");
//Write HFiles to disk. Autoconfigures partitioner and reducer.
HFileOutputFormat.configureIncrementalLoad(hbasejob, table);
boolean success = hbasejob.waitForCompletion(true);
//Load generated files into table.
LoadIncrementalHFiles loader;
loader = new LoadIncrementalHFiles(hbaseconfiguration);
loader.doBulkLoad(outputpath, table);
return success ? 0 : 1;
} catch (Exception ex) {
System.out.println("Error: " + ex.getMessage());
}
return 1;
}

ClassNotFoundException, it means that the required .jar that contains com.google.common.primitives.Longs cannot be found.
There are several ways to solve this issue:
If you're just playing with Hadoop, the simplest way to solve this issue is to copy the required .jar into /usr/share/hadoop/lib.
Add the path to the required .jar to HADOOP_CLASSPATH. To do so open /etc/hbase/hbase-env.sh and add:
export HADOOP_CLASSPATH="<jar_files>:$HADOOP_CLASSPATH"
Create a folder /lib in your root project folder. Copy your .jar into that folder. Create a package (.jar) for your project. The result will be a fat jar contained all the jars included in /lib.

Related

Cannot find org.codehaus.commons.compiler.properties resource

I created a Maven project that includes a dependency to the Calcite JDBC driver, as well as source code for a Calcite CSV adapter.
<dependency>
<groupId>org.apache.calcite</groupId>
<artifactId>calcite-core</artifactId>
<version>1.20.0</version>
</dependency>
When I run from a JUnit test, I can query some CSV files using SQL. Very cool!
But I cannot get the JAR to work in SQL Workbench/J. The log file has this:
Caused by: java.lang.IllegalStateException: Unable to instantiate java compiler
at org.apache.calcite.rel.metadata.JaninoRelMetadataProvider.compile(JaninoRelMetadataProvider.java:434)
Caused by: java.lang.ClassNotFoundException: No implementation of org.codehaus.commons.compiler is on the class path. Typically, you'd have 'janino.jar', or 'commons-compiler-jdk.jar', or both on the classpath.
at org.codehaus.commons.compiler.CompilerFactoryFactory.getDefaultCompilerFactory(CompilerFactoryFactory.java:65)
SQL Workbench/J is successfully connecting, and I can see the list of CSV "tables" in the UI. But when I try to query them, I get the above error.
I found a link to someone having a similar problem, but did not see a resolution.
https://community.jaspersoft.com/questions/1035211/apache-calcite-jdbc-driver-jaspersoft
Also, here's the code that seems to be throwing the error:
public final
class CompilerFactoryFactory {
...
public static ICompilerFactory
getDefaultCompilerFactory() throws Exception {
if (CompilerFactoryFactory.defaultCompilerFactory != null) {
return CompilerFactoryFactory.defaultCompilerFactory;
}
Properties properties = new Properties();
InputStream is = Thread.currentThread().getContextClassLoader().getResourceAsStream(
"org.codehaus.commons.compiler.properties"
);
if (is == null) {
throw new ClassNotFoundException(
"No implementation of org.codehaus.commons.compiler is on the class path. Typically, you'd have "
+ "'janino.jar', or 'commons-compiler-jdk.jar', or both on the classpath."
);
}
From what I can tell, the org.codehaus.commons.compiler.properties resource is just not being found when running under SQL Workbench/J, but for some reason it works in my code.
If I unzip the JAR file, I do see org.codehaus.commons.compiler.properties in the directory structure, so not sure why it's not being found.
Anyone else run into this problem?
Thanks for any help.

Inject Jar and replace classes in running JVM

I want to be able to replace and add some classes to an already running JVM. I read that I need to use CreateRemoteThread, but I don't completely get it. I read this post on how to do it (Software RnD), but I can't figure out what it does and why. Besides that, it only introduces new classes, but doesn't change existing ones. How can I do it with C++?
You don't even need CreateRemoteThread - there is an official way to connect to remote JVM and replace loaded classes by using Attach API.
You need a Java Agent that calls Instrumentation.redefineClasses.
public static void agentmain(String args, Instrumentation instr) throws Exception {
Class oldClass = Class.forName("org.pkg.MyClass");
Path newFile = Paths.get("/path/to/MyClass.class");
byte[] newData = Files.readAllBytes(newFile);
instr.redefineClasses(new ClassDefinition(oldClass, newData));
}
You'll have to add MANIFEST.MF with Agent-Class attribute and pack the agent into a jar file.
Then use Dynamic Attach to inject the agent jar into the running VM (with process ID = pid).
import com.sun.tools.attach.VirtualMachine;
...
VirtualMachine vm = VirtualMachine.attach(pid);
try {
vm.loadAgent(agentJarPath, options);
} finally {
vm.detach();
}
A bit more details in the article.
If you insist on using C/C++ instead of Java API, you may look at my jattach utility.

Can't deserialize Protobuf (2.6.1) data using elephant-bird and Hive in AWS

I am not able to deserialize the protobuf data that has repeated string in it using elephant-bird 4.14 with Hive. This seems to be because repeated string feature is available only with Protobuf 2.6 and not in Protobuf 2.5. While running my hive queries in AWS EMR cluster, it uses Protobuf 2.5 that is bundled with AWS Hive. Even after adding Protobuf 2.6 jar explicitly, i am not able to get rid of this error. I want to know how can i make hive to use Protobuf 2.6 jar that i add explicitly.
Below are the hive queries used:
add jar s3://gam.test/hive-jars/protobuf-java-2.6.1.jar;
add jar s3://gam.test/hive-jars/GAMDataModel-1.0.jar;
add jar s3://gam.test/hive-jars/GAMCoreModel-1.0.jar;
add jar s3://gam.test/hive-jars/GAMAccessLayer-1.1.jar;
add jar s3://gam.test/hive-jars/RodbHiveStorageHandler-0.12.0-jarjar-final.jar;
add jar s3://gam.test/hive-jars/elephant-bird-core-4.14.jar;
add jar s3://gam.test/hive-jars/elephant-bird-hive-4.14.jar;
add jar s3://gam.test/hive-jars/elephant-bird-hadoop-compat-4.14.jar;
add jar s3://gam.test/hive-jars/protobuf-java-2.6.1.jar;
add jar s3://gam.test/hive-jars/GamProtoBufHiveDeserializer-1.0-jarjar.jar;
drop table GamRelationRodb;
CREATE EXTERNAL TABLE GamRelationRodb
row format serde "com.amazon.hive.serde.GamProtobufDeserializer"
with serdeproperties("serialization.class"=
"com.amazon.gam.rodb.model.RepeatedRelationshipWrapperProto$RepeatedRelationshipWrapper")
STORED BY 'com.amazon.rodb.hadoop.hive.RodbHiveStorageHandler' TBLPROPERTIES
("file.name" = 'GAM_Relationship',"file.path" ='s3://pathtofile/');
select * from GamRelationRodb limit 10;
Below is the format of the Protobuf file:
message RepeatedRelationshipWrapper {
repeated relationship.Relationship relationships = 1;
}
message Relationship {
required RelationshipType type = 1;
repeated string ids = 2;
}
enum RelationshipType {
UKNOWN_RELATIONSHIP_TYPE = 0;
PARENT = 1;
CHILD = 2;
}
Below is the runtime exception thrown while running the query:
Exception in thread "main" java.lang.NoSuchMethodError: com.google.protobuf.LazyStringList.getUnmodifiableView()Lcom/google/protobuf/LazyStringList;
at com.amazon.gam.model.RelationshipProto$Relationship.<init>(RelationshipProto.java:215)
at com.amazon.gam.model.RelationshipProto$Relationship.<init>(RelationshipProto.java:137)
at com.amazon.gam.model.RelationshipProto$Relationship$1.parsePartialFrom(RelationshipProto.java:239)
at com.amazon.gam.model.RelationshipProto$Relationship$1.parsePartialFrom(RelationshipProto.java:234)
at com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309)
at com.amazon.gam.rodb.model.RepeatedRelationshipWrapperProto$RepeatedRelationshipWrapper.<init>(RepeatedRelationshipWrapperProto.java:126)
at com.amazon.gam.rodb.model.RepeatedRelationshipWrapperProto$RepeatedRelationshipWrapper.<init>(RepeatedRelationshipWrapperProto.java:72)
at com.amazon.gam.rodb.model.RepeatedRelationshipWrapperProto$RepeatedRelationshipWrapper$1.parsePartialFrom(RepeatedRelationshipWrapperProto.java:162)
at com.amazon.gam.rodb.model.RepeatedRelationshipWrapperProto$RepeatedRelationshipWrapper$1.parsePartialFrom(RepeatedRelationshipWrapperProto.java:157)
at com.amazon.gam.rodb.model.RepeatedRelationshipWrapperProto$RepeatedRelationshipWrapper$Builder.mergeFrom(RepeatedRelationshipWrapperProto.java:495)
at com.amazon.gam.rodb.model.RepeatedRelationshipWrapperProto$RepeatedRelationshipWrapper$Builder.mergeFrom(RepeatedRelationshipWrapperProto.java:355)
at com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:337)
at com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:267)
at com.google.protobuf.AbstractMessageLite$Builder.mergeFrom(AbstractMessageLite.java:170)
at com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:882)
at com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:267)
at com.twitter.elephantbird.mapreduce.io.ProtobufConverter.fromBytes(ProtobufConverter.java:66)
at com.twitter.elephantbird.hive.serde.ProtobufDeserializer.deserialize(ProtobufDeserializer.java:59)
at com.amazon.hive.serde.GamProtobufDeserializer.deserialize(GamProtobufDeserializer.java:63)
at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:502)
at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:428)
at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:146)
at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2098)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:252)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:183)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:399)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:776)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:714)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:641)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Protobuf is brittle library. It may be wire-format compatible across versions 2.x, but the classes generated by protoc will only link against the protobuf JAR of exactly the same version as that of the protoc compiler.
This means, fundamentally, that you cannot update protobuf except by choreographing this across all dependencies. The Great Protobuf upgrade in 2013 was when Hadoop, Hbase, Hive &c upgrade, and after that: everyone has frozen at v 2.5, probably for the entire life of the Hadoop 2.x codeline, unless it all gets shaded away or Java 9 hides the problem.
We are more scared of protobuf updates than upgrades to Guava and Jackson, as the latter only breaks every single library, not the wire format.
Watch HADOOP-13363 for the topic of a 2.x upgrade, and HDFS-11010 on the question of a move up to protobuf 3 in hadoop trunk. That's messy as it does change wire format, the protobuf-json marshalling breaks and other things.
It's best just to conclude, "binary compatibility of protobuf code has been found lacking", and stick to protobuf 2.5. Sorry.
You could take the entire stack of libraries you want to use, rebuild them with an updated protoc compiler, matching protobuf.jor, with any other patches you need applied. I would only recommend that to the bold —but am curious about the outcome. If you do try this, let us know how it worked out
Further reading fear of dependencies

Spring Batch process multiple files concurrently

I'm using Spring Batch to process a large XML file (~ 2 millions entities) and update a database. The process is quite time-consuming, so I tried to use partitioning to try to speed up the processing.
The approach I'm pursuing is to split the large xml file in smaller files (say each 500 entities) and then use Spring Batch to process each file in parallel.
I'm struggling with the Java configuration to achieve the processing of multiple xml files in parallel. These are the relevant beans of my configuration
#Bean
public Partitioner partitioner(){
MultiResourcePartitioner partitioner = new MultiResourcePartitioner();
Resource[] resources;
try {
resources = resourcePatternResolver.getResources("file:/tmp/test/*.xml");
} catch (IOException e) {
throw new RuntimeException("I/O problems when resolving the input file pattern.",e);
}
partitioner.setResources(resources);
return partitioner;
}
#Bean
public Step partitionStep(){
return stepBuilderFactory.get("test-partitionStep")
.partitioner(personStep())
.partitioner("personStep", partitioner())
.taskExecutor(taskExecutor())
.build();
}
#Bean
public Step personStep() {
return stepBuilderFactory.get("personStep")
.<Person, Person>chunk(100)
.reader(personReader())
.processor(personProcessor())
.writer(personWriter)
.build();
}
#Bean
public TaskExecutor taskExecutor() {
SimpleAsyncTaskExecutor asyncTaskExecutor = new SimpleAsyncTaskExecutor("spring_batch");
asyncTaskExecutor.setConcurrencyLimit(10);
return asyncTaskExecutor;
}
When I execute the job, I get different XML parsing errors (every time a different one). If I remove all the xml files but one from the folder, then the processing works as expected.
I'm not sure I understand 100% the concept of Spring Batch partitioning, especially the "slave" part.
Thanks!

Embedding Jetty 9.3 with modular XmlConfiguration

I am migrating from Jetty 8.1.17 to Jetty 9.3.9. Our application embeds Jetty. Previously we had a single XML configuration file jetty.xml which contained everything we needed.
I felt that with Jetty 9.3.9 it would be much nicer to use the modular approach that they suggest, so far I have jetty.xml, jetty-http.xml, jetty-https.xml and jetty-ssl.xml in my $JETTY_HOME/etc; these are pretty much copies of those from the 9.3.9 distribution. This seems to work well when I use start.jar but not through my own code which embeds Jetty.
Ideally I would like to be able to scan for any jetty xml files in the $JETTY_HOME/etc folder and load the configuration. However for embedded mode I have not found a way to do that without explicitly defining the order that those files should be loaded in, due to <ref id="x"/> dependencies between them etc.
My initial attempt is based on How can I programmatically start a jetty server with multiple configuration files? and looks like:
final List<Object> configuredObjects = new ArrayList();
XmlConfiguration last = null;
for(final Path confFile : configFiles) {
logger.info("[loading jetty configuration : {}]", confFile.toString());
try(final InputStream is = Files.newInputStream(confFile)) {
final XmlConfiguration configuration = new XmlConfiguration(is);
if (last != null) {
configuration.getIdMap().putAll(last.getIdMap());
}
configuredObjects.add(configuration.configure());
last = configuration;
}
}
Server server = null;
// For all objects created by XmlConfigurations, start them if they are lifecycles.
for (final Object configuredObject : configuredObjects) {
if(configuredObject instanceof Server) {
server = (Server)configuredObject;
}
if (configuredObject instanceof LifeCycle) {
final LifeCycle lc = (LifeCycle)configuredObject;
if (!lc.isRunning()) {
lc.start();
}
}
}
However, I get Exceptions at startup if jetty-https.xml is loaded before jetty-ssl.xml or if I place a reference in jetty.xml to an object from a sub-configuration jetty-blah.xml which has not been loaded first.
It seems to me like Jetty manages to do this okay itself when you call java -jar start.jar, so what am I missing to get Jetty to not care about what order the config files are parsed in?
Order is extremely important when loading the Jetty XML files.
That's the heart of what the entire start.jar and its module system is about, have an appropriate set of properties, the server classpath is sane, and ensuring proper load order of the XML.
Note: its not possible to have everything in ${jetty.home}/etc loaded at the same time, as you will get conflicts on alternate implementations of common technologies (something start.jar also manages for you)