How to schedule Hbase Map-Reduce Job by oozie? - mapreduce

I want to schedule a Hbase Map-Reduce job by Oozie.I am facing following problem .
How/Where to specify these properties in oozie workflow ?
( i> Table name for Mapper/Reducer
ii> scan object for Mapper )
Scan scan = new Scan(new Get());
scan.setMaxVersions();
scan.addColumn(Bytes.toBytes(FAMILY),
Bytes.toBytes(VALUE));
scan.addColumn(Bytes.toBytes(FAMILY),
Bytes.toBytes(DATE));
Job job = new Job(conf, JOB_NAME + "_" + TABLE_USER);
// These two properties :-
TableMapReduceUtil.initTableMapperJob(TABLE_USER, scan,
Mapper.class, Text.class, Text.class, job);
TableMapReduceUtil.initTableReducerJob(DETAILS_TABLE,
Reducer.class, job);
or
please let me know the best way to schedule a Hbase Map-Reduce Job by Oozie .
Thanks :) :)

The best way(According to me ) to schedule a Hbase Map_Reduce job is to schedule it as a .java file .
It works well and there is no need to write code to change your scan to string , etc.
So i am scheduling my jobs like java file till i get any better option .
workflow-app xmlns="uri:oozie:workflow:0.1" name="java-main-wf">
<start to="java-node"/>
<action name="java-node">
<java>
<job-tracker></job-tracker>
<name-node></name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<main-class>org.apache.oozie.example.DemoJavaMain</main-class>
<arg>Hello</arg>
<arg>Oozie!</arg>
<arg>This</arg>
<arg>is</arg>
<arg>Demo</arg>
<arg>Oozie!</arg>
</java>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Java failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>

You can also schedule the job using <Map-reduce> tag , but it is not as easy as scheduling it as java file. It requires a considerable effort, but can be considered as an alternate approach.
<action name='jobSample'>
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<!-- This is required for new api usage -->
<property>
<name>mapred.mapper.new-api</name>
<value>true</value>
</property>
<property>
<name>mapred.reducer.new-api</name>
<value>true</value>
</property>
<!-- HBASE CONFIGURATIONS -->
<property>
<name>hbase.mapreduce.inputtable</name>
<value>TABLE_USER</value>
</property>
<property>
<name>hbase.mapreduce.scan</name>
<value>${wf:actionData('get-scanner')['scan']}</value>
</property>
<property>
<name>hbase.zookeeper.property.clientPort</name>
<value>${hbaseZookeeperClientPort}</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>${hbaseZookeeperQuorum}</value>
</property>
<!-- MAPPER CONFIGURATIONS -->
<property>
<name>mapreduce.inputformat.class</name>
<value>org.apache.hadoop.hbase.mapreduce.TableInputFormat</value>
</property>
<property>
<name>mapred.mapoutput.key.class</name>
<value>org.apache.hadoop.io.Text</value>
</property>
<property>
<name>mapred.mapoutput.value.class</name>
<value>org.apache.hadoop.io.Text</value>
</property>
<property>
<name>mapreduce.map.class</name>
<value>com.hbase.mapper.MyTableMapper</value>
</property>
<!-- REDUCER CONFIGURATIONS -->
<property>
<name>mapreduce.reduce.class</name>
<value>com.hbase.reducer.MyTableReducer</value>
</property>
<property>
<name>hbase.mapred.outputtable</name>
<value>DETAILS_TABLE</value>
</property>
<property>
<name>mapreduce.outputformat.class</name>
<value>org.apache.hadoop.hbase.mapreduce.TableOutputFormat</value>
</property>
<property>
<name>mapred.map.tasks</name>
<value>${mapperCount}</value>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>${reducerCount}</value>
</property>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
</map-reduce>
<ok to="end" />
<error to="fail" />
</action>
<kill name="fail">
<message>Map/Reduce failed, error
message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name='end' />
To know more about the property name and value , dump the configration parameter.
Also, the scan property is some serialization of the scan information (a Base 64 encoded version) so not sure how to specify this -
scan.addColumn(Bytes.toBytes(FAMILY),
Bytes.toBytes(VALUE));

Related

Checkpointing in flink

Trying to use checkpointing mechanism in flink with fs HDFS.
While connecting with hdfs://aleksandar/0.0.0.0:50010/shared/ i get the following error
Caused by: java.lang.IllegalArgumentException: Pathname /0.0.0.0:50010/shared/972dde22148f58ec9f266fb7bdfae891 from hdfs://aleksandar/0.0.0.0:50010/shared/972dde22148f58ec9f266fb7bdfae891 is not a valid DFS filename.
In the core-site settings i have the following configuration
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/var/lib/hadoop</value>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://0.0.0.0:123</value>
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
</configuration>

HBase connection in mapreduce running from Oozie workflow fails

I am running my mapreduce job as java action from Oozie workflow .
When i run my mapreduce in my hadoop cluster it runs successfully,but when i run use same jar from Oozie workflow it throw be
This is my workflow .xml
<workflow-app name="HBaseToFileDriver" xmlns="uri:oozie:workflow:0.1">
<start to="mapReduceAction"/>
<action name="mapReduceAction">
<java>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${outputDir}"/>
</prepare>
<configuration>
<property>
<name>mapred.mapper.new-api</name>
<value>true</value>
</property>
<property>
<name>mapred.reducer.new-api</name>
<value>true</value>
</property>
<property>
<name>oozie.libpath</name>
<value>${appPath}/lib</value>
</property>
<property>
<name>mapreduce.job.queuename</name>
<value>root.fricadev</value>
</property>
</configuration>
<main-class>com.thomsonretuers.hbase.HBaseToFileDriver</main-class>
<arg>fricadev:FinancialLineItem</arg>
<capture-output/>
</java>
<ok to="end"/>
<error to="killJob"/>
</action>
<kill name="killJob">
<message>"Killed job due to error: ${wf:errorMessage(wf:lastErrorNode())}"</message>
</kill>
<end name="end" />
</workflow-app>
Below is my exception when i see the logs in the YARN .
even though is showing as succeeded but output files are not getting generated .
Have you look into Oozie Java Action
IMPORTANT: In order for a Java action to succeed on a secure cluster, it must propagate the Hadoop delegation token like in the following code snippet (this is benign on non-secure clusters):
// propagate delegation related props from launcher job to MR job
if (System.getenv("HADOOP_TOKEN_FILE_LOCATION") != null) {
jobConf.set("mapreduce.job.credentials.binary", System.getenv("HADOOP_TOKEN_FILE_LOCATION"));
}
You must get HADOOP_TOKEN_FILE_LOCATION from system env variable and set to the property mapreduce.job.credentials.binary.
HADOOP_TOKEN_FILE_LOCATION is set by oozie at runtime.

Error: Java heap space Container killed by the ApplicationMaster. Container killed on request. Exit code is 143

Error: Java heap space Container killed by the ApplicationMaster. Container killed on request. Exit code is 143.
The hadoop cluster has 3 machines, one of them is the master,others are datanode,the machine's RAM is 8G.
the yarn-site.xml:
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>Hadoop1</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>4096</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>2048</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
<description>Whether virtual memory limits will be enforced for containers</description>
</property>
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>4</value>
<description>Ratio between virtual memory to physical memory when setting memory limits for containers</description>
</property>
</configuration>
the mapred-site.xml:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>hadoop1:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hadoop1:19888</value>
</property>
<property>
<name>yarn.app.mapreduce.am.staging-dir</name>
<value>/user</value>
</property>
<property>
<name>mapreduce.input.fileinputformat.input.dir.recursive</name>
<value>true</value>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>2048</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>4096</value>
</property>
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx1024m</value>
</property>
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx3072m</value>
</property>
</configuration>``
when I run the mapreduce,get the error:Error: Java heap space Container killed by the ApplicationMaster. Container killed on request. Exit code is 143.
The input files are 500M,the reduce number is 4. when the input files less then 300M, the program can run good.

JDBCUserStoreManager Configuring as Secondary User Stores is not working in wso2esb

I am using wso2 esb 4.8.1.
I am trying to add the JDBCUserStoreManager Configuring as Secondary User Stores but unable to add some query related errors occurring.
my configuration is like this
<
UserStoreManager class="org.wso2.carbon.user.core.jdbc.JDBCUserStoreManager">
<Property name="driverName">oracle.jdbc.OracleDriver</Property>
<Property name="url">jdbc:oracle:thin:#localhost:1521:xe</Property>
<Property name="userName">fff</Property>
<Property name="password">fff</Property>
<Property name="Disabled">false</Property>
<Property name="MaxUserNameListLength">100</Property>
<Property name="MaxRoleNameListLength">100</Property>
<Property name="UserRolesCacheEnabled">true</Property>
<Property name="PasswordDigest">SHA-256</Property>
<Property name="ReadGroups">true</Property>
<Property name="ReadOnly">false</Property>
<Property name="IsEmailUserName">false</Property>
<Property name="DomainCalculation">default</Property>
<Property name="StoreSaltedPassword">true</Property>
<Property name="WriteGroups">true</Property>
<Property name="UserNameUniqueAcrossTenants">false</Property>
<Property name="PasswordJavaRegEx">^[\S]{5,30}$</Property>
<Property name="PasswordJavaScriptRegEx">^[\S]{5,30}$</Property>
<Property name="UsernameJavaRegEx">^[\S]{5,30}$</Property>
<Property name="UsernameJavaScriptRegEx">^[\S]{5,30}$</Property>
<Property name="RolenameJavaRegEx">^[\S]{5,30}$</Property>
<Property name="RolenameJavaScriptRegEx">^[\S]{5,30}$</Property>
<Property name="SCIMEnabled">false</Property>
<Property name="SelectUserSQL">select fff.AUTHENTICATION.username from kkkk.AUTHENTICATION;</Property>
<Property name="GetRoleListSQL">select fff.AUTHENTICATION.username from kkkk.AUTHENTICATION;</Property>
<Property name="DomainName">TT.com</Property>
<Property name="Description"/>
</UserStoreManager>
its showing success message while adding if i restart the server its giving so many errors.
like
[2014-07-08 17:07:42,620] ERROR - JDBCUserStoreManager Using sql : select fff.AUTHENTICATION.username from fff.AUTHENTICATION;
[2014-07-08 17:07:42,624] ERROR - AbstractUserStoreManager org.wso2.carbon.user.
core.UserStoreException: Invalid column index
[2014-07-08 17:07:42,663] INFO - ServiceBusInitializer Starting ESB...
if i add this configuratin
<Property name="SelectUserSQL">select kkkk.AUTHENTICATION.username from kkkk.AUTHENTICATION;</Property>
<Property name="EmptyRolesAllowed">Allowed</Property>
<Property name="DomainName">TT.com</Property>
again its giving this error
tenant -1234
[2014-07-08 17:49:10,112] ERROR - JDBCUserStoreManager Error while retrieving ro
les from JDBC user store
java.sql.SQLException: ORA-00942: table or view does not exist
at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:445)
at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:396)
at oracle.jdbc.driver.T4C8Oall.processError(T4C8Oall.java:879)
at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:450)
at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:192)
at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:531)
at oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.
java:207)
i need to add the data source for this else its work,I am unable to get the issue.
my table is like this
table name:AUTHENTICATION
columnnames: username, password,role
data:system,system,everyone
any help for this.
Thanks in advance,
The cause may be that some of the other SQL queries running are expecting the default WSO2 user store schema as you are using the default shipped JDBCUserStoreManager with a different schema. When you have the need to use a this kind of different user store structure, it is recommended to use a custom user store, which provide you more freedom on handling user store functionality according to your requirement. Following links may be of help to you. (Please note that though document is for Identity Server - 5.0.0, it's valid for ESB 4.8.1 as well.)
[1] - http://docs.wso2.com/display/IS500/Writing+a+Custom+User+Store+Manager
[2] - http://pushpalankajaya.blogspot.com/2013/09/how-to-write-custom-user-store-manager.html

MapReduce job failed in oozie

I've a map only job which takes sequence file (key is Text, value is BytesWritable) as input and output data in to sequence file (key is NullWritable, value is Text).
Java class
import java.io.*;
import java.util.*;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
public class Test {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
Job job = new Job(conf, "Test");
job.setJarByClass(Test.class);
job.setMapperClass(TestMapper.class);
job.setInputFormatClass(SequenceFileInputFormat.class);
job.setOutputFormatClass(SequenceFileOutputFormat.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(Text.class);
job.setMapOutputKeyClass(NullWritable.class);
job.setMapOutputValueClass(Text.class);
job.setNumReduceTasks(0);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.submit();
}
public static class TestMapper extends Mapper<Text, BytesWritable, NullWritable, Text> {
Text outValue = new Text("");
int counter = 0;
public void map(Text filename, BytesWritable data, Context context) throws IOException, InterruptedException {
/ logic
}
}
}
It's working fine when running job from unix command, when the same job scheduled in oozie seeing below error
java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.Text
at Test$TestMapper.map(Test.java:56)
job configuration in oozie
<configuration>
<property>
<name>mapred.input.dir</name>
<value>${input}</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>/temp</value>
</property>
<property>
<name>mapreduce.map.class</name>
<value>Test$TestMapper</value>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>0</value>
</property>
<property>
<name>mapreduce.job.output.key.class</name>
<value>org.apache.hadoop.io.NullWritable</value>
</property>
<property>
<name>mapreduce.job.output.value.class</name>
<value>org.apache.hadoop.io.Text</value>
</property>
<property>
<name>mapreduce.job.inputformat.class</name>
<value>org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat</value>
</property>
<property>
<name>mapreduce.job.outputformat.class</name>
<value>org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat</value>
</property>
<property>
<name>mapreduce.job.mapinput.key.class</name>
<value>org.apache.hadoop.io.Text</value>
</property>
<property>
<name>mapreduce.job.mapinput.value.class</name>
<value>org.apache.hadoop.io.BytesWritable</value>
</property>
<property>
<name>mapred.reducer.new-api</name>
<value>true</value>
</property>
<property>
<name>mapred.mapper.new-api</name>
<value>true</value>
</property>
Can someone tell me what is the error here.. thank you
The classcast exception indicates that Oozie is still using the default inputformat of TextInputFormat, which has a Key type of LongWritable. Since the mapper has a key type of Text, there is a type mismatch on ingestion by the mapper. So the config key of mapreduce.job.inputformat.class was incorrect.
(after some trial and error)
We found that the correct property name is mapreduce.inputformat.class, i.e.:
<property>
<name>mapreduce.inputformat.class</name>
<value>org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat</value>
</property>