Mapreduce and Hcatalog Integration fails to use MySql MetaStore

Mapreduce and Hcatalog Integration fails to use MySql MetaStore - mapreduce

Environment: HDP 2.3 Sandbox
Problem: I have created a table in hive with just 2 columns. Now i want to read this in my MR code using HCatalog integration. The MR Job fails to read the table from the MySql meta-store. It uses the Derby for some reason and hence it fails with "table not found" message.
Job Client code:
public class HCatalogMRJob extends Configured implements Tool {
public int run(String[] args) throws Exception {
Configuration conf = getConf();
args = new GenericOptionsParser(conf, args).getRemainingArgs();
String inputTableName = args[0];
String outputTableName = args[1];
String dbName = null;
Job job = new Job(conf, "HCatalogMRJob");
HCatInputFormat.setInput(job, dbName, inputTableName);
job.setInputFormatClass(HCatInputFormat.class);
job.setJarByClass(HCatalogMRJob.class);
job.setMapperClass(HCatalogMapper.class);
job.setReducerClass(HCatalogReducer.class);
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(WritableComparable.class);
job.setOutputValueClass(DefaultHCatRecord.class);
HCatOutputFormat.setOutput(job, OutputJobInfo.create(dbName, outputTableName, null));
HCatSchema s = HCatOutputFormat.getTableSchema(conf);
System.err.println("INFO: output schema explicitly set for writing:"
+ s);
HCatOutputFormat.setSchema(job, s);
job.setOutputFormatClass(HCatOutputFormat.class);
return (job.waitForCompletion(true) ? 0 : 1);
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new HCatalogMRJob(), args);
System.exit(exitCode);
}
}
Job Run Command:
hadoop jar mr-hcat.jar input_table out_table
Before running this command, i have set the necessary hcatalog, hive jars in the class path using the hadoop_classpath variable.
Question:
Now, how do i make the job to use the hive-site.xml correctly?
I tried setting this in the classpath using the same hadoop_classpath as mentioned above., but still it fails.

Related

Unable to clean the Jetty working directory while docker is restarted

This is for embed jetty. I am trying to clean the jetty working directory which is automatically created in the /tmp folder inside the container. I did write the below method-"cleanJettyWorkingDirectory()" logic to clean the working directory and it works. The problem here is, it is not allowing me to create a working directory now because think I am calling this method from the wrong place. Whenever I am restarting the docker, it is cleaning the entire working directory Please assist.
public void cleanJettyWorkingDirectory(){
final File folder = new File(JETTY_WORKING_DIRECTORY);
final File[] files = folder.listFiles( new FilenameFilter() {
#Override
public boolean accept( final File dir,
final String name ) {
return name.matches( "jetty-0_0_0_0-.*" );
}
} );
for ( final File file : files ) {
try {
FileUtils.deleteDirectory(file);
} catch (IOException e) {
logger.info("Unable to delete the Jetty working directory");
}
}
}
Jetty Service Main class file as below:
public class JettyServer {
private final static Logger logger = Logger.getLogger(JettyServer.class.getName());
private static final int JETTY_PORT = 10000;
private static final String JETTY_REALM_PROPERTIES_FILE_NAME = "realm.properties";
private static final String JETTY_REALM_NAME = "myrealm";
private static final String JETTY_WORKING_DIRECTORY="tmp";
public static QueuedThreadPool threadPool;
public JettyServer() {
try {
cleanJettyWorkingDirectory(); // *Calling here*
RolloverFileOutputStream os = new RolloverFileOutputStream(JETTY_STDOUT_LOG_FILE_NAME, true);
PrintStream logStream = new PrintStream(os);
System.setOut(logStream);
System.setErr(logStream);
Server server = new Server(JETTY_PORT);
server.addBean(getLoginService());
try {
logger.info("Configuring Jetty SSL..");
HttpConfiguration http_config = new HttpConfiguration();
http_config.setSecureScheme("https");
http_config.setSecurePort(JETTY_PORT);
https.setPort(JETTY_PORT);
server.setConnectors(new Connector[]{https});
logger.info("Jetty SSL successfully configured..");
} catch (Exception e){
logger.severe("Error configuring Jetty SSL.."+e);
throw e;
}
Configuration.ClassList classlist = Configuration.ClassList.setServerDefault(server);
classlist.addAfter("org.eclipse.jetty.webapp.FragmentConfiguration",
"org.eclipse.jetty.plus.webapp.EnvConfiguration",
"org.eclipse.jetty.plus.webapp.PlusConfiguration");
//register ui and service web apps
HandlerCollection webAppHandlers = getWebAppHandlers();
for (Connector c : server.getConnectors()) {
c.getConnectionFactory(HttpConnectionFactory.class).getHttpConfiguration().setRequestHeaderSize(MAX_REQUEST_HEADER_SIZE);
c.getConnectionFactory(HttpConnectionFactory.class).getHttpConfiguration().setSendServerVersion(false);
}
threadPool = (QueuedThreadPool) server.getThreadPool();
// request logs
RequestLogHandler requestLogHandler = new RequestLogHandler();
AsyncRequestLogWriter asyncRequestLogWriter = new AsyncRequestLogWriter(JETTY_REQUEST_LOG_FILE_NAME);
asyncRequestLogWriter.setFilenameDateFormat(JETTY_REQUEST_LOG_FILE_NAME_DATE_FORMAT);
asyncRequestLogWriter.setAppend(JETTY_REQUEST_LOG_FILE_APPEND);
asyncRequestLogWriter.setRetainDays(JETTY_REQUEST_LOG_FILE_RETAIN_DAYS);
asyncRequestLogWriter.setTimeZone(TimeZone.getDefault().getID());
requestLogHandler.setRequestLog(new AppShellCustomRequestLog(asyncRequestLogWriter));
webAppHandlers.addHandler(requestLogHandler);
StatisticsHandler statisticsHandler = new StatisticsHandler();
statisticsHandler.setHandler(new AppshellStatisticsHandler());
webAppHandlers.addHandler(statisticsHandler);
// set handler
server.setHandler(webAppHandlers);
//start jettyMetricsPsr
JettyMetricStatistics.logJettyMetrics();
// set error handler
server.addBean(new CustomErrorHandler());
// GZip Handler
GzipHandler gzip = new GzipHandler();
server.setHandler(gzip);
gzip.setHandler(webAppHandlers);
//setting server attribute for datasources
server.setAttribute("fawappshellDS", new Resource(JNDI_NAME_FAWAPPSHELL, DatasourceUtil.getFawAppshellDatasource()));
server.setAttribute("fawcommonDS", new Resource(JNDI_NAME_FAWCOMMON, DatasourceUtil.getCommonDatasource()));
//new Resource(server, JNDI_NAME_FAWAPPSHELL, getFawAppshellDatasource());
//new Resource(server, JNDI_NAME_FAWCOMMON, getFawCommonDatasource());
server.start();
server.join();
} catch (Exception e) {
e.printStackTrace();
}
}
private HandlerCollection getWebAppHandlers() throws SQLException, NamingException{
//Setting the war and context path for the service layer: oaxservice
WebAppContext serviceWebapp = new WebAppContext();
serviceWebapp.setWar(APPSHELL_API_WAR_FILE_PATH);
serviceWebapp.setContextPath(APPSHELL_API_CONTEXT_PATH);
serviceWebapp.setPersistTempDirectory(false);
//setting the war and context path for the UI layer: oaxui
WebAppContext uiWebapp = new WebAppContext();
uiWebapp.setWar(APPSHELL_UI_WAR_FILE_PATH);
uiWebapp.setContextPath(APPSHELL_UI_CONTEXT_PATH);
uiWebapp.setAllowNullPathInfo(true);
uiWebapp.setInitParameter("org.eclipse.jetty.servlet.Default.dirAllowed", "false");
//set error page handler for the UI context
uiWebapp.setErrorHandler(new CustomErrorHandler());
//handling the multiple war files using HandlerCollection.
HandlerCollection handlerCollection = new HandlerCollection();
handlerCollection.setHandlers(new Handler[]{serviceWebapp, uiWebapp});
return handlerCollection;
}
public LoginService getLoginService() throws IOException {
URL realmProps = JettyServer.class.getClassLoader().getResource(JETTY_REALM_PROPERTIES_FILE_NAME);
if (realmProps == null)
throw new FileNotFoundException("Unable to find " + JETTY_REALM_PROPERTIES_FILE_NAME);
return new HashLoginService(JETTY_REALM_NAME, realmProps.toExternalForm());
}
public void cleanJettyWorkingDirectory(){
final File folder = new File(JETTY_WORKING_DIRECTORY);
final File[] files = folder.listFiles( new FilenameFilter() {
#Override
public boolean accept( final File dir,
final String name ) {
return name.matches( "jetty-0_0_0_0-.*" );
}
} );
for ( final File file : files ) {
try {
FileUtils.deleteDirectory(file);
} catch (IOException e) {
logger.info("Unable to delete the Jetty working directory");
}
}
}
public static void main(String[] args) {
new JettyServer();
}
}

Option 1: Use docker tmpfs
If you want to eliminate the system temp persistence, just use docker correctly to avoid it doing that between restarts, don't write this custom logic within your java app.
The docker tmpfs is probably going to be a better solution.
See past answer: https://stackoverflow.com/a/52662602/775715
Option 2: Use linux systemd tmpfiles
You could also use systemd-tmpfiles or systemd-tmpfiles-clean to perform the cleanup (periodically) automatically within the Linux environment within your docker image.
Option 3: Use a non-standard system temp directory for Jetty
Configure a new Temp Directory for your Java instance ...
$ java -Djava.io.tmpdir=/var/run/jetty/work/ -jar start.jar
Then use your shell script that starts your Jetty instance to just clear out that unique directory before you execute the java instance.
aka:
JETTY_WORK=/var/run/jetty/work
rm -rf $JETTY_WORK/*
java -Djava.io.tmpdir=$JETTY_WORK/ -jar start.jar
This approach also catches all Java temp directory usages from your 3rd party libraries as well, not just Jetty itself.

How to get value of 'CARBON_HOME' in java code

I was trying to implement an Axis2 service that receives user requests and publishes them as events to a CEP using carbon databridge thrift (via 'org.wso2.carbon.databridge.agent.thrift.DataPublisher')
I followed the code sample provided in wso2cep-3.1.0/samples/producers/activity-monitor
please see the following code snippet
public class GatewayServiceSkeleton{
private static Logger logger = Logger.getLogger(GatewayServiceSkeleton.class);
public RequestResponse request(Request request)throws AgentException,
MalformedStreamDefinitionException,StreamDefinitionException,
DifferentStreamDefinitionAlreadyDefinedException,
MalformedURLException,AuthenticationException,DataBridgeException,
NoStreamDefinitionExistException,TransportException, SocketException,
org.wso2.carbon.databridge.commons.exception.AuthenticationException
{
final String GATEWAY_SERVICE_STREAM = "gateway.cep";
final String VERSION = "1.0.0";
final String PROTOCOL = "tcp://";
final String CEPHOST = "cep.gubnoi.com";
final String CEPPORT = "7611";
final String CEPUSERNAME = "admin";
final String CEPPASSWORD = "admin";
Object[] metadata = { request.getDeviceID(), request.getViewID()};
Object[] correlationdata = { request.getSessionID()};
Object[] payloaddata = {request.getBucket()};
KeyStoreUtil.setTrustStoreParams();
KeyStoreUtil.setKeyStoreParams();
DataPublisher dataPublisher = new DataPublisher(PROTOCOL + CEPHOST + ":" + CEPPORT, CEPUSERNAME, CEPPASSWORD);
//create event
Event event = new Event (GATEWAY_SERVICE_STREAM + ":" + VERSION, System.currentTimeMillis(), metadata, correlationdata, payloaddata);
//Publish event for a valid stream
dataPublisher.publish(event);
//stop
dataPublisher.stop();
RequestResponse response = new RequestResponse();
response.setSessionID(request.getSessionID());
response.setDeviceID(request.getDeviceID());
response.setViewID(request.getViewID());
response.setBucket(request.getBucket());
return response;
}
there is also a utility class that set the key store parameters as following
public class KeyStoreUtil {
static File filePath = new File("../../../repository/resources/security");
public static void setTrustStoreParams() {
String trustStore = filePath.getAbsolutePath();
System.setProperty("javax.net.ssl.trustStore", trustStore + "/client-truststore.jks");
System.setProperty("javax.net.ssl.trustStorePassword", "wso2carbon");
}
public static void setKeyStoreParams() {
String keyStore = filePath.getAbsolutePath();
System.setProperty("Security.KeyStore.Location", keyStore + "/wso2carbon.jks");
System.setProperty("Security.KeyStore.Password", "wso2carbon");
}
}
I uploaded the service into a wso2as-5.2.1, and called the service using SOAPUI
the request returned an error message "cannot borrow client for TCP"
I debug, and found out the problem might lies with the class 'KeyStoreUtil',
where the 'filePath' somehow retuned a 'null',
static File filePath = new File("../../../repository/resources/security");
and caused the failure on this line
DataPublisher dataPublisher = new DataPublisher(PROTOCOL + CEPHOST + ":" + CEPPORT, CEPUSERNAME, CEPPASSWORD);
I guess it could be a better idea if I use the value of "CARBON_HOME" to figure out the location of Key Store
so my question is :
How may I be able to get the value of 'CARBON_HOME' in the Java code?
that said. If you think a bit more:
the service will be called numerous time; whileas the 'setTrustStoreParams' and the 'setKeyStoreParams' will only be needed to executed once at the server/service initiate.
So, are there any even better ways to remove 'setTrustStoreParams' and 'setKeyStoreParams' out of the service code, or implement as configurable items?
Please advise
thanks

so my question is :
How may I be able to get the value of 'CARBON_HOME' in the Java code?
You should use the property carbon.home like following which will retrieve the WSO2 product's home directory.
System.getProperty("carbon.home");

Java sdk for copying to Redshift

Is it possible to fire a copy command from S3 To Redshift through java jdbc connection?
Example:
copy test from 's3://' CREDENTIALS 'aws_access_key_id=xxxxxxx;aws_secret_access_key=xxxxxxxxx'

Yes try code as below
String dbURL = "jdbc:postgresql://x.y.us-east-1.redshift.amazonaws.com:5439/dev";
String MasterUsername = "userame";
String MasterUserPassword = "password";
Connection conn = null;
Statement stmt = null;
try{
//Dynamically load postgresql driver at runtime.
Class.forName("org.postgresql.Driver");
System.out.println("Connecting to database...");
Properties props = new Properties();
props.setProperty("user", MasterUsername);
props.setProperty("password", MasterUserPassword);
conn = DriverManager.getConnection(dbURL, props);
stmt = conn.createStatement();
String sql="copy test from 's3://' CREDENTIALS 'aws_access_key_id=xxxxxxx;aws_secret_access_key=xxxxxxxxx'"
int j = stmt.executeUpdate(sql);
stmt.close();
conn.close();
}catch(Exception ex){
//For convenience, handle all errors here.
ex.printStackTrace();
}

Sandesh's answer works perfectly fine, but it uses PostgreSql driver. AWS Provides Redshift driver, which is better than PostgreSql driver.
Rest of things would remain same. I hope this information may help others.
1)JDBC Driver will change from org.postgresql.Driver to com.amazon.redshift.jdbcXX.Driver, where XX is the version of Redshift driver. e.g. 42.
2)Jdbc url will change from postgreSQL to redshift.
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.Statement;
import java.util.Properties;
public class RedShiftJDBC {
public static void main(String[] args) {
Connection conn = null;
Statement statement = null;
try {
//Make sure to choose appropriate Redshift Jdbc driver and its jar in classpath
Class.forName("com.amazon.redshift.jdbc42.Driver");
Properties props = new Properties();
props.setProperty("user", "username***");
props.setProperty("password", "password****");
System.out.println("\n\nconnecting to database...\n\n");
//In case you are using postgreSQL jdbc driver.
conn = DriverManager.getConnection("jdbc:redshift://********url-to-redshift.redshift.amazonaws.com:5439/example-database", props);
System.out.println("\n\nConnection made!\n\n");
statement = conn.createStatement();
String command = "COPY my_table from 's3://path/to/csv/example.csv' CREDENTIALS 'aws_access_key_id=******;aws_secret_access_key=********' CSV DELIMITER ',' ignoreheader 1";
System.out.println("\n\nExecuting...\n\n");
statement.executeUpdate(command);
//you must need to commit, if you realy want to have data copied.
conn.commit();
System.out.println("\n\nThats all copy using simple JDBC.\n\n");
statement.close();
conn.close();
} catch (Exception ex) {
ex.printStackTrace();
}
}
}

MapReduce job with mixed data sources: HBase table and HDFS files

I need to implement a MR job which access data from both HBase table and HDFS files. E.g., mapper reads data from HBase table and from HDFS files, these data share the same primary key but have different schema. A reducer then join all columns (from HBase table and HDFS files) together.
I tried look online and could not find a way to run MR job with such mixed data source. MultipleInputs seem only work for multiple HDFS data sources. Please let me know if you have some ideas. Sample code would be great.

After a few days of investigation (and get help from HBase user mailing list), I finally figured out how to do it. Here is the source code:
public class MixMR {
public static class Map extends Mapper<Object, Text, Text, Text> {
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
String s = value.toString();
String[] sa = s.split(",");
if (sa.length == 2) {
context.write(new Text(sa[0]), new Text(sa[1]));
}
}
}
public static class TableMap extends TableMapper<Text, Text> {
public static final byte[] CF = "cf".getBytes();
public static final byte[] ATTR1 = "c1".getBytes();
public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException {
String key = Bytes.toString(row.get());
String val = new String(value.getValue(CF, ATTR1));
context.write(new Text(key), new Text(val));
}
}
public static class Reduce extends Reducer <Object, Text, Object, Text> {
public void reduce(Object key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
String ks = key.toString();
for (Text val : values){
context.write(new Text(ks), val);
}
}
}
public static void main(String[] args) throws Exception {
Path inputPath1 = new Path(args[0]);
Path inputPath2 = new Path(args[1]);
Path outputPath = new Path(args[2]);
String tableName = "test";
Configuration config = HBaseConfiguration.create();
Job job = new Job(config, "ExampleRead");
job.setJarByClass(MixMR.class); // class that contains mapper
Scan scan = new Scan();
scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs
scan.setCacheBlocks(false); // don't set to true for MR jobs
scan.addFamily(Bytes.toBytes("cf"));
TableMapReduceUtil.initTableMapperJob(
tableName, // input HBase table name
scan, // Scan instance to control CF and attribute selection
TableMap.class, // mapper
Text.class, // mapper output key
Text.class, // mapper output value
job);
job.setReducerClass(Reduce.class); // reducer class
job.setOutputFormatClass(TextOutputFormat.class);
// inputPath1 here has no effect for HBase table
MultipleInputs.addInputPath(job, inputPath1, TextInputFormat.class, Map.class);
MultipleInputs.addInputPath(job, inputPath2, TableInputFormat.class, TableMap.class);
FileOutputFormat.setOutputPath(job, outputPath);
job.waitForCompletion(true);
}
}

There is no OOTB feature that supports this. A possible workaround could be to Scan your HBase table and write the Results to a HDFS file first and then do the reduce-side join using MultipleInputs. But this will incur some additional I/O overhead.

A pig script or hive query can do that easily.
sample pig script
tbl = LOAD 'hbase://SampleTable'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'info:* ...', '-loadKey true -limit 5')
AS (id:bytearray, info_map:map[],...);
fle = LOAD '/somefile' USING PigStorage(',') AS (id:bytearray,...);
Joined = JOIN A tbl by id,fle by id;
STORE Joined to ...

Cassandra Map Reduce for TimeUUID columns

I recently Setup 4 node Cassandra cluster for learning with one column family which hold time series data as.
Key -> {column name: timeUUID, column value: csv log line, ttl: 1year}, I use Netflix Astyanax java client to load about 1 million log lines.
I also configured Hadoop to run map-reduce jobs with 1 namenode and 4 datanode's to run some analytics on Cassandra data.
All the available examples on internet uses column name as SlicePredicate for Hadoop Job Configuration, where as I have timeUUID as columns how can I efficiently feed Cassandra data to Hadoop Job configurator with batches of 1000 columns at one time.
There are more than 10000 column's for some rows in this test data and expected to be more in real data.
I configure my job as
public int run(String[] arg0) throws Exception {
Job job = new Job(getConf(), JOB_NAME);
Job.setJarByClass(LogTypeCounterByDate.class);
job.setMapperClass(LogTypeCounterByDateMapper.class);
job.setReducerClass(LogTypeCounterByDateReducer.class);
job.setInputFormatClass(ColumnFamilyInputFormat.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
ConfigHelper.setRangeBatchSize(getConf(), 1000);
SliceRange sliceRange = new SliceRange(ByteBuffer.wrap(new byte[0]),
ByteBuffer.wrap(new byte[0]), true, 1000);
SlicePredicate slicePredicate = new SlicePredicate();
slicePredicate.setSlice_range(sliceRange);
ConfigHelper.setInputColumnFamily(job.getConfiguration(), KEYSPACE, COLUMN_FAMILY);
ConfigHelper.setInputRpcPort(job.getConfiguration(), INPUT_RPC_PORT);
ConfigHelper.setInputInitialAddress(job.getConfiguration(), INPUT_INITIAL_ADRESS);
ConfigHelper.setInputPartitioner(job.getConfiguration(), INPUT_PARTITIONER);
ConfigHelper.setInputSlicePredicate(job.getConfiguration(), slicePredicate);
FileOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH));
job.waitForCompletion(true);
return job.isSuccessful() ? 0 : 1;
}
But I can't able to understand how I define Mapper, kindly can you provide template for Mapper class.
public static class LogTypeCounterByDateMapper extends Mapper<ByteBuffer, SortedMap<ByteBuffer, IColumn>, Text, LongWritable>
{
private Text key = null;
private LongWritable value = null;
#Override
protected void setup(Context context){
}
public void map(ByteBuffer key, SortedMap<ByteBuffer, IColumn> columns, Context context){
//String[] lines = columns.;
}
}

ConfigHelper.setRangeBatchSize(getConf(), 1000)
...
SlicePredicate predicate = new SlicePredicate().setSlice_range(new SliceRange(TimeUUID.asByteBuffer(startValue), TimeUUID.asByteBuffer(endValue), false, 1000))
ConfigHelper.setInputSlicePredicate(conf, predicate)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Mapreduce and Hcatalog Integration fails to use MySql MetaStore - mapreduce

Related

Unable to clean the Jetty working directory while docker is restarted

How to get value of 'CARBON_HOME' in java code

Java sdk for copying to Redshift

MapReduce job with mixed data sources: HBase table and HDFS files

Cassandra Map Reduce for TimeUUID columns

Categories

Resources