Hadoop 0.20.205 Job (and not JobConf) Bzip2 compression - compression

In hadoop 0.20.2 version one can add input/output compression to the jobconf in the following way:
jobConf.setBoolean("mapred.output.compress", true);
jobConf.setClass("mapred.output.compression.codec", BZip2Codec.class, CompressionCodec.class);
jobConf is deprecated and job should be used instead. How can I add compression/decompression there? In particular, how can I change the wordcount example to input bzip2 files:
public class WordCount {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
Job job = new Job(conf, "Example Hadoop 0.20.1 WordCount");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenCounterMapper.class);
job.setReducerClass(TokenCounterReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}

Use the Configuration class as below while submitting the Job
Configuration conf = new Configuration();
conf.set("mapred.output.compression.codec",
"org.apache.hadoop.io.compress.BZip2Codec");
Job job = new Job(conf);

This is the way I found to compress the output:
Job job = new Job(conf, "FromToWordStatistics");
job.setJarByClass(FromToWordStatistics.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setCombinerClass(IntSumReducer.class);
job.setNumReduceTasks(20);
SequenceFileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
SequenceFileOutputFormat.setCompressOutput(job, true);
SequenceFileOutputFormat.setOutputCompressorClass(job, BZip2Codec.class);

Related

how to collect map or reduce steps results for each workers in map-reduce processing?

I want to verify map and reduce steps results, so I need to know their results for each workers. but I don't know how to collect inputs and outputs per each workers (nodes).
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}

Mapreduce and Hcatalog Integration fails to use MySql MetaStore

Environment: HDP 2.3 Sandbox
Problem: I have created a table in hive with just 2 columns. Now i want to read this in my MR code using HCatalog integration. The MR Job fails to read the table from the MySql meta-store. It uses the Derby for some reason and hence it fails with "table not found" message.
Job Client code:
public class HCatalogMRJob extends Configured implements Tool {
public int run(String[] args) throws Exception {
Configuration conf = getConf();
args = new GenericOptionsParser(conf, args).getRemainingArgs();
String inputTableName = args[0];
String outputTableName = args[1];
String dbName = null;
Job job = new Job(conf, "HCatalogMRJob");
HCatInputFormat.setInput(job, dbName, inputTableName);
job.setInputFormatClass(HCatInputFormat.class);
job.setJarByClass(HCatalogMRJob.class);
job.setMapperClass(HCatalogMapper.class);
job.setReducerClass(HCatalogReducer.class);
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(WritableComparable.class);
job.setOutputValueClass(DefaultHCatRecord.class);
HCatOutputFormat.setOutput(job, OutputJobInfo.create(dbName, outputTableName, null));
HCatSchema s = HCatOutputFormat.getTableSchema(conf);
System.err.println("INFO: output schema explicitly set for writing:"
+ s);
HCatOutputFormat.setSchema(job, s);
job.setOutputFormatClass(HCatOutputFormat.class);
return (job.waitForCompletion(true) ? 0 : 1);
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new HCatalogMRJob(), args);
System.exit(exitCode);
}
}
Job Run Command:
hadoop jar mr-hcat.jar input_table out_table
Before running this command, i have set the necessary hcatalog, hive jars in the class path using the hadoop_classpath variable.
Question:
Now, how do i make the job to use the hive-site.xml correctly?
I tried setting this in the classpath using the same hadoop_classpath as mentioned above., but still it fails.

Unable to create file using Pail DFS

Newbie here. Trying to run the code from Nathan Marz's book Big Data DFS Datastore using Pail. What am I doing wrong? Trying to connect to an HDFS VM. Tried replacing hdfs with file. Any help appreciated.
public class AppTest
{
private App app = new App();
private String path = "hdfs:////192.168.0.101:8080/mypail";
#Before
public void init() throws IllegalArgumentException, IOException{
FileSystem fs = FileSystem.get(new Configuration());
fs.delete(new Path(path), true);
}
#Test public void testAppAccess() throws IOException{
Pail pail = Pail.create(path);
TypedRecordOutputStream os = pail.openWrite();
os.writeObject(new byte[] {1, 2, 3});
os.writeObject(new byte[] {1, 2, 3, 4});
os.writeObject(new byte[] {1, 2, 3, 4, 5});
os.close();
}
}
Get an error -
java.lang.IllegalArgumentException: Wrong FS: hdfs:/192.168.0.101:8080/mypail, expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:80)
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:529)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
On replacing HDFS with file as file:///
java.io.IOException: Mkdirs failed to create file:/192.168.0.101:8080/mypail (exists=false, cwd=file:/Users/joshi/git/projectcsr/projectcsr)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:442)
at
I came across the same problem and I solved it! You should add your core-site.xml to the hadoop Configuration object, something like this should work:
Configuration cfg = new Configuration();
Path core_site_path = new Path("path/to/your/core-site.xml");
cfg.addResource(core_site_path);
FileSystem fs = FileSystem.get(cfg);
I guess you could do the same also programmatically adding the property fs.defaultFS to the cfg object
Source:
http://opensourceconnections.com/blog/2013/03/24/hdfs-debugging-wrong-fs-expected-file-exception/

MapReduce - code written in configure is not reachable

I wanted to move some files to the input folder and tried the required code by placing it in configure() method (I had to use Old mapred API, due to some constraints i have).
But, some how the code in configure is not being executed.
I have achieved my requirement though in other better way. Though this is a stupid idea, I wanted to know why it is not being executed. I have checked the job tracker and all variables got the right values
Code:
in main:
SimpleDateFormat sdf = new SimpleDateFormat("yyyyMMdd");
String date = sdf.format(new Date());
if(args[3].toString().equalsIgnoreCase("net")){
imgInpPath = "/user/mapreduce/output/net/"+date+"/";
inputDir = inputDir+"net/";
}
conf.set("imgInpPath", imgInpPath);
conf.set("inputDir", inputDir);
FileInputFormat.setInputPaths(conf,inputDir);
in configure()
inputPath = conf.get("inputDir");
Path inputImgPath = new Path(conf.get("imgInpPath"));
Configuration config = new Configuration();
FileSystem fileSystem = FileSystem.get(config);
inputImgPath = fileSystem.makeQualified(inputImgPath);
FileStatus[] status = fileSystem.listStatus(inputImgPath, new PathFilter(){
#Override
public boolean accept(Path name) {
return name.getName().contains("part");
}});
for(int i=0; i <status.length;i++)
{
Path inpPath = status[i].getPath();
FileUtil.copy(fileSystem, inpPath, fileSystem, new Path(inputPath), true , conf);
}
As i said, required thing is achieved in other way. But, I wanted to know, why this is not being performed, irrespective of the requirement.

MapReduce job with mixed data sources: HBase table and HDFS files

I need to implement a MR job which access data from both HBase table and HDFS files. E.g., mapper reads data from HBase table and from HDFS files, these data share the same primary key but have different schema. A reducer then join all columns (from HBase table and HDFS files) together.
I tried look online and could not find a way to run MR job with such mixed data source. MultipleInputs seem only work for multiple HDFS data sources. Please let me know if you have some ideas. Sample code would be great.
After a few days of investigation (and get help from HBase user mailing list), I finally figured out how to do it. Here is the source code:
public class MixMR {
public static class Map extends Mapper<Object, Text, Text, Text> {
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
String s = value.toString();
String[] sa = s.split(",");
if (sa.length == 2) {
context.write(new Text(sa[0]), new Text(sa[1]));
}
}
}
public static class TableMap extends TableMapper<Text, Text> {
public static final byte[] CF = "cf".getBytes();
public static final byte[] ATTR1 = "c1".getBytes();
public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException {
String key = Bytes.toString(row.get());
String val = new String(value.getValue(CF, ATTR1));
context.write(new Text(key), new Text(val));
}
}
public static class Reduce extends Reducer <Object, Text, Object, Text> {
public void reduce(Object key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
String ks = key.toString();
for (Text val : values){
context.write(new Text(ks), val);
}
}
}
public static void main(String[] args) throws Exception {
Path inputPath1 = new Path(args[0]);
Path inputPath2 = new Path(args[1]);
Path outputPath = new Path(args[2]);
String tableName = "test";
Configuration config = HBaseConfiguration.create();
Job job = new Job(config, "ExampleRead");
job.setJarByClass(MixMR.class); // class that contains mapper
Scan scan = new Scan();
scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs
scan.setCacheBlocks(false); // don't set to true for MR jobs
scan.addFamily(Bytes.toBytes("cf"));
TableMapReduceUtil.initTableMapperJob(
tableName, // input HBase table name
scan, // Scan instance to control CF and attribute selection
TableMap.class, // mapper
Text.class, // mapper output key
Text.class, // mapper output value
job);
job.setReducerClass(Reduce.class); // reducer class
job.setOutputFormatClass(TextOutputFormat.class);
// inputPath1 here has no effect for HBase table
MultipleInputs.addInputPath(job, inputPath1, TextInputFormat.class, Map.class);
MultipleInputs.addInputPath(job, inputPath2, TableInputFormat.class, TableMap.class);
FileOutputFormat.setOutputPath(job, outputPath);
job.waitForCompletion(true);
}
}
There is no OOTB feature that supports this. A possible workaround could be to Scan your HBase table and write the Results to a HDFS file first and then do the reduce-side join using MultipleInputs. But this will incur some additional I/O overhead.
A pig script or hive query can do that easily.
sample pig script
tbl = LOAD 'hbase://SampleTable'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'info:* ...', '-loadKey true -limit 5')
AS (id:bytearray, info_map:map[],...);
fle = LOAD '/somefile' USING PigStorage(',') AS (id:bytearray,...);
Joined = JOIN A tbl by id,fle by id;
STORE Joined to ...