MapReduce job failed in oozie - mapreduce

I've a map only job which takes sequence file (key is Text, value is BytesWritable) as input and output data in to sequence file (key is NullWritable, value is Text).
Java class
import java.io.*;
import java.util.*;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
public class Test {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
Job job = new Job(conf, "Test");
job.setJarByClass(Test.class);
job.setMapperClass(TestMapper.class);
job.setInputFormatClass(SequenceFileInputFormat.class);
job.setOutputFormatClass(SequenceFileOutputFormat.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(Text.class);
job.setMapOutputKeyClass(NullWritable.class);
job.setMapOutputValueClass(Text.class);
job.setNumReduceTasks(0);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.submit();
}
public static class TestMapper extends Mapper<Text, BytesWritable, NullWritable, Text> {
Text outValue = new Text("");
int counter = 0;
public void map(Text filename, BytesWritable data, Context context) throws IOException, InterruptedException {
/ logic
}
}
}
It's working fine when running job from unix command, when the same job scheduled in oozie seeing below error
java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.Text
at Test$TestMapper.map(Test.java:56)
job configuration in oozie
<configuration>
<property>
<name>mapred.input.dir</name>
<value>${input}</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>/temp</value>
</property>
<property>
<name>mapreduce.map.class</name>
<value>Test$TestMapper</value>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>0</value>
</property>
<property>
<name>mapreduce.job.output.key.class</name>
<value>org.apache.hadoop.io.NullWritable</value>
</property>
<property>
<name>mapreduce.job.output.value.class</name>
<value>org.apache.hadoop.io.Text</value>
</property>
<property>
<name>mapreduce.job.inputformat.class</name>
<value>org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat</value>
</property>
<property>
<name>mapreduce.job.outputformat.class</name>
<value>org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat</value>
</property>
<property>
<name>mapreduce.job.mapinput.key.class</name>
<value>org.apache.hadoop.io.Text</value>
</property>
<property>
<name>mapreduce.job.mapinput.value.class</name>
<value>org.apache.hadoop.io.BytesWritable</value>
</property>
<property>
<name>mapred.reducer.new-api</name>
<value>true</value>
</property>
<property>
<name>mapred.mapper.new-api</name>
<value>true</value>
</property>
Can someone tell me what is the error here.. thank you

The classcast exception indicates that Oozie is still using the default inputformat of TextInputFormat, which has a Key type of LongWritable. Since the mapper has a key type of Text, there is a type mismatch on ingestion by the mapper. So the config key of mapreduce.job.inputformat.class was incorrect.
(after some trial and error)
We found that the correct property name is mapreduce.inputformat.class, i.e.:
<property>
<name>mapreduce.inputformat.class</name>
<value>org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat</value>
</property>

Related

Checkpointing in flink

Trying to use checkpointing mechanism in flink with fs HDFS.
While connecting with hdfs://aleksandar/0.0.0.0:50010/shared/ i get the following error
Caused by: java.lang.IllegalArgumentException: Pathname /0.0.0.0:50010/shared/972dde22148f58ec9f266fb7bdfae891 from hdfs://aleksandar/0.0.0.0:50010/shared/972dde22148f58ec9f266fb7bdfae891 is not a valid DFS filename.
In the core-site settings i have the following configuration
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/var/lib/hadoop</value>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://0.0.0.0:123</value>
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
</configuration>

Python Load module in parent to prevent overwrite problems

I'm building a QGIS plugin with Python and designed a GUI for it. I can compile it with pyuic4, but on load it gives an error. I found out that I can prevent that error by adding the line below in the compiled Python code. Only at some moments I have to recompile and so the file gets overwritten and the line is lost.
form.py
from qgis.gui import QgsMapLayerComboBox
I have a 'parent' file which imports the compiled version like so:
dialog.py
from form import Ui_Dialog
Is there some way to import the QgsMapLayerComboBox in dialog.py so I don't have to add it every time to form.py after I recompiled my GUI?
EDIT:
<widget class="QgsMapLayerComboBox" name="mMapLayerComboBox">
<property name="geometry">
<rect>
<x>100</x>
<y>18</y>
<width>160</width>
<height>22</height>
</rect>
</property>
<property name="filters">
<set>QgsMapLayerProxyModel::RasterLayer</set>
</property>
</widget>
</widget>
<customwidgets>
<customwidget>
<class>QgsMapLayerComboBox</class>
<extends>QComboBox</extends>
<header>qgsmaplayercombobox.h</header>
</customwidget>
</customwidgets>
Open your form.ui with some text editor and replace:
<customwidget>
<class>QgsMapLayerComboBox</class>
<extends>QComboBox</extends>
<header>qgsmaplayercombobox.h</header>
</customwidget>
with
<customwidget>
<class>QgsMapLayerComboBox</class>
<extends>QComboBox</extends>
<header>qgis.gui</header>
</customwidget>
and compile again.

Error: Java heap space Container killed by the ApplicationMaster. Container killed on request. Exit code is 143

Error: Java heap space Container killed by the ApplicationMaster. Container killed on request. Exit code is 143.
The hadoop cluster has 3 machines, one of them is the master,others are datanode,the machine's RAM is 8G.
the yarn-site.xml:
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>Hadoop1</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>4096</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>2048</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
<description>Whether virtual memory limits will be enforced for containers</description>
</property>
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>4</value>
<description>Ratio between virtual memory to physical memory when setting memory limits for containers</description>
</property>
</configuration>
the mapred-site.xml:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>hadoop1:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hadoop1:19888</value>
</property>
<property>
<name>yarn.app.mapreduce.am.staging-dir</name>
<value>/user</value>
</property>
<property>
<name>mapreduce.input.fileinputformat.input.dir.recursive</name>
<value>true</value>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>2048</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>4096</value>
</property>
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx1024m</value>
</property>
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx3072m</value>
</property>
</configuration>``
when I run the mapreduce,get the error:Error: Java heap space Container killed by the ApplicationMaster. Container killed on request. Exit code is 143.
The input files are 500M,the reduce number is 4. when the input files less then 300M, the program can run good.

Why QSpinBox jumps twice the Step value

I have a simple python qt application in which I want to plot something dynamically via matplotlib. I was trying to follow instructions like this and that. It works as expected except for one thing, my spinbox updates twice upon mouse click and I do not understand why. Well, during the development the valueChanged.connect() worked fine until I got to a certain point. Then the update_figure() was called twice and the displayed value jumped by two instead of the desired one. I tried to make a minimum working example, you find below. Without the sleep command in update_figue my spinbox behaves normal, with sleep it is called twice and the value jumps by two, with an delay given by sleep.
Why does the sleep in update_figure influence the spinbox, or what is happening here? Is it the way I constructed the connect()?
BTW: I observed the same behaviour on a slider.
Cheers
(I do not think it matters, but this is on openSuse Leap)
EDIT: I could make the example even "more minimal".
from PyQt5.uic import loadUiType
from PyQt5.QtWidgets import QApplication, QMainWindow
import sys
from time import sleep
Ui_MainWindow, QMainWindow = loadUiType('main.ui')
class Main(QMainWindow, Ui_MainWindow):
def __init__(self, ):
super(Main, self).__init__()
self.setupUi(self)
self.spinBox_dac.valueChanged.connect(self.update_figure)
def update_figure(self):
##### The following line makes the difference!
sleep(1) #####################################
##### why?
print "damn...once or twice?"
return
if __name__ == '__main__':
app = QApplication(sys.argv)
main = Main()
main.show()
sys.exit(app.exec_())
Here is the main.ui:
<?xml version="1.0" encoding="UTF-8"?>
<ui version="4.0">
<class>MainWindow</class>
<widget class="QMainWindow" name="MainWindow">
<property name="geometry">
<rect>
<x>0</x>
<y>0</y>
<width>185</width>
<height>107</height>
</rect>
</property>
<property name="sizePolicy">
<sizepolicy hsizetype="Fixed" vsizetype="Maximum">
<horstretch>0</horstretch>
<verstretch>0</verstretch>
</sizepolicy>
</property>
<property name="maximumSize">
<size>
<width>1100</width>
<height>905</height>
</size>
</property>
<property name="windowTitle">
<string>mini</string>
</property>
<widget class="QWidget" name="centralwidget">
<widget class="QSpinBox" name="spinBox_dac">
<property name="geometry">
<rect>
<x>80</x>
<y>40</y>
<width>61</width>
<height>28</height>
</rect>
</property>
<property name="minimum">
<number>3</number>
</property>
<property name="maximum">
<number>20</number>
</property>
<property name="singleStep">
<number>1</number>
</property>
<property name="value">
<number>4</number>
</property>
</widget>
</widget>
</widget>
<resources/>
<connections/>
</ui>
The sleep() call prevent qt to process the mouse-release and the QSpinBox "thinks" that the mouse is pressed long enough to increase/decrease the value again (auto-repeat).
See https://bugreports.qt.io/browse/QTBUG-33128
A solution could be:
def update_figure(self):
for _ in xrange(10):
QApplication.processEvents()
sleep(0.1)
print "Only once :-)"
Anyway it's not a good practice to block the UI while performing long operations, the work is usually delegated to another thread.
EDIT
This affect Qt4 and Qt5 both, but it behave a bit differently.
There is no setAutoRepeat because in the QAbstrectSpinBox this behavior is implemented using styles, while in QAbstractButton is implemented with a timer.
The way around this is to create a ProxyStyle and set a very long "auto-repeat" interval
from PyQt5.uic import loadUiType
from PyQt5.QtWidgets import QApplication, QMainWindow, QProxyStyle, QStyle
import sys
from time import sleep
Ui_MainWindow, QMainWindow = loadUiType('main.ui')
class CustomStyle(QProxyStyle):
def styleHint(self, hint, option=None, widget=None, returnData=None):
if hint == QStyle.SH_SpinBox_KeyPressAutoRepeatRate:
return 10**10
elif hint == QStyle.SH_SpinBox_ClickAutoRepeatRate:
return 10**10
elif hint == QStyle.SH_SpinBox_ClickAutoRepeatThreshold:
# You can use only this condition to avoid the auto-repeat,
# but better safe than sorry ;-)
return 10**10
else:
return super().styleHint(hint, option, widget, returnData)
class Main(QMainWindow, Ui_MainWindow):
def __init__(self, ):
super(Main, self).__init__()
self.setupUi(self)
self.spinBox_dac.setStyle(CustomStyle())
self.spinBox_dac.valueChanged.connect(self.update_figure)
def update_figure(self):
sleep(1)
print "Only once!"
if __name__ == '__main__':
app = QApplication(sys.argv)
main = Main()
main.show()
sys.exit(app.exec_())

How to schedule Hbase Map-Reduce Job by oozie?

I want to schedule a Hbase Map-Reduce job by Oozie.I am facing following problem .
How/Where to specify these properties in oozie workflow ?
( i> Table name for Mapper/Reducer
ii> scan object for Mapper )
Scan scan = new Scan(new Get());
scan.setMaxVersions();
scan.addColumn(Bytes.toBytes(FAMILY),
Bytes.toBytes(VALUE));
scan.addColumn(Bytes.toBytes(FAMILY),
Bytes.toBytes(DATE));
Job job = new Job(conf, JOB_NAME + "_" + TABLE_USER);
// These two properties :-
TableMapReduceUtil.initTableMapperJob(TABLE_USER, scan,
Mapper.class, Text.class, Text.class, job);
TableMapReduceUtil.initTableReducerJob(DETAILS_TABLE,
Reducer.class, job);
or
please let me know the best way to schedule a Hbase Map-Reduce Job by Oozie .
Thanks :) :)
The best way(According to me ) to schedule a Hbase Map_Reduce job is to schedule it as a .java file .
It works well and there is no need to write code to change your scan to string , etc.
So i am scheduling my jobs like java file till i get any better option .
workflow-app xmlns="uri:oozie:workflow:0.1" name="java-main-wf">
<start to="java-node"/>
<action name="java-node">
<java>
<job-tracker></job-tracker>
<name-node></name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<main-class>org.apache.oozie.example.DemoJavaMain</main-class>
<arg>Hello</arg>
<arg>Oozie!</arg>
<arg>This</arg>
<arg>is</arg>
<arg>Demo</arg>
<arg>Oozie!</arg>
</java>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Java failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
You can also schedule the job using <Map-reduce> tag , but it is not as easy as scheduling it as java file. It requires a considerable effort, but can be considered as an alternate approach.
<action name='jobSample'>
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<!-- This is required for new api usage -->
<property>
<name>mapred.mapper.new-api</name>
<value>true</value>
</property>
<property>
<name>mapred.reducer.new-api</name>
<value>true</value>
</property>
<!-- HBASE CONFIGURATIONS -->
<property>
<name>hbase.mapreduce.inputtable</name>
<value>TABLE_USER</value>
</property>
<property>
<name>hbase.mapreduce.scan</name>
<value>${wf:actionData('get-scanner')['scan']}</value>
</property>
<property>
<name>hbase.zookeeper.property.clientPort</name>
<value>${hbaseZookeeperClientPort}</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>${hbaseZookeeperQuorum}</value>
</property>
<!-- MAPPER CONFIGURATIONS -->
<property>
<name>mapreduce.inputformat.class</name>
<value>org.apache.hadoop.hbase.mapreduce.TableInputFormat</value>
</property>
<property>
<name>mapred.mapoutput.key.class</name>
<value>org.apache.hadoop.io.Text</value>
</property>
<property>
<name>mapred.mapoutput.value.class</name>
<value>org.apache.hadoop.io.Text</value>
</property>
<property>
<name>mapreduce.map.class</name>
<value>com.hbase.mapper.MyTableMapper</value>
</property>
<!-- REDUCER CONFIGURATIONS -->
<property>
<name>mapreduce.reduce.class</name>
<value>com.hbase.reducer.MyTableReducer</value>
</property>
<property>
<name>hbase.mapred.outputtable</name>
<value>DETAILS_TABLE</value>
</property>
<property>
<name>mapreduce.outputformat.class</name>
<value>org.apache.hadoop.hbase.mapreduce.TableOutputFormat</value>
</property>
<property>
<name>mapred.map.tasks</name>
<value>${mapperCount}</value>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>${reducerCount}</value>
</property>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
</map-reduce>
<ok to="end" />
<error to="fail" />
</action>
<kill name="fail">
<message>Map/Reduce failed, error
message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name='end' />
To know more about the property name and value , dump the configration parameter.
Also, the scan property is some serialization of the scan information (a Base 64 encoded version) so not sure how to specify this -
scan.addColumn(Bytes.toBytes(FAMILY),
Bytes.toBytes(VALUE));