Cron like scheduling using Apache Oozie - mapreduce

http://blog.cloudera.com/blog/2014/04/how-to-use-cron-like-scheduling-in-apache-oozie/
I referred this link which shows us how to schedule jobs using cron like syntax. But it shows scheduling in CDH4 and CDH5. Can anyone tell me how this can be achieved in CDH3? We have a CDH3 cluster.

Unfortunately the 'cron-like' scheduling functionality wasn't included until Oozie 4.1.0. https://issues.apache.org/jira/browse/OOZIE-1306. CDH3 is only running a 2.X.X release I believe.
However, you may be in some luck if you're willing to compile Oozie by hand an install / update your current version. Here are the only requirements for Oozie 4.1.0 :
Unix box (tested on Mac OS X and Linux)
Java JDK 1.6+
Maven 3.0.1+
Hadoop 0.20.2+
Pig 0.7+
All of which are covered in CDH3. https://oozie.apache.org/docs/4.1.0/DG_QuickStart.html

Related

Will there be any compatibility issue if i upgrade my Databricks run time version

Will there be any issue in my current notebooks and jobs if i upgrade my Databricks run time version from 9.1 LTS to 10.4 LTS
I didn't tried upgrading the version. If I upgrade it then will I be able to change it back to previous version
It's really a very broad question - exact answer depends on the features and libraries/connectors that you're using in your code. You can refer to the Databricks Runtime 10.x migration guide and Spark 3.2.1 migration guide for more information about upgrade.
Usually, the correct way to do is to try to run your job with new runtime, but using the test environment, where your production data won't be affected.

Problems Integrating Hadoop 3.x on Flink cluster

I am facing some issues while trying to integrate Hadoop 3.x version on a Flink cluster. My goal is to use HDFS as a persistent storage and store checkpoints. I am currectly using Flink 1.13.1 and HDFS 3.3.1. The error that I am getting while trying to submit a job is that HDFS is not supported as a file system. In the standalone version, this error was solved by specifying the HADOOP_CLASSPATH on my local machine. As a next step I applied the solution above in all the machines that are used in my cluster and in the standalone version I managed to successfully submit my jobs in all of them without facing any issues. However, when I started modifying the configurations to setup my cluster (by specifying the IPs of my machines) that problem came up once again. What I am missing?
In Hadoop 2.x there are the pre-bundled jar files in the official flink download page that would solve similar issues in the past but that's not the case with Hadoop 3.x versions
It should be enough to set HADOOP_CLASSPATH on every machine in the cluster.
For anyone still struggling with a similar issue, the answer proposed by David worked for me in the end. The detail that I was missing was in the definition of the environment variables.
In my initial attempts, I was using the .bashrc script to permanently define my environment variables. This works in the standalone cluster which is not the case with a distributed cluster due to the scope of the script. What actually worked for me was defining my variables(and $HADOOP_CLASSPATH) in the /etc/profile
I also managed to find another solution while was struggling with HADOOP_CLASSPATH. As I mentioned in my initial post, in Hadoop 2.x there are pre-bundled jar files in the official Flink download page to support HDFS integration, which is not the case in Hadoop 3.x. I found the following maven repository page and after testing all of the existing jars I managed to find one that worked in my case. To be more precise, for Hadoop 3.3.1 the 3.1.1.7.2.8.0-224-9.0 jar (Placed the jar in the $FLINK_HOME/lib) worked. While it is not an "official solution" it seems to solve the issue.
https://mvnrepository.com/artifact/org.apache.flink/flink-shaded-hadoop-3-uber?repo=cloudera-repos

setup a cron job in django in windows

I want to setup cronjob in my Django project, but I am using windows.
I tried django-cron, but it is not working with windows.
How can I use cronjob scheduling in my project?
Using django-cron is not the case and cannot be used on Windows, because Windows do not support cron job scheduling. You can use Windows analog of the Unix cron command called "schtasks" to schedule execution of your script or Windows Task Scheduler.
See more in SO Questions What is the Windows version of cron? and Schedule Python Script - Windows 7

I am using Hortonworks sandbox for Hadoop ...How do I debug Java code in eclipse?

I'm using Hortonworks sandbox for Hadoop, But I am using Eclipse on Windows for the Java code. And hortonworks sandbox installed in Oracle virtual box. I created executable jar for that Java code and then run in hortons. How do I debug the Java code in eclipse?
One of the way debugging Hadoop in Eclipse is by running hadoop in local mode because map reduce task run in its own JVM.
Also explore MRUnit and Jumbune's debugger.
https://github.com/impetus-opensource/jumbune
The below blogposts may also help you.
http://let-them-c.blogspot.com/2011/07/running-hadoop-locally-on-eclipse.html
http://let-them-c.blogspot.com/2011/07/configurations-of-running-hadoop.html
How to debug hadoop mapreduce jobs from eclipse?

Rethinkdb chef solo cookbook

Is there any RethinkDB chef solo cookbook that allows one to install latest rethinkdb on ubuntu 14.04 / AWS.
I tried couple options, however it didn't help.
https://github.com/vFense/rethinkdb-chef - how to install latest version?
https://github.com/sprij/rethinkdb-cookbook.git - source compilation takes hours
I would appreciate any help regarding this.
Thanks
Try the cookbook that is available from the community repository first:
https://supermarket.chef.io/cookbooks/rethinkdb
It claims to be integration tested on Ubuntu. If it doesn't work under chef-solo, then I'd advise you to switch to local mode chef client instead.
https://www.chef.io/blog/2013/10/31/chef-client-z-from-zero-to-chef-in-8-5-seconds/
PS
Also checkout Berkshelf for managing cookbook dependencies. It's a standard tool in the chefdk
I updated rethinkdb-chef to work with the latest version of RethinkDB as well as removed the network portion of the .kitchen.yml file. I validated that this does work on CentOS 6 and Ubuntu 14.04.
I still need to write tests as well as documentation. As per Marks answer, try to use the community supported version 1st. I created this cookbook, so that I can customize it as per my needs with vFense.