Problems Integrating Hadoop 3.x on Flink cluster - hdfs

I am facing some issues while trying to integrate Hadoop 3.x version on a Flink cluster. My goal is to use HDFS as a persistent storage and store checkpoints. I am currectly using Flink 1.13.1 and HDFS 3.3.1. The error that I am getting while trying to submit a job is that HDFS is not supported as a file system. In the standalone version, this error was solved by specifying the HADOOP_CLASSPATH on my local machine. As a next step I applied the solution above in all the machines that are used in my cluster and in the standalone version I managed to successfully submit my jobs in all of them without facing any issues. However, when I started modifying the configurations to setup my cluster (by specifying the IPs of my machines) that problem came up once again. What I am missing?
In Hadoop 2.x there are the pre-bundled jar files in the official flink download page that would solve similar issues in the past but that's not the case with Hadoop 3.x versions

It should be enough to set HADOOP_CLASSPATH on every machine in the cluster.

For anyone still struggling with a similar issue, the answer proposed by David worked for me in the end. The detail that I was missing was in the definition of the environment variables.
In my initial attempts, I was using the .bashrc script to permanently define my environment variables. This works in the standalone cluster which is not the case with a distributed cluster due to the scope of the script. What actually worked for me was defining my variables(and $HADOOP_CLASSPATH) in the /etc/profile
I also managed to find another solution while was struggling with HADOOP_CLASSPATH. As I mentioned in my initial post, in Hadoop 2.x there are pre-bundled jar files in the official Flink download page to support HDFS integration, which is not the case in Hadoop 3.x. I found the following maven repository page and after testing all of the existing jars I managed to find one that worked in my case. To be more precise, for Hadoop 3.3.1 the 3.1.1.7.2.8.0-224-9.0 jar (Placed the jar in the $FLINK_HOME/lib) worked. While it is not an "official solution" it seems to solve the issue.
https://mvnrepository.com/artifact/org.apache.flink/flink-shaded-hadoop-3-uber?repo=cloudera-repos

Related

where to write DAG files in apache air flow?

I just started learning apache airflow, and I created an environment in composer in gcp and web server is working fine and everything, but I was just confused about where to write the DAG file? I mean I want to write the file where I can test it multiple times because in the web UI it's showing me a bucket where I can store the file, but I am unable to understand where to write the code. do I have to install airflow in my machine?
p.s - i know this is a stupid question, any help will be appreciated
You can install it locally yes. If you want to test it locally this is the only way I think.
There are couple of tools that you could do that - there is an astro CLI for managing your "dag development" environment which is published by Astronomer, https://github.com/astronomer/astro-cli
Also MWAA has their own tool too - I think, I think Composer has no Composer-specific one.
However for "generic" Airflow (which should be enough to start), you can use the community managed quick-start (either with local venv or Docker-Compose):
https://airflow.apache.org/docs/apache-airflow/stable/start/index.html

AWS Elastic Beanstalk Python 3.7 Deployment Location

I'm trying to upgrade an existing application on AWS from the now deprecated Python 3.4 platform to 3.7 with Amazon Linux 2/3.0.1, and in the process I ran into an issue with where the application source code is deployed on the EC2 instance.
From some empirical testing, I found that instead of /opt/python/current/app directory that most if not all AWS documentations say (e.g Troubleshooting issues with the EB CLI - AWS Elastic Beanstalk), with Python 3.7 it is actually deployed in /var/app/current/. I wasn't able to find any documentation regarding this change, and it is causing some issues with the application. I'm wondering is there any reason that this change is made? And if it is possible to revert it, how to do so?
Thanks in advance!
This is because the 3.7 Python Elastic Beanstalk distribution uses Amazon Linux 2 which is fundamentally different from the AMI predecessor. If you opt to use Python 3.6 instead you should be able to avoid this issue, as it runs on the earlier Linux version where deployments still occur in /opt/var/app/current. Most tutorials I've found are designed to work with this older rollout, including the most up-to-date Amazon start guide.
If you have the time, try migrating your code to the newer version, as this seems to be the workflow Amazon is embracing going forward, for all newer versions of Python (such as 3.8 and others yet to come).

Using spark with latest aws-java-sdk?

We are currently using spark 2.1 with hadoop 2.7.3 and I know and can't believe that spark still requires aws-java-sdk version 1.7.4. We are using a maven project and I was wondering if there is any way to setup libraries or my environment to be able to use spark 2.1 along with other applications that use the latest aws-java-sdk? I guess it's the same thing as asking if it's possible to setup a workflow that uses different versions of the aws-java-sdk and then when I want to run the jar on a cluster I could just point to the latest aws-java-sdk. I know I could obviously maintain to separate projects one for spark and one for pure sdk work but I'd like to just have them in the same project.
use spark 2.1 along with other applications that use the latest aws-java-sdk
You can try to use the Maven Shade Plugin when you create your JAR, then ensure the user classpath is before the Hadoop classpath (spark.executor.userClassPathFirst). This will ensure you're loading all dependencies included by Maven, not what's provided by Spark
I've done this with Avro before, but I know that the AWS SDK has more to it

How to run pdftk on elastic beanstalk

I am trying to run pdftk on an Elastic Beanstalk. The first problem I run into is that I cannot install pdftk on an instance of a Amazon Linux AMI because one of the dependencies (gcj) is not supported.
One of the options I am looking at is creating my own AMI and using that for my Elastic Beanstalk. Amazon recommends not doing this, and there are no community images for EB and Ubuntu.
Another option is using Docker. I am not as familiar with Docker, but I think I would be able to install pdftk in a container and then deploy that to EB. I am using Codeship for deployments and it looks like they have some options for Docker. (This is the options I'm currently exploring)
The last option I can think of is writing a library for encrypting pdfs on my own. I had a look at the encryption specifications for pdfs and I think this is not a time efficient option.
Has any one had a similar problem and found a good solution to the problem?
UPDATE:
After some more research I discovered that the issue was not with Amazon Linux bug with Fedora. Fedora dropped gcj because there was a lack of maintainers on the project, then dropped pdftk because it depends on gcj.
If you need another pdf tool kit I have found podofo to be a good replacement for what I've needed.
First I apologise for resurrecting an old thread! Recently we wanted to create an Elastic Beanstalk worker environment that uses pdftk. Of course we also stumbled on the same issue, so this is what we did and it works for us so far. I hope it'll work for others too.
In the .ebextensions folder add the linked configs:
The needed LaTeX packages:
packages.config
You'll also need to add the el5 library in order to install libgcj.
01_el5_yum.config
Next add this config with the commands to install libgcj, pdftk and pdfjam
02_pdftk.config
And that should be it.
In case anyone comes here having problems with pdftk - poppler-utils also cover some tasks done by pdftk (in my case it was pdf splitting) and can be easily set up on an EB instance through .ebextensions:
packages:
yum:
poppler-utils: []

Zeppelin: how to download or save the Zeppelin notebook?

I am using Zeppelin sandbox with aws EMR.
Is there a way to download or save the zeppelin notebook in a way so that it can be imported into another Zeppelin server ?
As noted in the comments above, this feature is available starting in version 0.5.6. You can find more details in the release notes. Downloading and installing this version would solve that issue.
Given that you are using EMR, it looks like you will have to work with the version available. As Samuel mentioned above, you can backup the contents of the incubator-zeppelin/notebook folder and make the transfer.