Using spark with latest aws-java-sdk? - amazon-web-services

We are currently using spark 2.1 with hadoop 2.7.3 and I know and can't believe that spark still requires aws-java-sdk version 1.7.4. We are using a maven project and I was wondering if there is any way to setup libraries or my environment to be able to use spark 2.1 along with other applications that use the latest aws-java-sdk? I guess it's the same thing as asking if it's possible to setup a workflow that uses different versions of the aws-java-sdk and then when I want to run the jar on a cluster I could just point to the latest aws-java-sdk. I know I could obviously maintain to separate projects one for spark and one for pure sdk work but I'd like to just have them in the same project.

use spark 2.1 along with other applications that use the latest aws-java-sdk
You can try to use the Maven Shade Plugin when you create your JAR, then ensure the user classpath is before the Hadoop classpath (spark.executor.userClassPathFirst). This will ensure you're loading all dependencies included by Maven, not what's provided by Spark
I've done this with Avro before, but I know that the AWS SDK has more to it

Related

Problems Integrating Hadoop 3.x on Flink cluster

I am facing some issues while trying to integrate Hadoop 3.x version on a Flink cluster. My goal is to use HDFS as a persistent storage and store checkpoints. I am currectly using Flink 1.13.1 and HDFS 3.3.1. The error that I am getting while trying to submit a job is that HDFS is not supported as a file system. In the standalone version, this error was solved by specifying the HADOOP_CLASSPATH on my local machine. As a next step I applied the solution above in all the machines that are used in my cluster and in the standalone version I managed to successfully submit my jobs in all of them without facing any issues. However, when I started modifying the configurations to setup my cluster (by specifying the IPs of my machines) that problem came up once again. What I am missing?
In Hadoop 2.x there are the pre-bundled jar files in the official flink download page that would solve similar issues in the past but that's not the case with Hadoop 3.x versions
It should be enough to set HADOOP_CLASSPATH on every machine in the cluster.
For anyone still struggling with a similar issue, the answer proposed by David worked for me in the end. The detail that I was missing was in the definition of the environment variables.
In my initial attempts, I was using the .bashrc script to permanently define my environment variables. This works in the standalone cluster which is not the case with a distributed cluster due to the scope of the script. What actually worked for me was defining my variables(and $HADOOP_CLASSPATH) in the /etc/profile
I also managed to find another solution while was struggling with HADOOP_CLASSPATH. As I mentioned in my initial post, in Hadoop 2.x there are pre-bundled jar files in the official Flink download page to support HDFS integration, which is not the case in Hadoop 3.x. I found the following maven repository page and after testing all of the existing jars I managed to find one that worked in my case. To be more precise, for Hadoop 3.3.1 the 3.1.1.7.2.8.0-224-9.0 jar (Placed the jar in the $FLINK_HOME/lib) worked. While it is not an "official solution" it seems to solve the issue.
https://mvnrepository.com/artifact/org.apache.flink/flink-shaded-hadoop-3-uber?repo=cloudera-repos

Is it possible to use lambda layers with zappa?

I want to deploy my wagtail (which is a CMS based on django) project onto an AWS lambda function. The best option seems to be using zappa.
Wagtail needs opencv installed to support all the features.
As you might know, just running pip install opencv-python is not enough because opencv needs some os level packages to be installed. So before running pip install opencv-python one has to install some packages on the Amazon Linux in which the lambda environment is running. (yum install ...)
The only solution that came to my mind is using lambda layers to properly install opencv.
But I'm not sure whether it's possible to use lambda layers with projects deployed by zappa.
Any kind of help and sharing experiences would be really appreciated!
There is an open pull request that is ready to merge, but needs additional user testing.
The older project has a pull request that claims layer support has been merged
Feel free to try it out and let the maintainers know so documentation can be updated.

Versions of Python & Spark to work with VS Code Notebooks

I'm developing scripts for AWS Glue, and trying to mimic development environment as close as possible to their specs here. Since it's a bit costly to run a Notebook server/development endpoint, I set everything up on local machine instead, develop scripts on VS Code Notebook, due to its usefulness.
There're some troubles with Notebook setup due to incompatible versions between installed Python & Spark.
For Python, I have gone through some harsh time to clean up, and its version is 3.8.3 now
For Spark, I use the manual method with version of 2.4.3, since I plan to use Scala alongside at later time. I install the findspark package to load that version as expected.
And it doesn't work! The error was TypeError: an integer is required (got type bytes)
I've searched around, and people said to downgrade to Python 3.7 using pyenv, and I got 3.7.7 installed but still had the same error
As a last resort, I tried pip install pyspark. it's Spark 3.0.0, and works fine, but not as expected.
Hope there's someone have experiences of this matter
A better approach would be to install the glue dependencies on docker then ssh into that docker container using VS code to mimic exact glue local dev environment.
I've written a blog about the same if you like to refer
https://towardsdatascience.com/develop-glue-jobs-locally-using-docker-containers-bffc9d95bd1

Import setup module error while deploying to app engine via google cloud sdk

I am writing after a lot of searching and trial and error with no luck.
I am trying to deploy a service in app engine.
You might be aware that deploying on app engine is usually practiced a two step process
1. Deploy on local dev app server
2. If step 1 succeeds deploy on cloud
My problems are with step 1 when I include third party python libraries such as numpy, sklearn, gcloud etc.
I am trying to deploy a service in local devapp server. When I import numpy or any other third party libraries in my main.py script it throws an error saying unable to find the module.
I am using cloud sdk and have two python distributions, the default python 2.7 and anaconda with python 2.7. When I change the path to look for the modules in anaconda distribution, it fails to find module ‘setup’ required by the cloud sdk.
Is there a way to install the cloud sdk for anaconda distribution ?
Any help/pointers will be much appreciated!
When using app engine python standard environment, you can install pure python 3rd party libs using pip by vendoring them as explained here.
There are also a number of libraries included in the python27 runtime which can be requested using the libraries directive in your app.yaml as explained here.
If there's a lib which is not pure python (i.e it uses C extensions) that you want to use in your project, and it's not part of this list, then your only option is to use a flexible VM. If you want to use anaconda, you should consider customizing the runtime for your flexible VM.

Developing in Adobe CQ5 with jetty?

We use maven to deploy the code changes to cq interner server / CRX Lite and the problem here is that it takes long time where the changes itself is often only one line code.
Has somebody experience with CQ5 with jetty and can give me a good Guide?
am not sure i understand the relationship with jetty (which ships as servlet container of latter versions of AEM/CQ5), but will answer to the code deployment part:
deploying a full content package (full content) should be done using
maven-content-package plugin for smaller deployments of content,
when you can't use integrated dev environments like sling eclipse dev
tools, i'd suggest you use the excellent repo command that basically zips the current folder and deploy it. I'm using it as an external tool command of intellij and it's really fast.
finally, if the deployment you're referring to is osgi deployment, maven sling plugin can help you with that (will still compile/package the whole osgi bundle though)