Using JNI libraries in flink cluster - java-native-interface

I need to use JNI libraries to run the jar on flink cluster.
Can anyone please guide me how to do this?
If it is to be provided using VM arguments then how to pass VM arguments while running the jar using flink cli?
Is there any change need to be madein the flink script?

One way to use JNI library path is to provide a java.library.path parameter:
java -Djava.library.path=
You can provide JVM arguments to a task manager (responsible for executing individual Flink tasks) using the FLINK_ENV_JAVA_OPTS_TM environment variable:
export FLINK_ENV_JAVA_OPTS_TM="-Djava.library.path="

Related

Problems Integrating Hadoop 3.x on Flink cluster

I am facing some issues while trying to integrate Hadoop 3.x version on a Flink cluster. My goal is to use HDFS as a persistent storage and store checkpoints. I am currectly using Flink 1.13.1 and HDFS 3.3.1. The error that I am getting while trying to submit a job is that HDFS is not supported as a file system. In the standalone version, this error was solved by specifying the HADOOP_CLASSPATH on my local machine. As a next step I applied the solution above in all the machines that are used in my cluster and in the standalone version I managed to successfully submit my jobs in all of them without facing any issues. However, when I started modifying the configurations to setup my cluster (by specifying the IPs of my machines) that problem came up once again. What I am missing?
In Hadoop 2.x there are the pre-bundled jar files in the official flink download page that would solve similar issues in the past but that's not the case with Hadoop 3.x versions
It should be enough to set HADOOP_CLASSPATH on every machine in the cluster.
For anyone still struggling with a similar issue, the answer proposed by David worked for me in the end. The detail that I was missing was in the definition of the environment variables.
In my initial attempts, I was using the .bashrc script to permanently define my environment variables. This works in the standalone cluster which is not the case with a distributed cluster due to the scope of the script. What actually worked for me was defining my variables(and $HADOOP_CLASSPATH) in the /etc/profile
I also managed to find another solution while was struggling with HADOOP_CLASSPATH. As I mentioned in my initial post, in Hadoop 2.x there are pre-bundled jar files in the official Flink download page to support HDFS integration, which is not the case in Hadoop 3.x. I found the following maven repository page and after testing all of the existing jars I managed to find one that worked in my case. To be more precise, for Hadoop 3.3.1 the 3.1.1.7.2.8.0-224-9.0 jar (Placed the jar in the $FLINK_HOME/lib) worked. While it is not an "official solution" it seems to solve the issue.
https://mvnrepository.com/artifact/org.apache.flink/flink-shaded-hadoop-3-uber?repo=cloudera-repos

Can you load standard zeppelin interpreter settings from S3?

Our company is building up a suite of common internal Spark functions and jobs, and I'd like to make sure that our data scientists have access to all of these when they prototype in Zeppelin.
Ideally, I'd like a way for them to start up a Zeppelin notebook on AWS EMR, and have the dependency jar we build automatically loaded onto it without them having to manually type in the maven information manually every time (private repo location/credentials, package info, etc).
Right now we have the dependency jar loaded on S3, and with some work we could get a private maven repository to host it on.
I see that ZEPPELIN_INTERPRETER_DIR saves off interpreter settings, but I don't think it can load from a common default location (like S3, or something)
Is there a way to tell Zeppelin on an EMR cluster to load it's interpreter settings from a common location? I can't be the first person to want this.
Other thoughts I've had but have not tried yet:
Have a script that uses aws cmd line options to start a EMR cluster with all the necessary settings pre-made for you. (Could also upload the .jar dependency if we can't get maven to work)
Use a infrastructure-as-code framework to start up the clusters with the required settings.
I don't believe it's possible to tell EMR to load settings from a common location. The first thought you included is the way to go imo - you would aws emr create ... and that create would include a shell script step to replace /etc/zeppelin/conf.dist/interpreter.json by downloading the interpreter.json of interest from S3, and then hard restart zeppelin (sudo stop zeppelin; sudo start zeppelin).

How to run C++ files on the Google Cloud Functions?

From what I am aware, Google Cloud Functions only allows you to deploy NodeJs or Python scripts.
Question: How would I be able to deploy a simple Hello_World.cpp on Google Cloud Functions? For example, writing a hello world HTTP function.
What are alternate methods to do this? I want to use serverless approach, since it's cheapest method. Therefore, that is why I'm going with Google Cloud Functions. Would I have to use docker in order to run C++ files? I've been stuck on this for a while and any guidance or help would be appreciated.
You can compile your C++ function into a WebAssembly module using emscripten. Then you can call it from a small nodejs glue code.
I built an example for you here:
https://github.com/ArthurSonzogni/gcloud-cpp-starter
You can run C++ Code by node.js on google cloud functions (tested with node.js 10)
how to using C++ and N-API (node-addon-api) https://medium.com/#atulanand94/beginners-guide-to-writing-nodejs-addons-using-c-and-n-api-node-addon-api-9b3b718a9a7f
use https://console.cloud.google.com/functions and click CREATE FUNCTION to upload .zip or gcloud functions deploy --runtime nodejs10 --trigger-http
The trick is when you zip file you need to remove /build and /node_modules folder then use command line by cd to folder of index.js and 'zip your_name.zip -r *'
ps. when I use firebase deploy --only functions it will error because it doesn't know file addon.node format (in fact it shouldn't read this file because it need to be recompiled) but I think if we using gcloud functions command line with .gcloudignore for /build and /node_modules it will success https://cloud.google.com/functions/docs/deploying/filesystem
HOW DOES IT WORK
I think when you deploy node.js source code to cloud functions it will run npm install and your C++ code will be compiled too (like npm run build will be auto run after npm install)
You can't use C++ on Cloud Functions, period. You can only use Node.js 6.14, Node.js 8.11.1 (beta) and Python 3.7 (also beta).
If you wish to use C++ in GCP with a serverless approach, my best suggestion would be running your own Custom Runtime in App Engine. You would still need to configure some instances options, but you don't have to manage servers and so on.
You can only use App Engine Flexible Environment (or, of course, standard VM architecture, Compute Engine). Extract from the docs (https://cloud.google.com/appengine/docs/flexible/):
Runtimes - The flexible environment includes native support for Java 8
(with no web-serving framework), Eclipse Jetty 9, Python 2.7 and Python 3.6,
Node.js, Ruby, PHP, .NET core, and Go. Developers can customize these
runtimes or provide their own runtime by supplying a custom Docker image
or Dockerfile from the open source community.
As an interesting side note, Google Serverless Containers will give you the chance to deploy your dockerized application but in a serverless flavour (in fact it's built on top of Google Cloud Functions technology). It's currently in Alpha stage.

Using spark with latest aws-java-sdk?

We are currently using spark 2.1 with hadoop 2.7.3 and I know and can't believe that spark still requires aws-java-sdk version 1.7.4. We are using a maven project and I was wondering if there is any way to setup libraries or my environment to be able to use spark 2.1 along with other applications that use the latest aws-java-sdk? I guess it's the same thing as asking if it's possible to setup a workflow that uses different versions of the aws-java-sdk and then when I want to run the jar on a cluster I could just point to the latest aws-java-sdk. I know I could obviously maintain to separate projects one for spark and one for pure sdk work but I'd like to just have them in the same project.
use spark 2.1 along with other applications that use the latest aws-java-sdk
You can try to use the Maven Shade Plugin when you create your JAR, then ensure the user classpath is before the Hadoop classpath (spark.executor.userClassPathFirst). This will ensure you're loading all dependencies included by Maven, not what's provided by Spark
I've done this with Avro before, but I know that the AWS SDK has more to it

spark-ec2 not recognized when lauching cluster on windows 8.1

I'm a complete beginner on spark. I'm trying to run spark on Amazon EC2, but my system does not recognize "spark-ec2" or "./spark-ec2". It says "spark-ec2" is not recognized as an internal or external command.
I followed the instruction here to launch a cluster. I would like to use Scala, how do I make it work?
Add PYTHON PATH environment variable with boto.
PYTHONPATH="${SPARK_EC2_DIR}/third_party/boto-2.4.1.zip/boto-2.4.1:$PYTHONPATH"
And execute the python script
In order to run the Spark-EC2 script on Windows you need Cygwin and Python. If you don't want to install these programs, you can use the dockerized version of the script (https://github.com/edrevo/spark-ec2-docker), which only depends on Docker.