Flink job using JNI on EMR - java-native-interface

I am trying to invoke a native library from within a flink pipeline.
Environment is
EMR 5.34
Flink 1.13.1
I have built the uber fat jar and made sure the .so file is available in the JAR file.
However I am facing the below exception when starting up the flink application.
Appreciate any pointers.
Caused by: java.lang.UnsatisfiedLinkError: no <<my native library artifact name>> in java.library.path
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1860)
at java.lang.Runtime.loadLibrary0(Runtime.java:871)
Thank you,
Amit

I was able to resolve this at least in "Session" mode by setting below config parameters in flink-conf.yaml file.
env.java.opts: "-Djava.library.path=<<path to libraries>>"
containerized.master.env.LD_LIBRARY_PATH: "<<path to libraries>>"
containerized.taskmanager.env.LD_LIBRARY_PATH: "<<path to libraries>>"
You also need to use StreamExecutionEnvironment.registerCachedFile to pass the extracted files on the JobManager to the TaskManagers involved.
On Driver side -
StreamExecutionEnvironment.getExecutionEnvironment.registerCachedFile(directorywherefilesareextracted,"somekey")
Hope this helps if someone is looking for an approach that could be used to work with such scenario.
You can access these cached files and store them in the directory configured in filnk-conf.yaml so that they are included in the library path for execution.
getRuntimeContext().getDistributedCache().getFile("somekey")
To be able to access the RuntimeContext, you need to extend RichMapFunction.
Update:
With all the above changes, when I run the Flink pipeline for the first time, it still complains about library not found. I did check the directory in which I am extracting distributed cache and the libraries are there.
Subsequent runs after the first failure are successful. I am not sure why I am seeing this kind of behavior.
Update:
Made sure that the directory, where we extract the libraries, is readily available when we create EMR cluster and it worked like a charm. I created this directory by configuring Bootstrap action.

Related

How can you find out Azure-pipeline image content?

I'm new to Azure-Pipeline and struggling to put together a C++ oriented pipeline that uses camke which properly compiles, run tests and build documentation on Ubuntu, macOS, and Windows.
I managed the macOS and Ubuntu cases rather easily but am struggling with the Windows case not knowing what's installed and what's in system PATH for the given image & container I've selected.
Not being super familiar with the Azure-Platform I'm basically relying on commit-push-run-pipeline every single little change to my YAML file thus wasting time and resources.
I can't imagine that the only way is to blindly try out commands by commit, push and run the pipeline.
I managed to find a basic description of the currently (hopefully) available images here following the included software link for Windows link yoou end up on a comprehensive list of what's supposedly installed (I have some doubts on whether this documentation actually matches the content of the image). Calling some of those tools like cmake and choco, present in the above list, failed. Whether or not they're actually installed and in system PATH I have no idea.
Q1: Is there any way to locally test out an Azure-Pipeline YAML?
Q2: Is there any way to figure what is actually installed on a given image/container (without issuing a DIR /s from the root folder??)
Q3: Is it possible to connect to a running container (or is it a VM???) instance and directly tinker with it?
Q4: Alternatively, is it possible to run such an image locally (Docker)? Does it imply execution on a Windows machine or is that a standalone VM image?
EDIT: Found out about this question, although doesn't quite answer mine: Is there a tool to validate an Azure DevOps Pipeline locally?
Q1: Is there any way to locally test out an Azure-Pipeline YAML?
The answer is yes. You could create your private agent to execute the Azure-Pipeline YAML.
Self-hosted agents
Q2: Is there any way to figure what is actually installed on a given
image/container (without issuing a DIR /s from the root folder??)
Just as you know, we could check the document Software for the software installed on the agent. If you want to know the install the path of some software, you could check the debug log from the build task. For example, cmake. We could check the build log from the cmake task:
Q3: Is it possible to connect to a running container (or is it a
VM???) instance and directly tinker with it?
For the hosted agent, I am afraid the answer is not.
Q4: Alternatively, is it possible to run such an image locally
(Docker)? Does it imply execution on a Windows machine or is that a
standalone VM image?
The answer is yes, we could Run a self-hosted agent in Docker. And it imply execution on a Windows machine.

Can I sync two bazel-remote-cache's using rsync

I have a build pipeline that builds and tests changes before they are merged to the main line. Once that happens, it would be great if the Bazel actions from that build are available to developers. Unfortunately, the build pipeline runs in the cloud and uses an in-cloud cache, but the developers use an on-premises cache.
I am using https://github.com/buchgr/bazel-remote
Does anyone know if I can just rsync the artifacts from the data directory of the cloud cache to the developers' cache in order to give them access to the pre-built artifacts? Normally, I would just try it out, but I'm concerned about subtle issues that might poison the cache or negatively effect the hit rate, so I'm hoping to hear from someone who understands the code before I go digging.
You can rsync the cache directory contents and use them from another location, but this won't work with a running bazel-remote- the items will be ignored until bazel-remote is restarted.
Another option would be to use the http_proxy configuration file setting to automatically put/get cache items to/from another bazel-remote instance. An example configuration file was recently added to README.md in the bazel-remote git repository.

Is ArtifactStagingDirectory always empty with each build in DevOps pipeline

We are using Build Pipeline in Azure DevOps to create a Deployment Artifact. Typical steps in such pipeline are:
Build Solution / Project
Copy dlls output into $Build.ArtifactStagingDirectory
Publish Artifact from $Build.ArtifactStagingDirectory
I just wonder if I can rely on the fact, that on start of each Build the Build.ArtifactStagingDirectory is empty. Or should I clean the folder as first step to be sure?
From my experience the folder was always empty, but I am not sure if I can rely on that. Is that something specific to Azure hosted Agent and maybe by using custom Build agents I have to do manual clean-ups of this folder? Maybe some old files from last build could remain there? I did not found this info in documentation.
Thanks.
I think that the main idea of this variable $Build.ArtifactStagingDirectory is to be a clean area so you can manage the code you're pushing from your repo. As far as I know, there is no explicit information on documentation talking that this folder is empty at every new build, but there are a few "clues":
You can see at the Microsoft's Build Variables documentation that Build.StagingDirectory is always purged before each new build, so you have a fresh start every build.
In the documentation above you have a few cases where it explicitly cites that some folders or files are not cleaned on a new build, like the Build.BinariesDirectory variable.
I've run a few build and realeases pointing to my Web App on Azure, and I never saw an unwanted file or folder that was not related to my build pipeline.
I hope that helps.

JNI in Dataflow

I need to use JNI in my Dataflow pipeline. The JNI uses C++ library that has a ton of external dependencies on other system libraries. What would be the best way to make sure that the libraries are where they should be in the operating system when a worker runs the DoFn that uses the C++ library?
I found that the DataflowPipelineOptions.setWorkerHarnessContainerImage might allow me to specify custom docker image from the Google Container Registry that I could potentially install bunch of libraries on, but the documentation doesn't say much more. Are there any requirements for the docker image in terms of installed packages, entry points, etc...?
Apache Beam recently published an example of calling sub-processes from a Dataflow worker. The solution downloads the binary dynamically within the DoFn's #Setup method and then executes the binary for each record processed by the pipeline. The solution also handles collecting the output from the process and propagating failures to the pipeline.

How do you configure proprietary dependencies for Leiningen?

We're working on a project that has some Clojure-Java interop. At this point we have a single class that has a variety of dependencies which we put into a user library in Eclipse for development, but of course that doesn't help when using Leiningen (2.x). Most of our dependencies are proprietary, so they aren't on a repository somewhere.
What is the easiest/right way to do this?
I've seen leiningen - how to add dependencies for local jars?, but it appears to be out of date?
Update: So I made a local maven repository for my jar following these instructions and the lein deployment docs on github, and edited my project.clj file like this:
:dependencies [[...]
[usc "0.1.0"]]
:repositories {"usc" "file://maven_repository"}
Where maven_repository is under the project directory (hence not using file:///). When I ran "lein deps"--I got this message:
Retrieving usc/usc/0.1.0/usc-0.1.0.pom from usc
Could not transfer artifact usc:usc:pom:0.1.0 from/to usc (file://maven_repository): no supported algorithms found
This could be due to a typo in :dependencies or network issues.
Could not resolve dependencies
What is meant by "no supported algorithms found" and how do I fix it?
Update2: Found the last bit of the answer here.
add them as a dependency to your leiningen project. You can make up the names and versions.
then run lein deps and the error message when it fails to find it will give you the exact command to run so you can install the jar to your local repo then sould you decide to use a shared repo you can use this same process to put your dependencies there.
#Arthur's answer is good but I figured I'd flesh it out a bit more since it leaves some details lacking.
Always keep in mind Repeatability. If you don't make it so that anyone who needs access to the artifacts can get access to the artifacts in a standard way, you're asking for support hell.
The documentation on deployment is a good place to go to find out everything you need to know about deploying your artifacts. Since you're in a polyglot environment you probably can't have lein take care of deploying all your artifacts but at least you can get your clojure specific jars up into S3 or even a file share if you like. The rest of your artifacts will have to use Maven or Ant directly to upload the artifacts to the Maven repo on the file server or S3. At my current company we are using technomancy's excellent s3 wagon private to great effect for hosting our closed source artifacts and clojars for hosting anything that we can open-source.
What #Arthur is referring to is doing a lein install. All that does is install a copy of the current project into your local .m2 directory so that other projects on your box can reference them. Unless you have configured your install of maven to use a shared directory for your .m2 folder (maybe not a bad idea in your environment?), this will mean that anyone else who checks out your project will not be able to build it. If you wanted to go this route, you need to set the localRepository node in your $M2_HOME/conf/settings.xml to be the shared location that the rest of your team has access to. See the docs for more information.
YMMV but I've found it best to use Maven rather than Leiningen when you are working with Polyglot Clojure / Java projects.
It's mainly because the Java based tools (Eclipse etc.) understand Maven projects but don't really understand Leiningen projects. It's getting slowly better with the excellent Counterclockwise Clojure plugin, but the integration still isn't quite good enough yet for an efficient IDE based workflow.
On the repository side of things, I'd suggest setting up a private shared Maven repository. You're going to need it sooner or later if you plan to manage a complex set of dependencies within your team: might as well bite the bullet and get it done now.