JNI in Dataflow - java-native-interface

I need to use JNI in my Dataflow pipeline. The JNI uses C++ library that has a ton of external dependencies on other system libraries. What would be the best way to make sure that the libraries are where they should be in the operating system when a worker runs the DoFn that uses the C++ library?
I found that the DataflowPipelineOptions.setWorkerHarnessContainerImage might allow me to specify custom docker image from the Google Container Registry that I could potentially install bunch of libraries on, but the documentation doesn't say much more. Are there any requirements for the docker image in terms of installed packages, entry points, etc...?

Apache Beam recently published an example of calling sub-processes from a Dataflow worker. The solution downloads the binary dynamically within the DoFn's #Setup method and then executes the binary for each record processed by the pipeline. The solution also handles collecting the output from the process and propagating failures to the pipeline.

Related

Flink job using JNI on EMR

I am trying to invoke a native library from within a flink pipeline.
Environment is
EMR 5.34
Flink 1.13.1
I have built the uber fat jar and made sure the .so file is available in the JAR file.
However I am facing the below exception when starting up the flink application.
Appreciate any pointers.
Caused by: java.lang.UnsatisfiedLinkError: no <<my native library artifact name>> in java.library.path
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1860)
at java.lang.Runtime.loadLibrary0(Runtime.java:871)
Thank you,
Amit
I was able to resolve this at least in "Session" mode by setting below config parameters in flink-conf.yaml file.
env.java.opts: "-Djava.library.path=<<path to libraries>>"
containerized.master.env.LD_LIBRARY_PATH: "<<path to libraries>>"
containerized.taskmanager.env.LD_LIBRARY_PATH: "<<path to libraries>>"
You also need to use StreamExecutionEnvironment.registerCachedFile to pass the extracted files on the JobManager to the TaskManagers involved.
On Driver side -
StreamExecutionEnvironment.getExecutionEnvironment.registerCachedFile(directorywherefilesareextracted,"somekey")
Hope this helps if someone is looking for an approach that could be used to work with such scenario.
You can access these cached files and store them in the directory configured in filnk-conf.yaml so that they are included in the library path for execution.
getRuntimeContext().getDistributedCache().getFile("somekey")
To be able to access the RuntimeContext, you need to extend RichMapFunction.
Update:
With all the above changes, when I run the Flink pipeline for the first time, it still complains about library not found. I did check the directory in which I am extracting distributed cache and the libraries are there.
Subsequent runs after the first failure are successful. I am not sure why I am seeing this kind of behavior.
Update:
Made sure that the directory, where we extract the libraries, is readily available when we create EMR cluster and it worked like a charm. I created this directory by configuring Bootstrap action.

Where can I find the source code for AWS Lambda's dotnetcore3.1 runtime?

AWS publishes a lot of their source code on GitHub, but so far I haven't been able to find the source for the dotnetcore3.1 Lambda Runtime.
I expect this source will be a console application responsible for startup and IoC (using the Lambda configuration to determine the handler class and function, using reflection to instantiate the handler, as well as the serializer specified on the lambda's assembly attribute). So far I have found a number of repositories containing things like NuGet libraries and dotnet CLI tooling--but haven't located the runtime itself.
Is the AWS Lambda dotnetcore3.1 runtime source publically available? Can you point me to the repo and .csproj? Thanks!
I am not 100% sure, but there is probably nothing like a dedicated ".NET Core Runtime" that is publicly available.
Let's explore how Lambda works to explain why I think that:
First there is an EC2 instance with a host operating system. That's the bare metal your resources are coming from (vCPU, RAM, disk, etc.).
Second, there is a thin management layer for microVMs. For AWS Lambda this is a project called Firecracker.
The microVMs that are started use a micro kernel. In case of AWS Lambda it is the OSv micro kernel. The micro kernel already has some support for runtimes:
OSv supports many managed language runtimes including unmodified JVM, Python 2 and 3, Node.JS, Ruby, Erlang as well as languages compiling directly to native machine code like Golang and Rust.
.NET Core is compiled so I think this falls into the same category. It is a "native" binary, that is self-contained and just started.
You mentioned reflection etc. to run the binary. I think it is not that complicated. This is were the last puzzle piece comes into place: environment variables.
There are a bunch of environment variables thar are set by the "runtime". You can find the full list in the AWS documentation: Defined runtime environment variables.
Every Lambda has a environment variable called _HANDLER which points to the handler. In your case this would be the binary. This would allow the runtime to just "run" your handler.
When the handler runs the underlying AWS SDK is starting some kind of webserver/socket (I think it depends on the framework/language) which exposes everything the "standard" runtime needs to communicate with your code.
So as far as I can tell there is not really a dedicated .NET Core runtime. It is pretty generic all the way down, which simplifies development for the folks at AWS I guess.
The source code for Lambda's dotnet runtime(s) is within the aws/aws-lambda-dotnet repository--it doesn't appear to have a dedicated netcoreapp3.1 library, or a preserved branch, but presumably the 3.1 runtime is contained within this repository's history.
The Amazon.Lambda.RuntimeSupport console application appears to be the entrypoint, which in turn instantiates and invokes the published Handler.

How can you find out Azure-pipeline image content?

I'm new to Azure-Pipeline and struggling to put together a C++ oriented pipeline that uses camke which properly compiles, run tests and build documentation on Ubuntu, macOS, and Windows.
I managed the macOS and Ubuntu cases rather easily but am struggling with the Windows case not knowing what's installed and what's in system PATH for the given image & container I've selected.
Not being super familiar with the Azure-Platform I'm basically relying on commit-push-run-pipeline every single little change to my YAML file thus wasting time and resources.
I can't imagine that the only way is to blindly try out commands by commit, push and run the pipeline.
I managed to find a basic description of the currently (hopefully) available images here following the included software link for Windows link yoou end up on a comprehensive list of what's supposedly installed (I have some doubts on whether this documentation actually matches the content of the image). Calling some of those tools like cmake and choco, present in the above list, failed. Whether or not they're actually installed and in system PATH I have no idea.
Q1: Is there any way to locally test out an Azure-Pipeline YAML?
Q2: Is there any way to figure what is actually installed on a given image/container (without issuing a DIR /s from the root folder??)
Q3: Is it possible to connect to a running container (or is it a VM???) instance and directly tinker with it?
Q4: Alternatively, is it possible to run such an image locally (Docker)? Does it imply execution on a Windows machine or is that a standalone VM image?
EDIT: Found out about this question, although doesn't quite answer mine: Is there a tool to validate an Azure DevOps Pipeline locally?
Q1: Is there any way to locally test out an Azure-Pipeline YAML?
The answer is yes. You could create your private agent to execute the Azure-Pipeline YAML.
Self-hosted agents
Q2: Is there any way to figure what is actually installed on a given
image/container (without issuing a DIR /s from the root folder??)
Just as you know, we could check the document Software for the software installed on the agent. If you want to know the install the path of some software, you could check the debug log from the build task. For example, cmake. We could check the build log from the cmake task:
Q3: Is it possible to connect to a running container (or is it a
VM???) instance and directly tinker with it?
For the hosted agent, I am afraid the answer is not.
Q4: Alternatively, is it possible to run such an image locally
(Docker)? Does it imply execution on a Windows machine or is that a
standalone VM image?
The answer is yes, we could Run a self-hosted agent in Docker. And it imply execution on a Windows machine.

Google Container Builder: How to cache dependencies between two builds

We are migrating our container building process to Google Container Builder. We have multiple repo using Node or Scala.
As of actual container builder features, is it possible to cache dependencies between two builds (ex: node_modules, .ivy, ...). It's really time (money) consuming to download everything each time.
I know it's possible to build a custom docker image with all packaged within, but we would prefer avoiding this solution.
For example can we mount a persistent volume for that purpose, as we used to do with DroneIO? or even better automatically like in Bitbucket Pipelines?
Thanks
GCB doesn't currently support mounting a persistent volume across builds.
In the meantime, the team recently published a document outlining some options for speeding up builds, which might be useful: https://cloud.google.com/container-builder/docs/speeding-up-builds
In particular, caching generated output to Google Cloud Storage and pulling it in at the beginning of your build might help in your case.

Continuous build system for Qt

I am a Qt/C++ developer. I would like to setup a continuous integration environment whereby after committing the source code, it triggers a build process that build the code for the 3 platforms I'm using:
Linux
OS X
Win32
If possible, how do I setup such environment. Any hints or links are welcome.
I've read around about Jenkins, but I can't find any good tutorial for it.
I also suggest Jenkins for several reasons:
It will run on all of the platforms you listed.
It can be configured to start a build when the repository is updated (hint: configure the Job to "Poll SCM" and you won't have to muck with your SCM tool to get it to tell Jenkins to start building).
It provides good support (mostly through plugins) for Unit Testing. [You're project is doing unit testing, right?]
The price is right
A bigger issue is going to have is that AFAIK, Qt doesn't really do cross-compiling for other platforms well. Using Jenkins (and the appropriate plugins), you should be able to solve this.
One method that comes quickly to mind is to have an instance of Jenkins on each platform. Each instance is responsible for building the version for its own platform. At the end of the build, the created artifacts are all put into a common, shared location.
Jenkins supports this feature via plugins for all major source control systems. If you seriously considering using Jenkins (and I would highly recommend it), consider buying John Ferguson Smart's Jenkins: The Definitive Guide.
Two solutions coming to my mind:
BuildBot
BuildBot is a highly customizable continuous integration system written in Python. The master component offers a nice web-based GUI to monitor and trigger builds; slave components are put on the target machines (usually virtual machines but they could be the Mac laptop of one of the developers). Docs are good enough to build up a basic system, customization could be a little tricky (at least it was for me). Using commit/push hooks provided by VC systems you can easily activate the master and trigger builds across the slaves. It also supports incremental builds (a must if your project is big).
CDash
Developed by the authors of CMake, CDash is a web application collecting builds coming from across the network, not exactly what you asked for but I think it's worth a try. Very powerful if you have a team of developers who could continuosly submit build result on their machines to the server (and if you use CMake it's almost transparent). You cannot trigger builds from the server as Buildbot does, but you could setup a bunch of VM with a cron which checks for changes and in case performs the build and sends results to CDash
Sure it's possible. Most of the version control systems are able to execute custom script on server side. Some of them (git, for example), has hooks to achieve the same locally. Have a look at git's post-commit hook.
All you need is to create a script that will trigger cross-platform builds.
Most version control systems allow post-commit hooks to allow you to kick off events like builds. Alternatively build systems can be configured to regularly poll a source control repository and manage their own build scheduling (this is how we use Jenkins).
Something to bear in mind is how long it will take to do a complete build across platforms and the typical number of check-ins in that interval. You might find batching check-ins a better way of doing continuous integration builds if you have an fair sized team or limited build server resources. Otherwise your build system could quickly end up trying to play catch up.
As for whether it is possible to build on all target platforms, that depends on your tool chain.