What is the proper way to use Google Pub/Sub with Flink Streaming using Dataproc? - google-cloud-platform

I'm trying to figure out the proper way to run Apache Flink on Dataproc and use Google Pub/Sub as a source/sink. When I create a Dataproc cluster, after applying flink initialization action to the most recent image 1.4, Flink 1.6.4 will be installed.
The problem is that flink-connector-gcp-pubsub is only available starting from Flink version 1.9.0.
So my question is what is the proper way to use all of this together? Should I build my own gce image with the latest Flink? Is there one already existing?

As you already said flink-connector-gcp-pubusub is only available from Flink 1.9.0. So you have two options:
Either implement connector yourself
Build your own image based on the flink initialization actions
I would not recommend implementing connector as it is a complex task and requires an in-depth understanding of Flink while building your own image should be relatively easy given an example for Flink 1.6.4

I solved this problem by running Flink 1.9.0 in Kubernetes. This way I do not depend on anybody and can run whatever version I need.

Related

Apache airflow vs puckle airflow image

I am currently using puckel airflow but now the apache airflow image is also available.
Which one is better and reliable. And given I need to start from scratch, which option would be better?
Official. Puckel is no longer updated
I started with puckel airlfow but as Javier mentioned its no longer updated. Last updated version was 1.10.9. Its easier to start with this image and following updates and mimicking required behaviours from official docker image you can build on it.

Does anyone know how to use AWS App2Container(A2C)?

AWS App2Container (A2C) is a recently launched feature by AWS. It is a CLI tool to help you lift and shift applications that run in your on-premises data centres or on virtual machines so that they run in containers that are managed by Amazon ECS or Amazon EKS. Since there is not much info on the internet about this, apart from the AWS document so does anybody knows how to implement it and what are the dependencies required for it?
This is a fairly new service so most people will be relying on reading at the moment.
For JAVA applications the setup instructions on Linux indicate that you just download the app2container package and then run the following over your code
sudo app2container containerize --application-id java-app-id
For .NET applications the setup instructions on Windows indicate that it is exactly the same process, run the install file and that will have all dependencies.
The best way to try and implement this will be by following these tutorials step by step. Also remember at this time it is JAVA or .NET only.

Does Google Dataproc support Apache Impala?

I am new to using cloud services and navigating Google's Cloud Platform is quite intimidating. When it comes to Google Dataproc, they do advertise Hadoop, Spark and Hive.
My question is, is Impala available at all?
I would like to do some benchmarking projects using all four of these tools and I require Apache Impala along side Spark/Hive.
No, DataProc is a cluster that supports Hadoop, Spark, Hive and pig; using default images.
Check this link for more information about native image list for DataProc
https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-versions
You can try also using another new instance of Dataproc, instead of using the default.
For example, you can create a Dataproc instance with HUE (Hadoop User Experience) which is an interface to handle Hadoop cluster built by Cloudera. The advantage here is that HUE has as a default component Apache Impala. It also has Pig, Hive, etc. So it's a pretty good solution for using Impala.
Another solution will be to create your own cluster by the beginning but is not a good idea (at least you want to customize everything). With this way, you can install Impala.
Here is a link, for more information:
https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/tree/master/hue
Dataproc provides you SSH access to the master and workers, so it is possible to install additional software and according to Impala documentation you would need to:
Ensure Impala Requirements.
Set up Impala on a cluster by building from source.
Remember that it is recommended to install the impalad daemon with each DataNode.
Cloud Dataproc supports Hadoop, Spark, Hive, Pig by default on the cluster. You can install more optionally supported components such as Zookeeper, Jyputer, Anaconda, Kerberos, Druid and Presto (You can find the complete list here). In addition, you can install a large set of open source components using initialization-actions.
Impala is not supported as optional component and there is no initialization-action script for it yet. You could get it to work on Dataproc with HDFS but making it work with GCS may require non-trivial changes.

using Kinesis Client library with Spark Steaming PySpark

I am looking for using KCL on SparkStreaming using pySpark.
Any pointers would be helpful.
I tried few given by spark Kinesis Ingeration link.
But i get the error for JAVA class reference.
Seems Python is using JAVA class.
i tried linking
spark-streaming-kinesis-asl-assembly_2.10-2.0.0-preview.jar
while trying to apply the KCL app on spark.
but still having the error.
Please let me know if anyone has done it already.
if i search online i get more about Twitter and Kafka.
Not able to get much help with regard to Kinesis.
spark verision used: 1.6.3
I encountered the same problem. The kinesis-asl jar had several files missing.
To overcome this problem, I had included the following jars in my spark-submit.
amazon-kinesis-client-1.9.0.jar
aws-java-sdk-1.11.310.jar
jackson-dataformat-cbor-2.6.7.jar
Note: I am using Spark 2.3.0 so the jar versions listed might not be the same as those you should be using for your spark version.
Hope this helps.

scikit learn on google cloud platform through datalab or compute engine?

I am running a Django App inside GCP. My idea was to call a python script from "view.py" for some machine learning algorithm and then display the result on the page.
But now I understand that running a machine learning library like Scikit-learn on GAE will not be possible (read Tim's answer here and this thread).
But suppose I need to still do this, I believe there are 2 ways possible, but I am not sure weather my guess is right or wrong
1) As the Google-Datalab provides the entire anaconda like distribution, if we have any datalab api which can be called from a python file in the Django app, I can achieve my goal ?
2) If I can install the scikit-learn library on any compute engine on GCP and somehow send it the request to run my code and then return the output back to the python file in the Django app ?
I am very new to client-server and cloud computing on the whole, so please provide examples (if possible) for any suggestion/ pointer for the help.
Regards,
I believe what you want is to use the App Engine Flex environment rather than the standard App Engine environment.
App Engine Flex uses a compute engine VM for running your code, so it does not have the library limitations that standard App Engine has.
Specifically, you'll need to add a 'requirements.txt' file to specify the version of scikit-learn that you want installed, and then add a 'vm: true' clause to your app.yaml file.
sklearn is now supported on ML Engine.
So, another alternative now is to use online prediction on Cloud ML Engine, and deploy your scikit-learn model as a web service.
Here is a fully worked out example of using fully-managed scikit-learn training, online prediction and hyperparameter tuning:
https://github.com/GoogleCloudPlatform/training-data-analyst/blob/master/blogs/sklearn/babyweight_skl.ipynb