Is it possible to run kubeflow pipelines or notebooks using AWS EMR as Spark Master/Driver - amazon-web-services

I am trying to implement as solution on an EKS cluster where jobs are expected to be submitted using kubeflow central dashboard by users/developers. To include spark as a service for users on platform I tried to have standalone spark installation on EKS cluster where everything other config will have to managed by admin. So managed service EMR could be possibly used here as an independent service and will be triggered only when job is submitted.
I an trying to make EMR on EC2 or EMR on EKS available as an endpoint to be used in kubeflow notebooks or pipelines. Tried various things but could not have any robust solution for it.
So if anybody has any sort of experience in the same please feel free to drop in your suggestions.

Related

AWS and GCP centrally managed airflows and Dataflow equivalent for AWS

I have two questions to ask:
So my company has 2 instances of airflow running, one on a GCP
provisioned cluster and another on a AWS provisioned cluster. Since
GCP has Composer, which helps you to manage airflow, is there a way
to sort of integrate the airflow DAGs on the AWS cluster to be
managed by GCP as well?
For Batch ETL/Streaming jobs(in python), GCP has Dataflow (Apache
Beam) for that. What's the AWS equivalent of that?
Thanks!
No, you can't do it, till now you have to use AWS, provision it and manage by yourself. There are some options you can choose: EC2, ECS + Fargate, EKS
Dataflow is equivalent to Amazon Elastic MapReduce (EMR) or AWS Batch Dataflow. Moreover if you want to run current Apache Beam jobs, you can provision Apache Beam in EMR and everything should be the same

monitoring spark cluster in AWS EMR without spark UI

I am running a spark cluster on AWS EMR. How do I get all all the details of the jobs and executors that are running on AWS EMR without using the spark UI. I am going to use it for monitoring and optimization.
You can checkout nagios or ganglia for cluster health but you cant see the jobs running on spark with these tools.
If you are using AWS EMR you can do that using lynx server. something like below.
Login to the master node of the cluster.
try the below command
lynx http://localhost:4040
Note : before you type the command make sure you are running a job

What should be suitable configuration to set up 2-3 node hadoop cluster on AWS?

What should be suitable configuration to set up 2-3 node hadoop cluster on AWS ?
I want to set-up Hive, HBase, Solr, Tomcat on hadoop cluster with purpose of doing small POC's.
Also please suggest option to go with EMR or with EC2 and manually set up cluster on that.
Amazon EMR can deploy a multi-node cluster with Hadoop and various applications (eg Hive, HBase) within a few minutes. It is much easier to deploy and manage than trying to deploy your own Hadoop cluster under Amazon EC2.
See: Getting Started: Analyzing Big Data with Amazon EMR

Presto Sandbox cluster on AWS EMR - add connector (catalog/.properties)

I just deployed a Presto Sandbox cluster on AWS using EMR. Is there any way to add connectors to my Presto cluster apart from manually (ssh) creating the properties and then restarting the cluster?
If you're looking for a UI to add a connector, Presto itself doesn't offer that and as far as I know Amazon EMR doesn't either. I'm afraid you'll have to add connectors manually by SSH-ing to the master node, creating the appropriate file, distributing it to all the nodes and then restarting everything.
Adding connectors to Presto with EMR does require manual restarting as you mention. You might be able to use a CFT to automate some of this, or you can try something like Ahana Cloud https://ahana.io/ahana-cloud/ which is a managed serviced for Presto in AWS.

Possibility of taking snapshot of AWS EMR cluster or namenode

I am new with AWS services and trying some use-cases. I want to create EMR clusters on demand with some predefined configurations and applications/scripts installed. I was planning to create a snapshot of existing EMR cluster or at-least namenode initially and then use it every-time whenever I want to create other clusters. But after some Google search, I couldn't find any way to capture snapshot of EMR cluster. Is it possible to create snapshot ? or any other alternate way that can help me out with my use-case.
Appreciate any kind of help.
Thanks
It is not possible to create a snapshot of an EMR cluster node and you cannot use a custom AMI when running a cluster. However you can install software on the cluster nodes at the cluster creation time using custom bootstrap actions. You can create your custom bootstrap scripts and use them every time you launch a new cluster. This way you can achieve a similar functionality with the one you are seeking.
For more information using bootstrap actions on EMR please visit: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-bootstrap.html#bootstrapCustom
Let us know if you need any further assistance.