Setup AWS Data Pipeline on long running EMR cluster

Setup AWS Data Pipeline on long running EMR cluster - amazon-web-services

If I want to have long running EMR cluster and after that I want to setup Data Pipeline doing something on that cluster, how I can do it?
I must install Task Runner on this EMR cluster? Or maybe Task Runner will be preinstalled ? Or maybe there is other simple way ?

Task Runner does not come pre-installed in EMR. It has to be configured manually, follow these steps to install Task Runner in EMR cluster.
On starting the Task Runner process, provide a name for the --workerGroup. This name will be the identifier for this EMR cluster and can be used for the WorkerGroup field in Datapipeline activities.

Related

Is it possible to run kubeflow pipelines or notebooks using AWS EMR as Spark Master/Driver

I am trying to implement as solution on an EKS cluster where jobs are expected to be submitted using kubeflow central dashboard by users/developers. To include spark as a service for users on platform I tried to have standalone spark installation on EKS cluster where everything other config will have to managed by admin. So managed service EMR could be possibly used here as an independent service and will be triggered only when job is submitted.
I an trying to make EMR on EC2 or EMR on EKS available as an endpoint to be used in kubeflow notebooks or pipelines. Tried various things but could not have any robust solution for it.
So if anybody has any sort of experience in the same please feel free to drop in your suggestions.

Creating an AWS Data Pipeline EMR cluster using ShellCommandActivity

When I create an AWS EMR I can do so through their simple wizard on the AWS Management Console. Once it's completed I can test it out and when I'm happy with it's configuration I can simply click the AWS CLI Export button and copy the CLI command that creates the EMR.
I need to create an EMR as part of my AWS Data Pipeline process and rather than configure the EmrCluser and then running whatever EmrActivity I want I'm wondering if I could just copy my CLI command I exported during my testing and paste it inside a ShellCommandActivity which will create the EMR. From there I could use either an EmrActivity to do some processing or just use the ShellCommandActivity to do the processing.
Can I create my AWS Data Pipeline EMR Cluster using a CLI command that's run through a ShellCommandActivity? And if I do so will I be able to run an EmrActivity against that EMR Cluster? I just think it would be easier to create the EMR this way because I can use the AWS Management Console to create my EMR and then I can test my EMR before exporting the CLI command rather than going through the process of properly constructing the EMR through the AWS Data Pipeline wizard/JSON process. I.E., The actual EMR wizard on the AWS Management Console is way easier than the Data Pipeline wizard for creating the EMR on the AWS Management Console, especially when it comes to choosing my security groups and various configurations.
Update:
I just verified that I can in fact run a CLI command through the ShellCommandActivity to create my EMR through the Data Pipeline but is this possibly a code smell or bad practice? Are there any downfalls to creating and EMR on the Data Pipeline this way rather than doing it through the predefined EmrCluster command?

It's possible, but a little complicated:
The following action or the script itself would have to wait for the cluster to be created. Make sure the action does not time out.
The data pipeline does not know about the cluster, hence you need to specify a workerGroup instead of runsOn in the EMRActivity. You also need to install Task Runner on the cluster.

Security-Configuration Field For AWS Data Pipeline EmrCluster

I created an AWS EMR Cluster through the regular EMR Cluster wizard on the AWS Management Console and I was able to select a security-configuration e.g., when you export the CLI command it's --security-configuration 'mySecurityConfigurationValue'.
I now need to create a similar EMR through the AWS Data Pipeline but I don't see any options where I can specify this security-configuration field.
The only similar fields I see are EmrManagedSlaveSecurityGroup, EmrManagedMasterSecurityGroup, AdditionalSlaveSecurityGroups, AdditionalMasterSecurityGroups, and SubnetId. I already have all of those filled out in my Pipeline configuration but I just need to also specify the security-configuration. Any thoughts?

Unfortunately, DataPipeline does not support the Security Configurations feature (as well as other features that were introduced in the EMR 5.x versions like using a custom AMI).
One solution for this is to:
Replace the EmrCluster in your pipeline with an EC2 resource
Use a ShellCommandActivity on the EC2 resource to run the aws emr create-cluster CLI command
Use a bootstrap step to install TaskRunner on the cluster
Replace all the runsOn properties in your pipeline with workerGroup so the tasks run on the EMR cluster you created in step 2
Add a final ShellCommandActivity at the end of the pipeline to terminate the cluster using CLI
Now since you are spinning up your cluster using the CLI you have access to all kinds of features like security configurations, custom AMI, instance fleets, etc. and you can still orchestrate the tasks using DataPipeline.

monitoring spark cluster in AWS EMR without spark UI

I am running a spark cluster on AWS EMR. How do I get all all the details of the jobs and executors that are running on AWS EMR without using the spark UI. I am going to use it for monitoring and optimization.

You can checkout nagios or ganglia for cluster health but you cant see the jobs running on spark with these tools.

If you are using AWS EMR you can do that using lynx server. something like below.
Login to the master node of the cluster.
try the below command
lynx http://localhost:4040
Note : before you type the command make sure you are running a job

What should be suitable configuration to set up 2-3 node hadoop cluster on AWS?

What should be suitable configuration to set up 2-3 node hadoop cluster on AWS ?
I want to set-up Hive, HBase, Solr, Tomcat on hadoop cluster with purpose of doing small POC's.
Also please suggest option to go with EMR or with EC2 and manually set up cluster on that.

Amazon EMR can deploy a multi-node cluster with Hadoop and various applications (eg Hive, HBase) within a few minutes. It is much easier to deploy and manage than trying to deploy your own Hadoop cluster under Amazon EC2.
See: Getting Started: Analyzing Big Data with Amazon EMR

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js