How to run PySpark on AWS EMR with AWS Lambda - amazon-web-services

How may I make my PySpark code to run with AWS EMR from AWS Lambda? Do I have to use AWS Lambda to create an auto-terminating EMR cluster to run my S3-stored code once?

You need transient cluster for this case which will auto terminate once your job is completed or the timeout is reached whichever occurs first.
You can access this link on how to initialise the same.

What are the processes available to create a EMR cluster:
Using boto3
/ AWS
CLI
/ Java
SDK
Using cloudformation
Using Data Pipeline
Do I have to use AWS Lambda to create an auto-terminating EMR cluster to run my S3-stored code once?
No. It isn’t mandatory to use lambda to create an auto-terminating cluster.
You just need to specify a flag --auto-terminate while creating a cluster using boto3 / CLi / Java-SDK. But this case you need to submit the job along with cluster config. Ref
Note:
Its not possible to create an auto-terminating cluster using cloudformation. By design, CloudFormation assumes that the
resources that are being created will be permanent to some extent.
If you REALLY had to do it this way, you could make an AWS api call to
delete the CF stack upon finishing your EMR tasks.
How may I make my PySpark code to run with AWS EMR from AWS Lambda?
You can design your lambda to submit spark
job.
You can find an example
here
In my use case I have one parameterised lambda which invoke CF to create cluster, submit job and terminate cluster.

Related

AWS DataPipeline ShellCommandActivity alternative

What can be an equivalent service to AWS DataPipeline ShellCommandActivity? I want the activity to trigger a command on the local environment, can we achieve this using Step Functions? If not, what can be the other AWS solutions?

Automate turning on and off of redis cluster

I'm trying to automate the turning on and off process of Redis Cluster in aws. I saw the following link for reference (https://forums.aws.amazon.com/thread.jspa?threadID=149772). Is there a way to do it via cloudwatch ?
I am very new to aws platform.
Check the documentation regarding scale in/out
https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/redis-cluster-resharding-online.html It also has commands to reshard a cluster manually.
Check CloudWatch metrics from the Redis cluster. https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/CacheMetrics.HostLevel.html and https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/CacheMetrics.Redis.html Choose the metrics that will trigger autoscaling
You can trigger an AWS Lambda on some event for a metric https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/RunLambdaSchedule.html
From the Lambda you cal call aws cli to reshard the cluster as described in 1. Example: https://alestic.com/2016/11/aws-lambda-awscli/
If you need to turn off the cluster completely, instead of the resharding commands just use https://docs.aws.amazon.com/cli/latest/reference/elasticache/delete-cache-cluster.html

is there a way to kill a hive job without killing the AWS EMR cluster

I use AWS EMR cluster to run HIVE query. For query optimization purpose, sometime I need to kill a long-running step but keep the EMR cluster live so I can keep using it. Is there a way to do it either in HIVE CLI or AWS console?
Please refer here for the detail. To cancel steps using the AWS CLI:
aws emr cancel-steps --cluster-id j-2QUAJ7T3OTEI8 --step-ids s-3M8DKCZYYN1QE

Creating an AWS Data Pipeline EMR cluster using ShellCommandActivity

When I create an AWS EMR I can do so through their simple wizard on the AWS Management Console. Once it's completed I can test it out and when I'm happy with it's configuration I can simply click the AWS CLI Export button and copy the CLI command that creates the EMR.
I need to create an EMR as part of my AWS Data Pipeline process and rather than configure the EmrCluser and then running whatever EmrActivity I want I'm wondering if I could just copy my CLI command I exported during my testing and paste it inside a ShellCommandActivity which will create the EMR. From there I could use either an EmrActivity to do some processing or just use the ShellCommandActivity to do the processing.
Can I create my AWS Data Pipeline EMR Cluster using a CLI command that's run through a ShellCommandActivity? And if I do so will I be able to run an EmrActivity against that EMR Cluster? I just think it would be easier to create the EMR this way because I can use the AWS Management Console to create my EMR and then I can test my EMR before exporting the CLI command rather than going through the process of properly constructing the EMR through the AWS Data Pipeline wizard/JSON process. I.E., The actual EMR wizard on the AWS Management Console is way easier than the Data Pipeline wizard for creating the EMR on the AWS Management Console, especially when it comes to choosing my security groups and various configurations.
Update:
I just verified that I can in fact run a CLI command through the ShellCommandActivity to create my EMR through the Data Pipeline but is this possibly a code smell or bad practice? Are there any downfalls to creating and EMR on the Data Pipeline this way rather than doing it through the predefined EmrCluster command?
It's possible, but a little complicated:
The following action or the script itself would have to wait for the cluster to be created. Make sure the action does not time out.
The data pipeline does not know about the cluster, hence you need to specify a workerGroup instead of runsOn in the EMRActivity. You also need to install Task Runner on the cluster.

Security-Configuration Field For AWS Data Pipeline EmrCluster

I created an AWS EMR Cluster through the regular EMR Cluster wizard on the AWS Management Console and I was able to select a security-configuration e.g., when you export the CLI command it's --security-configuration 'mySecurityConfigurationValue'.
I now need to create a similar EMR through the AWS Data Pipeline but I don't see any options where I can specify this security-configuration field.
The only similar fields I see are EmrManagedSlaveSecurityGroup, EmrManagedMasterSecurityGroup, AdditionalSlaveSecurityGroups, AdditionalMasterSecurityGroups, and SubnetId. I already have all of those filled out in my Pipeline configuration but I just need to also specify the security-configuration. Any thoughts?
Unfortunately, DataPipeline does not support the Security Configurations feature (as well as other features that were introduced in the EMR 5.x versions like using a custom AMI).
One solution for this is to:
Replace the EmrCluster in your pipeline with an EC2 resource
Use a ShellCommandActivity on the EC2 resource to run the aws emr create-cluster CLI command
Use a bootstrap step to install TaskRunner on the cluster
Replace all the runsOn properties in your pipeline with workerGroup so the tasks run on the EMR cluster you created in step 2
Add a final ShellCommandActivity at the end of the pipeline to terminate the cluster using CLI
Now since you are spinning up your cluster using the CLI you have access to all kinds of features like security configurations, custom AMI, instance fleets, etc. and you can still orchestrate the tasks using DataPipeline.