Run AWS EMR Cluster Using Step Functions - amazon-web-services

I am very new to AWS Step Functions and AWS Lambda Functions and could really use some help getting an EMR Cluster running through Step Functions. A sample of my current State Machine structure is shown by the following code
{
"Comment": "This is a test for running the structure of the CustomCreate job.",
"StartAt": "PreStep",
"States": {
"PreStep": {
"Comment": "Check that all the necessary files exist before running the job.",
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:XXXXXXXXXX:function:CustomCreate-PreStep-Function",
"Next": "Run Job Choice"
},
"Run Job Choice": {
"Comment": "This step chooses whether or not to go forward with running the main job.",
"Type": "Choice",
"Choices": [
{
"Variable": "$.FoundNecessaryFiles",
"BooleanEquals": true,
"Next": "Spin Up Cluster"
},
{
"Variable": "$.FoundNecessaryFiles",
"BooleanEquals": false,
"Next": "Do Not Run Job"
}
]
},
"Do Not Run Job": {
"Comment": "This step triggers if the PreStep fails and the job should not run.",
"Type": "Fail",
"Cause": "PreStep unsuccessful"
},
"Spin Up Cluster": {
"Comment": "Spins up the EMR Cluster.",
"Type": "Pass",
"Next": "Update Env"
},
"Update Env": {
"Comment": "Update the environment variables in the EMR Cluster.",
"Type": "Pass",
"Next": "Run Job"
},
"Run Job": {
"Comment": "Add steps to the EMR Cluster.",
"Type": "Pass",
"End": true
}
}
}
Which is shown by the following workflow diagram
The PreStep and Run Job Choice tasks use a simple Lambda Function to check that the files necessary to run this job exist on my S3 Bucket, then go to spin up the cluster provided that the necessary files are found. These tasks are working properly.
What I am not sure about is how to handle the EMR Cluster related steps.
In my current structure, the first task is to spin up an EMR Cluster. this could be done through directly using the Step Function JSON, or preferably, using a JSON Cluster Config file (titled EMR-cluster-setup.json) I have located on my S3 Bucket.
My next task is to update the EMR Cluster environment variables. I have a .sh script located on my S3 Bucket that can do this. I also have a JSON file (titled EMR-RUN-Script.json) located on my S3 Bucket that will add a first step to the EMR Cluster that will run and source the .sh script. I just need to run that JSON file from within the EMR Cluster, which I do not know how to do using the Step Functions. The code for EMR-RUN-SCRIPT.json is shown below
[
{
"Name": "EMR-RUN-SCRIPT",
"ActionOnFailure": "CONTINUE",
"HadoopJarStep": {
"Jar": "s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar",
"Args": [
"s3://PATH/TO/env_configs.sh"
]
}
}
]
My third task is to add a step that contains a spark-submit command to the EMR Cluster. This command is described in a JSON config file (titled EMR-RUN-STEP.json) located on my S3 Bucket that can be uploaded to the EMR Cluster in a similar manner to uploading the environment configs file in the previous step. The code for EMR-RUN-STEP.json is shown below
[
{
"Name": "EMR-RUN-STEP",
"ActionOnFailure": "CONTINUE",
"HadoopJarStep": {
"Jar": "command-runner.jar",
"Args": [
"bash", "-c",
"source /home/hadoop/.bashrc && spark-submit --master yarn --conf spark.yarn.submit.waitAppCompletion=false --class CLASSPATH.TO.MAIN s3://PATH/TO/JAR/FILE"
]
}
}
]
Finally, I want to have a task that makes sure the EMR Cluster terminates after it completes its run.
I know there may be a lot involved within this question, but I would greatly appreciate any assistance with any of the issues described above. Whether it be following the structure I outlined above, or if you know of another solution, I am open to any form of help. Thank you in advance.

You need a terminate cluster step,
as documentation states:
https://docs.aws.amazon.com/step-functions/latest/dg/connect-emr.html
createCluster uses the same request syntax as runJobFlow, except for the following:
The field Instances.KeepJobFlowAliveWhenNoSteps is mandatory,
and must have the Boolean value TRUE.
So, you need a step to do this for you:
terminateCluster.sync - for me this is preferable over the simple terminateCluster as it waits for the cluster to actually terminate and you can handle any hangs here - you'll be using Standard step functions so the bit of extra time will not be billed
Shuts down a cluster (job flow).
terminateJobFlows The same as terminateCluster, but waits for the cluster to terminate.
ps.: if you are using termination protection you'll need an extra step to turn if off before you can terminate your cluster ;)

'KeepJobFlowAliveWhenNoSteps': False
add the above configurations to emr cluster creation script. it will auto terminate emr clusters when all the steps are completed emr boto3 config

Related

startup scripts on Google cloud platform using Packer

Im using hashicorp's Packer to create machine images for the google cloud (AMI for Amazon). I want every instance to run a script once the instance is created on the cloud. As i understand from the Packer docs, i could use the startup_script_file to do this. Now i got this working but it seems that the script is only runned once, on image creation resulting in the same output on every running instance. How can i trigger this script only on instance creation such that i can have different output for every instance?
packer config:
{
"builders": [{
"type": "googlecompute",
"project_id": "project-id",
"source_image": "debian-9-stretch-v20200805",
"ssh_username": "name",
"zone": "europe-west4-a",
"account_file": "secret-account-file.json",
"startup_script_file": "link to file"
}]
}
script:
#!/bin/bash
echo $((1 + RANDOM % 100)) > test.log #output of this remains the same on every created instance.

Incorrect image reference when launching Dataflow Flex templates

We are using Dataflow Flex Templates and following this guide (https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates) to stage and launch jobs. This is working in our environment. However, when I SSH onto the Dataflow VM and run docker ps I see it is referencing the a different docker image to the one we speccify in our template (underlined in green):
The template I am launching from is as follows and jobs are created using gcloud beta dataflow flex-template run:
{
"image": "gcr.io/<MY PROJECT ID>/samples/dataflow/streaming-beam-sql:latest",
"metadata": {
"description": "An Apache Beam streaming pipeline that reads JSON encoded messages from Pub/Sub, uses Beam SQL to transform the message data, and writes the results to a BigQuery",
"name": "Streaming Beam SQL",
"parameters": [
{
"helpText": "Pub/Sub subscription to read from.",
"label": "Pub/Sub input subscription.",
"name": "inputSubscription",
"regexes": [
".*"
]
},
{
"helpText": "BigQuery table spec to write to, in the form 'project:dataset.table'.",
"is_optional": true,
"label": "BigQuery output table",
"name": "outputTable",
"regexes": [
"[^:]+:[^.]+[.].+"
]
}
]
},
"sdkInfo": {
"language": "JAVA"
}
}
So I would expect the output of docker ps to show gcr.io/<MY PROJECT ID>/samples/dataflow/streaming-beam-sql as the image on Dataflow. When I launch the image from GCR to run on a GCE instance I get the following output when running docker ps:
Should I expect to see the name of the image I have referenced in the Dataflow template on the Dataflow VM? Or have I missed a step somewhere?
Thanks!
TLDR; You are looking in the worker VM instead of launcher VM.
In case of flex templates, when you run the job, it first creates a launcher VM where it pulls your container and runs it to generate the job graph. This VM will destroyed after this step is completed. Then the worker VM is started to actually run the generated job graph. In the worker VM there is no need for your container. Your container is used only to generate the job graph based on the parameters passed.
In your case, you are trying to search for your image in the worker VM. The launcher VM is short lived and starts with launcher-*********************. If you SSH into that VM and do docker ps you will be able to see your container image.

Google Compute Engine GPU

Google recently added support for GPUs in their cloud service.
I'm trying to follow the instructions found here to start a machine with a GPU. Running this script on Windows:
gcloud beta compute instances create gpu-instance-1^
--machine-type n1-standard-2^
--zone us-east1-d^
--accelerator type=nvidia-tesla-k80,count=1^
--image-family ubuntu-1604-lts^
--image-project ubuntu-os-cloud^
--maintenance-policy TERMINATE^
--restart-on-failure^
with gcloud command line tool version 146.0.0 fails, saying:
ERROR: (gcloud.beta.compute.instances.create) unknown collection [compute.acceleratorTypes]
Any ideas?
Was never able to get the gcloud utility working. Using the API did work. Of note, when posting the API request (instructions on the same page as the gcloud instructions, here) the key that creates an instance with a GPU is guestAccelerators. This key does not have an analogous option in gcloud.
Copying the API request as it appears on the instructions page linked above.
POST https://www.googleapis.com/compute/beta/projects/[PROJECT_ID]/zones/[ZONE]/instances?key={YOUR_API_KEY}
{
"machineType": "https://www.googleapis.com/compute/beta/projects/[PROJECT_ID]/zones/[ZONE]/machineTypes/n1-highmem-2",
"disks":
[
{
"type": "PERSISTENT",
"initializeParams":
{
"diskSizeGb": "[DISK_SIZE]",
"sourceImage": "https://www.googleapis.com/compute/beta/projects/[IMAGE_PROJECT]/global/images/family/[IMAGE_FAMILY]"
},
"boot": true
}
],
"name": "[INSTANCE_NAME]",
"networkInterfaces":
[
{
"network": "https://www.googleapis.com/compute/beta/projects/[PROJECT_ID]/global/networks/[NETWORK]"
}
],
"guestAccelerators":
[
{
"acceleratorCount": [ACCELERATOR_COUNT],
"acceleratorType": "https://www.googleapis.com/compute/beta/projects/[PROJECT_ID]/zones/[ZONE]/acceleratorTypes/[ACCELERATOR_TYPE]"
}
],
"scheduling":
{
"onHostMaintenance": "terminate",
"automaticRestart": true
},
"metadata":
{
"items":
[
{
"key": "startup-script",
"value": "[STARTUP_SCRIPT]"
}
]
}
}
Sometimes you need to ensure you have the latest version of the gcloud utility installed in order to use certain GCP features.
Try running this command or read the below docs on how to update your gcloud utility:
gcloud components update
https://cloud.google.com/sdk/gcloud/reference/components/update

AWS ECS leader commands (django migrate)

We are currently deploying our Django APP on AWS Elastic Beanstalk. There we execute the django db migrations using container commands, where we assure we only run migrations on one instance by using the "leader_only" restriction.
We are considering to move our deployment to AWS EC2 Container Service. However, we cannot figure out a way to enforce the migrate to only be run on one container when new image is being deployed.
Is it possible to configure leader_only commands in AWS EC2 Container Service?
There is a possibility to use ECS built-in functionality to handle deployments that involve migrations. Basically, the idea is the following:
Make containers fail their health checks if they are running against an unmigrated database, e.g. via making a custom view and checking for the existence of the migrations execution plan executor.migration_plan(executor.loader.graph.leaf_nodes())
status = 503 if plan else 200
Make a task definition that does nothing more then just migrates the database and make sure it is scheduled for execution with the rest of the deployment process.
The result is deployment process will try to bring one new container. This new container will fail health checks as long as database is not migrated and thus will block all the further deployment process (so you will still have old instances running to serve requests). Once migration is done - health check will now succeed, so the deployment will unblock and proceed.
This is by far the most elegant solution I was able to find in terms of running Django migrations on Amazon ECS.
Source: https://engineering.instawork.com/elegant-database-migrations-on-ecs-74f3487da99f
Look at using Container Dependency in your task definition to make your application container wait for a migration container to successfully complete. Here's a brief example of the container_definitions component of a task definition:
{
"name": "migration",
"image": "my-django-image",
"essential": false,
"command": ["python3", "mange.py migrate"]
},
{
"name": "django",
"image": "my-django-image",
"essential": true,
"dependsOn": [
{
"containerName": "migration",
"condition": "SUCCESS"
}
]
}
The migration container starts, runs the migrate command, and exits. If successful, then the django container is launched. Of course, if your service is running multiple tasks, each task will run in this fashion, but once migrations have been run once, additional migrate commands will be a no-op, so there's no harm.
For the ones using task definitions JSON all we need to do is flag a container as not essential on containerDefinitions
{
"name": "migrations",
"image": "your-image-name",
"essential": false,
"cpu": 24,
"memory": 200,
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "your-logs-group",
"awslogs-region": "your-region",
"awslogs-stream-prefix": "your-log-prefix"
}
},
"command": [
"python3", "manage.py", "migrate"
],
"environment": [
{
"name": "ENVIRON_NAME",
"value": "${ENVIRON_NAME}"
}
]
}
I flagged this container as "essential": false.
"If the essential parameter of a container is marked as true, and that container fails or stops for any reason, all other containers that are part of the task are stopped. If the essential parameter of a container is marked as false, then its failure does not affect the rest of the containers in a task. If this parameter is omitted, a container is assumed to be essential."
source: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_definition_parameters.html
Honestly, I have not figured this out. I have encountered exactly the same limitation on ECS (as well as others which made me abandon it, but this is out of topic).
Potential workarounds:
1) Run migrations inside your init script. This has the flaw that it runs on every node at the time of the deployment. (I assume you have multi-replicas)
2) Add this as a step of your CI flow.
Hope I helped a bit, in case I come up with another idea, I'll revert back here.
It's not optimal, but you can simply run it as a command in task definition
"command": ["/bin/sh", "-c", "python manage.py migrate && gunicorn -w 3 -b :80 app.wsgi:application"],

Running Spark on AWS EMR, how to run driver on master node?

It seems that by default EMR deploys the Spark driver to one of the CORE nodes, resulting in the MASTER node being virtually un-utilized. Is it possible to run the driver program on the MASTER node instead? I have experimented with the --deploy-mode arguments to no avail.
Here is my instance groups JSON definition:
[
{
"InstanceGroupType": "MASTER",
"InstanceCount": 1,
"InstanceType": "m3.xlarge",
"Name": "Spark Master"
},
{
"InstanceGroupType": "CORE",
"InstanceCount": 3,
"InstanceType": "m3.xlarge",
"Name": "Spark Executors"
}
]
Here is my configurations JSON definition:
[
{
"Classification": "spark",
"Properties": {
"maximizeResourceAllocation": "true"
},
"Configurations": []
},
{
"Classification": "spark-env",
"Properties": {
},
"Configurations": [
{
"Classification": "export",
"Properties": {
},
"Configurations": [
]
}
]
}
]
Here is my steps JSON definition:
[
{
"Name": "example",
"Type": "SPARK",
"Args": [
"--class", "com.name.of.Class",
"/home/hadoop/myjar-assembly-1.0.jar"
],
"ActionOnFailure": "TERMINATE_CLUSTER"
}
]
I am using aws emr create-cluster with --release-label emr-4.3.0.
Setting the location of the driver
With spark-submit, the flag --deploy-mode can be used to select the location of the driver.
Submitting applications in client mode is advantageous when you are debugging and wish to quickly see the output of your application. For applications in production, the best practice is to run the application in cluster mode. This mode offers you a guarantee that the driver is always available during application execution. However, if you do use client mode and you submit applications from outside your EMR cluster (such as locally, on a laptop), keep in mind that the driver is running outside your EMR cluster and there will be higher latency for driver-executor communication.
https://blogs.aws.amazon.com/bigdata/post/Tx578UTQUV7LRP/Submitting-User-Applications-with-spark-submit
I don't think it is a waste. When running Spark on EMR, the master node will run Yarn RM, Livy Server, and maybe other applications you selected. And if you run in client mode, the majority of the driver program will run on the master node as well.
Note that the driver program could be heavier than the tasks on executors, e.g. collecting all results from all executors, in which case you need to allocate enough resources to your master node if it is where the driver program is running.