I have build a local cluster on my laptop (pseudo mode). Where I run different mapreduce commands like
hadoop-streaming -D mapred.output.compress=true \
-D mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
-files my_mapper.py,my_reducer.py \
-mapper my_mapper.py \
-reducer my_reducer.py \
-input /aws/input/input_warc.txt \
-output /aws/output
Now I have to run it on EMR. There are two options that can be used one is console and second is aws cli. I want to run exactly comands like above. For that, I think if I ssh to EMR master, then I should be able to run this command. Is it a right way or is there any drawback of this approch ?
Yes, you may SSH to your cluster and run your jobs there, but you may also use the Step API (http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-steps.html) to run arbitrary commands on the master instance, including of course running distributed jobs like your example. You may add Steps to a cluster using the AWS CLI ("aws emr add-step ..." or also during cluster creation using "aws emr create-cluster ... --steps ...") or similarly using the AWS SDKs (like the AWS Java SDK) or using the AWS EMR Console.
Some advantages of the Step API include that it captures the output of each step so that you can view it via the AWS CLI, SDK, or AWS Console, and you can also check the status of Steps to determine when they have completed.
One disadvantage of the Step API is that currently Steps all run sequentially, so you can't have multiple Steps running in parallel.
Related
How to add environment variables to an EMR cluster.
Currently, I have added them in a .sh file and was using script-runner.jar to run the script.
#!/bin/bash
export PYSPARK_PYTHON=/home/hadoop/bin/python
export PYSPARK_DRIVER_PYTHON=/home/hadoop/bin/python
Like this I was submitting the script as mentioned here:
aws emr add-steps \
--cluster-id j-2AXXXXXXGAPLF \
--steps Type=CUSTOM_JAR,Name="Run a script from S3 with script-runner.jar",ActionOnFailure=CONTINUE,Jar=s3://us-west-2.elasticmapreduce/libs/script-runner/script-runner.jar,Args=[s3://mybucket/my-script.sh]
I have also tried using command-runner.jar. Both the approaches did not work. Can you suggest some other approach to add env variables to the cluster remotely/from an EC2 instance?
I'd like to add a step as a spark application using AWS CLI, but I cannot find a working command, from AWS official doc: https://docs.aws.amazon.com/cli/latest/reference/emr/add-steps.html, they listed out 6 examples, none of them is for spark.
But I could configure it through AWS Console UI and it runs fine, but for efficiency, I'd like to be able to do so via aws cli.
The closest that I could come up with is this command:
aws emr add-steps --cluster-id j-cluster-id --steps Type=SPARK,Name='SPARK APP',ActionOnFailure=CONTINUE,Jar=s3://my-test/RandomJava-1.0-SNAPSHOT.jar,MainClass=JavaParquetExample1,Args=s3://my-test/my-file_0000_part_00.parquet,my-test --profile my-test --region us-west-2
but this resulted in this configuration on AWS EMR step:
JAR location : command-runner.jar
Main class : None
Arguments : spark-submit s3://my-test/my-file_0000_part_00.parquet my-test
Action on failure: Continue
which resulted in failure.
The correct one (completed successfully, configured through AWS Console UI) looks like this:
JAR location : command-runner.jar
Main class : None
Arguments : spark-submit --deploy-mode cluster --class sparkExamples.JavaParquetExample1 s3://my-test/RandomJava-1.0-SNAPSHOT.jar --s3://my-test/my-file_0000_part_00.parquet --my-test
Action on failure: Continue
Any help is greatly appreciated!
This seems to be working for me. I am adding a spark application to a cluster with the step name My step name. Let's say you name the file as step-addition.sh. The content of it is following:
#!/bin/bash
set -x
#cluster id
clusterId=$1
startDate=$2
endDate=$3
aws emr add-steps --cluster-id $clusterId --steps Type=Spark,Name='My step name',\
ActionOnFailure=TERMINATE_CLUSTER,Args=[\
"--deploy-mode","cluster","--executor-cores","1","--num-executors","20","--driver-memory","10g","--executor-memory","3g",\
"--class","your-package-structure-like-com.a.b.c.JavaParquetExample1",\
"--master","yarn",\
"--conf","spark.driver.my.custom.config1=my-value-1",\
"--conf","spark.driver.my.custom.config2=my-value-2",\
"--conf","spark.driver.my.custom.config.startDate=${startDate}",\
"--conf","spark.driver.my.custom.config.endDate=${endDate}",\
"s3://my-bucket/my-prefix/path-to-your-actual-application.jar"]
You can execute the above script simply like this:
bash $WORK_DIR/step-addition.sh $clusterId $startDate $endDate
I have an AWS EMR cluster running spark, and I'd like to submit a PySpark job to it from my laptop (--master yarn) to run in cluster mode.
I know that I need to set up some config on the laptop, but I'd like to know what the bare minimum is. Do I just need some of the config files from the master node of the cluster? If so, which? Or do I need to install hadoop or yarn on my local machine?
I've done a fair bit of searching for an answer, but I haven't yet been able to be sure that what I was reading referred to launching a job from the master of the cluster or some arbitrary laptop...
If you want to run the spark-submit job solely on your AWS EMR cluster, you do not need to install anything locally. You only need the EC2 key pair you specified in the Security Options when you created the cluster.
I personally scp over any relevant scripts &/or jars, ssh into the master node of the cluster, and then run spark-submit.
You can specify most of the relevant spark job configurations via spark-submit itself. AWS documents in some more detail how to configure spark-submit jobs.
For example:
>> scp -i ~/PATH/TO/${SSH_KEY} /PATH/TO/PYSPARK_SCRIPT.py hadoop#${PUBLIC_MASTER_DNS}:
>> ssh -i ~/PATH/TO/${SSH_KEY} hadoop#${PUBLIC_MASTER_DNS}
>> spark-submit --conf spark.OPTION.OPTION=VALUE PYSPARK_SCRIPT.py
However, if you already pass a particular configuration when creating the cluster itself, you do not need to re-specify those same configuration options via spark-submit.
You can setup the AWS CLI on your local machine, put your deployment on S3, and then add an EMR step to run on the EMR cluster. Something like this:
aws emr add-steps --cluster-id j-xxxxx --steps Type=spark,Name=SparkWordCountApp,Args=[--deploy-mode,cluster,--master,yarn,--conf,spark.yarn.submit.waitAppCompletion=false,--num-executors,5,--executor-cores,5,--executor-memory,20g,s3://codelocation/wordcount.py,s3://inputbucket/input.txt,s3://outputbucket/],ActionOnFailure=CONTINUE
Source: https://aws.amazon.com/de/blogs/big-data/submitting-user-applications-with-spark-submit/
The use case is like - developer makes some code changes and the below things happen automatically -
build runs, application artifact created, docker image generated with the artifact, image pushed to Docker registry, AWS ECS tasks and ECS services updated.
I want to know what are the ways to achieve the above automation of update of AWS ECS services. Till now I have implemented AWS ECS update from Jenkins build using -
1>run post build AWS CLi scripts from Jenkins to update ECS
2>post build action or pipeline step to invoke AWS Lambda function. I have created one Lambda function in Java to implement that.
Please let me the other ways we can achieve the above. Thanks.
I'm continuously deploying Docker containers from CircleCI to AWS ECS.
The outline of the deployment flow is as follows:
Build and tag a new Docker image
Login to AWS ECR and push the image
Update task definitions and services of ECS with ecs-deploy
ecs-deploy is a useful script that updates Docker images in ECS.
https://github.com/silinternational/ecs-deploy
You could use a shell script that calls aws cli commands to create cloudformation stacks or directly call the create commands in the aws cli for the ECR repository, Task Definition, Events rule and target(for scheduling).
then you just call this script on your terminal using this command: ./setup.sh and it should execute all your commands at once.
aws ecr create-repository \
--repository-name tasks-${TASK_NAME}-${TASK_ENV} \
;
or if you want to set up your resources via cloudformation templates, you can launch them using this command as long as the template exists at file://name.yml:
aws cloudformation create-stack \
--stack-name stack-name \
--capabilities CAPABILITY_IAM \
--template-body file://name.yml \
--parameters
ParameterKey=ParamName,ParameterValue=${PARAM_NAME} \
;
Take a look at Codefresh - https://docs.codefresh.io/docs/amazon-ecs
You can build your pipeline
Build Step
Push to Registry
Deply to ECS
That easy
While there are a ton of CI/CD tools out there, since I am early in my rollout, I decided to write a small script instead of having CI/CD pipelines do it.
Here is a one-click deploy script I wrote using the ecs-deploy script as a dependency to achieve a rolling deploy of a docker image to ECS.
You can run this locally from your dev or build/deployment box or use Jenkins or some local build tool.
#!/bin/bash
# automatically login to AWS
eval $(aws ecr get-login)
# build local docker image and push repo to AWS
docker build -t <yourlocaldockerimagetag> .
docker tag <yourlocaldockerimagetag>:latest <yourECSRepoURL>:latest
docker -D -l debug push <yourECSRepoURL>:latest
# deploy to ECS
ecs-deploy/ecs-deploy -m 50 -k <access-key> -s <secret-key> -r <aws-region> -c <cluster-name> -n <service-name> -i <yourECSRepoURL>:latest
Parameters:
cluster-name: Your cluster name in ECS
service-name: Your service name that you had created in ECS
yourECSRepoURL: ECS Repository URL
yourlocaldockerimagetag: Any local image tag name
access-key: your AWS access key for deployments
secret-key: your AWS secret key
Make sure you install ecs-deploy before this script.
The -m 50 tells it that it can deploy even if the number of nodes drops to 50%. Ideally you would have an extra node to do deployments, but if you can't afford that setting this would ensure that deployments continue to happen.
If you are also using an ELB (load balancer), then the default deregistration delay for target groups is 5 minutes which is a bit excessive. The deregistration delay is the time to wait for existing requests to complete BEFORE ECS sends a SIGTERM or SIGINT to your docker container. You should lower this by going to the Target Groups in EC2 dashboard and click the Edit Attributes to edit it. Otherwise your deployments may take forever.
I think nobody has mentioned CodePipeline from AWS, it really integrates easilly with many AWS Services including ECS and CodeCommit:
Push commit to CodeCommit Repo, triggering the pipeline execution.
(Optional) Configure a Manual Approval step that needs you to take an action before Build.
Run a CodeBuild Project that builds your Dockerfile and push the image to an ECR Repo.
Run a "Deploy" step that deploys to a specific ECS Service. It updates the services with a new Task Definition that points to the new ECR Image.
I have used this flow with BitBucket also, just configure a BitBucket pipeline that pushes all new code to a CodeCommit Repo as a previous step.
Exactly as #minamiyojo and #astav answers, we ended up glueing ecs-deploy with a template engine to power up our CD pipeline with some reusable component, we just open-sourced as well:
https://github.com/GuccioGucci/yoke
Please refer to Motivation section in README, hope this would help your scenario too.
If there are 3 steps that are executed in EMR, is it possible to know which step is currently executing, using shell script.
I am able to get current state of the EMR cluster, using this:
`aws emr describe-cluster --cluster-id ${cluster_id} | python -c 'import json,sys;obj=json.load(sys.stdin);print obj["Cluster"]["Status"]["State"]'`
but I am not able to get the information about the name of step currently executing, is it possible.
The list-steps option in the AWS CLI for EMR returns each step. It can be filtered to only return the steps in a certain state: aws emr list-steps --step-states "RUNNING"
(See http://docs.aws.amazon.com/cli/latest/reference/emr/list-steps.html)