EMR status not aligned with Spark job processing status - amazon-web-services

I have a python script that launches a Spark step on an EMR cluster.
This is the definition of the step:
Steps=[{
'Name': f"remove_num_{source['target_table'].upper()}",
'ActionOnFailure': 'CONTINUE',
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': ["spark-submit", \
"--deploy-mode", "cluster", \
"--master", "yarn", \
"--conf", "spark.yarn.submit.waitAppCompletion=false", \
"/home/hadoop/remove_row_num.py", \
source['target_table'].upper(), \
HDFS_PATH, \
IMPORT_BUCKET, \
S3_KEY]
}
}])
The script works fine and the step is correctly created and executed.
The problem is that the Spark step execution takes some minutes to complete while the status of the EMR step turns to Completed just after the Spark job is triggered (about 10sec) and not at the end of the Spark step processing.
Is there a way to let the EMR step turn to Completed just when the Spark job is completed?
Thanks,
Luca

Related

AWS StepFunctionsLocal StepFunctions test with definition substitutions

I have been looking into StepFunctionsLocal (SFL) to test. To get a project bootstrapped, I aws the SAM cli to generate a new project - which comes pre-packed with SFL tests and a make file to run everything.
However, it seems broken out of the box. When running the tests using directions in the README, I get this error:
InvalidDefinition: An error occurred (InvalidDefinition) when calling the CreateStateMachine operation: Invalid State Machine Definition: ''SCHEMA_VALIDATION_FAILED:
Value is not a valid resource ARN at /States/Check Stock Value/Resource','SCHEMA_VALIDATION_FAILED: Value is not a valid resource ARN at /States/Sell Stock/Resource', 'SCHEMA_VALIDATION_FAILED: Value is not a valid resource ARN at /States/Buy Stock/Resource', 'SCHEMA_VALIDATION_FAILED: Value is not a valid resource ARN at /States/Record Transaction/Resource''
And, indeed, the state machine definition is provided as a file that uses DefinitionSubstitutions:
{
"Comment": "A state machine that does mock stock trading.",
"StartAt": "Check Stock Value",
"States": {
"Check Stock Value": {
"Type": "Task",
"Resource": "${StockCheckerFunctionArn}", <--
"Retry": [
{
"ErrorEquals": [
"States.TaskFailed"
],
"IntervalSeconds": 15,
"MaxAttempts": 5,
"BackoffRate": 1.5
}
],
"Next": "Buy or Sell?"
},
...
The CloudFormation template injects those values
StockTradingStateMachine:
Type: AWS::Serverless::StateMachine # More info about State Machine Resource: https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/sam-resource-statemachine.html
Properties:
DefinitionUri: statemachine/stock_trader.asl.json
DefinitionSubstitutions:
StockCheckerFunctionArn: !GetAtt StockCheckerFunction.Arn <--
the makefile commands to run the test
run:
docker run -p 8083:8083 -d \
--mount type=bind,readonly,source=$(ROOT_DIR)/statemachine/test/MockConfigFile.json,destination=/home/StepFunctionsLocal/MockConfigFile.json \
-e SFN_MOCK_CONFIG="/home/StepFunctionsLocal/MockConfigFile.json" \
amazon/aws-stepfunctions-local
create:
aws stepfunctions create-state-machine \
--endpoint-url http://localhost:8083 \
--definition file://statemachine/stock_trader.asl.json \
--name "StockTradingLocalTesting" \
--role-arn "arn:aws:iam::123456789012:role/DummyRole" \
--no-cli-pager \
--debug
happypathsellstocktest:
aws stepfunctions start-execution \
--endpoint http://localhost:8083 \
--name HappyPathSellStockTest \
--state-machine arn:aws:states:us-east-1:123456789012:stateMachine:StockTradingLocalTesting#HappyPathSellStockTest \
--no-cli-pager
It appears that nothing provides the definition substitutions. I've come up dry when combing through the AWS docs for how to provide those substitutions through the API, maybe I just don't know what to look for. Any clues?
I did make an issue to fix the template: https://github.com/aws/aws-sam-cli-app-templates/issues/342
Unfortunately, DefinitionSubstitutions is a feature of the CloudFormation resource and not supported directly in the Step Functions API. You would need to parse and replace the substitution variables in your own code before you call Create State Machine in your test.

Gitlab: how to pause job and resume based on codebuild input?

Looking for a way to pause gitlab job and resume based on AWS lambda input.
Due to restrictive permissions in my organization, below is my current CI workflow:
In diagram above, on push event lambda triggers gitlab job through webhook. the gitlab job only gets latest code, zip files and copy to certain s3 bucket. AFTER gitlab job is finished, same lambda then triggers codebuild build which does gets latest zip file from s3 bucket, creates UI chunk files and artifacts are pushed to a different s3 bucket.
### gitlab-ci.yml ###
variables:
environment: <aws-account-number>
stages:
- get-latest-code
get-latest-code:
stage: get-latest-code
script:
- zip -r project.zip $(pwd)/*
- export PATH=$PATH:/tmp/project/.local/bin
- pip install awscli
- aws s3 cp $(pwd)/project.zip s3://project-input-bucket-dev
rules:
- if: ('$CI_PIPELINE_SOURCE == "merge_request_event"' || '$CI_PIPELINE_SOURCE == "push"')
### lambda code ###
def runner_lambda_handler(event, context):
cb = boto3.client( 'codebuild' )
builds_dir = os.environ.get('BUILDS_DIR', '/tmp/project/builds')
logger.debug("STARTING GIT LAB RUNNER")
gitlab_runner_cmd = f"gitlab-runner --debug run-single -u https://git.company.com/ -t {token} " \
f"--builds-dir {builds_dir} --max-builds 1 " \
f"--wait-timeout 900 --executor shell"
s3_libraries_loader.subprocess_cmd(gitlab_runner_cmd)
cb.start_build(projectName='PROJECT-Deploy-dev')
return {
"statusCode": 200,
"body": "Gitlab build success."
}
#### Codebuild Stack ####
codebuild.Project(self, f"Project-{env_name}",
project_name=f"Project-{env_name}",
role=codebuild_role,
environment_variables={
"INPUT_S3_ARTIFACTS_BUCKET": {
"value": input_bucket.bucket_name
},
"INPUT_S3_ARTIFACTS_OBJECT": {
"value": "project.zip"
},
"OUTPUT_S3_ARTIFACTS_BUCKET": {
"value": output_bucket.bucket_name
},
"PROJECT_NAME": {
"value": f"Project-{env_name}"
}
},
cache=codebuild.Cache.bucket(input_bucket),
environment=codebuild.BuildEnvironment(
build_image=codebuild.LinuxBuildImage.STANDARD_5_0
),
vpc=vpc,
security_groups=[codebuild_sg],
artifacts=codebuild.Artifacts.s3(
bucket=output_bucket,
include_build_id=False,
package_zip=False,
encryption=False
),
build_spec=codebuild.BuildSpec.from_object({
"version": "0.2",
"cache": {
"paths": ['/root/.m2/**/*', '/root/.npm/**/*', 'build/**/*', '*/project/node_modules/**/*']
},
"phases": {
"install": {
"runtime-versions": {
"nodejs": "14.x"
},
"commands": [
"aws s3 cp --quiet s3://$INPUT_S3_ARTIFACTS_BUCKET/$INPUT_S3_ARTIFACTS_OBJECT .",
"unzip $INPUT_S3_ARTIFACTS_OBJECT",
"cd project",
"export SASS_BINARY_DIR=$(pwd)",
"npm cache verify",
"npm install",
]
},
"build": {
"commands": [
"npm run build"
]
},
"post_build": {
"commands": [
"echo Clearing s3 bucket folder",
"aws s3 rm --recursive s3://$OUTPUT_S3_ARTIFACTS_BUCKET/$PROJECT_NAME"
]
}
},
"artifacts": {
"files": [
"**/*.html",
"**/*.js",
"**/*.css",
"**/*.ico",
"**/*.woff",
"**/*.woff2",
"**/*.svg",
"**/*.png"
],
"discard-paths": "yes",
"base-directory": "$(pwd)/dist/proj"
}
})
)
What's needed:
Currently there is a disconnect between gitlab job and codebuild job. I'm looking to find a way to PAUSE gitlab job after all steps are executed. later on codebuild job successful completion I can resume the same gitlab job and mark as done
thanks in advance
you need to change gitlab's webhook update part and there by curl you can check the status of other parts.
if their status are not finished pause git push to pushing event
You'll need to implement your own logic to wait for the codebuild job you finish, you may use batch-get-builds you check the status.
Check out this, you can have something similar in you gitlab job waiting for the codebuild job to finish
You can create a manual job in your pipeline to serve as a pause. Code build can call the GitLab API to run the manual job, allowing the pipeline to continue. That would probably be the most resource-efficient way to handle this scenario.
However you won’t be able to resume the same job with this method.
There is no mechanism to ‘pause’ a job, but you can implement a polling mechanism as suggested in another answer if you really need to keep things in the same job for some reason. However, you will end up consuming more minutes (if using GitLab.com shared runners) or system resources than needed. You may also want to consider your job timeouts if the process takes a long time.

How Can I Effortlessness Format An AWS CLI Command

I'm doing a lot of work with AWS EMR and when you build an EMR cluster through the AWS Management Console you can click a button to export the AWS CLI Command that creates the EMR cluster.
It then gives you a big CLI command that isn't formatted in any way i.e., if you copy and paste the command it's all on a single line.
I'm using these EMR CLI commands, that were created by other individuals, to create the EMR clusters in Python using the AWS SDK Boto3 library i.e., I'm looking at the CLI command to get all the configuration details. Some of the configuration details are present on the AWS Management Console UI but not all of them so it's easier for me to use the CLI command that you can export.
However, the AWS CLI command is very hard to read since it's not formatted. Is there an AWS CLI command formatter available online similar to JSON formatters?
Another solution I could use is to Clone the EMR Cluster and go through the EMR Cluster creation screen on the AWS Management Console to get all the configuration details but I'm still curious if I could format the CLI Command and do it that way. Another added benefit of being able to format the exported CLI command is that I could put it on a Confluence page for documentation.
Here is some quick python code to do it:
import shlex
import json
import re
def format_command(command):
tokens = shlex.split(command)
formatted = ''
for token in tokens:
# Flags get a new line
if token.startswith("--"):
formatted += '\\\n '
# JSON data
if token[0] in ('[', '{'):
json_data = json.loads(token)
data = json.dumps(json_data, indent=4).replace('\n', '\n ')
formatted += "'{}' ".format(data)
# Quote token when it contains whitespace
elif re.match('\s', token):
formatted += "'{}' ".format(token)
# Simple print for remaining tokens
else:
formatted += token + ' '
return formatted
example = """aws emr create-cluster --applications Name=spark Name=ganglia Name=hadoop --tags 'Project=MyProj' --ec2-attributes '{"KeyName":"emr-key","AdditionalSlaveSecurityGroups":["sg-3822994c","sg-ccc76987"],"InstanceProfile":"EMR_EC2_DefaultRole","ServiceAccessSecurityGroup":"sg-60832c2b","SubnetId":"subnet-3c76ee33","EmrManagedSlaveSecurityGroup":"sg-dd832c96","EmrManagedMasterSecurityGroup":"sg-b4923dff","AdditionalMasterSecurityGroups":["sg-3822994c","sg-ccc76987"]}' --service-role EMR_DefaultRole --release-label emr-5.14.0 --name 'Test Cluster' --instance-groups '[{"InstanceCount":1,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":32,"VolumeType":"gp2"},"VolumesPerInstance":1}]},"InstanceGroupType":"MASTER","InstanceType":"m4.xlarge","Name":"Master"},{"InstanceCount":1,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":32,"VolumeType":"gp2"},"VolumesPerInstance":1}]},"InstanceGroupType":"CORE","InstanceType":"m4.xlarge","Name":"CORE"}]' --configurations '[{"Classification":"spark-defaults","Properties":{"spark.sql.avro.compression.codec":"snappy","spark.eventLog.enabled":"true","spark.dynamicAllocation.enabled":"false"},"Configurations":[]},{"Classification":"spark-env","Properties":{},"Configurations":[{"Classification":"export","Properties":{"SPARK_DAEMON_MEMORY":"4g"},"Configurations":[]}]}]' --scale-down-behavior TERMINATE_AT_TASK_COMPLETION --region us-east-1"""
print(format_command(example))
Output looks like this:
aws emr create-cluster \
--applications Name=spark Name=ganglia Name=hadoop \
--tags Project=MyProj \
--ec2-attributes '{
"ServiceAccessSecurityGroup": "sg-60832c2b",
"InstanceProfile": "EMR_EC2_DefaultRole",
"EmrManagedMasterSecurityGroup": "sg-b4923dff",
"KeyName": "emr-key",
"SubnetId": "subnet-3c76ee33",
"AdditionalMasterSecurityGroups": [
"sg-3822994c",
"sg-ccc76987"
],
"AdditionalSlaveSecurityGroups": [
"sg-3822994c",
"sg-ccc76987"
],
"EmrManagedSlaveSecurityGroup": "sg-dd832c96"
}' \
--service-role EMR_DefaultRole \
--release-label emr-5.14.0 \
--name Test Cluster \
--instance-groups '[
{
"EbsConfiguration": {
"EbsBlockDeviceConfigs": [
{
"VolumeSpecification": {
"VolumeType": "gp2",
"SizeInGB": 32
},
"VolumesPerInstance": 1
}
]
},
"InstanceCount": 1,
"Name": "Master",
"InstanceType": "m4.xlarge",
"InstanceGroupType": "MASTER"
},
{
"EbsConfiguration": {
"EbsBlockDeviceConfigs": [
{
"VolumeSpecification": {
"VolumeType": "gp2",
"SizeInGB": 32
},
"VolumesPerInstance": 1
}
]
},
"InstanceCount": 1,
"Name": "CORE",
"InstanceType": "m4.xlarge",
"InstanceGroupType": "CORE"
}
]' \
--configurations '[
{
"Properties": {
"spark.eventLog.enabled": "true",
"spark.dynamicAllocation.enabled": "false",
"spark.sql.avro.compression.codec": "snappy"
},
"Classification": "spark-defaults",
"Configurations": []
},
{
"Properties": {},
"Classification": "spark-env",
"Configurations": [
{
"Properties": {
"SPARK_DAEMON_MEMORY": "4g"
},
"Classification": "export",
"Configurations": []
}
]
}
]' \
--scale-down-behavior TERMINATE_AT_TASK_COMPLETION \
--region us-east-1

Submitting spark job to AWS EMR cluster from aws-cli

I’m trying to figure out how to add a spark step properly to my aws-emr cluster from the command line aws-cli.
Some background:
I have a large dataset (thousands of .csv files) that I need to read in and analyze. I have a python script that looks something like:
analysis_script.py
import pandas as pd
from pyspark.sql import SQLContext, DataFrame
from pyspark.sql.types import *
from pyspark import SparkContext
import boto3
#Spark context
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)
df = sqlContext.read.format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat").load("s3n://data_input/*csv")
def analysis(df):
#do bunch of stuff. Create output dataframe
return df_output
df_output = analysis(df)
df_output.save_as_csv_to_s3_somehow
I want the output csv file to go to the directory s3://dataoutput/
Do I need to add the py file to a jar or something? What command do I use to run this analysis utilizing my cluster nodes, and how do I get the output to the correct directoy? Thanks.
I launch the cluster using:
aws emr create-cluster --release-label emr-5.5.0\
--name PySpark_Analysis\
--applications Name=Hadoop Name=Hive Name=Spark Name=Pig Name=Ganglia Name=Presto Name=Zeppelin\
--instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=r3.xlarge InstanceGroupType=CORE,InstanceCount=4,InstanceType=r3.xlarge\
--region us-west-2\
--log-uri s3://emr-logs-zerex/
--configurations file://./zeppelin-env-config.json/
--bootstrap-actions Name="Install Python Packages",Path="s3://emr-code/bootstraps/install_python_packages_custom.bash"
I usually use the --steps parameter of the aws emr create-cluster which can be specified like --steps file://mysteps.json. The file has the following look to it:
[
{
"Type": "Spark",
"Name": "KB Spark Program",
"ActionOnFailure": "TERMINATE_JOB_FLOW",
"Args": [
"--verbose",
"--packages",
"org.apache.spark:spark-streaming-kafka-0-8_2.11:2.1.1,com.amazonaws:aws-java-sdk-s3:1.11.27,org.apache.hadoop:hadoop-aws:2.7.2,com.databricks:spark-csv_2.11:1.5.0",
"/tmp/analysis_script.py"
]
},
{
"Type": "Spark",
"Name": "KB Spark Program",
"ActionOnFailure": "TERMINATE_JOB_FLOW",
"Args": [
"--verbose",
"--packages",
"org.apache.spark:spark-streaming-kafka-0-8_2.11:2.1.1,com.amazonaws:aws-java-sdk-s3:1.11.27,org.apache.hadoop:hadoop-aws:2.7.2,com.databricks:spark-csv_2.11:1.5.0",
"/tmp/analysis_script_1.py"
]
}
]
You can read more about steps here. I use the bootstrap script to load my code from S3 into /tmp and then specify the steps of execution in the file.
As for writing to s3 here is a link that explains that.

Setting Environment Variables per step in AWS EMR

I am unable to set environment variables for my spark application. I am using AWS EMR to run a spark application. Which is more like a framework I wrote in python on top of spark, to run multiple spark jobs according to environment variables present. So in order for me to start the exact job, I need to pass the environment variable into the spark-submit. I tried several methods to do this. But none of them works. As I try to print the value of the environment variable inside the application it returns empty.
To run the cluster in the EMR I am using following AWS CLI command
aws emr create-cluster --applications Name=Hadoop Name=Hive Name=Spark --ec2-attributes '{"KeyName":"<Key>","InstanceProfile":"<Profile>","SubnetId":"<Subnet-Id>","EmrManagedSlaveSecurityGroup":"<Group-Id>","EmrManagedMasterSecurityGroup":"<Group-Id>"}' --release-label emr-5.13.0 --log-uri 's3n://<bucket>/elasticmapreduce/' --bootstrap-action 'Path="s3://<bucket>/bootstrap.sh"' --steps file://./.envs/steps.json --instance-groups '[{"InstanceCount":1,"InstanceGroupType":"MASTER","InstanceType":"c4.xlarge","Name":"Master"}]' --configurations file://./.envs/Production.json --ebs-root-volume-size 64 --service-role EMRRole --enable-debugging --name 'Application' --auto-terminate --scale-down-behavior TERMINATE_AT_TASK_COMPLETION --region <region>
Now Production.json looks like this:
[
{
"Classification": "yarn-env",
"Properties": {},
"Configurations": [
{
"Classification": "export",
"Properties": {
"FOO": "bar"
}
}
]
},
{
"Classification": "spark-defaults",
"Properties": {
"spark.executor.memory": "2800m",
"spark.driver.memory": "900m"
}
}
]
And steps.json like this :
[
{
"Name": "Job",
"Args": [
"--deploy-mode","cluster",
"--master","yarn","--py-files",
"s3://<bucket>/code/dependencies.zip",
"s3://<bucket>/code/__init__.py",
"--conf", "spark.yarn.appMasterEnv.SPARK_YARN_USER_ENV=SHAPE=TRIANGLE",
"--conf", "spark.yarn.appMasterEnv.SHAPE=RECTANGLE",
"--conf", "spark.executorEnv.SHAPE=SQUARE"
],
"ActionOnFailure": "CONTINUE",
"Type": "Spark"
}
]
When I try to access the environment variable inside my __init__.py code, it simply prints empty. As you can see I am running the step using spark with yarn cluster in cluster mode. I went through these links to reach this position.
How do I set an environment variable in a YARN Spark job?
https://spark.apache.org/docs/latest/configuration.html#environment-variables
https://spark.apache.org/docs/latest/configuration.html#runtime-environment
Thanks for any help.
Use classification yarn-env to pass environment variables to the worker nodes.
Use classification spark-env to pass environment variables to the driver, with deploy mode client. When using deploy mode cluster, use yarn-env.
(Dear moderator, if you want to delete the post, let me know why.)
To work with EMR clusters I work using the AWS Lambda, creating a project that build an EMR cluster when a flag is set in the condition.
Inside this project, we define the variables that you can set in the Lambda and then, replace this to its value. To use this, we have to use the AWS API. The possible method you have to use is the AWSSimpleSystemsManagement.getParameters.
Then, make a map like val parametersValues = parameterResult.getParameters.asScala.map(k => (k.getName, k.getValue)) to have a tuple with its name and value.
Eg: ${BUCKET} = "s3://bucket-name/
What this means, you only have to write in your JSON ${BUCKET} instead all the name of your path.
Once you have replace the value, the step JSON can have a view like this,
[
{
"Name": "Job",
"Args": [
"--deploy-mode","cluster",
"--master","yarn","--py-files",
"${BUCKET}/code/dependencies.zip",
"${BUCKET}/code/__init__.py",
"--conf", "spark.yarn.appMasterEnv.SPARK_YARN_USER_ENV=SHAPE=TRIANGLE",
"--conf", "spark.yarn.appMasterEnv.SHAPE=RECTANGLE",
"--conf", "spark.executorEnv.SHAPE=SQUARE"
],
"ActionOnFailure": "CONTINUE",
"Type": "Spark"
}
]
I hope this can help you to solve your problem.