Gitlab: how to pause job and resume based on codebuild input? - amazon-web-services

Looking for a way to pause gitlab job and resume based on AWS lambda input.
Due to restrictive permissions in my organization, below is my current CI workflow:
In diagram above, on push event lambda triggers gitlab job through webhook. the gitlab job only gets latest code, zip files and copy to certain s3 bucket. AFTER gitlab job is finished, same lambda then triggers codebuild build which does gets latest zip file from s3 bucket, creates UI chunk files and artifacts are pushed to a different s3 bucket.
### gitlab-ci.yml ###
environment: <aws-account-number>
- get-latest-code
stage: get-latest-code
- zip -r $(pwd)/*
- export PATH=$PATH:/tmp/project/.local/bin
- pip install awscli
- aws s3 cp $(pwd)/ s3://project-input-bucket-dev
- if: ('$CI_PIPELINE_SOURCE == "merge_request_event"' || '$CI_PIPELINE_SOURCE == "push"')
### lambda code ###
def runner_lambda_handler(event, context):
cb = boto3.client( 'codebuild' )
builds_dir = os.environ.get('BUILDS_DIR', '/tmp/project/builds')
gitlab_runner_cmd = f"gitlab-runner --debug run-single -u -t {token} " \
f"--builds-dir {builds_dir} --max-builds 1 " \
f"--wait-timeout 900 --executor shell"
return {
"statusCode": 200,
"body": "Gitlab build success."
#### Codebuild Stack ####
codebuild.Project(self, f"Project-{env_name}",
"value": input_bucket.bucket_name
"value": ""
"value": output_bucket.bucket_name
"value": f"Project-{env_name}"
"version": "0.2",
"cache": {
"paths": ['/root/.m2/**/*', '/root/.npm/**/*', 'build/**/*', '*/project/node_modules/**/*']
"phases": {
"install": {
"runtime-versions": {
"nodejs": "14.x"
"commands": [
"cd project",
"export SASS_BINARY_DIR=$(pwd)",
"npm cache verify",
"npm install",
"build": {
"commands": [
"npm run build"
"post_build": {
"commands": [
"echo Clearing s3 bucket folder",
"aws s3 rm --recursive s3://$OUTPUT_S3_ARTIFACTS_BUCKET/$PROJECT_NAME"
"artifacts": {
"files": [
"discard-paths": "yes",
"base-directory": "$(pwd)/dist/proj"
What's needed:
Currently there is a disconnect between gitlab job and codebuild job. I'm looking to find a way to PAUSE gitlab job after all steps are executed. later on codebuild job successful completion I can resume the same gitlab job and mark as done
thanks in advance

you need to change gitlab's webhook update part and there by curl you can check the status of other parts.
if their status are not finished pause git push to pushing event

You'll need to implement your own logic to wait for the codebuild job you finish, you may use batch-get-builds you check the status.
Check out this, you can have something similar in you gitlab job waiting for the codebuild job to finish

You can create a manual job in your pipeline to serve as a pause. Code build can call the GitLab API to run the manual job, allowing the pipeline to continue. That would probably be the most resource-efficient way to handle this scenario.
However you won’t be able to resume the same job with this method.
There is no mechanism to ‘pause’ a job, but you can implement a polling mechanism as suggested in another answer if you really need to keep things in the same job for some reason. However, you will end up consuming more minutes (if using shared runners) or system resources than needed. You may also want to consider your job timeouts if the process takes a long time.


Packaging PySpark with PEX environment on dataproc

I'm trying to package a pyspark job with PEX to be run on google cloud dataproc, but I'm getting a Permission Denied error.
I've packaged my third party and local dependencies into env.pex and an entrypoint that uses those dependencies into I then gsutil cp those two files up to gs://<PATH> and run the script below.
from import dataproc_v1 as dataproc
from import storage
def submit_job(project_id: str, region: str, cluster_name: str):
job_client = dataproc.JobControllerClient(
client_options={"api_endpoint": f"{region}"}
operation = job_client.submit_job_as_operation(
"project_id": project_id,
"region": region,
"job": {
"placement": {"cluster_name": cluster_name},
"pyspark_job": {
"main_python_file_uri": "gs://<PATH>/",
"file_uris": ["gs://<PATH>/env.pex"],
"properties": {
"spark.pyspark.python": "./env.pex",
"spark.executorEnv.PEX_ROOT": "./.pex",
The error I get is
Exception in thread "main" Cannot run program "./env.pex": error=13, Permission denied
at java.lang.ProcessBuilder.start(
at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:97)
at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(
at sun.reflect.DelegatingMethodAccessorImpl.invoke(
at java.lang.reflect.Method.invoke(
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: error=13, Permission denied
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.<init>(
at java.lang.ProcessImpl.start(
at java.lang.ProcessBuilder.start(
... 14 more
Should I expect packaging my environment like this to work? I don't see a way to change the permission of files included as file_uris in the pyspark job config, and I don't see any documentation on google cloud about packaging with PEX, but PySpark official docs include this guide.
Any help is appreciated - thanks!
You can always run a PEX file using a compatible interpreter. So instead of specifying a program of ./env.pex you could try python env.pex. That does not require env.pex to be executable.
I wasn't able to run the pex directly in the end, but did get a workaround working for now, which was suggested by a user in the pants slack community (thanks!)...
The workaround is to unpack the pex as a venv in a cluster initialization script.
The initialization script gsutil copied to gs://<PATH TO INIT SCRIPT>:
set -exo pipefail
readonly PEX_ENV_FILE_URI=$(/usr/share/google/get_metadata_value attributes/PEX_ENV_FILE_URI || true)
readonly PEX_FILES_DIR="/pexfiles"
readonly PEX_ENV_DIR="/pexenvs"
function err() {
echo "[$(date +'%Y-%m-%dT%H:%M:%S%z')]: $*" >&2
exit 1
function install_pex_into_venv() {
local -r pex_name=${PEX_ENV_FILE_URI##*/}
local -r pex_file="${PEX_FILES_DIR}/${pex_name}"
local -r pex_venv="${PEX_ENV_DIR}/${pex_name}"
echo "Installing pex from ${pex_file} into venv ${pex_venv}..."
gsutil cp "${PEX_ENV_FILE_URI}" "${pex_file}"
PEX_TOOLS=1 python "${pex_file}" venv --compile "${pex_venv}"
function main() {
if [[ -z "${PEX_ENV_FILE_URI}" ]]; then
err "ERROR: Must specify PEX_ENV_FILE_URI metadata key"
To start the cluster and run the initialization script to unpack the pex into a venv on the cluster:
from import dataproc_v1 as dataproc
def start_cluster(project_id: str, region: str, cluster_name: str):
cluster_client = dataproc.ClusterControllerClient(...)
operation = cluster_client.create_cluster(
"project_id": project_id,
"region": region,
"cluster": {
"project_id": project_id,
"cluster_name": cluster_name,
"config": {
"master_config": <CONFIG>,
"worker_config": <CONFIG>,
"initialization_actions": [
"executable_file": "gs://<PATH TO INIT SCRIPT>",
"gce_cluster_config": {
"metadata": {"PEX_ENV_FILE_URI": "gs://<PATH>/env.pex"},
To start the job and use the unpacked pex venv to run the pyspark job:
def submit_job(project_id: str, region: str, cluster_name: str):
job_client = dataproc.ClusterControllerClient(...)
operation = job_client.submit_job_as_operation(
"project_id": project_id,
"region": region,
"job": {
"placement": {"cluster_name": cluster_name},
"pyspark_job": {
"main_python_file_uri": "gs://<PATH>/",
"properties": {
"spark.pyspark.python": "/pexenvs/env.pex/bin/python",
Following #megabits answer here is the bash based workflow that works for me
copy the init script (from answer) to GCS as gs://BUCKET/pkg/cluster-env-init.bash
build PEX providing --include-tools argument that is required by initialization script, e.g.
pex --include-tools -r requirements.txt -o env.pex
put PEX file on GCS
gsutil mv env.pex "gs://BUCKET/pkg/env.pex"
create cluster using PEX file to set-up env
gcloud dataproc clusters create your-cluster --region us-central1 \
--initialization-actions="gs://BUCKET/pkg/cluster-env-init.bash" \
--metadata "PEX_ENV_FILE_URI=gs://BUCKET/pkg/env.pex"
run job
gcloud dataproc jobs submit pyspark \
--cluster=your-cluster --region us-central1 \
--properties spark.pyspark.python="/pexenvs/env.pex/bin/python"

When I add this BucketDeployment to my CDK CodePipeline, cdk synth never finishes

I'm trying to use CDK and CodePipeline to build and deploy a React application to S3. After the CodePipeline phase, in my own stack, I defined the S3 bucket like this:
const bucket = new Bucket(this, "Bucket", {
websiteIndexDocument: "index.html",
websiteErrorDocument: "error.html",
which worked. And then I defined the deployment of my built React app like this:
new BucketDeployment(this, "WebsiteDeployment", {
sources: [Source.asset("./")],
destinationBucket: bucket
which doesn't seem to work. Is that use of BucketDeployment correct?
Something odd that happens when I add the BucketDeployment lines is that cdk synth or cdk deploy, they never finish and they seem to generate an infinite recursive tree in cdk.out, so something definitely seems wrong there.
And if I change to Source.asset("./build") I get the error:
> cdk synth
throw new Error(`Cannot find asset at ${this.sourcePath}`);
Error: Cannot find asset at C:\Users\pupeno\Code\ww3fe\build
at new AssetStaging (C:\Users\pupeno\Code\ww3fe\node_modules\aws-cdk-lib\core\lib\asset-staging.ts:109:13)
at new Asset (C:\Users\pupeno\Code\ww3fe\node_modules\aws-cdk-lib\aws-s3-assets\lib\asset.ts:72:21)
at Object.bind (C:\Users\pupeno\Code\ww3fe\node_modules\aws-cdk-lib\aws-s3-deployment\lib\source.ts:55:23)
at C:\Users\pupeno\Code\ww3fe\node_modules\aws-cdk-lib\aws-s3-deployment\lib\bucket-deployment.ts:170:83
at (<anonymous>)
at new BucketDeployment (C:\Users\pupeno\Code\ww3fe\node_modules\aws-cdk-lib\aws-s3-deployment\lib\bucket-deployment.ts:170:51)
at new MainStack (C:\Users\pupeno\Code\ww3fe\infra\pipeline-stack.ts:16:9)
at new DeployStage (C:\Users\pupeno\Code\ww3fe\infra\pipeline-stack.ts:28:26)
at new PipelineStack (C:\Users\pupeno\Code\ww3fe\infra\pipeline-stack.ts:56:24)
at Object.<anonymous> (C:\Users\pupeno\Code\ww3fe\infra\pipeline.ts:6:1)
Subprocess exited with error 1
which would indicate this is very wrong. Why is it searching for the build directory on my machine? It's supposed to search for it in CodePipeline, after building.
My whole pipeline is:
import {Construct} from "constructs"
import {CodeBuildStep, CodePipeline, CodePipelineSource} from "aws-cdk-lib/pipelines"
import {Stage, CfnOutput, StageProps, Stack, StackProps} from "aws-cdk-lib"
import {Bucket} from "aws-cdk-lib/aws-s3"
import {BucketDeployment, Source} from "aws-cdk-lib/aws-s3-deployment"
export class MainStack extends Stack {
constructor(scope: Construct, id: string, props?: StageProps) {
super(scope, id, props)
const bucket = new Bucket(this, "Bucket", {
websiteIndexDocument: "index.html",
websiteErrorDocument: "error.html",
new BucketDeployment(this, "WebsiteDeployment", {
sources: [Source.asset("./")],
destinationBucket: bucket
new CfnOutput(this, "BucketOutput", {value: bucket.bucketArn})
export class DeployStage extends Stage {
public readonly mainStack: MainStack
constructor(scope: Construct, id: string, props?: StageProps) {
super(scope, id, props)
this.mainStack = new MainStack(this, "example")
export class PipelineStack extends Stack {
constructor(scope: Construct, id: string, props?: StackProps) {
super(scope, id, props)
const pipeline = new CodePipeline(this, id, {
pipelineName: id,
synth: new CodeBuildStep("Synth", {
input: CodePipelineSource.connection("user/example", "main", {
connectionArn: "arn:aws:codestar-connections:....",
installCommands: [
"npm install -g aws-cdk"
commands: [
"npm ci",
"npm run build",
"npx cdk synth"
const deploy = new DeployStage(this, "Staging")
const deployStage = pipeline.addStage(deploy)
The actual error I'm experiencing in AWS is this:
[Container] 2022/01/25 20:28:27 Phase complete: BUILD State: SUCCEEDED
[Container] 2022/01/25 20:28:27 Phase context status code: Message:
[Container] 2022/01/25 20:28:27 Entering phase POST_BUILD
[Container] 2022/01/25 20:28:27 Phase complete: POST_BUILD State: SUCCEEDED
[Container] 2022/01/25 20:28:27 Phase context status code: Message:
[Container] 2022/01/25 20:28:27 Expanding base directory path: cdk.out
[Container] 2022/01/25 20:28:27 Assembling file list
[Container] 2022/01/25 20:28:27 Expanding cdk.out
[Container] 2022/01/25 20:28:27 Skipping invalid file path cdk.out
[Container] 2022/01/25 20:28:27 Phase complete: UPLOAD_ARTIFACTS State: FAILED
[Container] 2022/01/25 20:28:27 Phase context status code: CLIENT_ERROR Message: no matching base directory path found for cdk.out
In case it matters, by cdk.json, in the root of my repo, contains:
"app": "npx ts-node --project infra/tsconfig.json --prefer-ts-exts infra/pipeline.ts",
"watch": {
"include": [
"exclude": [
"context": {
"#aws-cdk/aws-apigateway:usagePlanKeyOrderInsensitiveId": true,
"#aws-cdk/core:stackRelativeExports": true,
"#aws-cdk/aws-rds:lowercaseDbIdentifier": true,
"#aws-cdk/aws-lambda:recognizeVersionProps": true,
"#aws-cdk/aws-cloudfront:defaultSecurityPolicyTLSv1.2_2021": true
TL;DR With two changes, the pipeline successfully deploys the React app: (1) Source.asset needs the full path to the build directory and (2) the React build commands need to be added to the synth step.
Give Source.asset the full path to the React build dir:
new BucketDeployment(this, "WebsiteDeployment", {
sources: [Source.asset(path.join(__dirname, './build'))], // relative to the Stack dir
destinationBucket: bucket
React build artifacts are typically .gitignored, so CodePipeline needs to build the React app. My version has a separate package.json for the React app, so the build step 1 needs a few more commands:
synth: new pipelines.CodeBuildStep('Synth', {
commands: [
// build react (new)
'cd react-app', // path from project root to React app package.json
'npm ci',
'npm run build',
'cd ..',
// synth cdk (as in OP)
"npm ci",
"npm run build",
"npx cdk synth" // synth must run AFTER the React build step
The React app deploys to a S3 URL:
// MainStack
new cdk.CfnOutput(this, 'WebsiteUrl', {
value: `http://${this.websiteBucket.bucketName}.s3-website-${this.region}`,
(1) pipelines.CodePipeline is an opinionated construct for deploying Stacks. The lower-level codepipeline.Pipeline construct has features many apps will need, such as separating build steps and the passing build-time env vars between steps (e.g. injecting the API URL into the client bundle using a REACT_APP_API_URL env var).
For the first question:
And if I change to Source.asset("./build") I get the error: ... Why is it searching for the build directory on my machine?
This is happening when you run cdk synth locally. Remember, cdk synth will always reference the file system where this command is run. Locally it will be your machine, in the pipeline it will be in the container or environment that is being used by AWS CodePipeline.
Dig a little deeper into BucketDeployment
But also, there is some interesting things that happen here that could be helpful. BucketDeployment doesn't just pull from the source you reference in BucketDeployment.sources and upload it to the bucket you specify in BucketDeployment.destinationBucket. According to the BucketDeployment docs the assets are uploaded to an intermediary bucket and then later merged to your bucket. This matters because it will explain your error received Error: Cannot find asset at C:\Users\pupeno\Code\ww3fe\build because when you run cdk synth it will expect the dir ./build as stated in Source.asset("./build") to exist.
This gets really interesting when trying to use a CodePipeline to build and deploy a single page app like React in your case. By default, CodePipeline will execute a Source step, followed a Synth step, then any of the waves or stages you add after. Adding a wave that builds your react app won't work right away because we now see that the output directory of building you react app is needed during the Synth step because of how BucketDeployment works. We need to be able to have the order be Source -> Build -> Synth -> Deploy. As found in this question, we can control the order of the steps by using inputs and outputs. CodePipeline will order the steps to ensure input/output dependencies are met. So we need the have our Synth step use the Build's output as its input.
Concerns with the currently defined pipeline
I believe that your current pipeline is missing a CodeBuildStep that would bundle your react app and output it to the directory that you specified in BucketDeployment.sources. We also need to set the inputs to order these actions correctly. Below are some updates to the pipeline definition, though some changes may need to be made to have the correct file paths. Also, set BucketDeployment.sources to the dir where your app bundle is written to.
export class PipelineStack extends Stack {
constructor(scope: Construct, id: string, props?: StackProps) {
super(scope, id, props)
const sourceAction = CodePipelineSource.connection("user/example", "main", {
connectionArn: "arn:aws:codestar-connections:....",
const buildAction = new CodeBuildStep("Build", {
input: sourceAction, // This places Source first
installCommands: [ "npm ci" ],
commands: [ "npm run build" ], // Change this to your build command for your react app
// We need to output the entire contents of our file structure
// this is both the CDK code needed for synth and the bundled app
primaryOutputDirectory: "./",
const synthAction = new ShellStep("Synth", {
input: buildAction, // This places Synth after Build
installCommands: [
"npm install -g aws-cdk"
commands: [
"npm ci",
"npm run build",
"npx cdk synth"
// Synth step must output to cdk.out
// if your CDK code is nested, this will have to match that structure
primaryOutputDirectory: "./cdk.out",
const pipeline = new CodePipeline(this, id, {
pipelineName: id,
synth: synthAction,
const deploy = new DeployStage(this, "Staging");
const deployStage = pipeline.addStage(deploy);
Take a look in your gitignore file if some files like json or html files are not commited to be pushed into Repo.

Running a shell script in CloudFormation cfn-init

I am trying to run a script in the cfn-init command but it keeps timing out.
What am I doing wrong when running the
"WebServerInstance" : {
"Type" : "AWS::EC2::Instance",
"DependsOn" : "AttachGateway",
"Metadata" : {
"Comment" : "Install a simple application",
"AWS::CloudFormation::Init" : {
"config" : {
"files": {
"/home/ec2-user/": {
"content": {
"Fn::Join": [
"aws s3 cp s3://server-assets/startserver.jar . --region=ap-northeast-1\n",
"aws s3 cp s3://server-assets/site-home-sprint2.jar . --region=ap-northeast-1\n",
"java -jar startserver.jar\n",
"java -jar site-home-sprint2.jar --spring.datasource.password=`< password.txt` --spring.datasource.username=`< username.txt` --spring.datasource.url=`<db_url.txt`\n"
"mode": "000755"
"commands": {
"start_server": {
"command": "./",
"cwd": "~",
The file part works fine and it creates the file but it times out at running the command.
What is the correct way of executing a shell script?
You can tail the logs in /var/log/cfn-init.log and detect the issues while running the script.
The commands in Cloudformation Init are ran as sudo user by default. Maybe there can be an issue were your script is residing in /home/ec2-user/ and you are trying to run the script from '~' (i.e. /root).
Please give the absolute path (/home/ec2-user) in cwd. It will solve your concern.
However, the exact issue can be fetched from the logs only.
Usually the init scripts are executed by root unless specified otherwise. Can you try giving the full path while running your startup script. You can give cloudkast a try. It is an online cloudformation template generator. Makes easier creating objects such as aws::cloudformation::init.

AWS CLI Update_Stack can't pass parameter value containing a /

I've been banging my head all morning on trying to create a powershell script that will ultimately update an AWS stack. Everything is great right up to the point where I have to pass parameters to the cloudformation template.
One of the parameter values (ParameterKey=ZipFilePath) contains a /. But the script fails complaining that it was expecting a = but found a /. I've tried escaping the slash but then the API complains that it found the backslash instead of an equals. Where am I going wrong?
... <snip creating a zip file> ...
$filename = ("TotalCommApi-" + $DateTime + ".zip")
aws s3 cp $filename ("s3://S3BucketName/TotalCommApi/" + $filename)
aws cloudformation update-stack --stack-name TotalCommApi-Dev --template-url --parameters ParameterKey=S3BucketName,ParameterValue=S3BucketNameValue,UsePreviousValue=false ParameterKey=ZipFilePath,ParameterValue=("TotalCommApi/" + $filename) ,UsePreviousValue=false
cd C:\Projects\TotalCommApi\TotalComm_API
And here is the pertinent section from the CloudFormation Template:
"Description": "An AWS Serverless Application that uses the ASP.NET Core framework running in Amazon Lambda.",
"Parameters": {
"ZipFilePath": {
"Type": "String",
"Description": "Path to the zip file containing the Lambda Functions code to be published."
"S3BucketName": {
"Type": "String",
"Description": "Name of the S3 bucket where the ZipFile resides."
"AWSTemplateFormatVersion": "2010-09-09",
"Outputs": {},
"Conditions": {},
"Resources": {
"ProxyFunction": {
"Type": "AWS::Lambda::Function",
"Properties": {
"Code": {
"S3Bucket": {"Ref": "S3BucketName" },
"S3Key": { "Ref": "ZipFilePath" }
And this is the error message generated by PowerShell ISE
[image removed]
Update: I am using Windows 7 which comes with Powershell 2. I updgraded to Powershell 4. Then my script yielded this error:
On recommendation from a consulting firm, I uninstalled the CLI that I installed via msi, then I upgraded Python to 3.6.2 and then re-installed the CLI via pip. I still get the same error. I "echo"d the command to the screen and this is what I see:
upload: .\ to s3://S3bucketName/TotalCommApi/
Sorry for the delay getting back to you on this - the good news is that I might have a hint about what your issue is.
ParameterKey=ZipFilePath,ParameterValue=("TotalCommApi/" + $filename) ,UsePreviousValue=false
I was driving myself mad trying to reproduce this issue. Why? Because I assumed that the space after ("TotalCommApi/" + $filename) was an artifact from copying, not the actual value that you were using. When I added the space in:
aws cloudformation update-stack --stack-name test --template-url --parameters ParameterKey=S3BucketName,ParameterValue=$bucketname,UsePreviousValue=false ParameterKey=ZipFilePath,ParameterValue=testfolder/$filename ,UsePreviousValue=false
Error parsing parameter '--parameters': Expected: '=', received: ','
This isn't exactly your error message (, instead of /), but I think it's probably a similar issue in your case - check to make sure the values that are being used in your command don't have extra spaces somewhere.

Setting Environment Variables per step in AWS EMR

I am unable to set environment variables for my spark application. I am using AWS EMR to run a spark application. Which is more like a framework I wrote in python on top of spark, to run multiple spark jobs according to environment variables present. So in order for me to start the exact job, I need to pass the environment variable into the spark-submit. I tried several methods to do this. But none of them works. As I try to print the value of the environment variable inside the application it returns empty.
To run the cluster in the EMR I am using following AWS CLI command
aws emr create-cluster --applications Name=Hadoop Name=Hive Name=Spark --ec2-attributes '{"KeyName":"<Key>","InstanceProfile":"<Profile>","SubnetId":"<Subnet-Id>","EmrManagedSlaveSecurityGroup":"<Group-Id>","EmrManagedMasterSecurityGroup":"<Group-Id>"}' --release-label emr-5.13.0 --log-uri 's3n://<bucket>/elasticmapreduce/' --bootstrap-action 'Path="s3://<bucket>/"' --steps file://./.envs/steps.json --instance-groups '[{"InstanceCount":1,"InstanceGroupType":"MASTER","InstanceType":"c4.xlarge","Name":"Master"}]' --configurations file://./.envs/Production.json --ebs-root-volume-size 64 --service-role EMRRole --enable-debugging --name 'Application' --auto-terminate --scale-down-behavior TERMINATE_AT_TASK_COMPLETION --region <region>
Now Production.json looks like this:
"Classification": "yarn-env",
"Properties": {},
"Configurations": [
"Classification": "export",
"Properties": {
"FOO": "bar"
"Classification": "spark-defaults",
"Properties": {
"spark.executor.memory": "2800m",
"spark.driver.memory": "900m"
And steps.json like this :
"Name": "Job",
"Args": [
"--conf", "spark.yarn.appMasterEnv.SPARK_YARN_USER_ENV=SHAPE=TRIANGLE",
"--conf", "spark.yarn.appMasterEnv.SHAPE=RECTANGLE",
"--conf", "spark.executorEnv.SHAPE=SQUARE"
"ActionOnFailure": "CONTINUE",
"Type": "Spark"
When I try to access the environment variable inside my code, it simply prints empty. As you can see I am running the step using spark with yarn cluster in cluster mode. I went through these links to reach this position.
How do I set an environment variable in a YARN Spark job?
Thanks for any help.
Use classification yarn-env to pass environment variables to the worker nodes.
Use classification spark-env to pass environment variables to the driver, with deploy mode client. When using deploy mode cluster, use yarn-env.
(Dear moderator, if you want to delete the post, let me know why.)
To work with EMR clusters I work using the AWS Lambda, creating a project that build an EMR cluster when a flag is set in the condition.
Inside this project, we define the variables that you can set in the Lambda and then, replace this to its value. To use this, we have to use the AWS API. The possible method you have to use is the AWSSimpleSystemsManagement.getParameters.
Then, make a map like val parametersValues = => (k.getName, k.getValue)) to have a tuple with its name and value.
Eg: ${BUCKET} = "s3://bucket-name/
What this means, you only have to write in your JSON ${BUCKET} instead all the name of your path.
Once you have replace the value, the step JSON can have a view like this,
"Name": "Job",
"Args": [
"--conf", "spark.yarn.appMasterEnv.SPARK_YARN_USER_ENV=SHAPE=TRIANGLE",
"--conf", "spark.yarn.appMasterEnv.SHAPE=RECTANGLE",
"--conf", "spark.executorEnv.SHAPE=SQUARE"
"ActionOnFailure": "CONTINUE",
"Type": "Spark"
I hope this can help you to solve your problem.