I am unable to set environment variables for my spark application. I am using AWS EMR to run a spark application. Which is more like a framework I wrote in python on top of spark, to run multiple spark jobs according to environment variables present. So in order for me to start the exact job, I need to pass the environment variable into the spark-submit. I tried several methods to do this. But none of them works. As I try to print the value of the environment variable inside the application it returns empty.
To run the cluster in the EMR I am using following AWS CLI command
aws emr create-cluster --applications Name=Hadoop Name=Hive Name=Spark --ec2-attributes '{"KeyName":"<Key>","InstanceProfile":"<Profile>","SubnetId":"<Subnet-Id>","EmrManagedSlaveSecurityGroup":"<Group-Id>","EmrManagedMasterSecurityGroup":"<Group-Id>"}' --release-label emr-5.13.0 --log-uri 's3n://<bucket>/elasticmapreduce/' --bootstrap-action 'Path="s3://<bucket>/bootstrap.sh"' --steps file://./.envs/steps.json --instance-groups '[{"InstanceCount":1,"InstanceGroupType":"MASTER","InstanceType":"c4.xlarge","Name":"Master"}]' --configurations file://./.envs/Production.json --ebs-root-volume-size 64 --service-role EMRRole --enable-debugging --name 'Application' --auto-terminate --scale-down-behavior TERMINATE_AT_TASK_COMPLETION --region <region>
Now Production.json looks like this:
[
{
"Classification": "yarn-env",
"Properties": {},
"Configurations": [
{
"Classification": "export",
"Properties": {
"FOO": "bar"
}
}
]
},
{
"Classification": "spark-defaults",
"Properties": {
"spark.executor.memory": "2800m",
"spark.driver.memory": "900m"
}
}
]
And steps.json like this :
[
{
"Name": "Job",
"Args": [
"--deploy-mode","cluster",
"--master","yarn","--py-files",
"s3://<bucket>/code/dependencies.zip",
"s3://<bucket>/code/__init__.py",
"--conf", "spark.yarn.appMasterEnv.SPARK_YARN_USER_ENV=SHAPE=TRIANGLE",
"--conf", "spark.yarn.appMasterEnv.SHAPE=RECTANGLE",
"--conf", "spark.executorEnv.SHAPE=SQUARE"
],
"ActionOnFailure": "CONTINUE",
"Type": "Spark"
}
]
When I try to access the environment variable inside my __init__.py code, it simply prints empty. As you can see I am running the step using spark with yarn cluster in cluster mode. I went through these links to reach this position.
How do I set an environment variable in a YARN Spark job?
https://spark.apache.org/docs/latest/configuration.html#environment-variables
https://spark.apache.org/docs/latest/configuration.html#runtime-environment
Thanks for any help.
Use classification yarn-env to pass environment variables to the worker nodes.
Use classification spark-env to pass environment variables to the driver, with deploy mode client. When using deploy mode cluster, use yarn-env.
(Dear moderator, if you want to delete the post, let me know why.)
To work with EMR clusters I work using the AWS Lambda, creating a project that build an EMR cluster when a flag is set in the condition.
Inside this project, we define the variables that you can set in the Lambda and then, replace this to its value. To use this, we have to use the AWS API. The possible method you have to use is the AWSSimpleSystemsManagement.getParameters.
Then, make a map like val parametersValues = parameterResult.getParameters.asScala.map(k => (k.getName, k.getValue)) to have a tuple with its name and value.
Eg: ${BUCKET} = "s3://bucket-name/
What this means, you only have to write in your JSON ${BUCKET} instead all the name of your path.
Once you have replace the value, the step JSON can have a view like this,
[
{
"Name": "Job",
"Args": [
"--deploy-mode","cluster",
"--master","yarn","--py-files",
"${BUCKET}/code/dependencies.zip",
"${BUCKET}/code/__init__.py",
"--conf", "spark.yarn.appMasterEnv.SPARK_YARN_USER_ENV=SHAPE=TRIANGLE",
"--conf", "spark.yarn.appMasterEnv.SHAPE=RECTANGLE",
"--conf", "spark.executorEnv.SHAPE=SQUARE"
],
"ActionOnFailure": "CONTINUE",
"Type": "Spark"
}
]
I hope this can help you to solve your problem.
Related
I'm trying to create a CodePipeline to deploy an application to EC2 instances using Blue/Green Deployment.
My Deployment Group looks like this:
aws deploy update-deployment-group \
--application-name MySampleAppDeploy \
--deployment-config-name CodeDeployDefault.AllAtOnce \
--service-role-arn arn:aws:iam::1111111111:role/CodeDeployRole \
--ec2-tag-filters Key=Stage,Type=KEY_AND_VALUE,Value=Blue \
--deployment-style deploymentType=BLUE_GREEN,deploymentOption=WITH_TRAFFIC_CONTROL \
--load-balancer-info targetGroupInfoList=[{name="sample-app-alb-targets"}] \
--blue-green-deployment-configuration file://configs/blue-green-deploy-config.json \
--current-deployment-group-name MySampleAppDeployGroup
blue-green-deploy-config.json
{
"terminateBlueInstancesOnDeploymentSuccess": {
"action": "KEEP_ALIVE",
"terminationWaitTimeInMinutes": 1
},
"deploymentReadyOption": {
"actionOnTimeout": "STOP_DEPLOYMENT",
"waitTimeInMinutes": 1
},
"greenFleetProvisioningOption": {
"action": "DISCOVER_EXISTING"
}
}
I'm able to create a blue/green deployment manually using this command, it Works! :
aws deploy create-deployment \
--application-name MySampleAppDeploy \
--deployment-config-name CodeDeployDefault.AllAtOnce \
--deployment-group-name MySampleAppDeployGroup \
# I can specify the Target Instances here
--target-instances file://configs/blue-green-target-instances.json \
--s3-location XXX
blue-green-target-instances.json
{
"tagFilters": [
{
"Key": "Stage",
"Value": "Green",
"Type": "KEY_AND_VALUE"
}
]
}
Now, In my CodePipeline Deploy Stage, I have this:
{
"name": "Deploy",
"actions": [
{
"inputArtifacts": [
{
"name": "app"
}
],
"name": "Deploy",
"actionTypeId": {
"category": "Deploy",
"owner": "AWS",
"version": "1",
"provider": "CodeDeploy"
},
"outputArtifacts": [],
"configuration": {
"ApplicationName": "MySampleAppDeploy",
"DeploymentGroupName": "MySampleAppDeployGroup"
/* How can I specify Target Instances here? */
},
"runOrder": 1
}
]
}
All EC2 instances are tagged correctly and everything works as expected when using CodeDeploy via the command line, I'm missing something about how AWS CodePipeline works in this case.
Thanks
You didn't mention which error you get when you invoke the pipeline? Are you getting this error:
"The deployment failed because no instances were found in your green fleet"
Taking this assumption, since you are using manual tagging in your CodeDeploy configuration, this is not going to work to deploy using Blue/Green with manual tags as CodeDeploy expects to see a tagSet to find the "Green" instances and there is no way to provide this information via CodePipeline.
To workaround this, please use the 'Copy AutoScaling' option for implementing Blue/Green deployments in CodeDeploy using CodePipeline. See Step 10 here [1]
Another workaround is that you can create lambda function that is invoked as an action in your CodePipeline. This lambda function can be used to trigger the CodeDeploy deployment where you specify the target-instances with the value of the green AutoScalingGroup. You will then need to make describe calls at frequent intervals to the CodeDeploy API to get the status of the deployment. Once the deployment has completed, your lambda function will need to signal back to the CodePipeline based on the status of the deployment.
Here is an example which walks through how to invoke an AWS lambda function in a pipeline in CodePipeline [2].
Ref:
[1] https://docs.aws.amazon.com/codedeploy/latest/userguide/applications-create-blue-green.html
[2] https://docs.aws.amazon.com/codepipeline/latest/userguide/actions-invoke-lambda-function.html
I am trying to run a script in the cfn-init command but it keeps timing out.
What am I doing wrong when running the startup-script.sh?
"WebServerInstance" : {
"Type" : "AWS::EC2::Instance",
"DependsOn" : "AttachGateway",
"Metadata" : {
"Comment" : "Install a simple application",
"AWS::CloudFormation::Init" : {
"config" : {
"files": {
"/home/ec2-user/startup_script.sh": {
"content": {
"Fn::Join": [
"",
[
"#!/bin/bash\n",
"aws s3 cp s3://server-assets/startserver.jar . --region=ap-northeast-1\n",
"aws s3 cp s3://server-assets/site-home-sprint2.jar . --region=ap-northeast-1\n",
"java -jar startserver.jar\n",
"java -jar site-home-sprint2.jar --spring.datasource.password=`< password.txt` --spring.datasource.username=`< username.txt` --spring.datasource.url=`<db_url.txt`\n"
]
]
},
"mode": "000755"
}
},
"commands": {
"start_server": {
"command": "./startup_script.sh",
"cwd": "~",
}
}
}
}
},
The file part works fine and it creates the file but it times out at running the command.
What is the correct way of executing a shell script?
You can tail the logs in /var/log/cfn-init.log and detect the issues while running the script.
The commands in Cloudformation Init are ran as sudo user by default. Maybe there can be an issue were your script is residing in /home/ec2-user/ and you are trying to run the script from '~' (i.e. /root).
Please give the absolute path (/home/ec2-user) in cwd. It will solve your concern.
However, the exact issue can be fetched from the logs only.
Usually the init scripts are executed by root unless specified otherwise. Can you try giving the full path while running your startup script. You can give cloudkast a try. It is an online cloudformation template generator. Makes easier creating objects such as aws::cloudformation::init.
I want to tune my spark cluster on AWS EMR and I couldn't change the default value of spark.driver.memory which leads every spark application to crash as my dataset is big.
I tried editing the spark-defaults.conf file manually on the master machine, and I also tried configuring it directly with a JSON file on EMR dashboard while creating the cluster.
Here's the JSON file used:
[
{
"Classification": "spark-defaults",
"Properties": {
"spark.driver.memory": "7g",
"spark.driver.cores": "5",
"spark.executor.memory": "7g",
"spark.executor.cores": "5",
"spark.executor.instances": "11"
}
}
]
After using the JSON file, the configurations are correctly found in the "spark-defaults.conf" but on spark dashboard there's always the default value for "spark.driver.memory" of 1000M while the other values are modified correctly. Anyone have got into the same problem please?
Thank you in advance.
You need to set
maximizeResourceAllocation=true
in the spark-defaults settings
[
{
"Classification": "spark",
"Properties": {
"maximizeResourceAllocation": "true"
}
}
]
I'm doing a lot of work with AWS EMR and when you build an EMR cluster through the AWS Management Console you can click a button to export the AWS CLI Command that creates the EMR cluster.
It then gives you a big CLI command that isn't formatted in any way i.e., if you copy and paste the command it's all on a single line.
I'm using these EMR CLI commands, that were created by other individuals, to create the EMR clusters in Python using the AWS SDK Boto3 library i.e., I'm looking at the CLI command to get all the configuration details. Some of the configuration details are present on the AWS Management Console UI but not all of them so it's easier for me to use the CLI command that you can export.
However, the AWS CLI command is very hard to read since it's not formatted. Is there an AWS CLI command formatter available online similar to JSON formatters?
Another solution I could use is to Clone the EMR Cluster and go through the EMR Cluster creation screen on the AWS Management Console to get all the configuration details but I'm still curious if I could format the CLI Command and do it that way. Another added benefit of being able to format the exported CLI command is that I could put it on a Confluence page for documentation.
Here is some quick python code to do it:
import shlex
import json
import re
def format_command(command):
tokens = shlex.split(command)
formatted = ''
for token in tokens:
# Flags get a new line
if token.startswith("--"):
formatted += '\\\n '
# JSON data
if token[0] in ('[', '{'):
json_data = json.loads(token)
data = json.dumps(json_data, indent=4).replace('\n', '\n ')
formatted += "'{}' ".format(data)
# Quote token when it contains whitespace
elif re.match('\s', token):
formatted += "'{}' ".format(token)
# Simple print for remaining tokens
else:
formatted += token + ' '
return formatted
example = """aws emr create-cluster --applications Name=spark Name=ganglia Name=hadoop --tags 'Project=MyProj' --ec2-attributes '{"KeyName":"emr-key","AdditionalSlaveSecurityGroups":["sg-3822994c","sg-ccc76987"],"InstanceProfile":"EMR_EC2_DefaultRole","ServiceAccessSecurityGroup":"sg-60832c2b","SubnetId":"subnet-3c76ee33","EmrManagedSlaveSecurityGroup":"sg-dd832c96","EmrManagedMasterSecurityGroup":"sg-b4923dff","AdditionalMasterSecurityGroups":["sg-3822994c","sg-ccc76987"]}' --service-role EMR_DefaultRole --release-label emr-5.14.0 --name 'Test Cluster' --instance-groups '[{"InstanceCount":1,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":32,"VolumeType":"gp2"},"VolumesPerInstance":1}]},"InstanceGroupType":"MASTER","InstanceType":"m4.xlarge","Name":"Master"},{"InstanceCount":1,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":32,"VolumeType":"gp2"},"VolumesPerInstance":1}]},"InstanceGroupType":"CORE","InstanceType":"m4.xlarge","Name":"CORE"}]' --configurations '[{"Classification":"spark-defaults","Properties":{"spark.sql.avro.compression.codec":"snappy","spark.eventLog.enabled":"true","spark.dynamicAllocation.enabled":"false"},"Configurations":[]},{"Classification":"spark-env","Properties":{},"Configurations":[{"Classification":"export","Properties":{"SPARK_DAEMON_MEMORY":"4g"},"Configurations":[]}]}]' --scale-down-behavior TERMINATE_AT_TASK_COMPLETION --region us-east-1"""
print(format_command(example))
Output looks like this:
aws emr create-cluster \
--applications Name=spark Name=ganglia Name=hadoop \
--tags Project=MyProj \
--ec2-attributes '{
"ServiceAccessSecurityGroup": "sg-60832c2b",
"InstanceProfile": "EMR_EC2_DefaultRole",
"EmrManagedMasterSecurityGroup": "sg-b4923dff",
"KeyName": "emr-key",
"SubnetId": "subnet-3c76ee33",
"AdditionalMasterSecurityGroups": [
"sg-3822994c",
"sg-ccc76987"
],
"AdditionalSlaveSecurityGroups": [
"sg-3822994c",
"sg-ccc76987"
],
"EmrManagedSlaveSecurityGroup": "sg-dd832c96"
}' \
--service-role EMR_DefaultRole \
--release-label emr-5.14.0 \
--name Test Cluster \
--instance-groups '[
{
"EbsConfiguration": {
"EbsBlockDeviceConfigs": [
{
"VolumeSpecification": {
"VolumeType": "gp2",
"SizeInGB": 32
},
"VolumesPerInstance": 1
}
]
},
"InstanceCount": 1,
"Name": "Master",
"InstanceType": "m4.xlarge",
"InstanceGroupType": "MASTER"
},
{
"EbsConfiguration": {
"EbsBlockDeviceConfigs": [
{
"VolumeSpecification": {
"VolumeType": "gp2",
"SizeInGB": 32
},
"VolumesPerInstance": 1
}
]
},
"InstanceCount": 1,
"Name": "CORE",
"InstanceType": "m4.xlarge",
"InstanceGroupType": "CORE"
}
]' \
--configurations '[
{
"Properties": {
"spark.eventLog.enabled": "true",
"spark.dynamicAllocation.enabled": "false",
"spark.sql.avro.compression.codec": "snappy"
},
"Classification": "spark-defaults",
"Configurations": []
},
{
"Properties": {},
"Classification": "spark-env",
"Configurations": [
{
"Properties": {
"SPARK_DAEMON_MEMORY": "4g"
},
"Classification": "export",
"Configurations": []
}
]
}
]' \
--scale-down-behavior TERMINATE_AT_TASK_COMPLETION \
--region us-east-1
I am following the instructions from http://docs.aws.amazon.com/vm-import/latest/userguide/import-vm-image.html to import an OVA. Here are the summarized steps I followed.
Step 1: Upload an OVA to S3 bucket.
Step 2: Create trust policy
Step 3: Create role policy
Step 4: Create containers.json with bucket name and ova filename.
Step 5: Run command for import-image
Command: aws ec2 import-image --description "My Unique OVA" --disk-containers file://containers.json
Step 6: Get the "ImportTaskId": "import-ami-fgi2cyyd" (in my case)
Step 7: Check status of import task
Error:
C:\Users\joe>aws ec2 describe-import-image-tasks --import-task-ids import-ami-fgi2cyyd
{
"ImportImageTasks": [
{
"Status": "deleted",
"SnapshotDetails": [
{
"UserBucket": {
"S3Bucket": "my_unique_bucket",
"S3Key": "my_unique_ova.ova"
},
"DiskImageSize": 2871726592.0,
"Format": "VMDK"
}
],
"Description": "My Unique OVA",
"StatusMessage": "ClientError: GRUB doesn't exist in /etc/default directory.",
"ImportTaskId": "import-ami-fgi2cyyd"
}
]
}
What am I doing wrong? I am on free-tier trying things out.
Contents of containers.json:
[
{
"Description": "My Unique OVA",
"Format": "ova",
"UserBucket": {
"S3Bucket": "my_unique_bucket",
"S3Key": "my_unique_ova.ova"
}
}]
The ova file was corrupted in my case. Tried it with a smaller ova and it worked fine.
Alright, figured it out. The problem I ran into, which I assume will be the case with yours as well is that you probably aren't using a grub loader but rather the lilo loader. I was able to alter the boot loader by going into the gui (startx) and going under system configuration. Under the Boot menu I was able to switch from lilo to Grub. Once I did that, I got further in the ec2 vm import process. Hope that helps.