I'm submitting a Spark job to EMR via AWSCLI, EMR steps and spark configs are provided as separate json files. For some reason the name of my main class gets passed to my Spark jar as an unnecessary command line argument, resulting in a failed job.
AWSCLI command:
aws emr create-cluster \
--name "Spark-Cluster" \
--release-label emr-5.5.0 \
--instance-groups \
InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge \
InstanceGroupType=CORE,InstanceCount=20,InstanceType=m3.xlarge \
--applications Name=Spark \
--use-default-roles \
--configurations file://conf.json \
--steps file://steps.json \
--log-uri s3://blah/logs \
The json file describing my EMR Step:
[
{
"Name": "RunEMRJob",
"Jar": "s3://blah/blah.jar",
"ActionOnFailure": "TERMINATE_CLUSTER",
"Type": "CUSTOM_JAR",
"MainClass": "blah.blah.MainClass",
"Args": [
"--arg1",
"these",
"--arg2",
"get",
"--arg3",
"passed",
"--arg4",
"to",
"--arg5",
"spark",
"--arg6",
"main",
"--arg7",
"class"
]
}
]
The argument parser in my main class throws an error (and prints the parameters provided):
Exception in thread "main" java.lang.IllegalArgumentException: One or more parameters are invalid or missing:
blah.blah.MainClass --arg1 these --arg2 get --arg3 passed --arg4 to --arg5 spark --arg6 main --arg7 class
So for some reason the main class that I define in steps.json leaks into my separately provided command line arguments.
What's up?
I misunderstood how EMR steps work. There were two options for resolving this:
I could use Type = "CUSTOM_JAR" with Jar = "command-runner.jar" and add a normal spark-submit call to Args.
Using Type = "Spark" simply adds the "spark-submit" call as the first argument, one still needs to provide a master, jar location, main class etc...
Related
I have created an EMR cluster thru AWS CLI
aws emr create-cluster --applications Name=Hive Name=HBase Name=Hue Name=Hadoop Name=ZooKeeper
--tags Name="EMR-Atlas" --release-label emr-5.16.0 --ec2-attributes SubnetId=subnet-xxxxx,
KeyName=atlas-emr-dif --use-default-roles --ebs-root-volume-size 100 --instance-groups
InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.xlarge InstanceGroupType=CORE,InstanceCount=1,
InstanceType=m4.xlarge --log-uri s3://xxx/logs/new-log --steps Name="Run Remote Script",
Jar=command-runner.jar,Args=
[bash,-c,
"curl https://s3.amazonaws.com/aws-bigdata-blog/artifacts/aws-blog-emr-atlas/apache-atlas-emr.sh
-o /tmp/script.sh; chmod +x /tmp/script.sh; /tmp/script.sh"]
Then I have established a SSH connection for HUE:
--ssh -L 8888:localhost:8888 -i key.pem hadoop#<EMR Master IP Address>
I have created a Hive table thru HUE :
CREATE external TABLE us_disease
(
YearStart int,
StratificationCategory2 string,
GeoLocation string,
ResponseID string,
LocationID int,
TopicID string
)
row format delimited
fields terminated by ','
LOCATION 's3://XXXX/data/USHealthcare/'
TBLPROPERTIES ("skip.header.line.count"="1");
I am able to fetch records with SELECT statement thru HUE.
But, if I try to execute the select statement thru HQL it fails.
I tried in the following way:
My HQL is plain SELECT statment
select * from us_disease limit 10;
and I have stored the same in S3 as hive.hql.
I executed the hql thru step in emr cluster:
Log :
INFO redirectError to /mnt/var/log/hadoop/steps/s-xxxxxxxx/stderr
INFO Working dir /mnt/var/lib/hadoop/steps/s-xxxxxxxx
INFO ProcessRunner started child process 30597 :
hadoop 30597 5505 0 11:40 ? 00:00:00 bash /usr/lib/hadoop/bin/hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar hive-script --run-hive-script --args -f s3://dif-test/data-governance/hql/hive.hql
2021-03-30T11:40:36.318Z INFO HadoopJarStepRunner.Runner: startRun() called for s-xxxxxxxx Child Pid: 30597
INFO Synchronously wait child process to complete : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO waitProcessCompletion ended with exit code 127 : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO total process run time: 2 seconds
2021-03-30T11:40:36.437Z INFO Step created jobs:
2021-03-30T11:40:36.438Z WARN Step failed with exitCode 127 and took 2 seconds
stderr:
/usr/lib/hadoop/bin/hadoop: line 169: /etc/alternatives/jre/bin/java: No such file or directory
Any help appreciated. Thank you.
The issue got fixed after I updated the emr version. Previously I was using emr-5.16.0 . I changed to emr-5.32.0.
Modified code :
aws emr create-cluster --applications Name=Hive Name=HBase Name=Hue Name=Hadoop Name=ZooKeeper --tags Name="EMR-Atlas" --release-label emr-5.32.0 --ec2-attributes SubnetId=subnet-xxxx,KeyName=atlas-emr-dif --use-default-roles --ebs-root-volume-size 100 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m5.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m5.xlarge --log-uri s3://xxx/xxx/new-log --steps Name="Run Remote Script",Jar=command-runner.jar,Args=[bash,-c,"curl https://s3.amazonaws.com/aws-bigdata-blog/artifacts/aws-blog-emr-atlas/apache-atlas-emr.sh -o /tmp/script.sh; chmod +x /tmp/script.sh; /tmp/script.sh"]
lets say i have all parameters needed to create a cloudformation stack in a json file but want to override some parameters from the parameters file..is this possible?
aws cloudformation create-stack \
--stack-name sample-stack \
--template-body file://sample-stack.yaml \
--parameters file://sample-stack.json \
--capabilities CAPABILITY_IAM \
--disable-rollback \
--region us-east-1 \
--output json && \
aws cloudformation wait stack-create-complete \
--stack-name sample-stack
so lets say there are like 10 parameters in sample-stack.json file BUT i have like 2 parameters i want to override from that file.
Is this possible?
Thanks
This isn't available in the AWS CLI right now, but there is a feature request on GitHub. For now you'll need to script something to generate your overrides prior to creating the stack. Another potential option is to store your values in something that you can dynamically reference, such as Parameter Store, and update them via the API prior to stack creation.
If you want to update a stack and specify only the list of parameters that changed, you can have a look at this shell script that I wrote.
Usage:
▶ bash update_stack.sh -h
Usage: update_stack.sh [-h] STACK_NAME KEY1=VAL1 [KEY2=VAL2 ...]
Updates CloudFormation stacks based on parameters passed here as key=value pairs. All
other parameters are based on existing values.
To solve your problem, you could borrow the edit() function:
PARAMS='sample-stack.json'
edit() {
local key value pair
for pair in "$#" ; do
IFS='=' read -r key value <<< "$pair"
jq --arg key "$key" \
--arg value "$value" \
'(.[] | select(.ParameterKey==$key)
| .ParameterValue) |= $value' \
"$PARAMS" > x ; mv x "$PARAMS"
done
}
cp $PARAMS $PARAMS.bak
edit param1=newval1 param2=newval2
And then create your stack as normal.
make all values in the files as the variables, and use another script pass the default values or overwrite them.
For example, i have my jason files sample-stack.json like following:
[
{
"ParameterKey": "InstanceType",
"ParameterValue": "${instance_type}"
},
{
"ParameterKey": "DesiredSize",
"ParameterValue": "${ASG_DESIRED_Number}"
}
]
in the script file, run following commands to replace
instance_type=t3.small
envsubst < "${IN_FILENAME}" > "${OUT_FILENAME}"
what you need to do is to replace those variables you need. for those don't need change, the default value will be passed in.
Is there a way I can access a REST API endpoint for a Model created by Cloud ML Engine? I only see:
gcloud ml-engine jobs submit prediction $JOB_NAME \
--model census \
--version v1 \
--data-format TEXT \
--region $REGION \
--runtime-version 1.10 \
--input-paths gs://cloud-samples-data/ml-engine/testdata/prediction/census.json \
--output-path $GCS_JOB_DIR/predictions
Yes, in fact their are two APIs available to do this.
The projects.predict call is the simplest method. You pass in a request as described here, and it returns with the result. This cannot take input from GCS like your gsutil command.
The projects.jobs.create call with the predictionInput and predictionOutput fields allows batch prediction, with input from GCS.
The equivalent for your command is:
POST https://ml.googleapis.com/v1/projects/$PROJECT_ID/jobs
{
"jobId" : "$JOB_NAME",
"predictionInput": {
"dataFormat": "TEXT",
"inputPaths": "gs://cloud-samples-data/ml-engine/testdata/prediction/census.json",
"region": "REGION",
"runtimeVersion": "1.10",
"modelName": "projects/$PROJECT_ID/models/census"
},
"predictionOutput": {
"outputPath": "$GCS_JOB_DIR/predictions"
}
}
This returns immediately. use projects.jobs.get to check for success/failure.
When creating a new cluster using boto3, I want to use configuration from existing clusters (which is terminated) and thus clone it.
As far as I know, emr_client.run_job_flow requires all the configuration(Instances, InstanceFleets etc) to be provided as parameters.
Is there any way I can clone from existing cluster like I can do from aws console for EMR.
What i can recommend you, is using the AWS CLI to fire your Cluster.
It permit to versioning your cluster configuration and you can easily load steps configuration with a json file.
aws create-cluster --name "Cluster's name" --ec2-attributes KeyName=SSH_KEY --instance-type m3.xlarge --release-label emr-5.2.1 --log-uri s3://mybucket/logs/ --enable-debugging --instance-count 1 --use-default-roles --applications Name=Spark --steps file://step.json
Where step.json looks like :
[
{
"Name": "Step #1",
"Type":"SPARK",
"Jar":"command-runner.jar",
"Args":
[
"--deploy-mode", "cluster",
"--class", "com.your.data.set.class",
"s3://path/to/your/spark-job.jar",
"-c", "s3://path/to/your/config/or/not",
"--aws-access-key", "ACCESS_KEY",
"--aws-secret-key", "SECRET_KEY"
],
"ActionOnFailure": "CANCEL_AND_WAIT"
}
]
(Multiple steps is okey too)
After that you can always startUp the same configured Cluster.
And for example Schedule the whole Cluster and steps from one AirFlow job.
But if you really want to use Boto3, i suppose that the describe_cluster() method can help you to get the whole informations and use the returned object to Fire Up a new one.
There is no way to get "emr export cli" through command line.
You should parse the parameter what you want to clone, through "describe-cluster".
See below sample,
https://github.com/awslabs/aws-support-tools/tree/master/EMR/Get_EMR_CLI_Export
import boto3
import json
import sys
cluster_id = sys.argv[1]
client = boto3.client('emr')
clst = client.describe_cluster(ClusterId=cluster_id)
...
awscli += ' --steps ' + '\'' + json.dumps(cli_steps) + '\''
...
awscli += ' --instance-groups ' + '\'' + json.dumps(cli_igroups) + '\''
print(awscli)
It works parsing the parameters from “describe-cluster” at first, and make strings to fit “create-cluster” of aws-cli.
Could you please point me to an appropriate documentation topic or provide an example how to add index to DynamoDB as far as I couldn't find any related info.
According to this blog: http://aws.amazon.com/blogs/aws/amazon-dynamodb-update-online-indexing-reserved-capacity-improvements/?sc_ichannel=em&sc_icountry=global&sc_icampaigntype=launch&sc_icampaign=em_130867660&sc_idetail=em_1273527421&ref_=pe_411040_130867660_15 it seems to be possible to do it with UI, however there are no mentions about CLI interface usages.
Thanks in advance,
Yevhenii
The aws command has help for every level of subcommand. For example, you can run aws help to get a list of all service names and discover the name dynamodb. Then you can aws dynamodb help to find the list of DDB commands and find that update-table is a likely culprit. Finally, aws dynamodb update-table help shows you the flags needed to add a global secondary index.
The AWS CLI documentation is really poor and lacks examples. Evidently AWS is promoting the SDK or the console.
This should work for updating
aws dynamodb update-table --table-name Test \
--attribute-definitions AttributeName=City,AttributeType=S AttributeName=State,AttributeType=S \
--global-secondary-index-updates \
"Create={"IndexName"="state-index", "KeySchema"=[ {"AttributeName"="State", "KeyType"="HASH" }], "Projection"={"ProjectionType"="INCLUDE", "NonKeyAttributes"="City"}, "ProvisionedThroughput"= {"ReadCapacityUnits"=1, "WriteCapacityUnits"=1} }"
Here's a shell function to do this that sets the R/W caps, and optionally handles --global-secondary-index-updates if an index name is provided
dynamodb_set_caps() {
# [ "$1" ] || fail_exit "Missing table name"
# [ "$3" ] || fail_exit "Missing read capacity"
# [ "$3" ] || fail_exit "Missing write capacity"
if [ "$4" ] ; then
aws dynamodb update-table --region $region --table-name ${1} \
--provisioned-throughput ReadCapacityUnits=${2},WriteCapacityUnits=${3} \
--global-secondary-index-updates \
"Update={"IndexName"="${4}", "ProvisionedThroughput"= {"ReadCapacityUnits"=${2}, "WriteCapacityUnits"=${3}} }"
else
aws dynamodb update-table --region $region --table-name ${1} \
--provisioned-throughput ReadCapacityUnits=${2},WriteCapacityUnits=${3}
fi
}
Completely agree that the aws docs are lacking in this area
Here is reference for creating a global secondary index:
https://docs.aws.amazon.com/pt_br/amazondynamodb/latest/developerguide/getting-started-step-6.html
However the example only provides the creation of an index for a single primary key.
This code helped me to create a global secondary index for a composite primary key:
aws dynamodb update-table \
--table-name YourTableName \
--attribute-definitions AttributeName=GSI1PK,AttributeType=S \
AttributeName=GSI1SK,AttributeType=S \
AttributeName=createdAt,AttributeType=S \
--global-secondary-index-updates \
"[{\"Create\":{\"IndexName\": \"GSI1\",\"KeySchema\":[{\"AttributeName\":\"GSI1PK\",\"KeyType\":\"HASH\"},{\"AttributeName\":\"GSI1SK\",\"KeyType\":\"RANGE\"}], \
\"ProvisionedThroughput\": {\"ReadCapacityUnits\": 5, \"WriteCapacityUnits\": 5 },\"Projection\":{\"ProjectionType\":\"ALL\"}}}]" --endpoint-url http://localhost:8000
A note in the bottom line considers that you are creating this index in your local database. If not, just delete it.