Is there a way to set multiple --conf as job parametet in AWS Glue? - amazon-web-services

Im trying to configure spark in my Glue jobs. When I tried to input them one by one in the 'Edit job', 'Job Parameters' as key and valur pair (e.g. key:--conf value: spark.executor.memory=10g) it works but when I tried putting them altogether (delimited by space or comma), it results to an error. I also tried using sc._conf.setAll but Glue is ignoring the config and insists on using its default. Is there a way to do this with Spark 2.4?

Yes, you can pass multiple parameters as below:
Key: --conf
value: spark.yarn.executor.memoryOverhead=7g --conf spark.yarn.executor.memory=7g

Related

Running multiple MSCK REPAIR TABLE statements in AWS Athena

So I'm trying to execute the following in AWS Athena which allows to run only one statement at a time:
MSCK REPAIR TABLE some_database.some_table_001;
MSCK REPAIR TABLE some_database.some_table_002;
MSCK REPAIR TABLE some_database.some_table_003;
Problem is, I just don't have three statements, I have 700+ similar statements and would like to run those all 700+ in one go as batch.
So using AWS CloudShell CLI and tried running the following:
aws athena start-query-execution --query-string "MSCK REPAIR TABLE `some_table_001`;" --work-group "primary" \
--query-execution-context Database=some_database \
--result-configuration "OutputLocation=s3://some_bucket/some_folder"
..hoping I could use Excel to generate 700+ statements like this and run as batch
..but keep getting this error:
An error occurred (InvalidRequestException) when calling the StartQueryExecution operation: line 1:1: mismatched input 'MSCK'. Expecting: 'ALTER', 'ANALYZE', 'CALL', 'COMMIT', 'CREATE', 'DEALLOCATE', 'DELETE', 'DESC', 'DESCRIBE', 'DROP', 'EXECUTE', 'EXPLAIN', 'GRANT', 'INSERT', 'PREPARE', 'RESET', 'REVOKE', 'ROLLBACK', 'SET', 'SHOW', 'START', 'UNLOAD', 'UPDATE', 'USE', <query>
Not sure what I'm doing wrong as the same MSCK command seems to run fine in Athena console. I know Athena is finicky when it comes to
`some_table_001`
versus
'some_table_001'
(different types of single quotes), I tried both but didn't get it work.
Any thoughts on possible solution?
For Amazon Athena, special characters other than underscore (_) are not supported in the table names and table column names.
Since your table name doesn't contain special characters, you don't need to wrap it in quotes or backticks. However, had your table name contained special characters then you would have had to use backticks. Refer below Amazon Athena Documentation for more details here.
Regarding running multiple MSCK REPAIR TABLE statements in Amazon Athena, you may also use any SDK like boto3 (Python).

What is the aws-cli command for AWS Macie to create a job?

actually I want to create a job in AWS macie using the aws cli.
I ran the following command:-
aws macie2 create-classification-job --job-type "ONE_TIME" --name "maice-poc" --s3-job-definition bucketDefinitions=[{"accountID"="254378651398", "buckets"=["maice-poc"]}]
but it is giving me an error:-
Unknown options: buckets=[maice-poc]}]
Can someone give me a correct command?
The s3-job-definition requires a structure as value.
And in your case, you want to pass in a JSON-formatted structure parameter, so you should wrap the JSON starting with bucketDefinitions in single quotes. Also instead of = use the JSON syntax : for key-value pairs.
The following API call should work:
aws macie2 create-classification-job --job-type "ONE_TIME" --name "macie-poc" --s3-job-definition '{"bucketDefinitions":[{"accountId":"254378651398", "buckets":["maice-poc"]}]}'

AWS CLI DynamoDB Called From Powershell Put-Item fails when a value contains a space

So, let's say I'm trying to post this JSON via the command line (not in a file because I'm not going to write a file for every invocation of this script) to a dynamo DB table
{\"TeamId\":{\"S\":\"One_Space_123\"},\"TeamName\":{\"S\":\"One_Space\"},\"Environment\":{\"S\":\"cte\"},\"StartDate\":{\"S\":\"null\"},\"EndDate\":{\"S\":\"null\"},\"CreatedDate\":{\"S\":\"today\"},\"CreatedBy\":{\"S\":\"someones user\"},\"EmailDistributionList\":{\"S\":\"test#test.com\"},\"RemedyGroup\":{\"S\":\"OneSpace\"},\"ScomSubscriptionId\":{\"S\":\"guid-ab22-2345\"},\"ZabbixActionId\":{\"S\":\"11\"},\"SnsTopic\":{\"M\":{\"TopicName\":{\"S\":\"ATopicName\"},\"TopicArn\":{\"S\":\"AtopicArn1234\"},\"CreatedDate\":{\"S\":\"today\"},\"CreatedBy\":{\"S\":\"someones user\"}}}}
Then the result from the CLI is one like this:
Unknown options: Space"},"ScomSubscriptionId":{"S":"guid-ab22-2345"},"ZabbixActionId":{"S":"11"},"SnsTopic":{"M":{"TopicName":{"S":"ATopicName"},"TopicArn":{"S":"AtopicArn1234"},"CreatedDate":{"S":"today"},"CreatedBy":{"S":"someones, user"}}}}, user"},"EmailDistributionList":{"S":"test#test.com"},"RemedyGroup":{"S":"One
As you can see, it fails on the TeamName property that in the above example is "One Space". If I change that value to "OneSpace" then instead it starts to fail on the "CreatedBy" property that is populated by "someones user" but if I remove all spaces from all properties I can suddenly pass this json to dynamoDB successfully.
In a working example the json looks like this:
{\"TeamId\":{\"S\":\"One_Space_123\"},\"TeamName\":{\"S\":\"One_Space\"},\"Environment\":{\"S\":\"cte\"},\"StartDate\":{\"S\":\"null\"},\"EndDate\":{\"S\":\"null\"},\"CreatedDate\":{\"S\":\"today\"},\"CreatedBy\":{\"S\":\"someonesuser\"},\"EmailDistributionList\":{\"S\":\"test#test.com\"},\"RemedyGroup\":{\"S\":\"OneSpace\"},\"ScomSubscriptionId\":{\"S\":\"guid-ab22-2345\"},\"ZabbixActionId\":{\"S\":\"11\"},\"SnsTopic\":{\"M\":{\"TopicName\":{\"S\":\"ATopicName\"},\"TopicArn\":{\"S\":\"AtopicArn1234\"},\"CreatedDate\":{\"S\":\"today\"},\"CreatedBy\":{\"S\":\"someonesuser\"}}}}
I can't find any documentation that tells me I can't have spaces, if I read this in from a file it will post it with the spaces, so what gives? If anyone has any advice on this matter, I certainly appreciate it.
For what it's worth in Powershell the execution looks like this currently (though I've tried various combinations of quoting the $dbTeamTableEntry variable
$dbEntry = aws.exe dynamodb put-item --region $region --table-name $table --item "$($dbTeamTableEntry)"

How do I set multiple --conf table parameters in AWS Glue?

Multiple Answers on stackoverflow for AWS Glue say to set the --conf table parameter. However, sometimes in a job we'll need to set multiple --conf key value pairs in 1 job.
I've tried the following ways to have multiple --conf values set all resulting in error:
add another table parameter called --conf. This results in the AWS Dashboard removing the 2nd parameter named --conf and sets focus to the value of the 1st parameter named --conf. Terraform also just considers both table parameters with key --conf to be equal and overwrites the value in the 1st parameter with the 2nd's value.
separate the config key value parameters with a space in the value of the table --conf parameter. E.G. spark.yarn.executor.memoryOverhead=1024 spark.yarn.executor.memoryOverhead=7g spark.yarn.executor.memory=7g. This results in a failure to start the job.
separate the config key value parameters with a comma in the value of the table --conf parameter. E.G. spark.yarn.executor.memoryOverhead=1024, spark.yarn.executor.memoryOverhead=7g, spark.yarn.executor.memory=7g. This results in a failure to start the job.
set the value of the --conf to have --conf string separate each key value. E.G. spark.yarn.executor.memoryOverhead=1024 --conf spark.yarn.executor.memoryOverhead=7g --conf spark.yarn.executor.memory=7g. This results in the glue job hanging.
How do I set multiple --conf table parameters in AWS Glue?
You can pass multiple parameters as below:
Key: --conf
value: spark.yarn.executor.memoryOverhead=7g --conf spark.yarn.executor.memory=7g
This has worked for me.
You can override the parameters by editing the job and adding job parameters. The key and value I used are here:
Key: --conf
Value: spark.yarn.executor.memoryOverhead=7g
This seemed counterintuitive since the setting key is actually in the value, but it was recognized. So if you're attempting to set spark.yarn.executor.memory the following parameter would be appropriate:
Key: --conf
Value: spark.yarn.executor.memory=7g
Find more information(I've add this answer from this): https://stackoverflow.com/a/50122948/10968161

Pass comma separated argument to spark jar in AWS EMR using CLI

I am using aws cli to create EMR cluster and adding a step. My create cluster command looks like :
aws emr create-cluster --release-label emr-5.0.0 --applications Name=Spark --ec2-attributes KeyName=*****,SubnetId=subnet-**** --use-default-roles --bootstrap-action Path=$S3_BOOTSTRAP_PATH --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large InstanceGroupType=CORE,InstanceCount=$instanceCount,InstanceType=m4.4xlarge --steps Type=Spark,Name="My Application",ActionOnFailure=TERMINATE_CLUSTER,Args=[--master,yarn,--deploy-mode,client,$JAR,$inputLoc,$outputLoc] --auto-terminate
$JAR - is my spark jar which takes two params input and output
$input is basically a comma separated list of input files like s3://myBucket/input1.txt,s3://myBucket/input2.txt
However, aws cli command treats comma separated values as separate arguments and hence my second parameter is being treated as second parameter and hence the $output here becomes s3://myBucket/input2.txt
Is there any way to escape comma and treat this whole argument as single value in CLI command so that spark can handle reading multiple files as input?
Seems like there is no possible way of escaping comma from input files.
After trying quite a few ways, I finally had to put a hack by passing a delimiter for separating input files and handling the same in code. In my case,I added % as my delimiter and in Driver code, I am doing
if (inputLoc.contains("%")) {
inputLoc = inputLoc.replaceAll("%", ",");
}