Pass comma separated argument to spark jar in AWS EMR using CLI - amazon-web-services

I am using aws cli to create EMR cluster and adding a step. My create cluster command looks like :
aws emr create-cluster --release-label emr-5.0.0 --applications Name=Spark --ec2-attributes KeyName=*****,SubnetId=subnet-**** --use-default-roles --bootstrap-action Path=$S3_BOOTSTRAP_PATH --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large InstanceGroupType=CORE,InstanceCount=$instanceCount,InstanceType=m4.4xlarge --steps Type=Spark,Name="My Application",ActionOnFailure=TERMINATE_CLUSTER,Args=[--master,yarn,--deploy-mode,client,$JAR,$inputLoc,$outputLoc] --auto-terminate
$JAR - is my spark jar which takes two params input and output
$input is basically a comma separated list of input files like s3://myBucket/input1.txt,s3://myBucket/input2.txt
However, aws cli command treats comma separated values as separate arguments and hence my second parameter is being treated as second parameter and hence the $output here becomes s3://myBucket/input2.txt
Is there any way to escape comma and treat this whole argument as single value in CLI command so that spark can handle reading multiple files as input?

Seems like there is no possible way of escaping comma from input files.
After trying quite a few ways, I finally had to put a hack by passing a delimiter for separating input files and handling the same in code. In my case,I added % as my delimiter and in Driver code, I am doing
if (inputLoc.contains("%")) {
inputLoc = inputLoc.replaceAll("%", ",");
}

Related

What is the aws-cli command for AWS Macie to create a job?

actually I want to create a job in AWS macie using the aws cli.
I ran the following command:-
aws macie2 create-classification-job --job-type "ONE_TIME" --name "maice-poc" --s3-job-definition bucketDefinitions=[{"accountID"="254378651398", "buckets"=["maice-poc"]}]
but it is giving me an error:-
Unknown options: buckets=[maice-poc]}]
Can someone give me a correct command?
The s3-job-definition requires a structure as value.
And in your case, you want to pass in a JSON-formatted structure parameter, so you should wrap the JSON starting with bucketDefinitions in single quotes. Also instead of = use the JSON syntax : for key-value pairs.
The following API call should work:
aws macie2 create-classification-job --job-type "ONE_TIME" --name "macie-poc" --s3-job-definition '{"bucketDefinitions":[{"accountId":"254378651398", "buckets":["maice-poc"]}]}'

AWS CLI DynamoDB Called From Powershell Put-Item fails when a value contains a space

So, let's say I'm trying to post this JSON via the command line (not in a file because I'm not going to write a file for every invocation of this script) to a dynamo DB table
{\"TeamId\":{\"S\":\"One_Space_123\"},\"TeamName\":{\"S\":\"One_Space\"},\"Environment\":{\"S\":\"cte\"},\"StartDate\":{\"S\":\"null\"},\"EndDate\":{\"S\":\"null\"},\"CreatedDate\":{\"S\":\"today\"},\"CreatedBy\":{\"S\":\"someones user\"},\"EmailDistributionList\":{\"S\":\"test#test.com\"},\"RemedyGroup\":{\"S\":\"OneSpace\"},\"ScomSubscriptionId\":{\"S\":\"guid-ab22-2345\"},\"ZabbixActionId\":{\"S\":\"11\"},\"SnsTopic\":{\"M\":{\"TopicName\":{\"S\":\"ATopicName\"},\"TopicArn\":{\"S\":\"AtopicArn1234\"},\"CreatedDate\":{\"S\":\"today\"},\"CreatedBy\":{\"S\":\"someones user\"}}}}
Then the result from the CLI is one like this:
Unknown options: Space"},"ScomSubscriptionId":{"S":"guid-ab22-2345"},"ZabbixActionId":{"S":"11"},"SnsTopic":{"M":{"TopicName":{"S":"ATopicName"},"TopicArn":{"S":"AtopicArn1234"},"CreatedDate":{"S":"today"},"CreatedBy":{"S":"someones, user"}}}}, user"},"EmailDistributionList":{"S":"test#test.com"},"RemedyGroup":{"S":"One
As you can see, it fails on the TeamName property that in the above example is "One Space". If I change that value to "OneSpace" then instead it starts to fail on the "CreatedBy" property that is populated by "someones user" but if I remove all spaces from all properties I can suddenly pass this json to dynamoDB successfully.
In a working example the json looks like this:
{\"TeamId\":{\"S\":\"One_Space_123\"},\"TeamName\":{\"S\":\"One_Space\"},\"Environment\":{\"S\":\"cte\"},\"StartDate\":{\"S\":\"null\"},\"EndDate\":{\"S\":\"null\"},\"CreatedDate\":{\"S\":\"today\"},\"CreatedBy\":{\"S\":\"someonesuser\"},\"EmailDistributionList\":{\"S\":\"test#test.com\"},\"RemedyGroup\":{\"S\":\"OneSpace\"},\"ScomSubscriptionId\":{\"S\":\"guid-ab22-2345\"},\"ZabbixActionId\":{\"S\":\"11\"},\"SnsTopic\":{\"M\":{\"TopicName\":{\"S\":\"ATopicName\"},\"TopicArn\":{\"S\":\"AtopicArn1234\"},\"CreatedDate\":{\"S\":\"today\"},\"CreatedBy\":{\"S\":\"someonesuser\"}}}}
I can't find any documentation that tells me I can't have spaces, if I read this in from a file it will post it with the spaces, so what gives? If anyone has any advice on this matter, I certainly appreciate it.
For what it's worth in Powershell the execution looks like this currently (though I've tried various combinations of quoting the $dbTeamTableEntry variable
$dbEntry = aws.exe dynamodb put-item --region $region --table-name $table --item "$($dbTeamTableEntry)"

Creating Kinesis Analytics applications using aws cli

I want to create a kinesis analytics application using aws cli. I use this command to create the application
aws kinesisanalytics create-application --application-name smartfactorytest1 --application-code "CREATE OR REPLACE STREAM DESTINATION_SQL_STREAM ( "device_serial" VARCHAR(16), "uploadRate" INTEGER, "downloadRate" INTEGER);
CREATE OR REPLACE PUMP "STREAM_PUMP"
AS INSERT INTO DESTINATION_SQL_STREAM
SELECT STREAM "device_serial", "uploadRate", "downloadRate"
FROM SOURCE_SQL_STREAM_001
-- LIKE compares a string to a string pattern (_ matches all char, % matches substring)
-- SIMILAR TO compares string to a regex, may use ESCAPE
WHERE "uploadRate" >20000" --inputs NamePrefix="SOURCE_SQL_STREAM",KinesisStreamsInput={ResourceARN="sourcearn",RoleARN="rolearn"}
But I get this error
invalid type for parameter Inputs[0].KinesisStreamsInput, value: ResourceARN=string, type: <class 'str'>, valid types: <class 'dict'>
Can anyone tell me what am I doing wrong? Any help would be appreciated.
I believe the issue is either that you need to take the quotes out in the KinesisStreamsInput section, or you need to add quotes and escape them. The documentation is unclear on which is the correct option.
According to the AWS Kinesis Analytics CLI Reference, https://docs.aws.amazon.com/cli/latest/reference/kinesisanalytics/create-application.html, the syntax for --inputs with KinesisStreamsInput should look like the example provided for KinesisStreamsOutput:
Name=string,KinesisStreamsOutput={ResourceARN=string,RoleARN=string},...
This would mean removing the quotes around your sourcearn and rolearn. However, the documentation isn't clear that this refers to the CLI syntax in all cases.
If that doesn't work, according to this AWS CLI usage guide page, https://docs.aws.amazon.com/cli/latest/userguide/cli-usage-parameters-quoting-strings.html, it specifies adding quotes and escaping the relevant ones, depending on your OS...
"Linux or macOS
Use single quotation marks (' ') to enclose the JSON data structure, as in the following example. You don't have to do anything special with the embedded double quotation marks embedded in the JSON string.
aws ec2 run-instances --image-id ami-12345678 --block-device-mappings '[{"DeviceName":"/dev/sdb","Ebs":{"VolumeSize":20,"DeleteOnTermination":false,"VolumeType":"standard"}}]'
PowerShell
PowerShell requires single quotation marks (' ') to enclose the JSON data structure. Also, because double quotation marks have a special meaning to PowerShell, you must use a backslash () to escape each double quotation mark (") within the JSON structure, as in the following example.
PS C:\> aws ec2 run-instances --image-id ami-12345678 --block-device-mappings '[{\"DeviceName\":\"/dev/sdb\",\"Ebs\":{\"VolumeSize\":20,\"DeleteOnTermination\":false,\"VolumeType\":\"standard\"}}]'
Windows Command Prompt
The Windows command prompt requires double quotation marks (" ") to enclose the JSON data structure. Also, to prevent the command processor from misinterpreting the double quotation marks embedded in the JSON, you must also escape (precede with a backslash [ \ ] character) each double quotation mark (") within the JSON data structure itself, as in the following example.
C:\> aws ec2 run-instances --image-id ami-12345678 --block-device-mappings "[{\"DeviceName\":\"/dev/sdb\",\"Ebs\":{\"VolumeSize\":20,\"DeleteOnTermination\":false,\"VolumeType\":\"standard\"}}]"
Only the outermost double quotation marks are not escaped."
This link also references needing to escape quotes on Windows, and is using the kinesisanalytics command: https://github.com/aws/aws-cli/issues/3103
"Rishi74744 commented on Feb 6, 2018
I got it to work as -
aws kinesisanalytics add-application-reference-data-source --endpoint https://kinesisanalytics.us-east-1.amazonaws.com --region us-east-1 --application-name alerts --reference-data-source "{\"TableName\":\"DeviceData\",\"S3ReferenceDataSource\":{\"BucketARN\":\"arn: aws: s3: : : bucket-name\",\"FileKey\":\"device.csv\",\"ReferenceRoleARN\":\"arn: aws: iam: : account-id: role/role-name\"},\"ReferenceSchema\":{\"RecordFormat\":{\"RecordFormatType\":\"CSV\",\"MappingParameters\":{\"CSVMappingParameters\":{\"RecordRowDelimiter\":\"\n\",\"RecordColumnDelimiter\":\", \"}}},\"RecordEncoding\":\"UTF-8\",\"RecordColumns\":[{\"Name\":\"key1\",\"SqlType\":\"VARCHAR(64)\"},{\"Name\":\"key2\",\"SqlType\":\"VARCHAR(64)\"}]}}" --current-application-version-id 2
But this should be mentioned in the documentation."
One note: it may be preferable to use JSON files as inputs and use this syntax instead: --cli-input-json file://input.json. This is referenced in the AWS Kinesis CLI Command Reference (first link, under 1.) and also mentioned in the GitHub link above. It's also the method used by the majority of the AWS Kinesis documentation. For example, JSON files used for different purposes in Kinesis Analytics:
https://docs.aws.amazon.com/kinesisanalytics/latest/dev/how-it-works-input.html
Please let me know what works, and I will work with my AWS rep to improve the documentation.

Is there a way to set multiple --conf as job parametet in AWS Glue?

Im trying to configure spark in my Glue jobs. When I tried to input them one by one in the 'Edit job', 'Job Parameters' as key and valur pair (e.g. key:--conf value: spark.executor.memory=10g) it works but when I tried putting them altogether (delimited by space or comma), it results to an error. I also tried using sc._conf.setAll but Glue is ignoring the config and insists on using its default. Is there a way to do this with Spark 2.4?
Yes, you can pass multiple parameters as below:
Key: --conf
value: spark.yarn.executor.memoryOverhead=7g --conf spark.yarn.executor.memory=7g

AWS Data Pipeline Escaping Comma in emr activity step section

I am creating an aws datapipline using the architect provided in the aws web console.
Everything is setup ok, my emrcluster is configured and successfully started.
But when I am trying to submit a emr activity I come across following problem:
In the step section of the emr activity my requirement is to provide --packages argument with 3 packages
But as far as I understand steps in emractivity is a comma separated value and commas (,) are replaced with spaces in the resultant step argument.
On the other hand --packages argument is also a comma separated value in case of multiple packages.
Now when I am trying to pass this as argument commas get replaced with spaces that make the step invalid.
This is the statement I required as it is in the resultant emr step:
--packages com.amazonaws:aws-java-sdk-s3:1.11.228,org.apache.hadoop:hadoop-aws:2.6.0,org.postgresql:postgresql:42.1.4
Any solution to escape the comma?
So far i try the \\\\ way as mentioned in http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-emractivity.html
Not worked.
when u will be using \\\\, it will escape the slashes and comma will get replaced.
You can try using Three slashes, same has worked for me . Like \\\, .
I hope that works