I'm using the AWS Parameter Store in order to save parameters to be used by my Lambda functions(env variables), 4 parameters actually. But I am observing some performance issues when loading them, It is taking between 0.2 and 0.6 secs to load one parameter only, which is a lot of time for my web app.
I measured the time by running this command
time aws ssm get-parameter --name "sample_parameter"
I would expect less time in order to load the parameter value, since I need to get 4 parameters. So here is my question...Is it a good pracite to load parameters as json text? so I could put all these 4 parameters within a json object.
Is there something to do in order to improve performance when calling the get parameter function?
Thanks
You can get all the parameters at once using the get-parameters. In my tests it's averages the same time to get all 4 parameters in a single call as it does to get 1.
$ time aws ssm get-parameter --name w1
{
"Parameter": {
"Name": "w1",
"Type": "String",
"Value": "say anything",
"Version": 1,
"LastModifiedDate": 1566914540.044,
"ARN": "arn:aws:ssm:us-east-1:1234567890123:parameter/w1"
}
}
real 0m0.811s
user 0m0.509s
sys 0m0.095s
$ time aws ssm get-parameters --names w1 w2 w3 w4
{
"Parameters": [
{
"Name": "w1",
"Type": "String",
"Value": "say anything",
"Version": 1,
"LastModifiedDate": 1566914540.044,
"ARN": "arn:aws:ssm:us-east-1:1234567890123:parameter/w1"
},
{
"Name": "w2",
"Type": "String",
"Value": "say nothing",
"Version": 1,
"LastModifiedDate": 1566914550.377,
"ARN": "arn:aws:ssm:us-east-1:1234567890123:parameter/w2"
},
{
"Name": "w3",
"Type": "String",
"Value": "say what",
"Version": 1,
"LastModifiedDate": 1566914561.301,
"ARN": "arn:aws:ssm:us-east-1:1234567890123:parameter/w3"
},
{
"Name": "w4",
"Type": "String",
"Value": "say hello",
"Version": 1,
"LastModifiedDate": 1566914574.716,
"ARN": "arn:aws:ssm:us-east-1:1234567890123:parameter/w4"
}
],
"InvalidParameters": []
}
real 0m0.887s
user 0m0.561s
sys 0m0.097s
Related
What is the difference between AWS SSM GetParameter and GetParameters ?
I have a machine with an IAM policy GetParameters and try to read a variable with terraform with the following code:
data "aws_ssm_parameter" "variable" { name = "variable"}
I get an error indicating I'm not authorized to perform GetParameter.
Like the name suggests.
GetParameter provides details about only one parameter per API call.
GetParameters provides details about multiple parameters in one API call.
The parameter details returned are exactly same for both calls, as the two calls return Parameter object:
"Parameter": {
"ARN": "string",
"DataType": "string",
"LastModifiedDate": number,
"Name": "string",
"Selector": "string",
"SourceResult": "string",
"Type": "string",
"Value": "string",
"Version": number
}
The key benefit of the GetParameters is that you can fetch many parameters in a single API call which saves time.
Example use of GetParameter:
aws ssm get-parameter --name /db/password
{
"Parameter": {
"Name": "/db/password",
"Type": "String",
"Value": "secret password",
"Version": 1,
"LastModifiedDate": 1589285865.183,
"ARN": "arn:aws:ssm:us-east-1:xxxxxxxxx:parameter/db/password",
"DataType": "text"
}
}
Example use of GetParameters with two parameters:
aws ssm get-parameters --name /db/password /db/url
{
"Parameters": [
{
"Name": "/db/password",
"Type": "String",
"Value": "secret password",
"Version": 1,
"LastModifiedDate": 1589285865.183,
"ARN": "arn:aws:ssm:us-east-1:xxxxxxxxx:parameter/db/password",
"DataType": "text"
},
{
"Name": "/db/url",
"Type": "String",
"Value": "url to db",
"Version": 1,
"LastModifiedDate": 1589285879.912,
"ARN": "arn:aws:ssm:us-east-1:xxxxxxxxx:parameter/db/url",
"DataType": "text"
}
],
"InvalidParameters": []
}
Example use of GetParameters with non-existing second parameter (/db/wrong)
aws ssm get-parameters --name /db/password /db/wrong
{
"Parameters": [
{
"Name": "/db/password",
"Type": "String",
"Value": "secret password",
"Version": 1,
"LastModifiedDate": 1589285865.183,
"ARN": "arn:aws:ssm:us-east-1:xxxxxxxxx:parameter/db/password",
"DataType": "text"
}
],
"InvalidParameters": [
"/db/wrong"
]
}
I'm using AWS SSM to compute a long script on an ec2 instance.
I would like to configure the execution timeout (execution time, not launch time) and I don't find how to do this on the official documentation (opposing informations or not working).
I'm using only the CLI interface.
This value is a document property that can be passed with --parameters option using executionTimeout key. You can use aws ssm describe-documents to find this and other document specific parameters.
aws ssm describe-document --name "AWS-RunShellScript"
{
"Document": {
"Hash": "99749de5e62f71e5ebe9a55c2321e2c394796afe7208cff048696541e6f6771e",
"HashType": "Sha256",
"Name": "AWS-RunShellScript",
"Owner": "Amazon",
"CreatedDate": "2017-08-21T22:25:02.029000+02:00",
"Status": "Active",
"DocumentVersion": "1",
"Description": "Run a shell script or specify the commands to run.",
"Parameters": [
{
"Name": "commands",
"Type": "StringList",
"Description": "(Required) Specify a shell script or a command to run."
},
{
"Name": "workingDirectory",
"Type": "String",
"Description": "(Optional) The path to the working directory on your instance.",
"DefaultValue": ""
},
{
"Name": "executionTimeout",
"Type": "String",
"Description": "(Optional) The time in seconds for a command to complete before it is considered to have failed. Default is 3600 (1 hour). Maximum is 172800 (48 hours).",
"DefaultValue": "3600"
}
],
"PlatformTypes": [
"Linux",
"MacOS"
],
"DocumentType": "Command",
"SchemaVersion": "1.2",
"LatestVersion": "1",
"DefaultVersion": "1",
"DocumentFormat": "JSON",
"Tags": []
}
}
I'm trying to use Data Pipeline to run a Spark Application. How can I access the input / output I specify (S3DataNode) for the EmrActivity, inside my Spark application?
My question is similar to this - https://forums.aws.amazon.com/message.jspa?messageID=507877
Earlier I used to pass the input and output as arguments to the Spark application in steps.
Thanks
I ran across the same question. There's very limited documentation around this. This is my understanding:
You specify the input and output for the EmrActivity. This will create the dependencies between the data nodes and the activity.
In the EmrActivity, you can reference the input sources like this: #{input.directoryPath},#{output.directoryPath}
Example:
...
{
"name": "Input Data Node",
"id": "inputDataNode",
"type": "S3DataNode",
"directoryPath": "s3://my/raw/data/path"
},
{
"name": "transform",
"id": "transform",
"type": "EmrActivity",
"step": [
"s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar,s3://my/transform/script.sh,#{input.directoryPath},#{output.directoryPath}"
],
"runsOn": {
"ref": "emrcluster"
},
"input": {
"ref": "inputDataNode"
},
"output": {
"ref": "outputDataNode"
}
},
{
"name": "Output Data Node",
"id": "outputDataNode",
"type": "S3DataNode",
"directoryPath": "s3://path/to/output/"
},
...
I'm trying to run simple AWS Data Pipeline for my POC. The case that I have is following: get data from CSV stored on S3, perform simple hive query on them and put results back to S3.
I've created very basic pipeline definition and tried to run it on different emr versions: 4.2.0 and 5.3.1 - both are failing though in different places.
So pipeline definition is following:
{
"objects": [
{
"resourceRole": "DataPipelineDefaultResourceRole",
"role": "DataPipelineDefaultRole",
"maximumRetries": "1",
"enableDebugging": "true",
"name": "EmrCluster",
"keyPair": "Jeff Key Pair",
"id": "EmrClusterId_CM5Td",
"releaseLabel": "emr-5.3.1",
"region": "us-west-2",
"type": "EmrCluster",
"terminateAfter": "1 Day"
},
{
"column": [
"policyID INT",
"statecode STRING"
],
"name": "SampleCSVOutputFormat",
"id": "DataFormatId_9sLJ0",
"type": "CSV"
},
{
"failureAndRerunMode": "CASCADE",
"resourceRole": "DataPipelineDefaultResourceRole",
"role": "DataPipelineDefaultRole",
"pipelineLogUri": "s3://aws-logs/datapipeline/",
"scheduleType": "ONDEMAND",
"name": "Default",
"id": "Default"
},
{
"directoryPath": "s3://data-pipeline-input/",
"dataFormat": {
"ref": "DataFormatId_KIMjx"
},
"name": "InputDataNode",
"id": "DataNodeId_RyNzr",
"type": "S3DataNode"
},
{
"s3EncryptionType": "NONE",
"directoryPath": "s3://data-pipeline-output/",
"dataFormat": {
"ref": "DataFormatId_9sLJ0"
},
"name": "OutputDataNode",
"id": "DataNodeId_lnwhV",
"type": "S3DataNode"
},
{
"output": {
"ref": "DataNodeId_lnwhV"
},
"input": {
"ref": "DataNodeId_RyNzr"
},
"stage": "true",
"maximumRetries": "2",
"name": "HiveTest",
"hiveScript": "INSERT OVERWRITE TABLE ${output1} select policyID, statecode from ${input1};",
"runsOn": {
"ref": "EmrClusterId_CM5Td"
},
"id": "HiveActivityId_JFqr5",
"type": "HiveActivity"
},
{
"name": "SampleCSVDataFormat",
"column": [
"policyID INT",
"statecode STRING",
"county STRING",
"eq_site_limit FLOAT",
"hu_site_limit FLOAT",
"fl_site_limit FLOAT",
"fr_site_limit FLOAT",
"tiv_2011 FLOAT",
"tiv_2012 FLOAT",
"eq_site_deductible FLOAT",
"hu_site_deductible FLOAT",
"fl_site_deductible FLOAT",
"fr_site_deductible FLOAT",
"point_latitude FLOAT",
"point_longitude FLOAT",
"line STRING",
"construction STRING",
"point_granularity INT"
],
"id": "DataFormatId_KIMjx",
"type": "CSV"
}
],
"parameters": []
}
And CSV file looks like this:
policyID,statecode,county,eq_site_limit,hu_site_limit,fl_site_limit,fr_site_limit,tiv_2011,tiv_2012,eq_site_deductible,hu_site_deductible,fl_site_deductible,fr_site_deductible,point_latitude,point_longitude,line,construction,point_granularity
119736,FL,CLAY COUNTY,498960,498960,498960,498960,498960,792148.9,0,9979.2,0,0,30.102261,-81.711777,Residential,Masonry,1
448094,FL,CLAY COUNTY,1322376.3,1322376.3,1322376.3,1322376.3,1322376.3,1438163.57,0,0,0,0,30.063936,-81.707664,Residential,Masonry,3
206893,FL,CLAY COUNTY,190724.4,190724.4,190724.4,190724.4,190724.4,192476.78,0,0,0,0,30.089579,-81.700455,Residential,Wood,1
HiveActivity is just a simple query (copy from AWS docs):
"INSERT OVERWRITE TABLE ${output1} select policyID, statecode from ${input1};"
However it fails when running on emr-5.3.1:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask
/mnt/taskRunner/./hive-script:617:in `<main>': Error executing cmd: /usr/share/aws/emr/scripts/hive-script "--base-path" "s3://us-west-2.elasticmapreduce/libs/hive/" "--hive-versions" "latest" "--run-hive-script" "--args" "-f"
Going deep into logs I could find following exception:
2017-02-25T00:33:00,434 ERROR [316e5d21-dfd8-4663-a03c-2ea4bae7b1a0 main([])]: tez.DagUtils (:()) - Could not find the jar that was being uploaded
2017-02-25T00:33:00,434 ERROR [316e5d21-dfd8-4663-a03c-2ea4bae7b1a0 main([])]: exec.Task (:()) - Failed to execute tez graph.
java.io.IOException: Previous writer likely failed to write hdfs://ip-170-41-32-05.us-west-2.compute.internal:8020/tmp/hive/hadoop/_tez_session_dir/31ae6d21-dfd8-4123-a03c-2ea4bae7b1a0/emr-hive-goodies.jar. Failing because I am unlikely to write too.
at org.apache.hadoop.hive.ql.exec.tez.DagUtils.localizeResource(DagUtils.java:1022)
at org.apache.hadoop.hive.ql.exec.tez.DagUtils.addTempResources(DagUtils.java:902)
at org.apache.hadoop.hive.ql.exec.tez.DagUtils.localizeTempFilesFromConf(DagUtils.java:845)
at org.apache.hadoop.hive.ql.exec.tez.TezSessionState.refreshLocalResourcesFromConf(TezSessionState.java:466)
at org.apache.hadoop.hive.ql.exec.tez.TezTask.updateSession(TezTask.java:294)
at org.apache.hadoop.hive.ql.exec.tez.TezTask.execute(TezTask.java:155)
When running on emr-4.2.0 I have another crash:
Number of reduce tasks is set to 0 since there's no reduce operator
java.lang.NullPointerException
at org.apache.hadoop.fs.Path.<init>(Path.java:105)
at org.apache.hadoop.fs.Path.<init>(Path.java:94)
at org.apache.hadoop.hive.ql.exec.Utilities.toTempPath(Utilities.java:1517)
at org.apache.hadoop.hive.ql.exec.Utilities.createTmpDirs(Utilities.java:3555)
at org.apache.hadoop.hive.ql.exec.Utilities.createTmpDirs(Utilities.java:3520)
Both S3 and EMR cluster are in same region and running under same AWS account. I've tried bunch of experiments with S3DataNode and EMRCluster configurations but it always crashes.
Also I couldn't find any working example of data pipeline with HiveActivity nor in documentation or over github.
Can someone please help me figure it out? Thank you.
I was facing the same problem when updating my EMR cluster from a 4.*.* release to 5.28.0 release. After changing the release label, I followed #andrii-gorishnii comment and added
delete jar /mnt/taskRunner/emr-hive-goodies.jar;
to the beginning of my Hive Script and it solved my problem! Thanks #andrii-gorishnii
I have been using UNLOAD statement in Redshift for a while now, it makes it easier to dump the file to S3 and then allow people to analysie.
The time has come to try to automate it. We have Amazon Data Pipeline running for several tasks and I wanted to run SQLActivity to execute UNLOAD automatically. I use SQL script hosted in S3.
The query itself is correct but what I have been trying to figure out is how can I dynamically assign the name of the file. For example:
UNLOAD('<the_query>')
TO 's3://my-bucket/' || to_char(current_date)
WITH CREDENTIALS '<credentials>'
ALLOWOVERWRITE
PARALLEL OFF
doesn't work and of course I suspect that you can't execute functions (to_char) in the "TO" line. Is there any other way I can do it?
And if UNLOAD is not the way, do I have any other options how to automate such tasks with current available infrastructure (Redshift + S3 + Data Pipeline, our Amazon EMR is not active yet).
The only thing that I thought could work (but not sure) is not instead of using script, to copy the script into the Script option in SQLActivity (at the moment it points to a file) and reference {#ScheduleStartTime}
Why not use RedshiftCopyActivity to copy from Redshift to S3? Input is RedshiftDataNode and output is S3DataNode where you can specify expression for directoryPath.
You can also specify the transformSql property in RedshiftCopyActivity to override the default value of : select * from + inputRedshiftTable.
Sample pipeline:
{
"objects": [{
"id": "CSVId1",
"name": "DefaultCSV1",
"type": "CSV"
}, {
"id": "RedshiftDatabaseId1",
"databaseName": "dbname",
"username": "user",
"name": "DefaultRedshiftDatabase1",
"*password": "password",
"type": "RedshiftDatabase",
"clusterId": "redshiftclusterId"
}, {
"id": "Default",
"scheduleType": "timeseries",
"failureAndRerunMode": "CASCADE",
"name": "Default",
"role": "DataPipelineDefaultRole",
"resourceRole": "DataPipelineDefaultResourceRole"
}, {
"id": "RedshiftDataNodeId1",
"schedule": {
"ref": "ScheduleId1"
},
"tableName": "orders",
"name": "DefaultRedshiftDataNode1",
"type": "RedshiftDataNode",
"database": {
"ref": "RedshiftDatabaseId1"
}
}, {
"id": "Ec2ResourceId1",
"schedule": {
"ref": "ScheduleId1"
},
"securityGroups": "MySecurityGroup",
"name": "DefaultEc2Resource1",
"role": "DataPipelineDefaultRole",
"logUri": "s3://myLogs",
"resourceRole": "DataPipelineDefaultResourceRole",
"type": "Ec2Resource"
}, {
"myComment": "This object is used to control the task schedule.",
"id": "DefaultSchedule1",
"name": "RunOnce",
"occurrences": "1",
"period": "1 Day",
"type": "Schedule",
"startAt": "FIRST_ACTIVATION_DATE_TIME"
}, {
"id": "S3DataNodeId1",
"schedule": {
"ref": "ScheduleId1"
},
"directoryPath": "s3://my-bucket/#{format(#scheduledStartTime, 'YYYY-MM-dd-HH-mm-ss')}",
"name": "DefaultS3DataNode1",
"dataFormat": {
"ref": "CSVId1"
},
"type": "S3DataNode"
}, {
"id": "RedshiftCopyActivityId1",
"output": {
"ref": "S3DataNodeId1"
},
"input": {
"ref": "RedshiftDataNodeId1"
},
"schedule": {
"ref": "ScheduleId1"
},
"name": "DefaultRedshiftCopyActivity1",
"runsOn": {
"ref": "Ec2ResourceId1"
},
"type": "RedshiftCopyActivity"
}]
}
Are you able to SSH into the cluster? If so, I would suggest writing a shell script where you can create variables and whatnot, then pass in those variables into a connection's statement-query
By using a redshift procedural wrapper around unload statement and dynamically deriving the s3 path name.
Execute the dynamic query and in your job, call the procedure that dynamically creates the UNLOAD statement and executes the statement.
This way you can avoid the other services. But depends on what kind of usecase you are working on.