How much time does AWS step function keeps the execution running? - amazon-web-services

I am new to AWS Step Function. I have created a basic step function with Activity Worker in the back end. For how much time, does the Step Function keeps the execution alive and not time out if the execution is still not picked by the activity worker?

For how much time, does the Step Function keeps the execution alive
and not time out if the execution is still not picked by the activity
worker?
1 year
You can specify TimeoutSeconds in activity task which is also a recommended way
"ActivityState": {
"Type": "Task",
"Resource": "arn:aws:states:us-east-1:123456789012:activity:HelloWorld",
"TimeoutSeconds": 300,
"HeartbeatSeconds": 60,
"Next": "NextState"
}
Step functions can keep the task in the queue for maximum 1 year. You can find more info on Step Functions limitations on this page.

Related

AWS Glue Job parallel running got error "Rate exceeded" ThrottlingException Status Code: 400

I have a simple (just print hello) glue 2.0 job that runs in parallel, triggered from a step function map. Glue job Maximum concurrency is set to 40 and so as Step Funcitons Map's MaxConcurrency.
.
It runs fine if I kicked off under 20 parallel glue jobs but exceeding that (I tried max 35 parallel) I got intermittent errors like this:
Rate exceeded (Service: AWSGlue; Status Code: 400; Error Code:
ThrottlingException; Request ID: 0a350b23-2f75-4951-a643-20429799e8b5;
Proxy: null)
I've checked the service quotas documentation
https://docs.aws.amazon.com/general/latest/gr/glue.html and my account settings. 200 max should have handled my 35 parallel jobs happily.
There are no other Glue job scheduled to be run at the same time in my aws account.
Should I just blindly request to increase the quota and see it fixed or is there anything I can do to get around this?
Thanks to luk2302 and Robert for the suggestions.
Based on their advice, I reach to a solution.
Add a retry in the Glue Task. (I tried IntervalSeconds 1 and BackoffRate 1 but that's too low and didn't work)
"Resource": "arn:aws:states:::glue:startJobRun",
"Type": "Task",
"Retry": [
{
"ErrorEquals": [
"Glue.AWSGlueException"
],
"BackoffRate": 2,
"IntervalSeconds": 2,
"MaxAttempts": 3
}
]
Hope this helps someone.
The quota that you are hitting is not the concurrent job quota of Glue, but the Start Job Run API quota. You basically requested too many job runs per second. If possible just wait in between every Start Job Run call.

DS SDK -AWS Step functions lambda job cancelled immediately

I am getting this weird result when I try to deploy the lambda step, if I define my lambda step like this
lambda_step = steps.compute.LambdaStep(
"Query Training Results",
parameters={
"FunctionName": execution_input["LambdaFunctionName"],
"Payload": {"TrainingJobName.$": "$.TrainingJobName"},
},
)
For some reason it will automatically just grey out the box meaning the job will be cancelled immediately. If I simply remove the "Payload" part then it will work but the lambda step will still fail because it does not know the training job name that I am trying to pass in the Payload.
I followed this example to the T here. Any suggestions would be greatly appreciated.

Sns mail notification when a step is not kicked off within a threshold timeframe

I have an emr step which is submitted through step function. During step run I can see task is submitted, but emr step is not executed and emr console don’t have any information .
How can I debug this?
How can I send an sns when a step doesn’t start execution with in a threshold timeframe?in my case step function shows emr task submitted but no information on emr console and pipeline is long running without failing for more than half hr
You could start the debugging process through the Step Functions execution log and identify the specific step that has failed, and later, you can move on looking for the EMR console or the specific service that has failed. Usually when the EMR step doesn't appear in the EMR console, is due to a Runtime Error, caused by an exception raised when calling the EMR step.
For this scenario, you can use the Error Handling that Step Functions has, using the Catch and Timeout fields, you can find more details in the AWS documentation here.
Basically you need to add this fields as show bellow:
{
"StartAt": "EmrStep",
"States": {
"EmrStep": {
"Type": "Task",
"Resource": "arn:aws:emr:execute-X-step",
"Comment": "This is your EMR step",
"TimeoutSeconds": 10,
"Catch": [ {
"ErrorEquals": ["States.Timeout"],
"Next": "ShutdownClusterAndSendSNS"
} ],
"End": true
},
"ShutdownClusterAndSendSNS": {
"Type": "Pass",
"Comment": "This step handles the timeout exception raised",
"Result": "You can shutdown the EMR cluster to avoid increased cost here and later send a sns notification!",
"End": true
}
}
Note: To catch the timeout exception, you have to catch the error States.Timeout, but also you can define the same catch field for other types of error.

How to send the notification on every task execution in a state machine on AW step functions?

I am working on Amazon Step functions to leverage the workflow for multiple Batch jobs. The requirement is such that the Batch jobs should be executed sequentially and whenever a job transition from one to another job then send a notification with the execution status of the tasks to a SNS topic. I need to send a notification for SUCCESS and FAILURE of a task.
I have tried the Execution Events using Cloudwatch event rules, but Execution Events only gives information about the State Machine's execution, not about the Tasks execution.
As you have found states aren't in Cloudwatch events need to add this as a separate step, there is no way around this, have a notify step which either executes a lambda, or sync to SNS.
There is also another way to do this as you can compose step functions of step functions. So you have your parent step function and your child step function. Your child step function could be the batch job itself, and then you can make use of Cloudwatch events on the batch-job-step-function step function:
"BatchJob" : {
"Comment": "This snippet is in the parent step function. It will kick off another step function, called: batch-job-step-function",
"Type": "Task",
"Resource": "arn:aws:states:::states:startExecution.sync",
"Parameters": {
"StateMachineArn": "arn:aws:states:us-east-1:TODO:stateMachine:batch-job-step-function",
"Input": {
"batchJobInput.$": "$$.Execution.Input.batchJobInput"
}
},
"End": true | "Next" : "TODO"
}
Now you can put Cloudwatch Event Rules against: arn:aws:states:us-east-1:TODO:stateMachine:batch-job-step-function

EMR Job Long Running Notifications

Consider we have around 30 EMR Jobs runs in 5:30 AM PST to 10:30 PST.
We have S3 Buckets and we use to receive flat files in S3 bucket and through lambda functions, received files will be copied to other target paths.
We have dynamo DB tables for data processing once data gets received in target path.
Now the problem area is since we have multiple dependencies & parallel execution, sometimes job gets failed due to memory issue as well as sometimes take more time to get completed.
Sometimes it will run for 4 or 5 hours, and finally it will get terminated with memory or any other issues like Subnet not available or EC2 issue. So we dont want to wait till that long time.
Eg: Job_A process some 1st to 4th files and Job_B processes from 5th to 10th files. Like that it goes.
Here Job_B has dependency with Job_A with 3rd file. So, Job_B will wait until Job_A gets completed. Like this dependency we have in our process.
I would like to get notification from EMR Jobs like below,
Eg: Average Running time for Job_A is 1 hour, but it is running for more than 1 hour and in this case I need to get notified by email or any other way.
How to achieve it? Please help or advise anyone.
Regards,
Karthik
Repeatedly call the list of steps by using lambda and aws sdk, e.g. boto3 and check the start date. When it is 1 hour behind, then you can trigger some notification like Amazon SES. See the documentation.
For example, you can call the list_steps for the running steps only.
response = client.list_steps(
ClusterId='string',
StepStates=['RUNNING']
)
Then it will give you below response.
{
'Steps': [
{
...
'Status': {
...
'Timeline': {
'CreationDateTime': datetime(2015, 1, 1),
'StartDateTime': datetime(2015, 1, 1),
'EndDateTime': datetime(2015, 1, 1)
}
}
},
],
...
}