I am using boto3 in my activity workers and I came upon a TaskTimedOut when calling the SendTaskFailure:
botocore.errorfactory.TaskTimedOut: An error occurred (TaskTimedOut) when calling the SendTaskFailure operation: Task Timed Out: 'arn:aws:states:eu-west-2:statemachinearn:activityname'
I think this happens because the connection pool gets full sometimes which makes the request not being fulfilled (even though a new connection is created).
I know it is possible to set a timeout value for Tasks and Parallel States but that does not have anything to do with calling the send_task_failure/send_task_success methods.
Does anyone have any idea on how to solve this?
I can explain one scenario which I encountered in step function.
I was using nested state machines and in my main state machine, I used to get exception like 'timeout when invoking state machine', one in 1000s. As it was transient error, i had to handle it as per specs provided by AWS so I explicitly added retry for framework exceptions like below.
"Retry": [
{
"ErrorEquals": [
"StepFunctions.SdkClientException"
],
"IntervalSeconds": 10,
"MaxAttempts": 4,
"BackoffRate": 2
}
]
Related
We have step function code, and when I trigger an API gateway for the first time, I get a 3 second timeout error. As a result, I used "lambda.Timeout" in retry logic, but now it's not giving an error, but when I trigger it for the first time, I get a blank screen, and when I run it again, I get the response. Please suggest me some idea what should i do?
"Timeoutseconds": 3,
"Retry": [
{
"ErrorEquals":[
"Lambda.Timeout"
],
"Intervalseconds": 3,
"MaxAttempts":2,
"BackofRate":2
}
],
Goal
I wanted to make a proof of concept of the callback pattern. This is where you have a step function that puts a message and token in an sqs queue, the queue is wired up to some arbitrary work, and when that work is done you give the step function back the token so it knows to continue.
Problem
I started testing all this stuff by starting an execution in the step function manually and after a few failures I hit on what should have worked. The send_task_success was called but all I ever got back was this An error occurred (TaskTimedOut) when calling the SendTaskSuccess operation: Task Timed Out: 'Provided task does not exist anymore'.
My architecture (you can skip this part)
I did this all in terraform.
Permissions
I'm going to skip all the IAM permission details for brevity but the idea is:
The queue the following with resource of my lambda
lambda:CreateEventSourceMapping
lambda:ListEventSourceMappings
lambda:ListFunctions
The step function has the following with the resource of my queue
sqs:SendMessage
The lambda has
AWSLambdaBasicExecutionRole
AWSLambdaSQSQueueExecutionRole
states:SendTaskSuccess with step function resource
Terraform
resource "aws_sqs_queue" "queue" {
name_prefix = "${local.project_name}-"
fifo_queue = true
# This one is required for fifo queues for some reason
content_based_deduplication = true
policy = templatefile(
"policy/queue.json",
{lambda_arn = aws_lambda_function.run_job.arn}
)
}
resource "aws_sfn_state_machine" "step" {
name = local.project_name
role_arn = aws_iam_role.step.arn
type = "STANDARD"
definition = templatefile(
"states.json", {
sqs_url = aws_sqs_queue.queue.url
}
)
}
resource "aws_lambda_function" "run_job" {
function_name = local.project_name
description = "Runs a job"
role = aws_iam_role.lambda.arn
architectures = ["arm64"]
runtime = "python3.9"
filename = var.zip_path
handler = "main.main"
}
resource "aws_lambda_event_source_mapping" "trigger_lambda" {
event_source_arn = aws_sqs_queue.queue.arn
enabled = true
function_name = aws_lambda_function.run_job.arn
batch_size = 1
}
Notes:
For my use case I definitely want a FIFO queue. However, there are two funny things you have to do to make a FIFO work (that also make me question what the heck the implementation is doing).
Deduplication. This can either be content based deduplication for the whole queue or you can use the dedplication id thing on a per message basis.
MessageGroupId. This is on a per message basis.
I don't have to worry about the deduplication because every item I put in this queue comes with a unique guid.
State Function
I expect this to be executed with a json that includes "job": "some job guid" at the top level.
{
"Comment": "This is a thing.",
"StartAt": "RunJob",
"States": {
"RunJob": {
"Type": "Task",
"Resource": "arn:aws:states:::sqs:sendMessage.waitForTaskToken",
"Parameters": {
"QueueUrl": "${sqs_url}",
"MessageBody": {
"Message": {
"job_guid.$": "$.job",
"TaskToken.$": "$$.Task.Token"
}
},
"MessageGroupId": "me_group"
},
"Next": "Finish"
},
"Finish": {
"Type": "Succeed"
}
}
}
Notes:
"RunJob"s resource is not the arn of the queue followed by .waitForTaskToken. Seems obvious since it starts with arn:aws:states but it threw me for a bit.
Inside "MessageBody" I'm pretty sure you can just put whatever you want. For sure I know you can rename "TaskToken" to whatever you want.
You need "MessageGroupId" because it's required when you are using a FIFO queue (for some reason).
Python
import boto3
from json import loads
def main(event, context):
message = loads(event["Records"][0]["body"])["Message"]
task_token = message["TaskToken"]
job_guid = message["job_guid"]
print(f"{task_token=}")
print(f"{job_guid=}")
client = boto3.client('stepfunctions')
client.send_task_success(taskToken=task_token, output=event["Records"][0]["body"])
return {"statusCode": 200, "body": "All good"}
Notes:
event["Records"][0]["body"] is a string of a json.
In send_task_success, output expects a string that is json. Basically this means the output of dumps. It just so happens that event["Records"][0]["body"] is a stringified json so that's why I'm returning it.
This is the way lambda + sqs works:
A message comes into SQS
SQS passes that off to a lambda. At the same time it makes the item in the queue invisible. It doesn't delete the item at this stage.
If the lambda returns, SQS deletes the item. If not it makes the item visible again (as long as it hasnt been longer than the default visibility timeout since the item was initially added to the queue).
Since a queue has to deal with each item in turn, this means that, if the lambda never succeeds, SQS will just keep retrying it for default visibility timeout and never process anything else.
Note a failure is an exception, timeout, permissions error, etc. If it returns normally, regardless of whats returned, that's counted as a success.
What happened to me is as follow:
First step function execution: There's some sort of configuration error in my lambda or something. I fix it and re-deploy the lambda. I abort this invocation and delete the lambda logs.
Second step function execution: Everything is properly configured this time but my lambda doesn't receive the new function invocation. Since the lambda failed the item wasn't removed from SQS. SQS will just keep retrying the same item until is successful. However, the function execution was aborted so it will never be successful. Nothing else on the queue will ever see the light of day. However, I don't know this. I just see a failed attempt in the logs. So I delete the logs and abort the execution.
Subsequent executions: Finally, the default visibility timeout is hit for the first item in the queue. So SQS tries to execute the second item in the queue. I already aborted it. Etc.
Here are a few approaches to fixing this:
For my particular use-case, it probably doesn't make sense to retry a lambda. So I could set up a dead-letter queue. This is a queue that takes all the failed jobs from the main queue. SQS can be configured to only send it to the dead letter queue after n retries but I would just send it there immediately. The dead letter queue would then be attached to a lambda that deals with cleaning up any resources that need cleaning.
For development, I should wrap everything in a big try except block. If there's an exception, print it to the logs, but clear out the queue so I don't have a build up.
For development, I should use a really short default visibility timeout. Like 500ms if possible. This ensures that my lambda is going to be executed once or maybe twice but that's it. This should be used in addition to the previous suggestion to catch things like permissions errors.
I found this stackoverflow post about SQS retry logic that I thought was helpful too.
I am executing a Java lambda in a step function. I am throwing any exceptions in lambda code as RuntimeExceptions. I am hoping to retry the lambda execution via below code on getting any Runtime exception(since https://docs.aws.amazon.com/step-functions/latest/dg/concepts-error-handling.html says any unhandled lambda errors come up as Lambda.Unknown). However, this does not retry lambda execution on failure.:
"STATE_NAME": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"OutputPath": "$.Payload",
"Parameters": {
"FunctionName": "arn:aws:lambda:*:$LATEST",
"Payload": {
...
}
},
"Retry": [
{
"ErrorEquals": [
"Lambda.Unknown"
],
"IntervalSeconds": 2,
"MaxAttempts": 6,
"BackoffRate": 2
}
],
What does work though is if i replace Error condition with States.ALL. However this also would include invalid permissions, state timeouts etc. on which i do Not want to Retry the lambda execution. Is there something i am missing here?
Based on AWS doc (https://docs.aws.amazon.com/step-functions/latest/dg/bp-lambda-serviceexception.html)
Unhandled errors in Lambda are reported as Lambda.Unknown in the error output. These include out-of-memory errors and function timeouts. You can match on Lambda.Unknown, States.ALL, or States.TaskFailed to handle these errors. When Lambda hits the maximum number of invocations, the error is Lambda.TooManyRequestsException. For more information about Lambda Handled and Unhandled errors, see FunctionError in the AWS Lambda Developer Guide.
If you are throwing exception in lambda code, then it will not be classified as Lambda Unhandled errors based on that clasifications.
We are creating a workflow composed of multiple SQL Operations(Aggregations, Transposes etc.) via AWS Step functions. Every operation is modelled as a separate Lambda which houses the SQL query.
Now, every query accepts its input parameters from the state machine, so every lambda task is as below:
"SQLQueryTask": {
"Type": "Task",
"Parameters": {
"param1.$": "$$.Execution.Input.param1",
"param2.$": "$$.Execution.Input.param2"
},
"Resource": "LambdaArn",
"End": true
}
The Parameters block thus repeats for every SQLQuery node.
Added to this since Lambdas can fail intermittently and we would like to retry for them ; we also need to have below retry block in every State:
"Retry": [ {
"ErrorEquals": [ "Lambda.ServiceException", "Lambda.AWSLambdaException", "Lambda.SdkClientException"],
"IntervalSeconds": 2,
"MaxAttempts": 6,
"BackoffRate": 2
} ]
This is making the state definition very complex. Is there No way to extract out the common part of state definition to a reusable piece?
One solution could be using AWS CDK (https://aws.amazon.com/cdk/)
This allows developers to define higher-level abstractions of resources, which can easily be reused.
There are some example here that could be helpful: https://docs.aws.amazon.com/cdk/api/latest/docs/aws-stepfunctions-readme.html
I use step functions for a big loop, so far no problem, but the day when my loop exceeded 8000 executions I came across the error "Maximum execution history size" which is 25000.
There is there a solution for not having the history events?
Otherwise, where I can easily migrate my step functions (3 lambda) because aws batch will ask me a lot of code rewrite ..
Thanks a lot
One approach to avoid the 25k history event limit is to add a choice state in your loop that takes in a counter or boolean and decides to exit the loop.
Outside of the loop you can put a lambda function that starts another execution (with a different id). After this, your current execution completes normally and another execution will continue to do the work.
Please note that the "LoopProcessor" in the example below must return a variable "$.breakOutOfLoop" to break out of the loop, which must also be determined somewhere in your loop and passed through.
Depending on your use case, you may need to restructure the data you pass around. For example, if you are processing a lot of data, you may want to consider using S3 objects and pass the ARN as input/output through the state machine execution. If you are trying to do a simple loop, one easy way would be to add a start offset (think of it as a global counter) that is passed into the execution as input, and each LoopProcessor Task will increment a counter (with the start offset as the initial value). This is similar to pagination solutions.
Here is a basic example of the ASL structure to avoid the 25k history event limit:
{
"Comment": "An example looping while avoiding the 25k event history limit.",
"StartAt": "FirstState",
"States": {
"FirstState": {
"Type": "Task",
"Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:function:FUNCTION_NAME",
"Next": "ChoiceState"
},
"ChoiceState": {
"Type" : "Choice",
"Choices": [
{
"Variable": "$.breakOutOfLoop",
"BooleanEquals": true,
"Next": "StartNewExecution"
}
],
"Default": "LoopProcessor"
},
"LoopProcessor": {
"Type" : "Task",
"Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:function:ProcessWork",
"Next": "ChoiceState"
},
"StartNewExecution": {
"Type" : "Task",
"Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:function:StartNewLooperExecution",
"Next": "FinalState"
},
"FinalState": {
"Type": "Task",
"Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:function:FUNCTION_NAME",
"End": true
}
}
}
Hope this helps!
To guarantee the execution of all the steps and their orders, step function stores the history of execution after the completion of each state, this storing is the reason behind the limit on the history execution size.
Having said that, one way to mitigate this limit is by following #sunnyD answer. However, it has below limitations
the invoker of a step function(if there is one) will not get the execution output of the complete data. Instead, he gets the output of the first execution in a chain of execution.
The limit on the number of execution history size has a high chance of increasing in the future versions so writing logic on this number would require you to modify the code/configuration every time the limit is increased or decreased.
Another alternate solution is to arrange step function as parent and child step functions. In this arrangement, the parent step function contains a task to loop through the entire set of data and create new execution of child step function for each record or set of records(a number which is will not exceed history execution limit of a child SF) in your data. The second step in parent step function will wait for a period of time before it checks the Cloudwatch metrics for the completion of all child function and exits with the output.
Few things to keep in mind about this solution are,
The startExecution API will throttle at 500 bucket size with 25 refills every second.
Make sure your wait time in parent SF is sufficient for child SFs to finish its execution otherwise implement a loop to check the completion of child SF.