I use step functions for a big loop, so far no problem, but the day when my loop exceeded 8000 executions I came across the error "Maximum execution history size" which is 25000.
There is there a solution for not having the history events?
Otherwise, where I can easily migrate my step functions (3 lambda) because aws batch will ask me a lot of code rewrite ..
Thanks a lot
One approach to avoid the 25k history event limit is to add a choice state in your loop that takes in a counter or boolean and decides to exit the loop.
Outside of the loop you can put a lambda function that starts another execution (with a different id). After this, your current execution completes normally and another execution will continue to do the work.
Please note that the "LoopProcessor" in the example below must return a variable "$.breakOutOfLoop" to break out of the loop, which must also be determined somewhere in your loop and passed through.
Depending on your use case, you may need to restructure the data you pass around. For example, if you are processing a lot of data, you may want to consider using S3 objects and pass the ARN as input/output through the state machine execution. If you are trying to do a simple loop, one easy way would be to add a start offset (think of it as a global counter) that is passed into the execution as input, and each LoopProcessor Task will increment a counter (with the start offset as the initial value). This is similar to pagination solutions.
Here is a basic example of the ASL structure to avoid the 25k history event limit:
{
"Comment": "An example looping while avoiding the 25k event history limit.",
"StartAt": "FirstState",
"States": {
"FirstState": {
"Type": "Task",
"Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:function:FUNCTION_NAME",
"Next": "ChoiceState"
},
"ChoiceState": {
"Type" : "Choice",
"Choices": [
{
"Variable": "$.breakOutOfLoop",
"BooleanEquals": true,
"Next": "StartNewExecution"
}
],
"Default": "LoopProcessor"
},
"LoopProcessor": {
"Type" : "Task",
"Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:function:ProcessWork",
"Next": "ChoiceState"
},
"StartNewExecution": {
"Type" : "Task",
"Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:function:StartNewLooperExecution",
"Next": "FinalState"
},
"FinalState": {
"Type": "Task",
"Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:function:FUNCTION_NAME",
"End": true
}
}
}
Hope this helps!
To guarantee the execution of all the steps and their orders, step function stores the history of execution after the completion of each state, this storing is the reason behind the limit on the history execution size.
Having said that, one way to mitigate this limit is by following #sunnyD answer. However, it has below limitations
the invoker of a step function(if there is one) will not get the execution output of the complete data. Instead, he gets the output of the first execution in a chain of execution.
The limit on the number of execution history size has a high chance of increasing in the future versions so writing logic on this number would require you to modify the code/configuration every time the limit is increased or decreased.
Another alternate solution is to arrange step function as parent and child step functions. In this arrangement, the parent step function contains a task to loop through the entire set of data and create new execution of child step function for each record or set of records(a number which is will not exceed history execution limit of a child SF) in your data. The second step in parent step function will wait for a period of time before it checks the Cloudwatch metrics for the completion of all child function and exits with the output.
Few things to keep in mind about this solution are,
The startExecution API will throttle at 500 bucket size with 25 refills every second.
Make sure your wait time in parent SF is sufficient for child SFs to finish its execution otherwise implement a loop to check the completion of child SF.
Related
Goal
I wanted to make a proof of concept of the callback pattern. This is where you have a step function that puts a message and token in an sqs queue, the queue is wired up to some arbitrary work, and when that work is done you give the step function back the token so it knows to continue.
Problem
I started testing all this stuff by starting an execution in the step function manually and after a few failures I hit on what should have worked. The send_task_success was called but all I ever got back was this An error occurred (TaskTimedOut) when calling the SendTaskSuccess operation: Task Timed Out: 'Provided task does not exist anymore'.
My architecture (you can skip this part)
I did this all in terraform.
Permissions
I'm going to skip all the IAM permission details for brevity but the idea is:
The queue the following with resource of my lambda
lambda:CreateEventSourceMapping
lambda:ListEventSourceMappings
lambda:ListFunctions
The step function has the following with the resource of my queue
sqs:SendMessage
The lambda has
AWSLambdaBasicExecutionRole
AWSLambdaSQSQueueExecutionRole
states:SendTaskSuccess with step function resource
Terraform
resource "aws_sqs_queue" "queue" {
name_prefix = "${local.project_name}-"
fifo_queue = true
# This one is required for fifo queues for some reason
content_based_deduplication = true
policy = templatefile(
"policy/queue.json",
{lambda_arn = aws_lambda_function.run_job.arn}
)
}
resource "aws_sfn_state_machine" "step" {
name = local.project_name
role_arn = aws_iam_role.step.arn
type = "STANDARD"
definition = templatefile(
"states.json", {
sqs_url = aws_sqs_queue.queue.url
}
)
}
resource "aws_lambda_function" "run_job" {
function_name = local.project_name
description = "Runs a job"
role = aws_iam_role.lambda.arn
architectures = ["arm64"]
runtime = "python3.9"
filename = var.zip_path
handler = "main.main"
}
resource "aws_lambda_event_source_mapping" "trigger_lambda" {
event_source_arn = aws_sqs_queue.queue.arn
enabled = true
function_name = aws_lambda_function.run_job.arn
batch_size = 1
}
Notes:
For my use case I definitely want a FIFO queue. However, there are two funny things you have to do to make a FIFO work (that also make me question what the heck the implementation is doing).
Deduplication. This can either be content based deduplication for the whole queue or you can use the dedplication id thing on a per message basis.
MessageGroupId. This is on a per message basis.
I don't have to worry about the deduplication because every item I put in this queue comes with a unique guid.
State Function
I expect this to be executed with a json that includes "job": "some job guid" at the top level.
{
"Comment": "This is a thing.",
"StartAt": "RunJob",
"States": {
"RunJob": {
"Type": "Task",
"Resource": "arn:aws:states:::sqs:sendMessage.waitForTaskToken",
"Parameters": {
"QueueUrl": "${sqs_url}",
"MessageBody": {
"Message": {
"job_guid.$": "$.job",
"TaskToken.$": "$$.Task.Token"
}
},
"MessageGroupId": "me_group"
},
"Next": "Finish"
},
"Finish": {
"Type": "Succeed"
}
}
}
Notes:
"RunJob"s resource is not the arn of the queue followed by .waitForTaskToken. Seems obvious since it starts with arn:aws:states but it threw me for a bit.
Inside "MessageBody" I'm pretty sure you can just put whatever you want. For sure I know you can rename "TaskToken" to whatever you want.
You need "MessageGroupId" because it's required when you are using a FIFO queue (for some reason).
Python
import boto3
from json import loads
def main(event, context):
message = loads(event["Records"][0]["body"])["Message"]
task_token = message["TaskToken"]
job_guid = message["job_guid"]
print(f"{task_token=}")
print(f"{job_guid=}")
client = boto3.client('stepfunctions')
client.send_task_success(taskToken=task_token, output=event["Records"][0]["body"])
return {"statusCode": 200, "body": "All good"}
Notes:
event["Records"][0]["body"] is a string of a json.
In send_task_success, output expects a string that is json. Basically this means the output of dumps. It just so happens that event["Records"][0]["body"] is a stringified json so that's why I'm returning it.
This is the way lambda + sqs works:
A message comes into SQS
SQS passes that off to a lambda. At the same time it makes the item in the queue invisible. It doesn't delete the item at this stage.
If the lambda returns, SQS deletes the item. If not it makes the item visible again (as long as it hasnt been longer than the default visibility timeout since the item was initially added to the queue).
Since a queue has to deal with each item in turn, this means that, if the lambda never succeeds, SQS will just keep retrying it for default visibility timeout and never process anything else.
Note a failure is an exception, timeout, permissions error, etc. If it returns normally, regardless of whats returned, that's counted as a success.
What happened to me is as follow:
First step function execution: There's some sort of configuration error in my lambda or something. I fix it and re-deploy the lambda. I abort this invocation and delete the lambda logs.
Second step function execution: Everything is properly configured this time but my lambda doesn't receive the new function invocation. Since the lambda failed the item wasn't removed from SQS. SQS will just keep retrying the same item until is successful. However, the function execution was aborted so it will never be successful. Nothing else on the queue will ever see the light of day. However, I don't know this. I just see a failed attempt in the logs. So I delete the logs and abort the execution.
Subsequent executions: Finally, the default visibility timeout is hit for the first item in the queue. So SQS tries to execute the second item in the queue. I already aborted it. Etc.
Here are a few approaches to fixing this:
For my particular use-case, it probably doesn't make sense to retry a lambda. So I could set up a dead-letter queue. This is a queue that takes all the failed jobs from the main queue. SQS can be configured to only send it to the dead letter queue after n retries but I would just send it there immediately. The dead letter queue would then be attached to a lambda that deals with cleaning up any resources that need cleaning.
For development, I should wrap everything in a big try except block. If there's an exception, print it to the logs, but clear out the queue so I don't have a build up.
For development, I should use a really short default visibility timeout. Like 500ms if possible. This ensures that my lambda is going to be executed once or maybe twice but that's it. This should be used in addition to the previous suggestion to catch things like permissions errors.
I found this stackoverflow post about SQS retry logic that I thought was helpful too.
I have a StepFunction with input:
{
"Jobs": [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0]
}
and I want the sum of the values in this array. According to the Json-Path docs there should exist a .sum() function for this. When I try it here it even works. So I defined the following Pass state:
"Sum Jobs": {
"Type": "Pass",
"Parameters": {
"Jobs.$": "$.Jobs.sum()"
}
},
Nevertheless executions fail with:
"An error occurred while executing the state 'Sum Jobs' (entered at the event id #249). The JSONPath '$.Jobs.sum()' specified for the field 'Jobs.$' could not be found in the input '{\"Jobs\":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0]}'"
You will need a Lambda Task for this. Step Functions' intrinsic functions (= operations accessible outside of tasks), do not include any math or array manipulation operations.
The input that is being sent from previous state is in this form:
[
{
"bucketName": "test-heimdall-employee-data",
"executionId": "ca9f1e5e-4d3a-4237-8a10-8860bb9d58be_1586771571368",
"feedType": "lenel_badge",
"chunkFileKeys": "chunkFileLocation/lenel_badge/68ac7180-69a0-401a-b30c-8f809acf3a1c_1586771581154.csv",
"sanityPassFileKeys": "chunkFileLocation/lenel_badge/0098b86b-fe3c-45ca-a067-4d4a826ee2c1_1586771588882.json"
},
{
"bucketName": "test-heimdall-employee-data",
"executionId": "ca9f1e5e-4d3a-4237-8a10-8860bb9d58be_1586771571368",
"feedType": "lenel_badge",
"errorFilePath": "error/lenel_badge/2a899128-339d-4262-bb2f-a70cc60e5d4e/1586771589234_2e06e043-ad63-4217-9b53-66405ac9a0fc_1586771581493.csv",
"chunkFileKeys": "chunkFileLocation/lenel_badge/2e06e043-ad63-4217-9b53-66405ac9a0fc_1586771581493.csv",
"sanityPassFileKeys": "chunkFileLocation/lenel_badge/f6957aa7-6e22-496a-a6b8-4964da92cb73_1586771588793.json"
},
{
"bucketName": "test-heimdall-employee-data",
"executionId": "ca9f1e5e-4d3a-4237-8a10-8860bb9d58be_1586771571368",
"feedType": "lenel_badge",
"errorFilePath": "error/lenel_badge/8050eb12-c5e6-4ae9-8c4b-0ac539f5c189/1586771589293_1bb32e6c-03fc-4679-9c2f-5a4bca46c8aa_1586771581569.csv",
"chunkFileKeys": "chunkFileLocation/lenel_badge/1bb32e6c-03fc-4679-9c2f-5a4bca46c8aa_1586771581569.csv",
"sanityPassFileKeys": "chunkFileLocation/lenel_badge/48960b7c-04e0-4cce-a77a-44d8834289df_1586771588870.json"
}
]
state machine workflow design:
How do I extract "feedType"value from the above inputs and transit to next state and also pass entire inputs to next state?
Thanks
You can access the input JSON you started your statemachine with using: $$.Execution.Input.todo. Other than that you can't directly access previous state from one step to the next.
As an example lets say you have A->B->C
Lets say you went through A which gave a new field: a : 1, and then you went through B and it returns b : 2, when you get to C you will only have b : 2. But if B also return a : 1 you would then have {a : 1, b : 2} at C. Which is typically what you do to pass state from a step a couple of steps prior.
There are other things which people do, such as storing data in an s3 bucket and accessing that bucket in different stages. You can also query a step function as well but that can be messy.
Other hacks include adding a pass step in a parallel block, but these hacks are not good, the correct way is to pass the data on between your steps, or hopefully have what you need in your execution input.
Looking at your previous state input it looks like feed_type is a constant. Assuming key to your entire input is "input" so that it's dictionary like {"input":[{...},{...}]} and so on. So to access the value of feed_type you can simply do $.input[0].feed_type.
Choice state by default passes the entire input passed to it into the next stage. So to whatever next stage it goes to, that stage is going to have same input that was passed to choice state.
To understand it better or as a proof of concept check the below Step Function in which Hello state is a choice state and other 2 states are simple pass states.
And if you will see below the input and output of Choice state. It's the same.
Hope it helps.
I have a state-machine consisting of a first pre-process task that generates an array as output, which is used by a subsequent map state to loop over. The output array of the first task has gotten too big and the state-machine throws the error States.DataLimitExceeded: The state/task 'arn:aws:lambda:XYZ' returned a result with a size exceeding the maximum number of characters service limit.
Here is an example of the state-machine yaml:
stateMachines:
myStateMachine:
name: "myStateMachine"
definition:
StartAt: preProcess
States:
preProcess:
Type: Task
Resource:
Fn::GetAtt: [preProcessLambda, Arn]
Next: mapState
ResultPath: "$.preProcessOutput"
mapState:
Type: Map
ItemsPath: "$.preProcessOutput.data"
MaxConcurrency: 100
Iterator:
StartAt: doMap
States:
doMap:
Type: Task
Resource:
Fn::GetAtt: [doMapLambda, Arn]
End: true
Next: ### next steps, not relevant
A possible solution I came up with would be that state preProcess saves its output in an S3-bucket and state mapState reads directly from it. Is this possible? At the moment the output of preProcess is
ResultPath: "$.preProcessOutput"
and mapState takes the array
ItemsPath: "$.preProcessOutput.data"
as input.
How would I need to adapt the yaml that the map state reads directly from S3?
I am solving a similar problem at work currently too. Because a step function stores its entire state, you can pretty quickly have problems as your json grows as it maps over all the values.
The only real way to solve this is to use hierarchies of step functions. That is, step functions on your step functions. So you have:
parent -> [batch1, batch2, batch...N]
And then each batch have a number of single jobs:
batch1 -> [j1,j2,j3...jBATCHSIZE]
I had a pretty simple step function, and I found at ~4k was about the max batch size I could have before I would start hitting state limits.
Not a pretty solution be hey it works.
I don't think it is possible to read directly from S3 at this time. There are a few things you could try to do to get around this limitation. One is making your own iterator and not using Map State. Another is the following:
Have a lambda read your s3 file and chunk it by index or some id/key. The idea behind this step is to pass the iterator in Map State a WAY smaller payload. Say your data has the below structure.
[ { idx: 1, ...more keys }, {idx: 2, ...more keys }, { idx: 3, ...more keys }, ... 4,997 more objects of data ]
Say you want your iterator to process 1,000 rows at a time. Return the following tuples representing indexs from your lambda instead: [ [ 0, 999 ], [ 1000, 1999 ], [ 2000, 2999 ], [ 3000, 3999 ], [ 4000, 4999] ]
Your Map State will receive this new data structure and each iteration will be one of the tuples. Iteration #1: [ 0, 999 ], Iteration #2: [ 1000, 1999 ], etc
Inside your iterator, call a lambda which uses the tuple indexes to query into your S3 file. AWS has a query language over S3 buckets called Amazon S3 Select: https://docs.aws.amazon.com/AmazonS3/latest/dev/s3-glacier-select-sql-reference-select.html
Here’s another great resource on how to use S3 select and get the data into a readable state with node: https://thetrevorharmon.com/blog/how-to-use-s3-select-to-query-json-in-node-js
So, for iteration #1, we are querying the first 1,000 objects in our data structure. I can now call whatever function I normally would have inside my iterator.
What's key about this approach is the inputPath is never receiving a large data structure.
As of September 2020 the limit on step functions has been increased 8-fold
https://aws.amazon.com/about-aws/whats-new/2020/09/aws-step-functions-increases-payload-size-to-256kb/
Maybe now it fits within your requirements
Just writing this in case someone else comes across the issue - I recently had to solve this at work as well. I found what I thought to be a relatively simple solution, without the use of a second step function.
I'm using Python for this and will provide a few examples in Python, but the solution should be applicable to any language.
Assuming the pre-process output looks like so:
[
{Output_1},
{Output_2},
.
.
.
{Output_n}
]
And a simplified version of the section of the Step Function is defined as follows:
"PreProcess": {
"Type": "Task",
"Resource": "Your Resource ARN",
"Next": "Map State"
},
"Map State": {
Do a bunch of stuff
}
To handle the scenario where the PreProcess output exceeds the Step Functions payload:
Inside the PreProcess, batch the output into chunks small enough to not exceed the payload.
This is the most complicated step. You will need to do some experimenting to find the largest size of a single batch. Once you have the number (it may be smart to make this number dynamic), I used numpy to split the original PreProcess output into the number of batches.
import numpy as np
batches = np.array_split(original_pre_process_output, number_of_batches)
Again inside the PreProcess, upload each batch to Amazon S3, saving the keys in a new list. This list of S3 keys will be the new PreProcess output.
In Python, this looks like so:
import json
import boto3
s3 = boto3.resource('s3')
batch_keys = []
for batch in batches:
s3_batch_key = 'Your S3 Key here'
s3.Bucket(YOUR_BUCKET).put_object(Key=s3_batch_key, Body=json.dumps(batch))
batch_keys.append({'batch_key': s3_batch_key})
In the solution I implemented, I used for batch_id, batch in enumerate(batches) to easily give each S3 key its own ID.
Wrap the 'Inner' Map State in an 'Outer' Map State, and create a Lambda function within the Outer Map to feed the batches to the Inner Map.
Now that we have a small output consisting of S3 keys, we need a way to open one at a time, feeding each batch into the original (now 'Inner') Map state.
To do this, first create a new Lambda function - this will represent the BatchJobs state. Next, wrap the initial Map state inside an Outer map, like so:
"PreProcess": {
"Type": "Task",
"Resource": "Your Resource ARN",
"Next": "Outer Map"
},
"Outer Map": {
"Type": "Map",
"MaxConcurrency": 1,
"Next": "Original 'Next' used in the Inner map",
"Iterator": {
"StartAt": "BatchJobs",
"States": {
"BatchJobs": {
"Type": "Task",
"Resource": "Newly created Lambda Function ARN",
"Next": "Inner Map"
},
"Inner Map": {
Initial Map State, left as is.
}
}
}
}
Note the 'MaxConcurrency' parameter in the Outer Map - This simply ensures the batches are executed sequentially.
With this new Step Function definition, the BatchJobs state will receive {'batch_key': s3_batch_key}, for each batch. The BatchJobs state then simply needs to get the object stored in the key, and pass it to the Inner Map.
In Python, the BatchJobs Lambda function looks like so:
import json
import boto3
s3 = boto3.client('s3')
def batch_jobs_handler(event, context):
return json.loads(s3.get_object(Bucket='YOUR_BUCKET_HERE',
Key=event.get('batch_key'))['Body'].read().decode('utf-8'))
Update your workflow to handle the new structure of the output.
Before implementing this solution, your Map state outputs an array of outputs:
[
{Map_output_1},
{Map_output_2},
.
.
.
{Map_output_n}
]
With this solution, you will now get a list of lists, with each inner list containing the results of each batch:
[
[
{Batch_1_output_1},
{Batch_1_output_2},
.
.
.
{Batch_1_output_n}
],
[
{Batch_2_output_1},
{Batch_2_output_2},
.
.
.
{Batch_2_output_n}
],
.
.
.
[
{Batch_n_output_1},
{Batch_n_output_2},
.
.
.
{Batch_n_output_n}
]
]
Depending on your needs, you may need to adjust some code after the Map in order to handle the new format of the output.
That's it! As long as you set the max batch size correctly, the only way you will hit a payload limit is if your list of S3 keys exceeds the payload limit.
The proposed workarounds work for specific scenarios, but it is not in the one that the processing of a normal payload can generate a big list of items that can exceed the payload limit.
In a general form I think that the problem can repeat in the scenarios 1->N. I mean when one step might generate many step executions in the workflow.
One of the clear ways to break the complexity of some task is divide it into many others, so this is likely to be needed a lot of times. Also from the scalability perspective, there is a clear advantage in doing that, because the more you break the big computations into little ones there is more granularity and more parallelism and optimizations can be done.
That is what AWS intends to facilitate by increasing the max payload size. They call it dynamic parallelism.
The problem is that the Map state is the corner-stone of that. Beside the service integrations (database queries, etc.) is the only one that can dynamically derive many tasks from just one step. But there seems to be no way to specify to it that the payload is on a file.
I see a quick solution to the problem would be if they add one optional persistence spec to the each step, for example:
stateMachines:
myStateMachine:
name: "myStateMachine"
definition:
StartAt: preProcess
States:
preProcess:
Type: Task
Resource:
Fn::GetAtt: [preProcessLambda, Arn]
Next: mapState
ResultPath: "$.preProcessOutput"
OutputFormat:
S3:
Bucket: myBucket
Compression:
Format: gzip
mapState:
Type: Map
ItemsPath: "$.preProcessOutput.data"
InputFormat:
S3:
Bucket: myBucket
Compression:
Format: gzip
MaxConcurrency: 100
Iterator:
StartAt: doMap
States:
doMap:
Type: Task
Resource:
Fn::GetAtt: [doMapLambda, Arn]
End: true
Next: ### next steps, not relevant
That way the Map could perform its work even over large payloads.
There is now a Map State in Distributed Mode:
https://docs.aws.amazon.com/step-functions/latest/dg/concepts-asl-use-map-state-distributed.html
Use the Map state in Distributed mode when you need to orchestrate
large-scale parallel workloads that meet any combination of the
following conditions:
The size of your dataset exceeds 256 KB.
The workflow's execution event history exceeds 25,000 entries.
You need a concurrency of more than 40 parallel iterations.
I have an SQS Queue of which I monitor its size from a state machine.
If size > desired size then I trigger some lambda functions, otherwise, it waits for 30 seconds and checks the queue size again.
Here is my probem: when the queue length is > 20000 I want to trigger 10 lambda functions to empty it faster. And if its length is <2000 then I want to only run 1 lambda function.
For now, I have hard coded ten parallel steps but its waste of resources if the queue size is less than 2000.
"CheckSize": {
"Type": "Choice",
"Choices": [
{
"Variable": "$.Payload.size",
"NumericGreaterThan": 2000,
"Next": "invoke_lambda"
},
{
"Variable": "$.Payload.size",
"NumericLessThan": 2000,
"Next": "Wait30s"
}
],
"Default": "Wait30s"
},
AWS Step Functions does not appear to be the best tool in your scenario. I think you should be using one of the SQS metrics available for CloudWatch. It should be ApproximateNumberOfMessagesVisible in your case. You can create an alarm if ApproximateNumberOfMessagesVisible >= 20,000. Action for that alarm would probably be SNS topic to which you can subscribe a Lambda function. In the Lambda function you can asynchronously invoke your Lambda function 10 times that is supposed to clear down the queue.
Check out AWS docs for creating a CloudWatch alarm for SQS metric
Using Step Functions:
If you want to do it with Step Functions then I don't think you need any Condition check in your state machine definition. All you need is to pass the $.size to a Lambda function and put the condition in that Lambda function. If size >= 20000 then asynchronously invoke queue processing function 10 times else 1.
AWS Step Functions now supports dynamic parallelism, so you can optimize the performance and efficiency of application workflows such as data processing and task automation. By running identical tasks in parallel, you can achieve consistent execution durations and improve utilization of resources to save on operating costs. Step Functions automatically scales resources in response to your input.
https://docs.aws.amazon.com/step-functions/latest/dg/amazon-states-language-map-state.html
https://aws.amazon.com/about-aws/whats-new/2019/09/aws-step-functions-adds-support-for-dynamic-parallelism-in-workflows/
Not diving deep into the solution that you have come up with and focusing on providing guidance on your question
So, If you see, you have answer the question yourself. The simplest solution is to make one more step called invoke10Lambdas and use it from your choice. Pseudo code for your step function would look something like this.
....
....
"CheckSizeAndDivert": {
"Type": "Choice",
"Choices": [
{
"Variable": "$.Payload.size",
"NumericGreaterThan": 20000,
"Next": "invoke_10_lambdas"
},
{
"Variable": "$.Payload.size",
"NumericGreaterThan": 2000,
"Next": "invoke_lambda"
}
],
"Default": "Wait30s"
},
"invoke_10_lambdas": {
// This is your parallel step.
...
Next:"whatever next(i believe it is Wait30)"
},
"invoke_lambda": {
...
// This is your single lambda step.
...
Next:"whatever next(i believe it is Wait30)"
},
...
...
SQS now supports using Lambda as a EventSourceMapping so the recommendation would be to have AWS directly take control of this and scale lambdas as necessary.
Example CloudFormation template would be
"EventSourceMapping": {
"Type": "AWS::Lambda::EventSourceMapping",
"Properties": {
"BatchSize": 10,
"Enabled": true,
"EventSourceArn" : { "Fn::GetAtt" : ["SQSStandupWork", "Arn"] },
"FunctionName" : {
"Fn::Join": [
":", [
{ "Fn::GetAtt" : ["LambdaFunction", "Arn"] },
"production"
]
]
}
}
}
If you are really set on using a step function to drive this forward, you can create another choice on top of what you currently have
A - execute in parallel 1 lambda (A1 => stop) + a checker (B)
B - call lambda and check the size return Wait30 (B1 if size is less than 2000), return Parallel if size is > 20000 (B2)
B1 - wait 30 and then NEXT: A
B2 - Have a parallel with 9 lambdas (since the 10 is A) => NEXT: A
Additional alternatives are:
CloudWatch event to schedule triggering every 30seconds
trigger the 10 parallel lambda functions directly from a separate lambda. A lambda could check the size and then directly call the other lambdas in ASYNC. since it doesn't matter what the result is, since we'll check again in 30 seconds the step function will retry.
The biggest problem with your suggested approach is that the step function has a 1 year limit, so unless you are sure the queue will be drained within a year you'll have a problem when you get to the end. Even if you set it up to retrigger a new step function, you'll be paying a lot of unnecessary step transitions (step functions are not the cheapest.)