Dividing tasks into aws step functions and then join them back when all completed - amazon-web-services

We have a AWS step function that processes csv files. These CSV files records can be anything from 1 to 4000.
Now, I want to create another inner AWS step function that will process these csv records. The problem is for each record I need to hit another API and for that I want all of the record to be executed asynchronously.
For example - CSV recieved having records of 2500
The step function called another step function 2500 times (The other step function will take a CSV record as input) process it and then store the result in Dynamo or in any other place.
I have learnt about the callback pattern in aws step function but in my case I will be passing 2500 tokens and I want the outer step function to process them when all the 2500 records are done processing.
So my question is this possible using the AWS step function.
If you know any article or guide for me to reference then that would be great.
Thanks in advance

It sounds like dynamic parallelism could work:
To configure a Map state, you define an Iterator, which is a complete sub-workflow. When a Step Functions execution enters a Map state, it will iterate over a JSON array in the state input. For each item, the Map state will execute one sub-workflow, potentially in parallel. When all sub-workflow executions complete, the Map state will return an array containing the output for each item processed by the Iterator.
This keeps the flow all within a single Step Function and allows for easier traceability.
The limiting factor would be the amount of concurrency available (docs):
Concurrent iterations may be limited. When this occurs, some iterations will not begin until previous iterations have completed. The likelihood of this occurring increases when your input array has more than 40 items.
One additional thing to be aware of here is cost. You'll easily blow right through the free tier and start incurring actual cost (link).

Related

Run ML pipeline using AWS step function for entire dataset?

I have a step function setup which calls preprocessing lambda and inference lambda for a data item. Now, I need to do this process on the entire dataset(over 10000 items). One way is to invoke step function parallelly for each input. Is there a better alternative to this approach?
Another way to do it would be to use Map state to run over an array of items. You could start with a list of item ID's and run a set of tasks for it.
https://aws.amazon.com/blogs/aws/new-step-functions-support-for-dynamic-parallelism/
This approach has some drawbacks though:
There is a 256kb limit for input/output data. The initial array of items could possibly be bigger. If you passes an array of ID's only as an input to map state though, 10k items would likely not cross that limit.
Map state doesn't guarantee that all the items will run concurrently. It could possibly be less than 40 at a time (workaround would be to have nested map states or maps of map states). From documentation:
Concurrent iterations may be limited. When this occurs, some iterations will not begin until previous iterations have completed. The likelihood of this occurring increases when your input array has more than 40 items.
https://docs.aws.amazon.com/step-functions/latest/dg/amazon-states-language-map-state.html

Starting a new execution of Step Function after exceeding 25,000 events, when iterating through objects in an S3 bucket

I am iterating through an S3 bucket to process the files. My solution is based on this example;
https://rubenjgarcia.es/step-function-to-iterate-s3/
The iteration is working fine but unfortunately I exceed the 25,000 events allowed by one execution, so it eventually fails. I know you have to start a new execution of the step function, but I'm unclear how to tell it where I am at in the current iteration. I have the count of how many files have been processed and obviously the ContinuationToken. Can I use the ContinuationToken to keep track of where I am in iterating through the s3 bucket and the count to tell it when to start a new execution.
I have read the AWS docs https://docs.aws.amazon.com/step-functions/latest/dg/tutorial-continue-new.html but I am not sure where to start applying this to my own solution. Has anyone done this when iterating through objects in an s3 bucket, if you can you point me in the right direction?
I can think of two options:
in your solution you are iterating as long as there is the next token. You can extend that and create a counter and in each iteration increase it. And change your condition to iterate as long as there is next token or count is less than a threshold.
I prefer to use a nested state machine to overcome the 25,000 events limitation. Let's say every time you are reading 100 items from s3. If you pass the list to the nested state machine to process them, then the top-level state machine will not reach 25,000 events, and also the nested state machine.

How to handle limitation of Dynamodb BatchWriteItem

Just wondering whats the best way to handle the fact that dynamodb can only write batch sizes of max 25.
I have 3 Lambdas (there are more but I am simplifying down so we don't get side tracked)
GetNItemsFromExternalSourceLambda
SaveAllToDynamoDBLambda
AnalyzeDynamoDBLambda
Here is what happens:
GetNItemsFromExternalSourceLambda can potentially fetch 250 items in 1 rest call it makes to an external api.
It then invokes SaveAllToDynamoDBLambda and passes a) all these items and b) paging info e.g. {pageNum:1, pageSize : 250, numPages:5 } in the payload
SaveAllToDynamoDBLambda needs to save all items to a dynamodb table and then , based on the paging info will either a) re-invoke GetNItemsFromExternalSourceLambda (to fetch next page of data) or b) invoke AnalyzeDynamoDBLambda
these steps can loop many times obviously until we have got all the data from the external source before finally proceeding to last step
the final AnalyzeDynamoDBLambda then is some lambda that processes all the data that was fetched and saved to the db
So my problems lies in fact that SaveAllToDynamoDBLambda can only write 25 items in a batch, which means I would have to tell my GetNItemsFromExternalSourceLambda to only fetch 25 items at a time from the external source which is not ideal. (being able to fetch 250 at a time would be a lot better)
One could extend the timeout period of the SaveAllToDynamoDBLambda so that it could do multiple batch writes inside one invocation but i dont like that approach.
I could also zip up the 250 items and save to s3 in one upload which could trigger a stream event but I would have same issue on the other side of that solution.
just wondering whats a better approach while still being able to invoke AnalyzeDynamoDBLambda only after all info from all rest calls has been saved to dynamodb.
Basically the problem is you need a way of subdividing the large batch (250 items in this case) down to batches of 25 of less.
A very simple solution would be to use a Kinesis stream in the middle. Kinesis can take up to 500 records per PutRecords call. You can then use GetRecords with a Limit of 25 and put the records into Dynamo with a single BatchWriteItem call.
Make sure you look at the size limits as well before deciding if this solution will work for you.

Use AWS step functions for processing large amount of data?

We want to use AWS step function to processing a large amount of data from a CSV file but we are not sure if this is the best choice.
Our use case is below :
- We upload a CSV with a large amount of lines (Like 50K) and for each line we process a small traitements (Each traitement is processed by a lambda function).
At this time, we think the best choice is to insert each line from our CSV in a DynamoDB and for each line launch our lambda functions.
What do you think of this ?
There are multiple patterns to process large files with Lambda.
One approach is to use a Lambda function is to split the large file and delegate the parts to worker Lambda functions.
If the processing steps for parts are complex enough you can trigger multiple Step function workflows.
In your proposed approach, if each item processing is large enough it will make sense to process item by item, but generally its more efficient to process as batches.

Callback for Multiple AWS Lambda execution completion

Here is my problem:
I have to compare a candidate object with some criteria with millions of other candidates in db. Since lambda allows only 5 minutes of execution so it causes timeout.
My Solution:
I planned to do this comparison with 10,000 chunks of candidates so I have to call 10 lambda functions (through SNS) to process 100,000 candidates and then save results of each lambda in some DynamoDB table. But how to get a callback when all lambda functions are done processing so that I can collect those results for individual lambdas and then calculate final results. How to achieve this or is there any better way to acheive my goal. Any help is most appreciated.
I'm not sure if AWS Lambda is truly a good fit for your use case. However just focusing on the main part of your question, you could use DynamoDB Atomic Counters to determine when all processing is complete. You would do the following:
Initially insert a record in DynamodB with a field like numberOfLambdaCalls attribute set to the number of concurrent executions you are kicking off, and a completedLambdaCalls attribute set to 0.
As each function completes, as part of updating the DynamoDB record they would increment the completedLambdaCalls attribute atomically.
Each function could check the returned result of the update to see if they were the one to complete the processing like if numberOfLambdaCalls == completedLambdaCalls and if they are, perform whatever action is necessary to trigger your response.