I have nearly 1000 items in my DB. I have to run the same operation on each item. The issue is that this is a third party service that has a 1 second rate limit for each operation. Until now, I was able to do the entire thing inside a lambda function. It is now getting close to the 15 minute (900 second) timeout limit.
I was wondering what the best way for splitting this would be. Can I dump each item (or batches of items) into SQS and have a lambda function process them sequentially? But from what I understood, this isn't the recommended way to do this as I can't delay invocations sufficiently long. Or I would have to call lambda within a lambda, which also sounds weird.
Is AWS Step Functions the way to go here? I haven't used that service yet, so I was wondering if there are other options too. I am also using the serverless framework for doing this if it is of any significance.
Both methods you mentioned are options that would work. Within lambda you could add a delay (sleep) after one item has been processed and then trigger another lambda invocation following the delay. You'll be paying for that dead time, of course, if you use this approach, so step functions may be a more elegant solution. One lambda can certainly invoke another--even invoking itself. If you invoke the next lambda asynchronously, then the initial function will finish while the newly-invoked function starts to run. This article on Asynchronous invocation will be useful for that approach. Essentially, each lambda invocation would be responsible for processing one item, delaying sufficiently to accommodate the service limit, and then invoking the next item.
If anything goes wrong you'd want to build appropriate exception handling so a problem with one item either halts the rest or allows the rest of the chain to continue, depending on what is appropriate for your use case.
Step Functions would also work well to handle this use case. With options like Wait and using a loop you could achieve the same result. For example, your step function flow could invoke one lambda that processes an item and returns the next item, then it could next run a wait step, then process the next item and so on until you reach the end. You could use a Map that runs a lambda task and a wait task:
The Map state ("Type": "Map") can be used to run a set of steps for
each element of an input array. While the Parallel state executes
multiple branches of steps using the same input, a Map state will
execute the same steps for multiple entries of an array in the state
input.
This article on Iterating a Loop Using Lambda is also useful.
If you want the messages to be processed serially and are happy to dump the messages to sqs, set both the concurency of the lambda and the batchsize property of the sqs event that triggers the function to 1
Make it a FIFO queue so that messages dont potentially get processed more than once if that is important.
Related
I'm trying to build a process like this:
In state1, it will trigger 10 lambdas, and only when ALL those 10 lambda respond/ or call callback with taskToken, it will then proceed to next state2.
How to design this process?
This is a perfect scenario for the Map state. You can pass in an array of lambda function names, then add a Lambda task and use the Parameters block to set the function dynamically. And if you want them to run one at a time instead of in parallel, you can set MaxConcurrency.
Scenario
I'm looking for a way to create an instance of a step function that waits for me to start it. Pseudo code would look like this.
StateMachine myStateMachine = new();
string executionArn = myStateMachine.ExecutionArn;
myStateMachine.Start();
Use Case
We need a way to reliably store the Execution ARN of a step function to a database. If we fail to write the Execution ARN to the database, we won't call the Start method and the step function should timeout. If the starting of the step function fails, the database operation would be rolled back.
These are the steps we plan to take
A local transaction is started
The step function instance is created, but not started
The ExecutionArn of the created step function instance is recorded in a database
The step function is started
The local transaction is committed
Is there a simple way to start a step function like this?
Below is the result of some research I've done on this so far.
Manual Callbacks
Following information in this article https://aws.amazon.com/blogs/compute/implementing-serverless-manual-approval-steps-in-aws-step-functions-and-amazon-api-gateway/,
I create an empty activity, then us this activity as the first step in the step function and add a timeout of 30 seconds to the activity step. The expectation was that if I didn't send a success to that activity task in the step function then the step would timeout and the workflow would fail, but it isn't doing that. Even though I set the timeout to 30 seconds, the step is not timing out. I'm guessing the timeout is about how long it waits for the step function to be able to schedule the activity, not how long it waits for the step function to move on from the activity step.
I've also considered using an SQS SendMessage step with Wait for callback checked and with a similar timeout, but that would require I create a throw-away SQS queue just to contain messages I never intend to read, plus I'm guessing the timeout functionality would work the same here as in an activity.
Wait State
There may be something I can do with a Wait state and parallel branches by following the accepted answer in this SO article: Does AWS Step Functions have a timeout feature?, but before I go down that route I want to see if something simpler can be done.
Global Timeout
I have found that step functions have a global timeout, and that is useful in this case if I use it in conjunction with a step that pauses until my application explicitly resumes it, but the global timeout is only useful if it can be reasonably low (like 20 minutes) and still have the step function viable for all use cases. For instance, if the maximum time it should take to run the step function is 2 or 3 minutes, then all is fine. But if I have another step in the step function that can take longer than 20 minutes then I can't use the global timer anymore or I have to start setting it to something very high, which I don't want to do.
Is there anything I can do here easily that I'm overlooking?
Thanks
Two-phase initialization of a step function cannot be done. We've worked around this by:
Our Application: Writing a row in our DB to indicate the intent to start a step function
Our Application: Start the step function
Our Application: Record the ExecutionArn of the step function instance in the created row
Step Function: Have the step function wait on step 1 indefinitely on an SQS step
Our Application: Poll the SQS queue and either abort the step function or allow it to proceed to the next step by sending a callback to the SQS step. (This is the 2nd phase)
I have a question on the step function part of AWS
I have a function to watch and update datas in databases. But because we can have only 1000 as we can have 1 000 000 items to update, I would like to manage it by 10 000 or 100 000 with a lambda.
But the optimal solution should be to manage them in parallel to update every datas at the same time and finish them together
So for that I would like to create a Lambda function with aws-sdk which should create a parallel step function with X tasks and every tasks will manage 10 000 or 100 000 items of the database
But when I read the aws-sdk documentation, it looks like there is no way to create a parallel step function, even from a template
So my question is, is it possible to create a parallel step function from a Lambda function with aws-sdk ? Or do you have a better solution to my problem ?
Thanks in advance
Update : To give you more informations, my problem is I'll have to update a insert an unknown of datas in my DB each first day of month, and the problem is that I need to call an API that takes 15 seconds to return the data (it's not our API so I cannot try to upgrade return time).
If I just use a Lambda function, it will be in timeout after 15 minutes.
Suddenly, I thought of using Step function to execute the Lambda function for each data, but the problem is, if we have a lot of datas, it will maybe take more than 24 hours and I would like to find a solution where I can execute my Lambda function in parallel to optimize the time, so i thought about parallel task of step function.
But because the number of datas will change every month, I don't know how to dynamically increase or decrease branch number of my step function, and that's why I thought of generate my step function from another Lambda
I have a function to watch and update data in databases.
I suppose what you need to watch is some kind of user/data events? what to watch? what to update?
Can you provide more info before I can give you some architectural suggestions?
By the way, it is Step Functions to orchestrate/invoke Lambda functions, not the other around.
updated answer:
so you seem to face the 15 mins hard limit for Lambda max execution time. there are 3 approaches I can see:
instead of using a Lambda function, use an ECS container or EC2 instance to handle the large volume of data processing and database writing. however, this requires substantial code re-rewrite and infrastructure/architectural change.
figure out a way to break down the input data so you can fan out the handling to multiple Lambda function instances, i.e.: input data -> Lambda to break down task -> SQS messages -> Lambda to handle each task. but my concern is that the task to break down input data may also need substantial time.
before Lambda execution timeout, mark the current processed position, invoke the same Lambda function with the original event + position offset. the next Lambda instance would pick up the data processing from where the previous execution stopped. https://medium.com/swlh/processing-large-s3-files-with-aws-lambda-2c5840ae5c91
We are experiencing double Lambda invocations of Lambdas triggered by S3 ObjectCreated-Events. Those double invocations happen exactly 10 minutes after the first invocation, not 10 minutes after the first try is complete, but 10 minutes after the first invocation happened. The original invocation takes anything in the range between 0.1 to 5 seconds. No invocations results in errors, they all complete successfully.
We are aware of the fact that SQS for example does not guarantee exactly-once but at-least-once delivery of messages and we would accept some of the lambdas getting invoked a second time due to results of the distributed system underneath. A delay of 10 minutes however sounds very weird.
Of about 10k messages 100-200 result in double invocations.
The AWS Support basically says "the 10 minute wait time is by design but we cannot tell you why", which is not at all helpful.
Has anyone else experienced this behaviour before?
How did you solve the issue or did you simply ignore it (which we could do)?
One proposed solution is not to use direct S3-lambda-triggers, but let S3 put its event on SNS and subscribe a Lambda to that. Any experience with that approach?
example log: two invocations, 10 minutes apart, same RequestId
START RequestId: f9b76436-1489-11e7-8586-33e40817cb02 Version: 13
2017-03-29 14:14:09 INFO ImageProcessingLambda:104 - handle 1 records
and
START RequestId: f9b76436-1489-11e7-8586-33e40817cb02 Version: 13
2017-03-29 14:24:09 INFO ImageProcessingLambda:104 - handle 1 records
After a couple of rounds with the AWS support and others and a few isolated trial runs it seems like this is simply "by design". It is not clear why, but it simply happens. The problem is neither S3 nor SQS / SNS but simply the lambda invocation and how the lambda service dispatches the invocations to lambda instances.
The double invocations happen somewhere between 1% and 3% of all invocations, 10 minutes after the first invocation. Surprisingly there are even triple (and probably quadruple) invocations with a rate of powers of the base probability, so basically 0.09%, ... The triple invocations happened 20 minutes after the first one.
If you encounter this, you simply have to work around it using whatever you have access to. We for example now store the already processed entities in a Cassandra with a TTL of 1 hour and only responding to messages from the lambda if the entity has not been processed yet. The double and triple invocations all happen within this one hour timeframe.
Not wanting to spin up a data store like Dynamo just to handle this, I did two things to solve our use case
Write a lock file per function into S3 (which we were already using for this one) and check for its existence on function entry, aborting if present; for this function we only ever want one of it running at a time. The lock file is removed before we call callback on error or success.
Write a request time in the initial event payload and check the request time on function entry; if the request time is too old then abort. We don't want Lambda retries on error unless they're done quickly, so this handles the case where a duplicate or retry is sent while another invocation of the same function is not already running (which would be stopped by the lock file) and also avoids the minimal overhead of the S3 requests for the lock file handling in this case.
Let's say that I have a switch statement in my thread function that evaluates for triggered events. Each case is a different event. Is it better to put the call to ResetEvent at the end of the case, or at the beginning? It seems to me that it should go at the end, so that the event cannot be triggered again, until the thread has finished processing the previous event. IF it is placed at the beginning, the event could be triggered again, while being processed.
Yes. think that is the way to go. Create a manual reset event (second parameter of CreateEvent API) so that event is not automatically reset after setting it.
If you handle incoming traffic using a single Event object (implying you have no inbound queue), you will miss events. Is this really what you want?
If you want to catch all events, a full-blown producer-consumer queue wouold be a better bet. Reference implementation for Boost.Thread here.
One problem that comes up time and
again with multi-threaded code is how
to transfer data from one thread to
another. For example, one common way
to parallelize a serial algorithm is
to split it into independent chunks
and make a pipeline — each stage in
the pipeline can be run on a separate
thread, and each stage adds the data
to the input queue for the next stage
when it's done. For this to work
properly, the input queue needs to be
written so that data can safely be
added by one thread and removed by
another thread without corrupting the
data structure.