AWS Lambda | How to rollback database changes due to execution timeout - amazon-web-services

My team is working on an AWS Lambda function that has a configured timeout of 30 seconds. Given that lambdas have this timeout constraint and the fact that they can be reused for subsequent requests, it seems like there will always be the potential for the function's execution to timeout prior to completing all of its necessary steps. Is this a correct assumption? If so, how do we bake in resiliency so that db updates can be rolled back in the case of a timeout occurring after records have been updated, but a response hasn't been returned to the function's caller?
To be more specific, my team is managing a Javascript-based lambda (Node.js 16.x) that sits behind an Api Gateway and is an implementation of a REST method to retrieve and update job records. The method works by retrieving records from DynamodDB given certain conditions, updates their states, then returns the updated job records to the caller. Is there a means to detect when a timeout has occurred and to rollback (either manually or automatically) the updated db records so that they're in the same state as when the lambda began execution?

It is important to consider the consequences of what you are trying to do here. Instead of finding ways to detect when your Lambda function is about to expire, the best practice is to first monitor a good chunk of executed requests and analyze how much time, on average, it takes to complete the said requests. Perhaps 30 seconds may not be enough to complete the transaction implemented as a Lambda function.
Once you work with an admittable timeout that suits the average execution time for requests, you can minimize the possibility of rollbacks because of incomplete executions with the support for transactions in DynamoDB. It allows you to group multiple operations together and submit them as a single all-or-nothing, thus ensuring atomicity.
Another aspect related to the design of your implementation is about how fast can you retrieve data from DynamoDB without compromising the timeout. Currently, your code retrieves records from DynamoDB and then updates them if certain conditions are met. This creates a need for this read to happen as fast as possible so the subsequent operation of update can start. A way for you to speed up this read is enabling the DAX (DynamoDB Accelerator) to achieve in-memory acceleration. This acts as a cache for DynamoDB with microseconds of latency.
Finally, if you wat to be extra careful and not even start a transaction in DynamoDB because there will be not enough time to do so, you can use the context object from the Lambda API to query for the remaining time of the function. In Node.js, you can do this like this:
let remainingTimeInMillis = context.getRemainingTimeInMillis()
if (remainingTimeInMillis < TIMEOUT_PASSED_AS_ENVIRONMENT_VARIABLE) {
// Cancel the execution and clean things up
}

Related

Will a click counter slow down my DynamoDB API?

I want to create a DynamoDB WebAPI. It allows the creation and reading of Posts. Now I would like to implement a click counter that updates the popularity of a post each time a user requests it. For this reason, every time a GET request for a posts comes in, I would change the Post object itself.
But I know that DynamoDB is optimized for reads, not for writes. So updating the object that is being fetched everytime would probably be a problem.
So how can I measure the popularity of posts without slowing down the API itself? I was thinking of generating a random number for every fetch and only updating it if it is below 0.05 or something similar.
But is there a better solution for this?
Dynamo DB isn't "optimized for reads" it's optimized to provide "consistent, single-digit millisecond response times at any scale."
To optimize DDB for reads, you'd want to stick a Amazon DynamoDB Accelerator (DAX) instance in front of it for "faster access with microsecond latency".
In actuality, the DDB read/write performance isn't going to be an issue. In your case the network latency between your app and DDB will be orders of magnitude higher. By making two calls synchronously one after the other you'd be doubling your response time; regardless of what cloud DB you're writing too.
Assuming the data and counter are in the same record, the simple DDB solution in this case would be to not make a call to GetItem() and one to UpdateItem(). Instead, simply call UpdateItem() with an UpdateExpression that uses the ADD expression to add 1 to your counter and the ReturnValues attribute to return either ALL_OLD or ALL_NEW.
Other more complex solutions
assuming you've already got the data for display, do an async call to UpdateItem().
At scale, you might consider disconnecting the counter update from your app. Your app post a SQS message, that's processed by a lambda which could use batch updates to DDB.

DynamoDB and computed columns: Run Lambda on GetItem / Query request but before data is returned to caller

Is it possible to run a Lambda function as part of a GetItem / Query request? I plan to use some kind of computed colum that I would like to update before the value is returned to the caller. The current idea is to do this with a Lambda function and DynamoDB Streams. Up to know, I kind of missed the part in the docs where I can specify the exact moment when the Lambda is executed (before, after fetching data). Of course, I am open for better ideas!
No. It is not possible. Dynamodb is designed to response items in distributed systems within milliseconds. There is no way to execute Lambdas synchronous with Put or Get Requets. DynamodDB Streams are more like asynchronous Table Trigger and only executed on new Data.
One Idea is to call an Lambda to collect and compute your data instead request Dynamodb.

Mysterious TransactionConflict in TransactionCanceledException

Using transactWriteItems in the aws-sdk (js) we get a TransactionCanceledException. The reason within that Exception is given as TransactionConflict. Sometimes all actions in the transaction fail, sometimes only a few or only one. We do run multiple transactions in parallel that can operate on the same items. The documentation doesn't mention this particular error. Possible reason excerpt:
A condition in one of the condition expressions is not met.
A table in the TransactWriteItems request is in a different account or
region.
More than one action in the TransactWriteItems operation targets the
same item.
There is insufficient provisioned capacity for the transaction to be
completed.
An item size becomes too large (larger than 400 KB), or a local
secondary index (LSI) becomes too large, or a similar validation error
occurs because of changes made by the transaction.
There is a user error, such as an invalid data format.
None of these apply and when retrying the transaction it seems to eventually work. Anyone knows about this exception? I can't find anything documented.
What you are experiencing is not a bug—it’s actually part of the feature, and it was mentioned in the launch announcement .
Items are not locked during a transaction. DynamoDB transactions provide serializable isolation. If an item is modified outside of a transaction while the transaction is in progress, the transaction is canceled and an exception is thrown with details about which item or items caused the exception.
As an aside, instead of locking, DynamoDB uses something called optimistic concurrency control (which is also (confusingly) called optimistic locking). If you’re interested in learning more about that, the Wikipedia article on Optimistic Concurrency Control is pretty good.
Back to the matter at hand, the AWS documentation for transactions says:
Multiple transactions updating the same items simultaneously can cause conflicts that cancel the transactions. We recommend following DynamoDB best practices for data modeling to minimize such conflicts.
Specifically for TransactWriteItems, they say:
Write transactions don't succeed under the following circumstances:
When an ongoing TransactWriteItems operation conflicts with a concurrent TransactWriteItems request on one or more items in the TransactWriteItems operation. In this case, the concurrent request fails with a TransactionCancelledException
Similarly for TransactGetItems:
Read transactions don't succeed under the following circumstances:
When there is an ongoing TransactGetItems operation that conflicts with a concurrent PutItem, UpdateItem, DeleteItem or TransactWriteItems request. In this case the TransactGetItems operation fails with a TransactionCancelledException

AWS: Execute a task after 1 year has elapsed

Basically, I have a web service that receives a small json payload (an event) a few times per minute, say 60. This event must be sent to an SQS queue only after 1 year has elapsed (it's ok to have it happen a few hours sooner or later, but the day of month should be exactly the same).
This means I'll have to store more than 31 million events somewhere before the first one should be sent to the SQS queue.
I thought about using SQS message timers, but they have a limit of only 15 minutes, and as pointed out by #Charlie Fish, it's weird to have an element lurking around on a queue for such a long time.
A better possibility could be to schedule a lambda function using a Cron expression for each event (I could end up with millions or billions of scheduled lambda functions in a year, if I don't hit an AWS limit well before that).
Or I could store these events on DynamoDB or RDS.
What would be the recommended / most cost-effective way to handle this using AWS services? Scheduled lambda functions? DynamoDB? PostgreSQL on RDS? Or something entirely different?
And what if I have 31 billion events per year instead of 31 million?
I cannot afford to loose ANY of those events.
DynamoDB is a reasonable option, as is RDS - SQS for long term storage is not a good choice. However - if you want to keep your costs down, I may suggest another: accumulate the events for a single 24 hour period (or a smaller interval if that is desirable), and write that set of data out as an S3 object instead of keeping it in DynamoDB. You could employ dynamodb or rds (or just about anything else) as a place to accumulate events for the day (or hour) before it then writes out that data to S3 as a single set of data for the interval.
Each S3 object could be named appropriately, either indicating the date/time it was created, or the data/time it needs to be used, i.e. 20190317-1400 to indicate that on March 17th, 2019 at 2PM this file needs to be used.
I would imagine a lambda function, called by a cloudwatch event that is triggered every 60 minutes, scans your s3 bucket looking for files that are due to be used, and it then reads in the json data and puts them into an SQS queue for further processing and moves the processed s3 object to another 'already processed' bucket
Your storage costs would be minimal (especially if you batch them up by day or hour), S3 has 11 9's of durability, and you can archive older events off to Glacier if you want to keep them around even after the are processed.
DynamoDB is a great product, it provides redundant storage, and super high performance - but I see nothing in your requirements to that would warrant incurring that cost or requiring the performance of DynamoDB; and why keep millions of records of data in a 'always on' database when you know in advance that you don't need to use or see the records until a year from now.
I mean you could store some form of data in DynamoDB, and run some daily Lambda task to query for all the items that are greater than a year old, remove those from DynamoDB and import it into SQS.
As you mentioned SQS doesn't have this functionality built in. So you need to store the data using some other technology. DynamoDB seems like a responsible choice based on what you have mentioned above.
Of course you also have to think about if doing a cron task once per day is sufficient for your task. Do you need it to be exactly after 1 year? Is it acceptable to have it be one year and a few days? Or one year and a few weeks? What is the window that is acceptable for importing into SQS?
Finally, the other question you have to think about is if SQS is even reasonable for your application. Having a queue that has a 1 year delay seems kinda strange. I could be wrong, but you might want to consider something besides SQS because SQS is meant for much more instantaneous tasks. See the examples on this page (Decouple live user requests from intensive background work: let users upload media while resizing or encoding it, Allocate tasks to multiple worker nodes: process a high number of credit card validation requests, etc.). None of those examples are really meant for a year of wait time before executing. At the end of the day it depends on your use case, but off the top of my head I can't think of a situation that makes sense for delaying entry into an SQS queue for a year. There seem to be much better ways to handle this, but again I don't know your specific use case.
EDIT another question is if your data is consistent? Is the amount of data you need to store consistent? How about the format? What about the number of events per second? You mention that you don’t want to lose any data. For sure build in error handling and backup systems. But for DynamoDB it doesn’t scale the best if one moment you store 5 items then the next moment you want to store 5 million items. If you set your capacity to account for 5 million then it is fine. But the question is will the amount of data and frequency be consistent or not?

Minimal delay when listing activities using the Reports API

I would like to use the list method in the Reports API to periodically fetch Activities of all users of some applications (e.g. 'admin' and 'login') and keep a local copy of all that data (using watch and push notifications is not an option in my particular scenario).
The idea is defining small time windows (e.g. 60 seconds) and, at the end of each time window plus some small delay, using the 'list' method and setting the startTime and endTime accordingly, fetching all events logged during the already finished time window.
This way I would be able to have an almost-real-time list of events locally stored. However, I'm not sure about what minimal delay should be used to ensure that the list method will be able to fetch all events. I'm assuming some delay is required here. Am I right? If so, is there any minimum delay that guarantees all events will be fetched?
In theory you wouldn't need a delay, but probably 10 secs would be fine if you want to be sure. Another important thing would be the api quota, in this case the project would be limited to 5 queries per second. Here is the documentation on that https://developers.google.com/admin-sdk/reports/v1/limits