Mysterious TransactionConflict in TransactionCanceledException - amazon-web-services

Using transactWriteItems in the aws-sdk (js) we get a TransactionCanceledException. The reason within that Exception is given as TransactionConflict. Sometimes all actions in the transaction fail, sometimes only a few or only one. We do run multiple transactions in parallel that can operate on the same items. The documentation doesn't mention this particular error. Possible reason excerpt:
A condition in one of the condition expressions is not met.
A table in the TransactWriteItems request is in a different account or
region.
More than one action in the TransactWriteItems operation targets the
same item.
There is insufficient provisioned capacity for the transaction to be
completed.
An item size becomes too large (larger than 400 KB), or a local
secondary index (LSI) becomes too large, or a similar validation error
occurs because of changes made by the transaction.
There is a user error, such as an invalid data format.
None of these apply and when retrying the transaction it seems to eventually work. Anyone knows about this exception? I can't find anything documented.

What you are experiencing is not a bug—it’s actually part of the feature, and it was mentioned in the launch announcement .
Items are not locked during a transaction. DynamoDB transactions provide serializable isolation. If an item is modified outside of a transaction while the transaction is in progress, the transaction is canceled and an exception is thrown with details about which item or items caused the exception.
As an aside, instead of locking, DynamoDB uses something called optimistic concurrency control (which is also (confusingly) called optimistic locking). If you’re interested in learning more about that, the Wikipedia article on Optimistic Concurrency Control is pretty good.
Back to the matter at hand, the AWS documentation for transactions says:
Multiple transactions updating the same items simultaneously can cause conflicts that cancel the transactions. We recommend following DynamoDB best practices for data modeling to minimize such conflicts.
Specifically for TransactWriteItems, they say:
Write transactions don't succeed under the following circumstances:
When an ongoing TransactWriteItems operation conflicts with a concurrent TransactWriteItems request on one or more items in the TransactWriteItems operation. In this case, the concurrent request fails with a TransactionCancelledException
Similarly for TransactGetItems:
Read transactions don't succeed under the following circumstances:
When there is an ongoing TransactGetItems operation that conflicts with a concurrent PutItem, UpdateItem, DeleteItem or TransactWriteItems request. In this case the TransactGetItems operation fails with a TransactionCancelledException

Related

AWS Lambda | How to rollback database changes due to execution timeout

My team is working on an AWS Lambda function that has a configured timeout of 30 seconds. Given that lambdas have this timeout constraint and the fact that they can be reused for subsequent requests, it seems like there will always be the potential for the function's execution to timeout prior to completing all of its necessary steps. Is this a correct assumption? If so, how do we bake in resiliency so that db updates can be rolled back in the case of a timeout occurring after records have been updated, but a response hasn't been returned to the function's caller?
To be more specific, my team is managing a Javascript-based lambda (Node.js 16.x) that sits behind an Api Gateway and is an implementation of a REST method to retrieve and update job records. The method works by retrieving records from DynamodDB given certain conditions, updates their states, then returns the updated job records to the caller. Is there a means to detect when a timeout has occurred and to rollback (either manually or automatically) the updated db records so that they're in the same state as when the lambda began execution?
It is important to consider the consequences of what you are trying to do here. Instead of finding ways to detect when your Lambda function is about to expire, the best practice is to first monitor a good chunk of executed requests and analyze how much time, on average, it takes to complete the said requests. Perhaps 30 seconds may not be enough to complete the transaction implemented as a Lambda function.
Once you work with an admittable timeout that suits the average execution time for requests, you can minimize the possibility of rollbacks because of incomplete executions with the support for transactions in DynamoDB. It allows you to group multiple operations together and submit them as a single all-or-nothing, thus ensuring atomicity.
Another aspect related to the design of your implementation is about how fast can you retrieve data from DynamoDB without compromising the timeout. Currently, your code retrieves records from DynamoDB and then updates them if certain conditions are met. This creates a need for this read to happen as fast as possible so the subsequent operation of update can start. A way for you to speed up this read is enabling the DAX (DynamoDB Accelerator) to achieve in-memory acceleration. This acts as a cache for DynamoDB with microseconds of latency.
Finally, if you wat to be extra careful and not even start a transaction in DynamoDB because there will be not enough time to do so, you can use the context object from the Lambda API to query for the remaining time of the function. In Node.js, you can do this like this:
let remainingTimeInMillis = context.getRemainingTimeInMillis()
if (remainingTimeInMillis < TIMEOUT_PASSED_AS_ENVIRONMENT_VARIABLE) {
// Cancel the execution and clean things up
}

How to deal with DynamoDB boto3's batchwriteitem (called in lambda) skipping some items (not uploading to dynamodb)?

This is a project that is being developed using AWS.
I have scheduled my lambda function using the cron expression in CloudWatch. The function will upload items to DynamoDB daily.
Some items are not uploaded to Dynamodb despite having a unique primary key. Sometimes consecutive items are skipped, sometimes items with slightly similar primary keys are skipped. Usually, the number of items skipped is below 20.
It fully works when I run the lambda function manually again. Would like to know the reason behind this and possibly the solution.Thanks!
The BatchWriteItem documentation explains that if the database encounters problems - most notably if your request rate exceeds your provisioned capacity - it is likely that a BatchWriteItem will only successfully write some of the items in the batch. The list of the items not written will be returned in a UnprocessedItems attribute in the response, and you are expected to try again to write those same unprocessed items:
If any requested operations fail because the table's provisioned throughput is exceeded or an internal processing failure occurs, the failed operations are returned in the UnprocessedItems response parameter. You can investigate and optionally resend the requests. Typically, you would call BatchWriteItem in a loop. Each iteration would check for unprocessed items and submit a new BatchWriteItem request with those unprocessed items until all items have been processed.
If none of the items can be processed due to insufficient provisioned throughput on all of the tables in the request, then BatchWriteItem returns a ProvisionedThroughputExceededException.
If DynamoDB returns any unprocessed items, you should retry the batch operation on those items. However, we strongly recommend that you use an exponential backoff algorithm. If you retry the batch operation immediately, the underlying read or write requests can still fail due to throttling on the individual tables. If you delay the batch operation using exponential backoff, the individual requests in the batch are much more likely to succeed.

DynamoDB Transaction Write

DynamoDB's transaction write states:
Multiple transactions updating the same items simultaneously can cause conflicts that cancel the transactions. We recommend following DynamoDB best practices for data modeling to minimize such conflicts.
If there are multiple simultaneous TransactionWriteItems on the same item simultaneously, will all the transaction write request fail with TransactionCanceledException? Or at least one request will succeed?
This purely depends on how the conflicts happen. So, if all the writes in a transaction get successful, it will be successful. If some other write/trasaction midifies one of the items, it will fail.

How to achieve consistent read across multiple SELECT using AWS RDS DataService (Aurora Serverless)

I'm not sure how to achieve consistent read across multiple SELECT queries.
I need to run several SELECT queries and to make sure that between them, no UPDATE, DELETE or CREATE has altered the overall consistency. The best case for me would be something non blocking of course.
I'm using MySQL 5.6 with InnoDB and default REPEATABLE READ isolation level.
The problem is when I'm using RDS DataService beginTransaction with several executeStatement (with the provided transactionId). I'm NOT getting the full result at the end when calling commitTransaction.
The commitTransaction only provides me with a { transactionStatus: 'Transaction Committed' }..
I don't understand, isn't the commit transaction fonction supposed to give me the whole (of my many SELECT) dataset result?
Instead, even with a transactionId, each executeStatement is returning me individual result... This behaviour is obviously NOT consistent..
With SELECTs in one transaction with REPEATABLE READ you should see same data and don't see any changes made by other transactions. Yes, data can be modified by other transactions, but while in a transaction you operate on a view and can't see the changes. So it is consistent.
To make sure that no data is actually changed between selects the only way is to lock tables / rows, i.e. with SELECT FOR UPDATE - but it should not be the case.
Transactions should be short / fast and locking tables / preventing updates while some long-running chain of selects runs is obviously not an option.
Issued queries against the database run at the time they are issued. The result of queries will stay uncommitted until commit. Query may be blocked if it targets resource another transaction has acquired lock for. Query may fail if another transaction modified resource resulting in conflict.
Transaction isolation affects how effects of this and other transactions happening at the same moment should be handled. Wikipedia
With isolation level REPEATABLE READ (which btw Aurora Replicas for Aurora MySQL always use for operations on InnoDB tables) you operate on read view of database and see only data committed before BEGIN of transaction.
This means that SELECTs in one transaction will see the same data, even if changes were made by other transactions.
By comparison, with transaction isolation level READ COMMITTED subsequent selects in one transaction may see different data - that was committed in between them by other transactions.

How do you handle Amazon Kinesis Record duplicates?

According to the Amazon Kinesis Streams documentation, a record can be delivered multiple times.
The only way to be sure to process every record just once is to temporary store them in a database that supports Integrity checks (e.g. DynamoDB, Elasticache or MySQL/PostgreSQL) or just checkpoint the RecordId for each Kinesis shard.
Do you know a better / more efficient way of handling duplicates?
We had exactly that problem when building a telemetry system for a mobile app. In our case we were also unsure that producers where sending each message exactly once, therefore for each received record we calculated its MD5 on the fly and checked whether it is presented in some form of a persistent storage, but indeed what storage to use is the trickiest bit.
Firstly, we tried trivial relational database, but it quickly became a major bottleneck of the whole system as this isn't just read-heavy but also write-heavy case, since the volume of data going though Kinesis was quite significant.
We ended up having a DynamoDB table storing MD5's for each unique message. The issue we had was that it wasn't so easy to delete the messages - even though our table contained partition and sort keys, DynamoDB does not allow to drop all records with a given partition key, we had to query all of the to get sort key values (which wastes time and capacity). Unfortunately, we had to just simply drop the whole table once in a while. Another way suboptimal solution is to regularly rotate DynamoDB tables which store message identifiers.
However, recently DynamoDB introduced a very handy feature - Time To Live, which means that now we can control the size of a table by enabling auto-expiry on a per record basis. In that sense DynamoDB seems to be quite similar to ElastiCache, however ElastiCache (at least Memcached cluster) is much less durable - there is no redundancy there, and all data residing on terminated nodes is lost in case of scale in operation or failure.
The thing you mentioned is a general problem of all queue systems with "at least once" approach. Also, not just the queue systems, the producers and consumers both may process the same message multiple times (due to ReadTimeout errors etc.). Kinesis and Kafka both uses that paradigm. Unfortunately there is not an easy answer for that.
You may also try to use an "exactly-once" message queue, with stricter transaction approach. For example AWS SQS does that: https://aws.amazon.com/about-aws/whats-new/2016/11/amazon-sqs-introduces-fifo-queues-with-exactly-once-processing-and-lower-prices-for-standard-queues/ . Be aware, SQS throughput is far smaller than Kinesis.
To solve your problem, you should be aware of your application domain and try to solve it internally like you suggested (database checks). Especially when you communicate with an external service (let's say an email server for example), you should be able to recover the operation state in order to prevent double processing (because double sending in the email server example, may result in multiple copies of the same post in the recipient's mailbox).
See also the following concepts;
At-least-once Delivery: http://www.cloudcomputingpatterns.org/at_least_once_delivery/
Exactly-once Delivery: http://www.cloudcomputingpatterns.org/exactly_once_delivery/
Idempotent Processor: http://www.cloudcomputingpatterns.org/idempotent_processor/