I know that AWS Dynamo DB read has eventually and strongly consistence. And I read a document it says that The individual PutItem and DeleteItem operations specified in BatchWriteItem are atomic; however BatchWriteItem as a whole is not.
But I still don't understand how is the write method behavior is synchronized or not.
If this is an awkward question, please tell me.
BatchWriteItem is a batch API - meaning it allows you to specify a number of different operations to be submitted to Dynamo for execution in the same request. So when you submit a BatchItemRequest you are asking DynamoDB to perform a number of either PutItem or DeleteItem requests for you.
The claim that the individual PutItem and DeleteItem requests are atomic means that each of those is atomic with respect to other requests that may be wanting to modify the same item (identified by it's partition/sort keys) - meaning it's not possible for data corruption to occur within the item because two PutItem requests executed at the same time each modifying some part of the item and thus leaving it in an inconsistent state.
But then, the claim is that the whole BatchWriteItem request is not atomic. That just means that the sequence of PutItem and/or DeleteItem requests is not guaranteed to be isolated, so you could have other PutItem or DeleteItem requests - whether single or batch execute at the same time as the BatchWriteItem request which could affect the state of the table(s) in-between the individual PutItem/DeleteItem requests that make up the batch.
To Illustrate the point, let's say you have a BatchItemRequest that consists of the following 2 calls:
PutItem (partitionKey = 1000; name = 'Alpha'; value = 100)
DeleteItem (partitionKey = 1000)
And that at approximately the same time you've submitted this request there is another request that has the following operation in it:
DeleteItem (partitionKey = 1000)
It is possible that the second delete item request might delete the item before the first request executes and so while the PutItem succeeds, the DeleteItem in the first request would fail with a not found because the item has been deleted by the other delete request. This is one example of how the whole batch operation is not atomic.
Related
Can we use DynamoDb optimistic locking with batchWriteItem request? AWS docs on Optimistic locking mention that a ConditionalCheckFailedException is thrown when the version value is different while updating the request. In case of batchWriteItem request, will the whole batch fail or only that record with a different version value? Will the record that failed due to different version value be returned as an unprocessed record?
You cannot. You can be sure by looking at the low level syntax and notice there’s no ability to specify a condition expression.
https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_BatchWriteItem.html
Lambda needs to get all results from DynamoDB and performs for processing on each record and trigger a step function workflow. Although paginated result is given by DynamoDB, Lambda will timeout if there are too many pages which can't be processed within 15 mins lambda limit. Is there any workaround to use lambda other than moving to Fargate?
Overview of Lambda
while True:
l, nextToken = get list of records from DynamoDB
for each record in l:
perform some preprocesing like reading a file and triggering a workflow
if nextToken == None:
break
I assume processing one record can fit inside the 15-minute lambda limit.
What you can do is to make your original lambda as an orchestrator that calls a worker lambda that processes a single page.
Orchestrator Lambda
while True:
l, nextToken = get list of records from DynamoDB
for each record in l:
call the worker lambda by passing the record as the event
if nextToken == None:
break
Worker Lambda
perform some preprocesing like reading a file and triggering a workflow
You can use SQS to provide you with a method to process these in rapid succession. You can even use that to perform them more or less in parallel rather than in sync.
Lambda reads in Dynamodb -> breaks each entry into a json object -> sends the json object to SQS -> which queues them out to multiple invoked lambda -> that lambda is designed to handle one single entry end finish
Doing this allows you to split up long tasks that may take many many hours across multiple lambda invocations by designing the second lambda to only handle one iteration of the task - and using SQS as your loop/iterator. You can set settings on SQS to send as fast as possible or to send one at a time (though if you do the one at a time you will have to manage the time to live and staleness settings of the messages in the queue)
In addition, if this is a regular thing where new items get added to the dynamo that then have to be processed, you should make use of Dynamo Streams - everytime a new item is added that triggers a lambda to fire on that new item, allowing to to do your workflow in real time as items are added.
I am trying to get the list of attributes that were overwritten in a DynamoDB batch operation, but it doesn't have this information in the response.
Is there anyway to get list of items that were overwritten when using batch write?
No: The API docs clearly list which information is returned and that's not among them.
If you need the before state, you have to do individual PutItem requests with the ReturnValues parameter set to ALL_OLD - docs.
We use DynamoDB UpdateItem.
This acts as an "upsert" as we can learn from the documentation
Edits an existing item's attributes, or adds a new item to the table if it does not already exist. [...]
When we make a request, to determine if an item was created or an existing item was updated, we request ALL_OLD. This works great and allows us to differentiate between update and create.
As an additional requirement we also want to return ALL_NEW, but still know the type of operation that was performed.
Question: Is this possible to do in a single request or do we have to make a second (get) request?
By default this is not supported in DynamoDB, there is no ALL or NEW_AND_OLD_IMAGES as there is in DynamoDB streams, but you can always go DIY.
When you do the UpdateItem call, you have the UpdateExpression, which is basically the list of changes to apply to the item. Given that you told DynamoDB to return the item as it looked like before that operation, you can construct the new state locally.
Just create a copy of the ALL_OLD response and locally apply the changes from the UpdateExpression to it. That's definitely faster than two API calls at the cost of a slightly more complex implementation.
We have a service which inserts into dynamodb certain values. For sake of this question let's say its key:value pair i.e., customer_id:customer_email. The inserts don't happen that frequently and once the inserts are done, that specific key doesn't get updated.
What we have done is create a client library which, provided with customer_id will fetch customer_email from dynamodb.
Given that customer_id data is static, what we were thinking is to add cache to the table but one thing which we are not sure that what will happen in the following use-case
client_1 uses our library to fetch customer_email for customer_id = 2.
The customer doesn't exist so API Gateway returns not found
APIGateway will cache this response
For any subsequent calls, this cached response will be sent
Now another system inserts customer_id = 2 with its email id. This system doesn't know if this response has been cached previously or not. It doesn't even know that any other system has fetched this specific data. How can we invalidate cache for this specific customer_id when it gets inserted into dynamodb
You can send a request to the API endpoint with a Cache-Control: max-age=0 header which will cause it to refresh.
This could open your application up to attack as a bad actor can simply flood an expensive endpoint with lots of traffic and buckle your servers/database. In order to safeguard against that it's best to use a signed request.
In case it's useful to people, here's .NET code to create the signed request:
https://gist.github.com/secretorange/905b4811300d7c96c71fa9c6d115ee24
We've built a Lambda which takes care of re-filling cache with updated results. It's a quite manual process, with very little re-usable code, but it works.
Lambda is triggered by the application itself following application needs. For example, in CRUD operations the Lambda is triggered upon successful execution of POST, PATCH and DELETE on a specific resource, in order to clear the general GET request (i.e. clear GET /books whenever POST /book succeeded).
Unfortunately, if you have a View with a server-side paginated table you are going to face all sorts of issues because invalidating /books is not enough since you actually may have /books?page=2, /books?page=3 and so on....a nightmare!
I believe APIG should allow for more granular control of cache entries, otherwise many use cases aren't covered. It would be enough if they would allow to choose a root cache group for each request, so that we could manage cache entries by group rather than by single request (which, imho, is also less common).
Did you look at this https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-caching.html ?
There is way to invalidate entire cache or a particular cache entry