DynamoDB version control using sort keys

DynamoDB version control using sort keys - amazon-web-services

Anyone that has implemented versioning using sort keys as stated in https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-sort-keys.html?
Trying to implement this using typescript for building a database with versions of the items. Is there any way of doing this using updateItem or is it a get + put operation needed?
Any sample to get me started or help is much appreciated!

The concept of versioning using sort key involves the creation of a completely new item that uses same Partition Key and different Sort Key.
DynamoDB offers some operations that allow to update values within an object in an atomic way, this use case is perfect for when you have something like a counter or a quantity and you want to decrease/increase it without having to read its value first. - Docs here.
In the case you're trying to achieve, as mentioned, you are essentially creating a new object. DynamoDB, by itself, doesn't have any concept of versioning and what this pattern does is to cleverly leverage the relation between Partition Key and Sort Key and the fact that a PK can have multiple SK associated with it, to correlate multiple rows of the same table.
To answer your question, if your only source of truth (or data store) is DynamoDB, then yes, your client will have to first query the table to know which was the last version of the item being updated and then insert the new version.
In case you are recording this information elsewhere and are using DynamoDB only to store these versions, then no, one put operation will be enough but again, this assumes you can retrieve this info somewhere else.
In terms of samples, the official documentation of the AWS SDK is always a good start, in your case I assume you'll want to use the Javascript one which you can find here.
At a very high level, you'll have to do the following:
Create an AWS.DynamoDB() client.
Execute a query using the dynamodb.query() method and specifying the PK of the item you want to update.
Go through the items (rows) returned from the previous query and find the one with the bigger version number as SK.
Put a new item using the dynamodb.putItem() method passing an item with the incremented version number as SK and same PK.

You can do the technique described by Amazon with a read and then a write, or more accurately, a read and then two writes (since they want to update both v0 and a new v4!). Often, you need the extra read because you want to build the new version v4 based on data you read from v3 (or equivalently, v0) - but in case you don't need that, the read is not necessary, and two writes are enough:
You first do an UpdateItem to v0 which increments the "Latest" attribute, sets whatever attributes you want to set in the new version, and uses the ReturnValues parameter to ask the update operation to return the new "Latest" attribute.
Then you write with PutItem the new row for v4 (where 4 is the "Latest" you just read).
This approach is safe in the sense that if two clients try to create two new versions at the same time, each one will pick a different "Latest", and both will appear on the version histories. However, it is not safe in the sense that if the client dies between step 1 and 2, you'll have a "hole" in the version history. However, I don't think there's any implementation of this technique that doesn't suffer from this problem.
After saying this, I want to reiterate what I said in the first paragraph: In most realistic use cases, the new version would be based on the old version, so your code anyway needs to read the old version first, then decide how to change it - and then write it (twice). You can't avoid the read in these cases. By the way, in this case the first write (to v0) would be a conditional update to verify that you only write the new version if the old version is still the same one ("Latest" is the same one you read during the read) - otherwise you'd be basing your modification on a non-current version. This is an example of optimistic locking.

Related

What's the cheapest way to store an auto increment indexed list of values in AWS?

I have a DynamoDB-based web application that uses DynamoDB to store my large JSON objects and perform simple CRUD operations on them via a web API. I would like to add a new table that acts like a categorization of these values. The user should be able to select from a selection box which category the object belongs to. If a desirable category does not exist, the user should be able to create a new category specifying a name which will be available to other objects in the future.
It is critical to the application that every one of these categories be given a integer ID that increments starting the first at 1. These numbers that are auto generated will turn into reproducible serial numbers for back end reports that will not use the user-visible text name.
So I would like to have a simple API available from the web fronted that allows me to:
A) GET /category : produces { int : string, ... } of all categories mapped to an ID
B) PUSH /category : accepts string and stores the string to the next integer
Here are some ideas for how to handle this kind of project.
Store it in DynamoDB with integer indexes. This leaves has some benefits but it leaves a lot to be desired. Firstly, there's no auto incrementing ID in DynamoDB, but I could definitely get the state of the table, create a new ID, and store the result. This might have issues with consistency and race conditions but there's probably a way to achieve this safely. It might, however, be a big anti pattern to use DynamoDB this way.
Store it in DynamoDB as one object in a table with some random index. Just store the mapping as a JSON object. This really forgets the notion of tables in DynamoDB and uses it as a simple file. It might also run into some issues with race conditions.
Use AWS ElasticCache to have a Redis key value store. This might be "the right" decision but the downside is that ElasticCache is an always on DB offering where you pay per hour. For a low-traffic web site like mine I'd be paying minumum $12/mo I think and I would really like for this to be pay per access/update due to the low volume. I'm not sure there's an auto increment feature for Redis built in the way I'd need it. But it's pretty trivial to make a trasaction that gets the length of the table, adds one, and stores a new value. Race conditions are easily avoid with this solution.
Use a SQL database like AWS Aurora or MYSQL. Well this has the same upsides as Redis, but it's also more overkill than Redis is, and also it costs a lot more and it's still always on.
Run my own in memory web service or MongoDB etc... still you're paying for constant containers running. Writing my own thing is obviously silly but I'm sure there are services that match this issue perfectly but they'd all require a constant container to run.
Is there a food way to just store a simple list, or integer mapping like this that doesn't cost a constant monthly cost? Is there a better way to do this with DynamoDB?

Store the maxCounterValue as an item in DyanamoDB.
For the PUSH /category, perform the following:
Get the current maxCounterValue.
TransactWrite:
Put the category name and id into a new item with id = maxCounterValue + 1.
Update the maxCounterValue +1, add a ConditionExpression to check that maxCounterValue = :valueFromGetOperation.
If TransactWrite fails, start at 1 again, try X more times

Are DynamoDB conditional writes transactional?

I'm having trouble wrapping my head around the dichotomy of DDB providing Condition Writes but also being eventually consistent. These two truths seem to be at odds with each other.
In the classic scenario, user Bob updates key A and sets the value to "FOO". User Alice reads from a node that hasn't received the update yet, and so it gets the original value "BAR" for the same key.
If Bob and Alice write to different nodes on the cluster without condition checks, it's possible to have a conflict where Alice and Bob wrote to the same key concurrently and DDB does not know which update should be the latest. This conflict has to be resolved by the client on next read.
But what about when condition write are used?
User Bob sends their update for A as "FOO" if the existing value for A is "BAR".
User Alice sends their update for A as "BAZ" if the existing value for A is "BAR".
Locally each node can check to see if their node has the original "BAR" value and go through with the update. But the only way to know the true state of A across the cluster is to make a strongly consistent read across the cluster first. This strongly consistent read must be blocking for either Alice or Bob, or they could both make a strongly consistent read at the same time.
So here is where I'm getting confused about the nature of DDBs condition writes. It seems to me that either:
Condition writes are only evaluated locally. Merge conflicts can still occur.
Condition writes are evaluated cross cluster.
If it is #2, the only way I see that working is if:
A lock is created for the key.
A strongly consistent read is made.
Let's say it's #2. Now where does this leave Bob's update? The update was made to node 2 and sent to node 1 and we have a majority quorum. But to make those updates available to Alice when they do their own conditional write, those updates need to be flushed from WAL. So in a conditional write are the updates always flushed? Are writes always flushed in general?
There have been other questions like this here on SO but the answers were a repeat of, or a link to, the AWS documentation about this. The AWS documentation doesn't really explain this (or i missed it).

DynamoDB conditional writes are "transactional" writes but how they're done is not public information & is perhaps proprietary intellectual property.
DynamoDB developers are the only ones with this information.
Your issue is that you're looking at this from a node perspective - I have gone through every mention of node anywhere in DynamoDB documentation & it's just mentions of Node.js or DAX nodes not database nodes.
While there can be outdated reads - yes, that would indicate some form of node - there are no database nodes per such when doing conditional writes.
User Bob sends their update for A as "FOO" if the existing value for A is "BAR". User Alice sends their update for A as "BAZ" if the existing value for A is "BAR".
Whoever's request gets there first is the one that goes through first.
The next request will just fail, meaning you now need to make a new read request to obtain the latest value to then proceed with the 2nd later write.
The Amazon DynamoDB developer guide shows this very clearly.
Note that there are no nodes, replicas etc. - there is only 1 reference to the DynamoDB table:
Condition writes are probably evaluated cross-cluster & a strongly consistent read is probably made but Amazon has not made this information public.

Ermiya Eskandary is correct that the exact details of DynamoDB's implementation aren't public knowledge, and also subject to change in the future while preserving the documented guarantees of the API. Nevertheless, various documents and especially video presentations that the Amazon developers did in the past, made it relatively clear how this works under the hood - at least in broad strokes, and I'll try to explain my understanding here:
Note that this might not be 100% accurate, and I don't have any inside knowledge about DynamoDB.
DynamoDB keeps three copies of each item.
One of the nodes holding a copy of a specific item is designated the leader for this item (there isn't a single "leader" - it can be a different leader per item). As far as I know, we have no details on which protocol is used to choose this leader (of course, if nodes go down, the leader choice changes).
A write to an item is started on the leader, who serializes writes to the same item. Note how DynamoDB conditional updates can only read and update the same item, so the same node (the leader) can read and write the item with just a local lock. After the leader evaluates the codintion and decides to write, it also sends an update to the two other nodes - returning success to the user only after two of the three nodes successfully wrote the data (to ensure durablity).
As you probably know, DyanamoDB reads have two options consistent and eventually-consistent: An eventually-consistent read reads from one of the three replicas at random, and might not yet see the result of a successful write (if the write wrote two copies, but not yet the third one). A consistent read reads from the leader, so is guaranteed to read the previously-written data.
Finally you asked about DynamoDB's newer and more expensive "Transaction" support. This is a completely different feature. DynamoDB "Transactions" are all about reads and writes to multiple items in the same request. As I explained above, the older conditional-updates feature only allows a read-modify-write operation to involve a single item at a time, and therefore has a simpler implementation - where a single node (the leader) can serialize concurrent writes and make the decisions without a complex distributed algorithm (however, a complex distributed algorithm is needed to pick the leader).

Dealing with read eventual consistency by retrying GetItem

I building an API #1 that creates an item in DynamoDB. I'm building another API #2 that retrieves an item using GSI (input key may not exist). But GSI reads can only be eventually consistent, and I don't want the scenario where API #1 creates an item but API #2 doesn't get that item.
So I am thinking of this:
API #1 creates item via UpdateItem
API #1 tries to retrieve item using GSI via GetItem. Keeps retrying with exponential backoff until it gets the item. Once this happens, eventual consistency should be over.
API #2 retrieves item using same GSI as above via GetItem. Since API #1 already got the item, this should get the item on first try.
Note: I don't think API #2 can do the GetItem retries instead because its input key may not ever exist.
Would this work? Are there better solutions?

The property you are looking for is known in literature as monotonic read consistency - it's eventual consistency (after enough time you'll always read the new value), but additionally - when you read the new value once, further reads will not return the older value.
I couldn't find (and I tried to look hard...) any documentation guaranteeing that DynamoDB eventually-consistent reads have monotonic read consistency. Based on presentations I saw on DynamoDB's implementation (I don't have any inside knowledge), I believe that it in fact does not have monotonic read consistency:
From what I understood in those presentations, DynamoDB saves each piece of data on three nodes. One of the three nodes is the "leader" (for this piece of data) and writes go to it - and so do consistent reads. But eventually-consistent reads will go to one of the three nodes at random. So the following scenario is possible:
A write is supposed to update three copies of the GSI on three nodes - X, Y and Z - but at this point only X and Y were updated, Z wasn't yet.
API 1 reads from the GSI and randomly gets to ask node X and gets the new value.
Now API 2 reads from the GSI. It randomly gets node Z, and gets the old value!
So it will be possible that after your application finds the new value, another read will not find it :-(
If someone else can find better documentation for this issue than just my "what I understood from presentations" I'd love to read their answer too.

Using column versions for time series

In the official documentation there is a text for which I can't totally understand the reason:
When working with time series, do not leverage the transactional behavior of rows. Changes to data in an existing row should be stored as a new, separate row, not changed in the existing row. This is an easier model to construct, and it enables you to maintain a history of activity without relying upon column versions.
The last sentence is not obvious and concrete, so it doesn't convince me. For now, using versioning for updating the cell's data still looks to me like a good fit for the update task. At least versions are managed by BigTable, so it's simplier solution.
Can anybody please provide more obvious explanation of why the versioning shouldn't be used in that use case?

Earlier in that page under Patterns for row key design, a bit more detail is explained. The high level view being that using row keys instead of column versions will:
Make it easier to run queries across your data, allowing for scanning of less data.
Avoid going over the recommended maximum row size.
The one caveat being:
It is acceptable to use versions of a column where the use case is
actually amending a value, and the value's history is important. For
example, suppose you did a set of calculations based on the closing
price of ZXZZT, and initially the data was mistakenly entered as
559.40 for the closing price instead of 558.40. In this case, it might be important to know the value's history in case the incorrect value
had caused other miscalculations.

DynamoDB Concurrency Issue

I'm building a system in which many DynamoDB (NoSQL) tables all contain data and data in one table accesses data in another table.
Multiple processes are accessing the same item in a table at the same time. I want to ensure that all of the processes have updated data and aren't trying to access that item at the exact same time because they are all updating the item with different data.
I would love some suggestions on this as I am stuck right now and don't know what to do. Thanks in advance!

Optimistic locking is a strategy to ensure that the client-side item that you are updating (or deleting) is the same as the item in Amazon DynamoDB. If you use this strategy, your database writes are protected from being overwritten by the writes of others, and vice versa.
With optimistic locking, each item has an attribute that acts as a version number. If you retrieve an item from a table, the application records the version number of that item. You can update the item, but only if the version number on the server side has not changed. If there is a version mismatch, it means that someone else has modified the item before you did. The update attempt fails, because you have a stale version of the item. If this happens, you simply try again by retrieving the item and then trying to update it. Optimistic locking prevents you from accidentally overwriting changes that were made by others. It also prevents others from accidentally overwriting your changes.
To support optimistic locking, the AWS SDK for Java provides the #DynamoDBVersionAttribute annotation. In the mapping class for your table, you designate one property to store the version number, and mark it using this annotation. When you save an object, the corresponding item in the DynamoDB table will have an attribute that stores the version number. The DynamoDBMapper assigns a version number when you first save the object, and it automatically increments the version number each time you update the item. Your update or delete requests succeed only if the client-side object version matches the corresponding version number of the item in the DynamoDB table.
ConditionalCheckFailedException is thrown if:
You use optimistic locking with #DynamoDBVersionAttribute and the version value on the server is different from the value on the client side.
You specify your own conditional constraints while saving data by using DynamoDBMapper with DynamoDBSaveExpression and these constraints failed.
Note
DynamoDB global tables use a “last writer wins” reconciliation between concurrent updates. If you use global tables, last writer policy wins. So in this case, the locking strategy does not work as expected.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js