DynamoDB eventually consistence read ordering for sequential written data - amazon-web-services

I have an application which append events to dynamodb table with userid as a hashkey and incremental seq number as range key(to guarantee the sort order). Table is append only.lets say writer write events for userid '1'. I have a reader that read events using last sequence number with user id '1' hash key.
If reader used strong consistence reads I know reader will get the data sequentially as same as write sequence.
If reader used eventually consistence reads ,can I expect the same behavior?

You can not expect the same behavior in that items that were written last might not be available to an eventually consistent read.
Say for example you have 3 writes in a row:
userId=1, seqNumber=1
userId=1, seqNumber=2
userId=1, seqNumber=3
If you do an eventually consistent read, you are not guaranteed to get all of the items. Your query would still return the items in order if that is how you are insert items. If you want to get all of the most recent writes, you have to use a strongly consistent read.
From the DynamoDB FAQ
...
Eventually Consistent Reads (Default) – the eventual consistency
option maximizes your read throughput. However, an eventually
consistent read might not reflect the results of a recently completed
write. Consistency across all copies of data is usually reached within
a second. Repeating a read after a short time should return the
updated data
...

Related

CockroachDB read transactions

I've been reading about the read-only lock-free transactions as implemented in Google Spanner and CockroachDB. Both claim to be implemented in a lock-free manner by making use of system clocks. Before getting to the question, here is my understanding (please skip the following section if you are aware of the machineries in both systems or just in CockroachDB):
Spanner's approach is simpler -- before committing a write transaction, Spanner picks the max timestamp across all involved shards as the commit timestamp, adds a wait, called commit wait, to for the max clock error before returning from it's write transaction. This means that all causally dependent transactions (both reads and writes) will have a timestamp value higher than the commit timestamp of the previous write. For read transactions, we pick the latest timestamp on the serving node. For example, if there was a write committed at timestamp 5, and the max clock error was 2, future writes and reads-only transactions will at least have a timestamp of 7.
CockroachDB on the other hand, does something more complicated. On writes, it picks the highest timestamp among all the involved shards, but does not wait. On reads, it assigns a preliminary read timestamp as the current timestamp on the serving node, then proceeds optimistically by reading across all shards and restarting the read transaction if any key on any shard reports a write timestamp that might imply uncertainty about whether the write causally preceeded the read transaction. It assumes that keys with write timestamps less than the timestamp for the read transaction either appeared before the read transaction or were concurrent with it. The uncertainty machinery kicks in on timestamps higher than the read transaction timestamp. For example, if there was a write committed at timestamp 8, and a read transaction was assigned timestamp 7, we are unsure about whether that write came before the read or after, so we restart the read transaction with a read timestamp of 8.
Relevant sources - https://www.cockroachlabs.com/blog/living-without-atomic-clocks/ and https://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi2012.pdf
Given this implementation does CockroachDB guarantee that the following two transactions will not see a violation of serializability?
A user blocks another user, then posts a message that they don't want the blocked user to see as one write transaction.
The blocked user loads their friends list and their posts as one read transaction.
As an example, consider that the friends list and posts lived on different shards. And the following ordering happens (assuming a max clock error of 2)
The initial posts and friends list was committed at timestamp 5.
A read transaction starts at timestamp 7, it reads the friends list, which it sees as being committed at timestamp 5.
Then the write transaction for blocking the friend and making a post gets committed at 6.
The read transaction reads the posts, which it sees as being committed at timestamp 6.
Now, the transactions violate serializability becasue the read transaction saw an old write and a newer write in the same transaction.
What am I missing?
CockroachDB handles this with a mechanism called the timestamp cache (which is an unfortunate name; it's not much of a cache).
In this example, at step two when the transaction reads the friends list at timestamp 7, the shard that holds the friends list remembers that it has served a read for this data at t=7 (the timestamp requested by the reading transaction, not the last-modified timestamp of the data that exists) and it can no longer allow any writes to commit with lower timestamps.
Then in step three, when the writing transaction attempts to write and commit at t=6, this conflict is detected and the writing transaction's timestamp gets pushed to t=8 or higher. Then that transaction must refresh its reads to see if it can commit as-is at t=8. If not, an error may be returned and the transaction must be retried from the beginning.
In step four, the reading transaction completes, seeing a consistent snapshot of the data as it existed at t=7, while both parts of the writing transaction are "in the future" at t=8.

How does Amazon Redshift reconstruct a row from columnar storage?

Amazon describes columnar storage like this:
So I guess this means in what PostgreSQL would call the "heap", blocks contain all the values for one column, then the next column, and so on.
Say I want to query for all people in their 30's, and I want to know their names. So columnar storage means less IO is required to read just the age of every row and find those that are 30-something, because all the other columns don't need to be read. Also maybe some efficient compression can be applied. That's neat, I guess.
Then what? This data structure alone doesn't explain how anything useful can happen after that. After determining what records are 30-something, how are the associated names found? What data structure is used? What are its performance characteristics?
If the Age column is the Sort Key, then the rows in the table will be stored in order of Age. This is great, because each 1MB storage block on disk keeps data for only one column, and it keeps note of the minimum and maximum values within the block.
Thus, searching for the rows that contain an Age of 30 means that Redshift can "skip over" blocks that do not contain Age=30. Since reading from disk is the slowest part of a database, this means it can operate much faster.
Once it has found the blocks that potentially contain Age=30, it reads those blocks from disk. Blocks are compressed, so they might contain much more data than the 1MB on disk. This means many rows can be read with fewer disk accesses.
Once those blocks are decompressed into memory, it finds the rows with Age=30 and then loads the corresponding blocks for the Name column. The compression ratio would be different for the Name column since it is text and is not sorted, so this might result in loading more blocks from disk for Name than for Age.
Redshift then assembles the data from Name and Age for the desired rows and performs any remaining operations.
These operations are also parallelized across multiple nodes based on the Distribution Key, which distributed data based on a given column (or replicates it between nodes for often-used tables). Data is typically distributed based upon a column that is frequently used in JOIN statements so that similar data is co-located on the same node. Each node returns its data to the Leader Node, which combines the data and provides the final results.
Bottom line: Minimise the amount of data read from disk and parallelize operations on separate nodes.
AFAIK every value in the columnar storage has an ID pointer (similar to CTID you mentioned), and to get the select results Redshift needs to find and combine the values with the same ID pointer for each column that's selected from the raw data. If memory allows it's stored in memory, unless it's spilling to disk. This process is called materialization (don't confuse with materialized view materialization). In your case there are 2 technically possible scenarios:
materialize all Age/Name pairs, then filter by Age=30, and output the result
filter Age column by Age=30, get IDs, get Name values with corresponding IDs, materialize pairs and output
I guess in this case #2 is what happens because materialization is more expensive than filtering. However, there is a plenty of scenarios where this is much less obvious (with complex queries and aggregations). It is the responsibility of the query optimizer to decide what's better. #1 is still better than the row oriented because it would still read just 2 columns.

DynamoDB: When does 1MB limit for queries apply

In the docs for DynamoDB it says:
In a Query operation, DynamoDB retrieves the items in sorted order, and then processes the items using KeyConditionExpression and any FilterExpression that might be present.
And:
A single Query operation can retrieve a maximum of 1 MB of data. This limit applies before any FilterExpression is applied to the results.
Does this mean, that KeyConditionExpression is applied before this 1MB limit?
Indeed, your interpretation is correct. With KeyConditionExpression, DynamoDB can efficiently fetch only the data matching its criteria, and you only pay for this matching data and the 1MB read size applies to the matching data. But with FilterExpression the story is different: DynamoDB has no efficient way of filtering out the non-matching items before actually fetching all of it then filtering out the items you don't want. So you pay for reading the entire unfiltered data (before FilterExpression), and the 1MB maximum also corresponds to the unfiltered data.
If you're still unconvinced that this is the way it should be, here's another issue to consider: Imagine that you have 1 gigabyte of data in your database to be Scan'ed (or in a single key to be Query'ed), and after filtering, the result will be just 1 kilobyte. Were you to make this query and expect to get the 1 kilobyte back, Dynamo would need to read and process the entire 1 gigabyte of data before returning. This could take a very long time, and you would have no idea how much, and will likely timeout while waiting for the result. So instead, Dynamo makes sure to return to you after every 1MB of data it reads from disk (and for which you pay ;-)). Control will return to you 1000 (=1 gigabyte / 1 MB) times during the long query, and you won't have a chance to timeout. Whether a 1MB limit actually makes sense here or it should have been more, I don't know, and maybe we should have had a different limit for the response size and the read amount - but definitely some sort of limit was needed on the read amount, even if it doesn't translate to large responses.
By the way, the Scan documentation includes a slightly differently-worded version of the explanation of the 1MB limit, maybe you will find it clearer than the version in the Query documentation:
A single Scan operation will read up to the maximum number of items set (if using the Limit parameter) or a maximum of 1 MB of data and then apply any filtering to the results using FilterExpression.

Amazon - DynamoDB Strong consistent reads, Are they latest and how?

In an attempt to use Dynamodb for one of projects, I have a doubt regarding the strong consistency model of dynamodb. From the FAQs
Strongly Consistent Reads — in addition to eventual consistency, Amazon DynamoDB also gives you the
flexibility and control to request a strongly consistent read if your application, or an element of your application, requires it. A strongly consistent read returns a result that reflects all writes that received a successful response prior to the read.
From the definition above, what I get is that a strong consistent read will return the latest write value.
Taking an example: Lets say Client1 issues a write command on Key K1 to update the value from V0 to V1. After few milliseconds Client2 issues a read command for Key K1, then in case of strong consistency V1 will be returned always, however in case of eventual consistency V1 or V0 may be returned. Is my understanding correct?
If it is, What if the write operation returned success but the data is not updated to all replicas and we issue a strongly consistent read, how it will ensure to return the latest write value in this case?
The following link
AWS DynamoDB read after write consistency - how does it work theoretically? tries to explain the architecture behind this, but don't know if this is how it actually works? The next question that comes to my mind after going through this link is: Is DynamoDb based on Single Master, multiple slave architecture, where writes and strong consistent reads are through master replica and normal reads are through others.
Short answer: Writing successfully in strongly consistent mode requires that your write succeed on a majority of servers that can contain the record, therefore any future consistent reads will always see the same data, because a consistent read must read a majority of the servers that can contain the desired record. If you do not perform a strongly consistent read, the system will ask a random server for the record, and it is possible that the data will not be up-to-date.
Imagine three servers. Server 1, server 2 and server 3. To write a strongly consistent record, you pick two servers at minimum, and write the data. Let's pick 1 and 2.
Now you want to read the data consistently. Pick a majority of servers. Let's say we picked 2 and 3.
Server 2 has the new data, and this is what the system returns.
Eventually consistent reads could come from server 1, 2, or 3. This means if server 3 is chosen by random, your new write will not appear yet, until replication occurs.
If a single server fails, your data is still safe, but if two out of three servers fail your new write may be lost until the offline servers are restored.
More explanation:
DynamoDB (assuming it is similar to the database described in the Dynamo paper that Amazon released) uses a ring topology, where data is spread to many servers. Strong consistency is guaranteed because you directly query all relevant servers and get the current data from them. There is no master in the ring, there are no slaves in the ring. A given record will map to a number of identical hosts in the ring, and all of those servers will contain that record. There is no slave that could lag behind, and there is no master that can fail.
Feel free to read any of the many papers on the topic. A similar database called Apache Cassandra is available which also uses ring replication.
http://www.read.seas.harvard.edu/~kohler/class/cs239-w08/decandia07dynamo.pdf
Disclaimer: the following cannot be verified based on the public DynamoDB documentation, but they are probably very close to the truth
Starting from the theory, DynamoDB makes use of quorums, where V is the total number of replica nodes, Vr is the number of replica nodes a read operation asks and Vw is the number of replica nodes where each write is performed. The read quorum (Vr) can be leveraged to make sure the client is getting the latest value, while the write quorum (Vw) can be leveraged to make sure that writes do not create conflicts.
Based on the fact that there are no write conflicts in DynamoDB (since these would have to be reconciliated from the client, thus being exposed in the API), we conclude that DynamoDB is using a Vw that respects the second law (Vw > V/2), probably just V/2+1 to reduce write latency.
Now regarding read quorums, DynamoDB provides 2 different kinds of read. The strongly consistent read uses a read quorum that respects the first law (Vr + Vw > V), probably just V/2 if we assume V/2+1 for writes as before. However, an eventually consistent read can use only a single random replica Vr = 1, thus being much quicker but giving zero guarantee around consistency.
Note: There's a possibility that the write quorum used does not respect the second law (Vw > V/2), but that would mean DynamoDB resolves automatically such conflicts (e.g. by selecting the latest one based on local time) without reconciliation from the client. But, I believe that this is really unlikely to be true, since there is no such reference in the DynamoDB documentation. Even in that case though, the rest reasoning stays the same.
You can find answer to your question here: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/APISummary.html
When you issue a strongly consistent read request, Amazon DynamoDB returns a response with the most up-to-date data that reflects updates by all prior related write operations to which Amazon DynamoDB returned a successful response.
In your example, if the updateItem request to update the value from v0 to v1 was successful, the subsequent strongly consistent read request will return v1.
Hope this helps.

Storing Time Series in AWS DynamoDb

I would like to store 1M+ different time series in Amazon's DynamoDb database. Each time series will have about 50K data points. A data point is comprised of a timestamp and a value.
The application will add new data points to time series frequently (all the time) and will retrieve (usually the whole time series) time series from time to time, for analytics.
How should I structure the database? Should I create a separate table for each timeseries? Or should I put all data points in one table?
Assuming your data is immutable and given the size, you may want to consider Amazon Redshift; it's written for petabyte-sized reporting solutions.
In Dynamo, I can think of a few viable designs. In the first, you could use one table, with a compound hash/range key (both strings). The hash key would be the time series name, the range key would be the timestamp as an ISO8601 string (which has the pleasant property that alphabetical ordering is also chronological ordering), and there would be an extra attribute on each item; a 'value'. This gives you the abilty to select everything from a time series (Query on hashKey equality) and a subset of a time series (Query on hashKey equality and rangeKey BETWEEN clause). However, your main problem is the "hotspot" problem: internally, Dynamo will partition your data by hashKey, and will disperse your ProvisionedReadCapacity over all your partitions. So you may have 1000 KB of reads a second, but if you have 100 partitions, then you have only 10 KB a second for each partition, and reading all data from a single time series (single hashKey) will only hit one partition. So you may think your 1000 KB of reads gives you 1 MB a second, but if you have 10 MB stored it might take you much longer to read it, as your single partition will throttle you much more heavily.
On the upside, DynamoDB has an extremely high but costly upper-bound on scaling; if you wanted you could pay for 100,000 Read Capacity units, and have sub-second response times on all of that data.
Another theoretical design would be to store every time series in a separate table, but I don't think DynamoDB is meant to scale to millions of tables, so this is probably a no-go.
You could try and spread out your time series across 10 tables where "highly read" data goes in table 1, "almost never read data" in table 10, and all other data somewhere in between. This would let you "game" the provisioned throughput / partition throttling rules, but at a high degree of complexity in your design. Overall, it's probably not worth it; where do you new time series? How do you remember where they all are? How do you move a time series?
I think DynamoDB supports some internal "bursting" on these kinds of reads from my own experience, and it's possible my numbers are off, and you will get adequete performance. However my verdict is to look into Redshift.
How about dripping each time series into JSON or similar and store in S3. At most you'd need a lookup from somewhere like Dynamo.
You still may need redshift to process your inputs.