Does Cosmos DB provide a possibility for resource locking or key range locks? - concurrency

I have a data model where the next row/document is dependent on the previous. E.g. each operation should add 1 to the previous value:
value: 10
value: 11
value: 12
In the example above when operation number 4 and 5 come at the same time what can happen is that both will read document number 3 and both will then insert a document with value 13. (While I need to have 13 and 14 not depending on concurrency)
I want to make it so that only one operation at a time is permitted. As far as I could find I can lock the update of the same document in Cosmos, but here I do really want to lock reading so that only one update operation at the time should be able to execute.
Is it possible to achieve something like that in Cosmos? It is acceptable to lock the whole "partition" so that only one write/update operation can be executed (ideally reads should be allowed).

Related

DynamoDB - TransactWrite, how to do If Else condition

An example, a table with a column counter with only 1 row. I want to update this row's counter value based on its current value.
Say if counter is >= 10, set it to 0, otherwise counter++. How to achieve this if else clause in TransactWrite?
I can't have two actions in one transaction because Documentation states that it does not allow more than 1 action on the same item.
And of course, the reason I use TransactWrite is because there will be multiple lambda doing this task in parallel.
You cannot do it in one request, transactional or otherwise.
You can get the item, decide what to do, and then update the item in a second request accordingly. You’ll want to keep a version number or timestamp attribute to make sure the item hasn’t changed between the read and write, and use a condition expression to fail if it has.
That’s a common idiom:
https://dynobase.dev/dynamodb-locking/
You can't do it the way you like, because if/else is not supported in transactions. There is however a simpler solution.
Just Update the item and increment the counter. Whenever you read the value, you take the counter value modulo 10, and you get the desired behavior, e.g. 123 % 10 = 3 or 10 % 10 = 0.

CockroachDB read transactions

I've been reading about the read-only lock-free transactions as implemented in Google Spanner and CockroachDB. Both claim to be implemented in a lock-free manner by making use of system clocks. Before getting to the question, here is my understanding (please skip the following section if you are aware of the machineries in both systems or just in CockroachDB):
Spanner's approach is simpler -- before committing a write transaction, Spanner picks the max timestamp across all involved shards as the commit timestamp, adds a wait, called commit wait, to for the max clock error before returning from it's write transaction. This means that all causally dependent transactions (both reads and writes) will have a timestamp value higher than the commit timestamp of the previous write. For read transactions, we pick the latest timestamp on the serving node. For example, if there was a write committed at timestamp 5, and the max clock error was 2, future writes and reads-only transactions will at least have a timestamp of 7.
CockroachDB on the other hand, does something more complicated. On writes, it picks the highest timestamp among all the involved shards, but does not wait. On reads, it assigns a preliminary read timestamp as the current timestamp on the serving node, then proceeds optimistically by reading across all shards and restarting the read transaction if any key on any shard reports a write timestamp that might imply uncertainty about whether the write causally preceeded the read transaction. It assumes that keys with write timestamps less than the timestamp for the read transaction either appeared before the read transaction or were concurrent with it. The uncertainty machinery kicks in on timestamps higher than the read transaction timestamp. For example, if there was a write committed at timestamp 8, and a read transaction was assigned timestamp 7, we are unsure about whether that write came before the read or after, so we restart the read transaction with a read timestamp of 8.
Relevant sources - https://www.cockroachlabs.com/blog/living-without-atomic-clocks/ and https://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi2012.pdf
Given this implementation does CockroachDB guarantee that the following two transactions will not see a violation of serializability?
A user blocks another user, then posts a message that they don't want the blocked user to see as one write transaction.
The blocked user loads their friends list and their posts as one read transaction.
As an example, consider that the friends list and posts lived on different shards. And the following ordering happens (assuming a max clock error of 2)
The initial posts and friends list was committed at timestamp 5.
A read transaction starts at timestamp 7, it reads the friends list, which it sees as being committed at timestamp 5.
Then the write transaction for blocking the friend and making a post gets committed at 6.
The read transaction reads the posts, which it sees as being committed at timestamp 6.
Now, the transactions violate serializability becasue the read transaction saw an old write and a newer write in the same transaction.
What am I missing?
CockroachDB handles this with a mechanism called the timestamp cache (which is an unfortunate name; it's not much of a cache).
In this example, at step two when the transaction reads the friends list at timestamp 7, the shard that holds the friends list remembers that it has served a read for this data at t=7 (the timestamp requested by the reading transaction, not the last-modified timestamp of the data that exists) and it can no longer allow any writes to commit with lower timestamps.
Then in step three, when the writing transaction attempts to write and commit at t=6, this conflict is detected and the writing transaction's timestamp gets pushed to t=8 or higher. Then that transaction must refresh its reads to see if it can commit as-is at t=8. If not, an error may be returned and the transaction must be retried from the beginning.
In step four, the reading transaction completes, seeing a consistent snapshot of the data as it existed at t=7, while both parts of the writing transaction are "in the future" at t=8.

DynamoDB eventually consistence read ordering for sequential written data

I have an application which append events to dynamodb table with userid as a hashkey and incremental seq number as range key(to guarantee the sort order). Table is append only.lets say writer write events for userid '1'. I have a reader that read events using last sequence number with user id '1' hash key.
If reader used strong consistence reads I know reader will get the data sequentially as same as write sequence.
If reader used eventually consistence reads ,can I expect the same behavior?
You can not expect the same behavior in that items that were written last might not be available to an eventually consistent read.
Say for example you have 3 writes in a row:
userId=1, seqNumber=1
userId=1, seqNumber=2
userId=1, seqNumber=3
If you do an eventually consistent read, you are not guaranteed to get all of the items. Your query would still return the items in order if that is how you are insert items. If you want to get all of the most recent writes, you have to use a strongly consistent read.
From the DynamoDB FAQ
...
Eventually Consistent Reads (Default) – the eventual consistency
option maximizes your read throughput. However, an eventually
consistent read might not reflect the results of a recently completed
write. Consistency across all copies of data is usually reached within
a second. Repeating a read after a short time should return the
updated data
...

Logical Programming Problem

I've been trying to solve this problem for quite sometime but I am having trouble with it.
Let's say on a trigger, you receive values.
First trigger: You get 1
Second trigger: You get 1, 2
Third trigger: You get 1, 2, 3
So, I store 1.
For 2nd trigger, I store 2 since 1 already exist.
For 3rd trigger, I store 3 since 1,2 already exist
so in total I have stored 1,2,3
As you can see, we can easily check for new values, if old != new.
Here's come the problem:
Fourth trigger: You get 1, 2, 4
For 4th trigger, I store 1, 2 because it exists
but how do I check against 3 and remove 3 from store and check if 4 is new?
If you are having problems understanding this, feel free to clarify. Thanks!
Use a std::set<int> container. When a trigger arrives, clear it an insert all the values from trigger. This should be ok if you work with just a few numbers (about ten or so). With more, a little bit more sophisticated approach might be required.
Hard to tell what you're asking exactly, but see std::set data structure if your main problem is trying to maintain a set of unique numbers and efficiently check for existence in the set.
Your logic changed between 1,2,3 and 1,2,4
(only stored 3 on former, but stored 1,2,4 on latter)
In that case, ignore data recv'd that already exists, only storing new data, unless some old data was not sent in which case you'll create a new set of data to store.
But, I'm guessing that's not what you had in mind at all :)
edit
I see it's been edited now, so my answer is invalid
edit-2
the fastest way is to drop all stored data on each iteration as comparisons will take as long (if not longer) than a complete save of sent data.
Your approach sounds like it is better served by using some basic set theory. A couple of answers already point you to STL sets for that matter. With that, you'll need to iterate through the reported values to test for membership in the set.
However, is there an opportunity to affect what is reported with each "trigger"? For example, if this is something like a select poll, you could just put whatever it is that you're polling into a different state so that it is not reported as ready in subsequent triggers.

Amazon SimpleDB Woes: Implementing counter attributes

Long story short, I'm rewriting a piece of a system and am looking for a way to store some hit counters in AWS SimpleDB.
For those of you not familiar with SimpleDB, the (main) problem with storing counters is that the cloud propagation delay is often over a second. Our application currently gets ~1,500 hits per second. Not all those hits will map to the same key, but a ballpark figure might be around 5-10 updates to a key every second. This means that if we were to use a traditional update mechanism (read, increment, store), we would end up inadvertently dropping a significant number of hits.
One potential solution is to keep the counters in memcache, and using a cron task to push the data. The big problem with this is that it isn't the "right" way to do it. Memcache shouldn't really be used for persistent storage... after all, it's a caching layer. In addition, then we'll end up with issues when we do the push, making sure we delete the correct elements, and hoping that there is no contention for them as we're deleting them (which is very likely).
Another potential solution is to keep a local SQL database and write the counters there, updating our SimpleDB out-of-band every so many requests or running a cron task to push the data. This solves the syncing problem, as we can include timestamps to easily set boundaries for the SimpleDB pushes. Of course, there are still other issues, and though this might work with a decent amount of hacking, it doesn't seem like the most elegant solution.
Has anyone encountered a similar issue in their experience, or have any novel approaches? Any advice or ideas would be appreciated, even if they're not completely flushed out. I've been thinking about this one for a while, and could use some new perspectives.
The existing SimpleDB API does not lend itself naturally to being a distributed counter. But it certainly can be done.
Working strictly within SimpleDB there are 2 ways to make it work. An easy method that requires something like a cron job to clean up. Or a much more complex technique that cleans as it goes.
The Easy Way
The easy way is to make a different item for each "hit". With a single attribute which is the key. Pump the domain(s) with counts quickly and easily. When you need to fetch the count (presumable much less often) you have to issue a query
SELECT count(*) FROM domain WHERE key='myKey'
Of course this will cause your domain(s) to grow unbounded and the queries will take longer and longer to execute over time. The solution is a summary record where you roll up all the counts collected so far for each key. It's just an item with attributes for the key {summary='myKey'} and a "Last-Updated" timestamp with granularity down to the millisecond. This also requires that you add the "timestamp" attribute to your "hit" items. The summary records don't need to be in the same domain. In fact, depending on your setup, they might best be kept in a separate domain. Either way you can use the key as the itemName and use GetAttributes instead of doing a SELECT.
Now getting the count is a two step process. You have to pull the summary record and also query for 'Timestamp' strictly greater than whatever the 'Last-Updated' time is in your summary record and add the two counts together.
SELECT count(*) FROM domain WHERE key='myKey' AND timestamp > '...'
You will also need a way to update your summary record periodically. You can do this on a schedule (every hour) or dynamically based on some other criteria (for example do it during regular processing whenever the query returns more than one page). Just make sure that when you update your summary record you base it on a time that is far enough in the past that you are past the eventual consistency window. 1 minute is more than safe.
This solution works in the face of concurrent updates because even if many summary records are written at the same time, they are all correct and whichever one wins will still be correct because the count and the 'Last-Updated' attribute will be consistent with each other.
This also works well across multiple domains even if you keep your summary records with the hit records, you can pull the summary records from all your domains simultaneously and then issue your queries to all domains in parallel. The reason to do this is if you need higher throughput for a key than what you can get from one domain.
This works well with caching. If your cache fails you have an authoritative backup.
The time will come where someone wants to go back and edit / remove / add a record that has an old 'Timestamp' value. You will have to update your summary record (for that domain) at that time or your counts will be off until you recompute that summary.
This will give you a count that is in sync with the data currently viewable within the consistency window. This won't give you a count that is accurate up to the millisecond.
The Hard Way
The other way way is to do the normal read - increment - store mechanism but also write a composite value that includes a version number along with your value. Where the version number you use is 1 greater than the version number of the value you are updating.
get(key) returns the attribute value="Ver015 Count089"
Here you retrieve a count of 89 that was stored as version 15. When you do an update you write a value like this:
put(key, value="Ver016 Count090")
The previous value is not removed and you end up with an audit trail of updates that are reminiscent of lamport clocks.
This requires you to do a few extra things.
the ability to identify and resolve conflicts whenever you do a GET
a simple version number isn't going to work you'll want to include a timestamp with resolution down to at least the millisecond and maybe a process ID as well.
in practice you'll want your value to include the current version number and the version number of the value your update is based on to more easily resolve conflicts.
you can't keep an infinite audit trail in one item so you'll need to issue delete's for older values as you go.
What you get with this technique is like a tree of divergent updates. you'll have one value and then all of a sudden multiple updates will occur and you will have a bunch of updates based off the same old value none of which know about each other.
When I say resolve conflicts at GET time I mean that if you read an item and the value looks like this:
11 --- 12
/
10 --- 11
\
11
You have to to be able to figure that the real value is 14. Which you can do if you include for each new value the version of the value(s) you are updating.
It shouldn't be rocket science
If all you want is a simple counter: this is way over-kill. It shouldn't be rocket science to make a simple counter. Which is why SimpleDB may not be the best choice for making simple counters.
That isn't the only way but most of those things will need to be done if you implement an SimpleDB solution in lieu of actually having a lock.
Don't get me wrong, I actually like this method precisely because there is no lock and the bound on the number of processes that can use this counter simultaneously is around 100. (because of the limit on the number of attributes in an item) And you can get beyond 100 with some changes.
Note
But if all these implementation details were hidden from you and you just had to call increment(key), it wouldn't be complex at all. With SimpleDB the client library is the key to making the complex things simple. But currently there are no publicly available libraries that implement this functionality (to my knowledge).
To anyone revisiting this issue, Amazon just added support for Conditional Puts, which makes implementing a counter much easier.
Now, to implement a counter - simply call GetAttributes, increment the count, and then call PutAttributes, with the Expected Value set correctly. If Amazon responds with an error ConditionalCheckFailed, then retry the whole operation.
Note that you can only have one expected value per PutAttributes call. So, if you want to have multiple counters in a single row, then use a version attribute.
pseudo-code:
begin
attributes = SimpleDB.GetAttributes
initial_version = attributes[:version]
attributes[:counter1] += 3
attributes[:counter2] += 7
attributes[:version] += 1
SimpleDB.PutAttributes(attributes, :expected => {:version => initial_version})
rescue ConditionalCheckFailed
retry
end
I see you've accepted an answer already, but this might count as a novel approach.
If you're building a web app then you can use Google's Analytics product to track page impressions (if the page to domain-item mapping fits) and then to use the Analytics API to periodically push that data up into the items themselves.
I haven't thought this through in detail so there may be holes. I'd actually be quite interested in your feedback on this approach given your experience in the area.
Thanks
Scott
For anyone interested in how I ended up dealing with this... (slightly Java-specific)
I ended up using an EhCache on each servlet instance. I used the UUID as a key, and a Java AtomicInteger as the value. Periodically a thread iterates through the cache and pushes rows to a simpledb temp stats domain, as well as writing a row with the key to an invalidation domain (which fails silently if the key already exists). The thread also decrements the counter with the previous value, ensuring that we don't miss any hits while it was updating. A separate thread pings the simpledb invalidation domain, and rolls up the stats in the temporary domains (there are multiple rows to each key, since we're using ec2 instances), pushing it to the actual stats domain.
I've done a little load testing, and it seems to scale well. Locally I was able to handle about 500 hits/second before the load tester broke (not the servlets - hah), so if anything I think running on ec2 should only improve performance.
Answer to feynmansbastard:
If you want to store huge amount of events i suggest you to use distributed commit log systems such as kafka or aws kinesis. They allow to consume stream of events cheap and simple (kinesis's pricing is 25$ per month for 1K events per seconds) – you just need to implement consumer (using any language), which bulk reads all events from previous checkpoint, aggregates counters in memory then flushes data into permanent storage (dynamodb or mysql) and commit checkpoint.
Events can be logged simply using nginx log and transfered to kafka/kinesis using fluentd. This is very cheap, performant and simple solution.
Also had similiar needs/challenges.
I looked at using google analytics and count.ly. the latter seemed too expensive to be worth it (plus they have a somewhat confusion definition of sessions). GA i would have loved to use, but I spent two days using their libraries and some 3rd party ones (gadotnet and one other from maybe codeproject). unfortunately I could only ever see counters post in GA realtime section, never in the normal dashboards even when the api reported success. we were probably doing something wrong but we exceeded our time budget for ga.
We already had an existing simpledb counter that updated using conditional updates as mentioned by previous commentor. This works well, but suffers when there is contention and conccurency where counts are missed (for example, our most updated counter lost several million counts over a period of 3 months, versus a backup system).
We implemented a newer solution which is somewhat similiar to the answer for this question, except much simpler.
We just sharded/partitioned the counters. When you create a counter you specify the # of shards which is a function of how many simulatenous updates you expect. this creates a number of sub counters, each which has the shard count started with it as an attribute :
COUNTER (w/5shards) creates :
shard0 { numshards = 5 } (informational only)
shard1 { count = 0, numshards = 5, timestamp = 0 }
shard2 { count = 0, numshards = 5, timestamp = 0 }
shard3 { count = 0, numshards = 5, timestamp = 0 }
shard4 { count = 0, numshards = 5, timestamp = 0 }
shard5 { count = 0, numshards = 5, timestamp = 0 }
Sharded Writes
Knowing the shard count, just randomly pick a shard and try to write to it conditionally. If it fails because of contention, choose another shard and retry.
If you don't know the shard count, get it from the root shard which is present regardless of how many shards exist. Because it supports multiple writes per counter, it lessens the contention issue to whatever your needs are.
Sharded Reads
if you know the shard count, read every shard and sum them.
If you don't know the shard count, get it from the root shard and then read all and sum.
Because of slow update propogation, you can still miss counts in reading but they should get picked up later. This is sufficient for our needs, although if you wanted more control over this you could ensure that- when reading- the last timestamp was as you expect and retry.