When a ConcurrencyException occurs, what gets written? - concurrency

We use RavenDB in our production environment. It stores millions of documents, and gets updated pretty much constantly during the day.
We have two boxes load-balanced using a round-robin strategy which replicate to one another.
Every week or so, we get a ConcurrencyException from Raven. I understand that this basically means that one of the servers was told to insert or update the same document within in short timeframe - it's kind of like a conflict exception, except occurring on the same server instead of two replicating servers.
What happens when this error occurs? Can I assume that at least one of the writes succeeded? Can I predict which one? Is there anything I can do to make these exceptions less likely?

ConcurrencyException means that on a single server, you have two writes to the same document at the same instant.
That lead to:
One write is accepted.
One write is rejected (with concurrency exception).

Related

How often do duplciates occur if I do not set "exactly once delivery" with pubsub?

I have noticed that using "Exactly once delivery" affects performance when using pull and acknowledge. Pull and acknowledge messages take up to 5 times longer ~0.2s. If I disable the "Exactly once delivery" response is much fast, under 0.05s for both pull and acknowledge. I tested using curl and php with similar results (reusing existing connection).
I am concerned what is the consequence of disabling this feature. How often do duplicates occur if this feature is disabled? Are there ways to avoid duplicates without enabling this feature?
For example, if I have an acknowledge deadline of 60 seconds, I pull a message then pull again after 10 seconds, could I get the same message again? Its unclear from the docs how often duplicates will occur and under what circumstances they will occur if this option is disabled.
How often do duplicates occur if this feature is disabled?
Not super often in my experience, but this doesn't matter, your system needs to be able to handle them one way or another, because it will happen.
Are there ways to avoid duplicates without enabling this feature?
On googles side? No, otherwise what would be the point of the option. The user should either de-duplicate it with the messageID, by only processing each id once, or make sure that whatever operation you perform is idempotent. Or you don't bother, hope it doesn't happen often and live with the consequences (be it by crashing, having corruption somewhere that you may or may not fix,...).
Its unclear from the docs how often duplicates will occur and under what circumstances they will occur if this option is disabled.
Pub/sub is a complex highly scaling distributed system, duplicated messages are not an intended feature on a fixed schedule, they are a necessary evil if you want high performance. Nobody can predict when they will happen, only that they can occur.
In the system I use, duplicates were happening often enough to cause us massive problems.

Basics to how a distributed, consistent key-value storage system return the latest key when dealing with concurrent requests?

I am getting up to speed on distributed systems (studying for an upcoming interview), and specifically on the basics for how a distributed system works for a distributed, consistent key-value storage system managed in memory.
My specific questions I am stuck on that I would love just a high level answer on if it's no trouble:
#1
Let's say we have 5 servers that are responsible to act as readers, and I have one writer. When I write the value 'foo' to the key 'k1', I understand it has to propagate to all of those servers so they all store the value 'foo' for the key k1. Is this correct, or does the writer only write to the majority (quorum) for this to work?
#2
After #1 above takes place, let's say concurrently a read comes in for k1, and a write comes in to replace 'foo' with 'bar', however not all of the servers are updated with 'bar. This means some are 'foo' and some are 'bar'. If I had lots of concurrent reads, it's conceivable some would return 'foo' and some 'bar' since it's not updated everywhere yet.
When we're talking about eventual consistency, this is expected, but if we're talking about strong consistency, how do you avoid #2 above? I keep seeing content about quorum and timestamps but on a high level, is there some sort of intermediary that sorts out what the correct value is? Just wanted to get a basic idea first before I dive in more.
Thank you so much for any help!
In doing more research, I found that "consensus algorithms" such as Paxos or Raft is the correct solution here. The idea is that your nodes need to arrive at a consensus of what the value is. If you read up on Paxos or Raft you'll learn everything you need to - it's quite complex to explain here, but there are videos/resources out there that cover this well.
Another thing I found helpful was learning more about Dynamo and DynamoDB. They handle the subject as well, although not strongly consistent/distributed.
Hope this helps someone, and message me if you'd like more details!
Read the CAP theorem will help you solve your problem. You are looking for consistence and network partition in this question, so you have to sacrifice the availability. The system needs to block and wait until all nodes finish writing. In other word, the change can not be read before all nodes have updated it.
In theoretical computer science, the CAP theorem, also named Brewer's
theorem after computer scientist Eric Brewer, states that any
distributed data store can only provide two of the following three
guarantees:
Consistency Every read receives the most recent write or an error.
Availability Every request receives a (non-error) response, without
the guarantee that it contains the most recent write.
Partition tolerance The system continues to operate despite an arbitrary number
of messages being dropped (or delayed) by the network between nodes.

DynamoDB Eventually consistent reads vs Strongly consistent reads

I recently came to know about two read modes of DynamoDB. But I am not clear about when to choose what. Can anyone explain the trade-offs?
Basically, if you NEED to have the latest values, use a fully consistent read. You'll get the guaranteed current value.
If your app is okay with potentially outdated information (mere seconds or less out of date), then use eventually consistent reads.
Examples of fully-consistent:
Bank balance (Want to know the latest amount)
Location of a locomotive on a train network (Need absolute certainty to guarantee safety)
Stock trading (Need to know the latest price)
Use-cases for eventually consistent reads:
Number of Facebook friends (Does it matter if another was added in the last few seconds?)
Number of commuters who used a particular turnstile in the past 5 minutes (Not important if it is out by a few people)
Stock research (Doesn't matter if it's out by a few seconds)
Apart from the other answers shortly the reason for this read modes is:
Lets say you have table User in eu-west-1 region. Without you being aware there are multiple Availability Zones AWS handles in the background. Like replicates your data in case of failure etc..Basically there are copies of your tables and once you insert a table there needs to be multiple resources to be updated.
But now when you wanna read there might be a chance that you are reading from not-yet-updated table without being aware of. Usually takes under a second for dynamodb to update. This is why its called eventually consistent. It will eventually be consistent in a short amount of time :)
When making decision knowing this reasoning helps me to understand and design my use cases.

Amazon S3 - What does eventual consistency mean in regard to delete operations?

I visited Amazon's website and I read the available information in regard to eventual consistency however it's still not completely clear to me.
What I am still not sure about is the behavior of S3 in the timeframe between an execution of update / delete and the moment when the consistency is eventually achieved.
For example, what will happen if I delete object A and subsequently execute a HEAD operation for object A multiple times?
I suppose I will eventually start getting a RESOURCE_NOT_FOUND error constantly at some point of time (when the deletion becomes consistent) but prior to that moment what should I expect to get?
I see two options.
1) Every HEAD operation succeeds up to a point in time and after that every HEAD operation constantly fails with RESOURCE_NOT_FOUND.
2) Each HEAD operation succeeds or fails "randomly" until some moment in which the eventual consistency is achieved.
Could someone clarify which of the two should be the expected behavior?
Many thanks.
I see two options.
It could be either of these. Neither one is necessarily the "expected" behavior. Eventually, requests would all return 404.
S3 is a large scale system, distributed across multiple availability zones in the region, so each request could hit one of several possible endpoints, each if which could reflect the bucket's state at a slightly different point in time. As long as they are all past the point where the object is deleted, they should continuously return 404, but the state of bucket index replication isn't exposed.

REST search interface and the idempotency of GET

In order to stick with the REST concepts, such as safe operations, idempotency, etc., how can one implement a complex search operation involving multiple parameters?
I have seen Google's implementation, and that is creative. What is an option, other than that?
The idempotent requirement is what is tripping me up, as the operation will definitely not return the same results for the same criteria, say searching for customers named "Smith" will not return the same set every time, because more "Smith" customer are added all the time. My instinct is to use GET for this, but for a true search feature, the result would not seem to be idempotent, and would need to be marked as non-cacheable due to its fluid result set.
To put it another way, the basics behind idempotency is that the GET operation doesn't affect the results of the operation. That is, the GET can safely be repeated with no ill side effects.
However, an idempotent request has nothing to do with the representation of the resource.
Two contrived examples:
GET /current-time
GET /current-weather/90210
As should be obvious, these resources will change over time, some resources change more rapidly than others. But the GET operation itself is not germane in affecting the actual resource.
Contrast to:
GET /next-counter
This is, obviously I hope, not an idempotent request. The request itself is changing the resource.
Also, there's nothing that says an idempotent operation has NO side effects. Clearly, many system log accesses and requests, including GETs. Therefore, when you do GET /resource, the logs will change as a result of that GET. That kind of side affect doesn't make the GET not idempotent. The fundamental premise is the affect on the resource itself.
But what about, say:
GET /logs
If the logs register every request, and the GET is returning the logs in their current state, does that mean that the GET in this case is not idempotent? Yup! Does it really matter? Nope. Not for this one edge case. Just the nature of the game.
What about:
GET /random-number
If you're using a pseudo-random number generator, most of those feed upon themselves. Starting with a seed and feeding their results back in to themselves to get the next number. So, using a GET here may not be idempotent. But is it? How do you know how the random number is generated. It could be a white noise source. And why do you care? If the resource is simply a random number, you really don't know if the operation is changing it or not.
But just because there may be exceptions to the guidelines, doesn't necessarily invalidate the concepts behind those guidelines.
Resources change, thats a simple fact of life. The representation of a resource does not have to be universal, or consistent across requests, or consistent across users. Literally, the representation of a resource is what GET delivers, and it is up to the application, using who knows what criteria to determine that representation for each request. Idempotent requests are very nice because they work well with the rest of the REST model -- things like caching and content negotiation.
Most resources don't change quickly, and relying on specific transactions, using non-idempotent verbs, offers a more predictable and consistent interface for clients. When a method is supposed to be idempotent, clients will be quite surprised when it turns out to not be the case. But in the end, its up to the application and its documented interface.
GET is safe and idempotent when properly implemented. That means:
It will cause no client-visible side-effects on the server side
When directed at the same URI, it causes the same server-side function to be executed each time, regardless of how many times it is issued, or when
What is not said above is that GET to the same URI always returns the same data.
GET causes the same server-side function to be executed each time, and that function is typically, "return a representation of the requested resource". If that resource has changed since the last GET, the client will get the latest data. The function which the server executes is the source of the idempotency, not the data which it uses as input (the state of the resource being requested).
If a timestamp is used in the URI to make sure that the server data being requested is the same each time, that just means that something which is already idempotent (the function implementing GET) will act upon the same data, thereby guaranteeing the same result each time.
It would be idempotent for the same dataset. You could achieve this with a timestamp filter.