Is it possible to get strong consistency in a distributed system? - concurrency

User sends a write request.
Someone then sends a read request for that resource.
The read request arrives before the write request, so the read request data is stale but has no way of knowing that it's stale yet.
Likewise, you could also have two write requests to the same resource but the later write request arrives first.
How is it possible to provide strong consistency in a distributed system when race conditions like this can happen?

What is consistency? You say two writes arrive "out of order", but what established that order? The thing that establishes that order is your basis for consistency.
An simple basis is a generation number; so any object O is augmented by a version N. When you retrieve an O, you also retrieve N. When you write, you write to O.N. If the object is at O.N+1 when the write to O.N arrives, it is stale and generates an error. Multiple versions of the O remain available for some period.
Of course, you can't replicate the object readily with this in any widely distributed system, since two disconnected owners of O could be permitting different operations that would be impossible to unify. Etcd, for example, solves this in a limited since. Block chain solves it in a wider sense.

Related

Algorithm or data structure for broadcast messages in 3D

Let's say some threads produce data and every piece of data has associated 3D coordinate. And other threads consumes these data and every consumer thread has cubic volume of interest described by center and "radius" (size of the cube). Consumer threads can update their cube of interest parameter (like move it) from time to time. Every piece of data is broadcasted - a copy of it should be received by every thread which has cube of interest which includes this coordinate.
What multi-threaded data structure can be used for this with the best performance? I am using C++, but generic algorithm pointer is fine too.
Bonus: it would be nice if an algorithm will have possibility to generalize to multiple network nodes (some nodes produce data and some consumes with the same rules as threads).
Extra information: there are more consumers than producers, there are much more data broadcasts than cube of interest changes (cube size changes are very rare, but moving is quite common event). It's okay if consumer will start receiving data from the new cube of interest after some delay after changing it (but before that it should continue receive data from the previous cube).
Your terminology is problematic. A cube by definition does not have a radius; a sphere does. A broadcast by definition is received by everyone, it is not received only by those who are interested; a multicast is.
I have encountered this problem in the development of an MMORPG. The approach taken in the development of that MMORPG was a bit wacky, but in the decade that followed my thinking has evolved so I have a much better idea of how to go about it now.
The solution is a bit involved, but it does not require any advanced notions like space partitioning, and it is reusable for all kinds of information that the consumers will inevitably need besides just 3D coordinates. Furthermore, it is reusable for entirely different projects.
We begin by building a light-weight data modelling framework which allows us to describe, instantiate, and manipulate finite, self-contained sets of inter-related observable data known as "Entities" in memory and perform various operations on them in an application-agnostic way.
Description can be done in simple object-relational terms. ("Object-relational" means relational with inheritance.)
Instantiation means that given a schema, the framework creates a container (an "EntitySpace") to hold, during runtime, instances of entities described by the schema.
Manipulation means being able to read and write properties of those entities.
Self-contained means that although an entity may contain a property which is a reference to another entity, the other entity must reside within the same EntitySpace.
Observable means that when the value of a property changes, a notification is issued by the EntitySpace, telling us which property of which entity has changed. Anyone can register for notifications from an EntitySpace, and receives all of them.
Once you have such a framework, you can build lots of useful functionality around it in an entirely application-agnostic way. For example:
Serialization: you can serialize and de-serialize an EntitySpace to and from markup.
Filtering: you can create a special kind of EntitySpace which does not contain storage, and instead acts as a view into a subset of another EntitySpace, filtering entities based on the values of certain properties.
Mirroring: You can keep an EntitySpace in sync with another, by responding to each property-changed notification from one and applying the change to the other, and vice versa.
Remoting: You can interject a transport layer between the two mirrored parts, thus keeping them mirrored while they reside on different threads or on different physical machines.
Every node in the network must have a corresponding "agent" object running inside every node that it needs data from. If you have a centralized architecture, (and I will continue under this hypothesis,) this means that within the server you will have one agent object for each client connected to that server. The agent represents the client, so the fact that the client is remote becomes irrelevant. The agent is only responsible for filtering and sending data to the client that it represents, so multi-threading becomes irrelevant, too.
An agent registers for notifications from the server's EntitySpace and filters them based on whatever criteria you choose. One such criterion for an Entity which contains a 3D-coordinate property can be whether that 3D-coordinate is within the client's area of interest. The center-of-sphere-and-radius approach will work, the center-of-cube-and-size approach will probably work even better. (No need for calculating a square.)

Basics to how a distributed, consistent key-value storage system return the latest key when dealing with concurrent requests?

I am getting up to speed on distributed systems (studying for an upcoming interview), and specifically on the basics for how a distributed system works for a distributed, consistent key-value storage system managed in memory.
My specific questions I am stuck on that I would love just a high level answer on if it's no trouble:
#1
Let's say we have 5 servers that are responsible to act as readers, and I have one writer. When I write the value 'foo' to the key 'k1', I understand it has to propagate to all of those servers so they all store the value 'foo' for the key k1. Is this correct, or does the writer only write to the majority (quorum) for this to work?
#2
After #1 above takes place, let's say concurrently a read comes in for k1, and a write comes in to replace 'foo' with 'bar', however not all of the servers are updated with 'bar. This means some are 'foo' and some are 'bar'. If I had lots of concurrent reads, it's conceivable some would return 'foo' and some 'bar' since it's not updated everywhere yet.
When we're talking about eventual consistency, this is expected, but if we're talking about strong consistency, how do you avoid #2 above? I keep seeing content about quorum and timestamps but on a high level, is there some sort of intermediary that sorts out what the correct value is? Just wanted to get a basic idea first before I dive in more.
Thank you so much for any help!
In doing more research, I found that "consensus algorithms" such as Paxos or Raft is the correct solution here. The idea is that your nodes need to arrive at a consensus of what the value is. If you read up on Paxos or Raft you'll learn everything you need to - it's quite complex to explain here, but there are videos/resources out there that cover this well.
Another thing I found helpful was learning more about Dynamo and DynamoDB. They handle the subject as well, although not strongly consistent/distributed.
Hope this helps someone, and message me if you'd like more details!
Read the CAP theorem will help you solve your problem. You are looking for consistence and network partition in this question, so you have to sacrifice the availability. The system needs to block and wait until all nodes finish writing. In other word, the change can not be read before all nodes have updated it.
In theoretical computer science, the CAP theorem, also named Brewer's
theorem after computer scientist Eric Brewer, states that any
distributed data store can only provide two of the following three
guarantees:
Consistency Every read receives the most recent write or an error.
Availability Every request receives a (non-error) response, without
the guarantee that it contains the most recent write.
Partition tolerance The system continues to operate despite an arbitrary number
of messages being dropped (or delayed) by the network between nodes.

AWS S3 Eventual Consistency and read after write Consistency

Help me to understand better these concepts that i can't grasp fully.
Talking about aws S3 consistecy models, i'll try to explain what i grasped.
Demistify or confirm me these claims please.
first of all
talking about "read after write" is related only to "new writings"/creation of objects that didn't exist before.
talking about "eventual consistency" is related to "modifying existing objects" (updating or deleting)
are these first concepts correct?
then,
eventual consistency: a "client" who accesses to a datum before this has been completely written on a node, can read an old version of the object because the writing can be still in progress e the object might not has been commetted.
This is a behavior universally tolerated in distributed systems where this type consistency is preferred to the other option of waiting for some sort of lock being removed when the object has been committed.
read after write consistency: the objects are immediately available to the client and the client will read the "real" version of the object, never an old version, and if i've understood well this is true only for new object.
If so, why these replication methods are so differen? and produce this differnet consistency?
The concept of "eventual consistency" is more natural to grasp, because you have to consider the "latency" to propagate the data to different nodes and a client might access during this time and getting no fresh data yet.
But why "read after write" should be immediate? to propagate a modification on an existing datum, or create a new datum, should have the same latency. I can't understood the difference.
Can you please tell me if my claims are correct, and explain in different way this concept.
talking about "read after write" is related only to "new writings"/creation of objects that didn't exist before.
Yes
talking about "eventual consistency" is related to "modifying existing objects" (updating or deleting)
Almost correct, but be aware of one caveat. Here is a quote from the documentation:
The caveat is that if you make a HEAD or GET request to a key name before the object is created, then create the object shortly after that, a subsequent GET might not return the object due to eventual consistency.
Regarding to why they offer different consistency models, here is my understanding/speculation. (Note: the following content might be wrong since I've never worked for S3 and don't know the actual internal implementation of it.)
S3 is a distributed system, so it's very likely that S3 uses some internal caching service. Think of how CDN works, I think you can use the similar analogy here. In the case where you GET an object whose key is not in the cache yet, it's a cache miss! S3 will fetch the latest version of the requested object, save it into the cache, and return it back to you. This is the read-after-write model.
On the other hand, if you update an object that's already in the cache, then besides replicating your new object to other availability zones, S3 needs to do more work to update the existing data in the cache. Therefore, the propagation process will likely be longer. Instead of letting you wait on the request, S3 made the decision to return the existing data in the cache. This data might be an old version of this object. This concludes the eventual consistency.
As Phil Karlton said, there are only two hard things in Computer Science: cache invalidation and naming things. AWS has no good ways to fully get around with this, and has to make some compromises too.

REST search interface and the idempotency of GET

In order to stick with the REST concepts, such as safe operations, idempotency, etc., how can one implement a complex search operation involving multiple parameters?
I have seen Google's implementation, and that is creative. What is an option, other than that?
The idempotent requirement is what is tripping me up, as the operation will definitely not return the same results for the same criteria, say searching for customers named "Smith" will not return the same set every time, because more "Smith" customer are added all the time. My instinct is to use GET for this, but for a true search feature, the result would not seem to be idempotent, and would need to be marked as non-cacheable due to its fluid result set.
To put it another way, the basics behind idempotency is that the GET operation doesn't affect the results of the operation. That is, the GET can safely be repeated with no ill side effects.
However, an idempotent request has nothing to do with the representation of the resource.
Two contrived examples:
GET /current-time
GET /current-weather/90210
As should be obvious, these resources will change over time, some resources change more rapidly than others. But the GET operation itself is not germane in affecting the actual resource.
Contrast to:
GET /next-counter
This is, obviously I hope, not an idempotent request. The request itself is changing the resource.
Also, there's nothing that says an idempotent operation has NO side effects. Clearly, many system log accesses and requests, including GETs. Therefore, when you do GET /resource, the logs will change as a result of that GET. That kind of side affect doesn't make the GET not idempotent. The fundamental premise is the affect on the resource itself.
But what about, say:
GET /logs
If the logs register every request, and the GET is returning the logs in their current state, does that mean that the GET in this case is not idempotent? Yup! Does it really matter? Nope. Not for this one edge case. Just the nature of the game.
What about:
GET /random-number
If you're using a pseudo-random number generator, most of those feed upon themselves. Starting with a seed and feeding their results back in to themselves to get the next number. So, using a GET here may not be idempotent. But is it? How do you know how the random number is generated. It could be a white noise source. And why do you care? If the resource is simply a random number, you really don't know if the operation is changing it or not.
But just because there may be exceptions to the guidelines, doesn't necessarily invalidate the concepts behind those guidelines.
Resources change, thats a simple fact of life. The representation of a resource does not have to be universal, or consistent across requests, or consistent across users. Literally, the representation of a resource is what GET delivers, and it is up to the application, using who knows what criteria to determine that representation for each request. Idempotent requests are very nice because they work well with the rest of the REST model -- things like caching and content negotiation.
Most resources don't change quickly, and relying on specific transactions, using non-idempotent verbs, offers a more predictable and consistent interface for clients. When a method is supposed to be idempotent, clients will be quite surprised when it turns out to not be the case. But in the end, its up to the application and its documented interface.
GET is safe and idempotent when properly implemented. That means:
It will cause no client-visible side-effects on the server side
When directed at the same URI, it causes the same server-side function to be executed each time, regardless of how many times it is issued, or when
What is not said above is that GET to the same URI always returns the same data.
GET causes the same server-side function to be executed each time, and that function is typically, "return a representation of the requested resource". If that resource has changed since the last GET, the client will get the latest data. The function which the server executes is the source of the idempotency, not the data which it uses as input (the state of the resource being requested).
If a timestamp is used in the URI to make sure that the server data being requested is the same each time, that just means that something which is already idempotent (the function implementing GET) will act upon the same data, thereby guaranteeing the same result each time.
It would be idempotent for the same dataset. You could achieve this with a timestamp filter.

Should I be concerned with bit flips on Amazon S3?

I've got some data that I want to save on Amazon S3. Some of this data is encrypted and some is compressed. Should I be worried about single bit flips? I know of the MD5 hash header that can be added. This (from my experience) will prevent flips in the most unreliable portion of the deal (network communication), however I'm still wondering if I need to guard against flips on disk?
I'm almost certain the answer is "no", but if you want to be extra paranoid you can precalculate the MD5 hash before uploading, compare that to the MD5 hash you get after upload, then when downloading calculate the MD5 hash of the downloaded data and compare it to your stored hash.
I'm not sure exactly what risk you're concerned about. At some point you have to defer the risk to somebody else. Does "corrupted data" fall under Amazon's Service Level Agreement? Presumably they know what the file hash is supposed to be, and if the hash of the data they're giving you doesn't match, then it's clearly their problem.
I suppose there are other approaches too:
Store your data with an FEC so that you can detect and correct N bit errors up to your choice of N.
Store your data more than once in Amazon S3, perhaps across their US and European data centers (I think there's a new one in Singapore coming online soon too), with RAID-like redundancy so you can recover your data if some number of sources disappear or become corrupted.
It really depends on just how valuable the data you're storing is to you, and how much risk you're willing to accept.
I see your question from two points of view, a theoretical and practical.
From a theoretical point of view, yes, you should be concerned - and not only about bit flipping, but about several other possible problems. In particular section 11.5 of the customer agreements says that Amazon
MAKE NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, WHETHER EXPRESS, IMPLIED, STATUTORY OR OTHERWISE WITH RESPECT TO THE SERVICE OFFERINGS. (..omiss..) WE AND OUR LICENSORS DO NOT WARRANT THAT THE SERVICE OFFERINGS WILL FUNCTION AS DESCRIBED, WILL BE UNINTERRUPTED OR ERROR FREE, OR FREE OF HARMFUL COMPONENTS, OR THAT THE DATA YOU STORE WITHIN THE SERVICE OFFERINGS WILL BE SECURE OR NOT OTHERWISE LOST OR DAMAGED.
Now, in practice, I'd not be concerned. If your data will be lost, you'll blog about it and (although they might not face any legal action), their business will be pretty much over.
On the other hand, that depends on how much vital your data is. Suppose that you were rolling your own stuff in your own data center(s). How would you plan for disaster recovery there? If you says: I'd just keep two copies in two different racks, just use the same technique with Amazon, maybe keeping two copies in two different datacenters (since you wrote that you are not interested in how to protect against bit flips, I'm providing only a trivial example here)
Probably not: Amazon is using checksums to protect against bit flips, regularly combing through data at rest, ensuring that no bit flips have occurred. So, unless you have corruption in all instances of the data within the interval of integrity check loops you should be fine.
Internally, S3 uses MD5 checksums throughout the system to detect/protect against bitflips. When you PUT an object into S3, we compute the MD5 and store that value. When you GET an object we recompute the MD5 as we stream it back. If our stored MD5 doesn't match the value we compute as we're streaming the object back we'll return an error for the GET request. You can then retry the request.
We also continually loop through all data at rest, recomputing checksums and validating them against the MD5 we saved when we originally stored the object. This allows us to detect and repair bit flips that occur in data at rest. When we find a bit flip in data at rest, we repair it using the redundant data we store for each object.
You can also protect yourself against bitflips during transmission to and from S3 by providing an MD5 checksum when you PUT the object (we'll error if the data we received doesn't match the checksum) and by validating the MD5 when GET an object.
Source:
https://forums.aws.amazon.com/thread.jspa?threadID=38587
There are two ways of reading your question:
"Is Amazon S3 perfect?"
"How do I handle the case where Amazon S3 is not perfect?"
The answer to (1) is almost certainly "no". They might have lots of protection to get close, but there is still the possibility of failure.
That leaves (2). The fact is that devices fail, sometimes in obvious ways and other times in ways that appear to work but give an incorrect answer. To deal with this, many databases use a per-page CRC to ensure that a page read from disk is the same as the one that was written. This approach is also used in modern filesystems (for example ZFS, which can write multiple copies of a page, each with a CRC to handle raid controller failures. I have seen ZFS correct single bit errors from a disk by reading a second copy; disks are not perfect.)
In general you should have a check to verify that your system is operating is you expect. Using a hash function is a good approach. What approach you take when you detect a failure depends on your requirements. Storing multiple copies is probably the best approach (and certainly the easiest) because you can get protection from site failures, connectivity failures and even vendor failures (by choosing a second vendor) instead of just redundancy in the data itself by using FEC.