Google File System Consistency Model - concurrency

I was reading about GFS and its consistency model but I'm failing to grasp some of it.
In particular, can someone provide me with a specific example scenario (or an explanation of why it cannot happen) of:
concurrent record append that could result in record duplication
concurrent record append that could result in undefined regions
concurrent writes (on a single chunk) that could result in undefined regions

I'm quoting from http://research.google.com/archive/gfs.html. Check out Table 1, which is a summary of the possible outcomes for writes/appends:
"If a record append fails at any replica, the client retries the
operation. As a result, replicas of the same chunk may contain
different data possibly including duplicates of the same
record in whole or in part." So any failure on a replica (e.g. timeout) will cause a duplicate record at least on the other replicas. This can happen without concurrent writes.
The same situation that causes a duplicate record also causes an inconsistent (and hence undefined) region. If a replica failed to acknowledge the mutation, it may not have performed it. In that case when the client retries the append this replica will have to add padding in place of the missing data, so that the record can be written at the right offset. So one replica will have padding while other will have the previously written record in this region.
A failed write can cause an inconsistent (hence undefined) region as well. More interestingly, successful concurrent writes can cause consistent but undefined regions. "If a write by the application is large or straddles a chunk
boundary, GFS client code breaks it down into multiple
write operations. They [...] may be interleaved with and overwritten by concurrent
operations from other clients. Therefore, the shared
file region may end up containing fragments from different
clients, although the replicas will be identical because the individual
operations are completed successfully in the same
order on all replicas. This leaves the file region in consistent
but undefined state [...]."

I don't think it really has to do with concurrent append but wih the at least once semantics of their system.
Failure is a fundamental problem of large distributed systems. In the presence of failure a sender may not know if the computer on the other end of the network fully received its message.
For such occasions distributed systems guarantee that a message is either delivered either at most once or delivered at least once.
In this case, it appears GFS decided upon at least once delivery to the storage nodes.

Related

Is it possible to get strong consistency in a distributed system?

User sends a write request.
Someone then sends a read request for that resource.
The read request arrives before the write request, so the read request data is stale but has no way of knowing that it's stale yet.
Likewise, you could also have two write requests to the same resource but the later write request arrives first.
How is it possible to provide strong consistency in a distributed system when race conditions like this can happen?
What is consistency? You say two writes arrive "out of order", but what established that order? The thing that establishes that order is your basis for consistency.
An simple basis is a generation number; so any object O is augmented by a version N. When you retrieve an O, you also retrieve N. When you write, you write to O.N. If the object is at O.N+1 when the write to O.N arrives, it is stale and generates an error. Multiple versions of the O remain available for some period.
Of course, you can't replicate the object readily with this in any widely distributed system, since two disconnected owners of O could be permitting different operations that would be impossible to unify. Etcd, for example, solves this in a limited since. Block chain solves it in a wider sense.

Amazon S3 - What does eventual consistency mean in regard to delete operations?

I visited Amazon's website and I read the available information in regard to eventual consistency however it's still not completely clear to me.
What I am still not sure about is the behavior of S3 in the timeframe between an execution of update / delete and the moment when the consistency is eventually achieved.
For example, what will happen if I delete object A and subsequently execute a HEAD operation for object A multiple times?
I suppose I will eventually start getting a RESOURCE_NOT_FOUND error constantly at some point of time (when the deletion becomes consistent) but prior to that moment what should I expect to get?
I see two options.
1) Every HEAD operation succeeds up to a point in time and after that every HEAD operation constantly fails with RESOURCE_NOT_FOUND.
2) Each HEAD operation succeeds or fails "randomly" until some moment in which the eventual consistency is achieved.
Could someone clarify which of the two should be the expected behavior?
Many thanks.
I see two options.
It could be either of these. Neither one is necessarily the "expected" behavior. Eventually, requests would all return 404.
S3 is a large scale system, distributed across multiple availability zones in the region, so each request could hit one of several possible endpoints, each if which could reflect the bucket's state at a slightly different point in time. As long as they are all past the point where the object is deleted, they should continuously return 404, but the state of bucket index replication isn't exposed.

When a ConcurrencyException occurs, what gets written?

We use RavenDB in our production environment. It stores millions of documents, and gets updated pretty much constantly during the day.
We have two boxes load-balanced using a round-robin strategy which replicate to one another.
Every week or so, we get a ConcurrencyException from Raven. I understand that this basically means that one of the servers was told to insert or update the same document within in short timeframe - it's kind of like a conflict exception, except occurring on the same server instead of two replicating servers.
What happens when this error occurs? Can I assume that at least one of the writes succeeded? Can I predict which one? Is there anything I can do to make these exceptions less likely?
ConcurrencyException means that on a single server, you have two writes to the same document at the same instant.
That lead to:
One write is accepted.
One write is rejected (with concurrency exception).

Intel TSX hardware transactional memory what do non-transactional threads see?

Suppose you have two threads, one creates a TSX transaction, and modifies some data structure. The other thread does no synchronization of any kind and reads the same data structure. Is the transaction atomic to it? I can't actually imagine that it can be true, since there is no way afaik to block or restart it if it tries reading a cache line modified by the transaction.
If the transaction is not atomic, then are the write ordering rules on x86 still respected? If it sees write #2, then it is guaranteed that it must be able to see the previous write #1. Does this still hold for writes that happen as part of a transaction?
I could not find answers to these questions anywhere, and I kind of doubt anyone on SO would know either, but at least when somebody finds out this is a Google friendly place to put an answer.
(My answer is based on IntelĀ® 64 and IA-32 Architectures Optimization Reference Manual, Chapter 12)
The transaction is atomic to the read, in that the read will cause the transaction to abort, and thus appear that it never took place. In the transactional region, cache lines (tracked in the L1) read are considered the read-set and lines written to from the write-set. If another processor reads from the write-set (which is your example) or writes to either the read- or write-set, then there is a data conflict.
Data conflicts are detected through the cache coherence protocol.
Data conflicts cause transactional aborts. In the initial
implementation, the thread that detects the data conflict will
transactionally abort.
Thus the thread attempting the transaction is tracking the line and will detect the conflict when the other thread makes its read request. It aborts and "the hardware will restart at the instruction address provided by the operation of the XBEGIN instruction". In this chapter, there are no distinctions as to what the second processor is doing. It does not matter whether it is attempting a transaction or performing a simple read.
To summarize, all threads (whether transactional or not) see either the full transaction or nothing. Only the thread in a TSX transaction can see the intermediate state of memory.

Concurrent writes in Cassandra: Are conflicts possible?

Does Cassandra guarantee consistency of replicas in case of concurrent writes? For example, if N=3, W=3 and there are 3 concurrent writers, is it possible to end up with 3 different values on each replica?
Is it a Cassandra-specific problem or does the canonical Dynamo design also has this problem, despite its use of vector clocks?
Cassandra uses client-provided timestamps in this case, to ensure each replica keeps the 'latest' value. In your example, where you write to each replica, even when replicas receive the writes in different order, they will use the timestamp provided with the writes to decide which one to keep. Writing the same key with an older timestamp to a replica will just be ignored.
This mechanism isn't just needed to cope with concurrent writes - Cassandra can receive writes out of order over long periods of time (ie replying hints to a recently down node). To cope with this, when Cassandra compacts of SSTables and encounters two keys that are the same, it will use the timestamps to decide which one is kept.
Similarly, Cassandra has a feature called read repair. On read, Cassandra will compare the timestamp given by each replica and return the value associated with the latest timestamp to the client. It will then write this value back to any replicas which were out of date (this can have a performance impact, so the chance of it doing the subsequent write is tuneable).
Just to add on tom.wilkie's answer
If you want to guarantee a good consistency to your data with the latest value being kept try to read AND write always at LOCAL_QUORUM or QUORUM consistency.