Will AWS S3 reject a read during an object being updated? - amazon-web-services

I read S3 doc that PUT is atomic, in the sense there is never a partial write to an object. My naive intuition is that, there might be some underlying locking mechanism that assures the atomicity. Does that mean that during an object is being updated, reads for the object with the same key will be rejected? (I searched around the S3 documentation but did not find a good answer)

Does that mean that during an object is being updated, reads for the object with the same key will be rejected?
No. It means that you will get a current version, until the update fully finishes.

#marcin has already answered and his answer should be the accepted answer
Do note there will be an exception in single case as stated in the docs
Updates to a single key are atomic. For example, if you make a PUT request to an existing key from one thread and perform a GET request on the same key from a second thread concurrently, you will get either the old data or the new data, but never partial or corrupt data.
Just for the sake of documentation to validate. what Marcin said https://aws.amazon.com/s3/consistency/
After a successful write of a new object, or an overwrite or delete of an existing object, any subsequent read request immediately receives the latest version of the object. S3 also provides strong consistency for list operations, so after a write, you can immediately perform a listing of the objects in a bucket with any changes reflected.
For all existing and new objects, and in all regions, all S3 GET, PUT, and LIST operations, as well as operations that change object tags, ACLs, or metadata, are now strongly consistent. What you write is what you will read, and the results of a LIST will be an accurate reflection of what’s in the bucket.
This clearly means you will always get the latest version of the bucket. if the latest version is ( old that is update is in progress) you will get the current version. Once updated then only you will get the latest version

Related

Easiest way to synchronize a Document across multiple Collections in Firestore?

Suppose I have two top-level collections, users and companies. The companies collection contains a sub collection of Users, called employees. What's the most simple way to ensure that User records in the users and companies/employees paths are synchronized? Is it more common to use batch operations or a trigger function?
If your document writes are coming directly from your client app, you can use security rules to make sure that all documents have the same values as part of a batch write. If you write the rules correctly, it will force the client to make appropriate batch writes at all required locations, assuming that you have a well-defined document structure.
You can see a similar example of this technique in this other question that ensures that clients increment and decrement a document counter with each create and delete. Your rules will obviously be more complex.
Since security rules only apply to client code, there are no similar techniques for backend code. If you're writing code on the backend, you just have to make sure your code for batch writes are all correct.
I see no need to trigger a Cloud Function if you're able to do a batch write, as the batch will take effect atomically and immediately, while the function will have some latency, and possibly incur a race condition, since you don't have a guaranteed order of execution.

How does BigtableIO achieves exactly once writes?

Dataflow guarantees exactly once processing and delivery as well. Is this guaranteed at sinks by not allowing mutations to the existing records and only allowing idempotent overwrite?
You're correct. The BigtableIO Dataflow/Beam connector will only write Put and Delete mutations, ignoring Append and Increment ones. See Note in the documentation for the class.

optimistic locking optimistic concurrency control

As I learned that "optimistic locking" that sometimes referred "optimistic concurrency control", doesn't really have a lock. A typical implementation is CAS (Compare-And-Swap).
So I wonder without locking, why this is still called "optimistic locking"? Is there any historical reason because this term was originated from database world?
As you rightly pointed, the transaction wont acquire any lock on the row/persistent object it tries to update. But, as you might also aware that Optimistic locking works on the principle of Versioning. Meaning...The database-table-record's version column(if you have set it) will be incremented each time it gets updated by a transaction. Also, any transaction which tries to update a particular record need to compare record's version number at the time of retrieval with record's version number at the time of updating. Its similar to having a key(as in Lock-Key) called version number and trying to see if it matches..if matches what is in the database(means..the record is not updated by another tx meanwhile), after which you would perform an update. If match fails(record updated by another tx..and your key wont work anymore).
Hence, Versioning/Opt Locking appears as if you have a key(called Version) for a virtually non existing lock. And the real meaning of lock can be understood in the situation when your current version of the record fails to match and PREVENTS(means LOCKED) you to update the record.

How do DynamoDB's conditional updates work with its eventual consistency model?

I was reading the DynamoDB documentation and found two interesting features:
Eventual consistent reads
Strongly consistent reads
Conditional updates
My question is, how do these three things interact with each other? Mostly I'm wondering if they conditional updates use a strongly consistent reads for checking the condition, or do they use eventually consistent reads? If it's the later, there is still a race condition, correct?
For a conditional update you need strong consistency. I am going to guess that an update is a separate operation in which consistent read + write happen atomically and fail/succeeded together.
The way to think of Dynamo is like a group of separated entities that all keep track of the state and inform each other of updates that are made / agree if such updates can be propagated to the whole group or not.
When you (dynamo api on your behalf) write you basically inform a subset of these entities that you want to update data. After that the data propagates to all of these entities.
When you do an eventual consistent read you read it from one of the entities. It's eventual consistent meaning that there is a possibility that you will read from one of the entities that did not get the memo yet.
When doing a strong consistent read you read from enough entities to ensure that what you're read has propagated. If propagation is in progress you need to wait.

taking a snapshot of complex mutable structure in concurrent environment

Given: a complex structure of various nested collections, with refs scattered in different levels.
Need: A way to take a snapshot of such a structure, while allowing writes to continue to happen in other threads.
So one the "reader" thread needs to read whole complex state in a single long transaction. The "writer" thread meanwhile makes modifications in multiple short transactions. As far as I understand, in such a case STM engine utilizes the refs history.
Here we have some interesting results. E.g., reader reaches some ref in 10 secs after beginning of transaction. Writer modifies this ref each 1 sec. It results in 10 values of ref's history. If it exceeds the ref's :max-history limit, the reader transaction will be run forever. If it exceeds :min-history, transaction may be rerun several times.
But really the reader needs just a single value of ref (the 1st one) and the writer needs just the recent one. All intermediate values in history list are useless. Is there a way to avoid such history overuse?
Thanks.
To me it's a bit of a "design smell" to have a large structure with lots of nested refs. You are effectively emulating a mutable object graph, which is a bad idea if you believe Rich Hickey's take on concurrency.
Some various thoughts to try out:
The idiomatic way to solve this problem in Clojure would be to put the state in a single top-level ref, with everything inside it being immutable. Then the reader can take a snapshot of the entire concurrent state for free (without even needing a transaction). Might be difficult to refactor to this from where you currently are, but I'd say it is best practice.
If you only want the reader to get a snaphot of the top level ref, you can just deref it directly outside of a transaction. Just be aware that the refs inside may continue to get mutated, so whether this is useful or not depends on the consistency requirements you have for the reader.
You can do everything within a (dosync...) transaction as normal for both readers and writer. You may get contention and transaction retries, but it may not be an issue.
You can create a "snapshot" function that quickly traverses the graph and dereferences all the refs within a transaction, returning the result with the refs stripped out (or replaced by new cloned refs). The reader calls snapshot once, then continues to do the rest of it's work after the snapshot is completed.
You could take a snapshot immediately each time after the writer finishes, and store it separately in an atom. Readers can use this directly (i.e. only the writer thread accesses the live data graph directly)
The general answer to your question is that you need two things:
A flag to indicate that the system is in "snapshot write" mode
A queue to hold all transactions that occur while the system is in snapshot mode
As far as what to do if the queue is overflows because the snapshot process isn't fast enough, well, there isn't much you can do about that except either optimize that process, or increase the size of your queue - it's going to be a balance that you'll have to strike depending on the needs of you app. It's a delicate balance, and is going to take some pretty extensive testing, depending on how complex your system is.
But you're on the right track. If you basically put the system in "snapshot write mode", then your reader/writer methods should automatically change where they are reading/writing from, so that the thread that is making changes gets all the "current values" and the thread reading the snapshot state is reading all the "snapshot values". You can split these up into separate methods - the snapshot reader will use the "snapshot value" methods, and all other threads will read the "current value" methods.
When the snapshot reader is done with its work, it needs to clear the snapshot state.
If a thread tries to read the "snapshot values" when no "snapshot state" is currently set, they should simply respond with the "current values" instead. No biggie.
Systems that allow snapshots of file systems to be taken for backup purposes, while not preventing new data from being written, follow a similar scheme.
Finally, unless you need to keep a record of all changes to the system (i.e. for an audit trail), then the queue of transactions actually doesn't need to be a queue of changes to be applied - it just needs to store the latest value of whatever thing you're changing in the system. When the "snapshot state" is cleared, you simply write all those non-committed values to the system, and call it done. The thing you might want to consider is making a log of those changes yet to be made, in case you need to recover from a crash, and have those changes still applied. The log file will give you a record of what happened, and can let you do this recovery. That's an oversimplification of the recovery process, but that's not really what your question is about, so I'll stop there.
What you are after is the state-of-the-art in high-performance concurrency. You should look at the work of Nathan Bronson, and his lab's collaborations with Aleksandar Prokopec, Phil Bagwell and the Scala team.
Binary Tree:
http://ppl.stanford.edu/papers/ppopp207-bronson.pdf
https://github.com/nbronson/snaptree/
Tree-of-arrays -based Hash Map
http://lampwww.epfl.ch/~prokopec/ctries-snapshot.pdf
However, a quick look at the implementations above should convince you this is not "roll-your-own" territory. I'd try to adapt an off-the-shelf concurrent data structure to your needs if possible. Everything I've linked to is freely available on the JVM, but its not native Clojure as such.