How does BigtableIO achieves exactly once writes? - google-cloud-platform

Dataflow guarantees exactly once processing and delivery as well. Is this guaranteed at sinks by not allowing mutations to the existing records and only allowing idempotent overwrite?

You're correct. The BigtableIO Dataflow/Beam connector will only write Put and Delete mutations, ignoring Append and Increment ones. See Note in the documentation for the class.

Related

Will AWS S3 reject a read during an object being updated?

I read S3 doc that PUT is atomic, in the sense there is never a partial write to an object. My naive intuition is that, there might be some underlying locking mechanism that assures the atomicity. Does that mean that during an object is being updated, reads for the object with the same key will be rejected? (I searched around the S3 documentation but did not find a good answer)
Does that mean that during an object is being updated, reads for the object with the same key will be rejected?
No. It means that you will get a current version, until the update fully finishes.
#marcin has already answered and his answer should be the accepted answer
Do note there will be an exception in single case as stated in the docs
Updates to a single key are atomic. For example, if you make a PUT request to an existing key from one thread and perform a GET request on the same key from a second thread concurrently, you will get either the old data or the new data, but never partial or corrupt data.
Just for the sake of documentation to validate. what Marcin said https://aws.amazon.com/s3/consistency/
After a successful write of a new object, or an overwrite or delete of an existing object, any subsequent read request immediately receives the latest version of the object. S3 also provides strong consistency for list operations, so after a write, you can immediately perform a listing of the objects in a bucket with any changes reflected.
For all existing and new objects, and in all regions, all S3 GET, PUT, and LIST operations, as well as operations that change object tags, ACLs, or metadata, are now strongly consistent. What you write is what you will read, and the results of a LIST will be an accurate reflection of what’s in the bucket.
This clearly means you will always get the latest version of the bucket. if the latest version is ( old that is update is in progress) you will get the current version. Once updated then only you will get the latest version

Easiest way to synchronize a Document across multiple Collections in Firestore?

Suppose I have two top-level collections, users and companies. The companies collection contains a sub collection of Users, called employees. What's the most simple way to ensure that User records in the users and companies/employees paths are synchronized? Is it more common to use batch operations or a trigger function?
If your document writes are coming directly from your client app, you can use security rules to make sure that all documents have the same values as part of a batch write. If you write the rules correctly, it will force the client to make appropriate batch writes at all required locations, assuming that you have a well-defined document structure.
You can see a similar example of this technique in this other question that ensures that clients increment and decrement a document counter with each create and delete. Your rules will obviously be more complex.
Since security rules only apply to client code, there are no similar techniques for backend code. If you're writing code on the backend, you just have to make sure your code for batch writes are all correct.
I see no need to trigger a Cloud Function if you're able to do a batch write, as the batch will take effect atomically and immediately, while the function will have some latency, and possibly incur a race condition, since you don't have a guaranteed order of execution.

Does DynamoDB protect against parallel operations on the same document

Say that I want to make frequent updates to an object in DynamoDB, and I've implemented optimistic locking where we (1) read the document; (2) perform some operations including a version increment; and (3) do a conditional put where the condition is that the version hasn't changed.
If I had thousands of these kinds of requests happening, would I ever run into a situation where two put operations (x and y) proceed in parallel, both pass the condition, x finishes first, and then y overwrites what x just did? I've heard that MongoDB prevents multiple operations from changing a document at the same time, but I have no idea if the same is true for DynamoDB.
Originally, I was going to use transactWrite for this, but since it isn't enabled for global tables and that is a requirement, I'm wondering if optimistic locking will be sufficient.

How do DynamoDB's conditional updates work with its eventual consistency model?

I was reading the DynamoDB documentation and found two interesting features:
Eventual consistent reads
Strongly consistent reads
Conditional updates
My question is, how do these three things interact with each other? Mostly I'm wondering if they conditional updates use a strongly consistent reads for checking the condition, or do they use eventually consistent reads? If it's the later, there is still a race condition, correct?
For a conditional update you need strong consistency. I am going to guess that an update is a separate operation in which consistent read + write happen atomically and fail/succeeded together.
The way to think of Dynamo is like a group of separated entities that all keep track of the state and inform each other of updates that are made / agree if such updates can be propagated to the whole group or not.
When you (dynamo api on your behalf) write you basically inform a subset of these entities that you want to update data. After that the data propagates to all of these entities.
When you do an eventual consistent read you read it from one of the entities. It's eventual consistent meaning that there is a possibility that you will read from one of the entities that did not get the memo yet.
When doing a strong consistent read you read from enough entities to ensure that what you're read has propagated. If propagation is in progress you need to wait.

libpqxx transaction serialization & consequences

For my implementation, a particular write must be done in bulk and without the chance of another interfering.
I have been told that two competing transactions in this way will lead to the first one blocking the second, and the second may or may not complete after the first has.
Please post the documentation that confirms this. Also, what exactly happens to the second transaction if the first is blocking? Will it be queued, fail, or some combination?
If this cannot be confirmed, should the transaction isolation level for this transaction be set to SERIALIZABLE? If so, how can that be done with libpqxx prepared statements?
If the transactions are serialized, will the second transaction fail or be queued until the first has completed?
If either fail, how can this be detected with libpqxx?
The only way to conclusively prevent concurrency effects is to LOCK TABLE ... IN ACCESS EXCLUSIVE MODE each table you wish to modify.
This means you're really only doing one thing at a time. It also leads to fun problems with deadlocks if you don't always acquire your locks in the same order.
So usually, what you need to do is figure out what exactly the operations you wish to do are, and how they interact. Determine what concurrency effects you can tolerate, and how to prevent those you cannot.
This question as it stands is just too broad to usefully answer.
Options include:
Exclusively locking tables. (This is the only way to do a multi-row upsert without concurrency problems in PostgreSQL right now). Beware of lock upgrade and lock order related deadlocks.
appropriate use of SERIALIZABLE isolation - but remember, you have to be able to keep a record of what you did during a transaction and retry it if the tx aborts.
Careful row-level locking - SELECT ... FOR UPDATE, SELECT ... FOR SHARE.
"Optimistic locking" / optimistic concurrency control, where appropriate
Writing your queries in ways that make them more friendly toward concurrent operation. For example, replacing read-modify-write cycles with in-place updates.