Say that I want to make frequent updates to an object in DynamoDB, and I've implemented optimistic locking where we (1) read the document; (2) perform some operations including a version increment; and (3) do a conditional put where the condition is that the version hasn't changed.
If I had thousands of these kinds of requests happening, would I ever run into a situation where two put operations (x and y) proceed in parallel, both pass the condition, x finishes first, and then y overwrites what x just did? I've heard that MongoDB prevents multiple operations from changing a document at the same time, but I have no idea if the same is true for DynamoDB.
Originally, I was going to use transactWrite for this, but since it isn't enabled for global tables and that is a requirement, I'm wondering if optimistic locking will be sufficient.
Related
I was reading the DynamoDB documentation and found two interesting features:
Eventual consistent reads
Strongly consistent reads
Conditional updates
My question is, how do these three things interact with each other? Mostly I'm wondering if they conditional updates use a strongly consistent reads for checking the condition, or do they use eventually consistent reads? If it's the later, there is still a race condition, correct?
For a conditional update you need strong consistency. I am going to guess that an update is a separate operation in which consistent read + write happen atomically and fail/succeeded together.
The way to think of Dynamo is like a group of separated entities that all keep track of the state and inform each other of updates that are made / agree if such updates can be propagated to the whole group or not.
When you (dynamo api on your behalf) write you basically inform a subset of these entities that you want to update data. After that the data propagates to all of these entities.
When you do an eventual consistent read you read it from one of the entities. It's eventual consistent meaning that there is a possibility that you will read from one of the entities that did not get the memo yet.
When doing a strong consistent read you read from enough entities to ensure that what you're read has propagated. If propagation is in progress you need to wait.
I'm doing thousands and thousands of inserts to a PostgreSQL database with Python and Django (using the CLI, so no web server at all).
The objects that are inserted are already in memory, and I'm poping them one by one from a FIFO queue (using Python's native https://docs.python.org/2/library/queue.html)
What I'm doing basically is:
args1, args2 = queue.get()
m1, _ = Model1.objects.get_or_create(args1)
Model2.objects.create(m1, args2)
I was thinking a way to do this faster was too spawn a few more threads that can do this in parallel. To my surprise the performance is actually slightly decreased... I was expecting almost linear improvement in relation to the number of threads.. not sure what's going on..
Is there something database specific I'm missing, are there table locks that are blocking the threads when this is running?
Or does it have something to do with that each thread can only access a single database connection atomically during runtime?
I have standard configuration for PostgreSQL (9.3) and Django (1.7.7) installed with apt-get on Debian Jessie.
Also I tried with 4 threads, which is the same number of CPUs I have available on my box.
There are a few things going on here.
Firstly you are using very high level ORM methods (get_or_create, create). Those are generally not a good fit for bulk operations since methods like that tend to have a lot of overhead to provide a nice API and also do additional work to prevent users from shooting themselves in the foot too easily.
Secondly your careful use of a queue is very counterproductive in multiple ways:
Due to django running in autocommit mode by default each database operation is carried out in its own transaction. Since that is a relatively expensive operation this also causes unnecessary overhead.
Inserting each object by itself also causes a lot more back and forth communication between the database and django, which again produces overhead, slowing things down.
Thirdly the reason using multiple threads is even slower stems from the fact that python has a GIL (Global Interpreter Lock). This prevents multiple threads from executing Python code at the same time. There is a lot of material on the web about the whys and hows of the GIL and what can be done in which circumstances to mitigate it. There is a nice summary by Dave Beazly about the GIL that should get you started if you're interested in learning more about it.
Additionally I'd generally recommend against doing large inserts from multiple threads in any language since - depending on your database and data model - this can also cause slowdowns inside the database due to possibly required locking.
Now there are many solutions to your problem but I'd recommend to start with a simple one:
Django actually provides a handy low-level interface to create models in bulk, fittingly enough called bulk_create(). I'd suggest removing all that fancy queue and thread code and using this interface as directly as possible with the data you already have.
In case this isn't sufficient for your case a possible alternative would be to generate an INSERT INTO statement from the data and executing that directly on the database.
If all you want to achieve is simply insertion, could you instead just use the save() method instead of get_or_create(). get_or_create() queries the database first. If the table is large, the call to get_or_create() can be a bottleneck. And that's probably why having multiple parallel threads do not help.
The other possibility is with the insertion itself. Postgres by default enables auto-commit on a per insert (transaction) basis. The committing process involves complex mechanisms under the hood. Long story short, you may try disabling auto-commit and see if that would help in your particular case. A relevant article is here.
Having read the following statement from the official documentation of OrientDB:
In order to guarantee atomicity and consistency, OrientDB acquire an
exclusive lock on the storage during transaction commit.
I am wondering if my understanding of the situation is correct. Here is how I assume this will work:
Thread 1 opens a transaction, and reads records #1:100 to #1:200, some from class A, and some from class B (something which cannot be determined without the transaction coming to a close).
Thread 1 massages the data, maybe even inserting a few records.
Thread 1 starts to commit the data. As the database does not have any way to know which parts of the data might be effected by the open transaction, it will blindly block the whole storage unit and verify the #version to enforce optimistic locking on all possibly affected records.
Thread 2 tries to read record #1:1 (or any other record from the whole database) and is blocked by the commit process, which is aligned, AFAIK with exclusive locking on the storage unit. This block occurs, if I'm not off, regardless of the cluster the original data resides on, since we have multi-master datasets.
Thread 1 ends the commit process and the database becomes consistent, effectively lifting the lock.
At this point, any thread can operate on the dataset, transactionally or otherwise, and will not be bound by the exclusive locking mechanism.
If this is the case, during the exchange highlighted in point 3 the data store, in its entirety is in an effective trance state and cannot be reached to, read from, or interacted with in any meaningful way.
I do so hope that I am missing my guess.
Disclaimer: I have not had the chance to dig into the underlying code from the rather rich OrientDB codebase. As such, this is, at its best, an educated guess and should not be taken as any sort of reference as to how OrientDB actually operates.
Possible Workarounds:
Should worse come to worse and this happens to be the way OrientDB actually works, I would dearly welcome any workarounds to this conundrum. We are looking for meaningful ways that will still keep OrientDB as a viable option for an enterprise, scalable high-end application.
In current release of OrientDB, transactions lock the storage in exclusive mode. Fortunately OrientDB works in optimistic way and this is done "only" at commit() time. So no matter when the transaction is begun.
If this is a showstopper for your use case, you could consider to:
don't use transactions. In this case you'll go in parallel with no locks, but consider using indexes requires the usage of lock at index level. In case the index is a bottleneck, the most common workaround is to create X sub-classes with an index on each. OrientDB will use the index of sub-classes if needed and on CRUD operation only the specific index will be locked
wait for OrientDB 3.0 where this limitation will be removed with real parallel transaction execution
For my implementation, a particular write must be done in bulk and without the chance of another interfering.
I have been told that two competing transactions in this way will lead to the first one blocking the second, and the second may or may not complete after the first has.
Please post the documentation that confirms this. Also, what exactly happens to the second transaction if the first is blocking? Will it be queued, fail, or some combination?
If this cannot be confirmed, should the transaction isolation level for this transaction be set to SERIALIZABLE? If so, how can that be done with libpqxx prepared statements?
If the transactions are serialized, will the second transaction fail or be queued until the first has completed?
If either fail, how can this be detected with libpqxx?
The only way to conclusively prevent concurrency effects is to LOCK TABLE ... IN ACCESS EXCLUSIVE MODE each table you wish to modify.
This means you're really only doing one thing at a time. It also leads to fun problems with deadlocks if you don't always acquire your locks in the same order.
So usually, what you need to do is figure out what exactly the operations you wish to do are, and how they interact. Determine what concurrency effects you can tolerate, and how to prevent those you cannot.
This question as it stands is just too broad to usefully answer.
Options include:
Exclusively locking tables. (This is the only way to do a multi-row upsert without concurrency problems in PostgreSQL right now). Beware of lock upgrade and lock order related deadlocks.
appropriate use of SERIALIZABLE isolation - but remember, you have to be able to keep a record of what you did during a transaction and retry it if the tx aborts.
Careful row-level locking - SELECT ... FOR UPDATE, SELECT ... FOR SHARE.
"Optimistic locking" / optimistic concurrency control, where appropriate
Writing your queries in ways that make them more friendly toward concurrent operation. For example, replacing read-modify-write cycles with in-place updates.
So, I have a ton ob objects, each having several fields, including a c-array, which are modified within their "Update()" method. Now I create several threads, each updating a section of these objects. As far as I understand calling lock() before calling the update function would be useless, since this would essentially cause the updates being called in a sequential order just like they would be without multithreading. Now, there objects have pointers, cross referencing to each other. Do I need to call lock every time ANY field is modified, or just before specific operations (like delete, re-initializing arrays, etc?)
Do I need to call lock every time ANY field is modified, or just before specific operations (like delete, re-initializing arrays, etc?)
Neither. You need to have a lock even to read, to make sure another thread isn't part way through modifying the data you're reading. You might want to use a many reader / one writer lock. I suggest you start by having a single lock (whether a simple mutex or the more elaborate multi-reader/writer lock) and get the code working so you can profile it and see whether you actually need more fine-grained locking, then you'll have a bit more experience and understanding of options and advice about how to manage that.
If you do need fine-grained locking, then the trick is to think about where the locks logically belong - for example - there could be one per object. You'll then need to learn about techniques for avoiding deadlocks. You should do some background reading too.
It depends on the consequences of the data changes you want to make. If each thread is, for example, changing well defined sub-blocks of data and each sub-block is entirely independent of all other sub-blocks then it might make sense to have a mutex per sub-block.
That would allow one thread to deal with one set of sub-blocks whilst another gets a different subset to process.
Having threads make changes without gaining a mutex lock first is going to lead to inconsistencies at best...
If the data and processing isn't subdivisible that way then you would probably have to start thinking about how you might handle whole objects in parallel, ie adopt a coarser granularity and one mutex per object. This is perhaps more likely to be possible - different objects are supposed to be independent of each other, so it should in theory be possible to process their data in parallel.
However the unavoidable truth is that some computer jobs require fast single thread performance. For that one starts seriously needing the right sort of supercomputer and perhaps some jolly long pipelines.