Problem with concurrent transactions operating on the same keys (2PL)

Problem with concurrent transactions operating on the same keys (2PL) - concurrency

I am using RocksDB transactions in 2 Phase Commit mode (Prepare before Commit), having multiple transactions overlapping each other.
Say transaction A begins then shortly afterwards transaction B begins, both transactions operating on key 'X'.
A put operation is applied on 'X' in transaction A, therefore A acquires the lock of key 'X'.
Now the lock will not be released until transaction A is committed, and if transaction B wants to update key 'X', it would not be able to acquire the lock.
Is there a way to transfer the lock on key 'X' from A to B before any of the transactions are commited? Note that the instance of the class running the DB is single-threaded (Technically transactions are conccurrent but the operations are ran in a sequential manner).
I have relied on RocksDB to hopefully take care of this, but this appears not to be the case.

Related

atomic exchange with memory_order_acquire and memory_order_release

I have a situation that I would like to prepare some data in one thread:
// My boolean flag
std::atomic<bool> is_data_ready = false;
Thread 1 (producer thread):
PrepareData();
if (!is_data_ready.exchange(true, std::memory_order_release)) {
NotifyConsumerThread();
}
else {
return;
}
In consumer thread,
Thread 2:
if (is_data_ready.exchange(false, std::memory_order_acquire)) {
ProcessData();
}
Does it make sense to use acquire/release order (instead of acq_rel order) for exchange? I am not sure if I understand it correctly: does std::memory_order_release in exchange mean the store is a release store? If so, what is the memory order for the load?

An atomic RMW has a load part and a store part. memory_order_release gives the store side release semantics, while leaving the load side relaxed. The reverse for exchange(val, acquire). With exchange(val, acq_rel) or seq_cst, the load would be an acquire load, the store would be a release store.
(compare_exchange_weak/_strong can have one memory order for the pure-load case where the compare failed, and a separate memory order for the RMW case where it succeeds. This distinction is meaningful on some ISAs, but not on ones like x86 where it's just a single instruction that effectively always stores, even in the false case.)
And of course atomicity of the exchange (or any other RMW) is guaranteed regardless of anything else; no stores or RMWs to this object by other cores can come between the load and store parts of the exchange. Notice that I didn't mention pure loads, or operations on other objects. See later in this answer and also For purposes of ordering, is atomic read-modify-write one operation or two?
Yes, this looks sensible, although simplistic and maybe racy in allowing more stuff to be published after the first batch is consumed (or started to consume)1. But for the purposes of understanding how atomic RMWs work, and the ordering of its load and store sides, we can ignore that.
exchange(true, release) "publishes" some shared data stored by PrepareData(), and checks the old value to see if the worker thread needs to get notified.
And in the reader, is_data_ready.exchange(false, acquire) is a load that syncs with the release-store if there was one, creating a happens-before relationship that makes it safe to read that data without data-race UB. And tied to that (as part of the atomic RMW), lets other threads see that it has gone past the point of checking for new work, so it needs another notify if there is any.
Yes, exchange(value, release) means the store part of the RMW has release ordering wrt. other operations in the same thread. The load part is relaxed, but the load/store pair still form an atomic RMW. So the load can't take a value until this core has exclusive ownership of the cache line.
Or in C++ terms, it sees the "latest value" in the modification order of is_data_ready; if some other thread was also storing to is_data_ready, that store will happen either before the load (before the whole exchange), or after the store (after the whole exchange).
Note that a pure load in another core coming after the load part of this exchange is indistinguishable from coming before, so only operations that involve a store are part of the modification order of an object. (That modification order is guaranteed to exist such that all threads can agree on it, even when you're using relaxed loads/stores.)
But the load part of another atomic RMW will have to come before the load part of the exchange, otherwise that other RMW would have this exchange happening between its load and its store. That would violate the atomicity guarantee of the other RMW, so that can't happen. Atomic RMWs on the same object effectively serialize across threads. That's why a million fetch_add(1, mo_relaxed) operations on an atomic counter will increment it by 1 million, regardless of what order they end up running in. (See also C++: std::memory_order in std::atomic_flag::test_and_set to do some work only once by a set of threads re: why atomic RMWs have to work this way.)
C++ is specified in terms of syncs-with and whether a happens-before guarantee exists that allows your other loads to see other stores by other threads. But humans often like to think in terms of local reordering (within execution of one thread) of operations that access shared memory (via coherent cache).
In terms of a memory-reordering model, the store part of an exchange(val, release) can reorder with later operations other than release or seq_cst. (Note that unlocking a mutex counts as a release operation). But not with any earlier operations. This is what acquire and release semantics are all about, as Jeff Preshing explains: https://preshing.com/20120913/acquire-and-release-semantics/.
Wherever the store ends up, the load is at some point before it. Right before it in the modification order of is_data_ready, but operations on other objects by this thread (especially in other cache lines) may be able to happen in between the load and store parts of an atomic exchange.
In practice, some CPU architectures don't make that possible. Notably x86 atomic RMW operations are always full barriers, which waits for all earlier loads and stores to complete before the exchange, and doesn't start any later loads and stores until after. So not even StoreLoad reordering of the store part of an exchange with later loads is possible on x86.
But on AArch64 you can observe StoreLoad reordering of the store part of a seq_cst exchange with a later relaxed load. But only the store part, not the load part; being seq_cst means the load part of the exchange has acquire semantics and thus happens before any later loads. See For purposes of ordering, is atomic read-modify-write one operation or two?
Footnote 1: is this a usable producer/consumer sync algorithm?
With a single boolean flag (not a queue with a read-index / write-index), IDK how a producer would know when it can overwrite the shared variables that the consumer will look at. If it (or another producer thread) did that right away after seeing is_data_ready == false, you'd race with the reader that's just started reading.
If you can solve that problem, this does appear to avoid the possibility of the consumer missing an update and going to sleep, as long as it handles the case where a second writer adds more data and sends a notify before the consumer finishes ProcessData. (The writers only know that the consumer has started, not when it finishes.) I guess this example isn't showing the notification mechanism, which might itself create synchronization.
If two producers run PrepareData() at overlapping times, the first one to finish will send a notification, not both. Unless the consumer does an exchange and resets is_data_ready between the two exchanges in the producers, then it will get a second notification. (So that sound pretty hard to deal with in the consumer, and in whatever data structure PrepareData() manages, unless it's something like a lock-free queue itself, in which case just check the queue for work instead of this mechanism. But again, this is still a usable example to talk about how exchange works.)
If a consumer is frequently checking and finding no work needing doing, that's also extra contention that could have been avoided if it checks read-only until they see a true and exchange it to false (with an acquire exchange). But since you're worrying about notifications, I assume it's not a spin-wait loop, instead sleeping if there isn't work to do.

How atomic is Compare and Swap against global variable in modern computer?

Assume we implement in a modern programming language like C++. Let's say we have 5 threads t1 to t5, and we also have an array of timestamps TS[5] one for each thread. We also have a global timestamp GT which will increase gradually as the process runs. So now each of the five thread tx makes a local copy of TS[x] as local_ts and is trying to do a compare and swap on its timestamp CAS(&TS[x],local_ts,TS). Then my question is will the final timestamps in TS[5] reflect the order of when each thread's compare and swap actually takes place. For example, if a thread does CAS before another thread, its stored timestamp must be less than or equal to the other thread's timestamp.
Please refer to the following as a simple code example in C++
TS timestamps[5];//assuming TS ia a class enclosing a long value.
TS GT=0;//it is periodically increased by another thread not shown here.
void work_load(int id){
for(int i=0;i<10000;i++);//simulate the thread does some work
TS local = timestamps[id];
timestamps[id].compare_exchange_strong(local,GT,std::some_memotry_order);
//each thread reads other threads' entries in the array and does things accordingly to their values against its own value
}
void main(){
for(int i=0;i<5;i++){
std::thread t(work_load,i);
}
}
So the goal is to design a transactional system with no locks. Each transaction appends its updates to a record as deltas. In validation phase, a transaction needs to check its deltas against other deltas on the same records that may conflict. I'm trying to design a global data structure which records atomically when each transaction starts its commit phase so transactions can decide if they need to abort based on observing who makes the conflicting deltas.

So many thanks and kudos to the people in the comment section. So my question is answered. The atomic CAS does not include an atomic read of the value I want to replace with upon success. Basically CAS(&object, expected, new_value) does not include fetching new_value from the global variable as an operation within the atomic CAS. Read_modify_write only guarantees the atomicity on the target object.

What makes the boost::shared_mutex so slow

I used google benchmark to run following 3 tests and the result surprised me since the RW lock is ~4x slower than simple mutex in release mode. (and ~10x slower than simple mutex in debug mode)
void raw_access() {
(void) (gp->a + gp->b);
}
void mutex_access() {
std::lock_guard<std::mutex> guard(g_mutex);
(void) (gp->a + gp->b);
}
void rw_mutex_access() {
boost::shared_lock<boost::shared_mutex> l(g_rw_mutex);
(void) (gp->a + gp->b);
}
the result is:
2019-06-26 08:30:45
Running ./perf
Run on (4 X 2500 MHz CPU s)
CPU Caches:
L1 Data 32K (x2)
L1 Instruction 32K (x2)
L2 Unified 262K (x2)
L3 Unified 4194K (x1)
Load Average: 5.35, 3.22, 2.57
-----------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------
BM_RawAccess 1.01 ns 1.01 ns 681922241
BM_MutexAccess 18.2 ns 18.2 ns 38479510
BM_RWMutexAccess 92.8 ns 92.8 ns 7561437
I didn't get the enough information via google, so hope some help here.
Thanks

I don't know the particulars of how the standard library/boost/etc. implementations differ, although it seems like the standard library version is faster (congrats, whoever wrote it).
So instead I'll try to explain the speed differences between various mutex types on a theoretical level, which will explain why a shared mutex (should) be slower.
Atomic Spin Lock
More-so as an academic exercise, consider the simplest thread-safety "mutex-like" implementation: a simple atomic spin lock.
In essence, this is nothing more than an std::atomic<bool> or an std::atomic_flag. It is initialized to false. To "lock" the mutex, you simply do an atomic compare-and-exchange operation in a loop until you get a false value (i.e. the previous value was false prior to atomically setting it to true).
std::atomic_flag flag = ATOMIC_FLAG_INIT;
// lock it by looping until we observe a false value
while (flag.test_and_set()) ;
// do stuff under "mutex" lock
// unlock by setting it back to false state
flag.clear();
However, due to the nature of this construct, it's what we call an "unfair" mutex because the order of threads that acquire the lock is not necessarily the order in which they began their attempts to lock it. That is, under high contention, it's possible a thread tries to lock and simply never succeed because other threads are luckier. It's very timing-sensitive. Imagine musical chairs.
Because of this, although it functions like a mutex, it's not what we consider a "mutex".
Mutex
A mutex can be thought of as building on top of an atomic spin lock (although it's not typically implemented as such, since they typically are implemented with support of the operating system and/or hardware).
In essence, a mutex is a step above atomic spin locks because it has a queue of waiting threads. This allows it to be "fair" because the order of lock acquisition is (more or less) the same as the order of locking attempts.
If you've noticed, if you run sizeof(std::mutex) it might be a bit larger than you might expect. On my platform it's 40 bytes. That extra space is used to hold state information, notably including some way of accessing a lock queue for each individual mutex.
When you try to lock a mutex, it performs some low-level thread-safety operation to have thread-safe access to the mutex's status information (e.g. atomic spin lock), checks the state of the mutex, adds your thread to the lock queue, and (typically) puts your thread to sleep while you wait so you don't burn precious CPU time. The low-level thread-safety operation (e.g. the atomic spin lock) is atomically released at the same time the thread goes to sleep (this is typically where OS or hardware support is necessary to be efficient).
Unlocking is performed by performing a low-level thread-safe operation (e.g. atomic spin lock), popping the next waiting thread from the queue, and waking it up. The thread that has been awoken now "owns" the lock. Rinse wash and repeat.
Shared Mutex
A shared mutex takes this concept a step further. It can be owned by a single thread for read/write permissions (like a normal mutex), or by multiple threads for read-only permissions (semantically, anyway - it's up to the programmer to ensure it's safe).
Thus, in addition to the unique ownership queue (like a normal mutex) it also has a shared ownership state. The shared ownership state could be simply a count of the number of threads that currently have shared ownership. If you inspect sizeof(std::shared_mutex) you'll find it's typically even larger than std::mutex. On my system, for instance, it's 56 bytes.
So when you go to lock a shared mutex, it has to do everything a normal mutex does, but additionally has to verify some other stuff. For instance, if you're trying to lock uniquely it must verify that there are no shared owners. And when you're trying to lock sharingly it must verify that there are no unique owners.
Because we typically want mutexes to be "fair", once a unique locker is in the queue, future sharing lock attempts must queue instead of acquiring the lock, even though it might currently be under sharing (i.e. non-unique) lock by several threads. This is to ensure shared owners don't "bully" a thread that wants unique ownership.
But this also goes the other way: the queuing logic must ensure a shared locker is never placed into an empty queue during shared ownership (because it should immediately succeed and be another shared owner).
Additionally, if there's a unique locker, followed by a shared locker, followed by a unique locker, it must (roughly) guarantee that acquisition order. So each entry in the lock queue also needs a flag denoting its purpose (i.e. shared vs. unique).
And then we think of the wake-up logic. When you unlock a shared mutex, the logic differs depending on the current ownership type of the mutex. If the unlocking thread has unique ownership or is the last shared owner it might have to wake up some thread(s) from the queue. It will either wake up all threads at the front of the queue who are requesting shared ownership, or a single thread at the front of the queue requesting unique ownership.
As you can imagine, all of this extra logic for who's locking for what reasons and how it changes depending not only on the current owners but also on the contents of the queue makes this potentially quite a bit slower. The hope is that you read significantly more frequent than you write, and thus you can have many sharing owners running concurrently, which mitigates the performance hit of coordinating all of that.

How do I safely read a variable from one thread and modify it from another?

I have a class instances which is being used in multiple threads. I am updating multiple member variables from one thread and reading the same member variables from one thread. What is the correct way to maintain the thread safety?
eg:
phthread_mutex_lock(&mutex1)
obj1.memberV1 = 1;
//unlock here?
Should I unlock the mutex over here? ( if another thread access the obj1 member variables 1 and 2 now, the accessed data might not be correct because memberV2 has not yet be updated. However, if I does not release the lock, the other thread might block because there is time consuming operation below.
//perform some time consuming operation which must be done before the assignment to memberV2 and after the assignment to memberV1
obj1.memberV2 = update field 2 from some calculation
pthread_mutex_unlock(&mutex1) //should I only unlock here?
Thanks

Your locking is correct. You should not release the lock early just to allow another thread to proceed (because that would allow the other thread to see the object in an inconsistent state.)

Perhaps it would be better to do something like:
//perform time consuming calculation
pthread_mutex_lock(&mutex1)
obj1.memberV1 = 1;
obj1.memberV2 = result;
pthread_mutex_unlock(&mutex1)
This of course assumes that the values used in the calculation won't be modified on any other thread.

Its hard to tell what you are doing that is causing problems. The mutex pattern is pretty simple. You Lock the mutex, access the shared data, unlock the mutex. This protects data, becuase the mutex will only let one thread get the lock at a time. Any thread that fails to get the lock has to wait till the mutex is unlocked. Unlocking wakes the waiters up. They will then fight to attain the lock. Losers go back to sleep. The time it takes to wake up might be multiple ms or more from the time the lock is released. Make sure you always unlock the mutex eventually.
Make sure you don't to keep locks locked for a long period of time. Most of the time, a long period of time is like a micro second. I prefer to keep it down around "a few lines of code." Thats why people have suggested that you do the long running calculation outside the lock. The reason for not keeping locks a long time is you increase the number of times other threads will hit the lock and have to spin or sleep, which decreases performance. You also increase the probability that your thread might be pre-empted while owning the lock, which means the lock is enabled while that thread sleeps. Thats even worse performance.
Threads that fail a lock dont have to sleep. Spinning means a thread encountering a locked mutex doesn't sleep, but loops repeatedly testing the lock for a predefine period before giving up and sleeping. This is a good idea if you have multiple cores or cores capable of multiple simultaneous threads. Multiple active threads means two threads can be executing the code at the same time. If the lock is around a small amount of code, then the thread that got the lock is going to be done real soon. the other thread need only wait a couple nano secs before it will get the lock. Remember, sleeping your thread is a context switch and some code to attach your thread to the waiters on the mutex, all have costs. Plus, once your thread sleeps, you have to wait for a period of time before the scheduler wakes it up. that could be multiple ms. Lookup spinlocks.
If you only have one core, then if a thread encounters a lock it means another sleeping thread owns the lock and no matter how long you spin it aint gonna unlock. So you would use a lock that sleeps a waiter immediately in hopes that the thread owning the lock will wake up and finish.
You should assume that a thread can be preempted at any machine code instruction. Also you should assume that each line of c code is probably many machine code instructions. The classic example is i++. This is one statement in c, but a read, an increment, and a store in machine code land.
If you really care about performance, try to use atomic operations first. Look to mutexes as a last resort. Most concurrency problems are easily solved with atomic operations (google gcc atomic operations to start learning) and very few problems really need mutexes. Mutexes are way way way slower.
Protect your shared data wherever it is written and wherever it is read. else...prepare for failure. You don't have to protect shared data during periods of time when only a single thread is active.
Its often useful to be able to run your app with 1 thread as well as N threads. This way you can debug race conditions easier.
Minimize the shared data that you protect with locks. Try to organize data into structures such that a single thread can gain exclusive access to the entire structure (perhaps by setting a single locked flag or version number or both) and not have to worry about anything after that. Then most of the code isnt cluttered with locks and race conditions.
Functions that ultimately write to shared variables should use temp variables until the last moment and then copy the results. Not only will the compiler generate better code, but accesses to shared variables especially changing them cause cache line updates between L2 and main ram and all sorts of other performance issues. Again if you don't care about performance disregard this. However i recommend you google the document "everything a programmer should know about memory" if you want to know more.
If you are reading a single variable from the shared data you probably don't need to lock as long as the variable is an integer type and not a member of a bitfield (bitfield members are read/written with multiple instructions). Read up on atomic operations. When you need to deal with multiple values, then you need a lock to make sure you didn't read version A of one value, get preempted, and then read version B of the next value. Same holds true for writing.
You will find that copies of data, even copies of entire structures come in handy. You can be working on building a new copy of the data and then swap it by changing a pointer in with one atomic operation. You can make a copy of the data and then do calculations on it without worrying if it changes.
So maybe what you want to do is:
lock the mutex
Make a copy of the input data to the long running calculation.
unlock the mutex
L1: Do the calculation
Lock the mutex
if the input data has changed and this matters
read the input data, unlock the mutex and go to L1
updata data
unlock mutex
Maybe, in the example above, you still store the result if the input changed, but go back and recalc. It depends if other threads can use a slightly out of date answer. Maybe other threads when they see that a thread is already doing the calculation simply change the input data and leave it to the busy thread to notice that and redo the calculation (there will be a race condition you need to handle if you do that, and easy one). That way the other threads can do other work rather than just sleep.
cheers.

Probably the best thing to do is:
temp = //perform some time consuming operation which must be done before the assignment to memberV2
pthread_mutex_lock(&mutex1)
obj1.memberV1 = 1;
obj1.memberV2 = temp; //result from previous calculation
pthread_mutex_unlock(&mutex1)

What I would do is separate the calculation from the update:
temp = some calculation
pthread_mutex_lock(&mutex1);
obj.memberV1 = 1;
obj.memberV2 = temp;
pthread_mutex_unlock(&mutex1);

Design options for a C++ thread-safe object cache

I'm in the process of writing a template library for data-caching in C++ where concurrent read can be done and concurrent write too, but not for the same key. The pattern can be explained with the following environment:
A mutex for the cache write.
A mutex for each key in the cache.
This way if a thread requests a key from the cache and is not present can start a locked calculation for that unique key. In the meantime other threads can retrieve or calculate data for other keys but a thread that tries to access the first key get locked-wait.
The main constraints are:
Never calculate the value for a key at the same time.
Calculating the value for 2 different keys can be done concurrently.
Data-retrieval must not lock other threads from retrieve data from other keys.
My other constraints but already resolved are:
fixed (known at compile time) maximum cache size with MRU-based ( most recently used ) thrashing.
retrieval by reference ( implicate mutexed shared counting )
I'm not sure using 1 mutex for each key is the right way to implement this but i didn't find any other substantially different way.
Do you know of other patterns to implements this or do you find this a suitable solution? I don't like the idea of having about 100 mutexs. ( the cache size is around 100 keys )

You want to lock and you want to wait. Thus there shall be "conditions" somewhere (as pthread_cond_t on Unix-like systems).
I suggest the following:
There is a global mutex which is used only to add or remove keys in the map.
The map maps keys to values, where values are wrappers. Each wrapper contains a condition and potentially a value. The condition is signaled when the value is set.
When a thread wishes to obtain a value from the cache, it first acquires the global mutex. It then looks in the map:
If there is a wrapper for that key, and that wrapper contains a value, then the thread has its value and may release the global mutex.
If there is a wrapper for that key but no value yet, then this means that some other thread is currently busy computing the value. The thread then blocks on the condition, to be awaken by the other thread when it has finished.
If there is no wrapper, then the thread registers a new wrapper in the map, and then proceeds to computing the value. When the value is computed, it sets the value and signals the condition.
In pseudo code this looks like this:
mutex_t global_mutex
hashmap_t map
lock(global_mutex)
w = map.get(key)
if (w == NULL) {
w = new Wrapper
map.put(key, w)
unlock(global_mutex)
v = compute_value()
lock(global_mutex)
w.set(v)
signal(w.cond)
unlock(global_mutex)
return v
} else {
v = w.get()
while (v == NULL) {
unlock-and-wait(global_mutex, w.cond)
v = w.get()
}
unlock(global_mutex)
return v
}
In pthreads terms, lock is pthread_mutex_lock(), unlock is pthread_mutex_unlock(), unlock-and-wait is pthread_cond_wait() and signal is pthread_cond_signal(). unlock-and-wait atomically releases the mutex and marks the thread as waiting on the condition; when the thread is awaken, the mutex is automatically reacquired.
This means that each wrapper will have to contain a condition. This embodies your various requirements:
No threads holds a mutex for a long period of time (either blocking or computing a value).
When a value is to be computed, only one thread does it, the other threads which wish to access the value just wait for it to be available.
Note that when a thread wishes to get a value and finds out that some other thread is already busy computing it, the threads ends up locking the global mutex twice: once in the beginning, and once when the value is available. A more complex solution, with one mutex per wrapper, may avoid the second locking, but unless contention is very high, I doubt that it is worth the effort.
About having many mutexes: mutexes are cheap. A mutex is basically an int, it costs nothing more than the four-or-so bytes of RAM used to store it. Beware of Windows terminology: in Win32, what I call here a mutex is deemed an "interlocked region"; what Win32 creates when CreateMutex() is called is something quite different, which is accessible from several distinct processes, and is much more expensive since it involves roundtrips to the kernel. Note that in Java, every single object instance contains a mutex, and Java developers do not seem to be overly grumpy on that subject.

You could use a mutex pool instead of allocating one mutex per resource. As reads are requested, first check the slot in question. If it already has a mutex tagged to it, block on that mutex. If not, assign a mutex to that slot and signal it, taking the mutex out of the pool. Once the mutex is unsignaled, clear the slot and return the mutex to the pool.

One possibility that would be a much simpler solution would be to use a single reader/writer lock on the entire cache. Given that you know there is a maximum number of entries (and it is relatively small), it sounds like adding new keys to the cache is a "rare" event. The general logic would be:
acquire read lock
search for key
if found
use the key
else
release read lock
acquire write lock
add key
release write lock
// acquire the read lock again and use it (probably encapsulate in a method)
endif
Not knowing more about the usage patterns, I can't say for sure if this is a good solution. It is very simple, though, and if the usage is predominantly reads, then it is very inexpensive in terms of locking.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js