Lockless reader/writer

Lockless reader/writer - c++

I have some data that is both read and updated by multiple threads. Both reads and writes must be atomic. I was thinking of doing it like this:
// Values must be read and updated atomically
struct SValues
{
double a;
double b;
double c;
double d;
};
class Test
{
public:
Test()
{
m_pValues = &m_values;
}
SValues* LockAndGet()
{
// Spin forver until we got ownership of the pointer
while (true)
{
SValues* pValues = (SValues*)::InterlockedExchange((long*)m_pValues, 0xffffffff);
if (pValues != (SValues*)0xffffffff)
{
return pValues;
}
}
}
void Unlock(SValues* pValues)
{
// Return the pointer so other threads can lock it
::InterlockedExchange((long*)m_pValues, (long)pValues);
}
private:
SValues* m_pValues;
SValues m_values;
};
void TestFunc()
{
Test test;
SValues* pValues = test.LockAndGet();
// Update or read values
test.Unlock(pValues);
}
The data is protected by stealing the pointer to it for every read and write, which should make it threadsafe, but it requires two interlocked instructions for every access. There will be plenty of both reads and writes and I cannot tell in advance if there will be more reads or more writes.
Can it be done more effective than this? This also locks when reading, but since it's quite possible to have more writes then reads there is no point in optimizing for reading, unless it does not inflict a penalty on writing.
I was thinking of reads acquiring the pointer without an interlocked instruction (along with a sequence number), copying the data, and then having a way of telling if the sequence number had changed, in which case it should retry. This would require some memory barriers, though, and I don't know whether or not it could improve the speed.
----- EDIT -----
Thanks all, great comments! I haven't actually run this code, but I will try to compare the current method with a critical section later today (if I get the time). I'm still looking for an optimal solution, so I will get back to the more advanced comments later. Thanks again!

What you have written is essentially a spinlock. If you're going to do that, then you might as well just use a mutex, such as boost::mutex. If you really want a spinlock, use a system-provided one, or one from a library rather than writing your own.
Other possibilities include doing some form of copy-on-write. Store the data structure by pointer, and just read the pointer (atomically) on the read side. On the write side then create a new instance (copying the old data as necessary) and atomically swap the pointer. If the write does need the old value and there is more than one writer then you will either need to do a compare-exchange loop to ensure that the value hasn't changed since you read it (beware ABA issues), or a mutex for the writers. If you do this then you need to be careful how you manage memory --- you need some way to reclaim instances of the data when no threads are referencing it (but not before).

There are several ways to resolve this, specifically without mutexes or locking mechanisms. The problem is that I'm not sure what the constraints on your system is.
Remember that atomic operations is something that often get moved around by the compilers in C++.
Generally I would solve the issue like this:
Multiple-producer-single-consumer by having 1 single-producer-single-consumer per writing thread. Each thread writes into their own queue. A single consumer thread that gathers the produced data and stores it in a single-consumer-multiple-reader data storage. The implementation for this is a lot of work and only recommended if you are doing a time-critical application and that you have the time to put in for this solution.
There are more things to read up about this, since the implementation is platform specific:
Atomic etc operations on windows/xbox360:
http://msdn.microsoft.com/en-us/library/ee418650(VS.85).aspx
The multithreaded single-producer-single-consumer without locks:
http://www.codeproject.com/KB/threads/LockFree.aspx#heading0005
What "volatile" really is and can be used for:
http://www.drdobbs.com/cpp/212701484
Herb Sutter has written a good article that reminds you of the dangers of writing this kind of code:
http://www.drdobbs.com/cpp/210600279;jsessionid=ZSUN3G3VXJM0BQE1GHRSKHWATMY32JVN?pgno=2

Related

Concurrency Problem: how to have only one thread go through critical section

I want to have a multithreaded function that allocates some memory for an object obj and returns the allocated memory. My current single-threaded and multiple threaded version codes are below.
The multi-threaded version has no race conditions but runs slow when a lot of threads are trying to get the lock. After the malloc and pointer update, each thread still needs to acquire and release the same lock. That causes some multi-threading performance drop. I wonder if there are some other ways to improve performance.
struct multi_level_tree{
multi_level_tree* ptr[256];
mutex mtx;
};
multi_level_tree tree; // A global object that every thread need to access and update
/* Single Threaded */
multi_level_tree* get_ptr(multi_level_tree* cur, int idx) {
if (!cur[idx].ptr)
cur[idx].ptr = malloc(sizeof(T));
return cur[idx].ptr;
}
/* Multi Threaded with mutex */
void get_ptr(multi_level_tree* cur, int idx) {
if (!cur[idx].ptr) {
cur[idx].mtx.lock(); // other threads wait here, and go one by one
/* Critical Section Start */
if (!cur[idx].ptr)
cur[idx].ptr = malloc(sizeof(multi_level_tree)); // malloc takes a while
/* Critical Section End */
cur[idx].mtx.unlock();
}
return cur[idx].ptr;
}
The code I am looking for should have the following property.
When the first thread allocated the memory, it should alert all other threads waiting for it.
All other threads should be unblocked at the same time.
No race condition.
The challenges in the problem
* The tree is sparse with multiple levels, initialize all of it is impossible considering the memory we have
* Similar to Double-Checked Locking problem, but was trying to avoid std::atomic
The point for this code is to implement a multi-level array as a global variable. Except for the lowest level, each array is a list of pointers to the next level array. Since this data structure needs to grow dynamically, I got into this problem.

how to have only one thread go through critical section
You could use a mutex. There's an example in your question.
It is not the most optimal solution for synchronised innitialisation. A simple improvement is to use a local static, in which case compiler is responsible for implementing the synchronisation:
T& get_T() {
static T instance;
return instance;
}
but runs slow when a lot of threads are trying to get the lock
This problem is inherent with serialising the access to the same data structure. A way to improve performance is to avoid doing that in the first place.
In this particular example, it appears that you could simply initialise the resource while the process is still single threaded, and start the parallel threads only after the initialisation is complete. That way no locking is required to access the pointer.
If that is not an option, another approach is to simply call get_ptr once in each thread, and store a copy locally. That way the locking overhead remains minimal.
Even better would be to have separate data structures in each thread. This is useful when threads only produce data, and don't need to access results from other threads.
Regarding edited example: You might benefit from a lock free tree implementation. It may be difficult to implement however.

Since you cannot easily fix it, since it's inherent of concurrency, i have an idea that may improve or decrease performance rather substantially, through.
If this resource is really used that often and is detrimental you could try to use Active Object (https://en.wikipedia.org/wiki/Active_object) and Boost Lockfree Queue (https://www.boost.org/doc/libs/1_66_0/doc/html/lockfree/reference.html#header.boost.lockfree.queue_hpp). Use atomic store/load on Future objects, and you will make this process completely lockless. But on the other hand it will require a single thread to maintain. Performance of such solution depends heavily on how often is this resource used.

From the comment form #WilliamClements , I see this is a double-checked locking problem itself. The original multi-threading code in my question may broke. To program it correctly, I switched to atomic pointers to prevent ordering problems with load/store instructions.
However, the example still uses a lock that I want to get rid of. Therefore, I choose to use std::atomic::compare_exchange_weak to only update the pointer when its value is nullptr. In this way, only one thread will successfully update the pointer value, and other threads are going to release requested memory if they fail std::atomic::compare_exchange_weak.
This code is doing very well for me so far.
struct multi_level_tree{
std::atomic<multi_level_tree*> ptr;
};
multi_level_tree tree;
void get_ptr(multi_level_tree* cur, int idx) {
if (!cur[idx].ptr.load()) {
/* Critical Section Start */
if (!cur[idx].ptr.load()) {
node* tmp = malloc(sizeof(multi_level_tree)*256);
if (cur[idx].ptr.compare_exchange_weak(nullptr, tmp)) {
/* successfully updated, do nothing */
}
else {
/* Already updated by other threads, release */
free(tmp);
}
}
/* Critical Section End */
}
return cur[idx].ptr;
}

Efficient way to have a thread wait for a value to change in memory?

For some silly reason, there's a piece of hardware on my (GNU/Linux) machine that can only communicate a certain occurrence by writing a value to memory. Assume that by some magic, the area of memory the hardware writes to is visible to a process I'm running. Now, I want to have a thread within that process keep track of that value, and as soon as possible after it has changed - execute some code. However, it is more important to me that the thread not waste CPU time than for it to absolutely minimize the response delay. So - no busy-waiting on a volatile...
How should I best do this (using modern C++)?
Notes:
I don't mind a solution involving atomics, or synchronization mechanisms (in fact, that would perhaps be preferable) - as long as you bear in mind that the hardware doesn't support atomic operations on host memory - it performs a plain write.
The value the hardware writes can be whatever I like, as can the initial value in the memory location it writes to.
I used C++11 since it's the popular tag for Modern C++, but really, C++14 is great and C++17 is ok. On the other hand, even a C-based solution will do.

So, the naive thing to do would be non-busy sleeping, e.g.:
volatile int32_t* special_location = get_special_location();
auto polling_interval_in_usec = perform_tradeoff_between_accuracy_and_cpu_load();
auto polling_interval = std::chrono::microseconds(polling_interval_in_usec);
while(should_continue_polling()) {
if (*special_location == HardwareIsDone) {
do_stuff();
return;
}
std::this_thread::sleep_for(polling_interval);
}

This is usually done via std::condition_variable.
... as long as you bear in mind that the hardware doesn't support atomic operations on host memory - it performs a plain write.
Implementations of std::atomic may fall back to mutexes in such cases
UPD - Possible implementation details: assuming you have some data structure in a form of:
struct MyData {
std::mutex mutex;
std::condition_variable cv;
some_user_type value;
};
and you have an access to it from several processes. Writer process overrides value and notifies cv via notify_one, reader process waits on cv in a somewhat similar to busy wait manner, but thread yields for the wait duration. Everything else I could add is already present in the referred examples.

C++ constructor memory synchronization

Assume that I have code like:
void InitializeComplexClass(ComplexClass* c);
class Foo {
public:
Foo() {
i = 0;
InitializeComplexClass(&c);
}
private:
ComplexClass c;
int i;
};
If I now do something like Foo f; and hand a pointer to f over to another thread, what guarantees do I have that any stores done by InitializeComplexClass() will be visible to the CPU executing the other thread that accesses f? What about the store writing zero into i? Would I have to add a mutex to the class, take a writer lock on it in the constructor and take corresponding reader locks in any methods that accesses the member?
Update: Assume I hand a pointer over to a bunch of other threads once the constructor has returned. I'm not assuming that the code is running on x86, but could be instead running on something like PowerPC, which has a lot of freedom to do memory reordering. I'm essentially interested in what sorts of memory barriers the compiler has to inject into the code when the constructor returns.

In order for the other thread to be able to know about your new object, you have to hand over the object / signal other thread somehow. For signaling a thread you write to memory. Both x86 and x64 perform all memory writes in order, CPU does not reorder these operations with regards to each other. This is called "Total Store Ordering", so CPU write queue works like "first in first out".
Given that you create an object first and then pass it on to another thread, these changes to memory data will also occur in order and the other thread will always see them in the same order. By the time the other thread learns about the new object, the contents of this object was guaranteed to be available for that thread even earlier (if the thread only somehow knew where to look).
In conclusion, you do not have to synchronise anything this time. Handing over the object after it has been initialised is all the synchronisation you need.
Update: On non-TSO architectures you do not have this TSO guarantee. So you need to synchronise. Use MemoryBarrier() macro (or any interlocked operation), or some synchronisation API. Signalling the other thread by corresponding API causes also synchronisation, otherwise it would not be synchronisation API.
x86 and x64 CPU may reorder writes past reads, but that is not relevant here. Just for better understanding - writes can be ordered after reads since writes to memory go through a write queue and flushing that queue may take some time. On the other hand, read cache is always consistent with latest updates from other processors (that have went through their own write queue).
This topic has been made so unbelievably confusing for so many, but in the end there is only a couple of things a x86-x64 programmer has to be worried about:
- First, is the existence of write queue (and one should not at all be worried about read cache!).
- Secondly, concurrent writing and reading in different threads to same variable in case of non-atomic variable length, which may cause data tearing, and for which case you would need synchronisation mechanisms.
- And finally, concurrent updates to same variable from multiple threads, for which we have interlocked operations, or again synchronisation mechanisms.)

If you do :
Foo f;
// HERE: InitializeComplexClass() and "i" member init are guaranteed to be completed
passToOtherThread(&f);
/* From this point, you cannot guarantee the state/members
of 'f' since another thread can modify it */
If you're passing an instance pointer to another thread, you need to implement guards in order for both threads to interact with the same instance. If you ONLY plan to use the instance on the other thread, you do not need to implement guards. However, do not pass a stack pointer like in your example, pass a new instance like this:
passToOtherThread(new Foo());
And make sure to delete it when you are done with it.

What level of locking granularity is good in concurrent data structures?

I am quite new to multi-threading, I have a single threaded data analysis app that has a good bit of potential for parallelization and while the data sets are large it does not come close to saturating the hard-disk read/write so I figure I should take advantage of the threading support that is now in the standard and try to speed the beast up.
After some research I decided that producer consumer was a good approach for the reading of data from the disk and processing it and I started writing an object pool that would become part of the circular buffer that will be where the producers put data and the consumers get the data. As I was writing the class it felt like I was being too fine grained in how I was handling locking and releasing data members. It feels like half the code is locking and unlocking and like there are an insane number of synchronization objects floating around.
So I come to you with a class declaration and a sample function and this question: Is this too fine-grained? Not fine grained enough? Poorly thought out?
struct PoolArray
{
public:
Obj* arr;
uint32 used;
uint32 refs;
std::mutex locker;
};
class SegmentedPool
{
public: /*Construction and destruction cut out*/
void alloc(uint32 cellsNeeded, PoolPtr& ptr);
void dealloc(PoolPtr& ptr);
void clearAll();
private:
void expand();
//stores all the segments of the pool
std::vector< PoolArray<Obj> > pools;
ReadWriteLock poolLock;
//stores pools that are empty
std::queue< int > freePools;
std::mutex freeLock;
int currentPool;
ReadWriteLock currentLock;
};
void SegmentedPool::dealloc(PoolPtr& ptr)
{
//find and access the segment
poolLock.lockForRead();
PoolArray* temp = &(pools[ptr.getSeg()]);
poolLock.unlockForRead();
//reduce the count of references in the segment
temp->locker.lock();
--(temp->refs);
//if the number of references is now zero then set the segment back to unused
//and push it onto the queue of empty segments so that it can be reused
if(temp->refs==0)
{
temp->used=0;
freeLock.lock();
freePools.push(ptr.getSeg());
freeLock.unlock();
}
temp->locker.unlock();
ptr.set(NULL,-1);
}
A few explanations:
First PoolPtr is a stupid little pointer like object that stores the pointer and the segment number in the pool that the pointer came from.
Second this is all "templatized" but i took those lines out to try to reduce the length of the code block
Third ReadWriteLock is something I put together using a mutex and a pair of condition variables.

Locks are inefficient no matter how fine grained they are, so avoid at all cost.
Both queue and vector can be easyly implemented lock free using compare-swap primitive.
there are a number of papers on the topic
Lock free queue:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.53.8674&rep=rep1&type=pdf
http://www.par.univie.ac.at/project/peppher/publications/Published/opodis10lfq.pdf
Lock free vector:
http://www.stroustrup.com/lock-free-vector.pdf
Straustrup's paper also refers to lock-free allocator, but don't jump at it right away, standard allocators are pretty good these days.
UPD
If you don't want to bother writing your own containers, use Intel's Threading Building Blocks library, it provide both thread safe vector and queue. They are NOT lock free, but they are optimized to use CPU cache efficiently.
UPD
Regarding PoolArray, you don't need a lock there either. If you can use c++11, use std::atomic for atomic increments and swaps, otherwise use compiler built-ins (InterLocked* functions in MSVC and _sync* in gcc http://gcc.gnu.org/onlinedocs/gcc-4.1.1/gcc/Atomic-Builtins.html)

A good start - you lock things when needed and free them as soon as you're finished.
Your ReadWriteLock is pretty much a CCriticalSection object - depending on your needs it might improve performance to use that instead.
One thing I would say is that call your temp->locker.lock(); function before you release the lock on the pool poolLock.unlockForRead();, otherwise you're performing operations on the pool object when it's not under synchronisation control - it could be being used by another thread at that point. A minor point, but with multi-threading it's the minor points that trip you up in the end.
A good approach to take to multi-threading is to wrap any controlled resources in objects or functions that do the locking and unlocking inside them, so anyone who wants access to the data doesn't have to worry about which lock to lock or unlock, and when to do it. for example:
...
if(temp->refs==0)
{
temp->used=0;
freeLock.lock();
freePools.push(ptr.getSeg());
freeLock.unlock();
}
...
would be...
...
if(temp->refs==0)
{
temp->used=0;
addFreePool(ptr.getSeg());
}
...
void SegmentedPool::addFreePool(unsigned int seg)
{
freeLock.lock();
freePools.push(seg);
freeLock.unlock();
}
There are plenty of multi-threading benchmarking tools out there too. You can play around with controlling your resources in different ways, run it through one of the tools, and see where any bottlenecks are if you feel like performance is becoming an issue.

Is it practically safe to write static data from multiple threads

I have some status data that I want to cache from a database. Any of several threads may modify the status data. After the data is modified it will be written to the database. The database writes will always be done in series by the underlying database access layer which queues database operations in a different process so I cam not concerned about race conditions for those.
Is it a problem to just modify the static data from several threads? In theory it is possible that modifications are implemented as read, modify, write but in practice I can't imagine that this is so.
My data handling class will look something like this:
class StatusCache
{
public:
static void SetActivityStarted(bool activityStarted)
{ m_activityStarted = activityStarted; WriteToDB(); }
static void SetActivityComplete(bool activityComplete);
{ m_activityComplete = activityComplete; WriteToDB(); }
static void SetProcessReady(bool processReady);
{ m_processReady = processReady; WriteToDB(); }
static void SetProcessPending(bool processPending);
{ m_processPending = processPending; WriteToDB(); }
private:
static void WriteToDB(); // will write all the class data to the db (multiple requests will happen in series)
static bool m_activityStarted;
static bool m_activityComplete;
static bool m_processReady;
static bool m_processPending;
};
I don't want to use locks as there are already a couple of locks in this part of the app and adding more will increase the possibility of deadlocks.
It doesn't matter if there is some overlap between 2 threads in the database update, e.g.
thread 1 thread 2 activity started in db
SetActivityStarted(true) SetActivityStarted(false)
m_activityStated = true
m_activityStarted = false
WriteToDB() false
WriteToDB() false
So the db shows the status that was most recently set by the m_... = x lines. This is OK.
Is this a reasonable approach to use or is there a better way of doing it?
[Edited to state that I only care about the last status - order is unimportant]

No, it's not safe.
The code generated that does the writing to m_activityStarted and the others may be atomic, but that is not garantueed. Also, in your setters you do two things: set a boolean and make a call. That is definately not atomic.
You're better off synchronizing here using a lock of some sort.
For example, one thread may call the first function, and before that thread goes into "WriteDB()" another thread may call another function and go into WriteDB() without the first going there. Then, perhaps the status is written in the DB in the wrong order.
If you're worried about deadlocks then you should revise your whole concurrency strategy.

On multi CPU machines, there's no guarantee that memory writes will be seen by threads running on different CPUs in the correct order without issuing a synchronisation instruction. It's only when you issue a synch order, e.g. a mutex lock or unlock, that the each thread's view of the data is guaranteed to be consistent.
To be safe, if you want the state shared between your threads, you need to use synchronisation of some form.

You never know exactly how things are implemented at the lower levels. Especially when you start dealing with multiple cores, the various cache levels, pipelined execution, etc. At least not without a lot of work, and implementations change frequently!
If you don't mutex it, eventually you will regret it!
My favorite example involves integers. This one particular system wrote its integer values in two writes. E.g. not atomic. Naturally, when the thread was interrupted between those two writes, well, you got the upper bytes from one set() call, and the lower bytes() from the other. A classic blunder. But far from the worst that can happen.
Mutexing is trivial.
You mention: I don't want to use locks as there are already a couple of locks in this part of the app and adding more will increase the possibility of deadlocks.
You'll be fine as long as you follow the golden rules:
Don't mix mutex lock orders. E.g. A.lock();B.lock() in one place and B.lock();A.lock(); in another. Use one order or the other!
Lock for the briefest possible time.
Don't try to use one mutex for multiple purposes. Use multiple mutexes.
Whenever possible use recursive or error-checking mutexes.
Use RAII or macros to insure unlocking.
E.g.:
#define RUN_UNDER_MUTEX_LOCK( MUTEX, STATEMENTS ) \
do { (MUTEX).lock(); STATEMENTS; (MUTEX).unlock(); } while ( false )
class StatusCache
{
public:
static void SetActivityStarted(bool activityStarted)
{ RUN_UNDER_MUTEX_LOCK( mMutex, mActivityStarted = activityStarted );
WriteToDB(); }
static void SetActivityComplete(bool activityComplete);
{ RUN_UNDER_MUTEX_LOCK( mMutex, mActivityComplete = activityComplete );
WriteToDB(); }
static void SetProcessReady(bool processReady);
{ RUN_UNDER_MUTEX_LOCK( mMutex, mProcessReady = processReady );
WriteToDB(); }
static void SetProcessPending(bool processPending);
{ RUN_UNDER_MUTEX_LOCK( mMutex, mProcessPending = processPending );
WriteToDB(); }
private:
static void WriteToDB(); // read data under mMutex.lock()!
static Mutex mMutex;
static bool mActivityStarted;
static bool mActivityComplete;
static bool mProcessReady;
static bool mProcessPending;
};

Im no c++ guy but i dont think it will be safe to write to it if you dont have some sort of synchronization..

It looks like you have two issues here.
#1 is that your boolean assignment is not necessarily atomic, even though it's one call in your code. So, under the hood, you could have inconsistent state. You could look into using atomic_set(), if your threading/concurrency library supports that.
#2 is synchronization between your reading and writing. From your code sample, it looks like your WriteToDB() function writes out the state of all 4 variables. Where is WriteToDB serialized? Could you have a situation where thread1 starts WriteToDB(), which reads m_activityStarted but doesn't finish writing it to the database, then is preempted by thread2, which writes m_activityStarted all the way through. Then, thread1 resumes, and finishes writing its inconsistent state through to the database. At the very least, I think that you should have write access to the static variables locked out while you are doing the read access necessary for the database update.

In theory it is possible that modifications are implemented as read, modify, write but in practice I can't imagine that this is so.
Generally it is so unless you've set up some sort of transactional memory. Variables are generally stored in RAM but modified in hardware registers, so the read isn't just for kicks. The read is necessary to copy the value out of RAM and into a place it can be modified (or even compared to another value). And while the data is being modified in the hardware register, the stale value is still in RAM in case somebody else wants to copy it into another hardware register. And while the modified data is being written back to RAM somebody else may be in the process of copying it into a hardware register.
And in C++ ints are guaranteed to take at least a byte of space. Which means it is actually possible for them to have a value other than true or false, say due to race condition where the read happens partway through a write.
On .Net there is some amount of automatic synchronization of static data and static methods. There is no such guarantee in standard C++.
If you're looking at only ints, bools, and (I think) longs, you have some options for atomic reads/writes and addition/subtraction. C++0x has something. So does Intel TBB. I believe that most operating systems also have the needed hooks to accomplish this.

While you may be afraid of deadlocks, I am sure you will be extremely proud of your code to know it works perfectly.
So I would recommend you throw in the locks, you may also want to consider semaphores, a more primitive(and perhaps more versatile) type of lock.

You may get away with it with bools, but if the static objects being changed are of types of any great complexity, terrible things will occur. My advice - if you are going to write from multiple threads, always use synchronisation objects, or you will sooner or later get bitten.

This is not a good idea. There are many variables that will affect the timing of different threads.
Without some kind of lock you will not be guaranteed to have the correct last state.
It is possible that two status updates could be written to the database out of order.
As long as the locking code is designed properly dead locks should not be an issue with a simple process like this.

As others have pointed out, this is generally a really bad idea (with some caveats).
Just because you don't see a problem on your particular machine when you happen to test it doesn't prove that the algorithm works right. This is especially true for concurrent applications. Interleavings can change dramatically for example when you switch to a machine with a different number of cores.
Caveat: if all your setters are doing atomic writes and if you don't care about the timing of them, then you may be okay.
Based on what you've said, I'd think that you could just have a dirty flag that's set in the setters. A separate database writing thread would poll the dirty flag every so often and send the updates to the database. If some items need extra atomicity, their setters would need to lock a mutex. The database writing thread must always lock the mutex.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js