Why do mutexes sometimes fail to fix a multi-threading access problem? - c++

I'm hoping someone can fill a gap in my knowledge of multithreaded programming here.
I have used mutexes on many occasions to successfully share access to data-structures between threads. For example, I'll often feed a queue on the main thread that is consumed by a worker thread. A mutex wraps access to the queue as tightly as possible when the main thread enqueues something, or when the worker thread dequeues something. It works great!
Okay, now consider a different situation I found myself in recently. Multiple threads are rendering triangles to the same framebuffer/z-buffer pair. (This may be a bad idea on its face, but please overlook that for the moment.) In other words, the work-load of rendering a bunch of triangles is evenly distributed across all these threads that are all writing pixels to the same framebuffer and all checking Z-values against, and updating Z-values to, the same Z-buffer.
Now, I knew this would be problematic from the get-go, but I wanted to see what would happen. Sure enough, when drawing two quads, (one behind the other), some of the pixels from the background quad would occasionally bleed through the foreground quad, unless, of course, I only had one worker thread. So, to fix this problem, I decided to use a mutex. I knew this would be extremely slow, but I wanted to do it anyway just to demonstrate that I had a handle on what the problem really was. The fix was simple: just wrap access to the Z-buffer with a mutex. But to my great surprise, this didn't fix the problem at all! So the question is: why?!
I have a hypothesis, but it is a disturbing one. My guess is that at least one of two things is happening. First, a thread may write to the Z-buffer, but that write operation isn't necessarily flushed from CPU-memory back to Z-buffer memory when another thread goes to read it. Second, a thread may read from the Z-buffer, but do so in prefetched amounts that assume no other thread is writing to it. In either case, even if the mutex is doing its job correctly, there are still going to be cases where we're either reading the wrong Z-value or failing to write a Z-value.
What may support this hypothesis is that I unnecessarily widened the mutex lock time, and while this didn't just make my rendering slower, it also appeared to fix the Z-buffer issue previously described. Why? My guess is because the extra lock time made it more likely that the Z-buffer writes were flushed.
Anyhow, this is disturbing to me, because I don't know why this isn't a problem I've run into before with, for example, simple queues I've been using to communicate between threads for years. Why wasn't the CPU lazy about flushing its cache with my link-list pointers?
So I looked around for maybe ways to add a memory fence or flush a write operation or to make sure that a read operation always pulled from memory (e.g., by using the "volatile" keyword), but none of it was trivial or seemed to help.
What am I not understanding here? Do I really just have no idea what's going on? Thanks for any light you can shed on this.

The fix was simple: just wrap access to the Z-buffer with a mutex.
This is not enough -- the mutex needs to cover both the access to the Z buffer and update of the framebuffer, making the whole operation (check&update Z buffer, conditionally update framebuffer) atomic. Otherwise there is a danger that the Z buffer and framebuffer updates will "cross" and happen in the reverse order. Somthing like:
thread 1: check/update z buffer (hit -- pixel is closer than previous)
thread 2: check/update z buffer (hit -- pixel is closer than previous)
thread 2: update framebuffer
thread 1: update framebuffer
and you end up with thread 1's color in the framebuffer even though thread 2 is closer in Z

Related

What is the most efficient way to coordinate between threads about which threads are free?

I'm working on a program that sends messages between threads, it looks at which threads are busy, if one is free it grabs the first free one(or in some cases multiple free ones), marks it as taken, sends work to it and does it's own work, then once finished waits for it to complete. The part that is the bottleneck of all of this is coordinating between threads about which thread is taken. Seems like a problem I'm sure others have encountered, have some solutions to share, but also want to know if you can do better than me.
My solution ultimately boils down to:
Maintain a set representing indexes of free threads, and be able to grab an item from the set getting the index of a free thread or add it back to the set increasing the size by one. Order unimportant. I know the fixed size of the set in advance.
I've tried a few ways of doing this:
Maintain a single unsigned long long int and use '__builtin_clz'(Interesting __builtin_ffsll was 10x slower.. thinking not supported with a single instruction on my processor) to count the number of bits in a single instruction cycle and grab the lowest one and use a lookup table of bitmasks to flip bits on and off, simultaneously claiming their thread number. Loved this version because I only needed to share a single atomic unsigned long long and could use a single atomic operation but doing 'fetch_and' in a loop until you are right ended up slowing than locking and doing non-atomically. The version using locking ended up being faster, probably because threads didn't get stuck in loops repeating the same operations waiting for others to finish theirs.
Use a linked list, allocate all nodes in advance, maintain a head node and a list, if pointing to nullptr, then we've reached the end of the list. Have only done this with a lock because it needs two simultaneous operations.
Maintain an array that represents all indexes of threads to claim. Either increment an array index and return previous pointer to claim a thread, or swap the last taken thread with the one being freed and decrement the pointer. Check if free.
Use the moodycamel queue which maintains a lock free queue.
Happy to share C++ code, the answer was getting to be quite long though when I tried to include it.
All three are fast, __builtin_clzll is not universally supported, so even though a little faster, probably not enough so to be worth it and probably 10x slower on computers that don't natively support it, similar to how __builtin_ffsll was slow. Array and linked list are roughly as fast as each other, array seems slightly faster when no contention. Moody is 3x slower.
Think you can do better and have a faster way to do this? Still the slowest part of this process, still just barely being worth the cost in some cases.
Thoughts for directions to explore:
Feels like there should be a way using a couple of atomics, maybe an array of atomics, one at a time, have to maintain the integrity of the set with every operation though, which makes this tricky. Most solutions at some point need two operations to be done simultaneously, atomics seem like they could provide a significantly faster solution than locking in my benchmarking.
Might be able to use lock but remove the need to check if the list is empty or swap elements in array
Maybe use a different data structure, for example, two arrays, add to one while emptying the other, then switch which one is being filled and which is emptied. This means no need to swap elements but rather just swap two pointers to arrays and only when one is empty.
Could have threads launching threads add work to a list of work to be done, then another thread can grab it while this thread keeps going. Ultimately still need a similar thread safe set.
See if the brilliant people on stackoverflow see directions to explore that I haven't seen yet :)
All you need is a thread pool, a queue (a list, deque or a ring buffer), a mutex and a condition_variable to signal when a new work item has been added to the queue.
Wrap work items in packaged_task if you need to wait on the result of each task.
When adding a new work item to the queue, 1) lock the mutex, 2) add the item, 3) release the lock and 4) call cv::notify_one, which will unblock the first available thread.
Once the basic setup is working, if tasks are too fine-grained, work stealing can be added to the solution to improve performance. It's also important to use the main thread to perform part of the work instead of just waiting for all tasks to complete. These simple (albeit clunky) optimizations often result in >15% overall improvement in performance due to reduced context switching.
Also don't forget to think about false sharing. It might be a good idea to pad task items to 64 bytes, just to be on the safe side.

Compute shader image and atomic coherency [duplicate]

I was having previously already the problem that I wanted to blend color values in an image unit by doing something like:
vec4 texelCol = imageLoad(myImage, myTexel);
imageStore(myImage, myTexel, texelCol+newCol);
In a scenario where multiple fragments can have the same value for 'myTexel', this aparently isn't possible because one can't create atomicity between the imageLoad and imageStore commands and other shaderinvocations could change the texel color in between.
Now someone told me that poeple are working arround this problem by creating semaphores using the atomic comands on uint textures, such that the shader would wait somehow in a while loop before accessing the texel and as soon as it is free, atomically write itno the integer texture to block other fragment shader invocations, process the color texel and when finished atomically free the integer texel again.
But I can't get my brains arround how this could really work and how such code would look like?
Is it really possible to do this? can a GLSL fragment shader be set to wait in a while loop? If it's possible, can someone give an example?
Basically, you're just implementing a spinlock. Only instead of one lock variable, you have an entire texture's worth of locks.
Logically, what you're doing makes sense. But as far as OpenGL is concerned, this won't actually work.
See, the OpenGL shader execution model states that invocations execute in an order which is largely undefined relative to one another. But spinlocks only work if there is a guarantee of forward progress among the various threads. Basically, spinlocks require that the thread which is spinning not be able to starve the execution system from executing the thread that it is waiting on.
OpenGL provides no such guarantee. Which means that it is entirely possible for one thread to lock a pixel, then stop executing (for whatever reason), while another thread comes along and blocks on that pixel. The blocked thread never stops executing, and the thread that owns the lock never restarts execution.
How might this happen in a real system? Well, let's say you have a fragment shader invocation group executing on some fragments from a triangle. They all lock their pixels. But then they diverge in execution due to a conditional branch within the locking region. Divergence of execution can mean that some of those invocations get transferred to a different execution unit. If there are none available at the moment, then they effectively pause until one becomes available.
Now, let's say that some other fragment shader invocation group comes along and gets assigned an execution unit before the divergent group. If that group tries to spinlock on pixels from the divergent group, it is essentially starving the divergent group of execution time, waiting on an even that will never happen.
Now obviously, in real GPUs there is more than one execution unit, but you can imagine that with lots of invocation groups out there, it is entirely possible for such a scenario to occasionally jam up the works.

Implementing progress visualization in C++

Let's say I have a computationally intensive algorithm running.
For example, let's say it's a routing algorithm, and on window running on a separate thread, I want to show the user what routes are being currently being analyzed and such, and for whatever reason, it contains heavily CPU-intensive code.
The important thing is that I don't want to slow down the worker thread just for the sake of displaying progress; it needs to run at full-speed as much as possible. It is perfectly OK if the user sees stale data, such as an in-between that didn't actually occur (say, two active routes at once), because this progress visualization is for informational purposes only, and nothing else.
From a theoretical standpoint, I think that according to the C++ standard, my best option is to use std::atomic with std::memory_order_relaxed on both threads. But that would slow down the code on the worker thread noticeably.
From a practical standpoint, though, I'm just tempted to ignore std::atomic altogether, and just have the worker thread work with all the variables normally. Who cares if the GUI thread reads stale data? i don't, and presumably neither will the user. In reality it won't matter because there is only one worker thread, and only that thread needs to observe valid writes, which in practice is the only thing that'll happen.
What I'm wondering about is:
What is the best way to solve this kind of problem, both in theory and in practice?
Do people just ignore the standard and go for raw primitives, or do they bite the bullet and take the performance hit of using std::atomic?
Or are there other facilities I'm not aware of for soving this problem?
Ignoring proper fences for std::atomic wouldn't buy you match but you might be at risk of loosing the communication between threads completely, mostly on the compiler side. The problem does not exist for example on x86 hardware side at all, because each store to memory (if you can ensure your compiler do it as expected) has required store-with-release semantics anyway.
Also I doubt that sharing the progress more often than 30-100 FPS (or Hz) brings any value. On the other hand, it can certainly put the unnecessary burden on the system resources (if repeated in a tight loop) and break compiler optimizations, e.g. vectorization.
So, if the overhead for worker thread is the concern, share the info with less frequency. E.g. update the atomic counter once in 1024 iterations:
// worker thread
if( i%1024 == 0 ) // update the progress info
my_atomic_progress.store( i, std::memory_order_release ); // regular `mov` on x86
// GUI thread
auto i = my_atomic_progress.load( std::memory_order_consume );
This example also shows the minimal fences necessary to establish the communication, otherwise the compiler is free to optimize the memory operations out of a loop for example.
There is no best way - it depends how much data you need to send to the display, if its just a single long integer value, and the display is completely nu-guaranteed, then I'd just write the value and have done with it. Occasionally the reader will read a corrupted value, but it won't matter so I won't care.
Otherwise, I'd be tempted to send the value to a queue and use an event or condition variable to trigger the read afterwards (as often you do not want the reader running full tilt, and you need some way to inform it there is new data to read)
I'm not sure the overhead for std::atomic is that great - isn't it going to be implemented in the OS primitives anyway? If so, the primitives (on Windows, x86 at least via InterlockedExchange function) end up as a single CPU instruction after the compiler and optimiser have done their thng.

GLSL, semaphores?

I was having previously already the problem that I wanted to blend color values in an image unit by doing something like:
vec4 texelCol = imageLoad(myImage, myTexel);
imageStore(myImage, myTexel, texelCol+newCol);
In a scenario where multiple fragments can have the same value for 'myTexel', this aparently isn't possible because one can't create atomicity between the imageLoad and imageStore commands and other shaderinvocations could change the texel color in between.
Now someone told me that poeple are working arround this problem by creating semaphores using the atomic comands on uint textures, such that the shader would wait somehow in a while loop before accessing the texel and as soon as it is free, atomically write itno the integer texture to block other fragment shader invocations, process the color texel and when finished atomically free the integer texel again.
But I can't get my brains arround how this could really work and how such code would look like?
Is it really possible to do this? can a GLSL fragment shader be set to wait in a while loop? If it's possible, can someone give an example?
Basically, you're just implementing a spinlock. Only instead of one lock variable, you have an entire texture's worth of locks.
Logically, what you're doing makes sense. But as far as OpenGL is concerned, this won't actually work.
See, the OpenGL shader execution model states that invocations execute in an order which is largely undefined relative to one another. But spinlocks only work if there is a guarantee of forward progress among the various threads. Basically, spinlocks require that the thread which is spinning not be able to starve the execution system from executing the thread that it is waiting on.
OpenGL provides no such guarantee. Which means that it is entirely possible for one thread to lock a pixel, then stop executing (for whatever reason), while another thread comes along and blocks on that pixel. The blocked thread never stops executing, and the thread that owns the lock never restarts execution.
How might this happen in a real system? Well, let's say you have a fragment shader invocation group executing on some fragments from a triangle. They all lock their pixels. But then they diverge in execution due to a conditional branch within the locking region. Divergence of execution can mean that some of those invocations get transferred to a different execution unit. If there are none available at the moment, then they effectively pause until one becomes available.
Now, let's say that some other fragment shader invocation group comes along and gets assigned an execution unit before the divergent group. If that group tries to spinlock on pixels from the divergent group, it is essentially starving the divergent group of execution time, waiting on an even that will never happen.
Now obviously, in real GPUs there is more than one execution unit, but you can imagine that with lots of invocation groups out there, it is entirely possible for such a scenario to occasionally jam up the works.

Thread safe programming

I keep hearing about thread safe. What is that exactly and how and where can I learn to program thread safe code?
Also, assume I have 2 threads, one that writes to a structure and another one that reads from it. Is that dangerous in any way? Is there anything I should look for? I don't think it is a problem. Both threads will not (well can't ) be accessing the struct at the exact same time..
Also, can someone please tell me how in this example : https://stackoverflow.com/a/5125493/1248779 we are doing a better job in concurrency issues. I don't get it.
It's a very deep topic. At the heart threads are usually about making things go fast by using multiple cores at the same time; or about doing long operations in the background when you don't have a good way to interleave the operation with a 'primary' thread. The latter being very common in UI programming.
Your scenario is one of the classic trouble spots, and one of the first people run into. It's vary rare to have a struct where the members are truly independent. It's very common to want to modify multiple values in the structure to maintain consistency. Without any precautions it is very possible to modify the first value, then have the other thread read the struct and operate on it before the second value has been written.
Simple example would be a 'point' struct for 2d graphics. You'd like to move the point from [2,2] to [5,6]. If you had a different thread drawing a line to that point you could end up drawing to [5,2] very easily.
This is the tip of the iceberg really. There are lots of great books, but learning this space usually goes something like this:
Uh oh, I just read from that thing in an inconsistent state.
Uh oh, I just modified that thing from 2 threads and now it's garbage.
Yay! I learned about locks
Whoa, I have a lot of locks and everything seems to just hang sometimes when I have lots of them locking in nested code.
Hrm. I need to stop doing this locking on the fly, I seem to be missing a lot of places; so I should encapsulate them in a data structure.
That data structure thing was great, but now I seem to be locking all the time and my code is just as slow as a single thread.
condition variables are weird
It's fast because I got clever with how I lock things. Hrm. Sometimes data corrupts.
Whoa.... InterlockedWhatDidYouSay?
Hey, look no lock, I do this thing called a spin lock.
Condition variables. Hrm... I see.
You know what, how about I just start thinking about how to operate on this stuff in completely independent ways, pipelineing my operations, and having as few cross thread dependencies as possible...
Obviously it's not all about condition variables. But there are many problems that can be solved with threading, and probably almost as many ways to do it, and even more ways to do it wrong.
Thread-safety is one aspect of a larger set of issues under the general heading of "Concurrent Programming". I'd suggest reading around that subject.
Your assumption that two threads cannot access the struct at the same time is not good. First: today we have multi-core machines, so two threads can be running at exactly the same time. Second: even on a single core machine the slices of time given to any other thread are unpredicatable. You have to anticipate that ant any arbitrary time the "other" thread might be processing. See my "window of opportunity" example below.
The concept of thread-safety is exactly to answer the question "is this dangerous in any way". The key question is whether it's possible for code running in one thread to get an inconsistent view of some data, that inconsistency happening because while it was running another thread was in the middle of changing data.
In your example, one thread is reading a structure and at the same time another is writing. Suppose that there are two related fields:
{ foreground: red; background: black }
and the writer is in the process of changing those
foreground = black;
<=== window of opportunity
background = red;
If the reader reads the values at just that window of opportunity then it sees a "nonsense" combination
{ foreground: black; background: black }
This essence of this pattern is that for a brief time, while we are making a change, the system becomes inconsistent and readers should not use the values. As soon as we finish our changes it becomes safe to read again.
Hence we use the CriticalSection APIs mentioned by Stefan to prevent a thread seeing an inconsistent state.
what is that exactly?
Briefly, a program that may be executed in a concurrent context without errors related to concurrency.
If ThreadA and ThreadB read and/or write data without errors and use proper synchronization, then the program may be threadsafe. It's a design choice -- making an object threadsafe can be accomplished a number of ways, and more complex types may be threadsafe using combinations of these techniques.
and how and where can I learn to program thread safe code?
boost/libs/thread/ would likely be a good introduction. The topic is quite complex.
The C++11 standard library provides implementations for locks, atomics and threads -- any well written programs which use these would be a good read. The standard library was modeled after boost's implementation.
also, assume I have 2 threads one that writes to a structure and another one that reads from it. Is that dangerous in any way? is there anything I should look for?
Yes, it can be dangerous and/or may produce incorrect results. Just imagine that a thread may run out of its time at any point, and then another thread could then read or modify that structure -- if you have not protected it, it may be in the middle of an update. A common solution is a lock, which can be used to prevent another thread from accessing shared resources during reads/writes.
When writing multithreaded C++ programs on WIN32 platforms, you need to protect certain shared objects so that only one thread can access them at any given time from different threads. You can use 5 system functions to achieve this. They are InitializeCriticalSection, EnterCriticalSection, TryEnterCriticalSection, LeaveCriticalSection, and DeleteCriticalSection.
Also maybe this links can help:
how to make an application thread safe?
http://www.codeproject.com/Articles/1779/Making-your-C-code-thread-safe
Thread safety is a simple concept: is it "safe" to perform operation A on one thread whilst another thread is performing operation B, which may or may not be the same as operation A. This can be extended to cover many threads. In this context, "safe" means:
No undefined behaviour
All invariants of the data structures are guaranteed to be observed by the threads
The actual operations A and B are important. If two threads both read a plain int variable, then this is fine. However, if any thread may write to that variable, and there is no synchronization to ensure that the read and write cannot happen together, then you have a data race, which is undefined behaviour, and this is not thread safe.
This applies equally to the scenario you asked about: unless you have taken special precautions, then it is not safe to have one thread read from a structure at the same time as another thread writes to it. If you can guarantee that the threads cannot access the data structure at the same time, through some form of synchronization such as a mutex, critical section, semaphore or event, then there is not a problem.
You can use things like mutexes and critical sections to prevent concurrent access to some data, so that the writing thread is the only thread accessing the data when it is writing, and the reading thread is the only thread accessing the data when it is reading, thus providing the guarantee I just mentioned. This therefore avoids the undefined behaviour mentioned above.
However, you still need to ensure that your code is safe in the wider context: if you need to modify more than one variable then you need to hold the lock on the mutex across the whole operation rather than for each individual access, otherwise you may find that the invariants of your data structure may not be observed by other threads.
It is also possible that a data structure may be thread safe for some operations but not others. For example, a single-producer single-consumer queue will be OK if one thread is pushing items on the queue and another is popping items off the queue, but will break if two threads are pushing items, or two threads are popping items.
In the example you reference, the point is that global variables are implicitly shared between all threads, and therefore all accesses must be protected by some form of synchronization (such as a mutex) if any thread can modify them. On the other hand, if you have a separate copy of the data for each thread, then that thread can modify its copy without worrying about concurrent access from any other thread, and no synchronization is required. Of course, you always need synchronization if two or more threads are going to operate on the same data.
My book, C++ Concurrency in Action covers what it means for things to be thread safe, how to design thread safe data structures, and the C++ synchronization primitives used for the purpose, such as std::mutex.
Threads safe is when a certain block of code is protected from being accessed by more than one thread. Meaning that the data manipulated always stays in a consistent state.
A common example is the producer consumer problem where one thread reads from a data structure while another thread writes to the same data structure : Detailed explanation
To answer the second part of the question: Imagine two threads both accessing std::vector<int> data:
//first thread
if (data.size() > 0)
{
std::cout << data[0]; //fails if data.size() == 0
}
//second thread
if (rand() % 5 == 0)
{
data.clear();
}
else
{
data.push_back(1);
}
Run these threads in parallel and your program will crash because std::cout << data[0]; might be executed directly after data.clear();.
You need to know that at any point of your thread code, the thread might be interrupted, e.g. after checking that (data.size() > 0), and another thread could become active. Although the first thread looks correct in a single threaded app, it's not in a multi-threaded program.