Is this use of volatile in TBB bug? - c++

Recently Eric Niebler had a tweet about volatile and thread safety and somebody replied with the link to following code from Intel TBB.
void Block::shareOrphaned(intptr_t binTag, unsigned index)
{
MALLOC_ASSERT( binTag, ASSERT_TEXT );
// unreferenced formal parameter warning
tbb::detail::suppress_unused_warning(index);
STAT_increment(getThreadId(), index, freeBlockPublic);
markOrphaned();
if ((intptr_t)nextPrivatizable==binTag) {
// First check passed: the block is not in mailbox yet.
// Need to set publicFreeList to non-zero, so other threads
// will not change nextPrivatizable and it can be zeroed.
if ( !readyToShare() ) {
// another thread freed an object; we need to wait until it finishes.
// There is no need for exponential backoff, as the wait here is not for a lock;
// but need to yield, so the thread we wait has a chance to run.
// TODO: add a pause to also be friendly to hyperthreads
int count = 256;
while( (intptr_t)const_cast<Block* volatile &>(nextPrivatizable)==binTag ) {
if (--count==0) {
do_yield();
count = 256;
}
}
}
}
MALLOC_ASSERT( publicFreeList.load(std::memory_order_relaxed) !=NULL, ASSERT_TEXT );
// now it is safe to change our data
previous = NULL;
// it is caller responsibility to ensure that the list of blocks
// formed by nextPrivatizable pointers is kept consistent if required.
// if only called from thread shutdown code, it does not matter.
(intptr_t&)(nextPrivatizable) = UNUSABLE;
}
as an example of wrong use of volatile (since it guarantees nothing wrt threading).
Is this really a bug?
My first intuition is yes, but then again TBB is not some anon person github project, so I am curious if I am missing something.
github link

Related

Synchronize Threads - InterlockedExchange

I like to check if a thread is doing work. If the thread is doing work I will wait for an event until the thread has stopped its work. The event the thread will set at the end.
To check if the thread is working I declared a volatile bool variable. The bool variable will be true if the thread is running, else it is false. At the end of the thread the bool variable will be set to false.
Is it adequate to use a volatile bool variable or do I have to use an atomic function?
BTW: Can please someone explain me the InterlockedExchange Method, I don´t understand the use case I will need this function.
Update
I see without my code it is not clear to say if a volatile bool variable will adequate. I wrote a testclass which shows my problem.
class Testclass
{
public:
Testclass(void);
~Testclass(void);
void doThreadedWork();
void Work();
void StartWork();
void WaitUntilFinish();
private:
HANDLE hHasWork;
HANDLE hAbort;
HANDLE hFinished;
volatile bool m_bWorking;
};
//.cpp
#include "stdafx.h"
#include "Testclass.h"
CRITICAL_SECTION cs;
DWORD WINAPI myThread(LPVOID lpParameter)
{
Testclass* pTestclass = (Testclass*) lpParameter;
pTestclass->doThreadedWork();
return 0;
}
Testclass::Testclass(void)
{
InitializeCriticalSection(&cs);
DWORD myThreadID;
HANDLE myHandle = CreateThread(0, 0, myThread, this, 0, &myThreadID);
m_bWorking = false;
hHasWork = CreateEvent(NULL,TRUE,FALSE,NULL);
hAbort = CreateEvent(NULL,TRUE,FALSE,NULL);
hFinished = CreateEvent(NULL,FALSE,FALSE,NULL);
}
Testclass::~Testclass(void)
{
DeleteCriticalSection(&cs);
CloseHandle(hHasWork);
CloseHandle(hAbort);
CloseHandle(hFinished);
}
void Testclass::Work()
{
// do some work
m_bWorking = false;
SetEvent(hFinished);
}
void Testclass::StartWork()
{
EnterCriticalSection(&cs);
m_bWorking = true;
ResetEvent(hFinished);
SetEvent(hHasWork);
LeaveCriticalSection(&cs);
}
void Testclass::doThreadedWork()
{
HANDLE hEvents[2];
hEvents[0] = hHasWork;
hEvents[1] = hAbort;
while(true)
{
DWORD dwEvent = WaitForMultipleObjects(2, hEvents, FALSE, INFINITE);
if(WAIT_OBJECT_0 == dwEvent)
{
Work();
}
else
{
break;
}
}
}
void Testclass::WaitUntilFinish()
{
EnterCriticalSection(&cs);
if(!m_bWorking)
{
// if the thread is not working, do not wait and return
LeaveCriticalSection(&cs);
return;
}
WaitForSingleObject(hFinished,INFINITE);
LeaveCriticalSection(&cs);
}
For me it is not realy clear if m_bWorking value n a atomic way or if the volatile cast will adequate.
There is a lot of background to cover for your question. We don't know for example what tool chain you are using so I am going to answer it as a winapi question. I further assume you have some something in mind like this:
volatile bool flag = false;
DWORD WINAPI WorkFn(void*) {
flag = true;
// work here
....
// done.
flag = false;
return 0;
}
int main() {
HANDLE th = CreateThread(...., &WorkFn, NULL, ..);
// wait for start of work.
while (!flag) {
// ?? # 1
}
// Seems thread is busy now. Time to wait for it to finish.
while (flag) {
// ?? # 2
}
}
There are many things wrong here. For starters the volatile does very little here. When flag = true happens it will eventually be visible to the other thread because it is backed by a global variable. This is so because it will at least make it into the cache and the cache has ways to tell other processors that a given line (which is a range of addresses) is dirty. The only way it would not make it into the cache is that if the compiler makes a super crazy optimization in which flag stays in the cpu as a register. That could actually happen but not in this particular code example.
So volatile tells the compiler to never keep the variable as a register. That is what it is, every time you see a volatile variable you can translate it as "never enregister this variable". Its use here is just basically a paranoid move.
If this code is what you had in mind then this looping over a flag pattern is called a Spinlock and this one is a really poor one. It is almost never the right thing to do in a user mode program.
Before we go into better approaches let me tackle your Interlocked question. What people usually mean is this pattern
volatile long flag = 0;
DWORD WINAPI WorkFn(void*) {
InterlockedExchange(&flag, 1);
....
}
int main() {
...
while (InterlockedCompareExchange(&flag, 1, 1) = 0L) {
YieldProcessor();
}
...
}
Assume the ... means similar code as before. What the InterlockedExchange() is doing is forcing the write to memory to happen in a deterministic, "broadcast the change now", kind of way and the typical way to read it in the same "bypass the cache" way is via InterlockedCompareExchange().
One problem with them is that they generate more traffic on the system bus. That is, the bus now being used to broadcast cache synchronization packets among the cpus on the system.
std::atomic<bool> flag would be the modern, C++11 way to do the same, but still not what you really want to do.
I added the YieldProcessor() call there to point to the real problem. When you wait for a memory address to change you are using cpu resources that would be better used somewhere else, for example in the actual work (!!). If you actually yield the processor there is at least a chance that the OS will give it to the WorkFn, but in a multicore machine it will quickly go back to polling the variable. In a modern machine you will be checking this flag millions of times per second, with the yield, probably 200000 times per second. Terrible waste either way.
What you want to do here is to leverage Windows to do a zero-cost wait, or at least a low cost as you want to:
DWORD WINAPI WorkFn(void*) {
// work here
....
return 0;
}
int main() {
HANDLE th = CreateThread(...., &WorkFn, NULL, ..);
WaitForSingleObject(th, INFINITE);
// work is done!
CloseHandle(th);
}
When you return from the worker thread the thread handle get signaled and the wait it satisfied. While stuck in WaitForSingleObject you don't consume any cpu cycles. If you want to do a periodic activity in the main() function while you wait you can replace INFINITE with 1000, which will release the main thread every second. In that case you need to check the return value of WaitForSingleObject to tell the timeout from thread being done case.
If you need to actually know when work started, you need an additional waitable object, for example, a Windows event which is obtained via CreateEvent() and can be waited on using the same WaitForSingleObject.
Update [1/23/2016]
Now that we can see the code you have in mind, you don't need atomics, volatile works just fine. The m_bWorking is protected by the cs mutex anyhow for the true case.
If I might suggest, you can use TryEnterCriticalSection and cs to accomplish the same without m_bWorking at all:
void Testclass::Work()
{
EnterCriticalSection(&cs);
// do some work
LeaveCriticalSection(&cs);
SetEvent(hFinished); // could be removed as well
}
void Testclass::StartWork()
{
ResetEvent(hFinished); // could be removed.
SetEvent(hHasWork);
}
void Testclass::WaitUntilFinish()
{
if (TryEnterCriticalSection(&cs)) {
// Not busy now.
LeaveCriticalSection(&cs);
return;
} else {
// busy doing work. If we use EnterCriticalSection(&cs)
// here we can even eliminate hFinished from the code.
}
...
}
For some reason, the Interlocked API does not include an "InterlockedGet" or "InterlockedSet" function. This is a strange omission and the typical work around is to cast through volatile.
You can use code like the following on Windows:
#include <intrin.h>
__inline int InterlockedIncrement(int *j)
{ // This is VS-specific
return _InterlockedIncrement((volatile LONG *) j);
}
__inline int InterlockedDecrement(int *j)
{ // This is VS-specific
return _InterlockedDecrement((volatile LONG *) j);
}
__inline static void InterlockedSet(int *val, int newval)
{
*((volatile int *)val) = newval;
}
__inline static int InterlockedGet(int *val)
{
return *((volatile int *)val);
}
Yes, it's ugly. But it's the best way to work around the deficiency if you're not using C++11. If you're using C++11, use std::atomic instead.
Note that this is Windows-specific code and should not be used on other platforms.
No, volatile bool will not be enough. You need an atomic bool, as you correctly suspect. Otherwise, you might never see your bool updated.
There is also no InterlockedExchange in C++ (the tags of your question), but there are compare_exchange_weak and compare_exchange_strong functions in C++11. Those are used to set the value of an object to a certain NewValue, provided it's current value is TestValue and indicate the status of this attempt (was the change made or not). The benefit of those functions is that this is done in such a fasion that you are guaranteed that if two threads are trying to perform this operation, only one will succeed. This is very helpful when you need to take a certain actions depending on the result of the operation.

Mutex Safety with Interrupts (Embedded Firmware)

Edit #Mike pointed out that my try_lock function in the code below is unsafe and that accessor creation can produce a race condition as well. The suggestions (from everyone) have convinced me that I'm going down the wrong path.
Original Question
The requirements for locking on an embedded microcontroller are different enough from multithreading that I haven't been able to convert multithreading examples to my embedded applications. Typically I don't have an OS or threads of any kind, just main and whatever interrupt functions are called by the hardware periodically.
It's pretty common that I need to fill up a buffer from an interrupt, but process it in main. I've created the IrqMutex class below to try to safely implement this. Each person trying to access the buffer is assigned a unique id through IrqMutexAccessor, then they each can try_lock() and unlock(). The idea of a blocking lock() function doesn't work from interrupts because unless you allow the interrupt to complete, no other code can execute so the unlock() code never runs. I do however use a blocking lock from the main() code occasionally.
However, I know that the double-check lock doesn't work without C++11 memory barriers (which aren't available on many embedded platforms). Honestly despite reading quite a bit about it, I don't really understand how/why the memory access reordering can cause a problem. I think that the use of volatile sig_atomic_t (possibly combined with the use of unique IDs) makes this different from the double-check lock. But I'm hoping someone can: confirm that the following code is correct, explain why it isn't safe, or offer a better way to accomplish this.
class IrqMutex {
friend class IrqMutexAccessor;
private:
std::sig_atomic_t accessorIdEnum;
volatile std::sig_atomic_t owner;
protected:
std::sig_atomic_t nextAccessor(void) { return ++accessorIdEnum; }
bool have_lock(std::sig_atomic_t accessorId) {
return (owner == accessorId);
}
bool try_lock(std::sig_atomic_t accessorId) {
// Only try to get a lock, while it isn't already owned.
while (owner == SIG_ATOMIC_MIN) {
// <-- If an interrupt occurs here, both attempts can get a lock at the same time.
// Try to take ownership of this Mutex.
owner = accessorId; // SET
// Double check that we are the owner.
if (owner == accessorId) return true;
// Someone else must have taken ownership between CHECK and SET.
// If they released it after CHECK, we'll loop back and try again.
// Otherwise someone else has a lock and we have failed.
}
// This shouldn't happen unless they called try_lock on something they already owned.
if (owner == accessorId) return true;
// If someone else owns it, we failed.
return false;
}
bool unlock(std::sig_atomic_t accessorId) {
// Double check that the owner called this function (not strictly required)
if (owner == accessorId) {
owner = SIG_ATOMIC_MIN;
return true;
}
// We still return true if the mutex was unlocked anyway.
return (owner == SIG_ATOMIC_MIN);
}
public:
IrqMutex(void) : accessorIdEnum(SIG_ATOMIC_MIN), owner(SIG_ATOMIC_MIN) {}
};
// This class is used to manage our unique accessorId.
class IrqMutexAccessor {
friend class IrqMutex;
private:
IrqMutex& mutex;
const std::sig_atomic_t accessorId;
public:
IrqMutexAccessor(IrqMutex& m) : mutex(m), accessorId(m.nextAccessor()) {}
bool have_lock(void) { return mutex.have_lock(accessorId); }
bool try_lock(void) { return mutex.try_lock(accessorId); }
bool unlock(void) { return mutex.unlock(accessorId); }
};
Because there is one processor, and no threading the mutex serves what I think is a subtly different purpose than normal. There are two main use cases I run into repeatedly.
The interrupt is a Producer and takes ownership of a free buffer and loads it with a packet of data. The interrupt/Producer may keep its ownership lock for a long time spanning multiple interrupt calls. The main function is the Consumer and takes ownership of a full buffer when it is ready to process it. The race condition rarely happens, but if the interrupt/Producer finishes with a packet and needs a new buffer, but they are all full it will try to take the oldest buffer (this is a dropped packet event). If the main/Consumer started to read and process that oldest buffer at exactly the same time they would trample all over each other.
The interrupt is just a quick change or increment of something (like a counter). However, if we want to reset the counter or jump to some new value with a call from the main() code we don't want to try to write to the counter as it is changing. Here main actually does a blocking loop to obtain a lock, however I think its almost impossible to have to actually wait here for more than two attempts. Once it has a lock, any calls to the counter interrupt will be skipped, but that's generally not a big deal for something like a counter. Then I update the counter value and unlock it so it can start incrementing again.
I realize these two samples are dumbed down a bit, but some version of these patterns occur in many of the peripherals in every project I work on and I'd like once piece of reusable code that can safely handle this across various embedded platforms. I included the C tag, because all of this is directly convertible to C code, and on some embedded compilers that's all that is available. So I'm trying to find a general method that is guaranteed to work in both C and C++.
struct ExampleCounter {
volatile long long int value;
IrqMutex mutex;
} exampleCounter;
struct ExampleBuffer {
volatile char data[256];
volatile size_t index;
IrqMutex mutex; // One mutex per buffer.
} exampleBuffers[2];
const volatile char * const REGISTER;
// This accessor shouldn't be created in an interrupt or a race condition can occur.
static IrqMutexAccessor myMutex(exampleCounter.mutex);
void __irqQuickFunction(void) {
// Obtain a lock, add the data then unlock all within one function call.
if (myMutex.try_lock()) {
exampleCounter.value++;
myMutex.unlock();
} else {
// If we failed to obtain a lock, we skipped this update this one time.
}
}
// These accessors shouldn't be created in an interrupt or a race condition can occur.
static IrqMutexAccessor myMutexes[2] = {
IrqMutexAccessor(exampleBuffers[0].mutex),
IrqMutexAccessor(exampleBuffers[1].mutex)
};
void __irqLongFunction(void) {
static size_t bufferIndex = 0;
// Check if we have a lock.
if (!myMutex[bufferIndex].have_lock() and !myMutex[bufferIndex].try_lock()) {
// If we can't get a lock try the other buffer
bufferIndex = (bufferIndex + 1) % 2;
// One buffer should always be available so the next line should always be successful.
if (!myMutex[bufferIndex].try_lock()) return;
}
// ... at this point we know we have a lock ...
// Get data from the hardware and modify the buffer here.
const char c = *REGISTER;
exampleBuffers[bufferIndex].data[exampleBuffers[bufferIndex].index++] = c;
// We may keep the lock for multiple function calls until the end of packet.
static const char END_PACKET_SIGNAL = '\0';
if (c == END_PACKET_SIGNAL) {
// Unlock this buffer so it can be read from main.
myMutex[bufferIndex].unlock();
// Switch to the other buffer for next time.
bufferIndex = (bufferIndex + 1) % 2;
}
}
int main(void) {
while (true) {
// Mutex for counter
static IrqMutexAccessor myCounterMutex(exampleCounter.mutex);
// Change counter value
if (EVERY_ONCE_IN_A_WHILE) {
// Skip any updates that occur while we are updating the counter.
while(!myCounterMutex.try_lock()) {
// Wait for the interrupt to release its lock.
}
// Set the counter to a new value.
exampleCounter.value = 500;
// Updates will start again as soon as we unlock it.
myCounterMutex.unlock();
}
// Mutexes for __irqLongFunction.
static IrqMutexAccessor myBufferMutexes[2] = {
IrqMutexAccessor(exampleBuffers[0].mutex),
IrqMutexAccessor(exampleBuffers[1].mutex)
};
// Process buffers from __irqLongFunction.
for (size_t i = 0; i < 2; i++) {
// Obtain a lock so we can read the data.
if (!myBufferMutexes[i].try_lock()) continue;
// Check that the buffer isn't empty.
if (exampleBuffers[i].index == 0) {
myBufferMutexes[i].unlock(); // Don't forget to unlock.
continue;
}
// ... read and do something with the data here ...
exampleBuffer.index = 0;
myBufferMutexes[i].unlock();
}
}
}
}
Also note that I used volatile on any variable that is read-by or written-by the interrupt routine (unless the variable was only accessed from the interrupt like the static bufferIndex value in __irqLongFunction). I've read that mutexes remove some of need for volatile in multithreaded code, but I don't think that applies here. Did I use the right amount of volatile? I used it on: ExampleBuffer[].data[256], ExampleBuffer[].index, and ExampleCounter.value.
I apologize for the long answer, but perhaps it is fitting for a long question.
To answer your first question, I would say that your implementation of IrqMutex is not safe. Let me try to explain where I see problems.
Function nextAccessor
std::sig_atomic_t nextAccessor(void) { return ++accessorIdEnum; }
This function has a race condition, because the increment operator is not atomic, despite it being on an atomic value marked volatile. It involves 3 operations: reading the current value of accessorIdEnum, incrementing it, and writing the result back. If two IrqMutexAccessors are created at the same time, it's possible that they both get the same ID.
Function try_lock
The try_lock function also has a race condition. One thread (eg main), could go into the while loop, and then before taking ownership, another thread (eg an interrupt) can also go into the while loop and take ownership of the lock (returning true). Then the first thread can continue, moving onto owner = accessorId, and thus "also" take the lock. So two threads (or your main thread and an interrupt) can try_lock on an unowned mutex at the same time and both return true.
Disabling interrupts by RAII
We can achieve some level of simplicity and encapsulation by using RAII for interrupt disabling, for example the following class:
class InterruptLock {
public:
InterruptLock() {
prevInterruptState = currentInterruptState();
disableInterrupts();
}
~InterruptLock() {
restoreInterrupts(prevInterruptState);
}
private:
int prevInterruptState; // Whatever type this should be for the platform
InterruptLock(const InterruptLock&); // Not copy-constructable
};
And I would recommend disabling interrupts to get the atomicity you need within the mutex implementation itself. For example something like:
bool try_lock(std::sig_atomic_t accessorId) {
InterruptLock lock;
if (owner == SIG_ATOMIC_MIN) {
owner = accessorId;
return true;
}
return false;
}
bool unlock(std::sig_atomic_t accessorId) {
InterruptLock lock;
if (owner == accessorId) {
owner = SIG_ATOMIC_MIN;
return true;
}
return false;
}
Depending on your platform, this might look different, but you get the idea.
As you said, this provides a platform to abstract away from the disabling and enabling interrupts in general code, and encapsulates it to this one class.
Mutexes and Interrupts
Having said how I would consider implementing the mutex class, I would not actually use a mutex class for your use-cases. As you pointed out, mutexes don't really play well with interrupts, because an interrupt can't "block" on trying to acquire a mutex. For this reason, for code that directly exchanges data with an interrupt, I would instead strongly consider just directly disabling interrupts (for a very short time while the main "thread" touches the data).
So your counter might simply look like this:
volatile long long int exampleCounter;
void __irqQuickFunction(void) {
exampleCounter++;
}
...
// Change counter value
if (EVERY_ONCE_IN_A_WHILE) {
InterruptLock lock;
exampleCounter = 500;
}
In my mind, this is easier to read, easier to reason about, and won't "slip" when there's contention (ie miss timer beats).
Regarding the buffer use-case, I would strongly recommend against holding a lock for multiple interrupt cycles. A lock/mutex should be held for just the slightest moment required to "touch" a piece of memory - just long enough to read or write it. Get in, get out.
So this is how the buffering example might look:
struct ExampleBuffer {
char data[256];
} exampleBuffers[2];
ExampleBuffer* volatile bufferAwaitingConsumption = nullptr;
ExampleBuffer* volatile freeBuffer = &exampleBuffers[1];
const volatile char * const REGISTER;
void __irqLongFunction(void) {
static const char END_PACKET_SIGNAL = '\0';
static size_t index = 0;
static ExampleBuffer* receiveBuffer = &exampleBuffers[0];
// Get data from the hardware and modify the buffer here.
const char c = *REGISTER;
receiveBuffer->data[index++] = c;
// End of packet?
if (c == END_PACKET_SIGNAL) {
// Make the packet available to the consumer
bufferAwaitingConsumption = receiveBuffer;
// Move on to the next buffer
receiveBuffer = freeBuffer;
freeBuffer = nullptr;
index = 0;
}
}
int main(void) {
while (true) {
// Fetch packet from shared variable
ExampleBuffer* packet;
{
InterruptLock lock;
packet = bufferAwaitingConsumption;
bufferAwaitingConsumption = nullptr;
}
if (packet) {
// ... read and do something with the data here ...
// Once we're done with the buffer, we need to release it back to the producer
{
InterruptLock lock;
freeBuffer = packet;
}
}
}
}
This code is arguably easier to reason about, since there are only two memory locations shared between the interrupt and the main loop: one to pass packets from the interrupt to the main loop, and one to pass empty buffers back to the interrupt. We also only touch those variables under "lock", and only for the minimum time needed to "move" the value. (for simplicity I've skipped over the buffer overflow logic when the main loop takes too long to free the buffer).
It's true that in this case one may not even need the locks, since we're just reading and writing simple value, but the cost of disabling the interrupts is not much, and the risk of making mistakes otherwise, is not worth it in my opinion.
Edit
As pointed out in the comments, the above solution was meant to only tackle the multithreading problem, and omitted overflow checking. Here is more complete solution which should be robust under overflow conditions:
const size_t BUFFER_COUNT = 2;
struct ExampleBuffer {
char data[256];
ExampleBuffer* next;
} exampleBuffers[BUFFER_COUNT];
volatile size_t overflowCount = 0;
class BufferList {
public:
BufferList() : first(nullptr), last(nullptr) { }
// Atomic enqueue
void enqueue(ExampleBuffer* buffer) {
InterruptLock lock;
if (last)
last->next = buffer;
else {
first = buffer;
last = buffer;
}
}
// Atomic dequeue (or returns null)
ExampleBuffer* dequeueOrNull() {
InterruptLock lock;
ExampleBuffer* result = first;
if (first) {
first = first->next;
if (!first)
last = nullptr;
}
return result;
}
private:
ExampleBuffer* first;
ExampleBuffer* last;
} freeBuffers, buffersAwaitingConsumption;
const volatile char * const REGISTER;
void __irqLongFunction(void) {
static const char END_PACKET_SIGNAL = '\0';
static size_t index = 0;
static ExampleBuffer* receiveBuffer = &exampleBuffers[0];
// Recovery from overflow?
if (!receiveBuffer) {
// Try get another free buffer
receiveBuffer = freeBuffers.dequeueOrNull();
// Still no buffer?
if (!receiveBuffer) {
overflowCount++;
return;
}
}
// Get data from the hardware and modify the buffer here.
const char c = *REGISTER;
if (index < sizeof(receiveBuffer->data))
receiveBuffer->data[index++] = c;
// End of packet, or out of space?
if (c == END_PACKET_SIGNAL) {
// Make the packet available to the consumer
buffersAwaitingConsumption.enqueue(receiveBuffer);
// Move on to the next free buffer
receiveBuffer = freeBuffers.dequeueOrNull();
index = 0;
}
}
size_t getAndResetOverflowCount() {
InterruptLock lock;
size_t result = overflowCount;
overflowCount = 0;
return result;
}
int main(void) {
// All buffers are free at the start
for (int i = 0; i < BUFFER_COUNT; i++)
freeBuffers.enqueue(&exampleBuffers[i]);
while (true) {
// Fetch packet from shared variable
ExampleBuffer* packet = dequeueOrNull();
if (packet) {
// ... read and do something with the data here ...
// Once we're done with the buffer, we need to release it back to the producer
freeBuffers.enqueue(packet);
}
size_t overflowBytes = getAndResetOverflowCount();
if (overflowBytes) {
// ...
}
}
}
The key changes:
If the interrupt runs out of free buffers, it will recover
If the interrupt receives data while it doesn't have a receive buffer, it will communicate that to the main thread via getAndResetOverflowCount
If you keep getting buffer overflows, you can simply increase the buffer count
I've encapsulated the multithreaded access into a queue class implemented as a linked list (BufferList), which supports atomic dequeue and enqueue. The previous example also used queues, but of length 0-1 (either an item is enqueued or it isn't), and so the implementation of the queue was just a single variable. In the case of running out of free buffers, the receive queue could have 2 items, so I upgraded it to a proper queue rather than adding more shared variables.
If the interrupt is the producer and mainline code is the consumer, surely it's as simple as disabling the interrupt for the duration of the consume operation?
That's how I used to do it in my embedded micro controller days.

Safe multi-thread counter increment

For example, I've got a some work that is computed simultaneously by multiple threads.
For demonstration purposes the work is performed inside a while loop. In a single iteration each thread performs its own portion of the work, before the next iteration begins a counter should be incremented once.
My problem is that the counter is updated by each thread.
As this seems like a relatively simple thing to want to do, I presume there is a 'best practice' or common way to go about it?
Here is some sample code to illustrate the issue and help the discussion along.
(Im using boost threads)
class someTask {
public:
int mCounter; //initialized to 0
int mTotal; //initialized to i.e. 100000
boost::mutex cntmutex;
int getCount()
{
boost::mutex::scoped_lock lock( cntmutex );
return mCount;
}
void process( int thread_id, int numThreads )
{
while ( getCount() < mTotal )
{
// The main task is performed here and is divided
// into sub-tasks based on the thread_id and numThreads
// Wait for all thread to get to this point
cntmutex.lock();
mCounter++; // < ---- how to ensure this is only updated once?
cntmutex.unlock();
}
}
};
The main problem I see here is that you reason at a too-low level. Therefore, I am going to present an alternative solution based on the new C++11 thread API.
The main idea is that you essentially have a schedule -> dispatch -> do -> collect -> loop routine. In your example you try to reason about all this within the do phase which is quite hard. Your pattern can be much more easily expressed using the opposite approach.
First we isolate the work to be done in its own routine:
void process_thread(size_t id, size_t numThreads) {
// do something
}
Now, we can easily invoke this routine:
#include <future>
#include <thread>
#include <vector>
void process(size_t const total, size_t const numThreads) {
for (size_t count = 0; count != total; ++count) {
std::vector< std::future<void> > results;
// Create all threads, launch the work!
for (size_t id = 0; id != numThreads; ++id) {
results.push_back(std::async(process_thread, id, numThreads));
}
// The destruction of `std::future`
// requires waiting for the task to complete (*)
}
}
(*) See this question.
You can read more about std::async here, and a short introduction is offered here (they appear to be somewhat contradictory on the effect of the launch policy, oh well). It is simpler here to let the implementation decides whether or not to create OS threads: it can adapt depending on the number of available cores.
Note how the code is simplified by removing shared state. Because the threads share nothing, we no longer have to worry about synchronization explicitly!
You protected the counter with a mutex, ensuring that no two threads can access the counter at the same time. Your other option would be using Boost::atomic, c++11 atomic operations or platform-specific atomic operations.
However, your code seems to access mCounter without holding the mutex:
while ( mCounter < mTotal )
That's a problem. You need to hold the mutex to access the shared state.
You may prefer to use this idiom:
Acquire lock.
Do tests and other things to decide whether we need to do work or not.
Adjust accounting to reflect the work we've decided to do.
Release lock. Do work. Acquire lock.
Adjust accounting to reflect the work we've done.
Loop back to step 2 unless we're totally done.
Release lock.
You need to use a message-passing solution. This is more easily enabled by libraries like TBB or PPL. PPL is included for free in Visual Studio 2010 and above, and TBB can be downloaded for free under a FOSS licence from Intel.
concurrent_queue<unsigned int> done;
std::vector<Work> work;
// fill work here
parallel_for(0, work.size(), [&](unsigned int i) {
processWorkItem(work[i]);
done.push(i);
});
It's lockless and you can have an external thread monitor the done variable to see how much, and what, has been completed.
I would like to disagree with David on doing multiple lock acquisitions to do the work.
Mutexes are expensive and with more threads contending for a mutex , it basically falls back to a system call , which results in user space to kernel space context switch along with the with the caller Thread(/s) forced to sleep :Thus a lot of overheads.
So If you are using a multiprocessor system , I would strongly recommend using spin locks instead [1].
So what i would do is :
=> Get rid of the scoped lock acquisition to check the condition.
=> Make your counter volatile to support above
=> In the while loop do the condition check again after acquiring the lock.
class someTask {
public:
volatile int mCounter; //initialized to 0 : Make your counter Volatile
int mTotal; //initialized to i.e. 100000
boost::mutex cntmutex;
void process( int thread_id, int numThreads )
{
while ( mCounter < mTotal ) //compare without acquiring lock
{
// The main task is performed here and is divided
// into sub-tasks based on the thread_id and numThreads
cntmutex.lock();
//Now compare again to make sure that the condition still holds
//This would save all those acquisitions and lock release we did just to
//check whther the condition was true.
if(mCounter < mTotal)
{
mCounter++;
}
cntmutex.unlock();
}
}
};
[1]http://www.alexonlinux.com/pthread-mutex-vs-pthread-spinlock

What is the most efficient way to make this code thread safe?

Some C++ library I'm working on features a simple tracing mechanism which can be activated to generate log files showing which functions were called and what arguments were passed. It basically boils down to a TRACE macro being spilled all over the source of the library, and the macro expands to something like this:
typedef void(*TraceProc)( const char *msg );
/* Sets 'callback' to point to the trace procedure which actually prints the given
* message to some output channel, or to a null trace procedure which is a no-op when
* case the given source file/line position was disabled by the client.
*
* This function also registers the callback pointer in an internal data structure
* and resets it to zero in case the filtering configuration changed since the last
* invocation of updateTraceCallback.
*/
void updateTraceCallback( TraceProc *callback, const char *file, unsinged int lineno );
#define TRACE(msg) \
{ \
static TraceProc traceCallback = 0; \
if ( !traceCallback ) \
updateTraceCallback( &traceCallback, __FILE__, __LINE__ ); \
traceCallback( msg ); \
}
The idea is that people can just say TRACE("foo hit") in their code and that will either
call a debug printing function or it will be a no-op. They can use some other API (which is not shown here) to configure that only TRACE uses in locations (source file/line number) should be printed. This configuration can change at runtime.
The issue with this is that this idea should now be used in a multi-threaded code base. Hence, the code which TRACE expands to needs to work correctly in the face of multiple threads of execution running the code simultaneously. There are about 20.000 different trace points in the code base right now and they are hit very often, so they should be rather efficient
What is the most efficient way to make this approach thread safe? I need a solution for Windows (XP and newer) and Linux. I'm afraid of doing excessive locking just to check whether the filter configuration changed (99% of the time a trace point is hit, the configuration didn't change). I'm open to larger changes to the macro, too. So instead of discussing mutex vs. critical section performance, it would also be acceptable if the macro just sent an event to an event loop in a different thread (assuming that accessing the event loop is thread safe) and all the processing happens in the same thread, so it's synchronized using the event loop.
UPDATE: I can probably simplify this question to:
If I have one thread reading a pointer, and another thread which might write to the variable (but 99% of the time it doesn't), how can I avoid that the reading thread needs to lock all the time?
You could implement a configuration file version variable. When your program starts it is set to 0. The macro can hold a static int that is the last config version it saw. Then a simple atomic comparison between the last seen and the current config version will tell you if you need to do a full lock and re-call updateTraceCallback();.
That way, 99% of the time you'll only add an extra atomic op, or memory barrier or something simmilar, which is very cheap. 1% of the time, just do the full mutex thing, it shouldn't affect your performance in any noticeable way, if its only 1% of the time.
Edit:
Some .h file:
extern long trace_version;
Some .cpp file:
long trace_version = 0;
The macro:
#define TRACE(msg)
{
static long __lastSeenVersion = -1;
static TraceProc traceCallback = 0;
if ( !traceCallback || __lastSeenVersion != trace_version )
updateTraceCallback( &traceCallback, &__lastSeenVersion, __FILE__, __LINE__ );
traceCallback( msg );
}
The functions for incrementing a version and updates:
static long oldVersionRefcount = 0;
static long curVersionRefCount = 0;
void updateTraceCallback( TraceProc *callback, long &version, const char *file, unsinged int lineno ) {
if ( version != trace_version ) {
if ( InterlockedDecrement( oldVersionRefcount ) == 0 ) {
//....free resources.....
//...no mutex needed, since no one is using this,,,
}
//....aquire mutex and do stuff....
InterlockedIncrement( curVersionRefCount );
*version = trace_version;
//...release mutex...
}
}
void setNewTraceCallback( TraceProc *callback ) {
//...aquire mutex...
trace_version++; // No locks, mutexes or anything, this is atomic by itself.
while ( oldVersionRefcount != 0 ) { //..sleep? }
InterlockedExchange( &oldVersionRefcount, curVersionRefCount );
curVersionRefCount = 0;
//.... and so on...
//...release mutex...
Of course, this is very simplified, since if you need to upgrade the version and the oldVersionRefCount > 0, then you're in trouble; how to solve this is up to you, since it really depends on your problem. My guess is that in those situations, you could simply wait until the ref count is zero, since the amount of time that the ref count is incremented should be the time it takes to run the macro.
I still don't fully understand the question, so please correct me on anything I didn't get.
(I'm leaving out the backslashes.)
#define TRACE(msg)
{
static TraceProc traceCallback = NULL;
TraceProc localTraceCallback;
localTraceCallback = traceCallback;
if (!localTraceCallback)
{
updateTraceBallback(&localTraceCallback, __FILE__, __LINE__);
// If two threads are running this at the same time
// one of them will update traceCallback and get it overwritten
// by the other. This isn't a big deal.
traceCallback = localTraceCallback;
}
// Now there's no way localTraceCallback can be null.
// An issue here is if in the middle of this executing
// traceCallback gets null'ed. But you haven't specified any
// restrictions about this either, so I'm assuming it isn't a problem.
localTraceCallback(msg);
}
Your comment says "resets it to zero in case the filtering configuration changes at runtime" but am I correct in reading that as "resets it to zero when the filtering configuration changes"?
Without knowing exactly how updateTraceCallback implements its data structure, or what other data it's referring to in order to decide when to reset the callbacks (or indeed to set them in the first place), it's impossible to judge what would be safe. A similar problem applies to knowing what traceCallback does - if it accesses a shared output destination, for example.
Given these limitations the only safe recommendation that doesn't require reworking other code is to stick a mutex around the whole lot (or preferably a critical section on Windows).
I'm afraid of doing excessive locking just to check whether the filter configuration changed (99% of the time a trace point is hit, the configuration didn't change). I'm open to larger changes to the macro, too. So instead of discussing mutex vs. critical section performance, it would also be acceptable if the macro just sent an event to an event loop in a different thread (assuming that accessing the event loop is thread safe)
How do you think thread safe messaging between threads is implemented without locks?
Anyway, here's a design that might work:
The data structure that holds the filter must be changed so that it is allocated dynamically from the heap because we are going to be creating multiple instances of filters. Also, it's going to need a reference count added to it. You need a typedef something like:
typedef struct Filter
{
unsigned int refCount;
// all the other filter data
} Filter;
There's a singleton 'current filter' declared somewhere.
static Filter* currentFilter;
and initialised with some default settings.
In your TRACE macro:
#define TRACE(char* msg)
{
static Filter* filter = NULL;
static TraceProc traceCallback = NULL;
if (filterOutOfDate(filter))
{
getNewCallback(__FILE__, __LINE__, &traceCallback, &filter);
}
traceCallback(msg);
}
filterOutOfDate() merely compares the filter with currentFilter to see if it is the same. It should be enough to just compare addresses. It does no locking.
getNewCallback() applies the current filter to get the new trace function and updates the filter passed in with the address of the current filter. It's implementation must be protected with a mutex lock. Also, it decremetns the refCount of the original filter and increments the refCount of the new filter. This is so we know when we can free the old filter.
void getNewCallback(const char* file, int line, TraceProc* newCallback, Filter** filter)
{
// MUTEX lock
newCallback = // whatever you need to do
currentFilter->refCount++;
if (*filter != NULL)
{
*filter->refCount--;
if (*filter->refCount == 0)
{
// free filter and associated resources
}
}
*filter = currentFilter;
// MUTEX unlock
}
When you want to change the filter, you do something like
changeFilter()
{
Filter* newFilter = // build a new filter
newFilter->refCount = 0;
// MUTEX lock (same mutex as above)
currentFilter = newFilter;
// MUTEX unlock
}
If I have one thread reading a pointer, and another thread which might write to the variable (but 99% of the time it doesn't), how can I avoid that the reading thread needs to lock all the time?
From your code, it is OK to use the mutex inside the updateTraceCallback() since it is going to be called very rarely (once per location). After taking the mutex, check whether the traceCallback is already initialized: if yes, then other thread just did it for you and there is nothing to be done.
If updateTraceCallback() would turn out to be a serious performance problem due to the collisions on the global mutex, then you can simply make an array of mutexes instead and use hashed value of the traceCallback pointer as an index into the mutex array. That would spread locking over many mutexes and minimize number of collisions.
#define TRACE(msg) \
{ \
static TraceProc traceCallback = \
updateTraceBallback( &traceCallback, __FILE__, __LINE__ ); \
traceCallback( msg ); \
}

Debugging instance of another thread altering my data

I have a huge global array of structures. Some regions of the array are tied to individual threads and those threads can modify their regions of the array without having to use critical sections. But there is one special region of the array which all threads may have access to. The code that accesses these parts of the array needs to carefully use critical sections (each array element has its own critical section) to prevent any possibility of two threads writing to the structure simultaneously.
Now I have a mysterious bug I am trying to chase, it is occurring unpredictably and very infrequently. It seems that one of the structures is being filled with some incorrect number. One obvious explanation is that another thread has accidentally been allowed to set this number when it should be excluded from doing so.
Unfortunately it seems close to impossible to track this bug. The array element in which the bad data appears is different each time. What I would love to be able to do is set some kind of trap for the bug as follows: I would enter a critical section for array element N, then I know that no other thread should be able to touch the data, then (until I exit the critical section) set some kind of flag to a debugging tool saying "if any other thread attempts to change the data here please break and show me the offending patch of source code"... but I suspect no such tool exists... or does it? Or is there some completely different debugging methodology that I should be employing.
How about wrapping your data with a transparent mutexed class? Then you could apply additional lock state checking.
class critical_section;
template < class T >
class element_wrapper
{
public:
element_wrapper(const T& v) : val(v) {}
element_wrapper() {}
const element_wrapper& operator = (const T& v) {
#ifdef _DEBUG_CONCURRENCY
if(!cs->is_locked())
_CrtDebugBreak();
#endif
val = v;
return *this;
}
operator T() { return val; }
critical_section* cs;
private:
T val;
};
As for critical section implementation:
class critical_section
{
public:
critical_section() : locked(FALSE) {
::InitializeCriticalSection(&cs);
}
~critical_section() {
_ASSERT(!locked);
::DeleteCriticalSection(&cs);
}
void lock() {
::EnterCriticalSection(&cs);
locked = TRUE;
}
void unlock() {
locked = FALSE;
::LeaveCriticalSection(&cs);
}
BOOL is_locked() {
return locked;
}
private:
CRITICAL_SECTION cs;
BOOL locked;
};
Actually, instead of custom critical_section::locked flag, one could use ::TryEnterCriticalSection (followed by ::LeaveCriticalSection if it succeeds) to determine if a critical section is owned. Though, the implementation above is almost as good.
So the appropriate usage would be:
typedef std::vector< element_wrapper<int> > cont_t;
void change(cont_t::reference x) { x.lock(); x = 1; x.unlock(); }
int main()
{
cont_t container(10, 0);
std::for_each(container.begin(), container.end(), &change);
}
I know two ways to handle such errors:
1) Read the code again and again, looking for possible errors. I can think about two errors that can cause this: unsynchronized access or writing by incorrect memory address. Maybe you have more ideas.
2) Logging, logging an logging. Add lot of optional traces (OutputDebugString or log file), in every critical place, which contain enough information - indexes, variable values etc. It is a good idea to add this tracing with some #ifdef. Reproduce the bug and try to understand from the log, what happens.
Your best (fastest) bet is still to revise the mutex code. As you said, it is the obvious explanation - why not trying to really find the explanation (by logic) instead of additional hints (by coding) that may come out inconclusive? If the code review doesn't turn out something useful you may still take the mutex code and use it for a test run. The first try should not be to reproduce the bug in your system but to ensure correct implementation of the mutex - implement threads (start from 2 upwards) that all try to access the same data structure again and again with a random small delay in each of them to have them jitter around on the time line. If this test results in a buggy mutex which you simply can't identify in the code then you have fallen victim to some architecture dependant effect (maybe intstruction reordering, multi-core cache incoherency, etc.) and need to find another mutex implementation. If OTOH you find an obvious bug in the mutex, try to exploit it in your real system (instrument your code so that the error should appear much more often) so that you can ensure that it really is the cause of your original problem.
I was thinking about this while pedaling to work. One possible way of handling this is to make portions of the memory in question be read-only when it is not actively being accessed and protected via critical section ownership. This is assuming that the problem is caused by a thread writing to the memory when it does not own the appropriate critical section.
There are quite a few limitations to this that prevent it from working. Most importantly is the fact that I think you can only set privileges on a page by page basis (4K I believe). So that would likely require some very specific changes to your allocation scheme so that you could narrow down the appropriate section to protect. The second problem is that it would not catch the rogue thread writing to the memory if another thread actively owned the critical section. But it would catch it and cause an immediate access violation if the critical section was not owned.
The idea would be to do to change your EnterCriticalSection calls to:
EnterCriticalSection()
VirtualProtect( … PAGE_READWRITE … );
And change the LeaveCriticalSection calls to:
VirtualProtect( … PAGE_READONLY … );
LeaveCriticalSection()
The following chunk of code shows a call to VirtualProtect
int main( int argc, char* argv[] 1
{
unsigned char *mem;
int i;
DWORD dwOld;
// this assume 4K page size
mem = malloc( 4096 * 10 );
for ( i = 0; i < 10; i++ )
mem[i * 4096] = i;
// set the second page to be readonly. The allocation from malloc is
// not necessarily on a page boundary, but this will definitely be in
// the second page.
printf( "VirtualProtect res = %d\n",
VirtualProtect( mem + 4096,
1, // ends up setting entire page
PAGE_READONLY, &dwOld ));
// can still read it
for ( i = 1; i < 10; i++ )
printf( "%d ", mem[i*4096] );
printf( "\n" );
// Can write to all but the second page
for ( i = 0; i < 10; i++ )
if ( i != 1 ) // avoid second page which we made readonly
mem[i] = 1;
// this causes an access violation
mem[4096] = 1;
}