Multithreaded C++: force read from memory, bypassing cache

Multithreaded C++: force read from memory, bypassing cache - c++

I'm working on a personal hobby-time game engine and I'm working on a multithreaded batch executor. I was originally using a concurrent lockless queue and std::function all over the place to facilitate communication between the master and slave threads, but decided to scrap it in favor of a lighter-weight way of doing things that give me tight control over memory allocation: function pointers and memory pools.
Anyway, I've run into a problem:
The function pointer, no matter what I try, is only getting read correctly by one thread while the others read a null pointer and thus fail an assert.
I'm fairly certain this is a problem with caching. I have confirmed that all threads have the same address for the pointer. I've tried declaring it as volatile, intptr_t, std::atomic, and tried all sorts of casting-fu and the threads all just seem to ignore it and continue reading their cached copies.
I've modeled the master and slave in a model checker to make sure the concurrency is good, and there is no livelock or deadlock (provided that the shared variables all synchronize correctly)
void Executor::operator() (int me) {
while (true) {
printf("Slave %d waiting.\n", me);
{
std::unique_lock<std::mutex> lock(batch.ready_m);
while(!batch.running) batch.ready.wait(lock);
running_threads++;
}
printf("Slave %d running.\n", me);
BatchFunc func = batch.func;
assert(func != nullptr);
int index;
if (batch.store_values) {
while ((index = batch.item.fetch_add(1)) < batch.n_items) {
void* data = reinterpret_cast<void*>(batch.data_buffer + index * batch.item_size);
func(batch.share_data, data);
}
}
else {
while ((index = batch.item.fetch_add(1)) < batch.n_items) {
void** data = reinterpret_cast<void**>(batch.data_buffer + index * batch.item_size);
func(batch.share_data, *data);
}
}
// at least one thread finished, so make sure we won't loop back around
batch.running = false;
if (running_threads.fetch_sub(1) == 1) { // I am the last one
batch.done = true; // therefore all threads are done
batch.complete.notify_all();
}
}
}
void Executor::run_batch() {
assert(!batch.running);
if (batch.func == nullptr || batch.n_items == 0) return;
batch.item.store(0);
batch.running = true;
batch.done = false;
batch.ready.notify_all();
printf("Master waiting.\n");
{
std::unique_lock<std::mutex> lock(batch.complete_m);
while (!batch.done) batch.complete.wait(lock);
}
printf("Master ready.\n");
batch.func = nullptr;
batch.n_items = 0;
}
batch.func is being set by another function
template<typename SharedT, typename ItemT>
void set_batch_job(void(*func)(const SharedT*, ItemT*), const SharedT& share_data, bool byValue = true) {
static_assert(sizeof(SharedT) <= SHARED_DATA_MAXSIZE, "Shared data too large");
static_assert(std::is_pod<SharedT>::value, "Shared data type must be POD");
assert(std::is_pod<ItemT>::value || !byValue);
assert(!batch.running);
batch.func = reinterpret_cast<volatile BatchFunc>(func);
memcpy(batch.share_data, (void*) &share_data, sizeof(SharedT));
batch.store_values = byValue;
if (byValue) {
batch.item_size = sizeof(ItemT);
}
else { // store pointers instead of values
batch.item_size = sizeof(ItemT*);
}
batch.n_items = 0;
}
and here is the struct (and typedef) that it's dealing with
typedef void(*BatchFunc)(const void*, void*);
struct JobBatch {
volatile BatchFunc func;
void* const share_data = operator new(SHARED_DATA_MAXSIZE);
intptr_t const data_buffer = reinterpret_cast<intptr_t>(operator new (EXEC_DATA_BUFFER_SIZE));
volatile size_t item_size;
std::atomic<int> item; // Index into the data array
volatile int n_items = 0;
std::condition_variable complete; // slave -> master signal
std::condition_variable ready; // master -> slave signal
std::mutex complete_m;
std::mutex ready_m;
bool store_values = false;
volatile bool running = false; // there is work to do in the batch
volatile bool done = false; // there is no work left to do
JobBatch();
} batch;
How do I make sure that all the necessary reads and writes to batch.func get synchronized properly between threads?
Just in case it matters: I'm using Visual Studio and compiling an x64 Debug Windows executable. Intel i5, Windows 10, 8GB RAM.

So I did a little reading on the C++ memory model and I managed to hack together a solution using atomic_thread_fence. Everything is probably super broken because I'm crazy and shouldn't roll my own system here, but hey, it's fun to learn!
Basically, whenever you're done writing things that you want other threads to see, you need to call atomic_thread_fence(std::memory_order_release)
On the receiving thread(s), you call atomic_thread_fence(std::memory_order_acquire) before reading shared data.
In my case, release should be done immediately before waiting on a condition variable and acquire should be done immediately before using data written by other threads.
This ensures that the writes on one thread are seen by the others.
I'm no expert, so this is probably not the right way to tackle the problem and will likely be faced with certain doom. For instance, I still have a deadlock/livelock problem to sort out.
tl;dr: it's not exactly a cache thing: threads may not have their data totally in sync with each other unless you enforce that with atomic memory fences.

Related

Synchronize Threads - InterlockedExchange

I like to check if a thread is doing work. If the thread is doing work I will wait for an event until the thread has stopped its work. The event the thread will set at the end.
To check if the thread is working I declared a volatile bool variable. The bool variable will be true if the thread is running, else it is false. At the end of the thread the bool variable will be set to false.
Is it adequate to use a volatile bool variable or do I have to use an atomic function?
BTW: Can please someone explain me the InterlockedExchange Method, I don´t understand the use case I will need this function.
Update
I see without my code it is not clear to say if a volatile bool variable will adequate. I wrote a testclass which shows my problem.
class Testclass
{
public:
Testclass(void);
~Testclass(void);
void doThreadedWork();
void Work();
void StartWork();
void WaitUntilFinish();
private:
HANDLE hHasWork;
HANDLE hAbort;
HANDLE hFinished;
volatile bool m_bWorking;
};
//.cpp
#include "stdafx.h"
#include "Testclass.h"
CRITICAL_SECTION cs;
DWORD WINAPI myThread(LPVOID lpParameter)
{
Testclass* pTestclass = (Testclass*) lpParameter;
pTestclass->doThreadedWork();
return 0;
}
Testclass::Testclass(void)
{
InitializeCriticalSection(&cs);
DWORD myThreadID;
HANDLE myHandle = CreateThread(0, 0, myThread, this, 0, &myThreadID);
m_bWorking = false;
hHasWork = CreateEvent(NULL,TRUE,FALSE,NULL);
hAbort = CreateEvent(NULL,TRUE,FALSE,NULL);
hFinished = CreateEvent(NULL,FALSE,FALSE,NULL);
}
Testclass::~Testclass(void)
{
DeleteCriticalSection(&cs);
CloseHandle(hHasWork);
CloseHandle(hAbort);
CloseHandle(hFinished);
}
void Testclass::Work()
{
// do some work
m_bWorking = false;
SetEvent(hFinished);
}
void Testclass::StartWork()
{
EnterCriticalSection(&cs);
m_bWorking = true;
ResetEvent(hFinished);
SetEvent(hHasWork);
LeaveCriticalSection(&cs);
}
void Testclass::doThreadedWork()
{
HANDLE hEvents[2];
hEvents[0] = hHasWork;
hEvents[1] = hAbort;
while(true)
{
DWORD dwEvent = WaitForMultipleObjects(2, hEvents, FALSE, INFINITE);
if(WAIT_OBJECT_0 == dwEvent)
{
Work();
}
else
{
break;
}
}
}
void Testclass::WaitUntilFinish()
{
EnterCriticalSection(&cs);
if(!m_bWorking)
{
// if the thread is not working, do not wait and return
LeaveCriticalSection(&cs);
return;
}
WaitForSingleObject(hFinished,INFINITE);
LeaveCriticalSection(&cs);
}
For me it is not realy clear if m_bWorking value n a atomic way or if the volatile cast will adequate.

There is a lot of background to cover for your question. We don't know for example what tool chain you are using so I am going to answer it as a winapi question. I further assume you have some something in mind like this:
volatile bool flag = false;
DWORD WINAPI WorkFn(void*) {
flag = true;
// work here
....
// done.
flag = false;
return 0;
}
int main() {
HANDLE th = CreateThread(...., &WorkFn, NULL, ..);
// wait for start of work.
while (!flag) {
// ?? # 1
}
// Seems thread is busy now. Time to wait for it to finish.
while (flag) {
// ?? # 2
}
}
There are many things wrong here. For starters the volatile does very little here. When flag = true happens it will eventually be visible to the other thread because it is backed by a global variable. This is so because it will at least make it into the cache and the cache has ways to tell other processors that a given line (which is a range of addresses) is dirty. The only way it would not make it into the cache is that if the compiler makes a super crazy optimization in which flag stays in the cpu as a register. That could actually happen but not in this particular code example.
So volatile tells the compiler to never keep the variable as a register. That is what it is, every time you see a volatile variable you can translate it as "never enregister this variable". Its use here is just basically a paranoid move.
If this code is what you had in mind then this looping over a flag pattern is called a Spinlock and this one is a really poor one. It is almost never the right thing to do in a user mode program.
Before we go into better approaches let me tackle your Interlocked question. What people usually mean is this pattern
volatile long flag = 0;
DWORD WINAPI WorkFn(void*) {
InterlockedExchange(&flag, 1);
....
}
int main() {
...
while (InterlockedCompareExchange(&flag, 1, 1) = 0L) {
YieldProcessor();
}
...
}
Assume the ... means similar code as before. What the InterlockedExchange() is doing is forcing the write to memory to happen in a deterministic, "broadcast the change now", kind of way and the typical way to read it in the same "bypass the cache" way is via InterlockedCompareExchange().
One problem with them is that they generate more traffic on the system bus. That is, the bus now being used to broadcast cache synchronization packets among the cpus on the system.
std::atomic<bool> flag would be the modern, C++11 way to do the same, but still not what you really want to do.
I added the YieldProcessor() call there to point to the real problem. When you wait for a memory address to change you are using cpu resources that would be better used somewhere else, for example in the actual work (!!). If you actually yield the processor there is at least a chance that the OS will give it to the WorkFn, but in a multicore machine it will quickly go back to polling the variable. In a modern machine you will be checking this flag millions of times per second, with the yield, probably 200000 times per second. Terrible waste either way.
What you want to do here is to leverage Windows to do a zero-cost wait, or at least a low cost as you want to:
DWORD WINAPI WorkFn(void*) {
// work here
....
return 0;
}
int main() {
HANDLE th = CreateThread(...., &WorkFn, NULL, ..);
WaitForSingleObject(th, INFINITE);
// work is done!
CloseHandle(th);
}
When you return from the worker thread the thread handle get signaled and the wait it satisfied. While stuck in WaitForSingleObject you don't consume any cpu cycles. If you want to do a periodic activity in the main() function while you wait you can replace INFINITE with 1000, which will release the main thread every second. In that case you need to check the return value of WaitForSingleObject to tell the timeout from thread being done case.
If you need to actually know when work started, you need an additional waitable object, for example, a Windows event which is obtained via CreateEvent() and can be waited on using the same WaitForSingleObject.
Update [1/23/2016]
Now that we can see the code you have in mind, you don't need atomics, volatile works just fine. The m_bWorking is protected by the cs mutex anyhow for the true case.
If I might suggest, you can use TryEnterCriticalSection and cs to accomplish the same without m_bWorking at all:
void Testclass::Work()
{
EnterCriticalSection(&cs);
// do some work
LeaveCriticalSection(&cs);
SetEvent(hFinished); // could be removed as well
}
void Testclass::StartWork()
{
ResetEvent(hFinished); // could be removed.
SetEvent(hHasWork);
}
void Testclass::WaitUntilFinish()
{
if (TryEnterCriticalSection(&cs)) {
// Not busy now.
LeaveCriticalSection(&cs);
return;
} else {
// busy doing work. If we use EnterCriticalSection(&cs)
// here we can even eliminate hFinished from the code.
}
...
}

For some reason, the Interlocked API does not include an "InterlockedGet" or "InterlockedSet" function. This is a strange omission and the typical work around is to cast through volatile.
You can use code like the following on Windows:
#include <intrin.h>
__inline int InterlockedIncrement(int *j)
{ // This is VS-specific
return _InterlockedIncrement((volatile LONG *) j);
}
__inline int InterlockedDecrement(int *j)
{ // This is VS-specific
return _InterlockedDecrement((volatile LONG *) j);
}
__inline static void InterlockedSet(int *val, int newval)
{
*((volatile int *)val) = newval;
}
__inline static int InterlockedGet(int *val)
{
return *((volatile int *)val);
}
Yes, it's ugly. But it's the best way to work around the deficiency if you're not using C++11. If you're using C++11, use std::atomic instead.
Note that this is Windows-specific code and should not be used on other platforms.

No, volatile bool will not be enough. You need an atomic bool, as you correctly suspect. Otherwise, you might never see your bool updated.
There is also no InterlockedExchange in C++ (the tags of your question), but there are compare_exchange_weak and compare_exchange_strong functions in C++11. Those are used to set the value of an object to a certain NewValue, provided it's current value is TestValue and indicate the status of this attempt (was the change made or not). The benefit of those functions is that this is done in such a fasion that you are guaranteed that if two threads are trying to perform this operation, only one will succeed. This is very helpful when you need to take a certain actions depending on the result of the operation.

Mutex Safety with Interrupts (Embedded Firmware)

Edit #Mike pointed out that my try_lock function in the code below is unsafe and that accessor creation can produce a race condition as well. The suggestions (from everyone) have convinced me that I'm going down the wrong path.
Original Question
The requirements for locking on an embedded microcontroller are different enough from multithreading that I haven't been able to convert multithreading examples to my embedded applications. Typically I don't have an OS or threads of any kind, just main and whatever interrupt functions are called by the hardware periodically.
It's pretty common that I need to fill up a buffer from an interrupt, but process it in main. I've created the IrqMutex class below to try to safely implement this. Each person trying to access the buffer is assigned a unique id through IrqMutexAccessor, then they each can try_lock() and unlock(). The idea of a blocking lock() function doesn't work from interrupts because unless you allow the interrupt to complete, no other code can execute so the unlock() code never runs. I do however use a blocking lock from the main() code occasionally.
However, I know that the double-check lock doesn't work without C++11 memory barriers (which aren't available on many embedded platforms). Honestly despite reading quite a bit about it, I don't really understand how/why the memory access reordering can cause a problem. I think that the use of volatile sig_atomic_t (possibly combined with the use of unique IDs) makes this different from the double-check lock. But I'm hoping someone can: confirm that the following code is correct, explain why it isn't safe, or offer a better way to accomplish this.
class IrqMutex {
friend class IrqMutexAccessor;
private:
std::sig_atomic_t accessorIdEnum;
volatile std::sig_atomic_t owner;
protected:
std::sig_atomic_t nextAccessor(void) { return ++accessorIdEnum; }
bool have_lock(std::sig_atomic_t accessorId) {
return (owner == accessorId);
}
bool try_lock(std::sig_atomic_t accessorId) {
// Only try to get a lock, while it isn't already owned.
while (owner == SIG_ATOMIC_MIN) {
// <-- If an interrupt occurs here, both attempts can get a lock at the same time.
// Try to take ownership of this Mutex.
owner = accessorId; // SET
// Double check that we are the owner.
if (owner == accessorId) return true;
// Someone else must have taken ownership between CHECK and SET.
// If they released it after CHECK, we'll loop back and try again.
// Otherwise someone else has a lock and we have failed.
}
// This shouldn't happen unless they called try_lock on something they already owned.
if (owner == accessorId) return true;
// If someone else owns it, we failed.
return false;
}
bool unlock(std::sig_atomic_t accessorId) {
// Double check that the owner called this function (not strictly required)
if (owner == accessorId) {
owner = SIG_ATOMIC_MIN;
return true;
}
// We still return true if the mutex was unlocked anyway.
return (owner == SIG_ATOMIC_MIN);
}
public:
IrqMutex(void) : accessorIdEnum(SIG_ATOMIC_MIN), owner(SIG_ATOMIC_MIN) {}
};
// This class is used to manage our unique accessorId.
class IrqMutexAccessor {
friend class IrqMutex;
private:
IrqMutex& mutex;
const std::sig_atomic_t accessorId;
public:
IrqMutexAccessor(IrqMutex& m) : mutex(m), accessorId(m.nextAccessor()) {}
bool have_lock(void) { return mutex.have_lock(accessorId); }
bool try_lock(void) { return mutex.try_lock(accessorId); }
bool unlock(void) { return mutex.unlock(accessorId); }
};
Because there is one processor, and no threading the mutex serves what I think is a subtly different purpose than normal. There are two main use cases I run into repeatedly.
The interrupt is a Producer and takes ownership of a free buffer and loads it with a packet of data. The interrupt/Producer may keep its ownership lock for a long time spanning multiple interrupt calls. The main function is the Consumer and takes ownership of a full buffer when it is ready to process it. The race condition rarely happens, but if the interrupt/Producer finishes with a packet and needs a new buffer, but they are all full it will try to take the oldest buffer (this is a dropped packet event). If the main/Consumer started to read and process that oldest buffer at exactly the same time they would trample all over each other.
The interrupt is just a quick change or increment of something (like a counter). However, if we want to reset the counter or jump to some new value with a call from the main() code we don't want to try to write to the counter as it is changing. Here main actually does a blocking loop to obtain a lock, however I think its almost impossible to have to actually wait here for more than two attempts. Once it has a lock, any calls to the counter interrupt will be skipped, but that's generally not a big deal for something like a counter. Then I update the counter value and unlock it so it can start incrementing again.
I realize these two samples are dumbed down a bit, but some version of these patterns occur in many of the peripherals in every project I work on and I'd like once piece of reusable code that can safely handle this across various embedded platforms. I included the C tag, because all of this is directly convertible to C code, and on some embedded compilers that's all that is available. So I'm trying to find a general method that is guaranteed to work in both C and C++.
struct ExampleCounter {
volatile long long int value;
IrqMutex mutex;
} exampleCounter;
struct ExampleBuffer {
volatile char data[256];
volatile size_t index;
IrqMutex mutex; // One mutex per buffer.
} exampleBuffers[2];
const volatile char * const REGISTER;
// This accessor shouldn't be created in an interrupt or a race condition can occur.
static IrqMutexAccessor myMutex(exampleCounter.mutex);
void __irqQuickFunction(void) {
// Obtain a lock, add the data then unlock all within one function call.
if (myMutex.try_lock()) {
exampleCounter.value++;
myMutex.unlock();
} else {
// If we failed to obtain a lock, we skipped this update this one time.
}
}
// These accessors shouldn't be created in an interrupt or a race condition can occur.
static IrqMutexAccessor myMutexes[2] = {
IrqMutexAccessor(exampleBuffers[0].mutex),
IrqMutexAccessor(exampleBuffers[1].mutex)
};
void __irqLongFunction(void) {
static size_t bufferIndex = 0;
// Check if we have a lock.
if (!myMutex[bufferIndex].have_lock() and !myMutex[bufferIndex].try_lock()) {
// If we can't get a lock try the other buffer
bufferIndex = (bufferIndex + 1) % 2;
// One buffer should always be available so the next line should always be successful.
if (!myMutex[bufferIndex].try_lock()) return;
}
// ... at this point we know we have a lock ...
// Get data from the hardware and modify the buffer here.
const char c = *REGISTER;
exampleBuffers[bufferIndex].data[exampleBuffers[bufferIndex].index++] = c;
// We may keep the lock for multiple function calls until the end of packet.
static const char END_PACKET_SIGNAL = '\0';
if (c == END_PACKET_SIGNAL) {
// Unlock this buffer so it can be read from main.
myMutex[bufferIndex].unlock();
// Switch to the other buffer for next time.
bufferIndex = (bufferIndex + 1) % 2;
}
}
int main(void) {
while (true) {
// Mutex for counter
static IrqMutexAccessor myCounterMutex(exampleCounter.mutex);
// Change counter value
if (EVERY_ONCE_IN_A_WHILE) {
// Skip any updates that occur while we are updating the counter.
while(!myCounterMutex.try_lock()) {
// Wait for the interrupt to release its lock.
}
// Set the counter to a new value.
exampleCounter.value = 500;
// Updates will start again as soon as we unlock it.
myCounterMutex.unlock();
}
// Mutexes for __irqLongFunction.
static IrqMutexAccessor myBufferMutexes[2] = {
IrqMutexAccessor(exampleBuffers[0].mutex),
IrqMutexAccessor(exampleBuffers[1].mutex)
};
// Process buffers from __irqLongFunction.
for (size_t i = 0; i < 2; i++) {
// Obtain a lock so we can read the data.
if (!myBufferMutexes[i].try_lock()) continue;
// Check that the buffer isn't empty.
if (exampleBuffers[i].index == 0) {
myBufferMutexes[i].unlock(); // Don't forget to unlock.
continue;
}
// ... read and do something with the data here ...
exampleBuffer.index = 0;
myBufferMutexes[i].unlock();
}
}
}
}
Also note that I used volatile on any variable that is read-by or written-by the interrupt routine (unless the variable was only accessed from the interrupt like the static bufferIndex value in __irqLongFunction). I've read that mutexes remove some of need for volatile in multithreaded code, but I don't think that applies here. Did I use the right amount of volatile? I used it on: ExampleBuffer[].data[256], ExampleBuffer[].index, and ExampleCounter.value.

I apologize for the long answer, but perhaps it is fitting for a long question.
To answer your first question, I would say that your implementation of IrqMutex is not safe. Let me try to explain where I see problems.
Function nextAccessor
std::sig_atomic_t nextAccessor(void) { return ++accessorIdEnum; }
This function has a race condition, because the increment operator is not atomic, despite it being on an atomic value marked volatile. It involves 3 operations: reading the current value of accessorIdEnum, incrementing it, and writing the result back. If two IrqMutexAccessors are created at the same time, it's possible that they both get the same ID.
Function try_lock
The try_lock function also has a race condition. One thread (eg main), could go into the while loop, and then before taking ownership, another thread (eg an interrupt) can also go into the while loop and take ownership of the lock (returning true). Then the first thread can continue, moving onto owner = accessorId, and thus "also" take the lock. So two threads (or your main thread and an interrupt) can try_lock on an unowned mutex at the same time and both return true.
Disabling interrupts by RAII
We can achieve some level of simplicity and encapsulation by using RAII for interrupt disabling, for example the following class:
class InterruptLock {
public:
InterruptLock() {
prevInterruptState = currentInterruptState();
disableInterrupts();
}
~InterruptLock() {
restoreInterrupts(prevInterruptState);
}
private:
int prevInterruptState; // Whatever type this should be for the platform
InterruptLock(const InterruptLock&); // Not copy-constructable
};
And I would recommend disabling interrupts to get the atomicity you need within the mutex implementation itself. For example something like:
bool try_lock(std::sig_atomic_t accessorId) {
InterruptLock lock;
if (owner == SIG_ATOMIC_MIN) {
owner = accessorId;
return true;
}
return false;
}
bool unlock(std::sig_atomic_t accessorId) {
InterruptLock lock;
if (owner == accessorId) {
owner = SIG_ATOMIC_MIN;
return true;
}
return false;
}
Depending on your platform, this might look different, but you get the idea.
As you said, this provides a platform to abstract away from the disabling and enabling interrupts in general code, and encapsulates it to this one class.
Mutexes and Interrupts
Having said how I would consider implementing the mutex class, I would not actually use a mutex class for your use-cases. As you pointed out, mutexes don't really play well with interrupts, because an interrupt can't "block" on trying to acquire a mutex. For this reason, for code that directly exchanges data with an interrupt, I would instead strongly consider just directly disabling interrupts (for a very short time while the main "thread" touches the data).
So your counter might simply look like this:
volatile long long int exampleCounter;
void __irqQuickFunction(void) {
exampleCounter++;
}
...
// Change counter value
if (EVERY_ONCE_IN_A_WHILE) {
InterruptLock lock;
exampleCounter = 500;
}
In my mind, this is easier to read, easier to reason about, and won't "slip" when there's contention (ie miss timer beats).
Regarding the buffer use-case, I would strongly recommend against holding a lock for multiple interrupt cycles. A lock/mutex should be held for just the slightest moment required to "touch" a piece of memory - just long enough to read or write it. Get in, get out.
So this is how the buffering example might look:
struct ExampleBuffer {
char data[256];
} exampleBuffers[2];
ExampleBuffer* volatile bufferAwaitingConsumption = nullptr;
ExampleBuffer* volatile freeBuffer = &exampleBuffers[1];
const volatile char * const REGISTER;
void __irqLongFunction(void) {
static const char END_PACKET_SIGNAL = '\0';
static size_t index = 0;
static ExampleBuffer* receiveBuffer = &exampleBuffers[0];
// Get data from the hardware and modify the buffer here.
const char c = *REGISTER;
receiveBuffer->data[index++] = c;
// End of packet?
if (c == END_PACKET_SIGNAL) {
// Make the packet available to the consumer
bufferAwaitingConsumption = receiveBuffer;
// Move on to the next buffer
receiveBuffer = freeBuffer;
freeBuffer = nullptr;
index = 0;
}
}
int main(void) {
while (true) {
// Fetch packet from shared variable
ExampleBuffer* packet;
{
InterruptLock lock;
packet = bufferAwaitingConsumption;
bufferAwaitingConsumption = nullptr;
}
if (packet) {
// ... read and do something with the data here ...
// Once we're done with the buffer, we need to release it back to the producer
{
InterruptLock lock;
freeBuffer = packet;
}
}
}
}
This code is arguably easier to reason about, since there are only two memory locations shared between the interrupt and the main loop: one to pass packets from the interrupt to the main loop, and one to pass empty buffers back to the interrupt. We also only touch those variables under "lock", and only for the minimum time needed to "move" the value. (for simplicity I've skipped over the buffer overflow logic when the main loop takes too long to free the buffer).
It's true that in this case one may not even need the locks, since we're just reading and writing simple value, but the cost of disabling the interrupts is not much, and the risk of making mistakes otherwise, is not worth it in my opinion.
Edit
As pointed out in the comments, the above solution was meant to only tackle the multithreading problem, and omitted overflow checking. Here is more complete solution which should be robust under overflow conditions:
const size_t BUFFER_COUNT = 2;
struct ExampleBuffer {
char data[256];
ExampleBuffer* next;
} exampleBuffers[BUFFER_COUNT];
volatile size_t overflowCount = 0;
class BufferList {
public:
BufferList() : first(nullptr), last(nullptr) { }
// Atomic enqueue
void enqueue(ExampleBuffer* buffer) {
InterruptLock lock;
if (last)
last->next = buffer;
else {
first = buffer;
last = buffer;
}
}
// Atomic dequeue (or returns null)
ExampleBuffer* dequeueOrNull() {
InterruptLock lock;
ExampleBuffer* result = first;
if (first) {
first = first->next;
if (!first)
last = nullptr;
}
return result;
}
private:
ExampleBuffer* first;
ExampleBuffer* last;
} freeBuffers, buffersAwaitingConsumption;
const volatile char * const REGISTER;
void __irqLongFunction(void) {
static const char END_PACKET_SIGNAL = '\0';
static size_t index = 0;
static ExampleBuffer* receiveBuffer = &exampleBuffers[0];
// Recovery from overflow?
if (!receiveBuffer) {
// Try get another free buffer
receiveBuffer = freeBuffers.dequeueOrNull();
// Still no buffer?
if (!receiveBuffer) {
overflowCount++;
return;
}
}
// Get data from the hardware and modify the buffer here.
const char c = *REGISTER;
if (index < sizeof(receiveBuffer->data))
receiveBuffer->data[index++] = c;
// End of packet, or out of space?
if (c == END_PACKET_SIGNAL) {
// Make the packet available to the consumer
buffersAwaitingConsumption.enqueue(receiveBuffer);
// Move on to the next free buffer
receiveBuffer = freeBuffers.dequeueOrNull();
index = 0;
}
}
size_t getAndResetOverflowCount() {
InterruptLock lock;
size_t result = overflowCount;
overflowCount = 0;
return result;
}
int main(void) {
// All buffers are free at the start
for (int i = 0; i < BUFFER_COUNT; i++)
freeBuffers.enqueue(&exampleBuffers[i]);
while (true) {
// Fetch packet from shared variable
ExampleBuffer* packet = dequeueOrNull();
if (packet) {
// ... read and do something with the data here ...
// Once we're done with the buffer, we need to release it back to the producer
freeBuffers.enqueue(packet);
}
size_t overflowBytes = getAndResetOverflowCount();
if (overflowBytes) {
// ...
}
}
}
The key changes:
If the interrupt runs out of free buffers, it will recover
If the interrupt receives data while it doesn't have a receive buffer, it will communicate that to the main thread via getAndResetOverflowCount
If you keep getting buffer overflows, you can simply increase the buffer count
I've encapsulated the multithreaded access into a queue class implemented as a linked list (BufferList), which supports atomic dequeue and enqueue. The previous example also used queues, but of length 0-1 (either an item is enqueued or it isn't), and so the implementation of the queue was just a single variable. In the case of running out of free buffers, the receive queue could have 2 items, so I upgraded it to a proper queue rather than adding more shared variables.

If the interrupt is the producer and mainline code is the consumer, surely it's as simple as disabling the interrupt for the duration of the consume operation?
That's how I used to do it in my embedded micro controller days.

C++ Blocking Queue Segfault w/ Boost

I had a need for a Blocking Queue in C++ with timeout-capable offer(). The queue is intended for multiple producers, one consumer. Back when I was implementing, I didn't find any good existing queues that fit this need, so I coded it myself.
I'm seeing segfaults come out of the take() method on the queue, but they are intermittent. I've been looking over the code for issues but I'm not seeing anything that looks problematic.
I'm wondering if:
There is an existing library that does this reliably that I should
use (boost or header-only preferred).
Anyone sees any obvious flaw in my code that I need to fix.
Here is the header:
class BlockingQueue
{
public:
BlockingQueue(unsigned int capacity) : capacity(capacity) { };
bool offer(const MyType & myType, unsigned int timeoutMillis);
MyType take();
void put(const MyType & myType);
unsigned int getCapacity();
unsigned int getCount();
private:
std::deque<MyType> queue;
unsigned int capacity;
};
And the relevant implementations:
boost::condition_variable cond;
boost::mutex mut;
bool BlockingQueue::offer(const MyType & myType, unsigned int timeoutMillis)
{
Timer timer;
// boost::unique_lock is a scoped lock - its destructor will call unlock().
// So no need for us to make that call here.
boost::unique_lock<boost::mutex> lock(mut);
// We use a while loop here because the monitor may have woken up because
// another producer did a PulseAll. In that case, the queue may not have
// room, so we need to re-check and re-wait if that is the case.
// We use an external stopwatch to stop the madness if we have taken too long.
while (queue.size() >= this->capacity)
{
int monitorTimeout = timeoutMillis - ((unsigned int) timer.getElapsedMilliSeconds());
if (monitorTimeout <= 0)
{
return false;
}
if (!cond.timed_wait(lock, boost::posix_time::milliseconds(timeoutMillis)))
{
return false;
}
}
cond.notify_all();
queue.push_back(myType);
return true;
}
void BlockingQueue::put(const MyType & myType)
{
// boost::unique_lock is a scoped lock - its destructor will call unlock().
// So no need for us to make that call here.
boost::unique_lock<boost::mutex> lock(mut);
// We use a while loop here because the monitor may have woken up because
// another producer did a PulseAll. In that case, the queue may not have
// room, so we need to re-check and re-wait if that is the case.
// We use an external stopwatch to stop the madness if we have taken too long.
while (queue.size() >= this->capacity)
{
cond.wait(lock);
}
cond.notify_all();
queue.push_back(myType);
}
MyType BlockingQueue::take()
{
// boost::unique_lock is a scoped lock - its destructor will call unlock().
// So no need for us to make that call here.
boost::unique_lock<boost::mutex> lock(mut);
while (queue.size() == 0)
{
cond.wait(lock);
}
cond.notify_one();
MyType myType = this->queue.front();
this->queue.pop_front();
return myType;
}
unsigned int BlockingQueue::getCapacity()
{
return this->capacity;
}
unsigned int BlockingQueue::getCount()
{
return this->queue.size();
}
And yes, I didn't implement the class using templates - that is next on the list :)
Any help is greatly appreciated. Threading issues can be really hard to pin down.
-Ben

Why are cond, and mut globals? I would expect them to be members of your BlockingQueue object. I don't know what else is touching those things, but there may be an issue there.
I too have implemented a ThreadSafeQueue as part of a larger project:
https://github.com/cdesjardins/QueuePtr/blob/master/include/ThreadSafeQueue.h
It is a similar concept to yours, except the enqueue (aka offer) functions are non-blocking because there is basically no max capacity. To enforce a capacity I typically have a pool with N buffers added at system init time, and a Queue for message passing at run time, this also eliminates the need for memory allocation at run time which I consider to be a good thing (I typically work on embedded applications).
The only difference between a pool, and a queue is that a pool gets a bunch of buffers enqueued at system init time. So you have something like this:
ThreadSafeQueue<BufferDataType*> pool;
ThreadSafeQueue<BufferDataType*> queue;
void init()
{
for (int i = 0; i < NUM_BUFS; i++)
{
pool.enqueue(new BufferDataType);
}
}
Then when you want send a message you do something like the following:
void producerA()
{
BufferDataType *buf;
if (pool.waitDequeue(buf, timeout) == true)
{
initBufWithMyData(buf);
queue.enqueue(buf);
}
}
This way the enqueue function is quick and easy, but if the pool is empty, then you will block until someone puts a buffer back into the pool. The idea being that some other thread will be blocking on the queue and will return buffers to the pool when they have been processed as follows:
void consumer()
{
BufferDataType *buf;
if (queue.waitDequeue(buf, timeout) == true)
{
processBufferData(buf);
pool.enqueue(buf);
}
}
Anyways take a look at it, maybe it will help.

I suppose the problem in your code is modifying the deque by several threads. Look:
you're waiting for codition from another thread;
and then immediately sending a signal to other threads that deque is unlocked just before you want to modify it;
then you modifying the deque while other threads are thinking deque is allready unlocked and starting doing the same.
So, try to place all the cond.notify_*() after modifying the deque. I.e.:
void BlockingQueue::put(const MyType & myType)
{
boost::unique_lock<boost::mutex> lock(mut);
while (queue.size() >= this->capacity)
{
cond.wait(lock);
}
queue.push_back(myType); // <- modify first
cond.notify_all(); // <- then say to others that deque is free
}
For better understanding I suggest to read about the pthread_cond_wait().

Error with function in multithreaded environment

What my function does is iterate through an array of bools and upon finding an element set to false, it is set to true. The function is a method from my memory manager singleton class which returns a pointer to memory. I'm getting an error where my iterator appears to loop through and ends up starting at the beginning, which I believe to because multiple threads are calling the function.
void* CNetworkMemoryManager::GetMemory()
{
WaitForSingleObject(hMutexCounter, INFINITE);
if(mCounter >= NetConsts::kNumMemorySlots)
{
mCounter = 0;
}
unsigned int tempCounter = mCounter;
unsigned int start = tempCounter;
while(mUsedSlots[tempCounter])
{
tempCounter++;
if(tempCounter >= NetConsts::kNumMemorySlots)
{
tempCounter = 0;
}
//looped all the way around
if(tempCounter == start)
{
assert(false);
return NULL;
}
}
//return pointer to free space and increment
mCounter = tempCounter + 1;
ReleaseMutex(hMutexCounter);
mUsedSlots[tempCounter] = true;
return mPointers[tempCounter];
}
My error is the assert that goes off in the loop. My question is how do I fix the function and is the error caused by multithreading?
Edit: added a mutex to guard the mCounter variable. No change. Error still occurs.

I can't say if the error is caused by multi threading or not but I can say your code is not thread safe.
You free the lock with
ReleaseMutex(hMutexCounter);
and then access tempCounter and mUsedSlots:
mUsedSlots[tempCounter] = true;
return mPointers[tempCounter];
neither of which are const. This is a data race because you have not correctly serialized access to these variables.
Change this to:
mUsedSlots[tempCounter] = true;
const unsigned int retVal = mPointers[tempCounter];
ReleaseMutex(hMutexCounter);
return retVal;
Then at least your code is thread safe, whether this solves your problem I can't say, try it out. On machines with multiple cores very weird things to happen as a result of data races.
As general best practice I would suggest looking at some C++11 synchronization features like std::mutex and std::lock_guard, this would have saved you from your self because std::lock_guard releases that lock automatically so you can't forget and, as in this case, you can't do it too soon inadvertently. This would also make your code more portable. If you don't have C++11 yet use the boost equivalents.

How and what data must be synced in multithreaded c++

I build a little application which has a render thread and some worker threads for tasks which can be made nearby the rendering, e.g. uploading files onto some server. Now in those worker threads I use different objects to store feedback information and share these with the render thread to read them for output purpose. So render = output, worker = input. Those shared objects are int, float, bool, STL string and STL list.
I had this running a few months and all was fine except 2 random crashes during output, but I learned about thread syncing now. I read int, bool, etc do not require syncing and I think it makes sense, but when I look at string and list I fear potential crashes if 2 threads attempt to read/write an object the same time. Basically I expect one thread changes the size of the string while the other might use the outdated size to loop through its characters and then read from unallocated memory. Today evening I want to build a little test scenario with 2 threads writing/reading the same object in a loop, however I was hoping to get some ideas here aswell.
I was reading about the CriticalSection in Win32 and thought it may be worth a try. Yet I am unsure what the best way would be to implement it. If I put it at the start and at the end of a read/function it feels like some time was wasted. And if I wrap EnterCriticalSection and LeaveCriticalSection in Set and Get Functions for each object I want to have synced across the threads, it is alot of adminstration.
I think I must crawl through more references.
Okay I am still not sure how to proceed. I was studying the links provided by StackedCrooked but do still have no image of how to do this.
I put copied/modified together this now and have no idea how to continue or what to do: someone has ideas?
class CSync
{
public:
CSync()
: m_isEnter(false)
{ InitializeCriticalSection(&m_CriticalSection); }
~CSync()
{ DeleteCriticalSection(&m_CriticalSection); }
bool TryEnter()
{
m_isEnter = TryEnterCriticalSection(&m_CriticalSection)==0 ? false:true;
return m_isEnter;
}
void Enter()
{
if(!m_isEnter)
{
EnterCriticalSection(&m_CriticalSection);
m_isEnter=true;
}
}
void Leave()
{
if(m_isEnter)
{
LeaveCriticalSection(&m_CriticalSection);
m_isEnter=false;
}
}
private:
CRITICAL_SECTION m_CriticalSection;
bool m_isEnter;
};
/* not needed
class CLockGuard
{
public:
CLockGuard(CSync& refSync) : m_refSync(refSync) { Lock(); }
~CLockGuard() { Unlock(); }
private:
CSync& m_refSync;
CLockGuard(const CLockGuard &refcSource);
CLockGuard& operator=(const CLockGuard& refcSource);
void Lock() { m_refSync.Enter(); }
void Unlock() { m_refSync.Leave(); }
};*/
template<class T> class Wrap
{
public:
Wrap(T* pp, const CSync& sync)
: p(pp)
, m_refSync(refSync)
{}
Call_proxy<T> operator->() { m_refSync.Enter(); return Call_proxy<T>(p); }
private:
T* p;
CSync& m_refSync;
};
template<class T> class Call_proxy
{
public:
Call_proxy(T* pp, const CSync& sync)
: p(pp)
, m_refSync(refSync)
{}
~Call_proxy() { m_refSync.Leave(); }
T* operator->() { return p; }
private:
T* p;
CSync& m_refSync;
};
int main
{
CSync sync;
Wrap<string> safeVar(new string);
// safeVar what now?
return 0;
};
Okay so I was preparing a little test now to see if my attempts do something good, so first I created a setup to make the application crash, I believed...
But that does not crash!? Does that mean now I need no syncing? What does the program need to effectively crash? And if it does not crash why do I even bother. It seems I am missing some point again. Any ideas?
string gl_str, str_test;
void thread1()
{
while(true)
{
gl_str = "12345";
str_test = gl_str;
}
};
void thread2()
{
while(true)
{
gl_str = "123456789";
str_test = gl_str;
}
};
CreateThread( NULL, 0, (LPTHREAD_START_ROUTINE)thread1, NULL, 0, NULL );
CreateThread( NULL, 0, (LPTHREAD_START_ROUTINE)thread2, NULL, 0, NULL );
Just added more stuff and now it crashes when calling clear(). Good.
void thread1()
{
while(true)
{
gl_str = "12345";
str_test = gl_str;
gl_str.clear();
gl_int = 124;
}
};
void thread2()
{
while(true)
{
gl_str = "123456789";
str_test = gl_str;
gl_str.clear();
if(gl_str.empty())
gl_str = "aaaaaaaaaaaaa";
gl_int = 244;
if(gl_int==124)
gl_str.clear();
}
};

The rules is simple: if the object can be modified in any thread, all accesses to it require synchronization. The type of object doesn't matter: even bool or int require external synchronization of some sort (possibly by means of a special, system dependent function, rather than with a lock). There are no exceptions, at least in C++. (If you're willing to use inline assembler, and understand the implications of fences and memory barriers, you may be able to avoid a lock.)

I read int, bool, etc do not require syncing
This is not true:
A thread may store a copy of the variable in a CPU register and keep using the old value even in the original variable has been modified by another thread.
Simple operations like i++ are not atomic.
The compiler may reorder reads and writes to the variable. This may cause synchronization issues in multithreaded scenarios.
See Lockless Programming Considerations for more details.
You should use mutexes to protect against race conditions. See this article for a quick introduction to the boost threading library.

First, you do need protection even for accessing the most primitive of data types.
If you have an int x somewhere, you can write
x += 42;
... but that will mean, at the lowest level: read the old value of x, calculate a new value, write the new value to the variable x. If two threads do that at about the same time, strange things will happen. You need a lock/critical section.
I'd recommend using the C++11 and related interfaces, or, if that is not available, the corresponding things from the boost::thread library. If that is not an option either, critical sections on Win32 and pthread_mutex_* for Unix.
NO, Don't Start Writing Multithreaded Programs Yet!
Let's talk about invariants first.
In a (hypothetical) well-defined program, every class has an invariant.
The invariant is some logical statement that is always true about an instance's state, i.e. about the values of all its member variables. If the invariant ever becomes false, the object is broken, corrupted, your program may crash, bad things have already happened. All your functions assume that the invariant is true when they are called, and they make sure that it is still true afterwards.
When a member function changes a member variable, the invariant might temporarily become false, but that is OK because the member function will make sure that everything "fits together" again before it exits.
You need a lock that protects the invariant - whenever you do something that might affect the invariant, take the lock and do not release it until you've made sure that the invariant is restored.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Multithreaded C++: force read from memory, bypassing cache - c++

Related

Synchronize Threads - InterlockedExchange

Mutex Safety with Interrupts (Embedded Firmware)

C++ Blocking Queue Segfault w/ Boost

Error with function in multithreaded environment

How and what data must be synced in multithreaded c++

Categories

Resources