Not executing all writes before end of program

Not executing all writes before end of program - c++

For a school project we were tasked with writing a ray tracer. I chose to use C++ since it's the language I'm most comfortable with, but I've been getting some weird artifacts.
Please keep in mind that we are still in the first few lessons of the class, so right now we are limited to checking whether or not a ray hits a certain object.
When my raytracer finishes quickly (less than 1 second spent on actual ray tracing) I've noticed that not all hits get registered in my "framebuffer".
To illustrate, here are two examples:
1.:
2.:
In the first image, you can clearly see that there are horizontal artifacts.
The second image contains a vertical artifact.
I was wondering if anyone could help me to figure out why this is happening?
I should mention that my application is multi-threaded, the multithreaded portion of the code looks like this:
Stats RayTracer::runParallel(const std::vector<Math::ivec2>& pixelList, const Math::vec3& eyePos, const Math::vec3& screenCenter, long numThreads) noexcept
{
//...
for (int i = 0; i < threads.size(); i++)
{
threads[i] = std::thread(&RayTracer::run, this, splitPixels[i], eyePos, screenCenter);
}
for (std::thread& thread: threads)
{
thread.join();
}
//...
}
The RayTracer::run method access the framebuffer as follows:
Stats RayTracer::run(const std::vector<Math::ivec2>& pixelList, const Math::vec3& eyePos, const Math::vec3& screenCenter) noexcept
{
this->frameBuffer.clear(RayTracer::CLEAR_COLOUR);
// ...
for (const Math::ivec2& pixel : pixelList)
{
// ...
for (const std::shared_ptr<Objects::Object>& object : this->objects)
{
std::optional<Objects::Hit> hit = object->hit(ray, pixelPos);
if (hit)
{
// ...
if (dist < minDist)
{
std::lock_guard lock (this->frameBufferMutex);
// ...
this->frameBuffer(pixel.y, pixel.x) = hit->getColor();
}
}
}
}
// ...
}
This is the operator() for the framebuffer class
class FrameBuffer
{
private:
PixelBuffer buffer;
public:
// ...
Color& FrameBuffer::operator()(int row, int col) noexcept
{
return this->buffer(row, col);
}
// ...
}
Which make use of the PixelBuffer's operator()
class PixelBuffer
{
private:
int mRows;
int mCols;
Color* mBuffer;
public:
// ...
Color& PixelBuffer::operator()(int row, int col) noexcept
{
return this->mBuffer[this->flattenIndex(row, col)];
}
// ...
}
I didn't bother to use any synchronization primitives because each thread gets assigned a certain subset of pixels from the complete image. The thread casts a ray for each of its assigned pixels and writes the resultant color back to the color buffer in that pixel's slot. This means that, while all my threads are concurrently accessing (and writing to) the same object, they don't write to the same memory locations.
After some initial testing, using a std::lock_guard to protect the shared framebuffer seems to help, but it's not a perfect solution, artifacts still occur (although much less common).
It should be noted that the way I divide pixels between threads determines the direction of the artifacts. If I give each thread a set of rows the artifacts will be horizontal lines, if I give each thread a set of columns, the artifacts will be vertical lines.
Another interesting conclusion is that when I trace more complex objects (These take anywhere between 30 seconds and 2 minutes) these aretifacts are extremely rare (I've seen it once in my 100's-1000's of traces so far)
I can't help but feel like this is a problem related to multithreading, but I don't really understand why std::lock_guard wouldn't completely solve the problem.
Edit: After suggestions by Jeremy Friesner I ran the raytracer about 10 times on a single thread, without any issues, so the problem does indeed appear to be a race condition.

I solved the problem thanks to Jeremy Friesner.
As you can see in the code, every thread calls framebuffer.clear() separately (without locking the mutex!). This means that thread A might have already hit 5-10 pixels because it was started first when thread B clears the framebuffer. this would erase thread A's already hit pixels.
By moving the framebuffer.clear() call to the beginning of the runParallel() method I was able to solve the issue.

Related

Trying to control multithreaded access to array using std::atomic

I'm trying to control multithreaded access to a vector of data which is fixed in size, so threads will wait until their current position in it has been filled before trying to use it, or will fill it themselves if no-one else has yet. (But ensure no-one is waiting around if their position is already filled, or no-one has done it yet)
However, I am struggling to understand a good way to do this, especially involving std::atomic. I'm just not very familiar with C++ multithreading concepts aside from basic std::thread usage.
Here is a very rough example of the problem:
class myClass
{
struct Data
{
int res1;
};
std::vector<Data*> myData;
int foo(unsigned long position)
{
if (!myData[position])
{
bar(myData[position]);
}
// Do something with the data
return 5 * myData[position]->res1;
}
void bar(Data* &data)
{
data = new Data;
// Do a whole bunch of calculations and so-on here
data->res1 = 42;
}
};
Now imagine if foo() is being called multi-threaded, and multiple threads may (or may not) have the same position at once. If that happens, there's a chance that a thread may (between when the Data was created and when bar() is finished, try to actually use the data.
So, what are the options?
1: Make a std::mutex for every position in myData. What if there are 10,000 elements in myData? That's 10,000 std::mutexes, not great.
2: Put a lock_guard around it like this:
std::mutex myMutex;
{
const std::lock_guard<std::mutex> lock(myMutex);
if (!myData[position])
{
bar(myData[position]);
}
}
While this works, it also means if different threads are working in different positions, they wait needlessly, wasting all of the threading advantage.
3: Use a vector of chars and a spinlock as a poor man's mutex? Here's what that might look like:
static std::vector<char> positionInProgress;
static std::vector<char> positionComplete;
class myClass
{
struct Data
{
int res1;
};
std::vector<Data*> myData;
int foo(unsigned long position)
{
if (positionInProgress[position])
{
while (positionInProgress[position])
{
; // do nothing, just wait until it is done
}
}
else
{
if (!positionComplete[position])
{
// Fill the data and prevent anyone from using it until it is complete
positionInProgress[position] = true;
bar(myData[position]);
positionInProgress[position] = false;
positionComplete[position] = true;
}
}
// Do something with the data
return 5 * myData[position]->res1;
}
void bar(Data* data)
{
data = new Data;
// Do a whole bunch of calculations and so-on here
data->res1 = 42;
}
};
This seems to work, but none of the test or set operations are atomic, so I have a feeling I'm just getting lucky.
4: What about std::atomic and std::atomic_flag? Well, there are a few problems.
std::atomic_flag doesn't have a way to test without setting in C++11...which makes this kind of difficult.
std::atomic is not movable or copy-constructable, so I cannot make a vector of them (I do not know the number of positions during construction of myClass)
Conclusion:
This is the simplest example that (likely) compiles I can think of that demonstrates my real problem. In reality, myData is a 2-dimensional vector implemented using a special hand-rolled solution, Data itself is a vector of pointers to more complex data types, the data isn't simply returned, etc. This is the best I could come up with.

The biggest problem you're likely to have is that a vector itself is not thread-safe, so you can't do ANY operation that might chage the vector (invalidate references to elements of the vector) while another thread might be accessing it, such as resize or push_back. However, if you vector is effectively "fixed" (you set the size prior to ever spawning threads and thereafter only ever access elements using at or operator[] and never ever modify the vector itself), you can get away with using a vector of atomic objects. In this case you could have:
std::vector<std::atomic<Data*>> myData;
and your code to setup and use an element could look like:
if (!myData[position]) {
Data *tmp = new Data;
if (!mydata[position].compare_exchange_strong(nullptr, tmp)) {
// some other thread did the setup
delete tmp; } }
myData[position]->bar();
Of course you still need to make sure that the operations done on members of Data in bar are themselves threadsafe, as you can get mulitple threads calling bar on the same Data instance here.

Waiting Thread or Create New Thread?

I have a decision to make regarding the way I code something, which is running on an embedded platform and am hoping there is a general "rule-of-thumb" that can be used in this case. Coding both my ideas and then benchmarking would obviously be the best way to go, but to get any meaningful, or rather accurate results out of this platform, in my particular case, would be quite tricky. I'm also sure that there may be others that are having the same question on their respective platforms, so I decided to ask it here. Please be kind, as I'm not very familiar with the threading library, so constructive feedback would be useful.
I have many threads (well, about 10-20 at maximum) all wanting to write to this hardware device. So I decided on using a simple ring-buffer consisting of 2 buffers (primary/secondary) of 8k each. This way each in-coming thread could be dealt with in a timely fashion. An arriving thread would obtain a mutex and write into the primary buffer and then release its mutex ready for the next thread. Now when the primary buffer is full, new incoming threads obviously switch to using the secondary buffer and then you start to write the primary buffer to the hardware device.
So the question really is... How best to write to the hardware device??? I'm thinking that there are two choices:
As soon as the buffer is full, create a new thread that does the write operation.
Signal a pre-created waiting worker-thread to do the write operation.
Both of the options seem to come with their respective pros/cons. Option 1 is the simplest to code and there are a number of ways to do this, but its effectiveness is dependent on how expensive it is to create/start the thread. The thread would be created, it would perform the write operation and then it would die. Option 2 however seems to be the most performant, but if you're going to have a reusable thread, you're going to need a mutex and a couple condition variables to control it. One to notify the thread that data is ready and another to ask for the thread to terminate when the program ends. Add to that a sprinkle of atomics for spurious wake-ups/missing notifications etc, and you've got quite an intricate solution to get right.
So what is the best method here? Are threads in general heavy to create/start or is this something that is completely platform dependent and benchmarking is the only way to know? Is there any benefit to using one method over the other that I've not thought about?
-- This is for the people not suffering from TL;DR syndrome --
I'm sure some of you have already wondered what happens if the secondary buffer becomes full before the write operation has finished? The answer in my case is fairly simple: this should never happen! Although the write operation is slow, it would never be slow enough such that the secondary buffer is filled before the write is complete. However, if someone is going to use this ring-buffer method, they must be prepared for this contingency. The way I thought about tackling this is to have a second mutex that is held during the write operation. This would mean that the thread that was due to write to the buffer would block until the write completed and the mutex was released.
Here's what I roughly ended up with after going with Option 2, but it seems awfully messy. I actually wanted to use promise/futures to avoid the spin-lock predicates on the condition variable, but couldn't think of a good way of moving a promise to an already created thread. Anyway... nice feedback is appreciated, bad-feedback, well, I'm not overly familiar with the threading library.
class Bar
{
public:
Bar(const size_t size) : buffer(new uint8_t[size]), buffer_size(size), used_size(0) {}
const size_t GetRemainingBufferSize(void) const { return buffer_size - used_size; }
const size_t GetUsedBufferSize(void) const { return used_size; }
const uint8_t* GetBuffer(void) const { return buffer.get(); }
const size_t GetBufferSize() const { return buffer_size; }
void ResetBuffer(void) { used_size = 0; }
void WriteIntoBuffer(const vector<uint8_t>& data)
{
std::copy(data.begin(), data.end(), buffer.get() + used_size);
used_size += data.size();
}
private:
std::unique_ptr<uint8_t[]> buffer;
size_t buffer_size;
size_t used_size;
};
class Foo
{
public:
Foo(const size_t buffer_size = 8192) : bar_buffers{ buffer_size, buffer_size }, primary_buffer(&bar_buffers[0]), secondary_buffer(&bar_buffers[1]),
write_predicate(false), quit_predicate(false), write_buffer(primary_buffer)
{
foo_thread = std::thread(&Foo::WriteHWThread, this);
}
~Foo()
{
quit_predicate = true;
begin_write.notify_one();
if (foo_thread.joinable())
foo_thread.join();
}
Foo(const Foo&) = delete;
Foo& operator=(const Foo&) = delete;
void WriteData(const std::vector<uint8_t>& data)
{
if (std::lock_guard<std::mutex> foo_lk(foo_lock); primary_buffer->GetRemainingBufferSize() < data.size())
{
std::unique_lock<std::mutex> write_lk(write_lock);
write_buffer = primary_buffer;
write_lk.unlock();
std::swap(primary_buffer, secondary_buffer);
primary_buffer->ResetBuffer();
write_predicate = true;
begin_write.notify_one();
}
primary_buffer->WriteIntoBuffer(data);
}
void WriteHWThread(void)
{
do
{
std::unique_lock<std::mutex> write_lk(write_lock);
begin_write.wait(write_lk, [&]() -> bool { return write_predicate.load() || quit_predicate.load(); });
write_predicate = false;
if (write_buffer.load()->GetUsedBufferSize())
<<< WRITE TO DEDICATED HARDWARE >>>
write_lk.unlock();
} while (!quit_predicate);
}
private:
Bar bar_buffers[2];
Bar* primary_buffer, *secondary_buffer;
std::atomic<bool> write_predicate, quit_predicate;
std::atomic<Bar*> write_buffer;
std::mutex foo_lock, write_lock;
std::thread foo_thread;
std::condition_variable begin_write;
};

Multiple threads access shared resources

I'm currently working on a particle system, which uses one thread in which the particles are first updated, then drawn. The particles are stored in a std::vector. I would like to move the update function to a separate thread to improve the systems performance. However this means that I encounter problems when the update thread and the draw thread are accessing the std::vector at the same time. My update function will change the values for the position, and colour of all particles, and also almost always resize the std::vector.
Single thread approach:
std::vector<Particle> particles;
void tick() //tick would be called from main update loop
{
//slow as must wait for update to draw
updateParticles();
drawParticles();
}
Multithreaded:
std::vector<Particle> particles;
//quicker as no longer need to wait to draw and update
//crashes when both threads access the same data, or update resizes vector
void updateThread()
{
updateParticles();
}
void drawThread()
{
drawParticles();
}
To fix this problem I have investigated using std::mutex however in practice, with a large amount of particles, the constant locking of threads meant that performance didn't increase. I have also investigated std::atomic however, neither the particles nor std::vector are trivially copyable and so can't use this either.
Multithreaded using mutex:
NOTE: I am using SDL mutex, as far as I am aware, the principles are the same.
SDL_mutex mutex = SDL_CreateMutex();
SDL_cond canDraw = SDL_CreateCond();
SDL_cond canUpdate = SDL_CreateCond();
std::vector<Particle> particles;
//locking the threads leads to the same problems as before,
//now each thread must wait for the other one
void updateThread()
{
SDL_LockMutex(lock);
while(!canUpdate)
{
SDL_CondWait(canUpdate, lock);
}
updateParticles();
SDL_UnlockMutex(lock);
SDL_CondSignal(canDraw);
}
void drawThread()
{
SDL_LockMutex(lock);
while(!canDraw)
{
SDL_CondWait(canDraw, lock);
}
drawParticles();
SDL_UnlockMutex(lock);
SDL_CondSignal(canUpdate);
}
I am wondering if there are any other ways to implement the multi threaded approach? Essentially preventing the same data from being accessed by both threads at the same time, without having to make each thread wait for the other. I have thought about making a local copy of the vector to draw from, but this seems like it would be inefficient, and may run into the same problems if the update thread changes the vector while it's being copied?

I would use a more granular locking strategy. Instead of storing a particle object in your vector, I would store a pointer to a different object.
struct lockedParticle {
particle* containedParticle;
SDL_mutex lockingObject;
};
In updateParticles() I would attempt to obtain the individual locking objects using SDL_TryLockMutex() - if I fail to obtain control of the mutex I would add the pointer to this particular lockedParticle instance to another vector, and retry later to update them.
I would follow a similar strategy inside the drawParticles(). This relies on the fact that draw order does not matter for particles, which is often the case.

If data consistency is not a concern you can avoid blocking the whole vector by encapsulating vector in a custom class and setting mutex on single read/write operations only, something like:
struct SharedVector
{
// ...
std::vector<Particle> vec;
void push( const& Particle particle )
{
SDL_LockMutex(lock);
vec.push_back(particle);
SDL_UnlockMutex(lock);
}
}
//...
SharedVector particles;
Then of course, you need to amend updateParticles() and drawParticles() to use new type instead of std::vector.
EDIT:
You can avoid creating new structure by using mutexes in updateParticles() and drawParticles() methods, e.g.
void updateParticles()
{
//... get Particle particle object
SDL_LockMutex(lock);
particles.push_back(particle);
SDL_UnlockMutex(lock);
}
The same should be done for drawParticles() as well.

If the vector is changing all the time, you can use two vectors. drawParticles would have its own copy, and updateParticles would write to another one. Once both functions are done, swap, copy, or move the vector used by updateParticles to the to be used by drawParticles. (updateParticles can read from the same vector used by drawParticles to get at the current particle positions, so you shouldn't need to create a complete new copy.) No locking necessary.

Multithreading and heap corruption

So I just started trying out some multithreaded programming for the first time, and I've run into this heap corruption problem. Basically the program will run for some random length of time (as short as 2 seconds, as long as 200) before crashing and spitting out a heap corruption error. Everything I've read on the subject says its very hard thing to diagnose, since the what triggers the error often has little to do with what actually causes it. As such, I remain stumped.
I haven't been formally taught multithreading however, so I was mostly programming off of what I understood of the concept, and my code may be completely wrong. So here's a basic rundown of what I'm trying to do and how the program currently tries to handle it:
I'm writing code for a simple game that involves drawing several parallaxing layers of background. These levels are very large (eg 20000x5000 pixels), so obviously trying to load 3 layers of those sized images is not feasible (if not impossible). So currently the images are split up into 500x500 images and I have the code only have the images it immediately needs to display held in memory. Any images it has loaded that it no longer needs are removed from memory. However, in a single thread, this causes the program to hang significantly while waiting for the image to load before continuing.
This is where multithreading seemed logical to me. I wanted the program to do the loading it needed to do, without affecting the smoothness of the game, as long as the image was loaded by the time it was actually needed. So here is how I have it organized:
1.) All the data for where the images should go and any data associated with them is all stored in one multidimensional array, but initially no image data is loaded. Each frame, the code checks each position on the array, and tests if the spot where the image should go is within some radius of the player.
2.) If it is, it flags this spot as needing to be loaded. A pointer to where the image should be loaded into is push_back()'d onto a vector.
3.) The second thread is started once the level begins. This thread is initially passed a pointer to the aforementioned vector.
4.) This thread is put into an infinite While loop (which by itself sounds wrong) that only terminates when the thread is terminated. This loop continuously checks if there are any elements in the vector. If there are, it grabs the 0th element, loads the image data into that pointer, then .erase()'s the element from the vector.
That's pretty much a rundown of how it works. My uneducated assumption is that the 2 threads collide at some point trying to write and delete in the same space at once or something. Given that I'm new to this I'm certain this method is terrible to some embarrassing degree, so I'm eager to hear what I should improve upon.
EDIT: Adding source code upon request:
class ImageLoadQueue
{
private:
ImageHandle* image;
std::string path;
int frameWidth, frameHeight, numOfFrames;
public:
ImageLoadQueue();
ImageLoadQueue(ImageHandle* a, std::string b, int c, int d, int e=1) { setData(a,b,c,d,e); }
void setData(ImageHandle* a, std::string b, int c, int d, int e=1)
{
image = a;
path = b;
frameWidth = c;
frameHeight = d;
numOfFrames = e;
}
void loadThisImage() { image->loadImage(path, frameWidth, frameHeight, numOfFrames, numOfFrames); }
};
class ImageLoadThread : public sf::Thread
{
private:
std::vector<ImageLoadQueue*>* images;
public:
ImageLoadThread() { };
ImageLoadThread(std::vector<ImageLoadQueue*>* a) { linkVector(a); }
void linkVector(std::vector<ImageLoadQueue*>* a) { images = a; }
virtual void Run()
{
while (1==1)
{
if (!images->empty())
{
(*images)[0]->loadThisImage();
images->erase(images->begin());
}
}
}
};
class LevelArt
{
private:
int levelWidth, levelHeight, startX, startY, numOfLayers;
float widthScale, heightScale, widthOfSegs, heightOfSegs;
float* parallaxFactor;
ImageHandle** levelImages;
int** frame;
int** numOfFrames;
bool* tileLayer;
bool** isLoaded;
Animation** animData;
std::string** imagePath;
std::vector<ImageLoadQueue*> imageQueue;
ImageLoadThread imageThread;
public:
LevelArt(void);
LevelArt(std::string);
~LevelArt(void);
void loadData(std::string);
void drawLevel(sf::RenderWindow*, float, float);
void scaleLevel(float, float);
void forceDraw(sf::RenderWindow*);
void wipeLevel();
void initialLoad();
int getLevelWidth() { return levelWidth; }
int getLevelHeight() { return levelHeight; }
int getTotalWidth() { return widthOfSegs*levelWidth; }
int getTotalHeight() { return heightOfSegs*levelHeight; }
int getStartX() { return startX; }
int getStartY() { return startY; }
};
That's most of the relevant threading code, in this header. Within the levelArt.cpp file exists 3 nested for loops to iterate through all the levelArt data stored, testing if they exist close enough to the player to be displayed, wherein it calls:
imageQueue.push_back(new ImageLoadQueue(&levelImages[i][(j*levelWidth)+k], imagePath[i][(j*levelWidth)+k], widthOfSegs, heightOfSegs, numOfFrames[i][(j*levelWidth)+k]));
i,j,k being the for loop iterators.

This seems like a reasonable use of multithreading. The key idea (in other words, the main place you'll have problems if you do it wrong) is that you have to be careful about data that is used by more than one thread.
You have two places where you have such data:
The vector (which, by the way, should probably be a queue)
The array where you return the data
One way to arrange things - by no means the only one - would be to wrap each of these into its own class (e.g., a class that has a member variable of the vector). Don't allow any direct access to the vector, only through methods on the class. Then synchronize the methods, for example using a mutex or whatever the appropriate synchronization object is. Note that you're synchronizing access to the object, not just the individual methods. So it's not enough to put a mutex in the "read from queue" method; you need a common mutex in the "read from queue" and "write to queue" methods so that no one is doing one while the other occurs. (Also note I'm using the term mutex; that may be a very wrong thing to use depending on your platform and the exact situation. I would likely use a semaphore and a critical section on Windows.)
Synchronization will make the program thread-safe. That's different than making the program efficient. To do that, you probably want a semaphore that represents the number of items in the queue, and have your "load data thread" wait on that semaphore, rather than doing a while loop.

Debugging instance of another thread altering my data

I have a huge global array of structures. Some regions of the array are tied to individual threads and those threads can modify their regions of the array without having to use critical sections. But there is one special region of the array which all threads may have access to. The code that accesses these parts of the array needs to carefully use critical sections (each array element has its own critical section) to prevent any possibility of two threads writing to the structure simultaneously.
Now I have a mysterious bug I am trying to chase, it is occurring unpredictably and very infrequently. It seems that one of the structures is being filled with some incorrect number. One obvious explanation is that another thread has accidentally been allowed to set this number when it should be excluded from doing so.
Unfortunately it seems close to impossible to track this bug. The array element in which the bad data appears is different each time. What I would love to be able to do is set some kind of trap for the bug as follows: I would enter a critical section for array element N, then I know that no other thread should be able to touch the data, then (until I exit the critical section) set some kind of flag to a debugging tool saying "if any other thread attempts to change the data here please break and show me the offending patch of source code"... but I suspect no such tool exists... or does it? Or is there some completely different debugging methodology that I should be employing.

How about wrapping your data with a transparent mutexed class? Then you could apply additional lock state checking.
class critical_section;
template < class T >
class element_wrapper
{
public:
element_wrapper(const T& v) : val(v) {}
element_wrapper() {}
const element_wrapper& operator = (const T& v) {
#ifdef _DEBUG_CONCURRENCY
if(!cs->is_locked())
_CrtDebugBreak();
#endif
val = v;
return *this;
}
operator T() { return val; }
critical_section* cs;
private:
T val;
};
As for critical section implementation:
class critical_section
{
public:
critical_section() : locked(FALSE) {
::InitializeCriticalSection(&cs);
}
~critical_section() {
_ASSERT(!locked);
::DeleteCriticalSection(&cs);
}
void lock() {
::EnterCriticalSection(&cs);
locked = TRUE;
}
void unlock() {
locked = FALSE;
::LeaveCriticalSection(&cs);
}
BOOL is_locked() {
return locked;
}
private:
CRITICAL_SECTION cs;
BOOL locked;
};
Actually, instead of custom critical_section::locked flag, one could use ::TryEnterCriticalSection (followed by ::LeaveCriticalSection if it succeeds) to determine if a critical section is owned. Though, the implementation above is almost as good.
So the appropriate usage would be:
typedef std::vector< element_wrapper<int> > cont_t;
void change(cont_t::reference x) { x.lock(); x = 1; x.unlock(); }
int main()
{
cont_t container(10, 0);
std::for_each(container.begin(), container.end(), &change);
}

I know two ways to handle such errors:
1) Read the code again and again, looking for possible errors. I can think about two errors that can cause this: unsynchronized access or writing by incorrect memory address. Maybe you have more ideas.
2) Logging, logging an logging. Add lot of optional traces (OutputDebugString or log file), in every critical place, which contain enough information - indexes, variable values etc. It is a good idea to add this tracing with some #ifdef. Reproduce the bug and try to understand from the log, what happens.

Your best (fastest) bet is still to revise the mutex code. As you said, it is the obvious explanation - why not trying to really find the explanation (by logic) instead of additional hints (by coding) that may come out inconclusive? If the code review doesn't turn out something useful you may still take the mutex code and use it for a test run. The first try should not be to reproduce the bug in your system but to ensure correct implementation of the mutex - implement threads (start from 2 upwards) that all try to access the same data structure again and again with a random small delay in each of them to have them jitter around on the time line. If this test results in a buggy mutex which you simply can't identify in the code then you have fallen victim to some architecture dependant effect (maybe intstruction reordering, multi-core cache incoherency, etc.) and need to find another mutex implementation. If OTOH you find an obvious bug in the mutex, try to exploit it in your real system (instrument your code so that the error should appear much more often) so that you can ensure that it really is the cause of your original problem.

I was thinking about this while pedaling to work. One possible way of handling this is to make portions of the memory in question be read-only when it is not actively being accessed and protected via critical section ownership. This is assuming that the problem is caused by a thread writing to the memory when it does not own the appropriate critical section.
There are quite a few limitations to this that prevent it from working. Most importantly is the fact that I think you can only set privileges on a page by page basis (4K I believe). So that would likely require some very specific changes to your allocation scheme so that you could narrow down the appropriate section to protect. The second problem is that it would not catch the rogue thread writing to the memory if another thread actively owned the critical section. But it would catch it and cause an immediate access violation if the critical section was not owned.
The idea would be to do to change your EnterCriticalSection calls to:
EnterCriticalSection()
VirtualProtect( … PAGE_READWRITE … );
And change the LeaveCriticalSection calls to:
VirtualProtect( … PAGE_READONLY … );
LeaveCriticalSection()
The following chunk of code shows a call to VirtualProtect
int main( int argc, char* argv[] 1
{
unsigned char *mem;
int i;
DWORD dwOld;
// this assume 4K page size
mem = malloc( 4096 * 10 );
for ( i = 0; i < 10; i++ )
mem[i * 4096] = i;
// set the second page to be readonly. The allocation from malloc is
// not necessarily on a page boundary, but this will definitely be in
// the second page.
printf( "VirtualProtect res = %d\n",
VirtualProtect( mem + 4096,
1, // ends up setting entire page
PAGE_READONLY, &dwOld ));
// can still read it
for ( i = 1; i < 10; i++ )
printf( "%d ", mem[i*4096] );
printf( "\n" );
// Can write to all but the second page
for ( i = 0; i < 10; i++ )
if ( i != 1 ) // avoid second page which we made readonly
mem[i] = 1;
// this causes an access violation
mem[4096] = 1;
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js