What is the most efficient way to make this code thread safe? - c++

Some C++ library I'm working on features a simple tracing mechanism which can be activated to generate log files showing which functions were called and what arguments were passed. It basically boils down to a TRACE macro being spilled all over the source of the library, and the macro expands to something like this:
typedef void(*TraceProc)( const char *msg );
/* Sets 'callback' to point to the trace procedure which actually prints the given
* message to some output channel, or to a null trace procedure which is a no-op when
* case the given source file/line position was disabled by the client.
*
* This function also registers the callback pointer in an internal data structure
* and resets it to zero in case the filtering configuration changed since the last
* invocation of updateTraceCallback.
*/
void updateTraceCallback( TraceProc *callback, const char *file, unsinged int lineno );
#define TRACE(msg) \
{ \
static TraceProc traceCallback = 0; \
if ( !traceCallback ) \
updateTraceCallback( &traceCallback, __FILE__, __LINE__ ); \
traceCallback( msg ); \
}
The idea is that people can just say TRACE("foo hit") in their code and that will either
call a debug printing function or it will be a no-op. They can use some other API (which is not shown here) to configure that only TRACE uses in locations (source file/line number) should be printed. This configuration can change at runtime.
The issue with this is that this idea should now be used in a multi-threaded code base. Hence, the code which TRACE expands to needs to work correctly in the face of multiple threads of execution running the code simultaneously. There are about 20.000 different trace points in the code base right now and they are hit very often, so they should be rather efficient
What is the most efficient way to make this approach thread safe? I need a solution for Windows (XP and newer) and Linux. I'm afraid of doing excessive locking just to check whether the filter configuration changed (99% of the time a trace point is hit, the configuration didn't change). I'm open to larger changes to the macro, too. So instead of discussing mutex vs. critical section performance, it would also be acceptable if the macro just sent an event to an event loop in a different thread (assuming that accessing the event loop is thread safe) and all the processing happens in the same thread, so it's synchronized using the event loop.
UPDATE: I can probably simplify this question to:
If I have one thread reading a pointer, and another thread which might write to the variable (but 99% of the time it doesn't), how can I avoid that the reading thread needs to lock all the time?

You could implement a configuration file version variable. When your program starts it is set to 0. The macro can hold a static int that is the last config version it saw. Then a simple atomic comparison between the last seen and the current config version will tell you if you need to do a full lock and re-call updateTraceCallback();.
That way, 99% of the time you'll only add an extra atomic op, or memory barrier or something simmilar, which is very cheap. 1% of the time, just do the full mutex thing, it shouldn't affect your performance in any noticeable way, if its only 1% of the time.
Edit:
Some .h file:
extern long trace_version;
Some .cpp file:
long trace_version = 0;
The macro:
#define TRACE(msg)
{
static long __lastSeenVersion = -1;
static TraceProc traceCallback = 0;
if ( !traceCallback || __lastSeenVersion != trace_version )
updateTraceCallback( &traceCallback, &__lastSeenVersion, __FILE__, __LINE__ );
traceCallback( msg );
}
The functions for incrementing a version and updates:
static long oldVersionRefcount = 0;
static long curVersionRefCount = 0;
void updateTraceCallback( TraceProc *callback, long &version, const char *file, unsinged int lineno ) {
if ( version != trace_version ) {
if ( InterlockedDecrement( oldVersionRefcount ) == 0 ) {
//....free resources.....
//...no mutex needed, since no one is using this,,,
}
//....aquire mutex and do stuff....
InterlockedIncrement( curVersionRefCount );
*version = trace_version;
//...release mutex...
}
}
void setNewTraceCallback( TraceProc *callback ) {
//...aquire mutex...
trace_version++; // No locks, mutexes or anything, this is atomic by itself.
while ( oldVersionRefcount != 0 ) { //..sleep? }
InterlockedExchange( &oldVersionRefcount, curVersionRefCount );
curVersionRefCount = 0;
//.... and so on...
//...release mutex...
Of course, this is very simplified, since if you need to upgrade the version and the oldVersionRefCount > 0, then you're in trouble; how to solve this is up to you, since it really depends on your problem. My guess is that in those situations, you could simply wait until the ref count is zero, since the amount of time that the ref count is incremented should be the time it takes to run the macro.

I still don't fully understand the question, so please correct me on anything I didn't get.
(I'm leaving out the backslashes.)
#define TRACE(msg)
{
static TraceProc traceCallback = NULL;
TraceProc localTraceCallback;
localTraceCallback = traceCallback;
if (!localTraceCallback)
{
updateTraceBallback(&localTraceCallback, __FILE__, __LINE__);
// If two threads are running this at the same time
// one of them will update traceCallback and get it overwritten
// by the other. This isn't a big deal.
traceCallback = localTraceCallback;
}
// Now there's no way localTraceCallback can be null.
// An issue here is if in the middle of this executing
// traceCallback gets null'ed. But you haven't specified any
// restrictions about this either, so I'm assuming it isn't a problem.
localTraceCallback(msg);
}

Your comment says "resets it to zero in case the filtering configuration changes at runtime" but am I correct in reading that as "resets it to zero when the filtering configuration changes"?
Without knowing exactly how updateTraceCallback implements its data structure, or what other data it's referring to in order to decide when to reset the callbacks (or indeed to set them in the first place), it's impossible to judge what would be safe. A similar problem applies to knowing what traceCallback does - if it accesses a shared output destination, for example.
Given these limitations the only safe recommendation that doesn't require reworking other code is to stick a mutex around the whole lot (or preferably a critical section on Windows).

I'm afraid of doing excessive locking just to check whether the filter configuration changed (99% of the time a trace point is hit, the configuration didn't change). I'm open to larger changes to the macro, too. So instead of discussing mutex vs. critical section performance, it would also be acceptable if the macro just sent an event to an event loop in a different thread (assuming that accessing the event loop is thread safe)
How do you think thread safe messaging between threads is implemented without locks?
Anyway, here's a design that might work:
The data structure that holds the filter must be changed so that it is allocated dynamically from the heap because we are going to be creating multiple instances of filters. Also, it's going to need a reference count added to it. You need a typedef something like:
typedef struct Filter
{
unsigned int refCount;
// all the other filter data
} Filter;
There's a singleton 'current filter' declared somewhere.
static Filter* currentFilter;
and initialised with some default settings.
In your TRACE macro:
#define TRACE(char* msg)
{
static Filter* filter = NULL;
static TraceProc traceCallback = NULL;
if (filterOutOfDate(filter))
{
getNewCallback(__FILE__, __LINE__, &traceCallback, &filter);
}
traceCallback(msg);
}
filterOutOfDate() merely compares the filter with currentFilter to see if it is the same. It should be enough to just compare addresses. It does no locking.
getNewCallback() applies the current filter to get the new trace function and updates the filter passed in with the address of the current filter. It's implementation must be protected with a mutex lock. Also, it decremetns the refCount of the original filter and increments the refCount of the new filter. This is so we know when we can free the old filter.
void getNewCallback(const char* file, int line, TraceProc* newCallback, Filter** filter)
{
// MUTEX lock
newCallback = // whatever you need to do
currentFilter->refCount++;
if (*filter != NULL)
{
*filter->refCount--;
if (*filter->refCount == 0)
{
// free filter and associated resources
}
}
*filter = currentFilter;
// MUTEX unlock
}
When you want to change the filter, you do something like
changeFilter()
{
Filter* newFilter = // build a new filter
newFilter->refCount = 0;
// MUTEX lock (same mutex as above)
currentFilter = newFilter;
// MUTEX unlock
}

If I have one thread reading a pointer, and another thread which might write to the variable (but 99% of the time it doesn't), how can I avoid that the reading thread needs to lock all the time?
From your code, it is OK to use the mutex inside the updateTraceCallback() since it is going to be called very rarely (once per location). After taking the mutex, check whether the traceCallback is already initialized: if yes, then other thread just did it for you and there is nothing to be done.
If updateTraceCallback() would turn out to be a serious performance problem due to the collisions on the global mutex, then you can simply make an array of mutexes instead and use hashed value of the traceCallback pointer as an index into the mutex array. That would spread locking over many mutexes and minimize number of collisions.

#define TRACE(msg) \
{ \
static TraceProc traceCallback = \
updateTraceBallback( &traceCallback, __FILE__, __LINE__ ); \
traceCallback( msg ); \
}

Related

Pointer passed to function changes unexpectedly

I'm designing a preloader-based lock tracing utility that attaches to Pthreads, and I've run into a weird issue. The program works by providing wrappers that replace relevant Pthreads functions at runtime; these do some logging, and then pass the args to the real Pthreads function to do the work. They do not modify the arguments passed to them, obviously. However, when testing, I discovered that the condition variable pointer passed to my pthread_cond_wait() wrapper does not match the one that gets passed to the underlying Pthreads function, which promptly crashes with "futex facility returned an unexpected error code," which, from what I've gathered, usually indicates an invalid sync object passed in. Relevant stack trace from GDB:
#8 __pthread_cond_wait (cond=0x7f1b14000d12, mutex=0x55a2b961eec0) at pthread_cond_wait.c:638
#9 0x00007f1b1a47b6ae in pthread_cond_wait (cond=0x55a2b961f290, lk=0x55a2b961eec0)
at pthread_trace.cpp:56
I'm pretty mystified. Here's the code for my pthread_cond_wait() wrapper:
int pthread_cond_wait(pthread_cond_t* cond, pthread_mutex_t* lk) {
// log arrival at wait
the_tracer.add_event(lktrace::event::COND_WAIT, (size_t) cond);
// run pthreads function
GET_REAL_FN(pthread_cond_wait, int, pthread_cond_t*, pthread_mutex_t*);
int e = REAL_FN(cond, lk);
if (e == 0) the_tracer.add_event(lktrace::event::COND_LEAVE, (size_t) cond);
else {
the_tracer.add_event(lktrace::event::COND_ERR, (size_t) cond);
}
return e;
}
// GET_REAL_FN is defined as:
#define GET_REAL_FN(name, rtn, params...) \
typedef rtn (*real_fn_t)(params); \
static const real_fn_t REAL_FN = (real_fn_t) dlsym(RTLD_NEXT, #name); \
assert(REAL_FN != NULL) // semicolon absence intentional
And here's the code for __pthread_cond_wait in glibc 2.31 (this is the function that gets called if you call pthread_cond_wait normally, it has a different name because of versioning stuff. The stack trace above confirms that this is the function that REAL_FN points to):
int
__pthread_cond_wait (pthread_cond_t *cond, pthread_mutex_t *mutex)
{
/* clockid is unused when abstime is NULL. */
return __pthread_cond_wait_common (cond, mutex, 0, NULL);
}
As you can see, neither of these functions modifies cond, yet it is not the same in the two frames. Examining the two different pointers in a core dump shows that they point to different contents, as well. I can also see in the core dump that cond does not appear to change in my wrapper function (i.e. it's still equal to 0x5... in frame 9 at the crash point, which is the call to REAL_FN). I can't really tell which pointer is correct by looking at their contents, but I'd assume it's the one passed in to my wrapper from the target application. Both pointers point to valid segments for program data (marked ALLOC, LOAD, HAS_CONTENTS).
My tool is definitely causing the error somehow, the target application runs fine if it is not attached. What am I missing?
UPDATE: Actually, this doesn't appear to be what's causing the error, because calls to my pthread_cond_wait() wrapper succeed many times before the error occurs, and exhibit similar behavior (pointer value changing between frames without explanation) each time. I'm leaving the question open, though, because I still don't understand what's going on here and I'd like to learn.
UPDATE 2: As requested, here's the code for tracer.add_event():
// add an event to the calling thread's history
// hist_entry ctor gets timestamp & stack trace
void tracer::add_event(event e, size_t obj_addr) {
size_t tid = get_tid();
hist_map::iterator hist = histories.contains(tid);
assert(hist != histories.end());
hist_entry ev (e, obj_addr);
hist->second.push_back(ev);
}
// hist_entry ctor:
hist_entry::hist_entry(event e, size_t obj_addr) :
ts(chrono::steady_clock::now()), ev(e), addr(obj_addr) {
// these are set in the tracer ctor
assert(start_addr && end_addr);
void* buf[TRACE_DEPTH];
int v = backtrace(buf, TRACE_DEPTH);
int a = 0;
// find first frame outside of our own code
while (a < v && start_addr < (size_t) buf[a] &&
end_addr > (size_t) buf[a]) ++a;
// skip requested amount of frames
a += TRACE_SKIP;
if (a >= v) a = v-1;
caller = buf[a];
}
histories is a lock-free concurrent hashmap from libcds (mapping tid->per-thread vectors of hist_entry), and its iterators are guaranteed to be thread-safe as well. GNU docs say backtrace() is thread-safe, and there's no data races mentioned in the CPP docs for steady_clock::now(). get_tid() just calls pthread_self() using the same method as the wrapper functions, and casts its result to size_t.
Hah, figured it out! The issue is that Glibc exposes multiple versions of pthread_cond_wait(), for backwards compatibility. The version I reproduce in my question is the current version, the one we want to call. The version that dlsym() was finding is the backwards-compatible version:
int
__pthread_cond_wait_2_0 (pthread_cond_2_0_t *cond, pthread_mutex_t *mutex)
{
if (cond->cond == NULL)
{
pthread_cond_t *newcond;
newcond = (pthread_cond_t *) calloc (sizeof (pthread_cond_t), 1);
if (newcond == NULL)
return ENOMEM;
if (atomic_compare_and_exchange_bool_acq (&cond->cond, newcond, NULL))
/* Somebody else just initialized the condvar. */
free (newcond);
}
return __pthread_cond_wait (cond->cond, mutex);
}
As you can see, this version tail-calls the current one, which is probably why this took so long to detect: GDB is normally pretty good at detecting frames elided by tail calls, but I'm guessing it didn't detect this one because the functions have the "same" name (and the error doesn't affect the mutex functions because they don't expose multiple versions). This blog post goes into much more detail, coincidentally specifically about pthread_cond_wait(). I stepped through this function many times while debugging and sort of tuned it out, because every call into glibc is wrapped in multiple layers of indirection; I only realized what was going on when I set a breakpoint on the pthread_cond_wait symbol, instead of a line number, and it stopped at this function.
Anyway, this explains the changing pointer phenomenon: what happens is that the old, incorrect function gets called, reinterprets the pthread_cond_t object as a struct containing a pointer to a pthread_cond_t object, allocates a new pthread_cond_t for that pointer, and then passes the newly allocated one to the new, correct function. The frame of the old function gets elided by the tail-call, and to a GDB backtrace after leaving the old function it looks like the correct function gets called directly from my wrapper, with a mysteriously changed argument.
The fix for this was simple: GNU provides the libdl extension dlvsym(), which is like dlsym() but also takes a version string. Looking for pthread_cond_wait with version string "GLIBC_2.3.2" solves the problem. Note that these versions do not usually correspond to the current version (i.e. pthread_create()/exit() have version string "GLIBC_2.2.5"), so they need to be looked up on a per-function basis. The correct string can be determined either by looking at the compat_symbol() or versioned_symbol() macros that are somewhere near the function definition in the glibc source, or by using readelf to see the names of the symbols in the compiled library (mine has "pthread_cond_wait##GLIBC_2.3.2" and "pthread_cond_wait##GLIBC_2.2.5").

Setting all TLS (thread local storage) variables to a new, single value in C++

I have a class Foo with the following thread-specific static member:
__declspec(thread) static bool s_IsAllAboutThatBass;
In the implementation file it is initialized like so:
__declspec(thread) bool Foo::s_IsAllAboutThatBass = true;
So far so good. Now, any thread can flip this bool willy nilly as they deem fit. Then the problem: at some point I want each thread to reset that bool to its initial true value.
How can I slam all instances of the TLS to true from a central thread?
I've thought of ways I could do this with synchronization primitives I know about, like critical sections, read/write sections, or events, but nothing fits the bill. In my real use cases I am unable to block any of the other threads for any significant length of time.
Any help is appreciated. Thank you!
Edit: Plan A
One idea is to use a generation token, or cookie that is read by all threads and written to by the central thread. Each thread can then have a TLS for the last generation viewed by that thread when grabbing s_isAllAboutThatBass via some accessor. When the thread local cookie differs from the shared cookie, we increment the thread local one and update s_isAllAboutThatBass to true.
Here is a light weighted implementation of "Plan A" with C++11 Standard atomic variable and thread_local-specifier. (If your compiler doesn't support them, please replace to vendor specific facilities.)
#include <atomic>
struct Foo {
static std::atomic<unsigned> s_TokenGeneration;
static thread_local unsigned s_LocalToken;
static thread_local bool s_LocalState;
// for central thread
void signalResetIsAllAboutThatBass() {
++s_TokenGeneration;
}
// accessor for other threads
void setIsAllAboutThatBass(bool b) {
unsigned currToken = s_TokenGeneration;
s_LocalToken = currToken;
s_LocalState = b;
}
bool getIsAllAboutThatBass() const {
unsigned currToken = s_TokenGeneration;
if (s_LocalToken < currToken) {
// reset thread-local token & state
s_LocalToken = currToken;
s_LocalState = true;
}
return s_LocalState;
}
};
std::atomic<unsigned> Foo::s_TokenGeneration;
thread_local unsigned Foo::s_LocalToken = 0u;
thread_local bool Foo::s_LocalState = true;
The simplest answer is: you can't. The reason that it's called thread local storage is because only its thread can access it. Which, by definition, means that some other "central thread" can't get to it. That's what it's all about, by definition.
Now, depending on how your hardware and compiler platform implements TLS, there might be a trick around it, if your implemention of TLS works by mapping TLS variables to different virtual memory addresses. Typically, what happens is that one CPU register is thread-specific, it's set to point to different memory addresses, and all TLS variables are accessed as relative addresses.
If that is the case, you could, perhaps, derive some thread-safe mechanism by which each thread takes a pointer to its TLS variable, and puts it into a non-TLS container, that your "central thread" can get to.
And, of course, you must keep all of that in sync with your threads, and clean things up after each thread terminates.
You'll have to figure out whether this is the case on your platform with a trivial test: declare a TLS variable, then compare its pointer address in two different threads. If it's different, you might be able to work around it, in this fashion. Technically, this kind of pointer comparison is non-portable, and implementation defined, but by this time you're already far into implemention-specific behavior.
But if the addresses are the same, it means that your implementation uses virtual memory addressing to implement TLS. Only the executing thread has access to its TLS variable, period, and there is no practical means by which any "central thread" could look at other threads' TLS variables. It's enforced by your operating system kernel. The "central thread" must cooperate which each thread, and make arrangements to access the thread's TLS variables using typical means of interthread communications.
The cookie approach would work fine, and you don't need to use a TLS slot to implement it, just a local variable inside your thread procedure. To handle the case where the cookie changes value between the time that the thread is created and the time that it starts running (there is a small delay), you would have to pass the current cookie value as an input parameter for the thread creation, then your thread procedure can initialize its local variable to that value before it starts checking the live cookie for changes.
intptr_t g_cookie = 1;
pthread_rwlock_t g_lock;
void* thread_proc(void *arg)
{
intptr_t cookie = (intptr_t)arg;
while (keepRunningUntilSomeCondition)
{
pthread_rwlock_rdlock(&g_lock);
if (cookie != g_cookie)
{
cookie = g_cookie;
s_IsAllAboutThatBass = true;
}
pthread_rwlock_unlock(&g_lock);
//...
}
pthread_exit(NULL);
}
void createThread()
{
...
pthread_t thread;
pthread_create(&thread, NULL, &thread_proc, (void*)g_cookie);
...
}
void signalThreads()
{
pthread_rwlock_wrlock(&g_lock);
++g_cookie;
pthread_rwlock_unlock(&g_lock);
}
int main()
{
pthread_rwlock_init(&g_lock, NULL);
// use createThread() and signalThreads() as needed...
pthread_rwlock_destroy(&g_lock);
return 0;
}

mutex lock is not unlocking

I use a mutex to lock and unlock a variable as I call getter from main thread continuously in the update cycle and I call setter from another thread. I provided the code for setter and getter below
Definition
bool _flag;
System::Mutex m_flag;
Calls
#define LOCK(MUTEX_VAR) MUTEX_VAR.Lock();
#define UNLOCK(MUTEX_VAR) MUTEX_VAR.Unlock();
void LoadingScreen::SetFlag(bool value)
{
LOCK(m_flag);
_flag = value;
UNLOCK(m_flag);
}
bool LoadingScreen::GetFlag()
{
LOCK(m_flag);
bool value = _flag;
UNLOCK(m_flag);
return value;
}
This works well half the time, but at times the variable gets locked on calling SetFlag and hence it is never set thereby disturbing the flow of code.
Can anyone tell me how to solve this issue?
EDIT:
This is the workaround i finally did. This is just a temporary solution. If anyone has a better answer please let me know.
bool _flag;
bool accessingFlag = false;
void LoadingScreen::SetFlag(bool value)
{
if(!accessingFlag)
{
_flag = value;
}
}
bool LoadingScreen::GetFlag()
{
accessingFlag = true;
bool value = _flag;
accessingFlag = false;
return value;
}
The issue you have (which user1192878 alludes to) is due to delayed compiler load/stores. You need to use memory barriers to implement the code. You may declare the volatile bool _flag;. But this is not needed with compiler memory barriers for a single CPU system. Hardware barriers (just below in the Wikipedia link) are needed for multi-cpu solutions; the hardware barrier's ensure the local processor's memory/cache is seen by all CPUs. The use of mutex and other interlocks is not needed in this case. What exactly do they accomplish? They just create deadlocks and are not needed.
bool _flag;
#define memory_barrier __asm__ __volatile__ ("" ::: "memory") /* GCC */
void LoadingScreen::SetFlag(bool value)
{
_flag = value;
memory_barrier(); /* Ensure write happens immediately, even for in-lines */
}
bool LoadingScreen::GetFlag()
{
bool value = _flag;
memory_barrier(); /* Ensure read happens immediately, even for in-lines */
return value;
}
Mutexes are only needed when multiple values are being set at the same time. You may also change the bool type to sig_atomic_t or LLVM atomics. However, this is rather pedantic as bool will work on most every practical CPU architecture. Cocoa's concurrency pages also have some information on alternative API's to do the same thing. I believe gcc's in-line assembler is the same syntax as used with Apple's compilers; but that could be wrong.
There are some limitations to the API. The instance GetFlag() returns, something can call SetFlag(). GetFlag() return value is then stale. If you have multiple writers, then you can easily miss one SetFlag(). This maybe important if the higher level logic is prone to ABA problems. However, all of these issue exist with/without mutexs. The memory barrier only solves the issue that a compiler/CPU will not cache the SetFlag() for a prolonged time and it will re-read the value in GetFlag(). Declaring volatile bool flag will generally result in the same behavior, but with extra side-effects and does not solve multi-CPU issues.
std::atomic<bool>As per stefan and atomic_set(&accessing_flag, true); will generally do the same thing as describe above in their implementations. You may wish to use them if they are available on your platforms.
First of all you should use RAII for mutex lock/unlock. Second you either do not show some other code that uses _flag directly, or there is something wrong with mutex you are using (unlikely). What library provides System::Mutex?
The code looks right if System::Mutex is correctly implemented.
Something to be mentioned:
As others pointed out, RAII is better than macro.
It might be better to define accessingFlag and _flag as volatile.
I think the temp solution you got is not correct if you compile with optimization.bool LoadingScreen::GetFlag()
{
accessingFlag = true; // might be reordered or deleted
bool value = _flag; // might be optimized away
accessingFlag = false; // might be reordered before value set
return value; // might be optimized to directly returen _flag or register
}
In above code, optimizer could do nasty things. For example, there is nothing to prevent the compiler eliminate the first assignment to accessingFlag=true, or it could be reordered, cached. For example, for compiler point of view, if single-threaded, the first assignment to accessingFlag is useless because the value true is never used.
Use mutex to protect a single bool variable is expensive since most of time spent on switching OS mode (from kernel to user back and forth). It might not be bad to use a spinlock (detail code depend on your target platform). It should be something like:spinlock_lock(&lock); _flag = value; spinlock_unlock(&lock);
Also atomic variable is good here as well. It might look like:
atomic_set(&accessing_flag, true);
Have you considered using CRITICAL_SECTION? This is only available on Windows, so you lose some portability, but it is an effective user level mutex.
The second block of code that you provided may modify the flag while it is being read, even in uni processor settings.
The original code that you posted is correct, and cannot lead to deadlocks under two assumptions:
The m_flags lock is correctly initialized, and not modified by any other code.
The lock implementation is correct.
If you want a portable lock implementation, I would suggest using OpenMP:
How to use lock in openMP?
From your description it seems like you want to busy wait for a thread to process some input. In this case, stefans solution (declare the flag std::atomic) is probably best. On semi-sane x86 systems, you could also declare the flag volatile int. Just don't do this for unaligned fields (packed structures).
You can avoid busy waiting with two locks. The first lock is unlocked by the slave when it finishes processing and locked by the main thread when waiting for the slave to finish. The second lock is unlocked by the main thread when providing input, and locked by the slave when waiting for input.
Here's a technique I've seen somewhere, but couldn't find the source anymore. If I find it, I will edit the answer. Basically, the writer will just write, but the reader will read the value of the set variable more than once, and only when all copies are consistent, it would use it. And I've changed the writer so that it will try to keep writing the value as long as it does not match the value it expects.
bool _flag;
void LoadingScreen::SetFlag(bool value)
{
do
{
_flag = value;
} while (_flag != value);
}
bool LoadingScreen::GetFlag()
{
bool value;
do
{
value = _flag;
} while (value != _flag);
return value;
}

Debugging instance of another thread altering my data

I have a huge global array of structures. Some regions of the array are tied to individual threads and those threads can modify their regions of the array without having to use critical sections. But there is one special region of the array which all threads may have access to. The code that accesses these parts of the array needs to carefully use critical sections (each array element has its own critical section) to prevent any possibility of two threads writing to the structure simultaneously.
Now I have a mysterious bug I am trying to chase, it is occurring unpredictably and very infrequently. It seems that one of the structures is being filled with some incorrect number. One obvious explanation is that another thread has accidentally been allowed to set this number when it should be excluded from doing so.
Unfortunately it seems close to impossible to track this bug. The array element in which the bad data appears is different each time. What I would love to be able to do is set some kind of trap for the bug as follows: I would enter a critical section for array element N, then I know that no other thread should be able to touch the data, then (until I exit the critical section) set some kind of flag to a debugging tool saying "if any other thread attempts to change the data here please break and show me the offending patch of source code"... but I suspect no such tool exists... or does it? Or is there some completely different debugging methodology that I should be employing.
How about wrapping your data with a transparent mutexed class? Then you could apply additional lock state checking.
class critical_section;
template < class T >
class element_wrapper
{
public:
element_wrapper(const T& v) : val(v) {}
element_wrapper() {}
const element_wrapper& operator = (const T& v) {
#ifdef _DEBUG_CONCURRENCY
if(!cs->is_locked())
_CrtDebugBreak();
#endif
val = v;
return *this;
}
operator T() { return val; }
critical_section* cs;
private:
T val;
};
As for critical section implementation:
class critical_section
{
public:
critical_section() : locked(FALSE) {
::InitializeCriticalSection(&cs);
}
~critical_section() {
_ASSERT(!locked);
::DeleteCriticalSection(&cs);
}
void lock() {
::EnterCriticalSection(&cs);
locked = TRUE;
}
void unlock() {
locked = FALSE;
::LeaveCriticalSection(&cs);
}
BOOL is_locked() {
return locked;
}
private:
CRITICAL_SECTION cs;
BOOL locked;
};
Actually, instead of custom critical_section::locked flag, one could use ::TryEnterCriticalSection (followed by ::LeaveCriticalSection if it succeeds) to determine if a critical section is owned. Though, the implementation above is almost as good.
So the appropriate usage would be:
typedef std::vector< element_wrapper<int> > cont_t;
void change(cont_t::reference x) { x.lock(); x = 1; x.unlock(); }
int main()
{
cont_t container(10, 0);
std::for_each(container.begin(), container.end(), &change);
}
I know two ways to handle such errors:
1) Read the code again and again, looking for possible errors. I can think about two errors that can cause this: unsynchronized access or writing by incorrect memory address. Maybe you have more ideas.
2) Logging, logging an logging. Add lot of optional traces (OutputDebugString or log file), in every critical place, which contain enough information - indexes, variable values etc. It is a good idea to add this tracing with some #ifdef. Reproduce the bug and try to understand from the log, what happens.
Your best (fastest) bet is still to revise the mutex code. As you said, it is the obvious explanation - why not trying to really find the explanation (by logic) instead of additional hints (by coding) that may come out inconclusive? If the code review doesn't turn out something useful you may still take the mutex code and use it for a test run. The first try should not be to reproduce the bug in your system but to ensure correct implementation of the mutex - implement threads (start from 2 upwards) that all try to access the same data structure again and again with a random small delay in each of them to have them jitter around on the time line. If this test results in a buggy mutex which you simply can't identify in the code then you have fallen victim to some architecture dependant effect (maybe intstruction reordering, multi-core cache incoherency, etc.) and need to find another mutex implementation. If OTOH you find an obvious bug in the mutex, try to exploit it in your real system (instrument your code so that the error should appear much more often) so that you can ensure that it really is the cause of your original problem.
I was thinking about this while pedaling to work. One possible way of handling this is to make portions of the memory in question be read-only when it is not actively being accessed and protected via critical section ownership. This is assuming that the problem is caused by a thread writing to the memory when it does not own the appropriate critical section.
There are quite a few limitations to this that prevent it from working. Most importantly is the fact that I think you can only set privileges on a page by page basis (4K I believe). So that would likely require some very specific changes to your allocation scheme so that you could narrow down the appropriate section to protect. The second problem is that it would not catch the rogue thread writing to the memory if another thread actively owned the critical section. But it would catch it and cause an immediate access violation if the critical section was not owned.
The idea would be to do to change your EnterCriticalSection calls to:
EnterCriticalSection()
VirtualProtect( … PAGE_READWRITE … );
And change the LeaveCriticalSection calls to:
VirtualProtect( … PAGE_READONLY … );
LeaveCriticalSection()
The following chunk of code shows a call to VirtualProtect
int main( int argc, char* argv[] 1
{
unsigned char *mem;
int i;
DWORD dwOld;
// this assume 4K page size
mem = malloc( 4096 * 10 );
for ( i = 0; i < 10; i++ )
mem[i * 4096] = i;
// set the second page to be readonly. The allocation from malloc is
// not necessarily on a page boundary, but this will definitely be in
// the second page.
printf( "VirtualProtect res = %d\n",
VirtualProtect( mem + 4096,
1, // ends up setting entire page
PAGE_READONLY, &dwOld ));
// can still read it
for ( i = 1; i < 10; i++ )
printf( "%d ", mem[i*4096] );
printf( "\n" );
// Can write to all but the second page
for ( i = 0; i < 10; i++ )
if ( i != 1 ) // avoid second page which we made readonly
mem[i] = 1;
// this causes an access violation
mem[4096] = 1;
}

How do I use an arbitrary string as a lock in C++?

Let's say I have a multithreaded C++ program that handles requests in the form of a function call to handleRequest(string key). Each call to handleRequest occurs in a separate thread, and there are an arbitrarily large number of possible values for key.
I want the following behavior:
Simultaneous calls to handleRequest(key) are serialized when they have the same value for key.
Global serialization is minimized.
The body of handleRequest might look like this:
void handleRequest(string key) {
KeyLock lock(key);
// Handle the request.
}
Question: How would I implement KeyLock to get the required behavior?
A naive implementation might start off like this:
KeyLock::KeyLock(string key) {
global_lock->Lock();
internal_lock_ = global_key_map[key];
if (internal_lock_ == NULL) {
internal_lock_ = new Lock();
global_key_map[key] = internal_lock_;
}
global_lock->Unlock();
internal_lock_->Lock();
}
KeyLock::~KeyLock() {
internal_lock_->Unlock();
// Remove internal_lock_ from global_key_map iff no other threads are waiting for it.
}
...but that requires a global lock at the beginning and end of each request, and the creation of a separate Lock object for each request. If contention is high between calls to handleRequest, that might not be a problem, but it could impose a lot of overhead if contention is low.
You could do something similar to what you have in your question, but instead of a single global_key_map have several (probably in an array or vector) - which one is used is determined by some simple hash function on the string.
That way instead of a single global lock, you spread that out over several independent ones.
This is a pattern that is often used in memory allocators (I don't know if the pattern has a name - it should). When a request comes in, something determines which pool the allocation will come from (usually the size of the request, but other parameters can factor in as well), then only that pool needs to be locked. If an allocation request comes in from another thread that will use a different pool, there's no lock contention.
It will depend on the platform, but the two techniques that I'd try would be:
Use named mutex/synchronization
objects, where object name = Key
Use filesystem-based locking, where you
try to create a non-shareable
temporary file with the key name. If it exists already (=already
locked) this will fail and you'll
have to poll to retry
Both techniques will depend on the detail of your OS. Experiment and see which works.
.
Perhaps an std::map<std::string, MutexType> would be what you want, where MutexType is the type of the mutex you want. You will probably have to wrap accesses to the map in another mutex in order to ensure that no other thread is inserting at the same time (and remember to perform the check again after the mutex is locked to ensure that another thread didn't add the key while waiting on the mutex!).
The same principle could apply to any other synchronization method, such as a critical section.
Raise granularity and lock entire key-ranges
This is a variation on Mike B's answer, where instead of having several fluid lock maps you have a single fixed array of locks that apply to key-ranges instead of single keys.
Simplified example: create array of 256 locks at startup, then use first byte of key to determine index of lock to be acquired (i.e. all keys starting with 'k' will be guarded by locks[107]).
To sustain optimal throughput you should analyze distribution of keys and contention rate. The benefits of this approach are zero dynamic allocations and simple cleanup; you also avoid two-step locking. The downside is potential contention peaks if key distribution becomes skewed over time.
After thinking about it, another approach might go something like this:
In handleRequest, create a Callback that does the actual work.
Create a multimap<string, Callback*> global_key_map, protected by a mutex.
If a thread sees that key is already being processed, it adds its Callback* to the global_key_map and returns.
Otherwise, it calls its callback immediately, and then calls the callbacks that have shown up in the meantime for the same key.
Implemented something like this:
LockAndCall(string key, Callback* callback) {
global_lock.Lock();
if (global_key_map.contains(key)) {
iterator iter = global_key_map.insert(key, callback);
while (true) {
global_lock.Unlock();
iter->second->Call();
global_lock.Lock();
global_key_map.erase(iter);
iter = global_key_map.find(key);
if (iter == global_key_map.end()) {
global_lock.Unlock();
return;
}
}
} else {
global_key_map.insert(key, callback);
global_lock.Unlock();
}
}
This has the advantage of freeing up threads that would otherwise be waiting for a key lock, but apart from that it's pretty much the same as the naive solution I posted in the question.
It could be combined with the answers given by Mike B and Constantin, though.
/**
* StringLock class for string based locking mechanism
* e.g. usage
* StringLock strLock;
* strLock.Lock("row1");
* strLock.UnLock("row1");
*/
class StringLock {
public:
/**
* Constructor
* Initializes the mutexes
*/
StringLock() {
pthread_mutex_init(&mtxGlobal, NULL);
}
/**
* Lock Function
* The thread will return immediately if the string is not locked
* The thread will wait if the string is locked until it gets a turn
* #param string the string to lock
*/
void Lock(string lockString) {
pthread_mutex_lock(&mtxGlobal);
TListIds *listId = NULL;
TWaiter *wtr = new TWaiter;
wtr->evPtr = NULL;
wtr->threadId = pthread_self();
if (lockMap.find(lockString) == lockMap.end()) {
listId = new TListIds();
listId->insert(listId->end(), wtr);
lockMap[lockString] = listId;
pthread_mutex_unlock(&mtxGlobal);
} else {
wtr->evPtr = new Event(false);
listId = lockMap[lockString];
listId->insert(listId->end(), wtr);
pthread_mutex_unlock(&mtxGlobal);
wtr->evPtr->Wait();
}
}
/**
* UnLock Function
* #param string the string to unlock
*/
void UnLock(string lockString) {
pthread_mutex_lock(&mtxGlobal);
TListIds *listID = NULL;
if (lockMap.find(lockString) != lockMap.end()) {
lockMap[lockString]->pop_front();
listID = lockMap[lockString];
if (!(listID->empty())) {
TWaiter *wtr = listID->front();
Event *thdEvent = wtr->evPtr;
thdEvent->Signal();
} else {
lockMap.erase(lockString);
delete listID;
}
}
pthread_mutex_unlock(&mtxGlobal);
}
protected:
struct TWaiter {
Event *evPtr;
long threadId;
};
StringLock(StringLock &);
void operator=(StringLock&);
typedef list TListIds;
typedef map TMapLockHolders;
typedef map TMapLockWaiters;
private:
pthread_mutex_t mtxGlobal;
TMapLockWaiters lockMap;
};