Creating Mutexes with Binary Semaphore

Creating Mutexes with Binary Semaphore - c++

I'm working with a simple system that does NOT have mutexes, but rather a limited array of hardware binary semaphores. Typically, all multithreading is been done with heavy Semaphore techniques that makes code both poor in performance and difficult to write correctly without deadlocks.
A naive implementation is to use one semaphore globally in order to ensure atomic access to a critical section. However, this means that unrelated objects (even of different types) will block if any critical section is being accessed.
My current solution to this problem is to use a single global semaphore to ensure atomic access to a guard byte that then ensures atomic access to the particular critical section. I currently have this so far:
while (true) {
while (mutexLock == Mutex::Locked) {
} //wait on mutex
Semaphore semaLock(SemaphoreIndex::Mutex); //RAII semaphore object
if (mutexLock == Mutex::Unlocked) {
mutexLock = Mutex::Locked;
break;
}
} //Semaphore is released by destructor here
// ... atomically safe code
mutexLock = Mutex::Unlocked;
I have a few questions: Is this the best way to approach this problem? Is this code thread-safe? Is this the same as a "double checked lock"? If so, does it suffer from the same problems and therefore need memory barriers?
EDIT: A few notes on the system this is being implemented on...
It is a RISC 16-bit processor with 32kB RAM. While it has heavy multithreading capabilities, its memory model is very primitive. Loads and stores are atomic, there is no caching, no branch prediction or branch target prediction, one core with many threads. Memory barriers are mostly for the compiler to know it should reload memory into general purpose registers, not for any hardware reason (no cache)

Even if the processor has heavy multithreading capabilities (not sure what you mean by that btw.), but is a single core processor, it still means that only one thread can execute at any one time.
A task switching system must be employed (which would normally run in a privileged mode) Using this system you must be able to define critical sections to atomically execute a (software implemented) mutex lock/unlock.
When you say "one core, many threads" does that mean that you have some kind of kernel running on your processor? The task switching system will be implemented by that kernel.
It might pay to look through the documentation of your kernel or ask your vendor for any tips.
Good luck.

Right now we still don't have that much information about your system (for example, what kind of registers are available for each instruction in parrallel? do you use bank architecture?, how many simultaneous instructions can you actually execute?) but hopefully what I suggest will help you
If I understand your situation you have a piece of hardware that does not have true cores, but simply MIMD ability via a vectorized operation (based on your reply). with a It is a "RISC 16-bit processor with 32kB RAM" where:
Loads and stores are atomic, there is no caching, no branch prediction or branch target prediction, one core with many threads
The key here is that loads and stores are atomic. Note you won't be able to do larger than 16bit load and stores atomically, since they will be compiled to two separate atomic instructions (and thus not being atomic itself).
Here is the functionality of a mutex:
attempt to lock
unlock
To lock, you might run into issues if each resource attempts to lock. For example say in your hardware N = 4 (number of processes to be run in parrallel). If instruction 1 (I1) and I2 try to lock, they will both be successful in locking. Since your loads and stores are atomic, both processes see "unlocked" at the same time, and both acquire the lock.
This means you can't do the following:
if mutex_1 unlocked:
lock mutex_1
which might look like this in an arbitrary assembly language:
load arithmetic mutex_addr
or arithmetic immediate(1) // unlocked = 0000 or 1 = 0001, locked = 0001 or 1 = 0001
store mutex_addr arithmetic // not putting in conditional label to provide better synchronization.
jumpifzero MUTEXLABEL arithmetic
To get around this you will need to have each "thread" either know if its currently getting a lock some one else is or avoid simultaneous lock access entirely.
I only see one kind of way this can be done on your system (via flag/mutex id checking). Have a mutex id associated with each thread for each mutex it is currently checking, and check for all other threads to see if you can actually acquire a lock. Your binary semaphores don't really help here because you need to associate an individual mutex with a semaphore if you were going to use it (and still have to load the mutex from ram).
A simple implementation of the check every thread unlock and lock, basically each mutex has an ID and a state, in order to avoid race conditions per instruction, the current mutex being handled is identified well before it is actually acquired. Having the "identify which lock you want to use" and "actually try to get the lock" come in two steps stops accidental acquisition on simultaneous access. With this method you can have 2^16-1 (because 0 is used to say no lock was found) mutexes and your "threads" can exist on any instruction pipe.
// init to zero
volatile uint16_t CURRENT_LOCK_ATTEMPT[NUM_THREADS]{0};
// make thread id associated with priority
bool tryAcqureLock(uint16_t mutex_id, bool& mutex_lock_state){
if(mutex_lock_state == false){
// do not actually attempt to take the lock until checked everything.
// No race condition can happen now, you won't have actually set the lock
// if two attempt to acquire the same lock at the same time, you'll both
// be able to see some one else is as well.
CURRENT_LOCK_ATTEMPT[MY_THREAD_ID] = mutex_id;
//checking all lower threads, need some sort of priority
//predetermined to figure out locking.
for( int i = 0; i < MY_THREAD_ID; i++ ){
if((CURRENT_LOCK_ATTEMPT[i] == mutex_id){
//clearing bit.
CURRENT_LOCK_ATTEMPT[MY_THREAD_ID] = 0;
return false;
}
}
// make sure to lock mutex before clearing which mutex you are currently handling
mutex_lock_state = true;
CURRENT_LOCK_ATTEMPT[MY_THREAD_ID] = 0;
return true;
}
return false;
}
// its your fault if you didn't make sure you owned the lock in the first place
// if you did own it, theres no race condition, because of atomic store load.
// if you happen to set the state while another instruction is attempting to
// acquire the lock they simply wont get the lock and no race condition occurs
bool unlock(bool& mutex_lock_state){
mutex_lock_state = false;
}
If you want more equal access of resources you could change indexing instead of being based on i = 0 to i < MY_THREAD_ID, you randomly pick a "starting point" to circle around back to MY_THREAD_ID using modulo arithmetic. IE:
bool tryAcqureLock(uint16_t mutex_id, bool& mutex_lock_state, uint16_t per_mutex_random_seed){
if(mutex_lock_state == false){
CURRENT_LOCK_ATTEMPT[MY_THREAD_ID] = mutex_id;
//need a per thread linear congruence generator for speed and consistency
std::minstd_rand0 random(per_mutex_random_seed)
for(int i = random() % TOTAL_NUM_THREADS; i != MY_THREAD_ID i = (i + 1) % TOTAL_NUM_THREADS)
{
//same as before
}
// if we actually acquired the lock
GLOBAL_SEED = global_random() // use another generator to set the next seed to be used
mutex_lock_state = true;
CURRENT_LOCK_ATTEMPT[MY_THREAD_ID] = 0;
return true;
}
return false;
}
In general your lack of test and set ability really throws a wrench into everything, meaning you are forced to use other algorithms for mutexes. For more information on other algorithms that you can use for non test and set architectures check out this SO post, and these wikipedia algorithms which only rely on atomic loads and stores:
Dekker's Algorithm
Eisenberg Mcguire Algorithm
Peterson's Algorithm
Lamport's Algorithm
Szymanski's Algorithm
All of these algorithms basically decompose into checking a set of flags to see if you can access the resource safely by going through every one elses flags.

Related

std::atomic - behaviour of relaxed ordering

Can the following call to print result in outputting stale/unintended values?
std::mutex g;
std::atomic<int> seq;
int g_s = 0;
int i = 0, j = 0, k = 0; // ignore fact that these could easily made atomic
// Thread 1
void do_work() // seldom called
{
// avoid over
std::lock_guard<std::mutex> lock{g};
i++;
j++;
k++;
seq.fetch_add(1, std::memory_order_relaxed);
}
// Thread 2
void consume_work() // spinning
{
const auto s = g_s;
// avoid overhead of constantly acquiring lock
g_s = seq.load(std::memory_order_relaxed);
if (s != g_s)
{
// no lock guard
print(i, j, k);
}
}

TL:DR: this is super broken; use a Seq Lock instead. Or RCU if your data structure is bigger.
Yes, you have data-race UB, and in practice stale values are likely; so are inconsistent values (from different increments). ISO C++ has nothing to say about what will happen, so it depends on how it happens to compile for some real machine, and interrupts / context switches in the reader that happen in the middle of reading some of these multiple vars. e.g. if the reader sleeps for any reason between reading i and j, you could miss many updates, or at least get a j that doesn't match your i.
Relaxed seq with writer+reader using lock_guard
I'm assuming the writer would look the same, so the atomic RMW increment is inside the critical section.
I'm picturing the reader checking seq like it is now, and only taking a lock after that, inside the block that runs print.
Even if you did use lock_guard to make sure the reader got a consistent snapshot of all three variables (something you couldn't get from making each of them separately atomic), I'm not sure relaxed would be sufficient in theory. It might be in practice on most real implementations for real machines (where compilers have to assume there might be a reader that synchronizes a certain way, even if there isn't in practice). I'd use at least release/acquire for seq, if I was going to take a lock in the reader.
Taking a mutex is an acquire operation, same as a std::memory_order_acquire load on the mutex object. A relaxed increment inside a critical section can't become visible to other threads until after the writer has taken the lock.
But in the reader, with if( xyz != seq.load(relaxed) ) { take_lock; ... }, the load is not guaranteed to "happen before" taking the lock. In practice on many ISAs it will, especially x86 where all atomic RMWs are full memory barriers. But in ISO C++, and maybe some real implementations, it's possible for the relaxed load to reorder into the reader's critical section. Of course, ISO C++ doesn't define things in terms of "reordering", only in terms of syncing with and values loads are allowed to see.
(This reordering may not be fully plausible; it would mean the read side would have to actually take the lock based on branch prediction / speculation on the load result. Maybe with lock elision like x86 did with transactional memory, except without x86's strong memory ordering?)
Anyway, it's pretty hairly to reason about, and release / acquire ops are quite cheap on most CPUs. If you expected it to be expensive, and for the check to often be false, you could check again with an acquire load, or put an acquire fence inside the if so it doesn't happen on the no-new-work path.
Use a Seq Lock
Your problem is better solved by using your sequence counter as part of a Seq Lock, so neither reader nor writer needs a mutex. (Summary: increment before writing, then touch the payload, then increment again. In the reader, read i, j, and k into local temporaries, then check the sequence number again to make sure it's the same, and an even number. With appropriate memory barriers.
See the wikipedia article and/or link below for actual details, but the real change from what you have now is that the sequence number has to increment by 2. If you can't handle that, use a separate counter for the actual lock, with seq as part of the payload.)
If you don't want to use a mutex in the reader, using one in the writer only helps in terms of implementation-detail side-effects, like making sure stores to memory actually happen, not keeping i in a register across calls if do_work inlines into some caller.
BTW, updating seq doesn't need to be an atomic RMW if there's only one writer. You can relaxed load and separately store an incremented temporary (with release semantics).
A Seq Lock is good for cheap reads and occasional writes that make the reader retry. Implementing 64 bit atomic counter with 32 bit atomics shows appropriate fencing.
It relies on non-atomic reads that may have a data race, but not using the result if your sequence counter detects tearing. C++ doesn't define the behaviour in that case, but it works in practice on real implementations. (C++ is mostly keeping its options open in case of hardware race detection, which normal CPUs don't do.)
If you have multiple writers, you'd still use a normal lock to give mutual exclusion between them. or use the sequence counter as a spinlock, as a writer acquires it by making the count odd. Otherwise you just need the sequence counter.
Your global g_s is just to track the latest sequence number the reader has seen? Storing it next to the data defeats some of the purpose/benefit, since it means the reader is writing the same cache line as the writer, assuming that variables declared near each other all end up together. Consider making it static inside the function, or separate it with other stuff, or with padding, like alignas(64) or 128. (That wouldn't guarantee that a compiler doesn't put it right before the other vars, though; a struct would let you control the layout of all of them. With enough alignment, you can make sure they're not in the same aligned pair of cache lines.)

Even ignoring the staleness, this is causes a data race and UB.
Thread 2 can read i,j,k while thread 1 is modifying them, you don't synchronize the access to those variables. If thread 2 doesn't respect the g, there's no point in locking it in thread 1.

Yes, it can.
First of all, the lock guard does not have any effect on your code. A lock has to be used by at least two threads to have any effect.
Thread 2 can read at any moment. It can read an incremented i and not incremented j and k. In theory, it can even read a weird partial value obtained by reading in between updating the various bytes that compose i - for example incrementing from 0xFF to 0x100 results reading 0x1FF or 0x0 - but not on x86 where these updates happen to be atomic.

Do I need std::atomic<bool> or is POD bool good enough?

Consider this code:
// global
std::atomic<bool> run = true;
// thread 1
while (run) { /* do stuff */ }
// thread 2
/* do stuff until it's time to shut down */
run = false;
Do I need the overhead associated with the atomic variable here? My intuition is that the read/write of a boolean variable is more or less atomic anyway (this is a common g++/Linux/Intel setup) and if there is some write/read timing weirdness, and my run loop on thread 1 stops one pass early or late as a result, I'm not super worried about it for this application.
Or is there some other consideration I am missing here? Looking at perf, it appears my code is spending a fair amount of time in std::atomic_bool::operator bool and I'd rather have it in the loop instead.

You need to use std::atomic to avoid undesired optimizations (compiler reading the value once and either always looping or never looping) and to get correct behavior on systems without a strongly ordered memory model (x86 is strongly ordered, so once the write finishes, the next read will see it; on other systems, if the threads don't flush CPU cache to main RAM for other reasons, the write might not be seen for a long time, if ever).
You can improve the performance though. Default use of std::atomic uses a sequential consistency model that's overkill for a single flag value. You can speed it up by using load/store with an explicit (and less strict) memory ordering, so each load isn't required to use the most paranoid mode to maintaining consistency.
For example, you could do:
// global
std::atomic<bool> run = true;
// thread 1
while (run.load(std::memory_order_acquire)) { /* do stuff */ }
// thread 2
/* do stuff until it's time to shut down */
run.store(false, std::memory_order_release);
On an x86 machine, any ordering less strict than the (default, most strict) sequential consistency ordering typically ends up doing nothing but ensuring instructions are executed in a specific order; no bus locking or the like is required, because of the strongly ordered memory model. Thus, aside from guaranteeing the value is actually read from memory, not cached to a register and reused, using atomics this way on x86 is free, and on non-x86 machines, it makes your code correct (which it otherwise wouldn't be).

difference between lock, memory barrier, semaphore

This article: http://www.aristeia.com/Papers/DDJ_Jul_Aug_2004_revised.pdf (page 12)
seems to make a difference between a lock and a memory barrier
I would like to know what the difference is between a lock, memory barrier, and a semaphore?
(While other questions might mention the difference between a lock and a synchronisation object, I found none about the difference between a lock and a memory barrier)

A memory barrier (also known as a fence) is a hardware operation, which
ensures the ordering of different reads and writes to the globally
visible store. On a typical modern processor, memory accesses are
pipelined, and may occur out of order. A memory barrier ensures that
this doesn't happen. A full memory barrier will ensure that all loads
and stores which precede it occur before any load or store which follows
it. (Many processors have support partial barriers; e.g. on a Sparc, a
membar #StoreStore ensures that all stores which occur before it will
be visible to all other processes before any store which occurs after
it.)
That's all a memory barrier does. It doesn't block the thread, or
anything.
Mutexes and semaphores are higher level primatives, implemented in the
operating system. A thread which requests a mutex lock will block, and
have its execution suspended by the OS, until that mutex is free. The
kernel code in the OS will contain memory barrier instructions in order
to implement a mutex, but it does much more; a memory barrier
instruction will suspend the hardware execution (all threads) until the
necessary conditions have been met—a microsecond or so at the
most, and the entire processor stops for this time. When you try to
lock a mutex, and another thread already has it, the OS will suspend
your thread (and only your thread—the processor continues to
execute other threads) until whoever holds the mutex frees it, which
could be seconds, minutes or even days. (Of course, if it's more than a
few hundred milliseconds, it's probably a bug.)
Finally, there's not really much difference between semaphores and
mutexes; a mutex can be considered a semaphore with a count of one.

A memory barrier is a method to order memory access. Compilers and CPU's can change this order to optimize, but in multithreaded environments, this can be an issue. The main difference with the others is that threads are not stopped by this.
A lock or mutex makes sure that code can only be accessed by 1 thread. Within this section, you can view the environment as singlethreaded, so memory barriers should not be needed.
a semaphore is basically a counter that can be increased (v()) or decreased (p()). If the counter is 0, then p() halts the thread until the counter is no longer 0. This is a way to synchronize threads, but I would prefer using mutexes or condition variables (controversial, but that's my opinion). When the initial counter is 1, then the semaphore is called a binary semaphore and it is similar to a lock.
A big difference between locks and semaphores is that the thread owns the lock, so no other thread should try to unlock, while this is not the case for semaphores.

Simples explanation for now.
Lock
Is an atomic test for whether this piece of code can proceed
lock (myObject)
{
// Stuff to do when I acquire the lock
}
This is normally a single CPU instruction that tests and sets a variable as a single atomic operation. More here, http://en.wikipedia.org/wiki/Test-and-set#Hardware_implementation_of_test-and-set_2
Memory Barrier
Is a hint to the processor that it can not execute these instructions out of order. Without it, the instructions could be executed out of order, like in double checked locking, the two null checks could be executed before the lock has.
Thread.MemoryBarrier();
public static Singleton Instance()
{
if (_singletonInstance == null)
{
lock(myObject)
{
if (_singletonInstance == null)
{
_singletonInstance = new Singleton();
}
}
}
}
This are also a set of CPU instructions that implement memory barriers to explicitly tell a CPU it can not execute things out of order.
Semaphores
Are similar to locks except they normally are for more than one thread. I.e. if you can handle 10 concurrent disk reads for example, you'd use a semaphore. Depending on the processor this is either its own instruction, or a test and set instruction with load more work around interrupts (for e.g. on ARM).

Memory barriers vs. interlocked operations

I am trying to improve my understanding of memory barriers. Suppose we have a weak memory model and we adapt Dekker's algorithm. Is it possible to make it work correctly under the weak memory model by adding memory barriers?
I think the answer is a surprising no. The reason (if I am correct) is that although a memory barrier can be used to ensure that a read is not moved past another, it cannot ensure that a read does not see stale data (such as that in a cache). Thus it could see some time in the past when the critical section was unlocked (per the CPU's cache) but at the current time other processors might see it as locked. If my understanding is correct, one must use interlocked operations such as those commonly called test-and-set or compare-and-swap to ensure synchronized agreement of a value at a memory location among multiple processors.
Thus, can we rightly expect that no weak memory model system would provide only memory barriers? The system must supply operations like test-and-set or compare-and-swap to be useful.
I realize that popular processors, including x86, provide memory models much stronger than a weak memory model. Please focus the discussion on weak memory models.
(If Dekker's algorithm is a poor choice, choose another mutual exclusion algorithm where memory barriers can successfully achieve correct synchronization, if possible.)

You are right that a memory barrier cannot ensure that a read sees up-to-date values. What it does do is enforce an ordering between operations, both on a single thread, and between threads.
For example, if thread A does a series of stores and then executes a release barrier before a final store to a flag location, and thread B reads from the flag location and then executes an acquire barrier before reading the other values then the other variables will have the values stored by thread A:
// initially x=y=z=flag=0
// thread A
x=1;
y=2;
z=3;
release_barrier();
flag=1;
// thread B
while(flag==0) ; // loop until flag is 1
acquire_barrier();
assert(x==1); // asserts will not fire
assert(y==2);
assert(z==3);
Of course, you need to ensure that the load and store to flag is atomic (which simple load and store instructions are on most common CPUs, provided the variables are suitably aligned). Without the while loop on thread B, thread B may well read a stale value (0) for flag, and thus you cannot guarantee any of the values read for the other variables.
Fences can thus be used to enforce synchronization in Dekker's algorithm.
Here's an example implementation in C++ (using C++0x atomic variables):
std::atomic<bool> flag0(false),flag1(false);
std::atomic<int> turn(0);
void p0()
{
flag0.store(true,std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_seq_cst);
while (flag1.load(std::memory_order_relaxed))
{
if (turn.load(std::memory_order_relaxed) != 0)
{
flag0.store(false,std::memory_order_relaxed);
while (turn.load(std::memory_order_relaxed) != 0)
{
}
flag0.store(true,std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_seq_cst);
}
}
std::atomic_thread_fence(std::memory_order_acquire);
// critical section
turn.store(1,std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_release);
flag0.store(false,std::memory_order_relaxed);
}
void p1()
{
flag1.store(true,std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_seq_cst);
while (flag0.load(std::memory_order_relaxed))
{
if (turn.load(std::memory_order_relaxed) != 1)
{
flag1.store(false,std::memory_order_relaxed);
while (turn.load(std::memory_order_relaxed) != 1)
{
}
flag1.store(true,std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_seq_cst);
}
}
std::atomic_thread_fence(std::memory_order_acquire);
// critical section
turn.store(0,std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_release);
flag1.store(false,std::memory_order_relaxed);
}
For a full analysis see my blog entry at http://www.justsoftwaresolutions.co.uk/threading/implementing_dekkers_algorithm_with_fences.html

Say you put in a load and store barrier after every statement, and in addition you ensured that the compiler didn't reorder your stores. Wouldn't this, on any reasonable architecture, provide strict consistency? Dekker's works on sequentially consistent architectures. Sequential consistency is a weaker condition than strict consistency.
http://www.cs.nmsu.edu/~pfeiffer/classes/573/notes/consistency.html
Even on a CPU that has a weak consistency model, you'd still expect cache coherence. I think that where things get derailed is the behavior of store buffers and speculated reads, and what operations are available flush stored writes and invalidate speculated reads. If there isn't a load fence that can invalidate speculated reads, or there isn't a write fence that flushes a store buffer, in addition to not being able to implement Dekker's, you won't be able to implement a mutex!
So here's my claim. If you have a write barrier available, and a read barrier, and the cache is coherent between CPUs, then you can trivially make all code sequentially consistent by flushing writes (store fence) after every instruction, and flushing speculations (read fence) before every instruction. So I claim that you don't need atomics for what you're talking about, and that you can do what you need with Dekker's only. Sure you wouldn't want to.
BTW, Corensic, the company I work for, writes cool tools for debugging concurrency issues. Check out http://www.corensic.com.

Some barriers (such as the powerpc isync, and a .acq load on ia64) also have an effect on ubsequent loads. ie: if a load was available before the isync due to prefetching it must be discarded. When used appropriately perhaps that's enough to make Dekker's algorithm work on a weak memory model.
You've also got cache invalidation logic working for you. If you know that your load is current due to something like an isync and that the cached version of the data is invalidated if another cpu touches it, is that enough?
Interesting questions aside, Dekker's algorithm is for all practical purposes dumb. You are going to want to use atomic hardware interfaces and memory barriers for any real application, so focusing on how to fix up Dekker's with atomics just doesn't seem worthwhile to me;)

Design options for a C++ thread-safe object cache

I'm in the process of writing a template library for data-caching in C++ where concurrent read can be done and concurrent write too, but not for the same key. The pattern can be explained with the following environment:
A mutex for the cache write.
A mutex for each key in the cache.
This way if a thread requests a key from the cache and is not present can start a locked calculation for that unique key. In the meantime other threads can retrieve or calculate data for other keys but a thread that tries to access the first key get locked-wait.
The main constraints are:
Never calculate the value for a key at the same time.
Calculating the value for 2 different keys can be done concurrently.
Data-retrieval must not lock other threads from retrieve data from other keys.
My other constraints but already resolved are:
fixed (known at compile time) maximum cache size with MRU-based ( most recently used ) thrashing.
retrieval by reference ( implicate mutexed shared counting )
I'm not sure using 1 mutex for each key is the right way to implement this but i didn't find any other substantially different way.
Do you know of other patterns to implements this or do you find this a suitable solution? I don't like the idea of having about 100 mutexs. ( the cache size is around 100 keys )

You want to lock and you want to wait. Thus there shall be "conditions" somewhere (as pthread_cond_t on Unix-like systems).
I suggest the following:
There is a global mutex which is used only to add or remove keys in the map.
The map maps keys to values, where values are wrappers. Each wrapper contains a condition and potentially a value. The condition is signaled when the value is set.
When a thread wishes to obtain a value from the cache, it first acquires the global mutex. It then looks in the map:
If there is a wrapper for that key, and that wrapper contains a value, then the thread has its value and may release the global mutex.
If there is a wrapper for that key but no value yet, then this means that some other thread is currently busy computing the value. The thread then blocks on the condition, to be awaken by the other thread when it has finished.
If there is no wrapper, then the thread registers a new wrapper in the map, and then proceeds to computing the value. When the value is computed, it sets the value and signals the condition.
In pseudo code this looks like this:
mutex_t global_mutex
hashmap_t map
lock(global_mutex)
w = map.get(key)
if (w == NULL) {
w = new Wrapper
map.put(key, w)
unlock(global_mutex)
v = compute_value()
lock(global_mutex)
w.set(v)
signal(w.cond)
unlock(global_mutex)
return v
} else {
v = w.get()
while (v == NULL) {
unlock-and-wait(global_mutex, w.cond)
v = w.get()
}
unlock(global_mutex)
return v
}
In pthreads terms, lock is pthread_mutex_lock(), unlock is pthread_mutex_unlock(), unlock-and-wait is pthread_cond_wait() and signal is pthread_cond_signal(). unlock-and-wait atomically releases the mutex and marks the thread as waiting on the condition; when the thread is awaken, the mutex is automatically reacquired.
This means that each wrapper will have to contain a condition. This embodies your various requirements:
No threads holds a mutex for a long period of time (either blocking or computing a value).
When a value is to be computed, only one thread does it, the other threads which wish to access the value just wait for it to be available.
Note that when a thread wishes to get a value and finds out that some other thread is already busy computing it, the threads ends up locking the global mutex twice: once in the beginning, and once when the value is available. A more complex solution, with one mutex per wrapper, may avoid the second locking, but unless contention is very high, I doubt that it is worth the effort.
About having many mutexes: mutexes are cheap. A mutex is basically an int, it costs nothing more than the four-or-so bytes of RAM used to store it. Beware of Windows terminology: in Win32, what I call here a mutex is deemed an "interlocked region"; what Win32 creates when CreateMutex() is called is something quite different, which is accessible from several distinct processes, and is much more expensive since it involves roundtrips to the kernel. Note that in Java, every single object instance contains a mutex, and Java developers do not seem to be overly grumpy on that subject.

You could use a mutex pool instead of allocating one mutex per resource. As reads are requested, first check the slot in question. If it already has a mutex tagged to it, block on that mutex. If not, assign a mutex to that slot and signal it, taking the mutex out of the pool. Once the mutex is unsignaled, clear the slot and return the mutex to the pool.

One possibility that would be a much simpler solution would be to use a single reader/writer lock on the entire cache. Given that you know there is a maximum number of entries (and it is relatively small), it sounds like adding new keys to the cache is a "rare" event. The general logic would be:
acquire read lock
search for key
if found
use the key
else
release read lock
acquire write lock
add key
release write lock
// acquire the read lock again and use it (probably encapsulate in a method)
endif
Not knowing more about the usage patterns, I can't say for sure if this is a good solution. It is very simple, though, and if the usage is predominantly reads, then it is very inexpensive in terms of locking.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js