How many mutex and cond variable? - c++

I am wokring on pthread pool and there will be five separate thread and one queue. All the five threads are competing to get a job from the queue and I know the basic idea that I need to do lock/unlock and wait/signal.
But I am not sure how many mutex and cond variable I should have. Right now I have only one mutex and cond variable and all five thread will use it.

One mutex and at least one condition variable.
One mutex because there is one 'thing' (i.e. piece of memory) to synchronize access to: the shared state between all workers and the thread pushing the work.
One condition variable per, well, condition that one or more threads need to wait on. At the very least you need one condition variable for waiting on new jobs, the condition here being: "is there more stuff to do?" (or the converse: "is the work queue empty?").
A somewhat more substantive answer would be that there is a one-to-many relationship between a mutex and associated condition variables, and a one-to-one relationship between shared states and mutexes. From what you've told us and since you're learning, I recommend using only one shared state for your design. When or if you need more that one state I'd recommend looking for some higher level concepts (e.g. channels, futures/promises) to build up on abstraction.
In any case, don't use the same condition variable with different mutexes.

To elaborate on #Ivan's solution...
Instead of a mutex + condition variable, you can use a counting semaphore + atomic operations to create a very efficient queue.
semaphore dequeue_sem = 0;
semaphore enqueue_sem = 37; // or however large you want to bound your queue
Enqueue operation is just:
wait_for(enqueue_sem)
atomic_add_to_queue(element)
signal(dequeue_sem)
Dequeue operation is:
wait_for(dequeue_sem)
element = atomic_remove_from_queue()
signal(enqueue_sem)
The "atomic_add_to_queue" and "atomic_remove_from_queue" are typicaly implemented using atomic compare&exchange in a tight loop.
In addition to its symmetry, this formulation bounds the maximum size of the queue; if a thread calls enqueue() on a full queue, it will block. This is almost certainly what you want for any queue in a multi-threaded environment. (Your computer has finite memory; consuming it without bound should be avoided when possible.)
If you do stick with a mutex and condition variables, you want two conditions, one for enqueue to wait on (and deque to signal) and one for the other way around. The conditions mean "queue not full" and "queue not empty", respectively, and the enqueue/dequeue code is similarly symmetric.

I think that you could do stealing work from queue without locking at all via Interlocked operations if you organize it as stack/linked list (it will require semaphore instead of condition variable to prevent problem described in comments to this answer).
Pseudo-code is like that:
candidate = head.
if (candidate == null) wait_for_semaphore;
if (candidate == InterlockedCompareExchange(head, candidate->next, candidate)) perform_work(candidate->data);
else goto 1;
Of course, adding work to queue should be also via InterlockedCompareExchange in this case and signaling the semaphore.

Related

Lock-free producer consumer [duplicate]

Anecdotally, I've found that a lot of programmers mistakenly believe that "lock-free" simply means "concurrent programming without mutexes". Usually, there's also a correlated misunderstanding that the purpose of writing lock-free code is for better concurrent performance. Of course, the correct definition of lock-free is actually about progress guarantees. A lock-free algorithm guarantees that at least one thread is able to make forward progress regardless of what any other threads are doing.
This means a lock-free algorithm can never have code where one thread is depending on another thread in order to proceed. E.g., lock-free code can not have a situation where Thread A sets a flag, and then Thread B keeps looping while waiting for Thread A to unset the flag. Code like that is basically implementing a lock (or what I would call a mutex in disguise).
However, other cases are more subtle and there are some cases where I honestly can't really tell if an algorithm qualifies as lock-free or not, because the notion of "making progress" sometimes appears subjective to me.
One such case is in the (well-regarded, afaik) concurrency library, liblfds. I was studying the implementation of a multi-producer/multi-consumer bounded queue in liblfds - the implementation is very straightforward, but I cannot really tell if it should qualify as lock-free.
The relevant algorithm is in lfds711_queue_bmm_enqueue.c. Liblfds uses custom atomics and memory barriers, but the algorithm is simple enough for me to describe in a paragraph or so.
The queue itself is a bounded contiguous array (ringbuffer). There is a shared read_index and write_index. Each slot in the queue contains a field for user-data, and a sequence_number value, which is basically like an epoch counter. (This avoids ABA issues).
The PUSH algorithm is as follows:
Atomically LOAD the write_index
Attempt to reserve a slot in the queue at write_index % queue_size using a CompareAndSwap loop that attempts to set write_index to write_index + 1.
If the CompareAndSwap is successful, copy the user data into the
reserved slot.
Finally, update the sequence_index on the
slot by making it equal to write_index + 1.
The actual source code uses custom atomics and memory barriers, so for further clarity about this algorithm I've briefly translated it into (untested) standard C++ atomics for better readability, as follows:
bool mcmp_queue::enqueue(void* data)
{
int write_index = m_write_index.load(std::memory_order_relaxed);
for (;;)
{
slot& s = m_slots[write_index % m_num_slots];
int sequence_number = s.sequence_number.load(std::memory_order_acquire);
int difference = sequence_number - write_index;
if (difference == 0)
{
if (m_write_index.compare_exchange_weak(
write_index,
write_index + 1,
std::memory_order_acq_rel
))
{
break;
}
}
if (difference < 0) return false; // queue is full
}
// Copy user-data and update sequence number
//
s.user_data = data;
s.sequence_number.store(write_index + 1, std::memory_order_release);
return true;
}
Now, a thread that wants to POP an element from the slot at read_index will not be able to do so until it observes that the slot's sequence_number is equal to read_index + 1.
Okay, so there are no mutexes here, and the algorithm likely performs well (it's only a single CAS for PUSH and POP), but is this lock-free? The reason it's unclear to me is because the definition of "making progress" seems murky when there is the possibility that a PUSH or POP can always just fail if the queue is observed to be full or empty.
But what's questionable to me is that the PUSH algorithm essentially reserves a slot, meaning that the slot can never be POP'd until the push thread gets around to updating the sequence number. This means that a POP thread that wants to pop a value depends on the PUSH thread having completed the operation. Otherwise, the POP thread will always return false because it thinks the queue is EMPTY. It seems debatable to me whether this actually falls within the definition of "making progress".
Generally, truly lock-free algorithms involve a phase where a pre-empted thread actually tries to ASSIST the other thread in completing an operation. So, in order to be truly lock-free, I would think that a POP thread that observes an in-progress PUSH would actually need to try and complete the PUSH, and then only after that, perform the original POP operation. If the POP thread simply returns that the queue is EMPTY when a PUSH is in progress, the POP thread is basically blocked until the PUSH thread completes the operation. If the PUSH thread dies, or goes to sleep for 1,000 years, or otherwise gets scheduled into oblivion, the POP thread can do nothing except continuously report that the queue is EMPTY.
So does this fit the defintion of lock-free? From one perspective, you can argue that the POP thread can always make progress, because it can always report that the queue is EMPTY (which is at least some form of progress I guess.) But to me, this isn't really making progress, since the only reason the queue is observed as empty is because we are blocked by a concurrent PUSH operation.
So, my question is: is this algorithm truly lock-free? Or is the index reservation system basically a mutex in disguise?
This queue data structure is not strictly lock-free by what I consider the most reasonable definition. That definition is something like:
A structure is lock-free if only if any thread can be indefinitely
suspended at any point while still leaving the structure usable by the
remaining threads.
Of course this implies a suitable definition of usable, but for most structures this is fairly simple: the structure should continue to obey its contracts and allow elements to be inserted and removed as expected.
In this case a thread that has succeeded in incrementing m_write_increment, but hasn't yet written s.sequence_number leaves the container in what will soon be an unusable state. If such a thread is killed, the container will eventually report both "full" and "empty" to push and pop respectively, violating the contract of a fixed size queue.
There is a hidden mutex here (the combination of m_write_index and the associated s.sequence_number) - but it basically works like a per-element mutex. So the failure only becomes apparent to writers once you've looped around and a new writer tries to get the mutex, but in fact all subsequent writers have effectively failed to insert their element into the queue since no reader will ever see it.
Now this doesn't mean this is a bad implementation of a concurrent queue. For some uses it may behave mostly as if it was lock free. For example, this structure may have most of the useful performance properties of a truly lock-free structure, but at the same time it lacks some of the useful correctness properties. Basically the term lock-free usually implies a whole bunch of properties, only a subset of which will usually be important for any particular use. Let's look at them one by one and see how this structure does. We'll broadly categorize them into performance and functional categories.
Performance
Uncontended Performance
The uncontended or "best case" performance is important for many structures. While you need a concurrent structure for correctness, you'll usually still try to design your application so that contention is kept to a minimum, so the uncontended cost is often important. Some lock-free structures help here, by reducing the number of expensive atomic operations in the uncontended fast-path, or avoiding a syscall.
This queue implementation does a reasonable job here: there is only a single "definitely expensive" operation: the compare_exchange_weak, and a couple of possibly expensive operations (the memory_order_acquire load and memory_order_release store)1, and little other overhead.
This compares to something like std::mutex which would imply something like one atomic operation for lock and another for unlock, and in practice on Linux the pthread calls have non-negligible overhead as well.
So I expect this queue to perform reasonably well in the uncontended fast-path.
Contended Performance
One advantage of lock-free structures is that they often allow better scaling when a structure is heavily contended. This isn't necessarily an inherent advantage: some lock-based structures with multiple locks or read-write locks may exhibit scaling that matches or exceeds some lock-free approaches, but it is usually that case that lock-free structures exhibit better scaling that a simple one-lock-to-rule-them-all alternative.
This queue performs reasonably in this respect. The m_write_index variable is atomically updated by all readers and will be a point of contention, but the behavior should be reasonable as long as the underlying hardware CAS implementation is reasonable.
Note that a queue is generally a fairly poor concurrent structure since inserts and removals all happen at the same places (the head and the tail), so contention is inherent in the definition of the structure. Compare this to a concurrent map, where different elements have no particular ordered relationship: such a structure can offer efficient contention-free simultaneous mutation if different elements are being accessed.
Context-switch Immunity
One performance advantage of lock-free structures that is related to the core definition above (and also to the functional guarantees) is that a context switch of a thread which is mutating the structure doesn't delay all the other mutators. In a heavily loaded system (especially when runnable threads >> available cores), a thread may be switched out for hundreds of milliseconds or seconds. During this time, any concurrent mutators will block and incur additional scheduling costs (or they will spin which may also produce poor behavior). Even though such "unluckly scheduling" may be rare, when it does occur the entire system may incur a serious latency spike.
Lock-free structures avoid this since there is no "critical region" where a thread can be context switched out and subsequently block forward progress by other threads.
This structure offers partial protection in this area — the specifics of which depend on the queue size and application behavior. Even if a thread is switched out in the critical region between the m_write_index update and the sequence number write, other threads can continue to push elements to the queue as long as they don't wrap all the way around to the in-progress element from the stalled thread. Threads can also pop elements, but only up to the in-progress element.
While the push behavior may not be a problem for high-capacity queues, the pop behavior can be a problem: if the queue has a high throughput compared to the average time a thread is context switched out, and the average fullness, the queue will quickly appear empty to all consumer threads, even if there are many elements added beyond the in-progress element. This isn't affected by the queue capacity, but simply the application behavior. It means that the consumer side may completely stall when this occurs. In this respect, the queue doesn't look very lock-free at all!
Functional Aspects
Async Thread Termination
On advantage of lock-free structures it they are safe for use by threads that may be asynchronously canceled or may otherwise terminate exceptionally in the critical region. Cancelling a thread at any point leaves the structure is a consistent state.
This is not the case for this queue, as described above.
Queue Access from Interrupt or Signal
A related advantage is that lock-free structures can usually be examined or mutated from an interrupt or signal. This is useful in many cases where an interrupt or signal shares a structure with regular process threads.
This queue mostly supports this use case. Even if the signal or interrupt occurs when another thread is in the critical region, the asynchronous code can still push an element onto the queue (which will only be seen later by consuming threads) and can still pop an element off of the queue.
The behavior isn't as complete as a true lock-free structure: imagine a signal handler with a way to tell the remaining application threads (other than the interrupted one) to quiesce and which then drains all the remaining elements of the queue. With a true lock-free structure, this would allow the signal handler to full drain all the elements, but this queue might fail to do that in the case a thread was interrupted or switched out in the critical region.
1 In particular, on x86, this will only use an atomic operation for the CAS as the memory model is strong enough to avoid the need for atomics or fencing for the other operations. Recent ARM can do acquire and release fairly efficiently as well.
I am the author of liblfds.
The OP is correct in his description of this queue.
It is the single data structure in the library which is not lock-free.
This is described in the documentation for the queue;
http://www.liblfds.org/mediawiki/index.php?title=r7.1.1:Queue_%28bounded,_many_producer,_many_consumer%29#Lock-free_Specific_Behaviour
"It must be understood though that this is not actually a lock-free data structure."
This queue is an implementation of an idea from Dmitry Vyukov (1024cores.net) and I only realised it was not lock-free while I was making the test code work.
By then it was working, so I included it.
I do have some thought to remove it, since it is not lock-free.
Most of the time people use lock-free when they really mean lockless. lockless means a data-structure or algorithm that does not use locks, but there is no guarantee for forward progress. Also check this question. So the queue in liblfds is lockless, but as BeeOnRope mentioned is not lock-free.
A thread that calls POP before the next update in sequence is complete is NOT "effectively blocked" if the POP call returns FALSE immediately. The thread can go off and do something else. I'd say that this queue qualifies as lock-free.
However, I wouldn't say that it qualifies as a "queue" -- at least not the kind of queue that you could publish as a queue in a library or something -- because it doesn't guarantee a lot of the behaviors that you can normally expect from a queue. In particular, you can PUSH and element and then try and FAIL to POP it, because some other thread is busy pushing an earlier item.
Even so, this queue could still be useful in some lock-free solutions for various problems.
For many applications, however, I would worry about the possibility for consumer threads to be starved while a producer thread is pre-empted. Maybe liblfds does something about that?
"Lock-free" is a property of the algorithm, which implements some functionality. The property doesn't correlate with a way, how given functionality is used by a program.
When talk about mcmp_queue::enqueue function, which returns FALSE if underlying queue is full, its implementation (given in the question post) is lock-free.
However, implementing mcmp_queue::dequeue in lock-free manner would be difficult. E.g., this pattern is obviously not-lock free, as it spins on the variable changed by other thread:
while(s.sequence_number.load(std::memory_order_acquire) == read_index);
data = s.user_data;
...
return data;
I did formal verification on this same code using Spin a couple years ago for a course in concurrency testing and it is definitely not lock-free.
Just because there is no explicit "locking", doesn't mean it's lock-free. When it comes to reasoning about progress conditions, think of it from an individual thread's perspective:
Blocking/locking: if another thread gets descheduled and this can block my progress, then it is blocking.
Lock-free/non-blocking: if I am able to eventually make progress in the absence of contention from other threads, then it is at most lock-free.
If no other thread can block my progress indefinitely, then it is wait-free.

Mutex granularity

I have a question regarding threads. It is known that basically when we call for mutex(lock) that means that thread keeps on executing the part of code uninterrupted by other threads until it meets mutex(unlock). (At least that's what they say in the book) So my question is if it is actually possible to have several scoped WriteLocks which do not interfere with each other. For example something like this:
If I have a buffer with N elements without any new elements coming, however with high frequency updates (like change value of Kth element) is it possible to set a different lock on each element so that the only time threads would stall and wait is if actually 2 or more threads are trying to update the same element?
To answer your question about N mutexes: yes, that is indeed possible. What resources are protected by a mutex depends entirely on you as the user of that mutex.
This leads to the first (statement) part of your question. A mutex by itself does not guarantee that a thread will work uninterrupted. All it guarantees is MUTual EXclusion - if thread B attempts to lock a mutex which thread A has locked, thread B will block (execute no code) until thread A unlocks the mutex.
This means mutexes can be used to guarantee that a thread executes a block of code uninterrupted; but this works only if all threads follow the same mutex-locking protocol around that block of code. Which means it is your responsibility to assign semantics (or meaning) to each individual mutex, and correctly adhere to those semantics in your code.
If you decide for the semantics to be "I have an array a of N data elements and an array m of N mutexes, and accessing a[i] can only be done when m[i] is locked," then that's how it will work.
The need to consistently stick to the same protocol is why you should generally encapsulate the mutex and the code/data protected by it in a class in some way or another, so that outside code doesn't need to know the details of the protocol. It just knows "call this member function, and the synchronisation will happen automagically." This "automagic" will be the class correcrtly implementing the protocol.
A crucial consideration when deciding between a mutex per array and a mutex per element is whether there are operations - like tracking the number of "in-use" array elements, the "active" element, or moving a pointer-to-array to a larger buffer - that can only be done safely by one thread while all the others are blocked.
A lesser but sometimes important consideration is the amount of extra memory more mutexes use.
If you genuinely need to do this kind of update as quickly as possible in a highly contested multi-threaded program, you may also want to learn about lock-free atomic types and their compare-and-swap / exchange operations, but I'd recommend against considering that unless profiling the existing locking is significant in your overall program performance.
A mutex does not stop other threads from running completely, it only stops other threads from locking the same mutex. I.e. while one thread is keeping the mutex locked, the operating system continues to do context switches letting other threads run also, but if any other thread is trying to lock the same mutex its execution will be halted until the mutex is unlocked.
So yes, you can indeed have several different mutexes and lock/unlock them independently. Just beware of deadlocks, i.e. if one thread can lock more than one mutex at a time you can run into a situation where thread 1 has locked mutex A and is trying to lock mutex B but blocks because thread 2 already has mutex B locked and it is trying to lock mutex A..
Its not completely clear that your use case is:
the threads gets a buffer assigned on that they have to work
the threads have some results and request a special buffer to update.
On the first variant you need some assignment logic that assigns a buffer to a thread.
This logic has to be exectued in an atomic way. so the best is to use a mutex to protect the assignment logic.
On the other variant it may be the best to have a vector of mutexes, one for each buffer element.
In Both cases the buffer does not need a protection because it (or better each field of it) is only accessed from one thread at a time.
You also may inform yourself about 'semaphores'. These contain a counter that allows to manage ressources that have a limited amount but more than one. Mutexes are a special case of semaphores with n=1.
You can have mutex per entry, C++11 mutex can be easily converted into an adaptive-spinlock, so you can achieve good CPU/Latency tradeoff.
Or, if you need very low latency yet have enough CPUs you can use an atomic "busy" flag per entry and spin in a tight compare-exchange loop on contention.
From experience, though, the best performance and scalability are achieved when concurrent writes are serialized via a command queue (or a queue of smaller immutable buffers to be concatenated at destination) and a single thread processing the queue.

Concurrency with a producer/consumer (sort of) and STL

I have the following situation: I have two threads
thread1, which is a worker thread that executes an algorithm until its input list size is > 0
thread2, which is asynchronous (user driven) and can add elements to the input list to be processed
Now, thread1 loop does something similar to the following
list input_list
list buffer_list
if (input_list.size() == 0)
sleep
while (input_list.size() > 0) {
for (item in input_list) {
process(item);
possibly add items to buffer_list
}
input_list = buffer_list (or copy it)
buffer_list = new list (or empty it)
sleep X ms (100 < X < 500, still have to decide)
}
Now thread2 will just add elements to buffer_list (which will be the next pass of the algorithm) and possibly manage to awake thread1 if it was stopped.
I'm trying to understand which multithread issues can occur in this situation, assuming that I'm programming it into C++ with aid of STL (no assumption on thread-safety of the implementation), and I have of course access to standard library (like mutex).
I would like to avoid any possible delay with thread2, since it's bound to user interface and it would create delays. I was thinking about using 3 lists to avoid synchronization issues but I'm not real sure so far. I'm still unsure either if there is a safer container within STL according to this specific situation.. I don't want to just place a mutex outside everything and lose so much performance.
Any advice would be very appreciated, thanks!
EDIT:
This is what I managed so far, wondering if it's thread safe and enough efficient:
std::set<Item> *extBuffer, *innBuffer, *actBuffer;
void thread1Function()
{
actBuffer->clear();
sem_wait(&mutex);
if (!extBuffer->empty())
std::swap(actBuffer, extBuffer);
sem_post(&mutex);
if (!innBuffer->empty())
{
if (actBuffer->empty())
std::swap(innBuffer, actBuffer);
else if (!innBuffer->empty())
actBuffer->insert(innBuffer->begin(), innBuffer->end());
}
if (!actBuffer->empty())
{
set<Item>::iterator it;
for (it = actBuffer.begin; it != actBuffer.end(); ++it)
// process
// possibly innBuffer->insert(...)
}
}
void thread2Add(Item item)
{
sem_wait(&mutex);
extBuffer->insert(item);
sem_post(&mutex);
}
Probably I should open another question
If you are worried about thread2 being blocked for a long time because thread1 is holding on to the lock, then make sure that thread1 guarentees to only take the lock for a really short time.
This can be easily accomplished if you have two instances of buffer list. So your attempt is already in the right direction.
Each buffer is pointed to with a pointer. One pointer you use to insert items into the list (thread2) and the other pointer is used to process the items in the other list (thread1). The insert operation of thread2 must be surrounded by a lock.
If thread1 is done processing all the items it only has to swap the pointers (e.g. with std::swap) this is a very quick operation which must be surrounded by a lock. Only the swap operation though. The actual processing of the items is lock-free.
This solution has the following advantages:
The lock in thread1 is always very short, so the amount of time that it may block thread2 is minimal
No constant dynamic allocation of buffers, which is faster and less likely to cause memory leak bugs.
You just need a mutex around inserting, removing, or when accessing the size of the container. You could develop a class to encapsulate the container and that would have the mutex. This would keep things simple and the class would handle the functionally of using he mutex. If you limit the items accessing (what is exposed for functions/interfaces) and make the functions small (just calling the container classes function enacapsulated in the mutex), they will return relatively quickly. You should need only on list for this in that case.
Depending on the system, if you have semaphore available, you may want to check and see if they are more efficent and use them instead of the mutex. Same concept applies, just in a different manner.
You may want to check in the concept of guards in case one of the threads dies, you would not get a deadlock condition.

How do I safely read a variable from one thread and modify it from another?

I have a class instances which is being used in multiple threads. I am updating multiple member variables from one thread and reading the same member variables from one thread. What is the correct way to maintain the thread safety?
eg:
phthread_mutex_lock(&mutex1)
obj1.memberV1 = 1;
//unlock here?
Should I unlock the mutex over here? ( if another thread access the obj1 member variables 1 and 2 now, the accessed data might not be correct because memberV2 has not yet be updated. However, if I does not release the lock, the other thread might block because there is time consuming operation below.
//perform some time consuming operation which must be done before the assignment to memberV2 and after the assignment to memberV1
obj1.memberV2 = update field 2 from some calculation
pthread_mutex_unlock(&mutex1) //should I only unlock here?
Thanks
Your locking is correct. You should not release the lock early just to allow another thread to proceed (because that would allow the other thread to see the object in an inconsistent state.)
Perhaps it would be better to do something like:
//perform time consuming calculation
pthread_mutex_lock(&mutex1)
obj1.memberV1 = 1;
obj1.memberV2 = result;
pthread_mutex_unlock(&mutex1)
This of course assumes that the values used in the calculation won't be modified on any other thread.
Its hard to tell what you are doing that is causing problems. The mutex pattern is pretty simple. You Lock the mutex, access the shared data, unlock the mutex. This protects data, becuase the mutex will only let one thread get the lock at a time. Any thread that fails to get the lock has to wait till the mutex is unlocked. Unlocking wakes the waiters up. They will then fight to attain the lock. Losers go back to sleep. The time it takes to wake up might be multiple ms or more from the time the lock is released. Make sure you always unlock the mutex eventually.
Make sure you don't to keep locks locked for a long period of time. Most of the time, a long period of time is like a micro second. I prefer to keep it down around "a few lines of code." Thats why people have suggested that you do the long running calculation outside the lock. The reason for not keeping locks a long time is you increase the number of times other threads will hit the lock and have to spin or sleep, which decreases performance. You also increase the probability that your thread might be pre-empted while owning the lock, which means the lock is enabled while that thread sleeps. Thats even worse performance.
Threads that fail a lock dont have to sleep. Spinning means a thread encountering a locked mutex doesn't sleep, but loops repeatedly testing the lock for a predefine period before giving up and sleeping. This is a good idea if you have multiple cores or cores capable of multiple simultaneous threads. Multiple active threads means two threads can be executing the code at the same time. If the lock is around a small amount of code, then the thread that got the lock is going to be done real soon. the other thread need only wait a couple nano secs before it will get the lock. Remember, sleeping your thread is a context switch and some code to attach your thread to the waiters on the mutex, all have costs. Plus, once your thread sleeps, you have to wait for a period of time before the scheduler wakes it up. that could be multiple ms. Lookup spinlocks.
If you only have one core, then if a thread encounters a lock it means another sleeping thread owns the lock and no matter how long you spin it aint gonna unlock. So you would use a lock that sleeps a waiter immediately in hopes that the thread owning the lock will wake up and finish.
You should assume that a thread can be preempted at any machine code instruction. Also you should assume that each line of c code is probably many machine code instructions. The classic example is i++. This is one statement in c, but a read, an increment, and a store in machine code land.
If you really care about performance, try to use atomic operations first. Look to mutexes as a last resort. Most concurrency problems are easily solved with atomic operations (google gcc atomic operations to start learning) and very few problems really need mutexes. Mutexes are way way way slower.
Protect your shared data wherever it is written and wherever it is read. else...prepare for failure. You don't have to protect shared data during periods of time when only a single thread is active.
Its often useful to be able to run your app with 1 thread as well as N threads. This way you can debug race conditions easier.
Minimize the shared data that you protect with locks. Try to organize data into structures such that a single thread can gain exclusive access to the entire structure (perhaps by setting a single locked flag or version number or both) and not have to worry about anything after that. Then most of the code isnt cluttered with locks and race conditions.
Functions that ultimately write to shared variables should use temp variables until the last moment and then copy the results. Not only will the compiler generate better code, but accesses to shared variables especially changing them cause cache line updates between L2 and main ram and all sorts of other performance issues. Again if you don't care about performance disregard this. However i recommend you google the document "everything a programmer should know about memory" if you want to know more.
If you are reading a single variable from the shared data you probably don't need to lock as long as the variable is an integer type and not a member of a bitfield (bitfield members are read/written with multiple instructions). Read up on atomic operations. When you need to deal with multiple values, then you need a lock to make sure you didn't read version A of one value, get preempted, and then read version B of the next value. Same holds true for writing.
You will find that copies of data, even copies of entire structures come in handy. You can be working on building a new copy of the data and then swap it by changing a pointer in with one atomic operation. You can make a copy of the data and then do calculations on it without worrying if it changes.
So maybe what you want to do is:
lock the mutex
Make a copy of the input data to the long running calculation.
unlock the mutex
L1: Do the calculation
Lock the mutex
if the input data has changed and this matters
read the input data, unlock the mutex and go to L1
updata data
unlock mutex
Maybe, in the example above, you still store the result if the input changed, but go back and recalc. It depends if other threads can use a slightly out of date answer. Maybe other threads when they see that a thread is already doing the calculation simply change the input data and leave it to the busy thread to notice that and redo the calculation (there will be a race condition you need to handle if you do that, and easy one). That way the other threads can do other work rather than just sleep.
cheers.
Probably the best thing to do is:
temp = //perform some time consuming operation which must be done before the assignment to memberV2
pthread_mutex_lock(&mutex1)
obj1.memberV1 = 1;
obj1.memberV2 = temp; //result from previous calculation
pthread_mutex_unlock(&mutex1)
What I would do is separate the calculation from the update:
temp = some calculation
pthread_mutex_lock(&mutex1);
obj.memberV1 = 1;
obj.memberV2 = temp;
pthread_mutex_unlock(&mutex1);

Design options for a C++ thread-safe object cache

I'm in the process of writing a template library for data-caching in C++ where concurrent read can be done and concurrent write too, but not for the same key. The pattern can be explained with the following environment:
A mutex for the cache write.
A mutex for each key in the cache.
This way if a thread requests a key from the cache and is not present can start a locked calculation for that unique key. In the meantime other threads can retrieve or calculate data for other keys but a thread that tries to access the first key get locked-wait.
The main constraints are:
Never calculate the value for a key at the same time.
Calculating the value for 2 different keys can be done concurrently.
Data-retrieval must not lock other threads from retrieve data from other keys.
My other constraints but already resolved are:
fixed (known at compile time) maximum cache size with MRU-based ( most recently used ) thrashing.
retrieval by reference ( implicate mutexed shared counting )
I'm not sure using 1 mutex for each key is the right way to implement this but i didn't find any other substantially different way.
Do you know of other patterns to implements this or do you find this a suitable solution? I don't like the idea of having about 100 mutexs. ( the cache size is around 100 keys )
You want to lock and you want to wait. Thus there shall be "conditions" somewhere (as pthread_cond_t on Unix-like systems).
I suggest the following:
There is a global mutex which is used only to add or remove keys in the map.
The map maps keys to values, where values are wrappers. Each wrapper contains a condition and potentially a value. The condition is signaled when the value is set.
When a thread wishes to obtain a value from the cache, it first acquires the global mutex. It then looks in the map:
If there is a wrapper for that key, and that wrapper contains a value, then the thread has its value and may release the global mutex.
If there is a wrapper for that key but no value yet, then this means that some other thread is currently busy computing the value. The thread then blocks on the condition, to be awaken by the other thread when it has finished.
If there is no wrapper, then the thread registers a new wrapper in the map, and then proceeds to computing the value. When the value is computed, it sets the value and signals the condition.
In pseudo code this looks like this:
mutex_t global_mutex
hashmap_t map
lock(global_mutex)
w = map.get(key)
if (w == NULL) {
w = new Wrapper
map.put(key, w)
unlock(global_mutex)
v = compute_value()
lock(global_mutex)
w.set(v)
signal(w.cond)
unlock(global_mutex)
return v
} else {
v = w.get()
while (v == NULL) {
unlock-and-wait(global_mutex, w.cond)
v = w.get()
}
unlock(global_mutex)
return v
}
In pthreads terms, lock is pthread_mutex_lock(), unlock is pthread_mutex_unlock(), unlock-and-wait is pthread_cond_wait() and signal is pthread_cond_signal(). unlock-and-wait atomically releases the mutex and marks the thread as waiting on the condition; when the thread is awaken, the mutex is automatically reacquired.
This means that each wrapper will have to contain a condition. This embodies your various requirements:
No threads holds a mutex for a long period of time (either blocking or computing a value).
When a value is to be computed, only one thread does it, the other threads which wish to access the value just wait for it to be available.
Note that when a thread wishes to get a value and finds out that some other thread is already busy computing it, the threads ends up locking the global mutex twice: once in the beginning, and once when the value is available. A more complex solution, with one mutex per wrapper, may avoid the second locking, but unless contention is very high, I doubt that it is worth the effort.
About having many mutexes: mutexes are cheap. A mutex is basically an int, it costs nothing more than the four-or-so bytes of RAM used to store it. Beware of Windows terminology: in Win32, what I call here a mutex is deemed an "interlocked region"; what Win32 creates when CreateMutex() is called is something quite different, which is accessible from several distinct processes, and is much more expensive since it involves roundtrips to the kernel. Note that in Java, every single object instance contains a mutex, and Java developers do not seem to be overly grumpy on that subject.
You could use a mutex pool instead of allocating one mutex per resource. As reads are requested, first check the slot in question. If it already has a mutex tagged to it, block on that mutex. If not, assign a mutex to that slot and signal it, taking the mutex out of the pool. Once the mutex is unsignaled, clear the slot and return the mutex to the pool.
One possibility that would be a much simpler solution would be to use a single reader/writer lock on the entire cache. Given that you know there is a maximum number of entries (and it is relatively small), it sounds like adding new keys to the cache is a "rare" event. The general logic would be:
acquire read lock
search for key
if found
use the key
else
release read lock
acquire write lock
add key
release write lock
// acquire the read lock again and use it (probably encapsulate in a method)
endif
Not knowing more about the usage patterns, I can't say for sure if this is a good solution. It is very simple, though, and if the usage is predominantly reads, then it is very inexpensive in terms of locking.