What exactly is the meaning of "wait-free" in boost::lockfree? - c++

I am reading the docs for spsc_queue and even after reading a bit elsewhere I am not completely convinced about the meaning of "wait-free".
What exactly do they mean here
bool push(T const & t);
Pushes object t to the ringbuffer.
Note: Thread-safe and wait-free
I mean there must be some overhead for synchronisation. Is
some_spscqueue.push(x);
guaranteed to take a constant amount of time? How does it compare to a non-thread safe queue?
PS: dont worry, I am going to measure, but due to my naive ignorance I just cant imagine a synchronisation mechanism that does not involve some kind of waiting and I am quite puzzled what "wait-free" is supposed to tell me.

A wait-free implementation of a concurrent data object is one that guarantees that any process can complete any operation in a finite number of steps, regardless of the execution speeds of the other processes.
(from the abstract of the Herlihy article).
See also Wikipedia, and all the other resources you immediately find by typing "wait free" into a search engine.
So wait-free doesn't mean no delay, it means you never enter a blocking wait-state, and can't be indefinitely blocked and/or starved.
In other words, wait has the specific technical meaning that your thread is either parked (and does not execute any instructions until something wakes it up), or is in a loop waiting for some external condition to be satisfied (eg. a spinlock). If it's never woken up, or if it is woken up but always finds it can't proceed and has to wait again, or if the loop never exits, then your thread is being starved and cannot make progress.
Every operation has some latency, and wait-free doesn't say anything about what that is.
How does it compare to a non-thread safe queue?
It'll almost certainly be more expensive than a wholly-unsynchronized container, because you're still doing extra work (assuming you really do only access the container from a single thread).

"Pushes object to the ring buffer" refers to the type of array buffer that the queue is implemented on. This is probably implemented using a circular array to store the objects in the queue. See for example:
https://en.wikipedia.org/wiki/Circular_buffer
and
http://opendatastructures.org/ods-python/2_3_ArrayQueue_Array_Based_.html

Related

Is there a C++ design pattern that implements a mechanism or mutex that controls the amount of time a thread can own a locked resource?

I am looking for a way to guarantee that any time a thread locks a specific resource, it is forced to release that resource after a specific period of time (if it has not already released it). Envision a connection where you need to limit the amount of time any specific thread can own that connection for.
I envision this is how it could be used:
{
std::lock_guard<std::TimeLimitedMutex> lock(this->myTimeLimitedMutex, timeout);
try {
// perform some operation with the resource that myTimeLimitedMutex guards.
}
catch (MutexTimeoutException ex) {
// perform cleanup
}
}
I see that there is a timed_mutex that lets the program timeout if a lock cannot be acquired. I need the timeout to occur after the lock is acquired.
There are already some situations where you get a resource that can be taken away unexpectedly. For instance, a tcp sockets -- once a socket connection is made, code on each side needs to handle the case where the other side drops the connection.
I am looking for a pattern that handle types of resources that normally time out on their own, but when they don't, they need to be reset. This does not have to handle every type of resource.
This can't work, and it will never work. In other words, this can never be made. It goes against all concept of ownership and atomic transactions. Because when thread acquires the lock and implements two transactions in a row, it expects them to become atomically visible to outside word. In this scenario, it would be very possible that the transaction will be torn - first part of it will be performed, but the second will be not.
What's worse is that since the lock will be forcefully removed, the part-executed transaction will become visible to outside word, before the interrupted thread has any chance to roll-back.
This idea goes contrary to all school of multi-threaded thinking.
I support SergeyAs answer. Releasing a locked mutex after a timeout is a bad idea and cannot work. Mutex stands for mutual exclusion and this is a rock-hard contract which cannot be violated.
But you can do what you want:
Problem: You want to guarantee that your threads do not hold the mutex longer than a certain time T.
Solution: Never lock the mutex for longer than time T. Instead write your code so that the mutex is locked only for the absolutely necessary operations. It is always possible to give such a time T (modulo the uncertainties and limits given my a multitasking and multiuser operating system of course).
To achieve that (examples):
Never do file I/O inside a locked section.
Never call a system call while a mutex is locked.
Avoid sorting a list while a mutex is locked (*).
Avoid doing a slow operation on each element of a list while a mutex is locked (*).
Avoid memory allocation/deallocation while a mutex is locked (*).
There are exceptions to these rules, but the general guideline is:
Make your code slightly less optimal (e.g. do some redundant copying inside the critical section) to make the critical section as short as possible. This is good multithreading programming.
(*) These are just examples for operations where it is tempting to lock the entire list, do the operations and then unlock the list. Instead it is advisable to just take a local copy of the list and clear the original list while the mutex is locked, ideally by using the swap() operation offered by most STL containers. And then do the slow operation on the local copy outside of the critical section. This is not always possible but always worth considering. Sorting has square complexity in the worst case and usually needs random access to the entire list. It is useful to sort (a copy of) the list outside of the critical section and later check whether elements need to be added or removed. Memory allocations also have quite some complexity behind them, so massive memory allocations/deallocations should be avoided.
You can't do that with only C++.
If you are using a Posix system, it can be done.
You'll have to trigger a SIGALARM signal that's only unmasked for the thread that'll timeout. In the signal handler, you'll have to set a flag and use longjmp to return to the thread code.
In the thread code, on the setjmp position, you can only be called if the signal was triggered, thus you can throw the Timeout exception.
Please see this answer for how to do that.
Also, on linux, it seems you can directly throw from the signal handler (so no longjmp/setjmp here).
BTW, if I were you, I would code the opposite. Think about it: You want to tell a thread "hey, you're taking too long, so let's throw away all the (long) work you've done so far so I can make progress".
Ideally, you should have your long thread be more cooperative, doing something like "I've done A of a ABCD task, let's release the mutex so other can progress on A. Then let's check if I can take it again to do B and so on."
You probably want to be more fine grained (have more mutex on smaller objects, but make sure you're locking in the same order) or use RW locks (so that other threads can use the objects if you're not modifying them), etc...
Such an approach cannot be enforced because the holder of the mutex needs the opportunity to clean up anything which is left in an invalid state part way through the transaction. This can take an unknown arbitrary amount of time.
The typical approach is to release the lock when doing long tasks, and re-aquire it as needed. You have to manage this yourself as everyone will have a slightly different approach.
The only situation I know of where this sort of thing is accepted practice is at the kernel level, especially with respect to microcontrollers (which either have no kernel, or are all kernel, depending on who you ask). You can set an interrupt which modifies the call stack, so that when it is triggered it unwinds the particular operations you are interested in.
"Condition" variables can have timeouts. This allows you to wait until a thread voluntarily releases a resource (with notify_one() or notify_all()), but the wait itself will timeout after a specified fixed amount of time.
Examples in the Boost documentation for "conditions" might make this more clear.
If you want to force a release, you have to write the code which will force it though. This could be dangerous. The code written in C++ can be doing some pretty close-to-the-metal stuff. The resource could be accessing real hardware and it could be waiting on it to finish something. It may not be physically possible to end whatever the program is stuck on.
However, if it is possible, then you can handle it in the thread in which the wait() times out.

Multiple vs single condition variable

I'm working on an application where I currently have only one condition variable and lots of wait statements with different conditions. Whenever one of the queues or some other state is changed I just call cv.notify_all(); and know that all threads waiting for this state change will be notified (and possibly other threads which are waiting for a different condition).
I'm wondering whether it makes sense to use separate condition variables for each queue or state. All my queues are usually separate, so a thread waiting for data in queue x doesn't care at about new data in queue y. So I won't have situations where I have to notify or wait for multiple condition variables (which is what most questions I've found are about).
Is there any downside, from a performance point of view, for having dozens of condition variables around? Of course the code may be more prone to errors because one has to notify the correct condition variable.
Edit: All condition variables are still going to use the same mutex.
I would recommend having one condition_variable per actual condition. In producer/consumer, for example, there would be a cv for buffer empty and another cv for buffer full.
It doesn't make sense to notifiy all when you could just notify one. That makes me think that your one condition_variable is in reality managing several actual conditions.
You're asking about optimization. On one hand, we want to avoid premature optimization and do the profiling when it becomes a problem. On the other hand, we want to design code that's going to scale well algorithmically from the beginning.
From a speed performance point of view, it's hard to say without knowing the situation, but the potential for significant negative impact on speed from adding cv's is very low, completely dwarfed by the cost of a wait or notify. If adding more cv's can get you out of calling notify-all, the potential for positive impact on speed is high.
Notify-all makes for simple-to-understand code. However, it shouldn't be used when performance matters. When a thread waiting on cv is awoken, it is guaranteed to hold the mutex. If you notify-all, every thread waiting on that cv will be woken up and immediately try to grab the mutex. At that point, all but one will go back to sleep (this time waiting for the mutex to be free). When the lucky thread releases the mutex, the next thread will get it and the next thing it will do is check the cv predicate. Depending on the condition, it could be very likely that the predicate evaluates to false (because the first thread already took care of it, or because you have multiple actual conditions for the one cv) and the thread immediately goes back to sleep waiting on the cv. One-by-one, every thread now waiting on the mutex will wake up, check the predicate, and probably go back to sleep on the cv. It's likely that only one actually ends up doing something.
This is horrible performance because mutex lock, mutex unlock, cv wait, and cv signal are all relatively expensive.
Even when you call notify-one, you might consider doing it after the critical section to try to prevent the one waking thread from blocking on the mutex. And if you actually want to notify-all, you can simply call notify-one and have each thread notify-one once it finishes the critical section. This creates a nice linear cascade of turn-taking versus an explosion of contention.

Multirate threads

I ran recently into a requirement in which there is a need for multithreaded application whose threads run at different rates.
The questions then become, since i am still learning multithreading:
A scenario is given to put things into perspective:
Say 1st thread runs at 100 Hz "real time"
2nd runs at 10 Hz
and say that the 1st thread provides data "myData" to the 2nd thread.
How is myData going to be provided to the 2nd thread, is the common practice to just read whatever is available from the first thread, or there need to be some kind of decimation to reduce the rate.
Does the myData need to be some kind of Singleton with locking mechanism. Although myData isn't shared, but rather updated by the first thread and used in the second thread.
How about the opposite case, when the data used in one thread need to be used at higher rate in a different thread.
How is myData going to be provided to the 2nd thread
One common method is to provide a FIFO queue -- this could be a std::dequeue or a linked list, or whatever -- and have the producer thread push data items onto one end of the queue while the consumer thread pops the data items off of the other end of the queue. Be sure to serialize all accesses to the FIFO queue (using a mutex or similar locking mechanism), to avoid race conditions.
Alternatively, instead of a queue you could have a single shared data object (essentially a queue of length one) and have your producer thread overwrite the object every time it generates new data. This could be done in cases where it's not important that the consumer thread sees every piece of data that was generated, but rather it's only important that it sees the most recent data. You'd still need to do the locking, though, to avoid the risk of the consumer thread reading from the data object at the same time the producer thread is in the middle of writing to it.
or does there need to be some kind of decimation to reduce the rate.
There doesn't need to be any decimation -- the second thread can just read in as much data as there is available to read, whenever it wakes up.
Does the myData need to be some kind of Singleton with locking
mechanism.
Singleton isn't necessary (although it's possible to do it that way). The locking mechanism is necessary, unless you have some kind of lock-free synchronization mechanism (and if you're asking this level of question, you don't have one and you don't want to try to get one either -- keep things simple for now!)
How about the opposite case, when the data used in one thread need to
be used at higher rate in a different thread.
It's the same -- if you're using a proper inter-thread communications mechanism, the rates at which the threads wake up doesn't matter, because the communications mechanism will do the right thing regardless of when or how often the the threads wake up.
Any multithreaded program has to cope with the possibility that one of the threads will work faster than another - by any ratio - even if they're executing on the same CPU with the same clock frequency.
Your choices include:
producer-consumer container than lets the first thread enqueue data, and the second thread "pop" it off for processing: you could let the queue grow as large as memory allows, or put some limit on the size after which either data would be lost or the 1st thread would be forced to slow down and wait to enqueue further values
there are libraries available (e.g. boost), or if you want to implement it yourself google some tutorials/docs on mutex and condition variables
do something conceptually similar to the above but where the size limit is 1 so there's just the single myData variable rather than a "container" - but all the synchronisation and delay choices remain the same
The Singleton pattern is orthogonal to your needs here: the two threads do need to know where the data is, but that would normally be done using e.g. a pointer argument to the function(s) run in the threads. Singleton's easily overused and best avoided unless reasons stack up high....

C++11 : Atomic variable : lock_free property : What does it mean?

I wanted to understand what does one mean by lock_free property of atomic variables in c++11. I did googled out and saw the other relevant questions on this forum but still having partial understanding. Appreciate if someone can explain it end-to-end and in simple way.
It's probably easiest to start by talking about what would happen if it was not lock-free.
The most obvious way to handle most atomic tasks is by locking. For example, to ensure that only one thread writes to a variable at a time, you can protect it with a mutex. Any code that's going to write to the variable needs obtain the mutex before doing the write (and release it afterwards). Only one thread can own the mutex at a time, so as long as all the threads follow the protocol, no more than one can write at any given time.
If you're not careful, however, this can be open to deadlock. For example, let's assume you need to write to two different variables (each protected by a mutex) as an atomic operation -- i.e., you need to ensure that when you write to one, you also write to the other). In such a case, if you aren't careful, you can cause a deadlock. For example, let's call the two mutexes A and B. Thread 1 obtains mutex A, then tries to get mutex B. At the same time, thread 2 obtains mutex B, and then tries to get mutex A. Since each is holding one mutex, neither can obtain both, and neither can progress toward its goal.
There are various strategies to avoid them (e.g., all threads always try to obtain the mutexes in the same order, or upon failure to obtain a mutex within a reasonable period of time, each thread releases the mutex it holds, waits some random amount of time, and then tries again).
With lock-free programming, however, we (obviously enough) don't use locks. This means that a deadlock like above simply cannot happen. When done properly, you can guarantee that all threads continuously progress toward their goal. Contrary to popular belief, it does not mean the code will necessarily run any faster than well written code using locks. It does, however, mean that deadlocks like the above (and some other types of problems like livelocks and some types of race conditions) are eliminated.
Now, as to exactly how you do that: the answer is short and simple: it varies -- widely. In a lot of cases, you're looking at specific hardware support for doing certain specific operations atomically. Your code either uses those directly, or extends them to give higher level operations that are still atomic and lock free. It's even possible (though only rarely practical) to implement lock-free atomic operations without hardware support (but given its impracticality, I'll pass on trying to go into more detail about it, at least for now).
Jerry already mentioned common correctness problems with locks, i.e. they're hard to understand and program correctly.
Another danger with locks is that you lose determinism regarding your execution time: if a thread that has acquired a lock gets delayed (e.g. descheduled by the operating system, or "swapped out"), then it is pos­si­ble that the entire program is de­layed be­cause it is waiting for the lock. By contrast, a lock-free al­go­rithm is al­ways gua­ran­teed to make some progress, even if any number of threads are held up some­where else.
By and large, lock-free programming is often slower (sometimes significantly so) than locked pro­gram­ming using non-atomic operations, because atomic operations cause a sig­ni­fi­cant hit on caching and pipelining; however, it offers determinism and upper bounds on latency (at least overall latency of your process; as #J99 observed, individual threads may still be starved as long as enough other threads are making progress). Your program may get a lot slower, but it never locks up entirely and always makes some progress.
The very nature of hardware architectures allows for certain small operations to be intrinsically atomic. In fact, this is a very necessity of any hardware that supports multitasking and multithreading. At the very heart of any synchronisation primitive, such as a mutex, you need some sort of atomic instruction that guarantees correct locking behaviour.
So, with that in mind, we now know that it is possible for certain types like booleans and machine-sized integers to be loaded, stored and exchanged atomically. Thus when we wrap such a type into an std::atomic template, we can expect that the resulting data type will indeed offer load, store and exchange operations that do not use locks. By contrast, your library implementation is always allowed to implement an atomic Foo as an ordinary Foo guarded by a lock.
To test whether an atomic object is lock-free, you can use the is_lock_free member function. Additionally, there are ATOMIC_*_LOCK_FREE macros that tell you whether atomic primitive types potentially have a lock-free instantiation. If you are writing concurrent algorithms that you want to be lock-free, you should thus include an assertion that your atomic objects are indeed lock-free, or a static assertion on the macro to have value 2 (meaning that every object of the corresponding type is always lock-free).
Lock-free is one of the non-blocking techniques. For an algorithm, it involves a global progress property: whenever a thread of the program is active, it can make a forward step in its action, for itself or eventually for the other.
Lock-free algorithms are supposed to have a better behavior under heavy contentions where threads acting on a shared resources may spent a lot of time waiting for their next active time slice. They are also almost mandatory in context where you can't lock, like interrupt handlers.
Implementations of lock-free algorithms almost always rely on Compare-and-Swap (some may used things like ll/sc) and strategy where visible modification can be simplified to one value (mostly pointer) change, making it a linearization point, and looping over this modification if the value has change. Most of the time, these algorithms try to complete jobs of other threads when possible. A good example is the lock-free queue of Micheal&Scott (http://www.cs.rochester.edu/research/synchronization/pseudocode/queues.html).
For lower-level instructions like Compare-and-Swap, it means that the implementation (probably the micro-code of the corresponding instruction) is wait-free (see http://www.diku.dk/OLD/undervisning/2005f/dat-os/skrifter/lockfree.pdf)
For the sake of completeness, wait-free algorithm enforces progression for each threads: each operations are guaranteed to terminate in a finite number of steps.

Intel Thread Building Blocks Concurrent Queue: Using pop() over pop_if_present()

What is the difference in using the blocking call pop() as compared to,
while(pop_if_present(...))
Which should be preferred over the other? And why?
I am looking for a deeper understanding of the tradeoff between polling yourself as in the case of while(pop_if_present(...)) with respect to letting the system doing it for you. This is quite a general theme. For example, with boost::asio I could do a myIO.run() which blocks or do the following:
while(1)
{
myIO.poll()
}
One possible explanation is is that the thread that invokes while(pop_if_present(...)) will remain busy so this is bad. But someone or something has to poll for the async event. Why and how can this be cheaper when it is delegated to the OS or the library? Is it because the OS or the library smart about polling for example do an exponential backoff?
Intel's TBB library is open source, so I took a look...
It looks like pop_if_present() essentially checks if the queue is empty and returns immediately if it is. If not, it attempts to get the element on the top of the queue (which might fail, since another thread may have come along and taken it). If it misses, it performs an "atomic_backoff" pause before checking again. The atomic_backoff will simply spin the first few times it's called (doubling its spin loop count each time), but after a certain number of pauses it'll just yield to the OS scheduler instead of spinning on the assumption that since it's been waiting a while, it might as well do it nicely.
For the plain pop() function, if there isn't anything in the queue will perform atomic_backoff waits until there is something in the queue that it gets.
Note that there are at least 2 interesting things (to me anyway) about this:
the pop() function performs spin waits (up to a point) for something to show up in the queue; it's not going to yield to the OS unless it has to wait for more than a little short moment. So as you might expect, there's not much reason to spin yourself calling pop_if_present() unless you have something else you're going to do between calls to pop_if_present()
when pop() does yield to the OS, it does so by simply giving up it's time slice. It doesn't block the thread on a synchronization object that can be signaled when an item is placed on the queue - it seems to go into a sleep/poll cycle to check the queue for something to pop. This surprised me a little.
Take this analysis with a grain of salt... The source I used for this analysis might be a bit old (it's actually from concurrent_queue_v2.h and .cpp) because the more recent concurrent_queue has a different API - there's no pop() or pop_if_present(), just a try_pop() function in the latest class concurrent_queue interface. The old interface has been moved (possibly changed somewhat) to the concurrent_bounded_queue class. It appears that the newer concurrent_queues can be configured when the library is built to use OS synchronization objects instead of busy waits and polling.
With the while(pop_if_present(...)) you are doing brute-force busy wait (also called spinning) on the queue. When the queue is empty you waste cycles by keeping CPU busy until either an item is pushed into the queue by another thread running on different CPU, or OS deciding to give your CPU to some other, possibly unrelated thread/process.
You can see how this could be bad if you have only one CPU - the producer thread would not be able to push and thus stop the consumer spinning until at least the end of consumer's time quanta plus overhead of a context switch. Clearly a mistake.
With multiple CPUs this might be better if the OS selects (or you enforce) the producer thread to run on different CPU. This is the basic idea of spin-lock - a synchronization primitive built directly on special processor instructions such as compare-and-swap or load-linked/store conditional and commonly used inside the operating system to communicate between interrupt handlers and rest of the kernel, and to build higher level constructs such as semaphores.
With blocking pop(), if queue is empty, you are entering sleep wait, i.e. asking the OS to put the consumer thread into non-schedulable state until an event - push onto the queue - occurs form another thread. The key here is that the processor is available for other (hopefully useful) work. The TBB implementation actually tries hard to avoid the sleep since it's expensive (entering the kernel, rescheduling, etc.) The goal is to optimize the normal case where the queue is not empty and the item can be retrieved quickly.
The choice is really simple though - always sleep-wait, i.e. do blocking pop(), unless you have to busy-wait (and that is in real-time systems, OS interrupt context, and some very specialized applications.)
Hope this helps a bit.