Are there any equivalents to the futex in Linux/Unix? - c++

I'm looking for something could be used for polling (like select, kqueue, epoll i.e. not busy polling) in C/C++. In other word, I need to block a thread, and then wake it up in another thread with as little overhead as possible.
A mutex + condition variable works, but there is a lot of overhead. A futex also works, but that's for Linux only (or maybe not?). Extra synchronization is not required as long as the polling itself works properly, e.g. no race when I call wait and wake in two threads.
Edit: If such a "facility" doesn't exist in FreeBSD, how to create one with C++11 built-in types and system calls?
Edit2: Since this question is migrated to SO, I'd like to make it more general (not for FreeBSD only)

semaphores are not mutexes, and would work with slightly less overhead (avoiding the mutex+condvar re-lock, for example)
Note that since any solution where a thread sleeps until woken will involve a kernel syscall, it still isn't cheap. Assuming x86_64 glibc and the FreeBSD libc are both reasonable implementations, the unavoidable cost seems to be:
user-mode synchronisation of the count (with a CAS or similar)
kernel management of the wait queue and thread sleep/wait
I assume the mutex + condvar overhead you're worried about is the cond_wait->re-lock->unlock sequence, which is indeed avoided here.

You want semaphores and not mutexes for the signaling between the to threads..
http://man7.org/linux/man-pages/man3/sem_wait.3.html
Semaphores can be used like a counter such as if you have a queue, you increment (post) to the semaphore every time you insert a message, and your receiver decrement (wait) on the semaphore for every message it takes out. If the counter reach zero the receiver will block until something is posted.
So a typical pattern is to combine a mutex and a semaphore like;
sender:
mutex.lock
insert message in shared queue
mutex.unlock
semaphore.post
receiver:
semaphore.wait
mutex.lock
dequeue message from shared structure
mutex.unlock

Related

What is the overhead associated with std::condition_variable_any

I have read in many places that there is some overhead associated with std::condition_variable_any. Just wondering, what is this overhead?
My guess here is that since this is a generic condition variable that can work with any type of lock, it requires a manually rolled implementation of waiting (perhaps with another condition_variable and mutex or futex, or something similar) so the extra overhead probably comes from that? But not sure... As opposed to just being a native wrapper around pthread_cond_wait() (and equivalent on other systems) etc.
As a followup, if I was say implementing something that waits on, say, a shared mutex, then is this type of condition variable a bad choice because of the performance overhead? What else can I do in this situation?
pthread_cond_wait() / SleepConditionVariableSRW(), same as the the plain std::condition_variable::wait() require just a single, atomic syscall for both releasing the mutex, waiting for the condition variable and re-aquiring the mutex. The thread immediately goes to sleep and another thread - ideally one which was blocked by the mutex - can take over immediately on the same core.
With std::condition_variable_any, the unlock of the passed BasicLockable and starting to wait on the native event / condition is more than just a single syscall, it's invoking the unlock() method on the BasicLockable first and only then issues the syscall for waiting. So you have at least the overhead from the separate unlock(), plus you are more likely to trigger an less than ideal scheduling decision on the OS side. Worst case, the unlock even caused continuation of a waiting thread on a different core, with all the associated overhead.
The other way around, e.g. on spurious wakes, there are also OS side scheduling optimizations possible when dealing with a native mutex (as used in std::mutex) which don't apply with a generic BasicLockable.
Both involve some book keeping, in order to provide notify_all() logic (it's actually one event / condition per waiting thread) as well as the guarantees about all methods being atomic, so they both come with a small overhead anyway.
The real overhead comes from how well the OS can make a good scheduling decision on the combined signal-and-wait-and-lock syscall. If the OS isn't smart about the scheduling, then it makes virtually no difference.

Do mutexes guarantee ordering of acquisition? Unlocking thread takes it again while others are still waiting

A coworker had an issue recently that boiled down to what we believe was the following sequence of events in a C++ application with two threads:
Thread A holds a mutex.
While thread A is holding the mutex, thread B attempts to lock it. Since it is held, thread B is suspended.
Thread A finishes the work that it was holding the mutex for, thus releasing the mutex.
Very shortly thereafter, thread A needs to touch a resource that is protected by the mutex, so it locks it again.
It appears that thread A is given the mutex again; thread B is still waiting, even though it "asked" for the lock first.
Does this sequence of events fit with the semantics of, say, C++11's std::mutex and/or pthreads? I can honestly say I've never thought about this aspect of mutexes before.
Are there any fairness guarantees to prevent starvation of other threads for too long, or any way to get such guarantees?
Known problem. C++ mutexes are thin layer on top of OS-provided mutexes, and OS-provided mutexes are often not fair. They do not care for FIFO.
The other side of the same coin is that threads are usually not pre-empted until they run out of their time slice. As a result, thread A in this scenario was likely to continue to be executed, and got the mutex right away because of that.
The guarantee of a std::mutex is enable exclusive access to shared resources. Its sole purpose is to eliminate the race condition when multiple threads attempt to access shared resources.
The implementer of a mutex may choose to favor the current thread acquiring a mutex (over another thread) for performance reasons. Allowing the current thread to acquire the mutex and make forward progress without requiring a context switch is often a preferred implementation choice supported by profiling/measurements.
Alternatively, the mutex could be constructed to prefer another (blocked) thread for acquisition (perhaps chosen according FIFO). This likely requires a thread context switch (on the same or other processor core) increasing latency/overhead. NOTE: FIFO mutexes can behave in surprising ways. E.g. Thread priorities must be considered in FIFO support - so acquisition won't be strictly FIFO unless all competing threads are the same priority.
Adding a FIFO requirement to a mutex's definition constrains implementers to provide suboptimal performance in nominal workloads. (see above)
Protecting a queue of callable objects (std::function) with a mutex would enable sequenced execution. Multiple threads can acquire the mutex, enqueue a callable object, and release the mutex. The callable objects can be executed by a single thread (or a pool of threads if synchrony is not required).
•Thread A finishes the work that it was holding the mutex for, thus
releasing the mutex.
•Very shortly thereafter, thread A needs to touch a resource that is
protected by the mutex, so it locks it again
In real world, when the program is running. there is no guarantee provided by any threading library or the OS. Here "shortly thereafter" may mean a lot to the OS and the hardware. If you say, 2 minutes, then thread B would definitely get it. If you say 200 ms or low, there is no promise of A or B getting it.
Number of cores, load on different processors/cores/threading units, contention, thread switching, kernel/user switches, pre-emption, priorities, deadlock detection schemes et. al. will make a lot of difference. Just by looking at green signal from far you cannot guarantee that you will get it green.
If you want that thread B must get the resource, you may use IPC mechanism to instruct the thread B to gain the resource.
You are inadvertently suggesting that threads should synchronise access to the synchronisation primitive. Mutexes are, as the name suggests, about Mutual Exclusion. They are not designed for control flow. If you want to signal a thread to run from another thread you need to use a synchronisation primitive designed for control flow i.e. a signal.
You can use a fair mutex to solve your task, i.e. a mutex that will guarantee the FIFO order of your operations. Unfortunately, C++ standard library doesn't have a fair mutex.
Thankfully, there are open-source implementations, for example yamc (a header-only library).
The logic here is very simple - the thread is not preempted based on mutexes, because that would require a cost incurred for each mutex operation, which is definitely not what you want. The cost of grabbing a mutex is high enough without forcing the scheduler to look for other threads to run.
If you want to fix this you can always yield the current thread. You can use std::this_thread::yield() - http://en.cppreference.com/w/cpp/thread/yield - and that might offer the chance to thread B to take over the mutex. But before you do that, allow me to tell you that this is a very fragile way of doing things, and offers no guarantee. You could, alternatively, investigate the issue deeper:
Why is it a problem that the B thread is not started when A releases the resource? Your code should not depend on such logic.
Consider using alternative thread synchronization objects like barriers (boost::barrier or http://linux.die.net/man/3/pthread_barrier_wait ) instead, if you really need this sort of logic.
Investigate if you really need to release the mutex from A at that point - I find the practice of locking and releasing fast a mutex for more than one time a code smell, it usually impacts terribly the performace. See if you can group extraction of data in immutable structures which you can play around with.
Ambitious, but try to work without mutexes - use instead lock-free structures and a more functional approach, including using a lot of immutable structures. I often found quite a performance gain from updating my code to not use mutexes (and still work correctly from the mt point of view)
How do you know this:
While thread A is holding the mutex, thread B attempts to lock it.
Since it is held, thread B is suspended.
How do you know thread B is suspended. How do you know that it is not just finished the line of code before trying to grab the lock, but not yet grabbed the lock:
Thread B:
x = 17; // is the thread here?
// or here? ('between' lines of code)
mtx.lock(); // or suspended in here?
// how can you tell?
You can't tell. At least not in theory.
Thus the order of acquiring the lock is, to the abstract machine (ie the language), not definable.

posix pipe as a work queue

The normal implementations of a work queue I have seen involve mutexes and condition variables.
Consumer:
A) Acquires Lock
B) While Queue empty
Wait on Condition Variable (thus suspending thread and releasing lock)
C) Work object retrieved from queue
D) Lock is released
E) Do Work
F) GOTO A
Producer:
A) Acquires Lock
B) Work is added to queue
C) condition variable is signaled (potentially releasing worker)
D) Lock is released
I have been browsing some code and I saw an implementation using POSIX pipes (I have not seen this technique before).
Consumer:
A) Do select on pipe (thus suspending thread while no work)
B) Get Job from pipe
C) Do Work
D) GOTO A
Producer:
A) Write Job to pipe.
Since the producer and consumer are threads inside the same application (thus they share the same address space and thus pointers between them are valid); the jobs are written to the pipe as the address of the work object (A C++ object). So all that has to be written/read from the pipe is an 8-byte address.
My question is:
Is this a common technique (have I been sheltered from this) and what are the advantages/disadvantages?
My curiosity was piqued because the pipe technique does not involve any visible lock or signals (it may be hidden in the select). So I was wondering if this would be more efficient?
Edit:
Based on comments in #Maxim Yegorushkin answer.
Actually the "Producer" in this scenario is involved in a lot of high volume IO from lots of source in parallel. So I suspect that the original author though it very desirable that this thread did not block under any circumstances, but also did not want to high cost work in the "Producer" thread.
As it's been mentioned here already, people use pipes as queues to avoid blocking on a condition variable in a non-blocking I/O thread (i.e. the thread that handles multiple sockets and blocks on select/epoll). If an I/O thread blocks on a condition variable or a mutex it can't do non-blocking I/O any more.
Some say that writing into a pipe involves a system call and may increase latency when the volume of inter-thread events is high. That is only true for naive pipe-based queue implementations.
Advanced implementations use lock-free linked lists of jobs/events and only when the first job is added to the list the pipe is written to to wake the target I/O thread from the blocking epoll call (essentially using pipe as an edge-triggered notification mechanism but not for passing pointers to jobs/events). Because it takes a few micro-seconds to wake up a thread there may be more jobs/events posted to that thread's event queue during this time but every subsequent event doesn't require writing to the pipe, until later time when the I/O thread wakes up and consumes all events in the queue. Also, in newer Linux kernel a faster eventfd can be used instead of pipe to wake up an I/O thread.
I have done this. It's old-school but it works.
The reason I did it this way was I needed to wake up the same thread on either a job for it to do or read input from another source, so select() was involved.
It is because of select and how it is structured. As you can see in the man page
select() and pselect() allow a program to monitor multiple file descriptors, waiting until one or more of the file descriptors become "ready" for some class of I/O operation (e.g., input possible). A file descriptor is considered ready if it is possible to perform the corresponding I/O operation (e.g., read(2)) without blocking.
The key in the above is the 'waiting until one or more of the FDs become ready'. That is the synchronization point between the two threads.
I think the answer is that the pipe technique does not give as good performance as it involves system calls, which are relatively expensive. But it does mean that all that tricky locking and sleeping and waking gets taken care of for you.
I've used both myself, but pipes only for occasional non performance critical applications.
EDIT: I suppose I might as well make the standard recommendation since nobody has come along with any clearly authoritative comments.
Standard recommendation being: Try both and benchmark them. It's the one true way to find out which performs better...

What happens when pthreads wait in mutex_lock/cond_wait?

I have a program that should get the maximum out of my cpu.
It is multithreaded via pthreads that do their job well apart from the fact that they "only" get my cores to about 60% load which is not enough in my opinion.
I am searching for the reason and am asking myself (and hereby you) if the blocking functions mutex_lock/cond_wait are candidates?
What happens when a thread cannot run on in such a function?
Does pthread switch to another thread it handles or
does the thread yield its time to the system and if the latter is the case, can I change this behavior?
Regards,
Nobody
More Information
The setting is one mainthread that fills the taskpool and countless workers that fetch jobs from there and wait on a conditional that is signaled via broadcast when a serialized calculation is done. They go on with the values from this calculation until they are done, deliver their mail and fetch the next job...
On a typical modern pthreads implementation, each thread is managed by the kernel not unlike a separate process. Any blocking call like pthread_mutex_lock or pthread_cond_wait (but also, say, read) will yield its time to the system. The system will then find another eligible thread to schedule, whether in your process or another process, and run it.
If your program is only taking 60% of the CPU, it is more likely blocked on I/O than on pthread operations, unless you have done something way too granular with your pthread operations.
If a thread is waiting on a mutex/condition, it doesn't use resources (well, uses just a tiny amount). Whenever the thread enters waiting state, control switches to other threads. When the mutex is released (or condition variable signalled), the thread wakes up and may acquire the mutex (if no other thread grabs it first), and continue to run. If however some other thread acquires the mutex (this can happen if several threads are waiting for it), the thread returns to sleeping state.

Conditional wait overhead

When using boost::conditional_variable, ACE_Conditional or directly pthread_cond_wait, is there any overhead for the waiting itself? These are more specific issues that trouble be:
After the waiting thread is unscheduled, will it be scheduled back before the wait expires and then unscheduled again or it will stay unscheduled until signaled?
Does wait acquires periodically the mutex? In this case, I guess it wastes each iteration some CPU time on system calls to lock and release the mutex. Is it the same as constantly acquiring and releasing a mutex?
Also, then, how much time passes between the signal and the return from wait?
Afaik, when using semaphores the acquire calls responsiveness is dependent on scheduler time slice size. How does it work in pthread_cond_wait? I assume this is platform dependent. I am more interested in Linux but if someone knows how it works on other platforms, it will help too.
And one more question: are there any additional system resources allocated for each conditional? I won't create 30000 mutexes in my code, but should I worry about 30000 conditionals that use the same one mutex?
Here's what is written in the pthread_cond man page:
pthread_cond_wait atomically unlocks the mutex and waits for the condition variable cond to be signaled. The thread execution is suspended and does not consume any CPU time until the condition variable is signaled.
So from here I'd answer to the questions as following:
The waiting thread won't be scheduled back before the wait was signaled or canceled.
There are no periodic mutex acquisitions. The mutex is reacquired only once before wait returns.
The time that passes between the signal and the wait return is similar to that of thread scheduling due to mutex release.
Regarding the resources, on the same man page:
In the LinuxThreads implementation, no resources are associated with condition variables, thus pthread_cond_destroy actually does nothing except checking that the condition has no waiting threads.
Update: I dug into the sources of pthread_cond_* functions and the behavior is as follows:
All the pthread conditionals in Linux are implemented using futex.
When a thread calls wait it is suspended and unscheduled. The thread id is inserted at the tail of a list of waiting threads.
When a thread calls signal the thread at the head of the list is scheduled back.
So, the waking is as efficient as the scheduler, no OS resources are consumed and the only memory overhead is the size of the waiting list (see futex_wake function).
You should only call pthread_cond_wait if the variable is already in the "wrong" state. Since it always waits, there is always the overhead associated with putting the current thread to sleep and switching.
When the thread is unscheduled, it is unscheduled. It should not use any resources, but of course an OS can in theory be implemented badly. It is allowed to re-acquire the mutex, and even to return, before the signal (which is why you must double-check the condition), but the OS will be implemented so this doesn't impact performance much, if it happens at all. It doesn't happen spontaneously, but rather in response to another, possibly-unrelated signal.
30000 mutexes shouldn't be a problem, but some OSes might have a problem with 30000 sleeping threads.