pthread signaling without kernel call - c++

I am running a few threads using pthreads on a real time linux (red hawk) in C++. All the threads run on a fixed frequency loop and one of the threads will poll the CPU clock and alert the other two threads that the next loop has started (by the end of the loop we can safely assume that the other loops have finished their task and are waiting for the next loop. My goal is to reduce latency where possible, and I have the ability to let threads take 100% of the CPU they are on (and guarantee they are the only thing running on that CPU due to the red hawk enhancements).
My idea to do this was to have the timing thread poll the cpu tick count until it reaches > X, then increment a 64 or 32 bit counter without asking for a mutex. The other two loops will poll this counter and wait for it to increase, also without asking for a mutex. How I see it no mutex is needed since the first thread can increment the counter atomically since it is the only thing writing to it. The other two threads can read from it without fear because a 32 or 64 bit number can be written to memory without it ever being a partial state (I think).
I realize that all my threads will be polling something and therefore running at 100% all the time, and I could reduce that by using the pthreads signaling, but I believe that the latency there is more than I want. I also know a mutex takes about a couple tens of nanoseconds, so I could probably use them without seeing the latency, but I don't see why it is needed when I have one thread increment a counter and the other two polling it.

You need to tell the compiler that your counter is a synchronization variable. You do that by declaring your counter std::atomic, and then using one of the built in operators (either fetch_add() or operator++() for the increment and load() for the reading threads.) See http://en.cppreference.com/w/cpp/atomic/atomic.
If you don't declare your counter atomic then you will have a data-race, your program has no defined semantics and the compiler is permitted to (and probably will) move code around with respect to the counter test (which will probably lead to results you don't expect.)
You need to use c++11 to get std::atomic. In most versions of g++ you do that with the --std=c++0x flag. The most recent versions of g++ require the --std=c++11 flag instead.

Since there will be shared variables, one thread modifying (incrementing) and others accessing, best would be to wrap between pthread_mutex_lock and pthread_mutex_unlock to ensure mutual exclusion

Related

How do I determine from strace output what part of my program is failing to acquire a mutex

I'm working on an embedded Linux system (3.12.something), and our application, after some random amount of time, starts hogging the CPU. I've run strace on our application, and right when the problem happens, I see a lot of lines similar to this in the strace output:
[48530666] futex(0x485f78b8, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable) <0.009002>
I'm pretty sure this is the smoking gun I'm looking for and there is a race of some sort. However, I now need to figure out how to identify the place in the code that's trying to get this mutex. How can I do that? Our code is compiled with GCC and has debugging symbols in it.
My current thinking (that I haven't tried yet) is to print out a string to stdout and flush before trying to grab any mutex in our system, with the expectation that the string will print right before strace complains about getting the lock ... but there are a LOT of places in the code that would have to be instrumented like this.
EDIT: Another strange thing that I just realized is that our program doesn't start hogging the CPU until some random time has passed since it was run (5 minutes to 5 hours and anywhere in between). During that time, there are zero futex syscalls happening. Why do they suddenly start? From what I've read, I think maybe they are being used properly in userspace until something fails and falls back to making a futex() syscall...
Any suggestions?
If you perpetually and often lock a mutex for a short time from different threads, like e.g. one protecting a global logger, you might cause a so-called thread convoy. The problem doesn't occur until two threads compete for the lock. The first gets the lock and holds it for a short time, then, when it needs the lock a second time, it gets preempted because the second one is waiting already. The second one does the same. The timeslice available to each thread is suddenly reduced to the time between two lock attempts, causing many context switches and the according slowdown. Further, all but one thread is always blocked on the mutex, effectively disabling any parallel execution.
In order to fix this, make your threads cooperate instead of competing for resources. For above example of a logger, consider e.g. a lock-free queue for the entries or separate queues for each thread using thread-local storage.
Concerning the futex() calls, the idea is to poll an atomic flag and after some rotations use the actual OS mutex. The atomic flag is available without the expensive switch between user-space and kernel-space. For longer breaks, using the kernel preemption (with futex()) avoids blocking the CPU with polling. This explains why the program doesn't need any futex() calls in normal operation.
You, basically need to generate core file at this moment.
Then you could load program+core in GDB and look at it
man gcore
or
generate-core-file
During that time, there are zero futex syscalls happening. Why do they suddenly start?
This is due to the fact that uncontested mutex, implemented via futex, doesn't make a system call, only atomic increment, purely in user space. Only CONTESTED lock is visible as system call

Block a thread with sleep vs block without sleep

I've created a multi-threaded application using C++ and POSIX threads. In which I should now block a thread (main thread) until a boolean flag is set (becomes true).
I've found two ways to get this done.
Spinning through a loop without sleep.
while(!flag);
Spinning through a loop with sleep.
while(!flag){
sleep(some_int);
}
If I should follow the first way, why do some people write codes following the second way? If the second way should be used, why should we make current thread to sleep? And what are disadvantages of this way?
The first option (a "busy wait") wastes an entire core for the duration of the wait, preventing other useful work being done and/or wasting energy.
The second option is less wasteful - your waiting thread uses very little CPU and allows other threads to run. But it is still wasteful to keep switching back to the thread to check the flag.
Far better than either would be to use a condition variable, which allows the waiting thread to block without consuming any resources until it is able to proceed.
while(flag); will cause your thread to use all of its allocated time checking the condition. This wastes a lot of CPU cycles checking something which has likely not changed.
Sleeping for a bit causes the thread to pause and give up the CPU to programs that actually need it.
You shouldn't do either though; you should use a threading library to create a flag object and call its wait function, so that the kernel will pause the thread until the flag is set.
The first way (just the plain while) is wasting resources, specifically the processor time of your process.
When a thread is put into sleep, OS may decide that the processor will be used for different tasks when talking about systems with preemptive multitasking. In theory, if you had as many processors / cores as threads, there would not have to be any difference.
If a solution is good or not depends on the operating system used, and sometimes architecture the program is running on. You should consult your syscall reference to find out more about this.

Pros and Cons of Busy Waiting on Modern Processors

I'm using busy waiting to synchronize access to critical regions, like this:
while (p1_flag != T_ID);
/* begin: critical section */
for (int i=0; i<N; i++) {
...
}
/* end: critical section */
p1_flag++;
p1_flag is a global volatile variable that is updated by another concurrent thread. As a matter of fact, I've two critical sections inside a loop and I've two threads (both executing the same loop) that commute execution of these critical regions. For instance, critical regions are named A and B.
Thread 1 Thread 2
A
B A
A B
B A
A B
B A
B
The parallel code executes faster than the serial one, however not as much as I expected. Profiling the parallel program using VTune Amplifier I noticed that a large amount of time is being spent in the synchronization directives, that is, the while(...) and flag update. I'm not sure why I'm seeing so large overhead on these "instructions" since region A is exactly the same as region B. My best guess is that this is due the cache coherence latency: I'm using an Intel i7 Ivy Bridge Machine and this micro-architecture resolves cache coherence at the L3. VTune also tells that the while (...) instruction is consuming all front-end bandwidth, but why?
To make the question(s) clear: Why are while(...) and update flag instructions taking so much execution time? Why would the while(...) instruction saturate the front-end bandwidth?
The overhead you're paying may very well be due to passing the sync variable back and forth between the core caches.
Cache coherency dictates that when you modify the cache line (p1_flag++) you need to have ownership on it. This means it would invalidate any copy existing in other cores, waiting for it to write back any changes made by that other core to a shared cache level. It would then provide the line to the requesting core in M state and perform the modification.
However, the other core would by then be constantly reading this line, read that would snoop the first core and ask if it has copy of that line. Since the first core is holding an M copy of that line, it would get written back to the shared cache and the core would lose ownership.
Now this depends on actual implementation in HW, but if the line was snooped before the change was actually made, the first core would have to attempt to get ownership on it again. In some cases i'd imagine this might take several iterations of attempts.
If you're set on using busy wait, you should at least use some pause inside it
: _mm_pause intrisic, or just __asm("pause"). This would both serve to give the other thread a chance get the lock and release you from waiting, as well as reducing the CPU effort in busy waiting (an out-of-order CPU would fill all pipelines with parallel instances of this busy wait, consuming lots of power - a pause would serialize it so only a single iteration can run at any given time - much less consuming and with the same effect).
A busy-wait is almost never a good idea in multithreaded applications.
When you busy-wait, thread scheduling algorithms will have no way of knowing that your loop is waiting on another thread, so they must allocate time as if your thread is doing useful work. And it does take processor time to check that variable over, and over, and over, and over, and over, and over...until it is finally "unlocked" by the other thread. In the meantime, your other thread will be preempted by your busy-waiting thread again and again, for no purpose at all.
This is an even worse problem if the scheduler is a priority-based one, and the busy-waiting thread is at a higher priority. In this situation, the lower-priority thread will NEVER preempt the higher-priority thread, thus you have a deadlock situation.
You should ALWAYS use semaphores or mutex objects or messaging to synchronize threads. I've never seen a situation where a busy-wait was the right solution.
When you use a semaphore or mutex, then the scheduler knows never to schedule that thread until the semaphore or mutex is released. Thus your thread will never be taking time away from threads that do real work.

Does msleep() give cycles to other threads?

In a multi threaded app, is
while (result->Status == Result::InProgress) Sleep(50);
//process results
better than
while (result->Status == Result::InProgress);
//process results
?
By that, I'm asking will the first method be polite to other threads while waiting for results rather than spinning constantly? The operation I'm waiting for usually takes about 1-2 seconds and is on a different thread.
I would suggest using semaphores for such case instead of polling. If you prefer active waiting, the sleep is much better solution than evaluating the loop condition constantly.
It's better, but not by much.
As long as result->Status is not volatile, the compiler is allowed to reduce
while(result->Status == Result::InProgress);
to
if(result->Status == Result::InProgress) for(;;) ;
as the condition does not change inside the loop.
Calling the external (and hence implicitly volatile) function Sleep changes this, because this may modify the result structure, unless the compiler is aware that Sleep never modifies data. Thus, depending on the compiler, the second implementation is a lot less likely to go into an endless loop.
There is also no guarantee that accesses to result->Status will be atomic. For specific memory layouts and processor architectures, reading and writing this variable may consist of multiple steps, which means that the scheduler may decide to step in in the middle.
As all you are communicating at this point is a simple yes/no, and the receiving thread should also wait on a negative reply, the best way is to use the appropriate thread synchronisation primitive provided by your OS that achieves this effect. This has the advantage that your thread is woken up immediately when the condition changes, and that it uses no CPU in the meantime as the OS is aware what your thread is waiting for.
On Windows, use CreateEvent and co. to communicate using an event object; on Unix, use a pthread_cond_t object.
Yes, sleep and variants give up the processor. Other threads can take over. But there are better ways to wait on other threads.
Don't use the empty loop.
That depends on your OS scheduling policy too.For example Linux has CFS schedular by default and with that it will fairly distribute the processor to all the tasks. But if you make this thread as real time thread with FIFO policy then code without sleep will never relenquish the processor untill and unless a higher priority thread comes, same priority or lower will never get scheduled untill you break from the loop. if you apply SCHED_RR then processes of same priority and higher will get scheduled but not lower.

Fastest method to wait under thread contention

I'm using pthread on Linux. I have a circular buffer to pass data from one thread to another. Maybe the circular buffer is not the best structure to use here, but changing that would not make my problem go away, so we'll just refer it as a queue.
Whenever my queue is either full or empty, pop/push operations return NULL. This is problematic since my threads fire periodically. Waiting for another thread loop would take too long.
I've tried using semaphores (sem_post, sem_wait) but unlocking under contention takes up to 25 ms, which is about the speed of my loop. I've tried waiting with pthread_cond_t, but the unlocking takes up to between 10 and 15 ms.
Is there a faster mechanism I could use to wait for data?
EDIT*
Ok I used condition variables. I'm on an embedded device so adding "more cores or cpu power" is not an option. This made me realise I had all sorts of thread priorities set all over the place so I'll sort this out before going further
You should use condition variables. The only faster ways are platform-specific, and they're only negligibly faster.
You're seeing what you think is poor performance simply because your threads are being de-scheduled. You're seeing long "delays" when your thread is near the end of its timeslice and the scheduler allows the unblocked thread to pre-empt the running thread. If you have more cores than threads or set your thread to a higher priority, you won't see these delays.
But these delays are actually a good thing, and you shouldn't be concerned about them. Other threads just get a chance to run too.