How can I avoid preemption of my thread in user mode - c++

I have a simple chunk of deterministic work that only takes thirteen machine instructions to complete. Because the first instruction takes a homemade semaphore (spinlock) and the last instruction releases it, I am safe from all of the other threads running on the other cores as they are attempting to take and give the same semaphore.
The problem arises when some thread interrupts a thread holding the semaphore before it can finish its "critical section". Worst case the interruption kills the thread while holding the semaphore or as can happen one of the threads normally competing for the semaphore branches out into code that can generate the interrupt causing a deadlock.
I don't have a way synchronizing with these other threads when they branch into those parts of the code I can't control. I think I need to disable interrupts like I used to do in my old VxWorks days when I was running in kernel mode. Its always thirteen instructions and I am always completely safe if I can get all thirteen instructions done before I have to honor an interrupt. Oh and it is all my own internal data, other that the homemade semaphore there is nothing that locks anything else up.
I have read several answers that I think are close. Most have to do with Critical Section calls on the Windows API (wrong OS but maybe the right concept). Most of the wrong solutions assume that I can get all of the offending threads to use a mutex that I create with the pthread libraries.
I need this solution in C/C++ on Linux and Solaris.
Johnny Crash's question is very close
prevent linux thread from being interrupted by scheduler
KermitG also
Can I prevent a Linux user space pthread yielding in critical code?
Thanks for your consideration.

You may not prevent preemption of a user-mode thread. Critical sections (and all other sync objects) prevent collisions of your threads, however they by no means prevent them from preemption by the OS.
If your other threads branch into something on timeout, whereas that something may lead to a deadlock - you have a design problem.
A correct design should be the most pessimistic: preemption may occur everywhere for indeterminate time.

Yes, yes - 7 years old - I need to do exactly this but for other reasons.
So I put this here for others to read in a historical context.
I am writing an emulation layer for an embedded RTOS where I need to emulate the embedded platform CPU_irq_disable(), and CPU_irq_restore() The closest thing I can think of is to disable peemption in the scheduler.
Yes, the target does have an RTOS - sometimes it does not.
IO is emulated via sockets, ie: a serial port is like a stream socket!
A GPIO pin (Edge IRQ) can be a socket to. The current value is in a quasi-global to the driver, and to wait for a pin change = waiting for a packet to arrive on a socket.
So the socket read thread acts like an IRQ when a packet shows up.
Thus- to emulate irq disable, it is reasonable to emulate by disabling pre-emption within my own application.
Also at the embedded application layer, I need to emulate what would be a superloop.
No amount of mutex stuff is going to emulate the embedded platform reasonably.

Related

Futex throughput on Linux

I have an async API which wraps some IO library. The library uses C style callbacks, the API is C++, so natural choice (IMHO) was to use std::future/std::promise to build this API. Something like std::future<void> Read(uint64_t addr, byte* buff, uint64_t buffSize). However, when I was testing the implementation I saw that the bottleneck is the future/promise, more precisely, the futex used to implement promise/future. Since the futex, AFAIK, is user space and the fastest mechanism I know to sync two threads, I just switched to use raw futexes, which somewhat improved the situation, but not something drastic. The performance floating somewhere around 200k futex WAKEs per second. Then I stumbled upon this article - Futex Scaling for Multi-core Systems which quite matches the effect I observe with futexes. My questions is, since the futex too slow for me, what is the fastest mechanism on Linux I can use to wake the waiting side. I dont need anything more sophisticated than binary semaphore, just to signal IO operation completion. Since IO operations are very fast (tens of microseconds) switching to kernel mode not an option. Busy wait not an option too, since CPU time is precious in my case.
Bottom line, user space, simple synchronization primitive, shared between two threads only, only one thread sets the completion, only one thread waits for completion.
EDIT001:
What if... Previously I said, no spinning in busy wait. But futex already spins in busy wait, right? But the implementation covers more general case, which requests global hash table, to hold the futexes, queues for all subscribers etc. Is it a good idea to mimic same behavior on some simple entity (like int), no locks, no atomics, no global datastructures and busy wait on it like futex already does?
In my experience, the bottleneck is due to linux's poor support for IPC. This probably isn't a multicore scaling issue, unless you have a large number of threads.
When one thread wakes another (by futex or any other mechanism), the system tries to run the 'wakee' thread immediately. But the waker thread is still running and using a core, so the system will usually put the wakee thread on a different core. If that core was previously idle, then the system will have to wake the core up from a power-down state, which takes some time. Any data shared between the threads must now be transferred between the cores.
Then, the waker thread will usually wait for a response from the wakee (it sounds like this is what you are doing). So it immediately goes to sleep, and puts its core to idle.
Then a similar thing happens again when the response comes. The continuous CPU wakes and migrations cause the slowdown. You may well discover that if you launch many instances of your process simultaneously, so that all your cores are busy, you see increased performance as the CPUs no longer have to wake up, and the threads may stop migrating between cores. You can get a similar performance increase if you pin the two threads to one core - it will do more than 1 million 'pings'/sec in this case.
So isn't there a way of saying 'put this thread to sleep and then wake that one'? Then the OS could run the wakee on the same core as the waiter? Well, Google proposed a solution to this with a FUTEX_SWAP api that does exactly this, but has yet to be accepted into the linux kernel. The focus now seems to be on user-space thread control via User Managed Concurrency Groups which will hopefully be able to do something similar. However at the time of writing this is yet to be merged into the kernel.
Without these changes to the kernel, as far as I can tell there is no way around this problem. 'You are on the fastest route'! UNIX sockets, TCP loopback, pipes all suffer from the same issue. Futexes have the lowest overhead, which is why they go faster than the others. (with TCP you get about 100k pings per sec, about half the speed of a futex impl). Fixing this issue in a general way would benefit a lot of applications/deployments - anything that uses connections to localhost could benefit.
(I did try a DIY approach where the waker thread pins the wakee thread to the same core that the waker is on, but if you don't want to to pin the waker, then every time you post the futex you need to pin the wakee to the current thread, and the system call to do this has too much overhead)

Upgradable mutex lies at shared memory on both Windows and Linux

I have 2 processes called Writer and Reader running on the same machine. Writer is a singular thread and writes data to a shared memory. Reader has 8 threads that intend to read data from the shared memory concurrently. I need a locking mechanism that meets following criteria:
1) At a time, either Writer or Reader is allowed to access the shared memory.
2) If Reader has permission to read data from the shared memory, all its own threads can read data.
3) Writer has to wait until Reader "completely" releases the lock (because it has multiple threads).
I have read much about sharable mutex that seems to be the solution. Here I describe more detailed about my system:
1) System should run on both Windows & Linux.
2) I divide the shared memory into two regions: locks & data. The data region is further divided into 100 blocks. I intend to create 100 "lock objects" (sharable mutex) and lay them on the locks region. These lock objects are used for synchronization of 100 the data blocks, 1 lock object for 1 data block.
3) Writer, Readers first determine which block it would like to access then try to acquire the appropriate lock. Once acquired the lock, it then performs on the data block.
My concern now is:
Is there any "built-in" way to lay the lock objects on shared memory on Windows and Linux (Centos) and then I can do lock/unlock with the objects without using boost library.
[Edited Feb 25, 2016, 09:30 GMT]
I can suggest a few things. It really depends on the requirements.
If it seems like the boost upgradeable mutex fits the bill, then by all means, use it. From 5 minute reading is seems you can use them in shm. I have no experience with it as I don't use boost. Boost is available on Windows and Linux so I don't see why not use it. You can always grab the specific code you like and bring it into your project without dragging the entire behemoth along.
Anyway, isn't it fairly easy to test and see is it good enough?
I don't understand the requirement for placing locks in shm. If it's no real requirement, and you want to use OS native objects, you can use a different mechanism per OS. Say, named mutex on Windows (not in shm), and pthread_rwlock, in shm, on Linux.
I know what I would prefer to use: a seqlock.
I work in the low-latency domain, so I'm picking what gets me the lowest possible latency. I measure it in cpu cycles.
From you mentioning that you want a lock per object, rather than one big lock, I assume performance is important.
There're important questions here, though:
Since it's in shm, I assume it's POD (flat data)? If not, you can switch to a read/write spinlock.
Are you ok with spinning (busy wait) or do you want to sleep-wait? seqlocks and spinlocks are no OS mechanism, so there's nobody to put your waiting threads to sleep. If you do want to sleep-wait, read #4
If you care to know the other side (reader/write) died, you have to impl that in some other way. Again, because seqlock is no OS beast. If you want to be notified of other side's death as part of the synchronization mechanism, you'll have to settle for named mutexes, on Windows, and on robust mutexes, in shm, on Linux
Spinlocks and seqlocks provide the maximum throughput and minimum latency. With kernel supported synchronization, a big part of the latency is spent in switching between user and kernel space. In most applications it is not a problem as synchronization is only happening in a small fraction of the time, and the extra latency of a few microseconds is negligible. Even in games, 100 fps leaves you with 10ms per frame, that is eternity in term of mutex lock/unlock.
There are alternatives to spinlock that are usually not much more expensive.
In Windows, Critical Section is actually a spinlock with a back-off mechanism that uses an Event object. This was re-implemented using shm and named Event and called Metered Section.
In Linux, the pthread mutex is futex based. A futex is like Event on Windows. A non-robust mutex with no contention is just a spinlock.
These guys still don't provide you with notification when the other side dies.
Addition [Feb 26, 2016, 10:00 GMT]
How to add your own owner death detection:
The Windows named mutex and pthread robust mutex have this capability built-in. It's easy enough to add it yourself when using other lock types and could be essential when using user-space-based locks.
First, I have to say, in many scenarios it's more appropriate to simply restart everything instead of detecting owner's death. It is definitely simpler as you also have to release the lock from a process that is not the original owner.
Anyway, native way to detect a process death is easy on Windows - processes are waitable objects so you can just wait on them. You can wait for zero time for an immediate check.
On Linux, only the parent is supposed to know about it's child's death, so less trivial. The parent can get SIGCHILD, or use waitpid().
My favorite way to detect process death is different. I connect a non-blocking TCP socket between the 2 processes and trust the OS to kill it on process death.
When you try to read data from the socket (on any of the sides) you'd read 0 bytes if the peer has died. If it's still alive, you'd get EWOULDBLOCK.
Obviously, this also works between boxes, so kinda convenient to have it uniformly done once and for all.
Your worker loop will have to change to interleave the peer death check and it's usual work.
#include <boost/interprocess/sync/interprocess_mutex.hpp>
#include <boost/interprocess/sync/interprocess_condition.hpp>**
//Mutex to protect access to the queue
boost::interprocess::interprocess_mutex mutex;
//Condition to wait when the queue is empty
boost::interprocess::interprocess_condition cond_empty;
//Condition to wait when the queue is full
boost::interprocess::interprocess_condition cond_full;

Make sure that main thread run on it's own core alone

I have a main thread which do some not-so-heavy-heavy work and also I'm creating worker threads which do very-heavy work. All documentation and examples shows how to create a number of hardware threads equal to std::thread::hardware_concurrency(). But since main thread already existed the number of threads becomes std::thread::hardware_concurrency() + 1. For example:
my machine supports 2 hardware threads.
in main thread I'm creating this 2 threads and the total number of threads becomes 3.
a core with the main thread do it's job plus (probably) the worker job.
Of course I don't want this because UI (which is done in main thread) becomes not responsive due to latency. What will happen if I create std::thread::hardware_concurrency() - 1 thread? Will it guarantee that the main thread and only main thread is running on single core? How can I check it?
P.S.: I'm using some sort of pool - I start threads on the program start and stop on exit. During the execution all worker threads run infinite while loop.
As others have written in the comments, you should carefully consider whether you can do a better job than the OS.
That being said, it is technically possible:
Use the native_handle method to get the OS's handle to your thread.
Consult your OS's documentation for setting the thread affinity. E.g., using pthreads, you'd want pthread_set_affinity.
This gives you full control over where each thread runs. In particular, you can give one of the threads a core of its own.
Note that this isn't part of the standard, as it is a level that is not portable. This might serve as another hint that it's possibly not what you're looking for.
No - std::thread::hardware_concurrency() only gives you a hint about the potential numbers of cores in use for multithreading. You might be interested in CPU Affinity Masks (Putting Threads on different CPUs). This works on the pthread level which you can reached via std::thread::native_handle (http://en.cppreference.com/w/cpp/thread/thread/native_handle)
Depending on your OS, you can get the thread's native handle, and control their priority levels using pthread_setschedparam(), for example giving the worker threads a lower priority than the main thread. This can be one solution to the UI problem. In general, number of threads need not match number of available HW cores.
There are definitely cases where you want to be able to gain full control, and reliably analyze what is going on. You are using Windows, but as an example, it is possible on a multicore machine to exclude e.g. one core from the normal Linux OS scheduler, and use that core for time-critical hard real-time tasks. In essence, you will own that core and handle interrupts for it, thereby enabling something close to hard real-time response times and predictability. Requires careful programming and analysis, and takes a significant effort. But very attractive if done right.

How do I determine from strace output what part of my program is failing to acquire a mutex

I'm working on an embedded Linux system (3.12.something), and our application, after some random amount of time, starts hogging the CPU. I've run strace on our application, and right when the problem happens, I see a lot of lines similar to this in the strace output:
[48530666] futex(0x485f78b8, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable) <0.009002>
I'm pretty sure this is the smoking gun I'm looking for and there is a race of some sort. However, I now need to figure out how to identify the place in the code that's trying to get this mutex. How can I do that? Our code is compiled with GCC and has debugging symbols in it.
My current thinking (that I haven't tried yet) is to print out a string to stdout and flush before trying to grab any mutex in our system, with the expectation that the string will print right before strace complains about getting the lock ... but there are a LOT of places in the code that would have to be instrumented like this.
EDIT: Another strange thing that I just realized is that our program doesn't start hogging the CPU until some random time has passed since it was run (5 minutes to 5 hours and anywhere in between). During that time, there are zero futex syscalls happening. Why do they suddenly start? From what I've read, I think maybe they are being used properly in userspace until something fails and falls back to making a futex() syscall...
Any suggestions?
If you perpetually and often lock a mutex for a short time from different threads, like e.g. one protecting a global logger, you might cause a so-called thread convoy. The problem doesn't occur until two threads compete for the lock. The first gets the lock and holds it for a short time, then, when it needs the lock a second time, it gets preempted because the second one is waiting already. The second one does the same. The timeslice available to each thread is suddenly reduced to the time between two lock attempts, causing many context switches and the according slowdown. Further, all but one thread is always blocked on the mutex, effectively disabling any parallel execution.
In order to fix this, make your threads cooperate instead of competing for resources. For above example of a logger, consider e.g. a lock-free queue for the entries or separate queues for each thread using thread-local storage.
Concerning the futex() calls, the idea is to poll an atomic flag and after some rotations use the actual OS mutex. The atomic flag is available without the expensive switch between user-space and kernel-space. For longer breaks, using the kernel preemption (with futex()) avoids blocking the CPU with polling. This explains why the program doesn't need any futex() calls in normal operation.
You, basically need to generate core file at this moment.
Then you could load program+core in GDB and look at it
man gcore
or
generate-core-file
During that time, there are zero futex syscalls happening. Why do they suddenly start?
This is due to the fact that uncontested mutex, implemented via futex, doesn't make a system call, only atomic increment, purely in user space. Only CONTESTED lock is visible as system call

My multithread program works slowly or appear deadlock on dual core machine, please help

I have a program with several threads, one thread will change a global when it exits itself and the other thread will repeatedly poll the global. No any protection on the globals.
The program works fine on uni-processor. On dual core machine, it works for a while and then halt either on Sleep(0) or SuspendThread(). Would anyone be able to help me out on this?
The code would be like this:
Thread 1:
do something...
while(1)
{
.....
flag_thread1_running=false;
SuspendThread(GetCurrentThread());
continue;
}
Thread 2
flag_thread1_running=true;
ResumeThread(thread1);
.....do some other work here....
while(flag_thread1_running) Sleep(0);
....
The fact that you don't see any problem on a uniprocessor machine, but see problems on a multiproc machine is an artifact of the relatively large granularity of thread context switching on a uniprocessor machine. A thread will execute for N amount of time (milliseconds, nanoseconds, whatever) before the thread scheduler switches execution to a different thread. A lot of CPU instructions can execute in the typical thread timeslice. You can think of it as having a fairly large chunk of "free play" exclusive processor time during which you probably won't run into resource collisions because nothing else is executing on the processor.
When running on a multiproc machine, though, CPU instructions in two threads execute exactly at the same time. The size of the "free play" chunk of time is near zero.
To reproduce a resource contention issue between two threads, you need to get thread 1 to be accessing the resource and thread 2 to be accessing the resource at the same time, or very nearly the same time.
In the large-granularity thread switching that takes place on a uniprocessor machine, the chances that a thread switch will happen exactly in the right spot are slim, so the program may never exhibit a failure under normal use on a uniproc machine.
In a multiproc machine, the instructions are executing at the same time in the two threads, so the chances of thread 1 and thread 2 accessing the same resource at the same time are much, much greater - thousands of times more likely than the uniprocessor scenario.
I've seen it happen many times: an app that has been running fine for years on uniproc machines suddenly starts failing all over the place when executed on a new multiproc machine. The cause is a latent threading bug in the original code that simply never hit the right coincidence of timeslicing to repro on the uniproc machines.
When working with multithreaded code, it is absolutely imperitive to test the code on multiproc hardware. If you have thread collision issues in your code, they will quickly present themselves on a multiproc machine.
As others have noted, don't use SuspendThread() unless you are a debugger. Use mutexes or other synchronization objects to coordinate between threads.
Try using something more like WaitForSingleObjectEx instead of SuspendThread.
You are hitting a race condition. Thread 2 may execute flag_thread1_running=true;
before thread 1 executes flag_thread1_running=false.
This is not likely to happen on single CPU, because with usual the scheduling quantum 10-20 ms you are not likely to hit the problem. It will happen there as well, but very rarely.
Using proper synchronization primitives is a must here. Instead of bool, use event. Instead of checking the bool in a loop, use WaitForSingleObject (or WaitForMultipleObjects for more elaborate stuff later).
It is possible to perform synchronization between threads using plain variables, but it is rarely a good idea and it is quite hard to do it right - cf. How can I write a lock free structure?. It is definitely not a good idea to perform schedulling using Sleep, Suspend or Resume.
I guess that you already know that polling a global flag is a "Bad Idea™" so I'll skip that little speech. Try adding volatile to the flag declaration. That should force each read of it to read from memory. Without volatile, the implementation could be reading the flag into a register and not fetching it from memory.