Implementing breakpoints that resume safely in multithreaded code

Implementing breakpoints that resume safely in multithreaded code - c++

I'm writing a debugger and currently trying to make breakpoints work reliably when multiple threads hit them at the same time. As far as I know, most debuggers implement breakpoints by replacing the first byte of the instruction with 0xCC, and that's how I'm currently doing it as well. However, I don't see any way of restoring the original byte while still being able to stop other threads that are about to hit that breakpoint, without halting all running threads. Does anyone have any information on how that's usually achieved? Is halting all threads really the only solution?

With all threads stopped, you restore that byte, step that one thread only for one instruction, recreate the breakpoint, then resume execution of all threads. If you are using one of the limited hardware debug registers, you can use RF to temporarily ignore the breakpoint for one instruction (see below).
Stopping just the one thread during debugging, while the other threads keep running, is just asking for trouble. Consider how you'd handle hitting the same or a different breakpoint while you were stopped at the first? Or if an exception occurs?
On the Intel CPUs, there is a flag that can be set in the EFLAGS register (the Resume Flag, bit 16). When set this will allow executing the first instruction without triggering breakpoints, and will work when using the hardware breakpoints (and not the breakpoint instruction).
Chapter 17 in Volume 3 (the System Programming Guide, available for download from Intel) contains lots of details on the Debug features of Intel IA-32 CPUs.

I'm aware that temporarily pausing all threads is the common way to solve that. I'm asking if there's any way to avoid doing that.
The first thread to hit your int3 software breakpoint is the one that you want to stop.
If other threads hit it before you can patch it back to the correct contents, resume those threads after removing the software breakpoint. (x86 has coherent instruction caches, so you can safely modify a single code byte without other cores needing to run a fence / isync instruction to re-sync their instruction caches with data cache. This is a harder problem on other ISAs.)
Other threads can see a small interruption.
Of course, if the user puts a breakpoint inside a critical section (with a lock held), or single-steps into a critical section, the other threads will block on that. This is also possible for lockless code that isn't lock-free (in the computer science sense).
Examining and modifying memory while other threads are running is potentially risky. Another thread could unmap memory just before you try to read or modify it. As long as your debugger itself doesn't crash, it's up to the user how much of a mess they want to make, though.

Related

How to find which thread will execute an instruction?

I'm very surprised this hasn't been asked before. I'm trying to put a breakpoint on a specific instruction and read the registers in an already running process (Following this post: Read eax register).
I found the instruction I'm looking for, however the problem I've been running into is how do I find the right thread where the instruction is going to be executed, so I can do SetThreadContext() on it. This is a multithreaded program, so its not as simple as looking up for the single thread that is associated with the process.
I tried looking through Cheat Engine's source to see how they did it, however I couldn't find much, so I'm wondering how exactly they did it.
One idea that comes to mind is just setting every thread's context to it, however I'd like to avoid that.
EDIT: Forgot to mention I'm trying to do this with hardware breakpoints (using debug registers)

Unless you already know the answer / can predict the future, you need to set a hardware breakpoint in every thread that might run the instruction you care about.
The debug registers are per-core (and thus per-thread with context-switching), so a core will only actually break if the thread it's executing has its debug registers set to break on that instruction.
It might be easier to use a software breakpoint (0xcc byte replacing the first byte of the instruction) because you just have to store that once and every thread will see it. (x86 has coherent instruction caches; you don't have to invalidate them.)
As Margaret points out, once your breakpoint handler runs, you check the EIP / RIP of every thread, and the ones that are currently at that instruction are the one(s) that have reached the breakpoint and will run that instruction if single-stepped or resumed. (Or an address in your handler, if the handler runs in the context of that thread.)

How do I determine from strace output what part of my program is failing to acquire a mutex

I'm working on an embedded Linux system (3.12.something), and our application, after some random amount of time, starts hogging the CPU. I've run strace on our application, and right when the problem happens, I see a lot of lines similar to this in the strace output:
[48530666] futex(0x485f78b8, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable) <0.009002>
I'm pretty sure this is the smoking gun I'm looking for and there is a race of some sort. However, I now need to figure out how to identify the place in the code that's trying to get this mutex. How can I do that? Our code is compiled with GCC and has debugging symbols in it.
My current thinking (that I haven't tried yet) is to print out a string to stdout and flush before trying to grab any mutex in our system, with the expectation that the string will print right before strace complains about getting the lock ... but there are a LOT of places in the code that would have to be instrumented like this.
EDIT: Another strange thing that I just realized is that our program doesn't start hogging the CPU until some random time has passed since it was run (5 minutes to 5 hours and anywhere in between). During that time, there are zero futex syscalls happening. Why do they suddenly start? From what I've read, I think maybe they are being used properly in userspace until something fails and falls back to making a futex() syscall...
Any suggestions?

If you perpetually and often lock a mutex for a short time from different threads, like e.g. one protecting a global logger, you might cause a so-called thread convoy. The problem doesn't occur until two threads compete for the lock. The first gets the lock and holds it for a short time, then, when it needs the lock a second time, it gets preempted because the second one is waiting already. The second one does the same. The timeslice available to each thread is suddenly reduced to the time between two lock attempts, causing many context switches and the according slowdown. Further, all but one thread is always blocked on the mutex, effectively disabling any parallel execution.
In order to fix this, make your threads cooperate instead of competing for resources. For above example of a logger, consider e.g. a lock-free queue for the entries or separate queues for each thread using thread-local storage.
Concerning the futex() calls, the idea is to poll an atomic flag and after some rotations use the actual OS mutex. The atomic flag is available without the expensive switch between user-space and kernel-space. For longer breaks, using the kernel preemption (with futex()) avoids blocking the CPU with polling. This explains why the program doesn't need any futex() calls in normal operation.

You, basically need to generate core file at this moment.
Then you could load program+core in GDB and look at it
man gcore
or
generate-core-file
During that time, there are zero futex syscalls happening. Why do they suddenly start?
This is due to the fact that uncontested mutex, implemented via futex, doesn't make a system call, only atomic increment, purely in user space. Only CONTESTED lock is visible as system call

How can I avoid preemption of my thread in user mode

I have a simple chunk of deterministic work that only takes thirteen machine instructions to complete. Because the first instruction takes a homemade semaphore (spinlock) and the last instruction releases it, I am safe from all of the other threads running on the other cores as they are attempting to take and give the same semaphore.
The problem arises when some thread interrupts a thread holding the semaphore before it can finish its "critical section". Worst case the interruption kills the thread while holding the semaphore or as can happen one of the threads normally competing for the semaphore branches out into code that can generate the interrupt causing a deadlock.
I don't have a way synchronizing with these other threads when they branch into those parts of the code I can't control. I think I need to disable interrupts like I used to do in my old VxWorks days when I was running in kernel mode. Its always thirteen instructions and I am always completely safe if I can get all thirteen instructions done before I have to honor an interrupt. Oh and it is all my own internal data, other that the homemade semaphore there is nothing that locks anything else up.
I have read several answers that I think are close. Most have to do with Critical Section calls on the Windows API (wrong OS but maybe the right concept). Most of the wrong solutions assume that I can get all of the offending threads to use a mutex that I create with the pthread libraries.
I need this solution in C/C++ on Linux and Solaris.
Johnny Crash's question is very close
prevent linux thread from being interrupted by scheduler
KermitG also
Can I prevent a Linux user space pthread yielding in critical code?
Thanks for your consideration.

You may not prevent preemption of a user-mode thread. Critical sections (and all other sync objects) prevent collisions of your threads, however they by no means prevent them from preemption by the OS.
If your other threads branch into something on timeout, whereas that something may lead to a deadlock - you have a design problem.
A correct design should be the most pessimistic: preemption may occur everywhere for indeterminate time.

Yes, yes - 7 years old - I need to do exactly this but for other reasons.
So I put this here for others to read in a historical context.
I am writing an emulation layer for an embedded RTOS where I need to emulate the embedded platform CPU_irq_disable(), and CPU_irq_restore() The closest thing I can think of is to disable peemption in the scheduler.
Yes, the target does have an RTOS - sometimes it does not.
IO is emulated via sockets, ie: a serial port is like a stream socket!
A GPIO pin (Edge IRQ) can be a socket to. The current value is in a quasi-global to the driver, and to wait for a pin change = waiting for a packet to arrive on a socket.
So the socket read thread acts like an IRQ when a packet shows up.
Thus- to emulate irq disable, it is reasonable to emulate by disabling pre-emption within my own application.
Also at the embedded application layer, I need to emulate what would be a superloop.
No amount of mutex stuff is going to emulate the embedded platform reasonably.

How do you stop a thread and flush its registers into the stack?

I'm creating a concurrent memory reclamation algorithm in C++. Periodically, the stacks of executing mutator threads need to be inspected, so that I can see what references the threads are currently holding. In the process of doing this, I need to also check the registers of the mutator thread to check any references that might be in there.
Clearly many JVM's and C# vm's have no problem doing this as part of their garbage collection cycles. However, I haven't been able to find a definitive solution to this issue.
I can't quite tease apart what is going on in the Bohem garbage collector in order to inspect the root set, if you can (or know how its done), I'd really like to know.
Ideally I would be able to cause the mutator thread to be interrupted, and execute a piece of handler code which would report it's PC and also flush any register-based references into the stack, and then perhaps help finish the collection cycle. I believe that most compilers in most systems will automatically flush the registers when interrupt or signal handlers are called, but I'm not clear on the specifics, or how to access that data. It seems that separate stacks might be used for interrupt and signal handlers. Additionally, I can't find any information about how to target a particular thread, or how to send a signal. Windows does not seem to support this form of signaling anyway, and I would like my system to run on both Linux and Windows on x86-64 processors.
Edit: SuspendThread() is used in some situations, although safepoints seem to be preferred. Any ideas on why? Is there any way to deal with long-lasting I/O waits or other waits for kernel code to return?

I thought this was a very interesting question, so I dug into it a bit. It turns out that the Hotspot JVM uses a mechanism called "safepoints" which cause the threads of the JVM to cooperatively all stop themselves so that the GC can begin. In other words, the thread initiating GC doesn't forcibly stop the other threads, the other threads voluntarily suspend themselves by various clever mechanisms.
I don't believe the JVM scans registers, because a safepoint is defined such that all roots are known (I presume this means in memory).
For more information see:
HotSpot Glossary -- which defines safepoints
safepoint.cpp -- the source in HotSpot that implements safepoints
A slide deck that describes safepoints in some detail (look 10 slides or so in)
In regards to your desire to "interrupt" all threads, according to the slide deck I referenced above, thread suspension is "unreliable on Solaris and Linux, e.g., spurious signals." I'm not sure what mechanism even exists for thread suspension that the slides would be referring to.

On windows you should be able to get this done use SuspendThread (and ResumeThread) along with GetThreadContext (as Hans mentioned). All of these functions take handles to the specific thread you intend to target.
To get a list of all threads in the current process, see this(toolhlp32 works on x64, despite its bad naming scheme...).
As a point of interest, one way to flush registers to the stack on x86 is to use the PUSHAD assembly instruction.

My multithread program works slowly or appear deadlock on dual core machine, please help

I have a program with several threads, one thread will change a global when it exits itself and the other thread will repeatedly poll the global. No any protection on the globals.
The program works fine on uni-processor. On dual core machine, it works for a while and then halt either on Sleep(0) or SuspendThread(). Would anyone be able to help me out on this?
The code would be like this:
Thread 1:
do something...
while(1)
{
.....
flag_thread1_running=false;
SuspendThread(GetCurrentThread());
continue;
}
Thread 2
flag_thread1_running=true;
ResumeThread(thread1);
.....do some other work here....
while(flag_thread1_running) Sleep(0);
....

The fact that you don't see any problem on a uniprocessor machine, but see problems on a multiproc machine is an artifact of the relatively large granularity of thread context switching on a uniprocessor machine. A thread will execute for N amount of time (milliseconds, nanoseconds, whatever) before the thread scheduler switches execution to a different thread. A lot of CPU instructions can execute in the typical thread timeslice. You can think of it as having a fairly large chunk of "free play" exclusive processor time during which you probably won't run into resource collisions because nothing else is executing on the processor.
When running on a multiproc machine, though, CPU instructions in two threads execute exactly at the same time. The size of the "free play" chunk of time is near zero.
To reproduce a resource contention issue between two threads, you need to get thread 1 to be accessing the resource and thread 2 to be accessing the resource at the same time, or very nearly the same time.
In the large-granularity thread switching that takes place on a uniprocessor machine, the chances that a thread switch will happen exactly in the right spot are slim, so the program may never exhibit a failure under normal use on a uniproc machine.
In a multiproc machine, the instructions are executing at the same time in the two threads, so the chances of thread 1 and thread 2 accessing the same resource at the same time are much, much greater - thousands of times more likely than the uniprocessor scenario.
I've seen it happen many times: an app that has been running fine for years on uniproc machines suddenly starts failing all over the place when executed on a new multiproc machine. The cause is a latent threading bug in the original code that simply never hit the right coincidence of timeslicing to repro on the uniproc machines.
When working with multithreaded code, it is absolutely imperitive to test the code on multiproc hardware. If you have thread collision issues in your code, they will quickly present themselves on a multiproc machine.
As others have noted, don't use SuspendThread() unless you are a debugger. Use mutexes or other synchronization objects to coordinate between threads.

Try using something more like WaitForSingleObjectEx instead of SuspendThread.

You are hitting a race condition. Thread 2 may execute flag_thread1_running=true;
before thread 1 executes flag_thread1_running=false.
This is not likely to happen on single CPU, because with usual the scheduling quantum 10-20 ms you are not likely to hit the problem. It will happen there as well, but very rarely.
Using proper synchronization primitives is a must here. Instead of bool, use event. Instead of checking the bool in a loop, use WaitForSingleObject (or WaitForMultipleObjects for more elaborate stuff later).
It is possible to perform synchronization between threads using plain variables, but it is rarely a good idea and it is quite hard to do it right - cf. How can I write a lock free structure?. It is definitely not a good idea to perform schedulling using Sleep, Suspend or Resume.

I guess that you already know that polling a global flag is a "Bad Idea™" so I'll skip that little speech. Try adding volatile to the flag declaration. That should force each read of it to read from memory. Without volatile, the implementation could be reading the flag into a register and not fetching it from memory.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js