How to debug a rare deadlock?

How to debug a rare deadlock? - c++

I'm trying to debug a custom thread pool implementation that has rarely deadlocks. So I cannot use a debugger like gdb because I have click like 100 times "launch" debugger before having a deadlock.
Currently, I'm running the threadpool test in an infinite loop in a shell script, but that means I cannot see variables and so on. I'm trying to std::cout data, but that slow down the thread and reduce the risk of deadlocks meaning that I can wait like 1hour with my infinite before getting messages. Then I don't get the error, and I need more messages, which means waiting one more hour...
How to efficiently debug the program so that its restart over and over until it deadlocks ? (Or maybe should I open another question with all the code for some help ?)
Thank you in advance !
Bonus question : how to check everything goes fine with a std::condition_variable ? You cannot really tell which thread are asleep or if a race condition occurs on the wait condition.

There are 2 basic ways:
Automate the running of program under debugger. Using gdb program -ex 'run <args>' -ex 'quit' should run the program under debugger and then quit. If the program is still alive in one form or another (segfault, or you broke it manually) you will be asked for confirmation.
Attach the debugger after reproducing the deadlock. For example gdb can be run as gdb <program> <pid> to attach to running program - just wait for deadlock and attach then. This is especially useful when attached debugger causes timing to be changed and you can no longer repro the bug.
In this way you can just run it in loop and wait for result while you drink coffee. BTW - I find the second option easier.

If this is some kind of homework - restarting again and again with more debug will be a reasonable approach.
If somebody pays money for every hour you wait, they might prefer to invest in a software that supports replay-based debugging, that is, a software that records everything a program does, every instruction, and allows you to replay it again and again, debugging back and forth. Thus instead of adding more debug, you record a session during which a deadlock happens, and then start debugging just before the deadlock happened. You can step back and forth as often as you want, until you finally found the culprit.
The software mentioned in the link actually supports Linux and multithreading.

Mozilla rr open source replay based debugging
https://github.com/mozilla/rr
Hans mentioned replay based debugging, but there is a specific open source implementation that is worth mentioning: Mozilla rr.
First you do a record run, and then you can replay the exact same run as many times as you want, and observe it in GDB, and it preserves everything, including input / output and thread ordering.
The official website mentions:
rr's original motivation was to make debugging of intermittent failures easie
Furthermore, rr enables GDB reverse debugging commands such as reverse-next to go to the previous line, which makes it much easier to find the root cause of the problem.
Here is a minimal example of rr in action: How to go to the previous line in GDB?

You can run your test case under GDB in a loop using the command shown in https://stackoverflow.com/a/8657833/341065: gdb --eval-command=run --eval-command=quit --args ./a.out.
I have used this myself: (while gdb --eval-command=run --eval-command=quit --args ./thread_testU ; do echo . ; done).
Once it deadlocks and does not exit, you can just interrupt it by CTRL+C to enter into the debugger.

An easy quick debug to find deadlocks is to have some global variables that you modify where you want to debug, and then print it in a signal handler. You can use SIGINT (sent when you interrupt with ctrl+c) or SIGTERM (sent when you kill the program):
int dbg;
int multithreaded_function()
{
signal(SIGINT, dbg_sighandler);
...
dbg = someVar;
...
}
void dbg_sighandler(int)
{
std::cout << dbg1 << std::endl;
std::exit(EXIT_FAILURE);
}
Like that you just see the state of all your debug variables when you interrupt the program with ctrl+c.
In addition you can run it in a shell while loop:
$> while [ $? -eq 0 ]
do
./my_program
done
which will run your program forever until it fails ($? is the exit status of your program and you exit with EXIT_FAILURE in your signal handler).
It worked well for me, especially for finding out how many thread passed before and after what locks.
It is quite rustic, but you do not need any extra tool and it is fast to implement.

Related

Attaching to gdb interupts and won't continue the process

got some big real time project to deal with (multiple processes (IPCs), multi Everything in short).
My working on process is started as service on Linux. I have the root access.
Here is the problem:
I'm trying to attach to a running proc, tried starting it through/with gdb but the result is the same: it stops the executable once I "touched" it with gdb or sometimes it throws:
Program received signal SIGUSR1, User defined signal 1. [Switching to Thread 0x7f9fe869f700 (LWP 2638)]
of course from there nothing can be done.
Tried:
handle all nostop
attach to launched as service (daemon) or launched as regular proc
started from gdb
thought maybe forking/multi-threaded problem - implemented in the very beginning sleep for 10 seconds - attached to it with "continue"
Guys, all I want it is to debug, hit the breakpoints, etc.
Please help! Share ideas.
Editing actual commands:
1) gdb attach myProcId. Then after reading symbols, I hit "c" which results:
Program received signal SIGUSR1, User defined signal 1.
[Switching to Thread 0x7f9fe869f700 (LWP 2638)]
0x00007f9fec09bf73 in select () from /lib64/libc.so.6
2) If I make the first line 10 seconds sleep in the code, attaching to the process, hit "c", result: it runs, shows info threads, backtrace of main, but never hits the breakpoint (for sure the code runs there - I get logs and different behaviour if I change code there), meaning the process is stuck.
3) All other combinations like gdb path/to/my/proc args list, then start. Where arg list played with different related options gdb gives us.
Maybe worth to mention: process network packets related, timers driven also.
But for me the important thing is a current snapshot on break, i don't care what will happen to the system after timers expired.

Since you mentioned that you are debugging a multiprocessing program, I think the underlying program you have is to set the breakpoint in the correct subprocess.
Try break fork and set follow-fork-mode child/parent. What you want to achieve is have gdb attached to the process that is running the code you want to debug.
Refer to this link.
Another thought is to generate a crash, since you can compile the programe. For example add a int i = *(int*)NULL and that will generate a core dump. You can then debug the core dump with gdb <program> <core dump>. You can refer to this page for how to configure core dump.

Multithreaded application at boot sequence - C++/Debian

I'm having this obscure problem since 2 days : I created a launch-at-boot application in C++ on a debian system, which worked flawlessly until I integrated some multithreading elements.
There are only 2 threads (1 main and 1 child)
I included -lpthread and -pthread in the makefile
I tried both /.config/autostart and the .desktop file methods (same
result)
The program is lanched with sudo
There is no error/crash anywhere, the main thread works OK, but the
child thread runs 1 iteration only then stops for some reason
even tried to add some sleep in the lxsession boot sequence
If I launch the same command line than in the autostart file in a terminal (sudo or not), it works perfectly.
Its been 2 days and I just have NO CLUE !
If someone experienced this before or can find some logic in it, i'll be ever grateful.

It appears to me that you simply have ... a bug in your new logic. You have made an error in the design of your multi-threading logic, such that the child thread only runs one iteration. (Or, much more likely, stalls in an infinite-wait. Waits for a event that is never signaled, a semaphore that is never raised, a queue that runs dry and is never filled, and so on.)
We can help you further if you post excerpts of the code in question ... only illustrating how the child thread is launched and how it interacts with the parent. (Condition-variables, semaphores, and so-forth, which is probably where the crux of your error lies.)
I would suggest that "all the other stuff is irrelevant." You don't need "a sleep in the boot-sequence" (if the sequence waits for your program to complete, and if it needs to). I suggest that it seems to me that you simply have ... a bug in your new code which introduces multi-threading.
And you might wish to contemplate whether multi-threading is advantageous, given that you had a non-threaded version of the same thing that worked properly. If the processing that is to be done used to be done (successfully) by a single thread, such processing might or might not be more-advantageously processed by "n threads." Should you find-and-fix this bug, or is it just as well to abandon the change and revert back to what worked? Only you can decide that ...

Thank you all for your suggestions.
I found a "fix" : running the startup program in a terminal ('#lxterminal -e url/to/program &' in autostart of lxsession) instead of background seems to fix it SOMEHOW. There is no GUI though ... it is a service.
The multithreaded logic isnt at fault here, not my first shot, and I really want to keep this feature (#Mike Robinson).
I will reconsider the use of sudo as suggested as well, which seems sketchy all things considered. It might get it running in background. thanks # datenwolf.

How does gdb attach to multi-threaded process?

I will try to be as specific as I can, but so far I have worded this problem so poorly that Google failed to return any useful results (hence my question here).
I am attaching gdb to a multi-threaded c++ server process. All I can say is that strange things have been happening while trying to do the usual set-breakpoint-break-investigate.
First, while waiting for the breakpoint to be hit (in 'Continuing' mode), I suddenly got back the (gdb) prompt with the message:
Continuing.
[Thread 0x54d5b940 (LWP 28503) exited]
[New Thread 0x54d5b940 (LWP 28726)]
Cannot get thread event message: debugger service failed
Second, also while waiting for the breakpoint to be hit, I'm suddenly told the program has received SIGSEGV and - back to the (gdb) prompt - backtrace tells me the segfault happened in pthread_cancel(). Note the process under investigation does not normally segfault.
I clearly lack enough information about how gdb works to even begin guessing what is happening. Am I doing anything wrong? The steps I take are the same each time:
gdb attach
break 'MyFunction()'
continue
Thoughts? Thanks.

I fought with similar gdb issues for a while. My case was having lots of threads spawned that executed few functions and then exited.
It appears if a thread exits too fast and there's lots of these happening sometimes gdb cannot keep up and when it fails, it fails with style as in crashes :) I think it tries to attach to a thread that is already done as per the error message.
I see this as an issue in gdb 6.5 to 7.6 and still happening. Did not try with older versions.
My advice is look for this use case or similar. Once I changed my design to have a thread serving a queue of requests gdb works flawlessly.
Design wise is healthier to have already created threads that digest actions than always spawning new threads.
Still same code debugs without a problem on Visual Studio so I do have to say that is a small disappointment to me with regards to gdb.
I use Eclipse and looking at the GDB traces (usually enabled by default) will give you a better hint of where GDB fails. One of the buttons on the console shows you the GDB trace.

command to suspend a thread with GDB

I'm a little new to GDB. I'm hoping someone can help me with something that should be quite simple, I've used Google/docs but I'm just missing something.
What is the 'normal' way folks debug threaded apps with GDB? I'm using pthreads. I'm wanting to watch only one thread - the two options I see are
a) tell the debugger somehow to attach to a particular thread, such that stepping wont result in jumping threads on each context switch
b) tell the debugger to suspend/free any 'uninteresting' threads
I'd prefer to go route b) - reading the help for GDB I dont see a command for this, tips?

See documentation for set scheduler-locking on.
Beware: if you suspend other threads, and if one of them holds a lock, and if your interesting thread needs that lock at some point while stepping, you'll deadlock.
What is the 'normal' way folks debug threaded apps
You can never debug thread correctness, you can only design it in. In my experience, most of debugging of threaded apps is putting in assertions, and examining state of the world when one of the assertions is violated.

First, you need to enable comfortable for multi-threading debugger behavior with the following commands. No idea why it's disabled by default.
set target-async 1
set non-stop on
I personally put those commands into .gdbinit file. They make your every command to be applied only to the currently focused thread. Note: the thread might be running, so you have to pause it.
To see the focused thread execute the thread.
To switch to another thread append the number of the thread, e.g. thread 2.
To see all threads with their numbers issue info thread.
To apply a command to a particular thread issue something like thread apply threadnum command. E.g. thread apply 4 bt will apply backtrace command to a thread number 4. thread apply all continue continues all paused threads.
There is a small problem though — many commands needs the thread to be paused. I know a few ways of doing that:
interrupt command: interrupts the thread execution, accepts a number of a thread to pause, without an argument breaks the focused one.
Setting a breakpoint somewhere. Note that you may set a breakpoint to a particular thread, so that other threads will ignore it, like break linenum thread threadnum. E.g. break 25 thread 4.
You may also find very useful that you can set a list of commands to be executed when a breakpoint hit through the command commands — so e.g. you may quickly print interesting values, then continue execution.

Help with a cryptic error message with KGDB - Bogus trace status reply from target: E22

I'm using gdb to connect to a 2.6.31.13 linux kernel patched with KGDB over Ethernet, and when I try to detach the debugger I get this:
(gdb) quit
A debugging session is active.
Inferior 1 [Remote target] will be killed.
Quit anyway? (y or n) y
Bogus trace status reply from target: E22
after that the session is still open, I can keep going on and on with ctrl+d, and the debugger doesn't exit.
I've searched for that message in google and there are just 5 results (and none of them are useful :-/ ).
Any idea of what could it be and how to fix it?

If you cleared all breakpoints on the target and "C"ontinued from the latest breakpoint (assuming that the target code didn't crash, etc.), I think you'll be safe: when running, kgdb won't be talking to your gdb anyway, since if I recall, it only handles the link when stopped (in a breakpoint or exception) awaiting for commands.
A few Ctrl-C in a fast sequence if needed to get control back in gdb, then "q", and that's it.
That's at least my experience when debugging ko's...
I suspect gdb is saying this because it doesn't realize that it is talking to a kgdb rather than to a remote gdb server. I don't imagine kgdb accepting to kill a kernel thread because the debugger was exited, anyway!
Hmmm, afterthought:
You're talking about kgdb 'lite', the one now part of the kernel tree, are you? Because that's the only one I have experience with...
PS on June, 3:
I had never seen the exact message you mentioned until I moved to the 2.6.32 kernel (as part of the migration of my dev and target machines to Lucid). And then, surprise, I ran into it too. Here, it seems to happen in hopeless situations (like a segfault or kgbd seemingly running away after missing a breakpoint or single step). The only cure I have found so far was to pkill ddd (gdb) on the dev machine and reboot the target. But the good news is that the kgdb in 2.6.32 seems to be quite more stable than the one in Karmic (2.6.31).

ctrl + z should help you quit.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js