C++ code with around 5k lines hangs randomly - in linux. My code deals with transmitting and reception of packets through RAW socket. The code just stops at a point randomly without any response - not even [ctrl+c] proves handy :: every time after hang I used to kill the process.
I tried GDB and result was same it hanged - ctrl+c produced a SIGTERM error message .
On using valgrind the code hanged similarly .
How to debug this issue? Is it any kind of system error?
Using strace command , it was clear that the hang was due to futex_wait_private issue. Socket read was pushed into deadlock scenario. On increasing the select timeout value - the issue could be resolved.
Related
I'm trying to debug a custom thread pool implementation that has rarely deadlocks. So I cannot use a debugger like gdb because I have click like 100 times "launch" debugger before having a deadlock.
Currently, I'm running the threadpool test in an infinite loop in a shell script, but that means I cannot see variables and so on. I'm trying to std::cout data, but that slow down the thread and reduce the risk of deadlocks meaning that I can wait like 1hour with my infinite before getting messages. Then I don't get the error, and I need more messages, which means waiting one more hour...
How to efficiently debug the program so that its restart over and over until it deadlocks ? (Or maybe should I open another question with all the code for some help ?)
Thank you in advance !
Bonus question : how to check everything goes fine with a std::condition_variable ? You cannot really tell which thread are asleep or if a race condition occurs on the wait condition.
There are 2 basic ways:
Automate the running of program under debugger. Using gdb program -ex 'run <args>' -ex 'quit' should run the program under debugger and then quit. If the program is still alive in one form or another (segfault, or you broke it manually) you will be asked for confirmation.
Attach the debugger after reproducing the deadlock. For example gdb can be run as gdb <program> <pid> to attach to running program - just wait for deadlock and attach then. This is especially useful when attached debugger causes timing to be changed and you can no longer repro the bug.
In this way you can just run it in loop and wait for result while you drink coffee. BTW - I find the second option easier.
If this is some kind of homework - restarting again and again with more debug will be a reasonable approach.
If somebody pays money for every hour you wait, they might prefer to invest in a software that supports replay-based debugging, that is, a software that records everything a program does, every instruction, and allows you to replay it again and again, debugging back and forth. Thus instead of adding more debug, you record a session during which a deadlock happens, and then start debugging just before the deadlock happened. You can step back and forth as often as you want, until you finally found the culprit.
The software mentioned in the link actually supports Linux and multithreading.
Mozilla rr open source replay based debugging
https://github.com/mozilla/rr
Hans mentioned replay based debugging, but there is a specific open source implementation that is worth mentioning: Mozilla rr.
First you do a record run, and then you can replay the exact same run as many times as you want, and observe it in GDB, and it preserves everything, including input / output and thread ordering.
The official website mentions:
rr's original motivation was to make debugging of intermittent failures easie
Furthermore, rr enables GDB reverse debugging commands such as reverse-next to go to the previous line, which makes it much easier to find the root cause of the problem.
Here is a minimal example of rr in action: How to go to the previous line in GDB?
You can run your test case under GDB in a loop using the command shown in https://stackoverflow.com/a/8657833/341065: gdb --eval-command=run --eval-command=quit --args ./a.out.
I have used this myself: (while gdb --eval-command=run --eval-command=quit --args ./thread_testU ; do echo . ; done).
Once it deadlocks and does not exit, you can just interrupt it by CTRL+C to enter into the debugger.
An easy quick debug to find deadlocks is to have some global variables that you modify where you want to debug, and then print it in a signal handler. You can use SIGINT (sent when you interrupt with ctrl+c) or SIGTERM (sent when you kill the program):
int dbg;
int multithreaded_function()
{
signal(SIGINT, dbg_sighandler);
...
dbg = someVar;
...
}
void dbg_sighandler(int)
{
std::cout << dbg1 << std::endl;
std::exit(EXIT_FAILURE);
}
Like that you just see the state of all your debug variables when you interrupt the program with ctrl+c.
In addition you can run it in a shell while loop:
$> while [ $? -eq 0 ]
do
./my_program
done
which will run your program forever until it fails ($? is the exit status of your program and you exit with EXIT_FAILURE in your signal handler).
It worked well for me, especially for finding out how many thread passed before and after what locks.
It is quite rustic, but you do not need any extra tool and it is fast to implement.
A very odd one: I have a Qt 4 embedded app that runs on framebuffer, it normally runs from inittab as the only UI on the box. There is an option to put the machine to sleep - I do the normal thing and open /sys/power/state, write "mem" and close it (using QFile). Very straight forward and it works fine EXCEPT the first time the app runs after booting. If it sleeps then it receives SIGUSR2 and just hangs forever with a blank screen. The hang occurs after resume.
But, if I manually kill it and run it again .. sleep works fine again. Note that it must do the failed sleep attempt and be killed - whereafter all is peachy every time it runs, SIGUSR2 never shows up again.
I have already tried to trap the signal - doesn't trap. No idea why - except that I know that pthreads uses SIGUSR2 ..
Stumped. Ideas? Clues?
[edit] I found that if I fork() and write to /sys/power/state in the child and exit it sort of solves the problem, but it doesn't solve the mystery..
[edit 2] I subsequently found that in fact the child is still hanging when the machine is shut down (causing it to hang forever without shutting down..), although the ugly hack just mentioned did fix the hang coming out of suspend - I have not figured out what is happening but finally solved it by just using a script/daemon: in a while loop it checks a file in /tmp for an action and either halts or suspends and restarts the app afterwards .. not pretty but it works.
And still the mystery of the SIGUSR2 hang remains ..
I have a C++ application that accepts TCP connections from client applications.
After a seemingly random time of running fine (days), it stops receiving followup messages from the clients and only sees the first message on each TCP connection. After a re-start all is fine again.
The trouble is, this only happens on the production server where I have to restart is as soon as it gets stuck and I have been uanble to reproduce this on a lab machine. None of the socket operations seems to return an error, that I would see in my logfile and the application is huge so I can't just post the relevant part here.
First messages keep coming through all the time, only subsequent messages aren't received after a while. Even when my application stops receiving the followup-messages, I can see them comming in with Wireshark.
Any ideas how I might find out what is happening ? What should I be looking for ?
Any config settings used here? In the past I have put a condition on a server accept to ignore messages after 50,000 have been processed. This was to prevent run-away situations in development. This code went live on one occasion without changing the config setting to 'allow infinite messages'. The result was exactly what you describe, ok for 2-3 days, then messages sent ok, but just ignored with no errors anywhere.
This may not be the case here, but I mention it as an example of where you may have to look.
What is the technique to log the segmentation faults and run time errors which crash the program, through a remote logging library?
The language is C++.
Here is the solution for printing backtrace, when you get a segfault, as an example what you can do when such an error happens.
That leaves you a problem of logging the error to the remote library. I would suggest keeping the signal handler, as simple, as possible and logging to the local file, because you cannot assume, that previously initialized logging library works correctly, when segmentation fault occured.
What is the technique to log the segmentation faults and run time errors which crash the program, through a remote logging library?
From my experience, trying to log (remotely or into file) debugging messages while program is crashing might not be very reliable, especially if APP takes system down along with it:
With TCP connection you might lose last several messages while system is crashing. (TCP maintains data packet order and uses error correction, AFAIK. So if app just quits, some data can be lost before being transmitted)
With UDP connection you might lose messages because of the nature of UDP and receive them out-of-order
If you're writing into file, OS might discard most recent changes (buffers not flushed, journaled filesystem reverting to earlier state of the file).
Flushing buffers after every write or sending messages via TCP/UDP might induce performance penalties for a program that produces thousands of messages per second.
So as far as I know, the good idea is to maintain in-memory plaintext log-file and write a core dump once program has crashed. This way you'll be able to find contents of log file within core dump. Also writing into in-memory log will be significantly faster than writing into file or sending messages over network. Alternatively, you could use some kind of "dual logging" - write every debug message immediately into in-memory log, and then send them asynchronously (in another thread) into log file or over the network.
Handling of exceptions:
Platform-specific. On windows platform you can use _set_se_handlers and use it to either generate backtrace or to translate platform exceptions into c++ exceptions.
On linux I think you should be able to create a handler for SIGSEGV signal.
While catching segfault sounds like a decent idea, instead of trying to handle it from within the program it makes sense to generate core dump and bail. On windows you can use MiniDumpWriteDump from within the program and on linux system can be configured to produce core dumps in shell (ulimit -c, I think?).
I'd like to give some solutions:
using core dump and start a daemon to monitor and collect core dumps and send to your host.
GDB (with GdbServer), you can debug remotely and see backtrace if crashed.
To catch the segfault signal and send a log accordingly, read this post:
Is there a point to trapping "segfault"?
If it turns out that you wont be able to send the log from a signal handler (maybe the crash occurred before the logger has been intitialized), then you may need to write the info to file and have an external entity send it remotely.
EDIT: Putting back some original info to be able to send the core file remotely too
To be able to send the core file remotely, you'll need an external entity (a different process than the one that crashed) that will "wait" for core files and send them remotely as they appear. (possibly using scp) Additionally, the crashing process could catch the segfault signal and notify the monitoring process that a crash has occurred and a core file will be available soon.
I'm writing a server Linux daemon. I want to know what protocol is in the UNIX/Linux community for what a daemon should do when it encounters a fatal error (eg. a server failing to listen, a segmentation fault, etc.). I have already done the whole thing with the system log, but I want to know what to do with a fatal error. Should I log and keep running in a infinite, do-nothing loop? Should I log and exit? What is the standard thing to do here and how do I do it?
The daemon is written in C++, and I'm using a custom exception system to wrap POSIX error codes, so I'll know when things are fatal.
There are degrees of 'fatal error'.
A server failing to listen is possibly a temporary issue; your daemon should probably continue trying to connect, maybe retrying periodically, and backing off slowly (1 second, 2 seconds, 4 seconds, etc).
If you catch a seg fault, maybe the best thing is to try to restart itself, by re-executing the daemon. It might recur, of course.
You shouldn't go into an infinite do-nothing loop; you should terminate rather than do that. If your loop isn't infinite but could be broken by a signal or something, maybe do-nothing is OK; I recommend the pause() system call as a way to do nothing without consuming CPU time.
You should certainly log what you're doing and why before you exit.