Stack corruption, unexpected jumps in backtrace - c++

For the sake of NDA I may not be able to paste any code here for this question.
The software I work upon is in c++. In which we use a lot of STL maps, vectors and other standard c++ features, like inheritance and others.
Recently I have been observing a SIGSEGV occur in the software. It is occurring very consistently.
The backtrace is confusing. There are many threads(related to this software) running in the system. The backtrace starts from a thread say THREAD1. It shows us that it is executing some functions in THREAD1, and it proceeds along and suddenly it jumps to THREAD2(which is running in the system), and it starts executing somewhere in the middle it and not from the beginning of that thread instance. It now takes two or three steps more, and then goes for a sigsegv. The threads THREAD1 and THREAD2 are always the same threads.
I have tried to make sure all the checked in code is proper and got them reviewed by many people.
My questions are as follows,
Are such jumps possible? If yes, what could be the root causes?
Are there any debugging steps I can take to find out what is happening with those threads?

Related

exit fails to set error code

I have a C++ Windows program that fails to set the exit code. The program is very complex and I'm currently unable to reproduce this with a simple test case. I do know that the program calls exit(1) because I have a breakpoint on that line. Immediately after I step over it, the debugger (VS2010) prints The program program.exe has exited with code 0 (0x0). When I run it from the shell, %ERRORLEVEL% is also set to 0.
I use subsystem:console and plain old main (no WinMain).
This only happens on Windows Server 2008 R2, not on my Windows 8.1 laptop. I'm running the same executable on both.
I have tried to use exit, _exit, ExitProcess, and return (the offending call is in main), but none of those seem to have any effect. I also have tried to return other codes, also with no result.
There's a similar question but I cannot reproduce the results described in it. My program does use threads.
How can I approach debugging this issue? I'm rather baffled.
I have tried to use exit, _exit, ExitProcess, and return
You've eliminated all reasonable explanations, particularly with ExitProcess(). There is only one possibility left, you need to try TerminateProcess(). If that still doesn't set the exit code then you need to shove that machine out of a 4th story window.
But with the expectation that it now works. The difference between ExitProcess() and TerminateProcess() is that the former ensures that all DLLs are notified by the termination. Their DllMain() function gets called with fdwReason = DLL_PROCESS_DETACH. Which gives a DLL the opportunity to do something icky like calling Exit/TerminateProcess() itself, thus screwing up the exit code.
Finding such a DLL can be difficult if you don't have all the source code. Could be an injected one as well, there are entirely too many around these days. Best thing to do is to set a breakpoint on system call so you can catch it in the act, you probably want to do this regardless.
Once you step into main(), use Debug > New Breakpoint > Break at Function and enter {,,ntdll.dll}_NtTerminateProcess#8. Press F5 and the debugger now stops just before the program terminates. Look at the Call Stack to find the evil-doer.
Strange symptoms involving exit(), _exit(), ExitProcess(), and others in a multithreaded program - particularly if the symptoms vary between hosts - have a smell of a variable being modified or accessed by different threads, without synchronisation.
Looking at the other thread you linked to, it appears you are using a volatile variable to communicate between threads, but not using any form of synchronisation (for example, code which accesses the value of that variable and code that modifies that value need to cooperate via means of a critical section, mutex, or comparable construct).
That little bit of indirect evidence makes the smell even stronger.
The basic problem I suspect is that declaring a variable as volatile is neither necessary nor sufficient to ensure that variable always has values that will make sense to your program. In particular, it is not sufficient to prevent a thread which is modifying a variable from being preempted when the modification is only partly complete, and for another thread to attempt accessing or modifying the affected variable.
If you look up some articles by Herb Sutter (particularly those concerned with thread synchronisation in his "Guru of the Week" series) you will find detailed explanations of why that is so. Other authors also describe such things, but Sutter's articles are ones that I recall offhand.
The solution is to introduce some means of synchronisation, and for EVERY thread in your program to religiously use it before accessing or modifying variables shared between them. This avoids the various problems (race conditions, operations being preempted partway through) that would cause symptoms like you describe.
Such problems are rarely caught by stepping through with a debugger. The reason for that is that the symptoms are an emergent property. Several unlikely and often independent occurrences, in disparate threads of execution, must occur together. Debuggers do typically change the timing of events in programs, and timing is a critical consideration in the symptoms emerging.
Options include making key variables atomic (so particular operations cannot be preempted), critical sections (where the threads explicitly cooperate within a program), or mutexes (which, depending on definition, allows threads in different programs to explicitly cooperate before accessing shared memory).
Yes, this introduces a bottleneck in your program - a point where every thread must rendezvous and potentially wait for each other. That can affect throughput of your program. Some people advocate using volatile variables to avoid such concerns. More often than not, the result is intermittent symptoms in long running programs like you have described in this question and the "similar question" you linked to.
It doesn't matter whether you use standard means of synchronisation (e.g. introduced in C++11) or windows specific means (WIN API functions). The important thing is that you use a deliberate synchronisation method, rather than just making variables volatile. Different options for synchronisation have different trade-offs, so you will need to make a decision relevant to needs of your program.
Another consideration is to signal all threads so they close cleanly, wait until they are all closed, capture their exit codes, and THEN exit the program. It is often less error prone to do this in the thread running main() - which ultimately starts the process, so is more likely to have access to information it needs to cleanup correctly. If another thread decides the program needs to exit, then better if it communicates that need back to main() to do it.

How do I correctly handle a permanently hung third-party library call in a thread in C++?

I have a device which has an library. Some of its functions are most awesomely ill-behaved, in the "occasionally hang forever" sense.
I have a program which uses this device. If/when it hangs, I need to be able to recover gracefully and reset it. The offending calls should return within milliseconds and are being called in a loop many many times per second.
My first question is: when a thread running the recalcitrant function hangs, what do I do? Even if I litter the thread with interruption points, this happens:
boost::this_thread::interruption_point(); // irrelevant, in the past
deviceLibrary.thatFunction(); // <-- hangs here forever
boost::this_thread::interruption_point(); // never gets here!
The only word I've read on what to do there is to modify the function itself, but that's out of the question for a variety of reasons -- not least of which is "this is already miles outside of my skill set".
I have tried asynchronous launching with C++11 futures:
// this was in a looping thread -- it does not work: wait_for sometimes never returns
std::future<void> future = std::async(std::launch::async,
[this] () { deviceLibrary.thatFunction(*data_ptr); });
if (future.wait_for(std::chrono::seconds(timeout)) == std::future_status::timeout) {
printf("no one will ever read this\n");
deviceLibrary.reset(); // this would work if it ever got here
}
No dice, in that or a number of variations.
I am now trying boost::asio with a thread_group of a number of worker threads running io_service::run(). It works magnificently until the second time it times out. Then I've run out of threads, because each hanging thread eats up one of my thread_group and it never comes back ever.
My latest idea is to call work_threads.create_thread to make a new thread to replace the now-hanging one. So my second question is: if this is a viable way of dealing with this, how should I cope with the slowly amassing group of hung threads? How do I remove them? Is it fine to leave them there?
Incidentally, I should mention that there is in fact a version of deviceLibrary.thatFunction() that has a timeout. It doesn't.
I found this answer but it's C# and Windows specific, and this one which seems relevant. But I'm not so sure about spawning hundreds of extra processes a second (edit: oh right; I could banish all the calls to one or two separate processes. If they communicate well enough and I can share the device between them. Hm...)
Pertinent background information: I'm using MSVC 2013 on Windows 7, but the code has to cross-compile for ARM on Debian with GCC 4.6 also. My level of C++ knowledge is... well... if it seems like I'm missing something obvious, I probably am.
Thanks!
If you want to reliably kill something that's out of your control and may hang, use a separate process.
While process isolation was once considered to be very 'heavy-handed', browsers like Chrome today will implement it on a per-tab basis. Each tab gets a process, the GUI has a process, and if the tab rendering dies it doesn't take down the whole browser.
How can Google Chrome isolate tabs into separate processes while looking like a single application?
Threads are simply not designed for letting a codebase defend itself from ill-behaved libraries. Processes are.
So define the services you need, put that all in one program using your flaky libraries, and use interprocess communication from your main app to speak with the bridge. If the bridge times out or has a problem due to the flakiness, kill it and restart it.
I am only going to answer this part of your text:
when a thread running the recalcitrant function hangs, what do I do?
A thread could invoke inline machine instructions.
These instructions might clear the interrupt flag.
This may cause the code to be non interruptible.
As long as it does not decide to return, you cannot force it to return.
You might be able to force it to die (eg kill the process containing the thread), but you cannot force the code to return.
I hope my answer convinces you that the answer recommending to use a bridge process is in fact what you should do.
The first thing you do is make sure that it's the library that's buggy. Then you create a minimal example that demonstrates the problem (if possible), and send a bug report and the example to the library's developer. Lastly, you cross your fingers and wait.
What you don't do is put your fingers in your ears and say "LALALALALA" while you hide the problem behind layers of crud in an attempt to pretend the problem is gone.

Segfault logic with two threads

I have an application with main thread and additional (detached) process created in it.
In that process we are running network server which sends logs from queue through the network.
The question is: is it possible to do something in segfault handler to wait/finish for sending that log queue. So I want almost 100% delivery of that queue.
While it is possible to write a segfault handler, I highly recommend against it. First off, it's very easy to get your program into a "won't terminate" state due to a segfault in the segfault handler.
Second, as dan3 mentions, the memory of the process is likely in a corrupt state, making it hard to know what will and won't work.
Finally, you lose the opportunity to use the coredump from the process to help track down the problem.
While it's not recommended, it is possible.
My recommendation is to write a small program that avoids memory allocation and the use of pointers as much as possible. Perhaps create buffers as global arrays and only ever access them with limited code that can be reviewed by several skilled developers and tested thoroughly (stress testing is great here). Keep in mind, though, that the message could still get lost by the sender or receiver if they crash, so it may not be worth the effort.
By the way - when Netscape first wrote a version of their browser for Linux, I ran it and it kept getting into a locked-up state. Using the strace program, I quickly found that it was in an infinite segfault loop. Very frustrating, and leading to almost 100% cpu wasted.
You can wait() for a process and pthread_wait() for a thread to finish (you didn't specify clearly which one you use).
Remember that if you are in segfault handler, your memory is messed up (avoid malloc() and free()) and your FILE * could also be borked.

stepping through program with debugger takes a long time

When I debug my program by stepping through it, it sometimes takes a long time for the step to finish. This was not happening in the beginning of the project so most likely it is due to something I have added. Could you give me pointers as to how to remedy this. I did notice one of the problems was due to the main thread trying to paint a widget. My application is multi-threaded (1 background thread and 1 main thread) so I am wondering if it has something to do with that. Your comments are appreciated.
With gdb just set scheduler-locking mode to desired behaviour.
In this case: "The step mode optimizes for single-stepping. It stops other threads from "seizing the prompt" by preempting the current thread while you are stepping. Other threads will only rarely (or never) get a chance to run when you step."
A guess: Is your "background thread" pegged at near 100% CPU utilization?
Between lines of of your main thread, while stepping, the debugger is going to allow the background thread to also "step". If the background thread is pegged it can be running a lot more than a few instructions, causing things to appear unresponsive.
Probably if your second thread is doing that much computation continuously it indicates you've got another problem in your application that you need to fix. If you get that thread under control you will probably see your debugger handling things a lot better.
I asked a very similar question regarding visual studio: VS2010 debugger takes an unreasonable amount of time
No real answer came about. You'll find similar questions for past versions of the IDE here as well.

Can I Always debug multiple instances of a same object that is of type thread with GDB?

program runs fine. When I put a breakpoint a segmentation fault is generated. Is it me or GDB? At run time this never happens and if I instantiate only one object then no problems.
Im using QtCreator on ubuntu x86_64 karmic koala.
UPDATE1:
I have made a small program containing a simplified version of that class. You can download it at:
example program
simply put a breakpoint on the first line of the function called drawChart() and step into to see the segfault happen
UPDATE2: This is another small program but it is practically the same as the mandlebrot example and it is still happening. You can diff it with mandlebrot to see the small difference.
almost the same as mandlebrot example program
To answer your question: Yes, you should be able to debug multiple threads using GDB. This depends on the concurrent design to be sound.
There is a chance you have a race condition on data that your threads access. It is possible that the problem does not show when you run the program normally, but attaching a debugger changes timing and scheduling. Even so, you should be able to use the debugger to break when the segfault happens. Understanding where this happens can inform you about the race condition or corruption, whatever the case may be.
It is worth looking into because even if it doesn't happen under most 'run time' conditions, it may manifest under different system load conditions.
Are you Calling into Qt's drawing code from multiple threads? (particularly widget methods)
http://doc.qt.nokia.com/4.3/threads.html#reentrancy-and-thread-safety
Seems like Qt is like GTK+ and you should only be touching GUI stuff from one thread (in particular the main one)
I'm not familiar enough with Qt to give you advice on how to change your code, but I'd suggest changing it to be event based (ie rendering starts in response to an event, then triggers an event in the main thread when it's done, every thread has it's own mainloop) that way you can probably completely avoid mutexes and synchronization.