I've been working on a webrtc datachannel library in C/C++ and wrote a program in C to:
Create two peers from the same process.
Establish a connection between them.
Close the connection if it's successful.
Everything runs fine on a debian docker container and on my host opensuse tumbleweed (all x86_64 and 64bit), but on alpine linux container (64bit x86_64), I'm getting a SEGFAULT inside the child processes:
The function above is from the program's dependency "libnice". It seems like *agent == NULL and there is no way that is made null in the caller's scope. I even inserted a printf("Argument is %p", agent); right before the function call and it prints out its memory and I can verify it's not null. From the disassembly, it looks like the line where copying the agent's contents (0x557a1d20) as the local variable in the callee's stack results in a segfault. The segfault always occurs at this point even after a make clean and recompilation. Fail at activation record? Stack corruption?
UPDATE: I made a more lightweight container and ran it, and now it segfaults at a different place in that same priv_conn_keepalive_tick_unlocked. The argument seems to be set though (Notice the 0x7ffff7f9ad08):
Since I thought I might be hitting the libmusl's default stack limit of 80k, I used getrlimit(RLIMIT_STACK, &rl) to obtain the stack size and it looks like it's already 8 MB and not 80k. Increasing this limit further does not seem to make any difference except that if I assign more than 8 MB, my program crashes early inside the Gdb. Gdb says it got an unknown signal "? ?"; outside the gdb, it crashes at the normal point where it normally crashes without the altered stack size.
I'm not sure what exactly the problem is (stack corruption?) and what to do next to resolve this.
Here's my program's flow:
For every peer that is created, a child process is created with a fork(). Parent <--> child communication is done by ZeroMQ and I use protocol buffers to forward any callbacks (and its arguments) that are triggered inside the child onto an event loop running in the parent process.
So for the above program, there are 2 child processes and 1 parent process.
Steps to reproduce:
Source file: https://github.com/hamon-in/librtcdcpp/blob/alpine-test/examples/websocket_client/2in1.c
Alpine docker container: https://github.com/hamon-in/librtcdcpp/blob/alpine-test/Dockerfile.amd64
Run the container and binary is located at /psl-librtcdcpp/examples/websocket_client/2in1
2in1 will spawn two child processes both of which will crash.
On further investigation, the crash is in an instruction writing at a mildly large negative offset from the stack base pointer, so it's probably just a simple stack overflow.
The right way to fix this is reducing the excess stack usage or explicitly requesting a large stack at pthread_create time, but I don't see where pthread_create is being called from. A quick check to verify that this is the problem would be to override the default stack size for new threads by performing the following somewhere early in the program:
pthread_attr_t attr;
pthread_attr_init(&attr);
pthread_attr_setstacksize(&attr, 1<<20); // 1 MB
pthread_setattr_default_np(&attr);
Add -Werror=implicit-function-declaration to your CFLAGS and you'll immediately have the cause. The key clue is the pointer value 0x557a1d20, which is almost surely the result of truncating a pointer to 32 bits. This happens when you failed to declare a function that returns a pointer and the compiler (by an awful backwards default) assumes it returns int rather than producing an error, then subsequently allows the implicit conversion from int to pointer despite the C language disallowing it.
My boss wants me to see if I can write a "one button memory leak checker" in our Java client for our testers (since I then would have to spend less time myself running our full manual test suite under a profiler; we don't have an automated test-suite).
I have found something that is pretty close to what I need: Heap Walker I had to modify it slightly to compile in VS in Windows (I think it was made for GCC under Mac). But when I run it, the JVM crashes.
I'd like to get a native stack-trace to see at least where it crashes, and if I can find out what happens, but idk how. I have built a "debug" DLL, but I still get no stack-trace. Neither on the console, nor in the "hs_err_pidXXXX.log" file generated by the JVM.
I haven't done any C/C++ in about 15 years; I can still about "guess" what the code does, but I forgotten how the debugging goes (beyond "printf everywhere" ...), and I never had to debug native code in the JVM. So far, Google was no help; I'm probably using the wrong terms to search.
JVM usually reports native stack trace in the crash dump. If there is no stack trace in hs_err_pid.log, it means JVM could not obtain the last frame from PC register, typically because it pointed to unreadable address.
For example, this can happen if a native code dereferences null function pointer:
void (*func)() = 0;
func();
In this case PC will be zero, and JVM won't print the trace. But you can still find caller PC from the stack, because the return address is typically pushed onto the stack before a call. Here is how to find it in hs_err_pid.log:
Top of Stack: (sp=0x0000000002e2f438)
0x0000000002e2f438: 00007ffcf03f1030 00007ffcf041d000
^^^^^^^^^^^^^^^^
the return address (the address after the last valid instruction)
Then you can find this address in Dynamic libraries section and calculate the offset from the beginning of dll.
Dynamic libraries:
...
0x00007ffcf03f0000 - 0x00007ffcf0426000 C:\Java\Test\crash.dll
^^^^^^^^^^^^^^^^^^
offset = 0x00007ffcf03f1030 - 0x00007ffcf03f0000 = 0x1030
Use disassembler (e.g. Visual Studio's dumpbin) to decode the offest into the particular function / instruction in the code.
You can also attach Visual Studio debugger to the JVM when it crashes. In order to do so, you should run Java with -XX:+ShowMessageBoxOnError. The following window will invite you to connect the debugger:
Some C++ libraries call abort() function in the case of error (for example, SDL). No helpful debug information is provided in this case. It is not possible to catch abort call and to write some diagnostics log output. I would like to override this behaviour globally without rewriting/rebuilding these libraries. I would like to throw exception and handle it. Is it possible?
Note that abort raises the SIGABRT signal, as if it called raise(SIGABRT). You can install a signal handler that gets called in this situation, like so:
#include <signal.h>
extern "C" void my_function_to_handle_aborts(int signal_number)
{
/*Your code goes here. You can output debugging info.
If you return from this function, and it was called
because abort() was called, your program will exit or crash anyway
(with a dialog box on Windows).
*/
}
/*Do this early in your program's initialization */
signal(SIGABRT, &my_function_to_handle_aborts);
If you can't prevent the abort calls (say, they're due to bugs that creep in despite your best intentions), this might allow you to collect some more debugging information. This is portable ANSI C, so it works on Unix and Windows, and other platforms too, though what you do in the abort handler will often not be portable. Note that this handler is also called when an assert fails, or even by other runtime functions - say, if malloc detects heap corruption. So your program might be in a crazy state during that handler. You shouldn't allocate memory - use static buffers if possible. Just do the bare minimum to collect the information you need, get an error message to the user, and quit.
Certain platforms may allow their abort functions to be customized further. For example, on Windows, Visual C++ has a function _set_abort_behavior that lets you choose whether or not a message is displayed to the user, and whether crash dumps are collected.
According to the man page on Linux, abort() generates a SIGABRT to the process that can be caught by a signal handler. EDIT: Ben's confirmed this is possible on Windows too - see his comment below.
You could try writing your own and get the linker to call yours in place of std::abort. I'm not sure if it is possible however.
I wrote my own reference counted memory manager c++ (for fun) and I'm sure it isn't perfect ;) . And now when I'm trying to use it I got random SIGTRAP signals. If I comment out every line which are in connection with that memory manager everything runs fine. Getting SIGTRAP-s instead of SIGSEGV is quite strange.
I know that SIGTRAP-s are thrown when the program hits a breakpoint, but no breakpoint is set. I read in an other thread that debug builds of the exe's and dll's must be up to date. They are up to date and so it is not the reason.
Does anyone know why is this happening?
After searching on Google I realized that those sigtraps are same as those warnings you get in MSVC++ saying "Windows has triggered a breakpoint in xxxx.exe. This may be due to a corruption of the heap, and indicates a bug blahblahblah"...
So it seems yes, unexpected sigtraps can indicate memory corrupction (quite strange...)
And I found my bug too. The MM is in a static library which is linked to a dll. And that static lib and the dll is linked to my exe. So there were two memory managers, one in my exe and one in my dll. If call the initaialization method of the MM. It initialized the MM in my exe but not in the dll so the dll went without init. I solved this by not linking my exe against that static library.
I'd throw in a guess that you might be calling mismatched new/delete or malloc/free implementations - So something was allocated by your memory manager but when the memory is released you end up with the default delete/free implementation.
Set a breakpoint on the signal and see whether there is free() or operator delete on the stack and whether that is the implementation of said function which you would expect.
This question already has answers here:
Closed 13 years ago.
Possible Duplicate:
Common reasons for bugs in release version not present in debug mode
Sometimes I encouter such strange situations that the program run incorrectly while running normally and it will pop-up the termination dialog,but correctly while debugging.This do make me frustrated when I want to use debugger to find the bug inside my code.
Have you ever met this kind of situation and why?
Update:
To prove there are logic reasons that will led such a frustrating situation:
I think one big possibility is heap access volidation. I once wrote a function that allocate a small buffer, but later I step out the boudary. It will run correctly within gdb, cdb, etc (I do not know why, but it do run correctly); but terminate abnormally while running normally.
I am using C++.
I do not think my problem duplicate the above one.
That one is comparision between release mode and debug mode,but mine is between debugging and not debugging,which have a word heisenbug, as many other noted.
thanks.
You have a heisenbug.
Debugger might be initializing values
Some environments initialize variables and/or memory to known values like zero in debug builds but not release builds.
Release might be built with optimizations
Modern compilers are good, but it could hypothetically happen that optimized code functions differently than non-optimized code. Edit: These days, compiler bugs are rare. If you find yourself thinking you have one, exhaust all other ideas first.
There can be other reasons for heisenbugs.
Here's a common gotcha that can lead to a Heisenbug (love that name!):
// Sanity check - this should never fail
ASSERT( ReleaseResources() == SUCCESS);
In a debug build, this will work as expected, but the ASSERT macro's argument is ignored in a release build. By ignored, I mean that not only won't the result be reported, but the expression won't be evaluated at all (i.e. ReleaseResources() won't be called).
This is a common mistake, and it's why the Windows SDK defines a VERIFY() macro in addition to the ASSERT() macro. They both generate an assertion dialog at runtime in a debug build if the argument evaluates to false. Their behavior is different for a release build, however. Here's the difference:
ASSERT( foo() == true ); // Confirm that call to foo() was successful
VERIFY( bar() == true ); // Confirm that call to bar() was successful
In a debug build, the above two macros behave identically. In a release build, however, they are essentially equivalent to:
; // Confirm that call to foo() was successful
bar(); // Confirm that call to bar() was successful
By the way, if your environment defines an ASSERT() macro, but not a VERIFY() macro, you can easily define your own:
#ifdef _DEBUG
// DEBUG build: Define VERIFY simply as ASSERT
# define VERIFY(expr) ASSERT(expr)
#else
// RELEASE build: Define VERIFY as the expression, without any checking
# define VERIFY(expr) ((void)(expr))
#endif
Hope that helps.
Apparently stackoverflow won't let me post a response which contains only a single word :)
VALGRIND
When using a debugger, sometimes memory gets initialized (e.g. zero'ed) whereas without a debugging session, memory can be random. This could explain the behavior you are seeing.
You have dialogs, so there may be threads in your application. If there is threads, there is a possibility of race conditions.
Let say your main thread initialize a structure that another thread uses. When you run your program inside the debugger the initializing thread may be scheduled before the other thread while in your real-life situation the thread that use the structure is scheduled before the other thread actually initialize it.
In addition to what JeffH said, you have to consider if the deploying computer (or server) has the same environment/libraries/whatever_related_to_the_program.
Sometimes it's very difficult to debug correctly if you debug with other conditions.
Giovanni
Also, debuggers might add some padding around allocated memory changing the behaviour. This has caught me out a number of times, so you need to be aware of it. Getting the same memory behaviour in debug is important.
For MSVC, this can be disabled with the env-var _NO_DEBUG_HEAP=1. (The debug heap is slow, so this helps if your debug runs are hideously slow too..).
Another method to get the same is to start the process outside the debugger, so you get a normal startup, then wait on first line in main and attach the debugger to process. That should work for "any" system. provided that you don't crash before main. (You could wait on a ctor on a statically pre-mani constructed object then...)
But I've no experience with gcc/gdb in this matter, but things might be similar there... (Comments welcome.)
One real-world example of heisenbug from Raymand Zhang.
/*--------------------------------------------------------------
GdPage.cpp : a real example to illustrate Heisenberg Effect
related with guard page by Raymond Zhang, Oct. 2008
--------------------------------------------------------------*/
#include <windows.h>
#include <stdio.h>
#include <stdlib.h>
int main()
{
LPVOID lpvAddr; // address of the test memory
lpvAddr = VirtualAlloc(NULL, 0x4096,
MEM_RESERVE | MEM_COMMIT,
PAGE_READONLY | PAGE_GUARD);
if(lpvAddr == NULL)
{
printf("VirtualAlloc failed with %ld\n", GetLastError());
return -1;
}
return *(long *)lpvAddr;
}
The program would terminate abnormally whether compile with Debug or Release,because
by specifying the PAGE_GUARD flag would cause the:
Pages in the region become guard
pages. Any attempt to read from or
write to a guard page causes the
system to raise a STATUS_GUARD_PAGE
exception and turn off the guard page
status. Guard pages thus act as a
one-shot access alarm.
So you'd get STATUS_GUARD_PAGE while trying to access *lpvAddr.But if you use debugger load the program and watch *lpvAddv or step the last statement return *(long *)lpvAddr assembly by assembly,the debugger would forsee the guard page to determine the value of *lpvAddr.So the debugger would have cleared the guard alarm for us before we access *lpvAddr.
Which programming language are you using. Certain languages, such as C++, behave slightly differently between release and debug builds. In the case of C++, this means that when you declare a var, such as int i;, in debug builds it will be initialised to 0, while in release builds it may take any value (whatever was stored in its memory location before).
One big reason is that debug code may define the _DEBUG macro that one may use in the code to add extra stuff in debug builds.
For multithreaded code, optimization may affect ordering which may influence race conditions.
I do not know if debug code adds code on the stack to mark stack frames. Any extra stuff on the stack may hide the effects of buffer overruns.
Try using the same command options as your release build and just add the -g (or equivalent debug flag). gcc allows the debug option together with the optimization options.
If your logic depends on data from the system clock, you could see serious probe effects. If you break into the debugger, you will obviously effect the values returned from clock functions such as timeGetTime(). The same is true if your program takes longer to execute. As other people have said, debug builds insert NOOPs. Also, simply running under the debugger (without hitting breakpoints) might slow things down.
An example of where this might happen is a real-time physics simulation with a variable time step, based off elapsed system time. This is why there are articles like this:
http://gafferongames.com/game-physics/fix-your-timestep/