How to debug confusingly big code?

How to debug confusingly big code? - c++

I am not a c++ programmer but try to debug some complex code. Not the best preconditions, I know...
So I have an openfoam solver which uses (includes) lots of code and I am struggling to really find the error. I compile with
SOURCE=mySolver.C ; g++ -m64 -Dlinux64 -DWM_DP -Wall -Wextra -Wno-unused-parameter -Wold-style-cast -O3 -DNoRepository -ftemplate-depth-100 -I/opt/software/openfoam/OpenFOAM-2.0.5/src/dynamicMesh/lnInclude {more linking} -I. -fPIC -c $SOURCE -o Make/linux64Gcc46DPOpt/mySolver.o
and after running the solver with the appropriate options, it crashes at the end after (or while) my return statement:
BEFORE return 0
*** glibc detected *** /opt/software/openfoam/myLibs/applications/bin/linux64Gcc46DPOpt/mySolver: double free or corruption (!prev): 0x000000000d3b7c30 ***
======= Backtrace: =========
/lib64/libc.so.6[0x31c307230f]
/lib64/libc.so.6(cfree+0x4b)[0x31c307276b]
/opt/software/openfoam/ThirdParty-2.0.5/platforms/linux64/gcc-4.5.3/lib64/libstdc++.so.6(_ZNSsD1Ev+0x39)[0x2b34781ffff9]
/opt/software/openfoam/myLibs/applications/bin/linux64Gcc46DPOpt/mySolver(_ZN4Foam6stringD1Ev+0x18)[0x441e2e]
/opt/software/openfoam/myLibs/applications/bin/linux64Gcc46DPOpt/mySolver(_ZN4Foam4wordD2Ev+0x18)[0x442216]
/lib64/libc.so.6(__cxa_finalize+0x8e)[0x31c303368e]
/opt/software/openfoam/myLibs/lib/linux64Gcc46DPOpt/libTMP.so[0x2b347a17f866]
======= Memory map: ========
...
My solver looks like (sorry, I can't post all parts):
#include "stuff1.H"
#include "stuff2.H"
int main(int argc, char *argv[])
{
#include "stuff3.H"
#include "stuffn.H"
while (runTime.run())
{
...
}
Info<< "BEFORE return 0\n" << endl;
return(0);
}
Running the solver with gdb with setting set environment MALLOC_CHECK_ 2 yields to:
BEFORE return 0
Program received signal SIGABRT, Aborted.
0x00000031c3030265 in raise () from /lib64/libc.so.6
(gdb) bt
#0 0x00000031c3030265 in raise () from /lib64/libc.so.6
#1 0x00000031c3031d10 in abort () from /lib64/libc.so.6
#2 0x00000031c3075ebc in free_check () from /lib64/libc.so.6
#3 0x00000031c30727f1 in free () from /lib64/libc.so.6
#4 0x00002aaab0496ff9 in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string() ()
from /opt/software/openfoam/ThirdParty-2.0.5/platforms/linux64/gcc-4.5.3/lib64/libstdc++.so.6
#5 0x0000000000441e2e in Foam::string::~string (this=0x2aaaac0bd3c8, __in_chrg=<value optimized out>) at /opt/software/openfoam/OpenFOAM-2.0.5/src/OpenFOAM/lnInclude/string.H:78
#6 0x0000000000442216 in Foam::word::~word (this=0x2aaaac0bd3c8, __in_chrg=<value optimized out>) at /opt/software/openfoam/OpenFOAM-2.0.5/src/OpenFOAM/lnInclude/word.H:63
#7 0x00000031c303368e in __cxa_finalize () from /lib64/libc.so.6
#8 0x00002aaab2416866 in __do_global_dtors_aux () from /opt/software/openfoam/myLibs/lib/linux64Gcc46DPOpt/libTMP.so
#9 0x0000000000000000 in ?? ()
(gdb)
How should I proceed to find the real source of my error?
Btw. I saw this and this which is similar but not solving my issue. Also valgrind isn't working correctly for me. I know it has to do with some wrong (de-)allocation but I don't know how to really find the problem.
/Edit
I wasn't able to locate my problem yet...
I think the backtrace which I posted above (position #8) shows the problem is in the code which compiles to libTMP.so. In the Make/options file I added the option -DFULLDEBUG -g -O0. I thought it's possible to track the bug then but I don't know how.
Any help is highly appreciated!

If you have dealt with all compiler warnings and valgrind errors but the problem remains, then Divide and conquer.
Cut out half of the code (use #if directives, remove files from Makefile, or delete lines and restore later using source control).
If the problem goes away then it's likely that it was caused by something you just removed. Or if the problem remains then it's certainly in the code that still remains.
Repeat procedure recursively until you hone in on the problem location.
This doesn't always work because undefined behaviour can manifest itself at a later time than the line which caused it.
However you can work towards producing a minimal program that still has the problem. Eventually you must either produce an actual minimal example that you cannot reduce further, or uncover the true cause.

If you havn't got anything concrete after using gdb and valgrind I think what you can try is disassemble your so libraray using objdump, as you can see in backtrace it has given you the address of the errors, I had tried this kind of approach a long back in my project while debugging a problem. After disassemble you match the address of error to the address of statement in your library, it might give you an idea about the error location.
The command for disassembling objdump -dR <library.so>
You can find more information about objdump here

valgrind
Ok, I risk being shot down for a one-word answer, but bear with me. Try valgrind. Build the most debug version you have that still has issues and simply issue:
valgrind path/to/program
Chances are, the first reported issue will be your problem source. You can even get valgrind to launch a gdb server and let you attach to debug the code leading to the first memory issue. See:
http://tromey.com/blog/?s=valgrind

Some other options that were not listed yet are:
You can try gdb execution flow recording capability:
$ gdb target_executable
(gdb) b main
(gdb) run
(gdb) target record-full
(gdb) set record full insn-number-max unlimited
Then when the program crashes, you will be able to execute flow backward with reverse-next and reverse-step commands. Note that program runs really slow in this mode.
Another possible approach is to try clang static analyzer or clang-check tools on your code. Sometimes analyzer can give a good hint where problem in code might be.
Also, you can link your code with jemalloc and use it's debugging capabilities. Options "opt.junk", "opt.quarantine", "opt.valgrind" and "opt.redzone" can be usefull. In general, it makes malloc allocate some additional memory that is used to monitor writes and reads after the end of buffers, reads of deallocated memory and so on. See man page. This options can be enabled with mallctl function.
One more way to find a bug is to build your code with gcc's or clang's sanitizers enabled. You can turn them on with -fsanitize="sanitizer", where "sanitizer" can be one of: address, thread, leak, undefined. Compiler will instrument application with some additional code that will do additional checks and will print the report. For example:
#include <vector>
#include <iostream>
int main() {
std::vector<int> vect;
vect.resize(5);
std::cout << vect[10] << std::endl; // access the element after the end of vector internal buffer
}
Compile it with sanitizer turned on and run:
$ clang++ -fsanitize=address test.cpp
$ ./a.out
Gives the output:
==29920==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x60400000dff8 at pc 0x0000004bad10 bp 0x7fff16d63e10 sp 0x7fff16d63e08
READ of size 4 at 0x60400000dff8 thread T0
#0 0x4bad0f in main (/home/pablo/a.out+0x4bad0f)
#1 0x7f0b6ce43fdf in __libc_start_main (/lib64/libc.so.6+0x1ffdf)
#2 0x4baaac in _start (/home/pablo/a.out+0x4baaac)
0x60400000dff8 is located 0 bytes to the right of 40-byte region [0x60400000dfd0,0x60400000dff8)
allocated by thread T0 here:
#0 0x435b9b in operator new(unsigned long) (/home/pablo/a.out+0x435b9b)
#1 0x4c1f49 in __gnu_cxx::new_allocator<int>::allocate(unsigned long, void const*) (/home/pablo/a.out+0x4c1f49)
#2 0x4c1d05 in __gnu_cxx::__alloc_traits<std::allocator<int> >::allocate(std::allocator<int>&, unsigned long) (/home/pablo/a.out+0x4c1d05)
#3 0x4bfd51 in std::_Vector_base<int, std::allocator<int> >::_M_allocate(unsigned long) (/home/pablo/a.out+0x4bfd51)
#4 0x4bdb2a in std::vector<int, std::allocator<int> >::_M_fill_insert(__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, unsigned long, int const&) (/home/pablo/a.out+0x4bdb2a)
#5 0x4bbe49 in std::vector<int, std::allocator<int> >::insert(__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, unsigned long, int const&) (/home/pablo/a.out+0x4bbe49)
#6 0x4bb358 in std::vector<int, std::allocator<int> >::resize(unsigned long, int) (/home/pablo/a.out+0x4bb358)
#7 0x4bacaa in main (/home/pablo/a.out+0x4bacaa)
#8 0x7f0b6ce43fdf in __libc_start_main (/lib64/libc.so.6+0x1ffdf)

I partially agree with Matt: divide et impera is the way. But I partially agree 'cause I partially disagree: modifiyng the code you are trying to debug can lead you to hunt on the wrong track, even more if you are trying to debug a huge and complex code not yours in a language that you don't master.
Instead, follow a divide et impera method coupled with a top to bottom strategy: start by adding a few breakpoints in code at a higher level, let's say in the main, then launch the program and see which breackpoints get hitten and which not before crashing. Now you have a general idea of where the bug is; remove all breakpoints and add new ones a little bit deeper, in the area you just found, and repeat until you hit the routine that cause the crash.
It can be tedious, I know, but it works and, moreover, while doing so it will give you a much much better understanding of how the entire system works. I've fixed bugs in unknown applications made of tens of thousands lines of code in this way, and it always works; maybe it can take an entire day, but it works.

Related

How do I get an error from `pop_back()` if `size()` is 0?

I was explaining to a coworker why we have small test with sanitizers on them. He asked about popping a vector too many times and if it was an exception, assert, UB and which sanitizer catches it
It appears NONE catches them. Address and memory will if you call back() after popping too many times but if you pop and do size() you get a large invalid value due to wrapping
Is there a way I can get an assert or exception or runtime termination when I pop too many times? I really thought a debug build without sanitizers would have caught that (with an assert or exception)
I use clang sanitizer but build options with gcc will also be helpful

Both libstdc++ and libc++ have a "debug mode" with assertions, that can be enabled using:
-D_GLIBCXX_DEBUG for libstdc++
-D_LIBCPP_DEBUG for libc++
Also -fsanitize=undefined appears to catch it, but the error message is much more cryptic.

The sanitizers may need some annotations to do a good job on library types. g++ -fsanitize=address -D_GLIBCXX_SANITIZE_VECTOR (documentation), clang++ -stdlib=libc++ -fsanitize=address or clang++ -stdlib=libstdc++ -fsanitize=address -D_GLIBCXX_SANITIZE_VECTOR -D__SANITIZE_ADDRESS__ (the need for this last macro is a bug) all detect the issue and print a message
=================================================================
==12312==ERROR: AddressSanitizer: bad parameters to __sanitizer_annotate_contiguous_container:
beg : 0x602000000010
end : 0x602000000014
old_mid : 0x602000000010
new_mid : 0x60200000000c
#0 0x7f6a9db3b707 in __sanitizer_annotate_contiguous_container ../../../../src/libsanitizer/asan/asan_poisoning.cpp:362
#1 0x55b9dcd1fc41 in std::_Vector_base<int, std::allocator<int> >::_Vector_impl::_Asan<std::allocator<int> >::_S_adjust(std::_Vector_base<int, std::allocator<int> >::_Vector_impl&, int*, int*) (/tmp/a.out+0x1c41)
#2 0x55b9dcd1fb66 in std::_Vector_base<int, std::allocator<int> >::_Vector_impl::_Asan<std::allocator<int> >::_S_shrink(std::_Vector_base<int, std::allocator<int> >::_Vector_impl&, unsigned long) (/tmp/a.out+0x1b66)
#3 0x55b9dcd1f6c4 in std::vector<int, std::allocator<int> >::pop_back() (/tmp/a.out+0x16c4)
#4 0x55b9dcd1f36d in main (/tmp/a.out+0x136d)
#5 0x7f6a9d57fe49 in __libc_start_main ../csu/libc-start.c:314
#6 0x55b9dcd1f199 in _start (/tmp/a.out+0x1199)
SUMMARY: AddressSanitizer: bad-__sanitizer_annotate_contiguous_container ../../../../src/libsanitizer/asan/asan_poisoning.cpp:362 in __sanitizer_annotate_contiguous_container
==12312==ABORTING
Note that there is no need for something as complicated as the sanitizers for this case, as mentioned in the other answer, libraries often have debug modes. For instance with libstdc++, defining _GLIBCXX_DEBUG enables the full debug mode (ABI-incompatible with normal mode)
/usr/include/c++/11/debug/vector:523:
In function:
void std::__debug::vector<_Tp, _Allocator>::pop_back() [with _Tp = int;
_Allocator = std::allocator<int>]
Error: attempt to access an element in an empty container.
Objects involved in the operation:
sequence "this" # 0x0x7ffe20bf5f80 {
type = std::__debug::vector<int, std::allocator<int> >;
}
while defining _GLIBCXX_ASSERTIONS enables much lighter assertions and preserves the ABI, but gives less information about the error
/usr/include/c++/11/bits/stl_vector.h:1227: void std::vector<_Tp, _Alloc>::pop_back() [with _Tp = int; _Alloc = std::allocator<int>]: Assertion '!this->empty()' failed.
and defining _LIBCPP_DEBUG with libc++ prints
/usr/lib/llvm-11/bin/../include/c++/v1/vector:1703: _LIBCPP_ASSERT '!empty()' failed. vector::pop_back called for empty vector

What is the proper way to break on failed asserts in gdb?

I am attempting to capture failed asserts in my program. I’m using a library that makes direct calls to assert(), rather than a custom function or macro, and it is within this library I am currently trying to trace several porting-related bugs. Everything involved has been compiled with debug symbols in g++.
The best solution I have found is breaking at the file:line of the assert, with the condition of the assert expression. This allows stopping on the assert before it fails, but is a horrible solution. It requires special setup for each possibly-failing assert, won’t work from my IDE, and is far too much effort in general.
How can I break on any failed assert using gdb & gcc in such a way that allows examination of the callstack and variables within the scope of the assert call?
It would be even better if the solution allowed me to discard the assert's failure and continue running.

Setting a breakpoint on abort() seems to be the best answer.
break abort in gdb's CLI.

No break is needed on Linux, just type bt on the prompt
abort() causes the SIGABRT signal to be raised in Linux, and GDB already breaks on signals by default. E.g.:
a.c
#include <assert.h>
void g(int i) {
assert(0);
}
void f(int i) {
g(i);
}
int main(void) {
f(1);
}
Then:
gcc -std=c99 -O0 -ggdb3 -o a a.c
gdb -ex run ./a
Then just type bt in the shell:
(gdb) bt
#0 __GI_raise (sig=sig#entry=6) at ../sysdeps/unix/sysv/linux/raise.c:58
#1 0x00007ffff7a483ea in __GI_abort () at abort.c:89
#2 0x00007ffff7a3ebb7 in __assert_fail_base (fmt=<optimized out>, assertion=assertion#entry=0x555555554788 "0", file=file#entry=0x555555554784 "a.c", line=line#entry=4,
function=function#entry=0x55555555478a <__PRETTY_FUNCTION__.1772> "g") at assert.c:92
#3 0x00007ffff7a3ec62 in __GI___assert_fail (assertion=0x555555554788 "0", file=0x555555554784 "a.c", line=4, function=0x55555555478a <__PRETTY_FUNCTION__.1772> "g")
at assert.c:101
#4 0x00005555555546ca in g (i=1) at a.c:4
#5 0x00005555555546df in f (i=1) at a.c:8
#6 0x00005555555546f0 in main () at a.c:12
Which already shows the function values (f (i=1)).
And you can also do as usual:
(gdb) f 4
#4 0x00005555555546ca in g (i=1) at a.c:4
4 assert(0);
(gdb) p i
$1 = 1
The setting that controls if GDB breaks on signals by default or not is: handle all nostop as shown at: How to handle all signals in GDB
Tested in Ubuntu 16.10, gdb 7.11.

If suggested above answers doesn't work for you, you may try to break on __assert_fail function.
break __assert_fail
Name is most probably implementation - dependent, but it's easy to find if you look at definition of assert macro on your platform. This will allow you to break before SIGABRT.

Another option in code:
#include <windows.h>
#include <signal.h>
static void abortHandler(int signalNumber)
{
DebugBreak();
}
int main()
{
signal(SIGABRT, &abortHandler);
}

Program compiled with GCC 4.5 crashes, while GCC 4.4 is fine

Recently I tried to compile and install ns-2, a network simulator based on C++ and Tcl.
Using some slight modification of the source code (don't worry, it won't cause the crash), I could make it compile using the latest gcc 4.5 version.
But when I execute the binary, it's giving the following error.:
$bin/ns
*** buffer overflow detected ***: bin/ns terminated
The same code if compiled with earlier gcc, runs fine. So I believe it's due to some enhanced features in gcc 4.5.
How do I approach this problem? Of course compiling with gcc 4.4 is an option, but I would like to know what went wrong :)
Update:
Here is the full stack-trace and back-trace with gdb:
$ bin/ns
*** buffer overflow detected ***: bin/ns terminated
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(__fortify_fail+0x37)[0x7f01824ac1d7]
/lib/x86_64-linux-gnu/libc.so.6(+0xfd0f0)[0x7f01824ab0f0]
bin/ns[0x8d5b5a]
bin/ns[0x8d56de]
bin/ns[0x841077]
bin/ns[0x842b19]
bin/ns(Tcl_EvalEx+0x16)[0x843256]
bin/ns(Tcl_Eval+0x1d)[0x84327d]
bin/ns(Tcl_GlobalEval+0x2b)[0x84391b]
bin/ns(_ZN3Tcl4evalEPc+0x27)[0x83352b]
bin/ns(_ZN3Tcl5evalcEPKc+0xdd)[0x8334e9]
bin/ns(_ZN11EmbeddedTcl4loadEv+0x24)[0x834712]
bin/ns(Tcl_AppInit+0xb2)[0x8331a5]
bin/ns(Tcl_Main+0x1d0)[0x8ad6a0]
bin/ns(nslibmain+0x25)[0x8330c5]
bin/ns(main+0x20)[0x833254]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xff)[0x7f01823cceff]
bin/ns[0x5bc1a9]
Using GDB and with symbols turned on:
(gdb) bt
#0 0x00007ffff6970d05 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007ffff6974ab6 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007ffff69a9d7b in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3 0x00007ffff6a3b1d7 in __fortify_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4 0x00007ffff6a3a0f0 in __chk_fail () from /lib/x86_64-linux-gnu/libc.so.6
#5 0x00000000008d5b5a in strcpy (interp=0xd2dda0, optionIndex=<value optimized out>, objc=<value optimized out>, objv=0x7fffffffdad0)
at /usr/include/bits/string3.h:105
#6 TraceVariableObjCmd (interp=0xd2dda0, optionIndex=<value optimized out>, objc=<value optimized out>, objv=0x7fffffffdad0)
at /media/Linux/ns-allinone-2.35-RC7/tcl8.5.8/unix/../generic/tclTrace.c:912
#7 0x00000000008d56de in Tcl_TraceObjCmd (dummy=<value optimized out>, interp=0xd2dda0, objc=<value optimized out>, objv=0xd2ec00)
at /media/Linux/ns-allinone-2.35-RC7/tcl8.5.8/unix/../generic/tclTrace.c:293
#8 0x0000000000841077 in TclEvalObjvInternal (interp=0xd2dda0, objc=5, objv=0xd2ec00,
command=0x7ffff7f680fe "trace variable defaultRNG w { abort \"cannot update defaultRNG once assigned\"; }\n\n\nClass RandomVariable/TraceDriven -superclass RandomVariable\n\nRandomVariable/TraceDriven instproc init {} {\n$self instv"..., length=80, flags=0)
at /media/Linux/ns-allinone-2.35-RC7/tcl8.5.8/unix/../generic/tclBasic.c:3689
#9 0x0000000000842b19 in TclEvalEx (interp=0xd2dda0,
script=0x7ffff7f52010 "\n\n\n\n\n\nproc warn {msg} {\nglobal warned_\nif {![info exists warned_($msg)]} {\nputs stderr \"warning: $msg\"\nset warned_($msg) 1\n}\n}\n\nif {[info commands debug] == \"\"} {\nproc debug args {\nwarn {Script debugg"..., numBytes=422209, flags=<value optimized out>, line=4141,
clNextOuter=<value optimized out>,
outerScript=0x7ffff7f52010 "\n\n\n\n\n\nproc warn {msg} {\nglobal warned_\nif {![info exists warned_($msg)]} {\nputs stderr \"warning: $msg\"\nset warned_($msg) 1\n}\n}\n\nif {[info commands debug] == \"\"} {\nproc debug args {\nwarn {Script debugg"...)
at /media/Linux/ns-allinone-2.35-RC7/tcl8.5.8/unix/../generic/tclBasic.c:4386
#10 0x0000000000843256 in Tcl_EvalEx (interp=<value optimized out>, script=<value optimized out>, numBytes=<value optimized out>,
flags=<value optimized out>) at /media/Linux/ns-allinone-2.35-RC7/tcl8.5.8/unix/../generic/tclBasic.c:4043
#11 0x000000000084327d in Tcl_Eval (interp=0xd2dda0, script=<value optimized out>)
at /media/Linux/ns-allinone-2.35-RC7/tcl8.5.8/unix/../generic/tclBasic.c:4955
#12 0x000000000084391b in Tcl_GlobalEval (interp=0xd2dda0, command=<value optimized out>)
at /media/Linux/ns-allinone-2.35-RC7/tcl8.5.8/unix/../generic/tclBasic.c:6005
#13 0x000000000083352b in Tcl::eval(char*) ()
#14 0x00000000008334e9 in Tcl::evalc(char const*) ()
#15 0x0000000000834712 in EmbeddedTcl::load() ()
#16 0x00000000008331a5 in Tcl_AppInit ()
#17 0x00000000008ad6a0 in Tcl_Main (argc=<value optimized out>, argv=0x7fffffffe1d0, appInitProc=0x8330f3 <Tcl_AppInit>)
at /media/Linux/ns-allinone-2.35-RC7/tcl8.5.8/unix/../generic/tclMain.c:418
#18 0x00000000008330c5 in nslibmain ()
#19 0x0000000000833254 in main ()

Famous last words: "Don't worry - my change didn't break anything". How can we be sure of that?
However, there is a moderate chance you're correct if the code worked under 4.4 and crashes under 4.5.
GCC has adopted some aggressive optimizations related to code that tries to detect integer overflow and removes it. In which case, you're going to have to find that code in ns-2 and try to get it fixed - either by the ns-2 developers or on your own.
You should probably try to run the program under the debugger so that you can get control at the point where the buffer overflow is detected, and see where the code is. If you disabled core dumps (with ulimit -c 0 or equivalent), consider enabling them and see whether you get a core dump when it terminates. That should give you a starting point.
Further thoughts:
When you compiled the code, how stringent were the warning flags used? Can you recompile with more warnings enabled?
One technique that often works (with AutoTools-configured programs) if you can find no other way to get special options to the C or C++ compiler is:
./configure --prefix=/opt/ns CC="gcc -Wall -Wextra" CXX="g++ -Wall -Wextra"
(I also use this technique to specify 32-bit vs 64-bit builds, adding -m32 or -m64.)
Warning: if the code was not created to compile clean under these options, it can be traumatic to do the first compilation using these options. However, there is also a decent chance that in amongst all the warnings is one about the source of your problem. However, it is also indisputable that there will likely be 50 warnings not related to it to any 1 that is (or worse), and fixing all the warnings thus spotted still might not cure the problem. If the code compiles with stringent warnings anyway, then you are faced with enabling many more exotic warnings instead. But if you can get the compiler to help diagnose the problem that it is causing, you should certainly do so - it is much simpler than finding the problem unaided.
Also, make sure you are producing a debuggable program - even if you keep the optimization enabled.
Also, consider compiling with optimization off and see whether the program still crashes. If the program does not crash without optimization and does with optimization, you have some useful information. It won't make it easier to find the cause, but you know it is (probably) related to the optimizer. Or it might just be that the bug moves when not optimized and doesn't fail fatally.
The extended stack trace information is curious:
#5 0x00000000008d5b5a in strcpy (interp=0xd2dda0, optionIndex=<value optimized out>,
objc=<value optimized out>, objv=0x7fffffffdad0)
at /usr/include/bits/string3.h:105
#6 TraceVariableObjCmd (interp=0xd2dda0, optionIndex=<value optimized out>,
objc=<value optimized out>, objv=0x7fffffffdad0)
at /media/Linux/ns-allinone-2.35-RC7/tcl8.5.8/unix/../generic/tclTrace.c:912
Those are not the ordinary arguments to strcpy(). Usually, you have just two arguments. I can't immediately think of a circumstance where it would appropriate to copy a string over the pointer to the Tcl interpreter's main control structure. So, to get further with this, I would be looking extremely hard at lines 900-920 or so in tclTrace.c, and in particular, line 912. This might just be an artefact of the way the optimizer is mungeing the object code, or it might be a genuine problem.
I found the tcl8.5.8 source and line 912 of tclTrace.c is the strcpy() in this code:
if ((enum traceOptions) optionIndex == TRACE_ADD) {
CombinedTraceVarInfo *ctvarPtr;
ctvarPtr = (CombinedTraceVarInfo *) ckalloc((unsigned)
(sizeof(CombinedTraceVarInfo) + length + 1
- sizeof(ctvarPtr->traceCmdInfo.command)));
ctvarPtr->traceCmdInfo.flags = flags;
if (objv[0] == NULL) {
ctvarPtr->traceCmdInfo.flags |= TCL_TRACE_OLD_STYLE;
}
ctvarPtr->traceCmdInfo.length = length;
flags |= TCL_TRACE_UNSETS | TCL_TRACE_RESULT_OBJECT;
strcpy(ctvarPtr->traceCmdInfo.command, command); // Line 912
ctvarPtr->traceInfo.traceProc = TraceVarProc;
ctvarPtr->traceInfo.clientData = (ClientData)
&ctvarPtr->traceCmdInfo;
ctvarPtr->traceInfo.flags = flags;
name = Tcl_GetString(objv[3]);
if (TraceVarEx(interp,name,NULL,(VarTrace*)ctvarPtr) != TCL_OK) {
ckfree((char *) ctvarPtr);
return TCL_ERROR;
}
} else {
So, the output from GDB and the stack trace looks somewhat misleading; there are two variables passed to strcpy() and one of those is locally allocated on the heap.
I would think about compiling tcl standalone from the source embedded with ns-2 and see whether you can tickle the bug (sorry, awful pun) on its own. This code is related to tracing a tcl variable - trace add varname ... AFAICT.
Assuming that passes, I'd consider getting hold of GCC 4.6 and seeing whether the same problem occurs when you compile ns-2 with that instead of GCC 4.5.
Valgrind
Since you are running on Linux, you should be able to use Valgrind. It is excellent at spotting memory abuse problems. For maximum benefit, use a debug build of ns-2.

"buffer overflow detected": you are writing to a zone which wasn't allocated. gcc 4.4 apparently generated code which didn't trigger a problem (or had a problem which didn't revealed itself as a crash but just as wrong results undetected now as such), gcc 4.5 generate code which detect the problem and warn you about it. The only solution is to find the source of the problem and fix the code.

It could be all sorts of thing. It could be a GCC bug. It could be a Tcl bug (I hope it isn't, speaking as one of the Tcl developers, but I won't rule it out as Tcl quite often assumes that there's no guard code on structures; Tcl is definitely C89 code). It could be a bug in ns2. For all I know, it could even be a bug elsewhere (because ns2 is built on Tcl, it can load external code libraries; it's quite possible to have a problem there).
Alas, we can't tell from the information posted which of those possibilities it is. Do you know in which library the callstack was when the program crashed? While not a guarantee that that's the actual locus of the problem, it's at least a place to start the bug-hunt…

C++ Program Always Crashes While doing a std::string assign

I have been trying to debug a crash in my application that crashes (i.e. asserts a
* glibc detected * free(): invalid pointer: 0x000000000070f0c0 ***) while I'm trying to do a simple assign to a string. Note that I'm compiling on a linux system with gcc 4.2.4 with an optimization level set to -O2. With -O0 the application no longer crashes.
E.g.
std::string abc;
abc = "testString";
but if I changed the code as follows it no longer crashes
std::string abc("testString");
So again I scratched my head! But the interesting pattern was that the crash moved later on in the application, AGAIN at another string. I found it weird that the application was continuously crashing on a string assign. A typical crash backtrace would look as follows:
#0 0x00007f2c2663bfb5 in raise () from /lib64/libc.so.6
(gdb) bt
#0 0x00007f2c2663bfb5 in raise () from /lib64/libc.so.6
#1 0x00007f2c2663dbc3 in abort () from /lib64/libc.so.6
#2 0x00000000004d8cb7 in people_streamingserver_sighandler (signum=6) at src/peoplestreamingserver.cpp:487
#3 <signal handler called>
#4 0x00007f2c2663bfb5 in raise () from /lib64/libc.so.6
#5 0x00007f2c2663dbc3 in abort () from /lib64/libc.so.6
#6 0x00007f2c26680ce0 in ?? () from /lib64/libc.so.6
#7 0x00007f2c270ca7a0 in std::string::assign (this=0x7f2c21bc8d20, __str=<value optimized out>)
at /home/bbazso/ThirdParty/sources/gcc-4.2.4/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/basic_string.h:238
#8 0x00007f2c21bd874a in PEOPLESProtocol::GetStreamName (this=<value optimized out>,
pRawPath=0x2342fd8 "rtmp://127.0.0.1/mp4:pop.mp4", lStreamName=#0x7f2c21bc8d20)
at /opt/trx-HEAD/gcc/4.2.4/lib/gcc/x86_64-pc-linux-gnu/4.2.4/../../../../include/c++/4.2.4/bits/basic_string.h:491
#9 0x00007f2c21bd9daa in PEOPLESProtocol::SignalProtocolCreated (pProtocol=0x233a4e0, customParameters=#0x7f2c21bc8de0)
at peoplestreamer/src/peoplesprotocol.cpp:240
This was really weird behavior and so I started to poke around further in my application to see if there was some sort of memory corruption (either heap or stack) error that could be occurring that could be causing this weird behavior. I even checked for ptr corruptions and came up empty handed. In addition to visual inspection of the code I also tried the following tools:
Valgrind using both memcheck and exp-ptrcheck
electric fence
libsafe
I compiled with -fstack-protector-all in gcc
I tried MALLOC_CHECK_ set to 2
I ran my code through lint checks as well as cppcheck (to check for mistakes)
And I stepped through the code using gdb
So I tried a lot of stuff and still came up empty handed. So I was wondering if it could be something like a linker issue or a library issue of some sort that could be causing this problem. Are there any know issues with the std::string that make is susceptible to crashing in -O2 or maybe it has nothing to do with the optimization level? But the only pattern that I can see thus far in my problem is that it always seems to crash on a string and so I was wondering if anyone knew of any issues that my be causing this type of behavior.
Thanks a lot!

This is an initial guess using all information I can extract from your back trace.
You are most likely mixing and matching gcc version, linker and libstdc++ that results an unusual behaviour on the host machine:
libc is the system's: /lib64/libc.so.6
libstdc++ is in a "ThirdParty" directory - this is suspicions, as it tells me it might be compiled elsewhere with a different target - /home/bbazso/ThirdParty/sources/gcc-4.2.4/x86_64-pc-linux-gnu/libstdc++-v3/
Yet another libstdc++ in /opt: /opt/trx-HEAD/gcc/4.2.4/lib/gcc/x86_64-pc-linux-gnu/4.2.4/../../../../include/c++/4.2.4/bits/basic_string.h:491
In addition, GCC may mix the system's ld instead of itself which may cause further weird memory maps usage.

Can you repeat the crash with a basic two line program?
#include <string>
int main()
{
std::string abc;
abc = "testString";
}
If that crashes, please post your exact compile / link options?
If not, start paring down your code. Remove things lines a handful at a time until the bug goes away. Once you have some other change you can add to cause the crash and remove to make it go away, that should help you locate the problem.

Happened to me because of using malloc for a class which had std::strings as data members. Tricky.

As you said it's a weird behavior.
To be honnest with i think you are wasting time looking into a possible bug with std::strings. Strings are perfectly safe as long as you are using them well.
Anyway, with the informations you are giving :
First, are you using threads ? It's might be a thread problem.
Second, you check your program using valgrind. Have you no warnings at all ?
Note : The most critical valgrind's warnings are invalid read and invalid write.
PS : As said in commentary, you should probably use g++ to compile C++ code ;)

What does the GDB backtrace message "0x0000000000000000 in ?? ()" mean?

What does it mean when it gives a backtrace with the following output?
#0 0x00000008009c991c in pthread_testcancel () from /lib/libpthread.so.2
#1 0x00000008009b8120 in sigaction () from /lib/libpthread.so.2
#2 0x00000008009c211a in pthread_mutexattr_init () from /lib/libpthread.so.2
#3 0x0000000000000000 in ?? ()
The program has crashed with a standard signal 11, segmentation fault.
My application is a multi-threaded FastCGI C++ program running on FreeBSD 6.3, using pthread as the threading library.
It has been compiled with -g and all the symbol tables for my source are loaded, according to info sources.
As is clear, none of my actual code appears in the trace but instead the error seems to originate from standard pthread libraries. In particular, what is ?? () ????
EDIT: eventually tracked the crash down to a standard invalid memory access in my main code. Doesn't explain why the stack trace was corrupted, but that's a question for another day :)

gdb wasn't able to extract the proper return address from pthread_mutexattr_init; it got an address of 0. The "??" is the result of looking up address 0 in the symbol table. It cannot find a symbolic name, so it prints a default "??"
Unfortunately right offhand I don't know why it could not extract the correct return address.

Something you did cause the threading library to crash. Since the threading library itself is not compiled with debugging symbols (-g), it cannot display the source code file or line number the crash happened on. In addition, since it's threads, the call stack does not point back to your file. Unfortunately this will be a tough bug to track down, you're gonna need to step through your code and try and narrow down when exactly the crash happens.

Make sure you compile with debug symbols. (For gcc I think that is the -g option). Then you should be able to get more interesting information out of GDB. Don't forget to turn it off when you compile the production version.

I could be missing something, but isn't this indicative of someone using NULL as a function pointer?
#include <stdio.h>
typedef int (*funcptr)(void);
int
func_caller(funcptr f)
{
return (*f)();
}
int
main()
{
return func_caller(NULL);
}
This produces the same style of a backtrace if you run it in gdb:
rivendell$ gcc -g -O0 foo.c -o foo
rivendell$ gdb --quiet foo
Reading symbols for shared libraries .. done
(gdb) r
Starting program: ...
Reading symbols for shared libraries . done
Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_PROTECTION_FAILURE at address: 0x00000000
0x00000000 in ?? ()
(gdb) bt
#0 0x00000000 in ?? ()
#1 0x00001f9d in func_caller (f=0) at foo.c:8
#2 0x00001fb1 in main () at foo.c:14
This is a pretty strange crash though... pthread_mutexattr_init rarely does anything more than allocate a data structure and memset it. I'd look for something else going on. Is there a possibility of mismatched threading libraries or something. My BSD knowledge is a little dated, but there used to be issues around this.

Maybe the bug that caused the crash has broken the stack (overwritten parts of the stack)? In that case, the backtrace might be useless; no idea what to do in that case...

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to debug confusingly big code? - c++

Related

How do I get an error from `pop_back()` if `size()` is 0?

What is the proper way to break on failed asserts in gdb?

Program compiled with GCC 4.5 crashes, while GCC 4.4 is fine

C++ Program Always Crashes While doing a std::string assign

What does the GDB backtrace message "0x0000000000000000 in ?? ()" mean?

Categories

Resources