Invisiable SIGSEGV on linux that does not happen on windows? - c++

INTRO
I have a TCP/HTTP server that supports plugins in form of Shared Libraries (DLL and .so). It has make and .sln files build system via premake. When I start my application I feed to it a configuration file like this with description of what libraries server shall use as plugins and what arguments it shall pass to tham. For some time I had 2 plugins and all worked just fine. and even now works just fine if I feed to my server config fdiles alike this. But Now I have new plugin I am developing and so new config file.
SETUP
Steps required to setup my server on linux are fiew and simple
download build script (from here as described here)
./cloud_server_net_setup.sh , no superuser needed, requires curl, make and g++
In regular case (not development this is enought - it will get boost, and other libraries it needs into local folder, it will build all of tham, build server in release form )
now you can cd into cloud_server/install-dir/
call export LD_LIBRARY_PATH=./:./lib_boost
and run our server ./CloudServer
But we need debug wersion so after we call script we
cd cloud_server/CloudServer/projects/linux-gmake/
make
cd bin/debug
export LD_LIBRARY_PATH=./:(place from where we called our script)/cloud_server/install-dir/lib_boost
PROBLEM
and now, finally we can call gdb.
So we call it. and this is what we see:
gdb ./CloudServer
GNU gdb (GDB) 7.0.1-debian
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/ole_jak/cloud_server/CloudServer/projects/linux-gmake/bin/debug/CloudServer...done.
(gdb) r
Starting program: /home/ole_jak/cloud_server/CloudServer/projects/linux-gmake/bin/debug/CloudServer
[Thread debugging using libthread_db enabled]
Cloud Server v0.5
Copyright (c) 2011 Cloud Forever. All rights reserved.
Type 'help' to see help messages.
Config file path: config.xml
[New Thread 0x7ffff5967700 (LWP 11516)]
[New Thread 0x7ffff5166700 (LWP 11517)]
[New Thread 0x7ffff4965700 (LWP 11518)]
[New Thread 0x7ffff4164700 (LWP 11519)]
[New Thread 0x7ffff3963700 (LWP 11520)]
[New Thread 0x7ffff3162700 (LWP 11521)]
[New Thread 0x7ffff2961700 (LWP 11522)]
[New Thread 0x7ffff2160700 (LWP 11523)]
[New Thread 0x7ffff195f700 (LWP 11524)]
[New Thread 0x7ffff115e700 (LWP 11525)]
[New Thread 0x7ffff095d700 (LWP 11526)]
[New Thread 0x7fffebfff700 (LWP 11527)]
[New Thread 0x7fffeb7fe700 (LWP 11528)]
[New Thread 0x7fffeaffd700 (LWP 11529)]
[New Thread 0x7fffea7fc700 (LWP 11530)]
[New Thread 0x7fffe9ffb700 (LWP 11531)]
Library libFileService.so opened.
[New Thread 0x7fffe953c700 (LWP 11532)]
Library libUsersFilesService.so opened.
Program received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb) x/i $pc
0x0: Cannot access memory at address 0x0
I am Linux nube and all I know about Segmentation fault I know from wikipedia, but I know one more thing about my server and this new service I am creating - it compiles and runs on Windows with no errors at all (VS2008, 2010 solutions can be created from same premake script).
So I wonder how and where in this 2 files .cpp and .h I have created an error that does not show on windows at alss an shows so dramaticvally on Linux? And is it fixable, or visiable to fresh eye?
UPDATE:
Valgrind output
ole_jak#dspproc:~/cloud_server/CloudServer/projects/linux-gmake/bin/debug$ valgrind ./CloudServer
==11682== Memcheck, a memory error detector
==11682== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
==11682== Using Valgrind-3.6.0.SVN-Debian and LibVEX; rerun with -h for copyright info
==11682== Command: ./CloudServer
==11682==
Cloud Server v0.5
Copyright (c) 2011 Cloud Forever. All rights reserved.
Type 'help' to see help messages.
Config file path: config.xml
Library libFileService.so opened.
Library libUsersFilesService.so opened.
==11682== Jump to the invalid address stated on the next line
==11682== at 0x0: ???
==11682== by 0x4D49BE: sqlite3_free (sqlite3.c:18155)
==11682== by 0x102242D5: sqlite3OsInit (sqlite3.c:14162)
==11682== by 0x1029EB28: sqlite3_initialize (sqlite3.c:107299)
==11682== by 0x102A159F: openDatabase (sqlite3.c:108909)
==11682== by 0x102A1B29: sqlite3_open (sqlite3.c:109156)
==11682== by 0x1021CAB0: sqlite3pp::database::connect(char const*) (sqlite3pp.cpp:89)
==11682== by 0x1021C6E3: sqlite3pp::database::database(char const*) (sqlite3pp.cpp:74)
==11682== by 0x1020DDDF: users_files_service::create_files_table(std::string) (users_files_service.cpp:171)
==11682== by 0x1020BAFC: users_files_service::apply_config(boost::shared_ptr<boost::property_tree::basic_ptree<std::string, std::string, std::less<std::string> > >) (users_files_service.cpp:38)
==11682== by 0x4B5432: server_utils::parse_config_services(boost::property_tree::basic_ptree<std::string, std::string, std::less<std::string> >) (server_utils.cpp:156)
==11682== by 0x4B6436: server_utils::parse_config(boost::property_tree::basic_ptree<std::string, std::string, std::less<std::string> >) (server_utils.cpp:208)
==11682== Address 0x0 is not stack'd, malloc'd or (recently) free'd
==11682==
==11682==
==11682== Process terminating with default action of signal 11 (SIGSEGV)
==11682== Bad permissions for mapped region at address 0x0
==11682== at 0x0: ???
==11682== by 0x4D49BE: sqlite3_free (sqlite3.c:18155)
==11682== by 0x102242D5: sqlite3OsInit (sqlite3.c:14162)
==11682== by 0x1029EB28: sqlite3_initialize (sqlite3.c:107299)
==11682== by 0x102A159F: openDatabase (sqlite3.c:108909)
==11682== by 0x102A1B29: sqlite3_open (sqlite3.c:109156)
==11682== by 0x1021CAB0: sqlite3pp::database::connect(char const*) (sqlite3pp.cpp:89)
==11682== by 0x1021C6E3: sqlite3pp::database::database(char const*) (sqlite3pp.cpp:74)
==11682== by 0x1020DDDF: users_files_service::create_files_table(std::string) (users_files_service.cpp:171)
==11682== by 0x1020BAFC: users_files_service::apply_config(boost::shared_ptr<boost::property_tree::basic_ptree<std::string, std::string, std::less<std::string> > >) (users_files_service.cpp:38)
==11682== by 0x4B5432: server_utils::parse_config_services(boost::property_tree::basic_ptree<std::string, std::string, std::less<std::string> >) (server_utils.cpp:156)
==11682== by 0x4B6436: server_utils::parse_config(boost::property_tree::basic_ptree<std::string, std::string, std::less<std::string> >) (server_utils.cpp:208)
==11682==
==11682== HEAP SUMMARY:
==11682== in use at exit: 124,050 bytes in 1,083 blocks
==11682== total heap usage: 1,814 allocs, 731 frees, 183,517 bytes allocated
==11682==
==11682== LEAK SUMMARY:
==11682== definitely lost: 0 bytes in 0 blocks
==11682== indirectly lost: 0 bytes in 0 blocks
==11682== possibly lost: 46,248 bytes in 799 blocks
==11682== still reachable: 77,802 bytes in 284 blocks
==11682== suppressed: 0 bytes in 0 blocks
==11682== Rerun with --leak-check=full to see details of leaked memory
==11682==
==11682== For counts of detected and suppressed errors, rerun with: -v
==11682== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 4 from 4)
Убито
ole_jak#dspproc:~/cloud_server/CloudServer/projects/linux-gmake/bin/debug$

This is a nasty one. I am unsure about the exact root cause, but this seems to be a multi-threading related issue. The immediate cause of the problem is that the sqlite3Config.m.xSize function pointer is NULL at the place and time the error happens.
This pointer is supposed to be initialized to point to a proper function the first time that sqlite3_initialize() is called, which normally happens the first time you open an SQLite database file. By setting breakpoints and watchpoints in GDB I was able to verify that the pointer is successfully set, yet at the time of the segmentation fault its value is NULL.
That could mean one of two things:
The new pointer value is not properly propagated to all threads. SQLite3 is supposed to be thread-safe, but well, threads can be nasty little buggers...
Something resets the pointer after it has been initialized. I considered this highly unlikely since the sqlite3Config structure is not usually modified after initialization.
I performed a simple test, which incidentally can be used as a temporary workaround: I added an explicit call to sqite3_initialize() as the first statement in main(), allowing it to be executed before any threads are launched. As a result, the segmentation fault went away and I got a shell prompt for your server, which points to the first of the two alternatives. Note that this is a workaround at best, since sqite3_initialize() is not supposed to be explicitly called. The root cause of the issue may still be present and make itself known otherwise - or, worse, it could break things in subtle, yet hard to detect, ways.
Since SQLite3 is supposed to be thread-safe (and the source code of the sqlite3_initialize() function seems correct in that regard), I am unsure what is happening. It could be a problem with the sqlite3pp wrapper or with the way the threads are launched.

here are my suggestions.
turn off optimizations. Sometime optimizations cause errors. use -O0 for example.
remove dynamic loading, try linking your code in statically, and see if the problem still occurs.
reduce the size of the problem. Make the smallest possible program that can reproduce the error and then post it here.
thanks,
mike

Related

Valgrind reporting memory leak in RocksDB

I am trying to profile the performance of RocksDB using Callgrind / KCacheGrind on a mac. I am running the command valgrind --tool=callgrind ./simple_example on one of the example programs that comes with RocksDB in the examples folder. I seem to be getting a memory leak, though, which prevents me from being able to do the performance profiling that I ultimately want to do.
==54628== Callgrind, a call-graph generating cache profiler
==54628== Copyright (C) 2002-2017, and GNU GPL'd, by Josef Weidendorfer et al.
==54628== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==54628== Command: ./simple_example
==54628==
==54628== For interactive control, run 'callgrind_control -h'.
==54628==
==54628== Process terminating with default action of signal 11 (SIGSEGV)
==54628== Access not within mapped region at address 0x18
==54628== at 0x1016D25BA: _pthread_body (in /usr/lib/system/libsystem_pthread.dylib)
==54628== by 0x1016D250C: _pthread_start (in /usr/lib/system/libsystem_pthread.dylib)
==54628== by 0x1016D1BF8: thread_start (in /usr/lib/system/libsystem_pthread.dylib)
==54628== If you believe this happened as a result of a stack
==54628== overflow in your program's main thread (unlikely but
==54628== possible), you can try to increase the size of the
==54628== main thread stack using the --main-stacksize= flag.
==54628== The main thread stack size used in this run was 8388608.
--54628:0:schedule VG_(sema_down): read returned -4
==54628==
==54628== Events : Ir
==54628== Collected : 14961779
==54628==
==54628== I refs: 14,961,779
Segmentation fault: 11

I want to restore memory in GDB when I debug core file

I want to restore memory in GDB when I debug core file.
I checked restore function in GDB when process is running. (https://sourceware.org/gdb/onlinedocs/gdb/Dump_002fRestore-Files.html)
It was success.
But I want to restore memory when I debug core file.
I can use core_filter and I can select segment. but It is out very big size core file.
so I use a way that dump and restoring memory.
but it is possible when process is running.
I want to restore memory when process is not running.
I need the way when I debug core dump file.
Can I restore memory in this way?
Do you know another ways?
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
...
Reading symbols from main...done.
[New LWP 19274]
Core was generated by `./main'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x000000000040053a in main () at main.c:6
6 *a = 3;
(gdb) restore memory_dump binary &array 0 40
You can't do that without a process to debug.

GDB with corefile on remote embedded device - How to get more information about backtrace?

I have a core dump from a C++ application running on an embedded imx6 board (yocto linux). I can put gdb on the box and run it in a terminal to examine the core file like so just fine:
gdb myApplication core.udpsrc256:src.1520419431.5526
I get extremely limited information, and really need to know more about what caused the core dump. All I have is a printout from the application:
(myApplication:5526): GLib-ERROR **: ../../glib-2.46.2/glib/gmem.c:100: failed to allocate 65611 bytes
./run-app.sh: line 8: 5526 Trace/breakpoint trap (core dumped) XDG_RUNTIME_DIR=/run/user/root ./myApplication
Also the core dump backtrace gives some useless stuff. I need to know more stuff up the stack that led to this frame:
#0 0x75ff1910 in raise () from /lib/libc.so.6
[Current thread is 1 (LWP 5533)]
(gdb)
(gdb)
(gdb) bt
#0 0x75ff1910 in raise () from /lib/libc.so.6
#1 0x6b169558 in g_logv () from /usr/lib/libglib-2.0.so.0
#2 0x6b169610 in g_log () from /usr/lib/libglib-2.0.so.0
#3 0x6b1681c4 in g_malloc () from /usr/lib/libglib-2.0.so.0
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
Sidenote -- there is some warnings when I startup gdb:
GNU gdb (GDB) 7.10.1
Copyright (C) 2015 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "arm-poky-linux-gnueabi".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from qt5qmlvideo...done.
warning: exec file is newer than core file.
[New LWP 5533]
[New LWP 5526]
[New LWP 5531]
[New LWP 5528]
[New LWP 5534]
[New LWP 21064]
[New LWP 5536]
[New LWP 21065]
[New LWP 5532]
[New LWP 5527]
[New LWP 5530]
[New LWP 5537]
warning: Could not load shared library symbols for linux-vdso.so.1.
Do you need "set solib-search-path" or "set sysroot"?
warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available.
warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available.
Core was generated by `./qt5qmlvideo -platform wayland'.
Program terminated with signal SIGTRAP, Trace/breakpoint trap.
#0 0x75ff1910 in raise () from /lib/libc.so.6
[Current thread is 1 (LWP 5533)]
(gdb)
Can anyone help? Do I need some of the stuff gdb warns about... or can i rebuild my application and its dependencies in some other configuration that would give more output? Thank you!
Some more notes that may matter -
This is a multithreaded application running a gstreamer pipeline. Many gstreamer plugins generate their own threads, one of which in this pipeline is 'udpsrc'. I'm wondering if it's because this failure happens in one of those threads is the reason why I can't get details, but I want to know how to get it to show the details if possible!
(1)
The
Do you need "set solib-search-path" or "set sysroot"?
is a problem. Check the path (on your device) where linux-vdso.so.1 resides, and include that in the solib-search-path. Similarly for the other shared-object libraries that your program uses. E.g. if some shared-object libraries are in /lib, some are in /usr/adowdy/lib and some are in /usr/adowdy/arm/lib, you can say:
(gdb) set solib-search-path /lib:/usr/adowdy/lib:/usr/adowdy/arm/lib
(2) The
warning: Unable to find libthread_db matching inferior's thread
library, thread debugging will not be available.
is also a problem. See the answer to this question
(3) The
failed to allocate 65611 bytes
is a clue. Are you, by any chance, trying to allocate a negative number of bytes (maybe 65536 - 65611 = -75 bytes)?
Also the core dump backtrace gives some useless stuff.
It's not entirely useless. The stack trace, and the message from the application say the same thing: your application ran out of memory (malloc failed to allocate 65611 bytes).
While a more complete stack would tell you which particular call to g_malloc failed, it's very likely to not matter in practice -- if this g_malloc didn't fail, the next one would.
You likely have a memory leak, or are simply allocating too much memory for what your system allows.
You should look into many debugging tools built for solving this exact problem.

Large core dump with no function names in GDB

I'm fairly new to debugging core dumps on Linux, and I'm running into a weird issue. Hoping to get some suggestions.
We're getting occasional crashes on our game servers running on AWS Linux boxes. I set up the boxes to generate core dumps. Often, the dumps are around a few hundred MB -- roughly the size of the program in memory. These I'm able to load in gdb and seemingly get a valid backtrace.
But frequently, we're getting dumps that are multiple GB in size. Usually, when I load these core dumps in gdb, there's no usable info in the backtrace.
Here's an example output:
> gdb AAPGOrbis core.3871
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.3) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from AAPGOrbis...Reading symbols from <path to>/AAPGOrbis.dbg...done.
done.
[New LWP 3871]
[New LWP 3877]
[New LWP 6557]
[New LWP 3876]
[New LWP 6558]
[New LWP 6559]
warning: Error reading shared library list entry at 0x302e6f732e646165
warning: Error reading shared library list entry at 0x74756d5f64616572
Core was generated by `/opt/aapg/Binaries/Linux/AAPGOrbis aaentry?game=AAGame.AAGamePreGameLobbyDedica'.
Program terminated with signal SIGABRT, Aborted.
#0 0x00007fed61d001f7 in ?? ()
(gdb) bt full
#0 0x00007fed61d001f7 in ?? ()
No symbol table info available.
#1 0x00007fed61d018e8 in ?? ()
No symbol table info available.
#2 0x0000000000000020 in ?? ()
No symbol table info available.
#3 0x0000000000000000 in ?? ()
No symbol table info available.
(gdb)
Any ideas as to what might be causing this? I'm wondering if the size of the core dumps coupled with the lack of valid data is indicative of some really bad memory corruption.
Any insight would be greatly appreciated!
warning: Error reading shared library list entry at 0x302e6f732e646165
warning: Error reading shared library list entry at 0x74756d5f64616572
GDB is attempting to read the list of loaded shared libraries from a clearly bogus address (both of these addresses are ASCII strings ead.so.0read_mut).
The most frequent cause is that you have given GDB the wrong binary: the AAPGOrbis that you give GDB must be exactly the same binary as the one that crashed.
Another possibility is that the shared library list (which is in heap) has indeed been corrupted by the program running amok.

Opening core dump file with different executable but the same sources

I have a coredump file from a colleague's machine.
We both have the same sources for the program and the same third party *.so files (like libmysqlclient18 and several in-house ones).
The problem is that we both compile the software (from the same sources) independently and I want to use his core dump files for inspection with GDB on my machine.
When I try to load the core file and my executable into gdb I get:
user#ubuntu:/mnt/hgfs/share/dir$ gdb prog core
GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2.1) 7.4-2012.04
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "i686-linux-gnu".
For bug reporting instructions, please see:
<http://bugs.launchpad.net/gdb-linaro/>...
Reading symbols from /mnt/hgfs/share/dir/prog...done.
warning: exec file is newer than core file.
[New LWP 4465]
[New LWP 4462]
[New LWP 4464]
warning: Can't read pathname for load map: Input/output error.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/i386-linux-gnu/libthread_db.so.1".
Core was generated by `./prog'.
Program terminated with signal 11, Segmentation fault.
#0 0x002d3706 in ?? () from /lib/i386-linux-gnu/libc.so.6
(gdb) bt full
#0 0x002d3706 in ?? () from /lib/i386-linux-gnu/libc.so.6
No symbol table info available.
#1 0x00000000 in ?? ()
No symbol table info available.
(gdb)
Is this possible for this scenario to work? (I compile the software with debugging symbols enabled, of course)
If not, what are the technical details I'm missing?
I know for sure that it wouldn't be possible if he or I would make modifications to the source, since then the executable would be different, but this is not the case, the third party *.so files and the sources, they all match.
UPDATE:
After installing libc6-dbg as user mkfs suggested in the comments, I get this in gdb:
user#ubuntu:/mnt/hgfs/share/dir$ gdb prog core
GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2.1) 7.4-2012.04
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "i686-linux-gnu".
For bug reporting instructions, please see:
<http://bugs.launchpad.net/gdb-linaro/>...
Reading symbols from /mnt/hgfs/share/dir/prog...done.
warning: exec file is newer than core file.
[New LWP 4465]
[New LWP 4462]
[New LWP 4464]
warning: Can't read pathname for load map: Input/output error.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/i386-linux-gnu/libthread_db.so.1".
Core was generated by `./prog'.
Program terminated with signal 11, Segmentation fault.
#0 0x002d3706 in _IO_helper_overflow (s=0x0, c=0) at vfprintf.c:2188
2188 vfprintf.c: No such file or directory.
(gdb) bt
#0 0x002d3706 in _IO_helper_overflow (s=0x0, c=0) at vfprintf.c:2188
#1 0x00000000 in ?? ()
(gdb) bt full
#0 0x002d3706 in _IO_helper_overflow (s=0x0, c=0) at vfprintf.c:2188
written = 47
target = <optimized out>
used = -1226838776
#1 0x00000000 in ?? ()
No symbol table info available.
(gdb)
Is this possible for this scenario to work?
Yes, but you need to ensure that the symbol layout of the binary you build on both machines is identical (or at least close enough). This isn't necessarily trivial: things like local username, pathname for sources or installation directory, and hostname sometimes leak into the built object files, and may cause symbol mismatch.
To check whether the binaries are close, run diff <(nm a.out) <(nm b.out) -- there should only be few differences. If you see a lot of differences, your binaries aren't close enough.
I compile the software with debugging symbols enabled, of course
This may be your first mistake: if your coworker builds with -O2, and you build with -g (and implied -O0), the binary is guaranteed to not match.
You need to build with exactly the flags your coworker builds with (but you may add debugging symbols; e.g. if your coworker builds with -O2, you should build with -O2 -g together).
P.S. Note that you also need identical versions of system libraries on the two machines.