Valgrind reporting invalid read on one system but not another

Valgrind reporting invalid read on one system but not another - c++

I need to run a rather large software package on a new machine for work. The application is written in C and C++ and I am running on CentOS 6.5.
The program builds fine, but segfaults when I go to run it. Using valgrind, I see the following error reported at the location of the segfault:
==23843== Invalid read of size 4
[stack trace here]
==23843== Address 0x642e7464 is not stack'd, malloc'd or (recently) free'd
So for some reason we are reading from memory we aren't supposed to and are invoking undefined behaviour. When I tar up my source files, take them to another CentOS 6.5 machine (w/ same kernel) and compile them (with same makefiles and same GCC version) the program seems to run fine.
I ran valgrind on that machine as well and expected to see the invalid read again. My thought was that the invalid read would always be present, yet because the behaviour is undefined things just happened to work correctly on one machine and not on the other.
What I found, however, was that valgrind reports no read errors on the second machine. How could this be possible?

Valgrind makes the running environment more deterministic, but it does not eliminate all randomness. Maybe the other machine has bit different versions of libraries installed, or anything external it is using (files, network..) is different, the code execution does not have to be exactly the same.
You should look at the stack trace and analyze the code where the error happens. If it is not obvious from the stack trace alone, you can start valgrind with --vgdb=full parameter. It will pause the execution once the error happens and print out instructions how to attach gdb. Or you can just run the program under debugger directly - you wrote that it crashes even without valgrind.

Different library versions are the best guess, judging from the sparse information you gave. Things to try:
1) Bring both machines up to date via package manager and try again
2) Run ldd [binary] to see all libraries used by the program in question. Run something like md5sum on them on both machines to find out if there are differences.
In general I made the experience that valgrind is really bad at detecting invalid memory access on the stack, so this might be a hidden root cause. If all else fails, you might want to try using clang and address sanitizer. It might find things valgrind doesn't catch, and vice versa.

This could be caused by using different versions of Valgrind.
Some common false positive errors get removed in newer versions. Which would explain why one machine complains about it (older version) and another one doesn't (newer version).

Related

c++ program terminating when one thread has access violation - how to catch this in linux - for win32 I get stacktraces in vs2010

c++ program terminated with no exceptions or stacktrace
I have a multi threaded application
If one of my threads has an access violation with reading out of bounds on an array (or any seg fault condition) my entire application immediately terminates.
If this happens on my windows counter part using visual studio I get a nice stacktrace of where the error was, and what the issue was.
I desperately need this type of debugging environment to be able to succeed at my project. I have too many threads and too many developers running different parts of the project to have one person not handle an exception properly and it destroys the entire project.
I am running Fedora Core 14
I am compiling with gcc 4.5.1
gdb is fedora 7.2-16.fc14
My IDE is eclipse Juno I am using the CDT builder
my toolchain is the cross GCC and my builder is the CDT Internal Builder
Is there ANY setting for the gdb or gcc or eclipse that will help me detect these type of situations?

That's what's supposed to happen. Under Unix, you get a full
core dump (which you can examine in the debugger), provided
you've authorized them. (ulimits -c—traditionally, they
were authorized by default, but Linux seems to have changed
this.)
Of course, to get any useful information from the core dump,
you'll need to have compiled the code with symbol information,
and not stripped it later. (On the other hand, you can copy the
core dump from your users machine onto your development machine,
and see what happened there.)

You're definitely looking for core dumps, as James Kanze wrote.
I would just add that core dumps will show you where the program crashed, which is not necessarily the same place as where the problem (memory corruption etc.) occurred. And of course some out-of-bounds reads/writes may not exhibit themselves by crashing immediately.
You can narrow the search by enabling check for incorrect memory allocations/deallocations in glibc. The simplest way is to set environmental variable MALLOC_CHECK_. Set it to 2 and glibc will check for heap corruption with every memory allocation/deallocation and abort the program when it founds any problem (producing a core dump, if enabled). This often helps to get closer to the real problem.
http://www.gnu.org/software/libc/manual/html_node/Heap-Consistency-Checking.html

Segfaults from command line but works from GDB run

I'm really lost in here. Maybe some of you can point me to a right direction.
I'm developing a tool in ANSI C using GCC over MinGW. The tool is to be run only from command line. Probably only on windows machine. It elaborates some data locally and generates files for use by other programs. Basically it does a lot of math and a few file handling. Nothing really fancy. I didn't find it necessary posting the whole 1000+ lines here for examination...
I compile it with GCC -ansi making sure not even a single warning is present. Everything worked always well as the development evolved. But recently I started getting (almost) random segfaults. I checked for the last changes made, but found nothing. I removed the last changes completely coming back to when it perfectly worked. Still segfaults. I traced line by line. I went back to read and re-read the whole code searching for possible pointers/malloc errors. I simply can't find the reason for it to fail so often and so randomly.
So here is the strange thing - I compiled it with -g and run through GDB.
start MSYS
change dir where the program resides
$ gdb generatore.exe (the one compiled -g that fails)
$ run
And it perfectly works inside GDB. I went step by step. Line by line. Perfect. I tried stressing it with huge amounts of data. All works. Can't reproduce the error. But if the same executable is run from command line, it fails.
I suspect an unpredictable behavior with some pointer but I cannot find it anywhere.
Has anyone ever encountered anything similar? Where should I be checking? Also, I am not as familiar with GDB, since it runs smoothly, can I enforce the control somehow to find the reason it fails? Are any other free debugging solutions for windows you can advise me? How can I debug for unpredictable behaviors?
Thanks a lot for your attention,
maxim

There is a great free tool for catching all kind of runtime errors to do with pointers, memory allocations, deallocations, etc., which cannot be caught at compile time, so your compiler will not warn you about them: valgrind http://valgrind.org/. The problem is, AFAIK it doesn't run on Windows. However, if your program is pure ANSI C you should be able to build it and run it with valgrind on a Linux box.
I'm not 100% sure about it, but it should run OK in a virtual machine, so if you don't have a separate Linux computer you can try installing e.g. Ubuntu in Virtual Box or VmWare and try running your program with valgrind in it.

C++ 64bit, variable not found

I have a problem with my C++ application. It was developed on a 32bit pc, on Microsoft Visual Studio 2008, and now I am trying to run it on a 64bit pc.
On my 32bit pc it works fine; on the 64bit pc, Visual Studio does not give any compilation problem, but then on execution gives wrong results.
And I have undestood why.
In the code, I define a variable, of tipe "dag", that is a structure for a direct acyclic graph. By debugging the software, I noticed that, although I declared it, later the software is not able to insert data in it, and the debugger says:
CXX0017: Error: symbol "dags" not found
Here's my code:
Dag<int64_t>* dags = new Dag<int64_t>();
dags = getDagsFromRequest2(request, dags);
The very strange thing is that, if I follow the flow inside getDagsFromRequest2() function, I can clearly see that dags variable is full of data: on "quickwatch", it shows 2342 nodes inside it. But when I come back from getDagsFromRequest2() function to this part of the code, debugger says "CXX0017: Error: symbol "dags" not found". How is it possible?
You can also see this screenshot from my Visual Studio debug set.
What could be the problem?
Thanks a lot

There are a few possibilities to consider:
Running in Release builds. Switch to a Debug build.
Using a Debug build that has optimizations enabled and/or debug information disabled. Disable the optimizations and enable the debug information (look in another project for the relevant settings).
A corrupt build of some sort. Clean and rebuild the entire solution.
Memory corruption which is preventing the debugger from displaying the variable. Ensure that no memory issues exist with a tool like Valgrind.
A VS bug. This report for VS2010 seems to suggest a known bug with similar characteristics for example. Ensure all patches and hotfixes for VS2008 are installed.

The variable dags is defined as your code compiles. The error you see is simply related to the debugger. I am guessing it is caused by running the application in Release mode which sometimes causes confusing and wrong watches values. Try changing the mode to debug(there is a drop down from which you can choose the build mode).
EDIT: as you say you are running in Debug mode, my next guess is that this behavior could be caused by stack corruption. Try using valgrind to detect if that is the case. It may take a while to start with it,but it is worth it and will detect if you have some memory corruption.

What could be the reason that gdb running my program and just bash running my program show different outputs?

When I debug my c++ program with gdb in linux? I compile with -g and in fact, I see a lot of information in the debugger but it keeps telling me that my program exits normally and doesn't show any errors.
When I just run my program though, it doesn't finish and shows that not everything is alright (one assertion in malloc.c failed).
I also had the case, that gdb and just running the program showed different error messages. errors are alwazys related to wrong pointers, memory accesses.
Same actually goes for valgrind. Is there the possibility that it is not possible to use valgrind? In particular if there are different processes and a shared library included?
Running it with valgrind by: valgrind --trace-children=yes prog1 gives me no errors (which I cannot be true), if I enable the suppressed errors by: valgrind -v --trace-children=yes prog1, I get warnings about redirection conflicts (don't seem like errors either).

The problem with buggy programs is that their behavior is undefined. They work sometimes, and crash at other times unpredictably.
Both Valgrind and GDB affect the program timing, and may hide race conditions (which could happen for both multithreaded and multiprocess programs).
In addition, GDB disables address-space randomization, making addresses in the program repeatable from run to run. That's usually what you want while debugging, but your crash may only manifest itself for particular random layout of shared libraries, and that layout may never happen under GDB.
Your best bet is to enable generation of core dumps (ulimit -c unlimited), run the program outside GDB and have it abort (failing assert calls abort). Once you have a core, debug it with GDB: gdb /path/to/your/executable core.
For the problems you've described, Valgrind is usually a better tool. If multiple processes are involved, you'll want to run valgrind with --trace-children=yes flag.

This could be a multithreading problem and gdb slows it down enough that your threads don't conflict. Also maybe the program is being run optimized? Does valgrind say everything is ok?

I would enable all warnings in the compilation with -Wall, so that gcc will warn you about uninitialized variables, and then run it inside valgrind. One of them should tell you the problem.

c++ what is a good debugger for segmentation errors?

Does anyone know a good debugger for C++ segmentation errors on the Linux environment? It would be good enough if the debugger could trace which function is causing the error.

Also consider some techniques that do require from you code changes:
Run your app via valgrind memcheck tool. It's possible to catch error when you access wrong address (e.g. freed pointer, not initialized) - see here.
If you use extensievly stl/boost, consider compiling with -D_GLIBCXX_DEBUG and -D_GLIBCXX_DEBUG_PEDANTIC (see here). This can catch such errors as using invalidated iterator, accessing incorrect index in vector etc.
tcmalloc (from google per tool). When linking with it's debug enabled version, it may find memory related problems
Even more ...

GDB! what else is available on Linux?
Check this out for starting up with GDB, its a nice, concise and easy to understand tutorial.

GDB is indeed about the only choice. There are some GUI's but they are allmost all wrappers for gdb. Finding a segfault is easy. Make sure you compile with -g -O0 then start gdb with your program as argument.
In gdb type run
To start your program running, gdb will stop it is soon as it hits a segfault and report on which line that was. If you need a full backtrace then just type bt. To get out of gdb enter quit.
BTW gdb has a build in help, just type help.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js