My server daemon works fine on most machines however on one I am getting:
malloc.c:3074: sYSMALLOc: Assertion `(old_top == (((mbinptr) (((char *) &((av)->bins[((1)
- 1) * 2])) - __builtin_offsetof (struct malloc_chunk, fd)))) && old_size == 0) ||
((unsigned long) (old_size) >= (unsigned long)((((__builtin_offsetof (struct
malloc_chunk, fd_nextsize))+((2 * (sizeof(size_t))) - 1)) & ~((2 * (sizeof(size_t))) -
1)))&& ((old_top)->size & 0x1) && ((unsigned long)old_end & pagemask) == 0)' failed.
gdb backtrace:
#4 0x002a8300 in sYSMALLOc (av=<value optimised out>, bytes=<value optimised out>) at malloc.c:3071
#5 _int_malloc (av=<value optimised out>, bytes=<value optimised out>) at malloc.c:4702
#6 0x002a9898 in *__GI___libc_malloc (bytes=16) at malloc.c:3638
#7 0x0804d575 in xmpp_ctx_new (mem=0x0, log=0x0) at src/ctx.c:383
#8 0x0804916e in main (argc=1, argv=0xbffff834) at ../src/adminbot.c:277
Any ideas what to try else ? I am unable to find a bug in my code, it could be a bug in the XMPP library and I need to determine that.
Thanks.
This is almost certainly due to a heap corruption bug in your code (writing just before or just after an allocated block).
Since you are apparently on Linux, the tool to use here is Valgrind. It should point you straight at the problem, and it should do so even on machines where your daemon "works".
Trying anything other than Valgrind for this kind of problem is likely a waste of time.
The assertion almost certainly indicates some kind of memory corruption prior to a call to malloc. Given that the assertion is tripping in xmpp_ctx_new, which appears to be a very early call in the libstrophe XMPP library, I'd say it's very likely that the bug is in your code (though it may not be if you're allocating several XMPP contexts - not sure if there's any reason to do that).
If you're only allocating one XMPP context, you can isolate the bug to your code by inserting a call to malloc(sizeof(xmpp_ctx_t)) prior to calling xmpp_ctx_new, and you'll see the problem isn't in libstrophe. (Incidentally, I'm pretty sure the problem won't be in this call to xmpp_ctx_new because I google'd the source to the function (mem=0x0 looked likely to cause problems), and saw that it basically reduced to malloc and a few initializers - reading the source is generally a good strategy for looking for bugs in OSS.)
Related
In my program I am encountering the following error:
free(): invalid size
Aborted (core dumped)
Running GDB I find that this occurs in the destructor of a vector:
#0 0x00007ffff58e8c01 in free () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x0000555555dd44e2 in __gnu_cxx::new_allocator<int>::deallocate (this=0x7fffffff6bf0, __p=0x555557117810) at /usr/include/c++/7/ext/new_allocator.h:125
#2 0x0000555555dcfbd7 in std::allocator_traits<std::allocator<int> >::deallocate (__a=..., __p=0x555557117810, __n=1) at /usr/include/c++/7/bits/alloc_traits.h:462
#3 0x0000555555dc85e6 in std::_Vector_base<int, std::allocator<int> >::_M_deallocate (this=0x7fffffff6bf0, __p=0x555557117810, __n=1)
at /usr/include/c++/7/bits/stl_vector.h:180
#4 0x0000555555dc49e1 in std::_Vector_base<int, std::allocator<int> >::~_Vector_base (this=0x7fffffff6bf0, __in_chrg=<optimized out>)
at /usr/include/c++/7/bits/stl_vector.h:162
#5 0x0000555555dbc5c9 in std::vector<int, std::allocator<int> >::~vector (this=0x7fffffff6bf0, __in_chrg=<optimized out>) at /usr/include/c++/7/bits/stl_vector.h:435
#6 0x0000555556338081 in Gambit::Printers::HDF5Printer2::get_buffer_idcodes[abi:cxx11](std::vector<Gambit::Printers::HDF5MasterBuffer*, std::allocator<Gambit::Printers::HDF5MasterBuffer*> > const&) (this=0x555556fd8820, masterbuffers=...) at /home/farmer/repos/gambit/copy3/Printers/src/printers/hdf5printer_v2/hdf5printer_v2.cpp:2183
where that last line of code is simply:
std::vector<int> alllens(myComm.Get_size());
So firstly, I don't quite get why the destructor is called here, but supposing it is a normal part of how the vector is dynamically constructed then I guess this error must be due to some sort of heap corruption.
I don't quite get it fully though, is the idea that some other part of the code has previously illegally accessed the memory that is supposed to be allocated for this vector?
Second, I have tried running this through Intel Inspector, and I do get a bunch of "Invalid memory access" and "Uninitialized memory access" problems flagged, but they all look like false positives in libraries I am using, like HDF5.
Is there some in-code way of narrowing down where exactly the problem is coming from? E.g. since it gets triggered by a dynamic memory allocation, can I just start allocating huge arrays earlier and earlier in the code to try and trigger the crash closer to where it originates? I tried searching around for whether something like that would work or be helpful but didn't find anything about it, so maybe it is not a good idea?
So it turned out that I was corrupting the heap via some of the MPI routines, i.e. incorrect parameters for buffer lengths and so on. Unfortunately lots of crazy stuff goes on in the MPI libraries so memory analyzers like Intel Inspector weren't that useful in finding it.
However, I learned about Address Sanitizer (https://en.wikipedia.org/wiki/AddressSanitizer) that comes with modern GNU compilers, and that turned out to be great! Compiled against it in my CMake project (from https://gist.github.com/jlblancoc/44be9d4d466f0a973b1f3808a8e56782)
cmake .. -DCMAKE_CXX_FLAGS="-fsanitize=address -fsanitize=leak -g"
-DCMAKE_C_FLAGS="-fsanitize=address -fsanitize=leak -g"
-DCMAKE_EXE_LINKER_FLAGS="-fsanitize=address -fsanitize=leak"
-DCMAKE_MODULE_LINKER_FLAGS="-fsanitize=address -fsanitize=leak"
Ran it with
export ASAN_OPTIONS=fast_unwind_on_malloc=0
(No idea if that was really neccesary), and received a fantastic backtrace when my heap corruption occurred:
==12748==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x602000521340 at pc 0x7fda5011577a bp 0x7ffe231c55e0 sp 0x7ffe231c4d88
WRITE of size 32 at 0x602000521340 thread T0
#0 0x7fda50115779 (/usr/lib/x86_64-linux-gnu/libasan.so.4+0x79779)
#1 0x7fda4fcd84e3 (/usr/lib/x86_64-linux-gnu/libmpich.so.0+0xf24e3)
#2 0x7fda4fc228d7 (/usr/lib/x86_64-linux-gnu/libmpich.so.0+0x3c8d7)
#3 0x7fda4fc23a26 (/usr/lib/x86_64-linux-gnu/libmpich.so.0+0x3da26)
#4 0x7fda4fc2316c (/usr/lib/x86_64-linux-gnu/libmpich.so.0+0x3d16c)
#5 0x7fda4fc2406c in PMPI_Gather (/usr/lib/x86_64-linux-gnu/libmpich.so.0+0x3e06c)
#6 0x55e0c18586b0 in void Gambit::GMPI::Comm::Gather<unsigned long>(std::vector<unsigned long, std::allocator<unsigned long> >&, std::vector<unsigned long, std::allocator<unsigned long> >&, int) /home/farmer/repos/gambit/copy3/Utils/include/gambit/Utils/mpiwrapper.hpp:450
...etc...
Which pointed straight at the MPI call that I screwed up. Amazing!
But to answer my OP question, my idea of allocating lots of heap memory to trigger the crash closer to the problem wasn't really working. Not sure why. I guess I just don't understand what is going on under the hood there. In fact the place I was seeing the crash was before the MPI call in my code, so that was quite confusing. I guess the compiler moved some stuff around? I did have optimisations turned off, but I guess operations could still be ordered differently in the binary than I expect?
I usually love good explained questions and answers. But in this case I really can't give any more clues.
The question is: why malloc() is giving me SIGSEGV? The debug bellow show the program has no time to test the returned pointer to NULL and exit. The program quits INSIDE MALLOC!
I'm assuming my malloc in glibc is just fine. I have a debian/linux wheezy system, updated, in an old pentium (i386/i486 arch).
To be able to track, I generated a core dump. Lets follow it:
iguana$gdb xadreco core-20131207-150611.dump
Core was generated by `./xadreco'.
Program terminated with signal 11, Segmentation fault.
#0 0xb767fef5 in ?? () from /lib/i386-linux-gnu/libc.so.6
(gdb) bt
#0 0xb767fef5 in ?? () from /lib/i386-linux-gnu/libc.so.6
#1 0xb76824bc in malloc () from /lib/i386-linux-gnu/libc.so.6
#2 0x080529c3 in enche_pmovi (cabeca=0xbfd40de0, pmovi=0x...) at xadreco.c:4519
#3 0x0804b93a in geramov (tabu=..., nmovi=0xbfd411f8) at xadreco.c:1473
#4 0x0804e7b7 in minimax (atual=..., deep=1, alfa=-105000, bet...) at xadreco.c:2778
#5 0x0804e9fa in minimax (atual=..., deep=0, alfa=-105000, bet...) at xadreco.c:2827
#6 0x0804de62 in compjoga (tabu=0xbfd41924) at xadreco.c:2508
#7 0x080490b5 in main (argc=1, argv=0xbfd41b24) at xadreco.c:604
(gdb) frame 2
#2 0x080529c3 in enche_pmovi (cabeca=0xbfd40de0, pmovi=0x ...) at xadreco.c:4519
4519 movimento *paux = (movimento *) malloc (sizeof (movimento));
(gdb) l
4516
4517 void enche_pmovi (movimento **cabeca, movimento **pmovi, int c0, int c1, int c2, int c3, int p, int r, int e, int f, int *nmovi)
4518 {
4519 movimento *paux = (movimento *) malloc (sizeof (movimento));
4520 if (paux == NULL)
4521 exit(1);
Of course I need to look at frame 2, the last on stack related to my code. But the line 4519 gives SIGSEGV! It does not have time to test, on line 4520, if paux==NULL or not.
Here it is "movimento" (abbreviated):
typedef struct smovimento
{
int lance[4]; //move in integer notation
int roque; // etc. ...
struct smovimento *prox;// pointer to next
} movimento;
This program can load a LOT of memory. And I know the memory is in its limits. But I thought malloc would handle better when memory is not available.
Doing a $free -h during execution, I can see memory down to as low as 1MB! Thats ok. The old computer only has 96MB. And 50MB is used by the OS.
I don't know to where start looking. Maybe check available memory BEFORE a malloc call? But that sounds a wast of computer power, as malloc would supposedly do that. sizeof (movimento) is about 48 bytes. If I test before, at least I'll have some confirmation of the bug.
Any ideas, please share. Thanks.
Any crash inside malloc (or free) is an almost sure sign of heap corruption, which can come in many forms:
overflowing or underflowing a heap buffer
freeing something twice
freeing a non-heap pointer
writing to freed block
etc.
These bugs are very hard to catch without tool support, because the crash often comes many thousands of instructions, and possibly many calls to malloc or free later, in code that is often in a completely different part of the program and very far from where the bug is.
The good news is that tools like Valgrind or AddressSanitizer usually point you straight at the problem.
I am working on a large fortran code and before to compile with fast options (in order to perform test on large database), I usually compile with "warnings" options in order to detect and backtrace all the problems.
So with the gfortran -fbacktrace -ffpe-trap=invalid,zero,overflow,underflow -Wall -fcheck=all -ftrapv -g2 compilation, I get the following error:
Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
Backtrace for this error:
#0 0x7fec64cdfef7 in ???
#1 0x7fec64cdf12d in ???
#2 0x7fec6440e4af in ???
#3 0x7fec64a200b4 in ???
#4 0x7fec649dc5ce in ???
#5 0x4cf93a in __f_mod_MOD
at /f_mod.f90:132
#6 0x407d55 in main_loop_
at main.f90:419
#7 0x40cf5c in main_prog
at main.f90:180
#8 0x40d5d3 in main
at main.f90:68
And the portion of the code f_mod.f90:132 is containing a where loop:
! Compute s parameter
do i = 1, Imax
where (dprim .ne. 1.0)
s(:,:,:, :) = s(:,:,:, :) +vprim(:,:,:, i,:)*dprim(:,:,:, :)*dprim(:,:,:, :)/(1.0 -dprim(:,:,:, :))
endwhere
enddo
But I do not see any mistake here. All the other locations are the calls of the subroutine leading to this part. And of course, since it is a SIGFPE error, I have to problem at the execution when I compile gfortran -g1. (I use gfortran 6.4.0 on linux)
Moreover, this error appears and disappears with the modifications of completely different part of the code. Thus, the problem comes from this where loop ? Or from somewhere else and the backtrace is wrong ? If it is the case how can I find this mistake?
EDIT:
Since, I can not reproduce this error in a minimal example (they are working), I think that the problem comes for somewhere else. But how to find the problem in a large code ?
As the code is dying with a SIGFPE, use each of the individual
possible traps to learn if it is a FE_DIVBYZERO, FE_INVALID,
FE_OVERFLOW, or FE_UNDERFLOW. If it is an underflow, change
your mask to '1 - dprim .ne. 0'.
PS: Don't use array section notation when a whole array reference
can be used instead.
PPS: You may want to compute dprim*drpim / (1 - dprim) outside
of the do-loop as it is loop invariant.
A user reported an error to me where the line
read(unit_chk) ((kpt_latt(i,nkp),i=1,3),nkp=1,num_kpts)
failed with the error (similar to Why do I get a C malloc assertion failure?)
malloc.c:2365: sysmalloc: Assertion `(old_top == (((mbinptr)
(((char *) &((av)->bins[((1) - 1) * 2])) - __builtin_offsetof (struct
malloc_chunk, fd)))) && old_size == 0) || ((unsigned long) (old_size) >=
(unsigned long)((((__builtin_offsetof (struct malloc_chunk,
fd_nextsize))+((2 * (sizeof(size_t))) - 1)) & ~((2 * (sizeof(size_t))) -
1))) && ((old_top)->size & 0x1) && ((unsigned long)old_end & pagemask)
== 0)' failed.
Abort
As far as I know, the error occurs only for a specific set of inputs. Also, when the read() is changed to the equivalent
((kpt_latt(i,nkp),i=1,3),nkp=1,(num_kpts-1)), &
kpt_latt(1,num_kpts),kpt_latt(2,num_kpts),kpt_latt(3,num_kpts)
the error disappears. Even compiling with a different compiler version (IntelStudio 2013 SP1 composer_xe_2013_sp1.2.144 instead of IntelStudio 2015 composer_xe_2015.6.233) made the error disappear. (This is all from the user's reports -- I have not yet reproduced the error.)
When the program is run through valgrind, it reports
valgrind: m_mallocfree.c:268 (mk_plain_bszB): Assertion 'bszB != 0' failed.
valgrind: This is probably caused by your program erroneously writing past the
end of a heap block and corrupting heap metadata. If you fix any
invalid writes reported by Memcheck, this assertion failure will
probably go away. Please try that before reporting this as a bug.
Before that, there area a couple of messages that Conditional jump or move depends on uninitialised value(s), Use of uninitialised value of size 8 and Invalid read of size 8; and one Invalid write of size 1 on the statement cited above.
The array that is being read into is allocated to the proper size just one line before:
allocate(kpt_latt(3,num_kpts))
read(unit_chk) ((kpt_latt(i,nkp),i=1,3),nkp=1,num_kpts)
EDIT: The user has reported back with a possible solution. The array kpt_latt that is being read was declared with a wrong data type, namely as integer while the data in the file was written as real. This is an error of course; but is it realistic that this caused the failed malloc() assertion?
Fine print: We are talking about a default-kind integer (4 bytes) and a double precision real (8 bytes) here. The resulting bogus values in kpt_latt were not noticed because the program does not actually use them. I still have not reproduced the error myself, so I have to rely on what the user tells me.
Hi when I was trying to execute my program(c++) i was getting the following error:
a.out: malloc.c:3096: sYSMALLOc: Assertion `(old_top == (((mbinptr) (((char *) &((av)->bins[((1) - 1) * 2])) - __builtin_offsetof (struct malloc_chunk, fd)))) && old_size == 0) || ((unsigned long) (old_size) >= (unsigned long)((((__builtin_offsetof (struct malloc_chunk, fd_nextsize))+((2 * (sizeof(size_t))) - 1)) & ~((2 * (sizeof(size_t))) - 1))) && ((old_top)->size & 0x1) && ((unsigned long)old_end & pagemask) == 0)' failed.
Aborted
and when i traced my program using cout's, I could find that, it is because of the following line
BNode* newNode=new BNode();
If i remove this line I was not getting the error.
Can any one please help in this regard...
The shown line of code is ok in general. The heap probably was corrupted before. I would use a memory checker like valgrind to find out where.
Without a memory checking tool you just have to look hard at your code and find the error.
Sometimes a binary search strategy helps. Deliberately deactivate parts of your code and narrow down. Don't be fooled by false positives like the line you posted.
Another alternative is to switch to a programming language with automatic memory management.
The error message means that the integrity of the program heap was violated. The heap was broken. The line you removed... maybe it was the culprit, maybe it was not to blame. Maybe the heap was damaged by some code before that (or even well before that) and the new that you removed simply revealed the problem, not caused it. There's no way to say from what you posted.
So, it is possible that you actually changed nothing by removing that line. The error could still be there, and the program will simply fail in some other place. Buffer overrun, double free or something like that is normally to blame for the invalidated heap. Run your code through some static or dynamic checker to look for these problems (valgrind, coverity etc.)