boost serialization binary_oarchive crashes - c++

First I am populating a structure which is quite big and have interrelations. and then I serialize that to a binary archive. Size of that structure depends on what data I feed to the program. I see the program taking ~2GB memory to build the structure which is expected and acceptable.
Then I start serializing the object. and I see program eating RAM while serializing. RAM usage growing till it reaches near 100%. swap usage is still 0 bytes.
and then the Application crashes. with a exception of bad_alloc on new
Why would serialization process take so much RAM and time ? and why would it crash while allocating memory when swap is empty ? the backtrace is too long to be pasted in full.
#0 0xb7fe1424 in __kernel_vsyscall ()
#1 0xb7c6e941 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#2 0xb7c71e42 in abort () at abort.c:92
#3 0xb7e92055 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/libstdc++.so.6
#4 0xb7e8ff35 in ?? () from /usr/lib/libstdc++.so.6
#5 0xb7e8ff72 in std::terminate() () from /usr/lib/libstdc++.so.6
#6 0xb7e900e1 in __cxa_throw () from /usr/lib/libstdc++.so.6
#7 0xb7e90677 in operator new(unsigned int) () from /usr/lib/libstdc++.so.6
#8 0xb7f00a9f in boost::archive::detail::basic_oarchive_impl::save_pointer(boost::archive::detail::basic_oarchive&, void const*, boost::archive::detail::basic_pointer_oserializer const*) () from /usr/lib/libboost_serialization.so.1.42.0
#9 0xb7effb42 in boost::archive::detail::basic_oarchive::save_pointer(void const*, boost::archive::detail::basic_pointer_oserializer const*) () from /usr/lib/libboost_serialization.so.1.42.0
#10 0x082d052c in void boost::archive::detail::save_pointer_type<boost::archive::binary_oarchive>::non_polymorphic::save<gcl::NestedConnection<gcl::Section, gcl::NestedConnection<gcl::Paragraph, gcl::NestedConnection<gcl::Line, void> > > >(boost::archive::binary_oarchive&, gcl::NestedConnection<gcl::Section, gcl::NestedConnection<gcl::Paragraph, gcl::NestedConnection<gcl::Line, void> > >&) ()
#11 0x082d0472 in void boost::archive::detail::save_pointer_type<boost::archive::binary_oarchive>::save<gcl::NestedConnection<gcl::Section, gcl::NestedConnection<gcl::Paragraph, gcl::NestedConnection<gcl::Line, void> > > >(boost::archive::binary_oarchive&, gcl::NestedConnection<gcl::Section, gcl::NestedConnection<gcl::Paragraph, gcl::NestedConnection<gcl::Line, void> > > const&) ()
.......
#172 0x082a91d8 in boost::archive::detail::interface_oarchive<boost::archive::binary_oarchive>::operator<< <gcl::Collation const> (this=0xbfffe500, t=...) at /usr/include/boost/archive/detail/interface_oarchive.hpp:64
#173 0x082a6298 in boost::archive::detail::interface_oarchive<boost::archive::binary_oarchive>::operator&<gcl::Collation> (this=0xbfffe500, t=...) at /usr/include/boost/archive/detail/interface_oarchive.hpp:72
#174 0x0829bd63 in main (argc=4, argv=0xbffff3f4) at /home/neel/projects/app/main.cpp:93
Program works properly When a smaller data is feeded to it.
Using Linux 64bit with 32bit PAE kernel boost 1.42
program was working without a crash few revision ago. I recently added some more bytes to the structures. may be then it was not reaching the end of RAM and now its reaching.
But why would new crash when there is enough swap ? why would serialization process take so much RAM ?

Question: why would it crash while allocating memory when swap is empty ?
The allocated object is too big to fit anywhere in the virtual address space:
The allocated object is humongous
virtual address space is too fragmented
virtual address space is all allocated
If your application is complied as a 32bits, the process virtual address space is limited to 4Gb.
Question: why would serialization process take so much RAM ?
I have not found any evidence why.

I realized that serialization process was taking extra memory, for its own house keeping works. and that was hitting the 3GB Barrier To stop serialization process from taking extra memory I disabled object tracking BOOST_CLASS_TRACKING and that fixed extra memory overhead.

Related

Track down heap corruption by allocating lots of memory?

In my program I am encountering the following error:
free(): invalid size
Aborted (core dumped)
Running GDB I find that this occurs in the destructor of a vector:
#0 0x00007ffff58e8c01 in free () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x0000555555dd44e2 in __gnu_cxx::new_allocator<int>::deallocate (this=0x7fffffff6bf0, __p=0x555557117810) at /usr/include/c++/7/ext/new_allocator.h:125
#2 0x0000555555dcfbd7 in std::allocator_traits<std::allocator<int> >::deallocate (__a=..., __p=0x555557117810, __n=1) at /usr/include/c++/7/bits/alloc_traits.h:462
#3 0x0000555555dc85e6 in std::_Vector_base<int, std::allocator<int> >::_M_deallocate (this=0x7fffffff6bf0, __p=0x555557117810, __n=1)
at /usr/include/c++/7/bits/stl_vector.h:180
#4 0x0000555555dc49e1 in std::_Vector_base<int, std::allocator<int> >::~_Vector_base (this=0x7fffffff6bf0, __in_chrg=<optimized out>)
at /usr/include/c++/7/bits/stl_vector.h:162
#5 0x0000555555dbc5c9 in std::vector<int, std::allocator<int> >::~vector (this=0x7fffffff6bf0, __in_chrg=<optimized out>) at /usr/include/c++/7/bits/stl_vector.h:435
#6 0x0000555556338081 in Gambit::Printers::HDF5Printer2::get_buffer_idcodes[abi:cxx11](std::vector<Gambit::Printers::HDF5MasterBuffer*, std::allocator<Gambit::Printers::HDF5MasterBuffer*> > const&) (this=0x555556fd8820, masterbuffers=...) at /home/farmer/repos/gambit/copy3/Printers/src/printers/hdf5printer_v2/hdf5printer_v2.cpp:2183
where that last line of code is simply:
std::vector<int> alllens(myComm.Get_size());
So firstly, I don't quite get why the destructor is called here, but supposing it is a normal part of how the vector is dynamically constructed then I guess this error must be due to some sort of heap corruption.
I don't quite get it fully though, is the idea that some other part of the code has previously illegally accessed the memory that is supposed to be allocated for this vector?
Second, I have tried running this through Intel Inspector, and I do get a bunch of "Invalid memory access" and "Uninitialized memory access" problems flagged, but they all look like false positives in libraries I am using, like HDF5.
Is there some in-code way of narrowing down where exactly the problem is coming from? E.g. since it gets triggered by a dynamic memory allocation, can I just start allocating huge arrays earlier and earlier in the code to try and trigger the crash closer to where it originates? I tried searching around for whether something like that would work or be helpful but didn't find anything about it, so maybe it is not a good idea?
So it turned out that I was corrupting the heap via some of the MPI routines, i.e. incorrect parameters for buffer lengths and so on. Unfortunately lots of crazy stuff goes on in the MPI libraries so memory analyzers like Intel Inspector weren't that useful in finding it.
However, I learned about Address Sanitizer (https://en.wikipedia.org/wiki/AddressSanitizer) that comes with modern GNU compilers, and that turned out to be great! Compiled against it in my CMake project (from https://gist.github.com/jlblancoc/44be9d4d466f0a973b1f3808a8e56782)
cmake .. -DCMAKE_CXX_FLAGS="-fsanitize=address -fsanitize=leak -g"
-DCMAKE_C_FLAGS="-fsanitize=address -fsanitize=leak -g"
-DCMAKE_EXE_LINKER_FLAGS="-fsanitize=address -fsanitize=leak"
-DCMAKE_MODULE_LINKER_FLAGS="-fsanitize=address -fsanitize=leak"
Ran it with
export ASAN_OPTIONS=fast_unwind_on_malloc=0
(No idea if that was really neccesary), and received a fantastic backtrace when my heap corruption occurred:
==12748==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x602000521340 at pc 0x7fda5011577a bp 0x7ffe231c55e0 sp 0x7ffe231c4d88
WRITE of size 32 at 0x602000521340 thread T0
#0 0x7fda50115779 (/usr/lib/x86_64-linux-gnu/libasan.so.4+0x79779)
#1 0x7fda4fcd84e3 (/usr/lib/x86_64-linux-gnu/libmpich.so.0+0xf24e3)
#2 0x7fda4fc228d7 (/usr/lib/x86_64-linux-gnu/libmpich.so.0+0x3c8d7)
#3 0x7fda4fc23a26 (/usr/lib/x86_64-linux-gnu/libmpich.so.0+0x3da26)
#4 0x7fda4fc2316c (/usr/lib/x86_64-linux-gnu/libmpich.so.0+0x3d16c)
#5 0x7fda4fc2406c in PMPI_Gather (/usr/lib/x86_64-linux-gnu/libmpich.so.0+0x3e06c)
#6 0x55e0c18586b0 in void Gambit::GMPI::Comm::Gather<unsigned long>(std::vector<unsigned long, std::allocator<unsigned long> >&, std::vector<unsigned long, std::allocator<unsigned long> >&, int) /home/farmer/repos/gambit/copy3/Utils/include/gambit/Utils/mpiwrapper.hpp:450
...etc...
Which pointed straight at the MPI call that I screwed up. Amazing!
But to answer my OP question, my idea of allocating lots of heap memory to trigger the crash closer to the problem wasn't really working. Not sure why. I guess I just don't understand what is going on under the hood there. In fact the place I was seeing the crash was before the MPI call in my code, so that was quite confusing. I guess the compiler moved some stuff around? I did have optimisations turned off, but I guess operations could still be ordered differently in the binary than I expect?

why is delete being called in stdc++ library when there is no delete nor free in the code flow?

I am having a problem debugging my code and am a bit confused by the gdb output. I have attached the gdb output below. The last 2 lines, line #13 and #14 are my code, but everything else is from the C++ library. What is confusing to me is that from about line #7 upward, it appears to be calling delete. This is initialization code and there are no deletes nor frees being called in the code flow. But something is causing delete to be called somewhere in the C++ library.
this is on a debian box with g++ 4.7.2
Anybody have a clue that could help me along?
EDIT: thanks you guys for your help. I indeed think there is something else going on here. Since the intent of my code is to construct a string using several append() calls, I added a call to reserve() in the ctor for that string so it would be large enough to handle a few append() calls without having to get more space. This has apparently helped because it is now harder for me to force the crash. But I do agree that the cause is probably elsewhere in my code. Again, thanks for all your help.
Program received signal SIGABRT, Aborted.
0xb7fe1424 in __kernel_vsyscall ()
(gdb) bt
#0 0xb7fe1424 in __kernel_vsyscall ()
#1 0xb7a9a941 in *__GI_raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#2 0xb7a9dd72 in *__GI_abort () at abort.c:92
#3 0xb7ad6e15 in __libc_message (do_abort=2, fmt=0xb7baee70 "*** glibc detected *** %s: %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:189
#4 0xb7ae0f01 in malloc_printerr (action=<optimized out>, str=0x6 <Address 0x6 out of bounds>, ptr=0xb71117f0) at malloc.c:6283
#5 0xb7ae2768 in _int_free (av=<optimized out>, p=<optimized out>) at malloc.c:4795
#6 0xb7ae581d in *__GI___libc_free (mem=0xb71117f0) at malloc.c:3738
#7 0xb7f244bf in operator delete(void*) () from /usr/lib/i386-linux-gnu/libstdc++.so.6
#8 0xb7f8b48b in std::string::_Rep::_M_destroy(std::allocator<char> const&) () from /usr/lib/i386-linux-gnu/libstdc++.so.6
#9 0xb7f8b4d0 in ?? () from /usr/lib/i386-linux-gnu/libstdc++.so.6
#10 0xb7f8c7a0 in std::string::reserve(unsigned int) () from /usr/lib/i386-linux-gnu/libstdc++.so.6
#11 0xb7f8caaa in std::string::append(char const*, unsigned int) () from /usr/lib/i386-linux-gnu/libstdc++.so.6
#12 0xb7f8cb76 in std::string::append(char const*) () from /usr/lib/i386-linux-gnu/libstdc++.so.6
#13 0x0804fa38 in MethodRequest::MethodRequest (this=0x80977a0) at cLogProxy.cpp:26
#14 0x0804fac0 in DebugMethodRequest::DebugMethodRequest (this=0x80977a0,
thanks,
-Andres
You are calling std::string::append, that ultimately results in delete getting called. If we go through the steps involved in std::string::append, it might make more sense why delete gets called.
Say you have:
std::string s("abc");
s.append("def");
When you create s, memory has to be allocated to hold "abc". At the end of s.append("def");, there has to be enough memory associated with s to hold "abcdef". Steps to get there:
Get the length of s => 3.
Get the length of the input string "def" => 3.
Add them to figure out the length of the new string. => 6.
Allocate memory to hold the new string.
Copy "abc" to the newly allocated memory.
Append "def" to the newly allocated memory.
Associate the newly allocated memory with s.
Delete the old memory associated with s. (This is where delete comes into picture).
Something is doing string computations that are resulting in deletes internally. Seems likely something else is trashing memory.

mmap call takes too long (>100 seconds)

Currently we are seeing our processes taking too long with mmap call.
Once the process reaches to roughly ~2.8 GB, the mmap call takes upto 100
seconds and its being killed by heart beat mechanism built in the process.
Would like to know anyone has seen this issue or know why would mmap take
more than 100 seconds when asked for memory. In all the cases the stack trace
looks the same but memory is allocated in different parts of the code.
Host and compiler info:
Host memory: 70 gb OS: redhat 6.3 compiler: gcc 4.4.6 process memory
limit(32 bit): 4 gb No Swap configured
And when this happens the host still has 50GB of memory left.
Stack Trace:
#0 0x55575430 in __kernel_vsyscall ()
#1 0x560f9dd8 in mmap () from /lib/libc.so.6
#2 0x5608f2db in _int_malloc () from /lib/libc.so.6
#3 0x5608fb7e in malloc () from /lib/libc.so.6
#4 0x55fb509a in operator new(unsigned int) () from /usr/lib/libstdc++.so.6
#5 0x55f91ed6 in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::_Rep::_S_create(unsigned int, unsigned int, std::allocator<char> const&) ()
from /usr/lib/libstdc++.so.6

Program receives SIGSEGV error after return 0

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7b8bc26 in std::basic_filebuf<char, std::char_traits<char> >::_M_terminate_output() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
(gdb) where
#0 0x00007ffff7b8bc26 in std::basic_filebuf<char, std::char_traits<char> >::_M_terminate_output() ()
from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#1 0x00007ffff7b8c6a2 in std::basic_filebuf<char, std::char_traits<char>>::close() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#2 0x00007ffff7b8cb2a in std::basic_ofstream<char, std::char_traits<char> >::~basic_ofstream() ()
from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3 0x0000000000403e02 in main (argc=2, argv=0x7fffffffe1c8)
at main.cpp:630
I am facing this error after program execution and after "return 0;" has been executed.
I have used vectors from STL. This error is thrown only when input file size is very high (I am having around 10000 nodes in graph)
Also, I am not able to write output to a file. Currently I have commented that part.
Please help me with issue.
I am using Ubuntu 12.10 64 bit.
Errors after returning from main can be caused by (at least):
dodgy atexit handlers; or
memory corruption of some description.
Of those two, it's more likely to be the latter so you should run your code under a dynamic memory-use analysis tool, like valgrind. Your description of large vectors causing the problem also seems to support this contention.

Is there some kind of list with special (memory) addresses for Linux or gcc?

I've heard there are some special addresses(or, at lease, some ranges with special addresses), used from Linux(or gcc, I don't know and this is a part of the question), but I can't find such. And I don't even know how to look for it.
( for example, in Visual Studio, there's such thing for uninitialized variables )
And this question was "introduced" by the more specific one (and it doesn't deserve to be a separate question, that's why I'll ask it here): is 0x30303030 some special address or something?
Because I have a backtrace like:
#0 0x003fa527 in memset () from /lib/tls/libc.so.6
#1 0x4e5fffa0 in ?? ()
#2 0x00787d13 in std::num_put > >::_M_group_int () from /usr/lib/libstdc++.so.6
#3 0x0079a1e4 in std::operator, std::allocator > () from /usr/lib/libstdc++.so.6
#4 0x30303030 in ?? ()
#5 0x30303030 in ?? ()
...
#1483 0x30303030 in ?? ()
#1484 0x30303030 in ?? ()
Cannot access memory at address 0xb3927000
And this have 1400+ lines like 0x30303030 in ?? ()
Does this mean something, or it's just a random memory address and it looks like a bottomless recursion? The problem is, that I cannot reproduce it, so debugging or using valgrind becomes useless :\
I know, that this is awful question with no any useful information, but I decided to give it a try.
In ASCII it's "0000", so it may be that something got overrun. Or that there's a memory error somewhere.
It's usually used by the debugger to mark uninitialized pointers. The addresses themselves are irrelevant and are not special in any way. Such a thing may not exist under GCC, depends on how they chose to write their debugger.
Seeing memset at the top of the backtrace, there are high odds that it didn't set the exact memory area you expected. Perhaps a bit too much was set to '0'?