Boost thread memory usage on 64bit linux - c++

I have been using boost threads on 32bit linux for some time and am very happy with their performance so far. Recently the project was moved to a 64bit platform and we saw a huge increase in memory usage (from about 2.5gb to 16-17gb). I have done profiling and found that the boost threads are the source of the huge allocation. Each thread is allocating about 10x what it was doing on 32bit.
I profiled using valgrind's massif and have confirmed the issue using only boost threads in a separate test application. I also tried using std::threads instead and these do not exhibit the large memory allocation issue.
I am wondering if anyone else has seen this behaviour and knows what the problem is? Thanks.

There's no problem. This is virtual memory, and each 64-bit process can allocate terabytes of virtual memory on every modern operating system. It's basically free and there's no reason to care about how much of it used.
It's basically just reserved space for thread stacks. You can reduce it, if you want, by changing the default stack size. But there's absolutely no reason to.

1. stack size of per-thread
use pthread_attr_getstacksize to view. use boost::thread::attributes to change (pthread_attr_setstacksize).
2. pre-mmap for per-thread in glibc's malloc
gdb example of boost.thread
0 0x000000000040ffe0 in boost::detail::get_once_per_thread_epoch() ()
1 0x0000000000407c12 in void boost::call_once<void (*)()>(boost::once_flag&, void (*)()) [clone .constprop.120] ()
2 0x00000000004082cf in thread_proxy ()
3 0x000000000041120a in start_thread (arg=0x7ffff7ffd700) at pthread_create.c:308
4 0x00000000004c5cf9 in clone ()
5 0x0000000000000000 in ?? ()
you will discover data=malloc(sizeof(boost::uintmax_t)); in get_once_per_thread_epoch ( boost_1_50_0/libs/thread/src/pthread/once.cpp )
continue
1 0x000000000041a0d3 in new_heap ()
2 0x000000000041b045 in arena_get2.isra.5.part.6 ()
3 0x000000000041ed13 in malloc ()
4 0x0000000000401b1a in test () at pthread_malloc_8byte.cc:9
5 0x0000000000402d3a in start_thread (arg=0x7ffff7ffd700) at pthread_create.c:308
6 0x00000000004413d9 in clone ()
7 0x0000000000000000 in ?? ()
in new_heap function (glibc-2.15\malloc\arena.c), it will pre-mmap 64M memory for per-thread in 64bit os. in other words, per-thread will use 64M + 8M (default thread stack) = 72M.
glibc-2.15\ChangeLog.17
2009-03-13 Ulrich Drepper <drepper#redhat.com>
* malloc/malloc.c: Implement PER_THREAD and ATOMIC_FASTBINS features.
* malloc/arena.c: Likewise.
* malloc/hooks.c: Likewise.
http://wuerping.github.io/blog/malloc_per_thread.html

Related

core dump on malloc_consolidate () from /lib64/libc.so.6 [duplicate]

I usually love good explained questions and answers. But in this case I really can't give any more clues.
The question is: why malloc() is giving me SIGSEGV? The debug bellow show the program has no time to test the returned pointer to NULL and exit. The program quits INSIDE MALLOC!
I'm assuming my malloc in glibc is just fine. I have a debian/linux wheezy system, updated, in an old pentium (i386/i486 arch).
To be able to track, I generated a core dump. Lets follow it:
iguana$gdb xadreco core-20131207-150611.dump
Core was generated by `./xadreco'.
Program terminated with signal 11, Segmentation fault.
#0 0xb767fef5 in ?? () from /lib/i386-linux-gnu/libc.so.6
(gdb) bt
#0 0xb767fef5 in ?? () from /lib/i386-linux-gnu/libc.so.6
#1 0xb76824bc in malloc () from /lib/i386-linux-gnu/libc.so.6
#2 0x080529c3 in enche_pmovi (cabeca=0xbfd40de0, pmovi=0x...) at xadreco.c:4519
#3 0x0804b93a in geramov (tabu=..., nmovi=0xbfd411f8) at xadreco.c:1473
#4 0x0804e7b7 in minimax (atual=..., deep=1, alfa=-105000, bet...) at xadreco.c:2778
#5 0x0804e9fa in minimax (atual=..., deep=0, alfa=-105000, bet...) at xadreco.c:2827
#6 0x0804de62 in compjoga (tabu=0xbfd41924) at xadreco.c:2508
#7 0x080490b5 in main (argc=1, argv=0xbfd41b24) at xadreco.c:604
(gdb) frame 2
#2 0x080529c3 in enche_pmovi (cabeca=0xbfd40de0, pmovi=0x ...) at xadreco.c:4519
4519 movimento *paux = (movimento *) malloc (sizeof (movimento));
(gdb) l
4516
4517 void enche_pmovi (movimento **cabeca, movimento **pmovi, int c0, int c1, int c2, int c3, int p, int r, int e, int f, int *nmovi)
4518 {
4519 movimento *paux = (movimento *) malloc (sizeof (movimento));
4520 if (paux == NULL)
4521 exit(1);
Of course I need to look at frame 2, the last on stack related to my code. But the line 4519 gives SIGSEGV! It does not have time to test, on line 4520, if paux==NULL or not.
Here it is "movimento" (abbreviated):
typedef struct smovimento
{
int lance[4]; //move in integer notation
int roque; // etc. ...
struct smovimento *prox;// pointer to next
} movimento;
This program can load a LOT of memory. And I know the memory is in its limits. But I thought malloc would handle better when memory is not available.
Doing a $free -h during execution, I can see memory down to as low as 1MB! Thats ok. The old computer only has 96MB. And 50MB is used by the OS.
I don't know to where start looking. Maybe check available memory BEFORE a malloc call? But that sounds a wast of computer power, as malloc would supposedly do that. sizeof (movimento) is about 48 bytes. If I test before, at least I'll have some confirmation of the bug.
Any ideas, please share. Thanks.
Any crash inside malloc (or free) is an almost sure sign of heap corruption, which can come in many forms:
overflowing or underflowing a heap buffer
freeing something twice
freeing a non-heap pointer
writing to freed block
etc.
These bugs are very hard to catch without tool support, because the crash often comes many thousands of instructions, and possibly many calls to malloc or free later, in code that is often in a completely different part of the program and very far from where the bug is.
The good news is that tools like Valgrind or AddressSanitizer usually point you straight at the problem.

false positive "Conflicting load" with DRD?

Analyzing my C++ code with DRD (valgrind) finds a "Conflicting load", but I cannot see why. The code is as follows:
int* x;
int Nt = 2;
x = new int[Nt];
omp_set_num_threads(Nt);
#pragma omp parallel for
for (int i = 0; i < Nt; i++)
{
x[i] = i;
}
for (int i = 0; i < Nt; i++)
{
printf("%d\n", x[i]);
}
The program behaves well, but DRD sees an issue when the master thread prints out the value of x[1]. Apart from possible false sharing due to how the x array is allocated, I do not see why there should be any conflict, and how to avoid it... Any insights, please?
EDIT Here's the DRD output for the above code (line 47 corresponds to the printf statement):
==2369== Conflicting load by thread 1 at 0x06031034 size 4
==2369== at 0x4008AB: main (test.c:47)
==2369== Address 0x6031034 is at offset 4 from 0x6031030. Allocation context:
==2369== at 0x4C2DCC7: operator new[](unsigned long) (vg_replace_malloc.c:363)
==2369== by 0x400843: main (test.c:37)
==2369== Other segment start (thread 2)
==2369== at 0x4C31EB8: pthread_mutex_unlock (drd_pthread_intercepts.c:703)
==2369== by 0x4C2F00E: vgDrd_thread_wrapper (drd_pthread_intercepts.c:236)
==2369== by 0x5868D95: start_thread (in /lib64/libpthread-2.15.so)
==2369== by 0x5B6950C: clone (in /lib64/libc-2.15.so)
==2369== Other segment end (thread 2)
==2369== at 0x5446846: ??? (in /usr/lib64/gcc/x86_64-pc-linux-gnu/4.7.3/libgomp.so.1.0.0)
==2369== by 0x54450DD: ??? (in /usr/lib64/gcc/x86_64-pc-linux-gnu/4.7.3/libgomp.so.1.0.0)
==2369== by 0x4C2F014: vgDrd_thread_wrapper (drd_pthread_intercepts.c:355)
==2369== by 0x5868D95: start_thread (in /lib64/libpthread-2.15.so)
==2369== by 0x5B6950C: clone (in /lib64/libc-2.15.so)
GNU OpenMP runtime (libgomp) implements OpenMP thread teams using a pool of threads. After they are created, the threads sit docked at a barrier where they wait to be awaken to perform a specific task. In GCC these tasks come in the form of outlined (the opposite of inlined) code segments, i.e. the code for the parallel region (or for the explicit OpenMP task) is extracted into a separate function and that is supplied to some of the waiting threads as a task for execution. The docking barrier is then lifted and the threads start executing the task. Once that is finished, the threads are docked again - they are not joined, but simply put on hold. Therefore from DRD's perspective the master thread, which executes the serial part of the code after the parallel region, is accessing without protection resources that might be written to by the other threads. This of course cannot happen since the other threads are docked and waiting for a new task.
Such false positives are common with general tools like DRD that do not understand the specific semantics of OpenMP. Those tools are thus not suitable for analysis of OpenMP programs. You should use instead a specialised tool, e.g. the free Thread Analyzer from Sun/Oracle Solaris Studio for Linux or the commercial Intel Inspector. The latter can be used for free with a license for non-commercial development purposes. Both tools understand the specifics of OpenMP and won't present such situations as possible data races.

c++ new operator takes lots of memory (67MB) via libstdc++

I have some issues with the new operator in libstdc++. I wrote a program in C++ and had some problems with the memory management.
After having debugged with gdb to determine what is eating up my ram I got the following for info proc mappings
Mapped address spaces:
Start Addr End Addr Size Offset objfile
0x400000 0x404000 0x4000 0 /home/sebastian/Developement/powerserverplus-svn/psp-job-distributor/Release/psp-job-distributor
0x604000 0x605000 0x1000 0x4000 /home/sebastian/Developement/powerserverplus-svn/psp-job-distributor/Release/psp-job-distributor
0x605000 0x626000 0x21000 0 [heap]
0x7ffff0000000 0x7ffff0021000 0x21000 0
0x7ffff0021000 0x7ffff4000000 0x3fdf000 0
0x7ffff6c7f000 0x7ffff6c80000 0x1000 0
0x7ffff6c80000 0x7ffff6c83000 0x3000 0
0x7ffff6c83000 0x7ffff6c84000 0x1000 0
0x7ffff6c84000 0x7ffff6c87000 0x3000 0
0x7ffff6c87000 0x7ffff6c88000 0x1000 0
0x7ffff6c88000 0x7ffff6c8b000 0x3000 0
0x7ffff6c8b000 0x7ffff6c8c000 0x1000 0
0x7ffff6c8c000 0x7ffff6c8f000 0x3000 0
0x7ffff6c8f000 0x7ffff6e0f000 0x180000 0 /lib/x86_64-linux-gnu/libc-2.13.so
0x7ffff6e0f000 0x7ffff700f000 0x200000 0x180000 /lib/x86_64-linux-gnu/libc-2.13.so
0x7ffff700f000 0x7ffff7013000 0x4000 0x180000 /lib/x86_64-linux-gnu/libc-2.13.so
0x7ffff7013000 0x7ffff7014000 0x1000 0x184000 /lib/x86_64-linux-gnu/libc-2.13.so
That's just snipped out of it. However, everything is normal. Some of this belongs to the code for the standard libs, some if it is heap and some of it are stack sections for threads I created.
But. there is this one section I id not figure out why it is allocated:
0x7ffff0000000 0x7ffff0021000 0x21000 0
0x7ffff0021000 0x7ffff4000000 0x3fdf000 0
These two sections are created at a seemlike random time. There is several hours of debugging no similarity in time nor at a certain created thread or so. I set a hardware watch point with awatch *0x7ffff0000000 and gave it several runs again.
These two sections are created at nearly the same time within the same code section of a non-debuggable function (gdb shows it in stack as in ?? () from /lib/x86_64-linux-gnu/libc.so.6). More exact this is a sample stack where it occured:
#0 0x00007ffff6d091d5 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007ffff6d0b2bd in calloc () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007ffff7dee28f in _dl_allocate_tls () from /lib64/ld-linux-x86-64.so.2
#3 0x00007ffff77c0484 in pthread_create##GLIBC_2.2.5 () from /lib/x86_64-linux-gnu/libpthread.so.0
#4 0x00007ffff79d670e in Thread::start (this=0x6077c0) at ../src/Thread.cpp:42
#5 0x000000000040193d in MultiThreadedServer<JobDistributionServer_Thread>::Main (this=0x7fffffffe170) at /home/sebastian/Developement/powerserverplus-svn/mtserversock/src/MultiThreadedServer.hpp:55
#6 0x0000000000401601 in main (argc=1, argv=0x7fffffffe298) at ../src/main.cpp:29
Another example would be here (from a differet run):
#0 0x00007ffff6d091d5 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007ffff6d0bc2d in malloc () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007ffff751607d in operator new(unsigned long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3 0x000000000040191b in MultiThreadedServer<JobDistributionServer_Thread>::Main (this=0x7fffffffe170) at /home/sebastian/Developement/powerserverplus-svn/mtserversock/src/MultiThreadedServer.hpp:53
#4 0x0000000000401601 in main (argc=1, argv=0x7fffffffe298) at ../src/main.cpp:29
The whole thing says that it occurs at the calloc called from the pthread lib or in another situation it was the new operator or the malloc called from it. It doesn't matter which new it is - in several runs it occured at nearly every new or thread creation in my code. The only "constant" thing with it is that it occurs every time in the libc.so.6.
No matter at which point of the code,
no matter if used with malloc or calloc,
no matter after how much time the program ran,
no matter after how many threads have been created,
it is always that section: 0x7ffff0000000 - 0x7ffff4000000.
Everytime the program runs. But everytime at another point in the program. I am really confused because it allocated 67MB of virtual space but it does not use it.
When watching the variables it created there, especially watched those which are created when malloc or calloc were called by libc, none of this space is used by them. They are created in a heap section which is far away from that address range (0x7ffff0000000 - 0x7ffff4000000).
Edit:
I checked the stack size of the parent process too and got a usage of 8388608 Bytes, which is 0x800000 (~8MB). To get these values I did:
pthread_attr_t attr;
size_t stacksize;
struct rlimit rlim;
pthread_attr_init(&attr);
pthread_attr_getstacksize(&attr, &stacksize);
getrlimit(RLIMIT_STACK, &rlim);
fit into a size_t variable. */
printf("Resource limit: %zd\n", (size_t) rlim.rlim_cur);
printf("Stacksize: %zd\n", stacksize);
pthread_attr_destroy(&attr);
Please help me with that. I am really confused about that.
It looks like it is allocating a stack space for a thread.
The space will be used as you make function calls in the thread.
But really what is is doing is none of your business. It is part of the internal implementation of pthread_create() it can do anything it likes in there.

boost serialization binary_oarchive crashes

First I am populating a structure which is quite big and have interrelations. and then I serialize that to a binary archive. Size of that structure depends on what data I feed to the program. I see the program taking ~2GB memory to build the structure which is expected and acceptable.
Then I start serializing the object. and I see program eating RAM while serializing. RAM usage growing till it reaches near 100%. swap usage is still 0 bytes.
and then the Application crashes. with a exception of bad_alloc on new
Why would serialization process take so much RAM and time ? and why would it crash while allocating memory when swap is empty ? the backtrace is too long to be pasted in full.
#0 0xb7fe1424 in __kernel_vsyscall ()
#1 0xb7c6e941 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#2 0xb7c71e42 in abort () at abort.c:92
#3 0xb7e92055 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/libstdc++.so.6
#4 0xb7e8ff35 in ?? () from /usr/lib/libstdc++.so.6
#5 0xb7e8ff72 in std::terminate() () from /usr/lib/libstdc++.so.6
#6 0xb7e900e1 in __cxa_throw () from /usr/lib/libstdc++.so.6
#7 0xb7e90677 in operator new(unsigned int) () from /usr/lib/libstdc++.so.6
#8 0xb7f00a9f in boost::archive::detail::basic_oarchive_impl::save_pointer(boost::archive::detail::basic_oarchive&, void const*, boost::archive::detail::basic_pointer_oserializer const*) () from /usr/lib/libboost_serialization.so.1.42.0
#9 0xb7effb42 in boost::archive::detail::basic_oarchive::save_pointer(void const*, boost::archive::detail::basic_pointer_oserializer const*) () from /usr/lib/libboost_serialization.so.1.42.0
#10 0x082d052c in void boost::archive::detail::save_pointer_type<boost::archive::binary_oarchive>::non_polymorphic::save<gcl::NestedConnection<gcl::Section, gcl::NestedConnection<gcl::Paragraph, gcl::NestedConnection<gcl::Line, void> > > >(boost::archive::binary_oarchive&, gcl::NestedConnection<gcl::Section, gcl::NestedConnection<gcl::Paragraph, gcl::NestedConnection<gcl::Line, void> > >&) ()
#11 0x082d0472 in void boost::archive::detail::save_pointer_type<boost::archive::binary_oarchive>::save<gcl::NestedConnection<gcl::Section, gcl::NestedConnection<gcl::Paragraph, gcl::NestedConnection<gcl::Line, void> > > >(boost::archive::binary_oarchive&, gcl::NestedConnection<gcl::Section, gcl::NestedConnection<gcl::Paragraph, gcl::NestedConnection<gcl::Line, void> > > const&) ()
.......
#172 0x082a91d8 in boost::archive::detail::interface_oarchive<boost::archive::binary_oarchive>::operator<< <gcl::Collation const> (this=0xbfffe500, t=...) at /usr/include/boost/archive/detail/interface_oarchive.hpp:64
#173 0x082a6298 in boost::archive::detail::interface_oarchive<boost::archive::binary_oarchive>::operator&<gcl::Collation> (this=0xbfffe500, t=...) at /usr/include/boost/archive/detail/interface_oarchive.hpp:72
#174 0x0829bd63 in main (argc=4, argv=0xbffff3f4) at /home/neel/projects/app/main.cpp:93
Program works properly When a smaller data is feeded to it.
Using Linux 64bit with 32bit PAE kernel boost 1.42
program was working without a crash few revision ago. I recently added some more bytes to the structures. may be then it was not reaching the end of RAM and now its reaching.
But why would new crash when there is enough swap ? why would serialization process take so much RAM ?
Question: why would it crash while allocating memory when swap is empty ?
The allocated object is too big to fit anywhere in the virtual address space:
The allocated object is humongous
virtual address space is too fragmented
virtual address space is all allocated
If your application is complied as a 32bits, the process virtual address space is limited to 4Gb.
Question: why would serialization process take so much RAM ?
I have not found any evidence why.
I realized that serialization process was taking extra memory, for its own house keeping works. and that was hitting the 3GB Barrier To stop serialization process from taking extra memory I disabled object tracking BOOST_CLASS_TRACKING and that fixed extra memory overhead.

Multi-threading and atomicity/mem leaks

So, I'm implementing a program with multiple threads (pthreads), and I am looking for help on a few points. I'm doing c++ on linux. All of my other questions have been answered by Google so far, but there are still two that I have not found answers for.
Question 1: I am going to be doing a bit of file I/O and web-page getting/processing within my threads. Is there anyway to guarantee what the threads do to be atomic? I am going to be letting my program run for quite a while, more than likely, and it won't really have a predetermined ending point. I am going to be catching the signal from a ctrl+c and I want to do some cleanup afterwards and still want my program to print out results/close files, etc.
I'm just wondering if it is reasonable behavior for the program to wait for the threads to complete or if I should just kill all the threads/close the file and exit. I just don't want my results to be skewed. Should I/can I just do a pthread_exit() in the signal catching method?
Any other comments/ideas on this would be nice.
Question 2: Valgrind is saying that I have some possible memory leaks. Are these avoidable, or does this always happen with threading in c++? Below are two of the six or so messages that I get when checking with valgrind.
I have been looking at a number of different websites, and one said that some possible memory leaks could be because of sleeping a thread. This doesn't make sense to me, nevertheless, I am currently sleeping the threads to test the setup I have right now (I'm not actually doing any real I/O at the moment, just playing with threads).
==14072== 256 bytes in 1 blocks are still reachable in loss record 4 of 6
==14072== at 0x402732C: calloc (vg_replace_malloc.c:467)
==14072== by 0x400FDAC: _dl_check_map_versions (dl-version.c:300)
==14072== by 0x4012898: dl_open_worker (dl-open.c:269)
==14072== by 0x400E63E: _dl_catch_error (dl-error.c:178)
==14072== by 0x4172C51: do_dlopen (dl-libc.c:86)
==14072== by 0x4052D30: start_thread (pthread_create.c:304)
==14072== by 0x413A0CD: clone (clone.S:130)
==14072==
==14072== 630 bytes in 1 blocks are still reachable in loss record 5 of 6
==14072== at 0x402732C: calloc (vg_replace_malloc.c:467)
==14072== by 0x400A8AF: _dl_new_object (dl-object.c:77)
==14072== by 0x4006067: _dl_map_object_from_fd (dl-load.c:957)
==14072== by 0x4007EBC: _dl_map_object (dl-load.c:2250)
==14072== by 0x40124EF: dl_open_worker (dl-open.c:226)
==14072== by 0x400E63E: _dl_catch_error (dl-error.c:178)
==14072== by 0x4172C51: do_dlopen (dl-libc.c:86)
==14072== by 0x4052D30: start_thread (pthread_create.c:304)
==14072== by 0x413A0CD: clone (clone.S:130)
I am creating my threads with:
rc = pthread_create(&threads[t], NULL, thread_stall, (void *)NULL);
(rc = return code). At the end of the entry point, I call pthread_exit().
Here's my take:
1.If you want your threads to exit gracefully (killing them with open file or socket handles is never a good idea), have them loop on a termination flag:
while(!stop)
{
do work
}
Then when you catch the ctrl-c set the flag to true and then join them. Make sure to declare stop as std::atomic<bool> to make sure all the threads see the updated value. This way they will finish the current batch of work and then exit gracefully when checking the condition next time.
2.I don't have enough information about your code to answer this.