parallel c++11 program random crashes

parallel c++11 program random crashes - c++

I have a problem which I could not solve for a long time now. Since, I don't have more Ideas I am happy for any suggestions.
The program is a physics simulation which works on a huge tree data structure with millions of dynamical allocated nodes which are constructed / reorganized / destructed many times in parallel throughout the simulation with allot of pointers involved. Also this might sound very error-prone I am almost sure that I am doing all this in a thread-save manner. The program uses only standard libs and classes plus Intel-MKL (blas / lapack optimized for Intel CPUs) for matrix operations.
My code is parallelized using c++11 threads. The program runs fine on my desktop, my laptop and on two different Intel clusters using up to 8 threads. Only on one cluster the code suffers from random crashes if I use more than 2 threads (it runs absolutely fine with one or two threads).
The crash reports are varying from case to case but are mostly connected to heap corruption (segmentation fault, corrupted double linked list, malloc assertions, ...). some times the program gets caught in an infinite loop as well. In very rear cases the data structure suddenly blows up and the program runs out of memory. Anyway, since the program runs fine on all other machines I doubt the problem is in my source code. Since the crashes occur randomly I found all back tracing information relatively useless.
The hardware of the problematic cluster is almost identical to another cluster on which the code runs fine on up to 8 threads (Intel Xeon E5-2630 CPUs). The libs / compilers / OS are all relatively up to date. Note that other open-MP parallelized programs are running fine on the same cluster.
(Linux version 3.11.10-21-default (geeko#buildhost) (gcc version 4.8.1 20130909 [gcc-4_8-branch revision 202388] (SUSE Linux) ) #1 SMP Mon Jul 21 15:28:46 UTC 2014 (9a9565d))
I already tried the following approaches:
adding allot of assertions to assure that all my pointers are handled correctly
linking against tc-malloc instead of glibc-malloc/free
trying different compilers (g++, icpc, clang++) and compiler options (with / without compiler optimizations / debugging options)
using the working binary from another machine with statically linked libraries to
using open-MP instead of c++ threads
switching between serial / parallel MKL
using other blas / lapack libraries
Using valgrind is out of question, since the problem occurs randomly after 10 minutes up to several hours and valgrind gives me a slowdown factor of around 50 - 100 (Plus valgrind does not allow real concurrency). Nevertheless I ran the code in valgrind for several hours without problems.
Also, I can not see any problem with the resource limits:
RLIMIT_AS: 18446744073709551615
RLIMIT_CORE : 18446744073709551615
RLIMIT_CPU: 18446744073709551615
RLIMIT_DATA: 18446744073709551615
RLIMIT_FSIZE: 18446744073709551615
RLIMIT_LOCKS: 18446744073709551615
RLIMIT_MEMLOCK: 18446744073709551615
RLIMIT_MSGQUEUE: 819200
RLIMIT_NICE: 0
RLIMIT_NOFILE: 1024
RLIMIT_NPROC: 2066856
RLIMIT_RSS: 18446744073709551615
RLIMIT_RTPRIO: 0
RLIMIT_RTTIME: 18446744073709551615
RLIMIT_SIGPENDING: 2066856
RLIMIT_STACK : 18446744073709551615
RLIMIT_STACK : 18446744073709551615
I found out that for some reason the stack size per thread seems to be only 2mb, so I increased it using ulimit -s. Anyway stack size shouldn't be the problem.
Also the program should not have problem with allocatable memory on the heap, since the memory size is more than sufficient.
Does anyone have an Idea of what could go wrong here / where I should look at? Maybe I miss some environment variables I should check? I think the fact that the error occurs only if I use more than two threads and that the crash rate for more than two threads is independent of the number of threads could be a hint.
Thanks in advance.

Related

why my c++ program uses significant memory when doing tcmalloc heap-checker or heap-profile

CentOS Linux release 7.3.1611
gcc version 4.8.5 20150623
gperftool 2.4-8.el7
1.my c++ program which links -ltcmalloc works fine without HEAPCHECKER or HEAPPROFILE. The memory it used stabilizes at 5M~10M.
2.if I run the program using heap-checker with env HEAPCHECKER=NORMAL,the memory increases about 50M per hour until OOM killer.
3.if I use heap-profile with env HEAPPROFILE="./hp" HEAP_PROFILE_ALLOCATION_INTERVAL=100000000, the memory increases about 100M per 40 minutes and also OOM is triggered.However, when I use pprof to analysis the heap file,it shows the total memory is only 0.1MB where I expect 100M.
I know heap-checker and heap-profile will cause extra memory use because they need to record some other information to trace memory allocation,but I don't think it is the reason for my case.
I use heap-checker and heap-profile with another small program,which works quite good.
The biggest difference between these two program is, the faulty program uses coroutine, by which i mean function swapcontext, getcontext and makecontext.
My questions are:
Q1.Why does the heap file opened by pprof show total memory is 0.1M when I set HEAP_PROFILE_ALLOCATION_INTERVAL=100000000?
Q2.Is it possible that heap-checker or heap-profile do not cooperate well with those coroutine functions?

I assume that you're using stackful coroutines so you're making new stacks all the time. It may well be that the stacks are no longer fully destructed/freed when you're running the checkers so they in effect leak.

Increased system CPU time in non-threaded regions of OpenMP code

I am experiencing a very peculiar issue. I am parallelizing parts of a very large code (about 600MB of source code) using OpenMP, but I'm not getting the speedups I expected. I profiled many parts of the code using omp_get_wtime() to get reliable estimates, and I discovered the following: the OpenMP parts of my code do have better wall times when I use more threads, however when I turn the threads on the program has increased system cpu time usage in parts of the code where there are no omp pragmas at all (e.g., time spent in IPOPT). I used grep for pragma to make sure. Just to clarify, I know for sure from my profiling and code structure that the threaded parts and the parts where I get the slowdowns run in completely different times.
My code consists of two main parts. The first is the base code (about 25MB of code), which I compile using -fopenmp, and the second is third-party code (the remaining 575MB) which is not compiled with the OpenMP flag.
Can anyone think of a reason why something like this could be happening? I can't fathom it being resource contention since the slowdowns are in the non-threaded parts of the code.
I am running my code on a 4-core (2 physical cores) Intel-i7 4600U compiled using clang++-3.8 in Ubuntu 14.04, with -O3 optimizations.
Any ideas would be great!

Speed performance of a Qt program: Windows vs Linux

I've already posted this question here, but since it's maybe not that Qt-specific, I thought I might try my chance here as well. I hope it's not inappropriate to do that (just tell me if it is).
I’ve developed a small scientific program that performs some mathematical computations. I’ve tried to optimize it so that it’s as fast as possible. Now I’m almost done deploying it for Windows, Mac and Linux users. But I have not been able to test it on many different computers yet.
Here’s what troubles me: To deploy for Windows, I’ve used a laptop which has both Windows 7 and Ubuntu 12.04 installed on it (dual boot). I compared the speed of the app running on these two systems, and I was shocked to observe that it’s at least twice as slow on Windows! I wouldn’t have been surprised if there were a small difference, but how can one account for such a difference?
Here are a few precisions:
The computation that I make the program do are just some brutal and stupid mathematical calculations, basically, it computes products and cosines in a loop that is called a billion times. On the other hand, the computation is multi-threaded: I launch something like 6 QThreads.
The laptop has two cores #1.73Ghz. At first I thought that Windows was probably not using one of the cores, but then I looked at the processor activity, according to the small graphic, both cores are running 100%.
Then I thought the C++ compiler for Windows didn’t the use the optimization options (things like -O1 -O2) that the C++ compiler for Linux automatically did (in release build), but apparently it does.
I’m bothered that the app is so mush slower (2 to 4 times) on Windows, and it’s really weird. On the other hand I haven’t tried on other computers with Windows yet. Still, do you have any idea why the difference?
Additional info: some data…
Even though Windows seems to be using the two cores, I’m thinking this might have something to do with threads management, here’s why:
Sample Computation n°1 (this one launches 2 QThreads):
PC1-windows: 7.33s
PC1-linux: 3.72s
PC2-linux: 1.36s
Sample Computation n°2 (this one launches 3 QThreads):
PC1-windows: 6.84s
PC1-linux: 3.24s
PC2-linux: 1.06s
Sample Computation n°3 (this one launches 6 QThreads):
PC1-windows: 8.35s
PC1-linux: 2.62s
PC2-linux: 0.47s
where:
PC1-windows = my 2 cores laptop (#1.73Ghz) with Windows 7
PC1-linux = my 2 cores laptop (#1.73Ghz) with Ubuntu 12.04
PC2-linux = my 8 cores laptop (#2.20Ghz) with Ubuntu 12.04
(Of course, it's not shocking that PC2 is faster. What's incredible to me is the difference between PC1-windows and PC1-linux).
Note: I've also tried running the program on a recent PC (4 or 8 cores #~3Ghz, don't remember exactly) under Mac OS, speed was comparable to PC2-linux (or slightly faster).
EDIT: I'll answer here a few questions I was asked in the comments.
I just installed Qt SDK on Windows, so I guess I have the latest version of everything (including MinGW?). The compiler is MinGW. Qt version is 4.8.1.
I use no optimization flags because I noticed that they are automatically used when I build in release mode (with Qt Creator). It seems to me that if I write something like QMAKE_CXXFLAGS += -O1, this only has an effect in debug build.
Lifetime of threads etc: this is pretty simple. When the user clicks the "Compute" button, 2 to 6 threads are launched simultaneously (depending on what he is computing), they are terminated when the computation ends. Nothing too fancy. Every thread just does brutal computations (except one, actually, which makes a (not so) small"computation every 30ms, basically checking whether the error is small enough).
EDIT: latest developments and partial answers
Here are some new developments that provide answers about all this:
I wanted to determine whether the difference in speed really had something to do with threads or not. So I modified the program so that the computation only uses 1 thread, that way we are pretty much comparing the performance on "pure C++ code". It turned out that now Windows was only slightly slower than Linux (something like 15%). So I guess that a small (but not unsignificant) part of the difference is intrinsic to the system, but the largest part is due to threads management.
As someone (Luca Carlon, thanks for that) suggested in the comments, I tried building the application with the compiler for Microsoft Visual Studio (MSVC), instead of MinGW. And suprise, the computation (with all the threads and everything) was now "only" 20% to 50% slower than Linux! I think I'm going to go ahead and be content with that. I noticed that weirdly though, the "pure C++" computation (with only one thread) was now even slower (than with MinGW), which must account for the overall difference. So as far as I can tell, MinGW is slightly better than MSVC except that it handles threads like a moron.
So, I’m thinking either there’s something I can do to make MinGW (ideally I’d rather use it than MSVC) handle threads better, or it just can’t. I would be amazed, how could it not be well known and documented ? Although I guess I should be careful about drawing conclusions too quickly, I’ve only compared things on one computer (for the moment).

Another option it could be: on linux qt are just loaded, this could happens i.e. if you use KDE, while in Windows library must be loaded so this slow down computation time. To check how much library loading waste your application you could write a dummy test with pure c++ code.

I have noticed exactly the same behavior on my PC.
I am running Windows 7(64bits), Ubuntu (64bits) and OSX (Lion 64bits) and my program compares 2 XML files (more than 60Mb each). It uses Multithreading too (2 threads) :
-Windows : 40sec
-Linux : 14sec (!!!)
-OSX : 22sec.
I use a personal class for threads (and not Qt one) which uses "pthread" on linux/OSX and "threads" on windows.
I use Qt/mingw compiler as I need the XML class from Qt.
I have found no way (for now) to have the 3 OS having similar performances... but I hope I will !
I think that another reason may be the memory : my program uses about 500Mb of RAM. So I think that Unix is managing it best because, in mono-thread, Windows is exactly 1.89 times slower and I don't think that Linux could be more than 2 times slower !

I have heard of one case where Windows was extremely slow with writing files if you do it wrongly. (This has nothing to do with Qt.)
The problem in that case was that the developer used a SQLite database, wrote some 10000 datasets, and did a SQL COMMIT after each insert. This caused Windows to write the whole DB file to disk each time, while Linux would only update the buffered version of the filesystem inode in the RAM. The speed difference was even worse in that case: 1 second on Linux vs. 1 minute on Windows. (After he changed SQLite to commit only once at the end, it was also 1 second on Windows.)
So if you're writing the results of your computation to disk, you might want to check if you're calling fsync() or fflush() too often. If your writing code comes from a library, you can use strace for this (Linux-only, but should give you a basic idea).

You might experience performance differences by how mutexes run on Windows and Linux.
Pure mutex code on windows can have a 15ms wait every time there is a contention for resource when locking. Better performing synchronization mechanism on Windows is Critical Sections. It doesn't experience the locking penalty that regular mutexes experience in most cases.
I have found that on Linux, regular mutexes perform the same as Critical Sections on Windows.

It's probably the memory allocator, try using jemalloc or tcmalloc from Google. Glibc's ptmalloc3 is significantly better than the old crusty allocator in MSVC's crt. The comparable option from Microsoft is the Concurrency CRT but you cannot simply drop it in as a replacement.

How to optimize large data manipulation in parallel

I'm developing a C/C++ application to manipulate large quantities of data in a generic way (aggregation/selection/transformation).
I'm using a AMD Phenom II X4 965 Black Edition, so with decent amount of different caches.
I've developed both ST and MT version of the functions to perform all the single operations and, not surprisingly, in the best case the MT version are 2x faster than the ST, even when using 4 cores.
Given I'm a fan of using 100% of available resources, I was pissed about the fact just 2x, I'd want 4x.
For this reason I've spent already quite a considerable amount of time with -pg and valgrind, using the cache simulator and callgraph. The program is working as expected and cores are sharing the input process data (i.e. operations to apply on data) and the cache misses are reported (as expected sic.) when the different threads load the data to be processed (millions of entities or rows if now you have an idea what I'm trying to do :-) ).
Eventually I've used different compilers, g++ and clang++, with -O3 both, and performance is identical.
My conclusion is that due to the large amount of data (GB of data) to process, given the fact the data has got to be loaded eventually in the CPU, this is real wait time.
Can I further improve my software? Have I hit a limit?
I'm using C/C++ on Linux x86-64, Ubuntu 11.10.
I'm all ears! :-)

What kind of application is it? Could you show us some code?
As I commented, you might have reached some hardware limit like RAM bandwidth. If you did, no software trick could improve it.
You might investigate using MPI, OpenMP, or OpenCL (on GPUs) but without an idea of your application we cannot help.
If compiling with GCC and if you want to help the processor cache prefetching, consider using with care and parsimony __builtin_prefetch (but using it too much or badly would decrease performance).

Windows, Linux and Memory Management

I'm pretty curious about how Windows and Linux does memory management with C++ programs.
The reason of this curiosity is because I've just made 3 very simple programs in C++ portable between Linux and Windows. The code was exactly the same. The hardware too. But the results were incredibly different! Both tests were repeated 10 times and then the arithmetic mean was calculated.
I've tested sequential insertions on a static array of integers, on the class vector and at a stack (with pointers). The total number of insertions was 10^6.
Windows XP SP2 x86 results:
Static array of integers: 56 ms
Class vector: 686 ms
Stack (with pointers): 2193 ms
Slackware 11 x86 results:
Static array of integers: 100 ms
Class vector: 476 ms
Stack (with pointer): 505 ms
The speed difference between the stack insertion time on Windows and Slax is impressive. Does these results seem normal? Both codes were compiled using G++ (mingw32-g++ on Windows).
The computer used was a Dual Core 3.2Ghz with 4GB RAM and when the tests were made, there were more than 2GB of free RAM.

This may have more to do with the C++ stdlib implementation (Microsoft MSVC vs. GNU libc/stdc++) than memory management...
C++ sets the spec, but implementations vary greatly.
Update: And as I now notice that you used g++ on both platforms - well, it's still a different implementation. And the GNU code was developed for a Unix environment, and was ported to Windows later - so it may make assumptions and optimizations that are incorrect for a Windows environment.
But your original question is valid. It may have something to do with the underlying memory model of the operating system - but your tests here are too coarse to draw any conclusions.
If you run more tests, or try using different flags / compilers, please post some updated stats, they would be informative.

Two seconds for a million insertions is a little too high. Have you enabled optimizations? Without them, those numbers mean nothing.
Edit: Since you compiled without optimizations, enable them (use -O2 for example) and measure again. The difference will probably be much smaller. The standard libraries tend to be quite defensive and perform a lot of consistency checking, which can skew the measurement a lot.
Edit: If it still takes over 2 seconds even with optimizations enabled, there something else going on. Try posting some code. The following program runs about 40 milliseconds on my system under cygwin.
#include <stack>
int main()
{
std::stack<int> s;
for (int i = 0; i < 1000000; ++i)
s.push(i);
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js