Huge memory leak when combining OpenMP, Intel MKL and MSVC compiler - c++

I am working on a considerably large C++ project with a high emphasis on performance. It therefore relies on the Intel MKL library and on OpenMP. I recently observed a considerable memory leak that I could narrow down to the following minimal example:
#include <atomic>
#include <iostream>
#include <thread>
class Foo {
public:
Foo() : calculate(false) {}
// Start the thread
void start() {
if (calculate) return;
calculate = true;
thread = std::thread(&Foo::loop, this);
}
// Stop the thread
void stop() {
if (!calculate) return;
calculate = false;
if (thread.joinable())
thread.join();
}
private:
// function containing the loop that is continually executed
void loop() {
while (calculate) {
#pragma omp parallel
{
}
}
}
std::atomic<bool> calculate;
std::thread thread;
};
int main() {
Foo foo;
foo.start();
foo.stop();
foo.start();
// Let the program run until the user inputs something
int a;
std::cin >> a;
foo.stop();
return 0;
}
When compiled with Visual Studio 2013 and executed, this code leaks up to 200 MB memory per second (!).
By modifying the above code only a little, the leak totally disappears. For instance:
If the program is not linked against the MKL library (which is obviously not needed here), there is no leak.
If I tell OpenMP to use only one thread, (i.e. I set the environment variable OMP_NUM_THREADS to 1), there is no leak.
If I comment out the line #pragma omp parallel, there is no leak.
If I don't stop the thread and start it again with foo.stop() and foo.start(), there is no leak.
Am I doing something wrong here or am I missing something ?

MKL's parallel (default) driver is built against Intel's OpenMP runtime. MSVC compiles OpenMP applications against its own runtime that is built around the Win32 ThreadPool API. Both most likely don't play nice. It is only safe to use the parallel MKL driver with OpenMP code built using Intel C/C++/Fortran compilers.
It should be fine if you link your OpenMP code with the serial driver of MKL. That way, you may call MKL from multiple threads at the same time and get concurrent serial instances of MKL. Whether n concurrent serial MKL calls are slower than, comparable to or faster than a single threaded MKL call on n threads is likely dependent on the kind of computation and the hardware.
Note that Microsoft no longer support their own OpenMP runtime. MSVC's OpenMP support is stuck at version 2.0, which is more than a decade older than the current specification. There are probably bugs in the runtime (and there are bugs in the compiler's OpenMP support itself) and those are not likely to get fixed. They don't want you to use OpenMP and would like you to favour their own Parallel Patterns Library instead. But PPL is not portable to other platforms (e.g. Linux), therefore you should really be using Intel Treading Building Blocks (TBB). If you want quality OpenMP support under Windows, use the Intel compiler or some of the GCC ports. (I don't work for Intel)

Related

Does any C/C++ compiler support OpenMP's task affinity clauses yet?

I am experimenting with OpenMP tasks and want to write an application that runs on a 2-NUMA socket CPU and uses OpenMP's task affinity clauses which can be added to the task creation pragma. They provide a hint for where a task should be executed, by providing a variable close to whose physical location the task should be executed.
An example from the OpenMP 5.0 documentation shows how it could be used:
void task_affinity(double *A, int N)
{
double * B;
#pragma omp task depend(out:B) shared(B) affinity(A[0:N])
{
B = alloc_init_B(A,N);
}
#pragma omp task depend( in:B) shared(B) affinity(A[0:N])
{
compute_on_B(B,N);
}
#pragma omp taskwait
}
The compiler gcc compiler that I have in version 11.2.0, however, only provides a stub as of now, which as I understand it, means, that the functionality is not actually implemented yet.
Is there any compiler that has OpenMP's task affinities fully implemented yet?
Does the gcc implementation of OpenMP handle tasks in a way that they are assigned to threads that are physically close to the data on which they work even if no affinities are explicitly stated?

Memory issue with threads (tinythread, C++)

I debug a strange memory issue: When a multithreaded algorithm runs in a loop its memory consumption increases with every iteration although the heap checker of of GooglePerformanceTools says there is no leak. Finally I have made a separate minimal program that reproduces the bug. It seems that the threads are the problem:
#include <stdio.h>
#include <iostream>
#include <vector>
#include "tinythread.h"
using namespace std;
int a(0);
void doNothingAtAll(void*)
{
++a;
}
void startAndJoin100()
{
vector<tthread::thread*> vThreads;
for(int i=0;i<100;++i)
{
vThreads.push_back(new tthread::thread(doNothingAtAll,NULL));
}
while(!vThreads.empty())
{
tthread::thread* pThread(vThreads.back());
pThread->join();
delete pThread;
vThreads.pop_back();
}
}
int main()
{
for(int i=0;i<10;++i)
{
cout<<"calling startAndJoin100()"<<endl;
startAndJoin100();
cout<<"all threads joined"<<endl;
cin.get();
}
return 0;
}
main() calls 10 times startAndJoin100(). It waits for a key stroke after each iteration so that one can take the memory consumption which is (under Ubuntu 17.10, 64-bit):
VIRT
2.1 GB
4 GB
5.9 GB
7.8 GB
9.6 GB
11.5 GB
13.4 GB
15.3 GB
17.2 GB
19.0 GB
Note: C++11 can't be used and the program must compile on Linux and Windows, thus tinythread is used. Minimal test code with Makefile:
geom.at/_downloads/testTinyThread.zip
I answer my own question, this may be useful for somebody later:
Conclusion:
1) I'd really like to keep TinyThread because C++11 is unavailable (VS2008 and old Linux Systems must be supported) and no additional library shall be linked (TinyThread consists only of an *.h and *.cpp file while Boost and other solutions I know require linking a DLL).
2) Valgrind and the heap checker of the GooglePerformanceTools do not report memory leaks and I have looked into the code - it seems to be correct although the virtual memory consumption increases drastically in the minimal example posted above. It seems that the system does not re-use the previously assigned memory pages and I have not found an explanation for this behavior. Thus I do not blame TinyThread++ but it works when pthreads are used directly instead.
3) The workaround: There is a C alternative called TinyCThread: https://tinycthread.github.io/ that works also for C++ and it does not cause the problems observed with TinyThread++.

Is it possible to run piece of pure C++ code in GPU

I don't know OpenCL very much but I know C/C++ API requires programmer to provide OpenCL code as a string. But lately I discovered ArrayFire library that doesn't require string-code to invoke some calculations. I wondered how is it working (it is open source but the code is a bit confusing). Would it be possible to write parallel for with OpenCL backend that invokes any piece of compiled (x86 for example) code like following:
template <typename F>
void parallel_for(int starts, int ends, F task) //API
{ /*some OpenCL magic */ }
//...
parallel_for(0, 255, [&tab](int i){ tab[i] *= 0.7; } ); //using
PS: I know I am for 99% too optimistic
You cannot really call C++ Host code from the device using standard OpenCL.
You can use SYCL, the Khronos standard for single-source C++ programming. SYCL allows to compile C++ directly into device code without requiring the OpenCL strings. You can call any C++ function from inside a SYCL kernel (as long as the source code is available). SYCL.tech has more links and updated information.

Segmentation fault using FFTW

I have a pretty involving program that uses an in house FFT algorithm. I recently decided to try using FFTW for a performance increase. Just as a simple test to ensure that FFTW would link and run, I added the following code to the beginning of the application, however, when I run, I get a segmentation fault when I create the fftwf_plan:
const size_t size = 1024;
vector<complex<float> > data(size);
for(size_t i = 0; i < size; ++i) data[i] = complex<float>(i, -i);
fftwf_plan plan =
fftwf_plan_dft_1d(size,
(fftwf_complex*)&data[0],
(fftwf_complex*)&data[0],
FFTW_FORWARD,
FFTW_ESTIMATE);
// ^ seg faults here ^
fftwf_execute(plan);
fftwf_destroy_plan(plan);
Any ideas what would be causing this?
Using FFTW 3.3. Tried 2 different compilers, g++ 4.1.1 and icc 11.1. Also, the core file file shows nothing of significance:
Thread 1.1: Error at 0x00000000
Stack Trace: PC: 000000, FP=Hex Address
EDIT
I reconfigured FFTW to add debug, using the following commands:
setenv CFLAGS "-fPIC -g -O0"
configure --enabled-shared --enable-float --enable-debug
make
make install
When the program has a segmentation fault, it is in a random location in the fftwf_plan_dft_1d() method, however, the stack trace allways shows that is in or below the function search which is called by mkplan.
Aparently the issue stems from multi-threading. Even though the main functions are thread safe in FFTW (e.g. fftwf_execute), the functions to create a plan are not. This doesn't fully explain why just running a test on startup failed, however, when I excapsulated the plan creation in mutex locks, the segmentation faults ceased.
The creation and destruction of plans must be single threaded
fftw_init_threads();
#pragma omp parallel for
for(i=0;i<n;i++) {
#pragma omp critical {
plan = fftw_create_plan....
}
fftw_execute(plan); // or the fftw_execute_dft for multiple in/out fft operations
#pragma omp critical {
fftw_destroy_plan(plan);
}
}
fftw_cleanup_threads();
I'm 3 years late, but I've just stumbled upon a very similar problem, also when using multi-threading (--enable-openmp and fftw_plan_with_nthreads(omp_get_max_threads())). Mine seg faulted on fftw_destroy_plan(p).
It turned out that I didn't pay attention when restructuring my code, and I was calling fftw_cleanup_threads() before calling fftw_destroy_plan(p) ... silly, I know, but it got me chasing my tail for about 1h.
When using multi-threading, fftw_cleanup_threads() needs to be called after all fftw* functions, just as fftw_init_threads() needs to be called before any fftw* function.

c++ threads - parallel processing

I was wondering how to execute two processes in a dual-core processor in c++.
I know threads (or multi-threading) is not a built-in feature of c++.
There is threading support in Qt, but I did not understand anything from their reference. :(
So, does anyone know a simple way for a beginner to do it. Cross-platform support (like Qt) would be very helpful since I am on Linux.
Try the Multithreading in C++0x part 1: Starting Threads as a 101. If you compiler does not have C++0x support, then stay with Boost.Thread
Take a look at Boost.Thread. This is cross-platform and a very good library to use in your C++ applications.
What specifically would you like to know?
The POSIX thread (pthreads) library is probably your best bet if you just need a simple threading library, it has implementations both on Windows and Linux.
A guide can be found e.g. here. A Win32 implementation of pthreads can be downloaded here.
Edit: Didn't see you were on Linux. In that case I'm not 100% sure but I think the libraries are probably already bundled in with your GCC installation.
I'd recommend using the Boost libraries Boost.Thread instead. This will wrap platform specifics of Win32 and Posix, and give you a solid set of threading and synchronization objects. It's also in very heavy use, so finding help on any issues you encounter on SO and other sites is easy.
You can search for a free PDF book "C++-GUI-Programming-with-Qt-4-1st-ed.zip" and read Chapter 18 about Multi-threading in Qt.
Concurrent programming features supported by Qt includes (not limited to) the following:
Mutex
Read Write Lock
Semaphore
Wait Condition
Thread Specific Storage
However, be aware of the following trade-offs with Qt:
Performance penalties vs native threading libraries. POSIX thread (pthreads) has been native to Linux since kernel 2.4 and may not substitute for < process.h > in W32API in all situations.
Inter-thread communication in Qt is implemented with SIGNAL and SLOT constructs. These are NOT part of the C++ language and are implemented as macros which requires proprietary code generators provided by Qt to be fully compiled.
If you can live with the above limitations, just follow these recipes for using QThread:
#include < QtCore >
Derive your own class from QThread. You must implement a public function run() that returns void to contain instructions to be executed.
Instantiate your own class and call start() to kick off a new thread.
Sameple Code:
#include <QtCore>
class MyThread : public QThread {
public:
void run() {
// do something
}
};
int main(int argc, char** argv) {
MyThread t1, t2;
t1.start(); // default implementation from QThread::start() is fine
t2.start(); // another thread
t1.wait(); // wait for thread to finish
t2.wait();
return 0;
}
As an important note in c++14, the use of concurrent threading is available:
#include<thread>
class Example
{
auto DoStuff() -> std::string
{
return "Doing Stuff";
}
auto DoStuff2() -> std::string
{
return "Doing Stuff 2";
}
};
int main()
{
Example EO;
std::string(Example::*func_pointer)();
func_pointer = &Example::DoStuff;
std::future<string> thread_one = std::async(std::launch::async, func_pointer, &EO); //Launching upon declaring
std::string(Example::*func_pointer_2)();
func_pointer_2 = &Example::DoStuff2;
std::future<string> thread_two = std::async(std::launch::deferred, func_pointer_2, &EO);
thread_two.get(); //Launching upon calling
}
Both std::async (std::launch::async, std::launch::deferred) and std::thread are fully compatible with Qt, and in some cases may be better at working in different OS environments.
For parallel processing, see this.