Multithreaded MKL + OpenMP compiled with GCC - c++

My understanding, from reading the Intel MKL documentation and posts such as this--
Calling multithreaded MKL in from openmp parallel region --
is that building OpenMP parallelization into your own code AND MKL internal OpenMP for MKL functions such as DGESVD or DPOTRF is impossible unless building with the Intel compiler. For example, I have a large linear system I'd like to solve using MKL, but I'd also like to take advantage of parallelization to build the system matrix (my own code independent of MKL), in the same binary executable.
Intel states in the MKL documentation that 3rd party compilers "may have to disable multithreading" for MKL functions. So the options are:
openmp parallelization of your own code (standard #pragma omp ... etc) and single-thread calls to MKL
multi-thread calls to MKL functions ONLY, and single-threaded code everywhere else
use the Intel compiler (I would like to use gcc, so not an option for me)
parallelize both your code and MKL with Intel TBB? (not sure if this would work)
Of course, MKL ships with it's own openmp build libiomp*, which gcc can link against. Is it possible to use this library to achieve parallelization of your own code in addition to MKL functions? I assume some direct management of threads would be involved. However as far as I can tell there are no iomp dev headers included with MKL, which may answer that question (--> NO).
So it seems at this point like the only answer is Intel TBB (Thread Building Blocks). Just wondering if I'm missing something or if there's a clever workaround.
(Edit:) Another solution might be if MKL has an interface to accept custom C++11 lambda functions or other arbitrary code (e.g., containing nested for loops) for parallelization via whatever internal threading scheme is being used. So far I haven't seen anything like this.

Intel TBB will also enable better nested parallelism, which might help in some cases. If you want to enable GNU OpenMP with MKL, there are following options:
Dynamically Selecting the Interface and Threading Layer. Links against mkl_rt library and then
set env var MKL_THREADING_LAYER=GNU prior to loading MKL
or call mkl_set_threading_layer(MKL_THREADING_GNU);
Linking with Threading Libraries directly (though, the link has no mentioning of GNU OpenMP explicitly). This is not recommended when you are building a library, a plug-in, or an extension module (e.g. Python's package), which can be mixed with other components that might use MKL differently. Link against mkl_gnu_thread.

Related

Using OpenBLAS with GSL

I compiled GSL and OpenBLAS from source with all default options in both cases. My GSL libraries are installed in /usr/local/lib and OpenBLAS in /opt/OpenBLAS/lib. How do I use OpenBLAS with GSL in C++?
The main reason I am doing this is because OpenBLAS utilizes all cores which Atlas does not in default configuration. My main aim is to multiply two large matrices (10000 x 10000) and perform 2D convolutions. Is there a better alternative to OpenBLAS or GSL for this?
I am using:
Linux Mint 17.2
GCC version 4.8.4
20 Core Intel CPU
I have been experimenting with the same thing in Octave with OpenBLAS. Will I get a significant performance improvement by using C++?
I would use an existing linear algebra library like Armadillo. AFAIK it wraps your BLAS implementation for matrix multiplications. I like it because it provides you with a syntax very similar to the one in Matlab or Octave.
Other linear algebra libraries like Eigen will also do the job.
But i do not expect them to perform (much) better than Octave or Matlab as long as the call to the underlying library remains same. Also checkout why matlab is so fast and how armadillo is parallelized.

What is the difference (if there is any) between BLAS/LAPACK in Mac OS and the original BLAS/LAPACK?

I recently switch to Mac OS from Linux. I need BLAS and LAPACK to do some computation. By checking Wikipedia of BLAS, I learnt these two libraries have beed implemented in Mac OS. However, it is said that
Apple's framework for Mac OS X and iOS, which includes tuned versions of BLAS and LAPACK.
(https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms)
So, what the difference between the MAC BLAS/LAPACK and the original ones? (I checked the references referred in the wikipedia, but I didn't find information about the difference.)
There are many implementation of BLAS and LAPACK, there is no single Linux version either. There is a specification of the interface of the routines and a reference implementation on netlib.org, but that reference implementation is not that focused on maximum performance on particular platform (CPU, OS...).
So, what the Mac library does is that it has its own code for doing some of the computation, which is probably faster than the reference one. It may be programmed in a different programming language, than the reference Fortran code.
There are many other implementations you could try: Intel MKL, ATLAS, GotoBLAS, OpenBLAS, Sun Performance Library, Cray Scientific Libraries and others. Often they are written in C or in the assembly language for a particular CPU.
The most important feature that Apple advertises for its Accelerate framework and the vecLib (which contain the BLAS) is the optimized usage of the vector (SIMD) instructions. That what all other versions also try to do. For the actual differences from other implementations it would be necessary to study the source code which is often unavailable (at least for the commercial libraries).

MinGW vs MinGW-W64 vs MSVC (VC++) in cross compiling

Let's put like this: We are going to create a library that needs to be cross platform and we choose GCC as compiler, it works awesomely on Linux and we need to compile it on Windows and we have the MinGW to do the work.
MinGW tries to implement a native way to compile C++ on Windows but it doesn't support some features like mutex and threads.
We have the MinGW-W64 that is a fork of MinGW that supports those features and I was wondering, which one to use? Knowing that GCC is one of the most used C++ compilers. Or it's better to use the MSVC (VC++) on Windows and GCC on Linux and use CMake to handle with the independent compiler?
Thanks in advance.
Personally, I prefer a MinGW based solution that cross compiles on Linux, because there are lots of platform independent libraries that are nearly impossible (or a huge PITA) to build on Windows. (For example, those that use ./configure scripts to setup their build environment.) But cross compiling all those libraries and their dependencies is also annoying even on Linux, if you have to ./configure and make each of them yourself. That's where MXE comes in.
From the comments, you seem to worry about dependencies. They are costly in terms of build environment setup when cross compiling, if you have to cross compile each library individually. But there is MXE. It builds a cross compiler and a large selection of platform independent libraries (like boost, QT, and lots of less notable libraries). With MXE, boost becomes a lot more attractive as a solution. I've used MXE to build a project that depends on Qt, boost, and libexiv2 with nearly no trouble.
Boost threads with MXE
To do this, first install mxe:
git clone -b master https://github.com/mxe/mxe.git
Then build the packages you want (gcc and boost):
make gcc boost
C++11 threads with MXE
If you would still prefer C++11 threads, then that too is possible with MXE, but it requires a two stage compilation of gcc.
First, checkout the master (development) branch of mxe (this is the normal way to install it):
git clone -b master https://github.com/mxe/mxe.git
Then build gcc and winpthreads without modification:
make gcc winpthreads
Now, edit mxe/src/gcc.mk. Find the line that starts with $(PKG)_DEPS := and add winpthreads to the end of the line. And find --enable-threads=win32 and replace it with --enable-threads=posix.
Now, recompile gcc and enjoy your C++11 threads.
make gcc
Note: You have to do this because the default configuration supports Win32 threads using the WINAPI instead of posix pthreads. But GCC's libstdc++, the library that implements std::thread and std::mutex, doesn't have code to use WINAPI threads, so they add a preprocessor block that strips std::thread and std::mutex from the library when Win32 threads are enabled. By using --enable-threads=posix and the winpthreads library, instead of having GCC try to interface with Win32 in it's libraries, which it doesn't fully support, we let the winpthreads act as glue code that presents a normal pthreads interface for GCC to use and uses the WINAPI functions to implement the pthreads library.
Final note
You can speed these compilations up by adding -jm and JOBS=n to the make command. -jm, where m is a number that means to build m packages concurrently. JOBS=n, where n is a number that means to use n processes building each package. So, in effect, they multiply, so only pick m and n so that m*n is at most not much more than the number of processor cores you have. E.g. if you have 8 cores, then m=3, n=4 would be about right.
Citations
http://blog.worldofcoding.com/2014_05_01_archive.html#windows
If you want portability, Use standard ways - <thread> library of C++11.
If you can't use C++11, pthread can be solution, although VC++ could not compile it.
Do you want not to use both of these? Then, just write your abstract layer of threading. For example, you can write class Thread, like this.
class Thread
{
public:
explicit Thread(int (*pf)(void *arg));
void run(void *arg);
int join();
void detach();
...
Then, write implementation of each platform you want to support. For example,
+src
|---thread.h
|--+win
|--|---thread.cpp
|--+linux
|--|---thread.cpp
After that, configure you build script to compile win/thread.cpp on windows, and linux/thread.cpp on linux.
You should definitely use Boost. It's really great and does all things.
Seriously, if you don't want to use some synchronization primitives that Boost.Thread doesn't support (such as std::async) take a look on the Boost library. Of course it's an extra dependency, but if you aren't scared of this, you will enjoy all advantages of Boost such as cross-compiling.
Learn about differences between Boost.Thread and the C++11 threads here.
I think this is a fairly generic list of considerations when you need to choose multi-platform tools or sets of tools, for a lot of these you probably already have an answer;
Tool support, who and how are you going to get support from if something doesn't work; how strong is the community and the vendor?
Native target support, how well does the tool understand the target platform?
Optimization potential?
Library support (now and in the intermediate future)?
Platform SDK support, if needed?
Build tools (although not directly asked here, do they work on both platforms; most of the popular ones do).
One thing I see that seems to not really have been dealt with is;
What is the target application expecting?
You mention you are building a library, so what application is going to use it and what does that application expect.
The constraint here being the target application dictates the most fundamental aspect of the system, the very tool used to built it. How is the application going to use the library;
What API and what kind of API is needed by or for that application?
What kind of API do you want to offer (C-style, vs. C++ classes, or a combination)?
What runtime is it using, will it be the same, or will there be conflicts?
Given these, and possible fact that the target application may still be unknown; maintain as much flexibility as possible. In this case, endeavour to maintain compatibility with gcc, mingw-w64 and msvc. They all offer a broad spectrum of C++11 language support (true, some more than others) and generally supported by other popular libraries (even if these other libraries are not needed right now).
I thought the comment by Hans Passant...
Do what works first
... really does apply here.
Since you mentioned it; the mingw-builds for mingw-w64 supports thread etc. with the posix build on Windows, both 64 bit and 32 bit.

Is TBB pre-Enabled in Opencv-2.4.5?

I have posted a question in Opencv answers group regarding the performance of TBB. An this is the link.
The answer in this link states as below.
Probably you used the 2.4.5 library with and without TBB to compare,
however, since OpenCV 2.4.3 multithreaded support functionality has
been included in the source code, not needing to build openCV with the
TBB support anymore. It is done automatically where necessary and the
included dll's are contained in the source where needed.
But I faced performance chage in Hog descriptor. That is I used peopledetect.cpp from samples and compiled with both TBB and without TBB in opencv2.4.5. I can see the Opencv2.4.5 compiled with TBB performs 2x speed where as Opencv2.4.5 without TBB performs very slow.
Can some one please conform the below points, as I couldnt find any belivable sources.
1) From opencv2.4.3 dont we need to make the opencv rebuild with TBB ON?
The prebuild binaries are compiled with the Visual Studio Concurrency framework since 2.4.3. However, not every algorithm uses the "new" parallel interface, where you can switch from Concurrency to IPP to TBB. Before, it was afaik hardcoded to use either TBB or nothing.
So the problem is that not every algorithm has been converted to the new parallel way, thus you can get speedups using TBB in some ways. (IIRC one example is the BruteForceMatcher, which uses only one core with the prebuild libs)

how to use lapack under windows

I want to use lapack and make C++ matrix wrapper for it, but lapack is written in Fortran, there are some clapack but I want to use it from source. compile firstly *.f and *.cpp files to object files then link it into a application..
the following apps and sources that I have.
visual studio proff edition,dev c++,ultimate++,mingw whatever
g95 and gfortran (under mingw) compiler
lapack (latest source)
blas (included in lapack)
How can I make an application please help...
My Operating System is Windows 7 and CPU Core2Duo and I dont have Intel math kernel
You can use the official C bindings for LAPACK, and then build your C++ wrapper around that. That avoids the issue of having to worry about the Fortran calling conventions, and the C bindings are a bit friendlier for C/C++ programmers than calling the Fortran routines directly.
Additionally, you could use one of the C++ matrix libraries that are already available, instead of rolling your own. I recommend Eigen.
PS.: Eigen matrices/vectors have a data() member that allows calling LAPACK without having to make temporary copies.