I have a working serial code and a working parallel single GPU code parallelized via OpenACC. Now I am trying to increase the parallelism by running on multiple GPUs, employing mpi+openacc paradigm. I wrote my code in Fortran-90 and compile it using Nvidia's HPC-SDK's nvfortran compiler.
I have a few beginner level questions:
How do I setup my compiler environment to start writing my mpi+openacc code. Are there any extra requirements other than Nvidia's HPC-SDK?
Assuming I have a code written under mpi+openacc setup, how do I compile it exactly? Do I have to compile it two times? one for cpus (mpif90) and one for gpus (openacc). An example of a make file or some compilation commands will be helpful.
When the communication between GPU-device-1 and GPU-device-2 is needed, is there a way to communicate directly between them, or I should be communicating via [GPU-device-1] ---> [CPU-host-1] ---> [CPU-host-2] ---> [GPU-device-2]
Are there any sample Fortran codes with mpi+openacc implementation?
As #Vladimir F pointed out, your question is very broad, so if you have further questions about specific points you should consider posting each point individually. That said, I'll try to answer each.
If you install NVIDIA HPC SDK you should have everything you need. It'll include an installation of OpenMPI plus NVIDIA's HPC compilers for the OpenACC. You'll also have a variety of math libraries, if you need those too.
Compile using mpif90 for everything. For instance, mpif90 -acc=gpu will build the files with OpenACC to include GPU support and files that don't include OpenACC will compile normally. The MPI module should be found automatically during compilation and the MPI libraries will be linked in.
You can use the acc host_data use_device directive to pass the GPU version of your data to MPI. I don't have a Fortran example with MPI, but it looks similar to the call in this file. https://github.com/jefflarkin/openacc-interoperability/blob/master/openacc_cublas.f90#L19
This code uses both OpenACC and MPI, but doesn't use the host_data directive I referenced in 3. If I find another, I'll update this answer. It's a common pattern, but I don't have an open code handy at the moment. https://github.com/UK-MAC/CloverLeaf_OpenACC
My understanding, from reading the Intel MKL documentation and posts such as this--
Calling multithreaded MKL in from openmp parallel region --
is that building OpenMP parallelization into your own code AND MKL internal OpenMP for MKL functions such as DGESVD or DPOTRF is impossible unless building with the Intel compiler. For example, I have a large linear system I'd like to solve using MKL, but I'd also like to take advantage of parallelization to build the system matrix (my own code independent of MKL), in the same binary executable.
Intel states in the MKL documentation that 3rd party compilers "may have to disable multithreading" for MKL functions. So the options are:
openmp parallelization of your own code (standard #pragma omp ... etc) and single-thread calls to MKL
multi-thread calls to MKL functions ONLY, and single-threaded code everywhere else
use the Intel compiler (I would like to use gcc, so not an option for me)
parallelize both your code and MKL with Intel TBB? (not sure if this would work)
Of course, MKL ships with it's own openmp build libiomp*, which gcc can link against. Is it possible to use this library to achieve parallelization of your own code in addition to MKL functions? I assume some direct management of threads would be involved. However as far as I can tell there are no iomp dev headers included with MKL, which may answer that question (--> NO).
So it seems at this point like the only answer is Intel TBB (Thread Building Blocks). Just wondering if I'm missing something or if there's a clever workaround.
(Edit:) Another solution might be if MKL has an interface to accept custom C++11 lambda functions or other arbitrary code (e.g., containing nested for loops) for parallelization via whatever internal threading scheme is being used. So far I haven't seen anything like this.
Intel TBB will also enable better nested parallelism, which might help in some cases. If you want to enable GNU OpenMP with MKL, there are following options:
Dynamically Selecting the Interface and Threading Layer. Links against mkl_rt library and then
set env var MKL_THREADING_LAYER=GNU prior to loading MKL
or call mkl_set_threading_layer(MKL_THREADING_GNU);
Linking with Threading Libraries directly (though, the link has no mentioning of GNU OpenMP explicitly). This is not recommended when you are building a library, a plug-in, or an extension module (e.g. Python's package), which can be mixed with other components that might use MKL differently. Link against mkl_gnu_thread.
I've recently become interested in learning more about parallel computing, concurrency, etc. My main language is C++, so obviously I decided to use that in my personal studies.
After some research (read: looking things up on Google), I decided that using Intel's TBB library would be the most ideal.
The one thing that's got me stuck so far, though, is setting it up to use on my computer. I have looked on the Internet for some sort of tutorial on setting TBB up with MinGW (in my case, specifically Nuwen) and haven't really found anything.
So, here's my question: how would I set up TBB to use with a Nuwen distro?
TBB doesn't provides binaries for mingw in a Windows package. So you should build it from source code. You need compiler and GNU make;
Download source code(.zip) from https://github.com/01org/tbb/releases
Unpack somethere (not sure, but in common: beware of spaces in the dir path)
Open your console with compiler's environment, go to $archive_root/src and call gmake tbb tbbmalloc compiler=gcc. Also you could try to add stdver=c++11 in the build command if your complier supports this mode.
You will find a library build in the $archive_root/build/windows_... directory
I have a file that needs to use boost numeric bindings's library. How can I get that binding library?
The following link seems not able to work anymore. The zipped file is corrupted.
http://mathema.tician.de/dl/software/boost-numeric-bindings/
And I hope I could use it in Window and Visual Studio tool.
I have a file that needs to use boost numeric bindings's library. How can I get that binding library?
You can grab the sources of the current version via
svn co http://svn.boost.org/svn/boost/sandbox/numeric_bindings
The version from http://mathema.tician.de/dl/software/boost-numeric-bindings/ is an older version with a different interface. You can grab the sources of that older version via
svn co http://svn.boost.org/svn/boost/sandbox/numeric_bindings-v1
And I hope I could use it in Window and Visual Studio tool.
You need a blas/lapack library in addition to the bindings. For windows, Intel MKL, AMD's ACML, clapack and atlas used to work, last time I checked. (You only need one of these, but note that atlas only implements blas and a small subset of lapack). These libraries have widely different performance and license conditions, but they all implement the same interface (more or less).
In general, the people at http://lists.boost.org/mailman/listinfo.cgi/ublas seem to be helpful, if you try to use the bindings (or ublas) and run into problems. But I'm not sure if any of them checks here at stackoverflow for potential users that ran into problems.
Let's put like this: We are going to create a library that needs to be cross platform and we choose GCC as compiler, it works awesomely on Linux and we need to compile it on Windows and we have the MinGW to do the work.
MinGW tries to implement a native way to compile C++ on Windows but it doesn't support some features like mutex and threads.
We have the MinGW-W64 that is a fork of MinGW that supports those features and I was wondering, which one to use? Knowing that GCC is one of the most used C++ compilers. Or it's better to use the MSVC (VC++) on Windows and GCC on Linux and use CMake to handle with the independent compiler?
Thanks in advance.
Personally, I prefer a MinGW based solution that cross compiles on Linux, because there are lots of platform independent libraries that are nearly impossible (or a huge PITA) to build on Windows. (For example, those that use ./configure scripts to setup their build environment.) But cross compiling all those libraries and their dependencies is also annoying even on Linux, if you have to ./configure and make each of them yourself. That's where MXE comes in.
From the comments, you seem to worry about dependencies. They are costly in terms of build environment setup when cross compiling, if you have to cross compile each library individually. But there is MXE. It builds a cross compiler and a large selection of platform independent libraries (like boost, QT, and lots of less notable libraries). With MXE, boost becomes a lot more attractive as a solution. I've used MXE to build a project that depends on Qt, boost, and libexiv2 with nearly no trouble.
Boost threads with MXE
To do this, first install mxe:
git clone -b master https://github.com/mxe/mxe.git
Then build the packages you want (gcc and boost):
make gcc boost
C++11 threads with MXE
If you would still prefer C++11 threads, then that too is possible with MXE, but it requires a two stage compilation of gcc.
First, checkout the master (development) branch of mxe (this is the normal way to install it):
git clone -b master https://github.com/mxe/mxe.git
Then build gcc and winpthreads without modification:
make gcc winpthreads
Now, edit mxe/src/gcc.mk. Find the line that starts with $(PKG)_DEPS := and add winpthreads to the end of the line. And find --enable-threads=win32 and replace it with --enable-threads=posix.
Now, recompile gcc and enjoy your C++11 threads.
make gcc
Note: You have to do this because the default configuration supports Win32 threads using the WINAPI instead of posix pthreads. But GCC's libstdc++, the library that implements std::thread and std::mutex, doesn't have code to use WINAPI threads, so they add a preprocessor block that strips std::thread and std::mutex from the library when Win32 threads are enabled. By using --enable-threads=posix and the winpthreads library, instead of having GCC try to interface with Win32 in it's libraries, which it doesn't fully support, we let the winpthreads act as glue code that presents a normal pthreads interface for GCC to use and uses the WINAPI functions to implement the pthreads library.
Final note
You can speed these compilations up by adding -jm and JOBS=n to the make command. -jm, where m is a number that means to build m packages concurrently. JOBS=n, where n is a number that means to use n processes building each package. So, in effect, they multiply, so only pick m and n so that m*n is at most not much more than the number of processor cores you have. E.g. if you have 8 cores, then m=3, n=4 would be about right.
Citations
http://blog.worldofcoding.com/2014_05_01_archive.html#windows
If you want portability, Use standard ways - <thread> library of C++11.
If you can't use C++11, pthread can be solution, although VC++ could not compile it.
Do you want not to use both of these? Then, just write your abstract layer of threading. For example, you can write class Thread, like this.
class Thread
{
public:
explicit Thread(int (*pf)(void *arg));
void run(void *arg);
int join();
void detach();
...
Then, write implementation of each platform you want to support. For example,
+src
|---thread.h
|--+win
|--|---thread.cpp
|--+linux
|--|---thread.cpp
After that, configure you build script to compile win/thread.cpp on windows, and linux/thread.cpp on linux.
You should definitely use Boost. It's really great and does all things.
Seriously, if you don't want to use some synchronization primitives that Boost.Thread doesn't support (such as std::async) take a look on the Boost library. Of course it's an extra dependency, but if you aren't scared of this, you will enjoy all advantages of Boost such as cross-compiling.
Learn about differences between Boost.Thread and the C++11 threads here.
I think this is a fairly generic list of considerations when you need to choose multi-platform tools or sets of tools, for a lot of these you probably already have an answer;
Tool support, who and how are you going to get support from if something doesn't work; how strong is the community and the vendor?
Native target support, how well does the tool understand the target platform?
Optimization potential?
Library support (now and in the intermediate future)?
Platform SDK support, if needed?
Build tools (although not directly asked here, do they work on both platforms; most of the popular ones do).
One thing I see that seems to not really have been dealt with is;
What is the target application expecting?
You mention you are building a library, so what application is going to use it and what does that application expect.
The constraint here being the target application dictates the most fundamental aspect of the system, the very tool used to built it. How is the application going to use the library;
What API and what kind of API is needed by or for that application?
What kind of API do you want to offer (C-style, vs. C++ classes, or a combination)?
What runtime is it using, will it be the same, or will there be conflicts?
Given these, and possible fact that the target application may still be unknown; maintain as much flexibility as possible. In this case, endeavour to maintain compatibility with gcc, mingw-w64 and msvc. They all offer a broad spectrum of C++11 language support (true, some more than others) and generally supported by other popular libraries (even if these other libraries are not needed right now).
I thought the comment by Hans Passant...
Do what works first
... really does apply here.
Since you mentioned it; the mingw-builds for mingw-w64 supports thread etc. with the posix build on Windows, both 64 bit and 32 bit.