Getting started with OpenACC + MPI Fortran program - fortran

I have a working serial code and a working parallel single GPU code parallelized via OpenACC. Now I am trying to increase the parallelism by running on multiple GPUs, employing mpi+openacc paradigm. I wrote my code in Fortran-90 and compile it using Nvidia's HPC-SDK's nvfortran compiler.
I have a few beginner level questions:
How do I setup my compiler environment to start writing my mpi+openacc code. Are there any extra requirements other than Nvidia's HPC-SDK?
Assuming I have a code written under mpi+openacc setup, how do I compile it exactly? Do I have to compile it two times? one for cpus (mpif90) and one for gpus (openacc). An example of a make file or some compilation commands will be helpful.
When the communication between GPU-device-1 and GPU-device-2 is needed, is there a way to communicate directly between them, or I should be communicating via [GPU-device-1] ---> [CPU-host-1] ---> [CPU-host-2] ---> [GPU-device-2]
Are there any sample Fortran codes with mpi+openacc implementation?

As #Vladimir F pointed out, your question is very broad, so if you have further questions about specific points you should consider posting each point individually. That said, I'll try to answer each.
If you install NVIDIA HPC SDK you should have everything you need. It'll include an installation of OpenMPI plus NVIDIA's HPC compilers for the OpenACC. You'll also have a variety of math libraries, if you need those too.
Compile using mpif90 for everything. For instance, mpif90 -acc=gpu will build the files with OpenACC to include GPU support and files that don't include OpenACC will compile normally. The MPI module should be found automatically during compilation and the MPI libraries will be linked in.
You can use the acc host_data use_device directive to pass the GPU version of your data to MPI. I don't have a Fortran example with MPI, but it looks similar to the call in this file. https://github.com/jefflarkin/openacc-interoperability/blob/master/openacc_cublas.f90#L19
This code uses both OpenACC and MPI, but doesn't use the host_data directive I referenced in 3. If I find another, I'll update this answer. It's a common pattern, but I don't have an open code handy at the moment. https://github.com/UK-MAC/CloverLeaf_OpenACC

Related

Is there a way to compile C++ code using the wasi stdlib + pthread support?

I'm new to c++ compiling, tooling, llvm and such. I'm exploring ways of compiling some c++ apps for the browser. I'm not looking for solutions that just run the c++ app. For those situation emscripten seems to be just right. I'm looking for ways to build a hybrid app that has a lot of touch-points between the javascript part and the c++ part.
I had success compiling and running some c/c++ apps using the wasi-sdk provided clang and llvm. But the llvm provided by wasi-sdk does not support threads.
The wasi-sdk offers a set of stdlib that respect the wasi specification. This specification does not support multi-threading. Is there a way to add the pthreads from other stdlib implementations and implement the javascript glue code by hand (maybe seeking inspiration from emscripten). If yes, what would be the steps? LLVM seems to be compiled without threads support in wasi-sdk, so simply adding additional headers that define pthreads might not work.
Wasi (and wasi-sdk and wasi-libc) doesn't currently support threads. There is an effort underway to add support here: https://github.com/WebAssembly/wasi-threads
There have been several recent patches to wasi-libc: e.g. https://github.com/WebAssembly/wasi-libc/pull/325.

Compiling Fortran77 with Julia

I have a bunch of Fortran77 code that I need to use for my research but I'm having trouble compiling it to run on my MacBook so I turned to Julia. I'm new to the language but for the life of me I can't figure out how to execute a Fortran script directly in Julia. All I want is to have a program that runs a F77 script and hands control directly to Fortran. I would just rewrite it with Julia or Numpy but there's about 10,000 lines of code and less than 200 lines of comment and I don't have time for that.
It seems from the wording of your question like you want to use Julia to directly call Fortran "scripts" – presumably Fortran .f source files – is that accurate?
As others have indicated in the comments, Fortran is not a scripting language: you cannot directly execute Fortran source files; instead you must use a Fortran compiler (e.g. gfortran, ifort) to translate Fortran programs into native libraries or executables for the system you want to run programs on. Julia will not help with this in any way as Julia is not a Fortran interpreter or compiler – it can neither run Fortran code directly nor convert Fortran source files into executables/libraries.
If, however, you already have a Fortran shared library compiled (.so file on Linux, .dylib on macOS, .dll on Windows), you can call it easily from Julia, as described in Integrating Fortran code in Julia. If you can compile Fortran source code to an executable (as opposed to a shared library), then you do not need anything else to run it – executables, by definition, are standalone.
Most projects in compiled languages like Fortran or C/C++ come with Makefiles or other mechanisms to help invoke a compiler to generate the appropriate binary artifacts (executables and/or libraries).

C++ Versions, do they auto-detect the version of an exe?

Okay so I know that there are multiple c++ versions. And I dont really know much about the differences between them but my question is:
Lets say i made a c++ application in c++ 11 and sent it off to another computer would it come up with errors from other versions of c++ or will it automatically detect it and run with that version? Or am I getting this wrong and is it defined at compile time? Someone please tell me because I am yet to find a single answer to my question on google.
It depends if you copy the source code to the other machine, and compile it there, or if you compile it on your machine and send the resulting binary to the other computer.
C++ is translated by the compiler to machine code, which runs directly on the processor. Any computer with a compatible processor will understand the machine code, but there is more than that. The program needs to interface with a filesystem, graphic adapters, and so on. This part is typically handled by the operating system, in different ways of course. Even if some of this is abstracted by C++ libraries, the calls to the operating system are different, and specific to it.
A compiled binary for ubuntu will not run on windows, for example, even if both computers have the same processor and hardware.
If you copy the source code to the other machine, and compile it there (or use a cross-compiler), your program should compile and run fine, if you don't use OS-specific features.
The C++ version does matter for compilation, you need a C++11 capable compiler of course if you have C++11 source code, but once the program is compiled, it does not matter any more.
C++ is compiled to machine code, which is then runnable on any computer having that architecture e.g. i386 or x64 (putting processor features like SSE etc. aside).
For Java, to bring a counterexample, it is different. There the code is compiled to a bytecode format, that is machine independent. This bytecodeformat is read/understood by the Java Virtual Machine (JVM). The JVM then has to be available for your architecture and the correct version has to be installed.
Or am I getting this wrong and is it defined at compile time?
This is precisely the idea: The code is compiled, and after that the language version is almost irrelevant. The only possible pitfall would be if a newer C++ version would include a breaking change to the standard C++ library (the library, not the language itself!). However, since the vast majority of that library is template code, it's compiled along with your own code anyway. It's basically baked into your .exe file along with your own code, so it's just as portable as yours. Also, both the C and C++ designers take great care not to break old code; so you can expect even those parts that are provided by the system itself (the standard C library) not to break anything.
So, even though there are things that could break in theory, pure C++ code should run fine on all machines that understand the same .exe format as the machine it was compiled on.

COCOS2D-X cross-platform mistery for me

I am a C++ developer I am interested in Cocos2d-x framework. I know that you can write C++ code using the framework, compile it for different platforms and that's it, you have your 2D on Windows, Android, iOS. This is amazing but I don't understand how it is being done and, consequently, I worry that some thing that I have done for one platform will not work on other one. To go into details I would like to open my concerns. In order to do that let's clarify what is compiling and what is running a code.
What does it mean to compile C++ code for a platform (platform is OS + CPU architecture)? It means that C++ source code is mapped to instructions which is understandable for a concrete CPU architecture. And the final set of instructions is packaged into an executable file which is understandable for a concrete OS which means a concrete SO or OSes that understand how to handle the executable file can run it. Also we should not forget that in the set of instructions that the executable contains there could be system calls. Which is also specific to OS.
What does it mean to run the executable? It means that OS knows the format of particular executable file. When you give run command OS loades it into the virtual memory and starts to execute that CPU instructions set step by step. (Very raw but in general it is like that I guess.)
Now returning to the COSOS2D-X. How it is possible to compile a C++ code so that it was able to be recognized and loaded by different OSes and by different CPUs. What mechanisms we use in order to get appropriate .apk, .ipa or .exe files. Is there a trap that we can fall while using system calls or processor specific calls? In general how all this problems are solved? Please explain the process, for example, for Android or it would be great for iOS too. :)
Cocos2d-x has 95% of the same code for all target OS and platforms. And it has 5% of code which is written for the concrete platform. For example there are some Java sources for Android. And there are some Obj-C files for iOS. Also there is some code in C++ for different platform. #define is used to separate this code. Examples of such code is working with files which is written in C++ but differs from platform to platform.
Generating of appropriate output file is responsibility of the compiler and SDK used for target platform. For example xCode with clang compiler will generate the iOS build. While Android NDK with gcc inside will build the apk.

Is TBB pre-Enabled in Opencv-2.4.5?

I have posted a question in Opencv answers group regarding the performance of TBB. An this is the link.
The answer in this link states as below.
Probably you used the 2.4.5 library with and without TBB to compare,
however, since OpenCV 2.4.3 multithreaded support functionality has
been included in the source code, not needing to build openCV with the
TBB support anymore. It is done automatically where necessary and the
included dll's are contained in the source where needed.
But I faced performance chage in Hog descriptor. That is I used peopledetect.cpp from samples and compiled with both TBB and without TBB in opencv2.4.5. I can see the Opencv2.4.5 compiled with TBB performs 2x speed where as Opencv2.4.5 without TBB performs very slow.
Can some one please conform the below points, as I couldnt find any belivable sources.
1) From opencv2.4.3 dont we need to make the opencv rebuild with TBB ON?
The prebuild binaries are compiled with the Visual Studio Concurrency framework since 2.4.3. However, not every algorithm uses the "new" parallel interface, where you can switch from Concurrency to IPP to TBB. Before, it was afaik hardcoded to use either TBB or nothing.
So the problem is that not every algorithm has been converted to the new parallel way, thus you can get speedups using TBB in some ways. (IIRC one example is the BruteForceMatcher, which uses only one core with the prebuild libs)