Does all openMp c++ programming compiled using -fopenmp running on GPU? - c++

I am new to openMP programming. While doing some basic examples, the cpp file is compiled using -fopenmp. #pragma omp parallel is giving at the beginning for parallelism. Also the #pragma omp parallel num_threads(4) can be given. Does all the code in this format uses GPU? From Nvidia command 540MiB / 2002MiB is used. So may the GPU is not using. What should be the reason?
Thanks in advance.

Does all the code in this format uses GPU?
No, it does not use the GPU.

OpenMP 4 and higher has support for offloading computation to accelerators including GPUs, if your compiler supports it for your particular GPU. You have to explicitly tell OpenMP to do so; the normal pragmas continue to stick to multithreading and vectorizing on the CPU.
Here's a presentation I found with some examples (PDF warning).

Related

oneMKL can not offload by openmp

I tried to run the official code in OneAPI example and found that the following code is not actually running on the GPU.
#pragma omp target data map(to:a[0:sizea],b[0:sizeb]) map(tofrom:c[0:sizec]) device(dnum)
{
// run gemm on gpu, use standard oneMKL interface within a variant dispatch construc
#pragma omp target variant dispatch device(dnum) use_device_ptr(a, b, c)
{
cblas_zgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc);
}
}
because by export LIBOMPTARGET_PLUGIN_PROFILE=T I found that the program runs without kernel time,like this:
and by export MKL_VERBOSE=1 I found that the MKL function runs on the GPU for 0 times.such as this:
I would like to know what the problem is and is there any solution,My Linux platform uses Intel's GPU Intel(R) Graphics.thanks
cblas_zgemm is a BLAS function call and OpenMP is not meant to rewrite it so to use its own GPU-based implementation. After all, this is just a function-call from the OpenMP point-of-view. The thing is if the linked BLAS implementation is not designed to run on a GPU, then OpenMP will not automatically convert the (compiled) code to a GPU (there is no such tool to far because GPU works very differently from CPUs). As a result, OpenMP cannot run this on the GPU if the BLAS is not meant to use the GPU.
The OneAPI documentation mentions GPU offloading using OpenMP and BLAS, but in separate/independent points. It is not clear whether OneMKL has a GPU-based version. AFAIK, it is not available in an OpenMP program, but possibly from a SysCL/DPC++ code but I am not sure this supports iGPUs so far.
Finally, even though you could do that, it will not be efficient on your target hardware. Intel iGPUs like mainstream PC GPUs (ie. client-side) are not designed for the fast computation double-precision operations: only single-precision one. This is because they are design for 3D rendering and 2D acceleration where single-precision is enough and also because single-precision units consume far less power than double-precision (for a same number of items computed per second). This means a cblas_zgemm call will certainly significantly faster on your CPU than on your iGPU (assuming it is possible).
Intel oneMKL does support running on CPU and GPU as stated in the documentation here:
https://www.intel.com/content/www/us/en/develop/documentation/get-started-with-mkl-for-dpcpp/top.html
but the cblas calls are C calls (built actually on top of a Fortran implemention) that run only on the CPU.
You should be able to make a oneMKL call within OpenMP without a problem, but as the other answer suggests this will just run the call in parallel without affecting the device the code is targeted for.

Image stitching is taking more time with Seamfinder and Exposurecompensator

I am trying to stitch images and the code on which I am working on, it uses SeamFinder and ExposureCompensator along with other functions. But while running the code, these two are taking so much of time. Is there any other alternative or is there a way to improve the performance.
Ptr<ExposureCompensator> compensator = ExposureCompensator::createDefault(expos_comp_type);
compensator->feed(corners, images_warped, masks_warped);
seam_finder = makePtr<GraphCutSeamFinder>(GraphCutSeamFinderBase::COST_COLOR);
seam_finder->find(images_warped_f, corners, masks_warped);
The above are the two functions which are taking time.
Please help me in solving the problem.
Thanks in advance.
The ImageStitching via OpenCV is known to be slow in many cases. Maybe you can give Open MP Parallel a shot here and counter the delay you are facing by using parallelization.
OpenMP consists of a set of compiler #pragmas that control how the program works. The pragmas are designed so that even if the compiler does not support them, the program will still yield correct behavior, but without any parallelism.
In cases where different iterations of loops have nothing to do with each other, therefore making these loops a prime target for parallelization. OpenMP effectively exploits these common program characteristics, so it is extremely easy to allow an OpenMP program to use multiple processors simply by adding a few lines of compiler directives into your source code.
In case you are running a loop in which a set of images are being stitched, you can make sure that the stiching for each set of images run parallely.
#pragma omp parallel for
for( ... )
{
// Image-stitching algorithms go here.
}
This compiler directive #pragma opm parallel for tells the compiler to auto-parallelize the for loop with OpenMP.
For non-loops, or just sections of code you can do something of this sort :
#pragma omp parallel sections
{
#pragma omp section
{
DoSomething();
}
#pragma omp section
{
DoSomethingElseParallely();
}
}
I know that the answer might not directly help you out, but might give you some avenues to dig.
You can go through more about the usage of OpenMP loop Parallelism and OpenMP: Sections before using it.

Translating Intel's #pragma offload to OpenMP for Xeon Phi (performance issues and other questions)

I use Intel C++ compiler 17.0.01, and I have two code blocks.
The first code block allocates memory on Xeon Phi like this:
#pragma offload target(mic:1) nocopy(data[0:size]: alloc_if(1) free_if(0))
The second block evaluates the above memory and copies it back to the host:
#pragma offload target(mic:1) out(data[0:size]: alloc_if(0) free_if(0))
This code runs just fine but the #pragma offload is part of Intel's compiler only (I think). So, I want to convert that to OpenMP.
This is how I translated the first block to OpenMP:
#pragma omp target device(1) map(alloc:data[0:size])
And this is how I translated the second block to OpenMP:
#pragma omp target device(1) map(from:data[0:size])
Also, I used export OFFLOAD_REPORT=2 in order to get a better idea on what is going on during the runtime.
Here are my problems/questions:
The OpenMP version of the first code block is as fast as the Intel version (#pragma offload). Nothing strange here.
The OpenMP version of the second code block is 5 times slower than the Intel version. However, the MIC_TIME of the two is the same, but the CPU_TIME is different (OpenMP version much higher). Why is that?
Is my Intel directives optimal?
Is my Intel -> OpenMP translation correct and optimal?
And here are some other, a bit different, questions:
On the testing machine I have two Intel Phi cards. Since I want to use the 2nd one I do this: #pragma omp target device(1).... Is that correct?
If I do #pragma omp target device(5)... the code still works! And it runs on one of the Phi cards (and not the CPU) because the performance is similar. Why is that?
I also tried my software (the OpenMP version) on a machine without an Xeon Phi and it run just fine on the CPU! Is this guaranteed? When you have no accelerator on the machine the target device(1) is ignored?
Is it possible to do something like std::cout << print_phi_card_name_or_uid(); inside an OpenMP offloaded region (so I will know for sure in which card my software is running)?
The second OpenMP code block allocates memory again. You should map the data to a device data environment by enclosing both blocks into #pragma omp target data map(from:data[0:size]), or just add #pragma omp target enter data map(alloc:data[0:size]) prior to the first block.
On the testing machine I have two Intel Phi cards. Since I want to use the 2nd one I do this: #pragma omp target device(1).... Is that correct?
AFAIK, device(0) means the default card, device(1) means the first card, and device(2) is the second card.
If I do #pragma omp target device(5)... the code still works! And it runs on one of the Phi cards (and not the CPU) because the performance is similar. Why is that?
Because liboffload does this (liboffload is a runtime library used by both gcc and icc). However the OpenMP standard doesn't guarantee such behaviour.
I also tried my software (the OpenMP version) on a machine without an Xeon Phi and it run just fine on the CPU! Is this guaranteed? When you have no accelerator on the machine the target device(1) is ignored?
Yes. Not sure about the standard, but offloading in icc and gcc is implemented this way.
Is it possible to do something like std::cout << print_phi_card_name_or_uid(); inside an OpenMP offloaded region (so I will know for sure in which card my software is running)?
OpenMP 4.5 provides only omp_is_initial_device() function to distinguish between the host and the accelerator. Maybe there is some Intel-specific interface to do it.

Make Eigen run in a Multi-Thread

I have some questions regarding Eigen parallelization.
To my understanding, Eigen handles its internal parallelization but I want to activate multi threading. I just started compiling with G++ using the flag "-fopenmp" and running my executable with OMP_NUM_THREADS=4 ./exec.
Some parts of the code that run only using C++ code I used:
#pragma omp parallel
{
}
Looking at my system monitor I can see that sometimes I used more than one thread, but most of the time it isn't. I don't know if I have to use additional OpenMp code.
In the following link:
https://eigen.tuxfamily.org/dox/TopicMultiThreading.html
They mention that "in the case your application is parallelized with OpenMP, you might want to disable Eigen's own parallization as detailed in the previous section", but I don't really understand if I have to or how to do it.
I hope I am not mixing concepts here.
My thanks in advance.
Quoting from the link you posted:
Currently, the following algorithms can make use of multi-threading: general matrix - matrix products PartialPivLU
Thus, without knowing exactly what your program is doing, I'd hazard a guess that it's not mostly large matrix-matrix multiplications and/or PartialPivLU. This only regards Eigen's internal parallelization. What you do within the omp parallel blocks will probably run as expected (multiple threads).

Performance difference using dynamic number of threads

I'm using openMP to parallize some heavy loops, and it works as expected.
Testing showed that this directive gave the most performance:
#pragma omp parallel for num_threads(7)
However, that may differ from machine to machine. Also, i wanted to be able to switch threading on/off using a runtime switch.
Therefore, i figured i could use something like this:
if(shouldThread)
omp_set_num_threads(optimalNumberOfThreadsForThisMachine);
else
omp_set_num_threads(1);
Where on my computer, the optimal number of threads is 7 in this example. Then, use this directive instead:
#pragma omp parallel for
It works well - except that the code compiled with the second directive is about 50% slower. Is this to be expected? I figure the runtime has to do dynamic dispatching and work scheduling, while the compile-time directive can add some sort of optimization, i guess.
Code is compiled with msvc 2013, on an core i7-3740