Translating Intel's #pragma offload to OpenMP for Xeon Phi (performance issues and other questions)

Translating Intel's #pragma offload to OpenMP for Xeon Phi (performance issues and other questions) - xeon-phi

I use Intel C++ compiler 17.0.01, and I have two code blocks.
The first code block allocates memory on Xeon Phi like this:
#pragma offload target(mic:1) nocopy(data[0:size]: alloc_if(1) free_if(0))
The second block evaluates the above memory and copies it back to the host:
#pragma offload target(mic:1) out(data[0:size]: alloc_if(0) free_if(0))
This code runs just fine but the #pragma offload is part of Intel's compiler only (I think). So, I want to convert that to OpenMP.
This is how I translated the first block to OpenMP:
#pragma omp target device(1) map(alloc:data[0:size])
And this is how I translated the second block to OpenMP:
#pragma omp target device(1) map(from:data[0:size])
Also, I used export OFFLOAD_REPORT=2 in order to get a better idea on what is going on during the runtime.
Here are my problems/questions:
The OpenMP version of the first code block is as fast as the Intel version (#pragma offload). Nothing strange here.
The OpenMP version of the second code block is 5 times slower than the Intel version. However, the MIC_TIME of the two is the same, but the CPU_TIME is different (OpenMP version much higher). Why is that?
Is my Intel directives optimal?
Is my Intel -> OpenMP translation correct and optimal?
And here are some other, a bit different, questions:
On the testing machine I have two Intel Phi cards. Since I want to use the 2nd one I do this: #pragma omp target device(1).... Is that correct?
If I do #pragma omp target device(5)... the code still works! And it runs on one of the Phi cards (and not the CPU) because the performance is similar. Why is that?
I also tried my software (the OpenMP version) on a machine without an Xeon Phi and it run just fine on the CPU! Is this guaranteed? When you have no accelerator on the machine the target device(1) is ignored?
Is it possible to do something like std::cout << print_phi_card_name_or_uid(); inside an OpenMP offloaded region (so I will know for sure in which card my software is running)?

The second OpenMP code block allocates memory again. You should map the data to a device data environment by enclosing both blocks into #pragma omp target data map(from:data[0:size]), or just add #pragma omp target enter data map(alloc:data[0:size]) prior to the first block.
On the testing machine I have two Intel Phi cards. Since I want to use the 2nd one I do this: #pragma omp target device(1).... Is that correct?
AFAIK, device(0) means the default card, device(1) means the first card, and device(2) is the second card.
If I do #pragma omp target device(5)... the code still works! And it runs on one of the Phi cards (and not the CPU) because the performance is similar. Why is that?
Because liboffload does this (liboffload is a runtime library used by both gcc and icc). However the OpenMP standard doesn't guarantee such behaviour.
I also tried my software (the OpenMP version) on a machine without an Xeon Phi and it run just fine on the CPU! Is this guaranteed? When you have no accelerator on the machine the target device(1) is ignored?
Yes. Not sure about the standard, but offloading in icc and gcc is implemented this way.
Is it possible to do something like std::cout << print_phi_card_name_or_uid(); inside an OpenMP offloaded region (so I will know for sure in which card my software is running)?
OpenMP 4.5 provides only omp_is_initial_device() function to distinguish between the host and the accelerator. Maybe there is some Intel-specific interface to do it.

Related

Does all openMp c++ programming compiled using -fopenmp running on GPU?

I am new to openMP programming. While doing some basic examples, the cpp file is compiled using -fopenmp. #pragma omp parallel is giving at the beginning for parallelism. Also the #pragma omp parallel num_threads(4) can be given. Does all the code in this format uses GPU? From Nvidia command 540MiB / 2002MiB is used. So may the GPU is not using. What should be the reason?
Thanks in advance.

Does all the code in this format uses GPU?
No, it does not use the GPU.

OpenMP 4 and higher has support for offloading computation to accelerators including GPUs, if your compiler supports it for your particular GPU. You have to explicitly tell OpenMP to do so; the normal pragmas continue to stick to multithreading and vectorizing on the CPU.
Here's a presentation I found with some examples (PDF warning).

Image stitching is taking more time with Seamfinder and Exposurecompensator

I am trying to stitch images and the code on which I am working on, it uses SeamFinder and ExposureCompensator along with other functions. But while running the code, these two are taking so much of time. Is there any other alternative or is there a way to improve the performance.
Ptr<ExposureCompensator> compensator = ExposureCompensator::createDefault(expos_comp_type);
compensator->feed(corners, images_warped, masks_warped);
seam_finder = makePtr<GraphCutSeamFinder>(GraphCutSeamFinderBase::COST_COLOR);
seam_finder->find(images_warped_f, corners, masks_warped);
The above are the two functions which are taking time.
Please help me in solving the problem.
Thanks in advance.

The ImageStitching via OpenCV is known to be slow in many cases. Maybe you can give Open MP Parallel a shot here and counter the delay you are facing by using parallelization.
OpenMP consists of a set of compiler #pragmas that control how the program works. The pragmas are designed so that even if the compiler does not support them, the program will still yield correct behavior, but without any parallelism.
In cases where different iterations of loops have nothing to do with each other, therefore making these loops a prime target for parallelization. OpenMP effectively exploits these common program characteristics, so it is extremely easy to allow an OpenMP program to use multiple processors simply by adding a few lines of compiler directives into your source code.
In case you are running a loop in which a set of images are being stitched, you can make sure that the stiching for each set of images run parallely.
#pragma omp parallel for
for( ... )
{
// Image-stitching algorithms go here.
}
This compiler directive #pragma opm parallel for tells the compiler to auto-parallelize the for loop with OpenMP.
For non-loops, or just sections of code you can do something of this sort :
#pragma omp parallel sections
{
#pragma omp section
{
DoSomething();
}
#pragma omp section
{
DoSomethingElseParallely();
}
}
I know that the answer might not directly help you out, but might give you some avenues to dig.
You can go through more about the usage of OpenMP loop Parallelism and OpenMP: Sections before using it.

Performance difference using dynamic number of threads

I'm using openMP to parallize some heavy loops, and it works as expected.
Testing showed that this directive gave the most performance:
#pragma omp parallel for num_threads(7)
However, that may differ from machine to machine. Also, i wanted to be able to switch threading on/off using a runtime switch.
Therefore, i figured i could use something like this:
if(shouldThread)
omp_set_num_threads(optimalNumberOfThreadsForThisMachine);
else
omp_set_num_threads(1);
Where on my computer, the optimal number of threads is 7 in this example. Then, use this directive instead:
#pragma omp parallel for
It works well - except that the code compiled with the second directive is about 50% slower. Is this to be expected? I figure the runtime has to do dynamic dispatching and work scheduling, while the compile-time directive can add some sort of optimization, i guess.
Code is compiled with msvc 2013, on an core i7-3740

Parallel for vs omp simd: when to use each?

OpenMP 4.0 introduces a new construct called "omp simd". What is the benefit of using this construct over the old "parallel for"? When would each be a better choice over the other?
EDIT:
Here is an interesting paper related to the SIMD directive.

A simple answer:
OpenMP only used to exploit multiple threads for multiple cores. This new simd extention allows you to explicitly use SIMD instructions on modern CPUs, such as Intel's AVX/SSE and ARM's NEON.
(Note that a SIMD instruction is executed in a single thread and a single core, by design. However, the meaning of SIMD can be quite expanded for GPGPU. But, but I don't think you need to consider GPGPU for OpenMP 4.0.)
So, once you know SIMD instructions, you can use this new construct.
In a modern CPU, roughly there are three types of parallelism: (1) instruction-level parallelism (ILP), (2) thread-level parallelism (TLP), and (3) SIMD instructions (we could say this is vector-level or so).
ILP is done automatically by your out-of-order CPUs, or compilers. You can exploit TLP using OpenMP's parallel for and other threading libraries. So, what about SIMD? Intrinsics were a way to use them (as well as compilers' automatic vectorization). OpenMP's simd is a new way to use SIMD.
Take a very simple example:
for (int i = 0; i < N; ++i)
A[i] = B[i] + C[i];
The above code computes a sum of two N-dimensional vectors. As you can easily see, there is no (loop-carried) data dependency on the array A[]. This loop is embarrassingly parallel.
There could be multiple ways to parallelize this loop. For example, until OpenMP 4.0, this can be parallelized using only parallel for construct. Each thread will perform N/#thread iterations on multiple cores.
However, you might think using multiple threads for such simple addition would be a overkill. That is why there is vectorization, which is mostly implemented by SIMD instructions.
Using a SIMD would be like this:
for (int i = 0; i < N/8; ++i)
VECTOR_ADD(A + i, B + i, C + i);
This code assumes that (1) the SIMD instruction (VECTOR_ADD) is 256-bit or 8-way (8 * 32 bits); and (2) N is a multiple of 8.
An 8-way SIMD instruction means that 8 items in a vector can be executed in a single machine instruction. Note that Intel's latest AVX provides such 8-way (32-bit * 8 = 256 bits) vector instructions.
In SIMD, you still use a single core (again, this is only for conventional CPUs, not GPU). But, you can use a hidden parallelism in hardware. Modern CPUs dedicate hardware resources for SIMD instructions, where each SIMD lane can be executed in parallel.
You can use thread-level parallelism at the same time. The above example can be further parallelized by parallel for.
(However, I have a doubt how many loops can be really transformed to SIMDized loops. The OpenMP 4.0 specification seems a bit unclear on this. So, real performance and practical restrictions would be dependent on actual compilers' implementations.)
To summarize, simd construct allows you to use SIMD instructions, in turn, more parallelism can be exploited along with thread-level parallelism. However, I think actual implementations would matter.

The linked-to standard is relatively clear (p 13, lines 19+20)
When any thread encounters a simd construct, the iterations of the
loop associated with the construct can be executed by the SIMD lanes
that are available to the thread.
SIMD is a sub-thread thing. To make it more concrete, on a CPU you could imagine using simd directives to specifically request vectorization of chunks of loop iterations that individually belong to the same thread. It's exposing the multiple levels of parallelism that exist within a single multicore processor, in a platform-independent way. See for instance the discussion (along with the accelerator stuff) on this intel blog post.
So basically, you'll want to use omp parallel to distribute work onto different threads, which can then migrate to multiple cores; and you'll want to use omp simd to make use of vector pipelines (say) within each core. Normally omp parallel would go on the "outside" to deal with coarser-grained parallel distribution of work and omp simd would go around tight loops inside of that to exploit fine-grained parallelism.

Compilers aren't required to make simd optimization in a parallel region conditional on presence of the simd clause. Compilers I'm familiar with continue to support nested loops, parallel outer, vector inner, in the same way as before.
In the past, OpenMP directives were usually taken to prevent loop-switching optimizations involving the outer parallelized loop (multiple loops with collapse clause). This seems to have changed in a few compilers.
OpenMP 4 opens up new possibilities including optimization of a parallel outer loop with a non-vectorizable inner loop, by a sort of strip mining, when omp parallel do [for] simd is set. ifort sometimes reports it as outer loop vectorization when it is done without the simd clause. It may then be optimized for a smaller number of threads than the omp parallel do simd, which seems to need more threads than the simd vector width to pay off. Such a distinction might be inferred, as, without the simd clause, the compiler is implicitly asked to optimize for a loop count such as 100 or 300, while the simd clause requests unconditional simd optimization.
gcc 4.9 omp parallel for simd looked quite effective when I had a 24-core platform.

C++ intel TBB inner loop optimisation

I am trying to use Intel TBB to parallelise an inner loop (the 2nd of 3) however, i only get decent pay off when the inner 2 loops are significant in size.
Is TBB spawning new threads for every iteration of the major loop?
Is there anyway to reduce the overhead?
tbb::task_scheduler_init tbb_init(4); //I have 4 cores
tbb::blocked_range<size_t> blk_rng(0, crs_.y_sz, crs_.y_sz/4);
boost::chrono::system_clock::time_point start =boost::chrono::system_clock::now();
for(unsigned i=0; i!=5000; ++i)
{
tbb::parallel_for(blk_rng,
[&](const tbb::blocked_range<size_t>& br)->void
{
:::
It might be interesting to note that openMP (which I am trying to remove!!!) doesn't have this problem.
I am compiling with:
intel ICC 12.1 at -03 -xHost -mavx
On a intel 2500k (4 cores)
EDIT: I can really change the order of loops, because the out loops test need to be replace with a predicate based on the loops result.

No, TBB does not spawn new threads for every invocation of parallel_for. Actually, unlike OpenMP parallel regions that each may start a new thread team, TBB work with the same thread team until all task_scheduler_init objects are destroyed; and in case of implicit initialization (with task_scheduler_init omitted), same worker threads are used till the end of the program.
So the performance issue is caused by something else. The most likely reasons, from my experience, are:
lack of compiler optimizations, auto-vectorization being first (can be checked by comparing single-threaded performance of OpenMP and TBB; if TBB is much slower, then this is the most likely reason).
cache misses; if you 5000 times run through the same data, cache locality has huge importance, and OpenMP's default schedule(static) works very well, deterministically repeating exactly the same partitioning each time, while TBB's work stealing scheduler has significant randomness. Setting the blocked_range grain size equal to problem_size/num_threads ensures one piece of work per thread but does not guarantee the same distribution of pieces; and affinity_partitioner is supposed to help with that.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js