efficiently use openmp on small loop with very large nested loops

efficiently use openmp on small loop with very large nested loops - c++

Basically I have a program that needs to go over several individual pictures
I do this by:
#pragma omp paralell num_threads(4)
#pragma omp paralell for
for(picture = 0; picture < 4; picture++){
for(int row = 0; row < 1000; row++){
for(int col = 0; col < 1000; col++){
//do stuff with pixel[picture][row][col]
}
}
}
I just want to split the work among 4 cores (1 core per picture) so that each core/thread is working on a specific picture. That way core 0 is working on picture 0, core 1 on picture 1, and so on. The machine it is being tested on only has 4 cores as well. What is the best way to use openmp declarations for this scenario. The one I posted is what I think would be the best performance for this scenario.
keep in mind this is pseudo code. The goal of the program is not important, parallelizing these loops efficiently is the goal.

Just adding a simple
#pragma omp parallel for
is a good starting point for your problem. Don't bother with statically writing in the how many threads it should use. The runtime will usually do the right thing.
However, it is impossible to generally say what is most efficient. There are many performance factors that are impossible to tell from your limited general example. Your code may be memory bound and benefit only very little from parallelization on desktop CPUs. You may have a load imbalance which means you need to split the work in to more chunks and process them dynamically. That could be done by parallelizing the middle loop or using nested parallelism. Whether the middle loop parallelization works well depends on the amount of work done by the inner loop (and hence the ratio of useful work / overhead). The memory layout also heavily influence the efficieny of the parallelization. Or maybe you even have data dependencies in the inner loop preventing parallelization there...
The only general recommendation once can give is to always measure, never guess. Learn to use the powerful available parallel performance analysis tools and incoperate that into your workflow.

Related

Is two statements in a single loop faster than a statement per loop? [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
Or is it the same in terms of performance?
For example, which is faster?
int a = 1, b = 2;
for (int i = 0; i < 10; ++i) {
a = a + 1;
b = b + 1;
}
or
for (int i = 0; i < 10; ++i) {
a = a + 1;
}
for (int i = 0; i < 10; ++i) {
b = b + 1;
}
Note: I changed my examples, given a lot of people seem hung up on the statements inside them rather than the purpose of my question.

Both of your examples do nothing at all and most compilers will optimize them both to the same thing -- nothing at all.
Update: Your two new examples are obviously equivalent. If any compiler generated better code for one than the other, then it's a poor quality compiler and you should just use a better compiler.

As people have pointed out, the compiler will optimize regardless of which way I go with but it really depends on what statements are inside the loop(s).

The performance depends on the contents of the loops.
Let's decompose the for loop. A for loop is comprised of:
Initialization
Comparison
Incrementing
Content (statements)
Branching
Let us define a comparison as a compare instruction (to set the processor status bits) and a branch (to take advantage of the processor status bits).
Processors are at their happiest when they are executing data instructions. The processor manipulates the data, then processes the next instruction in the pipeline (cache).
The processors don't like sections 2) Comparison and 5) Branching (to the top of the loop). Branching means that the processor has stop processing data and execute logic to determine if the instruction cache needs to be replaced or not. This time could be spent processing data instructions.
The goal to optimizing a for loop is to reduce the branching. The secondary one is to optimize the data cache / memory accesses. A common optimization technique is loop unrolling, or basically placing more statements inside the for loop. As a measurement, you can take the overhead of the for loop and divide by the quantity of statements inside the loop.
According to the above information, your first loop (with both assignment statements) would be more efficient, since there are more data instructions per loop; less overhead overall.
Edit 1: The Parallel Environment
However, your second example may be faster. The compiler could set up both loops to run in parallel (either through instructions or actual parallel tasks). Since both loops are independent, they can be run at the same time or split between CPU cores. Processors have instructions that can perform common operations on multiple memory locations. Your first example, makes this a little more difficult because it requires more analyzation from the compiler. Since the loops on the second example are simpler, the compiler's analyzation is also simpler.
Also, the quantity of iterations also plays a factor. For small quantities, the loops should perform the same or have negligible differences. For large quantities of iterations, there may be some timing differences.
In summary: PROFILE. BENCHMARK. The only true answer depends on measurements. They may vary depending on the applications being run at the same time, the amount of memory (both RAM and hard drive), the quantity of CPU cores and other items. Profile and Benchmark on your system. Repeat on other systems.

OpenMP nested parallelization

I am using a library that is already parallelized with OpenMP. The issue is that 2-4 cores seem enough for the processing it is doing. Using more than 4 cores makes little difference.
My code is like this:
for (size_t i=0; i<4; ++i)
Call_To_Library (i, ...);
Since 4 cores seem enough for the library (i.e, 4 cores should be used in Call_To_Library), and I am working with a 16 cores machine, I intend to also parallelize my for loop. Note that this for consists at most of 3-4 iterations.
What would be the best approach to parallelize this outer for? Can I also use OpenMP? Is it a best practice to use nested parallelizations? The library I am calling already uses OpenMP and I cannot modify its code (and it wouldn't be straightforward anyway).
PS. Even if the outer loop consists only of 4 iterations, it is worth parallelizing. Each call to the library takes 4-5 seconds.

If there is no dependency between iterations of this loop you can do:
#pragma omp for schedule(static)
for (size_t i=0; i<4; ++i)
Call_To_Library (i, ...);
If, as you said, every invocation of Call_To_Library takes such a big amount of time the overhead of having nested OpenMP operators will probably be negligible.
Moreover you say that you have no control over the number of openmp threads created in Call_To_Library. This solution will multiply the number of openmp threads by 4 and most likely you will see a 4x speedup. Probably the inner Call_To_Library was parallelized in such a way that no more than a few openmp threads could be executed at the same time. With the external parallel for you increase that number 4 times.
The problem with nested parallelism could be that you have an explosion of the number of threads created at the same time and therefore you could see less than ideal speedup because of the overhead related to creation/closing of openmp threads.

Threads in c++ not generating speedup on mandelbrot image processing

So, I wrote a program that generates a mandelbrot image. Then, I decided to write it in a way that would use a specified number of threads to speed it up. This is what I came up with:
void mandelbrot_all(std::vector<std::vector<int>>& pixels, int X, int Y, int threadCount) {
using namespace std;
vector<thread> threads;
int numThreads = threadCount;
for(int i=0; i<numThreads; i++) {
threads.push_back(thread (mandelbrot_range, std::ref(pixels), i*X/numThreads, 0, X*(i+1)/numThreads, Y, X));
}
for(int i=0; i<numThreads; i++) {
threads[i].join();
}
}
The intention was to split the processing into chunks and process each one separately. When I run the program, it takes a number as an argument, which will be used as the number of threads to be used in the program for that run. Unfortunately, I get similar times for any number of threads.
Is there something about threading in c++ that I'm missing? Do I have to add something or boilerplate of some kind to make the threads function simultaneously? Or is the way I'm making threads just silly?
I've tried running this code on a raspberry pi and my quad core laptop, with these same results.
Any help would be appreciated.

I'm a little late back to this question, but looking back, I remember the solution: I was programming on a single-core raspberry pi. One core means no speedup from threading.

I think spawning the threads is too expensive, You could try PPL or TBB. which both have parallel_for and parallel_foreach, and use those to loop through the pixels instead of using threads. they internally manage the threads so you have less overhead and the most throughput.

Solving one problem at a time, why not give it a try and hardcode the use of 2 threads, then 3? Thread starting is expensive however if you start only 2 threads and calculate a fairly large Mandelbrot, then thread start time will be relatively zero.
Up until you don't achieve 2x and 3x speedup, then you have other problems that you need to debug & solve, separately.

Without looking at your code and playing with it, it's hard to pinpoint what the problem is exactly. Here's a guess though: some portions of the Mandelbrot set image is much easier to compute than others. Your code is cutting the image up into equal slices by the x-axis, but the majority of the work (say 70%) could fall into one slice. In that case, the best you can do is a 30% speed up, since rest of the threads still have to wait for the last one to finish. For example, if you run with four threads and cut up the image into four pieces, the third piece certainly looks more intensive than the rest. Of course the 70% is just an estimate.

Which library for parallel for-loops that iterate 1M*1k times, OpenMP or boost::thread?

I want to iterate an image pixel by pixel and do a 1000 of floating point operations per pixel. Do you think I should use multi-threading or multiprocessing, i.e. boost::thread or OpenMP for this? Is there a rule of thumb to choose between these 2 (for fastest speed)? I have understood that creating threads or switching between threads is multiple times faster than creating/switching processes. On the other hand implementing OpenMP code is much easier.
My solution right now:
#pragma omp parallel for
for(size_t i=0; i<640; ++i) {
for(size_t j=0; j<480; ++j) {
// do 1000 float operations
}
}

OpenMP is more than sufficient for this, in fact boost does not even have a built-in parallel loop construct.
Do you think I should use multi-threading or multiprocessing
Although OpenMP stands for Open MultiProcessing, it is in fact a multithreading library.
An alternative library worth looking at is Intel TBB.

Multidimensional Array Initialization: Any benefit from Threading?

say I have the following code:
char[5][5] array;
for(int i =0; i < 5; ++i)
{
for(int j = 0; j < 5; ++i)
{
array[i][j] = //random char;
}
}
Would there be a benefit for initializing each row in this array in a separate thread?
Imagine instead of a 5 by 5 array, we have a 10 by 10?
n x n?
Also, this is done once, during application startup.

You're joking, right?
If not: The answer is certainly no!!!
You'd incur a lot of overhead for putting together enough synchronization to dispatch the work via a message queue, plus knowing all the threads had finished their rows and the arrays were ready. That would far outstrip the time it takes one CPU core to fill 25 bytes with a known value. So for almost any simple initialization like this you do not want to use threads.
Also bear in mind that threads provide concurrency but not speedup on a single core machine. If you have an operation which has to be completed synchronously--like an array initialization--then you'll only get value by adding a # of threads up to the # of CPU cores available. In theory.
So if you're on a multi-core system and if what you were putting in each cell took a long time to calculate... then sure, it may be worth exploiting some kind of parallelism. So I like genpfault's suggestion: write it multithreaded for a multi-core system and time it as an educational exercise just to get a feel for when the crossover of benefit happens...

Unless you're doing a significant amount of computation, no, there will not be any benefit. It's possible you might even see worse performance due to caching effects.
This type of initialization is memory-bound, not CPU bound. The time it takes to initialize the array depends on the speed of your memory; your CPU will just waste cycles spinning waiting for the memory operations to commit. Adding more threads will still have them all waiting for memory, and if they're all fighting over the same cache lines, the performance will be worse because now the caches of the separate CPUs have to synchronize with each other to avoid cache incoherency.

On modern hardware? Probably none, since you're not doing any significant computation. You'll most likely be limited by your memory bandwidth.
Pretty easy to test though. Whip up some OpenMP and give it a whirl!

Doubtful, but for some point of n x n, maybe... but I'd imagine it's a really high n and you'd have probably already be multi-threading on processing this data. Remember that these threads will be writing back to the same area which may also lead to cache contention.
If you want to know for sure, try it and profile.

Also, this is done once, during application startup.
For this kind of thing, the cost of allocating the threads is probably greater than what you save by using them. Especially if you only need to do it once.

I did something similar, but in my case, the 2d array represented pixels on the screen. I was doing pretty expensive stuff, colour lerping, Perlin noise calculation... When launching it all in a single thread, I got around 40 fps, but when I added slave threads responsible for calculating rows of pixels, I managed to double that result. So yes, there might be situations where multithreading helps in speeding up whatever you do in the array, providing that what you do is expensive enough to justify using multiple threads.
You can download a live demo where you adjust the number of threads to watch the fps counter change: http://umbrarumregnum.110mb.com/download/mnd (the multithreading test is the "Noise Demo 3").

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js