I would like to parallel a big loop using OpenMP to improve its efficiency. Here is the main part of the toy code:
vector<int> config;
config.resize(indices.size());
omp_set_num_threads(2);
#pragma omp parallel for schedule(static, 5000) firstprivate(config)
for (int i = 0; i < 10000; ++i) { // the outer loop that I would like to parallel
#pragma omp simd
for (int j = 0; j < indices.size(); ++j) { // pick some columns from a big ref_table
config[j] = ref_table[i][indices[j]];
}
int index = GetIndex(config); // do simple computations on the picked values to get the index
#pragma omp atomic
result[index]++;
}
Then I found I cannot get improvements in efficiency if I use 2, 4, or 8 threads. The execution time of the parallel versions is generally greater than that of the sequential version. The outer loop has 10000 iterations and they are independent so I want multiple threads to execute those iterations in parallel.
I guess the reasons for performance decrease maybe include: private copies of config? or, random access of ref_table? or, expensive atomic operation? So what are the exact reasons for the performance decrease? More importantly, how can I get a shorter execution time?
Private copies of config or, random access of ref_tables are not problematic, I think the workload is very small, there are 2 potential issues which prevent efficient parallelization:
atomic operation is too expensive.
overheads are bigger than workload (it simply means that it is not worth parallelizing with OpenMP)
I do not know which one is more significant in your case, so it is worth trying to get rid of atomic operation. There are 2 cases:
a) If the results array is zero initialized you have to use:
#pragma omp parallel for reduction(+:result[0:N]) schedule(static, 5000) firstprivate(config) where N is the size of result array and delete #pragma omp atomic. Note that this works on OpenMP 4.5 or later. It is also worth removing #parama omp simd for a loop of 2-10 iterations. So, your code should look like this:
#pragma omp parallel for reduction(+:result[0:N]) schedule(static, 5000) firstprivate(config)
for (int i = 0; i < 10000; ++i) { // the outer loop that I would like to parallel
for (int j = 0; j < indices.size(); ++j) { // pick some columns from a big ref_table
config[j] = ref_table[i][indices[j]];
}
int index = GetIndex(config); // do simple computations on the picked values to get the index
result[index]++;
}
b) If the result array is not zero initialized the solution is very similar, but use a temporary zero initialized array in the loop and after that add it to result array.
If the speed will not increase then your code is not worth parallelizing with OpenMP on your hardware.
Related
I have a problem writing the parallel instructions for a code that work like this:
// every iteration depends on the previous one
for (int iter = 0; iter < numIters; ++i)
{
#pragma omp parallel for num_threads(numThreads)
for (int p = 0; p < numParticles; ++p)
{
p_velocity_calculation(...);
}
// implicit sync barrier
#pragma omp parallel for num_threads(numThreads)
for (int p = 0; p < numParticles; ++p)
{
p_position_calculation(...);
}
}
The program is about a n-body simulation where first I need to calculate the velocities and then the positions of a set of particles, hence the separation of the two for-loops.
The code runs as expected, but from what I have inquired, the thread pools created by the #pragma omp directives are created and destroyed every iteration of the outer for-loop, but I don't want to waste resources creating them.
So my question is how can I reuse those thread pools and not creating/destroying the threads every iteration?
First of all: the thread pools are not destroyed, only suspended.
Next: Have you timed this and found that creating the threads is a limiting factor in your application? If not, don't worry.
Or to put it constructively : I have timed it and unless you have an extremely short omp parallel for and you call it tens of thousand of times, the overhead is negligible.
But if you are really worried, put the omp parallel outside the time loop, and do an omp for around the particle loop. You will do some redundant work between the for loops, which you can either accept or put a omp master around if it affects global variables.
But really: I wouldn't worry.
I'm pretty new to OpenMP, so I'm fine if I have this wrong. Also I wasn't successful in finding information about this but I'm sure I missed something obvious.
I have some nested loops, I would like to parallelize a certain way.
This is a sequential version. Notice f(i) is a larger integer between 100 and 100,000 roughly.
for (int a = 0; a < 10; a++)
{
for (int b = 0; b < 10; b++)
{
for (int c = 0; c < f(a); c++)
{
for (int d = 0; d < f(b); d++)
{
if (comp(c, d))
{
result[a][b]++;
}
}
}
}
}
Naively, I came up with this method of parallelizing the code.
#pragma omp parallel
{
// Create a result_local array to avoid critical sections in the loop
#pragma omp for collapse(2) schedule(guided) nowait
for (int a = 0; a < 10; a++)
{
for (int b = 0; b < 10; b++)
{
for (int c = 0; c < f(a); c++)
{
for (int d = 0; d < f(b); d++)
{
if (comp(c, d))
{
result_local[a][b]++;
}
}
}
}
}
// Add the result_local to result
}
This is the part where I'm not so sure. If my understanding is correct, OpenMP will not parallelize the c and d loops meaning each thread will execute a c loop in its entirety. Given f(i) can return relatively low numbers like 100 or relatively high numbers like 100,000, this means some of the threads might get stuck with a lot more work than other threads which is not ideal.
So then the question is how can I parallelize the inner loops to share the work better. I can't change collapse(2) to collapse(4) because the c and d loops iterate up to a number that is a function of the a and b variables.
I saw something in my research that maybe is helpful.
#pragma omp parallel
{
// Create a result_local array to avoid critical sections in the loop
for (int a = 0; a < 10; a++)
{
for (int b = 0; b < 10; b++)
{
#pragma omp parallel for collapse(2) schedule(guided)
for (int c = 0; c < f(a); c++)
{
for (int d = 0; d < f(b); d++)
{
if (comp(c, d))
{
result_local[a][b]++;
}
}
}
}
}
// Add the result_local to result
}
Admittedly, I don't know enough to know if this helpful at all. What I saw indicates this might be parallelizing the c and d loops but leaving the a and b loops serial?
Any help is appreciated.
omp will not parallelize the c and d loops meaning each thread will execute a c loop in its entirety.
This is correct.
some of the threads might get stuck with a lot more work than other threads
You are right: the work imbalance between thread is a performance issue in the first code. A schedule(dynamic) help a bit to fix this, but there is not much more you can do on this version.
I don't know enough to know if this helpful at all. What I saw indicates this might be parallelizing the c and d loops but leaving the a and b loops serial?
Technically, the a and b loops are executed in parallel too (since they are in a parallel section, but all the threads will completely execute all the iterations in lockstep (because the omp parallel for contains an implicit synchronization). You should not use a second omp parallel: regarding the runtime, this can created new threads 100 times, and even when no new threads are created, this result in an inefficient code (for example because of a bad default thread pinning). Moreover, schedule(guided) is not needed here and should be less efficient than a schedule(static). Thus, use omp for collapse(2) schedule(static).
how can I parallelize the inner loops to share the work better.
The last code is not soo bad in term of work balancing although it introduces some unwanted overheads:
The implicit synchronization of the omp for can be skipped using nowait since all threads are working on thread-private data.
The access to result_local[a][b] can be replaced by a fast thread-private variable access.
The conditional increment can be replaced by a branch-less boolean increment.
f(a) and f(b) can be per-computed although optimizing compilers should already do this.
When f(a) * f(b) is very small, this could be better not to execute the loop in parallel (because of the expensive cost to communicate between cores). However this is highly dependent of whether cond is expensive or not.
When f(a) is big, there is no need to use a costly collapse(2) as there will be enough work for all threads (collapse(2) usually slow down the execution since compilers often generate a slow modulus instruction to find the value of the loop iterators at runtime).
Here is the resulting code tacking into account most fixes:
#pragma omp parallel
{
// Create a result_local array to avoid critical sections in the loop
// Arbritrary threshold (this may not be optimal)
const int threshold = 4 * omp_get_num_threads();
for (int a = 0; a < 10; a++)
{
const int c_lim = f(a);
for (int b = 0; b < 10; b++)
{
const int d_lim = f(b);
int64_t local_sum = 0;
if(c_lim < threshold)
{
#pragma omp for collapse(2) schedule(static) nowait
for (int c = 0; c < c_lim; c++)
for (int d = 0; d < d_lim; d++)
local_sum += comp(c, d);
}
else
{
#pragma omp for schedule(static) nowait
for (int c = 0; c < c_lim; c++)
for (int d = 0; d < d_lim; d++)
local_sum += comp(c, d);
}
result_local[a][b] += local_sum;
}
}
// Add the result_local to result
}
Another more efficient strategy is to redesign the sequential algorithm to significantly reduce the amount of work.
Redesigning of the algorithm
One can note that comp(c, d) is recomputed with the same value several times (up to 100 times) and the same for result_local[a][b]++ or even f(b) (up to 1,000,000 times). In such cases, the generic solution is to memoize the results (see here for more information) to avoid expensive parts of the algorithm to be recomputed over and over.
Note that you cannot pre-compute all the needed comp(a, b) values: this solution would be too expensive in terms of memory usage (up to 10 Gio needed). Thus, the trick is to split the 2D space in tiles. Here is how the algorithm works:
compute all the f(a) and f(b) sequentially (100 values);
split the iteration space in tiles of reasonable size (eg. 100x100) and pre-compute all the required tiles that should be completely computed (possibly in parallel, although this is tedious);
compute the sum of all comp(a, b) for each tile (i.e. for a in [a_tile_begin;a_tile_end[ and b in [b_tile_begin;b_tile_end[) in parallel (each thread should work on several tiles) and write the sums in a shared array.
compute the final result using the tile sums (partial tiles are computed on the fly in this last step) in parallel.
This algorithm is definitively much more complex, but it should be up to 100 time faster than the above one since most operations are computed only once.
For your attempt of parallelizing the inner loops to have a chance to work, you need to do something about the data race to result_local:
If you have enough memory for every thread to have it's own private version of result_local, you might be able to specify reduction(+: result_local[:10][:10]) in the pragma, but I haven't used it with multidimensional arrays yet. You might have to use a linear array and "lexic indexing" (idx = a * 10 + b). If result_local is dynamically allocated (on the heap), this might be the better way of dealing with it anyway (better than some std::vector<std::vector<int>>, due to cache locality).
If comp is computationally intensive enough you might be better off by putting #pragma omp atomic update in front of result_local[a][b]++. This takes less memory. In your example with a * b == 100 memory is probably not an issue.
As branching inside the innermost loop can be bad for performance, you might want to try out if result_local[a][b] += comp(c, d); gives better performance, as addition is quite cheap.
I look for a better way to cancel my threads.
In my approach, I use a shared variable and if this variable is set, I just throw a continue. This finishes my threads fast, but threads keep theoretically spawning and ending, which seems not elegant.
So, is there a better way to solve the issue (break is not supported by my OpenMP)?
I have to work with Visual, so my OpenMP Lib is outdated and there is no way around that. Consequently, I think #omp cancel will not work
int progress_state = RunExport;
#pragma omp parallel
{
#pragma omp for
for (int k = 0; k < foo.z; k++)
for (int j = 0; j < foo.y; j++)
for (int i = 0; i < foo.x; i++) {
if (progress_state == StopExport) {
continue;
}
// do some fancy shit
// yeah here is a condition for speed due to the critical
#pragma omp critical
if (condition) {
progress_state = StopExport;
}
}
}
You should do it the simple way of "just continue in all remaining iterations if cancellation is requested". That can just be the first check in the outermost loop (and given that you have several nested loops, that will probably not have any measurable overhead).
std::atomic<int> progress_state = RunExport;
// You could just write #pragma omp parallel for instead of these two nested blocks.
#pragma omp parallel
{
#pragma omp for
for (int k = 0; k < foo.z; k++)
{
if (progress_state == StopExport)
continue;
for (int j = 0; j < foo.y; j++)
{
// You can add break statements in these inner loops.
// OMP only parallelizes the outermost loop (at least given the way you wrote this)
// so it won't care here.
for (int i = 0; i < foo.x; i++)
{
// ...
if (condition) {
progress_state = StopExport;
}
}
}
}
}
Generally speaking, OMP will not suddenly spawn new threads or end existing ones, especially not within one parallel region. This means there is little overhead associated with running a few more tiny iterations. This is even more true given that the default scheduling in your case is most likely static, meaning that each thread knows its start and end index right away. Other scheduling modes would have to call into the OMP runtime every iteration (or every few iterations) to request more work, but that won't happen here. The compiler will basically see this code for the threaded work:
// Not real omp functions.
int myStart = __omp_static_for_my_start();
int myEnd = __omp_static_for_my_end();
for (int k = myStart; k < myEnd; ++k)
{
if (progress_state == StopExport)
continue;
// etc.
}
You might try a non-atomic thread-local "should I cancel?" flag that starts as false and can only be changed to true (which the compiler may understand and fold into the loop condition). But I doubt you will see significant overhead either way, at least on x86 where int is atomic anyway.
which seems not elegant
OMP 2.0 does not exactly shine with respect to elegance. I mean, iterating over a std::vector requires at least one static_cast to silence signed -> unsigned conversion warnings. So unless you have specific evidence of this pattern causing a performance problem, there is little reason not to use it.
Currently, somewhere deep in my code, I am working with a nested for-loop (N1=~10000, N2 = ~500, x,y= 10-50). I used the #pragma omp, to have OpenMP distribute my calculation on several cores.
#pragma omp parallel for
for (int i = 0; i < N1; ++i)
{
for (int j = 0; j < N2; ++j)
{
for (int k = x; k <= y; ++k)
{
// calculation
}
}
}
Now, my two innerloops becomes conditional
#pragma omp parallel for
for (int i = 0; i < N1; ++i)
{
if (toExecute[i])
{
for (int j = 0; j < N2; ++j)
{
for (int k = x; k <= y; ++k)
{
// calculation
}
}
}
}
The inner nested loop either takes a long time, or is immediately done. Of course I can omit the if-statement by replacing the outer-loop and if-statement with a shorter loop and lookup for the later indexing.
My question is: Is OpenMP smart enough to handle the if-statement within my outer loop, or do I have to do something manually?
I am currently using C++ in Visual Studio 2017 if that matters (I think the OpenMP version is a bit behind).
Ideally, you should let OpenMP handle that for you. But as always when you're doing performance stuffs, you have to try to see what is best for you. Indeed, you can gain great speedup by doing things manually. OpenMP is not omniscient, he does not know all the details and intelligence about your calculation.
If your calculation implies the same work of amount for any iteration then your condition is likely to lead to some different work load regarding the most outter loop. So theoritically, a dynamic scheduling should be more fitted
#pragma omp parallel for schedule(dynamic)
You could also try static or guided scheduling which might fit your calculation (I don't know the details of your calculation so I cannot say) and play with the granularity block.
An other test to do, if you can afford that (i.e. is it parallelizable ?), you should try to move the parallelization in the inner loops.
You can even nest the parallelization, it sometimes give nice speedup. Try and tune step by step, take time to see what gives you the best output. Just to remind you these tweaks are often not generic accross different architectures, so aim for a good tradeoff between performance and code reusability.
I try using OpenMP to parallel some for-loop of my program but failed to get significant speed improvement (actual degradation is observed). My target machine will have 4-6 cores and I currently rely on the OpenMP runtime to get the thread count for me, so I haven't tried any threadcount combination yet.
Target/Development platform: Windows 64bits
using MinGW64 4.7.2 (rubenvb build)
Sample output with OpenMP
Thread count: 4
Dynamic :0
OMP_GET_NUM_PROCS: 4
OMP_IN_PARALLEL: 1
5.612 // <- returned by omp_get_wtime()
5.627 (sec) // <- returned by clock()
Wall time elapsed: 5.62703
Sample output without OpenMP
2.415 (sec) // <- returned by clock()
Wall time elapsed: 2.415
How I measure the time
struct timeval start, end;
gettimeofday(&start, NULL);
#ifdef _OPENMP
double t1 = (double) clock();
double wt = omp_get_wtime();
sim->resetEnvironment(run);
tout << omp_get_wtime() - wt << std::endl;
timeEnd(tout, t1);
#else
double = (double) clock();
sim->resetEnvironment(run);
timeEnd(tout, t1);
#endif
gettimeofday(&end, NULL);
tout << "Wall time elapsed: "
<< ((end.tv_sec - start.tv_sec) * 1000000u + (end.tv_usec - start.tv_usec)) / 1.e6
<< std::endl;
The code
void Simulator::resetEnvironment(int run)
{
#pragma omp parallel
{
// (a)
#pragma omp for schedule(dynamic)
for (size_t i = 0; i < vector_1.size(); i++) // size ~ 20
reset(vector_1[i]);
#pragma omp for schedule(dynamic)
for (size_t i = 0; i < vector_2.size(); i++) // size ~ 2.3M
reset(vector_2[i]);
#pragma omp for schedule(dynamic)
for (size_t i = 0; i < vector_3.size(); i++) // size ~ 0.3M
reset(vector_3[i]);
for (int level = 0; level < level_count; level++) // (b) level = 3
{
#pragma omp for schedule(dynamic)
for (size_t i = 0; i < vector_4[level].size(); i++) // size ~500 - 1K
reset(vector_4[level][i]);
}
#pragma omp for schedule(dynamic)
for (long i = 0; i < populationSize; i++) // size ~7M
resetAgent(agents[i]);
} // end #parallel
} // end: Simulator::resetEnvironment()
Randomness
Inside reset() function calls, I used a RNG for seeding some agents for subsequent tasks.
Below is my RNG implementation, as I saw suggestion that using one RNG per per-thread for thread-safety.
class RNG {
public:
typedef std::mt19937 Engine;
RNG()
: real_uni_dist_(0.0, 1.0)
#ifdef _OPENMP
, engines()
#endif
{
#ifdef _OPENMP
int threads = std::max(1, omp_get_max_threads());
for (int seed = 0; seed < threads; ++seed)
engines.push_back(Engine(seed));
#else
engine_.seed(time(NULL));
#endif
} // end_ctor(RNG)
/** #return next possible value of the uniformed distribution */
double operator()()
{
#ifdef _OPENMP
return real_uni_dist_(engines[omp_get_thread_num()]);
#else
return real_uni_dist_(engine_);
#endif
}
private:
std::uniform_real_distribution<double> real_uni_dist_;
#ifdef _OPENMP
std::vector<Engine> engines;
#else
std::mt19937 engine_;
#endif
}; // end_class(RNG)
Question:
at (a), is it good to not using shortcut 'parallel for' to avoid the overhead of creating teams?
which part of my implementation can be the cause of degradation of performance?
Why the time reported by clock() and omp_get_wtime() are so similar, as I expected clock() would be somehow longer than omp_get_wtime()
[Edit]
at (b), my intention of including OpenMP directive in the inner loop is that the iteration for outer loop is so small (only 3) so I think I can skip that and go directly to the inner loop of looping the vector_4[level]. Is this thought inappropriate (or will this instruct the OpenMP to repeat the outer loop by 4 and hence actually looping the inner loop 12 instead of 3 (say the current thread count is 4)?
Thanks
If the measured wall-clock time (as reported by omp_get_wtime()) is close to the total CPU time (as reported by clock()), this could mean several different things:
the code is running single-threaded, but then the total CPU time will be lower than the wall-clock time;
a very high synchronisation and cache coherency overhead is present and it is huge in comparison to the actual work being done by the threads.
Your case is the second one and the reason is that you use schedule(dynamic). Dynamic scheduling should only be used in cases when each iteration can take a varying amount of time. If such iterations are statically distributed among the threads, work imbalance could occur. schedule(dynamic) takes care of this by giving each task (in your case each single iteration of the loop) to the next thread to finish its work and become idle. There is a certain overhead in synchronising the threads and bookkeeping the distribution of the work items and therefore it should only be used when the amount of work per thread is huge in comparison to the overhead. OpenMP allows you to group more iterations into iteration blocks and this number is specified like schedule(dynamic,100) - this would make each thread execute a block (or chunk) of 100 consecutive iterations before asking for a new one. The default block size for dynamic scheduling is 1, i.e. each vector element in processed by a separate thread. I have no idea how much processing is done in reset() and what kind of elements are there in vector_*, but given the serial run time it is not much at all.
Another source of slowdown is the loss of data locality when you use dynamic scheduling. Depending on the type of elements of those vectors, processing neighbouring elements by different threads leads to false sharing. That means that, e.g. vector_1[i] lies in the same cache line with some other elements of vector_1, e.g. vector_1[i-1] and vector_1[i+1]. When thread 1 modifies vector_1[i], the cache line is reloaded in all other cores that work on the neighbouring elements. If vector_1[] is only written to, the compiler can be smart enough to generate non-temporal stores (those bypass the cache) but it only works with vector stores and having each core do a single iteration at a time means no vectorisation at all. Data locality can be improved by either switching to static scheduling or, if reset() really takes varying amount of time, by setting a reasonable chunk size in the schedule(dynamic) clause. The best chunk size is usually dependent on the processor and often one has to tweak it in order to get the best performance.
So I would strongly suggest that you first switch to static scheduling by replacing all schedule(dynamic) to schedule(static) and then try to optimise further. You don't have to specify the chunk size in the static case as the default is simply the total number of iterations divided by the number of threads, i.e. each thread would get one contiguous block of iterations.
to answer your question:
1) in a) the usage of the "parallel" keyword is exact
2) Congrats, your impl of your lok-free PRNG looks fine
3) the error can come from all the OpenMP pragma you use in the inner loop . Parallel at the top level and avoid fine-grain and inner loop parallelism
4) In the code below, i used 'nowait' on each 'omp for', I put the omp directive out-of-the-loop in the vector_4 proccessing and put a barrier at the end to join all the thread and wiat for the end of all the job we spawn before !
// pseudo code
#pragma omp for schedule(dynamic) nowait
for (size_t i = 0; i < vector_1.size(); i++) // size ~ 20
reset(vector_1[i]);
#pragma omp for schedule(dynamic) nowait
for (size_t i = 0; i < vector_2.size(); i++) // size ~ 2.3M
reset(vector_2[i]);
#pragma omp for schedule(dynamic) nowait
for (size_t i = 0; i < vector_3.size(); i++) // size ~ 0.3M
reset(vector_3[i]);
#pragma omp for schedule(dynamic) nowait
for (int level = 0; level < level_count; level++)
{
for (size_t i = 0; i < vector_4[level].size(); i++) // size ~500 - 1K
reset(vector_4[level][i]);
}
#pragma omp for schedule(dynamic) nowait
for (long i = 0; i < populationSize; i++) // size ~7M
resetAgent(agents[i]);
#pragma omp barrier
A single threaded program will run faster than a multi-threaded one if the useful processing time is less than the overhead incurred by threads.
It is a good idea to determine what the overhead is by implementing a null function and then deciding whether it is a better solution.
From a performance point of view, threads are only useful if the useful processing time is significantly higher than the overhead that is incurred by threads and there are real cpus available to run the threads.