I am trying to understand how to use the schedule(runtime) directive of OpenMP in C++. After some research I found OMP_SCHEDULE(1) and OMP_SCHEDULE(2).
I concluded that I need to set the varibale OMP_SCHEDULE to a certain value.
However, I don't know how to do that and I have not found any working C++ examples that explain me how to correctly do so.
Can some one explain me how to set variable and provide a working C++ example?
There are 4 types of OMP scheduling. They are static, dynamic, runtime and guided. Each scheduling has its advantages. The scheduling is present for better load balancing among the threads.
I will give you some example for static and dynamic scheduling. It is similar for the guided as well.
The schedule(runtime) clause tells it to set the schedule using the environment variable. The environment variable can be set to any other scheduling type. It can be set by
setenv OMP_SCHEDULE “dynamic,5”
Static Scheduling
Static scheduling is used when you know that each thread will more-or-less do same amount of work at the compile time.
For example, the following code can be parallelized using OMP. Let assume that we use only 4 threads.
If we are using the default static scheduling and place pragma on the outer for loop, then each thread will do 25% of the outer loop (i) work and eqaul amount of inner loop (j) work Hence, the total amount of work done by each thread is same. Hence, we could simply stick with the default static scheduling to give optimal load balancing.
float A[100][100];
for(int i = 0; i < 100; i++)
{
for(int j = 0; j < 100; j++)
{
A[i][j] = 1.0f;
}
}
Dynamic Scheduling
Dynamic scheduling is used when you know that each thread won't do same amount of work by using static scheduling.
Whereas, in this following code,
float A[100][100];
for(int i = 0; i < 100; i++)
{
for(int j = 0; j < i; j++)
{
A[i][j] = 1.0f;
}
}
The inner loop variable j is dependent on the i. If you use the default static scheduling, the outer loop (i) work might be divided equally between the 4 threads, but the inner loop (j) work will be large for some threads. This means that each thread won't do equal amount of work with static scheduling. Static scheduling won't result in optimal load balancing between threads. Hence we switch to dynamic scheduling (scheduling is done at the run time). This way you can make sure that the code achieves optimal load balance.
Note: you can also specify the chunk_size for scheduling. It depends on the loop size.
Related
I am currently working on parallelizing a nested for loop using C++ and OpenMP. Without going into the actual details of the program, I have constructed a basic example on the concepts I am using below:
float var = 0.f;
float distance = some float array;
float temp[] = some float array;
for(int i=0; i < distance.size; i++){
\\some work
for(int j=0; j < temp.size; j++){
var += temp[i]/distance[j]
}
}
I attempted to parallelize the above code in the following way:
float var = 0.f;
float distance = some float array;
float temp[] = some float array;
#pragma omp parallel for default(shared)
for(int i=0; i < distance.size; i++){
\\some work
#pragma omp parallel for reduction(+:var)
for(int j=0; j < temp.size; j++){
var += temp[i]/distance[j]
}
}
I then compared the serial program output with the parallel program output and I got incorrect result. I know that this is mainly due to the fact that floating point arithmetic is not associative. But are there any workarounds to this that give exact results?
Although the lack of associativity of floating point arithmetic might be an issue in some cases, the code you show here exposes a much more essential problem which you need to address first: the status of the var variable in the outer loop.
Indeed, since var is modified inside the i loop, even if only in the j part of the i loop, it needs to be "privatized" somehow. Now the exact status it needs to get depends on the value you expect it to store upon exit of the enclosing parallel region:
If you don't care about its value at all, just declare it private (or better, declare it inside the parallel region.
If you need its final value at the end of the i loop, and considering it accumulates a sum of values, most likely you'll need to declare it reduction(+:), although lastprivate might also be what you want (impossible to say without further details)
If private or lastprivate was all you needed, but you also need its initial value upon entrance of the parallel region, then you'll have to consider adding firstprivate too (no need of that if you went for reduction as it is already been taken care of)
That should be enough for fixing your issue.
Now, in your snippet, you also parallelized the inner loop. That is usually a bad idea to go for nested parallelism. So unless you have a very compelling reason for doing so, you will likely get much better performance by only parallelizing the outer loop, and leaving the inner loop alone. That won't mean the inner loop won't benefit from the parallelization, but rather that several instances of the inner loop will be computed in parallel (each one being sequential admittedly, but the whole process is parallel).
A nice side effect of removing the inner loop's parallelization (in addition to making the code faster) is that now all accumulations inside the privates var variables are done in the same order as when not in parallel. Therefore, your (hypothetical) floating point arithmetic issues inside the outer loop will now have disappeared, and only if you needed the final reduction upon exit of the parallel region might you still face them there.
I'm trying to parallelize a simulator written in C++ using OpenMP pragmas.
I have a basic understanding of it but no experience.
The code below shows the main method to parallelize:
void run(long long end) {
while (now + dt <= end) {
now += dt;
for (unsigned int i=0; i < populations.size(); i++) {
populations[i]->update(now);
}
}
}
where populations is a std::vector of instances of the class Population. Each population updates its own elements as follows:
void Population::update(long long ts) {
for (unsigned int j = 0; j < this->size(); j++) {
if (check(j,ts)) {
doit(ts, j);
}
}
}
Being each population of a different size, the loop in Population::update() takes a varying amount of time leading to suboptimal speedups. By adding #pragma omp parallel for schedule(static) in the run() method. I get a 2X speedup with 4 threads, however it drops for 8 threads.
I am aware of the schedule(dynamic) clause, allowing to balance out the computation between the threads. However, when I tried to dynamically dispatch the threads I did not observe any improvements.
Am I going in the right direction? Do you think playing with the chunck size would help? Any suggestion is appreciated!
So there is two things to distinguish:
The influence of the number of threads and the scheduling policy.
For the number of threads, having more threads than cores usually slows down performances because of the context switches. So it depends on the number of cores you have on your computer
The difference between the code generated (at least as far as I remember) for static and dynamic is that with the static scheduling, the loop iterations are divided by the number of threads equally and with the dynamic scheduling, the distribution is computed at runtime (after the end of every iteration, the omp runtime is queried with __builtin_GOMP_loop_dynamic_next).
The reason for the slowdown observed when switching to dynamic can be that the loop doesn't contain enough iterations/computations so the overhead of computing dynamically the iterations distribution is not covered by the gain in performance.
(I assumed that every population instance doesn't share data with others)
Just throwing ideas, hope this help =)
Suppose I have a the following function which makes use of #pragma omp parallel internally.
void do_heavy_work(double * input_array);
I now want to do_heavy_work on many input_arrays thus:
void do_many_heavy_work(double ** input_arrays, int num_arrays)
{
for (int i = 0; i < num_arrays; ++i)
{
do_heavy_work(input_arrays[i]);
}
}
Let's say I have N hardware threads. The implementation above would cause num_arrays invocations of do_heavy_work to occur in a serial fashion, each using all N threads internally to do whatever parallel thing it wants.
Now assume that when num_arrays > 1 it is actually more efficient to parallelise over this outer loop than it is to parallelise internally in do_heavy_work. I now have the following options.
Put #pragma omp parallel for on the outer loop and set OMP_NESTED=1. However, by setting OMP_NUM_THREADS=N I will end up with a large total number of threads (N*num_arrays) to be spawned.
As above but turn off nested parallelism. This wastes available cores when num_arrays < N.
Ideally I want OpenMP to split its team of OMP_NUM_THREADS threads into num_arrays subteams, and then each do_heavy_work can thread over its allocated subteam if given some.
What's the easiest way to achieve this?
(For the purpose of this discussion let's assume that num_arrays is not necessarily known beforehand, and also that I cannot change the code in do_heavy_work itself. The code should work on a number of machines so N should be freely specifiable.)
OMP_NUM_THREADS can be set to a list, thus specifying the number of threads at each level of nesting. E.g. OMP_NUM_THREADS=10,4 will tell the OpenMP runtime to execute the outer parallel region with 10 threads and each nested region will execute with 4 threads for a total of up to 40 simultaneously running threads.
Alternatively, you can make your program adaptive with code similar to this one:
void do_many_heavy_work(double ** input_arrays, int num_arrays)
{
#pragma omp parallel num_threads(num_arrays)
{
int nested_team_size = omp_get_max_threads() / num_arrays;
omp_set_num_threads(nested_team_size);
#pragma omp for
for (int i = 0; i < num_arrays; ++i)
{
do_heavy_work(input_arrays[i]);
}
}
}
This code will not use all available threads if the value of OMP_NUM_THREADS is not divisible by num_arrays. If having different number of threads per nested region is fine (it could result in some arrays being processed faster than others), come up with an idea of how to distribute the threads and set nested_team_size in each thread accordingly. Calling omp_set_num_threads() from within a parallel region only affects nested regions started by the calling thread, so you can have different nested team sizes.
Performance wise, which of the following is more efficient?
Assigning in the master thread and copying the value to all threads:
int i = 0;
#pragma omp parallel for firstprivate(i)
for( ; i < n; i++){
...
}
Declaring and assigning the variable in each thread
#pragma omp parallel for
for(int i = 0; i < n; i++){
...
}
Declaring the variable in the master thread but assigning it in each thread.
int i;
#pragma omp parallel for private(i)
for(i = 0; i < n; i++){
...
}
It may seem a silly question and/or the performance impact may be negligible. But I'm parallelizing a loop that does a small amount of computation and is called a large number of times, so any optimization I can squeeze out of this loop is helpful.
I'm looking for a more low level explanation and how OpenMP handles this.
For example, if parallelizing for a large number of threads I assume the second implementation would be more efficient, since initializing a variable using xor is far more efficient than copying the variable to all the threads
There is not much of a difference in terms of performance among the 3 versions you presented, since each one of them is using #pragma omp parallel for. Hence, OpenMP will automatically assign each for iteration to different threads. Thus, variable i will became private to each thread, and each thread will have a different range of for iterations to work with. The variable 'i' was automatically set to private in order to avoid race conditions when updating this variable. Since, the variable 'i' will be private on the parallel for anyway, there is no need to put private(i) on the #pragma omp parallel for.
Nevertheless, your first version will produce an error since OpenMP is expecting that the loop right underneath of #pragma omp parallel for have the following format:
for(init-expr; test-expr;incr-expr)
inorder to precompute the range of work.
The for directive places restrictions on the structure of all
associated for-loops. Specifically, all associated for-loops must
have the following canonical form:
for (init-expr; test-expr;incr-expr) structured-block (OpenMP Application Program Interface pag. 39/40.)
Edit: I tested your two last versions, and inspected the generated assembly. Both version produce the same assembly, as you can see -> version 2 and version 3.
I am trying OpenMP on a particular code snippet. Not sure if the snippet needs a revamp, perhaps it is set up too rigidly for sequential implementation. Anyway here is the (pseudo-)code that I'm trying to parallelize:
#pragma omp parallel for private(id, local_info, current_local_cell_id, local_subdomain_size) shared(cells, current_global_cell_id, global_id)
for(id = 0; id < grid_size; ++id) {
local_info = cells.get_local_subdomain_info(id);
local_subdomain_size = local_info.size();
...do other stuff...
do {
current_local_cell_id = cells.get_subdomain_cell_id(id);
global_id.set(id, current_global_cell_id + current_local_cell_id);
} while(id < local_subdomain_size && ++id);
current_global_cell_id += local_subdomain_size;
}
This makes complete sense (after staring at it for some time) in a sequential sense, which also might mean that it needs to be re-written for OpenMP. My concern is that current_local_cell_id and local_subdomain_size are private, but current_global_cell_id and global_id are shared.
Hence the statement current_global_cell_id += local_subdomain_size after the inner loop:
do {
...
} while(...)
current_global_cell_id += local_subdomain_size;
might lead to errors in the OpenMP setting, I suspect. I would greatly appreciate if any of the OpenMP experts out there can provide some pointers on any of the special OMP directives I can use to make minimum changes to the code but still avail of OpenMP for such a type of for loop.
I'm not sure I understand your code. However, I think you really want some kind of parallel accumulation.
You could use a pattern like
size_t total = 0;
#pragma omp parallel for shared(total) reduction (+:total)
for (int i=0; i<MAXITEMS; i++)
{
total += getvalue(i); // TODO replace with your logic
}
// total has been 'magically' combined by OMP
On a related note, when you use gcc you can just use the __gnu_parallel::accumulate drop-in replacement for std::accumulate, which does exactly the same. See Chapter 18. Parallel Mode
size_t total = __gnu_parallel::accumulate(c.begin(), c.end(), 0, &myvalue_accum);
You can even compile with -D_GLIBCXX_PARALLEL which will make all use of std algorithms automatically parallellized if possible. Don't use that unless you know what you're doing! Frequently, performance just suffers and the chance of introducing bugs due to unexpected parallelism is real
changing id inside the loop is not correct. There is no way to dispatch the loop to different thread, as loop step does not produce a predictable id value.
Why are you using the id inside that do while loop?