OpenMP and unbalanced nested loops - c++

I'm trying to parallelize a simulator written in C++ using OpenMP pragmas.
I have a basic understanding of it but no experience.
The code below shows the main method to parallelize:
void run(long long end) {
while (now + dt <= end) {
now += dt;
for (unsigned int i=0; i < populations.size(); i++) {
populations[i]->update(now);
}
}
}
where populations is a std::vector of instances of the class Population. Each population updates its own elements as follows:
void Population::update(long long ts) {
for (unsigned int j = 0; j < this->size(); j++) {
if (check(j,ts)) {
doit(ts, j);
}
}
}
Being each population of a different size, the loop in Population::update() takes a varying amount of time leading to suboptimal speedups. By adding #pragma omp parallel for schedule(static) in the run() method. I get a 2X speedup with 4 threads, however it drops for 8 threads.
I am aware of the schedule(dynamic) clause, allowing to balance out the computation between the threads. However, when I tried to dynamically dispatch the threads I did not observe any improvements.
Am I going in the right direction? Do you think playing with the chunck size would help? Any suggestion is appreciated!

So there is two things to distinguish:
The influence of the number of threads and the scheduling policy.
For the number of threads, having more threads than cores usually slows down performances because of the context switches. So it depends on the number of cores you have on your computer
The difference between the code generated (at least as far as I remember) for static and dynamic is that with the static scheduling, the loop iterations are divided by the number of threads equally and with the dynamic scheduling, the distribution is computed at runtime (after the end of every iteration, the omp runtime is queried with __builtin_GOMP_loop_dynamic_next).
The reason for the slowdown observed when switching to dynamic can be that the loop doesn't contain enough iterations/computations so the overhead of computing dynamically the iterations distribution is not covered by the gain in performance.
(I assumed that every population instance doesn't share data with others)
Just throwing ideas, hope this help =)

Related

Using TBB for an simple example

I am new to TBB and try to do a simple exprement.
My data for functions are:
int n = 9000000;
int *data = new int[n];
I created a function, the first one without using TBB:
void _array(int* &data, int n) {
for (int i = 0; i < n; i++) {
data[i] = busyfunc(data[i])*123;
}
}
It takes 0.456635 seconds.
And also created a to function, the first one with using TBB:
void parallel_change_array(int* &data,int list_count) {
//Instructional example - parallel version
parallel_for(blocked_range<int>(0, list_count),
[=](const blocked_range<int>& r) {
for (int i = r.begin(); i < r.end(); i++) {
data[i] = busyfunc(data[i])*123;
}
});
}
It takes me 0.584889 seconds.
As for busyfunc(int m):
int busyfunc(int m)
{
m *= 32;
return m;
}
Can you tell me, why the function without using TBB spends less time, than if it is with TBB?
I think, the problem is that the functions are simple, and it's easy to calculate without using TBB.
First, the busyfunc() seems not so busy because 9M elements are computed in just half a second, which makes this example rather memory bound (uncached memory operations take orders of magnitude more cycles than arithmetic operations). Memory bound computations scale not as good as compute-bound, e.g. plain memory copying usually scales up to no more than, say, 4 times even running on much bigger number of cores/processors.
Also, memory bound programs are more sensitive to NUMA effects and since you allocated this array as contiguous memory using standard C++, it will be allocated by default entirely on the same memory node where the initialization occurs. This default can be altered by running with numactl -i all --.
And the last, but the most important thing is that TBB initializes threads lazily and pretty slowly. I guess you do not intend writing an application which exits after 0.5 seconds spent on parallel computation. Thus, a fair benchmark should take into account all the warm-up effects, which are expected in the real application. At the very least, it has to wait until all the threads are up and running before starting measurements. This answer suggests one way to do that.
[update] Please also refer to Alexey's answer for another possible reason lurking in compiler optimization differences.
In addition to Anton's asnwer, I recommend to check if the compiler was able to optimize the code equivalently.
For start, check performance of the TBB version executed by a single thread, without real parallelism. You can use tbb::global_control or tbb::task_scheduler_init to limit the number of threads to 1, e.g.
tbb::global_control ctl(tbb::global_control::max_allowed_parallelism, 1);
The overheads of thread creation, as well as cache locality or NUMA effects, should not play a role when all the code is executed by one thread. Therefore you should see approximately the same performance as for the no-TBB version. If you do, then you have a scalability issue, and Anton explained possible reasons.
However if you see that performance drops a lot, then it is a serial optimization issue. One of known reasons is that some compilers cannot optimize the loop over a blocked_range as good as they optimize the original loop; and it was also observed that storing r.end() into a local variable may help:
int rend = r.end();
for (int i = r.begin(); i < rend; i++) {
data[i] = busyfunc(data[i])*123;
}

How to split OpenMP threads into subteams over a loop

Suppose I have a the following function which makes use of #pragma omp parallel internally.
void do_heavy_work(double * input_array);
I now want to do_heavy_work on many input_arrays thus:
void do_many_heavy_work(double ** input_arrays, int num_arrays)
{
for (int i = 0; i < num_arrays; ++i)
{
do_heavy_work(input_arrays[i]);
}
}
Let's say I have N hardware threads. The implementation above would cause num_arrays invocations of do_heavy_work to occur in a serial fashion, each using all N threads internally to do whatever parallel thing it wants.
Now assume that when num_arrays > 1 it is actually more efficient to parallelise over this outer loop than it is to parallelise internally in do_heavy_work. I now have the following options.
Put #pragma omp parallel for on the outer loop and set OMP_NESTED=1. However, by setting OMP_NUM_THREADS=N I will end up with a large total number of threads (N*num_arrays) to be spawned.
As above but turn off nested parallelism. This wastes available cores when num_arrays < N.
Ideally I want OpenMP to split its team of OMP_NUM_THREADS threads into num_arrays subteams, and then each do_heavy_work can thread over its allocated subteam if given some.
What's the easiest way to achieve this?
(For the purpose of this discussion let's assume that num_arrays is not necessarily known beforehand, and also that I cannot change the code in do_heavy_work itself. The code should work on a number of machines so N should be freely specifiable.)
OMP_NUM_THREADS can be set to a list, thus specifying the number of threads at each level of nesting. E.g. OMP_NUM_THREADS=10,4 will tell the OpenMP runtime to execute the outer parallel region with 10 threads and each nested region will execute with 4 threads for a total of up to 40 simultaneously running threads.
Alternatively, you can make your program adaptive with code similar to this one:
void do_many_heavy_work(double ** input_arrays, int num_arrays)
{
#pragma omp parallel num_threads(num_arrays)
{
int nested_team_size = omp_get_max_threads() / num_arrays;
omp_set_num_threads(nested_team_size);
#pragma omp for
for (int i = 0; i < num_arrays; ++i)
{
do_heavy_work(input_arrays[i]);
}
}
}
This code will not use all available threads if the value of OMP_NUM_THREADS is not divisible by num_arrays. If having different number of threads per nested region is fine (it could result in some arrays being processed faster than others), come up with an idea of how to distribute the threads and set nested_team_size in each thread accordingly. Calling omp_set_num_threads() from within a parallel region only affects nested regions started by the calling thread, so you can have different nested team sizes.

C++ dynamic memory allocation is slower in OpenMP, even for non-parallel sections of code

I have run into a rather frustrating problem with OpenMP: it seems that if OpenMP is used in parallel mode somewhere in the code (for more than one thread), then dynamic memory allocation/de-allocation becomes slower even in non-parallel portions of code. Here is an example program (just an illustration):
int main()
{
#pragma omp parallel
{
// Just to get OpenMP going
}
double wtime0, wtime;
wtime0 = omp_get_wtime();
double **stuff;
const int N = 1000000;
stuff = new double*[N];
for (int i=0; i < N; i++) stuff[i] = new double;
for (int i=0; i < N; i++) *(stuff[i]) = sqrt(i);
for (int i=0; i < N; i++) delete[] stuff[i];
delete[] stuff;
wtime = omp_get_wtime() - wtime0;
cout << "Total CPU time: " << wtime << endl;
}
When I run this code with one thread on my laptop (which is an Intel Core 2 Duo), I get a CPU time of 0.093. On the other hand, if I run it with two threads, the CPU time increases to 0.13. The more pointer allocations there are, the worse the discrepancy becomes. In the above code, if I were to replace "stuff" by a simple array, e.g.
double stuff2[N];
for (int i=0; i < N; i++) stuff2[i] = sqrt(i);
then there is no discrepancy. Can someone tell me why this problem exists when pointers are allocated/de-allocated, even though it's not done in parallel? The reason why this is a problem is because in the real code I am working with, dynamic memory allocation is essential. There are sections that can be sped up by running in parallel, but (with two threads versus one) this is more than overcompensated by the fact that the memory allocation/de-allocation is slowed down considerably, even in the non-parallel sections. If someone with extensive OpenMP experience can tell me how to get around this problem I would really appreciate it. (Worst case scenario, I can just use MPI instead, but I would love it if this can be solved within OpenMP.)
Thanks in advance for the help.
Yes, this is concievable. In general, one should avoid naive dynamic allocations in multi-threading anvironment, as there is a single lock there. MT-aware allocators provide a much better performance and should be preferred in allocation-heavy scenarios.
This is exactly why I always scowl down on code here which just uses vectors or strings or shared pointers as a class members without letting users to specify allocation policy.

Using OMP_SCHEDULE with #pragma omp for parallel schedule(runtime)

I am trying to understand how to use the schedule(runtime) directive of OpenMP in C++. After some research I found OMP_SCHEDULE(1) and OMP_SCHEDULE(2).
I concluded that I need to set the varibale OMP_SCHEDULE to a certain value.
However, I don't know how to do that and I have not found any working C++ examples that explain me how to correctly do so.
Can some one explain me how to set variable and provide a working C++ example?
There are 4 types of OMP scheduling. They are static, dynamic, runtime and guided. Each scheduling has its advantages. The scheduling is present for better load balancing among the threads.
I will give you some example for static and dynamic scheduling. It is similar for the guided as well.
The schedule(runtime) clause tells it to set the schedule using the environment variable. The environment variable can be set to any other scheduling type. It can be set by
setenv OMP_SCHEDULE “dynamic,5”
Static Scheduling
Static scheduling is used when you know that each thread will more-or-less do same amount of work at the compile time.
For example, the following code can be parallelized using OMP. Let assume that we use only 4 threads.
If we are using the default static scheduling and place pragma on the outer for loop, then each thread will do 25% of the outer loop (i) work and eqaul amount of inner loop (j) work Hence, the total amount of work done by each thread is same. Hence, we could simply stick with the default static scheduling to give optimal load balancing.
float A[100][100];
for(int i = 0; i < 100; i++)
{
for(int j = 0; j < 100; j++)
{
A[i][j] = 1.0f;
}
}
Dynamic Scheduling
Dynamic scheduling is used when you know that each thread won't do same amount of work by using static scheduling.
Whereas, in this following code,
float A[100][100];
for(int i = 0; i < 100; i++)
{
for(int j = 0; j < i; j++)
{
A[i][j] = 1.0f;
}
}
The inner loop variable j is dependent on the i. If you use the default static scheduling, the outer loop (i) work might be divided equally between the 4 threads, but the inner loop (j) work will be large for some threads. This means that each thread won't do equal amount of work with static scheduling. Static scheduling won't result in optimal load balancing between threads. Hence we switch to dynamic scheduling (scheduling is done at the run time). This way you can make sure that the code achieves optimal load balance.
Note: you can also specify the chunk_size for scheduling. It depends on the loop size.

OpenMP parallel thread

I need to parallelize this loop, I though that to use was a good idea, but I never studied them before.
#pragma omp parallel for
for(std::set<size_t>::const_iterator it=mesh->NEList[vid].begin();
it!=mesh->NEList[vid].end(); ++it){
worst_q = std::min(worst_q, mesh->element_quality(*it));
}
In this case the loop is not parallelized because it uses iterator and the compiler cannot
understand how to slit it.
Can You help me?
OpenMP requires that the controlling predicate in parallel for loops has one of the following relational operators: <, <=, > or >=. Only random access iterators provide these operators and hence OpenMP parallel loops work only with containers that provide random access iterators. std::set provides only bidirectional iterators. You may overcome that limitation using explicit tasks. Reduction can be performed by first partially reducing over private to each thread variables followed by a global reduction over the partial values.
double *t_worst_q;
// Cache size on x86/x64 in number of t_worst_q[] elements
const int cb = 64 / sizeof(*t_worst_q);
#pragma omp parallel
{
#pragma omp single
{
t_worst_q = new double[omp_get_num_threads() * cb];
for (int i = 0; i < omp_get_num_threads(); i++)
t_worst_q[i * cb] = worst_q;
}
// Perform partial min reduction using tasks
#pragma omp single
{
for(std::set<size_t>::const_iterator it=mesh->NEList[vid].begin();
it!=mesh->NEList[vid].end(); ++it) {
size_t elem = *it;
#pragma omp task
{
int tid = omp_get_thread_num();
t_worst_q[tid * cb] = std::min(t_worst_q[tid * cb],
mesh->element_quality(elem));
}
}
}
// Perform global reduction
#pragma omp critical
{
int tid = omp_get_thread_num();
worst_q = std::min(worst_q, t_worst_q[tid * cb]);
}
}
delete [] t_worst_q;
(I assume that mesh->element_quality() returns double)
Some key points:
The loop is executed serially by one thread only, but each iteration creates a new task. These are most likely queued for execution by the idle threads.
Idle threads waiting at the implicit barrier of the single construct begin consuming tasks as soon as they are created.
The value pointed by it is dereferenced before the task body. If dereferenced inside the task body, it would be firstprivate and a copy of the iterator would be created for each task (i.e. on each iteration). This is not what you want.
Each thread performs partial reduction in its private part of the t_worst_q[].
In order to prevent performance degradation due to false sharing, the elements of t_worst_q[] that each thread accesses are spaced out so to end up in separate cache lines. On x86/x64 the cache line is 64 bytes, therefore the thread number is multiplied by cb = 64 / sizeof(double).
The global min reduction is performed inside a critical construct to protect worst_q from being accessed by several threads at once. This is for illustrative purposes only since the reduction could also be performed by a loop in the main thread after the parallel region.
Note that explicit tasks require compiler which supports OpenMP 3.0 or 3.1. This rules out all versions of Microsoft C/C++ Compiler (it only supports OpenMP 2.0).
Random-Access Container
The simplest solution is to just throw everything into a random-access container (like std::vector) and use the index-based loops that are favoured by OpenMP:
// Copy elements
std::vector<size_t> neListVector(mesh->NEList[vid].begin(), mesh->NEList[vid].end());
// Process in a standard OpenMP index-based for loop
#pragma omp parallel for reduction(min : worst_q)
for (int i = 0; i < neListVector.size(); i++) {
worst_q = std::min(worst_q, complexCalc(neListVector[i]));
}
Apart from being incredibly simple, in your situation (tiny elements of type size_t that can easily be copied) this is also the solution with the best performance and scalability.
Avoiding copies
However, in a different situation than yours you may have elements that aren't copied as easily (larger elements) or cannot be copied at all. In this case you can just throw the corresponding pointers in a random-access container:
// Collect pointers
std::vector<const nonCopiableObjectType *> neListVector;
for (const auto &entry : mesh->NEList[vid]) {
neListVector.push_back(&entry);
}
// Process in a standard OpenMP index-based for loop
#pragma omp parallel for reduction(min : worst_q)
for (int i = 0; i < neListVector.size(); i++) {
worst_q = std::min(worst_q, mesh->element_quality(*neListVector[i]));
}
This is slightly more complex than the first solution, still has the same good performance on small elements and increased performance on larger elements.
Tasks and Dynamic Scheduling
Since someone else brought up OpenMP Tasks in his answer, I want to comment on that to. Tasks are a very powerful construct, but they have a huge overhead (that even increases with the number of threads) and in this case just make things more complex.
For the min reduction the use of Tasks is never justified because the creation of a Task in the main thread costs much more than just doing the std::min itself!
For the more complex operation mesh->element_quality you might think that the dynamic nature of Tasks can help you with load-balancing problems, in case that the execution time of mesh->element_quality varies greatly between iterations and you don't have enough iterations to even it out. But even in that case, there is a simpler solution: Simply use dynamic scheduling by adding the schedule(dynamic) directive to your parallel for line in one of my previous solutions. It achieves the same behaviour which far less overhead.