Performance of matrix multiplications remains unchanged with OpenMP in C++ - c++

auto t1 = chrono::steady_clock::now();
#pragma omp parallel
{
for(int i=0;i<n;i++)
{
#pragma omp for collapse(2)
for(int j=0;j<n;j++)
{
for(int k=0;k<n;k++)
{
C[i][j]+=A[i][k]*B[k][j];
}
}
}
}
auto t2 = chrono::steady_clock::now();
auto t = std::chrono::duration_cast<chrono::microseconds>( t2 - t1 ).count();
With and without the parallelization the variable t remains fairly constant. I am not sure why this is happening. Also once in a while t is outputted as 0.
One more problem I am facing is that if I increase value of n to something like 500, the compiler is unable to run the program.(Here I've take n=100)
I am using code::blocks with the GNU GCC compiler.

The proposed OpenMP parallelization is not correct and may lead to wrong results. When specifying collapse(2), threads execute "simultaneously" the (j,k) iterations. If two (or more) threads work on the same j but different k, they accumulate the result of A[i][k]*B[k][j] to the same array location C[i][j]. This is a so called race condition, i.e. "two or more threads can access shared data and they try to change it at the same time" (What is a race condition?). Data races do not necessarily lead to wrong results despite the code is not OpenMP valid and can produce wrong results depending on several factors (scheduling, compiler implementation, number of threads,...). To fix the problem in the code above, OpenMP offers the reduction clause:
#pragma omp parallel
{
for(int i=0;i<n;i++) {
#pragma omp for collapse(2) reduction(+:C)
for(int j=0;j<n;j++) {
for(int k=0;k<n;k++) {
C[i][j]+=A[i][k]*B[k][j];
so that "a private copy is created in each implicit task (...) and is initialized with the initializer value of the reduction-identifier. After the end of the region, the original list item is updated with the values of the private copies using the combiner associated with the reduction-identifier" (http://www.openmp.org/wp-content/uploads/openmp-4.5.pdf). Note that the reduction on arrays in C is directly supported by the standard since OpenMP 4.5 (check if the compiler support it, otherwise there are old manual ways to achieve it, Reducing on array in OpenMp).
However, for the given code, it should be probably more adequate to avoid the parallelization of the innermost loop so that the reduction is not needed at all:
#pragma omp parallel
{
#pragma omp for collapse(2)
for(int i=0;i<n;i++) {
for(int j=0;j<n;j++) {
for(int k=0;k<n;k++) {
C[i][j]+=A[i][k]*B[k][j];
Serial can be faster than OpenMP version for small sizes of matrices and/or small number of threads.
On my Intel machine using up to 16 cores, n=1000, GNU compiler v6.1 the break even is around 4 cores when the -O3 optimization is activated while the break even is around 2 cores compiling with -O0. For clarity I report the performances I measured:
Serial 418020
----------- WRONG ORIG -- +REDUCTION -- OUTER.COLLAPSE -- OUTER.NOCOLLAPSE -
OpenMP-1 1924950 2841993 1450686 1455989
OpenMP-2 988743 2446098 747333 745830
OpenMP-4 515266 3182262 396524 387671
OpenMP-8 280285 5510023 219506 211913
OpenMP-16 2227567 10807828 150277 123368
Using reduction the performance loss is dramatic (reversed speed-up). The outer parallelization (w or w/o collapse) is the best option.
As concerns your failure with large matrices, a possible reason is related to the size of the available stack. Try to enlarge both the system and OpenMP stack sizes, i.e.
ulimit -s unlimited
export OMP_STACKSIZE=10000000

The collapse directive may actually be responsible for this, because the index j is recreated using divide/mod operations.
Did you try without collapse?

Related

parallel programming in OpenMP

I have the following piece of code.
for (i = 0; i < n; ++i) {
++cnt[offset[i]];
}
where offset is an array of size n containing values in the range [0, m) and cnt is an array of size m initialized to 0. I use OpenMP to parallelize it as follows.
#pragma omp parallel for shared(cnt, offset) private(i)
for (i = 0; i < n; ++i) {
++cnt[offset[i]];
}
According to the discussion in this post, if offset[i1] == offset[i2] for i1 != i2, the above piece of code may result in incorrect cnt. What can I do to avoid this?
This code:
#pragma omp parallel for shared(cnt, offset) private(i)
for (i = 0; i < n; ++i) {
++cnt[offset[i]];
}
contains a race-condition during the updates of the array cnt, to solve it you need to guarantee mutual exclusion of those updates. That can be achieved with (for instance) #pragma omp atomic update but as already pointed out in the comments:
However, this resolves just correctness and may be terribly
inefficient due to heavy cache contention and synchronization needs
(including false sharing). The only solution then is to have each
thread its private copy of cnt and reduce these copies at the end.
The alternative solution is to have a private array per thread, and at end of the parallel region you perform the manual reduction of all those arrays into one. An example of such approach can be found here.
Fortunately, with OpenMP 4.5 you can reduce arrays using a dedicate pragma, namely:
#pragma omp parallel for reduction(+:cnt)
You can have look at this example on how to apply that feature.
Worth mentioning that regarding the reduction of arrays versus the atomic approach as kindly point out by #Jérôme Richard:
Note that this is fast only if the array is not huge (the atomic based
solution could be faster in this specific case regarding the platform
and if the values are not conflicting). So that is m << n. –
As always profiling is the key!; Hence, you should test your code with aforementioned approaches to find out which one is the most efficient.

OpenMP, reason for slowdown with more threads? (no sharing/no rand() (I think..) )

I am running my code on Intel® Xeon(R) CPU X5680 # 3.33GHz × 12. Here is a fairly simple OpenMP pseudo code (the OpenMP parts are exact, just normal code in between is changed for compactness and clarity):
vector<int> myarray(arraylength,something);
omp_set_num_threads(3);
#pragma omp parallel
{
#pragma omp for schedule(dynamic)
for(int j=0;j<pr.max_iteration_limit;j++)
{
vector<int> temp_array(updated_array(a,b,myarray));
for(int i=0;i<arraylength;i++)
{
#pragma omp atomic
myarray[i]+=temp_array[i];
}
}
}
all parameters taken by temp_array function are copied so that there would be no clashes. Basic structure of temp_array function:
vector<int> updated_array(myClass1 a, vector<myClass2> b, vector<int> myarray)
{
//lots of preparations, but obviously there are only local variables, since
//function only takes copies
//the core code taking most of the time, which I will be measuring:
double time_s=time(NULL);
while(waiting_time<t_wait) //as long as needed
{
//a fairly short computaiton
//generates variable: vector<int> another_array
waiting_time++;
}
double time_f=time(NULL);
cout<<"Thread "<<omp_get_thread_num()<<" / "<<omp_get_num_threads()
<< " runtime "<<time_f-time_s<<endl;
//few more changes to the another_array
return another_array;
}
Questions and my attempts to resolve it:
adding more threads (with omp_set_num_threads(3);) does create more threads, but each thread does the job slower. E.g. 1: 6s, 2: 10s, 3: 15s ... 12: 60s.
(where to "job" I refer to the exact part of the code I pointed out as core, (NOT the whole omp loop or so) since it takes most of the time, and makes sure I am not missing anything additional)
There are no rand() things happening inside the core code.
Dynamic or static schedule doesnt make a difference here of course (and I tried..)
There seem to be no sharing possible in any way or form, thus I am running out of ideas completely... What can it be? I would be extremely grateful if you could help me with this (even with just ideas)!
p.s. The point of the code is to take myarray, do a bit of montecarlo on it with a single thread, and then collect tiny changes and add/substract to the original array.
OpenMP may implement the atomic access using a mutex, when your code will suffer from heavy contention on that mutex. This will result in a significant performance hit.
If the work in updated_array() dominates the cost of the parallel loop, you'de better put the whole of the second loop inside a critical section:
{ // body of parallel loop
vector<int> temp_array = updated_array(a,b,myarray);
#pragma omp critical(UpDateMyArray)
for(int i=0;i<arraylength;i++)
myarray[i]+=temp_array[i];
}
However, your code looks broken (essentially not threadsafe), see my comment.

Parellize some nested for in openmp c++

My serial code for the convolution between a matrix and a kernel works like this:
int index1, index2, a, b;
for(int x=0;x<rows;++x){
for(int y=0;y<columns;++y){
for(int i=0;i<krows;++i){
for(int j=0;j<kcolumns;++j){
a=x+i-krows/2;
b=y+j-kcolumns/2;
if(a<0)
index1=rows+a;
else if(a>rows-1)
index1=a-rows;
else
index1=a;
if(b<0)
index2=columns+b;
else if(b>columns-1)
index2=b-columns;
else
index2=b;
output[x*columns+y]+=input[index1*columns+index2]*kernel[i*kcolumns+j];
}
}
}
}
The convolution considers cyclic treatment for the borders. Now I want to parallelize the code with openmp. I thought about reducing the first two for-cycles to just one and using the syntax:
#pragma omp parallel
#pragma omp for private(x,y,a, b, index1, index2)
for(int z=0;z<rows*columns;z++){
x=z/columns;
y=z%columns;
...
I see that parallelizing like that it reduces the cpu-time but I'm not a big expert of openmp so I was asking myself if there are other more efficient solutions. I don't think it is a good idea to parallelize also the others 2 nested for-cycles.
With an input matrix of dimensions 1000*10000 and a square kernel matrix 9*9 I obtain these times:
4823 ms for 1 thread
2696 ms for 2 threads
2513 ms for 4 threads.
I hope someone can give me some useful suggestions. What about the for reduction syntax?
My suggestion is to change approach altogether. If you are using cyclic treatment for the border (i.e. your problem is periodic) the fast way to do it is based on the fft-based spectral approach:
-Fourier transform matrix and kernel
-compute the product
-Inverse fourier transform the product (you have the convolution)
This is (1) much more efficient (unless the dimensions of the kernel are much smaller than those of the matrix) and (2) you can use a fft library that supports multithreading (like FFTW) and let it deal with it.
You don't need to change the for loops. You can make each thread iterate thru all rows in a column or thru all columns in a row. Also, bear in mind that if the number of threads is higher than the number of physical cores, the performance won't change much.
OpenMP already takes care of the number of threads that it should create, using the logical cores count - which might be a problem on Intel i3 and i7, since they have hyperthreading and thus the performance gain per extra thread won't be big.
In resume, you can either:
#pragma omp parallel for private (x,y,a,b,index1,index2)
for(int x=0;x<rows;++x){
for(int y=0;y<columns;++y){
// ...
}
}
Or:
for(int x=0;x<rows;++x){
#pragma omp parallel for private (y,a,b,index1,index2)
for(int y=0;y<columns;++y){
// ...
}
}
If you are using OpenMP 3.0 or greater you may exploit the collapse clause of the loop work-sharing construct:
The collapse clause may be used to specify how many loops are
associated with the loop construct. The parameter of the collapse
clause must be a constant positive integer expression. If no collapse
clause is present, the only loop that is associated with the loop
construct is the one that immediately follows the loop directive
This means that you may write the following:
#pragma omp parallel for collapse(2)
for(int x=0;x<rows;++x){
for(int y=0;y<columns;++y){
/* Work here */
}
}
and obtain exactly the same result as your linearized loop:
#pragma omp parallel for
for(int z=0;z<rows*columns;z++){
x=z/columns;
y=z%columns;
/* Work here */
}
As you may see, with the collapse clause no modification is needed to your serial code and you may easily experiment further loop collapsing changing the positive number in the clause.

C++ OpenMP writing to specific element of a shared array/vector

I have a long-running simulation program and I plan to use OpenMP for paralleling some codes for speedup. I'm new to OpenMP and have the following question.
Given that the simulation is a stochastic one, I have following data structure and I need to capture age-specific count of seeded agents [Edited: some code edited]:
class CAgent {
int ageGroup;
bool isSeed;
/* some other stuff */
};
class Simulator {
std::vector<int> seed_by_age;
std::vector<CAgent> agents;
void initEnv();
/* some other stuff */
};
void Simulator::initEnv() {
std::fill(seed_by_age.begin(), seed_by_age.end(), 0);
#pragma omp parallel
{
#pragma omp for
for (size_t i = 0; i < agents.size(); i++)
{
agents[i].setup(); // (a)
if (someRandomCondition())
{
agents[i].isSeed = true;
/* (b) */
seed_by_age[0]++; // index = 0 -> overall
seed_by_age[ agents[i].ageGroup - 1 ]++;
}
}
} // end #parallel
} // end Simulator::initEnv()
As the variable seed_by_age is shared across threads, I know I have to protect it properly. So in (b), I used #pragma omp flush(seed_by_age[agents[i].ageGroup]) But the compiler complains "error: expected ')' before '[' token"
I'm not doing reduction, and I try to avoid 'critical' directive if possible. So, am I missing something here? How can I properly protect a particular element of the vector?
Many thanks and I appreciate any suggestions.
Development box: 2 core CPU, target platform 4-6 cores
Platform: Windows 7, 64bits
MinGW 4.7.2 64 bits (rubenvb build)
You can only use flush with variables, not elements of arrays and definitely not with elements of C++ container classes. The indexing operator for std::vector results in a call to operator[], an inline function, but still a function.
Because in your case std::vector::operator[] returns a reference to a simple scalar type, you can use the atomic update construct to protect the updates:
#pragma omp atomic update
seed_by_age[0]++; // index = 0 -> overall
#pragma omp atomic update
seed_by_age[ agents[i].ageGroup - 1 ]++;
As for not using reduction, each thread touches seed_by_age[0] when the condition inside the loop is met thereby invalidating the same cache line in all other cores. Access to the other vector elements also leads to mutual cache invalidation but assuming that agents are more or less equally distributed among the age groups, it would not be that severe as in the case with the first element in the vector. Therefore I would propose that you do something like:
int total_seed_by_age = 0;
#pragma omp parallel for schedule(static) reduction(+:total_seed_by_age)
for (size_t i = 0; i < agents.size(); i++)
{
agents[i].setup(); // (a)
if (someRandomCondition())
{
agents[i].isSeed = true;
/* (b) */
total_seed_by_age++;
#pragma omp atomic update
seed_by_age[ agents[i].ageGroup - 1 ]++;
}
}
seed_by_age[0] = total_seed_by_age;
#pragma omp flush(seed_by_age[agents[i]].ageGroup)
try to close all your bracket, it will fix the compiler error.
I am afraid, that your #pragma omp flush statement is not sufficient to protect your data and prevent a race condition here.
If someRandomCondition() is true in only a very limited number of cases you could use a critical section for the update of your vector without loosing too much speed. Alternatively, if the size of your vector seed_by_age is not too large (which I assume) than it could be efficient to have a private version of the vector for each thread which you merge right before leaving the parallel block.

OpenMP Performance impact: private directive vs. declaring variable inside for construct

Performance wise, which of the following is more efficient?
Assigning in the master thread and copying the value to all threads:
int i = 0;
#pragma omp parallel for firstprivate(i)
for( ; i < n; i++){
...
}
Declaring and assigning the variable in each thread
#pragma omp parallel for
for(int i = 0; i < n; i++){
...
}
Declaring the variable in the master thread but assigning it in each thread.
int i;
#pragma omp parallel for private(i)
for(i = 0; i < n; i++){
...
}
It may seem a silly question and/or the performance impact may be negligible. But I'm parallelizing a loop that does a small amount of computation and is called a large number of times, so any optimization I can squeeze out of this loop is helpful.
I'm looking for a more low level explanation and how OpenMP handles this.
For example, if parallelizing for a large number of threads I assume the second implementation would be more efficient, since initializing a variable using xor is far more efficient than copying the variable to all the threads
There is not much of a difference in terms of performance among the 3 versions you presented, since each one of them is using #pragma omp parallel for. Hence, OpenMP will automatically assign each for iteration to different threads. Thus, variable i will became private to each thread, and each thread will have a different range of for iterations to work with. The variable 'i' was automatically set to private in order to avoid race conditions when updating this variable. Since, the variable 'i' will be private on the parallel for anyway, there is no need to put private(i) on the #pragma omp parallel for.
Nevertheless, your first version will produce an error since OpenMP is expecting that the loop right underneath of #pragma omp parallel for have the following format:
for(init-expr; test-expr;incr-expr)
inorder to precompute the range of work.
The for directive places restrictions on the structure of all
associated for-loops. Specifically, all associated for-loops must
have the following canonical form:
for (init-expr; test-expr;incr-expr) structured-block (OpenMP Application Program Interface pag. 39/40.)
Edit: I tested your two last versions, and inspected the generated assembly. Both version produce the same assembly, as you can see -> version 2 and version 3.