Parellize some nested for in openmp c++ - c++

My serial code for the convolution between a matrix and a kernel works like this:
int index1, index2, a, b;
for(int x=0;x<rows;++x){
for(int y=0;y<columns;++y){
for(int i=0;i<krows;++i){
for(int j=0;j<kcolumns;++j){
a=x+i-krows/2;
b=y+j-kcolumns/2;
if(a<0)
index1=rows+a;
else if(a>rows-1)
index1=a-rows;
else
index1=a;
if(b<0)
index2=columns+b;
else if(b>columns-1)
index2=b-columns;
else
index2=b;
output[x*columns+y]+=input[index1*columns+index2]*kernel[i*kcolumns+j];
}
}
}
}
The convolution considers cyclic treatment for the borders. Now I want to parallelize the code with openmp. I thought about reducing the first two for-cycles to just one and using the syntax:
#pragma omp parallel
#pragma omp for private(x,y,a, b, index1, index2)
for(int z=0;z<rows*columns;z++){
x=z/columns;
y=z%columns;
...
I see that parallelizing like that it reduces the cpu-time but I'm not a big expert of openmp so I was asking myself if there are other more efficient solutions. I don't think it is a good idea to parallelize also the others 2 nested for-cycles.
With an input matrix of dimensions 1000*10000 and a square kernel matrix 9*9 I obtain these times:
4823 ms for 1 thread
2696 ms for 2 threads
2513 ms for 4 threads.
I hope someone can give me some useful suggestions. What about the for reduction syntax?

My suggestion is to change approach altogether. If you are using cyclic treatment for the border (i.e. your problem is periodic) the fast way to do it is based on the fft-based spectral approach:
-Fourier transform matrix and kernel
-compute the product
-Inverse fourier transform the product (you have the convolution)
This is (1) much more efficient (unless the dimensions of the kernel are much smaller than those of the matrix) and (2) you can use a fft library that supports multithreading (like FFTW) and let it deal with it.

You don't need to change the for loops. You can make each thread iterate thru all rows in a column or thru all columns in a row. Also, bear in mind that if the number of threads is higher than the number of physical cores, the performance won't change much.
OpenMP already takes care of the number of threads that it should create, using the logical cores count - which might be a problem on Intel i3 and i7, since they have hyperthreading and thus the performance gain per extra thread won't be big.
In resume, you can either:
#pragma omp parallel for private (x,y,a,b,index1,index2)
for(int x=0;x<rows;++x){
for(int y=0;y<columns;++y){
// ...
}
}
Or:
for(int x=0;x<rows;++x){
#pragma omp parallel for private (y,a,b,index1,index2)
for(int y=0;y<columns;++y){
// ...
}
}

If you are using OpenMP 3.0 or greater you may exploit the collapse clause of the loop work-sharing construct:
The collapse clause may be used to specify how many loops are
associated with the loop construct. The parameter of the collapse
clause must be a constant positive integer expression. If no collapse
clause is present, the only loop that is associated with the loop
construct is the one that immediately follows the loop directive
This means that you may write the following:
#pragma omp parallel for collapse(2)
for(int x=0;x<rows;++x){
for(int y=0;y<columns;++y){
/* Work here */
}
}
and obtain exactly the same result as your linearized loop:
#pragma omp parallel for
for(int z=0;z<rows*columns;z++){
x=z/columns;
y=z%columns;
/* Work here */
}
As you may see, with the collapse clause no modification is needed to your serial code and you may easily experiment further loop collapsing changing the positive number in the clause.

Related

Performance of matrix multiplications remains unchanged with OpenMP in C++

auto t1 = chrono::steady_clock::now();
#pragma omp parallel
{
for(int i=0;i<n;i++)
{
#pragma omp for collapse(2)
for(int j=0;j<n;j++)
{
for(int k=0;k<n;k++)
{
C[i][j]+=A[i][k]*B[k][j];
}
}
}
}
auto t2 = chrono::steady_clock::now();
auto t = std::chrono::duration_cast<chrono::microseconds>( t2 - t1 ).count();
With and without the parallelization the variable t remains fairly constant. I am not sure why this is happening. Also once in a while t is outputted as 0.
One more problem I am facing is that if I increase value of n to something like 500, the compiler is unable to run the program.(Here I've take n=100)
I am using code::blocks with the GNU GCC compiler.
The proposed OpenMP parallelization is not correct and may lead to wrong results. When specifying collapse(2), threads execute "simultaneously" the (j,k) iterations. If two (or more) threads work on the same j but different k, they accumulate the result of A[i][k]*B[k][j] to the same array location C[i][j]. This is a so called race condition, i.e. "two or more threads can access shared data and they try to change it at the same time" (What is a race condition?). Data races do not necessarily lead to wrong results despite the code is not OpenMP valid and can produce wrong results depending on several factors (scheduling, compiler implementation, number of threads,...). To fix the problem in the code above, OpenMP offers the reduction clause:
#pragma omp parallel
{
for(int i=0;i<n;i++) {
#pragma omp for collapse(2) reduction(+:C)
for(int j=0;j<n;j++) {
for(int k=0;k<n;k++) {
C[i][j]+=A[i][k]*B[k][j];
so that "a private copy is created in each implicit task (...) and is initialized with the initializer value of the reduction-identifier. After the end of the region, the original list item is updated with the values of the private copies using the combiner associated with the reduction-identifier" (http://www.openmp.org/wp-content/uploads/openmp-4.5.pdf). Note that the reduction on arrays in C is directly supported by the standard since OpenMP 4.5 (check if the compiler support it, otherwise there are old manual ways to achieve it, Reducing on array in OpenMp).
However, for the given code, it should be probably more adequate to avoid the parallelization of the innermost loop so that the reduction is not needed at all:
#pragma omp parallel
{
#pragma omp for collapse(2)
for(int i=0;i<n;i++) {
for(int j=0;j<n;j++) {
for(int k=0;k<n;k++) {
C[i][j]+=A[i][k]*B[k][j];
Serial can be faster than OpenMP version for small sizes of matrices and/or small number of threads.
On my Intel machine using up to 16 cores, n=1000, GNU compiler v6.1 the break even is around 4 cores when the -O3 optimization is activated while the break even is around 2 cores compiling with -O0. For clarity I report the performances I measured:
Serial 418020
----------- WRONG ORIG -- +REDUCTION -- OUTER.COLLAPSE -- OUTER.NOCOLLAPSE -
OpenMP-1 1924950 2841993 1450686 1455989
OpenMP-2 988743 2446098 747333 745830
OpenMP-4 515266 3182262 396524 387671
OpenMP-8 280285 5510023 219506 211913
OpenMP-16 2227567 10807828 150277 123368
Using reduction the performance loss is dramatic (reversed speed-up). The outer parallelization (w or w/o collapse) is the best option.
As concerns your failure with large matrices, a possible reason is related to the size of the available stack. Try to enlarge both the system and OpenMP stack sizes, i.e.
ulimit -s unlimited
export OMP_STACKSIZE=10000000
The collapse directive may actually be responsible for this, because the index j is recreated using divide/mod operations.
Did you try without collapse?

Parallelization of Jacobi algorithm using eigen c++ using openmp

I have implemented the Jacobi algorithm based on the routine described in the book Numerical Recipes but since I plan to work with very large matrices I am trying to parallelize it using openmp.
void ROTATE(MatrixXd &a, int i, int j, int k, int l, double s, double tau)
{
double g,h;
g=a(i,j);
h=a(k,l);
a(i,j)=g-s*(h+g*tau);
a(k,l)=h+s*(g-h*tau);
}
void jacobi(int n, MatrixXd &a, MatrixXd &v, VectorXd &d )
{
int j,iq,ip,i;
double tresh,theta,tau,t,sm,s,h,g,c;
VectorXd b(n);
VectorXd z(n);
v.setIdentity();
z.setZero();
#pragma omp parallel for
for (ip=0;ip<n;ip++)
{
d(ip)=a(ip,ip);
b(ip)=d(ip);
}
for (i=0;i<50;i++)
{
sm=0.0;
for (ip=0;ip<n-1;ip++)
{
#pragma omp parallel for reduction (+:sm)
for (iq=ip+1;iq<n;iq++)
sm += fabs(a(ip,iq));
}
if (sm == 0.0) {
break;
}
if (i < 3)
tresh=0.2*sm/(n*n);
else
tresh=0.0;
#pragma omp parallel for private (ip,g,h,t,theta,c,s,tau)
for (ip=0;ip<n-1;ip++)
{
//#pragma omp parallel for private (g,h,t,theta,c,s,tau)
for (iq=ip+1;iq<n;iq++)
{
g=100.0*fabs(a(ip,iq));
if (i > 3 && (fabs(d(ip))+g) == fabs(d[ip]) && (fabs(d[iq])+g) == fabs(d[iq]))
a(ip,iq)=0.0;
else if (fabs(a(ip,iq)) > tresh)
{
h=d(iq)-d(ip);
if ((fabs(h)+g) == fabs(h))
{
t=(a(ip,iq))/h;
}
else
{
theta=0.5*h/(a(ip,iq));
t=1.0/(fabs(theta)+sqrt(1.0+theta*theta));
if (theta < 0.0)
{
t = -t;
}
c=1.0/sqrt(1+t*t);
s=t*c;
tau=s/(1.0+c);
h=t*a(ip,iq);
#pragma omp critical
{
z(ip)=z(ip)-h;
z(iq)=z(iq)+h;
d(ip)=d(ip)-h;
d(iq)=d(iq)+h;
a(ip,iq)=0.0;
for (j=0;j<ip;j++)
ROTATE(a,j,ip,j,iq,s,tau);
for (j=ip+1;j<iq;j++)
ROTATE(a,ip,j,j,iq,s,tau);
for (j=iq+1;j<n;j++)
ROTATE(a,ip,j,iq,j,s,tau);
for (j=0;j<n;j++)
ROTATE(v,j,ip,j,iq,s,tau);
}
}
}
}
}
}
}
I wanted to parallelize the loop that does most of the calculations and both comments inserted in the code:
//#pragma omp parallel for private (ip,g,h,t,theta,c,s,tau)
//#pragma omp parallel for private (g,h,t,theta,c,s,tau)
are my attempts at it. Unfortunately both of them end up producing incorrect results. I suspect the problem may be in this block:
z(ip)=z(ip)-h;
z(iq)=z(iq)+h;
d(ip)=d(ip)-h;
d(iq)=d(iq)+h;
because usually this sort of accumulation would need a reduction, but since each thread accesses a different part of the array, I am not certain of this.
I am not really sure if I am doing the parallelization in a correct manner because I have only recently started working with openmp, so any suggestion or recommendation would also be welcomed.
Sidenote: I know there are faster algorithms for eigenvalue and eigenvector determination including the SelfAdjointEigenSolver in Eigen, but those are not giving me the precision I need in the eigenvectors and this algorithm is.
My thanks in advance.
Edit: I considered to correct answer to be the one provided by The Quantum Physicist because what I did does not reduce the computation time for system of size up to 4096x4096. In any case I corrected the code in order to make it work and maybe for big enough systems it could be of some use. I would advise the use of timers to test if the
#pragma omp for
actually decrease the computation time.
I'll try to help, but I'm not sure this is the answer to your question.
There are tons of problems with your code. My friendly advice for you is: Don't do parallel things if you don't understand the implications of what you're doing.
For some reason, it looks like that you think putting everything in parallel #pragma for will make it faster. This is VERY wrong. Because spawning threads is an expensive thing to do and costs (relatively) lots of memory and time. So if you redo that #pragma for for every loop, you'll respawn threads for every loop, which will significantly reduce the speed of your program... UNLESS: Your matrices are REALLY huge and the computation time is >> than the cost of spawning them.
I fell into a similar issue when I wanted to multiply huge matrices, element-wise (and then I needed the sum for some expectation value in quantum mechanics). To use OpenMP for that, I had to flatten the matrices to linear arrays, and then distribute the array chunks each to a thread, and then run a for loop, where every loop iteration uses elements that are independent of others for sure, and I made them all evolve independently. This was quite fast. Why? Because I never had to respawn threads twice.
Why you're getting wrong results? I believe the reason is because you're not respecting shared memory rules. You have some variable(s) that is being modified by multiple threads simultaneously. It's hiding somewhere, and you have to find it! For example, what does the function z do? Does it take stuff by reference? What I see here:
z(ip)=z(ip)-h;
z(iq)=z(iq)+h;
d(ip)=d(ip)-h;
d(iq)=d(iq)+h;
Looks VERY multi-threading not-safe, and I don't understand what you're doing. Are you returning a reference that you have to modify? This is a recipe for thread non-safety. Why don't you create clean arrays and deal with them instead of this?
How to debug: Start with a small example (2x2 matrix, maybe), and use only 2 threads, and try to understand what's going on. Use a debugger and define break points, and check what information is shared between threads.
Also consider using a mutex to check what data gets ruined when it becomes shared. Here is how to do it.
My recommendation: Don't use OpenMP unless you plan to spawn the threads ONLY ONCE. I actually believe that OpenMP is going to die very soon because of C++11. OpenMP was beautiful back then when C++ didn't have any native multi-threading implementation. So learn how to use std::thread, and use it, and if you need to run many things in threads, then learn how to create a Thread Pool with std::thread. This is a good book for learning multithreading.

Situations faced in OpenMP on for() loops

I'm using OpenMP for this and I'm not confident of my answer as well. Really need your help in this. I've been wondering which method (serial or parallel) is faster in run speed in this. My #pragma commands (set into comments) are shown below.
Triangle Triangle::t_ID_lookup(Triangle a[], int ID, int n)
{
Triangle res; int i;
//#pragma omp for schedule(static) ordered
for(i=0; i<n; i++)
{
if(ID==a[i].t_ID)
{
//#pragma omp ordered
return (res=a[i]); // <-changed into "res = a[i]" instead of "return(...)"
}
}
return res;
}
It depends on n. If n is small, then the overhead required for the OMP threads makes the OMP version slower. This can be overcome by adding an if clause: #pragma omp parallel if (n > YourThreshhold)
If all a[i].t_ID are not unique, then you may receive different results from the same data when using OMP.
If you have more in your function than just a single comparison, consider adding a shared flag variable that would indicate that it was found so that a comparison if(found) continue; can be added at the beginning of the loop.
I have no experience with ordered, so if that was the crux of your question, ignore all the above and consider this answer.
Profile. In the end, there is no better answer.
If you still want a theoretical answer, then a random lookup would be O(n) with a mean of n/2 while the OMP version would be a constant n/k where k is the number of threads/cores not including overhead.
For an alternative way of writing your loop, see Z Boson's answer to a different question.

Join array results in OpenMP

I am writing c++ codes using OpenMP. I have a global huge array (100,000+ elements) that will be modified by adding values in a for loop. Is there a way that I can efficiently have each thread created by OpenMP for parallel maintain its local copy of array and then join after the loop? Since the number of threads is a variable, I could not create the local copies of array beforehand. If using a global copy and address the race condition by a synchronization lock, the performance is terrible.
Thanks!
Edited:
Sorry for not being clear. Here's some pseudo-code hopefully could clarify the scenario:
int* huge_array=new int[N];
memset(huge_array, 0, N*sizeof(int));
#pragma omp parallel for
for (i=0; i<n; i++)
{
get a value v independently
get a position p independently
// I have to set a lock here
omp_set_lock(&lock);
huge_array[p] += v;
omp_unset_lock(&lock);
}
Is there a way to improve the performance of the code above?
Okay, I finally understood what you want to do. Yes, you do it the same way as with ptreads.
std::vector<int> A(N,0);
std::vector<int*> local(omp_max_num_threads());
#pragma omp parallel
{
int np = omp_get_num_threads();
std::vector<int> localA(N);
local[omp_get_thread_num()] = localA.data();
// add values to local array
#pragma omp for
for(int i=0; i<num_values; ++i)
localA[position()] += value(); // (1)
// implicit barrier ensures all local copies are ready for aggregation
// aggregate local copies into global array
#pragma omp for
for(int k=0; k<N; ++k)
for(int p=0; p<np; ++p)
A[k] += local[p][k]; // (2)
// implicit barrier ensures no local copy is deleted before aggregation is done
}
but it is important to do the aggregate also in parallel.
In Walter's answer, I believe instead of
std::vector<int*> local(omp_max_num_threads());
It should be
std::vector<int*> local(omp_get_max_threads());
omp_max_num_threads() is not a routine in OpenMP.
What about using the directive
'#'pragma omp parallel for private(VARIABLE)
for your program (only with a cross, not with these '')?
EDIT:
For your code I would use my directive, you won't loose so much time when locking and unlocking your variable...
EDIT 2:
Sorry, you can not use my code for your problem, only, if you create a temporary array first where you store your data temporarily...
As far as I can tell you are essentially filling a histogram where position is the bin of the histogram to fill and value is the weight/value that you will add to that bin. Filling a histogram in parallel is equivalent to doing an array reduction. The C++ implementation of OpenMP does not have direct support for this, however, as far as I understand some version of the Fortran implementation do. To do an array reduction in C++ with OpenMP I have two suggestions.
1.) If the number of bins of the histogram (array) is much less than the number of values that will fill the histogram (which is often the preferred case since one wants reasonable statistics in each bin), then you can fill private version of the histogram in parallel and merge them in a critical section in serial. Since the number of bins is much less than the number of values this should be efficient.
2.) However, If the number of bins is large (as your example seems to imply) then it's possible to merge the private histograms in parallel as well but this is a bit more tricky. Additionally, one needs to be careful with cache alignment and false sharing.
I showed how to do both these methods and discuss some of the cache issues in the following question:
Fill histograms (array reduction) in parallel with openmp without using a critical section.

OpenMP Performance impact: private directive vs. declaring variable inside for construct

Performance wise, which of the following is more efficient?
Assigning in the master thread and copying the value to all threads:
int i = 0;
#pragma omp parallel for firstprivate(i)
for( ; i < n; i++){
...
}
Declaring and assigning the variable in each thread
#pragma omp parallel for
for(int i = 0; i < n; i++){
...
}
Declaring the variable in the master thread but assigning it in each thread.
int i;
#pragma omp parallel for private(i)
for(i = 0; i < n; i++){
...
}
It may seem a silly question and/or the performance impact may be negligible. But I'm parallelizing a loop that does a small amount of computation and is called a large number of times, so any optimization I can squeeze out of this loop is helpful.
I'm looking for a more low level explanation and how OpenMP handles this.
For example, if parallelizing for a large number of threads I assume the second implementation would be more efficient, since initializing a variable using xor is far more efficient than copying the variable to all the threads
There is not much of a difference in terms of performance among the 3 versions you presented, since each one of them is using #pragma omp parallel for. Hence, OpenMP will automatically assign each for iteration to different threads. Thus, variable i will became private to each thread, and each thread will have a different range of for iterations to work with. The variable 'i' was automatically set to private in order to avoid race conditions when updating this variable. Since, the variable 'i' will be private on the parallel for anyway, there is no need to put private(i) on the #pragma omp parallel for.
Nevertheless, your first version will produce an error since OpenMP is expecting that the loop right underneath of #pragma omp parallel for have the following format:
for(init-expr; test-expr;incr-expr)
inorder to precompute the range of work.
The for directive places restrictions on the structure of all
associated for-loops. Specifically, all associated for-loops must
have the following canonical form:
for (init-expr; test-expr;incr-expr) structured-block (OpenMP Application Program Interface pag. 39/40.)
Edit: I tested your two last versions, and inspected the generated assembly. Both version produce the same assembly, as you can see -> version 2 and version 3.