Trouble parallelizing OpenACC loop - fortran

I have a old code written in FORTRAN and i need to accelerate it using OpenACC but when i try using directives, it says there is a dependance of un,vn,pn which prevents parallelism. Is it possible to parallelize this loop? I am new to OpenACC but have parallelized with OpenMP
!$acc parallel loop
do 9000 j=2,jmaxm
jm=j-1
jp=j+1
do 9001 i=2,imaxm
im=i-1
ip=i+1
if(rmask(i,j).eq.1.0) then
! Calculate un field
un(i,j,kp)=un(i,j,km)+ tdt*rmask(i,j)*(
+ txsav(i,j)*zn(nmm)/xpsi2(nmm)+ visch*zetun(i,j)
+ -recdx*(pn(ip,j,k)-pn(i,j,k))-a*un(i,j,km)/cn(nmm)**2
+ +0.25* fu(i,j)*(vn(i,j,k)+vn(ip,j,k)+vn(i,jm,k)+
+ vn(ip,jm,k))
+ -damp(i,j)*un(i,j,km)
+ )
c SBnd damper is not used
cc + -(1./timkwd)*dampu(i,j)*un(i,j,km)
! Calculate vn field
vn(i,j,kp)=vn(i,j,km)+ tdt*rmask(i,j)*(
+ tysav(i,j)*zn(nmm)/xpsi2(nmm)+visch*zetvn(i,j)
+ -recdy*(pn(i,jp,k)-pn(i,j,k))-a*vn(i,j,km)/cn(nmm)**2
+ -0.25*fv(i,j)*(un(im,jp,k)+un(i,jp,k)+un(im,j,k)+
+ un(i,j,k))
+ )
c EBnd damper is not used
cc + -(1./timkwd)*dampv(i,j)*vn(i,j,km)
! Calculate pn field
pn(i,j,kp)=pn(i,j,km)+tdt*rmask(i,j)*(
+ cn(nmm)**2*(
+ -recdx*(un(i,j,k)-un(im,j,k))
+ -recdy*(vn(i,j,k)-vn(i,jm,k)))
+ -a*pn(i,j,k)/cn(nmm)**2
+ -dampu(i,j)*cn(nmm)/dx*pn(i,j,km)
+ -dampv(i,j)*cn(nmm)/dx*pn(i,j,km)
+ -damp(i,j)*pn(i,j,km)
+ )
rhon(i,j)=-pn(i,j,kp)/g
wn(i,j)=
+ -recdx*(un(i,j,kp)-un(im,j,kp))
+ -recdy*(vn(i,j,kp)-vn(i,jm,kp))
endif
9001 continue
9000 continue
!$acc end parallel loop

You have data dependency and that means your algorithm is inherently sequential,
a simple example would be the difference between the Gauss-Seidel and Jacobi iterations, and why people use Jacobi in GPU's and not Gaus Seidel,

Related

Vectorization vs SIMD vs Workshare in Fortran

From your experience, which code snippet is the fastest?
Vectorization
x(2:ndof - 1) = x(2:ndof - 1) - dt*( &
x(ndof + 1:ndof+ ndof - 2) + (ndof + 3:ndof + ndof))
Vectorization + workshare
!$omp parallel workshare schedule (runtime)
x(2:ndof - 1) = x(2:ndof - 1) - dt*( &
x(ndof + 1:ndof+ ndof - 2) + (ndof + 3:ndof + ndof))
!$omp end parallel workshare
Explicit loop + single instruction multiple data of openmp
!$omp parallel do simd schedule (runtime)
do idx = 2, ndof - 1, 1
x(idx) = x(idx) - (dt*coef*( &
x(ndof+idx - 1) + x(ndof + idx+1)))
end do
!$omp end parallel do simd
I do not include a simple explicit loop as it is the one I want to improve performance.

Openmp parallel do loop working correctly ~50% of the time

I am currently working on adding openmp parallelization for a do loop in one of the codes I have written for research. I am fairly new to using openmp so I would appreciate if you had any suggestions for what might be going wrong.
Basically, I have added a parallel do loop to the following code (which works prior to parallelization). r(:,:,:,:) is a vector of a ton of molecular coordinates indexed by time, molecule, atom, and (xyz). This vector is about 100 gb of data (I am working on an HPC with plenty of RAM). I am trying to parallelize the outer loop and subdivide it between processors so that I can reduce the amount of time this calculation goes. I thought it would be a good one to do it with as msd and cm_msd are the only things that would need to be edited by multiple processors and stored for later, which since each iteration gets its own element of these arrays they won't have a race condition.
The problem: If I run this code 5 times I get varying results, sometimes msd is calculated correctly (or appears to be), and sometimes it outputs all zeros later when I average it together. Without parallelization there are no issues.
I have been trying altering the shared vs private variables in the code and I think I have accounted for everything. The i index of the msd array and msd_cm array should never be equivalent between threads so I would think that they wouldn't be an issue.
! Loop over time origins
counti = 0
ind = 0
!$OMP PARALLEL DO schedule(static) PRIVATE(i,j,k,it,r_old,r_cm_old,shift,shift_cm,dsq,ind) &
!$OMP& SHARED(msd,msd_cm)
do i=1, nconfigs-nt, or_int
if(MOD(counti*or_int,500) == 0) then
write(*,*) 'Reached the ', counti*or_int,'th time origin'
end if
! Set the Old Coordinates
counti = counti + 1
ind = (i-1)/or_int + 1
r_old(:,:,:) = r(i,:,:,:)
r_cm_old(:,:) = r_cm(i,:,:)
shift = 0.0
shift_cm = 0.0
! Loop over the timesteps in each trajectory
do it=i+2, nt+i
! Loop over molecules
do j = 1, nmols
do k=1, atms_per_mol
! Calculate the shift if it occurs.
shift(j,k,:) = shift(j,k,:) - L(:)*anint((r(it,j,k,:) - &
r_old(j,k,:) )/L(:))
! Calculate the square displacements
dsq = ( r(it,j,k,1) + shift(j,k,1) - r(i,j,k,1) ) ** 2. &
+( r(it,j,k,2) + shift(j,k,2) - r(i,j,k,2) ) ** 2. &
+( r(it,j,k,3) + shift(j,k,3) - r(i,j,k,3) ) ** 2.
msd(ind, it-1-i, k) = msd(ind, it-1-i, k) + dsq
! Calculate the contribution to the c1,c2
enddo ! End Atoms Loop (k)
! Calculate the shift if it occurs.
shift_cm(j,:) = shift_cm(j,:) - L(:)*anint((r_cm(it,j,:) - &
r_cm_old(j,:) )/L(:))
! Calculate the square displacements
dsq = ( r_cm(it,j,1) + shift_cm(j,1) - r_cm(i,j,1) ) ** 2. &
+( r_cm(it,j,2) + shift_cm(j,2) - r_cm(i,j,2) ) ** 2. &
+( r_cm(it,j,3) + shift_cm(j,3) - r_cm(i,j,3) ) ** 2.
msd_cm(ind,it-1-i) = msd_cm(ind, it-1-i) + dsq
enddo ! End Molecules Loop (j)
r_old(:,:,:) = r(it,:,:,:)
r_cm_old(:,:) = r_cm(it,:,:)
enddo ! End t's loop (it)
enddo
!$OMP END PARALLEL DO
When this code is run, when I later print the averaged msd results out they either come out as correctly or they come out as zero and it is always one or the other. Do you see an issue that might explain why it is only working part of the time. I am brand new to openmp so it is completely possible there is just something incredibly stupid with how I am trying to do this.
Thanks in advance!

Rewriting Matlab eig(A,B) (Generalized eigenvalues/eigenvectors) to C/C++

Do anyone have any idea how can I rewrite eig(A,B) from Matlab used to calculate generalized eigenvector/eigenvalues? I've been struggling with this problem lately. So far:
Matlab definition of eig function I need:
[V,D] = eig(A,B) produces a diagonal matrix D of generalized
eigenvalues and a full matrix V whose columns are the corresponding
eigenvectors so that A*V = B*V*D.
So far I tried the Eigen library (http://eigen.tuxfamily.org/dox/classEigen_1_1GeneralizedSelfAdjointEigenSolver.html)
My implementation looks like this:
std::pair<Matrix4cd, Vector4d> eig(const Matrix4cd& A, const Matrix4cd& B)
{
Eigen::GeneralizedSelfAdjointEigenSolver<Matrix4cd> solver(A, B);
Matrix4cd V = solver.eigenvectors();
Vector4d D = solver.eigenvalues();
return std::make_pair(V, D);
}
But first thing that comes to my mind is, that I can't use Vector4cd as .eigenvalues() doesn't return complex values where Matlab does. Furthermore results of .eigenvectors() and .eigenvalues() for the same matrices are not the same at all:
C++:
Matrix4cd x;
Matrix4cd y;
pair<Matrix4cd, Vector4d> result;
for (int i = 0; i < 4; i++)
{
for (int j = 0; j < 4; j++)
{
x.real()(i,j) = (double)(i+j+1+i*3);
y.real()(i,j) = (double)(17 - (i+j+1+i*3));
x.imag()(i,j) = (double)(i+j+1+i*3);
y.imag()(i,j) = (double)(17 - (i+j+1+i*3));
}
}
result = eig(x,y);
cout << result.first << endl << endl;
cout << result.second << endl << endl;
Matlab:
for i=1:1:4
for j=1:1:4
x(i,j) = complex((i-1)+(j-1)+1+((i-1)*3), (i-1)+(j-1)+1+((i-1)*3));
y(i,j) = complex(17 - ((i-1)+(j-1)+1+((i-1)*3)), 17 - ((i-1)+(j-1)+1+((i-1)*3)));
end
end
[A,B] = eig(x,y)
So I give eig the same 4x4 matrices holding values 1-16 ascending (x) and descending (y). But I receive different results, furthermore Eigen method returns double from eigenvalues while Matlab returns complex dobule. I also find out that there is other Eigen solver named GeneralizedEigenSolver. That one in the documentation (http://eigen.tuxfamily.org/dox/classEigen_1_1GeneralizedEigenSolver.html) has written that it solves A*V = B*V*D but to be honest I tried it and results (matrix sizes) are not the same size as Matlab so I got quite lost how it works (examplary results are on the website I've linked). It also has only .eigenvector method.
C++ results:
(-0.222268,-0.0108754) (0.0803437,-0.0254809) (0.0383264,-0.0233819) (0.0995482,0.00682079)
(-0.009275,-0.0182668) (-0.0395551,-0.0582127) (0.0550395,0.03434) (-0.034419,-0.0287563)
(-0.112716,-0.0621061) (-0.010788,0.10297) (-0.0820552,0.0294896) (-0.114596,-0.146384)
(0.28873,0.257988) (0.0166259,-0.0529934) (0.0351645,-0.0322988) (0.405394,0.424698)
-1.66983
-0.0733194
0.0386832
3.97933
Matlab results:
[A,B] = eig(x,y)
A =
Columns 1 through 3
-0.9100 + 0.0900i -0.5506 + 0.4494i 0.3614 + 0.3531i
0.7123 + 0.0734i 0.4928 - 0.2586i -0.5663 - 0.4337i
0.0899 - 0.4170i -0.1210 - 0.3087i 0.0484 - 0.1918i
0.1077 + 0.2535i 0.1787 + 0.1179i 0.1565 + 0.2724i
Column 4
-0.3237 - 0.3868i
0.2338 + 0.7662i
0.5036 - 0.3720i
-0.4136 - 0.0074i
B =
Columns 1 through 3
-1.0000 + 0.0000i 0.0000 + 0.0000i 0.0000 + 0.0000i
0.0000 + 0.0000i -1.0000 - 0.0000i 0.0000 + 0.0000i
0.0000 + 0.0000i 0.0000 + 0.0000i -4.5745 - 1.8929i
0.0000 + 0.0000i 0.0000 + 0.0000i 0.0000 + 0.0000i
Column 4
0.0000 + 0.0000i
0.0000 + 0.0000i
0.0000 + 0.0000i
-0.3317 + 1.1948i
Second try was with Intel IPP but it seems that it solves only A*V = V*D and support told me that it's not supported anymore.
https://software.intel.com/en-us/node/505270 (list of constructors for Intel IPP)
I got suggestion to move from Intel IPP to MKL. I did it and hit the wall again. I tried to check all algorithms for Eigen but it seems that there are only A*V = V*D problems solved. I was checking lapack95.lib. The list of algorithms used by this library is available there:
https://software.intel.com/sites/products/documentation/doclib/mkl_sa/11/mkl_lapack_examples/index.htm#dsyev.htm
Somewhere on the web I could find topic on Mathworks when someone said that managed to solve my problem partially with usage of MKL:
http://jp.mathworks.com/matlabcentral/answers/40050-generalized-eigenvalue-and-eigenvectors-differences-between-matlab-eig-a-b-and-mkl-lapack-dsygv
Person said that he/she used dsygv algorithm but I can't locate anything like that on the web. Maybe it's a typo.
Anyone has any other proposition/idea how can I implement it? Or maybe point my mistake. I'd appreciate that.
EDIT:
In comments I've received a hint that I was using Eigen solver wrong. My A matrix wasn't self-adjoint and my B matrix wasn't positive-definite. I took matrices from program I want to rewrite to C++ (from random iteration) and checked if they meet the requirements. They did:
Rj =
1.0e+02 *
Columns 1 through 3
0.1302 + 0.0000i -0.0153 + 0.0724i 0.0011 - 0.0042i
-0.0153 - 0.0724i 1.2041 + 0.0000i -0.0524 + 0.0377i
0.0011 + 0.0042i -0.0524 - 0.0377i 0.0477 + 0.0000i
-0.0080 - 0.0108i 0.0929 - 0.0115i -0.0055 + 0.0021i
Column 4
-0.0080 + 0.0108i
0.0929 + 0.0115i
-0.0055 - 0.0021i
0.0317 + 0.0000i
Rt =
Columns 1 through 3
4.8156 + 0.0000i -0.3397 + 1.3502i -0.2143 - 0.3593i
-0.3397 - 1.3502i 7.3635 + 0.0000i -0.5539 - 0.5176i
-0.2143 + 0.3593i -0.5539 + 0.5176i 1.7801 + 0.0000i
0.5241 + 0.9105i 0.9514 + 0.6572i -0.7302 + 0.3161i
Column 4
0.5241 - 0.9105i
0.9514 - 0.6572i
-0.7302 - 0.3161i
9.6022 + 0.0000i
As for Rj which is now my A - it is self-adjoint because Rj = Rj' and Rj = ctranspose(Rj). (http://mathworld.wolfram.com/Self-AdjointMatrix.html)
As for Rt which is now my B - it is Positive-Definite what is checked with method linked to me. (http://www.mathworks.com/matlabcentral/answers/101132-how-do-i-determine-if-a-matrix-is-positive-definite-using-matlab). So
>> [~,p] = chol(Rt)
p =
0
I've rewritten matrices manually to C++ and performed eig(A,B) again with matrices meeting requirements:
Matrix4cd x;
Matrix4cd y;
pair<Matrix4cd, Vector4d> result;
x.real()(0,0) = 13.0163601949795;
x.real()(0,1) = -1.53172561296005;
x.real()(0,2) = 0.109594869350436;
x.real()(0,3) = -0.804231869422614;
x.real()(1,0) = -1.53172561296005;
x.real()(1,1) = 120.406645675346;
x.real()(1,2) = -5.23758765476463;
x.real()(1,3) = 9.28686785230169;
x.real()(2,0) = 0.109594869350436;
x.real()(2,1) = -5.23758765476463;
x.real()(2,2) = 4.76648319080400;
x.real()(2,3) = -0.552823839520508;
x.real()(3,0) = -0.804231869422614;
x.real()(3,1) = 9.28686785230169;
x.real()(3,2) = -0.552823839520508;
x.real()(3,3) = 3.16510496622613;
x.imag()(0,0) = -0.00000000000000;
x.imag()(0,1) = 7.23946944213164;
x.imag()(0,2) = 0.419181335323979;
x.imag()(0,3) = 1.08441894337449;
x.imag()(1,0) = -7.23946944213164;
x.imag()(1,1) = -0.00000000000000;
x.imag()(1,2) = 3.76849276970080;
x.imag()(1,3) = 1.14635625342266;
x.imag()(2,0) = 0.419181335323979;
x.imag()(2,1) = -3.76849276970080;
x.imag()(2,2) = -0.00000000000000;
x.imag()(2,3) = 0.205129702522089;
x.imag()(3,0) = -1.08441894337449;
x.imag()(3,1) = -1.14635625342266;
x.imag()(3,2) = 0.205129702522089;
x.imag()(3,3) = -0.00000000000000;
y.real()(0,0) = 4.81562784930907;
y.real()(0,1) = -0.339731222392148;
y.real()(0,2) = -0.214319720979258;
y.real()(0,3) = 0.524107127885349;
y.real()(1,0) = -0.339731222392148;
y.real()(1,1) = 7.36354235698375;
y.real()(1,2) = -0.553927983436786;
y.real()(1,3) = 0.951404408649307;
y.real()(2,0) = -0.214319720979258;
y.real()(2,1) = -0.553927983436786;
y.real()(2,2) = 1.78008768533745;
y.real()(2,3) = -0.730246631850385;
y.real()(3,0) = 0.524107127885349;
y.real()(3,1) = 0.951404408649307;
y.real()(3,2) = -0.730246631850385;
y.real()(3,3) = 9.60215057284395;
y.imag()(0,0) = -0.00000000000000;
y.imag()(0,1) = 1.35016928394966;
y.imag()(0,2) = -0.359262708214312;
y.imag()(0,3) = -0.910512495060186;
y.imag()(1,0) = -1.35016928394966;
y.imag()(1,1) = -0.00000000000000;
y.imag()(1,2) = -0.517616473138836;
y.imag()(1,3) = -0.657235460367660;
y.imag()(2,0) = 0.359262708214312;
y.imag()(2,1) = 0.517616473138836;
y.imag()(2,2) = -0.00000000000000;
y.imag()(2,3) = -0.316090662865005;
y.imag()(3,0) = 0.910512495060186;
y.imag()(3,1) = 0.657235460367660;
y.imag()(3,2) = 0.316090662865005;
y.imag()(3,3) = -0.00000000000000;
result = eig(x,y);
cout << result.first << endl << endl;
cout << result.second << endl << endl;
And the results of C++:
(0.0295948,0.00562174) (-0.253532,0.0138373) (-0.395087,-0.0139696) (-0.0918132,-0.0788735)
(-0.00994614,-0.0213973) (-0.0118322,-0.0445976) (0.00993512,0.0127006) (0.0590018,-0.387949)
(0.0139485,-0.00832193) (0.363694,-0.446652) (-0.319168,0.376483) (-0.234447,-0.0859585)
(0.173697,0.268015) (0.0279387,-0.0103741) (0.0273701,0.0937148) (-0.055169,0.0295393)
0.244233
2.24309
3.24152
18.664
Results of MATLAB:
>> [A,B] = eig(Rj,Rt)
A =
Columns 1 through 3
0.0208 - 0.0218i 0.2425 + 0.0753i -0.1242 + 0.3753i
-0.0234 - 0.0033i -0.0044 + 0.0459i 0.0150 - 0.0060i
0.0006 - 0.0162i -0.4964 + 0.2921i 0.2719 + 0.4119i
0.3194 + 0.0000i -0.0298 + 0.0000i 0.0976 + 0.0000i
Column 4
-0.0437 - 0.1129i
0.2351 - 0.3142i
-0.1661 - 0.1864i
-0.0626 + 0.0000i
B =
0.2442 0 0 0
0 2.2431 0 0
0 0 3.2415 0
0 0 0 18.6640
Eigenvalues are the same! Nice, but why Eigenvectors are not similar at all?
There is no problem here with Eigen.
In fact for the second example run, Matlab and Eigen produced the very same result. Please remember from basic linear algebra that eigenvector are determined up to an arbitrary scaling factor. (I.e. if v is an eigenvector the same holds for alpha*v, where alpha is a non zero complex scalar.)
It is quite common that different linear algebra libraries compute different eigenvectors, but this does not mean that one of the two codes is wrong: it simply means that they choose a different scaling of the eigenvectors.
EDIT
The main problem with exactly replicating the scaling chosen by matlab is that eig(A,B) is a driver routine, which depending from the different properties of A and B may call different libraries/routines, and apply extra steps like balancing the matrices and so on. By quickly inspecting your example, I would say that in this case matlab is enforcing following condition:
all(imag(V(end,:))==0) (the last component of each eigenvector is real)
but not imposing other constraints. This unfortunately means that the scaling is not unique, and probably depends on intermediate results of the generalised eigenvector algorithm used. In this case I'm not able to give you advice on how to exactly replicate matlab: knowledge of the internal working of matlab is required.
As a general remark, in linear algebra usually one does not care too much about eigenvector scaling, since this is usually completely irrelevant for the problem solved, when the eigenvectors are just used as intermediate results.
The only case in which the scaling has to be defined exactly, is when you are going to give a graphic representation of the eigenvalues.
The eigenvector scaling in Matlab seems to be based on normalizing them to 1.0 (ie. the absolute value of the biggest term in each vector is 1.0). In the application I was using it also returns the left eigenvector rather than the more commonly used right eigenvector. This could explain the differences between Matlab and the eigensolvers in Lapack MKL.

Nested Loop Optimization in OpenMP

I can't get the output result correct once applied openMP, is it anywhere get this right?
!$OMP PARALLEL DO SHARED(outmtresult,inpa,inpb,dynindexlist) PRIVATE(i,j) REDUCTION(+:outcountb)
do i=1,size1
do j=1, size1
outcountb = outcountb + 1
outmtresult(j) = tan(inpa(j) + inpb(j)) + alpha1 + dynindexlist(i)
enddo
enddo
!$OMP END PARALLEL DO
Just swap your loops and everything will be fine:
!$OMP PARALLEL DO SHARED(outmtresult,inpa,inpb,dynindexlist) PRIVATE(i,j) REDUCTION(+:outcountb)
do j=1,size1 ! <-- Swap i and
do i=1, size1 ! j here
outcountb = outcountb + 1
outmtresult(j) = tan(inpa(j) + inpb(j)) + alpha1 + dynindexlist(i)
enddo
enddo
!$OMP END PARALLEL DO
In your example, multiple threads write into the same memory address outmtresult(j) since you parallelize the do i loop.
By swapping the loops, you parallelize over do j and you will not write
at the same destination with multiple concurrent threads.

Intrisic store - bad performance

I want to write benchmark for Xeon Phi (60 core). In my program i use the OpenMP standard and Intel intrinsics. I implemented parallel version of algorithm (5-point stencil computation) which is faster under 230 times than scalar algorithm. I want add SIMD to parallel code. I have problem with performance. When i call _m512_store_pd() performance of computations is reduced and parallel version with SIMD is slower than version without SIMD. What is the problem? What should I do to get better performance?
for(int i=start; i<stop; i+=threadsPerCore)
{
for(int j=8; j<n+8; j+=8)
{
__m512d v_c = _mm512_load_pd(&matrixIn[i * n_real + j]);
__m512d v_g = _mm512_load_pd(&matrixIn[(i - 1) * n_real + j]);
__m512d v_d = _mm512_load_pd(&matrixIn[(i + 1) * n_real + j]);
__m512d v_l = _mm512_loadu_pd(&matrixIn[i * n_real + (j - 1)]);
__m512d v_p = _mm512_loadu_pd(&matrixIn[i * n_real + (j + 1)]);
__m512d v_max = _mm512_max_pd(v_c, v_g);
v_max = _mm512_max_pd(v_max, v_d);
v_max = _mm512_max_pd(v_max, v_l);
v_max = _mm512_max_pd(v_max, v_p);
_mm512_store_pd(&matrixOut[i * n_real + j], v_max);
}
}
I start computation from 8 becouse i have one vector at the beginning and one vector at the end are halo elements. n_real is size of vector -> n + 16. start and stop are computed, becouse i partition matrix for 60 cores and opne part (m/60) is computed by 4 HM threads.
Someone (maybe you) seems to have asked an identical question (at least, the code sample quoted is the same as yours) on Intel Developer Zone at https://software.intel.com/en-us/forums/topic/531721 where there are answers (including a rewrite that got 40% performance improvement).
Perhaps reading that would be useful?
(If it was you, I see no objection to asking in both places, but it would be polite to tell people here that you have already asked there, so that they don't waste time reproducing answers that people have already given in the other forum).