how to tell the compiler that a loop can be safely parallelized? - fortran

I'm updating a few elements in a large array.
The updates consists of:
multiplying the current value by ten (if it's not zero)
clearing the current value
moving the updated value to a new position in the array
I know there will be no collision when a move occurs.
How can I tell the compiler that it can safely parallelized the loop?
do i = 1, 1e6
if ( v[i] /= 0 ) then
temp = v[i] * 10
v[i] = 0
ndx = get_move_to_ndx(i)
v[ndx] = temp
end if
end do
I'm on ifort, but I guess this is compiler independent.

Here is a mongrel approach, so you have some ideas using temporary vectors. The WHERE may not be correct, you would have to try it. The Main advantage of WHERE/ELSEWHERE is readability, as it is usually not as fast as loops... Just easier to read.
!DIR$ SIMD
FillTemp: Do I = 1, 1000000
Temps(I) = v(I)*10
ENDDO FillTemp
!$OMP PARALLEL DO
FindIndex: Do I = 1, 1000000
ndx_vect(I) = get_move_to_ndx(i)
ENDDO FindIndex
WHERE( Temps /= 0 )
V = 0
ELSEWHERE
v(ndx_Vect) = tempz
ENDWHERE

Related

What is the time complexity of this for loop (be related to `n`)?

What is the time complexity of this for loop (be related to n)?
for(int i = 1, j; i <= n; i = j + 1)
{
j = n / (n / i);
}
Please note that i, j and n are integer variables and they follow integer arithmetic. In particular, the expression n/(n/i) inside the loop should be interpreted as below:
If we use j = i; instead of j = n / (n / i);, the time complexity is O(n).
Now it's j = n / (n / i);, suppose that n = i*k+r, where k and r are all integers and r = n%i. Thus j = (i*k+r)/((i*k+r)/i) = (i*k+r)/k = i+r/k >= i, which means i will increment faster than the case where you use j = i;. So at least the time complexity is less than O(n), which I suppose gives you another O(n).
And besides the big O notation, there are another two notations(Θ and Ω) which means the lower and upper bound of O(n). You can get time complexity by finding these two bounds. And there's another rule if I remember correctly, O(k*n) = O(n), the coefficient k doesn't matter no matter how big it is.
As elaborated by taotsi, the increment for i in each iteration is
inc = 1 + r/k
where r=n%i and k=n/i. Since r<i, the increment is 1 as long as i<sqrt(n) (because then i*i/n<1 become 0 in integer division). Thereafter, the increment is (typically) 2 as long as i<2*sqrt(n). This continues similar to the geometric series, giving a factor 2 over sqrt(n), i.e. 2 sqrt(n) iterations.
If we write n = a*a+b with integers 0 <= b <= 2*a (i.e. a=int(sqrt(n)) and b=n-a*a), then the total number of iterations in simple experiments is always
b < a? 2*a-1 : 2*a
Thus, the complexity is O(√n) (provided some useful work is done inside the loop, for example counting the number of total iterations, such that the compiler is not allowed to elide the whole loop).
As #Walter has already offered a proof, I am too late for that part, but here is a Python3 version of your code and a plot of the number of iterations as a function of n vs the 2*sqrt(n) function. They look approximately the same (up to n = 1e9).
import matplotlib.pyplot as plt
from numba import jit
import math
#jit
def weird_increment_loop(n):
i = 1
j = 0
iterations = 0
while i <= n:
j = n // (n // i)
i = j + 1
iterations = iterations + 1
return iterations
iterations = []
func_2sqrt = []
domain = range(0,1000000001,1000000)
for n in domain:
iterations.append(weird_increment_loop(n))
func_2sqrt.append(math.sqrt(n)*2)
plt.plot(domain,iterations)
plt.plot(domain,func_2sqrt)
plt.xlabel("n")
plt.ylabel("iterations(n) and 2*sqrt(n)")
plt.show()
Here is the plot:
If you see no difference, it is because there is close to none :D Of course, one should always trust Mathematics ;)
Strictly by the rules of C++, it's O(1). Either the loop terminates after some finite amount of doing no observable work, or it loops forever (which is undefined behaviour). A conforming implementation may assume that undefined behaviour is not encountered, so we may assume it terminates.
Because the observable effects of the program does not depend on what happens inside the loop, an implementation is allowed to "As-if" it into nothingness.

Non-recursive implementation of perms in Matlab, compatible with Coder

I am trying to convert part of my function in matlab into c++ using coder. Coder doesn't support the function perms. I extensively use perms in my code. After looking online I found few suggestions of how to generate a list of all permutations without perms but it is done "by hand", meaning that for permutations with 3 elements we have three for loops, with 4 elements we have 4 loops, etc.
Example for 1:4:
row = 1;
n=a;
Z = zeros(factorial(n),n);
idxarray1=[1:4];
for idx=idxarray1
idxarray2=idxarray1(find(idxarray1~=idx)) ;
for jdx=idxarray2
idxarray3=idxarray2(find(idxarray2~=jdx));
for kdx=idxarray3
idxarray4=idxarray3(find(idxarray3~=kdx)) ;
for mdx=idxarray4
Z(row,:) = [idx,jdx,kdx,mdx];
row = row + 1 ;
end
end
end
end
For 8 elements I would have to write 8 for loops, any suggestions of how I can transform this for n elements? Something like
for i=n:-1:1
I=[1:n] ;
for j=1:i
J=I(find(I~=j));
... ?
thank you
The problem here is that perms uses recursion, which is one of the language features that Matlab Coder does not support. So what we need to do is to come up with an implementation that is non-recursive.
Interestingly enough, perms was recursive before Matlab 6.0, then non-recursive, and then recursive again. So rather than inventing the wheel, we can just take one of the previous non-recursive revisions, e.g. 1.10.
Note that the order of the permutations is different, but you should not be relying on that in your code anyway. You might need to change the name to avoid the conflict with native perms function. Tested with coder.screener, which confirms that Coder supports it.
function P = perms(V)
%PERMS All possible permutations.
% PERMS(1:N), or PERMS(V) where V is a vector of length N, creates a
% matrix with N! rows and N columns containing all possible
% permutations of the N elements.
%
% This function is only practical for situations where N is less
% than about 10 (for N=11, the output takes over 3 giga-bytes).
%
% See also NCHOOSEK, RANDPERM, PERMUTE.
% ZP. You, 1-18-99
% Copyright 1984-2000 The MathWorks, Inc.
% $Revision: 1.10 $ $Date: 2000/06/16 17:00:47 $
V = V(:)';
n = length(V);
if n == 0
P = [];
else
c = cumprod(1:n);
cn = c(n);
P = V(ones(cn,1),:);
for i = 1:n-1; % for column 1 to n-1, switch oldidx entry with newidx entry
% compute oldidx
j = n-i;
k = (n-j-1)*cn;
oldidx = (c(j)+1+k:c(j+1)+k)';
% spread oldidx and newidx over corresponding rows
for k = j+1:n-1
q = 0:c(k):k*c(k);
shift = q(ones(length(oldidx),1),:);
oldidx = oldidx(:,ones(1,k+1));
oldidx = oldidx(:)+shift(:);
end
% compute newidx
colidx = cn:cn:j*cn;
colidx = colidx(ones(c(j),1),:);
colidx = colidx(:);
colidx = colidx(:,ones(1,length(oldidx)/(j*c(j))));
newidx = oldidx + colidx(:);
% do the swap
q = P(newidx);
P(newidx)=P(oldidx);
P(oldidx)=q;
end
end

Example of C++ code optimization for parallel computing

I'm trying to understand optimization routines. I'm focusing on the most critical part of my code (the code has some cycles of length "nc" and one cycle of length "np", where number "np" is much larger then "nc"). I present part of the code in here. The rest of code is not very essential in % of computational time so i prefer code purify in the rest of the algorithm. However, the critical cycle with "np" length is a pretty simple piece of code and it can be parallelized. So it will not hurt if i rewrite this part into some more effective and less clear version (maybe into SSE instructions). I'm using a gcc compiler, c++ code, and OpenMP parallelization.
This code is part of the well known particle-in-cell algorithm (and this one is also basic one). I'm trying to learn code optimization on this version (so my goal is not to have effective PIC algorithm only, because it is already written in thousand variants, but i want to bring some demonstrative example for code optimization also). I'm trying to do some work but i am not very sure if i solved all optimization properties correctly.
const int NT = ...; // number of threads (in two versions: about 6 or about 30)
const int np = 10000000; // np is about 1000-10000 times larger than nc commonly
const int nc = 10000;
const int step = 1000;
float u[np], x[np];
float a[nc], a_lin[nc], rho_full[NT][nc], rho_diff[NT][nc] , weight[nc];
int p,num;
for ( i = 0 ; i<step ; i++) {
// ***
// *** some not very time consuming code for calculation
// *** a, a_lin from values of rho_full and rho_diff
#pragma omp for private(p,num)
for ( k = np ; --k ; ) {
num = omp_get_thread_num();
p = (int) x[k];
u[k] += a[p] + a_lin[p] * (x[k] - p);
x[k] += u[k];
if (x[k]<0 ) {x[k]+=nc;} else
if (x[k]>nc) {x[k]-=nc;};
p = (int) x[k];
rho_full[num][p] += weight[k];
rho_diff[num][p] += weight[k] * (x[k] - p);
}
};
I realize this has problems:
1) (main question) I use set of arrays rho_full[num][p] where num is index for each thread. After computation i just summarize this arrays (rho_full[0][p] + rho_full[1][p] + rho_full[2][p] ...). The reason is avoidance of writing into same part of array with two different threads. I am not very sure if this way is an effective solution (note that number "nc" is relatively small, so number of operations with "np" is still probably most essential)
2) (also important question) I need to read x[k] many times and it's also changed many times. Maybe its better to read this value into some register and then forget whole x array or fix some pointer in here. After all calculation i can call x[k] array again and store obtained value. I believe that compiler do this work for me but i am not very sure because i used modification of x[k] in the center of algorithm. So the compiler probably do some effective work on their own but maybe in this version it call more times then nessesary becouse more then ones I swich calling and storing this value.
3) (probably not relevant) The code works with integer part and remainder below decimal point part. It needs both of this values. I identify integer part as p = (int) x and remainder as x - p. I calculate this routine at the begin and also in the end of cycle interior. One can see that this spliting can be stored somewhere and used at next step (i mean step at i index). Do you thing that following version is better? I store integral and remainder part at arrays of x instead of whole value x.
int x_int[np];
float x_rem[np];
//...
for ( k = np ; --k ; ) {
num = omp_get_thread_num();
u[k] += a[x_int[k]] + a_lin[x_int[k]] * x_rem[k];
x_rem[k] += u[k];
p = (int) x_rem[k]; // *** This part is added into code for simplify the rest.
x_int[k] += p; // *** And maybe there is a better way how to realize
x_rem[k] -= p; // *** this "pushing correction".
if (x_int[k]<0 ) {x_int[k]+=nc;} else
if (x_int[k]>nc) {x_int[k]-=nc;};
rho_full[num][x_int[k]] += weight[k];
rho_diff[num][x_int[k]] += weight[k] * x_rem[k];
}
};
You can use OMP reduction for your for loop:
int result = 0;
#pragma omp for nowait reduction(+:result)
for ( k = np ; --k ; ) {
num = omp_get_thread_num();
p = (int) x[k];
u[k] += a[p] + a_lin[p] * (x[k] - p);
x[k] += u[k];
if (x[k]<0 ) {x[k]+=nc;} else
if (x[k]>nc) {x[k]-=nc;};
p = (int) x[k];
result += weight[k] + weight[k] * (x[k] - p);
}

Efficient algorithm for iterative ranking of a vector

Suppose I've a scrambled vector of consecutive integers 1:n, say {3,6,2,1,4,5}. My problem is to find, for each element, the number of elements to its left that are smaller than itself. So I'd like the program to return {0,1,0,0,3,4} for this example. This is what I've written in Fortran:
subroutine iterrank(n,invec,outvec,tempvec)
implicit none
integer :: n, i, currank
integer, dimension(n) :: invec, outvec, tempvec
tempvec = 0
outvec = 0
do i = 1,n
currank = invec(i)
outvec(i) = tempvec(currank)
tempvec(currank:n) = tempvec(currank:n) + 1
end do
return
end subroutine
It takes a temporary array (vector), and for each digit d the loop comes across, it adds 1 to every element beyond position d in the temporary vector. The next iteration then takes the appropriate element in the temporary vector as the count of elements smaller than itself. My questions are:
1) I believe this is of complexity O(n^2), since there are O(n) writes to the temporary vector in each iteration of the loop. Am I correct?
2) Is there a more efficient way of doing this for large n (say, >100k)?
I believe this would be more efficient, and you could also reduce the temporary integer array to a single byte.
subroutine iterrank(n,invec,outvec,tempvec)
implicit none
integer :: n, i, currank
integer, dimension(n) :: invec, outvec, tempvec
tempvec = 0
!outvec = 0 ! no need to initialize something overwritten below
do i = 1 , n
currank = invec(i)
outvec(i) = sum( tempvec(1:currank) )
tempvec(currank) = 1
end do
end subroutine
The gain is that you are only writing twice per index, however you are reading elements a maximum of n*n times.
EDIT:
I haven't tried this, but it should do less reads, with a possible overhead of branching. It is possibly faster for extremely large arrays, I would however expect it to be slower for short arrays:
subroutine iterrank(n,invec,outvec,tempvec)
implicit none
integer :: n, i, currank, prevrank
integer, dimension(n) :: invec, outvec, tempvec
tempvec = 0
outvec(1) = 0
tempvec(invec(1)) = 1
do i = 2 , n
prevrank = invec(i-1)
currank = invec(i)
if ( abs(prevrank-currank) > currank ) then
outvec(i) = sum( tempvec(1:currank) )
else if ( prevrank < currank ) then
outvec(i) = outvec(i-1) + sum( tempvec(prevrank:currank) )
else
outvec(i) = outvec(i-1) - sum( tempvec(currank:prevrank-1) )
end if
tempvec(currank) = 1
end do
end subroutine iterrank
Complete rewrite of the answer. If the memory is not a concern, you can add another vector and use an algorithm like the one bellow. The additional vector is used to compute the permutation. Thanks to the fact that the original vector is a permutation of integer 1 to n, the permutation is computed in O(n). with vectors of size 100k on my computer, this algorithm runs in 1.9 sec in average (100 runs) and the initial proposition of zeroth is 2.8 sec in average. I suggested this solution simply because zeroth said he did not test his new solution, you will test and use the best one.
subroutine iterrank(n,invec,outvec,tempvec,ord)
implicit none
!
integer :: n, i, currPos, prevPos, currOut, prevOut
integer, dimension(n) :: invec, outvec, tempvec,ord
!
tempvec = 0
do i = 1, n
ord(invec(i)) = i
end do
!
currPos = ord(1)
tempvec(currPos) = 1
currOut = 0
outvec(currPos) = currOut
! last = 0
do i = 2 , n
prevPos = currPos
currPos = ord(i)
!
if(currPos>prevPos)then
currOut = currOut+sum( tempvec(prevPos:currPos) )
else
currOut = sum( tempvec(1:currPos) )
end if
!
outvec(currPos) = currOut
tempvec(currPos) = 1
end do
!
end subroutine iterrank
The down side of this solution is the random access to vectors outvec and tempvec, that does not make the best use of cache and registers. It is possible to solve that and reduce significantly the time, possibly at the expense of additional temporary vectors.

Nested for loops to iterate to the power of 2

I need to use two loops in such a way that the outer loop drives the inner loop to do computations for 2,4,8,16,and 32 iterations.
for example if i=2(for outer loop)
then inner loop will iterate for 4 times
and if i=3
then inner loop will iterate for 8 times and so on.
this is the logic I m using
for ( i = 0 ; i < n ; i++ )
{
for ( c = 0 ; c <= pow(2,i) ; c=c++ )
I would really appreciate any suggestions
Compute the number of iterations of the inner loop once and reuse it instead of computing it everytime.
Don't use pow(2, i). Use the more reliable 1 << i.
Don't use c = c++. Just use c++. I am not sure c = c++ is guaranteed to be c = c+1.
for ( i = 0 ; i < n ; i++ )
{
int nextCount = (1 << i);
for ( c = 0 ; c <= nextCount ; c++ )
You can use the fact that to compute a small power of two in C++ you can use a bit shift left:
for ( i = 0 ; i < n ; i++ ) {
for ( c = 0 ; c < (1 << i) ; c++ ) {
...
}
}
The reason behind this "magic" is the same as the reason why adding a zero to the right of a decimal number multiplies the number by ten.
Note that since you start the iteration of the inner loop at zero, you need to use <, not <=. Otherwise you would iterate 2n+1 times.
You'll want to use something like everyone else has suggested:
for (int i=0 ; i<n ; i++){
for(int c=0 ; c < (1<<i) ; c++){
//do computations
}
}
The reason you want to use < instead of <= is becase <= will actually give you (2^i)+1 iterations, due to counting zero.
The reason you want to want to use the bitshift operation 1<<i, is because integers are already in base two, and adding zeros on the end is the equivelant of multiplying by two repeatedly. (1 is automatically created as an integer, while 1.0 is automatically created as a float. You could not safely do this with floats: 1.0<<1 bitshifts to 1.70141e+38, if you can get the compiler to do it.)
Finally, you want to use c++ because c++ increments c, but returns the original value, so your inner for-loop always keeps the original value and never increments.