Write elements of matrix into vector using OpenMP - fortran

I am rather new with OpenMP. I want to write all elements of a big matrix into a vector using OpenMP threading to speed things up.
In my serial code I am simply doing the following:
m=1
DO k=1,n_lorentz
DO i=1,n_channels
DO p=1,n_lorentz
DO j=1,n_channels
vector(m) = Omega(j,p,i,k)
m=m+1
END DO
END DO
END DO
END DO
Now I'd like to use an OMP loop to write the elements of Omega into vector in a parallel fashion:
!$OMP PARALLEL DO PRIVATE(k,i,p,j)
! bla bla
!$OMP END PARALLEL DO
The question is how to keep track of the current vector index, since in this case the m parameter from the serial code will be incremented by different threads, resulting in a total mess.

One answer is: you don't need to keep track of m. Instead, analyzing the loop, we find that:
Every time j increases by one, m increases by one;
Every time p increases by one, m increases by n_channels;
Every time i increases by one, m increases by n_channels*n_lorentz;
Every time k increases by one, m increases by n_channels*n_lorentz*n_channels.
From these observations, you can write an explicit expression for m:
m = j + n_channels*((p-1) + n_lorentz*((i-1) + n_channels*(k-1)))
Being able to explicitly calculate the index should solve your problem :).

Related

MATMUL result not equal with explicit calculation for double precision?

sorry for a seemingly stupid question. I was testing the computational efficiency when replacing for-loop operations on matrices with intrinsic functions. When I check the matrices product results of the two methods, it confused me that the two outputs were not the same. Here is the simplified code I used
program matmultest
integer,parameter::nx=64,ny=32,nz=16
real*8::mat1(nx,ny),mat2(ny,nz)
real*8::result1(nx,nz),result2(nx,nz),diff(nx,nz)
real*8::localsum
integer::i,j,m
do i=1,ny
do j=1,nx
mat1(j,i)=dble(j)/7d0+2.65d0*dble(i)
enddo
enddo
do i=1,nz
do j=1,ny
mat2(j,i)=5d0*dble(j)-dble(i)*0.45d0
enddo
enddo
do j=1,nz
do i=1,nx
localsum=0d0
do m=1,ny
localsum=localsum+mat1(i,m)*mat2(m,j)
enddo
result1(i,j)=localsum
enddo
enddo
result2=matmul(mat1,mat2)
diff=result2-result1
print*,sum(abs(diff)),maxval(diff)
end program matmultest
And the result gives
1.6705598682165146E-008 5.8207660913467407E-011
The difference is non-zero for real8 but zero when I tested for integer later. I wonder if it is because of my code's faults somewhere or the numerical accuracy of MATMUL() is single precision?
And the compiler I am using is GNU Fortran (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Thanks!
francescalus explained that reordering of operations causes these differences. Let's try to find out how it actually happened.
A few words about matrix product
Consider matrices A(n,p), B(p,q), C(n,q) and C = A*B.
The naive approach, a variant of which you used, involves the following nested loops:
c = 0
do i = 1, n
do j = 1, p
do k = 1, q
c(i, j) = c(i, j) + a(i, k) * b(k, j)
end do
end do
end do
These loops can be executed in any of 6 orders, depending on the variable that you choose at each level. In the example above, the loop is named "ijk", and the other variants "ikj", "jik", etc. are all correct.
There is a speed difference, due to the memory cache: when the inner loop runs across contiguous memory elements, the loop is faster. That's the jki or kji cases.
Indeed, since Fortran matrices are stored in column major order, if the innermost loop runs on i, in the instruction c(i, j) = c(i, j) + a(i, k) * c(k, j), the value c(k, j) is constant, and the operation is equivalent to v(i) = v(i) + x * u(i), where the elements of vectors v and u are contiguous.
However, regarding the order of operations, there shouldn't be a difference: you can check for yourself that all elements of C are computed in the same order. At least at the "higher level": the compiler might optimize things differently, and it's where it becomes really interesting.
What about MATMUL? I believe it's usually a naive matrix product, based on the nested loops above, say a jki loop.
There are other ways to multiply matrices, that involve the Strassen algorithm to improve the algorithm complexity or blocking (i.e. computed products of submatrices) to improve cache use. Other methods that could change the result are OpenMP (i.e. multithread), or using FMA instructions. But here we are not going to delve into these methods. It's really only about the nested loops. If you are interested, there are many resources online, check this.
A few words about optimization
Three remarks first:
On a processor without SIMD instructions, you would get the same result as MATMUL (i.e. you would print zero in the end).
If you had implemented the loops as above, you would also get the same result. There is a tiny but significant difference in your code.
If you had implemented the loops as a subroutine, you would also get the same result. Here I suspect the compiler optimizer is doing some reordering, as I can't reproduce your "accumulator" code with a subroutine, at least with Intel Fortran.
Here is your implementation:
do i = 1, n
do j = 1, p
s = 0
do k = 1, q
s = s + a(i, k) * b(k, j)
end do
c(i, j) = s
end do
end do
It's also correct of course. Here, you are using an accumulator, and at the end of the innermost loop, the value of the accumulator is written in the matrix C.
Optimization is typically relevant on the innermost loop mainly. For our purpose, two "basic" instructions in the innermost loop are relevant, if we get rid of all other details:
v(i) = v(i) + x*u(i) (the jki loop)
s = s + x(k)*y(k) (the accumulator loop where y is contiguous in memory, but not x)
The first is usually called a "daxpy" (from the name of a BLAS routine), for "A X Plus Y", the "D" meaning double precision. The second one is just an accumulator.
On an old sequential processor, there is not much to be done to optimize. On a modern processor with SIMD, registers can hold several values, and computations can be done on all of them at once, in parallel. For instance, on x86, an XMM register (from SSE instruction set) can hold two double precision floating-point numbers. A YMM register (from AVX2) can hold four numbers, and a ZMM register (AVX512, found on Xeon) can hold eight numbers.
For instance, on YMM the innermost loop will be "unrolled" to deal with four vector elements at a time (or even more if using several registers).
Here is what the basic loop block is then roughly doing:
daxpy case:
Read 4 numbers from u into register YMM1
Read 4 numbers from v into register YMM2
x is constant and is kept in another register
Multiply in parallel x with YMM1, add in parallel to YMM2, put the result in YMM2
Write back the result to corresponding elements of v
The read/write part is faster if the elements are contiguous in memory, but if they are not it's still worth doing this in parallel.
Note that here, we haven't changed the execution order of additions of the high level Fortran loop.
accumulator case
For the parallelism to be useful, there will be a trick: accumulate four values in parallel in a YMM register, and then add the four accumulated values.
The basic loop block is thus doing this:
The accumulator is kept in YMM3 (four numbers)
Read 4 numbers from X into register YMM1
Read 4 numbers from Y into register YMM2
Multiply in parallel YMM1 with YMM2, add in parallel to YMM3
At the end of the innermost loop, add the four components of the accumulator, and write this back as the matrix element.
It's like if we had computed:
s1 = x(1)*y(1) + x(5)*y(5) + ... + x(29)*y(29)
s2 = x(2)*y(2) + x(6)*y(6) + ... + x(30)*y(30)
s3 = x(3)*y(3) + x(7)*y(7) + ... + x(31)*y(31)
s4 = x(4)*y(4) + x(8)*y(8) + ... + x(32)*y(32)
And then the matrix element written is c(i,j) = s1+s2+s3+s4.
Here the order of additions has changed! And then, since the order is different, the result is very likely different.
I can replicate the results when using fast math (I have Intel Fortran), and when I compile with the default /fp:fast I get the following max error and speed
! Error Loops Matmul
! 0.58208E-10 107526.9 140056.0 FAST
The error is just maxval(abs(diff)) speed measured is in # of matrix operations per second.
But when I compile with /fp:strict then I get no error, but a slowdown with the loops
! Error Loops Matmul
! 0.0000 43140.6 141844.0 STRICT
I see a -60% slowdown in the loops with strict floating-point handling, but surprisingly no slowdown with the matmul() function.
Source Code for completeness
program Console1
use iso_fortran_env
implicit none
integer,parameter :: nr = 100000
integer,parameter::nx=64,ny=32,nz=16
real(real64)::mat1(nx,ny),mat2(ny,nz)
real(real64)::result1(nx,nz),result2(nx,nz),diff(nx,nz)
real(real64)::localsum
integer::i,j,r
integer(int64) :: tic, toc, rate
real(real64) :: dt1, dt2
do i=1,ny
do j=1,nx
mat1(j,i)=dble(j)/7d0+2.65d0*dble(i)
enddo
enddo
do i=1,nz
do j=1,ny
mat2(j,i)=5d0*dble(j)-dble(i)*0.45d0
enddo
enddo
call SYSTEM_CLOCK(tic,rate)
do r=1, nr
result1=mymatmul(mat1,mat2)
end do
call SYSTEM_CLOCK(toc,rate)
dt1 = dble(toc-tic)/rate
call SYSTEM_CLOCK(tic,rate)
do r=1, nr
result2=matmul(mat1,mat2)
end do
call SYSTEM_CLOCK(toc,rate)
dt2 = dble(toc-tic)/rate
diff=result2-result1
print ('(1x,a16,1x,a16,1x,a16)'), "Error", "Loops", "Matmul"
print ('(1x,g16.5,1x,f16.1,1x,f16.1)'), maxval(abs(diff)), nr/dt1, nr/dt2
! Error Loops Matmul
! 0.58208E-10 107526.9 140056.0 FAST
! 0.0000 43140.6 141844.0 STRICT
!
contains
pure function mymatmul(a,b) result(c)
real(real64), intent(in) :: a(:,:), b(:,:)
real(real64) :: c(size(a,1), size(b,2))
integer :: i,j,k
real(real64) :: sum
do j=1, size(c,2)
do i=1, size(c,1)
sum = 0d0
do k=1, size(a,2)
sum = sum + a(i,k)*b(k,j)
end do
c(i,j) = sum
end do
end do
end function
end program Console1
Always compiled as Release-x64 and not Debug.

Time complexity of mandelbrot set in term of big O notation

I'm trying to find the time complexity of a simple implementation of mandelbrot set. with following code
int main(){
int rows, columns, iterations;
rows = 22;
columns = 72;
iterations = 28;
char matrix[max_rows][max_columns];
for(int r = 0; r < rows; ++r){
for(int c = 0; c < columns; ++c){
complex<float> z;
int itr = 0;
while(abs(z) < 2 && ++itr < iterations)
z = pow(z, 2) + decltype(z)((float)c * 2 / columns - 1.5,
(float)r * 2 / rows - 1);
matrix[r][c]=(itr== iterations ? '*' : '.');
}
}
Now looking at above code i made some estimation for time complexity in terms of big O notation and want to know if it is correct or not
So we are creating a 2d array traversing it through nested loops and and at each element we are performing an operation and setting a value of that element, if we take n as input size we can say that greater the input the greater will be the complexity, so the time complexity for rowsxcolumns would be O(rxc) and then again we are traversing it for printout, so what would be the time complexity? is it O(rxc)+O(rxc) ? does the function itself have some effect on time complexity when we are doing multiplication and subtraction on rows and columns? If yes then how?
Almost, given r rows, c columns and i iterations then the running time is O(r*c*i). This should be trivial to see if abs(z)<2 is not there. But with this extra condition its not clear how many times will the inner while loop run in total. Yes, it will be less than r*c*i times, so O(r*c*i) is still the upper bound. But perhaps we might do better. Given that for any r,c you compute Mandelbrot set over the same domain with varying resolution then the while loop will run k*r*c*i times for some constant k which is somewhere between area-of-Mandelbrot-set-over-area-of-the-domain and 1 --> Running time of the code is Θ(r*c*i) and O(r*c*i) cannot be improved.
Had you computed the set over [-c,c]x[-r,r] domain with fixed resolution then for any |z|>2 the abs(z)<2 breaks after first iteration. Then O(r*c*i) would not be tight bound and this condition (as all loop conditions) should be considered if you want accurate estimation.
Please don't use malloc, std::vector is safer.
In big-O notation, O(rxc)+O(rxc) collapses to O(rxc).
Since the maximal iteration count is also an input variable, it has an influence on the complexity as well. In particular, the inner loop runs at most n iterations, therefore, your complexity is O(rxcxn).
All other operations are constant, in particular multiplication and addition of complex<float>. These operations by themselves are always O(1), which does not contribut to the overall complexity.

OpenMP: Is it better to use fewer threads that are long, or maximum available threads that are short?

I have some C++ code that I am running for an optimisation task, and I am trying to parallelise it using OpenMP. I tried using #pragma omp parallel for on both loops, but realised pretty quickly that it didnt work, so I want to set up a conditional to decide whether to parallelise the outer or inner loop, depending on how many outer iterations there are.
Here is the code:
std::vector<Sol> seeds; // vector with initial solutions
std::vector<Sol> sols (N_OUTER*N_INNER); // vector for output solutions
int N_OUTER; // typically 1-8
int N_INNER; // typically > 100
int PAR_THRESH; // this is the parameter I am interested in setting
#pragma omp parallel for if (N_OUTER >= PAR_THRESH)
for (int outer = 0; outer < N_OUTER; ++outer){
#pragma omp parallel for if (N_OUTER < PAR_THRESH)
for (int inner = 0; inner < N_INNER; ++inner){
sols[outer*N_INNER + inner] = solve(seeds[outer]);
}
}
This works fine to decide which loop (inner or outer) gets parallelised; however, I am trying to determine what is the best value for PAR_THRESH.
My intuition says that if N_OUTER is 1, then it shouldn't parallelise the outer loop, and if N_OUTER is greater than the number of threads available, then the outer loop should be the one to be parallelised; because it uses maximum available threads and the threads are long as possible. My question is about when N_OUTER is either 2 or 3 (4 being the number of threads available).
Is it better to run, say, 2 or 3 threads that are long, in parallel; but not use up all of the available threads? Or is it better to run the 2 or 3 outer loops in serial, while utilising the maximum number of threads for the inner loop?
Or is there a kind of trade off in play, and maybe 2 outer loop iterations might be wasting threads, but if there are 3 outer loop iterations, then having longer threads is beneficial, despite the fact that one thread is remaining unused?
EDIT:
edited code to replace N_ITER with N_INNER in two places
Didn't have much experience with OpenMP, but I have found something like collapse directive:
https://software.intel.com/en-us/articles/openmp-loop-collapse-directive
Understanding the collapse clause in openmp
It seems to be even more appropriate when number of inner loop iterations differs.
--
On the other hand:
It seems to me that solve(...) is side-effect free. It seems also that N_ITER is N_INNER.
Currently you calculate solve N_INNER*N_OUTER times.
While reducing that won't reduce O notation complexity, assuming it has very large constant factor - it should save a lot of time. You cannot cache the result with collapse, so maybe this could be even better:
std::vector<Sol> sols_tmp (N_INNER);
#pragma omp parallel for
for (int i = 0; i < N_OUTER; ++i) {
sols_tmp[i] = solve(seeds[i]);
}
This calculates only N_OUTER times.
Because solve returns same value for each row:
#pragma omp parallel for
for (int i = 0; i < N_OUTER*N_INNER; ++i) {
sols[i] = sols_tmp[i/N_INNER];
}
Of course it must be measured if parallelization is suitable for those loops.

How to convert a simple computer algorithm into a mathematical function in order to determine the big o notation?

In my University we are learning Big O Notation. However, one question that I have in light of big o notation is, how do you convert a simple computer algorithm, say for example, a linear searching algorithm, into a mathematical function, say for example 2n^2 + 1?
Here is a simple and non-robust linear searching algorithm that I have written in c++11. Note: I have disregarded all header files (iostream) and function parameters just for simplicity. I will just be using basic operators, loops, and data types in order to show the algorithm.
int array[5] = {1,2,3,4,5};
// Variable to hold the value we are searching for
int searchValue;
// Ask the user to enter a search value
cout << "Enter a search value: ";
cin >> searchValue;
// Create a loop to traverse through each element of the array and find
// the search value
for (int i = 0; i < 5; i++)
{
if (searchValue == array[i])
{
cout << "Search Value Found!" << endl;
}
else
// If S.V. not found then print out a message
cout << "Sorry... Search Value not found" << endl;
In conclusion, how do you translate an algorithm into a mathematical function so that we can analyze how efficient an algorithm really is using big o notation? Thanks world.
First, be aware that it's not always possible to analyze the time complexity of an algorithm, there are some where we do not know their complexity, so we have to rely on experimental data.
All of the methods imply to count the number of operations done. So first, we have to define the cost of basic operations like assignation, memory allocation, control structures (if, else, for, ...). Some values I will use (working with different models can provide different values):
Assignation takes constant time (ex: int i = 0;)
Basic operations take constant time (+ - * ∕)
Memory allocation is proportional to the memory allocated: allocating an array of n elements takes linear time.
Conditions take constant time (if, else, else if)
Loops take time proportional to the number of time the code is ran.
Basic analysis
The basic analysis of a piece of code is: count the number of operations for each line. Sum those cost. Done.
int i = 1;
i = i*2;
System.out.println(i);
For this, there is one operation on line 1, one on line 2 and one on line 3. Those operations are constant: This is O(1).
for(int i = 0; i < N; i++) {
System.out.println(i);
}
For a loop, count the number of operations inside the loop and multiply by the number of times the loop is ran. There is one operation on the inside which takes constant time. This is ran n times -> Complexity is n * 1 -> O(n).
for (int i = 0; i < N; i++) {
for (int j = i; j < N; j++) {
System.out.println(i+j);
}
}
This one is more tricky because the second loop starts its iteration based on i. Line 3 does 2 operations (addition + print) which take constant time, so it takes constant time. Now, how much time line 3 is ran depends on the value of i. Enumerate the cases:
When i = 0, j goes from 0 to N so line 3 is ran N times.
When i = 1, j goes from 1 to N so line 3 is ran N-1 times.
...
Now, summing all this we have to evaluate N + N-1 + N-2 + ... + 2 + 1. The result of the sum is N*(N+1)/2 which is quadratic, so complexity is O(n^2).
And that's how it works for many cases: count the number of operations, sum all of them, get the result.
Amortized time
An important notion in complexity theory is amortized time. Let's take this example: running operation() n times:
for (int i = 0; i < N; i++) {
operation();
}
If one says that operation takes amortized constant time, it means that running n operations took linear time, even though one particular operation may have taken linear time.
Imagine you have an empty array of 1000 elements. Now, insert 1000 elements into it. Easy as pie, every insertion took constant time. And now, insert another element. For that, you have to create a new array (bigger), copy the data from the old array into the new one, and insert the element 1001. The 1000 first insertions took constant time, the last one took linear time. In this case, we say that all insertions took amortized constant time because the cost of that last insertion was amortized by the others.
Make assumptions
In some other cases, getting the number of operations require to make hypothesises. A perfect example for this is insertion sort, because it is simple and it's running time depends of how is the data ordered.
First, we have to make some more assumptions. Sorting involves two elementary operations, that is comparing two elements and swapping two elements. Here I will consider both of them to take constant time. Here is the algorithm where we want to sort array a:
for (int i = 0; i < a.length; i++) {
int j = i;
while (j > 0 && a[j] < a[j-1]) {
swap(a, i, j);
j--;
}
}
First loop is easy. No matter what happens inside, it will run n times. So the running time of the algorithm is at least linear. Now, to evaluate the second loop we have to make assumptions about how the array is ordered. Usually, we try to define the best-case, worst-case and average case running time.
Best-case: We do never enter the while loop. Is this possible ? Yes. If a is a sorted array, then a[j] > a[j-1] no matter what j is. Thus, we never enter the second loop. So, what operations are done in this case is the assignation on line 2 and the evaluation of the condition on line 3. Both take constant time. Because of the first loop, those operations are ran n times. Then in the best case, insertion sort is linear.
Worst-case: We leave the while loop only when we reach the beginning of the array. That is, we swap every element all the way to the 0 index, for every element in the array. It corresponds to an array sorted in reverse order. In this case, we end up with the first element being swapped 0 times, element 2 is swapped 1 times, element 3 is swapped 2 times, etc up to element n being swapped n-1 times. We already know the result of this: worst-case insertion is quadratic.
Average case: For the average case, we assume the items are randomly distributed inside the array. If you're interested in the maths, it involves probabilities and you can find the proof in many places. Result is quadratic.
Conclusion
Those were basics about analyzing the time complexity of an algorithm. The cases were easy, but there are some algorithms which aren't as nice. For example, you can look at the complexity of the pairing heap data structure which is much more complex.

Parallelizing nested loop with OpenMP

I have a relatively simple loop where I'm calculating the net acceleration of a system of particles using a brute-force method.
I have a working OpenMP loop which loops over each particles and compares it to every other particles for an n^2 complexity here:
!$omp parallel do private(i) shared(bodyArray, n) default(none)
do i = 1, n
!acc is real(real64), dimension(3)
bodyArray(i)%acc = bodyArray(i)%calcNetAcc(i, bodyArray)
end do
which works just fine.
What I'm trying to do now is to reduce my calculation time by only computing the force on each body once using the fact that the force from F(a->b) = -F(b->a), reducing the number of interactions to calculate by half (n^2 / 2). Which I do in this loop:
call clearAcceleration(bodyArray) !zero out acceleration
!$omp parallel do private(i, j) shared(bodyArray, n) default(none)
do i = 1, n
do j = i, n
if ( i /= j .and. j > i) then
bodyArray(i)%acc = bodyArray(i)%acc + bodyArray(i)%accTo(bodyArray(j))
bodyArray(j)%acc = bodyArray(j)%acc - bodyArray(i)%acc
end if
end do
end do
But I'm having a lot of difficulty with this parallelizing this loop, I keep getting junk results. I think it has to do with this line:
bodyArray(j)%acc = bodyArray(j)%acc - bodyArray(i)%acc
and that the forces are not being added up properly with all the different 'j' writing to it.
I've tried using the atomic statement, but that's not allowed on array variables. So then I tried critical, but that increases the time it takes by about 20, and still doesn't give correct results. I also tried adding an ordered statement, but then I just get NaN for all my results.
Is there an easy fix to get this loop working with OpenMP?
Working code, it has a slight speed improvement but not the ~2x I was looking for.
!$omp parallel do private(i, j) shared(bodyArray, forces, n) default(none) schedule(guided)
do i = 1, n
do j = 1, i-1
forces(j, i)%vec = bodyArray(i)%accTo(bodyArray(j))
forces(i, j)%vec = -forces(j, i)%vec
end do
end do
!$omp parallel do private(i, j) shared(bodyArray, n, forces) schedule(static)
do i = 1, n
do j = 1, n
bodyArray(i)%acc = bodyArray(i)%acc + forces(j, i)%vec
end do
end do
With your current approach and data structures you're going to struggle to get good speedup with OpenMP. Consider the loop nest
!$omp parallel do private(i, j) shared(bodyArray, n) default(none)
do i = 1, n
do j = i, n
if ( i /= j .and. j > i) then
bodyArray(i)%acc = bodyArray(i)%acc + bodyArray(i)%accTo(bodyArray(j))
bodyArray(j)%acc = bodyArray(j)%acc - bodyArray(i)%acc
end if
end do
end do
[Actually, before you consider it, revise it as follows ...
!$omp parallel do private(i, j) shared(bodyArray, n) default(none)
do i = 1, n
do j = i+1, n
bodyArray(i)%acc = bodyArray(i)%acc + bodyArray(i)%accTo(bodyArray(j))
bodyArray(j)%acc = bodyArray(j)%acc - bodyArray(i)%acc
end do
end do
..., now back to the issues]
There are two problems here:
As you've already twigged, you've got a data race updating bodyArray(j)%acc; multiple threads will try to update the same element and there is no coordination of those updates. Junk results. Using critical sections or ordering the statements serialises the code; when you get it right you also get it as slow as it was before you started with OpenMP.
The pattern of access to elements of bodyArray is cache-unfriendly. It wouldn't surprise me to find that, even if you address the data race without serialising the computation, the impact of the cache-unfriendliness is to produce code slower than the original. Modern CPUs are crazy-fast in computation but the memory systems struggle to feed the beasts so cache effects can be massive. Trying to run two loops over the same rank-1 array simultaneously, which is in essence what your code does, is never (?) going to shift data through cache at maximum speed.
Personally I would try the following. I'm not going to guarantee that this will be faster, but it will be easier (I think) than fixing your current approach and fit OpenMP like a glove. I do have a nagging doubt that this is overcomplicating matters, but I haven't had a better idea yet.
First, create a 2D array of reals, call it forces, where element force(i,j) is the force that element i exerts on j. Then, some code like this (untested, that's your responsibility if you care to follow this line)
forces = 0.0 ! Parallelise this if you want to
!$omp parallel do private(i, j) shared(forces, n) default(none)
do i = 1, n
do j = 1, i-1
forces(i,j) = bodyArray(i)%accTo(bodyArray(j)) ! if I understand correctly
end do
end do
then sum the forces on each particle (and get the following right, I haven't checked carefully)
!$omp parallel do private(i) shared(bodyArray,forces, n) default(none)
do i = 1, n
bodyArray(i)%acc = sum(forces(:,i))
end do
As I wrote above, computation is extremely fast and if you have the memory to spare it's often well worth trading some space for time.
Now what you have is, probably, a problem with load balancing in the loop nest over forces. Most OpenMP implementations will, by default, perform a static distribution of work (this is not required by the standard but seems to be most common, check your documentation). So thread 1 will get the first n/num_threads rows to deal with, but these are the itty-bitty little rows at the top of the triangle you're computing. Thread 2 will get more work, thread 3 still more, and so forth. You might get away with simply adding a schedule(dynamic) clause to the parallel directive, you might have to work a bit harder to balance the load.
You may also want to review my code snippets wrt cache-friendliness and adjust as appropriate. And you may well find, if you do as I suggest, that you were better off with your original code, that halving the amount of computation doesn't actually save much time.
Another approach would be to pack the lower (or upper) triangle of forces into a rank-1 array and use some fancy indexing arithmetic to transform 2D (i,j) indices into a 1D index into that array. This would save storage space, and might be easier to make cache-friendly.