I aim to compute a simple N-body program on C++ and I am using OpenMP to speed things up with the computations. At some point, I have nested loops that look like that:
int N;
double* S = new double[N];
double* Weight = new double[N];
double* Coordinate = new double[N];
...
#pragma omp parallel for
for (int i = 0; i < N; ++i)
{
for (int j = 0; j < i; ++j)
{
double K = Coordinate[i] - Coordinate[j];
S[i] += K*Weight[j];
S[j] -= K*Weight[i];
}
}
The issue here is that I do not obtain exactly the same result when removing the #pragma ... I am guessing it has to do with the fact that the second loop is dependent on the integer i, but I don't see how to get past that issue
The problem is that there is a data race during updating S[i] and S[j]. Different threads may read from/write to the same element of the array at the same time, therefore it should be an atomic operation (you have to add #pragma omp atomic) to avoid data race and to ensure memory consistency:
for (int j = 0; j < i; ++j)
{
double K = Coordinate[i] - Coordinate[j];
#pragma omp atomic
S[i] += K*Weight[j];
#pragma omp atomic
S[j] -= K*Weight[i];
}
I have a vector of vector. I construct this vector in a parallel manner with each index in the vector being dealt with by a single thread. Something similar to this :
vector<vector<int> > global_vec(10, vector<int>({}));
#pragma omp parallel for schedule(dynamic)
for(int i = 0; i < 10; i++)
{
for(int j = 0; j < i * 5; j++)
{
global_vec[i].push_back(i);
}
}
I know if I had known the size of each vector beforehand, I could have allocated required size in the beginning and then there would have been no issue. But this can't be done by me and I need to dynamically push back. Is this thread safe?
Thanks in advance.
Yes this is thread-safe, since the inner vectors are solely modified by one thread. You can omit the schedule(dynamic) derivative and still be save.
This becomes a bit clearer, if you get rid of the inner loop using std::iota.
vector<vector<int> > global_vec(10, vector<int>({}));
#pragma omp parallel for schedule(dynamic)
for(int i = 0; i < 10; i++)
{
global_vec[i].resize(i * 5) ;
std::iota(global_vec[i].begin(), global_vec[i].end(), 0);
}
Ps. If your outer vector has a fixed size, consider using a std::array<vector<int>, 10> instead.
I need to iterate over an array and assign each element according to a calculation that requires some iteration itself. Removing all unnecessary details the program boils down to something like this.
float output[n];
const float input[n] = ...;
for (int i = 0; i < n; ++i) {
output[i] = 0.0f;
for (int j = 0; j < n; ++j) {
output[i] += some_calculation(i, input[j]);
}
}
some_calculation does not alter its arguments, nor does it have an internal state so its thread safe. Looking at the loops, I understand that the outer loop is thread-safe because different iterations output to different memory locations (different output[i]) and the shared elements of input are never altered while the loop runs, but the inner loop is not thread safe because it has a race condition on output[i] because it is altered in all iterations.
Consequently, I'd like to spawn threads and get them working for different values of i but the whole iteration over input should be local to each thread so as not to introduce a race condition on output[i]. I think the following achieves this.
std::array<float, n> output;
const std::array<float, n> input[n];
#pragma omp parallel for
for (int i = 0; i < n; ++i) {
output[i] = 0.0f;
for (int j = 0; j < n; ++j) {
output[i] += some_calculation(i, input[j]);
}
}
I'm not sure how this handles the inner loop. Threads working on different is should be able to run the loop in parallel but I don't understand if I'm allowing them to without another #pragma omp directive. On the other hand I don't want to accidentally allow threads to run for different values of j over the same i because that introduces a race condition. I'm also not sure if I need some extra specification on how the two arrays should be handled.
Lastly, if this code is in a function that is going to get called repeatedly, does it need the parallel directive or can that be called once before my main loop begins like so.
void iterative_step(const std::array<float, n> &input, const std::array<float, n> &output) {
// Threads have already been spawned
#pragma omp for
for (int i = 0; i < n; ++i) {
output[i] = 0.0f;
for (int j = 0; j < n; ++j) {
output[i] += some_calculation(i, input[j]);
}
}
int main() {
...
// spawn threads once, but not for this loop
#pragma omp parallel
while (...) {
iterative_step(input, output);
}
...
}
I looked through various other questions but they were about different problems with different race conditions and I'm confused as to how to generalize the answers.
You don't want the omp parallel in main. The omp for you use will only create/reuse threads for the following for (int i loop. For any particular value of i, the j loop will run entirely on one thread.
One other thing that would help a little is to compute your output[i] result into a local variable, then store that into output[i] once you're done with the j loop.
Hi i am new to c++ and i made a code which runs but it is slow because of many nested for loops i want to speed it up by openmp anyone who can guide me. i tried to use '#pragma omp parallel' before ip loop and inside this loop i used '#pragma omp parallel for' before it loop but it does not works
#pragma omp parallel
for(int ip=0; ip !=nparticle; ip++){
inf14>>r>>xp>>yp>>zp;
zp/=sqrt(gamma2);
counter++;
double para[7]={0,0,Vz,x0-xp,y0-yp,z0-zp,0};
if(ip>=0 && ip<=43){
#pragma omp parallel for
for(int it=0;it<NT;it++){
para[6]=PosT[it];
for(int ix=0;ix<NumX;ix++){
para[3]=PosX[ix]-xp;
for(int iy=0;iy<NumY;iy++){
para[4]=PosY[iy]-yp;
for(int iz=0;iz<NumZ;iz++){
para[5]=PosZ[iz]-zp;
int position=it*NumX*NumY*NumZ+ix*NumY*NumZ+iy*NumZ+iz;
rotation(para,&Field[3*position]);
MagX[position] +=chg*Field[3*position];
MagY[position] +=chg*Field[3*position+1];
MagZ[position] +=chg*Field[3*position+2];
}
}
}
}
}
}enter code here
and my rotation function also has infinite integration for loop as given below
for(int i=1;;i++){
gsl_integration_qag(&F, 10*i, 10*i+10, 1.0e-8, 1.0e-8, 100, 2, w, &temp, &error);
result+=temp;
if(abs(temp/result)<ACCURACY){
break;
}
}
i am using gsl libraries as well. so how to speed up this process or how to make openmp?
If you don't have inter-loop dependences, you can use the collapse keyword to parallelize multiple loops altoghether. Example:
void scale( int N, int M, float A[N][M], float B[N][M], float alpha ) {
#pragma omp for collapse(2)
for( int i = 0; i < N; i++ ) {
for( int j = 0; j < M; j++ ) {
A[i][j] = alpha * B[i][j];
}
}
}
I suggest you to check out the OpenMP C/C++ cheat sheet (PDF), which contain all the specifications for loop parallelization.
Do not set parallel pragmas inside another parallel pragma. You might overhead the machine creating more threads than it can handle. I would establish the parallelization in the outter loop (if it is big enough):
#pragma omp parallel for
for(int ip=0; ip !=nparticle; ip++)
Also make sure you do not have any race condition between threads (e.g. RAW).
Advice: if you do not get a great speed-up, a good practice is iterating by chunks and not only by one increment. For instance:
int num_threads = 1;
#pragma omp parallel
{
#pragma omp single
{
num_threads = omp_get_num_threads();
}
}
int chunkSize = 20; //Define your own chunk here
for (int position = 0; position < total; position+=(chunkSize*num_threads)) {
int endOfChunk = position + (chunkSize*num_threads);
#pragma omp parallel for
for(int ip = position; ip < endOfChunk ; ip += chunkSize) {
//Code
}
}
I have a school task about paralel programming and I'm having a lot of problems with it.
My task is to create a parallel version of given matrix multiplication code and test its performence (and yes, it has to be in KIJ order):
void multiply_matrices_KIJ()
{
for (int k = 0; k < SIZE; k++)
for (int i = 0; i < SIZE; i++)
for (int j = 0; j < SIZE; j++)
matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j];
}
This is what I came up with so far:
void multiply_matrices_KIJ()
{
for (int k = 0; k < SIZE; k++)
#pragma omp parallel
{
#pragma omp for schedule(static, 16)
for (int i = 0; i < SIZE; i++)
for (int j = 0; j < SIZE; j++)
matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j];
}
}
And that's where i found something confusing to me. This parallel version of the code is running around 50% slower than non-parallel one. The difference in speed varies only a little bit based on the matrix size (tested SIZE = 128, 256, 512, 1024, 2048, and various schedule versions - dynamic, static, w/o it at all etc. so far).
Can someone help me understand what am I doing wrong? Is it maybe because I'm using the KIJ order and it won't get any faster using openMP?
EDIT:
I'm working on a Windows 7 PC, using Visual Studio 2015 Community edition, compiling in Release x86 mode (x64 doesn't help either). My CPU is: Intel Core i5-2520M CPU # 2,50GHZ (yes, yes it's a laptop, but I'm getting same results on my home I7 PC)
I'm using global arrays:
float matrix_a[SIZE][SIZE];
float matrix_b[SIZE][SIZE];
float matrix_r[SIZE][SIZE];
I'm assigning random (float) values to matrix a and b, matrix r is filled with 0s.
I've tested the code with various matrix sizes so far (128, 256, 512, 1024, 2048 etc.). For some of them it is intended NOT to fit in cache.
My current version of code looks like this:
void multiply_matrices_KIJ()
{
#pragma omp parallel
{
for (int k = 0; k < SIZE; k++) {
#pragma omp for schedule(dynamic, 16) nowait
for (int i = 0; i < SIZE; i++) {
for (int j = 0; j < SIZE; j++) {
matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j];
}
}
}
}
}
And just to be clear, I know that with different ordering of loops I could get better results but that is the thing - I HAVE TO use KIJ order. My task is to do the KIJ for loops in parallel and check the performence increase. My problem is that I expect(ed) at least a little faster execution (than the one im getting now which it between 5-10% faster at most) even though it's the I loop that is in parallel (can't do that with K loop because I will get incorrect result since it's matrix_r[i][j]).
These are the results I'm getting when using the code shown above (I'm doing calculations hundreds of times and getting the average time):
SIZE = 128
Serial version : 0,000608s
Parallel I, schedule(dynamic, 16): 0,000683s
Parallel I, schedule(static, 16): 0,000647s
Parallel J, no schedule: 0,001978s (this is where I exected
way slower execution)
SIZE = 256
Serial version: 0,005787s
Parallel I, schedule(dynamic, 16): 0,005125s
Parallel I, schedule(static, 16): 0,004938s
Parallel J, no schedule: 0,013916s
SIZE = 1024
Serial version: 0,930250s
Parallel I, schedule(dynamic, 16): 0,865750s
Parallel I, schedule(static, 16): 0,823750s
Parallel J, no schedule: 1,137000s
Note: This answer is not about how to get the best performance out of your loop order or how to parallelize it because I consider it to be suboptimal due to several reasons. I'll try to give some advice on how to improve the order (and parallelize it) instead.
Loop order
OpenMP is usually used to distribute work over several CPUs. Therefore, you want to maximize the workload of each thread while minimizing the amount of required data and information transfer.
You want to execute the outermost loop in parallel instead of the second one. Therefore, you'll want to have one of the r_matrix indices as outer loop index in order to avoid race conditions when writing to the result matrix.
The next thing is that you want to traverse the matrices in memory storage order (having the faster changing indices as the second not the first subscript index).
You can achieve both with the following loop/index order:
for i = 0 to a_rows
for k = 0 to a_cols
for j = 0 to b_cols
r[i][j] = a[i][k]*b[k][j]
Where
j changes faster than i or k and k changes faster than i.
i is a result matrix subscript and the i loop can run parallel
Rearranging your multiply_matrices_KIJ in that way gives quite a bit of a performance boost already.
I did some short tests and the code I used to compare the timings is:
template<class T>
void mm_kij(T const * const matrix_a, std::size_t const a_rows,
std::size_t const a_cols, T const * const matrix_b, std::size_t const b_rows,
std::size_t const b_cols, T * const matrix_r)
{
for (std::size_t k = 0; k < a_cols; k++)
{
for (std::size_t i = 0; i < a_rows; i++)
{
for (std::size_t j = 0; j < b_cols; j++)
{
matrix_r[i*b_cols + j] +=
matrix_a[i*a_cols + k] * matrix_b[k*b_cols + j];
}
}
}
}
mimicing your multiply_matrices_KIJ() function versus
template<class T>
void mm_opt(T const * const a_matrix, std::size_t const a_rows,
std::size_t const a_cols, T const * const b_matrix, std::size_t const b_rows,
std::size_t const b_cols, T * const r_matrix)
{
for (std::size_t i = 0; i < a_rows; ++i)
{
T * const r_row_p = r_matrix + i*b_cols;
for (std::size_t k = 0; k < a_cols; ++k)
{
auto const a_val = a_matrix[i*a_cols + k];
T const * const b_row_p = b_matrix + k * b_cols;
for (std::size_t j = 0; j < b_cols; ++j)
{
r_row_p[j] += a_val * b_row_p[j];
}
}
}
}
implementing the above mentioned order.
Time consumption for multiplication of two 2048x2048 matrices on Intel i5-2500k
mm_kij(): 6.16706s.
mm_opt(): 2.6567s.
The given order also allows for outer loop parallelization without introducing any race conditions when writing to the result matrix:
template<class T>
void mm_opt_par(T const * const a_matrix, std::size_t const a_rows,
std::size_t const a_cols, T const * const b_matrix, std::size_t const b_rows,
std::size_t const b_cols, T * const r_matrix)
{
#if defined(_OPENMP)
#pragma omp parallel
{
auto ar = static_cast<std::ptrdiff_t>(a_rows);
#pragma omp for schedule(static) nowait
for (std::ptrdiff_t i = 0; i < ar; ++i)
#else
for (std::size_t i = 0; i < a_rows; ++i)
#endif
{
T * const r_row_p = r_matrix + i*b_cols;
for (std::size_t k = 0; k < b_rows; ++k)
{
auto const a_val = a_matrix[i*a_cols + k];
T const * const b_row_p = b_matrix + k * b_cols;
for (std::size_t j = 0; j < b_cols; ++j)
{
r_row_p[j] += a_val * b_row_p[j];
}
}
}
#if defined(_OPENMP)
}
#endif
}
Where each thread writes to an individual result row
Time consumption for multiplication of two 2048x2048 matrices on Intel i5-2500k (4 OMP threads)
mm_kij(): 6.16706s.
mm_opt(): 2.6567s.
mm_opt_par(): 0.968325s.
Not perfect scaling but as a start faster than the serial code.
OpenMP implementations creates a thread pool (although a thread pool is not mandated by the OpenMP standard every implementation of OpenMP I have seen does this) so that threads don't have to be created and destroyed each time a parallel region is entered. Nevertheless, there is a barrier between each parallel region so that all threads have to sync. There is probably some additional overhead in the fork join model between parallel regions. So even though the threads don't have to be recreated they still have to be initialized between parallel regions. More details can be found here.
In order to avoid the overhead between entering parallel regions I suggest creating the parallel region on the outermost loop but doing the work sharing on the inner loop over i like this:
void multiply_matrices_KIJ() {
#pragma omp parallel
for (int k = 0; k < SIZE; k++)
#pragma omp for schedule(static) nowait
for (int i = 0; i < SIZE; i++)
for (int j = 0; j < SIZE; j++)
matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j];
}
There is an implicit barrier when using #pragma omp for. The nowait clause removes the barrier.
Also make sure you compile with optimizing. There is little point in comparing performance without optimization enabled. I would use -O3.
Always keep in mind that for caching purposes, the most optimal ordering of your loops will be slowest -> fastest. In your case, that means I,K,L order. I would be quite surprised if your serial code is not automatically reordered from KIJ->IKL ordering by your compiler (assuming you have "-O3"). However, the compiler cannot do this with your parallel loop because that would break the logic you are declaring within your parallel region.
If you really truly cannot reorder your loops, then your best bet would probably be to rewrite the parallel region to encompass the largest possible loop. If you have OpenMP 4.0, you could also consider utilizing SIMD vectorization across your fastest dimension as well. However, I am still doubtful you will be able to beat your serial code by much because of the aforementioned caching issues inherent in your KIJ ordering...
void multiply_matrices_KIJ()
{
#pragma omp parallel for
for (int k = 0; k < SIZE; k++)
{
for (int i = 0; i < SIZE; i++)
#pragma omp simd
for (int j = 0; j < SIZE; j++)
matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j];
}
}