There is two consecutive loops and there is a reduction clause in the second loop.
#pragma opm parallel
{
#pragma omp for
for (size_t i = 0; i < N; ++i)
{
}
#pragma omp barrier
#pragma omp for reduction(+ \
: sumj)
for (size_t i = 0; i < N; ++i)
{
sumj = 0.0;
for (size_t j = 0; j < adjList[i].size(); ++j)
{
sumj += 0;
}
Jac[i, i] = sumj;
}
}
to reduce the creating threads overhead I wand to keep the threads and use them in the second loop, but I get the following error
lib.cpp:131:17: error: reduction variable ‘sumj’ is private in outer context
#pragma omp for reduction(+ \
^~~
how to fix that?
I'm not sure what you are trying to do, but it seems that something like this would do what you expect:
#pragma omp parallel
{
#pragma omp for
for (size_t i = 0; i < N; ++i)
{
}
#pragma omp barrier
#pragma omp for
for (size_t i = 0; i < N; ++i)
{
double sumj = 0.0;
for (size_t j = 0; j < adjList[i].size(); ++j)
{
sumj += 0;
}
Jac[i, i] = sumj;
}
}
Reduce would be useful in the case of an "omp for" in the interior loop.
Related
I've just started studying parallel programming with OpenMP, and there is a subtle point in the nested loop. I wrote a simple matrix multiplication code, and checked the result that is correct. But actually there are several ways to parallelize this for loop, which may be different in terms of low-level detail, and I wanna ask about it.
At first, I wrote code below, which multiply two matrix A, B and assign the result to C.
for(i = 0; i < N; i++)
{
for(j = 0; j < N; j++)
{
sum = 0;
#pragma omp parallel for reduction(+:sum)
for(k = 0; k < N; k++)
{
sum += A[i][k]*B[k][j];
}
C[i][j] = sum;
}
}
It works, but it takes really long time. And I find out that because of the location of parallel directive, it will construct the parallel region N2 time. I found it by huge increase in user time when I used linux time command.
Next time, I tried code below which also worked.
#pragma omp parallel for private(i, j, k, sum)
for(i = 0; i < N; i++)
{
for(j = 0; j < N; j++)
{
sum = 0;
for(k = 0; k < N; k++)
{
sum += A[i][k]*B[k][j];
}
C[i][j] = sum;
}
}
And the elapsed time is decreased from 72.720s in sequential execution to 5.782s in parallel execution with the code above. And it is the reasonable result because I executed it with 16 cores.
But the flow of the second code is not easily drawn in my mind. I know that if we privatize all loop variables, the program will consider that nested loop as one large loop with size N3. It can be easily checked by executing the code below.
#pragma omp parallel for private(i, j, k)
for(i = 0; i < N; i++)
{
for(j = 0; j < N; j++)
{
for(k = 0; k < N; k++)
{
printf("%d, %d, %d\n", i, j, k);
}
}
}
The printf was executed N3 times.
But in my second matrix multiplication code, there is sum right before and after the innermost loop. And It bothers me to unfold the loop in my mind easily. The third code I wrote is easily unfolded in my mind.
To summarize, I want to know what really happens behind the scene in my second matrix multiplication code, especially with the change of the value of sum. Or I'll really thank you for some recommendation of tools to observe the flow of multithreads program written with OpenMP.
omp for by default only applies to the next direct loop. The inner loops are not affected at all. This means, your can think about your second version like this:
// Example for two threads
with one thread execute
{
// declare private variables "locally"
int i, j, k;
for(i = 0; i < N / 2; i++) // loop range changed
{
for(j = 0; j < N; j++)
{
sum = 0;
for(k = 0; k < N; k++)
{
sum += A[i][k]*B[k][j];
}
C[i][j] = sum;
}
}
}
with the other thread execute
{
// declare private variables "locally"
int i, j, k;
for(i = N / 2; i < N; i++) // loop range changed
{
for(j = 0; j < N; j++)
{
sum = 0;
for(k = 0; k < N; k++)
{
sum += A[i][k]*B[k][j];
}
C[i][j] = sum;
}
}
}
You can simply all reasoning about variables with OpenMP by declaring them as locally as possible. I.e. instead of the explicit declaration use:
#pragma omp parallel for
for(int i = 0; i < N; i++)
{
for(int j = 0; j < N; j++)
{
int sum = 0;
for(int k = 0; k < N; k++)
{
sum += A[i][k]*B[k][j];
}
C[i][j] = sum;
}
}
This way you the private scope of variable more easily.
In some cases it can be beneficial to apply parallelism to multiple loops.
This is done by using collapse, i.e.
#pragma omp parallel for collapse(2)
for(int i = 0; i < N; i++)
{
for(int j = 0; j < N; j++)
You can imagine this works with a transformation like:
#pragma omp parallel for
for (int ij = 0; ij < N * N; ij++)
{
int i = ij / N;
int j = ij % N;
A collapse(3) would not work for this loop because of the sum = 0 in-between.
Now is one more detail:
#pragma omp parallel for
is a shorthand for
#pragma omp parallel
#pragma omp for
The first creates the threads - the second shares the work of a loop among all threads reaching this point. This may not be of importance for the understanding now, but there are use-cases for which it matters. For instance you could write:
#pragma omp parallel
for(int i = 0; i < N; i++)
{
#pragma omp for
for(int j = 0; j < N; j++)
{
I hope this sheds some light on what happens there from a logical point of view.
I am new on using OpenMP 2.0 along with MSVC++ 2017. I'm working with a big data structure (referenced as bigMap) so I need to distribute the workload when iterating on it in the best possible way. My attempt for doing so is:
std::map<int, std::set<std::pair<double, double>>> bigMap;
///thousands of values are added here
int k;
int max_threads = omp_get_max_threads();
omp_set_num_threads(max_threads);
#pragma omp parallel default(none) private(k)
{
#pragma omp for
for(k = kMax; k > kMin; k--)
{
for (auto& myPair : bigMap[k])
{
int pthread = omp_get_thread_num();
std::cout << "Thread " << pthread << std::endl;
for (auto& item : myPair)
{
#pragma omp critical
myMap[k-1].insert(std::make_pair(item, 0));
}
}
}
The output for "pthread" is always "0" and the execution time is the same as for single-thread (so I assume no new threads are being created).
Why this code doesn't work and which OMP directives / clauses / sections are wrong??
UPDATE:
OMP is now working, but the code below is not working as expected:
#pragma omp parallel for schedule(static,1)
for (int i = 0; i < map_size; ++i) {
#pragma omp critical
bigMap[i] = std::set<int>();
}
bigMap[1] = { 10, 100, 1000 };
int i;
#pragma omp parallel for schedule(static) num_threads(8)
for (i = thread_num; i < map_size; i += thread_count)
{
for (auto it = bigMap[i].begin(); it != bigMap[i].end(); ++it)
{
int elem = *it;
bigMap[i + 1].insert(elem);
}
}
I expect the 3 elements from bigMap[1] to be inserted across all entries of bigMap, instead, they're inserted only once, for bigMap[2], why??
Little bug....
#pragma omp parallel for schedule(static,1)
for (int i = 0; i < map_size; ++i) {
#pragma omp critical
bigMap[i] = std::set<int>();
}
bigMap[1] = { 10, 100, 1000 };
int i;
#pragma omp parallel for schedule(static) num_threads(8)
for (i = thread_num; i < map_size; i += thread_count)
{
//here you loop on bigMap[i] which is empty execpt for i==1.
//for (auto it = bigMap[i].begin(); it != bigMap[i].end(); ++it)
for (auto it = bigMap[1].begin(); it != bigMap[1].end(); ++it)
{
int elem = *it;
bigMap[i + 1].insert(elem);
}
}
Maybe you miss understand what static means.
I'm trying to use openmp to run the below code, but I get Segmentation Fault
void modKeyGenPrs(mat_GF2E *&Prs, mat_GF2E Lst[], mat_GF2E L1, mat_GF2E L2) {
Prs = new mat_GF2E[m];
mat_GF2E L1_trans = transpose(L1);
#pragma omp parallel shared(L1_trans,L2,Lst,Prs,L1)
{
#pragma omp for
for (int i = 0; i < m; i++) {
(Prs[i]).SetDims(n, n);
for (int j = 0; j < m; j++) {
Prs[i] = Prs[i] + (L2[i][j] * (L1_trans * (Lst[i]) * L1));
}
}
}
}
what is wrong with my openMP code? it always takes only 1 thread and works the same time as non-parallel version
template <typename T>
Matrix<T>* Matrix<T>::OMPMultiplication(Matrix<T>* A, Matrix<T>* B){
if(A->ySize != B->xSize)
throw;
Matrix<T>* C = new Matrix<T>(A->xSize, B->ySize);
sizeType i, j, k;
T element;
#pragma omp parallel for private(i, j)
{
#pragma omp for private(i, j)
for( i = 0; i < A->xSize; i++ )
cout<<"There are "<<omp_get_num_threads()<<" threads"<<endl;
for(j = 0; j < B->ySize; j++){
C->matrix[i][j] = 0;
for(k = 0; k < A->ySize; k++){
C->matrix[i][j] += A->matrix[i][k] * B->matrix[k][j];
}
}
}
return C;
}
First of all, you are missing some {} for the i loop and the variable k needs to be made private to each iteration of the i loop. However, I think you have also mixed up how the parallel and for pragmas are combined. To successfully parallelize a for loop, you need to put it inside a parallel pragma and then inside a for pragma. To do this you could either change your code into
#pragma omp parallel private(i, j, k)
{
#pragma omp for
for( i = 0; i < A->xSize; i++ ) {
cout<<"There are "<<omp_get_num_threads()<<" threads"<<endl;
for(j = 0; j < B->ySize; j++) {
C->matrix[i][j] = 0;
for(k = 0; k < A->ySize; k++){
C->matrix[i][j] += A->matrix[i][k] * B->matrix[k][j];
}
}
}
}
or make use of the combined parallel for notation
#pragma omp parallel for private(i, j, k)
for( i = 0; i < A->xSize; i++ ) {
...
}
Also, make sure you are telling OpenMP to use more than 1 thread here. This can be done both with omp_set_num_threads(<number of threads here>) and by setting environment variables like OMP_NUM_THREADS.
Hope you get it parallelized. :)
I get slightly faster result with my 4 cores using this code:
omp_set_num_threads(4);
#pragma omp parallel for
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
c[i] += b[j] * a[j][i];
}
}
Full program
#include <stdio.h>
#include <time.h>
#include <omp.h>
#include <stdlib.h>
int main() {
int i, j, n, a[719][719], b[719], c[719];
clock_t start = clock();
n = 100; //Max 719
printf("Matrix A\n");
for (i = 0; i < n; ++i) {
for (j = 0; j < n; ++j) {
a[i][j] = 10;
printf("%d ", a[i][j]);
}
printf("\n");
}
printf("\nMatrix B\n");
#pragma omp parallel private(i) shared(b)
{
#pragma omp for
for (i = 0; i < n; ++i) {
b[i] = 5;
printf("%d\n", b[i]);
}
}
printf("\nA * B\n");
#pragma omp parallel private(i) shared(c)
{
#pragma omp for
for (i = 0; i < n; ++i) {
c[i] = 0;
}
}
#pragma omp parallel private(i,j) shared(n,a,b,c)
{
#pragma omp for schedule(dynamic)
for (i = 0; i < n; ++i) {
for (j = 0; j < n; ++j) {
c[i] += b[j] * a[j][i];
}
}
}
#pragma omp parallel private(i) shared(c)
{
#pragma omp for
for (i = 0; i < n; ++i) {
printf("%d\n", c[i]);
}
}
clock_t stop = clock();
double elapsed = (double) (stop - start) / CLOCKS_PER_SEC;
printf("\nTime elapsed: %.5f\n", elapsed);
start = clock();
printf("Matrix A\n");
for (i = 0; i < n; ++i) {
for (j = 0; j < n; ++j) {
a[i][j] = 10;
printf("%d ", a[i][j]);
}
printf("\n");
}
printf("\nMatrix B\n");
#pragma omp parallel private(i) shared(b)
{
#pragma omp for
for (i = 0; i < n; ++i) {
b[i] = 5;
printf("%d\n", b[i]);
}
}
printf("\nA * B\n");
omp_set_num_threads(4);
#pragma omp parallel for
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
c[i] += b[j] * a[j][i];
}
}
stop = clock();
elapsed = (double) (stop - start) / CLOCKS_PER_SEC;
printf("\nTime elapsed: %.5f\n", elapsed);
return 0;
}
First method takes
Time elapsed: 0.03442
Second method
Time elapsed: 0.02630
I have this C++ code.
Loop goes throgh the matrix, finds the min element in each row and subtracts it from each element of corresponding row.
Variable myr is a summ of all min elements
Trying to parallel for:
int min = 0;
int myr = 0;
int temp[SIZE][SIZE];
int size = 0;
...//some initialization
omp_set_num_threads(1);
start_time = omp_get_wtime();
#ifdef _OPENMP
#pragma omp parallel for firstprivate(min, size) reduction(+:myr)
#endif
for(int i = 0; i < size; i++){
min = INFINITY;
for(int j = 0; j < size; j++){
if (temp[i][j] < min)
min = temp[i][j];
}
myr+=min;
for(int j = 0; j < size; j++)
temp[i][j]-=min;
}
end_time = omp_get_wtime();
if I set omp_set_num_threads(2); this part of code starts working slower.
My proc has 2 cores
Why code works slower with 2 threads?
There must be some aliasing or something going on. Make things simpler for OpenMP:
int const size0 = size;
#ifdef _OPENMP
#pragma omp parallel for reduction(+:myr)
#endif
for(int i = 0; i < size0; i++){
int min = INFINITY;
int * tmp = temp[i];
for(int j = 0; j < size0; j++){
if (tmp[j] < min)
min = tmp[j];
}
for(int j = 0; j < size0; j++)
tmp[j]-=min;
myr+=min;
}
That is, have most of the variables local and const if you may.
The parallel part can be reinterpreted as follows (I have used the snippet by #jens-gustedt, but to my experience it didn't make much difference):
#pragma omp parallel private(myr_private) shared(myr)
{
myr_private = 0;
#pragma omp for
for(int i = 0; i < size; i++){
int min = INFINITY;
int * tmp = temp[i];
for(int j = 0; j < size; j++){
if (tmp[j] < min)
min = tmp[j];
}
for(int j = 0; j < size; j++)
tmp[j]-=min;
myr_private+=min;
}
#pragma omp critical
{
myr+=myr_private;
}
}
(This interpretation is straight from http://www.openmp.org/mp-documents/OpenMP3.1.pdf Example A.36.2c).
If number of threads is n>1, there is overhead when #pragma omp parallel creates additional thread(s) and then in critical section, which all of the threads should wait for.
I have experimented with different matrix sizes and in my limited tests two threads are considerably faster with sizes above 1000, and start lagging behind with sizes below 500.