I am not good in parallel programming but I need to do this loop more faster. I know how to use the library but I am not sure about if the output is right because the output is very close to be zero, but is not.
for ( int l = 0; l < s; ++l ) {
double denonimator_l = 0;
for (int g = 0; g < G; ++g ) {
double &pi_gl = pi(g, l) = w[g];
for (int i = 0; i < p; ++i )
pi_gl *= P[g](i, Y(l, i) - 1);
denonimator_l += pi_gl;
}
for (int g = 0; g < G; ++g ) {
double &pi_gl = pi(g, l);
pi_gl /= denonimator_l;
}
}
I need to parallel the four loops and not just the first.
#pragma omp parallel for schedule(dynamic)
So, I am not sure if I put the line above in the first loop is enough. Also, I need the correct output and I don't know if I am doing something wrong if I don't use private variables because I am using read/write operations.
The loops are not prefect nested so I can't use collapse.
Note that each cycle has a temporary loop variable.
Related
I have the current version of a function:
void*
function(const Input_st *Data, Output_st *Image)
{
int i,j,r,Offset;
omp_set_num_threads(24);
#pragma omp parallel for schedule(static) shared(Data,Image),\
private(i,j,r,Offset)
for (i = 0; i < Data->NX; i++)
{
for (j = 0; j < (Data->NZ); j++)
{
for (r = 0; r < Data->NR; r++)
{
Offset = i*Data->NR*Data->NZ + j*Data->NR + r;
Image->pTime[Offset] = function2()
}
}
}
return NULL;
}
It works very well, however I wanted to remove the calculation of the variable Offset and use of a pointer pointing to the member Image->pTimeR and then increment, which can look like following:
void*
function(const Input_st *Data, Output_st *Image)
{
int i, j, r;
double *pTime = Image->pTime;
omp_set_num_threads(24);
#pragma omp parallel for schedule(static) shared(Data,Image),\
private(i,j,r)
for (i = 0; i < Data->NX; i++)
{
for (j = 0; j < (Data->NZ); j++)
{
for (r = 0; r < Data->NR; r++)
{
*pTime = function2()
pTime++;
}
}
}
return NULL;
}
I get Seg Fault. I assume I need to use the reduction clause like reduction(+:pTime).
First, the purpose here is to speed up the function and I am wondering if such change would significantly speed up? (Like less cache memory used?)
Second, well I tried to benchmark it and failed to do so! I think the problem here can be solved by using a reduction clause, but since loops are nested the problem is not that straightforward to me.
There's no need of any sort of reduction clause here. However,at the moment, all threads use the same pointer and update the same memory location (with race conditions in the value assigned to pTime, hence the crashes I suspect).
So you need to define your pointer in a private way (typically by declaring it within the parallel region, and to set it individually per thread to a meaningful value. Then it can be incremented the way you want.
Here is what the code could look like once fixed (not tested obviously):
void* function( const Input_st *Data, Output_st *Image ) {
#pragma omp parallel for schedule( static ) num_threads( 24 )
for ( int i = 0; i < Data->NX; i++ ) {
double *pTime = Image->pTime + i * Data->NR * Data->NZ;
for ( int j = 0; j < Data->NZ; j++ ) {
for ( int r = 0; r < Data->NR; r++ ) {
*pTime = function2();
pTime++;
}
}
}
return NULL;
}
I am working on a code that emulates a matrix-vector product. In this code, we are explicitly calculating the vector A x, but we do not have access to the full matrix A, as dim is typically very large such that the full matrix A cannot be stored in memory.
By doing some timings, we have established that the following loop takes about 4 seconds to complete. string is a an instance of a custom class, basically a wrapper around an unsigned long, and matvec is an Eigen::VectorXd.
// Off-diagonal contributions
for (size_t I = 0; I < dim; I++) {
for (size_t p = 0; p < K; p++) {
if (string.isOccupied(p)) { // p in I
for (size_t q = 0; q < p; q++) {
if (!string.isOccupied(q)) {
string.annihilate(p);
string.create(q);
size_t J = string.address();
matvec(I) += value(p,q) * x(J);
matvec(J) += value(p,q) * x(I);
string.annihilate(q);
string.create(p);
}
} // q < p loop
}
} // p loop
string.nextPermutation();
}
In trying to determine the bottleneck of this loop, we tried a variety of things, such as optimizing the annihilate and create calls, but these have little effect. However, when we time the code without the Eigen::VectorXd value updating, i.e.
// Off-diagonal contributions
for (size_t I = 0; I < dim; I++) {
for (size_t p = 0; p < K; p++) {
if (string.isOccupied(p)) { // p in I
for (size_t q = 0; q < p; q++) {
if (!string.isOccupied(q)) {
string.annihilate(p);
string.create(q);
size_t J = string.address();
string.annihilate(q);
string.create(p);
}
} // q < p loop
}
} // p loop
string.nextPermutation();
}
the loop now only takes 0.2 seconds. From this, we have deduced that the bottleneck is actually adding the right values to the resulting vector. see the edit
Is there a faster way to do the matvec(I) += value(p,q) * x(J) calls, using the C++ Eigen library?
EDIT: I have done some better timings, inspired by #PeterT and #chtz's comments, and I find that the += statements take about 30% of all the time spent in the for loop, whereas the other parts all take about an equally proportioned size of the remaining time.
So, the question remains: how can the matvec(I) += value(p,q) * x(J) and matvec(J) += value(p,q) * x(I) be sped up?
I've a problem in parallelizing a piece of code with openmp, I think that there is a conceptual problem with some operations that have to be made sequentially.
else if (PERF_ROWS <= MAX_ROWS && function_switch == true)
{
int array_dist_perf[PERF_ROWS];
int array_dist[MAX_ROWS];
#pragma omp parallel for collapse(2)
for (int i = 0; i < MAX_COLUMNS;
i = i + 1 + (i % PERF_CLMN == 0 ? 1:0))
{
for (int j = 0; j < PERF_ROWS; j++) //truncation perforation
{
array_dist_perf[j] = abs(input[j] - input_matrix[j][i]);
}
float av = mean(PERF_ROWS, array_dist_perf);
float score = score_func(av);
if (score > THRESHOLD_SCORE)
{
for (int k = 0; k < MAX_ROWS; k++)
{
array_dist[k] = abs(input[k] - input_matrix[k][i]);
}
float av_real = mean(MAX_ROWS, array_dist);
float score_real = score_func(av_real);
rank_function(score_real, i);
}
}
}
The error is that "collapsed loops are not perfectly nested". I'm using Clion on g++-5. Thanks in advance
First of all, perfectly nested loops have the following form:
for (init1; cond1; inc1)
{
for (init2; cond2; inc2)
{
...
}
}
Notice that the body of the outer loop consists solely of the inner loop and nothing else. This is definitely not the case with your code - you have plenty of other statements following the inner loop.
Second, your outer loop is not in the canonical form required by OpenMP. Canonical are loops for which the number of iterations and the iteration step can be easily pre-determined. Since what you are doing is skip an iteration each time i is a multiple of PERF_CLMN, you can rewrite the loop as:
for (int i = 0; i < MAX_COLUMNS; i++)
{
if (i % PERF_CLMN == 1) continue;
...
}
This will create work imbalance depending on whether MAX_COLUMNS is a multiple of the number of threads or not. But there is yet another source or imbalance, namely the conditional evaluation of rank_function(). You should therefore utilise dynamic scheduling.
Now, apparently both array_dist* loops are meant to be private, which they are not in your case and that will result in data races. Either move the definition of the arrays within the loop body or use the private() clause.
#pragma omp parallel for schedule(dynamic) private(array_dist_perf,array_dist)
for (int i = 0; i < MAX_COLUMNS; i++)
{
if (i % PERF_CLMN == 1) continue;
...
}
Now, for some unsolicited optimisation advice: the two inner loops are redundant as the first one is basically doing a subset of the work of the second one. You can optimise the computation and save on memory by using a single array only and let the second loop continue from where the first one ends. The final version of the code should look like:
else if (PERF_ROWS <= MAX_ROWS && function_switch == true)
{
int array_dist[MAX_ROWS];
#pragma omp parallel for schedule(dynamic) private(array_dist)
for (int i = 0; i < MAX_COLUMNS; i++)
{
if (i % PERF_CLMN == 1) continue;
for (int j = 0; j < PERF_ROWS; j++) //truncation perforation
{
array_dist[j] = abs(input[j] - input_matrix[j][i]);
}
float av = mean(PERF_ROWS, array_dist);
float score = score_func(av);
if (score > THRESHOLD_SCORE)
{
for (int k = PERF_ROWS; k < MAX_ROWS; k++)
{
array_dist[k] = abs(input[k] - input_matrix[k][i]);
}
float av_real = mean(MAX_ROWS, array_dist);
float score_real = score_func(av_real);
rank_function(score_real, i);
}
}
}
Another potential for optimisation lies in the fact that input_matrix is not accessed in a cache-friendly way. Transposing it will result in columns data being stored continuously in memory and improve the memory access locality.
I'm new here and this is my first question in this site;
I am doing a simple program to find a maximum value of a vector c that is function of two other vectors a and b. I'm doing it on Microsoft Visual Studio 2013 and the problem is that it only support OpenMP 2.0 and I cannot do a Reduction operation to find directy the max or min value of a vector, because OpenMP 2.0 does not supports this operation.
I'm trying to do the without the constructor reduction with the following code:
for (i = 0; i < NUM_THREADS; i++){
cMaxParcial[i] = - FLT_MAX;
}
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel for private (i,j,indice)
for (i = 0; i < N; i++){
for (j = 0; j < N; j++){
indice = omp_get_thread_num();
if (c[i*N + j] > cMaxParcial[indice]){
cMaxParcial[indice] = c[i*N + j];
bMaxParcial[indice] = b[j];
aMaxParcial[indice] = a[i];
}
}
}
cMax = -FLT_MAX;
for (i = 0; i < NUM_THREADS; i++){
if (cMaxParcial[i]>cMax){
cMax = cMaxParcial[i];
bMax = bMaxParcial[i];
aMax = aMaxParcial[i];
}
}
I'm getting the error: "The expression must have integral or unscoped enum type"
on the command cMaxParcial[indice] = c[i*N + j];
Can anybody help me with this error?
Normally, the error is caused by one of the indices not being in integer type. Since you haven't shown the code where i, j, N and indice are declared, my guess is that either N or indice is a float or double, but it would be simpler to answer if you had provided a MCVE. However, the line above it seems to have used the same indices correctly. This leads me to believe that it's an IntelliSense error, which often are false positives. Try compiling the code and running it.
Now, on to issues that you haven't (yet) asked about (why is my parallel code slower than my serial code?). You're causing false sharing by using (presumably) contiguous arrays to find the a, b, and c values of each thread. Instead of using a single pragma for parallel and for, split it up like so:
cMax = -FLT_MAX;
#pragma omp parallel
{
float aMaxParcialPerThread;
float bMaxParcialPerThread;
float cMaxParcialPerThread;
#pragma omp for nowait private (i,j)
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++) {
if (c[i*N + j] > cMaxParcialPerThread){
cMaxParcialPerThread = c[i*N + j];
bMaxParcialPerThread = b[j];
aMaxParcialPerThread = a[i];
} // if
} // for j
} // for i
#pragma omp critical
{
if (cMaxParcialPerThread < cMax) {
cMax = cMaxParcialPerThread;
bMax = bMaxParcialPerThread;
aMax = aMaxParcialPerThread;
}
}
}
I don't know what is wrong with your compiler since (as far as I can see with only the partial data you gave), the code seems valid. However, it is a bit convoluted and not so good.
What about the following:
#include <omp.h>
#include <float.h>
extern int N, NUM_THREADS;
extern float aMax, bMax, cMax, *a, *b, *c;
int foo() {
cMax = -FLT_MAX;
#pragma omp parallel num_threads( NUM_THREADS )
{
float localAMax, localBMax, localCMax = -FLT_MAX;
#pragma omp for
for ( int i = 0; i < N; i++ ) {
for ( int j = 0; j < N; j++ ) {
float pivot = c[i*N + j];
if ( pivot > localCMax ) {
localAMax = a[i];
localBMax = b[j];
localCMax = pivot;
}
}
}
#pragma omp critical
{
if ( localCMax > cMax ) {
aMax = localAMax;
bMax = localBMax;
cMax = localCMax;
}
}
}
}
It compiles but I haven't tested it...
Anyway, I avoided using the [a-c]MaxParcial arrays since they will generate false sharing between the threads, leading to poor performance. The final reduction is done based on critical. It is not ideal, but will perform perfectly as long as you have a "moderated" number of threads. If you see some hot spot there or you need to use a "large" number of threads, it can be optimised better with a proper parallel reduction later.
Do OpenMP 'For' loops work with multiple loop variables? For example:
int i;
double k;
#pragma omp parallel for
for (k = 0, i = 0; k < 1; k += 0.1, i++)
{ }
It works fine without OpenMP, but using it I get the following errors:
C3015: initialization in OpemMP 'for' statement has improper form
C3019: increment in OpenMP 'for' statement has improper form
You can do this
#pragma omp parallel for
for (int i = 0; i<10; i++) {
double k = 0.1*i;
}
If you really want to avoid the multiplication in the loop and be closer to your original code you can do this
#pragma omp parallel
{
int nthreads = omp_get_num_threads();
int ithread = omp_get_thread_num();
int starti = ithread*10/nthreads;
int finishi = (ithread+1)*10/nthreads;
double start = 0.1*starti;
double finish = 0.1*finishi;
double k;
int i;
for (k = start, i = starti; k < finish; k += 0.1, i++) {
}
}
When I first wrote this answer I did not realize one subtle point.
The conversion from
for (k = 0; k < 1; k += 0.1)
to
for (int i = 0; i<10; i++) double k = 0.1*i;
is not one-to-one. I mean the results are not necessarily identical. That's because for floating point math multiplication times an integer is not necessarily repeated addition. It may be fine in many cases it's important to be aware that they are not the same thing.
It's possible to go the other way from multiplication to repeated addition if you use Kahan summation but going from repeated addition to multiplication is not guaranteed to give the same result.
You can read more about it at floating-point-multiplication-vs-repeated-addition.
You need to convert the code to only use i (i.e., the int variable with the simple increment) for the the loop itself, and work with k in code controlled by the loop:
double k = 0.0;
int i;
for (i=0; i<10; i++) {
// body of loop goes here
k+=0.1;
}