I wrote a programm that multiplies a vector by a matrix. The matrix has periodically repeated cells, so I use a temporary variable to sum vector elements before multiplication. The period is the same for adjacent rows. I create a separate temp variable for each thread. sizeof(InnerVector) == 400 and I don't want to allocate memory for it on every iterration (= 600 times).
Code looks something like this:
tempsSize = omp_get_max_threads();
InnerVector temps = new InnerVector[tempsSize];
for(int k = 0; k < tempsSize; k++)
InnerVector_init(temps[k]);
for(int jmin = 1, jmax = 2; jmax < matrixSize/2; jmin *= 2, jmax *= 2)
{
int period = getPeriod(jmax);
#pragma omp parallel
{
int threadNum = omp_get_thread_num();
// printf("\n threadNum = %i", threadNum);
#pragma omp for
for(int j = jmin; j < jmax; j++)
{
InnerVector_reset(temps[threadNum]);
for(int i = jmin; i < jmax; i++)
{
InnerMatrix cell = getCell(i, j);
if(temps[threadNum].IsZero)
for(int k = j; k < matrixSize; k += period)
InnerVector_add(temps[threadNum], temps[threadNum], v[k]);
InnerVector_add_mul(v_res[i], cell, temps[threadNum]);
}
}
}
}
The code looks to be correct but I get wrong result. In fact, I get different results for different runs... sometimes result is correct.
When I compile in debug mode the result is always correct.
When I uncomment the row with "printf" the result is always correct.
p.s. I use Visual Studio 2010.
I suspect there might be a data race in
InnerVector_add_mul(v_res[i], cell, temps[threadNum]);
Since v_res appears to be a resulting vector, and i changes from jmin to jmax in each iteration of the parallelized loop, it can happen that multiple threads write to v_res[i] for the same value of i, with unpredictable result.
Related
I aim to compute a simple N-body program on C++ and I am using OpenMP to speed things up with the computations. At some point, I have nested loops that look like that:
int N;
double* S = new double[N];
double* Weight = new double[N];
double* Coordinate = new double[N];
...
#pragma omp parallel for
for (int i = 0; i < N; ++i)
{
for (int j = 0; j < i; ++j)
{
double K = Coordinate[i] - Coordinate[j];
S[i] += K*Weight[j];
S[j] -= K*Weight[i];
}
}
The issue here is that I do not obtain exactly the same result when removing the #pragma ... I am guessing it has to do with the fact that the second loop is dependent on the integer i, but I don't see how to get past that issue
The problem is that there is a data race during updating S[i] and S[j]. Different threads may read from/write to the same element of the array at the same time, therefore it should be an atomic operation (you have to add #pragma omp atomic) to avoid data race and to ensure memory consistency:
for (int j = 0; j < i; ++j)
{
double K = Coordinate[i] - Coordinate[j];
#pragma omp atomic
S[i] += K*Weight[j];
#pragma omp atomic
S[j] -= K*Weight[i];
}
I've a problem in parallelizing a piece of code with openmp, I think that there is a conceptual problem with some operations that have to be made sequentially.
else if (PERF_ROWS <= MAX_ROWS && function_switch == true)
{
int array_dist_perf[PERF_ROWS];
int array_dist[MAX_ROWS];
#pragma omp parallel for collapse(2)
for (int i = 0; i < MAX_COLUMNS;
i = i + 1 + (i % PERF_CLMN == 0 ? 1:0))
{
for (int j = 0; j < PERF_ROWS; j++) //truncation perforation
{
array_dist_perf[j] = abs(input[j] - input_matrix[j][i]);
}
float av = mean(PERF_ROWS, array_dist_perf);
float score = score_func(av);
if (score > THRESHOLD_SCORE)
{
for (int k = 0; k < MAX_ROWS; k++)
{
array_dist[k] = abs(input[k] - input_matrix[k][i]);
}
float av_real = mean(MAX_ROWS, array_dist);
float score_real = score_func(av_real);
rank_function(score_real, i);
}
}
}
The error is that "collapsed loops are not perfectly nested". I'm using Clion on g++-5. Thanks in advance
First of all, perfectly nested loops have the following form:
for (init1; cond1; inc1)
{
for (init2; cond2; inc2)
{
...
}
}
Notice that the body of the outer loop consists solely of the inner loop and nothing else. This is definitely not the case with your code - you have plenty of other statements following the inner loop.
Second, your outer loop is not in the canonical form required by OpenMP. Canonical are loops for which the number of iterations and the iteration step can be easily pre-determined. Since what you are doing is skip an iteration each time i is a multiple of PERF_CLMN, you can rewrite the loop as:
for (int i = 0; i < MAX_COLUMNS; i++)
{
if (i % PERF_CLMN == 1) continue;
...
}
This will create work imbalance depending on whether MAX_COLUMNS is a multiple of the number of threads or not. But there is yet another source or imbalance, namely the conditional evaluation of rank_function(). You should therefore utilise dynamic scheduling.
Now, apparently both array_dist* loops are meant to be private, which they are not in your case and that will result in data races. Either move the definition of the arrays within the loop body or use the private() clause.
#pragma omp parallel for schedule(dynamic) private(array_dist_perf,array_dist)
for (int i = 0; i < MAX_COLUMNS; i++)
{
if (i % PERF_CLMN == 1) continue;
...
}
Now, for some unsolicited optimisation advice: the two inner loops are redundant as the first one is basically doing a subset of the work of the second one. You can optimise the computation and save on memory by using a single array only and let the second loop continue from where the first one ends. The final version of the code should look like:
else if (PERF_ROWS <= MAX_ROWS && function_switch == true)
{
int array_dist[MAX_ROWS];
#pragma omp parallel for schedule(dynamic) private(array_dist)
for (int i = 0; i < MAX_COLUMNS; i++)
{
if (i % PERF_CLMN == 1) continue;
for (int j = 0; j < PERF_ROWS; j++) //truncation perforation
{
array_dist[j] = abs(input[j] - input_matrix[j][i]);
}
float av = mean(PERF_ROWS, array_dist);
float score = score_func(av);
if (score > THRESHOLD_SCORE)
{
for (int k = PERF_ROWS; k < MAX_ROWS; k++)
{
array_dist[k] = abs(input[k] - input_matrix[k][i]);
}
float av_real = mean(MAX_ROWS, array_dist);
float score_real = score_func(av_real);
rank_function(score_real, i);
}
}
}
Another potential for optimisation lies in the fact that input_matrix is not accessed in a cache-friendly way. Transposing it will result in columns data being stored continuously in memory and improve the memory access locality.
I'm running the following code for matrix multiplication the performance of which I'm supposed to measure:
for (int j = 0; j < COLUMNS; j++)
#pragma omp for schedule(dynamic, 10)
for (int k = 0; k < COLUMNS; k++)
for (int i = 0; i < ROWS; i++)
matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j];
Yes, I know it's really slow, but that's not the point - it's purely for performance measuring purposes. I'm running 3 versions of the code depending on where I put the #pragma omp directive, and therefore depending on where the parallelization happens. The code is run in Microsoft Visual Studio 2012 in release mode and profiled in CodeXL.
One thing I've noticed from the measurements is that the option in the code snippet (with parallelization before the k loop) is the slowest, then the version with the directive before the j loop, then the one with it before the i loop. The presented version is also the one which calculates a wrong result because of race conditions - multiple threads accessing the same cell of the result matrix at the same time. I understand why the i loop version is the fastest - all the particular threads process only part of the range of the i variable, increasing the temporal locality. However, I don't understand what causes the k loop version to be the slowest - does it have something to do with the fact that it produces the wrong result?
Of course race conditions can slow the code down. When two or more threads access the same part of memory (same cache line), that part must be loaded into the cache of the given cores over and over again as the the other thread invalidates the content of the cache by writing into it. They compete for a shared resource.
When two variables located too close in memory are written and read by more threads, it also results in a slowdown. This is known as false sharing. In your case it is even worse, they are not just too close, they even coincide.
Your assumption is correct. But if we are talking about performance, and not just validating your assumption, there is more to the story.
The order of your indexes is a big issue, multi-threaded or not. Given than the distance between mat[x][y] and mat[x][y+1] is one, while the distance between mat[x][y] and mat[x+1][y] is dim(mat[x]) You want x to be the outer index and y the inner to have the minimal distance between iteration. Given __[i][j] += __[i][k] * __[k][j]; you see that the proper order for spacial locality is i -> k -> j.
Whatever the order, there is one value which can be saved for later. Given your snippet
for (int j = 0; j < COLUMNS; j++)
for (int k = 0; k < COLUMNS; k++)
for (int i = 0; i < ROWS; i++)
matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j];
matrix_b[k][j] value will be fetched from memory i times. You could have started from
for (int j = 0; j < COLUMNS; j++)
for (int k = 0; k < COLUMNS; k++)
int temp = matrix_b[k][j];
for (int i = 0; i < ROWS; i++)
matrix_r[i][j] += matrix_a[i][k] * temp;
But given that you are writing to matrix_r[i][j], the best access to optimize is matrix_r[i][j], given that writing is slower than reading
Unnecessary write accesses to memory
for (int i = 0; i < ROWS; i++)
matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j];
will write to the memory of matrix_r[i][j] ROWS times. Using a temporary variable would reduce the accesses to one.
for (int i = 0; i < ...; j++)
for (int j = 0; j < ...; k++)
int temp = 0;
for (int k = 0; k < ...; i++)
temp += matrix_a[i][k] * matrix_b[k][j];
matrix_r[i][j] = temp;
This decreases write accesses from n^3 to n^2.
Now you are using threads. To maximize the efficiency of multithreading you should isolate as much a thread memory access from the others. One way to do it would be to give each thread a column, and prefect that column once. One simple way would be to have the transpose of matrix_b such that
matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j]; becomes
matrix_r[i][j] += matrix_a[i][k] * matrix_b_trans[j][k];
such that the most inner loop on k always deal with contiguous memory respective to matrix_a and matrix_b_trans
for (int i = 0; i < ROWS; j++)
for (int j = 0; j < COLS; k++)
int temp = 0;
for (int k = 0; k < SAMEDIM; i++)
temp += matrix_a[i][k] * matrix_b_trans[j][k];
matrix_r[i][j] = temp;
Do OpenMP 'For' loops work with multiple loop variables? For example:
int i;
double k;
#pragma omp parallel for
for (k = 0, i = 0; k < 1; k += 0.1, i++)
{ }
It works fine without OpenMP, but using it I get the following errors:
C3015: initialization in OpemMP 'for' statement has improper form
C3019: increment in OpenMP 'for' statement has improper form
You can do this
#pragma omp parallel for
for (int i = 0; i<10; i++) {
double k = 0.1*i;
}
If you really want to avoid the multiplication in the loop and be closer to your original code you can do this
#pragma omp parallel
{
int nthreads = omp_get_num_threads();
int ithread = omp_get_thread_num();
int starti = ithread*10/nthreads;
int finishi = (ithread+1)*10/nthreads;
double start = 0.1*starti;
double finish = 0.1*finishi;
double k;
int i;
for (k = start, i = starti; k < finish; k += 0.1, i++) {
}
}
When I first wrote this answer I did not realize one subtle point.
The conversion from
for (k = 0; k < 1; k += 0.1)
to
for (int i = 0; i<10; i++) double k = 0.1*i;
is not one-to-one. I mean the results are not necessarily identical. That's because for floating point math multiplication times an integer is not necessarily repeated addition. It may be fine in many cases it's important to be aware that they are not the same thing.
It's possible to go the other way from multiplication to repeated addition if you use Kahan summation but going from repeated addition to multiplication is not guaranteed to give the same result.
You can read more about it at floating-point-multiplication-vs-repeated-addition.
You need to convert the code to only use i (i.e., the int variable with the simple increment) for the the loop itself, and work with k in code controlled by the loop:
double k = 0.0;
int i;
for (i=0; i<10; i++) {
// body of loop goes here
k+=0.1;
}
in my previous question
Shared vectors in OpenMP
it was stated that one can let diferent threads read and write on a shared vector as long as
the different threads access different elements of the vector.
What if different threads have to read all the (so sometimes the same) elements of a vector, like in the following case ?
#include <vector>
int main(){
vector<double> numbers;
vector<double> results(10);
double x;
//write 25 values in vector numbers
for (int i =0; i<25; i++){
numbers.push_back(cos(i));
}
#pragma omp parallel for default(none) \
shared(numbers, results) \
private(x)
for (int j = 0; j < 10; j++){
for(int k = 0; k < 25; k++){
x += 2 * numbers[j] * numbers[k] + 5 * numbers[j * k / 25];
}
results[j] = x;
}
return 0;
}
Will this parallelization be slow because only one thread at a time can read any element of the vector or is this not the case? Could I resolve the problem with the clause firstprivate(numbers)?
Would it make sense to create an array of vectors so every thread gets his own vector ?
For instance:
vector<double> numbersx[**-number of threads-**];
Reading elements of the same vector from multiple threads is not a problem. There is no synchronization in your code, so they will be accessed concurrently.
With the size of vectors that you are working with, you will not have any cache problems either, although for bigger vectors you may get some slow-downs due to the cache access pattern. In that case, separate copies of the numbers data would improve performance.
better approach:
#include <vector>
int main(){
vector<double> numbers;
vector<double> results(10);
//write 25 values in vector numbers
for (int i =0; i<25; i++){
numbers.push_back(cos(i));
}
#pragma omp parallel for
for (int j = 0; j < 10; j++){
double x = 0; // make x local var
for(int k = 0; k < 25; k++){
x += 2 * numbers[j] * numbers[k] + 5 * numbers[j * k / 25];
}
results[j] = x; // no race here
}
return 0;
}
it will be slow kinda due to fact that there isn't much work to share