I'm trying to parallelize a loop with interdependent cycles, I've tried with a reduction and the code work, but the result is wrong, I think that the reduction works for the sum but not for the update of the array in the right cycle, there is a way to obtain the right result parallelizing the loop?
#pragma omp parallel for reduction(+: sum)
for (int i = 0; i < DATA_MAG; i++)
{
sum += H[i];
LUT[i] = sum * scale_factor;
}
The reduction clause creates private copies of sum for each thread in the team as if the private clause had been used on sum. After the for loop the result of each private copy is combined with the original shared value of sum. Since the shared sum is only updated after the for loop you cannot rely on it inside of the for loop.
In this case you need to do a prefix sum. Unfortunately the parallel prefix-sum using threads is memory bandwidth bound for large DATA_MAG and it's dominated by the OpenMP overhead for small DATA_MAG. However, there may be some sweet spot in between small and large where you seen some benefit using threads.
But in your case DATA_MAG is only 256 which is very small and would not benefit from OpenMP anyway. What you can do is use SIMD. However, OpenMP's simd construction is not powerful enough for the prefix sum as far as I know. However, you can do this by hand suing intrinsics like this
#include <stdio.h>
#include <x86intrin.h>
#define N 256
inline __m128 scan_SSE(__m128 x) {
x = _mm_add_ps(x, _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 4)));
x = _mm_add_ps(x, _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 8)));
return x;
}
void prefix_sum_SSE(float *a, float *s, int n, float scale_factor) {
__m128 offset = _mm_setzero_ps();
__m128 f = _mm_set1_ps(scale_factor);
for (int i = 0; i < n; i+=4) {
__m128 x = _mm_loadu_ps(&a[i]);
__m128 out = scan_SSE(x);
out = _mm_add_ps(out, offset);
offset = _mm_shuffle_ps(out, out, _MM_SHUFFLE(3, 3, 3, 3));
out = _mm_mul_ps(out, f);
_mm_storeu_ps(&s[i], out);
}
}
int main(void) {
float H[N], LUT[N];
for(int i=0; i<N; i++) H[i] = i;
prefix_sum_SSE(H, LUT, N, 3.14159f);
for(int i=0; i<N; i++) printf("%.1f ", LUT[i]); puts("");
for(int i=0; i<N; i++) printf("%.1f ", 3.14159f*i*(i+1)/2); puts("");
}
See here for more details about SIMD pre-fix sum with SSE and AVX.
Related
Here is my Matrix Multiplication C++ OpenMP code that I have written. I am trying to use OpenMP to optimize the program. The sequential code speed was 7 seconds but when I added openMP statements but it only got faster by 3 seconds. I thought it was going to get much faster and don't understand if I'm doing it right.
The OpenMP statements are in the fill_random function and in the matrix multiplication triple for loop section in main.
I would appreciate any help or advice you can give to understand this!
#include <iostream>
#include <cassert>
#include <omp.h>
#include <chrono>
using namespace std::chrono;
double** fill_random(int rows, int cols )
{
double** mat = new double* [rows]; //Allocate rows.
#pragma omp parallell collapse(2)
for (int i = 0; i < rows; ++i)
{
mat[i] = new double[cols]; // added
for( int j = 0; j < cols; ++j)
{
mat[i][j] = rand() % 10;
}
}
return mat;
}
double** create_matrix(int rows, int cols)
{
double** mat = new double* [rows]; //Allocate rows.
for (int i = 0; i < rows; ++i)
{
mat[i] = new double[cols](); //Allocate each row and zero initialize..
}
return mat;
}
void destroy_matrix(double** &mat, int rows)
{
if (mat)
{
for (int i = 0; i < rows; ++i)
{
delete[] mat[i]; //delete each row..
}
delete[] mat; //delete the rows..
mat = nullptr;
}
}
int main()
{
int rowsA = 1000; // number of rows
int colsA= 1000; // number of columns
double** matA = fill_random(rowsA, colsA);
int rowsB = 1000; // number of rows
int colsB = 1000; // number of columns
double** matB = fill_random(rowsB, colsB);
//Checking matrix multiplication qualification
assert(colsA == rowsB);
double** matC = create_matrix(rowsA, colsB);
//measure the multiply only
const auto start = high_resolution_clock::now();
//Multiplication
#pragma omp parallel for
for(int i = 0; i < rowsA; ++i)
{
for(int j = 0; j < colsB; ++j)
{
for(int k = 0; k < colsA; ++k) //ColsA..
{
matC[i][j] += matA[i][k] * matB[k][j];
}
}
}
const auto stop = high_resolution_clock::now();
const auto duration = duration_cast<seconds>(stop - start);
std::cout << "Time taken by function: " << duration.count() << " seconds" << std::endl;
//Clean up..
destroy_matrix(matA, rowsA);
destroy_matrix(matB, rowsB);
destroy_matrix(matC, rowsA);
return 0;
}
Your problem is rather small.
The collapse in the matrix creation does nothing because the loops are not perfectly nested. On the other hand, in the multiplication routine you should add a collapse(2) directive.
Creating a matrix with an array of pointers means that the expression matB[k][j] dances all over memory. Allocate your matrices as a single array and then use i*N+j as an indexing expression. (Of course I would put that in a macro or so.)
Matrix size of 1000x1000 with double(64 bit) element type requires 8MB data. When you multiply two matrices, you read 16MB data. When you write to a third matrix, you also access 24MB data total.
If L3 cache is smaller than 24MB then RAM is bottleneck. Maybe single thread did not fully use its bandwidth but when OpenMP is used, RAM bandwidth is fully used. In your case it had only 50% headroom for bandwidth.
Naive version is not using cache well. You need to swap order of two loops to gain more caching:
loop
loop k
loop
C[..] += B[..] * A[..]
although incrementing C does not re-use a register in this optimized version, it re-uses cache that is more important in this case. If you do it, it should get ~100-200 milliseconds computation time even in single-thread.
Also if you need performance, don't do this:
//Allocate each row and zero initialize..
allocate whole matrix at once so that your matrix is not scattered in memory.
To add more threads efficiently, you can do sub-matrix multiplications to compute full matrix multiplication. Scan-line multiplication is not good for load-balancing between threads. When sub-matrices are multiplied, they give better load distribution due to caching and higher floating-point operations per element fetched from memory.
Edit:
Swapping order of loops also makes compiler able to vectorize the innermost loop because one of the input matrices becomes a constant during the innermost loop.
I have the following C++ code that multiply two array elements of a large size count
double* pA1 = { large array };
double* pA2 = { large array };
for(register int r = mm; r <= count; ++r)
{
lg += *pA1-- * *pA2--;
}
Is there a way that I can implement parallelism for the code?
Here is an alternative OpenMP implementation that is simpler (and a bit faster on many-core platforms):
double dot_prod_parallel(double* v1, double* v2, int dim)
{
TimeMeasureHelper helper;
double sum = 0.;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < dim; ++i)
sum += v1[i] * v2[i];
return sum;
}
GCC ad ICC are able to vectorize this loop in -O3. Clang 13.0 fail to do this, even with -ffast-math and even with explicit OpenMP SIMD instructions as well as a with loop tiling. This appears to be a bug of the Clang's optimizer related to OpenMP... Note that you can use -mavx to use the AVX instruction set which can be up to twice as fast as SSE (default). It is available on almost all recent x86-64 PC processors.
I wanted to answer my own question. Looks like we can use openMP like the following. However, the speed gains is not that much (2x). My computer has 16 cores.
// need to use compile flag /openmp
double dot_prod_parallel(double* v1, double* v2, int dim)
{
TimeMeasureHelper helper;
double sum = 0.;
int i;
# pragma omp parallel shared(sum)
{
int num = omp_get_num_threads();
int id = omp_get_thread_num();
printf("I am thread # % d of % d.\n", id, num);
double priv_sum = 0.;
# pragma omp for
for (i = 0; i < dim; i++)
{
priv_sum += v1[i] * v2[i];
}
#pragma omp critical
{
cout << "priv_sum = " << priv_sum << endl;
sum += priv_sum;
}
}
return sum;
}
I wrote a C++ code to solve a linear system A.x = b where A is a symmetric matrix by first diagonalizing the matrix A = V.D.V^T with LAPACK(E) (because I need the eigenvalues later) and then solving x = A^-1.b = V^T.D^-1.V.b where of course V is orthogonal.
Now I would like to optimize this last operation as much as possible, e.g. by using (C)BLAS routines and OpenMP.
Here is my naive implementation:
// Solve linear system A.X = B for X (V contains eigenvectors and D eigenvalues of A)
void solve(const double* V, const double* D, const double* B, double* X, const int& N)
{
#ifdef _OPENMP
#pragma omp parallel for
#endif
for (int i=0; i<N; i++)
{
for (int j=0; j<N; j++)
{
for (int k=0; k<N; k++)
{
X[i] += B[j] * V[i+k*N] * V[j+k*N] / D[k];
}
}
}
}
All arrays are C-style arrays, where V is of size N^2, D is of size N, B is of size N and X is of size N (and initialized with zeros).
For now, this naive implementation is very slow and is the bottleneck of the code. Any hints and help would be very appreciated !
Thanks
EDIT
Thanks to Jérôme Richard's answer and comment I further optimized his solution by calling BLAS and parallelizing the middle loop with OpenMP. On a 1000x1000 matrix, this solution is ~4 times faster as his proposition, which itself was 1000 times faster than my naive implementation.
I found the #pragma omp parallel for simd clause the be faster than other alternatives on two different machines with 4 and 20 cores respectively, for N=1000 and N=2000.
void solve(const double* V, const double* D, const double* B, double* X, const int& N)
{
double* sum = new double[N]{0.};
cblas_dgemv(CblasColMajor,CblasTrans,N,N,1.,V,N,B,1,0.,sum,1);
#pragma omp parallel for simd
for (int i=0; i<N; ++i)
{
sum[i] /= D[i];
}
cblas_dgemv(CblasColMajor,CblasNoTrans,N,N,1.,V,N,sum,1,0.,X,1);
delete [] sum;
}
This code is currently highly memory-bound. Thus the resulting program will probably poorly scale (as long as compiler optimization are enabled). Indeed, on most common systems (eg. 1 socket non-NUMA processor) the RAM throughput is a shared resource between cores and also a scarce one. Moreover, the memory access pattern is inefficient and the algorithmic complexity of the code can be improved.
To speed up the computation, the j and k loops can be swapped so that V is read contiguously. Moreover, the division by V[i+k*N] and D[k] becomes constants in the most inner loop. The computation can then be factorized to be much faster since B[j] and V[j+k*N] is not dependent of i too. The resulting algorithm runs in O(n^2) rather than O(n^3) thanks to sum precomputations !
Finally, omp simd can be used to help compilers vectorizing the code, making it even faster!
Note that _OPENMP seems useless here since #pragma should ignored by compilers when OpenMP is disabled or not supported.
// Solve linear system A.X = B for X (V contains eigenvectors and D eigenvalues of A)
void solve(const double* V, const double* D, const double* B, double* X, const int& N)
{
std::vector<double> kSum(N);
#pragma omp parallel for
for (int k=0; k<N; k++)
{
const double sum = 0.0;
#pragma omp simd reduction(+:sum)
for (int j=0; j<N; j++)
{
sum += B[j] * V[j+k*N];
}
kSum[k] = sum / D[k];
}
// Loop tiling can be used to speed up this section even more.
// The idea is to swap i-based and j-based loops and work on thread-private copies
// of X and finally sum the thread-private versions into a global X.
// The resulting code should work on contiguous data and can even be vectorized.
#pragma omp parallel for
for (int i=0; i<N; i++)
{
double sum = X[i];
for (int k=0; k<N; k++)
{
sum += kSum[k] * V[i+k*N];
}
X[i] = sum;
}
}
The new code should be several order of magnitude faster than the original one (but still memory-bound). Note that results might be a bit different (as floating-point operations are not really associative), but I expect results to be be more accurate.
I have this code:
scalar State::add(const int N, const int M,
vector<scalar>& flmn,
vector<scalar>& BSum,
const vector<scalar>& prev_flm,
const vector<scalar>& prev_bigsum,
const vector<scalar>& Qratio,
const int test)
{
scalar c=1;
#pragma omp parallel for
for(int i=1;i<=M;i++)
{
flmn.at(i-1) = Qratio.at(i-1)*k1+k2;
BSum.at(i-1) = someconstant + somepublicvector.at(1)*flmn.at(i-1);
c *= BSum.at(i-1);
}
return c;
}
Which at the end I am returning the variable c. When use this: "#pragma omp parallel for" it definitely won't give me consistent answer since there is always an overlap between the iterations. I wonder how such a combination of matrix or vector manipulations should be parallelized in openmp, and also I would get a consistent results as there is obviously a race condition problem in here?
for (int i = 1; i <= M; i++) {
flmn.at(i - 1) = Qratio.at(i - 1) * k1 + k2;
BSum.at(i - 1) = someconstant + somepublicvector.at(1) * flmn.at(i - 1);
c *= BSum.at(i - 1);
}
A few notes:
Don't use std::vector::at unless you really need the exception-safe indexing.
You are using the same index for each vector, so start at i=0 rather than the Fortran-style i=1.
Is M different from the sizes of the vectors being used (i.e., is it a subset)? If not, then it doesn't need to be specified.
A possible OpenMP implementation could then be
scalar c{1.0};
#pragma omp parallel
{
const std::size_t nthreads = omp_get_num_threads();
const std::size_t chunk_size = M / nthreads; // WARNING: non-even division case left to user
const std::size_t tid = omp_get_thread_num();
#pragma omp for reduction(*:c)
for (std::size_t j = 0; j < chunk_size; j++) {
const std::size_t i = j + tid * chunk_size;
flmn[i] = Qratio[i] * k1 + k2;
BSum[i] = someconstant + somepublicvector[1] * flmn[i];
c *= BSum[i];
}
}
Note that I have assumed that nthreads evenly divides M. If it does not, this case needs to be handled separately. If you are using OpenMP 4.0, then I recommend using the simd directive since the first two lines are both saxpy operations and can benefit from vectorization. For optimal performance, make sure that chunk_size is a multiple of your CPU's cacheline size.
for (uint i = 0; i < x; i++) {
for (uint j = 0; j < z; j++) {
if (inFunc(p, index)) {
XY[2*nind] = i;
XY[2*nind + 1] = j;
nind++;
}
}
}
here x = 512 and z = 512 and nind = 0 initially
and XY[2*x*y].
I want to optimize this for loops with openMP but 'nind' variable is closely binded serially to for loop. I have no clue because I am also checking a condition and so some of the time it will not enter in if and will skip increment or it will enter increment nind. openMP threads will increment nind variable as first come will increment nind firstly. Is there any way to unbind it. ('binding' I mean only can be implemented serially).
A typical cache-friendly solution in that case is to collect the (i,j) pairs in private arrays, then concatenate those private arrays at the end, and finally sort the result if needed:
#pragma omp parallel
{
uint myXY[2*z*x];
uint mynind = 0;
#pragma omp for collapse(2) schedule(dynamic,N)
for (uint i = 0; i < x; i++) {
for (uint j = 0; j < z; j++) {
if (inFunc(p, index)) {
myXY[2*mynind] = i;
myXY[2*mynind + 1] = j;
mynind++;
}
}
}
#pragma omp critical(concat_arrays)
{
memcpy(&XY[2*nind], myXY, 2*mynind*sizeof(uint));
nind += mynind;
}
}
// Sort the pairs if needed
qsort(XY, nind, 2*sizeof(uint), compar);
int compar(const uint *p1, const uint *p2)
{
if (p1[0] < p2[0])
return -1;
else if (p1[0] > p2[0])
return 1;
else
{
if (p1[1] < p2[1])
return -1;
else if (p1[1] > p2[1])
return 1;
}
return 0;
}
You should experiment with different values of N in the schedule(dynamic,N) clause in order to achieve the best trade-off between overhead (for small values of N) and load imbalance (for large values of N). The comparison function compar could probably be written in a more optimal way.
The assumption here is that the overhead from merging and sorting the array is small. Whether that will be the case depends on many factors.
Here is a variation on Hristo Iliev's good answer.
The important parameter to act on here is the index of the pairs rather than the pairs themselves.
We can fill private arrays of the pair indices in parallel for each thread. The arrays for each thread will be sorted (irrespective of the scheduling).
The following function merges two sorted arrays
void merge(int *a, int *b, int*c, int na, int nb) {
int i=0, j=0, k=0;
while(i<na && j<nb) c[k++] = a[i] < b[j] ? a[i++] : b[j++];
while(i<na) c[k++] = a[i++];
while(j<nb) c[k++] = b[j++];
}
Here is the remaining code
uint nind = 0;
uint *P;
#pragma omp parallel
{
uint myP[x*z];
uint mynind = 0;
#pragma omp for schedule(dynamic) nowait
for(uint k = 0 ; k < x*z; k++) {
if (inFunc(p, index)) myP[mynind++] = k;
}
#pragma omp critical
{
uint *t = (uint*)malloc(sizeof *P * (nind+mynind));
merge(P, myP, t, nind, mynind);
free(P);
P = t;
nind += mynind;
}
}
Then given an index k in P the pair is (k/z, k%z).
The merging can be improved. Right now it goes at O(omp_get_num_threads()) but it could be done in O(log2(omp_get_num_threads())). I did not bother with this.
Hristo Iliev's pointed out that dynamic scheduling does not guarantee that the iterations per thread increase monotonically. I think in practice they are but it's not guaranteed in principle.
If you want to be 100% sure that the iterations increase monotonically you can implement dynamic scheduling by hand.
The code you provide looks like you are trying to fill the XY data in sequential order. In this case OMP multithreading is probably not the tool for the job as threads (in a best case) should avoid communication as much as possible. You could introduce an atomic counter, but then again, it is probably going to be faster just doing it sequentially.
Also what do you want to achieve by optimizing it? The x and z are not too big, so I doubt that you will get a substantial speed increase even if you reformulate your problem in a parallel fashion.
If you do want parallel execution - map your indexes to the array, e.g. (not tested, but should do)
#pragma omp parallel for shared(XY)
for (uint i = 0; i < x; i++) {
for (uint j = 0; j < z; j++) {
if (inFunc(p, index)) {
uint idx = (2 * i) * x + 2 * j;
XY[idx] = i;
XY[idx + 1] = j;
}
}
}
However, you will have gaps in your array XY then. Which may or may not be a problem for you.