I have the following C++ code that multiply two array elements of a large size count
double* pA1 = { large array };
double* pA2 = { large array };
for(register int r = mm; r <= count; ++r)
{
lg += *pA1-- * *pA2--;
}
Is there a way that I can implement parallelism for the code?
Here is an alternative OpenMP implementation that is simpler (and a bit faster on many-core platforms):
double dot_prod_parallel(double* v1, double* v2, int dim)
{
TimeMeasureHelper helper;
double sum = 0.;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < dim; ++i)
sum += v1[i] * v2[i];
return sum;
}
GCC ad ICC are able to vectorize this loop in -O3. Clang 13.0 fail to do this, even with -ffast-math and even with explicit OpenMP SIMD instructions as well as a with loop tiling. This appears to be a bug of the Clang's optimizer related to OpenMP... Note that you can use -mavx to use the AVX instruction set which can be up to twice as fast as SSE (default). It is available on almost all recent x86-64 PC processors.
I wanted to answer my own question. Looks like we can use openMP like the following. However, the speed gains is not that much (2x). My computer has 16 cores.
// need to use compile flag /openmp
double dot_prod_parallel(double* v1, double* v2, int dim)
{
TimeMeasureHelper helper;
double sum = 0.;
int i;
# pragma omp parallel shared(sum)
{
int num = omp_get_num_threads();
int id = omp_get_thread_num();
printf("I am thread # % d of % d.\n", id, num);
double priv_sum = 0.;
# pragma omp for
for (i = 0; i < dim; i++)
{
priv_sum += v1[i] * v2[i];
}
#pragma omp critical
{
cout << "priv_sum = " << priv_sum << endl;
sum += priv_sum;
}
}
return sum;
}
Related
i have the following part of code, i run it on sample of N=3000, the c++ sequential code is faster by 3 seconds which is not good at all.
this code is filling the array jsd[N] with calculated values and i want to locate the maximum value and its location.
so
1- is this openmp conversion correct, and is there any better suggstion to make it more profissional
2- why it is slower that the equavilant c++ code, also the more threads i create the more it get slow.
thanks in advance
double maxval = 0;
int pos = -1;
double jsd[N];
#pragma omp parallel for num_threads(4)
for (int i = 0; i < N; i++) {
double Hl = obj.function1(sequenceVctr, i, LEFT);
double Hr = obj.function1(sequenceVctr, i, RIGHT);
jsd[i] = obj.function2(H, i + 1, N, Hl, Hr);
if (jsd[i] >= maxval) {
#pragma omp critical
{
maxval = jsd[i];
pos = i;
}
}
} // for
update:
here is the new code but still slow and get slower in more threads.
i update the code as following. but still get slower for more threads
double maxval = 0;
int pos = -1;
double jsd[N];
#pragma omp parallel num_threads(50)
for (int i = 0; i < N; i++) {
double Hl = obj.function1(sequenceVctr, i, LEFT);
double Hr = obj.function1(sequenceVctr, i, RIGHT);
jsd[i]= obj.function2(H, i + 1, N, Hl, Hr);
} // for
#pragma omp master
{
vector<double> jsd2 (jsd,jsd+N);
vector<double>::iterator jsditer;
jsditer = std::max_element(jsd2.begin(), jsd2.end());
maxval=*jsditer;
pos=std::distance(jsd2.begin(),jsditer) ;
// cout<<"pos"<<pos<<endl;
}
#pragma omp barrier
The first optimization I would suggest is to first compute all jsd values in the loop, then find the maximum element via std::max_element().
This way you are not forcing the threads to synchronise.
The second thing I would do is move over to Intel TBB instead of OpenMP and use parallel_reduce().
But the biggest question is, how complex are the objective functions you are evaluating.
I'm trying to parallelize a loop with interdependent cycles, I've tried with a reduction and the code work, but the result is wrong, I think that the reduction works for the sum but not for the update of the array in the right cycle, there is a way to obtain the right result parallelizing the loop?
#pragma omp parallel for reduction(+: sum)
for (int i = 0; i < DATA_MAG; i++)
{
sum += H[i];
LUT[i] = sum * scale_factor;
}
The reduction clause creates private copies of sum for each thread in the team as if the private clause had been used on sum. After the for loop the result of each private copy is combined with the original shared value of sum. Since the shared sum is only updated after the for loop you cannot rely on it inside of the for loop.
In this case you need to do a prefix sum. Unfortunately the parallel prefix-sum using threads is memory bandwidth bound for large DATA_MAG and it's dominated by the OpenMP overhead for small DATA_MAG. However, there may be some sweet spot in between small and large where you seen some benefit using threads.
But in your case DATA_MAG is only 256 which is very small and would not benefit from OpenMP anyway. What you can do is use SIMD. However, OpenMP's simd construction is not powerful enough for the prefix sum as far as I know. However, you can do this by hand suing intrinsics like this
#include <stdio.h>
#include <x86intrin.h>
#define N 256
inline __m128 scan_SSE(__m128 x) {
x = _mm_add_ps(x, _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 4)));
x = _mm_add_ps(x, _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 8)));
return x;
}
void prefix_sum_SSE(float *a, float *s, int n, float scale_factor) {
__m128 offset = _mm_setzero_ps();
__m128 f = _mm_set1_ps(scale_factor);
for (int i = 0; i < n; i+=4) {
__m128 x = _mm_loadu_ps(&a[i]);
__m128 out = scan_SSE(x);
out = _mm_add_ps(out, offset);
offset = _mm_shuffle_ps(out, out, _MM_SHUFFLE(3, 3, 3, 3));
out = _mm_mul_ps(out, f);
_mm_storeu_ps(&s[i], out);
}
}
int main(void) {
float H[N], LUT[N];
for(int i=0; i<N; i++) H[i] = i;
prefix_sum_SSE(H, LUT, N, 3.14159f);
for(int i=0; i<N; i++) printf("%.1f ", LUT[i]); puts("");
for(int i=0; i<N; i++) printf("%.1f ", 3.14159f*i*(i+1)/2); puts("");
}
See here for more details about SIMD pre-fix sum with SSE and AVX.
I'm trying to implement two versions of a function that would find the max element in the array of floats. However, my parallel functions appeared to run much slower than the serial code.
With array of 4194304 (2048 * 2048) floats, I get the following numbers (in microseconds):
serial code: 9433
PPL code: 24184 (more than two times slower)
OpenMP code: 862093 (almost 100 times slower)
Here's the code:
PPL:
float find_largest_element_in_matrix_PPL(float* m, size_t dims)
{
float max_element;
int row, col;
concurrency::combinable<float> locals([] { return (float)INT_MIN; });
concurrency::parallel_for(size_t(0), dims * dims, [&locals](int curr)
{
float &localMax = locals.local();
localMax = max<float>(localMax, curr);
});
max_element = locals.combine([](float left, float right) { return max<float>(left, right); });
return max_element;
}
OpenMP:
float find_largest_element_in_matrix_OMP(float* m, unsigned const int dims)
{
float max_value = 0.0;
int i, row, col, index;
#pragma omp parallel for private(i) shared(max_value, index)
for (i = 0; i < dims * dims; ++i)
{
#pragma omp critical
if (m[i] > max_value)
{
max_value = m[i];
index = i;
}
}
//row = index / dims;
//col = index % dims;
return max_value;
}
What's making the code run so slowly? Am I missing something?
Could you help me find out what I'm doing wrong?
So, as Baum mit Augen noticed, the problem with OpenMP was that I had a critical section and the code didn't actually run in parallel, but synchronously.
Removing critical section did the trick.
As for PPL, I've found out that it does a lot more preparations (creating threads and stuff) than OpenMP does, hence the slowdown.
Update
So, here's the correct variant to find max element with OpenMP (the critical section is still needed but inside the if block):
float find_largest_element_in_matrix_OMP(float* m, unsigned const int dims)
{
float max_value = 0.0;
int i, row, col, index;
#pragma omp parallel for
for (i = 0; i < dims * dims; ++i)
{
if (m[i] > max_value)
{
#pragma omp critical
max_value = m[i];
}
}
return max_value;
}
PS: not tested.
I'm new here and this is my first question in this site;
I am doing a simple program to find a maximum value of a vector c that is function of two other vectors a and b. I'm doing it on Microsoft Visual Studio 2013 and the problem is that it only support OpenMP 2.0 and I cannot do a Reduction operation to find directy the max or min value of a vector, because OpenMP 2.0 does not supports this operation.
I'm trying to do the without the constructor reduction with the following code:
for (i = 0; i < NUM_THREADS; i++){
cMaxParcial[i] = - FLT_MAX;
}
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel for private (i,j,indice)
for (i = 0; i < N; i++){
for (j = 0; j < N; j++){
indice = omp_get_thread_num();
if (c[i*N + j] > cMaxParcial[indice]){
cMaxParcial[indice] = c[i*N + j];
bMaxParcial[indice] = b[j];
aMaxParcial[indice] = a[i];
}
}
}
cMax = -FLT_MAX;
for (i = 0; i < NUM_THREADS; i++){
if (cMaxParcial[i]>cMax){
cMax = cMaxParcial[i];
bMax = bMaxParcial[i];
aMax = aMaxParcial[i];
}
}
I'm getting the error: "The expression must have integral or unscoped enum type"
on the command cMaxParcial[indice] = c[i*N + j];
Can anybody help me with this error?
Normally, the error is caused by one of the indices not being in integer type. Since you haven't shown the code where i, j, N and indice are declared, my guess is that either N or indice is a float or double, but it would be simpler to answer if you had provided a MCVE. However, the line above it seems to have used the same indices correctly. This leads me to believe that it's an IntelliSense error, which often are false positives. Try compiling the code and running it.
Now, on to issues that you haven't (yet) asked about (why is my parallel code slower than my serial code?). You're causing false sharing by using (presumably) contiguous arrays to find the a, b, and c values of each thread. Instead of using a single pragma for parallel and for, split it up like so:
cMax = -FLT_MAX;
#pragma omp parallel
{
float aMaxParcialPerThread;
float bMaxParcialPerThread;
float cMaxParcialPerThread;
#pragma omp for nowait private (i,j)
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++) {
if (c[i*N + j] > cMaxParcialPerThread){
cMaxParcialPerThread = c[i*N + j];
bMaxParcialPerThread = b[j];
aMaxParcialPerThread = a[i];
} // if
} // for j
} // for i
#pragma omp critical
{
if (cMaxParcialPerThread < cMax) {
cMax = cMaxParcialPerThread;
bMax = bMaxParcialPerThread;
aMax = aMaxParcialPerThread;
}
}
}
I don't know what is wrong with your compiler since (as far as I can see with only the partial data you gave), the code seems valid. However, it is a bit convoluted and not so good.
What about the following:
#include <omp.h>
#include <float.h>
extern int N, NUM_THREADS;
extern float aMax, bMax, cMax, *a, *b, *c;
int foo() {
cMax = -FLT_MAX;
#pragma omp parallel num_threads( NUM_THREADS )
{
float localAMax, localBMax, localCMax = -FLT_MAX;
#pragma omp for
for ( int i = 0; i < N; i++ ) {
for ( int j = 0; j < N; j++ ) {
float pivot = c[i*N + j];
if ( pivot > localCMax ) {
localAMax = a[i];
localBMax = b[j];
localCMax = pivot;
}
}
}
#pragma omp critical
{
if ( localCMax > cMax ) {
aMax = localAMax;
bMax = localBMax;
cMax = localCMax;
}
}
}
}
It compiles but I haven't tested it...
Anyway, I avoided using the [a-c]MaxParcial arrays since they will generate false sharing between the threads, leading to poor performance. The final reduction is done based on critical. It is not ideal, but will perform perfectly as long as you have a "moderated" number of threads. If you see some hot spot there or you need to use a "large" number of threads, it can be optimised better with a proper parallel reduction later.
I'm using a version of openMP which does not support reduce() for complex argument. I need a fast dot-product function like
std::complex< double > dot_prod( std::complex< double > *v1,std::complex< double > *v2,int dim )
{
std::complex< double > sum=0.;
int i;
# pragma omp parallel shared(sum)
# pragma omp for
for (i=0; i<dim;i++ )
{
#pragma omp critical
{
sum+=std::conj<double>(v1[i])*v2[i];
}
}
return sum;
}
Obviously this code does not speed up the problem but slows it down. Do you have a fast solution without using reduce() for complex arguments?
Each thread can calculate the private sum as the first step and as the second step it can be composed to the final sum. In that case the critical section is only needed in the final step.
std::complex< double > dot_prod( std::complex< double > *v1,std::complex< double > *v2,int dim )
{
std::complex< double > sum=0.;
int i;
# pragma omp parallel shared(sum)
{
std::complex< double > priv_sum = 0.;
# pragma omp for
for (i=0; i<dim;i++ )
{
priv_sum += std::conj<double>(v1[i])*v2[i];
}
#pragma omp critical
{
sum += priv_sum;
}
}
return sum;
}
Try doing the multiplications in parallel, then sum them serially:
template <typename T>
std::complex<T> dot_prod(std::complex<T> *a, std::complex<T> *b, size_t dim)
{
std::vector<std::complex<T> > prod(dim); // or boost::scoped_array + new[]
#pragma omp parallel for
for (size_t i=0; i<dim; i++)
// I believe you had these reversed
prod[i] = a[i] * std::conj(b[i]);
std::complex<T> sum(0);
for (size_t i=0; i<dim; i++)
sum += prod[i];
return sum;
}
This does require O(dim) working memory, of course.
Why not have the N threads compute N individual sums. Then at the end you only need to sum the N sums, which can be done serially, as N is quite small. Although I don't know how to accomplish that with OpenMP, at the moment (I don't have any experience with it), I'm quite sure this is easily achievable.