I have this for loop that finds minimum and maximum length, as you can see I have two values to reduce here while looking at OpenMP I can only notice that it provides reduction technique for only one value.
for (size_t i = 0; i < m_patterns.size(); ++i)
{// start for loop
if (m_patterns[i].size() < m_lmin)
m_lmin = m_patterns[i].size();
else if (m_patterns[i].size() > m_lmax)
m_lmax = m_patterns[i].size();
}// end for loop
can I do the following
#pragma omp parallel for reduction (min:m_lmin,max:m_lmax)
or should I rewrite the for loop to two for loops one for the minimum and one for the maximum
another question .. can I use tbb containers like concurrent_vector in OpenMP
From OpenMP 3.1 they started support of min & max reduction operation. OpenMP 3.1 is available from GCC 4.7. You can refer this link for further details of min max reduction.
You can roll-your-own concurrent vector as well as min and max reductions by filling private versions of the variables in parallel and then merging them in a critical section. This will work in MSVC which only supports OpenMP 2.5 (which does not support min and max reductions). But irrespective of whether your version of OpenMP supports min and max reductions this is a useful technique to learn.
This method is efficient as long as the number of items you loop over is much larger than the number of threads (or the time run over the items is large compared to the merging).
#pragma parallel
{
int m_lmin_private = m_lmin;
int m_max_private = m_max_private;
#pragma omp for nowait
for (size_t i = 0; i < m_patterns.size(); ++i) {
if (m_patterns[i].size() < m_lmin_private)
m_lmin_private = m_patterns[i].size();
else if (m_patterns[i].size() > m_lmax_private)
m_lmax_private = m_patterns[i].size();
}
#pragma omp critical
{
if (m_lmin_private<m_lmin)
m_lmin = m_lmin_private;
if (m_lmax_private>m_lmax)
m_lmax = m_lmax_private;
}
}
For concurrent vectors us the same method:
std::vector<int> vec;
#pragma omp parallel
{
std::vector<int> vec_private;
#pragma omp for nowait //fill vec_private in parallel
for(int i=0; i<n; i++) {
vec_private.push_back(i);
}
#pragma omp critical
vec.insert(vec.end(), vec_private.begin(), vec_private.end());
}
As far as openmp is concerned - an official specification is available ( www.openmp.org ). But finally your compiler is doing all the work. So the answer to your question may be compiler related...
However Microsoft is offering http://msdn.microsoft.com/de-de/library/2etkydkz(v=vs.80).aspx
suggesting
#pragma omp parallel for reduction(min:m_lmin) reduction(max:m_lmax)
Related
I would like to parallel a big loop using OpenMP to improve its efficiency. Here is the main part of the toy code:
vector<int> config;
config.resize(indices.size());
omp_set_num_threads(2);
#pragma omp parallel for schedule(static, 5000) firstprivate(config)
for (int i = 0; i < 10000; ++i) { // the outer loop that I would like to parallel
#pragma omp simd
for (int j = 0; j < indices.size(); ++j) { // pick some columns from a big ref_table
config[j] = ref_table[i][indices[j]];
}
int index = GetIndex(config); // do simple computations on the picked values to get the index
#pragma omp atomic
result[index]++;
}
Then I found I cannot get improvements in efficiency if I use 2, 4, or 8 threads. The execution time of the parallel versions is generally greater than that of the sequential version. The outer loop has 10000 iterations and they are independent so I want multiple threads to execute those iterations in parallel.
I guess the reasons for performance decrease maybe include: private copies of config? or, random access of ref_table? or, expensive atomic operation? So what are the exact reasons for the performance decrease? More importantly, how can I get a shorter execution time?
Private copies of config or, random access of ref_tables are not problematic, I think the workload is very small, there are 2 potential issues which prevent efficient parallelization:
atomic operation is too expensive.
overheads are bigger than workload (it simply means that it is not worth parallelizing with OpenMP)
I do not know which one is more significant in your case, so it is worth trying to get rid of atomic operation. There are 2 cases:
a) If the results array is zero initialized you have to use:
#pragma omp parallel for reduction(+:result[0:N]) schedule(static, 5000) firstprivate(config) where N is the size of result array and delete #pragma omp atomic. Note that this works on OpenMP 4.5 or later. It is also worth removing #parama omp simd for a loop of 2-10 iterations. So, your code should look like this:
#pragma omp parallel for reduction(+:result[0:N]) schedule(static, 5000) firstprivate(config)
for (int i = 0; i < 10000; ++i) { // the outer loop that I would like to parallel
for (int j = 0; j < indices.size(); ++j) { // pick some columns from a big ref_table
config[j] = ref_table[i][indices[j]];
}
int index = GetIndex(config); // do simple computations on the picked values to get the index
result[index]++;
}
b) If the result array is not zero initialized the solution is very similar, but use a temporary zero initialized array in the loop and after that add it to result array.
If the speed will not increase then your code is not worth parallelizing with OpenMP on your hardware.
I'm currently trying to use OpenMP for parallel computing.
I've written the following basic code.
However it returns the following warning:
warning #2901: [omp] OpenMP is not active; all OpenMP directives will be ignored.
Changing the number of threads does not change the required running time since omp.h is ignored for some reason which is unclear to me.
Can someone help me out?
#include <stdio.h>
#include <omp.h>
#include <math.h>
int main(void)
{
double ts;
double something;
clock_t begin = clock();
#pragma omp parallel num_threads(4)
#pragma omp parallel for
for (int i = 0; i<pow(10,7);i++)
{
something=sqrt(123456);
}
clock_t end = clock();
ts = (double)(end - begin) / CLOCKS_PER_SEC;
printf("Time elpased is %f seconds", ts);
}
In order to get OpenMP support you need to explicitly tell your compiler.
g++, gcc and clang need the option -fopenmp
mvsc needs the option /openmp (more info here if you use visual studio)
Aside from the obvious having to compile with -fopenmp flag your code has some problem worth pointing out, namely:
To measure time use omp_get_wtime() instead of clock() (it will give you the number of clock ticks accumulated across all threads).
The other problem is:
#pragma omp parallel num_threads(4)
#pragma omp parallel for
for (int i = 0; i<pow(10,7);i++)
{
something=sqrt(123456);
}
the iterations of the loop are not being assigned to threads as you wanted. Because you have added again the clause parallel to #pragma omp for, and assuming that you have nested parallelism disabled, which by default it is, each of the threads created in the outer parallel region will execute "sequentially" the code within that region. Consequently, for a n = 6 (i.e., pow(10,7) = 6) and number of threads = 4, you would have the following block of code:
for (int i=0; i<n; i++) {
something=sqrt(123456);
}
being executed 6 x 4 = 24 times (i.e., the total number of loop iterations multiple by the total number of threads). For a more in depth explanation check this SO Thread about a similar issue. Nevertheless, the image below provides a visualization of the essential:
To fix this adapt your code to the following:
#pragma omp parallel for num_threads(4)
for (int i = 0; i<pow(10,7);i++)
{
something=sqrt(123456);
}
I am dealing with huge point cloud data. I try to use OpenMp.
But I found it's very hard for beginners to optimize code.
For example, when I want to get the Histogram of the point cloud (the point has another info beyond x,y,z). I write code below
#pragma omp parallel num_threads(N_THREAD) shared(hist,partHist)
{
int tId = omp_get_thread_num();
int index = tId * partCount;
#pragma omp for nowait
for(int i =0;i<partCount;++i)
{
if (index + i < size)
#pragma omp atomic
partHist[tId][(int)floor((array[index + i] - minValue) / stride)]++;
}
#pragma omp critical
{
for (int i = 0; i < binCount; ++i)
hist[i] += partHist[tId][i];
}
}
The code is being run on Linux with an i7-9700k, compiled with g++ and using omp 4.0
I have two questions
The data set is about 10^8 at least, I use 128 threads. but It's slower than serial.How can I optimize the code
Are there rules that I can follow to optimize the code when some other questions occur?
I am a newbie to multithreading. I am trying to design a program that solves a sparse matrix. In my code I call Vector Vector dot product and Matix vector product as subroutines many times to arrive at the final solution. I am trying to parallelise the code using open MP (Especially the above two sub routines.)
I also have sequential codes in between which i donot intend to parallelise.
My question is how do I handle the threads created when the sub routine is called. Should I put a barrier at the end of every sub routine call.
Also where should I set the number of threads?
Mat_Vec_Mult(MAT,x0,rm);
#pragma omp parallel for schedule(static)
for(int i=0;i<numcols;i++)
rm[i] = b[i] - rm[i];
#pragma omp barrier
#pragma omp parallel for schedule(static)
for(int i=0;i<numcols;i++)
xm[i] = x0[i];
#pragma omp barrier
double* pm = (double*) malloc(numcols*sizeof(double));
#pragma omp parallel for schedule(static)
for(int i=0;i<numcols;i++)
pm[i] = rm[i];
#pragma omp barrier
scalarProd(rm,rm,numcols);
Thanks
EDIT:
for the scalar dotproduct, I am using the following piece of code:
double scalarProd(double* vec1, double* vec2, int n){
double prod = 0.0;
int chunk = 10;
int i;
//double* c = (double*) malloc(n*sizeof(double));
omp_set_num_threads(4);
// #pragma omp parallel shared(vec1,vec2,c,prod) private(i)
#pragma omp parallel
{
double pprod = 0.0;
#pragma omp for
for(i=0;i<n;i++) {
pprod += vec1[i]*vec2[i];
}
//#pragma omp for reduction (+:prod)
#pragma omp critical
for(i=0;i<n;i++) {
prod += pprod;
}
}
return prod;
}
I have now added the time calculation code in my ConjugateGradient function as below:
start_dotprod = omp_get_wtime();
rm_rm_old = scalarProd(rm,rm,MAT->ncols);
run_dotprod = omp_get_wtime() - start_dotprod;
fprintf(timing,"Time taken by rm_rm dot product : %lf \n",run_dotprod);
Observed results : Time taken for the dot product Sequential Version : 0.000007s Parallel Version : 0.002110
I am doing a simple compile using gcc -fopenmp command on Linux OS on my Intel I7 laptop.
I am currently using a matrix of size n = 5000.
I am getting huge speed down overall since the same dot product gets called multiple times till convergence is achieved( around 80k times).
Please suggest some improvements. Any help is much appreciated!
Honestly, I would suggest parallelizing at a higher level. By this I mean trying to minimize the number of #pragma omp parallels you are using. Every time you try and split up the work among your threads, there is an OpenMP overhead. Try and avoid this whenever possible.
So in your case at the very least I would try:
Mat_Vec_Mult(MAT,x0,rm);
double* pm = (double*) malloc(numcols*sizeof(double)); // must be performed once outside of parallel region
// all threads forked and created once here
#pragma omp parallel for schedule(static)
for(int i = 0; i < numcols; i++) {
rm[i] = b[i] - rm[i]; // (1)
xm[i] = x0[i]; // (2) does not require (1)
pm[i] = rm[i]; // (3) requires (1) at this i, not (2)
}
// implicit barrier at the end of omp for
// implicit join of all threads at the end of omp parallel
scalarProd(rm,rm,numcols);
Notice how I show that no barriers are actually necessary between your loops anyway.
If the majority of your time had been spent in this computation stage, you will surely be seeing considerable improvement. However, I'm reasonably confident that the majority of your time is being spent in Mat_Vec_Mult() and maybe also scalarProd(), so the amount of time you'll be saving is probably minimal.
** EDIT **
And as per your edit, I am seeing a few problems. (1) Always compile with -O3 when you are testing performance of your algorithm. (2) You won't be able to improve the runtime of something that takes .000007 sec to complete; that's nearly instantaneous. This goes back to what I said previously: try and parallelize at a higher level. CG Method is inherently a sequential algorithm, but there are certainly research papers developed detailing parallel CG. (3) Your implementation of scalar product is not optimal. Indeed, I suspect your implementation of matrix-vector product is not either. I would personally do the following:
double scalarProd(double* vec1, double* vec2, int n) {
double prod = 0.0;
int i;
// omp_set_num_threads(4); this should be done once during initialization somewhere previously in your program
#pragma omp parallel for private(i) reduction(+:prod)
for (i = 0; i < n; ++i) {
prod += vec1[i]*vec2[i];
}
return prod;
}
(4) There are entire libraries (LAPACK, BLAS, etc) that have highly optimized matrix-vector, vector-vector, etc operations. Any Linear Algebra library must be built upon them. Therefore, I'd suggest looking at using one of those libraries to do your two operations before you start re-creating the wheel here and trying to implement your own.
I'm sitting with some stuff here trying to make orphaning work, and reduce the overhead by reducing the calls of #pragma omp parallel.
What I'm trying is something like:
#pragma omp parallel default(none) shared(mat,mat2,f,max_iter,tol,N,conv) private(diff,k)
{
#pragma omp master // I'm not against using #pragma omp single or whatever will work
{
while(diff>tol) {
do_work(mat,mat2,f,N);
swap(mat,mat2);
if( !(k%100) ) // Only test stop criteria every 100 iteration
diff = conv[k] = do_more_work(mat,mat2);
k++;
} // end while
} // end master
} // end parallel
The do_work depends on the previous iteration so the while-loop is has to be run sequential.
But I would like to be able to run the ´do_work´ parallel, so it would look something like:
void do_work(double *mat, double *mat2, double *f, int N)
{
int i,j;
double scale = 1/4.0;
#pragma omp for schedule(runtime) // Just so I can test different settings without having to recompile
for(i=0;i<N;i++)
for(j=0;j<N;j++)
mat[i*N+j] = scale*(mat2[(i+1)*N+j]+mat2[(i-1)*N+j] + ... + f[i*N+j]);
}
I hope this can be accomplished some way, I'm just not sure how. So any help I can get is greatly appreciated (also if you're telling me this isn't possible). Btw I'm working with open mp 3.0, the gcc compiler and the sun studio compiler.
The outer parallel region in your original code contains only a serial piece (#pragma omp master), which makes no sense and effectively results in purely serial execution (no parallelism). As do_work() depends on the previous iteration, but you want to run it in parallel, you must use synchronisation. The openmp tool for that is an (explicit or implicit) synchronisation barrier.
For example (code similar to yours):
#pragma omp parallel
for(int j=0; diff>tol; ++j) // must be the same condition for each thread!
#pragma omp for // note: implicit synchronisation after for loop
for(int i=0; i<N; ++i)
work(j,i);
Note that the implicit synchronisation ensures that no thread enters the next j if any thread is still working on the current j.
The alternative
for(int j=0; diff>tol; ++j)
#pragma omp parallel for
for(int i=0; i<N; ++i)
work(j,i);
should be less efficient, as it creates a new team of threads at each iteration, instead of merely synchronising.