C++ OpenMP slower than serial with default thread count - c++

I try using OpenMP to parallel some for-loop of my program but failed to get significant speed improvement (actual degradation is observed). My target machine will have 4-6 cores and I currently rely on the OpenMP runtime to get the thread count for me, so I haven't tried any threadcount combination yet.
Target/Development platform: Windows 64bits
using MinGW64 4.7.2 (rubenvb build)
Sample output with OpenMP
Thread count: 4
Dynamic :0
OMP_GET_NUM_PROCS: 4
OMP_IN_PARALLEL: 1
5.612 // <- returned by omp_get_wtime()
5.627 (sec) // <- returned by clock()
Wall time elapsed: 5.62703
Sample output without OpenMP
2.415 (sec) // <- returned by clock()
Wall time elapsed: 2.415
How I measure the time
struct timeval start, end;
gettimeofday(&start, NULL);
#ifdef _OPENMP
double t1 = (double) clock();
double wt = omp_get_wtime();
sim->resetEnvironment(run);
tout << omp_get_wtime() - wt << std::endl;
timeEnd(tout, t1);
#else
double = (double) clock();
sim->resetEnvironment(run);
timeEnd(tout, t1);
#endif
gettimeofday(&end, NULL);
tout << "Wall time elapsed: "
<< ((end.tv_sec - start.tv_sec) * 1000000u + (end.tv_usec - start.tv_usec)) / 1.e6
<< std::endl;
The code
void Simulator::resetEnvironment(int run)
{
#pragma omp parallel
{
// (a)
#pragma omp for schedule(dynamic)
for (size_t i = 0; i < vector_1.size(); i++) // size ~ 20
reset(vector_1[i]);
#pragma omp for schedule(dynamic)
for (size_t i = 0; i < vector_2.size(); i++) // size ~ 2.3M
reset(vector_2[i]);
#pragma omp for schedule(dynamic)
for (size_t i = 0; i < vector_3.size(); i++) // size ~ 0.3M
reset(vector_3[i]);
for (int level = 0; level < level_count; level++) // (b) level = 3
{
#pragma omp for schedule(dynamic)
for (size_t i = 0; i < vector_4[level].size(); i++) // size ~500 - 1K
reset(vector_4[level][i]);
}
#pragma omp for schedule(dynamic)
for (long i = 0; i < populationSize; i++) // size ~7M
resetAgent(agents[i]);
} // end #parallel
} // end: Simulator::resetEnvironment()
Randomness
Inside reset() function calls, I used a RNG for seeding some agents for subsequent tasks.
Below is my RNG implementation, as I saw suggestion that using one RNG per per-thread for thread-safety.
class RNG {
public:
typedef std::mt19937 Engine;
RNG()
: real_uni_dist_(0.0, 1.0)
#ifdef _OPENMP
, engines()
#endif
{
#ifdef _OPENMP
int threads = std::max(1, omp_get_max_threads());
for (int seed = 0; seed < threads; ++seed)
engines.push_back(Engine(seed));
#else
engine_.seed(time(NULL));
#endif
} // end_ctor(RNG)
/** #return next possible value of the uniformed distribution */
double operator()()
{
#ifdef _OPENMP
return real_uni_dist_(engines[omp_get_thread_num()]);
#else
return real_uni_dist_(engine_);
#endif
}
private:
std::uniform_real_distribution<double> real_uni_dist_;
#ifdef _OPENMP
std::vector<Engine> engines;
#else
std::mt19937 engine_;
#endif
}; // end_class(RNG)
Question:
at (a), is it good to not using shortcut 'parallel for' to avoid the overhead of creating teams?
which part of my implementation can be the cause of degradation of performance?
Why the time reported by clock() and omp_get_wtime() are so similar, as I expected clock() would be somehow longer than omp_get_wtime()
[Edit]
at (b), my intention of including OpenMP directive in the inner loop is that the iteration for outer loop is so small (only 3) so I think I can skip that and go directly to the inner loop of looping the vector_4[level]. Is this thought inappropriate (or will this instruct the OpenMP to repeat the outer loop by 4 and hence actually looping the inner loop 12 instead of 3 (say the current thread count is 4)?
Thanks

If the measured wall-clock time (as reported by omp_get_wtime()) is close to the total CPU time (as reported by clock()), this could mean several different things:
the code is running single-threaded, but then the total CPU time will be lower than the wall-clock time;
a very high synchronisation and cache coherency overhead is present and it is huge in comparison to the actual work being done by the threads.
Your case is the second one and the reason is that you use schedule(dynamic). Dynamic scheduling should only be used in cases when each iteration can take a varying amount of time. If such iterations are statically distributed among the threads, work imbalance could occur. schedule(dynamic) takes care of this by giving each task (in your case each single iteration of the loop) to the next thread to finish its work and become idle. There is a certain overhead in synchronising the threads and bookkeeping the distribution of the work items and therefore it should only be used when the amount of work per thread is huge in comparison to the overhead. OpenMP allows you to group more iterations into iteration blocks and this number is specified like schedule(dynamic,100) - this would make each thread execute a block (or chunk) of 100 consecutive iterations before asking for a new one. The default block size for dynamic scheduling is 1, i.e. each vector element in processed by a separate thread. I have no idea how much processing is done in reset() and what kind of elements are there in vector_*, but given the serial run time it is not much at all.
Another source of slowdown is the loss of data locality when you use dynamic scheduling. Depending on the type of elements of those vectors, processing neighbouring elements by different threads leads to false sharing. That means that, e.g. vector_1[i] lies in the same cache line with some other elements of vector_1, e.g. vector_1[i-1] and vector_1[i+1]. When thread 1 modifies vector_1[i], the cache line is reloaded in all other cores that work on the neighbouring elements. If vector_1[] is only written to, the compiler can be smart enough to generate non-temporal stores (those bypass the cache) but it only works with vector stores and having each core do a single iteration at a time means no vectorisation at all. Data locality can be improved by either switching to static scheduling or, if reset() really takes varying amount of time, by setting a reasonable chunk size in the schedule(dynamic) clause. The best chunk size is usually dependent on the processor and often one has to tweak it in order to get the best performance.
So I would strongly suggest that you first switch to static scheduling by replacing all schedule(dynamic) to schedule(static) and then try to optimise further. You don't have to specify the chunk size in the static case as the default is simply the total number of iterations divided by the number of threads, i.e. each thread would get one contiguous block of iterations.

to answer your question:
1) in a) the usage of the "parallel" keyword is exact
2) Congrats, your impl of your lok-free PRNG looks fine
3) the error can come from all the OpenMP pragma you use in the inner loop . Parallel at the top level and avoid fine-grain and inner loop parallelism
4) In the code below, i used 'nowait' on each 'omp for', I put the omp directive out-of-the-loop in the vector_4 proccessing and put a barrier at the end to join all the thread and wiat for the end of all the job we spawn before !
// pseudo code
#pragma omp for schedule(dynamic) nowait
for (size_t i = 0; i < vector_1.size(); i++) // size ~ 20
reset(vector_1[i]);
#pragma omp for schedule(dynamic) nowait
for (size_t i = 0; i < vector_2.size(); i++) // size ~ 2.3M
reset(vector_2[i]);
#pragma omp for schedule(dynamic) nowait
for (size_t i = 0; i < vector_3.size(); i++) // size ~ 0.3M
reset(vector_3[i]);
#pragma omp for schedule(dynamic) nowait
for (int level = 0; level < level_count; level++)
{
for (size_t i = 0; i < vector_4[level].size(); i++) // size ~500 - 1K
reset(vector_4[level][i]);
}
#pragma omp for schedule(dynamic) nowait
for (long i = 0; i < populationSize; i++) // size ~7M
resetAgent(agents[i]);
#pragma omp barrier

A single threaded program will run faster than a multi-threaded one if the useful processing time is less than the overhead incurred by threads.
It is a good idea to determine what the overhead is by implementing a null function and then deciding whether it is a better solution.
From a performance point of view, threads are only useful if the useful processing time is significantly higher than the overhead that is incurred by threads and there are real cpus available to run the threads.

Related

Parallelizing for-loop inside another loop efficiently with OpenMP

I have a problem writing the parallel instructions for a code that work like this:
// every iteration depends on the previous one
for (int iter = 0; iter < numIters; ++i)
{
#pragma omp parallel for num_threads(numThreads)
for (int p = 0; p < numParticles; ++p)
{
p_velocity_calculation(...);
}
// implicit sync barrier
#pragma omp parallel for num_threads(numThreads)
for (int p = 0; p < numParticles; ++p)
{
p_position_calculation(...);
}
}
The program is about a n-body simulation where first I need to calculate the velocities and then the positions of a set of particles, hence the separation of the two for-loops.
The code runs as expected, but from what I have inquired, the thread pools created by the #pragma omp directives are created and destroyed every iteration of the outer for-loop, but I don't want to waste resources creating them.
So my question is how can I reuse those thread pools and not creating/destroying the threads every iteration?
First of all: the thread pools are not destroyed, only suspended.
Next: Have you timed this and found that creating the threads is a limiting factor in your application? If not, don't worry.
Or to put it constructively : I have timed it and unless you have an extremely short omp parallel for and you call it tens of thousand of times, the overhead is negligible.
But if you are really worried, put the omp parallel outside the time loop, and do an omp for around the particle loop. You will do some redundant work between the for loops, which you can either accept or put a omp master around if it affects global variables.
But really: I wouldn't worry.

Aspects that affects the efficiency of OpenMP parallelism

I would like to parallel a big loop using OpenMP to improve its efficiency. Here is the main part of the toy code:
vector<int> config;
config.resize(indices.size());
omp_set_num_threads(2);
#pragma omp parallel for schedule(static, 5000) firstprivate(config)
for (int i = 0; i < 10000; ++i) { // the outer loop that I would like to parallel
#pragma omp simd
for (int j = 0; j < indices.size(); ++j) { // pick some columns from a big ref_table
config[j] = ref_table[i][indices[j]];
}
int index = GetIndex(config); // do simple computations on the picked values to get the index
#pragma omp atomic
result[index]++;
}
Then I found I cannot get improvements in efficiency if I use 2, 4, or 8 threads. The execution time of the parallel versions is generally greater than that of the sequential version. The outer loop has 10000 iterations and they are independent so I want multiple threads to execute those iterations in parallel.
I guess the reasons for performance decrease maybe include: private copies of config? or, random access of ref_table? or, expensive atomic operation? So what are the exact reasons for the performance decrease? More importantly, how can I get a shorter execution time?
Private copies of config or, random access of ref_tables are not problematic, I think the workload is very small, there are 2 potential issues which prevent efficient parallelization:
atomic operation is too expensive.
overheads are bigger than workload (it simply means that it is not worth parallelizing with OpenMP)
I do not know which one is more significant in your case, so it is worth trying to get rid of atomic operation. There are 2 cases:
a) If the results array is zero initialized you have to use:
#pragma omp parallel for reduction(+:result[0:N]) schedule(static, 5000) firstprivate(config) where N is the size of result array and delete #pragma omp atomic. Note that this works on OpenMP 4.5 or later. It is also worth removing #parama omp simd for a loop of 2-10 iterations. So, your code should look like this:
#pragma omp parallel for reduction(+:result[0:N]) schedule(static, 5000) firstprivate(config)
for (int i = 0; i < 10000; ++i) { // the outer loop that I would like to parallel
for (int j = 0; j < indices.size(); ++j) { // pick some columns from a big ref_table
config[j] = ref_table[i][indices[j]];
}
int index = GetIndex(config); // do simple computations on the picked values to get the index
result[index]++;
}
b) If the result array is not zero initialized the solution is very similar, but use a temporary zero initialized array in the loop and after that add it to result array.
If the speed will not increase then your code is not worth parallelizing with OpenMP on your hardware.

optimising OpenCV for loops

Trying to optimize OpenCV code with openMP, code as follows. The actual execution time with openMP is longer. 2 cores, 4 threads. Image size: [3024 x 4032]
std::vector<std::vector<cv::Vec3b> > pixelsD(maskedImage.rows, std::vector<cv::Vec3b>(maskedImage.cols));
std::clock_t start;
double duration;
start = std::clock();
////none, without openMP 0.129677 sec
//#pragma omp parallel for // 0.213286 sec
#pragma omp parallel for collapse(2)// 0.206435 sec
for (int i = 0; i < maskedImage.rows; ++i)
for (int j = 0; j < maskedImage.cols; ++j){
pixelsD[i][j] = maskedImage.at<cv::Vec3b>(i, j);
// printf("%d %d %d\n", i, j, omp_get_thread_num());
}
duration = ( std::clock() - start ) / (double) CLOCKS_PER_SEC;
My guess: the reason is the context switch which takes longer. What may be other reasons?
How could I optimize it utilizing available resources? Any other ways?
Input appreciated.
P.S.:
The reason for the translate between cv::Mat to std::vector is to utilise erase, push_back and insert for image's content manipulation.
Thread creation can be quite costly as well as context switches: strangely with GCC 9.3, it takes 10-20 ms to just start the parallel section on my machine on this sample code. Note that some OpenMP runtimes like Clang can create thread once for all OpenMP section. Moreover, setting OMP_PROC_BIND to TRUE can help OpenMP threads to not move between cores. Note that timings between GCC and Clang are quite different on this code.
std::clock do not measure what you probably want to: it does not consider process inactivity and sum the tick of each thread of the process. Please use C++ std::chrono::steady_clock or omp_get_wtime to correctly measure durations.
Please do not use std::vector<std::vector<cv::Vec3b>> as it use a very inefficient memory layout pattern. If you want to make complex matrix operation, you can use Eigen for example or write your own type based on contiguous flatten arrays. Splitting each color channel in a separate array may also help compiler to vectorize operations improving performance.
On Clang, the pixelsD[i][j] access produce a very slow code with OpenMP as the compiler fail to optimize it. Actually, using a collapse is not useful here as the number of threads should be much smaller than the number of rows (it could even decrease performance).
Here is a new version where the time is more correctly measured:
std::vector<std::vector<cv::Vec3b> > pixelsD(maskedImage.rows, std::vector<cv::Vec3b>(maskedImage.cols));
#pragma omp parallel
{
double start;
// Wait for all threads to be created and ready
#pragma omp barrier
#pragma omp master
start = omp_get_wtime();
#pragma omp for
for (int i = 0; i < maskedImage.rows; ++i)
{
std::vector<cv::Vec3b>& row = pixelsD[i];
for (int j = 0; j < maskedImage.cols; ++j)
{
row[j] = maskedImage.at<cv::Vec3b>(i, j);
}
} // Implicit barrier here
#pragma omp master
{
const double duration = omp_get_wtime() - start;
cout << duration << endl;
}
}
// Side effect to force the compiler to not optimize the previous loop to nothing
cout << "result: " << (int)pixelsD[0][0][0] << endl;
On my 6-core machine and with an image of size 3840x2160, I get the following results:
Clang:
- initial sequential clock time: 8.5 ms
- initial parallel clock time: 60 ~ 63 ms
- new sequential time: 8.5 ms
- new parallel time: 2.4 ms
GCC:
- initial sequential clock time: 9.7 ms
- initial parallel clock time: 3 ~ 93 ms
- new sequential time: 8.5 ms
- new parallel time: 2.3 ms
Theoretical optimal time: 1.2 ms
Note that this operation can be made even faster using direct access to data of maskedImage. Note also that memory access tend to barely scale. Results are not bad here because compilers generate a quite inefficient code (although it is difficult regarding the memory layout).
Another possible explanation is this link.
It is suggested to avoid using i and j indices inside the loop code.
If I remember correctly, the data part of an OpenCV Mat uses contiguous part of the memory, at least for rows, and for the entire data in some cases.
As this is also the case for vectors, you could copy the image line by line (or the entire image) instead of pixels by pixels.
I think threads switching too frequently (once per row), and it requires more processor time for management. It should work more effective, if you will assign larger pieces of woek for threads. An image per thread for instance.

openMP excessive synchronization

I am trying to add an openMP parallelization into quite a big Project and I found out the openMP does too much synchronization outside the parallel blocks.
This synchronization is done for all of the variables, even those not used in the parallel block and it is done continuously, not only before entering the block.
I made an example proving this:
#include <cmath>
int main()
{
double dummy1 = 1.234;
int const size = 1000000;
int const size1 = 2500;
int const size2 = 500;
for(unsigned int i=0; i<size; ++i){
//for (unsigned int j=0; j<size1; j++){
// dummy1 = pow(dummy1/2 + 1, 1.5);
//}
#pragma omp parallel for
for (unsigned int j=0; j<size2; j++){
double dummy2 = 2.345;
dummy2 = pow(dummy2/2 + 1, 1.5);
}
}
}
If I run this code (with the for cycle commented), the runtimes are 6.75s with parallelization and 30.6s without. Great.
But if I uncomment the for cycle and run it again, the excessive synchronization kicks in and I get results 67.9s with parallelization and 73s without. If I increase size1 I even get slower results with parallelization than without it.
Is there a way to disable this synchronization and force it only before the second for cycle? Or any other way how to improve the speed?
Note that the outer neither the first for cycle are in the real example parallelizable. The outer one is in fact a ODE solver and the first inner one updating of loads of inner values.
I am using gcc (SUSE Linux) 4.8.5
Thanks for Your answers.
In the end the solution for my problem was specifying number of threads = number of processor cores. It seems the hyperthreading was causing the problems. So using (my processor has 4 real cores)
#pragma omp parallel for num_threads(4)
I get times 8.7s without the first for loop and 51.9s with it. There is still about 1.2s overhead, but that is acceptable. Using default (8 threads)
#pragma omp parallel for
the times are 6.65s and 68s. Here the overhead is about 19s.
So the hyperthreading helps if no other code is present, but when it is it might not always be a good idea to use it.

Multithreaded Program for Sparse Matrices

I am a newbie to multithreading. I am trying to design a program that solves a sparse matrix. In my code I call Vector Vector dot product and Matix vector product as subroutines many times to arrive at the final solution. I am trying to parallelise the code using open MP (Especially the above two sub routines.)
I also have sequential codes in between which i donot intend to parallelise.
My question is how do I handle the threads created when the sub routine is called. Should I put a barrier at the end of every sub routine call.
Also where should I set the number of threads?
Mat_Vec_Mult(MAT,x0,rm);
#pragma omp parallel for schedule(static)
for(int i=0;i<numcols;i++)
rm[i] = b[i] - rm[i];
#pragma omp barrier
#pragma omp parallel for schedule(static)
for(int i=0;i<numcols;i++)
xm[i] = x0[i];
#pragma omp barrier
double* pm = (double*) malloc(numcols*sizeof(double));
#pragma omp parallel for schedule(static)
for(int i=0;i<numcols;i++)
pm[i] = rm[i];
#pragma omp barrier
scalarProd(rm,rm,numcols);
Thanks
EDIT:
for the scalar dotproduct, I am using the following piece of code:
double scalarProd(double* vec1, double* vec2, int n){
double prod = 0.0;
int chunk = 10;
int i;
//double* c = (double*) malloc(n*sizeof(double));
omp_set_num_threads(4);
// #pragma omp parallel shared(vec1,vec2,c,prod) private(i)
#pragma omp parallel
{
double pprod = 0.0;
#pragma omp for
for(i=0;i<n;i++) {
pprod += vec1[i]*vec2[i];
}
//#pragma omp for reduction (+:prod)
#pragma omp critical
for(i=0;i<n;i++) {
prod += pprod;
}
}
return prod;
}
I have now added the time calculation code in my ConjugateGradient function as below:
start_dotprod = omp_get_wtime();
rm_rm_old = scalarProd(rm,rm,MAT->ncols);
run_dotprod = omp_get_wtime() - start_dotprod;
fprintf(timing,"Time taken by rm_rm dot product : %lf \n",run_dotprod);
Observed results : Time taken for the dot product Sequential Version : 0.000007s Parallel Version : 0.002110
I am doing a simple compile using gcc -fopenmp command on Linux OS on my Intel I7 laptop.
I am currently using a matrix of size n = 5000.
I am getting huge speed down overall since the same dot product gets called multiple times till convergence is achieved( around 80k times).
Please suggest some improvements. Any help is much appreciated!
Honestly, I would suggest parallelizing at a higher level. By this I mean trying to minimize the number of #pragma omp parallels you are using. Every time you try and split up the work among your threads, there is an OpenMP overhead. Try and avoid this whenever possible.
So in your case at the very least I would try:
Mat_Vec_Mult(MAT,x0,rm);
double* pm = (double*) malloc(numcols*sizeof(double)); // must be performed once outside of parallel region
// all threads forked and created once here
#pragma omp parallel for schedule(static)
for(int i = 0; i < numcols; i++) {
rm[i] = b[i] - rm[i]; // (1)
xm[i] = x0[i]; // (2) does not require (1)
pm[i] = rm[i]; // (3) requires (1) at this i, not (2)
}
// implicit barrier at the end of omp for
// implicit join of all threads at the end of omp parallel
scalarProd(rm,rm,numcols);
Notice how I show that no barriers are actually necessary between your loops anyway.
If the majority of your time had been spent in this computation stage, you will surely be seeing considerable improvement. However, I'm reasonably confident that the majority of your time is being spent in Mat_Vec_Mult() and maybe also scalarProd(), so the amount of time you'll be saving is probably minimal.
** EDIT **
And as per your edit, I am seeing a few problems. (1) Always compile with -O3 when you are testing performance of your algorithm. (2) You won't be able to improve the runtime of something that takes .000007 sec to complete; that's nearly instantaneous. This goes back to what I said previously: try and parallelize at a higher level. CG Method is inherently a sequential algorithm, but there are certainly research papers developed detailing parallel CG. (3) Your implementation of scalar product is not optimal. Indeed, I suspect your implementation of matrix-vector product is not either. I would personally do the following:
double scalarProd(double* vec1, double* vec2, int n) {
double prod = 0.0;
int i;
// omp_set_num_threads(4); this should be done once during initialization somewhere previously in your program
#pragma omp parallel for private(i) reduction(+:prod)
for (i = 0; i < n; ++i) {
prod += vec1[i]*vec2[i];
}
return prod;
}
(4) There are entire libraries (LAPACK, BLAS, etc) that have highly optimized matrix-vector, vector-vector, etc operations. Any Linear Algebra library must be built upon them. Therefore, I'd suggest looking at using one of those libraries to do your two operations before you start re-creating the wheel here and trying to implement your own.