slow serial "for" with openmp enabled

slow serial "for" with openmp enabled - c++

I try to use openmp and find strange results.
Parallel "for" run faster with openmp as expected. But serial "for" run much faster when openmp disabled (without /openmp option. vs 2013).
Test code
const int n = 5000;
const int m = 2000000;
vector <double> a(n, 0);
double start = omp_get_wtime();
#pragma omp parallel for shared(a)
for (int i = 0; i < n; i++)
{
double StartVal = i;
for (int j = 0; j < m; ++j)
{
a[i] = (StartVal + log(exp(exp((double)i))));
}
}
cout << "omp Time: " << (omp_get_wtime() - start) << endl;
start = omp_get_wtime();
for (int i = 0; i < n; i++)
{
double StartVal = i;
for (int j = 0; j < m; ++j)
{
a[i] = (StartVal + log(exp(exp((double)i))));
}
}
cout << "serial Time: " << (omp_get_wtime() - start) << endl;
Output without /openmp option
0
omp Time: 6.4389
serial Time: 6.37592
Output with /openmp option
0
1
2
3
omp Time: 1.84636
serial Time: 16.353
Is it correct results? Or I'm doing something wrong?

I believe part of the answer lies hidden in the architecture of the computer you run on. I tried running the same code another machine (GCC 4.8 on GNU+Linux, quad Core2 CPU), and over many runs, found a slightly odd thing: while the time for both loops varied, and OpenMP with many threads always ran faster, the second loop never ran significantly faster than the first, even without OpenMP.
The next step was to try to eliminate a dependency between the loops, allocating a second vector for the second loop. It still ran no faster than the first. So I tried reversing them, running the OpenMP loop after the serial one; and while it still ran fast when multithreaded, it would now see delays when the first loop didn't. It's looking more like an operating system behaviour at this point; long-lived threads simply seem more likely to get interrupted. I had taken some measures to reduce interruptions (niceness -15, specific cpu set) but this is not a system dedicated to benchmarking.
None of my results were anywhere near as extreme as yours, however. My first guess as to what caused your large difference was that you reused the same array and ran the parallel loop first. This would distribute the array into caches on all cores, causing a slight dilemma of whether to migrade the thread to the data or the other way around; and OpenMP may have chosen any distribution, including iteration i to thread i%threads (as with schedule(static,1)), which probably would hurt multithreaded runtime, or one cacheline each which would hurt later single threaded reading if it fit in per-core caches. However, all of the array accesses are writes, so the processor shouldn't need to wait for them in the first place.
In summary, your results are certainly platform dependent and unexpected. I would suggest rerunning the test with swapped order, the two loops operating on different arrays, and placed in different compilation units, and of course to verify the written results. It is possible you've found a flaw in your compiler.

Related

optimising OpenCV for loops

Trying to optimize OpenCV code with openMP, code as follows. The actual execution time with openMP is longer. 2 cores, 4 threads. Image size: [3024 x 4032]
std::vector<std::vector<cv::Vec3b> > pixelsD(maskedImage.rows, std::vector<cv::Vec3b>(maskedImage.cols));
std::clock_t start;
double duration;
start = std::clock();
////none, without openMP 0.129677 sec
//#pragma omp parallel for // 0.213286 sec
#pragma omp parallel for collapse(2)// 0.206435 sec
for (int i = 0; i < maskedImage.rows; ++i)
for (int j = 0; j < maskedImage.cols; ++j){
pixelsD[i][j] = maskedImage.at<cv::Vec3b>(i, j);
// printf("%d %d %d\n", i, j, omp_get_thread_num());
}
duration = ( std::clock() - start ) / (double) CLOCKS_PER_SEC;
My guess: the reason is the context switch which takes longer. What may be other reasons?
How could I optimize it utilizing available resources? Any other ways?
Input appreciated.
P.S.:
The reason for the translate between cv::Mat to std::vector is to utilise erase, push_back and insert for image's content manipulation.

Thread creation can be quite costly as well as context switches: strangely with GCC 9.3, it takes 10-20 ms to just start the parallel section on my machine on this sample code. Note that some OpenMP runtimes like Clang can create thread once for all OpenMP section. Moreover, setting OMP_PROC_BIND to TRUE can help OpenMP threads to not move between cores. Note that timings between GCC and Clang are quite different on this code.
std::clock do not measure what you probably want to: it does not consider process inactivity and sum the tick of each thread of the process. Please use C++ std::chrono::steady_clock or omp_get_wtime to correctly measure durations.
Please do not use std::vector<std::vector<cv::Vec3b>> as it use a very inefficient memory layout pattern. If you want to make complex matrix operation, you can use Eigen for example or write your own type based on contiguous flatten arrays. Splitting each color channel in a separate array may also help compiler to vectorize operations improving performance.
On Clang, the pixelsD[i][j] access produce a very slow code with OpenMP as the compiler fail to optimize it. Actually, using a collapse is not useful here as the number of threads should be much smaller than the number of rows (it could even decrease performance).
Here is a new version where the time is more correctly measured:
std::vector<std::vector<cv::Vec3b> > pixelsD(maskedImage.rows, std::vector<cv::Vec3b>(maskedImage.cols));
#pragma omp parallel
{
double start;
// Wait for all threads to be created and ready
#pragma omp barrier
#pragma omp master
start = omp_get_wtime();
#pragma omp for
for (int i = 0; i < maskedImage.rows; ++i)
{
std::vector<cv::Vec3b>& row = pixelsD[i];
for (int j = 0; j < maskedImage.cols; ++j)
{
row[j] = maskedImage.at<cv::Vec3b>(i, j);
}
} // Implicit barrier here
#pragma omp master
{
const double duration = omp_get_wtime() - start;
cout << duration << endl;
}
}
// Side effect to force the compiler to not optimize the previous loop to nothing
cout << "result: " << (int)pixelsD[0][0][0] << endl;
On my 6-core machine and with an image of size 3840x2160, I get the following results:
Clang:
- initial sequential clock time: 8.5 ms
- initial parallel clock time: 60 ~ 63 ms
- new sequential time: 8.5 ms
- new parallel time: 2.4 ms
GCC:
- initial sequential clock time: 9.7 ms
- initial parallel clock time: 3 ~ 93 ms
- new sequential time: 8.5 ms
- new parallel time: 2.3 ms
Theoretical optimal time: 1.2 ms
Note that this operation can be made even faster using direct access to data of maskedImage. Note also that memory access tend to barely scale. Results are not bad here because compilers generate a quite inefficient code (although it is difficult regarding the memory layout).

Another possible explanation is this link.
It is suggested to avoid using i and j indices inside the loop code.
If I remember correctly, the data part of an OpenCV Mat uses contiguous part of the memory, at least for rows, and for the entire data in some cases.
As this is also the case for vectors, you could copy the image line by line (or the entire image) instead of pixels by pixels.

I think threads switching too frequently (once per row), and it requires more processor time for management. It should work more effective, if you will assign larger pieces of woek for threads. An image per thread for instance.

Throughput of trivially concurrent code does not increase with the number of threads

I am trying to use OpenMP to benchmark the speed of data structure that I implemented. However, I seem to make a fundamental mistake: the throughput decreases instead of increasing with the number of threads no matter what operation I try to benchmark.
Below you can see the code that tries to benchmark the speed of a for-loop, as such I would expect it to scale (somewhat) linearly with the number of threads, it doesn't (compiled on a dualcore laptop with and without -O3 flag on g++ with c++11).
#include <omp.h>
#include <atomic>
#include <chrono>
#include <iostream>
thread_local const int OPS = 10000;
thread_local const int TIMES = 200;
double get_tp(int THREADS)
{
double threadtime[THREADS] = {0};
//Repeat the test many times
for(int iteration = 0; iteration < TIMES; iteration++)
{
#pragma omp parallel num_threads(THREADS)
{
double start, stop;
int loc_ops = OPS/float(THREADS);
int t = omp_get_thread_num();
//Force all threads to start at the same time
#pragma omp barrier
start = omp_get_wtime();
//Do a certain kind of operations loc_ops times
for(int i = 0; i < loc_ops; i++)
{
//Here I would put the operations to benchmark
//in this case a boring for loop
int x = 0;
for(int j = 0; j < 1000; j++)
x++;
}
stop = omp_get_wtime();
threadtime[t] += stop-start;
}
}
double total_time = 0;
std::cout << "\nThread times: ";
for(int i = 0; i < THREADS; i++)
{
total_time += threadtime[i];
std::cout << threadtime[i] << ", ";
}
std::cout << "\nTotal time: " << total_time << "\n";
double mopss = float(OPS)*TIMES/total_time;
return mopss;
}
int main()
{
std::cout << "\n1 " << get_tp(1) << "ops/s\n";
std::cout << "\n2 " << get_tp(2) << "ops/s\n";
std::cout << "\n4 " << get_tp(4) << "ops/s\n";
std::cout << "\n8 " << get_tp(8) << "ops/s\n";
}
Outputs with -O3 on a dualcore, so we don't expect the throughput to increase after 2 threads, but it does not even increase when going from 1 to 2 threads it decreases by 50%:
1 Thread
Thread times: 7.411e-06,
Total time: 7.411e-06
2.69869e+11 ops/s
2 Threads
Thread times: 7.36701e-06, 7.38301e-06,
Total time: 1.475e-05
1.35593e+11ops/s
4 Threads
Thread times: 7.44301e-06, 8.31901e-06, 8.34001e-06, 7.498e-06,
Total time: 3.16e-05
6.32911e+10ops/s
8 Threads
Thread times: 7.885e-06, 8.18899e-06, 9.001e-06, 7.838e-06, 7.75799e-06, 7.783e-06, 8.349e-06, 8.855e-06,
Total time: 6.5658e-05
3.04609e+10ops/s
To make sure that the compiler does not remove the loop, I also tried outputting "x" after measuring the time and to the best of my knowledge the problem persists. I also tried the code on a machine with more cores and it behaved very similarly. Without -O3 the throughput also does not scale. So there is clearly something wrong with the way I benchmark. I hope you can help me.

I'm not sure why you are defining performance as the total number of operations per total CPU time and then get surprised by the decreasing function of the number of threads. This will almost always and universally be the case except for when cache effects kick in. The true performance metric is the number of operations per wall-clock time.
It is easy to show with simple mathematical reasoning. Given a total work W and processing capability of each core P, the time on a single core is T_1 = W / P. Dividing the work evenly among n cores means each of them works for T_1,n = (W / n + H) / P, where H is the overhead per thread induced by the parallelisation itself. The sum of those is T_n = n * T_1,n = W / P + n (H / P) = T_1 + n (H / P). The overhead is always a positive value, even in the trivial case of so-called embarrassing parallelism where no two threads need to communicate or synchronise. For example, launching the OpenMP threads takes time. You cannot get rid of the overhead, you can only amortise it over the lifetime of the threads by making sure that each one get a lot to work on. Therefore, T_n > T_1 and with fixed number of operations in both cases the performance on n cores will always be lower than on a single core. The only exception of this rule is the case when the data for work of size W doesn't fit in the lower-level caches but that for work of size W / n does. This results in massive speed-up that exceeds the number of cores, known as superlinear speed-up. You are measuring inside the thread function so you ignore the value of H and T_n should more or less be equal to T_1 within the timer precision, but...
With multiple threads running on multiple CPU cores, they all compete for limited shared CPU resources, namely last-level cache (if any), memory bandwidth, and thermal envelope.
The memory bandwidth is not a problem when you are simply incrementing a scalar variable, but becomes the bottleneck when the code starts actually moving data in and out of the CPU. A canonical example from numerical computing is the sparse matrix-vector multiplication (spMVM) -- a properly optimised spMVM routine working with double non-zero values and long indices eats so much memory bandwidth, that one can completely saturate the memory bus with as low as two threads per CPU socket, making an expensive 64-core CPU a very poor choice in that case. This is true for all algorithms with low arithmetic intensity (operations per unit of data volume).
When it comes to the thermal envelope, most modern CPUs employ dynamic power management and will overclock or clock down the cores depending on how many of them are active. Therefore, while n clocked down cores perform more work in total per unit of time than a single core, a single core outperforms n cores in terms of work per total CPU time, which is the metric you are using.
With all this in mind, there is one last (but not least) thing to consider -- timer resolution and measurement noise. Your run times are in couples of microseconds. Unless your code is running on some specialised hardware that does nothing else but run your code (i.e., no time sharing with daemons, kernel threads, and other processes and no interrupt handing), you need benchmarks that run several orders of magnitude longer, preferably for at least a couple of seconds.

The loop is almost certainly still getting optimized, even if you output the value of x after the outer loop. The compiler can trivially replace the entire loop with a single instruction since the loop bounds are constant at compile time. Indeed, in this example:
#include <iostream>
int main()
{
int x = 0;
for (int i = 0; i < 10000; ++i) {
for (int j = 0; j < 1000; ++j) {
++x;
}
}
std::cout << x << '\n';
return 0;
}
The loop is replaced with the single assembly instruction mov esi, 10000000.
Always inspect the assembly output when benchmarking to make sure that you're measuring what you think you are; in this case you are just measuring the overhead of creating threads, which of course will be higher the more threads you create.
Consider having the innermost loop do something that can't be optimized away. Random number generation is a good candidate because it should perform in constant time, and it has the side-effect of permuting the PRNG state (making it ineligible to be removed entirely, unless the seed is known in advance and the compiler is able to unravel all of the mutation in the PRNG).
For example:
#include <iostream>
#include <random>
int main()
{
std::mt19937 r;
std::uniform_real_distribution<double> dist{0, 1};
for (int i = 0; i < 10000; ++i) {
for (int j = 0; j < 1000; ++j) {
dist(r);
}
}
return 0;
}
Both loops and the PRNG invocation are left intact here.

How to parallelise a 2D time evolution

I have a C++ code that performs a time evolution of four variables that live on a 2D spatial grid. To save some time, I tried to parallelise my code with OpenMP but I just cannot get it to work: No matter how many cores I use, the runtime stays basically the same or increases. (My code does use 24 cores or however many I specify, so the compilation is not a problem.)
I have the feeling that the runtime for one individual time-step is too short and the overhead of producing threads kills the potential speed-up.
The layout of my code is:
for (int t = 0; t < max_time_steps; t++) {
// do some book-keeping
...
// perform time step
// (1) calculate righthand-side of ODE:
for (int i = 0; i < nr; i++) {
for (int j = 0; j < ntheta; j++) {
rhs[0][i][j] = A0[i][j] + B0[i][j] + ...;
rhs[1][i][j] = A1[i][j] + B1[i][j] + ...;
rhs[2][i][j] = A2[i][j] + B2[i][j] + ...;
rhs[3][i][j] = A3[i][j] + B3[i][j] + ...;
}
}
// (2) perform Euler step (or Runge-Kutta, ...)
for (int d = 0; d < 4; d++) {
for (int i = 0; i < nr; i++) {
for (int j = 0; j < ntheta; j++) {
next[d][i][j] = current[d][i][j] + time_step * rhs[d][i][j];
}
}
}
}
I thought this code should be fairly easy to parallelise... I put "#pragma omp parellel for" in front of the (1) and (2) loops, and I also specified the number of cores (e.g. 4 cores for loop (2) since there are four variables) but there is simply no speed-up whatsoever.
I have found that OpenMP is fairly smart about when to create/destroy the threads. I.e. it realises that threads are required soon again and then they're only put asleep to save overhead time.
I think one "problem" is that my time step is coded in a subroutine (I'm using RK4 instead of Euler) and the computation of the righthand-side is again in another subroutine that is called by the time_step() function. So, I believe that due to this, OpenMP cannot see that the threads should be kept open for longer and hence the threads are created and destroyed at every time step.
Would it be helpful to put a "#pragma omp parallel" in front of the time-loop so that the threads are created at the very beginning? And then do the actual parallelisation for the righthand-side (1) and the Euler step (2)? But how do I do that?
I have found numerous examples for how to parallelise nested for loops, but none of them were concerned with the setup where the inner loops have been sourced out to separate modules. Would this an obstacle for parallelising?
I have now removed the d loops (by making the indices explicit) and collapsed the i and j loops (by running over the entire 2D array with one variable only).
The code looks like:
for (int t = 0; t < max_time_steps; t++) {
// do some book-keeping
...
// perform time step
// (1) calculate righthand-side of ODE:
#pragma omp parallel for
for (int i = 0; i < nr*ntheta; i++) {
rhs[0][0][i] = A0[0][i] + B0[0][i] + ...;
rhs[1][0][i] = A1[0][i] + B1[0][i] + ...;
rhs[2][0][i] = A2[0][i] + B2[0][i] + ...;
rhs[3][0][i] = A3[0][i] + B3[0][i] + ...;
}
// (2) perform Euler step (or Runge-Kutta, ...)
#pragma omp parallel for
for (int i = 0; i < nr*ntheta; i++) {
next[0][0][i] = current[0][0][i] + time_step * rhs[0][0][i];
next[1][0][i] = current[1][0][i] + time_step * rhs[1][0][i];
next[2][0][i] = current[2][0][i] + time_step * rhs[2][0][i];
next[3][0][i] = current[3][0][i] + time_step * rhs[3][0][i];
}
}
The size of nr*ntheta is 400*40=1600 and I a make max_time_steps=1000 time steps. Still, the parallelisation does not result in a speed-up:
Runtime without OpenMP (result of time on the command line):
real 0m23.597s
user 0m23.496s
sys 0m0.076s
Runtime with OpenMP (24 cores)
real 0m23.162s
user 7m47.026s
sys 0m0.905s
I do not understand what's happening here.
One peculiarity that I don't show in my code snippet above is that my variables are not actually doubles but a self-defined struct of two doubles which resemble real and imaginary part. But I think this should not make a difference.
Just wanted to report some success after I left the parallelisation alone for a while. The code evolved for a year and now I went back to parallelisation. This time, I can say that OpenMP does it's job and reduces the required walltime.
While the code evolved overall, this particular loop that I've shown above did not really change; merely two things: a) The resolution is higher so that it covers about 10 times as many points and b) the number of calculations per loop also is about 10-fold (maybe even more).
My only explanation why it works now and didn't work a little over a year ago, is that, when I tried to parallelise the code last time, it wasn't computationally expensive enough and the speed-up was killed by the OpenMP overhead. One single loop now requires about 200-300ms whereas that time required must have been in the single digit ms last time.
I can see such effect when comparing gcc and the Intel compiler (which are doing a very different job when vectorizing):
a) Using gcc, one loop needs about 300ms without OpenMP, and on two cores only 52% of the time is required --> near perfect optimization.
b) Using icpc, one loop needs about 160ms without OpenMP, and on two cores it needs 60% of the time --> good optimization but about 20% less effective.
When going for more than two cores, the speed-up is not large enough to make it worthwhile.

OpenMP parallel for loop speedup issues

Recently I started using OpenMP. Doing a numerical calculation involving 3d matrices created in c++ as vectors and I used parallel for loops to speedup the code. But it runs slower than serial code. I compile the code using Codeblocks in Windows 7. The code is something like this.
int main(){
vector<vector<vector<float> > > Dx; //
/*create 3d array Dx[IE][JE][KE] as vectors*/
Dx.resize(IE);
for (int i = 0; i < IE; ++i) {
for (int j = 0; j < JE; ++j){
dx[i][j].resize(KE);
}
}
//declare and initialize more matrices like this
.
.
.
double wtime = omp_get_wtime(); // start time
//and matrix calculations using parallel for loop
#pragma omp parallel for
for (int i=1; i < IE; ++i ) {
for (int j=1; j < JE; ++j ) {
for (int k=1; k < KE; ++k ) {
curl_h = ( Hz[i][j][k] - Hz[i][j-1][k] - Hy[i][j][k] + Hy[i][j][k-1]);
idxl[i][j][k] = idxl[i][j][k] + curl_h;
Dx[i][j][k] = gj3[j]*gk3[k]*dx[i][j][k]
+ gj2[j]*gk2[k]*.5*(curl_h + gi1[i]*idxl[i][j][k]);
}
}
}
wtime = omp_get_wtime() - wtime;
}
But code with parallel loops run slower than the serial code. Any ideas ?
Thxs.

The loop uses the variable curl_h, which is not declared as thread private. This is both a bug, and also the reason for your perceived performance problem:
As there is only one place in memory where curl_h is stored, all threads constantly and concurrently try to read and write it. One CPU core will load the value into its cache, the next one will issue a write to it, invalidating the cache of the first CPU, which will again grab the cacheline when it itself tries to use curl_h (read or write, both will require the cacheline to be in the local cache).
The point is, that the fierce pretense put up by the hardware that there is only one memory location called curl_h demands its tribute. You get a huge amount of chatter in the cache coherency protocol, and keep your memory buses busy with constantly refetching the same cacheline from memory. All your threads are really doing is fighting over that one cacheline.
Of course, the constant races between the threads are a big bug, as no process can be certain that the value it's currently using is actually the one it calculated in the statement above.
So, just add the correct private() declarations to your omp parallel for statement, and you'll fix both the bug and the performance issue.

Strange slowdown when using openmp

I am trying to increase performance of a rather complex iteration algorithm by parallelizing matrix multiplication, which is being called on each iteration.
The algorithm takes 500 iterations and approximately 10 seconds. But after parallelizing matrix multiplication it slows down to 13 seconds.
However, when I tested matrix multiplication of the same dimension alone, there was an increase in speed. (I am talking about 100x100 matrices.)
Finally, I switched off any parallelizing inside the algorithm and added on each iteration the following piece of code, which does absolutely nothing and presumably shouldn't take long:
int j;
#pragma omp parallel for private(j)
for (int i = 0; i < 10; i++)
j = i;
And again, there is a 30% slowdown comparing to the same algorithm without this piece of code.
Thus, calling any parallelization using openmp 500 times inside the main algorithm somehow slows things down. This behavior looks very strange to me, anybody has any clues what the problem is?
The main algorithm is being called by a desktop application, compiled by VS2010, Win32 Release.
I work on Intel Core i3 (parallelization creates 4 threads), 64 bit Windows 7.
Here is a structure of a program:
int internal_method(..)
{
...//no openmp here
// the following code does nothing, has nothing to do with the rest of the program and shouldn't take long,
// but somehow adding of this code caused a 3 sec slowdown of the Huge_algorithm()
double sum;
#pragma omp parallel for private(sum)
for (int i = 0; i < 10; i++)
sum = i*i*i / (1.0 + i*i*i*i);
...//no openmp here
}
int Huge_algorithm(..)
{
...//no openmp here
for (int i = 0; i < 500; i++)
{
.....// no openmp
internal_method(..);
......//no openmp
}
...//no openmp here
}
So, the final point is:
calling the parallel piece of code 500 times alone (when the rest of the algorithm is omitted) takes less than 0.01 sec, but when you call it 500 times inside a huge algorithm it causes 3 sec delay of the entire algorithm.
And what I don't understand is how the small parallel part affects the rest of the algorithm?

For 10 iterations and a simple assignment, I guess there is too much OpenMP overhead compared to the computation itself. What looks lightweight here is actually managing and synchronizing multiple threads which may not even come from a thread pool. There might be some locking involved, and I don't know how good MSVC is at estimating whether to parallelize at all.
Try with bigger loop bodies or a bigger amount of iterations (say 1024*1024 iterations, just for starters).
Example OpenMP Magick:
#pragma omp parallel for private(j)
for (int i = 0; i < 10; i++)
j = i;
This might be approximately expanded by a compiler to:
const unsigned __cpu_count = __get_cpu_count();
const unsigned __j = alloca (sizeof (unsigned) * __cpu_count);
__thread *__threads = alloca (sizeof (__thread) * __cpu_count);
for (unsigned u=0; u!=__cpu_count; ++u) {
__init_thread (__threads+u);
__run_thread ([u]{for (int i=u; i<10; i+=__cpu_count)
__j[u] = __i;}); // assume lambdas
}
for (unsigned u=0; u!=__cpu_count; ++u)
__join (__threads+u);
with __init_thread(), __run_thread() and __join() being non-trivial function that invoke certain system calls.
In case thread-pools are used, you would replace the first alloca() by something like __pick_from_pool() or so.
(note this, names and emitted code, was all imaginary, actual implementation will look different)
Regarding your updated question:
You seem to be parallelizing at the wrong granularity. Put as much workload as possible in a thread, so instead of
for (...) {
#omp parallel ...
for (...) {}
}
try
#omp parallel ...
for (...) {
for (...) {}
}
Rule of thumb: Keep workloads big enough per thread so as to reduce relative overhead.

Maybe just j=i is not high-yield for core-cpu bandwith. maybe you should try something more yielding calculation. (for exapmle taking i*i*i*i*i*i and dividing it by i+i+i)
are you running this on multi-core cpu or gpu?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js