Trying to optimize OpenCV code with openMP, code as follows. The actual execution time with openMP is longer. 2 cores, 4 threads. Image size: [3024 x 4032]
std::vector<std::vector<cv::Vec3b> > pixelsD(maskedImage.rows, std::vector<cv::Vec3b>(maskedImage.cols));
std::clock_t start;
double duration;
start = std::clock();
////none, without openMP 0.129677 sec
//#pragma omp parallel for // 0.213286 sec
#pragma omp parallel for collapse(2)// 0.206435 sec
for (int i = 0; i < maskedImage.rows; ++i)
for (int j = 0; j < maskedImage.cols; ++j){
pixelsD[i][j] = maskedImage.at<cv::Vec3b>(i, j);
// printf("%d %d %d\n", i, j, omp_get_thread_num());
}
duration = ( std::clock() - start ) / (double) CLOCKS_PER_SEC;
My guess: the reason is the context switch which takes longer. What may be other reasons?
How could I optimize it utilizing available resources? Any other ways?
Input appreciated.
P.S.:
The reason for the translate between cv::Mat to std::vector is to utilise erase, push_back and insert for image's content manipulation.
Thread creation can be quite costly as well as context switches: strangely with GCC 9.3, it takes 10-20 ms to just start the parallel section on my machine on this sample code. Note that some OpenMP runtimes like Clang can create thread once for all OpenMP section. Moreover, setting OMP_PROC_BIND to TRUE can help OpenMP threads to not move between cores. Note that timings between GCC and Clang are quite different on this code.
std::clock do not measure what you probably want to: it does not consider process inactivity and sum the tick of each thread of the process. Please use C++ std::chrono::steady_clock or omp_get_wtime to correctly measure durations.
Please do not use std::vector<std::vector<cv::Vec3b>> as it use a very inefficient memory layout pattern. If you want to make complex matrix operation, you can use Eigen for example or write your own type based on contiguous flatten arrays. Splitting each color channel in a separate array may also help compiler to vectorize operations improving performance.
On Clang, the pixelsD[i][j] access produce a very slow code with OpenMP as the compiler fail to optimize it. Actually, using a collapse is not useful here as the number of threads should be much smaller than the number of rows (it could even decrease performance).
Here is a new version where the time is more correctly measured:
std::vector<std::vector<cv::Vec3b> > pixelsD(maskedImage.rows, std::vector<cv::Vec3b>(maskedImage.cols));
#pragma omp parallel
{
double start;
// Wait for all threads to be created and ready
#pragma omp barrier
#pragma omp master
start = omp_get_wtime();
#pragma omp for
for (int i = 0; i < maskedImage.rows; ++i)
{
std::vector<cv::Vec3b>& row = pixelsD[i];
for (int j = 0; j < maskedImage.cols; ++j)
{
row[j] = maskedImage.at<cv::Vec3b>(i, j);
}
} // Implicit barrier here
#pragma omp master
{
const double duration = omp_get_wtime() - start;
cout << duration << endl;
}
}
// Side effect to force the compiler to not optimize the previous loop to nothing
cout << "result: " << (int)pixelsD[0][0][0] << endl;
On my 6-core machine and with an image of size 3840x2160, I get the following results:
Clang:
- initial sequential clock time: 8.5 ms
- initial parallel clock time: 60 ~ 63 ms
- new sequential time: 8.5 ms
- new parallel time: 2.4 ms
GCC:
- initial sequential clock time: 9.7 ms
- initial parallel clock time: 3 ~ 93 ms
- new sequential time: 8.5 ms
- new parallel time: 2.3 ms
Theoretical optimal time: 1.2 ms
Note that this operation can be made even faster using direct access to data of maskedImage. Note also that memory access tend to barely scale. Results are not bad here because compilers generate a quite inefficient code (although it is difficult regarding the memory layout).
Another possible explanation is this link.
It is suggested to avoid using i and j indices inside the loop code.
If I remember correctly, the data part of an OpenCV Mat uses contiguous part of the memory, at least for rows, and for the entire data in some cases.
As this is also the case for vectors, you could copy the image line by line (or the entire image) instead of pixels by pixels.
I think threads switching too frequently (once per row), and it requires more processor time for management. It should work more effective, if you will assign larger pieces of woek for threads. An image per thread for instance.
Related
Without using Open MP Directives - serial execution - check screenshot here
Using OpenMp Directives - parallel execution - check screenshot here
#include "stdafx.h"
#include <omp.h>
#include <iostream>
#include <time.h>
using namespace std;
static long num_steps = 100000;
double step;
double pi;
int main()
{
clock_t tStart = clock();
int i;
double x, sum = 0.0;
step = 1.0 / (double)num_steps;
#pragma omp parallel for shared(sum)
for (i = 0; i < num_steps; i++)
{
x = (i + 0.5)*step;
#pragma omp critical
{
sum += 4.0 / (1.0 + x * x);
}
}
pi = step * sum;
cout << pi <<"\n";
printf("Time taken: %.5fs\n", (double)(clock() - tStart) / CLOCKS_PER_SEC);
getchar();
return 0;
}
I have tried multiple times, the serial execution is always faster why?
Serial Execution Time: 0.0200s
Parallel Execution Time: 0.02500s
why is serial execution faster here? am I calculation the execution time in the right way?
OpenMP internally implement multithreading for parallel processing and multi threading's performance can be measured with large volume of data. With very small volume of data you cannot measure the performance of multithreaded application. The reasons:-
a) To create a thread O/S need to allocate memory to each thread which take time (even though it is tiny bit.)
b) When you create multi threads it needs context switching which also take time.
c) Need to release memory allocated to threads which also take time.
d) It depends on number of processors and total memory (RAM) in your machine
So when you try with small operation with multi threads it's performance will be as same as a single thread (O/S by default assign one thread to every process which is call main thread). So your outcome is perfect in this case. To measure the performance of multithread architecture use large amount of data with complex operation then only you can see the differences.
Because of your critical block you cannot sum sum in parallel. Everytime one thread reaches the critical section all other threads have to wait.
The smart approach would be to create a temporary copy of sum for each thread that can be summed without synchronization and afterwards to sum the results from the different threads.
Openmp can do this automatically for with the reduction clause. So your loop will be changed to.
#pragma omp parallel for reduction(+:sum)
for (i = 0; i < num_steps; i++)
{
x = (i + 0.5)*step;
sum += 4.0 / (1.0 + x * x);
}
On my machine this performs 10 times faster than the version using the critical block (I also increased num_steps to reduce the influence of one-time actions like thread-creation).
PS: I recommend you you to use <chrono>, <boost/timer/timer.hpp> or google benchmark for timing your code.
I am new to Openmp and now trying to use Openmp + SIMD intrinsics to speedup my program, but the result is far from expectation.
In order to simplify the case without losing much essential information, I wrote a simplier toy example:
#include <omp.h>
#include <stdlib.h>
#include <iostream>
#include <vector>
#include <sys/time.h>
#include "immintrin.h" // for SIMD intrinsics
int main() {
int64_t size = 160000000;
std::vector<int> src(size);
// generating random src data
for (int i = 0; i < size; ++i)
src[i] = (rand() / (float)RAND_MAX) * size;
// to store the final results, so size is the same as src
std::vector<int> dst(size);
// get pointers for vector load and store
int * src_ptr = src.data();
int * dst_ptr = dst.data();
__m256i vec_src;
__m256i vec_op = _mm256_set1_epi32(2);
__m256i vec_dst;
omp_set_num_threads(4); // you can change thread count here
// only measure the parallel part
struct timeval one, two;
double get_time;
gettimeofday (&one, NULL);
#pragma omp parallel for private(vec_src, vec_op, vec_dst)
for (int64_t i = 0; i < size; i += 8) {
// load needed data
vec_src = _mm256_loadu_si256((__m256i const *)(src_ptr + i));
// computation part
vec_dst = _mm256_add_epi32(vec_src, vec_op);
vec_dst = _mm256_mullo_epi32(vec_dst, vec_src);
vec_dst = _mm256_slli_epi32(vec_dst, 1);
vec_dst = _mm256_add_epi32(vec_dst, vec_src);
vec_dst = _mm256_sub_epi32(vec_dst, vec_src);
// store results
_mm256_storeu_si256((__m256i *)(dst_ptr + i), vec_dst);
}
gettimeofday(&two, NULL);
double oneD = one.tv_sec + (double)one.tv_usec * .000001;
double twoD = two.tv_sec + (double)two.tv_usec * .000001;
get_time = 1000 * (twoD - oneD);
std::cout << "took time: " << get_time << std::endl;
// output something in case the computation is optimized out
int64_t i = (int)((rand() / (float)RAND_MAX) * size);
for (int64_t i = 0; i < size; ++i)
std::cout << i << ": " << dst[i] << std::endl;
return 0;
}
It is compiled using icpc -g -std=c++11 -march=core-avx2 -O3 -qopenmp test.cpp -o test and the elapsed time of the parallel part is measured. The result is as follows (the median value is picked out of 5 runs each):
1 thread: 92.519
2 threads: 89.045
4 threads: 90.361
The computations seem embarrassingly parallel, as different threads can load their needed data simultaneously given different indices, and the case is similar for writing the results, but why no speedups?
More information:
I checked the assembly code using icpc -g -std=c++11 -march=core-avx2 -O3 -qopenmp -S test.cpp and found vectorized instructions are generated;
To check if it is memory-bound, I commented the computation part in the loop, and the measured time decreased to around 60, but it does not change much if I change the thread count from 1 -> 2 -> 4.
Any advice or clue is welcome.
EDIT-1:
Thank #JerryCoffin for pointing out the possible cause, so I did the Memory Access Analysis using Vtune. Here are the results:
1-thread: Memory Bound: 6.5%, L1 Bound: 0.134, L3 Latency: 0.039
2-threads: Memory Bound: 18.0%, L1 Bound: 0.115, L3 Latency: 0.015
4-threads: Memory Bound: 21.6%, L1 Bound: 0.213, L3 Latency: 0.003
It is an Intel 4770 Processor with 25.6GB/s (23GB/s measured by Vtune) max. bandwidth. The memory bound does increase, but I am still not sure if that is the cause. Any advice?
EDIT-2 (just trying to give thorough information, so the appended stuff can be long but not tedious hopefully):
Thanks for the suggestions from #PaulR and #bazza. I tried 3 ways for comparison. One thing to note is that the processor has 4 cores and 8 hardware threads. Here are the results:
(1) just initialize dst as all zeros in advance: 1 thread: 91.922; 2 threads: 93.170; 4 threads: 93.868 --- seems not effective;
(2) without (1), put the parallel part in an outer loop over 100 iterations, and measure the time of the 100 iterations: 1 thread: 9109.49; 2 threads: 4951.20; 4 threads: 2511.01; 8 threads: 2861.75 --- quite effective except for 8 threads;
(3) based on (2), put one more iteration before the 100 iterations, and measure the time of the 100 iterations: 1 thread: 9078.02; 2 threads: 4956.66; 4 threads: 2516.93; 8 threads: 2088.88 --- similar with (2) but more effective for 8 threads.
It seems more iterations can expose the advantages of openmp + SIMD, but the computation / memory access ratio is unchanged regardless loop count, and locality seems not to be the reason as well since src or dst is too large to stay in any caches, therefore no relations exist between consecutive iterations.
Any advice?
EDIT 3:
In case of misleading, one thing needs to be clarified: in (2) and (3), the openmp directive is outside the added outer loop
#pragma omp parallel for private(vec_src, vec_op, vec_dst)
for (int k = 0; k < 100; ++k) {
for (int64_t i = 0; i < size; i += 8) {
......
}
}
i.e. the outer loop is parallelized using multithreads, and the inner loop is still serially processed. So the effective speedup in (2) and (3) might be achieved by enhanced locality among threads.
I did another experiment that the the openmp directive is put inside the outer loop:
for (int k = 0; k < 100; ++k) {
#pragma omp parallel for private(vec_src, vec_op, vec_dst)
for (int64_t i = 0; i < size; i += 8) {
......
}
}
and the speedup is still not good: 1 thread: 9074.18; 2 threads: 8809.36; 4 threads: 8936.89.93; 8 threads: 9098.83.
Problem still exists. :(
EDIT-4:
If I replace the vectorized part with scalar operations like this (the same calculations but in scalar way):
#pragma omp parallel for
for (int64_t i = 0; i < size; i++) { // not i += 8
int query = src[i];
int res = src[i] + 2;
res = res * query;
res = res << 1;
res = res + query;
res = res - query;
dst[i] = res;
}
The speedup is 1 thread: 92.065; 2 threads: 89.432; 4 threads: 88.864. May I come to the conclusion that the seemingly embarassing parallel is actually memory bound (the bottleneck is load / store operations)? If so, why can't load / store operations well parallelized?
May I come to the conclusion that the seemingly embarassing parallel is actually memory bound (the bottleneck is load / store operations)? If so, why can't load / store operations well parallelized?
Yes this problem is embarrassingly parallel in the sense that it is easy to parallelize due to the lack of dependencies. That doesn't imply that it will scale perfectly. You can still have a bad initialization overhead vs work ratio or shared resources limiting your speedup.
In your case, you are indeed limited by memory bandwidth. A practical consideration first: When compile with icpc (16.0.3 or 17.0.1), the "scalar" version yields better code when size is made constexpr. This is not due to the fact that it optimizes away these two redundant lines:
res = res + query;
res = res - query;
It does, but that makes no difference. Mainly the compiler uses exactly the same instruction that you do with the intrinsic, except for the store. Fore the store, it uses vmovntdq instead of vmovdqu, making use of sophisticated knowledge about the program, memory and the architecture. Not only does vmovntdq require aligned memory and can therefore be more efficient. It gives the CPU a non-temporal hint, preventing this data from being cached during the write to memory. This improves performance, because writing it to cache requires to load the remainder of the cache-line from memory. So while your initial SIMD version does require three memory operations: Reading the source, reading the destination cache line, writing the destination, the compiler version with the non-temporal store requires only two. In fact On my i7-4770 system, the compiler-generated version reduces the runtime at 2 threads from ~85.8 ms to 58.0 ms, and almost perfect 1.5x speedup. The lesson here is to trust your compiler unless you know the architecture and instruction set extremely well.
Considering peak performance here, 58 ms for transferring 2*160000000*4 byte corresponds to 22.07 GB/s (summarizing read and write), which is about the same than your VTune results. (funny enough considering 85.8 ms is about the same bandwidth for two read, one write). There isn't much more direct room for improvement.
To further improve performance, you would have to do something about the operation / byte ratio of your code. Remember that your processor can perform 217.6 GFLOP/s (I guess either the same or twice for intops), but can only read&write 3.2 G int/s. That gives you an idea how much operations you need to perform to not be limited by memory. So if you can, work on the data in blocks so that you can reuse data in caches.
I cannot reproduce your results for (2) and (3). When I loop around the inner loop, the scaling behaves the same. The results look fishy, particularly in the light of the results being so consistent with peak performance otherwise. Generally, I recommend to do the measuring inside of the parallel region and leverage omp_get_wtime like such:
double one, two;
#pragma omp parallel
{
__m256i vec_src;
__m256i vec_op = _mm256_set1_epi32(2);
__m256i vec_dst;
#pragma omp master
one = omp_get_wtime();
#pragma omp barrier
for (int kk = 0; kk < 100; kk++)
#pragma omp for
for (int64_t i = 0; i < size; i += 8) {
...
}
#pragma omp master
{
two = omp_get_wtime();
std::cout << "took time: " << (two-one) * 1000 << std::endl;
}
}
A final remark: Desktop processors and server processors have very different characteristics regarding memory performance. On contemporary server processors, you need much more active threads to saturate the memory bandwidth, while on desktop processors a core can often almost saturate the memory bandwidth.
Edit: One more thought about VTune not classifying it as memory-bound. This may be cause by the short computation time vs initialization. Try to see what VTune says about the code in a loop.
I try to use openmp and find strange results.
Parallel "for" run faster with openmp as expected. But serial "for" run much faster when openmp disabled (without /openmp option. vs 2013).
Test code
const int n = 5000;
const int m = 2000000;
vector <double> a(n, 0);
double start = omp_get_wtime();
#pragma omp parallel for shared(a)
for (int i = 0; i < n; i++)
{
double StartVal = i;
for (int j = 0; j < m; ++j)
{
a[i] = (StartVal + log(exp(exp((double)i))));
}
}
cout << "omp Time: " << (omp_get_wtime() - start) << endl;
start = omp_get_wtime();
for (int i = 0; i < n; i++)
{
double StartVal = i;
for (int j = 0; j < m; ++j)
{
a[i] = (StartVal + log(exp(exp((double)i))));
}
}
cout << "serial Time: " << (omp_get_wtime() - start) << endl;
Output without /openmp option
0
omp Time: 6.4389
serial Time: 6.37592
Output with /openmp option
0
1
2
3
omp Time: 1.84636
serial Time: 16.353
Is it correct results? Or I'm doing something wrong?
I believe part of the answer lies hidden in the architecture of the computer you run on. I tried running the same code another machine (GCC 4.8 on GNU+Linux, quad Core2 CPU), and over many runs, found a slightly odd thing: while the time for both loops varied, and OpenMP with many threads always ran faster, the second loop never ran significantly faster than the first, even without OpenMP.
The next step was to try to eliminate a dependency between the loops, allocating a second vector for the second loop. It still ran no faster than the first. So I tried reversing them, running the OpenMP loop after the serial one; and while it still ran fast when multithreaded, it would now see delays when the first loop didn't. It's looking more like an operating system behaviour at this point; long-lived threads simply seem more likely to get interrupted. I had taken some measures to reduce interruptions (niceness -15, specific cpu set) but this is not a system dedicated to benchmarking.
None of my results were anywhere near as extreme as yours, however. My first guess as to what caused your large difference was that you reused the same array and ran the parallel loop first. This would distribute the array into caches on all cores, causing a slight dilemma of whether to migrade the thread to the data or the other way around; and OpenMP may have chosen any distribution, including iteration i to thread i%threads (as with schedule(static,1)), which probably would hurt multithreaded runtime, or one cacheline each which would hurt later single threaded reading if it fit in per-core caches. However, all of the array accesses are writes, so the processor shouldn't need to wait for them in the first place.
In summary, your results are certainly platform dependent and unexpected. I would suggest rerunning the test with swapped order, the two loops operating on different arrays, and placed in different compilation units, and of course to verify the written results. It is possible you've found a flaw in your compiler.
I try using OpenMP to parallel some for-loop of my program but failed to get significant speed improvement (actual degradation is observed). My target machine will have 4-6 cores and I currently rely on the OpenMP runtime to get the thread count for me, so I haven't tried any threadcount combination yet.
Target/Development platform: Windows 64bits
using MinGW64 4.7.2 (rubenvb build)
Sample output with OpenMP
Thread count: 4
Dynamic :0
OMP_GET_NUM_PROCS: 4
OMP_IN_PARALLEL: 1
5.612 // <- returned by omp_get_wtime()
5.627 (sec) // <- returned by clock()
Wall time elapsed: 5.62703
Sample output without OpenMP
2.415 (sec) // <- returned by clock()
Wall time elapsed: 2.415
How I measure the time
struct timeval start, end;
gettimeofday(&start, NULL);
#ifdef _OPENMP
double t1 = (double) clock();
double wt = omp_get_wtime();
sim->resetEnvironment(run);
tout << omp_get_wtime() - wt << std::endl;
timeEnd(tout, t1);
#else
double = (double) clock();
sim->resetEnvironment(run);
timeEnd(tout, t1);
#endif
gettimeofday(&end, NULL);
tout << "Wall time elapsed: "
<< ((end.tv_sec - start.tv_sec) * 1000000u + (end.tv_usec - start.tv_usec)) / 1.e6
<< std::endl;
The code
void Simulator::resetEnvironment(int run)
{
#pragma omp parallel
{
// (a)
#pragma omp for schedule(dynamic)
for (size_t i = 0; i < vector_1.size(); i++) // size ~ 20
reset(vector_1[i]);
#pragma omp for schedule(dynamic)
for (size_t i = 0; i < vector_2.size(); i++) // size ~ 2.3M
reset(vector_2[i]);
#pragma omp for schedule(dynamic)
for (size_t i = 0; i < vector_3.size(); i++) // size ~ 0.3M
reset(vector_3[i]);
for (int level = 0; level < level_count; level++) // (b) level = 3
{
#pragma omp for schedule(dynamic)
for (size_t i = 0; i < vector_4[level].size(); i++) // size ~500 - 1K
reset(vector_4[level][i]);
}
#pragma omp for schedule(dynamic)
for (long i = 0; i < populationSize; i++) // size ~7M
resetAgent(agents[i]);
} // end #parallel
} // end: Simulator::resetEnvironment()
Randomness
Inside reset() function calls, I used a RNG for seeding some agents for subsequent tasks.
Below is my RNG implementation, as I saw suggestion that using one RNG per per-thread for thread-safety.
class RNG {
public:
typedef std::mt19937 Engine;
RNG()
: real_uni_dist_(0.0, 1.0)
#ifdef _OPENMP
, engines()
#endif
{
#ifdef _OPENMP
int threads = std::max(1, omp_get_max_threads());
for (int seed = 0; seed < threads; ++seed)
engines.push_back(Engine(seed));
#else
engine_.seed(time(NULL));
#endif
} // end_ctor(RNG)
/** #return next possible value of the uniformed distribution */
double operator()()
{
#ifdef _OPENMP
return real_uni_dist_(engines[omp_get_thread_num()]);
#else
return real_uni_dist_(engine_);
#endif
}
private:
std::uniform_real_distribution<double> real_uni_dist_;
#ifdef _OPENMP
std::vector<Engine> engines;
#else
std::mt19937 engine_;
#endif
}; // end_class(RNG)
Question:
at (a), is it good to not using shortcut 'parallel for' to avoid the overhead of creating teams?
which part of my implementation can be the cause of degradation of performance?
Why the time reported by clock() and omp_get_wtime() are so similar, as I expected clock() would be somehow longer than omp_get_wtime()
[Edit]
at (b), my intention of including OpenMP directive in the inner loop is that the iteration for outer loop is so small (only 3) so I think I can skip that and go directly to the inner loop of looping the vector_4[level]. Is this thought inappropriate (or will this instruct the OpenMP to repeat the outer loop by 4 and hence actually looping the inner loop 12 instead of 3 (say the current thread count is 4)?
Thanks
If the measured wall-clock time (as reported by omp_get_wtime()) is close to the total CPU time (as reported by clock()), this could mean several different things:
the code is running single-threaded, but then the total CPU time will be lower than the wall-clock time;
a very high synchronisation and cache coherency overhead is present and it is huge in comparison to the actual work being done by the threads.
Your case is the second one and the reason is that you use schedule(dynamic). Dynamic scheduling should only be used in cases when each iteration can take a varying amount of time. If such iterations are statically distributed among the threads, work imbalance could occur. schedule(dynamic) takes care of this by giving each task (in your case each single iteration of the loop) to the next thread to finish its work and become idle. There is a certain overhead in synchronising the threads and bookkeeping the distribution of the work items and therefore it should only be used when the amount of work per thread is huge in comparison to the overhead. OpenMP allows you to group more iterations into iteration blocks and this number is specified like schedule(dynamic,100) - this would make each thread execute a block (or chunk) of 100 consecutive iterations before asking for a new one. The default block size for dynamic scheduling is 1, i.e. each vector element in processed by a separate thread. I have no idea how much processing is done in reset() and what kind of elements are there in vector_*, but given the serial run time it is not much at all.
Another source of slowdown is the loss of data locality when you use dynamic scheduling. Depending on the type of elements of those vectors, processing neighbouring elements by different threads leads to false sharing. That means that, e.g. vector_1[i] lies in the same cache line with some other elements of vector_1, e.g. vector_1[i-1] and vector_1[i+1]. When thread 1 modifies vector_1[i], the cache line is reloaded in all other cores that work on the neighbouring elements. If vector_1[] is only written to, the compiler can be smart enough to generate non-temporal stores (those bypass the cache) but it only works with vector stores and having each core do a single iteration at a time means no vectorisation at all. Data locality can be improved by either switching to static scheduling or, if reset() really takes varying amount of time, by setting a reasonable chunk size in the schedule(dynamic) clause. The best chunk size is usually dependent on the processor and often one has to tweak it in order to get the best performance.
So I would strongly suggest that you first switch to static scheduling by replacing all schedule(dynamic) to schedule(static) and then try to optimise further. You don't have to specify the chunk size in the static case as the default is simply the total number of iterations divided by the number of threads, i.e. each thread would get one contiguous block of iterations.
to answer your question:
1) in a) the usage of the "parallel" keyword is exact
2) Congrats, your impl of your lok-free PRNG looks fine
3) the error can come from all the OpenMP pragma you use in the inner loop . Parallel at the top level and avoid fine-grain and inner loop parallelism
4) In the code below, i used 'nowait' on each 'omp for', I put the omp directive out-of-the-loop in the vector_4 proccessing and put a barrier at the end to join all the thread and wiat for the end of all the job we spawn before !
// pseudo code
#pragma omp for schedule(dynamic) nowait
for (size_t i = 0; i < vector_1.size(); i++) // size ~ 20
reset(vector_1[i]);
#pragma omp for schedule(dynamic) nowait
for (size_t i = 0; i < vector_2.size(); i++) // size ~ 2.3M
reset(vector_2[i]);
#pragma omp for schedule(dynamic) nowait
for (size_t i = 0; i < vector_3.size(); i++) // size ~ 0.3M
reset(vector_3[i]);
#pragma omp for schedule(dynamic) nowait
for (int level = 0; level < level_count; level++)
{
for (size_t i = 0; i < vector_4[level].size(); i++) // size ~500 - 1K
reset(vector_4[level][i]);
}
#pragma omp for schedule(dynamic) nowait
for (long i = 0; i < populationSize; i++) // size ~7M
resetAgent(agents[i]);
#pragma omp barrier
A single threaded program will run faster than a multi-threaded one if the useful processing time is less than the overhead incurred by threads.
It is a good idea to determine what the overhead is by implementing a null function and then deciding whether it is a better solution.
From a performance point of view, threads are only useful if the useful processing time is significantly higher than the overhead that is incurred by threads and there are real cpus available to run the threads.
I am trying to increase performance of a rather complex iteration algorithm by parallelizing matrix multiplication, which is being called on each iteration.
The algorithm takes 500 iterations and approximately 10 seconds. But after parallelizing matrix multiplication it slows down to 13 seconds.
However, when I tested matrix multiplication of the same dimension alone, there was an increase in speed. (I am talking about 100x100 matrices.)
Finally, I switched off any parallelizing inside the algorithm and added on each iteration the following piece of code, which does absolutely nothing and presumably shouldn't take long:
int j;
#pragma omp parallel for private(j)
for (int i = 0; i < 10; i++)
j = i;
And again, there is a 30% slowdown comparing to the same algorithm without this piece of code.
Thus, calling any parallelization using openmp 500 times inside the main algorithm somehow slows things down. This behavior looks very strange to me, anybody has any clues what the problem is?
The main algorithm is being called by a desktop application, compiled by VS2010, Win32 Release.
I work on Intel Core i3 (parallelization creates 4 threads), 64 bit Windows 7.
Here is a structure of a program:
int internal_method(..)
{
...//no openmp here
// the following code does nothing, has nothing to do with the rest of the program and shouldn't take long,
// but somehow adding of this code caused a 3 sec slowdown of the Huge_algorithm()
double sum;
#pragma omp parallel for private(sum)
for (int i = 0; i < 10; i++)
sum = i*i*i / (1.0 + i*i*i*i);
...//no openmp here
}
int Huge_algorithm(..)
{
...//no openmp here
for (int i = 0; i < 500; i++)
{
.....// no openmp
internal_method(..);
......//no openmp
}
...//no openmp here
}
So, the final point is:
calling the parallel piece of code 500 times alone (when the rest of the algorithm is omitted) takes less than 0.01 sec, but when you call it 500 times inside a huge algorithm it causes 3 sec delay of the entire algorithm.
And what I don't understand is how the small parallel part affects the rest of the algorithm?
For 10 iterations and a simple assignment, I guess there is too much OpenMP overhead compared to the computation itself. What looks lightweight here is actually managing and synchronizing multiple threads which may not even come from a thread pool. There might be some locking involved, and I don't know how good MSVC is at estimating whether to parallelize at all.
Try with bigger loop bodies or a bigger amount of iterations (say 1024*1024 iterations, just for starters).
Example OpenMP Magick:
#pragma omp parallel for private(j)
for (int i = 0; i < 10; i++)
j = i;
This might be approximately expanded by a compiler to:
const unsigned __cpu_count = __get_cpu_count();
const unsigned __j = alloca (sizeof (unsigned) * __cpu_count);
__thread *__threads = alloca (sizeof (__thread) * __cpu_count);
for (unsigned u=0; u!=__cpu_count; ++u) {
__init_thread (__threads+u);
__run_thread ([u]{for (int i=u; i<10; i+=__cpu_count)
__j[u] = __i;}); // assume lambdas
}
for (unsigned u=0; u!=__cpu_count; ++u)
__join (__threads+u);
with __init_thread(), __run_thread() and __join() being non-trivial function that invoke certain system calls.
In case thread-pools are used, you would replace the first alloca() by something like __pick_from_pool() or so.
(note this, names and emitted code, was all imaginary, actual implementation will look different)
Regarding your updated question:
You seem to be parallelizing at the wrong granularity. Put as much workload as possible in a thread, so instead of
for (...) {
#omp parallel ...
for (...) {}
}
try
#omp parallel ...
for (...) {
for (...) {}
}
Rule of thumb: Keep workloads big enough per thread so as to reduce relative overhead.
Maybe just j=i is not high-yield for core-cpu bandwith. maybe you should try something more yielding calculation. (for exapmle taking i*i*i*i*i*i and dividing it by i+i+i)
are you running this on multi-core cpu or gpu?