Open MP bottleneck issue - c++

I was trying to observe a basic openMP based parallelism with the following code,
#include<stdio.h>
#include<omp.h>
#include<stdlib.h>
#include <time.h>
int main(){
long i;
long x[] = {0,0,0,0};
omp_set_num_threads(4);
clock_t time=clock();
#pragma omp parallel for
for(i=0;i<100000000;i++){
x[omp_get_thread_num()]++;
}
double time_taken = (double)(clock() - time) / CLOCKS_PER_SEC;
printf("%ld %ld %ld %ld %lf\n",x[0],x[1],x[2],x[3],time_taken);
}
Now, I am using a quad core i5 processor. I have checked 4 different values of the threads. The following results are found,
Set: omp_set_num_threads(1);
Out: 100000000 0 0 0 0.203921
Set: omp_set_num_threads(2);
Out: 50000000 50000000 0 0 0.826322
Set: omp_set_num_threads(3);
Out: 33333334 33333333 33333333 0 1.448936
Set: omp_set_num_threads(4);
Out: 25000000 25000000 25000000 25000000 1.919655
The x array values are accurate. But the time is surprisingly increasing in the increased number of threads. I can not get any explanation/justification behind this phenomenon. Is it somehow, omp_get_thread_num() function that is atomic in nature ? Or something else that I am missing out ?
Compiling as, gcc -o test test.c -fopenmp
UPDATE
So, as per the suggestion in the accepted answer, I have modified the code as follows,
#include<stdio.h>
#include<omp.h>
#include<stdlib.h>
int main(){
long i, t_id, fact=1096;
long x[fact*4];
x[0]=x[fact]=x[2*fact]=x[3*fact]=0;
omp_set_num_threads(4);
double time = omp_get_wtime();
#pragma omp parallel for private(t_id)
for(i=0;i<100000000;i++){
t_id = omp_get_thread_num();
x[t_id*fact]++;
}
double time_taken = omp_get_wtime() - time;
printf("%ld %ld %ld %ld %lf\n",x[0],x[fact],x[2*fact],x[3*fact],time_taken);
}
Now, the results are understandable,
Set: omp_set_num_threads(1)
Out: 100000000 0 0 0 0.250205
Set: omp_set_num_threads(2)
Out: 50000000 50000000 0 0 0.154980
Set: omp_set_num_threads(3)
Out: 33333334 33333333 33333333 0 0.078874
Set: omp_set_num_threads(4)
Out: 25000000 25000000 25000000 25000000 0.061155
Therefore, it was about the cache line size as explained in the accepted answer. Have a look there to get the answer.

Note that the 4 integers that you are operating on lie very closely together, probably on one cache line. Since cache lines are loaded into the CPU cache in one go, each thread needs to ensure that it has the latest version of that cache line. Since all threads want to modify (and not just read) that one cache line, they are constantly invalidating one another's copy. Welcome to false sharing!
To solve this problem, ensure that the integers are (physically) far enough apart from one another, e.g., by allocating structures that fill (at least) one full cache line for each thread to work with.
When executing your sample program using 4 threads on one of my machines, I got the following result:
25000000 25000000 25000000 25000000 5.049694
When modifying the program, such that the array has 4096 elements, and using the elements 0, 1024, 2048 and 3072 (which ensures enough distance), the program runs a lot faster:
25000000 25000000 25000000 25000000 1.617231
Note that although you are counting the processor time used by the whole process, without false sharing, the time should not increase significantly, but rather be more or less constant (there is some additional locking involved, but it should not usually be on the order of a 10x increase). In fact, the performance boost shown above also translates into wall-clock time (~1.25 seconds to ~500ms).

The reason for your observation, as noted by gha.st, are false sharing and the properties of the clock function.
For this reason x[omp_get_thread_num()], is an anti-pattern. Sure, you can leverage your new knowledge by adding a stride in the memory. But this also encodes hardware-specific properties (i.e. cache line size) into your data structures. This can lead to nasty code that is difficult to understand and still has bad performance portability.
The idiomatic solution is to use either of the following:
If you are only interested in an aggregate, use a reduction clause, i.e.:
long x = 0;
#pragma omp parallel for reduction(+:x)
for(i=0;i<100000000;i++){
x++;
}
// total sum is now in x
If you need individual values within the thread, just use a private variable, preferably implicitly by scope. Or if you need particular initialization from outside the construct use firstprivate.
#pragma omp parallel
{
long local_x = 0; // implicitly private by scope!
#pragma omp for
for(i=0;i<100000000;i++) {
local_x++;
}
// can now do something with the the sum of the current thread.
}
And if you need the per-thread results outside, you can just use the second form and write the result once:
#pragma omp parallel
{
long local_x = 0; // implicitly private by scope!
#pragma omp for
for(i=0;i<100000000;i++) {
local_x++;
}
x[omp_get_thread_num()] = local_x;
}
That's not to say that you never need to design a datastructure with false-sharing in mind. But it's not as common as you might think.

Related

chrono give different measures at same function

i am trying to measure the execution time.
i'm on windows 10 and use gcc compiler.
start_t = chrono::system_clock::now();
tree->insert();
end_t = chrono::system_clock::now();
rslt_period = chrono::duration_cast<chrono::nanoseconds>(end_t - start_t);
this is my code to measure time about bp_w->insert()
the function insert work internally like follow (just pseudo code)
insert(){
_load_node(node);
// do something //
_save_node(node, addr);
}
_save_node(n){
ofstream file(name);
file.write(n);
file.close();
}
_load_node(n, addr){
ifstream file(name);
file.read_from(n, addr);
file.close();
}
the actual results is,
read is number of _load_node executions.
write is number of _save_node executions.
time is nano secs.
read write time
1 1 1000000
1 1 0
2 1 0
1 1 0
1 1 0
1 1 0
2 1 0
1 1 1004000
1 1 1005000
1 1 0
1 1 0
1 1 15621000
i don't have any idea why this result come and want to know.
What you are trying to measure is ill-defined.
"How long did this code take to run" can seem simple. In practice, though, do you mean "how many CPU cycles my code took" ? Or how many cycles between my program and the other running programs ? Do you account for the time to load/unload it on the CPU ? Do you account for the CPU being throttled down when on battery ? Do you want to account for the time to access the main clock located on the motherboard (in terms of computation that is extremely far).
So, in practice timing will be affected by a lot of factors and the simple fact of measuring it will slow everything down. Don't expect nanosecond accuracy. Micros, maybe. Millis, certainly.
So, that leaves you in a position where any measurement will fluctuate a lot. The sane way is to average it out over multiple measurement. Or, even better, do the same operation (on different data) a thousand (million?) times and divide the results by a thousand.
Then, you'll get significant improvement on accuracy.
In code:
start_t = chrono::system_clock::now();
for(int i = 0; i < 1000000; i++)
tree->insert();
end_t = chrono::system_clock::now();
You are using the wrong clock. system_clock is not useful for timing intervals due to low resolution and its non-monotonic nature.
Use steady_clock instead. it is guaranteed to be monotonic and have a low enough resolution to be useful.

volatile increments with false sharing run slower in release than in debug when 2 threads are sharing the same physical core

I'm trying to test the performance impact of false sharing. The test code is as below:
constexpr uint64_t loop = 1000000000;
struct no_padding_struct {
no_padding_struct() :x(0), y(0) {}
uint64_t x;
uint64_t y;
};
struct padding_struct {
padding_struct() :x(0), y(0) {}
uint64_t x;
char padding[64];
uint64_t y;
};
alignas(64) volatile no_padding_struct n;
alignas(64) volatile padding_struct p;
constexpr core_a = 0;
constexpr core_b = 1;
void func(volatile uint64_t* addr, uint64_t b, uint64_t mask) {
SetThreadAffinityMask(GetCurrentThread(), mask);
for (uint64_t i = 0; i < loop; ++i) {
*addr += b;
}
}
void test1(uint64_t a, uint64_t b) {
thread t1{ func, &n.x, a, 1<<core_a };
thread t2{ func, &n.y, b, 1<<core_b };
t1.join();
t2.join();
}
void test2(uint64_t a, uint64_t b) {
thread t1{ func, &p.x, a, 1<<core_a };
thread t2{ func, &p.y, b, 1<<core_b };
t1.join();
t2.join();
}
int main() {
uint64_t a, b;
cin >> a >> b;
auto start = std::chrono::system_clock::now();
//test1(a, b);
//test2(a, b);
auto end = std::chrono::system_clock::now();
cout << (end - start).count();
}
The result was mostly as follow:
x86 x64
cores test1 test2 cores test1 test2
debug release debug release debug release debug release
0-0 4.0s 2.8s 4.0s 2.8s 0-0 2.8s 2.8s 2.8s 2.8s
0-1 5.6s 6.1s 3.0s 1.5s 0-1 4.2s 7.8s 2.1s 1.5s
0-2 6.2s 1.8s 2.0s 1.4s 0-2 3.5s 2.0s 1.4s 1.4s
0-3 6.2s 1.8s 2.0s 1.4s 0-3 3.5s 2.0s 1.4s 1.4s
0-5 6.5s 1.8s 2.0s 1.4s 0-5 3.5s 2.0s 1.4s 1.4s
test result in image
My CPU is intel core i7-9750h. 'core0' and 'core1' are of a physical core, and so does 'core2' and 'core3' and others. MSVC 14.24 was used as the compiler.
The time recorded was an approximate value of the best score in several runs since there were tons of background tasks. I think this was fair enough since the results can be clearly divided into groups and 0.1s~0.3s error did not affect such division.
Test2 was quite easy to explain. As x and y are in different cache lines, running on 2 physical core can gain 2 times performance boost(the cost of context switch when running 2 threads on a single core is ignorable here), and running on one core with SMT is less efficient than 2 physical cores, limited by the throughput of coffee-lake(believe Ryzen can do slightly better), and more efficient than temporal multithreading. It seems 64bit mode is more efficient here.
But the result of test1 is confusing to me.
First, in debug mode, 0-2, 0-3, and 0-5 are slower than 0-0, which makes sense. I explained this as certain data was moved from L1 to L3 and L3 to L1 repeatedly since the cache must stay coherent among 2 cores, while it would always stay in L1 when running on a single core. But this theory conflicts with the fact that 0-1 pair is always the slowest. Technically, the two threads should share the same L1 cache. 0-1 should run 2 times as fast as 0-0.
Second, in release mode, 0-2, 0-3, and 0-5 were faster than 0-0, which disproved the theory above.
Last, 0-1 runs slower in release than in debug in both 64bit and 32bit mode. That's what I can't understand most. I read the generated assembly code and did not find anything helpful.
#PeterCordes Thank you for your analysis and advice.
I finally profiled the program using Vtune and it turns out your expectation was correct.
When running on SMT threads of the same core, machine_clear consumes lots of time, and it was more severe in Release than in Debug. This happens on both 32bit and 64bit mode.
When running on different physical cores the bottleneck was memory(store latency and false sharing),and Release was always faster since it contains significantly fewer memory access than Debug in critical part, as shown in Debug assembly(godbolt) and Release assembly(godbolt). The total instruction retired is also fewer in Release, which strengthens this point. It seems the assembly I found in Visual Studio yesterday was not correct.
This might be explained by hyper-threading. Cores being shared as 2 hyperthread cores do not get double the throughout like 2 entirely separate cores might. Instead you might get something like 1.7 times the performance.
Indeed, your processor has 6 cores and 12 threads, and core0/core1 are 2 threads on the same underlying core, if I am reading all this correctly.
In fact, if you picture in your mind how hyper-threading works, with the work of 2 separate cores interleaved, it is not surprising.

Verfiy the number of times a cuda kernel is called

Say you have a cuda kernel that you want to run 2048 times, so you define your kernel like this:
__global__ void run2048Times(){ }
Then you call it from your main code:
run2048Times<<<2,1024>>>();
All seems well so far. However now say for debugging purposes when you're calling the kernel millions of times, you want to verify that your actually calling the Kernel that many times.
What I did was pass a pointer to the kernel and ++'d the pointer every time the kernel ran.
__global__ void run2048Times(int *kernelCount){
kernelCount[0]++; // Add to the pointer
}
However when I copied that pointer back to the main function I get "2".
At first it baffeled me, then after 5 minutes of coffee and pacing back and forth I realized this probably makes sense because the cuda kernel is running 1024 instances of itself at the same time, which means that the kernels overwrite the "kernelCount[0]" instead of truly adding to it.
So instead I decided to do this:
__global__ void run2048Times(int *kernelCount){
// Get the id of the kernel
int id = blockIdx.x * blockDim.x + threadIdx.x;
// If the id is bigger than the pointer overwrite it
if(id > kernelCount[0]){
kernelCount[0] = id;
}
}
Genius!! This was guaranteed to work I thought. Until I ran it and got all sorts of numbers between 0 and 2000.
Which tells me that the problem mentioned above still happens here.
Is there any way to do this, even if it involves forcing the kernels to pause and wait for each other to run?
Assuming this is a simplified example, and you are not in fact trying to do profiling as others have already suggested, but want to use this in a more complex scenario, you can achieve the result you want with atomicAdd, which will ensure that the increment operation is executed as a single atomic operation:
__global__ void run2048Times(int *kernelCount){
atomicAdd(kernelCount, 1); // Add to the pointer
}
Why your solutions didn't work:
The problem with your first solution is that it gets compiled into the following PTX code (see here for description of PTX instructions):
ld.global.u32 %r1, [%rd2];
add.s32 %r2, %r1, 1;
st.global.u32 [%rd2], %r2;
You can verify this by calling nvcc with the --ptx option to only generate the intermediate representation.
What can happen here is the following timeline, assuming you launch 2 threads (Note: this is a simplified example and not exactly how GPUs work, but it is enough to illustrate the problem):
thread 0 reads 0 from kernelCount
thread 1 reads 0 from kernelCount
thread 0 increases it's local copy by 1
thread 0 stores 1 back to kernelCount
thread 1 increases it's local copy by 1
thread 1 stores 1 back to kernelCount
and you end up with 1 even though 2 threads were launched.
Your second solution is wrong even if the threads are launched sequentially because thread indexes are 0-based. So I'll assume you wanted to do this:
__global__ void run2048Times(int *kernelCount){
// Get the id of the kernel
int id = blockIdx.x * blockDim.x + threadIdx.x;
// If the id is bigger than the pointer overwrite it
if(id + 1 > kernelCount[0]){
kernelCount[0] = id + 1;
}
}
This will compile into:
ld.global.u32 %r5, [%rd1];
setp.lt.s32 %p1, %r1, %r5;
#%p1 bra BB0_2;
add.s32 %r6, %r1, 1;
st.global.u32 [%rd1], %r6;
BB0_2:
ret;
What can happen here is the following timeline:
thread 0 reads 0 from kernelCount
thread 1 reads 0 from kernelCount
thread 1 compares 0 to 1 + 1 and stores 2 into kernelCount
thread 0 compares 0 to 0 + 1 and stores 1 into kernelCount
You end up having the wrong result of 1.
I suggest you pick up a good parallel programming / CUDA book if you want to better understand problems with synchronization and non-atomic operations.
EDIT:
For completeness, the version using atomicAdd compiles into:
atom.global.add.u32 %r1, [%rd2], 1;
It seems like the only point of that counter is to do profiling (i.e. analyse how the code runs) rather than to actually count something (i.e. no functional benefit to the program).
There are profiling tools available designed for this task. For example, nvprof gives the number of calls, as well as some time metrics for each kernel in your codebase.

What can prevent multiprocessing from improving speed - OpenMP?

I am scanning through every permutation of vectors and I would like to multithread this process (each thread would scan all the permutation of some vectors).
I manage to extract the code that would not speed up (I know it does not do anything useful but it reproduces my problem).
int main(int argc, char *argv[]){
std::vector<std::string *> myVector;
for(int i = 0 ; i < 8 ; ++i){
myVector.push_back(new std::string("myString" + std::to_string(i)));
}
std::sort(myVector.begin(), myVector.end());
omp_set_dynamic(0);
omp_set_num_threads(8);
#pragma omp parallel for shared(myVector)
for(int i = 0 ; i < 100 ; ++i){
std::vector<std::string*> test(myVector);
do{ //here is a permutation
} while(std::next_permutation(test.begin(), test.end())); // tests all the permutations of this combination
}
return 0;
}
The result is :
1 thread : 15 seconds
2 threads : 8 seconds
4 threads : 15 seconds
8 threads : 18 seconds
16 threads : 20 seconds
I am working with an i7 processor with 8 cores. I can't understand how it could be slower with 8 threads than with 1... I don't think the cost of creating new threads is higher than the one to go through 40320 permutations.. so what is happening?
Thanks to the help of everyone, I finally manage to find the answer :
There were two problems :
A quick performance profiling showed that most of the time was spent in std::lockit which is something used for debug on visual studio.. to prevent that just add this command line /D "_HAS_ITERATOR_DEBUGGING=0" /D "_SECURE_SCL=0". That was why adding more threads resulted in loss of time
Switching optimization on helped improve the performance

No speedup for vector sums with threading

I have a C++ program which basically performs some matrix calculations. For these I use LAPACK/BLAS and usually link to the MKL or ACML depending on the platform. A lot of these matrix calculations operate on different independent matrices and hence I use std::thread's to let these operations run in parallel. However, I noticed that I have no speed-up when using more threads. I traced the problem down to the daxpy Blas routine. It seems that if two threads are using this routine in parallel each thread takes twice the time, even though the two threads operate on different arrays.
The next thing I tried was writing a new simple method to perform vector additions to replace the daxpy routine. With one thread this new method is as fast as the BLAS routine, but, when compiling with gcc, it suffers from the same problems as the BLAS routine: doubling the number of threads running parallel also doubles the amount of time each threads needs, so no speed-up is gained. However, using the Intel C++ Compiler this problems vanishes: with increasing number of threads the time a single thread needs is constant.
However, I need to compile as well on systems where no Intel compiler is available. So my questions are: why is there no speed-up with the gcc and is there any possibility of improving the gcc performance?
I wrote a small program to demonstrate the effect:
// $(CC) -std=c++11 -O2 threadmatrixsum.cpp -o threadmatrixsum -pthread
#include <iostream>
#include <thread>
#include <vector>
#include "boost/date_time/posix_time/posix_time.hpp"
#include "boost/timer.hpp"
void simplesum(double* a, double* b, std::size_t dim);
int main() {
for (std::size_t num_threads {1}; num_threads <= 4; num_threads++) {
const std::size_t N { 936 };
std::vector <std::size_t> times(num_threads, 0);
auto threadfunction = [&](std::size_t tid)
{
const std::size_t dim { N * N };
double* pA = new double[dim];
double* pB = new double[dim];
for (std::size_t i {0}; i < N; ++i){
pA[i] = i;
pB[i] = 2*i;
}
boost::posix_time::ptime now1 =
boost::posix_time::microsec_clock::universal_time();
for (std::size_t n{0}; n < 1000; ++n){
simplesum(pA, pB, dim);
}
boost::posix_time::ptime now2 =
boost::posix_time::microsec_clock::universal_time();
boost::posix_time::time_duration dur = now2 - now1;
times[tid] += dur.total_milliseconds();
delete[] pA;
delete[] pB;
};
std::vector <std::thread> mythreads;
// start threads
for (std::size_t n {0} ; n < num_threads; ++n)
{
mythreads.emplace_back(threadfunction, n);
}
// wait for threads to finish
for (std::size_t n {0} ; n < num_threads; ++n)
{
mythreads[n].join();
std::cout << " Thread " << n+1 << " of " << num_threads
<< " took " << times[n]<< "msec" << std::endl;
}
}
}
void simplesum(double* a, double* b, std::size_t dim){
for(std::size_t i{0}; i < dim; ++i)
{*(++a) += *(++b);}
}
The outout with gcc:
Thread 1 of 1 took 532msec
Thread 1 of 2 took 1104msec
Thread 2 of 2 took 1103msec
Thread 1 of 3 took 1680msec
Thread 2 of 3 took 1821msec
Thread 3 of 3 took 1808msec
Thread 1 of 4 took 2542msec
Thread 2 of 4 took 2536msec
Thread 3 of 4 took 2509msec
Thread 4 of 4 took 2515msec
The outout with icc:
Thread 1 of 1 took 663msec
Thread 1 of 2 took 674msec
Thread 2 of 2 took 674msec
Thread 1 of 3 took 681msec
Thread 2 of 3 took 681msec
Thread 3 of 3 took 681msec
Thread 1 of 4 took 688msec
Thread 2 of 4 took 689msec
Thread 3 of 4 took 687msec
Thread 4 of 4 took 688msec
So, with the icc the time needed for one thread perform the computations is constant (as I would have expected; my CPU has 4 physical cores) and with the gcc the time for one thread increases. Replacing the simplesum routine by BLAS::daxpy yields the same results for icc and gcc (no surprise, as most time is spent in the library), which are almost the same as the above stated gcc results.
The answer is fairly simple: Your threads are fighting for memory bandwidth!
Consider that you perform one floating point addition per 2 stores (one initialization, one after the addition) and 2 reads (in the addition). Most modern systems providing multiple cpus actually have to share the memory controller among several cores.
The following was run on a system with 2 physical CPU sockets and 12 cores (24 with HT). Your original code exhibits exactly your problem:
Thread 1 of 1 took 657msec
Thread 1 of 2 took 1447msec
Thread 2 of 2 took 1463msec
[...]
Thread 1 of 8 took 5516msec
Thread 2 of 8 took 5587msec
Thread 3 of 8 took 5205msec
Thread 4 of 8 took 5311msec
Thread 5 of 8 took 2731msec
Thread 6 of 8 took 5545msec
Thread 7 of 8 took 5551msec
Thread 8 of 8 took 4903msec
However, by simply increasing the arithmetic density, we can see a significant increase in scalability. To demonstrate, I changed your addition routine to also perform an exponentiation: *(++a) += std::exp(*(++b));. The result shows almost perfect scaling:
Thread 1 of 1 took 7671msec
Thread 1 of 2 took 7759msec
Thread 2 of 2 took 7759msec
[...]
Thread 1 of 8 took 9997msec
Thread 2 of 8 took 8135msec
Thread 3 of 8 took 10625msec
Thread 4 of 8 took 8169msec
Thread 5 of 8 took 10054msec
Thread 6 of 8 took 8242msec
Thread 7 of 8 took 9876msec
Thread 8 of 8 took 8819msec
But what about ICC?
First, ICC inlines simplesum. Proving that inlining happens is simple: Using icc, I have disable multi-file interprocedural optimization and moved simplesum into its own translation unit. The difference is astonishing. The performance went from
Thread 1 of 1 took 687msec
Thread 1 of 2 took 688msec
Thread 2 of 2 took 689msec
[...]
Thread 1 of 8 took 690msec
Thread 2 of 8 took 697msec
Thread 3 of 8 took 700msec
Thread 4 of 8 took 874msec
Thread 5 of 8 took 878msec
Thread 6 of 8 took 874msec
Thread 7 of 8 took 742msec
Thread 8 of 8 took 868msec
To
Thread 1 of 1 took 1278msec
Thread 1 of 2 took 2457msec
Thread 2 of 2 took 2445msec
[...]
Thread 1 of 8 took 8868msec
Thread 2 of 8 took 8434msec
Thread 3 of 8 took 7964msec
Thread 4 of 8 took 7951msec
Thread 5 of 8 took 8872msec
Thread 6 of 8 took 8286msec
Thread 7 of 8 took 5714msec
Thread 8 of 8 took 8241msec
This already explains why the library performs badly: ICC cannot inline it and therefore no matter what else causes ICC to perform better than g++, it will not happen.
It also gives a hint as to what ICC might be doing right here... What if instead of executing simplesum 1000 times, it interchanges the loops so that it
Loads two doubles
Adds them 1000 times (or even performs a = 1000 * b)
Stores two doubles
This would increase arithmetic density without adding any exponentials to the function... How to prove this? Well, to begin let us simply implement this optimization and see what happens! To analyse, we will look at the g++ performance. Recall our benchmark results:
Thread 1 of 1 took 640msec
Thread 1 of 2 took 1308msec
Thread 2 of 2 took 1304msec
[...]
Thread 1 of 8 took 5294msec
Thread 2 of 8 took 5370msec
Thread 3 of 8 took 5451msec
Thread 4 of 8 took 5527msec
Thread 5 of 8 took 5174msec
Thread 6 of 8 took 5464msec
Thread 7 of 8 took 4640msec
Thread 8 of 8 took 4055msec
And now, let us exchange
for (std::size_t n{0}; n < 1000; ++n){
simplesum(pA, pB, dim);
}
with the version in which the inner loop was made the outer loop:
double* a = pA; double* b = pB;
for(std::size_t i{0}; i < dim; ++i, ++a, ++b)
{
double x = *a, y = *b;
for (std::size_t n{0}; n < 1000; ++n)
{
x += y;
}
*a = x;
}
The results show that we are on the right track:
Thread 1 of 1 took 693msec
Thread 1 of 2 took 703msec
Thread 2 of 2 took 700msec
[...]
Thread 1 of 8 took 920msec
Thread 2 of 8 took 804msec
Thread 3 of 8 took 750msec
Thread 4 of 8 took 943msec
Thread 5 of 8 took 909msec
Thread 6 of 8 took 744msec
Thread 7 of 8 took 759msec
Thread 8 of 8 took 904msec
This proves that the loop interchange optimization is indeed the main source of the excellent performance ICC exhibits here.
Note that none of the tested compilers (MSVC, ICC, g++ and clang) will replace the loop with a multiplication, which improves performance by 200x in the single threaded and 15x in the 8-threaded cases. This is due to the fact that the numerical instability of the repeated additions may cause wildly differing results when replaced with a single multiplication. When testing with integer data types instead of floating point data types, this optimization happens.
How can we force g++ to perform this optimization?
Interestingly enough, the true killer for g++ is not an inability to perform loop interchange. When called with -floop-interchange, g++ can perform this optimization as well. But only when the odds are significantly stacked into its favor.
Instead of std::size_t all bounds were expressed as ints. Not long, not unsigned int, but int. I still find it hard to believe, but it seems this is a hard requirement.
Instead of incrementing pointers, index them: a[i] += b[i];
G++ needs to be told -floop-interchange. A simple -O3 is not enough.
When all three criteria are met, the g++ performance is similar to what ICC delivers:
Thread 1 of 1 took 714msec
Thread 1 of 2 took 724msec
Thread 2 of 2 took 721msec
[...]
Thread 1 of 8 took 782msec
Thread 2 of 8 took 1221msec
Thread 3 of 8 took 1225msec
Thread 4 of 8 took 781msec
Thread 5 of 8 took 788msec
Thread 6 of 8 took 1262msec
Thread 7 of 8 took 1226msec
Thread 8 of 8 took 820msec
Note: The version of g++ used in this experiment is 4.9.0 on a x64 Arch linux.
Ok, I came to the conclusion that the main problem is that the processor acts on different parts of the memory in parallel and hence I assume that one has to deal with lots of cache misses which slows the process further down. Putting the actual sum function in a critical section
summutex.lock();
simplesum(pA, pB, dim);
summutex.unlock();
solves the problem of the cache missses, but of course does not yield optimal speed-up. Anyway, because now the other threads are blocked the simplesum method might as well use all available threads for the sum
void simplesum(double* a, double* b, std::size_t dim, std::size_t numberofthreads){
omp_set_num_threads(numberofthreads);
#pragma omp parallel
{
#pragma omp for
for(std::size_t i = 0; i < dim; ++i)
{
a[i]+=b[i];
}
}
}
In this case all the threads work on the same chunk on memory: it should be in the processor cache and if the processor needs to load some other parts of the memory into its cache the other threads benefit from this all well (depending whether this is L1 or L2 cache, but I reckon the details do not really matter for the sake of this discussion).
I don't claim that this solution is perfect or anywhere near optimal, but it seems to work much better than the original code. And it does not rely on some loop switching tricks which I cannot do in my actual code.