I started working with OpenMP using C++.
I have two questions:
What is #pragma omp for schedule?
What is the difference between dynamic and static?
Please, explain with examples.
Others have since answered most of the question but I would like to point to some specific cases where a particular scheduling type is more suited than the others. Schedule controls how loop iterations are divided among threads. Choosing the right schedule can have great impact on the speed of the application.
static schedule means that iterations blocks are mapped statically to the execution threads in a round-robin fashion. The nice thing with static scheduling is that OpenMP run-time guarantees that if you have two separate loops with the same number of iterations and execute them with the same number of threads using static scheduling, then each thread will receive exactly the same iteration range(s) in both parallel regions. This is quite important on NUMA systems: if you touch some memory in the first loop, it will reside on the NUMA node where the executing thread was. Then in the second loop the same thread could access the same memory location faster since it will reside on the same NUMA node.
Imagine there are two NUMA nodes: node 0 and node 1, e.g. a two-socket Intel Nehalem board with 4-core CPUs in both sockets. Then threads 0, 1, 2, and 3 will reside on node 0 and threads 4, 5, 6, and 7 will reside on node 1:
| | core 0 | thread 0 |
| socket 0 | core 1 | thread 1 |
| NUMA node 0 | core 2 | thread 2 |
| | core 3 | thread 3 |
| | core 4 | thread 4 |
| socket 1 | core 5 | thread 5 |
| NUMA node 1 | core 6 | thread 6 |
| | core 7 | thread 7 |
Each core can access memory from each NUMA node, but remote access is slower (1.5x - 1.9x slower on Intel) than local node access. You run something like this:
char *a = (char *)malloc(8*4096);
#pragma omp parallel for schedule(static,1) num_threads(8)
for (int i = 0; i < 8; i++)
memset(&a[i*4096], 0, 4096);
4096 bytes in this case is the standard size of one memory page on Linux on x86 if huge pages are not used. This code will zero the whole 32 KiB array a. The malloc() call just reserves virtual address space but does not actually "touch" the physical memory (this is the default behaviour unless some other version of malloc is used, e.g. one that zeroes the memory like calloc() does). Now this array is contiguous but only in virtual memory. In physical memory half of it would lie in the memory attached to socket 0 and half in the memory attached to socket 1. This is so because different parts are zeroed by different threads and those threads reside on different cores and there is something called first touch NUMA policy which means that memory pages are allocated on the NUMA node on which the thread that first "touched" the memory page resides.
| | core 0 | thread 0 | a[0] ... a[4095]
| socket 0 | core 1 | thread 1 | a[4096] ... a[8191]
| NUMA node 0 | core 2 | thread 2 | a[8192] ... a[12287]
| | core 3 | thread 3 | a[12288] ... a[16383]
| | core 4 | thread 4 | a[16384] ... a[20479]
| socket 1 | core 5 | thread 5 | a[20480] ... a[24575]
| NUMA node 1 | core 6 | thread 6 | a[24576] ... a[28671]
| | core 7 | thread 7 | a[28672] ... a[32768]
Now lets run another loop like this:
#pragma omp parallel for schedule(static,1) num_threads(8)
for (i = 0; i < 8; i++)
memset(&a[i*4096], 1, 4096);
Each thread will access the already mapped physical memory and it will have the same mapping of thread to memory region as the one during the first loop. It means that threads will only access memory located in their local memory blocks which will be fast.
Now imagine that another scheduling scheme is used for the second loop: schedule(static,2). This will "chop" iteration space into blocks of two iterations and there will be 4 such blocks in total. What will happen is that we will have the following thread to memory location mapping (through the iteration number):
| | core 0 | thread 0 | a[0] ... a[8191] <- OK, same memory node
| socket 0 | core 1 | thread 1 | a[8192] ... a[16383] <- OK, same memory node
| NUMA node 0 | core 2 | thread 2 | a[16384] ... a[24575] <- Not OK, remote memory
| | core 3 | thread 3 | a[24576] ... a[32768] <- Not OK, remote memory
| | core 4 | thread 4 | <idle>
| socket 1 | core 5 | thread 5 | <idle>
| NUMA node 1 | core 6 | thread 6 | <idle>
| | core 7 | thread 7 | <idle>
Two bad things happen here:
threads 4 to 7 remain idle and half of the compute capability is lost;
threads 2 and 3 access non-local memory and it will take them about twice as much time to finish during which time threads 0 and 1 will remain idle.
So one of the advantages for using static scheduling is that it improves locality in memory access. The disadvantage is that bad choice of scheduling parameters can ruin the performance.
dynamic scheduling works on a "first come, first served" basis. Two runs with the same number of threads might (and most likely would) produce completely different "iteration space" -> "threads" mappings as one can easily verify:
$ cat dyn.c
#include <stdio.h>
#include <omp.h>
int main (void)
{
int i;
#pragma omp parallel num_threads(8)
{
#pragma omp for schedule(dynamic,1)
for (i = 0; i < 8; i++)
printf("[1] iter %0d, tid %0d\n", i, omp_get_thread_num());
#pragma omp for schedule(dynamic,1)
for (i = 0; i < 8; i++)
printf("[2] iter %0d, tid %0d\n", i, omp_get_thread_num());
}
return 0;
}
$ icc -openmp -o dyn.x dyn.c
$ OMP_NUM_THREADS=8 ./dyn.x | sort
[1] iter 0, tid 2
[1] iter 1, tid 0
[1] iter 2, tid 7
[1] iter 3, tid 3
[1] iter 4, tid 4
[1] iter 5, tid 1
[1] iter 6, tid 6
[1] iter 7, tid 5
[2] iter 0, tid 0
[2] iter 1, tid 2
[2] iter 2, tid 7
[2] iter 3, tid 3
[2] iter 4, tid 6
[2] iter 5, tid 1
[2] iter 6, tid 5
[2] iter 7, tid 4
(same behaviour is observed when gcc is used instead)
If the sample code from the static section was run with dynamic scheduling instead there will be only 1/70 (1.4%) chance that the original locality would be preserved and 69/70 (98.6%) chance that remote access would occur. This fact is often overlooked and hence suboptimal performance is achieved.
There is another reason to choose between static and dynamic scheduling - workload balancing. If each iteration takes vastly different from the mean time to be completed then high work imbalance might occur in the static case. Take as an example the case where time to complete an iteration grows linearly with the iteration number. If iteration space is divided statically between two threads the second one will have three times more work than the first one and hence for 2/3 of the compute time the first thread will be idle. Dynamic schedule introduces some additional overhead but in that particular case will lead to much better workload distribution. A special kind of dynamic scheduling is the guided where smaller and smaller iteration blocks are given to each task as the work progresses.
Since precompiled code could be run on various platforms it would be nice if the end user can control the scheduling. That's why OpenMP provides the special schedule(runtime) clause. With runtime scheduling the type is taken from the content of the environment variable OMP_SCHEDULE. This allows to test different scheduling types without recompiling the application and also allows the end user to fine-tune for his or her platform.
I think the misunderstanding comes from the fact that you miss the point about OpenMP.
In a sentence OpenMP allows you to execute you program faster by enabling parallelism.
In a program parallelism can be enabled in many ways and one of the is by using threads.
Suppose you have and array:
[1,2,3,4,5,6,7,8,9,10]
and you want to increment all elements by 1 in this array.
If you are going to use
#pragma omp for schedule(static, 5)
it means that to each of the threads will be assigned 5 contiguous iterations. In this case the first thread will take 5 numbers. The second one will take another 5 and so on until there are no more data to process or the maximum number of threads is reached (typically equal to the number of cores). Sharing of workload is done during the compilation.
In case of
#pragma omp for schedule(dynamic, 5)
The work will be shared amongst threads but this procedure will occur at a runtime. Thus involving more overhead. Second parameter specifies size of the chunk of the data.
Not being very familiar to OpenMP I risk to assume that dynamic type is more appropriate when compiled code is going to run on the system that has a different configuration that the one on which code was compiled.
I would recommend the page bellow where there are discussed techniques used for parallelizing the code, preconditions and limitations
https://computing.llnl.gov/tutorials/parallel_comp/
Additional links:
http://en.wikipedia.org/wiki/OpenMP
Difference between static and dynamic schedule in openMP in C
http://openmp.blogspot.se/
The loop partitioning scheme is different. The static scheduler would divide a loop over N elements into M subsets, and each subset would then contain strictly N/M elements.
The dynamic approach calculates the size of the subsets on the fly, which can be useful if the subsets' computation times vary.
The static approach should be used if computation times vary not much.
Related
for a university project we are implementing an algorithm capable of bruteforcing on an AES key that we assume is partially known.
We have implemented several versions including one that exploits the multithreading mechanism in C++.
The implementation is done by allocating a variable number of threads, to be passed as input at launch, and dividing the key space equally for each thread that will cycle through the respective range attempting each key. De facto the implementation works, as it succeeds in finding the key for any combination #bitsToHack/#threads but returns strange timing results.
//Structs for threads and respective data
pthread_t threads[num_of_threads];
struct bf_data td [num_of_threads];
int rc;
//Space division
uintmax_t index = pow (BASE_NUMBER, num_bits_to_hack);
uintmax_t step = index/num_of_threads;
if(sem_init(&s, 1, 0)!=0){
printf("Error during semaphore initialization\n");
return -1;
}
for(int i = 0; i < num_of_threads; i++){
//Structure initialization
td[i].ciphertext = ciphertext;
td[i].hacked_key = hacked_key;
td[i].iv_aes = iv_aes;
td[i].key = key_aes;
td[i].num_bits_to_hack = num_bits_to_hack;
td[i].plaintext = plaintext;
td[i].starting_point = step*i;
td[i].step = step;
td[i].num_of_threads = num_of_threads;
if(DEBUG)
printf("Starting point for thread %d is: %lu, using step: %lu\n", i , td[i].starting_point, td[i].step);
rc = pthread_create(&threads[i], NULL, decryption_brute_force, (void*)&td[i]);
if (rc){
cout << "Error:unable to create thread," << rc << endl;
exit(-1);
}
}
sem_wait(&s);
for(int i = 0; i < num_of_threads; i++){
pthread_join(threads[i], NULL);
}
For the decryption_brute_force function (The body of each thread):
void* decryption_brute_force(void* data){
** Copy data on local thread memory
** Build the key to begin the search from starting point
** for each key from starting_point to starting_point + step
** Try decryption
** if obtained plaintext corresponds to the expected one
** Print results, wake up main thread and terminate
** else
** increment the key and continue
}
To conclude the project we intended to conduct a study of the optimal number of threads expecting an increase in performance as the number of threads increased up to a threshold, after which the system would no longer benefit from the increase in threads assigned to it.
At the end of the analysis (a simulation lasting about 9 hours), the results obtained were as follows in figure.
Click here to see the plot.
We cannot understand why 8 threads performs better than 16. Could it be due to the CPU architecture? Could it be able to schedule 32 and 8 threads better than 16?
From comments, I think it could be the linear-search pattern in each thread yields to different results for different number of threads. Because when you double the threads, the actual linear point to find in a thread may shift to a further point. But once you double again, it can not go much further due to too many threads. Because you said you are using only same encrypted data always. Did you try different inputs?
this variable is integer (so it may not be exact distribution)
^
8 threads & step=7 (56 work total)
index-16 (0-based)
v
01234567 89abcdef 01234567 89abcdef
| | |. | ...
500 seconds as its the first loop iteration
16 threads & step=3 (56 work total)
index-16 again, but at second-iteration now
v
012 345 678 9ab cde f01 234 567 8
| | | | | | . | | | ...
1000 seconds as it finds only after second iteration in the thread
Another example with 2 threads and 3 threads:
x to found at 51-th element of 100-element-work:
2 threads
| |x(1st iteration) |
3 threads
| |........x | |
5x slower than 2 threads
Say you have a cuda kernel that you want to run 2048 times, so you define your kernel like this:
__global__ void run2048Times(){ }
Then you call it from your main code:
run2048Times<<<2,1024>>>();
All seems well so far. However now say for debugging purposes when you're calling the kernel millions of times, you want to verify that your actually calling the Kernel that many times.
What I did was pass a pointer to the kernel and ++'d the pointer every time the kernel ran.
__global__ void run2048Times(int *kernelCount){
kernelCount[0]++; // Add to the pointer
}
However when I copied that pointer back to the main function I get "2".
At first it baffeled me, then after 5 minutes of coffee and pacing back and forth I realized this probably makes sense because the cuda kernel is running 1024 instances of itself at the same time, which means that the kernels overwrite the "kernelCount[0]" instead of truly adding to it.
So instead I decided to do this:
__global__ void run2048Times(int *kernelCount){
// Get the id of the kernel
int id = blockIdx.x * blockDim.x + threadIdx.x;
// If the id is bigger than the pointer overwrite it
if(id > kernelCount[0]){
kernelCount[0] = id;
}
}
Genius!! This was guaranteed to work I thought. Until I ran it and got all sorts of numbers between 0 and 2000.
Which tells me that the problem mentioned above still happens here.
Is there any way to do this, even if it involves forcing the kernels to pause and wait for each other to run?
Assuming this is a simplified example, and you are not in fact trying to do profiling as others have already suggested, but want to use this in a more complex scenario, you can achieve the result you want with atomicAdd, which will ensure that the increment operation is executed as a single atomic operation:
__global__ void run2048Times(int *kernelCount){
atomicAdd(kernelCount, 1); // Add to the pointer
}
Why your solutions didn't work:
The problem with your first solution is that it gets compiled into the following PTX code (see here for description of PTX instructions):
ld.global.u32 %r1, [%rd2];
add.s32 %r2, %r1, 1;
st.global.u32 [%rd2], %r2;
You can verify this by calling nvcc with the --ptx option to only generate the intermediate representation.
What can happen here is the following timeline, assuming you launch 2 threads (Note: this is a simplified example and not exactly how GPUs work, but it is enough to illustrate the problem):
thread 0 reads 0 from kernelCount
thread 1 reads 0 from kernelCount
thread 0 increases it's local copy by 1
thread 0 stores 1 back to kernelCount
thread 1 increases it's local copy by 1
thread 1 stores 1 back to kernelCount
and you end up with 1 even though 2 threads were launched.
Your second solution is wrong even if the threads are launched sequentially because thread indexes are 0-based. So I'll assume you wanted to do this:
__global__ void run2048Times(int *kernelCount){
// Get the id of the kernel
int id = blockIdx.x * blockDim.x + threadIdx.x;
// If the id is bigger than the pointer overwrite it
if(id + 1 > kernelCount[0]){
kernelCount[0] = id + 1;
}
}
This will compile into:
ld.global.u32 %r5, [%rd1];
setp.lt.s32 %p1, %r1, %r5;
#%p1 bra BB0_2;
add.s32 %r6, %r1, 1;
st.global.u32 [%rd1], %r6;
BB0_2:
ret;
What can happen here is the following timeline:
thread 0 reads 0 from kernelCount
thread 1 reads 0 from kernelCount
thread 1 compares 0 to 1 + 1 and stores 2 into kernelCount
thread 0 compares 0 to 0 + 1 and stores 1 into kernelCount
You end up having the wrong result of 1.
I suggest you pick up a good parallel programming / CUDA book if you want to better understand problems with synchronization and non-atomic operations.
EDIT:
For completeness, the version using atomicAdd compiles into:
atom.global.add.u32 %r1, [%rd2], 1;
It seems like the only point of that counter is to do profiling (i.e. analyse how the code runs) rather than to actually count something (i.e. no functional benefit to the program).
There are profiling tools available designed for this task. For example, nvprof gives the number of calls, as well as some time metrics for each kernel in your codebase.
I have a simple C++ code that runs a default sin function across a vector of values.
static void BM_sin() {
int data_size = 100000000;
double lower_bound = 0;
double upper_bound = 1;
random_device device;
mt19937 engine(device());
uniform_real_distribution<double> distribution(lower_bound, upper_bound);
auto generator = bind(distribution, engine);
vector<double> data(data_size);
generate(begin(data), end(data), generator);
#pragma omp parallel for
for(int i = 0; i < data_size; ++i) {
data[i] = sin(data[i]);
}
cout << accumulate(data.begin(), data.end(), 0) << endl;
}
I get same time when I run this function with export OMP_NUM_THREADS set to 1 and 8 having 8 cores. Also commenting line #pragma omp parallel for out does not help. So I wonder why sinus applied to a vector from all threads is as fast as applied from one thread?
(I compile with -Ofast -fopenmp on gcc-4.8)
Simple answer is simple:
Not all things scale well. I don't know fast_sin, but it's possible it's mainly memory-bandwidth limited. In that case, you'll win nothing by distributing the workload across cores.
Also, I doubt your measuring methods. If your generator is the mt19337, it's a lot more complex than your sine, so parallelizing your sine doesn't do much, because most of the time is spent generating random numbers.
You are measuring something wrongly. The generator loop is slow, but not that slow that it completely overshadows the sine loop. Here are the results of measuring the execution speed of several code parts on two different Intel architectures:
Code part | WM (x64) | WM (x86) | SB (x64) | SB (x86)
-----------------------+----------+----------+----------+----------
generate() | 1,45 s | 2,44 s | 1,28 s | 2,18 s
sine loop (serial) | 2,17 s | 2,88 s | 1,80 s | 2,91 s
sine loop (6 threads) | 0,37 s | 0,51 s | 0,31 s | 0,52 s
accumulate() | 0,31 s | 0,70 s | 0,33 s | 0,67 s
-----------------------+----------+----------+----------+----------
speed-up: overall | 1,85x | 1,65x | 1,78x | 1,71x
speed-up: sine loop | 5,86x | 5,65x | 5,81x | 5,60x
speed-up: Amdahl | 2,23x | 1,92x | 2,12x | 2,02x
In the above table, WM stands for Intel X5675, a Westmere CPU, while SB stands for Intel E5-2650, a Sandy Bridge CPU. x64 stands for 64-bit mode and x86 - for 32-bit mode. GCC 4.8.5 was used with -Ofast -fopenmp -mtune=native (-m32 for 32-bit mode). Both systems are running CentOS 7.2. The execution times are only approximate, as I haven't done proper timing by taking the average of multiple executions. Timing was done using the portable omp_get_wtime() timer routine.
As you can see, the overall speed-up with 6 threads ranges from 1,65x to 1,85x, while the speed-up for the sine loop alone ranges from 5,60x to 5,86x. Both the generator loop and the accumulator loop are performed in serial, which caps the parallel speed-up (see Amdahl's law).
Two things to note here. First one, the generator loop could be a tad faster if the memory for the vector is pre-faulted. It basically means sweeping over the vector and touching every memory page that backs it. Running the generator loop twice and only timing the second invocation will also do the trick. On my systems that brings no noticeable advantage (the savings are on the same order as the measurement error), most likely since CentOS's kernel has transparent huge pages turned on by default.
The second thing is the last parameter to accumulate() is an integer 0, therefore the algorithm is forced to perform an integer conversion every time, which slows it down considerably and gives the wrong result at the end (0). accumulate(data.begin(), data.end(), 0.0) executes ten times faster and also produces the correct result.
I have a periodic cartesian grid of MPI proc's, for 4 proc's the layout looks like this
__________________________
| | | |
| 0 | 1 | 2 |
|_______|________|_______ |
| | | |
| 3 | 4 | 5 |
|_______|________|________|
where the numbers are the ranks of the proc's in the communicator.
During the calculation all proc'have to send a number to its left neighbour, and this number should be summed up with the one that the left neighbour already has :
int a[2];
a[0] = calculateSomething1();
a[1] = calculateSomething2();
int tempA;
MPI_Request recv_request, send_request;
//Revceive from the right neighbour
MPI_Irecv(&tempA, 1, MPI_INT, myMPI.getRightNeigh(), 0, Cart_comm, &recv_request);
//Send to the left neighbour
MPI_Isend(&a[0], 1, MPI_INT, myMPI.getLeftNeigh(), 0, Cart_comm, &send_request);
MPI_Status status;
MPI_Wait(&recv_request, &status);
MPI_Wait(&send_request, &status);
//now I have to do something like this
a[1] += tempA;
I am wondering if there is a sort of "local" reduction operation only for a pair sender-receiver or the only solution is to have "local" communicators and use collective operations there?
You can use MPI_Sendrecv in this case. It is basically made for this case.
I don't think you would get any benefit from using collectives.
BTW: Your code is not probably correct. You are sending from a local stack variable &a[0]. You must complete the communication send_request before exiting scope and reusing a's memory. This is done by some form of MPI_Wait(all) or a successful MPI_Test.
I have a C++ program which basically performs some matrix calculations. For these I use LAPACK/BLAS and usually link to the MKL or ACML depending on the platform. A lot of these matrix calculations operate on different independent matrices and hence I use std::thread's to let these operations run in parallel. However, I noticed that I have no speed-up when using more threads. I traced the problem down to the daxpy Blas routine. It seems that if two threads are using this routine in parallel each thread takes twice the time, even though the two threads operate on different arrays.
The next thing I tried was writing a new simple method to perform vector additions to replace the daxpy routine. With one thread this new method is as fast as the BLAS routine, but, when compiling with gcc, it suffers from the same problems as the BLAS routine: doubling the number of threads running parallel also doubles the amount of time each threads needs, so no speed-up is gained. However, using the Intel C++ Compiler this problems vanishes: with increasing number of threads the time a single thread needs is constant.
However, I need to compile as well on systems where no Intel compiler is available. So my questions are: why is there no speed-up with the gcc and is there any possibility of improving the gcc performance?
I wrote a small program to demonstrate the effect:
// $(CC) -std=c++11 -O2 threadmatrixsum.cpp -o threadmatrixsum -pthread
#include <iostream>
#include <thread>
#include <vector>
#include "boost/date_time/posix_time/posix_time.hpp"
#include "boost/timer.hpp"
void simplesum(double* a, double* b, std::size_t dim);
int main() {
for (std::size_t num_threads {1}; num_threads <= 4; num_threads++) {
const std::size_t N { 936 };
std::vector <std::size_t> times(num_threads, 0);
auto threadfunction = [&](std::size_t tid)
{
const std::size_t dim { N * N };
double* pA = new double[dim];
double* pB = new double[dim];
for (std::size_t i {0}; i < N; ++i){
pA[i] = i;
pB[i] = 2*i;
}
boost::posix_time::ptime now1 =
boost::posix_time::microsec_clock::universal_time();
for (std::size_t n{0}; n < 1000; ++n){
simplesum(pA, pB, dim);
}
boost::posix_time::ptime now2 =
boost::posix_time::microsec_clock::universal_time();
boost::posix_time::time_duration dur = now2 - now1;
times[tid] += dur.total_milliseconds();
delete[] pA;
delete[] pB;
};
std::vector <std::thread> mythreads;
// start threads
for (std::size_t n {0} ; n < num_threads; ++n)
{
mythreads.emplace_back(threadfunction, n);
}
// wait for threads to finish
for (std::size_t n {0} ; n < num_threads; ++n)
{
mythreads[n].join();
std::cout << " Thread " << n+1 << " of " << num_threads
<< " took " << times[n]<< "msec" << std::endl;
}
}
}
void simplesum(double* a, double* b, std::size_t dim){
for(std::size_t i{0}; i < dim; ++i)
{*(++a) += *(++b);}
}
The outout with gcc:
Thread 1 of 1 took 532msec
Thread 1 of 2 took 1104msec
Thread 2 of 2 took 1103msec
Thread 1 of 3 took 1680msec
Thread 2 of 3 took 1821msec
Thread 3 of 3 took 1808msec
Thread 1 of 4 took 2542msec
Thread 2 of 4 took 2536msec
Thread 3 of 4 took 2509msec
Thread 4 of 4 took 2515msec
The outout with icc:
Thread 1 of 1 took 663msec
Thread 1 of 2 took 674msec
Thread 2 of 2 took 674msec
Thread 1 of 3 took 681msec
Thread 2 of 3 took 681msec
Thread 3 of 3 took 681msec
Thread 1 of 4 took 688msec
Thread 2 of 4 took 689msec
Thread 3 of 4 took 687msec
Thread 4 of 4 took 688msec
So, with the icc the time needed for one thread perform the computations is constant (as I would have expected; my CPU has 4 physical cores) and with the gcc the time for one thread increases. Replacing the simplesum routine by BLAS::daxpy yields the same results for icc and gcc (no surprise, as most time is spent in the library), which are almost the same as the above stated gcc results.
The answer is fairly simple: Your threads are fighting for memory bandwidth!
Consider that you perform one floating point addition per 2 stores (one initialization, one after the addition) and 2 reads (in the addition). Most modern systems providing multiple cpus actually have to share the memory controller among several cores.
The following was run on a system with 2 physical CPU sockets and 12 cores (24 with HT). Your original code exhibits exactly your problem:
Thread 1 of 1 took 657msec
Thread 1 of 2 took 1447msec
Thread 2 of 2 took 1463msec
[...]
Thread 1 of 8 took 5516msec
Thread 2 of 8 took 5587msec
Thread 3 of 8 took 5205msec
Thread 4 of 8 took 5311msec
Thread 5 of 8 took 2731msec
Thread 6 of 8 took 5545msec
Thread 7 of 8 took 5551msec
Thread 8 of 8 took 4903msec
However, by simply increasing the arithmetic density, we can see a significant increase in scalability. To demonstrate, I changed your addition routine to also perform an exponentiation: *(++a) += std::exp(*(++b));. The result shows almost perfect scaling:
Thread 1 of 1 took 7671msec
Thread 1 of 2 took 7759msec
Thread 2 of 2 took 7759msec
[...]
Thread 1 of 8 took 9997msec
Thread 2 of 8 took 8135msec
Thread 3 of 8 took 10625msec
Thread 4 of 8 took 8169msec
Thread 5 of 8 took 10054msec
Thread 6 of 8 took 8242msec
Thread 7 of 8 took 9876msec
Thread 8 of 8 took 8819msec
But what about ICC?
First, ICC inlines simplesum. Proving that inlining happens is simple: Using icc, I have disable multi-file interprocedural optimization and moved simplesum into its own translation unit. The difference is astonishing. The performance went from
Thread 1 of 1 took 687msec
Thread 1 of 2 took 688msec
Thread 2 of 2 took 689msec
[...]
Thread 1 of 8 took 690msec
Thread 2 of 8 took 697msec
Thread 3 of 8 took 700msec
Thread 4 of 8 took 874msec
Thread 5 of 8 took 878msec
Thread 6 of 8 took 874msec
Thread 7 of 8 took 742msec
Thread 8 of 8 took 868msec
To
Thread 1 of 1 took 1278msec
Thread 1 of 2 took 2457msec
Thread 2 of 2 took 2445msec
[...]
Thread 1 of 8 took 8868msec
Thread 2 of 8 took 8434msec
Thread 3 of 8 took 7964msec
Thread 4 of 8 took 7951msec
Thread 5 of 8 took 8872msec
Thread 6 of 8 took 8286msec
Thread 7 of 8 took 5714msec
Thread 8 of 8 took 8241msec
This already explains why the library performs badly: ICC cannot inline it and therefore no matter what else causes ICC to perform better than g++, it will not happen.
It also gives a hint as to what ICC might be doing right here... What if instead of executing simplesum 1000 times, it interchanges the loops so that it
Loads two doubles
Adds them 1000 times (or even performs a = 1000 * b)
Stores two doubles
This would increase arithmetic density without adding any exponentials to the function... How to prove this? Well, to begin let us simply implement this optimization and see what happens! To analyse, we will look at the g++ performance. Recall our benchmark results:
Thread 1 of 1 took 640msec
Thread 1 of 2 took 1308msec
Thread 2 of 2 took 1304msec
[...]
Thread 1 of 8 took 5294msec
Thread 2 of 8 took 5370msec
Thread 3 of 8 took 5451msec
Thread 4 of 8 took 5527msec
Thread 5 of 8 took 5174msec
Thread 6 of 8 took 5464msec
Thread 7 of 8 took 4640msec
Thread 8 of 8 took 4055msec
And now, let us exchange
for (std::size_t n{0}; n < 1000; ++n){
simplesum(pA, pB, dim);
}
with the version in which the inner loop was made the outer loop:
double* a = pA; double* b = pB;
for(std::size_t i{0}; i < dim; ++i, ++a, ++b)
{
double x = *a, y = *b;
for (std::size_t n{0}; n < 1000; ++n)
{
x += y;
}
*a = x;
}
The results show that we are on the right track:
Thread 1 of 1 took 693msec
Thread 1 of 2 took 703msec
Thread 2 of 2 took 700msec
[...]
Thread 1 of 8 took 920msec
Thread 2 of 8 took 804msec
Thread 3 of 8 took 750msec
Thread 4 of 8 took 943msec
Thread 5 of 8 took 909msec
Thread 6 of 8 took 744msec
Thread 7 of 8 took 759msec
Thread 8 of 8 took 904msec
This proves that the loop interchange optimization is indeed the main source of the excellent performance ICC exhibits here.
Note that none of the tested compilers (MSVC, ICC, g++ and clang) will replace the loop with a multiplication, which improves performance by 200x in the single threaded and 15x in the 8-threaded cases. This is due to the fact that the numerical instability of the repeated additions may cause wildly differing results when replaced with a single multiplication. When testing with integer data types instead of floating point data types, this optimization happens.
How can we force g++ to perform this optimization?
Interestingly enough, the true killer for g++ is not an inability to perform loop interchange. When called with -floop-interchange, g++ can perform this optimization as well. But only when the odds are significantly stacked into its favor.
Instead of std::size_t all bounds were expressed as ints. Not long, not unsigned int, but int. I still find it hard to believe, but it seems this is a hard requirement.
Instead of incrementing pointers, index them: a[i] += b[i];
G++ needs to be told -floop-interchange. A simple -O3 is not enough.
When all three criteria are met, the g++ performance is similar to what ICC delivers:
Thread 1 of 1 took 714msec
Thread 1 of 2 took 724msec
Thread 2 of 2 took 721msec
[...]
Thread 1 of 8 took 782msec
Thread 2 of 8 took 1221msec
Thread 3 of 8 took 1225msec
Thread 4 of 8 took 781msec
Thread 5 of 8 took 788msec
Thread 6 of 8 took 1262msec
Thread 7 of 8 took 1226msec
Thread 8 of 8 took 820msec
Note: The version of g++ used in this experiment is 4.9.0 on a x64 Arch linux.
Ok, I came to the conclusion that the main problem is that the processor acts on different parts of the memory in parallel and hence I assume that one has to deal with lots of cache misses which slows the process further down. Putting the actual sum function in a critical section
summutex.lock();
simplesum(pA, pB, dim);
summutex.unlock();
solves the problem of the cache missses, but of course does not yield optimal speed-up. Anyway, because now the other threads are blocked the simplesum method might as well use all available threads for the sum
void simplesum(double* a, double* b, std::size_t dim, std::size_t numberofthreads){
omp_set_num_threads(numberofthreads);
#pragma omp parallel
{
#pragma omp for
for(std::size_t i = 0; i < dim; ++i)
{
a[i]+=b[i];
}
}
}
In this case all the threads work on the same chunk on memory: it should be in the processor cache and if the processor needs to load some other parts of the memory into its cache the other threads benefit from this all well (depending whether this is L1 or L2 cache, but I reckon the details do not really matter for the sake of this discussion).
I don't claim that this solution is perfect or anywhere near optimal, but it seems to work much better than the original code. And it does not rely on some loop switching tricks which I cannot do in my actual code.