I am writing some code which is computationally expensive, but highly parallelisable. Once parallelised, I intend to run it on a HPC, however to keep the runtime down to within a week, the problem needs to scale well, with the number of processors.
Below is a simple and ludicrous example of what I am attempting to achieve, which is concise enough to compile and demonstrate my problem;
#include <iostream>
#include <ctime>
#include "mpi.h"
using namespace std;
double int_theta(double E){
double result = 0;
for (int k = 0; k < 20000; k++)
result += E*k;
return result;
}
int main()
{
int n = 3500000;
int counter = 0;
time_t timer;
int start_time = time(&timer);
int myid, numprocs;
int k;
double integrate, result;
double end = 0.5;
double start = -2.;
double E;
double factor = (end - start)/(n*1.);
integrate = 0;
MPI_Init(NULL,NULL);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &myid);
for (k = myid; k<n+1; k+=numprocs){
E = start + k*(end-start)/n;
if (( k == 0 ) || (k == n))
integrate += 0.5*factor*int_theta(E);
else
integrate += factor*int_theta(E);
counter++;
}
cout<<"process "<<myid<<" took "<<time(&timer)-start_time<<"s"<<endl;
cout<<"process "<<myid<<" performed "<<counter<<" computations"<<endl;
MPI_Reduce(&integrate, &result, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
if (myid == 0)
cout<<result<<endl;
MPI_Finalize();
return 0;
}
I have compiled the problem on my quadcore laptop with
mpiicc test.cpp -std=c++14 -O3 -DMKL_LP64 -lmkl_intel_lp64 - lmkl_sequential -lmkl_core -lpthread -lm -ldl
and I get the following output;
$ mpirun -np 4 ./a.out
process 3 took 14s
process 3 performed 875000 computations
process 1 took 15s
process 1 performed 875000 computations
process 2 took 16s
process 2 performed 875000 computations
process 0 took 16s
process 0 performed 875001 computations
-3.74981e+08
$ mpirun -np 3 ./a.out
process 2 took 11s
process 2 performed 1166667 computations
process 1 took 20s
process 1 performed 1166667 computations
process 0 took 20s
process 0 performed 1166667 computations
-3.74981e+08
$ mpirun -np 2 ./a.out
process 0 took 16s
process 0 performed 1750001 computations
process 1 took 16s
process 1 performed 1750000 computations
-3.74981e+08
To me it appears that there must be a barrier somewhere that I am not aware of. I get better performance with 2 processors over 3. Please can somebody offer any advice? Thanks
If I read the output of lscpu you gave correctly (e.g. with the help of https://unix.stackexchange.com/a/218081), you are having 4 logical CPUs, but only 2 hardware cores (1 socket x 2 cores per socket).
Using cat /proc/cpuinfo you can finde the make and model for the CPU to maybe find out more.
The four logical CPUs might result from hyperthreading, which means that some hardware resources (e.g. the FPU unit, but I am not an expert on this) are shared between two cores. Thus, I would not expect any good parallel scaling beyond two processes.
For scalability tests, you should try to get your hands on a machine with maybe 6 or more hardware cores do get a better estimate.
From looking at your code, I would expect perfect scalability to any number of cores - At least as long as you do not include the time needed for process startup and the final MPI_Reduce. These will for sure become slower with more processes involved.
Related
i am trying to measure the execution time.
i'm on windows 10 and use gcc compiler.
start_t = chrono::system_clock::now();
tree->insert();
end_t = chrono::system_clock::now();
rslt_period = chrono::duration_cast<chrono::nanoseconds>(end_t - start_t);
this is my code to measure time about bp_w->insert()
the function insert work internally like follow (just pseudo code)
insert(){
_load_node(node);
// do something //
_save_node(node, addr);
}
_save_node(n){
ofstream file(name);
file.write(n);
file.close();
}
_load_node(n, addr){
ifstream file(name);
file.read_from(n, addr);
file.close();
}
the actual results is,
read is number of _load_node executions.
write is number of _save_node executions.
time is nano secs.
read write time
1 1 1000000
1 1 0
2 1 0
1 1 0
1 1 0
1 1 0
2 1 0
1 1 1004000
1 1 1005000
1 1 0
1 1 0
1 1 15621000
i don't have any idea why this result come and want to know.
What you are trying to measure is ill-defined.
"How long did this code take to run" can seem simple. In practice, though, do you mean "how many CPU cycles my code took" ? Or how many cycles between my program and the other running programs ? Do you account for the time to load/unload it on the CPU ? Do you account for the CPU being throttled down when on battery ? Do you want to account for the time to access the main clock located on the motherboard (in terms of computation that is extremely far).
So, in practice timing will be affected by a lot of factors and the simple fact of measuring it will slow everything down. Don't expect nanosecond accuracy. Micros, maybe. Millis, certainly.
So, that leaves you in a position where any measurement will fluctuate a lot. The sane way is to average it out over multiple measurement. Or, even better, do the same operation (on different data) a thousand (million?) times and divide the results by a thousand.
Then, you'll get significant improvement on accuracy.
In code:
start_t = chrono::system_clock::now();
for(int i = 0; i < 1000000; i++)
tree->insert();
end_t = chrono::system_clock::now();
You are using the wrong clock. system_clock is not useful for timing intervals due to low resolution and its non-monotonic nature.
Use steady_clock instead. it is guaranteed to be monotonic and have a low enough resolution to be useful.
I was reading Chapter 8 of the "Modern C++ Programming Cookbook, 2nd edition" on concurrency and stumbled upon something that puzzles me.
The author implements different versions of parallel map and reduce functions using std::thread and std::async. The implementations are really close; for example, the heart of the parallel_map functions are
// parallel_map using std::async
...
tasks.emplace_back(std::async(
std::launch::async,
[=, &f] {std::transform(begin, last, begin, std::forward<F>(f)); }));
...
// parallel_map using std::thread
...
threads.emplace_back([=, &f] {std::transform(begin, last, begin, std::forward<F>(f)); });
...
The complete code can be found here for std::thread and there for std::async.
What puzzles me is that the computation times reported in the book give a significant and consistent advantage to the std::async implementation. Moreover, the author acknowledge this fact as being obvious, without providing any hint of justification:
If we compare this [result with async] with the results from the parallel version using threads, we will find that these are faster execution times and that the speedup is significant, especially for the fold function.
I ran the code above on my computer, and even though the differences are not as compelling as in the book, I find that the std::async implementation is indeed faster than the std::thread one. (The author also later brings in standard implementations of these algorithms, which are even faster). On my computer, the code runs with four threads, which corresponds to the number of physical cores of my CPU.
Maybe I missed something, but why is it obvious that std::async should run faster than std::thread on this example? My intuition was that std::async being a higher-level implementation of threads, it should take at least the same amount of time, if not more, than threads -- obviously I was wrong. Are those findings consistent, as suggested by the book, and what is the explanation?
My original interpretation was incorrect. Refer to #OznOg's answer below.
Modified Answer:
I created a simple benchmark that uses std::async and std::thread to do some tiny tasks:
#include <thread>
#include <chrono>
#include <vector>
#include <future>
#include <iostream>
__thread volatile int you_shall_not_optimize_this;
void work() {
// This is the simplest way I can think of to prevent the compiler and
// operating system from doing naughty things
you_shall_not_optimize_this = 42;
}
[[gnu::noinline]]
std::chrono::nanoseconds benchmark_threads(size_t count) {
std::vector<std::optional<std::thread>> threads;
threads.resize(count);
auto before = std::chrono::high_resolution_clock::now();
for (size_t i = 0; i < count; ++i)
threads[i] = std::thread { work };
for (size_t i = 0; i < count; ++i)
threads[i]->join();
threads.clear();
auto after = std::chrono::high_resolution_clock::now();
return after - before;
}
[[gnu::noinline]]
std::chrono::nanoseconds benchmark_async(size_t count, std::launch policy) {
std::vector<std::optional<std::future<void>>> results;
results.resize(count);
auto before = std::chrono::high_resolution_clock::now();
for (size_t i = 0; i < count; ++i)
results[i] = std::async(policy, work);
for (size_t i = 0; i < count; ++i)
results[i]->wait();
results.clear();
auto after = std::chrono::high_resolution_clock::now();
return after - before;
}
std::ostream& operator<<(std::ostream& stream, std::launch value)
{
if (value == std::launch::async)
return stream << "std::launch::async";
else if (value == std::launch::deferred)
return stream << "std::launch::deferred";
else
return stream << "std::launch::unknown";
}
// #define CONFIG_THREADS true
// #define CONFIG_ITERATIONS 10000
// #define CONFIG_POLICY std::launch::async
int main() {
std::cout << "Running benchmark:\n"
<< " threads? " << std::boolalpha << CONFIG_THREADS << '\n'
<< " iterations " << CONFIG_ITERATIONS << '\n'
<< " async policy " << CONFIG_POLICY << std::endl;
std::chrono::nanoseconds duration;
if (CONFIG_THREADS) {
duration = benchmark_threads(CONFIG_ITERATIONS);
} else {
duration = benchmark_async(CONFIG_ITERATIONS, CONFIG_POLICY);
}
std::cout << "Completed in " << duration.count() << "ns (" << std::chrono::duration_cast<std::chrono::milliseconds>(duration).count() << "ms)\n";
}
I've run the benchmark as follows:
$ g++ -Wall -Wextra -std=c++20 -pthread -O3 -DCONFIG_THREADS=false -DCONFIG_ITERATIONS=10000 -DCONFIG_POLICY=std::launch::deferred main.cpp -o main && ./main
Running benchmark:
threads? false
iterations 10000
async policy std::launch::deferred
Completed in 4783327ns (4ms)
$ g++ -Wall -Wextra -std=c++20 -pthread -O3 -DCONFIG_THREADS=false -DCONFIG_ITERATIONS=10000 -DCONFIG_POLICY=std::launch::async main.cpp -o main && ./main
Running benchmark:
threads? false
iterations 10000
async policy std::launch::async
Completed in 301756775ns (301ms)
$ g++ -Wall -Wextra -std=c++20 -pthread -O3 -DCONFIG_THREADS=true -DCONFIG_ITERATIONS=10000 -DCONFIG_POLICY=std::launch::deferred main.cpp -o main && ./main
Running benchmark:
threads? true
iterations 10000
async policy std::launch::deferred
Completed in 291284997ns (291ms)
$ g++ -Wall -Wextra -std=c++20 -pthread -O3 -DCONFIG_THREADS=true -DCONFIG_ITERATIONS=10000 -DCONFIG_POLICY=std::launch::async main.cpp -o main && ./main
Running benchmark:
threads? true
iterations 10000
async policy std::launch::async
Completed in 293539858ns (293ms)
I re-ran all the benchmarks with strace attached and accumulated the system calls made:
# std::async with std::launch::async
1 access
2 arch_prctl
36 brk
10000 clone
6 close
1 execve
1 exit_group
10002 futex
10028 mmap
10009 mprotect
9998 munmap
7 newfstatat
6 openat
7 pread64
1 prlimit64
5 read
2 rt_sigaction
20001 rt_sigprocmask
1 set_robust_list
1 set_tid_address
5 write
# std::async with std::launch::deferred
1 access
2 arch_prctl
11 brk
6 close
1 execve
1 exit_group
10002 futex
28 mmap
9 mprotect
2 munmap
7 newfstatat
6 openat
7 pread64
1 prlimit64
5 read
2 rt_sigaction
1 rt_sigprocmask
1 set_robust_list
1 set_tid_address
5 write
# std::thread with std::launch::async
1 access
2 arch_prctl
27 brk
10000 clone
6 close
1 execve
1 exit_group
2 futex
10028 mmap
10009 mprotect
9998 munmap
7 newfstatat
6 openat
7 pread64
1 prlimit64
5 read
2 rt_sigaction
20001 rt_sigprocmask
1 set_robust_list
1 set_tid_address
5 write
# std::thread with std::launch::deferred
1 access
2 arch_prctl
27 brk
10000 clone
6 close
1 execve
1 exit_group
2 futex
10028 mmap
10009 mprotect
9998 munmap
7 newfstatat
6 openat
7 pread64
1 prlimit64
5 read
2 rt_sigaction
20001 rt_sigprocmask
1 set_robust_list
1 set_tid_address
5 write
We observe that std::async is significantly faster with std::launch::deferred but that everything else doesn't seem to matter as much.
My conclusions are:
The current libstdc++ implementation does not take advantage of the fact that std::async doesn't need a new thread for each task.
The current libstdc++ implementation does some sort of locking in std::async that std::thread doesn't do.
std::async with std::launch::deferred saves setup and destroy costs and is much faster for this case.
My machine is configured as follows:
$ uname -a
Linux linux-2 5.12.1-arch1-1 #1 SMP PREEMPT Sun, 02 May 2021 12:43:58 +0000 x86_64 GNU/Linux
$ g++ --version
g++ (GCC) 10.2.0
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ lscpu # truncated
Architecture: x86_64
Byte Order: Little Endian
CPU(s): 8
Model name: Intel(R) Core(TM) i7-4770K CPU # 3.50GHz
Original Answer:
std::thread is a wrapper for thread objects which are provided by the operating system, they are extremely expensive to create and destroy.
std::async is similar, but there isn't a 1-to-1 mapping between tasks and operating system threads. This could be implemented with thread pools, where threads are reused for multiple tasks.
So std::async is better if you have many small tasks, and std::thread is better if you have a few tasks that are running for long periods of time.
Also if you have things that truly need to happen in parallel, then std::async might not fit very well. (std::thread also can't make such guarantees, but that's the closest you can get.)
Maybe to clarify, in your case std::async saves the overhead from creating and destroying threads.
(Depending on the operating system, you could also lose performance simply by having a lot of threads running. An operating system might have a scheduling strategy where it tries to guarantee that every thread gets executed every so often, thus the scheduler could decide go give the individual threads smaller slices of processing time, thus creating more overhead for switching between threads.)
Looks like something is not happening like expected. I compiled the whole thing on my fedora and the first results where surprising. I commented out all tests to keep only the 2 ones comparing threads and async. The output looks like confirming the behaviour:
Thead version result
size s map p map s fold p fold
10000 642 628 751 770
100000 6533 3444 7985 3338
500000 14885 5760 13854 6304
1000000 23428 11398 27795 12129
2000000 47136 22468 55518 24154
5000000 118690 55752 139489 60142
10000000 236496 112467 277413 121002
25000000 589277 276750 694742 297832
500000001.17839e+06 5553181.39065e+06 594102
Async version:
size s map p1 map p2 map s fold p1 fold p2 fold
10000 248 232 231 273 282 273
100000 2323 1562 2827 2757 1536 1766
500000 12312 5615 12044 14014 6272 7431
1000000 23585 11701 24060 27851 12376 14109
2000000 47147 22796 48035 55433 25565 30095
5000000 118465 59980 119698 140775 62960 68382
10000000 241727 110883 239554 277958 121205 136041
Looks like the async is actually 2 times faster (for small vallues) than the threaded one.
Then I used strace to count the number of clone system call done (number of thread created):
64 clone with threads
92 clone with async
So looks like the explaination on the time spend to create thread is somehow in contradiction as he async version actually creates as many threads than the thread based one (the difference comes from the fact that there are two version of fold in the async code).
Then I tried to swap both test order of execution (put async before thread), and here are the results:
size s map p1 map p2 map s fold p1 fold p2 fold
10000 653 694 624 718 748 718
100000 6731 3931 2978 8533 3116 1724
500000 12406 5839 14589 13895 8427 7072
1000000 23813 11578 24099 27853 13091 14108
2000000 47357 22402 48197 55469 24572 33543
5000000 117923 55869 120303 139061 61801 68281
10000000 234861 111055 239124 277153 121270 136953
size s map p map s fold p fold
10000 232 232 273 328
100000 6424 3271 8297 4487
500000 21329 5547 13913 6263
1000000 23654 11419 27827 12083
2000000 47230 22763 55653 24135
5000000 117448 56785 139286 61679
10000000 235394 111021 278177 119805
25000000 589329 279637 696392 301485
500000001.1824e+06 5564431.38722e+06 606279
So now the "thread" version is 2 time faster than async for small values.
Looking at clone calls di not show any differences:
92 clone
64 clone
I didn't have much time to go further, but at least on linux, we can consider that there are no differences between the two version (the async could even be seen as less efficient as it requires more threads).
We can see that it has nothing to do with the async/thread problem.
Moreover, if we look at values that realy need computation time, the difference of time is really small and not relevent: 55752us vs 56785us for 5'000'000, and keeps matching for bigger values.
This looks like usual problem with microbenches, we somehow measure latencies of the system more than the computation time itself.
Note: the figures shown are without optimization (original code) ; adding -O3 obviously speeds up the computations, but results show the same: there is not real difference in computation time on big values.
Say you have a cuda kernel that you want to run 2048 times, so you define your kernel like this:
__global__ void run2048Times(){ }
Then you call it from your main code:
run2048Times<<<2,1024>>>();
All seems well so far. However now say for debugging purposes when you're calling the kernel millions of times, you want to verify that your actually calling the Kernel that many times.
What I did was pass a pointer to the kernel and ++'d the pointer every time the kernel ran.
__global__ void run2048Times(int *kernelCount){
kernelCount[0]++; // Add to the pointer
}
However when I copied that pointer back to the main function I get "2".
At first it baffeled me, then after 5 minutes of coffee and pacing back and forth I realized this probably makes sense because the cuda kernel is running 1024 instances of itself at the same time, which means that the kernels overwrite the "kernelCount[0]" instead of truly adding to it.
So instead I decided to do this:
__global__ void run2048Times(int *kernelCount){
// Get the id of the kernel
int id = blockIdx.x * blockDim.x + threadIdx.x;
// If the id is bigger than the pointer overwrite it
if(id > kernelCount[0]){
kernelCount[0] = id;
}
}
Genius!! This was guaranteed to work I thought. Until I ran it and got all sorts of numbers between 0 and 2000.
Which tells me that the problem mentioned above still happens here.
Is there any way to do this, even if it involves forcing the kernels to pause and wait for each other to run?
Assuming this is a simplified example, and you are not in fact trying to do profiling as others have already suggested, but want to use this in a more complex scenario, you can achieve the result you want with atomicAdd, which will ensure that the increment operation is executed as a single atomic operation:
__global__ void run2048Times(int *kernelCount){
atomicAdd(kernelCount, 1); // Add to the pointer
}
Why your solutions didn't work:
The problem with your first solution is that it gets compiled into the following PTX code (see here for description of PTX instructions):
ld.global.u32 %r1, [%rd2];
add.s32 %r2, %r1, 1;
st.global.u32 [%rd2], %r2;
You can verify this by calling nvcc with the --ptx option to only generate the intermediate representation.
What can happen here is the following timeline, assuming you launch 2 threads (Note: this is a simplified example and not exactly how GPUs work, but it is enough to illustrate the problem):
thread 0 reads 0 from kernelCount
thread 1 reads 0 from kernelCount
thread 0 increases it's local copy by 1
thread 0 stores 1 back to kernelCount
thread 1 increases it's local copy by 1
thread 1 stores 1 back to kernelCount
and you end up with 1 even though 2 threads were launched.
Your second solution is wrong even if the threads are launched sequentially because thread indexes are 0-based. So I'll assume you wanted to do this:
__global__ void run2048Times(int *kernelCount){
// Get the id of the kernel
int id = blockIdx.x * blockDim.x + threadIdx.x;
// If the id is bigger than the pointer overwrite it
if(id + 1 > kernelCount[0]){
kernelCount[0] = id + 1;
}
}
This will compile into:
ld.global.u32 %r5, [%rd1];
setp.lt.s32 %p1, %r1, %r5;
#%p1 bra BB0_2;
add.s32 %r6, %r1, 1;
st.global.u32 [%rd1], %r6;
BB0_2:
ret;
What can happen here is the following timeline:
thread 0 reads 0 from kernelCount
thread 1 reads 0 from kernelCount
thread 1 compares 0 to 1 + 1 and stores 2 into kernelCount
thread 0 compares 0 to 0 + 1 and stores 1 into kernelCount
You end up having the wrong result of 1.
I suggest you pick up a good parallel programming / CUDA book if you want to better understand problems with synchronization and non-atomic operations.
EDIT:
For completeness, the version using atomicAdd compiles into:
atom.global.add.u32 %r1, [%rd2], 1;
It seems like the only point of that counter is to do profiling (i.e. analyse how the code runs) rather than to actually count something (i.e. no functional benefit to the program).
There are profiling tools available designed for this task. For example, nvprof gives the number of calls, as well as some time metrics for each kernel in your codebase.
I am scanning through every permutation of vectors and I would like to multithread this process (each thread would scan all the permutation of some vectors).
I manage to extract the code that would not speed up (I know it does not do anything useful but it reproduces my problem).
int main(int argc, char *argv[]){
std::vector<std::string *> myVector;
for(int i = 0 ; i < 8 ; ++i){
myVector.push_back(new std::string("myString" + std::to_string(i)));
}
std::sort(myVector.begin(), myVector.end());
omp_set_dynamic(0);
omp_set_num_threads(8);
#pragma omp parallel for shared(myVector)
for(int i = 0 ; i < 100 ; ++i){
std::vector<std::string*> test(myVector);
do{ //here is a permutation
} while(std::next_permutation(test.begin(), test.end())); // tests all the permutations of this combination
}
return 0;
}
The result is :
1 thread : 15 seconds
2 threads : 8 seconds
4 threads : 15 seconds
8 threads : 18 seconds
16 threads : 20 seconds
I am working with an i7 processor with 8 cores. I can't understand how it could be slower with 8 threads than with 1... I don't think the cost of creating new threads is higher than the one to go through 40320 permutations.. so what is happening?
Thanks to the help of everyone, I finally manage to find the answer :
There were two problems :
A quick performance profiling showed that most of the time was spent in std::lockit which is something used for debug on visual studio.. to prevent that just add this command line /D "_HAS_ITERATOR_DEBUGGING=0" /D "_SECURE_SCL=0". That was why adding more threads resulted in loss of time
Switching optimization on helped improve the performance
I have a C++ program which basically performs some matrix calculations. For these I use LAPACK/BLAS and usually link to the MKL or ACML depending on the platform. A lot of these matrix calculations operate on different independent matrices and hence I use std::thread's to let these operations run in parallel. However, I noticed that I have no speed-up when using more threads. I traced the problem down to the daxpy Blas routine. It seems that if two threads are using this routine in parallel each thread takes twice the time, even though the two threads operate on different arrays.
The next thing I tried was writing a new simple method to perform vector additions to replace the daxpy routine. With one thread this new method is as fast as the BLAS routine, but, when compiling with gcc, it suffers from the same problems as the BLAS routine: doubling the number of threads running parallel also doubles the amount of time each threads needs, so no speed-up is gained. However, using the Intel C++ Compiler this problems vanishes: with increasing number of threads the time a single thread needs is constant.
However, I need to compile as well on systems where no Intel compiler is available. So my questions are: why is there no speed-up with the gcc and is there any possibility of improving the gcc performance?
I wrote a small program to demonstrate the effect:
// $(CC) -std=c++11 -O2 threadmatrixsum.cpp -o threadmatrixsum -pthread
#include <iostream>
#include <thread>
#include <vector>
#include "boost/date_time/posix_time/posix_time.hpp"
#include "boost/timer.hpp"
void simplesum(double* a, double* b, std::size_t dim);
int main() {
for (std::size_t num_threads {1}; num_threads <= 4; num_threads++) {
const std::size_t N { 936 };
std::vector <std::size_t> times(num_threads, 0);
auto threadfunction = [&](std::size_t tid)
{
const std::size_t dim { N * N };
double* pA = new double[dim];
double* pB = new double[dim];
for (std::size_t i {0}; i < N; ++i){
pA[i] = i;
pB[i] = 2*i;
}
boost::posix_time::ptime now1 =
boost::posix_time::microsec_clock::universal_time();
for (std::size_t n{0}; n < 1000; ++n){
simplesum(pA, pB, dim);
}
boost::posix_time::ptime now2 =
boost::posix_time::microsec_clock::universal_time();
boost::posix_time::time_duration dur = now2 - now1;
times[tid] += dur.total_milliseconds();
delete[] pA;
delete[] pB;
};
std::vector <std::thread> mythreads;
// start threads
for (std::size_t n {0} ; n < num_threads; ++n)
{
mythreads.emplace_back(threadfunction, n);
}
// wait for threads to finish
for (std::size_t n {0} ; n < num_threads; ++n)
{
mythreads[n].join();
std::cout << " Thread " << n+1 << " of " << num_threads
<< " took " << times[n]<< "msec" << std::endl;
}
}
}
void simplesum(double* a, double* b, std::size_t dim){
for(std::size_t i{0}; i < dim; ++i)
{*(++a) += *(++b);}
}
The outout with gcc:
Thread 1 of 1 took 532msec
Thread 1 of 2 took 1104msec
Thread 2 of 2 took 1103msec
Thread 1 of 3 took 1680msec
Thread 2 of 3 took 1821msec
Thread 3 of 3 took 1808msec
Thread 1 of 4 took 2542msec
Thread 2 of 4 took 2536msec
Thread 3 of 4 took 2509msec
Thread 4 of 4 took 2515msec
The outout with icc:
Thread 1 of 1 took 663msec
Thread 1 of 2 took 674msec
Thread 2 of 2 took 674msec
Thread 1 of 3 took 681msec
Thread 2 of 3 took 681msec
Thread 3 of 3 took 681msec
Thread 1 of 4 took 688msec
Thread 2 of 4 took 689msec
Thread 3 of 4 took 687msec
Thread 4 of 4 took 688msec
So, with the icc the time needed for one thread perform the computations is constant (as I would have expected; my CPU has 4 physical cores) and with the gcc the time for one thread increases. Replacing the simplesum routine by BLAS::daxpy yields the same results for icc and gcc (no surprise, as most time is spent in the library), which are almost the same as the above stated gcc results.
The answer is fairly simple: Your threads are fighting for memory bandwidth!
Consider that you perform one floating point addition per 2 stores (one initialization, one after the addition) and 2 reads (in the addition). Most modern systems providing multiple cpus actually have to share the memory controller among several cores.
The following was run on a system with 2 physical CPU sockets and 12 cores (24 with HT). Your original code exhibits exactly your problem:
Thread 1 of 1 took 657msec
Thread 1 of 2 took 1447msec
Thread 2 of 2 took 1463msec
[...]
Thread 1 of 8 took 5516msec
Thread 2 of 8 took 5587msec
Thread 3 of 8 took 5205msec
Thread 4 of 8 took 5311msec
Thread 5 of 8 took 2731msec
Thread 6 of 8 took 5545msec
Thread 7 of 8 took 5551msec
Thread 8 of 8 took 4903msec
However, by simply increasing the arithmetic density, we can see a significant increase in scalability. To demonstrate, I changed your addition routine to also perform an exponentiation: *(++a) += std::exp(*(++b));. The result shows almost perfect scaling:
Thread 1 of 1 took 7671msec
Thread 1 of 2 took 7759msec
Thread 2 of 2 took 7759msec
[...]
Thread 1 of 8 took 9997msec
Thread 2 of 8 took 8135msec
Thread 3 of 8 took 10625msec
Thread 4 of 8 took 8169msec
Thread 5 of 8 took 10054msec
Thread 6 of 8 took 8242msec
Thread 7 of 8 took 9876msec
Thread 8 of 8 took 8819msec
But what about ICC?
First, ICC inlines simplesum. Proving that inlining happens is simple: Using icc, I have disable multi-file interprocedural optimization and moved simplesum into its own translation unit. The difference is astonishing. The performance went from
Thread 1 of 1 took 687msec
Thread 1 of 2 took 688msec
Thread 2 of 2 took 689msec
[...]
Thread 1 of 8 took 690msec
Thread 2 of 8 took 697msec
Thread 3 of 8 took 700msec
Thread 4 of 8 took 874msec
Thread 5 of 8 took 878msec
Thread 6 of 8 took 874msec
Thread 7 of 8 took 742msec
Thread 8 of 8 took 868msec
To
Thread 1 of 1 took 1278msec
Thread 1 of 2 took 2457msec
Thread 2 of 2 took 2445msec
[...]
Thread 1 of 8 took 8868msec
Thread 2 of 8 took 8434msec
Thread 3 of 8 took 7964msec
Thread 4 of 8 took 7951msec
Thread 5 of 8 took 8872msec
Thread 6 of 8 took 8286msec
Thread 7 of 8 took 5714msec
Thread 8 of 8 took 8241msec
This already explains why the library performs badly: ICC cannot inline it and therefore no matter what else causes ICC to perform better than g++, it will not happen.
It also gives a hint as to what ICC might be doing right here... What if instead of executing simplesum 1000 times, it interchanges the loops so that it
Loads two doubles
Adds them 1000 times (or even performs a = 1000 * b)
Stores two doubles
This would increase arithmetic density without adding any exponentials to the function... How to prove this? Well, to begin let us simply implement this optimization and see what happens! To analyse, we will look at the g++ performance. Recall our benchmark results:
Thread 1 of 1 took 640msec
Thread 1 of 2 took 1308msec
Thread 2 of 2 took 1304msec
[...]
Thread 1 of 8 took 5294msec
Thread 2 of 8 took 5370msec
Thread 3 of 8 took 5451msec
Thread 4 of 8 took 5527msec
Thread 5 of 8 took 5174msec
Thread 6 of 8 took 5464msec
Thread 7 of 8 took 4640msec
Thread 8 of 8 took 4055msec
And now, let us exchange
for (std::size_t n{0}; n < 1000; ++n){
simplesum(pA, pB, dim);
}
with the version in which the inner loop was made the outer loop:
double* a = pA; double* b = pB;
for(std::size_t i{0}; i < dim; ++i, ++a, ++b)
{
double x = *a, y = *b;
for (std::size_t n{0}; n < 1000; ++n)
{
x += y;
}
*a = x;
}
The results show that we are on the right track:
Thread 1 of 1 took 693msec
Thread 1 of 2 took 703msec
Thread 2 of 2 took 700msec
[...]
Thread 1 of 8 took 920msec
Thread 2 of 8 took 804msec
Thread 3 of 8 took 750msec
Thread 4 of 8 took 943msec
Thread 5 of 8 took 909msec
Thread 6 of 8 took 744msec
Thread 7 of 8 took 759msec
Thread 8 of 8 took 904msec
This proves that the loop interchange optimization is indeed the main source of the excellent performance ICC exhibits here.
Note that none of the tested compilers (MSVC, ICC, g++ and clang) will replace the loop with a multiplication, which improves performance by 200x in the single threaded and 15x in the 8-threaded cases. This is due to the fact that the numerical instability of the repeated additions may cause wildly differing results when replaced with a single multiplication. When testing with integer data types instead of floating point data types, this optimization happens.
How can we force g++ to perform this optimization?
Interestingly enough, the true killer for g++ is not an inability to perform loop interchange. When called with -floop-interchange, g++ can perform this optimization as well. But only when the odds are significantly stacked into its favor.
Instead of std::size_t all bounds were expressed as ints. Not long, not unsigned int, but int. I still find it hard to believe, but it seems this is a hard requirement.
Instead of incrementing pointers, index them: a[i] += b[i];
G++ needs to be told -floop-interchange. A simple -O3 is not enough.
When all three criteria are met, the g++ performance is similar to what ICC delivers:
Thread 1 of 1 took 714msec
Thread 1 of 2 took 724msec
Thread 2 of 2 took 721msec
[...]
Thread 1 of 8 took 782msec
Thread 2 of 8 took 1221msec
Thread 3 of 8 took 1225msec
Thread 4 of 8 took 781msec
Thread 5 of 8 took 788msec
Thread 6 of 8 took 1262msec
Thread 7 of 8 took 1226msec
Thread 8 of 8 took 820msec
Note: The version of g++ used in this experiment is 4.9.0 on a x64 Arch linux.
Ok, I came to the conclusion that the main problem is that the processor acts on different parts of the memory in parallel and hence I assume that one has to deal with lots of cache misses which slows the process further down. Putting the actual sum function in a critical section
summutex.lock();
simplesum(pA, pB, dim);
summutex.unlock();
solves the problem of the cache missses, but of course does not yield optimal speed-up. Anyway, because now the other threads are blocked the simplesum method might as well use all available threads for the sum
void simplesum(double* a, double* b, std::size_t dim, std::size_t numberofthreads){
omp_set_num_threads(numberofthreads);
#pragma omp parallel
{
#pragma omp for
for(std::size_t i = 0; i < dim; ++i)
{
a[i]+=b[i];
}
}
}
In this case all the threads work on the same chunk on memory: it should be in the processor cache and if the processor needs to load some other parts of the memory into its cache the other threads benefit from this all well (depending whether this is L1 or L2 cache, but I reckon the details do not really matter for the sake of this discussion).
I don't claim that this solution is perfect or anywhere near optimal, but it seems to work much better than the original code. And it does not rely on some loop switching tricks which I cannot do in my actual code.