Why would concurrency using std::async be faster than using std::thread? - c++

I was reading Chapter 8 of the "Modern C++ Programming Cookbook, 2nd edition" on concurrency and stumbled upon something that puzzles me.
The author implements different versions of parallel map and reduce functions using std::thread and std::async. The implementations are really close; for example, the heart of the parallel_map functions are
// parallel_map using std::async
...
tasks.emplace_back(std::async(
std::launch::async,
[=, &f] {std::transform(begin, last, begin, std::forward<F>(f)); }));
...
// parallel_map using std::thread
...
threads.emplace_back([=, &f] {std::transform(begin, last, begin, std::forward<F>(f)); });
...
The complete code can be found here for std::thread and there for std::async.
What puzzles me is that the computation times reported in the book give a significant and consistent advantage to the std::async implementation. Moreover, the author acknowledge this fact as being obvious, without providing any hint of justification:
If we compare this [result with async] with the results from the parallel version using threads, we will find that these are faster execution times and that the speedup is significant, especially for the fold function.
I ran the code above on my computer, and even though the differences are not as compelling as in the book, I find that the std::async implementation is indeed faster than the std::thread one. (The author also later brings in standard implementations of these algorithms, which are even faster). On my computer, the code runs with four threads, which corresponds to the number of physical cores of my CPU.
Maybe I missed something, but why is it obvious that std::async should run faster than std::thread on this example? My intuition was that std::async being a higher-level implementation of threads, it should take at least the same amount of time, if not more, than threads -- obviously I was wrong. Are those findings consistent, as suggested by the book, and what is the explanation?

My original interpretation was incorrect. Refer to #OznOg's answer below.
Modified Answer:
I created a simple benchmark that uses std::async and std::thread to do some tiny tasks:
#include <thread>
#include <chrono>
#include <vector>
#include <future>
#include <iostream>
__thread volatile int you_shall_not_optimize_this;
void work() {
// This is the simplest way I can think of to prevent the compiler and
// operating system from doing naughty things
you_shall_not_optimize_this = 42;
}
[[gnu::noinline]]
std::chrono::nanoseconds benchmark_threads(size_t count) {
std::vector<std::optional<std::thread>> threads;
threads.resize(count);
auto before = std::chrono::high_resolution_clock::now();
for (size_t i = 0; i < count; ++i)
threads[i] = std::thread { work };
for (size_t i = 0; i < count; ++i)
threads[i]->join();
threads.clear();
auto after = std::chrono::high_resolution_clock::now();
return after - before;
}
[[gnu::noinline]]
std::chrono::nanoseconds benchmark_async(size_t count, std::launch policy) {
std::vector<std::optional<std::future<void>>> results;
results.resize(count);
auto before = std::chrono::high_resolution_clock::now();
for (size_t i = 0; i < count; ++i)
results[i] = std::async(policy, work);
for (size_t i = 0; i < count; ++i)
results[i]->wait();
results.clear();
auto after = std::chrono::high_resolution_clock::now();
return after - before;
}
std::ostream& operator<<(std::ostream& stream, std::launch value)
{
if (value == std::launch::async)
return stream << "std::launch::async";
else if (value == std::launch::deferred)
return stream << "std::launch::deferred";
else
return stream << "std::launch::unknown";
}
// #define CONFIG_THREADS true
// #define CONFIG_ITERATIONS 10000
// #define CONFIG_POLICY std::launch::async
int main() {
std::cout << "Running benchmark:\n"
<< " threads? " << std::boolalpha << CONFIG_THREADS << '\n'
<< " iterations " << CONFIG_ITERATIONS << '\n'
<< " async policy " << CONFIG_POLICY << std::endl;
std::chrono::nanoseconds duration;
if (CONFIG_THREADS) {
duration = benchmark_threads(CONFIG_ITERATIONS);
} else {
duration = benchmark_async(CONFIG_ITERATIONS, CONFIG_POLICY);
}
std::cout << "Completed in " << duration.count() << "ns (" << std::chrono::duration_cast<std::chrono::milliseconds>(duration).count() << "ms)\n";
}
I've run the benchmark as follows:
$ g++ -Wall -Wextra -std=c++20 -pthread -O3 -DCONFIG_THREADS=false -DCONFIG_ITERATIONS=10000 -DCONFIG_POLICY=std::launch::deferred main.cpp -o main && ./main
Running benchmark:
threads? false
iterations 10000
async policy std::launch::deferred
Completed in 4783327ns (4ms)
$ g++ -Wall -Wextra -std=c++20 -pthread -O3 -DCONFIG_THREADS=false -DCONFIG_ITERATIONS=10000 -DCONFIG_POLICY=std::launch::async main.cpp -o main && ./main
Running benchmark:
threads? false
iterations 10000
async policy std::launch::async
Completed in 301756775ns (301ms)
$ g++ -Wall -Wextra -std=c++20 -pthread -O3 -DCONFIG_THREADS=true -DCONFIG_ITERATIONS=10000 -DCONFIG_POLICY=std::launch::deferred main.cpp -o main && ./main
Running benchmark:
threads? true
iterations 10000
async policy std::launch::deferred
Completed in 291284997ns (291ms)
$ g++ -Wall -Wextra -std=c++20 -pthread -O3 -DCONFIG_THREADS=true -DCONFIG_ITERATIONS=10000 -DCONFIG_POLICY=std::launch::async main.cpp -o main && ./main
Running benchmark:
threads? true
iterations 10000
async policy std::launch::async
Completed in 293539858ns (293ms)
I re-ran all the benchmarks with strace attached and accumulated the system calls made:
# std::async with std::launch::async
1 access
2 arch_prctl
36 brk
10000 clone
6 close
1 execve
1 exit_group
10002 futex
10028 mmap
10009 mprotect
9998 munmap
7 newfstatat
6 openat
7 pread64
1 prlimit64
5 read
2 rt_sigaction
20001 rt_sigprocmask
1 set_robust_list
1 set_tid_address
5 write
# std::async with std::launch::deferred
1 access
2 arch_prctl
11 brk
6 close
1 execve
1 exit_group
10002 futex
28 mmap
9 mprotect
2 munmap
7 newfstatat
6 openat
7 pread64
1 prlimit64
5 read
2 rt_sigaction
1 rt_sigprocmask
1 set_robust_list
1 set_tid_address
5 write
# std::thread with std::launch::async
1 access
2 arch_prctl
27 brk
10000 clone
6 close
1 execve
1 exit_group
2 futex
10028 mmap
10009 mprotect
9998 munmap
7 newfstatat
6 openat
7 pread64
1 prlimit64
5 read
2 rt_sigaction
20001 rt_sigprocmask
1 set_robust_list
1 set_tid_address
5 write
# std::thread with std::launch::deferred
1 access
2 arch_prctl
27 brk
10000 clone
6 close
1 execve
1 exit_group
2 futex
10028 mmap
10009 mprotect
9998 munmap
7 newfstatat
6 openat
7 pread64
1 prlimit64
5 read
2 rt_sigaction
20001 rt_sigprocmask
1 set_robust_list
1 set_tid_address
5 write
We observe that std::async is significantly faster with std::launch::deferred but that everything else doesn't seem to matter as much.
My conclusions are:
The current libstdc++ implementation does not take advantage of the fact that std::async doesn't need a new thread for each task.
The current libstdc++ implementation does some sort of locking in std::async that std::thread doesn't do.
std::async with std::launch::deferred saves setup and destroy costs and is much faster for this case.
My machine is configured as follows:
$ uname -a
Linux linux-2 5.12.1-arch1-1 #1 SMP PREEMPT Sun, 02 May 2021 12:43:58 +0000 x86_64 GNU/Linux
$ g++ --version
g++ (GCC) 10.2.0
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ lscpu # truncated
Architecture: x86_64
Byte Order: Little Endian
CPU(s): 8
Model name: Intel(R) Core(TM) i7-4770K CPU # 3.50GHz
Original Answer:
std::thread is a wrapper for thread objects which are provided by the operating system, they are extremely expensive to create and destroy.
std::async is similar, but there isn't a 1-to-1 mapping between tasks and operating system threads. This could be implemented with thread pools, where threads are reused for multiple tasks.
So std::async is better if you have many small tasks, and std::thread is better if you have a few tasks that are running for long periods of time.
Also if you have things that truly need to happen in parallel, then std::async might not fit very well. (std::thread also can't make such guarantees, but that's the closest you can get.)
Maybe to clarify, in your case std::async saves the overhead from creating and destroying threads.
(Depending on the operating system, you could also lose performance simply by having a lot of threads running. An operating system might have a scheduling strategy where it tries to guarantee that every thread gets executed every so often, thus the scheduler could decide go give the individual threads smaller slices of processing time, thus creating more overhead for switching between threads.)

Looks like something is not happening like expected. I compiled the whole thing on my fedora and the first results where surprising. I commented out all tests to keep only the 2 ones comparing threads and async. The output looks like confirming the behaviour:
Thead version result
size s map p map s fold p fold
10000 642 628 751 770
100000 6533 3444 7985 3338
500000 14885 5760 13854 6304
1000000 23428 11398 27795 12129
2000000 47136 22468 55518 24154
5000000 118690 55752 139489 60142
10000000 236496 112467 277413 121002
25000000 589277 276750 694742 297832
500000001.17839e+06 5553181.39065e+06 594102
Async version:
size s map p1 map p2 map s fold p1 fold p2 fold
10000 248 232 231 273 282 273
100000 2323 1562 2827 2757 1536 1766
500000 12312 5615 12044 14014 6272 7431
1000000 23585 11701 24060 27851 12376 14109
2000000 47147 22796 48035 55433 25565 30095
5000000 118465 59980 119698 140775 62960 68382
10000000 241727 110883 239554 277958 121205 136041
Looks like the async is actually 2 times faster (for small vallues) than the threaded one.
Then I used strace to count the number of clone system call done (number of thread created):
64 clone with threads
92 clone with async
So looks like the explaination on the time spend to create thread is somehow in contradiction as he async version actually creates as many threads than the thread based one (the difference comes from the fact that there are two version of fold in the async code).
Then I tried to swap both test order of execution (put async before thread), and here are the results:
size s map p1 map p2 map s fold p1 fold p2 fold
10000 653 694 624 718 748 718
100000 6731 3931 2978 8533 3116 1724
500000 12406 5839 14589 13895 8427 7072
1000000 23813 11578 24099 27853 13091 14108
2000000 47357 22402 48197 55469 24572 33543
5000000 117923 55869 120303 139061 61801 68281
10000000 234861 111055 239124 277153 121270 136953
size s map p map s fold p fold
10000 232 232 273 328
100000 6424 3271 8297 4487
500000 21329 5547 13913 6263
1000000 23654 11419 27827 12083
2000000 47230 22763 55653 24135
5000000 117448 56785 139286 61679
10000000 235394 111021 278177 119805
25000000 589329 279637 696392 301485
500000001.1824e+06 5564431.38722e+06 606279
So now the "thread" version is 2 time faster than async for small values.
Looking at clone calls di not show any differences:
92 clone
64 clone
I didn't have much time to go further, but at least on linux, we can consider that there are no differences between the two version (the async could even be seen as less efficient as it requires more threads).
We can see that it has nothing to do with the async/thread problem.
Moreover, if we look at values that realy need computation time, the difference of time is really small and not relevent: 55752us vs 56785us for 5'000'000, and keeps matching for bigger values.
This looks like usual problem with microbenches, we somehow measure latencies of the system more than the computation time itself.
Note: the figures shown are without optimization (original code) ; adding -O3 obviously speeds up the computations, but results show the same: there is not real difference in computation time on big values.

Related

chrono give different measures at same function

i am trying to measure the execution time.
i'm on windows 10 and use gcc compiler.
start_t = chrono::system_clock::now();
tree->insert();
end_t = chrono::system_clock::now();
rslt_period = chrono::duration_cast<chrono::nanoseconds>(end_t - start_t);
this is my code to measure time about bp_w->insert()
the function insert work internally like follow (just pseudo code)
insert(){
_load_node(node);
// do something //
_save_node(node, addr);
}
_save_node(n){
ofstream file(name);
file.write(n);
file.close();
}
_load_node(n, addr){
ifstream file(name);
file.read_from(n, addr);
file.close();
}
the actual results is,
read is number of _load_node executions.
write is number of _save_node executions.
time is nano secs.
read write time
1 1 1000000
1 1 0
2 1 0
1 1 0
1 1 0
1 1 0
2 1 0
1 1 1004000
1 1 1005000
1 1 0
1 1 0
1 1 15621000
i don't have any idea why this result come and want to know.
What you are trying to measure is ill-defined.
"How long did this code take to run" can seem simple. In practice, though, do you mean "how many CPU cycles my code took" ? Or how many cycles between my program and the other running programs ? Do you account for the time to load/unload it on the CPU ? Do you account for the CPU being throttled down when on battery ? Do you want to account for the time to access the main clock located on the motherboard (in terms of computation that is extremely far).
So, in practice timing will be affected by a lot of factors and the simple fact of measuring it will slow everything down. Don't expect nanosecond accuracy. Micros, maybe. Millis, certainly.
So, that leaves you in a position where any measurement will fluctuate a lot. The sane way is to average it out over multiple measurement. Or, even better, do the same operation (on different data) a thousand (million?) times and divide the results by a thousand.
Then, you'll get significant improvement on accuracy.
In code:
start_t = chrono::system_clock::now();
for(int i = 0; i < 1000000; i++)
tree->insert();
end_t = chrono::system_clock::now();
You are using the wrong clock. system_clock is not useful for timing intervals due to low resolution and its non-monotonic nature.
Use steady_clock instead. it is guaranteed to be monotonic and have a low enough resolution to be useful.

How to load balance a simple loop using MPI in C++

I am writing some code which is computationally expensive, but highly parallelisable. Once parallelised, I intend to run it on a HPC, however to keep the runtime down to within a week, the problem needs to scale well, with the number of processors.
Below is a simple and ludicrous example of what I am attempting to achieve, which is concise enough to compile and demonstrate my problem;
#include <iostream>
#include <ctime>
#include "mpi.h"
using namespace std;
double int_theta(double E){
double result = 0;
for (int k = 0; k < 20000; k++)
result += E*k;
return result;
}
int main()
{
int n = 3500000;
int counter = 0;
time_t timer;
int start_time = time(&timer);
int myid, numprocs;
int k;
double integrate, result;
double end = 0.5;
double start = -2.;
double E;
double factor = (end - start)/(n*1.);
integrate = 0;
MPI_Init(NULL,NULL);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &myid);
for (k = myid; k<n+1; k+=numprocs){
E = start + k*(end-start)/n;
if (( k == 0 ) || (k == n))
integrate += 0.5*factor*int_theta(E);
else
integrate += factor*int_theta(E);
counter++;
}
cout<<"process "<<myid<<" took "<<time(&timer)-start_time<<"s"<<endl;
cout<<"process "<<myid<<" performed "<<counter<<" computations"<<endl;
MPI_Reduce(&integrate, &result, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
if (myid == 0)
cout<<result<<endl;
MPI_Finalize();
return 0;
}
I have compiled the problem on my quadcore laptop with
mpiicc test.cpp -std=c++14 -O3 -DMKL_LP64 -lmkl_intel_lp64 - lmkl_sequential -lmkl_core -lpthread -lm -ldl
and I get the following output;
$ mpirun -np 4 ./a.out
process 3 took 14s
process 3 performed 875000 computations
process 1 took 15s
process 1 performed 875000 computations
process 2 took 16s
process 2 performed 875000 computations
process 0 took 16s
process 0 performed 875001 computations
-3.74981e+08
$ mpirun -np 3 ./a.out
process 2 took 11s
process 2 performed 1166667 computations
process 1 took 20s
process 1 performed 1166667 computations
process 0 took 20s
process 0 performed 1166667 computations
-3.74981e+08
$ mpirun -np 2 ./a.out
process 0 took 16s
process 0 performed 1750001 computations
process 1 took 16s
process 1 performed 1750000 computations
-3.74981e+08
To me it appears that there must be a barrier somewhere that I am not aware of. I get better performance with 2 processors over 3. Please can somebody offer any advice? Thanks
If I read the output of lscpu you gave correctly (e.g. with the help of https://unix.stackexchange.com/a/218081), you are having 4 logical CPUs, but only 2 hardware cores (1 socket x 2 cores per socket).
Using cat /proc/cpuinfo you can finde the make and model for the CPU to maybe find out more.
The four logical CPUs might result from hyperthreading, which means that some hardware resources (e.g. the FPU unit, but I am not an expert on this) are shared between two cores. Thus, I would not expect any good parallel scaling beyond two processes.
For scalability tests, you should try to get your hands on a machine with maybe 6 or more hardware cores do get a better estimate.
From looking at your code, I would expect perfect scalability to any number of cores - At least as long as you do not include the time needed for process startup and the final MPI_Reduce. These will for sure become slower with more processes involved.

why perf has such high context-switches?

I was trying to understand the linux perf, and found some really confusing behavior:
I wrote a simple multi-threading example with one thread pinned to each core; each thread runs computation locally and does not communicate with each other (see test.cc below). I was thinking that this example should have really low, if not zero, context switches. However, using linux perf to profile the example shows thousands of context-switches - much more than what I expected. I further profiled the linux command sleep 20 for a comparison, showing much fewer context switches.
This profile result does not make any sense to me. What is causing so many context switches?
> sudo perf stat -e sched:sched_switch ./test
Performance counter stats for './test':
6,725 sched:sched_switch
20.835 seconds time elapsed
> sudo perf stat -e sched:sched_switch sleep 20
Performance counter stats for 'sleep 20':
1 sched:sched_switch
20.001 seconds time elapsed
For reproducing the results, please run the following code:
perf stat -e context-switches sleep 20
perf stat -e context-switches ./test
To compile the source code, please type the following code:
g++ -std=c++11 -pthread -o test test.cc
// test.cc
#include <iostream>
#include <thread>
#include <vector>
int main(int argc, const char** argv) {
unsigned num_cpus = std::thread::hardware_concurrency();
std::cout << "Launching " << num_cpus << " threads\n";
std::vector<std::thread> threads(num_cpus);
for (unsigned i = 0; i < num_cpus; ++i) {
threads[i] = std::thread([i] {
int j = 0;
while (j++ < 100) {
int tmp = 0;
while (tmp++ < 110000000) { }
}
});
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(i, &cpuset);
int rc = pthread_setaffinity_np(threads[i].native_handle(),
sizeof(cpu_set_t), &cpuset);
if (rc != 0) {
std::cerr << "Error calling pthread_setaffinity_np: " << rc << "\n";
}
}
for (auto& t : threads) {
t.join();
}
return 0;
}
You can use sudo perf sched record -- ./test to determine which processes are being scheduled to run in place of the one of the threads of your application. When I execute this command on my system, I get:
sudo perf sched record -- ./test
Launching 4 threads
[ perf record: Woken up 10 times to write data ]
[ perf record: Captured and wrote 23.886 MB perf.data (212100 samples) ]
Note that I have four cores and the name of the executable is test. perf sched has captured all the sched:sched_switch events and dumped the data into a file called perf.data by default. The size of the file is about 23 MBs and contains abut 212100 events. The duration of profiling will be from the time perf starts until test terminates.
You can use sudo perf sched map to print all the recorded events in a nice format that looks like this:
*. 448826.757400 secs . => swapper:0
*A0 448826.757461 secs A0 => perf:15875
*. A0 448826.757477 secs
*. . A0 448826.757548 secs
. . *B0 448826.757601 secs B0 => migration/3:22
. . *. 448826.757608 secs
*A0 . . 448826.757625 secs
A0 *C0 . 448826.757775 secs C0 => rcu_sched:7
A0 *. . 448826.757777 secs
*D0 . . 448826.757803 secs D0 => ksoftirqd/0:3
*A0 . . 448826.757807 secs
A0 *E0 . . 448826.757862 secs E0 => kworker/1:3:13786
A0 *F0 . . 448826.757870 secs F0 => kworker/1:0:5886
A0 *G0 . . 448826.757874 secs G0 => hud-service:1609
A0 *. . . 448826.758614 secs
A0 *H0 . . 448826.758714 secs H0 => kworker/u8:2:15585
A0 *. . . 448826.758721 secs
A0 . *I0 . 448826.758740 secs I0 => gnome-terminal-:8878
A0 . I0 *J0 448826.758744 secs J0 => test:15876
A0 . I0 *B0 448826.758749 secs
The two-letter names A0, B0, C0, E0, and so on, are short names given by perf to every thread running on the system. The first four columns shows which thread was running on each of the four cores. For example, in the second-to-last row, you can see that the first thread that got created in your for loop. The name assigned to this thread is J0. The thread is running on the fourth core. The asterisk indicates that it has just been context-switched to from some other thread. Without an asterisk, it means that the same thread has continued to run on the same core for another time slice. A dot represents an idle core. To determine the names for all of the four threads, run the following command:
sudo perf sched map | grep 'test'
On my system, this prints:
A0 . I0 *J0 448826.758744 secs J0 => test:15876
J0 A0 *K0 . 448826.758868 secs K0 => test:15878
J0 *L0 K0 . 448826.758889 secs L0 => test:15877
J0 L0 K0 *M0 448826.758894 secs M0 => test:15879
Now that you know the two-letter names assigned to your threads (and all other threads). you can determine which other threads are causing your threads to be context-switched. For example, if you see this:
*G1 L0 K0 M0 448826.822555 secs G1 => firefox:2384
then you'd know that three of your app threads are running, but the one of the cores is being used to run Firefox. So the fourth thread needs to wait until the scheduler decides when to schedule again.
If you want all the scheduler slots where at least one of your threads is occupying, then you can use the following command:
sudo perf sched map > mydata
grep -E 'J0|K0|L0|M0' mydata > mydata2
wc -l mydata
wc -l mydata2
The last two commands tell you how many rows (time slices) where at least one thread of your app was running. You can compare that to the total number of time slices. Since there are four cores, the total number of scheduler slots is 4 * (number of time slices). Then you can do all sorts of manual calculations and figure out exactly what's happened.
We can't tell you exactly what is being scheduled - but you can find out yourself using perf.
perf record -e sched:sched_switch ./test
Note this requires a mounted debugfs and root permissions. Now a perf report will give you an overview of what the scheduler was switching to (or see perf script for a full listing). Now there is no apparent thing in your code that would cause a context switch (e.g. sleep, waiting for I/O), so it is most likely another task that is being scheduled on these cores.
The reason why sleep has almost no context switches is simple. It goes to sleep almost immediately - which is one context switch. While the task is not active, it cannot be displaced by another task.

No speedup for vector sums with threading

I have a C++ program which basically performs some matrix calculations. For these I use LAPACK/BLAS and usually link to the MKL or ACML depending on the platform. A lot of these matrix calculations operate on different independent matrices and hence I use std::thread's to let these operations run in parallel. However, I noticed that I have no speed-up when using more threads. I traced the problem down to the daxpy Blas routine. It seems that if two threads are using this routine in parallel each thread takes twice the time, even though the two threads operate on different arrays.
The next thing I tried was writing a new simple method to perform vector additions to replace the daxpy routine. With one thread this new method is as fast as the BLAS routine, but, when compiling with gcc, it suffers from the same problems as the BLAS routine: doubling the number of threads running parallel also doubles the amount of time each threads needs, so no speed-up is gained. However, using the Intel C++ Compiler this problems vanishes: with increasing number of threads the time a single thread needs is constant.
However, I need to compile as well on systems where no Intel compiler is available. So my questions are: why is there no speed-up with the gcc and is there any possibility of improving the gcc performance?
I wrote a small program to demonstrate the effect:
// $(CC) -std=c++11 -O2 threadmatrixsum.cpp -o threadmatrixsum -pthread
#include <iostream>
#include <thread>
#include <vector>
#include "boost/date_time/posix_time/posix_time.hpp"
#include "boost/timer.hpp"
void simplesum(double* a, double* b, std::size_t dim);
int main() {
for (std::size_t num_threads {1}; num_threads <= 4; num_threads++) {
const std::size_t N { 936 };
std::vector <std::size_t> times(num_threads, 0);
auto threadfunction = [&](std::size_t tid)
{
const std::size_t dim { N * N };
double* pA = new double[dim];
double* pB = new double[dim];
for (std::size_t i {0}; i < N; ++i){
pA[i] = i;
pB[i] = 2*i;
}
boost::posix_time::ptime now1 =
boost::posix_time::microsec_clock::universal_time();
for (std::size_t n{0}; n < 1000; ++n){
simplesum(pA, pB, dim);
}
boost::posix_time::ptime now2 =
boost::posix_time::microsec_clock::universal_time();
boost::posix_time::time_duration dur = now2 - now1;
times[tid] += dur.total_milliseconds();
delete[] pA;
delete[] pB;
};
std::vector <std::thread> mythreads;
// start threads
for (std::size_t n {0} ; n < num_threads; ++n)
{
mythreads.emplace_back(threadfunction, n);
}
// wait for threads to finish
for (std::size_t n {0} ; n < num_threads; ++n)
{
mythreads[n].join();
std::cout << " Thread " << n+1 << " of " << num_threads
<< " took " << times[n]<< "msec" << std::endl;
}
}
}
void simplesum(double* a, double* b, std::size_t dim){
for(std::size_t i{0}; i < dim; ++i)
{*(++a) += *(++b);}
}
The outout with gcc:
Thread 1 of 1 took 532msec
Thread 1 of 2 took 1104msec
Thread 2 of 2 took 1103msec
Thread 1 of 3 took 1680msec
Thread 2 of 3 took 1821msec
Thread 3 of 3 took 1808msec
Thread 1 of 4 took 2542msec
Thread 2 of 4 took 2536msec
Thread 3 of 4 took 2509msec
Thread 4 of 4 took 2515msec
The outout with icc:
Thread 1 of 1 took 663msec
Thread 1 of 2 took 674msec
Thread 2 of 2 took 674msec
Thread 1 of 3 took 681msec
Thread 2 of 3 took 681msec
Thread 3 of 3 took 681msec
Thread 1 of 4 took 688msec
Thread 2 of 4 took 689msec
Thread 3 of 4 took 687msec
Thread 4 of 4 took 688msec
So, with the icc the time needed for one thread perform the computations is constant (as I would have expected; my CPU has 4 physical cores) and with the gcc the time for one thread increases. Replacing the simplesum routine by BLAS::daxpy yields the same results for icc and gcc (no surprise, as most time is spent in the library), which are almost the same as the above stated gcc results.
The answer is fairly simple: Your threads are fighting for memory bandwidth!
Consider that you perform one floating point addition per 2 stores (one initialization, one after the addition) and 2 reads (in the addition). Most modern systems providing multiple cpus actually have to share the memory controller among several cores.
The following was run on a system with 2 physical CPU sockets and 12 cores (24 with HT). Your original code exhibits exactly your problem:
Thread 1 of 1 took 657msec
Thread 1 of 2 took 1447msec
Thread 2 of 2 took 1463msec
[...]
Thread 1 of 8 took 5516msec
Thread 2 of 8 took 5587msec
Thread 3 of 8 took 5205msec
Thread 4 of 8 took 5311msec
Thread 5 of 8 took 2731msec
Thread 6 of 8 took 5545msec
Thread 7 of 8 took 5551msec
Thread 8 of 8 took 4903msec
However, by simply increasing the arithmetic density, we can see a significant increase in scalability. To demonstrate, I changed your addition routine to also perform an exponentiation: *(++a) += std::exp(*(++b));. The result shows almost perfect scaling:
Thread 1 of 1 took 7671msec
Thread 1 of 2 took 7759msec
Thread 2 of 2 took 7759msec
[...]
Thread 1 of 8 took 9997msec
Thread 2 of 8 took 8135msec
Thread 3 of 8 took 10625msec
Thread 4 of 8 took 8169msec
Thread 5 of 8 took 10054msec
Thread 6 of 8 took 8242msec
Thread 7 of 8 took 9876msec
Thread 8 of 8 took 8819msec
But what about ICC?
First, ICC inlines simplesum. Proving that inlining happens is simple: Using icc, I have disable multi-file interprocedural optimization and moved simplesum into its own translation unit. The difference is astonishing. The performance went from
Thread 1 of 1 took 687msec
Thread 1 of 2 took 688msec
Thread 2 of 2 took 689msec
[...]
Thread 1 of 8 took 690msec
Thread 2 of 8 took 697msec
Thread 3 of 8 took 700msec
Thread 4 of 8 took 874msec
Thread 5 of 8 took 878msec
Thread 6 of 8 took 874msec
Thread 7 of 8 took 742msec
Thread 8 of 8 took 868msec
To
Thread 1 of 1 took 1278msec
Thread 1 of 2 took 2457msec
Thread 2 of 2 took 2445msec
[...]
Thread 1 of 8 took 8868msec
Thread 2 of 8 took 8434msec
Thread 3 of 8 took 7964msec
Thread 4 of 8 took 7951msec
Thread 5 of 8 took 8872msec
Thread 6 of 8 took 8286msec
Thread 7 of 8 took 5714msec
Thread 8 of 8 took 8241msec
This already explains why the library performs badly: ICC cannot inline it and therefore no matter what else causes ICC to perform better than g++, it will not happen.
It also gives a hint as to what ICC might be doing right here... What if instead of executing simplesum 1000 times, it interchanges the loops so that it
Loads two doubles
Adds them 1000 times (or even performs a = 1000 * b)
Stores two doubles
This would increase arithmetic density without adding any exponentials to the function... How to prove this? Well, to begin let us simply implement this optimization and see what happens! To analyse, we will look at the g++ performance. Recall our benchmark results:
Thread 1 of 1 took 640msec
Thread 1 of 2 took 1308msec
Thread 2 of 2 took 1304msec
[...]
Thread 1 of 8 took 5294msec
Thread 2 of 8 took 5370msec
Thread 3 of 8 took 5451msec
Thread 4 of 8 took 5527msec
Thread 5 of 8 took 5174msec
Thread 6 of 8 took 5464msec
Thread 7 of 8 took 4640msec
Thread 8 of 8 took 4055msec
And now, let us exchange
for (std::size_t n{0}; n < 1000; ++n){
simplesum(pA, pB, dim);
}
with the version in which the inner loop was made the outer loop:
double* a = pA; double* b = pB;
for(std::size_t i{0}; i < dim; ++i, ++a, ++b)
{
double x = *a, y = *b;
for (std::size_t n{0}; n < 1000; ++n)
{
x += y;
}
*a = x;
}
The results show that we are on the right track:
Thread 1 of 1 took 693msec
Thread 1 of 2 took 703msec
Thread 2 of 2 took 700msec
[...]
Thread 1 of 8 took 920msec
Thread 2 of 8 took 804msec
Thread 3 of 8 took 750msec
Thread 4 of 8 took 943msec
Thread 5 of 8 took 909msec
Thread 6 of 8 took 744msec
Thread 7 of 8 took 759msec
Thread 8 of 8 took 904msec
This proves that the loop interchange optimization is indeed the main source of the excellent performance ICC exhibits here.
Note that none of the tested compilers (MSVC, ICC, g++ and clang) will replace the loop with a multiplication, which improves performance by 200x in the single threaded and 15x in the 8-threaded cases. This is due to the fact that the numerical instability of the repeated additions may cause wildly differing results when replaced with a single multiplication. When testing with integer data types instead of floating point data types, this optimization happens.
How can we force g++ to perform this optimization?
Interestingly enough, the true killer for g++ is not an inability to perform loop interchange. When called with -floop-interchange, g++ can perform this optimization as well. But only when the odds are significantly stacked into its favor.
Instead of std::size_t all bounds were expressed as ints. Not long, not unsigned int, but int. I still find it hard to believe, but it seems this is a hard requirement.
Instead of incrementing pointers, index them: a[i] += b[i];
G++ needs to be told -floop-interchange. A simple -O3 is not enough.
When all three criteria are met, the g++ performance is similar to what ICC delivers:
Thread 1 of 1 took 714msec
Thread 1 of 2 took 724msec
Thread 2 of 2 took 721msec
[...]
Thread 1 of 8 took 782msec
Thread 2 of 8 took 1221msec
Thread 3 of 8 took 1225msec
Thread 4 of 8 took 781msec
Thread 5 of 8 took 788msec
Thread 6 of 8 took 1262msec
Thread 7 of 8 took 1226msec
Thread 8 of 8 took 820msec
Note: The version of g++ used in this experiment is 4.9.0 on a x64 Arch linux.
Ok, I came to the conclusion that the main problem is that the processor acts on different parts of the memory in parallel and hence I assume that one has to deal with lots of cache misses which slows the process further down. Putting the actual sum function in a critical section
summutex.lock();
simplesum(pA, pB, dim);
summutex.unlock();
solves the problem of the cache missses, but of course does not yield optimal speed-up. Anyway, because now the other threads are blocked the simplesum method might as well use all available threads for the sum
void simplesum(double* a, double* b, std::size_t dim, std::size_t numberofthreads){
omp_set_num_threads(numberofthreads);
#pragma omp parallel
{
#pragma omp for
for(std::size_t i = 0; i < dim; ++i)
{
a[i]+=b[i];
}
}
}
In this case all the threads work on the same chunk on memory: it should be in the processor cache and if the processor needs to load some other parts of the memory into its cache the other threads benefit from this all well (depending whether this is L1 or L2 cache, but I reckon the details do not really matter for the sake of this discussion).
I don't claim that this solution is perfect or anywhere near optimal, but it seems to work much better than the original code. And it does not rely on some loop switching tricks which I cannot do in my actual code.

Windows Critical Section strange behaviour

I have two shared global variables
int a = 0;
int b = 0;
and two threads
// thread 1
for (int i = 0; i < 10; ++i) {
EnterCriticalSection(&sect);
a++;
b++;
std::cout << a " " << b << std::endl;
LeaveCriticalSection(&sect);
}
// thread2
for (int i = 0; i < 10; ++i) {
EnterCriticalSection(&sect);
a--;
b--;
std::cout << a " " << b << std::endl;
LeaveCriticalSection(&sect);
}
The code always prints the following output
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
That is quite strange, looks like threads are working sequentally.. What's the problem with that?
Thanks.
Each thread has a specific time slice during which it executes before being preempted. In your example, the time slice seems to be longer than the time required to complete the loop.
However, you can actively yield control by calling Sleep(0) after leaving the critical section inside the loop.
IMO critical section leave/enter in your example is so fast that another thread is not fast enough to execute enter section during this moment.
Try to put some (maybe random) sleeps to slow down code to see desired effects.
Note:
Default timeout for EnterCriticalSection is like 30 days or so (means infinty) so you cannot expect that function will time out. And documentation says:
There is no guarantee about the order in which threads will obtain ownership of the critical section, however, the system will be fair to all threads.
For me it looks like the topic discussed in http://social.msdn.microsoft.com/forums/en-US/windowssdk/thread/980e5018-3ade-4823-a6dc-5ddbcc3091d5/
Please look the example from June 28, 2006
(unfortunately I cannot find the original article by Microsoft telling about the change of CriticalSection)
Could you try your code on Windows XP? What does it show?
I guess that I/O operations (cout) affect the scheduling similarly to a Sleep() call, so starting with Windows Vista a thread could cause starvation of other threads when doing I/O inside a CS.