why perf has such high context-switches?

why perf has such high context-switches? - c++

I was trying to understand the linux perf, and found some really confusing behavior:
I wrote a simple multi-threading example with one thread pinned to each core; each thread runs computation locally and does not communicate with each other (see test.cc below). I was thinking that this example should have really low, if not zero, context switches. However, using linux perf to profile the example shows thousands of context-switches - much more than what I expected. I further profiled the linux command sleep 20 for a comparison, showing much fewer context switches.
This profile result does not make any sense to me. What is causing so many context switches?
> sudo perf stat -e sched:sched_switch ./test
Performance counter stats for './test':
6,725 sched:sched_switch
20.835 seconds time elapsed
> sudo perf stat -e sched:sched_switch sleep 20
Performance counter stats for 'sleep 20':
1 sched:sched_switch
20.001 seconds time elapsed
For reproducing the results, please run the following code:
perf stat -e context-switches sleep 20
perf stat -e context-switches ./test
To compile the source code, please type the following code:
g++ -std=c++11 -pthread -o test test.cc
// test.cc
#include <iostream>
#include <thread>
#include <vector>
int main(int argc, const char** argv) {
unsigned num_cpus = std::thread::hardware_concurrency();
std::cout << "Launching " << num_cpus << " threads\n";
std::vector<std::thread> threads(num_cpus);
for (unsigned i = 0; i < num_cpus; ++i) {
threads[i] = std::thread([i] {
int j = 0;
while (j++ < 100) {
int tmp = 0;
while (tmp++ < 110000000) { }
}
});
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(i, &cpuset);
int rc = pthread_setaffinity_np(threads[i].native_handle(),
sizeof(cpu_set_t), &cpuset);
if (rc != 0) {
std::cerr << "Error calling pthread_setaffinity_np: " << rc << "\n";
}
}
for (auto& t : threads) {
t.join();
}
return 0;
}

You can use sudo perf sched record -- ./test to determine which processes are being scheduled to run in place of the one of the threads of your application. When I execute this command on my system, I get:
sudo perf sched record -- ./test
Launching 4 threads
[ perf record: Woken up 10 times to write data ]
[ perf record: Captured and wrote 23.886 MB perf.data (212100 samples) ]
Note that I have four cores and the name of the executable is test. perf sched has captured all the sched:sched_switch events and dumped the data into a file called perf.data by default. The size of the file is about 23 MBs and contains abut 212100 events. The duration of profiling will be from the time perf starts until test terminates.
You can use sudo perf sched map to print all the recorded events in a nice format that looks like this:
*. 448826.757400 secs . => swapper:0
*A0 448826.757461 secs A0 => perf:15875
*. A0 448826.757477 secs
*. . A0 448826.757548 secs
. . *B0 448826.757601 secs B0 => migration/3:22
. . *. 448826.757608 secs
*A0 . . 448826.757625 secs
A0 *C0 . 448826.757775 secs C0 => rcu_sched:7
A0 *. . 448826.757777 secs
*D0 . . 448826.757803 secs D0 => ksoftirqd/0:3
*A0 . . 448826.757807 secs
A0 *E0 . . 448826.757862 secs E0 => kworker/1:3:13786
A0 *F0 . . 448826.757870 secs F0 => kworker/1:0:5886
A0 *G0 . . 448826.757874 secs G0 => hud-service:1609
A0 *. . . 448826.758614 secs
A0 *H0 . . 448826.758714 secs H0 => kworker/u8:2:15585
A0 *. . . 448826.758721 secs
A0 . *I0 . 448826.758740 secs I0 => gnome-terminal-:8878
A0 . I0 *J0 448826.758744 secs J0 => test:15876
A0 . I0 *B0 448826.758749 secs
The two-letter names A0, B0, C0, E0, and so on, are short names given by perf to every thread running on the system. The first four columns shows which thread was running on each of the four cores. For example, in the second-to-last row, you can see that the first thread that got created in your for loop. The name assigned to this thread is J0. The thread is running on the fourth core. The asterisk indicates that it has just been context-switched to from some other thread. Without an asterisk, it means that the same thread has continued to run on the same core for another time slice. A dot represents an idle core. To determine the names for all of the four threads, run the following command:
sudo perf sched map | grep 'test'
On my system, this prints:
A0 . I0 *J0 448826.758744 secs J0 => test:15876
J0 A0 *K0 . 448826.758868 secs K0 => test:15878
J0 *L0 K0 . 448826.758889 secs L0 => test:15877
J0 L0 K0 *M0 448826.758894 secs M0 => test:15879
Now that you know the two-letter names assigned to your threads (and all other threads). you can determine which other threads are causing your threads to be context-switched. For example, if you see this:
*G1 L0 K0 M0 448826.822555 secs G1 => firefox:2384
then you'd know that three of your app threads are running, but the one of the cores is being used to run Firefox. So the fourth thread needs to wait until the scheduler decides when to schedule again.
If you want all the scheduler slots where at least one of your threads is occupying, then you can use the following command:
sudo perf sched map > mydata
grep -E 'J0|K0|L0|M0' mydata > mydata2
wc -l mydata
wc -l mydata2
The last two commands tell you how many rows (time slices) where at least one thread of your app was running. You can compare that to the total number of time slices. Since there are four cores, the total number of scheduler slots is 4 * (number of time slices). Then you can do all sorts of manual calculations and figure out exactly what's happened.

We can't tell you exactly what is being scheduled - but you can find out yourself using perf.
perf record -e sched:sched_switch ./test
Note this requires a mounted debugfs and root permissions. Now a perf report will give you an overview of what the scheduler was switching to (or see perf script for a full listing). Now there is no apparent thing in your code that would cause a context switch (e.g. sleep, waiting for I/O), so it is most likely another task that is being scheduled on these cores.
The reason why sleep has almost no context switches is simple. It goes to sleep almost immediately - which is one context switch. While the task is not active, it cannot be displaced by another task.

Related

Strange behaviour in multithread analysis

for a university project we are implementing an algorithm capable of bruteforcing on an AES key that we assume is partially known.
We have implemented several versions including one that exploits the multithreading mechanism in C++.
The implementation is done by allocating a variable number of threads, to be passed as input at launch, and dividing the key space equally for each thread that will cycle through the respective range attempting each key. De facto the implementation works, as it succeeds in finding the key for any combination #bitsToHack/#threads but returns strange timing results.
//Structs for threads and respective data
pthread_t threads[num_of_threads];
struct bf_data td [num_of_threads];
int rc;
//Space division
uintmax_t index = pow (BASE_NUMBER, num_bits_to_hack);
uintmax_t step = index/num_of_threads;
if(sem_init(&s, 1, 0)!=0){
printf("Error during semaphore initialization\n");
return -1;
}
for(int i = 0; i < num_of_threads; i++){
//Structure initialization
td[i].ciphertext = ciphertext;
td[i].hacked_key = hacked_key;
td[i].iv_aes = iv_aes;
td[i].key = key_aes;
td[i].num_bits_to_hack = num_bits_to_hack;
td[i].plaintext = plaintext;
td[i].starting_point = step*i;
td[i].step = step;
td[i].num_of_threads = num_of_threads;
if(DEBUG)
printf("Starting point for thread %d is: %lu, using step: %lu\n", i , td[i].starting_point, td[i].step);
rc = pthread_create(&threads[i], NULL, decryption_brute_force, (void*)&td[i]);
if (rc){
cout << "Error:unable to create thread," << rc << endl;
exit(-1);
}
}
sem_wait(&s);
for(int i = 0; i < num_of_threads; i++){
pthread_join(threads[i], NULL);
}
For the decryption_brute_force function (The body of each thread):
void* decryption_brute_force(void* data){
** Copy data on local thread memory
** Build the key to begin the search from starting point
** for each key from starting_point to starting_point + step
** Try decryption
** if obtained plaintext corresponds to the expected one
** Print results, wake up main thread and terminate
** else
** increment the key and continue
}
To conclude the project we intended to conduct a study of the optimal number of threads expecting an increase in performance as the number of threads increased up to a threshold, after which the system would no longer benefit from the increase in threads assigned to it.
At the end of the analysis (a simulation lasting about 9 hours), the results obtained were as follows in figure.
Click here to see the plot.
We cannot understand why 8 threads performs better than 16. Could it be due to the CPU architecture? Could it be able to schedule 32 and 8 threads better than 16?

From comments, I think it could be the linear-search pattern in each thread yields to different results for different number of threads. Because when you double the threads, the actual linear point to find in a thread may shift to a further point. But once you double again, it can not go much further due to too many threads. Because you said you are using only same encrypted data always. Did you try different inputs?
this variable is integer (so it may not be exact distribution)
^
8 threads & step=7 (56 work total)
index-16 (0-based)
v
01234567 89abcdef 01234567 89abcdef
| | |. | ...
500 seconds as its the first loop iteration
16 threads & step=3 (56 work total)
index-16 again, but at second-iteration now
v
012 345 678 9ab cde f01 234 567 8
| | | | | | . | | | ...
1000 seconds as it finds only after second iteration in the thread
Another example with 2 threads and 3 threads:
x to found at 51-th element of 100-element-work:
2 threads
| |x(1st iteration) |
3 threads
| |........x | |
5x slower than 2 threads

Would Google Benchmark library call testing blocks several times?

I have this minimal example of Google Benchmark usage.
The weird thing is that "42" is printed a number of times (4), not just once.
I understand that the library has to run things several times to gain statistics but this I though that this is handled by the statie-loop itself.
This is a minimal example of something more complicated where I wanted to print (outside the loop) the result to verify that different implementations of the same function would give the same result.
#include <benchmark/benchmark.h>
#include<iostream>
#include <thread> //sleep for
int SomeFunction(){
using namespace std::chrono_literals;
std::this_thread::sleep_for(10ms);
return 42;
}
static void BM_SomeFunction(benchmark::State& state) {
// Perform setup here
int result = -1;
for (auto _ : state) {
// This code gets timed
result = SomeFunction();
benchmark::DoNotOptimize(result);
}
std::cout<< result <<std::endl;
}
// Register the function as a benchmark
BENCHMARK(BM_SomeFunction);
// Run the benchmark
BENCHMARK_MAIN();
output: (42 is printed 4 times, why more than once, why 4?)
Running ./a.out
Run on (12 X 4600 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x6)
L1 Instruction 32 KiB (x6)
L2 Unified 256 KiB (x6)
L3 Unified 12288 KiB (x1)
Load Average: 0.30, 0.65, 0.79
42
42
42
42
----------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------
BM_SomeFunction 10243011 ns 11051 ns 1000
How else could I test (at least visually) that different benchmarking blocks give the same answer?

Why would concurrency using std::async be faster than using std::thread?

I was reading Chapter 8 of the "Modern C++ Programming Cookbook, 2nd edition" on concurrency and stumbled upon something that puzzles me.
The author implements different versions of parallel map and reduce functions using std::thread and std::async. The implementations are really close; for example, the heart of the parallel_map functions are
// parallel_map using std::async
...
tasks.emplace_back(std::async(
std::launch::async,
[=, &f] {std::transform(begin, last, begin, std::forward<F>(f)); }));
...
// parallel_map using std::thread
...
threads.emplace_back([=, &f] {std::transform(begin, last, begin, std::forward<F>(f)); });
...
The complete code can be found here for std::thread and there for std::async.
What puzzles me is that the computation times reported in the book give a significant and consistent advantage to the std::async implementation. Moreover, the author acknowledge this fact as being obvious, without providing any hint of justification:
If we compare this [result with async] with the results from the parallel version using threads, we will find that these are faster execution times and that the speedup is significant, especially for the fold function.
I ran the code above on my computer, and even though the differences are not as compelling as in the book, I find that the std::async implementation is indeed faster than the std::thread one. (The author also later brings in standard implementations of these algorithms, which are even faster). On my computer, the code runs with four threads, which corresponds to the number of physical cores of my CPU.
Maybe I missed something, but why is it obvious that std::async should run faster than std::thread on this example? My intuition was that std::async being a higher-level implementation of threads, it should take at least the same amount of time, if not more, than threads -- obviously I was wrong. Are those findings consistent, as suggested by the book, and what is the explanation?

My original interpretation was incorrect. Refer to #OznOg's answer below.
Modified Answer:
I created a simple benchmark that uses std::async and std::thread to do some tiny tasks:
#include <thread>
#include <chrono>
#include <vector>
#include <future>
#include <iostream>
__thread volatile int you_shall_not_optimize_this;
void work() {
// This is the simplest way I can think of to prevent the compiler and
// operating system from doing naughty things
you_shall_not_optimize_this = 42;
}
[[gnu::noinline]]
std::chrono::nanoseconds benchmark_threads(size_t count) {
std::vector<std::optional<std::thread>> threads;
threads.resize(count);
auto before = std::chrono::high_resolution_clock::now();
for (size_t i = 0; i < count; ++i)
threads[i] = std::thread { work };
for (size_t i = 0; i < count; ++i)
threads[i]->join();
threads.clear();
auto after = std::chrono::high_resolution_clock::now();
return after - before;
}
[[gnu::noinline]]
std::chrono::nanoseconds benchmark_async(size_t count, std::launch policy) {
std::vector<std::optional<std::future<void>>> results;
results.resize(count);
auto before = std::chrono::high_resolution_clock::now();
for (size_t i = 0; i < count; ++i)
results[i] = std::async(policy, work);
for (size_t i = 0; i < count; ++i)
results[i]->wait();
results.clear();
auto after = std::chrono::high_resolution_clock::now();
return after - before;
}
std::ostream& operator<<(std::ostream& stream, std::launch value)
{
if (value == std::launch::async)
return stream << "std::launch::async";
else if (value == std::launch::deferred)
return stream << "std::launch::deferred";
else
return stream << "std::launch::unknown";
}
// #define CONFIG_THREADS true
// #define CONFIG_ITERATIONS 10000
// #define CONFIG_POLICY std::launch::async
int main() {
std::cout << "Running benchmark:\n"
<< " threads? " << std::boolalpha << CONFIG_THREADS << '\n'
<< " iterations " << CONFIG_ITERATIONS << '\n'
<< " async policy " << CONFIG_POLICY << std::endl;
std::chrono::nanoseconds duration;
if (CONFIG_THREADS) {
duration = benchmark_threads(CONFIG_ITERATIONS);
} else {
duration = benchmark_async(CONFIG_ITERATIONS, CONFIG_POLICY);
}
std::cout << "Completed in " << duration.count() << "ns (" << std::chrono::duration_cast<std::chrono::milliseconds>(duration).count() << "ms)\n";
}
I've run the benchmark as follows:
$ g++ -Wall -Wextra -std=c++20 -pthread -O3 -DCONFIG_THREADS=false -DCONFIG_ITERATIONS=10000 -DCONFIG_POLICY=std::launch::deferred main.cpp -o main && ./main
Running benchmark:
threads? false
iterations 10000
async policy std::launch::deferred
Completed in 4783327ns (4ms)
$ g++ -Wall -Wextra -std=c++20 -pthread -O3 -DCONFIG_THREADS=false -DCONFIG_ITERATIONS=10000 -DCONFIG_POLICY=std::launch::async main.cpp -o main && ./main
Running benchmark:
threads? false
iterations 10000
async policy std::launch::async
Completed in 301756775ns (301ms)
$ g++ -Wall -Wextra -std=c++20 -pthread -O3 -DCONFIG_THREADS=true -DCONFIG_ITERATIONS=10000 -DCONFIG_POLICY=std::launch::deferred main.cpp -o main && ./main
Running benchmark:
threads? true
iterations 10000
async policy std::launch::deferred
Completed in 291284997ns (291ms)
$ g++ -Wall -Wextra -std=c++20 -pthread -O3 -DCONFIG_THREADS=true -DCONFIG_ITERATIONS=10000 -DCONFIG_POLICY=std::launch::async main.cpp -o main && ./main
Running benchmark:
threads? true
iterations 10000
async policy std::launch::async
Completed in 293539858ns (293ms)
I re-ran all the benchmarks with strace attached and accumulated the system calls made:
# std::async with std::launch::async
1 access
2 arch_prctl
36 brk
10000 clone
6 close
1 execve
1 exit_group
10002 futex
10028 mmap
10009 mprotect
9998 munmap
7 newfstatat
6 openat
7 pread64
1 prlimit64
5 read
2 rt_sigaction
20001 rt_sigprocmask
1 set_robust_list
1 set_tid_address
5 write
# std::async with std::launch::deferred
1 access
2 arch_prctl
11 brk
6 close
1 execve
1 exit_group
10002 futex
28 mmap
9 mprotect
2 munmap
7 newfstatat
6 openat
7 pread64
1 prlimit64
5 read
2 rt_sigaction
1 rt_sigprocmask
1 set_robust_list
1 set_tid_address
5 write
# std::thread with std::launch::async
1 access
2 arch_prctl
27 brk
10000 clone
6 close
1 execve
1 exit_group
2 futex
10028 mmap
10009 mprotect
9998 munmap
7 newfstatat
6 openat
7 pread64
1 prlimit64
5 read
2 rt_sigaction
20001 rt_sigprocmask
1 set_robust_list
1 set_tid_address
5 write
# std::thread with std::launch::deferred
1 access
2 arch_prctl
27 brk
10000 clone
6 close
1 execve
1 exit_group
2 futex
10028 mmap
10009 mprotect
9998 munmap
7 newfstatat
6 openat
7 pread64
1 prlimit64
5 read
2 rt_sigaction
20001 rt_sigprocmask
1 set_robust_list
1 set_tid_address
5 write
We observe that std::async is significantly faster with std::launch::deferred but that everything else doesn't seem to matter as much.
My conclusions are:
The current libstdc++ implementation does not take advantage of the fact that std::async doesn't need a new thread for each task.
The current libstdc++ implementation does some sort of locking in std::async that std::thread doesn't do.
std::async with std::launch::deferred saves setup and destroy costs and is much faster for this case.
My machine is configured as follows:
$ uname -a
Linux linux-2 5.12.1-arch1-1 #1 SMP PREEMPT Sun, 02 May 2021 12:43:58 +0000 x86_64 GNU/Linux
$ g++ --version
g++ (GCC) 10.2.0
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ lscpu # truncated
Architecture: x86_64
Byte Order: Little Endian
CPU(s): 8
Model name: Intel(R) Core(TM) i7-4770K CPU # 3.50GHz
Original Answer:
std::thread is a wrapper for thread objects which are provided by the operating system, they are extremely expensive to create and destroy.
std::async is similar, but there isn't a 1-to-1 mapping between tasks and operating system threads. This could be implemented with thread pools, where threads are reused for multiple tasks.
So std::async is better if you have many small tasks, and std::thread is better if you have a few tasks that are running for long periods of time.
Also if you have things that truly need to happen in parallel, then std::async might not fit very well. (std::thread also can't make such guarantees, but that's the closest you can get.)
Maybe to clarify, in your case std::async saves the overhead from creating and destroying threads.
(Depending on the operating system, you could also lose performance simply by having a lot of threads running. An operating system might have a scheduling strategy where it tries to guarantee that every thread gets executed every so often, thus the scheduler could decide go give the individual threads smaller slices of processing time, thus creating more overhead for switching between threads.)

Looks like something is not happening like expected. I compiled the whole thing on my fedora and the first results where surprising. I commented out all tests to keep only the 2 ones comparing threads and async. The output looks like confirming the behaviour:
Thead version result
size s map p map s fold p fold
10000 642 628 751 770
100000 6533 3444 7985 3338
500000 14885 5760 13854 6304
1000000 23428 11398 27795 12129
2000000 47136 22468 55518 24154
5000000 118690 55752 139489 60142
10000000 236496 112467 277413 121002
25000000 589277 276750 694742 297832
500000001.17839e+06 5553181.39065e+06 594102
Async version:
size s map p1 map p2 map s fold p1 fold p2 fold
10000 248 232 231 273 282 273
100000 2323 1562 2827 2757 1536 1766
500000 12312 5615 12044 14014 6272 7431
1000000 23585 11701 24060 27851 12376 14109
2000000 47147 22796 48035 55433 25565 30095
5000000 118465 59980 119698 140775 62960 68382
10000000 241727 110883 239554 277958 121205 136041
Looks like the async is actually 2 times faster (for small vallues) than the threaded one.
Then I used strace to count the number of clone system call done (number of thread created):
64 clone with threads
92 clone with async
So looks like the explaination on the time spend to create thread is somehow in contradiction as he async version actually creates as many threads than the thread based one (the difference comes from the fact that there are two version of fold in the async code).
Then I tried to swap both test order of execution (put async before thread), and here are the results:
size s map p1 map p2 map s fold p1 fold p2 fold
10000 653 694 624 718 748 718
100000 6731 3931 2978 8533 3116 1724
500000 12406 5839 14589 13895 8427 7072
1000000 23813 11578 24099 27853 13091 14108
2000000 47357 22402 48197 55469 24572 33543
5000000 117923 55869 120303 139061 61801 68281
10000000 234861 111055 239124 277153 121270 136953
size s map p map s fold p fold
10000 232 232 273 328
100000 6424 3271 8297 4487
500000 21329 5547 13913 6263
1000000 23654 11419 27827 12083
2000000 47230 22763 55653 24135
5000000 117448 56785 139286 61679
10000000 235394 111021 278177 119805
25000000 589329 279637 696392 301485
500000001.1824e+06 5564431.38722e+06 606279
So now the "thread" version is 2 time faster than async for small values.
Looking at clone calls di not show any differences:
92 clone
64 clone
I didn't have much time to go further, but at least on linux, we can consider that there are no differences between the two version (the async could even be seen as less efficient as it requires more threads).
We can see that it has nothing to do with the async/thread problem.
Moreover, if we look at values that realy need computation time, the difference of time is really small and not relevent: 55752us vs 56785us for 5'000'000, and keeps matching for bigger values.
This looks like usual problem with microbenches, we somehow measure latencies of the system more than the computation time itself.
Note: the figures shown are without optimization (original code) ; adding -O3 obviously speeds up the computations, but results show the same: there is not real difference in computation time on big values.

How to load balance a simple loop using MPI in C++

I am writing some code which is computationally expensive, but highly parallelisable. Once parallelised, I intend to run it on a HPC, however to keep the runtime down to within a week, the problem needs to scale well, with the number of processors.
Below is a simple and ludicrous example of what I am attempting to achieve, which is concise enough to compile and demonstrate my problem;
#include <iostream>
#include <ctime>
#include "mpi.h"
using namespace std;
double int_theta(double E){
double result = 0;
for (int k = 0; k < 20000; k++)
result += E*k;
return result;
}
int main()
{
int n = 3500000;
int counter = 0;
time_t timer;
int start_time = time(&timer);
int myid, numprocs;
int k;
double integrate, result;
double end = 0.5;
double start = -2.;
double E;
double factor = (end - start)/(n*1.);
integrate = 0;
MPI_Init(NULL,NULL);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &myid);
for (k = myid; k<n+1; k+=numprocs){
E = start + k*(end-start)/n;
if (( k == 0 ) || (k == n))
integrate += 0.5*factor*int_theta(E);
else
integrate += factor*int_theta(E);
counter++;
}
cout<<"process "<<myid<<" took "<<time(&timer)-start_time<<"s"<<endl;
cout<<"process "<<myid<<" performed "<<counter<<" computations"<<endl;
MPI_Reduce(&integrate, &result, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
if (myid == 0)
cout<<result<<endl;
MPI_Finalize();
return 0;
}
I have compiled the problem on my quadcore laptop with
mpiicc test.cpp -std=c++14 -O3 -DMKL_LP64 -lmkl_intel_lp64 - lmkl_sequential -lmkl_core -lpthread -lm -ldl
and I get the following output;
$ mpirun -np 4 ./a.out
process 3 took 14s
process 3 performed 875000 computations
process 1 took 15s
process 1 performed 875000 computations
process 2 took 16s
process 2 performed 875000 computations
process 0 took 16s
process 0 performed 875001 computations
-3.74981e+08
$ mpirun -np 3 ./a.out
process 2 took 11s
process 2 performed 1166667 computations
process 1 took 20s
process 1 performed 1166667 computations
process 0 took 20s
process 0 performed 1166667 computations
-3.74981e+08
$ mpirun -np 2 ./a.out
process 0 took 16s
process 0 performed 1750001 computations
process 1 took 16s
process 1 performed 1750000 computations
-3.74981e+08
To me it appears that there must be a barrier somewhere that I am not aware of. I get better performance with 2 processors over 3. Please can somebody offer any advice? Thanks

If I read the output of lscpu you gave correctly (e.g. with the help of https://unix.stackexchange.com/a/218081), you are having 4 logical CPUs, but only 2 hardware cores (1 socket x 2 cores per socket).
Using cat /proc/cpuinfo you can finde the make and model for the CPU to maybe find out more.
The four logical CPUs might result from hyperthreading, which means that some hardware resources (e.g. the FPU unit, but I am not an expert on this) are shared between two cores. Thus, I would not expect any good parallel scaling beyond two processes.
For scalability tests, you should try to get your hands on a machine with maybe 6 or more hardware cores do get a better estimate.
From looking at your code, I would expect perfect scalability to any number of cores - At least as long as you do not include the time needed for process startup and the final MPI_Reduce. These will for sure become slower with more processes involved.

Run loop every millisecond with sleep_for

I want to run a loop inside a thread that calculates some data every millisecond. But I am having trouble with the sleep function. It is sleeping much too long.
I created a basic console application in visual studio:
#include <windows.h>
#include <iostream>
#include <chrono>
#include <thread>
using namespace std;
typedef std::chrono::high_resolution_clock Clock;
int _tmain(int argc, _TCHAR* argv[])
{
int iIdx = 0;
bool bRun = true;
auto aTimeStart = Clock::now();
while (bRun){
iIdx++;
if (iIdx >= 500) bRun = false;
//Sleep(1);
this_thread::sleep_for(chrono::microseconds(10));
}
printf("Duration: %i ms\n", chrono::duration_cast<std::chrono::milliseconds>(Clock::now() - aTimeStart).count());
cin.get();
return 0;
}
This prints out: Duration: 5000 ms
The same result is printed, when i use Sleep(1);
I would expect the duration to be 500 ms, and not 5000 ms. What am I doing wrong here?
Update:
I was using Visual Studio 2013. Now I have installed Visual Studio 2015, and its fine - prints out: Duration: 500 ms (sometimes its 527 ms).
However, this sleep_for still isn't very accurate, so I will look out for other solutions.

The typical time slice used by popular OSs is much longer than 1ms (say 20ms or so); the sleep sets a minimum for how long you want your thread to be suspended not a maximum. Once your thread becomes runnable it is up to the OS when to next schedule it.
If you need this level of accuracy you either need a real time OS, or set a very high priority on your thread (so it can pre-empt almost anything else), or write your code in the kernel, or use a busy wait.
But do you really need to do the calculation every ms? That sort of timing requirement normally comes from hardware. What goes wrong if you bunch up the calculations a bit later?

On Windows, try timeBeginPeriod: https://msdn.microsoft.com/en-us/library/windows/desktop/dd757624(v=vs.85).aspx
It increases timer resolution.

What am I doing wrong here?
Attempting to use sleep for precise timing.
sleep(n) does not pause your thread for precisely n time then immediately continue.
sleep(n) yields control of the thread back to the scheduler, and indicates that you do not want control back until at least n time has passed.
Now, the scheduler already divvies up thread processing time into time slices, and these are typically on the order of 25 milliseconds or so. That's the bare minimum you can expect your sleep to run.
sleep is simply the wrong tool for this job. Never use it for precise scheduling.

This thread is fairly old, but perhaps someone can still use this code.
It's written for C++11 and I've tested it on Ubuntu 15.04.
class MillisecondPerLoop
{
public:
void do_loop(uint32_t loops)
{
int32_t time_to_wait = 0;
next_clock = ((get_current_clock_ns() / one_ms_in_ns) * one_ms_in_ns);
for (uint32_t loop = 0; loop < loops; ++loop)
{
on_tick();
// Assume on_tick takes less than 1 ms to run
// calculate the next tick time and time to wait from now until that time
time_to_wait = calc_time_to_wait();
// check if we're already past the 1ms time interval
if (time_to_wait > 0)
{
// wait that many ns
std::this_thread::sleep_for(std::chrono::nanoseconds(time_to_wait));
}
++m_tick;
}
}
private:
void on_tick()
{
// TEST only: simulate the work done in every tick
// by waiting a random amount of time
std::this_thread::sleep_for(std::chrono::microseconds(distribution(generator)));
}
uint32_t get_current_clock_ns()
{
return std::chrono::duration_cast<std::chrono::nanoseconds>(
std::chrono::system_clock::now().time_since_epoch()).count();
}
int32_t calc_time_to_wait()
{
next_clock += one_ms_in_ns;
return next_clock - get_current_clock_ns();
}
static constexpr uint32_t one_ms_in_ns = 1000000L;
uint32_t m_tick;
uint32_t next_clock;
};
A typical run shows a pretty accurate 1ms loop with a 1- 3 microsecond error. Your PC may be more accurate than this if it's a faster CPU.
Here's typical output:
One Second Loops:
Avg (ns) ms err(ms)
[ 0] 999703 0.9997 0.0003
[ 1] 999888 0.9999 0.0001
[ 2] 999781 0.9998 0.0002
[ 3] 999896 0.9999 0.0001
[ 4] 999772 0.9998 0.0002
[ 5] 999759 0.9998 0.0002
[ 6] 999879 0.9999 0.0001
[ 7] 999915 0.9999 0.0001
[ 8] 1000043 1.0000 -0.0000
[ 9] 999675 0.9997 0.0003
[10] 1000120 1.0001 -0.0001
[11] 999606 0.9996 0.0004
[12] 999714 0.9997 0.0003
[13] 1000171 1.0002 -0.0002
[14] 999670 0.9997 0.0003
[15] 999832 0.9998 0.0002
[16] 999812 0.9998 0.0002
[17] 999868 0.9999 0.0001
[18] 1000096 1.0001 -0.0001
[19] 999665 0.9997 0.0003
Expected total time: 20.0000ms
Actual total time : 19.9969ms
I have a more detailed write up here:
https://arrizza.org/wiki/index.php/One_Millisecond_Loop

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js