Multithread superlinear performance implementation with CPU boost mode consideration - c++

I am studying the class in C++11 by using MinGW 4.8.1 lib on WIN7 64bit OS.
The CPU is ARK | Intel® Core™ i7-820QM Processor, which has four physical cores with 8M cache and supports maximum eight threads. This CPU has base operation frequency at 1.73 GHz if four cores are used simultaneously and can be boosted to 3.08 GHz if only one core is used.
The main target of my studying is that I am going to implement a Multithread test program to demonstrate super-linear performance increases as the number of the thread increases.
Here, the SUPER-linear term means exactly 4 speedup times (maybe 3.8 times acceptable) when employing four threads compared to single thread, not 3.2 or 3.5 times.
The codes and results are pasted here,
inline void count(int workNum) // some working to do .
//These codes are extracted from a real application except that the count function do some "meaningful" job and
//these codes have identical speedup ratio as my real application.
{
int s=0;
for(int i=0;i<workNum;++i)
++s;
}
inline void devide(int numThread) // create multiThreads [1,7] to do same amount task
{
int max = 100000000;
typedef std::vector<std::thread> threadList;
threadList list;
for(int i=1;i<=numThread;++i){
list.push_back(std::thread(count,max/numThread));
}
std::for_each(list.begin(),list.end(),std::mem_fun_ref(&std::thread::join));
}
inline void thread_efficiency_load() // to start test
{
for(int i=7;i>0;--i)
{
std::cout<< "*****************************************" << std::endl;
std::chrono::time_point<std::chrono::system_clock> start, end;
start = std::chrono::system_clock::now();
devide(i); // this is the work load to be measured, which i is the number of thread
end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = end-start;
std::cout << "thread num=#" << i << " time=" << elapsed_seconds.count() << std::endl;
}
}
The output is:
The time unit is seconds,
*****************************************
thread num=#7 time=0.101006
*****************************************
thread num=#6 time=0.0950055
*****************************************
thread num=#5 time=0.0910052
*****************************************
thread num=#4 time=0.0910052
*****************************************
thread num=#3 time=0.102006
*****************************************
thread num=#2 time=0.127007
*****************************************
thread num=#1 time=0.229013
This is very clear that I do not obtain a super-linear performance increases as the number of thread increases. I would like to know why I do not get it. Why ? Why ? Why ?
Some basic things from my mind,
Due to the fact that there are only 4 physical cores, so the maximum speedup should show up when there are four active threads (more threads does not really help a lot). There are only 2.4 times speed up by using four cores compared to the single, where 4 times speed-up is expected. I hope the above implementation does block the 4 times speed-up due to memory issue (cache paging) because all variables are local variables.
By considering the CPU boost mode, the CPU increases operating frequency to 3.07 GHz when there is only one core busy, where is a ratio of 1.7 ( base operating frequency of cores is 1.79 GHz), 2.4 * 1.7 is about 4 as excepted, does it really mean that 2.4 time speedup is the maximum speedup can be made compared to the boost single thread mode.
I will be very appreciated that you can answer,
1) In the above implementation, are there some variables located on the same cache line, which results a lot of paging between multithread to reduce the performance ?
2) How to modify the above codes to achieve super-linear performance (4 times speedup compared to the single thread) as the number of the threads increases ?
Thank you very much for your help.

Just as a warning up front: arguing about actual performance numbers of a multithreaded Program on a modern x86/x64 system without a RTOS is always a lot of speculation - there are just too many layers between your c/c++ code and the actual operations performed on the processor.
As a rough upper bound estimation, yes, for a ALU(not memory)-bound workload you won't get much more than a 1.73*4 / 3.08 = 2.24 times speedup factor for 4 threads on 4 cores vs 1 thread on one core even in the ideal case. Aside from that, I' argue that your tests "workload" is too small to get meaningfull test results. As mentioned in the comments, a compiler would be allowed to completely replace your workload function with a NOP operation leaving you only with the overhead of creating and joining the threads and your measurement (although I don't think that happened here).

Related

Execution time inconsistency in a program with high priority in the scheduler using RT Kernel

Problem
We are trying to implement a program that sends commands to a robot in a given cycle time. Thus this program should be a real-time application. We set up a pc with a preempted RT Linux kernel and are launching our programs with chrt -f 98 or chrt -rr 99 to define the scheduling policy and priority. Loading of the kernel and launching of the program seems to be fine and work (see details below).
Now we were measuring the time (CPU ticks) it takes our program to be computed. We expected this time to be constant with very little variation. What we measured though, were quite significant differences in computation time. Of course, we thought this could be undefined behavior in our rather complex program, so we created a very basic program and measured the time as well. The behavior was similarly bad.
Question
Why are we not measuring a (close to) constant computation time even for our basic program?
How can we solve this problem?
Environment Description
First of all, we installed an RT Linux Kernel on the PC using this tutorial. The main characteristics of the PC are:
PC Characteristics
Details
CPU
Intel(R) Atom(TM) Processor E3950 # 1.60GHz with 4 cores
Memory RAM
8 GB
Operating System
Ubunut 20.04.1 LTS
Kernel
Linux 5.9.1-rt20 SMP PREEMPT_RT
Architecture
x86-64
Tests
The first time we detected this problem was when we were measuring the time it takes to execute this "complex" program with a single thread. We did a few tests with this program but also with a simpler one:
The CPU execution times
The wall time (the world real-time)
The difference (Wall time - CPU time) between them and the ratio (CPU time / Wall time).
We also did a latency test on the PC.
Latency Test
For this one, we followed this tutorial, and these are the results:
Latency Test Generic Kernel
Latency Test RT Kernel
The processes are shown in htop with a priority of RT
Test Program - Complex
We called the function multiple times in the program and measured the time each takes. The results of the 2 tests are:
From this we observed that:
The first execution (around 0.28 ms) always takes longer than the second one (around 0.18 ms), but most of the time it is not the longest iteration.
The mode is around 0.17 ms.
For those that take 17 ms the difference is usually 0 and the ratio 1. Although this is not exclusive to this time. For these, it seems like only 1 CPU is being used and it is saturated (there is no waiting time).
When the difference is not 0, it is usually negative. This, from what we have read here and here, is because more than 1 CPU is being used.
Test Program - Simple
We did the same test but this time with a simpler program:
#include <vector>
#include <iostream>
#include <time.h>
int main(int argc, char** argv) {
int iterations = 5000;
double a = 5.5;
double b = 5.5;
double c = 4.5;
std::vector<double> wallTime(iterations, 0);
std::vector<double> cpuTime(iterations, 0);
struct timespec beginWallTime, endWallTime, beginCPUTime, endCPUTime;
std::cout << "Iteration | WallTime | cpuTime" << std::endl;
for (unsigned int i = 0; i < iterations; i++) {
// Start measuring time
clock_gettime(CLOCK_REALTIME, &beginWallTime);
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &beginCPUTime);
// Function
a = b + c + i;
// Stop measuring time and calculate the elapsed time
clock_gettime(CLOCK_REALTIME, &endWallTime);
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &endCPUTime);
wallTime[i] = (endWallTime.tv_sec - beginWallTime.tv_sec) + (endWallTime.tv_nsec - beginWallTime.tv_nsec)*1e-9;
cpuTime[i] = (endCPUTime.tv_sec - beginCPUTime.tv_sec) + (endCPUTime.tv_nsec - beginCPUTime.tv_nsec)*1e-9;
std::cout << i << " | " << wallTime[i] << " | " << cpuTime[i] << std::endl;
}
return 0;
}
Final Thoughts
We understand that:
If the ratio == number of CPUs used, they are saturated and there is no waiting time.
If the ratio < number of CPUs used, it means that there is some waiting time (theoretically we should only be using 1 CPU, although in practice we use more).
Of course, we can give more details.
Thanks a lot for your help!
Your function will near certainly be optimized away so you are just measuring how long it takes to read the clocks. And as you can see that doesn't take very long with some exceptions:
The very first time you run the code (unless you just compiled it) the pages need to be loaded from disk. If you are unlucky the code spans pages and you include the loading of the next page in the measured time. Quite unlikely given the code size.
The first loop the code and any data needs to be loaded into cache. So that takes longer to execute. The branch predictor might also need a few loops to predict the loop right so the second, third loop might be slightly longer too.
For everything else I think you can blame scheduling:
an IRQ happens but nothing gets rescheduled
the process gets paused while another process runs
the process gets moved to another CPU thread leaving the caches hot
the process gets moved to another CPU core making L1 cache cold but leaving L2/L3 caches hot (if your L2 is shared)
the process gets moved to a CPU on another socket making L1/L2 caches cold but L3 cache hot (if L3 is shared)
You can do little about IRQs. Some you can fix to specific cores but others are just essential (like the timer interrupt for the scheduler itself). You kind of just have to live with that.
But you can fix your program to a specific CPU and you can fix everything else to all the other cores. Basically reserving the core for the real-time code. I guess you would have to use cgroups for this, to keep everything else off the chosen core. And you might still get some kernel threads run on the reserved core. Nothing you can do about that. But that should eliminate most of the large execution times.

code runs significantly slower after thread::sleep_for() [duplicate]

Consider:
#include <time.h>
#include <unistd.h>
#include <iostream>
using namespace std;
const int times = 1000;
const int N = 100000;
void run() {
for (int j = 0; j < N; j++) {
}
}
int main() {
clock_t main_start = clock();
for (int i = 0; i < times; i++) {
clock_t start = clock();
run();
cout << "cost: " << (clock() - start) / 1000.0 << " ms." << endl;
//usleep(1000);
}
cout << "total cost: " << (clock() - main_start) / 1000.0 << " ms." << endl;
}
Here is the example code. In the first 26 iterations of the timing loop, the run function costs about 0.4 ms, but then the cost reduces to 0.2 ms.
When the usleep is uncommented, the delay-loop takes 0.4 ms for all runs, never speeding up. Why?
The code is compiled with g++ -O0 (no optimization), so the delay loop isn't optimized away. It's run on Intel(R) Core(TM) i3-3220 CPU # 3.30 GHz, with 3.13.0-32-generic Ubuntu 14.04.1 LTS (Trusty Tahr).
After 26 iterations, Linux ramps the CPU up to the maximum clock speed since your process uses its full time slice a couple of times in a row.
If you checked with performance counters instead of wall-clock time, you'd see that the core clock cycles per delay-loop stayed constant, confirming that it's just an effect of DVFS (which all modern CPUs use to run at a more energy-efficient frequency and voltage most of the time).
If you tested on a Skylake with kernel support for the new power-management mode (where the hardware takes full control of the clock speed), ramp-up would happen much faster.
If you leave it running for a while on an Intel CPU with Turbo, you'll probably see the time per iteration increase again slightly once thermal limits require the clock speed to reduce back down to the maximum sustained frequency. (See Why can't my CPU maintain peak performance in HPC for more about Turbo letting the CPU run faster than it can sustain for high-power workloads.)
Introducing a usleep prevents Linux's CPU frequency governor from ramping up the clock speed, because the process isn't generating 100% load even at minimum frequency. (I.e. the kernel's heuristic decides that the CPU is running fast enough for the workload that's running on it.)
comments on other theories:
re: David's theory that a potential context switch from usleep could pollute caches: That's not a bad idea in general, but it doesn't help explain this code.
Cache / TLB pollution isn't important at all for this experiment. There's basically nothing inside the timing window that touches memory other than the end of the stack. Most of the time is spent in a tiny loop (1 line of instruction cache) that only touches one int of stack memory. Any potential cache pollution during usleep is a tiny fraction of the time for this code (real code will be different)!
In more detail for x86:
The call to clock() itself might cache-miss, but a code-fetch cache miss delays the starting-time measurement, rather than being part of what's measured. The second call to clock() will almost never be delayed, because it should still be hot in cache.
The run function may be in a different cache line from main (since gcc marks main as "cold", so it gets optimized less and placed with other cold functions/data). We can expect one or two instruction-cache misses. They're probably still in the same 4k page, though, so main will have triggered the potential TLB miss before entering the timed region of the program.
gcc -O0 will compile the OP's code to something like this (Godbolt Compiler explorer): keeping the loop counter in memory on the stack.
The empty loop keeps the loop counter in stack memory, so on a typical Intel x86 CPU the loop runs at one iteration per ~6 cycles on the OP's IvyBridge CPU, thanks to the store-forwarding latency that's part of add with a memory destination (read-modify-write). 100k iterations * 6 cycles/iteration is 600k cycles, which dominates the contribution of at most a couple cache misses (~200 cycles each for code-fetch misses which prevent further instructions from issuing until they're resolved).
Out-of-order execution and store-forwarding should mostly hide the potential cache miss on accessing the stack (as part of the call instruction).
Even if the loop-counter was kept in a register, 100k cycles is a lot.
A call to usleep may or may not result in a context switch. If it does, it will take longer than if it doesn't.

Is lock-free multithreading slower than a single-threaded program?

I have considered parallelizing a program so that in the first phase it sorts items into buckets modulo the number of parallel workers, so that this avoids collisions in the second phase. Each thread of the parallel program uses std::atomic::fetch_add to reserve a place in the output array, and then it uses std::atomic::compare_exchange_weak to update current bucket head pointer. So it's lock free. However, I got doubt about the performance of multiple threads struggling for a single atomic (the one we do fetch_add, as the bucket head count is equal to the number of threads, thus on average there is not much contention), so I decided to measure this. Here is the code:
#include <atomic>
#include <chrono>
#include <cstdio>
#include <string>
#include <thread>
#include <vector>
std::atomic<int64_t> gCounter(0);
const int64_t gnAtomicIterations = 10 * 1000 * 1000;
void CountingThread() {
for (int64_t i = 0; i < gnAtomicIterations; i++) {
gCounter.fetch_add(1, std::memory_order_acq_rel);
}
}
void BenchmarkAtomic() {
const uint32_t maxThreads = std::thread::hardware_concurrency();
std::vector<std::thread> thrs;
thrs.reserve(maxThreads + 1);
for (uint32_t nThreads = 1; nThreads <= maxThreads; nThreads++) {
auto start = std::chrono::high_resolution_clock::now();
for (uint32_t i = 0; i < nThreads; i++) {
thrs.emplace_back(CountingThread);
}
for (uint32_t i = 0; i < nThreads; i++) {
thrs[i].join();
}
auto elapsed = std::chrono::high_resolution_clock::now() - start;
double nSec = 1e-6 * std::chrono::duration_cast<std::chrono::microseconds>(elapsed).count();
printf("%d threads: %.3lf Ops/sec, counter=%lld\n", (int)nThreads, (nThreads * gnAtomicIterations) / nSec,
(long long)gCounter.load(std::memory_order_acquire));
thrs.clear();
gCounter.store(0, std::memory_order_release);
}
}
int __cdecl main() {
BenchmarkAtomic();
return 0;
}
And here is the output:
1 threads: 150836387.770 Ops/sec, counter=10000000
2 threads: 91198022.827 Ops/sec, counter=20000000
3 threads: 78989357.501 Ops/sec, counter=30000000
4 threads: 66808858.187 Ops/sec, counter=40000000
5 threads: 68732962.817 Ops/sec, counter=50000000
6 threads: 64296828.452 Ops/sec, counter=60000000
7 threads: 66575046.721 Ops/sec, counter=70000000
8 threads: 64487317.763 Ops/sec, counter=80000000
9 threads: 63598622.030 Ops/sec, counter=90000000
10 threads: 62666457.778 Ops/sec, counter=100000000
11 threads: 62341701.668 Ops/sec, counter=110000000
12 threads: 62043591.828 Ops/sec, counter=120000000
13 threads: 61933752.800 Ops/sec, counter=130000000
14 threads: 62063367.585 Ops/sec, counter=140000000
15 threads: 61994384.135 Ops/sec, counter=150000000
16 threads: 61760299.784 Ops/sec, counter=160000000
The CPU is 8-core, 16-thread (Ryzen 1800X #3.9Ghz).
So the total over all threads of operations per second decreases dramatically till 4 threads are used. Then it decreases slowly and fluctuates a bit.
So is this phenomenon common to other CPUs and compilers? Is there any workaround (except resorting to a single thread)?
A lock free multi threaded program is not slower than a single threaded program. What makes it slow is data contention. The example you provided is in fact a highly contentious artificial program. In a real program you will do a lot of work between each access to shared data and thus it will have less cache invalidations and so on.
This CppCon talk by Jeff Preshing can explain some of your questions better than I did.
Add: Try to modify CountingThread and add a sleep once in a while to pretend you are busy with something else than incrementing atomic variable gCounter. Then go ahead and play with value in the if statement to see how it will influence results of your program.
void CountingThread() {
for (int64_t i = 0; i < gnAtomicIterations; i++) {
// take a nap every 10000th iteration to simulate work on something
// unrelated to access to shared resource
if (i%10000 == 0) {
std::chrono::milliseconds timespan(1);
std::this_thread::sleep_for(timespan);
}
gCounter.fetch_add(1, std::memory_order_acq_rel);
}
}
In general every time you call gCounter.fetch_add it means marking that data invalid in other core's cache. It is forcing them to reach for the data into a cache further from the core. This effect is major contributor to performance slowdown in your program.
local L1 CACHE hit, ~4 cycles ( 2.1 - 1.2 ns )
local L2 CACHE hit, ~10 cycles ( 5.3 - 3.0 ns )
local L3 CACHE hit, line unshared ~40 cycles ( 21.4 - 12.0 ns )
local L3 CACHE hit, shared line in another core ~65 cycles ( 34.8 - 19.5 ns )
local L3 CACHE hit, modified in another core ~75 cycles ( 40.2 - 22.5 ns )
remote L3 CACHE (Ref: Fig.1 [Pg. 5]) ~100-300 cycles ( 160.7 - 30.0 ns )
local DRAM ~60 ns
remote DRAM ~100 ns
Above table taken from Approximate cost to access various caches and main memory?
Lock-free doesn't mean you can exchange data between threads without cost. Lock-free means that you don't wait for other threads to unlock mutex for you to read shared data. In fact even lock-free programs use locking mechanisms to prevent data corruption.
Just follow simple rule. Try to access shared data as less as possible to gain more from multicore programming.
It depends on the concrete workload.
See amdahl's law
100 % (whole workload in percentage)
speedup = -----------------------------------------------------------
(sequential work load in %) + (parallel workload in %) / (count of workers)
The parallel workload in your program is 0 %, so the speedup is 1. Aka no speedup. (You are synchronising for incrementing the same memory cell
and so only one thread can increment the cell at any given time.)
Rough explanation, why it even performs worse then speedup=1:
The cache line containing gCounter stays in the cpu cache with only one thread.
With multiple threads, which are scheduled to different cpus or cores, the cache line containing gCounter will bounce around the different caches for the cpus ore cores.
So the difference is somewhat comparable to incrementing a register with only one thread compared to accessing memory for each increment operation. (Sometimes it is faster than a memory access, as there is cache to cache transfers in modern cpu architectures.)
Like most very broad which is faster questions, the only completely general answer is it depends.
A good mental model is that when parallelizing an existing task is that the runtime of the parallel version over N threads will be composed of rought three contributions:
A still serial part common to both the serial and parallel algorithms. I.e,. work that wasn't parallelized such as setup or tear down work, or work that didn't run in parallel because the task was inexactly partitioned1.
A parallel part which was effectively parallelized among the N workers.
An overhead component that represents extra work done in the parallel algorithm that doesn't exist in the serial version. There is almost invariably some small amount of overhead to partition the work, delegate to worker threads and combine the results, but in some cases the overhead can swamp the real work.
So in general you have these three contributions, and lets assign T1p, T2p and T3p respectively. Now the T1p component exists and takes the same time in both the serial and parallel algorithms, so we can ignore since it cancels out for the purposes of determining which is slower.
Of course, if you used coarser grained synchronization, e.g., incrementing a local variable on each thread and only periodically (perhaps only once at the very end) updating the shared variable, the situation would reverse.
1 This also includes the case where the workload was well partitioned, but some threads did more work per unit time, which is common on modern CPUs and in modern OSes.

Why does this delay-loop start to run faster after several iterations with no sleep?

Consider:
#include <time.h>
#include <unistd.h>
#include <iostream>
using namespace std;
const int times = 1000;
const int N = 100000;
void run() {
for (int j = 0; j < N; j++) {
}
}
int main() {
clock_t main_start = clock();
for (int i = 0; i < times; i++) {
clock_t start = clock();
run();
cout << "cost: " << (clock() - start) / 1000.0 << " ms." << endl;
//usleep(1000);
}
cout << "total cost: " << (clock() - main_start) / 1000.0 << " ms." << endl;
}
Here is the example code. In the first 26 iterations of the timing loop, the run function costs about 0.4 ms, but then the cost reduces to 0.2 ms.
When the usleep is uncommented, the delay-loop takes 0.4 ms for all runs, never speeding up. Why?
The code is compiled with g++ -O0 (no optimization), so the delay loop isn't optimized away. It's run on Intel(R) Core(TM) i3-3220 CPU # 3.30 GHz, with 3.13.0-32-generic Ubuntu 14.04.1 LTS (Trusty Tahr).
After 26 iterations, Linux ramps the CPU up to the maximum clock speed since your process uses its full time slice a couple of times in a row.
If you checked with performance counters instead of wall-clock time, you'd see that the core clock cycles per delay-loop stayed constant, confirming that it's just an effect of DVFS (which all modern CPUs use to run at a more energy-efficient frequency and voltage most of the time).
If you tested on a Skylake with kernel support for the new power-management mode (where the hardware takes full control of the clock speed), ramp-up would happen much faster.
If you leave it running for a while on an Intel CPU with Turbo, you'll probably see the time per iteration increase again slightly once thermal limits require the clock speed to reduce back down to the maximum sustained frequency. (See Why can't my CPU maintain peak performance in HPC for more about Turbo letting the CPU run faster than it can sustain for high-power workloads.)
Introducing a usleep prevents Linux's CPU frequency governor from ramping up the clock speed, because the process isn't generating 100% load even at minimum frequency. (I.e. the kernel's heuristic decides that the CPU is running fast enough for the workload that's running on it.)
comments on other theories:
re: David's theory that a potential context switch from usleep could pollute caches: That's not a bad idea in general, but it doesn't help explain this code.
Cache / TLB pollution isn't important at all for this experiment. There's basically nothing inside the timing window that touches memory other than the end of the stack. Most of the time is spent in a tiny loop (1 line of instruction cache) that only touches one int of stack memory. Any potential cache pollution during usleep is a tiny fraction of the time for this code (real code will be different)!
In more detail for x86:
The call to clock() itself might cache-miss, but a code-fetch cache miss delays the starting-time measurement, rather than being part of what's measured. The second call to clock() will almost never be delayed, because it should still be hot in cache.
The run function may be in a different cache line from main (since gcc marks main as "cold", so it gets optimized less and placed with other cold functions/data). We can expect one or two instruction-cache misses. They're probably still in the same 4k page, though, so main will have triggered the potential TLB miss before entering the timed region of the program.
gcc -O0 will compile the OP's code to something like this (Godbolt Compiler explorer): keeping the loop counter in memory on the stack.
The empty loop keeps the loop counter in stack memory, so on a typical Intel x86 CPU the loop runs at one iteration per ~6 cycles on the OP's IvyBridge CPU, thanks to the store-forwarding latency that's part of add with a memory destination (read-modify-write). 100k iterations * 6 cycles/iteration is 600k cycles, which dominates the contribution of at most a couple cache misses (~200 cycles each for code-fetch misses which prevent further instructions from issuing until they're resolved).
Out-of-order execution and store-forwarding should mostly hide the potential cache miss on accessing the stack (as part of the call instruction).
Even if the loop-counter was kept in a register, 100k cycles is a lot.
A call to usleep may or may not result in a context switch. If it does, it will take longer than if it doesn't.

Cost of OpenMPI in C++

I have the following program C++ program which uses no communication, and the same identical work is done on all cores, I know that this doesn't use parallel processing at all:
unsigned n = 130000000;
std::vector<double>vec1(n,1.0);
std::vector<double>vec2(n,1.0);
double precision :: t1,t2,dt;
t1 = MPI_Wtime();
for (unsigned i = 0; i < n; i++)
{
// Do something so it's not a trivial loop
vec1[i] = vec2[i]+i;
}
t2 = MPI_Wtime();
dt = t2-t1;
I'm running this program in a single node with two
Intel® Xeon® Processor E5-2690 v3, so I have 24 cores all together. This is a dedicated node, no one else is using it.
Since there is no communication, and each processor is doing the same amount of (identical) work, running it on multiple processors should give the same time. However, I get the following times (averaged time over all cores):
1 core: 0.237
2 cores: 0.240
4 cores: 0.241
8 cores: 0.261
16 cores: 0.454
What could cause the increase in time? Particularly for 16 cores.
I have ran callgrind and I get the roughly same amount of data/instruction misses on all cores (the percentage of misses are the same).
I have repeated the same test on a node with two Intel® Xeon® Processor E5-2628L v2, (16 cores all together), I observe the same increase in execution times. Is this something to do with the MPI implementation?
Considering you are using ~2 GiB of memory per rank, your code is memory-bound. Except for prefetchers you are not operating within the cache but in main memory. You are simply hitting the memory bandwidth at a certain number of active cores.
Another aspect can be turbo mode, if enabled. Turbo mode can increase the core frequency to higher levels if less cores are utilized. As long as the memory bandwidth is not saturated, the higher frequency from turbo core will increase the bandwidth each core gets. This paper discusses the available aggregate memory bandwidth on Haswell processors depending on number of active cores and frequency (Fig 7./8.)
Please note that this has nothing to do with MPI / OpenMPI. You might as well launch the same program X times via any other mean.
I suspect that there are common resources that should be used by your program, so when the number of them increases, there are delays, so that a resource is free'ed so that it can be used by the other process.
You see, you may have 24 cores, but that doesn't mean that all your system allows every core to do everything concurrent. As mentioned in the comments, the memory access is one thing that might cause delays (due to traffic), same thing for disk.
Also consider the interconnection network, which can also suffer from many accesses. In conclusion, notice that these hardware delays are enough to overwhelm the processing time.
General note: Remember how Efficiency of a program is defined:
E = S/p, where S is the speedup and p the number of nodes/processes/threads
Now take Scalability into account. Usually programs are weakly scalable, i.e. that you have to increase with the same rate the size of the problem and p. By increasing only the number of p, while keeping the size of your problem (n in your case) constant, while keeping Efficiency constant, yields a strongly Scalable program.
Your program is not using parallel processing at all. Just because you have compiled it with OpenMP does not make it parallel.
To parallelize the for loop, for example, you need to use the different #pragma's OpenMP offer.
unsigned n = 130000000;
std::vector<double>vec1(n,1.0);
std::vector<double>vec2(n,1.0);
double precision :: t1,t2,dt;
t1 = MPI_Wtime();
#pragma omp parallel for
for (unsigned i = 0; i < n; i++)
{
// Do something so it's not a trivial loop
vec1[i] = vec2[i]+i;
}
t2 = MPI_Wtime();
dt = t2-t1;
However, take into account that for large values of n, the impact of cache misses may hide the perfomance gained with multiple cores.