Using multithreading optimize the execution time (simple example) - c++

I have the following code:
#include <boost/date_time/posix_time/posix_time.hpp>
#include <boost/cstdint.hpp>
#include <iostream>
int main()
{
boost::posix_time::ptime start = boost::posix_time::microsec_clock::local_time();
boost::uint64_t sum = 0;
for (int i = 0; i < 1000000000; ++i)
sum += i;
boost::posix_time::ptime end = boost::posix_time::microsec_clock::local_time();
std::cout << end - start << std::endl;
std::cout << sum << std::endl;
}
The task is: refactor the following program to calculate the total using two threads. Since many processors nowadays have two cores, the execution time should decrease by utilizing threads.
Here is my solution:
#include <boost/date_time/posix_time/posix_time.hpp>
#include <boost/thread.hpp>
#include <boost/cstdint.hpp>
#include <iostream>
boost::uint64_t s1 = 0;
boost::uint64_t s2 = 0;
void sum1()
{
for (int i = 0; i < 500000000; ++i)
s1 += i;
}
void sum2()
{
for (int i = 500000000; i < 1000000000; ++i)
s2 += i;
}
int main()
{
boost::posix_time::ptime start = boost::posix_time::microsec_clock::local_time();
boost::thread t1(sum1);
boost::thread t2(sum2);
t1.join();
t2.join();
boost::posix_time::ptime end = boost::posix_time::microsec_clock::local_time();
std::cout << end - start << std::endl;
std::cout << s1+s2 << std::endl;
}
Please review and also answer the following questions:
1. Why this code does not actually optimize the execution time? :) (I use Intel Core i5 processor, and Win7 64bit system)
2. Why when I use one variable s to store the sum instead of s1 and s2 the sum becomes incorrect?
Thanks in advance.

I'll answer your second question, because the first one is not yet clear to me. When you are using a single global variable to calculate the sum there is a so-called "data race" caused by the fact that the operation
s += i;
is not "atomic", meaning that at the assembler level it is translated into several instructions. If one thread is executing this set of instructions it may be interrupted by another thread doing the same thing and your results will be inconsistent.
This is due to the fact that threads are scheduled on and off the CPU by the OS and it's impossible to predict how the threads will interleave their instruction execution.
The classic pattern in this case is to have two local variables collecting the sums for each thread and then summing them up together into a global variable once the threads have fished their work.

The answer to 1 should be: run in a profiler and see what it tells you.
But there is at least one usual suspect: False sharing. Your s1 and s2 likely end up on the same cacheline, so your 2 cores (if your 2 threads indeed end up on different cores) have to synchronize at the cacheline level. Make sure the 2 uint64_t are on different cachelines (whose size depends on the architecture you're targeting).
As to the answer to 2... Nothing in your program guarantees that the updates from one thread will not get stomped by the second and vice-versa. You need either synchronization primitives to make sure your updates don't happen at the same time, or atomic updates to make sure the updates don't stomp on each other.

I'll answer the first:
It takes a lot more time to create a thread than it takes to do nothing (base).
the compiler will convert this:
for (int i = 0; i < 1000000000; ++i)
sum += i;
into this:
// << optimized away >>
even your worst case using local data, it would be one addition with optimization enabled.
The parallel version reduces the compiler's ability to optimize the program, while adding work.

The simplest way to refactor the program (code-wise) to compute the sum using multiple threads is to use OpenMP:
// $ g++ -fopenmp parallel-sum.cpp && ./a.out
#include <stdint.h>
#include <iostream>
const int32_t N = 1 << 30;
int main() {
int64_t sum = 0;
#pragma omp parallel for reduction(+:sum)
for (int32_t i = 0; i < N; ++i)
sum += i;
std::cout << sum << " " << static_cast<int64_t>(N)*(N-1)/2 << std::endl;
}
Output
576460751766552576 576460751766552576
Here's a parallel reduction implemented using c++11 threads:
// $ g++ -std=c++0x -pthread parallel-sum-c++11.cpp && ./a.out
#include <cstdint>
#include <iostream>
#include <thread>
namespace {
std::mutex mutex;
void sum_interval(int32_t start, int32_t end, int64_t &sum) {
int64_t s = 0;
for ( ; start < end; ++start) s += start;
std::lock_guard<std::mutex> lock(mutex);
sum += s;
}
}
int main() {
int64_t sum = 0;
const int num_threads = 4;
const int32_t N = 1 << 30;
std::thread t[num_threads];
// fork threads; assign intervals to sum
int32_t start = 0, step = N / num_threads;
for (int i = 0; i < num_threads-1; ++i, start += step)
t[i] = std::thread(sum_interval, start, start+step, std::ref(sum));
t[num_threads-1] = std::thread(sum_interval, start, N, std::ref(sum));
// wait for result and print it
for (int i = 0; i < num_threads; ++i) t[i].join();
std::cout << sum << " " << static_cast<int64_t>(N)*(N-1)/2 << std::endl;
}
Note: Access to sum is guarded so only one thread at a time can change it. If sum is std::atomic<int64_t> then the locking can be omitted.

Related

Why does isolating tasks in task arenas to NUMA nodes for memory locality slow down my embarassingly parallel TBB application?

I have this self-contained example of a TBB application that I run on a 2-NUMA-node CPU that performs a simple vector addition repeatedly on dynamic arrays. It recreates an issue that I am having with a bit more complicated example. I am trying to divide the computations cleanly between the available NUMA nodes by initializing the data in parallel with 2 task_arenas that are linked to separate NUMA nodes through TBB's NUMA API. The subsequent parallel execution should then be conducted so that that memory accesses are performed on data that is local to the cpu that computes its task. A control example uses a simple parallel_for with a static_partitioner to perform the computation while my intended example invokes per task_arena a task which invokes a parallel_for to compute the vector addition of the designated region, i.e. the half of the dynamic arena that was initialized before in the corresponding NUMA node. This example always takes twice as much time to perform the vector addition compared to the control example. It cannot be the overhead of creating the tasks for the task_arenas that will invoke the parallel_for algorithms, because the performance degradation only occurs when the tbb::task_arena::constraints are applied. Could anyone explain to me what happens and why this performance penalty is so harsh. A direction to resources would also be helpful as I am doing this for a university project.
#include <iostream>
#include <iomanip>
#include <tbb/tbb.h>
#include <vector>
int main(){
std::vector<int> numa_indexes = tbb::info::numa_nodes();
std::vector<tbb::task_arena> arenas(numa_indexes.size());
std::size_t numa_nodes = numa_indexes.size();
for(unsigned j = 0; j < numa_indexes.size(); j++){
arenas[j].initialize( tbb::task_arena::constraints(numa_indexes[j]));
}
std::size_t size = 10000000;
std::size_t part_size = std::ceil((float)size/numa_nodes);
double * A = (double *) malloc(sizeof(double)*size);
double * B = (double *) malloc(sizeof(double)*size);
double * C = (double *) malloc(sizeof(double)*size);
double * D = (double *) malloc(sizeof(double)*size);
//DATA INITIALIZATION
for(unsigned k = 0; k < numa_indexes.size(); k++)
arenas[k].execute(
[&](){
std::size_t local_start = k*part_size;
std::size_t local_end = std::min(local_start + part_size, size);
tbb::parallel_for(static_cast<std::size_t>(local_start), local_end,
[&](std::size_t i)
{
C[i] = D[i] = 0;
A[i] = B[i] = 1;
}, tbb::static_partitioner());
});
//PARALLEL ALGORITHM
tbb::tick_count t0 = tbb::tick_count::now();
for(int i = 0; i<100; i++)
tbb::parallel_for(static_cast<std::size_t>(0), size,
[&](std::size_t i)
{
C[i] += A[i] + B[i];
}, tbb::static_partitioner());
tbb::tick_count t1 = tbb::tick_count::now();
std::cout << "Time 1: " << (t1-t0).seconds() << std::endl;
//TASK ARENA & PARALLEL ALGORITHM
t0 = tbb::tick_count::now();
for(int i = 0; i<100; i++){
for(unsigned k = 0; k < numa_indexes.size(); k++){
arenas[k].execute(
[&](){
for(unsigned i=0; i<numa_indexes.size(); i++)
task_groups[i].wait();
task_groups[k].run([&](){
std::size_t local_start = k*part_size;
std::size_t local_end = std::min(local_start + part_size, size);
tbb::parallel_for(static_cast<std::size_t>(local_start), local_end,
[&](std::size_t i)
{
D[i] += A[i] + B[i];
});
});
});
}
t1 = tbb::tick_count::now();
std::cout << "Time 2: " << (t1-t0).seconds() << std::endl;
double sum1 = 0;
double sum2 = 0;
for(int i = 0; i<size; i++){
sum1 += C[i];
sum2 += D[i];
}
std::cout << sum1 << std::endl;
std::cout << sum2 << std::endl;
return 0;
}
Performance with:
for(unsigned j = 0; j < numa_indexes.size(); j++){
arenas[j].initialize( tbb::task_arena::constraints(numa_indexes[j]));
}
$ taskset -c 0,1,8,9 ./RUNME
Time 1: 0.896496
Time 2: 1.60392
2e+07
2e+07
Performance without constraints:
$ taskset -c 0,1,8,9 ./RUNME
Time 1: 0.652501
Time 2: 0.638362
2e+07
2e+07
EDIT: I implemented the use of task_group as found in #AlekseiFedotov's suggested resources, but the issue still remains.
Part of the provided example where the work with arenas happens is not one-to-one match to the example from the docs, "Setting the preferred NUMA node" section.
Looking further into the specification of the task_arena::execute() method, we can find out that the task_arena::execute() is a blocking API, i.e. it does not return until the passed lambda completes.
On the other hand, the specification of the task_group::run() method reveals that its method is asynchronous, i.e. returns immediately, not waiting for the passed functor to complete.
That is where the problem lies, I guess. The code executes two parallel loops within arenas one by one, in a serial manner so to say. Consider following the example from the docs carefully.
BTW, the oneTBB project, which is the revamped version of the TBB, can be found here.
EDIT answer for the EDITED question:
See the comment to the question.
The waiting should happen after work is submitted, not before it. Also, no need to go to another arena's task group to do the wait within the loop, just submit the work in the NUMA loop via arena[i].execute( [i, &] { task_group[i].run( [i, &] { /*...*/ } ); } ), then, in another loop, wait for each task_group within corresponding task_arena.
Please note how I capture the NUMA loop iteration by copy. Otherwise, the code might be referring the wrong data inside the lambda body.

Parallelizing a small array is slower than parallelizing a large array?

I wrote a small program that generates random values for two valarrays and in a for loop the values of said arrays are added to a new one.
However, when I use a small array size(20 elements) the parallel version takes significantly longer than the serial one and when I'm using large arrays(200 000 elements) it takes roughly the same amount of time(parallel is always a bit slower though).
Why is this?
The only reason I can think is that with the large array the CPU puts it in L3 cache and shares it across all cores, whereas with the small one its having to copy it around the lower cache levels? Or I'm getting this wrong?
Here is the code:
#include <valarray>
#include <iostream>
#include <ctime>
#include <omp.h>
#include <chrono>
int main()
{
int size = 2000000;
std::valarray<double> num1(size), num2(size), result(size);
std::srand(std::time(nullptr));
std::chrono::time_point<std::chrono::steady_clock> start, stop;
std::chrono::microseconds duration;
for (int i = 0; i < size; ++i) {
num1[i] = std::rand();
num2[i] = std::rand();
}
//Parallel execution
start = std::chrono::high_resolution_clock::now();
#pragma omp parallel for num_threads(8)
for (int i = 0; i < size; ++i) {
result[i] = num1[i] + num2[i];
}
stop = std::chrono::high_resolution_clock::now();
duration = std::chrono::duration_cast<std::chrono::microseconds>(stop - start);
std::cout << "Parallel for loop executed in: " << duration.count() << " microseconds" << std::endl;
//Serial execution
start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < size; ++i) {
result[i] = num1[i] + num2[i];
}
stop = std::chrono::high_resolution_clock::now();
duration = std::chrono::duration_cast<std::chrono::microseconds>(stop - start);
std::cout << "Serial for loop executed in: " << duration.count() << " microseconds" << std::endl;
}
Output with size = 200 000
Parallel for loop executed in: 2450 microseconds
Serial for loop executed in: 2726 microseconds
Output with size = 20
Parallel for loop executed in: 4727 microseconds
Serial for loop executed in: 0 microseconds
I'm using a Xeon E3-1230 V5 and I'm compiling with Intel's compiler using maximum optimization and Skylake specific optimizations as well.
I get identical results with Visual Studio's C++ compiler.

Why does omp_set_dynamic(1) never adjust the number of threads (in Visual C++)?

If we look at the Visual C++ documentation of omp_set_dynamic, it is literally copy-pasted from the OMP 2.0 standard (section 3.1.7 on page 39):
If [the function argument] evaluates to a nonzero value, the number of threads that are used for executing upcoming parallel regions may be adjusted automatically by the run-time environment to best use system resources. As a consequence, the number of threads specified by the user is the maximum thread count. The number of threads in the team executing a parallel region stays fixed for the duration of that parallel region and is reported by the omp_get_num_threads function.
It seems clear that omp_set_dynamic(1) allows the implementation to use fewer than the current maximum number of threads for a parallel region (presumably to prevent oversubscription under high loads). Any reasonable reading of this paragraph would suggest that said reduction should be observable by querying omp_get_num_threads inside parallel regions.
(Both documentations also show the signature as void omp_set_dynamic(int dynamic_threads);. It appears that "the number of threads specified by the user" does not refer to dynamic_threads but instead means "whatever the user specified using the remaining OpenMP interface").
However, no matter how high I push my system load under omp_set_dynamic(1), the return value of omp_get_num_threads (queried inside the parallel regions) never changes from the maximum in my test program. Yet I can still observe clear performance differences between omp_set_dynamic(1) and omp_set_dynamic(0).
Here is a sample program to reproduce the issue:
#include <atomic>
#include <chrono>
#include <iostream>
#include <thread>
#include <cstdlib>
#include <cmath>
#include <omp.h>
#define UNDER_LOAD true
const int SET_DYNAMIC_TO = 1;
const int REPEATS = 3000;
const unsigned MAXCOUNT = 1000000;
std::size_t threadNumSum = 0;
std::size_t threadNumCount = 0;
void oneRegion(int i)
{
// Pesudo-randomize the number of iterations.
unsigned ui = static_cast<unsigned>(i);
int count = static_cast<int>(((MAXCOUNT + 37) * (ui + 7) * ui) % MAXCOUNT);
#pragma omp parallel for schedule(guided, 512)
for (int j = 0; j < count; ++j)
{
if (j == 0)
{
threadNumSum += omp_get_num_threads();
threadNumCount++;
}
if ((j + i + count) % 16 != 0)
continue;
// Do some floating point math.
double a = j + i;
for (int k = 0; k < 10; ++k)
a = std::sin(i * (std::cos(a) * j + std::log(std::abs(a + count) + 1)));
volatile double out = a;
}
}
int main()
{
omp_set_dynamic(SET_DYNAMIC_TO);
#if UNDER_LOAD
for (int i = 0; i < 10; ++i)
{
std::thread([]()
{
unsigned x = 0;
float y = static_cast<float>(std::sqrt(2));
while (true)
{
//#pragma omp parallel for
for (int i = 0; i < 100000; ++i)
{
x = x * 7 + 13;
y = 4 * y * (1 - y);
}
volatile unsigned xx = x;
volatile float yy = y;
}
}).detach();
}
#endif
std::chrono::high_resolution_clock clk;
auto start = clk.now();
for (int i = 0; i < REPEATS; ++i)
oneRegion(i);
std::cout << (clk.now() - start).count() / 1000ull / 1000ull << " ms for " << REPEATS << " iterations" << std::endl;
double averageThreadNum = double(threadNumSum) / threadNumCount;
std::cout << "Entered " << threadNumCount << " parallel regions with " << averageThreadNum << " threads each on average." << std::endl;
std::getchar();
return 0;
}
Compiler version: Microsoft (R) C/C++ Optimizing Compiler Version 19.16.27024.1 for x64
On e.g. gcc, this program will print a significantly lower averageThreadNum for omp_set_dynamic(1) than for omp_set_dynamic(0). But on MSVC, the same value is shown in both cases, despite a 30% performance difference (170s vs 230s).
How can this be explained?
In Visual C++, the number of threads executing the loop does get reduced with omp_set_dynamic(1) in this example, which explains the performance difference.
However, contrary to any good-faith interpretation of the standard (and Visual C++ docs), omp_get_num_threads does not report this reduction.
The only way to figure out how many threads MSVC actually uses for each parallel region is to inspect omp_get_thread_num on every loop iteration (or parallel task). The following would be one way to do it with little in-loop performance overhead:
// std::hardware_destructive_interference_size is not available in gcc or clang, also see comments by Peter Cordes:
// https://stackoverflow.com/questions/39680206/understanding-stdhardware-destructive-interference-size-and-stdhardware-cons
struct alignas(2 * std::hardware_destructive_interference_size) NoFalseSharing
{
int flagValue = 0;
};
void foo()
{
std::vector<NoFalseSharing> flags(omp_get_max_threads());
#pragma omp parallel for
for (int j = 0; j < count; ++j)
{
flags[omp_get_thread_num()].flagValue = 1;
// Your real loop body
}
int realOmpNumThreads = 0;
for (auto flag : flags)
realOmpNumThreads += flag.flagValue;
}
Indeed, you will find realOmpNumThreads to yield significantly different values from the omp_get_num_threads() inside the parallel region with omp_set_dynamic(1) on Visual C++.
One could argue that technically
"the number of threads in the team executing a parallel region" and
"the number of threads that are used for executing upcoming parallel regions"
are not literally the same.
This is a nonsensical interpretation of the standard in my view, because the intent is very clear and there is no reason for the standard to say "The number of threads in the team executing a parallel region stays fixed for the duration of that parallel region and is reported by the omp_get_num_threads function" in this section if this number is unrelated to the functionality of omp_set_dynamic.
However, it could be that MSVC decided to keep the number of threads in a team unaffected and just assign no loop iterations for execution to a subset of them under omp_set_dynamic(1) for ease of implementation.
Whatever the case may be: Do not trust omp_get_num_threads in Visual C++.

C++ 11 std thread sumation with atomic very slow

I wanted to learn to use C++ 11 std::threads with VS2012 and I wrote a very simple C++ console program with two threads which just increment a counter. I also want to test the performance difference when two threads are used. Test program is given below:
#include <iostream>
#include <thread>
#include <conio.h>
#include <atomic>
std::atomic<long long> sum(0);
//long long sum;
using namespace std;
const int RANGE = 100000000;
void test_without_threds()
{
sum = 0;
for(unsigned int j = 0; j < 2; j++)
for(unsigned int k = 0; k < RANGE; k++)
sum ++ ;
}
void call_from_thread(int tid)
{
for(unsigned int k = 0; k < RANGE; k++)
sum ++ ;
}
void test_with_2_threds()
{
std::thread t[2];
sum = 0;
//Launch a group of threads
for (int i = 0; i < 2; ++i) {
t[i] = std::thread(call_from_thread, i);
}
//Join the threads with the main thread
for (int i = 0; i < 2; ++i) {
t[i].join();
}
}
int _tmain(int argc, _TCHAR* argv[])
{
chrono::time_point<chrono::system_clock> start, end;
cout << "-----------------------------------------\n";
cout << "test without threds()\n";
start = chrono::system_clock::now();
test_without_threds();
end = chrono::system_clock::now();
chrono::duration<double> elapsed_seconds = end-start;
cout << "finished calculation for "
<< chrono::duration_cast<std::chrono::milliseconds>(end - start).count()
<< "ms.\n";
cout << "sum:\t" << sum << "\n";\
cout << "-----------------------------------------\n";
cout << "test with 2_threds\n";
start = chrono::system_clock::now();
test_with_2_threds();
end = chrono::system_clock::now();
cout << "finished calculation for "
<< chrono::duration_cast<std::chrono::milliseconds>(end - start).count()
<< "ms.\n";
cout << "sum:\t" << sum << "\n";\
_getch();
return 0;
}
Now, when I use for the counter just the long long variable (which is commented) I get value which is different from the correct - 100000000 instead of 200000000. I am not sure why is that and I suppose that the two threads are changing the counter at the same time, but I am not sure how it happens really because ++ is just a very simple instruction. It seems that the threads are caching the sum variable at beginning. Performance is 110 ms with two threads vs 200 ms for one thread.
So the correct way according to documentation is to use std::atomic. However now the performance is much worse for both cases as about 3300 ms without threads and 15820 ms with threads. What is the correct way to use std::atomic in this case?
I am not sure why is that and I suppose that the two threads are changing the counter at the same time, but I am not sure how it happens really because ++ is just a very simple instruction.
Each thread is pulling the value of sum into a register, incrementing the register, and finally writing it back to memory at the end of the loop.
So the correct way according to documentation is to use std::atomic. However now the performance is much worse for both cases as about 3300 ms without threads and 15820 ms with threads. What is the correct way to use std::atomic in this case?
You're paying for the synchronization std::atomic provides. It won't be nearly as fast as using an un-synchronized integer, though you can get a small improvement to performance by refining the memory order of the add:
sum.fetch_add(1, std::memory_order_relaxed);
In this particular case, you're compiling for x86 and operating on a 64-bit integer. This means that the compiler has to generate code to update the value in two 32-bit operations; if you change the target platform to x64, the compiler will generate code to do the increment in a single 64-bit operation.
As a general rule, the solution to problems like this is to reduce the number of writes to shared data.
Your code has a couple of problems. First of all, all the "inputs" involved are compile-time constants, so a good compiler can pre-compute the value for the single-threaded code, so (regardless of the value you give for range) it shows as running in 0 ms.
Second, you're sharing a single variable (sum) between all the threads, forcing all of their accesses to be synchronized at that point. Without synchronization, that gives undefined behavior. As you've already found, synchronizing the access to that variable is quite expensive, so you usually want to avoid it if at all reasonable.
One way to do that is to use a separate subtotal for each thread, so they can all do their additions in parallel, without synchronizing, the adding together the individual results at the end.
Another point is to ensure against false sharing. False sharing arises when two (or more) threads are writing to data that really is separate, but has been allocated in the same cache line. In this case, access to the memory can be serialized even though (as already noted) you don't have any data actually shared between the threads.
Based on those factors, I've rewritten your code slightly to create a separate sum variable for each thread. Those variables are of a class type that gives (fairly) direct access to the data, but does stop the optimizer from seeing that it can do the whole computation at compile-time, so we end up comparing one thread to 4 (which reminds me: I did increase the number of threads from 2 to 4, since I'm using a quad-core machine). I moved that number into a const variable though, so it should be easy to test with different numbers of threads.
#include <iostream>
#include <thread>
#include <conio.h>
#include <atomic>
#include <numeric>
const int num_threads = 4;
struct val {
long long sum;
int pad[2];
val &operator=(long long i) { sum = i; return *this; }
operator long long &() { return sum; }
operator long long() const { return sum; }
};
val sum[num_threads];
using namespace std;
const int RANGE = 100000000;
void test_without_threds()
{
sum[0] = 0LL;
for(unsigned int j = 0; j < num_threads; j++)
for(unsigned int k = 0; k < RANGE; k++)
sum[0] ++ ;
}
void call_from_thread(int tid)
{
for(unsigned int k = 0; k < RANGE; k++)
sum[tid] ++ ;
}
void test_with_threads()
{
std::thread t[num_threads];
std::fill_n(sum, num_threads, 0);
//Launch a group of threads
for (int i = 0; i < num_threads; ++i) {
t[i] = std::thread(call_from_thread, i);
}
//Join the threads with the main thread
for (int i = 0; i < num_threads; ++i) {
t[i].join();
}
long long total = std::accumulate(std::begin(sum), std::end(sum), 0LL);
}
int main()
{
chrono::time_point<chrono::system_clock> start, end;
cout << "-----------------------------------------\n";
cout << "test without threds()\n";
start = chrono::system_clock::now();
test_without_threds();
end = chrono::system_clock::now();
chrono::duration<double> elapsed_seconds = end-start;
cout << "finished calculation for "
<< chrono::duration_cast<std::chrono::milliseconds>(end - start).count()
<< "ms.\n";
cout << "sum:\t" << sum << "\n";\
cout << "-----------------------------------------\n";
cout << "test with threads\n";
start = chrono::system_clock::now();
test_with_threads();
end = chrono::system_clock::now();
cout << "finished calculation for "
<< chrono::duration_cast<std::chrono::milliseconds>(end - start).count()
<< "ms.\n";
cout << "sum:\t" << sum << "\n";\
_getch();
return 0;
}
When I run this, my results are closer to what I'd guess you hoped for:
-----------------------------------------
test without threds()
finished calculation for 78ms.
sum: 000000013FCBC370
-----------------------------------------
test with threads
finished calculation for 15ms.
sum: 000000013FCBC370
... the sums are identical, but N threads increases speed by a factor of approximately N (up to the number of cores available).
Try to use prefix increment, which will give performance improvement.
Test on my machine, std::memory_order_relaxed does not give any advantage.

OpenMP overhead calculation

Given n threads, is there a way that I can calculate the amount of overhead (e.g. # of cycles) that is required to implement a specific directive in OpenMP.
For example, given the code below
#pragma omp parallel
{
#pragma omp for
for( int i=0 ; i < m ; i++ )
a[i] = b[i] + c[i];
}
Can I calculate somehow how much overhead is required to create these threads?
I think the way to measure the overhead is to time both the serial and parallel versions, and then see how far off the parallel version is from its 'ideal' running time for your number of threads.
So for example, if your serial version takes 10 seconds and you have 4 threads on 4 cores, then your ideal running time is 2.5 seconds. If your OpenMP version takes 4 seconds, then your 'overhead' is 1.5 seconds. I put overhead in quotes because some of that will be thread creation and memory sharing (actual threading overhead), and some of that will just be unparallelized sections of code. I'm trying to think here in terms of Amdahl's Law.
For demonstration, here are two examples. They don't measure thread creation overhead, but they might show the difference between expected and achieved improvement. And while Mystical was right that the only real way to measure is to time it, even trivial examples like your for loop aren't necessarily memory bound. OpenMP does a lot of work that we don't see.
Serial (speedtest.cpp)
#include <iostream>
int main(int argc, char** argv) {
const int SIZE = 100000000;
int* a = new int[SIZE];
int* b = new int[SIZE];
int* c = new int[SIZE];
for(int i = 0; i < SIZE; i++) {
a[i] = b[i] * c[i] * 2;
}
std::cout << "a[" << (SIZE-1) << "]=" << a[SIZE-1] << std::endl;
for(int i = 0; i < SIZE; i++) {
a[i] = b[i] + c[i] + 1;
}
std::cout << "a[" << (SIZE-1) << "]=" << a[SIZE-1] << std::endl;
delete[] a;
delete[] b;
delete[] c;
return 0;
}
Parallel (omp_speedtest.cpp)
#include <omp.h>
#include <iostream>
int main(int argc, char** argv) {
const int SIZE = 100000000;
int* a = new int[SIZE];
int* b = new int[SIZE];
int* c = new int[SIZE];
std::cout << "There are " << omp_get_num_procs() << " procs." << std::endl;
#pragma omp parallel
{
#pragma omp for
for(int i = 0; i < SIZE; i++) {
a[i] = b[i] * c[i];
}
}
std::cout << "a[" << (SIZE-1) << "]=" << a[SIZE-1] << std::endl;
#pragma omp parallel
{
#pragma omp for
for(int i = 0; i < SIZE; i++) {
a[i] = b[i] + c[i] + 1;
}
}
std::cout << "a[" << (SIZE-1) << "]=" << a[SIZE-1] << std::endl;
delete[] a;
delete[] b;
delete[] c;
return 0;
}
So I compiled these these with
g++ -O3 -o speedtest.exe speedtest.cpp
g++ -fopenmp -O3 -o omp_speedtest.exe omp_speedtest.cpp
And when I ran them
$ time ./speedtest.exe
a[99999999]=0
a[99999999]=1
real 0m1.379s
user 0m0.015s
sys 0m0.000s
$ time ./omp_speedtest.exe
There are 4 procs.
a[99999999]=0
a[99999999]=1
real 0m0.854s
user 0m0.015s
sys 0m0.015s
Yes, you can. Please take a look at EPCC benchmark. Although this code is a bit older, it measures the various overhead of OpenMP's constructs, including omp parallel for and omp critical.
Basic approach is somewhat very simple and straightforward. You measure a baseline serial time without any OpenMP, and just include a OpenMP pragma that you want to measure. Then, subtract the elapsed times. This is exactly how EPCC benchmark measures the overhead. See the source like 'syncbench.c'.
Please note that the overhead is expressed as time, rather than the # of cycles. I also tried to measure # of cycles, but OpenMP parallel constructs' overhead may include blocked time due to synchronizations. Hence, # of cycles may not reflect the real overhead of OpenMP.