Given n threads, is there a way that I can calculate the amount of overhead (e.g. # of cycles) that is required to implement a specific directive in OpenMP.
For example, given the code below
#pragma omp parallel
{
#pragma omp for
for( int i=0 ; i < m ; i++ )
a[i] = b[i] + c[i];
}
Can I calculate somehow how much overhead is required to create these threads?
I think the way to measure the overhead is to time both the serial and parallel versions, and then see how far off the parallel version is from its 'ideal' running time for your number of threads.
So for example, if your serial version takes 10 seconds and you have 4 threads on 4 cores, then your ideal running time is 2.5 seconds. If your OpenMP version takes 4 seconds, then your 'overhead' is 1.5 seconds. I put overhead in quotes because some of that will be thread creation and memory sharing (actual threading overhead), and some of that will just be unparallelized sections of code. I'm trying to think here in terms of Amdahl's Law.
For demonstration, here are two examples. They don't measure thread creation overhead, but they might show the difference between expected and achieved improvement. And while Mystical was right that the only real way to measure is to time it, even trivial examples like your for loop aren't necessarily memory bound. OpenMP does a lot of work that we don't see.
Serial (speedtest.cpp)
#include <iostream>
int main(int argc, char** argv) {
const int SIZE = 100000000;
int* a = new int[SIZE];
int* b = new int[SIZE];
int* c = new int[SIZE];
for(int i = 0; i < SIZE; i++) {
a[i] = b[i] * c[i] * 2;
}
std::cout << "a[" << (SIZE-1) << "]=" << a[SIZE-1] << std::endl;
for(int i = 0; i < SIZE; i++) {
a[i] = b[i] + c[i] + 1;
}
std::cout << "a[" << (SIZE-1) << "]=" << a[SIZE-1] << std::endl;
delete[] a;
delete[] b;
delete[] c;
return 0;
}
Parallel (omp_speedtest.cpp)
#include <omp.h>
#include <iostream>
int main(int argc, char** argv) {
const int SIZE = 100000000;
int* a = new int[SIZE];
int* b = new int[SIZE];
int* c = new int[SIZE];
std::cout << "There are " << omp_get_num_procs() << " procs." << std::endl;
#pragma omp parallel
{
#pragma omp for
for(int i = 0; i < SIZE; i++) {
a[i] = b[i] * c[i];
}
}
std::cout << "a[" << (SIZE-1) << "]=" << a[SIZE-1] << std::endl;
#pragma omp parallel
{
#pragma omp for
for(int i = 0; i < SIZE; i++) {
a[i] = b[i] + c[i] + 1;
}
}
std::cout << "a[" << (SIZE-1) << "]=" << a[SIZE-1] << std::endl;
delete[] a;
delete[] b;
delete[] c;
return 0;
}
So I compiled these these with
g++ -O3 -o speedtest.exe speedtest.cpp
g++ -fopenmp -O3 -o omp_speedtest.exe omp_speedtest.cpp
And when I ran them
$ time ./speedtest.exe
a[99999999]=0
a[99999999]=1
real 0m1.379s
user 0m0.015s
sys 0m0.000s
$ time ./omp_speedtest.exe
There are 4 procs.
a[99999999]=0
a[99999999]=1
real 0m0.854s
user 0m0.015s
sys 0m0.015s
Yes, you can. Please take a look at EPCC benchmark. Although this code is a bit older, it measures the various overhead of OpenMP's constructs, including omp parallel for and omp critical.
Basic approach is somewhat very simple and straightforward. You measure a baseline serial time without any OpenMP, and just include a OpenMP pragma that you want to measure. Then, subtract the elapsed times. This is exactly how EPCC benchmark measures the overhead. See the source like 'syncbench.c'.
Please note that the overhead is expressed as time, rather than the # of cycles. I also tried to measure # of cycles, but OpenMP parallel constructs' overhead may include blocked time due to synchronizations. Hence, # of cycles may not reflect the real overhead of OpenMP.
Related
I have this self-contained example of a TBB application that I run on a 2-NUMA-node CPU that performs a simple vector addition repeatedly on dynamic arrays. It recreates an issue that I am having with a bit more complicated example. I am trying to divide the computations cleanly between the available NUMA nodes by initializing the data in parallel with 2 task_arenas that are linked to separate NUMA nodes through TBB's NUMA API. The subsequent parallel execution should then be conducted so that that memory accesses are performed on data that is local to the cpu that computes its task. A control example uses a simple parallel_for with a static_partitioner to perform the computation while my intended example invokes per task_arena a task which invokes a parallel_for to compute the vector addition of the designated region, i.e. the half of the dynamic arena that was initialized before in the corresponding NUMA node. This example always takes twice as much time to perform the vector addition compared to the control example. It cannot be the overhead of creating the tasks for the task_arenas that will invoke the parallel_for algorithms, because the performance degradation only occurs when the tbb::task_arena::constraints are applied. Could anyone explain to me what happens and why this performance penalty is so harsh. A direction to resources would also be helpful as I am doing this for a university project.
#include <iostream>
#include <iomanip>
#include <tbb/tbb.h>
#include <vector>
int main(){
std::vector<int> numa_indexes = tbb::info::numa_nodes();
std::vector<tbb::task_arena> arenas(numa_indexes.size());
std::size_t numa_nodes = numa_indexes.size();
for(unsigned j = 0; j < numa_indexes.size(); j++){
arenas[j].initialize( tbb::task_arena::constraints(numa_indexes[j]));
}
std::size_t size = 10000000;
std::size_t part_size = std::ceil((float)size/numa_nodes);
double * A = (double *) malloc(sizeof(double)*size);
double * B = (double *) malloc(sizeof(double)*size);
double * C = (double *) malloc(sizeof(double)*size);
double * D = (double *) malloc(sizeof(double)*size);
//DATA INITIALIZATION
for(unsigned k = 0; k < numa_indexes.size(); k++)
arenas[k].execute(
[&](){
std::size_t local_start = k*part_size;
std::size_t local_end = std::min(local_start + part_size, size);
tbb::parallel_for(static_cast<std::size_t>(local_start), local_end,
[&](std::size_t i)
{
C[i] = D[i] = 0;
A[i] = B[i] = 1;
}, tbb::static_partitioner());
});
//PARALLEL ALGORITHM
tbb::tick_count t0 = tbb::tick_count::now();
for(int i = 0; i<100; i++)
tbb::parallel_for(static_cast<std::size_t>(0), size,
[&](std::size_t i)
{
C[i] += A[i] + B[i];
}, tbb::static_partitioner());
tbb::tick_count t1 = tbb::tick_count::now();
std::cout << "Time 1: " << (t1-t0).seconds() << std::endl;
//TASK ARENA & PARALLEL ALGORITHM
t0 = tbb::tick_count::now();
for(int i = 0; i<100; i++){
for(unsigned k = 0; k < numa_indexes.size(); k++){
arenas[k].execute(
[&](){
for(unsigned i=0; i<numa_indexes.size(); i++)
task_groups[i].wait();
task_groups[k].run([&](){
std::size_t local_start = k*part_size;
std::size_t local_end = std::min(local_start + part_size, size);
tbb::parallel_for(static_cast<std::size_t>(local_start), local_end,
[&](std::size_t i)
{
D[i] += A[i] + B[i];
});
});
});
}
t1 = tbb::tick_count::now();
std::cout << "Time 2: " << (t1-t0).seconds() << std::endl;
double sum1 = 0;
double sum2 = 0;
for(int i = 0; i<size; i++){
sum1 += C[i];
sum2 += D[i];
}
std::cout << sum1 << std::endl;
std::cout << sum2 << std::endl;
return 0;
}
Performance with:
for(unsigned j = 0; j < numa_indexes.size(); j++){
arenas[j].initialize( tbb::task_arena::constraints(numa_indexes[j]));
}
$ taskset -c 0,1,8,9 ./RUNME
Time 1: 0.896496
Time 2: 1.60392
2e+07
2e+07
Performance without constraints:
$ taskset -c 0,1,8,9 ./RUNME
Time 1: 0.652501
Time 2: 0.638362
2e+07
2e+07
EDIT: I implemented the use of task_group as found in #AlekseiFedotov's suggested resources, but the issue still remains.
Part of the provided example where the work with arenas happens is not one-to-one match to the example from the docs, "Setting the preferred NUMA node" section.
Looking further into the specification of the task_arena::execute() method, we can find out that the task_arena::execute() is a blocking API, i.e. it does not return until the passed lambda completes.
On the other hand, the specification of the task_group::run() method reveals that its method is asynchronous, i.e. returns immediately, not waiting for the passed functor to complete.
That is where the problem lies, I guess. The code executes two parallel loops within arenas one by one, in a serial manner so to say. Consider following the example from the docs carefully.
BTW, the oneTBB project, which is the revamped version of the TBB, can be found here.
EDIT answer for the EDITED question:
See the comment to the question.
The waiting should happen after work is submitted, not before it. Also, no need to go to another arena's task group to do the wait within the loop, just submit the work in the NUMA loop via arena[i].execute( [i, &] { task_group[i].run( [i, &] { /*...*/ } ); } ), then, in another loop, wait for each task_group within corresponding task_arena.
Please note how I capture the NUMA loop iteration by copy. Otherwise, the code might be referring the wrong data inside the lambda body.
This loop:
long n = 0;
unsigned int i, j, innerLoopLength = 4;
for (i = 0; i < 10000000; i++) {
for (j = 0; j < innerLoopLength; j++) {
n += v[j];
}
}
finishes in 0 ms, while this one:
long n = 0;
unsigned int i, j, innerLoopLength = argc;
for (i = 0; i < 10000000; i++) {
for (j = 0; j < innerLoopLength; j++) {
n += v[j];
}
}
takes 35 ms.
No matter what the innerLoopLength is, the first method is always pretty fast while the second getting slower and slower.
Does anybody know why and is there a way to speed up the seconds version? I'm grateful for every ms.
Full code:
#include <iostream>
#include <chrono>
#include <vector>
using namespace std;
int main(int argc, char *argv[]) {
vector<long> v;
cout << "argc: " << argc << endl;
for (long l = 1; l <= argc; l++) {
v.push_back(l);
}
auto start = chrono::steady_clock::now();
long n = 0;
unsigned int i, j, innerLoopLength = 4;
for (i = 0; i < 10000000; i++) {
for (j = 0; j < innerLoopLength; j++) {
n += v[j];
}
}
auto end = chrono::steady_clock::now();
cout << "duration: " << chrono::duration_cast<chrono::microseconds>(end - start).count() / 1000.0 << " ms" << endl;
cout << "n: " << n << endl;
return 0;
}
Compiled with -std=c++1z and -O3.
The fixed-length loop was far quicker due to loop unrolling:
Loop unrolling, also known as loop unwinding, is a loop transformation
technique that attempts to optimize a program's execution speed at the
expense of its binary size, which is an approach known as space–time
tradeoff. The transformation can be undertaken manually by the
programmer or by an optimizing compiler.
The goal of loop unwinding is to increase a program's speed by
reducing or eliminating instructions that control the loop, such as
pointer arithmetic and "end of loop" tests on each iteration; reducing
branch penalties; as well as hiding latencies, including the delay in
reading data from memory. To eliminate this computational overhead,
loops can be re-written as a repeated sequence of similar independent
statements.
Essentially, the inner loop of your C(++) code is transformed to the following before compilation:
for (i = 0; i < 10000000; i++) {
n += v[0];
n += v[1];
n += v[2];
n += v[3];
}
As you can see, it is a little bit faster.
In your specific case, there is yet another source of the optimization: you sum 1000000 times the same values to n. gcc can detect it since around 3.*, and converts it to a multiplication. You can check that, doing the same loop 100000000000 times will be similarly ready in 0 ms. You can check on the ASM level (g++ -S -o bench.s bench.c -O3), you will see only a multiplication and not an addition in a loop. To avoid this, you should add something what can't be converted to a multiplication so easily.
None of them can be done in the second case. Thus, on the ASM level, you will have to deal with a lot of conditional expressions (conditional jumps). These are costly in a modern CPU, because their unexpected result causes the CPU pipeline to reset.
What can you help:
If you know something from innerLoopLength, for example if it is always divisable by 4, you can unroll the loop for yourself
Some gcc(g++) optimization flag, to help him to understand, here you need fast code. Compile with at least -O3 -funroll-loops.
I wanted to learn to use C++ 11 std::threads with VS2012 and I wrote a very simple C++ console program with two threads which just increment a counter. I also want to test the performance difference when two threads are used. Test program is given below:
#include <iostream>
#include <thread>
#include <conio.h>
#include <atomic>
std::atomic<long long> sum(0);
//long long sum;
using namespace std;
const int RANGE = 100000000;
void test_without_threds()
{
sum = 0;
for(unsigned int j = 0; j < 2; j++)
for(unsigned int k = 0; k < RANGE; k++)
sum ++ ;
}
void call_from_thread(int tid)
{
for(unsigned int k = 0; k < RANGE; k++)
sum ++ ;
}
void test_with_2_threds()
{
std::thread t[2];
sum = 0;
//Launch a group of threads
for (int i = 0; i < 2; ++i) {
t[i] = std::thread(call_from_thread, i);
}
//Join the threads with the main thread
for (int i = 0; i < 2; ++i) {
t[i].join();
}
}
int _tmain(int argc, _TCHAR* argv[])
{
chrono::time_point<chrono::system_clock> start, end;
cout << "-----------------------------------------\n";
cout << "test without threds()\n";
start = chrono::system_clock::now();
test_without_threds();
end = chrono::system_clock::now();
chrono::duration<double> elapsed_seconds = end-start;
cout << "finished calculation for "
<< chrono::duration_cast<std::chrono::milliseconds>(end - start).count()
<< "ms.\n";
cout << "sum:\t" << sum << "\n";\
cout << "-----------------------------------------\n";
cout << "test with 2_threds\n";
start = chrono::system_clock::now();
test_with_2_threds();
end = chrono::system_clock::now();
cout << "finished calculation for "
<< chrono::duration_cast<std::chrono::milliseconds>(end - start).count()
<< "ms.\n";
cout << "sum:\t" << sum << "\n";\
_getch();
return 0;
}
Now, when I use for the counter just the long long variable (which is commented) I get value which is different from the correct - 100000000 instead of 200000000. I am not sure why is that and I suppose that the two threads are changing the counter at the same time, but I am not sure how it happens really because ++ is just a very simple instruction. It seems that the threads are caching the sum variable at beginning. Performance is 110 ms with two threads vs 200 ms for one thread.
So the correct way according to documentation is to use std::atomic. However now the performance is much worse for both cases as about 3300 ms without threads and 15820 ms with threads. What is the correct way to use std::atomic in this case?
I am not sure why is that and I suppose that the two threads are changing the counter at the same time, but I am not sure how it happens really because ++ is just a very simple instruction.
Each thread is pulling the value of sum into a register, incrementing the register, and finally writing it back to memory at the end of the loop.
So the correct way according to documentation is to use std::atomic. However now the performance is much worse for both cases as about 3300 ms without threads and 15820 ms with threads. What is the correct way to use std::atomic in this case?
You're paying for the synchronization std::atomic provides. It won't be nearly as fast as using an un-synchronized integer, though you can get a small improvement to performance by refining the memory order of the add:
sum.fetch_add(1, std::memory_order_relaxed);
In this particular case, you're compiling for x86 and operating on a 64-bit integer. This means that the compiler has to generate code to update the value in two 32-bit operations; if you change the target platform to x64, the compiler will generate code to do the increment in a single 64-bit operation.
As a general rule, the solution to problems like this is to reduce the number of writes to shared data.
Your code has a couple of problems. First of all, all the "inputs" involved are compile-time constants, so a good compiler can pre-compute the value for the single-threaded code, so (regardless of the value you give for range) it shows as running in 0 ms.
Second, you're sharing a single variable (sum) between all the threads, forcing all of their accesses to be synchronized at that point. Without synchronization, that gives undefined behavior. As you've already found, synchronizing the access to that variable is quite expensive, so you usually want to avoid it if at all reasonable.
One way to do that is to use a separate subtotal for each thread, so they can all do their additions in parallel, without synchronizing, the adding together the individual results at the end.
Another point is to ensure against false sharing. False sharing arises when two (or more) threads are writing to data that really is separate, but has been allocated in the same cache line. In this case, access to the memory can be serialized even though (as already noted) you don't have any data actually shared between the threads.
Based on those factors, I've rewritten your code slightly to create a separate sum variable for each thread. Those variables are of a class type that gives (fairly) direct access to the data, but does stop the optimizer from seeing that it can do the whole computation at compile-time, so we end up comparing one thread to 4 (which reminds me: I did increase the number of threads from 2 to 4, since I'm using a quad-core machine). I moved that number into a const variable though, so it should be easy to test with different numbers of threads.
#include <iostream>
#include <thread>
#include <conio.h>
#include <atomic>
#include <numeric>
const int num_threads = 4;
struct val {
long long sum;
int pad[2];
val &operator=(long long i) { sum = i; return *this; }
operator long long &() { return sum; }
operator long long() const { return sum; }
};
val sum[num_threads];
using namespace std;
const int RANGE = 100000000;
void test_without_threds()
{
sum[0] = 0LL;
for(unsigned int j = 0; j < num_threads; j++)
for(unsigned int k = 0; k < RANGE; k++)
sum[0] ++ ;
}
void call_from_thread(int tid)
{
for(unsigned int k = 0; k < RANGE; k++)
sum[tid] ++ ;
}
void test_with_threads()
{
std::thread t[num_threads];
std::fill_n(sum, num_threads, 0);
//Launch a group of threads
for (int i = 0; i < num_threads; ++i) {
t[i] = std::thread(call_from_thread, i);
}
//Join the threads with the main thread
for (int i = 0; i < num_threads; ++i) {
t[i].join();
}
long long total = std::accumulate(std::begin(sum), std::end(sum), 0LL);
}
int main()
{
chrono::time_point<chrono::system_clock> start, end;
cout << "-----------------------------------------\n";
cout << "test without threds()\n";
start = chrono::system_clock::now();
test_without_threds();
end = chrono::system_clock::now();
chrono::duration<double> elapsed_seconds = end-start;
cout << "finished calculation for "
<< chrono::duration_cast<std::chrono::milliseconds>(end - start).count()
<< "ms.\n";
cout << "sum:\t" << sum << "\n";\
cout << "-----------------------------------------\n";
cout << "test with threads\n";
start = chrono::system_clock::now();
test_with_threads();
end = chrono::system_clock::now();
cout << "finished calculation for "
<< chrono::duration_cast<std::chrono::milliseconds>(end - start).count()
<< "ms.\n";
cout << "sum:\t" << sum << "\n";\
_getch();
return 0;
}
When I run this, my results are closer to what I'd guess you hoped for:
-----------------------------------------
test without threds()
finished calculation for 78ms.
sum: 000000013FCBC370
-----------------------------------------
test with threads
finished calculation for 15ms.
sum: 000000013FCBC370
... the sums are identical, but N threads increases speed by a factor of approximately N (up to the number of cores available).
Try to use prefix increment, which will give performance improvement.
Test on my machine, std::memory_order_relaxed does not give any advantage.
Im trying get the elapsed time of my program. Actually i thought I should use yclock() from time.h. But it stays zero in all phases of the program although I'm adding 10^5 numbers(there must be some CPU time consumed). I already searched this problem and it seems like, people running Linux are having this issue only. I'm running Ubuntu 12.04LTS.
I'm going to compare AVX and SSE instructions, so using time_t is not really an option. Any hints?
Here is the code:
//Dimension of Arrays
unsigned int N = 100000;
//Fill two arrays with random numbers
unsigned int a[N];
clock_t start_of_programm = clock();
for(int i=0;i<N;i++){
a[i] = i;
}
clock_t after_init_of_a = clock();
unsigned int b[N];
for(int i=0;i<N;i++){
b[i] = i;
}
clock_t after_init_of_b = clock();
//Add the two arrays with Standard
unsigned int out[N];
for(int i = 0; i < N; ++i)
out[i] = a[i] + b[i];
clock_t after_add = clock();
cout << "start_of_programm " << start_of_programm << endl; // prints
cout << "after_init_of_a " << after_init_of_a << endl; // prints
cout << "after_init_of_b " << after_init_of_b << endl; // prints
cout << "after_add " << after_add << endl; // prints
cout << endl << "CLOCKS_PER_SEC " << CLOCKS_PER_SEC << endl;
And the output of the console. I also used printf() with %d, with no difference.
start_of_programm 0
after_init_of_a 0
after_init_of_b 0
after_add 0
CLOCKS_PER_SEC 1000000
clock does indeed return the CPU time used, but the granularity is in the order of 10Hz. So if your code doesn't take more than 100ms, you will get zero. And unless it's significantly longer than 100ms, you won't get a very accurate value, because it your error margin will be around 100ms.
So, increasing N or using a different method to measure time would be your choices. std::chrono will most likely produce a more accurate timing (but it will measure "wall-time", not CPU-time).
timespec t1, t2;
clock_gettime(CLOCK_REALTIME, &t1);
... do stuff ...
clock_gettime(CLOCK_REALTIME, &t2);
double t = timespec_diff(t2, t1);
double timespec_diff(timespec t2, timespec t1)
{
double d1 = t1.tv_sec + t1.tv_nsec / 1000000000.0;
double d2 = t2.tv_sec + t2.tv_nsec / 1000000000.0;
return d2 - d1;
}
The simplest way to get the time is to just use a stub function from OpenMP. This will work on MSVC, GCC, and ICC. With MSVC you don't even need to enable OpenMP. With ICC you can link just the stubs if you like -openmp-stubs. With GCC you have to use -fopenmp.
#include <omp.h>
double dtime;
dtime = omp_get_wtime();
foo();
dtime = omp_get_wtime() - dtime;
printf("time %f\n", dtime);
First, compiler is very likely to optimize your code. Check your compiler's optimization option.
Since array including out[], a[], b[] are not used by the successive code, and no value from out[], a[], b[] would be output, the compiler is to optimize code block as follows like never execute at all:
for(int i=0;i<=N;i++){
a[i] = i;
}
for(int i=0;i<=N;i++){
b[i] = i;
}
for(int i = 0; i < N; ++i)
out[i] = a[i] + b[i];
Since clock() function returns CPU time, the above code consume almost no time after optimization.
And one more thing, set N a bigger value. 100000 is too small for a performance test, nowadays computer runs very fast with o(n) code at 100000 scale.
unsigned int N = 10000000;
Add this to the end of the code
int sum = 0;
for(int i = 0; i<N; i++)
sum += out[i];
cout << sum;
Then you will see the times.
Since you dont use a[], b[], out[] it ignores corresponding for loops. This is because of optimization of the compiler.
Also, to see the exact time it takes use debug mode instead of release, then you will be able to see the time it takes.
I have the following code:
#include <boost/date_time/posix_time/posix_time.hpp>
#include <boost/cstdint.hpp>
#include <iostream>
int main()
{
boost::posix_time::ptime start = boost::posix_time::microsec_clock::local_time();
boost::uint64_t sum = 0;
for (int i = 0; i < 1000000000; ++i)
sum += i;
boost::posix_time::ptime end = boost::posix_time::microsec_clock::local_time();
std::cout << end - start << std::endl;
std::cout << sum << std::endl;
}
The task is: refactor the following program to calculate the total using two threads. Since many processors nowadays have two cores, the execution time should decrease by utilizing threads.
Here is my solution:
#include <boost/date_time/posix_time/posix_time.hpp>
#include <boost/thread.hpp>
#include <boost/cstdint.hpp>
#include <iostream>
boost::uint64_t s1 = 0;
boost::uint64_t s2 = 0;
void sum1()
{
for (int i = 0; i < 500000000; ++i)
s1 += i;
}
void sum2()
{
for (int i = 500000000; i < 1000000000; ++i)
s2 += i;
}
int main()
{
boost::posix_time::ptime start = boost::posix_time::microsec_clock::local_time();
boost::thread t1(sum1);
boost::thread t2(sum2);
t1.join();
t2.join();
boost::posix_time::ptime end = boost::posix_time::microsec_clock::local_time();
std::cout << end - start << std::endl;
std::cout << s1+s2 << std::endl;
}
Please review and also answer the following questions:
1. Why this code does not actually optimize the execution time? :) (I use Intel Core i5 processor, and Win7 64bit system)
2. Why when I use one variable s to store the sum instead of s1 and s2 the sum becomes incorrect?
Thanks in advance.
I'll answer your second question, because the first one is not yet clear to me. When you are using a single global variable to calculate the sum there is a so-called "data race" caused by the fact that the operation
s += i;
is not "atomic", meaning that at the assembler level it is translated into several instructions. If one thread is executing this set of instructions it may be interrupted by another thread doing the same thing and your results will be inconsistent.
This is due to the fact that threads are scheduled on and off the CPU by the OS and it's impossible to predict how the threads will interleave their instruction execution.
The classic pattern in this case is to have two local variables collecting the sums for each thread and then summing them up together into a global variable once the threads have fished their work.
The answer to 1 should be: run in a profiler and see what it tells you.
But there is at least one usual suspect: False sharing. Your s1 and s2 likely end up on the same cacheline, so your 2 cores (if your 2 threads indeed end up on different cores) have to synchronize at the cacheline level. Make sure the 2 uint64_t are on different cachelines (whose size depends on the architecture you're targeting).
As to the answer to 2... Nothing in your program guarantees that the updates from one thread will not get stomped by the second and vice-versa. You need either synchronization primitives to make sure your updates don't happen at the same time, or atomic updates to make sure the updates don't stomp on each other.
I'll answer the first:
It takes a lot more time to create a thread than it takes to do nothing (base).
the compiler will convert this:
for (int i = 0; i < 1000000000; ++i)
sum += i;
into this:
// << optimized away >>
even your worst case using local data, it would be one addition with optimization enabled.
The parallel version reduces the compiler's ability to optimize the program, while adding work.
The simplest way to refactor the program (code-wise) to compute the sum using multiple threads is to use OpenMP:
// $ g++ -fopenmp parallel-sum.cpp && ./a.out
#include <stdint.h>
#include <iostream>
const int32_t N = 1 << 30;
int main() {
int64_t sum = 0;
#pragma omp parallel for reduction(+:sum)
for (int32_t i = 0; i < N; ++i)
sum += i;
std::cout << sum << " " << static_cast<int64_t>(N)*(N-1)/2 << std::endl;
}
Output
576460751766552576 576460751766552576
Here's a parallel reduction implemented using c++11 threads:
// $ g++ -std=c++0x -pthread parallel-sum-c++11.cpp && ./a.out
#include <cstdint>
#include <iostream>
#include <thread>
namespace {
std::mutex mutex;
void sum_interval(int32_t start, int32_t end, int64_t &sum) {
int64_t s = 0;
for ( ; start < end; ++start) s += start;
std::lock_guard<std::mutex> lock(mutex);
sum += s;
}
}
int main() {
int64_t sum = 0;
const int num_threads = 4;
const int32_t N = 1 << 30;
std::thread t[num_threads];
// fork threads; assign intervals to sum
int32_t start = 0, step = N / num_threads;
for (int i = 0; i < num_threads-1; ++i, start += step)
t[i] = std::thread(sum_interval, start, start+step, std::ref(sum));
t[num_threads-1] = std::thread(sum_interval, start, N, std::ref(sum));
// wait for result and print it
for (int i = 0; i < num_threads; ++i) t[i].join();
std::cout << sum << " " << static_cast<int64_t>(N)*(N-1)/2 << std::endl;
}
Note: Access to sum is guarded so only one thread at a time can change it. If sum is std::atomic<int64_t> then the locking can be omitted.