C++ 11 std thread sumation with atomic very slow - c++

I wanted to learn to use C++ 11 std::threads with VS2012 and I wrote a very simple C++ console program with two threads which just increment a counter. I also want to test the performance difference when two threads are used. Test program is given below:
#include <iostream>
#include <thread>
#include <conio.h>
#include <atomic>
std::atomic<long long> sum(0);
//long long sum;
using namespace std;
const int RANGE = 100000000;
void test_without_threds()
{
sum = 0;
for(unsigned int j = 0; j < 2; j++)
for(unsigned int k = 0; k < RANGE; k++)
sum ++ ;
}
void call_from_thread(int tid)
{
for(unsigned int k = 0; k < RANGE; k++)
sum ++ ;
}
void test_with_2_threds()
{
std::thread t[2];
sum = 0;
//Launch a group of threads
for (int i = 0; i < 2; ++i) {
t[i] = std::thread(call_from_thread, i);
}
//Join the threads with the main thread
for (int i = 0; i < 2; ++i) {
t[i].join();
}
}
int _tmain(int argc, _TCHAR* argv[])
{
chrono::time_point<chrono::system_clock> start, end;
cout << "-----------------------------------------\n";
cout << "test without threds()\n";
start = chrono::system_clock::now();
test_without_threds();
end = chrono::system_clock::now();
chrono::duration<double> elapsed_seconds = end-start;
cout << "finished calculation for "
<< chrono::duration_cast<std::chrono::milliseconds>(end - start).count()
<< "ms.\n";
cout << "sum:\t" << sum << "\n";\
cout << "-----------------------------------------\n";
cout << "test with 2_threds\n";
start = chrono::system_clock::now();
test_with_2_threds();
end = chrono::system_clock::now();
cout << "finished calculation for "
<< chrono::duration_cast<std::chrono::milliseconds>(end - start).count()
<< "ms.\n";
cout << "sum:\t" << sum << "\n";\
_getch();
return 0;
}
Now, when I use for the counter just the long long variable (which is commented) I get value which is different from the correct - 100000000 instead of 200000000. I am not sure why is that and I suppose that the two threads are changing the counter at the same time, but I am not sure how it happens really because ++ is just a very simple instruction. It seems that the threads are caching the sum variable at beginning. Performance is 110 ms with two threads vs 200 ms for one thread.
So the correct way according to documentation is to use std::atomic. However now the performance is much worse for both cases as about 3300 ms without threads and 15820 ms with threads. What is the correct way to use std::atomic in this case?

I am not sure why is that and I suppose that the two threads are changing the counter at the same time, but I am not sure how it happens really because ++ is just a very simple instruction.
Each thread is pulling the value of sum into a register, incrementing the register, and finally writing it back to memory at the end of the loop.
So the correct way according to documentation is to use std::atomic. However now the performance is much worse for both cases as about 3300 ms without threads and 15820 ms with threads. What is the correct way to use std::atomic in this case?
You're paying for the synchronization std::atomic provides. It won't be nearly as fast as using an un-synchronized integer, though you can get a small improvement to performance by refining the memory order of the add:
sum.fetch_add(1, std::memory_order_relaxed);
In this particular case, you're compiling for x86 and operating on a 64-bit integer. This means that the compiler has to generate code to update the value in two 32-bit operations; if you change the target platform to x64, the compiler will generate code to do the increment in a single 64-bit operation.
As a general rule, the solution to problems like this is to reduce the number of writes to shared data.

Your code has a couple of problems. First of all, all the "inputs" involved are compile-time constants, so a good compiler can pre-compute the value for the single-threaded code, so (regardless of the value you give for range) it shows as running in 0 ms.
Second, you're sharing a single variable (sum) between all the threads, forcing all of their accesses to be synchronized at that point. Without synchronization, that gives undefined behavior. As you've already found, synchronizing the access to that variable is quite expensive, so you usually want to avoid it if at all reasonable.
One way to do that is to use a separate subtotal for each thread, so they can all do their additions in parallel, without synchronizing, the adding together the individual results at the end.
Another point is to ensure against false sharing. False sharing arises when two (or more) threads are writing to data that really is separate, but has been allocated in the same cache line. In this case, access to the memory can be serialized even though (as already noted) you don't have any data actually shared between the threads.
Based on those factors, I've rewritten your code slightly to create a separate sum variable for each thread. Those variables are of a class type that gives (fairly) direct access to the data, but does stop the optimizer from seeing that it can do the whole computation at compile-time, so we end up comparing one thread to 4 (which reminds me: I did increase the number of threads from 2 to 4, since I'm using a quad-core machine). I moved that number into a const variable though, so it should be easy to test with different numbers of threads.
#include <iostream>
#include <thread>
#include <conio.h>
#include <atomic>
#include <numeric>
const int num_threads = 4;
struct val {
long long sum;
int pad[2];
val &operator=(long long i) { sum = i; return *this; }
operator long long &() { return sum; }
operator long long() const { return sum; }
};
val sum[num_threads];
using namespace std;
const int RANGE = 100000000;
void test_without_threds()
{
sum[0] = 0LL;
for(unsigned int j = 0; j < num_threads; j++)
for(unsigned int k = 0; k < RANGE; k++)
sum[0] ++ ;
}
void call_from_thread(int tid)
{
for(unsigned int k = 0; k < RANGE; k++)
sum[tid] ++ ;
}
void test_with_threads()
{
std::thread t[num_threads];
std::fill_n(sum, num_threads, 0);
//Launch a group of threads
for (int i = 0; i < num_threads; ++i) {
t[i] = std::thread(call_from_thread, i);
}
//Join the threads with the main thread
for (int i = 0; i < num_threads; ++i) {
t[i].join();
}
long long total = std::accumulate(std::begin(sum), std::end(sum), 0LL);
}
int main()
{
chrono::time_point<chrono::system_clock> start, end;
cout << "-----------------------------------------\n";
cout << "test without threds()\n";
start = chrono::system_clock::now();
test_without_threds();
end = chrono::system_clock::now();
chrono::duration<double> elapsed_seconds = end-start;
cout << "finished calculation for "
<< chrono::duration_cast<std::chrono::milliseconds>(end - start).count()
<< "ms.\n";
cout << "sum:\t" << sum << "\n";\
cout << "-----------------------------------------\n";
cout << "test with threads\n";
start = chrono::system_clock::now();
test_with_threads();
end = chrono::system_clock::now();
cout << "finished calculation for "
<< chrono::duration_cast<std::chrono::milliseconds>(end - start).count()
<< "ms.\n";
cout << "sum:\t" << sum << "\n";\
_getch();
return 0;
}
When I run this, my results are closer to what I'd guess you hoped for:
-----------------------------------------
test without threds()
finished calculation for 78ms.
sum: 000000013FCBC370
-----------------------------------------
test with threads
finished calculation for 15ms.
sum: 000000013FCBC370
... the sums are identical, but N threads increases speed by a factor of approximately N (up to the number of cores available).

Try to use prefix increment, which will give performance improvement.
Test on my machine, std::memory_order_relaxed does not give any advantage.

Related

Implementation of a lock free vector

After several searches, I cannot find a lock-free vector implementation.
There is a document that speaks about it but nothing concrete (in any case I have not found it). http://pirkelbauer.com/papers/opodis06.pdf
There are currently 2 threads dealing with arrays, there may be more in a while.
One thread that updates different vectors and another thread that accesses the vector to do calculations, etc. Each thread accesses the different array a large number of times per second.
I implemented a lock with mutex on the different vectors but when the reading or writing thread takes too long to unlock, all further updates are delayed.
I then thought of copying the array all the time to go faster, but copying thousands of times per second an array of thousands of elements doesn't seem great to me.
So I thought to use 1 mutex per value in each table to lock only the value I am working on.
A lock-free could be better but I can not find a solution and I wonder if the performances would be really better.
EDIT:
I have a thread that receives data and ranges in vectors.
When I instantiate the structure, I use a fixed size.
I have to do 2 different things for the updates:
-Update vector elements. (1d vector which simulates a 2d vector)
-Add a line at the end of the vector and remove the first line. The array always remains sorted. Adding elements is much much rarer than updating
The thread that is read-only walks the array and performs calculations.
To limit the time spent on the array and do as little calculation as possible, I use arrays that store the result of my calculations. Despite this, I often have to scan the table enough to do new calculations or just update them. (the application is in real-time so the calculations to be made vary according to the requests)
When a new element is added to the vector, the reading thread must directly use it to update the calculations.
When I say calculation, it is not necessarily only arithmetic, it is more a treatment to be done.
There is no perfect implementation to run concurrency, each task has it's own good enogh. My goto method to find a decent implementation is to only alow what is needed and then check if i would need somthing more in the future.
You described a quite simple scenario, one thread one accion to a shared vector, then the vector needs to tell if the acction is alowed soo std::atomic_flag is good enogh.
This example shuld give you an idea on how it works and what to expent. Mainly i just attached a flag to eatch array and checkt it before to see if is safe to do somthing and some people like to add a guard to the flag, just in case.
#include <iostream>
#include <thread>
#include <atomic>
#include <chrono>
const int vector_size = 1024;
struct Element {
void some_yield(){
std::this_thread::yield();
};
void some_wait(){
std::this_thread::sleep_for(
std::chrono::microseconds(1)
);
};
};
Element ** data;
std::atomic_flag * vector_safe;
bool alive = true;
uint32_t c_down_time = 0;
uint32_t p_down_time = 0;
uint32_t c_intinerations = 0;
uint32_t p_intinerations = 0;
std::chrono::high_resolution_clock::time_point c_time_point;
std::chrono::high_resolution_clock::time_point p_time_point;
int simple_consumer_work(){
Element a_read;
uint16_t i, e;
while (alive){
// Loops thru the vectors
for (i=0; i < vector_size; i++){
// locks the thread untin the vector
// at index i is free to read
while (!vector_safe[i].test_and_set()){}
// Do the watherver
for (e=0; e < vector_size; e++){
a_read = data[i][e];
}
// And signal that this vector is done
vector_safe[i].clear();
}
}
return 0;
};
int simple_producer_work(){
uint16_t i;
while (alive){
for (i=0; i < vector_size; i++){
while (!vector_safe[i].test_and_set()){}
data[i][i].some_wait();
vector_safe[i].clear();
}
p_intinerations++;
}
return 0;
};
int consumer_work(){
Element a_read;
uint16_t i, e;
bool waiting;
while (alive){
for (i=0; i < vector_size; i++){
waiting = false;
c_time_point = std::chrono::high_resolution_clock::now();
while (!vector_safe[i].test_and_set(std::memory_order_acquire)){
waiting = true;
}
if (waiting){
c_down_time += (uint32_t)std::chrono::duration_cast<std::chrono::nanoseconds>
(std::chrono::high_resolution_clock::now() - c_time_point).count();
}
for (e=0; e < vector_size; e++){
a_read = data[i][e];
}
vector_safe[i].clear(std::memory_order_release);
}
c_intinerations++;
}
return 0;
};
int producer_work(){
bool waiting;
uint16_t i;
while (alive){
for (i=0; i < vector_size; i++){
waiting = false;
p_time_point = std::chrono::high_resolution_clock::now();
while (!vector_safe[i].test_and_set(std::memory_order_acquire)){
waiting = true;
}
if (waiting){
p_down_time += (uint32_t)std::chrono::duration_cast<std::chrono::nanoseconds>
(std::chrono::high_resolution_clock::now() - p_time_point).count();
}
data[i][i].some_wait();
vector_safe[i].clear(std::memory_order_release);
}
p_intinerations++;
}
return 0;
};
void print_time(uint32_t down_time){
if ( down_time <= 1000) {
std::cout << down_time << " [nanosecods] \n";
} else if (down_time <= 1000000) {
std::cout << down_time / 1000 << " [microseconds] \n";
} else if (down_time <= 1000000000) {
std::cout << down_time / 1000000 << " [miliseconds] \n";
} else {
std::cout << down_time / 1000000000 << " [seconds] \n";
}
};
int main(){
std::uint16_t i;
std::thread consumer;
std::thread producer;
vector_safe = new std::atomic_flag [vector_size] {ATOMIC_FLAG_INIT};
data = new Element * [vector_size];
for(i=0; i < vector_size; i++){
data[i] = new Element;
}
consumer = std::thread(consumer_work);
producer = std::thread(producer_work);
std::this_thread::sleep_for(
std::chrono::seconds(10)
);
alive = false;
producer.join();
consumer.join();
std::cout << " Consumer loops > " << c_intinerations << std::endl;
std::cout << " Consumer time lost > "; print_time(c_down_time);
std::cout << " Producer loops > " << p_intinerations << std::endl;
std::cout << " Producer time lost > "; print_time(p_down_time);
for(i=0; i < vector_size; i++){
delete data[i];
}
delete [] vector_safe;
delete [] data;
return 0;
}
And dont forget that the compiler can and will change portions of the code, spagueti code is realy realy buggy in multithreading.

Parallelizing a small array is slower than parallelizing a large array?

I wrote a small program that generates random values for two valarrays and in a for loop the values of said arrays are added to a new one.
However, when I use a small array size(20 elements) the parallel version takes significantly longer than the serial one and when I'm using large arrays(200 000 elements) it takes roughly the same amount of time(parallel is always a bit slower though).
Why is this?
The only reason I can think is that with the large array the CPU puts it in L3 cache and shares it across all cores, whereas with the small one its having to copy it around the lower cache levels? Or I'm getting this wrong?
Here is the code:
#include <valarray>
#include <iostream>
#include <ctime>
#include <omp.h>
#include <chrono>
int main()
{
int size = 2000000;
std::valarray<double> num1(size), num2(size), result(size);
std::srand(std::time(nullptr));
std::chrono::time_point<std::chrono::steady_clock> start, stop;
std::chrono::microseconds duration;
for (int i = 0; i < size; ++i) {
num1[i] = std::rand();
num2[i] = std::rand();
}
//Parallel execution
start = std::chrono::high_resolution_clock::now();
#pragma omp parallel for num_threads(8)
for (int i = 0; i < size; ++i) {
result[i] = num1[i] + num2[i];
}
stop = std::chrono::high_resolution_clock::now();
duration = std::chrono::duration_cast<std::chrono::microseconds>(stop - start);
std::cout << "Parallel for loop executed in: " << duration.count() << " microseconds" << std::endl;
//Serial execution
start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < size; ++i) {
result[i] = num1[i] + num2[i];
}
stop = std::chrono::high_resolution_clock::now();
duration = std::chrono::duration_cast<std::chrono::microseconds>(stop - start);
std::cout << "Serial for loop executed in: " << duration.count() << " microseconds" << std::endl;
}
Output with size = 200 000
Parallel for loop executed in: 2450 microseconds
Serial for loop executed in: 2726 microseconds
Output with size = 20
Parallel for loop executed in: 4727 microseconds
Serial for loop executed in: 0 microseconds
I'm using a Xeon E3-1230 V5 and I'm compiling with Intel's compiler using maximum optimization and Skylake specific optimizations as well.
I get identical results with Visual Studio's C++ compiler.

Loop is faster with fixed limit

This loop:
long n = 0;
unsigned int i, j, innerLoopLength = 4;
for (i = 0; i < 10000000; i++) {
for (j = 0; j < innerLoopLength; j++) {
n += v[j];
}
}
finishes in 0 ms, while this one:
long n = 0;
unsigned int i, j, innerLoopLength = argc;
for (i = 0; i < 10000000; i++) {
for (j = 0; j < innerLoopLength; j++) {
n += v[j];
}
}
takes 35 ms.
No matter what the innerLoopLength is, the first method is always pretty fast while the second getting slower and slower.
Does anybody know why and is there a way to speed up the seconds version? I'm grateful for every ms.
Full code:
#include <iostream>
#include <chrono>
#include <vector>
using namespace std;
int main(int argc, char *argv[]) {
vector<long> v;
cout << "argc: " << argc << endl;
for (long l = 1; l <= argc; l++) {
v.push_back(l);
}
auto start = chrono::steady_clock::now();
long n = 0;
unsigned int i, j, innerLoopLength = 4;
for (i = 0; i < 10000000; i++) {
for (j = 0; j < innerLoopLength; j++) {
n += v[j];
}
}
auto end = chrono::steady_clock::now();
cout << "duration: " << chrono::duration_cast<chrono::microseconds>(end - start).count() / 1000.0 << " ms" << endl;
cout << "n: " << n << endl;
return 0;
}
Compiled with -std=c++1z and -O3.
The fixed-length loop was far quicker due to loop unrolling:
Loop unrolling, also known as loop unwinding, is a loop transformation
technique that attempts to optimize a program's execution speed at the
expense of its binary size, which is an approach known as space–time
tradeoff. The transformation can be undertaken manually by the
programmer or by an optimizing compiler.
The goal of loop unwinding is to increase a program's speed by
reducing or eliminating instructions that control the loop, such as
pointer arithmetic and "end of loop" tests on each iteration; reducing
branch penalties; as well as hiding latencies, including the delay in
reading data from memory. To eliminate this computational overhead,
loops can be re-written as a repeated sequence of similar independent
statements.
Essentially, the inner loop of your C(++) code is transformed to the following before compilation:
for (i = 0; i < 10000000; i++) {
n += v[0];
n += v[1];
n += v[2];
n += v[3];
}
As you can see, it is a little bit faster.
In your specific case, there is yet another source of the optimization: you sum 1000000 times the same values to n. gcc can detect it since around 3.*, and converts it to a multiplication. You can check that, doing the same loop 100000000000 times will be similarly ready in 0 ms. You can check on the ASM level (g++ -S -o bench.s bench.c -O3), you will see only a multiplication and not an addition in a loop. To avoid this, you should add something what can't be converted to a multiplication so easily.
None of them can be done in the second case. Thus, on the ASM level, you will have to deal with a lot of conditional expressions (conditional jumps). These are costly in a modern CPU, because their unexpected result causes the CPU pipeline to reset.
What can you help:
If you know something from innerLoopLength, for example if it is always divisable by 4, you can unroll the loop for yourself
Some gcc(g++) optimization flag, to help him to understand, here you need fast code. Compile with at least -O3 -funroll-loops.

Using multithreading optimize the execution time (simple example)

I have the following code:
#include <boost/date_time/posix_time/posix_time.hpp>
#include <boost/cstdint.hpp>
#include <iostream>
int main()
{
boost::posix_time::ptime start = boost::posix_time::microsec_clock::local_time();
boost::uint64_t sum = 0;
for (int i = 0; i < 1000000000; ++i)
sum += i;
boost::posix_time::ptime end = boost::posix_time::microsec_clock::local_time();
std::cout << end - start << std::endl;
std::cout << sum << std::endl;
}
The task is: refactor the following program to calculate the total using two threads. Since many processors nowadays have two cores, the execution time should decrease by utilizing threads.
Here is my solution:
#include <boost/date_time/posix_time/posix_time.hpp>
#include <boost/thread.hpp>
#include <boost/cstdint.hpp>
#include <iostream>
boost::uint64_t s1 = 0;
boost::uint64_t s2 = 0;
void sum1()
{
for (int i = 0; i < 500000000; ++i)
s1 += i;
}
void sum2()
{
for (int i = 500000000; i < 1000000000; ++i)
s2 += i;
}
int main()
{
boost::posix_time::ptime start = boost::posix_time::microsec_clock::local_time();
boost::thread t1(sum1);
boost::thread t2(sum2);
t1.join();
t2.join();
boost::posix_time::ptime end = boost::posix_time::microsec_clock::local_time();
std::cout << end - start << std::endl;
std::cout << s1+s2 << std::endl;
}
Please review and also answer the following questions:
1. Why this code does not actually optimize the execution time? :) (I use Intel Core i5 processor, and Win7 64bit system)
2. Why when I use one variable s to store the sum instead of s1 and s2 the sum becomes incorrect?
Thanks in advance.
I'll answer your second question, because the first one is not yet clear to me. When you are using a single global variable to calculate the sum there is a so-called "data race" caused by the fact that the operation
s += i;
is not "atomic", meaning that at the assembler level it is translated into several instructions. If one thread is executing this set of instructions it may be interrupted by another thread doing the same thing and your results will be inconsistent.
This is due to the fact that threads are scheduled on and off the CPU by the OS and it's impossible to predict how the threads will interleave their instruction execution.
The classic pattern in this case is to have two local variables collecting the sums for each thread and then summing them up together into a global variable once the threads have fished their work.
The answer to 1 should be: run in a profiler and see what it tells you.
But there is at least one usual suspect: False sharing. Your s1 and s2 likely end up on the same cacheline, so your 2 cores (if your 2 threads indeed end up on different cores) have to synchronize at the cacheline level. Make sure the 2 uint64_t are on different cachelines (whose size depends on the architecture you're targeting).
As to the answer to 2... Nothing in your program guarantees that the updates from one thread will not get stomped by the second and vice-versa. You need either synchronization primitives to make sure your updates don't happen at the same time, or atomic updates to make sure the updates don't stomp on each other.
I'll answer the first:
It takes a lot more time to create a thread than it takes to do nothing (base).
the compiler will convert this:
for (int i = 0; i < 1000000000; ++i)
sum += i;
into this:
// << optimized away >>
even your worst case using local data, it would be one addition with optimization enabled.
The parallel version reduces the compiler's ability to optimize the program, while adding work.
The simplest way to refactor the program (code-wise) to compute the sum using multiple threads is to use OpenMP:
// $ g++ -fopenmp parallel-sum.cpp && ./a.out
#include <stdint.h>
#include <iostream>
const int32_t N = 1 << 30;
int main() {
int64_t sum = 0;
#pragma omp parallel for reduction(+:sum)
for (int32_t i = 0; i < N; ++i)
sum += i;
std::cout << sum << " " << static_cast<int64_t>(N)*(N-1)/2 << std::endl;
}
Output
576460751766552576 576460751766552576
Here's a parallel reduction implemented using c++11 threads:
// $ g++ -std=c++0x -pthread parallel-sum-c++11.cpp && ./a.out
#include <cstdint>
#include <iostream>
#include <thread>
namespace {
std::mutex mutex;
void sum_interval(int32_t start, int32_t end, int64_t &sum) {
int64_t s = 0;
for ( ; start < end; ++start) s += start;
std::lock_guard<std::mutex> lock(mutex);
sum += s;
}
}
int main() {
int64_t sum = 0;
const int num_threads = 4;
const int32_t N = 1 << 30;
std::thread t[num_threads];
// fork threads; assign intervals to sum
int32_t start = 0, step = N / num_threads;
for (int i = 0; i < num_threads-1; ++i, start += step)
t[i] = std::thread(sum_interval, start, start+step, std::ref(sum));
t[num_threads-1] = std::thread(sum_interval, start, N, std::ref(sum));
// wait for result and print it
for (int i = 0; i < num_threads; ++i) t[i].join();
std::cout << sum << " " << static_cast<int64_t>(N)*(N-1)/2 << std::endl;
}
Note: Access to sum is guarded so only one thread at a time can change it. If sum is std::atomic<int64_t> then the locking can be omitted.

OpenMP overhead calculation

Given n threads, is there a way that I can calculate the amount of overhead (e.g. # of cycles) that is required to implement a specific directive in OpenMP.
For example, given the code below
#pragma omp parallel
{
#pragma omp for
for( int i=0 ; i < m ; i++ )
a[i] = b[i] + c[i];
}
Can I calculate somehow how much overhead is required to create these threads?
I think the way to measure the overhead is to time both the serial and parallel versions, and then see how far off the parallel version is from its 'ideal' running time for your number of threads.
So for example, if your serial version takes 10 seconds and you have 4 threads on 4 cores, then your ideal running time is 2.5 seconds. If your OpenMP version takes 4 seconds, then your 'overhead' is 1.5 seconds. I put overhead in quotes because some of that will be thread creation and memory sharing (actual threading overhead), and some of that will just be unparallelized sections of code. I'm trying to think here in terms of Amdahl's Law.
For demonstration, here are two examples. They don't measure thread creation overhead, but they might show the difference between expected and achieved improvement. And while Mystical was right that the only real way to measure is to time it, even trivial examples like your for loop aren't necessarily memory bound. OpenMP does a lot of work that we don't see.
Serial (speedtest.cpp)
#include <iostream>
int main(int argc, char** argv) {
const int SIZE = 100000000;
int* a = new int[SIZE];
int* b = new int[SIZE];
int* c = new int[SIZE];
for(int i = 0; i < SIZE; i++) {
a[i] = b[i] * c[i] * 2;
}
std::cout << "a[" << (SIZE-1) << "]=" << a[SIZE-1] << std::endl;
for(int i = 0; i < SIZE; i++) {
a[i] = b[i] + c[i] + 1;
}
std::cout << "a[" << (SIZE-1) << "]=" << a[SIZE-1] << std::endl;
delete[] a;
delete[] b;
delete[] c;
return 0;
}
Parallel (omp_speedtest.cpp)
#include <omp.h>
#include <iostream>
int main(int argc, char** argv) {
const int SIZE = 100000000;
int* a = new int[SIZE];
int* b = new int[SIZE];
int* c = new int[SIZE];
std::cout << "There are " << omp_get_num_procs() << " procs." << std::endl;
#pragma omp parallel
{
#pragma omp for
for(int i = 0; i < SIZE; i++) {
a[i] = b[i] * c[i];
}
}
std::cout << "a[" << (SIZE-1) << "]=" << a[SIZE-1] << std::endl;
#pragma omp parallel
{
#pragma omp for
for(int i = 0; i < SIZE; i++) {
a[i] = b[i] + c[i] + 1;
}
}
std::cout << "a[" << (SIZE-1) << "]=" << a[SIZE-1] << std::endl;
delete[] a;
delete[] b;
delete[] c;
return 0;
}
So I compiled these these with
g++ -O3 -o speedtest.exe speedtest.cpp
g++ -fopenmp -O3 -o omp_speedtest.exe omp_speedtest.cpp
And when I ran them
$ time ./speedtest.exe
a[99999999]=0
a[99999999]=1
real 0m1.379s
user 0m0.015s
sys 0m0.000s
$ time ./omp_speedtest.exe
There are 4 procs.
a[99999999]=0
a[99999999]=1
real 0m0.854s
user 0m0.015s
sys 0m0.015s
Yes, you can. Please take a look at EPCC benchmark. Although this code is a bit older, it measures the various overhead of OpenMP's constructs, including omp parallel for and omp critical.
Basic approach is somewhat very simple and straightforward. You measure a baseline serial time without any OpenMP, and just include a OpenMP pragma that you want to measure. Then, subtract the elapsed times. This is exactly how EPCC benchmark measures the overhead. See the source like 'syncbench.c'.
Please note that the overhead is expressed as time, rather than the # of cycles. I also tried to measure # of cycles, but OpenMP parallel constructs' overhead may include blocked time due to synchronizations. Hence, # of cycles may not reflect the real overhead of OpenMP.