Im trying get the elapsed time of my program. Actually i thought I should use yclock() from time.h. But it stays zero in all phases of the program although I'm adding 10^5 numbers(there must be some CPU time consumed). I already searched this problem and it seems like, people running Linux are having this issue only. I'm running Ubuntu 12.04LTS.
I'm going to compare AVX and SSE instructions, so using time_t is not really an option. Any hints?
Here is the code:
//Dimension of Arrays
unsigned int N = 100000;
//Fill two arrays with random numbers
unsigned int a[N];
clock_t start_of_programm = clock();
for(int i=0;i<N;i++){
a[i] = i;
}
clock_t after_init_of_a = clock();
unsigned int b[N];
for(int i=0;i<N;i++){
b[i] = i;
}
clock_t after_init_of_b = clock();
//Add the two arrays with Standard
unsigned int out[N];
for(int i = 0; i < N; ++i)
out[i] = a[i] + b[i];
clock_t after_add = clock();
cout << "start_of_programm " << start_of_programm << endl; // prints
cout << "after_init_of_a " << after_init_of_a << endl; // prints
cout << "after_init_of_b " << after_init_of_b << endl; // prints
cout << "after_add " << after_add << endl; // prints
cout << endl << "CLOCKS_PER_SEC " << CLOCKS_PER_SEC << endl;
And the output of the console. I also used printf() with %d, with no difference.
start_of_programm 0
after_init_of_a 0
after_init_of_b 0
after_add 0
CLOCKS_PER_SEC 1000000
clock does indeed return the CPU time used, but the granularity is in the order of 10Hz. So if your code doesn't take more than 100ms, you will get zero. And unless it's significantly longer than 100ms, you won't get a very accurate value, because it your error margin will be around 100ms.
So, increasing N or using a different method to measure time would be your choices. std::chrono will most likely produce a more accurate timing (but it will measure "wall-time", not CPU-time).
timespec t1, t2;
clock_gettime(CLOCK_REALTIME, &t1);
... do stuff ...
clock_gettime(CLOCK_REALTIME, &t2);
double t = timespec_diff(t2, t1);
double timespec_diff(timespec t2, timespec t1)
{
double d1 = t1.tv_sec + t1.tv_nsec / 1000000000.0;
double d2 = t2.tv_sec + t2.tv_nsec / 1000000000.0;
return d2 - d1;
}
The simplest way to get the time is to just use a stub function from OpenMP. This will work on MSVC, GCC, and ICC. With MSVC you don't even need to enable OpenMP. With ICC you can link just the stubs if you like -openmp-stubs. With GCC you have to use -fopenmp.
#include <omp.h>
double dtime;
dtime = omp_get_wtime();
foo();
dtime = omp_get_wtime() - dtime;
printf("time %f\n", dtime);
First, compiler is very likely to optimize your code. Check your compiler's optimization option.
Since array including out[], a[], b[] are not used by the successive code, and no value from out[], a[], b[] would be output, the compiler is to optimize code block as follows like never execute at all:
for(int i=0;i<=N;i++){
a[i] = i;
}
for(int i=0;i<=N;i++){
b[i] = i;
}
for(int i = 0; i < N; ++i)
out[i] = a[i] + b[i];
Since clock() function returns CPU time, the above code consume almost no time after optimization.
And one more thing, set N a bigger value. 100000 is too small for a performance test, nowadays computer runs very fast with o(n) code at 100000 scale.
unsigned int N = 10000000;
Add this to the end of the code
int sum = 0;
for(int i = 0; i<N; i++)
sum += out[i];
cout << sum;
Then you will see the times.
Since you dont use a[], b[], out[] it ignores corresponding for loops. This is because of optimization of the compiler.
Also, to see the exact time it takes use debug mode instead of release, then you will be able to see the time it takes.
Related
I wrote a small program that generates random values for two valarrays and in a for loop the values of said arrays are added to a new one.
However, when I use a small array size(20 elements) the parallel version takes significantly longer than the serial one and when I'm using large arrays(200 000 elements) it takes roughly the same amount of time(parallel is always a bit slower though).
Why is this?
The only reason I can think is that with the large array the CPU puts it in L3 cache and shares it across all cores, whereas with the small one its having to copy it around the lower cache levels? Or I'm getting this wrong?
Here is the code:
#include <valarray>
#include <iostream>
#include <ctime>
#include <omp.h>
#include <chrono>
int main()
{
int size = 2000000;
std::valarray<double> num1(size), num2(size), result(size);
std::srand(std::time(nullptr));
std::chrono::time_point<std::chrono::steady_clock> start, stop;
std::chrono::microseconds duration;
for (int i = 0; i < size; ++i) {
num1[i] = std::rand();
num2[i] = std::rand();
}
//Parallel execution
start = std::chrono::high_resolution_clock::now();
#pragma omp parallel for num_threads(8)
for (int i = 0; i < size; ++i) {
result[i] = num1[i] + num2[i];
}
stop = std::chrono::high_resolution_clock::now();
duration = std::chrono::duration_cast<std::chrono::microseconds>(stop - start);
std::cout << "Parallel for loop executed in: " << duration.count() << " microseconds" << std::endl;
//Serial execution
start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < size; ++i) {
result[i] = num1[i] + num2[i];
}
stop = std::chrono::high_resolution_clock::now();
duration = std::chrono::duration_cast<std::chrono::microseconds>(stop - start);
std::cout << "Serial for loop executed in: " << duration.count() << " microseconds" << std::endl;
}
Output with size = 200 000
Parallel for loop executed in: 2450 microseconds
Serial for loop executed in: 2726 microseconds
Output with size = 20
Parallel for loop executed in: 4727 microseconds
Serial for loop executed in: 0 microseconds
I'm using a Xeon E3-1230 V5 and I'm compiling with Intel's compiler using maximum optimization and Skylake specific optimizations as well.
I get identical results with Visual Studio's C++ compiler.
This question already has answers here:
OpenMP time and clock() give two different results
(3 answers)
Closed 3 years ago.
I have to add two vectors and compare serial performance against parallel performance.
However, my parallel code seems to take longer to execute than the serial code.
Could you please suggest changes to make the parallel code faster?
#include <iostream>
#include <time.h>
#include "omp.h"
#define ull unsigned long long
using namespace std;
void parallelAddition (ull N, const double *A, const double *B, double *C)
{
ull i;
#pragma omp parallel for shared (A,B,C,N) private(i) schedule(static)
for (i = 0; i < N; ++i)
{
C[i] = A[i] + B[i];
}
}
int main(){
ull n = 100000000;
double* A = new double[n];
double* B = new double[n];
double* C = new double[n];
double time_spent = 0.0;
for(ull i = 0; i<n; i++)
{
A[i] = 1;
B[i] = 1;
}
//PARALLEL
clock_t begin = clock();
parallelAddition(n, &A[0], &B[0], &C[0]);
clock_t end = clock();
time_spent += (double)(end - begin) / CLOCKS_PER_SEC;
cout<<"time elapsed in parallel : "<<time_spent<<endl;
//SERIAL
time_spent = 0.0;
for(ull i = 0; i<n; i++)
{
A[i] = 1;
B[i] = 1;
}
begin = clock();
for (ull i = 0; i < n; ++i)
{
C[i] = A[i] + B[i];
}
end = clock();
time_spent += (double)(end - begin) / CLOCKS_PER_SEC;
cout<<"time elapsed in serial : "<<time_spent;
return 0;
}
These are results:
time elapsed in parallel : 0.824808
time elapsed in serial : 0.351246
I've read on another thread that there are factors like spawning of threads, allocation of resources. But I don't know what to do to get the expected result.
EDIT:
Thanks! #zulan and #Daniel Langr 's answers actually helped!
I used omp_get_wtime() instead of clock().
It happens to be that clock() measures cumulative time of all threads as against omp_get_wtime() which can be used to measure the time elasped from an arbitrary point to some other arbitrary point
This answer too answers this query pretty well: https://stackoverflow.com/a/10874371/4305675
Here's the fixed code:
void parallelAddition (ull N, const double *A, const double *B, double *C)
{
....
}
int main(){
....
//PARALLEL
double begin = omp_get_wtime();
parallelAddition(n, &A[0], &B[0], &C[0]);
double end = omp_get_wtime();
time_spent += (double)(end - begin);
cout<<"time elapsed in parallel : "<<time_spent<<endl;
....
//SERIAL
begin = omp_get_wtime();
for (ull i = 0; i < n; ++i)
{
C[i] = A[i] + B[i];
}
end = omp_get_wtime();
time_spent += (double)(end - begin);
cout<<"time elapsed in serial : "<<time_spent;
return 0;
}
RESULT AFTER CHANGES:
time elapsed in parallel : 0.204763
time elapsed in serial : 0.351711
There are multiple factors that influence your measurements:
Use omp_get_wtime() as #zulan suggested, otherwise, you may actually calculate combined CPU time, instead of wall time.
Threading has some overhead and typically does not pay off for short calculations. You may want to use higher n.
"Touch" data in C array before running parallelAddition. Otherwise, the memory pages are actually allocated from OS inside parallelAddition. Easy fix since C++11: double* C = new double[n]{};.
I tried your program for n being 1G and the last change reduced runtime of parallelAddition from 1.54 to 0.94 [s] for 2 threads. Serial version took 1.83 [s], therefore, the speedup with 2 threads was 1.95, which was pretty close to ideal.
Other considerations:
Generally, if you profile something, make sure that the program has some observable effect. Otherwise, a compiler may optimize a lot of code away. Your array addition has no observable effect.
Add some form of restrict keyword to the C parameter. Without it, a compiler might not be able to apply vectorization.
If you are on a multi-socket system, take care about affinity of threads and NUMA effects. On my dual-socket system, runtime of a parallel version for 2 threads took 0.94 [s] (as mentioned above) when restricting threads to a single NUMA node (numactl -N 0 -m 0). Without numactl, it took 1.35 [s], thus 1.44 times more.
This loop:
long n = 0;
unsigned int i, j, innerLoopLength = 4;
for (i = 0; i < 10000000; i++) {
for (j = 0; j < innerLoopLength; j++) {
n += v[j];
}
}
finishes in 0 ms, while this one:
long n = 0;
unsigned int i, j, innerLoopLength = argc;
for (i = 0; i < 10000000; i++) {
for (j = 0; j < innerLoopLength; j++) {
n += v[j];
}
}
takes 35 ms.
No matter what the innerLoopLength is, the first method is always pretty fast while the second getting slower and slower.
Does anybody know why and is there a way to speed up the seconds version? I'm grateful for every ms.
Full code:
#include <iostream>
#include <chrono>
#include <vector>
using namespace std;
int main(int argc, char *argv[]) {
vector<long> v;
cout << "argc: " << argc << endl;
for (long l = 1; l <= argc; l++) {
v.push_back(l);
}
auto start = chrono::steady_clock::now();
long n = 0;
unsigned int i, j, innerLoopLength = 4;
for (i = 0; i < 10000000; i++) {
for (j = 0; j < innerLoopLength; j++) {
n += v[j];
}
}
auto end = chrono::steady_clock::now();
cout << "duration: " << chrono::duration_cast<chrono::microseconds>(end - start).count() / 1000.0 << " ms" << endl;
cout << "n: " << n << endl;
return 0;
}
Compiled with -std=c++1z and -O3.
The fixed-length loop was far quicker due to loop unrolling:
Loop unrolling, also known as loop unwinding, is a loop transformation
technique that attempts to optimize a program's execution speed at the
expense of its binary size, which is an approach known as space–time
tradeoff. The transformation can be undertaken manually by the
programmer or by an optimizing compiler.
The goal of loop unwinding is to increase a program's speed by
reducing or eliminating instructions that control the loop, such as
pointer arithmetic and "end of loop" tests on each iteration; reducing
branch penalties; as well as hiding latencies, including the delay in
reading data from memory. To eliminate this computational overhead,
loops can be re-written as a repeated sequence of similar independent
statements.
Essentially, the inner loop of your C(++) code is transformed to the following before compilation:
for (i = 0; i < 10000000; i++) {
n += v[0];
n += v[1];
n += v[2];
n += v[3];
}
As you can see, it is a little bit faster.
In your specific case, there is yet another source of the optimization: you sum 1000000 times the same values to n. gcc can detect it since around 3.*, and converts it to a multiplication. You can check that, doing the same loop 100000000000 times will be similarly ready in 0 ms. You can check on the ASM level (g++ -S -o bench.s bench.c -O3), you will see only a multiplication and not an addition in a loop. To avoid this, you should add something what can't be converted to a multiplication so easily.
None of them can be done in the second case. Thus, on the ASM level, you will have to deal with a lot of conditional expressions (conditional jumps). These are costly in a modern CPU, because their unexpected result causes the CPU pipeline to reset.
What can you help:
If you know something from innerLoopLength, for example if it is always divisable by 4, you can unroll the loop for yourself
Some gcc(g++) optimization flag, to help him to understand, here you need fast code. Compile with at least -O3 -funroll-loops.
I wanted to learn to use C++ 11 std::threads with VS2012 and I wrote a very simple C++ console program with two threads which just increment a counter. I also want to test the performance difference when two threads are used. Test program is given below:
#include <iostream>
#include <thread>
#include <conio.h>
#include <atomic>
std::atomic<long long> sum(0);
//long long sum;
using namespace std;
const int RANGE = 100000000;
void test_without_threds()
{
sum = 0;
for(unsigned int j = 0; j < 2; j++)
for(unsigned int k = 0; k < RANGE; k++)
sum ++ ;
}
void call_from_thread(int tid)
{
for(unsigned int k = 0; k < RANGE; k++)
sum ++ ;
}
void test_with_2_threds()
{
std::thread t[2];
sum = 0;
//Launch a group of threads
for (int i = 0; i < 2; ++i) {
t[i] = std::thread(call_from_thread, i);
}
//Join the threads with the main thread
for (int i = 0; i < 2; ++i) {
t[i].join();
}
}
int _tmain(int argc, _TCHAR* argv[])
{
chrono::time_point<chrono::system_clock> start, end;
cout << "-----------------------------------------\n";
cout << "test without threds()\n";
start = chrono::system_clock::now();
test_without_threds();
end = chrono::system_clock::now();
chrono::duration<double> elapsed_seconds = end-start;
cout << "finished calculation for "
<< chrono::duration_cast<std::chrono::milliseconds>(end - start).count()
<< "ms.\n";
cout << "sum:\t" << sum << "\n";\
cout << "-----------------------------------------\n";
cout << "test with 2_threds\n";
start = chrono::system_clock::now();
test_with_2_threds();
end = chrono::system_clock::now();
cout << "finished calculation for "
<< chrono::duration_cast<std::chrono::milliseconds>(end - start).count()
<< "ms.\n";
cout << "sum:\t" << sum << "\n";\
_getch();
return 0;
}
Now, when I use for the counter just the long long variable (which is commented) I get value which is different from the correct - 100000000 instead of 200000000. I am not sure why is that and I suppose that the two threads are changing the counter at the same time, but I am not sure how it happens really because ++ is just a very simple instruction. It seems that the threads are caching the sum variable at beginning. Performance is 110 ms with two threads vs 200 ms for one thread.
So the correct way according to documentation is to use std::atomic. However now the performance is much worse for both cases as about 3300 ms without threads and 15820 ms with threads. What is the correct way to use std::atomic in this case?
I am not sure why is that and I suppose that the two threads are changing the counter at the same time, but I am not sure how it happens really because ++ is just a very simple instruction.
Each thread is pulling the value of sum into a register, incrementing the register, and finally writing it back to memory at the end of the loop.
So the correct way according to documentation is to use std::atomic. However now the performance is much worse for both cases as about 3300 ms without threads and 15820 ms with threads. What is the correct way to use std::atomic in this case?
You're paying for the synchronization std::atomic provides. It won't be nearly as fast as using an un-synchronized integer, though you can get a small improvement to performance by refining the memory order of the add:
sum.fetch_add(1, std::memory_order_relaxed);
In this particular case, you're compiling for x86 and operating on a 64-bit integer. This means that the compiler has to generate code to update the value in two 32-bit operations; if you change the target platform to x64, the compiler will generate code to do the increment in a single 64-bit operation.
As a general rule, the solution to problems like this is to reduce the number of writes to shared data.
Your code has a couple of problems. First of all, all the "inputs" involved are compile-time constants, so a good compiler can pre-compute the value for the single-threaded code, so (regardless of the value you give for range) it shows as running in 0 ms.
Second, you're sharing a single variable (sum) between all the threads, forcing all of their accesses to be synchronized at that point. Without synchronization, that gives undefined behavior. As you've already found, synchronizing the access to that variable is quite expensive, so you usually want to avoid it if at all reasonable.
One way to do that is to use a separate subtotal for each thread, so they can all do their additions in parallel, without synchronizing, the adding together the individual results at the end.
Another point is to ensure against false sharing. False sharing arises when two (or more) threads are writing to data that really is separate, but has been allocated in the same cache line. In this case, access to the memory can be serialized even though (as already noted) you don't have any data actually shared between the threads.
Based on those factors, I've rewritten your code slightly to create a separate sum variable for each thread. Those variables are of a class type that gives (fairly) direct access to the data, but does stop the optimizer from seeing that it can do the whole computation at compile-time, so we end up comparing one thread to 4 (which reminds me: I did increase the number of threads from 2 to 4, since I'm using a quad-core machine). I moved that number into a const variable though, so it should be easy to test with different numbers of threads.
#include <iostream>
#include <thread>
#include <conio.h>
#include <atomic>
#include <numeric>
const int num_threads = 4;
struct val {
long long sum;
int pad[2];
val &operator=(long long i) { sum = i; return *this; }
operator long long &() { return sum; }
operator long long() const { return sum; }
};
val sum[num_threads];
using namespace std;
const int RANGE = 100000000;
void test_without_threds()
{
sum[0] = 0LL;
for(unsigned int j = 0; j < num_threads; j++)
for(unsigned int k = 0; k < RANGE; k++)
sum[0] ++ ;
}
void call_from_thread(int tid)
{
for(unsigned int k = 0; k < RANGE; k++)
sum[tid] ++ ;
}
void test_with_threads()
{
std::thread t[num_threads];
std::fill_n(sum, num_threads, 0);
//Launch a group of threads
for (int i = 0; i < num_threads; ++i) {
t[i] = std::thread(call_from_thread, i);
}
//Join the threads with the main thread
for (int i = 0; i < num_threads; ++i) {
t[i].join();
}
long long total = std::accumulate(std::begin(sum), std::end(sum), 0LL);
}
int main()
{
chrono::time_point<chrono::system_clock> start, end;
cout << "-----------------------------------------\n";
cout << "test without threds()\n";
start = chrono::system_clock::now();
test_without_threds();
end = chrono::system_clock::now();
chrono::duration<double> elapsed_seconds = end-start;
cout << "finished calculation for "
<< chrono::duration_cast<std::chrono::milliseconds>(end - start).count()
<< "ms.\n";
cout << "sum:\t" << sum << "\n";\
cout << "-----------------------------------------\n";
cout << "test with threads\n";
start = chrono::system_clock::now();
test_with_threads();
end = chrono::system_clock::now();
cout << "finished calculation for "
<< chrono::duration_cast<std::chrono::milliseconds>(end - start).count()
<< "ms.\n";
cout << "sum:\t" << sum << "\n";\
_getch();
return 0;
}
When I run this, my results are closer to what I'd guess you hoped for:
-----------------------------------------
test without threds()
finished calculation for 78ms.
sum: 000000013FCBC370
-----------------------------------------
test with threads
finished calculation for 15ms.
sum: 000000013FCBC370
... the sums are identical, but N threads increases speed by a factor of approximately N (up to the number of cores available).
Try to use prefix increment, which will give performance improvement.
Test on my machine, std::memory_order_relaxed does not give any advantage.
Given n threads, is there a way that I can calculate the amount of overhead (e.g. # of cycles) that is required to implement a specific directive in OpenMP.
For example, given the code below
#pragma omp parallel
{
#pragma omp for
for( int i=0 ; i < m ; i++ )
a[i] = b[i] + c[i];
}
Can I calculate somehow how much overhead is required to create these threads?
I think the way to measure the overhead is to time both the serial and parallel versions, and then see how far off the parallel version is from its 'ideal' running time for your number of threads.
So for example, if your serial version takes 10 seconds and you have 4 threads on 4 cores, then your ideal running time is 2.5 seconds. If your OpenMP version takes 4 seconds, then your 'overhead' is 1.5 seconds. I put overhead in quotes because some of that will be thread creation and memory sharing (actual threading overhead), and some of that will just be unparallelized sections of code. I'm trying to think here in terms of Amdahl's Law.
For demonstration, here are two examples. They don't measure thread creation overhead, but they might show the difference between expected and achieved improvement. And while Mystical was right that the only real way to measure is to time it, even trivial examples like your for loop aren't necessarily memory bound. OpenMP does a lot of work that we don't see.
Serial (speedtest.cpp)
#include <iostream>
int main(int argc, char** argv) {
const int SIZE = 100000000;
int* a = new int[SIZE];
int* b = new int[SIZE];
int* c = new int[SIZE];
for(int i = 0; i < SIZE; i++) {
a[i] = b[i] * c[i] * 2;
}
std::cout << "a[" << (SIZE-1) << "]=" << a[SIZE-1] << std::endl;
for(int i = 0; i < SIZE; i++) {
a[i] = b[i] + c[i] + 1;
}
std::cout << "a[" << (SIZE-1) << "]=" << a[SIZE-1] << std::endl;
delete[] a;
delete[] b;
delete[] c;
return 0;
}
Parallel (omp_speedtest.cpp)
#include <omp.h>
#include <iostream>
int main(int argc, char** argv) {
const int SIZE = 100000000;
int* a = new int[SIZE];
int* b = new int[SIZE];
int* c = new int[SIZE];
std::cout << "There are " << omp_get_num_procs() << " procs." << std::endl;
#pragma omp parallel
{
#pragma omp for
for(int i = 0; i < SIZE; i++) {
a[i] = b[i] * c[i];
}
}
std::cout << "a[" << (SIZE-1) << "]=" << a[SIZE-1] << std::endl;
#pragma omp parallel
{
#pragma omp for
for(int i = 0; i < SIZE; i++) {
a[i] = b[i] + c[i] + 1;
}
}
std::cout << "a[" << (SIZE-1) << "]=" << a[SIZE-1] << std::endl;
delete[] a;
delete[] b;
delete[] c;
return 0;
}
So I compiled these these with
g++ -O3 -o speedtest.exe speedtest.cpp
g++ -fopenmp -O3 -o omp_speedtest.exe omp_speedtest.cpp
And when I ran them
$ time ./speedtest.exe
a[99999999]=0
a[99999999]=1
real 0m1.379s
user 0m0.015s
sys 0m0.000s
$ time ./omp_speedtest.exe
There are 4 procs.
a[99999999]=0
a[99999999]=1
real 0m0.854s
user 0m0.015s
sys 0m0.015s
Yes, you can. Please take a look at EPCC benchmark. Although this code is a bit older, it measures the various overhead of OpenMP's constructs, including omp parallel for and omp critical.
Basic approach is somewhat very simple and straightforward. You measure a baseline serial time without any OpenMP, and just include a OpenMP pragma that you want to measure. Then, subtract the elapsed times. This is exactly how EPCC benchmark measures the overhead. See the source like 'syncbench.c'.
Please note that the overhead is expressed as time, rather than the # of cycles. I also tried to measure # of cycles, but OpenMP parallel constructs' overhead may include blocked time due to synchronizations. Hence, # of cycles may not reflect the real overhead of OpenMP.