Cost of OpenMPI in C++ - c++

I have the following program C++ program which uses no communication, and the same identical work is done on all cores, I know that this doesn't use parallel processing at all:
unsigned n = 130000000;
std::vector<double>vec1(n,1.0);
std::vector<double>vec2(n,1.0);
double precision :: t1,t2,dt;
t1 = MPI_Wtime();
for (unsigned i = 0; i < n; i++)
{
// Do something so it's not a trivial loop
vec1[i] = vec2[i]+i;
}
t2 = MPI_Wtime();
dt = t2-t1;
I'm running this program in a single node with two
Intel® Xeon® Processor E5-2690 v3, so I have 24 cores all together. This is a dedicated node, no one else is using it.
Since there is no communication, and each processor is doing the same amount of (identical) work, running it on multiple processors should give the same time. However, I get the following times (averaged time over all cores):
1 core: 0.237
2 cores: 0.240
4 cores: 0.241
8 cores: 0.261
16 cores: 0.454
What could cause the increase in time? Particularly for 16 cores.
I have ran callgrind and I get the roughly same amount of data/instruction misses on all cores (the percentage of misses are the same).
I have repeated the same test on a node with two Intel® Xeon® Processor E5-2628L v2, (16 cores all together), I observe the same increase in execution times. Is this something to do with the MPI implementation?

Considering you are using ~2 GiB of memory per rank, your code is memory-bound. Except for prefetchers you are not operating within the cache but in main memory. You are simply hitting the memory bandwidth at a certain number of active cores.
Another aspect can be turbo mode, if enabled. Turbo mode can increase the core frequency to higher levels if less cores are utilized. As long as the memory bandwidth is not saturated, the higher frequency from turbo core will increase the bandwidth each core gets. This paper discusses the available aggregate memory bandwidth on Haswell processors depending on number of active cores and frequency (Fig 7./8.)
Please note that this has nothing to do with MPI / OpenMPI. You might as well launch the same program X times via any other mean.

I suspect that there are common resources that should be used by your program, so when the number of them increases, there are delays, so that a resource is free'ed so that it can be used by the other process.
You see, you may have 24 cores, but that doesn't mean that all your system allows every core to do everything concurrent. As mentioned in the comments, the memory access is one thing that might cause delays (due to traffic), same thing for disk.
Also consider the interconnection network, which can also suffer from many accesses. In conclusion, notice that these hardware delays are enough to overwhelm the processing time.
General note: Remember how Efficiency of a program is defined:
E = S/p, where S is the speedup and p the number of nodes/processes/threads
Now take Scalability into account. Usually programs are weakly scalable, i.e. that you have to increase with the same rate the size of the problem and p. By increasing only the number of p, while keeping the size of your problem (n in your case) constant, while keeping Efficiency constant, yields a strongly Scalable program.

Your program is not using parallel processing at all. Just because you have compiled it with OpenMP does not make it parallel.
To parallelize the for loop, for example, you need to use the different #pragma's OpenMP offer.
unsigned n = 130000000;
std::vector<double>vec1(n,1.0);
std::vector<double>vec2(n,1.0);
double precision :: t1,t2,dt;
t1 = MPI_Wtime();
#pragma omp parallel for
for (unsigned i = 0; i < n; i++)
{
// Do something so it's not a trivial loop
vec1[i] = vec2[i]+i;
}
t2 = MPI_Wtime();
dt = t2-t1;
However, take into account that for large values of n, the impact of cache misses may hide the perfomance gained with multiple cores.

Related

Execution time inconsistency in a program with high priority in the scheduler using RT Kernel

Problem
We are trying to implement a program that sends commands to a robot in a given cycle time. Thus this program should be a real-time application. We set up a pc with a preempted RT Linux kernel and are launching our programs with chrt -f 98 or chrt -rr 99 to define the scheduling policy and priority. Loading of the kernel and launching of the program seems to be fine and work (see details below).
Now we were measuring the time (CPU ticks) it takes our program to be computed. We expected this time to be constant with very little variation. What we measured though, were quite significant differences in computation time. Of course, we thought this could be undefined behavior in our rather complex program, so we created a very basic program and measured the time as well. The behavior was similarly bad.
Question
Why are we not measuring a (close to) constant computation time even for our basic program?
How can we solve this problem?
Environment Description
First of all, we installed an RT Linux Kernel on the PC using this tutorial. The main characteristics of the PC are:
PC Characteristics
Details
CPU
Intel(R) Atom(TM) Processor E3950 # 1.60GHz with 4 cores
Memory RAM
8 GB
Operating System
Ubunut 20.04.1 LTS
Kernel
Linux 5.9.1-rt20 SMP PREEMPT_RT
Architecture
x86-64
Tests
The first time we detected this problem was when we were measuring the time it takes to execute this "complex" program with a single thread. We did a few tests with this program but also with a simpler one:
The CPU execution times
The wall time (the world real-time)
The difference (Wall time - CPU time) between them and the ratio (CPU time / Wall time).
We also did a latency test on the PC.
Latency Test
For this one, we followed this tutorial, and these are the results:
Latency Test Generic Kernel
Latency Test RT Kernel
The processes are shown in htop with a priority of RT
Test Program - Complex
We called the function multiple times in the program and measured the time each takes. The results of the 2 tests are:
From this we observed that:
The first execution (around 0.28 ms) always takes longer than the second one (around 0.18 ms), but most of the time it is not the longest iteration.
The mode is around 0.17 ms.
For those that take 17 ms the difference is usually 0 and the ratio 1. Although this is not exclusive to this time. For these, it seems like only 1 CPU is being used and it is saturated (there is no waiting time).
When the difference is not 0, it is usually negative. This, from what we have read here and here, is because more than 1 CPU is being used.
Test Program - Simple
We did the same test but this time with a simpler program:
#include <vector>
#include <iostream>
#include <time.h>
int main(int argc, char** argv) {
int iterations = 5000;
double a = 5.5;
double b = 5.5;
double c = 4.5;
std::vector<double> wallTime(iterations, 0);
std::vector<double> cpuTime(iterations, 0);
struct timespec beginWallTime, endWallTime, beginCPUTime, endCPUTime;
std::cout << "Iteration | WallTime | cpuTime" << std::endl;
for (unsigned int i = 0; i < iterations; i++) {
// Start measuring time
clock_gettime(CLOCK_REALTIME, &beginWallTime);
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &beginCPUTime);
// Function
a = b + c + i;
// Stop measuring time and calculate the elapsed time
clock_gettime(CLOCK_REALTIME, &endWallTime);
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &endCPUTime);
wallTime[i] = (endWallTime.tv_sec - beginWallTime.tv_sec) + (endWallTime.tv_nsec - beginWallTime.tv_nsec)*1e-9;
cpuTime[i] = (endCPUTime.tv_sec - beginCPUTime.tv_sec) + (endCPUTime.tv_nsec - beginCPUTime.tv_nsec)*1e-9;
std::cout << i << " | " << wallTime[i] << " | " << cpuTime[i] << std::endl;
}
return 0;
}
Final Thoughts
We understand that:
If the ratio == number of CPUs used, they are saturated and there is no waiting time.
If the ratio < number of CPUs used, it means that there is some waiting time (theoretically we should only be using 1 CPU, although in practice we use more).
Of course, we can give more details.
Thanks a lot for your help!
Your function will near certainly be optimized away so you are just measuring how long it takes to read the clocks. And as you can see that doesn't take very long with some exceptions:
The very first time you run the code (unless you just compiled it) the pages need to be loaded from disk. If you are unlucky the code spans pages and you include the loading of the next page in the measured time. Quite unlikely given the code size.
The first loop the code and any data needs to be loaded into cache. So that takes longer to execute. The branch predictor might also need a few loops to predict the loop right so the second, third loop might be slightly longer too.
For everything else I think you can blame scheduling:
an IRQ happens but nothing gets rescheduled
the process gets paused while another process runs
the process gets moved to another CPU thread leaving the caches hot
the process gets moved to another CPU core making L1 cache cold but leaving L2/L3 caches hot (if your L2 is shared)
the process gets moved to a CPU on another socket making L1/L2 caches cold but L3 cache hot (if L3 is shared)
You can do little about IRQs. Some you can fix to specific cores but others are just essential (like the timer interrupt for the scheduler itself). You kind of just have to live with that.
But you can fix your program to a specific CPU and you can fix everything else to all the other cores. Basically reserving the core for the real-time code. I guess you would have to use cgroups for this, to keep everything else off the chosen core. And you might still get some kernel threads run on the reserved core. Nothing you can do about that. But that should eliminate most of the large execution times.

How to reach AVX computation throughput for a simple loop

Recently I am working on a numerical solver on computational Electrodynamics by Finite difference method.
The solver was very simple to implement, but it is very difficult to reach the theoretical throughput of modern processors, because there is only 1 math operation on the loaded data, for example:
#pragma ivdep
for(int ii=0;ii<Large_Number;ii++)
{ Z[ii] = C1*Z[ii] + C2*D[ii];}
Large_Number is about 1,000,000, but not bigger than 10,000,000
I have tried to manually unroll the loop and write AVX code but failed to make it faster:
int Vec_Size = 8;
int Unroll_Num = 6;
int remainder = Large_Number%(Vec_Size*Unroll_Num);
int iter = Large_Number/(Vec_Size*Unroll_Num);
int addr_incr = Vec_Size*Unroll_Num;
__m256 AVX_Div1, AVX_Div2, AVX_Div3, AVX_Div4, AVX_Div5, AVX_Div6;
__m256 AVX_Z1, AVX_Z2, AVX_Z3, AVX_Z4, AVX_Z5, AVX_Z6;
__m256 AVX_Zb = _mm256_set1_ps(Zb);
__m256 AVX_Za = _mm256_set1_ps(Za);
for(int it=0;it<iter;it++)
{
int addr = addr + addr_incr;
AVX_Div1 = _mm256_loadu_ps(&Div1[addr]);
AVX_Z1 = _mm256_loadu_ps(&Z[addr]);
AVX_Z1 = _mm256_add_ps(_mm256_mul_ps(AVX_Zb,AVX_Div1),_mm256_mul_ps(AVX_Za,AVX_Z1));
_mm256_storeu_ps(&Z[addr],AVX_Z1);
AVX_Div2 = _mm256_loadu_ps(&Div1[addr+8]);
AVX_Z2 = _mm256_loadu_ps(&Z[addr+8]);
AVX_Z2 = _mm256_add_ps(_mm256_mul_ps(AVX_Zb,AVX_Div2),_mm256_mul_ps(AVX_Za,AVX_Z2));
_mm256_storeu_ps(&Z[addr+8],AVX_Z2);
AVX_Div3 = _mm256_loadu_ps(&Div1[addr+16]);
AVX_Z3 = _mm256_loadu_ps(&Z[addr+16]);
AVX_Z3 = _mm256_add_ps(_mm256_mul_ps(AVX_Zb,AVX_Div3),_mm256_mul_ps(AVX_Za,AVX_Z3));
_mm256_storeu_ps(&Z[addr+16],AVX_Z3);
AVX_Div4 = _mm256_loadu_ps(&Div1[addr+24]);
AVX_Z4 = _mm256_loadu_ps(&Z[addr+24]);
AVX_Z4 = _mm256_add_ps(_mm256_mul_ps(AVX_Zb,AVX_Div4),_mm256_mul_ps(AVX_Za,AVX_Z4));
_mm256_storeu_ps(&Z[addr+24],AVX_Z4);
AVX_Div5 = _mm256_loadu_ps(&Div1[addr+32]);
AVX_Z5 = _mm256_loadu_ps(&Z[addr+32]);
AVX_Z5 = _mm256_add_ps(_mm256_mul_ps(AVX_Zb,AVX_Div5),_mm256_mul_ps(AVX_Za,AVX_Z5));
_mm256_storeu_ps(&Z[addr+32],AVX_Z5);
AVX_Div6 = _mm256_loadu_ps(&Div1[addr+40]);
AVX_Z6 = _mm256_loadu_ps(&Z[addr+40]);
AVX_Z6 = _mm256_add_ps(_mm256_mul_ps(AVX_Zb,AVX_Div6),_mm256_mul_ps(AVX_Za,AVX_Z6));
_mm256_storeu_ps(&Z[addr+40],AVX_Z6);
}
The above AVX loop is actually a bit slower than the Inter compiler generated code.
The compiler generated code can reach about 8G flops/s, about 25% of the single thread theoretical throughput of a 3GHz Ivybridge processor. I wonder if it is even possible to reach the throughput for the simple loop like this.
Thank you!
Improving performance for the codes like yours is "well explored" and still popular area. Take a look at dot-product (perfect link provided by Z Boson already) or at some (D)AXPY optimization discussions (https://scicomp.stackexchange.com/questions/1932/are-daxpy-dcopy-dscal-overkills)
In general , key topics to explore and consider applying are:
AVX2 advantage over AVX due to FMA and better load/store ports u-architecture on Haswell
Pre-Fetching. "Streaming stores", "non-temporal stores" for some platforms.
Threading parallelism to reach max sustained bandwidth
Unrolling (already done by you; Intel Compiler is also capable to do that with #pragma unroll (X) ). Not a big difference for "streaming" codes.
Finally deciding what is a set of hardware platforms you want to optimize your code for
Last bullet is particularly important, because for "streaming" and overall memory-bound codes - it's important to know more about target memory-sybsystems; for example, with existing and especially future high-end HPC servers (2nd gen Xeon Phi codenamed Knights Landing as an example) you may have very different "roofline model" balance between bandwidth and compute, and even different techniques than in case of optimizing for average desktop machine.
Are you sure that 8 GFLOPS/s is about 25% of the maximum throughput of a 3 GHz Ivybridge processor? Let's do the calculations.
Every 8 elements require two single-precision AVX multiplications and one AVX addition. An Ivybridge processor can only execute one 8-wide AVX addition and one 8-wide AVX multiplication per cycle. Also since the addition is dependent on the two multiplications, then 3 cycles are required to process 8 elements. Since the addition can be overlapped with the next multiplication, we can reduce this to 2 cycles per 8 elements. For one billion elements, 2*10^9/8 = 10^9/4 cycles are required. Considering 3 GHz clock, we get 10^9/4 * 10^-9/3 = 1/12 = 0.08 seconds. So the maximum theoretical throughput is 12 GLOPS/s and the compiler-generated code is reaching 66%, which is fine.
One more thing, by unrolling the loop 8 times, it can be vectorized efficiently. I doubt that you'll gain any significant speed up if you unroll this particular loop more than that, especially more than 16 times.
I think the real bottleneck is that there are 2 load and 1 store instructions for every 2 multiplication and 1 addition. Maybe the calculation is memory bandwidth limited. Every element requires transfer 12 bytes of data, and if 2G elements are processed every second (which is 6G flops) that is 24GB/s data transfer, reaching the theoretical bandwidth of ivy bridge. I wonder if this argument holds and there is indeed no solution to this problem.
The reason why I am answering to my own question is to hope someone can correct me before I easily give up the optimization. This simple loop is EXTREMELY important for many scientific solvers, it is the backbone of finite element and finite difference method. If one cannot even feed one processor because the computation is memory bandwith limited, why bother with multicore? A high bandwith GPU or Xeon Phi should be better solutions.

Why is this for loop not faster using OpenMP?

I have extracted this simple member function from a larger 2D program, all it does is a for loop accessing from three different arrays and doing a math operation (1D convolution). I have been testing with using OpenMP to make this particular function faster:
void Image::convolve_lines()
{
const int *ptr0 = tmp_bufs[0];
const int *ptr1 = tmp_bufs[1];
const int *ptr2 = tmp_bufs[2];
const int width = Width;
#pragma omp parallel for
for ( int x = 0; x < width; ++x )
{
const int sum = 0
+ 1 * ptr0[x]
+ 2 * ptr1[x]
+ 1 * ptr2[x];
output[x] = sum;
}
}
If I use gcc 4.7 on debian/wheezy amd64 the overall programm performs a lot slower on an 8 CPUs machine. If I use gcc 4.9 on a debian/jessie amd64 (only 4 CPUs on this machine) the overall program perform with very little difference.
Using time to compare:
single core run:
$ ./test black.pgm out.pgm 94.28s user 6.20s system 84% cpu 1:58.56 total
multi core run:
$ ./test black.pgm out.pgm 400.49s user 6.73s system 344% cpu 1:58.31 total
Where:
$ head -3 black.pgm
P5
65536 65536
255
So Width is set to 65536 during execution.
If that matter, I am using cmake for compilation:
add_executable(test test.cxx)
set_target_properties(test PROPERTIES COMPILE_FLAGS "-fopenmp" LINK_FLAGS "-fopenmp")
And CMAKE_BUILD_TYPE is set to:
CMAKE_BUILD_TYPE:STRING=Release
which implies -O3 -DNDEBUG
My question, why is this for loop not faster using multi-core ? There is no overlap on the array, openmp should split the memory equally. I do not see where bottleneck is coming from ?
EDIT: as it was commented, I changed my input file into:
$ head -3 black2.pgm
P5
33554432 128
255
So Width is now set to 33554432 during execution (should be considered by enough). Now the timing reveals:
single core run:
$ ./test ./black2.pgm out.pgm 100.55s user 5.77s system 83% cpu 2:06.86 total
multi core run (for some reason cpu% was always below 100%, which would indicate no threads at all):
$ ./test ./black2.pgm out.pgm 117.94s user 7.94s system 98% cpu 2:07.63 total
I have some general comments:
1. Before optimizing your code, make sure the data is 16 byte aligned. This is extremely important for whatever optimization one wants to apply. And if the data is separated into 3 pieces, it is better to add some dummy elements to make the starting addresses of the 3 pieces are all 16-byte aligned. By doing so, the CPU can load your data into cache lines easily.
2. Make sure the simple function is vectorized before implementing openMP. Most of cases, using AVX/SSE instruction sets should give you a decent 2 to 8X single thread improvement. And it is very simple for your case: create a constant mm256 register and set it with value 2, and load 8 integers to three mm256 registers. With Haswell processor, one addition and one multiplication can be done together. So theoretically, the loop should speed up by a factor 12 if AVX pipeline can be filled!
3. Sometimes parallelization can degrade performance: Modern CPU needs several hundreds to thousands clock cycles to warm up, entering high performance states and scaling up frequency. If the task is not large enough, it is very likely that the task is done before the CPU warms up and one cannot gain speed boost by going parallel. And don't forget that openMP has overhead as well: thread creating, synchronization and deletion. Another case is poor memory management. Data accesses are so scattered, all CPU cores are idle and waiting for data from RAM.
My Suggestion:
You might want to try intel MKL, don't reinvent the wheel. The library is optimized to extreme and there is no clock cycle wasted. One can link with the serial library or the parallel version, a speed boost is guaranteed if it is possible by going parallel.

Multithread superlinear performance implementation with CPU boost mode consideration

I am studying the class in C++11 by using MinGW 4.8.1 lib on WIN7 64bit OS.
The CPU is ARK | Intel® Core™ i7-820QM Processor, which has four physical cores with 8M cache and supports maximum eight threads. This CPU has base operation frequency at 1.73 GHz if four cores are used simultaneously and can be boosted to 3.08 GHz if only one core is used.
The main target of my studying is that I am going to implement a Multithread test program to demonstrate super-linear performance increases as the number of the thread increases.
Here, the SUPER-linear term means exactly 4 speedup times (maybe 3.8 times acceptable) when employing four threads compared to single thread, not 3.2 or 3.5 times.
The codes and results are pasted here,
inline void count(int workNum) // some working to do .
//These codes are extracted from a real application except that the count function do some "meaningful" job and
//these codes have identical speedup ratio as my real application.
{
int s=0;
for(int i=0;i<workNum;++i)
++s;
}
inline void devide(int numThread) // create multiThreads [1,7] to do same amount task
{
int max = 100000000;
typedef std::vector<std::thread> threadList;
threadList list;
for(int i=1;i<=numThread;++i){
list.push_back(std::thread(count,max/numThread));
}
std::for_each(list.begin(),list.end(),std::mem_fun_ref(&std::thread::join));
}
inline void thread_efficiency_load() // to start test
{
for(int i=7;i>0;--i)
{
std::cout<< "*****************************************" << std::endl;
std::chrono::time_point<std::chrono::system_clock> start, end;
start = std::chrono::system_clock::now();
devide(i); // this is the work load to be measured, which i is the number of thread
end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = end-start;
std::cout << "thread num=#" << i << " time=" << elapsed_seconds.count() << std::endl;
}
}
The output is:
The time unit is seconds,
*****************************************
thread num=#7 time=0.101006
*****************************************
thread num=#6 time=0.0950055
*****************************************
thread num=#5 time=0.0910052
*****************************************
thread num=#4 time=0.0910052
*****************************************
thread num=#3 time=0.102006
*****************************************
thread num=#2 time=0.127007
*****************************************
thread num=#1 time=0.229013
This is very clear that I do not obtain a super-linear performance increases as the number of thread increases. I would like to know why I do not get it. Why ? Why ? Why ?
Some basic things from my mind,
Due to the fact that there are only 4 physical cores, so the maximum speedup should show up when there are four active threads (more threads does not really help a lot). There are only 2.4 times speed up by using four cores compared to the single, where 4 times speed-up is expected. I hope the above implementation does block the 4 times speed-up due to memory issue (cache paging) because all variables are local variables.
By considering the CPU boost mode, the CPU increases operating frequency to 3.07 GHz when there is only one core busy, where is a ratio of 1.7 ( base operating frequency of cores is 1.79 GHz), 2.4 * 1.7 is about 4 as excepted, does it really mean that 2.4 time speedup is the maximum speedup can be made compared to the boost single thread mode.
I will be very appreciated that you can answer,
1) In the above implementation, are there some variables located on the same cache line, which results a lot of paging between multithread to reduce the performance ?
2) How to modify the above codes to achieve super-linear performance (4 times speedup compared to the single thread) as the number of the threads increases ?
Thank you very much for your help.
Just as a warning up front: arguing about actual performance numbers of a multithreaded Program on a modern x86/x64 system without a RTOS is always a lot of speculation - there are just too many layers between your c/c++ code and the actual operations performed on the processor.
As a rough upper bound estimation, yes, for a ALU(not memory)-bound workload you won't get much more than a 1.73*4 / 3.08 = 2.24 times speedup factor for 4 threads on 4 cores vs 1 thread on one core even in the ideal case. Aside from that, I' argue that your tests "workload" is too small to get meaningfull test results. As mentioned in the comments, a compiler would be allowed to completely replace your workload function with a NOP operation leaving you only with the overhead of creating and joining the threads and your measurement (although I don't think that happened here).

Measuring time to perform simple instruction

I am trying to measure the number of cycles it takes my CPU to perform a specific instruction (one that should take one CPU cycle), and the output must be in cycle-lengths (The time it takes the CPU to complete one cycle).
So first of all, my CPU is 2.1GHz, so that means that one cycle-length unit on my computer is 1/2100, right?
Also - I am using getTimeOfDay to measure time in microseconds and I calculate the average of 1,000,000 iterations.
So if I'm not mistaken my desired output must be result*2100 (in order to get it in cycle lengths). Am I right?
Thanks!
P.S Don't know if it matters, but I'm writing in cpp
I believe you have some been misinformed about a few things.
In modern terms clock speed is an indication of speed, not an actual measure of speed - so there is no reasonable way to estimate how long a single instruction may take.
Your question is based on the assumption that all instructions are equal - they most certainly aren't, some CPU instructions get interpreted as sequences of micro-instructions on some architectures, and on others the timings may change.
Additionally, you can't safely assume that on modern architectures that a repeated instruction will perform the same way, this depends on data and instruction caches, pipelines and branch prediction.
The resolution of getTimeOfDay isn't anything like accurate enough to estimate the length of time required to measure single instructions, even CPU clock cycle counters (TSC on x86) are not sufficient.
Furthermore, your operating system is a major source of error in estimation of such timings, context switches, power management, machine load and interrupts all have a huge impact. But even on a true hard real-time operating system (QNX or VxWorks), such measurements are still difficult, and require time and tools, as well as the expertise to interpret the results. On a General Purpose Operating System (windows or a basic Linux), you've got little to no hope of getting accurate measurements)
The computational overhead and error of reading and storing the CPU cycle counts will also tend to dwarf the time required for one instruction. At minimum, I suggest you consider grouping several hundred or thousand instructions together.
On deterministic architectures (1 cycle = 1 instruction) with no caches, like a PIC chip, you can do exactly what you suggest by using the clock multiplier, but even then to validate your measurements you would probably need a logic analyser (ie. you need to do this in hardware).
In short, this is an extremely hard problem.
The CPU contains a cycle counter which you can read with a bit of inline assembly:
static inline uint64_t get_cycles()
{
uint64_t n;
__asm__ __volatile__ ("rdtsc" : "=A"(n));
return n;
}
If you measure the cycle count for 1, 2 and 3 million iterations of your operation, you ought to be able to interpolate the cost of one, but be sure to also measure "empty" loops to remove the cost of the looping:
{
unsigned int n, m = get_cycles();
for (unsigned int n = 0; n != 1000000; ++n)
{
// (compiler barrier)
}
n = get_cycles();
// cost of loop: n - m
}
{
unsigned int n, m = get_cycles();
for (unsigned int n = 0; n != 1000000; ++n)
{
my_operation();
}
n = get_cycles();
// cost of 1000000 operations: n - m - cost of loop
}
// repeat for 2000000, 3000000.
I am trying to measure the time it takes my computer to perform a simple instruction
If that's the case, the key isn't even the most accurate time function you can find. I bet none have the resolution necessary to provide a meaningful result.
The key is to increase your sample count.
So instead of doing something like:
start = tik();
instruction();
end = tok();
time = end - start;
do
start = tik();
for ( 1..10000 )
instruction();
end = tok();
time = (end - start) / 10000;
This will provide more accurate results, and the error caused by the measuring mechanism will be negligeable.