Repeated timing of a void function in C++ - c++

I am trying to time a void function
for (size_t round = 0; round < 5; round++) {
cpu_time_start = get_cpu_time();
wall_time_start = get_wall_time();
scan.assign_clusters(epsilon, mu);
cpu_time_end = get_cpu_time();
wall_time_end = get_wall_time();
...
}
The first timing yields 300 seconds, while the next four timings yields 0.000002 seconds. This indicates that the void function call to assign_clusters is optimized out. How can I force my program to execute this time consuming function call every time, and yet still use optimization for the rest of the code?
What I usually do is to save the result of the function in question and then print it, but since this is a void function, do I have the same option?
I use the following optimization flags: -std=c++0x -march=native -O2

It depends on what is taking the time, to make the fix.
This could be caused by :-
Loading services. Your clustering may be database based, and requires the database services to start (the first time)
Disk caching. The OS will remember data it has read, and be able to provide the data as if it was in memory.
Memory caching. The CPU has different speeds of memory available to it, using the same memory twice, would be faster the second time.
State caching. The data may be in a more amenable state for subsequent runs. This can be thought of as sorting an array twice. The second time is already sorted, which can produce a speed up.
Service starting can be a number of seconds.
Disk cache approx 20x speed up.
Memory cache approx 6x speed up
State caching, can be unbounded.
I think your code needs to reset the scan object, to ensure it does the work again

Related

Time measurement repeatedly makes mistake in specific places

I need to write program which would measure performance of certain data structures. But I can't get reliable result. For example when I measured performance 8 times for the same size of structure, every other result was different(for example: 15ms, 9ms, 15ms, 9ms, 15ms, ...), although the measurements weren't dependent on each other(for every measurement I generated new data). I tried to extract the problem and here is what I have:
while (true) {
auto start = high_resolution_clock::now();
for (int j = 0; j < 500; j++)
;
auto end = high_resolution_clock::now();
cout << duration<double, milli>(end - start).count() << " ";
_getch();
}
What happens when I run this code is - In the first run of loop the time is significantly higher than in next runs. Well it's always higher in the first run, but from time to time also in other measurements.
Example output: 0.006842 0.002566 0.002566 0.002138 0.002993 0.002138 0.002139 ...
And that's the behaviour everytime I start the program.
Here are some things I tried:
It does matter if I compile Release or Debug version. Measurements are still faulty but in different places.
I turned off code optimization.
I tried using different clocks.
And what I think is quite important - While my Add function wasn't empty, the problem depended on data size. For example program worked well for most data sizes but let's say for element count of 7500 measurements were drastically different.
I just deleted part of code after the segment i posted here. And guess what, first measurement is no longer faulty. I have no idea what's happening here.
I would be glad if someone explained to me what can be possible cause of all of this.
In that code, it's likely that you're just seeing the effect of the instruction cache or the micro-op cache. The first time the test is run, more instructions have to be fetched and decoded; on subsequent runs the results of that are available in the caches. As for the alternating times you were setting on some other code, that could be fluctuations in the branch prediction buffer, or something else entirely.
There's too many complex processes involved in execution on modern CPUs to expect a normal sequence of instructions to execute in a fixed amount of time. While it's possible to measure or at least account for these externalities when looking at individual instructions, for nontrivial code you basically have to accept empirical measurements including their variance.
Depending on what kind of operating system you're on, for durations this short, the scheduler can cause huge differences. If your thread is preempted, then you have the idle duration in your time. There are also many things that happen that you don't see: caches, pages, allocation. Modern systems are complex.
You're better off making the whole benchmark bigger, and then doing multiple runs on each thing you're testing, and then using something like ministat from FreeBSD to compare the runs of the same test, and then compare the ministat for the different things you're comparing.
To do this effectively, your benchmark should try to use the same amount of memory as the real workload, so that you memory access is a part of the benchmark.

First method call takes 10 times longer than consecutive calls with the same data

I am performing some execution time benchmarks for my implementation of quicksort. Out of 100 successive measurements on exactly the same input data it seems like the first call to quicksort takes roughly 10 times longer than all consecutive calls. Is this a consequence of the operating system getting ready to execute the program, or is there some other explanation? Moreover, is it reasonable to discard the first measurement when computing an average runtime?
The below bar chart illustrates execution time (miliseconds) versus method call number. Each time the method is called it processes the exact same data.
To produce this particular graph the main method makes a call to quicksort_timer::time_fpi_quicksort(5, 100) whose implementation can be seen below.
static void time_fpi_quicksort(int size, int runs)
{
std::vector<int> vector(size);
for (int i = 0; i < runs; i++)
{
vector = utilities::getRandomIntVectorWithConstantSeed(size);
Timer timer;
quicksort(vector, ver::FixedPivotInsertion);
}
}
The getRandomIntVectorWithConstantSeed is implemented as follows
std::vector<int> getRandomIntVectorWithConstantSeed(int size)
{
std::vector<int> vector(size);
srand(6475307);
for (int i = 0; i < size; i++)
vector[i] = rand();
return vector;
}
CPU and Compilation
CPU: Broadwell 2.7 GHz Intel Core i5 (5257U)
Compiler Version: Apple LLVM version 10.0.0 (clang-1000.11.45.5)
Compiler Options: -std=c++17 -O2 -march=native
Yes, it could be a page fault on the page holding the code for the sort function (and the timing code itself). The 10x could also include ramp-up to max turbo clock speed.
Caching is not plausible, though: you're writing the (tiny) array outside the timed region, unless the compiler somehow reordered the init with the constructor of your Timer. Memory allocation being much slower the first time would easily explain it, maybe having to make a system call to get a new page the first time, but later calls to new (to construct std::vector) just grabbing already-hot-in-cache memory from the free list.
Training the branch predictors could also be a big factor, but you'd expect it to take more than 1 run before the TAGE branch predictors in a modern Intel CPU, or the perceptron predictors in a modern AMD, "learned" the full pattern of all the branching. But maybe they get close after the first run.
Note that you produce the same random array every time, by using srand() on every call. To test if branch prediction is the explanation, remove the srand so you get different arrays every time, and see if the time stays much higher.
What CPU, compiler version / options, etc. are you using?
Probably is because of caching, as the memory needs to be fetched from DRAM and allocated in CPU's data cache the first time. That takes (much) more latency more than loads that hit in the CPU's cache.
Then as your instructions are in the pipeline they follow the same branch as it is the instructions from the same memory source as it doesn't need to be invalidated because is the same pointer.
Would be interesting if you implement 4 methods with more or less the same functionality and then swap between them to see what happen.

How to demonstrate the impact of instruction cache limitations

My orginial idea was to give an elegant code example, that would demonstrate the impact of instruction cache limitations. I wrote the following piece of code, that creates a large amount of identical functions, using template metaprogramming.
volatile int checksum;
void (*funcs[MAX_FUNCS])(void);
template <unsigned t>
__attribute__ ((noinline)) static void work(void) { ++checksum; }
template <unsigned t>
static void create(void) { funcs[t - 1] = &work<t - 1>; create<t - 1>(); }
template <> void create<0>(void) { }
int main()
{
create<MAX_FUNCS>();
for (unsigned range = 1; range <= MAX_FUNCS; range *= 2)
{
checksum = 0;
for (unsigned i = 0; i < WORKLOAD; ++i)
{
funcs[i % range]();
}
}
return 0;
}
The outer loop varies the amount of different functions to be called using a jump table. For each loop pass, the time taken to invoke WORKLOAD functions is then measured. Now what are the results? The following chart shows the average run time per function call in relation to the used range. The blue line shows the data measured on a Core i7 machine. The comparative measurement, depicted by the red line, was carried out on a Pentium 4 machine. Yet when it comes to interpreting these lines, I seem to be somehow struggling...
The only jumps of the piecewise constant red curve occur exactly where the total memory consumption for all functions within range exceed the capacity of one cache level on the tested machine, which has no dedicated instruction cache. For very small ranges (below 4 in this case) however, run time still increases with the amount of functions. This may be related to branch prediction efficiency, but since every function call reduces to an unconditional jump in this case, I'm not sure if there should be any branching penalty at all.
The blue curve behaves quite differently. Run time is constant for small ranges and increases logarithmic thereafter. Yet for larger ranges, the curve seems to be approaching a constant asymptote again. How exactly can the qualitative differences of both curves be explained?
I am currently using GCC MinGW Win32 x86 v.4.8.1 with g++ -std=c++11 -ftemplate-depth=65536 and no compiler optimization.
Any help would be appreciated. I am also interested in any idea on how to improve the experiment itself. Thanks in advance!
First, let me say that I really like how you've approached this problem, this is a really neat solution for intentional code bloating. However, there might still be several possible issues with your test -
You also measure the warmup time. you didn't show where you've placed your time checks, but if it's just around the internal loop - then the first time until you reach range/2 you'd still enjoy the warmup of the previous outer iteration. Instead, measure only warm performance - run each internal iteration for several times (add another loop in the middle), and take the timestamp only after 1-2 rounds.
You claim to have measure several cache levels, but your L1 cache is only 32k, which is where your graph ends. Even assuming this counts in terms of "range", each function is ~21 bytes (at least on my gcc 4.8.1), so you'll reach at most 256KB, which is only then scratching the size of your L2.
You didn't specify your CPU model (i7 has at least 4 generations in the market now, Haswell, IvyBridge, SandyBridge and Nehalem). The differences are quite large, for example an additional uop-cache since Sandybrige with complicated storage rules and conditions. Your baseline is also complicating things, if I recall correctly the P4 had a trace cache which might also cause all sorts of performance impacts. You should check an option to disable them if possible.
Don't forget the TLB - even though it probably doesn't play a role here in such a tightly organized code, the number of unique 4k pages should not exceed the ITLB (128 entries), and even before that you may start having collisions if your OS did not spread the physical code pages well enough to avoid ITLB collisions.

Simple operation to waste time?

I'm looking for a simple operation / routine which can "waste" time if repeated continuously.
I'm researching how gprof profiles applications, so this "time waster" needs to waste time in the user space and should not require external libraries. IE, calling sleep(20) will "waste" 20 seconds of time, but gprof will not record this time because it occurred within another library.
Any recommendations for simple tasks which can be repeated to waste time?
Another variant on Tomalak's solution is to set up an alarm, and so in your busy-wait loop, you don't need to keep issuing a system call, but instead just check if the signal has been sent.
The simplest way to "waste" time without yielding CPU is a tight loop.
If you don't need to restrict the duration of your waste (say, you control it by simply terminating the process when done), then go C style*:
for (;;) {}
(Be aware, though, that the standard allows the implementation to assume that programs will eventually terminate, so technically speaking this loop — at least in C++0x — has Undefined Behaviour and could be optimised out!**
Otherwise, you could time it manually:
time_t s = time(0);
while (time(0) - s < 20) {}
Or, instead of repeatedly issuing the time syscall (which will lead to some time spent in the kernel), if on a GNU-compatible system you could make use of signal.h "alarms" to end the loop:
alarm(20);
while (true) {}
There's even a very similar example on the documentation page for "Handler Returns".
(Of course, these approaches will all send you to 100% CPU for the intervening time and make fluffy unicorns fall out of your ears.)
* {} rather than trailing ; used deliberately, for clarity. Ultimately, there's no excuse for writing a semicolon in a context like this; it's a terrible habit to get into, and becomes a maintenance pitfall when you use it in "real" code.
** See [n3290: 1.10/2] and [n3290: 1.10/24].
A simple loop would do.
If you're researching how gprof works, I assume you've read the paper, slowly and carefully.
I also assume you're familiar with these issues.
Here's a busy loop which runs at one cycle per iteration on modern hardware, at least as compiled by clang or gcc or probably any reasonable compiler with at least some optimization flag:
void busy_loop(uint64_t iters) {
volatile int sink;
do {
sink = 0;
} while (--iters > 0);
(void)sink;
}
The idea is just to store to the volatile sink every iteration. This prevents the loop from being optimized away, and makes each iteration have a predictable amount of work (at least one store). Modern hardware can do one store per cycle, and the loop overhead generally can complete in parallel in that same cycle, so usually achieves one cycle per iteration. So you can ballpark the wall-clock time in nanoseconds a given number of iters will take by dividing by your CPU speed in GHz. For example, a 3 GHz CPU will take about 2 seconds (2 billion nanos) to busy_loop when iters == 6,000,000,000.

What is the cost of a function call?

Compared to
Simple memory access
Disk access
Memory access on another computer(on the same network)
Disk access on another computer(on the same network)
in C++ on windows.
relative timings (shouldn't be off by more than a factor of 100 ;-)
memory-access in cache = 1
function call/return in cache = 2
memory-access out of cache = 10 .. 300
disk access = 1000 .. 1e8 (amortized depends upon the number of bytes transferred)
depending mostly upon seek times
the transfer itself can be pretty fast
involves at least a few thousand ops, since the user/system threshold must be crossed at least twice; an I/O request must be scheduled, the result must be written back; possibly buffers are allocated...
network calls = 1000 .. 1e9 (amortized depends upon the number of bytes transferred)
same argument as with disk i/o
the raw transfer speed can be quite high, but some process on the other computer must do the actual work
A function call is simply a shift of the frame pointer in memory onto the stack and addition of a new frame on top of that. The function parameters are shifted into local registers for use and the stack pointer is advanced to the new top of the stack for execution of the function.
In comparison with time
Function call ~ simple memory access
Function call < Disk Access
Function call < memory access on another computer
Function call < disk access on another computer
Compared to a simple memory access - slightly more, negligible really.
Compared to every thing else listed - orders of magnitude less.
This should hold true for just about any language on any OS.
In general, a function call is going to be slightly slower than memory access since it in fact has to do multiple memory accesses to perform the call. For example, multiple pushes and pops of the stack are required for most function calls using __stdcall on x86. But if your memory access is to a page that isn't even in the L2 cache, the function call can be much faster if the destination and the stack are all in the CPU's memory caches.
For everything else, a function call is many (many) magnitudes faster.
Hard to answer because there are a lot of factors involved.
First of all, "Simple Memory Access" isn't simple. Since at modern clock speeds, a CPU can add two numbers faster than it get a number from one side of the chip to the other (The speed of light -- It's not just a good idea, it's the LAW)
So, is the function being called inside the CPU memory cache? Is the memory access you're comparing it too?
Then we have the function call will clear the CPU instruction pipeline, which will affect speed in a non-deterministic way.
Assuming you mean the overhead of the call itself, rather than what the callee might do, it's definitely far, far quicker than all but the "simple" memory access.
It's probably slower than the memory access, but note that since the compiler can do inlining, function call overhead is sometimes zero. Even if not, it's at least possible on some architectures that some calls to code already in the instruction cache could be quicker than accessing main (uncached) memory. It depends how many registers need to be spilled to stack before making the call, and that sort of thing. Consult your compiler and calling convention documentation, although you're unlikely to be able to figure it out faster than disassembling the code emitted.
Also note that "simple" memory access sometimes isn't - if the OS has to bring the page in from disk then you've got a long wait on your hands. The same would be true if you jump into code currently paged out on disk.
If the underlying question is "when should I optimise my code to minimise the total number of function calls made?", then the answer is "very close to never".
This link comes up a lot in Google. For future reference, I ran a short program in C# on the cost of a function call, and the answer is: "about six times the cost of inline". Below are details, see //Output at the bottom. UPDATE: To better compare apples with apples, I changed Class1.Method to return 'void', as so: public void Method1 () { // return 0; }
Still, inline is faster by 2x: inline (avg): 610 ms; function call (avg): 1380 ms. So the answer, updated, is "about two times".
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Diagnostics;
namespace FunctionCallCost
{
class Program
{
static void Main(string[] args)
{
Debug.WriteLine("stop1");
int iMax = 100000000; //100M
DateTime funcCall1 = DateTime.Now;
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < iMax; i++)
{
//gives about 5.94 seconds to do a billion loops,
// or 0.594 for 100M, about 6 times faster than
//the method call.
}
sw.Stop();
long iE = sw.ElapsedMilliseconds;
Debug.WriteLine("elapsed time of main function (ms) is: " + iE.ToString());
Debug.WriteLine("stop2");
Class1 myClass1 = new Class1();
Stopwatch sw2 = Stopwatch.StartNew();
int dummyI;
for (int ie = 0; ie < iMax; ie++)
{
dummyI = myClass1.Method1();
}
sw2.Stop();
long iE2 = sw2.ElapsedMilliseconds;
Debug.WriteLine("elapsed time of helper class function (ms) is: " + iE2.ToString());
Debug.WriteLine("Hi3");
}
}
// Class 1 here
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace FunctionCallCost
{
class Class1
{
public Class1()
{
}
public int Method1 ()
{
return 0;
}
}
}
// Output:
stop1
elapsed time of main function (ms) is: 595
stop2
elapsed time of helper class function (ms) is: 3780
stop1
elapsed time of main function (ms) is: 592
stop2
elapsed time of helper class function (ms) is: 4042
stop1
elapsed time of main function (ms) is: 626
stop2
elapsed time of helper class function (ms) is: 3755
The cost of actually calling the function, but not executing it in full? or the cost of actually executing the function? simply setting up a function call is not a costly operation (update the PC?). but obviously the cost of a function executing in full depends on what the function is doing.
Let's not forget that C++ has virtual calls (significantly more expensive, about x10) and on WIndows you can expect VS to inline calls (0 cost by definition, as there is no call left in the binary)
Depends on what that function does, it would fall 2nd on your list if it were doing logic with objects in memory. Further down the list if it included disk/network access.
A function call usually involves merely a couple of memory copies (often into registers, so they should not take up much time) and then a jump operation. This will be slower than a memory access, but faster than any of the other operations mentioned above, because they require communication with other hardware. The same should usually hold true on any OS/language combination.
If the function is inlined at compile time, the cost of the function becomes equivelant to 0.
0 of course being, what you would have gotten by not having a function call, ie: inlined it yourself.
This of course sounds excessively obvious when I write it like that.
The cost of a function call depends on the architecture. x86 is considerably slower (a few clocks plus a clock or so per function argument) while 64-bit is much less because most function arguments are passed in registers instead of on the stack.
Function call is actually a copy of parameters onto the stack (multiple memory access), register save, the actual code execution, and finally result copy and and registers restore (the registers save/restore depend on the system).
So.. speaking relatively:
Function call > Simple memory access.
Function call << Disk access - compared with memory it can be hundreds of times more expensive.
Function call << Memory access on another computer - the network bandwidth and protocol are the grand time killers here.
Function call <<< Disk access on another computer - all of the above and more :)
Only memory access is faster than a function call.
But the call can be avoided if compiler with inline optimization (for GCC compiler(s) and not only it is activated when using level 3 of optimization (-O3) ).