10 milli-second C++ excution time - c++

I try to find out the exact execution time for "for loop" with 2e6 iteritions.
The following code is ran within 10ms after compiled from g++ for c++ file.
People told me that is optimization code automatically done by C++ compiler so you
get meaningless execution time. In other words,since there is no any output call
such as printf or cout<< for variable a,b,c so the optimized code will do nothing for
that "for loop" that is why I got really short program execution time in 10ms. Right ? Why they said the time result is meaningless for "for loop".
Please advise
int main(){
int max = 2e6;
int a,b,c;
// CODE YOU WANT TO TIME
int start = getMilliCount();
for (int i = 0; i < max; i++) {
a = 1234 + 5678 + i;
b = 1234 * 5678 + i;
c=1234/2+i;
}
int milliSecondsElapsed = getMilliSpan(start);
printf("\n\nElapsed time = %u milliseconds %d\n", milliSecondsElapsed,max);
return 0;
}

The run-time is absolutely not meaningless. It proves at least one important point: the optimizer is smarter than given credit, and it's able to deduce the loop has no side effects, so it cuts it out.
So even if the profile result only proves this one thing, it does have meaning.
To address what you want:
I try to find out the exact execution time for "for loop" with 2e8 iteritions.
The execution time of a for loop with 2e8 can be 0 if there are no observable effects. Or very large if they are. That's why you usually profile actual code using dedicated tools.

The compiler can change the program in any way that does not change anything observable, i.e. all outputs etc. must be exactly the same as the outputs of the un-optimized code. In your example, the compiler may notice that the values of a, b and c after the loop are never used and the loop does nothing else, so it might as well remove the loop from your program.
It could also observe that the value of the variables depend directly on max and just skip all but the last iteration.
In both cases, the result would not depend on max. It still is not meaningless, it just means that you underestimate your compiler.
Edit:
I tested this scenario with g++ -O2, the loop gets completely removed and does not run at all.

Related

Why is this variable returning 32766?

I wrote a very basic evolution algorithm. The way it's supposed to work is that the user types in the desired value, and the amount of generations to try to reach it. Then, the program will run through, taking the nearest value in an array to the goal and mutating it four times (while also leaving the original, in case it's right) to try and get closer to the goal. In theory, it should take roughly |n|/2 generations to reach the value, as mutations happen in either one or two points.
Here's the code to demonstrate what I mean:
#include <iostream>
using namespace std;
int gen [5] = {0, 0, 0, 0, 0}; int goal; int gens; int best; int i = 0; int fit;
int dif(int in) {
return abs(gen[in] - goal);
}
void nextgen() {
int fit [5] = {dif(1), dif(2), dif(3), dif(4), dif(5)};
best = *max_element(fit, fit + 6);
int gen [5] = {best - 2, best - 1, best, best + 1, best + 2};
}
int main() {
cout << "Goal: "; cin >> goal; cout << "Gens: "; cin >> gens;
while(i < gens) {
nextgen(); cout << "Generation " << i + 1 << ": " << best << "\n";
i = i + 1;
}
}
It's pretty simple code. However, it seems that the int best bit of the output is returning 32766 every time, no matter what I do. Do you know what I've done wrong?
I've tried outputting the entire generation (which is even worse––a jumbled mess of non user friendly data that appears meaningless), I've reworked the code, I've added varibles and functions to try and pin down exactly where the error is, and I watched the entire code aesthetic youtube channel to make sure this looked good for you guys.
Looks like you're driving C++ without a license or safety belt. Joke aside, please keep trying and learning. But with C/C++ you should always enable compiler warnings. The godbolt link in the comment from #user4581301 is really good, the compiler flags -Wall -Wextra -pedantic -O2 -fsanitize=address,undefined are all best practice. (I would add -Werror.)
Why you got 32766 is possible to analyze with a debugger, but it's not meaningful. A number close to 32768 (=2^15) should trigger all the warning bells (could be an integer overflow). Your code is accessing uninitialized memory (among other issues), leading to what is called undefined behaviour. This means it may produce different output depending on your machine, compiler, optimization flags, OS, standard libraries, etc. - even adding a debug-print could change what it does.
For optimization algorithms (like GAs) it's also super easy to fool yourself into thinking that your implementation is correct, because the optimization will find a way to avoid (or exploit) any bugs. I've had one in my NN implementation that was accessing some data from the previous example by accident, and it took several days until I even noticed there was a problem.
If you want to focus on the algorithms, I suggest to start with a different language (anything except C/C++/Assembly). My advice would be either Python (though it can be 50x slower, it's much easier to learn and write) or Rust (just as fast as C++ and just as complicated, but no undefined behaviour). With Rust, every mistake in your code above would have given you either a warning by default, a compiler error, or a runtime error instead of wrong output. Though C++ with the flags mentioned above does the same for your specific code.

Timing of using variables passed by reference and by value in C++

I have decided to compare the times of passing by value and by reference in C++ (g++ 5.4.0) with the following code:
#include <iostream>
#include <sys/time.h>
using namespace std;
int fooVal(int a) {
for (size_t i = 0; i < 1000; ++i) {
++a;
--a;
}
return a;
}
int fooRef(int & a) {
for (size_t i = 0; i < 1000; ++i) {
++a;
--a;
}
return a;
}
int main() {
int a = 0;
struct timeval stop, start;
gettimeofday(&start, NULL);
for (size_t i = 0; i < 10000; ++i) {
fooVal(a);
}
gettimeofday(&stop, NULL);
printf("The loop has taken %lu microseconds\n", stop.tv_usec - start.tv_usec);
gettimeofday(&start, NULL);
for (size_t i = 0; i < 10000; ++i) {
fooRef(a);
}
gettimeofday(&stop, NULL);
printf("The loop has taken %lu microseconds\n", stop.tv_usec - start.tv_usec);
return 0;
}
It was expected that the fooRef execution would take much more time in comparison with fooVal case because of "looking up" referenced value in memory while performing operations inside fooRef. But the result proved to be unexpected for me:
The loop has taken 18446744073708648210 microseconds
The loop has taken 99967 microseconds
And the next time I run the code it can produce something like
The loop has taken 97275 microseconds
The loop has taken 99873 microseconds
Most of the time produced values are close to each other (with fooRef being just a little bit slower), but sometimes outbursts like in the output from the first run can happen (both for fooRef and fooVal loops).
Could you please explain this strange result?
UPD: Optimizations were turned off, O0 level.
If gettimeofday() function relies on operating system clock, this clock is not really designed for dealing with microseconds in an accurate manner. The clock is typically updated periodically and only frequently enough to give the appearance of showing seconds accurately for the purpose of working with date/time values. Sampling at the microsecond level may be unreliable for a benchmark such as the one you are performing.
You should be able to work around this limitation by making your test time much longer; for example, several seconds.
Again, as mentioned in other answers and comments, the effects of which type of memory is accessed (register, cache, main, etc.) and whether or not various optimizations are applied, could substantially impact results.
As with working around the time sampling limitation, you might be able to somewhat work around the memory type and optimization issues by making your test data set much larger such that memory optimizations aimed at smaller blocks of memory are effectively bypassed.
Firstly, you should look at the assembly language to see if there are any differences between passing by reference and passing by value.
Secondly, make the functions equivalent by passing by constant reference. Passing by value says that the original variable won't be changed. Passing by constant reference keeps the same principle.
My belief is that the two techniques should be equivalent in both assembly language and performance.
I'm no expert in this area, but I would tend to think that the reason why the two times are somewhat equivalent is due to cache memory.
When you need to access a memory location (Say, address 0xaabbc125 on an IA-32 architecure), the CPU copies the memory block (addresses 0xaabbc000 to 0xaabbcfff) to your cache memory. Reading from and writing to the memory is very slow, but once it's been copied into you cache, you can access values very quickly. This is useful because programs usually require the same range of addresses over and over.
Since you execute the same code over and over and that your code doesn't require a lot of memory, the first time the function is executed, the memory block(s) is (are) copied to your cache once, which probably takes most of the 97000 time units. Any subsequent calls to your fooVal and fooRef functions will require addresses that are already in your cache, so they will require only a few nanoseconds (I'd figure roughly between 10ns and 1µs). Thus, dereferencing the pointer (since a reference is implemented as a pointer) is about double the time compared to just accessing a value, but it's double of not much anyway.
Someone who is more of an expert may have a better or more complete explanation than mine, but I think this could help you understand what's going on here.
A little idea : try to run the fooVal and fooRef functions a few times (say, 10 times) before setting start and beginning the loop. That way, (if my explanation was correct!) the memory block will (should) be already into cache when you begin looping them, which means you won't be taking caching in your times.
About the super-high value you got, I can't explain that. But the value is obviously wrong.
It's not a bug, it's a feature! =)

Can reducing loop times in C++ codes help increase the speed?

I give the following example to illustrate my question:
void fun(int i, float *pt)
{
// do something based on i
std::cout<<*(pt+i)<<std::endl;
}
const unsigned int LOOP = 2000000007;
void fun_without_optmization()
{
float *example;
example = new float [LOOP];
for(unsigned int i=0; i<LOOP; i++)
{
fun(i,example);
}
delete []example;
}
void fun_with_optimization()
{
float *example;
example = new float [LOOP];
unsigned int unit_loop = LOOP/10;
unsigned int left_loop = LOOP%10;
pt = example;
for(unsigend int i=0; i<unit_loop; i++)
{
fun(0,pt);
fun(1,pt);
fun(2,pt);
fun(3,pt);
fun(4,pt);
fun(5,pt);
fun(6,pt);
fun(7,pt);
fun(8,pt);
fun(9,pt);
pt=pt+10;
}
delete []example;
}
As far as I understand, function fun_without_optimization() and function fun_with_optimization() should perform the same. The only argument why the second function is better than the first is that the pointer calculation in fun becomes simple. Any other arguments why the second function is better?
Unrolling a loop in which I/O is performed is like moving the landing strip for a B747 from London an inch eastward in JFK.
Re: "Any other arguments why the second function is better?" - would you accept the answer explaining why it is NOT better?
Manually unrolling a loop is error-prone, as is clearly illustrated by your code: you forgot to process the tail left_loop.
For at least a couple of decades compiler does this optimization for you.
How do you know the optimal number of iteration to put in that unrolled loop? Do you target a specific cache size and calculate the length of assembly instructions in bytes? The compiler might.
Your messing with the otherwise clean loop can prevent other optimizations, like the use of SIMD.
The bottom line is: if you know something that your compiler doesn't (specific pattern of the run-time data, details of the targeted execution environment, etc.), and you know what you are doing - you can try manual loop unrolling. But even then - profile.
The technique you describe is called loop unrolling; potentially this increases performance, as the time for evaluation of the control structures (update of te loop variable and checking the termination condition) becomes smaller. However, decent compilers can do this for you and maintainability of the code decreases if done manually.
This is an optimization technique used for parallel architectures (architectures that support VLIW instructions). Depending on the number DALU (most common 4) and ALU(most common 2) units the architecture supports, and the level of "parallelization" the code supports, multiple instructions can be executes in one cycle.
So this code:
for (int i=0; i<n;i++) //n multiple of 4, for simplicity
a+=temp; //just a random instruction
Will actually execute faster on a parallel architecture if rewritten like:
for (int i=0;i<n ;i+=4)
{
temp0 = temp0 +temp1; //reads and additions can be executed in parallel
temp1 = temp2 +temp3;
a=temp0+temp1+a;
}
There is a limit to how much you can parallelize your code, a limit imposed by the physical ALUs/DALUs the CPU has. That's why it's important to know your architecture before you attempt to (properly) optimize your code.
It does not stop here: the code you want to optimize has to be a continuous block of code, meaning no jumps ( no function calls, no chance of flow instructions), for maximum efficiency.
Writing your code, like:
for(unsigend int i=0; i<unit_loop; i++)
{
fun(0,pt);
fun(1,pt);
fun(2,pt);
fun(3,pt);
fun(4,pt);
fun(5,pt);
fun(6,pt);
fun(7,pt);
fun(8,pt);
fun(9,pt);
pt=pt+10;
}
Wold not do much, unless the compiler inlines the function calls; and it looks like to many instructions anyway...
On a different note: while it's true that you ALWAYS have to work with the compiler when optimizing your code, you should NEVER rely only on it when you what to get the maximum optimization out of your code. Remember, the compiler handles 'the general case' while you are likely interested in a particular situation - that's why some compiles have special directives to help with the optimization process.

Function pointer runs faster than inline function. Why?

I ran a benchmark of mine on my computer (Intel i3-3220 # 3.3GHz, Fedora 18), and got very unexpected results. A function pointer was actually a bit faster than an inline function.
Code:
#include <iostream>
#include <chrono>
inline short toBigEndian(short i)
{
return (i<<8)|(i>>8);
}
short (*toBigEndianPtr)(short i)=toBigEndian;
int main()
{
std::chrono::duration<double> t;
int total=0;
for(int i=0;i<10000000;i++)
{
auto begin=std::chrono::high_resolution_clock::now();
short a=toBigEndian((short)i);//toBigEndianPtr((short)i);
total+=a;
auto end=std::chrono::high_resolution_clock::now();
t+=std::chrono::duration_cast<std::chrono::duration<double>>(end-begin);
}
std::cout<<t.count()<<", "<<total<<std::endl;
return 0;
}
compiled with
g++ test.cpp -std=c++0x -O0
The 'toBigEndian' loop finishes always at around 0.26-0.27 seconds, while 'toBigEndianPtr' takes 0.21-0.22 seconds.
What makes this even more odd is that when I remove 'total', the function pointer becomes the slower one at 0.35-0.37 seconds, while the inline function is at about 0.27-0.28 seconds.
My question is:
Why is the function pointer faster than the inline function when 'total' exists?
Short answer: it isn't.
You compile with -O0, wich does not optimize (much). Without optimization, you have no saying in "fast", because unptimized code is not as fast as can be.
You take the address of toBigEndian, wich prevents inlining. inline keyword is a hint for the compiler anyway, wich it may or may not follow. You did the best to not make it follow that hint.
So, to give your measurements any meaning,
optimize your code
use two functions, doing the same thing, one that gets inlined, the other one taken the addres of
A common mistake in measuring performance (besides forgetting to optimize) is to use the wrong tool to measure. Using std::chrono would be fine, if you were measuring the performance of your entire, 10000000 or 500000000 iterations. Instead, you are asking it to measure the call / inline of toBigEndian. A function that is all of 6 instructions. So I switched to rdtsc (read time stamp counter, i.e. clock cycles).
Allowing the compiler to really optimize everything in the loop, not cluttering it with recording the time on every tiny iteration, we have a different code sequence. Now, after compiling with g++ -O3 fp_test.cpp -o fp_test -std=c++11, I observe the desired effect. The inlined version averages around 2.15 cycles per iteration, while the function pointer takes around 7.0 cycles per iteration.
Even without using rdtsc, the difference is still quite observable. The wall clock time was 360ms for the inlined code and 1.17s for the function pointer. So one could use std::chrono in place of rdtsc in this code.
Modified code follows:
#include <iostream>
static inline uint64_t rdtsc(void)
{
uint32_t hi, lo;
asm volatile ("rdtsc" : "=a"(lo), "=d"(hi));
return ( (uint64_t)lo)|( ((uint64_t)hi)<<32 );
}
inline short toBigEndian(short i)
{
return (i<<8)|(i>>8);
}
short (*toBigEndianPtr)(short i)=toBigEndian;
#define LOOP_COUNT 500000000
int main()
{
uint64_t t = 0, begin=0, end=0;
int total=0;
begin=rdtsc();
for(int i=0;i<LOOP_COUNT;i++)
{
short a=0;
a=toBigEndianPtr((short)i);
//a=toBigEndian((short)i);
total+=a;
}
end=rdtsc();
t+=(end-begin);
std::cout<<((double)t/LOOP_COUNT)<<", "<<total<<std::endl;
return 0;
}
Oh s**t (do I need to censor swearing here?), I found it out. It was somehow related to the timing being inside the loop. When I moved it outside as following,
#include <iostream>
#include <chrono>
inline short toBigEndian(short i)
{
return (i<<8)|(i>>8);
}
short (*toBigEndianPtr)(short i)=toBigEndian;
int main()
{
int total=0;
auto begin=std::chrono::high_resolution_clock::now();
for(int i=0;i<100000000;i++)
{
short a=toBigEndianPtr((short)i);
total+=a;
}
auto end=std::chrono::high_resolution_clock::now();
std::cout<<std::chrono::duration_cast<std::chrono::duration<double>>(end-begin).count()<<", "<<total<<std::endl;
return 0;
}
the results are just as they should be. 0.08 seconds for inline, 0.20 seconds for pointer. Sorry for bothering you guys.
First off, with -O0, you aren't running the optimizer, which means the compiler is ignoring your request to inline, as it is free to do. The cost of the two different calls ought to be nearly identical. Try with -O2.
Second, if you are only running for 0.22 seconds, weirdly variable costs involved with starting your program totally dominate the cost of running the test function. That function call is just a few instructions. If your CPU is running at 2 GHz, it ought to execute that function call in something like 20 nanoseconds, so you can see that whatever it is you're measuring, it's not the cost of running that function.
Try calling the test function in a loop, say 1,000,000 times. Make the number of loops 10x bigger until it takes > 10 seconds to run the test. Then divide the result by the number of loops for an approximation of the cost of the operation.
With many/most self-respecting modern compilers, the code you posted will still inline the function call even when when it is called through the pointer. (Assuming the compiler makes a reasonable effort to optimize the code). The situation is just too easy to see through. In other words, the generated code can easily end up virtually the same in both cases, meaning that your test is not really useful for measuring what you are trying to measure.
If you really want to make sure the call is physically performed through the pointer, you have to make an effort to "confuse" the compiler to the point where it can't figure out the pointer value at compile time. For example, make the pointer value run-time dependent, as in
toBigEndianPtr = rand() % 1000 != 0 ? toBigEndian : NULL;
or something along these lines. You can also declare your function pointer as volatile, which will typically cause a genuine through-the-pointer call each time as well as force the compiler to re-read the pointer value from memory on each iteration.

Strange C++ performance difference?

I just stumbled upon a change that seems to have counterintuitive performance ramifications. Can anyone provide a possible explanation for this behavior?
Original code:
for (int i = 0; i < ct; ++i) {
// do some stuff...
int iFreq = getFreq(i);
double dFreq = iFreq;
if (iFreq != 0) {
// do some stuff with iFreq...
// do some calculations with dFreq...
}
}
While cleaning up this code during a "performance pass," I decided to move the definition of dFreq inside the if block, as it was only used inside the if. There are several calculations involving dFreq so I didn't eliminate it entirely as it does save the cost of multiple run-time conversions from int to double. I expected no performance difference, or if any at all, a negligible improvement. However, the perfomance decreased by nearly 10%. I have measured this many times, and this is indeed the only change I've made. The code snippet shown above executes inside a couple other loops. I get very consistent timings across runs and can definitely confirm that the change I'm describing decreases performance by ~10%. I would expect performance to increase because the int to double conversion would only occur when iFreq != 0.
Chnaged code:
for (int i = 0; i < ct; ++i) {
// do some stuff...
int iFreq = getFreq(i);
if (iFreq != 0) {
// do some stuff with iFreq...
double dFreq = iFreq;
// do some stuff with dFreq...
}
}
Can anyone explain this? I am using VC++ 9.0 with /O2. I just want to understand what I'm not accounting for here.
You should put the conversion to dFreq immediately inside the if() before doing the calculations with iFreq. The conversion may execute in parallel with the integer calculations if the instruction is farther up in the code. A good compiler might be able to push it farther up, and a not-so-good one may just leave it where it falls. Since you moved it to after the integer calculations it may not get to run in parallel with integer code, leading to a slowdown. If it does run parallel, then there may be little to no improvement at all depending on the CPU (issuing an FP instruction whose result is never used will have little effect in the original version).
If you really want to improve performance, a number of people have done benchmarks and rank the following compilers in this order:
1) ICC - Intel compiler
2) GCC - A good second place
3) MSVC - generated code can be quite poor compared to the others.
You may also want to try -O3 if they have it.
Maybe the result of getFreq is kept inside a register in the first case and written to memory in the second case? It might also be, that the performance decrease has to do with CPU mechanisms as pipelining and/or branch prediction.
You could check the generated assembly code.
This looks to me like a pipeline stall
int iFreq = getFreq(i);
double dFreq = iFreq;
if (iFreq != 0) {
Allows the conversion to double to happen in parallel with other code
since dFreq is not being used immediately. it gives the compiler something
to do between storing iFreq and using it, so this conversion is most likely
"free".
But
int iFreq = getFreq(i);
if (iFreq != 0) {
// do some stuff with iFreq...
double dFreq = iFreq;
// do some stuff with dFreq...
}
Could be hitting a store/reference stall after the conversion to double since you begin using the double value right away.
Modern processors can do multiple things per clock cycle, but only when the things are independent. Two consecutive instructions that reference the same register often result in a stall. The actual conversion to double may take 3 clocks, but all but the first clock can be done in parallel with other work, provided you don't refer to the result of the conversion for an instruction or two.
C++ compilers are getting pretty good at re-ordering instructions to take advantage of this, it looks like your change defeated some nice optimization.
One other (less likely) possibility is that when the conversion to float was before the branch, the compiler was able remove the branch entirely. Branchless code is often a major performance win in modern processors.
It would be interesting to see what instructions the compiler actually emitted for these two cases.
Try moving the definition of dFreq outside of the for loop but keep the assignment inside the for loop/if block.
Perhaps the creation of dFreq on the stack every for loop, inside the if, is causing issue (although the compiler should take care of that). Perhaps a regression in the compiler, if the dFreq var is in the four loop its created once, inside the if inside the for its created every time.
double dFreq;
int iFreq;
for (int i = 0; i < ct; ++i)
{
// do some stuff...
iFreq = getFreq(i);
if (iFreq != 0)
{
// do some stuff with iFreq...
dFreq = iFreq;
// do some stuff with dFreq...
}
}
maybe the compiler is optimizing it taking the definition outside the for loop. when you put it in the if the compiler optimizations aren't doing that.
There's a likelihood that this changed caused your compiler to disable some optimizations. What happens if you move the declarations above the loop?
Once I've read a document about optimization that said that as defining variables just before their usage and not even before was a good practice, the compilers could optimize code following that advice.
This article (a bit old but quite valid) say (with statistics) something similar : http://www.tantalon.com/pete/cppopt/asyougo.htm#PostponeVariableDeclaration
It's easy enough to find out. Just take 20 stackshots of the slow version, and of the fast version. In the slow version you will see on roughly 2 of the shots what it is doing that it is not doing in the fast version. You will see a subtle difference in where it halts in the assembly language.