Where is the virtual function call overhead?

Where is the virtual function call overhead? - c++

I'm trying to benchmark the difference between a function pointer call and a virtual function call. To do this, I have written two pieces of code, that do the same mathematical computation over an array. One variant uses an array of pointers to functions and calls those in a loop. The other variant uses an array of pointers to a base class and calls its virtual function, which is overloaded in the derived classes to do absolutely the same thing as the functions in the first variant. Then I print the time elapsed and use a simple shell script to run the benchmark many times and compute the average run time.
Here is the code:
#include <iostream>
#include <cstdlib>
#include <ctime>
#include <cmath>
using namespace std;
long long timespecDiff(struct timespec *timeA_p, struct timespec *timeB_p)
{
return ((timeA_p->tv_sec * 1000000000) + timeA_p->tv_nsec) -
((timeB_p->tv_sec * 1000000000) + timeB_p->tv_nsec);
}
void function_not( double *d ) {
*d = sin(*d);
}
void function_and( double *d ) {
*d = cos(*d);
}
void function_or( double *d ) {
*d = tan(*d);
}
void function_xor( double *d ) {
*d = sqrt(*d);
}
void ( * const function_table[4] )( double* ) = { &function_not, &function_and, &function_or, &function_xor };
int main(void)
{
srand(time(0));
void ( * index_array[100000] )( double * );
double array[100000];
for ( long int i = 0; i < 100000; ++i ) {
index_array[i] = function_table[ rand() % 4 ];
array[i] = ( double )( rand() / 1000 );
}
struct timespec start, end;
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
for ( long int i = 0; i < 100000; ++i ) {
index_array[i]( &array[i] );
}
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
unsigned long long time_elapsed = timespecDiff(&end, &start);
cout << time_elapsed / 1000000000.0 << endl;
}
and here is the virtual function variant:
#include <iostream>
#include <cstdlib>
#include <ctime>
#include <cmath>
using namespace std;
long long timespecDiff(struct timespec *timeA_p, struct timespec *timeB_p)
{
return ((timeA_p->tv_sec * 1000000000) + timeA_p->tv_nsec) -
((timeB_p->tv_sec * 1000000000) + timeB_p->tv_nsec);
}
class A {
public:
virtual void calculate( double *i ) = 0;
};
class A1 : public A {
public:
void calculate( double *i ) {
*i = sin(*i);
}
};
class A2 : public A {
public:
void calculate( double *i ) {
*i = cos(*i);
}
};
class A3 : public A {
public:
void calculate( double *i ) {
*i = tan(*i);
}
};
class A4 : public A {
public:
void calculate( double *i ) {
*i = sqrt(*i);
}
};
int main(void)
{
srand(time(0));
A *base[100000];
double array[100000];
for ( long int i = 0; i < 100000; ++i ) {
array[i] = ( double )( rand() / 1000 );
switch ( rand() % 4 ) {
case 0:
base[i] = new A1();
break;
case 1:
base[i] = new A2();
break;
case 2:
base[i] = new A3();
break;
case 3:
base[i] = new A4();
break;
}
}
struct timespec start, end;
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
for ( int i = 0; i < 100000; ++i ) {
base[i]->calculate( &array[i] );
}
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
unsigned long long time_elapsed = timespecDiff(&end, &start);
cout << time_elapsed / 1000000000.0 << endl;
}
My system is LInux, Fedora 13, gcc 4.4.2. The code is compiled it with g++ -O3. The first one is test1, the second is test2.
Now I see this in console:
[Ignat#localhost circuit_testing]$ ./test2 && ./test2
0.0153142
0.0153166
Well, more or less, I think. And then, this:
[Ignat#localhost circuit_testing]$ ./test2 && ./test2
0.01531
0.0152476
Where are the 25% which should be visible? How can the first executable be even slower than the second one?
I'm asking this because I'm doing a project which involves calling a lot of small functions in a row like this in order to compute the values of an array, and the code I've inherited does a very complex manipulation to avoid the virtual function call overhead. Now where is this famous call overhead?

In both cases you are calling functions indirectly. In one case through your table of function pointers, and in the other through the compiler's array of function pointers (the vtable). Not surprisingly, two similar operations give you similar timing results.

Virtual functions may be slower than regular functions, but that's due to things like inlines. If you call a function through a function table, those can't be inlined either, and the lookup time is pretty much the same. Looking up through your own lookup table is of course going to be the same as looking up through the compiler's lookup table.
Edit: Or even slower, because the compiler knows a lot more than you about things like processor cache and such.

I think you're seeing the difference, but it's just the function call overhead. Branch misprediction, memory access and the trig functions are the same in both cases. Compared to those, it's just not that big a deal, though the function pointer case was definitely a bit quicker when I tried it.
If this is representative of your larger program, this is a good demonstration that this type of microoptimization is sometimes just a drop in the ocean, and at worst futile. But leaving that aside, for a clearer test, the functions should perform some simpler operation, that is different for each function:
void function_not( double *d ) {
*d = 1.0;
}
void function_and( double *d ) {
*d = 2.0;
}
And so on, and similarly for the virtual functions.
(Each function should do something different, so that they don't get elided and all end up with the same address; that would make the branch prediction work unrealistically well.)
With these changes, the results are a bit different. Best of 4 runs in each case. (Not very scientific, but the numbers are broadly similar for larger numbers of runs.) All timings are in cycles, running on my laptop. Code was compiled with VC++ (only changed the timing) but gcc implements virtual function calls in the same way so the relative timings should be broadly similar even with different OS/x86 CPU/compiler.
Function pointers: 2,052,770
Virtuals: 3,598,039
That difference seems a bit excessive! Sure enough, the two bits of code aren't quite the same in terms of their memory access behaviour. The second one should have a table of 4 A *s, used to fill in base, rather than new'ing up a new one for each entry. Both examples will then have similar behaviour (1 cache miss/N entries) when fetching the pointer to jump through. For example:
A *tbl[4] = { new A1, new A2, new A3, new A4 };
for ( long int i = 0; i < 100000; ++i ) {
array[i] = ( double )( rand() / 1000 );
base[i] = tbl[ rand() % 4 ];
}
With this in place, still using the simplified functions:
Virtuals (as suggested here): 2,487,699
So there's 20%, best case. Close enough?
So perhaps your colleague was right to at least consider this, but I suspect that in any realistic program the call overhead won't be enough of a bottleneck to be worth jumping through hoops over.

Nowadays, on most systems, memory access is the primary bottleneck, and not the CPU. In many cases, there is little significant difference between virtual and non-virtual functions - they usually represent a very small portion of execution time. (Sorry, I don't have reported figures to back this up, just emprical data.)
If you want to get the best performance you will get more bang for your buck if you look into how to parallelize the computation to take advantage of multiple cores/processing units, rather than worring about micro-details of virtual vs non-virtual functions.

Many people fall into the habit of doing things just because they are thought to be "faster". It's all relative.
If I'm going to take a 100-mile drive from my home, I have to start by driving around the block. I can drive around the block to the right, or to the left. One of those will be "faster". But will it matter? Of course not.
In this case, the functions that you call are in turn calling math functions.
If you pause the program under the IDE or GDB, I suspect you will find that nearly every time you pause it it will be in those math library routines (or it should be!), and dereferencing an additional pointer to get there (assuming it doesn't bust a cache) should be lost in the noise.
Added: Here is a favorite video: Harry Porter's relay computer. As that thing laboriously clacks away adding numbers and stepping its program counter, I find it helpful to be mindful that that's what all computers are doing, just on a different scale of time and complexity. In your case, think about an algorithm to do sin, cos, tan, or sqrt. Inside, it is clacking away doing those things, and only incidentally following addresses or messing with a really slow memory to get there.

And finally, the function pointer approach has turned out to be the fastest one. Which was what I'd expected from the very beginning.

Related

Function slows down when I put it in header

I have a project where it is important to do speedy conversions from bytes (char) to hex-formatted strings ("00" - "ff")
The problem I have is that my conversion function slows down when I move it from my test file to my conversion library.
the function uses a std::vector<int> as a lookup table, for the precomputed strings.
The speed difference when testing is 4us in the test file, to 8us when called from the library. This is using 1000 iterations of the conversion function.
Can anyone help me understand what is going on? To my eyes, the same code is taking twice the time to execute.
test code with catch2 (partial)
BENCHMARK("fast, local")
{
auto l = [](){
string x;
for (int i = 0; i < 1000; ++i) {
// this is exactly how conv::char2hex works as well
x += lookupvector[conv::byte2int(random_bytes[i])];
}
return x;
};
return l();
};
BENCHMARK("slow, lib")
{
auto l = [](){
string x;
for (int i = 0; i < 1000; ++i) {
x += conv::char2hex(random_bytes[i]);
}
return x;
};
return l();
};
function code in conversion.h
inline string char2hex(const char &x){
return lookupvector[byte2int(x)];
};
Compiled with cmake, using clang, release mode (-O2)
Update:
random_bytes is a pre-allocated std::vector<char> with 1M entries for testing.
The BENCHMARK macro runs the test repeatedly for better statistics.
10x'ing the number in the loop does not change the timing difference significantly.
x.reserve(2000); does not change anything, I believe it is already optimized for.
Changing the order of the tests does not change anything.
-flto does not improve the situation
Having the conversion function and lookup table in a local header, compared to a lib does not improve the speed.

Restrict pointers and inlining

I have tried to use restrict qualified pointers, and I have encountered a problem.
The program below is just a simple one only to present the problem.
The calc_function uses three pointers, which is restricted so they "SHALL" not alias with each other. When compiling this code in visual studio, the function will be inlined, so for no reason Visual Studio 2010 ignores the qualifiers. If I disable inlining, the code executes more then six times faster (from 2200ms to 360ms). But I do not want to disable inlining in the whole project nor the whole file (because then will it be call overheads in e.g. all getters and setters, which would be horrible).
(Might the only solution be to disable inlining of only this function?)
I have tried to create temporary restrict qualified pointers in the function, both at the top and in the inner loop to try to tell the compiler that I promise that there is no aliasing, but the compiler won't believe me, and it will not work.
I have also tried to tweaking compiler settings, but the only one that i have found that works, is to disable inlining.
I would appreciate some help to solve this optimization problem.
To run the program (in realeasemode) don't forget to use the arguments 0 1000 2000.
Why the use of userinput/program arguments is to be sure that the compiler can't know if there is or isn't aliasing between the pointers a, b and c.
#include <cstdlib>
#include <cstdio>
#include <ctime>
// Data-table where a,b,c will point into, so the compiler cant know if they alias.
const size_t listSize = 10000;
int data[listSize];
//void calc_function(int * a, int * b, int * c){
void calc_function(int *__restrict a, int *__restrict b, int *__restrict c){
for(size_t y=0; y<1000*1000; ++y){ // <- Extra loop to be able to messure the time.
for(size_t i=0; i<1000; ++i){
*a += *b;
*c += *a;
}
}
}
int main(int argc, char *argv[]){ // argv SHALL be "0 1000 2000" (with no quotes)
// init
for(size_t i=0; i<listSize; ++i)
data[i] = i;
// get a, b and c from argv(0,1000,2000)
int *a,*b,*c;
sscanf(argv[1],"%d",&a);
sscanf(argv[2],"%d",&b);
sscanf(argv[3],"%d",&c);
a = data + int(a); // a, b and c will (after the specified argv) be,
b = data + int(b); // a = &data[0], b = &data[1000], c = &data[2000],
c = data + int(c); // So they will not alias, and the compiler cant know.
// calculate and take time
time_t start = clock();
funcResticted(a,b,c);
time_t end = clock();
time_t t = (end-start);
printf("funcResticted %u (microSec)\n", t);
system("PAUSE");
return EXIT_SUCCESS;
}

If you declare a function with __declspec(noinline), it will force it not to be inlined:
http://msdn.microsoft.com/en-us/library/kxybs02x%28v=vs.80%29.aspx
You can use this to manually disable inlining on a per-function basis.
As for restrict, the compiler is free to use it only when it wants to. So fiddling around with different versions of the same code is somewhat unavoidable when attempting to "trick" compilers to do such optimizations.

C++ use `const int` as looping variable?

I want to write code that compiles conditionally and according to the following two cases:
CASE_A:
for(int i = 1; i <= 10; ++i){
// do something...
}
CASE_B: ( == !CASE_A)
{
const int i = 0;
// do something...
}
That is, in case A, I want to have a normal loop over variable i but, in case B, i want to restrict local scope variable i to only a special case (designated here as i = 0). Obviously, I could write something along the lines:
for(int i = (CASE_A ? 1 : 0); i <= (CASE_A ? 10 : 0); ++i){
// do something
}
However, I do not like this design as it doesn't allow me to take advantage of the const declaration in the special case B. Such declaration would presumably allow for lots of optimization as the body of this loop benefits greatly from a potential compile-time replacement of i by its constant value.
Looking forward to any tips from the community on how to efficiently achieve this.
Thank you!
EDITS:
CASE_A vs CASE_B can be evaluated at compile-time.
i is not passed as reference
i is not re-evaluated in the body (otherwise const would not make sense), but I am not sure the compiler will go through the effort to certify that

Assuming, you aren't over-simplifying your example, it shouldn't matter. Assuming CASE_A can be evaluated at compile-time, the code:
for( int i = 0; i <= 0; ++i ) {
do_something_with( i );
}
is going to generate the same machine code as:
const int i = 0;
do_something_with( i );
for any decent compiler (with optimization turned on, of course).
In researching this, I find there is a fine point here. If i gets passed to a function via a pointer or reference, the compiler can't assume it doesn't change. This is true even if the pointer or reference is const! (Since the const can be cast away in the function.)

Seems to be the obvious solution:
template<int CASE>
void do_case();
template<>
void do_case<CASE_A>()
{
for(int i = 1; i <= 10; ++i){
do_something( i );
}
}
template<>
void do_case<CASE_B>()
{
do_something( 0 );
}
// Usage
...
do_case<CURRENT_CASE>(); // CURRENT_CASE is the compile time constant

If your CASE_B/CASE_B determination can be expressed as a compile time constant, then you can do what you want in a nice, readable format using something like the following (which is just a variation on your example of using the ?: operator for the for loop initialization and condition):
enum {
kLowerBound = (CASE_A ? 1 : 0),
kUpperBound = (CASE_A ? 10 : 0)
};
for (int i = kLowerBound; i <= kUpperBound; ++i) {
// do something
}
This makes it clear that the for loop bounds are compile time constants - note that I think most compilers today would have no problem making that determination even if the ?: expressions were used directly in the for statement's controlling clauses. However, I do think using enums makes it more evident to people reading the code.
Again, any compiler worth its salt today should recognize when i is invariant inside the loop, and in the CASE_B situation also determine that the loop will never iterate. Making i const won't benefit the compiler's optimization possibilities.
If you're convinced that the compiler might be able to optimize better if i is const, then a simple modification can help:
for (int ii = kLowerBound; ii <= kUpperBound; ++ii) {
const int i = ii;
// do something
}
I doubt this will help the compiler much (but check it's output - I could be wrong) if i isn't modified or has its address taken (even by passing it as a reference). However, it might help you make sure that i isn't inappropriately modified or passed by reference/address in the loop.
On the other hand, you might actually see a benefit to optimizations produced by the compiler if you use the const modifier on it - in the cases where the address of i is taken or the const is cast away, the compiler is still permitted to treat i as not being modified for its lifetime. Any modifications that might be made by something that cast away the const would be undefined behavior, so the compiler is allowed to ignore that they might occur. Of course, if you have code that might do this, you have bigger worries than optimization. So it's more important to make sure that there are no 'behind the back' modification attempts to i than to simply marking i as const 'for optimization', but using const might help you identify whether modifications are made (but remember that casts can continue to hide that).

I'm not quite sure that this is what you're looking for, but I'm using this macro version of the vanilla FOR loop which enforces the loop counter to be const to catch any modification of it in the body
#define FOR(type, var, start, maxExclusive, inc) if (bool a = true) for (type var##_ = start; var##_ < maxExclusive; a=true,var##_ += inc) for (const auto var = var##_;a;a=false)
Usage:
#include <stdio.h>
#define FOR(type, var, start, maxExclusive, inc) if (bool a = true) for (type var##_ = start; var##_ < maxExclusive; a=true,var##_ += inc) for (const auto var = var##_;a;a=false)
int main()
{
FOR(int, i, 0, 10, 1) {
printf("i: %d\n", i);
}
// does the same as:
for (int i = 0; i < 10; i++) {
printf("i: %d\n", i);
}
// FOR catches some bugs:
for (int i = 0; i < 10; i++) {
i += 10; // is legal but bad
printf("i: %d\n", i);
}
FOR(int, i, 0, 10, 1) {
i += 10; // is illlegal and will not compile
printf("i: %d\n", i);
}
return 0;
}

Performances of Structs vs Classes

I wonder if there are performance comparisons of classes and C style structs in C++ with g++ -O3 option. Is there any benchmark or comparison about this. I've always thought C++ classes as heavier and possibly slower as well than the structs (compile time isn't very important for me, run time is more crucial). I'm going to implement a B-tree, should I implement it with classes or with structs for the sake of performance.

On runtime level there is no difference between structs and classes in C++ at all.
So it doesn't make any performance difference whether you use struct A or class A in your code.
Other thing, is using some features -- like, constructors, destructors and virtual functions, -- could have some performance penalties (but if you use them you probably need them anyway). But you can with equal success use them both inside your class or struct.
In this document you can read about other performance-related subtleties of C++.

In C++, struct is syntactic sugar for classes whose members are public by default.

My honest opinion...don't worry about performance until it actually shows itself to be a problem, then profile your code. Premature optimization is the root of all evil. But, as others have said, there is no difference between a struct and class in C++ at runtime.

Focus on creating an efficient data structure and efficient logic to manipulate the data structure. C++ classes are not inherently slower than C-style structs, so don't let that limit your design.

AFAIK, from a performance point of view, they are equivalent in C++.
Their difference is synctatic sugar like struct members are public by default, for example.
my2c

Just do an experiment, people!
Here is the code for the experiment I designed:
#include <iostream>
#include <string>
#include <ctime>
using namespace std;
class foo {
public:
void foobar(int k) {
for (k; k > 0; k--) {
cout << k << endl;
}
}
void initialize() {
accessor = "asdfasdfasdfasdfasdfasdfasdfasdfasdfasdf";
}
string accessor;
};
struct bar {
public:
void foobar(int k) {
for (k; k > 0; k--) {
cout << k << endl;
}
}
void initialize() {
accessor = "asdfasdfasdfasdfasdfasdfasdfasdfasdfasdf";
}
string accessor;
};
int main() {
clock_t timer1 = clock();
for (int j = 0; j < 200; j++) {
foo f;
f.initialize();
f.foobar(7);
cout << f.accessor << endl;
}
clock_t classstuff = clock();
clock_t timer2 = clock();
for (int j = 0; j < 200; j++) {
bar b;
b.initialize();
b.foobar(7);
cout << b.accessor << endl;
}
clock_t structstuff = clock();
cout << "struct took " << structstuff-timer2 << endl;
cout << "class took " << classstuff-timer1 << endl;
return 0;
}
On my computer, struct took 1286 clock ticks, and class took 1450 clock ticks. To answer your question, struct is slightly faster. However, that shouldn't matter, because computers are so fast these days.

well actually structs can be more efficient than classes both in time and memory (e.g arrays of structs vs arrays of objects),
‎‏There is a huge
difference in efficiency in
some cases. While the
overhead of an object
might not seem like very
much, consider an array
of objects and compare it
to an array of structs.
Assume the data
structure contains 16
bytes of data, the array
length is 1,000,000, and
this is a 32-bit system.
For an array of objects
the total space usage is:
8 bytes array overhea
(4 byte pointer size ×
((8 bytes overhead +
= 28 MB
For an array of structs,
the results are
dramatically different:
8 bytes array overhea
(16 bytes data × 1,00
= 16 MB
With a 64-bit process, the
object array takes over
40 MB while the struct
array still requires only 16
MB.
see this article for details.

i++ less efficient than ++i, how to show this?

I am trying to show by example that the prefix increment is more efficient than the postfix increment.
In theory this makes sense: i++ needs to be able to return the unincremented original value and therefore store it, whereas ++i can return the incremented value without storing the previous value.
But is there a good example to show this in practice?
I tried the following code:
int array[100];
int main()
{
for(int i = 0; i < sizeof(array)/sizeof(*array); i++)
array[i] = 1;
}
I compiled it using gcc 4.4.0 like this:
gcc -Wa,-adhls -O0 myfile.cpp
I did this again, with the postfix increment changed to a prefix increment:
for(int i = 0; i < sizeof(array)/sizeof(*array); ++i)
The result is identical assembly code in both cases.
This was somewhat unexpected. It seemed like that by turning off optimizations (with -O0) I should see a difference to show the concept. What am I missing? Is there a better example to show this?

In the general case, the post increment will result in a copy where a pre-increment will not. Of course this will be optimized away in a large number of cases and in the cases where it isn't the copy operation will be negligible (ie., for built in types).
Here's a small example that show the potential inefficiency of post-increment.
#include <stdio.h>
class foo
{
public:
int x;
foo() : x(0) {
printf( "construct foo()\n");
};
foo( foo const& other) {
printf( "copy foo()\n");
x = other.x;
};
foo& operator=( foo const& rhs) {
printf( "assign foo()\n");
x = rhs.x;
return *this;
};
foo& operator++() {
printf( "preincrement foo\n");
++x;
return *this;
};
foo operator++( int) {
printf( "postincrement foo\n");
foo temp( *this);
++x;
return temp;
};
};
int main()
{
foo bar;
printf( "\n" "preinc example: \n");
++bar;
printf( "\n" "postinc example: \n");
bar++;
}
The results from an optimized build (which actually removes a second copy operation in the post-increment case due to RVO):
construct foo()
preinc example:
preincrement foo
postinc example:
postincrement foo
copy foo()
In general, if you don't need the semantics of the post-increment, why take the chance that an unnecessary copy will occur?
Of course, it's good to keep in mind that a custom operator++() - either the pre or post variant - is free to return whatever it wants (or even do whatever it wants), and I'd imagine that there are quite a few that don't follow the usual rules. Occasionally I've come across implementations that return "void", which makes the usual semantic difference go away.

You won't see any difference with integers. You need to use iterators or something where post and prefix really do something different. And you need to turn all optimisations on, not off!

I like to follow the rule of "say what you mean".
++i simply increments. i++ increments and has a special, non-intuitive result of evaluation. I only use i++ if I explicitly want that behavior, and use ++i in all other cases. If you follow this practice, when you do see i++ in code, it's obvious that post-increment behavior really was intended.

Several points:
First, you're unlikely to see a major performance difference in any way
Second, your benchmarking is useless if you have optimizations disabled. What we want to know is if this change gives us more or less efficient code, which means that we have to use it with the most efficient code the compiler is able to produce. We don't care whether it is faster in unoptimized builds, we need to know if it is faster in optimized ones.
For built-in datatypes like integers, the compiler is generally able to optimize the difference away. The problem mainly occurs for more complex types with overloaded increment iterators, where the compiler can't trivially see that the two operations would be equivalent in the context.
You should use the code that clearest expresses your intent. Do you want to "add one to the value", or "add one to the value, but keep working on the original value a bit longer"? Usually, the former is the case, and then a pre-increment better expresses your intent.
If you want to show the difference, the simplest option is simply to impement both operators, and point out that one requires an extra copy, the other does not.

This code and its comments should demonstrate the differences between the two.
class a {
int index;
some_ridiculously_big_type big;
//etc...
};
// prefix ++a
void operator++ (a& _a) {
++_a.index
}
// postfix a++
void operator++ (a& _a, int b) {
_a.index++;
}
// now the program
int main (void) {
a my_a;
// prefix:
// 1. updates my_a.index
// 2. copies my_a.index to b
int b = (++my_a).index;
// postfix
// 1. creates a copy of my_a, including the *big* member.
// 2. updates my_a.index
// 3. copies index out of the **copy** of my_a that was created in step 1
int c = (my_a++).index;
}
You can see that the postfix has an extra step (step 1) which involves creating a copy of the object. This has both implications for both memory consumption and runtime. That is why prefix is more efficient that postfix for non-basic types.
Depending on some_ridiculously_big_type and also on whatever you do with the result of the incrememt, you'll be able to see the difference either with or without optimizations.

In response to Mihail, this is a somewhat more portable version his code:
#include <cstdio>
#include <ctime>
using namespace std;
#define SOME_BIG_CONSTANT 100000000
#define OUTER 40
int main( int argc, char * argv[] ) {
int d = 0;
time_t now = time(0);
if ( argc == 1 ) {
for ( int n = 0; n < OUTER; n++ ) {
int i = 0;
while(i < SOME_BIG_CONSTANT) {
d += i++;
}
}
}
else {
for ( int n = 0; n < OUTER; n++ ) {
int i = 0;
while(i < SOME_BIG_CONSTANT) {
d += ++i;
}
}
}
int t = time(0) - now;
printf( "%d\n", t );
return d % 2;
}
The outer loops are there to allow me to fiddle the timings to get something suitable on my platform.
I don't use VC++ any more, so i compiled it (on Windows) with:
g++ -O3 t.cpp
I then ran it by alternating:
a.exe
and
a.exe 1
My timing results were approximately the same for both cases. Sometimes one version would be faster by up to 20% and sometimes the other. This I would guess is due to other processes running on my system.

Try to use while or do something with returned value, e.g.:
#define SOME_BIG_CONSTANT 1000000000
int _tmain(int argc, _TCHAR* argv[])
{
int i = 1;
int d = 0;
DWORD d1 = GetTickCount();
while(i < SOME_BIG_CONSTANT + 1)
{
d += i++;
}
DWORD t1 = GetTickCount() - d1;
printf("%d", d);
printf("\ni++ > %d <\n", t1);
i = 0;
d = 0;
d1 = GetTickCount();
while(i < SOME_BIG_CONSTANT)
{
d += ++i;
}
t1 = GetTickCount() - d1;
printf("%d", d);
printf("\n++i > %d <\n", t1);
return 0;
}
Compiled with VS 2005 using /O2 or /Ox, tried on my desktop and on laptop.
Stably get something around on laptop, on desktop numbers are a bit different (but rate is about the same):
i++ > 8xx <
++i > 6xx <
xx means that numbers are different e.g. 813 vs 640 - still around 20% speed up.
And one more point - if you replace "d +=" with "d = " you will see nice optimization trick:
i++ > 935 <
++i > 0 <
However, it's quite specific. But after all, I don't see any reasons to change my mind and think there is no difference :)

Perhaps you could just show the theoretical difference by writing out both versions with x86 assembly instructions? As many people have pointed out before, compiler will always make its own decisions on how best to compile/assemble the program.
If the example is meant for students not familiar with the x86 instruction set, you might consider using the MIPS32 instruction set -- for some odd reason many people seem to find it to be easier to comprehend than x86 assembly.

Ok, all this prefix/postfix "optimization" is just... some big misunderstanding.
The major idea that i++ returns its original copy and thus requires copying the value.
This may be correct for some unefficient implementations of iterators. However in 99% of cases even with STL iterators there is no difference because compiler knows how to optimize it and the actual iterators are just pointers that look like class. And of course there is no difference for primitive types like integers on pointers.
So... forget about it.
EDIT: Clearification
As I had mentioned, most of STL iterator classes are just pointers wrapped with classes, that have all member functions inlined allowing out-optimization of such irrelevant copy.
And yes, if you have your own iterators without inlined member functions, then it may
work slower. But, you should just understand what compiler does and what does not.
As a small prove, take this code:
int sum1(vector<int> const &v)
{
int n;
for(auto x=v.begin();x!=v.end();x++)
n+=*x;
return n;
}
int sum2(vector<int> const &v)
{
int n;
for(auto x=v.begin();x!=v.end();++x)
n+=*x;
return n;
}
int sum3(set<int> const &v)
{
int n;
for(auto x=v.begin();x!=v.end();x++)
n+=*x;
return n;
}
int sum4(set<int> const &v)
{
int n;
for(auto x=v.begin();x!=v.end();++x)
n+=*x;
return n;
}
Compile it to assembly and compare sum1 and sum2, sum3 and sum4...
I just can tell you... gcc give exactly the same code with -02.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Where is the virtual function call overhead? - c++

In both cases you are calling functions indirectly. In one case through your table of function pointers, and in the other through the compiler's array of function pointers (the vtable). Not surprisingly, two similar operations give you similar timing results.

And finally, the function pointer approach has turned out to be the fastest one. Which was what I'd expected from the very beginning.

Related

Function slows down when I put it in header

Restrict pointers and inlining

C++ use `const int` as looping variable?

Performances of Structs vs Classes

i++ less efficient than ++i, how to show this?

Categories

Resources