Function slows down when I put it in header - c++

I have a project where it is important to do speedy conversions from bytes (char) to hex-formatted strings ("00" - "ff")
The problem I have is that my conversion function slows down when I move it from my test file to my conversion library.
the function uses a std::vector<int> as a lookup table, for the precomputed strings.
The speed difference when testing is 4us in the test file, to 8us when called from the library. This is using 1000 iterations of the conversion function.
Can anyone help me understand what is going on? To my eyes, the same code is taking twice the time to execute.
test code with catch2 (partial)
BENCHMARK("fast, local")
{
auto l = [](){
string x;
for (int i = 0; i < 1000; ++i) {
// this is exactly how conv::char2hex works as well
x += lookupvector[conv::byte2int(random_bytes[i])];
}
return x;
};
return l();
};
BENCHMARK("slow, lib")
{
auto l = [](){
string x;
for (int i = 0; i < 1000; ++i) {
x += conv::char2hex(random_bytes[i]);
}
return x;
};
return l();
};
function code in conversion.h
inline string char2hex(const char &x){
return lookupvector[byte2int(x)];
};
Compiled with cmake, using clang, release mode (-O2)
Update:
random_bytes is a pre-allocated std::vector<char> with 1M entries for testing.
The BENCHMARK macro runs the test repeatedly for better statistics.
10x'ing the number in the loop does not change the timing difference significantly.
x.reserve(2000); does not change anything, I believe it is already optimized for.
Changing the order of the tests does not change anything.
-flto does not improve the situation
Having the conversion function and lookup table in a local header, compared to a lib does not improve the speed.

Related

How to make sure the call is not optimised away when measuring time?

I wrote a function template to measure time:
#include <ctime>
template <typename FUNCTION,typename INPUT,int N>
double measureTime(FUNCTION f,INPUT inp){
// double x;
double duration = 0;
clock_t begin = clock();
for (int i=0;i<N;i++){
// x = f(inp);
f(inp);
}
clock_t end = clock();
// std::cout << x << std::endl;
return double(end-begin) / CLOCKS_PER_SEC;
}
And I use it like this:
#include <iostream>
typedef std::vector<double> DVect;
double passValue(DVect a){
double sum = 0;
for (int i=0;i<a.size();i++){sum += sum+a[i];}
return sum;
}
typedef double (*passValue_type)(DVect);
int main(int argc, char *argv[]) {
const int N = 1000;
const int size = 10000;
std::vector<double> v(size,0);
std::cout << measureTime<passValue_type,DVect,N>(passValue,v) << std::endl;
}
The aim is to reliably measure the cpu time of different functions, e.g. pass-by-value vs pass-by-reference. Actually it seems to work nicely, however, sometimes the resulting time is too short to be measured and i just get 0 as result. To make sure the function is called, I printed the result of the call (see comments in above code). This I would like to avoid and I would like to keep the template as simple as possible, so my question is:
How can I make sure that the function is really called and not optimised away (because the return value is not used)?
I typically do something like this:
#include <ctime>
template <typename FUNCTION,typename INPUT,int N>
double measureTime(FUNCTION f,INPUT inp){
double x = 0;
double duration = 0;
clock_t begin = clock();
for (int i=0;i<N;i++){
x += f(inp);
}
clock_t end = clock();
std::cout << x << std::endl;
// or if (x < 0) cout << x; or similar.
// such that it doesn't ACTUALLY print anything.
return double(end-begin) / CLOCKS_PER_SEC;
}
The above assumes that f actually does something non-trivial that the compiler can't figure out how to simplify. If f is return 6; then the compiler will convert it to x = 6 * N;, and you get very short runtime indeed.
If you want to be able to use "any" function, you will have to do some more clever stuff:
template <typename FUNCTION,typename INPUT,int N, typename RET>
double measureTime(FUNCTION f,INPUT inp){
RET x = 0;
double duration = 0;
clock_t begin = clock();
for (int i=0;i<N;i++){
x += f(inp);
}
clock_t end = clock();
std::cout << x << std::endl;
return double(end-begin) / CLOCKS_PER_SEC;
}
template <typename FUNCTION,typename INPUT,int N, void>
double measureTime(FUNCTION f,INPUT inp){
clock_t begin = clock();
for (int i=0;i<N;i++){
f(inp);
}
clock_t end = clock();
return double(end-begin) / CLOCKS_PER_SEC;
}
[I haven't actually compiled the above code, so it may have minor flaws, but as a concept it should work].
Since any meaningful void function will have to do something that affects the surrounding world (output to a stream, change a global variable or call some system call), it won't be eliminated. Of course, calling an empty function or similar is likely to cause trouble.
Another method, assuming you don't care about not inlining the call is to actually place the function under test in a separate file, and not let the compiler "see" that function from the code that measure the time [and not use -flto to allow it to inline the function at link-time] - that way, the compiler can't KNOW what the function under test is doing, and not eliminate the call.
It should be noted that there is really no way to GUARANTEE that the compiler doesn't eliminate a call, other than either "make it impossible for the compiler to know what the outcome of the function is" (for example use random/externally sourced input), or "don't let the compiler know what the function does".
Without inlining: Make sure that the function call and the function definition are in separate compilation units (i.e. cpp-files), then disable link-time optimization in your build.
In this case compiler will not be able to inline your function call due to how compilation units work in C++. Also, the compiler will not be able to remove the call completely. In fact, it will know nothing about your function (except for the signature) at the moment it optimizes the call.
With inlining: The simple way described above will not work if you want to measure time with your function call inlined. In such case you have to make sure that: for each operation inside your function there is some observable behavior that depends on it. You can for example write your results to volatile variables, or calculate some sum/hash of the results and print it to stdout.
Many compilers have extensions to disable inlining of a function. For gcc, it is __attribute__((noinline)), e.g.:
__attribute__((noinline)) void foo() { ... }
Boost provides a portable BOOST_NOINLINE macro.

C++ use `const int` as looping variable?

I want to write code that compiles conditionally and according to the following two cases:
CASE_A:
for(int i = 1; i <= 10; ++i){
// do something...
}
CASE_B: ( == !CASE_A)
{
const int i = 0;
// do something...
}
That is, in case A, I want to have a normal loop over variable i but, in case B, i want to restrict local scope variable i to only a special case (designated here as i = 0). Obviously, I could write something along the lines:
for(int i = (CASE_A ? 1 : 0); i <= (CASE_A ? 10 : 0); ++i){
// do something
}
However, I do not like this design as it doesn't allow me to take advantage of the const declaration in the special case B. Such declaration would presumably allow for lots of optimization as the body of this loop benefits greatly from a potential compile-time replacement of i by its constant value.
Looking forward to any tips from the community on how to efficiently achieve this.
Thank you!
EDITS:
CASE_A vs CASE_B can be evaluated at compile-time.
i is not passed as reference
i is not re-evaluated in the body (otherwise const would not make sense), but I am not sure the compiler will go through the effort to certify that
Assuming, you aren't over-simplifying your example, it shouldn't matter. Assuming CASE_A can be evaluated at compile-time, the code:
for( int i = 0; i <= 0; ++i ) {
do_something_with( i );
}
is going to generate the same machine code as:
const int i = 0;
do_something_with( i );
for any decent compiler (with optimization turned on, of course).
In researching this, I find there is a fine point here. If i gets passed to a function via a pointer or reference, the compiler can't assume it doesn't change. This is true even if the pointer or reference is const! (Since the const can be cast away in the function.)
Seems to be the obvious solution:
template<int CASE>
void do_case();
template<>
void do_case<CASE_A>()
{
for(int i = 1; i <= 10; ++i){
do_something( i );
}
}
template<>
void do_case<CASE_B>()
{
do_something( 0 );
}
// Usage
...
do_case<CURRENT_CASE>(); // CURRENT_CASE is the compile time constant
If your CASE_B/CASE_B determination can be expressed as a compile time constant, then you can do what you want in a nice, readable format using something like the following (which is just a variation on your example of using the ?: operator for the for loop initialization and condition):
enum {
kLowerBound = (CASE_A ? 1 : 0),
kUpperBound = (CASE_A ? 10 : 0)
};
for (int i = kLowerBound; i <= kUpperBound; ++i) {
// do something
}
This makes it clear that the for loop bounds are compile time constants - note that I think most compilers today would have no problem making that determination even if the ?: expressions were used directly in the for statement's controlling clauses. However, I do think using enums makes it more evident to people reading the code.
Again, any compiler worth its salt today should recognize when i is invariant inside the loop, and in the CASE_B situation also determine that the loop will never iterate. Making i const won't benefit the compiler's optimization possibilities.
If you're convinced that the compiler might be able to optimize better if i is const, then a simple modification can help:
for (int ii = kLowerBound; ii <= kUpperBound; ++ii) {
const int i = ii;
// do something
}
I doubt this will help the compiler much (but check it's output - I could be wrong) if i isn't modified or has its address taken (even by passing it as a reference). However, it might help you make sure that i isn't inappropriately modified or passed by reference/address in the loop.
On the other hand, you might actually see a benefit to optimizations produced by the compiler if you use the const modifier on it - in the cases where the address of i is taken or the const is cast away, the compiler is still permitted to treat i as not being modified for its lifetime. Any modifications that might be made by something that cast away the const would be undefined behavior, so the compiler is allowed to ignore that they might occur. Of course, if you have code that might do this, you have bigger worries than optimization. So it's more important to make sure that there are no 'behind the back' modification attempts to i than to simply marking i as const 'for optimization', but using const might help you identify whether modifications are made (but remember that casts can continue to hide that).
I'm not quite sure that this is what you're looking for, but I'm using this macro version of the vanilla FOR loop which enforces the loop counter to be const to catch any modification of it in the body
#define FOR(type, var, start, maxExclusive, inc) if (bool a = true) for (type var##_ = start; var##_ < maxExclusive; a=true,var##_ += inc) for (const auto var = var##_;a;a=false)
Usage:
#include <stdio.h>
#define FOR(type, var, start, maxExclusive, inc) if (bool a = true) for (type var##_ = start; var##_ < maxExclusive; a=true,var##_ += inc) for (const auto var = var##_;a;a=false)
int main()
{
FOR(int, i, 0, 10, 1) {
printf("i: %d\n", i);
}
// does the same as:
for (int i = 0; i < 10; i++) {
printf("i: %d\n", i);
}
// FOR catches some bugs:
for (int i = 0; i < 10; i++) {
i += 10; // is legal but bad
printf("i: %d\n", i);
}
FOR(int, i, 0, 10, 1) {
i += 10; // is illlegal and will not compile
printf("i: %d\n", i);
}
return 0;
}

Where is the virtual function call overhead?

I'm trying to benchmark the difference between a function pointer call and a virtual function call. To do this, I have written two pieces of code, that do the same mathematical computation over an array. One variant uses an array of pointers to functions and calls those in a loop. The other variant uses an array of pointers to a base class and calls its virtual function, which is overloaded in the derived classes to do absolutely the same thing as the functions in the first variant. Then I print the time elapsed and use a simple shell script to run the benchmark many times and compute the average run time.
Here is the code:
#include <iostream>
#include <cstdlib>
#include <ctime>
#include <cmath>
using namespace std;
long long timespecDiff(struct timespec *timeA_p, struct timespec *timeB_p)
{
return ((timeA_p->tv_sec * 1000000000) + timeA_p->tv_nsec) -
((timeB_p->tv_sec * 1000000000) + timeB_p->tv_nsec);
}
void function_not( double *d ) {
*d = sin(*d);
}
void function_and( double *d ) {
*d = cos(*d);
}
void function_or( double *d ) {
*d = tan(*d);
}
void function_xor( double *d ) {
*d = sqrt(*d);
}
void ( * const function_table[4] )( double* ) = { &function_not, &function_and, &function_or, &function_xor };
int main(void)
{
srand(time(0));
void ( * index_array[100000] )( double * );
double array[100000];
for ( long int i = 0; i < 100000; ++i ) {
index_array[i] = function_table[ rand() % 4 ];
array[i] = ( double )( rand() / 1000 );
}
struct timespec start, end;
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
for ( long int i = 0; i < 100000; ++i ) {
index_array[i]( &array[i] );
}
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
unsigned long long time_elapsed = timespecDiff(&end, &start);
cout << time_elapsed / 1000000000.0 << endl;
}
and here is the virtual function variant:
#include <iostream>
#include <cstdlib>
#include <ctime>
#include <cmath>
using namespace std;
long long timespecDiff(struct timespec *timeA_p, struct timespec *timeB_p)
{
return ((timeA_p->tv_sec * 1000000000) + timeA_p->tv_nsec) -
((timeB_p->tv_sec * 1000000000) + timeB_p->tv_nsec);
}
class A {
public:
virtual void calculate( double *i ) = 0;
};
class A1 : public A {
public:
void calculate( double *i ) {
*i = sin(*i);
}
};
class A2 : public A {
public:
void calculate( double *i ) {
*i = cos(*i);
}
};
class A3 : public A {
public:
void calculate( double *i ) {
*i = tan(*i);
}
};
class A4 : public A {
public:
void calculate( double *i ) {
*i = sqrt(*i);
}
};
int main(void)
{
srand(time(0));
A *base[100000];
double array[100000];
for ( long int i = 0; i < 100000; ++i ) {
array[i] = ( double )( rand() / 1000 );
switch ( rand() % 4 ) {
case 0:
base[i] = new A1();
break;
case 1:
base[i] = new A2();
break;
case 2:
base[i] = new A3();
break;
case 3:
base[i] = new A4();
break;
}
}
struct timespec start, end;
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
for ( int i = 0; i < 100000; ++i ) {
base[i]->calculate( &array[i] );
}
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
unsigned long long time_elapsed = timespecDiff(&end, &start);
cout << time_elapsed / 1000000000.0 << endl;
}
My system is LInux, Fedora 13, gcc 4.4.2. The code is compiled it with g++ -O3. The first one is test1, the second is test2.
Now I see this in console:
[Ignat#localhost circuit_testing]$ ./test2 && ./test2
0.0153142
0.0153166
Well, more or less, I think. And then, this:
[Ignat#localhost circuit_testing]$ ./test2 && ./test2
0.01531
0.0152476
Where are the 25% which should be visible? How can the first executable be even slower than the second one?
I'm asking this because I'm doing a project which involves calling a lot of small functions in a row like this in order to compute the values of an array, and the code I've inherited does a very complex manipulation to avoid the virtual function call overhead. Now where is this famous call overhead?
In both cases you are calling functions indirectly. In one case through your table of function pointers, and in the other through the compiler's array of function pointers (the vtable). Not surprisingly, two similar operations give you similar timing results.
Virtual functions may be slower than regular functions, but that's due to things like inlines. If you call a function through a function table, those can't be inlined either, and the lookup time is pretty much the same. Looking up through your own lookup table is of course going to be the same as looking up through the compiler's lookup table.
Edit: Or even slower, because the compiler knows a lot more than you about things like processor cache and such.
I think you're seeing the difference, but it's just the function call overhead. Branch misprediction, memory access and the trig functions are the same in both cases. Compared to those, it's just not that big a deal, though the function pointer case was definitely a bit quicker when I tried it.
If this is representative of your larger program, this is a good demonstration that this type of microoptimization is sometimes just a drop in the ocean, and at worst futile. But leaving that aside, for a clearer test, the functions should perform some simpler operation, that is different for each function:
void function_not( double *d ) {
*d = 1.0;
}
void function_and( double *d ) {
*d = 2.0;
}
And so on, and similarly for the virtual functions.
(Each function should do something different, so that they don't get elided and all end up with the same address; that would make the branch prediction work unrealistically well.)
With these changes, the results are a bit different. Best of 4 runs in each case. (Not very scientific, but the numbers are broadly similar for larger numbers of runs.) All timings are in cycles, running on my laptop. Code was compiled with VC++ (only changed the timing) but gcc implements virtual function calls in the same way so the relative timings should be broadly similar even with different OS/x86 CPU/compiler.
Function pointers: 2,052,770
Virtuals: 3,598,039
That difference seems a bit excessive! Sure enough, the two bits of code aren't quite the same in terms of their memory access behaviour. The second one should have a table of 4 A *s, used to fill in base, rather than new'ing up a new one for each entry. Both examples will then have similar behaviour (1 cache miss/N entries) when fetching the pointer to jump through. For example:
A *tbl[4] = { new A1, new A2, new A3, new A4 };
for ( long int i = 0; i < 100000; ++i ) {
array[i] = ( double )( rand() / 1000 );
base[i] = tbl[ rand() % 4 ];
}
With this in place, still using the simplified functions:
Virtuals (as suggested here): 2,487,699
So there's 20%, best case. Close enough?
So perhaps your colleague was right to at least consider this, but I suspect that in any realistic program the call overhead won't be enough of a bottleneck to be worth jumping through hoops over.
Nowadays, on most systems, memory access is the primary bottleneck, and not the CPU. In many cases, there is little significant difference between virtual and non-virtual functions - they usually represent a very small portion of execution time. (Sorry, I don't have reported figures to back this up, just emprical data.)
If you want to get the best performance you will get more bang for your buck if you look into how to parallelize the computation to take advantage of multiple cores/processing units, rather than worring about micro-details of virtual vs non-virtual functions.
Many people fall into the habit of doing things just because they are thought to be "faster". It's all relative.
If I'm going to take a 100-mile drive from my home, I have to start by driving around the block. I can drive around the block to the right, or to the left. One of those will be "faster". But will it matter? Of course not.
In this case, the functions that you call are in turn calling math functions.
If you pause the program under the IDE or GDB, I suspect you will find that nearly every time you pause it it will be in those math library routines (or it should be!), and dereferencing an additional pointer to get there (assuming it doesn't bust a cache) should be lost in the noise.
Added: Here is a favorite video: Harry Porter's relay computer. As that thing laboriously clacks away adding numbers and stepping its program counter, I find it helpful to be mindful that that's what all computers are doing, just on a different scale of time and complexity. In your case, think about an algorithm to do sin, cos, tan, or sqrt. Inside, it is clacking away doing those things, and only incidentally following addresses or messing with a really slow memory to get there.
And finally, the function pointer approach has turned out to be the fastest one. Which was what I'd expected from the very beginning.

C++ arrays as parameters, EDIT: now includes variable scoping

Alright, I'm guessing this is an easy question, so I'll take the knocks, but I'm not finding what I need on google or SO. I'd like to create an array in one place, and populate it inside a different function.
I define a function:
void someFunction(double results[])
{
for (int i = 0; i<100; ++i)
{
for (int n = 0; n<16; ++n) //note this iteration limit
{
results[n] += i * n;
}
}
}
That's an approximation to what my code is doing, but regardless, shouldn't be running into any overflow or out of bounds issues or anything. I generate an array:
double result[16];
for(int i = 0; i<16; i++)
{
result[i] = -1;
}
then I want to pass it to someFunction
someFunction(result);
When I set breakpoints and step through the code, upon entering someFunction, results is set to the same address as result, and the value there is -1.000000 as expected. However, when I start iterating through the loop, results[n] doesn't seem to resolve to *(results+n) or *(results+n*sizeof(double)), it just seems to resolve to *(results). What I end up with is that instead of populating my result array, I just get one value. What am I doing wrong?
EDIT
Oh fun, I have a typo: it wasn't void someFunction(double results[]). It was:
void someFunction(double result[])...
So perhaps this is turning into a scoping question. If my double result[16] array is defined in a main.cpp, and someFunction is defined in a Utils.h file that's included by the main.cpp, does the result variable in someFunction then wreak havoc on the result array in main?
EDIT 2:
#gf, in the process of trying to reproduce this problem with a fresh project, the original project "magically" started working.
I don't know how to explain it, as nothing changed, but I'm pretty sure of what I saw - my original description of the issue was pretty clear, so I don't think I was hallucinating. I appreciate the time and answers...sorry for wasting your time. I'll update again if it happens again, but for the meantime, I think I'm in the clear. Thanks again.
Just a point about the variable scope part of the question - there is no issue of variable scope here. result/results in your someFunction definition is a parameter -> it will take on the value passed in. There is no relation between variables in a called function and it's caller -> the variables in the caller function are unknown to the called function unless passed in. Also, variable scoping issues do not occur between routines in C++ because there are no nested routines. The following pieces of code would demonstrate scoping issues:
int i = 0;
{
int i = 0;
i = 5; //changes the second i, not the first.
//The first is aliased by the second i defined first.
}
i = 5; //now changes the first i; the inner block is gone and so is its local i
so if C++ did have nested routines, this would cause variable scoping
void main()
{
double results[16];
double blah[16];
doSomething(blah);
void doSomething(double * results)
{
//blah doing something here uses our parameter results,
//which refers to blah, but not to the results in the higher scope.
//The results in the higher scope is hidden.
}
}
void someFunction(double results[])
should be exactly equivalent to
void someFunction(double *results)
Try using the alternative declaration and see if the problem persists.
To me it seems that your code should simply work.
I just tried this in g++ and worked fine. I guess your problem is elsewhere? have you tried the snipped you posted?
#include <iostream>
void someFunction(double results[])
{
for (int i = 0; i<100; ++i)
{
for (int n = 0; n<16; ++n) //note this iteration limit
{
results[n] += i * n;
}
}
}
int main()
{
double result[16];
for(int i = 0; i<16; i++)
{
result[i] = -1;
}
someFunction(result);
for(int i = 0; i<16; i++)
std::cerr << result[i] << " ";
std::cerr << std::endl;
}
Have you perhaps double defined your results array in a couple places and then accidently refered to one copy in one place and another copy elsewhere? Perhaps the second is a pointer and not an array and that is why the debugger is confused?
To ensure this problem doesn't occur, you should never use global variables like that. If you absolutely must have one, put it in a namespace for clarity.

i++ less efficient than ++i, how to show this?

I am trying to show by example that the prefix increment is more efficient than the postfix increment.
In theory this makes sense: i++ needs to be able to return the unincremented original value and therefore store it, whereas ++i can return the incremented value without storing the previous value.
But is there a good example to show this in practice?
I tried the following code:
int array[100];
int main()
{
for(int i = 0; i < sizeof(array)/sizeof(*array); i++)
array[i] = 1;
}
I compiled it using gcc 4.4.0 like this:
gcc -Wa,-adhls -O0 myfile.cpp
I did this again, with the postfix increment changed to a prefix increment:
for(int i = 0; i < sizeof(array)/sizeof(*array); ++i)
The result is identical assembly code in both cases.
This was somewhat unexpected. It seemed like that by turning off optimizations (with -O0) I should see a difference to show the concept. What am I missing? Is there a better example to show this?
In the general case, the post increment will result in a copy where a pre-increment will not. Of course this will be optimized away in a large number of cases and in the cases where it isn't the copy operation will be negligible (ie., for built in types).
Here's a small example that show the potential inefficiency of post-increment.
#include <stdio.h>
class foo
{
public:
int x;
foo() : x(0) {
printf( "construct foo()\n");
};
foo( foo const& other) {
printf( "copy foo()\n");
x = other.x;
};
foo& operator=( foo const& rhs) {
printf( "assign foo()\n");
x = rhs.x;
return *this;
};
foo& operator++() {
printf( "preincrement foo\n");
++x;
return *this;
};
foo operator++( int) {
printf( "postincrement foo\n");
foo temp( *this);
++x;
return temp;
};
};
int main()
{
foo bar;
printf( "\n" "preinc example: \n");
++bar;
printf( "\n" "postinc example: \n");
bar++;
}
The results from an optimized build (which actually removes a second copy operation in the post-increment case due to RVO):
construct foo()
preinc example:
preincrement foo
postinc example:
postincrement foo
copy foo()
In general, if you don't need the semantics of the post-increment, why take the chance that an unnecessary copy will occur?
Of course, it's good to keep in mind that a custom operator++() - either the pre or post variant - is free to return whatever it wants (or even do whatever it wants), and I'd imagine that there are quite a few that don't follow the usual rules. Occasionally I've come across implementations that return "void", which makes the usual semantic difference go away.
You won't see any difference with integers. You need to use iterators or something where post and prefix really do something different. And you need to turn all optimisations on, not off!
I like to follow the rule of "say what you mean".
++i simply increments. i++ increments and has a special, non-intuitive result of evaluation. I only use i++ if I explicitly want that behavior, and use ++i in all other cases. If you follow this practice, when you do see i++ in code, it's obvious that post-increment behavior really was intended.
Several points:
First, you're unlikely to see a major performance difference in any way
Second, your benchmarking is useless if you have optimizations disabled. What we want to know is if this change gives us more or less efficient code, which means that we have to use it with the most efficient code the compiler is able to produce. We don't care whether it is faster in unoptimized builds, we need to know if it is faster in optimized ones.
For built-in datatypes like integers, the compiler is generally able to optimize the difference away. The problem mainly occurs for more complex types with overloaded increment iterators, where the compiler can't trivially see that the two operations would be equivalent in the context.
You should use the code that clearest expresses your intent. Do you want to "add one to the value", or "add one to the value, but keep working on the original value a bit longer"? Usually, the former is the case, and then a pre-increment better expresses your intent.
If you want to show the difference, the simplest option is simply to impement both operators, and point out that one requires an extra copy, the other does not.
This code and its comments should demonstrate the differences between the two.
class a {
int index;
some_ridiculously_big_type big;
//etc...
};
// prefix ++a
void operator++ (a& _a) {
++_a.index
}
// postfix a++
void operator++ (a& _a, int b) {
_a.index++;
}
// now the program
int main (void) {
a my_a;
// prefix:
// 1. updates my_a.index
// 2. copies my_a.index to b
int b = (++my_a).index;
// postfix
// 1. creates a copy of my_a, including the *big* member.
// 2. updates my_a.index
// 3. copies index out of the **copy** of my_a that was created in step 1
int c = (my_a++).index;
}
You can see that the postfix has an extra step (step 1) which involves creating a copy of the object. This has both implications for both memory consumption and runtime. That is why prefix is more efficient that postfix for non-basic types.
Depending on some_ridiculously_big_type and also on whatever you do with the result of the incrememt, you'll be able to see the difference either with or without optimizations.
In response to Mihail, this is a somewhat more portable version his code:
#include <cstdio>
#include <ctime>
using namespace std;
#define SOME_BIG_CONSTANT 100000000
#define OUTER 40
int main( int argc, char * argv[] ) {
int d = 0;
time_t now = time(0);
if ( argc == 1 ) {
for ( int n = 0; n < OUTER; n++ ) {
int i = 0;
while(i < SOME_BIG_CONSTANT) {
d += i++;
}
}
}
else {
for ( int n = 0; n < OUTER; n++ ) {
int i = 0;
while(i < SOME_BIG_CONSTANT) {
d += ++i;
}
}
}
int t = time(0) - now;
printf( "%d\n", t );
return d % 2;
}
The outer loops are there to allow me to fiddle the timings to get something suitable on my platform.
I don't use VC++ any more, so i compiled it (on Windows) with:
g++ -O3 t.cpp
I then ran it by alternating:
a.exe
and
a.exe 1
My timing results were approximately the same for both cases. Sometimes one version would be faster by up to 20% and sometimes the other. This I would guess is due to other processes running on my system.
Try to use while or do something with returned value, e.g.:
#define SOME_BIG_CONSTANT 1000000000
int _tmain(int argc, _TCHAR* argv[])
{
int i = 1;
int d = 0;
DWORD d1 = GetTickCount();
while(i < SOME_BIG_CONSTANT + 1)
{
d += i++;
}
DWORD t1 = GetTickCount() - d1;
printf("%d", d);
printf("\ni++ > %d <\n", t1);
i = 0;
d = 0;
d1 = GetTickCount();
while(i < SOME_BIG_CONSTANT)
{
d += ++i;
}
t1 = GetTickCount() - d1;
printf("%d", d);
printf("\n++i > %d <\n", t1);
return 0;
}
Compiled with VS 2005 using /O2 or /Ox, tried on my desktop and on laptop.
Stably get something around on laptop, on desktop numbers are a bit different (but rate is about the same):
i++ > 8xx <
++i > 6xx <
xx means that numbers are different e.g. 813 vs 640 - still around 20% speed up.
And one more point - if you replace "d +=" with "d = " you will see nice optimization trick:
i++ > 935 <
++i > 0 <
However, it's quite specific. But after all, I don't see any reasons to change my mind and think there is no difference :)
Perhaps you could just show the theoretical difference by writing out both versions with x86 assembly instructions? As many people have pointed out before, compiler will always make its own decisions on how best to compile/assemble the program.
If the example is meant for students not familiar with the x86 instruction set, you might consider using the MIPS32 instruction set -- for some odd reason many people seem to find it to be easier to comprehend than x86 assembly.
Ok, all this prefix/postfix "optimization" is just... some big misunderstanding.
The major idea that i++ returns its original copy and thus requires copying the value.
This may be correct for some unefficient implementations of iterators. However in 99% of cases even with STL iterators there is no difference because compiler knows how to optimize it and the actual iterators are just pointers that look like class. And of course there is no difference for primitive types like integers on pointers.
So... forget about it.
EDIT: Clearification
As I had mentioned, most of STL iterator classes are just pointers wrapped with classes, that have all member functions inlined allowing out-optimization of such irrelevant copy.
And yes, if you have your own iterators without inlined member functions, then it may
work slower. But, you should just understand what compiler does and what does not.
As a small prove, take this code:
int sum1(vector<int> const &v)
{
int n;
for(auto x=v.begin();x!=v.end();x++)
n+=*x;
return n;
}
int sum2(vector<int> const &v)
{
int n;
for(auto x=v.begin();x!=v.end();++x)
n+=*x;
return n;
}
int sum3(set<int> const &v)
{
int n;
for(auto x=v.begin();x!=v.end();x++)
n+=*x;
return n;
}
int sum4(set<int> const &v)
{
int n;
for(auto x=v.begin();x!=v.end();++x)
n+=*x;
return n;
}
Compile it to assembly and compare sum1 and sum2, sum3 and sum4...
I just can tell you... gcc give exactly the same code with -02.