Related
I don't recall seeing examples of code like this hypothetical snippet:
cpu->dev.bus->uevent = (cpu->dev.bus->uevent) >> 16; //or the equivalent using a macro
in which a member in a large structure gets dereferenced using pointers, operated on, and the result assigned back to the same field of the structure.
The kernel seems to be a place where such large structures are frequent but I haven't seen examples of it and became interested as to the reason why.
Is there a performance reason for this, maybe related to the time required to follow the pointers? Is it simply not good style and if so, what is the preferred way?
There's nothing wrong with the statement syntactically, but it's easier to code it like this:
cpu->dev.bus->uevent >>= 16;
It's mush more a matter of history: the kernel is mostly written in C (not C++), and -in the original development intention- (K&R era) was thought as a "high level assembler", whose statement and expression should have a literal correspondence in C and ASM. In this environment, ++i i+=1 and i=i+1 are completely different things that translates in completely different CPU instructions
Compiler optimizations, at that time, where not so advanced and popular, so the idea to follow the pointer chain twice was often avoided by first store the resulting destination address in a local temporary variable (most likely a register) and than do the assignment.
(like int* p = &a->b->c->d; *p = a + *p;)
or trying to use compond instruction like a->b->c >>= 16;)
With nowadays computers (multicore processor, multilevel caches and piping) the execution of cone inside registers can be ten times faster respect to the memory access, following three pointers is faster than storing an address in memory, thus reverting the priority of the "business model".
Compiler optimization, then, can freely change the produced code to adequate it to size or speed depending on what is retained more important and depending on what kind of processor you are working with.
So -nowadays- it doesn't really matter if you write ++i or i+=1 or i=i+1: The compiler will most likely produce the same code, attempting to access i only once. and following the pointer chain twice will most likely be rewritten as equivalent to (cpu->dev.bus->uevent) >>= 16 since >>= correspond to a single machine instruction in the x86 derivative processors.
That said ("it doesn't really matter"), it is also true that code style tend to reflect stiles and fashions of the age it was first written (since further developers tend to maintain consistency).
You code is not "bad" by itself, it just looks "odd" in the place it is usually written.
Just to give you an idea of what piping and prediction is. consider the comparison of two vectors:
bool equal(size_t n, int* a, int *b)
{
for(size_t i=0; i<n; ++i)
if(a[i]!=b[i]) return false;
return true;
}
Here, as soon we find something different we sortcut saying they are different.
Now consider this:
bool equal(size_t n, int* a, int *b)
{
register size_t c=0;
for(register size_t i=0; i<n; ++i)
c+=(a[i]==b[i]);
return c==n;
}
There is no shortcut, and even if we find a difference continue to loop and count.
But having removed the if from inside the loop, if n isn't that big (let's say less that 20) this can be 4 or 5 times faster!
An optimized compiler can even recognize this situation - proven there are no different side effects- can rework the first code in the second!
I see nothing wrong with something like that, it appears as innocuous as:
i = i + 42;
If you're accessing the data items a lot, you could consider something like:
tSomething *cdb = cpu->dev.bus;
cdb->uevent = cdb->uevent >> 16;
// and many more accesses to cdb here
but, even then, I'd tend to leave it to the optimiser, which tends to do a better job than most humans anyway :-)
There's nothing inherently wrong by doing
cpu->dev.bus->uevent = (cpu->dev.bus->uevent) >> 16;
but depending on the type of uevent, you need to be careful when shifting right like that, so you don't accidentally shift in unexpected bits into your value. For instance, if it's a 64-bit value
uint64_t uevent = 0xDEADBEEF00000000;
uevent = uevent >> 16; // now uevent is 0x0000DEADBEEF0000;
if you thought you shifted a 32-bit value and then pass the new uevent to a function taking a 64-bit value, you're not passing 0xBEEF0000, as you might have expected. Since the sizes fit (64-bit value passed as 64-bit parameter), you won't get any compiler warnings here (which you would have if you passed a 64-bit value as a 32-bit parameter).
Also interesting to note is that the above operation, while similar to
i = ++i;
which is undefined behavior (see http://josephmansfield.uk/articles/c++-sequenced-before-graphs.html for details), is still well defined, since there are no side effects in the right-hand side expression.
Is there a performance difference between i++ and ++i if the resulting value is not used?
Executive summary: No.
i++ could potentially be slower than ++i, since the old value of i
might need to be saved for later use, but in practice all modern
compilers will optimize this away.
We can demonstrate this by looking at the code for this function,
both with ++i and i++.
$ cat i++.c
extern void g(int i);
void f()
{
int i;
for (i = 0; i < 100; i++)
g(i);
}
The files are the same, except for ++i and i++:
$ diff i++.c ++i.c
6c6
< for (i = 0; i < 100; i++)
---
> for (i = 0; i < 100; ++i)
We'll compile them, and also get the generated assembler:
$ gcc -c i++.c ++i.c
$ gcc -S i++.c ++i.c
And we can see that both the generated object and assembler files are the same.
$ md5 i++.s ++i.s
MD5 (i++.s) = 90f620dda862cd0205cd5db1f2c8c06e
MD5 (++i.s) = 90f620dda862cd0205cd5db1f2c8c06e
$ md5 *.o
MD5 (++i.o) = dd3ef1408d3a9e4287facccec53f7d22
MD5 (i++.o) = dd3ef1408d3a9e4287facccec53f7d22
From Efficiency versus intent by Andrew Koenig :
First, it is far from obvious that ++i is more efficient than i++, at least where integer variables are concerned.
And :
So the question one should be asking is not which of these two operations is faster, it is which of these two operations expresses more accurately what you are trying to accomplish. I submit that if you are not using the value of the expression, there is never a reason to use i++ instead of ++i, because there is never a reason to copy the value of a variable, increment the variable, and then throw the copy away.
So, if the resulting value is not used, I would use ++i. But not because it is more efficient: because it correctly states my intent.
A better answer is that ++i will sometimes be faster but never slower.
Everyone seems to be assuming that i is a regular built-in type such as int. In this case there will be no measurable difference.
However if i is complex type then you may well find a measurable difference. For i++ you must make a copy of your class before incrementing it. Depending on what's involved in a copy it could indeed be slower since with ++i you can just return the final value.
Foo Foo::operator++()
{
Foo oldFoo = *this; // copy existing value - could be slow
// yadda yadda, do increment
return oldFoo;
}
Another difference is that with ++i you have the option of returning a reference instead of a value. Again, depending on what's involved in making a copy of your object this could be slower.
A real-world example of where this can occur would be the use of iterators. Copying an iterator is unlikely to be a bottle-neck in your application, but it's still good practice to get into the habit of using ++i instead of i++ where the outcome is not affected.
Short answer:
There is never any difference between i++ and ++i in terms of speed. A good compiler should not generate different code in the two cases.
Long answer:
What every other answer fails to mention is that the difference between ++i versus i++ only makes sense within the expression it is found.
In the case of for(i=0; i<n; i++), the i++ is alone in its own expression: there is a sequence point before the i++ and there is one after it. Thus the only machine code generated is "increase i by 1" and it is well-defined how this is sequenced in relation to the rest of the program. So if you would change it to prefix ++, it wouldn't matter in the slightest, you would still just get the machine code "increase i by 1".
The differences between ++i and i++ only matters in expressions such as array[i++] = x; versus array[++i] = x;. Some may argue and say that the postfix will be slower in such operations because the register where i resides has to be reloaded later. But then note that the compiler is free to order your instructions in any way it pleases, as long as it doesn't "break the behavior of the abstract machine" as the C standard calls it.
So while you may assume that array[i++] = x; gets translated to machine code as:
Store value of i in register A.
Store address of array in register B.
Add A and B, store results in A.
At this new address represented by A, store the value of x.
Store value of i in register A // inefficient because extra instruction here, we already did this once.
Increment register A.
Store register A in i.
the compiler might as well produce the code more efficiently, such as:
Store value of i in register A.
Store address of array in register B.
Add A and B, store results in B.
Increment register A.
Store register A in i.
... // rest of the code.
Just because you as a C programmer is trained to think that the postfix ++ happens at the end, the machine code doesn't have to be ordered in that way.
So there is no difference between prefix and postfix ++ in C. Now what you as a C programmer should be vary of, is people who inconsistently use prefix in some cases and postfix in other cases, without any rationale why. This suggests that they are uncertain about how C works or that they have incorrect knowledge of the language. This is always a bad sign, it does in turn suggest that they are making other questionable decisions in their program, based on superstition or "religious dogmas".
"Prefix ++ is always faster" is indeed one such false dogma that is common among would-be C programmers.
Taking a leaf from Scott Meyers, More Effective c++ Item 6: Distinguish between prefix and postfix forms of increment and decrement operations.
The prefix version is always preferred over the postfix in regards to objects, especially in regards to iterators.
The reason for this if you look at the call pattern of the operators.
// Prefix
Integer& Integer::operator++()
{
*this += 1;
return *this;
}
// Postfix
const Integer Integer::operator++(int)
{
Integer oldValue = *this;
++(*this);
return oldValue;
}
Looking at this example it is easy to see how the prefix operator will always be more efficient than the postfix. Because of the need for a temporary object in the use of the postfix.
This is why when you see examples using iterators they always use the prefix version.
But as you point out for int's there is effectively no difference because of compiler optimisation that can take place.
Here's an additional observation if you're worried about micro optimisation. Decrementing loops can 'possibly' be more efficient than incrementing loops (depending on instruction set architecture e.g. ARM), given:
for (i = 0; i < 100; i++)
On each loop you you will have one instruction each for:
Adding 1 to i.
Compare whether i is less than a 100.
A conditional branch if i is less than a 100.
Whereas a decrementing loop:
for (i = 100; i != 0; i--)
The loop will have an instruction for each of:
Decrement i, setting the CPU register status flag.
A conditional branch depending on CPU register status (Z==0).
Of course this works only when decrementing to zero!
Remembered from the ARM System Developer's Guide.
First of all: The difference between i++ and ++i is neglegible in C.
To the details.
1. The well known C++ issue: ++i is faster
In C++, ++i is more efficient iff i is some kind of an object with an overloaded increment operator.
Why?
In ++i, the object is first incremented, and can subsequently passed as a const reference to any other function. This is not possible if the expression is foo(i++) because now the increment needs to be done before foo() is called, but the old value needs to be passed to foo(). Consequently, the compiler is forced to make a copy of i before it executes the increment operator on the original. The additional constructor/destructor calls are the bad part.
As noted above, this does not apply to fundamental types.
2. The little known fact: i++ may be faster
If no constructor/destructor needs to be called, which is always the case in C, ++i and i++ should be equally fast, right? No. They are virtually equally fast, but there may be small differences, which most other answerers got the wrong way around.
How can i++ be faster?
The point is data dependencies. If the value needs to be loaded from memory, two subsequent operations need to be done with it, incrementing it, and using it. With ++i, the incrementation needs to be done before the value can be used. With i++, the use does not depend on the increment, and the CPU may perform the use operation in parallel to the increment operation. The difference is at most one CPU cycle, so it is really neglegible, but it is there. And it is the other way round then many would expect.
Please don't let the question of "which one is faster" be the deciding factor of which to use. Chances are you're never going to care that much, and besides, programmer reading time is far more expensive than machine time.
Use whichever makes most sense to the human reading the code.
#Mark
Even though the compiler is allowed to optimize away the (stack based) temporary copy of the variable and gcc (in recent versions) is doing so,
doesn't mean all compilers will always do so.
I just tested it with the compilers we use in our current project and 3 out of 4 do not optimize it.
Never assume the compiler gets it right, especially if the possibly faster, but never slower code is as easy to read.
If you don't have a really stupid implementation of one of the operators in your code:
Alwas prefer ++i over i++.
I have been reading through most of the answers here and many of the comments, and I didn't see any reference to the one instance that I could think of where i++ is more efficient than ++i (and perhaps surprisingly --i was more efficient than i--). That is for C compilers for the DEC PDP-11!
The PDP-11 had assembly instructions for pre-decrement of a register and post-increment, but not the other way around. The instructions allowed any "general-purpose" register to be used as a stack pointer. So if you used something like *(i++) it could be compiled into a single assembly instruction, while *(++i) could not.
This is obviously a very esoteric example, but it does provide the exception where post-increment is more efficient(or I should say was, since there isn't much demand for PDP-11 C code these days).
In C, the compiler can generally optimize them to be the same if the result is unused.
However, in C++ if using other types that provide their own ++ operators, the prefix version is likely to be faster than the postfix version. So, if you don't need the postfix semantics, it is better to use the prefix operator.
I can think of a situation where postfix is slower than prefix increment:
Imagine a processor with register A is used as accumulator and it's the only register used in many instructions (some small microcontrollers are actually like this).
Now imagine the following program and their translation into a hypothetical assembly:
Prefix increment:
a = ++b + c;
; increment b
LD A, [&b]
INC A
ST A, [&b]
; add with c
ADD A, [&c]
; store in a
ST A, [&a]
Postfix increment:
a = b++ + c;
; load b
LD A, [&b]
; add with c
ADD A, [&c]
; store in a
ST A, [&a]
; increment b
LD A, [&b]
INC A
ST A, [&b]
Note how the value of b was forced to be reloaded. With prefix increment, the compiler can just increment the value and go ahead with using it, possibly avoid reloading it since the desired value is already in the register after the increment. However, with postfix increment, the compiler has to deal with two values, one the old and one the incremented value which as I show above results in one more memory access.
Of course, if the value of the increment is not used, such as a single i++; statement, the compiler can (and does) simply generate an increment instruction regardless of postfix or prefix usage.
As a side note, I'd like to mention that an expression in which there is a b++ cannot simply be converted to one with ++b without any additional effort (for example by adding a - 1). So comparing the two if they are part of some expression is not really valid. Often, where you use b++ inside an expression you cannot use ++b, so even if ++b were potentially more efficient, it would simply be wrong. Exception is of course if the expression is begging for it (for example a = b++ + 1; which can be changed to a = ++b;).
I always prefer pre-increment, however ...
I wanted to point out that even in the case of calling the operator++ function, the compiler will be able to optimize away the temporary if the function gets inlined. Since the operator++ is usually short and often implemented in the header, it is likely to get inlined.
So, for practical purposes, there likely isn't much of a difference between the performance of the two forms. However, I always prefer pre-increment since it seems better to directly express what I"m trying to say, rather than relying on the optimizer to figure it out.
Also, giving the optmizer less to do likely means the compiler runs faster.
My C is a little rusty, so I apologize in advance. Speedwise, I can understand the results. But, I am confused as to how both files came out to the same MD5 hash. Maybe a for loop runs the same, but wouldn't the following 2 lines of code generate different assembly?
myArray[i++] = "hello";
vs
myArray[++i] = "hello";
The first one writes the value to the array, then increments i. The second increments i then writes to the array. I'm no assembly expert, but I just don't see how the same executable would be generated by these 2 different lines of code.
Just my two cents.
in the process of refactoring some code, want to change a function like this
bool A::function() {
return this->a == this->b || this->c == this->d || this->e == this->f || this->g == this->h ;
}
to something like this
bool A::function(int a, int b, int c, int d, int e, int g) {
return a == b || c == d || e == this->f || g == this->h ;
}
this function is supposed to be called each time inside a main loop which would have at most 10M elements
The people I'm working with are reluctant to use the second version because of the performance cost of passing 6 ints.
I'm pretty sure that this is negligeable, considering that each iteration of the loop goes through a LOT of code, and it roughly takes ~1 minute to proces the 10M elements.
Is the cost of passing 6 int by value all the time so hight? if not, how can I make them change their mind?
edit :
about inlining, I told them that the penality would be 0 if the function was inlined but their answer was basically "we can't know for sure if it will be inlined", which I seem to recall is true (up to the compiler)
I suspect that you won't see any big difference between these two variants in reasonably optimised code. However, the proof of that would be to actually change the code and compare the different times. (And more so if 10M entries are being processed in a minute, that's 6 microseconds per item, so around 30000-200000 instructions on a modern processor - adding 6 argument passes won't budge it one or the other way, I'd say - unless this function is called many times in the loop, of course).
And yes, if the function is inlined, the result would be identical code for the two alternatives - but are your colleagues say, you can't know for sure that it is inlined or not - the only way to really determine that is to have a look at the generated machine-code (-S or use objdump or similar).
In terms of performance, I would suggest you profile your code, and see if there is a difference that matters. Passing ints around is usually very cheap and open to automatic optimization, so I doubt you would see a measurable performance hit.
Also worth pointing out that the two functions are different. The second doesn't necessarily use the member variables and the first does. If you're always comparing member variables, why pass them as parameters? Extra unnecessary parameters means more source code and a greater scope for bugs.
Write the code and as Shane says, profile it, or I prefer to grab a few stack samples because you can see exactly what's going on.
If you find the program counter in the instructions that pass those int arguments, on more than one sample, then they are costing a significant fraction of time, and you should do something about it.
On the other hand, the samples might tell you something else is the main time-taker, and maybe you should fix that first.
Then the program will be faster, and if you do the whole process again, it might come back to your original question.
Today I browsed some source code (it was an example file explaining the use of a software framework) and discovered a lot of code like this:
int* array = new int[10]; // or malloc, who cares. Please, no language wars. This is applicable to both languages
for ( int* ptr = &(array[0]); ptr <= &(array[9]); ptr++ )
{
...
}
So basically, they've done "take the address of the object that lies at address array + x".
Normally I would say, that this is plain stupidity, as writing array + 0or array + 9 directly does the same. I even would always rewrite such code to a size_t for loop, but that's a matter of style.
But the overuse of this got me thinking: Am I missing something blatantly obvious or something subtely hidden in the dark corners of the language?
For anyone wanting to take a look at the original source code, with all it's nasty gotos , mallocs and of course this pointer thing, feel free to look at it online.
Yeah, there's no good reason for the first one. This is exactly the same thing:
int *ptr = array;
I agree on the second also, may as well just write:
ptr < (array + 10)
Of course you could also just make it a for loop from 0-9 and set the temp pointer to point to the beginning of the array.
for(int i = 0, *ptr = array; i < 10; ++i, ++ptr)
/* ... */
That of course assumes that ptr is not being modified within the body of the loop.
You're not missing anything, they do mean the same thing.
However, to try to shed some more light on this, I should say that I also write expressions like that from time to time, for added clarity.
I personally tend to think in terms of object-oriented programming, meaning that I prefer to refer to "the address of the nth element of the array", rather than "the nth offset from the beginning address of the array". Even though those two things are equivalent in C, when I'm writing the code, I have the former in mind - so I express that.
Perhaps that's the reasoning of the person who wrote this as well.
Edit: this is partially incorrect. Read the comments.
The problem with &(array[0]) is that it expands to &(*(array + 0)) which involves an dereference. Now, every compiler will obviously optimize this into the same thing as array + 0, but as far as the language is concerned the dereference can cause UB in places where array + 0 would not.
I think the reason why they wrote it this way was that
&(array[0])
and
&(array[9])
just look similar. Another way would be to write it
array + 0
and
array + 9
respectively. As you already mentioned, they essentially do the same (at least most compilers treat it as the same, I hope).
You could interpret the two different type of expressions differently: The first one can be read as "the address of element 0 / 9". The second one can be read as "array pointer with an element offset of 0 / 9". The first one sounds more high-level, the second more low-level. However, most people tend to use the second form, though.
Now since array + 0 of course is the same as array, you could just write array. I think the point here is that the begin and end of the loop look "analogous" to each other. A question of personal taste.
According to classical mathematics:
Array[n]
refers to the nth element in the array.
To "take the address of" the nth element, the & or address of operator is applied:
&Array[n]
To clear out any assumed ambiguities, parenthesis are added:
&(Array[n])
To a reader, reading from left to right, this expression has the meaning:
Return the address of the element at position 'n'
The insurance may have developed as a protection against old faulty compilers.
Some people consider it more readable than:
Array + n
Sorry, but I am old school and prefer using the '&' version, paren or without. I'll waste my time making code easier to read than worrying about which version takes longer to compile or which version is more efficient.
A clear commented section of code has a higher Return On Investment than a section of code that is micro-optimized for efficiency or uses sections of the language that are unfamilar to non language lawyers.
Variable x is int with possible values: -1, 0, 1, 2, 3.
Which expression will be faster (in CPU ticks):
1. (x < 0)
2. (x == -1)
Language: C/C++, but I suppose all other languages will have the same.
P.S. I personally think that answer is (x < 0).
More widely for gurus: what if x from -1 to 2^30?
That depends entirely on the ISA you're compiling for, and the quality of your compiler's optimizer. Don't optimize prematurely: profile first to find your bottlenecks.
That said, in x86, you'll find that both are equally fast in most cases. In both cases, you'll have a comparison (cmp) and a conditional jump (jCC) instructions. However, for (x < 0), there may be some instances where the compiler can elide the cmp instruction, speeding up your code by one whole cycle.
Specifically, if the value x is stored in a register and was recently the result of an arithmetic operation (such as add, or sub, but there are many more possibilities) that sets the sign flag SF in the EFLAGS register, then there's no need for the cmp instruction, and the compiler can emit just a js instruction. There's no simple jCC instruction that jumps when the input was -1.
Try it and see! Do a million, or better, a billion of each and time them. I bet there is no statistical significance in your results, but who knows -- maybe on your platform and compiler, you might find a result.
This is a great experiment to convince yourself that premature optimization is probably not worth your time--and may well be "the root of all evil--at least in programming".
Both operations can be done in a single CPU step, so they should be the same performance wise.
x < 0 will be faster. If nothing else, it prevents fetching the constant -1 as an operand.
Most architectures have special instructions for comparing against zero, so that will help too.
It could be dependent on what operations precede or succeed the comparison. For example, if you assign a value to x just before doing the comparison, then it might be faster to check the sign flag than to compare to a specific value. Or the CPU's branch-prediction performance could be affected by which comparison you choose.
But, as others have said, this is dependent upon CPU architecture, memory architecture, compiler, and a lot of other things, so there is no general answer.
The important consideration, anyway, is which actually directs your program flow accurately, and which just happens to produce the same result?
If x is actually and index or a value in an enum, then will -1 always be what you want, or will any negative value work? Right now, -1 is the only negative, but that could change.
You can't even answer this question out of context. If you try for a trivial microbenchmark, it's entirely possible that the optimizer will waft your code into the ether:
// Get time
int x = -1;
for (int i = 0; i < ONE_JILLION; i++) {
int dummy = (x < 0); // Poof! Dummy is ignored.
}
// Compute time difference - in the presence of good optimization
// expect this time difference to be close to useless.
Same, both operations are usually done in 1 clock.
It depends on the architecture, but the x == -1 is more error-prone. x < 0 is the way to go.
As others have said there probably isn't any difference. Comparisons are such fundamental operations in a CPU that chip designers want to make them as fast as possible.
But there is something else you could consider. Analyze the frequencies of each value and have the comparisons in that order. This could save you quite a few cycles. Of course you still need to compile your code to asm to verify this.
I'm sure you're confident this is a real time-taker.
I would suppose asking the machine would give a more reliable answer than any of us could give.
I've found, even in code like you're talking about, my supposition that I knew where the time was going was not quite correct. For example, if this is in an inner loop, if there is any sort of function call, even an invisible one inserted by the compiler, the cost of that call will dominate by far.
Nikolay, you write:
It's actually bottleneck operator in
the high-load program. Performance in
this 1-2 strings is much more valuable
than readability...
All bottlenecks are usually this
small, even in perfect design with
perfect algorithms (though there is no
such). I do high-load DNA processing
and know my field and my algorithms
quite well
If so, why not to do next:
get timer, set it to 0;
compile your high-load program with (x < 0);
start your program and timer;
on program end look at the timer and remember result1.
same as 1;
compile your high-load program with (x == -1);
same as 3;
on program end look at the timer and remember result2.
compare result1 and result2.
You'll get the Answer.