Is comparing to zero faster than comparing to any other number? - c++

Is
if(!test)
faster than
if(test==-1)
I can produce assembly but there is too much assembly produced and I can never locate the particulars I'm after. I was hoping someone just knows the answer. I would guess they are the same unless most CPU architectures have some sort of "compare to zero" short cut.
thanks for any help.

Typically, yes. In typical processors testing against zero, or testing sign (negative/positive) are simple condition code checks. This means that instructions can be re-ordered to omit a test instruction. In pseudo assembly, consider this:
Loop:
LOADCC r1, test // load test into register 1, and set condition codes
BCZS Loop // If zero was set, go to Loop
Now consider testing against 1:
Loop:
LOAD r1, test // load test into register 1
SUBT r1, 1 // Subtract Test instruction, with destination suppressed
BCNE Loop // If not equal to 1, go to Loop
Now for the usual pre-optimization disclaimer: Is your program too slow? Don't optimize, profile it.

It depends.
Of course it's going to depend, not all architectures are equal, not all µarchs are equal, even compilers aren't equal but I'll assume they compile this in a reasonable way.
Let's say the platform is 32bit x86, the assembly might look something like
test eax, eax
jnz skip
Vs:
cmp eax, -1
jnz skip
So what's the difference? Not much. The first snippet takes a byte less. The second snippet might be implemented with an inc to make it shorter, but that would make it destructive so it doesn't always apply, and anyway, it's probably slower (but again it depends).
Take any modern Intel CPU. They do "macro fusion", which means they take a comparison and a branch (subject to some limitations), and fuse them. The comparison becomes essentially free in most cases. The same goes for test. Not inc though, but the inc trick only really applied in the first place because we just happened to compare to -1.
Apart from any "weird effects" (due to changed alignment and whatnot), there should be absolutely no difference on that platform. Not even a small difference.
Even if you got lucky and got the test for free as a result of a previous arithmetic instruction, it still wouldn't be any better.
It'll be different on other platforms, of course.

On x86 there won't be any noticeably difference, unless you are doing some math at the same time (e.g. while(--x) the result of --x will automatically set the condition code, where while(x) ... will necessitate some sort of test on the value in x before we know if it's zero or not.
Many other processors do have a "automatic updates of the condition codes on LOAD or MOVE instructions", which means that checking for "postive", "negative" and "zero" is "free" with every movement of data. Of course, you pay for that by not being able to backward propagate the compare instruction from the branch instruction, so if you have a comparison, the very next instruction MUST be a conditional branch - where an extra instruction between these would possibly help with alleviating any delay in the "result" from such an instruction.
In general, these sort of micro-optimisations are best left to compilers, rather than the user - the compiler will quite often convert for(i = 0; i < 1000; i++) into for(i = 1000-1; i >= 0; i--) if it thinks that makes sense [and the order of the loop isn't important in the compiler's view]. Trying to be clever with these sort of things tend to make the code unreadable, and performance can suffer badly on other systems (because when you start tweaking "natural" code to "unnatural", the compiler tends to think that you really meant what you wrote, and not optimise it the same way as the "natural" version).

Related

Profiling a simple, one cycle length operation

We have an assignment where we need to profile a 'simple instruction' (addition or bit-wise and for example). This means performing the same operation a large number of times (100K+) and measuring the average time in microseconds. The result should be presented in cycle-lengths: (totalTime/iterations)*cphMHz.
So, results may vary but all in all we were told that we should get a result close to 1 cycle-length. Actual result doesn't matter as long as programming is correct.
My question is: what is a good operation to profile?
There are two points I need to concider:
I use loop unrolling to be a bit more accurate, so in each iteration I perform 10 simple instruction. This means I have to choose an operation to wouldn't be performed only once due to compiler optimization (we can't use -o0 flag as school staff does not).
Bad example: var = i; - the compiler would only perform the last command.
What is a real 'simple instruction'? How do I know the number of operations that are actually performed? I tried reading the assembly output, but I couldn't understand it.
Hope I was clear enough, any idea would be great.
Thanks anyway
P.S don't know if it matters but I write in CPP
1) This sounds (to me) like an impossible task, if optimizations are (or might be) enabled. You can never be sure on what the compiler will do during optimizations. I'd definitely do something like reusing the previous result. If allowed to/possible, I'd try to include a raw assembler snippet to be profiled (so you can be sure there's no additional overhead; although it still could be optimized).
2) As for instructions: One assembler command is one instruction. E.g. a += i will - depending on available instruction set and stuff - most likely result in 4 instructions: read a, read i, add, write a. Reading assembly is pretty much straightforward. Depending on the instruction set/processor, there might be different "directions" for reading (i.e. "from -> to"). x86 assemblers (and those for most other common processors) will prefer instruction target, source, while DSPs prefer to use instruction source, target. Just important to know: moving data has to happen through registers. So even a single assignment like a = b will result in two instructions (b to register and register to a).
In general, if this answer goes into the wrong direction, try to elaborate a bit more on your specific task and its requirements (e.g. which compiler is to be used) and drop me a short comment.

Branch on ?: operator?

For a typical modern compiler on modern hardware, will the ? : operator result in a branch that affects the instruction pipeline?
In other words which is faster, calling both cases to avoid a possible branch:
bool testVar = someValue(); // Used later.
purge(white);
purge(black);
or picking the one actually needed to be purged and only doing it with an operator ?::
bool testVar = someValue();
purge(testVar ? white : black);
I realize you have no idea how long purge() will take, but I'm just asking a general question here about whether I would ever want to call purge() twice to avoid a possible branch in the code.
I realize this is a very tiny optimization and may make no real difference, but would still like to know. I expect the ?: does not result in branching, but want to make sure my understanding is correct.
Depends on the platform. Specifically, it depends on the size of jump prediction table of the CPU and whether the CPU allows conditional operations (like on ARM).
CPUs with conditional operations will strongly favor the second case. CPUs with bigger jump prediction tables will favor the first case.
The real answer (like with any other performance questions): measure and compare. Sometimes the rest of the code throws a curve ball and it's usually impossible to predict effects of some changes.
The CMOV (Conditional MOVe) instruction has been part of the x86 instruction set since the Pentium Pro. It is rarely automatically generated by GCC because of compiler options commonly used and restrictions placed by the C language. A SETCC/CMOV sequence can be inserted by inline assembly in your C program. This should only be done is cases where the conditional variable is a randomly oscillating value in the inner loop (millions of executions) of a program. In non-oscillating cases and in cases of simple patterns of oscillation, modern processors can predict branches with a very high degree of accuracy. In 2007, Linus Torvalds suggested here to avoid use of CMOV in most situations.
Intel describes the conditional move in the Intel(R) Architecture Software Developer's Manual, Volume 2: Instruction Set Reference Manual:
The CMOVcc instructions check the state of one or more of the status
flags in the EFLAGS register (CF, OF, PF, SF, and ZF) and perform a
move operation if the flags are in a specified state (or condition). A
condition code (cc) is associated with each instruction to indicate
the condition being tested for. If the condition is not satisfied, a
move is not performed and execution continues with the instruction
following the CMOVcc instruction.
These instructions can move a 16- or 32-bit value from memory to a
general-purpose register or from one general-purpose register to
another. Conditional moves of 8-bit register operands are not
supported.
The conditions for each CMOVcc mnemonic is given in the description
column of the above table. The terms “less” and “greater” are used for
comparisons of signed integers and the terms “above” and “below” are
used for unsigned integers.
Because a particular state of the status flags can sometimes be
interpreted in two ways, two mnemonics are defined for some opcodes.
For example, the CMOVA (conditional move if above) instruction and the
CMOVNBE (conditional move if not below or equal) instruction are
alternate mnemonics for the opcode 0F 47H.
I can't imagine the first method would ever be faster.
With the first method you may avoid a branch, but you replace it with a function call, which would usually involve a branch plus a lot more (unless it was inlined). Even if inlined, unless the functionality inside the purge() function was absolutely trivial it would almost certainly be slower.
Calling a function is at least as expensive as doing a logic test + jump (and yes, the ? : ternary operator would require a jump).
in the first case purge is called twice. In the second case purge is called once
Its hard to answer the question about branching because its so dependent on compilers and instruction set. For example on an ARM (which has conditional instruction execution) it might not branch. ON an x86 it almost certainly will

Could this alternative way to loop be more effcient?

I was bored one rainy afternoon and came up with this:
int ia_array[5][5][5]; //interger array called array
{
int i = 0, j = 0, k = 0;//counters
while( i < 5 )//loop conditions
{
ia_array[i][j][k] = 0;//do something
__asm inc k;//++k;
if( k > 4)
{
__asm inc j; //++j;
__asm mov k,0;///k = 0;
}
if( j > 4)
{
__asm inc i; //++i;
__asm mov j,0;//j = 0;
}
}//end of while
}//i,j,k fall out of scope
its functionally equivalent to three nested for loops. However in a for loop you cannot use __asm statements. Also you have the option to not put the counters in a scope so you can reuse them for other loops. I have looked at the disassembly for both and my alternative has 15 opcodes and the nested for loops have 24. Therefore is it potentially faster? suppose I'm really asking is __asm inc i; faster then ++i;?
note: i don't intent to use this code in any projects, just out of curiosity. thanks for your time.
First off, your compiler will likely store the values of i, j and k in registers.
It's more efficient to do for (i = 4; i <=0; i--) than for(i = 0; i < 5; i++) as the cpu can determine if the result of the last operation it executed was zero for free - it doesn't have to explicitly compare to 4 (see the cmovz instruction).
It's the not the case for x86 that having to execute less instruction will lead to faster code. There are all sorts of issues to do with instruction pipelining that quickly get too much for a programmer to write by hand. Leave it to the compiler, they're sufficiently efficient these days (though definitely not optimal... but who wants to wait hours for their code to compile).
You can check it out yourself by running your function a few hundred thousand times with each implementation and check which is faster. Check if you can write asm instructions in for loops with
__asm {
inc j;
mov k, 0;
}
(it's been a while since I did this)
P.S. Have fun experimenting with asm, it can be very interesting and rewarding!
No, it won't be even remotely faster. Infact, it could quite easily be slower. Your compiler's optimizer is almost certainly more effective at this than you are.
This is going to be very compiler and compiler switch specific, but your code will have three tests per loop iteration where a traditional nested loop would only have one per inner-most loop iteration, so I think your approach would tend to be slower in general.
Several things:
You can't judge the speed of assembly code based on the number of opcodes in the output. Compilers can unroll loops to eliminate branches, and many modern compilers will attempt to vectorize a loop like the one above. The former could have more opcodes than naive code and be faster, and the latter could have fewer and be faster.
By putting __asm statements in your code, you're probably precluding any optimizations the compiler could do on the loop. So if you compiled this with something really fast like, say, the Intel compilers, then you will likely get worse performance with your code than with the compiler. This is especially true for something as simple as your code here, where the array sizes are known statically and the loop bounds are constant.
If you really want to get a sense of what compilers can/can't do, grab a book or take a course on optimizing compilers and vectorization. There are tons of different optimizations and understanding the performance of even a simple piece of code like this on a particular architecture can be subtle.
There are plenty of kernels and number crunching codes where compilers still can't do better than knowledgable humans, but without a lot of experience with architecture details you're not going to do much better than icc -fast or xlC -O5.
While it certainly is possible to beat a compiler at optimization, you're not going to do it this way. The bits you've written in assembly language are pretty obvious, mechanical types of translations that any half-way decent compiler (or even a pretty lousy one) can do easily.
If you want to beat the compiler, you need to go a lot further, such as rearranging instructions to allow more to execute in parallel (decidedly non-trivial) or finding a better sequence of instructions than the compiler can.
In this case, for example, you might at least stand a chance by noting that iarray[5][5][5] can (from an assembly language viewpoint) be treated as a single, flat array of 5*5*5 = 125 elements, and encode most of what's essentially a memset into a single instruction:
mov ecx, 125 // 125 elements
xor eax, eax // set them to zero
mov di, offset ia_array // where we're going to store them
rep stosd // and fill that memory.
Realistically, however, this probably isn't going to be a major (or probably even minor) improvement over what the compiler is likely to generate. It's more likely close to the minimum necessary to (at least nearly) keep up.
The next step would be to consider using non-temporal stores instead of a simple stosd. This won't actually speed up this loop (much, anyway), but it might gain some speed overall by avoiding this store polluting the cache if it's possible that other code already in the cache is more important immediately. You could also use some of the other SSE instructions to gain a little speed -- but even at best, you can't expect much better than a couple of percent out of this. The bottom line is that for zeroing some memory, the speed is limited primarily by the bus speed, not the instructions you use, so nothing you do is likely to help much.

Performance of comparisons in C++ ( foo >= 0 vs. foo != 0 )

I've been working on a piece of code recently where performance is very important, and essentially I have the following situation:
int len = some_very_big_number;
int counter = some_rather_small_number;
for( int i = len; i >= 0; --i ){
while( counter > 0 && costly other stuff here ){
/* do stuff */
--counter;
}
/* do more stuff */
}
So here I have a loop that runs very often and for a certain number of runs the while block will be executed as well until the variable counter is reduced to zero and then the while loop will not be called because the first expression will be false.
The question is now, if there is a difference in performance between using
counter > 0 and counter != 0?
I suspect there would be, does anyone know specifics about this.
To measure is to know.
Do you think that what will solve your problem! :D
if(x >= 0)
00CA1011 cmp dword ptr [esp],0
00CA1015 jl main+2Ch (0CA102Ch) <----
...
if(x != 0)
00CA1026 cmp dword ptr [esp],0
00CA102A je main+3Bh (0CA103Bh) <----
In programming, the following statement is the sign designating the road to Hell:
I've been working on a piece of code recently where performance is very important
Write your code in the cleanest, most easy to understand way. Period.
Once that is done, you can measure its runtime. If it takes too long, measure the bottlenecks, and speed up the biggest ones. Keep doing that until it is fast enough.
The list of projects that failed or suffered catastrophic loss due to a misguided emphasis on blind optimization is large and tragic. Don't join them.
I think you're spending time optimizing the wrong thing. "costly other stuff here", "do stuff" and "do more stuff" are more important to look at. That is where you'll make the big performance improvements I bet.
There will be a huge difference if the counter starts with a negative number. Otherwise, on every platform I'm familiar with, there won't be a difference.
Is there a difference between counter > 0 and counter != 0? It depends on the platform.
A very common type of CPU are those from Intel we have in our PC's. Both comparisons will map to a single instruction on that CPU and I assume they will execute at the same speed. However, to be certain you will have to perform your own benchmark.
As Jim said, when in doubt see for yourself :
#include <boost/date_time/posix_time/posix_time.hpp>
#include <iostream>
using namespace boost::posix_time;
using namespace std;
void main()
{
ptime Before = microsec_clock::universal_time(); // UTC NOW
// do stuff here
ptime After = microsec_clock::universal_time(); // UTC NOW
time_duration delta_t = After - Before; // How much time has passed?
cout << delta_t.total_seconds() << endl; // how much seconds total?
cout << delta_t.fractional_seconds() << endl; // how much microseconds total?
}
Here's a pretty nifty way of measuring time. Hope that helps.
OK, you can measure this, sure. However, these sorts of comparisons are so fast that you are probably going to see more variation based on processor swapping and scheduling then on this single line of code.
This smells of unnecessary, and premature, optimization. Right your program, optimize what you see. If you need more, profile, and then go from there.
I would add that the overwhelming performance aspects of this code on modern cpus will be dominated not by the comparison instruction but whether the comparison is well predicted since any mis-predict will waste many more cycles than any integral operation.
As such loop unrolling will probably be the biggest winner but measure, measure, measure.
Thinking that the type of comparison is going to make a difference, without knowing it, is the definition of guessing.
Don't guess.
In general, they should be equivalent (both are usually implemented in single-cycle instructions/micro-ops). Your compiler may do some strange special-case optimization that is difficult to reason about from the source level, which may make either one slightly faster. Also, equality testing is more energy-efficient than inequality testing (>), though the system-level effect is so small as to not merit discussion.
There may be no difference. You could try examining the assembly output for each.
That being said, the only way to tell if any difference is significant is to try it both ways and measure. I'd bet that the change makes no difference whatsoever with optimizations on.
Assuming you are developing for the x86 architecture, when you look at the assembly output it will come down to jns vs jne. jns will check the sign flag and jne will check the zero flag. Both operations, should as far as I know, be equally costly.
Clearly the solution is to use the correct data type.
Make counter an unsigned int. Then it can't be less than zero. Your compiler will obviously know this and be forced to choose the optimal solution.
Or you could just measure it.
You could also think about how it would be implemented...(here we go on a tangent)...
less than zero: the sign bit would be set, so need to check 1 bit
equal to zero : the whole value would be zero, so need to check all the bits
Of course, computers are funny things, and it may take longer to check a single bit than the whole value (however many bytes it is on your platform).
You could just measure it...
And you could find out that one it more optimal than another (under the conditions you measured it). But your program will still run like a dog because you spent all your time optimising the wrong part of your code.
The best solution is to use what many large software companies do - blame the hardware for not runnnig fast enough and encourage your customer to upgrade their equipment (which is clearly inferior since your product doesn't run fast enough).
< /rant>
I stumbled across this question just now, 3 years after it is asked, so I am not sure how useful the answer will still be... Still, I am surprised not to see clearly stated that answering your question requires to know two and only two things:
which processor you target
which compiler you work with
To the first point, each processor has different instructions for tests. On one given processor, two similar comparisons may turn up to take a different number of cycles. For example, you may have a 1-cycle instruction to do a gt (>), eq (==), or a le (<=), but no 1-cycle instruction for other comparisons like a ge (>=). Following a test, you may decide to execute conditional instructions, or, more often, as in your code example, take a jump. There again, cycles spent in jumps take a variable number of cycles on most high-end processors, depending whether the conditional jump is taken or not taken, predicted or not predicted. When you write code in assembly and your code is time critical, you can actually take quite a bit of time to figure out how to best arrange your code to minimize overall the cycle count and may end up in a solution that may have to be optimized based on the number of time a given comparison returns a true or false.
Which leads me to the second point: compilers, like human coders, try to arrange the code to take into account the instructions available and their latencies. Their job is harder because some assumptions an assembly code would know like "counter is small" is hard (not impossible) to know. For trivial cases like a loop counter, most modern compilers can at least recognize the counter will always be positive and that a != will be the same as a > and thus generate the best code accordingly. But that, as many mentioned in the posts, you will only know if you either run measurements, or inspect your assembly code and convince yourself this is the best you could do in assembly. And when you upgrade to a new compiler, you may then get a different answer.

What is faster (x < 0) or (x == -1)?

Variable x is int with possible values: -1, 0, 1, 2, 3.
Which expression will be faster (in CPU ticks):
1. (x < 0)
2. (x == -1)
Language: C/C++, but I suppose all other languages will have the same.
P.S. I personally think that answer is (x < 0).
More widely for gurus: what if x from -1 to 2^30?
That depends entirely on the ISA you're compiling for, and the quality of your compiler's optimizer. Don't optimize prematurely: profile first to find your bottlenecks.
That said, in x86, you'll find that both are equally fast in most cases. In both cases, you'll have a comparison (cmp) and a conditional jump (jCC) instructions. However, for (x < 0), there may be some instances where the compiler can elide the cmp instruction, speeding up your code by one whole cycle.
Specifically, if the value x is stored in a register and was recently the result of an arithmetic operation (such as add, or sub, but there are many more possibilities) that sets the sign flag SF in the EFLAGS register, then there's no need for the cmp instruction, and the compiler can emit just a js instruction. There's no simple jCC instruction that jumps when the input was -1.
Try it and see! Do a million, or better, a billion of each and time them. I bet there is no statistical significance in your results, but who knows -- maybe on your platform and compiler, you might find a result.
This is a great experiment to convince yourself that premature optimization is probably not worth your time--and may well be "the root of all evil--at least in programming".
Both operations can be done in a single CPU step, so they should be the same performance wise.
x < 0 will be faster. If nothing else, it prevents fetching the constant -1 as an operand.
Most architectures have special instructions for comparing against zero, so that will help too.
It could be dependent on what operations precede or succeed the comparison. For example, if you assign a value to x just before doing the comparison, then it might be faster to check the sign flag than to compare to a specific value. Or the CPU's branch-prediction performance could be affected by which comparison you choose.
But, as others have said, this is dependent upon CPU architecture, memory architecture, compiler, and a lot of other things, so there is no general answer.
The important consideration, anyway, is which actually directs your program flow accurately, and which just happens to produce the same result?
If x is actually and index or a value in an enum, then will -1 always be what you want, or will any negative value work? Right now, -1 is the only negative, but that could change.
You can't even answer this question out of context. If you try for a trivial microbenchmark, it's entirely possible that the optimizer will waft your code into the ether:
// Get time
int x = -1;
for (int i = 0; i < ONE_JILLION; i++) {
int dummy = (x < 0); // Poof! Dummy is ignored.
}
// Compute time difference - in the presence of good optimization
// expect this time difference to be close to useless.
Same, both operations are usually done in 1 clock.
It depends on the architecture, but the x == -1 is more error-prone. x < 0 is the way to go.
As others have said there probably isn't any difference. Comparisons are such fundamental operations in a CPU that chip designers want to make them as fast as possible.
But there is something else you could consider. Analyze the frequencies of each value and have the comparisons in that order. This could save you quite a few cycles. Of course you still need to compile your code to asm to verify this.
I'm sure you're confident this is a real time-taker.
I would suppose asking the machine would give a more reliable answer than any of us could give.
I've found, even in code like you're talking about, my supposition that I knew where the time was going was not quite correct. For example, if this is in an inner loop, if there is any sort of function call, even an invisible one inserted by the compiler, the cost of that call will dominate by far.
Nikolay, you write:
It's actually bottleneck operator in
the high-load program. Performance in
this 1-2 strings is much more valuable
than readability...
All bottlenecks are usually this
small, even in perfect design with
perfect algorithms (though there is no
such). I do high-load DNA processing
and know my field and my algorithms
quite well
If so, why not to do next:
get timer, set it to 0;
compile your high-load program with (x < 0);
start your program and timer;
on program end look at the timer and remember result1.
same as 1;
compile your high-load program with (x == -1);
same as 3;
on program end look at the timer and remember result2.
compare result1 and result2.
You'll get the Answer.