spoj TESTINT time limit exceed - c++

The code basically adds two numbers. I was wondering if we can optimise it even more and reduce the execution duration. The online judge for the SPOJ TESTINT problem tells me "time limit exceeded".
Here's my code:
#include <cstdio>
int main()
{
int a, b;
scanf("%d\n%d", &a, &b);
printf("%d", a + b);
return 0;
}

From the problem page:
both not greater than 200
That's almost certainly a clue. But you shouldn't care.
The real answer to this question is to stop playing with these utterly stupid "online judges" that test nothing of any real value whatsoever. Maybe you could "optimise" this code to be faster, with some assembly or something, but why on earth would you want to? This is about as good as it gets for most real-world practical purposes. Anything else is just a waste of your time, unless you have an extremely narrow and niche use case.
Speaking more opportunely, is it possible that you have misunderstood the requirements of the task, and are trying to read too much input from STDIN? Then your program would be blocking on the rest. This program should not take anywhere near 0.2s. For me it takes 0.009s.

Related

Why is std::remquo so slow?

In some inner loop I have:
double x;
...
int i = x/h_;
double xx = x - i*h_;
Thinking that might be a better way to do this, I tried with std::remquo
double x;
...
int i;
double xx = std::remquo(x, h_, &i);
Suddenly, timings went from 2.6 seconds to 40 seconds (for many executions of the loop).
The timing test is difficult to replicate here, but I did a online code to see if someone can help me to understand what is going on.
naive version: https://godbolt.org/z/PnsfR8
remquo version: https://godbolt.org/z/NSMwyW
It looks like the main difference is that remquo is not inlined and the naive code is. If that is the case, what is the purpose of remquo if it is going to be always slower than the manual code? Is it a matter of accuracy (e.g. for large argument) or not relying on (not well defined) casting conversion?
I just realized that the remquo version is not even doing something equivalent to the first code. So I am using it wrong. In any case, I am surprised that remquo is so slow.
It's a rubbish function that was added to C99 to entice Fortran coders to switch to C. There's little reason to actually use it, so library vendors avoid wasting time optimizing it.
See also: What does the function remquo do and what can it be used for?.
BTW if you assumed that i gets the quotient stored in it, read the documentation more closely! (Or read the answers on the question linked in the previous paragraph).

Do variables affect performance?

I am using c++ with QT 5.6. I have simple console application in 2 styles as follows:
//First style
qstring x = “Hi!”;
void func()
{
QTextStream(stdout) << x;
}
int main()
{
while (true)
{
func_one();
}
}
//Second style
void func()
{
QTextStream(stdout) << “Hi!”;
}
int main()
{
while (true)
{
func();
}
}
Which will stress out the cpu more and therefore have lesser performance there might not be a big difference but when we apply this to large scale such as a server where every 2 seconds a connection is made it makes a situation similar to the loop above and with multiple variables (but not the same variable and data) a little less resource usage can cause great performance improvements with lesser resource usage. So is using variables gives any performance improvements but I will be using the variable only once in my function though the function will be called repetitively or will using variables slows the program as it has to repetitively check the ram for where is the value of “x” stored and then retrieve the data?
Edit 1:
I will not be using the variable again in my code and we can consider that there is no compiler optimizations. #DrDonut the answer in the link you gave also doesn't answer is $array === (array) $array faster than is_array($array) i.e is it a micro-optimization and I am also asking is the second style a micro-optimization or does it harm the performance.
Your example is bad because of possible compiler optimizations and because it is not clear will you use this variable in different places or it is just a test code which will be thrown out.
But generally you are optimizing in a wrong way. There is no sense to optimize single variable or single function. You should not guess where your program will spend its time, you should first write your program in the way it works and looks OK.
After the program works, if you find its perfomance is bad you should search for bottlenecks - places where program spends a lot of time. They can be found with the help of profilers or in debugger, not by guessing.
When you found them, you need to optimize these critical places.
Read about premature optimization

Visual studio: how to make C++ consoleApplication use more of CPU power?

so i am running a console project, but when the code is running i see in Task manager that only 5% (2.8 GHz) of Cpu is been used, of course i am not exacly sure how cpu distribute the proccessing power in windows to begain with. but for more of a future reffrence i would like to know if i had a performance demanding code that i need the answer faster how would i do that?
here is the code if you would like to know:
#include "stdafx.h"
#include <iostream>
#include <string.h>
using namespace std;
void swap(char *x, char *y)
{
char temp;
temp = *x;
*x = *y;
*y = temp;
}
void permute(char *a, int l, int r)
{
int i;
if (l == r)
cout << a << endl;
else
{
for (i = l; i <= r; i++)
{
swap((a + l), (a + i));
permute(a, l + 1, r);
swap((a + l), (a + i));
}
}
}
int main()
{
char Short[] = "ABCD";
int n1 = strlen(Short);
char Long[] = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
int n2 = strlen(Long);
while(true)
{
cout << "Would you like to see the permutions of only a) ABCD or b) the whole alphabet?!\n(please enter a or b): ";
char input;
cin >> input;
if (input == 'a')
{
cout << "The permutions of ABCD:\n";
permute(Short, 0, n1 - 1);
cout << "-----------------------------------";
}
else if (input == 'b')
{
cout << "The permutions of Alphabet:\n";
permute(Long, 0, n2 - 1);
cout << "-----------------------------------";
}
else
{
cout << "ERROR! : Enter either a or b.\n";
}
}
}
i found the code in a blog to show the permutions of "ABCD" as part of an assgiment but i also used it for the entire alphabet, and i wanted to know for that use is there a way to make code use more cpu?(it's kinda taking a much longer time than i expected)
Learning to optimize code efficiently is a major challenge for the even experienced coders, and there are volumes of books, articles, and presentations on the topic. As such, a complete treatment is well out of scope for a Stack Overflow question.
That said, here are a few principles:
Focus initially on the algorithm. You can write a messy bubble sort or an efficient one, but in most 'real world' cases quicksort will beat either handily. This is arguably the primary reason the field of computer science exists: the study and selection of algorithms and their theoretical performance.
Related to this, make sure you are comparing your implementation against a 'stock' algorithm when possible. For example, you should see how your implementation performs compared to using the C++11 std::random_shuffle in the <random> header.
Optimize the compiler settings first. Debug builds are never going to be fast, and they aren't supposed to be. Using inline can help, but only happens if the compiler is actually doing inline optimizations. For Visual C++, there are a number of different optimization settings you can try out, but remember that there are tradeoffs so /Ox (maximum optimization) may not always be the right choice, which is why most templates default to /O2 (maximize speed). In some cases, /O1 (minimize space) is actually better.
Always measure performance before and after optimization. Modern out-of-order CPUs are sophisticated systems, and they don't always do what you think they are doing. In many cases, what is a textbook optimization in code actually performs worse than the original code due to various pipelining and microarchitecture effects. The only way to know for sure is to use a good profiler, have solid test cases, and measure the impact of any optimization work. If it's slower on average than before, then revert to the 'unoptimized' version and try something else.
Focus optimization on the hotspots. This is the so-called '80/20' rule. In many applications the vast majority of the code is run rarely, so only a few areas of your application are actually spending enough time running to be worth optimizing.
As a corollary to this rule, having all of your code using extremely inefficient anti-patterns can really hurt the baseline performance of your entire application. For this reason, it's worth knowing how to write good code generally. The point of the 80/20 rule is to spend your limited time optimizing on the areas that will have the most impact rather than what you as the programmer assume matters.
All that said, in your case none of this matters. The vast majority of the CPU time is spent just creating your process and handling the serialized input and output. When dealing with an n of 4 or 26, it doesn't matter how bad or good your algorithm is. In other words, it is highly unlikely permute is your program's 'hotspot' unless you are working with tens of thousands of millions of characters.
NOTE: Yes I am oversimplifying the topic, but I'm concerned that
without this basic understanding, the more advanced topics will
actually lead to some disastrous program designs.
Maybe I'm missing something, but there also seems to be a misunderstanding regarding the link between CPU and efficiency in your mind.
Your program has N instructions, and the CPU will process those N instructions at relatively the same speed (3.56 GHz is about 3.56 billion instructions per second). That's the same (more or less), whether you're getting "5%" or "25%" use of the CPU from a single program. (I'll explain that percentage in a moment.)
The only way to get "faster" in terms of processor usage, as erip said, with parallel computing techniques, which in a nutshell employ multiple CPUs to accomplish the task.
If you think of it like an assembly line, your one worker can only process one widget at a time. If your batch of widgets to him takes up 5% of his time, that means that in order to process ALL of your widgets one-by-one, he uses 5% of his time, and the other 95% is not needed for that batch (and he'll probably use it for some other batches other people assigned him.)
He cannot process more than one widget at a time, so that's as fast as he'll get with your batch. You might be able to make things appear faster by having him alternate between two different types of widgets, instead of finishing all of batch A before starting on batch B, but it will still take the same amount of time in the end to process both batches.
MASSIVE EXCEPTION: If he's spending 100% of his time on someone else's batch of widgets, you're literally going to have to cool your heels. That's not something you can do a thing about.
However, if you add another worker to that assembly line, they can process twice (roughly) the widgets in the same amount of time, because you are processing two widgets at once. When we say you have a "quad core processor", that basically means that you have four workers available (literally 4 CPUs). Each one can only process a single instruction at once, but by assigning more than one to the batch of widgets, you get it done faster.
All of this said, one must keep in mind that those CPUs are doing a lot - they run the entire computer. You want to try and keep those percentages down as much as possible, so your program is fast and responsive on any supported computer. Not all of your users will have 3.46 GHz quad-core machines, after all.
Surely the reason this program is not using all available CPU bandwidth is because it's emitting the permutation results to the screen once for each permutation. This will result in blocking I/O within the implementation of cout.
If you want 100% cpu use you'll want to separate computation from I/O. In this case you'd then need to either:
a) store the results for later output, or
b) communicate results across a thread boundary (which will itself have a an efficiency cost because of the cost of acquiring mutexes and synchronising cache memory), or
c) a combination of the above (batching results and communicating them across the thread boundary)
For a quick check, you could remove comment out all the cout calls and see how much CPU use you get (as mentioned it will be close to 100% divided by the number of CPUs on your computer).

Best way to test code speed in C++ without profiler, or does it not make sense to try?

On SO, there are quite a few questions about performance profiling, but I don't seem to find the whole picture. There are quite a few issues involved and most Q & A ignore all but a few at a time, or don't justify their proposals.
What Im wondering about. If I have two functions that do the same thing, and Im curious about the difference in speed, does it make sense to test this without external tools, with timers, or will this compiled in testing affect the results to much?
I ask this because if it is sensible, as a C++ programmer, I want to know how it should best be done, as they are much simpler than using external tools. If it makes sense, lets proceed with all the possible pitfalls:
Consider this example. The following code shows 2 ways of doing the same thing:
#include <algorithm>
#include <ctime>
#include <iostream>
typedef unsigned char byte;
inline
void
swapBytes( void* in, size_t n )
{
for( size_t lo=0, hi=n-1; hi>lo; ++lo, --hi )
in[lo] ^= in[hi]
, in[hi] ^= in[lo]
, in[lo] ^= in[hi] ;
}
int
main()
{
byte arr[9] = { 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h' };
const int iterations = 100000000;
clock_t begin = clock();
for( int i=iterations; i!=0; --i )
swapBytes( arr, 8 );
clock_t middle = clock();
for( int i=iterations; i!=0; --i )
std::reverse( arr, arr+8 );
clock_t end = clock();
double secSwap = (double) ( middle-begin ) / CLOCKS_PER_SEC;
double secReve = (double) ( end-middle ) / CLOCKS_PER_SEC;
std::cout << "swapBytes, for: " << iterations << " times takes: " << middle-begin
<< " clock ticks, which is: " << secSwap << "sec." << std::endl;
std::cout << "std::reverse, for: " << iterations << " times takes: " << end-middle
<< " clock ticks, which is: " << secReve << "sec." << std::endl;
std::cin.get();
return 0;
}
// Output:
// Release:
// swapBytes, for: 100000000 times takes: 3000 clock ticks, which is: 3sec.
// std::reverse, for: 100000000 times takes: 1437 clock ticks, which is: 1.437sec.
// Debug:
// swapBytes, for: 10000000 times takes: 1781 clock ticks, which is: 1.781sec.
// std::reverse, for: 10000000 times takes: 12781 clock ticks, which is: 12.781sec.
The issues:
Which timers to use and how get the cpu time actually consumed by the code under question?
What are the effects of compiler optimization (since these functions just swap bytes back and forth, the most efficient thing is obviously to do nothing at all)?
Considering the results presented here, do you think they are accurate (I can assure you that multiple runs give very similar results)? If yes, can you explain how std::reverse gets to be so fast, considering the simplicity of the custom function. I don't have the source code from the vc++ version that I used for this test, but here is the implementation from GNU. It boils down to the function iter_swap, which is completely incomprehensible for me. Would this also be expected to run twice as fast as that custom function, and if so, why?
Contemplations:
It seems two high precision timers are being proposed: clock() and QueryPerformanceCounter (on windows). Obviously we would like to measure the cpu time of our code and not the real time, but as far as I understand, these functions don't give that functionality, so other processes on the system would interfere with measurements. This page on the gnu c library seems to contradict that, but when I put a breakpoint in vc++, the debugged process gets a lot of clock ticks even though it was suspended (I have not tested under gnu). Am I missing alternative counters for this, or do we need at least special libraries or classes for this? If not, is clock good enough in this example or would there be a reason to use the QueryPerformanceCounter?
What can we know for certain without debugging, dissassembling and profiling tools? Is anything actually happening? Is the function call being inlined or not? When checking in the debugger, the bytes do actually get swapped, but I'd rather know from theory why, than from testing.
Thanks for any directions.
update
Thanks to a hint from tojas the swapBytes function now runs as fast as the std::reverse. I had failed to realize that the temporary copy in case of a byte must be only a register, and thus is very fast. Elegance can blind you.
inline
void
swapBytes( byte* in, size_t n )
{
byte t;
for( int i=0; i<7-i; ++i )
{
t = in[i];
in[i] = in[7-i];
in[7-i] = t;
}
}
Thanks to a tip from ChrisW I have found that on windows you can get the actual cpu time consumed by a (read:your) process trough Windows Management Instrumentation. This definitely looks more interesting than the high precision counter.
Obviously we would like to measure the cpu time of our code and not the real time, but as far as I understand, these functions don't give that functionality, so other processes on the system would interfere with measurements.
I do two things, to ensure that wall-clock time and CPU time are approximately the same thing:
Test for a significant length of time, i.e. several seconds (e.g. by testing a loop of however many thousands of iterations)
Test when the machine is more or less relatively idle except for whatever I'm testing.
Alternatively if you want to measure only/more exactly the CPU time per thread, that's available as a performance counter (see e.g. perfmon.exe).
What can we know for certain without debugging, dissassembling and profiling tools?
Nearly nothing (except that I/O tends to be relatively slow).
To answer you main question, it "reverse" algorithm just swaps elements from the array and not operating on the elements of the array.
Use QueryPerformanceCounter on Windows if you need a high-resolution timing. The counter accuracy depends on the CPU but it can go up to per clock pulse. However, profiling in real world operations is always a better idea.
Is it safe to say you're asking two questions?
Which one is faster, and by how much?
And why is it faster?
For the first, you don't need high precision timers. All you need to do is run them "long enough" and measure with low precision timers. (I'm old-fashioned, my wristwatch has a stop-watch function, and it is entirely good enough.)
For the second, surely you can run the code under a debugger and single-step it at the instruction level. Since the basic operations are so simple, you will be able to easily see roughly how many instructions are required for the basic cycle.
Think simple. Performance is not a hard subject. Usually, people are trying to find problems, for which this is a simple approach.
(This answer is specific to Windows XP and the 32-bit VC++ compiler.)
The easiest thing for timing little bits of code is the time-stamp counter of the CPU. This is a 64-bit value, a count of the number of CPU cycles run so far, which is about as fine a resolution as you're going to get. The actual numbers you get aren't especially useful as they stand, but if you average out several runs of various competing approaches then you can compare them that way. The results are a bit noisy, but still valid for comparison purposes.
To read the time-stamp counter, use code like the following:
LARGE_INTEGER tsc;
__asm {
cpuid
rdtsc
mov tsc.LowPart,eax
mov tsc.HighPart,edx
}
(The cpuid instruction is there to ensure that there aren't any incomplete instructions waiting to complete.)
There are four things worth noting about this approach.
Firstly, because of the inline assembly language, it won't work as-is on MS's x64 compiler. (You'll have to create a .ASM file with a function in it. An exercise for the reader; I don't know the details.)
Secondly, to avoid problems with cycle counters not being in sync across different cores/threads/what have you, you may find it necessary to set your process's affinity so that it only runs on one specific execution unit. (Then again... you may not.)
Thirdly, you'll definitely want to check the generated assembly language to ensure that the compiler is generating roughly the code you expect. Watch out for bits of code being removed, functions being inlined, that sort of thing.
Finally, the results are rather noisy. The cycle counters count cycles spent on everything, including waiting for caches, time spent on running other processes, time spent in the OS itself, etc. Unfortunately, it's not possible (under Windows, at least) to time just your process. So, I suggest running the code under test a lot of times (several tens of thousands) and working out the average. This isn't very cunning, but it seems to have produced useful results for me at any rate.
I would suppose that anyone competent enough to answer all your questions is gong to be far too busy to answer all your questions. In practice it is probably more effective to ask a single, well-defined questions. That way you may hope to get well-defined answers which you can collect and be on your way to wisdom.
So, anyway, perhaps I can answer your question about which clock to use on Windows.
clock() is not considered a high precision clock. If you look at the value of CLOCKS_PER_SEC you will see it has a resolution of 1 millisecond. This is only adequate if you are timing very long routines, or a loop with 10000's of iterations. As you point out, if you try and repeat a simple method 10000's of times in order to get a time that can be measured with clock() the compiler is liable to step in and optimize the whole thing away.
So, really, the only clock to use is QueryPerformanceCounter()
Is there something you have against profilers? They help a ton. Since you are on WinXP, you should really give a trial of vtune a try. Try a call graph sampling test and look at self time and total time of the functions being called. There's no better way to tune your program so that it's the fastest possible without being an assembly genius (and a truly exceptional one).
Some people just seem to be allergic to profilers. I used to be one of those and thought I knew best about where my hotspots were. I was often correct about obvious algorithmic inefficiencies, but practically always incorrect about more micro-optimization cases. Just rewriting a function without changing any of the logic (ex: reordering things, putting exceptional case code in a separate, non-inlined function, etc) can make functions a dozen times faster and even the best disassembly experts usually can't predict that without the profiler.
As for relying on simplistic timing tests alone, they are extremely problematic. That current test is not so bad but it's a very common mistake to write timing tests in ways in which the optimizer will optimize out dead code and end up testing the time it takes to do essentially a nop or even nothing at all. You should have some knowledge to interpret the disassembly to make sure the compiler isn't doing this.
Also timing tests like this have a tendency to bias the results significantly since a lot of them just involve running your code over and over in the same loop, which tends to simply test the effect of your code when all the memory in the cache with all the branch prediction working perfectly for it. It's often just showing you best case scenarios without showing you the average, real-world case.
Depending on real world timing tests is a little bit better; something closer to what your application will be doing at a high level. It won't give you specifics about what is taking what amount of time, but that's precisely what the profiler is meant to do.
Wha? How to measure speed without a profiler? The very act of measuring speed is profiling! The question amounts to, "how can I write my own profiler?" And the answer is clearly, "don't".
Besides, you should be using std::swap in the first place, which complete invalidates this whole pointless pursuit.
-1 for pointlessness.

Performance of comparisons in C++ ( foo >= 0 vs. foo != 0 )

I've been working on a piece of code recently where performance is very important, and essentially I have the following situation:
int len = some_very_big_number;
int counter = some_rather_small_number;
for( int i = len; i >= 0; --i ){
while( counter > 0 && costly other stuff here ){
/* do stuff */
--counter;
}
/* do more stuff */
}
So here I have a loop that runs very often and for a certain number of runs the while block will be executed as well until the variable counter is reduced to zero and then the while loop will not be called because the first expression will be false.
The question is now, if there is a difference in performance between using
counter > 0 and counter != 0?
I suspect there would be, does anyone know specifics about this.
To measure is to know.
Do you think that what will solve your problem! :D
if(x >= 0)
00CA1011 cmp dword ptr [esp],0
00CA1015 jl main+2Ch (0CA102Ch) <----
...
if(x != 0)
00CA1026 cmp dword ptr [esp],0
00CA102A je main+3Bh (0CA103Bh) <----
In programming, the following statement is the sign designating the road to Hell:
I've been working on a piece of code recently where performance is very important
Write your code in the cleanest, most easy to understand way. Period.
Once that is done, you can measure its runtime. If it takes too long, measure the bottlenecks, and speed up the biggest ones. Keep doing that until it is fast enough.
The list of projects that failed or suffered catastrophic loss due to a misguided emphasis on blind optimization is large and tragic. Don't join them.
I think you're spending time optimizing the wrong thing. "costly other stuff here", "do stuff" and "do more stuff" are more important to look at. That is where you'll make the big performance improvements I bet.
There will be a huge difference if the counter starts with a negative number. Otherwise, on every platform I'm familiar with, there won't be a difference.
Is there a difference between counter > 0 and counter != 0? It depends on the platform.
A very common type of CPU are those from Intel we have in our PC's. Both comparisons will map to a single instruction on that CPU and I assume they will execute at the same speed. However, to be certain you will have to perform your own benchmark.
As Jim said, when in doubt see for yourself :
#include <boost/date_time/posix_time/posix_time.hpp>
#include <iostream>
using namespace boost::posix_time;
using namespace std;
void main()
{
ptime Before = microsec_clock::universal_time(); // UTC NOW
// do stuff here
ptime After = microsec_clock::universal_time(); // UTC NOW
time_duration delta_t = After - Before; // How much time has passed?
cout << delta_t.total_seconds() << endl; // how much seconds total?
cout << delta_t.fractional_seconds() << endl; // how much microseconds total?
}
Here's a pretty nifty way of measuring time. Hope that helps.
OK, you can measure this, sure. However, these sorts of comparisons are so fast that you are probably going to see more variation based on processor swapping and scheduling then on this single line of code.
This smells of unnecessary, and premature, optimization. Right your program, optimize what you see. If you need more, profile, and then go from there.
I would add that the overwhelming performance aspects of this code on modern cpus will be dominated not by the comparison instruction but whether the comparison is well predicted since any mis-predict will waste many more cycles than any integral operation.
As such loop unrolling will probably be the biggest winner but measure, measure, measure.
Thinking that the type of comparison is going to make a difference, without knowing it, is the definition of guessing.
Don't guess.
In general, they should be equivalent (both are usually implemented in single-cycle instructions/micro-ops). Your compiler may do some strange special-case optimization that is difficult to reason about from the source level, which may make either one slightly faster. Also, equality testing is more energy-efficient than inequality testing (>), though the system-level effect is so small as to not merit discussion.
There may be no difference. You could try examining the assembly output for each.
That being said, the only way to tell if any difference is significant is to try it both ways and measure. I'd bet that the change makes no difference whatsoever with optimizations on.
Assuming you are developing for the x86 architecture, when you look at the assembly output it will come down to jns vs jne. jns will check the sign flag and jne will check the zero flag. Both operations, should as far as I know, be equally costly.
Clearly the solution is to use the correct data type.
Make counter an unsigned int. Then it can't be less than zero. Your compiler will obviously know this and be forced to choose the optimal solution.
Or you could just measure it.
You could also think about how it would be implemented...(here we go on a tangent)...
less than zero: the sign bit would be set, so need to check 1 bit
equal to zero : the whole value would be zero, so need to check all the bits
Of course, computers are funny things, and it may take longer to check a single bit than the whole value (however many bytes it is on your platform).
You could just measure it...
And you could find out that one it more optimal than another (under the conditions you measured it). But your program will still run like a dog because you spent all your time optimising the wrong part of your code.
The best solution is to use what many large software companies do - blame the hardware for not runnnig fast enough and encourage your customer to upgrade their equipment (which is clearly inferior since your product doesn't run fast enough).
< /rant>
I stumbled across this question just now, 3 years after it is asked, so I am not sure how useful the answer will still be... Still, I am surprised not to see clearly stated that answering your question requires to know two and only two things:
which processor you target
which compiler you work with
To the first point, each processor has different instructions for tests. On one given processor, two similar comparisons may turn up to take a different number of cycles. For example, you may have a 1-cycle instruction to do a gt (>), eq (==), or a le (<=), but no 1-cycle instruction for other comparisons like a ge (>=). Following a test, you may decide to execute conditional instructions, or, more often, as in your code example, take a jump. There again, cycles spent in jumps take a variable number of cycles on most high-end processors, depending whether the conditional jump is taken or not taken, predicted or not predicted. When you write code in assembly and your code is time critical, you can actually take quite a bit of time to figure out how to best arrange your code to minimize overall the cycle count and may end up in a solution that may have to be optimized based on the number of time a given comparison returns a true or false.
Which leads me to the second point: compilers, like human coders, try to arrange the code to take into account the instructions available and their latencies. Their job is harder because some assumptions an assembly code would know like "counter is small" is hard (not impossible) to know. For trivial cases like a loop counter, most modern compilers can at least recognize the counter will always be positive and that a != will be the same as a > and thus generate the best code accordingly. But that, as many mentioned in the posts, you will only know if you either run measurements, or inspect your assembly code and convince yourself this is the best you could do in assembly. And when you upgrade to a new compiler, you may then get a different answer.