time of an assembler operation - c++

Why can the same assembler operation (mul for example) in different parts of a program consume different amount of time?
P.S. I'm using C++ and disassembler.

This question is very vague, but generally on a modern CPU you cannot expect operations to have a constant execution time, because a lot of factors can influence this, including but not limited to:
Branch prediction failures
Cache misses
Pipelining
...

There are all kinds reasons why the same kind of operation can have massively varying performance on modern processors.
Data Cache Misses:
If your operation accesses memory it might go to the cache at one location and generate a cache miss elsewhere. Cache misses can be in the order of a hundret cycles, while easy operations often execute in a few cycles, so this will make it much slower.
Pipeline Stalls:
Modern CPUs are typically pipelined, so an instruction (or more then one) can be scheduled each cycle, but they typically need more than one cycle till the result is available. Your operation might depend on the result of another operation, which isn't ready when the operation is scheduled, so the CPU has to wait till the operation generating the result has finished.
Instruction Cache Misses:
The instruction stream is also cached, so you might find a situation where for one location the cpu generates a cache miss each time it encounteres that location (unlikely for anything which will take a measurable amount of the runtime though, instruction caches aren't that small).
Branch Misprediction:
Another kind of pipeline stall. The CPU will try to predict which way a conditional jump will go and speculatively execute the code in that execution path. If it is wrong it has to discard the results from this speculative execution and start on the other path. This might show up on the first line of the other path in a profiler.
Resource Contention: The operation might not depend a not avalible result, but the execution unit needed might still be occupied by another instruction (some instructions are not fully pipelined on all processors, or it might be because of some kind of Hyperthreading or Bulldozers shared FPU). Again the CPU might have to stall until the unit is free.
Page Faults: Should be pretty obvious. Basically a Cache Miss on steroids. If the accessed memory has to be reloaded from disk it will cost hundreds of thousands of cycles
...: The list goes on, however the mentioned points are the ones most likely to make an impact in my opionon.

I assume you're asking about exactly the same instruction applied to the same operands.
One possible cause that could have huge performance implications is whether the operands are readily available in the CPU cache or whether they have to be fetched from the main RAM.
This is just one example; there are many other potential causes. With modern CPUs it's generally very hard to figure out how many cycles a given instruction will require just by looking at the code.

In profiler I see "mulps %xmm11, %xmm5", for example. I guess it is data in registers
xmmXX are SSE instructions. mulps is precision single, it depends whether or not you are comparing a SSE multiply against a normal scalar multiply. in which case it is understandable.
We really need more information for a better answer a chunk of asm and your profilers figures.
If it just this instruction that is slow? or a block of instructions, maybe it loading from unaligned memory, or you getting cache misses, pipeline hazards and a significant number of other possiblities.

Related

Shorter loop, same coverage, why do I get more Last Level Cache Misses in c++ with Visual Studio 2013?

I'm trying to understand what creates cache misses and eventually how much do they cost in terms of performance in our application. But with the tests I'm doing now, I'm quite confused.
Assuming that my L3 cache is 4MB, and my LineSize is 64 bytes, I would expect that this loop (loop 1):
int8_t aArr[SIZE_L3];
int i;
for ( i = 0; i < (SIZE_L3); ++i )
{
++aArr[i];
}
...and this loop (loop 2):
int8_t aArr[SIZE_L3];
int i;
for ( i = 0; i < (SIZE_L3 / 64u); ++i )
{
++aArr[i * 64];
}
give roughly the same amount of Last Level Cache Misses, but different amount of Inclusive Last Level Cache References.
However the numbers that the profiler of Visual Studio 2013 gives me are unsettling.
With loop 1:
Inclusive Last Level Cache References: 53,000
Last Level Cache Misses: 17,000
With loop 2:
Inclusive Last Level Cache References: 69,000
Last Level Cache Misses: 35,000
I have tested this with a dynamically allocated array, and on a CPU that has a larger L3 cache (8MB) and I get a similar pattern in the results.
Why don't I get the same amount of cache misses, and why do I have more references in a shorter loop?
Incrementing every byte of int8_t aArr[SIZE_L3]; separately is slow enough that hardware prefetchers are probably able to keep up pretty well a lot of the time. Out-of-order execution can keep a lot of read-modify-writes in flight at once to different addresses, but the best-case is still one byte per clock of stores. (Bottleneck on store-port uops, assuming this was a single-threaded test on a system without a lot of other demands for memory bandwidth).
Intel CPUs have their main prefetch logic in L2 cache (as described in Intel's optimization guide; see the x86 tag wiki). So successful hardware prefetch into L2 cache before the core issues a load means the that L3 cache never sees a miss.
John McCalpin's answer on this Intel forum thread confirms that L2 hardware prefetches are NOT counted as LLC references or misses by the normal perf events like MEM_LOAD_UOPS_RETIRED.LLC_MISS. Apparently there are OFFCORE_RESPONSE events you can look at.
IvyBridge introduced next-page HW prefetch. Intel Microarchitectures before that don't cross page boundaries when prefetching, so you still get misses every 4k. And maybe TLB misses if the OS didn't opportunistically put your memory in a 2MiB hugepage. (But speculative page-walks as you approach a page boundary can probably avoid much delay there, and hardware definitely does do speculative page walks).
With a stride of 64 bytes, execution can touch memory much faster than the cache / memory hierarchy can keep up. You bottleneck on L3 / main memory. Out-of-order execution can keep about the same number of read/modify/write ops in flight at once, but the same out-of-order window covers 64x more memory.
Explaining the exact numbers in more details
For array sizes right around L3, IvyBridge's adaptive replacement policy probably makes a significant difference.
Until we know the exact uarch, and more details of the test, I can't say. It's not clear if you only ran that loop once, or if you had an outer repeat loop and those miss / reference numbers are an average per iteration.
If it's only from a single run, that's a tiny noisy sample. I assume it was somewhat repeatable, but I'm surprised the L3 references count was so high for the every-byte version. 4 * 1024^2 / 64 = 65536, so there was still an L3 reference for most of the cache lines you touched.
Of course, if you didn't have a repeat loop, and those counts include everything the code did besides the loop, maybe most of those counts came from startup / cleanup overhead in your program. (i.e. your program with the loop commented out might have 48k L3 references, IDK.)
I have tested this with a dynamically allocated array
Totally unsurprising, since it's still contiguous.
and on a CPU that has a larger L3 cache (8MB) and I get a similar pattern in the results.
Did this test use a larger array? Or did you use a 4MiB array on a CPU with an 8MiB L3?
Your assumption that "If I skip over more elements in the array, making for fewer iterations of the loop and fewer array accesses, that I should have fewer cache misses" seems to be ignoring the way that data gets fetched into the cache.
When you access memory, more data is kept in the cache than just the specific data you accessed. If I access intArray[0], then intArray[1] and intArray[2] are likely going to be fetched as well at the same time. This is one of the optimizations that allows the cache to help us work faster. So if I access those three memory locations in a row, it's sort of like having only 1 memory read that you need to wait for.
If you increase the stride, instead accessing intArray[0], then intArray[100] and intArray[200], the data may require 3 separate reads because the second and third memory accesses might not be in cache, resulting in a cache miss.
All of the exact details of your specific problem depend on your computer architecture. I would assume you are running an intel x86-based architecture, but when we are talking about hardware at this low of a level I should not assume (I think you can get Visual Studio to run on other architectures, can't you?); and I don't remember all of the specifics for that architecture anyway.
Because you generally don't know what exactly the caching system will be like on the hardware your software is run on, and it can change over time, it is usually better to just read up on caching principles in general and try to write in general code that is likely to produce fewer misses. Trying to make the code perfect on the specific machine you're developing on is usually a waste of time. The exceptions to this are for certain embedded control systems and other types of low-level systems which are not likely to change on you; unless this describes your work I suggest you just read some good articles or books about computer caches.

Will Speculative Execution Follow Into an Expensive Operation?

If I understand branching correctly (x86), the processor will sometimes speculatively take a code path and perform the instructions and 'cancel' the results of the wrong path. What if the operation in the wrong codepath is very expensive, like a memory read that causes a cache miss or some expensive math operation? Will the processor try to perform something expensive ahead of time? How would a processor typically handle this?
if (likely) {
// do something lightweight (addition, subtraction, etc.)
} else {
// do something expensive (cache-miss, division, sin/cos/tan etc.)
}
tl:dr: the impact isn't as bad as you think, because the CPU no longer has to wait for slow things, even if it doesn't cancel them. Almost everything is heavily pipelined, so many operations can be in flight at once. The mis-speculated operations don't prevent new ones from starting.
Current x86 designs do not speculate on both sides of a branch at once. They only speculate down the predicted path.
I'm not aware of any specific microarchitecture that does speculate along both ways of a branch in any circumstances, but that doesn't mean there aren't any. I've mostly only read up on x86 microarchitectures (see the tag wiki for links to Agner Fog's microarch gude). I'm sure it's been suggested in academic papers, and maybe even implemented in a real design somewhere.
I'm not sure exactly what happens in current Intel and AMD designs when a branch mispredict is detected while a cache-miss load or store is already executing pending, or a divide is occupying the divide unit. Certainly out-of-order execution doesn't have to wait for the result, because no future uops depend on it.
On uarches other than P4, bogus uops in the ROB/scheduler are discarded when a mispredict is detected. From Agner Fog's microarch doc, talking about P4 vs. other uarches:
the misprediction penalty is unusually high for two reasons ... [long pipeline and]
... bogus μops in a mispredicted branch are not
discarded before they retire. A misprediction typically involves 45
μops. If these μops are divisions or other time-consuming operations
then the misprediction can be extremely costly. Other microprocessors
can discard μops as soon as the misprediction is detected so that they
don't use execution resources unnecessarily.
uops that are currently occupying execution units are another story:
Almost all execution units except the divider are fully pipelined, so another multiply, shuffle, or whatever can start without cancelling an in-flight FP FMA. (Haswell: 5 cycle latency, two execution units each capable of one per clock throughput, for a total sustained throughput of one per 0.5c. This means max throughput requires keeping 10 FMAs in flight at once, typically with 10 vector accumulators). Divide is interesting, though. Integer divide is many uops, so a branch mispredict will at least stop issuing them. FP div is only a single uop instruction, but not fully pipelined, esp. in older CPUs. It would be useful to cancel an FP div that was tieing up the divide unit, but IDK if that's possible. If adding the ability to cancel would have slowed down the normal case, or cost more power, then it would probably be left out. It's a rare special case which probably wasn't worth spending transistors on.
x87 fsin or something is a good example of a really expensive instruction. I didn't notice that until I went back to re-read the question. It's microcoded, so even though it has a latency of 47-106 cycles (Intel Haswell), it's also 71-100 uops. A branch mispredict would stop the frontend from issuing the remaining uops, and cancel all the ones that are queued, like I said for integer division. Note that real libm implementations typically don't use fsin and so on because they're slower and less accurate than what can be achieved in software (even without SSE), IIRC.
For a cache-miss, it might be cancelled, potentially saving bandwidth in L3 cache (and maybe main memory). Even if not, the instruction no longer has to retire, so the ROB won't fill up waiting for it to finish. That's normally why cache misses hurt OOO execution so much, but here it's at worst just tieing up a load or store buffer. Modern CPUs can have many outstanding cache misses in flight at once. Often code doesn't make this possible because future operations depend on the result of a load that missed in cache (e.g. pointer chasing in a linked list or tree), so multiple memory operations can't be pipelined. Even if a branch mispredict doesn't cancel much of an in-flight memory op, it avoids most of the worst effects.
I have heard of putting a ud2 (illegal instruction) at the end of a block of code to stop instruction prefetch from triggering a TLB miss when the block is at the end of a page. I'm not sure when this technique is necessary. Maybe if there's a conditional branch that's always actually taken? That doesn't make sense, you'd just use an unconditional branch. There must be something I'm not remembering about when you'd do that.

Speed ratio of algorithm versus precompiled reference implementation differs across computers

We have a small C++ project with the following architecture.
These two were compiled into a DLL:
An algorithm
A tester for the algorithm which checks the correctness of the result and measures the execution speed.
Then another implementation of the same algorithm is written by someone else.
The main() function does this:
Invoke the tester on both implementations of the algorithm and measure their execution speed. This is done several times, so that averages can be taken later.
Compute the speed ratio between them (measured time/measured reference time). This is referred to as the score.
We found that running the very same code and DLL on different computers returned quite different speed ratios. On one computer an implementation scored 6.4, and the very same implementation scored 2.8 on another machine. How could that be?
There could be tons of factors, but here are a few:
CPU cache can be a big one. Different processors have different caches (and not just in terms of raw cache size, but also caching strategies). One might be "smarter" than the other, or perhaps one just happens to work better than another in this specific situation.
CPU pipelining. Instructions these days are interleaved in the CPU, even in a single thread of execution. The way the CPU pipeline works varies from CPU to CPU, and one CPU might be able to two particular things at once, while another CPU can't. If one of the implementations exploit this, then it gets a speed boost (or if they both do, then they both get closer to the same speed).
CPU instruction execution times may vary. So one CPU executing the exact same instructions as another CPU might be able to do each one faster than the other CPU. If one computer's CPU takes a longer time to use a particular instruction (and one of the implementations happens to use that instruction), while another CPU has been improved to speed up that instruction's execution time, then there will be a larger time discrepancy.
Branch prediction models in the CPUs might be different, and one implementation might be more or less friendly to a particular CPU's branch prediction model.
Operating systems can affect this in many ways, from memory allocation strategies (maybe one OS has a memory allocation strategy that causes a bigger discrepancy in times, while another OS has a different allocation strategy that minimizes the discrepancy), to CPU time slice management (are the algorithms multithreaded, for example?).

Measuring performance/throughput of fast code ignoring processor speed?

Is there a way I could write a "tool" which could analyse the produced x86 assembly language from a C/C++ program and measure the performance in such a way, that it wouldnt matter if I ran it on a 1GHz or 3GHz processor?
I am thinking more along the lines of instruction throughput? How could I write such a tool? Would it be possible?
I'm pretty sure this has to be equivalent to the halting problem, in which case it can't be done. Things such as branch prediction, memory accesses, and memory caching will all change performance irrespective of the speed of the CPU upon which the program is run.
Well, you could, but it would have very limited relevance. You can't tell the running time by just looking at the instructions.
What about cache usage? A "longer" code can be more cache-friendly, and thus faster.
Certain CPU instructions can be executed in parallel and out-of-order, but the final behaviour depends a lot on the hardware.
If you really want to try it, I would recommend writing a tool for valgrind. You would essentially run the program under a simulated environment, making sure you can replicate the behaviour of real-world CPUs (that's the challenging part).
EDIT: just to be clear, I'm assuming you want dynamic analysis, extracted from real inputs. IF you want static analysis you'll be in "undecidable land" as the other answer pointed out (you can't even detect if a given code loops forever).
EDIT 2: forgot to include the out-of-order case in the second point.
It's possible, but only if the tool knows all the internals of the processor for which it is projecting performance. Since knowing 'all' the internals is tantamount to building your own processor, you would correctly guess that this is not an easy task. So instead, you'll need to make a lot of assumptions, and hope that they don't affect your answer too much. Unfortunately, for anything longer than a few hundred instructions, these assumptions (for example, all memory reads are found in L1 data cache and have 4 cycle latency; all instructions are in L1 instruction cache but in trace cache thereafter) affect your answer a lot. Clock speed is probably the easiest variable to handle, but the details for all the rest that differ greatly from processor to processor.
Current processors are "speculative", "superscalar", and "out-of-order". Speculative means that they choose their code path before the correct choice is computed, and then go back and start over from the branch if their guess is wrong. Superscalar means that multiple instructions that don't depend on each other can sometimes be executed simultaneously -- but only in certain combinations. Out-of-order means that there is a pool of instructions waiting to be executed, and the processor chooses when to execute them based on when their inputs are ready.
Making things even worse, instructions don't execute instantaneously, and the number of cycles they do take (and the resources they occupy during this time) vary also. Accuracy of branch prediction is hard to predict, and it takes different numbers of cycles for processors to recover. Caches are different sizes, take different times to access, and have different algorithms for decided what to cache. There simply is no meaningful concept of 'how fast assembly executes' without reference to the processor it is executing on.
This doesn't mean you can't reason about it, though. And the more you can narrow down the processor you are targetting, and the more you constrain the code you are evaluating, the better you can predict how code will execute. Agner Fog has a good mid-level introduction to the differences and similarities of the current generation of x86 processors:
http://www.agner.org/optimize/microarchitecture.pdf
Additionally, Intel offers for free a very useful (and surprisingly unknown) tool that answers a lot of these questions for recent generations of their processors. If you are trying to measure the performance and interaction of a few dozen instructions in a tight loop, IACA may already do what you want. There are all sorts of improvements that could be made to the interface and presentation of data, but it's definitely worth checking out before trying to write your own:
http://software.intel.com/en-us/articles/intel-architecture-code-analyzer
To my knowledge, there isn't an AMD equivalent, but if there is I'd love to hear about it.

How do you measure the effect of branch misprediction?

I'm currently profiling an implementation of binary search. Using some special instructions to measure this I noticed that the code has about a 20% misprediction rate. I'm curious if there is any way to check how many cycles I'm potentially losing due to this. It's a MIPS based architecture.
You're losing 0.2 * N cycles per iteration, where N is the number of cycles that it takes to flush the pipelines after a mispredicted branch. Suppose N = 10 then that means you are losing 2 clocks per iteration on aggregate. Unless you have a very small inner loop then this is probably not going to be a significant performance hit.
Look it up in the docs for your CPU. If you can't find this information specifically, the length of the CPU's pipeline is a fairly good estimate.
Given that it's MIPS and it's a 300MHz system, I'm going to guess that it's a fairly short pipeline. Probably 4-5 stages, so a cost of 3-4 cycles per mispredict is probably a reasonable guess.
On an in-order CPU you may be able to calculate the approximate mispredict cost as a product of the number of mispredicts and the mispredict cost (which is generally a function of some part of the pipeline)
On a modern out-of-order CPU, however, such a general calculation is usually not possible. There may be a large number of instructions in flight1, only some of which are flushed by a misprediction. The surrounding code may be latency bound by one or more chains of dependent instructions, or it may be throughput bound on resources like execution units, renaming throughput, etc, or it may be somewhere in-between.
On such a core, the penalty per misprediction is very difficult to determine, even with the help of performance counters. You can find entire papers dedicated to the topic: that one found a penalty size of ranging from 9 to 35 cycles averaged across entire benchmarks: if you look at some small piece of code the range will be even larger: a penalty of zero is easy to demonstrate, and you could create a scenario where the penalty is in the 100s of cycles.
Where does that leave you, just trying to determine the misprediction cost in your binary search? Well a simple approach is just to control the number of mispredictions and measure the difference! If you set up your benchmark input have a range of behavior, starting with always following the same branch pattern, all the way to having a random pattern, you can plot the misprediction count versus runtime degradation. If you do, share your result!
1Hundreds of instructions in-flight in the case of modern big cores such as those offered by the x86, ARM and POWER architectures.
Look at your specs for that info and if that fails, run it a billion times and time it external to your program (stop watch of something.) Then run it with without a miss and compare.