I heard there is Intel book online which describes the CPU cycles needed for a specific assembly instruction, but I can not find it out (after trying hard). Could anyone show me how to find CPU cycle please?
Here is an example, in the below code, mov/lock is 1 CPU cycle, and xchg is 3 CPU cycles.
// This part is Platform dependent!
#ifdef WIN32
inline int CPP_SpinLock::TestAndSet(int* pTargetAddress,
int nValue)
{
__asm
{
mov edx, dword ptr [pTargetAddress]
mov eax, nValue
lock xchg eax, dword ptr [edx]
}
// mov = 1 CPU cycle
// lock = 1 CPU cycle
// xchg = 3 CPU cycles
}
#endif // WIN32
BTW: here is the URL for the code I posted: http://www.codeproject.com/KB/threads/spinlocks.aspx
Modern CPUs are complex beasts, using pipelining, superscalar execution, and out-of-order execution among other techniques which make performance analysis difficult... but not impossible!
While you can no longer simply add together the latencies of a stream of instructions to get the total runtime, you can still get a (often) highly accurate analysis of the behavior of some piece of code (especially a loop) as described below and in other linked resources.
Instruction Timings
First, you need the actual timings. These vary by CPU architecture, but the best resource currently for x86 timings is Agner Fog's instruction tables. Covering no less than thirty different microarchitecures, these tables list the instruction latency, which is the minimum/typical time that an instruction takes from inputs ready to output available. In Agner's words:
Latency: This is the delay that the instruction generates in a
dependency chain. The numbers are minimum values. Cache misses,
misalignment, and exceptions may increase the clock counts
considerably. Where hyperthreading is enabled, the use of the same
execution units in the other thread leads to inferior performance.
Denormal numbers, NAN's and infinity do not increase the latency. The
time unit used is core clock cycles, not the reference clock cycles
given by the time stamp counter.
So, for example, the add instruction has a latency of one cycle, so a series of dependent add instructions, as shown, will have a latency of 1 cycle per add:
add eax, eax
add eax, eax
add eax, eax
add eax, eax # total latency of 4 cycles for these 4 adds
Note that this doesn't mean that add instructions will only take 1 cycle each. For example, if the add instructions were not dependent, it is possible that on modern chips all 4 add instructions can execute independently in the same cycle:
add eax, eax
add ebx, ebx
add ecx, ecx
add edx, edx # these 4 instructions might all execute, in parallel in a single cycle
Agner provides a metric which captures some of this potential parallelism, called reciprocal throughput:
Reciprocal throughput: The average number of core clock cycles per instruction for a series of independent instructions of the same kind
in the same thread.
For add this is listed as 0.25 meaning that up to 4 add instructions can execute every cycle (giving a reciprocal throughput of 1 / 4 = 0.25).
The reciprocal throughput number also gives a hint at the pipelining capability of an instruction. For example, on most recent x86 chips, the common forms of the imul instruction have a latency of 3 cycles, and internally only one execution unit can handle them (unlike add which usually has four add-capable units). Yet the observed throughput for a long series of independent imul instructions is 1/cycle, not 1 every 3 cycles as you might expect given the latency of 3. The reason is that the imul unit is pipelined: it can start a new imul every cycle, even while the previous multiplication hasn't completed.
This means a series of independent imul instructions can run at up to 1 per cycle, but a series of dependent imul instructions will run at only 1 every 3 cycles (since the next imul can't start until the result from the prior one is ready).
So with this information, you can start to see how to analyze instruction timings on modern CPUs.
Detailed Analysis
Still, the above is only scratching the surface. You now have multiple ways of looking at a series of instructions (latency or throughput) and it may not be clear which to use.
Furthermore, there are other limits not captured by the above numbers, such as the fact that certain instructions compete for the same resources within the CPU, and restrictions in other parts of the CPU pipeline (such as instruction decoding) which may result in a lower overall throughput than you'd calculate just by looking at latency and throughput. Beyond that, you have factors "beyond the ALUs" such as memory access and branch prediction: entire topics unto themselves - you can mostly model these well, but it takes work. For example here's a recent post where the answer covers in some detail most of the relevant factors.
Covering all the details would increase the size of this already long answer by a factor of 10 or more, so I'll just point you to the best resources. Agner Fog has an Optimizing Asembly guide that covers in detail the precise analysis of a loop with a dozen or so instructions. See "12.7 An example of analysis for bottlenecks in vector loops" which starts on page 95 in the current version of the PDF.
The basic idea is that you create a table, with one row per instruction and mark the execution resources each uses. This lets you see any throughput bottlenecks. In addition, you need to examine the loop for carried dependencies, to see if any of those limit the throughput (see "12.16 Analyzing dependencies" for a complex case).
If you don't want to do it by hand, Intel has released the Intel Architecture Code Analyzer, which is a tool that automates this analysis. It currently hasn't been updated beyond Skylake, but the results are still largely reasonable for Kaby Lake since the microarchitecture hasn't changed much and therefore the timings remain comparable. This answer goes into a lot of detail and provides example output, and the user's guide isn't half bad (although it is out of date with respect to the newest versions).
Other sources
Agner usually provides timings for new architectures shortly after they are released, but you can also check out instlatx64 for similarly organized timings in the InstLatX86 and InstLatX64 results. The results cover a lot of interesting old chips, and new chips usually show up fairly quickly. The results are mostly consistent with Agner's, with a few exceptions here and there. You can also find memory latency and other values on this page.
You can even get the timing results directly from Intel in their IA32 and Intel 64 optimization manual in Appendix C: INSTRUCTION LATENCY AND THROUGHPUT. Personally I prefer Agner's version because they are more complete, often arrive before the Intel manual is updated, and are easier to use as they provide a spreadsheet and PDF version.
Finally, the x86 tag wiki has a wealth of resources on x86 optimization, including links to other examples of how to do a cycle accurate analysis of code sequences.
If you want a deeper look into the type of "dataflow analysis" described above, I would recommend A Whirlwind Introduction to Data Flow Graphs.
Given pipelining, out of order processing, microcode, multi-core processors, etc there's no guarantee that a particular section of assembly code will take exactly x CPU cycles/clock cycle/whatever cycles.
If such a reference exists, it will only be able to provide broad generalizations given a particular architecture, and depending on how the microcode is implemented you may find that the Pentium M is different than the Core 2 Duo which is different than the AMD dual core, etc.
Note that this article was updated in 2000, and written earlier. Even the Pentium 4 is hard to pin down regarding instruction timing - PIII, PII, and the original pentium were easier, and the texts referenced were probably based on those earlier processors that had a more well-defined instruction timing.
These days people generally use statistical analysis for code timing estimation.
What the other answers say about it being impossible to accurately predict the performance of code running on a modern CPU is true, but that doesn't mean the latencies are unknown, or that knowing them is useless.
The exact latencies for Intels and AMD's processors are listed in Agner Fog's instruction tables. See also Intel® 64 and IA-32 Architectures Optimization Reference Manual, and Instruction latencies and throughput for AMD and Intel x86 processors (from Can Berk Güder's now-deleted link-only answer). AMD also has pdf manuals on their own website with their official values.
For (micro-)optimizing tight loops, knowing the latencies for each instruction can help a lot in manually trying to schedule your code. The programmer can make a lot of optimizations that the compiler can't (because the compiler can't guarantee it won't change the meaning of the program).
Of course, this still requires you to know a lot of other details about the CPU, such as how deeply pipelined it is, how many instructions it can issue per cycle, number of execution units and so on. And of course, these numbers vary for different CPU's. But you can often come up with a reasonable average that more or less works for all CPU's.
It's worth noting though, that it is a lot of work to optimize even a few lines of code at this level. And it is easy to make something that turns out to be a pessimization. Modern CPUs are hugely complicated, and they try extremely hard to get good performance out of bad code. But there are also cases they're unable to handle efficiently, or where you think you're clever and making efficient code, and it turns out to slow the CPU down.
Edit
Looking in Intel's optimization manual, table C-13:
The first column is instruction type, then there is a number of columns for latency for each CPUID. The CPUID indicates which processor family the numbers apply to, and are explained elsewhere in the document. The latency specifies how many cycles it takes before the result of the instruction is available, so this is the number you're looking for.
The throughput columns show how many of this type of instructions can be executed per cycle.
Looking up xchg in this table, we see that depending on the CPU family, it takes 1-3 cycles, and a mov takes 0.5-1. These are for the register-to-register forms of the instructions, not for a lock xchg with memory, which is a lot slower. And more importantly, hugely-variable latency and impact on surrounding code (much slower when there's contention with another core), so looking only at the best-case is a mistake. (I haven't looked up what each CPUID means, but I assume the .5 are for Pentium 4, which ran some components of the chip at double speed, allowing it to do things in half cycles)
I don't really see what you plan to use this information for, however, but if you know the exact CPU family the code is running on, then adding up the latency tells you the minimum number of cycles required to execute this sequence of instructions.
Measuring and counting CPU-cycles does not make sense on the x86 anymore.
First off, ask yourself for which CPU you're counting cycles? Core-2? a Athlon? Pentium-M? Atom? All these CPUs execute x86 code but all of them have different execution times. The execution even varies between different steppings of the same CPU.
The last x86 where cycle-counting made sense was the Pentium-Pro.
Also consider, that inside the CPU most instructions are transcoded into microcode and executed out of order by a internal execution unit that does not even remotely look like a x86. The performance of a single CPU instruction depends on how much resources in the internal execution unit is available.
So the time for a instruction depends not only on the instruction itself but also on the surrounding code.
Anyway: You can estimate the throughput-resource usage and latency of instructions for different processors. The relevant information can be found at the Intel and AMD sites.
Agner Fog has a very nice summary on his web-site. See the instruction tables for latency, throughput, and uop count. See the microarchictecture PDF to learn how to interpret those.
http://www.agner.org/optimize
But note that xchg-with-memory does not have predictable performance, even if you look at only one CPU model. Even in the no-contention case with the cache-line already hot in L1D cache, being a full memory barrier will mean it's impact depends a lot on loads and stores to other addresses in the surrounding code.
Btw - since your example-code is a lock-free datastructure basic building block: Have you considered using the compiler built-in functions? On win32 you can include intrin.h and use functions such as _InterlockedExchange.
That'll give you better execution time because the compiler can inline the instructions. Inline-assembler always forces the compiler to disable optimizations around the asm-code.
lock xchg eax, dword ptr [edx]
Note the lock will lock memory for the memory fetch for all cores, this can take 100 cycles on some multi cores and a cache line will also need to be flushed. It will also stall the pipeline. So i wouldnt worry about the rest.
So optimal performance gets back to tuning your algorithms critical regions.
Note on a single core you can optmize this by removing the lock but it is needed for multi core.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
Consider the following two alternative pieces of code:
Alternative 1:
if (variable != new_val) // (1)
variable = new_val;
f(); // This function reads `variable`.
Alternative 2:
variable = new_val; // (2)
f(); // This function reads `variable`.
Which alternative is "statistically" faster? Assume variable is in cache L1 before (1) or (2).
I guess that alternative (1) is faster even if the branch-misprediction rate is high, but I don't really know the costs of "ifs". My guess is based on the assumption that cache-misses are way more expensive than branch-mispredictions but I don't really know.
What if variable wasn't in cache before (1) or (2)? Does it change the situation too much?
NOTE: Since the situation could change a lot among different CPUs, you can based your answer in an architecture you are familiar with, although widely used CPUs like any modern Intel architecture is preferred. The goal of my question is actually to know a bit more about how CPUs work.
Normally, alternative 2 is faster because it's less machine code executing, and the store buffer will decouple unconditional stores from other parts of the core, even if they miss in cache.
If alternative 1 was consistently faster, compilers would make asm that did that, but it's not so they don't. It introduces a possible branch miss and a load that can cache-miss. There are plausible circumstances under which it could be better (e.g. false sharing with other threads, or breaking a data dependency), but those are special cases that you'd have to confirm with performance experiments and perf counters.
Reading variable in the first place already touches memory for both variables (if neither is in registers). If you expect new_val to almost always be the same (so it predicts well), and for that load to miss in cache, branch prediction + speculative execution can be helpful to decouple later reads of variable from that cache-miss load. But it's still a cache miss load that has to get waited for because the branch condition can be checked, so the total miss penalty could end up being quite large if the branch predicts wrong. But otherwise you're hiding a lot of the cache-miss load penalty by making more later work independent of it, allowing OoO exec up to the limit of the ROB size.
Other than breaking the data dependency, if f() inlines and variable optimizes into a register, it would be pointless to branch. Otherwise, a store that misses in L1d but hits in L2 cache is still pretty cheap, and decoupled from execution by the store buffer. (Can a speculatively executed CPU branch contain opcodes that access RAM?) Even hitting in L3 is not too bad for a store, unless other threads have the line in shared state and dirtying it would interfere with them reading values of other global vars. (False sharing)
Note that later reloads of variable can use the newly-stored value even while the store is waiting to commit from the store buffer to L1d cache (store forwarding), so even if f() didn't inline and use the new_value load result directly, its use of variable still doesn't have to wait for a possible store miss on variable.
Avoiding false-sharing is one of the few reasons it could be worth branching to avoid a single store of a value that fits in a register.
Two questions linked in comments by #EOF discuss a case of this possible optimization (or possible pessimization) to avoid writes. It's sometimes done with std::atomic variables because false sharing is an even bigger deal. (And stores with the default mo_seq_cst memory order are slow on most ISAs other than AArch64, draining the store buffer.)
Strange optimization? in `libuv`. Please explain
C optimization: conditional store to avoid dirtying a cache line
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
Or is it the same in terms of performance?
For example, which is faster?
int a = 1, b = 2;
for (int i = 0; i < 10; ++i) {
a = a + 1;
b = b + 1;
}
or
for (int i = 0; i < 10; ++i) {
a = a + 1;
}
for (int i = 0; i < 10; ++i) {
b = b + 1;
}
Note: I changed my examples, given a lot of people seem hung up on the statements inside them rather than the purpose of my question.
Both of your examples do nothing at all and most compilers will optimize them both to the same thing -- nothing at all.
Update: Your two new examples are obviously equivalent. If any compiler generated better code for one than the other, then it's a poor quality compiler and you should just use a better compiler.
As people have pointed out, the compiler will optimize regardless of which way I go with but it really depends on what statements are inside the loop(s).
The performance depends on the contents of the loops.
Let's decompose the for loop. A for loop is comprised of:
Initialization
Comparison
Incrementing
Content (statements)
Branching
Let us define a comparison as a compare instruction (to set the processor status bits) and a branch (to take advantage of the processor status bits).
Processors are at their happiest when they are executing data instructions. The processor manipulates the data, then processes the next instruction in the pipeline (cache).
The processors don't like sections 2) Comparison and 5) Branching (to the top of the loop). Branching means that the processor has stop processing data and execute logic to determine if the instruction cache needs to be replaced or not. This time could be spent processing data instructions.
The goal to optimizing a for loop is to reduce the branching. The secondary one is to optimize the data cache / memory accesses. A common optimization technique is loop unrolling, or basically placing more statements inside the for loop. As a measurement, you can take the overhead of the for loop and divide by the quantity of statements inside the loop.
According to the above information, your first loop (with both assignment statements) would be more efficient, since there are more data instructions per loop; less overhead overall.
Edit 1: The Parallel Environment
However, your second example may be faster. The compiler could set up both loops to run in parallel (either through instructions or actual parallel tasks). Since both loops are independent, they can be run at the same time or split between CPU cores. Processors have instructions that can perform common operations on multiple memory locations. Your first example, makes this a little more difficult because it requires more analyzation from the compiler. Since the loops on the second example are simpler, the compiler's analyzation is also simpler.
Also, the quantity of iterations also plays a factor. For small quantities, the loops should perform the same or have negligible differences. For large quantities of iterations, there may be some timing differences.
In summary: PROFILE. BENCHMARK. The only true answer depends on measurements. They may vary depending on the applications being run at the same time, the amount of memory (both RAM and hard drive), the quantity of CPU cores and other items. Profile and Benchmark on your system. Repeat on other systems.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
Quoted from https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html:
-falign-labels
-falign-labels=n
Align all branch targets to a power-of-two boundary, skipping up to n bytes like -falign-functions. This option can easily
make code slower, because it must insert dummy operations for when the
branch target is reached in the usual flow of the code.
-fno-align-labels and -falign-labels=1 are equivalent and mean that labels are not aligned.
If -falign-loops or -falign-jumps are applicable and are greater than
this value, then their values are used instead.
If n is not specified or is zero, use a machine-dependent default
which is very likely to be ‘1’, meaning no alignment.
Enabled at levels -O2, -O3.
Thinking about this flag more makes it lose even more sense... there are consequences of provoking code cache miss, and what even enabling means when parameter takes numeric value (1..)?
It doesn't say that. It says that can easily make code slower. It means, that in certain situations, it can make code slower. In other situations, it can make code faster.
Alignment causes to run code slower:
increases code size, so there is a higher chance that a code is not in the cache.
added nop operations slow down code
Alignment could cause to run code faster: branch prediction, instruction fetch, and god-knows-what.
In the case of a single if, it is hard to say which effect is stronger. It depends on the conditions.
However, for a loop, usually code becomes faster. Why? Because slow factors happen only once, but every cycle of the loop will be executed faster.
(My GCC seems to align labels to 8)
Assuming that in some C or C++ code I have a function named T fma( T a, T b, T c ) that performs 1 multiplication and 1 addition like so ( a * b ) + c ; how I'm supposed to optimize multiple mul & add steps ?
For example my algorithm needs to be implemented with 3 or 4 fma operations chained and summed together, How I can write this is an efficient way and at what part of the syntax or semantics I should dedicate particular attention ?
I also would like some hints on the critical part: avoid changing the rounding mode for the CPU to avoid flushing the cpu pipeline. But I'm quite sure that just using the + operation between multiple calls to fma shouldn't change that, I'm saying "quite sure" because I don't have too many CPUs to test this, I'm just following some logical steps.
My algorithm is something like the total of multiple fma calls
fma ( triplet 1 ) + fma ( triplet 2 ) + fma ( triplet 3 )
Recently, in Build 2014 Eric Brumer gave a very nice talk on the topic (see here).
The bottom line of talk was that
Using Fused Multiply Accumulate (aka FMA) everywhere hurts performance.
In Intel CPUs a FMA instruction costs 5 cycles. Instead doing a multiplication (5 cycles) and an addition (3 cycles) costs 8 cycles. Using FMA your are getting two operations in the prize of one (see picture below).
However, FMA seems not to be the holly grail of instructions. As you can see in the picture below FMA can in certain citations hurt the performance.
In the same fashion, your case fma(triplet1) + fma(triplet2) + fma(triplet 3) costs 21 cycles whereas if you were to do the same operations with out FMA would cost 30 cycles. That's a 30% gain in performance.
Using FMA in your code would demand using compiler intrinsics. In my humble opinion though, FMA etc. is not something you should be worried about, unless you are a C++ compiler programmer. If your are not, let the compiler optimization take care of these technicalities. Generally, under such kind of concerns lies the root of all evil (i.e., premature optimization), to paraphrase one of the great ones (i.e., Donald Knuth).