Measuring performance/throughput of fast code ignoring processor speed?

Measuring performance/throughput of fast code ignoring processor speed? - c++

Is there a way I could write a "tool" which could analyse the produced x86 assembly language from a C/C++ program and measure the performance in such a way, that it wouldnt matter if I ran it on a 1GHz or 3GHz processor?
I am thinking more along the lines of instruction throughput? How could I write such a tool? Would it be possible?

I'm pretty sure this has to be equivalent to the halting problem, in which case it can't be done. Things such as branch prediction, memory accesses, and memory caching will all change performance irrespective of the speed of the CPU upon which the program is run.

Well, you could, but it would have very limited relevance. You can't tell the running time by just looking at the instructions.
What about cache usage? A "longer" code can be more cache-friendly, and thus faster.
Certain CPU instructions can be executed in parallel and out-of-order, but the final behaviour depends a lot on the hardware.
If you really want to try it, I would recommend writing a tool for valgrind. You would essentially run the program under a simulated environment, making sure you can replicate the behaviour of real-world CPUs (that's the challenging part).
EDIT: just to be clear, I'm assuming you want dynamic analysis, extracted from real inputs. IF you want static analysis you'll be in "undecidable land" as the other answer pointed out (you can't even detect if a given code loops forever).
EDIT 2: forgot to include the out-of-order case in the second point.

It's possible, but only if the tool knows all the internals of the processor for which it is projecting performance. Since knowing 'all' the internals is tantamount to building your own processor, you would correctly guess that this is not an easy task. So instead, you'll need to make a lot of assumptions, and hope that they don't affect your answer too much. Unfortunately, for anything longer than a few hundred instructions, these assumptions (for example, all memory reads are found in L1 data cache and have 4 cycle latency; all instructions are in L1 instruction cache but in trace cache thereafter) affect your answer a lot. Clock speed is probably the easiest variable to handle, but the details for all the rest that differ greatly from processor to processor.
Current processors are "speculative", "superscalar", and "out-of-order". Speculative means that they choose their code path before the correct choice is computed, and then go back and start over from the branch if their guess is wrong. Superscalar means that multiple instructions that don't depend on each other can sometimes be executed simultaneously -- but only in certain combinations. Out-of-order means that there is a pool of instructions waiting to be executed, and the processor chooses when to execute them based on when their inputs are ready.
Making things even worse, instructions don't execute instantaneously, and the number of cycles they do take (and the resources they occupy during this time) vary also. Accuracy of branch prediction is hard to predict, and it takes different numbers of cycles for processors to recover. Caches are different sizes, take different times to access, and have different algorithms for decided what to cache. There simply is no meaningful concept of 'how fast assembly executes' without reference to the processor it is executing on.
This doesn't mean you can't reason about it, though. And the more you can narrow down the processor you are targetting, and the more you constrain the code you are evaluating, the better you can predict how code will execute. Agner Fog has a good mid-level introduction to the differences and similarities of the current generation of x86 processors:
http://www.agner.org/optimize/microarchitecture.pdf
Additionally, Intel offers for free a very useful (and surprisingly unknown) tool that answers a lot of these questions for recent generations of their processors. If you are trying to measure the performance and interaction of a few dozen instructions in a tight loop, IACA may already do what you want. There are all sorts of improvements that could be made to the interface and presentation of data, but it's definitely worth checking out before trying to write your own:
http://software.intel.com/en-us/articles/intel-architecture-code-analyzer
To my knowledge, there isn't an AMD equivalent, but if there is I'd love to hear about it.

Related

Finding which code segment is faster than the other

Say that we have two C++ code segments, for doing the same task. How can we determine which code will run faster?
As an example lets say there is this global array "some_struct_type numbers[]". Inside a function, I can read a location of this array in two ways(I do not want to alter the content of the array)
some_struct_type val = numbers[i];
some_struct_type* val = &numbers[i]
I assume the second one is faster. but I can't measure the time to make sure because it will be a negligible difference.
So in this type of a situation, how do I figure out which code segment runs faster? Is there a way to compile a single line of code or set of lines and view
how many lines of assembly instructions are there?
I would appreciate your thoughts on this matter.

The basics are to run the piece of code so many times that it takes a few seconds at least to complete, and measure the time.
But it's hard, very hard, to get any meaningful figures this way, for many reasons:
Todays compilers are very good at optimizing code, but the optimizations depend on the context. It often does not make sense to look at a single line and try to optimize it. When the same line appears in a different context, the optimizations applied may be different.
Short pieces of code can be much faster than the surrounding looping code.
Not only the compiler makes optimizations, the processor has a cache, an instruction pipeline, and tries to predict branching code. A value which has been read before will be read much faster the next time, for example.
...
Because of this, it's usually better to leave the code in its place in your program, and use a profiling tool to see which parts of your code use the most processing resources. Then, you can change these parts and profile again.
While writing new code, prefer readable code to seemingly optimal code. Choose the right algorithm, this also depends on your input sizes. For example, insertion sort can be faster than quicksort, if the input is very small. But don't write your own sorting code, if your input is not special, use the libraries available in general. And don't optimize prematurely.

Eugene Sh. is correct that these two lines aren't doing the same thing - the first one copies the value of numbers[i] into a local variable, whereas the second one stores the address of numbers[i] into a pointer local variable. If you can do what you need using just the address of numbers[i] and referring back to numbers[i], it's likely that will be faster than doing a wholesale copy of the value, although it depends on a lot of factors like the size of the struct, etc.
Regarding the general optimization question, here are some things to consider...
Use a Profiler
The best way to measure the speed of your code is to use a profiling tool. There are a number of different tools available, depending on your target platform - see (for example) How can I profile C++ code running in Linux? and What's the best free C++ profiler for Windows?.
You really want to use a profiler for this because it's notoriously difficult to tell just from looking what the costliest parts of a program will be, for a number of reasons...
# of Instructions != # of Processor Cycles
One reason to use a profiler is that it's often difficult to tell from looking at two pieces of code which one will run faster. Even in assembly code, you can't simply count the number of instructions, because many instructions take multiple processor cycles to complete. This varies considerably by target platform. For example, on some platforms the fastest way to load the value 1 to a CPU register is something straightforward like this:
MOV r0, #1
Whereas on other platforms the fastest approach is actually to clear the register and then increment it, like this:
CLR r0
INC r0
The second case has more instruction lines, but that doesn't necessarily mean that it's slower.
Other Complications
Another reason that it's difficult to tell which pieces of code will most need optimizing is that most modern computers employ fairly sophisticated caches that can dramatically improve performance. Executing a cached loop several times is often less expensive than loading a single piece of data from a location that isn't cached. It can be very difficult to predict exactly what will cause a cache miss, but when using a profiler you don't have to predict - it makes the measurements for you.
Avoid Premature Optimization
For most projects, optimizing your code is best left until relatively late in the process. If you start optimizing too early, you may find that you spend a lot of time optimizing a feature that turns out to be relatively inexpensive compared to your program's other features. That said, there are some notable counterexamples - if you're building a large-scale database tool you might reasonably expect that performance is going to be an important selling point.

How to measure read/cycle or instructions/cycle?

I want to thoroughly measure and tune my C/C++ code to perform better with caches on a x86_64 system. I know how to measure time with a counter (QueryPerformanceCounter on my Windows machine) but I'm wondering how would one measure the instructions per cycle or reads/write per cycle with respect to the working set.
How should I proceed to measure these values?

Modern processors (i.e., those not very constrained that are less than some 20 years old) are superscalar, i.e., they execute more than one instruction at a time (given correct instruction ordering). Latest x86 processors translate the CISC instructions into internal RISC instructions, reorder them and execute the result, have even several regster banks so instructions using "the same registers" can be done in parallel. There isn't any reasonable way to define the "time the instruction execution takes" today.
The current CPUs are much faster than memory (a few hundred instructions is the typical cost of accessing memory), they are all heavily dependent on cache for performance. And then you have all kinds of funny effects of cores sharing (or not) parts of cache, ...
Tuning code for maximal performance starts with the software architecture, goes on to program organization, algorithm and data structure selection (here a modicum of cache/virtual memory awareness is useful too), careful programming and (as te most extreme measures to squeeze out the last 2% of performance) considerations like the ones you mention (and the other favorite, "rewrite in assembly"). And the ordering is that one because the first levels give more performance for the same cost. Measure before digging in, programmers are notoriously unreliable in finding bottlenecks. And consider the cost of reorganizing code for performance, both in the work itself, in convincing yourself this complex code is correct, and maintenance. Given the relative costs of computers and people, extreme performance tuning rarely makes any sense (perhaps for heavily travelled code paths in popular operating systems, in common code paths generated by a compiler, but almost nowhere else).

If you are really interested in where your code is hitting cache and where it is hitting memory, and the processor is less than about 10-15 years old in its design, then there are performance counters in the processor. You need driver level software to access these registers, so you probably don't want to write your own tools for this. Fortunately, you don't have to.
There is tools like VTune from Intel, CodeAnalyst from AMD and oprofile for Linux (works with both AMD and Intel processors).
There are a whole range of different registers that count the number of instructions actually completed, the number of cycles the processor is waiting for . You can also get a count of things like "number of memory reads", "number of cache misses", "number of TLB misses", "number of FPU instructions".
The next, more tricky part, is of course to try to fix any of these sort of issues, and as mentioned in another answer, programmers aren't always good at tweaking these sort of things - and it's certainly time consuming, not to mention that what works well on processor model X will not necessarily run fast on model Y (there were some tuning tricks for early Pentium 4 that works VERY badly on AMD processors - if on the other hand, you tune that code for AMD processors of that age, you get code that runs well on the same generation Intel processor too!)

You might be interested in the rdtsc x86 instruction, which reads a relative number of cycles.
See http://www.fftw.org/cycle.h for an implementation to read the counter in many compilers.
However, I'd suggest simply measuring using QueryPerformanceCounter. It is rare that the actual number of cycles is important, to tune code you typically only need to be able to compare relative time measurements, and rdtsc has many pitfalls (though probably not applicable to the situation you described):
On multiprocessor systems, there is not a single coherent cycle counter value.
Modern processors often adjust the frequency, changing the rate of change in time with respect to the rate of change in cycles.

Pipeline optimzation, is there any point to do this?

Some very expencied programmer from another company told me about some low-level code-optimzation tips that targetting specific CPU, including pipeline-optimzation, which means, arrange the code (inlined assembly, obviously) in special orders such that it fit the pipeline better for the targetting hardware.
With the presence of out-of-order and speculative execuation, I just wonder is there any points to do this kind of low-level stuff? We are mostly invovled in high performance computing, so we can really focus on one very specific CPU type to do our optimzation, but I just dont know if there is any point to do this specific optimzation, anyone has any experience here, where to begin? are there any code examples for this kind of optimzation? many thanks!

I'll start by saying that the compiler will usually optimize code sufficiently (i.e. well enough) that you do not need to worry about this provided your high-level code and algorithms are optimized. In general, manual optimizing should only happen if you have hard evidence that there is an actual performance issue that you can quantify and have tracked down.
Now, with that said, it's always possible to improve things - sometimes a little, sometimes a lot.
If you are in the high-performance computing game, then this sort of optimization might make sense. There are all sorts of "tricks" that can be done, but they are best left to real experts and not for the faint of heart.
If you really want to know more about this topic, a good place to start is by reading Agner Fog's website.

Pipeline optimization will improve your programs performance:
Branches and jumps may force your processor to reload the instruction pipeline, which takes some time. This time could be devoted to data processing instructions.
Some platform independent methods for pipeline optimizations:
Reduce number of branches.
Use Boolean Arithmetic
Set up code to allow for conditional execution of instructions.
Unroll loops.
Make loops have short content (that can fit in a processor's cache
without loading).
Edit 1: Other optimizations
Reduce code by eliminating features and requirements.
Review and optimize the design.
Review implementation for more efficient implementations.
Revert to assembly language only when all other optimizations have
provided little performance improvement; optimize only the code that
is executed 80% of the time; find out by profiling.
Edit 2: Data Optimizations
You can also gain performance improvements by organizing your data. Search the web for "Data Driven Design" or "Optimize performance data".
One idea is that the most frequently used data should be close together and ultimately fit into the processor's data cache. This will reduce the frequency that the processor has to reload its data cache.
Another optimization is to: Load data (into registers), operate on data, then write all data back to memory. The idea here is to trigger the processor's data cache loading circuitry before it processes the data (or registers).
If you can, organize the data to fit in one "line" of your processor's cache. Sequential locations require less time than random access locations.

There are always things that "help" vs. "hinder" the execution in the pipeline, but for most general purpose code that isn't highly specialized, I would expect that performance from compiled code is about as good as the best you can get without highly specialized code for each model of processor. If you have a controlled system, where all of your machines are using the same (or a small number of similar) processor model, and you know that 99% of the time is spent in this particular function, then there may be a benefit to optimizing that particular function to become more efficient.
In your case, it being HPC, it may well be beneficial to handwrite some of the low-level code (e.g. matrix multiplication) to be optimized for the processor you are running on. This does take some reasonable amount of understanding of the processor however, so you need to study the optimization guides for that processor model, and if you can, talk to people who've worked on that processor before.
Some of the things you'd look at is "register to register dependencies" - where you need the result of c = a + b to calculate x = c + d - so you try to separate these with some other useful work, such that the calculation of x doesn't get held up by the c = a + b calculation.
Cache-prefetching and generally caring for how the caches are used is also a useful thing to look at - not kicking useful cached data out that you need 100 instructions later, when you are storing the resulting 1MB array that won't be used again for several seconds can be worth a lot of processor time.
It's hard(er) to control these things when compilers decide to shuffle it around in it's own optimisation, so handwritten assembler is pretty much the only way to go.

Determine Values AND/OR Address of Values in CPU Cache

Is there a way to determine exactly what values, memory addresses, and/or other information currently resides in the CPU cache (L1, L2, etc.) - for current or all processes?
I've been doing quite a bit a reading which shows how to optimize programs to utilize the CPU cache more effectively. However, I'm looking for a way to truly determine if certain approaches are effective.
Bottom line: is it possible to be 100% certain what does and does not make it into the CPU cache.
Searching for this topic returns several results on how to determine the cache size, but not contents.
Edit: To clarify some of the comments below: Since software would undoubtedly alter the cache, do CPU manufactures have a tool / hardware diagnostic system (built-in) which provides this functionality?

Without using specialized hardware, you cannot directly inspect what is in the CPU cache. The act of running any software to inspect the CPU cache would alter the state of the cache.
The best approach I have found is simply to identify real hot spots in your application and benchmark alternative algorithms on hardware the code will run on in production (or on a range of likely hardware if you do not have control over the production environment).

In addition to Eric J.'s answer, I'll add that while I'm sure the big chip manufacturers do have such tools it's unlikely that such a "debug" facility would be made available to regular mortals like you and I, but even if it were, it wouldn't really be of much help.
Why? It's unlikely that you are having performance issues that you've traced to cache and which cannot be solved using the well-known and "common sense" techniques for maintaining high cache-hit ratios.
Have you really optimized all other hotspots in the code and poor cache behavior by the CPU is the problem? I very much doubt that.
Additionally, as food for thought: do you really want to optimize your program's behavior to only one or two particular CPUs? After all, caching algorithms change all the time, as do the parameters of the caches, sometimes dramatically.

If you have a relatively modern processor running Windows then take a look at
http://software.intel.com/en-us/articles/intel-performance-counter-monitor-a-better-way-to-measure-cpu-utilization
and see if that might provide some of what you are looking for.

To optimize for one specific CPU cache size is usually in vain since this optimization will break when your assumptions about the CPU cache sizes are wrong when you execute on a different CPU.
But there is a way out there. You should optimize for certain access patterns to allow the CPU to easily predict what memory locations should be read next (the most obvious one is a linear increasing read). To be able to fully utilize a CPU you should read about cache oblivious algorithms where most of them follow a divide and conquer strategy where a problem is divided into sub parts to a certain extent until all memory accesses fit completly into the CPU cache.
It is also noteworthy to mention that you have a code and data cache which are separate. Herb Sutter has a nice video online where he talks about the CPU internals in depth.
The Visual Studio Profiler can collect CPU counters dealing with memory and L2 counters. These options are available when you select instrumentation profiling.
Intel has also a paper online which talks in greater detail about these CPU counters and what the task manager of Windows and Linux do show you and how wrong it is for todays CPUs which do work internally asynchronous and parallel at many diffent levels. Unfortunatley there is no tool from intel to display this stuff directly. The only tool I do know is the VS profiler. Perhaps VTune has similar capabilities.
If you have gone this far to optimize your code you might look as well into GPU programming. You need at least a PHD to get your head around SIMD instructions, cache locality, ... to get perhaps a factor 5 over your original design. But by porting your algorithm to a GPU you get a factor 100 with much less effort ony a decent graphics card. NVidia GPUs which do support CUDA (all today sold cards do support it) can be very nicely programmed in a C dialect. There are even wrapper for managed code (.NET) to take advantage of the full power of GPUs.
You can stay platform agnostic by using OpenCL but NVidia OpenCL support is very bad. The OpenCL drivers are at least 8 times slower than its CUDA counterpart.

Almost everything you do will be in the cache at the moment when you use it, unless you are reading memory that has been configured as "uncacheable" - typically, that's frame buffer memory of your graphics card. The other way to "not hit the cache" is to use specific load and store instructions that are "non-temporal". Everything else is read into the L1 cache before it reaches the target registers inside the CPU itself.
For nearly all cases, CPU's do have a fairly good system of knowing what to keep and what to throw away in the cache, and the cache is nearly always "full" - not necessarily of useful stuff, if, for example you are working your way through an enormous array, it will just contain a lot of "old array" [this is where the "non-temporal" memory operations come in handy, as they allow you to read and/or write data that won't be stored in the cache, since next time you get back to the same point, it won't be in the cache ANYWAYS].
And yes, processors usually have special registers [that can be accessed in kernel drivers] that can inspect the contents of the cache. But they are quite tricky to use without at the same time losing the content of the cache(s). And they are definitely not useful as "how much of array A is in the cache" type checking. They are specifically for "Hmm, it looks like cache-line 1234 is broken, I'd better read the cached data to see if it's really the value it should be" when processors aren't working as they should.
As DanS says, there are performance counters that you can read from suitable software [need to be in the kernel to use those registers too, so you need some sort of "driver" software for that]. In Linux, there's "perf". And AMD has a similar set of performance counters that can be used to find out, for example "how many cache misses have we had over this period of time" or "how many cache hits in L" have we had, etc.

Optimization Techniques for C++

In his talk a few days ago at Facebook - slides, video, Andrei Alexandrescu talks about common intuitions that might prove us wrong. For me one very interesting point came up on Slide 7 where he states that the assumption "Fewer instructions = faster code" is not true and more instructions will not necessarily mean slower code.
Here comes my problem: The audio quality of his talk (around 6:20min) is not that well and I don't understand the explanation very well, but from what I get is that he is comparing retired instructions with optimality of an algorithm on a performance level.
However, from my understanding this cannot be done because these are two independent structural levels. Instructions (especially actually retired instructions) are one very important measure and basically, gives you an idea about performance to achieve a goal. If we leave out the latency of an instruction, we can generalize that fewer retired instructions = faster code. Now, of course there are cases where an algorithm that performs complex calculations inside a loop will yield better performance even though it is performed inside the loop, because it will break the loop earlier (think graph traversal). But wouldn't it be more useful to compare to algorithms on a complexity level rather than saying this loop has more instructions and is better than the other? From my point of view, the better algorithm will have less retired instructions in the end.
Can someone please help me to understand where he was going with his example, and how can there be a case where (significantly) more retired instructions lead to better performance?

The quality is indeed bad, but I think he leads to the fact that CPUs are good for calculations, but suffer from bad performance for memory seek (RAM is much slower then CPU), and branches (because CPU works as a pipeline, and branches might cause the pipeline to break).
Here are some cases where more instructions are faster:
Branch prediction - even if we need to do more instructions, but it causes for a better branch prediction, the pipeline of the CPU will be full more time, and less ops will be "thrown out" of it, which ultimately leads to better performance. This thread for example, shows how doing the same thing, but first sorting - improves performnce.
CPU Cache - If your code is more cache optimized, and follows the principle of locality - it is more likely to be faster then a code who doesn't, even if the code that doesn't do half the amount of instructions. This thread gives an example for a small cache optimization - that the same number of instructions might result in much slower code if it is not cache optimized.
It also matters which instructions are done. Sometimes - some instructions might be slower to perform then others, for example - divide might be slower then integer addition.
Note: All of the above are machine dependent and how/if they actually change the performance might vary from one architecture to the other.

The number of instructions is not a good measure in itself.
Fewer retired instructions (because there is nothing more to do) = faster code.
Fewer retired instructions (because they have to wait for dependencies) = slower code.
It can sometimes be that more instructions in the code also means more retired instructions, because they can use up execution slots that would otherwise be wasted in case 2.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js