Each CPU instruction consumes a number of bytes. The smaller the size, the most instructions which can be held in the CPU cache.
What techniques are available when writing C++ code which allow you to reduce CPU instruction sizes?
One example could be reducing the number of FAR jumps (literally, jumps to code across larger addresses). Because the offset is a smaller number, the type used is smaller and the overall instruction is smaller.
I thought GCC's __builtin_expect may reduce jump instruction sizes by putting unlikely instructions further away.
I think I have seen somewhere that its better to use an int32_t rather than int16_t due to being the native CPU integer size and therefore more efficient CPU instructions.
Or is something which can only be done whilst writing assembly?
Now that we've all fought over micro/macro optimization, let's try to help with the actual question.
I don't have a full, definitive answer, but you might be able to start here. GCC has some macro hooks for describing performance characteristics of the target hardware. You could theoretically set up a few key macros to help gcc favor "smaller" instructions while optimizing.
Based on very limited information from this question and its one reply, you might be able to get some gain from the TARGET_RTX_COSTS costs hook. I haven't yet done enough follow up research to verify this.
I would guess that hooking into the compiler like this will be more useful than any specific C++ idioms.
Please let us know if you manage any performance gain. I'm curious.
If a processor has various length (multi-byte) instructions, the best you can do is to write your code to help the compiler make use of the smaller instruction sizes.
Get The Code Working Robustly & Correct first.
Debugging optimized code is more difficult than debugging code that is not optimized. The symbols used by the debugger line up with the source code better. During optimization, the compiler can eliminate code, which gets your code out-of-sync with the source listing.
Know Your Assembly Instructions
Not all processors have variable length instructions. Become familiar with your processors instruction set. Find out which instructions are small (one byte) versus multi-byte.
Write Code to Use Small Assembly Instructions
Help out your compiler and write your code to take advantage of the small length instructions.
Print out the assembly language code to verify that the compiler uses the small instructions.
Change your code if necessary to help out the compiler.
There is no guarantee that the compiler will use small instructions. The compiler emits instructions that it thinks will have the best performance according to the optimization settings.
Write Your Own Assembly Language Function
After generating the assembly language source code, you are now better equipped to replace the high level language with an assembly language version. You have the freedom to use small instructions.
Beware the Jabberwocky
Smaller instructions may not be the best solution in all cases. For example, the Intel Processors have block instructions (perform operations on blocks of data). These block instructions perform better than loops of small instructions. However, the block instructions take up more bytes than the smaller instructions.
The processor will fetch as many bytes as necessary, depending on the instruction, into its instruction cache. If you can write loops or code that fits into the cache, the instruction sizes become less of a concern.
Also, many processors will use large instructions to communicate with other processors, such as a floating point processor. Reduction of floating point math in your program may reduce the quanitity of these instructions.
Trim the Code Tree & Reduce the Branches
In general, branching slows down processing. Branches are the change of execution to a new location, such as loops and function calls. Processors love to data instructions, because they don't have to reload the instruction pipeline. Increasing the amount of data instructions and reducing the quantity of branches will improve performance, usually without regards to the instruction sizes.
Related
Say that we have two C++ code segments, for doing the same task. How can we determine which code will run faster?
As an example lets say there is this global array "some_struct_type numbers[]". Inside a function, I can read a location of this array in two ways(I do not want to alter the content of the array)
some_struct_type val = numbers[i];
some_struct_type* val = &numbers[i]
I assume the second one is faster. but I can't measure the time to make sure because it will be a negligible difference.
So in this type of a situation, how do I figure out which code segment runs faster? Is there a way to compile a single line of code or set of lines and view
how many lines of assembly instructions are there?
I would appreciate your thoughts on this matter.
The basics are to run the piece of code so many times that it takes a few seconds at least to complete, and measure the time.
But it's hard, very hard, to get any meaningful figures this way, for many reasons:
Todays compilers are very good at optimizing code, but the optimizations depend on the context. It often does not make sense to look at a single line and try to optimize it. When the same line appears in a different context, the optimizations applied may be different.
Short pieces of code can be much faster than the surrounding looping code.
Not only the compiler makes optimizations, the processor has a cache, an instruction pipeline, and tries to predict branching code. A value which has been read before will be read much faster the next time, for example.
...
Because of this, it's usually better to leave the code in its place in your program, and use a profiling tool to see which parts of your code use the most processing resources. Then, you can change these parts and profile again.
While writing new code, prefer readable code to seemingly optimal code. Choose the right algorithm, this also depends on your input sizes. For example, insertion sort can be faster than quicksort, if the input is very small. But don't write your own sorting code, if your input is not special, use the libraries available in general. And don't optimize prematurely.
Eugene Sh. is correct that these two lines aren't doing the same thing - the first one copies the value of numbers[i] into a local variable, whereas the second one stores the address of numbers[i] into a pointer local variable. If you can do what you need using just the address of numbers[i] and referring back to numbers[i], it's likely that will be faster than doing a wholesale copy of the value, although it depends on a lot of factors like the size of the struct, etc.
Regarding the general optimization question, here are some things to consider...
Use a Profiler
The best way to measure the speed of your code is to use a profiling tool. There are a number of different tools available, depending on your target platform - see (for example) How can I profile C++ code running in Linux? and What's the best free C++ profiler for Windows?.
You really want to use a profiler for this because it's notoriously difficult to tell just from looking what the costliest parts of a program will be, for a number of reasons...
# of Instructions != # of Processor Cycles
One reason to use a profiler is that it's often difficult to tell from looking at two pieces of code which one will run faster. Even in assembly code, you can't simply count the number of instructions, because many instructions take multiple processor cycles to complete. This varies considerably by target platform. For example, on some platforms the fastest way to load the value 1 to a CPU register is something straightforward like this:
MOV r0, #1
Whereas on other platforms the fastest approach is actually to clear the register and then increment it, like this:
CLR r0
INC r0
The second case has more instruction lines, but that doesn't necessarily mean that it's slower.
Other Complications
Another reason that it's difficult to tell which pieces of code will most need optimizing is that most modern computers employ fairly sophisticated caches that can dramatically improve performance. Executing a cached loop several times is often less expensive than loading a single piece of data from a location that isn't cached. It can be very difficult to predict exactly what will cause a cache miss, but when using a profiler you don't have to predict - it makes the measurements for you.
Avoid Premature Optimization
For most projects, optimizing your code is best left until relatively late in the process. If you start optimizing too early, you may find that you spend a lot of time optimizing a feature that turns out to be relatively inexpensive compared to your program's other features. That said, there are some notable counterexamples - if you're building a large-scale database tool you might reasonably expect that performance is going to be an important selling point.
Is there a way I could write a "tool" which could analyse the produced x86 assembly language from a C/C++ program and measure the performance in such a way, that it wouldnt matter if I ran it on a 1GHz or 3GHz processor?
I am thinking more along the lines of instruction throughput? How could I write such a tool? Would it be possible?
I'm pretty sure this has to be equivalent to the halting problem, in which case it can't be done. Things such as branch prediction, memory accesses, and memory caching will all change performance irrespective of the speed of the CPU upon which the program is run.
Well, you could, but it would have very limited relevance. You can't tell the running time by just looking at the instructions.
What about cache usage? A "longer" code can be more cache-friendly, and thus faster.
Certain CPU instructions can be executed in parallel and out-of-order, but the final behaviour depends a lot on the hardware.
If you really want to try it, I would recommend writing a tool for valgrind. You would essentially run the program under a simulated environment, making sure you can replicate the behaviour of real-world CPUs (that's the challenging part).
EDIT: just to be clear, I'm assuming you want dynamic analysis, extracted from real inputs. IF you want static analysis you'll be in "undecidable land" as the other answer pointed out (you can't even detect if a given code loops forever).
EDIT 2: forgot to include the out-of-order case in the second point.
It's possible, but only if the tool knows all the internals of the processor for which it is projecting performance. Since knowing 'all' the internals is tantamount to building your own processor, you would correctly guess that this is not an easy task. So instead, you'll need to make a lot of assumptions, and hope that they don't affect your answer too much. Unfortunately, for anything longer than a few hundred instructions, these assumptions (for example, all memory reads are found in L1 data cache and have 4 cycle latency; all instructions are in L1 instruction cache but in trace cache thereafter) affect your answer a lot. Clock speed is probably the easiest variable to handle, but the details for all the rest that differ greatly from processor to processor.
Current processors are "speculative", "superscalar", and "out-of-order". Speculative means that they choose their code path before the correct choice is computed, and then go back and start over from the branch if their guess is wrong. Superscalar means that multiple instructions that don't depend on each other can sometimes be executed simultaneously -- but only in certain combinations. Out-of-order means that there is a pool of instructions waiting to be executed, and the processor chooses when to execute them based on when their inputs are ready.
Making things even worse, instructions don't execute instantaneously, and the number of cycles they do take (and the resources they occupy during this time) vary also. Accuracy of branch prediction is hard to predict, and it takes different numbers of cycles for processors to recover. Caches are different sizes, take different times to access, and have different algorithms for decided what to cache. There simply is no meaningful concept of 'how fast assembly executes' without reference to the processor it is executing on.
This doesn't mean you can't reason about it, though. And the more you can narrow down the processor you are targetting, and the more you constrain the code you are evaluating, the better you can predict how code will execute. Agner Fog has a good mid-level introduction to the differences and similarities of the current generation of x86 processors:
http://www.agner.org/optimize/microarchitecture.pdf
Additionally, Intel offers for free a very useful (and surprisingly unknown) tool that answers a lot of these questions for recent generations of their processors. If you are trying to measure the performance and interaction of a few dozen instructions in a tight loop, IACA may already do what you want. There are all sorts of improvements that could be made to the interface and presentation of data, but it's definitely worth checking out before trying to write your own:
http://software.intel.com/en-us/articles/intel-architecture-code-analyzer
To my knowledge, there isn't an AMD equivalent, but if there is I'd love to hear about it.
Some very expencied programmer from another company told me about some low-level code-optimzation tips that targetting specific CPU, including pipeline-optimzation, which means, arrange the code (inlined assembly, obviously) in special orders such that it fit the pipeline better for the targetting hardware.
With the presence of out-of-order and speculative execuation, I just wonder is there any points to do this kind of low-level stuff? We are mostly invovled in high performance computing, so we can really focus on one very specific CPU type to do our optimzation, but I just dont know if there is any point to do this specific optimzation, anyone has any experience here, where to begin? are there any code examples for this kind of optimzation? many thanks!
I'll start by saying that the compiler will usually optimize code sufficiently (i.e. well enough) that you do not need to worry about this provided your high-level code and algorithms are optimized. In general, manual optimizing should only happen if you have hard evidence that there is an actual performance issue that you can quantify and have tracked down.
Now, with that said, it's always possible to improve things - sometimes a little, sometimes a lot.
If you are in the high-performance computing game, then this sort of optimization might make sense. There are all sorts of "tricks" that can be done, but they are best left to real experts and not for the faint of heart.
If you really want to know more about this topic, a good place to start is by reading Agner Fog's website.
Pipeline optimization will improve your programs performance:
Branches and jumps may force your processor to reload the instruction pipeline, which takes some time. This time could be devoted to data processing instructions.
Some platform independent methods for pipeline optimizations:
Reduce number of branches.
Use Boolean Arithmetic
Set up code to allow for conditional execution of instructions.
Unroll loops.
Make loops have short content (that can fit in a processor's cache
without loading).
Edit 1: Other optimizations
Reduce code by eliminating features and requirements.
Review and optimize the design.
Review implementation for more efficient implementations.
Revert to assembly language only when all other optimizations have
provided little performance improvement; optimize only the code that
is executed 80% of the time; find out by profiling.
Edit 2: Data Optimizations
You can also gain performance improvements by organizing your data. Search the web for "Data Driven Design" or "Optimize performance data".
One idea is that the most frequently used data should be close together and ultimately fit into the processor's data cache. This will reduce the frequency that the processor has to reload its data cache.
Another optimization is to: Load data (into registers), operate on data, then write all data back to memory. The idea here is to trigger the processor's data cache loading circuitry before it processes the data (or registers).
If you can, organize the data to fit in one "line" of your processor's cache. Sequential locations require less time than random access locations.
There are always things that "help" vs. "hinder" the execution in the pipeline, but for most general purpose code that isn't highly specialized, I would expect that performance from compiled code is about as good as the best you can get without highly specialized code for each model of processor. If you have a controlled system, where all of your machines are using the same (or a small number of similar) processor model, and you know that 99% of the time is spent in this particular function, then there may be a benefit to optimizing that particular function to become more efficient.
In your case, it being HPC, it may well be beneficial to handwrite some of the low-level code (e.g. matrix multiplication) to be optimized for the processor you are running on. This does take some reasonable amount of understanding of the processor however, so you need to study the optimization guides for that processor model, and if you can, talk to people who've worked on that processor before.
Some of the things you'd look at is "register to register dependencies" - where you need the result of c = a + b to calculate x = c + d - so you try to separate these with some other useful work, such that the calculation of x doesn't get held up by the c = a + b calculation.
Cache-prefetching and generally caring for how the caches are used is also a useful thing to look at - not kicking useful cached data out that you need 100 instructions later, when you are storing the resulting 1MB array that won't be used again for several seconds can be worth a lot of processor time.
It's hard(er) to control these things when compilers decide to shuffle it around in it's own optimisation, so handwritten assembler is pretty much the only way to go.
I am learning assembly and making some inlining in my Digital Mars C++ compiler. I searched some things to make a program better and had these parameters to tune the programs:
use better C++ compiler//thinking of GCC or intel compiler
use assembly only in critical part of program
find better algorithm
Cache miss, cache contention.
Loop-carried dependency chain.
Instruction fetching time.
Instruction decoding time.
Instruction retirement.
Register read stalls.
Execution port throughput.
Execution unit throughput.
Suboptimal reordering and scheduling of micro-ops.
Branch misprediction.
Floating point exception.
I understood all except "register read stalls".
Question: Can anybody tell me how is this happening in CPU and the "superscalar" form of the "out of order execution"?
Normal "out of order" seemed logical but i couldnt find a logical explanation of "superscalar" form.
Question 2: Can someone alse give some good instruction list of SSE SSE2 and newer CPU's prefarably with micro-ops table, port throughputs, units and some calculation table for the latencies to find the real bottle-neck of a piece of code?
I would be happy with a small example like this:
//loop carried dependency chain breaking:
__asm
{
loop_begin:
....
....
sub edx,05h //rather than taking i*5 in each iteration, we sub 5 each iteration
sub ecx,01h //i-- counter
...
...
jnz loop_begin//edit: sub ecx must have been after the sub edx for jnz
}
//while sub edx makes us get rid of a multiplication also makes that independent of ecx, making independent
Thank you.
Computer: Pentium-M 2GHz , Windows XP-32 bit
You should take a look at Agner Fogs optimization manuals: Optimizing software in C++: An optimization guide for Windows, Linux and Mac platforms or Optimizing subroutines in assembly language: An optimization guide for x86 platforms.
But to really be able to outsmart a modern compiler, you need some good background knowledge of the arch you want to optimize for: The microarchitecture of Intel, AMD and VIA CPUs: An optimization guide for assembly programmers and compiler makers
My two cents: Intel Architecture Developers Manuals
Really detailed, there are all SSE instructions as well, with opcodes, instruction latency and throughput, and all gory details you might need :)
The "superscalar" stalls is an added problem for scheduling instructions. A modern processor can not only execute instructions out of order, it can also do 3-4 simple instructions at a time, using parallel execution units.
But to actually do that, the instructions must be sufficiently independent of each other. If, for example, one instruction uses the result of a previous instruction, it must wait for that result to be available.
In practice, this makes creating an optimal assembly program by hand extremely difficult. You really have to be like a computer (compiler) to calculate the optimal order of the instructions. And if you change one instruction, you have to do it all over again....
For question #1 I would highly recommend Computer Architecture: A Quantitative Approach. It does a very good job of explaining the concepts in context, so you can see the big picture. The examples are also very useful for a person who is interested in optimizing code, because they always focus on prioritizing and improving the bottleneck.
Newer ARM processors include the PLD and PLI instructions.
I'm writing tight inner loops (in C++) which have a non-sequential memory access pattern, but a pattern that naturally my code fully understands. I would anticipate a substantial speedup if I could prefetch the next location whilst processing the current memory location, and I would expect this to be quick-enough to try out to be worth the experiment!
I'm using new expensive compilers from ARM, and it doesn't seem to be including PLD instructions anywhere, let alone in this particular loop that I care about.
How can I include explicit prefetch instructions in my C++ code?
There should be some Compiler-specific Features. There is no standard way to do it for C/C++. Check out you compiler Compiler Reference Guide. For RealView Compiler see this or this.
If you are trying to extract truly maximum performance from these loops, than I would recommend writing the entire looping construct in assembler. You should be able to use inline assembly depending on the data structures involved in your loop. Even better if you can unroll any piece of your loop (like the parts involved in making the access non-sequential).
At the risk of asking the obvious: have you verified the compiler's target architecture? For example (humor me), if by default the compiler is targeted to ARM7, you're never going to see the PLD instruction.
It is not outside the realm of possibility that other optimizations like software pipelining and loop unrolling may achieve the same effect as your prefetching idea (hiding the latency of the loads by overlapping it with useful computation), but without the extra instruction-cache pressure caused by the extra instructions. I would even go so far as to say that this is the case more often than not, for tight inner loops that tend to have few instructions and little control flow. Is your compiler doing these types of traditional optimizations instead. If so, it may be worth looking at the pipeline diagram to develop a more detailed cost model of how your processor works, and evaluate more quantitatively whether prefetching would help.