I'm currently writting a pass in opt that happens to create extra control flow and, as a result of that, I need to also insert a lot of llvm::PHINode instructions. The endgame of my pass is to reduce codesize and, as far I can tell, the number of llvm instructions after I run it is lower. However, in most cases I don't see a significant reduction in code size, or sometimes I even see an increase (even though the total number of llvm is smaller). I've been trying to find a reference to how PHINode instructions are implemented on x86/amd64 but without luck. The obvious solution for me would be to just go through the source and find out myself but I can't invest that much time in investigating this issue. Any help would be much appreciated.
Related
Say that we have two C++ code segments, for doing the same task. How can we determine which code will run faster?
As an example lets say there is this global array "some_struct_type numbers[]". Inside a function, I can read a location of this array in two ways(I do not want to alter the content of the array)
some_struct_type val = numbers[i];
some_struct_type* val = &numbers[i]
I assume the second one is faster. but I can't measure the time to make sure because it will be a negligible difference.
So in this type of a situation, how do I figure out which code segment runs faster? Is there a way to compile a single line of code or set of lines and view
how many lines of assembly instructions are there?
I would appreciate your thoughts on this matter.
The basics are to run the piece of code so many times that it takes a few seconds at least to complete, and measure the time.
But it's hard, very hard, to get any meaningful figures this way, for many reasons:
Todays compilers are very good at optimizing code, but the optimizations depend on the context. It often does not make sense to look at a single line and try to optimize it. When the same line appears in a different context, the optimizations applied may be different.
Short pieces of code can be much faster than the surrounding looping code.
Not only the compiler makes optimizations, the processor has a cache, an instruction pipeline, and tries to predict branching code. A value which has been read before will be read much faster the next time, for example.
...
Because of this, it's usually better to leave the code in its place in your program, and use a profiling tool to see which parts of your code use the most processing resources. Then, you can change these parts and profile again.
While writing new code, prefer readable code to seemingly optimal code. Choose the right algorithm, this also depends on your input sizes. For example, insertion sort can be faster than quicksort, if the input is very small. But don't write your own sorting code, if your input is not special, use the libraries available in general. And don't optimize prematurely.
Eugene Sh. is correct that these two lines aren't doing the same thing - the first one copies the value of numbers[i] into a local variable, whereas the second one stores the address of numbers[i] into a pointer local variable. If you can do what you need using just the address of numbers[i] and referring back to numbers[i], it's likely that will be faster than doing a wholesale copy of the value, although it depends on a lot of factors like the size of the struct, etc.
Regarding the general optimization question, here are some things to consider...
Use a Profiler
The best way to measure the speed of your code is to use a profiling tool. There are a number of different tools available, depending on your target platform - see (for example) How can I profile C++ code running in Linux? and What's the best free C++ profiler for Windows?.
You really want to use a profiler for this because it's notoriously difficult to tell just from looking what the costliest parts of a program will be, for a number of reasons...
# of Instructions != # of Processor Cycles
One reason to use a profiler is that it's often difficult to tell from looking at two pieces of code which one will run faster. Even in assembly code, you can't simply count the number of instructions, because many instructions take multiple processor cycles to complete. This varies considerably by target platform. For example, on some platforms the fastest way to load the value 1 to a CPU register is something straightforward like this:
MOV r0, #1
Whereas on other platforms the fastest approach is actually to clear the register and then increment it, like this:
CLR r0
INC r0
The second case has more instruction lines, but that doesn't necessarily mean that it's slower.
Other Complications
Another reason that it's difficult to tell which pieces of code will most need optimizing is that most modern computers employ fairly sophisticated caches that can dramatically improve performance. Executing a cached loop several times is often less expensive than loading a single piece of data from a location that isn't cached. It can be very difficult to predict exactly what will cause a cache miss, but when using a profiler you don't have to predict - it makes the measurements for you.
Avoid Premature Optimization
For most projects, optimizing your code is best left until relatively late in the process. If you start optimizing too early, you may find that you spend a lot of time optimizing a feature that turns out to be relatively inexpensive compared to your program's other features. That said, there are some notable counterexamples - if you're building a large-scale database tool you might reasonably expect that performance is going to be an important selling point.
In order to check whether AVX or SSE is available, the compiler usually sets SSE or AVX true. However is there an option, how to receive during compile time the size of the cache?
Edit:
I refrase the question a little bit, since it is not clear enough.
I want information about the cache during compile time. The more the better. I would like to have it for optimization purposes (i.e. cache-blocking,...). In my current task - spatial blocking - the size of the cache is mostly relevant. But the comments bellow rightly asked, which level of cache am i asking for. In addition caches can behave very differrently, if you consider their evicting strategy, size of cache line, amount of levels, how they are shared between cores... The list goes on.
So my general question: How do i recieve any information about the cache during compile time?
For my current task, it would be sufficient to readout /proc/cpuinfo and use the cachesize given there. However the general question is far more interesting.
How do i recieve information about the cpu (with focus on its cache) during compile time?
(I am not considering of crosscompiling at all. The code compiled will be run on the SAME machine.)
Apparently, no compiler is capable of having such detailed information about a cpu AND making it available to the compiling program. However i have found a library, which seems to do exactly that. Sadly again only at runtime. However it is at least a comfortable solution.
(I found it by accident. I was not looking for it :) If you ever look for something, 1) google 2) look here)
You can compile several versions of your library for various cache sizes and then pick them dynamically. That is the only possible approach for adding cache size info at compile-time.
Each CPU instruction consumes a number of bytes. The smaller the size, the most instructions which can be held in the CPU cache.
What techniques are available when writing C++ code which allow you to reduce CPU instruction sizes?
One example could be reducing the number of FAR jumps (literally, jumps to code across larger addresses). Because the offset is a smaller number, the type used is smaller and the overall instruction is smaller.
I thought GCC's __builtin_expect may reduce jump instruction sizes by putting unlikely instructions further away.
I think I have seen somewhere that its better to use an int32_t rather than int16_t due to being the native CPU integer size and therefore more efficient CPU instructions.
Or is something which can only be done whilst writing assembly?
Now that we've all fought over micro/macro optimization, let's try to help with the actual question.
I don't have a full, definitive answer, but you might be able to start here. GCC has some macro hooks for describing performance characteristics of the target hardware. You could theoretically set up a few key macros to help gcc favor "smaller" instructions while optimizing.
Based on very limited information from this question and its one reply, you might be able to get some gain from the TARGET_RTX_COSTS costs hook. I haven't yet done enough follow up research to verify this.
I would guess that hooking into the compiler like this will be more useful than any specific C++ idioms.
Please let us know if you manage any performance gain. I'm curious.
If a processor has various length (multi-byte) instructions, the best you can do is to write your code to help the compiler make use of the smaller instruction sizes.
Get The Code Working Robustly & Correct first.
Debugging optimized code is more difficult than debugging code that is not optimized. The symbols used by the debugger line up with the source code better. During optimization, the compiler can eliminate code, which gets your code out-of-sync with the source listing.
Know Your Assembly Instructions
Not all processors have variable length instructions. Become familiar with your processors instruction set. Find out which instructions are small (one byte) versus multi-byte.
Write Code to Use Small Assembly Instructions
Help out your compiler and write your code to take advantage of the small length instructions.
Print out the assembly language code to verify that the compiler uses the small instructions.
Change your code if necessary to help out the compiler.
There is no guarantee that the compiler will use small instructions. The compiler emits instructions that it thinks will have the best performance according to the optimization settings.
Write Your Own Assembly Language Function
After generating the assembly language source code, you are now better equipped to replace the high level language with an assembly language version. You have the freedom to use small instructions.
Beware the Jabberwocky
Smaller instructions may not be the best solution in all cases. For example, the Intel Processors have block instructions (perform operations on blocks of data). These block instructions perform better than loops of small instructions. However, the block instructions take up more bytes than the smaller instructions.
The processor will fetch as many bytes as necessary, depending on the instruction, into its instruction cache. If you can write loops or code that fits into the cache, the instruction sizes become less of a concern.
Also, many processors will use large instructions to communicate with other processors, such as a floating point processor. Reduction of floating point math in your program may reduce the quanitity of these instructions.
Trim the Code Tree & Reduce the Branches
In general, branching slows down processing. Branches are the change of execution to a new location, such as loops and function calls. Processors love to data instructions, because they don't have to reload the instruction pipeline. Increasing the amount of data instructions and reducing the quantity of branches will improve performance, usually without regards to the instruction sizes.
Is there a way I could write a "tool" which could analyse the produced x86 assembly language from a C/C++ program and measure the performance in such a way, that it wouldnt matter if I ran it on a 1GHz or 3GHz processor?
I am thinking more along the lines of instruction throughput? How could I write such a tool? Would it be possible?
I'm pretty sure this has to be equivalent to the halting problem, in which case it can't be done. Things such as branch prediction, memory accesses, and memory caching will all change performance irrespective of the speed of the CPU upon which the program is run.
Well, you could, but it would have very limited relevance. You can't tell the running time by just looking at the instructions.
What about cache usage? A "longer" code can be more cache-friendly, and thus faster.
Certain CPU instructions can be executed in parallel and out-of-order, but the final behaviour depends a lot on the hardware.
If you really want to try it, I would recommend writing a tool for valgrind. You would essentially run the program under a simulated environment, making sure you can replicate the behaviour of real-world CPUs (that's the challenging part).
EDIT: just to be clear, I'm assuming you want dynamic analysis, extracted from real inputs. IF you want static analysis you'll be in "undecidable land" as the other answer pointed out (you can't even detect if a given code loops forever).
EDIT 2: forgot to include the out-of-order case in the second point.
It's possible, but only if the tool knows all the internals of the processor for which it is projecting performance. Since knowing 'all' the internals is tantamount to building your own processor, you would correctly guess that this is not an easy task. So instead, you'll need to make a lot of assumptions, and hope that they don't affect your answer too much. Unfortunately, for anything longer than a few hundred instructions, these assumptions (for example, all memory reads are found in L1 data cache and have 4 cycle latency; all instructions are in L1 instruction cache but in trace cache thereafter) affect your answer a lot. Clock speed is probably the easiest variable to handle, but the details for all the rest that differ greatly from processor to processor.
Current processors are "speculative", "superscalar", and "out-of-order". Speculative means that they choose their code path before the correct choice is computed, and then go back and start over from the branch if their guess is wrong. Superscalar means that multiple instructions that don't depend on each other can sometimes be executed simultaneously -- but only in certain combinations. Out-of-order means that there is a pool of instructions waiting to be executed, and the processor chooses when to execute them based on when their inputs are ready.
Making things even worse, instructions don't execute instantaneously, and the number of cycles they do take (and the resources they occupy during this time) vary also. Accuracy of branch prediction is hard to predict, and it takes different numbers of cycles for processors to recover. Caches are different sizes, take different times to access, and have different algorithms for decided what to cache. There simply is no meaningful concept of 'how fast assembly executes' without reference to the processor it is executing on.
This doesn't mean you can't reason about it, though. And the more you can narrow down the processor you are targetting, and the more you constrain the code you are evaluating, the better you can predict how code will execute. Agner Fog has a good mid-level introduction to the differences and similarities of the current generation of x86 processors:
http://www.agner.org/optimize/microarchitecture.pdf
Additionally, Intel offers for free a very useful (and surprisingly unknown) tool that answers a lot of these questions for recent generations of their processors. If you are trying to measure the performance and interaction of a few dozen instructions in a tight loop, IACA may already do what you want. There are all sorts of improvements that could be made to the interface and presentation of data, but it's definitely worth checking out before trying to write your own:
http://software.intel.com/en-us/articles/intel-architecture-code-analyzer
To my knowledge, there isn't an AMD equivalent, but if there is I'd love to hear about it.
I built a Fortran code with Intel 11.1. I built it with the -p option in order to produce profiling data. When I check these results, there are some routines present that aren't a part of my code. I assume they were put there by Intel. The include:
__powr8i4
__intel_new_memset
__intel_fast_memset
__intel_fast_memset.J
__intel_fast_memcpy
__intel_new_memcpy
__intel_fast_memcpy.J
There are others, too. When I build the code without optimization, the code doesn't spend much time in them. Except that results show __powr8i4 being used 3.3% of the time. However, when I build the code with optimization, this number goes way up to about 35%. I can't seem to find out what these routines are, but they are confusing my results because I want to know where to look to optimize my code.
Most programs spend a lot of their cycles in the calling of subroutines, often library subroutines, so if you look only at exclusive (self) time, you will see what you are seeing.
So point 1 is look at inclusive (self plus callees) time.
Now, if the profiler is a "CPU profiler", it will probably be blind to I/O time. That means your program might be spending most of its time reading or writing, but the profiler will give you no clue about that.
So point 2 is use a profiler that works on "wall clock" time, not "CPU" time, unless you are sure you are not doing much I/O. (Sometimes you think you're not doing I/O, but deep inside some subroutine layers deep, guess what - it's doing I/O.)
Many profilers try to produce a call-graph, and if your program does not contain recursion, and if the profiler has access to all the routines in your code, that can be helpful in identifying the subroutine calls in your code that account for a lot of time.
However, if routine A is large and calls B in several places, the profiler won't tell you which lines of code to look at.
Point 3 is use a profiler that gives you line-level inclusive time percentage, if possible.
(Percentage is the most useful number, because that tells you how much overall time you would save if you could somehow remove that line of code. Also, it is not much affected by competing processes in the system.)
One example of such a profiler is Zoom.
It may be that after you do all this, you don't see much you could do to speed up the code.
However, if you could see how certain properties of the data might affect performance, you might find there were further speedups you could get. Profilers are unable to look at data.
What I do is randomly sample the state of the program under the debugger, and see if I can really understand what it is doing at each sample.
You can find things that way that you can't find any other way.
(Some people say this is not accurate, but it is accurate - about what matters. What matters is what the problem is, not precisely how much it costs.)
And that is point 4.