Benefit of LLVM's SelectInst - llvm

LLVM has a SelectInst that is used to represent expressions like something = cond ? true-part : false-part.
What is the benefit of this instruction in the IR, as ?: could also always be lowered to a BranchInst by the compiler? Are there CPUs that support such instructions? Or is select lowered to jumps by the CodeGenerator anyway?
I reckon there may be benefits for analysis passes as the select guarantees two "branches" of the implicit if. But on the other hand, compilers are not required to use the instruction at all, so these passes must be able to deal with brs anyway.

Yes, you can use always use a conditional branch instead of a select instruction, but a select has several advantages:
There are indeed relevant CPU instructions to lower those into, the most obvious example in x86 being cmov and the various setcc instructions.
A select is a lot easier to vectorize - in fact, one of the usual phases of vectorization is "if conversion", the process of converting control flow (a conditional branch) to data flow (a select).

Related

Can I improve branch prediction with my code?

This is a naive general question open to any platform, language, or compiler. Though I am most curious about Aarch64, C++, GCC.
When coding an unavoidable branch in program flow dependent on I/O state (compiler cannot predict), and I know that one state is much more likely than another, how do I indicate that to the compiler?
Is this better
if(true == get(gpioVal))
unlikelyFunction();
else
likelyFunction();
than this?
if(true == get(gpioVal))
likelyFunction(); // performance critical, fill prefetch caches from this branch
else
unlikelyFunction(); // missed prediction not consequential on this branch
Does it help if the communication protocol makes the more likely or critical value true(high), or false(low)?
TL:DR: Yes, in C or C++ use a likely() macro, or C++20 [[likely]], to help the compiler make better asm. That's separate from influencing actual CPU branch-prediction, though. If writing in asm, lay out your code to minimize taken branches.
For most ISAs, there's no way in asm to hint the CPU whether a branch is likely to be taken or not. (Some exceptions include Pentium 4 (but not earlier or later x86), PowerPC, and some MIPS, which allow branch hints as part of conditional-branch asm instructions.)
Is it possible to tell the branch predictor how likely it is to follow the branch?
But not-taken straight-line code is cheaper than taken, so hinting high-level language to lay out code with the fast-path contiguous doesn't help branch prediction accuracy, but can help (or hurt) performance. (I-cache locality, front-end bandwidth: remember code-fetch happens in contiguous 16 or 32-byte blocks, so a taken branch means a later part of that fetch block isn't useful. Also, branch prediction throughput; some CPUs like Intel Skylake for example can't handle a predicted-taken branch at more than 1 per 2 clocks, other than loop branches. That include unconditional branches like jmp or ret.)
Taken branches are hard; not-taken branches keep the CPU on its toes, but if the prediction is accurate it's just a normal instruction for an execution unit (verifying the prediction), with nothing special for the front-end. See also Modern Microprocessors
A 90-Minute Guide! which has a section on branch prediction. (And is overall excellent.)
What exactly happens when a skylake CPU mispredicts a branch?
Avoid stalling pipeline by calculating conditional early
How does the branch predictor know if it is not correct?
Many people misunderstand source-level branch hints as branch prediction hints. That could be one effect if compiling for a CPU that supports branch hints in asm, but for most the significant effect is in layout, and deciding whether to use branchless (cmov) or not; a [[likely]] condition also means it should predict well.
With some CPUs, especially older, layout of a branch did sometimes influence runtime prediction: if the CPU didn't remember anything about the branch in its dynamic predictors, the standard static prediction heuristic is that forward conditional branches are not-taken, backward conditional are assumed taken (because that's normally the bottom of a loop. See the BTFNT section in https://danluu.com/branch-prediction/.
A compiler can lay out an if(c) x else y; either way, either matching the source with jump over x if !c as the opening thing, or swap the if and else blocks and use the opposite branch condition. Or it can put one block out-of-line (e.g. after the ret at the end of the function) so the fast path has no taken branches conditional or otherwise, while the less likely path has to jump there and then jump back.
It's easy to do more harm than good with branch hints in high-level source, especially if surrounding code changes without paying attention to them, so profile-guided optimization is the best way for compilers to learn about branch predictability and likelihood. (e.g. gcc -O3 -fprofile-generate / run with some representative inputs that exercise code-paths in relevant ways / gcc -O3 -fprofile-use)
But there are ways to hint in some languages, like C++20 [[likely]] and [[unlikely]], which are the portable version of GNU C likely() / unlikely() macros around __builtin_expect.
https://en.cppreference.com/w/cpp/language/attributes/likely C++20 [[likely]]
How to use C++20's likely/unlikely attribute in if-else statement syntax help
Is there a compiler hint for GCC to force branch prediction to always go a certain way? (to the literal question, no. To what's actually wanted, branch hints to the compiler, yes.)
How do the likely/unlikely macros in the Linux kernel work and what is their benefit? The GNU C macros using __builtin_expect, same effect but different syntax than C++20 [[likely]]
What is the advantage of GCC's __builtin_expect in if else statements? example asm output. (Also see CiroSantilli's answers to some of the other questions where he made examples.)
Simple example where [[likely]] and [[unlikely]] affect program assembly?
I don't know of ways to annotate branches for languages other than GNU C / C++, and ISO C++20.
Absent any hints or profile data
Without that, optimizing compilers have to use heuristics to guess which side of a branch is more likely. If it's a loop branch, they normally assume that the loop will run multiple times. On an if, they have some heuristics based on the actual condition and maybe what's in the blocks being controlled; IDK I haven't looked into what gcc or clang do.
I have noticed that GCC does care about the condition, though. It's not as naive as assuming that int values are uniformly randomly distributed, although I think it normally assumes that if (x == 10) foo(); is somewhat unlikely.
JIT compilers like in a JVM have an advantage here: they can potentially instrument branches in the early stages of running, to collect branch-direction information before making final optimized asm. OTOH they need to compile fast because compile time is part of total run time, so they don't try as hard to make good asm, which is a major disadvantage in terms of code quality.

Implementing a simulator for a subset of x86

I wish to implement a simulator for a subset of instructions for the x86 architecture. Given a binary, I wish to disassemble it and run a simulation on the instructions. For that, one would need to look at certain bits of an instruction to decide whether it is a control instruction, arithmetic instruction or a logical instruction and based on that, one must derive the parameters of the operation by looking at the remaining bits. One obvious yet painful way to implement this is by using nested if-else/switch-case statements. Can someone suggest a better methodology for implementing this?
Use a lookup table, perhaps in the form of a std::map.
You can look at the source of an x86 emulator to find an implementation of this idea, already fully written and fleshed out.
Here's one you might try: http://www.dosbox.com/wiki/BuildingDOSBox#1._Grab_the_source
Let me know if this doesn't work out; there are lots to choose from.
In general, with an emulator, I would think that a switch on the opcode would be one way to go. Another good approach would be an 256-entry array of function pointers, corresponding to the first byte of the instruction. That gives a little more separation than a giant switch or if block. Of course you can reuse the functions as needed.
Doing a nested if/else type construct should be fine if you cache the output of the translation. If you are doing simulation you will have relatively few dynamic instructions out of all of the static instructions in the program. So the best performance optimization is to cache the output of the translation and then reuse it when the dynamic instruction executes. Eventually your cache will fill up and you will need to clear it for new entries. But it makes more sense to cache the translation somehow, rather than try to come up with a really fast method of doing the translation in the first place.
As an example QEMU is an emulator that supports a variety of targets that is optimized for performance. You can see how they translate x86 instructions here:
https://github.com/qemu/QEMU/blob/master/target-i386/translate.c#L4076
if QEMU did this for every instruction the performance would be very slow. But since the cache the results it does not matter too much that the first time an instruction is translated there is a complex case statement.

What does SSE instructions optimize in practice, and how does the compiler enables and use them?

SSE and/or 3D now! have vector instructions, but what do they optimize in practice ? Are 8 bits characters treated 4 by 4 instead of 1 by 1 for example ? Are there optimisation for some arithmetical operations ? Does the word size have any effect (16 bits, 32 bits, 64 bits) ?
Does all compilers use them when they are available ?
Does one really have to understand assembly to use SSE instructions ? Does knowing about electronics and gate logics helps understanding this ?
Background: SSE has both vector and scalar instructions. 3DNow! is dead.
It is uncommon for any compiler to extract a meaningful benefit from vectorization without the programmer's help. With programming effort and experimentation, one can often approach the speed of pure assembly, without actually mentioning any specific vector instructions. See your compiler's vector programming guide for details.
There are a couple portability tradeoffs involved. If you code for GCC's vectorizer, you might be able to work with non-Intel architectures such as PowerPC and ARM, but not other compilers. If you use Intel intrinsics to make your C code more like assembly, then you can use other compilers but not other architectures.
Electronics knowledge will not help you. Learning the available instructions will.
In the general case, you can't rely on compilers to use vectorized instructions at all. Some do (Intel's C++ compiler does a reasonable job of it in many simple cases, and GCC attempts to do so too, with mixed success)
But the idea is simply to apply the same operation to 4 32-bit words (or 2 64 bit values in some cases).
So instead of the traditional `add´ instruction which adds together the values from 2 different 32-bit wide registers, you can use a vectorized add, which uses special, 128-bit wide registers containing four 32-bit values, and adds them together as a single operation.
Duplicate of other questions:
Using SSE instructions
In short, SSE is short for Streaming SIMD Extensions, where SIMD = Single Instruction, Multiple Data. This is useful for performing a single mathematical or logical operation on many values at once, as is typically done for matrix or vector math operations.
The compiler can target this instruction set as part of it's optimizations (research your /O options), however you typically have to restructure code and either code SSE manually, or use a library like Intel Performance Primitives to really take advantage of it.
If you know what you are doing, you might get a huge performance boost. See for example here, where this guy improved the performances of his algorithm 6 times.

Number of cycles taken for C++ or ANSI C?

Is there anywhere on the web where i can get an idea of what the various programming language syntax take in terms of processor (Core i7 and Core 2) cycles? At university i learnt the ARM assembly language and we could map the number of cycles taken to do a subtraction operator etc. I just wondered if its possible to do this with a higher level language on the Core i7 or Core 2?
No. That's completely dependent on the compiler you use, and what optimization settings you use, etc.
You can use your favorite compiler and settings to generate assembly code, and from the assembly code you can make these kinds of predictions.
However, remember that on modern architectures things like memory latency and register renaming have large effects on speed, and these effects are not obvious even from inspection of the assembly code.
In general, in higher-level languages, individual statements don't map cleanly onto specific sequences of machine-code instructions. The compiler will typically optimise things, which will involve various transformations, arrangements, and even eliminations, of instructions. Therefore, it's not usually meaningful to quote metrics like "a for expression takes 20 cycles".
You have to map higher level instructions into assembly instructions manually, or look at the assembly listing. And then look here
http://gmplib.org/~tege/x86-timing.pdf
or here
http://www.intel.com/Assets/PDF/manual/248966.pdf

Producing the fastest possible executable

I have a very large program which I have been compiling under visual studio (v6 then migrated to 2008). I need the executable to run as fast as possible. The program spends most of its time processing integers of various sizes and does very little IO.
Obviously I will select maximum optimization, but it seems that there are a variety of things that can be done which don't come under the heading of optimization which do still affect the speed of the executable. For example selecting the __fastcall calling convention or setting structure member alignment to a large number.
So my question is: Are there other compiler/linker options I should be using to make the program faster which are not controlled from the "optimization" page of the "properties" dialog.
EDIT: I already make extensive use of profilers.
Another optimization option to consider is optimizing for size. Sometimes size-optimized code can run faster than speed-optimized code due to better cache locality.
Also, beyond optimization operations, run the code under a profiler and see where the bottlenecks are. Time spent with a good profiler can reap major dividends in performance (especially it if gives feedback on the cache-friendliness of your code).
And ultimately, you'll probably never know what "as fast as possible" is. You'll eventually need to settle for "this is fast enough for our purposes".
Profile-guided optimization can result in a large speedup. My application runs about 30% faster with a PGO build than a normal optimized build. Basically, you run your application once and let Visual Studio profile it, and then it is built again with optimization based on the data collected.
1) Reduce aliasing by using __restrict.
2) Help the compiler in common subexpression elimination / dead code elimination by using __pure.
3) An introduction to SSE/SIMD can be found here and here. The internet isn't exactly overflowing with articles about the topic, but there's enough. For a reference list of intrinsics, you can search MSDN for 'compiler intrinsics'.
4) For 'macro parallelization', you can try OpenMP. It's a compiler standard for easy task parallelization -- essentially, you tell the compiler using a handful of #pragmas that certain sections of the code are reentrant, and the compiler creates the threads for you automagically.
5) I second interjay's point that PGO can be pretty helpful. And unlike #3 and #4, it's almost effortless to add in.
You're asking which compiler options can help you speed up your program, but here's some general optimisation tips:
1) Ensure your algorithms are appropriate for the job. No amount of fiddling with compiler options will help you if you write an O(shit squared) algorithm.
2) There's no hard and fast rules for compiler options. Sometimes optimise for speed, sometimes optimise for size, and make sure you time the differences!
3) Understand the platform you are working on. Understand how the caches for that CPU operate, and write code that specifically takes advantage of the hardware. Make sure you're not following pointers everywhere to get access to data which will thrash the cache. Understand the SIMD operations available to you and use the intrinsics rather than writing assembly. Only write assembly if the compiler is definitely not generating the right code (i.e. writing to uncached memory in bad ways). Make sure you use __restrict on pointers that will not alias. Some platforms prefer you to pass vector variables by value rather than by reference as they can sit in registers - I could go on with this but this should be enough to point you in the right direction!
Hope this helps,
-Tom
Forget micro-optimization such as what you are describing. Run your application through a profiler (there is one included in Visual Studio, at least in some editions). The profiler will tell you where your application is spending its time.
Micro-optimization will rarely give you more than a few percentage points increase in performance. To get a really big boost, you need to identify areas in your code where inefficient algorithms and/or data structures are being used. Focus on those, for example by changing algorithms. The profiler will help identify these problem areas.
Check which /precision mode you are using. Each one generates quite different code and you need to choose based on what accuracy is required in your app. Our code needs precision (geometry, graphics code) but we still use /fp:fast (C/C++ -> Code generation options).
Also make sure you have /arch:SSE2, assuming your deployment covers processors that all support SSE2. This will result is quite a big difference in performance, as compile will use very few cycles. Details are nicely covered in the blog SomeAssemblyRequired
Since you are already profiling, I would suggest loop unrolling if it is not happening. I have seen VS2008 not doing it more frequently (templates, references etc..)
Use __forceinline in hotspots if applicable.
Change hotspots of your code to use SSE2 etc as your app seems to be compute intense.
You should always address your algorithm and optimise that before relying on compiler optimisations to get you significant improvements in most cases.
Also you can throw hardware at the problem. Your PC may already have the necessary hardware lying around mostly unused: the GPU! One way of improving performance of some types of computationally expensive processing is to execute it on the GPU. This is hardware specific but NVIDIA provide an API for exactly that: CUDA. Using the GPU is likely to get you far greater improvement than using the CPU.
I agree with what everyone has said about profiling. However you mention "integers of various sizes". If you are doing much arithmetic with mismatched integers a lot of time can be wasted in changing sizes, shorts to ints for example, when the expressions are evaluated.
I'll throw in one more thing too. Probably the most significant optimisation is in choosing and implementing the best algorithm.
You have three ways to speed up your application:
Better algorithm - you've not specified the algorithm or the data types (is there an upper limit to integer size?) or what output you want.
Macro parallelisation - split the task into chunks and give each chunk to a separate CPU, so, on a two core cpu divide the integer set into two sets and give half to each cpu. This depends on the algorithm you're using - not all algorithms can be processed like this.
Micro parallelisation - this is like the above but uses SIMD. You can combine this with point 2 as well.
You say the program is very large. That tells me it probably has many classes in a hierarchy.
My experience with that kind of program is that, while you are probably assuming that the basic structure is just about right, and to get better speed you need to worry about low-level optimization, chances are very good that there are large opportunities for optimization that are not of the low-level kind.
Unless the program has already been tuned aggressively, there may be room for massive speedup in the form of mid-stack operations that can be done differently. These are usually very innocent-looking and would never grab your attention. They are not cases of "improve the algorithm". They are usually cases of "good design" that just happen to be on the critical path.
Unfortunately, you cannot rely on profilers to find these things, because they are not designed to look for them.
This is an example of what I'm talking about.