What SELECT instruction lowers to in ISA? - llvm

While vectorizing if the loop contains 'if constructs' llvm tries to flatten them by replacing with SELECT instructions to make the control flow straight, if the basic blocks cannot be turned to predicated instructions like SELECT, llvm can't vectorize it. Till now am assuming that there should be some equivalent instruction for SELECT like IR instructions. I searched for predicated instruction in intel architecture I din't got any. Can someone please tell me does current ISA architectures support predicated instructions? If not, How the SELECT instructions will be lowered into machine instructions? Please correct me if I have made any wrong assumptions.
Thanks in advance

Yes several architectures support conditional/predicated execution. For example ARM has csel AArch64 ISA.
It is quite common in VLIW architectures because they need to fill the packets.
Predicated execution

Related

What happens when I compile on machine that supports avx2 and run the binary on another machine that only supports avx?

I compiled my c++ program on a machine that supports avx2 (Intel E5-2643 V3). It compiles and runs just fine. I confirm the avx2 instruction is used since after I dissemble the binary, I saw avx2 instructions such as vpbroadcastd.
Then I run this binary on another machine that only has avx instruction set (Intel E5-2643 V2). It runs also fine. Does the binary runs on a backward compatible avx instruction instead? What is this instruction? Do you see any potential issue?
There are multiple compilers and multiple settings you can use but the general principle is that usually a compiler is not targeting a particular processor, it's targeting an architecture, and by default it will usually have a fairly inclusive approach meaning the generated code will be compatible with as many processors as reasonable. You would normally expect an x86_64 compiler to generate code that runs without AVX2, indeed, that it should run on some of the earliest CPUs supporting the x86_64 instruction set.
If you have code that benefits greatly from extensions to the instruction set that aren't universally supported like AVX2, your aim when producing software is generally to degrade gracefully. For instance you could use runtime feature detection to see if the current processor supports AVX2 and run a separate code path. Some compilers may support automated ways of doing this or helpers to assist you in achieving this yourself.
It's not rare to have AVX2 instructions in a binary that uses CPU detection to make sure it only runs them on CPUs that support them. (e.g. via cpuid and setting function pointers).
If the AVX2 instruction actually executed on a CPU without AVX2 support, it raises #UD, so the OS delivers SIGILL (illegal instruction) to your process, or the Windows equivalent.
There are a few cases where an instruction like lzcnt decodes as rep bsr, which runs as bsr on CPUs without BMI1. (Giving a different answer). But VEX-coded AVX2 instructions just fault on older CPUs.

C++ techniques for reducing CPU instruction sizes?

Each CPU instruction consumes a number of bytes. The smaller the size, the most instructions which can be held in the CPU cache.
What techniques are available when writing C++ code which allow you to reduce CPU instruction sizes?
One example could be reducing the number of FAR jumps (literally, jumps to code across larger addresses). Because the offset is a smaller number, the type used is smaller and the overall instruction is smaller.
I thought GCC's __builtin_expect may reduce jump instruction sizes by putting unlikely instructions further away.
I think I have seen somewhere that its better to use an int32_t rather than int16_t due to being the native CPU integer size and therefore more efficient CPU instructions.
Or is something which can only be done whilst writing assembly?
Now that we've all fought over micro/macro optimization, let's try to help with the actual question.
I don't have a full, definitive answer, but you might be able to start here. GCC has some macro hooks for describing performance characteristics of the target hardware. You could theoretically set up a few key macros to help gcc favor "smaller" instructions while optimizing.
Based on very limited information from this question and its one reply, you might be able to get some gain from the TARGET_RTX_COSTS costs hook. I haven't yet done enough follow up research to verify this.
I would guess that hooking into the compiler like this will be more useful than any specific C++ idioms.
Please let us know if you manage any performance gain. I'm curious.
If a processor has various length (multi-byte) instructions, the best you can do is to write your code to help the compiler make use of the smaller instruction sizes.
Get The Code Working Robustly & Correct first.
Debugging optimized code is more difficult than debugging code that is not optimized. The symbols used by the debugger line up with the source code better. During optimization, the compiler can eliminate code, which gets your code out-of-sync with the source listing.
Know Your Assembly Instructions
Not all processors have variable length instructions. Become familiar with your processors instruction set. Find out which instructions are small (one byte) versus multi-byte.
Write Code to Use Small Assembly Instructions
Help out your compiler and write your code to take advantage of the small length instructions.
Print out the assembly language code to verify that the compiler uses the small instructions.
Change your code if necessary to help out the compiler.
There is no guarantee that the compiler will use small instructions. The compiler emits instructions that it thinks will have the best performance according to the optimization settings.
Write Your Own Assembly Language Function
After generating the assembly language source code, you are now better equipped to replace the high level language with an assembly language version. You have the freedom to use small instructions.
Beware the Jabberwocky
Smaller instructions may not be the best solution in all cases. For example, the Intel Processors have block instructions (perform operations on blocks of data). These block instructions perform better than loops of small instructions. However, the block instructions take up more bytes than the smaller instructions.
The processor will fetch as many bytes as necessary, depending on the instruction, into its instruction cache. If you can write loops or code that fits into the cache, the instruction sizes become less of a concern.
Also, many processors will use large instructions to communicate with other processors, such as a floating point processor. Reduction of floating point math in your program may reduce the quanitity of these instructions.
Trim the Code Tree & Reduce the Branches
In general, branching slows down processing. Branches are the change of execution to a new location, such as loops and function calls. Processors love to data instructions, because they don't have to reload the instruction pipeline. Increasing the amount of data instructions and reducing the quantity of branches will improve performance, usually without regards to the instruction sizes.

assembly / __asm inlining

I am learning assembly and making some inlining in my Digital Mars C++ compiler. I searched some things to make a program better and had these parameters to tune the programs:
use better C++ compiler//thinking of GCC or intel compiler
use assembly only in critical part of program
find better algorithm
Cache miss, cache contention.
Loop-carried dependency chain.
Instruction fetching time.
Instruction decoding time.
Instruction retirement.
Register read stalls.
Execution port throughput.
Execution unit throughput.
Suboptimal reordering and scheduling of micro-ops.
Branch misprediction.
Floating point exception.
I understood all except "register read stalls".
Question: Can anybody tell me how is this happening in CPU and the "superscalar" form of the "out of order execution"?
Normal "out of order" seemed logical but i couldnt find a logical explanation of "superscalar" form.
Question 2: Can someone alse give some good instruction list of SSE SSE2 and newer CPU's prefarably with micro-ops table, port throughputs, units and some calculation table for the latencies to find the real bottle-neck of a piece of code?
I would be happy with a small example like this:
//loop carried dependency chain breaking:
__asm
{
loop_begin:
....
....
sub edx,05h //rather than taking i*5 in each iteration, we sub 5 each iteration
sub ecx,01h //i-- counter
...
...
jnz loop_begin//edit: sub ecx must have been after the sub edx for jnz
}
//while sub edx makes us get rid of a multiplication also makes that independent of ecx, making independent
Thank you.
Computer: Pentium-M 2GHz , Windows XP-32 bit
You should take a look at Agner Fogs optimization manuals: Optimizing software in C++: An optimization guide for Windows, Linux and Mac platforms or Optimizing subroutines in assembly language: An optimization guide for x86 platforms.
But to really be able to outsmart a modern compiler, you need some good background knowledge of the arch you want to optimize for: The microarchitecture of Intel, AMD and VIA CPUs: An optimization guide for assembly programmers and compiler makers
My two cents: Intel Architecture Developers Manuals
Really detailed, there are all SSE instructions as well, with opcodes, instruction latency and throughput, and all gory details you might need :)
The "superscalar" stalls is an added problem for scheduling instructions. A modern processor can not only execute instructions out of order, it can also do 3-4 simple instructions at a time, using parallel execution units.
But to actually do that, the instructions must be sufficiently independent of each other. If, for example, one instruction uses the result of a previous instruction, it must wait for that result to be available.
In practice, this makes creating an optimal assembly program by hand extremely difficult. You really have to be like a computer (compiler) to calculate the optimal order of the instructions. And if you change one instruction, you have to do it all over again....
For question #1 I would highly recommend Computer Architecture: A Quantitative Approach. It does a very good job of explaining the concepts in context, so you can see the big picture. The examples are also very useful for a person who is interested in optimizing code, because they always focus on prioritizing and improving the bottleneck.

Measure how often a branch is mispredicted

Assuming I have a if-else branch in C++ how can I (in-code) measure how often the branch is mispredicted? I would like to add some calls or macros around the branch (similar to how you do bottom-up profiling) that would report branch mispredictions.
It would be nice to have a generic method, but lets do Intel i5 2500k for starters.
If you are using an AMD CPU, AMD's CodeAnalyst is just what you need (works on windows and Linux)*.
if your not, then you may need to fork out for a VTune licence or build something using the on CPU performance registers and counters details in the instruction manuals.
You can also check out gperf & OProfile (linux only), see how well they perform (I've never used these, but I see them referred to quite a bit).
*CodeAnalyst should work on an Intel CPU, you just don't get all then nice CPU level analysis.
I wonder if it would be possible to extract this information from g++ -fprofile-arcs? It has to measure exactly this in order to feed back into the optimizer in order to optimize branching.
OProfile
OProfile is pretty complex, but it can profile anything your CPU tracks.
Look through the Event Type Reference and look for your particular CPU.
For instance here is the core2 events. After a quick search I don't see any event counters for missed branch prediction on the core2 architecture.

What happens when windows encounters an unknown instruction in a binary?

We have a binary compiled with SSE3 optimizations which end up using the instruction LDDQU. Now when this code is executed on a Windows system (Single core, XP2) which has only SSE1,2 support (as seen through CPU-Z tool) then application crashes.
(924.4f0): Invalid lock sequence - code c000001e (first chance)
...
001700a10 f20ff00430 lddqu xmm0,xmmword ptr [eax+esi] ds:0023:1e08d200=270a57364a4a77896db676459d8c40a9
...
Can some one enlighten me what does this crash signify and possible fixes?
An application is compiled with SSE3 support and crashes when run on a CPU not supporting SSE3. Gee, so strange! Compiler options for choosing an instruction set must be there just because some programmer at Microsoft was bored as hell one day.
You have several options:
make a single version of the application using SSE2 instruction set only
make different versions of the application compiled with different instruction sets
use structured exception handling (SEH) to implement user-mode emulation of unsupported instructions.
The last approach is a bit more time-consuming than the first two, has some performance issues, but those downsides are much smaller than the advantages it gives you. If you choose the third solution, you will also be able to invent your own opcodes! Perfect way for obfuscating program control flow, which is again very useful for hindering reverse-engineering of your program and thus protectnig your IP.
It's the hardware that encounters an instruction it doesn't know. Just like you can't let an motorola chip execute x86 code, this processor doesn't recognise the LDDQU instruction.
The CPU will raise an interrupt, which is handled by the OS, and translated to the error message you got.
What could you do? You can only build your binary for the 'lower level' platform, too. Probably the target "x86" will do. The compiler will then emit x86-compliant code. You may want to release your software in two versions: the 'optimized' and the 'compatible'.
It should raise an exception, generally EXCEPTION_ILLEGAL_INSTRUCTION from msdn:
EXCEPTION_ILLEGAL_INSTRUCTION The
thread tried to execute an invalid
instruction.
however, in your case the CPU couldn't properly interperate the execution stream and broke it into smaller pieces, leading to undefined behavoir(in this a some instruction got a LOCK prefix added to it, from a residual byte from the SSE3 instruction, but it doesn't support a LOCK prefix and singals an exception). there is really nothing to be done other than making an SSE2 version, or testing for the SSE falgs and braching the code based on what is supported)