C/C++ usage of special CPU features

C/C++ usage of special CPU features - c++

I am curious, do new compilers use some extra features built into new CPUs such as MMX SSE,3DNow! and so?
I mean, in original 8086 there was even no FPU, so compiler that old cannot even use it, but new compilers can, since FPU is part of every new CPU. So, does new compilers use new features of CPU?
Or, it should be more right to ask, does new C/C++ standart library functions use new features?
Thanks for answer.
EDIT:
OK, so, if I get all of you right,even some standart operations, especially with float numbers can be done using SSE faster.
In order to use it, I must enable this feature in my compiler, if it supports it. If it does, I must be sure that targeted platform supports that features.
In case of some system libraries that require top performance, such as OpenGL, DirectX and so, this support may be supported in system.
By default, for compatibility reasons, compiler doesen´t support it, but you can add this support using special C functions delivered by, for example Intel. This should be the best way, since you can directly control wheather and when you use special features of desired platform, to write multi-CPU-support applications.

gcc will support newer instructions via command line arguments. See here for more info. To quote:
GCC can take advantage of the
additional instructions in the MMX,
SSE, SSE2, SSE3 and 3dnow extensions
of recent Intel and AMD processors.
The options -mmmx, -msse, -msse2,
-msse3 and -m3dnow enable the use of these extra instructions, allowing
multiple words of data to be processed
in parallel. The resulting executables
will only run on processors supporting
the appropriate extensions--on other
systems they will crash with an
Illegal instruction error (or similar)

These instructions are not part of any ISO C/C++ standards. They are available through compiler intrinsics, depending on the compiler used.
For MSVC, see http://msdn.microsoft.com/en-us/library/26td21ds(VS.80).aspx
For GCC, you could look at http://developer.apple.com/hardwaredrivers/ve/sse.html
AFAIK, SSE intrinsics are the same between GCC and MSVC.

Compilers will aim for producing code for a minimal set of features in a processor. They also provide compilation switches that allow you to target specific processors. In this manner, they can sell more compilers (to those folks with old processors as well as the trendy folk with new ones).
You will need to study the documentation that came with your compiler.

Sometimes the runtime library will contain multiple implementations of a feature, and the library will dynamically choose between implementations when the program is run. The overhead might be the cost of a function pointer call instead of a direct function call, but the benefit could be much greater when using a CPU-specific optimised function.
JIT compilers (for VM languages such as Java and C#) take this one step further and compile the bytecode for the specific CPU that it's running on. This gives your own code the benefit of specific CPU optimisation. This is one reason why Java code can actually be faster than compiled C code, because the Java JIT compiler can delay its optimisation decisions until the program is run on the actual target machine. A C compiler must make those decisions without always knowing what the target CPU is. Furthermore, JIT compilers evolve and can make your program faster over time without you having to do anything.

If you use the Intel C compiler, and set sufficiently high optimisation options, you will find that some of your loops get 'vectorised', which means the compiler has rewritten them to use SSE-style instructions.
If you want to use SSE operations directly, you use the intrinsics defined in the 'xmmintrin.h' header file; say
#include <xmmintrin.h>
__m128 U, V, W;
float ww[4];
V=_mm_set1_ps(1.5);
U=_mm_set_ps(0,1,2,3);
W=_mm_add_ps(U,V);
_mm_storeu_ps(ww,W);

Varying compilers will use varying new features. Visual Studio will use SSE/2, and I believe the Intel compiler will support the very latest in CPU features. You should, of course, be wary about the market penetration of your favourite feature.
As for what your favourite standard library use, that depends on what it was compiled with. However, C++ standard library is typically compiled on-site, since it's very heavily templated, so if you enable SSE2, the C++ std libs should use it. As for the CRT, depends on what they were compiled with.

There are generally two ways a compiler can generate code that uses special features like these:
When the compiler itself is compiled, you configure it to generate code for a particular architecture, and it can take advantage of any features it knows that architecture will have. For example, if it gcc is configured for an Intel processor new enough (or is that "not old enough"?) to contain an integrated FPU, it will generate floating-point instructions.
When the compiler is invoked, flags or parameters can specify the type of features available to the processor that will run the program, and then the compiler will know it is safe to use these features. If the flags aren't present, it will generate equivalent code without using the special instructions provided by those features.

If you're talking about code written in C/C++, the new features are explited if you tell to your compiler to do so. By default, your compiler probably targets "plain x86" (naturally with FPU :) ), usually optimized for the most widespread processor generation at the moment, but still able to run on older processors.
If you want the compiler to generate code also considering the new instruction sets, you should tell it to do so with the appropriate command line switch/project setting, for example for Visual C++ the option to enable SSE/SSE2 instructions generation is /arch.
Notice that many features of new instruction sets cannot be exploited directly in "normal" code, so you are usually provided with compiler intrinsics to operate on the particular datatypes native of the new instruction sets.

Intel provides updated CPUID example code every time they release a new cpu so that you can check for the new features and has been as long as I remember. At least this is what I found the first time I thought about this same question myself.
Using CPUID to Detect the presence of SSE 4.1 and SSE 4.2 Instruction Sets
As new compilers are released they add the new features directly like VS2010 for example.
Visual C++ Code Generation in Visual Studio 2010

Related

How do applications determine if instruction set is available and use it in case it is?

Just interesting how it works in games and other software.
More precisely, I'm asking for a solution in C++.
Something like:
if AMX available -> Use AMX version of the math library
else if AVX-512 available -> Use AVX-512 version of the math library
else if AVX-256 available -> Use AVX-256 version of the math library
etc.
The basic idea I have is to compile the library in different DLLs and swap them on runtime but it seems not to be the best solution for me.

For the detection part
See Are the xgetbv and CPUID checks sufficient to guarantee AVX2 support? which shows how to detect CPU and OS support for new extensions: cpuid and xgetbv, respectively.
ISA extensions that add new/wider registers that need to be saved/restored on context switch also need to be supported and enabled by the OS, not just the CPU. New instructions like AVX-512 will still fault on a CPU that supports them if the OS hasn't set a control-register bit. (Effectively promising that it knows about them and will save/restore them.) Intel designed things so the failure mode is faulting, not silent corruption of registers on CPU migration, or context switch between two programs using the extension.
Extensions that added new or wider registers are AVX, AVX-512F, and AMX. OSes need to know about them. (AMX is very new, and adds a large amount of state: 8 tile registers T0-T7 of 1KiB each. Apparently OSes need to know about AMX for power-management to work properly.)
OSes don't need to know about AVX2/FMA3 (still YMM0-15), or any of the various AVX-512 extensions which still use k0-k7 and ZMM0-31.
There's no OS-independent way to detect OS support of SSE, but fortunately it's old enough that these days you don't have to. It and SSE2 are baseline for x86-64. Everything up to SSE4.2 uses the same register state (XMM0-15) so OS support for SSE1 is sufficient for user-space to use SSE4.2. SSE1 was new in 1999, with Pentium 3.
Different compilers have different ways of doing CPUID and xgetbv detection. See does gcc's __builtin_cpu_supports check for OS support? - unfortunately no, only CPUID, at least when that was asked. I'd consider that a GCC bug, but IDK if it ever got reported or fixed.
For the optional-use part
Typically setting function pointers to selected versions of some important functions. Inlining through function pointers isn't generally possible, so make sure you choose the boundaries appropriately, like an AVX-512 version of a function that includes a loop, not just a single vector.
GCC's function multi-versioning can automate that for you, transparently compiling multiple versions and hooking some function-pointer setup.
There have been some previous Q&As about this with different compilers, search for "CPU dispatch avx" or something like that, along with other search terms.
See The Effect of Architecture When Using SSE / AVX Intrinisics to understand the difference between GCC/clang's model for intrinsics where you have to enable -march=skylake or whatever, or manually -mavx2, before you can use an intrinsic. vs. MSVC and classic ICC where you could use any intrinsic anywhere, even to emit instructions the compiler wouldn't be able to auto-vectorize with. (Those compilers can't or don't optimize intrinsics much at all, perhaps because that could lead to them getting hoisted out of if(cpu) statements.)

Windows provides IsProcessorFeaturePresent but AVX support is not on the list.
For more detailed detection you need to ask the CPU directly. On x86 this means the CPUID instruction. Visual C++ provides the __cpuidex intrinsic for this. In your case, function/leaf 1 and check bit 28 in ECX. Wikipedia has a decent article but you really should download the Intel instruction set manual to use as a reference.

Backward compatibility of the code compiled optimized for new instruction set extensions

In order to narrow the scope of this question, let's consider projects in C / C++ only.
There is a whole array of new SIMD instruction set extensions for x86 architecture, though in order to benefit from them a developer should recompile the code with an appropriate optimization flag, and perhaps, modify it accordingly as well.
Since new instruction set extensions come out relatively frequently, it's unclear how the backward compatibility can be maintained while utilizing the benefits of available instruction set extensions.
Is a resulting application stays compatible with the older CPU models that don't support a new institution set extension? If yes, could you elaborate on how such support implemented?

New CPU instructions require new hardware to execute. If you try to run them on older CPUs that don't support those instructions, your program will crash with an Invalid Opcode fault. Occasionally OSes will handle this condition, but usually not.
To run with the new instructions, you either need to require that they are supported in hardware, or (if the benefit is great enough) check at runtime to see if the new instructions you need are supported. If they are, you run a section of code that uses them. If they are not, you run a different section of code that does not use them.
Generally "backwards compatible" refers to a new version of something running stuff that runs on the older, existing things, and not old things running with new stuff.

Historically, most x86 instruction sets have been (practically) strict supersets of previous sets. However, the AVX-512 extension comes in several mutually-incompatible variants, so particular care will need to be taken.
Fortunately, compilers are also getting smarter. GCC has __attribute__((simd)) and __attribute__((target_clones(...))) to automatically create multiple implementations of the given function, and choose the best one at load time based on what the actual CPU supports. (For older GCC versions, you had to use IFUNC manually ... and in ancient days, ld.so would load libraries from a completely separate directory depending on things like cmov).

Optimize for a specific machine / processor architecture

In this highly voted answer to a question on the performance differences between C++ and Java I learn that the JIT compiler is sometimes able to optimize better because it can determine the exact specifics of the machine (processor, cache sizes, etc.):
Generally, C# and Java can be just as fast or faster because the JIT
compiler -- a compiler that compiles your IL the first time it's
executed -- can make optimizations that a C++ compiled program cannot
because it can query the machine. It can determine if the machine is
Intel or AMD; Pentium 4, Core Solo, or Core Duo; or if supports SSE4,
etc.
A C++ program has to be compiled beforehand usually with mixed
optimizations so that it runs decently well on all machines, but is
not optimized as much as it could be for a single configuration (i.e.
processor, instruction set, other hardware).
Question: Is there a way to tell the compiler to optimize specifically for my current machine? Is there a compiler which is able to do this?

For GCC, you can use the flag -march=native. Be aware that the generated code may not run on other CPUs because
GCC uses this name to determine what kind of instructions it can emit
when generating assembly code.
So CPU specific assembly can be generated.
If you want your code to run on other CPU types, but tune it for better performance on your CPU, then you should use -mtune=native:
Specify the name of the processor to tune the performance for. The
code will be tuned as if the target processor were of the type
specified in this option, but still using instructions compatible with
the target processor specified by a -mcpu= option.

Certainly a compiler could be instructed to optimize for a specific architecture. This is true of gcc, if you look at the multitude of architecture flags that you can pass in. The same is true to a lesser extent on Visual Studio, as it has the -MACHINE option and /arch options.
However, unlike in Java, this likely means that the generated code is only (safe) to run on that hardware that is being targeted. The assertion that Java can be just as fast or faster only likely holds in the case of generically compiled C++ code. Given the target architecture, C++ code compiled for that specific architecture will likely be as fast or faster than equivalent Java code. Of course, it's much more work to support multiple architectures in this way.

Does building the compiler from source result in better optimization?

Consider this simple case scenario:
I download the pre-built binaries of a C++ compiler (say CLang or GCC or anything else) for my generic OS (that is not windows). I compile my code which consists of some computationally expensive mathematical calculation with optimization flag -O3 and I have an execution time of T1.
On a different attempt, this time instead of using pre-built binaries I download the source code and build the compiler by myself on my generic machine. I compile the same code with the same optimization flag, achieving execution time T2?
Will T2 < T1 or they will be more or less the same?
In other words, is the execution time independent from the way that compiler is built?

The compiler's optimization of your code is the result of the behavior of the compiler, not the performance of the compiler.
As long as the compiler has the same behavioral design, it will produce exactly the same output.

Generally the same compiler version should generate the same assembler code given the same C or C++ code input. However there are certain things that might further affect the code that is being execute when you run the compiler.
A distro might have backported (or even created own) patches from other versions.
Modern compilers often have library depenencies (e.g. cloog) that may have different behaviour in different versions, causing the compiler to make code generation decisions based on essentially other data
These libraries may (in some compiler versions) be optional at compile time (might need to give --enable switches to configure, or configure tries to autodetect them).
Compiler switches like -march=native will look on what hardware you compile and try to optimize accordingly.
a time limit in the compilers optimizer triggers, essentially making better optimizations on better machines; or the same for memory (I don't think thats to be found in modern compilers anymore though)
That said, even the same assembler might perform different on yours and their machine, e.g. because one is optimized for AMD, the other for intel.

In my opinion, and in theory, compilation speed can be faster, since you can say to "compiler which compile the compiler", "please target to my computer, and you can use my computer's processor's own machine code to optimize".
But I think compiler's optimization cannot be faster.. To make compiler's optimization faster, I think we need put something like new technology into compiler, not just re-compile.

That depends on how that compiler is implemented and on your platform, but the answer will be most likely "no".
If your platform provides specific functionality that can improve the performance of your program, the optimizer in your compiler might use that functionality to produce a faster program. The optimizer can do so only if the compiler writer was aware of the functionality and has implemented special treatment for your platform in the optimizer. If that is the case, the detection might be done dynamically in the optimizer, meaning any build of the optimizer can detect the platform and optimize your code. Only if the detection has to occur at compiletime of the optimizer for some reason, recompiling it on your platform could give that advantage. But if such a better build for your plaform exists, the compiler vendor most likely has provided binaries for it.
So, with all these ifs, it's unlikely that your program will be any faster when you recompile the compiler on your platform. There is a chance, however, that the compiler will be a bit faster if it is optimized to your platform rather than a generic binary, resulting on shorter compiletimes.

Register allocation rules in code generated by major C/C++ compilers

I remember some rules from a time ago (pre-32bit Intel processors), when was quite frequent (at least for me) having to analyze the assembly output generated by C/C++ compilers (in my case, Borland/Turbo at that time) to find performance bottlenecks, and to safely mix assembly routines with C/C++ code. Things like using the SI register for the this pointer, AX being used for return values, which registers should be preserved when an assembly routine returns, etc.
Now I was wondering if there's some reference for the more popular C/C++ compilers (Visual C++, GCC, Intel...) and processors (Intel, ARM, ...), and if not, where to find the pieces to create one. Ideas?

You are asking about "application binary interface" (ABI) and calling conventions. These are typically set by operating systems and libraries, and enforced by compilers and linkers. Google for "ABI" or "calling convention." Some starting points from Wikipedia and Debian for ARM.

Agner Fog's "Calling Conventions" document summarizes, amongst other things, the Windows and Linux 64 and 32-bit ABIs: http://www.agner.org/optimize/calling_conventions.pdf. See Table 4 on p.10 for a summary of register usage.
One warning from personal experience: don't embed assumptions about the ABI in inline assembly. If you write a function in inline assembly that assumes return and/or parameter transfer in particular registers (e.g. eax, rdi, rsi), it will break if/when the function is inlined by the compiler.

Open Watcom C/C++ compiler supports two calling conventions, register-based (default) and stack-based (very close to what other compilers use). User's Guide for this compiler describes them both and is available for free online, together with the compiler itself. You may find these topics in the User's Guide especially helpful:
10.4.1 Passing Arguments Using Register-Based Calling Conventions
10.4.6 Using Stack-Based Calling Conventions
10.5 Calling Conventions for 80x87-based Applications

Well, today if optimisation is turned on, there arn't any. But GCC allows you to declare that your assembly instruction should use particular variable regardless if it's in register or not, or even to force GCC tu put that variable into a register usable with your instruction. You can also declare which registers your inline assembly block reserves for itself (so compiler should generate apropriate save/restore code around your inline piece, if needed)

I believe but am by no means sure that GCC uses the Itanium ABI for most of its function; the incompatibilites between it and the ABI it uses are documented.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

C/C++ usage of special CPU features - c++

Related

How do applications determine if instruction set is available and use it in case it is?

Backward compatibility of the code compiled optimized for new instruction set extensions

Optimize for a specific machine / processor architecture

Does building the compiler from source result in better optimization?

Register allocation rules in code generated by major C/C++ compilers

Categories

Resources