I keep reading opinions on which header file is better to include to access Intel's intrinsics : x86intrin.h or immintrin.h .
Both seem to achieve an identical outcome, but I'm sure there must some subtle differences, with regards to code portability. Maybe one is more common, or more complete, than the other ?
I couldn't find an explanation on any of them. If anyone knows why there are 2 files, and what differences they have, this would be a welcomed SO answer.
Speaking of portability, for older compilers (like gcc < v4.4.0), of course things become more complex, and neither is available. One has to consider including another intrinsic header (likely emmintrin.h for SSE support).
(posting an answer here because Header files for x86 SIMD intrinsics has out of date answers that suggest including individual header files).
immintrin.h is portable across all compilers, and includes all Intel SIMD intrinsics, and some scalar extensions like _pdep_u32 that are available with -mbmi2 or a -march= that includes it. (For AMD SSE4a and XOP (Bulldozer-family only, dropped for Zen), you need to include a different header as well.)
The only reason I can think of for including <emmintrin.h> specifically would be if you're using MSVC and want to leave intrinsics undefined for ISA extensions you don't want to depend on.
GCC's model of requiring you to enable extensions before you can use intrinsics for them means the compiler does this checking for you, so you can just #include <immintrin.h> but still get an error if you try to use _mm_shuffle_epi8 (pshufb) without -mssse3.
Don't use compilers older than gcc4.4. They're obsolete and will typically generate slower code, especially for modern CPUs that didn't exist when their tuning settings were being decided.
gcc/clang's x86intrin.h vs. MSVC intrin.h are only useful if you need some extra non-SIMD intrinsics like MSVC's _BitScanReverse() that aren't always portable across compilers. Stuff like integer rotate / bit-scan intrinsics that are baseline (unlike BMI1 lzcnt/tzcnt or BMI2 rorx) but hard or impossible to express in C in a way that compilers will recognize and turn a loop back into a single instruction.
Intel documents some of those as being available in immintrin.h in their intrinsics guide, but gcc/clang and MSVC actually have them in their x86intrin.h or intrin.h headers, respectively.
See How to get the CPU cycle count in x86_64 from C++? for an example of using #ifdef _MSC_VER to choose the right header to define uint64_t __rdtsc(void) and __rdtscp().
just found out, that _mm_broadcastsd_pd, which is listed in the intel intrinsics guide (link), is not implemented in GCCs avx2intrin.h. I tested a small example on Godbolt with the latest GCC version and it won't compile (Example GCC). Clang does (Example Clang). It's the same on my computer (GCC 8.3).
Should I file a bug report or is there any particular reason why it is not included? I mean, sure, _mm_movedup_pd does exactly the same thing and clang actually generates the same assembly for both intrinsics, but I think that shouldn't be a reason to exclude it.
Greetings
Edit
Created a bug report: link
Not all compilers have all aliases for an intrinsic (different names for the same thing). Other than trying them on Godbolt, IDK how to find out which ones are portable across current versions of the major 4 compilers.
But yes, GCC/clang do accept bugs about missing _mm intrinsics, especially ones that Intel documents.
_mm_broadcastsd_pd is documented by Intel as being an intrinsic for movddup so you're not missing out on anything. More importantly, it's a bit misleading because there is no vbroadcastsd xmm, xmm, only with a YMM or ZMM destination. (_mm256_broadcast_sd(double *a); and _mm256_broadcastsd_pd(__m128d a);)
The asm reference manual doesn't even document _mm_broadcastsd_pd in the vbroadcast or the movddup entry; it's only in the intrinsics guide.
GCC would probably want to add this, especially since clang has it. Having _mm_broadcastsd_pd as an alias would be useful for people that are looking for it and don't know the asm well enough to know that they need a movddup. (Or with AVX 3-operand instructions, movlhps or unpcklpd same,same)
In this youtube video the guy explains that there are 2 types of assembly syntax - Intel and AT&T. Is there a relation between assembly-syntax and platform, (for example intel platforms assemblers only accept Intel assembly syntax)?
No connection whatsoever. An assembly language syntax is defined by the assembler, the program that parses it. The machine code is the only real standard that has to be conformed to. Although folks like to talk about two x86 syntaxes for example there are many many more than that as the syntax encompasses all of the text not just the ordering of operands. There are multiple syntaxes for ARM, and others for the same reason.
It is in the interest of the chip or ip vendor to define the instruction set and at the same time in their interest to present this with an assembly language syntax, or at least with respect to the instructions (there is a lot more to an assembly language than just the instructions, in the case of x86 variation within the instructions themselves other than just the ordering of operands), and these vendors make or have made for them at a minimum an assembler and linker as needed (in their best interest). Their documentation and this specific assembler ideally match (in their best interest). But the standardization stops there, the next person/team to make an assembler only needs to conform to the machine code, and will often make at least subtle changes making the code incompatible (intentionally or not, why bother making a perfect clone). For whatever reason gas implementations tend to create an incompatible language from the chip/ip vendor except for the cases where the gas version IS the vendor driven initial version (lets say risc-v for example)
The syntax connection and binary file format (elf, coff, etc) are important with respect to the toolchain. For obvious reasons compilers output assembly language then launch the assembler, so the code produced by the compiler needs to match the syntax of the assembler in that chain. Likewise the object file format output by the assembler needs to match that of the linker, I assume that linkers like gnu ld dont actually rely on the assembler for patching up the machine code used during linking, just modifies the machine code directly (or adds machine code as needed although in that case maybe does use assembly language, have not personally inspected/tested/confirmed).
One can argue since the assembler comes before the compiler that the compiler conforms to the assembler, if it is the same folks then possible that items are added to the assembler to help the overall toolchain solution, but would expect the assembler and thus the assembly language as a whole comes first.
And lastly we as programmers expect the installed compiler to produce programs for that system/platform. gcc on a linux machine makes working programs gcc hello.c -o hello. Without us having to do the bootstrap and linking for that platform, and that comes from the C library usually, but depends on the design of the toolchain, everyone from start to end may have to be system aware.
I answer in case someone else made that mistake. As Cody Gray commented, both the Intel and AT&T are in-fact Intel X86 assembly languages.
Resources:
https://en.wikipedia.org/wiki/X86_assembly_language#Syntax
https://stackoverflow.com/questions/tagged/x86
https://stackoverflow.com/questions/tagged/att
https://stackoverflow.com/questions/tagged/intel-syntax
After reading around the subject, there is overwhelming evidence from numerous sources that using standard C or C++ casts to convert from floating point to integer numbers on Intel is very slow. In order to meeting the ANSI/ISO specification, Intel CPUs need to execute a large number of instructions including those needed to switch the rounding mode of the FPU hardware.
There are a number of workarounds described in various documents, but the cleanest and most portable seems to be the lrint() call added to C99 and C++ 0x standards. Many documents say that a compiler should inline expand these functions when optimization is enabled, leading to code which is faster than a conventional cast, or a function call.
I even found references to gcc feature tracking bags to add this inline expansion to the gcc optimizer, but in my own performance tests I have been unable to get it to work. All my attempts show lrint performance to be much slower than a simple C or C++ style cast. Examining the assembly output of the compiler, and disassembling the compiled objects always shows an explicit call to an external lrint() or lrintf() function.
The gcc versions I have been working with are 4.4.3 and 4.6.1, and I have tried a number of flag combinations on 32bit and 64bit x86 targets, including options to explicitly enable SSE.
How do I get gcc to inline expand lrint, and give me fast conversions?
The lrint() function may raise domain and range errors. One possible way the libc deals with such errors is setting errno (see C99/C11 section 7.12.1). The overhead of the error checking can be quite significant and in this particular case seems to be enough for the optimizer to decide against inlining.
The gcc flag -fno-math-errno (which is part of -ffast-math) will disable these checks. It might be a good idea to look into -ffast-math if you do not rely on standards-compliant handling of floating-point semantics, in particular NaNs and infinities...
Have you tried the -finline-functions flag to gcc.
You can also direct GCC to try to integrate all “simple enough” functions into their callers with the option -finline-functions.
see http://gcc.gnu.org/onlinedocs/gcc/Inline.html
Here you can say gcc to make all function to inline but not all will be inlined.
The compiler uses some heuristics to determine whether the function is small enough to be inlined. One more thing is that a recursive function are also not going to be inline here.
I am curious, do new compilers use some extra features built into new CPUs such as MMX SSE,3DNow! and so?
I mean, in original 8086 there was even no FPU, so compiler that old cannot even use it, but new compilers can, since FPU is part of every new CPU. So, does new compilers use new features of CPU?
Or, it should be more right to ask, does new C/C++ standart library functions use new features?
Thanks for answer.
EDIT:
OK, so, if I get all of you right,even some standart operations, especially with float numbers can be done using SSE faster.
In order to use it, I must enable this feature in my compiler, if it supports it. If it does, I must be sure that targeted platform supports that features.
In case of some system libraries that require top performance, such as OpenGL, DirectX and so, this support may be supported in system.
By default, for compatibility reasons, compiler doesen´t support it, but you can add this support using special C functions delivered by, for example Intel. This should be the best way, since you can directly control wheather and when you use special features of desired platform, to write multi-CPU-support applications.
gcc will support newer instructions via command line arguments. See here for more info. To quote:
GCC can take advantage of the
additional instructions in the MMX,
SSE, SSE2, SSE3 and 3dnow extensions
of recent Intel and AMD processors.
The options -mmmx, -msse, -msse2,
-msse3 and -m3dnow enable the use of these extra instructions, allowing
multiple words of data to be processed
in parallel. The resulting executables
will only run on processors supporting
the appropriate extensions--on other
systems they will crash with an
Illegal instruction error (or similar)
These instructions are not part of any ISO C/C++ standards. They are available through compiler intrinsics, depending on the compiler used.
For MSVC, see http://msdn.microsoft.com/en-us/library/26td21ds(VS.80).aspx
For GCC, you could look at http://developer.apple.com/hardwaredrivers/ve/sse.html
AFAIK, SSE intrinsics are the same between GCC and MSVC.
Compilers will aim for producing code for a minimal set of features in a processor. They also provide compilation switches that allow you to target specific processors. In this manner, they can sell more compilers (to those folks with old processors as well as the trendy folk with new ones).
You will need to study the documentation that came with your compiler.
Sometimes the runtime library will contain multiple implementations of a feature, and the library will dynamically choose between implementations when the program is run. The overhead might be the cost of a function pointer call instead of a direct function call, but the benefit could be much greater when using a CPU-specific optimised function.
JIT compilers (for VM languages such as Java and C#) take this one step further and compile the bytecode for the specific CPU that it's running on. This gives your own code the benefit of specific CPU optimisation. This is one reason why Java code can actually be faster than compiled C code, because the Java JIT compiler can delay its optimisation decisions until the program is run on the actual target machine. A C compiler must make those decisions without always knowing what the target CPU is. Furthermore, JIT compilers evolve and can make your program faster over time without you having to do anything.
If you use the Intel C compiler, and set sufficiently high optimisation options, you will find that some of your loops get 'vectorised', which means the compiler has rewritten them to use SSE-style instructions.
If you want to use SSE operations directly, you use the intrinsics defined in the 'xmmintrin.h' header file; say
#include <xmmintrin.h>
__m128 U, V, W;
float ww[4];
V=_mm_set1_ps(1.5);
U=_mm_set_ps(0,1,2,3);
W=_mm_add_ps(U,V);
_mm_storeu_ps(ww,W);
Varying compilers will use varying new features. Visual Studio will use SSE/2, and I believe the Intel compiler will support the very latest in CPU features. You should, of course, be wary about the market penetration of your favourite feature.
As for what your favourite standard library use, that depends on what it was compiled with. However, C++ standard library is typically compiled on-site, since it's very heavily templated, so if you enable SSE2, the C++ std libs should use it. As for the CRT, depends on what they were compiled with.
There are generally two ways a compiler can generate code that uses special features like these:
When the compiler itself is compiled, you configure it to generate code for a particular architecture, and it can take advantage of any features it knows that architecture will have. For example, if it gcc is configured for an Intel processor new enough (or is that "not old enough"?) to contain an integrated FPU, it will generate floating-point instructions.
When the compiler is invoked, flags or parameters can specify the type of features available to the processor that will run the program, and then the compiler will know it is safe to use these features. If the flags aren't present, it will generate equivalent code without using the special instructions provided by those features.
If you're talking about code written in C/C++, the new features are explited if you tell to your compiler to do so. By default, your compiler probably targets "plain x86" (naturally with FPU :) ), usually optimized for the most widespread processor generation at the moment, but still able to run on older processors.
If you want the compiler to generate code also considering the new instruction sets, you should tell it to do so with the appropriate command line switch/project setting, for example for Visual C++ the option to enable SSE/SSE2 instructions generation is /arch.
Notice that many features of new instruction sets cannot be exploited directly in "normal" code, so you are usually provided with compiler intrinsics to operate on the particular datatypes native of the new instruction sets.
Intel provides updated CPUID example code every time they release a new cpu so that you can check for the new features and has been as long as I remember. At least this is what I found the first time I thought about this same question myself.
Using CPUID to Detect the presence of SSE 4.1 and SSE 4.2 Instruction Sets
As new compilers are released they add the new features directly like VS2010 for example.
Visual C++ Code Generation in Visual Studio 2010