common gcc C/C++ flags between Atom & i3/5? - c++

We'd like to deploy our product on two different HW platforms, an i5 (typically i5-7500, but older CPUs back to 4100 must be supported) and an Atom (E3845)
Supporting Atom is new. Running the current binaries on the E3845 don't work - "Illegal instruction". Disassembling in gdb doesn't show me exactly which instruction, it only says "(bad)".
Since both are x86 I'd like to deploy a single set of binaries but, other than exhaustive trial and error, I don't know how to find which combination gcc flags will generate code compatible with both CPUs.
P Brady's gcccpuopt.sh script looked promising but it doesn't support my CPUs
Looking at /proc/cpuinfo here's the difference:
CPU Atom E3845 i3-4160
Family 6 6
Model 55 60
3dnowprefetch
epb IA32_ENERGY_PERF_BIAS support
abm Advanced Bit Manipulation
avx Advanced Vector Extensions
avx2 Advanced Vector Extensions
bmi1 Bit Manipulation Instructions
bmi2 Bit Manipulation Instructions
eagerfpu ???
f16c 16-bit floating point conversions
fma 4 operands MAC instructions for fused multiply–add
fsgsbase ????
invpcid Invalidate Processor Context ID
pcid Process Context Identifiers
pdpe1gb One GB pages (allows hugepagesz=1G)
pln Intel Power Limit Notification
pts Intel Package Thermal Status
xsave Save Processor Extended States: also provides XGETBY,XRSTOR,XSETBY
xsaveopt Optimized XSAVE
I don't really know what all of those mean... Would I just disable (if possible) generation of everything in the i5 column? Or is there a better procedure for finding the settings?
Target environment is 32-bit Centos6 with 3.10 kernel. GCC 4.9. Code is mostly C++ with some C.

In order to make this answer applicable to more use cases, I'll try to make this generic and use atom and i5 as the examples.
On each platform run gcc -march=native -Q --help=target as noted here
Gather the options that are common to all platforms and either add them to your CFLAGS or make a wrapper that always adds them to your compiler command line (it could just be a shell script with /path/to/real-gcc $myflags $# where $myflags is your list of common flags). I have often had to resort to the wrapper method for some stubborn build systems that ignore $CFLAGS.
Compile as normal, ensuring that your CFLAGS get used.
If performance is acceptable stop here, otherwise do a profile guided optimization build
If performance is acceptable stop here, otherwise you can use your profile info to identify functions that may benefit from gcc's target_clones function attribute or a combination of ifunc and target function attributes (supported by clang) to generate sub-architecture specific versions of each function that get resolved at run time. (Note that in this specific case there may be no functions where this is useful, since the i5 outperforms the atom in most cases)
If performance is acceptable stop here, otherwise fix the code.

Related

How to specify target CPU/architecture Haswell for MSVC Visual Studio?

I have a program that makes heavy use of the intrinsic command _BitScanForward / _BitScanForward64 (aka count trailing zeros, TZCNT, CTZ).
I would like to not use the intrinsic but instead use the according CPU instruction (available on Haswell and later).
When using gcc or clang (where the intrinsic is called __builtin_ctz), I can achieve this by specifying either -march=haswell or -mbmi2 as compiler flags.
The documentation of _BitScanForward only specifies that the intrinsic is available on all architectures "x86, ARM, x64, ARM64" or "x64, ARM64", but I don't just want it to be available, I want to ensure it is compiled to use the CPU instruction instead of the intrinsic function. I also checked /Oi but that doesn't explain it either.
I also searched the web but there are curiously few matches for my question, most just explain how to use intrinsics, e.g. this question and this question.
Am I overthinking this and MSVC will create code that magically uses the CPU instruction if the CPU supports it? Are there any flags required? How can I ensure that the CPU instructions are used when available?
UPDATE
Here is what it looks like with Godbolt.
Please be nice, my assembly reading skills are pretty basic.
GCC uses tzcnt with haswell/bmi2, otherwise resorts to rep bsf.
MSVC uses bsf without rep.
I also found this useful answer, which states that:
"Using a redundant rep prefix for bsr was generally defined to be ignored [...]". I wonder whether the same is true for bsf?
It explains (as I knew) that bsf is not the same as tzcnt, however MSVC doesn't appear to check for input == 0
This adds the questions: Why does bsf work for MSVC?
UPDATE
Okay, this was easy, I actually call _BitScanForward for MSVC. Doh!
UPDATE
So I added a bit of unnecessary confusion here. Ideally I would like to use an intrinsic __tzcnt, but that doesn't exist in MSVC so I resorted to _BitScanForward plus an extra check to account for 0 input.
However, MSVC supports LZCNT, where I have a similar issue (but it is used less in my code).
Slightly updated question would be: How does MSVC deal with LZCNT (instead of TZCNT)?
Answer: see here. Specifically: "On Intel processors that don't support the lzcnt instruction, the instruction byte encoding is executed as bsr (bit scan reverse). If code portability is a concern, consider use of the _BitScanReverse intrinsic instead."
The article suggests to resort to bsr if older CPUs are a concern. To me, this implies that there is no compiler flag to control this, instead they suggest to manually identify the __cpu and then call either bsr or lzcnt.
In short, MSVC has no support for different CPU architectures (beyond x86/64/ARM).
As I posted above, MSVC doesn't appear to have support for different CPU architectures (beyond x86/64/ARM).
This article says: "On Intel processors that don't support the lzcnt instruction, the instruction byte encoding is executed as bsr (bit scan reverse). If code portability is a concern, consider use of the _BitScanReverse intrinsic instead."
The article suggests to resort to bsr if older CPUs are a concern. To me, this implies that there is no compiler flag to control this, instead they suggest to manually identify the __cpuid and then call either bsr or lzcnt depending on the result.
UPDATE
As #dewaffled pointed out, there are indeed _tzcnt_u32 / _tzcnt_u64 in the x64 intrinsics list.
I got mislead by looking at the Alphabetical listing of intrinsic functions on the left side of the pane. I wonder whether there is a distinction between "intrinsics" and "intrinsic functions", i.e. _tzcnt_u64 is an intrinsic but not an intrinsic function.

How do applications determine if instruction set is available and use it in case it is?

Just interesting how it works in games and other software.
More precisely, I'm asking for a solution in C++.
Something like:
if AMX available -> Use AMX version of the math library
else if AVX-512 available -> Use AVX-512 version of the math library
else if AVX-256 available -> Use AVX-256 version of the math library
etc.
The basic idea I have is to compile the library in different DLLs and swap them on runtime but it seems not to be the best solution for me.
For the detection part
See Are the xgetbv and CPUID checks sufficient to guarantee AVX2 support? which shows how to detect CPU and OS support for new extensions: cpuid and xgetbv, respectively.
ISA extensions that add new/wider registers that need to be saved/restored on context switch also need to be supported and enabled by the OS, not just the CPU. New instructions like AVX-512 will still fault on a CPU that supports them if the OS hasn't set a control-register bit. (Effectively promising that it knows about them and will save/restore them.) Intel designed things so the failure mode is faulting, not silent corruption of registers on CPU migration, or context switch between two programs using the extension.
Extensions that added new or wider registers are AVX, AVX-512F, and AMX. OSes need to know about them. (AMX is very new, and adds a large amount of state: 8 tile registers T0-T7 of 1KiB each. Apparently OSes need to know about AMX for power-management to work properly.)
OSes don't need to know about AVX2/FMA3 (still YMM0-15), or any of the various AVX-512 extensions which still use k0-k7 and ZMM0-31.
There's no OS-independent way to detect OS support of SSE, but fortunately it's old enough that these days you don't have to. It and SSE2 are baseline for x86-64. Everything up to SSE4.2 uses the same register state (XMM0-15) so OS support for SSE1 is sufficient for user-space to use SSE4.2. SSE1 was new in 1999, with Pentium 3.
Different compilers have different ways of doing CPUID and xgetbv detection. See does gcc's __builtin_cpu_supports check for OS support? - unfortunately no, only CPUID, at least when that was asked. I'd consider that a GCC bug, but IDK if it ever got reported or fixed.
For the optional-use part
Typically setting function pointers to selected versions of some important functions. Inlining through function pointers isn't generally possible, so make sure you choose the boundaries appropriately, like an AVX-512 version of a function that includes a loop, not just a single vector.
GCC's function multi-versioning can automate that for you, transparently compiling multiple versions and hooking some function-pointer setup.
There have been some previous Q&As about this with different compilers, search for "CPU dispatch avx" or something like that, along with other search terms.
See The Effect of Architecture When Using SSE / AVX Intrinisics to understand the difference between GCC/clang's model for intrinsics where you have to enable -march=skylake or whatever, or manually -mavx2, before you can use an intrinsic. vs. MSVC and classic ICC where you could use any intrinsic anywhere, even to emit instructions the compiler wouldn't be able to auto-vectorize with. (Those compilers can't or don't optimize intrinsics much at all, perhaps because that could lead to them getting hoisted out of if(cpu) statements.)
Windows provides IsProcessorFeaturePresent but AVX support is not on the list.
For more detailed detection you need to ask the CPU directly. On x86 this means the CPUID instruction. Visual C++ provides the __cpuidex intrinsic for this. In your case, function/leaf 1 and check bit 28 in ECX. Wikipedia has a decent article but you really should download the Intel instruction set manual to use as a reference.

how does assembler convert from assembly to machine code?

I know this has been asked many times, but I am looking for a simple interpretation.
Let's say I have some assembly code that C++ compiler generated.
Now assembler kicks in and it has to transform the assembly code into machine code.
Question 1). Will the C++ assembler compiler look at the table where each assembly instruction has the corresponding machine code instruction ?
Question 2). If the C++ program runs on the intel processor, then, assembler needs to take a look at the table published by Intel team, right ? because in the end, C++ program runs on the intel processor.
Question 3). If I am right about the question 2, then how is it possible that program written in C++ can be run on the computer which uses Intel and on the computer which uses AMD processor ?
Please try to limit your questions to one question per question. Neverthless, let me try and answer them.
Question 1
An “assembly compiler” is called an “assembler.” Assembly is assembled, not compiled. And the assembler is not specific to C++. It is specific to the architecture and can only be used to assemble assembly programs for that architecture.
Yes, assemblers are usually implemented by having a large table mapping instruction mnemonics to the operation codes (opcodes) they correspond to. This table also tells the assembler what operands the instruction takes and how the operands are encoded. There can be multiple entries for the same mnemonic if the mnemonic corresponds to multiple instructions.
It is however not a requirement to do it this way. Assemblers may chose different approaches or combine tables with pre- and postprocessing steps.
Question 2
This is correct. Processor vendors generally provide documentation for their processors in which all instructions and their instruction encodings are listed. For Intel, this information can be found in the Intel Software Development Manuals. Note that while the processor vendor provides such specifications, it is the job of the assembler author to translate these documents into tables for use by the assembler. This is traditionally done manually but recently, people have started automatically translating manuals into tables.
Question 3
Both Intel and AMD produce processors of the amd64 (also called x86-64, IA32e, Intel 64, EM64T, and other things) architecture. So a program written for an Intel processor generally also runs on an AMD processor.
Note that there are tiny differences between Intel's and AMD's implementation of this architecture. Your compiler is aware of them and won't generate code that can behave differently between the two.
There are also various instruction set extensions available on some but not all amd64 processors. Programs using these will only run on processors that have these instruction set extensions. However, unless you specifically tell your compiler to make use of such extensions, it won't use any of them and your code will run on amd64 processors of any vendor.
Will the C++ assembler
There is no "the C++" assembler. An assembler generally doesn't need to know anything about a higher level languages (if any) that were compiled to the assembly code.
... look at the table where each assembly instruction has the corresponding machine code instruction ?
Nothing says that there has to be a "table" but sure, an assembler supporting multiple CPU architectures could do that.
If the C++ program runs on the intel processor, then, assembler needs to take a look at the table published by Intel team, right ?
Such table would likely be written by the authors of the assembler program rather than the CPU vendor. It would be based on manuals published by the vendor.
how is it possible that program written in C++ can be run on the computer which uses Intel and on the computer which uses AMD processor ?
Intel, AMD and VIA all make CPU's that implement the same(ish) instruction set called x86-64. An assembler targeting x86-64 instruction set should work on CPU's that support x86-64 instruction set.
There are a few small variations between the different implementations, and the assemblers (and compilers) must be designed in a way to take such differences into consideration if the program is to work on all those systems. Example: Early Intel64 CPU's lack the NX bit (according to wikipedia, which doesn't cite a source). A program that is to work on those CPU's mustn't use that feature.

How does "__builtin_popcount" of gcc work?

I want know the inner workings of "__builtin_popcount".
As much as I understand, it works differently for different cpu.
Similar to many other built-ins, it translates into specific CPU instruction if one is available on the target CPU, thus considerably speeding up the application.
For example, on x86_64 it translates to popcntl ASM instruction.
Additional information can be found on GCC page: https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html
It is also worth noting that the actual speedup could only be seen if gcc is ran with march flag which targets architecture supporting this instruction or an argument which specifically enables it, -mpopcnt. Without either of those, gcc will revert to generic bit counting via bit operations.

Optimize for a specific machine / processor architecture

In this highly voted answer to a question on the performance differences between C++ and Java I learn that the JIT compiler is sometimes able to optimize better because it can determine the exact specifics of the machine (processor, cache sizes, etc.):
Generally, C# and Java can be just as fast or faster because the JIT
compiler -- a compiler that compiles your IL the first time it's
executed -- can make optimizations that a C++ compiled program cannot
because it can query the machine. It can determine if the machine is
Intel or AMD; Pentium 4, Core Solo, or Core Duo; or if supports SSE4,
etc.
A C++ program has to be compiled beforehand usually with mixed
optimizations so that it runs decently well on all machines, but is
not optimized as much as it could be for a single configuration (i.e.
processor, instruction set, other hardware).
Question: Is there a way to tell the compiler to optimize specifically for my current machine? Is there a compiler which is able to do this?
For GCC, you can use the flag -march=native. Be aware that the generated code may not run on other CPUs because
GCC uses this name to determine what kind of instructions it can emit
when generating assembly code.
So CPU specific assembly can be generated.
If you want your code to run on other CPU types, but tune it for better performance on your CPU, then you should use -mtune=native:
Specify the name of the processor to tune the performance for. The
code will be tuned as if the target processor were of the type
specified in this option, but still using instructions compatible with
the target processor specified by a -mcpu= option.
Certainly a compiler could be instructed to optimize for a specific architecture. This is true of gcc, if you look at the multitude of architecture flags that you can pass in. The same is true to a lesser extent on Visual Studio, as it has the -MACHINE option and /arch options.
However, unlike in Java, this likely means that the generated code is only (safe) to run on that hardware that is being targeted. The assertion that Java can be just as fast or faster only likely holds in the case of generically compiled C++ code. Given the target architecture, C++ code compiled for that specific architecture will likely be as fast or faster than equivalent Java code. Of course, it's much more work to support multiple architectures in this way.