Practical approach to do vectorization on different CPUs [duplicate] - c++

What is the best way to implement multiple versions of the same function that uses a specific CPU instructions if available (tested at run time), or falls back to a slower implementation if not?
For example, x86 BMI2 provides a very useful PDEP instruction. How would I write a C code such that it tests BMI2 availability of the executing CPU on startup, and uses one of the two implementations -- one that uses _pdep_u64 call (available with -mbmi2), and another that does bit manipulation "by hand" using C code. Are there any built-in support for such cases? How would I make GCC compile for older arch while providing access to the newer intrinsic? I suspect execution is faster if the function is invoked via a global function pointer, rather than an if/else every time?

You can declare a function pointer and point it to the correct version at program startup by calling cpuid to determine the current architecture
But it's better to utilize support from many modern compilers. Intel's ICC has automatic function dispatching to select the optimized version for each architecture long ago. I don't know the details but looks like it only applies to Intel's libraries. Besides it only dispatches to the efficient version on Intel CPUs, hence would be unfair to other manufacturers. There are many patches and workarounds for that in Agner`s CPU blog
Later a feature called Function Multiversioning was introduced in GCC 4.8. It adds the target attribute that you'll declare on each version of your function
__attribute__ ((target ("sse4.2")))
int foo() { return 1; }
__attribute__ ((target ("arch=atom")))
int foo() { return 2; }
int main() {
int (*p)() = &foo;
return foo() + p();
}
That duplicates a lot of code and is cumbersome so GCC 6 added target_clones that tells GCC to compile a function to multiple clones. For example __attribute__((target_clones("avx2","arch=atom","default"))) void foo() {} will create 3 different foo versions. More information about them can be found in GCC's documentation about function attribute
The syntax was then adopted by Clang and ICC. Performance can even be better than a global function pointer because the function symbols can be resolved at process loading time instead of runtime. It's one of the reasons Intel's Clear Linux runs so fast. ICC may also create multiple versions of a single loop during auto-vectorization
Function multi-versioning in GCC 6
Function Multi-Versioning
The - surprisingly limited - usefulness of function multiversioning in GCC
Generate code for multiple SIMD architectures
Here's an example from The one with multi-versioning (Part II) along with its demo which is about popcnt but you get the idea
__attribute__((target_clones("popcnt","default")))
int runPopcount64_builtin_multiarch_loop(const uint8_t* bitfield, int64_t size, int repeat) {
int res = 0;
const uint64_t* data = (const uint64_t*)bitfield;
for (int r=0; r<repeat; r++)
for (int i=0; i<size/8; i++) {
res += popcount64_builtin_multiarch_loop(data[i]);
}
return res;
}
Note that PDEP and PEXT are very slow on current AMD CPUs so they should only be enabled on Intel

Related

Using MSVC's __popcnt in a constexpr function

Background: I am trying to repurpose some C++ code written for GCC in a MSVC project. I have been trying to refactor code to make it compatible with MSVC compiler.
Simplified, one of the functions originally was this:
[[nodiscard]] constexpr int count() const noexcept {
return __builtin_popcountll(mask);//gcc-specific function
}
Where mask is a 64-bit member variable. The obvious conversion to MSVC is:
[[nodiscard]] constexpr int count() const noexcept {
return __popcnt64(mask); // MSVC replacement
}
However, it doesn't compile because __popcnt64 in not allowed in a constexpr function.
I am using C++17, and I would prefer to avoid having to switch to C++20 if possible.
Is there a way to make it work?
You cannot make a non-constexpr function become a constexpr one. If their standard library doesn't declare it constexpr, then that's it. You will have to write your own, which would be difficult in C++17.
It depends on goal:
Just count bits in compile time and possibly in runtime. Then just implement own constexpr bit counting and don't use __popcnt64. You can look in into Wikipedia's Hamming weight article for ideas.
Use popcnt instruction in runtime. Then you need to implement compile-time / run-time distinction, to use different compile-time and runtime implementations.
For compile-time/runtime distinction in C++20 you would have used if (std::is_constant_evaluated()) { ... } else { ... }
In MSVC, std::is_constant_evaluated is implemented via compiler magic __builtin_is_constant_evaluated(), which happen to compile and work properly in C++17. So you can:
constexpr int popcount(unsigned long long x)
{
if (__builtin_is_constant_evaluated())
{
return -1; // TODO: count bits
}
else
{
return __popcnt64(x);
}
}
Note: __builtin_popcountll compiles into either the popcnt instruction or bit counting via bit hacks, depending on compilation flags. MSVC __popcnt64 always compiles into the popcnt instruction. If the goal is to support older CPUs that do not have the popcnt instruction, you'd have to provide CPU detection (compile-time or runtime, again, depending on goals) and fallback, or don't use __popcnt64 at all.
The question has already been answered. So, just some side note.
Building your own and efficient pop function would pobably be the best solution.
And for that you may reuse the very old wisdom from the book "Hacker's Delight" by Henry S. Warren. This book is from the time, when programmers and algorithm developers worked on solutions, to minimize the usage of the precious assembler statements. Both for ROM consumption (yes, indeed) and performance.
You will find there many very efficient solutions, even completely loop free and an astonishingly low number of assembler instructions. For example, with the usage of the divide and conquer method.
It is worth a visit . . .

Is it possible to calculate function length at compile time in C++?

I have this piece of code:
constexpr static VOID fStart()
{
auto a = 3;
a++;
}
__declspec(naked)
constexpr static VOID fEnd() {};
static constexpr auto getFSize()
{
return (SIZE_T)((PBYTE)fEnd - (PBYTE)fStart);
}
static constexpr auto fSize = getFSize();
static BYTE func[fSize];
Is it possible to declare "func[fSize]" array size as the size of "fStart()" during compilation without using any std library? It is necessary in order to copy the full code of fStart() into this array later.
There is no method in standard C++ to get the length of a function.
You'll need to use a compiler specific method.
One method is to have the linker create a segment, and place your function in that segment. Then use the length of the segment.
You may be able to use some assembly language constructs to do this; depends on the assembler and the assembly code.
Note: in embedded systems, there are reasons to move function code, such as to On-Chip memory or swap to external memory, or to perform a checksum on the code.
The following calculates the "byte size" of the fStart function. However, the size cannot be obtained as a constexpr this way, because casting loses the compile-time const'ness (see for example Why is reinterpret_cast not constexpr?), and the difference of two unrelated function pointers cannot be evaluated without some kind of casting.
#pragma runtime_checks("", off)
__declspec(code_seg("myFunc$a")) static void fStart()
{ auto a = 3; a++; }
__declspec(code_seg("myFunc$z")) static void fEnd(void)
{ }
#pragma runtime_checks("", restore)
constexpr auto pfnStart = fStart; // ok
constexpr auto pfnEnd = fEnd; // ok
// constexpr auto nStart = (INT_PTR)pfnStart; // error C2131
const auto fnSize = (INT_PTR)pfnEnd - (INT_PTR)pfnStart; // ok
// constexpr auto fnSize = (INT_PTR)pfnEnd - (INT_PTR)pfnStart; // error C2131
On some processors and with some known compilers and ABI conventions, you could do the opposite:
generate machine code at runtime.
For x86/64 on Linux, I know GNU lightning, asmjit, libgccjit doing so.
The elf(5) format knows the size of functions.
On Linux, you can generate shared libraries (perhaps generate C or C++ code at runtime (like RefPerSys does and GCC MELT did), then compiling it with gcc -fPIC -shared -O) and later dlopen(3) / dlsym(3) it. And dladdr(3) is very useful. You'll use function pointers.
Read also a book on linkers and loaders.
But you usually cannot move machine code without doing some relocation, unless that machine code is position-independent code (quite often PIC is slower to run than ordinary code).
A related topic is garbage collection of code (or even of agents). You need to read the garbage collection handbook and take inspiration from implementations like SBCL.
Remember also that a good optimizing C++ compiler is allowed to unroll loops, inline expand function calls, remove dead code, do function cloning, etc... So it may happen that machine code functions are not even contiguous: two C functions foo() and bar() could share dozens of common machine instructions.
Read the Dragon book, and study the source code of GCC (and consider extending it with your GCC plugin). Look also into the assembler code produced by gcc -O2 -Wall -fverbose-asm -S. Some experimental variants of GCC might be able to generate OpenCL code running on your GPGPU (and then, the very notion of function end does not make sense)
With generated plugins thru C and C++, you carefully could remove them using dlclose(3) if you use Ian Taylor's libbacktrace and dladdr to explore your call stack. In 99% of the cases, it is not worth the trouble, since in practice a Linux process (on current x86-64 laptops in 2021) can do perhaps a million of dlopen(3), as my manydl.c program demonstrates (it generates "random" C code, compile it into a unique /tmp/generated123.so, and dlopen that, and repeat many times).
The only reason (on desktop and server computers) to overwrite machine code is for long lasting server processes generating machine code every second. If this was your scenario, you should have mentioned it (and generating JVM bytecode by using Java classloaders could make more sense).
Of course, on 16 bits microcontrollers things are very different.
Is it possible to calculate function length at compile time in C++?
No, because at runtime time some functions do not exist anymore.
The compiler have somehow removed them. Or cloned them. Or inlined them.
And for C++ it is practically important with standard containers: a lot of template expansion occurs, including for useless code which has to be removed by your optimizing compiler at some point.
(Think -in 2021 of compilation with a recent GCC 10.2 or 11. using everywhere, and linking with, gcc -O3 -flto -fwhole-program: a function foo23 might be defined but never called, and then it is not inside the ELF executable)

What is the difference between gcc builtin function and common c/c++ code [duplicate]

Comparing the following two expressions
std::bitset<8>(5).count()
__builtin_popcount(5)
which one is better?
int __builtin_popcount(unsigned int);
is a built in function of GCC while std::bitset<N>::count is a C++ standard.
Both function do the same thing: return the number of bits that are set to true.
What should you use?
Always tend to use C++ standard's functions because other compilers don't support __builtin_popcount function.
UPDATE
If you take a look at the statistics made by Google Benchmark tool:
#include <bitset>
static void GccBuiltInPopCount(benchmark::State& state) {
for (auto _ : state) {
__builtin_popcount(5);
}
}
BENCHMARK(GccBuiltInPopCount);
static void StdBitsetCount(benchmark::State& state) {
for (auto _ : state) {
std::bitset<8>(5).count();
}
}
BENCHMARK(StdBitsetCount);
with GCC 9.2 and flags -std=c++2a -O3, GCC built in function is 10% slower than the std::bitset<N>::count() function but, since the ASM output is the same for both function, the difference in benchmark could be due to other factors.
According to godbolt, bitset and popcount yields just the same asm output on latest g++. However, as mentioned in the comments, __builtin_popcount is an gcc extension and won't be available on neither other compilers nor other architectures than x86. Therefore, bitset option is clearly better.
When you don’t know the value of N in std::bitset<N>::count, I think the second one is better
update:
you can try std::popcount

compiler reordering and load operation

I'm starting lock free programming and I encounter some difficulties for basic stuff. I have found the following example :
#define COMPILER_BARRIER() asm volatile("" ::: "memory")
int Value;
int IsPublished = 0;
void sendValue(int x)
{
Value = x;
COMPILER_BARRIER(); // prevent reordering of stores
IsPublished = 1;
}
int tryRecvValue()
{
if (IsPublished)
{
COMPILER_BARRIER(); // prevent reordering of loads
return Value;
}
return -1; // or some other value to mean not yet received
}
What kind of reordering a compiler can perform in tryRecvValue function ?
Answer
Without the inline asm*, the compiler is able to load the value from Value first, then IsPublished, then perform the check and return.
Always check the assembly output if you're not sure what the compiler is doing. This is especially true of lock free programming techniques.
More Information
* NOTE: Inline asm is not a required part of the C++ standard, but is implemented by all major compilers. It is only mentioned in section 7.4:
The asm declaration is conditionally-supported; its meaning is implementation-defined. [ Note: Typically it
is used to pass information through the implementation to an assembler. — end note ]
Generally speaking, compilers cannot reorder reads and writes around inline assembly because the compiler cannot make assumptions about what the assembly does.
The use of COMPILER_BARRIER() above will not act as any kind of read/write barrier, it just has the side affect of not allowing reordering of assembly instructions around the statement.
Assuming the above code is called from different threads, it will only work as-is on x86, since the architecture guarantees that writes are never (observably) reordered by the cpu. (reference: 8.2.3.2 in volume 3A of the Intel Manuals)
For use on other architectures with relaxed memory models, (like PowerPc and ARM) you'll need hardware barriers that prevent these types of reorders. For a good breakdown, check out Jeff Preshing's articles.

Clang/GCC Compiler Intrinsics without corresponding compiler flag

I know there are similar questions to this, but compiling different file with different flag is not acceptable solution here since it would complicate the codebase real quick. An answer with "No, it is not possible" will do.
Is it possible, in any version of Clang OR GCC, to compile intrinsics function for SSE 2/3/3S/4.1 while only enable compiler to use SSE instruction set for its optimization?
EDIT: For example, I want compiler to turn _mm_load_si128() to movdqa, but compiler must not do emit this instruction at any other place than this intrinsics function, similar to how MSVC compiler works.
EDIT2: I have dynamic dispatcher in place and several version of single function with different instruction sets written using intrinsics function. Using multiple file will make this much harder to maintain as same version of code will span multiple file, and there are a lot of this type of functions.
EDIT3: Example source code as requested: https://github.com/AviSynth/AviSynthPlus/blob/master/avs_core/filters/resample.cpp or most file in that folder really.
Here is an approach using gcc that might be acceptable. All source code goes into a single source file. The single source file is divided into sections. One section generates code according to the command line options used. Functions like main() and processor feature detection go in this section. Another section generates code according to a target override pragma. Intrinsic functions supported by the target override value can be used. Functions in this section should be called only after processor feature detection has confirmed the needed processor features are present. This example has a single override section for AVX2 code. Multiple override sections can be used when writing functions optimized for multiple targets.
// temporarily switch target so that all x64 intrinsic functions will be available
#pragma GCC push_options
#pragma GCC target ("arch=core-avx2")
#include <intrin.h>
// restore the target selection
#pragma GCC pop_options
//----------------------------------------------------------------------------
// the following functions will be compiled using default code generation
//----------------------------------------------------------------------------
int dummy1 (int a) {return a;}
//----------------------------------------------------------------------------
// the following functions will be compiled using core-avx2 code generation
// all x64 intrinc functions are available
#pragma GCC push_options
#pragma GCC target ("arch=core-avx2")
//----------------------------------------------------------------------------
static __m256i bitShiftLeft256ymm (__m256i *data, int count)
{
__m256i innerCarry, carryOut, rotate;
innerCarry = _mm256_srli_epi64 (*data, 64 - count); // carry outs in bit 0 of each qword
rotate = _mm256_permute4x64_epi64 (innerCarry, 0x93); // rotate ymm left 64 bits
innerCarry = _mm256_blend_epi32 (_mm256_setzero_si256 (), rotate, 0xFC); // clear lower qword
*data = _mm256_slli_epi64 (*data, count); // shift all qwords left
*data = _mm256_or_si256 (*data, innerCarry); // propagate carrys from low qwords
carryOut = _mm256_xor_si256 (innerCarry, rotate); // clear all except lower qword
return carryOut;
}
//----------------------------------------------------------------------------
// the following functions will be compiled using default code generation
#pragma GCC pop_options
//----------------------------------------------------------------------------
int main (void)
{
return 0;
}
//----------------------------------------------------------------------------
There is no way to control instruction set used for the compiler, other than the switches on the compiler itself. In other words, there are no pragmas or other features for this, just the overall compiler flags.
This means that the only viable solution for achieving what you want is to use the -msseX and split your source into multiple files (of course, you can always use various clever #include etc to keep one single textfile as the main source, and just include the same file in multiple places)
Of course, the source code of the compiler is available. I'm sure the maintainers of GCC and Clang/LLVM will happily take patches that improve on this. But bear in mind that the path from "parsing the source" to "emitting instructions" is quite long and complicated. What should happen if we do this:
#pragma use_sse=1
void func()
{
... some code goes here ...
}
#pragma use_sse=3
void func2()
{
...
func();
...
}
Now, func is short enough to be inlined, should the compiler inline it? If so, should it use sse1 or sse3 instructions for func().
I understand that YOU may not care about that sort of difficulty, but the maintainers of Clang and GCC will indeed have to deal with this in some way.
Edit:
In the headerfiles declaring the SSE intrinsics (and many other intrinsics), a typical function looks something like this:
extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_add_ss (__m128 __A, __m128 __B)
{
return (__m128) __builtin_ia32_addss ((__v4sf)__A, (__v4sf)__B);
}
The builtin_ia32_addss is only available in the compiler when you have enabled the -msse option. So if you convince the compiler to still allow you to use the _mm_add_ss() when you have -mno-sse, it will give you an error for "__builtin_ia32_addss is not declared in this scope" (I just tried).
It would probably not be very hard to change this particular behaviour - there are probably only a few places where the code does the "introduce builtin functions". However, I'm not convinced that there are further issues in the code, later on when it comes to actually issuing instructions in the compiler.
I have done some work with "builtin functions" in a Clang-based compiler, and unfortunately, there are several steps involved in getting from the "parser" to the "code generation", where the builtin function gets involved.
Edit2:
Compared to GCC, solving this for Clang is even more complex, in that the compiler itself has understanding of SSE instructions, so it simply has this in the header file:
static __inline__ __m128 __attribute__((__always_inline__, __nodebug__))
_mm_add_ps(__m128 __a, __m128 __b)
{
return __a + __b;
}
The compiler will then know that to add a couple of __m128, it needs to produce the correct SSE instruction. I have just downloaded Clang (I'm at home, my work on Clang is at work, and not related to SSE at all, just builtin functions in general - and I haven't really done much of the changes to Clang as such, but it was enough to understand roughly how builtin functions work).
However, from your perspective, the fact that it's not a builtin function makes it worse, because the operator+ translation is much more complicated. I'm pretty sure the compiler just makes it into an "add these two things", and then pass it to LLVM for further work - LLVM will be the part that understands SSE instructions etc. But for your purposes, this makes it worse, because the fact that this is an "intrinsic function" is now pretty much lost, and the compiler just deals with it just as if you'd written a + b, with the side effect of a and b being types that are 128 bits long. It makes it even more complicated to deal with generating "the right instructions" and yet keeping "all other" instructions at a different SSE level.