-fno-strict-aliasing as function attribute - c++

I have a function in which I'm type-punning for performance reasons. Basically, I have a 32-by-32 bit array stored as an array of 32 uint32s:
struct Tile {
uint32_t d[32];
};
I then want to calculate the population (number of '1's) of the 28-by-28 'interior' of the 32-by-32 tile. The naive method would take 28 calls to the machine's popcnt instruction, one for each row. However, since popcnt can take a 64-bit argument, this can be reduced to 14 popcnt calls:
int countPopulation(Tile* sqt) __attribute__((optimize("-fno-strict-aliasing"))) {
int pop = 0;
for (int i = 2; i < 30; i += 2) {
const uint64_t v = *reinterpret_cast<const uint64_t*>(sqt->d + i);
pop += __builtin_popcountll(v & 0x3ffffffc3ffffffcull);
}
return pop;
}
If I don't include the attribute:
__attribute__((optimize("-fno-strict-aliasing")))
then g++ will consistently complain, for obvious reasons, about my type-punning:
warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
const uint64_t v = *reinterpret_cast<const uint64_t*>(sqt->d + i);
On the other hand, if I do include the attribute, certain versions of g++ complain whereas others do not. Of the machines on which I've tried this, I get:
g++ (Ubuntu 4.8.4-2ubuntu1~14.04.1) 4.8.4 complains
g++-4.6.real (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3 does not complain
g++ (Debian 5.3.1-5) 5.3.1 20160101 does not complain
What's wrong with the Ubuntu flavour of g++ 4.8.4?

Do not type-pun. There is no reason for this. Instead properly use memcpy() to copy to and from your int64_t argument. Optimizer will do the rest.

Related

Integral promotion and operator+=

I need to eliminate gcc -Wconversion warnings. For example
typedef unsigned short uint16_t;
uint16_t a = 1;
uint16_t b = 2;
b += a;
gives
warning: conversion to 'uint16_t {aka short unsigned int}' from 'int' may alter its value [-Wconversion]
b += a;
~~^~~~
I can eliminate this by
uint16_t a = 1;
uint16_t b = 2;
b = static_cast<uint16_t>(b + a);
Is there any way to keep the operator+= and eliminate the warning? Thank you.
EDIT
I use
gcc test.cpp -Wconversion
my gcc version is
gcc.exe (Rev3, Built by MSYS2 project) 7.2.0
I need to eliminate gcc -Wconversion warnings.
You don't say why but this is actually unlikely.
From the GCC wiki page on this switch:
Why isn't Wconversion enabled by -Wall or at least by -Wextra?
Implicit conversions are very common in C. This tied with the fact that there is no data-flow in front-ends (see next question) results in hard to avoid warnings for perfectly working and valid code. Wconversion is designed for a niche of uses (security audits, porting 32 bit code to 64 bit, etc.) where the programmer is willing to accept and workaround invalid warnings. Therefore, it shouldn't be enabled if it is not explicitly requested.
If you don't want it, just turn it off.
Mangling your code with unnecessary casts, making it harder to read and maintain, is the wrong solution.
If your build engineers are insisting on this flag, ask them why, and ask them to stop.
You could build your own abstraction to overload the += operator, something like
template <typename T>
class myVar {
public:
myVar(T var) : val{var} {}
myVar& operator+=(const myVar& t) {
this->val = static_cast<T>(this->val + t.val);
return *this;
}
T val;
};
int main()
{
typedef unsigned short uint16_t;
myVar<uint16_t> c{3};
myVar<uint16_t> d{4};
c += d;
}
It still uses a static_cast, but you only need to use it once and then reuse it. And you don't need it in your main.
IMHO it just adds overhead, but opinions may vary...

Auto-vectorization with SSE2 movemask for bytes to bitmap with gcc

With properly constructed C/C++ code one can hint gcc to generate efficient SIMD assembler on its own, without use of intrinsics, e.g. https://locklessinc.com/articles/vectorize/
I am trying to achieve a similar effect for movemask operation (PMOVMSKB / *_movemask_epi8 family), but so far without success.
The simplest code I could think of:
#include <cstdint>
alignas(128) int8_t arr[32];
uint32_t foo()
{
uint32_t rv = 0;
for (int it = 0; it < 32; ++it)
{
rv |= (arr[it] < 0) << it;
}
return rv;
}
leads to assembly that fails to utilize move mask instruction: https://godbolt.org/z/3XimYc
Does anyone have an idea if there's a way to do that with gcc without explicitly using intrinsics?
I haven't looked into MD files and associated implementation in gcc yet (https://github.com/gcc-mirror/gcc/tree/master/gcc/config/i386).

C++ Centralizing SIMD usage

i have a library and a lot of projects depending on that library. I want to optimize certain procedures inside the library using SIMD extensions. However it is important for me to stay portable, so to the user it should be quite abstract.
I say at the beginning that i dont want to use some other great library that does the trick. I actually want to understand if that what i want is possible and to what extent.
My very first idea was to have a "vector" wrapper class, that the usage of SIMD is transparent to the user and a "scalar" vector class could be used in case no SIMD extension is available on the target machine.
The naive thought came to my mind to use the preprocessor to select one vector class out of many depending on which target the library is compiled. So one scalar vector class, one with SSE (something like this basically: http://fastcpp.blogspot.de/2011/12/simple-vector3-class-with-sse-support.html) and so on... all with the same interface.
This gives me good performance but this would mean that i would have to compile the library for any kind of SIMD ISA that i use. I rather would like to evaluate the processor capabilities dynamically at runtime and select the "best" implementation available.
So my second guess was to have a general "vector" class with abstract methods. The "processor evaluator" function would than return instances of the optimal implementation. Obviously this would lead to ugly code, but the pointer to the vector object could be stored in a smart pointer-like container that just delegates the calls to the vector object. Actually I would prefer this method because of its abstraction but I'm not sure if calling the virtual methods actually will kill the performance that i gain using SIMD extensions.
The last option that i figured out would be to do optimizations whole routines and select at runtime the optimal one. I dont like this idea so much because this forces me to implement whole functions multiple times. I would prefer to do this once, using my idea of the vector class i would like to do something like this for example:
void Memcopy(void *dst, void *src, size_t size)
{
vector v;
for(int i = 0; i < size; i += v.size())
{
v.load(src);
v.store(dst);
dst += v.size();
src += v.size();
}
}
I assume here that "size" is a correct value so that no overlapping happens. This example should just show what i would prefer to have. The size-method of the vector object would for example just return 4 in case SSE is used and 1 in case the scalar version is used.
Is there a proper way to implement this using only runtime information without loosing too much performance? Abstraction is to me more important than performance but as this is a performance optimization i wouldn't include it if would not speedup my application.
I also found this on the web: http://compeng.uni-frankfurt.de/?vc
Its open source but i dont understand how the correct vector class is chosen.
Your idea will only compile to efficient code if everything inlines at compile time, which is incompatible with runtime CPU dispatching. For v.load(), v.store(), and v.size() to actually be different at runtime depending on the CPU, they'd have to be actual function calls, not single instructions. The overhead would be killer.
If your library has functions that are big enough to work without being inlined, then function pointers are great for dispatching based on runtime CPU detection. (e.g. make multiple versions of memcpy, and pay the overhead of runtime detection once per call, not twice per loop iteration.)
This shouldn't be visible in your library's external API/ABI, unless your functions are mostly so short that the overhead of an extra (direct) call/ret matters. In the implementation of your library functions, put each sub-task that you want to make a CPU-specific version of into a helper function. Call those helper functions through function pointers.
Start with your function pointers initialized to versions that will work on your baseline target. e.g. SSE2 for x86-64, scalar or SSE2 for legacy 32bit x86 (depending on whether you care about Athlon XP and Pentium III), and probably scalar for non-x86 architectures. In a constructor or library init function, do a CPUID and update the function pointers to the best version for the host CPU. Even if your absolute baseline is scalar, you could make your "good performance" baseline something like SSSE3, and not spend much/any time on SSE2-only routines. Even if you're mostly targetting SSSE3, some of your routines will probably end up only requiring SSE2, so you might as well mark them as such and let the dispatcher use them on CPUs that only do SSE2.
Updating the function pointers shouldn't even require any locking. Any calls that happen from other threads before your constructor is done setting function pointers may get the baseline version, but that's fine. Storing a pointer to an aligned address is atomic on x86. If it's not atomic on any platform where you have a version of a routine that needs runtime CPU detection, use C++ std:atomic (with memory-order relaxed stores and loads, not the default sequential consistency which would trigger a full memory barrier on every load). It matters a lot that there's minimal overhead when calling through the function pointers, and it doesn't matter what order different threads see the changes to the function pointers. They're write-once.
x264 (the heavily-optimized open source h.264 video encoder) uses this technique extensively, with arrays of function pointers. See x264_mc_init_mmx(), for example. (That function handles all CPU dispatching for Motion Compensation functions, from MMX to AVX2). I assume libx264 does the CPU dispatching in the "encoder init" function. If you don't have a function that users of your library are required to call, then you should look into some kind of mechanism for running global constructor / init functions when programs using your library start up.
If you want this to work with very C++ey code (C++ish? Is that a word?) i.e. templated classes & functions, the program using the library will probably have do the CPU dispatching, and arrange to get baseline and multiple CPU-requirement versions of functions compiled.
I do exactly this with a fractal project. It works with vector sizes of 1, 2, 4, 8, and 16 for float and 1, 2, 4, 8 for double. I use a CPU dispatcher at run-time to select the following instructions sets: SSE2, SSE4.1, AVX, AVX+FMA, and AVX512.
The reason I use a vector size of 1 is to test performance. There is already a SIMD library that does all this: Agner Fog's Vector Class Library. He even includes example code for a CPU dispatcher.
The VCL emulates hardware such as AVX on systems that only have SSE (or even AVX512 for SSE). It just implements AVX twice (for four times for AVX512) so in most cases you can just use the largest vector size you want to target.
//#include "vectorclass.h"
void Memcopy(void *dst, void *src, size_t size)
{
Vec8f v; //eight floats using AVX hardware or AVX emulated with SSE twice.
for(int i = 0; i < size; i +=v.size())
{
v.load(src);
v.store(dst);
dst += v.size();
src += v.size();
}
}
(however, writing an efficient memcpy is complicating. For large sizes you should consider non temroal stores and on IVB and above use rep movsb instead). Notice that that code is identical to what you asked for except I changed the word vector to Vec8f.
Using the VLC, as CPU dispatcher, templating, and macros you can write your code/kernel so that it looks nearly identical to scalar code without source code duplication for every different instruction set and vector size. It's your binaries which will be bigger not your source code.
I have described CPU dispatchers several times. You can also see some example using templateing and macros for a dispatcher here: alias of a function template
Edit: Here is an example of part of my kernel to calculate the Mandelbrot set for a set of pixels equal to the vector size. At compile time I set TYPE to float, double, or doubledouble and N to 1, 2, 4, 8, or 16. The type doubledouble is described here which I created and added to the VCL. This produces Vector types of Vec1f, Vec4f, Vec8f, Vec16f, Vec1d, Vec2d, Vec4d, Vec8d, doubledouble1, doubledouble2, doubledouble4, doubledouble8.
template<typename TYPE, unsigned N>
static inline intn calc(floatn const &cx, floatn const &cy, floatn const &cut, int32_t maxiter) {
floatn x = cx, y = cy;
intn n = 0;
for(int32_t i=0; i<maxiter; i++) {
floatn x2 = square(x), y2 = square(y);
floatn r2 = x2 + y2;
booln mask = r2<cut;
if(!horizontal_or(mask)) break;
add_mask(n,mask);
floatn t = x*y; mul2(t);
x = x2 - y2 + cx;
y = t + cy;
}
return n;
}
So my SIMD code for several several different data types and vector sizes is nearly identical to the scalar code I would use. I have not included the part of my kernel which loops over each super-pixel.
My build file looks something like this
g++ -m64 -c -Wall -g -std=gnu++11 -O3 -fopenmp -mfpmath=sse -msse2 -Ivectorclass kernel.cpp -okernel_sse2.o
g++ -m64 -c -Wall -g -std=gnu++11 -O3 -fopenmp -mfpmath=sse -msse4.1 -Ivectorclass kernel.cpp -okernel_sse41.o
g++ -m64 -c -Wall -g -std=gnu++11 -O3 -fopenmp -mfpmath=sse -mavx -Ivectorclass kernel.cpp -okernel_avx.o
g++ -m64 -c -Wall -g -std=gnu++11 -O3 -fopenmp -mfpmath=sse -mavx2 -mfma -Ivectorclass kernel.cpp -okernel_avx2.o
g++ -m64 -c -Wall -g -std=gnu++11 -O3 -fopenmp -mfpmath=sse -mavx2 -mfma -Ivectorclass kernel_fma.cpp -okernel_fma.o
g++ -m64 -c -Wall -g -std=gnu++11 -O3 -fopenmp -mfpmath=sse -mavx512f -mfma -Ivectorclass kernel.cpp -okernel_avx512.o
g++ -m64 -Wall -Wextra -std=gnu++11 -O3 -fopenmp -mfpmath=sse -msse2 -Ivectorclass frac.cpp vectorclass/instrset_detect.cpp kernel_sse2.o kernel_sse41.o kernel_avx.o kernel_avx2.o kernel_avx512.o kernel_fma.o -o frac
Then the dispatcher looks something like this
int iset = instrset_detect();
fp_float1 = NULL;
fp_floatn = NULL;
fp_double1 = NULL;
fp_doublen = NULL;
fp_doublefloat1 = NULL;
fp_doublefloatn = NULL;
fp_doubledouble1 = NULL;
fp_doubledoublen = NULL;
fp_float128 = NULL;
fp_floatn_fma = NULL;
fp_doublen_fma = NULL;
if (iset >= 9) {
fp_float1 = &manddd_AVX512<float,1>;
fp_floatn = &manddd_AVX512<float,16>;
fp_double1 = &manddd_AVX512<double,1>;
fp_doublen = &manddd_AVX512<double,8>;
fp_doublefloat1 = &manddd_AVX512<doublefloat,1>;
fp_doublefloatn = &manddd_AVX512<doublefloat,16>;
fp_doubledouble1 = &manddd_AVX512<doubledouble,1>;
fp_doubledoublen = &manddd_AVX512<doubledouble,8>;
}
else if (iset >= 8) {
fp_float1 = &manddd_AVX<float,1>;
fp_floatn = &manddd_AVX2<float,8>;
fp_double1 = &manddd_AVX2<double,1>;
fp_doublen = &manddd_AVX2<double,4>;
fp_doublefloat1 = &manddd_AVX2<doublefloat,1>;
fp_doublefloatn = &manddd_AVX2<doublefloat,8>;
fp_doubledouble1 = &manddd_AVX2<doubledouble,1>;
fp_doubledoublen = &manddd_AVX2<doubledouble,4>;
}
....
This sets function pointers to each of the different possible datatype vector combination for the instruction set found at runtime. Then I can call whatever function I'm interested.
Thanks Peter Cordes and Z boson. With your both replies I I came to a solution that satisfies me.
I chose the Memcopy just as an example just because of everyone knowing it and its beautiful simplicity (but also slowness) when implemented naively in contrast to SIMD optimizations that are often not well readable anymore but of course much faster.
I have now two classes (more possible of course) a scalar vector and an SSE vector both with inline methods. To the user i show something like:
typedef void(*MEM_COPY_FUNC)(void *, const void *, size_t);
extern MEM_COPY_FUNC memCopyPointer;
I declare my function something like this, as Z boson pointed out:
template
void MemCopyTemplate(void *pDest, const void *prc, size_t size)
{
VectorType v;
byte *pDst, *pSrc;
uint32 mask;
pDst = (byte *)pDest;
pSrc = (byte *)prc;
mask = (2 << v.GetSize()) - 1;
while(size & mask)
{
*pDst++ = *pSrc++;
}
while(size)
{
v.Load(pSrc);
v.Store(pDst);
pDst += v.GetSize();
pSrc += v.GetSize();
size -= v.GetSize();
}
}
And at runtime, when the library is loaded, i use CPUID to do either
memCopyPointer = MemCopyTemplate<ScalarVector>;
or
memCopyPointer = MemCopyTemplate<SSEVector>;
as you both suggested. Thanks a lot.

Wrapper for `__m256` Producing Segmentation Fault with Constructor - Windows 64 + MinGW + AVX Issues

I have a union that looks like this
union bareVec8f {
__m256 m256; //avx 8x float vector
float floats[8];
int ints[8];
inline bareVec8f(){
}
inline bareVec8f(__m256 vec){
this->m256 = vec;
}
inline bareVec8f &operator=(__m256 m256) {
this->m256 = m256;
return *this;
}
inline operator __m256 &() {
return m256;
}
}
the __m256 needs to be aligned on 32 byte boundary to be used with SSE functions, and should be automatically, even within the union.
And when I do this
bareVec8f test = _mm256_set1_ps(1.0f);
I get a segmentation fault. This code should work because of the constructor I made. However, when I do this
bareVec8f test;
test.m256 = _mm256_set1_ps(8.f);
I do not get a segmentation fault.
So because that works fine the union is probably aligned properly, there's just some segmentation fault being caused with the constructor it seems
I'm using gcc 64bit windows compiler
---------------------------------EDIT
Matt managed to produce the simplest example of the error that seems to be happening here.
#include <immintrin.h>
void foo(__m256 x) {}
int main()
{
__m256 r = _mm256_set1_ps(0.0f);
foo(r);
}
I'm compiling with -std=c++11 -mavx
This is a bug in g++ for Windows. It does not perform 32-byte stack alignment when it should. Bug 49001 Bug 54412
On this SO thread someone made a Python script to process the assembly output by g++ to fix the problem, so that would be one option.
Otherwise, to avoid this in your union you could make the functions which take __m256 by value, take it by reference instead. This shouldn't have any performance penalty unless optimization is low/off.
In case you are unaware - union aliasing causes undefined behaviour in C++, it's not permitted to write m256 and then read floats or ints for example. So perhaps there is a different solution to your problem.

Counting the number of leading zeros in a 128-bit integer

How can I count the number of leading zeros in a 128-bit integer (uint128_t) efficiently?
I know GCC's built-in functions:
__builtin_clz, __builtin_clzl, __builtin_clzll
__builtin_ffs, __builtin_ffsl, __builtin_ffsll
However, these functions only work with 32- and 64-bit integers.
I also found some SSE instructions:
__lzcnt16, __lzcnt, __lzcnt64
As you may guess, these only work with 16-, 32- and 64-bit integers.
Is there any similar, efficient built-in functionality for 128-bit integers?
inline int clz_u128 (uint128_t u) {
uint64_t hi = u>>64;
uint64_t lo = u;
int retval[3]={
__builtin_clzll(hi),
__builtin_clzll(lo)+64,
128
};
int idx = !hi + ((!lo)&(!hi));
return retval[idx];
}
this is a branch free variant. Note that more work is done than in the branchy solution, and in practice the branching will probably be predictable.
It also relies on __builtin_clzll not crashing when fed 0: the docs say the result is undefined, but is it just unspecified or undefined?
Assuming a 'random' distribution, the first non-zero bit will be in the high 64 bits, with an overwhelming probability, so it makes sense to test that half first.
Have a look at the code generated for:
/* inline */ int clz_u128 (uint128_t u)
{
unsigned long long hi, lo; /* (or uint64_t) */
int b = 128;
if ((hi = u >> 64) != 0) {
b = __builtin_clzll(hi);
}
else if ((lo = u & ~0ULL) != 0) {
b = __builtin_clzll(lo) + 64;
}
return b;
}
I would expect gcc to implement each __builtin_clzll using the bsrq instruction - bit scan reverse, i.e., most-significant bit position - in conjunction with an xor, (msb ^ 63), or sub, (63 - msb), to turn it into a leading zero count. gcc might generate lzcnt instructions with the right -march= (architecture) options.
Edit: others have pointed out that the 'distribution' is not relevant in this case, since the HI uint64_t needs to be tested regardless.
Yakk's answer works well for all kinds of targets as long as gcc supports
128 bit integers for the target. However, note that on the x86-64 platform,
with an Intel Haswell processor or newer, there is a more efficient solution:
#include <immintrin.h>
#include <stdint.h>
// tested with compiler options: gcc -O3 -Wall -m64 -mlzcnt
inline int lzcnt_u128 (unsigned __int128 u) {
uint64_t hi = u>>64;
uint64_t lo = u;
lo = (hi == 0) ? lo : -1ULL;
return _lzcnt_u64(hi) + _lzcnt_u64(lo);
}
The _lzcnt_u64 intrinsic compiles (gcc 5.4) to the lzcnt instruction, which is well
defined for a zero input (it returns 64), in contrary to gcc's __builtin_clzll().
The ternary operator compiles to the cmove instruction.