Gcc misoptimises sse function

Gcc misoptimises sse function - c++

I'm converting a project to compile with gcc from clang and I've ran into a issue with a function that uses sse functions:
void dodgy_function(
const short* lows,
const short* highs,
short* mins,
short* maxs,
int its
)
{
__m128i v00[2] = { _mm_setzero_si128(), _mm_setzero_si128() };
__m128i v10[2] = { _mm_setzero_si128(), _mm_setzero_si128() };
for (int i = 0; i < its; ++i) {
reinterpret_cast<short*>(v00)[i] = lows[i];
reinterpret_cast<short*>(v10)[i] = highs[i];
}
reinterpret_cast<short*>(v00)[its] = reinterpret_cast<short*>(v00)[its - 1];
reinterpret_cast<short*>(v10)[its] = reinterpret_cast<short*>(v10)[its - 1];
__m128i v01[2] = {_mm_setzero_si128(), _mm_setzero_si128()};
__m128i v11[2] = {_mm_setzero_si128(), _mm_setzero_si128()};
__m128i min[2];
__m128i max[2];
min[0] = _mm_min_epi16(_mm_max_epi16(v11[0], v01[0]), _mm_min_epi16(v10[0], v00[0]));
max[0] = _mm_max_epi16(_mm_max_epi16(v11[0], v01[0]), _mm_max_epi16(v10[0], v00[0]));
min[1] = _mm_min_epi16(_mm_min_epi16(v11[1], v01[1]), _mm_min_epi16(v10[1], v00[1]));
max[1] = _mm_max_epi16(_mm_max_epi16(v11[1], v01[1]), _mm_max_epi16(v10[1], v00[1]));
reinterpret_cast<__m128i*>(mins)[0] = _mm_min_epi16(reinterpret_cast<__m128i*>(mins)[0], min[0]);
reinterpret_cast<__m128i*>(maxs)[0] = _mm_max_epi16(reinterpret_cast<__m128i*>(maxs)[0], max[0]);
reinterpret_cast<__m128i*>(mins)[1] = _mm_min_epi16(reinterpret_cast<__m128i*>(mins)[1], min[1]);
reinterpret_cast<__m128i*>(maxs)[1] = _mm_max_epi16(reinterpret_cast<__m128i*>(maxs)[1], max[1]);
}
Now with clang it gives it gives me the expected output but in gcc it prints all zeros: godbolt link
Playing around I discovered that gcc gives me the right results when I compile with -O1 but goes wrong with -O2 and -O3, suggesting the optimiser is going awry. Is there something particularly wrong I'm doing that would cause this behavior?
As a workaround I can wrap things up in a union and gcc will then give me the right result, but that feels a little icky: godbolt link 2
Any ideas?

The problem is that you're using short* to access the elements of a __m128i* object. That violates the strict-aliasing rule. It's only safe to go the other way, using __m128i* dereference or more normally _mm_load_si128( (const __m128i*)ptr ).
__m128i* is exactly like char* - you can point it at anything, but not vice versa: Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?
The only standard blessed way to do type punning is with memcpy:
memcpy(v00, lows, its * sizeof(short));
memcpy(v10, highs, its * sizeof(short));
memcpy(reinterpret_cast<short*>(v00) + its, lows + its - 1, sizeof(short));
memcpy(reinterpret_cast<short*>(v10) + its, highs + its - 1, sizeof(short));
https://godbolt.org/z/f63q7x
I prefer just using aligned memory of the correct type directly:
alignas(16) short v00[16];
alignas(16) short v10[16];
auto mv00 = reinterpret_cast<__m128i*>(v00);
auto mv10 = reinterpret_cast<__m128i*>(v10);
_mm_store_si128(mv00, _mm_setzero_si128());
_mm_store_si128(mv10, _mm_setzero_si128());
_mm_store_si128(mv00 + 1, _mm_setzero_si128());
_mm_store_si128(mv10 + 1, _mm_setzero_si128());
for (int i = 0; i < its; ++i) {
v00[i] = lows[i];
v10[i] = highs[i];
}
v00[its] = v00[its - 1];
v10[its] = v10[its - 1];
https://godbolt.org/z/bfanne
I'm not positive that this setup is actually standard-blessed (it definitely is for _mm_load_ps since you can do it without type punning at all) but it does seem to also fix the issue. I'd guess that any reasonable implementation of the load/store intrinsics is going to have to provide the same sort of aliasing guarantees that memcpy does since it's more or less the kosher way to go from straight line to vectorized code in x86.
As you mentioned in your question, you can also force the alignment with a union, and I've used that too in pre c++11 contexts. Even in that case though, I still personally always write the loads and stores explicitly (even if they're just going to/from aligned memory) because issues like this tend to pop up if you don't.

Related

_mm256_rem_epu64 intrinsic not found with GCC 10.3.0

I try to re-write the following uint64_t 2x2 matrix multiplication with AVX-512 instructions, but GCC 10.3 does not found _mm256_rem_epu64 intrinsic.
#include <cstdint>
#include <immintrin.h>
constexpr uint32_t LAST_9_DIGITS_DIVIDER = 1000000000;
void multiply(uint64_t f[2][2], uint64_t m[2][2])
{
uint64_t x = (f[0][0] * m[0][0] + f[0][1] * m[1][0]) % LAST_9_DIGITS_DIVIDER;
uint64_t y = (f[0][0] * m[0][1] + f[0][1] * m[1][1]) % LAST_9_DIGITS_DIVIDER;
uint64_t z = (f[1][0] * m[0][0] + f[1][1] * m[1][0]) % LAST_9_DIGITS_DIVIDER;
uint64_t w = (f[1][0] * m[0][1] + f[1][1] * m[1][1]) % LAST_9_DIGITS_DIVIDER;
f[0][0] = x;
f[0][1] = y;
f[1][0] = z;
f[1][1] = w;
}
void multiply_simd(uint64_t f[2][2], uint64_t m[2][2])
{
__m256i v1 = _mm256_set_epi64x(f[0][0], f[0][0], f[1][0], f[1][0]);
__m256i v2 = _mm256_set_epi64x(m[0][0], m[0][1], m[0][0], m[0][1]);
__m256i v3 = _mm256_mullo_epi64(v1, v2);
__m256i v4 = _mm256_set_epi64x(f[0][1], f[0][1], f[1][1], f[1][1]);
__m256i v5 = _mm256_set_epi64x(m[1][0], m[1][1], m[1][0], m[1][1]);
__m256i v6 = _mm256_mullo_epi64(v4, v5);
__m256i v7 = _mm256_add_epi64(v3, v6);
__m256i div = _mm256_set1_epi64x(LAST_9_DIGITS_DIVIDER);
__m256i v8 = _mm256_rem_epu64(v7, div);
_mm256_store_epi64(f, v8);
}
Is it possible somehow to enable _mm256_rem_epu64 or if not, some other way to calculate the reminder with SIMD instructions?

As Peter Cordes mentioned in the comments, _mm256_rem_epu64 is an SVML function. Most compilers don't support SVML; AFAIK really only ICC does, but clang can be configured to use it too.
The only other implementation of SVML I'm aware of is in one of my projects, SIMDe. In this case, since you're using GCC 10.3, the implementation of _mm256_rem_epu64 will use vector extensions, so the code from SIMDe is going to be basically the same as something like:
#include <immintrin.h>
#include <stdint.h>
typedef uint64_t u64x4 __attribute__((__vector_size__(32)));
__m256i
foo_mm256_rem_epu64(__m256i a, __m256i b) {
return (__m256i) (((u64x4) a) % ((u64x4) b));
}
In this case, both GCC and clang will scalarize the operation (see Compiler Explorer), so performance is going to be pretty bad, especially considering how slow the div instruction is.
That said, since you're using a compile-time constant, the compiler should be able to replace the division with a multiplication and a shift, so performance will be better, but we can squeeze out some more by using libdivide.
Libdivide usually computes the the magic value at runtime, but the libdivide_u64_t structure is very simple and we can just skip the libdivide_u64_gen step and provide the struct at compile time:
__m256i div_by_1000000000(__m256i a) {
static const struct libdivide_u64_t d = {
UINT64_C(1360296554856532783),
UINT8_C(93)
};
return libdivide_u64_do_vec256(a, &d);
}
Now, if you can use AVX-512VL + AVX-512DQ there is a 64-bit multiplication function (_mm256_mullo_epi64). If you can use that it's probably the right way to go:
__m256i rem_1000000000(__m256i a) {
static const struct libdivide_u64_t d = {
UINT64_C(1360296554856532783),
UINT8_C(93)
};
return
_mm256_sub_epi64(
a,
_mm256_mullo_epi64(
libdivide_u64_do_vec256(a, &d),
_mm256_set1_epi64x(1000000000)
)
);
}
(or on Compiler Explorer, with LLVM-MCA)
If you don't have AVX-512DQ+VL, you'll probably want to fall back on vector extensions again:
typedef uint64_t u64x4 __attribute__((__vector_size__(32)));
__m256i rem_1000000000(__m256i a) {
static const struct libdivide_u64_t d = {
UINT64_C(1360296554856532783),
UINT8_C(93)
};
u64x4 one_billion = { 1000000000, 1000000000, 1000000000, 1000000000 };
return (__m256i) (
(
(u64x4) a) -
(((u64x4) libdivide_u64_do_vec256(a, &d)) * one_billion
)
);
}
(on Compiler Explorer)
All this is untested, but assuming I haven't made any stupid mistakes it should be relatively snappy.
If you really want to get rid of the libdivide dependency you could perform those operations yourself, but I don't really see any good reason not to use libdivide so I'll leave that as an exercise for someone else.

Vectorizing sparse matrix vector product with Compressed Sparse Row SegFault [duplicate]

I have the following function:
template <typename T>
void SSE_vectormult(T * A, T * B, int size)
{
__m128d a;
__m128d b;
__m128d c;
double A2[2], B2[2], C[2];
const double * A2ptr, * B2ptr;
A2ptr = &A2[0];
B2ptr = &B2[0];
a = _mm_load_pd(A);
for(int i = 0; i < size; i+=2)
{
std::cout << "In SSE_vectormult: i is: " << i << '\n';
A2[0] = A[i];
B2[0] = B[i];
A2[1] = A[i+1];
B2[1] = B[i+1];
std::cout << "Values from A and B written to A2 and B2\n";
a = _mm_load_pd(A2ptr);
b = _mm_load_pd(B2ptr);
std::cout << "Values converted to a and b\n";
c = _mm_mul_pd(a,b);
_mm_store_pd(C, c);
A[i] = C[0];
A[i+1] = C[1];
};
// const int mask = 0xf1;
// __m128d res = _mm_dp_pd(a,b,mask);
// r1 = _mm_mul_pd(a, b);
// r2 = _mm_hadd_pd(r1, r1);
// c = _mm_hadd_pd(r2, r2);
// c = _mm_scale_pd(a, b);
// _mm_store_pd(A, c);
}
When I am calling it on Linux, everything is fine, but when I am calling it on a windows OS, my program crashes with "program is not working anymore". What am I doing wrong, and how can I determine my error?

Your data is not guaranteed to be 16 byte aligned as required by SSE loads. Either use _mm_loadu_pd:
a = _mm_loadu_pd(A);
...
a = _mm_loadu_pd(A2ptr);
b = _mm_loadu_pd(B2ptr);
or make sure that your data is correctly aligned where possible, e.g. for static or locals:
alignas(16) double A2[2], B2[2], C[2]; // C++11, or C11 with <stdalign.h>
or without C++11, using compiler-specific language extensions:
__attribute__ ((aligned(16))) double A2[2], B2[2], C[2]; // gcc/clang/ICC/et al
__declspec (align(16)) double A2[2], B2[2], C[2]; // MSVC
You could use #ifdef to #define an ALIGN(x) macro that works on the target compiler.

Let me try and answer why your code works in Linux and not Windows. Code compiled in 64-bit mode has the stack aligned by 16 bytes. However, code compiled in 32-bit mode is only 4 byte aligned on windows and is not guaranteed to be 16 byte aligned on Linux.
GCC defaults to 64-bit mode on 64-bit systems. However MSVC defaults to 32-bit mode even on 64-bit systems. So I'm going to guess that you did not compile your code in 64-bit mode in windows and _mm_load_pd and _mm_store_pd both need 16 byte aligned addresses so the code crashes.
You have at least three different solutions to get your code working in Windows as well.
Compile your code in 64 bit mode.
Use unaligned loads and stores (e.g. _mm_storeu_pd)
Align the data yourself as Paul R suggested.
The best solution is the third solution since then your code will work on 32 bit systems and on older systems where unaligned loads/stores are much slower.

If you look at http://msdn.microsoft.com/en-us/library/cww3b12t(v=vs.90).aspx you can see that the function __mm_load_pd is defined as:
__m128d _mm_load_pd (double *p);
So, in your code A should be of type double, but A is of tipe T that is a template param. You should be sure that you are calling your SSE_vectormult function with the rights template params or just remove the template and use the double type instead,

Setting a buffer of char* with intermediate casting to int*

I could not fully understand the consequences of what I read here: Casting an int pointer to a char ptr and vice versa
In short, would this work?
set4Bytes(unsigned char* buffer) {
const uint32_t MASK = 0xffffffff;
if ((uintmax_t)buffer % 4) {//misaligned
for (int i = 0; i < 4; i++) {
buffer[i] = 0xff;
}
} else {//4-byte alignment
*((uint32_t*) buffer) = MASK;
}
}
Edit
There was a long discussion (it was in the comments, which mysteriously got deleted) about what type the pointer should be casted to in order to check the alignment. The subject is now addressed here.

This conversion is safe if you are filling same value in all 4 bytes. If byte order matters then this conversion is not safe.
Because when you use integer to fill 4 Bytes at a time it will fill 4 Bytes but order depends on the endianness.

No, it won't work in every case. Aside from endianness, which may or may not be an issue, you assume that the alignment of uint32_t is 4. But this quantity is implementation-defined (C11 Draft N1570 Section 6.2.8). You can use the _Alignof operator to get the alignment in a portable way.
Second, the effective type (ibid. Sec. 6.5) of the location pointed to by buffer may not be compatible to uint32_t (e.g. if buffer points to an unsigned char array). In that case you break strict aliasing rules once you try reading through the array itself or through a pointer of different type.
Assuming that the pointer actually points to an array of unsigned char, the following code will work
typedef union { unsigned char chr[sizeof(uint32_t)]; uint32_t u32; } conv_t;
void set4Bytes(unsigned char* buffer) {
const uint32_t MASK = 0xffffffffU;
if ((uintptr_t)buffer % _Alignof(uint32_t)) {// misaligned
for (size_t i = 0; i < sizeof(uint32_t); i++) {
buffer[i] = 0xffU;
}
} else { // correct alignment
conv_t *cnv = (conv_t *) buffer;
cnv->u32 = MASK;
}
}

This code might be of help to you. It shows a 32-bit number being built by assigning its contents a byte at a time, forcing misalignment. It compiles and works on my machine.
#include<stdint.h>
#include<stdio.h>
#include<inttypes.h>
#include<stdlib.h>
int main () {
uint32_t *data = (uint32_t*)malloc(sizeof(uint32_t)*2);
char *buf = (char*)data;
uintptr_t addr = (uintptr_t)buf;
int i,j;
i = !(addr%4) ? 1 : 0;
uint32_t x = (1<<6)-1;
for( j=0;j<4;j++ ) buf[i+j] = ((char*)&x)[j];
printf("%" PRIu32 "\n",*((uint32_t*) (addr+i)) );
}
As mentioned by #Learner, endianness must be obeyed. The code above is not portable and would break on a big endian machine.
Note that my compiler throws the error "cast from ‘char*’ to ‘unsigned int’ loses precision [-fpermissive]" when trying to cast a char* to an unsigned int, as done in the original post. This post explains that uintptr_t should be used instead.

In addition to the endian issue, which has already been mentioned here:
CHAR_BIT - the number of bits per char - should also be considered.
It is 8 on most platforms, where for (int i=0; i<4; i++) should work fine.
A safer way of doing it would be for (int i=0; i<sizeof(uint32_t); i++).
Alternatively, you can include <limits.h> and use for (int i=0; i<32/CHAR_BIT; i++).

Use reinterpret_cast<>() if you want to ensure the underlying data does not "change shape".
As Learner has mentioned, when you store data in machine memory endianess becomes a factor. If you know how the data is stored correctly in memory (correct endianess) and you are specifically testing its layout as an alternate representation, then you would want to use reinterpret_cast<>() to test that memory, as a specific type, without modifying the original storage.
Below, I've modified your example to use reinterpret_cast<>():
void set4Bytes(unsigned char* buffer) {
const uint32_t MASK = 0xffffffff;
if (*reinterpret_cast<unsigned int *>(buffer) % 4) {//misaligned
for (int i = 0; i < 4; i++) {
buffer[i] = 0xff;
}
} else {//4-byte alignment
*reinterpret_cast<unsigned int *>(buffer) = MASK;
}
}
It should also be noted, your function appears to set the buffer (32-bytes of contiguous memory) to 0xFFFFFFFF, regardless of which branch it takes.

Your code is perfect for working with any architecture with 32bit and up. There is no issue with byte ordering since all your source bytes are 0xFF.
At x86 or x64 machines, the extra work necessary to deal with eventually unaligned access to RAM are managed by the CPU and transparent to the programmer (since Pentium II), with some performance cost at each access. So, if you are just setting the first four bytes of a buffer a few times, you are good to simplify your function:
void set4Bytes(unsigned char* buffer) {
const uint32_t MASK = 0xffffffff;
*((uint32_t *)buffer) = MASK;
}
Some readings:
A Linux kernel doc about UNALIGNED MEMORY ACCESSES
Intel Architecture Optimization Manual, section 3.4
Windows Data Alignment on IPF, x86, and x64
A Practical 'Aligned vs. unaligned memory access', by Alexander Sandler

_mm_load_si128 - Passed memory address is not 16-byte-aligned?

I've got some trouble understanding a SSE2-instruction. According to the microsoft documentation, _mm_load_si128 requires a 16-byte-aligned address as parameter. In the code, which I try to understand, this seems not to be the case:
void f(uchar* buf0, const int n)
{
ushort* buf = (ushort*)alignPtr(buf0, 16);
for(int i = 0; i < n; i += 16)
{
__m128i v0 = _mm_load_si128((__m128i*)(buf+i)); // 16-byte-aligned, since buf is 16-byte-aligned and i is divisable by 16.
__m128i v1 = _mm_load_si128((__m128i*)(buf+i+8)); // If buf+i is 16-byte-aligned, then buf+i+8 cannot be 16-byte-aligned.
}
}
I reduced the code to the relevant part and renamed some variables. The original code is from the OpenCV implementation of Konoliges blockmatching algorithm (stereobm.cpp, especially line 313). My question is, why is the code correct and what is written into v1?

OS portable memcpy optimized for SSE2 & SSE3

If I was to write a OS portable memcpy optimized for SSE2/SSE3 how would that look like? I want to support both the GCC and ICC compilers. The reason I ask is that memcpy is written in assembler code in glibc and not optimized for SSE2/SSE3, and other generic memcpy implementations may not fully take advantage of the systems capabilities with data alignment and size etc.
Here is my current memcpy that take data alignment into consideration and is optimized for SSE2 (I think) but not for SSE3:
#ifdef __SSE2__
// SSE2 optimized memcpy()
void *CMemUtils::MemCpy(void *restrict b, const void *restrict a, size_t n)
{
char *s1 = b;
const char *s2 = a;
for(; 0<n; --n)*s1++ = *s2++;
return b;
}
#else
// Generic memcpy() implementation
void *CMemUtils::MemCpy(void *dest, const void *source, size_t count) const
{
#ifdef _USE_SYSTEM_MEMCPY
// Use system memcpy()
return memcpy(dest, source, count);
#else
size_t blockIdx;
size_t blocks = count >> 3;
size_t bytesLeft = count - (blocks << 3);
// Copy 64-bit blocks first
_UINT64 *sourcePtr8 = (_UINT64*)source;
_UINT64 *destPtr8 = (_UINT64*)dest;
for (blockIdx = 0; blockIdx < blocks; blockIdx++) destPtr8[blockIdx] = sourcePtr8[blockIdx];
if (!bytesLeft) return dest;
blocks = bytesLeft >> 2;
bytesLeft = bytesLeft - (blocks << 2);
// Copy 32-bit blocks
_UINT32 *sourcePtr4 = (_UINT32*)&sourcePtr8[blockIdx];
_UINT32 *destPtr4 = (_UINT32*)&destPtr8[blockIdx];
for (blockIdx = 0; blockIdx < blocks; blockIdx++) destPtr4[blockIdx] = sourcePtr4[blockIdx];
if (!bytesLeft) return dest;
blocks = bytesLeft >> 1;
bytesLeft = bytesLeft - (blocks << 1);
// Copy 16-bit blocks
_UINT16 *sourcePtr2 = (_UINT16*)&sourcePtr4[blockIdx];
_UINT16 *destPtr2 = (_UINT16*)&destPtr4[blockIdx];
for (blockIdx = 0; blockIdx < blocks; blockIdx++) destPtr2[blockIdx] = sourcePtr2[blockIdx];
if (!bytesLeft) return dest;
// Copy byte blocks
_UINT8 *sourcePtr1 = (_UINT8*)&sourcePtr2[blockIdx];
_UINT8 *destPtr1 = (_UINT8*)&destPtr2[blockIdx];
for (blockIdx = 0; blockIdx < bytesLeft; blockIdx++) destPtr1[blockIdx] = sourcePtr1[blockIdx];
return dest;
#endif
}
#endif
Not all memcpy implementations are thread-safe, which is just another reason to make our own version. All this leads me to conclude I should at least try to make a thread-safe OS portable memcpy that is optimized for SSE2/SSE3 where available.
I've also read that GCC supports aggressive unrolling with the -funroll-loops compiler option, could this improve performance with SSE2 and/or SSE3 if there are no significant cache misses?
Is there a performance gain of making different memcpy versions for 32-and 64-bit architectures?
Is there any performance gain of pre-aligning internal memory buffers before copying?
How do I use the #pragma loop to controls how loop code is to be considered by the SSE2/SSE3 auto-parallelizer? Supposedly one could use #pragma loop on contiguous data regions are moved by a for() loop.
Do I need to use the GCC compiler option -fno-builtin-memcpy even with -O3 to force the compiler from inlining the GCC memcpy when adding my own memcpy? Or perhaps just overriding memcpy in my code is enough?
Update:
After some tests it seems to me that an SSE2 optimized memcpy() is not that much faster for it to be worth the effort. I've asked a question in that regard on the Intel C/C++ Compiler forums.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Gcc misoptimises sse function - c++

Related

_mm256_rem_epu64 intrinsic not found with GCC 10.3.0

Vectorizing sparse matrix vector product with Compressed Sparse Row SegFault [duplicate]

Setting a buffer of char* with intermediate casting to int*

_mm_load_si128 - Passed memory address is not 16-byte-aligned?

OS portable memcpy optimized for SSE2 & SSE3

Categories

Resources