What is the equivalent of v4sf and __attribute__ in Visual Studio C++? - c++

typedef float v4sf __attribute__ ((mode(V4SF)));
This is in GCC. Anyone knows the equivalence syntax?
VS 2010 will show __attribute__ has no storage class of this type, and mode is not defined.
I searched on the Internet and it said
Equivalent to __attribute__( aligned( size ) ) in GCC
It is helpful
for former unix developers or people writing code that works on
multiple platforms that in GCC you achieve the same results using
attribute( aligned( ... ) )
See here for more information:
http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Type-Attributes.html#Type-Attributes
The full GCC code is here: http://pastebin.com/bKkTTmH1

If you're looking for the alignment directive in VC++ it's __declspec(align(16)). (or whatever you want the alignment to be)
And example usage is this:
__declspec(align(16)) float x[] = {1.,2.,3.,4.};
http://msdn.microsoft.com/en-us/library/83ythb65.aspx
Note that both attribute (in GCC) and __declspec (in VC++) are compiler-specific extensions.
EDIT :
Now that I take a second look at the code, it's gonna take more work than just replacing the __attribute__ line with the VC++ equivalent to get it to compile in VC++.
VC++ doesn't have any if these macros/functions that you are using:
__builtin_ia32_xorps
__builtin_ia32_loadups
__builtin_ia32_mulps
__builtin_ia32_addps
__builtin_ia32_storeups
You're better off just replacing all of those with SSE intrinsics - which will work on both GCC and VC++.
Here's the code converted to intrinsics:
float *mv_mult(float mat[SIZE][SIZE], float vec[SIZE]) {
static float ret[SIZE];
float temp[4];
int i, j;
__m128 m, v, r;
for (i = 0; i < SIZE; i++) {
r = _mm_xor_ps(r, r);
for (j = 0; j < SIZE; j += 4) {
m = _mm_loadu_ps(&mat[i][j]);
v = _mm_loadu_ps(&vec[j]);
v = _mm_mul_ps(m, v);
r = _mm_add_ps(r, v);
}
_mm_storeu_ps(temp, r);
ret[i] = temp[0] + temp[1] + temp[2] + temp[3];
}
return ret;
}

V4SF and friends have to do with GCC "vector extensions":
http://gcc.gnu.org/onlinedocs/gcc-3.1/gcc/Vector-Extensions.html#Vector%20Extensions
http://gcc.gnu.org/onlinedocs/gcc-3.1/gcc/X86-Built-in-Functions.html
I'm not sure how much - if any of this stuff - is supported in MSVS/MSVC. Here are a few links:
http://www.codeproject.com/KB/recipes/sseintro.aspx?msg=643444
http://msdn.microsoft.com/en-us/library/y0dh78ez%28v=vs.80%29.aspx
http://msdn.microsoft.com/en-us/library/01fth20w.aspx

Related

_mm256_rem_epu64 intrinsic not found with GCC 10.3.0

I try to re-write the following uint64_t 2x2 matrix multiplication with AVX-512 instructions, but GCC 10.3 does not found _mm256_rem_epu64 intrinsic.
#include <cstdint>
#include <immintrin.h>
constexpr uint32_t LAST_9_DIGITS_DIVIDER = 1000000000;
void multiply(uint64_t f[2][2], uint64_t m[2][2])
{
uint64_t x = (f[0][0] * m[0][0] + f[0][1] * m[1][0]) % LAST_9_DIGITS_DIVIDER;
uint64_t y = (f[0][0] * m[0][1] + f[0][1] * m[1][1]) % LAST_9_DIGITS_DIVIDER;
uint64_t z = (f[1][0] * m[0][0] + f[1][1] * m[1][0]) % LAST_9_DIGITS_DIVIDER;
uint64_t w = (f[1][0] * m[0][1] + f[1][1] * m[1][1]) % LAST_9_DIGITS_DIVIDER;
f[0][0] = x;
f[0][1] = y;
f[1][0] = z;
f[1][1] = w;
}
void multiply_simd(uint64_t f[2][2], uint64_t m[2][2])
{
__m256i v1 = _mm256_set_epi64x(f[0][0], f[0][0], f[1][0], f[1][0]);
__m256i v2 = _mm256_set_epi64x(m[0][0], m[0][1], m[0][0], m[0][1]);
__m256i v3 = _mm256_mullo_epi64(v1, v2);
__m256i v4 = _mm256_set_epi64x(f[0][1], f[0][1], f[1][1], f[1][1]);
__m256i v5 = _mm256_set_epi64x(m[1][0], m[1][1], m[1][0], m[1][1]);
__m256i v6 = _mm256_mullo_epi64(v4, v5);
__m256i v7 = _mm256_add_epi64(v3, v6);
__m256i div = _mm256_set1_epi64x(LAST_9_DIGITS_DIVIDER);
__m256i v8 = _mm256_rem_epu64(v7, div);
_mm256_store_epi64(f, v8);
}
Is it possible somehow to enable _mm256_rem_epu64 or if not, some other way to calculate the reminder with SIMD instructions?
As Peter Cordes mentioned in the comments, _mm256_rem_epu64 is an SVML function. Most compilers don't support SVML; AFAIK really only ICC does, but clang can be configured to use it too.
The only other implementation of SVML I'm aware of is in one of my projects, SIMDe. In this case, since you're using GCC 10.3, the implementation of _mm256_rem_epu64 will use vector extensions, so the code from SIMDe is going to be basically the same as something like:
#include <immintrin.h>
#include <stdint.h>
typedef uint64_t u64x4 __attribute__((__vector_size__(32)));
__m256i
foo_mm256_rem_epu64(__m256i a, __m256i b) {
return (__m256i) (((u64x4) a) % ((u64x4) b));
}
In this case, both GCC and clang will scalarize the operation (see Compiler Explorer), so performance is going to be pretty bad, especially considering how slow the div instruction is.
That said, since you're using a compile-time constant, the compiler should be able to replace the division with a multiplication and a shift, so performance will be better, but we can squeeze out some more by using libdivide.
Libdivide usually computes the the magic value at runtime, but the libdivide_u64_t structure is very simple and we can just skip the libdivide_u64_gen step and provide the struct at compile time:
__m256i div_by_1000000000(__m256i a) {
static const struct libdivide_u64_t d = {
UINT64_C(1360296554856532783),
UINT8_C(93)
};
return libdivide_u64_do_vec256(a, &d);
}
Now, if you can use AVX-512VL + AVX-512DQ there is a 64-bit multiplication function (_mm256_mullo_epi64). If you can use that it's probably the right way to go:
__m256i rem_1000000000(__m256i a) {
static const struct libdivide_u64_t d = {
UINT64_C(1360296554856532783),
UINT8_C(93)
};
return
_mm256_sub_epi64(
a,
_mm256_mullo_epi64(
libdivide_u64_do_vec256(a, &d),
_mm256_set1_epi64x(1000000000)
)
);
}
(or on Compiler Explorer, with LLVM-MCA)
If you don't have AVX-512DQ+VL, you'll probably want to fall back on vector extensions again:
typedef uint64_t u64x4 __attribute__((__vector_size__(32)));
__m256i rem_1000000000(__m256i a) {
static const struct libdivide_u64_t d = {
UINT64_C(1360296554856532783),
UINT8_C(93)
};
u64x4 one_billion = { 1000000000, 1000000000, 1000000000, 1000000000 };
return (__m256i) (
(
(u64x4) a) -
(((u64x4) libdivide_u64_do_vec256(a, &d)) * one_billion
)
);
}
(on Compiler Explorer)
All this is untested, but assuming I haven't made any stupid mistakes it should be relatively snappy.
If you really want to get rid of the libdivide dependency you could perform those operations yourself, but I don't really see any good reason not to use libdivide so I'll leave that as an exercise for someone else.

Gcc misoptimises sse function

I'm converting a project to compile with gcc from clang and I've ran into a issue with a function that uses sse functions:
void dodgy_function(
const short* lows,
const short* highs,
short* mins,
short* maxs,
int its
)
{
__m128i v00[2] = { _mm_setzero_si128(), _mm_setzero_si128() };
__m128i v10[2] = { _mm_setzero_si128(), _mm_setzero_si128() };
for (int i = 0; i < its; ++i) {
reinterpret_cast<short*>(v00)[i] = lows[i];
reinterpret_cast<short*>(v10)[i] = highs[i];
}
reinterpret_cast<short*>(v00)[its] = reinterpret_cast<short*>(v00)[its - 1];
reinterpret_cast<short*>(v10)[its] = reinterpret_cast<short*>(v10)[its - 1];
__m128i v01[2] = {_mm_setzero_si128(), _mm_setzero_si128()};
__m128i v11[2] = {_mm_setzero_si128(), _mm_setzero_si128()};
__m128i min[2];
__m128i max[2];
min[0] = _mm_min_epi16(_mm_max_epi16(v11[0], v01[0]), _mm_min_epi16(v10[0], v00[0]));
max[0] = _mm_max_epi16(_mm_max_epi16(v11[0], v01[0]), _mm_max_epi16(v10[0], v00[0]));
min[1] = _mm_min_epi16(_mm_min_epi16(v11[1], v01[1]), _mm_min_epi16(v10[1], v00[1]));
max[1] = _mm_max_epi16(_mm_max_epi16(v11[1], v01[1]), _mm_max_epi16(v10[1], v00[1]));
reinterpret_cast<__m128i*>(mins)[0] = _mm_min_epi16(reinterpret_cast<__m128i*>(mins)[0], min[0]);
reinterpret_cast<__m128i*>(maxs)[0] = _mm_max_epi16(reinterpret_cast<__m128i*>(maxs)[0], max[0]);
reinterpret_cast<__m128i*>(mins)[1] = _mm_min_epi16(reinterpret_cast<__m128i*>(mins)[1], min[1]);
reinterpret_cast<__m128i*>(maxs)[1] = _mm_max_epi16(reinterpret_cast<__m128i*>(maxs)[1], max[1]);
}
Now with clang it gives it gives me the expected output but in gcc it prints all zeros: godbolt link
Playing around I discovered that gcc gives me the right results when I compile with -O1 but goes wrong with -O2 and -O3, suggesting the optimiser is going awry. Is there something particularly wrong I'm doing that would cause this behavior?
As a workaround I can wrap things up in a union and gcc will then give me the right result, but that feels a little icky: godbolt link 2
Any ideas?
The problem is that you're using short* to access the elements of a __m128i* object. That violates the strict-aliasing rule. It's only safe to go the other way, using __m128i* dereference or more normally _mm_load_si128( (const __m128i*)ptr ).
__m128i* is exactly like char* - you can point it at anything, but not vice versa: Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?
The only standard blessed way to do type punning is with memcpy:
memcpy(v00, lows, its * sizeof(short));
memcpy(v10, highs, its * sizeof(short));
memcpy(reinterpret_cast<short*>(v00) + its, lows + its - 1, sizeof(short));
memcpy(reinterpret_cast<short*>(v10) + its, highs + its - 1, sizeof(short));
https://godbolt.org/z/f63q7x
I prefer just using aligned memory of the correct type directly:
alignas(16) short v00[16];
alignas(16) short v10[16];
auto mv00 = reinterpret_cast<__m128i*>(v00);
auto mv10 = reinterpret_cast<__m128i*>(v10);
_mm_store_si128(mv00, _mm_setzero_si128());
_mm_store_si128(mv10, _mm_setzero_si128());
_mm_store_si128(mv00 + 1, _mm_setzero_si128());
_mm_store_si128(mv10 + 1, _mm_setzero_si128());
for (int i = 0; i < its; ++i) {
v00[i] = lows[i];
v10[i] = highs[i];
}
v00[its] = v00[its - 1];
v10[its] = v10[its - 1];
https://godbolt.org/z/bfanne
I'm not positive that this setup is actually standard-blessed (it definitely is for _mm_load_ps since you can do it without type punning at all) but it does seem to also fix the issue. I'd guess that any reasonable implementation of the load/store intrinsics is going to have to provide the same sort of aliasing guarantees that memcpy does since it's more or less the kosher way to go from straight line to vectorized code in x86.
As you mentioned in your question, you can also force the alignment with a union, and I've used that too in pre c++11 contexts. Even in that case though, I still personally always write the loads and stores explicitly (even if they're just going to/from aligned memory) because issues like this tend to pop up if you don't.

Enable fast-math in Clang on a per-function basis?

GCC provides a way to optimize a function/section of code selectively with
fast-math using attributes. Is there a way to enable the same in Clang with pragmas/attributes?
I understand Clang provides some pragmas
to specify floating point flags. However, none of these pragmas enable fast-math.
PS: A similar question was asked before but was not answered in the context of Clang.
UPD: I missed that you mentioned pragmas. The Option1 is fast math for one op as far as I understood. I am not sure about flushing subnormals, I hope it's not affected.
I didn't find the per function options, but I did find 2 pragmas that can help.
Let's say we want dot product.
Option 1.
float innerProductF32(const float* a, const float* b, std::size_t size) {
float res = 0.f;
for (std::size_t i = 0; i != size; ++i) {
#pragma float_control(precise, off)
res += a[i] * b[i];
}
return res;
}
Option2:
float innerProductF32(const float* a, const float* b, std::size_t size) {
float res = 0.f;
_Pragma("clang loop vectorize(enable) interleave(enable)")
for (std::size_t i = 0; i != size; ++i) {
res += a[i] * b[i];
}
return res;
}
The second one is less powerful, it does not generate fma instructions, but maybe it's not what you want.

Vectorizing sparse matrix vector product with Compressed Sparse Row SegFault [duplicate]

I have the following function:
template <typename T>
void SSE_vectormult(T * A, T * B, int size)
{
__m128d a;
__m128d b;
__m128d c;
double A2[2], B2[2], C[2];
const double * A2ptr, * B2ptr;
A2ptr = &A2[0];
B2ptr = &B2[0];
a = _mm_load_pd(A);
for(int i = 0; i < size; i+=2)
{
std::cout << "In SSE_vectormult: i is: " << i << '\n';
A2[0] = A[i];
B2[0] = B[i];
A2[1] = A[i+1];
B2[1] = B[i+1];
std::cout << "Values from A and B written to A2 and B2\n";
a = _mm_load_pd(A2ptr);
b = _mm_load_pd(B2ptr);
std::cout << "Values converted to a and b\n";
c = _mm_mul_pd(a,b);
_mm_store_pd(C, c);
A[i] = C[0];
A[i+1] = C[1];
};
// const int mask = 0xf1;
// __m128d res = _mm_dp_pd(a,b,mask);
// r1 = _mm_mul_pd(a, b);
// r2 = _mm_hadd_pd(r1, r1);
// c = _mm_hadd_pd(r2, r2);
// c = _mm_scale_pd(a, b);
// _mm_store_pd(A, c);
}
When I am calling it on Linux, everything is fine, but when I am calling it on a windows OS, my program crashes with "program is not working anymore". What am I doing wrong, and how can I determine my error?
Your data is not guaranteed to be 16 byte aligned as required by SSE loads. Either use _mm_loadu_pd:
a = _mm_loadu_pd(A);
...
a = _mm_loadu_pd(A2ptr);
b = _mm_loadu_pd(B2ptr);
or make sure that your data is correctly aligned where possible, e.g. for static or locals:
alignas(16) double A2[2], B2[2], C[2]; // C++11, or C11 with <stdalign.h>
or without C++11, using compiler-specific language extensions:
__attribute__ ((aligned(16))) double A2[2], B2[2], C[2]; // gcc/clang/ICC/et al
__declspec (align(16)) double A2[2], B2[2], C[2]; // MSVC
You could use #ifdef to #define an ALIGN(x) macro that works on the target compiler.
Let me try and answer why your code works in Linux and not Windows. Code compiled in 64-bit mode has the stack aligned by 16 bytes. However, code compiled in 32-bit mode is only 4 byte aligned on windows and is not guaranteed to be 16 byte aligned on Linux.
GCC defaults to 64-bit mode on 64-bit systems. However MSVC defaults to 32-bit mode even on 64-bit systems. So I'm going to guess that you did not compile your code in 64-bit mode in windows and _mm_load_pd and _mm_store_pd both need 16 byte aligned addresses so the code crashes.
You have at least three different solutions to get your code working in Windows as well.
Compile your code in 64 bit mode.
Use unaligned loads and stores (e.g. _mm_storeu_pd)
Align the data yourself as Paul R suggested.
The best solution is the third solution since then your code will work on 32 bit systems and on older systems where unaligned loads/stores are much slower.
If you look at http://msdn.microsoft.com/en-us/library/cww3b12t(v=vs.90).aspx you can see that the function __mm_load_pd is defined as:
__m128d _mm_load_pd (double *p);
So, in your code A should be of type double, but A is of tipe T that is a template param. You should be sure that you are calling your SSE_vectormult function with the rights template params or just remove the template and use the double type instead,

Function crashes when using _mm_load_pd

I have the following function:
template <typename T>
void SSE_vectormult(T * A, T * B, int size)
{
__m128d a;
__m128d b;
__m128d c;
double A2[2], B2[2], C[2];
const double * A2ptr, * B2ptr;
A2ptr = &A2[0];
B2ptr = &B2[0];
a = _mm_load_pd(A);
for(int i = 0; i < size; i+=2)
{
std::cout << "In SSE_vectormult: i is: " << i << '\n';
A2[0] = A[i];
B2[0] = B[i];
A2[1] = A[i+1];
B2[1] = B[i+1];
std::cout << "Values from A and B written to A2 and B2\n";
a = _mm_load_pd(A2ptr);
b = _mm_load_pd(B2ptr);
std::cout << "Values converted to a and b\n";
c = _mm_mul_pd(a,b);
_mm_store_pd(C, c);
A[i] = C[0];
A[i+1] = C[1];
};
// const int mask = 0xf1;
// __m128d res = _mm_dp_pd(a,b,mask);
// r1 = _mm_mul_pd(a, b);
// r2 = _mm_hadd_pd(r1, r1);
// c = _mm_hadd_pd(r2, r2);
// c = _mm_scale_pd(a, b);
// _mm_store_pd(A, c);
}
When I am calling it on Linux, everything is fine, but when I am calling it on a windows OS, my program crashes with "program is not working anymore". What am I doing wrong, and how can I determine my error?
Your data is not guaranteed to be 16 byte aligned as required by SSE loads. Either use _mm_loadu_pd:
a = _mm_loadu_pd(A);
...
a = _mm_loadu_pd(A2ptr);
b = _mm_loadu_pd(B2ptr);
or make sure that your data is correctly aligned where possible, e.g. for static or locals:
alignas(16) double A2[2], B2[2], C[2]; // C++11, or C11 with <stdalign.h>
or without C++11, using compiler-specific language extensions:
__attribute__ ((aligned(16))) double A2[2], B2[2], C[2]; // gcc/clang/ICC/et al
__declspec (align(16)) double A2[2], B2[2], C[2]; // MSVC
You could use #ifdef to #define an ALIGN(x) macro that works on the target compiler.
Let me try and answer why your code works in Linux and not Windows. Code compiled in 64-bit mode has the stack aligned by 16 bytes. However, code compiled in 32-bit mode is only 4 byte aligned on windows and is not guaranteed to be 16 byte aligned on Linux.
GCC defaults to 64-bit mode on 64-bit systems. However MSVC defaults to 32-bit mode even on 64-bit systems. So I'm going to guess that you did not compile your code in 64-bit mode in windows and _mm_load_pd and _mm_store_pd both need 16 byte aligned addresses so the code crashes.
You have at least three different solutions to get your code working in Windows as well.
Compile your code in 64 bit mode.
Use unaligned loads and stores (e.g. _mm_storeu_pd)
Align the data yourself as Paul R suggested.
The best solution is the third solution since then your code will work on 32 bit systems and on older systems where unaligned loads/stores are much slower.
If you look at http://msdn.microsoft.com/en-us/library/cww3b12t(v=vs.90).aspx you can see that the function __mm_load_pd is defined as:
__m128d _mm_load_pd (double *p);
So, in your code A should be of type double, but A is of tipe T that is a template param. You should be sure that you are calling your SSE_vectormult function with the rights template params or just remove the template and use the double type instead,