Enable fast-math in Clang on a per-function basis?

Enable fast-math in Clang on a per-function basis? - c++

GCC provides a way to optimize a function/section of code selectively with
fast-math using attributes. Is there a way to enable the same in Clang with pragmas/attributes?
I understand Clang provides some pragmas
to specify floating point flags. However, none of these pragmas enable fast-math.
PS: A similar question was asked before but was not answered in the context of Clang.

UPD: I missed that you mentioned pragmas. The Option1 is fast math for one op as far as I understood. I am not sure about flushing subnormals, I hope it's not affected.
I didn't find the per function options, but I did find 2 pragmas that can help.
Let's say we want dot product.
Option 1.
float innerProductF32(const float* a, const float* b, std::size_t size) {
float res = 0.f;
for (std::size_t i = 0; i != size; ++i) {
#pragma float_control(precise, off)
res += a[i] * b[i];
}
return res;
}
Option2:
float innerProductF32(const float* a, const float* b, std::size_t size) {
float res = 0.f;
_Pragma("clang loop vectorize(enable) interleave(enable)")
for (std::size_t i = 0; i != size; ++i) {
res += a[i] * b[i];
}
return res;
}
The second one is less powerful, it does not generate fma instructions, but maybe it's not what you want.

Related

_mm256_rem_epu64 intrinsic not found with GCC 10.3.0

I try to re-write the following uint64_t 2x2 matrix multiplication with AVX-512 instructions, but GCC 10.3 does not found _mm256_rem_epu64 intrinsic.
#include <cstdint>
#include <immintrin.h>
constexpr uint32_t LAST_9_DIGITS_DIVIDER = 1000000000;
void multiply(uint64_t f[2][2], uint64_t m[2][2])
{
uint64_t x = (f[0][0] * m[0][0] + f[0][1] * m[1][0]) % LAST_9_DIGITS_DIVIDER;
uint64_t y = (f[0][0] * m[0][1] + f[0][1] * m[1][1]) % LAST_9_DIGITS_DIVIDER;
uint64_t z = (f[1][0] * m[0][0] + f[1][1] * m[1][0]) % LAST_9_DIGITS_DIVIDER;
uint64_t w = (f[1][0] * m[0][1] + f[1][1] * m[1][1]) % LAST_9_DIGITS_DIVIDER;
f[0][0] = x;
f[0][1] = y;
f[1][0] = z;
f[1][1] = w;
}
void multiply_simd(uint64_t f[2][2], uint64_t m[2][2])
{
__m256i v1 = _mm256_set_epi64x(f[0][0], f[0][0], f[1][0], f[1][0]);
__m256i v2 = _mm256_set_epi64x(m[0][0], m[0][1], m[0][0], m[0][1]);
__m256i v3 = _mm256_mullo_epi64(v1, v2);
__m256i v4 = _mm256_set_epi64x(f[0][1], f[0][1], f[1][1], f[1][1]);
__m256i v5 = _mm256_set_epi64x(m[1][0], m[1][1], m[1][0], m[1][1]);
__m256i v6 = _mm256_mullo_epi64(v4, v5);
__m256i v7 = _mm256_add_epi64(v3, v6);
__m256i div = _mm256_set1_epi64x(LAST_9_DIGITS_DIVIDER);
__m256i v8 = _mm256_rem_epu64(v7, div);
_mm256_store_epi64(f, v8);
}
Is it possible somehow to enable _mm256_rem_epu64 or if not, some other way to calculate the reminder with SIMD instructions?

As Peter Cordes mentioned in the comments, _mm256_rem_epu64 is an SVML function. Most compilers don't support SVML; AFAIK really only ICC does, but clang can be configured to use it too.
The only other implementation of SVML I'm aware of is in one of my projects, SIMDe. In this case, since you're using GCC 10.3, the implementation of _mm256_rem_epu64 will use vector extensions, so the code from SIMDe is going to be basically the same as something like:
#include <immintrin.h>
#include <stdint.h>
typedef uint64_t u64x4 __attribute__((__vector_size__(32)));
__m256i
foo_mm256_rem_epu64(__m256i a, __m256i b) {
return (__m256i) (((u64x4) a) % ((u64x4) b));
}
In this case, both GCC and clang will scalarize the operation (see Compiler Explorer), so performance is going to be pretty bad, especially considering how slow the div instruction is.
That said, since you're using a compile-time constant, the compiler should be able to replace the division with a multiplication and a shift, so performance will be better, but we can squeeze out some more by using libdivide.
Libdivide usually computes the the magic value at runtime, but the libdivide_u64_t structure is very simple and we can just skip the libdivide_u64_gen step and provide the struct at compile time:
__m256i div_by_1000000000(__m256i a) {
static const struct libdivide_u64_t d = {
UINT64_C(1360296554856532783),
UINT8_C(93)
};
return libdivide_u64_do_vec256(a, &d);
}
Now, if you can use AVX-512VL + AVX-512DQ there is a 64-bit multiplication function (_mm256_mullo_epi64). If you can use that it's probably the right way to go:
__m256i rem_1000000000(__m256i a) {
static const struct libdivide_u64_t d = {
UINT64_C(1360296554856532783),
UINT8_C(93)
};
return
_mm256_sub_epi64(
a,
_mm256_mullo_epi64(
libdivide_u64_do_vec256(a, &d),
_mm256_set1_epi64x(1000000000)
)
);
}
(or on Compiler Explorer, with LLVM-MCA)
If you don't have AVX-512DQ+VL, you'll probably want to fall back on vector extensions again:
typedef uint64_t u64x4 __attribute__((__vector_size__(32)));
__m256i rem_1000000000(__m256i a) {
static const struct libdivide_u64_t d = {
UINT64_C(1360296554856532783),
UINT8_C(93)
};
u64x4 one_billion = { 1000000000, 1000000000, 1000000000, 1000000000 };
return (__m256i) (
(
(u64x4) a) -
(((u64x4) libdivide_u64_do_vec256(a, &d)) * one_billion
)
);
}
(on Compiler Explorer)
All this is untested, but assuming I haven't made any stupid mistakes it should be relatively snappy.
If you really want to get rid of the libdivide dependency you could perform those operations yourself, but I don't really see any good reason not to use libdivide so I'll leave that as an exercise for someone else.

Safe explicit vectorization of a seemingly simple loop

New here, hoping you can help. I am attempting to explicitly vectorize both of the for loops in the below member function code as they are the main runtime bottleneck and auto-vectorization does not work due to dependencies. For the life of me, however, I cannot find "safe" clause/reduction.
PSvector& RTMap::Apply(PSvector& X) const
{
PSvector Y(0);
double k=0;
// linear map
#pragma omp simd
for(RMap::const_itor r = rterms.begin(); r != rterms.end(); r++)
{
Y[r->i] += r->val * X[r->j];
}
// non-linear map
#pragma omp simd
for(const_itor t = tterms.begin(); t != tterms.end(); t++)
{
Y[t->i] += t->val * X[t->j] * X[t->k];
}
Y.location() = X.location();
Y.type() = X.type();
Y.id() = X.id();
Y.sd() = X.sd();
return X = Y;
}
Note that the #pragmas as written don't work because of race conditions. Is there a method of declaring reduction which could work? I tried something like:
#pragma omp declare reduction(+:PSvector:(*omp_out.getvec())+=(*omp_in.getvec()))
which compiles (icpc) but seems to produce nonsense.
/cheers

Should division by one get a special case?

If we have a division by one in an inner loop, is it smart to add special case treatment to eliminate the division:
BEFORE:
int collapseFactorDepth...
for (int i = 0; i < numPixels; i++)
{
pDataTarget[i] += pPixelData[i] / collapseFactorDepth;
}
AFTER:
if (collapseFactorDepth != 1)
{
for (int i = 0; i < numPixels; i++)
{
pDataTarget[i] += pPixelData[i] / collapseFactorDepth;
}
}
else
{
for (int i = 0; i < numPixels; i++)
{
pDataTarget[i] += pPixelData[i];
}
}
Can the compiler reason this by itself? Do modern CPUs contain any means to optimize this?
I am particularly interested if you consider the additional code beneficial in contrast to the performance gain (is there any?).
Background:
Numpixels is big
collapseFactorDepth is 90% of the time 1
Modern CPUs: Intel x86/amd64 architecture
Please don't consider the wider things. The memory overhead of loading is optimized.
Let's not sweat that we should probably do this as a double multiplication anyway.

As a general rule, the answer is No. Write clear code first and optimize it later when the profiler tells you have a problem.
The only way to answer whether this particular optimization will help in this particular hotspot is: "measure it and see".
Unless collapseFactorDepth is almost always 1, or numPixels is very large (at least thousands and possibly more), I would not expect the optimization to help (branches are expensive).
You are much more likely to benefit from using SSE or similar SIMD instructions.

Follow #Martin Bonner's advice. Optimize when you need to.
When you need to:
int identity(int pixel)
{
return pixel;
}
template<int collapseFactorDepth>
int div(int pixel)
{
return pixel / collapseFactorDepth;
}
struct Div
{
int collapseFactorDepth_;
Div(collapseFactorDepth)
: collapseFactorDepth(collapseFactorDepth_) {}
int operator()(int pixel)
{
return pixel / collapseFactorDepth_;
}
};
template<typename T>
void fn(int* pDataTarget, T fn)
{
for (int i = 0; i < numPixels; i++)
{
pDataTarget[i] += fn(pPixelData[i]);
}
}
void fn(int* pDataTarget)
{
fn(pDataTarget, identity);
}
template<int collapseFactorDepth>
void fnComp()
{
fn(pDataTarget, div<collapseFactorDepth>);
}
void fn(int* pDataTarget, int collapseFactorDepth)
{
fn(pDataTarget, Div(collapseFactorDepth));
}
This provides you a convenient default behaviour, a compile-time divide (which might be faster than divide-by-int) when possible and a way (passing a Div) to specify runtime behaviour.

Where is the loop-carried dependency here?

Does anyone see anything obvious about the loop code below that I'm not seeing as to why this cannot be auto-vectorized by VS2012's C++ compiler?
All the compiler gives me is info C5002: loop not vectorized due to reason '1200' when I use the /Qvec-report:2 command-line switch.
Reason 1200 is documented in MSDN as:
Loop contains loop-carried data dependences that prevent
vectorization. Different iterations of the loop interfere with each
other such that vectorizing the loop would produce wrong answers, and
the auto-vectorizer cannot prove to itself that there are no such data
dependences.
I know (or I'm pretty sure that) there aren't any loop-carried data dependencies but I'm not sure what's preventing the compiler from realizing this.
These source and dest pointers do not ever overlap nor alias the same memory and I'm trying to provide the compiler with that hint via __restrict.
pitch is always a positive integer value, something like 4096, depending on the screen resolution, since this is a 8bpp->32bpp rendering/conversion function, operating column-by-column.
byte * __restrict source;
DWORD * __restrict dest;
int pitch;
for (int i = 0; i < count; ++i) {
dest[(i*2*pitch)+0] = (source[(i*8)+0]);
dest[(i*2*pitch)+1] = (source[(i*8)+1]);
dest[(i*2*pitch)+2] = (source[(i*8)+2]);
dest[(i*2*pitch)+3] = (source[(i*8)+3]);
dest[((i*2+1)*pitch)+0] = (source[(i*8)+4]);
dest[((i*2+1)*pitch)+1] = (source[(i*8)+5]);
dest[((i*2+1)*pitch)+2] = (source[(i*8)+6]);
dest[((i*2+1)*pitch)+3] = (source[(i*8)+7]);
}
The parens around each source[] are remnants of a function call which I have elided here because the loop still won't be auto-vectorized without the function call, in its most simplest form.
EDIT:
I've simplified the loop to its most trivial form that I can:
for (int i = 0; i < 200; ++i) {
dest[(i*2*4096)+0] = (source[(i*8)+0]);
}
This still produces the same 1200 reason code.
EDIT (2):
This minimal test case with local allocations and identical pointer types still fails to auto-vectorize. I'm simply baffled at this point.
const byte * __restrict source;
byte * __restrict dest;
source = (const byte * __restrict ) new byte[1600];
dest = (byte * __restrict ) new byte[1600];
for (int i = 0; i < 200; ++i) {
dest[(i*2*4096)+0] = (source[(i*8)+0]);
}

Let's just say there's more than just a couple of things preventing this loop from vectorizing...
Consider this:
int main(){
byte *source = new byte[1000];
DWORD *dest = new DWORD[1000];
for (int i = 0; i < 200; ++i) {
dest[(i*2*4096)+0] = (source[(i*8)+0]);
}
for (int i = 0; i < 200; ++i) {
dest[i*2*4096] = source[i*8];
}
for (int i = 0; i < 200; ++i) {
dest[i*8192] = source[i*8];
}
for (int i = 0; i < 200; ++i) {
dest[i] = source[i];
}
}
Compiler Output:
main.cpp(10) : info C5002: loop not vectorized due to reason '1200'
main.cpp(13) : info C5002: loop not vectorized due to reason '1200'
main.cpp(16) : info C5002: loop not vectorized due to reason '1203'
main.cpp(19) : info C5002: loop not vectorized due to reason '1101'
Let's break this down:
The first two loops are the same. So they give the original reason 1200 which is the loop-carried dependency.
The 3rd loop is the same as the 2nd loop. Yet the compiler gives a different reason 1203:
Loop body includes non-contiguous accesses into an array
Okay... Why a different reason? I dunno. But this time, the reason is correct.
The 4th loop gives 1101:
Loop contains a non-vectorizable conversion operation (may be implicit)
So VC++ doesn't isn't smart enough to issue the SSE4.1 pmovzxbd instruction.
That's a pretty niche case, I wouldn't have expected any modern compiler to be able to do this. And if it could, you'd need to specify SSE4.1.
So the only thing that's out-of-the ordinary is why the initial loop reports a loop-carried dependency.Well, that's a tough call... I would go so far to say that the compiler just isn't issuing the correct reason. (When it really should be non-contiguous access.)
Getting back to the point, I wouldn't expect MSVC or any compiler to be able to vectorize your original loop. Your original loop has accesses grouped in chunks of 4 - which makes it contiguous enough to vectorize. But it's a long-shot to expect the compiler to be able to recognize that.
So if it matters, I suggest manually vectorizing this loop. The intrinsic that you will need is _mm_cvtepu8_epi32().
Your original loop:
for (int i = 0; i < count; ++i) {
dest[(i*2*pitch)+0] = (source[(i*8)+0]);
dest[(i*2*pitch)+1] = (source[(i*8)+1]);
dest[(i*2*pitch)+2] = (source[(i*8)+2]);
dest[(i*2*pitch)+3] = (source[(i*8)+3]);
dest[((i*2+1)*pitch)+0] = (source[(i*8)+4]);
dest[((i*2+1)*pitch)+1] = (source[(i*8)+5]);
dest[((i*2+1)*pitch)+2] = (source[(i*8)+6]);
dest[((i*2+1)*pitch)+3] = (source[(i*8)+7]);
}
vectorizes as follows:
for (int i = 0; i < count; ++i) {
__m128i s0 = _mm_loadl_epi64((__m128i*)(source + i*8));
__m128i s1 = _mm_unpackhi_epi64(s0,s0);
*(__m128i*)(dest + (i*2 + 0)*pitch) = _mm_cvtepu8_epi32(s0);
*(__m128i*)(dest + (i*2 + 1)*pitch) = _mm_cvtepu8_epi32(s1);
}
Disclaimer: This is untested and ignores alignment.

From the MSDN documentation, a case in which an error 1203 would be reported
void code_1203(int *A)
{
// Code 1203 is emitted when non-vectorizable memory references
// are present in the loop body. Vectorization of some non-contiguous
// memory access is supported - for example, the gather/scatter pattern.
for (int i=0; i<1000; ++i)
{
A[i] += A[0] + 1; // constant memory access not vectorized
A[i] += A[i*2+2] + 2; // non-contiguous memory access not vectorized
}
}
It really could be the computations at the indexes that mess with the auto-vectorizer. Funny the error code shown is not 1203, though.
MSDN Parallelizer and Vectorizer Messages

What is the equivalent of v4sf and attribute in Visual Studio C++?

typedef float v4sf __attribute__ ((mode(V4SF)));
This is in GCC. Anyone knows the equivalence syntax?
VS 2010 will show __attribute__ has no storage class of this type, and mode is not defined.
I searched on the Internet and it said
Equivalent to __attribute__( aligned( size ) ) in GCC
It is helpful
for former unix developers or people writing code that works on
multiple platforms that in GCC you achieve the same results using
attribute( aligned( ... ) )
See here for more information:
http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Type-Attributes.html#Type-Attributes
The full GCC code is here: http://pastebin.com/bKkTTmH1

If you're looking for the alignment directive in VC++ it's __declspec(align(16)). (or whatever you want the alignment to be)
And example usage is this:
__declspec(align(16)) float x[] = {1.,2.,3.,4.};
http://msdn.microsoft.com/en-us/library/83ythb65.aspx
Note that both attribute (in GCC) and __declspec (in VC++) are compiler-specific extensions.
EDIT :
Now that I take a second look at the code, it's gonna take more work than just replacing the __attribute__ line with the VC++ equivalent to get it to compile in VC++.
VC++ doesn't have any if these macros/functions that you are using:
__builtin_ia32_xorps
__builtin_ia32_loadups
__builtin_ia32_mulps
__builtin_ia32_addps
__builtin_ia32_storeups
You're better off just replacing all of those with SSE intrinsics - which will work on both GCC and VC++.
Here's the code converted to intrinsics:
float *mv_mult(float mat[SIZE][SIZE], float vec[SIZE]) {
static float ret[SIZE];
float temp[4];
int i, j;
__m128 m, v, r;
for (i = 0; i < SIZE; i++) {
r = _mm_xor_ps(r, r);
for (j = 0; j < SIZE; j += 4) {
m = _mm_loadu_ps(&mat[i][j]);
v = _mm_loadu_ps(&vec[j]);
v = _mm_mul_ps(m, v);
r = _mm_add_ps(r, v);
}
_mm_storeu_ps(temp, r);
ret[i] = temp[0] + temp[1] + temp[2] + temp[3];
}
return ret;
}

V4SF and friends have to do with GCC "vector extensions":
http://gcc.gnu.org/onlinedocs/gcc-3.1/gcc/Vector-Extensions.html#Vector%20Extensions
http://gcc.gnu.org/onlinedocs/gcc-3.1/gcc/X86-Built-in-Functions.html
I'm not sure how much - if any of this stuff - is supported in MSVS/MSVC. Here are a few links:
http://www.codeproject.com/KB/recipes/sseintro.aspx?msg=643444
http://msdn.microsoft.com/en-us/library/y0dh78ez%28v=vs.80%29.aspx
http://msdn.microsoft.com/en-us/library/01fth20w.aspx

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Enable fast-math in Clang on a per-function basis? - c++

Related

_mm256_rem_epu64 intrinsic not found with GCC 10.3.0

Safe explicit vectorization of a seemingly simple loop

Should division by one get a special case?

Where is the loop-carried dependency here?

What is the equivalent of v4sf and attribute in Visual Studio C++?

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Enable fast-math in Clang on a per-function basis? - c++

Related

_mm256_rem_epu64 intrinsic not found with GCC 10.3.0

Safe explicit vectorization of a seemingly simple loop

Should division by one get a special case?

Where is the loop-carried dependency here?

What is the equivalent of v4sf and __attribute__ in Visual Studio C++?

Categories

Resources

What is the equivalent of v4sf and attribute in Visual Studio C++?