I want something like this:
template <const char *op, int lane_select>
static int cmpGT64Sx2(V64x2 x, V64x2 y)
{
int result;
__asm__("movups %1,%%xmm6\n"
"\tmovups %2,%%xmm7\n"
// order swapped for AT&T style which has destination second.
"\t " op " %%xmm7,%%xmm6\n"
"\tpextrb %3, %%xmm6, %0"
: "=r" (result) : "m" (x), "m" (y), "i" (lane_select*8) : "xmm6");
return result;
}
Clearly it has to be a template because it must be known at compile time. The lane_select works fine already and it is a template but it's an operand. I want the op that is in the asm to be different, like either pcmpgtd or pcmpgtq, etc. If it helps: I always want some form of x86's pcmpgt, with just the final letter changing.
Edit:
This is a test case for valgrind where it's very important that the exact instruction is run so that we can examine if the definedness of the output.
This is possible with some asm hackery, but normally you'd be better off using intrinsics, like this:
https://gcc.gnu.org/wiki/DontUseInlineAsm
#include <immintrin.h>
template<int size, int element>
int foo(__m128i x, __m128i y) {
if (size==32) // if constexpr if you want to be C++17 fancy
x = _mm_cmpgt_epi32(x, y);
else
x = _mm_cmpgt_epi64(x, y); // needs -msse4.2 or -march=native or whatever
return x[element]; // GNU C extension to index vectors with [].
// GCC defines __m128i as a vector of two long long
// cast to typedef int v4si __attribute__((vector_size(16)))
// or use _mm_extract_epi8 or whatever with element*size/8
// if you want to access one of 4 dword elements.
}
int test(__m128i x, __m128i y) {
return foo<32, 0>(x, y);
}
// compiles to pcmpgtd %xmm1, %xmm0 ; movq %xmm0, %rax ; ret
You can even go full-on GNU C native vector style and do x = x>y after casting them to v4si or not, depending on whether you want 32 or 64-bit element compares. GCC will implement the operator > however it can, with a multi-instruction emulation if SSE4.2 isn't available for pcmpgtq. There's no other fundamental difference between that and intrinsics though; the compiler isn't required to emit pcmpgtq just because the source contained _mm_cmpgt_epi64, e.g. it can do constant-propagation through it if x and y are both compile-time constants, or if y is known to be LONG_MAX so nothing can be greater than it.
Using inline asm
Only the C preprocessor could work the way you're hoping; the asm template has to be a string literal at compile time, and AFAIK C++ template constexpr stuff can't stringify and past a variable's value into an actual string literal. Template evaluation happens after parsing.
I came up with an amusing hack that gets GCC to print d or q as the asm symbol name of a global (or static) variable, using %p4 (See operand modifiers in the GCC manual.) An empty array like constexpr char d[] = {}; is probably a good choice here. You can't pass string literals to template parameters anyway.
(I also fixed the inefficiencies and bugs in your inline asm statement: e.g. let the compiler pick registers, and ask for the inputs in XMM regs, not memory. You were missing an "xmm7" clobber, but this version doesn't need any clobbers. This is still worse than intrinsics for cases where the inputs might be compile-time constants, or where one was in aligned memory so could use a memory operand, or various other possible optimizations. I could have used "xm" as a source but clang would always pick "m". **https://gcc.gnu.org/wiki/DontUseInlineAsm**.)
If you need it to not be optimized away for valgrind testing, maybe make it asm volatile to force it to run even if the output isn't needed. That's about the only reason you'd want to use inline asm instead of intrinsics or GNU C native vector syntax (x > y)
typedef long long V64x2 __attribute__((vector_size(16), may_alias));
// or #include <immintrin.h> and use __m128i which is defined the same way
static constexpr char q[0] asm("q") = {}; // override asm symbol name which gets mangled for const or constexpr
static constexpr char d[0] asm("d") = {};
template <const char *op, int element_select>
static int cmpGT64Sx2(V64x2 x, V64x2 y)
{
int result;
__asm__(
// AT&T style has destination second.
"pcmpgt%p[op] %[src],%[dst]\n\t" // %p4 - print the bare name, not $d or $q
"pextrb %3, %[dst], %0"
: "=r" (result), [dst]"+x"(x)
: [src]"x"(y), "i" (element_select*8),
[op]"i"(op) // address as an immediate = symbol name
: /* no clobbers */);
return result;
}
int gt64(V64x2 x, V64x2 y) {
return cmpGT64Sx2<q, 1>(x,y);
}
int gt32(V64x2 x, V64x2 y) {
return cmpGT64Sx2<d, 1>(x,y);
}
So at the cost of having d and q as global-scope names in this file(!??), we can use <d, 2> or <q, 0> template params that look like the instruction we want.
Note that in x86 SIMD terminology, a "lane" is a 128-bit chunk of an AVX or AVX-512 vector. As in vpermilps (In-Lane Permute of 32-bit float elements).
This compiles to the following asm with GCC10 -O3 (https://godbolt.org/z/ovxWd8)
gt64(long long __vector(2), long long __vector(2)):
pcmpgtq %xmm1,%xmm0
pextrb $8, %xmm0, %eax
ret
gt32(long long __vector(2), long long __vector(2)):
pcmpgtd %xmm1,%xmm0
pextrb $8, %xmm0, %eax // This is actually element 2 of 4, not 1, because your scale doesn't account for the size.
ret
You can hide the global-scope vars from the template users and have them pass an integer size. I also fixed the element indexing to account for the variable element size.
static constexpr char q[0] asm("q") = {}; // override asm symbol name which gets mangled for const or constexpr
static constexpr char d[0] asm("d") = {};
template <int size, int element_select>
static int cmpGT64Sx2_new(V64x2 x, V64x2 y)
{
//static constexpr char dd[0] asm("d") = {}; // nope, asm symbol name overrides don't work on local-scope static vars
constexpr int bytepos = size/8 * element_select;
constexpr const char *op = (size==32) ? d : q;
// maybe static_assert( size == 32 || size == 64 )
int result;
__asm__(
// AT&T style has destination second.
"pcmpgt%p[op] %[src],%[dst]\n\t" // SSE2 or SSE4.2
"pextrb %[byte], %[dst], %0" // SSE4.1
: "=r" (result), [dst]"+x"(x)
: [src]"x"(y), [byte]"i" (bytepos),
[op]"i"(op) // address as an immediate = symbol name
: /* no clobbers */);
return result;
}
// Note *not* referencing d or q static vars, but the template is
int gt64_new(V64x2 x, V64x2 y) {
return cmpGT64Sx2_new<64, 1>(x,y);
}
int gt32_new(V64x2 x, V64x2 y) {
return cmpGT64Sx2_new<32, 1>(x,y);
}
This also compiles like we want, e.g.
gt32_new(long long __vector(2), long long __vector(2)):
pcmpgtd %xmm1,%xmm0
pextrb $4, %xmm0, %eax # note the correct element 1 position
ret
BTW, you could use typedef int v4si __attribute__((vector_size(16))) and then v[element] to let GCC do it for you, if your asm statement just produces an "=x" output of that type in the same register as a "0"(x) input for example.
Without global-scope var names, using GAS .if / .else
We can easily get GCC to print a bare number into the asm template, e.g. for use as an operand to a .if %[size] == 32 directive. The GNU assembler has some conditional-assembly features, so we just get GCC to feed it the right text input to use that. Much less of a hack on the C++ side, but less compact source. Your template param could be a 'd' or 'q' size-code character if you wanted to compare on that instead of a size number.
template <int size, int element_select>
static int cmpGT64Sx2_mask(V64x2 x, V64x2 y)
{
constexpr int bytepos = size/8 * element_select;
unsigned int result;
__asm__(
// AT&T style has destination second.
".if %c[opsize] == 32\n\t" // note Godbolt hides directives; use binary mode to verify the assemble-time condition worked
"pcmpgtd %[src],%[dst]\n\t" // SSE2
".else \n\t"
"pcmpgtq %[src],%[dst]\n\t" // SSE4.2
".endif \n\t"
"pmovmskb %[dst], %0"
: "=r" (result), [dst]"+x"(x)
: [src]"x"(y), [opsize]"i"(size) // address as an immediate = symbol name
: /* no clobbers */);
return (result >> bytepos) & 1; // can just be TEST when branching on it
}
I also changed to using SSE2 pmovmskb to extract both / all element compare results, and using scalar stuff to select which bit to look at. This is orthogonal and can be used with any others. After inlining, it's generally going to be more efficient, allowing test $imm32, %eax. (pmovmskb is cheaper than pextrb, and it lets the whole thing require only SSE2 for the pcmpgtd version).
The asm output from the compiler looks like
.if 64 == 32
pcmpgtd %xmm1,%xmm0
.else
pcmpgtq %xmm1,%xmm0
.endif
pmovmskb %xmm0, %eax
To make sure that did what we want, we can assemble to binary and look at disassembly (https://godbolt.org/z/5zGdfv):
gt32_mask(long long __vector(2), long long __vector(2)):
pcmpgtd %xmm1,%xmm0
pmovmskb %xmm0,%eax
shr $0x4,%eax
and $0x1,%eax
(and gt64_mask uses pcmpgtq and shr by 8.)
Related
I have two int16_t integers that I want to store in one one 32-bit int.
I can use the lower and upper bits to do this as answered in this question.
The top answers demonstrates this
int int1 = 345;
int int2 = 2342;
take2Integers( int1 | (int2 << 16) );
But my value could also be negative, so will << 16 cause undefined behavior?
If so, what is the solution to storing and retrieving those two numbers which can also be negative in a safe way?
Further context
The whole 32 bit number will be stored in an array with other numbers similar to it.
These numbers will never be utilized as a whole, I'll only want to work with one of the integers packed inside at once.
So what is required is two functions which ensure safe storing and retrieval of two int16_t values into one 4 byte integer
int pack(int16_t lower, int16_t upper);
int16_t get_lower(int);
int16_t get_upper(int);
Here’s the solution with a single std::memcpy from the comments:
std::int32_t pack(std::int16_t l, std::int16_t h) {
std::int16_t arr[2] = {l, h};
std::int32_t result;
std::memcpy(&result, arr, sizeof result);
return result;
}
Any compiler worth its salt won’t emit code for the temporary array. Case in point, the resulting assembly code on GCC with -O2 looks like this:
pack(short, short):
sal esi, 16
movzx eax, di
or eax, esi
ret
Without checking, I’m confident that clang/ICC/MSVC produce something similar.
C++ (pre-20) makes it notoriously hard to alias values. If you want to remain within defined behavior pre-20, you'd have to use memcpy.
Like following:
int32_t pack(int16_t l, int16_t h) {
int32_t r;
memcpy(&r, &l, 2);
memcpy(reinterpret_cast<char*>(&r) + 2, &h, 2);
return r;
}
Unpack is similar, with reverse order of memcpy arguments.
Note, that this code is very-well optimized with both Clang and gcc in my experiments. Here is what was produced by latest version of Clang:
pack(short, short): # #pack(short, short)
movzwl %di, %eax
shll $16, %esi
orl %esi, %eax
retq
It all ended up being the same combination of shifts and conjunctions.
Today I wanted to test, how Clang would transform a recursive power of two function and noticed that even with known exponent, the recursion is not optimized away even when using constexpr.
#include <array>
constexpr unsigned int pow2_recursive(unsigned int exp) {
if(exp == 0) return 1;
return 2 * pow2_recursive(exp-1);
}
unsigned int pow2_5() {
return pow2_recursive(5);
}
pow2_5 is compiled as a call to pow2_recursive.
pow2_5(): # #pow2_5()
mov edi, 5
jmp pow2_recursive(unsigned int) # TAILCALL
However, when I use the result in a context that requires it to be known at compile time, it will correctly compute the result at compile time.
unsigned int pow2_5_arr() {
std::array<int, pow2_recursive(5)> a;
return a.size();
}
is compiled to
pow2_5_arr(): # #pow2_5_arr()
mov eax, 32
ret
Here is the link to the full example in Godbolt: https://godbolt.org/z/fcKef1
So, am I missing something here? Is there something that can change the result at runtime and a reason, that pow2_5 cannot be optimized in the same way as pow2_5_arr?
I'm investigating ways to speed up a large section of C++ code, which has automatic derivatives for computing jacobians. This involves doing some amount of work in the actual residuals, but the majority of the work (based on profiled execution time) is in calculating the jacobians.
This surprised me, since most of the jacobians are propagated forward from 0s and 1s, so the amount of work should be 2-4x the function, not 10-12x. In order to model what a large amount of the jacobian work is like, I made a super minimal example with just a dot product (instead of sin, cos, sqrt and more that would be in a real situation) that the compiler should be able to optimize to a single return value:
#include <Eigen/Core>
#include <Eigen/Geometry>
using Array12d = Eigen::Matrix<double,12,1>;
double testReturnFirstDot(const Array12d& b)
{
Array12d a;
a.array() = 0.;
a(0) = 1.;
return a.dot(b);
}
Which should be the same as
double testReturnFirst(const Array12d& b)
{
return b(0);
}
I was disappointed to find that, without fast-math enabled, neither GCC 8.2, Clang 6 or MSVC 19 were able to make any optimizations at all over the naive dot-product with a matrix full of 0s. Even with fast-math (https://godbolt.org/z/GvPXFy) the optimizations are very poor in GCC and Clang (still involve multiplications and additions), and MSVC doesn't do any optimizations at all.
I don't have a background in compilers, but is there a reason for this? I'm fairly sure that in a large proportion of scientific computations being able to do better constant propagation/folding would make more optimizations apparent, even if the constant-fold itself didn't result in a speedup.
While I'm interested in explanations for why this isn't done on the compiler side, I'm also interested for what I can do on a practical side to make my own code faster when facing these kinds of patterns.
This is because Eigen explicitly vectorize your code as 3 vmulpd, 2 vaddpd and 1 horizontal reduction within the remaining 4 component registers (this assumes AVX, with SSE only you'll get 6 mulpd and 5 addpd). With -ffast-math GCC and clang are allowed to remove the last 2 vmulpd and vaddpd (and this is what they do) but they cannot really replace the remaining vmulpd and horizontal reduction that have been explicitly generated by Eigen.
So what if you disable Eigen's explicit vectorization by defining EIGEN_DONT_VECTORIZE? Then you get what you expected (https://godbolt.org/z/UQsoeH) but other pieces of code might become much slower.
If you want to locally disable explicit vectorization and are not afraid of messing with Eigen's internal, you can introduce a DontVectorize option to Matrix and disable vectorization by specializing traits<> for this Matrix type:
static const int DontVectorize = 0x80000000;
namespace Eigen {
namespace internal {
template<typename _Scalar, int _Rows, int _Cols, int _MaxRows, int _MaxCols>
struct traits<Matrix<_Scalar, _Rows, _Cols, DontVectorize, _MaxRows, _MaxCols> >
: traits<Matrix<_Scalar, _Rows, _Cols> >
{
typedef traits<Matrix<_Scalar, _Rows, _Cols> > Base;
enum {
EvaluatorFlags = Base::EvaluatorFlags & ~PacketAccessBit
};
};
}
}
using ArrayS12d = Eigen::Matrix<double,12,1,DontVectorize>;
Full example there: https://godbolt.org/z/bOEyzv
I was disappointed to find that, without fast-math enabled, neither GCC 8.2, Clang 6 or MSVC 19 were able to make any optimizations at all over the naive dot-product with a matrix full of 0s.
They have no other choice unfortunately. Since IEEE floats have signed zeros, adding 0.0 is not an identity operation:
-0.0 + 0.0 = 0.0 // Not -0.0!
Similarly, multiplying by zero does not always yield zero:
0.0 * Infinity = NaN // Not 0.0!
So the compilers simply cannot perform these constant folds in the dot product while retaining IEEE float compliance - for all they know, your input might contain signed zeros and/or infinities.
You will have to use -ffast-math to get these folds, but that may have undesired consequences. You can get more fine-grained control with specific flags (from http://gcc.gnu.org/wiki/FloatingPointMath). According to the above explanation, adding the following two flags should allow the constant folding:
-ffinite-math-only, -fno-signed-zeros
Indeed, you get the same assembly as with -ffast-math this way: https://godbolt.org/z/vGULLA. You only give up the signed zeros (probably irrelevant), NaNs and the infinities. Presumably, if you were to still produce them in your code, you would get undefined behavior, so weigh your options.
As for why your example is not optimized better even with -ffast-math: That is on Eigen. Presumably they have vectorization on their matrix operations, which are much harder for compilers to see through. A simple loop is properly optimized with these options: https://godbolt.org/z/OppEhY
One way to force a compiler to optimize multiplications by 0's and 1`s is to manually unroll the loop. For simplicity let's use
#include <array>
#include <cstddef>
constexpr std::size_t n = 12;
using Array = std::array<double, n>;
Then we can implement a simple dot function using fold expressions (or recursion if they are not available):
<utility>
template<std::size_t... is>
double dot(const Array& x, const Array& y, std::index_sequence<is...>)
{
return ((x[is] * y[is]) + ...);
}
double dot(const Array& x, const Array& y)
{
return dot(x, y, std::make_index_sequence<n>{});
}
Now let's take a look at your function
double test(const Array& b)
{
const Array a{1}; // = {1, 0, ...}
return dot(a, b);
}
With -ffast-math gcc 8.2 produces:
test(std::array<double, 12ul> const&):
movsd xmm0, QWORD PTR [rdi]
ret
clang 6.0.0 goes along the same lines:
test(std::array<double, 12ul> const&): # #test(std::array<double, 12ul> const&)
movsd xmm0, qword ptr [rdi] # xmm0 = mem[0],zero
ret
For example, for
double test(const Array& b)
{
const Array a{1, 1}; // = {1, 1, 0...}
return dot(a, b);
}
we get
test(std::array<double, 12ul> const&):
movsd xmm0, QWORD PTR [rdi]
addsd xmm0, QWORD PTR [rdi+8]
ret
Addition. Clang unrolls a for (std::size_t i = 0; i < n; ++i) ... loop without all these fold expressions tricks, gcc doesn't and needs some help.
My goal is to observe the outputs (destination operand and eflag register) of several instructions like add, or, xor, and, sub, etc...
For this I am randomly feeding them with inputs, and record the outputs. I embed the target instruction into a C++ routine with some genericity over the widht/signedness of the operands:
/* generic operand.
W: width, S: signedness
*/
templates<size_t W, bool S> gen_op { public: typedef size_t type;};
followed by specializations:
template<> gen_op<8U,true> { public: typedef int8_t type;};
...
template<> gen_op<16U,false> { public: typedef uint16_t type;};
...
and then:
template<size_t W, bool S>
gen_op<W,S>::type test_add(gen_op<W,S>::type i1, gen_op<W,S>::type i2)
{
gen_op<W,S>::type o;
{
__asm movzx eax, i1
__asm add eax, i2
__asm mov o, eax
}
return o;
}
Another routine exists to test the eflag register using pushfd/pop. I know it would be more efficient (in time) to use only one routine and report the destination operand and the eflag register at once.
Since I am quite new to inline assembler (and to assembler as well), I would like to get your feedback on the approach/validity of results.
On my side, I see one issue: for the time being, I limit myself to 8/16/32 bit operand. But when moving to 64 bit operand, EAX will not be enough. How can I handle this in a simple manner ?
I've got some code, originally given to me by someone working with MSVC, and I'm trying to get it to work on Clang. Here's the function that I'm having trouble with:
float vectorGetByIndex( __m128 V, unsigned int i )
{
assert( i <= 3 );
return V.m128_f32[i];
}
The error I get is as follows:
Member reference has base type '__m128' is not a structure or union.
I've looked around and found that Clang (and maybe GCC) has a problem with treating __m128 as a struct or union. However I haven't managed to find a straight answer as to how I can get these values back. I've tried using the subscript operator and couldn't do that, and I've glanced around the huge list of SSE intrinsics functions and haven't yet found an appropriate one.
A union is probably the most portable way to do this:
union {
__m128 v; // SSE 4 x float vector
float a[4]; // scalar array of 4 floats
} U;
float vectorGetByIndex(__m128 V, unsigned int i)
{
U u;
assert(i <= 3);
u.v = V;
return u.a[i];
}
As a modification to hirschhornsalz's solution, if i is a compile-time constant, you could avoid the union path entirely by using a shuffle:
template<unsigned i>
float vectorGetByIndex( __m128 V)
{
// shuffle V so that the element that you want is moved to the least-
// significant element of the vector (V[0])
V = _mm_shuffle_ps(V, V, _MM_SHUFFLE(i, i, i, i));
// return the value in V[0]
return _mm_cvtss_f32(V);
}
A scalar float is just the bottom element of an XMM register, and the upper elements are allowed to be non-zero; _mm_cvtss_f32 is free and will compile to zero instructions. This will inline as just a shufps (or nothing for i==0).
Compilers are smart enough to optimize away the shuffle for i==0 (except for long-obsolete ICC13) so no need for an if (i). https://godbolt.org/z/K154Pe. clang's shuffle optimizer will compile vectorGetByIndex<2> into movhlps xmm0, xmm0 which is 1 byte shorter than shufps and produces the same low element. You could manually do this with switch/case for other compilers since i is a compile-time constant, but 1 byte of code size in the few places you use this while manually vectorizing is pretty trivial.
Note that SSE4.1 _mm_extract_epi32(V, i); is not a useful shuffle here: extractps r/m32, xmm, imm can only extract the FP bit-pattern to an integer register or memory (https://www.felixcloutier.com/x86/extractps). (And the intrinsic returns it as an int, so it would actually compile to extractps + cvtsi2ss to do int->float conversion on the FP bit-pattern, unless you type-pun it in your C++ code. But then you'd expect it to compile to extractps eax, xmm0, i / movd xmm0, eax which is terrible vs. shufps.)
The only case where extractps would be useful is if the compiler wanted to store this result straight to memory, and fold the store into the extract instruction. (For i!=0, otherwise it would use movss). To leave the result in an XMM register as a scalar float, shufps is good.
(SSE4.1 insertps would be usable but unnecessary: it makes it possible to zero other elements while taking an arbitrary source element.)
Use
template<unsigned i>
float vectorGetByIndex( __m128 V) {
union {
__m128 v;
float a[4];
} converter;
converter.v = V;
return converter.a[i];
}
which will work regardless of the available instruction set.
Note: Even if SSE4.1 is available and i is a compile time constant, you can't use pextract etc. this way, because these instructions extract a 32-bit integer, not a float:
// broken code starts here
template<unsigned i>
float vectorGetByIndex( __m128 V) {
return _mm_extract_epi32(V, i);
}
// broken code ends here
I don't delete it because it is a useful reminder how to not do things.
The way I use is
union vec { __m128 sse, float f[4] };
float accessmember(__m128 v, int index)
{
vec v.sse = v;
return v.f[index];
}
Seems to work out pretty well for me.
Late to this party but found that this works for me in MSVC where z is a variable of type __m128.
#define _mm_extract_f32(v, i) _mm_cvtss_f32(_mm_shuffle_ps(v, v, i))
__m128 z = _mm_setr_ps(1.0, 2.0, 3.0, 4.0);
float f = _mm_extract_f32(z, 2);
OR even simpler
__m128 z;
float f = z.m128_f32[2]; // to get the 3rd float value in the vector