inline assembly genericity over operand width

inline assembly genericity over operand width - c++

My goal is to observe the outputs (destination operand and eflag register) of several instructions like add, or, xor, and, sub, etc...
For this I am randomly feeding them with inputs, and record the outputs. I embed the target instruction into a C++ routine with some genericity over the widht/signedness of the operands:
/* generic operand.
W: width, S: signedness
*/
templates<size_t W, bool S> gen_op { public: typedef size_t type;};
followed by specializations:
template<> gen_op<8U,true> { public: typedef int8_t type;};
...
template<> gen_op<16U,false> { public: typedef uint16_t type;};
...
and then:
template<size_t W, bool S>
gen_op<W,S>::type test_add(gen_op<W,S>::type i1, gen_op<W,S>::type i2)
{
gen_op<W,S>::type o;
{
__asm movzx eax, i1
__asm add eax, i2
__asm mov o, eax
}
return o;
}
Another routine exists to test the eflag register using pushfd/pop. I know it would be more efficient (in time) to use only one routine and report the destination operand and the eflag register at once.
Since I am quite new to inline assembler (and to assembler as well), I would like to get your feedback on the approach/validity of results.
On my side, I see one issue: for the time being, I limit myself to 8/16/32 bit operand. But when moving to 64 bit operand, EAX will not be enough. How can I handle this in a simple manner ?

Related

Is it possible to template the asm opcode in c++?

I want something like this:
template <const char *op, int lane_select>
static int cmpGT64Sx2(V64x2 x, V64x2 y)
{
int result;
__asm__("movups %1,%%xmm6\n"
"\tmovups %2,%%xmm7\n"
// order swapped for AT&T style which has destination second.
"\t " op " %%xmm7,%%xmm6\n"
"\tpextrb %3, %%xmm6, %0"
: "=r" (result) : "m" (x), "m" (y), "i" (lane_select*8) : "xmm6");
return result;
}
Clearly it has to be a template because it must be known at compile time. The lane_select works fine already and it is a template but it's an operand. I want the op that is in the asm to be different, like either pcmpgtd or pcmpgtq, etc. If it helps: I always want some form of x86's pcmpgt, with just the final letter changing.
Edit:
This is a test case for valgrind where it's very important that the exact instruction is run so that we can examine if the definedness of the output.

This is possible with some asm hackery, but normally you'd be better off using intrinsics, like this:
https://gcc.gnu.org/wiki/DontUseInlineAsm
#include <immintrin.h>
template<int size, int element>
int foo(__m128i x, __m128i y) {
if (size==32) // if constexpr if you want to be C++17 fancy
x = _mm_cmpgt_epi32(x, y);
else
x = _mm_cmpgt_epi64(x, y); // needs -msse4.2 or -march=native or whatever
return x[element]; // GNU C extension to index vectors with [].
// GCC defines __m128i as a vector of two long long
// cast to typedef int v4si __attribute__((vector_size(16)))
// or use _mm_extract_epi8 or whatever with element*size/8
// if you want to access one of 4 dword elements.
}
int test(__m128i x, __m128i y) {
return foo<32, 0>(x, y);
}
// compiles to pcmpgtd %xmm1, %xmm0 ; movq %xmm0, %rax ; ret
You can even go full-on GNU C native vector style and do x = x>y after casting them to v4si or not, depending on whether you want 32 or 64-bit element compares. GCC will implement the operator > however it can, with a multi-instruction emulation if SSE4.2 isn't available for pcmpgtq. There's no other fundamental difference between that and intrinsics though; the compiler isn't required to emit pcmpgtq just because the source contained _mm_cmpgt_epi64, e.g. it can do constant-propagation through it if x and y are both compile-time constants, or if y is known to be LONG_MAX so nothing can be greater than it.
Using inline asm
Only the C preprocessor could work the way you're hoping; the asm template has to be a string literal at compile time, and AFAIK C++ template constexpr stuff can't stringify and past a variable's value into an actual string literal. Template evaluation happens after parsing.
I came up with an amusing hack that gets GCC to print d or q as the asm symbol name of a global (or static) variable, using %p4 (See operand modifiers in the GCC manual.) An empty array like constexpr char d[] = {}; is probably a good choice here. You can't pass string literals to template parameters anyway.
(I also fixed the inefficiencies and bugs in your inline asm statement: e.g. let the compiler pick registers, and ask for the inputs in XMM regs, not memory. You were missing an "xmm7" clobber, but this version doesn't need any clobbers. This is still worse than intrinsics for cases where the inputs might be compile-time constants, or where one was in aligned memory so could use a memory operand, or various other possible optimizations. I could have used "xm" as a source but clang would always pick "m". **https://gcc.gnu.org/wiki/DontUseInlineAsm**.)
If you need it to not be optimized away for valgrind testing, maybe make it asm volatile to force it to run even if the output isn't needed. That's about the only reason you'd want to use inline asm instead of intrinsics or GNU C native vector syntax (x > y)
typedef long long V64x2 __attribute__((vector_size(16), may_alias));
// or #include <immintrin.h> and use __m128i which is defined the same way
static constexpr char q[0] asm("q") = {}; // override asm symbol name which gets mangled for const or constexpr
static constexpr char d[0] asm("d") = {};
template <const char *op, int element_select>
static int cmpGT64Sx2(V64x2 x, V64x2 y)
{
int result;
__asm__(
// AT&T style has destination second.
"pcmpgt%p[op] %[src],%[dst]\n\t" // %p4 - print the bare name, not $d or $q
"pextrb %3, %[dst], %0"
: "=r" (result), [dst]"+x"(x)
: [src]"x"(y), "i" (element_select*8),
[op]"i"(op) // address as an immediate = symbol name
: /* no clobbers */);
return result;
}
int gt64(V64x2 x, V64x2 y) {
return cmpGT64Sx2<q, 1>(x,y);
}
int gt32(V64x2 x, V64x2 y) {
return cmpGT64Sx2<d, 1>(x,y);
}
So at the cost of having d and q as global-scope names in this file(!??), we can use <d, 2> or <q, 0> template params that look like the instruction we want.
Note that in x86 SIMD terminology, a "lane" is a 128-bit chunk of an AVX or AVX-512 vector. As in vpermilps (In-Lane Permute of 32-bit float elements).
This compiles to the following asm with GCC10 -O3 (https://godbolt.org/z/ovxWd8)
gt64(long long __vector(2), long long __vector(2)):
pcmpgtq %xmm1,%xmm0
pextrb $8, %xmm0, %eax
ret
gt32(long long __vector(2), long long __vector(2)):
pcmpgtd %xmm1,%xmm0
pextrb $8, %xmm0, %eax // This is actually element 2 of 4, not 1, because your scale doesn't account for the size.
ret
You can hide the global-scope vars from the template users and have them pass an integer size. I also fixed the element indexing to account for the variable element size.
static constexpr char q[0] asm("q") = {}; // override asm symbol name which gets mangled for const or constexpr
static constexpr char d[0] asm("d") = {};
template <int size, int element_select>
static int cmpGT64Sx2_new(V64x2 x, V64x2 y)
{
//static constexpr char dd[0] asm("d") = {}; // nope, asm symbol name overrides don't work on local-scope static vars
constexpr int bytepos = size/8 * element_select;
constexpr const char *op = (size==32) ? d : q;
// maybe static_assert( size == 32 || size == 64 )
int result;
__asm__(
// AT&T style has destination second.
"pcmpgt%p[op] %[src],%[dst]\n\t" // SSE2 or SSE4.2
"pextrb %[byte], %[dst], %0" // SSE4.1
: "=r" (result), [dst]"+x"(x)
: [src]"x"(y), [byte]"i" (bytepos),
[op]"i"(op) // address as an immediate = symbol name
: /* no clobbers */);
return result;
}
// Note *not* referencing d or q static vars, but the template is
int gt64_new(V64x2 x, V64x2 y) {
return cmpGT64Sx2_new<64, 1>(x,y);
}
int gt32_new(V64x2 x, V64x2 y) {
return cmpGT64Sx2_new<32, 1>(x,y);
}
This also compiles like we want, e.g.
gt32_new(long long __vector(2), long long __vector(2)):
pcmpgtd %xmm1,%xmm0
pextrb $4, %xmm0, %eax # note the correct element 1 position
ret
BTW, you could use typedef int v4si __attribute__((vector_size(16))) and then v[element] to let GCC do it for you, if your asm statement just produces an "=x" output of that type in the same register as a "0"(x) input for example.
Without global-scope var names, using GAS .if / .else
We can easily get GCC to print a bare number into the asm template, e.g. for use as an operand to a .if %[size] == 32 directive. The GNU assembler has some conditional-assembly features, so we just get GCC to feed it the right text input to use that. Much less of a hack on the C++ side, but less compact source. Your template param could be a 'd' or 'q' size-code character if you wanted to compare on that instead of a size number.
template <int size, int element_select>
static int cmpGT64Sx2_mask(V64x2 x, V64x2 y)
{
constexpr int bytepos = size/8 * element_select;
unsigned int result;
__asm__(
// AT&T style has destination second.
".if %c[opsize] == 32\n\t" // note Godbolt hides directives; use binary mode to verify the assemble-time condition worked
"pcmpgtd %[src],%[dst]\n\t" // SSE2
".else \n\t"
"pcmpgtq %[src],%[dst]\n\t" // SSE4.2
".endif \n\t"
"pmovmskb %[dst], %0"
: "=r" (result), [dst]"+x"(x)
: [src]"x"(y), [opsize]"i"(size) // address as an immediate = symbol name
: /* no clobbers */);
return (result >> bytepos) & 1; // can just be TEST when branching on it
}
I also changed to using SSE2 pmovmskb to extract both / all element compare results, and using scalar stuff to select which bit to look at. This is orthogonal and can be used with any others. After inlining, it's generally going to be more efficient, allowing test $imm32, %eax. (pmovmskb is cheaper than pextrb, and it lets the whole thing require only SSE2 for the pcmpgtd version).
The asm output from the compiler looks like
.if 64 == 32
pcmpgtd %xmm1,%xmm0
.else
pcmpgtq %xmm1,%xmm0
.endif
pmovmskb %xmm0, %eax
To make sure that did what we want, we can assemble to binary and look at disassembly (https://godbolt.org/z/5zGdfv):
gt32_mask(long long __vector(2), long long __vector(2)):
pcmpgtd %xmm1,%xmm0
pmovmskb %xmm0,%eax
shr $0x4,%eax
and $0x1,%eax
(and gt64_mask uses pcmpgtq and shr by 8.)

std::vector resize without initializing [duplicate]

I want to use vector<char> as a buffer. The interface is perfect for my needs, but there's a performance penalty when resizing it beyond its current size, since the memory is initialized. I don't need the initialization, since the data will be overwritten in any case by some third-party C functions. Is there a way or a specific allocator to avoid the initialization step? Note that I do want to use resize(), not other tricks like reserve() and capacity(), because I need size() to always represent the significative size of my "buffer" at any moment, while capacity() might be greater than its size after a resize(), so, again, I cannot rely on capacity() as a significative information for my application. Furthemore, the (new) size of the vector is never known in advance, so I cannot use std::array. If vector cannot be configured that way, I'd like to know what kind of container or allocator I could use instead of vector<char, std::alloc>. The only requirement is that the alternative to vector must at most be based on STL or Boost. I have access to C++11.

It is a known issue that initialization can not be turned off even explicitly for std::vector.
People normally implement their own pod_vector<> that does not do any initialization of the elements.
Another way is to create a type which is layout-compatible with char, whose constructor does nothing:
struct NoInitChar
{
char value;
NoInitChar() noexcept {
// do nothing
static_assert(sizeof *this == sizeof value, "invalid size");
static_assert(__alignof *this == __alignof value, "invalid alignment");
}
};
int main() {
std::vector<NoInitChar> v;
v.resize(10); // calls NoInitChar() which does not initialize
// Look ma, no reinterpret_cast<>!
char* beg = &v.front().value;
char* end = beg + v.size();
}

There's nothing in the standard library that meets your requirements, and nothing I know of in boost either.
There are three reasonable options I can think of:
Stick with std::vector for now, leave a comment in the code and come back to it if this ever causes a bottleneck in your application.
Use a custom allocator with empty construct/destroy methods - and hope your optimiser will be smart enough to remove any calls to them.
Create a wrapper around a a dynamically allocated array, implementing only the minimal functionality that you require.

As an alternative solution that works with vectors of different pod types:
template<typename V>
void resize(V& v, size_t newSize)
{
struct vt { typename V::value_type v; vt() {}};
static_assert(sizeof(vt[10]) == sizeof(typename V::value_type[10]), "alignment error");
typedef std::vector<vt, typename std::allocator_traits<typename V::allocator_type>::template rebind_alloc<vt>> V2;
reinterpret_cast<V2&>(v).resize(newSize);
}
And then you can:
std::vector<char> v;
resize(v, 1000); // instead of v.resize(1000);
This is most likely UB, even though it works properly for me for cases where I care more about performance. Difference in generated assembly as produced by clang:
test():
push rbx
mov edi, 1000
call operator new(unsigned long)
mov rbx, rax
mov edx, 1000
mov rdi, rax
xor esi, esi
call memset
mov rdi, rbx
pop rbx
jmp operator delete(void*)
test_noinit():
push rax
mov edi, 1000
call operator new(unsigned long)
mov rdi, rax
pop rax
jmp operator delete(void*)

So to summarize the various solutions found on stackoverflow:
use a special default-init allocator. (https://stackoverflow.com/a/21028912/1984766)
drawback: changes the vector-type to std::vector<char, default_init_allocator<char>> vec;
use a wrapper-struct struct NoInitChar around a char that has an empty constructor and therefore skips the value-initialization (https://stackoverflow.com/a/15220853/1984766)
drawback: changes the vector-type to std::vector<NoInitChar> vec;
temporarily cast the vector<char> to vector<NoInitChar> and resize it (https://stackoverflow.com/a/57053750/1984766)
drawback: does not change the type of the vector but you need to call your_resize_function (vec, x) instead of vec.resize (x).
With this post I wanted to point out that all of these methods need to be optimized by the compiler in order to speed up your program. I can confirm that the initialization of the new chars when resizing is indeed optimized away with every compiler I tested. So everything looks good ...
But --> since method 1 & 2 change the type of the vector, what happens when you use these vectors under more "complex" circumstances.
Consider this example:
#include <time.h>
#include <vector>
#include <string_view>
#include <iostream>
//high precision-timer
double get_time () {
struct timespec timespec;
::clock_gettime (CLOCK_MONOTONIC_RAW, &timespec);
return timespec.tv_sec + timespec.tv_nsec / (1000.0 * 1000.0 * 1000.0);
}
//method 1 --> special allocator
//reformated to make it readable
template <typename T, typename A = std::allocator<T>>
class default_init_allocator : public A {
private:
typedef std::allocator_traits<A> a_t;
public:
template<typename U>
struct rebind {
using other = default_init_allocator<U, typename a_t::template rebind_alloc<U>>;
};
using A::A;
template <typename U>
void construct (U* ptr) noexcept (std::is_nothrow_default_constructible<U>::value) {
::new (static_cast<void*>(ptr)) U;
}
template <typename U, typename...Args>
void construct (U* ptr, Args&&... args) {
a_t::construct (static_cast<A&>(*this), ptr, std::forward<Args>(args)...);
}
};
//method 2 --> wrapper struct
struct NoInitChar {
public:
NoInitChar () noexcept { }
NoInitChar (char c) noexcept : value (c) { }
public:
char value;
};
//some work to waste time
template<typename T>
void do_something (T& vec, std::string_view str) {
vec.push_back ('"');
vec.insert (vec.end (), str.begin (), str.end ());
vec.push_back ('"');
vec.push_back (',');
}
int main (int argc, char** argv) {
double timeBegin = get_time ();
std::vector<char> vec; //normal case
//std::vector<char, default_init_allocator<char>> vec; //method 1
//std::vector<NoInitChar> vec; //method 2
vec.reserve (256 * 1024 * 1024);
for (int i = 0; i < 1024 * 1024; ++i) {
do_something (vec, "foobar1");
do_something (vec, "foobar2");
do_something (vec, "foobar3");
do_something (vec, "foobar4");
do_something (vec, "foobar5");
do_something (vec, "foobar6");
do_something (vec, "foobar7");
do_something (vec, "foobar8");
vec.resize (vec.size () + 64);
}
double timeEnd = get_time ();
std::cout << (timeEnd - timeBegin) * 1000 << "ms" << std::endl;
return 0;
}
You would expect that method 1 & 2 outperform the normal vector with every "recent" compiler since the resize is free and the other operations are the same. Well think again:
g++ 7.5.0 g++ 8.4.0 g++ 9.3.0 clang++ 9.0.0
vector<char> 95ms 134ms 133ms 97ms
method 1 130ms 159ms 166ms 91ms
method 2 166ms 160ms 159ms 89ms
All test-applications are compiled like this and executed 50times taking the lowest measurement:
$(cc) -O3 -flto -std=c++17 sample.cpp

Encapsulate it.
Initialise it to the maximum size (not reserve).
Keep a reference to the iterator representing the end of the real size, as you put it.
Use begin and real end, instead of end, for your algorithms.

It's very rare that you would need to do this; I strongly encourage you to benchmark your situation to be absolutely sure this hack is needed.
Even then, I prefer the NoInitChar solution. (See Maxim's answer)
But if you're sure you would benefit from this, and NoInitChar doesn't work for you, and you're using clang, or gcc, or MSVC as your compiler, consider using folly's routine for this purpose.
See https://github.com/facebook/folly/blob/master/folly/memory/UninitializedMemoryHacks.h
The basic idea is that each of these library implementations has a routine for doing uninitialized resizing; you just need to call it.
While hacky, you can at least console yourself with knowing that facebook's C++ code relies on this hack working properly, so they'll be updating it if new versions of these library implementations require it.

Using vector<char> as a buffer without initializing it on resize()

I want to use vector<char> as a buffer. The interface is perfect for my needs, but there's a performance penalty when resizing it beyond its current size, since the memory is initialized. I don't need the initialization, since the data will be overwritten in any case by some third-party C functions. Is there a way or a specific allocator to avoid the initialization step? Note that I do want to use resize(), not other tricks like reserve() and capacity(), because I need size() to always represent the significative size of my "buffer" at any moment, while capacity() might be greater than its size after a resize(), so, again, I cannot rely on capacity() as a significative information for my application. Furthemore, the (new) size of the vector is never known in advance, so I cannot use std::array. If vector cannot be configured that way, I'd like to know what kind of container or allocator I could use instead of vector<char, std::alloc>. The only requirement is that the alternative to vector must at most be based on STL or Boost. I have access to C++11.

It is a known issue that initialization can not be turned off even explicitly for std::vector.
People normally implement their own pod_vector<> that does not do any initialization of the elements.
Another way is to create a type which is layout-compatible with char, whose constructor does nothing:
struct NoInitChar
{
char value;
NoInitChar() noexcept {
// do nothing
static_assert(sizeof *this == sizeof value, "invalid size");
static_assert(__alignof *this == __alignof value, "invalid alignment");
}
};
int main() {
std::vector<NoInitChar> v;
v.resize(10); // calls NoInitChar() which does not initialize
// Look ma, no reinterpret_cast<>!
char* beg = &v.front().value;
char* end = beg + v.size();
}

There's nothing in the standard library that meets your requirements, and nothing I know of in boost either.
There are three reasonable options I can think of:
Stick with std::vector for now, leave a comment in the code and come back to it if this ever causes a bottleneck in your application.
Use a custom allocator with empty construct/destroy methods - and hope your optimiser will be smart enough to remove any calls to them.
Create a wrapper around a a dynamically allocated array, implementing only the minimal functionality that you require.

As an alternative solution that works with vectors of different pod types:
template<typename V>
void resize(V& v, size_t newSize)
{
struct vt { typename V::value_type v; vt() {}};
static_assert(sizeof(vt[10]) == sizeof(typename V::value_type[10]), "alignment error");
typedef std::vector<vt, typename std::allocator_traits<typename V::allocator_type>::template rebind_alloc<vt>> V2;
reinterpret_cast<V2&>(v).resize(newSize);
}
And then you can:
std::vector<char> v;
resize(v, 1000); // instead of v.resize(1000);
This is most likely UB, even though it works properly for me for cases where I care more about performance. Difference in generated assembly as produced by clang:
test():
push rbx
mov edi, 1000
call operator new(unsigned long)
mov rbx, rax
mov edx, 1000
mov rdi, rax
xor esi, esi
call memset
mov rdi, rbx
pop rbx
jmp operator delete(void*)
test_noinit():
push rax
mov edi, 1000
call operator new(unsigned long)
mov rdi, rax
pop rax
jmp operator delete(void*)

So to summarize the various solutions found on stackoverflow:
use a special default-init allocator. (https://stackoverflow.com/a/21028912/1984766)
drawback: changes the vector-type to std::vector<char, default_init_allocator<char>> vec;
use a wrapper-struct struct NoInitChar around a char that has an empty constructor and therefore skips the value-initialization (https://stackoverflow.com/a/15220853/1984766)
drawback: changes the vector-type to std::vector<NoInitChar> vec;
temporarily cast the vector<char> to vector<NoInitChar> and resize it (https://stackoverflow.com/a/57053750/1984766)
drawback: does not change the type of the vector but you need to call your_resize_function (vec, x) instead of vec.resize (x).
With this post I wanted to point out that all of these methods need to be optimized by the compiler in order to speed up your program. I can confirm that the initialization of the new chars when resizing is indeed optimized away with every compiler I tested. So everything looks good ...
But --> since method 1 & 2 change the type of the vector, what happens when you use these vectors under more "complex" circumstances.
Consider this example:
#include <time.h>
#include <vector>
#include <string_view>
#include <iostream>
//high precision-timer
double get_time () {
struct timespec timespec;
::clock_gettime (CLOCK_MONOTONIC_RAW, &timespec);
return timespec.tv_sec + timespec.tv_nsec / (1000.0 * 1000.0 * 1000.0);
}
//method 1 --> special allocator
//reformated to make it readable
template <typename T, typename A = std::allocator<T>>
class default_init_allocator : public A {
private:
typedef std::allocator_traits<A> a_t;
public:
template<typename U>
struct rebind {
using other = default_init_allocator<U, typename a_t::template rebind_alloc<U>>;
};
using A::A;
template <typename U>
void construct (U* ptr) noexcept (std::is_nothrow_default_constructible<U>::value) {
::new (static_cast<void*>(ptr)) U;
}
template <typename U, typename...Args>
void construct (U* ptr, Args&&... args) {
a_t::construct (static_cast<A&>(*this), ptr, std::forward<Args>(args)...);
}
};
//method 2 --> wrapper struct
struct NoInitChar {
public:
NoInitChar () noexcept { }
NoInitChar (char c) noexcept : value (c) { }
public:
char value;
};
//some work to waste time
template<typename T>
void do_something (T& vec, std::string_view str) {
vec.push_back ('"');
vec.insert (vec.end (), str.begin (), str.end ());
vec.push_back ('"');
vec.push_back (',');
}
int main (int argc, char** argv) {
double timeBegin = get_time ();
std::vector<char> vec; //normal case
//std::vector<char, default_init_allocator<char>> vec; //method 1
//std::vector<NoInitChar> vec; //method 2
vec.reserve (256 * 1024 * 1024);
for (int i = 0; i < 1024 * 1024; ++i) {
do_something (vec, "foobar1");
do_something (vec, "foobar2");
do_something (vec, "foobar3");
do_something (vec, "foobar4");
do_something (vec, "foobar5");
do_something (vec, "foobar6");
do_something (vec, "foobar7");
do_something (vec, "foobar8");
vec.resize (vec.size () + 64);
}
double timeEnd = get_time ();
std::cout << (timeEnd - timeBegin) * 1000 << "ms" << std::endl;
return 0;
}
You would expect that method 1 & 2 outperform the normal vector with every "recent" compiler since the resize is free and the other operations are the same. Well think again:
g++ 7.5.0 g++ 8.4.0 g++ 9.3.0 clang++ 9.0.0
vector<char> 95ms 134ms 133ms 97ms
method 1 130ms 159ms 166ms 91ms
method 2 166ms 160ms 159ms 89ms
All test-applications are compiled like this and executed 50times taking the lowest measurement:
$(cc) -O3 -flto -std=c++17 sample.cpp

Encapsulate it.
Initialise it to the maximum size (not reserve).
Keep a reference to the iterator representing the end of the real size, as you put it.
Use begin and real end, instead of end, for your algorithms.

It's very rare that you would need to do this; I strongly encourage you to benchmark your situation to be absolutely sure this hack is needed.
Even then, I prefer the NoInitChar solution. (See Maxim's answer)
But if you're sure you would benefit from this, and NoInitChar doesn't work for you, and you're using clang, or gcc, or MSVC as your compiler, consider using folly's routine for this purpose.
See https://github.com/facebook/folly/blob/master/folly/memory/UninitializedMemoryHacks.h
The basic idea is that each of these library implementations has a routine for doing uninitialized resizing; you just need to call it.
While hacky, you can at least console yourself with knowing that facebook's C++ code relies on this hack working properly, so they'll be updating it if new versions of these library implementations require it.

Does static constexpr variable inside a function make sense?

If I have a variable inside a function (say, a large array), does it make sense to declare it both static and constexpr? constexpr guarantees that the array is created at compile time, so would the static be useless?
void f() {
static constexpr int x [] = {
// a few thousand elements
};
// do something with the array
}
Is the static actually doing anything there in terms of generated code or semantics?

The short answer is that not only is static useful, it is pretty well always going to be desired.
First, note that static and constexpr are completely independent of each other. static defines the object's lifetime during execution; constexpr specifies that the object should be available during compilation. Compilation and execution are disjoint and discontiguous, both in time and space. So once the program is compiled, constexpr is no longer relevant.
Every variable declared constexpr is implicitly const but const and static are almost orthogonal (except for the interaction with static const integers.)
The C++ object model (§1.9) requires that all objects other than bit-fields occupy at least one byte of memory and have addresses; furthermore all such objects observable in a program at a given moment must have distinct addresses (paragraph 6). This does not quite require the compiler to create a new array on the stack for every invocation of a function with a local non-static const array, because the compiler could take refuge in the as-if principle provided it can prove that no other such object can be observed.
That's not going to be easy to prove, unfortunately, unless the function is trivial (for example, it does not call any other function whose body is not visible within the translation unit) because arrays, more or less by definition, are addresses. So in most cases, the non-static const(expr) array will have to be recreated on the stack at every invocation, which defeats the point of being able to compute it at compile time.
On the other hand, a local static const object is shared by all observers, and furthermore may be initialized even if the function it is defined in is never called. So none of the above applies, and a compiler is free not only to generate only a single instance of it; it is free to generate a single instance of it in read-only storage.
So you should definitely use static constexpr in your example.
However, there is one case where you wouldn't want to use static constexpr. Unless a constexpr declared object is either ODR-used or declared static, the compiler is free to not include it at all. That's pretty useful, because it allows the use of compile-time temporary constexpr arrays without polluting the compiled program with unnecessary bytes. In that case, you would clearly not want to use static, since static is likely to force the object to exist at runtime.

In addition to given answer, it's worth noting that compiler is not required to initialize constexpr variable at compile time, knowing that the difference between constexpr and static constexpr is that to use static constexpr you ensure the variable is initialized only once.
Following code demonstrates how constexpr variable is initialized multiple times (with same value though), while static constexpr is surely initialized only once.
In addition the code compares the advantage of constexpr against const in combination with static.
#include <iostream>
#include <string>
#include <cassert>
#include <sstream>
const short const_short = 0;
constexpr short constexpr_short = 0;
// print only last 3 address value numbers
const short addr_offset = 3;
// This function will print name, value and address for given parameter
void print_properties(std::string ref_name, const short* param, short offset)
{
// determine initial size of strings
std::string title = "value \\ address of ";
const size_t ref_size = ref_name.size();
const size_t title_size = title.size();
assert(title_size > ref_size);
// create title (resize)
title.append(ref_name);
title.append(" is ");
title.append(title_size - ref_size, ' ');
// extract last 'offset' values from address
std::stringstream addr;
addr << param;
const std::string addr_str = addr.str();
const size_t addr_size = addr_str.size();
assert(addr_size - offset > 0);
// print title / ref value / address at offset
std::cout << title << *param << " " << addr_str.substr(addr_size - offset) << std::endl;
}
// here we test initialization of const variable (runtime)
void const_value(const short counter)
{
static short temp = const_short;
const short const_var = ++temp;
print_properties("const", &const_var, addr_offset);
if (counter)
const_value(counter - 1);
}
// here we test initialization of static variable (runtime)
void static_value(const short counter)
{
static short temp = const_short;
static short static_var = ++temp;
print_properties("static", &static_var, addr_offset);
if (counter)
static_value(counter - 1);
}
// here we test initialization of static const variable (runtime)
void static_const_value(const short counter)
{
static short temp = const_short;
static const short static_var = ++temp;
print_properties("static const", &static_var, addr_offset);
if (counter)
static_const_value(counter - 1);
}
// here we test initialization of constexpr variable (compile time)
void constexpr_value(const short counter)
{
constexpr short constexpr_var = constexpr_short;
print_properties("constexpr", &constexpr_var, addr_offset);
if (counter)
constexpr_value(counter - 1);
}
// here we test initialization of static constexpr variable (compile time)
void static_constexpr_value(const short counter)
{
static constexpr short static_constexpr_var = constexpr_short;
print_properties("static constexpr", &static_constexpr_var, addr_offset);
if (counter)
static_constexpr_value(counter - 1);
}
// final test call this method from main()
void test_static_const()
{
constexpr short counter = 2;
const_value(counter);
std::cout << std::endl;
static_value(counter);
std::cout << std::endl;
static_const_value(counter);
std::cout << std::endl;
constexpr_value(counter);
std::cout << std::endl;
static_constexpr_value(counter);
std::cout << std::endl;
}
Possible program output:
value \ address of const is 1 564
value \ address of const is 2 3D4
value \ address of const is 3 244
value \ address of static is 1 C58
value \ address of static is 1 C58
value \ address of static is 1 C58
value \ address of static const is 1 C64
value \ address of static const is 1 C64
value \ address of static const is 1 C64
value \ address of constexpr is 0 564
value \ address of constexpr is 0 3D4
value \ address of constexpr is 0 244
value \ address of static constexpr is 0 EA0
value \ address of static constexpr is 0 EA0
value \ address of static constexpr is 0 EA0
As you can see yourself constexpr is initilized multiple times (address is not the same) while static keyword ensures that initialization is performed only once.

Not making large arrays static, even when they're constexpr can have dramatic performance impact and can lead to many missed optimizations. It may slow down your code by orders of magnitude. Your variables are still local and the compiler may decide to initialize them at runtime instead of storing them as data in the executable.
Consider the following example:
template <int N>
void foo();
void bar(int n)
{
// array of four function pointers to void(void)
constexpr void(*table[])(void) {
&foo<0>,
&foo<1>,
&foo<2>,
&foo<3>
};
// look up function pointer and call it
table[n]();
}
You probably expect gcc-10 -O3 to compile bar() to a jmp to an address which it fetches from a table, but that is not what happens:
bar(int):
mov eax, OFFSET FLAT:_Z3fooILi0EEvv
movsx rdi, edi
movq xmm0, rax
mov eax, OFFSET FLAT:_Z3fooILi2EEvv
movhps xmm0, QWORD PTR .LC0[rip]
movaps XMMWORD PTR [rsp-40], xmm0
movq xmm0, rax
movhps xmm0, QWORD PTR .LC1[rip]
movaps XMMWORD PTR [rsp-24], xmm0
jmp [QWORD PTR [rsp-40+rdi*8]]
.LC0:
.quad void foo<1>()
.LC1:
.quad void foo<3>()
This is because GCC decides not to store table in the executable's data section, but instead initializes a local variable with its contents every time the function runs. In fact, if we remove constexpr here, the compiled binary is 100% identical.
This can easily be 10x slower than the following code:
template <int N>
void foo();
void bar(int n)
{
static constexpr void(*table[])(void) {
&foo<0>,
&foo<1>,
&foo<2>,
&foo<3>
};
table[n]();
}
Our only change is that we have made table static, but the impact is enormous:
bar(int):
movsx rdi, edi
jmp [QWORD PTR bar(int)::table[0+rdi*8]]
bar(int)::table:
.quad void foo<0>()
.quad void foo<1>()
.quad void foo<2>()
.quad void foo<3>()
In conclusion, never make your lookup tables local variables, even if they're constexpr. Clang actually optimizes such lookup tables well, but other compilers don't. See Compiler Explorer for a live example.

Get member of __m128 by index?

I've got some code, originally given to me by someone working with MSVC, and I'm trying to get it to work on Clang. Here's the function that I'm having trouble with:
float vectorGetByIndex( __m128 V, unsigned int i )
{
assert( i <= 3 );
return V.m128_f32[i];
}
The error I get is as follows:
Member reference has base type '__m128' is not a structure or union.
I've looked around and found that Clang (and maybe GCC) has a problem with treating __m128 as a struct or union. However I haven't managed to find a straight answer as to how I can get these values back. I've tried using the subscript operator and couldn't do that, and I've glanced around the huge list of SSE intrinsics functions and haven't yet found an appropriate one.

A union is probably the most portable way to do this:
union {
__m128 v; // SSE 4 x float vector
float a[4]; // scalar array of 4 floats
} U;
float vectorGetByIndex(__m128 V, unsigned int i)
{
U u;
assert(i <= 3);
u.v = V;
return u.a[i];
}

As a modification to hirschhornsalz's solution, if i is a compile-time constant, you could avoid the union path entirely by using a shuffle:
template<unsigned i>
float vectorGetByIndex( __m128 V)
{
// shuffle V so that the element that you want is moved to the least-
// significant element of the vector (V[0])
V = _mm_shuffle_ps(V, V, _MM_SHUFFLE(i, i, i, i));
// return the value in V[0]
return _mm_cvtss_f32(V);
}
A scalar float is just the bottom element of an XMM register, and the upper elements are allowed to be non-zero; _mm_cvtss_f32 is free and will compile to zero instructions. This will inline as just a shufps (or nothing for i==0).
Compilers are smart enough to optimize away the shuffle for i==0 (except for long-obsolete ICC13) so no need for an if (i). https://godbolt.org/z/K154Pe. clang's shuffle optimizer will compile vectorGetByIndex<2> into movhlps xmm0, xmm0 which is 1 byte shorter than shufps and produces the same low element. You could manually do this with switch/case for other compilers since i is a compile-time constant, but 1 byte of code size in the few places you use this while manually vectorizing is pretty trivial.
Note that SSE4.1 _mm_extract_epi32(V, i); is not a useful shuffle here: extractps r/m32, xmm, imm can only extract the FP bit-pattern to an integer register or memory (https://www.felixcloutier.com/x86/extractps). (And the intrinsic returns it as an int, so it would actually compile to extractps + cvtsi2ss to do int->float conversion on the FP bit-pattern, unless you type-pun it in your C++ code. But then you'd expect it to compile to extractps eax, xmm0, i / movd xmm0, eax which is terrible vs. shufps.)
The only case where extractps would be useful is if the compiler wanted to store this result straight to memory, and fold the store into the extract instruction. (For i!=0, otherwise it would use movss). To leave the result in an XMM register as a scalar float, shufps is good.
(SSE4.1 insertps would be usable but unnecessary: it makes it possible to zero other elements while taking an arbitrary source element.)

Use
template<unsigned i>
float vectorGetByIndex( __m128 V) {
union {
__m128 v;
float a[4];
} converter;
converter.v = V;
return converter.a[i];
}
which will work regardless of the available instruction set.
Note: Even if SSE4.1 is available and i is a compile time constant, you can't use pextract etc. this way, because these instructions extract a 32-bit integer, not a float:
// broken code starts here
template<unsigned i>
float vectorGetByIndex( __m128 V) {
return _mm_extract_epi32(V, i);
}
// broken code ends here
I don't delete it because it is a useful reminder how to not do things.

The way I use is
union vec { __m128 sse, float f[4] };
float accessmember(__m128 v, int index)
{
vec v.sse = v;
return v.f[index];
}
Seems to work out pretty well for me.

Late to this party but found that this works for me in MSVC where z is a variable of type __m128.
#define _mm_extract_f32(v, i) _mm_cvtss_f32(_mm_shuffle_ps(v, v, i))
__m128 z = _mm_setr_ps(1.0, 2.0, 3.0, 4.0);
float f = _mm_extract_f32(z, 2);
OR even simpler
__m128 z;
float f = z.m128_f32[2]; // to get the 3rd float value in the vector

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

inline assembly genericity over operand width - c++

Related

Is it possible to template the asm opcode in c++?

std::vector resize without initializing [duplicate]

Using vector<char> as a buffer without initializing it on resize()

Does static constexpr variable inside a function make sense?

Get member of __m128 by index?

Categories

Resources