Building GCC SIMD vector constants using constexpr functions (rather than literals)

Building GCC SIMD vector constants using constexpr functions (rather than literals) - c++

I'm writing a piece of code using GCC's vector extensions (__attribute__((vector_size(x)))) that needs several constant masks. These masks are simple enough to fill in sequentially, but adding them as vector literals is tedious and error prone, not to mention limiting potential changes in vector size.
Is it possible to generate the constants using a constexpr function?
I've tried generating the values like this:
using u8v = uint8_t __attribute__((vector_size(64)));
auto generate_mask = []() constexpr {
u8v ret;
for (size_t i = 0; i < sizeof(u8v); ++i) {
ret[i] = i & 0xff;
}
return ret;
};
constexpr auto mask = generate_mask();
but GCC says modification of u8v is not a constant expression.
Is there some workaround, using G++ with C++20 features?

Technically possible, but complicated to the point of being unusable. An example unrelated to SIMD.
Still, the workarounds aren’t that bad.
SIMD vectors can’t be embedded into instruction stream anyway. Doesn’t matter your code says constexpr auto mask, in reality the compiler will generate the complete vector, place the data in the read-only segment of your binary, and load from there as needed.
The exception is when your vector can be generated faster than RAM loads. A vector with all zeroes can be made without RAM access with vpxor, a vector with all bits set can be made with vpcmpeqd, and vectors with the same value in all lanes can be made with broadcast instructions. Anything else will result in full-vector load from the binary.
This means you can replace your constexpr with const and the binary will be the same. If that code is in a header file, ideally you’d also want __declspec(selectany) when building with VC++, or in your case the GCC’s equivalent is __attribute__((weak))

Related

Is generally faster do _mm_set1_ps once and reuse it or set it on the fly?

Here's two example. Set float values on a vector every time within a loop:
static constexpr float kFMScale = 10.0f / k2PIf; // somewhere...
for (int i = 0; i < numValues; i++) {
paramKnob.v = _mm_mul_ps(_mm_sub_ps(paramKnob.v - _mm_set1_ps(0.5f)), _mm_set1_ps(2.0f));
paramKnob.v = _mm_mul_ps(paramKnob.v, _mm_set1_ps(kFMScale));
// code...
}
or do it once and "reload" the same register every time:
static constexpr float kFMScale = 10.0f / k2PIf; // somewhere...
__m128 v05 = _mm_set1_ps(0.5f);
__m128 v2 = _mm_set1_ps(2.0f);
__m128 vFMScale = _mm_set1_ps(kFMScale);
for (int i = 0; i < numValues; i++) {
paramKnob.v = _mm_mul_ps(_mm_sub_ps(paramKnob.v, v05), v2);
paramKnob.v = _mm_mul_ps(paramKnob.v, vFMScale);
// code...
}
which one is generally the best and suited approch? I'll bet on the second, but vectorization most of the time fool me.
And what if I use them as const in the whole project instead of "within a block"? "load everywhere" instead of "set everywhere" would be better?
i think its all about cache missing rather than latency/throughput used by the operations.
I'm on a windows/64 bit machine, using FLAGS += -O3 -march=nocona -funsafe-math-optimizations

In general, for best performance you want to make the loop body as concise as possible and move any code invariant to the iteration out of the loop body. Whether a given compiler is able to do it for you depends on its ability to perform this kind of optimization and on the currently enabled optimization level.
Modern gcc and clang versions are typically smart enough to convert _mm_set1_ps and similar intrinsics with constant arguments to in-memory constants, so even without hoisting this results in a fairly efficient binary code. On the other hand, MSVC is not very smart in SIMD optimizations.
As a rule of thumb I would still recommend to move constant generation (except for all-zero or all-ones constants) out of the loop body. There are several reasons to do so, even if your compiler is capable to do so on its own:
You are not relying on the compiler to do the optimization, which makes your code more portable across compilers. The code will also perform more similarly between different optimization levels, which may be useful.
Moving the constant out of the loop body may convince the compiler to pre-load the constant before entering the loop instead of referencing it in-memory inside the loop. Again, this is subject to compiler's optimization capabilities and current optimization level.
The constants can be reused in multiple places, including multiple functions. This reduces the binary size and makes your code more cache-friendly in run time. (Some linkers are able to merge equivalent constants when linking the binary, but this feature is not universally supported, it is subject to the optimization level, and in some cases makes the code non-compliant to the C/C++ standards, which can affect code correctness. For this reason, this feature is normally disabled by default, even when supported.)
If you are going to declare a constant in namespace scope, it is highly recommended to use an aligned array and/or a union to ensure static initialization. Compilers may not be able to perform static initialization if you use intrinsics to initialize the constant. You can use something like the following:
template< typename T >
union mm_constant128
{
T as_array[sizeof(__m128i) / sizeof(T)];
__m128i as_m128i;
__m128 as_m128;
__m128d as_m128d;
operator __m128i () const noexcept { return as_m128i; }
operator __m128 () const noexcept { return as_m128; }
operator __m128d () const noexcept { return as_m128d; }
};
constexpr mm_constant128< float > k05 = {{ 0.5f, 0.5f, 0.5f, 0.5f }};
void foo()
{
// Use as follows:
_mm_sub_ps(paramKnob.v, k05)
}
If you are targeting a specific compiler at a given optimization level, you can inspect the generated assembler code to see whether the compiler is doing good enough optimization on its own.

Or do this:
Declare a compile-time constant __m128 outside your function, preferably in a "constants" namespace.
inline constexpr __m128 _val = {0.5f,0.5f,0.5f,0.5f};
//Your function here
This will save you from using the _mm_set1_ps function, which does not evaluate to 1 instruction, it is a rather inefficient sequence of instructions.
Since templates are resolved at compile-time, there will be little to no performance difference between this and the other answer. However, I think this code is significantly simpler.

Bool Flags vs. Unsigned Char Flags

Disclaimer: Please correct me in the event that I make any false claims in this post.
Consider a struct that contains eight bool member variables.
/*
* Struct uses one byte for each flag.
*/
struct WithBools
{
bool f0 = true;
bool f1 = true;
bool f2 = true;
bool f3 = true;
bool f4 = true;
bool f5 = true;
bool f6 = true;
bool f7 = true;
};
The space allocated to each variable is a byte in length, which seems like a waste if the variables are used solely as flags. One solution to reduce this wasted space, as far as the variables are concerned, is to encapsulate the eight flags into a single member variable of unsigned char.
/*
* Struct uses a single byte for eight flags; retrieval and
* manipulation of data is achieved through accessor functions.
*/
struct WithoutBools
{
unsigned char getFlag(unsigned index)
{
return flags & (1 << (index % 8));
}
void toggleFlag(unsigned index)
{
flags ^= (1 << (index % 8));
}
private:
unsigned char flags = 0xFF;
};
The flags are retrieved and manipulated via. bitwise operators, and the struct provides an interface for the user to retrieve and manipulate the flags. While flag sizes have been reduced, we now have the two additional methods that add to the size of the struct. I do not know how to benchmark this difference, therefore I could not be certain of any fluctuation between the above structs.
My questions are:
1) Would the difference in space between these two structs be negligible?
2) Generally, is this approach of "optimising" a collection of bools by compacting them into a single byte a good idea? Either in an embedded systems context or otherwise.
3) Would a C++ compiler make such an optimisation that compacts a collection of bools wherever possible and appropriate.

we now have the two additional methods that add to the size of the
struct
Methods are code and do not increase the size of the struct. Only data makes up size on the structure.
3) Would a C++ compiler make such an optimisation that compacts a
collection of bools wherever possible and appropriate.
That is a sound resounding no. The compiler is not allowed to change data types.
1) Would the difference in space between these two structs be
negligible?
No, there definitely is a size difference between the two approaches.
2) Generally, is this approach of "optimising" a collection of bools
by compacting them into a single byte a good idea? Either in an
embedded systems context or otherwise.
Generally yes, the idiomatic way to model flags is with bit-wise manipulation inside an unsigned integer. Depending on the number of flags needed you can use std::uint8_t, std::uint16_t and so on.
However the most common way to model this is not via index as you've done, but via masks.

Would the difference in space between these two structs be negligible?
That depends on how many values you are storing and how much space you have to store them in. The size difference is 1 to 8.
Generally, is this approach of "optimising" a collection of bools by compacting them into a single byte a good idea? Either in an embedded systems context or otherwise.
Again, it depends on how many values and how much space. Also note that dealing with bits instead of bytes increases code size and execution time.
Many embedded systems have relatively little RAM and plenty of Flash. Code is stored in Flash, so the increased code size can be ignored, and the saved memory could be important on small RAM systems.
Would a C++ compiler make such an optimisation that compacts a collection of bools wherever possible and appropriate.
Hypothetically it could. I would consider that an aggressive space optimization, at the expense of execution time.
STL has a specialization for vector<bool> that I frequently avoid for performance reasons - vector<char> is much faster.

Authoritative "correct" way to avoid signed-unsigned warnings when testing a loop variable against size_t

The code below generates a compiler warning:
private void test()
{
byte buffer[100];
for (int i = 0; i < sizeof(buffer); ++i)
{
buffer[i] = 0;
}
}
warning: comparison between signed and unsigned integer expressions
[-Wsign-compare]
This is because sizeof() returns a size_t, which is unsigned.
I have seen a number of suggestions for how to deal with this, but none with a preponderance of support and none with any convincing logic nor any references to support one approach as clearly "better." The most common suggestions seem to be:
ignore the warnings
turn off the warnings
use a loop variable of type size_t
use a loop variable of type size_t with tricks to avoid decrementing past zero
cast size_of(buffer) to an int
some extremely convoluted suggestions that I did not have the patience to follow because they involved unreadable code, generally involving vectors and/or iterators
libraries that I cannot load in the AVR / ARM embedded environments I often use.
free functions returning a valid int or long representing the byte count of T
Don't use loops (gotta love that advice)
Is there a "correct" way to approach this?
-- Begin Edit --
The example I gave is, of course, trivial, and meant only to demonstrate the type mismatch warning that can occur in an indexing situation.
#3 is not necessarily the obviously correct answer because size_t carries special risks in a decrementing loop such as
for (size_t i = myArray.size; i > 0; --i)
(the array may someday have a size of zero).
#4 is a suggestion to deal with decrementing size_t indexes by including appropriate and necessary checks to avoid ever decrementing past zero. Since that makes the code harder to read, there are some cute shortcuts that are not particularly readable, hence my referring to them as "tricks."
#7 is a suggestion to use libraries that are not generalizable in the sense that they may not be available or appropriate in every setting.
#8 is a suggestion to keep the checks readable, but to hide them in a non-member method, sometimes referred to as a "free function."
#9 is a suggestion to use algorithms rather than loops. This was offered many times as a solution to the size_t indexing problem, and there were a lot of upvotes. I include it even though I can't use the stl library in most of my environments and would have to write the code myself.
-- End Edit--
I am hoping for evidence-based guidance or references as to best practices for handling something like this. Is there a "standard text" or a style guide somewhere that addresses the question? A defined approach that has been adopted/endorsed internally by a major tech company? An emulatable solution forthcoming in a new language release? If necessary, I would be satisfied with an unsupported public recommendation from a single widely recognized expert.
None of the options on offer seem very appealing. The warnings drown out other things I want to see. I don't want to miss signed/unsigned comparisons in places where it might matter. Decrementing a loop variable of type size_t with comparison >=0 results in an infinite loop from unsigned integer wraparound, and even if we protect against that with something like for (size_t i = sizeof(buffer); i-->0 ;), there are other issues with incrementing/decrementing/comparing to size_t variables. Testing against size_t - 1 will yield a large positive 'oops' number when size_t is unexpectedly zero (e.g. strlen(myEmptyString)). Casting an unsigned size_t to an integer is a container size problem (not guaranteed a value) and of course size_t could potentially be bigger than an int.
Given that my arrays are of known sizes well below Int_Max, it seems to me that casting size_t to a signed integer is the best of the bunch, but it makes me cringe a little bit. Especially if it has to be static_cast<int>. Easier to take if it's hidden in a function call with some size testing, but still...
Or perhaps there's a way to turn off the warnings, but just for loop comparisons?

I find any of the three following approaches equally good.
Use a variable of type int to store the size and compare the loop variable to it.
byte buffer[100];
int size = sizeof(buffer);
for (int i = 0; i < size; ++i)
{
buffer[i] = 0;
}
Use size_t as the type of the loop variable.
byte buffer[100];
for (size_t i = 0; i < sizeof(buffer); ++i)
{
buffer[i] = 0;
}
Use a pointer.
byte buffer[100];
byte* end = buffer + sizeof(buffer)
for (byte* p = buffer; p < end; ++p)
{
*p = 0;
}
If you are able to use a C++11 compiler, you can also use a range for loop.
byte buffer[100];
for (byte& b : buffer)
{
b = 0;
}

The most appropriate solution will depend entirely on context. In the context of the code fragment in your question the most appropriate action is perhaps to have type-agreement - the third option in your bullet list. This is appropriate in this case because the usage of i throughout the code is only to index the array - in this case the use of int is inappropriate - or at least unnecessary.
On the other hand if i were an arithmetic object involved in some arithmetic expression that was itself signed, the int might be appropriate and a cast would be in order.
I would suggest that as a guideline, a solution that involves the fewest number of necessary type casts (explicit of implicit) is appropriate, or to look at it another way, the maximum possible type agreement. There is not one "authoritative" rule because the purpose and usage of the variables involved is semantically rather then syntactically dependent. In this case also as has been pointed out in other answers, newer language features supporting iteration may avoid this specific issue altogether.
To discuss the advice you say you have been given specifically:
ignore the warnings
Never a good idea - some will be genuine semantic errors or maintenance issues, and by teh time you have several hundred warnings you are ignoring, how will you spot the one warning that is and issue?
turn off the warnings
An even worse idea; the compiler is helping you to improve your code quality and reliability. Why would you disable that?
use a loop variable of type size_t
In this precise example, that is exactly why you should do; exact type agreement should always be the aim.
use a loop variable of type size_t with tricks to avoid decrementing past zero
This advice is irrelevant for the trivial example given. Moreover I presume that by "tricks" the adviser in fact means checks or just correct code. There is no need for "tricks" and the term is entirely ambiguous - who knows what the adviser means? It suggests something unconventional and a bit "dirty", when there is not need for any solution with such attributes.
cast size_of(buffer) to an int
This may be necessary if the usage of i warrants the use of int for correct semantics elsewhere in the code. The example in the question does not, so this would not be an appropriate solution in this case. Essentially if making i a size_t here causes type agreement warnings elsewhere that cannot themselves be resolved by universal type agreement for all operands in an expression, then a cast may be appropriate. The aim should be to achieve zero warnings an minimum type casts.
some extremely convoluted suggestions that I did not have the patience to follow, generally involving vectors and/or iterators
If you are not prepared to elaborate or even consider such advice, you'd have better omitted the "advice" from your question. The use of STL containers in any case is not always appropriate to a large segment of embedded targets in any case, excessive code size increase and non-deterministic heap management are reasons to avoid on many platforms and applications.
libraries that I cannot load in an embedded environment.
Not all embedded environments have equal constraints. The restriction is on your embedded environment, not by any means all embedded environments. However the "loading of libraries" to resolve or avoid type agreement issues seems like a sledgehammer to crack a nut.
free functions returning a valid int or long representing the byte count of T
It is not clear what that means. What id a "free function"? Is that just a non-member function? Such a function would internally necessarily have a type case, so what have you achieved other than hiding a type cast?
Don't use loops (gotta love that advice).
I doubt you needed to include that advice in your list. The problem is not in any case limited to loops; it is not because you are using a loop that you have the warning, it is because you have used < with mismatched types.

My favorite solution is to use C++11 or newer and skip the whole manual size bounding entirely like so:
// assuming byte is defined by something like using byte = std::uint8_t;
void test()
{
byte buffer[100];
for (auto&& b: buffer)
{
b = 0;
}
}
Alternatively, if I can't use the ranged-based for loop (but still can use C++11 or newer), my favorite syntax becomes:
void test()
{
byte buffer[100];
for (auto i = decltype(sizeof(buffer)){0}; i < sizeof(buffer); ++i)
{
buffer[i] = 0;
}
}
Or for iterating backwards:
void test()
{
byte buffer[100];
// relies on the defined modwrap semantics behavior for unsigned integers
for (auto i = sizeof(buffer) - 1; i < sizeof(buffer); --i)
{
buffer[i] = 0;
}
}

The correct generic way is to use a loop iterator of type size_t. Simply because the is the most correct type to use for describing an array size.
There is not much need for "tricks to avoid decrementing past zero", because the size of an object can never be negative.
If you find yourself needing negative numbers to describe a variable size, it is probably because you have some special case where you are iterating across an array backwards. If so, the "trick" to deal with it is this:
for(size_t i=0; i<sizeof(array); i++)
{
size_t index = sizeof(array)-1 - i;
array[index] = something;
}
However, size_t is often an inconvenient type to use in embedded systems, because it may end up as a larger type than what your MCU can handle with one instruction, resulting in needlessly inefficient code. It may then be better to use a fixed width integer such as uint16_t, if you know the maximum size of the array in advance.
Using plain int in an embedded system is almost certainly incorrect practice. Your variables must be of deterministic size and signedness - most variables in an embedded system are unsigned. Signed variables also lead to major problems whenever you need to use bitwise operators.

If you are able to use C++ 11, you could use decltype to obtain the actual type of what sizeof returns, for instance:
void test()
{
byte buffer[100];
// On macOS decltype(sizeof(buffer)) returns unsigned long, this passes
// the compiler without warnings.
for (decltype(sizeof(buffer)) i = 0; i < sizeof(buffer); ++i)
{
buffer[i] = 0;
}
}

How to improve compilation time for a gigantic boolean expression?

I need to perform a rather complex check over a vector and I have to repeat it thousands and millions of times. To make it more efficient, I translate given formula into C++ source code and compile it in heavily-optimized binary, which I call in my code. The formula is always purely Boolean: only &&, || and ! used. Typical source code looks like this:
#include <assert.h>
#include <vector>
using DataType = std::vector<bool>;
static const char T = 1;
static const char F = 0;
const std::size_t maxidx = 300;
extern "C" bool check (const DataType& l);
bool check (const DataType& l) {
assert (l.size() == maxidx);
return (l[0] && l[1] && l[2]) || (l[3] && l[4] && l[5]); //etc, very large line with && and || everywhere
}
I compile it as follows:
g++ -std=c++11 -Ofast -march=native -fpic -c check.cpp
Performance of the resulting binary is crucial.
It worked perfectly util recent test case with the large number of variables (300, as you can see above). With this test case, g++ consumes more than 100 GB of memory and freezes forever.
My question is pretty straightforward: how can I simplify that code for the compiler? Should I use some additional variables, get rid of vector or something else?
EDIT1: Ok, here is the screenshot from top utility.
cc1plus is busy with my code. The check function depends on 584 variables (sorry for a imprecise number in the example above) and it contains 450'000 expressions.
I would agree with #akakatak's comment below. It seems that g++ performs something O(N^2).

The obvious optimization here is to toss out the vector and use a bit-field, based on the fastest possible integer type:
uint_fast8_t item [n];
You could write this as
#define ITEM_BYTES(items) ((items) / sizeof(uint_fast8_t))
#define ITEM_SIZE(items) ( ITEM_BYTES(items) / CHAR_BIT + (ITEM_BYTES(items)%CHAR_BIT!=0) )
...
uint_fast8_t item [ITEM_SIZE(n)];
Now you have a chunk of memory with n segments, where each segment is the ideal size for your CPU. In each such segment, set bits to 1=true or 0=false, using bitwise operators.
Depending on how you want to optimize, you would group the bits in different ways. I would suggest storing 3 bits of data in every segment, since you always wish to check 3 adjacent boolean numbers. This mean that "n" in the above example will be the total number of booleans divided by 3.
You can then simply iterate through the array like:
bool items_ok ()
{
for(size_t i=0; i<n; i++)
{
if( (item[i] & 0x7u) == 0x7u )
{
return true;
}
}
return false;
}
With the above method you optimize:
The data size in which comparisons are made, and with it possible alignment issues.
The overall memory use.
The number of branches needed for the comparisons.
This also rules out any risks of ineffectiveness caused by the usual C++ meta programming. I would never trust std::vector, std::array or std::bitfield to produce optimal code.
Once you have the above working you can always test if std::bitfield etc containers yields the very same, effective machine code. If you find that they spawned any form of unrelated madness in your machine code, then don't use them.

It's a necro-posting a little bit, but I still should share my results.
The solution proposed by Thilo in comments above is the best. It's very simple and it provides measurable compile time improvement. Just split your expression into chunks of the same size. But, in my experience, you have to choose an appropriate sub expression length carefully - you can encounter significant execution performance drop in case of large number of sub expressions; a compiler will not be able to optimize the whole expression perfectly.

an optimized memcpy for small, or fixed size data in gcc

I use memcpy to copy both variable sizes of data and fixed sized data. In some cases I copy small amounts of memory (only a handful of bytes). In GCC I recall that memcpy used to be an intrinsic/builtin. Profiling my code however (with valgrind) I see thousands of calls to the actual "memcpy" function in glibc.
What conditions have to be met to use the builtin function? I can roll my own memcpy quickly, but I'm sure the builtin is more efficient than what I can do.
NOTE: In most cases the amount of data to be copied is available as a compile-time constant.
CXXFLAGS: -O3 -DNDEBUG
The code I'm using now, forcing builtins, if you take off the _builtin prefix the builtin is not used. This is called from various other templates/functions using T=sizeof(type). The sizes that get used are 1, 2, multiples of 4, a few 50-100 byte sizes, and some larger structures.
template<int T>
inline void load_binary_fixm(void *address)
{
if( (at + T) > len )
stream_error();
__builtin_memcpy( address, data + at, T );
at += T;
}

For the cases where T is small, I'd specialise and use a native assignment.
For example, where T is 1, just assign a single char.
If you know the addresses are aligned, use and appropriately sized int type for your platform.
If the addresses are not aligned, you might be better off doing the appropriate number of char assignments.
The point of this is to avoid a branch and keeping a counter.
Where T is big, I'd be surprised if you do better than the library memcpy(), and the function call overhead is probably going to be lost in the noise. If you do want to optimise, look around at the memcpy() implementations around. There are variants that use extended instructions, etc.
Update:
Looking at your actual(!) question about inlining memcpy, questions like compiler versions and platform become relevant. Out of curiosity, have you tried using std::copy, something like this:
template<int T>
inline void load_binary_fixm(void *address)
{
if( (at + T) > len )
stream_error();
std::copy(at, at + T, static_cast<char*>(address));
at += T;
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js