Why don't C++ compilers do better constant folding? - c++

I'm investigating ways to speed up a large section of C++ code, which has automatic derivatives for computing jacobians. This involves doing some amount of work in the actual residuals, but the majority of the work (based on profiled execution time) is in calculating the jacobians.
This surprised me, since most of the jacobians are propagated forward from 0s and 1s, so the amount of work should be 2-4x the function, not 10-12x. In order to model what a large amount of the jacobian work is like, I made a super minimal example with just a dot product (instead of sin, cos, sqrt and more that would be in a real situation) that the compiler should be able to optimize to a single return value:
#include <Eigen/Core>
#include <Eigen/Geometry>
using Array12d = Eigen::Matrix<double,12,1>;
double testReturnFirstDot(const Array12d& b)
{
Array12d a;
a.array() = 0.;
a(0) = 1.;
return a.dot(b);
}
Which should be the same as
double testReturnFirst(const Array12d& b)
{
return b(0);
}
I was disappointed to find that, without fast-math enabled, neither GCC 8.2, Clang 6 or MSVC 19 were able to make any optimizations at all over the naive dot-product with a matrix full of 0s. Even with fast-math (https://godbolt.org/z/GvPXFy) the optimizations are very poor in GCC and Clang (still involve multiplications and additions), and MSVC doesn't do any optimizations at all.
I don't have a background in compilers, but is there a reason for this? I'm fairly sure that in a large proportion of scientific computations being able to do better constant propagation/folding would make more optimizations apparent, even if the constant-fold itself didn't result in a speedup.
While I'm interested in explanations for why this isn't done on the compiler side, I'm also interested for what I can do on a practical side to make my own code faster when facing these kinds of patterns.

This is because Eigen explicitly vectorize your code as 3 vmulpd, 2 vaddpd and 1 horizontal reduction within the remaining 4 component registers (this assumes AVX, with SSE only you'll get 6 mulpd and 5 addpd). With -ffast-math GCC and clang are allowed to remove the last 2 vmulpd and vaddpd (and this is what they do) but they cannot really replace the remaining vmulpd and horizontal reduction that have been explicitly generated by Eigen.
So what if you disable Eigen's explicit vectorization by defining EIGEN_DONT_VECTORIZE? Then you get what you expected (https://godbolt.org/z/UQsoeH) but other pieces of code might become much slower.
If you want to locally disable explicit vectorization and are not afraid of messing with Eigen's internal, you can introduce a DontVectorize option to Matrix and disable vectorization by specializing traits<> for this Matrix type:
static const int DontVectorize = 0x80000000;
namespace Eigen {
namespace internal {
template<typename _Scalar, int _Rows, int _Cols, int _MaxRows, int _MaxCols>
struct traits<Matrix<_Scalar, _Rows, _Cols, DontVectorize, _MaxRows, _MaxCols> >
: traits<Matrix<_Scalar, _Rows, _Cols> >
{
typedef traits<Matrix<_Scalar, _Rows, _Cols> > Base;
enum {
EvaluatorFlags = Base::EvaluatorFlags & ~PacketAccessBit
};
};
}
}
using ArrayS12d = Eigen::Matrix<double,12,1,DontVectorize>;
Full example there: https://godbolt.org/z/bOEyzv

I was disappointed to find that, without fast-math enabled, neither GCC 8.2, Clang 6 or MSVC 19 were able to make any optimizations at all over the naive dot-product with a matrix full of 0s.
They have no other choice unfortunately. Since IEEE floats have signed zeros, adding 0.0 is not an identity operation:
-0.0 + 0.0 = 0.0 // Not -0.0!
Similarly, multiplying by zero does not always yield zero:
0.0 * Infinity = NaN // Not 0.0!
So the compilers simply cannot perform these constant folds in the dot product while retaining IEEE float compliance - for all they know, your input might contain signed zeros and/or infinities.
You will have to use -ffast-math to get these folds, but that may have undesired consequences. You can get more fine-grained control with specific flags (from http://gcc.gnu.org/wiki/FloatingPointMath). According to the above explanation, adding the following two flags should allow the constant folding:
-ffinite-math-only, -fno-signed-zeros
Indeed, you get the same assembly as with -ffast-math this way: https://godbolt.org/z/vGULLA. You only give up the signed zeros (probably irrelevant), NaNs and the infinities. Presumably, if you were to still produce them in your code, you would get undefined behavior, so weigh your options.
As for why your example is not optimized better even with -ffast-math: That is on Eigen. Presumably they have vectorization on their matrix operations, which are much harder for compilers to see through. A simple loop is properly optimized with these options: https://godbolt.org/z/OppEhY

One way to force a compiler to optimize multiplications by 0's and 1`s is to manually unroll the loop. For simplicity let's use
#include <array>
#include <cstddef>
constexpr std::size_t n = 12;
using Array = std::array<double, n>;
Then we can implement a simple dot function using fold expressions (or recursion if they are not available):
<utility>
template<std::size_t... is>
double dot(const Array& x, const Array& y, std::index_sequence<is...>)
{
return ((x[is] * y[is]) + ...);
}
double dot(const Array& x, const Array& y)
{
return dot(x, y, std::make_index_sequence<n>{});
}
Now let's take a look at your function
double test(const Array& b)
{
const Array a{1}; // = {1, 0, ...}
return dot(a, b);
}
With -ffast-math gcc 8.2 produces:
test(std::array<double, 12ul> const&):
movsd xmm0, QWORD PTR [rdi]
ret
clang 6.0.0 goes along the same lines:
test(std::array<double, 12ul> const&): # #test(std::array<double, 12ul> const&)
movsd xmm0, qword ptr [rdi] # xmm0 = mem[0],zero
ret
For example, for
double test(const Array& b)
{
const Array a{1, 1}; // = {1, 1, 0...}
return dot(a, b);
}
we get
test(std::array<double, 12ul> const&):
movsd xmm0, QWORD PTR [rdi]
addsd xmm0, QWORD PTR [rdi+8]
ret
Addition. Clang unrolls a for (std::size_t i = 0; i < n; ++i) ... loop without all these fold expressions tricks, gcc doesn't and needs some help.

Related

Clang: autovectorize conversion of bool[64] array to uint64_t bit mask

I want to convert a bool[64] into a uint64_t where each bit represents the value of an element in the input array.
On modern x86 processors, this can be done quite efficiently, e.g. using vptestmd with AVX512 or vpmovmskb with AVX256. When I use clang's bool vector extension in combination with __builtin_convertvector, I'm happy with the results:
uint64_t handvectorized(const bool* input_) noexcept {
const bool* __restrict input = std::assume_aligned<64>(input_);
using VecBool64 __attribute__((vector_size(64))) = char;
using VecBitmaskT __attribute__((ext_vector_type(64))) = bool;
auto input_vec = *reinterpret_cast<const VecBool64*>(input);
auto result_vec = __builtin_convertvector(input_vec, VecBitmaskT);
return reinterpret_cast<uint64_t&>(result_vec);
}
produces (godbolt):
vmovdqa64 zmm0, zmmword ptr [rdi]
vptestmb k0, zmm0, zmm0
kmovq rax, k0
vzeroupper
ret
However, I can not get clang (or GCC, or ICX) to produce anything that uses vector mask extraction like this with (portable) scalar code.
For this implementation:
uint64_t loop(const bool* input_) noexcept {
const bool* __restrict input = std::assume_aligned<64>(input_);
uint64_t result = 0;
for(size_t i = 0; i < 64; ++i) {
if(input[i]) {
result |= 1ull << i;
}
}
return result;
}
clang produces a 64*8B = 512B lookup table and 39 instructions.
This implementation, and some other scalar implementations (branchless, inverse bit order, using std::bitset) that I've tried, can all be found on godbolt. None of them results in code close to the handwritten vector instructions.
Is there anything I'm missing or any reason the optimization doesn't work well here? Is there a scalar version I can write that produces reasonably vectorized code?
I'm wondering especially since the "handvectorized" version doesn't use any platform-specific intrinsics, and doesn't really have much programming to it. All it does is "load as vector" and "convert to bitmask". Perhaps clang simply doesn't detect the loop pattern? It just feels strange to me, a simple bitwise OR reduction loop feels like a common pattern, and the documentation of the loop vectorizer explicitly lists reductions using OR as a supported feature.
Edit: Updated godbolt link with suggestions from the comments

Optimal implementation of iterative Kahan summation

Intro
Kahan summation / compensated summation is technique that addresses compilers´ inability to respect the associative property of numbers. Truncation errors results in (a+b)+c not being exactly equal to a+(b+c) and thus accumulate an undesired relative error on longer series of sums, which is a common obstacle in scientific computing.
Task
I desire the optimal implementation of Kahan summation. I suspect that the best performance may be achieved with handcrafted assembly code.
Attempts
The code below calculates the sum of 1000 random numbers in range [0,1] with three approaches.
Standard summation: Naive implementation which accumulates a root mean square relative error that grows as O(sqrt(N))
Kahan summation [g++]: Compensated summation using the c/c++ function "csum". Explanation in comments. Note that some compilers may have default flags that invalidate this implementation (see output below).
Kahan summation [asm]: Compensated summation implemented as "csumasm" using the same algorithm as "csum". Cryptic explanation in comments.
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
extern "C" void csumasm(double&, double, double&);
__asm__(
"csumasm:\n"
"movsd (%rcx), %xmm0\n" //xmm0 = a
"subsd (%r8), %xmm1\n" //xmm1 - r8 (c) | y = b-c
"movapd %xmm0, %xmm2\n"
"addsd %xmm1, %xmm2\n" //xmm2 + xmm1 (y) | b = a+y
"movapd %xmm2, %xmm3\n"
"subsd %xmm0, %xmm3\n" //xmm3 - xmm0 (a) | b - a
"movapd %xmm3, %xmm0\n"
"subsd %xmm1, %xmm0\n" //xmm0 - xmm1 (y) | - y
"movsd %xmm0, (%r8)\n" //xmm0 to c
"movsd %xmm2, (%rcx)\n" //b to a
"ret\n"
);
void csum(double &a,double b,double &c) { //this function adds a and b, and passes c as a compensation term
double y = b-c; //y is the correction of b argument
b = a+y; //add corrected b argument to a argument. The output of the current summation
c = (b-a)-y; //find new error to be passed as a compensation term
a = b;
}
double fun(double fMin, double fMax){
double f = (double)rand()/RAND_MAX;
return fMin + f*(fMax - fMin); //returns random value
}
int main(int argc, char** argv) {
int N = 1000;
srand(0); //use 0 seed for each method
double sum1 = 0;
for (int n = 0; n < N; ++n)
sum1 += fun(0,1);
srand(0);
double sum2 = 0;
double c = 0; //compensation term
for (int n = 0; n < N; ++n)
csum(sum2,fun(0,1),c);
srand(0);
double sum3 = 0;
c = 0;
for (int n = 0; n < N; ++n)
csumasm(sum3,fun(0,1),c);
printf("Standard summation:\n %.16e (error: %.16e)\n\n",sum1,sum1-sum3);
printf("Kahan compensated summation [g++]:\n %.16e (error: %.16e)\n\n",sum2,sum2-sum3);
printf("Kahan compensated summation [asm]:\n %.16e\n",sum3);
return 0;
}
The output with -O3 is:
Standard summation:
5.1991955320902093e+002 (error: -3.4106051316484809e-013)
Kahan compensated summation [g++]:
5.1991955320902127e+002 (error: 0.0000000000000000e+000)
Kahan compensated summation [asm]:
5.1991955320902127e+002
The output with -O3 -ffast-math
Standard summation:
5.1991955320902093e+002 (error: -3.4106051316484809e-013)
Kahan compensated summation [g++]:
5.1991955320902093e+002 (error: -3.4106051316484809e-013)
Kahan compensated summation [asm]:
5.1991955320902127e+002
It is clear that -ffast-math destroys the Kahan summation arithmetic, which is unfortunate because my program requires the use of -ffast-math.
Question
Is it possible to construct a better/faster asm x64 code for Kahan's compensated summation? Perhaps there is a clever way to skip some of the movapd instructions?
If no better asm codes are possible, is there a c++ way to implement Kahan summation that can be used with -ffast-math without devolving to the naive summation? Perhaps a c++ implementation is generally more flexible for the compiler to optimize.
Ideas or suggestions are appreciated.
Further information
The contents of "fun" cannot be inlined, but the "csum" function could be.
The sum must be calculated as an iterative process (the corrected term must be applied on every single addition). This is because the intended summation function takes an input that depends on the previous sum.
The intended summation function is called indefinitely and several hundred million times per second, which motives the pursuit of a high performance low-level implementation.
Higher precision arithmetic such as long double, float128 or arbitrary precision libraries are not to be considered as higher precision solutions due to performance reasons.
Edit: Inlined csum (doesn't make much sense without the full code, but just for reference)
subsd xmm0, QWORD PTR [rsp+32]
movapd xmm1, xmm3
addsd xmm3, xmm0
movsd QWORD PTR [rsp+16], xmm3
subsd xmm3, xmm1
movapd xmm1, xmm3
subsd xmm1, xmm0
movsd QWORD PTR [rsp+32], xmm1
You can put functions that need to not use -ffast-math (like a csum loop) in a separate file that gets compiled without -ffast-math.
Possibly you could also use __attribute__((optimize("no-fast-math"))), but https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html says that optimization-level pragmas and attributes aren't "suitable in production code", unfortunately.
update: apparently part of the question was based on a misunderstanding that -O3 wasn't safe, or something? It is; ISO C++ specifies FP math rules that are like GCC's -fno-fast-math. Compiling everything with just -O3 apparently makes the OP's code run quickly and safely. See the bottom of this answer for workarounds like OpenMP to get some of the benefit of fast-math for some parts of your code without actually enabling -ffast-math.
ICC defaults to fast-path so you have to specifically enable FP=strict for it to be safe with -O3, but gcc/clang default to fully strict FP regardless of other optimization settings. (except -Ofast = -O3 -ffast-math)
You should be able to vectorize Kahan summation by keeping a vector (or four) of totals and an equal number of vectors of compensations. You can do that with intrinsics (as long as you don't enable fast-math for that file).
e.g. use SSE2 __m128d for 2 packed additions per instruction. Or AVX __m256d. On modern x86, addpd / subpd have the same performance as addsd and subsd (1 uop, 3 to 5 cycle latency depending on microarchitecture: https://agner.org/optimize/).
So you're effectively doing 8 compensated summations in parallel, each sum getting every 8th input element.
Generating random numbers on the fly with your fun() is significantly slower than reading them from memory. If your normal use-case has data in memory, you should be benchmarking that. Otherwise I guess scalar is interesting.
If you're going to use inline asm, it would be much better to actually use it inline so you can get multiple inputs and multiple outputs in XMM registers with Extended asm, not stored/reloaded through memory.
Defining a stand-alone function that actually takes args by reference looks pretty performance-defeating. (Especially when it doesn't even return either of them as a return value to avoid one of the store/reload chains). Even just making a function call introduces a lot of overhead by clobbering many registers. (Not as bad in Windows x64 as in x86-64 System V where all the XMM regs are call-clobbered, and more of the integer regs.)
Also your stand-alone function is specific to the Windows x64 calling convention so it's less portable than inline asm inside a function would be.
And BTW, clang managed to implement csum(double&, double, double&): with only two movapd instructions, instead of the 3 in your asm (which I assume you copied from GCC's asm output). https://godbolt.org/z/lw6tug. If you can assume AVX is available, you can avoid any.
And BTW, movaps is 1 byte smaller and should be used instead. No CPUs have had separate data domains / forwarding networks for double vs. float, just vec-FP vs. vec-int (vs. GP integer)
But really by far your bet is to get GCC to compile a file or function without -ffast-math. https://gcc.gnu.org/wiki/DontUseInlineAsm. That lets the compiler avoid the movaps instructions when AVX is available, besides letting it optimize better when unrolling.
If you're willing to accept the overhead of a function-call for every element, you might as well let the compiler generate that asm by putting csum in a separate file. (Hopefully link-time optimization respects -fno-fast-math for one file, perhaps by not inlining that function.)
But it would be much better to disable fast-math for the whole function containing the summation loop by putting it in a separate file. You may be stuck choosing where non-inline function-call boundaries need to be, based on compiling some code with fast-math and others without.
Ideally compile all of your code with -O3 -march=native, and profile-guided optimization. Also -flto link-time optimization to enable cross-file inlining.
It's not surprising that -ffast-math breaks Kahan summation: treating FP math as associative is one of the main reasons to use fast-math. If you need other parts of -ffast-math like -fno-math-errno and -fno-trapping-math so math functions can inline better, then enable those manually. Those are basically always safe and a good idea; nobody checks errno after calling sqrt so that requirement to set errno for some inputs is just a terrible misdesign of C that burdens implementations unnecessarily. GCC's -ftrapping-math is on by default even though it's broken (it doesn't always exactly reproduce the number of FP exceptions you'd get if you unmasked any) so it should really be off by default. Turning it off doesn't enable any optimizations that would break NaN propagation, it only tells GCC that the number of exceptions isn't a visible side-effect.
Or maybe try -ffast-math -fno-associative-math for your Kahan summation file, but that's the main one that's needed to auto-vectorize FP loops that involve reductions, and helps in other cases. But still, there are several other valuable optimizations that you'd still get.
Another way to get optimizations that normally require fast-math is #pragma omp simd to enable auto-vectorization with OpenMP even in files compiled without auto-vectorization. You can declare an accumulator variable for a reduction to let gcc reorder operations on it as if they were associative.

Is there anything special about -1 (0xFFFFFFFF) regarding ADC?

In a research project of mine I'm writing C++ code. However, the generated assembly is one of the crucial points of the project. C++ doesn't provide direct access to flag manipulating instructions, in particular, to ADC but this shouldn't be a problem provided the compiler is smart enough to use it. Consider:
constexpr unsigned X = 0;
unsigned f1(unsigned a, unsigned b) {
b += a;
unsigned c = b < a;
return c + b + X;
}
Variable c is a workaround to get my hands on the carry flag and add it to b and X. It looks I got luck and the (g++ -O3, version 9.1) generated code is this:
f1(unsigned int, unsigned int):
add %edi,%esi
mov %esi,%eax
adc $0x0,%eax
retq
For all values of X that I've tested the code is as above (except, of course for the immediate value $0x0 that changes accordingly). I found one exception though: when X == -1 (or 0xFFFFFFFFu or ~0u, ... it really doesn't matter how you spell it) the generated code is:
f1(unsigned int, unsigned int):
xor %eax,%eax
add %edi,%esi
setb %al
lea -0x1(%rsi,%rax,1),%eax
retq
This seems less efficient than the initial code as suggested by indirect measurements (not very scientific though) Am I right? If so, is this a "missing optimization opportunity" kind of bug that is worth reporting?
For what is worth, clang -O3, version 8.8.0, always uses ADC (as I wanted) and icc -O3, version 19.0.1 never does.
I've tried using the intrinsic _addcarry_u32 but it didn't help.
unsigned f2(unsigned a, unsigned b) {
b += a;
unsigned char c = b < a;
_addcarry_u32(c, b, X, &b);
return b;
}
I reckon I might not be using _addcarry_u32 correctly (I couldn't find much info on it). What's the point of using it since it's up to me to provide the carry flag? (Again, introducing c and praying for the compiler to understand the situation.)
I might, actually, be using it correctly. For X == 0 I'm happy:
f2(unsigned int, unsigned int):
add %esi,%edi
mov %edi,%eax
adc $0x0,%eax
retq
For X == -1 I'm unhappy :-(
f2(unsigned int, unsigned int):
add %esi,%edi
mov $0xffffffff,%eax
setb %dl
add $0xff,%dl
adc %edi,%eax
retq
I do get the ADC but this is clearly not the most efficient code. (What's dl doing there? Two instructions to read the carry flag and restore it? Really? I hope I'm very wrong!)
mov + adc $-1, %eax is more efficient than xor-zero + setc + 3-component lea for both latency and uop count on most CPUs, and no worse on any still-relevant CPUs.1
This looks like a gcc missed optimization: it probably sees a special case and latches onto that, shooting itself in the foot and preventing the adc pattern recognition from happening.
I don't know what exactly it saw / was looking for, so yes you should report this as a missed-optimization bug. Or if you want to dig deeper yourself, you could look at the GIMPLE or RTL output after optimization passes and see what happens. If you know anything about GCC's internal representations. Godbolt has a GIMPLE tree-dump window you can add from the same dropdown as "clone compiler".
The fact that clang compiles it with adc proves that it's legal, i.e. that the asm you want does match the C++ source, and you didn't miss some special case that's stopping the compiler from doing that optimization. (Assuming clang is bug-free, which is the case here.)
That problem can certainly happen if you're not careful, e.g. trying to write a general-case adc function that takes carry in and provides carry-out from the 3-input addition is hard in C, because either of the two additions can carry so you can't just use the sum < a+b idiom after adding the carry to one of the inputs. I'm not sure it's possible to get gcc or clang to emit add/adc/adc where the middle adc has to take carry-in and produce carry-out.
e.g. 0xff...ff + 1 wraps around to 0, so sum = a+b+carry_in / carry_out = sum < a can't optimize to an adc because it needs to ignore carry in the special case where a = -1 and carry_in = 1.
So another guess is that maybe gcc considered doing the + X earlier, and shot itself in the foot because of that special case. That doesn't make a lot of sense, though.
What's the point of using it since it's up to me to provide the carry flag?
You're using _addcarry_u32 correctly.
The point of its existence is to let you express an add with carry in as well as carry out, which is hard in pure C. GCC and clang don't optimize it well, often not just keeping the carry result in CF.
If you only want carry-out, you can provide a 0 as the carry in and it will optimize to add instead of adc, but still give you the carry-out as a C variable.
e.g. to add two 128-bit integers in 32-bit chunks, you can do this
// bad on x86-64 because it doesn't optimize the same as 2x _addcary_u64
// even though __restrict guarantees non-overlap.
void adc_128bit(unsigned *__restrict dst, const unsigned *__restrict src)
{
unsigned char carry;
carry = _addcarry_u32(0, dst[0], src[0], &dst[0]);
carry = _addcarry_u32(carry, dst[1], src[1], &dst[1]);
carry = _addcarry_u32(carry, dst[2], src[2], &dst[2]);
carry = _addcarry_u32(carry, dst[3], src[3], &dst[3]);
}
(On Godbolt with GCC/clang/ICC)
That's very inefficient vs. unsigned __int128 where compilers would just use 64-bit add/adc, but does get clang and ICC to emit a chain of add/adc/adc/adc. GCC makes a mess, using setcc to store CF to an integer for some of the steps, then add dl, -1 to put it back into CF for an adc.
GCC unfortunately sucks at extended-precision / biginteger written in pure C. Clang sometimes does slightly better, but most compilers are bad at it. This is why the lowest-level gmplib functions are hand-written in asm for most architectures.
Footnote 1: or for uop count: equal on Intel Haswell and earlier where adc is 2 uops, except with a zero immediate where Sandybridge-family's decoders special case that as 1 uop.
But the 3-component LEA with a base + index + disp makes it a 3-cycle latency instruction on Intel CPUs, so it's definitely worse.
On Intel Broadwell and later, adc is a 1-uop instruction even with a non-zero immediate, taking advantage of support for 3-input uops introduced with Haswell for FMA.
So equal total uop count but worse latency means that adc would still be a better choice.
https://agner.org/optimize/

Run-time overhead with boost.units?

I'm seeing some 10% run-time overhead when using a clone of a constexpr enhanced boost.units with the float value type using clang and -O3 level optimization. This is showing up with some of the more elaborate applications of a library that I've been working on. Given this situation, I have two questions that I'd really like to solve and would love help with:
Boost units is supposed to be a zero-overhead library so why am I seeing the overhead?
More importantly, besides not using boost.units, how can I get the overhead to go away?
Details...
I've been working on an interactive physics engine written in C++14. With the many different physical quantities and units it uses, I love using the compile-time enforced units and quantities that boost.units provides. Unfortunately enabling boost units seems to be coming with this run-time cost. The engine comes with a benchmark application that uses google's benchmark library to provide this insight and it takes some of the more elaborate simulations to see the overhead.
At present, due to the overhead, the engine builds by default without using boost units. By defining the right preprocessor macro name, the engine can be built with boost units. I achieved this switching using code like the following:
// #define USE_BOOST_UNITS
#if defined(USE_BOOST_UNITS)
...
#include <boost/units/systems/si/time.hpp>
...
#endif // defined(USE_BOOST_UNITS)
#if defined(USE_BOOST_UNITS)
#define QUANTITY(BoostDimension) boost::units::quantity<BoostDimension, float>
#define UNIT(Quantity, BoostUnit) Quantity{BoostUnit * float{1}}
#define DERIVED_UNIT(Quantity, BoostUnit, Ratio) Quantity{BoostUnit * float{Ratio}}
#else // defined(USE_BOOST_UNITS)
#define QUANTITY(BoostDimension) float
#define UNIT(Quantity, BoostUnit) float{1}
#define DERIVED_UNIT(Quantity, BoostUnit, Ratio) float{Ratio}}
#endif // defined(USE_BOOST_UNITS)
using Time = QUANTITY(boost::units::si::time);
constexpr auto Second = UNIT(Time, boost::units::si::second);
What I did with the UNIT macro feels a bit suspect to me in that it's taking a boost unit type and turning it into a value. That makes switching between using or not using boost units easier however since either way expressions like 3.0f * Second compile without warning. Checking what clang and gcc do with expressions like these appeared to confirm that they were smart enough to avoid run-time multiplying 3.0f * 1.0f and just recognized the expression as 3.0f. I wonder anyway if that's the cause of the overhead or if it's something else that I've done.
I've also wondered if maybe the problem is rooted in the constexpr enhancement code I'm using or if the author(s) of that code had any idea about this overhead. On search the internet, I found a mention of overhead with the normal boost units library so seems safe to assume the enhanced units are not at fault. A suggestion that came out of my inquiring though (and my thanks go to GitHub user muggenhor for it) was the following:
I expect this is likely caused by the amount of inlining done by the compiler. Because of the wrapper functions for the operators this adds at least one function call that needs to be inlined per operation. For expressions depending on the result of sub-expressions this requires the sub-expressions to be inlined first. As a result I expect the minimum amount of inlining passes to be able to properly optimize your code to be equal to the depth of the produced expression tree...
This sounds like a pretty viable theory to me. Unfortunately, I don't know how to test it and admittedly I'm more fond of digging into my own code at the moment than into clang/LLVM code. I've tried using -inline-threshold=10000 but that doesn't seem to make the overhead go away. To my understanding of clang at least, I don't believe that specifically increases the number of inlining passes. Is there another command line argument that does? Or are there parameters within clang's sources that someone can point me to looking at as a starting point to maybe recompiling clang and trying the modified compiler?
Another theory I've had is whether using float is the problem. I can rebuild my physics engine to use double instead and compare benchmark results between building with and without the boost units support enabled. What I find when using double is that the overhead at least seems to decrease. I've wondered if maybe boost units is somewhere using double even when I use float in its quantity template and maybe that's causing the overhead.
Lastly, I built boost unit's performance example with the constexpr enhancements and ran it with both double and float. Got no reliable sign of any overhead which seems to eliminate my theory of float being the problem.
Update With Data & Code
Got some more isolated data and code on this where it seems I'm seeing significantly more than 10% overhead...
Some benchmark data where Length is basically boost::units::si::length:
LesserLength/1000 953 ns 953 ns 724870
LesserFloat/1000 590 ns 590 ns 1093647
LesserDouble/1000 619 ns 618 ns 1198938
What the related code looks like:
static void LesserLength(benchmark::State& state)
{
const auto vals = RandPairs(static_cast<unsigned>(state.range()),
-100.0f * playrho::Meter, 100.0f * playrho::Meter);
auto c = 0.0f * playrho::Meter;
for (auto _: state)
{
for (const auto& val: vals)
{
const auto a = std::get<0>(val);
const auto b = std::get<1>(val);
static_assert(std::is_same<decltype(b), const playrho::Length>::value, "not Length");
const auto v = (a < b)? a: b;
benchmark::DoNotOptimize(c = v);
}
}
}
static void LesserFloat(benchmark::State& state)
{
const auto vals = RandPairs(static_cast<unsigned>(state.range()),
-100.0f, 100.0f);
auto c = 0.0f;
for (auto _: state)
{
for (const auto& val: vals)
{
const auto a = std::get<0>(val);
const auto b = std::get<1>(val);
const auto v = (a < b)? a: b;
static_assert(std::is_same<decltype(v), const float>::value, "not float");
benchmark::DoNotOptimize(c = v);
}
}
}
static void LesserDouble(benchmark::State& state)
{
const auto vals = RandPairs(static_cast<unsigned>(state.range()),
-100.0, 100.0);
auto c = 0.0;
for (auto _: state)
{
for (const auto& val: vals)
{
const auto a = std::get<0>(val);
const auto b = std::get<1>(val);
const auto v = (a < b)? a: b;
static_assert(std::is_same<decltype(v), const double>::value, "not double");
benchmark::DoNotOptimize(c = v);
}
}
}
With this as a hint to me, I checked Godbolt with the following code to see what clang 5.0.0 and gcc 7.2 would generate:
#include <algorithm>
#include <boost/units/systems/si/length.hpp>
#include <boost/units/cmath.hpp>
using length = boost::units::quantity<boost::units::si::length, float>;
float f(float a, float b)
{
return a < b? a: b;
}
length f(length a, length b)
{
return a < b? a: b;
}
I see that the generated assembly looks quite different between the two functions and between clang and gcc. Here's a gist of the relevant assembly from clang (with the boost stuff here simply shown as length):
f(float, float): # #f(float, float)
minss xmm0, xmm1
ret
f(length, length)
movss xmm0, dword ptr [rdx] # xmm0 = mem[0],zero,zero,zero
ucomiss xmm0, dword ptr [rsi]
cmova rdx, rsi
mov eax, dword ptr [rdx]
mov dword ptr [rdi], eax
mov rax, rdi
ret
Shouldn't both of these compilers using -O3 optimization be returning the same assembly though for the length version as they do for the float version? Is the problem that they're not quite optimizing down all the way to the same code as for float? Seems like this is the problem and if so that's progress but I still want to figure out what can be done to really get zero overhead.

Get member of __m128 by index?

I've got some code, originally given to me by someone working with MSVC, and I'm trying to get it to work on Clang. Here's the function that I'm having trouble with:
float vectorGetByIndex( __m128 V, unsigned int i )
{
assert( i <= 3 );
return V.m128_f32[i];
}
The error I get is as follows:
Member reference has base type '__m128' is not a structure or union.
I've looked around and found that Clang (and maybe GCC) has a problem with treating __m128 as a struct or union. However I haven't managed to find a straight answer as to how I can get these values back. I've tried using the subscript operator and couldn't do that, and I've glanced around the huge list of SSE intrinsics functions and haven't yet found an appropriate one.
A union is probably the most portable way to do this:
union {
__m128 v; // SSE 4 x float vector
float a[4]; // scalar array of 4 floats
} U;
float vectorGetByIndex(__m128 V, unsigned int i)
{
U u;
assert(i <= 3);
u.v = V;
return u.a[i];
}
As a modification to hirschhornsalz's solution, if i is a compile-time constant, you could avoid the union path entirely by using a shuffle:
template<unsigned i>
float vectorGetByIndex( __m128 V)
{
// shuffle V so that the element that you want is moved to the least-
// significant element of the vector (V[0])
V = _mm_shuffle_ps(V, V, _MM_SHUFFLE(i, i, i, i));
// return the value in V[0]
return _mm_cvtss_f32(V);
}
A scalar float is just the bottom element of an XMM register, and the upper elements are allowed to be non-zero; _mm_cvtss_f32 is free and will compile to zero instructions. This will inline as just a shufps (or nothing for i==0).
Compilers are smart enough to optimize away the shuffle for i==0 (except for long-obsolete ICC13) so no need for an if (i). https://godbolt.org/z/K154Pe. clang's shuffle optimizer will compile vectorGetByIndex<2> into movhlps xmm0, xmm0 which is 1 byte shorter than shufps and produces the same low element. You could manually do this with switch/case for other compilers since i is a compile-time constant, but 1 byte of code size in the few places you use this while manually vectorizing is pretty trivial.
Note that SSE4.1 _mm_extract_epi32(V, i); is not a useful shuffle here: extractps r/m32, xmm, imm can only extract the FP bit-pattern to an integer register or memory (https://www.felixcloutier.com/x86/extractps). (And the intrinsic returns it as an int, so it would actually compile to extractps + cvtsi2ss to do int->float conversion on the FP bit-pattern, unless you type-pun it in your C++ code. But then you'd expect it to compile to extractps eax, xmm0, i / movd xmm0, eax which is terrible vs. shufps.)
The only case where extractps would be useful is if the compiler wanted to store this result straight to memory, and fold the store into the extract instruction. (For i!=0, otherwise it would use movss). To leave the result in an XMM register as a scalar float, shufps is good.
(SSE4.1 insertps would be usable but unnecessary: it makes it possible to zero other elements while taking an arbitrary source element.)
Use
template<unsigned i>
float vectorGetByIndex( __m128 V) {
union {
__m128 v;
float a[4];
} converter;
converter.v = V;
return converter.a[i];
}
which will work regardless of the available instruction set.
Note: Even if SSE4.1 is available and i is a compile time constant, you can't use pextract etc. this way, because these instructions extract a 32-bit integer, not a float:
// broken code starts here
template<unsigned i>
float vectorGetByIndex( __m128 V) {
return _mm_extract_epi32(V, i);
}
// broken code ends here
I don't delete it because it is a useful reminder how to not do things.
The way I use is
union vec { __m128 sse, float f[4] };
float accessmember(__m128 v, int index)
{
vec v.sse = v;
return v.f[index];
}
Seems to work out pretty well for me.
Late to this party but found that this works for me in MSVC where z is a variable of type __m128.
#define _mm_extract_f32(v, i) _mm_cvtss_f32(_mm_shuffle_ps(v, v, i))
__m128 z = _mm_setr_ps(1.0, 2.0, 3.0, 4.0);
float f = _mm_extract_f32(z, 2);
OR even simpler
__m128 z;
float f = z.m128_f32[2]; // to get the 3rd float value in the vector