Does casting and assignment really strip away any extra precision of floats? - c++

I am reading through the new C++ FAQ and I see that even if x == y for double x, y; then it is possible for:
std::cos(x) == std::cos(y)
to evaluate to false. This is because the machine can have a processor which supports extended precision, such that one part of the == is a 64 bit number while the other one is an 80 bit number.
However, the next example seems to be incorrect:
void foo(double x, double y)
{
double cos_x = cos(x);
double cos_y = cos(y);
// the behavior might depend on what's in here
if (cos_x != cos_y) {
std::cout << "Huh?!?\n"; // You might end up here when x == y!!
}
}
As far as I read on en.cppreference.com here:
Cast and assignment strip away any extraneous range and precision:
this models the action of storing a value from an extended-precision
FPU register into a standard-sized memory location.
Hence, assigning:
double cos_x = cos(x);
double cos_y = cos(y);
should trim away any extra precision and make the program perfectly predictable.
So, who is right? The C++ FAQ, or the en.cppreference.com?

Neither the ISOCPP FAQ or cppreference are authoritative sources. cppreference is essentially the equivalent of wikipedia: it contains good information, but anybody can add anything without sources you have to read it with a grain of salt. Are the following three statements true:
Did you catch that? Your particular installation might store the result of one of the cos() calls out into RAM, truncating it in the process, then later compare that truncated value with the untruncated result of the second cos() call. Depending on lots of details, those two values might not be equal.
This is because the machine can have a processor which supports extended precision, such that one part of the == is a 64 bit number while the other one is an 80 bit number.
Cast and assignment strip away any extraneous range and precision: this models the action of storing a value from an extended-precision FPU register into a standard-sized memory location.
Maybe. It depends on the platform, your compiler, the compiler options, or any number of things. Realistically, the above statements only hold true for GCC in 32-bit mode, which defaults to the legacy x87 FPU (where everything is 80-bits internally) and rounded down to 64-bit when storing in memory. 64-bit uses SSE so double temporaries are always 64-bit. You can force SSE with mfpmath=sse and -msse2. Either way, the assembly might look something like this:
movsd QWORD PTR [rsp+8], xmm1
call cos
movsd xmm1, QWORD PTR [rsp+8]
movsd QWORD PTR [rsp], xmm0
movapd xmm0, xmm1
call cos
movsd xmm2, QWORD PTR [rsp]
ucomisd xmm2, xmm0
As you can see there is no truncation of any sort nor 80-bit registers used here.

I wish that assignment or casting would toss away the extra precision. I have worked with "clever" compilers for which this does not happen. Many do, but if you want truly machine-independent code, you need do extra work.
You can force the compiler to always use the value in memory with the expected precision by declaring the variable with the "volatile" keyword.
Some compilers pass parameters in registers rather than on the stack. For these compilers, it could be that x or y could have unintended extra precision. To be totally safe, you could do
volatile double db_x = x, db_y = y;
volatile double cos_x = cos(db_x), cos_y = cos(db_y);
if (cos_x != cos_y)...

Related

Alignment and SSE strange behaviour

I try to work with SSE and i faced with some strange behaviour.
I write simple code for comparing two strings with SSE Intrinsics, run it and it work. But later i understand, that in my code one of pointer still not aligned, but i use _mm_load_si128 instruction, which requires pointer aligned on a 16-byte boundary.
//Compare two different, not overlapping piece of memory
__attribute((target("avx"))) int is_equal(const void* src_1, const void* src_2, size_t size)
{
//Skip tail for right alignment of pointer [head_1]
const char* head_1 = (const char*)src_1;
const char* head_2 = (const char*)src_2;
size_t tail_n = 0;
while (((uintptr_t)head_1 % 16) != 0 && tail_n < size)
{
if (*head_1 != *head_2)
return 0;
head_1++, head_2++, tail_n++;
}
//Vectorized part: check equality of memory with SSE4.1 instructions
//src1 - aligned, src2 - NOT aligned
const __m128i* src1 = (const __m128i*)head_1;
const __m128i* src2 = (const __m128i*)head_2;
const size_t n = (size - tail_n) / 32;
for (size_t i = 0; i < n; ++i, src1 += 2, src2 += 2)
{
printf("src1 align: %d, src2 align: %d\n", align(src1) % 16, align(src2) % 16);
__m128i mm11 = _mm_load_si128(src1);
__m128i mm12 = _mm_load_si128(src1 + 1);
__m128i mm21 = _mm_load_si128(src2);
__m128i mm22 = _mm_load_si128(src2 + 1);
__m128i mm1 = _mm_xor_si128(mm11, mm21);
__m128i mm2 = _mm_xor_si128(mm12, mm22);
__m128i mm = _mm_or_si128(mm1, mm2);
if (!_mm_testz_si128(mm, mm))
return 0;
}
//Check tail with scalar instructions
const size_t rem = (size - tail_n) % 32;
const char* tail_1 = (const char*)src1;
const char* tail_2 = (const char*)src2;
for (size_t i = 0; i < rem; i++, tail_1++, tail_2++)
{
if (*tail_1 != *tail_2)
return 0;
}
return 1;
}
I print alignment of two pointers and one of this wal aligned but second - wasn't. And program still running correctly and fast.
Then i create synthetic test like this:
//printChars128(...) function just print 16 byte values from __m128i
const __m128i* A = (const __m128i*)buf;
const __m128i* B = (const __m128i*)(buf + rand() % 15 + 1);
for (int i = 0; i < 5; i++, A++, B++)
{
__m128i A1 = _mm_load_si128(A);
__m128i B1 = _mm_load_si128(B);
printChars128(A1);
printChars128(B1);
}
And it crashes, as we expected, on first iteration, when try load pointer B.
Interesting fact that if i switch target to sse4.2 then my implementation of is_equal will crash.
Another interesting fact that if i try align second pointer instead of first (so first pointer will be not aligned, second - aligned), then is_equal will crash.
So, my question is: "Why is_equal function works fine with only first pointer aligned if i enable avx instruction generation?"
UPD: This is C++ code. I compile my code with MinGW64/g++, gcc version 4.9.2 under Windows, x86.
Compile string: g++.exe main.cpp -Wall -Wextra -std=c++11 -O2 -Wcast-align -Wcast-qual -o main.exe
TL:DR: Loads from _mm_load_* intrinsics can be folded (at compile time) into memory operands to other instructions. The AVX versions of vector instructions don't require alignment for memory operands, except for specifically-aligned load/store instructions like vmovdqa.
In the legacy SSE encoding of vector instructions (like pxor xmm0, [src1]) , unaligned 128 bit memory operands will fault except with the special unaligned load/store instructions (like movdqu / movups).
The VEX-encoding of vector instructions (like vpxor xmm1, xmm0, [src1]) doesn't fault with unaligned memory, except with the alignment-required load/store instructions (like vmovdqa, or vmovntdq).
The _mm_loadu_si128 vs. _mm_load_si128 (and store/storeu) intrinsics communicate alignment guarantees to the compiler, but doesn't force it to actually emit a stand-alone load instruction. (Or anything at all if it already has the data in a register, just like dereferencing a scalar pointer).
The as-if rule still applies when optimizing code that uses intrinsics. A load can be folded into a memory operand for the vector-ALU instruction that uses it, as long as that doesn't introduce the risk of a fault. This is advantageous for code-density reasons, and also fewer uops to track in parts of the CPU thanks to micro-fusion (see Agner Fog's microarch.pdf). The optimization pass that does this isn't enabled at -O0, so an unoptimized build of your code probably would have faulted with unaligned src1.
(Conversely, this means _mm_loadu_* can only fold into a memory operand with AVX, but not with SSE. So even on CPUs where movdqu is as fast as movqda when the pointer does happen to be aligned, _mm_loadu can hurt performance because movqdu xmm1, [rsi] / pxor xmm0, xmm1 is 2 fused-domain uops for the front-end to issue while pxor xmm0, [rsi] is only 1. And doesn't need a scratch register. See also Micro fusion and addressing modes).
The interpretation of the as-if rule in this case is that it's ok for the program to not fault in some cases where the naive translation into asm would have faulted. (Or for the same code to fault in an un-optimized build but not fault in an optimized build).
This is opposite from the rules for floating-point exceptions, where the compiler-generated code must still raise any and all exceptions that would have occurred on the C abstract machine. That's because there are well-defined mechanisms for handling FP exceptions, but not for handling segfaults.
Note that since stores can't fold into memory operands for ALU instructions, store (not storeu) intrinsics will compile into code that faults with unaligned pointers even when compiling for an AVX target.
To be specific: consider this code fragment:
// aligned version:
y = ...; // assume it's in xmm1
x = _mm_load_si128(Aptr); // Aligned pointer
res = _mm_or_si128(y, x);
// unaligned version: the same thing with _mm_loadu_si128(Uptr)
When targeting SSE (code that can run on CPUs without AVX support), the aligned version can fold the load into por xmm1, [Aptr], but the unaligned version has to use
movdqu xmm0, [Uptr] / por xmm0, xmm1. The aligned version might do that too, if the old value of y is still needed after the OR.
When targeting AVX (gcc -mavx, or gcc -march=sandybridge or later), all vector instructions emitted (including 128 bit) will use the VEX encoding. So you get different asm from the same _mm_... intrinsics. Both versions can compile into vpor xmm0, xmm1, [ptr]. (And the 3-operand non-destructive feature means that this actually happens except when the original value loaded is used multiple times).
Only one operand to ALU instructions can be a memory operand, so in your case one has to be loaded separately. Your code faults when the first pointer isn't aligned, but doesn't care about alignment for the second, so we can conclude that gcc chose to load the first operand with vmovdqa and fold the second, rather than vice-versa.
You can see this happen in practice in your code on the Godbolt compiler explorer. Unfortunately gcc 4.9 (and 5.3) compile it to somewhat sub-optimal code that generates the return value in al and then tests it, instead of just branching on the flags from vptest :( clang-3.8 does a significantly better job.
.L36:
add rdi, 32
add rsi, 32
cmp rdi, rcx
je .L9
.L10:
vmovdqa xmm0, XMMWORD PTR [rdi] # first arg: loads that will fault on unaligned
xor eax, eax
vpxor xmm1, xmm0, XMMWORD PTR [rsi] # second arg: loads that don't care about alignment
vmovdqa xmm0, XMMWORD PTR [rdi+16] # first arg
vpxor xmm0, xmm0, XMMWORD PTR [rsi+16] # second arg
vpor xmm0, xmm1, xmm0
vptest xmm0, xmm0
sete al # generate a boolean in a reg
test eax, eax
jne .L36 # then test&branch on it. /facepalm
Note that your is_equal is memcmp. I think glibc's memcmp will do better than your implementation in many cases, since it has hand-written asm versions for SSE4.1 and others which handle various cases of the buffers being misaligned relative to each other. (e.g. one aligned, one not.) Note that glibc code is LGPLed, so you might not be able to just copy it. If your use-case has smaller buffers that are typically aligned, your implementation is probably good. Not needing a VZEROUPPER before calling it from other AVX code is also nice.
The compiler-generated byte-loop to clean up at the end is definitely sub-optimal. If the size is bigger than 16 bytes, do an unaligned load that ends at the last byte of each src. It doesn't matter that you re-compared some bytes you've already checked.
Anyway, definitely benchmark your code with the system memcmp. Besides the library implementation, gcc knows what memcmp does and has its own builtin definition that it can inline code for.

How to optimize function return values in C and C++ on x86-64?

The x86-64 ABI specifies two return registers: rax and rdx, both 64-bits (8 bytes) in size.
Assuming that x86-64 is the only targeted platform, which of these two functions:
uint64_t f(uint64_t * const secondReturnValue) {
/* Calculate a and b. */
*secondReturnValue = b;
return a;
}
std::pair<uint64_t, uint64_t> g() {
/* Calculate a and b, same as in f() above. */
return { a, b };
}
would yield better performance, given the current state of C/C++ compilers targeting x86-64? Are there any pitfalls performance-wise using one or the other version? Are compilers (GCC, Clang) always able to optimize the std::pair to be returned in rax and rdx?
UPDATE: Generally, returning a pair is faster if the compiler optimizes out the std::pair methods (examples of binary output with GCC 5.3.0 and Clang 3.8.0). If f() is not inlined, the compiler must generate code to write a value to memory, e.g:
movq b, (%rdi)
movq a, %rax
retq
But in case of g() it suffices for the compiler to do:
movq a, %rax
movq b, %rdx
retq
Because instructions for writing values to memory are generally slower than instructions for writing values to registers, the second version should be faster.
Since the ABI specifies that in some particular cases two registers have to be used for the 2-word result any conforming compiler has to obey that rule.
However, for such tiny functions I guess that most of the performance will come from inlining.
You may want to compile and link with g++ -flto -O2 using link-time optimizations.
I guess that the second function (returning a pair thru 2 registers) might be slightly faster, and that perhaps in some situations the GCC compiler could inline and optimize the first into the second.
But you really should benchmark if you care that much.
Note that the ABI specifies packing any small struct into registers for passing/returning (if it contains only integer types). This means that returning a std::pair<uint32_t, uint32_t> means the values have to be shift+ORed into rax.
This is probably still better than a round trip through memory, because setting up space for a pointer, and passing that pointer as an extra arg, has some overhead. (Other than that, though, a round-trip through L1 cache is pretty cheap, like ~5c latency. The store/load are almost certainly going to hit in L1 cache, because stack memory is used all the time. Even if it misses, store-forwarding can still happen, so execution doesn't stall until the ROB fills because the store can't retire. See Agner Fog's microarch guide and other stuff at the x86 tag wiki.)
Anyway, here's the kind of code you get from gcc 5.3 -O2, using functions that take args instead of returning compile-time constant values (which would lead to movabs rax, 0x...):
#include <cstdint>
#include <utility>
#define type_t uint32_t
type_t f(type_t * const secondReturnValue, type_t x) {
*secondReturnValue = x+4;
return x+2;
}
lea eax, [rsi+4] # LEA is an add-and-shift instruction that uses memory-operand syntax and encoding
mov DWORD PTR [rdi], eax
lea eax, [rsi+2]
ret
std::pair<type_t, type_t> g(type_t x) { return {x+2, x+4}; }
lea eax, [rdi+4]
lea edx, [rdi+2]
sal rax, 32
or rax, rdx
ret
type_t use_pair(std::pair<type_t, type_t> pair) {
return pair.second + pair.first;
}
mov rax, rdi
shr rax, 32
add eax, edi
ret
So it's really not bad at all. Two or three insns in the caller and callee to pack and unpack a pair of uint32_t values. Nowhere near as good as returning a pair of uint64_t values, though.
If you're specifically optimizing for x86-64, and care what happens for non-inlined functions with multiple return values, then prefer returning std::pair<uint64_t, uint64_t> (or int64_t, obviously), even if you assign those pairs to narrower integers in the caller. Note that in the x32 ABI (-mx32), pointers are only 32bits. Don't assume pointers are 64bit when optimizing for x86-64, if you care about that ABI.
If either member of the pair is 64bit, they use separate registers. It doesn't do anything stupid like splitting one value between the high half of one reg and the low half of another.

What is faster than std::pow?

My program spends 90% of CPU time in the std::pow(double,int) function. Accuracy is not a primary concern here, so I was wondering if there were any faster alternatives. One thing I was thinking of trying is casting to float, performing the operation and then back to double (haven't tried this yet); I am concerned that this is not a portable way of improving performance (don't most CPUs operate on doubles intrinsically anyway?)
Cheers
It looks like Martin Ankerl has a few of articles on this, Optimized Approximative pow() in C / C++ is one and it has two fast versions, one is as follows:
inline double fastPow(double a, double b) {
union {
double d;
int x[2];
} u = { a };
u.x[1] = (int)(b * (u.x[1] - 1072632447) + 1072632447);
u.x[0] = 0;
return u.d;
}
which relies on type punning through a union which is undefined behavior in C++, from the draft standard section 9.5 [class.union]:
In a union, at most one of the non-static data members can be active at any time, that is, the value of at
most one of the non-static data members can be stored in a union at any time. [...]
but most compilers including gcc support this with well defined behavior:
The practice of reading from a different union member than the one most recently written to (called “type-punning”) is common. Even with -fstrict-aliasing, type-punning is allowed, provided the memory is accessed through the union type
but this is not universal as this article points out and as I point out in my answer here using memcpy should generate identical code and does not invoke undefined behavior.
He also links to a second one Optimized pow() approximation for Java, C / C++, and C#.
The first article also links to his microbenchmarks here
Depending on what you need to do, operating in the log domain might work — that is, you replace all of your values with their logarithms; multiplication becomes addition, division becomes subtraction, and exponentiation becomes multiplication. But now addition and subtraction become expensive and somewhat error-prone operations.
How big are your integers? Are they known at compile time? It's far better to compute x^2 as x*x as opposed to pow(x,2). Note: Almost all applications of pow() to an integer power involve raising some number to the second or third power (or the multiplicative inverse in the case of negative exponents). Using pow() is overkill in such cases. Use a template for these small integer powers, or just use x*x.
If the integers are small, but not known at compile time, say between -12 and +12, multiplication will still beat pow() and won't lose accuracy. You don't need eleven multiplications to compute x^12. Four will do. Use the fact that x^(2n) = (x^n)^2 and x^(2n+1) = x*((x^n)^2). For example, x^12 is ((x*x*x)^2)^2. Two multiplications to compute x^3 (x*x*x), one more to compute x^6, and one final one to compute x^12.
YES! Very fast if you only need 'y'/'n' as a long/int which allows you to avoid the slow FPU FSCALE function. This is Agner Fog's x86 hand-optimized version if you only need results with 'y'/'n' as an INT. I upgraded it to __fastcall/__declspec(naked) for speed/size, made use of ECX to pass 'n' (floats always are passed in stack for 32-bit MSVC++), so very minor tweaks on my part, it's mostly Agner's work. It was tested/debugged/compiled on MS Visual VC++ 2005 Express/Pro, so should be OK to slip in newer versions. Accuracy against the universal CRT pow() function is very good.
extern double __fastcall fs_power(double x, long n);
// Raise 'x' to the power 'n' (INT-only) in ASM by the great Agner Fog!
__declspec(naked) double __fastcall fs_power(double x, long n) { __asm {
MOV EAX, ECX ;// Move 'n' to eax
;// abs(n) is calculated by inverting all bits and adding 1 if n < 0:
CDQ ;// Get sign bit into all bits of edx
XOR EAX, EDX ;// Invert bits if negative
SUB EAX, EDX ;// Add 1 if negative. Now eax = abs(n)
JZ RETZERO ;// End if n = 0
FLD1 ;// ST(0) = 1.0 (FPU push1)
FLD QWORD PTR [ESP+4] ;// Load 'x' : ST(0) = 'x', ST(1) = 1.0 (FPU push2)
JMP L2 ;// Jump into loop
L1: ;// Top of loop
FMUL ST(0), ST(0) ;// Square x
L2: ;// Loop entered here
SHR EAX, 1 ;// Get each bit of n into carry flag
JNC L1 ;// No carry. Skip multiplication, goto next
FMUL ST(1), ST(0) ;// Multiply by x squared i times for bit # i
JNZ L1 ;// End of loop. Stop when nn = 0
FSTP ST(0) ;// Discard ST(0) (FPU Pop1)
TEST EDX, EDX ;// Test if 'n' was negative
JNS RETPOS ;// Finish if 'n' was positive
FLD1 ;// ST(0) = 1.0, ST(1) = x^abs(n)
FDIVR ;// Reciprocal
RETPOS: ;// Finish, success!
RET 4 ;//(FPU Pop2 occurs by compiler on assignment
RETZERO:
FLDZ ;// Ret 0.0, fail, if n was 0
RET 4
}}

Should I use SIMD or vector extensions or something else?

I'm currently develop an open source 3D application framework in c++ (with c++11). My own math library is designed like the XNA math library, also with SIMD in mind. But currently it is not really fast, and it has problems with memory alignes, but more about that in a different question.
Some days ago I asked myself why I should write my own SSE code. The compiler is also able to generate high optimized code when optimization is on. I can also use the "vector extension" of GCC. But this all is not really portable.
I know that I have more control when I use my own SSE code, but often this control is unnessary.
One big problem of SSE is the use of dynamic memory which is, with the help of memory pools and data oriented design, as much as possible limited.
Now to my question:
Should I use naked SSE? Perhaps encapsulated.
__m128 v1 = _mm_set_ps(0.5f, 2, 4, 0.25f);
__m128 v2 = _mm_set_ps(2, 0.5f, 0.25f, 4);
__m128 res = _mm_mul_ps(v1, v2);
Or should the compiler do the dirty work?
float v1 = {0.5f, 2, 4, 0.25f};
float v2 = {2, 0.5f, 0.25f, 4};
float res[4];
res[0] = v1[0]*v2[0];
res[1] = v1[1]*v2[1];
res[2] = v1[2]*v2[2];
res[3] = v1[3]*v2[3];
Or should I use SIMD with additional code? Like a dynamic container class with SIMD operations, which needs additional load and store instructions.
Pear3D::Vector4f* v1 = new Pear3D::Vector4f(0.5f, 2, 4, 0.25f);
Pear3D::Vector4f* v2 = new Pear3D::Vector4f(2, 0.5f, 0.25f, 4);
Pear3D::Vector4f res = Pear3D::Vector::multiplyElements(*v1, *v2);
The above example use a imaginary class with uses float[4] internal and uses store and load in each methods like multiplyElements(...). The methods uses SSE internal.
I don't want to use another library, because I want to learn more about SIMD and large scale software design. But library examples are welcome.
PS: This is not a real problem more a design question.
Well, if you want to use SIMD extensions, a good approach is to use SSE intrinsics (of course stay by all means away from inline assembly, but fortunately you didn't list it as alternative, anyway). But for cleanliness you should encapsulate them in a nice vector class with overloaded operators:
struct aligned_storage
{
//overload new and delete for 16-byte alignment
};
class vec4 : public aligned_storage
{
public:
vec4(float x, float y, float z, float w)
{
data_[0] = x; ... data_[3] = w; //don't use _mm_set_ps, it will do the same, followed by a _mm_load_ps, which is unneccessary
}
vec4(float *data)
{
data_[0] = data[0]; ... data_[3] = data[3]; //don't use _mm_loadu_ps, unaligned just doesn't pay
}
vec4(const vec4 &rhs)
: xmm_(rhs.xmm_)
{
}
...
vec4& operator*=(const vec4 v)
{
xmm_ = _mm_mul_ps(xmm_, v.xmm_);
return *this;
}
...
private:
union
{
__m128 xmm_;
float data_[4];
};
};
Now the nice thing is, due to the anonymous union (UB, I know, but show me a platform with SSE where this doesn't work) you can use the standard float array whenever neccessary (like operator[] or initialization (don't use _mm_set_ps)) and only use SSE when appropriate. With a modern inlining compiler the encapsulation comes at probably no cost (I was rather surprised how well VC10 optimized the SSE instructions for a bunch of computations with this vector class, no fear of unneccessary moves into temporary memory variables, as VC8 seemed to like even without encapsulation).
The only disadvantage is, that you need to take care of proper alignment, as unaligned vectors don't buy you anything and may even be slower than non-SSE. But fortunately the alignment requirement of the __m128 will propagate into the vec4 (and any surrounding class) and you just need to take care of dynamic allocation, which C++ has good means for. You just need to make a base class whose operator new and operator delete functions (in all flavours of course) are overloaded properly and from which your vector class will derive. To use your type with standard containers you of course also need to specialize std::allocator (and maybe std::get_temporary_buffer and std::return_temporary_buffer for the sake of completeness), as it will use the global operator new otherwise.
But the real disadvantage is, that you need to also care for the dynamic allocation of any class that has your SSE vector as member, which may be tedious, but can again be automated a bit by also deriving those classes from aligned_storage and putting the whole std::allocator specialization mess into a handy macro.
JamesWynn has a point that those operations often come together in some special heavy computation blocks (like texture filtering or vertex transformation), but on the other hand using those SSE vector encapsulations doesn't introduce any overhead over a standard float[4]-implementation of a vector class. You need to get those values from memory into registers anyway (be it the x87 stack or a scalar SSE register) in order to do any computations, so why not take em all at once (which should IMHO not be any slower than moving a single value if properly aligned) and compute in parallel. Thus you can freely switch out an SSE-inplementation for a non-SSE one without inducing any overhead (correct me if my reasoning is wrong).
But if the ensuring of alignment for all classes having vec4 as member is too tedious for you (which is IMHO the only disadvantage of this approach), you can also define a specialized SSE-vector type which you use for computations and use a standard non-SSE vector for storage.
EDIT: Ok, to look at the overhead argument, that goes around here (and looks quite reasonable at first), let's take a bunch of computations, which look very clean, due to overloaded operators:
#include "vec.h"
#include <iostream>
int main(int argc, char *argv[])
{
math::vec<float,4> u, v, w = u + v;
u = v + dot(v, w) * w;
v = abs(u-w);
u = 3.0f * w + v;
w = -w * (u+v);
v = min(u, w) + length(u) * w;
std::cout << v << std::endl;
return 0;
}
and see what VC10 thinks about it:
...
; 6 : math::vec<float,4> u, v, w = u + v;
movaps xmm4, XMMWORD PTR _v$[esp+32]
; 7 : u = v + dot(v, w) * w;
; 8 : v = abs(u-w);
movaps xmm3, XMMWORD PTR __xmm#0
movaps xmm1, xmm4
addps xmm1, XMMWORD PTR _u$[esp+32]
movaps xmm0, xmm4
mulps xmm0, xmm1
haddps xmm0, xmm0
haddps xmm0, xmm0
shufps xmm0, xmm0, 0
mulps xmm0, xmm1
addps xmm0, xmm4
subps xmm0, xmm1
movaps xmm2, xmm3
; 9 : u = 3.0f * w + v;
; 10 : w = -w * (u+v);
xorps xmm3, xmm1
andnps xmm2, xmm0
movaps xmm0, XMMWORD PTR __xmm#1
mulps xmm0, xmm1
addps xmm0, xmm2
; 11 : v = min(u, w) + length(u) * w;
movaps xmm1, xmm0
mulps xmm1, xmm0
haddps xmm1, xmm1
haddps xmm1, xmm1
sqrtss xmm1, xmm1
addps xmm2, xmm0
mulps xmm3, xmm2
shufps xmm1, xmm1, 0
; 12 : std::cout << v << std::endl;
mov edi, DWORD PTR __imp_?cout#std##3V?$basic_ostream#DU?$char_traits#D#std###1#A
mulps xmm1, xmm3
minps xmm0, xmm3
addps xmm1, xmm0
movaps XMMWORD PTR _v$[esp+32], xmm1
...
Even without thoroughly analyzing every single instruction and its use, I'm pretty confident to say that there aren't any unneccessary loads or stores, except the ones at the beginning (Ok, I left them uninitialized), which are neccessary anyway to get them from memory into computing registers, and at the end, which is neccessary as in the following expression v is gonna be put out. It didn't even store anything back into u and w, since they are only temporary variables which I don't use any further. Everything is perfectly inlined and optimized out. It even managed to seamlessly shuffle the result of the dot product for the following multiplication, without it leaving the XMM register, although the dot function returns a float using an actual _mm_store_ss after the haddpss.
So even I, being usually a bit oversuspicios of the compiler's abilities, have to say, that handcrafting your own intrinsics into special functions doesn't really pay compared to the clean and expressive code you gain by encapsulation. Though you may be able to create killer examples where handrafting the intrinics may indeed spare you some few instructions, but then again you first have to outsmart the optimizer.
EDIT: Ok, Ben Voigt pointed out another problem of the union besides the (most probably not problematic) memory layout incompatibility, which is that it is violating strict aliasing rules and the compiler may optimize instructions accessing different union members in a way that makes the code invalid. I haven't thought about that yet. I don't know if it makes any problems in practice, it certainly needs investigation.
If it really is a problem, we unfortunately need to drop the data_[4] member and use the __m128 alone. For initialization we now have to resort to _mm_set_ps and _mm_loadu_ps again. The operator[] gets a bit more complicated and might need some combination of _mm_shuffle_ps and _mm_store_ss. But for the non-const version you have to use some kind of proxy object delegating an assignment to the corresponding SSE instructions. It has to be investigated in which way the compiler can optimize this additional overhead in the specific situations then.
Or you only use the SSE-vector for computations and just make an interface for conversion to and from non-SSE vectors at a whole, which is then used at the peripherals of computations (as you often don't need to access individual components inside lengthy computations). This seems to be the way glm handles this issue. But I'm not sure how Eigen handles it.
But however you tackle it, there is still no need to handcraft SSE instrisics without using the benefits of operator overloading.
I suggest that you learn about expression templates (custom operator implementations that use proxy objects). In this way, you can avoid doing performance-killing load/store around each individual operation, and do them only once for the entire computation.
I would suggest using the naked simd code in a tightly controlled function. Since you won't be using it for your primary vector multiplication because of the overhead, this function should probably take the list of Vector3 objects that need to be manipulated, as per DOD. Where there's one, there is many.

C++: Structs slower to access than basic variables?

I found some code that had "optimization" like this:
void somefunc(SomeStruct param){
float x = param.x; // param.x and x are both floats. supposedly this makes it faster access
float y = param.y;
float z = param.z;
}
And the comments said that it will make the variable access faster, but i've always thought structs element access is as fast as if it wasnt struct after all.
Could someone clear my head off this?
The usual rules for optimization (Michael A. Jackson) apply:
1. Don't do it.
2. (For experts only:) Don't do it yet.
That being said, let's assume it's the innermost loop that takes 80% of the time of a performance-critical application. Even then, I doubt you will ever see any difference. Let's use this piece of code for instance:
struct Xyz {
float x, y, z;
};
float f(Xyz param){
return param.x + param.y + param.z;
}
float g(Xyz param){
float x = param.x;
float y = param.y;
float z = param.z;
return x + y + z;
}
Running it through LLVM shows: Only with no optimizations, the two act as expected (g copies the struct members into locals, then proceeds sums those; f sums the values fetched from param directly). With standard optimization levels, both result in identical code (extracting the values once, then summing them).
For short code, this "optimization" is actually harmful, as it copies the floats needlessly. For longer code using the members in several places, it might help a teensy bit if you actively tell your compiler to be stupid. A quick test with 65 (instead of 2) additions of the members/locals confirms this: With no optimizations, f repeatedly loads the struct members while g reuses the already extracted locals. The optimized versions are again identical and both extract the members only once. (Surprisingly, there's no strength reduction turning the additions into multiplications even with LTO enabled, but that just indicates the LLVM version used isn't optimizing too agressively anyway - so it should work just as well in other compilers.)
So, the bottom line is: Unless you know your code will have to be compiled by a compiler that's so outragously stupid and/or ancient that it won't optimize anything, you now have proof that the compiler will make both ways equivalent and can thus do away with this crime against readability and brewity commited in the name of performance. (Repeat the experiment for your particular compiler if necessary.)
Rule of thumb: it's not slow, unless profiler says it is. Let the compiler worry about micro-optimisations (they're pretty smart about them; after all, they've been doing it for years) and focus on the bigger picture.
I'm no compiler guru, so take this with a grain of salt. I'm guessing that the original author of the code is assuming that by copying the values from the struct into local variables, the compiler has "placed" those variables into floating point registers which are available on some platforms (e.g., x86). If there aren't enough registers to go around, they'd be put in the stack.
That being said, unless this code was in the middle of an intensive computation/loop, I'd strive for clarity rather than speed. It's pretty rare that anyone is going to notice a few instructions difference in timing.
You'd have to look at the compiled code on a particular implementation to be sure, but there's no reason in principle why your preferred code (using the struct members) should necessarily be any slower than the code you've shown (copying into variables and then using the variables).
someFunc takes a struct by value, so it has its own local copy of that struct. The compiler is perfectly at liberty to apply exactly the same optimizations to the struct members, as it would apply to the float variables. They're both automatic variables, and in both cases the "as-if" rule allows them to be stored in register(s) rather than in memory provided that the function produces the correct observable behavior.
This is unless of course you take a pointer to the struct and use it, in which case the values need to be written in memory somewhere, in the correct order, pointed to by the pointer. This starts to limit optimization, and other limits are introduced by the fact that if you pass around a pointer to an automatic variable, the compiler can no longer assume that the variable name is the only reference to that memory and hence the only way its contents can be modified. Having multiple references to the same object is called "aliasing", and does sometimes block optimizations that could be made if the object was somehow known not to be aliased.
Then again, if this is an issue, and the rest of the code in the function somehow does use a pointer to the struct, then of course you could be on dodgy ground copying the values into variables from the POV of correctness. So the claimed optimization is not quite so straightforward as it looks in that case.
Now, there may be particular compilers (or particular optimization levels) which fail to apply to structs all the optimizations that they're permitted to apply, but do apply equivalent optimizations to float variables. If so then the comment would be right, and that's why you have to check to be sure. For example, maybe compare the emitted code for this:
float somefunc(SomeStruct param){
float x = param.x; // param.x and x are both floats. supposedly this makes it faster access
float y = param.y;
float z = param.z;
for (int i = 0; i < 10; ++i) {
x += (y +i) * z;
}
return x;
}
with this:
float somefunc(SomeStruct param){
for (int i = 0; i < 10; ++i) {
param.x += (param.y +i) * param.z;
}
return param.x;
}
There may also be optimization levels where the extra variables make the code worse. I'm not sure I put much trust in code comments that say "supposedly this makes it faster access", sounds like the author doesn't really have a clear idea why it matters. "Apparently it makes it faster access - I don't know why but the tests to confirm this and to demonstrate that it makes a noticeable difference in the context of our program, are in source control in the following location" is a lot more like it ;-)
In an unoptimised code:
function parameters (which are not passed by reference) are on the stack
local variables are also on the stack
Unoptimised access to local variables and function parameters in an assembly language look more-or-less like this:
mov %eax, %ebp+ compile-time-constant
where %ebp is a frame pointer (sort of 'this' pointer for a function).
It makes no difference if you access a parameter or a local variable.
The fact that you are accessing an element from a struct makes absolutely no difference from the assembly/machine point of view. Structs are constructs made in C to make programmer's life easier.
So, ulitmately, my answer is: No, there is absolutely no benefit in doing that.
There are good and valid reasons to do that kind of optimization when pointers are used, because consuming all inputs first frees the compiler from possible aliasing issues which prevent it from producing optimal code (there's restrict nowadays too, though).
For non-pointer types, there is in theory an overhead because every member is accessed via the struct's this pointer. This may in theory be noticeable within an inner loop and will in theory be a diminuitive overhead otherwise.In practice, however, a modern compiler will almost always (unless there is a complex inheritance hierarchy) produce the exact same binary code.
I had asked myself the exact same question as you did about two years ago and did a very extensive test case using gcc 4.4. My findings were that unless you really try to throw sticks between the compiler's legs on purpose, there is absolutely no difference in the generated code.
The real answer is given by Piotr. This one is just for fun.
I have tested it. This code:
float somefunc(SomeStruct param, float &sum){
float x = param.x;
float y = param.y;
float z = param.z;
float xyz = x * y * z;
sum = x + y + z;
return xyz;
}
And this code:
float somefunc(SomeStruct param, float &sum){
float xyz = param.x * param.y * param.z;
sum = param.x + param.y + param.z;
return xyz;
}
Generate identical assembly code when compiled with g++ -O2. They do generate different code with optimization turned off, though. Here is the difference:
< movl -32(%rbp), %eax
< movl %eax, -4(%rbp)
< movl -28(%rbp), %eax
< movl %eax, -8(%rbp)
< movl -24(%rbp), %eax
< movl %eax, -12(%rbp)
< movss -4(%rbp), %xmm0
< mulss -8(%rbp), %xmm0
< mulss -12(%rbp), %xmm0
< movss %xmm0, -16(%rbp)
< movss -4(%rbp), %xmm0
< addss -8(%rbp), %xmm0
< addss -12(%rbp), %xmm0
---
> movss -32(%rbp), %xmm1
> movss -28(%rbp), %xmm0
> mulss %xmm1, %xmm0
> movss -24(%rbp), %xmm1
> mulss %xmm1, %xmm0
> movss %xmm0, -4(%rbp)
> movss -32(%rbp), %xmm1
> movss -28(%rbp), %xmm0
> addss %xmm1, %xmm0
> movss -24(%rbp), %xmm1
> addss %xmm1, %xmm0
The lines marked < correspond to the version with "optimization" variables. It seems to me that the "optimized" version is even slower than the one with no extra variables. This is to be expected, though, as x, y and z are allocated on the stack, exactly like the param. What's the point of allocating more stack variables to duplicate existing ones?
If the one who did that "optimization" knew the language better, he would probably have declared those variables as register, but even that leaves the "optimized" version slightly slower and longer, at least on G++/x86-64.
Compiler may make faster code to copy float-to-float.
But when x will used it will be converted to internal FPU representation.
When you specify a "simple" variable (not a struct/class) to be operated upon, the system only has to go to that place and fetch the data it wants.
But when you refer to a variable inside a struct or class, like A.B, the system needs to calculate where B is inside that area called A (because there may be other variables declared before it), and that calculation takes a bit more than the the more plain access described above.