Most efficient standard-compliant way of reinterpreting int as float - c++

Assume I have guarantees that float is IEEE 754 binary32. Given a bit pattern that corresponds to a valid float, stored in std::uint32_t, how does one reinterpret it as a float in a most efficient standard compliant way?
float reinterpret_as_float(std::uint32_t ui) {
return /* apply sorcery to ui */;
}
I've got a few ways that I know/suspect/assume have some issues:
Via reinterpret_cast,
float reinterpret_as_float(std::uint32_t ui) {
return reinterpret_cast<float&>(ui);
}
or equivalently
float reinterpret_as_float(std::uint32_t ui) {
return *reinterpret_cast<float*>(&ui);
}
which suffers from aliasing issues.
Via union,
float reinterpret_as_float(std::uint32_t ui) {
union {
std::uint32_t ui;
float f;
} u = {ui};
return u.f;
}
which is not actually legal, as it is only allowed to read from most recently written to member. Yet, it seems some compilers (gcc) allow this.
Via std::memcpy,
float reinterpret_as_float(std::uint32_t ui) {
float f;
std::memcpy(&f, &ui, 4);
return f;
}
which AFAIK is legal, but a function call to copy single word seems wasteful, though it might get optimized away.
Via reinterpret_casting to char* and copying,
float reinterpret_as_float(std::uint32_t ui) {
char* uip = reinterpret_cast<char*>(&ui);
float f;
char* fp = reinterpret_cast<char*>(&f);
for (int i = 0; i < 4; ++i) {
fp[i] = uip[i];
}
return f;
}
which AFAIK is also legal, as char pointers are exempt from aliasing issues and manual byte copying loop saves a possible function call. The loop will most definitely be unrolled, yet 4 possibly separate one-byte loads/stores are worrisome, I have no idea whether this is optimizable to single four byte load/store.
The 4 is the best I've been able to come up with.
Am I correct so far? Is there a better way to do this, particulary one that will guarantee single load/store?

Afaik, there are only two approaches that are compliant with strict aliasing rules: memcpy() and cast to char* with copying. All others read a float from memory that belongs to an uint32_t, and the compiler is allowed to perform the read before the write to that memory location. It might even optimize away the write altogether as it can prove that the stored value will never be used according to strict aliasing rules, resulting in a garbage return value.
It really depends on the compiler/optimizes whether memcpy() or char* copy is faster. In both cases, an intelligent compiler might be able to figure out that it can just load and copy an uint32_t, but I would not trust any compiler to do so before I have seen it in the resulting assembler code.
Edit:
After some testing with gcc 4.8.1, I can say that the memcpy() approach is the best for this particulare compiler, see below for details.
Compiling
#include <stdint.h>
float foo(uint32_t a) {
float b;
char* aPointer = (char*)&a, *bPointer = (char*)&b;
for( int i = sizeof(a); i--; ) bPointer[i] = aPointer[i];
return b;
}
with gcc -S -std=gnu11 -O3 foo.c yields this assemble code:
movl %edi, %ecx
movl %edi, %edx
movl %edi, %eax
shrl $24, %ecx
shrl $16, %edx
shrw $8, %ax
movb %cl, -1(%rsp)
movb %dl, -2(%rsp)
movb %al, -3(%rsp)
movb %dil, -4(%rsp)
movss -4(%rsp), %xmm0
ret
This is not optimal.
Doing the same with
#include <stdint.h>
#include <string.h>
float foo(uint32_t a) {
float b;
char* aPointer = (char*)&a, *bPointer = (char*)&b;
memcpy(bPointer, aPointer, sizeof(a));
return b;
}
yields (with all optimization levels except -O0):
movl %edi, -4(%rsp)
movss -4(%rsp), %xmm0
ret
This is optimal.

If the bitpattern in the integer variable is the same as a valid float value, then union is probably the best and most compliant way to go. And it's actually legal if you read the specification (don't remember the section at the moment).

memcpy is always safe but does involve a copy
casting may lead to problems
union - seems to be allowed in C99 and C11, not sure about C++
Take a look at:
What is the strict aliasing rule?
and
Is type-punning through a union unspecified in C99, and has it become specified in C11?

float reinterpret_as_float(std::uint32_t ui) {
return *((float *)&ui);
}
As plain function, its code is translated into assembly as this (Pelles C for Windows):
fld [esp+4]
ret
If defined as inline function, then a code like this (n being unsigned, x being float):
x = reinterpret_as_float (n);
Is translated to assembler as this:
fld [ebp-4] ;RHS of asignment. Read n as float
fstp dword ptr [ebp-8] ;LHS of asignment

Related

Why is a volatile local variable optimised differently from a volatile argument, and why does the optimiser generate a no-op loop from the latter?

Background
This was inspired by this question/answer and ensuing discussion in the comments: Is the definition of “volatile” this volatile, or is GCC having some standard compliancy problems?. Based on others' and my interpretation of what should happening, as discussed in comments, I've submitted it to GCC Bugzilla: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71793 Other relevant responses are still welcome.
Also, that thread has since given rise to this question: Does accessing a declared non-volatile object through a volatile reference/pointer confer volatile rules upon said accesses?
Intro
I know volatile isn't what most people think it is and is an implementation-defined nest of vipers. And I certainly don't want to use the below constructs in any real code. That said, I'm totally baffled by what's going on in these examples, so I'd really appreciate any elucidation.
My guess is this is due to either highly nuanced interpretation of the Standard or (more likely?) just corner-cases for the optimiser used. Either way, while more academic than practical, I hope this is deemed valuable to analyse, especially given how typically misunderstood volatile is. Some more data points - or perhaps more likely, points against it - must be good.
Input
Given this code:
#include <cstddef>
void f(void *const p, std::size_t n)
{
unsigned char *y = static_cast<unsigned char *>(p);
volatile unsigned char const x = 42;
// N.B. Yeah, const is weird, but it doesn't change anything
while (n--) {
*y++ = x;
}
}
void g(void *const p, std::size_t n, volatile unsigned char const x)
{
unsigned char *y = static_cast<unsigned char *>(p);
while (n--) {
*y++ = x;
}
}
void h(void *const p, std::size_t n, volatile unsigned char const &x)
{
unsigned char *y = static_cast<unsigned char *>(p);
while (n--) {
*y++ = x;
}
}
int main(int, char **)
{
int y[1000];
f(&y, sizeof y);
volatile unsigned char const x{99};
g(&y, sizeof y, x);
h(&y, sizeof y, x);
}
Output
g++ from gcc (Debian 4.9.2-10) 4.9.2 (Debian stable a.k.a. Jessie) with the command line g++ -std=c++14 -O3 -S test.cpp produces the below ASM for main(). Version Debian 5.4.0-6 (current unstable) produces equivalent code, but I just happened to run the older one first, so here it is:
main:
.LFB3:
.cfi_startproc
# f()
movb $42, -1(%rsp)
movl $4000, %eax
.p2align 4,,10
.p2align 3
.L21:
subq $1, %rax
movzbl -1(%rsp), %edx
jne .L21
# x = 99
movb $99, -2(%rsp)
movzbl -2(%rsp), %eax
# g()
movl $4000, %eax
.p2align 4,,10
.p2align 3
.L22:
subq $1, %rax
jne .L22
# h()
movl $4000, %eax
.p2align 4,,10
.p2align 3
.L23:
subq $1, %rax
movzbl -2(%rsp), %edx
jne .L23
# return 0;
xorl %eax, %eax
ret
.cfi_endproc
Analysis
All 3 functions are inlined, and both that allocate volatile local variables do so on the stack for fairly obvious reasons. But that's about the only thing they share...
f() ensures to read from x on each iteration, presumably due to its volatile - but just dumps the result to edx, presumably because the destination y isn't declared volatile and is never read, meaning changes to it can be nixed under the as-if rule. OK, makes sense.
Well, I mean... kinda. Like, not really, because volatile is really for hardware registers, and clearly a local value can't be one of those - and can't otherwise be modified in a volatile way unless its address is passed out... which it's not. Look, there's just not a lot of sense to be had out of volatile local values. But C++ lets us declare them and tries to do something with them. And so, confused as always, we stumble onwards.
g(): What. By moving the volatile source into a pass-by-value parameter, which is still just another local variable, GCC somehow decides it's not or less volatile, and so it doesn't need to read it every iteration... but it still carries out the loop, despite its body now doing nothing.
h(): By taking the passed volatile as pass-by-reference, the same effective behaviour as f() is restored, so the loop does volatile reads.
This case alone actually makes practical sense to me, for reasons outlined above against f(). To elaborate: Imagine x refers to a hardware register, of which every read has side-effects. You wouldn't want to skip any of those.
Adding #define volatile /**/ leads to main() being a no-op, as you'd expect. So, when present, even on a local variable volatile does do something... I just have no idea what in the case of g(). What on Earth is going on there?
Questions
Why does a local value declared in-body produce different results from a by-value parameter, with the former letting reads be optimised away? Both are declared volatile. Neither have an address passed out - and don't have a static address, ruling out any inline-ASM POKEry - so they can never be modified outwith the function. The compiler can see that each is constant, need never be re-read, and volatile just ain't true -
so (A) is either allowed to be elided under such constraints? (acting as-if they weren't declared volatile) -
and (B) why does only one get elided? Are some volatile local variables more volatile than others?
Setting aside that inconsistency for just a moment: After the read was optimised away, why does the compiler still generate the loop? It does nothing! Why doesn't the optimiser elide it as-if no loop was coded?
Is this a weird corner case due to order of optimising analyses or such? As the code is a daft thought-experiment, I wouldn't chastise GCC for this, but it'd be good to know for sure. (Or is g() the manual timing loop people have dreamt of all these years?) If we conclude there's no Standard bearing on any of this, I'll move it to their Bugzilla just for their information.
And of course, the more important question from a practical perspective, though I don't want that to overshadow the potential for compiler geekery... Which, if any of these, are well-defined/correct according to the Standard?
For f: GCC eliminates the non-volatile stores (but not the loads, which can have side-effects if the source location is a memory mapped hardware register). There is really nothing surprising here.
For g: Because of the x86_64 ABI the parameter x of g is allocated in a register (i.e. rdx) and does not have a location in memory. Reading a general purpose register does not have any observable side effects so the dead read gets eliminted.

Volatile not working as expected

Consider this code:
struct A{
volatile int x;
A() : x(12){
}
};
A foo(){
A ret;
//Do stuff
return ret;
}
int main()
{
A a;
a.x = 13;
a = foo();
}
Using g++ -std=c++14 -pedantic -O3 I get this assembly:
foo():
movl $12, %eax
ret
main:
xorl %eax, %eax
ret
According to my estimation the variable x should be written to at least three times (possibly four), yet it not even written once (the function foo isn't even called!)
Even worse when you add the inline keyword to foo this is the result:
main:
xorl %eax, %eax
ret
I thought that volatile means that every single read or write must happen even if the compiler can not see the point of the read/write.
What is going on here?
Update:
Putting the declaration of A a; outside main like this:
A a;
int main()
{
a.x = 13;
a = foo();
}
Generates this code:
foo():
movl $12, %eax
ret
main:
movl $13, a(%rip)
xorl %eax, %eax
movl $12, a(%rip)
ret
movl $12, a(%rip)
ret
a:
.zero 4
Which is closer to what you would expect....I am even more confused then ever
Visual C++ 2015 does not optimize away the assignments:
A a;
mov dword ptr [rsp+8],0Ch <-- write 1
a.x = 13;
mov dword ptr [a],0Dh <-- write2
a = foo();
mov dword ptr [a],0Ch <-- write3
mov eax,dword ptr [rsp+8]
mov dword ptr [rsp+8],eax
mov eax,dword ptr [rsp+8]
mov dword ptr [rsp+8],eax
}
xor eax,eax
ret
The same happens both with /O2 (Maximize speed) and /Ox (Full optimization).
The volatile writes are kept also by gcc 3.4.4 using both -O2 and -O3
_main:
pushl %ebp
movl $16, %eax
movl %esp, %ebp
subl $8, %esp
andl $-16, %esp
call __alloca
call ___main
movl $12, -4(%ebp) <-- write1
xorl %eax, %eax
movl $13, -4(%ebp) <-- write2
movl $12, -8(%ebp) <-- write3
leave
ret
Using both of these compilers, if I remove the volatile keyword, main() becomes essentially empty.
I'd say you have a case where the compiler over-agressively (and incorrectly IMHO) decides that since 'a' is not used, operations on it arent' necessary and overlooks the volatile member. Making 'a' itself volatile could get you what you want, but as I don't have a compiler that reproduces this, I can't say for sure.
Last (while this is admittedly Microsoft specific), https://msdn.microsoft.com/en-us/library/12a04hfd.aspx says:
If a struct member is marked as volatile, then volatile is propagated to the whole structure.
Which also points towards the behavior you are seeing being a compiler problem.
Last, if you make 'a' a global variable, it is somewhat understandable that the compiler is less eager to deem it unused and drop it. Global variables are extern by default, so it is not possible to say that a global 'a' is unused just by looking at the main function. Some other compilation unit (.cpp file) might be using it.
GCC's page on Volatile access gives some insight into how it works:
The standard encourages compilers to refrain from optimizations concerning accesses to volatile objects, but leaves it implementation defined as to what constitutes a volatile access. The minimum requirement is that at a sequence point all previous accesses to volatile objects have stabilized and no subsequent accesses have occurred. Thus an implementation is free to reorder and combine volatile accesses that occur between sequence points, but cannot do so for accesses across a sequence point. The use of volatile does not allow you to violate the restriction on updating objects multiple times between two sequence points.
In C standardese:
§5.1.2.3
2 Accessing a volatile object, modifying an object, modifying a file,
or calling a function that does any of those operations are all side
effects, 11) which are changes in the state of the
execution environment. Evaluation of an expression may produce side
effects. At certain specified points in the execution sequence called
sequence points, all side effects of previous evaluations shall be complete and no side effects of subsequent evaluations shall have
taken place. (A summary of the sequence points is given in annex C.)
3 In the abstract machine, all expressions are evaluated as specified
by the semantics. An actual implementation need not evaluate part of
an expression if it can deduce that its value is not used and that no
needed side effects are produced (including any caused by calling a
function or accessing a volatile object).
[...]
5 The least requirements on a conforming implementation are:
At sequence points, volatile objects are stable in the sense that previous accesses are complete and subsequent accesses have not yet
occurred. [...]
I chose the C standard because the language is simpler but the rules are essentially the same in C++. See the "as-if" rule.
Now on my machine, -O1 doesn't optimize away the call to foo(), so let's use -fdump-tree-optimized to see the difference:
-O1
*[definition to foo() omitted]*
;; Function int main() (main, funcdef_no=4, decl_uid=2131, cgraph_uid=4, symbol_order=4) (executed once)
int main() ()
{
struct A a;
<bb 2>:
a.x ={v} 12;
a.x ={v} 13;
a = foo ();
a ={v} {CLOBBER};
return 0;
}
And -O3:
*[definition to foo() omitted]*
;; Function int main() (main, funcdef_no=4, decl_uid=2131, cgraph_uid=4, symbol_order=4) (executed once)
int main() ()
{
struct A ret;
struct A a;
<bb 2>:
a.x ={v} 12;
a.x ={v} 13;
ret.x ={v} 12;
ret ={v} {CLOBBER};
a ={v} {CLOBBER};
return 0;
}
gdb reveals in both cases that a is ultimately optimized out, but we're worried about foo(). The dumps show us that GCC reordered the accesses so that foo() is not even necessary and subsequently all of the code in main() is optimized out. Is this really true? Let's see the assembly output for -O1:
foo():
mov eax, 12
ret
main:
call foo()
mov eax, 0
ret
This essentially confirms what I said above. Everything is optimized out: the only difference is whether or not the call to foo() is as well.

Avoiding strict aliasing violation in hash function

How I can avoid strict aliasing rule violation, trying to modify char* result of sha256 function.
Compute hash value:
std::string sha = sha256("some text");
const char* sha_result = sha.c_str();
unsigned long* mod_args = reinterpret_cast<unsigned long*>(sha_result);
than getting 2 pieces of 64 bit:
unsigned long a = mod_args[1] ^ mod_args[3] ^ mod_args[5] ^ mod_args[7];
unsigned long b = mod_args[0] ^ mod_args[2] ^ mod_args[4] ^ mod_args[6];
than getting result by concat that two pieces:
unsigned long long result = (((unsigned long long)a) << 32) | b;
As depressing as it might sound, the only true portable, standard-conforming and efficient way of doing so is through memcpy(). Using reinterpret_cast is a violation of strict aliasing rule, and using union (as often suggested) triggers undefined behaviour when you read from the member you didn't write to.
However, since most compilers will optimize away memcpy() calls, this is not as depressing as it sounds.
For example, following code with two memcpy()s:
char* foo() {
char* sha = sha256("some text");
unsigned int mod_args[8];
memcpy(mod_args, sha, sizeof(mod_args));
mod_args[5] = 0;
memcpy(sha, mod_args, sizeof(mod_args));
return sha;
}
Produce following optimized assembly:
foo(): # #foo()
pushq %rax
movl $.L.str, %edi
callq sha256(char const*)
movl $0, 20(%rax)
popq %rdx
retq
It is easy to see, no memcpy() is there - the value is modified 'in place'.

CPU overhead for struct?

In C/C++, is there any CPU overhead for acessing struct members in comparison to isolated variables?
For a concrete example, should something like the first code sample below use more CPU cycles than the second one? Would it make any difference if it were a class instead of a struct? (in C++)
1)
struct S {
int a;
int b;
};
struct S s;
s.a = 10;
s.b = 20;
s.a++;
s.b++;
2)
int a;
int b;
a = 10;
b = 20;
a++;
b++;
"Don't optimize yet." The compiler will figure out the best case for you. Write what makes sense first, and make it faster later if you need to. For fun, I ran the following in Clang 3.4 (-O3 -S):
void __attribute__((used)) StructTest() {
struct S {
int a;
int b;
};
volatile struct S s;
s.a = 10;
s.b = 20;
s.a++;
s.b++;
}
void __attribute__((used)) NoStructTest() {
volatile int a;
volatile int b;
a = 10;
b = 20;
a++;
b++;
}
int main() {
StructTest();
NoStructTest();
}
StructTest and NoStructTest have identical ASM output:
pushl %ebp
movl %esp, %ebp
subl $8, %esp
movl $10, -4(%ebp)
movl $20, -8(%ebp)
incl -4(%ebp)
incl -8(%ebp)
addl $8, %esp
popl %ebp
ret
No. The size of all the types in the struct, and thus the offset to each member from the beginning of the struct, is known at compile-time, so the address used to fetch the values in the struct is every bit as knowable as the addresses of individual variables.
My understanding is that all of the values in a struct are adjacent in memory, and are better able to take advantage of the memory caching than the variables.
The variables are probably adjacent in memory too, but they're not guaranteed to be adjacent like the struct.
That being said, cpu performance should not be a consideration when deciding whether to use or not use a struct in the first place.
The real answer is: It completely depend on you CPU architecture and your compiler. The best way is to compile and look at the assembly code.
Now for x86 machine, I'm pretty sure there isn't. The offset is computed as compile time and there is an adressing mode with some offset.
If you ask the compiler to optimize (e.g. compile with gcc -O2 or g++ -O2) then there are no much overhead (probably too small to be measurable, or perhaps a few percents).
However, if you use only local variables, the optimizing compiler might even not allocate slots for them in the local call frame.
Compile with gcc -O2 -fverbose-asm -S and look into the generated assembly code.
Using a class won't make any difference (of course, some class-es have costly constructors & destructors).
Such a code could be useful in generated C or C++ code (like MELT does); local such struct-s or class-es contain the local call frame (as seen by the MELT language, see e.g. its gcc/melt/warmelt-genobj+01.cc generated file). I don't claim it is as efficient as real C++ local variables, but it gets optimized enough.

Is extra time required for float vs int comparisons?

If you have a floating point number double num_float = 5.0; and the following two conditionals.
if(num_float > 3)
{
//...
}
if(num_float > 3.0)
{
//...
}
Q: Would it be slower to perform the former comparison because of the conversion of 3 to a floating point, or would there really be no difference at all?
Obviously I'm assuming the time delay would be negligible at best, but compounded in a while(1) loop I suppose over the long run a decent chunk of time could be lost (if it really is slower).
Because of the "as-if" rule, the compiler is allowed to do the conversion of the literal to a floating point value at compile time. A good compiler will do so if that results in better code.
In order to answer your question definitively for your compiler and your target platform(s), you'd need to check what the compiler emits, and how it performs. However, I'd be surprised if any mainstream compiler did not turn either of the two if statements into the most efficient code possible.
If the value is a constant, then there shouldn't be any difference, since the compiler will convert the constant to float as part of the compilation [unless the compiler decides to use a "compare float with integer" instruction].
If the value is an integer VARIABLE, then there will be an extra instruction to convert the integer value to a floating point [again, unless the compiler can use a "compare float with integer" instruction].
How much, if any, time that adds to the whole process depends HIGHLY on what processor, how the floating point instructions work, etc, etc.
As with anything where performance really matters, measure the alternatives. Preferably on more than one type of hardware (e.g. both AMD and Intel processors if it's a PC), and then decide which is the better choice. Otherwise, you may find yourself tuning the code to work well on YOUR hardware, but worse on some other hardware. Which isn't a good optimisation - unless the ONLY machine you ever run on is your own.
Note: This will need to be repeated with your target hardware. The code below just demonstrates nicely what has been said.
with constants:
bool with_int(const double num_float) {
return num_float > 3;
}
bool with_float(const double num_float) {
return num_float > 3.0;
}
g++ 4.7.2 (-O3 -march=native):
with_int(double):
ucomisd .LC0(%rip), %xmm0
seta %al
ret
with_float(double):
ucomisd .LC0(%rip), %xmm0
seta %al
ret
.LC0:
.long 0
.long 1074266112
clang 3.0 (-O3 -march=native):
.LCPI0_0:
.quad 4613937818241073152 # double 3.000000e+00
with_int(double): # #with_int(double)
ucomisd .LCPI0_0(%rip), %xmm0
seta %al
ret
.LCPI1_0:
.quad 4613937818241073152 # double 3.000000e+00
with_float(double): # #with_float(double)
ucomisd .LCPI1_0(%rip), %xmm0
seta %al
ret
Conclusion: No difference if comparing against constants.
with variables:
bool with_int(const double a, const int b) {
return a > b;
}
bool with_float(const double a, const float b) {
return a > b;
}
g++ 4.7.2 (-O3 -march=native):
with_int(double, int):
cvtsi2sd %edi, %xmm1
ucomisd %xmm1, %xmm0
seta %al
ret
with_float(double, float):
unpcklps %xmm1, %xmm1
cvtps2pd %xmm1, %xmm1
ucomisd %xmm1, %xmm0
seta %al
ret
clang 3.0 (-O3 -march=native):
with_int(double, int): # #with_int(double, int)
cvtsi2sd %edi, %xmm1
ucomisd %xmm1, %xmm0
seta %al
ret
with_float(double, float): # #with_float(double, float)
cvtss2sd %xmm1, %xmm1
ucomisd %xmm1, %xmm0
seta %al
ret
Conclusion: the emitted instructions differ when comparing against variables, as Mats Peterson's answer already explained.
Q: Would it be slower to perform the former comparison because of the conversion of 3 to a floating point, or would there really be no difference at all?
A) Just specify it as an integer. Some chips have a special instruction for comparing to integer at runtime, but that is not important since the compiler will choose what is best. In some cases it might convert it to 3.0 at compile time depending on target architecture. In other cases it will leave it as int. But because you want 3 specifically then specify '3'
Obviously I'm assuming the time delay would be negligible at best, but compounded in a while(1) loop I suppose over the long run a decent chunk of time could be lost (if it really is slower).
A) The compiler will not do anything odd with such code. It will choose the best thing so there should be no time delay. With number constants the compiler is free to do what is best regardless of what it seems as long as it produces same result. However you would not want to compound this type of comparison in a while loop at all. Rather use an integer loop counter. A floating point loop counter is going to be much slower. If you have to use floats as a loop counter prefer single point 32 bit data type and compare as little as possible.
For example, you could break the problem down into multiple loops.
int x = 0;
float y = 0;
float finc = 0.1;
int total = 1000;
int num_times = total / finc;
num_times -= 2;// safety
// Run the loop in a safe zone using integer compares
while (x < num_times) {
// Do stuff
y += finc;
x++;
}
// Now complete the loop using float compares
while (y < total) {
y+= finc;
}
And that would cause a severe improvement in the comparisons speed.