Is it possible to use XOR swapping algorithm with float values in c++?
However wikipedia says that:
XOR bitwise operation to swap values of distinct variables having the same data type
but I am bit confused. With this code:
void xorSwap (int* x, int* y) {
*x ^= *y;
*y ^= *x;
*x ^= *y;
}
int main() {
float a = 255.33333f;
float b = 0.123023f;
xorSwap(reinterpret_cast<int*>(&a), reinterpret_cast<int*>(&b));
std::cout << a << ", " << b << "\n";
return 0;
}
it seems to work (at least under gcc), but I am concerned if such a practice is allowed if needed?
Technically, what you ask is possible but, as clearly commented by IInspectable, it causes UB (Undefined Behaviour).
Anyway, I do suggest you to use std::swap instead, it's a template often specialized for specific data types and designed to do a good job.
If int is the same size as float, it is going to work in practice on any reasonable architecture. The memory in the float is a set of bits, you interpret those bits and swap them completely using xor operations. You can then use Those bits as the proper floats again. The quote you refer to only says that the two values you swap need to be of the same type, and both are ints.
However, on some architectures, this can result in a movement between different sorts of registers, or explicitly flushing registers to memory. What you will see on almost any sane architecture with a sane optimizing compiler these days, is that an explicit swap, using std::swap or an expression with a temporary variable, is in fact faster.
I.e. you should write:
float a = 255.33333f;
float b = 0.123023f;
float tmp = a;
a = b;
b = tmp;
or preferably:
float a = 255.33333f;
float b = 0.123023f;
std::swap(a,b);
If the standard library author for your architecture has determined xor swapping to indeed be beneficial, then you should hope that the last form will use it. xor swapping is a typical bad idiom in terms of hiding intent in an unneccesarily arcane implementation. It was only ever efficient in seriously register-starved cases with bad optimizers.
Your code there invokes undefined behavior. It's not legal in C or C++ to cast a float* to a int* and use it as such. reinterpret_cast should be used to to convert between unrelated structs with compatible layouts, or to temporarily convert between a typed pointer and void*.
Oh, and in this particular case the UB is not of merely academic concern. A compiler may notice that xorSwap() doesn't touch any floats, perform optimizations allowed to it by the aliasing rules of the language, and print out the original values of a and b instead of the swapped values. And that's not even getting into architectures where int and float are of different sizes or alignments.
If you wanted to do this safely, you'd have to memcpy() from your floats into unsigned char arrays, do the XOR in a loop, then memcpy() back. Which would of course make the operation slower than a normal swap. Of course, xor-based swapping is ALREADY slower than normal swapping.
It is:
a) possible when the compiler allows it.
b) an operation for which the standard does not define behaviour (i.e. undefined behaviour)
c) on gcc, actually less efficient than stating exactly what you want:
given:
void xorSwap (unsigned int* x, unsigned int* y) {
*x ^= *y;
*y ^= *x;
*x ^= *y;
}
void swapit3(float& a, float&b)
{
xorSwap(reinterpret_cast<unsigned int*>(&a), reinterpret_cast<unsigned int*>(&b));
}
results in this:
swapit3(float&, float&): # #swapit3(float&, float&)
mov eax, dword ptr [rdi]
xor eax, dword ptr [rsi]
mov dword ptr [rdi], eax
xor eax, dword ptr [rsi]
mov dword ptr [rsi], eax
xor dword ptr [rdi], eax
ret
whereas this:
void swapit2(float& a, float&b)
{
std::swap(a,b);
}
results in this:
swapit2(float&, float&): # #swapit2(float&, float&)
mov eax, dword ptr [rdi]
mov ecx, dword ptr [rsi]
mov dword ptr [rdi], ecx
mov dword ptr [rsi], eax
ret
link: https://godbolt.org/g/K4cazx
Related
This is the code in question:
struct Cell
{
Cell* U;
Cell* D;
void Detach();
};
void Cell::Detach()
{
U->D = D;
D->U = U;
}
clang-14 -O3 produces:
mov rax, qword ptr [rdi] <-- rax = U
mov rcx, qword ptr [rdi + 8] <-- rcx = D
mov qword ptr [rax + 8], rcx <-- U->D = D
mov rcx, qword ptr [rdi + 8] <-- this queries the D field again
mov qword ptr [rcx], rax <-- D->U = U
gcc 11.2 -O3 produces almost the same, but leaves out one mov:
mov rdx, QWORD PTR [rdi]
mov rax, QWORD PTR [rdi+8]
mov QWORD PTR [rdx+8], rax
mov QWORD PTR [rax], rdx
Clang reads the D field twice, while GCC reads it only once and re-uses it. Apparently GCC is not afraid of the first assignment changing anything that has an impact on the second assignment. I'm trying to understand if/when this is allowed.
Checking correctness gets a bit complicated when U or D point at themselves, each other and/or the same target.
My understanding is that the shorter code of GCC is correct if it is guaranteed that the pointers point at the beginning of a Cell (never inside it), regardless of which Cell it is.
Following this line of thought further, this is the case when a) Cells are always aligned to their size, and b) no custom manipulation of such a pointer occurs (referencing and arithmetic are fine).
I suspect case a) is guaranteed by the compiler, and case b) would require invoking undefined behavior of some sort, and as such can be ignored.
This would explain why GCC allows itself this optimization.
Is my reasoning correct? If so, why does clang not make the same optimization?
There are many potential optimizations in C and C++ that are usually safe, but aren't quite sound. If one regards the -> operator as being usable to build a standard-layout object without having to use placement new on it first (an abstraction model that is relied upon by a lot of code, whether or not the Standard mandates support), removing the if (mode) in the following C and C++ funcitons would be such an optimization.
C version:
struct s { int x,y; }; /* Assume int is 4 bytes, and struct is 8 */
void test(struct s *p1, struct s *p2, int mode)
{
p1->y = 1;
p2->x = 2;
if (mode)
p1->y = 1;
}
C++ version:
#include <new>
struct s { int x,y; };
void test(void *vp1, void *vp2, int mode)
{
if (1)
{
struct s* p1 = new (vp1) struct s;
p1->x = 1;
}
if (1)
{
struct s* p2 = new (vp2) struct s;
p2->y = 2;
}
if (mode)
{
struct s* p3 = new (vp1) struct s;
p3->x = 1;
}
}
The optimization would be correct unless the address in p2 is four bytes higher than p1. Under the "traditional" abstraction model used in C or C++, if the address of p1 happens to be 0x1000 and that of p2 happens to be 0x1004, the first assignment would cause addresses 0x1000-0x1007 to hold a struct s, if it didn't already, whose second member (at address 0x1004) would equal 1. The second assignment, by overwriting that object, would end its lifetime and cause addresses 0x1004 to 0x100B to hold a struct s whose first member would equal 2. The third assignment, if executed, would end the lifetime of that second object and re-create the first.
If the third assignment is executed, there would be an object at address 0x1000 whose second field (at address 0x1004) would hold the readable value 1. If the assignment is skipped, there would be an object at address 0x1004 whose first field would hold the value 2. Behavior would be defined in both cases, and a compiler that didn't know which case would apply would have to accommodate both of them by making the value at 0x1004 depend upon mode.
As it happens, the authors of clang do not seem to have provided for that corner case, and thus omit the conditional check. While I think the Standard should use an abstraction model that would allow such optimization, while also supporting the common structure-creation pattern in situations that don't involve weird aliasing corner cases, I don't see any way of interpreting the Standard that would allow for such optimization without allowing compilers to arbitrarily break a large amount of existing code.
I don't think there's any general way of knowing when a decision by gcc or clang not to impose a particular optimization represents a recognition of potential corner cases where the optimization would be incorrect, and an inability to prove that none of them apply, and when it simply represents an oversight which may be "corrected" to as to replace correct behavior with an unsound optimization.
How much would performance differ between these two situations?
int func(int a, int b) { return a + b; }
And
void func(int a, int b, int * c) { *c = a + b; }
Now, what if it's a struct?
typedef struct { int a; int b; char c; } my;
my func(int a, int b, char c) { my x; x.a = a; x.b = b; x.c = c; return x; }
And
void func(int a, int b, int c, my * x) { x->a = a; x->b = b; x->c = c; }
One thing I can think of is that a register cannot be used for this purpose, correct? Other than that, I am unaware of how this function would turn out after going trough a compiler.
Which would be more efficient and speedy?
If the function can inline, often no difference between the first 2.
Otherwise (no inlining because of no link-time optimization) returning an int by value is more efficient because it's just a value in a register that can be used right away. Also, the caller didn't have to pass as many args, or find/make space to point at. If the caller does want to use the output value, it will have to reload it, introducing latency in the total dependency chain from inputs ready to output ready. (Store-forwarding latency is ~5 cycles on modern x86 CPUs, vs. 1 cycle latency for the lea eax, [rdi + rsi] that would implement that function for x86-64 System V.
The exception is maybe for rare cases where the caller isn't going to use the value, just wants it in memory at some address. Passing that address to the callee (in a register) so it can be used there means the caller doesn't have to keep that address anywhere that will survive across the function call.
For the struct version:
a register cannot be used for this purpose, correct?
No, for some calling conventions, small structs can be returned in registers.
x86-64 System V will return your my struct by value in the RDX:RAX register pair because it's less than 16 bytes and all integer. (And trivially copyable.) Try it on https://godbolt.org/z/x73cEh -
# clang11.0 -O3 for x86-64 SysV
func_val:
shl rsi, 32
mov eax, edi
or rax, rsi # (uint64_t)b<<32 | a; the low 64 bits of the struct
# c was already in EDX, the low half of RDX; clang leaves it there.
ret
func_out:
mov dword ptr [rcx], edi
mov dword ptr [rcx + 4], esi # just store the struct members
mov byte ptr [rcx + 8], dl # to memory pointed-to by 4th arg
ret
GCC doesn't assume that char c is correctly sign-extended to EDX the way clang does (unofficial ABI feature). GCC does a really dumb byte store / dword reload that creates a store-forwarding stall, to get uninitialized garbage from memory instead of from high bytes of EDX. Purely a missed optimization, but see it in https://godbolt.org/z/WGcqKc. It also insanely uses SSE2 to merge the two integers into a 64-bit value before doing a movq rax, xmm0, or to memory for the output-arg.
You definitely want the struct version to inline if the caller uses the values, so this packing into return-value registers can be optimized away.
How does function ACTUALLY return struct variable in C? has an ARM example for a larger struct: return by value passes a hidden pointer to the caller's return-value object. From there, it may need to be copied by the caller if assigning to something that escape analysis can't prove is private. (e.g. through some pointer). What prevents the usage of a function argument as hidden pointer?
Also related: Why is tailcall optimization not performed for types of class MEMORY?
How do C compilers implement functions that return large structures? points out that code-gen may differ between C and C++.
I don't know how to explain any general rule of thumb that one could apply without understand asm and the calling convention you care about. Usually pass/return large structs by reference, but for small structs it's very much "it depends".
I have to implement a "bandpass" filter. Let a and b denote two integers that induce a half-open interval [a, b). If some argument x lies within this interval (i.e., a <= x < b), I return a pointer to a C string const char* high, otherwise I return a pointer const char* low. The vanilla implementation of this function looks like
const char* vanilla_bandpass(int a, int b, int x, const char* low,
const char* high)
{
const bool withinInterval { (a <= x) && (x < b) };
return (withinInterval ? high : low);
}
which when compiled with -O3 -march=znver2 on Godbolt gives the following assembly code
vanilla_bandpass(int, int, int, char const*, char const*):
mov rax, r8
cmp edi, edx
jg .L4
cmp edx, esi
jge .L4
ret
.L4:
mov rax, rcx
ret
Now, I've looked into creating a version without a jump/branch, which looks like this
#include <cstdint>
const char* funky_bandpass(int a, int b, int x, const char* low,
const char* high)
{
const bool withinInterval { (a <= x) && (x < b) };
const auto low_ptr = reinterpret_cast<uintptr_t>(low) * (!withinInterval);
const auto high_ptr = reinterpret_cast<uintptr_t>(high) * withinInterval;
const auto ptr_sum = low_ptr + high_ptr;
const auto* result = reinterpret_cast<const char*>(ptr_sum);
return result;
}
which just is ultimately just a "chord" between two pointers. Using the same options as before, this code compiles to
funky_bandpass(int, int, int, char const*, char const*):
mov r9d, esi
cmp edi, edx
mov esi, edx
setle dl
cmp esi, r9d
setl al
and edx, eax
mov eax, edx
and edx, 1
xor eax, 1
imul rdx, r8
movzx eax, al
imul rcx, rax
lea rax, [rcx+rdx]
ret
While at first glance, this function has more instructions, careful benchmarking shows that it's 1.8x to 1.9x faster than the vanilla_bandpass implementation.
Is this use of uintptr_t valid and free of undefined behavior? I'm well aware that the language around uintptr_t is vague and ambiguous to say the least, and that anything that isn't explicitly specified in the standard (like arithmetic on uintptr_t) is generally considered undefined behavior. On the other hand, in many cases, the standard explicitly calls out when something has undefined behavior, which it also doesn't do in this case. I'm aware that the "blending" that happens when adding together low_ptr and high_ptr touches on topics just as pointer provenance, which is a murky topic in and of itself.
Is this use of uintptr_t valid and free of undefined behavior?
Yes. Conversion from pointer to integer (of sufficient size such as uintptr_t) is well defined, and integer arithmetic is well defined.
Another thing to be wary about is whether converting a modified uintptr_t back to a pointer gives back what you want. Only guarantee given by the standard is that pointer converted to integer converted back gives the same address. Luckily, this guarantee is sufficient for you, because you always use the exact value from a converted pointer.
If you were using something other than pointer to narrow character, I think you would need to use std::launder on the result of the conversion.
The Standard doesn't require that implementations process uintptr_t-to-pointer conversions in useful fashion even in cases where the uintptr_t values are produced from pointer-to-integer conversions. Given e.g.
extern int x[5],y[5];
int *px5 = x+5, *py0 = y;
the pointers px5 and py0 might compare equal, and regardless of whether they do or not, code may use px5[-1] to access x[4], or py0[0] to access y[0], but may not access px5[0] nor py0[-1]. If the pointers happen to be equal, and code attempts to access ((int*)(uintptr_t)px5)[-1], a compiler could replace (int*)(uintptr_t)px5) with py0, since that pointer would compare equal to px5, but then jump the rails when attempting to access py0[-1]. Likewise if code tries to access ((int*)(uintptr_t)py0)[0], a compiler could replace (int*)(uintptr_t)py0 with px5, and then jump the rails when attempting to access px5[0].
While it may seem obtuse for a compiler to do such a thing, clang gets even crazier. Consider:
#include <stdint.h>
extern int x[],y[];
int test(int i)
{
y[0] = 1;
uintptr_t px = (uintptr_t)(x+5);
uintptr_t py = (uintptr_t)(y+i);
int flag = (px==py);
if (flag)
y[i] = 2;
return y[0];
}
If px and py are coincidentally equal and i is zero, that will cause clang to set y[0] to 2 but return 1. See https://godbolt.org/z/7Sa_KZ for generated code ("mov eax,1 / ret" means "return 1").
Moving a member variable to a local variable reduces the number of writes in this loop despite the presence of the __restrict keyword. This is using GCC -O3. Clang and MSVC optimise the writes in both cases. [Note that since this question was posted we observed that adding __restrict to the calling function caused GCC to also move the store out of the loop. See the godbolt link below and the comments]
class X
{
public:
void process(float * __restrict d, int size)
{
for (int i = 0; i < size; ++i)
{
d[i] = v * c + d[i];
v = d[i];
}
}
void processFaster(float * __restrict d, int size)
{
float lv = v;
for (int i = 0; i < size; ++i)
{
d[i] = lv * c + d[i];
lv = d[i];
}
v = lv;
}
float c{0.0f};
float v{0.0f};
};
With gcc -O3 the first one has an inner loop that looks like:
.L3:
mulss xmm0, xmm1
add rdi, 4
addss xmm0, DWORD PTR [rdi-4]
movss DWORD PTR [rdi-4], xmm0
cmp rax, rdi
movss DWORD PTR x[rip+4], xmm0 ;<<< the extra store
jne .L3
.L1:
rep ret
The second here:
.L8:
mulss xmm0, xmm1
add rdi, 4
addss xmm0, DWORD PTR [rdi-4]
movss DWORD PTR [rdi-4], xmm0
cmp rdi, rax
jne .L8
.L7:
movss DWORD PTR x[rip+4], xmm0
ret
See https://godbolt.org/g/a9nCP2 for the complete code.
Why does the compiler not perform the lv optimisation here?
I'm assuming the 3 memory accesses per loop are worse than the 2 (assuming size is not a small number), though I've not measured this yet.
Am I right to make that assumption?
I think the observable behaviour should be the same in both cases.
This seems to be caused by the missing __restrict qualifier on the f_original function. __restrict is a GCC extension; it is not quite clear how it is expected to behave in C++. Maybe it is a compiler bug (missed optimization) that it appears to disappear after inlining.
The two methods are not identical. In the first, the value of v is updated multiple times during the execution. That may be or may not be what you want, but it is not the same as the second method, so it is not something the compiler can decide for itself as a possible optimization.
The restrict keyword says there is no aliasing with anything else, in effect same as if the value had been local (and no local had any references to it).
In the second case there is no external visible effect of v so it doesn't need to store it.
In the first case there is a potential that some external might see it, the compiler doesn't at this time know that there will be no threads that could change it, but it knows that it doesn't have to read it as its neither atomic nor volatile. And the change of d[] another externally visible variable make the storing necessary.
If the compiler writers reasoning, well neither d nor v are volatile nor atomic so we can just do it all using 'as-if', then the compiler has to be sure no one can touch v at all. I'm pretty sure this will come in one of the new version as there is no synchronisation before the return and this will be the case in 99+% of all cases anyway. Programmers will then have to put either volatile or atomic on variables that are changed, which I think I could live with.
I'd like keep track of what is essentially "type" information at compile time for a few functions which currently take arguments of the same type. Here's an example; Say I have two functions getThingIndex(uint64_t t) and getThingAtIndex(uint64_t tidx). The first function treats the argument as an encoding of the thing, does a non-trivial computation of an index, and returns it. One can then get the actual "thing" by calling getThingAtIndex. getThingAtIndex, on the other hand, assumes you're querying the structure and already have an index. The latter of the two methods is faster, but, more importantly, I want to avoid the headaches that might result from passing a thing to getThingAtIndex or by passing an index to getThingIndex.
I was thinking about creating types for thing and thing index sort of like so:
struct Thing { uint64_t thing; }
struct ThingIndex { uint64_t idx; }
And then changing the signatures of the functions above to be
getThingIndex(Thing t)
getThingAtIndex(ThingIndex idx)
Now, despite the fact that Thing and ThingIndex encode the same
underlying type, they are nonetheless distinct at compile time and I
have less opportunity to make stupid mistakes by passing an index to
getThingIndex or a thing to getThingAtIndex.
However, I'm concerned about the overhead of this approach. The functions
are called many (10s-100s of millions of) times, and I'm curious if the
compiler will optimize away the creation of these structures which essentially
do nothing but encode compile-time type information. If the compiler won't
perform such an optimization, is there a way to create these types of "rich types"
with zero overhead?
Take a look at the disassembly.
unsigned long long * x = new unsigned long long;
0110784E push 8
01107850 call operator new (01102E51h)
01107855 add esp,4
01107858 mov dword ptr [ebp-0D4h],eax
0110785E mov eax,dword ptr [ebp-0D4h]
01107864 mov dword ptr [x],eax
*x = 5;
01107867 mov eax,dword ptr [x]
0110786A mov dword ptr [eax],5
01107870 mov dword ptr [eax+4],0
And the struct.
struct Thing { unsigned long long a; };
Thing * thing = new Thing;
0133784E push 8
01337850 call operator new (01332E51h)
01337855 add esp,4
01337858 mov dword ptr [ebp-0D4h],eax
0133785E mov eax,dword ptr [ebp-0D4h]
01337864 mov dword ptr [thing],eax
thing->a = 5;
01337867 mov eax,dword ptr [thing]
0133786A mov dword ptr [eax],5
01337870 mov dword ptr [eax+4],0
There is no difference in the two instructions. The compiler doesn't care that this->a is a member of the struct, it accesses it as if you just declared unsigned long long a.