I have to implement a "bandpass" filter. Let a and b denote two integers that induce a half-open interval [a, b). If some argument x lies within this interval (i.e., a <= x < b), I return a pointer to a C string const char* high, otherwise I return a pointer const char* low. The vanilla implementation of this function looks like
const char* vanilla_bandpass(int a, int b, int x, const char* low,
const char* high)
{
const bool withinInterval { (a <= x) && (x < b) };
return (withinInterval ? high : low);
}
which when compiled with -O3 -march=znver2 on Godbolt gives the following assembly code
vanilla_bandpass(int, int, int, char const*, char const*):
mov rax, r8
cmp edi, edx
jg .L4
cmp edx, esi
jge .L4
ret
.L4:
mov rax, rcx
ret
Now, I've looked into creating a version without a jump/branch, which looks like this
#include <cstdint>
const char* funky_bandpass(int a, int b, int x, const char* low,
const char* high)
{
const bool withinInterval { (a <= x) && (x < b) };
const auto low_ptr = reinterpret_cast<uintptr_t>(low) * (!withinInterval);
const auto high_ptr = reinterpret_cast<uintptr_t>(high) * withinInterval;
const auto ptr_sum = low_ptr + high_ptr;
const auto* result = reinterpret_cast<const char*>(ptr_sum);
return result;
}
which just is ultimately just a "chord" between two pointers. Using the same options as before, this code compiles to
funky_bandpass(int, int, int, char const*, char const*):
mov r9d, esi
cmp edi, edx
mov esi, edx
setle dl
cmp esi, r9d
setl al
and edx, eax
mov eax, edx
and edx, 1
xor eax, 1
imul rdx, r8
movzx eax, al
imul rcx, rax
lea rax, [rcx+rdx]
ret
While at first glance, this function has more instructions, careful benchmarking shows that it's 1.8x to 1.9x faster than the vanilla_bandpass implementation.
Is this use of uintptr_t valid and free of undefined behavior? I'm well aware that the language around uintptr_t is vague and ambiguous to say the least, and that anything that isn't explicitly specified in the standard (like arithmetic on uintptr_t) is generally considered undefined behavior. On the other hand, in many cases, the standard explicitly calls out when something has undefined behavior, which it also doesn't do in this case. I'm aware that the "blending" that happens when adding together low_ptr and high_ptr touches on topics just as pointer provenance, which is a murky topic in and of itself.
Is this use of uintptr_t valid and free of undefined behavior?
Yes. Conversion from pointer to integer (of sufficient size such as uintptr_t) is well defined, and integer arithmetic is well defined.
Another thing to be wary about is whether converting a modified uintptr_t back to a pointer gives back what you want. Only guarantee given by the standard is that pointer converted to integer converted back gives the same address. Luckily, this guarantee is sufficient for you, because you always use the exact value from a converted pointer.
If you were using something other than pointer to narrow character, I think you would need to use std::launder on the result of the conversion.
The Standard doesn't require that implementations process uintptr_t-to-pointer conversions in useful fashion even in cases where the uintptr_t values are produced from pointer-to-integer conversions. Given e.g.
extern int x[5],y[5];
int *px5 = x+5, *py0 = y;
the pointers px5 and py0 might compare equal, and regardless of whether they do or not, code may use px5[-1] to access x[4], or py0[0] to access y[0], but may not access px5[0] nor py0[-1]. If the pointers happen to be equal, and code attempts to access ((int*)(uintptr_t)px5)[-1], a compiler could replace (int*)(uintptr_t)px5) with py0, since that pointer would compare equal to px5, but then jump the rails when attempting to access py0[-1]. Likewise if code tries to access ((int*)(uintptr_t)py0)[0], a compiler could replace (int*)(uintptr_t)py0 with px5, and then jump the rails when attempting to access px5[0].
While it may seem obtuse for a compiler to do such a thing, clang gets even crazier. Consider:
#include <stdint.h>
extern int x[],y[];
int test(int i)
{
y[0] = 1;
uintptr_t px = (uintptr_t)(x+5);
uintptr_t py = (uintptr_t)(y+i);
int flag = (px==py);
if (flag)
y[i] = 2;
return y[0];
}
If px and py are coincidentally equal and i is zero, that will cause clang to set y[0] to 2 but return 1. See https://godbolt.org/z/7Sa_KZ for generated code ("mov eax,1 / ret" means "return 1").
Related
I am currently trying to improve the speed of my program.
I was wondering whether it would help to replace all if-statements of the type:
bool a=1;
int b=0;
if(a){b++;}
with this:
bool a=1;
int b=0;
b+=a;
I am unsure whether the conversion from bool to int could be a problem time-wise.
One rule of thumb when programming is to not micro-optimise.
Another rule is to write clear code.
But in this case, another rule applies. If you are writing optimised code then avoid any code that can cause branches, as you can cause unwanted cpu pipeline dumps due to failed branch prediction.
Bear in mind also that there are not bool and int types as such in assembler: just registers, so you will probably find that all conversions will be optimised out. Therefore
b += a;
wins for me; it's also clearer.
Compilers are allowed to assume that the underlying value of a bool isn't messed up, so optimizing compilers can avoid the branch.
If we look at the generated code for this artificial test
int with_if_bool(bool a, int b) {
if(a){b++;}
return b;
}
int with_if_char(unsigned char a, int b) {
if(a){b++;}
return b;
}
int without_if(bool a, int b) {
b += a;
return b;
}
clang will exploit this fact and generate the exact same branchless code that sums a and b for the bool version, and instead generate actual comparisons with zero in the unsigned char case (although it's still branchless code):
with_if_bool(bool, int): # #with_if_bool(bool, int)
lea eax, [rdi + rsi]
ret
with_if_char(unsigned char, int): # #with_if_char(unsigned char, int)
cmp dil, 1
sbb esi, -1
mov eax, esi
ret
without_if(bool, int): # #without_if(bool, int)
lea eax, [rdi + rsi]
ret
gcc will instead treat bool just as if it was an unsigned char, without exploiting its properties, generating similar code as clang's unsigned char case.
with_if_bool(bool, int):
mov eax, esi
cmp dil, 1
sbb eax, -1
ret
with_if_char(unsigned char, int):
mov eax, esi
cmp dil, 1
sbb eax, -1
ret
without_if(bool, int):
movzx edi, dil
lea eax, [rdi+rsi]
ret
Finally, Visual C++ will treat the bool and the unsigned char versions equally, just as gcc, although with more naive codegen (it uses a conditional move instead of performing arithmetic with the flags register, which IIRC traditionally used to be less efficient, don't know for current machines).
a$ = 8
b$ = 16
int with_if_bool(bool,int) PROC ; with_if_bool, COMDAT
test cl, cl
lea eax, DWORD PTR [rdx+1]
cmove eax, edx
ret 0
int with_if_bool(bool,int) ENDP ; with_if_bool
a$ = 8
b$ = 16
int with_if_char(unsigned char,int) PROC ; with_if_char, COMDAT
test cl, cl
lea eax, DWORD PTR [rdx+1]
cmove eax, edx
ret 0
int with_if_char(unsigned char,int) ENDP ; with_if_char
a$ = 8
b$ = 16
int without_if(bool,int) PROC ; without_if, COMDAT
movzx eax, cl
add eax, edx
ret 0
int without_if(bool,int) ENDP ; without_if
In all cases, no branches are generated; the only difference is that, on most compilers, some more complex code is generated that depends on a cmp or a test, creating a longer dependency chain.
That being said, I would worry about this kind of micro-optimization only if you actually run your code under a profiler, and the results point to this specific code (or to some tight loop that involve it); in general you should write sensible, semantically correct code and focus on using the correct algorithms/data structures. Micro-optimization comes later.
In my program, this wouldn't work, as a is actually an operation of the type: b+=(a==c)
This should be even better for the optimizer, as it doesn't even have any doubt about where the bool is coming from - it can just decide straight from the flags register. As you can see, here gcc produces quite similar code for the two cases, clang exactly the same, while VC++ as usual produces something that is more conditional-ish (a cmov) in the if case.
Consider the following two useless C++ functions.
Compiled with GCC (4.9.2, 32- or 64-bit) both functions returning the same value as expected.
Compiled with Visual Studio 2010 or Visual Studio 2017 (unmanaged code) both functions returning different values.
What I've tried:
brackets, brackets, brackets
explicit casts to char
sizeof(char) is evaluated to 1
debug / release version
32- / 64-bit
What's going on here? It seems to be a fundamental bug in VS.
char test1()
{
char buf[] = "The quick brown fox...", *pbuf = buf;
char value = (*(pbuf++) & 0x0F) | (*(pbuf++) & 0xF0);
return value;
}
char test2()
{
char buf[] = "The quick brown fox...", *pbuf = buf;
char a = *(pbuf++) & 0x0F;
char b = *(pbuf++) & 0xF0;
char value = a | b;
return value;
}
Edit:
It's not an attempt to blame VS (as mentioned in the posts).
It's not a matter of signed or unsigned.
It's not a matter of the order of evaluation left and right side of the or-operator. Changing the order of the assignments of a and b in test2() yields to a third result.
But the simultaneity is a good point. It seems the ordering of evaluation is defined to be undefined. In a first step, the generated code evaluates the complete expression in test1() without incrementing any pointer. In a second step the pointers will be incremented. Since the incrementation has no effect and the data remains unchanged after this specific operation, the optimizer will remove the code.
Sorry for inconveniences, but this is not what i would expect. In no language.
For completeness, here the disassembled code of test1():
0028102A mov ecx,dword ptr [ebp-8]
0028102D movsx edx,byte ptr [ecx]
00281030 and edx,0Fh
00281033 mov eax,dword ptr [ebp-8]
00281036 movsx ecx,byte ptr [eax]
00281039 and ecx,0F0h
0028103F or edx,ecx
00281041 mov byte ptr [ebp-1],dl
00281044 mov edx,dword ptr [ebp-8]
00281047 add edx,1
0028104A mov dword ptr [ebp-8],edx
0028104D mov eax,dword ptr [ebp-8]
00281050 add eax,1
00281053 mov dword ptr [ebp-8],eax
The behaviour of (*(pbuf++) & 0x0F) | (*(pbuf++) & 0xF0); is undefined. | (unlike ||) is not a sequencing point, and so you have simultaneous reads and writes on pbuf in the same program step.
Not a VS bug therefore. (Such things rarely are: a golden rule is not to blame the compiler.)
(Note also that char can be either signed or unsigned. That can introduce differences in code like yours.)
Is it possible to use XOR swapping algorithm with float values in c++?
However wikipedia says that:
XOR bitwise operation to swap values of distinct variables having the same data type
but I am bit confused. With this code:
void xorSwap (int* x, int* y) {
*x ^= *y;
*y ^= *x;
*x ^= *y;
}
int main() {
float a = 255.33333f;
float b = 0.123023f;
xorSwap(reinterpret_cast<int*>(&a), reinterpret_cast<int*>(&b));
std::cout << a << ", " << b << "\n";
return 0;
}
it seems to work (at least under gcc), but I am concerned if such a practice is allowed if needed?
Technically, what you ask is possible but, as clearly commented by IInspectable, it causes UB (Undefined Behaviour).
Anyway, I do suggest you to use std::swap instead, it's a template often specialized for specific data types and designed to do a good job.
If int is the same size as float, it is going to work in practice on any reasonable architecture. The memory in the float is a set of bits, you interpret those bits and swap them completely using xor operations. You can then use Those bits as the proper floats again. The quote you refer to only says that the two values you swap need to be of the same type, and both are ints.
However, on some architectures, this can result in a movement between different sorts of registers, or explicitly flushing registers to memory. What you will see on almost any sane architecture with a sane optimizing compiler these days, is that an explicit swap, using std::swap or an expression with a temporary variable, is in fact faster.
I.e. you should write:
float a = 255.33333f;
float b = 0.123023f;
float tmp = a;
a = b;
b = tmp;
or preferably:
float a = 255.33333f;
float b = 0.123023f;
std::swap(a,b);
If the standard library author for your architecture has determined xor swapping to indeed be beneficial, then you should hope that the last form will use it. xor swapping is a typical bad idiom in terms of hiding intent in an unneccesarily arcane implementation. It was only ever efficient in seriously register-starved cases with bad optimizers.
Your code there invokes undefined behavior. It's not legal in C or C++ to cast a float* to a int* and use it as such. reinterpret_cast should be used to to convert between unrelated structs with compatible layouts, or to temporarily convert between a typed pointer and void*.
Oh, and in this particular case the UB is not of merely academic concern. A compiler may notice that xorSwap() doesn't touch any floats, perform optimizations allowed to it by the aliasing rules of the language, and print out the original values of a and b instead of the swapped values. And that's not even getting into architectures where int and float are of different sizes or alignments.
If you wanted to do this safely, you'd have to memcpy() from your floats into unsigned char arrays, do the XOR in a loop, then memcpy() back. Which would of course make the operation slower than a normal swap. Of course, xor-based swapping is ALREADY slower than normal swapping.
It is:
a) possible when the compiler allows it.
b) an operation for which the standard does not define behaviour (i.e. undefined behaviour)
c) on gcc, actually less efficient than stating exactly what you want:
given:
void xorSwap (unsigned int* x, unsigned int* y) {
*x ^= *y;
*y ^= *x;
*x ^= *y;
}
void swapit3(float& a, float&b)
{
xorSwap(reinterpret_cast<unsigned int*>(&a), reinterpret_cast<unsigned int*>(&b));
}
results in this:
swapit3(float&, float&): # #swapit3(float&, float&)
mov eax, dword ptr [rdi]
xor eax, dword ptr [rsi]
mov dword ptr [rdi], eax
xor eax, dword ptr [rsi]
mov dword ptr [rsi], eax
xor dword ptr [rdi], eax
ret
whereas this:
void swapit2(float& a, float&b)
{
std::swap(a,b);
}
results in this:
swapit2(float&, float&): # #swapit2(float&, float&)
mov eax, dword ptr [rdi]
mov ecx, dword ptr [rsi]
mov dword ptr [rdi], ecx
mov dword ptr [rsi], eax
ret
link: https://godbolt.org/g/K4cazx
I was looking at the assembly Visual Studio generated for this simple x64 program:
struct Point {
int a, b;
Point() {
a = 0; b = 1;
}
};
int main(int argc, char* argv[])
{
Point arr[3];
arr[0].b = 2;
return 0;
}
And when it meets arr[0].b = 2, it generates this:
mov eax, 8
imul rax, rax, 0
mov dword ptr [rbp+rax+4],2
Why does it do imul rax, rax, 0 instead of a simple mov rax, 0, or even xor rax, rax? How is imul more efficient, if at all?
Kamac
The reason is because the assembly is calculating the offset of both the Point object in the array, which happens to be on the stack, as well as the offset to the variable b.
Intel documentation for imul with three (3) operands state:
Three-operand form — This form requires a destination operand (the
first operand) and two source operands (the second and the third
operands). Here, the first source operand (which can be a
general-purpose register or a memory location) is multiplied by the
second source operand (an immediate value). The intermediate product
(twice the size of the first source operand) is truncated and stored
in the destination operand (a general-purpose register).
In your case it is calculating the offset of the object in the array which results in addressing the first (zeroth) Point location on the stack. Having that resolved it is then adding the offset for .b which is the +4. So broken down:
mov eax,8 ; prepare to offset into the Point array
imul rax, rax, 0 ; Calculate which Point object is being referred to
mov dword ptr [rbp+rax+4],2 ; Add the offset to b and move value 2 in
instruction. All of which resolves to arr[0].b = 2.
I take it you did not compile with aggressive optimization. When going with a straight compile (no optimization, debug on, etc.) the compiler is not making any assumptions with respects to addressing.
Comparison to clang
On OS X (El Capitan) with clang 3.9.0 and no optimization flags, once the Point objects are instantiated in the array, the assignment of .b = 2 is simply:
mov dword ptr [rbp - 44], 2
In this case, clang is pretty smart about the offsets and resolves addressing during the default optimization.
I want dLower and dHigher to have the lower and higher of two double values, respectively - i.e. to sort them if they are the wrong way around. The most immediate answer, it seems is:
void ascending(double& dFirst, double& dSecond)
{
if(dFirst > dSecond)
swap(dFirst,dSecond);
}
ascending(dFoo, dBar);
But it seems like such an obvious thing to do I wondered if I'm just not using the right terminology to find a standard routine.
Also, how would you make that generic?
This is a good way of approaching it. It is as efficient as you are going to get. I doubt that this specific function has a generally recognized name. This is apparently called comparison-swap.
Generalizing it on type is as easy as:
template <typename T>
void ascending(T& dFirst, T& dSecond)
{
if (dFirst > dSecond)
std::swap(dFirst, dSecond);
}
Corroborating this function:
int main() {
int a=10, b=5;
ascending(a, b);
std::cout << a << ", " << b << std::endl;
double c=7.2, d=3.1;
ascending(c, d);
std::cout << c << ", " << d << std::endl;
return 0;
}
This prints:
5, 10
3.1, 7.2
Playing the "extremely generic" game:
template <typename T, typename StrictWeakOrdering>
void comparison_swap(T &lhs, T &rhs, StrictWeakOrdering cmp) {
using std::swap;
if (cmp(rhs, lhs)) {
swap(lhs, rhs);
}
}
template <typename T>
void comparison_swap(T &lhs, T &rhs) {
comparison_swap(lhs, rhs, std::less<T>());
}
This ticks the following boxes:
uses a less-than comparator, which is more likely to be readily available for a user-defined type, since it's used in standard algorithms.
Comparator is optionally configurable, and defaults to something sensible (you could use std::greater<T> as the default if you prefer and modify accordingly). It's also guaranteed valid for arbitrary pointers of the same type, which operator< isn't.
Uses either a specialization of std::swap, or a swap function found by ADL, just in case the type T provides one but not the other.
There may be some boxes I've forgotten about, though.
Let me throw a special case in here that only applies if performance is a absolutely critical issue there and if float accuracy is enough: You could consider the vector pipeline (if your target CPU has one).
Some CPUs can get you the min and max of each component of a vector with one instruction each, so you can process 4 values in one go - without any branches at all.
Again, this is a very special case and most likely not relevant for what you're doing, but I wanted to bring this up since "more efficient" was part of the question.
Why not use std::sort() with a lambda or a functor?
As you also asked,
Is there a more efficient way to sort
two numbers?
Considering efficiency, you may need to write your own swap function and test its performance against std::swap.
Here is Microsoft implementation.
template<class _Ty> inline
void swap(_Ty& _Left, _Ty& _Right)
{ // exchange values stored at _Left and _Right
if (&_Left != &_Right)
{ // different, worth swapping
_Ty _Tmp = _Left;
_Left = _Right;
_Right = _Tmp;
}
}
If you feel the condition if (&_Left != &_Right) check is not required, you can ommit it to improve the performance of the code. You can write your own swap like below.
template <class T>
inline void swap(T &left, T& right)
{
T temp = left;
left = right;
right = temp;
}
For me it looks like improved performance slightly for 10 crore calls.
Anyways, you need to measure performance related changes properly. Don't assume.
Some library functions may not run so fast as they are written considering generic usage, error checking etc. If performance is not critical in your application, it is recommended to use Library functions as they are well tested.
If performance is critical like hard realtime systems, there is nothing wrong in writing your own and use.
All the answers apart from EboMike's focus on programming generality and use the same underlying compare-and-swap approach. I'm interested in this question because it would be nice to have specialisations that avoid branching for pipelining efficiency. I'm sketching out some untested/unbenchmarked implementations that might compile more efficiently than the previous answers by exploiting conditional-move instructions (e.g., cmovl) to avoid branching. I have no idea if this manifests in real-world performance gains, however...
Programming generality could be added by making these specialisations, using the compare-and-swap approach in the generic case. This is a common enough problem that I would really love to see it correctly implemented as a set of architecture-tuned specialisations in a library.
I've included x86 assembly output from godbolt in comments.
/*
mov eax, dword ptr [rdi]
mov ecx, dword ptr [rsi]
cmp eax, ecx
mov edx, ecx
cmovl edx, eax
cmovl eax, ecx
mov dword ptr [rdi], edx
mov dword ptr [rsi], eax
ret
*/
void ascending1(int &a, int &b)
{
bool const pred = a < b;
int const _a = pred ? a : b;
int const _b = pred ? b : a;
a = _a;
b = _b;
}
/*
mov eax, dword ptr [rdi]
mov ecx, dword ptr [rsi]
mov edx, ecx
xor edx, eax
cmp eax, ecx
cmovle ecx, eax
mov dword ptr [rdi], ecx
xor ecx, edx
mov dword ptr [rsi], ecx
ret
*/
void ascending2(int &a, int &b)
{
bool const pred = a < b;
int const c = a^b;
a = pred ? a : b;
b = a^c;
}
/*
The following implementation changes to a function-style interface,
which I feel is more elegant, although admittedly always forces assignment
to occur, so will be more expensive if assignment is costly.
See foobar() to see that this rather nicely inlines.
mov eax, esi
mov ecx, esi
xor ecx, edi
cmp edi, esi
cmovle eax, edi
xor ecx, eax
shl rcx, 32
or rax, rcx
ret
*/
std::pair<int,int> ascending3(int const a, int const b)
{
bool const pred = a < b;
int const c = a^b;
int const x = pred ? a : b;
int const y = c^x;
return std::make_pair(x,y);
}
/*
This is to show that ascending3() inlines very nicely
to only 5 assembly instructions.
# inlined ascending3().
mov eax, esi
xor eax, edi
cmp edi, esi
cmovle esi, edi
xor eax, esi
# end of inlining.
add eax, esi
ret
*/
int foobar(int const a, int const b)
{
auto const [x,y] = ascending3(a,b);
return x+y;
}