RISC-V inline assembly struct optimized away [duplicate] - c++

Consider the following small function:
void foo(int* iptr) {
iptr[10] = 1;
__asm__ volatile ("nop"::"r"(iptr):);
iptr[10] = 2;
}
Using gcc, this compiles to:
foo:
nop
mov DWORD PTR [rdi+40], 2
ret
Note in particular, that the first write to iptr, iptr[10] = 1 doesn't occur at all: the inline asm nop is the first thing in the function, and only the final write of 2 appears (after the ASM call). Apparently the compiler decides that it only needs to provide an up-to-date version of the value of iptr itself, but not the memory it points to.
I can tell the compiler that memory must be up to date with a memory clobber, like so:
void foo(int* iptr) {
iptr[10] = 1;
__asm__ volatile ("nop"::"r"(iptr):"memory");
iptr[10] = 2;
}
which results in the expected code:
foo:
mov DWORD PTR [rdi+40], 1
nop
mov DWORD PTR [rdi+40], 2
ret
However, this is too strong of a condition, since it tells the compiler all memory has to be written. For example, in the following function:
void foo2(int* iptr, long* lptr) {
iptr[10] = 1;
lptr[20] = 100;
__asm__ volatile ("nop"::"r"(iptr):);
iptr[10] = 2;
lptr[20] = 200;
}
The desired behavior is to let the compiler optimize away the first write to lptr[20], but not the first write to iptr[10]. The "memory" clobber cannot achieve this because it means both writes have to occur:
foo2:
mov DWORD PTR [rdi+40], 1
mov QWORD PTR [rsi+160], 100 ; lptr[10] written unecessarily
nop
mov DWORD PTR [rdi+40], 2
mov QWORD PTR [rsi+160], 200
ret
Is there some way to tell compilers accepting gcc extended asm syntax that the input to the asm includes the pointer and anything it can point to?

That's correct; asking for a pointer as input to inline asm does not imply that the pointed-to memory is also an input or output or both. With a register input and register output, for all gcc knows your asm just aligns a pointer by masking off the low bits, or adds a constant to it. (In which case you would want it to optimize away a dead store.)
The simple option is asm volatile and a "memory" clobber1.
The narrower more specific way you're asking for is to use a "dummy" memory operand as well as the pointer in a register. Your asm template doesn't reference this operand (except maybe inside an asm comment to see what the compiler picked). It tells the compiler which memory you actually read, write, or read+write.
Dummy memory input: "m" (*(const int (*)[]) iptr)
or output: "=m" (*(int (*)[]) iptr). Or of course "+m" with the same syntax.
That syntax is casting to a pointer-to-array and dereferencing, so the actual input is a C array. (If you actually have an array, not pointer, you don't need any casting and can just ask for it as a memory operand.)
If you leave the size unspecified with [], that tells GCC that any memory accessed relative to that pointer is an input, output, or in/out operand. If you use [10] or [some_variable], that tells the compiler the specific size. With runtime-variable sizes, gcc in practice misses the optimization that iptr[size+1] is not part of the input.
GCC documents this and therefore supports it. I think it's not a strict-aliasing violation if the array element type is the same as the pointer, or maybe if it's char.
(from the GCC manual)
An x86 example where the string memory argument is of unknown length.
asm("repne scasb"
: "=c" (count), "+D" (p)
: "m" (*(const char (*)[]) p), "0" (-1), "a" (0));
If you can avoid using an early-clobber on the pointer input operand, the dummy memory input operand will typically pick a simple addressing mode using that same register.
But if you do use an early-clobber for strict correctness of an asm loop, sometimes a dummy operand will make gcc waste instructions (and an extra register) on a base address for the memory operand. Check the asm output of the compiler.
Background:
This is a widespread bug in inline-asm examples which often goes undetected because the asm is wrapped in a function that doesn't inline into any callers that tempt the compiler into reordering stores for merging doing dead-store elimination.
GNU C inline asm syntax is designed around describing a single instruction to the compiler. The intent is that you tell the compiler about a memory input or memory output with a "m" or "=m" operand constraint, and it picks the addressing mode.
Writing whole loops in inline asm requires care to make sure the compiler really knows what's going on (or asm volatile plus a "memory" clobber), otherwise you risk breakage when changing the surrounding code, or enabling link-time optimization that allows for cross-file inlining.
See also Looping over arrays with inline assembly for using an asm statement as the loop body, still doing the loop logic in C. With actual (non-dummy) "m" and "=m" operands, the compiler can unroll the loop by using displacements in the addressing modes it chooses.
Footnote 1: A "memory" clobber gets the compiler to treat the asm like a non-inline function call (that could read or write any memory except for locals that escape analysis has proved have not escaped). The escape analysis includes input operands to the asm statement itself, but also any global or static variables that any earlier call could have stored pointers into. So usually local loop counters don't have to be spilled/reloaded around an asm statement with a "memory" clobber.
asm volatile is necessary to make sure the asm isn't optimized away even if its output operands are unused (because you require the un-declared the side-effect of writing memory to happen).
Or for memory that is only read by asm, you you need the asm to run again if the same input buffer contains different input data. Without volatile, the asm statement could be CSEd out of a loop. (A "memory" clobber does not make the optimizer treat all memory as an input when considering whether the asm statement even needs to run.)
asm with no output operands is implicitly volatile, but it's a good idea to make it explicit. (The GCC manual has a section on asm volatile).
e.g. asm("... sum an array ..." : "=r"(sum) : "r"(pointer), "r"(end_pointer) : "memory") has an output operand so is not implicitly volatile. If you used it like
arr[5] = 1;
total += asm_sum(arr, len);
memcpy(arr, foo, len);
total += asm_sum(arr, len);
Without volatile the 2nd asm_sum could optimize away, assuming that the same asm with the same input operands (pointer and length) will produce the same output. You need volatile for any asm that's not a pure function of its explicit input operands. If it doesn't optimize away, then the "memory" clobber will have the desired effect of requiring memory to be in sync.

Related

Is it possible to tell the compiler that an object reachable through a pointer has changed? [duplicate]

Consider the following small function:
void foo(int* iptr) {
iptr[10] = 1;
__asm__ volatile ("nop"::"r"(iptr):);
iptr[10] = 2;
}
Using gcc, this compiles to:
foo:
nop
mov DWORD PTR [rdi+40], 2
ret
Note in particular, that the first write to iptr, iptr[10] = 1 doesn't occur at all: the inline asm nop is the first thing in the function, and only the final write of 2 appears (after the ASM call). Apparently the compiler decides that it only needs to provide an up-to-date version of the value of iptr itself, but not the memory it points to.
I can tell the compiler that memory must be up to date with a memory clobber, like so:
void foo(int* iptr) {
iptr[10] = 1;
__asm__ volatile ("nop"::"r"(iptr):"memory");
iptr[10] = 2;
}
which results in the expected code:
foo:
mov DWORD PTR [rdi+40], 1
nop
mov DWORD PTR [rdi+40], 2
ret
However, this is too strong of a condition, since it tells the compiler all memory has to be written. For example, in the following function:
void foo2(int* iptr, long* lptr) {
iptr[10] = 1;
lptr[20] = 100;
__asm__ volatile ("nop"::"r"(iptr):);
iptr[10] = 2;
lptr[20] = 200;
}
The desired behavior is to let the compiler optimize away the first write to lptr[20], but not the first write to iptr[10]. The "memory" clobber cannot achieve this because it means both writes have to occur:
foo2:
mov DWORD PTR [rdi+40], 1
mov QWORD PTR [rsi+160], 100 ; lptr[10] written unecessarily
nop
mov DWORD PTR [rdi+40], 2
mov QWORD PTR [rsi+160], 200
ret
Is there some way to tell compilers accepting gcc extended asm syntax that the input to the asm includes the pointer and anything it can point to?
That's correct; asking for a pointer as input to inline asm does not imply that the pointed-to memory is also an input or output or both. With a register input and register output, for all gcc knows your asm just aligns a pointer by masking off the low bits, or adds a constant to it. (In which case you would want it to optimize away a dead store.)
The simple option is asm volatile and a "memory" clobber1.
The narrower more specific way you're asking for is to use a "dummy" memory operand as well as the pointer in a register. Your asm template doesn't reference this operand (except maybe inside an asm comment to see what the compiler picked). It tells the compiler which memory you actually read, write, or read+write.
Dummy memory input: "m" (*(const int (*)[]) iptr)
or output: "=m" (*(int (*)[]) iptr). Or of course "+m" with the same syntax.
That syntax is casting to a pointer-to-array and dereferencing, so the actual input is a C array. (If you actually have an array, not pointer, you don't need any casting and can just ask for it as a memory operand.)
If you leave the size unspecified with [], that tells GCC that any memory accessed relative to that pointer is an input, output, or in/out operand. If you use [10] or [some_variable], that tells the compiler the specific size. With runtime-variable sizes, gcc in practice misses the optimization that iptr[size+1] is not part of the input.
GCC documents this and therefore supports it. I think it's not a strict-aliasing violation if the array element type is the same as the pointer, or maybe if it's char.
(from the GCC manual)
An x86 example where the string memory argument is of unknown length.
asm("repne scasb"
: "=c" (count), "+D" (p)
: "m" (*(const char (*)[]) p), "0" (-1), "a" (0));
If you can avoid using an early-clobber on the pointer input operand, the dummy memory input operand will typically pick a simple addressing mode using that same register.
But if you do use an early-clobber for strict correctness of an asm loop, sometimes a dummy operand will make gcc waste instructions (and an extra register) on a base address for the memory operand. Check the asm output of the compiler.
Background:
This is a widespread bug in inline-asm examples which often goes undetected because the asm is wrapped in a function that doesn't inline into any callers that tempt the compiler into reordering stores for merging doing dead-store elimination.
GNU C inline asm syntax is designed around describing a single instruction to the compiler. The intent is that you tell the compiler about a memory input or memory output with a "m" or "=m" operand constraint, and it picks the addressing mode.
Writing whole loops in inline asm requires care to make sure the compiler really knows what's going on (or asm volatile plus a "memory" clobber), otherwise you risk breakage when changing the surrounding code, or enabling link-time optimization that allows for cross-file inlining.
See also Looping over arrays with inline assembly for using an asm statement as the loop body, still doing the loop logic in C. With actual (non-dummy) "m" and "=m" operands, the compiler can unroll the loop by using displacements in the addressing modes it chooses.
Footnote 1: A "memory" clobber gets the compiler to treat the asm like a non-inline function call (that could read or write any memory except for locals that escape analysis has proved have not escaped). The escape analysis includes input operands to the asm statement itself, but also any global or static variables that any earlier call could have stored pointers into. So usually local loop counters don't have to be spilled/reloaded around an asm statement with a "memory" clobber.
asm volatile is necessary to make sure the asm isn't optimized away even if its output operands are unused (because you require the un-declared the side-effect of writing memory to happen).
Or for memory that is only read by asm, you you need the asm to run again if the same input buffer contains different input data. Without volatile, the asm statement could be CSEd out of a loop. (A "memory" clobber does not make the optimizer treat all memory as an input when considering whether the asm statement even needs to run.)
asm with no output operands is implicitly volatile, but it's a good idea to make it explicit. (The GCC manual has a section on asm volatile).
e.g. asm("... sum an array ..." : "=r"(sum) : "r"(pointer), "r"(end_pointer) : "memory") has an output operand so is not implicitly volatile. If you used it like
arr[5] = 1;
total += asm_sum(arr, len);
memcpy(arr, foo, len);
total += asm_sum(arr, len);
Without volatile the 2nd asm_sum could optimize away, assuming that the same asm with the same input operands (pointer and length) will produce the same output. You need volatile for any asm that's not a pure function of its explicit input operands. If it doesn't optimize away, then the "memory" clobber will have the desired effect of requiring memory to be in sync.

Test of 8 subsequent bytes isn't translated into a single compare instruction

Motivated by this question, I compared three different functions for checking if 8 bytes pointed to by the argument are zeros (note that in the original question, characters are compared with '0', not 0):
bool f1(const char *ptr)
{
for (int i = 0; i < 8; i++)
if (ptr[i])
return false;
return true;
}
bool f2(const char *ptr)
{
bool res = true;
for (int i = 0; i < 8; i++)
res &= (ptr[i] == 0);
return res;
}
bool f3(const char *ptr)
{
static const char tmp[8]{};
return !std::memcmp(ptr, tmp, 8);
}
Though I would expect the same assembly outcome with enabled optimizations, only the memcmp version was translated into a single cmp instruction on x64. Both f1 and f2 were translated into either a winded or unwinded loop. Moreover, this holds for all GCC, Clang, and Intel compilers with -O3.
Is there any reason why f1 and f2 cannot be optimized into a single compare instruction? It seem to be a pretty straightforward optimization to me.
Live demo: https://godbolt.org/z/j48366
Is there any reason why f1 and f2 cannot be optimized into a single compare instruction (possibly with additional unaligned load)? It seem to be a pretty straightforward optimization to me.
In f1 the loop stops when ptr[i] is true, so it is not always equivalent of to consider 8 elements as it is the case with the two other functions or directly comparing a 8 bytes word if the size of the array is less than 8 (the compiler does not know the size of the array) :
f1("\000\001"); // no access out of the array
f2("\000\001"); // access out of the array
f3("\000\001"); // access out of the array
For f2 I agree that can be replaced by a 8 bytes comparison under the condition the CPU allows to read a word of 8 bytes from any address alignment which is the case of the x64 but that can introduce unusual situation as explained in Unusual situations where this wouldn't be safe in x86 asm
First of all, f1 stops reading at the first non-zero byte, so there are cases where it won't fault if you pass it a pointer to a shorter object near the end of a page, and the next page is unmapped. Unconditionally reading 8 bytes can fault in cases where f1 doesn't encounter UB, as #bruno points out. (Is it safe to read past the end of a buffer within the same page on x86 and x64?). The compiler doesn't know that you're never going to use it this way; it has to make code that works for every possible non-UB case for any hypothetical caller.
You can fix that by making the function arg const char ptr[static 8] (but that's a C99 feature, not C++) to guarantee that it's safe to touch all 8 bytes even if the C abstract machine wouldn't. Then the compiler can safely invent reads. (A pointer to a struct {char buf[8]}; would also work, but wouldn't be strict-aliasing safe if the actual pointed-to object wasn't that.)
GCC and clang can't auto-vectorize loops whose trip-count isn't known before the first iteration. So that rules out all search loops like f1, even if made it check a static array of known size or something. (ICC can vectorize some search loops like a naive strlen implementation, though.)
Your f2 could have been optimized the same as f3, to a qword cmp, without overcoming that major compiler-internals limitations because it always does 8 iterations. In fact, current nightly builds of clang do optimize f2, thanks #Tharwen for spotting that.
Recognizing loop patterns is not that simple, and takes compile time to look for. IDK how valuable this optimization would be in practice; that's what compiler devs need trade off against when considering writing more code to look for such patterns. (Maintenance cost of code, and compile-time cost.)
The value depends on how much real world code actually has patterns like this, as well as how big a saving it is when you find it. In this case it's a very nice saving, so it's not crazy for clang to look for it, especially if they have the infrastructure to turn a loop over 8 bytes into an 8-byte integer operation in general.
In practice, just use memcmp if that's what you want; apparently most compilers don't spend time looking for patterns like f2. Modern compilers do reliably inline it, especially for x86-64 where unaligned loads are known to be safe and efficient in asm.
Or use memcpy to do an aliasing-safe unaligned load and compare that, if you think your compiler is more likely to have a builtin memcpy than memcmp.
Or in GNU C++, use a typedef to express unaligned may-alias loads:
bool f4(const char *ptr) {
typedef uint64_t aliasing_unaligned_u64 __attribute__((aligned(1), may_alias));
auto val = *(const aliasing_unaligned_u64*)ptr;
return val != 0;
}
Compiles on Godbolt with GCC10 -O3:
f4(char const*):
cmp QWORD PTR [rdi], 0
setne al
ret
Casting to uint64_t* would potentially violate alignof(uint64_t), and probably violate the strict-aliasing rule unless the actual object pointed to by the char* was compatible with uint64_t.
And yes, alignment does matter on x86-64 because the ABI allows compilers to make assumptions based on it. A faulting movaps or other problems can happen with real compilers in corner cases.
https://trust-in-soft.com/blog/2020/04/06/gcc-always-assumes-aligned-pointers/
Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?
Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior? is another example of using may_alias (without aligned(1) in that case because implicit-length strings could end at any point, so you need to do aligned loads to make sure that your chunk that contains at least 1 valid string byte doesn't cross a page boundary.) Also Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?
You need to help your compiler a bit to get exactly what you want... If you want to compare 8 bytes in one CPU operation, you'll need to change your char pointer so it points to something that's actually 8 bytes long, like a uint64_t pointer.
If your compiler does not support uint64_t, you can use unsigned long long* instead:
#include <cstdint>
inline bool EightBytesNull(const char *ptr)
{
return *reinterpret_cast<const uint64_t*>(ptr) == 0;
}
Note that this will work on x86, but will not on ARM, which requires strict integer memory alignment.

Is it possible to write to an array second element by overflowing the first element in C?

In low level languages it is possible to mov a dword (32 bit) to the first array element this will overflow to write to the second, third and fourth element, or mov a word (16 bit) to the first and it will overflow to the second element.
How to achieve the same effect in c? as when trying for example:
char txt[] = {0, 0};
txt[0] = 0x4142;
it gives a warning [-Woverflow]
and the value of txt[1] doesn't change and txt[0] is set to 0x42.
How to get the same behavior as in assembly:
mov word [txt], 0x4142
the previous assembly instruction will set the first element [txt+0] to 0x42 and the second element [txt+1] to 0x41.
EDIT
What about this suggestion?
define the array as a single variable.
uint16_t txt;
txt = 0x4142;
and accessing the elements with ((uint8_t*) &txt)[0] for the first element and ((uint8_t*) &txt)[1] for the second element.
If you are totally sure this will not cause a segmentation fault, which you must be, you can use memcpy()
uint16_t n = 0x4142;
memcpy((void *)txt, (void *)&n, sizeof(uint16_t));
By using void pointers, this is the most versatile solution, generalizable to all the cases beyond this example.
txt[0] = 0x4142; is an assignment to a char object, so the right hand side is implicitly cast to (char) after being evaluated.
The NASM equivalent is mov byte [rsp-4], 'BA'. Assembling that with NASM gives you the same warning as your C compiler:
foo.asm:1: warning: byte data exceeds bounds [-w+number-overflow]
Also, modern C is not a high-level assembler. C has types, NASM doesn't (operand-size is on a per-instruction basis only). Don't expect C to work like NASM.
C is defined in terms of an "abstract machine", and the compiler's job is to make asm for the target CPU which produces the same observable results as if the C was running directly on the C abstract machine. Unless you use volatile, actually storing to memory doesn't count as an observable side-effect. This is why C compilers can keep variables in registers.
And more importantly, things that are undefined behaviour according to the ISO C standard may still be undefined when compiling for x86. For example, x86 asm has well-defined behaviour for signed overflow: it wraps around. But in C, it's undefined behaviour, so compilers can exploit this to make more efficient code for for (int i=0 ; i<=len ;i++) arr[i] *= 2; without worrying that i<=len might always be true, giving an infinite loop. See What Every C Programmer Should Know About Undefined Behavior.
Type-punning by pointer-casting other than to char* or unsigned char* (or __m128i* and other Intel SSE/AVX intrinsic types, because they're also defined as may_alias types) violates the strict-aliasing rule. txt is a char array, but I think it's still a strict-aliasing violation to write it through a uint16_t* and then read it back via txt[0] and txt[1].
Some compilers may define the behaviour of *(uint16_t*)txt = 0x4142, or happen to produce the code you expect in some cases, but you shouldn't count on it always working and being safe other code also reads and writes txt[].
Compilers (i.e. C implementations, to use the terminology of the ISO standard) are allowed to define behaviour that the C standard leaves undefined. But in a quest for higher performance, they choose to leave a lot of stuff undefined. This is why compiling C for x86 is not similar to writing in asm directly.
Many people consider modern C compilers to be actively hostile to the programmer, looking for excuses to "miscompile" your code. See the 2nd half of this answer on gcc, strict-aliasing, and horror stories, and also the comments. (The example in that answer is safe with a proper memcpy; the problem was a custom implementation of memcpy that copied using long*.)
Here's a real-life example of a misaligned pointer leading to a fault on x86 (because gcc's auto-vectorization strategy assumed that some whole number of elements would reach a 16-byte alignment boundary. i.e. it depended on the uint16_t* being aligned.)
Obviously if you want your C to be portable (including to non-x86), you must use well-defined ways to type-pun. In ISO C99 and later, writing one union member and reading another is well-defined. (And in GNU C++, and GNU C89).
In ISO C++, the only well-defined way to type-pun is with memcpy or other char* accesses, to copy object representations.
Modern compilers know how to optimize away memcpy for small compile-time constant sizes.
#include <string.h>
#include <stdint.h>
void set2bytes_safe(char *p) {
uint16_t src = 0x4142;
memcpy(p, &src, sizeof(src));
}
void set2bytes_alias(char *p) {
*(uint16_t*)p = 0x4142;
}
Both functions compile to the same code with gcc, clang, and ICC for x86-64 System V ABI:
# clang++6.0 -O3 -march=sandybridge
set2bytes_safe(char*):
mov word ptr [rdi], 16706
ret
Sandybridge-family doesn't have LCP decode stalls for 16-bit mov immediate, only for 16-bit immediates with ALU instructions. This is an improvement over Nehalem (See Agner Fog's microarch guide), but apparently gcc8.1 -march=sandybridge doesn't know about it because it still likes to:
# gcc and ICC
mov eax, 16706
mov WORD PTR [rdi], ax
ret
define the array as a single variable.
... and accessing the elements with ((uint8_t*) &txt)[0]
Yes, that's fine, assuming that uint8_t is unsigned char, because char* is allowed to alias anything.
This is the case on almost any implementation that supports uint8_t at all, but it's theoretically possible to build one where it's not, and char is a 16 or 32-bit type, and uint8_t is implemented with a more expensive read/modify/write of the containing word.
One option is to Trust Your Compiler(tm) and just write proper code.
With this test code:
#include <iostream>
int main() {
char txt[] = {0, 0};
txt[0] = 0x41;
txt[1] = 0x42;
std::cout << txt;
}
Clang 6.0 produces:
int main() {
00E91020 push ebp
00E91021 mov ebp,esp
00E91023 push eax
00E91024 lea eax,[ebp-2]
char txt[] = {0, 0};
00E91027 mov word ptr [ebp-2],4241h <-- Combined write, without any tricks!
txt[0] = 0x41;
txt[1] = 0x42;
std::cout << txt;
00E9102D push eax
00E9102E push offset cout (0E99540h)
00E91033 call std::operator<<<std::char_traits<char> > (0E91050h)
00E91038 add esp,8
}
00E9103B xor eax,eax
00E9103D add esp,4
00E91040 pop ebp
00E91041 ret
You're looking to do a deep copy which you'll need to use a loop to accomplish (or a function that does the loop for you internally: memcpy).
Simply assigning 0x4142 to a char will have to be truncated to fit in the char. This should throw a warning as the outcome will be implementation specific, but typically the least significant bits are retained.
In any case, if you know the numbers you want to assign you could just construct using them: const char txt[] = { '\x41', '\x42' };
I'd suggest doing this with an initializer-list, obviously it's on you to make sure the initializer list is at least as long as size(txt). For example:
copy_n(begin({ '\x41', '\x42' }), size(txt), begin(txt));
Live Example

Can I create a union with formal parameter passed to a function in C++?

The function below calculates absolute value of 32-bit floating point value:
__forceinline static float Abs(float x)
{
union {
float x;
int a;
} u;
//u.x = x;
u.a &= 0x7FFFFFFF;
return u.x;
}
union u declared in the function holds variable x, which is different from the x which is passed as parameter in the function. Is there any way to create a union with argument to the function - x?
Any reason the function above with uncommented line be executing longer than this one?
__forceinline float fastAbs(float a)
{
int b= *((int *)&a) & 0x7FFFFFFF;
return *((float *)(&b));
}
I'm trying to figure out best way to take Abs of floating point value in as little count of read/writes to memory as possible.
For the first question, I'm not sure why you can't just what you want with an assignment. The compiler will do whatever optimizations that can be done.
In your second sample code. You violate strict aliasing. So it isn't the same.
As for why it's slower:
It's because CPUs today tend to have separate integer and floating-point units. By type-punning like that, you force the value to be moved from one unit to the other. This has overhead. (This is often done through memory, so you have extra loads and stores.)
In the second snippet: a which is originally in the floating-point unit (either the x87 FPU or an SSE register), needs to be moved into the general purpose registers to apply the mask 0x7FFFFFFF. Then it needs to be moved back.
In the first snippet: The compiler is probably smart enough to load a directly into the integer unit. So you bypass the FPU in the first stage.
(I'm not 100% sure until you show us the assembly. It will also depend heavily on whether the parameter starts off in a register or on the stack. And whether the output is used immediately by another floating-point operation.)
Looking at the disassembly of the code compiled in release mode the difference is quite clear!
I removed the inline and used two virtual function to allow the compiler to not optimize too much and let us show the differences.
This is the first function.
013D1002 in al,dx
union {
float x;
int a;
} u;
u.x = x;
013D1003 fld dword ptr [x] // Loads a float on top of the FPU STACK.
013D1006 fstp dword ptr [x] // Pops a Float Number from the top of the FPU Stack into the destination address.
u.a &= 0x7FFFFFFF;
013D1009 and dword ptr [x],7FFFFFFFh // Execute a 32 bit binary and operation with the specified address.
return u.x;
013D1010 fld dword ptr [x] // Loads the result on top of the FPU stack.
}
This is the second function.
013D1020 push ebp // Standard function entry... i'm using a virtual function here to show the difference.
013D1021 mov ebp,esp
int b= *((int *)&a) & 0x7FFFFFFF;
013D1023 mov eax,dword ptr [a] // Load into eax our parameter.
013D1026 and eax,7FFFFFFFh // Execute 32 bit binary and between our register and our constant.
013D102B mov dword ptr [a],eax // Move the register value into our destination variable
return *((float *)(&b));
013D102E fld dword ptr [a] // Loads the result on top of the FPU stack.
The number of floating point operations and the usage of FPU stack in the first case is greater.
The functions are executing exactly what you asked, so no surprise.
So i expect the second function to be faster.
Now... removing the virtual and inlining things are a little different, is hard to write the disassembly code here because of course the compiler does a good job, but i repeat, if values are not constants, the compiler will use more floating point operation in the first function.
Of course, integer operations are faster than floating point operations.
Are you sure that directly using math.h abs function is slower than your method?
If correctly inlined, abs function will just do this!
00D71016 fabs
Micro-optimizations like this are hard to see in long code, but if your function is called in a long chain of floating point operations, fabs will work better since values will be already in FPU stack or in SSE registers! abs would be faster and better optimized by the compiler.
You cannot measure the performances of optimizations running a loop in a piece of code, you must see how the compiler mix all together in the real code.

Using bts assembly instruction with gcc compiler

I want to use the bts and bt x86 assembly instructions to speed up bit operations in my C++ code on the Mac. On Windows, the _bittestandset and _bittest intrinsics work well, and provide significant performance gains. On the Mac, the gcc compiler doesn't seem to support those, so I'm trying to do it directly in assembler instead.
Here's my C++ code (note that 'bit' can be >= 32):
typedef unsigned long LongWord;
#define DivLongWord(w) ((unsigned)w >> 5)
#define ModLongWord(w) ((unsigned)w & (32-1))
inline void SetBit(LongWord array[], const int bit)
{
array[DivLongWord(bit)] |= 1 << ModLongWord(bit);
}
inline bool TestBit(const LongWord array[], const int bit)
{
return (array[DivLongWord(bit)] & (1 << ModLongWord(bit))) != 0;
}
The following assembler code works, but is not optimal, as the compiler can't optimize register allocation:
inline void SetBit(LongWord* array, const int bit)
{
__asm {
mov eax, bit
mov ecx, array
bts [ecx], eax
}
}
Question: How do I get the compiler to fully optimize around the bts instruction? And how do I replace TestBit by a bt instruction?
BTS (and the other BT* insns) with a memory destination are slow. (>10 uops on Intel). You'll probably get faster code from doing the address math to find the right byte, and loading it into a register. Then you can do the BT / BTS with a register destination and store the result.
Or maybe shift a 1 to the right position and use OR with with a memory destination for SetBit, or AND with a memory source for TestBit. Of course, if you avoid inline asm, the compiler can inline TestBit and use TEST instead of AND, which is useful on some CPUs (since it can macro-fuse into a test-and-branch on more CPUs than AND).
This is in fact what gcc 5.2 generates from your C source (memory-dest OR or TEST). Looks optimal to me (fewer uops than a memory-dest bt). Actually, note that your code is broken because it assumes unsigned long is 32 bits, not CHAR_BIT * sizeof(unsigned_long). Using uint32_t, or char, would be a much better plan. Note the sign-extension of eax into rax with the cqde instruction, due to the badly-written C which uses 1 instead of 1UL.
Also note that inline asm can't return the flags as a result (except with a new-in-gcc v6 extension!), so using inline asm for TestBit would probably result in terrible code code like:
... ; inline asm
bt reg, reg
setc al ; end of inline asm
test al, al ; compiler-generated
jz bit_was_zero
Modern compilers can and do use BT when appropriate (with a register destination). End result: your C probably compiles to faster code than what you're suggesting doing with inline asm. It would be even faster after being bugfixed to be correct and 64bit-clean. If you were optimizing for code size, and willing to pay a significant speed penalty, forcing use of bts could work, but bt probably still won't work well (because the result goes into the flags).
inline void SetBit(*array, bit) {
asm("bts %1,%0" : "+m" (*array) : "r" (bit));
}
This version efficiently returns the carry flag (via the gcc-v6 extension mentioned by Peter in the top answer) for a subsequent test instruction. It only supports a register operand since use of a memory operand is very slow as he said:
int variable_test_and_set_bit64(unsigned long long &n, const unsigned long long bit) {
int oldbit;
asm("bts %2,%0"
: "+r" (n), "=#ccc" (oldbit)
: "r" (bit));
return oldbit;
}
Use in code is then like so. The wasSet variable is optimized away and the produced assembly will have bts followed immediately by jb instruction checking the carry flag.
unsigned long long flags = *(memoryaddress);
unsigned long long bitToTest = someOtherVariable;
int wasSet = variable_test_and_set_bit64(flags, bitToTest);
if(!wasSet) {
*(memoryaddress) = flags;
}
Although it seems a bit contrived, this does save me several instructions vs the "1ULL << bitToTest" version.
Another slightly indirect answer, GCC exposes a number of atomic operations starting with version 4.1.