Using bts assembly instruction with gcc compiler - c++

I want to use the bts and bt x86 assembly instructions to speed up bit operations in my C++ code on the Mac. On Windows, the _bittestandset and _bittest intrinsics work well, and provide significant performance gains. On the Mac, the gcc compiler doesn't seem to support those, so I'm trying to do it directly in assembler instead.
Here's my C++ code (note that 'bit' can be >= 32):
typedef unsigned long LongWord;
#define DivLongWord(w) ((unsigned)w >> 5)
#define ModLongWord(w) ((unsigned)w & (32-1))
inline void SetBit(LongWord array[], const int bit)
{
array[DivLongWord(bit)] |= 1 << ModLongWord(bit);
}
inline bool TestBit(const LongWord array[], const int bit)
{
return (array[DivLongWord(bit)] & (1 << ModLongWord(bit))) != 0;
}
The following assembler code works, but is not optimal, as the compiler can't optimize register allocation:
inline void SetBit(LongWord* array, const int bit)
{
__asm {
mov eax, bit
mov ecx, array
bts [ecx], eax
}
}
Question: How do I get the compiler to fully optimize around the bts instruction? And how do I replace TestBit by a bt instruction?

BTS (and the other BT* insns) with a memory destination are slow. (>10 uops on Intel). You'll probably get faster code from doing the address math to find the right byte, and loading it into a register. Then you can do the BT / BTS with a register destination and store the result.
Or maybe shift a 1 to the right position and use OR with with a memory destination for SetBit, or AND with a memory source for TestBit. Of course, if you avoid inline asm, the compiler can inline TestBit and use TEST instead of AND, which is useful on some CPUs (since it can macro-fuse into a test-and-branch on more CPUs than AND).
This is in fact what gcc 5.2 generates from your C source (memory-dest OR or TEST). Looks optimal to me (fewer uops than a memory-dest bt). Actually, note that your code is broken because it assumes unsigned long is 32 bits, not CHAR_BIT * sizeof(unsigned_long). Using uint32_t, or char, would be a much better plan. Note the sign-extension of eax into rax with the cqde instruction, due to the badly-written C which uses 1 instead of 1UL.
Also note that inline asm can't return the flags as a result (except with a new-in-gcc v6 extension!), so using inline asm for TestBit would probably result in terrible code code like:
... ; inline asm
bt reg, reg
setc al ; end of inline asm
test al, al ; compiler-generated
jz bit_was_zero
Modern compilers can and do use BT when appropriate (with a register destination). End result: your C probably compiles to faster code than what you're suggesting doing with inline asm. It would be even faster after being bugfixed to be correct and 64bit-clean. If you were optimizing for code size, and willing to pay a significant speed penalty, forcing use of bts could work, but bt probably still won't work well (because the result goes into the flags).

inline void SetBit(*array, bit) {
asm("bts %1,%0" : "+m" (*array) : "r" (bit));
}

This version efficiently returns the carry flag (via the gcc-v6 extension mentioned by Peter in the top answer) for a subsequent test instruction. It only supports a register operand since use of a memory operand is very slow as he said:
int variable_test_and_set_bit64(unsigned long long &n, const unsigned long long bit) {
int oldbit;
asm("bts %2,%0"
: "+r" (n), "=#ccc" (oldbit)
: "r" (bit));
return oldbit;
}
Use in code is then like so. The wasSet variable is optimized away and the produced assembly will have bts followed immediately by jb instruction checking the carry flag.
unsigned long long flags = *(memoryaddress);
unsigned long long bitToTest = someOtherVariable;
int wasSet = variable_test_and_set_bit64(flags, bitToTest);
if(!wasSet) {
*(memoryaddress) = flags;
}
Although it seems a bit contrived, this does save me several instructions vs the "1ULL << bitToTest" version.

Another slightly indirect answer, GCC exposes a number of atomic operations starting with version 4.1.

Related

Is there a better way to any detect bits that are set in a 16-byte array of flags?

ALIGNTO(16) uint8_t noise_frame_flags[16] = { 0 };
// Code detects noise and sets noise_frame_flags omitted
__m128i xmm0 = _mm_load_si128((__m128i*)noise_frame_flags);
bool isNoiseToCancel = _mm_extract_epi64(xmm0, 0) | _mm_extract_epi64(xmm0, 1);
if (isNoiseToCancel)
cancelNoises(audiobuffer, nAudioChannels, audio_samples, noise_frame_flags);
This is a code snippet from my AV Capture tool on Linux. noise_frame_flags here is an array of flags for 16-channel audio. For each channel, the corresponding byte can be either 0 or 1. 1 is indicating that the channel has some noise to cancel. For example, if noise_frame_flags[0] == 1, that means first channel noise flag is set (by the omitted code).
Even if a single "flag" is set then I need to call cancelNoises. And this code seems to work fine in that matter. As you see I used _mm_load_si128 to load a whole array of flags that is correctly aligned and then two _mm_extract_epi64 to extract "flags". My question is there a better way to do this (using pop count maybe)?
Note: ALIGNTO(16) is a macro expands to correct GCC equivalent but nicer looking.
Yes, you eventually want a 64-bit OR to look for any non-zero bits in either half, but it's not efficient to get those uint64_t values from a 128-bit load and then extract.
In asm you just want a mov load and a memory-source or or add, which will set ZF just like you're doing now. Two loads from the same cache line are very cheap; current CPUs have at least 2/clock load throughput. The extra ALU work to extract from a single 128-bit load is just not worth it, even if you did shuffle / por to set up for a single movq.
In C++, use memcpy to do strict-aliasing safe loads of uint64_t tmp vars, then if(a | b). This is still SIMD, just SWAR (SIMD Within A Register).
add is even better than or: it can macro-fuse with most jcc instructions on Intel Sandybridge-family (but not AMD). or can't fuse with branch instructions on any CPUs. Since your values are 0 or 1, we can't have a case of two non-zero values adding to produce a zero, which is why you'd normally use or for the general case.
(Some addressing modes may defeat micro or macro-fusion on Intel. Or maybe it always works since there's no immediate involved. It really is possible for add rax, [mem] / jnz to go through the front-end and ROB as a single uop, and execute in the back-end as only 2 (load + add/sub-and-branch). Assuming it's about the same as cmp on my Skylake, except it does write the destination so Haswell and later can maybe keep it micro-fused even for indexed addressing modes.)
uint64_t a, b;
memcpy(&a, noise_frame_flags+0, sizeof(a)); // strict-aliasing-safe loads
memcpy(&b, noise_frame_flags+8, sizeof(b)); // which optimize to MOV qword
bool isNoiseToCancel = a + b; // equivalent to a | b for bool inputs
This should compile to 3 asm instructions which will decode to 2 uops total, or 3 on AMD CPUs where JCC can only fuse with cmp or test.
union { alignas(16) uint8_t flags[16]; uint64_t chunks[2];}; would be safe in C99, but not ISO C++. Most but not all C++ compilers that support Intel intrinsics define the behaviour of union type-punning. (I think #jww has said SunCC doesn't.)
In C++11, you don't need a custom macro for ALIGNTO(16), just use alignas(16). Also supported in C11 if you #include <stdalign.h>
Alternatives:
movdqa 16-byte load / SSE4.1 ptest xmm0, xmm0 / jnz - 4 uops on Intel CPUs, 3 on AMD.
Intel runs ptest as 2 uops, and it can't macro-fuse with jcc.
AMD CPUs run ptest as 1 uop, but it still can't fuse.
If you had an all-ones or all-zeros constant in a register, ptest xmm0, [mem] would work to save a uop on Intel (depending on addressing mode), but that's still 3 total.
PTEST is only good for checking a 32-byte array with AVX1 or AVX2. (Surprisingly, vptest ymm only requires AVX1). Then it's about break-even with AVX2 vmovdqa / vpslld ymm0, 7 / vpmovmskb eax,ymm0 / test+jnz. See TrentP's answer for portable GNU C native vector source code that should compile to vptest on x86 with AVX available, and maybe to something clunky on other ISAs like ARM depending on how good their horizontal OR support is.
popcnt wouldn't be useful unless you want to break down the work depending on how many bits are set.
In that case, yes, sure, you can turn the bool array into a bitmap that you can scan easily, probably more efficient than _mm_sad_epu8 against a zeroed register to sum into two 8-byte halves.
__m128i vflags = _mm_load_si128((__m128i*)noise_frame_flags);
vflags = _mm_slli_epi32(vflags, 7);
unsigned flagmask = _mm_movemask_epi8(vflags);
if (flagmask) {
unsigned flagcount = __builtin_popcount(flagmask); // popcnt with -march=nehalem or higher
unsigned first_setflag = __builtin_ctz(flagmask); // tzcnt if available, else BSF
vflags &= vflags - 1; // clear lowest set bit. blsr if compiled with -march=haswell or bdver2 or newer.
...
}
(Don't actually use -march=bdver2 or -march=nehalem, unless you want to set an ISA baseline but also use -mtune=haswell or something more modern. There are individual options like -mpopcnt and -mbmi, but generally good to enable all ISA extensions that some CPU supports, so you don't miss out on useful stuff the compiler can use.)
Here's what I came up with for doing this:
#define VLEN 8
typedef int vNb __attribute__((vector_size(VLEN*sizeof(int))));
// Constants for 128 or 256 bit registers
#if VLEN == 8
#define V(a,b,c,d,e,f,g,h) a,b,c,d,e,f,g,h
#else
#define V(a,b,c,d,e,f,g,h) a,b,c,d
#endif
#define SWAP128 V(4,5,6,7, 0,1,2,3)
#define SWAP64 V(2,3, 0,1, 6,7, 4,5)
#define SWAP32 V(1, 0, 3, 2, 5, 4, 7, 6)
static bool any(vNb x) {
if (VLEN >= 8)
x |= __builtin_shufflevector(x,x, SWAP128);
x |= __builtin_shufflevector(x,x, SWAP64);
x |= __builtin_shufflevector(x,x, SWAP32);
return x[0];
}
With VLEN = 8, this will use 256-bit registers if the arch supports it. Change to 4 to use 128 bit.
This should compile to a single vptest instruction.

RISC-V inline assembly struct optimized away [duplicate]

Consider the following small function:
void foo(int* iptr) {
iptr[10] = 1;
__asm__ volatile ("nop"::"r"(iptr):);
iptr[10] = 2;
}
Using gcc, this compiles to:
foo:
nop
mov DWORD PTR [rdi+40], 2
ret
Note in particular, that the first write to iptr, iptr[10] = 1 doesn't occur at all: the inline asm nop is the first thing in the function, and only the final write of 2 appears (after the ASM call). Apparently the compiler decides that it only needs to provide an up-to-date version of the value of iptr itself, but not the memory it points to.
I can tell the compiler that memory must be up to date with a memory clobber, like so:
void foo(int* iptr) {
iptr[10] = 1;
__asm__ volatile ("nop"::"r"(iptr):"memory");
iptr[10] = 2;
}
which results in the expected code:
foo:
mov DWORD PTR [rdi+40], 1
nop
mov DWORD PTR [rdi+40], 2
ret
However, this is too strong of a condition, since it tells the compiler all memory has to be written. For example, in the following function:
void foo2(int* iptr, long* lptr) {
iptr[10] = 1;
lptr[20] = 100;
__asm__ volatile ("nop"::"r"(iptr):);
iptr[10] = 2;
lptr[20] = 200;
}
The desired behavior is to let the compiler optimize away the first write to lptr[20], but not the first write to iptr[10]. The "memory" clobber cannot achieve this because it means both writes have to occur:
foo2:
mov DWORD PTR [rdi+40], 1
mov QWORD PTR [rsi+160], 100 ; lptr[10] written unecessarily
nop
mov DWORD PTR [rdi+40], 2
mov QWORD PTR [rsi+160], 200
ret
Is there some way to tell compilers accepting gcc extended asm syntax that the input to the asm includes the pointer and anything it can point to?
That's correct; asking for a pointer as input to inline asm does not imply that the pointed-to memory is also an input or output or both. With a register input and register output, for all gcc knows your asm just aligns a pointer by masking off the low bits, or adds a constant to it. (In which case you would want it to optimize away a dead store.)
The simple option is asm volatile and a "memory" clobber1.
The narrower more specific way you're asking for is to use a "dummy" memory operand as well as the pointer in a register. Your asm template doesn't reference this operand (except maybe inside an asm comment to see what the compiler picked). It tells the compiler which memory you actually read, write, or read+write.
Dummy memory input: "m" (*(const int (*)[]) iptr)
or output: "=m" (*(int (*)[]) iptr). Or of course "+m" with the same syntax.
That syntax is casting to a pointer-to-array and dereferencing, so the actual input is a C array. (If you actually have an array, not pointer, you don't need any casting and can just ask for it as a memory operand.)
If you leave the size unspecified with [], that tells GCC that any memory accessed relative to that pointer is an input, output, or in/out operand. If you use [10] or [some_variable], that tells the compiler the specific size. With runtime-variable sizes, gcc in practice misses the optimization that iptr[size+1] is not part of the input.
GCC documents this and therefore supports it. I think it's not a strict-aliasing violation if the array element type is the same as the pointer, or maybe if it's char.
(from the GCC manual)
An x86 example where the string memory argument is of unknown length.
asm("repne scasb"
: "=c" (count), "+D" (p)
: "m" (*(const char (*)[]) p), "0" (-1), "a" (0));
If you can avoid using an early-clobber on the pointer input operand, the dummy memory input operand will typically pick a simple addressing mode using that same register.
But if you do use an early-clobber for strict correctness of an asm loop, sometimes a dummy operand will make gcc waste instructions (and an extra register) on a base address for the memory operand. Check the asm output of the compiler.
Background:
This is a widespread bug in inline-asm examples which often goes undetected because the asm is wrapped in a function that doesn't inline into any callers that tempt the compiler into reordering stores for merging doing dead-store elimination.
GNU C inline asm syntax is designed around describing a single instruction to the compiler. The intent is that you tell the compiler about a memory input or memory output with a "m" or "=m" operand constraint, and it picks the addressing mode.
Writing whole loops in inline asm requires care to make sure the compiler really knows what's going on (or asm volatile plus a "memory" clobber), otherwise you risk breakage when changing the surrounding code, or enabling link-time optimization that allows for cross-file inlining.
See also Looping over arrays with inline assembly for using an asm statement as the loop body, still doing the loop logic in C. With actual (non-dummy) "m" and "=m" operands, the compiler can unroll the loop by using displacements in the addressing modes it chooses.
Footnote 1: A "memory" clobber gets the compiler to treat the asm like a non-inline function call (that could read or write any memory except for locals that escape analysis has proved have not escaped). The escape analysis includes input operands to the asm statement itself, but also any global or static variables that any earlier call could have stored pointers into. So usually local loop counters don't have to be spilled/reloaded around an asm statement with a "memory" clobber.
asm volatile is necessary to make sure the asm isn't optimized away even if its output operands are unused (because you require the un-declared the side-effect of writing memory to happen).
Or for memory that is only read by asm, you you need the asm to run again if the same input buffer contains different input data. Without volatile, the asm statement could be CSEd out of a loop. (A "memory" clobber does not make the optimizer treat all memory as an input when considering whether the asm statement even needs to run.)
asm with no output operands is implicitly volatile, but it's a good idea to make it explicit. (The GCC manual has a section on asm volatile).
e.g. asm("... sum an array ..." : "=r"(sum) : "r"(pointer), "r"(end_pointer) : "memory") has an output operand so is not implicitly volatile. If you used it like
arr[5] = 1;
total += asm_sum(arr, len);
memcpy(arr, foo, len);
total += asm_sum(arr, len);
Without volatile the 2nd asm_sum could optimize away, assuming that the same asm with the same input operands (pointer and length) will produce the same output. You need volatile for any asm that's not a pure function of its explicit input operands. If it doesn't optimize away, then the "memory" clobber will have the desired effect of requiring memory to be in sync.

Is there anything special about -1 (0xFFFFFFFF) regarding ADC?

In a research project of mine I'm writing C++ code. However, the generated assembly is one of the crucial points of the project. C++ doesn't provide direct access to flag manipulating instructions, in particular, to ADC but this shouldn't be a problem provided the compiler is smart enough to use it. Consider:
constexpr unsigned X = 0;
unsigned f1(unsigned a, unsigned b) {
b += a;
unsigned c = b < a;
return c + b + X;
}
Variable c is a workaround to get my hands on the carry flag and add it to b and X. It looks I got luck and the (g++ -O3, version 9.1) generated code is this:
f1(unsigned int, unsigned int):
add %edi,%esi
mov %esi,%eax
adc $0x0,%eax
retq
For all values of X that I've tested the code is as above (except, of course for the immediate value $0x0 that changes accordingly). I found one exception though: when X == -1 (or 0xFFFFFFFFu or ~0u, ... it really doesn't matter how you spell it) the generated code is:
f1(unsigned int, unsigned int):
xor %eax,%eax
add %edi,%esi
setb %al
lea -0x1(%rsi,%rax,1),%eax
retq
This seems less efficient than the initial code as suggested by indirect measurements (not very scientific though) Am I right? If so, is this a "missing optimization opportunity" kind of bug that is worth reporting?
For what is worth, clang -O3, version 8.8.0, always uses ADC (as I wanted) and icc -O3, version 19.0.1 never does.
I've tried using the intrinsic _addcarry_u32 but it didn't help.
unsigned f2(unsigned a, unsigned b) {
b += a;
unsigned char c = b < a;
_addcarry_u32(c, b, X, &b);
return b;
}
I reckon I might not be using _addcarry_u32 correctly (I couldn't find much info on it). What's the point of using it since it's up to me to provide the carry flag? (Again, introducing c and praying for the compiler to understand the situation.)
I might, actually, be using it correctly. For X == 0 I'm happy:
f2(unsigned int, unsigned int):
add %esi,%edi
mov %edi,%eax
adc $0x0,%eax
retq
For X == -1 I'm unhappy :-(
f2(unsigned int, unsigned int):
add %esi,%edi
mov $0xffffffff,%eax
setb %dl
add $0xff,%dl
adc %edi,%eax
retq
I do get the ADC but this is clearly not the most efficient code. (What's dl doing there? Two instructions to read the carry flag and restore it? Really? I hope I'm very wrong!)
mov + adc $-1, %eax is more efficient than xor-zero + setc + 3-component lea for both latency and uop count on most CPUs, and no worse on any still-relevant CPUs.1
This looks like a gcc missed optimization: it probably sees a special case and latches onto that, shooting itself in the foot and preventing the adc pattern recognition from happening.
I don't know what exactly it saw / was looking for, so yes you should report this as a missed-optimization bug. Or if you want to dig deeper yourself, you could look at the GIMPLE or RTL output after optimization passes and see what happens. If you know anything about GCC's internal representations. Godbolt has a GIMPLE tree-dump window you can add from the same dropdown as "clone compiler".
The fact that clang compiles it with adc proves that it's legal, i.e. that the asm you want does match the C++ source, and you didn't miss some special case that's stopping the compiler from doing that optimization. (Assuming clang is bug-free, which is the case here.)
That problem can certainly happen if you're not careful, e.g. trying to write a general-case adc function that takes carry in and provides carry-out from the 3-input addition is hard in C, because either of the two additions can carry so you can't just use the sum < a+b idiom after adding the carry to one of the inputs. I'm not sure it's possible to get gcc or clang to emit add/adc/adc where the middle adc has to take carry-in and produce carry-out.
e.g. 0xff...ff + 1 wraps around to 0, so sum = a+b+carry_in / carry_out = sum < a can't optimize to an adc because it needs to ignore carry in the special case where a = -1 and carry_in = 1.
So another guess is that maybe gcc considered doing the + X earlier, and shot itself in the foot because of that special case. That doesn't make a lot of sense, though.
What's the point of using it since it's up to me to provide the carry flag?
You're using _addcarry_u32 correctly.
The point of its existence is to let you express an add with carry in as well as carry out, which is hard in pure C. GCC and clang don't optimize it well, often not just keeping the carry result in CF.
If you only want carry-out, you can provide a 0 as the carry in and it will optimize to add instead of adc, but still give you the carry-out as a C variable.
e.g. to add two 128-bit integers in 32-bit chunks, you can do this
// bad on x86-64 because it doesn't optimize the same as 2x _addcary_u64
// even though __restrict guarantees non-overlap.
void adc_128bit(unsigned *__restrict dst, const unsigned *__restrict src)
{
unsigned char carry;
carry = _addcarry_u32(0, dst[0], src[0], &dst[0]);
carry = _addcarry_u32(carry, dst[1], src[1], &dst[1]);
carry = _addcarry_u32(carry, dst[2], src[2], &dst[2]);
carry = _addcarry_u32(carry, dst[3], src[3], &dst[3]);
}
(On Godbolt with GCC/clang/ICC)
That's very inefficient vs. unsigned __int128 where compilers would just use 64-bit add/adc, but does get clang and ICC to emit a chain of add/adc/adc/adc. GCC makes a mess, using setcc to store CF to an integer for some of the steps, then add dl, -1 to put it back into CF for an adc.
GCC unfortunately sucks at extended-precision / biginteger written in pure C. Clang sometimes does slightly better, but most compilers are bad at it. This is why the lowest-level gmplib functions are hand-written in asm for most architectures.
Footnote 1: or for uop count: equal on Intel Haswell and earlier where adc is 2 uops, except with a zero immediate where Sandybridge-family's decoders special case that as 1 uop.
But the 3-component LEA with a base + index + disp makes it a 3-cycle latency instruction on Intel CPUs, so it's definitely worse.
On Intel Broadwell and later, adc is a 1-uop instruction even with a non-zero immediate, taking advantage of support for 3-input uops introduced with Haswell for FMA.
So equal total uop count but worse latency means that adc would still be a better choice.
https://agner.org/optimize/

Is it possible to write to an array second element by overflowing the first element in C?

In low level languages it is possible to mov a dword (32 bit) to the first array element this will overflow to write to the second, third and fourth element, or mov a word (16 bit) to the first and it will overflow to the second element.
How to achieve the same effect in c? as when trying for example:
char txt[] = {0, 0};
txt[0] = 0x4142;
it gives a warning [-Woverflow]
and the value of txt[1] doesn't change and txt[0] is set to 0x42.
How to get the same behavior as in assembly:
mov word [txt], 0x4142
the previous assembly instruction will set the first element [txt+0] to 0x42 and the second element [txt+1] to 0x41.
EDIT
What about this suggestion?
define the array as a single variable.
uint16_t txt;
txt = 0x4142;
and accessing the elements with ((uint8_t*) &txt)[0] for the first element and ((uint8_t*) &txt)[1] for the second element.
If you are totally sure this will not cause a segmentation fault, which you must be, you can use memcpy()
uint16_t n = 0x4142;
memcpy((void *)txt, (void *)&n, sizeof(uint16_t));
By using void pointers, this is the most versatile solution, generalizable to all the cases beyond this example.
txt[0] = 0x4142; is an assignment to a char object, so the right hand side is implicitly cast to (char) after being evaluated.
The NASM equivalent is mov byte [rsp-4], 'BA'. Assembling that with NASM gives you the same warning as your C compiler:
foo.asm:1: warning: byte data exceeds bounds [-w+number-overflow]
Also, modern C is not a high-level assembler. C has types, NASM doesn't (operand-size is on a per-instruction basis only). Don't expect C to work like NASM.
C is defined in terms of an "abstract machine", and the compiler's job is to make asm for the target CPU which produces the same observable results as if the C was running directly on the C abstract machine. Unless you use volatile, actually storing to memory doesn't count as an observable side-effect. This is why C compilers can keep variables in registers.
And more importantly, things that are undefined behaviour according to the ISO C standard may still be undefined when compiling for x86. For example, x86 asm has well-defined behaviour for signed overflow: it wraps around. But in C, it's undefined behaviour, so compilers can exploit this to make more efficient code for for (int i=0 ; i<=len ;i++) arr[i] *= 2; without worrying that i<=len might always be true, giving an infinite loop. See What Every C Programmer Should Know About Undefined Behavior.
Type-punning by pointer-casting other than to char* or unsigned char* (or __m128i* and other Intel SSE/AVX intrinsic types, because they're also defined as may_alias types) violates the strict-aliasing rule. txt is a char array, but I think it's still a strict-aliasing violation to write it through a uint16_t* and then read it back via txt[0] and txt[1].
Some compilers may define the behaviour of *(uint16_t*)txt = 0x4142, or happen to produce the code you expect in some cases, but you shouldn't count on it always working and being safe other code also reads and writes txt[].
Compilers (i.e. C implementations, to use the terminology of the ISO standard) are allowed to define behaviour that the C standard leaves undefined. But in a quest for higher performance, they choose to leave a lot of stuff undefined. This is why compiling C for x86 is not similar to writing in asm directly.
Many people consider modern C compilers to be actively hostile to the programmer, looking for excuses to "miscompile" your code. See the 2nd half of this answer on gcc, strict-aliasing, and horror stories, and also the comments. (The example in that answer is safe with a proper memcpy; the problem was a custom implementation of memcpy that copied using long*.)
Here's a real-life example of a misaligned pointer leading to a fault on x86 (because gcc's auto-vectorization strategy assumed that some whole number of elements would reach a 16-byte alignment boundary. i.e. it depended on the uint16_t* being aligned.)
Obviously if you want your C to be portable (including to non-x86), you must use well-defined ways to type-pun. In ISO C99 and later, writing one union member and reading another is well-defined. (And in GNU C++, and GNU C89).
In ISO C++, the only well-defined way to type-pun is with memcpy or other char* accesses, to copy object representations.
Modern compilers know how to optimize away memcpy for small compile-time constant sizes.
#include <string.h>
#include <stdint.h>
void set2bytes_safe(char *p) {
uint16_t src = 0x4142;
memcpy(p, &src, sizeof(src));
}
void set2bytes_alias(char *p) {
*(uint16_t*)p = 0x4142;
}
Both functions compile to the same code with gcc, clang, and ICC for x86-64 System V ABI:
# clang++6.0 -O3 -march=sandybridge
set2bytes_safe(char*):
mov word ptr [rdi], 16706
ret
Sandybridge-family doesn't have LCP decode stalls for 16-bit mov immediate, only for 16-bit immediates with ALU instructions. This is an improvement over Nehalem (See Agner Fog's microarch guide), but apparently gcc8.1 -march=sandybridge doesn't know about it because it still likes to:
# gcc and ICC
mov eax, 16706
mov WORD PTR [rdi], ax
ret
define the array as a single variable.
... and accessing the elements with ((uint8_t*) &txt)[0]
Yes, that's fine, assuming that uint8_t is unsigned char, because char* is allowed to alias anything.
This is the case on almost any implementation that supports uint8_t at all, but it's theoretically possible to build one where it's not, and char is a 16 or 32-bit type, and uint8_t is implemented with a more expensive read/modify/write of the containing word.
One option is to Trust Your Compiler(tm) and just write proper code.
With this test code:
#include <iostream>
int main() {
char txt[] = {0, 0};
txt[0] = 0x41;
txt[1] = 0x42;
std::cout << txt;
}
Clang 6.0 produces:
int main() {
00E91020 push ebp
00E91021 mov ebp,esp
00E91023 push eax
00E91024 lea eax,[ebp-2]
char txt[] = {0, 0};
00E91027 mov word ptr [ebp-2],4241h <-- Combined write, without any tricks!
txt[0] = 0x41;
txt[1] = 0x42;
std::cout << txt;
00E9102D push eax
00E9102E push offset cout (0E99540h)
00E91033 call std::operator<<<std::char_traits<char> > (0E91050h)
00E91038 add esp,8
}
00E9103B xor eax,eax
00E9103D add esp,4
00E91040 pop ebp
00E91041 ret
You're looking to do a deep copy which you'll need to use a loop to accomplish (or a function that does the loop for you internally: memcpy).
Simply assigning 0x4142 to a char will have to be truncated to fit in the char. This should throw a warning as the outcome will be implementation specific, but typically the least significant bits are retained.
In any case, if you know the numbers you want to assign you could just construct using them: const char txt[] = { '\x41', '\x42' };
I'd suggest doing this with an initializer-list, obviously it's on you to make sure the initializer list is at least as long as size(txt). For example:
copy_n(begin({ '\x41', '\x42' }), size(txt), begin(txt));
Live Example

making mistake in inline assembler in gcc [duplicate]

This question already has answers here:
How to get the CPU cycle count in x86_64 from C++?
(5 answers)
Closed 4 years ago.
I have successfully written some inline assembler in gcc to rotate right one bit
following some nice instructions: http://www.cs.dartmouth.edu/~sergey/cs108/2009/gcc-inline-asm.pdf
Here's an example:
static inline int ror(int v) {
asm ("ror %0;" :"=r"(v) /* output */ :"0"(v) /* input */ );
return v;
}
However, I want code to count clock cycles, and have seen some in the wrong (probably microsoft) format. I don't know how to do these things in gcc. Any help?
unsigned __int64 inline GetRDTSC() {
__asm {
; Flush the pipeline
XOR eax, eax
CPUID
; Get RDTSC counter in edx:eax
RDTSC
}
}
I tried:
static inline unsigned long long getClocks() {
asm("xor %%eax, %%eax" );
asm(CPUID);
asm(RDTSC : : %%edx %%eax); //Get RDTSC counter in edx:eax
but I don't know how to get the edx:eax pair to return as 64 bits cleanly, and don't know how to really flush the pipeline.
Also, the best source code I found was at: http://www.strchr.com/performance_measurements_with_rdtsc
and that was mentioning pentium, so if there are different ways of doing it on different intel/AMD variants, please let me know. I would prefer something that works on all x86 platforms, even if it's a bit ugly, to a range of solutions for each variant, but I wouldn't mind knowing about it.
The following does what you want:
inline unsigned long long rdtsc() {
unsigned int lo, hi;
asm volatile (
"cpuid \n"
"rdtsc"
: "=a"(lo), "=d"(hi) /* outputs */
: "a"(0) /* inputs */
: "%ebx", "%ecx"); /* clobbers*/
return ((unsigned long long)lo) | (((unsigned long long)hi) << 32);
}
It is important to put as little inline ASM as possible in your code, because it prevents the compiler from doing any optimizations. That's why I've done the shift and oring of the result in C code rather than coding that in ASM as well. Similarly, I use the "a" input of 0 to let the compiler decide when and how to zero out eax. It could be that some other code in your program already zeroed it out, and the compiler could save an instruction if it knows that.
Also, the "clobbers" above are very important. CPUID overwrites everything in eax, ebx, ecx, and edx. You need to tell the compiler that you're changing these registers so that it knows not to keep anything important there. You don't have to list eax and edx because you're using them as outputs. If you don't list the clobbers, there's a serious chance your program will crash and you will find it extremely difficult to track down the issue.
This will store the result in value. Combining the results takes extra cycles, so the number of cycles between calls to this code will be a few less than the difference in results.
unsigned int hi,lo;
unsigned long long value;
asm (
"cpuid\n\t"
"rdtsc"
: "d" (hi), "a" (lo)
);
value = (((unsigned long long)hi) << 32) | lo;