Strange bit operation result in C++ / disassembly? - c++

In a C++ function, I have the following variables:
uint32_t buf = 229; //Member variable
int bufSize = 0; //Member variable
static constexpr const uint32_t all1 = ~((uint32_t)0);
And this line of code:
uint32_t result = buf & (all1 >> (32-bufSize ));
Then the value of result is 229, both by console output, or by debugging with gdb. The expected value is of course 0, and any trial to make a minimal example reproducing the problem failed.
So I went to the disassembly and executed it step by step. The bit shift operation is here:
0x44c89e <+0x00b4> 41 d3 e8 shr %cl,%r8d
The debugger registers show that before the instruction, rcx=0x20 and r8=0xFFFFFFFF
After the instruction, we still have r8=0xFFFFFFFF
I have a very poor knowledge of x86-64 assembly, but this instruction is supposed to be an unsigned shift, so why the heck isn't the result 0?
I am using mingw-gcc x86-64 4.9.1

Invoke -Wall on compiler and you will have -Wshift-count-overflow problem as you are using 32 for shifting which is size of unsigned int. Now you can do one thing just for knowing about it. Change 32 to 31 and compile. Then compare the assembly generated and you will know what went wrong.
The easiest fix for you would be use long data type for all1 and result.

Related

Bit fetching with type punning. Unexpected behaviour

Expected result
The program simply prints anyNum in its binary representation.
We accomplish this by "front popping" the first bit of value and sending it to standard output.
After ≤32 iterations i (=anyNum) will finally fall down to zero. Loop ends.
Problem
The v1 version of this code produces the expected result (111...),
However, in v2, when I used a mask structure to get the first bit, it worked as if the last bit was grabbed (1000...).
Maybe not the last? Anyway why and what is happing in second version of the code?
#include <iostream>
typedef struct{
unsigned b31: 1;
unsigned rest: 31;
} mask;
int main()
{
constexpr unsigned anyNum= -1; //0b111...
for (unsigned i= anyNum; i; i<<=1){
unsigned bit;
//bit= (i>>31); //v1
bit= ((mask*)&i)->b31; //v2 (unexpected behaviour)
std::cout <<bit;
}
}
Environment
IDE & platform: https://replit.com/
Platform: Linux-5.11.0-1029-gcp-x86_64-with-glibc2.27
Machine: x86_64
Compilaltion command: clang++-7 -pthread -std=c++17 -o main main.cpp
unsigned b31: 1; is the least significant bit.
Maybe not the last?
The last.
why
Because the compiler chose to do so. The order is whatever compiler decides to.
For example, on GCC the order of bits in bitfields is controlled with BITS_BIG_ENDIAN configuration options.
x86 ABI specifies that bit-fields are allocated from right to left.
what is happing in second version of the code?
Undefined behavior, as in, the code is invalid. You should not expect the code to do anything sane. The compiler happens to generate code that is printing the least significant bit from i.
Thanks for #KamilCuk I found the answer.
It was enough to do a little line-swap in the code so as to repair it.
With comments I've marked the memory layouts of struct (LTR) and bit layout of unsigned int (RLT – Little endian).
Not taking them to account was my mistake.
#include <iostream>
//struct uses left-to-right memory layout
typedef struct{
//I've swapped the next 2 lines as a correction.
unsigned rest: 31;
unsigned b31: 1; //NOW `b31` is the most significant bit.
} mask;
int main()
{
//unsigned is Little Endian on Linux-5.11.0-1029-gcp-x86_64-with-glibc2.27
//,which is right-to-left bit layout.
constexpr unsigned anyNum= -1; //0b111...
for (unsigned i= anyNum; i; i<<=1){
unsigned bit;
//bit= (i>>31); //v1
bit= ((mask*)&i)->b31; //v2 (NOW expected behaviour)
std::cout <<bit;
}
}

Incorrect double conversion by AIX compiler 13.1.3 for c++

We cast a data from char* array do double as in the following function:
double getDouble(const char* szData, const size_t dataLength)
{
double res = 0;
if(dataLength == 8)
{
ub8 doubleData = *(ub8*)(szData);
doubleData = ntohll(doubleData);
double* pDoubleData = (double*)(&doubleData);
res = *pDoubleData;
}
return res;
}
ub8 has size as 8 byte, unsigned long long.
And the double value -1.1512299550195975 is converted as 3.6975400899608046. But input and output should be equal(equal to -1.1512299550195975). This case is only happened in AIX. In the another platforms we get correct result.
We use optimization level O2 for our project. If I used optimization level as O1 , then data is converted correctly.
Can you help me, please, what do you think, why conversion is correct for optimization level O1 and why it is incorrect for optimization level O2? May be I should to turn off or on some flags for aix in the compilation? Thank you. Aix compiler version is 13.1.3.
We get double value in the char array format from binary file. We parsing data from transaction logs of oracle database. And oracle writes , for example, double value as '40 0d 94 8f e6 10 3e 93' and we should convert is as '-1,1512299550195975' in the double type value. But in the only aix we get '3,6975400899608046' incorrect value
Your implementation is relying on Undefined Behaviour, particularly violating the Strict Aliasing rule, as somebody has commented.
That is why sometimes it may work (without optimization) and sometimes no.
To avoid violating the Strict Aliasing, you can use one of the following:
Only in C (not C++, More info here):
typedef union swap64{
unsigned long long u64;
double d64;
}swap64;
swap64 s;
memcpy(s.u64, szData, sizeof(swap64));
s.u64=ntohll(s.u64);
//Following line is UB in C++, but not in C:
return s.d64;
Or without UB (code valid in both C++ and C):
unsigned long long u64;
double d64;
//Most compilers will remove the memcpy calls and replace them by simpler instructions
memcpy(&u64, szData, sizeof(u64));
u64=ntohll(u64);
memcpy(&d64, &u64, sizeof(u64));
return d64;

Getting the high half and low half of a full integer multiply

I start with three values A,B,C (unsigned 32bit integer). And i have to obtain two values D,E (unsigned 32 bit integer also). Where
D = high(A*C);
E = low(A*C) + high(B*C);
I expect that multiply of two 32bit uint produce 64bit result. "high" and "low" is just my covnention for mark the first 32 bits and the last 32 bits in 64bit result of multiply.
I try to obtain optimized code of some allready functional one. I have a short part of the code in huge loop which is just few command lines, however it consumes almost all of computational time (physical simulation for couple of hours computing). That's the reason why i try to optimized this little part and rest of the code could remain more "user-well-arranged".
There is some SSE instructions that are fit for compute mentioned routine. The gcc compiler probably do optimized work. However i do not reject an option to write some piece of code in SSE intructions directly, if it will be necessary.
Be patient with my low experience with SSE please. I will try to write an algorithm for SSE just symbolically. There will be probably some mistakes with ordering masks or understanding the structure.
Store four 32-bit integers into one 128-bit register in order: A,B,C,C.
Apply instruction (probably pmuludq) into mentioned 128-bit register which multiply pairs of 32-bit integeres and return pairs of 64-bit integers as result. So it shoudld calculate multiply of A*C and multiply of B*C simultaneously and return two 64-bit values.
I expect that i have new 128bit register values P,Q,R,S (four 32-bit blocs) where P,Q is 64-bit result of A*C and R,S is 64-bit result of B*C. Then i continue with rearrange values at register into order P,Q,0,R
Take first 64 bits P,Q and add second 64 bits 0,R. The result is a new 64 bits value.
Read first 32 bits of the result as D and last 32 bits of the result as E.
This algorithm should return correct values for E and D.
My question:
Is there a static code in c++ which generate similar SSE routine as mentioned 1-5 SSE algorithm? I preffer solutions with higher performance. If the algorithm is problematic for standart c++ commands, is there a way how to write an algorithm in SSE?
I use TDM-GCC 4.9.2 64-bit compiler.
(note: Question was modified after advice)
(note2: I have an inspiration in this http://sci.tuomastonteri.fi/programming/sse for using SSE for obtain better performance)
You don't need vectors for this unless you have multiple inputs to process in parallel. clang and gcc already do a good job of optimizing the "normal" way to write your code: cast to twice the size, multiply, then shift to get the high half. Compilers recognize this pattern.
They notice that the operands started out as 32bit, so the upper halves are all zero after casting to 64b. Thus, they can use x86's mul insn to do a 32b*32b->64b multiply, instead of doing a full extended-precision 64b multiply. In 64bit mode, they do the same thing with a __uint128_t version of your code.
Both of these functions compile to fairly good code (one mul or imul per multiply).. gcc -m32 doesn't support 128b types, but I won't get into that because 1. you only asked about full multiplies of 32bit values, and 2. you should always use 64bit code when you want something to run fast. If you are doing full-multiplies where the result doesn't fit in a register, clang will avoid a lot of extra mov instructions, because gcc is silly about this. This little test function made a good test-case for that gcc bug report.
That godbolt link includes a function that calls this in a loop, storing the result in an array. It auto-vectorizes with a bunch of shuffling, but still looks like a speedup if you have multiple inputs to process in parallel. A different output format might take less shuffling after the multiply, like maybe storing separate arrays for D and E.
I'm including the 128b version to show that compilers can handle this even when it's not trivial (e.g. just do a 64bit imul instruction to do a 64*64->64b multiply on the 32bit inputs, after zeroing any upper bits that might be sitting in the input registers on function entry.)
When targeting Haswell CPUs and newer, gcc and clang can use the mulx BMI2 instruction. (I used -mno-bmi2 -mno-avx2 in the godbolt link to keep the asm simpler. If you do have a Haswell CPU, just use -O3 -march=haswell.) mulx dest1, dest2, src1 does dest1:dest2 = rdx * src1 while mul src1 does rdx:rax = rax * src1. So mulx has two read-only inputs (one implicit: edx/rdx), and two write-only outputs. This lets compilers do full-multiplies with fewer mov instructions to get data into and out of the implicit registers for mul. This is only a small speedup, esp. since 64bit mulx has 4 cycle latency instead of 3, on Haswell. (Strangely, 64bit mul and mulx are slightly cheaper than 32bit mul and mulx.)
// compiles to good code: you can and should do this sort of thing:
#include <stdint.h>
struct DE { uint32_t D,E; };
struct DE f_structret(uint32_t A, uint32_t B, uint32_t C) {
uint64_t AC = A * (uint64_t)C;
uint64_t BC = B * (uint64_t)C;
uint32_t D = AC >> 32; // high half
uint32_t E = AC + (BC >> 32); // We could cast to uint32_t before adding, but don't need to
struct DE retval = { D, E };
return retval;
}
#ifdef __SIZEOF_INT128__ // IDK the "correct" way to detect __int128_t support
struct DE64 { uint64_t D,E; };
struct DE64 f64_structret(uint64_t A, uint64_t B, uint64_t C) {
__uint128_t AC = A * (__uint128_t)C;
__uint128_t BC = B * (__uint128_t)C;
uint64_t D = AC >> 64; // high half
uint64_t E = AC + (BC >> 64);
struct DE64 retval = { D, E };
return retval;
}
#endif
If I understand it correctly, you want to compute number of potential overflows in A*B. If yes then you have 2 good options - the "use twice as big variable" (write 128bit math function for uint64 - it's not that hard (or wait for me to post it tomorrow)), and the "use floating point type":
(float(A)*float(B))/float(C)
as the loss of precision is minimal (assuming float is 4 bytes, double 8 bytes, and long double 16 bytes long) , and both float and uint32 require 4 bytes of memory (use double for uint64_t as it should be 8 bytes long):
#include <iostream>
#include <conio.h>
#include <stdint.h>
using namespace std;
int main()
{
uint32_t a(-1), b(-1);
uint64_t result1;
float result2;
result1 = uint64_t(a)*uint64_t(b)/4294967296ull; // >>32 would be faster and less memory consuming
result2 = float(a)*float(b)/4294967296.0f;
cout.precision(20);
cout<<result1<<'\n'<<result2;
getch();
return 0;
}
Produces:
4294967294
4294967296
But if you want really precise and correct answer I'd suggest using twice as big type for computing
Now that I think of it - you could use long double for uint64 and double for uint32 instead of writing function for uint64, but I don't think it's guaranteed that long double will be 128bit, and you'll have to check it. I'd go for more universal option.
EDIT:
You can write function to calculate that without using anything more
than A, B and result variable which would be of the same type as A.
Just add rightmost bit of (where Z equals B*(A>>pass_number&1)) Z<<0,
Z<<1, Z<<2 (...) Z<<X in first pass, Z<<-1, Z<<0, Z<<1 (...) Z<<(X-1)
for second (there should be X passes), while right shifting the result
by 1 (the just computed bit becomes irrelevant to us after it's
computed as it won't participate in calculation anymore, and it would
be erased anyway after dividing by 2^X (doing >>X)
(had to place in the "code" as I'm new here and couldn't find another way to prevent formatting script from eating half of it)
It's just a quick idea. You'll have to check it's correctness (sorry, but I'm really tired right now - but the result shouldn't overflow at any point of calculation, as the maximum carry would have value of 2X if I'm correct, and the algorithm itself seems to be good).
I will write code for that tomorrow if you'll still be in need of help.

making mistake in inline assembler in gcc [duplicate]

This question already has answers here:
How to get the CPU cycle count in x86_64 from C++?
(5 answers)
Closed 4 years ago.
I have successfully written some inline assembler in gcc to rotate right one bit
following some nice instructions: http://www.cs.dartmouth.edu/~sergey/cs108/2009/gcc-inline-asm.pdf
Here's an example:
static inline int ror(int v) {
asm ("ror %0;" :"=r"(v) /* output */ :"0"(v) /* input */ );
return v;
}
However, I want code to count clock cycles, and have seen some in the wrong (probably microsoft) format. I don't know how to do these things in gcc. Any help?
unsigned __int64 inline GetRDTSC() {
__asm {
; Flush the pipeline
XOR eax, eax
CPUID
; Get RDTSC counter in edx:eax
RDTSC
}
}
I tried:
static inline unsigned long long getClocks() {
asm("xor %%eax, %%eax" );
asm(CPUID);
asm(RDTSC : : %%edx %%eax); //Get RDTSC counter in edx:eax
but I don't know how to get the edx:eax pair to return as 64 bits cleanly, and don't know how to really flush the pipeline.
Also, the best source code I found was at: http://www.strchr.com/performance_measurements_with_rdtsc
and that was mentioning pentium, so if there are different ways of doing it on different intel/AMD variants, please let me know. I would prefer something that works on all x86 platforms, even if it's a bit ugly, to a range of solutions for each variant, but I wouldn't mind knowing about it.
The following does what you want:
inline unsigned long long rdtsc() {
unsigned int lo, hi;
asm volatile (
"cpuid \n"
"rdtsc"
: "=a"(lo), "=d"(hi) /* outputs */
: "a"(0) /* inputs */
: "%ebx", "%ecx"); /* clobbers*/
return ((unsigned long long)lo) | (((unsigned long long)hi) << 32);
}
It is important to put as little inline ASM as possible in your code, because it prevents the compiler from doing any optimizations. That's why I've done the shift and oring of the result in C code rather than coding that in ASM as well. Similarly, I use the "a" input of 0 to let the compiler decide when and how to zero out eax. It could be that some other code in your program already zeroed it out, and the compiler could save an instruction if it knows that.
Also, the "clobbers" above are very important. CPUID overwrites everything in eax, ebx, ecx, and edx. You need to tell the compiler that you're changing these registers so that it knows not to keep anything important there. You don't have to list eax and edx because you're using them as outputs. If you don't list the clobbers, there's a serious chance your program will crash and you will find it extremely difficult to track down the issue.
This will store the result in value. Combining the results takes extra cycles, so the number of cycles between calls to this code will be a few less than the difference in results.
unsigned int hi,lo;
unsigned long long value;
asm (
"cpuid\n\t"
"rdtsc"
: "d" (hi), "a" (lo)
);
value = (((unsigned long long)hi) << 32) | lo;

Using bts assembly instruction with gcc compiler

I want to use the bts and bt x86 assembly instructions to speed up bit operations in my C++ code on the Mac. On Windows, the _bittestandset and _bittest intrinsics work well, and provide significant performance gains. On the Mac, the gcc compiler doesn't seem to support those, so I'm trying to do it directly in assembler instead.
Here's my C++ code (note that 'bit' can be >= 32):
typedef unsigned long LongWord;
#define DivLongWord(w) ((unsigned)w >> 5)
#define ModLongWord(w) ((unsigned)w & (32-1))
inline void SetBit(LongWord array[], const int bit)
{
array[DivLongWord(bit)] |= 1 << ModLongWord(bit);
}
inline bool TestBit(const LongWord array[], const int bit)
{
return (array[DivLongWord(bit)] & (1 << ModLongWord(bit))) != 0;
}
The following assembler code works, but is not optimal, as the compiler can't optimize register allocation:
inline void SetBit(LongWord* array, const int bit)
{
__asm {
mov eax, bit
mov ecx, array
bts [ecx], eax
}
}
Question: How do I get the compiler to fully optimize around the bts instruction? And how do I replace TestBit by a bt instruction?
BTS (and the other BT* insns) with a memory destination are slow. (>10 uops on Intel). You'll probably get faster code from doing the address math to find the right byte, and loading it into a register. Then you can do the BT / BTS with a register destination and store the result.
Or maybe shift a 1 to the right position and use OR with with a memory destination for SetBit, or AND with a memory source for TestBit. Of course, if you avoid inline asm, the compiler can inline TestBit and use TEST instead of AND, which is useful on some CPUs (since it can macro-fuse into a test-and-branch on more CPUs than AND).
This is in fact what gcc 5.2 generates from your C source (memory-dest OR or TEST). Looks optimal to me (fewer uops than a memory-dest bt). Actually, note that your code is broken because it assumes unsigned long is 32 bits, not CHAR_BIT * sizeof(unsigned_long). Using uint32_t, or char, would be a much better plan. Note the sign-extension of eax into rax with the cqde instruction, due to the badly-written C which uses 1 instead of 1UL.
Also note that inline asm can't return the flags as a result (except with a new-in-gcc v6 extension!), so using inline asm for TestBit would probably result in terrible code code like:
... ; inline asm
bt reg, reg
setc al ; end of inline asm
test al, al ; compiler-generated
jz bit_was_zero
Modern compilers can and do use BT when appropriate (with a register destination). End result: your C probably compiles to faster code than what you're suggesting doing with inline asm. It would be even faster after being bugfixed to be correct and 64bit-clean. If you were optimizing for code size, and willing to pay a significant speed penalty, forcing use of bts could work, but bt probably still won't work well (because the result goes into the flags).
inline void SetBit(*array, bit) {
asm("bts %1,%0" : "+m" (*array) : "r" (bit));
}
This version efficiently returns the carry flag (via the gcc-v6 extension mentioned by Peter in the top answer) for a subsequent test instruction. It only supports a register operand since use of a memory operand is very slow as he said:
int variable_test_and_set_bit64(unsigned long long &n, const unsigned long long bit) {
int oldbit;
asm("bts %2,%0"
: "+r" (n), "=#ccc" (oldbit)
: "r" (bit));
return oldbit;
}
Use in code is then like so. The wasSet variable is optimized away and the produced assembly will have bts followed immediately by jb instruction checking the carry flag.
unsigned long long flags = *(memoryaddress);
unsigned long long bitToTest = someOtherVariable;
int wasSet = variable_test_and_set_bit64(flags, bitToTest);
if(!wasSet) {
*(memoryaddress) = flags;
}
Although it seems a bit contrived, this does save me several instructions vs the "1ULL << bitToTest" version.
Another slightly indirect answer, GCC exposes a number of atomic operations starting with version 4.1.