Why GCC generates strange way to move stack pointer - c++

I have observed that GCC's C++ compiler generates the following assembler code:
sub $0xffffffffffffff80,%rsp
This is equivalent to
add $0x80,%rsp
i.e. remove 128 bytes from the stack.
Why does GCC generate the first sub variant and not the add variant? The add variant seems way more natural to me than to exploit that there is an underflow.
This only occurred once in a quite large code base. I have no minimal C++ code example to trigger this. I am using GCC 7.5.0

Try assembling both and you'll see why.
0: 48 83 ec 80 sub $0xffffffffffffff80,%rsp
4: 48 81 c4 80 00 00 00 add $0x80,%rsp
The sub version is three bytes shorter.
This is because the add and sub immediate instructions on x86 has two forms. One takes an 8-bit sign-extended immediate, and the other a 32-bit sign-extended immediate. See https://www.felixcloutier.com/x86/add; the relevant forms are (in Intel syntax) add r/m64, imm8 and add r/m64, imm32. The 32-bit one is obviously three bytes larger.
The number 0x80 can't be represented as an 8-bit signed immediate; since the high bit is set, it would sign-extend to 0xffffffffffffff80 instead of the desired 0x0000000000000080. So add $0x80, %rsp would have to use the 32-bit form add r/m64, imm32. On the other hand, 0xffffffffffffff80 would be just what we want if we subtract instead of adding, and so we can use sub r/m64, imm8, giving the same effect with smaller code.
I wouldn't really say it's "exploiting an underflow". I'd just interpret it as sub $-0x80, %rsp. The compiler is just choosing to emit 0xffffffffffffff80 instead of the equivalent -0x80; it doesn't bother to use the more human-readable version.
Note that 0x80 is actually the only possible number for which this trick is relevant; it's the unique 8-bit number which is its own negative mod 2^8. Any smaller number can just use add, and any larger number has to use 32 bits anyway. In fact, 0x80 is the only reason that we couldn't just omit sub r/m, imm8 from the instruction set and always use add with negative immediates in its place. I guess a similar trick does come up if we want to do a 64-bit add of 0x0000000080000000; sub will do it, but add can't be used at all, as there is no imm64 version; we'd have to load the constant into another register first.

Related

When debugging with gdb, what is the meaning of the debugging information in the front of the assembly code?

When debugging c code with gdb, the displayed assembly code is
0x000000000040116c main+0 push %rbp
0x000000000040116d main+1 mov %rsp,%rbp
!0x0000000000401170 main+4 movl $0x0,-0x4(%rbp)
0x0000000000401177 main+11 jmp 0x40118d <main+33>
0x0000000000401179 main+13 mov -0x4(%rbp),%eax
0x000000000040117c main+16 mov %eax,%edx
0x000000000040117e main+18 mov -0x4(%rbp),%eax
Is the 0x000000000040116d in the front of the first assembly instruction the virtual address of this function? Is main+1 the offset of this assembly from the main function? The next assembly is main+4. Does it mean that the first mov %rsp,%rbp is three bytes? If so, why is movl $0x0,-0x4(%rbp) 7 bytes?
I am using a server. The version is:Linux version 4.15.0-122-generic (buildd#lcy01-amd64-010) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.12)) #124~16.04.1-Ubuntu SMP.
Pretty much yes. It's quite apparent that, for example, adding 4 to the first address gives you the address shown for main+4, and adding another 7 on top of that gives you the corresponding address for main+11.
As far as the two move instructions go: they are very different, and do completely different things. They are two very different kind of moves, and that's how many bytes each one requires in x86 machine language, so its not surprising that one takes many more bytes than the other. As far as the precise reason why, well, in general, that opens a very long, broad, and windy discussion about the underlying reasons, and the original design goals of the x86 machine language instruction set. Much of it actually no longer applies (and you would probably find quite boring, actually), since the modern x86 CPU is something quite radically different than its original generation. But it has to remain binary compatible. Hence, little oddities like that.
Just to give you a basic understanding: the first move is between two CPU registers. Doesn't take a long novel to specify from where, and to where. The second move has to specify a 32 bit value (0 to be precise), a CPU register, and a memory offset. That has to be specified somewhere. You have to find the bytes somewhere to specify all the little details for that.

How can I determine in c++ where one Instruction (In bytes) ends and another starts? [duplicate]

This question already has an answer here:
X86 Assembly - How to calculate instruction opcodes length in bytes [closed]
How to tell length of an x86-64 instruction opcode using CPU itself?
(1 answer)
Closed 3 years ago.
For example, at the address 0x762C51, there is the instruction call sub_E91E50.
In bytes, this is E8 FA F1 72 00.
Next, at the address 0x762C56, there is the instruction push 0. In bytes, this is 6A 00.
Now, when it comes to C++ reading a function like this, it would only have the bytes like: E8 FA F1 72 00 6A 00
How can I determine where the first instruction ends and the next begins.
For variable length instruction sets you can't really do this. Many tools will try but it is often trivial to mess them up if you try. Compiled code works better.
The best way which won't necessarily result in a complete disassembly is to go in execution order. So you have to have a known correct entry point and go from there following the execution paths and looking for collisions and setting those aside for a human to figure out.
Simulation is even better and it might give you better coverage in some areas, but also will leave gaps where an execution order disassembler wouldn't.
Even gnu assembler messes up with variable length instruction sets (next time please specify the target/isa as stated in the assembly tag). So whatever a "full disassembler" is can't possibly do it either, in general.
If someone has told you like it's a class assignment or you compile a function to an object and based on the label in the disassembly you feel comfortable that is a valid entry point then you can start disassembling there in execution order. Understanding that if it is an object there will be incomplete instructions to be filled in later, if you link then if you assume that a function label address is an entry point then can disassemble in execution order.
C++ really has nothing to do with this if you have a sequence of bytes and you are sure you know the entry point it doesn't much matter how that code was created, unless it intentionally has anti-disassembly items in it (hand written or compiler/tool created) generally this is not a problem, but technically a tool could do this and it is trivial for a human to do it by hand.

Fastest way to compare a double to exact 0 while both +0.0 or -0.0 are accepted

So far I have the following:
bool IsZero(const double x) {
return fabs(x) == +0.0;
}
Is this the fastest of correct ways to compare to exact 0, while both +0.0 and -0.0 are accepted?
If CPU-specific, lets consider x86-64. If compiler specific, lets consider MSVC++2017 toolset v141.
Since you said you want the fastest possible code, I'm going to make some important simplifying assumptions throughout this answer. These are legal, per the question. In particular, I'm assuming x86 and IEEE-754 representations of floating-point values. I'll also mention MSVC-specific quirks, where applicable, although the general discussion would apply to any compiler targeting this architecture.
The way you test whether a floating-point value is equal to zero is by testing all of its bits. If all of the bits are 0, then the value is zero. Actually, the value is +0.0. The sign bit can be either 0 or 1, since the representation allows such thing as positive and negative 0.0, as you mention in the question. But this difference doesn't actually exist (there's not really any such thing as +0.0 and −0.0), so what you really need is to test all bits except the sign bit.
This can be done quickly and efficiently with some bit-twiddling. On little-endian architectures like x86, the sign bit is the leading bit, so you simply shift it out and then test the remaining bits.
This trick is described by Agner Fog in his Optimizing Subroutines in Assembly Language. Specifically, example 17.4b (on page 156 in the current version).
For a single-precision floating-point value (i.e., float), which is 32-bits wide:
mov eax, DWORD PTR [floatingPointValue]
add eax, eax ; shift out the sign bit to ignore -0.0
sete al ; set AL if the remaining bits were 0
Translating this into C code, you'd do something like:
const uint32_t bits = *(reinterpret_cast<uint32_t*>(&value));
return ((bits + bits) == 0);
Of course, this is formally unsafe because of the type punning. MSVC lets you get away with it, no problem. In fact, if you try to actually conform to the standard and play it safe, MSVC will tend to generate less efficient code, decreasing the effectiveness of this trick. If you want to do this safely, you'll need to verify the output of your compiler and make sure it's doing what you want. Some assertions are also recommended.
If you're okay with the unsafe nature of this approach, you will find that it is faster than a poorly-predicted conditional branch, so when you're dealing with random input values, it might be a performance win. For comparison purposes, here is what you'll see from MSVC if you just do a naive test for equality against 0.0:
;; assuming /arch:IA32, which is *not* the default in modern versions of MSVC
;; but necessary if you cannot assume SSE2 support
fld DWORD PTR [floatingPointValue]
fldz
fucompp
fnstsw ax
test ah, 44h
jp IsNonZero
mov al, 1
ret
IsNonZero:
xor al, al
ret
;; assuming /arch:SSE2, which *is* the default in modern versions of MSVC
movss xmm0, DWORD PTR [floatingPointValue]
ucomiss xmm0, DWORD PTR [constantZero]
lahf
test ah, 44h
jp IsNonZero
mov al, 1
ret
IsNonZero:
xor al, al
ret
Ugly, and potentially slow. There are branchless ways of doing this, but MSVC won't use them.
An obvious drawback to the "optimized" implementation described above is that it requires the floating-point value be loaded from memory in order to access its bits. There are no x87 instructions that can access the bits directly, and there's no way go directly from an x87 register to a GP register without going through memory. Since memory access is slow, this does incur a performance penalty, but in my tests, it's still faster than a mispredicted branch.
If you're using any of the standard calling conventions on 32-bit x86 (__cdecl, __stdcall, etc.), then all floating-point values are passed and returned in the x87 registers, so there's no difference in moving from an x87 register to a GP register versus moving from an x87 register to an SSE register.
The story is a bit different if you're targeting x86-64 or if you are using __vectorcall on x86-32. Then, you actually have floating-point values stored and passed in SSE registers, so you can take advantage of branchless SSE instructions. At least, theoretically. MSVC won't, unless you hold its hand. It would normally do the same branching comparison shown above, just without the extra memory load:
;; MSVC output for a __vectorcall function, targeting x86-32 with /arch:SSE2
;; and/or for x86-64 (which always uses a vector calling convention and SSE2)
;; The floating point value being compared is passed directly in XMM0
ucomiss xmm0, DWORD PTR [constantZero]
lahf
test ah, 44h
jp IsNonZero
mov al, 1
ret
IsNonZero:
xor al, al
ret
I've demonstrated the compiler output for a very simple bool IsZero(float val) function, but in my observations, MSVC always emits a UCOMISS+JP sequence for this type of comparison, no matter how the comparison is incorporated into the input code. Again, fine if the zero-ness of the input is predictable, but relatively lousy if branch prediction fails.
If you want to ensure you get branchless code, avoiding the possibility of branch-misprediction stalls, then you need to use intrinsics to do the comparison. These intrinsics will force MSVC to emit code closer to what you would expect:
return (_mm_ucomieq_ss(_mm_set_ss(floatingPointValue), _mm_setzero_ps()) != 0);
Unfortunately, the output is still not perfect. You suffer from general optimization deficiencies surrounding the use of intrinsics—namely, some redundant shuffling of the input value between various SSE registers—but that is (A) unavoidable, and (B) not a measurable performance problem.
I'll note here that other compilers, like Clang and GCC, don't need their hands held. You can just do value == 0.0. The exact sequence of code that they emit varies, depending on your optimization settings, but you'll see either COMISS+SETE, UCOMISS+SETNP+CMOVNE or CMPEQSS+MOVD+NEG (the latter is used exclusively by ICC). Your attempting to hold their hands with intrinsics would almost certainly result in less efficient output, so this probably needs to be #ifdef'ed to limit it to MSVC.
That's single-precision values, which have a width of 32 bits. What about double-precision values, which are twice as long? You'd think these would have 63 bits to test (since the sign bit is still ignored), but there's a twist. If you can rule out the possibility of denormal numbers, then you can get away with testing only the upper bits (again, assuming little-endian).
Agner Fog discusses this as well (example 17.4d). If you exclude the possibility of denormal numbers, then a value of 0 corresponds to the case where the exponent bits are all 0. The upper bits are the sign bit and the exponent bits, so you can just test these exactly as you did for single-precision values:
mov eax, DWORD PTR [floatingPointValue+4] ; load upper bits only
add eax, eax ; shift out sign bit to ignore -0.0
sete al ; set AL if the remaining bits were 0
In unsafe C:
const uint64_t bits = *(reinterpret_cast<uint64_t*>(&value);
const uint32_t upperBits = (bits & 0xFFFFFFFF00000000) >> 32;
return ((upperBits + upperBits) == 0);
If you do need to account for denormal values, then you aren't saving yourself anything. I haven't tested this, but you're probably no worse letting the compiler generate the code for a naive comparison. At least, not for x86-32. You might still gain on x86-64, where you have 64-bit-wide GP registers.
If you can assume SSE2 support (which would be all x86-64 systems, and all modern x86-32 builds as well), then you just use intrinsics, and you get denormal support for free (well, not really free; there are internal penalties in the CPU, I believe, but we'll ignore those):
return (_mm_ucomieq_sd(_mm_set_sd(floatingPointValue), _mm_setzero_pd()) != 0);
Again, as with single-precision values, the use of intrinsics is not necessary on compilers other than MSVC to get optimal code, and indeed may result in sub-optimal code, so should be avoided.
In plain and simple words, if you want to accept exactly +0.0 and -0.0, you just have to use:
x == 0.0
OR
From the cmath library you can use:
int fpclassify( double arg ) which will return "zero" for -0.0 or +0.0
If you open the assembler of the code you can find what kind of assembler instructions are used for different versions of your code. Having the assembler you can estimate which is better.
In GCC compiler you can keep intermediate files (including assembler version) by this way:
gcc -save-temps main.cpp

Getting "actual" registers from MCInsts (x86)

I'm using llvm-mc with the goal of making a relatively smart disassembler (identifying and tracking locals, easily following branches, etc), and part of that is creating a string representation of the disassembled instructions.
When I started this, I expected that I would be able to relatively easily identify registers and values used by MCInsts and whip out another representation myself with which I could easily work with. However, after some investigation, I realized that the correlation between the operands shown with the textual representation of an instruction and the operands that are actually present within the MCInst object is fairly low. Here are a few examples (Intel syntax):
Moving, say, 11587 as a 32-bit immediate into eax would be done with the MOV32ri opcode. The textual representation would be mov eax, 11587. The corresponding MCInst would have two operands, a register and an immediate. This works for me. This is great.
Adding 11587 to eax would be done with the ADD32ri opcode. The textual representation would be add eax, 11587. However, this time, the corresponding MCInst has three operands: eax is there twice and the immediate is in the end. This isn't so great. I can assume that this is an artifact of the lowering process, that the first instance of eax is the destination register and that the second one is there to be the source (even though x86 does not distinguish between the two), and I can hack around that.
Moving a 32-bits eip-relative value to eax would be done with the MOV32ao32 opcode. The textual representation would be mov eax, dword ptr [11587]. In this case, the MCInst doesn't even have an operand for eax, it can only be inferred from the operand type present in the opcode name. I can hack around that too, but things are getting less and less pretty and I've only tested 5-6 different instructions out of the 1300+ that x86 supports.
Obviously, for the purpose of showing text, I could get the textual representation with an MCInstPrinter, but the mapping between what's shown there and what the MCInst has is still muddy.
Is there a straightforward way to tell which operands appears in the textual representation of an instruction?
Add having three arguments sounds like a compiler builder preference for Three address code is bleeding through, since there is no justification for that in Intel assembler. (you can't add and store to a different register with the ADD instruction, you can with LEA though).
The opcodes run into the hundreds if you count all extensions (like SSE, FPU etc), and worse there are multiple variants of an opcode due to addressing modes and prefixes.
The NASM assembler has some tables in the source that you could try to mine if your llvm-mc system doesn't provide the functionality.
The MC level is very low and the operand layout depends on the opcode. That said, there are mapping tables that tell you what is where. MCInstrDesc and MCOperandInfo will tell you which operands and sources and destinations, whether they are immediates, registers, etc. and a set of flags.
You'll also need to get familiar with MCRegisterClass and MCRegisterInfo and a bunch of other stuff. It's a complicated interface because the task of representing arbitrary target information is complicated.
I would look at the code for the various MC-based tools to get started. You shouldn't need your own representation, MC should have everything you need.

Fast way to get a valid imm num for arm mov?

Arm mov has a limitation that the immediates must be an 8bit rotated by a multiple of 2, we can write:
mov ip, #0x5000
But we cannot write that:
mov ip, #0x5001
The 0x5000 can be split as 0x5000 + 1, my means, the sum of a valid immediate number and a small number.
So for a given 32bit number, how to find the closest valid immediate number fast? Like this:
uint32 find_imm(uint32 src, bool less_than_src) {
...
}
// x is 0x5000
uint32 x = find_imm(0x5001, true);
It is quite simple, look at the distance between the ones. 0x5001 = 0b101000000000001. 15 significant digits, so it will take you two instructions at 8 bits of immediate per. Also remember to put a rotate in your test, if there are enough zeros 0x80000001 and you rotate that around 0x88000000 or 0x00000003 that is only two significant digits from a distance between the ones measurement. So take the immediate, perform a distance between the ones type test, rotate one step, perform the test again, and repeat until all the possible (counter-)rotations have happened and go with one of the ones with the smallest number of instructions/immediates.
gnu as already does this and gas is open source so you can just go get their code if you prefer. When you use the load address trick:
ldr rd,=const
If that const can be resolved in a single move immediate instruction then it encodes it as a
mov rd,#const
if it cant then it tries to find a location to put the word and encodes it as a pc relative load:
ldr rd,[pc,#offset]
...
.word const
There is not a straightforward rule or function for finding ways to construct values. Once a value exceeds what can be loaded easily from immediate values, you usually load it by defining it in the data section and loading it from memory, rather than constructing it from immediate values.
If you do want to construct a value from two immediate values, you must consider a variety of operations, including:
Adding two immediates.
Subtracting two immediates.
Multiplying two immediates.
More esoteric instructions, such as some of the “SIMD” instructions that split 32-bit registers into multiple lanes.
If you must go to three immediate values, there are more combinations. One can find some patterns in the possibilities that reduce the search, but some portion of it remains a “brute force” search. Generally, there is no point in using complicated instructions sequences, since you can simply load the data from a prepared location in memory.
The ARM assembler has an instruction form to assist this:
LDR Rd, =const
When the assembler sees this, it places the const value in the literal pool and generates an instruction to load the value from the pool. If you are using a different assembler, it might not have the same instruction form, but you can write the necessary code manually.