Disassembling the following function using VS2010
int __stdcall modulo(int a, int b)
{
return a % b;
}
gives you:
push ebp
mov ebp,esp
mov eax,dword ptr [ebp+8]
cdq
idiv eax,dword ptr [ebp+0Ch]
mov eax,edx
pop ebp
ret 8
which is pretty straightforward.
Now, trying the same assembly code as inline assembly fails with error C2414: illegal number of operands, pointing to idiv.
I read Intel's manual and it says that idiv accept only 1 operand:
Divides the (signed) value in the AX, DX:AX, or EDX:EAX (dividend) by
the source operand (divisor) and stores the result in the AX (AH:AL),
DX:AX, or EDX:EAX registers
and sure enough, removing the extra eax compiles and the function returns the correct result.
So, what is going on here? why is VS2010 emitting erroneous code ? (btw, VS2012 emits exactly the same assembly)
The difference, most likely, is that the author of the disassembler intended for its output to be read by a human rather than by an assembler. Since it's a disassembly, one assumes the underlying code has already been assembled/compiled, so why worry about whether it can be re-assembled?
There are two major possibilities here:
The author didn't realize or didn't remember that the dividend is implicit for idiv (in which case this could be considered a bug).
They felt that making it explicit in the output would be helpful for a reader trying to understand the flow of the disassembly (in which case it's a feature). Without the explicit operand, it would be easy to overlook that idiv both depends on and modifies eax when scanning the disassembly.
Related
This experiment was done using GCC 6.3. There are two functions where the only difference is in the order we assign the i32 and i16 in the struct. We assumed that both functions should produce the same assembly. However this is not the case. The "bad" function produces more instructions. Can anyone explain why this happens?
#include <inttypes.h>
union pack {
struct {
int32_t i32;
int16_t i16;
};
void *ptr;
};
static_assert(sizeof(pack)==8, "what?");
void *bad(const int32_t i32, const int16_t i16) {
pack p;
p.i32 = i32;
p.i16 = i16;
return p.ptr;
}
void *good(const int32_t i32, const int16_t i16) {
pack p;
p.i16 = i16;
p.i32 = i32;
return p.ptr;
}
...
bad(int, short):
movzx eax, si
sal rax, 32
mov rsi, rax
mov eax, edi
or rax, rsi
ret
good(int, short):
movzx eax, si
mov edi, edi
sal rax, 32
or rax, rdi
ret
The compiler flags were -O3 -fno-rtti -std=c++14
This is/was a missed optimization in GCC10.2 and earlier. It seems to already be fixed in current nightly builds of GCC, so no need to report a missed-optimization bug on GCC's bugzilla. (https://gcc.gnu.org/bugzilla/). It looks like it first appeared as a regression from GCC4.8 to GCC4.9. (Godbolt)
# GCC11-dev nightly build
# actually *better* than "good", avoiding a mov-elimination missed opt.
bad(int, short):
movzx esi, si # mov-elimination never works for 16->32 movzx
mov eax, edi # mov-elimination works between different regs
sal rsi, 32
or rax, rsi
ret
Yes, you'd generally expect C++ that implements the same logic basically the same way to compile to the same asm, as long as optimization is enabled, or at least hope so1. And generally you can hope that there are no pointless missed optimizations that waste instructions for no apparent reason (rather than simply picking a different implementation strategy), but unfortunately that's not always true either.
Writing different parts of the same object and then reading the whole object is tricky for compilers in general so it's not a shock to see different asm when you write different parts of the full object in a different order.
Note that there's nothing "smart" about the bad asm, it's just doing a redundant mov instruction. Having to take input in fixed registers and produce output in another specific hard register to satisfy the calling convention is something GCC's register allocator isn't amazing at: wasted mov missed optimizations like this are more common in tiny functions than when part of a larger function.
If you're really curious, you could dig into the GIMPLE and RTL internal representations that GCC transformed through to get here. (Godbolt has a GCC tree-dump pane to help with this.)
Footnote 1: Or at least hope that, but missed-optimization bugs do happen in real life. Report them when you spot them, in case it's something that GCC or LLVM devs can easily teach the optimizer to avoid. Compilers are complex pieces of machinery with multiple passes; often a corner case for one part of the optimizer just didn't used to happen until some other optimization pass changed to doing something else, exposing a poor end result for a case the author of that code wasn't thinking about when writing / tweaking it to improve some other case.
Note that there's no Undefined Behaviour here despite the complaints in comments: The GNU dialect of C and C++ defines the behaviour of union type-punning in C89 and C++, not just in C99 and later like ISO C does. Implementations are free to define the behaviour of anything that ISO C++ leaves undefined.
Well technically there is a read-uninitialized because the upper 2 bytes of the void* object haven't been written yet in pack p. But fixing it with pack p = {.ptr=0}; doesn't help. (And doesn't change the asm; GCC happened to already zero the padding because that's convenient).
Also note, both versions in the question are less efficient than possible:
(The bad output from GCC4.8 or GCC11-trunk avoiding the wasted mov looks optimal for that choice of strategy.)
mov edi,edi defeats mov-elimination on both Intel and AMD, so that instruction has 1 cycle latency instead of 0, and costs a back-end µop. Picking a different register to zero-extend into would be cheaper. We could even pick RSI after reading SI, but any call-clobbered register would work.
hand_written:
movzx eax, si # 16->32 can't be eliminated, only 8->32 and 32->32 mov
shl rax, 32
mov ecx, edi # zero-extend into a different reg with 0 latency
or rax, rcx
ret
Or if optimizing for code-size or throughput on Intel (low µop count, not low latency), shld is an option: 1 µop / 3c latency on Intel, but 6 µops on Zen (also 3c latency, though). (https://uops.info/ and https://agner.org/optimize/)
minimal_uops_worse_latency: # also more uops on AMD.
movzx eax, si
shl rdi, 32 # int32 bits to the top of RDI
shld rax, rdi, 32 # shift the high 32 bits of RDI into RAX.
ret
If your struct was ordered the other way, with the padding in the middle, you could do something involving mov ax, si to merge into RAX. That could be efficient on non-Intel, and on Haswell and later which don't do partial-register renaming except for high-8 regs like AH.
Given the read-uninitialized UB, you could just compile it to literally anything, including ret or ud2. Or slightly less aggressive, you could compile it to just leave garbage for the padding part of the struct, the last 2 bytes.
high_garbage:
shl rsi, 32 # leaving high garbage = incoming high half of ESI
mov eax, edi # zero-extend into RAX
or rax, rsi
ret
Note that an unofficial extension to the x86-64 System V ABI (which clang actually depends on) is that narrow args are sign- or zero-extended to 32 bits. So instead of zeros, the high 2 bytes of the pointer would be copies of the sign bit. (Which would actually guarantee that it's a canonical 48-bit virtual address on x86-64!)
I've been frustrated by passing parameters from a c++ function to assembly. I couldn't find anything that helped on Google and would really like your help. I am using Visual Studio 2017 and masm to compile my assembly code.
This is a simplified version of my c++ file where I call the assembly procedure set_clock
int main()
{
TimeInfo localTime;
char clock[4] = { 0,0,0,0 };
set_clock(clock,&localTime);
system("pause");
return 0;
}
I run into problems in the assembly file. I can't figure out why the second parameter passed to the function turns out huge. I was going off my textbook, which shows similar code with PROC followed by parameters. I don't know why the first parameter is passed successfully and the second one isn't. Can someone tell me the correct way to pass multiple parameters?
.code
set_clock PROC,
array:qword,address:qword
mov rdx,array ; works fine memory address: 0x1052440000616
mov rdi,address ; value of rdi is 14757395258967641292
mov al, [rdx]
mov [rdi],al ; ERROR: cant access that memory location
ret
set_clock ENDP
END
MASM's high-level crap is biting you in the ass. x64 Windows passes the first 4 args in rcx, rdx, r8, r9 (for any of those 4 that are integer/pointer).
mov rdx,array
mov rdi,address
assembles to
mov rdx, rcx ; clobber 2nd arg with a copy of the 1st
mov rdi, rdx ; copy array again
Use a disassembler to check for yourself. Always a good idea to check the real machine code by disassembling or using your debuggers disassembly instead of source mode, if anything weird is happening with assembler macros.
I'm not sure why this would result in an inaccessible memory location. If both args really are pointers to locals, then it should just be loading and storing back into the same stack location. But if char clock[4] is a const in static storage, it might be in a read-only memory page which would explain the store failing.
Either way, use a debugger and find out.
BTW, rdi is a call-preserved (aka non-volatile) register in the x64 Windows convention. (https://msdn.microsoft.com/en-us/library/9z1stfyw.aspx). Use call-clobbered registers for scratch regs unless you run out and need to save/restore some call-preserved regs. See also Agner Fog's calling conventions doc (http://agner.org/optimize/), and other links in the x86 tag wiki.
It's call-clobbered in x86-64 System V, which also passes args in different registers. Maybe you were looking at a different example?
Hopefully-fixed version, using movzx to avoid a false dependency on RAX when loading a byte.
set_clock PROC,
array:qword,address:qword
movzx eax, byte ptr [array]
mov [address], al
ret
set_clock ENDP
I don't use MASM, but I think array:qword makes array an alias for rcx. Or you could skip declaring the parameters and just use rcx and rdx directly, and document it with comments. That would be easier for everyone to understand.
You definitely don't want useless mov reg,reg instructions cluttering your code; if you're writing in asm in the first place, wasted instructions would cut into any speedups you're getting.
Sometimes compilers generate code with weird instruction duplications that can safely be removed. Consider the following piece of code:
int gcd(unsigned x, unsigned y) {
return x == 0 ? y : gcd(y % x, x);
}
Here is the assembly code (generated by clang 5.0 with optimizations enabled):
gcd(unsigned int, unsigned int): # #gcd(unsigned int, unsigned int)
mov eax, esi
mov edx, edi
test edx, edx
je .LBB0_1
.LBB0_2: # =>This Inner Loop Header: Depth=1
mov ecx, edx
xor edx, edx
div ecx
test edx, edx
mov eax, ecx
jne .LBB0_2
mov eax, ecx
ret
.LBB0_1:
ret
In the following snippet:
mov eax, ecx
jne .LBB0_2
mov eax, ecx
If the jump doesn't happen, eax is reassigned for no obvious reason.
The other example is two ret's at the end of the function: one would perfectly work as well.
Is the compiler simply not intelligent enough or there's a reason to not remove the duplications?
Compilers can perform optimisations that are not obvious to people and removing instructions does not always make things faster.
A small amount of searching shows that various AMD processors have branch prediction problems when a RET is immediately after a conditional branch. By filling that slot with what is essentially a no-op, the performance problem is avoided.
Update:
Example reference, section 6.2 of the "Software Optimization Guide for AMD64 Processors" (see http://support.amd.com/TechDocs/25112.PDF) says:
Specifically, avoid the following two situations:
Any kind of branch (either conditional or unconditional) that has the single-byte near-return RET instruction as its target. See “Examples.”
A conditional branch that occurs in the code directly before the single-byte near-return RET instruction.
It also goes into detail on why jump targets should have alignment which is also likely to explain the duplicate RETs at the end of the function.
Any compiler will have a bunch of transformations for register renaming, unrolling, hoisting, and so on. Combining their outputs can lead to suboptimal cases such as what you have shown. Marc Glisse offers good advice: it's worth a bug report. You are describing an opportunity for a peephole optimizer to discard instructions that either
don't affect the state of registers & memory at all, or
don't affect state that matters for a function's post-conditions, won't matter for its public API.
Sounds like an opportunity for symbolic execution techniques. If the constraint solver finds no branch points for a given MOV, perhaps it is really a NOP.
volatile int num = 0;
num = num + 10;
The above C++ Code seems to produce following code in intel assembly:
mov DWORD PTR [rbp-4], 0
mov eax, DWORD PTR [rbp-4]
add eax, 10
mov DWORD PTR [rbp-4], eax
If I change C++ code to
volatile int num = 0;
num = num + 0;
why will not compiler produce assembly code as:
mov DWORD PTR [rbp-4], 0
mov eax, DWORD PTR [rbp-4]
add eax, 0
mov DWORD PTR [rbp-4], eax
gcc7.2 -O0 leaves out the add eax, 0, but all the other instructions are the same (Godbolt).
At which part of compilation process does this kind trivial code gets removed. Is there any compiler flag which will make GCC compiler to not do these kind of optimizations.
clang will emit add eax, 0 at -O0, but none of gcc, ICC, nor MSVC will. See below.
gcc -O0 doesn't mean "no optimization". gcc doesn't have a "braindead literal translation" mode where it tries to transliterate every component of every C expression directly to an asm instruction.
GCC's -O0 is not intended to be totally un-optimized. It's intended to be "compile-fast" and make debugging give the expected results (even if you modify C variables with a debugger, or jump to a different line within the function). So it spills / reloads everything around every C statement, assuming that memory can be asynchronously modified by a debugger stopped before such a block. (Interesting example of the consequences, and a more detailed explanation: Why does integer division by -1 (negative one) result in FPE?)
There isn't much demand for gcc -O0 to make even slower code (e.g. forgetting that 0 is the additive identity), so nobody has implemented an option for that. And it might even make gcc slower if that behaviour was optional. (Or maybe there is such an option but it's on by default even at -O0, because it's fast, doesn't hurt debugging, and useful. Usually people like it when their debug builds run fast enough to be usable, especially for big or real-time projects.)
As #Basile Starynkevitch explains in Disable all optimization options in GCC, gcc always transforms through its internal representations on the way to making an executable. Just doing this at all results in some kinds of optimizations.
For example, even at -O0, gcc's "divide by a constant" algorithm uses a fixed-point multiplicative inverse or a shift (for powers of 2) instead of an idiv instruction. But clang -O0 will use idiv for x /= 2.
Clang's -O0 optimizes less than gcc's in this case, too:
void foo(void) {
volatile int num = 0;
num = num + 0;
}
asm output on Godbolt for x86-64
push rbp
mov rbp, rsp
# your asm block from the question, but with 0 instead of 10
mov dword ptr [rbp - 4], 0
mov eax, dword ptr [rbp - 4]
add eax, 0
mov dword ptr [rbp - 4], eax
pop rbp
ret
As you say, gcc leaves out the useless add eax,0. ICC17 stores/reloads multiple times. MSVC is usually extremely literal in debug mode, but even it avoids emitting add eax,0.
Clang is also the only one of the 4 x86 compilers on Godbolt that will use idiv for return x/2;. The others all SAR + CMOV or whatever to implement C's signed division semantics.
As per the "as if" rule in C++, an implementation is freely allowed to do whatever it wants, provided that the observable behaviour matches the standard. Specifically, in C++17, 4.6/1 (as one example):
... conforming implementations are required to emulate (only) the observable behavior of the abstract machine as explained below.
This provision is sometimes called the "as-if" rule, because an implementation is free to disregard any requirement of this International Standard as long as the result is as if the requirement had been obeyed, as far as can be determined from the observable behavior of the program.
For instance, an actual implementation need not evaluate part of an expression if it can deduce that its value is not used and that no side effects affecting the observable behavior of the program are produced.
As to how to control gcc, my first suggestion would be to turn off all optimisation by using the -O0 flag. You can get more fine-tuned control by using various -f<blah> options but -O0 should be a good start.
I am a teaching assistant for computer science and one of my students submitted the following code to check whether an integer is odd or even:
int is_odd (int i) {
if((i % 2 == 1) && (i % 2 == -1));
else;
}
Surprisingly (at least for me) this code gives correct results. I tested numbers up to 100000000, and I honestly cannot explain why this code is behaving as it does.
We are using gcc v6.2.1 and c++
I know that this is not a typical question for so, but I hope to find some help.
Flowing off the end of a function without returning anything is undefined behaviour, regardless of what actually happens with your compiler. Note that if you pass -O3 to GCC, or use Clang, then you get different results.
As for why you actually see the "correct" answer, this is the x86 assembly which GCC 6.2 produces at -O0:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], edi
mov eax, DWORD PTR [rbp-4]
cdq
shr edx, 31
add eax, edx
and eax, 1
sub eax, edx
cmp eax, 1
nop
pop rbp
ret
Don't worry if you can't read x86. The important thing to note is that eax is used for the return value, and all the intermediate calculations for the if statement use eax as their destination. So when the function exits, eax just happens to have the result of the branch check in it.
Of course, this is all a purely academic discussion; the student's code is wrong and I'd certainly give it zero marks, regardless of whether it passes whatever tests you run it through.