In my function I use
__asm
{
mov ecx,dword ptr [0x28F1431]
mov ecx,ds:[0x28F14131]
}
which should produce the following bytes: 0x8B0D (mov ecx, dword ptr []). However the first instruction produces 0xB9 (mov ecx,0x28F14131) and the second one 0x3E:8B0D
So my question is, what instruction should I use to get the desired result inside the C++ __asm?
If you know for 100% certain what your byte sequence is supposed to be for your inlined assembly, you can always explicitly use those bytes. The exact syntax escapes me, but if you are using GCC, you may try ....
__asm {
.byte 0x##
.byte 0x##
...
}
This approach only works if you know with 100% certainty what the byte sequences for the whole instruction are. AND if you are going to do this, be sure to comment appropriately.
(For what it is worth, I have had to use this approach in the past to work around a compiler bug where no matter what it would otherwise use the wrong byte sequence for one of the instructions.)
Related
Consider the following declaration of local variables:
bool a{false};
bool b{false};
bool c{false};
bool d{false};
bool e{false};
bool f{false};
bool g{false};
bool h{false};
in x86-64 architectures, I'd expect the optimizer to reduce the initialization of those variables to something like mov qword ptr [rsp], 0. But instead what I get with all the compilers (regardless of level of optimization) I've been able to try is some form of:
mov byte ptr [rsp + 7], 0
mov byte ptr [rsp + 6], 0
mov byte ptr [rsp + 5], 0
mov byte ptr [rsp + 4], 0
mov byte ptr [rsp + 3], 0
mov byte ptr [rsp + 2], 0
mov byte ptr [rsp + 1], 0
mov byte ptr [rsp], 0
Which seems like a waste of CPU cycles. Using copy-initialization, value-initialization or replacing braces with parentheses made no difference.
But wait, that's not all. Suppose that I have this instead:
struct
{
bool a{false};
bool b{false};
bool c{false};
bool d{false};
bool e{false};
bool f{false};
bool g{false};
bool h{false};
} bools;
Then the initialization of bools generates exactly what I'd expect: mov qword ptr [rsp], 0. What gives?
You can try the code above yourself in this Compiler Explorer link.
The behavior of the different compilers is so consistent that I am forced to think there is some reason for the above inefficiency, but I have not been able to find it. Do you know why?
Compilers are dumb, this is a missed-optimization. mov qword ptr [rsp], 0 would be optimal. Store forwarding from a qword store to a byte reload of any individual byte is efficient on modern CPUs. (https://blog.stuffedcow.net/2014/01/x86-memory-disambiguation/)
(Or even better, push 0 instead of sub rsp, 8 + mov, also a missed optimization because compilers don't bother looking for cases where that's possible.)
Presumably the optimization pass that looks for store merging runs before nailing down the locations of locals in the stack frame relative to each other. (Or before even deciding which locals can be kept in registers and which need memory addresses at all.)
Store merging aka coalescing was only recently reintroduced in GCC8 IIRC, after being dropped as a regression from GCC2.95 to GCC3, again IIRC. (I think other optimizations like assuming no strict-aliasing violations to keep more vars in registers more of the time, were more useful). So it's been missing for decades.
From one POV, you could say consider yourself lucky you're getting any store merging at all (with struct members, and array elements, that are known early to be adjacent). Of course, from another POV, compilers ideally should make good asm. But in practice missed optimizations are common. Fortunately we have beefy CPUs with wide superscalar out-of-order execution to usually chew through this crap to still see upcoming cache miss loads and stores pretty quickly, so wasted instructions sometimes have time to execute in the shadow of other bottlenecks. That's not always true, and clogging up space in the out-of-order execution window is never a good thing.
Related: In x86-64 asm: is there a way of optimising two adjacent 32-bit stores / writes to memory if the source operands are two immediate values? covers the general case for constants other than 0, re: what the optimal asm would be. (The difference between array vs. separate locals was only discussed in comments there.)
I've been frustrated by passing parameters from a c++ function to assembly. I couldn't find anything that helped on Google and would really like your help. I am using Visual Studio 2017 and masm to compile my assembly code.
This is a simplified version of my c++ file where I call the assembly procedure set_clock
int main()
{
TimeInfo localTime;
char clock[4] = { 0,0,0,0 };
set_clock(clock,&localTime);
system("pause");
return 0;
}
I run into problems in the assembly file. I can't figure out why the second parameter passed to the function turns out huge. I was going off my textbook, which shows similar code with PROC followed by parameters. I don't know why the first parameter is passed successfully and the second one isn't. Can someone tell me the correct way to pass multiple parameters?
.code
set_clock PROC,
array:qword,address:qword
mov rdx,array ; works fine memory address: 0x1052440000616
mov rdi,address ; value of rdi is 14757395258967641292
mov al, [rdx]
mov [rdi],al ; ERROR: cant access that memory location
ret
set_clock ENDP
END
MASM's high-level crap is biting you in the ass. x64 Windows passes the first 4 args in rcx, rdx, r8, r9 (for any of those 4 that are integer/pointer).
mov rdx,array
mov rdi,address
assembles to
mov rdx, rcx ; clobber 2nd arg with a copy of the 1st
mov rdi, rdx ; copy array again
Use a disassembler to check for yourself. Always a good idea to check the real machine code by disassembling or using your debuggers disassembly instead of source mode, if anything weird is happening with assembler macros.
I'm not sure why this would result in an inaccessible memory location. If both args really are pointers to locals, then it should just be loading and storing back into the same stack location. But if char clock[4] is a const in static storage, it might be in a read-only memory page which would explain the store failing.
Either way, use a debugger and find out.
BTW, rdi is a call-preserved (aka non-volatile) register in the x64 Windows convention. (https://msdn.microsoft.com/en-us/library/9z1stfyw.aspx). Use call-clobbered registers for scratch regs unless you run out and need to save/restore some call-preserved regs. See also Agner Fog's calling conventions doc (http://agner.org/optimize/), and other links in the x86 tag wiki.
It's call-clobbered in x86-64 System V, which also passes args in different registers. Maybe you were looking at a different example?
Hopefully-fixed version, using movzx to avoid a false dependency on RAX when loading a byte.
set_clock PROC,
array:qword,address:qword
movzx eax, byte ptr [array]
mov [address], al
ret
set_clock ENDP
I don't use MASM, but I think array:qword makes array an alias for rcx. Or you could skip declaring the parameters and just use rcx and rdx directly, and document it with comments. That would be easier for everyone to understand.
You definitely don't want useless mov reg,reg instructions cluttering your code; if you're writing in asm in the first place, wasted instructions would cut into any speedups you're getting.
Was reading a book on Microprocessors. Saw this snippet in C++ code to print a string using ASM.
str (char *string_adr[])
{
_asm
{
mov bx, string_adr
mov ah, 2
top:
mov dl, [bx]
inc bx
cmp al, 0
je bot
int 21h
jmp top
bot:
mov al, 20h
int 21h
}
}
Now I was wondering how the cmp al, 0 works since al is not used before that...
You'll have to look at proc linkage code the compiler generates. It may, for example, decide to put arguments not only on stack, but selected few (one?) into registers. But this doesn't seem to be the case here: My impression is that instead of both references to al, dl was actually meant - was that the case, the result was that an end of string (0) would be printed as space character before termination (int 21h looks like MSDOS, function 2, which outputs the char which ASCII is in dl). This makes some sense, therefore I believe this to be a typo in the example. I can even attempt to offer some guess where this typo could come from: in an earlier version of that example, LODSB was used, which is as much of an autoincrementing indirect register read an 8086 offers - and the target for the read bytes is al. Because use of LODSB is deprecated. the example was updated, and the indirect register read with increment coded out, this time reading directly into dl instead of al, because that's where DOS fn 02 expects the character - unluckily the test for end of string was missed, as was the later load with ASCII of space char, and kept as was in the previous version of that example.
I guess the author wrote the code by hand. The letters 'a' and 'd' are very similar in handwriting, especially when sloppily written:
(http://en.wikipedia.org/wiki/File:Cursive.svg)
The transcriber confused the letters (or didn't know the different forms of the printed and the handwritten 'a') and wrote mov dl, [bx]. Correct is mov al, [bx].
Disassembling the following function using VS2010
int __stdcall modulo(int a, int b)
{
return a % b;
}
gives you:
push ebp
mov ebp,esp
mov eax,dword ptr [ebp+8]
cdq
idiv eax,dword ptr [ebp+0Ch]
mov eax,edx
pop ebp
ret 8
which is pretty straightforward.
Now, trying the same assembly code as inline assembly fails with error C2414: illegal number of operands, pointing to idiv.
I read Intel's manual and it says that idiv accept only 1 operand:
Divides the (signed) value in the AX, DX:AX, or EDX:EAX (dividend) by
the source operand (divisor) and stores the result in the AX (AH:AL),
DX:AX, or EDX:EAX registers
and sure enough, removing the extra eax compiles and the function returns the correct result.
So, what is going on here? why is VS2010 emitting erroneous code ? (btw, VS2012 emits exactly the same assembly)
The difference, most likely, is that the author of the disassembler intended for its output to be read by a human rather than by an assembler. Since it's a disassembly, one assumes the underlying code has already been assembled/compiled, so why worry about whether it can be re-assembled?
There are two major possibilities here:
The author didn't realize or didn't remember that the dividend is implicit for idiv (in which case this could be considered a bug).
They felt that making it explicit in the output would be helpful for a reader trying to understand the flow of the disassembly (in which case it's a feature). Without the explicit operand, it would be easy to overlook that idiv both depends on and modifies eax when scanning the disassembly.
While stepping through some Qt code I came across the following. The function QMainWindowLayout::invalidate() has the following implementation:
void QMainWindowLayout::invalidate()
{
QLayout::invalidate()
minSize = szHint = QSize();
}
It is compiled to this:
<invalidate()> push %rbx
<invalidate()+1> mov %rdi,%rbx
<invalidate()+4> callq 0x7ffff4fd9090 <QLayout::invalidate()>
<invalidate()+9> movl $0xffffffff,0x564(%rbx)
<invalidate()+19> movl $0xffffffff,0x568(%rbx)
<invalidate()+29> mov 0x564(%rbx),%rax
<invalidate()+36> mov %rax,0x56c(%rbx)
<invalidate()+43> pop %rbx
<invalidate()+44> retq
The assembly from invalidate+9 to invalidate+36 seems stupid. First the code writes -1 to %rbx+0x564 and %rbx+0x568, but then it loads that -1 from %rbx+0x564 back into a register just to write it out to %rbx+0x56c. This seems like something the compiler should easily be able to optimize into just another move immediate.
So is this stupid code (and if so, why wouldn't the compiler optimize it?) or is this somehow very clever and faster than using just another move immediate?
(Note: This code is from the normal release library build shipped by ubuntu, so it was presumably compiled by GCC in optimize mode. The minSize and szHint variables are normal variables of type QSize.)
Not sure you're correct when you're saying it's stupid. I think the compiler might be trying to optimize the code size here. There is no 64-bit immediate to memory mov instruction. So the compiler has to generate 2 mov instructions just like it did above. Each of them would be 10 bytes, the 2 moves generated are 14 bytes. It's been written to so there is most likely no memory latency so I do not think you'll take any performance hit here.
The code is "less than perfect".
For code size, those 4 instructions add up to 34 bytes. A much smaller sequence (19 bytes) is possible:
00000000 31C0 xor eax,eax
00000002 48F7D0 not rax
00000005 48898364050000 mov [rbx+0x564],rax
0000000C 4889836C050000 mov [rbx+0x56c],rax
;Note: XOR above clears RAX due to zero extension
For performance things aren't so simple. The CPU wants to do many instructions at the same time, and the code above breaks that. For example:
xor eax,eax
not rax ;Must wait until previous instruction finishes
mov [rbx+0x564],rax ;Must wait until previous instruction finishes
mov [rbx+0x56c],rax ;Must wait until "not" finishes
For performance you want to do this:
00000000 48C7C0FFFFFFFF mov rax,0xffffffff
00000007 C78364050000FFFFFFFF mov dword [rbx+0x564],0xffffffff
00000011 C78368050000FFFFFFFF mov dword [rbx+0x568],0xffffffff
0000001B C7836C050000FFFFFFFF mov dword [rbx+0x56c],0xffffffff
00000025 C78370050000FFFFFFFF mov dword [rbx+0x570],0xffffffff
;Note: first MOV sets RAX to 0xFFFFFFFFFFFFFFFF due to sign extension
This allows all of the instructions to be executed in parallel, with no dependencies anywhere. Sadly, it's also much larger (45 bytes).
If you try to get a balance between code size and performance; then you could hope that the first instruction (that sets the value in RAX) completes before the last instruction/s needs to know the value in RAX. This might be something like this:
mov rax,-1
mov dword [rbx+0x564],0xffffffff
mov dword [rbx+0x568],0xffffffff
mov dword [rbx+0x56c],rax
This is 34 bytes (the same size as the original code). This is likely to be a good compromise between code size and performance.
Now; let's look at the original code and see why it is bad:
mov dword [rbx+0x564],0xffffffff
mov dword [rbx+0x568],0xffffffff
mov rax,[rbx+0x564] ;Massive problem
mov [rbx+0x56C],rax ;Depends on previous instruction
Modern CPUs do have something called "store forwarding", where writes are stored in a buffer and future reads can get the value from this buffer to avoid reading the value from cache. Ironically, this only works if the size of the read is smaller than or equal to the size of the write. The "store forwarding" will not work for this code as there are 2 writes and the read is larger than both of them. This means that the third instruction has to wait until the first 2 instructions have written to cache and then has to read the value from cache; which could easily add up to a penalty of about 30 cycles or more. Then the fourth instruction must wait for the third instruction (and can't happen in parallel with anything) so that's another problem.
I'd break down the lines as this (think several have comment same steps)
These two lines comes from the inline definition of QSize() http://qt.gitorious.org/qt/qt/blobs/4.7/src/corelib/tools/qsize.h
which set each field separately. Also, my guess is that 0x564(%rbx) is the address of szHint which is also set at the same time.
<invalidate()+9> movl $0xffffffff,0x564(%rbx)
<invalidate()+19> movl $0xffffffff,0x568(%rbx)
These lines are finally setting minSize using 64bit operations because the compiler now know the size of a QSize object. And the address of minSize is 0x56c(%rbx)
<invalidate()+29> mov 0x564(%rbx),%rax
<invalidate()+36> mov %rax,0x56c(%rbx)
Note. First part is setting two separate fields, and next part is copying a QSize object (regardless content). The question then is, should the compiler be smart enough to build a compound 64bit value because it saw preset values just earlier? Not sure about that...
In addition to Guillaume's answer, the 64 bit load/store is not aligned. But according to the Intel optimization guide (p 3-62)
Misaligned data access can incur significant performance penalties.
This is particularly true for cache line splits. The size of a cache
line is 64 bytes in the Pentium 4 and other recent Intel processors,
including processors based on Intel Core microarchitecture.
An access to data unaligned on 64-byte boundary leads to two memory
accesses and requires several μops to be executed (instead of one).
Accesses that span 64-byte boundaries are likely to incur a large
performance penalty, the cost of each stall generally are greater on
machines with longer pipelines.
Which imo implies that an unaligned load/store that does not cross a cache line boundary is cheap. In this case the base pointer in the process I was debugging was 0x10f9bb0, so the two variables are 20 and 28 bytes into the cacheline.
Normally Intel processors use store to load forwarding, so a load of a value that was just stored doesn't even need to touch the cache. But the same guide also states that a large load of several smaller stores does not store-load-forward but stalls: (p 3-66, p 3-68)
Assembly/Compiler Coding Rule 49. (H impact, M generality) The data of
a load which is forwarded from a store must be completely contained
within the store data.
; A. Large load stall
mov mem, eax ; Store dword to address “MEM"
mov mem + 4, ebx ; Store dword to address “MEM + 4"
fld mem ; Load qword at address “MEM", stalls
So the code in question probably causes a stall, and therefore I'm inclined to believe it is not optimal. I wouldn't be very surprised if GCC does not take such limitations fully into account. Does anyone know if/how much modelling of store-to-load forwarding limitations GCC does?
EDIT: some experimenting with adding filler values before the minSize/szHint fields shows that GCC does not care at all where the cache line boundaries are, and neither does clang.