I have a C++ program that reports a crash. Analyzing it further, I am able to figure out the exact place where the data on stack gets corrupted. But, I am not able to understand why that happens or what would cause such corruption. At a point in time
value at [rsp+40h] is 00000249 15ed4de0
Then the following assembly instruction gets executed
mov rsi,qword ptr [rsp+40h]
Now, the value at
[rsp+40h] is 00000000 15ed4de0
The higher order DWORD has changed from 00000249 to 00000000. There are no changes in the addresses nearby. I do not understand why a read from [rsp+40h] would end up corrupting the value present at that address.
Any pointers are welcome. Thanks in advance.
Looking at PIN's Inspection API Page there is a method called INS_IsLea() which returns true if the current instruction is a LEA isntruction. This is useful, but I need to determine if one of the instruction operands are indirect references. For example:
MOV rax, (%rdi)
I'd like to be able to determine that the second operand is in fact an indirect reference, but there does not seem to be a method available for this.
INS_OperandIsMemory() if you just want to check if one of the operand is a memory operand.
Note that the second argument, n, indicates the operand number (0-indexed) in Intel syntax, e.g.
; rax = destination op; rdi = source op
mov rax, [rdi] ; rax = op #0 ; rdi = op #1
It looks like you are looking for two cases:
If the an operand is a register containing a memory address. This is used for indirect jumps or calls. Can be detected via INS_IsIndirectBranchOrCall(ins)
If the second operand is a memory location used for arithmetic or move operations. Can be detected by (!INS_IsBranchOrCall(ins) && (INS_MemoryOperandCount(ins) > 0))
NOTE:
lea instruction just loads a memory address into a register. It never actually access the memory address. Read here. INS_IsLea returns true is the instruction opcode is lea. Hence it will return false for MOV rax, (%rdi)
I am just guessing here because I have never heard about this library before reading this question, but...
I think it follows that if one of INS_IsMemoryRead(), INS_IsMemoryWrite() is true, it has to have an indirect memory operand.
Also, from different angle, INS_MemoryOperand*() functions inspects the operands.
I got the following code in ASM:
MOV EAX,DWORD PTR DS:[EBX+F8]
EBX containts an address, F8 is the offset added to this address. If I get this right, eax contains the dereferenced value from address+offset after the operation is executed.
What I want to do now is, to write some pattern in C++ while using inline asm, which allows me to fetch/retrieve F8 without changing the code in case F8 changes.
Is there any pattern search method (like regex) which I can use here? Is the offset possibly saved in any register? Or is this quite impossible to do?
Hopefully the information provided is enough, I could add some more lines of code if you wish.
You can add offset to the value in EBX register and then get the value from the updated address.
In my function I use
__asm
{
mov ecx,dword ptr [0x28F1431]
mov ecx,ds:[0x28F14131]
}
which should produce the following bytes: 0x8B0D (mov ecx, dword ptr []). However the first instruction produces 0xB9 (mov ecx,0x28F14131) and the second one 0x3E:8B0D
So my question is, what instruction should I use to get the desired result inside the C++ __asm?
If you know for 100% certain what your byte sequence is supposed to be for your inlined assembly, you can always explicitly use those bytes. The exact syntax escapes me, but if you are using GCC, you may try ....
__asm {
.byte 0x##
.byte 0x##
...
}
This approach only works if you know with 100% certainty what the byte sequences for the whole instruction are. AND if you are going to do this, be sure to comment appropriately.
(For what it is worth, I have had to use this approach in the past to work around a compiler bug where no matter what it would otherwise use the wrong byte sequence for one of the instructions.)
While stepping through some Qt code I came across the following. The function QMainWindowLayout::invalidate() has the following implementation:
void QMainWindowLayout::invalidate()
{
QLayout::invalidate()
minSize = szHint = QSize();
}
It is compiled to this:
<invalidate()> push %rbx
<invalidate()+1> mov %rdi,%rbx
<invalidate()+4> callq 0x7ffff4fd9090 <QLayout::invalidate()>
<invalidate()+9> movl $0xffffffff,0x564(%rbx)
<invalidate()+19> movl $0xffffffff,0x568(%rbx)
<invalidate()+29> mov 0x564(%rbx),%rax
<invalidate()+36> mov %rax,0x56c(%rbx)
<invalidate()+43> pop %rbx
<invalidate()+44> retq
The assembly from invalidate+9 to invalidate+36 seems stupid. First the code writes -1 to %rbx+0x564 and %rbx+0x568, but then it loads that -1 from %rbx+0x564 back into a register just to write it out to %rbx+0x56c. This seems like something the compiler should easily be able to optimize into just another move immediate.
So is this stupid code (and if so, why wouldn't the compiler optimize it?) or is this somehow very clever and faster than using just another move immediate?
(Note: This code is from the normal release library build shipped by ubuntu, so it was presumably compiled by GCC in optimize mode. The minSize and szHint variables are normal variables of type QSize.)
Not sure you're correct when you're saying it's stupid. I think the compiler might be trying to optimize the code size here. There is no 64-bit immediate to memory mov instruction. So the compiler has to generate 2 mov instructions just like it did above. Each of them would be 10 bytes, the 2 moves generated are 14 bytes. It's been written to so there is most likely no memory latency so I do not think you'll take any performance hit here.
The code is "less than perfect".
For code size, those 4 instructions add up to 34 bytes. A much smaller sequence (19 bytes) is possible:
00000000 31C0 xor eax,eax
00000002 48F7D0 not rax
00000005 48898364050000 mov [rbx+0x564],rax
0000000C 4889836C050000 mov [rbx+0x56c],rax
;Note: XOR above clears RAX due to zero extension
For performance things aren't so simple. The CPU wants to do many instructions at the same time, and the code above breaks that. For example:
xor eax,eax
not rax ;Must wait until previous instruction finishes
mov [rbx+0x564],rax ;Must wait until previous instruction finishes
mov [rbx+0x56c],rax ;Must wait until "not" finishes
For performance you want to do this:
00000000 48C7C0FFFFFFFF mov rax,0xffffffff
00000007 C78364050000FFFFFFFF mov dword [rbx+0x564],0xffffffff
00000011 C78368050000FFFFFFFF mov dword [rbx+0x568],0xffffffff
0000001B C7836C050000FFFFFFFF mov dword [rbx+0x56c],0xffffffff
00000025 C78370050000FFFFFFFF mov dword [rbx+0x570],0xffffffff
;Note: first MOV sets RAX to 0xFFFFFFFFFFFFFFFF due to sign extension
This allows all of the instructions to be executed in parallel, with no dependencies anywhere. Sadly, it's also much larger (45 bytes).
If you try to get a balance between code size and performance; then you could hope that the first instruction (that sets the value in RAX) completes before the last instruction/s needs to know the value in RAX. This might be something like this:
mov rax,-1
mov dword [rbx+0x564],0xffffffff
mov dword [rbx+0x568],0xffffffff
mov dword [rbx+0x56c],rax
This is 34 bytes (the same size as the original code). This is likely to be a good compromise between code size and performance.
Now; let's look at the original code and see why it is bad:
mov dword [rbx+0x564],0xffffffff
mov dword [rbx+0x568],0xffffffff
mov rax,[rbx+0x564] ;Massive problem
mov [rbx+0x56C],rax ;Depends on previous instruction
Modern CPUs do have something called "store forwarding", where writes are stored in a buffer and future reads can get the value from this buffer to avoid reading the value from cache. Ironically, this only works if the size of the read is smaller than or equal to the size of the write. The "store forwarding" will not work for this code as there are 2 writes and the read is larger than both of them. This means that the third instruction has to wait until the first 2 instructions have written to cache and then has to read the value from cache; which could easily add up to a penalty of about 30 cycles or more. Then the fourth instruction must wait for the third instruction (and can't happen in parallel with anything) so that's another problem.
I'd break down the lines as this (think several have comment same steps)
These two lines comes from the inline definition of QSize() http://qt.gitorious.org/qt/qt/blobs/4.7/src/corelib/tools/qsize.h
which set each field separately. Also, my guess is that 0x564(%rbx) is the address of szHint which is also set at the same time.
<invalidate()+9> movl $0xffffffff,0x564(%rbx)
<invalidate()+19> movl $0xffffffff,0x568(%rbx)
These lines are finally setting minSize using 64bit operations because the compiler now know the size of a QSize object. And the address of minSize is 0x56c(%rbx)
<invalidate()+29> mov 0x564(%rbx),%rax
<invalidate()+36> mov %rax,0x56c(%rbx)
Note. First part is setting two separate fields, and next part is copying a QSize object (regardless content). The question then is, should the compiler be smart enough to build a compound 64bit value because it saw preset values just earlier? Not sure about that...
In addition to Guillaume's answer, the 64 bit load/store is not aligned. But according to the Intel optimization guide (p 3-62)
Misaligned data access can incur significant performance penalties.
This is particularly true for cache line splits. The size of a cache
line is 64 bytes in the Pentium 4 and other recent Intel processors,
including processors based on Intel Core microarchitecture.
An access to data unaligned on 64-byte boundary leads to two memory
accesses and requires several μops to be executed (instead of one).
Accesses that span 64-byte boundaries are likely to incur a large
performance penalty, the cost of each stall generally are greater on
machines with longer pipelines.
Which imo implies that an unaligned load/store that does not cross a cache line boundary is cheap. In this case the base pointer in the process I was debugging was 0x10f9bb0, so the two variables are 20 and 28 bytes into the cacheline.
Normally Intel processors use store to load forwarding, so a load of a value that was just stored doesn't even need to touch the cache. But the same guide also states that a large load of several smaller stores does not store-load-forward but stalls: (p 3-66, p 3-68)
Assembly/Compiler Coding Rule 49. (H impact, M generality) The data of
a load which is forwarded from a store must be completely contained
within the store data.
; A. Large load stall
mov mem, eax ; Store dword to address “MEM"
mov mem + 4, ebx ; Store dword to address “MEM + 4"
fld mem ; Load qword at address “MEM", stalls
So the code in question probably causes a stall, and therefore I'm inclined to believe it is not optimal. I wouldn't be very surprised if GCC does not take such limitations fully into account. Does anyone know if/how much modelling of store-to-load forwarding limitations GCC does?
EDIT: some experimenting with adding filler values before the minSize/szHint fields shows that GCC does not care at all where the cache line boundaries are, and neither does clang.