Calculating JMP instruction's address for trampoline hook

Calculating JMP instruction's address for trampoline hook - c++

I am trying to calculate the relative address offset from one instruction to another.
I understand the basic calculation, and why I have to -5 (to cater for size of jmp and instruction size) (Calculating JMP instruction's address)
The question is, what if I want to jump not to start of a code but some specific instructions after it?
For example:
original function
I want to JMP to the highlighted instruction OPENGL32.dll+48195, while I only have the start address of OPENGL32.wglSwapBu.
From my code, I understand I can do
uintptr_t gatewayRelativeAddr = src - gateway - 5;
where src is the address of OPENGL32.wglSwapBu and gateway is the start address of my code.
// len is 5
BYTE* gateway = (BYTE*)VirtualAlloc(NULL, len+5, MEM_COMMIT | MEM_RESERVE,
PAGE_EXECUTE_READWRITE);
memcpy_s(gateway, len, src, len);
// Jump back to original function, but at the highlighted instruction
uintptr_t gatewayRelativeAddr = src - gateway - 5;
// add the jmp opcode to end of gateway
*(gateway + len) = 0xE9;
*(uintptr_t*)(gateway + len + 1) = gatewayRelativeAddr;
I understand thus far what the code does:
calculate the relative address/bytes from the start address of gateway to start address of src(original function). I also -5 to cater for the size of the jump.
However, when I viewed it in memory, it ended up at where I want. But no where in the code I specified it to jmp to the highlighted instruction.

This works because len equals exactly the number of bytes of instructions (5) that precede the desired one, which I presume was the whole point (you want to copy the instructions that will be jumped over, and maybe modify them later on?).
The jump instruction starts at gateway+len, and so EIP at the jump will be gateway+len+5. On the other hand, the address you want to jump to is src+len. So the relative address is (src+len)-(gateway+len+5), and len cancels out, so your formula is correct.
If you want to jump to an instruction that's not the next one after the ones you copied, you'll need to work out its offset from src by disassembly (call it ofs), and then set gatewayRelativeAddr to (src+ofs)-(gateway+len+5).

Related

Using memcpy on mmap'ed region crashes, a for loop does not

I have an NVIDIA Tegra TK1 processor module on a carrier board with a PCI-e slot connecting to it. In that PCIe slot is an FPGA board which exposes some registers and a 64K memory area via PCIe.
On the ARM CPU of the Tegra board, a minimal Linux installation is running.
I am using /dev/mem and the mmap function to obtain user-space pointers to the register structs and the 64K memory area.
The distinct register files and the memory block are all assigned addresses which are aligned and do not overlap with regards to 4KB memory pages.
I explicitly map whole pages with mmap, using the result of getpagesize(), which also is 4096.
I can read/write from/to those exposed registers just fine.
I can read from the memory area (64KB), doing uint32 word-by-word reads in a for loop, just fine. I.e. read contents are correct.
But if I use std::memcpy on the same address range, though, the Tegra CPU freezes, always. I do not see any error message, if GDB is attached I also don't see a thing in Eclipse when trying to step over the memcpy line, it just stops hard. And I have to reset the CPU using the hardware reset button, as the remote console is frozen.
This is debug build with no optimization (-O0), using gcc-linaro-6.3.1-2017.05-i686-mingw32_arm-linux-gnueabihf. I was told the 64K region is accessible byte-wise, I did not try that explicitly.
Is there an actual (potential) problem that I need to worry about, or is there a specific reason why memcpy does not work and maybe should not be used in the first place in this scenario - and I can just carry on using my for loops and think nothing of it?
EDIT: Another effect has been observed: The original code snippet was missing a "vital" printf in the copying for loop, that came before the memory read. That removed, I don't get back valid data. I now updated the code snippet to have an extra read from the same address instead of the printf, which also yields correct data. The confusion intensifies.
Here the (I think) important excerpts of what's going on. With minor modifications, to make sense as shown, in this "de-fluffed" form.
// void* physicalAddr: PCIe "BAR0" address as reported by dmesg, added to the physical address offset of FPGA memory region
// long size: size of the physical region to be mapped
//--------------------------------
// doing the memory mapping
//
const uint32_t pageSize = getpagesize();
assert( IsPowerOfTwo( pageSize ) );
const uint32_t physAddrNum = (uint32_t) physicalAddr;
const uint32_t offsetInPage = physAddrNum & (pageSize - 1);
const uint32_t firstMappedPageIdx = physAddrNum / pageSize;
const uint32_t lastMappedPageIdx = (physAddrNum + size - 1) / pageSize;
const uint32_t mappedPagesCount = 1 + lastMappedPageIdx - firstMappedPageIdx;
const uint32_t mappedSize = mappedPagesCount * pageSize;
const off_t targetOffset = physAddrNum & ~(off_t)(pageSize - 1);
m_fileID = open( "/dev/mem", O_RDWR | O_SYNC );
// addr passed as null means: we supply pages to map. Supplying non-null addr would mean, Linux takes it as a "hint" where to place.
void* mapAtPageStart = mmap( 0, mappedSize, PROT_READ | PROT_WRITE, MAP_SHARED, m_fileID, targetOffset );
if (MAP_FAILED != mapAtPageStart)
{
m_userSpaceMappedAddr = (volatile void*) ( uint32_t(mapAtPageStart) + offsetInPage );
}
//--------------------------------
// Accessing the mapped memory
//
//void* m_rawData: <== m_userSpaceMappedAddr
//uint32_t* destination: points to a stack object
//int length: size in 32bit words of the stack object (a struct with only U32's in it)
// this crashes:
std::memcpy( destination, m_rawData, length * sizeof(uint32_t) );
// this does not, AND does yield correct memory contents - but only with a preceding extra read
for (int i=0; i<length; ++i)
{
// This extra read makes the data gotten in the 2nd read below valid.
// Commented out, the data read into destination will not be valid.
uint32_t tmp = ((const volatile uint32_t*)m_rawData)[i];
(void)tmp; //pacify compiler
destination[i] = ((const volatile uint32_t*)m_rawData)[i];
}

Based on the description, it looks like your FPGA code is not responding correctly to load instructions that are reading from locations on your FPGA and it is causing the CPU to lock up. It's not crashing it is permanently stalled, hence the need for the hard reset. I had this problem also when debugging my PCIE logic on an FPGA.
Another indication that your logic is not responding correctly is that you need an extra read in order to get the right responses.
Your loop is doing 32-bit loads but memcpy is doing at least 64-bit loads, which changes how your logic responds. For example, it will need to use two TLPs with 32 bits of response if the first 128 bits of the completion and the next 32 bits in the second 128 bit TLP of the completion.
What I found super-useful was to add logic to log all the PCIE transactions into an SRAM and to be able to dump the SRAM out to see how the logic was behaving or misbehaving. We have a nifty utility, pcieflat, that prints one PCIE TLP per line. It even has documentation.
When the PCIE interface is not working well enough, I stream the log to a UART in hex which can be decoded by pcieflat.
This tool is also useful for debugging performance problems -- you can look at how well your DMA reads and writes are pipelined.
Alternatively, if you have integrated logic analyzer or similar on the FPGA, you can trace the activity that way. But it's nicer to have the TLPs parsed according to PCIE protocol.

ATmega8 doesn't support JMP instruction

Now I'm writing bootloader which starts in the middle of memory, but after it finishes I need to go to the main app, thought to try jmp 0x00, however my chip doesn't support jmp, how should I start main app?

I would use RJMP:
Relative jump to an address within PC - 2K +1 and PC + 2K (words). In
the assembler, labels are used instead of relative operands.
For example:
entry:
rjmp reset
.org 512
reset:
rjmp foo
.org 3072
foo:
rjmp entry
By the way, there are several other jump instructions (RJMP, IJMP, RCALL, ICALL, CALL, RET, RETI etc.) See this relevant discussion.

Well take a look into RET instruction. It returns to previous location, so you can try:
push 0x00
push 0x00
ret
This should work because while entering into any function you push your current location, and RET makes you go back.
As far as I remember ATmege8 has 16-bit address line, but if I'm not right you may need more push 0x00

why not simply use IJMP?
set Z to 0x00 and use IJMP. may be faster than 2xpush and ret
EOR R30, R30 ; clear ZL
EOR R31, R31 ; clear ZH
IJMP ; set PC to Z
should be 4 cycles and 3 instruction words (6 Bytes program memory)

ORG alternative for C++

In assembly we use the org instruction to set the location counter to a specific location in the memory. This is particularly helpful in making Operating Systems. Here's an example boot loader (From wikibooks):
org 7C00h
jmp short Start ;Jump over the data (the 'short' keyword makes the jmp instruction smaller)
Msg: db "Hello World! "
EndMsg:
Start: mov bx, 000Fh ;Page 0, colour attribute 15 (white) for the int 10 calls below
mov cx, 1 ;We will want to write 1 character
xor dx, dx ;Start at top left corner
mov ds, dx ;Ensure ds = 0 (to let us load the message)
cld ;Ensure direction flag is cleared (for LODSB)
Print: mov si, Msg ;Loads the address of the first byte of the message, 7C02h in this case
;PC BIOS Interrupt 10 Subfunction 2 - Set cursor position
;AH = 2
Char: mov ah, 2 ;BH = page, DH = row, DL = column
int 10h
lodsb ;Load a byte of the message into AL.
;Remember that DS is 0 and SI holds the
;offset of one of the bytes of the message.
;PC BIOS Interrupt 10 Subfunction 9 - Write character and colour
;AH = 9
mov ah, 9 ;BH = page, AL = character, BL = attribute, CX = character count
int 10h
inc dl ;Advance cursor
cmp dl, 80 ;Wrap around edge of screen if necessary
jne Skip
xor dl, dl
inc dh
cmp dh, 25 ;Wrap around bottom of screen if necessary
jne Skip
xor dh, dh
Skip: cmp si, EndMsg ;If we're not at end of message,
jne Char ;continue loading characters
jmp Print ;otherwise restart from the beginning of the message
times 0200h - 2 - ($ - $$) db 0 ;Zerofill up to 510 bytes
dw 0AA55h ;Boot Sector signature
;OPTIONAL:
;To zerofill up to the size of a standard 1.44MB, 3.5" floppy disk
;times 1474560 - ($ - $$) db 0
Is it possible accomplish the task with C++? Is there any command, function etc. like org where i can change the location of the program?

No it's not possible to do in any C compiler that I know of. You can however create your own linker script that places the code/data/bss segments at specific addresses.

Just for clarity, the org directive does not load the code at the specified address, it merely informs the assembler that the code will be loaded at that address. The code shown appears to be for Nasm (or similar) - in AT&T syntax, the .org directive does something different: it pads the code to that address - similar to the times line in the Nasm code.. Nasm can do this because in -f bin mode, it "acts as it's own linker".
The important thing for the code to know is the address where Msg can be found. The jmps and jnes (and call and ret which your example doesn't have, but a compiler may generate) are relative addressing mode. We code jmp target but the bytes that are actually emitted say jmp distance_to_target (plus or minus) so the address doesn't matter.
Gas doesn't do this, it emits a linkable object file. To use ld without a linker script the command line looks something like:
ld -o boot.bin boot.o -oformat binary -T text=0x7C00
(don't quote me on that exact syntax but "something like that") If you can get a linkable object file from your (16-bit capable!) C++ compiler, you might be able to do the same.
In the case of a bootsector, the code is loaded by the BIOS (or fake BIOS) at 0x7C00 - one of the few things we can assume about the bootsector. The sane thing for a bootsector to do is not fiddle-faddle around printing a message, but to load something else. You'll need to know how to find the something else on the disk and where you want to load it to (perhaps where your C++ compiler wants to put it by default) - and jmp there. This jmp will want to be a far jmp, which does need to know the address.
I'm guessing it's going to be some butt-ugly C++!

What is the role of DC bit in GDT?

this is my code :
...
data_seg equ os_data-gdt_start
code_seg equ os_code-gdt_start
...
jmp code_seg:pm_start
[BITS 32]
pm_start:
mov ax,data_seg
mov ds,ax
mov word [ds:0xb8000],0xC341
it work correctly when dc bit (Third bit of Access byte) in the gdt is zero.
I want to know why not work when it is 1?
I know that dc bit is Direction bit of data selectors , and when it's 0 , the segment grows up and when it's 1 the segment grows down. but not know what is the meaning of grows up and grows down exactly. grows up and grows down means to me when I want to use the stack.( ESP++ and ESP-- )

DC bit is name by osdev.org, by Intel´s manual it's expansion-direction. Number can go only in two directions: it can increase or decrease. DC bit is the thing that plays with it.
If the size of a stack segment needs to be dynamically, the stack segment can be an expand-down data segment (expansion-direction flag is set). Dynamically changing segment limit causes stack space to be added to the bottom of the stack.

What is the trick in pAddress & ~(PAGE_SIZE - 1) to get the page's base address

Following function is used to get the page's base address of an address which is inside this page:
void* GetPageAddress(void* pAddress)
{
return (void*)((ULONG_PTR)pAddress & ~(PAGE_SIZE - 1));
}
But I couldn't quite get it, what is the trick it plays here?
Conclusion:
Personally, I think Amardeep's explanation plus Alex B's example are best answers. As Alex B's answer has already been voted up, I would like to accept Amardeep's answer as the official one to highlight it! Thanks you all.

The function clears low bits of a given address, which yields the address of its page.
For example, if PAGE_SIZE is 4096, then in 32-bit binary:
PAGE_SIZE = 00000000000000000001000000000000b
PAGE_SIZE - 1 = 00000000000000000000111111111111b
~(PAGE_SIZE - 1) = 11111111111111111111000000000000b
If you bitwise-and it with a 32-bit address, it will turn lower bits into zeros, rounding the address to the nearest 4096-byte page address.
~(PAGE_SIZE - 1) = 11111111111111111111000000000000b
pAddress = 11010010100101110110110100100100b
~(PAGE_SIZE - 1) & pAddress = 11010010100101110110000000000000b
So, in decimal, original address is 3533139236, page address (address with lower bits stripped) is 3533135872 = 862582 x 4096, is a multiple of 4096.

What it does is clears the bits of the address that fit within the mask created by the page size. Effectively it gets the first valid address of a block.
PAGE_SIZE must be a power of 2 and is represented by a single bit set in the address.
The mask is created by subtracting one from PAGE_SIZE. That effectively sets all the bits that are a lower order than the page size bit. The ~ then complements all those bits to zero and sets all the bits that are a higher order than the mask. The & then effectively strips all the lower bits away, leaving the actual base address of the page containing the original address.

When PAGE_SIZE is some power of 2 (say 4096 for example), this clears all the bits below those that specify the page.

This is just a tricky way of clearing the low order bits.
Another way to implement it might be
void* GetPageAddress(void* pAddress)
{
return pAddress - pAddress%PAGE_SIZE;
}
That probably won't compile, as you need to cast the types a bit, bit it shows the algorithm.
Effectively it is getting the largest multiple of PAGE_SIZE that is less than pAddress.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js