How to hook C++ functions with asm

How to hook C++ functions with asm - c++

I want to hook a C++ function. But I don't want to use the trampoline mechanism of ms detours, instead of it I want to fully patch it. I can get the handle to the DLL, where the function is located and I have the right offset(imageBase stuff ...). So how to hook it? And I don't know the data types of the arguments(var_4 and arg_0), or aren't they needed? In general I want to replace following function with my own one(my function is nearly the same, there's only a line changed):
sub_39001A40 proc near
var_4 = dword ptr -4
arg_0 = dword ptr 4
push ecx
cmp dword_392ADAB4, 0
jnz short loc_39001A4F
call loc_39024840
loc_39001A4F:
push esi
mov esi, [esp+8+arg_0]
lea eax, [esp+8+var_4]
push eax
push esi
call dword_392ADA98
mov ecx, [esp+10h+var_4]
add esp, 8
add dword_392ADA80, ecx
adc dword_392ADA84, 0
add dword_392ADA90, esi
pop esi
adc dword_392ADA94, 0
add dword_392ADA7C, 1
pop ecx
retn
sub_39001A40 endp
It's bad, that I only can hook functions, which names I know with ms detours. I cannot hook those asm functions with detours, cause I need the data types of the arguments passed for creating the function structures!
EDIT::::
"What's wrong with detours, exactly?"
I wrote: "I don't want to use the trampoline mechanism of ms detours, instead of it I want to fully patch it." and "It's bad, that I only can hook functions, which names I know with ms detours. I cannot hook those asm functions with detours, cause I need the data types of the arguments passed for creating the function structures!" and I don't have the source code of the C++ files. I only have the hex-dump.
"Trampoline is an actual technical term :) I'm just wondering why #lua can't use it."
I write: Read my sentences again, if you still don't understand why, my english is bad.
"Overriding just the named function should work, of course you may need to re-implement the whole DLL (depending on if it is of any further use to you). Given your grasp of assembler you might get away with using a hex editor to edit (a copy of) the original DLL you are seeking to subvert."
I want to hook the function, because I don't want to edit the file. I can't overwrite my function, because I don't know the datatypes of the arguments and the function's name.
#asveikau: Thanks for your real help, but I don't want to use a trampoline mechanism, I want to overwrite the function.

A good trick is to replace the first few instructions with this:
push dword xxxx ; where xxx = new code location
ret
This is sort of like an obfuscated jmp. I write it this way because the assembled version of this is very easy to replace the push operand with your pointer at runtime. It assembles to:
68 XX XX XX XX c3
Where "XX XX XX XX" is your address in little-endian.
Then you can make a "call the old version of the function" code location, where the first few instructions are the ones you replaced with the sequence above, followed by a jump to the next valid instruction in the original code.

Overriding just the named function should work, of course you may need to re-implement the whole DLL (depending on if it is of any further use to you). Given your grasp of assembler you might get away with using a hex editor to edit (a copy of) the original DLL you are seeking to subvert.

Related

Mixing c++ and assembly cant pass multiple paramaters from C++ function to assembly

I've been frustrated by passing parameters from a c++ function to assembly. I couldn't find anything that helped on Google and would really like your help. I am using Visual Studio 2017 and masm to compile my assembly code.
This is a simplified version of my c++ file where I call the assembly procedure set_clock
int main()
{
TimeInfo localTime;
char clock[4] = { 0,0,0,0 };
set_clock(clock,&localTime);
system("pause");
return 0;
}
I run into problems in the assembly file. I can't figure out why the second parameter passed to the function turns out huge. I was going off my textbook, which shows similar code with PROC followed by parameters. I don't know why the first parameter is passed successfully and the second one isn't. Can someone tell me the correct way to pass multiple parameters?
.code
set_clock PROC,
array:qword,address:qword
mov rdx,array ; works fine memory address: 0x1052440000616
mov rdi,address ; value of rdi is 14757395258967641292
mov al, [rdx]
mov [rdi],al ; ERROR: cant access that memory location
ret
set_clock ENDP
END

MASM's high-level crap is biting you in the ass. x64 Windows passes the first 4 args in rcx, rdx, r8, r9 (for any of those 4 that are integer/pointer).
mov rdx,array
mov rdi,address
assembles to
mov rdx, rcx ; clobber 2nd arg with a copy of the 1st
mov rdi, rdx ; copy array again
Use a disassembler to check for yourself. Always a good idea to check the real machine code by disassembling or using your debuggers disassembly instead of source mode, if anything weird is happening with assembler macros.
I'm not sure why this would result in an inaccessible memory location. If both args really are pointers to locals, then it should just be loading and storing back into the same stack location. But if char clock[4] is a const in static storage, it might be in a read-only memory page which would explain the store failing.
Either way, use a debugger and find out.
BTW, rdi is a call-preserved (aka non-volatile) register in the x64 Windows convention. (https://msdn.microsoft.com/en-us/library/9z1stfyw.aspx). Use call-clobbered registers for scratch regs unless you run out and need to save/restore some call-preserved regs. See also Agner Fog's calling conventions doc (http://agner.org/optimize/), and other links in the x86 tag wiki.
It's call-clobbered in x86-64 System V, which also passes args in different registers. Maybe you were looking at a different example?
Hopefully-fixed version, using movzx to avoid a false dependency on RAX when loading a byte.
set_clock PROC,
array:qword,address:qword
movzx eax, byte ptr [array]
mov [address], al
ret
set_clock ENDP
I don't use MASM, but I think array:qword makes array an alias for rcx. Or you could skip declaring the parameters and just use rcx and rdx directly, and document it with comments. That would be easier for everyone to understand.
You definitely don't want useless mov reg,reg instructions cluttering your code; if you're writing in asm in the first place, wasted instructions would cut into any speedups you're getting.

Extract a static value which may change in the future

I got the following code in ASM:
MOV EAX,DWORD PTR DS:[EBX+F8]
EBX containts an address, F8 is the offset added to this address. If I get this right, eax contains the dereferenced value from address+offset after the operation is executed.
What I want to do now is, to write some pattern in C++ while using inline asm, which allows me to fetch/retrieve F8 without changing the code in case F8 changes.
Is there any pattern search method (like regex) which I can use here? Is the offset possibly saved in any register? Or is this quite impossible to do?
Hopefully the information provided is enough, I could add some more lines of code if you wish.

You can add offset to the value in EBX register and then get the value from the updated address.

Just-in-Time compilation of Java bytecode

We are currently working on the JIT compilation part of our own Java Virtual Machine implementation. Our idea was now to do a simple translation of the given Java bytecode into opcodes, writing them to executable memory and CALLing right to the start of the method.
Assuming the given Java code would be:
int a = 13372338;
int b = 32 * a;
return b;
Now, the following approach was made (assuming that the given memory starts at 0x1000 & the return value is expected in eax):
0x1000: first local variable - accessible via [eip - 8]
0x1004: second local variable - accessible via [eip - 4]
0x1008: start of the code - accessible via [eip]
Java bytecode | Assembler code (NASM syntax)
--------------|------------------------------------------------------------------
| // start
| mov edx, eip
| push ebx
|
| // method content
ldc | mov eax, 13372338
| push eax
istore_0 | pop eax
| mov [edx - 8], eax
bipush | push 32
iload_0 | mov eax, [edx - 8]
| push eax
imul | pop ebx
| pop eax
| mul ebx
| push eax
istore_1 | pop eax
| mov [edx - 4], eax
iload_1 | mov eax, [edx - 4]
| push eax
ireturn | pop eax
|
| // end
| pop ebx
| ret
This would simply use the stack just like the virtual machine does itself.
The questions regarding this solution are:
Is this method of compilation viable?
Is it even possible to implement all the Java instructions this way? How could things like athrow/instanceof and similar commands be translated?

This method of compilation works, is easy to get up and running, and it at least removes interpretation overhead. But it results in pretty large amounts of code and pretty awful performance. One big problem is that it transliterates the stack operations 1:1, even though the target machine (x86) is a register machine. As you can see in the snippet you posted (as well as any other code), this always results in several stack manipulation opcodes for every single operation, so it uses the registers - heck, the the whole ISA - about as ineffectively as possible.
You can also support complicated control flow such as exceptions. It's not very different from implementing it in an interpreter. If you want good performance you don't want to perform work every time you enter or exit a try block. There are schemes to avoid this, used by both C++ and other JVMs (keyword: zero-cost or table-driven exception handling). These are quite complex and complicated to implement, understand and debug, so you should go with a simpler alternative first. Just keep it in mind.
As for the generated code: The first optimization, one which you'll almost definitely will need, is converting the stack operations into three address code or some other representation that uses registers. There are several papers on this and implementations of this, so I won't elaborate unless you want me to. Then, of course, you need to map these virtual registers onto physical registers. Register allocation is one of the most well-researched topics in compiler constructions, and there are at least half a dozen heuristics that are reasonably effective and fast enough to use in a JIT compiler. One example off the top of my head is linear scan register allocation (specifically creates for JIT compilation).
Beyond that, most JIT compilers focused on performance of the generated code (as opposed to quick compilation) use one or more intermediate formats and optimize the programs in this form. This is basically your run of the mill compiler optimization suite, including veterans like constant propagation, value numbering, re-association, loop invariant code motion, etc. - these things are not only simple to understand and implement, they've also been described in thirty years of literature up to and including textbooks and Wikipedia.
The code you'll get with the above will be pretty good for straigt-line code using primitives, arrays and object fields. However, you won't be able to optimize method calls at all. Every method is virtual, which means inlining or even moving method calls (for example out of a loop) is basically impossible except in very special cases. You mentioned that this is for a kernel. If you can accept using a subset of Java without dynamic class loading, you can do better (but it'll be nonstandard) by assuming the JIT knows all classes. Then you can, for example, detect leaf classes (or more generally methods which are never overriden) and inline those.
If you do need dynamic class loading, but expect it to be rare, you can also do better, though it takes more work. The advantage is that this approach generalizes to other things, like eliminating logging statements completely. The basic idea is specializing the code based on some assumptions (for example, that this static does not change or that no new classes are loaded), then de-optimizing if those assumptions are violated. This means you'll sometimes have to re-compile code while it is running (this is hard, but not impossible).
If you go further down this road, its logical conclusion is trace-based JIT compilation, which has been applied to Java, but AFAIK it didn't turn out to be superior to method-based JIT compilers. It's more effective when you have to make dozens or hundreds of assumptions to get good code, as it happens with highly dynamic languages.

Some comments about your JIT compiler (I hope I do not write things "delnan" already wrote):
Generic comments
I'm sure "real" JIT compilers work similar to your one. However you could do some optimization (example: "mov eax,nnn" and "push eax" could be replaced by "push nnn").
You should store local variables on the stack; typically "ebp" is used as local pointer:
push ebx
push ebp
sub esp, 8 // 2 variables with 4 bytes each
mov ebp, esp
// Now local variables are addressed using [ebp+0] and [ebp+4]
...
pop ebp
pop ebx
ret
This is necessary because functions may be recursive. Storing a variable at a fixed location (relative to EIP) would cause the variables to behave like "static" ones. (I'm assuming you are not compile a function multiple times in the case of a recursive function.)
Try/Catch
To implement Try/Catch your JIT compiler does not only have to look at the Java Bytecode but also at the Try/Catch information that is stored in a separate Attribute in the Java class. Try/catch can be implemented in the following way:
// push all useful registers (= the ones, that must not be destroyed)
push eax
push ebp
...
// push the "catch" pointers
push dword ptr catch_pointer
push dword ptr catch_stack
// set the "catch" pointers
mov catch_stack,esp
mov dword ptr catch_pointer, my_catch
... // some code
// Here some "throw" instruction...
push exception
jmp dword ptr catch_pointer
... //some code
// End of the "try" section: Pop all registers
pop dword_ptr catch_stack
...
pop eax
...
// The "catch" block
my_catch:
pop ecx // pop the Exception from the stack
mov esp, catch_stack // restore the stack
// Now restore all registers (same as at the end of the "try" section)
pop dword_ptr catch_stack
...
pop eax
push ecx // push the Exception to the stack
In a multi-thread environment each thread requires its own catch_stack and catch_pointer variable!
Specific exception types can be handled by using an "instanceof" the following way:
try {
// some code
} catch(MyException1 ex) {
// code 1
} catch(MyException2 ex) {
// code 2
}
... is actually compiled like this ...:
try {
// some code
} catch(Throwable ex) {
if(ex instanceof MyException1) {
// code 1
}
else if(ex instanceof MyException2) {
// code 2
}
else throw(ex); // not handled!
}
Objects
A JIT compiler of a simplified Java virtual machine not supporting objects (and arrays) would be quite easy but the objects in Java make the virtual machine very complex.
Objects are simply stored as pointers to the object on the stack or in the local variables. Typically JIT compilers will be implemented like this: For each class a piece of memory exists that contains information about the class (eg. which methods exist and at which address the assembler code of the method is located etc.). An object is some piece of memory that contains all instance variables and a pointer to the memory containing information about the class.
"Instanceof" and "checkcast" could be implemented by looking at the pointer to the memory containing information about the class. This information may contain a list of all parent classes and implemented interfaces.
The main problem of objects however is the memory management in Java: Unlike C++ there is a "new" but no "delete". You have to check how often an object is used. If an object is no longer used it must be deleted from memory and the destructor must be called.
The problems here are local variables (the same local variable may contain an object or a number) and try/catch blocks (the "catch" block must take care about the local variables and the stack (!) containing objects before restoring the stack pointer).

Assembler push rdi, pop rdi around function call

What's purpose of push rdi and pop rdi when calling function in C++?
VS2010, x64, debug, no optimizations
C++
int calc()
{
return 8 + 7;
}
Disassembly:
int calc()
{
000000013F0B1020 push rdi
return 8 + 7;
000000013F0B1022 mov eax,0Fh
}
000000013F0B1027 pop rdi
000000013F0B1028 ret

There is no purpose to it. This is a common artifact of unoptimized code. The code generator emits the push edi instruction in anticipation of having to perform an addition. The EDI register must be preserved across function calls. But then, later, figures out that the addition can be performed at compile time.
Getting rid of extraneous code like this requires "peephole optimization". But that optimization isn't enabled in the Debug build. To know what the real code look like, you have to turn on the optimizer, best done by building the Release build. It in fact will completely eliminate the function, you can prevent it from doing so with:
__declspec(noline) int calc()
{
return 8 + 7;
}
Which produces in the Release build:
return 8 + 7;
000007F7038E1000 mov eax,0Fh
000007F7038E1005 ret

Have you heard of "caller-save" and "callee-save" registers?
Since your CPU only has a small, finite number of registers, it's usually impossible for caller/called functions to always use different registers. If a caller function and called function both want to use the same register, it means the value in the caller will have to be saved/restored before/after the call.
Saving/restoring register values can be done either by the caller or by the callee -- which one does so is a matter of convention. The benefit of "caller-save" registers is that if the caller knows it won't need the value in register XYZ after the call, it can omit the save/restore operations. The benefit of "callee-save" registers is that if the callee knows it won't modify the value in register XYZ, it can omit the save/restore operations.
I'm guessing that your compiler treats RDI as a callee-save register, but doesn't omit the unnecessary save/restore operations unless you have compiler optimizations turned on. (If someone knows this is incorrect, please post another answer!)
UPDATE: I found an article on x86 calling conventions: http://en.wikipedia.org/wiki/X86_calling_conventions
It seems to confirm that with most calling conventions, RDI would be callee-save. This doesn't explain why it isn't pushing and popping all the other callee-save registers. Maybe there is something else going on here.

Trouble with C and C++ compiler

I'm having a problem trying to convert a 32 bit product into a 64 bit product. I'm using Visual Studio 2008 and the code is in C and C++. I would like anyone to look at the following two lines of code, one from a C source file and the other from a C++ source file. Both of these files are included in a DLL. I also include the disassembly of both lines of code.
ewxlcom.c
memcpy(pCM->pSecAccInfo->spUserID,userSecurityInfo.spUserID,
sizeof(UserID));
000000000EF33BB9 mov r8d,80h
000000000EF33BBF mov rdx,qword ptr [rsp+828h]
000000000EF33BC7 mov rcx,qword ptr [rsp+1F8h]
000000000EF33BCF mov rcx,qword ptr [rcx+0BDEh]
000000000EF33BD6 call memcpy (0EF40352h)
tcputil.cpp
memcpy(serv_temp+INIT_MSG_USERID_OFFSET, pCM->pSecAccInfo->spUserID, INIT_MSG_USERID_LEN);
000000000EF3B8E6 lea rcx,[rsp+67h]
000000000EF3B8EB mov r8d,80h
000000000EF3B8F1 mov rdx,qword ptr [rsp+3B0h]
000000000EF3B8F9 mov rdx,qword ptr [rdx+0CBEh]
000000000EF3B900 call memcpy (0EF40352h)
As you can see, the first line copies some bytes into the memory pointed to by pCM->pSecAccInfo->spUserID. And the second line copies those same bytes into another place in memory. The ASM memcpy copies bytes from memory pointed to by register rdx to memory pointed to by register rcx. So in the first line a value is moved into register rcx. This I have verified to point to pCM. Then the value pointed to by rcx + 0BDEh is copied into rcx. And the memcpy is called. This works.
But later on in the second line a value is loaded into register rdx. This I have verified to also point to the same pCM as in the first line. It then loads the pointer residing in memory that is offset from pCM (rdx) by 0CBEh. That memory is all zeros, so memcpy crashes.
The question is why would the compiler produce different code for the same source variable. I think its an alignment problem. Is it the difference between a C file and a C++ file? Does VS use the same compiler for both C and C++? Are there any other things I should be looking at?
Any help would be appreciated.

If you're linking C & C++ code, you might need to be careful about different padding characteristics in your structs. Perhaps create a temporary function to print the offsets of each member of the struct, and copy that same code from a C source file (where you wrote it) to a C++ source file. The two copies of the functions can remain the same, since the C++ one will be mangled, but I'd add a printf() at the top of each to say which version it is. Then call each one from somewhere before the crash so you can compare the offsets. If they're different, you'll need to look into compiler flags to fix that problem. OR... perhaps you need to add lines like this...
#ifdef __cplusplus
extern "C" {
#endif
.
. ...your struct definitions & variables go here...
.
#ifdef __cplusplus
}
#endif
...around your struct definitions to get the C++ side to have the same padding behavior as the C side of your project.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js