Trouble with C and C++ compiler - c++

I'm having a problem trying to convert a 32 bit product into a 64 bit product. I'm using Visual Studio 2008 and the code is in C and C++. I would like anyone to look at the following two lines of code, one from a C source file and the other from a C++ source file. Both of these files are included in a DLL. I also include the disassembly of both lines of code.
ewxlcom.c
memcpy(pCM->pSecAccInfo->spUserID,userSecurityInfo.spUserID,
sizeof(UserID));
000000000EF33BB9 mov r8d,80h
000000000EF33BBF mov rdx,qword ptr [rsp+828h]
000000000EF33BC7 mov rcx,qword ptr [rsp+1F8h]
000000000EF33BCF mov rcx,qword ptr [rcx+0BDEh]
000000000EF33BD6 call memcpy (0EF40352h)
tcputil.cpp
memcpy(serv_temp+INIT_MSG_USERID_OFFSET, pCM->pSecAccInfo->spUserID, INIT_MSG_USERID_LEN);
000000000EF3B8E6 lea rcx,[rsp+67h]
000000000EF3B8EB mov r8d,80h
000000000EF3B8F1 mov rdx,qword ptr [rsp+3B0h]
000000000EF3B8F9 mov rdx,qword ptr [rdx+0CBEh]
000000000EF3B900 call memcpy (0EF40352h)
As you can see, the first line copies some bytes into the memory pointed to by pCM->pSecAccInfo->spUserID. And the second line copies those same bytes into another place in memory. The ASM memcpy copies bytes from memory pointed to by register rdx to memory pointed to by register rcx. So in the first line a value is moved into register rcx. This I have verified to point to pCM. Then the value pointed to by rcx + 0BDEh is copied into rcx. And the memcpy is called. This works.
But later on in the second line a value is loaded into register rdx. This I have verified to also point to the same pCM as in the first line. It then loads the pointer residing in memory that is offset from pCM (rdx) by 0CBEh. That memory is all zeros, so memcpy crashes.
The question is why would the compiler produce different code for the same source variable. I think its an alignment problem. Is it the difference between a C file and a C++ file? Does VS use the same compiler for both C and C++? Are there any other things I should be looking at?
Any help would be appreciated.

If you're linking C & C++ code, you might need to be careful about different padding characteristics in your structs. Perhaps create a temporary function to print the offsets of each member of the struct, and copy that same code from a C source file (where you wrote it) to a C++ source file. The two copies of the functions can remain the same, since the C++ one will be mangled, but I'd add a printf() at the top of each to say which version it is. Then call each one from somewhere before the crash so you can compare the offsets. If they're different, you'll need to look into compiler flags to fix that problem. OR... perhaps you need to add lines like this...
#ifdef __cplusplus
extern "C" {
#endif
.
. ...your struct definitions & variables go here...
.
#ifdef __cplusplus
}
#endif
...around your struct definitions to get the C++ side to have the same padding behavior as the C side of your project.

Related

Assembly: Why there is an empty memory on stack?

I use online complier wrote a simple c++ code :
int main()
{
int a = 4;
int&& b = 2;
}
and the main function part of assembly code complied by gcc 11.20 shown below
main:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], 4
mov eax, 2
mov DWORD PTR [rbp-20], eax
lea rax, [rbp-20]
mov QWORD PTR [rbp-16], rax
mov eax, 0
pop rbp
ret
I notice that when initializing 'a', the instruction just simply move an immediate operand directly to memory while for r-value reference 'b', it first store the immediate value into register eax,then move it to the memory, and also there is an unused memory bettween [rbp-8] ~ [rbp-4], I think that whatever immediate value,they just exist, so it has to be somewhere or it just simply use signal to iniltialize(my guess), I want to know more about the underlying logic.
So my question is that:
Why does inilization differs?
Why there is an empty 4-bytes unused memory on stack?
Let me address the second question first.
Note that there are actually three objects defined in this function: the int variable a, the reference b (implemented as a pointer), and the unnamed temporary int with a value of 2 that b points to. In unoptimized compilation, each of these objects needs to be stored at some unique location on the stack, and the compiler allocates stack space naively, processing the variables one by one and assigning each one space below the previous. It evidently chooses to handle them in the following order:
The variable a, an int needing 4 bytes. It goes in the first available stack slot, at [rbp-4].
The reference b, stored as a pointer needing 8 bytes. You might think it would go at [rbp-12], but the x86-64 ABI requires that pointers be naturally aligned on 8-byte boundaries. So the compiler moves down another 4 bytes to achieve this alignment, putting b at [rbp-16]. The 4 bytes at [rbp-8] are unused so far.
The temporary int, also needing 4 bytes. The compiler puts it right below the previously placed variable, at [rbp-20]. True, there was space at [rbp-8] that could have been used instead, which would be more efficient; but since you told the compiler not to optimize, it doesn't perform this optimization. It would if you used one of the -O flags.
As to why a is initialized with an immediate store to memory, whereas the temporary is initialized via a register: to really answer this, you'd have to read the details of the GCC source code, and frankly I don't think you'll find that there is anything very interesting behind it. Presumably there are different code paths in the compiler for creating and initializing named variables versus temporaries, and the code for temporaries may happen to be written as two steps.
It may be that for convenience, the programmer chose to create an extra object in the intermediate representation (GIMPLE or RTL), perhaps because it simplifies the compiler code in handling more general cases. They wouldn't take any trouble to avoid this, because they know that later optimization passes will clean it up. But if you have optimization turned off, this doesn't happen and you get actual instructions emitted for this unnecessary transfer.
In
int a = 4;
you declare a (typically) 4-byte variable and ask the compiler to fill it with the bit representation of 4.
In
int&& b = 2;
you declare a reference ("r-value reference") to, well, to what? To a literal? Is it possible? In C++ references are typically translated, on the assembly level, into pointers. So one can expect that b will be "a pointer in disguise", that is, without the * and -> semantics. But it will likely occupy 64 bits on a 64-bit machine. Now, pointers must point to some memory stored in RAM, not in registers, cache(s) etc. So the compiler most likely creates a temporary (unnamed) integer, initializes it with 2, and then binds its address to b. I write "most likely" because I doubt the standard standardizes this in such great detail. What we know for sure is that there is an extra unnamed variable involved in the initialization of b in int&& b = 2;.
As for the assembler, I have too little knowledge of it to dare explain anything to you. I guess, however, that the concept of a temporary variable and a pointer behind the && reference solves all your problems here.

Mixing c++ and assembly cant pass multiple paramaters from C++ function to assembly

I've been frustrated by passing parameters from a c++ function to assembly. I couldn't find anything that helped on Google and would really like your help. I am using Visual Studio 2017 and masm to compile my assembly code.
This is a simplified version of my c++ file where I call the assembly procedure set_clock
int main()
{
TimeInfo localTime;
char clock[4] = { 0,0,0,0 };
set_clock(clock,&localTime);
system("pause");
return 0;
}
I run into problems in the assembly file. I can't figure out why the second parameter passed to the function turns out huge. I was going off my textbook, which shows similar code with PROC followed by parameters. I don't know why the first parameter is passed successfully and the second one isn't. Can someone tell me the correct way to pass multiple parameters?
.code
set_clock PROC,
array:qword,address:qword
mov rdx,array ; works fine memory address: 0x1052440000616
mov rdi,address ; value of rdi is 14757395258967641292
mov al, [rdx]
mov [rdi],al ; ERROR: cant access that memory location
ret
set_clock ENDP
END
MASM's high-level crap is biting you in the ass. x64 Windows passes the first 4 args in rcx, rdx, r8, r9 (for any of those 4 that are integer/pointer).
mov rdx,array
mov rdi,address
assembles to
mov rdx, rcx ; clobber 2nd arg with a copy of the 1st
mov rdi, rdx ; copy array again
Use a disassembler to check for yourself. Always a good idea to check the real machine code by disassembling or using your debuggers disassembly instead of source mode, if anything weird is happening with assembler macros.
I'm not sure why this would result in an inaccessible memory location. If both args really are pointers to locals, then it should just be loading and storing back into the same stack location. But if char clock[4] is a const in static storage, it might be in a read-only memory page which would explain the store failing.
Either way, use a debugger and find out.
BTW, rdi is a call-preserved (aka non-volatile) register in the x64 Windows convention. (https://msdn.microsoft.com/en-us/library/9z1stfyw.aspx). Use call-clobbered registers for scratch regs unless you run out and need to save/restore some call-preserved regs. See also Agner Fog's calling conventions doc (http://agner.org/optimize/), and other links in the x86 tag wiki.
It's call-clobbered in x86-64 System V, which also passes args in different registers. Maybe you were looking at a different example?
Hopefully-fixed version, using movzx to avoid a false dependency on RAX when loading a byte.
set_clock PROC,
array:qword,address:qword
movzx eax, byte ptr [array]
mov [address], al
ret
set_clock ENDP
I don't use MASM, but I think array:qword makes array an alias for rcx. Or you could skip declaring the parameters and just use rcx and rdx directly, and document it with comments. That would be easier for everyone to understand.
You definitely don't want useless mov reg,reg instructions cluttering your code; if you're writing in asm in the first place, wasted instructions would cut into any speedups you're getting.

uint64 array to uint128 for SSE2

I have two similar issues when handling arrays when defined in the asm and when passed from c++ to asm. The code works fine inline but I need to separate them from the cpp into an asm file. The compiler may not throw an error or warning but the end result is random each run and should be constant like it was when inline.
The below code works when used in MMX (movq mm6,twosMask_W) but I need the equivalent for SSE2. I thought that this would work but I appear to be incorrect.
.data
align 16
twosMask_W qword 2 dup(0002000200020002h)
.code
...
movdqa xmm6,oword ptr twosMask_W
...
The second issue is when I pass my thresh128 array from C++ to asm (again for SSE2):
//C++
uint64_t thresh128[2];
thresh128[0] = ((thresh-1)<<8)+(thresh-1);
thresh128[0] += (thresh128[0]<<48)+(thresh128[0]<<32)+(thresh128[0]<<16);
thresh128[1] = thresh128[0];
sendToASM(thresh128)
//ASM
;There are more parameters that utilize the registers but not listed.
receivedFromCPP proc thresh:qword
public receivedFromCPP
...
movdqu xmm4,oword ptr thresh
...
I've tried having thresh as an oword parameter in the procedure but it yielded no results. I'm sure I've got some syntax or parameter type wrong. Any help would be greatly appreciated.
Note: Compiled using MASM in VS2013 for x86.
Well, I tested the first part and it seems to work - so I cannot say anything related to this particular issue.
Concerning the second problem: you seem to pass a 64 bit qword on the stack in 32 bit mode (where is no direct opcode for 64 bit PUSHes) so it would be 2 PUSHes...
receivedFromCPP proc thresh:qword
but are expecting a pointer to a 128 bit value on the stack:
movdqu xmm4,oword ptr thresh
Also keep in mind the little-endianess of x86 - depending on how the compiler chooses to PUSH the 2*64bit-array it may be different from a little-endian-value resulting in seemingly random values.
EDIT: Because the stack grows upside-down, a 128 bit value has to be PUSHed in reverse order for referencing it by EBP.

Just-in-Time compilation of Java bytecode

We are currently working on the JIT compilation part of our own Java Virtual Machine implementation. Our idea was now to do a simple translation of the given Java bytecode into opcodes, writing them to executable memory and CALLing right to the start of the method.
Assuming the given Java code would be:
int a = 13372338;
int b = 32 * a;
return b;
Now, the following approach was made (assuming that the given memory starts at 0x1000 & the return value is expected in eax):
0x1000: first local variable - accessible via [eip - 8]
0x1004: second local variable - accessible via [eip - 4]
0x1008: start of the code - accessible via [eip]
Java bytecode | Assembler code (NASM syntax)
--------------|------------------------------------------------------------------
| // start
| mov edx, eip
| push ebx
|
| // method content
ldc | mov eax, 13372338
| push eax
istore_0 | pop eax
| mov [edx - 8], eax
bipush | push 32
iload_0 | mov eax, [edx - 8]
| push eax
imul | pop ebx
| pop eax
| mul ebx
| push eax
istore_1 | pop eax
| mov [edx - 4], eax
iload_1 | mov eax, [edx - 4]
| push eax
ireturn | pop eax
|
| // end
| pop ebx
| ret
This would simply use the stack just like the virtual machine does itself.
The questions regarding this solution are:
Is this method of compilation viable?
Is it even possible to implement all the Java instructions this way? How could things like athrow/instanceof and similar commands be translated?
This method of compilation works, is easy to get up and running, and it at least removes interpretation overhead. But it results in pretty large amounts of code and pretty awful performance. One big problem is that it transliterates the stack operations 1:1, even though the target machine (x86) is a register machine. As you can see in the snippet you posted (as well as any other code), this always results in several stack manipulation opcodes for every single operation, so it uses the registers - heck, the the whole ISA - about as ineffectively as possible.
You can also support complicated control flow such as exceptions. It's not very different from implementing it in an interpreter. If you want good performance you don't want to perform work every time you enter or exit a try block. There are schemes to avoid this, used by both C++ and other JVMs (keyword: zero-cost or table-driven exception handling). These are quite complex and complicated to implement, understand and debug, so you should go with a simpler alternative first. Just keep it in mind.
As for the generated code: The first optimization, one which you'll almost definitely will need, is converting the stack operations into three address code or some other representation that uses registers. There are several papers on this and implementations of this, so I won't elaborate unless you want me to. Then, of course, you need to map these virtual registers onto physical registers. Register allocation is one of the most well-researched topics in compiler constructions, and there are at least half a dozen heuristics that are reasonably effective and fast enough to use in a JIT compiler. One example off the top of my head is linear scan register allocation (specifically creates for JIT compilation).
Beyond that, most JIT compilers focused on performance of the generated code (as opposed to quick compilation) use one or more intermediate formats and optimize the programs in this form. This is basically your run of the mill compiler optimization suite, including veterans like constant propagation, value numbering, re-association, loop invariant code motion, etc. - these things are not only simple to understand and implement, they've also been described in thirty years of literature up to and including textbooks and Wikipedia.
The code you'll get with the above will be pretty good for straigt-line code using primitives, arrays and object fields. However, you won't be able to optimize method calls at all. Every method is virtual, which means inlining or even moving method calls (for example out of a loop) is basically impossible except in very special cases. You mentioned that this is for a kernel. If you can accept using a subset of Java without dynamic class loading, you can do better (but it'll be nonstandard) by assuming the JIT knows all classes. Then you can, for example, detect leaf classes (or more generally methods which are never overriden) and inline those.
If you do need dynamic class loading, but expect it to be rare, you can also do better, though it takes more work. The advantage is that this approach generalizes to other things, like eliminating logging statements completely. The basic idea is specializing the code based on some assumptions (for example, that this static does not change or that no new classes are loaded), then de-optimizing if those assumptions are violated. This means you'll sometimes have to re-compile code while it is running (this is hard, but not impossible).
If you go further down this road, its logical conclusion is trace-based JIT compilation, which has been applied to Java, but AFAIK it didn't turn out to be superior to method-based JIT compilers. It's more effective when you have to make dozens or hundreds of assumptions to get good code, as it happens with highly dynamic languages.
Some comments about your JIT compiler (I hope I do not write things "delnan" already wrote):
Generic comments
I'm sure "real" JIT compilers work similar to your one. However you could do some optimization (example: "mov eax,nnn" and "push eax" could be replaced by "push nnn").
You should store local variables on the stack; typically "ebp" is used as local pointer:
push ebx
push ebp
sub esp, 8 // 2 variables with 4 bytes each
mov ebp, esp
// Now local variables are addressed using [ebp+0] and [ebp+4]
...
pop ebp
pop ebx
ret
This is necessary because functions may be recursive. Storing a variable at a fixed location (relative to EIP) would cause the variables to behave like "static" ones. (I'm assuming you are not compile a function multiple times in the case of a recursive function.)
Try/Catch
To implement Try/Catch your JIT compiler does not only have to look at the Java Bytecode but also at the Try/Catch information that is stored in a separate Attribute in the Java class. Try/catch can be implemented in the following way:
// push all useful registers (= the ones, that must not be destroyed)
push eax
push ebp
...
// push the "catch" pointers
push dword ptr catch_pointer
push dword ptr catch_stack
// set the "catch" pointers
mov catch_stack,esp
mov dword ptr catch_pointer, my_catch
... // some code
// Here some "throw" instruction...
push exception
jmp dword ptr catch_pointer
... //some code
// End of the "try" section: Pop all registers
pop dword_ptr catch_stack
...
pop eax
...
// The "catch" block
my_catch:
pop ecx // pop the Exception from the stack
mov esp, catch_stack // restore the stack
// Now restore all registers (same as at the end of the "try" section)
pop dword_ptr catch_stack
...
pop eax
push ecx // push the Exception to the stack
In a multi-thread environment each thread requires its own catch_stack and catch_pointer variable!
Specific exception types can be handled by using an "instanceof" the following way:
try {
// some code
} catch(MyException1 ex) {
// code 1
} catch(MyException2 ex) {
// code 2
}
... is actually compiled like this ...:
try {
// some code
} catch(Throwable ex) {
if(ex instanceof MyException1) {
// code 1
}
else if(ex instanceof MyException2) {
// code 2
}
else throw(ex); // not handled!
}
Objects
A JIT compiler of a simplified Java virtual machine not supporting objects (and arrays) would be quite easy but the objects in Java make the virtual machine very complex.
Objects are simply stored as pointers to the object on the stack or in the local variables. Typically JIT compilers will be implemented like this: For each class a piece of memory exists that contains information about the class (eg. which methods exist and at which address the assembler code of the method is located etc.). An object is some piece of memory that contains all instance variables and a pointer to the memory containing information about the class.
"Instanceof" and "checkcast" could be implemented by looking at the pointer to the memory containing information about the class. This information may contain a list of all parent classes and implemented interfaces.
The main problem of objects however is the memory management in Java: Unlike C++ there is a "new" but no "delete". You have to check how often an object is used. If an object is no longer used it must be deleted from memory and the destructor must be called.
The problems here are local variables (the same local variable may contain an object or a number) and try/catch blocks (the "catch" block must take care about the local variables and the stack (!) containing objects before restoring the stack pointer).

How to hook C++ functions with asm

I want to hook a C++ function. But I don't want to use the trampoline mechanism of ms detours, instead of it I want to fully patch it. I can get the handle to the DLL, where the function is located and I have the right offset(imageBase stuff ...). So how to hook it? And I don't know the data types of the arguments(var_4 and arg_0), or aren't they needed? In general I want to replace following function with my own one(my function is nearly the same, there's only a line changed):
sub_39001A40 proc near
var_4 = dword ptr -4
arg_0 = dword ptr 4
push ecx
cmp dword_392ADAB4, 0
jnz short loc_39001A4F
call loc_39024840
loc_39001A4F:
push esi
mov esi, [esp+8+arg_0]
lea eax, [esp+8+var_4]
push eax
push esi
call dword_392ADA98
mov ecx, [esp+10h+var_4]
add esp, 8
add dword_392ADA80, ecx
adc dword_392ADA84, 0
add dword_392ADA90, esi
pop esi
adc dword_392ADA94, 0
add dword_392ADA7C, 1
pop ecx
retn
sub_39001A40 endp
It's bad, that I only can hook functions, which names I know with ms detours. I cannot hook those asm functions with detours, cause I need the data types of the arguments passed for creating the function structures!
EDIT::::
"What's wrong with detours, exactly?"
I wrote: "I don't want to use the trampoline mechanism of ms detours, instead of it I want to fully patch it." and "It's bad, that I only can hook functions, which names I know with ms detours. I cannot hook those asm functions with detours, cause I need the data types of the arguments passed for creating the function structures!" and I don't have the source code of the C++ files. I only have the hex-dump.
"Trampoline is an actual technical term :) I'm just wondering why #lua can't use it."
I write: Read my sentences again, if you still don't understand why, my english is bad.
"Overriding just the named function should work, of course you may need to re-implement the whole DLL (depending on if it is of any further use to you). Given your grasp of assembler you might get away with using a hex editor to edit (a copy of) the original DLL you are seeking to subvert."
I want to hook the function, because I don't want to edit the file. I can't overwrite my function, because I don't know the datatypes of the arguments and the function's name.
#asveikau: Thanks for your real help, but I don't want to use a trampoline mechanism, I want to overwrite the function.
A good trick is to replace the first few instructions with this:
push dword xxxx ; where xxx = new code location
ret
This is sort of like an obfuscated jmp. I write it this way because the assembled version of this is very easy to replace the push operand with your pointer at runtime. It assembles to:
68 XX XX XX XX c3
Where "XX XX XX XX" is your address in little-endian.
Then you can make a "call the old version of the function" code location, where the first few instructions are the ones you replaced with the sequence above, followed by a jump to the next valid instruction in the original code.
Overriding just the named function should work, of course you may need to re-implement the whole DLL (depending on if it is of any further use to you). Given your grasp of assembler you might get away with using a hex editor to edit (a copy of) the original DLL you are seeking to subvert.