Address of function is not actual code address - c++

Debugging some code in Visual Studio 2008 (C++), I noticed that the address in my function pointer variable is not the actual address of the function itself. This is an extern "C" function.
int main() {
void (*printaddr)(const char *) = &print; // debug shows printaddr == 0x013C1429
}
Address: 0x013C4F10
void print() {
...
}
The disassembly of taking the function address is:
void (*printaddr)(const char *) = &print;
013C7465 C7 45 BC 29 14 3C 01 mov dword ptr [printaddr],offset print (13C1429h)
EDIT: I viewed the code at address 013C4F10 and the compiler is apparently inserting a "jmp" instruction at that address.
013C4F10 E9 C7 3F 00 00 jmp print (013C1429h)
There is actually a whole jump table of every method in the .exe.
Can someone expound on why it does this? Is it a debugging "feature" ?

That is caused by 'Incremental Linking'. If you disable that in your compiler/linker settings the jumps will go away.
http://msdn.microsoft.com/en-us/library/4khtbfyf(VS.80).aspx

I'm going to hazard a guess here, but it's possibly to enable Edit-and-Continue.
Say you need to recompile that function, you only need to change the indirection table, not all callers. That would dramatically reduce the amount of work to do when the Edit-and-Continue feature is being exercised.

The compiler is inserting a "jmp" instruction at that address to the real method.
013C4F10 E9 C7 3F 00 00 jmp print (013C1429h)
There is actually a whole jump table of every method in the .exe.
It is a Debugging feature. When I switch to release mode the jump table goes away and the address is indeed the actual function address.

Related

C++ odd assembly output query

Using Windows 10 Pro with Visual Studio 2022, Debug mode, X64 platform, I have the following code...
int main()
{
int var = 1;
int* varPtr = &var;
*varPtr = 10;
return 0;
}
In the disassembly window we see this...
int var = 1;
00007FF75F1D1A0D C7 45 04 01 00 00 00 mov dword ptr [var],1
int* varPtr = &var;
00007FF75F1D1A14 48 8D 45 04 lea rax,[var]
00007FF75F1D1A18 48 89 45 28 mov qword ptr [varPtr],rax
*varPtr = 10;
00007FF75F1D1A1C 48 8B 45 28 mov rax,qword ptr [varPtr]
00007FF75F1D1A20 C7 00 0A 00 00 00 mov dword ptr [rax],0Ah
return 0;
Upon stepping through the above, the RAX register is loaded with the memory address for the stack variable, var, via...
00007FF75F1D1A14 48 8D 45 04 lea rax,[var]
Since RAX is not changed after this, why is that same var address being loaded into RAX again, 2 instructions later with...
00007FF75F1D1A1C 48 8B 45 28 mov rax,qword ptr [varPtr]
The memory view window shows that the &var address is constant throughout. Am I missing something daft?
[Updated] - switching to release mode and optimisation off returns the above in full. Turning on speed/size optimization returns only that "return 0" code. Would be interesting to see if there's a way to force the compiler to compile everything (using fast switch) and force it to not remove what it thought was redundant, for this example. This minimal appears to be too minimal, lol.
Still concerned about that unneeded double load of RAX - primarily, for such a small program, though yes, that's what 'optimisation' is all about. Sill.
When compiling in Debug mode (i.e. with all optimisations disabled), the compiler generates code like this for a reason.
Suppose you are stepping through the code and you stop on the line that reads *varPtr = 10;. At that point, you decide that you loaded the wrong address into varPtr and would like to change it and continue debugging without stopping, rebuilding and restarting your program.
Well, in Debug mode, you can. Just change the address stored in varPtr (in the Watch window, say) and carry on debugging. Without the 'redundant' second load, this wouldn't work. When the compiler emits said load, it does.
So, to summarise, Debug mode is designed to make debugging easier, while Release mode is designed to make your code run as fast (or be as small) as possible, hopefully with the same semantics.
And just be grateful that compiler writers understand the need for these two modes of operation. Without them, our lives as developers would be much, much harder.

Is it possible to write asm in C++ with opcode instead of shellcode

I'm curious if there's a way to use __asm in c++ then write that into memory instead of doing something like:
BYTE shell_code[] = { 0x48, 0x03 ,0x1c ,0x25, 0x0A, 0x00, 0x00, 0x00 };
write_to_memory(function, &shell_code, sizeof(shell_code));
So I would like to do:
asm_code = __asm("add rbx, &variable\n\t""jmp rbx") ;
write_to_memory(function, &asm_code , sizeof(asm_code ));
Worst case I can use GCC and objdump externally or something but hoping there's an internal way
You can put an asm(""); statement at global scope, with start/end labels inside it, and declare those labels as extern char start_code[], end_code[0]; so you can access them from C. C char arrays work most like asm labels, in terms of being able to use the C name and have it work as an address.
// compile with gcc -masm=intel
// AFAIK, no way to do that with clang
asm(
".pushsection .rodata \n" // we don't want to run this from here, it's just data
"start_code: \n"
" add rax, OFFSET variable \n" // *absolute* address as 32-bit sign-extended immediate
"end_code: \n"
".popsection"
);
__attribute__((used)) static int variable = 1;
extern char start_code[], end_code[0]; // C declarations for those asm labels
#include <string.h>
void copy_code(void *dst)
{
memcpy(dst, start_code, end_code - start_code);
}
It would be fine to have the payload code in the default .text section, but we can put it in .rodata since we don't want to run it.
Is that the kind of thing you're looking for? asm output on Godbolt (without assembling + disassembling:
start_code:
add rax, OFFSET variable
end_code:
copy_code(void*):
mov edx, OFFSET FLAT:end_code
mov esi, OFFSET FLAT:start_code
sub rdx, OFFSET FLAT:start_code
jmp [QWORD PTR memcpy#GOTPCREL[rip]]
To see if it actually assembles to what we want, I compiled with
gcc -O2 -fno-plt -masm=intel -fno-pie -no-pie -c foo.c to get a .o. objdump -drwC -Mintel shows:
0000000000000000 <copy_code>:
0: ba 00 00 00 00 mov edx,0x0 1: R_X86_64_32 .rodata+0x6
5: be 00 00 00 00 mov esi,0x0 6: R_X86_64_32 .rodata
a: 48 81 ea 00 00 00 00 sub rdx,0x0 d: R_X86_64_32S .rodata
11: ff 25 00 00 00 00 jmp QWORD PTR [rip+0x0] # 17 <end_code+0x11> 13: R_X86_64_GOTPCRELX memcpy-0x4
And with -D to see all sections, the actual payload is there in .rodata, still not linked yet:
Disassembly of section .rodata:
0000000000000000 <start_code>:
0: 48 05 00 00 00 00 add rax,0x0 2: R_X86_64_32S .data
-fno-pie -no-pie is only necessary for the 32-bit absolute address of variable to work. (Without it, we get two RIP-relative LEAs and a sub rdx, rsi. Unfortunately neither way of compiling gets GCC to subtract the symbols at build time with mov edx, OFFSET end_code - start_code, but that's just in the code doing the memcpy, not in the machine code being copied.)
In a linked executable
We can see how the linker filled in those relocations.
(I tested by using -nostartfiles instead of -c - I didn't want to run it, just look at the disassembly, so there was not point to actually writing a main.)
$ gcc -O2 -fno-plt -masm=intel -fno-pie -no-pie -nostartfiles foo.c
/usr/bin/ld: warning: cannot find entry symbol _start; defaulting to 0000000000401000
$ objdump -D -rwC -Mintel a.out
(manually edited to remove uninteresting sections)
Disassembly of section .text:
0000000000401000 <copy_code>:
401000: ba 06 20 40 00 mov edx,0x402006
401005: be 00 20 40 00 mov esi,0x402000
40100a: 48 81 ea 00 20 40 00 sub rdx,0x402000
401011: ff 25 e1 2f 00 00 jmp QWORD PTR [rip+0x2fe1] # 403ff8 <memcpy#GLIBC_2.14>
The linked payload:
0000000000402000 <start_code>:
402000: 48 05 18 40 40 00 add rax,0x404018 # from add rax, OFFSET variable
0000000000402006 <end_code>:
402006: 48 c7 c2 06 00 00 00 mov rdx,0x6
# this was from mov rdx, OFFSET end_code - start_code to see if that would assemble + link
Our non-zero-init dword variable that we're taking the address of:
Disassembly of section .data:
0000000000404018 <variable>:
404018: 01 00 add DWORD PTR [rax],eax
...
Your specific asm instruction is weird
&variable isn't valid asm syntax, but I'm guessing you wanted to add the address?
Since you're going to be copying the machine code somewhere, you must avoid RIP-relative addressing modes and any other relative references to things outside the block you're copying. Only mov can use 64-bit absolute addresses, like movabs rdi, OFFSET variable instead of the usual lea rdi, [rip + variable]. Also, you can even load / store into/from RAX/EAX/AX/AL with 64-bit absolute addresses movabs eax, [variable]. (mov-immediate can use any register, load/store are only the accumulator. https://www.felixcloutier.com/x86/mov)
(movabs is an AT&T mnemonic, but GAS allows it in .intel_syntax noprefix to force using 64-bit immediates, instead of the default 32-bit-sign-extended.)
This is kind of opposite of normal position-independent code, which works when the whole image is loaded at an arbitrary base. This will make code that works when the image is loaded to a fixed base (or even variable since runtime fixups should work for symbolic references), and then copied around relative to the rest of your code. So all your memory refs have to be absolute, except for within the asm.
So we couldn't have made PIE-compatible machine code by using lea rdx, [RIP+variable] / add rax, rdx - that would only get the right address for variable when run from the linked location in .rodata, not from any copy. (Unless you manually fixup the code when copying it, but it's still only a rel32 displacement.)
Terminology:
An opcode is part of a machine instruction, e.g. add ecx, 123 assembles to 3 bytes: 83 c1 7b. Those are the opcode, modrm, and imm8 respectively. (https://www.felixcloutier.com/x86/add).
"opcode" also gets misused (especially in shellcode usage) to describe the whole instruction represented as bytes.
Text names for instructions like add are mnemonics.
this is just a guess, i don't know if it will work. i'm sorry in advance for an ugly answer since i don't have much time due to work.
i think you can enclose your asm code inside labels.
get the address of that label and the size. treat it as a blob of data and you can write it anywhere.
void funcA(){
//some code here.
labelStart:
__asm("
;asm code here.
")
labelEnd:
//some code here.
//---make code as movable data.
char* pDynamicProgram = labelStart;
size_t sizeDP = labelEnd - labelStart;
//---writing to some memory.
char* someBuffer = malloc(sizeDP);
memcpy(someBuffer, pDynamicProgram, sizeDP);
//---execute: cast as a function pointer then execute call.
((func*)someBuffer)(/* parameters if any*/);
}
the sample code above of course is not compilable. but the logic is kind of like that. i see viruses do it that way though i haven't saw the actual c++ code. but we saw it from disassemblers. for the "return" logic after the call, there are many adhoc ways to do that. just be creative.
also, i think you have to enable first some settings for your program to write to some forbidden memory in case you want to override an existing function.

macOS - Reading part of other app library code and disassembling it to get offset

My applications read other application memory in order to get pointer. I need firstly to read offset from static library to start working with application itself.
Some function in dylib contains offset to pointer "0x41b1110" - i know that this offset works when used manually, but i need to to read that with my application automatically without checking value manually, if i do simple read from memory from address SomeAddressX as uint64_t it get's ridiculous address which is not equal 0x41b1110. im pretty sure what i got is simply this instruction. Then i have tried read this as byte array, and this byte array was equal to byte array from plain binary at this address. Im wondering how to read simply "0x41b1110" not entire instruction? Do i need to disassembly byte code to x64 instruction and then parse it to get address, or is there smarter way ? Im not much experienced with asm.
SomeAddressX - rax, qword [ds:0x41b1110]
Adding Example byte code and instruction
lea rax, qword [ds:0x1043740]
which gives
48 8D 05 6F D9 99 00
first three 48 8D 05 appears to be lea rax, qword but the other part 6F D9 99 00 is not looking like 01 04 37 40 (0x1043740) ?
It's x64 and enforced PIC (position-independent code) code on OSX (doesn't allow non-PIC executables, as it is using ASLR).
So that disassembly is hiding an important bit of information from you. The true nature of that instruction is revealed here (ba dum ts):
lea rax,[rip+0x99d96f]
It's using current instruction pointer rip to relatively address it's data.
The 0x1043740 is result of addressOfInstruction + 7 + 0x99d96f.
The 0x99d96f part is clearly visible in the bytecode itself.
The +7 is instruction opcode size. Now I'm not 100% sure it's added too at that stage, so do your own math, as you know "addressOfInstruction".
And check out your debugger options, to see if you can switch between the friendly absolute memory display vs. true rip+offset disassembly.

Do macros in C++ improve performance?

I'm a beginner in C++ and I've just read that macros work by replacing text whenever needed. In this case, does this mean that it makes the .exe run faster? And how is this different than an inline function?
For example, if I have the following macro :
#define SQUARE(x) ((x) * (x))
and normal function :
int Square(const int& x)
{
return x*x;
}
and inline function :
inline int Square(const int& x)
{
return x*x;
}
What are the main differences between these three and especially between the inline function and the macro? Thank you.
You should avoid using macros if possible. Inline functions are always the better choice, as they are type safe. An inline function should be as fast as a macro (if it is indeed inlined by the compiler; note that the inline keyword is not binding but just a hint to the compiler, which may ignore it if inlining is not possible).
PS: as a matter of style, avoid using const Type& for parameter types that are fundamental, like int or double. Simply use the type itself, in other words, use
int Square(int x)
since a copy won't affect (or even make it worse) performance, see e.g. this question for more details.
Macros translate to: stupid replacing of pattern A with pattern B. This means: everything happens before the compiler kicks in. Sometimes they come in handy; but in general, they should be avoided. Because you can do a lot of things, and later on, in the debugger, you have no idea what is going on.
Besides: your approach to performance is well, naive, to say it friendly. First you learn the language (which is hard for modern C++, because there are a ton of important concepts and things one absolutely need to know and understand). Then you practice, practice, practice. And then, when you really come to a point where your existing application has performance problems; then do profiling to understand the real issue.
In other words: if you are interested in performance, you are asking the wrong question. You should worry much more about architecture (like: potential bottlenecks), configuration (in the sense of latency between different nodes in your system), and so on. Of course, you should apply common sense; and not write code that is obviously wasting memory or CPU cycles. But sometimes a piece of code that runs 50% slower ... might be 500% easier to read and maintain. And if execution time is then 500ms, and not 250ms; that might be totally OK (unless that specific part is called a thousand times per minute).
The difference between a macro and an inlined function is that a macro is dealt with before the compiler sees it.
On my compiler (clang++) without optimisation flags the square function won't be inlined. The code it generates looks like this
4009f0: 55 push %rbp
4009f1: 48 89 e5 mov %rsp,%rbp
4009f4: 89 7d fc mov %edi,-0x4(%rbp)
4009f7: 8b 7d fc mov -0x4(%rbp),%edi
4009fa: 0f af 7d fc imul -0x4(%rbp),%edi
4009fe: 89 f8 mov %edi,%eax
400a00: 5d pop %rbp
400a01: c3 retq
the imul is the assembly instruction doing the work, the rest is moving data around.
code that calls it looks like
400969: e8 82 00 00 00 callq 4009f0 <_Z6squarei>
iI add the -O3 flag to Inline it and that imul shows up in the main function where the function is called from in C++ code
0000000000400a10 <main>:
400a10: 41 56 push %r14
400a12: 53 push %rbx
400a13: 50 push %rax
400a14: 48 8b 7e 08 mov 0x8(%rsi),%rdi
400a18: 31 f6 xor %esi,%esi
400a1a: ba 0a 00 00 00 mov $0xa,%edx
400a1f: e8 9c fe ff ff callq 4008c0 <strtol#plt>
400a24: 48 89 c3 mov %rax,%rbx
400a27: 0f af db imul %ebx,%ebx
It's a reasonable thing to do to get a basic handle on assembly language for your machine and use gcc -S on your source, or objdump -D on your binary (as I did here) to see exactly what is going on.
Using the macro instead of the inlined function gets something very similar
0000000000400a10 <main>:
400a10: 41 56 push %r14
400a12: 53 push %rbx
400a13: 50 push %rax
400a14: 48 8b 7e 08 mov 0x8(%rsi),%rdi
400a18: 31 f6 xor %esi,%esi
400a1a: ba 0a 00 00 00 mov $0xa,%edx
400a1f: e8 9c fe ff ff callq 4008c0 <strtol#plt>
400a24: 48 89 c3 mov %rax,%rbx
400a27: 0f af db imul %ebx,%ebx
Note one of the many dangers here with macros: what does this do ?
x = 5; std::cout << SQUARE(++x) << std::endl;
36? nope, 42. It becomes
std::cout << ++x * ++x << std::endl;
which becomes 6 * 7
Don't be put off by people telling you not to care about optimisation. Using C or C++ as your language is an optimisation in itself. Just try to work out if you're wasting time with it and be sensible.
Macros just perform text substitution to modify source code.
As such, macros don't inherently affect performance of code. The techniques you use to design and code obviously affect performance. So the only implication of macros on performance is based on what the macro does (i.e. what code you write the macro to emit).
The big danger of macros is that they do not respect scope. The changes they make are unconditional, cross function boundaries, and things like that. There are a lot of subtleties in writing macros to make them behave as intended (avoid unintended side effects in code, avoid undefined behaviour, etc). This means code which uses macros is harder to understand, and harder to get right.
At best, with modern compilers, the performance gain you can get using macros, is the same as can be achieved with inline functions - at the expense of increasing chances of the code behaving incorrectly. You are therefore better off using inline functions - unlike macros they are typesafe and work consistently with other code.
Modern compilers might choose to not inline a function, even if you have specified it as inline. If that happens, you generally don't need to worry - modern compilers are able to do a better job than most modern programmers in deciding whether a function should be inlined.
Using such a macro only make sense if its argument is itself a #define'd constant, as the computation will then be performed by the preprocessor. Even then, double-check that the result is the expected one.
When working on classic variables, the (inlined) function form should be preferred as:
It is type-safe;
It will handle expressions used as an argument in a consistent way. This not only includes the case of per/post increments as quoted by Peter, but when the argument it itself some computation-intensive expression, using the macro form forces the evaluation of that argument twice (which may not necessarely evaluate to the same value btw) vs. only once for the function.
I have to admit that I used to code such macros for quick prototyping of apparently simple functions, but the time those make me lose over the years finalyl changed my mind !

Exception handler

There is this code:
char text[] = "zim";
int x = 777;
If I look on stack where x and text are placed there output is:
09 03 00 00 7a 69 6d 00
Where:
09 03 00 00 = 0x309 = 777 <- int x = 777
7a 69 6d 00 = char text[] = "zim" (ASCII code)
There is now code with try..catch:
char text[] = "zim";
try{
int x = 777;
}
catch(int){
}
Stack:
09 03 00 00 **97 85 04 08** 7a 69 6d 00
Now between text and x is placed new 4 byte value. If I add another catch, then there will be something like:
09 03 00 00 **97 85 04 08** **xx xx xx xx** 7a 69 6d 00
and so on. I think that this is some value connected with exception handling and it is used during stack unwinding to find appropriate catch when exception is thrown in try block. However question is, what is exactly this 4-byte value (maybe some address to excception handler structure or some id)?
I use g++ 4.6 on 32 bit Linux machine.
AFAICT, that's a pointer to an "unwind table". Per the the Itanium ABI implementation suggestions, the process "[uses] an unwind table, [to] find information on how to handle exceptions that occur at that PC, and in particular, get the address of the personality routine for that address range. "
The idea behind unwind tables is that the data needed for stack unwinding is rarely used. Therefore, it's more efficient to put a pointer on the stack, and store the reast of the data in another page. In the best cases, that page can remain on disk and doesn't even need to be loaded in RAM. In comparison, C style error handling often ends up in the L1 cache because it's all inline.
Needless to say all this is platform-dependent and etc.
This may be an address. It may point to either a code section (some handler address), or data section (pointer to a build-time-generated structure with frame info), or the stack of the same thread (pointer to a run-time-generated table of frame info).
Or it may also be a garbage, left due to an alignment requirement, which EH may demand.
For instance on Win32/x86 there's no such a gap. For every function that uses exception handling (has either try/catch or __try/__except/__finally or objects with d'tors) - the compiler generates an EXCEPTION_RECORD structure that is allocated on the stack (by the function prolog code). Then, whenever something changes within the function (object is created/destroyed, try/catch block entered/exited) - the compiler adds an instruction that modifies this structure (more correctly - modifies its extension). But nothing more is allocated on the stack.