(Intel x86. Turbo Assembler and BorlandC compilers, Turbo Linker.)
My question will be about how to modify my f1.asm (and possibly main1.cpp) code.
In main1.cpp I input integer values which I send to function in f1.asm, add them, and send back and display the result in main1.cpp.
main1.cpp:
#include <iostream.h>
#include <stdlib.h>
#include <math.h>
extern "C" int f1(int, int, int);
int main()
{
int a,b,c;
cout<<"W = a+b+c" << endl ;
cout<<"a = " ;
cin>> a;
cout<<"b = " ;
cin>>b;
cout<<"c = " ;
cin>>c;
cout<<"\nW = "<< f1(a,b,c) ;
return 0;
}
f1.asm:
.model SMALL, C
.data
.code
PUBLIC f1
f1 PROC
push BP
mov BP, SP
mov ax,[bp+4]
add ax,[bp+6]
add ax,[bp+8]
pop BP
ret
f1 ENDP
.stack
db 100(?)
END
I want to make such a function for an arbitrary number of variables by sending a pointer to the array of elements to the f1.asm.
QUESTION: If I make the int f1(int, int, int) function in main1.cpp into a int f1( int* ), and put into it the pointer to the array containing the to-be-added values, then how should my .asm code look to access the first (and subsequent) array elements?
How is the pointer stored? Because I tried treating it as an offeset, and an offsett of an offset, and I tried a few other things but I still couldn't access the array's elements.
(If I can just access the first few, I can take care of the rest of the problem.)
...Or should I, in this particular case, use something else from .cpp's side than a pointer?
Ouch, long time I haven't seen a call from 16 bits C to assembly ...
C or C++ allows passing a variable number of arguments provided callee can determine the number because it pushes all arguments in opposite order before calling the function, and the caller cleans up the stack after the function returned.
But passing an array is something totally different : you just pass one single value which is the address (a pointer ...) of the array
Assuming you pass an array of 3 ints - 16 bits small model (int, data pointers, and code addresses are all 16 bits)
C++
int arr[3] = {1, 2, 3}
int cr;
cr = f1(arr);
ASM
push BP
mov BP, SP
mov ax,[bp+4] ; get the address of the array
mov bp, ax ; BP now points to the array
mov ax, [bp] ; get value of first element
add ax,[bp+2] ; add remaining elements
add ax,[bp+4]
pop BP
ret
Related
I want to link assembly code to C++, and below is my code.
This is .cpp
#include <iostream>
#include <time.h>
using namespace std;
extern "C" {int IndexOf(long searchVal, long array[], unsigned count); }
int main()
{
//Fill an array with pseudorandon integers.
const unsigned ARRAY_SIZE = 100000;
const unsigned LOOP_SIZE = 100000;
char F[] = "false";
char T[] = "true";
char* boolstr[] = {F,T};
long array[ARRAY_SIZE];
for (unsigned i = 0; i < ARRAY_SIZE; i++)
array[i] = rand();
long searchval;
time_t startTime, endTime;
cout << "Enter an integer value to find:";
cin >> searchval;
cout << "Please wait... \n";
//Test the Assembly language function.
time(&startTime);
int count = 0;
for (unsigned n = 0; n < LOOP_SIZE; n++)
count = IndexOf(searchval, array, ARRAY_SIZE); //Here
bool found = count != -1;
time(&endTime);
cout << "Elapsed ASM time: " << long(endTime - startTime)
<< " seconds. Found = " << boolstr[found] << endl;
return 0;
}
This is .asm
;IndexOf function (IndexOf.asm)
.586
.model flat, C
Indexof PROTO,
srchval:DWORD, arrayPtr: PTR DWORD, count: DWORD
.code
;---------------------------------------------------------------------
IndexOf PROC USES ecx esi edi,
srchval: DWORD, arrayPtr : PTR DWORD, count: DWORD
;
;Performs a linear search of a 32-bit integer array,
;looking for a specific value. If the value is found,
;the matching index position is returned in EAX;
;otherwise, EAX equals -1.
;---------------------------------------------------------------------
NOT_FOUND = -1
mov eax, srchval ; search value
mov ecx, count ; array size
mov esi, arrayPtr ; pointer to array
mov edi, 0 ;index
L1:
cmp [esi+edi*4], eax
je found
inc edi
loop L1
notFound:
mov al, NOT_FOUND
jmp short exit
found:
mov eax, edi
exit:
ret
IndexOf ENDP
END
It's same code on the textbook Assembly Language by Irvine.
And I already set the build customizations, and check the box next to masm.
I also set .asm Properties, then change the item type to Microsoft Macro Assembler.
I don't know how to link these two files together.
I'm wondering whether the problem is less of some code in .cpp or .asm.
Please help me:(
thank!!
I am not giving an exact answer here, this answer might be a little rough, but in this link pay attention the the link command from the question and if you can get your self two obj files for the cpp and asm I think it should work. If you are learning you could try making an nmake file. I think some assembly programmers do that. I have the Irvine book and I don't remember him mixing c++ and asm
How to write and combine multiple source files for a project in MASM?
It is a little more complicated than what I am saying here, but basically to mix the two they have to be using the same calling convention. On Windows I think that is __stdcall basically the program needs to know how the use the stack between the two source files. Side note, most of the stuff in win32 api headers boils down to __stdcall.
https://learn.microsoft.com/en-us/cpp/cpp/stdcall?view=msvc-170
This is not a pro answer, but it might help
PS I didn't notice your comment of the IndexOf.h, try getting rid of your { } on extern "C" line, I think that is basically what you would find in header file, header file typically declare functions, but do not define functions
Edit additional info:
For your last comment, if you are asking what think you are asking, it works like this: the extern "C" int IndexOf(long searchVal, long array[], unsigned count); is called a function prototype, that means it is telling the compiler that a function with this "indexof" signature exists, it does not however mean the function is implemented or define what that function does. Your implementation is in asm, but it could easily be replaced with c++ as long as the signature is the same. I am guessing one of the errors you may have gotten along the way is a linker error for IndexOf definition not found, that just means your C++ is saying, hey you said we have a indexOf function and I need it but cannot find it. That happens at the linking stage where the object files for the C++ and asm are linked to create the exe file. So yes if you are using Vs you can use your debugger to STEP INTO your assembly code of indexOf or even put the break point in asm file its self.
Side note: in windows (and i guess in linux) for example when programming against the API (functions of windows) your could step into the assembler code, but is is not really source code, it is just the machine code that has been linked into your exe, sometimes you can see that in the stack trace, for example all Windows exes start with ntdll.dll as that contains the process loader. As Linux is open source I guess you code step into the os source if you had it loaded and MS could step into the code of Windows as they own it. It comes down to debugging symbols, look for the pdb files (on Windows platforms with MASM/Vs) in your project folder after building the project.
I'm trying to debug a rather large program with many variables. The code is setup in this way:
while (condition1) {
//non timing sensitive code
while (condition2) {
//timing sensitive code
//many variables that change each iteration
}
}
I have many variables on the inner loop that I want to save for viewing. I want to write them to a text file each outer loop iteration. The inner loop executes a different number of times each iteration. It can be just 2 or 3, or it can be several thousands.
I need to see all the variables values from each inner iteration, but I need to keep the inner loop as fast as possible.
Originally, I tried just storing each data variable in its own vector where I just appended a value at each inner loop iteration. Then, when the outer loop iteration came, I would read from the vectors and write the data to a debug file. This quickly got out of hand as variables were added.
I thought about using a string buffer to store the information, but I'm not sure if this is the fastest way given strings would need to be created multiple times within the loop. Also, since I don't know the number of iterations, I'm not sure how large the buffer would grow.
With the information stored being in formats such as:
"Var x: 10\n
Var y: 20\n
.
.
.
Other Text: Stuff\n"
So, is there a cleaner option for writing large amounts of debug data quickly?
If it's really time-sensitive, then don't format strings inside the critical loop.
I'd go for appending records to a log buffer of binary records inside the critical loop. The outer loop can either write that directly to a binary file (which can be processed later), or format text based on the records.
This has the advantage that the loop only needs to track a couple extra variables (pointers to the end of used and allocated space of one std::vector), rather than two pointers for a std::vector for every variable being logged. This will have much lower impact on register allocation in the critical loop.
In my testing, it looks like you just get a bit of extra loop overhead to track the vector, and a store instruction for every variable you want to log. I didn't write a big enough test loop to expose any potential problems from keeping all the variables "alive" until the emplace_back(). If the compiler does a bad job with bigger loops where it needs to spill registers, see the section below about using a simple array without any size checking. That should remove any constraint on the compiler that makes it try to do all the stores into the log buffer at the same time.
Here's an example of what I'm suggesting. It compiles and runs, writing a binary log file which you can hexdump.
See the source and asm output with nice formatting on the Godbolt compiler explorer. It can even colourize source and asm lines so you can more easily see which asm comes from which source line.
#include <vector>
#include <cstdint>
#include <cstddef>
#include <iostream>
struct loop_log {
// Generally sort in order of size for better packing.
// Use as narrow types as possible to reduce memory bandwidth.
// e.g. logging an int loop counter into a short log record is fine if you're sure it always in-practice fits in a short, and has zero performance downside
int64_t x, y, z;
uint64_t ux, uy, uz;
int32_t a, b, c;
uint16_t t, i, j;
uint8_t c1, c2, c3;
// isn't there a less-repetitive way to write this?
loop_log(int64_t x, int32_t a, int outer_counter, char c1)
: x(x), a(a), i(outer_counter), c1(c1)
// leaves other members *uninitialized*, not zeroed.
// note lack of gcc warning for initializing uint16_t i from an int
// and for not mentioning every member
{}
};
static constexpr size_t initial_reserve = 10000;
// take some args so gcc can't count the iterations at compile time
void foo(std::ostream &logfile, int outer_iterations, int inner_param) {
std::vector<struct loop_log> log;
log.reserve(initial_reserve);
int outer_counter = outer_iterations;
while (--outer_counter) {
//non timing sensitive code
int32_t a = inner_param - outer_counter;
while (a != 0) {
//timing sensitive code
a <<= 1;
int64_t x = outer_counter * (100LL + a);
char c1 = x;
// much more efficient code with gcc 5.3 -O3 than push_back( a struct literal );
log.emplace_back(x, a, outer_counter, c1);
}
const auto logdata = log.data();
const size_t bytes = log.size() * sizeof(*logdata);
// write group size, then a group of records
logfile.write( reinterpret_cast<const char *>(&bytes), sizeof(bytes) );
logfile.write( reinterpret_cast<const char *>(logdata), bytes );
// you could format the records into strings at this point if you want
log.clear();
}
}
#include <fstream>
int main() {
std::ofstream logfile("dbg.log");
foo(logfile, 100, 10);
}
gcc's output for foo() pretty much optimizes away all the vector overhead. As long as the initial reserve() is big enough, the inner loop is just:
## gcc 5.3 -masm=intel -O3 -march=haswell -std=gnu++11 -fverbose-asm
## The inner loop from the above C++:
.L59:
test rbx, rbx # log // IDK why gcc wants to check for a NULL pointer inside the hot loop, instead of doing it once after reserve() calls new()
je .L6 #,
mov QWORD PTR [rbx], rbp # log_53->x, x // emplace_back the 4 elements
mov DWORD PTR [rbx+48], r12d # log_53->a, a
mov WORD PTR [rbx+62], r15w # log_53->i, outer_counter
mov BYTE PTR [rbx+66], bpl # log_53->c1, x
.L6:
add rbx, 72 # log, // struct size is 72B
mov r8, r13 # D.44727, log
test r12d, r12d # a
je .L58 #, // a != 0
.L4:
add r12d, r12d # a // a <<= 1
movsx rbp, r12d # D.44726, a // x = ...
add rbp, 100 # D.44726, // x = ...
imul rbp, QWORD PTR [rsp+8] # x, %sfp // x = ...
cmp r14, rbx # log$D40277$_M_impl$_M_end_of_storage, log
jne .L59 #, // stay in this tight loop as long as we don't run out of reserved space in the vector
// fall through into code that allocates more space and copies.
// gcc generates pretty lame copy code, using 8B integer loads/stores, not rep movsq. Clang uses AVX to copy 32B at a time
// anyway, that code never runs as long as the reserve is big enough
// I guess std::vector doesn't try to realloc() to avoid the copy if possible (e.g. if the following virtual address region is unused) :/
An attempt to avoid repetitive constructor code:
I tried a version that uses a braced initializer list to avoid having to write a really repetitive constructor, but got much worse code from gcc:
#ifdef USE_CONSTRUCTOR
// much more efficient code with gcc 5.3 -O3.
log.emplace_back(x, a, outer_counter, c1);
#else
// Put the mapping from local var names to struct member names right here in with the loop
log.push_back( (struct loop_log) {
.x = x, .y =0, .z=0, // C99 designated-initializers are a GNU extension to C++,
.ux=0, .uy=0, .uz=0, // but gcc doesn't support leaving having uninitialized elements before the last initialized one:
.a = a, .b=0, .c=0, // without all the ...=0, you get "sorry, unimplemented: non-trivial designated initializers not supported"
.t=0, .i = outer_counter, .j=0,
.c1 = (uint8_t)c1
} );
#endif
This unfortunately stores a struct onto the stack and then copies it 8B at a time with code like:
mov rax, QWORD PTR [rsp+72]
mov QWORD PTR [rdx+8], rax // rdx points into the vector's buffer
mov rax, QWORD PTR [rsp+80]
mov QWORD PTR [rdx+16], rax
... // total of 9 loads/stores for a 72B struct
So it will have more impact on the inner loop.
There are a few ways to push_back() a struct into a vector, but using a braced-initializer-list unfortunately seems to always result in a copy that doesn't get optimized away by gcc 5.3. It would nice to avoid writing a lot of repetitive code for a constructor. And with designated initializer lists ({.x = val}), the code inside the loop wouldn't have to care much about what order the struct actually stores things. You could just write them in easy-to-read order.
BTW, .x= val C99 designated-initializer syntax is a GNU extension to C++. Also, you can get warnings for forgetting to initialize a member in a braced-list with gcc's -Wextra (which enables -Wmissing-field-initializers).
For more on syntax for initializers, have a look at Brace-enclosed initializer list constructor and the docs for member initialization.
This was a fun but terrible idea:
// Doesn't compiler. Worse: hard to read, probably easy to screw up
while (outerloop) {
int64_t x=0, y=1;
struct loop_log {int64_t logx=x, logy=y;}; // loop vars as default initializers
// error: default initializers can't be local vars with automatic storage.
while (innerloop) { x+=y; y+=x; log.emplace_back(loop_log()); }
}
Lower overhead from using a flat array instead of a std::vector
Perhaps trying to get the compiler to optimize away any kind of std::vector operation is less good than just making a big array of structs (static, local, or dynamic) and keeping a count yourself of how many records are valid. std::vector checks to see if you've used up the reserved space on every iteration, but you don't need anything like that if there is a fixed upper-bound you can use to allocate enough space to never overflow. (Depending on the platform and how you allocate the space, a big chunk of memory that's allocated but never written isn't really a problem. e.g. on Linux, malloc uses mmap(MAP_ANONYMOUS) for big allocations, and that gives you pages that are all copy-on-write mapped to a zeroed physical page. The OS doesn't need to allocate physical pages until you write, them. The same should apply to a large static array.)
So in your loop, you could just have code like
loop_log *current_record = logbuf;
while(inner_loop) {
int64_t x = ...;
current_record->x = x;
...
current_record->i = (short)outer_counter;
...
// or maybe
// *current_record = { .x = x, .i = (short)outer_counter };
// compilers will probably have an easier time avoiding any copying with a braced initializer list in this case than with vector.push_back
current_record++;
}
size_t record_bytes = (current_record - log) * sizeof(log[0]);
// or size_t record_bytes = static_cast<char*>(current_record) - static_cast<char*>(log);
logfile.write((const char*)logbuf, record_bytes);
Scattering the stores throughout the inner loop will require the array pointer to be live all the time, but OTOH doesn't require all the loop variables to be live at the same time. IDK if gcc would optimize an emplace_back to store each variable into the vector once the variable was no longer needed, or if it might spill variables to the stack and then copy them all into the vector in one group of instructions.
Using log[records++].x = ... might lead to the compiler keeping the array and counter tying up two registers, since we'd use the record count in the outer loop. We want the inner loop to be fast, and can take the time to do the subtraction in the outer loop, so I wrote it with pointer increments to encourage the compiler to only use one register for that piece of state. Besides register pressure, base+index store instructions are less efficient on Intel SnB-family hardware than single-register addressing modes.
You could still use a std::vector for this, but it's hard to get std::vector not to write zeroes into memory it allocates. reserve() just allocates without zeroing, but you calling .data() and using the reserved space without telling vector about it with .resize() kind of defeats the purpose. And of course .resize() will initialize all the new elements. So you std::vector is a bad choice for getting your hands on a large allocation without dirtying it.
It sounds like what you really want is to look at your program from within a debugger. You haven't specified a platform, but if you build with debug information (-g using gcc or clang) you should be able to step through the loop when starting the program from within the debugger (gdb on linux.) Assuming you are on linux, tell it to break at the beginning of the function (break ) and then run. If you tell the debugger to display all the variables you want to see after each step or breakpoint hit, you'll get to the bottom of your problem in no time.
Regarding performance: unless you do something fancy like set conditional breakpoints or watch memory, running the program through the debugger will not dramatically affect perf as long as the program is not stopped. You may need to turn down the optimization level to get meaningful information though.
I have read some questions about returning more than one value such as What is the reason behind having only one return value in C++ and Java?, Returning multiple values from a C++ function and Why do most programming languages only support returning a single value from a function?.
I agree with most of the arguments used to prove that more than one return value is not strictly necessary and I understand why such feature hasn't been implemented, but I still can't understand why can't we use multiple caller-saved registers such as ECX and EDX to return such values.
Wouldn't it be faster to use the registers instead of creating a Class/Struct to store those values or passing arguments by reference/pointers, both of which use memory to store them? If it is possible to do such thing, does any C/C++ compiler use this feature to speed up the code?
Edit:
An ideal code would be like this:
(int, int) getTwoValues(void) { return 1, 2; }
int main(int argc, char** argv)
{
// a and b are actually returned in registers
// so future operations with a and b are faster
(int a, int b) = getTwoValues();
// do something with a and b
return 0;
}
Yes, this is sometimes done. If you read the Wikipedia page on x86 calling conventions under cdecl:
There are some variations in the interpretation of cdecl, particularly in how to return values. As a result, x86 programs compiled for different operating system platforms and/or by different compilers can be incompatible, even if they both use the "cdecl" convention and do not call out to the underlying environment. Some compilers return simple data structures with a length of 2 registers or less in the register pair EAX:EDX, and larger structures and class objects requiring special treatment by the exception handler (e.g., a defined constructor, destructor, or assignment) are returned in memory. To pass "in memory", the caller allocates memory and passes a pointer to it as a hidden first parameter; the callee populates the memory and returns the pointer, popping the hidden pointer when returning.
(emphasis mine)
Ultimately, it comes down to calling convention. It's possible for your compiler to optimize your code to use whatever registers it wants, but when your code interacts with other code (like the operating system), it needs to follow the standard calling conventions, which typically uses 1 register for returning values.
Returning in stack isn't necessarily slower, because once the values are available in L1 cache (which the stack often fulfills), accessing them will be very fast.
However in most computer architectures there are at least 2 registers to return values that are twice (or more) as wide as the word size (edx:eax in x86, rdx:rax in x86_64, $v0 and $v1 in MIPS (Why MIPS assembler has more that one register for return value?), R0:R3 in ARM1, X0:X7 in ARM64...). The ones that don't have are mostly microcontrollers with only one accumulator or a very limited number of registers.
1"If the type of value returned is too large to fit in r0 to r3, or whose size cannot be determined statically at compile time, then the caller must allocate space for that value at run time, and pass a pointer to that space in r0."
These registers can also be used for returning directly small structs that fits in 2 (or more depending on architecture and ABI) registers or less.
For example with the following code
struct Point
{
int x, y;
};
struct shortPoint
{
short x, y;
};
struct Point3D
{
int x, y, z;
};
Point P1()
{
Point p;
p.x = 1;
p.y = 2;
return p;
}
Point P2()
{
Point p;
p.x = 1;
p.y = 0;
return p;
}
shortPoint P3()
{
shortPoint p;
p.x = 1;
p.y = 0;
return p;
}
Point3D P4()
{
Point3D p;
p.x = 1;
p.y = 2;
p.z = 3;
return p;
}
Clang emits the following instructions for x86_64 as you can see here
P1(): # #P1()
movabs rax, 8589934593
ret
P2(): # #P2()
mov eax, 1
ret
P3(): # #P3()
mov eax, 1
ret
P4(): # #P4()
movabs rax, 8589934593
mov edx, 3
ret
For ARM64:
P1():
mov x0, 1
orr x0, x0, 8589934592
ret
P2():
mov x0, 1
ret
P3():
mov w0, 1
ret
P4():
mov x1, 1
mov x0, 0
sub sp, sp, #16
bfi x0, x1, 0, 32
mov x1, 2
bfi x0, x1, 32, 32
add sp, sp, 16
mov x1, 3
ret
As you can see, no stack operations are involved. You can switch to other compilers to see that the values are mainly returned on registers.
Return data is put on the stack. Returning a struct by copy is literally the same thing as returning multiple values in that all it's data members are put on the stack. If you want multiple return values that is the simplest way. I know in Lua that's exactly how it handles it, just wraps it in a struct. Why it was never implemented, probably because you could just do it with a struct, so why implement a different method? As for C++, it actually does support multiple return values, but it's in the form of a special class, really the same way Java handles multiple return values (tuples) as well. So in the end, it's all the same, either you copy the data raw (non-pointer/non-reference to a struct/object) or just copy a pointer to a collection that stores multiple values.
I try to port a C++ tool to x64 in VS2005. The problem is, that the code contains inline assembly, which is not supported by the 64bit compiler. My question is, if there is much more effort to code it with clear c++ or use intrinsics. But in this case not all assembler functions are available for x64, am I right? Let say, I have a simple program
#include <stdio.h>
void main()
{
int a = 5;
int b = 3;
int res = 0;
_asm
{
mov eax,a
add eax,b
mov res,eax
}
printf("%d + %d = %d\n", a, b, res);
}
How must I change this code using intrinsics to run it? I'm new in assembler and do not know about most of its functions.
UPDATE:
I made changes to compile assembly with ml64.exe like Hans suggested.
; add.asm
; ASM function called from C++
.code
;---------------------------------------------
AddInt PROC,
a:DWORD, ; receives an integer
b:DWORD ; receives an integer
; Returns: sum of a and b, in EAX.
;----------------------------------------------
mov eax,a
add eax,b
ret
AddInt ENDP
END
main.cpp
#include <stdio.h>
extern "C" int AddInt(int a, int b);
void main()
{
int a = 5;
int b = 3;
int res = AddInt(a,b);
printf("%d + %d = %d\n", a, b, res);
}
but the result is not correct 5 + 3 = -1717986920. I guess, something goes wrong with pointer. Where did I do a mistake?
Inline assembly isn't supported for 64-bit targets in VC.
Regarding the error in your non-inline code, in a first look the code seems fine. I would look at the generated assembly code from the C++ - to see if it matches the addInt procedure.
Edit: 2 things to note:
Add extern addInt :proc to your asm code.
I'm not aware of an assembly syntax for procedure accepting parameters. The parameters are normally extracted via the stack pointer(sp register) according to your calling convention, see more here: http://courses.engr.illinois.edu/ece390/books/labmanual/c-prog-mixing.html
I've been trying to use 'thunking' so I can use member functions to legacy APIs which expects a C function. I'm trying to use a similar solution to this. This is my thunk structure so far:
struct Thunk
{
byte mov; // ↓
uint value; // mov esp, 'value' <-- replace the return address with 'this' (since this thunk was called with 'call', we can replace the 'pushed' return address with 'this')
byte call; // ↓
int offset; // call 'offset' <-- we want to return here for ESP alignment, so we use call instead of 'jmp'
byte sub; // ↓
byte esp; // ↓
byte num; // sub esp, 4 <-- pop the 'this' pointer from the stack
//perhaps I should use 'ret' here as well/instead?
} __attribute__((packed));
The following code is a test of mine which uses this thunk structure (but it does not yet work):
#include <iostream>
#include <sys/mman.h>
#include <cstdio>
typedef unsigned char byte;
typedef unsigned short ushort;
typedef unsigned int uint;
typedef unsigned long ulong;
#include "thunk.h"
template<typename Target, typename Source>
inline Target brute_cast(const Source s)
{
static_assert(sizeof(Source) == sizeof(Target));
union { Target t; Source s; } u;
u.s = s;
return u.t;
}
void Callback(void (*cb)(int, int))
{
std::cout << "Calling...\n";
cb(34, 71);
std::cout << "Called!\n";
}
struct Test
{
int m_x = 15;
void Hi(int x, int y)
{
printf("X: %d | Y: %d | M: %d\n", x, y, m_x);
}
};
int main(int argc, char * argv[])
{
std::cout << "Begin Execution...\n";
Test test;
Thunk * thunk = static_cast<Thunk*>(mmap(nullptr, sizeof(Thunk),
PROT_EXEC | PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0));
thunk->mov = 0xBC; // mov esp
thunk->value = reinterpret_cast<uint>(&test);
thunk->call = 0xE8; // call
thunk->offset = brute_cast<uint>(&Test::Hi) - reinterpret_cast<uint>(thunk);
thunk->offset -= 10; // Adjust the relative call
thunk->sub = 0x83; // sub
thunk->esp = 0xEC; // esp
thunk->num = 0x04; // 'num'
// Call the function
Callback(reinterpret_cast<void (*)(int, int)>(thunk));
std::cout << "End execution\n";
}
If I use that code; I receive a segmentation fault within the Test::Hi function. The reason is obvious (once you analyze the stack in GDB) but I do not know how to fix this. The stack is not aligned properly.
The x argument contains garbage but the y argument contains the this pointer (see the Thunk code). That means the stack is misaligned by 8 bytes, but I still don't know why this is the case. Can anyone tell why this is happening? x and y should contain 34 and 71 respectively.
NOTE: I'm aware of the fact that this is does not work in all scenarios (such as MI and VC++ thiscall convention) but I want to see if I can get this work, since I would benefit from it a lot!
EDIT: Obviously I also know that I can use static functions, but I see this more as a challenge...
Suppose you have a standalone (non-member, or maybe static) cdecl function:
void Hi_cdecl(int x, int y)
{
printf("X: %d | Y: %d | M: %d\n", x, y, m_x);
}
Another function calls it this way:
push 71
push 36
push (return-address)
call (address-of-hi)
add esp, 8 (stack cleanup)
You want to replace this by the following:
push 71
push 36
push this
push (return-address)
call (address-of-hi)
add esp, 4 (cleanup of this from stack)
add esp, 8 (stack cleanup)
For this, you have to read the return-address from the stack, push this, and then, push the return-address. And for the cleanup, add 4 (not subtract) to esp.
Regarding the return address - since the thunk must do some cleanup after the callee returns, it must store the original return-address somewhere, and push the return-address of the cleanup part of the thunk. So, where to store the original return-address?
In a global variable - might be an acceptable hack (since you probably don't need your solution to be reentrant)
On the stack - requires moving the whole block of parameters (using a machine-language equivalent of memmove), whose length is pretty much unknown
Please also note that the resulting stack is not 16-byte-aligned; this can lead to crashes if the function uses certain types (those that require 8-byte and 16-byte alignment - the SSE ones, for example; also maybe double).