I am trying to figure out gcc inline assembly on c++. The following code works on visual c++ without % and other operands but i could not make it work with gcc
void function(const char* text) {
DWORD addr = (DWORD)text;
DWORD fncAddr = 0x004169E0;
asm(
"push %0" "\n"
"call %1" "\n"
"add esp, 04" "\n"
: "=r" (addr) : "d" (fncAddr)
);
}
I am injecting a dll to a process on runtime and fncAddr is an address of a function. It never changes. As I said it works with Visual C++
VC++ equivalent of that function:
void function(const char* text) {
DWORD addr = (DWORD)text;
DWORD fncAddr = 0x004169E0;
__asm {
push addr
call fncAddr
add esp, 04
}
}
Edit:
I changed my function to this: now it crashes
void sendPacket(const char* msg) {
DWORD addr = (DWORD)msg;
DWORD fncAddr = 0x004169E0;
asm(
".intel_syntax noprefix" "\n"
"pusha" "\n"
"push %0" "\n"
"call %1" "\n"
"add esp, 04" "\n"
"popa" "\n"
:
: "r" (addr) , "d"(fncAddr) : "memory"
);
}
Edit:
004169E0 /$ 8B0D B4D38100 MOV ECX,DWORD PTR DS:[81D3B4]
004169E6 |. 85C9 TEST ECX,ECX
004169E8 |. 74 0A JE SHORT client_6.004169F4
004169EA |. 8B4424 04 MOV EAX,DWORD PTR SS:[ESP+4]
004169EE |. 50 PUSH EAX
004169EF |. E8 7C3F0000 CALL client_6.0041A970
004169F4 \> C3 RETN
the function im calling is above. I changed it to function pointer cast
char_func_t func = (char_func_t)0x004169E0;
func(text);
like this and it crashed too but surprisingly somethimes it works. I attacted a debugger and it gave access violation at some address it does not exist
on callstack the last call is this:
004169EF |. E8 7C3F0000 CALL client_6.0041A970
LAST EDIT:
I gave up inline assembly, instead i wrote instructions i wanted byte by byte and it works like a charm
void function(const char* text) {
DWORD fncAddr = 0x004169E0;
char *buff = new char[50]; //extra bytes for no reason
memset((void*)buff, 0x90, 50);
*((BYTE*)buff) = 0x68; // push
*((DWORD*)(buff + 1)) = ((DWORD)text);
*((BYTE*)buff+5) = 0xE8; //call
*((DWORD*)(buff + 6)) = ((DWORD)fncAddr) - ((DWORD)&(buff[5]) + 5);
*((BYTE*)(buff + 10)) = 0x83; // add esp, 04
*((BYTE*)(buff + 11)) = 0xC4;
*((BYTE*)(buff + 12)) = 0x04;
*((BYTE*)(buff + 13)) = 0xC3; // ret
typedef void(*char_func_t)(void);
char_func_t func = (char_func_t)buff;
func();
delete[] buff;
}
Thank you all
Your current version with pusha / popa looks correct (slow but safe), unless your calling convention depends on maintaing 16-byte stack alignment.
If it's crashing, your real problem is somewhere else, so you should use a debugger and find out where it crashes.
Declaring clobbers on eax / ecx / edx, or asking for the pointers in two of those registers and clobbering the third, would let you avoid pusha / popa. (Or whatever the call-clobbered regs are for the calling convention you're using.)
You should remove the .intel_syntax noprefix. You already depend on compiling with -masm=intel, because you don't restore the previous mode in case it was AT&T. (I don't think there is a way to save/restore the old mode, unfortunately, but there is a dialect-alternatves mechanism for using different templates for different syntax modes.)
You don't need and shouldn't use inline asm for this
compilers know how to make function calls already, when you're using a standard calling convention (in this case: stack args in 32-bit mode which is normally the default).
It's valid C++ to cast an integer to a function pointer, and it's not even undefined behaviour if there really is a function there at that address.
void function(const char* text) {
typedef void (*char_func_t)(const char *);
char_func_t func = (char_func_t)0x004169E0;
func(text);
}
As a bonus, this compiles more efficiently with MSVC than your asm version, too.
You can use GCC function attributes on function pointers to specify the calling convention explicitly, in case you compile with a different default. For example __attribute__((cdecl)) to explicitly specify stack args and caller-pops for calls using that function pointer. The MSVC equivalent is just __cdecl.
#ifdef __GNUC__
#define CDECL __attribute__((cdecl))
#define STDCALL __attribute__((stdcall))
#elif defined(_MSC_VER)
#define CDECL __cdecl
#define STDCALL __stdcall
#else
#define CDECL /*empty*/
#define STDCALL /*empty*/
#endif
// With STDCALL instead of CDECL, this function has to translate from one calling convention to another
// so it can't compile to just a jmp tailcall
void function(const char* text) {
typedef void (CDECL *char_func_t)(const char *);
char_func_t func = (char_func_t)0x004169E0;
func(text);
}
To see the compiler's asm output, I put this on the Godbolt compiler explorer. I used the "intel-syntax" option, so gcc output comes from gcc -S -masm=intel
# gcc8.1 -O3 -m32 (the 32-bit Linux calling convention is close enough to Windows)
# except it requires maintaing 16-byte stack alignment.
function(char const*):
mov eax, 4286944
jmp eax # tail-call with the args still where we got them
This test caller makes the compiler set up args and not just a tail-call, but function can inline into it.
int caller() {
function("hello world");
return 0;
}
.LC0:
.string "hello world"
caller():
sub esp, 24 # reserve way more stack than it needs to reach 16-byte alignment, IDK why.
mov eax, 4286944 # your function pointer
push OFFSET FLAT:.LC0 # addr becomes an immediate
call eax
xor eax, eax # return 0
add esp, 28 # add esp, 4 folded into this
ret
MSVC's -Ox output for caller is essentially the same:
caller PROC
push OFFSET $SG2661
mov eax, 4286944 ; 004169e0H
call eax
add esp, 4
xor eax, eax
ret 0
But a version using your inline asm is much worse:
;; MSVC -Ox on a caller() that uses your asm implementation of function()
caller_asm PROC
push ebp
mov ebp, esp
sub esp, 8
; store inline asm inputs to the stack
mov DWORD PTR _addr$2[ebp], OFFSET $SG2671
mov DWORD PTR _fncAddr$1[ebp], 4286944 ; 004169e0H
push DWORD PTR _addr$2[ebp] ; then reload as memory operands
call DWORD PTR _fncAddr$1[ebp]
add esp, 4
xor eax, eax
mov esp, ebp ; makes the add esp,4 redundant in this case
pop ebp
ret 0
MSVC inline asm syntax basically sucks, because unlike GNU C asm syntax the inputs always have to be in memory, not registers or immediates. So you could do better with GNU C, but not as good as you can do by avoiding inline asm altogether. https://gcc.gnu.org/wiki/DontUseInlineAsm.
Making function calls from inline asm is generally to be avoided; it's much safer and more efficient when the compiler knows what's happening.
Here's an example of inline assembly with gcc.
Routine "vazio" hosts assembly code for routine "rotina" (vazio and rotina are simply labels). Note the use of Intel syntax by means of a directive; gcc defaults to AT&T .
I recovered this code from an old sub-directory; variables in assembly code were prefixed with "_" , as "_str" - that's standard C convention. I confess that, here and now, I have no idea as why the compiler is accepting "str" instead... Anyway:
compiled correctly with gcc/g++ versions 5 and 7! Hope this helps. Simply call "gcc main.c", or "gcc -S main.c" if you want to see the asm result, and "gcc -S masm=intel main.c" for Intel output.
#include <stdio.h>
char str[] = "abcdefg";
// C routine, acts as a container for "rotina"
void vazio (void) {
asm(".intel_syntax noprefix");
asm("rotina:");
asm("inc eax");
// EBX = address of str
asm("lea ebx, str");
// ++str[0]
asm("inc byte ptr [ebx]");
asm("ret");
asm(".att_syntax noprefix");
}
// global variables make things simpler
int a;
int main(void) {
a = -7;
puts ("antes");
puts (str);
printf("a = %d\n\n", a);
asm(".intel_syntax noprefix");
asm("mov eax, 0");
asm("call rotina");
// modify variable a
asm("mov a, eax");
asm(".att_syntax noprefix");
printf("depois: \n a = %d\n", a);
puts (str);
return 0;
}
Related
When I run the following program, it always prints "yes". However when I change SOME_CONSTANT to -2 it always prints "no". Why is that? I am using visual studio 2019 compiler with optimizations disabled.
#define SOME_CONSTANT -3
void func() {
static int i = 2;
int j = SOME_CONSTANT;
i += j;
}
void main() {
if (((bool(*)())func)()) {
printf("yes\n");
}
else {
printf("no\n");
}
}
EDIT: Here is the output assembly of func (IDA Pro 7.2):
sub rsp, 18h
mov [rsp+18h+var_18], 0FFFFFFFEh
mov eax, [rsp+18h+var_18]
mov ecx, cs:i
add ecx, eax
mov eax, ecx
mov cs:i, eax
add rsp, 18h
retn
Here is the first part of main:
sub rsp, 628h
mov rax, cs:__security_cookie
xor rax, rsp
mov [rsp+628h+var_18], rax
call ?func##YAXXZ ; func(void)
test eax, eax
jz short loc_1400012B0
Here is main decompiled:
int __cdecl main(int argc, const char **argv, const char **envp)
{
int v3; // eax
func();
if ( v3 )
printf("yes\n");
else
printf("no\n");
return 0;
}
((bool(*)())func)()
This expression takes a pointer to func, casts the pointer to a different type of function, then invokes it. Invoking a function through a pointer-to-function whose function signature does not match the original function is undefined behavior which means that anything at all might happen. From the moment this function call happens, the behavior of the program cannot be reasoned about. You cannot predict what will happen with any certainty. Behavior might be different on different optimization levels, different compilers, different versions of the same compiler, or when targeting different architectures.
This is simply because the compiler is allowed to assume that you won't do this. When the compiler's assumptions and reality come into conflict, the result is a vacuum into which the compiler can insert whatever it likes.
The simple answer to your question "why is that?" is, quite simply: because it can. But tomorrow it might do something else.
What apparently happened is:
mov ecx, cs:i
add ecx, eax
mov eax, ecx ; <- final value of i is stored in eax
mov cs:i, eax ; and then also stored in i itself
Different registers could have been used, it just happened to work this way. There is nothing about the code that forces eax to be chosen. That mov eax, ecx is really redundant, ecx could have been stored straight to i. But it happened to work this way.
And in main:
call ?func##YAXXZ ; func(void)
test eax, eax
jz short loc_1400012B0
rax (or part of it, like eax or al) is used for the return value for integer-ish types (such as booleans) in the WIN64 ABI, so that makes sense. That means the final value of i happens to be used as the return value, by accident.
I always get printed out no, so it must be dependent from compiler to compiler, hence the best answer is UB (Undefined Behavior).
Background
I have a VS2013 solution containing many projects an numerous sources.
In my sources, I use the same macro thousands of times in different locations in the sources.
Something like:
#define MyMacro(X) X
where X is const char*
I have a DLL project, that with the above macro definition result in a 800KB output dll size.
Problem
In some scenarios or modes, I wish to change my macro definition to the following:
#define MyMacro(X) Utils::MyFunc(X)
This change had a very unpleasant side effect which result in the DLL output file size increasing by 100KB.
Notes
Utils::MyFunc() is used for the first time. So, naturally, I except the binary to increase (a little) since a new code is introduces
Utils::MyFunc() does not include large header or libs.
Utils::MyFunc() does allocate string object.
All projects are compiled using definitions to favor small code.
Artificial example
#define M1(X) X
#define M2(X) ReturnString1(X)
#define M3(X) ReturnString2(X)
string ReturnString1(const char* c)
{
return string(c);
}
string ReturnString2(const string& s)
{
return string(s);
}
int _tmain(int argc, _TCHAR* argv[])
{
M3("TEST");
M3("TEST");
.
. // 5000 times
.
M3("TEST");
return 1;
}
In the above example, I've generate a small EXE project to try and mimic the problem I'm facing.
Using M1 exclusively in _tmain - compilation was instantaneous and output file was 88KB EXE.
Using M2 exclusively in _tmain - compilation took minutes and output file was 239KB EXE.
Using M3 exclusively in _tmain - compilation took a lot longer and output file was 587KB EXE.
I used IDA to compare between the binaries and extracted the function names from the binaries.
In M2 & M3, I see a lot more of the following functions than I see in M1:
... $basic_string#DU?$char_traits#D#std##V?$allocator#...
I'm not too surprised about it since in M2 & M3 I'm allocating a string object.
But is it enough to justify a 151KB & 499KB increase?
Question
Is it expected from string allocation to have such a substantial impact on the output file size?
Here is another "artificial" example:
int main()
{
const char* p = M1("TEST");
std::cout << p;
string s = M3("TEST");
std::cout << s;
return 1;
}
I have commented one section at a time and looked at the generated ASM. For the M1 macro, I got:
012B1000 mov ecx,dword ptr [_imp_?cout#std##3V?$basic_ostream#DU?$char_traits#D#std###1#A (012B204Ch)]
012B1006 call std::operator<<<std::char_traits<char> > (012B1020h)
012B100B mov eax,1
While for M3:
00DC1068 push 4
00DC106A push ecx
00DC106B lea ecx,[ebp-40h]
00DC106E mov dword ptr [ebp-2Ch],0Fh
00DC1075 mov dword ptr [ebp-30h],0
00DC107C mov byte ptr [ebp-40h],0
00DC1080 call std::basic_string<char,std::char_traits<char>,std::allocator<char> >::assign (0DC1820h)
00DC1085 lea edx,[ebp-40h]
00DC1088 mov dword ptr [ebp-4],0
00DC108F lea ecx,[s]
00DC1092 call ReturnString2 (0DC1000h)
00DC1097 mov byte ptr [ebp-4],2
00DC109B mov eax,dword ptr [ebp-2Ch]
00DC109E cmp eax,10h
00DC10A1 jb main+6Dh (0DC10ADh)
00DC10A3 inc eax
00DC10A4 push eax
00DC10A5 push dword ptr [ebp-40h]
00DC10A8 call std::_Wrap_alloc<std::allocator<char> >::deallocate (0DC17C0h)
00DC10AD mov ecx,dword ptr [_imp_?cout#std##3V?$basic_ostream#DU?$char_traits#D#std###1#A (0DC3050h)]
00DC10B3 lea edx,[s]
00DC10B6 mov dword ptr [ebp-2Ch],0Fh
00DC10BD mov dword ptr [ebp-30h],0
00DC10C4 mov byte ptr [ebp-40h],0
00DC10C8 call std::operator<<<char,std::char_traits<char>,std::allocator<char> > (0DC1100h)
00DC10CD mov eax,dword ptr [ebp-14h]
00DC10D0 cmp eax,10h
00DC10D3 jb main+9Fh (0DC10DFh)
00DC10D5 inc eax
00DC10D6 push eax
00DC10D7 push dword ptr [s]
00DC10DA call std::_Wrap_alloc<std::allocator<char> >::deallocate (0DC17C0h)
00DC10DF mov eax,1
Looking at the first column (addresses), the M1 code size is 12, while M3 - 119.
I will leave it as an exercise for the reader to figure out the difference between 5,000 * 12 and 5,000 * 119 :)
Let's take two cases in a simple example:
int _tmain()
{
"TEST";
std::string("TEST");
}
The first statement has no effect and is trivially optimized away.
The second statement constructs a string, which requires a function call. But what function is called? Maybe it's the string constructor, but if that's inlined, it might actually be that malloc(), strlen(), and memcpy() are called directly from main (not explicitly, but those three functions might plausibly be used by a string constructor which could be inline).
Now if you have this:
std::string("TEST");
std::string("TEST");
std::string("TEST");
You can see it's not 3 function calls, but 9 (in our hypothetical). You could get it back to 3 if you make sure the function you're calling is not inline (either using __declspec(noinline) or by defining it in a separate translation unit, aka .cpp file).
You may find that enabling full optimizations (Release build) lets the compiler figure out that these strings are never used, and get rid of them. Maybe.
I have problem with inline asm in C++. I'm trying to implement fast strlen, but it is not working - when I use __declspec(naked) keyword debugger shows address of input as 0x000000, when I don't use that keyword, eax is pointing for some trash, and function returns various values.
Here's code:
int fastStrlen(char *input) // I know that function does not calculate strlen
{ // properly, but I just want to know why it crashes
_asm // access violation when I try to write to variable x
{
mov ecx, dword ptr input
xor eax, eax
start:
mov bx, [ecx]
cmp bl, '\0'
je Sxend
inc eax
cmp bh, '\0'
je Sxend
inc eax
add ecx, 2
jmp start
Sxend:
ret
}
}
int _tmain(int argc, _TCHAR* argv[])
{
char* test = "test";
int x = fastStrlen(test);
cout << x;
return 0;
}
can anybody point me out what am I doing wrong?
Don't use __declspec(naked) since in that case the complier doesn't generate epilogue and prologue instructions and you need to generate a prologue just like compiler expects you to if you want to access the argument fastStrlen. Since you don't know what the compiler expects you should just let it generate the prologue.
This means you can't just use ret to return to the caller because this means you're supplying your own epilogue. Since you don't know what prologue the compiler used, you don't know what epilogue you need implement to reverse it. Instead assign the return value to a C variable you declare inside the function before the inline assembly statement and return that variable in a normal C return statement. For example:
int fastStrlen(char *input)
{
int retval;
_asm
{
mov ecx, dword ptr input
...
Sxend:
mov retval,eax
}
return retval;
}
As noted in your comments your code will not be able to improve on the strlen implementation in your compiler's runtime library. It also reads past the end of strings of even lengths, which will cause a memory fault if the byte past the end of a string isn't mapped into memory.
I'm using MASM and Visual C++, and I'm compiling in x64. This is my C++ code:
// include directive
#include "stdafx.h"
// external functions
extern "C" int Asm();
// main function
int main()
{
// call asm
Asm();
// get char, return success
_getch();
return EXIT_SUCCESS;
}
and my assembly code:
extern Sleep : proc
; code segment
.code
; assembly procedure
Asm proc
; sleep for 1 second
mov ecx, 1000 ; ecx = sleep time
sub rsp, 8 ; 8 bytes of shadow space
call Sleep ; call sleep
add rsp, 8 ; get rid of shadow space
; return
ret
Asm endp
end
Using breakpoints, I've pinpointed the line of code where the access violation occurs: right after the ret statement in my assembly code.
Extra info:
I'm using the fastcall convention to pass my parameters into Sleep (even though it is declared as stdcall), because from what I have read, x64 will always use the fastcall convention.
My Asm procedure compiles and executes with no errors when I get rid of the Sleep related code.
Even when I try to call Sleep with the stdcall convention, I still get an access violation error.
So obviously, my question is, how do I get rid of the access violation error, what am I doing wrong?
Edit:
This is the generated assembly for Sleep(500); in C++:
mov ecx,1F4h
call qword ptr [__imp_Sleep (13F54B308h)]
This generated assembly is confusing me... it looks like fastcall because it moves the parameter into ecx, but at the same time it doesn't create any shadow space. And I have no clue what this means: qword ptr [__imp_Sleep (13F54B308h)].
And again, edit, the full disassembly for main.
int main()
{
000000013F991020 push rdi
000000013F991022 sub rsp,20h
000000013F991026 mov rdi,rsp
000000013F991029 mov ecx,8
000000013F99102E mov eax,0CCCCCCCCh
000000013F991033 rep stos dword ptr [rdi]
Sleep(500); // this here is the asm generated by the compiler!
000000013F991035 mov ecx,1F4h
000000013F99103A call qword ptr [__imp_Sleep (13F99B308h)]
// call asm
Asm();
000000013F991040 call #ILT+5(Asm) (13F99100Ah)
// get char, return success
_getch();
000000013F991045 call qword ptr [__imp__getch (13F99B540h)]
return EXIT_SUCCESS;
000000013F99104B xor eax,eax
}
If Asm() were a normal C/C++ function, eg:
void Asm()
{
Sleep(1000);
}
The following is what my x64 compiler generates for it:
Asm proc
push rbp ; re-aligns the stack to a 16-byte boundary (CALL pushed 8 bytes for the caller's return address) as well as prepares for setting up a stack frame
sub rsp, 32 ; 32 bytes of shadow space
mov rbp, rsp ; finalizes the stack frame using the current stack pointer
; sleep for 1 second
mov ecx, 1000 ; ecx = sleep time
call Sleep ; call sleep
lea rsp, [rbp+32] ; get rid of shadow space
pop rbp ; clears the stack frame and sets the stack pointer back to the location of the caller's return address
ret ; return to caller
Asm endp
MSDN says:
The caller is responsible for allocating space for parameters to the callee, and must always allocate sufficient space for the 4 register parameters, even if the callee doesn’t have that many parameters.
Have a look at the following page for more information about how x64 uses the stack:
Stack Allocation
I'm in a situation where I have to mock up a _stdcall function using C++ and inline ASM, but which uses a variable number of arguments. Normally it wouldn't know how many arguments to pop from the stack when it returns control to its parent, so wouldn't work, but I'm hoping to tell it via a global variable how many params it should have and then get it to pop them off like that.
Is that actually possible? If so, can someone start me off in the right direction? I'm specifically stuck with the epilog code I would need.
My objective is to make a function which can be used as a callback for any function that requires one (like EnumWindows), so long as the user tells it at runtime how long the args list has to be. The idea is for it to integrate with some code elsewhere so it basically runs a trigger each time the callback is called and provides a link to a place where the variables that were returned can be read and viewed by the user.
Does that make sense?
Doesn't make sense. __stdcall doesn't allow variadic parameters, as the total size of all parameters is decorated into the function name (from msdn):
Name-decoration convention
An underscore (_) is prefixed to the name. The name is followed by the at sign (#) followed by the number of bytes (in decimal) in the argument list. Therefore, the function declared as int func( int a, double b ) is decorated as follows: _func#12
This quote tells you how variadic __stdcall functions are implemented:
The __stdcall calling convention is used to call Win32 API functions. The callee cleans the stack, so the compiler makes vararg functions __cdecl. Functions that use this calling convention require a function prototype.
(emphasis mine)
So, there are no __stdcall functions with variadic parameters, they silently get changed to __cdecl. :)
You can do something like the following (hacked up code):
static int NumberOfParameters = 0;
__declspec(naked) void GenericCallback()
{
// prologue
__asm push ebp
__asm mov ebp, esp
// TODO: do something with parameters on stack
// manual stack unwinding for 2 parameters
// obviously you would adjust for the appropriate number of parameters
// (e.g. NumberOfParameters) instead of hard-coding it for 2
// fixup frame pointer
__asm mov eax, [ebp + 0]
__asm mov [ebp + 8], eax // NumberOfParameters * 4 (assuming dword-sized parameters)
// fixup return address
__asm mov eax, [ebp + 4]
__asm mov [ebp + 12], eax // (NumberOfParameters + 1) * 4
// return TRUE
__asm mov eax, 1
// epilogue
__asm mov esp, ebp
__asm pop ebp
// fixup stack pointer
__asm add esp, 8 // NumberOfParameters * 4
__asm ret 0
}
int main(int argc, _TCHAR* argv[])
{
NumberOfParameters = 2;
EnumWindows((WNDENUMPROC)GenericCallback, NULL);
return 0;
}