I am currently working on using some ASM in C/C++
I have the following
__declspec(naked) unsigned long
someFunction( unsigned long inputDWord )
{
__asm
{
}
}
how, in asm, would I return the unsigned long?
Do I need to push something onto the stack and then call ret?
I haven't used Asm in a long time, and never inside C++ before.
Thanks!
EDIT: Thanks to #Matteo Italia, I've corrected the usage of ret.
Put the retval in eax register, this is according to __cdecl and __stdcall conventions.
Then, depending on the calling convention, you should use the appropriate variant of ret instruction:
In case of __cdecl convention (or similar) - use ret. On machine level this means pop-ing the return address from the stack and jmp to it. The caller is responsible for removing all the function parameters from the stack.
In case of __stdcall convention (or similar) - use ret X, whereas X is the size of all the function arguments.
Related
I'm trying to do some kind of timing attack to a Java Card.I need a way to measure the time elapsed between sending the command and getting the answer.I'm using the winscard.h interface and the language is c++. .I created a wrapper to winscard.h interface in order to make my work easier. For example for sending an APDU now i'm using this code which seems to work.
Based on this answer I updated my code
byte pbRecvBuffer[258];
long rv;
if (this->sessionHandle >= this->internal.vSessions.size())
throw new SmartCardException("There is no card inserted");
SCARD_IO_REQUEST pioRecvPci;
pioRecvPci.dwProtocol = (this->internal.vSessions)[sessionHandle].dwActiveProtocol;
pioRecvPci.cbPciLength = sizeof(pioRecvPci);
LPSCARD_IO_REQUEST pioSendPci;
if ((this->internal.vSessions)[sessionHandle].dwActiveProtocol == SCARD_PROTOCOL_T1)
pioSendPci = (LPSCARD_IO_REQUEST)SCARD_PCI_T1;
else
pioSendPci = (LPSCARD_IO_REQUEST)SCARD_PCI_T0;
word expected_length = 258;//apdu.getExpectedLen();
word send_length = apdu.getApduLength();
CardSession session = (this->internal.vSessions).operator[](sessionHandle);
byte * data = const_cast<Apdu&>(apdu).getNonConstantData();
auto start = Timer::now();
rv = SCardTransmit(session.hCard, pioSendPci,data,
send_length, &pioRecvPci, pbRecvBuffer,&expected_length);
auto end = Timer::now();
auto duration = (float)(end - start) / Timer::ticks();
return *new ApduResponse(pbRecvBuffer, expected_length,duration);
class Timer
{
public:
static inline int ticks()
{
LARGE_INTEGER ticks;
QueryPerformanceFrequency(&ticks);
return ticks.LowPart;
}
static inline __int64 now()
{
struct { __int32 low, high; } counter;
__asm cpuid
__asm push EDX
__asm rdtsc
__asm mov counter.low, EAX
__asm mov counter.high, EDX
__asm pop EDX
__asm pop EAX
return *(__int64 *)(&counter);
}
};
My code fails with error The value of ESP was not properly saved across a function call. This is usually a result of calling a function declared with one calling convention with a function pointer declared with a different calling convention.. My guessing is that instruction rdtsc is not supported by my Intel Processor.I have an Intel Broadwell 5500U.
.I'm looking for a proper way to do this kind of measurement and get eventually responses with a more accuracy.
The error message that you provided
The value of ESP was not properly saved across a function call. This
is usually a result of calling a function declared with one calling
convention with a function pointer declared with a different calling
convention.
indicates a mistake in the inline assembly function that you call. Assuming that the default calling convention is used when calling it, it's fundamentally flawed : cpuid destroys ebx, which is a callee-saved register. Furthermore, it only pushes one argument to the stack, and pops two : the second pop is effectively (most possibly) the return address of the function, or the base pointer saved as a part of the stack frame. As a result, the function fails when it calls ret, since it has no valid address to return to, or the runtime detects that the new value of esp (which is restored from the value at the beginning of the function) is simply invalid. This has nothing to do with the CPU that you're using, since all x86 CPUs support RDTSC - though the base clock that it uses may be different depending on the CPU's current speed state, which is why using the instruction directly is discouraged, and OS facilities should be favoured over it, as they offer compensation for different implementations of the instruction on various steppings.
Seeing how you're using C++11 - judging by the use of auto - use std::chrono for measuring time intervals. If that doesn't work for some reason, use the facilities provided by your OS (this looks like Windows, so QueryPerformanceCounter is probably the one to use). If this still doesn't satisfy you, you can just generate the rdtsc by using the __rdtsc intrinsic function and not worry about inline assembly.
I'm working with a proprietary MCU that has a built-in library in metal (mask ROM). The compiler I'm using is clang, which uses GCC-like inline ASM. The issue I'm running into, is calling the library since the library does not have a consistent calling convention. While I found a solution, I've found that in some cases the compiler will make optimizations that clobber registers immediately before the call, I think there is just something wrong with how I'm doing things. Here is the code I'm using:
int EchoByte()
{
register int asmHex __asm__ ("R1") = Hex;
asm volatile("//Assert Input to R1 for MASKROM_EchoByte"
:
:"r"(asmHex)
:"%R1");
((volatile void (*)(void))(MASKROM_EchoByte))(); //MASKROM_EchoByte is a 16-bit integer with the memory location of the function
}
Now this has the obvious problem that while the variable "asmHex" is asserted to register R1, the actual call does not use it and therefore the compiler "doesn't know" that R1 is reserved at the time of the call. I used the following code to eliminate this case:
int EchoByte()
{
register int asmHex __asm__ ("R1") = Hex;
asm volatile("//Assert Input to R1 for MASKROM_EchoByte"
:
:"r"(asmHex)
:"%R1");
((volatile void (*)(void))(MASKROM_EchoByte))();
asm volatile("//Assert Input to R1 for MASKROM_EchoByte"
:
:"r"(asmHex)
:"%R1");
}
This seems really ugly to me, and like there should be a better way. Also I'm worried that the compiler may do some nonsense in between, since the call itself has no indication that it needs the asmHex variable. Unfortunately, ((volatile void (*)(int))(MASKROM_EchoByte))(asmHex) does not work as it will follow the C-convention, which puts arguments into R2+ (R1 is reserved for scratching)
Note that changing the Mask ROM library is unfortunately impossible, and there are too many frequently used routines to recreate them all in C/C++.
Cheers, and thanks.
EDIT: I should note that while I could call the function in the ASM block, the compiler has an optimization for functions that are call-less, and by calling in assembly it looks like there's no call. I could go this route if there is some way of indicating that the inline ASM contains a function call, but otherwise the return address will likely get clobbered. I haven't been able to find a way to do this in any case.
Per the comments above:
The most conventional answer is that you should implement a stub function in assembly (in a .s file) that simply performs the wacky call for you. In ARM, this would look something like
// void EchoByte(int hex);
_EchoByte:
push {lr}
mov r1, r0 // move our first parameter into r1
bl _MASKROM_EchoByte
pop pc
Implement one of these stubs per mask-ROM routine, and you're done.
What's that? You have 500 mask-ROM routines and don't want to cut-and-paste so much code? Then add a level of indirection:
// typedef void MASKROM_Routine(int r1, ...);
// void GeneralPurposeStub(MASKROM_Routine *f, int arg, ...);
_GeneralPurposeStub:
bx r0
Call this stub by using the syntax GeneralPurposeStub(&MASKROM_EchoByte, hex). It'll work for any mask-ROM entry point that expects a parameter in r1. Any really wacky entry points will still need their own hand-coded assembly stubs.
But if you really, really, really must do this via inline assembly in a C function, then (as #JasonD pointed out) all you need to do is add the link register lr to the clobber list.
void EchoByte(int hex)
{
register int r1 asm("r1") = hex;
asm volatile(
"bl _MASKROM_EchoByte"
:
: "r"(r1)
: "r1", "lr" // Compare the codegen with and without this "lr"!
);
}
I want to print return value in my tracer, there are two questions
How to get return address ?
The return position is updated before OR after ~Tracer() ?
Need text here so Stackoverflow formats the code:
struct Tracer
{
int* _retval;
~Tracer()
{ printf("return value is %d", *_retval); }
};
int foo()
{
Tracer __tracter = { __Question_1_how_to_get_return_address_here__ };
if(cond) {
return 0;
} else {
return 99;
}
//Question-2:
// return postion is updated before OR after ~Tracer() called ???
}
I found some hints for Question-1, checking Vc code now
For gcc, __builtin_return_address
http://gcc.gnu.org/onlinedocs/gcc/Return-Address.html
For Visual C++, _ReturnAddress
You can't portably or reliably do this in C++. The return value may be in memory or in a register and may or may not be indirected in different cases.
You could probably use inline assembly to make something work on certain hardware/compilers.
One possible way is to make your Tracer a template that takes a reference to a return value variable (when appropriate) and prints that out before destructing.
Also note that identifiers with __ (double underscore) are reserved for the implementation.
Your question is rather confusing, you're interchangeably using the terms "address" and "value", which are not interchangeable.
Return value is what the function spits out, in x86(_64) that comes in the form of a 4/8 byte value in E/RAX, or EDX:EAX, or XMM0, etc, you can read more about it here.
Return address on the other hand, is what E/RSP point to when a call is made (aka thing on top of the stack), and holds the address of where the function "jumps" back to when it's done (what is by definition called returning).
Now I don't even know what a tracer is tbh, but I can tell you how you'd get either, it's all about hooks.
For the value, and assuming you're doing it internally, just hook the function with one with the same definition, and once it returns you'll have your result.
For the address it's a bit more complicated because you'll have to go a bit lower, and possibly do some asm shenanigains, I really have no idea what exactly you are looking to acomplish, but I made a little "stub" if you will, to provide the callee with the return pointer.
Here is:
void __declspec(noinline) __declspec(naked) __stdcall _replaceFirstArgWithRetPtrAndJump_() {
__asm { //let's call the function we jump to "callee", and the function that called us "caller"
push ebp //save ebp, ESP IS NOW -4
mov ebp, [esp + 4] //save return address
mov eax, [esp + 8] //get callee's address (which is the first param) - eax is volatile so iz fine
mov[esp + 8], ebp //put the return address where the callee's address was (to the callee, it will be the caller)
pop ebp //restore ebp
jmp eax //jump to callee
} }
#define CallFunc_RetPtr(Function, ...) ((decltype(&Function))_replaceFirstArgWithRetPtrAndJump_)(Function, __VA_ARGS__)
unsigned __declspec(noinline) __stdcall printCaller(void* caller, unsigned param1, unsigned param2) {
printf("I'm printCaller, Called By %p; Param1: %u, Param2: %u\n", caller, param1, param2);
return 20;
}
void __declspec(noinline) doshit() {
printf("us: %p\nFunction we're calling: %p\n", doshit, printCaller);
CallFunc_RetPtr(printCaller, 69, 420);
}
Now sure, you could and maybe should use _ReturnAddress() or any different compiler's intrinsics, but if that's not available (which should be a really rare scenario depending on your work) and you know your ASM, this concept should work for any architecture, since however different the instruction set may be, they all follow the same Program Counter design.
I wrote this more because I was looking for an answer for this quite a long time ago for a certain purpose, and I couldn't find a good one since most people just go "hurr durr it's not possible or portable or whatever", and I feel like this would have helped.
I'm writing some inline functions for fun and it throws an exception I have never encountered before. The funny thing is, that if I continue, after the exception just stopped the flow of execution of my program, it will return the sum of two integers.
__declspec(dllexport) int addintegers(int one, int two)
{
int answer = 0;
__asm
{
mov eax, 0
push two
push one
call add
mov answer, eax
}
return answer;
} // Debugger stops here with exception message
Exception Message:
Run-Time Check Failure #0 - The value of ESP was not properly saved across a function call. This is usually a result of calling a function declared with one calling convention with a function pointer declared with a different calling convention.
// add function definition
int add(int one, int two)
{
return one + two;
}
I don't know much about assembler, and you don't show us the declaration of add(), but if it adheres to C's calling convention you have to pop the arguments from the stack after the call returned to the caller.
Requiring the caller to clean up the stack, rather than the callee, is what allows C to have functions with a variable number of arguments, like printf().
The function below calculates absolute value of 32-bit floating point value:
__forceinline static float Abs(float x)
{
union {
float x;
int a;
} u;
//u.x = x;
u.a &= 0x7FFFFFFF;
return u.x;
}
union u declared in the function holds variable x, which is different from the x which is passed as parameter in the function. Is there any way to create a union with argument to the function - x?
Any reason the function above with uncommented line be executing longer than this one?
__forceinline float fastAbs(float a)
{
int b= *((int *)&a) & 0x7FFFFFFF;
return *((float *)(&b));
}
I'm trying to figure out best way to take Abs of floating point value in as little count of read/writes to memory as possible.
For the first question, I'm not sure why you can't just what you want with an assignment. The compiler will do whatever optimizations that can be done.
In your second sample code. You violate strict aliasing. So it isn't the same.
As for why it's slower:
It's because CPUs today tend to have separate integer and floating-point units. By type-punning like that, you force the value to be moved from one unit to the other. This has overhead. (This is often done through memory, so you have extra loads and stores.)
In the second snippet: a which is originally in the floating-point unit (either the x87 FPU or an SSE register), needs to be moved into the general purpose registers to apply the mask 0x7FFFFFFF. Then it needs to be moved back.
In the first snippet: The compiler is probably smart enough to load a directly into the integer unit. So you bypass the FPU in the first stage.
(I'm not 100% sure until you show us the assembly. It will also depend heavily on whether the parameter starts off in a register or on the stack. And whether the output is used immediately by another floating-point operation.)
Looking at the disassembly of the code compiled in release mode the difference is quite clear!
I removed the inline and used two virtual function to allow the compiler to not optimize too much and let us show the differences.
This is the first function.
013D1002 in al,dx
union {
float x;
int a;
} u;
u.x = x;
013D1003 fld dword ptr [x] // Loads a float on top of the FPU STACK.
013D1006 fstp dword ptr [x] // Pops a Float Number from the top of the FPU Stack into the destination address.
u.a &= 0x7FFFFFFF;
013D1009 and dword ptr [x],7FFFFFFFh // Execute a 32 bit binary and operation with the specified address.
return u.x;
013D1010 fld dword ptr [x] // Loads the result on top of the FPU stack.
}
This is the second function.
013D1020 push ebp // Standard function entry... i'm using a virtual function here to show the difference.
013D1021 mov ebp,esp
int b= *((int *)&a) & 0x7FFFFFFF;
013D1023 mov eax,dword ptr [a] // Load into eax our parameter.
013D1026 and eax,7FFFFFFFh // Execute 32 bit binary and between our register and our constant.
013D102B mov dword ptr [a],eax // Move the register value into our destination variable
return *((float *)(&b));
013D102E fld dword ptr [a] // Loads the result on top of the FPU stack.
The number of floating point operations and the usage of FPU stack in the first case is greater.
The functions are executing exactly what you asked, so no surprise.
So i expect the second function to be faster.
Now... removing the virtual and inlining things are a little different, is hard to write the disassembly code here because of course the compiler does a good job, but i repeat, if values are not constants, the compiler will use more floating point operation in the first function.
Of course, integer operations are faster than floating point operations.
Are you sure that directly using math.h abs function is slower than your method?
If correctly inlined, abs function will just do this!
00D71016 fabs
Micro-optimizations like this are hard to see in long code, but if your function is called in a long chain of floating point operations, fabs will work better since values will be already in FPU stack or in SSE registers! abs would be faster and better optimized by the compiler.
You cannot measure the performances of optimizations running a loop in a piece of code, you must see how the compiler mix all together in the real code.