Related
I want to measure the speed in which my PC can increment a counter N times (e.g., for N = 10^9).
I tried the following code:
using namespace std
auto start = chrono::steady_clock::now();
for (int i = 0; i < N; ++i)
{
}
auto end = chrono::steady_clock::now();
However, the compiler is smart enough to simply set i=N, and I get that start==end regardless of the value of N.
How can I change the code to measure the increment speed? (adding costly operations in the loop would dominate the runtime and would not allow the measurement to be correct).
I use Windows 10 and Visual Studio 15.9.7.
A bit of motivation: my code takes about 2 seconds for N=10^9. I'm wondering if there's any "meat" left in optimizing it further (e.g., could it possibly go down to 1 sec? or would the loop itself require more?)
This question doesn't really make sense in C or C++. The compiler aims to generate the fastest code that meets the constraints defined by your source code. In your question, you do not define a constraint that the compiler must do a loop at all. Because the loop has no effect, the optimizer will remove it.
Gabriel Staple's answer is probably the nearest thing you can get to a sensible answer to your question, but it is also not quite right because it defines too many constraints that limits the compiler's freedom to implement optimal code. Volatile often forces the compiler to write the result back to memory each time the variable is modified.
eg, this code:
void foo(int N) {
for (volatile int i = 0; i < N; ++i)
{
}
}
Becomes this assembly (on an x64 compiler I tried):
mov DWORD PTR [rsp-4], 0
mov eax, DWORD PTR [rsp-4]
cmp edi, eax
jle .L1
.L3:
mov eax, DWORD PTR [rsp-4] # Read i from mem
add eax, 1 # i++
mov DWORD PTR [rsp-4], eax # Write i to mem
mov eax, DWORD PTR [rsp-4] # Read it back again before
# evaluating the loop condition.
cmp eax, edi # Is i < N?
jl .L3 # Jump back to L3 if not.
.L1:
It sounds like your real question is more like how fast is:
L1: add eax, 1
jmp L1
Even the answer to that is complex and requires an understanding of the internals of your CPU's pipelines.
I recommend playing with Godbolt to understand more about what the compiler is doing. eg https://godbolt.org/z/59XUSu
You can directly measure the speed of the "empty loop", but it is not easy to convince a C++ compiler to emit it. GCC and Clang can be tricked with asm volatile("") but MSVC inline assembly has always been different and is disabled completely for 64bit programs.
It is possible to use MASM to side-step that restriction:
.MODEL FLAT
.CODE
_testfun PROC
sub ecx, 1
jnz _testfun
ret
_testfun ENDP
END
Import it into your code with extern "C" void testfun(unsigned N);.
Try volatile int i = 0 In your for loop. The volatile keyword tells the compiler this variable could change at any time, due to outside events or threads, and therefore it can't make the same assumptions about what the variable might be in the future.
I have been learning IA-32 assembly programming. So I would like to write a function in assembly and call it from C++.
The tutorial I am following is actually for x64 assembly. But I am working on IA-32. In x64, it says function arguments are stored in registers like RCX, RDX, R8, R9 etc.
But on searching a little bit, I could understand in IA-32, arguments are stored in stack, not in registers.
Below is my C++ code :
#include <iostream>
#include <conio.h>
using namespace std;
extern "C" int PassParam(int a,int b);
int main()
{
cout << "z is " << PassParam(15,13) << endl;
_getch();
return 0;
}
Below is assembly code for PassParam() function (it just add two arguments, that's all. It is only for learning purpose) :
PassParam() in assembly :
.model C,flat
.code
PassParam proc
mov eax,[ebp-212]
add eax,[ebp-216]
ret
PassParam endp
end
In my assembly code, you can see I moved first argument from [ebp-212] to eax. That value is obtained as follows :
I wrote PassParam() function in C++ itself and disassembled it. Then checked where ebp is and where is second argument stored (arguments are stored from right to left). I could see there is a difference of 212, so that is how i got that value. Then as usual, first argument is stored 4 bytes later. And it works fine.
Question :
Is this the correct method to access arguments from assembly ? I mean, is it always [ebp-212] where argument stored?
If not, can anyone explain the correct method to pass arguments from C++ to assembly ?
Note :
I am working with Visual C++ 2010, on Windows 7 machine.
On 32bit architectures, it depends on the calling convention, Windows for example has both __fastcall and __thiscall that use register and stack args, and __cdecl and __stdcall that use stack args but differ in who does the cleanup. MSDN has a nice listing here (or the more assembly orientated version). Note that FPU/SSE operations also have their own conventions.
For ease and simplicity, try use __stdcall for everything, this allows you to use stack frames to access args via MOV r32,[EBP+4+(arg_index * 4)], or if you aren't using stack frames, you can use MOV r32,[ESP+local_stack_offset+(arg_index * 4)]. The annotated C++ -> x86 Assembly example here should be of help.
So as a simple example, lets say we have the function MulAdd in assembly, with the C++ prototype int __stdcall MulAdd(int base, int mul, int add), it would look something like:
MOV EAX,[ESP+4] //get the first arg('base') off the stack
MOV ECX,[ESP+8] //get the second arg('mul') off the stack
IMUL EAX,ECX //base * mul
MOV ECX,[ESP+12] //get arg 3 off the stack
ADD EAX,ECX
RETN 12 //cleanup the 3 args and return
Or if you use a stack frame:
PUSH EBP
MOV EBP,ESP //save the stack
MOV EAX,[EBP+8] //get the first arg('base') off the stack
MOV ECX,[EBP+12] //get the second arg('mul') off the stack
IMUL EAX,ECX //base * mul
MOV ECX,[EBP+16] //get arg 3 off the stack
ADD EAX,ECX
MOV ESP,EBP //restore the stack
POP EBP
RETN //return to caller
Using the stack frame avoids needing to adjust for changes made to the stack by PUSH'ing of args, spilling or registers or stack allocations made for local variables. Its downside is that it reduces the number of registers you have to work with.
Hi I know this is a very silly/basic question, but what is the difference between the code like:-
int *i;
for(j=0;j<10;j++)
{
i = static_cast<int *>(getNthCon(j));
i->xyz
}
and, some thing like this :-
for(j=0;j<10;j++)
{
int *i = static_cast<int *>(getNthCon(j));
i->xyz;
}
I mean, are these code extremely same in logic, or would there be any difference due to its local nature ?
One practical difference is the scope of i. In the first case, i continues to exist after the final iteration of the loop. In the second it does not.
There may be some case where you want to know the value of i after all of the computation is complete. In that case, use the second pattern.
A less practical difference is the nature of the = token in each case. In the first example i = ... indicates assignment. In the second example, int *i = ... indicates initialization. Some types (but not int* nor fp_ContainerObject*) might treat assignment and initialization differently.
There is very little difference between them.
In the first code sample, i is declared outside the loop, so you're re-using the same pointer variable on each iteration. In the second, i is local to the body of the loop.
Since i is not used outside the loop, and the value assigned to it in one iteration is not used in future iterations, it's better style to declare it locally, as in the second sample.
Incidentally, i is a bad name for a pointer variable; it's usually used for int variables, particularly ones used in for loops.
For any sane optimizing compiler there will be no difference in terms of memory allocation. The only difference will be the scope of i. Here is a sample program (and yes, I realize there is a leak here):
#include <iostream>
int *get_some_data(int value) {
return new int(value);
}
int main(int argc, char *argv[]){
int *p;
for(int i = 0; i < 10; ++i) {
p = get_some_data(i);
std::cout << *p;
}
return 0;
}
And the generated assembly output:
int main(int argc, char *argv[]){
01091000 push esi
01091001 push edi
int *p;
for(int i = 0; i < 10; ++i) {
01091002 mov edi,dword ptr [__imp_operator new (10920A8h)]
01091008 xor esi,esi
0109100A lea ebx,[ebx]
p = get_some_data(i);
01091010 push 4
01091012 call edi
01091014 add esp,4
01091017 test eax,eax
01091019 je main+1Fh (109101Fh)
0109101B mov dword ptr [eax],esi
0109101D jmp main+21h (1091021h)
0109101F xor eax,eax
std::cout << *p;
01091021 mov eax,dword ptr [eax]
01091023 mov ecx,dword ptr [__imp_std::cout (1092048h)]
01091029 push eax
0109102A call dword ptr [__imp_std::basic_ostream<char,std::char_traits<char> >::operator<< (1092044h)]
01091030 inc esi
01091031 cmp esi,0Ah
01091034 jl main+10h (1091010h)
}
Now with the pointer declared inside of the loop:
int main(int argc, char *argv[]){
008D1000 push esi
008D1001 push edi
for(int i = 0; i < 10; ++i) {
008D1002 mov edi,dword ptr [__imp_operator new (8D20A8h)]
008D1008 xor esi,esi
008D100A lea ebx,[ebx]
int *p = get_some_data(i);
008D1010 push 4
008D1012 call edi
008D1014 add esp,4
008D1017 test eax,eax
008D1019 je main+1Fh (8D101Fh)
008D101B mov dword ptr [eax],esi
008D101D jmp main+21h (8D1021h)
008D101F xor eax,eax
std::cout << *p;
008D1021 mov eax,dword ptr [eax]
008D1023 mov ecx,dword ptr [__imp_std::cout (8D2048h)]
008D1029 push eax
008D102A call dword ptr [__imp_std::basic_ostream<char,std::char_traits<char> >::operator<< (8D2044h)]
008D1030 inc esi
008D1031 cmp esi,0Ah
008D1034 jl main+10h (8D1010h)
}
As you can see, the output is identical. Note that, even in a debug build, the assembly remains identical.
Ed S. shows that most compilers will generate the same code for both cases. But, as Mahesh points out, they're not actually identical (even beyond the obvious fact that it would be legal to use i outside the loop scope in version 1 but not version 2). Let me try to explain how these can both be true, in a way that isn't misleading.
First, where does the storage for i come from?
The standard is silent on this—as long as storage is available for the entire lifetime of the scope of i, it can be anywhere the compiler likes. But the typical way to deal with local variables (technically, variables with automatic storage duration) is to expand the stack frame of the appropriate scope by sizeof(i) bytes, and store i as an offset into that stack frame.
A "teaching compiler" might always create a stack frame for each scope. But a real compiler usually won't bother, especially if nothing happens on entering and exiting the loop scope. (There's no way you can tell the difference, except by looking at the assembly or breaking in with a debugger, so of course it's allowed to do this.) So, both versions will probably end up with i referring to the exact same offset from the function's stack frame. (Actually, it's quite plausible i will end up in a register, but that doesn't change anything important here.)
Now let's look at the lifecycle.
In the first case, the compiler has to default-initialize i where it's declared at the function scope, copy-assign into it each time through the loop, and destroy it at the end of the function scope. In the second case, the compiler has to copy-initialize i at the start of each loop, and destroy it at the end of each loop. Like this:
If i were of class type, this would be a very significant difference. (See below if it's not obvious why.) But it's not, it's a pointer. This means default initialization and destruction are both no-ops, and copy-initialization and copy-assignment are identical.
So, the lifecycle-management code will be identical in both cases: it's a copy once each time through the loop, and that's it.
In other words, the storage is allowed to be, and probably will be, the same; the lifecycle management is required to be the same.
I promised to come back to why these would be different if i were of class type.
Compare this pseudocode:
i.IType();
for(j=0;j<10;j++) {
i.operator=(static_cast<IType>(getNthCon(j));
}
i.~IType();
to this:
for(j=0;j<10;j++) {
i.IType(static_cast<IType>(getNthCon(j));
i.~IType();
}
At first glance, the first version looks "better", because it's 1 IType(), 10 operator=(IType&), and 1 ~IType(), while the second is 10 IType(IType&) and 10 ~IType(). And for some classes, this might be true. But if you think about how operator= works, it usually has to do at least the equivalent of a copy construction and a destruction.
So the real difference here is that the first version requires a default constructor and a copy-assignment operator, while the second doesn't. And if you take out that static_cast bit (so we're talking about a conversion constructor and assignment instead of copy), what you're looking at is equivalent to this:
for(j=0;j<10;j++) {
std::ifstream i(filenames[j]);
}
Clearly, you would try to pull i out of the loop in that case.
But again, this is only true for "most" classes; you can easily design a class for which version 2 is ridiculously bad and version 1 makes more sense.
For every iteration, in second case, a new pointer variable is created on the stack. While in the first case, the pointer variable is created only once(i.e., before entering the loop )
I'm trying to learn reverse engineering, and I'm stuck on this little thing. I have code like this:
.text:10003478 mov eax, HWHandle
.text:1000347D lea ecx, [eax+1829B8h] <------
.text:10003483 mov dword_1000FA64, ecx
.text:10003489 lea esi, [eax+166A98h]<------
.text:1000348F lea edx, [eax+11FE320h]
.text:10003495 mov dword_1000FCA0, esi
and I'm wondering, how does it look like in C or C++? Especially the two instructions marked by arrows. HWHandle is variable which holds the a value returned from the GetModuleHandle() function.
More interesting is that a couple of lines below this instructions, dword_1000FCA0 is used as a function:
.text:1000353C mov eax, dword_1000FCA0
.text:10003541 mov ecx, [eax+0A0h]
.text:10003547 push offset asc_1000C9E4 ; "\r\n========================\r\n"
.text:1000354C call ecx
This will draw this text in my game console. Have you got any ideas, guys?
LEA is nothing more than an arithmetic operation : in that case, ECX is just filled with EAX+offset (the very address, not the pointed contents). if HWHandle pointed to a (very large) structure, ECX would just be one of its members.
This could be an associated source code:
extern A* HWHandle; // mov eax, HWHandle
B* ECX = HWHandle->someStructure; // lea ecx, [eax+1829B8h]
and later, one of B’s members is used as a function.
*(ECX->ptrFunction(someArg)) // mov ecx, [eax+0A0h]
// call ecx
Since HWHandle is a module handle, which is just the base address of a DLL, it looks as if the constants that are being added to this are offsets for functions or static data inside the DLL. The code is computing the addresses of these functions or data items and storing them for later use.
Since this is typically the job of a dynamic linker, I'm not sure that this assembly code corresponds to actual C++ code. It would be helpful to know what environment you're working in exactly -- since you refer to games consoles, is this Xbox code? Unfortunately, I don't know how exactly dynamic linking works on Xbox, but it looks as if this may be what is going on here.
In the specific case of dword_1000FCA0, it looks as if this is the location of a jump table (i.e. essentially a list of function pointers) inside the DLL. Your second code snippet is getting a function pointer from offset 0xA inside this table, then calling it -- apparently, the function being called outputs strings to the screen. (The pointer to the string to be output is pushed to the stack, which a usual x86 calling convention.) The C++ code corresponding to this would be something like
my_print_function("\r\n========================\r\n");
Edit:
If you want to call functions in a DLL yourself, the canonical way of getting at the function pointer is to use GetProcAddress():
FARPROC func=GetProcAddress(HWHandle, "MyFunction");
However, the code you posted is calculating offsets itself, and if you really want to do the same, you could use something like this:
DWORD func=(DWORD)HWHandle + myOffset;
myOffset is the offset you want to use -- of course, you'd need to have some way of determining this offset, and this can change every time the DLL is recompiled, so it's not a technique I would recommend -- but it is, after all, what you were asking but.
Regardless of which of these two ways you use to get at the address of the function, you need to call it. To do this, you need to declare a function pointer -- and to do that, you need to know the signature of your function (its parameters and return types). For example:
typedef void (*print_func_type)(const char *);
print_func_type my_func_pointer=(print_func_type)func;
my_func_pointer("\r\n========================\r\n");
Beware -- if you get the address of the function or its signature wrong, your code will likely crash. All part of the fun of this kind of low-level work.
It looks like HWHandle is apointer to some structure (a big one). lea instruction is reading address(es) from that structure, e.g:
mov eax, HWHandle
lea ecx, [eax+1829B8h]
mov dword_1000FA64, ecx
means:
Read address from HWHandle + 0x1829B8 and put it into ecx
Put that address (from ecx) into some (global) variable dword_1000FA64
The rest looks simmilar.
In C++ you can get it almost anywhere and you really cannot predict where (depends on a compiler and optimizations), e.g.:
int x;
int* pX = &X;
The second line may generate lea.
Another example:
struct s
{
int x;
int y;
};
my_s s;
int Y = s.y; //here: probably lea <something> , [address(my_s) + 0x4]
Hope that helps.
In C++ this is roughly equivalent to
char* ecx, eax, esi;
ecx = eax+0x1829B8 // lea ecx, [eax+1829B8h]
esi = eax+0x166A98 // lea esi, [eax+166A98h]
Under the assumption that eax, esi and ecx are really holding pointers to memory locations. Of course the lea instruction can be used to to simple arithmetic too, and in fact it often is used for addition by the compilers. The advantage compared to a simple add: It can have up to three input operands and a different destination.
For example, foo = &bar->baz is the same as (simplified) foo = (char *)bar + offsetof(typeof(*bar), baz), which can be translated to lea foo, [bar+offsetofbaz].
It really is compiler and optimization dependent, but if IIRC, lea could be emitted just for additions.... So lea ecx, [eax+1829B8h] can be understood as ecx = eax + 0x1829B8
I'm new to assembly coding and embedding it in C++, the thing I'm trying to do is add the integers in an array using assembly. This is the code i have so far:
#include <iostream>
#include <stdio.h>
int x [] = {5,4,3,2,1};
int sumArray(int [5]);
int main()
{
sumArray(x);
printf_s("The sum of the array is %d");
}
int sumArray(int [5])
{
__asm
{
mov edi,OFFSET sumArray
mov ecx,5
mov eax,0
L1:
add eax,[edi]
add edi, TYPE sumArray
loop L1
}
}
An original problem I was having with was with mov ecx I had it as
mov ecx,LENGTHOF sumArray
but it wouldn't compile so I changed it to 5 and it compiled. So now when I run the program it breaks. I used F11 in Visual Studio to go line by line to see at what line the program breaks, and program breaks when its going through the loop a second time.
So if anyone can help me figure out how I can go about fixing it I would appreciate it.
It seems to me you have it broken quite a bit. First of all, you have a function named sumArray with an unnamed argument. But inside the function, you are referring to sumArray as if it were the name of the array argument. Then, you need to understand the way C(++) passes arrays as arguments: they are (always) passed by reference, as a pointer to the first array member. And it also means the function (in general) does not know the length of the array (unless you set it to a fixed-size). So, you usually pass the length in another argument. Which means we have the following:
int sumArray(int arr[], int len)
{
__asm
{
mov edi, arr
mov ecx, len
xor eax, eax
L1:
add eax, [edi]
add edi, 4
loop L1
}
}
Note that we are not trying to get an offset of the array, that would get us to the pointer, we need to get the value of the pointer, i.e. the address of the first array item. Also, note I have hardcoded the element size (4), there is no point acting like we can work with anything, if at the previous line, we add 32-bit words. (The xor eax, eax is just another way to set a register to zero, to be honest, in today’s CPUs, I am not sure if it is faster or not.)
And when testing this, do not forget to actually pass the result to the printf_s…
The problem with your code seems to be you are using your sumArray function name instead of your actual array x, and thats why it crashes.
Isn't your asm supposed to look like these:
__asm
{
mov edi,OFFSET x
mov ecx,LENGTHOF x
mov eax,0
L1:
add eax,[edi]
add edi, TYPE x
loop L1
}
? (here I assume, that you are not mistaken about macro usage, as I never compiled anything using MASM, that seems to be used here, but I think you got the idea)
The another question is why you pass unnamed argument to sumArray if you don't actually use it, you'd better then pass the array as a named argument and it's length and make use of them in your assembly code.