I am trying to understand assembly a little better, so I've been paying attention to the assembly output from CDB when I debug my code. My platform is an Intel Xeon with Windows 7.
The following C++ code:
int main()
{
int a = 30;
int b = 0;
b = ++a;
return 0;
}
produces the following assembly for the line with the increment operator:
b = ++a;
0x13f441023 <+0x0013> mov eax,dword ptr [rsp]
0x13f441026 <+0x0016> inc eax
0x13f441028 <+0x0018> mov dword ptr [rsp],eax //Move eax to some memory address
0x13f44102b <+0x001b> mov eax,dword ptr [rsp] //Move it back to eax?
0x13f44102e <+0x001e> mov dword ptr [rsp+4],eax
My question is, what is the purpose of moving the value in eax to memory, then immediately moving the same value back into eax, as indicated by the comments? Is this for thread safety, or just some artifact of a debug build?
The compiler initially translates your instructions into assembly using static single assignment values (SSA), meaning that each operation gets a temporary value to store its result. Only at a later backend stage, these values would be translated into machine registers according to your target machine, and possibly into memory locations if necessary (explicitly required or spilled due to lack of registers).
Between these stages, the optimizer may eliminate partial values, but initially ++a is one operation, and assigning a (post increment) into b is a second operation. Since a and b are both local variables, they will be stored on the stack (and must be visible there, in case you step with a debugger for example), a will reside in [rsp] and b in [rsp+4].
So your compiler, at some point, probably has (in some intermediate representation):
value1 = a
value2 = value1 + 1
a = value2 //self increment
b = a
Or something similar. a and b must be memory resident, but the operations will usually be done on registers, so at first the compiler does -
value1 = a
value2 = value1 + 1
0x13f441023 <+0x0013> mov eax,dword ptr [rsp]
0x13f441026 <+0x0016> inc eax
a = value2
0x13f441028 <+0x0018> mov dword ptr [rsp],eax
b = a
0x13f44102b <+0x001b> mov eax,dword ptr [rsp]
0x13f44102e <+0x001e> mov dword ptr [rsp+4],eax
Note that the intermediate values were kept in a register - in a normal compilation, they would have probably get eliminated altogether by one of the optimization passes (prior to register assignment and code generation).
just some artifact of a debug build?
Yes, just some artifact of a debug build (actually from an unoptimized build)
Related
Consider the following two useless C++ functions.
Compiled with GCC (4.9.2, 32- or 64-bit) both functions returning the same value as expected.
Compiled with Visual Studio 2010 or Visual Studio 2017 (unmanaged code) both functions returning different values.
What I've tried:
brackets, brackets, brackets
explicit casts to char
sizeof(char) is evaluated to 1
debug / release version
32- / 64-bit
What's going on here? It seems to be a fundamental bug in VS.
char test1()
{
char buf[] = "The quick brown fox...", *pbuf = buf;
char value = (*(pbuf++) & 0x0F) | (*(pbuf++) & 0xF0);
return value;
}
char test2()
{
char buf[] = "The quick brown fox...", *pbuf = buf;
char a = *(pbuf++) & 0x0F;
char b = *(pbuf++) & 0xF0;
char value = a | b;
return value;
}
Edit:
It's not an attempt to blame VS (as mentioned in the posts).
It's not a matter of signed or unsigned.
It's not a matter of the order of evaluation left and right side of the or-operator. Changing the order of the assignments of a and b in test2() yields to a third result.
But the simultaneity is a good point. It seems the ordering of evaluation is defined to be undefined. In a first step, the generated code evaluates the complete expression in test1() without incrementing any pointer. In a second step the pointers will be incremented. Since the incrementation has no effect and the data remains unchanged after this specific operation, the optimizer will remove the code.
Sorry for inconveniences, but this is not what i would expect. In no language.
For completeness, here the disassembled code of test1():
0028102A mov ecx,dword ptr [ebp-8]
0028102D movsx edx,byte ptr [ecx]
00281030 and edx,0Fh
00281033 mov eax,dword ptr [ebp-8]
00281036 movsx ecx,byte ptr [eax]
00281039 and ecx,0F0h
0028103F or edx,ecx
00281041 mov byte ptr [ebp-1],dl
00281044 mov edx,dword ptr [ebp-8]
00281047 add edx,1
0028104A mov dword ptr [ebp-8],edx
0028104D mov eax,dword ptr [ebp-8]
00281050 add eax,1
00281053 mov dword ptr [ebp-8],eax
The behaviour of (*(pbuf++) & 0x0F) | (*(pbuf++) & 0xF0); is undefined. | (unlike ||) is not a sequencing point, and so you have simultaneous reads and writes on pbuf in the same program step.
Not a VS bug therefore. (Such things rarely are: a golden rule is not to blame the compiler.)
(Note also that char can be either signed or unsigned. That can introduce differences in code like yours.)
I was looking at the assembly Visual Studio generated for this simple x64 program:
struct Point {
int a, b;
Point() {
a = 0; b = 1;
}
};
int main(int argc, char* argv[])
{
Point arr[3];
arr[0].b = 2;
return 0;
}
And when it meets arr[0].b = 2, it generates this:
mov eax, 8
imul rax, rax, 0
mov dword ptr [rbp+rax+4],2
Why does it do imul rax, rax, 0 instead of a simple mov rax, 0, or even xor rax, rax? How is imul more efficient, if at all?
Kamac
The reason is because the assembly is calculating the offset of both the Point object in the array, which happens to be on the stack, as well as the offset to the variable b.
Intel documentation for imul with three (3) operands state:
Three-operand form — This form requires a destination operand (the
first operand) and two source operands (the second and the third
operands). Here, the first source operand (which can be a
general-purpose register or a memory location) is multiplied by the
second source operand (an immediate value). The intermediate product
(twice the size of the first source operand) is truncated and stored
in the destination operand (a general-purpose register).
In your case it is calculating the offset of the object in the array which results in addressing the first (zeroth) Point location on the stack. Having that resolved it is then adding the offset for .b which is the +4. So broken down:
mov eax,8 ; prepare to offset into the Point array
imul rax, rax, 0 ; Calculate which Point object is being referred to
mov dword ptr [rbp+rax+4],2 ; Add the offset to b and move value 2 in
instruction. All of which resolves to arr[0].b = 2.
I take it you did not compile with aggressive optimization. When going with a straight compile (no optimization, debug on, etc.) the compiler is not making any assumptions with respects to addressing.
Comparison to clang
On OS X (El Capitan) with clang 3.9.0 and no optimization flags, once the Point objects are instantiated in the array, the assignment of .b = 2 is simply:
mov dword ptr [rbp - 44], 2
In this case, clang is pretty smart about the offsets and resolves addressing during the default optimization.
Background
I have a VS2013 solution containing many projects an numerous sources.
In my sources, I use the same macro thousands of times in different locations in the sources.
Something like:
#define MyMacro(X) X
where X is const char*
I have a DLL project, that with the above macro definition result in a 800KB output dll size.
Problem
In some scenarios or modes, I wish to change my macro definition to the following:
#define MyMacro(X) Utils::MyFunc(X)
This change had a very unpleasant side effect which result in the DLL output file size increasing by 100KB.
Notes
Utils::MyFunc() is used for the first time. So, naturally, I except the binary to increase (a little) since a new code is introduces
Utils::MyFunc() does not include large header or libs.
Utils::MyFunc() does allocate string object.
All projects are compiled using definitions to favor small code.
Artificial example
#define M1(X) X
#define M2(X) ReturnString1(X)
#define M3(X) ReturnString2(X)
string ReturnString1(const char* c)
{
return string(c);
}
string ReturnString2(const string& s)
{
return string(s);
}
int _tmain(int argc, _TCHAR* argv[])
{
M3("TEST");
M3("TEST");
.
. // 5000 times
.
M3("TEST");
return 1;
}
In the above example, I've generate a small EXE project to try and mimic the problem I'm facing.
Using M1 exclusively in _tmain - compilation was instantaneous and output file was 88KB EXE.
Using M2 exclusively in _tmain - compilation took minutes and output file was 239KB EXE.
Using M3 exclusively in _tmain - compilation took a lot longer and output file was 587KB EXE.
I used IDA to compare between the binaries and extracted the function names from the binaries.
In M2 & M3, I see a lot more of the following functions than I see in M1:
... $basic_string#DU?$char_traits#D#std##V?$allocator#...
I'm not too surprised about it since in M2 & M3 I'm allocating a string object.
But is it enough to justify a 151KB & 499KB increase?
Question
Is it expected from string allocation to have such a substantial impact on the output file size?
Here is another "artificial" example:
int main()
{
const char* p = M1("TEST");
std::cout << p;
string s = M3("TEST");
std::cout << s;
return 1;
}
I have commented one section at a time and looked at the generated ASM. For the M1 macro, I got:
012B1000 mov ecx,dword ptr [_imp_?cout#std##3V?$basic_ostream#DU?$char_traits#D#std###1#A (012B204Ch)]
012B1006 call std::operator<<<std::char_traits<char> > (012B1020h)
012B100B mov eax,1
While for M3:
00DC1068 push 4
00DC106A push ecx
00DC106B lea ecx,[ebp-40h]
00DC106E mov dword ptr [ebp-2Ch],0Fh
00DC1075 mov dword ptr [ebp-30h],0
00DC107C mov byte ptr [ebp-40h],0
00DC1080 call std::basic_string<char,std::char_traits<char>,std::allocator<char> >::assign (0DC1820h)
00DC1085 lea edx,[ebp-40h]
00DC1088 mov dword ptr [ebp-4],0
00DC108F lea ecx,[s]
00DC1092 call ReturnString2 (0DC1000h)
00DC1097 mov byte ptr [ebp-4],2
00DC109B mov eax,dword ptr [ebp-2Ch]
00DC109E cmp eax,10h
00DC10A1 jb main+6Dh (0DC10ADh)
00DC10A3 inc eax
00DC10A4 push eax
00DC10A5 push dword ptr [ebp-40h]
00DC10A8 call std::_Wrap_alloc<std::allocator<char> >::deallocate (0DC17C0h)
00DC10AD mov ecx,dword ptr [_imp_?cout#std##3V?$basic_ostream#DU?$char_traits#D#std###1#A (0DC3050h)]
00DC10B3 lea edx,[s]
00DC10B6 mov dword ptr [ebp-2Ch],0Fh
00DC10BD mov dword ptr [ebp-30h],0
00DC10C4 mov byte ptr [ebp-40h],0
00DC10C8 call std::operator<<<char,std::char_traits<char>,std::allocator<char> > (0DC1100h)
00DC10CD mov eax,dword ptr [ebp-14h]
00DC10D0 cmp eax,10h
00DC10D3 jb main+9Fh (0DC10DFh)
00DC10D5 inc eax
00DC10D6 push eax
00DC10D7 push dword ptr [s]
00DC10DA call std::_Wrap_alloc<std::allocator<char> >::deallocate (0DC17C0h)
00DC10DF mov eax,1
Looking at the first column (addresses), the M1 code size is 12, while M3 - 119.
I will leave it as an exercise for the reader to figure out the difference between 5,000 * 12 and 5,000 * 119 :)
Let's take two cases in a simple example:
int _tmain()
{
"TEST";
std::string("TEST");
}
The first statement has no effect and is trivially optimized away.
The second statement constructs a string, which requires a function call. But what function is called? Maybe it's the string constructor, but if that's inlined, it might actually be that malloc(), strlen(), and memcpy() are called directly from main (not explicitly, but those three functions might plausibly be used by a string constructor which could be inline).
Now if you have this:
std::string("TEST");
std::string("TEST");
std::string("TEST");
You can see it's not 3 function calls, but 9 (in our hypothetical). You could get it back to 3 if you make sure the function you're calling is not inline (either using __declspec(noinline) or by defining it in a separate translation unit, aka .cpp file).
You may find that enabling full optimizations (Release build) lets the compiler figure out that these strings are never used, and get rid of them. Maybe.
I was curious to see what the cost is of accessing a data member through a pointer compared with not through a pointer, so came up with this test:
#include <iostream>
struct X{
int a;
};
int main(){
X* xheap = new X();
std::cin >> xheap->a;
volatile int x = xheap->a;
X xstack;
std::cin >> xstack.a;
volatile int y = xstack.a;
}
the generated x86 is:
int main(){
push rbx
sub rsp,20h
X* xheap = new X();
mov ecx,4
call qword ptr [__imp_operator new (013FCD3158h)]
mov rbx,rax
test rax,rax
je main+1Fh (013FCD125Fh)
xor eax,eax
mov dword ptr [rbx],eax
jmp main+21h (013FCD1261h)
xor ebx,ebx
std::cin >> xheap->a;
mov rcx,qword ptr [__imp_std::cin (013FCD3060h)]
mov rdx,rbx
call qword ptr [__imp_std::basic_istream<char,std::char_traits<char> >::operator>> (013FCD3070h)]
volatile int x = xheap->a;
mov eax,dword ptr [rbx]
X xstack;
std::cin >> xstack.a;
mov rcx,qword ptr [__imp_std::cin (013FCD3060h)]
mov dword ptr [x],eax
lea rdx,[xstack]
call qword ptr [__imp_std::basic_istream<char,std::char_traits<char> >::operator>> (013FCD3070h)]
volatile int y = xstack.a;
mov eax,dword ptr [xstack]
mov dword ptr [x],eax
It looks like the non-pointer access takes two instructions, compared to oneinstruction for the access through a pointer. Could somebody please tell me why this is and which would take fewer CPU cycles to retrieve?
I am trying to understand if pointers do incur more CPU instructions/cycles when accessing data members through them as opposed to non-pointer-access.
That's a terrible test.
The complete assignment to x is this:
mov eax,dword ptr [rbx]
mov dword ptr [x],eax
(the compiler is allowed to re-order the instructions somewhat, and has).
The assignment to y (which the compiler has given the same address as x) is
mov eax,dword ptr [xstack]
mov dword ptr [x],eax
which is almost the same (read memory pointed to by register, write to the stack).
The first one would be more complicated except that the compiler kept xheap in register rbx after the call to new, so it doesn't need to re-load it.
In either case I would be more worried about whether any of those accesses misses the L1 or L2 caches than about the precise instructions. (The processor doesn't even directly execute those instructions, they get converted internally to a different instruction set, and it may execute them in a different order.)
Accessing via a pointer instead of directly accessing from the stack costs you one extra indirection in the worst case (fetching the pointer). This is almost always irrelevant in itself; you need to look at your whole algorithm and how it works with the processor's caches and branch prediction logic.
It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 11 years ago.
Say you do:
void something()
{
int* number = new int(16);
int* sixteen = number;
}
How does the CPU know the address that I want to assign to sixteen?
Thanks
There's no magic in your example code. Take this snippet, for example:
int x = 5;
int y = x;
Your code with pointers is exactly the same - the computer doesn't need to know any magic information, it's just copying whatever's in number into sixteen.
As to your comment below:
but how does it know where x or y are in memory. If I ask to copy x into y, how does it know where either of those are.
In practice, on most machines these days, probably neither of them will be in memory, they'll be in registers. But if they are in memory, then yes, the compiler will emit code that keeps track of all of those addresses as necessary. In this case, they'd be on the stack, so the machine code would be accessing the stack pointer register and dereferencing it with some compiler-decided offsets that refer to the storage of each particular variable.
Here's an example. This simple function:
int f(void)
{
int x = 5;
int y = x;
return y;
}
When compiled with clang and no optimizations, gives me the following output on my machine:
_f:
pushq %rbp ; save caller's base pointer
movq %rsp,%rbp ; copy stack pointer into base pointer
movl $5,0xfc(%rbp) ; store constant 5 to stack at rbp-4
movl 0xfc(%rbp),%eax ; copy value at rbp-4 to register eax
movl %eax,0xf8(%rbp) ; copy value from eax to stack at rbp-8
movl 0xf8(%rbp),%eax ; copy value off stack to return value register eax
popq %rbp ; restore caller's base pointer
ret ; return from function
I added some comments to explain what each line of the generated code does. The important things to see are that there are two variables on the stack - one at 0xf8(%rbp) (or rbp-8 to be clearer) and one at 0xfc(%rbp) (or rbp-4). The basic algorithm is just like the original code shows - the constant 5 gets saved into x at rbp-4, then that value gets copied over into y at rbp-8.
"But where does the stack come from?" you might ask. The answer to that question is operating system and compiler dependent, though. It's all set up prior to your program's main function being called, at the same time as other runtime setup required by your operating system takes place.
The CPU knows because your program tells it. The magic here is in the compiler. First I build this program in Visual Studio 2010.
This is the disassembly that it generates (in DEBUG mode):
void something()
{
003A13C0 push ebp
003A13C1 mov ebp,esp
003A13C3 sub esp,0E8h
003A13C9 push ebx
003A13CA push esi
003A13CB push edi
003A13CC lea edi,[ebp-0E8h]
003A13D2 mov ecx,3Ah
003A13D7 mov eax,0CCCCCCCCh
003A13DC rep stos dword ptr es:[edi]
int* number = new int(16);
003A13DE push 4
003A13E0 call operator new (3A1186h)
After the call to operator new, EAX = 00097C58 which is the address that the memory manager decided to give me this run of the program. This is the address that will be used whenever you dereference number.
003A13E5 add esp,4
003A13E8 mov dword ptr [ebp-0E0h],eax
003A13EE cmp dword ptr [ebp-0E0h],0
003A13F5 je something+51h (3A1411h)
003A13F7 mov eax,dword ptr [ebp-0E0h]
003A13FD mov dword ptr [eax],10h
003A1403 mov ecx,dword ptr [ebp-0E0h]
003A1409 mov dword ptr [ebp-0E8h],ecx
003A140F jmp something+5Bh (3A141Bh)
003A1411 mov dword ptr [ebp-0E8h],0
003A141B mov edx,dword ptr [ebp-0E8h]
003A1421 mov dword ptr [number],edx
int* sixteen = number;
003A1424 mov eax,dword ptr [number]
003A1427 mov dword ptr [sixteen],eax
Here you're just making sure that sixteen is the same value as number. So now they point at the same address.
}
You can verify by inspecting them in the Locals debug window:
+ number 0x00097c58 int *
+ sixteen 0x00097c58 int *
You can do this experiment and step through the disassembly. It is often very enlightening.