Background:
I am new to assembly. When I was learning programming, I made a program that implements multiplication tables up to 1000 * 1000. The tables are formatted so that each answer is on the line factor1 << 10 | factor2 (I know, I know, it's isn't pretty). These tables are then loaded into an array: int* tables. Empty lines are filled with 0. Here is a link to the file for the tables (7.3 MB). I know using assembly won't speed up this by much, but I just wanted to do it for fun (and a bit of practice).
Question:
I'm trying to convert this code into inline assembly (tables is a global):
int answer;
// ...
answer = tables [factor1 << 10 | factor2];
This is what I came up with:
asm volatile ( "shll $10, %1;"
"orl %1, %2;"
"movl _tables(,%2,4), %0;" : "=r" (answer) : "r" (factor1), "r" (factor2) );
My C++ code works fine, but my assembly fails. What is wrong with my assembly (especially the movl _tables(,%2,4), %0; part), compared to my C++
What I have done to solve it:
I used some random numbers: 89 796 as factor1 and factor2. I know that there is an element at 89 << 10 | 786 (which is 91922) – verified this with C++. When I run it with gdb, I get a SIGSEGV:
Program received signal SIGSEGV, Segmentation fault.
at this line:
"movl _tables(,%2,4), %0;" : "=r" (answer) : "r" (factor1), "r" (factor2) );
I added two methods around my asm, which is how I know where the asm block is in the disassembly.
Disassembly of my asm block:
The disassembly from objdump -M att -d looks fine (although I'm not sure, I'm new to assembly, as I said):
402096: 8b 45 08 mov 0x8(%ebp),%eax
402099: 8b 55 0c mov 0xc(%ebp),%edx
40209c: c1 e0 0a shl $0xa,%eax
40209f: 09 c2 or %eax,%edx
4020a1: 8b 04 95 18 e0 47 00 mov 0x47e018(,%edx,4),%eax
4020a8: 89 45 ec mov %eax,-0x14(%ebp)
The disassembly from objdump -M intel -d:
402096: 8b 45 08 mov eax,DWORD PTR [ebp+0x8]
402099: 8b 55 0c mov edx,DWORD PTR [ebp+0xc]
40209c: c1 e0 0a shl eax,0xa
40209f: 09 c2 or edx,eax
4020a1: 8b 04 95 18 e0 47 00 mov eax,DWORD PTR [edx*4+0x47e018]
4020a8: 89 45 ec mov DWORD PTR [ebp-0x14],eax
From what I understand, it's moving the first parameter of my void calc ( int factor1, int factor2 ) function into eax. Then it's moving the second parameter into edx. Then it shifts eax to the left by 10 and ors it with edx. A 32-bit integer is 4 bytes, so [edx*4+base_address]. Move result to eax and then put eax into int answer (which, I'm guessing is on -0x14 of the stack). I don't really see much of a problem.
Disassembly of the compiler's .exe:
When I replace the asm block with plain C++ (answer = tables [factor1 << 10 | factor2];) and disassemble it this is what I get in Intel syntax:
402096: a1 18 e0 47 00 mov eax,ds:0x47e018
40209b: 8b 55 08 mov edx,DWORD PTR [ebp+0x8]
40209e: c1 e2 0a shl edx,0xa
4020a1: 0b 55 0c or edx,DWORD PTR [ebp+0xc]
4020a4: c1 e2 02 shl edx,0x2
4020a7: 01 d0 add eax,edx
4020a9: 8b 00 mov eax,DWORD PTR [eax]
4020ab: 89 45 ec mov DWORD PTR [ebp-0x14],eax
AT&T syntax:
402096: a1 18 e0 47 00 mov 0x47e018,%eax
40209b: 8b 55 08 mov 0x8(%ebp),%edx
40209e: c1 e2 0a shl $0xa,%edx
4020a1: 0b 55 0c or 0xc(%ebp),%edx
4020a4: c1 e2 02 shl $0x2,%edx
4020a7: 01 d0 add %edx,%eax
4020a9: 8b 00 mov (%eax),%eax
4020ab: 89 45 ec mov %eax,-0x14(%ebp)
I am not really familiar with the Intel syntax, so I am just going to try and understand the AT&T syntax:
It first moves the base address of the tables array into %eax. Then, is moves the first parameter into %edx. It shifts %edx to the left by 10 then ors it with the second parameter. Then, by shifting %edx to the left by two, it actually multiplies %edx by 4. Then, it adds that to %eax (the base address of the array). So, basically it just did this: [edx*4+0x47e018] (Intel syntax) or 0x47e018(,%edx,4) AT&T. It moves the value of the element it got into %eax and puts it into int answer. This method is more "expanded", but it does the same thing as my hand-written assembly! So why is mine giving a SIGSEGV while the compiler's working fine?
I bet (from the disassembly) that tables is a pointer to an array, not the array itself.
So you need:
asm volatile ( "shll $10, %1;"
movl _tables,%%eax
"orl %1, %2;"
"movl (%%eax,%2,4)",
: "=r" (answer) : "r" (factor1), "r" (factor2) : "eax" )
(Don't forget the extra clobber in the last line).
There are of course variations, this may be more efficient if the code is in a loop:
asm volatile ( "shll $10, %1;"
"orl %1, %2;"
"movl (%3,%2,4)",
: "=r" (answer) : "r" (factor1), "r" (factor2), "r"(tables) )
This is intended to be an addition to Mats Petersson's answer - I wrote it simply because it wasn't immediately clear to me why OP's analysis of the disassembly (that his assembly and the compiler-generated one were equivalent) was incorrect.
As Mats Petersson explains, the problem is that tables is actually a pointer to an array, so to access an element, you have to dereference twice. Now to me, it wasn't immediately clear where this happens in the compiler-generated code. The culprit is this innocent-looking line:
a1 18 e0 47 00 mov 0x47e018,%eax
To the untrained eye (that includes mine), this might look like the value 0x47e018 is moved to eax, but it's actually not. The Intel-syntax representation of the same opcodes gives us a clue:
a1 18 e0 47 00 mov eax,ds:0x47e018
Ah - ds: - so it's not actually a value, but an address!
For anyone who is wondering now, the following would be the opcodes and ATT syntax assembly for moving the value 0x47e018 to eax:
b8 18 e0 47 00 mov $0x47e018,%eax
Related
I'm checking the Release build of my project done with the latest version of the VS 2017 C++ compiler. And I'm curious why did compiler choose to build the following code snippet:
//ncbSzBuffDataUsed of type INT32
UINT8* pDst = (UINT8*)(pMXB + 1);
UINT8* pSrc = (UINT8*)pDPE;
for(size_t i = 0; i < (size_t)ncbSzBuffDataUsed; i++)
{
pDst[i] = pSrc[i];
}
as such:
UINT8* pDst = (UINT8*)(pMXB + 1);
UINT8* pSrc = (UINT8*)pDPE;
for(size_t i = 0; i < (size_t)ncbSzBuffDataUsed; i++)
00007FF66441251E 4C 63 C2 movsxd r8,edx
00007FF664412521 4C 2B D1 sub r10,rcx
00007FF664412524 0F 1F 40 00 nop dword ptr [rax]
00007FF664412528 0F 1F 84 00 00 00 00 00 nop dword ptr [rax+rax]
00007FF664412530 41 0F B6 04 0A movzx eax,byte ptr [r10+rcx]
{
pDst[i] = pSrc[i];
00007FF664412535 88 01 mov byte ptr [rcx],al
00007FF664412537 48 8D 49 01 lea rcx,[rcx+1]
00007FF66441253B 49 83 E8 01 sub r8,1
00007FF66441253F 75 EF jne _logDebugPrint_in_MainXchgBuffer+0A0h (07FF664412530h)
}
versus just using a single REP MOVSB instruction? Wouldn't the latter be more efficient?
Edit: First up, there's an intrinsic for rep movsb which Peter Cordes tells us would be much faster here and I believe him (I guess I already did). If you want to force the compiler to do things this way, see: __movsb(): https://learn.microsoft.com/en-us/cpp/intrinsics/movsb.
As to why the compiler didn't do this for you, in the absence of any other ideas the answer might be register pressure. To use rep movsb The compiler would have to:
set up rsi (= source address)
set up rdi (= destination address)
set up rcx (= count)
issue the rep movsb
So now it has had to use up the three registers mandated by the rep movsb instruction, and it may prefer not to do that. Specifically rsi and rdi are expected to be preserved across a function call, so if the compiler can get away with using them in the body of any particular function it will, and (on initial entry to the method, at least) rcx holds the this pointer.
Also, with the code that we see the compiler has generated there, the r10 and rcxregisters might already contain the requisite source and destination addresses (we can't see that from your example), which would be handy for the compiler if so.
In practise, you will probably see the compiler make different choices in different situations. The type of optimisation requested (/O1 - optimise for size, vs /O2 - optimise for speed) will likely also affect this.
More on the x64 register passing convention here, and on the x64 ABI generally here.
Edit 2 (again inspired by Peter's comments):
The compiler probably decided not to vectorise the loop because it doesn't know if the pointers are aligned or might overlap. Without seeing more of the code, we can't be sure. But that's not strictly relevant to my answer, given what the OP actually asked about.
This is not really an answer, and I can't jam it all into a comment. I just want to share my additional findings. (This is probably relevant to the Visual Studio compilers only.)
What also makes a difference is how you structure your loops. For instance:
Assuming the following struct definitions:
#define PCALLBACK ULONG64
#pragma pack(push)
#pragma pack(1)
typedef struct {
ULONG64 ui0;
USHORT w0;
USHORT w1;
//Followed by:
// PCALLBACK[] 'array' - variable size array
}DPE;
#pragma pack(pop)
(1) The regular way to structure a for loop. The following code chunk is called somewhere in the middle of a larger serialization function:
PCALLBACK* pDstClbks = (PCALLBACK*)(pDPE + 1);
for(size_t i = 0; i < (size_t)info.wNumCallbackFuncs; i++)
{
pDstClbks[i] = info.callbackFuncs[i];
}
As was mentioned somewhere in the answer on this page, it is clear that the compiler was starved of registers to have produced the following monstrocity (see how it reused rax for the loop end limit, or movzx eax,word ptr [r13] instruction that could've been clearly left out of the loop.)
PCALLBACK* pDstClbks = (PCALLBACK*)(pDPE + 1);
00007FF7029327CF 48 83 C1 30 add rcx,30h
for(size_t i = 0; i < (size_t)info.wNumCallbackFuncs; i++)
00007FF7029327D3 66 41 3B 5D 00 cmp bx,word ptr [r13]
00007FF7029327D8 73 1F jae 07FF7029327F9h
00007FF7029327DA 4C 8B C1 mov r8,rcx
00007FF7029327DD 4C 2B F1 sub r14,rcx
{
pDstClbks[i] = info.callbackFuncs[i];
00007FF7029327E0 4B 8B 44 06 08 mov rax,qword ptr [r14+r8+8]
00007FF7029327E5 48 FF C3 inc rbx
00007FF7029327E8 49 89 00 mov qword ptr [r8],rax
00007FF7029327EB 4D 8D 40 08 lea r8,[r8+8]
00007FF7029327EF 41 0F B7 45 00 movzx eax,word ptr [r13]
00007FF7029327F4 48 3B D8 cmp rbx,rax
00007FF7029327F7 72 E7 jb 07FF7029327E0h
}
00007FF7029327F9 45 0F B7 C7 movzx r8d,r15w
(2) So if I re-write it into a less familiar C pattern:
PCALLBACK* pDstClbks = (PCALLBACK*)(pDPE + 1);
PCALLBACK* pEndDstClbks = pDstClbks + (size_t)info.wNumCallbackFuncs;
for(PCALLBACK* pScrClbks = info.callbackFuncs;
pDstClbks < pEndDstClbks;
pScrClbks++, pDstClbks++)
{
*pDstClbks = *pScrClbks;
}
this produces a more sensible machine code (on the same compiler, in the same function, in the same project):
PCALLBACK* pDstClbks = (PCALLBACK*)(pDPE + 1);
00007FF71D7E27C2 48 83 C1 30 add rcx,30h
PCALLBACK* pEndDstClbks = pDstClbks + (size_t)info.wNumCallbackFuncs;
00007FF71D7E27C6 0F B7 86 88 00 00 00 movzx eax,word ptr [rsi+88h]
00007FF71D7E27CD 48 8D 14 C1 lea rdx,[rcx+rax*8]
for(PCALLBACK* pScrClbks = info.callbackFuncs; pDstClbks < pEndDstClbks; pScrClbks++, pDstClbks++)
00007FF71D7E27D1 48 3B CA cmp rcx,rdx
00007FF71D7E27D4 76 14 jbe 07FF71D7E27EAh
00007FF71D7E27D6 48 2B F1 sub rsi,rcx
{
*pDstClbks = *pScrClbks;
00007FF71D7E27D9 48 8B 44 0E 08 mov rax,qword ptr [rsi+rcx+8]
00007FF71D7E27DE 48 89 01 mov qword ptr [rcx],rax
00007FF71D7E27E1 48 83 C1 08 add rcx,8
00007FF71D7E27E5 48 3B CA cmp rcx,rdx
00007FF71D7E27E8 77 EF jb 07FF71D7E27D9h
}
00007FF71D7E27EA 45 0F B7 C6 movzx r8d,r14w
I have a simple polymorphic construction, with one pure virtual function Foo.
The only flaw in the big project it's used in is that the project uses a couple of global statics for centralized parameter loading and for event logging (can't easily get rid of that legacy code).
Project info:
Platform toolset: v110_xp
MFC in static library
MBCS charset
calling convention: __cdecl
All optimizations disabled
Warning level 4, no warnings on the whole project
Code:
class Base
{
public:
Base(){}
virtual ~Base(void){}
virtual void Foo(void) = 0;
};
class Derived
: public Base
{
public:
Derived(void) : Base(){}
virtual void Foo(void) override
{
double a = sqrt(4.9);
double b = -a;
}
Calling code (doesn't really matter, same behaviour everywhere)
BOOL MainMFCApp::InitInstance()
{
Derived* d = new Derived();
d->Foo();
delete d;
...
}
The problem is that when run in debug (not tested with release), and when we end up inside function Foo, the this pointer is 'corrupted':
this = 0xcccccccc
this.__vfptr = <unable to read memory>
When I dive into the assembly code entering the function I see the following:
13:
14:
15: void Derived::Foo(void)
16: {
015E3570 55 push ebp
015E3571 8B EC mov ebp,esp
015E3573 83 E4 F8 and esp,0FFFFFFF8h
015E3576 81 EC EC 00 00 00 sub esp,0ECh
015E357C 53 push ebx
015E357D 56 push esi
015E357E 57 push edi
015E357F 51 push ecx
015E3580 8D BD 14 FF FF FF lea edi,[ebp-0ECh]
015E3586 B9 3B 00 00 00 mov ecx,3Bh
015E358B B8 CC CC CC CC mov eax,0CCCCCCCCh
015E3590 F3 AB rep stos dword ptr es:[edi]
015E3592 59 pop ecx
015E3593 89 8C 24 F0 00 00 00 mov dword ptr [esp+0F0h],ecx
17: double a = sqrt(4.9);
015E359A F2 0F 10 05 00 55 09 02 movsd xmm0,mmword ptr ds:[2095500h]
015E35A2 E8 72 D4 FD FF call __libm_sse2_sqrt_precise (015C0A19h)
015E35A7 F2 0F 11 84 24 E0 00 00 00 movsd mmword ptr [esp+0E0h],xmm0
18: double b = -a;
015E35B0 F2 0F 10 84 24 E0 00 00 00 movsd xmm0,mmword ptr [esp+0E0h]
015E35B9 66 0F 57 05 10 55 09 02 xorpd xmm0,xmmword ptr ds:[2095510h]
015E35C1 F2 0F 11 84 24 D0 00 00 00 movsd mmword ptr [esp+0D0h],xmm0
19: return;
20: }
015E35CA 5F pop edi
015E35CB 5E pop esi
015E35CC 5B pop ebx
015E35CD 8B E5 mov esp,ebp
015E35CF 5D pop ebp
015E35D0 C3 ret
--- No source file -------------------------------------------------------------
015E35D1 CC int 3
...
015E35EF CC int 3
Breakpoint at line 17, right before entering the function body: Using the watch window to inspect the object behind register ecx (with a cast to Derived*) shows that ecx contains the pointer I need (to the object), but for some reason it is mov'ed to the seemingly random address [esp+0F0h].
And now the really interesting/flabbergasting part: When I change
double b = -a;
to
double b = -1.0 * a;
and compile again, everything magically works. The function assembly has now changed to:
13:
14:
15: void Derived::Foo(void)
16: {
00863570 55 push ebp
00863571 8B EC mov ebp,esp
00863573 81 EC EC 00 00 00 sub esp,0ECh
00863579 53 push ebx
0086357A 56 push esi
0086357B 57 push edi
0086357C 51 push ecx
0086357D 8D BD 14 FF FF FF lea edi,[ebp-0ECh]
00863583 B9 3B 00 00 00 mov ecx,3Bh
00863588 B8 CC CC CC CC mov eax,0CCCCCCCCh
0086358D F3 AB rep stos dword ptr es:[edi]
0086358F 59 pop ecx
00863590 89 4D F8 mov dword ptr [this],ecx
17: double a = sqrt(4.9);
00863593 F2 0F 10 05 00 55 31 01 movsd xmm0,mmword ptr ds:[1315500h]
0086359B E8 79 D4 FD FF call __libm_sse2_sqrt_precise (0840A19h)
008635A0 F2 0F 11 45 E8 movsd mmword ptr [a],xmm0
18: double b = -1.0 * a;
008635A5 F2 0F 10 05 10 55 31 01 movsd xmm0,mmword ptr ds:[1315510h]
008635AD F2 0F 59 45 E8 mulsd xmm0,mmword ptr [a]
008635B2 F2 0F 11 45 D8 movsd mmword ptr [b],xmm0
19: return;
20: }
008635B7 5F pop edi
008635B8 5E pop esi
008635B9 5B pop ebx
008635BA 81 C4 EC 00 00 00 add esp,0ECh
008635C0 3B EC cmp ebp,esp
008635C2 E8 32 CD FC FF call __RTC_CheckEsp (08302F9h)
008635C7 8B E5 mov esp,ebp
008635C9 5D pop ebp
008635CA C3 ret
--- No source file -------------------------------------------------------------
008635CB CC int 3
...
008635EF CC int 3
Now the generated code nicely moves the pointer in register ecx to this. Other difference:
different memory addresses/offsets
mulsd instead of xorpd to negate the variable
and esp,0FFFFFFF8h disappeared (?? used to align the stack pointer esp ??)
more cleanup (after the function body)?? (add cmp call)
The assembly part where the parameters get pushed to the stack is the same for both situations:
53: d->Foo();
011A500B 8B 45 E0 mov eax,dword ptr [d]
011A500E 8B 10 mov edx,dword ptr [eax]
011A5010 8B F4 mov esi,esp
011A5012 8B 4D E0 mov ecx,dword ptr [d]
011A5015 8B 42 04 mov eax,dword ptr [edx+4]
011A5018 FF D0 call eax
Of course when I try to replicate this with a Minimal, Complete, and Verifiable example, everything works as intened. But in my big project, it fails consistently.
I'm not sure which parameters can influence compilation, and don't know enough of assembly to even see what's going on there;
therefor I'm asking here in the hope that someone has seen this before or recognizes this behaviour.
Note: it also works again, when I remove the sqrt call.
Update:
No problems in release
VS2012 SP4 (v11.0.61030.00)
problem persists when referencing member variables (iso no member references)
TODO: try without global statics
This question already has answers here:
C++ performance of accessing member variables versus local variables
(11 answers)
Closed 2 years ago.
I am trying to write a code as efficient as possible and I encountered the following situation:
int foo(int a, int b, int c)
{
return (a + b) % c;
}
All good! But what if I want to check if the result of the expression to be different of a constant lets say myConst. Lets say I can afford a temporary variable.
What method is the fastest of the following:
int foo(int a, int b, int c)
{
return (((a + b) % c) != myConst) ? (a + b) % c : myException;
}
or
int foo(int a, int b, int c)
{
int tmp = (a + b) % c
return (tmp != myConst) ? tmp : myException;
}
I can't decide. Where is the 'line' where recalculation is more expensive than allocating and deallocating a temporary variable or the other way around.
Don't worry about it, write concise code and leave micro-optimizations to the compiler.
In your example writing the same calculation twice is error prone - so do not do this. In your specific example, compiler is more than likely to avoid creating a temporary on the stack at all!
Your example can (does on my compiler) produce following assembly (i have replaced myConst with constexpr 42 and myException with 0):
foo(int, int, int):
leal (%rdi,%rsi), %eax # this adds a and b, puts result to eax
movl %edx, %ecx # loads c
cltd
idivl %ecx # performs division, puts result into edx
movl $0, %eax #, prepares to return exception value
cmpl $42, %edx #, compares result of division with magic const
cmovne %edx, %eax # overwrites pessimized exception if all is cool
ret
As you see, there is no temporary anywhere in sight!
Use the later.
You're not computing the same value twice.
The code is more clear.
Creating local variables on the stack doesn't take any significant amount of time.
Check the assembler code this generates for both versions. You most likely want the highest optimization settings for your compiler.
You may very well find out the compiler itself can figure out the intermediate value is used twice, but only inside the function, so safe to store in a register.
To add to what has already been posted, ease of debugging is at least as important as code efficiency, (if there is any effect on code efficiency which, as others have posted, is unlikely with optimization on).
Go with the easiest to follow, test and debug.
Use a temp var.
If more developers used simpler, non-compound expressions and more temp vars, there would be far fewer 'Help - I cannot debug my code!' posts to SO.
Hard coded values
The following only applies to hard coded values (even if they aren't const or constexpr)
In the following example on MSVC 2015 they were optimized away completely and replaced only with mov edx, result (=1 in this example):
#include <iostream>
#include <exception>
int myConst{4};
int myException{2};
int foo1(int a,int b,int c)
{
return (((a + b) % c) != myConst) ? (a + b) % c : myException;
}
int foo2(int a,int b,int c)
{
int tmp = (a + b) % c;
return (tmp != myConst) ? tmp : myException;
}
int main()
{
00007FF71F0E1000 48 83 EC 28 sub rsp,28h
auto test1{foo1(5,2,3)};
auto test2{foo2(5,2,3)};
std::cout << test1 <<'\n';
00007FF71F0E1004 48 8B 0D 75 20 00 00 mov rcx,qword ptr [__imp_std::cout (07FF71F0E3080h)]
00007FF71F0E100B BA 01 00 00 00 mov edx,1
00007FF71F0E1010 FF 15 72 20 00 00 call qword ptr [__imp_std::basic_ostream<char,std::char_traits<char> >::operator<< (07FF71F0E3088h)]
00007FF71F0E1016 48 8B C8 mov rcx,rax
00007FF71F0E1019 E8 B2 00 00 00 call std::operator<<<std::char_traits<char> > (07FF71F0E10D0h)
std::cout << test2 <<'\n';
00007FF71F0E101E 48 8B 0D 5B 20 00 00 mov rcx,qword ptr [__imp_std::cout (07FF71F0E3080h)]
00007FF71F0E1025 BA 01 00 00 00 mov edx,1
00007FF71F0E102A FF 15 58 20 00 00 call qword ptr [__imp_std::basic_ostream<char,std::char_traits<char> >::operator<< (07FF71F0E3088h)]
00007FF71F0E1030 48 8B C8 mov rcx,rax
00007FF71F0E1033 E8 98 00 00 00 call std::operator<<<std::char_traits<char> > (07FF71F0E10D0h)
return 0;
00007FF71F0E1038 33 C0 xor eax,eax
}
00007FF71F0E103A 48 83 C4 28 add rsp,28h
00007FF71F0E103E C3 ret
At this point other have pointed out that the optimization will not happen if the values were passed, or if we had separate files, but it seems even if the code is in separate compilation units the optimization is still done and we don't get any instructions for these functions:
#include <iostream>
#include <exception>
#include "Header.h"
int main()
{
00007FF667BF1000 48 83 EC 28 sub rsp,28h
int var1{5},var2{2},var3{3};
auto test1{foo1(var1,var2,var3)};
auto test2{foo2(var1,var2,var3)};
std::cout << test1 <<'\n';
00007FF667BF1004 48 8B 0D 75 20 00 00 mov rcx,qword ptr [__imp_std::cout (07FF667BF3080h)]
00007FF667BF100B BA 01 00 00 00 mov edx,1
00007FF667BF1010 FF 15 72 20 00 00 call qword ptr [__imp_std::basic_ostream<char,std::char_traits<char> >::operator<< (07FF667BF3088h)]
00007FF667BF1016 48 8B C8 mov rcx,rax
00007FF667BF1019 E8 B2 00 00 00 call std::operator<<<std::char_traits<char> > (07FF667BF10D0h)
std::cout << test2 <<'\n';
00007FF667BF101E 48 8B 0D 5B 20 00 00 mov rcx,qword ptr [__imp_std::cout (07FF667BF3080h)]
00007FF667BF1025 BA 01 00 00 00 mov edx,1
00007FF667BF102A FF 15 58 20 00 00 call qword ptr [__imp_std::basic_ostream<char,std::char_traits<char> >::operator<< (07FF667BF3088h)]
00007FF667BF1030 48 8B C8 mov rcx,rax
00007FF667BF1033 E8 98 00 00 00 call std::operator<<<std::char_traits<char> > (07FF667BF10D0h)
return 0;
00007FF667BF1038 33 C0 xor eax,eax
}
00007FF667BF103A 48 83 C4 28 add rsp,28h
00007FF667BF103E C3 ret
I have a spin lock with the xchg instruction. The C++ function takes in the resource to be locked.
Following is the code
void SpinLock::lock( u32& resource )
{
__asm__ __volatile__
(
"mov ebx, %0\n\t"
"InUseLoop:\n\t"
"mov eax, 0x01\n\t" /* 1=In Use*/
"xchg eax, [ebx]\n\t"
"cmp eax, 0x01\n\t"
"je InUseLoop\n\t"
:"=r"(resource)
:"r"(resource)
:"eax","ebx"
);
}
void SpinLock::unlock(u32& resource )
{
__asm__ __volatile__
(
/* "mov DWORD PTR ds:[%0],0x00\n\t" */
"mov ebx, %0\n\t"
"mov DWORD PTR [ebx], 0x00\n\t"
:"=r"(resource)
:"r"(resource)
: "ebx"
);
}
This code is compiled with gcc 4.5.2 -masm=intel on a 64 bit intel machine.
The objdump produces following assembly for the above functions .
0000000000490968 <_ZN8SpinLock4lockERj>:
490968: 55 push %rbp
490969: 48 89 e5 mov %rsp,%rbp
49096c: 53 push %rbx
49096d: 48 89 7d f0 mov %rdi,-0x10(%rbp)
490971: 48 8b 45 f0 mov -0x10(%rbp),%rax
490975: 8b 10 mov (%rax),%edx
490977: 89 d3 mov %edx,%ebx
0000000000490979 <InUseLoop>:
490979: b8 01 00 00 00 mov $0x1,%eax
49097e: 67 87 03 addr32 xchg %eax,(%ebx)
490981: 83 f8 01 cmp $0x1,%eax
490984: 74 f3 je 490979 <InUseLoop>
490986: 48 8b 45 f0 mov -0x10(%rbp),%rax
49098a: 89 10 mov %edx,(%rax)
49098c: 5b pop %rbx
49098d: c9 leaveq
49098e: c3 retq
49098f: 90 nop
0000000000490990 <_ZN8SpinLock6unlockERj>:
490990: 55 push %rbp
490991: 48 89 e5 mov %rsp,%rbp
490994: 53 push %rbx
490995: 48 89 7d f0 mov %rdi,-0x10(%rbp)
490999: 48 8b 45 f0 mov -0x10(%rbp),%rax
49099d: 8b 00 mov (%rax),%eax
49099f: 89 d3 mov %edx,%ebx
4909a1: 67 c7 03 00 00 00 00 addr32 movl $0x0,(%ebx)
4909a8: 48 8b 45 f0 mov -0x10(%rbp),%rax
4909ac: 89 10 mov %edx,(%rax)
4909ae: 5b pop %rbx
4909af: c9 leaveq
4909b0: c3 retq
4909b1: 90 nop
The code dumps core when executing the locking operation.
Is there something grossly wrong here ?
Regards,
-J
First, why are you using truncated 32-bit addresses in your assembly code whereas the rest of the program is compiled to execute in 64-bit mode and operate with 64-bit addresses/pointers? I'm referring to ebx. Why is it not rbx?
Second, why are you trying to return a value from the assembly code with "=r"(resource)? Your functions change the in-memory value with xchg eax, [ebx] and mov DWORD PTR [ebx], 0x00 and return void. Remove "=r"(resource).
Lastly, if you look closely at the disassembly of SpinLock::lock(), can't you see something odd about ebx?:
mov %rdi,-0x10(%rbp)
mov -0x10(%rbp),%rax
mov (%rax),%edx
mov %edx,%ebx
<InUseLoop>:
mov $0x1,%eax
addr32 xchg %eax,(%ebx)
In this code, the ebx value, which is an address/pointer, does not come directly from the function's parameter (rdi), the parameter first gets dereferenced with mov (%rax),%edx, but why? If you throw away all the confusing C++ reference stuff, technically, the function receives a pointer to u32, not a pointer to a pointer to u32, and thus needs no extra dereference anywhere.
The problem is here: "r"(resource). It must be "r"(&resource).
A small 32-bit test app demonstrates this problem:
#include <iostream>
using namespace std;
void unlock1(unsigned& resource)
{
__asm__ __volatile__
(
/* "mov DWORD PTR ds:[%0],0x00\n\t" */
"movl %0, %%ebx\n\t"
"movl $0, (%%ebx)\n\t"
:
:"r"(resource)
:"ebx"
);
}
void unlock2(unsigned& resource)
{
__asm__ __volatile__
(
/* "mov DWORD PTR ds:[%0],0x00\n\t" */
"movl %0, %%ebx\n\t"
"movl $0, (%%ebx)\n\t"
:
:"r"(&resource)
:"ebx"
);
}
unsigned blah;
int main(void)
{
blah = 3456789012u;
cout << "before unlock2() blah=" << blah << endl;
unlock2(blah);
cout << "after unlock2() blah=" << blah << endl;
blah = 3456789012u;
cout << "before unlock1() blah=" << blah << endl;
unlock1(blah); // may crash here, but if it doesn't, it won't change blah
cout << "after unlock1() blah=" << blah << endl;
return 0;
}
Output:
before unlock2() blah=3456789012
after unlock2() blah=0
before unlock1() blah=3456789012
Exiting due to signal SIGSEGV
General Protection Fault at eip=000015eb
eax=ce0a6a14 ...
After some searching on our friend google, I could not get a clear view on the following point.
I'm used to call class members with this->. Even if not needed, I find it more explicit as it helps when maintaining some heavy piece of algorithm with loads of vars.
As I'm working on a supposed-to-be-optimised algorithm, I was wondering whether using this-> would alter runtime performance or not.
Does it ?
No, the call is exactly the same in both cases.
It doesn't make any difference. Here's a demonstration with GCC. The source is simple class, but I've restricted this post to the difference for clarity.
% diff -s with-this.cpp without-this.cpp
7c7
< this->x = 5;
---
> x = 5;
% g++ -c with-this.cpp without-this.cpp
% diff -s with-this.o without-this.o
Files with-this.o and without-this.o are identical
Answer has been given by zennehoy and here's assembly code (generated by Microsoft C++ compiler) for a simple test class:
class C
{
int n;
public:
void boo(){n = 1;}
void goo(){this->n = 2;}
};
int main()
{
C c;
c.boo();
c.goo();
return 0;
}
Disassembly Window in Visual Studio shows that assembly code is the same for both functions:
class C
{
int n;
public:
void boo(){n = 1;}
001B2F80 55 push ebp
001B2F81 8B EC mov ebp,esp
001B2F83 81 EC CC 00 00 00 sub esp,0CCh
001B2F89 53 push ebx
001B2F8A 56 push esi
001B2F8B 57 push edi
001B2F8C 51 push ecx
001B2F8D 8D BD 34 FF FF FF lea edi,[ebp-0CCh]
001B2F93 B9 33 00 00 00 mov ecx,33h
001B2F98 B8 CC CC CC CC mov eax,0CCCCCCCCh
001B2F9D F3 AB rep stos dword ptr es:[edi]
001B2F9F 59 pop ecx
001B2FA0 89 4D F8 mov dword ptr [ebp-8],ecx
001B2FA3 8B 45 F8 mov eax,dword ptr [this]
001B2FA6 C7 00 01 00 00 00 mov dword ptr [eax],1
001B2FAC 5F pop edi
001B2FAD 5E pop esi
001B2FAE 5B pop ebx
001B2FAF 8B E5 mov esp,ebp
001B2FB1 5D pop ebp
001B2FB2 C3 ret
...
--- ..\main.cpp -----------------------------
void goo(){this->n = 2;}
001B2FC0 55 push ebp
001B2FC1 8B EC mov ebp,esp
001B2FC3 81 EC CC 00 00 00 sub esp,0CCh
001B2FC9 53 push ebx
001B2FCA 56 push esi
001B2FCB 57 push edi
001B2FCC 51 push ecx
001B2FCD 8D BD 34 FF FF FF lea edi,[ebp-0CCh]
001B2FD3 B9 33 00 00 00 mov ecx,33h
001B2FD8 B8 CC CC CC CC mov eax,0CCCCCCCCh
001B2FDD F3 AB rep stos dword ptr es:[edi]
001B2FDF 59 pop ecx
001B2FE0 89 4D F8 mov dword ptr [ebp-8],ecx
001B2FE3 8B 45 F8 mov eax,dword ptr [this]
001B2FE6 C7 00 02 00 00 00 mov dword ptr [eax],2
001B2FEC 5F pop edi
001B2FED 5E pop esi
001B2FEE 5B pop ebx
001B2FEF 8B E5 mov esp,ebp
001B2FF1 5D pop ebp
001B2FF2 C3 ret
And the code in the main:
C c;
c.boo();
001B2F0E 8D 4D F8 lea ecx,[c]
001B2F11 E8 00 E4 FF FF call C::boo (1B1316h)
c.goo();
001B2F16 8D 4D F8 lea ecx,[c]
001B2F19 E8 29 E5 FF FF call C::goo (1B1447h)
Microsoft compiler uses __thiscall calling convention by default for class member calls and this pointer is passed via ECX register.
There are several layers involved in the compilation of a language.
The difference between accessing member as member, this->member, MyClass::member etc... is a syntactic difference.
More precisely, it's a matter of name lookup, and how the front-end of the compiler will "find" the exact element you are referring to. Therefore, you might speed up compilation by being more precise... though it will be unnoticeable (there are much more time-consuming tasks involved in C++, like opening all those includes).
Since (in this case) you are referring to the same element, it should not matter.
Now, an interesting parallel can be done with interpreted languages. In an interpreted language, the name lookup will be delayed to the moment where the line (or function) is called. Therefore, it could have an impact at runtime (though once again, probably not really noticeable).