Which D compilers will perform tail-call optimization on this function?

Which D compilers will perform tail-call optimization on this function? - d

string reverse(string str) pure nothrow
{
string reverse_impl(string temp, string str) pure nothrow
{
if (str.length == 0)
{
return temp;
}
else
{
return reverse_impl(str[0] ~ temp, str[1..$]);
}
}
return reverse_impl("", str);
}
As far as I know, this code should be subject to tail-call optimization, but I can't tell if DMD is doing it or not. Which of the D compilers support tail-call optimization, and will they perform it on this function?

From looking at the disassembly, DMD performs TCO on your code:
_D4test7reverseFNaNbAyaZAya12reverse_implMFNaNbAyaAyaZAya comdat
assume CS:_D4test7reverseFNaNbAyaZAya12reverse_implMFNaNbAyaAyaZAya
L0: sub ESP,0Ch
push EBX
push ESI
cmp dword ptr 018h[ESP],0
jne L1C
LC: mov EDX,024h[ESP]
mov EAX,020h[ESP]
pop ESI
pop EBX
add ESP,0Ch
ret 010h
L1C: push dword ptr 024h[ESP]
mov EAX,1
mov EDX,offset FLAT:_D12TypeInfo_Aya6__initZ
push dword ptr 024h[ESP]
mov ECX,024h[ESP]
push ECX
push EAX
push EDX
call near ptr __d_arraycatT
mov EBX,02Ch[ESP]
mov ESI,030h[ESP]
mov 034h[ESP],EAX
dec EBX
lea ECX,1[ESI]
mov 01Ch[ESP],EBX
mov 020h[ESP],ECX
mov 02Ch[ESP],EBX
mov 030h[ESP],ECX
mov 038h[ESP],EDX
add ESP,014h
cmp dword ptr 8[ESP],0
jne L1C
jmp short LC
_D4test7reverseFNaNbAyaZAya12reverse_implMFNaNbAyaAyaZAya ends
end

A very good resource for quickly looking at the code generated by gdc is http://d.godbolt.org/. We currently don't have a dmd equivalent.

Related

Finding max number between two, which implementation to choose

I am trying to figure out, which implementation has edge over other while finding max number between two. As an example let's examine two implementation:
Implementation 1:
int findMax (int a, int b)
{
return (a > b) ? a : b;
}
// Assembly output: (gcc 11.1)
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], edi
mov DWORD PTR [rbp-8], esi
mov eax, DWORD PTR [rbp-4]
cmp eax, DWORD PTR [rbp-8]
jle .L2
mov eax, DWORD PTR [rbp-4]
jmp .L4 .L2:
mov eax, DWORD PTR [rbp-8] .L4:
pop rbp
ret
Implementation 2:
int findMax(int a, int b)
{
int diff, s, max;
diff = a - b;
s = (diff >> 31) & 1;
max = a - (s * diff);
return max;
}
// Assembly output: (gcc 11.1)
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-20], edi
mov DWORD PTR [rbp-24], esi
mov eax, DWORD PTR [rbp-20]
sub eax, DWORD PTR [rbp-24]
mov DWORD PTR [rbp-4], eax
mov eax, DWORD PTR [rbp-4]
shr eax, 31
mov DWORD PTR [rbp-8], eax
mov eax, DWORD PTR [rbp-8]
imul eax, DWORD PTR [rbp-4]
mov edx, eax
mov eax, DWORD PTR [rbp-20]
sub eax, edx
mov DWORD PTR [rbp-12], eax
mov eax, DWORD PTR [rbp-12]
pop rbp
ret
The second one produced more assembly instructions but first one has conditional jump. Just trying to understand if both are equally good.

First you need to turn on compiler optimizations (I used -O2 for the following). And you should compare to std::max. Then this:
#include <algorithm>
int findMax (int a, int b)
{
return (a > b) ? a : b;
}
int findMax2(int a, int b)
{
int diff, s, max;
diff = a - b;
s = (diff >> 31) & 1;
max = a - (s * diff);
return max;
}
int findMax3(int a,int b){
return std::max(a,b);
}
results in:
findMax(int, int):
cmp edi, esi
mov eax, esi
cmovge eax, edi
ret
findMax2(int, int):
mov ecx, edi
mov eax, edi
sub ecx, esi
mov edx, ecx
shr edx, 31
imul edx, ecx
sub eax, edx
ret
findMax3(int, int):
cmp edi, esi
mov eax, esi
cmovge eax, edi
ret
Your first version results in identical assembly as std::max, while your second variant is doing more. Actually when trying to optimize you need to specify what you optimize for. There are several options that typically require a trade-off to be made: Runtime, memory usage, size of executable, readability of code, etc. Typically you cannot get it all at once.
When in doubt, do not reinvent a wheel but use existing already optimzied std::max. And do not forget that code you write is not instructions for your CPU, rather it is a high level abstract description of what the program should do. Its the compilers job to figure out how that can be achieved best.
Last but not least, your second variant is actually broken. See example here compiled with -O2 -fsanitize=signed-integer-overflow, results in:
/app/example.cpp:13:10: runtime error: signed integer overflow: -2147483648 - 2147483647 cannot be represented in type 'int'
You should favor correctness over speed. The fastest code is not worth a thing when it is wrong. And because of that, readability is next on the list. Code that is difficult to read and understand is also difficult to proove correct. I was only able to spot the problem in your code with the help of the compiler, while std::max(a,b) is unlikely to cause undefined behavior (and even if it does, at least it isnt your fault ;).

For two ints, you can compute max(a, b) without branching using a technique you probably learnt at school:
a ^ ((a ^ b) & -(a < b));
But no sane person would write this in their code. Always use std::max and trust the compiler to pick the best way. You may well find it adopts the above for int arguments with optimisations set appropriately. Although I conject that a compare and jump is probably the best way on the whole, even at the expense of a pipeline dump.
Using std::max gives the compiler the best optimisation hint.

Implementation 1 performs well on a CISC CPU like a modern x64 AMD/Intel CPU.
Implementation 2 performs well on a RISC GPU like from nVIDIA or AMD Graphics.
The term "performs well" is only significant in a tight loop.

C++ inline asm move WCHAR in 32-bit register

I am trying to practice the inline ASM in C++ :) Maybe outdated, but it is interesting, to know how CPU is executing the code.
So, what I am trying to do here, is to loop through processes and get a handle of needed one :) I am using for that already created methods from tlhelp32
I have this code:
HANDLE RetHandle = nullptr, snap;
int SizeOfPE = sizeof(PROCESSENTRY32), pid; PROCESSENTRY32 pe;
int PA = PROCESS_ALL_ACCESS;
const char* Pname = "explorer.exe";
__asm
{
mov eax, pe
mov ebx, this
mov ecx, [ebx]pe.dwSize
mov ecx, SizeOfPE
mov[ebx]pe.dwSize, ecx
mov eax, PA
mov ebx,0
call CreateToolhelp32Snapshot
mov eax,snap
label1:
mov eax, snap
mov ebx, [pe]
call Process32First
cmp eax,1
jne exitLabel
Process32NextLoop:
mov eax, snap
mov ebx, [pe]
call Process32Next
cmp eax, 1
jne Process32NextLoop
mov edx, pe
mov ecx, [edx].szExeFile
cmp ecx, Pname
je ExitLoop
jne Process32NextLoop
ExitLoop:
mov eax, [ebx].th32ProcessID
mov pid, eax
ExitLabel:
ret
}
Apparently, it is throwing error in th32ProcessID as well, however, it is just regular int.
Have been searching, but haven't found the equivalent for movl in C++

Trying to understand ASM code

EDIT
I switched from memcmp to a home brewed 13 byte compare function and the homebrew doesnt have the extra instructions. So all I can guess is that the extra assembly is just a flaw in the optimizer.
if (!EQ13(&ti, &m_ti)) { // in 2014, memcmp was not being optimzied here
000007FEF91B2CFE mov rdx,qword ptr [rsp]
000007FEF91B2D02 movzx eax,byte ptr [rsp+0Ch]
000007FEF91B2D07 mov ecx,dword ptr [rsp+8]
000007FEF91B2D0B cmp rdx,qword ptr [r10+28h]
000007FEF91B2D0F jne TSccIter::SetTi+9Dh (7FEF91B2D1Dh)
000007FEF91B2D11 cmp ecx,dword ptr [r10+30h]
000007FEF91B2D15 jne TSccIter::SetTi+9Dh (7FEF91B2D1Dh)
000007FEF91B2D17 cmp al,byte ptr [r10+34h]
000007FEF91B2D1B je TSccIter::SetTi+0B1h (7FEF91B2D31h)
My homebrew isn't perfect in this case since it does 3 movs at the start even though it is unlikely to ever check past the first mov. I need to work on that part.
ORIGINAL QUESTION
Here is asm code from msvc 2010 showing how it can optimze a small, fixed-sized memcmp (in this case, 13 bytes). I've seen this type of optimization a lot in our code, but never with the last 6 lines. Can anyone tell me why the last 6 lines of assembly are there? TransferItem is 13 bytes so that explains the QWORD, DWORD, then BYTE cmps.
struct TransferItem {
char m_szCxrMkt1[3];
char m_szCxrOp1[3];
char m_chDelimiter;
char m_szCxrMkt2[3];
char m_szCxrOp2[3];
};
...
if (memcmp(&ti, &m_ti, sizeof(TransferItem))) {
2B8E lea rax,[rsp]
2B92 mov rdx,qword ptr [rax]
2B95 cmp rdx,qword ptr [r10+28h]
2B99 jne TSccIter::SetTi+0A2h (7FEF9302BB2h)
2B9B mov edx,dword ptr [rax+8]
2B9E cmp edx,dword ptr [r10+30h]
2BA2 jne TSccIter::SetTi+0A2h (7FEF9302BB2h)
2BA4 movzx edx,byte ptr [rax+0Ch]
2BA8 cmp dl,byte ptr [r10+34h]
2BAC jne TSccIter::SetTi+0A2h (7FEF9302BB2h)
2BAE xor eax,eax
2BB0 jmp TSccIter::SetTi+0A7h (7FEF9302BB7h)
2BB2 sbb eax,eax
2BB4 sbb eax,0FFFFFFFFh
2BB7 test eax,eax
2BB9 je TSccIter::SetTi+0CCh (7FEF9302BDCh)
Also what is the point of xor eax,eax which we know will be zero and then testing that for that known to be zero on line 2bb7?
Here is the whole function
// fWildCard means match certain fields to '**' in the db
// szCxrMkt1,2 are required and cannot be null, ' ', or '\0\0'.
// szCxrOp1,2 can be null, ' ', or '\0\0'.
TSccIter& SetTi(bool fWildCard, LPCSTR szCxrMkt1, LPCSTR szCxrOp1, LPCSTR szCxrMkt2, LPCSTR szCxrOp2) {
if (m_fSkipSet)
return *this;
m_iSid = -1; // resets the iterator to search from the start
// Pad the struct to 16 bytes so we can clear it with 2 QWORDS
// We use a temp, ti, to detect if the new transferitem has changed
class TransferItemPadded : public TransferItem {
char padding[16 - sizeof(TransferItem)]; // get us to 16 bytes
} ti;
U8(&ti) = U8(BUMP(&ti, 8)) = 0x2020202020202020; // 8 spaces
// copy in the params
CPY2(ti.m_szCxrMkt1, szCxrMkt1);
if (szCxrOp1 && *szCxrOp1)
CPY2(ti.m_szCxrOp1, szCxrOp1);
ti.m_chDelimiter = (fWildCard) ? '*' : ':'; // this controls wild card matching
CPY2(ti.m_szCxrMkt2, szCxrMkt2);
if (szCxrOp2 && *szCxrOp2)
CPY2(ti.m_szCxrOp2, szCxrOp2);
// see if different
if (memcmp(&ti, &m_ti, sizeof(TransferItem))) {
memcpy(&m_ti, &ti, sizeof(TransferItem));
m_fQryChanged = true;
}
return *this;
}
typedef unsigned __int64 U8;
#define CPY2(a,b) ((*(WORD*)a) = (*(WORD*)b))
And here's the whole asm
TSccIter& SetTi(bool fWildCard, LPCSTR szCxrMkt1, LPCSTR szCxrOp1, LPCSTR szCxrMkt2, LPCSTR szCxrOp2) {
2B10 sub rsp,18h
if (m_fSkipSet)
2B14 cmp byte ptr [rcx+0EAh],0
2B1B mov r10,rcx
return *this;
2B1E jne TSccIter::SetTi+0CCh (7FEF9302BDCh)
m_iSid = -1;
class TransferItemPadded : public TransferItem {
char padding[16 - sizeof(TransferItem)];
} ti;
U8(&ti) = U8(BUMP(&ti, 8)) = 0x2020202020202020;
2B24 mov rax,2020202020202020h
2B2E mov byte ptr [rcx+36h],0FFh
2B32 mov qword ptr [rsp],rax
2B36 mov qword ptr [rsp+8],rax
CPY2(ti.m_szCxrMkt1, szCxrMkt1);
2B3B movzx eax,word ptr [r8]
2B3F mov word ptr [rsp],ax
if (szCxrOp1 && *szCxrOp1)
2B43 test r9,r9
2B46 je TSccIter::SetTi+47h (7FEF9302B57h)
2B48 cmp byte ptr [r9],0
2B4C je TSccIter::SetTi+47h (7FEF9302B57h)
CPY2(ti.m_szCxrOp1, szCxrOp1);
2B4E movzx eax,word ptr [r9]
2B52 mov word ptr [rsp+3],ax
ti.m_chDelimiter = (fWildCard) ? '*' : ':';
2B57 mov eax,3Ah
2B5C mov ecx,2Ah
2B61 test dl,dl
2B63 cmovne eax,ecx
2B66 mov byte ptr [rsp+6],al
CPY2(ti.m_szCxrMkt2, szCxrMkt2);
2B6A mov rax,qword ptr [szCxrMkt2]
2B6F movzx ecx,word ptr [rax]
if (szCxrOp2 && *szCxrOp2)
2B72 mov rax,qword ptr [szCxrOp2]
2B77 mov word ptr [rsp+7],cx
2B7C test rax,rax
2B7F je TSccIter::SetTi+7Eh (7FEF9302B8Eh)
2B81 cmp byte ptr [rax],0
2B84 je TSccIter::SetTi+7Eh (7FEF9302B8Eh)
CPY2(ti.m_szCxrOp2, szCxrOp2);
2B86 movzx eax,word ptr [rax]
2B89 mov word ptr [rsp+0Ah],ax
if (memcmp(&ti, &m_ti, sizeof(TransferItem))) {
2B8E lea rax,[rsp]
2B92 mov rdx,qword ptr [rax]
2B95 cmp rdx,qword ptr [r10+28h]
2B99 jne TSccIter::SetTi+0A2h (7FEF9302BB2h)
2B9B mov edx,dword ptr [rax+8]
2B9E cmp edx,dword ptr [r10+30h]
2BA2 jne TSccIter::SetTi+0A2h (7FEF9302BB2h)
2BA4 movzx edx,byte ptr [rax+0Ch]
2BA8 cmp dl,byte ptr [r10+34h]
2BAC jne TSccIter::SetTi+0A2h (7FEF9302BB2h)
2BAE xor eax,eax
2BB0 jmp TSccIter::SetTi+0A7h (7FEF9302BB7h)
2BB2 sbb eax,eax
2BB4 sbb eax,0FFFFFFFFh
2BB7 test eax,eax
2BB9 je TSccIter::SetTi+0CCh (7FEF9302BDCh)
memcpy(&m_ti, &ti, sizeof(TransferItem));
2BBB mov rax,qword ptr [rsp]
m_fQryChanged = true;
2BBF mov byte ptr [r10+0E9h],1
2BC7 mov qword ptr [r10+28h],rax
2BCB mov eax,dword ptr [rsp+8]
2BCF mov dword ptr [r10+30h],eax
2BD3 movzx eax,byte ptr [rsp+0Ch]
2BD8 mov byte ptr [r10+34h],al
}
return *this;
2BDC mov rax,r10
}

2bb7 can be reached by different code paths: via taken jumps at 2b99, 2ba2 and 2bac, as well as directly when none of the conditional jumps is taken. The xor eax,eax is only executed at the last path, and it ensures that eax is 0 - which is apparently not the case otherwise.

The last 6 lines return the value in eax == 0 for a match, and also set the SF and ZF condition codes.

test eax, eax will test whether eax AND eax == 0. The following je will jump if zero.
And xor eax, eax is an efficient way to encode "eax = 0". It is more efficient than mov eax, 0
EDIT: Initially misread the question. It looks like something will happen at "TSccIter::SetTi+0A7h" which should change the value?
Also, the SBB trick to replicate the carry(2BB2-2BB4) is explained here:
http://compgroups.net/comp.lang.asm.x86/trick-with-sbb-instruction/20164

Need help decrypting my assembly program

I have this encryption program in C++ and ASM which has got encryption routines but I need to know
how the decryption routine for it should look like .
This is the code :
//-ENCRYPTION ROUTINES
void encrypt_chars (int length, char EKey)
{ char temp_char;
for (int i = 0; i < length; i++)
{ temp_char = OChars [i];
__asm {
push eax
push ecx
movzx ecx,temp_char
lea eax,EKey
call encrypt
mov temp_char,al
pop ecx
pop eax
}
EChars [i] = temp_char;
}
return;
// --- Start of Assembly code
__asm {
encrypt5: push eax
mov al,byte ptr [eax]
push ecx
and eax,0x7C
ror eax,1
ror eax,1
inc eax
mov edx,eax
pop ecx
pop eax
mov byte ptr [eax],dl
xor edx,ecx
mov eax,edx
rol al,1
ret
encrypt:
mov eax,ecx
inc eax
ret
}
//--- End of Assembly code
}

The best clue ever for decryption (and as general as the question):
undo everything
I guess every instruction in that code has an conservative opposite (unless it's destroyin the data, but hey)
So if the code ends with:
inc eax
ret
You start with
[load the return in eax]
dec eax
and so on.

Run-Time Check Failure #0 - The value of ESP was not properly saved across a function call

#include<stdio.h>
int a[100];
int main(){
char UserName[100];
char *n=UserName;
char *q=NULL;
char Serial[200];
q=Serial;
scanf("%s",UserName);
//this is about
__asm{
pushad
mov eax,q
push eax
mov eax,n
push eax
mov EAX,EAX
mov EAX,EAX
CALL G1
LEA EDX,DWORD PTR SS:[ESP+10H]
jmp End
G1:
SUB ESP,400H
XOR ECX,ECX
PUSH EBX
PUSH EBP
MOV EBP,DWORD PTR SS:[ESP+40CH]
PUSH ESI
PUSH EDI
MOV DL,BYTE PTR SS:[EBP]
TEST DL,DL
JE L048
LEA EDI,DWORD PTR SS:[ESP+10H]
MOV AL,DL
MOV ESI,EBP
SUB EDI,EBP
L014:
MOV BL,AL
ADD BL,CL
XOR BL,AL
SHL AL,1
OR BL,AL
MOV AL,BYTE PTR DS:[ESI+1]
MOV BYTE PTR DS:[EDI+ESI],BL
INC ECX
INC ESI
TEST AL,AL
JNZ L014
TEST DL,DL
JE L048
MOV EDI,DWORD PTR SS:[ESP+418H]
LEA EBX,DWORD PTR SS:[ESP+10H]
MOV ESI,EBP
SUB EBX,EBP
L031:
MOV AL,BYTE PTR DS:[ESI+EBX]
PUSH EDI
PUSH EAX
CALL G2
MOV AL,BYTE PTR DS:[ESI+1]
ADD ESP,8
ADD EDI,2
INC ESI
TEST AL,AL
JNZ L031
MOV BYTE PTR DS:[EDI],0
POP EDI
POP ESI
POP EBP
POP EBX
ADD ESP,400H
RETN
L048:
MOV ECX,DWORD PTR SS:[ESP+418H]
POP EDI
POP ESI
POP EBP
MOV BYTE PTR DS:[ECX],0
POP EBX
ADD ESP,400H
RETN
G2:
MOVSX ECX,BYTE PTR SS:[ESP+4]
MOV EAX,ECX
AND ECX,0FH
SAR EAX,4
AND EAX,0FH
CMP EAX,0AH
JGE L009
ADD AL,30H
JMP L010
L009:
ADD AL,42H
L010:
MOV EDX,DWORD PTR SS:[ESP+8]
CMP ECX,0AH
MOV BYTE PTR DS:[EDX],AL
JGE L017
ADD CL,61H
MOV BYTE PTR DS:[EDX+1],CL
RETN
L017:
ADD CL,45H
MOV BYTE PTR DS:[EDX+1],CL
RETN
End:
mov eax,eax
popad
}
printf("%s\n",Serial);
return 0;
}
Can you help me?
this problem about Asm，I don't know why cause this result.
this program is very easy,and it about a program of internal code.
Run-Time Check Failure #0 - The value of ESP was not properly saved across a function call. This is usually a result of calling a function declared with one calling convention with a function pointer declared with a different calling convention.

It seems the two parameters which are pushed onto the stack before the call to G1 are never popped from the stack.

Possibly it happens because at the beginning of the function G1 you SUB ESP,400H, after L031 you make ADD ESP,8 and at the end you ADD ESP,400H. It seems like ESP before the G1 call is by 8 less then after call.
EDIT: Regarding to the coding style of assembly function please see this. Here briefly described what are the caller's responsibilities and what are callee's responsibilities, that are regarded to ESP.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Which D compilers will perform tail-call optimization on this function? - d

A very good resource for quickly looking at the code generated by gdc is http://d.godbolt.org/. We currently don't have a dmd equivalent.

Related

Finding max number between two, which implementation to choose

C++ inline asm move WCHAR in 32-bit register

Trying to understand ASM code

Need help decrypting my assembly program

Run-Time Check Failure #0 - The value of ESP was not properly saved across a function call

Categories

Resources