Why is vzeroupper being inserted at the end of this code?

Why is vzeroupper being inserted at the end of this code? - c++

I noticed something strange when I compile this code on godbolt, with MSVC:
#include <intrin.h>
#include <cstdint>
void test(unsigned char*& pSrc) {
__m256i data = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(pSrc));
int32_t mask = _mm256_movemask_epi8(data);
if (!mask) {
++pSrc;
}
else {
unsigned long v;
_BitScanForward(&v, mask);
pSrc += v;
}
}
I get this resulting assembly:
pSrc$ = 8
void test(unsigned char * &) PROC ; test, COMDAT
mov rdx, QWORD PTR [rcx]
vmovdqu ymm0, YMMWORD PTR [rdx]
vpmovmskb eax, ymm0
test eax, eax
jne SHORT $LN2#test
mov eax, 1
add rax, rdx
mov QWORD PTR [rcx], rax
vzeroupper ; Why is this being inserted?
ret 0
$LN2#test:
bsf eax, eax
add rax, rdx
mov QWORD PTR [rcx], rax
vzeroupper ; Why is this being inserted?
ret 0
void test(unsigned char * &) ENDP ; test
Why is vzeroupper being inserted at the end of each scope? I heard that it's because of switching between SSE and AVX, but I'm not doing that here. I'm using exclusively AVX code.
I was wondering, does this pose a performance problem?

Related

How do I convert C++ function to assembly(x86_64)?

This is my .CPP file
#include <iostream>
using namespace std;
extern "C" void KeysAsm(int arr[], int n, int thetha, int rho);
// Keep this and call it from assembler
extern "C"
void crim(int *xp, int *yp) {
int temp = *xp;
*xp = *yp;
*yp = temp+2;
}
// Translate this into Intel assembler
void KeysCpp(int arr[], int n, int thetha, int rho){
for (int i = 0; i < n - 1; i++) {
for (int j = 0; j < n - i - 1; j++) {
if (arr[j] > arr[j + 1]) {
crim(&arr[j], &arr[j + 1]);
}
}
arr[i]= arr[i] + thetha / rho * 2 - 4;
}
}
// Function to print an array
void printArray(int arr[], int size){
int i;
for (i = 0; i < size; i++)
cout << arr[i] << "\n";
cout << endl;
}
int main() {
int gamma1[]{
9,
270,
88,
-12,
456,
80,
45,
123,
427,
999
};
int gamma2[]{
900,
312,
542,
234,
234,
1,
566,
123,
427,
111
};
printf("Array:\n");
printArray(gamma1, 10);
KeysAsm(gamma1, 10, 5, 6);
printf("Array Result Asm:\n");
printArray(gamma1, 10);
KeysCpp(gamma2, 10, 5, 6);
printf("Array Result Cpp:\n");
printArray(gamma2, 10);
}
What I want to do is, convert the KeysCpp function into assembly language and call it from this very .CPP file. I want to keep the crim function as it is in .CPP, while only converting the KeysCpp.
Here is my .ASM file
PUBLIC KeysAsm
includelib kernel32.lib
_DATA SEGMENT
EXTERN crim:PROC
_DATA ENDS
_TEXT SEGMENT
KeysAsm PROC
push rbp
mov rbp, rsp
sub rsp, 40
mov QWORD PTR [rbp-24], rdi
mov DWORD PTR [rbp-28], esi
mov DWORD PTR [rbp-32], edx
mov DWORD PTR [rbp-36], ecx
mov DWORD PTR [rbp-4], 0
jmp L3
L3:
mov eax, DWORD PTR [rbp-28]
sub eax, 1
cmp DWORD PTR [rbp-4], eax
jl L7
L4:
mov eax, DWORD PTR [rbp-28]
sub eax, DWORD PTR [rbp-4]
sub eax, 1
cmp DWORD PTR [rbp-8], eax
jl L6
L5:
add DWORD PTR [rbp-8], 1
L6:
mov eax, DWORD PTR [rbp-8]
cdqe
lea rdx, [0+rax*4]
mov rax, QWORD PTR [rbp-24]
add rax, rdx
mov edx, DWORD PTR [rax]
mov eax, DWORD PTR [rbp-8]
cdqe
add rax, 1
lea rcx, [0+rax*4]
mov rax, QWORD PTR [rbp-24]
add rax, rcx
mov eax, DWORD PTR [rax]
cmp edx, eax
jle L5
mov eax, DWORD PTR [rbp-8]
cdqe
add rax, 1
lea rdx, [0+rax*4]
mov rax, QWORD PTR [rbp-24]
add rdx, rax
mov eax, DWORD PTR [rbp-8]
cdqe
lea rcx, [0+rax*4]
mov rax, QWORD PTR [rbp-24]
add rax, rcx
mov rsi, rdx
mov rdi, rax
call crim
L7:
mov DWORD PTR [rbp-8], 0
jmp L4
KeysAsm ENDP
_TEXT ENDS
END
I am using Visual Studio 2017 to run this project.
I am getting next error when I run this code.
Unhandled exception at 0x00007FF74B0E429C in MatrixMultiplication.exe: Stack cookie instrumentation code detected a stack-based buffer overrun. occurred

Your asm looks like it's expecting the x86-64 System V calling convention, with args in RDI, ESI, EDX, ECX. But you said you're compiling with Visual Studio, so the compiler-generated code will use the Windows x64 calling convention: RCX, EDX, R8D, R9D.
And when you call crim, it can use shadow space (32 bytes above its return address, which you didn't reserve space for).
It looks like you got this asm from un-optimized compiler output, probably from https://godbolt.org/z/ea4MPh81r using GCC for Linux, without using -mabi=ms to override the default -mabi=sysv when compiling for non-Windows targets. And then you modified it to make the loop infinite, with a jmp at the bottom instead of a ret? Maybe a different GCC version than 12.2 since the label numbers and code don't match exactly.
(The signs of being un-optimized compiler output are all the reloads from [rbp-whatever], and redoing sign-extension before using an int to index an array with cdqe. A human would know the int must be non-negative. And being GCC specifically, the numbered label like .L1: etc. where you just removed the ., and of heavily using RAX for as much as possible in a debug build. And choices like lea rdx, [0+rax*4] to copy-and-shift, and the exact syntax it used to print that instruction in Intel syntax match GCC.)
To compile a single function for Windows x64, isolate it and give the compiler only prototypes for anything it calls
extern "C" void crim(int *xp, int *yp); // prototype only
void KeysCpp(int arr[], int n, int thetha, int rho){
for (int i = 0; i < n - 1; i++) {
for (int j = 0; j < n - i - 1; j++) {
if (arr[j] > arr[j + 1]) {
crim(&arr[j], &arr[j + 1]);
}
}
arr[i]= arr[i] + thetha / rho * 2 - 4;
}
}
Then on Godbolt, use gcc -O3 -mabi=ms, or use MSVC which always targets Windows. https://godbolt.org/z/Mj5Gb54b5 shows both GCC and MSVC with optimization enabled.
KeysCpp(int*, int, int, int): ; demangled name
cmp edx, 1
jle .L11 ; "shrink wrap" optimization: early-out on n<=1 before saving regs
push r15 ; save some call-preserved regs
push r14
lea r14, [rcx+4] ; arr + 1
push r13
mov r13, rcx
Unfortunately GCC fails to hoist the thetha / rho * 2 - 4 loop-invariant, instead redoing idiv every time through the loop. Seems like an obvious optimization since those are local vars whose address hasn't been taken at all, and it keeps thetha (typo for theta?) and rho in registers. So MSVC is much more efficient here. Clang also misses this optimization.

Why aren't clang++ and g++ de-duplicating these instructions?

Consider the following function:
std::string get_value(const bool b)
{
if (b) {
return "Hello";
}
else {
return "World";
}
}
g++ 11.0.1 20210312 compiles this (as C++17 and with maximum optimization) into:
get_value[abi:cxx11](bool):
lea rdx, [rdi+16]
mov rax, rdi
mov QWORD PTR [rdi], rdx
test sil, sil
je .L2
mov DWORD PTR [rdi+16], 1819043144
mov BYTE PTR [rdx+4], 111
mov QWORD PTR [rax+8], 5
mov BYTE PTR [rax+21], 0
ret
.L2:
mov DWORD PTR [rdi+16], 1819438935
mov BYTE PTR [rdx+4], 100
mov QWORD PTR [rax+8], 5
mov BYTE PTR [rax+21], 0
ret
Why does it not move the two replicated mov instructions up before the jump, or even before the test, reducing the code size by two instructions?
The same thing happens with clang++ and libc++, except it only has one relevant instruction to move up.
(See this also on GodBolt)

Passing a r-value-reference to constructor to reduce copies

I have the following code lines
#include <stdio.h>
#include <utility>
class A
{
public: // member functions
explicit A(int && Val)
{
_val = std::move(Val); // \2\
}
virtual ~A(){}
private: // member variables
int _val = 0;
private: // member functions
A(const A &) = delete;
A& operator = (const A &) = delete;
A(A &&) = delete;
A&& operator = (A &&) = delete;
};
int main()
{
A a01{3}; // \1\
return 0;
}
I would like to ask how many copies did I make from \1\ to \2\?

Your code doesn't compile, but after making the changes needed for it to compile, it does nothing and compiles into this x86 assembly because none of it's values are ever used:
main:
xor eax, eax
ret
https://godbolt.org/z/q70EMb
Modifying the code so that it requires the output of the _val member variable (with a print statement) shows that with optimizations it simply moves the value 0x03 into a register and prints it:
.LC0:
.string "%d\n"
main:
sub rsp, 8
mov esi, 3
mov edi, OFFSET FLAT:.LC0
xor eax, eax
call printf
xor eax, eax
add rsp, 8
ret
https://godbolt.org/z/JG73Ll
If you disable optimizations in an attempt to get the compiler to output a more verbose version of the program:
A::A(int&&):
push rbp
mov rbp, rsp
sub rsp, 16
mov QWORD PTR [rbp-8], rdi
mov QWORD PTR [rbp-16], rsi
mov rax, QWORD PTR [rbp-8]
mov DWORD PTR [rax], 0
mov rax, QWORD PTR [rbp-16]
mov rdi, rax
call std::remove_reference<int&>::type&& std::move<int&>(int&)
mov edx, DWORD PTR [rax]
mov rax, QWORD PTR [rbp-8]
mov DWORD PTR [rax], edx
nop
leave
ret
.LC0:
.string "%d\n"
main:
push rbp
mov rbp, rsp
sub rsp, 16
mov DWORD PTR [rbp-4], 3
lea rdx, [rbp-4]
lea rax, [rbp-8]
mov rsi, rdx
mov rdi, rax
call A::A(int&&)
mov eax, DWORD PTR [rbp-8]
mov esi, eax
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
mov eax, 0
leave
ret
std::remove_reference<int&>::type&& std::move<int&>(int&):
push rbp
mov rbp, rsp
mov QWORD PTR [rbp-8], rdi
mov rax, QWORD PTR [rbp-8]
pop rbp
ret
https://godbolt.org/z/ZTK40d
The answer to your question depends on how your program is compiled and how copy elision is enforced, as well as if there is any benefit in the case of an int to not "copying" a value, since an int* and int likely take up the same amount of memory.

your are merely assigning a value, not copying. Nevertheless, you can have a static member in your class that is incremented everytime this method is called!
class A
{
public: // member functions
static int counter = 0;
explicit A(int && Val)
{
_val = std::move(Val); // \2\
counter++;
}
....

Getting GCC/Clang to use CMOV

I have a simple tagged union of values. The values can either be int64_ts or doubles. I am performing addition on the these unions with the caveat that if both arguments represent int64_t values then the result should also have an int64_t value.
Here is the code:
#include<stdint.h>
union Value {
int64_t a;
double b;
};
enum Type { DOUBLE, LONG };
// Value + type.
struct TaggedValue {
Type type;
Value value;
};
void add(const TaggedValue& arg1, const TaggedValue& arg2, TaggedValue* out) {
const Type type1 = arg1.type;
const Type type2 = arg2.type;
// If both args are longs then write a long to the output.
if (type1 == LONG && type2 == LONG) {
out->value.a = arg1.value.a + arg2.value.a;
out->type = LONG;
} else {
// Convert argument to a double and add it.
double op1 = type1 == LONG ? (double)arg1.value.a : arg1.value.b; // Why isn't CMOV used?
double op2 = type2 == LONG ? (double)arg2.value.a : arg2.value.b; // Why isn't CMOV used?
out->value.b = op1 + op2;
out->type = DOUBLE;
}
}
The output of gcc at -O2 is here: http://goo.gl/uTve18
Attached here in case the link doesn't work.
add(TaggedValue const&, TaggedValue const&, TaggedValue*):
cmp DWORD PTR [rdi], 1
sete al
cmp DWORD PTR [rsi], 1
sete cl
je .L17
test al, al
jne .L18
.L4:
test cl, cl
movsd xmm1, QWORD PTR [rdi+8]
jne .L19
.L6:
movsd xmm0, QWORD PTR [rsi+8]
mov DWORD PTR [rdx], 0
addsd xmm0, xmm1
movsd QWORD PTR [rdx+8], xmm0
ret
.L17:
test al, al
je .L4
mov rax, QWORD PTR [rdi+8]
add rax, QWORD PTR [rsi+8]
mov DWORD PTR [rdx], 1
mov QWORD PTR [rdx+8], rax
ret
.L18:
cvtsi2sd xmm1, QWORD PTR [rdi+8]
jmp .L6
.L19:
cvtsi2sd xmm0, QWORD PTR [rsi+8]
addsd xmm0, xmm1
mov DWORD PTR [rdx], 0
movsd QWORD PTR [rdx+8], xmm0
ret
It produced code with a lot of branches. I know that the input data is pretty random i.e it has a random combination of int64_ts and doubles. I'd like to have at least the conversion to a double done with an equivalent of a CMOV instruction. Is there any way I can coax gcc to produce that code? I'd ideally like to run some benchmark on real data to see how the code with a lot of branches does vs one with fewer branches but more expensive CMOV instructions. It might turn out that the code generated by default by GCC works better but I'd like to confirm that. I could inline the assembly myself but I'd prefer not to.
The interactive compiler link is a good way to check the assembly. Any suggestions?

Trying to understand ASM code

EDIT
I switched from memcmp to a home brewed 13 byte compare function and the homebrew doesnt have the extra instructions. So all I can guess is that the extra assembly is just a flaw in the optimizer.
if (!EQ13(&ti, &m_ti)) { // in 2014, memcmp was not being optimzied here
000007FEF91B2CFE mov rdx,qword ptr [rsp]
000007FEF91B2D02 movzx eax,byte ptr [rsp+0Ch]
000007FEF91B2D07 mov ecx,dword ptr [rsp+8]
000007FEF91B2D0B cmp rdx,qword ptr [r10+28h]
000007FEF91B2D0F jne TSccIter::SetTi+9Dh (7FEF91B2D1Dh)
000007FEF91B2D11 cmp ecx,dword ptr [r10+30h]
000007FEF91B2D15 jne TSccIter::SetTi+9Dh (7FEF91B2D1Dh)
000007FEF91B2D17 cmp al,byte ptr [r10+34h]
000007FEF91B2D1B je TSccIter::SetTi+0B1h (7FEF91B2D31h)
My homebrew isn't perfect in this case since it does 3 movs at the start even though it is unlikely to ever check past the first mov. I need to work on that part.
ORIGINAL QUESTION
Here is asm code from msvc 2010 showing how it can optimze a small, fixed-sized memcmp (in this case, 13 bytes). I've seen this type of optimization a lot in our code, but never with the last 6 lines. Can anyone tell me why the last 6 lines of assembly are there? TransferItem is 13 bytes so that explains the QWORD, DWORD, then BYTE cmps.
struct TransferItem {
char m_szCxrMkt1[3];
char m_szCxrOp1[3];
char m_chDelimiter;
char m_szCxrMkt2[3];
char m_szCxrOp2[3];
};
...
if (memcmp(&ti, &m_ti, sizeof(TransferItem))) {
2B8E lea rax,[rsp]
2B92 mov rdx,qword ptr [rax]
2B95 cmp rdx,qword ptr [r10+28h]
2B99 jne TSccIter::SetTi+0A2h (7FEF9302BB2h)
2B9B mov edx,dword ptr [rax+8]
2B9E cmp edx,dword ptr [r10+30h]
2BA2 jne TSccIter::SetTi+0A2h (7FEF9302BB2h)
2BA4 movzx edx,byte ptr [rax+0Ch]
2BA8 cmp dl,byte ptr [r10+34h]
2BAC jne TSccIter::SetTi+0A2h (7FEF9302BB2h)
2BAE xor eax,eax
2BB0 jmp TSccIter::SetTi+0A7h (7FEF9302BB7h)
2BB2 sbb eax,eax
2BB4 sbb eax,0FFFFFFFFh
2BB7 test eax,eax
2BB9 je TSccIter::SetTi+0CCh (7FEF9302BDCh)
Also what is the point of xor eax,eax which we know will be zero and then testing that for that known to be zero on line 2bb7?
Here is the whole function
// fWildCard means match certain fields to '**' in the db
// szCxrMkt1,2 are required and cannot be null, ' ', or '\0\0'.
// szCxrOp1,2 can be null, ' ', or '\0\0'.
TSccIter& SetTi(bool fWildCard, LPCSTR szCxrMkt1, LPCSTR szCxrOp1, LPCSTR szCxrMkt2, LPCSTR szCxrOp2) {
if (m_fSkipSet)
return *this;
m_iSid = -1; // resets the iterator to search from the start
// Pad the struct to 16 bytes so we can clear it with 2 QWORDS
// We use a temp, ti, to detect if the new transferitem has changed
class TransferItemPadded : public TransferItem {
char padding[16 - sizeof(TransferItem)]; // get us to 16 bytes
} ti;
U8(&ti) = U8(BUMP(&ti, 8)) = 0x2020202020202020; // 8 spaces
// copy in the params
CPY2(ti.m_szCxrMkt1, szCxrMkt1);
if (szCxrOp1 && *szCxrOp1)
CPY2(ti.m_szCxrOp1, szCxrOp1);
ti.m_chDelimiter = (fWildCard) ? '*' : ':'; // this controls wild card matching
CPY2(ti.m_szCxrMkt2, szCxrMkt2);
if (szCxrOp2 && *szCxrOp2)
CPY2(ti.m_szCxrOp2, szCxrOp2);
// see if different
if (memcmp(&ti, &m_ti, sizeof(TransferItem))) {
memcpy(&m_ti, &ti, sizeof(TransferItem));
m_fQryChanged = true;
}
return *this;
}
typedef unsigned __int64 U8;
#define CPY2(a,b) ((*(WORD*)a) = (*(WORD*)b))
And here's the whole asm
TSccIter& SetTi(bool fWildCard, LPCSTR szCxrMkt1, LPCSTR szCxrOp1, LPCSTR szCxrMkt2, LPCSTR szCxrOp2) {
2B10 sub rsp,18h
if (m_fSkipSet)
2B14 cmp byte ptr [rcx+0EAh],0
2B1B mov r10,rcx
return *this;
2B1E jne TSccIter::SetTi+0CCh (7FEF9302BDCh)
m_iSid = -1;
class TransferItemPadded : public TransferItem {
char padding[16 - sizeof(TransferItem)];
} ti;
U8(&ti) = U8(BUMP(&ti, 8)) = 0x2020202020202020;
2B24 mov rax,2020202020202020h
2B2E mov byte ptr [rcx+36h],0FFh
2B32 mov qword ptr [rsp],rax
2B36 mov qword ptr [rsp+8],rax
CPY2(ti.m_szCxrMkt1, szCxrMkt1);
2B3B movzx eax,word ptr [r8]
2B3F mov word ptr [rsp],ax
if (szCxrOp1 && *szCxrOp1)
2B43 test r9,r9
2B46 je TSccIter::SetTi+47h (7FEF9302B57h)
2B48 cmp byte ptr [r9],0
2B4C je TSccIter::SetTi+47h (7FEF9302B57h)
CPY2(ti.m_szCxrOp1, szCxrOp1);
2B4E movzx eax,word ptr [r9]
2B52 mov word ptr [rsp+3],ax
ti.m_chDelimiter = (fWildCard) ? '*' : ':';
2B57 mov eax,3Ah
2B5C mov ecx,2Ah
2B61 test dl,dl
2B63 cmovne eax,ecx
2B66 mov byte ptr [rsp+6],al
CPY2(ti.m_szCxrMkt2, szCxrMkt2);
2B6A mov rax,qword ptr [szCxrMkt2]
2B6F movzx ecx,word ptr [rax]
if (szCxrOp2 && *szCxrOp2)
2B72 mov rax,qword ptr [szCxrOp2]
2B77 mov word ptr [rsp+7],cx
2B7C test rax,rax
2B7F je TSccIter::SetTi+7Eh (7FEF9302B8Eh)
2B81 cmp byte ptr [rax],0
2B84 je TSccIter::SetTi+7Eh (7FEF9302B8Eh)
CPY2(ti.m_szCxrOp2, szCxrOp2);
2B86 movzx eax,word ptr [rax]
2B89 mov word ptr [rsp+0Ah],ax
if (memcmp(&ti, &m_ti, sizeof(TransferItem))) {
2B8E lea rax,[rsp]
2B92 mov rdx,qword ptr [rax]
2B95 cmp rdx,qword ptr [r10+28h]
2B99 jne TSccIter::SetTi+0A2h (7FEF9302BB2h)
2B9B mov edx,dword ptr [rax+8]
2B9E cmp edx,dword ptr [r10+30h]
2BA2 jne TSccIter::SetTi+0A2h (7FEF9302BB2h)
2BA4 movzx edx,byte ptr [rax+0Ch]
2BA8 cmp dl,byte ptr [r10+34h]
2BAC jne TSccIter::SetTi+0A2h (7FEF9302BB2h)
2BAE xor eax,eax
2BB0 jmp TSccIter::SetTi+0A7h (7FEF9302BB7h)
2BB2 sbb eax,eax
2BB4 sbb eax,0FFFFFFFFh
2BB7 test eax,eax
2BB9 je TSccIter::SetTi+0CCh (7FEF9302BDCh)
memcpy(&m_ti, &ti, sizeof(TransferItem));
2BBB mov rax,qword ptr [rsp]
m_fQryChanged = true;
2BBF mov byte ptr [r10+0E9h],1
2BC7 mov qword ptr [r10+28h],rax
2BCB mov eax,dword ptr [rsp+8]
2BCF mov dword ptr [r10+30h],eax
2BD3 movzx eax,byte ptr [rsp+0Ch]
2BD8 mov byte ptr [r10+34h],al
}
return *this;
2BDC mov rax,r10
}

2bb7 can be reached by different code paths: via taken jumps at 2b99, 2ba2 and 2bac, as well as directly when none of the conditional jumps is taken. The xor eax,eax is only executed at the last path, and it ensures that eax is 0 - which is apparently not the case otherwise.

The last 6 lines return the value in eax == 0 for a match, and also set the SF and ZF condition codes.

test eax, eax will test whether eax AND eax == 0. The following je will jump if zero.
And xor eax, eax is an efficient way to encode "eax = 0". It is more efficient than mov eax, 0
EDIT: Initially misread the question. It looks like something will happen at "TSccIter::SetTi+0A7h" which should change the value?
Also, the SBB trick to replicate the carry(2BB2-2BB4) is explained here:
http://compgroups.net/comp.lang.asm.x86/trick-with-sbb-instruction/20164

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Why is vzeroupper being inserted at the end of this code? - c++

Related

How do I convert C++ function to assembly(x86_64)?

Why aren't clang++ and g++ de-duplicating these instructions?

Passing a r-value-reference to constructor to reduce copies

Getting GCC/Clang to use CMOV

Trying to understand ASM code

Categories

Resources