Related
Assembly included. This weekend I tried to get my own small library running without any C libs and the thread local stuff is giving me problems. Below you can see I created a struct called Try1 (because it's my first attempt!) If I set the thread local variable and use it, the code seems to execute fine. If I call a const method on Try1 with a global variable it seems to run fine. Now if I do both, it's not fine. It segfaults despite me being able to access members and running the function with a global variable. The code will print Hello and Hello2 but not Hello3
I suspect the problem is the address of the variable. I tried using an if statement to print the first hello. if ((s64)&t1 > (s64)buf+1024*16) It was true so it means the pointer isn't where I thought it was. Also it isn't -8 as gdb suggest (it's a signed compare and I tried 0 instead of buf)
Assembly under the c++ code. First line is the first call to write
//test.cpp
//clang++ or g++ -std=c++20 -g -fno-rtti -fno-exceptions -fno-stack-protector -fno-asynchronous-unwind-tables -static -nostdlib test.cpp -march=native && ./a.out
#include <immintrin.h>
typedef unsigned long long int u64;
ssize_t my_write(int fd, const void *buf, size_t size) {
register int64_t rax __asm__ ("rax") = 1;
register int rdi __asm__ ("rdi") = fd;
register const void *rsi __asm__ ("rsi") = buf;
register size_t rdx __asm__ ("rdx") = size;
__asm__ __volatile__ (
"syscall"
: "+r" (rax)
: "r" (rdi), "r" (rsi), "r" (rdx)
: "cc", "rcx", "r11", "memory"
);
return rax;
}
void my_exit(int exit_status) {
register int64_t rax __asm__ ("rax") = 60;
register int rdi __asm__ ("rdi") = exit_status;
__asm__ __volatile__ (
"syscall"
: "+r" (rax)
: "r" (rdi)
: "cc", "rcx", "r11", "memory"
);
}
struct Try1
{
u64 val;
constexpr Try1() { val=0; }
u64 Get() const { return val; }
};
static char buf[1024*8]; //originally mmap but lets reduce code
static __thread u64 sanity_check;
static __thread Try1 t1;
static Try1 global;
extern "C"
int _start()
{
auto tls_size = 4096*2;
auto originalFS = _readfsbase_u64();
_writefsbase_u64((u64)(buf+4096));
global.val = 1;
global.Get(); //Executes fine
sanity_check=6;
t1.val = 7;
my_write(1, "Hello\n", sanity_check);
my_write(1, "Hello2\n", t1.val); //Still fine
my_write(1, "Hello3\n", t1.Get()); //crash! :/
my_exit(0);
return 0;
}
Asm:
4010b4: e8 47 ff ff ff call 401000 <_Z8my_writeiPKvm>
4010b9: 64 48 8b 04 25 f8 ff mov rax,QWORD PTR fs:0xfffffffffffffff8
4010c0: ff ff
4010c2: 48 89 c2 mov rdx,rax
4010c5: 48 8d 05 3b 0f 00 00 lea rax,[rip+0xf3b] # 402007 <_ZNK4Try13GetEv+0xeef>
4010cc: 48 89 c6 mov rsi,rax
4010cf: bf 01 00 00 00 mov edi,0x1
4010d4: e8 27 ff ff ff call 401000 <_Z8my_writeiPKvm>
4010d9: 64 48 8b 04 25 00 00 mov rax,QWORD PTR fs:0x0
4010e0: 00 00
4010e2: 48 05 f8 ff ff ff add rax,0xfffffffffffffff8
4010e8: 48 89 c7 mov rdi,rax
4010eb: e8 28 00 00 00 call 401118 <_ZNK4Try13GetEv>
4010f0: 48 89 c2 mov rdx,rax
4010f3: 48 8d 05 15 0f 00 00 lea rax,[rip+0xf15] # 40200f <_ZNK4Try13GetEv+0xef7>
4010fa: 48 89 c6 mov rsi,rax
4010fd: bf 01 00 00 00 mov edi,0x1
401102: e8 f9 fe ff ff call 401000 <_Z8my_writeiPKvm>
401107: bf 00 00 00 00 mov edi,0x0
40110c: e8 12 ff ff ff call 401023 <_Z7my_exiti>
401111: b8 00 00 00 00 mov eax,0x0
401116: c9 leave
401117: c3 ret
The ABI requires that fs:0 contains a pointer with the absolute address of the thread-local storage block, i.e. the value of fsbase. The compiler needs access to this address to evaluate expressions like &t1, which here it needs in order to compute the this pointer to be passed to Try1::Get().
It's tricky to recover this address on x86-64, since the TLS base address isn't in a convenient general register, but in the hidden fsbase. It isn't feasible to execute rdfsbase every time we need it (expensive instruction that may not be available) nor worse yet to call arch_prctl, so the easiest solution is to ensure that it's available in memory at a known address. See this past answer and sections 3.4.2 and 3.4.6 of "ELF Handling for Thread-Local Storage", which is incorporated by reference into the x86-64 ABI.
In your disassembly at 0x4010d9, you can see the compiler trying to load from address fs:0x0 into rax, then adding -8 (the offset of t1 in the TLS block) and moving the result into rdi as the hidden this argument to Try1::Get(). Obviously since you have zeros at fs:0 instead, the resulting pointer is invalid and you get a crash when Try1::Get() reads val, which is really this->val.
I would write something like
void *fsbase = buf+4096;
_writefsbase_u64((u64)fsbase);
*(void **)fsbase = fsbase;
(Or memcpy(fsbase, &fsbase, sizeof(void *)) might be more compliant with strict aliasing.)
I'm checking the Release build of my project done with the latest version of the VS 2017 C++ compiler. And I'm curious why did compiler choose to build the following code snippet:
//ncbSzBuffDataUsed of type INT32
UINT8* pDst = (UINT8*)(pMXB + 1);
UINT8* pSrc = (UINT8*)pDPE;
for(size_t i = 0; i < (size_t)ncbSzBuffDataUsed; i++)
{
pDst[i] = pSrc[i];
}
as such:
UINT8* pDst = (UINT8*)(pMXB + 1);
UINT8* pSrc = (UINT8*)pDPE;
for(size_t i = 0; i < (size_t)ncbSzBuffDataUsed; i++)
00007FF66441251E 4C 63 C2 movsxd r8,edx
00007FF664412521 4C 2B D1 sub r10,rcx
00007FF664412524 0F 1F 40 00 nop dword ptr [rax]
00007FF664412528 0F 1F 84 00 00 00 00 00 nop dword ptr [rax+rax]
00007FF664412530 41 0F B6 04 0A movzx eax,byte ptr [r10+rcx]
{
pDst[i] = pSrc[i];
00007FF664412535 88 01 mov byte ptr [rcx],al
00007FF664412537 48 8D 49 01 lea rcx,[rcx+1]
00007FF66441253B 49 83 E8 01 sub r8,1
00007FF66441253F 75 EF jne _logDebugPrint_in_MainXchgBuffer+0A0h (07FF664412530h)
}
versus just using a single REP MOVSB instruction? Wouldn't the latter be more efficient?
Edit: First up, there's an intrinsic for rep movsb which Peter Cordes tells us would be much faster here and I believe him (I guess I already did). If you want to force the compiler to do things this way, see: __movsb(): https://learn.microsoft.com/en-us/cpp/intrinsics/movsb.
As to why the compiler didn't do this for you, in the absence of any other ideas the answer might be register pressure. To use rep movsb The compiler would have to:
set up rsi (= source address)
set up rdi (= destination address)
set up rcx (= count)
issue the rep movsb
So now it has had to use up the three registers mandated by the rep movsb instruction, and it may prefer not to do that. Specifically rsi and rdi are expected to be preserved across a function call, so if the compiler can get away with using them in the body of any particular function it will, and (on initial entry to the method, at least) rcx holds the this pointer.
Also, with the code that we see the compiler has generated there, the r10 and rcxregisters might already contain the requisite source and destination addresses (we can't see that from your example), which would be handy for the compiler if so.
In practise, you will probably see the compiler make different choices in different situations. The type of optimisation requested (/O1 - optimise for size, vs /O2 - optimise for speed) will likely also affect this.
More on the x64 register passing convention here, and on the x64 ABI generally here.
Edit 2 (again inspired by Peter's comments):
The compiler probably decided not to vectorise the loop because it doesn't know if the pointers are aligned or might overlap. Without seeing more of the code, we can't be sure. But that's not strictly relevant to my answer, given what the OP actually asked about.
This is not really an answer, and I can't jam it all into a comment. I just want to share my additional findings. (This is probably relevant to the Visual Studio compilers only.)
What also makes a difference is how you structure your loops. For instance:
Assuming the following struct definitions:
#define PCALLBACK ULONG64
#pragma pack(push)
#pragma pack(1)
typedef struct {
ULONG64 ui0;
USHORT w0;
USHORT w1;
//Followed by:
// PCALLBACK[] 'array' - variable size array
}DPE;
#pragma pack(pop)
(1) The regular way to structure a for loop. The following code chunk is called somewhere in the middle of a larger serialization function:
PCALLBACK* pDstClbks = (PCALLBACK*)(pDPE + 1);
for(size_t i = 0; i < (size_t)info.wNumCallbackFuncs; i++)
{
pDstClbks[i] = info.callbackFuncs[i];
}
As was mentioned somewhere in the answer on this page, it is clear that the compiler was starved of registers to have produced the following monstrocity (see how it reused rax for the loop end limit, or movzx eax,word ptr [r13] instruction that could've been clearly left out of the loop.)
PCALLBACK* pDstClbks = (PCALLBACK*)(pDPE + 1);
00007FF7029327CF 48 83 C1 30 add rcx,30h
for(size_t i = 0; i < (size_t)info.wNumCallbackFuncs; i++)
00007FF7029327D3 66 41 3B 5D 00 cmp bx,word ptr [r13]
00007FF7029327D8 73 1F jae 07FF7029327F9h
00007FF7029327DA 4C 8B C1 mov r8,rcx
00007FF7029327DD 4C 2B F1 sub r14,rcx
{
pDstClbks[i] = info.callbackFuncs[i];
00007FF7029327E0 4B 8B 44 06 08 mov rax,qword ptr [r14+r8+8]
00007FF7029327E5 48 FF C3 inc rbx
00007FF7029327E8 49 89 00 mov qword ptr [r8],rax
00007FF7029327EB 4D 8D 40 08 lea r8,[r8+8]
00007FF7029327EF 41 0F B7 45 00 movzx eax,word ptr [r13]
00007FF7029327F4 48 3B D8 cmp rbx,rax
00007FF7029327F7 72 E7 jb 07FF7029327E0h
}
00007FF7029327F9 45 0F B7 C7 movzx r8d,r15w
(2) So if I re-write it into a less familiar C pattern:
PCALLBACK* pDstClbks = (PCALLBACK*)(pDPE + 1);
PCALLBACK* pEndDstClbks = pDstClbks + (size_t)info.wNumCallbackFuncs;
for(PCALLBACK* pScrClbks = info.callbackFuncs;
pDstClbks < pEndDstClbks;
pScrClbks++, pDstClbks++)
{
*pDstClbks = *pScrClbks;
}
this produces a more sensible machine code (on the same compiler, in the same function, in the same project):
PCALLBACK* pDstClbks = (PCALLBACK*)(pDPE + 1);
00007FF71D7E27C2 48 83 C1 30 add rcx,30h
PCALLBACK* pEndDstClbks = pDstClbks + (size_t)info.wNumCallbackFuncs;
00007FF71D7E27C6 0F B7 86 88 00 00 00 movzx eax,word ptr [rsi+88h]
00007FF71D7E27CD 48 8D 14 C1 lea rdx,[rcx+rax*8]
for(PCALLBACK* pScrClbks = info.callbackFuncs; pDstClbks < pEndDstClbks; pScrClbks++, pDstClbks++)
00007FF71D7E27D1 48 3B CA cmp rcx,rdx
00007FF71D7E27D4 76 14 jbe 07FF71D7E27EAh
00007FF71D7E27D6 48 2B F1 sub rsi,rcx
{
*pDstClbks = *pScrClbks;
00007FF71D7E27D9 48 8B 44 0E 08 mov rax,qword ptr [rsi+rcx+8]
00007FF71D7E27DE 48 89 01 mov qword ptr [rcx],rax
00007FF71D7E27E1 48 83 C1 08 add rcx,8
00007FF71D7E27E5 48 3B CA cmp rcx,rdx
00007FF71D7E27E8 77 EF jb 07FF71D7E27D9h
}
00007FF71D7E27EA 45 0F B7 C6 movzx r8d,r14w
I've recently been introduced to Vector Instructions (theoretically) and am excited about how I can use them to speed up my applications.
One area I'd like to improve is a very hot loop:
__declspec(noinline) void pleaseVectorize(int* arr, int* someGlobalArray, int* output)
{
for (int i = 0; i < 16; ++i)
{
auto someIndex = arr[i];
output[i] = someGlobalArray[someIndex];
}
for (int i = 0; i < 16; ++i)
{
if (output[i] == 1)
{
return i;
}
}
return -1;
}
But of course, all 3 major compilers (msvc, gcc, clang) refuse to vectorize this. I can sort of understand why, but I wanted to get a confirmation.
If I had to vectorize this by hand, it would be:
(1) VectorLoad "arr", this brings in 16 4-byte integers let's say into zmm0
(2) 16 memory loads from the address pointed to by zmm0[0..3] into zmm1[0..3], load from address pointed into by zmm0[4..7] into zmm1[4..7] so on and so forth
(3) compare zmm0 and zmm1
(4) vector popcnt into the output to find out the most significant bit and basically divide that by 8 to get the index that matched
First of all, can vector instructions do these things? Like can they do this "gathering" operation, i.e. do a load from address pointing to zmm0?
Here is what clang generates:
0000000000400530 <_Z5superPiS_S_>:
400530: 48 63 07 movslq (%rdi),%rax
400533: 8b 04 86 mov (%rsi,%rax,4),%eax
400536: 89 02 mov %eax,(%rdx)
400538: 48 63 47 04 movslq 0x4(%rdi),%rax
40053c: 8b 04 86 mov (%rsi,%rax,4),%eax
40053f: 89 42 04 mov %eax,0x4(%rdx)
400542: 48 63 47 08 movslq 0x8(%rdi),%rax
400546: 8b 04 86 mov (%rsi,%rax,4),%eax
400549: 89 42 08 mov %eax,0x8(%rdx)
40054c: 48 63 47 0c movslq 0xc(%rdi),%rax
400550: 8b 04 86 mov (%rsi,%rax,4),%eax
400553: 89 42 0c mov %eax,0xc(%rdx)
400556: 48 63 47 10 movslq 0x10(%rdi),%rax
40055a: 8b 04 86 mov (%rsi,%rax,4),%eax
40055d: 89 42 10 mov %eax,0x10(%rdx)
400560: 48 63 47 14 movslq 0x14(%rdi),%rax
400564: 8b 04 86 mov (%rsi,%rax,4),%eax
400567: 89 42 14 mov %eax,0x14(%rdx)
40056a: 48 63 47 18 movslq 0x18(%rdi),%rax
40056e: 8b 04 86 mov (%rsi,%rax,4),%eax
400571: 89 42 18 mov %eax,0x18(%rdx)
400574: 48 63 47 1c movslq 0x1c(%rdi),%rax
400578: 8b 04 86 mov (%rsi,%rax,4),%eax
40057b: 89 42 1c mov %eax,0x1c(%rdx)
40057e: 48 63 47 20 movslq 0x20(%rdi),%rax
400582: 8b 04 86 mov (%rsi,%rax,4),%eax
400585: 89 42 20 mov %eax,0x20(%rdx)
400588: 48 63 47 24 movslq 0x24(%rdi),%rax
40058c: 8b 04 86 mov (%rsi,%rax,4),%eax
40058f: 89 42 24 mov %eax,0x24(%rdx)
400592: 48 63 47 28 movslq 0x28(%rdi),%rax
400596: 8b 04 86 mov (%rsi,%rax,4),%eax
400599: 89 42 28 mov %eax,0x28(%rdx)
40059c: 48 63 47 2c movslq 0x2c(%rdi),%rax
4005a0: 8b 04 86 mov (%rsi,%rax,4),%eax
4005a3: 89 42 2c mov %eax,0x2c(%rdx)
4005a6: 48 63 47 30 movslq 0x30(%rdi),%rax
4005aa: 8b 04 86 mov (%rsi,%rax,4),%eax
4005ad: 89 42 30 mov %eax,0x30(%rdx)
4005b0: 48 63 47 34 movslq 0x34(%rdi),%rax
4005b4: 8b 04 86 mov (%rsi,%rax,4),%eax
4005b7: 89 42 34 mov %eax,0x34(%rdx)
4005ba: 48 63 47 38 movslq 0x38(%rdi),%rax
4005be: 8b 04 86 mov (%rsi,%rax,4),%eax
4005c1: 89 42 38 mov %eax,0x38(%rdx)
4005c4: 48 63 47 3c movslq 0x3c(%rdi),%rax
4005c8: 8b 04 86 mov (%rsi,%rax,4),%eax
4005cb: 89 42 3c mov %eax,0x3c(%rdx)
4005ce: c3 retq
4005cf: 90 nop
Your idea of how it could work is close, except that you want a bit-scan / find-first-set-bit (x86 BSF or TZCNT) of the compare bitmap, not population-count (number of bits set).
AVX2 / AVX512 have vpgatherdd which does use a vector of signed 32-bit scaled indices. It's barely worth using on Haswell, improved on Broadwell, and very good on Skylake. (http://agner.org/optimize/, and see other links in the x86 tag wiki, such as Intel's optimization manual which has a section on gather performance). The SIMD compare and bitscan are very cheap by comparison; single uop and fully pipelined.
gcc8.1 can auto-vectorize your gather, if it can prove that your inputs don't overlap your output function arg. Sometimes possible after inlining, but for the non-inline version, you can promise this with int * __restrict output. Or if you make output a local temporary instead of a function arg. (General rule: storing through a non-_restrict pointer will often inhibit auto-vectorization, especially if it's a char* that can alias anything.)
gcc and clang never vectorize search loops; only loops where the trip-count can be calculated before entering the loop. But ICC can; it does a scalar gather and stores the result (even if output[] is a local so it doesn't have to do that as a side-effect of running the function), then uses SIMD packed-compare + bit-scan.
Compiler output for a __restrict version. Notice that gcc8.1 and ICC avoid 512-bit vectors by default when tuning for Skylake-AVX512. 512-bit vectors can limit the max-turbo, and always shut down the vector ALU on port 1 while they're in the pipeline, so it can make sense to use AVX512 or AVX2 with 256-bit vectors in case this function is only a small part of a big program. (Compilers don't know that this function is super-hot in your program.)
If output[] is a local, a better code-gen strategy would probably be to compare while gathering, so an early hit skips the rest of the loads. The compilers that go fully scalar (clang and MSVC) both miss this optimization. In fact, they even store to the local array even though clang mostly doesn't re-read it (keeping results in registers). Writing the source with the compare inside the first loop would work to get better scalar code. (Depending on cache misses from the gather vs. branch mispredicts from non-SIMD searching, scalar could be a good strategy. Especially if hits in the first few elements are common. Current gather hardware can't take advantage of multiple elements coming from the same cache line, so the hard limit is still 2 elements loaded per clock cycle.
But using a wide vector load for the indices to feed a gather reduces load-port / cache access pressure significantly if your data was mostly hot in cache.)
A compiler could have auto-vectorized the __restrict version of your code to something like this. (gcc manages the gather part, ICC manages the SIMD compare part)
;; Windows x64 calling convention: rcx,rdx, r8,r9
; but of course you'd actually inline this
; only uses ZMM16..31, so vzeroupper not required
vmovdqu32 zmm16, [rcx/arr] ; You def. want to reach an alignment boundary if you can for ZMM loads, vmovdqa32 will enforce that
kxnorw k1, k0,k0 ; k1 = -1. k0 false dep is likely not a problem.
; optional: vpxord xmm17, xmm17, xmm17 ; break merge-masking false dep
vpgatherdd zmm17{k1}, [rdx + zmm16 * 4] ; GlobalArray + scaled-vector-index
; sets k1 = 0 when done
vmovdqu32 [r8/output], zmm17
vpcmpd k1, zmm17, zmm31, 0 ; 0->EQ. Outside the loop, do zmm31=set1_epi32(1)
; k1 = compare bitmap
kortestw k1, k1
jz .not_found ; early check for not-found
kmovw edx, k1
; tzcnt doesn't have a false dep on the output on Skylake
; so no AVX512 CPUs need to worry about that HSW/BDW issue
tzcnt eax, edx ; bit-scan for the first (lowest-address) set element
; input=0 produces output=32
; or avoid the branch and let 32 be the not-found return value.
; or do a branchless kortestw / cmov if -1 is directly useful without branching
ret
.not_found:
mov eax, -1
ret
You can do this yourself with intrinsics:
Intel's instruction-set reference manual (HTML extract at http://felixcloutier.com/x86/index.html) includes C/C++ intrinsic names for each instruction, or search for them in https://software.intel.com/sites/landingpage/IntrinsicsGuide/
I changed the output type to __m512i. You could change it back to an array if you aren't manually vectorizing the caller. You definitely want this function to inline.
#include <immintrin.h>
//__declspec(noinline) // I *hope* this was just to see the stand-alone asm version
// but it means the output array can't optimize away at all
//static inline
int find_first_1(const int *__restrict arr, const int *__restrict someGlobalArray, __m512i *__restrict output)
{
__m512i vindex = _mm512_load_si512(arr);
__m512i gather = _mm512_i32gather_epi32(vindex, someGlobalArray, 4); // indexing by 4-byte int
*output = gather;
__mmask16 cmp = _mm512_cmpeq_epi32_mask(gather, _mm512_set1_epi32(1));
// Intrinsics make masks freely convert to integer
// even though it costs a `kmov` instruction either way.
int onepos = _tzcnt_u32(cmp);
if (onepos >= 16){
return -1;
}
return onepos;
}
All 4 x86 compilers produce similar asm to what I suggested (see it on the Godbolt compiler explorer), but of course they have to actually materialize the set1_epi32(1) vector constant, or use a (broadcast) memory operand. Clang actually uses a {1to16} broadcast-load from a constant for the compare: vpcmpeqd k0, zmm1, dword ptr [rip + .LCPI0_0]{1to16}. (Of course they will make different choices whe inlined into a loop.) Others use mov eax,1 / vpbroadcastd zmm0, eax.
gcc8.1 -O3 -march=skylake-avx512 has two redundant mov eax, -1 instructions: one to feed a kmov for the gather, the other for the return-value stuff. Silly compiler should keep it around and use a different register for the 1.
All of them use zmm0..15 and thus can't avoid a vzeroupper. (xmm16.31 are not accessible with legacy-SSE, so the SSE/AVX transition penalty problem that vzeroupper solves doesn't exist if the only wide vector registers you use are y/zmm16..31). There may still be tiny possible advantages to vzeroupper, like cheaper context switches when the upper halves of ymm or zmm regs are known to be zero (Is it useful to use VZEROUPPER if your program+libraries contain no SSE instructions?). If you're going to use it anyway, no reason to avoid xmm0..15.
Oh, and in the Windows calling convention, xmm6..15 are call-preserved. (Not ymm/zmm, just the low 128 bits), so zmm16..31 are a good choice if you run out of xmm0..5 regs.
I decided to benchmark the realization of a swap function for simple types (like int, or struct, or a class that uses only simple types in its fields) with a static tmp variable in it to prevent memory allocation in each swap call. So I wrote this simple test program:
#include <iostream>
#include <chrono>
#include <utility>
#include <vector>
template<typename T>
void mySwap(T& a, T& b) //Like std::swap - just for tests
{
T tmp = std::move(a);
a = std::move(b);
b = std::move(tmp);
}
template<typename T>
void mySwapStatic(T& a, T& b) //Here with static tmp
{
static T tmp;
tmp = std::move(a);
a = std::move(b);
b = std::move(tmp);
}
class Test1 { //Simple class with some simple types
int foo;
float bar;
char bazz;
};
class Test2 { //Class with std::vector in it
int foo;
float bar;
char bazz;
std::vector<int> bizz;
public:
Test2()
{
bizz = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
}
};
#define Test Test1 //choosing class
const static unsigned int NUM_TESTS = 100000000;
static Test a, b; //making it static to prevent throwing out from code by compiler optimizations
template<typename T, typename F>
auto test(unsigned int numTests, T& a, T& b, const F swapFunction ) //test function
{
std::chrono::system_clock::time_point t1, t2;
t1 = std::chrono::system_clock::now();
for(unsigned int i = 0; i < NUM_TESTS; ++i) {
swapFunction(a, b);
}
t2 = std::chrono::system_clock::now();
return std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count();
}
int main()
{
std::chrono::system_clock::time_point t1, t2;
std::cout << "Test 1. MySwap Result:\t\t" << test(NUM_TESTS, a, b, mySwap<Test>) << " nanoseconds\n"; //caling test function
t1 = std::chrono::system_clock::now();
for(unsigned int i = 0; i < NUM_TESTS; ++i) {
mySwap<Test>(a, b);
}
t2 = std::chrono::system_clock::now();
std::cout << "Test 2. MySwap2 Result:\t\t" << std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count() << " nanoseconds\n"; //This result slightly better then 1. why?!
std::cout << "Test 3. MySwapStatic Result:\t" << test(NUM_TESTS, a, b, mySwapStatic<Test>) << " nanoseconds\n"; //test function with mySwapStatic
t1 = std::chrono::system_clock::now();
for(unsigned int i = 0; i < NUM_TESTS; ++i) {
mySwapStatic<Test>(a, b);
}
t2 = std::chrono::system_clock::now();
std::cout << "Test 4. MySwapStatic2 Result:\t" << std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count() << " nanoseconds\n"; //And again - it's better then 3...
std::cout << "Test 5. std::swap Result:\t" << test(NUM_TESTS, a, b, std::swap<Test>) << " nanoseconds\n"; //calling test function with std::swap for comparsion. Mostly similar to 1...
return 0;
}
Some results with Test defined as Test1 (g++ (Ubuntu 4.8.2-19ubuntu1) 4.8.2 called as g++ main.cpp -O3 -std=c++11):
Test 1. MySwap Result: 625,105,480 nanoseconds
Test 2. MySwap2 Result: 528,701,547 nanoseconds
Test 3. MySwapStatic Result: 338,484,180 nanoseconds
Test 4. MySwapStatic2 Result: 228,228,156 nanoseconds
Test 5. std::swap Result: 564,863,184 nanoseconds
My main question: is it good to use this implementation for swapping of simple types? I know that if you use it for swapping types with vectors, for example, then std::swap is better, and you can see it just by changing the Test define to Test2.
Second question: why are the results in test 1, 2, 3, and 4 so different? What am I doing wrong with the test function implementation?
An answer to your second question first : in your test 2 and 4, the compiler is inlining the functions, thus it gives better performances (there is even more on test 4, but I will cover this later).
Overall, it is probably a bad idea to use a static temp variable.
Why ? First, it should be noted that in x86 assembly, there is no instruction to copy from memory to memory. This means that when you swap, there is, not one, but two temporary variables in CPU registers. And these temporary variables MUST be in CPU registers, you can't copy mem to mem, thus a static variable will add a third memory location to transfer to and from.
One problem with your static temp is that it will hinder inlining. Imagine if the variables you swap are already in CPU registers. In this case, the compiler can inline the swapping, and never copy anything to memory, which is much faster. Now, if you forced a static temp, either the compiler removes it (useless), or is forced to add a memory copy. That's what happen in test 4, in which GCC removed all the reads to the static variable. It just pointlessly write updated values to it because you told it do to so. The read removal explains the good performance gain, but it could be even faster.
Your test cases are flawed because they don't show this point.
Now you may ask : Then why my static functions perform better ? I have no idea. (Answer at the end)
I was curious, so I compiled your code with MSVC, and it turns out MSVC is doing it right, and GCC is doing it weird. At O2 optimization levels, MSVC detects that two swaps is a no-op and shortcut them, but even at O1, the non-inlined generated code is faster than all the test case with GCC at O3. (EDIT: Actually, MSVC is not doing it right either, see explanation at the end.)
The assembly generated by MSVC looks indeed better, but when comparing the static and non-static assembly generated by GCC, I don't know why the static performs better.
Anyway, I think even if GCC is generating weird code, the inlining problem should be worth using the std::swap, because with bigger types the additional memory copy could be costly, and smaller types give better inlining.
Here are the assembly produced by all the test cases, if someone have an idea of why the GCC static performs better than the non-static, despite being longer and using more memory moves. EDIT: Answer at the end
GCC non-static (perf 570ms):
00402F90 44 8B 01 mov r8d,dword ptr [rcx]
00402F93 F3 0F 10 41 04 movss xmm0,dword ptr [rcx+4]
00402F98 0F B6 41 08 movzx eax,byte ptr [rcx+8]
00402F9C 4C 8B 0A mov r9,qword ptr [rdx]
00402F9F 4C 89 09 mov qword ptr [rcx],r9
00402FA2 44 0F B6 4A 08 movzx r9d,byte ptr [rdx+8]
00402FA7 44 88 49 08 mov byte ptr [rcx+8],r9b
00402FAB 44 89 02 mov dword ptr [rdx],r8d
00402FAE F3 0F 11 42 04 movss dword ptr [rdx+4],xmm0
00402FB3 88 42 08 mov byte ptr [rdx+8],al
GCC static and MSVC static (perf 275ms):
00402F10 48 8B 01 mov rax,qword ptr [rcx]
00402F13 48 89 05 66 11 00 00 mov qword ptr [404080h],rax
00402F1A 0F B6 41 08 movzx eax,byte ptr [rcx+8]
00402F1E 88 05 64 11 00 00 mov byte ptr [404088h],al
00402F24 48 8B 02 mov rax,qword ptr [rdx]
00402F27 48 89 01 mov qword ptr [rcx],rax
00402F2A 0F B6 42 08 movzx eax,byte ptr [rdx+8]
00402F2E 88 41 08 mov byte ptr [rcx+8],al
00402F31 48 8B 05 48 11 00 00 mov rax,qword ptr [404080h]
00402F38 48 89 02 mov qword ptr [rdx],rax
00402F3B 0F B6 05 46 11 00 00 movzx eax,byte ptr [404088h]
00402F42 88 42 08 mov byte ptr [rdx+8],al
MSVC non-static (perf 215ms):
00000 f2 0f 10 02 movsdx xmm0, QWORD PTR [rdx]
00004 f2 0f 10 09 movsdx xmm1, QWORD PTR [rcx]
00008 44 8b 41 08 mov r8d, DWORD PTR [rcx+8]
0000c f2 0f 11 01 movsdx QWORD PTR [rcx], xmm0
00010 8b 42 08 mov eax, DWORD PTR [rdx+8]
00013 89 41 08 mov DWORD PTR [rcx+8], eax
00016 f2 0f 11 0a movsdx QWORD PTR [rdx], xmm1
0001a 44 89 42 08 mov DWORD PTR [rdx+8], r8d
std::swap versions are all identical to the non-static versions.
After having some fun investigating, I found the likely reason of the bad performance of the GCC non-static version. Modern processors have a feature called store-to-load forwarding. This feature kicks in when a memory load match a previous memory store, and shortcut the memory operation to use the value already known. In this case GCC somehow use an asymmetric load/store for parameter A and B. A is copied using 4+4+1 bytes, and B is copied using 8+1 bytes. This means the 8 first bytes of the class won't be matched by the store-to-load forwarding, losing a precious CPU optimization. To check this, I manually replaced the 8+1 copy by a 4+4+1 copy, and the performance went up as expected (code below). In the end, GCC is at fault for not considering this.
GCC patched code, longer but taking advantage of store forwarding (perf 220ms) :
00402F90 44 8B 01 mov r8d,dword ptr [rcx]
00402F93 F3 0F 10 41 04 movss xmm0,dword ptr [rcx+4]
00402F98 0F B6 41 08 movzx eax,byte ptr [rcx+8]
00402F9C 4C 8B 0A mov r9,qword ptr [rdx]
00402F9F 4C 89 09 mov qword ptr [rcx],r9
00402F9C 44 8B 0A mov r9d,dword ptr [rdx]
00402F9F 44 89 09 mov dword ptr [rcx],r9d
00402FA2 44 8B 4A 04 mov r9d,dword ptr [rdx+4]
00402FA6 44 89 49 04 mov dword ptr [rcx+4],r9d
00402FAA 44 0F B6 4A 08 movzx r9d,byte ptr [rdx+8]
00402FAF 44 88 49 08 mov byte ptr [rcx+8],r9b
00402FB3 44 89 02 mov dword ptr [rdx],r8d
00402FB6 F3 0F 11 42 04 movss dword ptr [rdx+4],xmm0
00402FBB 88 42 08 mov byte ptr [rdx+8],al
Actually, this copy instructions (symmetric 4+4+1) is the right way to do it. In these test we are only doing copies, in which case MSVC version is the best without doubt. The problem is that in a real case, the class member will be accessed individually, thus generating 4 bytes read/write. MSVC 8 bytes batch copy (also generated by GCC for one argument) will prevent future store forwarding for individual members. A new test I did with member operation beside copies show that the patched 4+4+1 versions indeed outperform all the others. And by a factor of nearly x2. Sadly, no modern compiler is generating this code.
Background:
I am new to assembly. When I was learning programming, I made a program that implements multiplication tables up to 1000 * 1000. The tables are formatted so that each answer is on the line factor1 << 10 | factor2 (I know, I know, it's isn't pretty). These tables are then loaded into an array: int* tables. Empty lines are filled with 0. Here is a link to the file for the tables (7.3 MB). I know using assembly won't speed up this by much, but I just wanted to do it for fun (and a bit of practice).
Question:
I'm trying to convert this code into inline assembly (tables is a global):
int answer;
// ...
answer = tables [factor1 << 10 | factor2];
This is what I came up with:
asm volatile ( "shll $10, %1;"
"orl %1, %2;"
"movl _tables(,%2,4), %0;" : "=r" (answer) : "r" (factor1), "r" (factor2) );
My C++ code works fine, but my assembly fails. What is wrong with my assembly (especially the movl _tables(,%2,4), %0; part), compared to my C++
What I have done to solve it:
I used some random numbers: 89 796 as factor1 and factor2. I know that there is an element at 89 << 10 | 786 (which is 91922) – verified this with C++. When I run it with gdb, I get a SIGSEGV:
Program received signal SIGSEGV, Segmentation fault.
at this line:
"movl _tables(,%2,4), %0;" : "=r" (answer) : "r" (factor1), "r" (factor2) );
I added two methods around my asm, which is how I know where the asm block is in the disassembly.
Disassembly of my asm block:
The disassembly from objdump -M att -d looks fine (although I'm not sure, I'm new to assembly, as I said):
402096: 8b 45 08 mov 0x8(%ebp),%eax
402099: 8b 55 0c mov 0xc(%ebp),%edx
40209c: c1 e0 0a shl $0xa,%eax
40209f: 09 c2 or %eax,%edx
4020a1: 8b 04 95 18 e0 47 00 mov 0x47e018(,%edx,4),%eax
4020a8: 89 45 ec mov %eax,-0x14(%ebp)
The disassembly from objdump -M intel -d:
402096: 8b 45 08 mov eax,DWORD PTR [ebp+0x8]
402099: 8b 55 0c mov edx,DWORD PTR [ebp+0xc]
40209c: c1 e0 0a shl eax,0xa
40209f: 09 c2 or edx,eax
4020a1: 8b 04 95 18 e0 47 00 mov eax,DWORD PTR [edx*4+0x47e018]
4020a8: 89 45 ec mov DWORD PTR [ebp-0x14],eax
From what I understand, it's moving the first parameter of my void calc ( int factor1, int factor2 ) function into eax. Then it's moving the second parameter into edx. Then it shifts eax to the left by 10 and ors it with edx. A 32-bit integer is 4 bytes, so [edx*4+base_address]. Move result to eax and then put eax into int answer (which, I'm guessing is on -0x14 of the stack). I don't really see much of a problem.
Disassembly of the compiler's .exe:
When I replace the asm block with plain C++ (answer = tables [factor1 << 10 | factor2];) and disassemble it this is what I get in Intel syntax:
402096: a1 18 e0 47 00 mov eax,ds:0x47e018
40209b: 8b 55 08 mov edx,DWORD PTR [ebp+0x8]
40209e: c1 e2 0a shl edx,0xa
4020a1: 0b 55 0c or edx,DWORD PTR [ebp+0xc]
4020a4: c1 e2 02 shl edx,0x2
4020a7: 01 d0 add eax,edx
4020a9: 8b 00 mov eax,DWORD PTR [eax]
4020ab: 89 45 ec mov DWORD PTR [ebp-0x14],eax
AT&T syntax:
402096: a1 18 e0 47 00 mov 0x47e018,%eax
40209b: 8b 55 08 mov 0x8(%ebp),%edx
40209e: c1 e2 0a shl $0xa,%edx
4020a1: 0b 55 0c or 0xc(%ebp),%edx
4020a4: c1 e2 02 shl $0x2,%edx
4020a7: 01 d0 add %edx,%eax
4020a9: 8b 00 mov (%eax),%eax
4020ab: 89 45 ec mov %eax,-0x14(%ebp)
I am not really familiar with the Intel syntax, so I am just going to try and understand the AT&T syntax:
It first moves the base address of the tables array into %eax. Then, is moves the first parameter into %edx. It shifts %edx to the left by 10 then ors it with the second parameter. Then, by shifting %edx to the left by two, it actually multiplies %edx by 4. Then, it adds that to %eax (the base address of the array). So, basically it just did this: [edx*4+0x47e018] (Intel syntax) or 0x47e018(,%edx,4) AT&T. It moves the value of the element it got into %eax and puts it into int answer. This method is more "expanded", but it does the same thing as my hand-written assembly! So why is mine giving a SIGSEGV while the compiler's working fine?
I bet (from the disassembly) that tables is a pointer to an array, not the array itself.
So you need:
asm volatile ( "shll $10, %1;"
movl _tables,%%eax
"orl %1, %2;"
"movl (%%eax,%2,4)",
: "=r" (answer) : "r" (factor1), "r" (factor2) : "eax" )
(Don't forget the extra clobber in the last line).
There are of course variations, this may be more efficient if the code is in a loop:
asm volatile ( "shll $10, %1;"
"orl %1, %2;"
"movl (%3,%2,4)",
: "=r" (answer) : "r" (factor1), "r" (factor2), "r"(tables) )
This is intended to be an addition to Mats Petersson's answer - I wrote it simply because it wasn't immediately clear to me why OP's analysis of the disassembly (that his assembly and the compiler-generated one were equivalent) was incorrect.
As Mats Petersson explains, the problem is that tables is actually a pointer to an array, so to access an element, you have to dereference twice. Now to me, it wasn't immediately clear where this happens in the compiler-generated code. The culprit is this innocent-looking line:
a1 18 e0 47 00 mov 0x47e018,%eax
To the untrained eye (that includes mine), this might look like the value 0x47e018 is moved to eax, but it's actually not. The Intel-syntax representation of the same opcodes gives us a clue:
a1 18 e0 47 00 mov eax,ds:0x47e018
Ah - ds: - so it's not actually a value, but an address!
For anyone who is wondering now, the following would be the opcodes and ATT syntax assembly for moving the value 0x47e018 to eax:
b8 18 e0 47 00 mov $0x47e018,%eax