Vectorizing indirect access through avx instructions

Vectorizing indirect access through avx instructions - c++

I've recently been introduced to Vector Instructions (theoretically) and am excited about how I can use them to speed up my applications.
One area I'd like to improve is a very hot loop:
__declspec(noinline) void pleaseVectorize(int* arr, int* someGlobalArray, int* output)
{
for (int i = 0; i < 16; ++i)
{
auto someIndex = arr[i];
output[i] = someGlobalArray[someIndex];
}
for (int i = 0; i < 16; ++i)
{
if (output[i] == 1)
{
return i;
}
}
return -1;
}
But of course, all 3 major compilers (msvc, gcc, clang) refuse to vectorize this. I can sort of understand why, but I wanted to get a confirmation.
If I had to vectorize this by hand, it would be:
(1) VectorLoad "arr", this brings in 16 4-byte integers let's say into zmm0
(2) 16 memory loads from the address pointed to by zmm0[0..3] into zmm1[0..3], load from address pointed into by zmm0[4..7] into zmm1[4..7] so on and so forth
(3) compare zmm0 and zmm1
(4) vector popcnt into the output to find out the most significant bit and basically divide that by 8 to get the index that matched
First of all, can vector instructions do these things? Like can they do this "gathering" operation, i.e. do a load from address pointing to zmm0?
Here is what clang generates:
0000000000400530 <_Z5superPiS_S_>:
400530: 48 63 07 movslq (%rdi),%rax
400533: 8b 04 86 mov (%rsi,%rax,4),%eax
400536: 89 02 mov %eax,(%rdx)
400538: 48 63 47 04 movslq 0x4(%rdi),%rax
40053c: 8b 04 86 mov (%rsi,%rax,4),%eax
40053f: 89 42 04 mov %eax,0x4(%rdx)
400542: 48 63 47 08 movslq 0x8(%rdi),%rax
400546: 8b 04 86 mov (%rsi,%rax,4),%eax
400549: 89 42 08 mov %eax,0x8(%rdx)
40054c: 48 63 47 0c movslq 0xc(%rdi),%rax
400550: 8b 04 86 mov (%rsi,%rax,4),%eax
400553: 89 42 0c mov %eax,0xc(%rdx)
400556: 48 63 47 10 movslq 0x10(%rdi),%rax
40055a: 8b 04 86 mov (%rsi,%rax,4),%eax
40055d: 89 42 10 mov %eax,0x10(%rdx)
400560: 48 63 47 14 movslq 0x14(%rdi),%rax
400564: 8b 04 86 mov (%rsi,%rax,4),%eax
400567: 89 42 14 mov %eax,0x14(%rdx)
40056a: 48 63 47 18 movslq 0x18(%rdi),%rax
40056e: 8b 04 86 mov (%rsi,%rax,4),%eax
400571: 89 42 18 mov %eax,0x18(%rdx)
400574: 48 63 47 1c movslq 0x1c(%rdi),%rax
400578: 8b 04 86 mov (%rsi,%rax,4),%eax
40057b: 89 42 1c mov %eax,0x1c(%rdx)
40057e: 48 63 47 20 movslq 0x20(%rdi),%rax
400582: 8b 04 86 mov (%rsi,%rax,4),%eax
400585: 89 42 20 mov %eax,0x20(%rdx)
400588: 48 63 47 24 movslq 0x24(%rdi),%rax
40058c: 8b 04 86 mov (%rsi,%rax,4),%eax
40058f: 89 42 24 mov %eax,0x24(%rdx)
400592: 48 63 47 28 movslq 0x28(%rdi),%rax
400596: 8b 04 86 mov (%rsi,%rax,4),%eax
400599: 89 42 28 mov %eax,0x28(%rdx)
40059c: 48 63 47 2c movslq 0x2c(%rdi),%rax
4005a0: 8b 04 86 mov (%rsi,%rax,4),%eax
4005a3: 89 42 2c mov %eax,0x2c(%rdx)
4005a6: 48 63 47 30 movslq 0x30(%rdi),%rax
4005aa: 8b 04 86 mov (%rsi,%rax,4),%eax
4005ad: 89 42 30 mov %eax,0x30(%rdx)
4005b0: 48 63 47 34 movslq 0x34(%rdi),%rax
4005b4: 8b 04 86 mov (%rsi,%rax,4),%eax
4005b7: 89 42 34 mov %eax,0x34(%rdx)
4005ba: 48 63 47 38 movslq 0x38(%rdi),%rax
4005be: 8b 04 86 mov (%rsi,%rax,4),%eax
4005c1: 89 42 38 mov %eax,0x38(%rdx)
4005c4: 48 63 47 3c movslq 0x3c(%rdi),%rax
4005c8: 8b 04 86 mov (%rsi,%rax,4),%eax
4005cb: 89 42 3c mov %eax,0x3c(%rdx)
4005ce: c3 retq
4005cf: 90 nop

Your idea of how it could work is close, except that you want a bit-scan / find-first-set-bit (x86 BSF or TZCNT) of the compare bitmap, not population-count (number of bits set).
AVX2 / AVX512 have vpgatherdd which does use a vector of signed 32-bit scaled indices. It's barely worth using on Haswell, improved on Broadwell, and very good on Skylake. (http://agner.org/optimize/, and see other links in the x86 tag wiki, such as Intel's optimization manual which has a section on gather performance). The SIMD compare and bitscan are very cheap by comparison; single uop and fully pipelined.
gcc8.1 can auto-vectorize your gather, if it can prove that your inputs don't overlap your output function arg. Sometimes possible after inlining, but for the non-inline version, you can promise this with int * __restrict output. Or if you make output a local temporary instead of a function arg. (General rule: storing through a non-_restrict pointer will often inhibit auto-vectorization, especially if it's a char* that can alias anything.)
gcc and clang never vectorize search loops; only loops where the trip-count can be calculated before entering the loop. But ICC can; it does a scalar gather and stores the result (even if output[] is a local so it doesn't have to do that as a side-effect of running the function), then uses SIMD packed-compare + bit-scan.
Compiler output for a __restrict version. Notice that gcc8.1 and ICC avoid 512-bit vectors by default when tuning for Skylake-AVX512. 512-bit vectors can limit the max-turbo, and always shut down the vector ALU on port 1 while they're in the pipeline, so it can make sense to use AVX512 or AVX2 with 256-bit vectors in case this function is only a small part of a big program. (Compilers don't know that this function is super-hot in your program.)
If output[] is a local, a better code-gen strategy would probably be to compare while gathering, so an early hit skips the rest of the loads. The compilers that go fully scalar (clang and MSVC) both miss this optimization. In fact, they even store to the local array even though clang mostly doesn't re-read it (keeping results in registers). Writing the source with the compare inside the first loop would work to get better scalar code. (Depending on cache misses from the gather vs. branch mispredicts from non-SIMD searching, scalar could be a good strategy. Especially if hits in the first few elements are common. Current gather hardware can't take advantage of multiple elements coming from the same cache line, so the hard limit is still 2 elements loaded per clock cycle.
But using a wide vector load for the indices to feed a gather reduces load-port / cache access pressure significantly if your data was mostly hot in cache.)
A compiler could have auto-vectorized the __restrict version of your code to something like this. (gcc manages the gather part, ICC manages the SIMD compare part)
;; Windows x64 calling convention: rcx,rdx, r8,r9
; but of course you'd actually inline this
; only uses ZMM16..31, so vzeroupper not required
vmovdqu32 zmm16, [rcx/arr] ; You def. want to reach an alignment boundary if you can for ZMM loads, vmovdqa32 will enforce that
kxnorw k1, k0,k0 ; k1 = -1. k0 false dep is likely not a problem.
; optional: vpxord xmm17, xmm17, xmm17 ; break merge-masking false dep
vpgatherdd zmm17{k1}, [rdx + zmm16 * 4] ; GlobalArray + scaled-vector-index
; sets k1 = 0 when done
vmovdqu32 [r8/output], zmm17
vpcmpd k1, zmm17, zmm31, 0 ; 0->EQ. Outside the loop, do zmm31=set1_epi32(1)
; k1 = compare bitmap
kortestw k1, k1
jz .not_found ; early check for not-found
kmovw edx, k1
; tzcnt doesn't have a false dep on the output on Skylake
; so no AVX512 CPUs need to worry about that HSW/BDW issue
tzcnt eax, edx ; bit-scan for the first (lowest-address) set element
; input=0 produces output=32
; or avoid the branch and let 32 be the not-found return value.
; or do a branchless kortestw / cmov if -1 is directly useful without branching
ret
.not_found:
mov eax, -1
ret
You can do this yourself with intrinsics:
Intel's instruction-set reference manual (HTML extract at http://felixcloutier.com/x86/index.html) includes C/C++ intrinsic names for each instruction, or search for them in https://software.intel.com/sites/landingpage/IntrinsicsGuide/
I changed the output type to __m512i. You could change it back to an array if you aren't manually vectorizing the caller. You definitely want this function to inline.
#include <immintrin.h>
//__declspec(noinline) // I *hope* this was just to see the stand-alone asm version
// but it means the output array can't optimize away at all
//static inline
int find_first_1(const int *__restrict arr, const int *__restrict someGlobalArray, __m512i *__restrict output)
{
__m512i vindex = _mm512_load_si512(arr);
__m512i gather = _mm512_i32gather_epi32(vindex, someGlobalArray, 4); // indexing by 4-byte int
*output = gather;
__mmask16 cmp = _mm512_cmpeq_epi32_mask(gather, _mm512_set1_epi32(1));
// Intrinsics make masks freely convert to integer
// even though it costs a `kmov` instruction either way.
int onepos = _tzcnt_u32(cmp);
if (onepos >= 16){
return -1;
}
return onepos;
}
All 4 x86 compilers produce similar asm to what I suggested (see it on the Godbolt compiler explorer), but of course they have to actually materialize the set1_epi32(1) vector constant, or use a (broadcast) memory operand. Clang actually uses a {1to16} broadcast-load from a constant for the compare: vpcmpeqd k0, zmm1, dword ptr [rip + .LCPI0_0]{1to16}. (Of course they will make different choices whe inlined into a loop.) Others use mov eax,1 / vpbroadcastd zmm0, eax.
gcc8.1 -O3 -march=skylake-avx512 has two redundant mov eax, -1 instructions: one to feed a kmov for the gather, the other for the return-value stuff. Silly compiler should keep it around and use a different register for the 1.
All of them use zmm0..15 and thus can't avoid a vzeroupper. (xmm16.31 are not accessible with legacy-SSE, so the SSE/AVX transition penalty problem that vzeroupper solves doesn't exist if the only wide vector registers you use are y/zmm16..31). There may still be tiny possible advantages to vzeroupper, like cheaper context switches when the upper halves of ymm or zmm regs are known to be zero (Is it useful to use VZEROUPPER if your program+libraries contain no SSE instructions?). If you're going to use it anyway, no reason to avoid xmm0..15.
Oh, and in the Windows calling convention, xmm6..15 are call-preserved. (Not ymm/zmm, just the low 128 bits), so zmm16..31 are a good choice if you run out of xmm0..5 regs.

Related

Compiler choice of not using REP MOVSB instruction for a byte array move

I'm checking the Release build of my project done with the latest version of the VS 2017 C++ compiler. And I'm curious why did compiler choose to build the following code snippet:
//ncbSzBuffDataUsed of type INT32
UINT8* pDst = (UINT8*)(pMXB + 1);
UINT8* pSrc = (UINT8*)pDPE;
for(size_t i = 0; i < (size_t)ncbSzBuffDataUsed; i++)
{
pDst[i] = pSrc[i];
}
as such:
UINT8* pDst = (UINT8*)(pMXB + 1);
UINT8* pSrc = (UINT8*)pDPE;
for(size_t i = 0; i < (size_t)ncbSzBuffDataUsed; i++)
00007FF66441251E 4C 63 C2 movsxd r8,edx
00007FF664412521 4C 2B D1 sub r10,rcx
00007FF664412524 0F 1F 40 00 nop dword ptr [rax]
00007FF664412528 0F 1F 84 00 00 00 00 00 nop dword ptr [rax+rax]
00007FF664412530 41 0F B6 04 0A movzx eax,byte ptr [r10+rcx]
{
pDst[i] = pSrc[i];
00007FF664412535 88 01 mov byte ptr [rcx],al
00007FF664412537 48 8D 49 01 lea rcx,[rcx+1]
00007FF66441253B 49 83 E8 01 sub r8,1
00007FF66441253F 75 EF jne _logDebugPrint_in_MainXchgBuffer+0A0h (07FF664412530h)
}
versus just using a single REP MOVSB instruction? Wouldn't the latter be more efficient?

Edit: First up, there's an intrinsic for rep movsb which Peter Cordes tells us would be much faster here and I believe him (I guess I already did). If you want to force the compiler to do things this way, see: __movsb(): https://learn.microsoft.com/en-us/cpp/intrinsics/movsb.
As to why the compiler didn't do this for you, in the absence of any other ideas the answer might be register pressure. To use rep movsb The compiler would have to:
set up rsi (= source address)
set up rdi (= destination address)
set up rcx (= count)
issue the rep movsb
So now it has had to use up the three registers mandated by the rep movsb instruction, and it may prefer not to do that. Specifically rsi and rdi are expected to be preserved across a function call, so if the compiler can get away with using them in the body of any particular function it will, and (on initial entry to the method, at least) rcx holds the this pointer.
Also, with the code that we see the compiler has generated there, the r10 and rcxregisters might already contain the requisite source and destination addresses (we can't see that from your example), which would be handy for the compiler if so.
In practise, you will probably see the compiler make different choices in different situations. The type of optimisation requested (/O1 - optimise for size, vs /O2 - optimise for speed) will likely also affect this.
More on the x64 register passing convention here, and on the x64 ABI generally here.
Edit 2 (again inspired by Peter's comments):
The compiler probably decided not to vectorise the loop because it doesn't know if the pointers are aligned or might overlap. Without seeing more of the code, we can't be sure. But that's not strictly relevant to my answer, given what the OP actually asked about.

This is not really an answer, and I can't jam it all into a comment. I just want to share my additional findings. (This is probably relevant to the Visual Studio compilers only.)
What also makes a difference is how you structure your loops. For instance:
Assuming the following struct definitions:
#define PCALLBACK ULONG64
#pragma pack(push)
#pragma pack(1)
typedef struct {
ULONG64 ui0;
USHORT w0;
USHORT w1;
//Followed by:
// PCALLBACK[] 'array' - variable size array
}DPE;
#pragma pack(pop)
(1) The regular way to structure a for loop. The following code chunk is called somewhere in the middle of a larger serialization function:
PCALLBACK* pDstClbks = (PCALLBACK*)(pDPE + 1);
for(size_t i = 0; i < (size_t)info.wNumCallbackFuncs; i++)
{
pDstClbks[i] = info.callbackFuncs[i];
}
As was mentioned somewhere in the answer on this page, it is clear that the compiler was starved of registers to have produced the following monstrocity (see how it reused rax for the loop end limit, or movzx eax,word ptr [r13] instruction that could've been clearly left out of the loop.)
PCALLBACK* pDstClbks = (PCALLBACK*)(pDPE + 1);
00007FF7029327CF 48 83 C1 30 add rcx,30h
for(size_t i = 0; i < (size_t)info.wNumCallbackFuncs; i++)
00007FF7029327D3 66 41 3B 5D 00 cmp bx,word ptr [r13]
00007FF7029327D8 73 1F jae 07FF7029327F9h
00007FF7029327DA 4C 8B C1 mov r8,rcx
00007FF7029327DD 4C 2B F1 sub r14,rcx
{
pDstClbks[i] = info.callbackFuncs[i];
00007FF7029327E0 4B 8B 44 06 08 mov rax,qword ptr [r14+r8+8]
00007FF7029327E5 48 FF C3 inc rbx
00007FF7029327E8 49 89 00 mov qword ptr [r8],rax
00007FF7029327EB 4D 8D 40 08 lea r8,[r8+8]
00007FF7029327EF 41 0F B7 45 00 movzx eax,word ptr [r13]
00007FF7029327F4 48 3B D8 cmp rbx,rax
00007FF7029327F7 72 E7 jb 07FF7029327E0h
}
00007FF7029327F9 45 0F B7 C7 movzx r8d,r15w
(2) So if I re-write it into a less familiar C pattern:
PCALLBACK* pDstClbks = (PCALLBACK*)(pDPE + 1);
PCALLBACK* pEndDstClbks = pDstClbks + (size_t)info.wNumCallbackFuncs;
for(PCALLBACK* pScrClbks = info.callbackFuncs;
pDstClbks < pEndDstClbks;
pScrClbks++, pDstClbks++)
{
*pDstClbks = *pScrClbks;
}
this produces a more sensible machine code (on the same compiler, in the same function, in the same project):
PCALLBACK* pDstClbks = (PCALLBACK*)(pDPE + 1);
00007FF71D7E27C2 48 83 C1 30 add rcx,30h
PCALLBACK* pEndDstClbks = pDstClbks + (size_t)info.wNumCallbackFuncs;
00007FF71D7E27C6 0F B7 86 88 00 00 00 movzx eax,word ptr [rsi+88h]
00007FF71D7E27CD 48 8D 14 C1 lea rdx,[rcx+rax*8]
for(PCALLBACK* pScrClbks = info.callbackFuncs; pDstClbks < pEndDstClbks; pScrClbks++, pDstClbks++)
00007FF71D7E27D1 48 3B CA cmp rcx,rdx
00007FF71D7E27D4 76 14 jbe 07FF71D7E27EAh
00007FF71D7E27D6 48 2B F1 sub rsi,rcx
{
*pDstClbks = *pScrClbks;
00007FF71D7E27D9 48 8B 44 0E 08 mov rax,qword ptr [rsi+rcx+8]
00007FF71D7E27DE 48 89 01 mov qword ptr [rcx],rax
00007FF71D7E27E1 48 83 C1 08 add rcx,8
00007FF71D7E27E5 48 3B CA cmp rcx,rdx
00007FF71D7E27E8 77 EF jb 07FF71D7E27D9h
}
00007FF71D7E27EA 45 0F B7 C6 movzx r8d,r14w

gcc '-m32' option changes floating-point rounding when not running valgrind

I am getting different floating-point rounding under different build/execute scenarios. Notice the 2498 in the second run below...
#include <iostream>
#include <cassert>
#include <typeinfo>
using std::cerr;
void domath( int n, double c, double & q1, double & q2 )
{
q1=n*c;
q2=int(n*c);
}
int main()
{
int n=2550;
double c=0.98, q1, q2;
domath( n, c, q1, q2 );
cerr<<"sizeof(int)="<<sizeof(int)<<", sizeof(double)="<<sizeof(double)<<", sizeof(n*c)="<<sizeof(n*c)<<"\n";
cerr<<"n="<<n<<", int(q1)="<<int(q1)<<", int(q2)="<<int(q2)<<"\n";
assert( typeid(q1) == typeid(n*c) );
}
Running as a 64-bit executable...
$ g++ -m64 -Wall rounding_test.cpp -o rounding_test && ./rounding_test
sizeof(int)=4, sizeof(double)=8, sizeof(n*c)=8
n=2550, int(q1)=2499, int(q2)=2499
Running as a 32-bit executable...
$ g++ -m32 -Wall rounding_test.cpp -o rounding_test && ./rounding_test
sizeof(int)=4, sizeof(double)=8, sizeof(n*c)=8
n=2550, int(q1)=2499, int(q2)=2498
Running as a 32-bit executable under valgrind...
$ g++ -m32 -Wall rounding_test.cpp -o rounding_test && valgrind --quiet ./rounding_test
sizeof(int)=4, sizeof(double)=8, sizeof(n*c)=8
n=2550, int(q1)=2499, int(q2)=2499
Why am I seeing different results when compiling with -m32, and why are the results different again when running valgrind?
My system is Ubuntu 14.04.1 LTS x86_64, and my gcc is version 4.8.2.
EDIT:
In response to the request for disassembly, I have refactored the code a bit so that I could isolate the relevant portion. The approach taken between -m64 and -m32 is clearly much different. I'm not too concerned about why these give a different rounding result since I can fix that by applying the round() function. The most interesting question is: why does valgrind change the result?
rounding_test: file format elf64-x86-64
<
000000000040090d <_Z6domathidRdS_>: <
40090d: 55 push %rbp <
40090e: 48 89 e5 mov %rsp,%rbp <
400911: 89 7d fc mov %edi,-0x4(%rbp <
400914: f2 0f 11 45 f0 movsd %xmm0,-0x10(%r <
400919: 48 89 75 e8 mov %rsi,-0x18(%rb <
40091d: 48 89 55 e0 mov %rdx,-0x20(%rb <
400921: f2 0f 2a 45 fc cvtsi2sdl -0x4(%rbp), <
400926: f2 0f 59 45 f0 mulsd -0x10(%rbp),%x <
40092b: 48 8b 45 e8 mov -0x18(%rbp),%r <
40092f: f2 0f 11 00 movsd %xmm0,(%rax) <
400933: f2 0f 2a 45 fc cvtsi2sdl -0x4(%rbp), <
400938: f2 0f 59 45 f0 mulsd -0x10(%rbp),%x <
40093d: f2 0f 2c c0 cvttsd2si %xmm0,%eax <
400941: f2 0f 2a c0 cvtsi2sd %eax,%xmm0 <
400945: 48 8b 45 e0 mov -0x20(%rbp),%r <
400949: f2 0f 11 00 movsd %xmm0,(%rax) <
40094d: 5d pop %rbp <
40094e: c3 retq <
| rounding_test: file format elf32-i386
> 0804871d <_Z6domathidRdS_>:
> 804871d: 55 push %ebp
> 804871e: 89 e5 mov %esp,%ebp
> 8048720: 83 ec 10 sub $0x10,%esp
> 8048723: 8b 45 0c mov 0xc(%ebp),%eax
> 8048726: 89 45 f8 mov %eax,-0x8(%ebp
> 8048729: 8b 45 10 mov 0x10(%ebp),%ea
> 804872c: 89 45 fc mov %eax,-0x4(%ebp
> 804872f: db 45 08 fildl 0x8(%ebp)
> 8048732: dc 4d f8 fmull -0x8(%ebp)
> 8048735: 8b 45 14 mov 0x14(%ebp),%ea
> 8048738: dd 18 fstpl (%eax)
> 804873a: db 45 08 fildl 0x8(%ebp)
> 804873d: dc 4d f8 fmull -0x8(%ebp)
> 8048740: d9 7d f6 fnstcw -0xa(%ebp)
> 8048743: 0f b7 45 f6 movzwl -0xa(%ebp),%ea
> 8048747: b4 0c mov $0xc,%ah
> 8048749: 66 89 45 f4 mov %ax,-0xc(%ebp)
> 804874d: d9 6d f4 fldcw -0xc(%ebp)
> 8048750: db 5d f0 fistpl -0x10(%ebp)
> 8048753: d9 6d f6 fldcw -0xa(%ebp)
> 8048756: 8b 45 f0 mov -0x10(%ebp),%e
> 8048759: 89 45 f0 mov %eax,-0x10(%eb
> 804875c: db 45 f0 fildl -0x10(%ebp)
> 804875f: 8b 45 18 mov 0x18(%ebp),%ea
> 8048762: dd 18 fstpl (%eax)
> 8048764: c9 leave
> 8048765: c3 ret

Edit: It would seem that, at least a long time back, valgrind's floating point calculations wheren't quite as accurate as the "real" calculations. In other words, this MAY explain why you get different results. See this question and answer on the valgrind mailing list.
Edit2: And the current valgrind.org documentation has it in it's "core limitations" section here - so I would expect that it is indeed "still valid". In other words the documentation for valgrind says to expect differences between valgrind and x87 FPU calculations. "You have been warned!" (And as we can see, using sse instructions to do the same math produces the same result as valgrind, confirming that it's a "rounding from 80 bits to 64 bits" difference)
Floating point calculations WILL differ slightly depending on exactly how the calculation is performed. I'm not sure exactly what you want to have an answer to, so here's a long rambling "answer of a sort".
Valgrind DOES indeed change the exact behaviour of your program in various ways (it emulates certain instructions, rather than actually executing the real instructions - which may include saving the intermediate results of calculations). Also, floating point calculations are well known to "not be precise" - it's just a matter of luck/bad luck if the calculation comes out precise or not. 0.98 is one of many, many numbers that can't be described precisely in floating point format [at least not the common IEEE formats].
By adding:
cerr<<"c="<<std::setprecision(30)<<c <<"\n";
we see that the output is c=0.979999999999999982236431605997 (yes, the actual value is 0.979999...99982 or some such, remaining digits is just the residual value, since it's not an "even" binary number, there's always going to be something left.
This is the n = 2550;, c = 0.98 and q = n * c part of the code as generated by gcc:
movl $2550, -28(%ebp) ; n
fldl .LC0
fstpl -40(%ebp) ; c
fildl -28(%ebp)
fmull -40(%ebp)
fstpl -48(%ebp) ; q - note that this is stored as a rouned 64-bit value.
This is the int(q) and int(n*c) part of the code:
fildl -28(%ebp) ; n
fmull -40(%ebp) ; c
fnstcw -58(%ebp) ; Save control word
movzwl -58(%ebp), %eax
movb $12, %ah
movw %ax, -60(%ebp) ; Save float control word.
fldcw -60(%ebp)
fistpl -64(%ebp) ; Store as integer (directly from 80-bit result)
fldcw -58(%ebp) ; restore float control word.
movl -64(%ebp), %ebx ; result of int(n * c)
fldl -48(%ebp) ; q
fldcw -60(%ebp) ; Load float control word as saved above.
fistpl -64(%ebp) ; Store as integer.
fldcw -58(%ebp) ; Restore control word.
movl -64(%ebp), %esi ; result of int(q)
Now, if the intermediate result is stored (and thus rounded) from the internal 80-bit precision in the middle of one of those calculations, the result may be subtly different from the result if the calculation happens without saving intermediate values.
I get identical results from both g++ 4.9.2 and clang++ -mno-sse - but if I enable sse in the clang case, it gives the same result as 64-bit build. Using gcc -msse2 -m32 gives the 2499 answer everywhere. This indicates that the error is caused by "storing intermediate results" in some way or another.
Likewise, optimising in gcc to -O1 will give the 2499 in all places - but this is a coincidence, not a result of some "clever thinking". If you want correctly rounded integer values of your calculations, you're much better off rounding yourself, because sooner or later int(someDoubleValue) will come up "one short".
Edit3: And finally, using g++ -mno-sse -m64 will also produce the same 2498 answer, where using valgrind on the same binary produces the 2499 answer.

The 32-bit version uses X87 floating point instructions. X87 internally uses 80-bit floating point numbers, which will cause trouble when numbers are converted to and from other precisions. In your case the 64-bit precision approximation for 0.98 is slightly less than the true value. When the CPU converts it to an 80-bit value you get the exact same numerical value, which is an equally bad approximation - having more bits doesn't get you a better approximation. The FPU then multiplies that number by 2550, and gets a figure that's slightly less than 2499. If the CPU used 64-bit numbers all the way it should compute exactly 2499, like it does in the 64-bit version.

Using .size() vs const variable for loops

I have a vector:
vector<Body*> Bodies;
And it contains pointers to Body objects that I have defined.
I also have a unsigned int const that contains the number of bodyobjects I wish to have in bodies.
unsigned int const NumParticles = 1000;
I have populated Bodieswith NumParticles amount of Body objects.
Now if I wish to iterate through a loop, for example invoking each of the Body's Update() functions in Bodies, I have two choices on what I can do:
First:
for (unsigned int i = 0; i < NumParticles; i++)
{
Bodies.at(i)->Update();
}
Or second:
for (unsigned int i = 0; i < Bodies.size(); i++)
{
Bodies.at(i)->Update();
}
There are pro's and con's of each. I would like to know which one (if either) would be the better practice, in terms of safety, readability and convention.

I expect, given that the compiler (at least in this case) can inline all relevant code in the std::vector, it will be identical code [aside from 1000 being a true constant literal in the machine code, and Bodies.size() will be a "variable" value].
Short summary of findings:
The compiler doesn't call a function for size() of a vector for every iteration, it calculates that in the beginning of the loop, and uses it as a "constant value".
Actual code IN the loop is identical, only the preparation of the loop is different.
As always: If performance is highly important, measure on your system with your data and your compiler. Otherwise, write the code that makes most sense for your design (I prefer using for(auto i : vec), as that is easy and straight forward [and works for all the containers])
Supporting evidence:
After fetching coffee, I wrote this code:
class X
{
public:
void Update() { x++; }
operator int() { return x; }
private:
int x = rand();
};
extern std::vector<X*> vec;
const size_t vec_size = 1000;
void Process1()
{
for(auto i : vec)
{
i->Update();
}
}
void Process2()
{
for(size_t i = 0; i < vec.size(); i++)
{
vec[i]->Update();
}
}
void Process3()
{
for(size_t i = 0; i < vec_size; i++)
{
vec[i]->Update();
}
}
(along with a main function that fills the array, and calls Process1(), Process2() and Process3() - the main is in an separate file to avoid the compiler deciding to inline everything and making it hard to tell what is what)
Here's the code generated by g++ 4.9.2:
0000000000401940 <_Z8Process1v>:
401940: 48 8b 0d a1 18 20 00 mov 0x2018a1(%rip),%rcx # 6031e8 <vec+0x8>
401947: 48 8b 05 92 18 20 00 mov 0x201892(%rip),%rax # 6031e0 <vec>
40194e: 48 39 c1 cmp %rax,%rcx
401951: 74 14 je 401967 <_Z8Process1v+0x27>
401953: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
401958: 48 8b 10 mov (%rax),%rdx
40195b: 48 83 c0 08 add $0x8,%rax
40195f: 83 02 01 addl $0x1,(%rdx)
401962: 48 39 c1 cmp %rax,%rcx
401965: 75 f1 jne 401958 <_Z8Process1v+0x18>
401967: f3 c3 repz retq
0000000000401970 <_Z8Process2v>:
401970: 48 8b 35 69 18 20 00 mov 0x201869(%rip),%rsi # 6031e0 <vec>
401977: 48 8b 0d 6a 18 20 00 mov 0x20186a(%rip),%rcx # 6031e8 <vec+0x8>
40197e: 31 c0 xor %eax,%eax
401980: 48 29 f1 sub %rsi,%rcx
401983: 48 c1 f9 03 sar $0x3,%rcx
401987: 48 85 c9 test %rcx,%rcx
40198a: 74 14 je 4019a0 <_Z8Process2v+0x30>
40198c: 0f 1f 40 00 nopl 0x0(%rax)
401990: 48 8b 14 c6 mov (%rsi,%rax,8),%rdx
401994: 48 83 c0 01 add $0x1,%rax
401998: 83 02 01 addl $0x1,(%rdx)
40199b: 48 39 c8 cmp %rcx,%rax
40199e: 75 f0 jne 401990 <_Z8Process2v+0x20>
4019a0: f3 c3 repz retq
00000000004019b0 <_Z8Process3v>:
4019b0: 48 8b 05 29 18 20 00 mov 0x201829(%rip),%rax # 6031e0 <vec>
4019b7: 48 8d 88 40 1f 00 00 lea 0x1f40(%rax),%rcx
4019be: 66 90 xchg %ax,%ax
4019c0: 48 8b 10 mov (%rax),%rdx
4019c3: 48 83 c0 08 add $0x8,%rax
4019c7: 83 02 01 addl $0x1,(%rdx)
4019ca: 48 39 c8 cmp %rcx,%rax
4019cd: 75 f1 jne 4019c0 <_Z8Process3v+0x10>
4019cf: f3 c3 repz retq
Whilst the assembly code looks slightly different for each of those cases, in practice, I'd say you'd be hard pushed to measure the difference between those loops, and in fact, a run of perf on the code show that it's "the same time for all loops" [this is with 100000 elements and 100 calls to Process1, Process2 and Process3 in a loop, otherwise the time was dominated by new X in main]:
31.29% a.out a.out [.] Process1
31.28% a.out a.out [.] Process3
31.13% a.out a.out [.] Process2
Unless you think 1/10th of a percent is significant - and it may be for something that takes a week to run, but this is only a few tenths of a seconds [0.163 seconds on my machine], and probably more measurement error than anything else - and the shorter time is actually the one that in theory should be slowest, Process2, using vec.size(). I did another run with a higher loop count, and now the measurement for each of the loops is with 0.01% of each other - in other words identical in time spent.
Of course, if you look carefully, you will see that the actual loop content for all three variants is essentially identical, except for the early part of Process3 which is simpler because the compiler knows that we will do at least one loop - Process1 and Process2 has to check for "is the vector empty" before the first iteration. This would make a difference for VERY short vector lengths.

I would vote for for range:
for (auto* body : Bodies)
{
body->Update();
}

NumParticles is not a property of the vector. It is some external constant relative to the vector. I would prefer to use the property size() of the vector. In this case the code is more safe and clear for the reader.
Usually using some constant instead of size() means for the reader that in general the constant can be unequal to the size().
Thus if you want to say the reader that you are going to process all elements of the vector then it is better to use size(). Otherwise use the constant.
Of course there are exceptions from this implicit rule when the accent is put on the constant. In this case it is better to use the constant. But it depends on the context.

I would suggest you to use the .size() function instead of defining a new constant.
Why?
Safety : Since .size() does not throw any exceptions, it is perfectly safe to use .size().
Readability : IMHO, Bodies.size() conveys the size of the vector Bodies more clearly than NumParticles.
Convention : According to conventions too, it is better to use .size() as it is a property of the vector, instead of the variable NumParticles.
Performance: .size() is a constant complexity member function, so there is no significant performance difference between using a const int and .size().

I prefer this form:
for (auto const& it : Bodies)
{
it->Update();
}

why the compiler reserves just 0x10 bits for a int?

I have the following code:
#include <iostream>
using namespace std;
void f()
{
cout << "hello" << endl;
}
void f(int i)
{
cout << i << endl;
}
int main()
{
f();
f(0x123456);
}
I compiled it using g++, then disassembled it using objdump -Mintel -d and I got the following for the main function:
08048733 <main>:
8048733: 55 push ebp
8048734: 89 e5 mov ebp,esp
8048736: 83 e4 f0 and esp,0xfffffff0
8048739: 83 ec 10 sub esp,0x10
804873c: e8 9b ff ff ff call 80486dc <_Z1fv>
8048741: c7 04 24 56 34 12 00 mov DWORD PTR [esp],0x123456
8048748: e8 bb ff ff ff call 8048708 <_Z1fi>
804874d: b8 00 00 00 00 mov eax,0x0
8048752: c9 leave
8048753: c3 ret
now, the reserved space in the stack is 16 bits (0x10, in line 8048739), while a int is (on my machine) 32 bit. This can't be because of optimization because the number 0x123456 won't fit into 16 bits. So why the compiler doesn't reserve enough space?

So it's been pointed out that it's 0x10 bytes (not bits). it's 16 bytes because gcc keeps stack 16-byte aligned for x86. From GCC manual:
-mstackrealign Realign the stack at entry. On the Intel x86, the -mstackrealign option generates an alternate prologue and epilogue that realigns the run-time stack if necessary. This supports mixing
legacy codes that keep 4-byte stack alignment with modern codes that
keep 16-byte stack alignment for SSE compatibility. See also the
attribute force_align_arg_pointer, applicable to individual functions.
-mpreferred-stack-boundary=num Attempt to keep the stack boundary aligned to a 2 raised to num byte boundary. If
-mpreferred-stack-boundary is not specified, the default is 4 (16 bytes or 128 bits).

I am not sure myself but I can try to help
sub esp,0x10 is not done to get the reserved space of the int(32bits) on the stack. Instead the first 4 assembly instructions are merely a compiler optimization used to free up the ebp register to be used as a general purposed register.
Read more about it here.
The actual assembly involving the integer is mov DWORD PTR [esp],0x123456.
Hope it helps.
Digvijay

SIGSEGV When accessing array element using assembly

Background:
I am new to assembly. When I was learning programming, I made a program that implements multiplication tables up to 1000 * 1000. The tables are formatted so that each answer is on the line factor1 << 10 | factor2 (I know, I know, it's isn't pretty). These tables are then loaded into an array: int* tables. Empty lines are filled with 0. Here is a link to the file for the tables (7.3 MB). I know using assembly won't speed up this by much, but I just wanted to do it for fun (and a bit of practice).
Question:
I'm trying to convert this code into inline assembly (tables is a global):
int answer;
// ...
answer = tables [factor1 << 10 | factor2];
This is what I came up with:
asm volatile ( "shll $10, %1;"
"orl %1, %2;"
"movl _tables(,%2,4), %0;" : "=r" (answer) : "r" (factor1), "r" (factor2) );
My C++ code works fine, but my assembly fails. What is wrong with my assembly (especially the movl _tables(,%2,4), %0; part), compared to my C++
What I have done to solve it:
I used some random numbers: 89 796 as factor1 and factor2. I know that there is an element at 89 << 10 | 786 (which is 91922) – verified this with C++. When I run it with gdb, I get a SIGSEGV:
Program received signal SIGSEGV, Segmentation fault.
at this line:
"movl _tables(,%2,4), %0;" : "=r" (answer) : "r" (factor1), "r" (factor2) );
I added two methods around my asm, which is how I know where the asm block is in the disassembly.
Disassembly of my asm block:
The disassembly from objdump -M att -d looks fine (although I'm not sure, I'm new to assembly, as I said):
402096: 8b 45 08 mov 0x8(%ebp),%eax
402099: 8b 55 0c mov 0xc(%ebp),%edx
40209c: c1 e0 0a shl $0xa,%eax
40209f: 09 c2 or %eax,%edx
4020a1: 8b 04 95 18 e0 47 00 mov 0x47e018(,%edx,4),%eax
4020a8: 89 45 ec mov %eax,-0x14(%ebp)
The disassembly from objdump -M intel -d:
402096: 8b 45 08 mov eax,DWORD PTR [ebp+0x8]
402099: 8b 55 0c mov edx,DWORD PTR [ebp+0xc]
40209c: c1 e0 0a shl eax,0xa
40209f: 09 c2 or edx,eax
4020a1: 8b 04 95 18 e0 47 00 mov eax,DWORD PTR [edx*4+0x47e018]
4020a8: 89 45 ec mov DWORD PTR [ebp-0x14],eax
From what I understand, it's moving the first parameter of my void calc ( int factor1, int factor2 ) function into eax. Then it's moving the second parameter into edx. Then it shifts eax to the left by 10 and ors it with edx. A 32-bit integer is 4 bytes, so [edx*4+base_address]. Move result to eax and then put eax into int answer (which, I'm guessing is on -0x14 of the stack). I don't really see much of a problem.
Disassembly of the compiler's .exe:
When I replace the asm block with plain C++ (answer = tables [factor1 << 10 | factor2];) and disassemble it this is what I get in Intel syntax:
402096: a1 18 e0 47 00 mov eax,ds:0x47e018
40209b: 8b 55 08 mov edx,DWORD PTR [ebp+0x8]
40209e: c1 e2 0a shl edx,0xa
4020a1: 0b 55 0c or edx,DWORD PTR [ebp+0xc]
4020a4: c1 e2 02 shl edx,0x2
4020a7: 01 d0 add eax,edx
4020a9: 8b 00 mov eax,DWORD PTR [eax]
4020ab: 89 45 ec mov DWORD PTR [ebp-0x14],eax
AT&T syntax:
402096: a1 18 e0 47 00 mov 0x47e018,%eax
40209b: 8b 55 08 mov 0x8(%ebp),%edx
40209e: c1 e2 0a shl $0xa,%edx
4020a1: 0b 55 0c or 0xc(%ebp),%edx
4020a4: c1 e2 02 shl $0x2,%edx
4020a7: 01 d0 add %edx,%eax
4020a9: 8b 00 mov (%eax),%eax
4020ab: 89 45 ec mov %eax,-0x14(%ebp)
I am not really familiar with the Intel syntax, so I am just going to try and understand the AT&T syntax:
It first moves the base address of the tables array into %eax. Then, is moves the first parameter into %edx. It shifts %edx to the left by 10 then ors it with the second parameter. Then, by shifting %edx to the left by two, it actually multiplies %edx by 4. Then, it adds that to %eax (the base address of the array). So, basically it just did this: [edx*4+0x47e018] (Intel syntax) or 0x47e018(,%edx,4) AT&T. It moves the value of the element it got into %eax and puts it into int answer. This method is more "expanded", but it does the same thing as my hand-written assembly! So why is mine giving a SIGSEGV while the compiler's working fine?

I bet (from the disassembly) that tables is a pointer to an array, not the array itself.
So you need:
asm volatile ( "shll $10, %1;"
movl _tables,%%eax
"orl %1, %2;"
"movl (%%eax,%2,4)",
: "=r" (answer) : "r" (factor1), "r" (factor2) : "eax" )
(Don't forget the extra clobber in the last line).
There are of course variations, this may be more efficient if the code is in a loop:
asm volatile ( "shll $10, %1;"
"orl %1, %2;"
"movl (%3,%2,4)",
: "=r" (answer) : "r" (factor1), "r" (factor2), "r"(tables) )

This is intended to be an addition to Mats Petersson's answer - I wrote it simply because it wasn't immediately clear to me why OP's analysis of the disassembly (that his assembly and the compiler-generated one were equivalent) was incorrect.
As Mats Petersson explains, the problem is that tables is actually a pointer to an array, so to access an element, you have to dereference twice. Now to me, it wasn't immediately clear where this happens in the compiler-generated code. The culprit is this innocent-looking line:
a1 18 e0 47 00 mov 0x47e018,%eax
To the untrained eye (that includes mine), this might look like the value 0x47e018 is moved to eax, but it's actually not. The Intel-syntax representation of the same opcodes gives us a clue:
a1 18 e0 47 00 mov eax,ds:0x47e018
Ah - ds: - so it's not actually a value, but an address!
For anyone who is wondering now, the following would be the opcodes and ATT syntax assembly for moving the value 0x47e018 to eax:
b8 18 e0 47 00 mov $0x47e018,%eax

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Vectorizing indirect access through avx instructions - c++

Related

Compiler choice of not using REP MOVSB instruction for a byte array move

gcc '-m32' option changes floating-point rounding when not running valgrind

Using .size() vs const variable for loops

why the compiler reserves just 0x10 bits for a int?

SIGSEGV When accessing array element using assembly

Categories

Resources