Why is native c++ performing poorly with c++ interop? - c++

I have posted some code below to test the performance (time-wise in milliseconds) of calling a method from native c++ and c# from c++/cli using Visual Studio 2010. I have a separate native c++ project which is compiled into dll. When I call into c++ from c++, I get the expected result which is much faster (about 4x) than the managed counterparts. However, when I call into c++ from c++/cli, the performance is 10x slower.
Is this an expected behavior when calling into native c++ from c++/cli? I was under the impression that there shouldn't be a significant difference, but this simple test is showing otherwise. Could this be an optimization difference between the c++ and c++/cli compiler?
Update
I made some update to the cpp, so that I'm not calling a method in a tight loop (as Reed Copsey pointed out), and it turns out that the difference in performance in insignificant or very small. Depending on how the inter-operation is being done, of course.
.h
#ifndef CPPOBJECT_H
#define CPPOBJECT_H
#ifdef CPLUSPLUSOBJECT_EXPORTING
#define CLASS_DECLSPEC __declspec(dllexport)
#else
#define CLASS_DECLSPEC __declspec(dllimport)
#endif
class CLASS_DECLSPEC CPlusPlusObject
{
public:
CPlusPlusObject(){}
~CPlusPlusObject(){}
void sayHello();
double getSqrt(double n);
// Update
double wasteSomeTimeWithSqrt(double n);
};
#endif
.cpp
#include "CPlusPlusObject.h"
#include <iostream>
void CPlusPlusObject::sayHello(){std::cout << "Hello";}
double CPlusPlusObject::getSqrt(double n) {return std::sqrt(n);}
double CPlusPlusObject::wasteSomeTimeWithSqrt(double n)
{
double result = 0;
for (int x = 0; x < 10000000; x++)
{
result += std::sqrt(n);
}
return result;
}
c++/cli
const unsigned set = 100;
const unsigned repetitions = 1000000;
double cppcliTocpp()
{
double n = 0;
System::Diagnostics::Stopwatch^ stopWatch = gcnew System::Diagnostics::Stopwatch();
stopWatch->Start();
while (stopWatch->ElapsedMilliseconds < 1200){n+=0.001;}
stopWatch->Reset();
for (int x = 0; x < set; x++)
{
stopWatch->Start();
CPlusPlusObject cplusplusObject;
n += cplusplusObject.wasteSomeTimeWithSqrt(123.456);
/*for (int i = 0; i < repetitions; i++)
{
n += cplusplusObject.getSqrt(123.456);
}*/
stopWatch->Stop();
System::Console::WriteLine("c++/cli call to native c++ took " + stopWatch->ElapsedMilliseconds + "ms.");
stopWatch->Reset();
}
return n;
}
double cppcliTocSharp()
{
double n = 0;
System::Diagnostics::Stopwatch^ stopWatch = gcnew System::Diagnostics::Stopwatch();
stopWatch->Start();
while (stopWatch->ElapsedMilliseconds < 1200){n+=0.001;}
stopWatch->Reset();
for (int x = 0; x < set; x++)
{
stopWatch->Start();
CSharp::CSharpObject^ cSharpObject = gcnew CSharp::CSharpObject();
for (int i = 0; i < repetitions; i++)
{
n += cSharpObject->GetSqrt(123.456);
}
stopWatch->Stop();
System::Console::WriteLine("c++/cli call to c# took " + stopWatch->ElapsedMilliseconds + "ms.");
stopWatch->Reset();
}
return n;
}
double cppcli()
{
double n = 0;
System::Diagnostics::Stopwatch^ stopWatch = gcnew System::Diagnostics::Stopwatch();
stopWatch->Start();
while (stopWatch->ElapsedMilliseconds < 1200){n+=0.001;}
stopWatch->Reset();
for (int x = 0; x < set; x++)
{
stopWatch->Start();
CPlusPlusCliObject cPlusPlusCliObject;
for (int i = 0; i < repetitions; i++)
{
n += cPlusPlusCliObject.getSqrt(123.456);
}
stopWatch->Stop();
System::Console::WriteLine("c++/cli took " + stopWatch->ElapsedMilliseconds + "ms.");
stopWatch->Reset();
}
return n;
}
int main()
{
double n = 0;
n += cppcliTocpp();
n += cppcliTocSharp();
n += cppcli();
System::Console::WriteLine(n);
System::Console::ReadKey();
}

However, when I call into c++ from c++/cli, the performance is 10x slower.
Bridging the CLR and native code requires marshaling. There is always going to be some overhead in each method call when going from C++/CLI into a native method call.
The only reason the overhead (in this case) seems so large is that you're calling a very fast method in a tight loop. If you were to batch the class, or call a method that was significantly longer in terms of runtime, you'd find that the overhead is quite small.

These micro-benchmarks are very dangerous. You made an effort to avoid the typical benchmark mistakes but still fell into a classic trap. Your intention was to measure method call overhead, but that's not what's actually happening. The jitter optimizer is capable of standard code optimization techniques, like code hoisting and method inlining. You can only really see that when you look at the generated machine code. Debug + Windows + Disassembly window.
I tested this with VS2012, 32-bit Release build with the jitter optimizer enabled. The C++/CLI code was the fastest, taking ~128 msec:
000000bf fld qword ptr ds:[01212078h]
000000c5 fsqrt
000000c7 fstp qword ptr [ebp-20h]
//
// stopWatch->Start() call elided...
//
n += cPlusPlusCliObject.getSqrt(123.456);
000000f5 fld qword ptr [ebp-20h]
000000f8 fadd qword ptr [ebp-14h]
000000fb fstp qword ptr [ebp-14h]
for (int i = 0; i < repetitions; i++)
000000fe dec eax
000000ff jne 000000F5
In other words, the std::sqrt() call got hoisted out of the loop and the inner loop simply performs adds from the generated value. No method call. Also note how it didn't actually measure the time needed for the sqrt() call :)
The loop with the C# method call was a bit slower, taking ~180 msec:
000000ea fld qword ptr ds:[01211EC0h]
000000f0 fsqrt
000000f2 fadd qword ptr [ebp-14h]
000000f5 fstp qword ptr [ebp-14h]
for (int i = 0; i < repetitions; i++)
000000f8 dec eax
000000f9 jne 000000EA
Just the inlined method call to Math::Sqrt(), it didn't get hoisted. Not actually sure why, the optimizations performed by the jitter optimizer do have a time factor included.
And I won't post the code for the interop call. But yes, taking ~380 msec due to the need to actually make a function call, unmanaged code cannot be inlined, plus the thunk that's required to prevent the garbage collector from blundering into the unmanaged stack frame. The thunk is pretty fast, takes a handful of nanoseconds, but that just can't compete with the jitter optimizer directly inlining the fadd or fsqrt.

Related

OpenMP Strange Behavior - differences in performance

I want to speedup image processing code using OpenMP and I found some strange behavior in my code. I'm using Visual Studio 2019 and I also tried Intel C++ compiler with same result.
I'm not sure why is the code with OpenMP in some situations much slower than in the others. For example function divideImageDataWithParam() or difference between copyFirstPixelOnRow() and copyFirstPixelOnRowUsingTSize() using struct TSize as parameter of image data size. Why is performance of boxFilterRow() and boxFilterRow_OpenMP() so different a why isn't it with different radius size in program?
I created github repository for this little testing project:
https://github.com/Tb45/OpenMP-Strange-Behavior
Here are all results summarized:
https://github.com/Tb45/OpenMP-Strange-Behavior/blob/master/resuts.txt
I didn't find any explanation why is this happening or what am I doing wrong.
Thanks for your help.
I'm working on faster box filter and others for image processing algorithms.
typedef intptr_t int_t;
struct TSize
{
int_t width;
int_t height;
};
void divideImageDataWithParam(
const unsigned char * src, int_t srcStep, unsigned char * dst, int_t dstStep, TSize size, int_t param)
{
for (int_t y = 0; y < size.height; y++)
{
for (int_t x = 0; x < size.width; x++)
{
dst[y*dstStep + x] = src[y*srcStep + x]/param;
}
}
}
void divideImageDataWithParam_OpenMP(
const unsigned char * src, int_t srcStep, unsigned char * dst, int_t dstStep, TSize size, int_t param, bool parallel)
{
#pragma omp parallel for if(parallel)
for (int_t y = 0; y < size.height; y++)
{
for (int_t x = 0; x < size.width; x++)
{
dst[y*dstStep + x] = src[y*srcStep + x]/param;
}
}
}
Results of divideImageDataWithParam():
generateRandomImageData :: 3840x2160
numberOfIterations = 100
With Visual C++ 2019:
32bit 64bit
336.906ms 344.251ms divideImageDataWithParam
1832.120ms 6395.861ms divideImageDataWithParam_OpenMP single-thread parallel=false
387.152ms 1204.302ms divideImageDataWithParam_OpenMP multi-threaded parallel=true
With Intel C++ 19:
32bit 64bit
15.162ms 8.927ms divideImageDataWithParam
266.646ms 294.134ms divideImageDataWithParam_OpenMP single-threaded parallel=false
239.564ms 1195.556ms divideImageDataWithParam_OpenMP multi-threaded parallel=true
Screenshot from Intel VTune Amplifier, where divideImageDataWithParam_OpenMP() with parallel=false take most of the time in instruction mov to dst memory.
648trindade is right; it has to do with optimizations that cannot be done with openmp. But its not loop-unrolling or vectorization, its inlining which allows for a smart substitution.
Let me explain: Integer divisions are incredibly slow (64bit IDIV: ~40-100 Cycles). So whenever possible people (and compilers) try to avoid divisions. One trick you can use is to substitute a division with a multiplication and a shift. That only works if the divisor is known at compile time. This is the case because your function divideImageDataWithParam is inlined and PARAM is known. You can verify this by prepending it with __declspec(noinline). You will get the timings that you expected.
The openmp parallelization does not allow this trick because the function cannot be inlined and therefore param is not known at compile time and an expensive IDIV-instruction is generated.
Compiler output of divideImageDataWithParam (WIN10, MSVC2017, x64):
0x7ff67d151480 <+ 336> movzx ecx,byte ptr [r10+r8]
0x7ff67d151485 <+ 341> mov rax,r12
0x7ff67d151488 <+ 344> mul rax,rcx <------- multiply
0x7ff67d15148b <+ 347> shr rdx,3 <------- shift
0x7ff67d15148f <+ 351> mov byte ptr [r8],dl
0x7ff67d151492 <+ 354> lea r8,[r8+1]
0x7ff67d151496 <+ 358> sub r9,1
0x7ff67d15149a <+ 362> jne test!main+0x150 (00007ff6`7d151480)
And the openmp-version:
0x7ff67d151210 <+ 192> movzx eax,byte ptr [r10+rcx]
0x7ff67d151215 <+ 197> lea rcx,[rcx+1]
0x7ff67d151219 <+ 201> cqo
0x7ff67d15121b <+ 203> idiv rax,rbp <------- idiv
0x7ff67d15121e <+ 206> mov byte ptr [rcx-1],al
0x7ff67d151221 <+ 209> lea rax,[r8+rcx]
0x7ff67d151225 <+ 213> mov rdx,qword ptr [rbx]
0x7ff67d151228 <+ 216> cmp rax,rdx
0x7ff67d15122b <+ 219> jl test!divideImageDataWithParam$omp$1+0xc0 (00007ff6`7d151210)
Note 1) If you try out the compiler explorer (https://godbolt.org/) you will see that some compilers do the substitution for the openmp version too.
Note 2) As soon as the parameter is not known at compile time this optimization cannot be done anyway. So if you put your function into a library it will be slow. I'd do something like precomputing the division for all possible values and then do a lookup. This is even faster because the lookup table fits into 4-5 cache lines and L1 latency is only 3-4 cycles.
void divideImageDataWithParam(
const unsigned char * src, int_t srcStep, unsigned char * dst, int_t dstStep, TSize size, int_t param)
{
uint8_t tbl[256];
for(int i = 0; i < 256; i++) {
tbl[i] = i / param;
}
for (int_t y = 0; y < size.height; y++)
{
for (int_t x = 0; x < size.width; x++)
{
dst[y*dstStep + x] = tbl[src[y*srcStep + x]];
}
}
}
Also thanks for the interesting question, I learned a thing or two along the way! ;-)
This behavior is explained by the use of compiler optimizations: when enabled, divideImageDataWithParam sequential code will be subjected to a series of optimizations (loop-unrolling, vectorization, etc.) that divideImageDataWithParam_OpenMP parallel code probably is not, as it is certainly uncharacterized after the process of outlining parallel regions by the compiler.
If you compile this same code without optimizations, you will find that the runtime version of the sequential version is very similar to that of the parallel version with only one thread.
The maximum speedup of parallel version in this case is limited by the division of the original workload without optimizations. Optimizations in this case need to be writed manually.

overriding function calls from SVML

The Xeon-Phi Knights Landing cores have a fast exp2 instruction vexp2pd (intrinsic _mm512_exp2a23_pd). The Intel C++ compiler can vectorize the exp function using the Short Vector Math Library (SVML) which comes with the compiler. Specifically, it calls the fuction __svml_exp8.
However, when I step through a debugger I don't see that __svml_exp8 uses the vexp2pd instruction. It is a complication function with many FMA operations. I understand that vexp2pd is less accurate than exp but if I use -fp-model fast=1 (the default) or fp-model fast=2 I expect the compiler to use this instruction but it does not.
I have two questions.
Is there a way to get the compiler to use vexp2pd?
How do I safely override the call to __svml_exp8?
As to the second question this is what I have done so far.
//exp(x) = exp2(log2(e)*x)
extern "C" __m512d __svml_exp8(__m512d x) {
return _mm512_exp2a23_pd(_mm512_mul_pd(_mm512_set1_pd(M_LOG2E), x));
}
Is this safe? Is there a better solution e.g. one that inlines the function? In the test code below this is about 3 times faster than if I don't override.
//https://godbolt.org/g/adI11c
//icpc -O3 -xMIC-AVX512 foo.cpp
#include <math.h>
#include <stdio.h>
#include <x86intrin.h>
extern "C" __m512d __svml_exp8(__m512d x) {
//exp(x) = exp2(log2(e)*x)
return _mm512_exp2a23_pd(_mm512_mul_pd(_mm512_set1_pd(M_LOG2E), x));
}
void foo(double * __restrict x, double * __restrict y) {
__assume_aligned(x, 64);
__assume_aligned(y, 64);
for(int i=0; i<1024; i++) y[i] = exp(x[i]);
}
int main(void) {
double x[1024], y[1024];
for(int i=0; i<1024; i++) x[i] = 1.0*i;
for(int r=0; r<1000000; r++) foo(x,y);
double sum=0;
//for(int i=0; i<1024; i++) sum+=y[i];
for(int i=0; i<8; i++) printf("%f ", y[i]); puts("");
//printf("%lf",sum);
}
ICC will generate vexp2pd but only under very much relaxed math requirements as specified by targeted -fimf* switches.
#include <math.h>
void vfoo(int n, double * a, double * r)
{
int i;
#pragma simd
for ( i = 0; i < n; i++ )
{
r[i] = exp(a[i]);
}
}
E.g. compile with -xMIC-AVX512 -fimf-domain-exclusion=1 -fimf-accuracy-bits=22
..B1.12:
vmovups (%rsi,%rax,8), %zmm0
vmulpd .L_2il0floatpacket.2(%rip){1to8}, %zmm0, %zmm1
vexp2pd %zmm1, %zmm2
vmovupd %zmm2, (%rcx,%rax,8)
addq $8, %rax
cmpq %r8, %rax
jb ..B1.12
Please be sure to understand the accuracy implications as not only the end result is only about 22 bits accurate, but the vexp2pd also flushes to zero any denormalized results irrespective of the FTZ/DAZ bits set in the MXCSR.
To the second question: "How do I safely override the call to __svml_exp8?"
Your approach is generally not safe. SVML routines are internal to Intel Compiler and rely on custom calling conventions, so a generic routine with the same name can potentially clobber more registers than a library routine would, and you may end up in a hard-to-debug ABI mismatch.
A better way of providing your own vector functions would be to utilize #pragma omp declare simd, e.g. see https://software.intel.com/en-us/node/524514 and possibly the vector_variant attribute if prefer coding with intrinsics, see https://software.intel.com/en-us/node/523350. Just don't try to override standard math names or you'll get an error.

Poor performance of vector<bool> in 64-bit target with VS2012

Benchmarking this class:
struct Sieve {
std::vector<bool> isPrime;
Sieve (int n = 1) {
isPrime.assign (n+1, true);
isPrime[0] = isPrime[1] = false;
for (int i = 2; i <= (int)sqrt((double)n); ++i)
if (isPrime[i])
for (int j = i*i; j <= n; j += i)
isPrime[j] = false;
}
};
I'm getting over 3 times worse performance (CPU time) with 64-bit binary vs. 32-bit version (release build) when calling a constructor for a large number, e.g.
Sieve s(100000000);
I tested sizeof(bool) and it is 1 for both versions.
When I substitute vector<bool> with vector<char> performance becomes the same for 64-bit and 32-bit versions. Why is that?
Here are the run times for S(100000000) (release mode, 32-bit first, 64-bit second)):
vector<bool> 0.97s 3.12s
vector<char> 0.99s 0.99s
vector<int> 1.57s 1.59s
I also did a sanity test with VS2010 (prompted by Wouter Huysentruit's response), which produced 0.98s 0.88s. So there is something wrong with VS2012 implementation.
I submitted a bug report to Microsoft Connect
EDIT
Many answers below comment on deficiencies of using int for indexing. This may be true, but even the Great Wizard himself is using a standard for (int i = 0; i < v.size(); ++i) in his books, so such a pattern should not incur a significant performance penalty. Additionally, this issue was raised during Going Native 2013 conference and the presiding group of C++ gurus commented on their early recommendations of using size_t for indexing and as a return type of size() as a mistake. They said: "we are sorry, we were young..."
The title of this question could be rephrased to: Over 3 times performance drop on this code when upgrading from VS2010 to VS2012.
EDIT
I made a crude attempt at finding memory alignment of indexes i and j and discovered that this instrumented version:
struct Sieve {
vector<bool> isPrime;
Sieve (int n = 1) {
isPrime.assign (n+1, true);
isPrime[0] = isPrime[1] = false;
for (int i = 2; i <= sqrt((double)n); ++i) {
if (i == 17) cout << ((int)&i)%16 << endl;
if (isPrime[i])
for (int j = i*i; j <= n; j += i) {
if (j == 4) cout << ((int)&j)%16 << endl;
isPrime[j] = false;
}
}
}
};
auto-magically runs fast now (only 10% slower than 32-bit version). This and VS2010 performance makes it hard to accept a theory of optimizer having inherent problems dealing with int indexes instead of size_t.
std::vector<bool> is not directly at fault here. The performance difference is ultimately caused by your use of the signed 32-bit int type in your loops and some rather poor register allocation by the compiler. Consider, for example, your innermost loop:
for (int j = i*i; j <= n; j += i)
isPrime[j] = false;
Here, j is a 32-bit signed integer. When it is used in isPrime[j], however, it must be promoted (and sign-extended) to a 64-bit integer, in order to perform the subscript computation. The compiler can't just treat j as a 64-bit value, because that would change the behavior of the loop (e.g. if n is negative). The compiler also can't perform the index computation using the 32-bit quantity j, because that would change the behavior of that expression (e.g. if j is negative).
So, the compiler needs to generate code for the loop using a 32-bit j then it must generate code to convert that j to a 64-bit integer for the subscript computation. It has to do the same thing for the outer loop with i. Unfortunately, it looks like the compiler allocates registers rather poorly in this case(*)--it starts spilling temporaries to the stack, causing the performance hit you see.
If you change your program to use size_t everywhere (which is 32-bit on x86 and 64-bit on x64), you will observe that the performance is on par with x86, because the generated code needs only work with values of one size:
Sieve (size_t n = 1)
{
isPrime.assign (n+1, true);
isPrime[0] = isPrime[1] = false;
for (size_t i = 2; i <= static_cast<size_t>(sqrt((double)n)); ++i)
if (isPrime[i])
for (size_t j = i*i; j <= n; j += i)
isPrime[j] = false;
}
You should do this anyway, because mixing signed and unsigned types, especially when those types are of different widths, is perilous and can lead to unexpected errors.
Note that using std::vector<char> also "solves" the problem, but for a different reason: the subscript computation required for accessing an element of a std::vector<char> is substantially simpler than that for accessing an element of std::vector<bool>. The optimizer is able to generate better code for the simpler computations.
(*) I don't work on code generation, and I'm hardly an expert in either assembly or low-level performance optimization, but from looking at the generated code, and given that it is reported here that Visual C++ 2010 generates better code, I'd guess that there are opportunities for improvement in the compiler. I'll make sure the Connect bug you opened gets forwarded on to the compiler team so they can take a look.
I've tested this with vector<bool> in VS2010: 32-bit needs 1452ms while 64-bit needs 1264ms to complete on a i3.
The same test in VS2012 (on i7 this time) needs 700ms (32-bit) and 2730ms (64-bit), so there is something wrong with the compiler in VS2012. Maybe you can report this test case as a bug to Microsoft.
UPDATE
The problem is that the VS2012 compiler uses a temporary stack variable for a part of the code in the inner for-loop when using int as iterator. The assembly parts listed below are part of the code inside <vector>, in the += operator of the std::vector<bool>::iterator.
size_t as iterator
When using size_t as iterator, a part of the code looks like this:
or rax, -1
sub rax, rdx
shr rax, 5
lea rax, QWORD PTR [rax*4+4]
sub r8, rax
Here, all instructions use CPU registers which are very fast.
int as iterator
When using int as iterator, that same part looks like this:
or rcx, -1
sub rcx, r8
shr rcx, 5
shl rcx, 2
mov rax, -4
sub rax, rcx
mov rdx, QWORD PTR _Tmp$6[rsp]
add rdx, rax
Here you see the _Tmp$6 stack variable being used, which causes the slowdown.
Point compiler into the right direction
The funny part is that you can point the compiler into the right direction by using the vector<bool>::iterator directly.
struct Sieve {
std::vector<bool> isPrime;
Sieve (int n = 1) {
isPrime.assign(n + 1, true);
std::vector<bool>::iterator it1 = isPrime.begin();
std::vector<bool>::iterator end = it1 + n;
*it1++ = false;
*it1++ = false;
for (int i = 2; i <= (int)sqrt((double)n); ++it1, ++i)
if (*it1)
for (std::vector<bool>::iterator it2 = isPrime.begin() + i * i; it2 <= end; it2 += i)
*it2 = false;
}
};
vector<bool> is a very special container that's specialized to use 1 bit per item rather than providing normal container semantics. I suspect that the bit manipulation logic is much more expensive when compiling 64 bits (either it still uses 32 bit chunks to hold the bits or some other reason). vector<char> behaves just like a normal vector so there's no special logic.
You could also use deque<bool> which doesn't have the specialization.

Extremely bizarre code generation in Visual C++, for nearly identical code; 3x speed difference

The code below (reduced from my larger code, after my astonishment at how its speed paled in comparison with that of std::vector) has two peculiar features:
It runs more than three times faster when I make a very tiny modification to the source code (always compiling it with /O2 with Visual C++ 2010).
Note: To make this a little more fun, I put a hint for the modification at the end, so you can spend some time figuring out the change yourself. The original code was ~500 lines, so it took me a heck of a lot longer to pin it down, since the fix looks pretty irrelevant to the performance.
It runs about 20% faster with /MTd than with /MT, even though the output loop looks the same!!!
The difference in the assembly code for the tiny-modification case is:
Loop without the modification (~300 ms):
00403383 mov esi,dword ptr [esp+10h]
00403387 mov edx,dword ptr [esp+0Ch]
0040338B mov dword ptr [edx+esi*4],eax
0040338E add dword ptr [esp+10h],ecx
00403392 add eax,ecx
00403394 cmp eax,4000000h
00403399 jl main+43h (403383h)
Loop with /MTd (looks identical! but ~270 ms):
00407D73 mov esi,dword ptr [esp+10h]
00407D77 mov edx,dword ptr [esp+0Ch]
00407D7B mov dword ptr [edx+esi*4],eax
00407D7E add dword ptr [esp+10h],ecx
00407D82 add eax,ecx
00407D84 cmp eax,4000000h
00407D89 jl main+43h (407D73h)
Loop with the modification (~100 ms!!):
00403361 mov dword ptr [esi+eax*4],eax
00403364 inc eax
00403365 cmp eax,4000000h
0040336A jl main+21h (403361h)
Now my question is, why should the changes above have the effects they do? It's completely bizarre!
Especially the first one -- it shouldn't affect anything at all (once you see the difference in the code), and yet it lowers the speed dramatically.
Is there an explanation for this?
#include <cstdio>
#include <ctime>
#include <algorithm>
#include <memory>
template<class T, class Allocator = std::allocator<T> >
struct vector : Allocator
{
T *p;
size_t n;
struct scoped
{
T *p_;
size_t n_;
Allocator &a_;
~scoped() { if (p_) { a_.deallocate(p_, n_); } }
scoped(Allocator &a, size_t n) : a_(a), n_(n), p_(a.allocate(n, 0)) { }
void swap(T *&p, size_t &n)
{
std::swap(p_, p);
std::swap(n_, n);
}
};
vector(size_t n) : n(0), p(0) { scoped(*this, n).swap(p, n); }
void push_back(T const &value) { p[n++] = value; }
};
int main()
{
int const COUNT = 1 << 26;
vector<int> vect(COUNT);
clock_t start = clock();
for (int i = 0; i < COUNT; i++) { vect.push_back(i); }
printf("time: %d\n", (clock() - start) * 1000 / CLOCKS_PER_SEC);
}
Hint (hover your mouse below):
It has to do with the allocator.
Answer:
Change Allocator &a_ to Allocator a_.
For what it's worth, my speculation for the difference between /MT and /MTd is that the /MTd heap allocation will paint the heap memory for debugging purposes making it more likely to be paged in - that occurs before you start the clock.
If you 'pre-heat' the vector allocation, you get the same numbers for /MT and /MTd:
vector<int> vect(COUNT);
// make sure vect's memory is warmed up
for (int i = 0; i < COUNT; i++) { vect.push_back(i); }
vect.n = 0; // clear the vector
clock_t start = clock();
for (int i = 0; i < COUNT; i++) { vect.push_back(i); }
printf("time: %d\n", (clock() - start) * 1000 / CLOCKS_PER_SEC);
It's strange that Allocator& will break the alias chain while Allocator will not.
You can try
for(int i=vect.n; i<COUNT;++i){
...
}
to enforce i and n are synchronized.
This will make vc much easier to optimize.
emm... It seems that the "fastest" code
00403361 mov dword ptr [esi+eax*4],eax
00403364 inc eax
00403365 cmp eax,4000000h
0040336A jl main+21h (403361h)
is somewhat over-optimized. In this loop, vect.n is ignored at all...
If there was an exception happened in the loop, vect.n will not be updated correctly.
So the answer might be: when you use Allocator, vc figures out that vect.n will be never used again,
so that it can be ignored. It's amazing, but in general it's not so useful and dangerous.

Profiling SIMD Code

UPDATED - Check Below
Will keep this as short as possible. Happy to add any more details if required.
I have some sse code for normalising a vector. I'm using QueryPerformanceCounter() (wrapped in a helper struct) to measure performance.
If I measure like this
for( int j = 0; j < NUM_VECTORS; ++j )
{
Timer t(norm_sse);
NormaliseSSE( vectors_sse+j);
}
The results I get are often slower than just doing a standard normalise with 4 doubles representing a vector (testing in the same configuration).
for( int j = 0; j < NUM_VECTORS; ++j )
{
Timer t(norm_dbl);
NormaliseDBL( vectors_dbl+j);
}
However, timing just the entirety of the loop like this
{
Timer t(norm_sse);
for( int j = 0; j < NUM_VECTORS; ++j ){
NormaliseSSE( vectors_sse+j );
}
}
shows the SSE code to be an order of magnitude faster, but doesn't really affect the measurements for the double version.
I've done a fair bit of experimentation and searching, and can't seem to find a reasonable answer as to why.
For example, I know there can be penalities when casting the results to float, but none of that is going on here.
Can anyone offer any insight? What is it about calling QueryPerformanceCounter between each normalise that slows the SIMD code down so much?
Thanks for reading :)
More details below:
Both normalise methods are inlined (verified in disassembly)
Running in release
32 bit compilation
Simple Vector struct
_declspec(align(16)) struct FVECTOR{
typedef float REAL;
union{
struct { REAL x, y, z, w; };
__m128 Vec;
};
};
Code to Normalise SSE:
__m128 Vec = _v->Vec;
__m128 sqr = _mm_mul_ps( Vec, Vec ); // Vec * Vec
__m128 yxwz = _mm_shuffle_ps( sqr, sqr , 0x4e );
__m128 addOne = _mm_add_ps( sqr, yxwz );
__m128 swapPairs = _mm_shuffle_ps( addOne, addOne , 0x11 );
__m128 addTwo = _mm_add_ps( addOne, swapPairs );
__m128 invSqrOne = _mm_rsqrt_ps( addTwo );
_v->Vec = _mm_mul_ps( invSqrOne, Vec );
Code to normalise doubles
double len_recip = 1./sqrt(v->x*v->x + v->y*v->y + v->z*v->z);
v->x *= len_recip;
v->y *= len_recip;
v->z *= len_recip;
Helper struct
struct Timer{
Timer( LARGE_INTEGER & a_Storage ): Storage( a_Storage ){
QueryPerformanceCounter( &PStart );
}
~Timer(){
LARGE_INTEGER PEnd;
QueryPerformanceCounter( &PEnd );
Storage.QuadPart += ( PEnd.QuadPart - PStart.QuadPart );
}
LARGE_INTEGER& Storage;
LARGE_INTEGER PStart;
};
Update
So thanks to Johns comments, I think I've managed to confirm that it is QueryPerformanceCounter thats doing bad things to my simd code.
I added a new timer struct that uses RDTSC directly, and it seems to give results consistent to what I would expect. The result is still far slower than timing the entire loop, rather than each iteration separately, but I expect that that's because Getting the RDTSC involves flushing the instruction pipeline (Check http://www.strchr.com/performance_measurements_with_rdtsc for more info).
struct PreciseTimer{
PreciseTimer( LARGE_INTEGER& a_Storage ) : Storage(a_Storage){
StartVal.QuadPart = GetRDTSC();
}
~PreciseTimer(){
Storage.QuadPart += ( GetRDTSC() - StartVal.QuadPart );
}
unsigned __int64 inline GetRDTSC() {
unsigned int lo, hi;
__asm {
; Flush the pipeline
xor eax, eax
CPUID
; Get RDTSC counter in edx:eax
RDTSC
mov DWORD PTR [hi], edx
mov DWORD PTR [lo], eax
}
return (unsigned __int64)(hi << 32 | lo);
}
LARGE_INTEGER StartVal;
LARGE_INTEGER& Storage;
};
When it's only the SSE code running the loop, the processor should be able to keep its pipelines full and executing a huge number of SIMD instructions per unit time. When you add the timer code within the loop, now there's a whole bunch of non-SIMD instructions, possibly less predictable, between each of the easy-to-optimize operations. It's likely that the QueryPerformanceCounter call is either expensive enough to make the data manipulation part insignificant, or the nature of the code it executes wreaks havoc with the processor's ability to keep executing instructions at the maximum rate (possibly due to cache evictions or branches that are not well-predicted).
You might try commenting out the actual calls to QPC in your Timer class and see how it performs--this may help you discover if it's the construction and destruction of the Timer objects that is the problem, or the QPC calls. Likewise, try just calling QPC directly in the loop instead of making a Timer and see how that compares.
QPC is a kernel function, and calling it causes a context switch, which is inherently far more expensive and destructive than any equivalent user-mode function call, and will definitely annihilate the processor's ability to process at it's normal speed. In addition to that, remember that QPC/QPF are abstractions and require their own processing- which likely involves the use of SSE itself.