overriding function calls from SVML - c++

The Xeon-Phi Knights Landing cores have a fast exp2 instruction vexp2pd (intrinsic _mm512_exp2a23_pd). The Intel C++ compiler can vectorize the exp function using the Short Vector Math Library (SVML) which comes with the compiler. Specifically, it calls the fuction __svml_exp8.
However, when I step through a debugger I don't see that __svml_exp8 uses the vexp2pd instruction. It is a complication function with many FMA operations. I understand that vexp2pd is less accurate than exp but if I use -fp-model fast=1 (the default) or fp-model fast=2 I expect the compiler to use this instruction but it does not.
I have two questions.
Is there a way to get the compiler to use vexp2pd?
How do I safely override the call to __svml_exp8?
As to the second question this is what I have done so far.
//exp(x) = exp2(log2(e)*x)
extern "C" __m512d __svml_exp8(__m512d x) {
return _mm512_exp2a23_pd(_mm512_mul_pd(_mm512_set1_pd(M_LOG2E), x));
}
Is this safe? Is there a better solution e.g. one that inlines the function? In the test code below this is about 3 times faster than if I don't override.
//https://godbolt.org/g/adI11c
//icpc -O3 -xMIC-AVX512 foo.cpp
#include <math.h>
#include <stdio.h>
#include <x86intrin.h>
extern "C" __m512d __svml_exp8(__m512d x) {
//exp(x) = exp2(log2(e)*x)
return _mm512_exp2a23_pd(_mm512_mul_pd(_mm512_set1_pd(M_LOG2E), x));
}
void foo(double * __restrict x, double * __restrict y) {
__assume_aligned(x, 64);
__assume_aligned(y, 64);
for(int i=0; i<1024; i++) y[i] = exp(x[i]);
}
int main(void) {
double x[1024], y[1024];
for(int i=0; i<1024; i++) x[i] = 1.0*i;
for(int r=0; r<1000000; r++) foo(x,y);
double sum=0;
//for(int i=0; i<1024; i++) sum+=y[i];
for(int i=0; i<8; i++) printf("%f ", y[i]); puts("");
//printf("%lf",sum);
}

ICC will generate vexp2pd but only under very much relaxed math requirements as specified by targeted -fimf* switches.
#include <math.h>
void vfoo(int n, double * a, double * r)
{
int i;
#pragma simd
for ( i = 0; i < n; i++ )
{
r[i] = exp(a[i]);
}
}
E.g. compile with -xMIC-AVX512 -fimf-domain-exclusion=1 -fimf-accuracy-bits=22
..B1.12:
vmovups (%rsi,%rax,8), %zmm0
vmulpd .L_2il0floatpacket.2(%rip){1to8}, %zmm0, %zmm1
vexp2pd %zmm1, %zmm2
vmovupd %zmm2, (%rcx,%rax,8)
addq $8, %rax
cmpq %r8, %rax
jb ..B1.12
Please be sure to understand the accuracy implications as not only the end result is only about 22 bits accurate, but the vexp2pd also flushes to zero any denormalized results irrespective of the FTZ/DAZ bits set in the MXCSR.
To the second question: "How do I safely override the call to __svml_exp8?"
Your approach is generally not safe. SVML routines are internal to Intel Compiler and rely on custom calling conventions, so a generic routine with the same name can potentially clobber more registers than a library routine would, and you may end up in a hard-to-debug ABI mismatch.
A better way of providing your own vector functions would be to utilize #pragma omp declare simd, e.g. see https://software.intel.com/en-us/node/524514 and possibly the vector_variant attribute if prefer coding with intrinsics, see https://software.intel.com/en-us/node/523350. Just don't try to override standard math names or you'll get an error.

Related

OpenMP Strange Behavior - differences in performance

I want to speedup image processing code using OpenMP and I found some strange behavior in my code. I'm using Visual Studio 2019 and I also tried Intel C++ compiler with same result.
I'm not sure why is the code with OpenMP in some situations much slower than in the others. For example function divideImageDataWithParam() or difference between copyFirstPixelOnRow() and copyFirstPixelOnRowUsingTSize() using struct TSize as parameter of image data size. Why is performance of boxFilterRow() and boxFilterRow_OpenMP() so different a why isn't it with different radius size in program?
I created github repository for this little testing project:
https://github.com/Tb45/OpenMP-Strange-Behavior
Here are all results summarized:
https://github.com/Tb45/OpenMP-Strange-Behavior/blob/master/resuts.txt
I didn't find any explanation why is this happening or what am I doing wrong.
Thanks for your help.
I'm working on faster box filter and others for image processing algorithms.
typedef intptr_t int_t;
struct TSize
{
int_t width;
int_t height;
};
void divideImageDataWithParam(
const unsigned char * src, int_t srcStep, unsigned char * dst, int_t dstStep, TSize size, int_t param)
{
for (int_t y = 0; y < size.height; y++)
{
for (int_t x = 0; x < size.width; x++)
{
dst[y*dstStep + x] = src[y*srcStep + x]/param;
}
}
}
void divideImageDataWithParam_OpenMP(
const unsigned char * src, int_t srcStep, unsigned char * dst, int_t dstStep, TSize size, int_t param, bool parallel)
{
#pragma omp parallel for if(parallel)
for (int_t y = 0; y < size.height; y++)
{
for (int_t x = 0; x < size.width; x++)
{
dst[y*dstStep + x] = src[y*srcStep + x]/param;
}
}
}
Results of divideImageDataWithParam():
generateRandomImageData :: 3840x2160
numberOfIterations = 100
With Visual C++ 2019:
32bit 64bit
336.906ms 344.251ms divideImageDataWithParam
1832.120ms 6395.861ms divideImageDataWithParam_OpenMP single-thread parallel=false
387.152ms 1204.302ms divideImageDataWithParam_OpenMP multi-threaded parallel=true
With Intel C++ 19:
32bit 64bit
15.162ms 8.927ms divideImageDataWithParam
266.646ms 294.134ms divideImageDataWithParam_OpenMP single-threaded parallel=false
239.564ms 1195.556ms divideImageDataWithParam_OpenMP multi-threaded parallel=true
Screenshot from Intel VTune Amplifier, where divideImageDataWithParam_OpenMP() with parallel=false take most of the time in instruction mov to dst memory.
648trindade is right; it has to do with optimizations that cannot be done with openmp. But its not loop-unrolling or vectorization, its inlining which allows for a smart substitution.
Let me explain: Integer divisions are incredibly slow (64bit IDIV: ~40-100 Cycles). So whenever possible people (and compilers) try to avoid divisions. One trick you can use is to substitute a division with a multiplication and a shift. That only works if the divisor is known at compile time. This is the case because your function divideImageDataWithParam is inlined and PARAM is known. You can verify this by prepending it with __declspec(noinline). You will get the timings that you expected.
The openmp parallelization does not allow this trick because the function cannot be inlined and therefore param is not known at compile time and an expensive IDIV-instruction is generated.
Compiler output of divideImageDataWithParam (WIN10, MSVC2017, x64):
0x7ff67d151480 <+ 336> movzx ecx,byte ptr [r10+r8]
0x7ff67d151485 <+ 341> mov rax,r12
0x7ff67d151488 <+ 344> mul rax,rcx <------- multiply
0x7ff67d15148b <+ 347> shr rdx,3 <------- shift
0x7ff67d15148f <+ 351> mov byte ptr [r8],dl
0x7ff67d151492 <+ 354> lea r8,[r8+1]
0x7ff67d151496 <+ 358> sub r9,1
0x7ff67d15149a <+ 362> jne test!main+0x150 (00007ff6`7d151480)
And the openmp-version:
0x7ff67d151210 <+ 192> movzx eax,byte ptr [r10+rcx]
0x7ff67d151215 <+ 197> lea rcx,[rcx+1]
0x7ff67d151219 <+ 201> cqo
0x7ff67d15121b <+ 203> idiv rax,rbp <------- idiv
0x7ff67d15121e <+ 206> mov byte ptr [rcx-1],al
0x7ff67d151221 <+ 209> lea rax,[r8+rcx]
0x7ff67d151225 <+ 213> mov rdx,qword ptr [rbx]
0x7ff67d151228 <+ 216> cmp rax,rdx
0x7ff67d15122b <+ 219> jl test!divideImageDataWithParam$omp$1+0xc0 (00007ff6`7d151210)
Note 1) If you try out the compiler explorer (https://godbolt.org/) you will see that some compilers do the substitution for the openmp version too.
Note 2) As soon as the parameter is not known at compile time this optimization cannot be done anyway. So if you put your function into a library it will be slow. I'd do something like precomputing the division for all possible values and then do a lookup. This is even faster because the lookup table fits into 4-5 cache lines and L1 latency is only 3-4 cycles.
void divideImageDataWithParam(
const unsigned char * src, int_t srcStep, unsigned char * dst, int_t dstStep, TSize size, int_t param)
{
uint8_t tbl[256];
for(int i = 0; i < 256; i++) {
tbl[i] = i / param;
}
for (int_t y = 0; y < size.height; y++)
{
for (int_t x = 0; x < size.width; x++)
{
dst[y*dstStep + x] = tbl[src[y*srcStep + x]];
}
}
}
Also thanks for the interesting question, I learned a thing or two along the way! ;-)
This behavior is explained by the use of compiler optimizations: when enabled, divideImageDataWithParam sequential code will be subjected to a series of optimizations (loop-unrolling, vectorization, etc.) that divideImageDataWithParam_OpenMP parallel code probably is not, as it is certainly uncharacterized after the process of outlining parallel regions by the compiler.
If you compile this same code without optimizations, you will find that the runtime version of the sequential version is very similar to that of the parallel version with only one thread.
The maximum speedup of parallel version in this case is limited by the division of the original workload without optimizations. Optimizations in this case need to be writed manually.

Adding a print statement speeds up code by an order of magnitude

I've hit a piece of extremely bizarre performance behavior in a piece of C/C++ code, as suggested in the title, which I have no idea how to explain.
Here's an as-close-as-I've-found-to-minimal working example [EDIT: see below for a shorter one]:
#include <stdio.h>
#include <stdlib.h>
#include <complex>
using namespace std;
const int pp = 29;
typedef complex<double> cdbl;
int main() {
cdbl ff[pp], gg[pp];
for(int ii = 0; ii < pp; ii++) {
ff[ii] = gg[ii] = 1.0;
}
for(int it = 0; it < 1000; it++) {
cdbl dual[pp];
for(int ii = 0; ii < pp; ii++) {
dual[ii] = 0.0;
}
for(int h1 = 0; h1 < pp; h1 ++) {
for(int h2 = 0; h2 < pp; h2 ++) {
cdbl avg_right = 0.0;
for(int xx = 0; xx < pp; xx ++) {
int c00 = xx, c01 = (xx + h1) % pp, c10 = (xx + h2) % pp,
c11 = (xx + h1 + h2) % pp;
avg_right += ff[c00] * conj(ff[c01]) * conj(ff[c10]) * gg[c11];
}
avg_right /= static_cast<cdbl>(pp);
for(int xx = 0; xx < pp; xx ++) {
int c01 = (xx + h1) % pp, c10 = (xx + h2) % pp,
c11 = (xx + h1 + h2) % pp;
dual[xx] += conj(ff[c01]) * conj(ff[c10]) * ff[c11] * conj(avg_right);
}
}
}
for(int ii = 0; ii < pp; ii++) {
dual[ii] = conj(dual[ii]) / static_cast<double>(pp*pp);
}
for(int ii = 0; ii < pp; ii++) {
gg[ii] = dual[ii];
}
#ifdef I_WANT_THIS_TO_RUN_REALLY_FAST
printf("%.15lf\n", gg[0].real());
#else // I_WANT_THIS_TO_RUN_REALLY_SLOWLY
#endif
}
printf("%.15lf\n", gg[0].real());
return 0;
}
Here are the results of running this on my system:
me#mine $ g++ -o test.elf test.cc -Wall -Wextra -O2
me#mine $ time ./test.elf > /dev/null
real 0m7.329s
user 0m7.328s
sys 0m0.000s
me#mine $ g++ -o test.elf test.cc -Wall -Wextra -O2 -DI_WANT_THIS_TO_RUN_REALLY_FAST
me#mine $ time ./test.elf > /dev/null
real 0m0.492s
user 0m0.490s
sys 0m0.001s
me#mine $ g++ --version
g++ (Gentoo 4.9.4 p1.0, pie-0.6.4) 4.9.4 [snip]
It's not terribly important what this code computes: it's just a tonne of complex arithmetic on arrays of length 29. It's been "simplified" from a much larger tonne of complex arithmetic that I care about.
So, the behavior seems to be, as claimed in the title: if I put this print statement back in, the code gets a lot faster.
I've played around a bit: e.g, printing a constant string doesn't give the speedup, but printing the clock time does. There's a pretty clear threshold: the code is either fast or slow.
I considered the possibility that some bizarre compiler optimization either does or doesn't kick in, maybe depending on whether the code does or doesn't have side effects. But, if so it's pretty subtle: when I looked at the disassembled binaries, they're seemingly identical except that one has an extra print statement in and they use different interchangeable registers. I may (must?) have missed something important.
I'm at a total loss to explain what an earth could be causing this. Worse, it does actually affect my life because I'm running related code a lot, and going round inserting extra print statements does not feel like a good solution.
Any plausible theories would be very welcome. Responses along the lines of "your computer's broken" are acceptable if you can explain how that might explain anything.
UPDATE: with apologies for the increasing length of the question, I've shrunk the example to
#include <stdio.h>
#include <stdlib.h>
#include <complex>
using namespace std;
const int pp = 29;
typedef complex<double> cdbl;
int main() {
cdbl ff[pp];
cdbl blah = 0.0;
for(int ii = 0; ii < pp; ii++) {
ff[ii] = 1.0;
}
for(int it = 0; it < 1000; it++) {
cdbl xx = 0.0;
for(int kk = 0; kk < 100; kk++) {
for(int ii = 0; ii < pp; ii++) {
for(int jj = 0; jj < pp; jj++) {
xx += conj(ff[ii]) * conj(ff[jj]) * ff[ii];
}
}
}
blah += xx;
printf("%.15lf\n", blah.real());
}
printf("%.15lf\n", blah.real());
return 0;
}
I could make it even smaller but already the machine code is manageable. If I change five bytes of the binary corresponding to the callq instruction for that first printf, to 0x90, the execution goes from fast to slow.
The compiled code is very heavy with function calls to __muldc3(). I feel it must be to do with how the Broadwell architecture does or doesn't handle these jumps well: both versions run the same number of instructions so it's a difference in instructions / cycle (about 0.16 vs about 2.8).
Also, compiling -static makes things fast again.
Further shameless update: I'm conscious I'm the only one who can play with this, so here are some more observations:
It seems like calling any library function — including some foolish ones I made up that do nothing — for the first time, puts the execution into slow state. A subsequent call to printf, fprintf or sprintf somehow clears the state and execution is fast again. So, importantly the first time __muldc3() is called we go into slow state, and the next {,f,s}printf resets everything.
Once a library function has been called once, and the state has been reset, that function becomes free and you can use it as much as you like without changing the state.
So, e.g.:
#include <stdio.h>
#include <stdlib.h>
#include <complex>
using namespace std;
int main() {
complex<double> foo = 0.0;
foo += foo * foo; // 1
char str[10];
sprintf(str, "%c\n", 'c');
//fflush(stdout); // 2
for(int it = 0; it < 100000000; it++) {
foo += foo * foo;
}
return (foo.real() > 10.0);
}
is fast, but commenting out line 1 or uncommenting line 2 makes it slow again.
It must be relevant that the first time a library call is run the "trampoline" in the PLT is initialized to point to the shared library. So, maybe somehow this dynamic loading code leaves the processor frontend in a bad place until it's "rescued".
For the record, I finally figured this out.
It turns out this is to do with AVX–SSE transition penalties. To quote this exposition from Intel:
When using Intel® AVX instructions, it is important to know that mixing 256-bit Intel® AVX instructions with legacy (non VEX-encoded) Intel® SSE instructions may result in penalties that could impact performance. 256-bit Intel® AVX instructions operate on the 256-bit YMM registers which are 256-bit extensions of the existing 128-bit XMM registers. 128-bit Intel® AVX instructions operate on the lower 128 bits of the YMM registers and zero the upper 128 bits. However, legacy Intel® SSE instructions operate on the XMM registers and have no knowledge of the upper 128 bits of the YMM registers. Because of this, the hardware saves the contents of the upper 128 bits of the YMM registers when transitioning from 256-bit Intel® AVX to legacy Intel® SSE, and then restores these values when transitioning back from Intel® SSE to Intel® AVX (256-bit or 128-bit). The save and restore operations both cause a penalty that amounts to several tens of clock cycles for each operation.
The compiled version of my main loops above includes legacy SSE instructions (movapd and friends, I think), whereas the implementation of __muldc3 in libgcc_s uses a lot of fancy AVX instructions (vmovapd, vmulsd etc.).
This is the ultimate cause of the slowdown.
Indeed, Intel performance diagnostics show that this AVX/SSE switching happens almost exactly once each way per call of `__muldc3' (in the last version of the code posted above):
$ perf stat -e cpu/event=0xc1,umask=0x08/ -e cpu/event=0xc1,umask=0x10/ ./slow.elf
Performance counter stats for './slow.elf':
100,000,064 cpu/event=0xc1,umask=0x08/
100,000,118 cpu/event=0xc1,umask=0x10/
(event codes taken from table 19.5 of another Intel manual).
That leaves the question of why the slowdown turns on when you call a library function for the first time, and turns off again when you call printf, sprintf or whatever. The clue is in the first document again:
When it is not possible to remove the transitions, it is often possible to avoid the penalty by explicitly zeroing the upper 128-bits of the YMM registers, in which case the hardware does not save these values.
I think the full story is therefore as follows. When you call a library function for the first time, the trampoline code in ld-linux-x86-64.so that sets up the PLT leaves the upper bits of the MMY registers in a non-zero state. When you call sprintf among other things it zeros out the upper bits of the MMY registers (whether by chance or design, I'm not sure).
Replacing the sprintf call with asm("vzeroupper") — which instructs the processor explicitly to zero these high bits — has the same effect.
The effect can be eliminated by adding -mavx or -march=native to the compile flags, which is how the rest of the system was built. Why this doesn't happen by default is just a mystery of my system I guess.
I'm not quite sure what we learn here, but there it is.

Process unaligned part of a double array, vectorize the rest

I am generating sse/avx instructions and currently i have to use unaligned load and stores. I operate on a float/double array and i will never know whether it will be aligned or not. So before vectorizing it, i would like to have a pre and possibly a post loop, which takes care about the unaligned part. The main vectorized loop operates then on the aligned part.
But how do i determine when an array is aligned? Can i check the pointer value? When should the pre-loop stop and the post-loop start?
Here is my simple code example:
void func(double * in, double * out, unsigned int size){
for( as long as in unaligned part ){
out[i] = do_something_with_array(in[i])
}
for( as long as aligned ){
awesome avx code that loads operates and stores 4 doubles
}
for( remaining part of array ){
out[i] = do_something_with_array(in[i])
}
}
Edit:
I have been thinking about it. Theoretically the pointer to the i'th element should be dividable (something like &a[i]%16==0) by 2,4,16,32 (depending whether it is double and whether it is sse or avx). So the first loop should cover up the elements, which are not dividable.
Practically i will try the compiler pragmas and flags out, to see what does the compiler produce. If no one gives a good answer i will then post my solution (if any) on weekend.
Here is some example C code that does what you want
#include <stdio.h>
#include <x86intrin.h>
#include <inttypes.h>
#define ALIGN 32
#define SIMD_WIDTH (ALIGN/sizeof(double))
int main(void) {
int n = 17;
int c = 1;
double* p = _mm_malloc((n+c) * sizeof *p, ALIGN);
double* p1 = p+c;
for(int i=0; i<n; i++) p1[i] = 1.0*i;
double* p2 = (double*)((uintptr_t)(p1+SIMD_WIDTH-1)&-ALIGN);
double* p3 = (double*)((uintptr_t)(p1+n)&-ALIGN);
if(p2>p3) p2 = p3;
printf("%p %p %p %p\n", p1, p2, p3, p1+n);
double *t;
for(t=p1; t<p2; t+=1) {
printf("a %p %f\n", t, *t);
}
puts("");
for(;t<p3; t+=SIMD_WIDTH) {
printf("b %p ", t);
for(int i=0; i<SIMD_WIDTH; i++) printf("%f ", *(t+i));
puts("");
}
puts("");
for(;t<p1+n; t+=1) {
printf("c %p %f\n", t, *t);
}
}
This generates a 32-byte aligned buffer but then offsets it by one double in size so it's no longer 32-byte aligned. It loops over scalar values until reaching 32-btye alignment, loops over the 32-byte aligned values, and then lastly finishes with another scalar loop for any remaining values which are not a multiple of the SIMD width.
I would argue that this kind of optimization only really makes a lot of sense for Intel x86 processors before Nehalem. Since Nehalem the latency and throughput of unaligned loads and stores are the same as for the aligned loads and stores. Additionally, since Nehalem the costs of the cache line splits is small.
There is one subtle point with SSE since Nehalem in that unaligned loads and stores cannot fold with other operations. Therefore, aligned loads and stores are not obsolete with SSE since Nehalem. So in principle this optimization could make a difference even with Nehalem but in practice I think there are few cases where it will.
However, with AVX unaligned loads and stores can fold so the aligned loads and store instructions are obsolete.
I looked into this with GCC, MSVC, and Clang. GCC if it cannot assume a pointer is aligned to e.g. 16 bytes with SSE then it will generate code similar to the code above to reach 16 byte alignment to avoid the cache line splits when vectorizing.
Clang and MSVC don't do this so they would suffer from the cache-line splits. However, the cost of the additional code to do this makes up for cost of the cache-line splits which probably explains why Clang and MSVC don't worry about it.
The only exception is before Nahalem. In this case GCC is much faster than Clang and MSVC when the pointer is not aligned. If the pointer is aligned and Clang knows it then it will use aligned loads and stores and be fast like GCC. MSVC vectorization still uses unaligned stores and loads and is therefore slow pre-Nahalem even when a pointer is 16-byte aligned.
Here is a version which I think is a bit clearer using pointer differences
#include <stdio.h>
#include <x86intrin.h>
#include <inttypes.h>
#define ALIGN 32
#define SIMD_WIDTH (ALIGN/sizeof(double))
int main(void) {
int n = 17, c =1;
double* p = _mm_malloc((n+c) * sizeof *p, ALIGN);
double* p1 = p+c;
for(int i=0; i<n; i++) p1[i] = 1.0*i;
double* p2 = (double*)((uintptr_t)(p1+SIMD_WIDTH-1)&-ALIGN);
double* p3 = (double*)((uintptr_t)(p1+n)&-ALIGN);
int n1 = p2-p1, n2 = p3-p2;
if(n1>n2) n1=n2;
printf("%d %d %d\n", n1, n2, n);
int i;
for(i=0; i<n1; i++) {
printf("a %p %f\n", &p1[i], p1[i]);
}
puts("");
for(;i<n2; i+=SIMD_WIDTH) {
printf("b %p ", &p1[i]);
for(int j=0; j<SIMD_WIDTH; j++) printf("%f ", p1[i+j]);
puts("");
}
puts("");
for(;i<n; i++) {
printf("c %p %f\n", &p1[i], p1[i]);
}
}

Why is native c++ performing poorly with c++ interop?

I have posted some code below to test the performance (time-wise in milliseconds) of calling a method from native c++ and c# from c++/cli using Visual Studio 2010. I have a separate native c++ project which is compiled into dll. When I call into c++ from c++, I get the expected result which is much faster (about 4x) than the managed counterparts. However, when I call into c++ from c++/cli, the performance is 10x slower.
Is this an expected behavior when calling into native c++ from c++/cli? I was under the impression that there shouldn't be a significant difference, but this simple test is showing otherwise. Could this be an optimization difference between the c++ and c++/cli compiler?
Update
I made some update to the cpp, so that I'm not calling a method in a tight loop (as Reed Copsey pointed out), and it turns out that the difference in performance in insignificant or very small. Depending on how the inter-operation is being done, of course.
.h
#ifndef CPPOBJECT_H
#define CPPOBJECT_H
#ifdef CPLUSPLUSOBJECT_EXPORTING
#define CLASS_DECLSPEC __declspec(dllexport)
#else
#define CLASS_DECLSPEC __declspec(dllimport)
#endif
class CLASS_DECLSPEC CPlusPlusObject
{
public:
CPlusPlusObject(){}
~CPlusPlusObject(){}
void sayHello();
double getSqrt(double n);
// Update
double wasteSomeTimeWithSqrt(double n);
};
#endif
.cpp
#include "CPlusPlusObject.h"
#include <iostream>
void CPlusPlusObject::sayHello(){std::cout << "Hello";}
double CPlusPlusObject::getSqrt(double n) {return std::sqrt(n);}
double CPlusPlusObject::wasteSomeTimeWithSqrt(double n)
{
double result = 0;
for (int x = 0; x < 10000000; x++)
{
result += std::sqrt(n);
}
return result;
}
c++/cli
const unsigned set = 100;
const unsigned repetitions = 1000000;
double cppcliTocpp()
{
double n = 0;
System::Diagnostics::Stopwatch^ stopWatch = gcnew System::Diagnostics::Stopwatch();
stopWatch->Start();
while (stopWatch->ElapsedMilliseconds < 1200){n+=0.001;}
stopWatch->Reset();
for (int x = 0; x < set; x++)
{
stopWatch->Start();
CPlusPlusObject cplusplusObject;
n += cplusplusObject.wasteSomeTimeWithSqrt(123.456);
/*for (int i = 0; i < repetitions; i++)
{
n += cplusplusObject.getSqrt(123.456);
}*/
stopWatch->Stop();
System::Console::WriteLine("c++/cli call to native c++ took " + stopWatch->ElapsedMilliseconds + "ms.");
stopWatch->Reset();
}
return n;
}
double cppcliTocSharp()
{
double n = 0;
System::Diagnostics::Stopwatch^ stopWatch = gcnew System::Diagnostics::Stopwatch();
stopWatch->Start();
while (stopWatch->ElapsedMilliseconds < 1200){n+=0.001;}
stopWatch->Reset();
for (int x = 0; x < set; x++)
{
stopWatch->Start();
CSharp::CSharpObject^ cSharpObject = gcnew CSharp::CSharpObject();
for (int i = 0; i < repetitions; i++)
{
n += cSharpObject->GetSqrt(123.456);
}
stopWatch->Stop();
System::Console::WriteLine("c++/cli call to c# took " + stopWatch->ElapsedMilliseconds + "ms.");
stopWatch->Reset();
}
return n;
}
double cppcli()
{
double n = 0;
System::Diagnostics::Stopwatch^ stopWatch = gcnew System::Diagnostics::Stopwatch();
stopWatch->Start();
while (stopWatch->ElapsedMilliseconds < 1200){n+=0.001;}
stopWatch->Reset();
for (int x = 0; x < set; x++)
{
stopWatch->Start();
CPlusPlusCliObject cPlusPlusCliObject;
for (int i = 0; i < repetitions; i++)
{
n += cPlusPlusCliObject.getSqrt(123.456);
}
stopWatch->Stop();
System::Console::WriteLine("c++/cli took " + stopWatch->ElapsedMilliseconds + "ms.");
stopWatch->Reset();
}
return n;
}
int main()
{
double n = 0;
n += cppcliTocpp();
n += cppcliTocSharp();
n += cppcli();
System::Console::WriteLine(n);
System::Console::ReadKey();
}
However, when I call into c++ from c++/cli, the performance is 10x slower.
Bridging the CLR and native code requires marshaling. There is always going to be some overhead in each method call when going from C++/CLI into a native method call.
The only reason the overhead (in this case) seems so large is that you're calling a very fast method in a tight loop. If you were to batch the class, or call a method that was significantly longer in terms of runtime, you'd find that the overhead is quite small.
These micro-benchmarks are very dangerous. You made an effort to avoid the typical benchmark mistakes but still fell into a classic trap. Your intention was to measure method call overhead, but that's not what's actually happening. The jitter optimizer is capable of standard code optimization techniques, like code hoisting and method inlining. You can only really see that when you look at the generated machine code. Debug + Windows + Disassembly window.
I tested this with VS2012, 32-bit Release build with the jitter optimizer enabled. The C++/CLI code was the fastest, taking ~128 msec:
000000bf fld qword ptr ds:[01212078h]
000000c5 fsqrt
000000c7 fstp qword ptr [ebp-20h]
//
// stopWatch->Start() call elided...
//
n += cPlusPlusCliObject.getSqrt(123.456);
000000f5 fld qword ptr [ebp-20h]
000000f8 fadd qword ptr [ebp-14h]
000000fb fstp qword ptr [ebp-14h]
for (int i = 0; i < repetitions; i++)
000000fe dec eax
000000ff jne 000000F5
In other words, the std::sqrt() call got hoisted out of the loop and the inner loop simply performs adds from the generated value. No method call. Also note how it didn't actually measure the time needed for the sqrt() call :)
The loop with the C# method call was a bit slower, taking ~180 msec:
000000ea fld qword ptr ds:[01211EC0h]
000000f0 fsqrt
000000f2 fadd qword ptr [ebp-14h]
000000f5 fstp qword ptr [ebp-14h]
for (int i = 0; i < repetitions; i++)
000000f8 dec eax
000000f9 jne 000000EA
Just the inlined method call to Math::Sqrt(), it didn't get hoisted. Not actually sure why, the optimizations performed by the jitter optimizer do have a time factor included.
And I won't post the code for the interop call. But yes, taking ~380 msec due to the need to actually make a function call, unmanaged code cannot be inlined, plus the thunk that's required to prevent the garbage collector from blundering into the unmanaged stack frame. The thunk is pretty fast, takes a handful of nanoseconds, but that just can't compete with the jitter optimizer directly inlining the fadd or fsqrt.

Helping GCC with auto-vectorisation

I have a shader I need to optimise (with lots of vector operations) and I am experimenting with SSE instructions in order to better understand the problem.
I have some very simple sample code. With the USE_SSE define it uses explicit SSE intrinsics; without it I'm hoping GCC will do the work for me. Auto-vectorisation feels a bit finicky but I'm hoping it will save me some hair.
Compiler and platform is: gcc 4.7.1 (tdm64), target x86_64-w64-mingw32 and Windows 7 on Ivy Bridge.
Here's the test code:
/*
Include all the SIMD intrinsics.
*/
#ifdef USE_SSE
#include <x86intrin.h>
#endif
#include <cstdio>
#if defined(__GNUG__) || defined(__clang__)
/* GCC & CLANG */
#define SSVEC_FINLINE __attribute__((always_inline))
#elif defined(_WIN32) && defined(MSC_VER)
/* MSVC. */
#define SSVEC_FINLINE __forceinline
#else
#error Unsupported platform.
#endif
#ifdef USE_SSE
typedef __m128 vec4f;
inline void addvec4f(vec4f &a, vec4f const &b)
{
a = _mm_add_ps(a, b);
}
#else
typedef float vec4f[4];
inline void addvec4f(vec4f &a, vec4f const &b)
{
a[0] = a[0] + b[0];
a[1] = a[1] + b[1];
a[2] = a[2] + b[2];
a[3] = a[3] + b[3];
}
#endif
int main(int argc, char *argv[])
{
int const count = 1e7;
#ifdef USE_SSE
printf("Using SSE.\n");
#else
printf("Not using SSE.\n");
#endif
vec4f data = {1.0f, 1.0f, 1.0f, 1.0f};
for (int i = 0; i < count; ++i)
{
vec4f val = {0.1f, 0.1f, 0.1f, 0.1f};
addvec4f(data, val);
}
float result[4] = {0};
#ifdef USE_SSE
_mm_store_ps(result, data);
#else
result[0] = data[0];
result[1] = data[1];
result[2] = data[2];
result[3] = data[3];
#endif
printf("Result: %f %f %f %f\n", result[0], result[1], result[2], result[3]);
return 0;
}
This is compiled with:
g++ -O3 ssetest.cpp -o nossetest.exe
g++ -O3 -DUSE_SSE ssetest.cpp -o ssetest.exe
Apart from the explicit SSE-version being a bit quicker there is no difference in output.
Here's the assembly for the loop, first explicit SSE:
.L3:
subl $1, %eax
addps %xmm1, %xmm0
jne .L3
It inlined the call. Nice, more or less just a straight up _mm_add_ps.
Array version:
.L3:
subl $1, %eax
addss %xmm0, %xmm1
addss %xmm0, %xmm2
addss %xmm0, %xmm3
addss %xmm0, %xmm4
jne .L3
It is using SSE math alright, but on each array member. Not really desirable.
My question is, how can I help GCC so that it can better optimise the array version of vec4f?
Any Linux specific tips is helpful too, that's where the real code will run.
This LockLess article on Auto-vectorization with gcc 4.7 is hands down the best article I have ever seen and I have spent a while looking for good articles on similar topics. They also have a lot of other articles that you may find very useful on similar subjects dealing all manners of low level software development.
Here is some tips based on your code to make gcc auto-vectorization works:
make the loop-upbound a const. To vectorize, GCC need to split the loop by 4-iterations to fit in the SSE XMM register, which is 128-bit length. a const loop upper bound will help GCC make sure that the loop have plenty of iterations, and the vectorization is profitable.
remove the inline keyword. if the code is marked as inline, GCC can not know whether the start point of the array is aligned without inter-procedure analysis which will not turned on by -O3.
so, to make your code vectorized, your addvec4f function should be modified as the following:
void addvec4f(vec4f &a, vec4f const &b)
{
int i = 0;
for(;i < 4; i++)
a[i] = a[i]+b[i];
}
BTW:
GCC also have flags to help you find out whether a loop have been vectorized. -ftree-vectorizer-verbose=2, higher number will have more output information, currently the value can be 0,1,2.Here is the documentation of this flag, and some other related flag.
Be careful of the alignment. The address of the array should be aligned, and the compiler can not know whether the address is aligned without running it.Usually, there will be a bus error if the data is not aligned. Here is the reason.