Performance AVX/SSE assembly vs. intrinsics - c++

I'm just trying to check the optimum approach to optimizing some basic routines. In this case I tried very simply example of multiplying 2 float vectors together:
void Mul(float *src1, float *src2, float *dst)
{
for (int i=0; i<cnt; i++) dst[i] = src1[i] * src2[i];
};
Plain C implementation is very slow. I did some external ASM using AVX and also tried using intrinsics. These are the test results (time, smaller is better):
ASM: 0.110
IPP: 0.125
Intrinsics: 0.18
Plain C++: 4.0
(compiled using MSVC 2013, SSE2, tried Intel Compiler, results were pretty much the same)
As you can see my ASM code beaten even Intel Performance Primitives (probably because I did lots of branches to ensure I can use the AVX aligned instructions). But I'd personally like to utilize the intrinsic approach, it's simply easier to manage and I was thinking the compiler should do the best job optimizing all the branches and stuff (my ASM code sucks in that matter imho, yet it is faster). So here's the code using intrinsics:
int i;
for (i=0; (MINTEGER)(dst + i) % 32 != 0 && i < cnt; i++) dst[i] = src1[i] * src2[i];
if ((MINTEGER)(src1 + i) % 32 == 0)
{
if ((MINTEGER)(src2 + i) % 32 == 0)
{
for (; i<cnt-8; i+=8)
{
__m256 x = _mm256_load_ps( src1 + i);
__m256 y = _mm256_load_ps( src2 + i);
__m256 z = _mm256_mul_ps(x, y);
_mm256_store_ps(dst + i, z);
};
}
else
{
for (; i<cnt-8; i+=8)
{
__m256 x = _mm256_load_ps( src1 + i);
__m256 y = _mm256_loadu_ps( src2 + i);
__m256 z = _mm256_mul_ps(x, y);
_mm256_store_ps(dst + i, z);
};
};
}
else
{
for (; i<cnt-8; i+=8)
{
__m256 x = _mm256_loadu_ps( src1 + i);
__m256 y = _mm256_loadu_ps( src2 + i);
__m256 z = _mm256_mul_ps(x, y);
_mm256_store_ps(dst + i, z);
};
};
for (; i<cnt; i++) dst[i] = src1[i] * src2[i];
Simple: First get to an address where dst is aligned to 32 bytes, then branch to check which sources are aligned.
One problem is that the C++ implementations in the beginning and at the end are not using AVX unless I enable AVX in the compiler, which I do NOT want, because this should be just AVX specialization, but the software should work even on a platform, where AVX is not available. And sadly there seems to be no intrinsics for instructions such as vmovss, so there's probably a penalty for mixing AVX code with SSE, which the compiler uses. However even if I enabled AVX in the compiler, it still didn't get below 0.14.
Any ideas how to optimize this to make the instrisics reach the speed of the ASM code?

Your implementation with intrinsics is not the same function as your implementation in straight C: e.g. what if your function was called with arguments Mul(p, p, p+1)? You'll get different results. The pure C version is slow because the compiler is ensuring that the code does exactly what you said.
If you want the compiler to make optimizations based on the assumption that the three arrays do not overlap, you need to make that explicit:
void Mul(float *src1, float *src2, float *__restrict__ dst)
or even better
void Mul(const float *src1, const float *src2, float *__restrict__ dst)
(I think it's enough to have __restrict__ just on the output pointer, although it wouldn't hurt to add it to the input pointers too)

On CPUs with AVX there is very little penalty for using misaligned loads - I would suggest trading this small penalty off against all the extra logic you're using to check for alignment etc and just have a single loop + scalar code to handle any residual elements:
for (i = 0; i <= cnt - 8; i += 8)
{
__m256 x = _mm256_loadu_ps(src1 + i);
__m256 y = _mm256_loadu_ps(src2 + i);
__m256 z = _mm256_mul_ps(x, y);
_mm256_storeu_ps(dst + i, z);
}
for ( ; i < cnt; i++)
{
dst[i] = src1[i] * src2[i];
}
Better still, make sure that your buffers are all 32 byte aligned in the first place and then just use aligned loads/stores.
Note that performing a single arithmetic operation in a loop like this is generally a bad approach with SIMD - execution time will be largely dominated by loads and stores - you should try to combine this multiplication with other SIMD operations to mitigate the load/store cost.

Related

fastest way to initialize huge array of floats

i need to initialize every node of a tree with something like:
this->values=(float*) _aligned_malloc(mem * sizeof(float), 32);
this->frequencies =(float*) _aligned_malloc(mem * sizeof(float), 32);
where mem is rather big(~100k-1m), values are 0s and frequencies==1/numChildren (arbitrary float for each node)
the fastest(although by a small amount) was std:fill_n:
std::fill_n(this->values, mem, 0);
std::fill_n(this->frequencies , mem,1/(float)numchildren);
i thought using avx2 intrinsics would've made it faster, something like:
float v = 1 / (float)numchildren;
__m256 nc = _mm256_set_ps(v, v, v, v, v, v, v, v);
__m256 z = _mm256_setzero_ps();
for (long i = 0; i < mem; i += 8)
{
_mm256_store_ps(this->frequencies + i, nc);
_mm256_store_ps(this->values + i, z);
}
this was actually a bit slower, and as slow as naive
for (auto i = 0; i < mem; i++)
{
this->values[i] = 0;
this->frequencies[i] = 1 / (float)numchildren;
}
i assume that intrinsics may actually copy arguments on each call, but since all values are the same, i want to load them into 1 register just once and move to different memory locations multiple times and i think it's not what's happening here.
By _aligned_malloc I assume Windows.
In Windows, you can allocate with VirtualAlloc big amounts of memory, it would be page-aligned (4096 bytes), and will be already zeroed by OS, which is likely faster than manual zeroing.
Note that VirtualAlloc is always a kernel call, but a huge _aligned_malloc is very likely to be a kernel call anyway.

NEON increasing run time

I am currently trying to optimize some of my image processing code to use NEON instructions.
Let's say I have to very large float arrays and I want to multiply each value of the first one with three consecutive values of the second one. (The second one is three times as large.)
float* l_ptrGauss_pf32 = [...];
float* l_ptrLaplace_pf32 = [...]; // Three times as large
for (uint64_t k = 0; k < l_numPixels_ui64; ++k)
{
float l_weight_f32 = *l_ptrGauss_pf32;
*l_ptrLaplace_pf32 *= l_weight_f32;
++l_ptrLaplace_pf32;
*l_ptrLaplace_pf32 *= l_weight_f32;
++l_ptrLaplace_pf32;
*l_ptrLaplace_pf32 *= l_weight_f32;
++l_ptrLaplace_pf32;
++l_ptrGauss_pf32;
}
So when I replace the above code with NEON intrinsics, the run time is about 10% longer.
float32x4_t l_gaussElem_f32x4;
float32x4_t l_laplElem1_f32x4;
float32x4_t l_laplElem2_f32x4;
float32x4_t l_laplElem3_f32x4;
for( uint64_t k=0; k<(l_lastPixelInBlock_ui64/4); ++k)
{
l_gaussElem_f32x4 = vld1q_f32(l_ptrGauss_pf32);
l_laplElem1_f32x4 = vld1q_f32(l_ptrLaplace_pf32);
l_laplElem2_f32x4 = vld1q_f32(l_ptrLaplace_pf32+4);
l_laplElem3_f32x4 = vld1q_f32(l_ptrLaplace_pf32+8);
l_laplElem1_f32x4 = vmulq_f32(l_gaussElem_f32x4, l_laplElem1_f32x4);
l_laplElem2_f32x4 = vmulq_f32(l_gaussElem_f32x4, l_laplElem2_f32x4);
l_laplElem3_f32x4 = vmulq_f32(l_gaussElem_f32x4, l_laplElem3_f32x4);
vst1q_f32(l_ptrLaplace_pf32, l_laplElem1_f32x4);
vst1q_f32(l_ptrLaplace_pf32+4, l_laplElem2_f32x4);
vst1q_f32(l_ptrLaplace_pf32+8, l_laplElem3_f32x4);
l_ptrLaplace_pf32 += 12;
l_ptrGauss_pf32 += 4;
}
Both versions are compiled with -Ofast using Apple LLVM 8.0. Is the compiler really so good at optimizing this code even without NEON intrinsics?
You code contains relatively many operations of vector loading and a few operations of multiplication. So I would recommend to optimize loading of vectors. There are two steps:
Use aligned memory in your arrays.
Use prefetch.
In order to do this I would recommend to use next function:
inline float32x4_t Load(const float * p)
{
// use prefetch:
__builtin_prefetch(p + 256);
// tell compiler that address is aligned:
float * _p = (float *)__builtin_assume_aligned(p, 16);
return vld1q_f32(_p);
}

Entrywise addition of two double arrays using AVX

I need a function to entrywise add the elements of two double arrays and store the result in a third array. Currently I use (simplified)
void add( double* result, const double* a, const double* b, size_t size) {
memcpy(result, a, size*sizeof(double));
for(size_t i = 0; i < size; ++i) {
result[i] += b[i];
}
}
As far as I know the memcpy function uses AVX. In order to improve the performance I would like to also enforce AVX use for the addition. This should be one of the most basic examples for AVX, however I couldn't find any description how to do this in C/C++. I would like to avoid the use of external libraries if possible.
You'll need something like this, assuming AVX-512:
void add( double* result, const double* a, const double* b, size_t size)
{
size_t i = 0;
// Note we are doing as many blocks of 8 as we can. If the size is not divisible by 8
// then we will have some left over that will then be performed serially.
// AVX-512 loop
for( ; i < (size & ~0x7); i += 8)
{
const __m512d kA8 = _mm512_load_pd( &a[i] );
const __m512d kB8 = _mm512_load_pd( &b[i] );
const __m512d kRes = _mm512_add_pd( kA8, kB8 );
_mm512_stream_pd( &res[i], kRes );
}
// AVX loop
for ( ; i < (size & ~0x3); i += 4 )
{
const __m256d kA4 = _mm256_load_pd( &a[i] );
const __m256d kB4 = _mm256_load_pd( &b[i] );
const __m256d kRes = _mm256_add_pd( kA4, kB4 );
_mm256_stream_pd( &res[i], kRes );
}
// SSE2 loop
for ( ; i < (size & ~0x1); i += 2 )
{
const __m128d kA2 = _mm_load_pd( &a[i] );
const __m128d kB2 = _mm_load_pd( &b[i] );
const __m128d kRes = _mm_add_pd( kA2, kB2 );
_mm_stream_pd( &res[i], kRes );
}
// Serial loop
for( ; i < size; i++ )
{
result[i] = a[i] + b[i];
}
}
(Though be warned I've just thrown that together off the top of my head).
Something to note form the above code is that I essentially process the remaining values using the next best parallel code. Primarily this is for illustration of the 3 possible ways you could do it parallely. The loops will work perfectly well on their own. For example if you can't support AVX-512 then you'd jump straight to the AVX loop. If you can't support AVX even then if you jump straight to the SSE2 loop then you'll be using the most performant loop that your hardware can support.
For best performance your arrays should be aligned to the relevant size used in the load. So for AVX-512 you would want 512-bit of 64 byte alignment. For AVX, 256-bit or 32 byte alignment. For SSE2 128-bit or 16 byte alignment. If you use 64 byte alignment for all your arrays then you will always have good alignment, though you may want to go for 128 byte alignment to ease moving over to AVX-1024 when that appears ;)

Why is this slower than memcmp

I am trying to compare two rows of pixels.
A pixel is defined as a struct containing 4 float values (RGBA).
The reason I am not using memcmp is because I need to return the position of the 1st different pixel, which memcmp does not do.
My first implementation uses SSE intrinsics, and is ~30% slower than memcmp:
inline int PixelMemCmp(const Pixel* a, const Pixel* b, int count)
{
for (int i = 0; i < count; i++)
{
__m128 x = _mm_load_ps((float*)(a + i));
__m128 y = _mm_load_ps((float*)(b + i));
__m128 cmp = _mm_cmpeq_ps(x, y);
if (_mm_movemask_ps(cmp) != 15) return i;
}
return -1;
}
I then found that treating the values as integers instead of floats sped things up a bit, and is now only ~20% slower than memcmp.
inline int PixelMemCmp(const Pixel* a, const Pixel* b, int count)
{
for (int i = 0; i < count; i++)
{
__m128i x = _mm_load_si128((__m128i*)(a + i));
__m128i y = _mm_load_si128((__m128i*)(b + i));
__m128i cmp = _mm_cmpeq_epi32(x, y);
if (_mm_movemask_epi8(cmp) != 0xffff) return i;
}
return -1;
}
From what I've read on other questions, the MS implementation of memcmp is also implemented using SSE. My question is what other tricks does the MS implementation have up it's sleeve that I don't? How is it still faster even though it does a byte-by-byte comparison?
Is alignment an issue? If the pixel contains 4 floats, won't an array of pixels already be allocated on a 16 byte boundary?
I am compiling with /o2 and all the optimization flags.
I have written strcmp/memcmp optimizations with SSE (and MMX/3DNow!), and the first step is to ensure that the arrays are as aligned as possible - you may find that you have to do the first and/or last bytes "one at a time".
If you can align the data before it gets to the loop [if your code does the allocation], then that's ideal.
The second part is to unroll the loop, so you don't get so many "if loop isn't at the end, jump back to beginning of loop" - assuming the loop is quite long.
You may find that preloading the next data of the input before doing the "do we leave now" condition helps too.
Edit: The last paragraph may need an example. This code assumes an unrolled loop of at least two:
__m128i x = _mm_load_si128((__m128i*)(a));
__m128i y = _mm_load_si128((__m128i*)(b));
for(int i = 0; i < count; i+=2)
{
__m128i cmp = _mm_cmpeq_epi32(x, y);
__m128i x1 = _mm_load_si128((__m128i*)(a + i + 1));
__m128i y1 = _mm_load_si128((__m128i*)(b + i + 1));
if (_mm_movemask_epi8(cmp) != 0xffff) return i;
cmp = _mm_cmpeq_epi32(x1, y1);
__m128i x = _mm_load_si128((__m128i*)(a + i + 2));
__m128i y = _mm_load_si128((__m128i*)(b + i + 2));
if (_mm_movemask_epi8(cmp) != 0xffff) return i + 1;
}
Roughly something like that.
You might want to check this memcmp SSE implementation, specifically the __sse_memcmp function, it starts with some sanity checks and then checks if the pointers are aligned or not:
aligned_a = ( (unsigned long)a & (sizeof(__m128i)-1) );
aligned_b = ( (unsigned long)b & (sizeof(__m128i)-1) );
If they are not aligned it compares the pointers byte by byte until the start of an aligned address:
while( len && ( (unsigned long) a & ( sizeof(__m128i)-1) ) )
{
if(*a++ != *b++) return -1;
--len;
}
And then compares the remaining memory with SSE instructions similar to your code:
if(!len) return 0;
while( len && !(len & 7 ) )
{
__m128i x = _mm_load_si128( (__m128i*)&a[i]);
__m128i y = _mm_load_si128( (__m128i*)&b[i]);
....
I cannot help you directly because I'm using Mac, but there's an easy way to figure out what happens:
You just step into memcpy in the debug mode and switch to Disassembly view. As the memcpy is a simple little function, you will easily figure out all the implementation tricks.

weird performance in C++ (VC 2010)

I have this loop written in C++, that compiled with MSVC2010 takes a long time to run. (300ms)
for (int i=0; i<h; i++) {
for (int j=0; j<w; j++) {
if (buf[i*w+j] > 0) {
const int sy = max(0, i - hr);
const int ey = min(h, i + hr + 1);
const int sx = max(0, j - hr);
const int ex = min(w, j + hr + 1);
float val = 0;
for (int k=sy; k < ey; k++) {
for (int m=sx; m < ex; m++) {
val += original[k*w + m] * ds[k - i + hr][m - j + hr];
}
}
heat_map[i*w + j] = val;
}
}
}
It seemed a bit strange to me, so I did some tests then changed a few bits to inline assembly: (specifically, the code that sums "val")
for (int i=0; i<h; i++) {
for (int j=0; j<w; j++) {
if (buf[i*w+j] > 0) {
const int sy = max(0, i - hr);
const int ey = min(h, i + hr + 1);
const int sx = max(0, j - hr);
const int ex = min(w, j + hr + 1);
__asm {
fldz
}
for (int k=sy; k < ey; k++) {
for (int m=sx; m < ex; m++) {
float val = original[k*w + m] * ds[k - i + hr][m - j + hr];
__asm {
fld val
fadd
}
}
}
float val1;
__asm {
fstp val1
}
heat_map[i*w + j] = val1;
}
}
}
Now it runs in half the time, 150ms. It does exactly the same thing, but why is it twice as quick? In both cases it was run in Release mode with optimizations on. Am I doing anything wrong in my original C++ code?
I suggest you try different floating-point calculation models supported by the compiler - precise, strict or fast (see /fp option) - with your original code before making any conclusions. I suspect that your original code was compiled with some overly restrictive floating-point model (not followed by your assembly in the second version of the code), which is why the original is much slower.
In other words, if the original model was indeed too restrictive, then you were simply comparing apples to oranges. The two versions didn't really do the same thing, even though it might seem so at the first sight.
Note, for example, that in the first version of the code the intermediate sum is accumulated in a float value. If it was compiled with precise model, the intermediate results would have to be rounded to the precision of float type, even if the variable val was optimized away and the internal FPU register was used instead. In your assembly code you don't bother to round the accumulated result, which is what could have contributed to its better performance.
I'd suggest you compile both versions of the code in /fp:fast mode and see how their performances compare in that case.
A few things to check out:
You need to check that is actually is the same code. As in, are your inline assembly statements exactly the same as those generated by the compiler? I can see three potential differences (potential because they may be optimised out). The first is the initial setting of val to zero, the second is the extra variable val1 (unlikely since it will most likely just change the constant subtraction of the stack pointer), the third is that your inline assembly version may not put the interim results back into val.
You need to make sure your sample space is large. You didn't mention whether you'd done only one run of each version or a hundred runs but, the more runs, the better, so as to remove the effect of "noise" in your statistics.
An even better measurement would be CPU time rather than elapsed time. Elapsed time is subject to environmental changes (like your virus checker or one of your services deciding to do something at the time you're testing). The large sample space will alleviate, but not necessarily solve, this.