the code doesn't speed up while using Intel Intrinsics - c++

I am using intrinsics to accelerate the running openCV code. But after i replaced the code with Intrinsics, the runtime cost of the code is almost the same or maybe even worse. i cannot figure out what and why this is happening. I have been searching this issue for quite long time but noting change. It is appreciated if someone can help me out. Thank you very much! Here is my code
// if useSSE is true,run the code with intrinsics and takes 1.45ms in my computer
// and if not run the general code and takes the same time.
cv::Mat<float> results(shape.rows,2);
if (useSSE) {
float* pshape = (float*);
results = shape.clone();
float* presults = (float*);
// use SSE
__m128 xyxy_center = _mm_set_ps(bbox.center_y, bbox.center_x, bbox.center_y, bbox.center_x);
float bbox_width = bbox.width/2;
float bbox_height = bbox.height/2;
__m128 xyxy_size = _mm_set_ps(bbox_height, bbox_width, bbox_height, bbox_width);
gettimeofday(&start, NULL); // this is for counting time
int shape_size = shape.rows*shape.cols;
for (int i=0; i<shape_size; i +=4) {
__m128 a = _mm_loadu_ps(pshape+i);
__m128 result = _mm_div_ps(_mm_sub_ps(a, xyxy_center), xyxy_size);
_mm_storeu_ps(presults+i, result);
}else {
for (int i = 0; i < shape.rows; i++){
results(i, 0) = (shape(i, 0) - bbox.center_x) / (bbox.width / 2.0);
results(i, 1) = (shape(i, 1) - bbox.center_y) / (bbox.height / 2.0);
gettimeofday(&end, NULL);
diff = 1000000*(end.tv_sec-start.tv_sec)+end.tv_sec-start.tv_usec;
return results;

Your SSE optimization will corrupt memory near results variable, if shape.rows % 2 == 1
Try avoiding using i variable in the loop, use pointers directly. Compiler may optimize additional plus operation, or it may not.
Use multiplication instead of division:
float bbox_width_inv = 2./bbox.width;
float bbox_height_inv = 2./bbox.height;
__m128 xyxy_size = _mm_set_ps(bbox_height, bbox_width, bbox_height, bbox_width);
float* p_shape_end = p_shape + shape.rows*shape.cols;
float* p_shape_end_batch = p_shape + shape.rows*shape.cols & (~3);
for (; p_shape<p_shape_end_batch; p_shape+=4, presults+=4) {
__m128 a = _mm_loadu_ps(pshape);
__m128 result = _mm_mul_ps(_mm_sub_ps(a, xyxy_center), xyxy_size_inv);
_mm_storeu_ps(presults, result);
while (p_shape < p_shape_end) {
presults++ = (p_shape++ - bbox.center_x) * bbox_width_inv;
presults++ = (p_shape++ - bbox.center_y) * bbox_height_inv;
Try to disassemble code generated from intrinsics, and make sure there is enough registers to perform your operations, and it doesn't store temporary results into RAM


How can I force the compiler to make critical variables in a register?

My self-set task was to experiment with optimising the ReLu activation function (for neural networks) where the function would activate an entire layer at a time, and rely on SIMD vectorisation and loop unrolling to get the job done much faster - with success! I've seen a consistent 4x performance increase from the standard c++ way of doing this task (that I could think of anyway.)
I'm still curious about pushing it faster and faster - and I've been looking at the disassembly of the program in Release mode, x64, msvc compiler, with all the optimisations and instruction sets on the highest setting - so in theory the compiler should be storing commonly used variables in the registers, however it refuses to do so - they're always in memory. This seems to me as highly inefficient, and one final bottleneck. I've been using intrinsics (because no inline assembly in x64 - if that was allowed, I could bypass this infuriating compiler issue entirely), and here is the code:
The global bitmask is just the MSB of a 64-bit value set to 1, and all else as zero - it's a mask
to check if a value is negative - the ReLu is max(in, 0.0) - but I found that to be far slower, so
I've been creating a mask, that is set to true when the value is non-negative (andNOT is my
instruction) and then using a maskload, which zeros the destination out if the mask isn't true for
that element. Likewise, I have the global ones and zeros vectors as pre-stored values, that I would hope would be stored in registers, as they are used often and would be more efficiently placed in the ymm registers, but instead are being put on the stack.
I have experimented with creating the bitmasks, and one vectors as local const variables, in the hope that that would force them into registers, but there is no effect - they are put in a register for the first instruction, and then sent out to memory immediately afterwards.
The two loops for the activation and the derivative are separated, as they use different intrinsics - I've found that the max intrinsic is far faster than a maskload - but I can't see any other way to compute the derivative (1.0 if in > 0.0, else 0.0) with a faster intrinsic. I've tested it - counter-intuitively, the separation of the loops makes it far faster.
void ReluTesters::reluCompBothV3(double* in, double* out, int args) {
int unrolled = args / 16;
__m256d* inPtr = (__m256d*) in;
__m256d* inPtr1 = (__m256d*) in + 1;
__m256d* inPtr2 = (__m256d*) in + 2;
__m256d* inPtr3 = (__m256d*) in + 3;
__m256d* actOutPtr = (__m256d*) out;
__m256d* actOutPtr1 = (__m256d*) out + 1;
__m256d* actOutPtr2 = (__m256d*) out + 2;
__m256d* actOutPtr3 = (__m256d*) out + 3;
__m256d* delOutPtr = (__m256d*) (out + args);
__m256d* delOutPtr1 = (__m256d*) (out + args + 4);
__m256d* delOutPtr2 = (__m256d*) (out + args + 8);
__m256d* delOutPtr3 = (__m256d*) (out + args + 12);
const __m256d zeros = reluGlobalZeroVectorAVX;
for (int i = 0; i < unrolled; ++i) {
*actOutPtr = _mm256_max_pd(*inPtr, zeros);
*actOutPtr1 = _mm256_max_pd(*inPtr1, zeros);
*actOutPtr2 = _mm256_max_pd(*inPtr2, zeros);
*actOutPtr3 = _mm256_max_pd(*inPtr3, zeros);
inPtr += 4;
inPtr1 += 4;
inPtr2 += 4;
inPtr3 += 4;
actOutPtr += 4;
actOutPtr1 += 4;
actOutPtr2 += 4;
actOutPtr3 += 4;
inPtr = (__m256d*) in;
inPtr1 = (__m256d*) in + 1;
inPtr2 = (__m256d*) in + 2;
inPtr3 = (__m256d*) in + 3;
const __m256d ones = reluGlobalOneVectorAVX;
const __m256i bitmask = reluGlobalBitmask;
for (int i = 0; i < unrolled ; ++i) {
__m256i mask = _mm256_andnot_si256(*(__m256i*)inPtr, bitmask);
__m256i mask1 = _mm256_andnot_si256(*(__m256i*)inPtr1, bitmask);
__m256i mask2 = _mm256_andnot_si256(*(__m256i*)inPtr2, bitmask);
__m256i mask3 = _mm256_andnot_si256(*(__m256i*)inPtr3, bitmask);
*delOutPtr = _mm256_maskload_pd((double*)&ones, mask);
*delOutPtr1 = _mm256_maskload_pd((double*)&ones, mask1);
*delOutPtr2 = _mm256_maskload_pd((double*)&ones, mask2);
*delOutPtr3 = _mm256_maskload_pd((double*)&ones, mask3);
delOutPtr += 4;
delOutPtr1 += 4;
delOutPtr2 += 4;
delOutPtr3 += 4;
double* inD = (double*)inPtr;
double* actOutD = (double*)actOutPtr;
double* delOutD = (double*)delOutPtr;
for (int i = inD - in; i < args; ++i, ++inD, ++actOutD, ++delOutD) {
if ((*((long long*)inD)) & (*((long long*)&bitmask))) {
*actOutD = 0.0;
*delOutD = 0.0;
else {
*actOutD = *inD;
*delOutD = 1.0;

SIMD -> uint16_t array to float array work on float then back to uint16_t

I am currently working on a project that manipulates images. To speed up the process (and increase my knowledge), I decided to write some of the basic functions using SIMD instructions.
The code using for loops is
int idx;
uint16_t* A, B, C;
float gAlpha = 0.8;
float alpha = 0.2;
for (size_t rw = 0; rw < height; rw++) {
for (size_t cl = 0; cl < width; cl++) {
idx = rw * width + height;
C[idx] = static_cast<uint16_t>(gAlpha * static_cast<float>(A[idx]) + alpha * static_cast<float>(B[idx]));
This loop is probably not perfect but it makes its job perfectly and my unit test gives me the expected results.
As I said, I am trying to convert these loops using SIMD intrinsic. This is my working code and, as you will see, it is not very pretty... We do have access to intrinsic up to AVX2.
size_t n_pixels = height * width;
for (size_t px = 0; px < n_pixels; px += 8) {
__m128i xlo = _mm_unpacklo_epi16(_mm_load_si128((__m128i*)&A[px]), _mm_set1_epi16(0));
__m128i xhi = _mm_unpackhi_epi16(_mm_load_si128((__m128i*)&A[px]), _mm_set1_epi16(0));
__m128 ylo = _mm_cvtepi32_ps(xlo);
__m128 yhi = _mm_cvtepi32_ps(xhi);
__m256 pxMinFl = _mm256_castps128_ps256(ylo);
pxMinFl = _mm256_insertf128_ps(pxMinFl, yhi, 1);
xlo = _mm_unpacklo_epi16(_mm_load_si128((__m128i*)&B[px]), _mm_set1_epi16(0));
xhi = _mm_unpackhi_epi16(_mm_load_si128((__m128i*)&B[px]), _mm_set1_epi16(0));
ylo = _mm_cvtepi32_ps(xlo);
yhi = _mm_cvtepi32_ps(xhi);
__m256 pxMaxFl = _mm256_castps128_ps256(ylo);
pxMaxFl = _mm256_insertf128_ps(pxMaxFl, yhi, 1);
__m256 avGain1 = _mm256_set1_ps(gAlpha);
__m256 avGain2 = _mm256_set1_ps(alpha);
__m256 prodUp = _mm256_mul_ps(prodUp, avGain1);
__m256 prodBt = _mm256_mul_ps(prodBt, avGain2);
__m256 pxOutFl = _mm256_add_ps(prodUp, prodBt);
__m128 ylo_ps = _mm256_castps256_ps128(pxOutFl);
__m128 yhi_ps = _mm256_extractf128_ps(pxOutFl, 1);
__m128i xlo_ep = _mm_cvtps_epi32(ylo_ps);
__m128i xhi_ep = _mm_cvtps_epi32(yhi_ps); <- POINT 1
int* xl = reinterpret_cast<int*>(&xlo_ep); <- POINT 2
for (int i=0; i < 8; i++) { <- POINT 2
C[px + i] = static_cast<uint16_t>(xl[i]); <- POINT 2
There are probably tons of optimization that could be done on this code but I have checked that the output of pxOutFl corresponds to the expected value. Where is start to look like black magic to me is when I looked at how I had to save the data back into the output array C. First of all, the code doesn't work if I comment the line at POINT 1 even if, as you can read, I don't use the variable. Secondly, I would have guessed that there is a better solution than the trick I used to store the data back into the uint16_t array (POINT 2) but I can't find one that is working.
Could someone point me into the correct direction? What am I missing? How could I improve this code?
Thanks in advance!
PS: We use the Intel compiler 2017 for the parallel studio professional edition 2117 on Linux (Fedora 25).
You can re-write all of POINT 2 as:
_mm_storeu_si128((__m128i *)&C[px], xlo_ep);
Also note that all instances of _mm_load_si128 should probably be _mm_loadu_si128, since you don't seem to be guaranteeing alignment anywhere.

NEON increasing run time

I am currently trying to optimize some of my image processing code to use NEON instructions.
Let's say I have to very large float arrays and I want to multiply each value of the first one with three consecutive values of the second one. (The second one is three times as large.)
float* l_ptrGauss_pf32 = [...];
float* l_ptrLaplace_pf32 = [...]; // Three times as large
for (uint64_t k = 0; k < l_numPixels_ui64; ++k)
float l_weight_f32 = *l_ptrGauss_pf32;
*l_ptrLaplace_pf32 *= l_weight_f32;
*l_ptrLaplace_pf32 *= l_weight_f32;
*l_ptrLaplace_pf32 *= l_weight_f32;
So when I replace the above code with NEON intrinsics, the run time is about 10% longer.
float32x4_t l_gaussElem_f32x4;
float32x4_t l_laplElem1_f32x4;
float32x4_t l_laplElem2_f32x4;
float32x4_t l_laplElem3_f32x4;
for( uint64_t k=0; k<(l_lastPixelInBlock_ui64/4); ++k)
l_gaussElem_f32x4 = vld1q_f32(l_ptrGauss_pf32);
l_laplElem1_f32x4 = vld1q_f32(l_ptrLaplace_pf32);
l_laplElem2_f32x4 = vld1q_f32(l_ptrLaplace_pf32+4);
l_laplElem3_f32x4 = vld1q_f32(l_ptrLaplace_pf32+8);
l_laplElem1_f32x4 = vmulq_f32(l_gaussElem_f32x4, l_laplElem1_f32x4);
l_laplElem2_f32x4 = vmulq_f32(l_gaussElem_f32x4, l_laplElem2_f32x4);
l_laplElem3_f32x4 = vmulq_f32(l_gaussElem_f32x4, l_laplElem3_f32x4);
vst1q_f32(l_ptrLaplace_pf32, l_laplElem1_f32x4);
vst1q_f32(l_ptrLaplace_pf32+4, l_laplElem2_f32x4);
vst1q_f32(l_ptrLaplace_pf32+8, l_laplElem3_f32x4);
l_ptrLaplace_pf32 += 12;
l_ptrGauss_pf32 += 4;
Both versions are compiled with -Ofast using Apple LLVM 8.0. Is the compiler really so good at optimizing this code even without NEON intrinsics?
You code contains relatively many operations of vector loading and a few operations of multiplication. So I would recommend to optimize loading of vectors. There are two steps:
Use aligned memory in your arrays.
Use prefetch.
In order to do this I would recommend to use next function:
inline float32x4_t Load(const float * p)
// use prefetch:
__builtin_prefetch(p + 256);
// tell compiler that address is aligned:
float * _p = (float *)__builtin_assume_aligned(p, 16);
return vld1q_f32(_p);

Performance AVX/SSE assembly vs. intrinsics

I'm just trying to check the optimum approach to optimizing some basic routines. In this case I tried very simply example of multiplying 2 float vectors together:
void Mul(float *src1, float *src2, float *dst)
for (int i=0; i<cnt; i++) dst[i] = src1[i] * src2[i];
Plain C implementation is very slow. I did some external ASM using AVX and also tried using intrinsics. These are the test results (time, smaller is better):
ASM: 0.110
IPP: 0.125
Intrinsics: 0.18
Plain C++: 4.0
(compiled using MSVC 2013, SSE2, tried Intel Compiler, results were pretty much the same)
As you can see my ASM code beaten even Intel Performance Primitives (probably because I did lots of branches to ensure I can use the AVX aligned instructions). But I'd personally like to utilize the intrinsic approach, it's simply easier to manage and I was thinking the compiler should do the best job optimizing all the branches and stuff (my ASM code sucks in that matter imho, yet it is faster). So here's the code using intrinsics:
int i;
for (i=0; (MINTEGER)(dst + i) % 32 != 0 && i < cnt; i++) dst[i] = src1[i] * src2[i];
if ((MINTEGER)(src1 + i) % 32 == 0)
if ((MINTEGER)(src2 + i) % 32 == 0)
for (; i<cnt-8; i+=8)
__m256 x = _mm256_load_ps( src1 + i);
__m256 y = _mm256_load_ps( src2 + i);
__m256 z = _mm256_mul_ps(x, y);
_mm256_store_ps(dst + i, z);
for (; i<cnt-8; i+=8)
__m256 x = _mm256_load_ps( src1 + i);
__m256 y = _mm256_loadu_ps( src2 + i);
__m256 z = _mm256_mul_ps(x, y);
_mm256_store_ps(dst + i, z);
for (; i<cnt-8; i+=8)
__m256 x = _mm256_loadu_ps( src1 + i);
__m256 y = _mm256_loadu_ps( src2 + i);
__m256 z = _mm256_mul_ps(x, y);
_mm256_store_ps(dst + i, z);
for (; i<cnt; i++) dst[i] = src1[i] * src2[i];
Simple: First get to an address where dst is aligned to 32 bytes, then branch to check which sources are aligned.
One problem is that the C++ implementations in the beginning and at the end are not using AVX unless I enable AVX in the compiler, which I do NOT want, because this should be just AVX specialization, but the software should work even on a platform, where AVX is not available. And sadly there seems to be no intrinsics for instructions such as vmovss, so there's probably a penalty for mixing AVX code with SSE, which the compiler uses. However even if I enabled AVX in the compiler, it still didn't get below 0.14.
Any ideas how to optimize this to make the instrisics reach the speed of the ASM code?
Your implementation with intrinsics is not the same function as your implementation in straight C: e.g. what if your function was called with arguments Mul(p, p, p+1)? You'll get different results. The pure C version is slow because the compiler is ensuring that the code does exactly what you said.
If you want the compiler to make optimizations based on the assumption that the three arrays do not overlap, you need to make that explicit:
void Mul(float *src1, float *src2, float *__restrict__ dst)
or even better
void Mul(const float *src1, const float *src2, float *__restrict__ dst)
(I think it's enough to have __restrict__ just on the output pointer, although it wouldn't hurt to add it to the input pointers too)
On CPUs with AVX there is very little penalty for using misaligned loads - I would suggest trading this small penalty off against all the extra logic you're using to check for alignment etc and just have a single loop + scalar code to handle any residual elements:
for (i = 0; i <= cnt - 8; i += 8)
__m256 x = _mm256_loadu_ps(src1 + i);
__m256 y = _mm256_loadu_ps(src2 + i);
__m256 z = _mm256_mul_ps(x, y);
_mm256_storeu_ps(dst + i, z);
for ( ; i < cnt; i++)
dst[i] = src1[i] * src2[i];
Better still, make sure that your buffers are all 32 byte aligned in the first place and then just use aligned loads/stores.
Note that performing a single arithmetic operation in a loop like this is generally a bad approach with SIMD - execution time will be largely dominated by loads and stores - you should try to combine this multiplication with other SIMD operations to mitigate the load/store cost.

weird performance in C++ (VC 2010)

I have this loop written in C++, that compiled with MSVC2010 takes a long time to run. (300ms)
for (int i=0; i<h; i++) {
for (int j=0; j<w; j++) {
if (buf[i*w+j] > 0) {
const int sy = max(0, i - hr);
const int ey = min(h, i + hr + 1);
const int sx = max(0, j - hr);
const int ex = min(w, j + hr + 1);
float val = 0;
for (int k=sy; k < ey; k++) {
for (int m=sx; m < ex; m++) {
val += original[k*w + m] * ds[k - i + hr][m - j + hr];
heat_map[i*w + j] = val;
It seemed a bit strange to me, so I did some tests then changed a few bits to inline assembly: (specifically, the code that sums "val")
for (int i=0; i<h; i++) {
for (int j=0; j<w; j++) {
if (buf[i*w+j] > 0) {
const int sy = max(0, i - hr);
const int ey = min(h, i + hr + 1);
const int sx = max(0, j - hr);
const int ex = min(w, j + hr + 1);
__asm {
for (int k=sy; k < ey; k++) {
for (int m=sx; m < ex; m++) {
float val = original[k*w + m] * ds[k - i + hr][m - j + hr];
__asm {
fld val
float val1;
__asm {
fstp val1
heat_map[i*w + j] = val1;
Now it runs in half the time, 150ms. It does exactly the same thing, but why is it twice as quick? In both cases it was run in Release mode with optimizations on. Am I doing anything wrong in my original C++ code?
I suggest you try different floating-point calculation models supported by the compiler - precise, strict or fast (see /fp option) - with your original code before making any conclusions. I suspect that your original code was compiled with some overly restrictive floating-point model (not followed by your assembly in the second version of the code), which is why the original is much slower.
In other words, if the original model was indeed too restrictive, then you were simply comparing apples to oranges. The two versions didn't really do the same thing, even though it might seem so at the first sight.
Note, for example, that in the first version of the code the intermediate sum is accumulated in a float value. If it was compiled with precise model, the intermediate results would have to be rounded to the precision of float type, even if the variable val was optimized away and the internal FPU register was used instead. In your assembly code you don't bother to round the accumulated result, which is what could have contributed to its better performance.
I'd suggest you compile both versions of the code in /fp:fast mode and see how their performances compare in that case.
A few things to check out:
You need to check that is actually is the same code. As in, are your inline assembly statements exactly the same as those generated by the compiler? I can see three potential differences (potential because they may be optimised out). The first is the initial setting of val to zero, the second is the extra variable val1 (unlikely since it will most likely just change the constant subtraction of the stack pointer), the third is that your inline assembly version may not put the interim results back into val.
You need to make sure your sample space is large. You didn't mention whether you'd done only one run of each version or a hundred runs but, the more runs, the better, so as to remove the effect of "noise" in your statistics.
An even better measurement would be CPU time rather than elapsed time. Elapsed time is subject to environmental changes (like your virus checker or one of your services deciding to do something at the time you're testing). The large sample space will alleviate, but not necessarily solve, this.