NEON increasing run time - c++

I am currently trying to optimize some of my image processing code to use NEON instructions.
Let's say I have to very large float arrays and I want to multiply each value of the first one with three consecutive values of the second one. (The second one is three times as large.)
float* l_ptrGauss_pf32 = [...];
float* l_ptrLaplace_pf32 = [...]; // Three times as large
for (uint64_t k = 0; k < l_numPixels_ui64; ++k)
{
float l_weight_f32 = *l_ptrGauss_pf32;
*l_ptrLaplace_pf32 *= l_weight_f32;
++l_ptrLaplace_pf32;
*l_ptrLaplace_pf32 *= l_weight_f32;
++l_ptrLaplace_pf32;
*l_ptrLaplace_pf32 *= l_weight_f32;
++l_ptrLaplace_pf32;
++l_ptrGauss_pf32;
}
So when I replace the above code with NEON intrinsics, the run time is about 10% longer.
float32x4_t l_gaussElem_f32x4;
float32x4_t l_laplElem1_f32x4;
float32x4_t l_laplElem2_f32x4;
float32x4_t l_laplElem3_f32x4;
for( uint64_t k=0; k<(l_lastPixelInBlock_ui64/4); ++k)
{
l_gaussElem_f32x4 = vld1q_f32(l_ptrGauss_pf32);
l_laplElem1_f32x4 = vld1q_f32(l_ptrLaplace_pf32);
l_laplElem2_f32x4 = vld1q_f32(l_ptrLaplace_pf32+4);
l_laplElem3_f32x4 = vld1q_f32(l_ptrLaplace_pf32+8);
l_laplElem1_f32x4 = vmulq_f32(l_gaussElem_f32x4, l_laplElem1_f32x4);
l_laplElem2_f32x4 = vmulq_f32(l_gaussElem_f32x4, l_laplElem2_f32x4);
l_laplElem3_f32x4 = vmulq_f32(l_gaussElem_f32x4, l_laplElem3_f32x4);
vst1q_f32(l_ptrLaplace_pf32, l_laplElem1_f32x4);
vst1q_f32(l_ptrLaplace_pf32+4, l_laplElem2_f32x4);
vst1q_f32(l_ptrLaplace_pf32+8, l_laplElem3_f32x4);
l_ptrLaplace_pf32 += 12;
l_ptrGauss_pf32 += 4;
}
Both versions are compiled with -Ofast using Apple LLVM 8.0. Is the compiler really so good at optimizing this code even without NEON intrinsics?

You code contains relatively many operations of vector loading and a few operations of multiplication. So I would recommend to optimize loading of vectors. There are two steps:
Use aligned memory in your arrays.
Use prefetch.
In order to do this I would recommend to use next function:
inline float32x4_t Load(const float * p)
{
// use prefetch:
__builtin_prefetch(p + 256);
// tell compiler that address is aligned:
float * _p = (float *)__builtin_assume_aligned(p, 16);
return vld1q_f32(_p);
}

Related

Fully Connected Layer (dot product) using AVX

I have the following C++ code to perform the multiply and accumulate steps of a fully connected layer (without the bias). Basically I just do a dot product using a vector (inputs) and a matrix (weights). I used AVX vectors to speed up the operation.
const float* src = inputs[0]->buffer();
const float* scl = weights->buffer();
float* dst = outputs[0]->buffer();
SizeVector in_dims = inputs[0]->getTensorDesc().getDims();
SizeVector out_dims = outputs[0]->getTensorDesc().getDims();
const int in_neurons = static_cast<int>(in_dims[1]);
const int out_neurons = static_cast<int>(out_dims[1]);
for(size_t n = 0; n < out_neurons; n++){
float accum = 0.0;
float temp[4] = {0,0,0,0};
float *p = temp;
__m128 in, ws, dp;
for(size_t i = 0; i < in_neurons; i+=4){
// read and save the weights correctly by applying the mask
temp[0] = scl[(i+0)*out_neurons + n];
temp[1] = scl[(i+1)*out_neurons + n];
temp[2] = scl[(i+2)*out_neurons + n];
temp[3] = scl[(i+3)*out_neurons + n];
// load input neurons sequentially
in = _mm_load_ps(&src[i]);
// load weights
ws = _mm_load_ps(p);
// dot product
dp = _mm_dp_ps(in, ws, 0xff);
// accumulator
accum += dp.m128_f32[0];
}
// save the final result
dst[n] = accum.m128_f32[0];
}
It works but the speedup is far from what I expected. As you can see below a convolutional layer with x24 more operations than my custom dot product layer takes less time. This makes no sense and there should be much more room for improvements. What are my major mistakes when trying to use AVX? (I'm new to AVX programming so I don't fully understand from where I should start to look to fully optimize the code).
**Convolutional Convolutional Layer Fully Optimized (AVX)**
Layer: CONV3-32
Input: 28x28x32 = 25K
Weights: (3*3*32)*32 = 9K
Number of MACs: 3*3*27*27*32*32 = 7M
Execution Time on OpenVINO framework: 0.049 ms
**My Custom Dot Product Layer Far From Optimized (AVX)**
Layer: FC
Inputs: 1x1x512
Outputs: 576
Weights: 3*3*64*512 = 295K
Number of MACs: 295K
Execution Time on OpenVINO framework: 0.197 ms
Thanks for all help in advance!
Addendum: What you are doing is actually a Matrix-Vector-product. It is well-understood how to implement this efficiently (although with caching and instruction-level parallelism it is not completely trivial). The rest of the answer just shows a very simple vectorized implementation.
You can drastically simplify your implementation by incrementing n+=8 and i+=1 (assuming out_neurons is a multiple of 8, otherwise, some special handling needs to be done for the last elements), i.e., you can accumulate 8 dst values at once.
A very simple implementation assuming FMA is available (otherwise use multiplication and addition):
void dot_product(const float* src, const float* scl, float* dst,
const int in_neurons, const int out_neurons)
{
for(size_t n = 0; n < out_neurons; n+=8){
__m256 accum = _mm256_setzero_ps();
for(size_t i = 0; i < in_neurons; i++){
accum = _mm256_fmadd_ps(_mm256_loadu_ps(&scl[i*out_neurons+n]), _mm256_set1_ps(src[i]), accum);
}
// save the result
_mm256_storeu_ps(dst+n ,accum);
}
}
This could still be optimized e.g., by accumulating 2, 4, or 8 dst packets inside the inner loop, which would not only save some broadcast operations (the _mm256_set1_ps instruction), but also compensate latencies of the FMA instruction.
Godbolt-Link, if you want to play around with the code: https://godbolt.org/z/mm-YHi

SIMD -> uint16_t array to float array work on float then back to uint16_t

I am currently working on a project that manipulates images. To speed up the process (and increase my knowledge), I decided to write some of the basic functions using SIMD instructions.
The code using for loops is
int idx;
uint16_t* A, B, C;
float gAlpha = 0.8;
float alpha = 0.2;
for (size_t rw = 0; rw < height; rw++) {
for (size_t cl = 0; cl < width; cl++) {
idx = rw * width + height;
C[idx] = static_cast<uint16_t>(gAlpha * static_cast<float>(A[idx]) + alpha * static_cast<float>(B[idx]));
}
}
}
This loop is probably not perfect but it makes its job perfectly and my unit test gives me the expected results.
As I said, I am trying to convert these loops using SIMD intrinsic. This is my working code and, as you will see, it is not very pretty... We do have access to intrinsic up to AVX2.
size_t n_pixels = height * width;
for (size_t px = 0; px < n_pixels; px += 8) {
__m128i xlo = _mm_unpacklo_epi16(_mm_load_si128((__m128i*)&A[px]), _mm_set1_epi16(0));
__m128i xhi = _mm_unpackhi_epi16(_mm_load_si128((__m128i*)&A[px]), _mm_set1_epi16(0));
__m128 ylo = _mm_cvtepi32_ps(xlo);
__m128 yhi = _mm_cvtepi32_ps(xhi);
__m256 pxMinFl = _mm256_castps128_ps256(ylo);
pxMinFl = _mm256_insertf128_ps(pxMinFl, yhi, 1);
xlo = _mm_unpacklo_epi16(_mm_load_si128((__m128i*)&B[px]), _mm_set1_epi16(0));
xhi = _mm_unpackhi_epi16(_mm_load_si128((__m128i*)&B[px]), _mm_set1_epi16(0));
ylo = _mm_cvtepi32_ps(xlo);
yhi = _mm_cvtepi32_ps(xhi);
__m256 pxMaxFl = _mm256_castps128_ps256(ylo);
pxMaxFl = _mm256_insertf128_ps(pxMaxFl, yhi, 1);
__m256 avGain1 = _mm256_set1_ps(gAlpha);
__m256 avGain2 = _mm256_set1_ps(alpha);
__m256 prodUp = _mm256_mul_ps(prodUp, avGain1);
__m256 prodBt = _mm256_mul_ps(prodBt, avGain2);
__m256 pxOutFl = _mm256_add_ps(prodUp, prodBt);
__m128 ylo_ps = _mm256_castps256_ps128(pxOutFl);
__m128 yhi_ps = _mm256_extractf128_ps(pxOutFl, 1);
__m128i xlo_ep = _mm_cvtps_epi32(ylo_ps);
__m128i xhi_ep = _mm_cvtps_epi32(yhi_ps); <- POINT 1
int* xl = reinterpret_cast<int*>(&xlo_ep); <- POINT 2
for (int i=0; i < 8; i++) { <- POINT 2
C[px + i] = static_cast<uint16_t>(xl[i]); <- POINT 2
}
}
There are probably tons of optimization that could be done on this code but I have checked that the output of pxOutFl corresponds to the expected value. Where is start to look like black magic to me is when I looked at how I had to save the data back into the output array C. First of all, the code doesn't work if I comment the line at POINT 1 even if, as you can read, I don't use the variable. Secondly, I would have guessed that there is a better solution than the trick I used to store the data back into the uint16_t array (POINT 2) but I can't find one that is working.
Could someone point me into the correct direction? What am I missing? How could I improve this code?
Thanks in advance!
PS: We use the Intel compiler 2017 for the parallel studio professional edition 2117 on Linux (Fedora 25).
You can re-write all of POINT 2 as:
_mm_storeu_si128((__m128i *)&C[px], xlo_ep);
Also note that all instances of _mm_load_si128 should probably be _mm_loadu_si128, since you don't seem to be guaranteeing alignment anywhere.

the code doesn't speed up while using Intel Intrinsics

I am using intrinsics to accelerate the running openCV code. But after i replaced the code with Intrinsics, the runtime cost of the code is almost the same or maybe even worse. i cannot figure out what and why this is happening. I have been searching this issue for quite long time but noting change. It is appreciated if someone can help me out. Thank you very much! Here is my code
// if useSSE is true,run the code with intrinsics and takes 1.45ms in my computer
// and if not run the general code and takes the same time.
cv::Mat<float> results(shape.rows,2);
if (useSSE) {
float* pshape = (float*)shape.data;
results = shape.clone();
float* presults = (float*)results.data;
// use SSE
__m128 xyxy_center = _mm_set_ps(bbox.center_y, bbox.center_x, bbox.center_y, bbox.center_x);
float bbox_width = bbox.width/2;
float bbox_height = bbox.height/2;
__m128 xyxy_size = _mm_set_ps(bbox_height, bbox_width, bbox_height, bbox_width);
gettimeofday(&start, NULL); // this is for counting time
int shape_size = shape.rows*shape.cols;
for (int i=0; i<shape_size; i +=4) {
__m128 a = _mm_loadu_ps(pshape+i);
__m128 result = _mm_div_ps(_mm_sub_ps(a, xyxy_center), xyxy_size);
_mm_storeu_ps(presults+i, result);
}
}else {
//SSE TO BE DONE
for (int i = 0; i < shape.rows; i++){
results(i, 0) = (shape(i, 0) - bbox.center_x) / (bbox.width / 2.0);
results(i, 1) = (shape(i, 1) - bbox.center_y) / (bbox.height / 2.0);
}
}
gettimeofday(&end, NULL);
diff = 1000000*(end.tv_sec-start.tv_sec)+end.tv_sec-start.tv_usec;
std::cout<<diff<<"-----"<<std::endl;
return results;
Your SSE optimization will corrupt memory near results variable, if shape.rows % 2 == 1
Try avoiding using i variable in the loop, use pointers directly. Compiler may optimize additional plus operation, or it may not.
Use multiplication instead of division:
float bbox_width_inv = 2./bbox.width;
float bbox_height_inv = 2./bbox.height;
__m128 xyxy_size = _mm_set_ps(bbox_height, bbox_width, bbox_height, bbox_width);
float* p_shape_end = p_shape + shape.rows*shape.cols;
float* p_shape_end_batch = p_shape + shape.rows*shape.cols & (~3);
for (; p_shape<p_shape_end_batch; p_shape+=4, presults+=4) {
__m128 a = _mm_loadu_ps(pshape);
__m128 result = _mm_mul_ps(_mm_sub_ps(a, xyxy_center), xyxy_size_inv);
_mm_storeu_ps(presults, result);
}
while (p_shape < p_shape_end) {
presults++ = (p_shape++ - bbox.center_x) * bbox_width_inv;
presults++ = (p_shape++ - bbox.center_y) * bbox_height_inv;
}
Try to disassemble code generated from intrinsics, and make sure there is enough registers to perform your operations, and it doesn't store temporary results into RAM

Performance AVX/SSE assembly vs. intrinsics

I'm just trying to check the optimum approach to optimizing some basic routines. In this case I tried very simply example of multiplying 2 float vectors together:
void Mul(float *src1, float *src2, float *dst)
{
for (int i=0; i<cnt; i++) dst[i] = src1[i] * src2[i];
};
Plain C implementation is very slow. I did some external ASM using AVX and also tried using intrinsics. These are the test results (time, smaller is better):
ASM: 0.110
IPP: 0.125
Intrinsics: 0.18
Plain C++: 4.0
(compiled using MSVC 2013, SSE2, tried Intel Compiler, results were pretty much the same)
As you can see my ASM code beaten even Intel Performance Primitives (probably because I did lots of branches to ensure I can use the AVX aligned instructions). But I'd personally like to utilize the intrinsic approach, it's simply easier to manage and I was thinking the compiler should do the best job optimizing all the branches and stuff (my ASM code sucks in that matter imho, yet it is faster). So here's the code using intrinsics:
int i;
for (i=0; (MINTEGER)(dst + i) % 32 != 0 && i < cnt; i++) dst[i] = src1[i] * src2[i];
if ((MINTEGER)(src1 + i) % 32 == 0)
{
if ((MINTEGER)(src2 + i) % 32 == 0)
{
for (; i<cnt-8; i+=8)
{
__m256 x = _mm256_load_ps( src1 + i);
__m256 y = _mm256_load_ps( src2 + i);
__m256 z = _mm256_mul_ps(x, y);
_mm256_store_ps(dst + i, z);
};
}
else
{
for (; i<cnt-8; i+=8)
{
__m256 x = _mm256_load_ps( src1 + i);
__m256 y = _mm256_loadu_ps( src2 + i);
__m256 z = _mm256_mul_ps(x, y);
_mm256_store_ps(dst + i, z);
};
};
}
else
{
for (; i<cnt-8; i+=8)
{
__m256 x = _mm256_loadu_ps( src1 + i);
__m256 y = _mm256_loadu_ps( src2 + i);
__m256 z = _mm256_mul_ps(x, y);
_mm256_store_ps(dst + i, z);
};
};
for (; i<cnt; i++) dst[i] = src1[i] * src2[i];
Simple: First get to an address where dst is aligned to 32 bytes, then branch to check which sources are aligned.
One problem is that the C++ implementations in the beginning and at the end are not using AVX unless I enable AVX in the compiler, which I do NOT want, because this should be just AVX specialization, but the software should work even on a platform, where AVX is not available. And sadly there seems to be no intrinsics for instructions such as vmovss, so there's probably a penalty for mixing AVX code with SSE, which the compiler uses. However even if I enabled AVX in the compiler, it still didn't get below 0.14.
Any ideas how to optimize this to make the instrisics reach the speed of the ASM code?
Your implementation with intrinsics is not the same function as your implementation in straight C: e.g. what if your function was called with arguments Mul(p, p, p+1)? You'll get different results. The pure C version is slow because the compiler is ensuring that the code does exactly what you said.
If you want the compiler to make optimizations based on the assumption that the three arrays do not overlap, you need to make that explicit:
void Mul(float *src1, float *src2, float *__restrict__ dst)
or even better
void Mul(const float *src1, const float *src2, float *__restrict__ dst)
(I think it's enough to have __restrict__ just on the output pointer, although it wouldn't hurt to add it to the input pointers too)
On CPUs with AVX there is very little penalty for using misaligned loads - I would suggest trading this small penalty off against all the extra logic you're using to check for alignment etc and just have a single loop + scalar code to handle any residual elements:
for (i = 0; i <= cnt - 8; i += 8)
{
__m256 x = _mm256_loadu_ps(src1 + i);
__m256 y = _mm256_loadu_ps(src2 + i);
__m256 z = _mm256_mul_ps(x, y);
_mm256_storeu_ps(dst + i, z);
}
for ( ; i < cnt; i++)
{
dst[i] = src1[i] * src2[i];
}
Better still, make sure that your buffers are all 32 byte aligned in the first place and then just use aligned loads/stores.
Note that performing a single arithmetic operation in a loop like this is generally a bad approach with SIMD - execution time will be largely dominated by loads and stores - you should try to combine this multiplication with other SIMD operations to mitigate the load/store cost.

Optimize a nearest neighbor resizing algorithm for speed

I'm using the next algorithm to perform nearest neighbor resizing. Is there anyway to optimize it's speed? Input and Output buffers are in ARGB format, though images are known to be always opaque. Thank you.
void resizeNearestNeighbor(const uint8_t* input, uint8_t* output, int sourceWidth, int sourceHeight, int targetWidth, int targetHeight)
{
const int x_ratio = (int)((sourceWidth << 16) / targetWidth);
const int y_ratio = (int)((sourceHeight << 16) / targetHeight) ;
const int colors = 4;
for (int y = 0; y < targetHeight; y++)
{
int y2_xsource = ((y * y_ratio) >> 16) * sourceWidth;
int i_xdest = y * targetWidth;
for (int x = 0; x < targetWidth; x++)
{
int x2 = ((x * x_ratio) >> 16) ;
int y2_x2_colors = (y2_xsource + x2) * colors;
int i_x_colors = (i_xdest + x) * colors;
output[i_x_colors] = input[y2_x2_colors];
output[i_x_colors + 1] = input[y2_x2_colors + 1];
output[i_x_colors + 2] = input[y2_x2_colors + 2];
output[i_x_colors + 3] = input[y2_x2_colors + 3];
}
}
}
restrict keyword will help a lot, assuming no aliasing.
Another improvement is to declare another pointerToOutput and pointerToInput as uint_32_t, so that the four 8-bit copy-assignments can be combined into a 32-bit one, assuming pointers are 32bit aligned.
There's little that you can do to speed this up, as you already arranged the loops in the right order and cleverly used fixed-point arithmetic. As others suggested, try to move the 32 bits in a single go (hoping that the compiler didn't see that yet).
In case of significant enlargement, there is a possibility: you can determine how many times every source pixel needs to be replicated (you'll need to work on the properties of the relation Xd=Wd.Xs/Ws in integers), and perform a single pixel read for k writes. This also works on the y's, and you can memcpy the identical rows instead of recomputing them. You can precompute and tabulate the mappings of the X's and Y's using run-length coding.
But there is a barrier that you will not pass: you need to fill the destination image.
If you are desperately looking for speedup, there could remain the option of using vector operations (SEE or AVX) to handle several pixels at a time. Shuffle instructions are available that might enable to control the replication (or decimation) of the pixels. But due to the complicated replication pattern combined with the fixed structure of the vector registers, you will probably need to integrate a complex decision table.
The algorithm is fine, but you can utilize massive parallelization by submitting your image to the GPU. If you use opengl, simply creating a context of the new size and providing a properly sized quad can give you inherent nearest neighbor calculations. Also opengl could give you access to other resizing sampling techniques by simply changing the properties of the texture you read from (which would amount to a single gl command which could be an easy paramter to your resize function).
Also later in development, you could simply swap out a shader for other blending techniques which also keeps you utilizing your wonderful GPU processor of image processing glory.
Also, since you aren't using any fancy geometry it can become almost trivial to write the program. It would be a little more involved than your algorithm, but it could perform magnitudes faster depending on image size.
I hope I didn't break anything. This combines some of the suggestions posted thus far and is about 30% faster. I'm amazed that is all we got. I did not actually check the destination image to see if it was right.
Changes:
- remove multiplies from inner loop (10% improvement)
- uint32_t instead of uint8_t (10% improvement)
- __restrict keyword (1% improvement)
This was on an i7 x64 machine running Windows, compiled with MSVC 2013. You will have to change the __restrict keyword for other compilers.
void resizeNearestNeighbor2_32(const uint8_t* __restrict input, uint8_t* __restrict output, int sourceWidth, int sourceHeight, int targetWidth, int targetHeight)
{
const uint32_t* input32 = (const uint32_t*)input;
uint32_t* output32 = (uint32_t*)output;
const int x_ratio = (int)((sourceWidth << 16) / targetWidth);
const int y_ratio = (int)((sourceHeight << 16) / targetHeight);
int x_ratio_with_color = x_ratio;
for (int y = 0; y < targetHeight; y++)
{
int y2_xsource = ((y * y_ratio) >> 16) * sourceWidth;
int i_xdest = y * targetWidth;
int source_x_offset = 0;
int startingOffset = y2_xsource;
const uint32_t * inputLine = input32 + startingOffset;
for (int x = 0; x < targetWidth; x++)
{
i_xdest += 1;
source_x_offset += x_ratio_with_color;
int sourceOffset = source_x_offset >> 16;
output[i_xdest] = inputLine[sourceOffset];
}
}
}