SSE reduction of float vector - c++

How can I get sum elements (reduction) of float vector using sse intrinsics?
Simple serial code:
void(float *input, float &result, unsigned int NumElems)
{
result = 0;
for(auto i=0; i<NumElems; ++i)
result += input[i];
}

Typically you generate 4 partial sums in your loop and then just sum horizontally across the 4 elements after the loop, e.g.
#include <cassert>
#include <cstdint>
#include <emmintrin.h>
float vsum(const float *a, int n)
{
float sum;
__m128 vsum = _mm_set1_ps(0.0f);
assert((n & 3) == 0);
assert(((uintptr_t)a & 15) == 0);
for (int i = 0; i < n; i += 4)
{
__m128 v = _mm_load_ps(&a[i]);
vsum = _mm_add_ps(vsum, v);
}
vsum = _mm_hadd_ps(vsum, vsum);
vsum = _mm_hadd_ps(vsum, vsum);
_mm_store_ss(&sum, vsum);
return sum;
}
Note: for the above example a must be 16 byte aligned and n must be a multiple of 4. If the alignment of a can not be guaranteed then use _mm_loadu_ps instead of _mm_load_ps. If n is not guaranteed to be a multiple of 4 then add a scalar loop at the end of the function to accumulate any remaining elements.

Related

How to properly access array with specific pointer arithmetic using SSE in convolution algorithm? [duplicate]

This question already exists:
How to implement convolution algorithm with SSE?
Closed 1 year ago.
My goal is to implement exactly that algorithm using only CPU and using SSE:
My array's sizes a multiple of 4 and they are aligned:
const int INPUT_SIGNAL_ARRAY_SIZE = 256896;
const int IMPULSE_RESPONSE_ARRAY_SIZE = 318264;
const int OUTPUT_SIGNAL_ARRAY_SIZE = INPUT_SIGNAL_ARRAY_SIZE + IMPULSE_RESPONSE_ARRAY_SIZE;
__declspec(align(16)) float inputSignal_dArray[INPUT_SIGNAL_ARRAY_SIZE];
__declspec(align(16)) float impulseResponse_dArray[IMPULSE_RESPONSE_ARRAY_SIZE];
__declspec(align(16)) float outputSignal_dArray[OUTPUT_SIGNAL_ARRAY_SIZE];
I have written CPU "method" and it works correctly:
//#pragma optimize( "", off )
void computeConvolutionOutputCPU(float* inputSignal, float* impulseResponse, float* outputSignal) {
float* pInputSignal = inputSignal;
float* pImpulseResponse = impulseResponse;
float* pOutputSignal = outputSignal;
#pragma loop(no_vector)
for (int i = 0; i < OUTPUT_SIGNAL_ARRAY_SIZE; i++)
{
*(pOutputSignal + i) = 0;
#pragma loop(no_vector)
for (int j = 0; j < IMPULSE_RESPONSE_ARRAY_SIZE; j++)
{
if (i - j >= 0 && i - j < INPUT_SIGNAL_ARRAY_SIZE)
{
*(pOutputSignal + i) = *(pOutputSignal + i) + *(pImpulseResponse + j) * (*(pInputSignal + i - j));
}
}
}
}
//#pragma optimize( "", on )
On the other hand I should use function with SSE. I tried the following code:
void computeConvolutionOutputSSE(float* inputSignal, float* impulseResponse, float* outputSignal) {
__m128* pInputSignal = (__m128*) inputSignal;
__m128* pImpulseResponse = (__m128*) impulseResponse;
__m128* pOutputSignal = (__m128*) outputSignal;
int nOuterLoop = OUTPUT_SIGNAL_ARRAY_SIZE / 4;
int nInnerLoop = IMPULSE_RESPONSE_ARRAY_SIZE / 4;
int quarterOfInputSignal = INPUT_SIGNAL_ARRAY_SIZE / 4;
__m128 m0 = _mm_set_ps1(0);
for (int i = 0; i < nOuterLoop; i++)
{
*(pOutputSignal + i) = m0;
for (int j = 0; j < nInnerLoop; j++)
{
if ((i - j) >= 0 && (i - j) < quarterOfInputSignal)
{
*(pOutputSignal + i) = _mm_add_ps(
*(pOutputSignal + i),
_mm_mul_ps(*(pImpulseResponse + j), *(pInputSignal + i - j))
);
}
}
}
}
And function above works not correct and produces not the same values like CPU.
The problem was specified on stackoverflow with following comment :
*(pInputSignal + i - j) is incorrect in case of SSE, because it's not an i-j offset away from current value, it's (i-j) * 4 . THe thing is,
as I remember it, the idea of using pointer that way is incorrect
unless intrinsics had changed since then - in my time one had to
"load" values into an instance of __m128 in this case, as H(J) and
X(I-J) are in unaligned location (and sequence breaks).
and
Since you care about individual floats and their order, probably best
to use const float*, with _mm_loadu_ps instead of just dereferencing
(which is like _mm_load_ps). That way you can easily do unaligned
loads that get the floats you want into the vector element positions
you want, and the pointer math works the same as for scalar. You just
have to take into account that load(ptr) actually gets you a vector of
elements from ptr+0..3.
But I can't use this information because have no idea how to properly access array with SSE in this case.
you need 128-bit float32 value , not msvc float.
see _mm_broadcast_ss

An accumulated computing error in SSE version of algorithm of the sum of squared differences

I was trying to optimize following code (sum of squared differences for two arrays):
inline float Square(float value)
{
return value*value;
}
float SquaredDifferenceSum(const float * a, const float * b, size_t size)
{
float sum = 0;
for(size_t i = 0; i < size; ++i)
sum += Square(a[i] - b[i]);
return sum;
}
So I performed optimization with using of SSE instructions of CPU:
inline void SquaredDifferenceSum(const float * a, const float * b, size_t i, __m128 & sum)
{
__m128 _a = _mm_loadu_ps(a + i);
__m128 _b = _mm_loadu_ps(b + i);
__m128 _d = _mm_sub_ps(_a, _b);
sum = _mm_add_ps(sum, _mm_mul_ps(_d, _d));
}
inline float ExtractSum(__m128 a)
{
float _a[4];
_mm_storeu_ps(_a, a);
return _a[0] + _a[1] + _a[2] + _a[3];
}
float SquaredDifferenceSum(const float * a, const float * b, size_t size)
{
size_t i = 0, alignedSize = size/4*4;
__m128 sums = _mm_setzero_ps();
for(; i < alignedSize; i += 4)
SquaredDifferenceSum(a, b, i, sums);
float sum = ExtractSum(sums);
for(; i < size; ++i)
sum += Square(a[i] - b[i]);
return sum;
}
This code works fine if the size of the arrays is not too large.
But if the size is big enough then there is a large computing error between results given by base function and its optimized version.
And so I have a question: Where is here a bug in SSE optimized code, which leads to the computing error.
The error follows from finite precision floating point numbers.
Each addition of two floating point numbers is has an computing error proportional to difference between them.
In your scalar version of algorithm the resulting sum is much greater then each term (if size of arrays is big enough of course).
So it leads to accumulation of big computing error.
In the SSE version of algorithm actually there is four sums for results accumulation. And difference between these sums and each term is lesser in four times relative to scalar code.
So this leads to the lesser computing error.
There are two ways to solve this error:
1) Using of floating point numbers of double precision for accumulating sum.
2) Using of the the Kahan summation algorithm (also known as compensated summation) which significantly reduces the numerical error in the total obtained by adding a sequence of finite precision floating point numbers, compared to the obvious approach.
https://en.wikipedia.org/wiki/Kahan_summation_algorithm
With using of Kahan summation algorithm your scalar code will look like:
inline void KahanSum(float value, float & sum, float & correction)
{
float term = value - correction;
float temp = sum + term;
correction = (temp - sum) - term;
sum = temp;
}
float SquaredDifferenceKahanSum(const float * a, const float * b, size_t size)
{
float sum = 0, correction = 0;
for(size_t i = 0; i < size; ++i)
KahanSum(Square(a[i] - b[i]), sum, correction);
return sum;
}
And SSE optimized code will look as follow:
inline void SquaredDifferenceKahanSum(const float * a, const float * b, size_t i,
__m128 & sum, __m128 & correction)
{
__m128 _a = _mm_loadu_ps(a + i);
__m128 _b = _mm_loadu_ps(b + i);
__m128 _d = _mm_sub_ps(_a, _b);
__m128 term = _mm_sub_ps(_mm_mul_ps(_d, _d), correction);
__m128 temp = _mm_add_ps(sum, term);
correction = _mm_sub_ps(_mm_sub_ps(temp, sum), term);
sum = temp;
}
float SquaredDifferenceKahanSum(const float * a, const float * b, size_t size)
{
size_t i = 0, alignedSize = size/4*4;
__m128 sums = _mm_setzero_ps(), corrections = _mm_setzero_ps();
for(; i < alignedSize; i += 4)
SquaredDifferenceKahanSum(a, b, i, sums, corrections);
float sum = ExtractSum(sums), correction = 0;
for(; i < size; ++i)
KahanSum(Square(a[i] - b[i]), sum, correction);
return sum;
}

Structure of Arrays vs Array of Structures

From some comments that I have read in here, for some reason it is preferable to have Structure of Arrays (SoA) over Array of Structures (AoS) for parallel implementations like CUDA? If that is true, can anyone explain why?
Thanks in advance!
Choice of AoS versus SoA for optimum performance usually depends on access pattern. This is not just limited to CUDA however - similar considerations apply for any architecture where performance can be significantly affected by memory access pattern, e.g. where you have caches or where performance is better with contiguous memory access (e.g. coalesced memory accesses in CUDA).
E.g. for RGB pixels versus separate RGB planes:
struct {
uint8_t r, g, b;
} AoS[N];
struct {
uint8_t r[N];
uint8_t g[N];
uint8_t b[N];
} SoA;
If you are going to be accessing the R/G/B components of each pixel concurrently then AoS usually makes sense, since the successive reads of R, G, B components will be contiguous and usually contained within the same cache line. For CUDA this also means memory read/write coalescing.
However if you are going to process color planes separately then SoA might be preferred, e.g. if you want to scale all R values by some scale factor, then SoA means that all R components will be contiguous.
One further consideration is padding/alignment. For the RGB example above each element in an AoS layout is aligned to a multiple of 3 bytes, which may not be convenient for CUDA, SIMD, et al - in some cases perhaps even requiring padding within the struct to make alignment more convenient (e.g. add a dummy uint8_t element to ensure 4 byte alignment). In the SoA case however the planes are byte aligned which can be more convenient for certain algorithms/architectures.
For most image processing type applications the AoS scenario is much more common, but for other applications, or for specific image processing tasks this may not always be the case. When there is no obvious choice I would recommend AoS as the default choice.
See also this answer for more general discussion of AoS v SoA.
I just want to provide a simple example showing how a Struct of Arrays (SoA) performs better than an Array of Structs (AoS).
In the example, I'm considering three different versions of the same code:
SoA (v1)
Straight arrays (v2)
AoS (v3)
In particular, version 2 considers the use of straight arrays. The timings of versions 2 and 3 are the same for this example and result to be better than version 1. I suspect that, in general, straight arrays could be preferable, although at the expense of readability, since, for example, loading from uniform cache could be enabled through const __restrict__ for this case.
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <thrust\device_vector.h>
#include "Utilities.cuh"
#include "TimingGPU.cuh"
#define BLOCKSIZE 1024
/******************************************/
/* CELL STRUCT LEADING TO ARRAY OF STRUCT */
/******************************************/
struct cellAoS {
unsigned int x1;
unsigned int x2;
unsigned int code;
bool done;
};
/*******************************************/
/* CELL STRUCT LEADING TO STRUCT OF ARRAYS */
/*******************************************/
struct cellSoA {
unsigned int *x1;
unsigned int *x2;
unsigned int *code;
bool *done;
};
/*******************************************/
/* KERNEL MANIPULATING THE ARRAY OF STRUCT */
/*******************************************/
__global__ void AoSvsSoA_v1(cellAoS *d_cells, const int N) {
const int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid < N) {
cellAoS tempCell = d_cells[tid];
tempCell.x1 = tempCell.x1 + 10;
tempCell.x2 = tempCell.x2 + 10;
d_cells[tid] = tempCell;
}
}
/******************************/
/* KERNEL MANIPULATING ARRAYS */
/******************************/
__global__ void AoSvsSoA_v2(unsigned int * __restrict__ d_x1, unsigned int * __restrict__ d_x2, const int N) {
const int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid < N) {
d_x1[tid] = d_x1[tid] + 10;
d_x2[tid] = d_x2[tid] + 10;
}
}
/********************************************/
/* KERNEL MANIPULATING THE STRUCT OF ARRAYS */
/********************************************/
__global__ void AoSvsSoA_v3(cellSoA cell, const int N) {
const int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid < N) {
cell.x1[tid] = cell.x1[tid] + 10;
cell.x2[tid] = cell.x2[tid] + 10;
}
}
/********/
/* MAIN */
/********/
int main() {
const int N = 2048 * 2048 * 4;
TimingGPU timerGPU;
thrust::host_vector<cellAoS> h_cells(N);
thrust::device_vector<cellAoS> d_cells(N);
thrust::host_vector<unsigned int> h_x1(N);
thrust::host_vector<unsigned int> h_x2(N);
thrust::device_vector<unsigned int> d_x1(N);
thrust::device_vector<unsigned int> d_x2(N);
for (int k = 0; k < N; k++) {
h_cells[k].x1 = k + 1;
h_cells[k].x2 = k + 2;
h_cells[k].code = k + 3;
h_cells[k].done = true;
h_x1[k] = k + 1;
h_x2[k] = k + 2;
}
d_cells = h_cells;
d_x1 = h_x1;
d_x2 = h_x2;
cellSoA cell;
cell.x1 = thrust::raw_pointer_cast(d_x1.data());
cell.x2 = thrust::raw_pointer_cast(d_x2.data());
cell.code = NULL;
cell.done = NULL;
timerGPU.StartCounter();
AoSvsSoA_v1 << <iDivUp(N, BLOCKSIZE), BLOCKSIZE >> >(thrust::raw_pointer_cast(d_cells.data()), N);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
printf("Timing AoSvsSoA_v1 = %f\n", timerGPU.GetCounter());
//timerGPU.StartCounter();
//AoSvsSoA_v2 << <iDivUp(N, BLOCKSIZE), BLOCKSIZE >> >(thrust::raw_pointer_cast(d_x1.data()), thrust::raw_pointer_cast(d_x2.data()), N);
//gpuErrchk(cudaPeekAtLastError());
//gpuErrchk(cudaDeviceSynchronize());
//printf("Timing AoSvsSoA_v2 = %f\n", timerGPU.GetCounter());
timerGPU.StartCounter();
AoSvsSoA_v3 << <iDivUp(N, BLOCKSIZE), BLOCKSIZE >> >(cell, N);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
printf("Timing AoSvsSoA_v3 = %f\n", timerGPU.GetCounter());
h_cells = d_cells;
h_x1 = d_x1;
h_x2 = d_x2;
// --- Check results
for (int k = 0; k < N; k++) {
if (h_x1[k] != k + 11) {
printf("h_x1[%i] not equal to %i\n", h_x1[k], k + 11);
break;
}
if (h_x2[k] != k + 12) {
printf("h_x2[%i] not equal to %i\n", h_x2[k], k + 12);
break;
}
if (h_cells[k].x1 != k + 11) {
printf("h_cells[%i].x1 not equal to %i\n", h_cells[k].x1, k + 11);
break;
}
if (h_cells[k].x2 != k + 12) {
printf("h_cells[%i].x2 not equal to %i\n", h_cells[k].x2, k + 12);
break;
}
}
}
The following are the timings (runs performed on a GTX960):
Array of struct 9.1ms (v1 kernel)
Struct of arrays 3.3ms (v3 kernel)
Straight arrays 3.2ms (v2 kernel)
SoA is effectly good for SIMD processing.
For several reason, but basically it's more efficient to load 4 consecutive floats in a register. With something like:
float v [4] = {0};
__m128 reg = _mm_load_ps( v );
than using:
struct vec { float x; float, y; ....} ;
vec v = {0, 0, 0, 0};
and create an __m128 data by accessing all member:
__m128 reg = _mm_set_ps(v.x, ....);
if your arrays are 16-byte aligned data load/store are faster and some op can be perform directly in memory.

how to use SSE to process array of ints, using a condition

I'm new to SSE, and limited in knowledge. I'm trying to vectorize my code (C++, using gcc), which is actually quite simple.
I have an array of unsigned ints, and I only check for elements that are >=, or <= than some constant. As result, I need an array with elements that passed condition.
I'm thinking to use 'mm_cmpge_ps' as a mask, but this construct work over floats not ints!? :(
any suggestion, help is very much appreciated.
It's pretty easy to just mask out (i.e. set to 0) all non-matching ints. e.g.
#include <emmintrin.h> // SSE2 intrinsics
for (int i = 0; i < N; i += 4)
{
__m128i v = _mm_load_si128(&a[i]);
__m128i vcmp0 = _mm_cmpgt_epi32(v, _mm_set1_epi32(MIN_VAL - 1));
__m128i vcmp1 = _mm_cmplt_epi32(v, _mm_set1_epi32(MAX_VAL + 1));
__m128i vcmp = _mm_and_si128(vcmp0, vcmp1);
v = _mm_and_si128(v, vcmp);
_mm_store_si128(&a[i], v);
}
Note that a needs to be 16 byte aligned and N needs to be a multiple of 4 - if these constraints are a problem then it's not too hard to extend the code to cope with this.
Here you go. Here are three functions.
The first function,foo_v1, is based on Paul R's answer.
The second function,foo_v2, is based on a popular question today Fastest way to determine if an integer is between two integers (inclusive) with known sets of values
The third function, foo_v3 uses Agner Fog's vectorclass which I added only to show how much easier and cleaner it is to use his class. If you don't have the class then just comment out the #include "vectorclass.h" line and the foo_v3 function. I used Vec8ui which means it will use AVX2 if available and break it into two Vec4ui otherwise so you don't have to change your code to get the benefit of AVX2.
#include <stdio.h>
#include <nmmintrin.h> // SSE4.2
#include "vectorclass.h"
void foo_v1(const int N, int *a, const int MAX_VAL, const int MIN_VAL) {
for (int i = 0; i < N; i += 4) {
__m128i v = _mm_load_si128((const __m128i*)&a[i]);
__m128i vcmp0 = _mm_cmpgt_epi32(v, _mm_set1_epi32(MIN_VAL - 1));
__m128i vcmp1 = _mm_cmplt_epi32(v, _mm_set1_epi32(MAX_VAL + 1));
__m128i vcmp = _mm_and_si128(vcmp0, vcmp1);
v = _mm_and_si128(v, vcmp);
_mm_store_si128((__m128i*)&a[i], v);
}
}
void foo_v2(const int N, int *a, const int MAX_VAL, const int MIN_VAL) {
//if ((unsigned)(number-lower) < (upper-lower))
for (int i = 0; i < N; i += 4) {
__m128i v = _mm_load_si128((const __m128i*)&a[i]);
__m128i dv = _mm_sub_epi32(v, _mm_set1_epi32(MIN_VAL));
__m128i min_ab = _mm_min_epu32(dv,_mm_set1_epi32(MAX_VAL-MIN_VAL));
__m128i vcmp = _mm_cmpeq_epi32(dv,min_ab);
v = _mm_and_si128(v, vcmp);
_mm_store_si128((__m128i*)&a[i], v);
}
}
void foo_v3(const int N, int *a, const int MAX_VAL, const int MIN_VAL) {
//if ((unsigned)(number-lower) < (upper-lower))
for (int i = 0; i < N; i += 8) {
Vec8ui va = Vec8ui().load(&a[i]);
va &= (va - MIN_VAL) <= (MAX_VAL-MIN_VAL);
va.store(&a[i]);
}
}
int main() {
const int N = 16;
int* a = (int*)_mm_malloc(sizeof(int)*N, 16);
for(int i=0; i<N; i++) {
a[i] = i;
}
foo_v2(N, a, 7, 3);
for(int i=0; i<N; i++) {
printf("%d ", a[i]);
} printf("\n");
_mm_free(a);
}
First place to look might be IntelĀ® Intrinsics Guide

SIMD-able code?

What is the strict definition of what code can utilise SIMD instruction set? Is it anything where you can run calculations in parallel?
So if I had:
for(int i=0; i<100; i++){
sum += array[i];
}
this could take advantage of SIMD because we could run:
for(int i=0; i<100;i=i+4){
sum0 += array[i];
sum1 += array[i+1];
sum2 += array[i+2];
sum3 += array[i+3];
}
sum = sum0 + sum1 + sum2 + sum3;
?
Does it have to be float types, or could it be double and integer?
Assuming you're talking about x86 (SSE et al) then the supported types for arithmetic are 8, 16, 32 and 64 bit integers, and single and double precision floats. Note however that not all arithmetic operations are supported for all data types - SSE lacks orthogonality in this regard.
Assuming 32 bit ints and suitably aligned arrays (16 byte aligned) then you could implement your above loop example as:
#include <emmintrin.h> // SSE2 intrinsics
int32_t a[100] __attribute__ ((aligned(16)));
// suitably aligned array
__m128i vsum = _mm_set1_epi32(0); // init vsum = { 0, 0, 0, 0 }
for (int i = 0; i < 100; i += 4)
{
__m128i v = _mm_load_si128(&a[i]); // load 4 ints from a[i]..a[i+3]
vsum = _mm_add_epi32(vsum, v); // accumulate 4 partial sums
}
// final horizontal sum of partial sums
vsum = _mm_add_epi32(vsum, _mm_srli_si128(vsum, 8));
int32_t sum = _mm_cvtsi128_si32(vsum); // sum = scalar sum of a[]