I'm new to SSE, and limited in knowledge. I'm trying to vectorize my code (C++, using gcc), which is actually quite simple.
I have an array of unsigned ints, and I only check for elements that are >=, or <= than some constant. As result, I need an array with elements that passed condition.
I'm thinking to use 'mm_cmpge_ps' as a mask, but this construct work over floats not ints!? :(
any suggestion, help is very much appreciated.
It's pretty easy to just mask out (i.e. set to 0) all non-matching ints. e.g.
#include <emmintrin.h> // SSE2 intrinsics
for (int i = 0; i < N; i += 4)
{
__m128i v = _mm_load_si128(&a[i]);
__m128i vcmp0 = _mm_cmpgt_epi32(v, _mm_set1_epi32(MIN_VAL - 1));
__m128i vcmp1 = _mm_cmplt_epi32(v, _mm_set1_epi32(MAX_VAL + 1));
__m128i vcmp = _mm_and_si128(vcmp0, vcmp1);
v = _mm_and_si128(v, vcmp);
_mm_store_si128(&a[i], v);
}
Note that a needs to be 16 byte aligned and N needs to be a multiple of 4 - if these constraints are a problem then it's not too hard to extend the code to cope with this.
Here you go. Here are three functions.
The first function,foo_v1, is based on Paul R's answer.
The second function,foo_v2, is based on a popular question today Fastest way to determine if an integer is between two integers (inclusive) with known sets of values
The third function, foo_v3 uses Agner Fog's vectorclass which I added only to show how much easier and cleaner it is to use his class. If you don't have the class then just comment out the #include "vectorclass.h" line and the foo_v3 function. I used Vec8ui which means it will use AVX2 if available and break it into two Vec4ui otherwise so you don't have to change your code to get the benefit of AVX2.
#include <stdio.h>
#include <nmmintrin.h> // SSE4.2
#include "vectorclass.h"
void foo_v1(const int N, int *a, const int MAX_VAL, const int MIN_VAL) {
for (int i = 0; i < N; i += 4) {
__m128i v = _mm_load_si128((const __m128i*)&a[i]);
__m128i vcmp0 = _mm_cmpgt_epi32(v, _mm_set1_epi32(MIN_VAL - 1));
__m128i vcmp1 = _mm_cmplt_epi32(v, _mm_set1_epi32(MAX_VAL + 1));
__m128i vcmp = _mm_and_si128(vcmp0, vcmp1);
v = _mm_and_si128(v, vcmp);
_mm_store_si128((__m128i*)&a[i], v);
}
}
void foo_v2(const int N, int *a, const int MAX_VAL, const int MIN_VAL) {
//if ((unsigned)(number-lower) < (upper-lower))
for (int i = 0; i < N; i += 4) {
__m128i v = _mm_load_si128((const __m128i*)&a[i]);
__m128i dv = _mm_sub_epi32(v, _mm_set1_epi32(MIN_VAL));
__m128i min_ab = _mm_min_epu32(dv,_mm_set1_epi32(MAX_VAL-MIN_VAL));
__m128i vcmp = _mm_cmpeq_epi32(dv,min_ab);
v = _mm_and_si128(v, vcmp);
_mm_store_si128((__m128i*)&a[i], v);
}
}
void foo_v3(const int N, int *a, const int MAX_VAL, const int MIN_VAL) {
//if ((unsigned)(number-lower) < (upper-lower))
for (int i = 0; i < N; i += 8) {
Vec8ui va = Vec8ui().load(&a[i]);
va &= (va - MIN_VAL) <= (MAX_VAL-MIN_VAL);
va.store(&a[i]);
}
}
int main() {
const int N = 16;
int* a = (int*)_mm_malloc(sizeof(int)*N, 16);
for(int i=0; i<N; i++) {
a[i] = i;
}
foo_v2(N, a, 7, 3);
for(int i=0; i<N; i++) {
printf("%d ", a[i]);
} printf("\n");
_mm_free(a);
}
First place to look might be IntelĀ® Intrinsics Guide
Related
This question already exists:
How to implement convolution algorithm with SSE?
Closed 1 year ago.
My goal is to implement exactly that algorithm using only CPU and using SSE:
My array's sizes a multiple of 4 and they are aligned:
const int INPUT_SIGNAL_ARRAY_SIZE = 256896;
const int IMPULSE_RESPONSE_ARRAY_SIZE = 318264;
const int OUTPUT_SIGNAL_ARRAY_SIZE = INPUT_SIGNAL_ARRAY_SIZE + IMPULSE_RESPONSE_ARRAY_SIZE;
__declspec(align(16)) float inputSignal_dArray[INPUT_SIGNAL_ARRAY_SIZE];
__declspec(align(16)) float impulseResponse_dArray[IMPULSE_RESPONSE_ARRAY_SIZE];
__declspec(align(16)) float outputSignal_dArray[OUTPUT_SIGNAL_ARRAY_SIZE];
I have written CPU "method" and it works correctly:
//#pragma optimize( "", off )
void computeConvolutionOutputCPU(float* inputSignal, float* impulseResponse, float* outputSignal) {
float* pInputSignal = inputSignal;
float* pImpulseResponse = impulseResponse;
float* pOutputSignal = outputSignal;
#pragma loop(no_vector)
for (int i = 0; i < OUTPUT_SIGNAL_ARRAY_SIZE; i++)
{
*(pOutputSignal + i) = 0;
#pragma loop(no_vector)
for (int j = 0; j < IMPULSE_RESPONSE_ARRAY_SIZE; j++)
{
if (i - j >= 0 && i - j < INPUT_SIGNAL_ARRAY_SIZE)
{
*(pOutputSignal + i) = *(pOutputSignal + i) + *(pImpulseResponse + j) * (*(pInputSignal + i - j));
}
}
}
}
//#pragma optimize( "", on )
On the other hand I should use function with SSE. I tried the following code:
void computeConvolutionOutputSSE(float* inputSignal, float* impulseResponse, float* outputSignal) {
__m128* pInputSignal = (__m128*) inputSignal;
__m128* pImpulseResponse = (__m128*) impulseResponse;
__m128* pOutputSignal = (__m128*) outputSignal;
int nOuterLoop = OUTPUT_SIGNAL_ARRAY_SIZE / 4;
int nInnerLoop = IMPULSE_RESPONSE_ARRAY_SIZE / 4;
int quarterOfInputSignal = INPUT_SIGNAL_ARRAY_SIZE / 4;
__m128 m0 = _mm_set_ps1(0);
for (int i = 0; i < nOuterLoop; i++)
{
*(pOutputSignal + i) = m0;
for (int j = 0; j < nInnerLoop; j++)
{
if ((i - j) >= 0 && (i - j) < quarterOfInputSignal)
{
*(pOutputSignal + i) = _mm_add_ps(
*(pOutputSignal + i),
_mm_mul_ps(*(pImpulseResponse + j), *(pInputSignal + i - j))
);
}
}
}
}
And function above works not correct and produces not the same values like CPU.
The problem was specified on stackoverflow with following comment :
*(pInputSignal + i - j) is incorrect in case of SSE, because it's not an i-j offset away from current value, it's (i-j) * 4 . THe thing is,
as I remember it, the idea of using pointer that way is incorrect
unless intrinsics had changed since then - in my time one had to
"load" values into an instance of __m128 in this case, as H(J) and
X(I-J) are in unaligned location (and sequence breaks).
and
Since you care about individual floats and their order, probably best
to use const float*, with _mm_loadu_ps instead of just dereferencing
(which is like _mm_load_ps). That way you can easily do unaligned
loads that get the floats you want into the vector element positions
you want, and the pointer math works the same as for scalar. You just
have to take into account that load(ptr) actually gets you a vector of
elements from ptr+0..3.
But I can't use this information because have no idea how to properly access array with SSE in this case.
you need 128-bit float32 value , not msvc float.
see _mm_broadcast_ss
Is there an efficient way to calculate a bitwise sum of uint8_t buffers (assume number of buffers are <= 255, so that we can make the sum uint8)? Basically I want to know how many bits are set at the i'th position of each buffer.
Ex: For 2 buffers
uint8 buf1[k] -> 0011 0001 ...
uint8 buf2[k] -> 0101 1000 ...
uint8 sum[k*8]-> 0 1 1 2 1 0 0 1...
are there any BLAS or boost routines for such a requirement?
This is a highly vectorizable operation IMO.
UPDATE:
Following is a naive impl of the requirement
for (auto&& buf: buffers){
for (int i = 0; i < buf_len; i++){
for (int b = 0; b < 8; ++b) {
sum[i*8 + b] += (buf[i] >> b) & 1;
}
}
}
An alternative to OP's naive code:
Perform 8 additions at once. Use a lookup table to expand the 8 bits to 8 bytes with each bit to a corresponding byte - see ones[].
void sumit(uint8_t number_of_buf, uint8_t k, const uint8_t buf[number_of_buf][k]) {
static const uint64_t ones[256] = { 0, 0x1, 0x100, 0x101, 0x10000, 0x10001,
/* 249 more pre-computed constants */ 0x0101010101010101};
uint64_t sum[k];
memset(sum, 0, sizeof sum):
for (size_t buf_index = 0; buf_index < number_of_buf; buf_index++) {
for (size_t int i = 0; i < k; i++) {
sum[i] += ones(buf[buf_index][i]);
}
}
for (size_t int i = 0; i < k; i++) {
for (size_t bit = 0; bit < 8; bit++) {
printf("%llu ", 0xFF & (sum[i] >> (8*bit)));
}
}
}
See also #Eric Postpischil.
As a modification of chux's approach, the lookup table can be replaced with a vector shift and mask. Here's an example using GCC's vector extensions.
#include <stdint.h>
#include <stddef.h>
typedef uint8_t vec8x8 __attribute__((vector_size(8)));
void sumit(uint8_t number_of_buf,
uint8_t k,
const uint8_t buf[number_of_buf][k],
vec8x8 * restrict sums) {
static const vec8x8 shift = {0,1,2,3,4,5,6,7};
for (size_t i = 0; i < k; i++) {
sums[i] = (vec8x8){0};
for (size_t buf_index = 0; buf_index < number_of_buf; buf_index++) {
sums[i] += (buf[buf_index][i] >> shift) & 1;
}
}
}
Try it on godbolt.
I interchanged the loops from chux's answer because it seemed more natural to accumulate the sum for one buffer index at a time (then the sum can be cached in a register throughout the inner loop). There might be a tradeoff in cache performance because we now have to read the elements of the two-dimensional buf in column-major order.
Taking ARM64 as an example, GCC 11.1 compiles the inner loop as follows.
// v1 = sums[i]
// v2 = {0,-1,-2,...,-7} (right shift is done as left shift with negative count)
// v3 = {1,1,1,1,1,1,1,1}
.L4:
ld1r {v0.8b}, [x1] // replicate buf[buf_index][i] to all elements of v0
add x0, x0, 1
add x1, x1, x20
ushl v0.8b, v0.8b, v2.8b // shift
and v0.8b, v0.8b, v3.8b // mask
add v1.8b, v1.8b, v0.8b // accumulate
cmp x0, x19
bne .L4
I think it'd be more efficient to do two bytes at a time (so unrolling the loop on i by a factor of 2) and use 128-bit vector operations. I leave this as an exercise :)
It's not immediately clear to me whether this would end up being faster or slower than the lookup table. You might have to profile both on the target machine(s) of interest.
I'm starting with SIMD programming but i don't know what to do at this moment. I'm trying to diminish runtime but its doing it the other way.
This is my basic code:
https://codepaste.net/a8ut89
void blurr2(double * u, double * r) {
int i;
double dos[2] = { 2.0, 2.0 };
for (i = 0; i < SIZE - 1; i++) {
r[i] = u[i] + u[i + 1];
}
}
blurr2: 0.43s
int contarNegativos(double * u) {
int i;
int contador = 0;
for (i = 0; i < SIZE; i++) {
if (u[i] < 0) {
contador++;
}
}
return contador;
}
negativeCount: 1.38s
void ord(double * v, double * u, double * r) {
int i;
for (i = 0; i < SIZE; i += 2) {
r[i] = *(__int64*)&(v[i]) | *(__int64*)&(u[i]);
}
}
ord: 0.33
And this is my SIMD code:
https://codepaste.net/fbg1g5
void blurr2(double * u, double * r) {
__m128d rp2;
__m128d rdos;
__m128d rr;
int i;
int sizeAux = SIZE % 2 == 1 ? SIZE : SIZE - 1;
double dos[2] = { 2.0, 2.0 };
rdos = *(__m128d*)dos;
for (i = 0; i < sizeAux; i += 2) {
rp2 = *(__m128d*)&u[i + 1];
rr = _mm_add_pd(*(__m128d*)&u[i], rp2);
*((__m128d*)&r[i]) = _mm_div_pd(rr, rdos);
}
}
blurr2: 0.42s
int contarNegativos(double * u) {
__m128d rcero;
__m128d rr;
int i;
double cero[2] = { 0.0, 0.0 };
int contador = 0;
rcero = *(__m128d*)cero;
for (i = 0; i < SIZE; i += 2) {
rr = _mm_cmplt_pd(*(__m128d*)&u[i], rcero);
if (((__int64 *)&rr)[0]) {
contador++;
};
if (((__int64 *)&rr)[1]) {
contador++;
};
}
return contador;
}
negativeCount: 1.42s
void ord(double * v, double * u, double * r) {
__m128d rr;
int i;
for (i = 0; i < SIZE; i += 2) {
*((__m128d*)&r[i]) = _mm_or_pd(*(__m128d*)&v[i], *(__m128d*)&u[i]);
}
}
ord: 0.35s
**Differents solutions.
Can you explain me what i'm doing wrong? I'm a bit lost...
Use _mm_loadu_pd instead of pointer-casting and dereferencing a __m128d. Your code is guaranteed to segfault on gcc/clang where __m128d is assumed to be aligned.
blurr2: multiply by 0.5 instead of dividing by 2. It will be much faster. (I commented the same thing on a question with the exact same code in the last day or two, was that also you?)
negativeCount: _mm_castpd_si128 the compare result to integer, and accumulate it with _mm_sub_epi64. (The bit pattern is all-zero or all-one, i.e. 2's complement 0 / -1).
#include <immintrin.h>
#include <stdint.h>
static const size_t SIZE = 1024;
uint64_t countNegative(double * u) {
__m128i counts = _mm_setzero_si128();
for (size_t i = 0; i < SIZE; i += 2) {
__m128d cmp = _mm_cmplt_pd(_mm_loadu_pd(&u[i]), _mm_setzero_pd());
counts = _mm_sub_epi64(counts, _mm_castpd_si128(cmp));
}
//return counts[0] + counts[1]; // GNU C only, and less efficient
// horizontal sum
__m128i hi64 = _mm_shuffle_epi32(counts, _MM_SHUFFLE(1, 0, 3, 2));
counts = _mm_add_epi64(counts, hi64);
uint64_t scalarcount = _mm_cvtsi128_si64(counts);
return scalarcount;
}
To learn more about efficient vector horizontal sums, see Fastest way to do horizontal float vector sum on x86. But the first rule is to do it outside the loop.
(source + asm on the Godbolt compiler explorer)
From MSVC (which I'm guessing you're using, or you'd get segfaults from *(__m128d*)foo), the inner loop is:
$LL4#countNegat:
movups xmm0, XMMWORD PTR [rcx]
lea rcx, QWORD PTR [rcx+16]
cmpltpd xmm0, xmm2
psubq xmm1, xmm0
sub rax, 1
jne SHORT $LL4#countNegat
It could maybe go faster with unrolling (and maybe two vector accumulators), but this is fairly good and might go close to 1.25 clocks per 16 bytes on Sandybridge/Haswell. (Bottleneck on 5 fused-domain uops).
Your version was actually unpacking to integer inside the inner loop! And if you were using MSVC -Ox, it was actually branching instead of using a branchless compare + conditional add. I'm surprised it wasn't slower than the scalar version.
Also, (int64_t *)&rr violates strict aliasing. char* can alias anything, but it's not safe to cast other pointers onto SIMD vectors and expect it to work. If it does, you got lucky. Compilers usually generate similar code for that or intrinsics, and usually not worse for proper intrinsics.
Do you know that ord function with SIMD is not 1:1 to ord function without using SIMD instructions ?
In ord function without using SIMD, result of OR operation is calculated for even indexes
r[0] = v[0] | u[0],
r[2] = v[2] | u[2],
r[4] = v[4] | u[4]
what with odd indexes? maybe, if OR operations are calculated for all indexes, it will take more time than now.
I've recently found that my program spend most time in the following simple function:
void SumOfSquaredDifference(
const uint8_t * a, size_t aStride, const uint8_t * b, size_t bStride,
size_t width, size_t height, uint64_t * sum)
{
*sum = 0;
for(size_t row = 0; row < height; ++row)
{
int rowSum = 0;
for(size_t col = 0; col < width; ++col)
{
int d = a[col] - b[col];
rowSum += d*d;
}
*sum += rowSum;
a += aStride;
b += bStride;
}
}
This function finds a sum of squared difference of two 8-bit gray images.
I think that there is the way to improve its performance with using SSE, but I don't have an experience in this area.
Could anybody help me?
Of course, you can improve your code.
This an example of optimization of your function with using SSE2:
const __m128i Z = _mm_setzero_si128();
const size_t A = sizeof(__m128i);
inline __m128i SquaredDifference(__m128i a, __m128i b)
{
const __m128i aLo = _mm_unpacklo_epi8(a, Z);
const __m128i bLo = _mm_unpacklo_epi8(b, Z);
const __m128i dLo = _mm_sub_epi16(aLo, bLo);
const __m128i aHi = _mm_unpackhi_epi8(a, Z);
const __m128i bHi = _mm_unpackhi_epi8(b, Z);
const __m128i dHi = _mm_sub_epi16(aHi, bHi);
return _mm_add_epi32(_mm_madd_epi16(dLo, dLo), _mm_madd_epi16(dHi, dHi));
}
inline __m128i HorizontalSum32(__m128i a)
{
return _mm_add_epi64(_mm_unpacklo_epi32(a, Z), _mm_unpackhi_epi32(a, Z));
}
inline uint64_t ExtractSum64(__m128i a)
{
uint64_t _a[2];
_mm_storeu_si128((__m128i*)_a, a);
return _a[0] + _a[1];
}
void SumOfSquaredDifference(
const uint8_t *a, size_t aStride, const uint8_t *b, size_t bStride,
size_t width, size_t height, uint64_t * sum)
{
assert(width%A == 0 && width < 0x10000);
__m128i fullSum = Z;
for(size_t row = 0; row < height; ++row)
{
__m128i rowSum = Z;
for(size_t col = 0; col < width; col += A)
{
const __m128i a_ = _mm_loadu_si128((__m128i*)(a + col));
const __m128i b_ = _mm_loadu_si128((__m128i*)(b + col));
rowSum = _mm_add_epi32(rowSum, SquaredDifference(a_, b_));
}
fullSum = _mm_add_epi64(fullSum, HorizontalSum32(rowSum));
a += aStride;
b += bStride;
}
*sum = ExtractSum64(fullSum);
}
This example is a few simplified (it doesn't work if the image width isn't multiple of 16).
Full version of the algorithm is here.
And some magic from SSSE3 version:
const __m128i K_1FF = _mm_set1_epi16(0x1FF);
inline __m128i SquaredDifference(__m128i a, __m128i b)
{
const __m128i lo = _mm_maddubs_epi16(_mm_unpacklo_epi8(a, b), K_1FF);
const __m128i hi = _mm_maddubs_epi16(_mm_unpackhi_epi8(a, b), K_1FF);
return _mm_add_epi32(_mm_madd_epi16(lo, lo), _mm_madd_epi16(hi, hi));
}
The magic description (see _mm_maddubs_epi16):
K_1FF -> {-1, 1, -1, 1, ...};
_mm_unpacklo_epi8(a, b) -> {a0, b0, a1, b1, ...};
_mm_maddubs_epi16(_mm_unpacklo_epi8(a, b), K_1FF) -> {b0 - a0, b1 - a1, ...};
GCC has switches that encourage it to vectorize the code. For example, the -mfma switch gives me about 25% speed increase on simple loops like this, using doubles. I imagine it's even better with 8-bit integers. I prefer that over hand-written optimizations because your code stays readable.
That said, there are a few old tricks that can speed up your loop:
Don't index, increment your pointer in every loop iteration. You do this in the outer loop, you should do the same in the inner loop. You can create a new pointer before going into the inner loop, so the +=stride stays valid.
Don't assign into the sum pointer inside your loop, use a local variable to accumulate and copy to the output when done. You use rowSum, but only in the inner loop. Use that variable across both loops instead.
How can I get sum elements (reduction) of float vector using sse intrinsics?
Simple serial code:
void(float *input, float &result, unsigned int NumElems)
{
result = 0;
for(auto i=0; i<NumElems; ++i)
result += input[i];
}
Typically you generate 4 partial sums in your loop and then just sum horizontally across the 4 elements after the loop, e.g.
#include <cassert>
#include <cstdint>
#include <emmintrin.h>
float vsum(const float *a, int n)
{
float sum;
__m128 vsum = _mm_set1_ps(0.0f);
assert((n & 3) == 0);
assert(((uintptr_t)a & 15) == 0);
for (int i = 0; i < n; i += 4)
{
__m128 v = _mm_load_ps(&a[i]);
vsum = _mm_add_ps(vsum, v);
}
vsum = _mm_hadd_ps(vsum, vsum);
vsum = _mm_hadd_ps(vsum, vsum);
_mm_store_ss(&sum, vsum);
return sum;
}
Note: for the above example a must be 16 byte aligned and n must be a multiple of 4. If the alignment of a can not be guaranteed then use _mm_loadu_ps instead of _mm_load_ps. If n is not guaranteed to be a multiple of 4 then add a scalar loop at the end of the function to accumulate any remaining elements.