efficient bitwise sum calculation - c++

Is there an efficient way to calculate a bitwise sum of uint8_t buffers (assume number of buffers are <= 255, so that we can make the sum uint8)? Basically I want to know how many bits are set at the i'th position of each buffer.
Ex: For 2 buffers
uint8 buf1[k] -> 0011 0001 ...
uint8 buf2[k] -> 0101 1000 ...
uint8 sum[k*8]-> 0 1 1 2 1 0 0 1...
are there any BLAS or boost routines for such a requirement?
This is a highly vectorizable operation IMO.
UPDATE:
Following is a naive impl of the requirement
for (auto&& buf: buffers){
for (int i = 0; i < buf_len; i++){
for (int b = 0; b < 8; ++b) {
sum[i*8 + b] += (buf[i] >> b) & 1;
}
}
}

An alternative to OP's naive code:
Perform 8 additions at once. Use a lookup table to expand the 8 bits to 8 bytes with each bit to a corresponding byte - see ones[].
void sumit(uint8_t number_of_buf, uint8_t k, const uint8_t buf[number_of_buf][k]) {
static const uint64_t ones[256] = { 0, 0x1, 0x100, 0x101, 0x10000, 0x10001,
/* 249 more pre-computed constants */ 0x0101010101010101};
uint64_t sum[k];
memset(sum, 0, sizeof sum):
for (size_t buf_index = 0; buf_index < number_of_buf; buf_index++) {
for (size_t int i = 0; i < k; i++) {
sum[i] += ones(buf[buf_index][i]);
}
}
for (size_t int i = 0; i < k; i++) {
for (size_t bit = 0; bit < 8; bit++) {
printf("%llu ", 0xFF & (sum[i] >> (8*bit)));
}
}
}
See also #Eric Postpischil.

As a modification of chux's approach, the lookup table can be replaced with a vector shift and mask. Here's an example using GCC's vector extensions.
#include <stdint.h>
#include <stddef.h>
typedef uint8_t vec8x8 __attribute__((vector_size(8)));
void sumit(uint8_t number_of_buf,
uint8_t k,
const uint8_t buf[number_of_buf][k],
vec8x8 * restrict sums) {
static const vec8x8 shift = {0,1,2,3,4,5,6,7};
for (size_t i = 0; i < k; i++) {
sums[i] = (vec8x8){0};
for (size_t buf_index = 0; buf_index < number_of_buf; buf_index++) {
sums[i] += (buf[buf_index][i] >> shift) & 1;
}
}
}
Try it on godbolt.
I interchanged the loops from chux's answer because it seemed more natural to accumulate the sum for one buffer index at a time (then the sum can be cached in a register throughout the inner loop). There might be a tradeoff in cache performance because we now have to read the elements of the two-dimensional buf in column-major order.
Taking ARM64 as an example, GCC 11.1 compiles the inner loop as follows.
// v1 = sums[i]
// v2 = {0,-1,-2,...,-7} (right shift is done as left shift with negative count)
// v3 = {1,1,1,1,1,1,1,1}
.L4:
ld1r {v0.8b}, [x1] // replicate buf[buf_index][i] to all elements of v0
add x0, x0, 1
add x1, x1, x20
ushl v0.8b, v0.8b, v2.8b // shift
and v0.8b, v0.8b, v3.8b // mask
add v1.8b, v1.8b, v0.8b // accumulate
cmp x0, x19
bne .L4
I think it'd be more efficient to do two bytes at a time (so unrolling the loop on i by a factor of 2) and use 128-bit vector operations. I leave this as an exercise :)
It's not immediately clear to me whether this would end up being faster or slower than the lookup table. You might have to profile both on the target machine(s) of interest.

Related

Processing of integers on different CPUs

My task is to design a function that fulfils those requirements:
Function shall sum members of given one-dimensional array. However, it should sum only members whose number of ones in the binary representation is higher than defined threshold (e.g. if the threshold is 4, number 255 will be counted and 15 will not)
The array length is arbitrary
The function shall utilize as little memory as possible and shall be written in an efficient way
The production function code (‘sum_filtered(){..}’) shall not use any standard C library functions (or any other libraries)
The function shall return 0 on success and error code on error
The array elements are of a type 16-bit signed integer and an overflow during calculation shall be regarded as a failure
Use data types that ensure portability between different CPUs (so the calculations will be the same on 8/16/32-bit MCU)
The function code should contain a reasonable amount of comments in doxygen annotation
Here is my solution:
#include <iostream>
using namespace std;
int sum_filtered(short array[], int treshold)
{
// return 1 if invalid input parameters
if((treshold < 0) || (treshold > 16)){return(1);}
int sum = 0;
int bitcnt = 0;
for(int i=0; i < sizeof(array); i++)
{
// Count one bits of integer
bitcnt = 0;
for (int pos = 0 ; pos < 16 ; pos++) {if (array[i] & (1 << pos)) {bitcnt++;}}
// Add integer to sum if bitcnt>treshold
if(bitcnt>treshold){sum += array[i];}
}
return(0);
}
int main()
{
short array[5] = {15, 2652, 14, 1562, -115324};
int result = sum_filtered(array, 14);
cout << result << endl;
short array2[5] = {15, 2652, 14, 1562, 15324};
result = sum_filtered(array2, -2);
cout << result << endl;
}
However I'm not sure whether this code is portable between different CPUs.
And I don't how can an overflow occur during calculation and what can be other errors during processing of arrays with this function.
Can somebody more experienced give me his opinion?
Well, I can foresee one problem:
for(int i=0; i < sizeof(array); i++)
array in this context is a pointer, so will likely be 4 on 32bit systems, or 8 on 64bit systems. You really do want to be passing a count variable (in this case 5) into the sum_filtered function (and then you can pass the count as sizeof(array) / sizeof(short)).
Anyhow, this code:
// Count one bits of integer
bitcnt = 0;
for (int pos = 0 ; pos < 16 ; pos++) {if (array[i] & (1 << pos)) {bitcnt++;}}
Effectively you are doing a popcount here (which can be done using __builtin_popcount on gcc/clang, or __popcnt on MSVC. They are compiler specific, but usually boil down to a single popcount CPU instruction on most CPUs).
If you do want to do this the slow way, then an efficient approach is to treat the computation as a form of bitwise SIMD operation:
#include <cstdint> // or stdint.h if you have a rubbish compiler :)
uint16_t popcount(uint16_t s)
{
// perform 8x 1bit adds
uint16_t a0 = s & 0x5555;
uint16_t b0 = (s >> 1) & 0x5555;
uint16_t s0 = a0 + b0;
// perform 4x 2bit adds
uint16_t a1 = s0 & 0x3333;
uint16_t b1 = (s0 >> 2) & 0x3333;
uint16_t s1 = a1 + b1;
// perform 2x 4bit adds
uint16_t a2 = s1 & 0x0F0F;
uint16_t b2 = (s1 >> 4) & 0x0F0F;
uint16_t s2 = a2 + b2;
// perform 1x 8bit adds
uint16_t a3 = s2 & 0x00FF;
uint16_t b3 = (s2 >> 8) & 0x00FF;
return a3 + b3;
}
I know it says you can't use stdlib functions (your 4th point), but that shouldn't apply to the standardised integer types surely? (e.g. uint16_t) If it does, well then there is no way to guarantee portability across platforms. You're out of luck.
Personally I'd just use a 64bit integer for the sum. That should reduce the risk of any overflows *(i.e. if the threshold is zero, and all the values are -128, then you'd overflow if the array size exceeded 0x1FFFFFFFFFFFF elements (562,949,953,421,311 in decimal).
#include <cstdint>
int64_t sum_filtered(int16_t array[], uint16_t threshold, size_t array_length)
{
// changing the type on threshold to be unsigned means we don't need to test
// for negative numbers.
if(threshold > 16) { return 1; }
int64_t sum = 0;
for(size_t i=0; i < array_length; i++)
{
if (popcount(array[i]) > threshold)
{
sum += array[i];
}
}
return sum;
}

SIMD Program slow runtime

I'm starting with SIMD programming but i don't know what to do at this moment. I'm trying to diminish runtime but its doing it the other way.
This is my basic code:
https://codepaste.net/a8ut89
void blurr2(double * u, double * r) {
int i;
double dos[2] = { 2.0, 2.0 };
for (i = 0; i < SIZE - 1; i++) {
r[i] = u[i] + u[i + 1];
}
}
blurr2: 0.43s
int contarNegativos(double * u) {
int i;
int contador = 0;
for (i = 0; i < SIZE; i++) {
if (u[i] < 0) {
contador++;
}
}
return contador;
}
negativeCount: 1.38s
void ord(double * v, double * u, double * r) {
int i;
for (i = 0; i < SIZE; i += 2) {
r[i] = *(__int64*)&(v[i]) | *(__int64*)&(u[i]);
}
}
ord: 0.33
And this is my SIMD code:
https://codepaste.net/fbg1g5
void blurr2(double * u, double * r) {
__m128d rp2;
__m128d rdos;
__m128d rr;
int i;
int sizeAux = SIZE % 2 == 1 ? SIZE : SIZE - 1;
double dos[2] = { 2.0, 2.0 };
rdos = *(__m128d*)dos;
for (i = 0; i < sizeAux; i += 2) {
rp2 = *(__m128d*)&u[i + 1];
rr = _mm_add_pd(*(__m128d*)&u[i], rp2);
*((__m128d*)&r[i]) = _mm_div_pd(rr, rdos);
}
}
blurr2: 0.42s
int contarNegativos(double * u) {
__m128d rcero;
__m128d rr;
int i;
double cero[2] = { 0.0, 0.0 };
int contador = 0;
rcero = *(__m128d*)cero;
for (i = 0; i < SIZE; i += 2) {
rr = _mm_cmplt_pd(*(__m128d*)&u[i], rcero);
if (((__int64 *)&rr)[0]) {
contador++;
};
if (((__int64 *)&rr)[1]) {
contador++;
};
}
return contador;
}
negativeCount: 1.42s
void ord(double * v, double * u, double * r) {
__m128d rr;
int i;
for (i = 0; i < SIZE; i += 2) {
*((__m128d*)&r[i]) = _mm_or_pd(*(__m128d*)&v[i], *(__m128d*)&u[i]);
}
}
ord: 0.35s
**Differents solutions.
Can you explain me what i'm doing wrong? I'm a bit lost...
Use _mm_loadu_pd instead of pointer-casting and dereferencing a __m128d. Your code is guaranteed to segfault on gcc/clang where __m128d is assumed to be aligned.
blurr2: multiply by 0.5 instead of dividing by 2. It will be much faster. (I commented the same thing on a question with the exact same code in the last day or two, was that also you?)
negativeCount: _mm_castpd_si128 the compare result to integer, and accumulate it with _mm_sub_epi64. (The bit pattern is all-zero or all-one, i.e. 2's complement 0 / -1).
#include <immintrin.h>
#include <stdint.h>
static const size_t SIZE = 1024;
uint64_t countNegative(double * u) {
__m128i counts = _mm_setzero_si128();
for (size_t i = 0; i < SIZE; i += 2) {
__m128d cmp = _mm_cmplt_pd(_mm_loadu_pd(&u[i]), _mm_setzero_pd());
counts = _mm_sub_epi64(counts, _mm_castpd_si128(cmp));
}
//return counts[0] + counts[1]; // GNU C only, and less efficient
// horizontal sum
__m128i hi64 = _mm_shuffle_epi32(counts, _MM_SHUFFLE(1, 0, 3, 2));
counts = _mm_add_epi64(counts, hi64);
uint64_t scalarcount = _mm_cvtsi128_si64(counts);
return scalarcount;
}
To learn more about efficient vector horizontal sums, see Fastest way to do horizontal float vector sum on x86. But the first rule is to do it outside the loop.
(source + asm on the Godbolt compiler explorer)
From MSVC (which I'm guessing you're using, or you'd get segfaults from *(__m128d*)foo), the inner loop is:
$LL4#countNegat:
movups xmm0, XMMWORD PTR [rcx]
lea rcx, QWORD PTR [rcx+16]
cmpltpd xmm0, xmm2
psubq xmm1, xmm0
sub rax, 1
jne SHORT $LL4#countNegat
It could maybe go faster with unrolling (and maybe two vector accumulators), but this is fairly good and might go close to 1.25 clocks per 16 bytes on Sandybridge/Haswell. (Bottleneck on 5 fused-domain uops).
Your version was actually unpacking to integer inside the inner loop! And if you were using MSVC -Ox, it was actually branching instead of using a branchless compare + conditional add. I'm surprised it wasn't slower than the scalar version.
Also, (int64_t *)&rr violates strict aliasing. char* can alias anything, but it's not safe to cast other pointers onto SIMD vectors and expect it to work. If it does, you got lucky. Compilers usually generate similar code for that or intrinsics, and usually not worse for proper intrinsics.
Do you know that ord function with SIMD is not 1:1 to ord function without using SIMD instructions ?
In ord function without using SIMD, result of OR operation is calculated for even indexes
r[0] = v[0] | u[0],
r[2] = v[2] | u[2],
r[4] = v[4] | u[4]
what with odd indexes? maybe, if OR operations are calculated for all indexes, it will take more time than now.

Remove nth bit from buffer, and shift the rest

Giving a uint8_t buffer of x length, I am trying to come up with a function or a macro that can remove nth bit (or n to n+i), then left-shift the remaining bits.
example #1:
for input 0b76543210 0b76543210 ... then output should be 0b76543217 0b654321 ...
example #2: if the input is:
uint8_t input[8] = {
0b00110011,
0b00110011,
...
};
the output without the first bit, should be
uint8_t output[8] = {
0b00110010,
0b01100100,
...
};
I have tried the following to remove the first bit, but it did not work for the second group of bits.
/* A macro to extract (a-b) range of bits without shifting */
#define BIT_RANGE(N,x,y) ((N) & ((0xff >> (7 - (y) + (x))) << ((x))))
void removeBit0(uint8_t *n) {
for (int i=0; i < 7; i++) {
n[i] = (BIT_RANGE(n[i], i + 1, 7)) << (i + 1) |
(BIT_RANGE(n[i + 1], 1, i + 1)) << (7 - i); /* This does not extract the next element bits */
}
n[7] = 0;
}
Update #1
In my case, the input will be uint64_t number, then I will use memmov to shift it one place to the left.
Update #2
The solution can be in C/C++, assembly(x86-64) or inline assembly.
This is really 2 subproblems: remove bits from each byte and pack the results. This is the flow of the code below. I wouldn't use a macro for this. Too much going on. Just inline the function if you're worried about performance at that level.
#include <stdio.h>
#include <stdint.h>
// Remove bits n to n+k-1 from x.
unsigned scrunch_1(unsigned x, int n, int k) {
unsigned hi_bits = ~0u << n;
return (x & ~hi_bits) | ((x >> k) & hi_bits);
}
// Remove bits n to n+k-1 from each byte in the buffer,
// then pack left. Return number of packed bytes.
size_t scrunch(uint8_t *buf, size_t size, int n, int k) {
size_t i_src = 0, i_dst = 0;
unsigned src_bits = 0; // Scrunched source bit buffer.
int n_src_bits = 0; // Initially it's empty.
for (;;) {
// Get scrunched bits until the buffer has at least 8.
while (n_src_bits < 8) {
if (i_src >= size) { // Done when source bytes exhausted.
// If there are left-over bits, add one more byte to output.
if (n_src_bits > 0) buf[i_dst++] = src_bits << (8 - n_src_bits);
return i_dst;
}
// Pack 'em in.
src_bits = (src_bits << (8 - k)) | scrunch_1(buf[i_src++], n, k);
n_src_bits += 8 - k;
}
// Write the highest 8 bits of the buffer to the destination byte.
n_src_bits -= 8;
buf[i_dst++] = src_bits >> n_src_bits;
}
}
int main(void) {
uint8_t x[] = { 0xaa, 0xaa, 0xaa, 0xaa };
size_t n = scrunch(x, 4, 2, 3);
for (size_t i = 0; i < n; i++) {
printf("%x ", x[i]);
}
printf("\n");
return 0;
}
This writes b5 ad 60, which by my reckoning is correct. A few other test cases work as well.
Oops I coded it the first time shifting the wrong way, but include that here in case it's useful to someone.
#include <stdio.h>
#include <stdint.h>
// Remove bits n to n+k-1 from x.
unsigned scrunch_1(unsigned x, int n, int k) {
unsigned hi_bits = 0xffu << n;
return (x & ~hi_bits) | ((x >> k) & hi_bits);
}
// Remove bits n to n+k-1 from each byte in the buffer,
// then pack right. Return number of packed bytes.
size_t scrunch(uint8_t *buf, size_t size, int n, int k) {
size_t i_src = 0, i_dst = 0;
unsigned src_bits = 0; // Scrunched source bit buffer.
int n_src_bits = 0; // Initially it's empty.
for (;;) {
// Get scrunched bits until the buffer has at least 8.
while (n_src_bits < 8) {
if (i_src >= size) { // Done when source bytes exhausted.
// If there are left-over bits, add one more byte to output.
if (n_src_bits > 0) buf[i_dst++] = src_bits;
return i_dst;
}
// Pack 'em in.
src_bits |= scrunch_1(buf[i_src++], n, k) << n_src_bits;
n_src_bits += 8 - k;
}
// Write the lower 8 bits of the buffer to the destination byte.
buf[i_dst++] = src_bits;
src_bits >>= 8;
n_src_bits -= 8;
}
}
int main(void) {
uint8_t x[] = { 0xaa, 0xaa, 0xaa, 0xaa };
size_t n = scrunch(x, 4, 2, 3);
for (size_t i = 0; i < n; i++) {
printf("%x ", x[i]);
}
printf("\n");
return 0;
}
This writes d6 5a b. A few other test cases work as well.
Something similar to this should work:
template<typename S> void removeBit(S* buffer, size_t length, size_t index)
{
const size_t BITS_PER_UNIT = sizeof(S)*8;
// first we find which data unit contains the desired bit
const size_t unit = index / BITS_PER_UNIT;
// and which index has the bit inside the specified unit, starting counting from most significant bit
const size_t relativeIndex = (BITS_PER_UNIT - 1) - index % BITS_PER_UNIT;
// then we unset that bit
buffer[unit] &= ~(1 << relativeIndex);
// now we have to shift what's on the right by 1 position
// we create a mask such that if 0b00100000 is the bit removed we use 0b00011111 as mask to shift the rest
const S partialShiftMask = (1 << relativeIndex) - 1;
// now we keep all bits left to the removed one and we shift left all the others
buffer[unit] = (buffer[unit] & ~partialShiftMask) | ((buffer[unit] & partialShiftMask) << 1);
for (int i = unit+1; i < length; ++i)
{
//we set rightmost bit of previous unit according to last bit of current unit
buffer[i-1] |= buffer[i] >> (BITS_PER_UNIT-1);
// then we shift current unit by one
buffer[i] <<= 1;
}
}
I just tested it on some basic cases so maybe something is not exactly correct but this should move you onto the right track.

Generate all combinations in bit version

I'd like to generate all possible combination (without repetitions) in bit representation. I can't use any library like boost or stl::next_combination - it has to be my own code (computation time is very important).
Here's my code (modified from ones StackOverflow user):
int combination = (1 << k) - 1;
int new_combination = 0;
int change = 0;
while (true)
{
// return next combination
cout << combination << endl;
// find first index to update
int indexToUpdate = k;
while (indexToUpdate > 0 && GetBitPositionByNr(combination, indexToUpdate)>= n - k + indexToUpdate)
indexToUpdate--;
if (indexToUpdate == 1) change = 1; // move all bites to the left by one position
if (indexToUpdate <= 0) break; // done
// update combination indices
new_combination = 0;
for (int combIndex = GetBitPositionByNr(combination, indexToUpdate) - 1; indexToUpdate <= k; indexToUpdate++, combIndex++)
{
if(change)
{
new_combination |= (1 << (combIndex + 1));
}
else
{
combination = combination & (~(1 << combIndex));
combination |= (1 << (combIndex + 1));
}
}
if(change) combination = new_combination;
change = 0;
}
where n - all elements, k - number of elements in combination.
GetBitPositionByNr - return position of k-th bit.
GetBitPositionByNr(13,2) = 3 cause 13 is 1101 and second bit is on third position.
It gives me correct output for n=4, k=2 which is:
0011 (3 - decimal representation - printed value)
0101 (5)
1001 (9)
0110 (6)
1010 (10)
1100 (12)
Also it gives me correct output for k=1 and k=4, but gives me wrong outpu for k=3 which is:
0111 (7)
1011 (11)
1011 (9) - wrong, should be 13
1110 (14)
I guess the problem is in inner while condition (second) but I don't know how to fix this.
Maybe some of you know better (faster) algorithm to do want I want to achieve? It can't use additional memory (arrays).
Here is code to run on ideone: IDEONE
When in doubt, use brute force. Alas, generate all variations with repetition, then filter out the unnecessary patterns:
unsigned bit_count(unsigned n)
{
unsigned i = 0;
while (n) {
i += n & 1;
n >>= 1;
}
return i;
}
int main()
{
std::vector<unsigned> combs;
const unsigned N = 4;
const unsigned K = 3;
for (int i = 0; i < (1 << N); i++) {
if (bit_count(i) == K) {
combs.push_back(i);
}
}
// and print 'combs' here
}
Edit: Someone else already pointed out a solution without filtering and brute force, but I'm still going to give you a few hints about this algorithm:
most compilers offer some sort of intrinsic population count function. I know of GCC and Clang which have __builtin_popcount(). Using this intrinsic function, I was able to double the speed of the code.
Since you seem to be working on GPUs, you can parallelize the code. I have done it using C++11's standard threading facilities, and I've managed to compute all 32-bit repetitions for arbitrarily-chosen popcounts 1, 16 and 19 in 7.1 seconds on my 8-core Intel machine.
Here's the final code I've written:
#include <vector>
#include <cstdio>
#include <thread>
#include <utility>
#include <future>
unsigned popcount_range(unsigned popcount, unsigned long min, unsigned long max)
{
unsigned n = 0;
for (unsigned long i = min; i < max; i++) {
n += __builtin_popcount(i) == popcount;
}
return n;
}
int main()
{
const unsigned N = 32;
const unsigned K = 16;
const unsigned N_cores = 8;
const unsigned long Max = 1ul << N;
const unsigned long N_per_core = Max / N_cores;
std::vector<std::future<unsigned>> v;
for (unsigned core = 0; core < N_cores; core++) {
unsigned long core_min = N_per_core * core;
unsigned long core_max = core_min + N_per_core;
auto fut = std::async(
std::launch::async,
popcount_range,
K,
core_min,
core_max
);
v.push_back(std::move(fut));
}
unsigned final_count = 0;
for (auto &fut : v) {
final_count += fut.get();
}
printf("%u\n", final_count);
return 0;
}

how to use SSE to process array of ints, using a condition

I'm new to SSE, and limited in knowledge. I'm trying to vectorize my code (C++, using gcc), which is actually quite simple.
I have an array of unsigned ints, and I only check for elements that are >=, or <= than some constant. As result, I need an array with elements that passed condition.
I'm thinking to use 'mm_cmpge_ps' as a mask, but this construct work over floats not ints!? :(
any suggestion, help is very much appreciated.
It's pretty easy to just mask out (i.e. set to 0) all non-matching ints. e.g.
#include <emmintrin.h> // SSE2 intrinsics
for (int i = 0; i < N; i += 4)
{
__m128i v = _mm_load_si128(&a[i]);
__m128i vcmp0 = _mm_cmpgt_epi32(v, _mm_set1_epi32(MIN_VAL - 1));
__m128i vcmp1 = _mm_cmplt_epi32(v, _mm_set1_epi32(MAX_VAL + 1));
__m128i vcmp = _mm_and_si128(vcmp0, vcmp1);
v = _mm_and_si128(v, vcmp);
_mm_store_si128(&a[i], v);
}
Note that a needs to be 16 byte aligned and N needs to be a multiple of 4 - if these constraints are a problem then it's not too hard to extend the code to cope with this.
Here you go. Here are three functions.
The first function,foo_v1, is based on Paul R's answer.
The second function,foo_v2, is based on a popular question today Fastest way to determine if an integer is between two integers (inclusive) with known sets of values
The third function, foo_v3 uses Agner Fog's vectorclass which I added only to show how much easier and cleaner it is to use his class. If you don't have the class then just comment out the #include "vectorclass.h" line and the foo_v3 function. I used Vec8ui which means it will use AVX2 if available and break it into two Vec4ui otherwise so you don't have to change your code to get the benefit of AVX2.
#include <stdio.h>
#include <nmmintrin.h> // SSE4.2
#include "vectorclass.h"
void foo_v1(const int N, int *a, const int MAX_VAL, const int MIN_VAL) {
for (int i = 0; i < N; i += 4) {
__m128i v = _mm_load_si128((const __m128i*)&a[i]);
__m128i vcmp0 = _mm_cmpgt_epi32(v, _mm_set1_epi32(MIN_VAL - 1));
__m128i vcmp1 = _mm_cmplt_epi32(v, _mm_set1_epi32(MAX_VAL + 1));
__m128i vcmp = _mm_and_si128(vcmp0, vcmp1);
v = _mm_and_si128(v, vcmp);
_mm_store_si128((__m128i*)&a[i], v);
}
}
void foo_v2(const int N, int *a, const int MAX_VAL, const int MIN_VAL) {
//if ((unsigned)(number-lower) < (upper-lower))
for (int i = 0; i < N; i += 4) {
__m128i v = _mm_load_si128((const __m128i*)&a[i]);
__m128i dv = _mm_sub_epi32(v, _mm_set1_epi32(MIN_VAL));
__m128i min_ab = _mm_min_epu32(dv,_mm_set1_epi32(MAX_VAL-MIN_VAL));
__m128i vcmp = _mm_cmpeq_epi32(dv,min_ab);
v = _mm_and_si128(v, vcmp);
_mm_store_si128((__m128i*)&a[i], v);
}
}
void foo_v3(const int N, int *a, const int MAX_VAL, const int MIN_VAL) {
//if ((unsigned)(number-lower) < (upper-lower))
for (int i = 0; i < N; i += 8) {
Vec8ui va = Vec8ui().load(&a[i]);
va &= (va - MIN_VAL) <= (MAX_VAL-MIN_VAL);
va.store(&a[i]);
}
}
int main() {
const int N = 16;
int* a = (int*)_mm_malloc(sizeof(int)*N, 16);
for(int i=0; i<N; i++) {
a[i] = i;
}
foo_v2(N, a, 7, 3);
for(int i=0; i<N; i++) {
printf("%d ", a[i]);
} printf("\n");
_mm_free(a);
}
First place to look might be Intel® Intrinsics Guide