Fast search/replace of matching single bytes in a 8-bit array, on ARM - c++

I develop image processing algorithms (using GCC, targeting ARMv7 (Raspberry Pi 2B)).
In particular I use a simple algorithm, which changes index in a mask:
void ChangeIndex(uint8_t * mask, size_t size, uint8_t oldIndex, uint8_t newIndex)
{
for(size_t i = 0; i < size; ++i)
{
if(mask[i] == oldIndex)
mask[i] = newIndex;
}
}
Unfortunately it has poor performance for the target platform.
Is there any way to optimize it?

The ARMv7 platform supports SIMD instructions called NEON.
With use of them you can make you code faster:
#include <arm_neon.h>
void ChangeIndex(uint8_t * mask, size_t size, uint8_t oldIndex, uint8_t newIndex)
{
size_t alignedSize = size/16*16, i = 0;
uint8x16_t _oldIndex = vdupq_n_u8(oldIndex);
uint8x16_t _newIndex = vdupq_n_u8(newIndex);
for(; i < alignedSize; i += 16)
{
uint8x16_t oldMask = vld1q_u8(mask + i); // loading of 128-bit vector
uint8x16_t condition = vceqq_u8(oldMask, _oldIndex); // compare two 128-bit vectors
uint8x16_t newMask = vbslq_u8(condition, _newIndex, oldMask); // selective copying of 128-bit vector
vst1q_u8(mask + i, newMask); // saving of 128-bit vector
}
for(; i < size; ++i)
{
if(mask[i] == oldIndex)
mask[i] = newIndex;
}
}

Related

What is the best way to loop AVX for un-even non-aligned array?

If array cannot be divided by 8 (for integer), what is the best way to write cycle for it? Possible way I figured out so far is to divide it into 2 separate cycles: 1 main cycle for almost all elements; and 1 tail cycle with maskload/maskstore for remaining 1-7 elements. But it's not looking like the best way.
for (auto i = 0; i < vec.size() - 8; i += 8) {
__m256i va = _mm256_loadu_si256((__m256i*) & vec[i]);
//do some work
_mm256_storeu_si256((__m256i*) & vec[i], va);
}
for (auto i = vec.size() - vec.size() % 8; i < vec.size(); i += 8) {
auto tmp = (vec.size() % 8) + 1;
char chArr[8] = {};
for (auto j = 0; j < 8; ++j) {
chArr[j] -= --tmp;
}
__m256i mask = _mm256_setr_epi32(chArr[0],
chArr[1], chArr[2], chArr[3], chArr[4], chArr[5], chArr[6], chArr[7]);
__m256i va = _mm256_maskload_epi32(&vec[i], mask);
//do some work
_mm256_maskstore_epi32(&vec[i], mask, va);
}
Could it be made looking better without hitting the performance? Just removing second for-loop for a single load doesn’t help much because it’s only 1 line saved out of dozen.
If I put maskload/maskstore in the main cycle it will slower down it significantly. There is also no maskloadu/maskstoreu, so I can't use this for unaligned array.
To expand on Yves' idea of prebuilding masks, here is one way to structure it:
#include <vector>
#include <immintrin.h>
void foo(std::vector<int>& vec)
{
std::size_t size = vec.size();
int* data = vec.data();
std::size_t i;
for(i = 0; i + 8 <= size; i += 8) {
__m256i va = _mm256_loadu_si256((__m256i*) (data + i));
asm volatile ("" : : : "memory"); // more work here
_mm256_storeu_si256((__m256i*) (data + i), va);
}
static const int maskarr[] = {
-1, -1, -1, -1, -1, -1, -1, -1,
0, 0, 0, 0, 0, 0, 0, 0
};
if(i < size) {
__m256i mask = _mm256_loadu_si256((const __m256i*)(
maskarr + (i + 8 - size)));
__m256i va = _mm256_maskload_epi32(data + i, mask);
asm volatile ("" : : : "memory"); // more work here
_mm256_maskstore_epi32(data + i, mask, va);
}
}
A few notes:
As mentioned in my comment, i + 8 <= vec.size() is safer as it avoids a possible wrap-around if vec.size() is 7 or lower
Use size_t or ptrdiff_t instead of int for such loop counters
The if to skip over the last part is important. Masked memory operations with an all-zero mask are very slow
The static mask array can be slimmed by two elements since we know we never access an all-filled or all-zero mask array

efficient bitwise sum calculation

Is there an efficient way to calculate a bitwise sum of uint8_t buffers (assume number of buffers are <= 255, so that we can make the sum uint8)? Basically I want to know how many bits are set at the i'th position of each buffer.
Ex: For 2 buffers
uint8 buf1[k] -> 0011 0001 ...
uint8 buf2[k] -> 0101 1000 ...
uint8 sum[k*8]-> 0 1 1 2 1 0 0 1...
are there any BLAS or boost routines for such a requirement?
This is a highly vectorizable operation IMO.
UPDATE:
Following is a naive impl of the requirement
for (auto&& buf: buffers){
for (int i = 0; i < buf_len; i++){
for (int b = 0; b < 8; ++b) {
sum[i*8 + b] += (buf[i] >> b) & 1;
}
}
}
An alternative to OP's naive code:
Perform 8 additions at once. Use a lookup table to expand the 8 bits to 8 bytes with each bit to a corresponding byte - see ones[].
void sumit(uint8_t number_of_buf, uint8_t k, const uint8_t buf[number_of_buf][k]) {
static const uint64_t ones[256] = { 0, 0x1, 0x100, 0x101, 0x10000, 0x10001,
/* 249 more pre-computed constants */ 0x0101010101010101};
uint64_t sum[k];
memset(sum, 0, sizeof sum):
for (size_t buf_index = 0; buf_index < number_of_buf; buf_index++) {
for (size_t int i = 0; i < k; i++) {
sum[i] += ones(buf[buf_index][i]);
}
}
for (size_t int i = 0; i < k; i++) {
for (size_t bit = 0; bit < 8; bit++) {
printf("%llu ", 0xFF & (sum[i] >> (8*bit)));
}
}
}
See also #Eric Postpischil.
As a modification of chux's approach, the lookup table can be replaced with a vector shift and mask. Here's an example using GCC's vector extensions.
#include <stdint.h>
#include <stddef.h>
typedef uint8_t vec8x8 __attribute__((vector_size(8)));
void sumit(uint8_t number_of_buf,
uint8_t k,
const uint8_t buf[number_of_buf][k],
vec8x8 * restrict sums) {
static const vec8x8 shift = {0,1,2,3,4,5,6,7};
for (size_t i = 0; i < k; i++) {
sums[i] = (vec8x8){0};
for (size_t buf_index = 0; buf_index < number_of_buf; buf_index++) {
sums[i] += (buf[buf_index][i] >> shift) & 1;
}
}
}
Try it on godbolt.
I interchanged the loops from chux's answer because it seemed more natural to accumulate the sum for one buffer index at a time (then the sum can be cached in a register throughout the inner loop). There might be a tradeoff in cache performance because we now have to read the elements of the two-dimensional buf in column-major order.
Taking ARM64 as an example, GCC 11.1 compiles the inner loop as follows.
// v1 = sums[i]
// v2 = {0,-1,-2,...,-7} (right shift is done as left shift with negative count)
// v3 = {1,1,1,1,1,1,1,1}
.L4:
ld1r {v0.8b}, [x1] // replicate buf[buf_index][i] to all elements of v0
add x0, x0, 1
add x1, x1, x20
ushl v0.8b, v0.8b, v2.8b // shift
and v0.8b, v0.8b, v3.8b // mask
add v1.8b, v1.8b, v0.8b // accumulate
cmp x0, x19
bne .L4
I think it'd be more efficient to do two bytes at a time (so unrolling the loop on i by a factor of 2) and use 128-bit vector operations. I leave this as an exercise :)
It's not immediately clear to me whether this would end up being faster or slower than the lookup table. You might have to profile both on the target machine(s) of interest.

crc32 with lookup table

// -- Edited
Currently, hardware functions (__builtin_ia32_crc32qi and __builtin_ia32_crc32di) are used for crc32 with __builtin_ia32_crc32di returning 64 bits. Then, 64-bits are trimmed to 32-bits. Existing data is based on this logic.
https://gcc.gnu.org/onlinedocs/gcc-4.9.2/gcc/X86-Built-in-Functions.html
uint32_t calculateCrc32(uint32_t init, const uint8_t* buf, size_t size) {
uint32_t crc32 = init;
const uint8_t* pos = buf;
const uint8_t* end = buf + size;
// byte-wise crc
while (((uint64_t)pos) % sizeof(uint64_t) && pos < end) {
crc32 = __builtin_ia32_crc32qi(crc32, *pos);
++pos;
}
// 8-bytes-wise
while (((uint64_t)pos) <
(((uint64_t)end) / sizeof(uint64_t)) * sizeof(uint64_t)) {
crc32 = __builtin_ia32_crc32di(crc32, *(uint64_t*)pos);
pos += sizeof(uint64_t);
}
// byte-wise crc for remaining
while (pos < end) {
crc32 = __builtin_ia32_crc32qi(crc32, *pos);
++pos;
}
return crc32;
}
I am trying to implement a lookup-table version. What I am doing is: 1) first generate a lookup table 2) do table lookup
uint8_t kCrc32tab[256];
for (int i=0; i < 256; ++i) {
uint8_t buf = i;
kCrc32tab[i] = calculateCrc32(0xFF, &buf, 1);
}
uint32_t crc32WithLookup(uint32_t crc32_init, const uint8_t* buf, size_t size) {
uint32_t crc32 = crc32_init;
for (std::size_t i = 0; i < size; i++) {
uint8_t key = (crc32 ^ buf[i]) & 0xFF;
crc32 = kCrc32tab[key] ^ (crc32 >> 8);
}
return crc32;
}
However, crc32 outcome is different between crc32WithLookup and calculateCrc32. Any suggestions?
lookup example in redis:
https://github.com/redis/redis/blob/unstable/src/crc16.c
That CRC-32 is commonly referred to as the CRC-32C (where outside the provided code the initial value and final exclusive-or is 0xffffffff).
There are two errors in your code. The table must be 32-bit values, and the initial value for your CRCs is zero. So you need uint32_t kCrc32tab[256]; and kCrc32tab[i] = calculateCrc32(0, &buf, 1);.
This answer provides more advanced and faster code for both the hardware and software versions of that CRC calculation.

Looking for short value in C++ Array fast SIMD version

I have an algorithm in my program which works fine. I speculate if there is possible to speed think up a little bit:
unsigned short c;
bool found = false;
unsigned short* arrIterator = arr;
while(( c = *arrIterator & mask) != stopValue)
{
if(c == next)
{
found= true;
break;
}
arrIterator ++;
}
It is possible to rewrite such algorithm to SIMD instructions?
Assuming the arr is 16-aligned (make it so), you could do something like this (not tested)
__m128i vstop = _mm_set1_epi16(stopValue);
__m128i vnext = _mm_set1_epi16(next);
int found_mask = 0;
int stop_mask = 0;
do
{
__m128i data = _mm_load_si128(arrIterator++);
__m128i contains_next = _mm_cmpeq_epi16(data, vnext);
__m128i contains_stop = _mm_cmpeq_epi16(data, vstop);
found_mask = _mm_movemask_epi8(contains_next);
stopmask = found_mask | _mm_movemask_epi8(contains_stop);
} while (stopmask == 0);
You can then tell the index where next was found by doing a bitscan over found_mask and some stuff with the current value of the iterator.

how to use SSE to process array of ints, using a condition

I'm new to SSE, and limited in knowledge. I'm trying to vectorize my code (C++, using gcc), which is actually quite simple.
I have an array of unsigned ints, and I only check for elements that are >=, or <= than some constant. As result, I need an array with elements that passed condition.
I'm thinking to use 'mm_cmpge_ps' as a mask, but this construct work over floats not ints!? :(
any suggestion, help is very much appreciated.
It's pretty easy to just mask out (i.e. set to 0) all non-matching ints. e.g.
#include <emmintrin.h> // SSE2 intrinsics
for (int i = 0; i < N; i += 4)
{
__m128i v = _mm_load_si128(&a[i]);
__m128i vcmp0 = _mm_cmpgt_epi32(v, _mm_set1_epi32(MIN_VAL - 1));
__m128i vcmp1 = _mm_cmplt_epi32(v, _mm_set1_epi32(MAX_VAL + 1));
__m128i vcmp = _mm_and_si128(vcmp0, vcmp1);
v = _mm_and_si128(v, vcmp);
_mm_store_si128(&a[i], v);
}
Note that a needs to be 16 byte aligned and N needs to be a multiple of 4 - if these constraints are a problem then it's not too hard to extend the code to cope with this.
Here you go. Here are three functions.
The first function,foo_v1, is based on Paul R's answer.
The second function,foo_v2, is based on a popular question today Fastest way to determine if an integer is between two integers (inclusive) with known sets of values
The third function, foo_v3 uses Agner Fog's vectorclass which I added only to show how much easier and cleaner it is to use his class. If you don't have the class then just comment out the #include "vectorclass.h" line and the foo_v3 function. I used Vec8ui which means it will use AVX2 if available and break it into two Vec4ui otherwise so you don't have to change your code to get the benefit of AVX2.
#include <stdio.h>
#include <nmmintrin.h> // SSE4.2
#include "vectorclass.h"
void foo_v1(const int N, int *a, const int MAX_VAL, const int MIN_VAL) {
for (int i = 0; i < N; i += 4) {
__m128i v = _mm_load_si128((const __m128i*)&a[i]);
__m128i vcmp0 = _mm_cmpgt_epi32(v, _mm_set1_epi32(MIN_VAL - 1));
__m128i vcmp1 = _mm_cmplt_epi32(v, _mm_set1_epi32(MAX_VAL + 1));
__m128i vcmp = _mm_and_si128(vcmp0, vcmp1);
v = _mm_and_si128(v, vcmp);
_mm_store_si128((__m128i*)&a[i], v);
}
}
void foo_v2(const int N, int *a, const int MAX_VAL, const int MIN_VAL) {
//if ((unsigned)(number-lower) < (upper-lower))
for (int i = 0; i < N; i += 4) {
__m128i v = _mm_load_si128((const __m128i*)&a[i]);
__m128i dv = _mm_sub_epi32(v, _mm_set1_epi32(MIN_VAL));
__m128i min_ab = _mm_min_epu32(dv,_mm_set1_epi32(MAX_VAL-MIN_VAL));
__m128i vcmp = _mm_cmpeq_epi32(dv,min_ab);
v = _mm_and_si128(v, vcmp);
_mm_store_si128((__m128i*)&a[i], v);
}
}
void foo_v3(const int N, int *a, const int MAX_VAL, const int MIN_VAL) {
//if ((unsigned)(number-lower) < (upper-lower))
for (int i = 0; i < N; i += 8) {
Vec8ui va = Vec8ui().load(&a[i]);
va &= (va - MIN_VAL) <= (MAX_VAL-MIN_VAL);
va.store(&a[i]);
}
}
int main() {
const int N = 16;
int* a = (int*)_mm_malloc(sizeof(int)*N, 16);
for(int i=0; i<N; i++) {
a[i] = i;
}
foo_v2(N, a, 7, 3);
for(int i=0; i<N; i++) {
printf("%d ", a[i]);
} printf("\n");
_mm_free(a);
}
First place to look might be Intel® Intrinsics Guide