How to speed up bit testing - c++

I'm pondering at how to speed up bit testing in the following routine:
void histSubtractFromBits(uint64* cursor, uint16* hist){
//traverse each bit of the 256-bit-long bitstring by splitting up into 4 bitsets
std::bitset<64> a(*cursor);
std::bitset<64> b(*(cursor+1));
std::bitset<64> c(*(cursor+2));
std::bitset<64> d(*(cursor+3));
for(int bit = 0; bit < 64; bit++){
hist[bit] -= a.test(bit);
}
for(int bit = 0; bit < 64; bit++){
hist[bit+64] -= b.test(bit);
}
for(int bit = 0; bit < 64; bit++){
hist[bit+128] -= c.test(bit);
}
for(int bit = 0; bit < 64; bit++){
hist[bit+192] -= d.test(bit);
}
}
The actual gcc implementation does a range-check for the bit argument, then &-s with a bitmask. I could do it without the bitsets and with my own bit-shifting / masking, but I'm fairly certain that won't yield any significant speedup (tell me if I'm wrong and why).
I'm not really familiar with the x86-64 assembly, but I am aware of a certain bit test instruction, and I am aware that it's theoretically possible to do inline assembly with gcc.
1) Do you think it at all worthwhile to write an inline-assembly analogue for the above code?
2) If yes, then how would I go about doing it, i.e. could you show me some basic starter code / samples to point me in the right direction?

As far as I can tell, you basically iterate over each bit. As such, I'd imagine simply shifting and masking off the LSB every time should provide good performance. Something like:
uint64_t a = *cursor;
for(int bit = 0; a != 0; bit++, a >>= 1) {
hist[bit] -= (a & 1);
}
Alternatively, if you expect only very few bits to be set and are happy with gcc specific stuff, you could use __builtin_ffsll
uint64_t a = *cursor;
int next;
for(int bit = 0; (next = __builtin_ffsll(a)) != 0; ) {
bit += next;
hist[bit - 1] -= 1;
a >>= next;
}
The idea should be fine, but no warranty for the actual code :)
Update: code using vector extensions:
typedef short v8hi __attribute__ ((vector_size (16)));
static v8hi table[256];
void histSubtractFromBits(uint64_t* cursor, uint16_t* hist)
{
uint8_t* cursor_tmp = (uint8_t*)cursor;
v8hi* hist_tmp = (v8hi*)hist;
for(int i = 0; i < 32; i++, cursor_tmp++, hist_tmp++)
{
*hist_tmp -= table[*cursor_tmp];
}
}
void setup_table()
{
for(int i = 0; i < 256; i++)
{
for(int j = 0; j < 8; j++)
{
table[i][j] = (i >> j) & 1;
}
}
}
This will be compiled to SSE instructions if available, for example I get:
leaq 32(%rdi), %rdx
.p2align 4,,10
.p2align 3
.L2:
movzbl (%rdi), %eax
addq $1, %rdi
movdqa (%rsi), %xmm0
salq $4, %rax
psubw table(%rax), %xmm0
movdqa %xmm0, (%rsi)
addq $16, %rsi
cmpq %rdx, %rdi
jne .L2
Of course this approach relies on the table being in cache.

Another suggestion is to combine data caching, registers and loop unrolling:
// Assuming your processor has 64-bit words
void histSubtractFromBits(uint64_t const * cursor, uint16* hist)
{
register uint64_t a = *cursor++;
register uint64_t b = *cursor++;
register uint64_t c = *cursor++;
register uint64_t d = *cursor++;
register unsigned int i = 0;
for (i = 0; i < (sizeof(*cursor) * CHAR_BIT; ++i)
{
hist[i + 0] += a & 1;
hist[i + 64] += b & 1;
hist[i + 128] += c & 1;
hist[i + 192] += d & 1;
a >>= 1;
b >>= 1;
c >>= 1;
d >>= 1;
}
}
I'm not sure if you gain any more performance by reordering the instructions like this:
hist[i + 0] += a & 1;
a >>= 1;
You could try both ways and compare the assembly language for both.
One of the ideas here is to maximize the register usage. The values to test are loaded into registers and then the testing begins.

Related

efficient bitwise sum calculation

Is there an efficient way to calculate a bitwise sum of uint8_t buffers (assume number of buffers are <= 255, so that we can make the sum uint8)? Basically I want to know how many bits are set at the i'th position of each buffer.
Ex: For 2 buffers
uint8 buf1[k] -> 0011 0001 ...
uint8 buf2[k] -> 0101 1000 ...
uint8 sum[k*8]-> 0 1 1 2 1 0 0 1...
are there any BLAS or boost routines for such a requirement?
This is a highly vectorizable operation IMO.
UPDATE:
Following is a naive impl of the requirement
for (auto&& buf: buffers){
for (int i = 0; i < buf_len; i++){
for (int b = 0; b < 8; ++b) {
sum[i*8 + b] += (buf[i] >> b) & 1;
}
}
}
An alternative to OP's naive code:
Perform 8 additions at once. Use a lookup table to expand the 8 bits to 8 bytes with each bit to a corresponding byte - see ones[].
void sumit(uint8_t number_of_buf, uint8_t k, const uint8_t buf[number_of_buf][k]) {
static const uint64_t ones[256] = { 0, 0x1, 0x100, 0x101, 0x10000, 0x10001,
/* 249 more pre-computed constants */ 0x0101010101010101};
uint64_t sum[k];
memset(sum, 0, sizeof sum):
for (size_t buf_index = 0; buf_index < number_of_buf; buf_index++) {
for (size_t int i = 0; i < k; i++) {
sum[i] += ones(buf[buf_index][i]);
}
}
for (size_t int i = 0; i < k; i++) {
for (size_t bit = 0; bit < 8; bit++) {
printf("%llu ", 0xFF & (sum[i] >> (8*bit)));
}
}
}
See also #Eric Postpischil.
As a modification of chux's approach, the lookup table can be replaced with a vector shift and mask. Here's an example using GCC's vector extensions.
#include <stdint.h>
#include <stddef.h>
typedef uint8_t vec8x8 __attribute__((vector_size(8)));
void sumit(uint8_t number_of_buf,
uint8_t k,
const uint8_t buf[number_of_buf][k],
vec8x8 * restrict sums) {
static const vec8x8 shift = {0,1,2,3,4,5,6,7};
for (size_t i = 0; i < k; i++) {
sums[i] = (vec8x8){0};
for (size_t buf_index = 0; buf_index < number_of_buf; buf_index++) {
sums[i] += (buf[buf_index][i] >> shift) & 1;
}
}
}
Try it on godbolt.
I interchanged the loops from chux's answer because it seemed more natural to accumulate the sum for one buffer index at a time (then the sum can be cached in a register throughout the inner loop). There might be a tradeoff in cache performance because we now have to read the elements of the two-dimensional buf in column-major order.
Taking ARM64 as an example, GCC 11.1 compiles the inner loop as follows.
// v1 = sums[i]
// v2 = {0,-1,-2,...,-7} (right shift is done as left shift with negative count)
// v3 = {1,1,1,1,1,1,1,1}
.L4:
ld1r {v0.8b}, [x1] // replicate buf[buf_index][i] to all elements of v0
add x0, x0, 1
add x1, x1, x20
ushl v0.8b, v0.8b, v2.8b // shift
and v0.8b, v0.8b, v3.8b // mask
add v1.8b, v1.8b, v0.8b // accumulate
cmp x0, x19
bne .L4
I think it'd be more efficient to do two bytes at a time (so unrolling the loop on i by a factor of 2) and use 128-bit vector operations. I leave this as an exercise :)
It's not immediately clear to me whether this would end up being faster or slower than the lookup table. You might have to profile both on the target machine(s) of interest.

SIMD Program slow runtime

I'm starting with SIMD programming but i don't know what to do at this moment. I'm trying to diminish runtime but its doing it the other way.
This is my basic code:
https://codepaste.net/a8ut89
void blurr2(double * u, double * r) {
int i;
double dos[2] = { 2.0, 2.0 };
for (i = 0; i < SIZE - 1; i++) {
r[i] = u[i] + u[i + 1];
}
}
blurr2: 0.43s
int contarNegativos(double * u) {
int i;
int contador = 0;
for (i = 0; i < SIZE; i++) {
if (u[i] < 0) {
contador++;
}
}
return contador;
}
negativeCount: 1.38s
void ord(double * v, double * u, double * r) {
int i;
for (i = 0; i < SIZE; i += 2) {
r[i] = *(__int64*)&(v[i]) | *(__int64*)&(u[i]);
}
}
ord: 0.33
And this is my SIMD code:
https://codepaste.net/fbg1g5
void blurr2(double * u, double * r) {
__m128d rp2;
__m128d rdos;
__m128d rr;
int i;
int sizeAux = SIZE % 2 == 1 ? SIZE : SIZE - 1;
double dos[2] = { 2.0, 2.0 };
rdos = *(__m128d*)dos;
for (i = 0; i < sizeAux; i += 2) {
rp2 = *(__m128d*)&u[i + 1];
rr = _mm_add_pd(*(__m128d*)&u[i], rp2);
*((__m128d*)&r[i]) = _mm_div_pd(rr, rdos);
}
}
blurr2: 0.42s
int contarNegativos(double * u) {
__m128d rcero;
__m128d rr;
int i;
double cero[2] = { 0.0, 0.0 };
int contador = 0;
rcero = *(__m128d*)cero;
for (i = 0; i < SIZE; i += 2) {
rr = _mm_cmplt_pd(*(__m128d*)&u[i], rcero);
if (((__int64 *)&rr)[0]) {
contador++;
};
if (((__int64 *)&rr)[1]) {
contador++;
};
}
return contador;
}
negativeCount: 1.42s
void ord(double * v, double * u, double * r) {
__m128d rr;
int i;
for (i = 0; i < SIZE; i += 2) {
*((__m128d*)&r[i]) = _mm_or_pd(*(__m128d*)&v[i], *(__m128d*)&u[i]);
}
}
ord: 0.35s
**Differents solutions.
Can you explain me what i'm doing wrong? I'm a bit lost...
Use _mm_loadu_pd instead of pointer-casting and dereferencing a __m128d. Your code is guaranteed to segfault on gcc/clang where __m128d is assumed to be aligned.
blurr2: multiply by 0.5 instead of dividing by 2. It will be much faster. (I commented the same thing on a question with the exact same code in the last day or two, was that also you?)
negativeCount: _mm_castpd_si128 the compare result to integer, and accumulate it with _mm_sub_epi64. (The bit pattern is all-zero or all-one, i.e. 2's complement 0 / -1).
#include <immintrin.h>
#include <stdint.h>
static const size_t SIZE = 1024;
uint64_t countNegative(double * u) {
__m128i counts = _mm_setzero_si128();
for (size_t i = 0; i < SIZE; i += 2) {
__m128d cmp = _mm_cmplt_pd(_mm_loadu_pd(&u[i]), _mm_setzero_pd());
counts = _mm_sub_epi64(counts, _mm_castpd_si128(cmp));
}
//return counts[0] + counts[1]; // GNU C only, and less efficient
// horizontal sum
__m128i hi64 = _mm_shuffle_epi32(counts, _MM_SHUFFLE(1, 0, 3, 2));
counts = _mm_add_epi64(counts, hi64);
uint64_t scalarcount = _mm_cvtsi128_si64(counts);
return scalarcount;
}
To learn more about efficient vector horizontal sums, see Fastest way to do horizontal float vector sum on x86. But the first rule is to do it outside the loop.
(source + asm on the Godbolt compiler explorer)
From MSVC (which I'm guessing you're using, or you'd get segfaults from *(__m128d*)foo), the inner loop is:
$LL4#countNegat:
movups xmm0, XMMWORD PTR [rcx]
lea rcx, QWORD PTR [rcx+16]
cmpltpd xmm0, xmm2
psubq xmm1, xmm0
sub rax, 1
jne SHORT $LL4#countNegat
It could maybe go faster with unrolling (and maybe two vector accumulators), but this is fairly good and might go close to 1.25 clocks per 16 bytes on Sandybridge/Haswell. (Bottleneck on 5 fused-domain uops).
Your version was actually unpacking to integer inside the inner loop! And if you were using MSVC -Ox, it was actually branching instead of using a branchless compare + conditional add. I'm surprised it wasn't slower than the scalar version.
Also, (int64_t *)&rr violates strict aliasing. char* can alias anything, but it's not safe to cast other pointers onto SIMD vectors and expect it to work. If it does, you got lucky. Compilers usually generate similar code for that or intrinsics, and usually not worse for proper intrinsics.
Do you know that ord function with SIMD is not 1:1 to ord function without using SIMD instructions ?
In ord function without using SIMD, result of OR operation is calculated for even indexes
r[0] = v[0] | u[0],
r[2] = v[2] | u[2],
r[4] = v[4] | u[4]
what with odd indexes? maybe, if OR operations are calculated for all indexes, it will take more time than now.

_mm_load_ps caused segment fault

I have a code snippet. The snippet just loads 2 arrays and calculates dot product between them using SSE.
Code here:
using namespace std;
long long size = 3200000;
float* _random()
{
unsigned int seed = 123;
// float *t = malloc(size*sizeof(float));
float *t = new float[size];
int i;
float num = 0.0;
for(i=0; i < size; i++) {
num = rand()/(RAND_MAX+1.0);
t[i] = num;
}
return t;
}
float _dotProductVectorSSE(float *s1, float *s2)
{
float prod;
int i;
__m128 X, Y, Z;
for(i=0; i<size; i+=4)
{
X = _mm_load_ps(&s1[i]);
Y = _mm_load_ps(&s2[i]);
X = _mm_mul_ps(X, Y);
Z = _mm_add_ps(X, Z);
}
float *v = new float[4];
_mm_store_ps(v,Z);
for(i=0; i<4; i++)
{
// prod += Z[i];
std::cout << v[i] << endl;
}
return prod;
}
int main(int argc, char *argv[])
{
QCoreApplication a(argc, argv);
time_t start, stop;
double avg_time = 0;
double cur_time;
float* s1 = NULL;
float* s2 = NULL;
for(int i = 0; i < 100; i++)
{
s1 = _random();
s2 = _random();
start = clock();
float sse_product = _dotProductVectorSSE(s1, s2);
stop = clock();
cur_time = ((double) stop-start) / CLOCKS_PER_SEC;
avg_time += cur_time;
}
std::cout << "Averagely used " << avg_time/100 << " seconds." << endl;
return a.exec();
}
When I run, I got segment fault. Here is the backtrace:
(gdb) bt
0 0x0804965f in _mm_load_ps (__P=0xb6b56008) at /usr/lib/gcc/i586-suse-linux/4.6/include/xmmintrin.h:899
1 _dotProductVectorSSE (s1=0xb6b56008, s2=0xb5f20008) at ../simd/simd.cpp:37
2 0x0804987f in main (argc=1, argv=0xbfffee84) at ../simd/simd.cpp:80
Diassembler:
0x8049b30 push %ebp
0x8049b31 <+0x0001> push %edi
0x8049b32 <+0x0002> push %esi
0x8049b33 <+0x0003> push %ebx
0x8049b34 <+0x0004> sub $0x2c,%esp
0x8049b37 <+0x0007> mov 0x804c0a4,%esi
0x8049b3d <+0x000d> mov 0x40(%esp),%edx
0x8049b41 <+0x0011> mov 0x44(%esp),%ecx
0x8049b45 <+0x0015> mov 0x804c0a0,%ebx
0x8049b4b <+0x001b> cmp $0x0,%esi
0x8049b4e <+0x001e> jl 0x8049b7a <_Z20_dotProductVectorSSEPfS_+74>
0x8049b50 <+0x0020> jle 0x8049c10 <_Z20_dotProductVectorSSEPfS_+224>
0x8049b56 <+0x0026> add $0xffffffff,%ebx
0x8049b59 <+0x0029> adc $0xffffffff,%esi
0x8049b5c <+0x002c> xor %eax,%eax
0x8049b5e <+0x002e> shrd $0x2,%esi,%ebx
0x8049b62 <+0x0032> add $0x1,%ebx
0x8049b65 <+0x0035> shl $0x2,%ebx
**0x8049b68 <+0x0038> movaps (%edx,%eax,4),%xmm0**
0x8049b6c <+0x003c> mulps (%ecx,%eax,4),%xmm0
0x8049b70 <+0x0040> add $0x4,%eax
0x8049b73 <+0x0043> cmp %ebx,%eax
0x8049b75 <+0x0045> addps %xmm0,%xmm1
0x8049b78 <+0x0048> jne 0x8049b68 <_Z20_dotProductVectorSSEPfS_+56>
0x8049b7a <+0x004a> movaps %xmm1,0x10(%esp)
0x8049b7f <+0x004f> xor %ebx,%ebx
I am using QtCreator and defined in .pro file:
QMAKE_CXXFLAGS += -msse -msse2
DEFINES += __SSE__
DEFINES += __SSE2__
DEFINES += __MMX__
Please tell me how to fix that problem !
You are not ensuring that your data is 16 byte aligned (malloc/new are not sufficient in general) - you will either need to use _mm_loadu_ps instead of _mm_load_ps to deal with your potentially misaligned data, or preferably use a suitable method to allocate aligned memory (e.g. posix_memalign on Linux).
Note that you should _mm_load_ps and 16 byte aligned memory if you possibly can, otherwise use _mm_loadu_ps but note that this may reduce performance signficantly on some (older) CPUs.
Try the link below.
http://flyeater.wordpress.com/2010/11/29/memory-allocation-and-data-alignment-custom-mallocfree/
You basically allocate a bit more memory than you need, then calculate the address which is modulo 16 and use memory beginning from that address to load/store data.
Take care of pointer arithmetic.
Most of the code here ideone.com/fXKQhR is taken from the above link, sample usage.
I think, the _mm_malloc maybe helpful with you.

Can counting byte matches between two strings be optimized using SIMD?

Profiling suggests that this function here is a real bottle neck for my application:
static inline int countEqualChars(const char* string1, const char* string2, int size) {
int r = 0;
for (int j = 0; j < size; ++j) {
if (string1[j] == string2[j]) {
++r;
}
}
return r;
}
Even with -O3 and -march=native, G++ 4.7.2 does not vectorize this function (I checked the assembler output). Now, I'm not an expert with SSE and friends, but I think that comparing more than one character at once should be faster. Any ideas on how to speed things up? Target architecture is x86-64.
Of course it can.
pcmpeqb compares two vectors of 16 bytes and produces a vector with zeros where they differed, and -1 where they match. Use this to compare 16 bytes at a time, adding the result to an accumulator vector (make sure to accumulate the results of at most 255 vector compares to avoid overflow). When you're done, there are 16 results in the accumulator. Sum them and negate to get the number of equal elements.
If the lengths are very short, it will be hard to get a significant speedup from this approach. If the lengths are long, then it will be worth pursuing.
Compiler flags for vectorization:
-ftree-vectorize
-ftree-vectorize -march=<your_architecture> (Use all instruction-set extensions available on your computer, not just baseline like SSE2 for x86-64). Use -march=native to optimize for the machine the compiler is running on.) -march=<foo> also sets -mtune=<foo>, which is also a good thing.
Using SSEx intrinsics:
Padd and align the buffer to 16 bytes (according to the vector size you're actually going to use)
Create an accumlator countU8 with _mm_set1_epi8(0)
For all n/16 input (sub) vectors, do:
Load 16 chars from both strings with _mm_load_si128 or _mm_loadu_si128 (for unaligned loads)
_mm_cmpeq_epi8
compare the octets in parallel. Each match yields 0xFF (-1), 0x00 otherwise.
Substract the above result vector from countU8 using _mm_sub_epi8 (minus -1 -> +1)
Always after 255 cycles, the 16 8bit counters must be extracted into a larger integer type to prevent overflows. See unpack and horizontal add in this nice answer for how to do that: https://stackoverflow.com/a/10930706/1175253
Code:
#include <iostream>
#include <vector>
#include <cassert>
#include <cstdint>
#include <climits>
#include <cstring>
#include <emmintrin.h>
#ifdef __SSE2__
#if !defined(UINTPTR_MAX) || !defined(UINT64_MAX) || !defined(UINT32_MAX)
# error "Limit macros are not defined"
#endif
#if UINTPTR_MAX == UINT64_MAX
#define PTR_64
#elif UINTPTR_MAX == UINT32_MAX
#define PTR_32
#else
# error "Current UINTPTR_MAX is not supported"
#endif
template<typename T>
void print_vector(std::ostream& out,const __m128i& vec)
{
static_assert(sizeof(vec) % sizeof(T) == 0,"Invalid element size");
std::cout << '{';
const T* const end = reinterpret_cast<const T*>(&vec)-1;
const T* const upper = end+(sizeof(vec)/sizeof(T));
for(const T* elem = upper;
elem != end;
--elem
)
{
if(elem != upper)
std::cout << ',';
std::cout << +(*elem);
}
std::cout << '}' << std::endl;
}
#define PRINT_VECTOR(_TYPE,_VEC) do{ std::cout << #_VEC << " : "; print_vector<_TYPE>(std::cout,_VEC); } while(0)
///#note SSE2 required (macro: __SSE2__)
///#warning Not tested!
size_t counteq_epi8(const __m128i* a_in,const __m128i* b_in,size_t count)
{
assert(a_in != nullptr && (uintptr_t(a_in) % 16) == 0);
assert(b_in != nullptr && (uintptr_t(b_in) % 16) == 0);
//assert(count > 0);
/*
//maybe not so good with all that branching and additional loop variables
__m128i accumulatorU8 = _mm_set1_epi8(0);
__m128i sum2xU64 = _mm_set1_epi8(0);
for(size_t i = 0;i < count;++i)
{
//this operation could also be unrolled, where multiple result registers would be accumulated
accumulatorU8 = _mm_sub_epi8(accumulatorU8,_mm_cmpeq_epi8(*a_in++,*b_in++));
if(i % 255 == 0)
{
//before overflow of uint8, the counter will be extracted
__m128i sum2xU16 = _mm_sad_epu8(accumulatorU8,_mm_set1_epi8(0));
sum2xU64 = _mm_add_epi64(sum2xU64,sum2xU16);
//reset accumulatorU8
accumulatorU8 = _mm_set1_epi8(0);
}
}
//blindly accumulate remaining values
__m128i sum2xU16 = _mm_sad_epu8(accumulatorU8,_mm_set1_epi8(0));
sum2xU64 = _mm_add_epi64(sum2xU64,sum2xU16);
//do a horizontal addition of the two counter values
sum2xU64 = _mm_add_epi64(sum2xU64,_mm_srli_si128(sum2xU64,64/8));
#if defined PTR_64
return _mm_cvtsi128_si64(sum2xU64);
#elif defined PTR_32
return _mm_cvtsi128_si32(sum2xU64);
#else
# error "macro PTR_(32|64) is not set"
#endif
*/
__m128i sum2xU64 = _mm_set1_epi32(0);
while(count--)
{
__m128i matches = _mm_sub_epi8(_mm_set1_epi32(0),_mm_cmpeq_epi8(*a_in++,*b_in++));
__m128i sum2xU16 = _mm_sad_epu8(matches,_mm_set1_epi32(0));
sum2xU64 = _mm_add_epi64(sum2xU64,sum2xU16);
#ifndef NDEBUG
PRINT_VECTOR(uint16_t,sum2xU64);
#endif
}
//do a horizontal addition of the two counter values
sum2xU64 = _mm_add_epi64(sum2xU64,_mm_srli_si128(sum2xU64,64/8));
#ifndef NDEBUG
std::cout << "----------------------------------------" << std::endl;
PRINT_VECTOR(uint16_t,sum2xU64);
#endif
#if !defined(UINTPTR_MAX) || !defined(UINT64_MAX) || !defined(UINT32_MAX)
# error "Limit macros are not defined"
#endif
#if defined PTR_64
return _mm_cvtsi128_si64(sum2xU64);
#elif defined PTR_32
return _mm_cvtsi128_si32(sum2xU64);
#else
# error "macro PTR_(32|64) is not set"
#endif
}
#endif
int main(int argc, char* argv[])
{
std::vector<__m128i> a(64); // * 16 bytes
std::vector<__m128i> b(a.size());
const size_t nBytes = a.size() * sizeof(std::vector<__m128i>::value_type);
char* const a_out = reinterpret_cast<char*>(a.data());
char* const b_out = reinterpret_cast<char*>(b.data());
memset(a_out,0,nBytes);
memset(b_out,0,nBytes);
a_out[1023] = 1;
b_out[1023] = 1;
size_t equalBytes = counteq_epi8(a.data(),b.data(),a.size());
std::cout << "equalBytes = " << equalBytes << std::endl;
return 0;
}
The fastest SSE implementation I got for large and small arrays:
size_t counteq_epi8(const __m128i* a_in,const __m128i* b_in,size_t count)
{
assert((count > 0 ? a_in != nullptr : true) && (uintptr_t(a_in) % sizeof(__m128i)) == 0);
assert((count > 0 ? b_in != nullptr : true) && (uintptr_t(b_in) % sizeof(__m128i)) == 0);
//assert(count > 0);
const size_t maxInnerLoops = 255;
const size_t nNestedLoops = count / maxInnerLoops;
const size_t nRemainderLoops = count % maxInnerLoops;
const __m128i zero = _mm_setzero_si128();
__m128i sum16xU8 = zero;
__m128i sum2xU64 = zero;
for(size_t i = 0;i < nNestedLoops;++i)
{
for(size_t j = 0;j < maxInnerLoops;++j)
{
sum16xU8 = _mm_sub_epi8(sum16xU8,_mm_cmpeq_epi8(*a_in++,*b_in++));
}
sum2xU64 = _mm_add_epi64(sum2xU64,_mm_sad_epu8(sum16xU8,zero));
sum16xU8 = zero;
}
for(size_t j = 0;j < nRemainderLoops;++j)
{
sum16xU8 = _mm_sub_epi8(sum16xU8,_mm_cmpeq_epi8(*a_in++,*b_in++));
}
sum2xU64 = _mm_add_epi64(sum2xU64,_mm_sad_epu8(sum16xU8,zero));
sum2xU64 = _mm_add_epi64(sum2xU64,_mm_srli_si128(sum2xU64,64/8));
#if UINTPTR_MAX == UINT64_MAX
return _mm_cvtsi128_si64(sum2xU64);
#elif UINTPTR_MAX == UINT32_MAX
return _mm_cvtsi128_si32(sum2xU64);
#else
# error "macro PTR_(32|64) is not set"
#endif
}
Auto-vectorization in current gcc is a matter of helping the compiler to understand that's easy to vectorize the code. In your case: it will understand the vectorization request if you remove the conditional and rewrite the code in a more imperative way:
static inline int count(const char* string1, const char* string2, int size) {
int r = 0;
bool b;
for (int j = 0; j < size; ++j) {
b = (string1[j] == string2[j]);
r += b;
}
return r;
}
In this case:
movdqa 16(%rsp), %xmm1
movl $.LC2, %esi
pxor %xmm2, %xmm2
movzbl 416(%rsp), %edx
movdqa .LC1(%rip), %xmm3
pcmpeqb 224(%rsp), %xmm1
cmpb %dl, 208(%rsp)
movzbl 417(%rsp), %eax
movl $1, %edi
pand %xmm3, %xmm1
movdqa %xmm1, %xmm5
sete %dl
movdqa %xmm1, %xmm4
movzbl %dl, %edx
punpcklbw %xmm2, %xmm5
punpckhbw %xmm2, %xmm4
pxor %xmm1, %xmm1
movdqa %xmm5, %xmm6
movdqa %xmm5, %xmm0
movdqa %xmm4, %xmm5
punpcklwd %xmm1, %xmm6
(etc.)

Branching elimination using bitwise operators

I have some critical branching code inside a loop that's run about 2^26 times. Branch prediction is not optimal because m is random. How would I remove the branching, possibly using bitwise operators?
bool m;
unsigned int a;
const unsigned int k = ...; // k >= 7
if(a == 0)
a = (m ? (a+1) : (k));
else if(a == k)
a = (m ? 0 : (a-1));
else
a = (m ? (a+1) : (a-1));
And here is the relevant assembly generated by gcc -O3:
.cfi_startproc
movl 4(%esp), %edx
movb 8(%esp), %cl
movl (%edx), %eax
testl %eax, %eax
jne L15
cmpb $1, %cl
sbbl %eax, %eax
andl $638, %eax
incl %eax
movl %eax, (%edx)
ret
L15:
cmpl $639, %eax
je L23
testb %cl, %cl
jne L24
decl %eax
movl %eax, (%edx)
ret
L23:
cmpb $1, %cl
sbbl %eax, %eax
andl $638, %eax
movl %eax, (%edx)
ret
L24:
incl %eax
movl %eax, (%edx)
ret
.cfi_endproc
The branch-free division-free modulo could have been useful, but testing shows that in practice, it isn't.
const unsigned int k = 639;
void f(bool m, unsigned int &a)
{
a += m * 2 - 1;
if (a == -1u)
a = k;
else if (a == k + 1)
a = 0;
}
Testcase:
unsigned a = 0;
f(false, a);
assert(a == 639);
f(false, a);
assert(a == 638);
f(true, a);
assert(a == 639);
f(true, a);
assert(a == 0);
f(true, a);
assert(a == 1);
f(false, a);
assert(a == 0);
Actually timing this, using a test program:
int main()
{
for (int i = 0; i != 10000; i++)
{
unsigned int a = k / 2;
while (a != 0) f(rand() & 1, a);
}
}
(Note: there's no srand, so results are deterministic.)
My original answer: 5.3s
The code in the question: 4.8s
Lookup table: 4.5s (static unsigned lookup[2][k+1];)
Lookup table: 4.3s (static unsigned lookup[k+1][2];)
Eric's answer: 4.2s
This version: 4.0s
The fastest I've found is now the table implementation
Timings I got (UPDATED for new measurement code)
HVD's most recent: 9.2s
Table version: 7.4s (with k=693)
Table creation code:
unsigned int table[2*k];
table_ptr = table;
for(int i = 0; i < k; i++){
unsigned int a = i;
f(0, a);
table[i<<1] = a;
a = i;
f(1, a);
table[i<<1 + 1] = a;
}
Table runtime loop:
void f(bool m, unsigned int &a){
a = table_ptr[a<<1 | m];
}
With HVD's measurement code, I saw the cost of the rand() dominating the runtime, so that the runtime for a branchless version was about the same range as these solutions. I changed the measurement code to this (UPDATED to keep random branch order, and pre-computing random values to prevent rand(), etc. from trashing the cache)
int main(){
unsigned int a = k / 2;
int m[100000];
for(int i = 0; i < 100000; i++){
m[i] = rand() & 1;
}
for (int i = 0; i != 10000; i++
{
for(int j = 0; j != 100000; j++){
f(m[j], a);
}
}
}
I don't think you can remove the branches entirely, but you can reduce the number by branching on m first.
if (m){
if (a==k) {a = 0;} else {++a;}
}
else {
if (a==0) {a = k;} else {--a;}
}
Adding to Antimony's rewrite:
if (a==k) {a = 0;} else {++a;}
looks like an increase with wraparound. You can write this as
a=(a+1)%k;
which, of course, only makes sense if divisions are actually faster than branches.
Not sure about the other one; too lazy to think about what the (~0)%k will be.
This has no branches. Because K is constant, compiler might be able to optimize the modulo depending on it's value. And if K is 'small' then a full lookup table solution would probably be even faster.
bool m;
unsigned int a;
const unsigned int k = ...; // k >= 7
const int inc[2] = {1, k};
a = a + inc[m] % (k+1);
If k isn't large enough to cause overflow, you could do something like this:
int a; // Note: not unsigned int
int plusMinus = 2 * m - 1;
a += plusMinus;
if(a == -1)
a = k;
else if (a == k+1)
a = 0;
Still branches, but the branch prediction should be better, since the edge conditions are rarer than m-related conditions.