Related
I am trying to multiply two uint64_ts and store the result to uint64_t. I found an existing answer on Stackoverflow which splits the inputs in to their four uint32_ts and joins the result later:
https://stackoverflow.com/a/28904636/1107474
I have created a full example using the code and pasted it below.
However, for 37 x 5 I am getting the result 0 instead of 185?
#include <iostream>
int main()
{
uint64_t a = 37; // Input 1
uint64_t b = 5; // Input 2
uint64_t a_lo = (uint32_t)a;
uint64_t a_hi = a >> 32;
uint64_t b_lo = (uint32_t)b;
uint64_t b_hi = b >> 32;
uint64_t a_x_b_hi = a_hi * b_hi;
uint64_t a_x_b_mid = a_hi * b_lo;
uint64_t b_x_a_mid = b_hi * a_lo;
uint64_t a_x_b_lo = a_lo * b_lo;
uint64_t carry_bit = ((uint64_t)(uint32_t)a_x_b_mid +
(uint64_t)(uint32_t)b_x_a_mid +
(a_x_b_lo >> 32) ) >> 32;
uint64_t multhi = a_x_b_hi +
(a_x_b_mid >> 32) + (b_x_a_mid >> 32) +
carry_bit;
std::cout << multhi << std::endl; // Outputs 0 instead of 185?
}
I'm merging your code with another answer in the original link.
#include <iostream>
int main()
{
uint64_t a = 37; // Input 1
uint64_t b = 5; // Input 2
uint64_t a_lo = (uint32_t)a;
uint64_t a_hi = a >> 32;
uint64_t b_lo = (uint32_t)b;
uint64_t b_hi = b >> 32;
uint64_t a_x_b_hi = a_hi * b_hi;
uint64_t a_x_b_mid = a_hi * b_lo;
uint64_t b_x_a_mid = b_hi * a_lo;
uint64_t a_x_b_lo = a_lo * b_lo;
/*
This is implementing schoolbook multiplication:
x1 x0
X y1 y0
-------------
00 LOW PART
-------------
00
10 10 MIDDLE PART
+ 01
-------------
01
+ 11 11 HIGH PART
-------------
*/
// 64-bit product + two 32-bit values
uint64_t middle = a_x_b_mid + (a_x_b_lo >> 32) + uint32_t(b_x_a_mid);
// 64-bit product + two 32-bit values
uint64_t carry = a_x_b_hi + (middle >> 32) + (b_x_a_mid >> 32);
// Add LOW PART and lower half of MIDDLE PART
uint64_t result = (middle << 32) | uint32_t(a_x_b_lo);
std::cout << result << std::endl;
std::cout << carry << std::endl;
}
This results in
Program stdout
185
0
Godbolt link: https://godbolt.org/z/97xhMvY53
Or you could use __uint128_t which is non-standard but widely available.
static inline void mul64(uint64_t a, uint64_t b, uint64_t& result, uint64_t& carry) {
__uint128_t va(a);
__uint128_t vb(b);
__uint128_t vr = va * vb;
result = uint64_t(vr);
carry = uint64_t(vr >> 64);
}
In the title of this question, you said you wanted to multiply two integers. But the code you found on that other Q&A (Getting the high part of 64 bit integer multiplication) isn't trying to do that, it's only trying to get the high half of the full product. For a 64x64 => 128-bit product, the high half is product >> 64.
37 x 5 = 185
185 >> 64 = 0
It's correctly emulating multihi = (37 * (unsigned __int128)5) >> 64, and you're forgetting about the >>64 part.
__int128 is a GNU C extension; it's much more efficient than emulating it manually with pure ISO C, but only supported on 64-bit targets by current compilers. See my answer on the same question. (ISO C23 is expected to have _BitInt(128) or whatever width you specify.)
In comments you were talking about floating-point mantissas. In an FP multiply, you have two n-bit mantissas (usually with their leading bits set), so the high half of the 2n-bit product will have n significant bits (more or less; maybe actually one place to the right IIRC).
Something like 37 x 5 would only happen with tiny subnormal floats, where the product would indeed underflow to zero. But in that case, it would be because you only get subnormals at the limits of the exponent range, and (37 * 2^-1022) * (5 * 2^-1022) would be 186 * 2^-2044, an exponent way too small to be represented in an FP format like IEEE binary64 aka double where -1022 was the minimum exponent.
You're using integers where a>>63 isn't 1, in fact they're both less than 2^32 so there are no significant bits outside the low 64 bits of the full 128-bit product.
I have a long byte array and I want to remove the lower nibble (the lower 4 bits) of every byte and move the rest together such that the result occupies half the space as the input.
For example, if my input is 057ABC23, my output should be 07B2.
My current approach looks like this:
// in is unsigned char*
size_t outIdx = 0;
for(size_t i = 0; i < input_length; i += 8)
{
in[outIdx++] = (in[i ] & 0xF0) | (in[i + 1] >> 4);
in[outIdx++] = (in[i + 2] & 0xF0) | (in[i + 3] >> 4);
in[outIdx++] = (in[i + 4] & 0xF0) | (in[i + 5] >> 4);
in[outIdx++] = (in[i + 6] & 0xF0) | (in[i + 7] >> 4);
}
... where I basically process 8 bytes of input in every loop, to illustrate that I can assume input_length to be divisible by 8 (even though it's probably not faster than processing only 2 bytes per loop). The operation is done in-place, overwriting the input array.
Is there a faster way to do this? For example, since I can read in 8 bytes at a time anyway, the operation could be done on 4-byte or 8-byte integers instead of individual bytes, but I cannot think of a way to do that. The compiler doesn't come up with something itself either, as I can see the output code still operates on bytes (-O3 seems to do some loop unrolling, but that's it).
I don't have control over the input, so I cannot store it differently to begin with.
There is a general technique for bit-fiddling to swap bits around. Suppose you have a 64-bit number, containing the following nibbles:
HxGxFxExDxCxBxAx
Here by x I denote a nibble whose value is unimportant (you want to delete it). The result of your bit-operation should be a 32-bit number HGFEDCBA.
First, delete all the x nibbles:
HxGxFxExDxCxBxAx & *_*_*_*_*_*_*_*_ = H_G_F_E_D_C_B_A_
Here I denote 0 by _, and binary 1111 by * for clarity.
Now, replicate your data:
H_G_F_E_D_C_B_A_ << 4 = _G_F_E_D_C_B_A__
H_G_F_E_D_C_B_A_ | _G_F_E_D_C_B_A__ = HGGFFEEDDCCBBAA_
Notice how some of your target nibbles are together. You need to retain these places, and delete duplicate data.
HGGFFEEDDCCBBAA_ & **__**__**__**__ = HG__FE__DC__BA__
From here, you can extract the result bytes directly, or do another iteration or two of the technique.
Next iteration:
HG__FE__DC__BA__ << 8 = __FE__DC__BA____
HG__FE__DC__BA__ | __FE__DC__BA____ = HGFEFEDCDCBABA__
HGFEFEDCDCBABA__ & ****____****____ = HGFE____DCBA____
Last iteration:
HGFE____DCBA____ << 16 = ____DCBA________
HGFE____DCBA____ | ____DCBA________ = HGFEDCBADCBA____
HGFEDCBADCBA____ >> 32 = ________HGFEDCBA
All x64-86 (and most x86) cpus have SSE2.
For each 16-bit lane do
t = (x & 0x00F0) | (x >> 12).
Then use the pack instruction to truncate each 16-bit lane to 8-bits.
For example, 0xABCD1234 would become 0x00CA0031 then the pack would make it 0xCA31.
#include <emmintrin.h>
void squish_32bytesTo16 (unsigned char* src, unsigned char* dst) {
const __m128i mask = _mm_set1_epi16(0x00F0);
__m128i src0 = _mm_loadu_si128((__m128i*)(void*)src);
__m128i src1 = _mm_loadu_si128((__m128i*)(void*)(src + sizeof(__m128i)));
__m128i t0 = _mm_or_si128(_mm_and_si128(src0, mask), _mm_srli_epi16(src0, 12));
__m128i t1 = _mm_or_si128(_mm_and_si128(src1, mask), _mm_srli_epi16(src1, 12));
_mm_storeu_si128((__m128i*)(void*)dst, _mm_packus_epi16(t0, t1));
}
Just to put the resulting code here for future reference, it now looks like this (assuming the system is little endian, and the input length is a multiple of 8 bytes):
void compress(unsigned char* in, size_t input_length)
{
unsigned int* inUInt = reinterpret_cast<unsigned int*>(in);
unsigned long long* inULong = reinterpret_cast<unsigned long long*>(in);
for(size_t i = 0; i < input_length / 8; ++i)
{
unsigned long long value = inULong[i] & 0xF0F0F0F0F0F0F0F0;
value = (value >> 4) | (value << 8);
value &= 0xFF00FF00FF00FF00;
value |= (value << 8);
value &= 0xFFFF0000FFFF0000;
value |= (value << 16);
inUInt[i] = static_cast<unsigned int>(value >> 32);
}
}
Benchmarked very roughly it's around twice as fast as the code in the question (using MSVC19 /O2).
Note that this is basically the solution anatolyg posted before (just put into code), so upvote that answer instead if you found this helpful.
I need to perform the following operation:
w[i] = scale * v[i] + point
scale and point are fixed, whereas v[] is a vector of 4-bit integers.
I need to compute w[] for the arbitrary input vector v[] and I want to speed up the process using AVX intrinsics. However, v[i] is a vector of 4-bit integers.
The question is how to perform operations on 4-bit integers using intrinsics? I could use 8-bit integers and perform operations that way, but is there a way to do the following:
[a,b] + [c,d] = [a+b,c+d]
[a,b] * [c,d] = [a * b,c * d]
(Ignoring overflow)
Using AVX intrinsics, where [...,...] Is an 8-bit integer and a,b,c,d are 4-bit integers?
If yes, would it be possible to give a short example on how this could work?
Just a partial answer (only addition) and in pseudo code (should be easy to extent to AVX2 intrinsics):
uint8_t a, b; // input containing two nibbles each
uint8_t c = a + b; // add with (unwanted) carry between nibbles
uint8_t x = a ^ b ^ c; // bits which are result of a carry
x &= 0x10; // only bit 4 is of interest
c -= x; // undo carry of lower to upper nibble
If either a or b is known to have bit 4 unset (i.e. the lowest bit of the upper nibble), it can be left out the computation of x.
As for multiplication: If scale is the same for all products, you can likely get away with some shifting and adding/subtracting (masking out overflow bits where necessarry). Otherwise, I'm afraid you need to mask out 4 bits of each 16bit word, do the operation, and fiddle them together at the end. Pseudo code (there is no AVX 8bit multiplication, so we need to operate with 16bit words):
uint16_t m0=0xf, m1=0xf0, m2=0xf00, m3=0xf000; // masks for each nibble
uint16_t a, b; // input containing 4 nibbles each.
uint16_t p0 = (a*b) & m0; // lowest nibble, does not require masking a,b
uint16_t p1 = ((a>>4) * (b&m1)) & m1;
uint16_t p2 = ((a>>8) * (b&m2)) & m2;
uint16_t p3 = ((a>>12)* (b&m3)) & m3;
uint16_t result = p0 | p1 | p2 | p3; // join results together
For fixed a, b in w[i]=v[i] * a + b, you can simply use a lookup table w_0_3 = _mm_shuffle_epi8(LUT_03, input) for the LSB. Split the input to even and odd nibbles, with the odd LUT preshifted by 4.
auto a = input & 15; // per element
auto b = (input >> 4) & 15; // shift as 16 bits
return LUTA[a] | LUTB[b];
How to generate those LUTs dynamically, is another issue, if at all.
4-bit aditions/multiplication can be done using AVX2, particularly if you want to apply those computations on larger vectors (say more than 128 elements). However, if you want to add just 4 numbers use straight scalar code.
We have done an extensive work on how to deal with 4-bit integers, and we have recently developed a library to do it Clover: 4-bit Quantized Linear Algebra Library (with focus on quantization). The code is also available at GitHub.
As you mentioned only 4-bit integers, I would assume that you are referring to signed integers (i.e. two's complements), and base my answer accordingly. Note that handling unsigned is in fact much simpler.
I would also assume that you would like to take vector int8_t v[n/2] that contains n 4-bit integers, and produce int8_t v_sum[n/4] having n/2 4-bit integers. All the code relative to the description bellow is available as a gist.
Packing / Unpacking
Obviously AVX2 does not offer any instructions to perform additions / multiplication on 4-bit integers, therefore, you must resort to the given 8- or 16-bit instruction. The first step in dealing with 4-bit arithmetics is to devise methods on how to place the 4-bit nibble into larger chunks of 8-, 16-, or 32-bit chunk.
For a sake of clarity let's assume that you want to unpack a given nibble from a 32-bit chunk that stores multiple 4-bit signed values into a corresponding 32-bit integer (figure below). This can be done with two bit shifts:
a logical left shift is used to shift the nibble so that it occupies the highest-order 4-bits of the 32-bit entity.
an arithmetic right shift is used to shift the nibble to the lowest order 4-bits of the 32-bit entity.
The arithmetic right shift has sign extension, filling the high-order 28 bits with the sign bit of the nibble. yielding a 32-bit integer with the same value as the two’s complement 4-bit value.
The goal of packing (left part of figure above) is to revert the unpacking operation. Two bit shifts can be used to place the lowest order 4 bits of a 32-bit integer anywhere within a 32-bit entity.
a logical left shift is used to shift the nibble so that it occupies the highest-order 4-bits of the 32-bit entity.
a logical right shift is used to shift the nibble to somewhere within the 32-bit entity.
The first sets the bits lower-ordered than the nibble to zero, and the second sets the bits higher-ordered than the nibble to zero. A bitwise OR operation can then be used to store up to eight nibbles in the 32-bit entity.
How to apply this in practice?
Let's assume that you have 64 x 32-bit integer values stored in 8 AVX registers __m256i q_1, q_2, q_3, q_4, q_5, q_6, q_7, q_8. Let's also assume that each value is in the [-8, 7], range. If you want to pack them into a single AVX register of 64 x 4-bit values, you can do as follows:
//
// Transpose the 8x8 registers
//
_mm256_transpose8_epi32(q_1, q_2, q_3, q_4, q_5, q_6, q_7, q_8);
//
// Shift values left
//
q_1 = _mm256_slli_epi32(q_1, 28);
q_2 = _mm256_slli_epi32(q_2, 28);
q_3 = _mm256_slli_epi32(q_3, 28);
q_4 = _mm256_slli_epi32(q_4, 28);
q_5 = _mm256_slli_epi32(q_5, 28);
q_6 = _mm256_slli_epi32(q_6, 28);
q_7 = _mm256_slli_epi32(q_7, 28);
q_8 = _mm256_slli_epi32(q_8, 28);
//
// Shift values right (zero-extend)
//
q_1 = _mm256_srli_epi32(q_1, 7 * 4);
q_2 = _mm256_srli_epi32(q_2, 6 * 4);
q_3 = _mm256_srli_epi32(q_3, 5 * 4);
q_4 = _mm256_srli_epi32(q_4, 4 * 4);
q_5 = _mm256_srli_epi32(q_5, 3 * 4);
q_6 = _mm256_srli_epi32(q_6, 2 * 4);
q_7 = _mm256_srli_epi32(q_7, 1 * 4);
q_8 = _mm256_srli_epi32(q_8, 0 * 4);
//
// Pack together
//
__m256i t1 = _mm256_or_si256(q_1, q_2);
__m256i t2 = _mm256_or_si256(q_3, q_4);
__m256i t3 = _mm256_or_si256(q_5, q_6);
__m256i t4 = _mm256_or_si256(q_7, q_8);
__m256i t5 = _mm256_or_si256(t1, t2);
__m256i t6 = _mm256_or_si256(t3, t4);
__m256i t7 = _mm256_or_si256(t5, t6);
Shifts usually take 1 cycle of throughput, and 1 cycle of latency, thus you can assume that are in fact quite inexpensive. If you have to deal with unsigned 4-bit values, the left shifts can be skipped all together.
To reverse the procedure, you can apply the same method. Let's assume that you have loaded 64 4-bit values into a single AVX register __m256i qu_64. In order to produce 64 x 32-bit integers __m256i q_1, q_2, q_3, q_4, q_5, q_6, q_7, q_8, you can execute the following:
//
// Shift values left
//
const __m256i qu_1 = _mm256_slli_epi32(qu_64, 4 * 7);
const __m256i qu_2 = _mm256_slli_epi32(qu_64, 4 * 6);
const __m256i qu_3 = _mm256_slli_epi32(qu_64, 4 * 5);
const __m256i qu_4 = _mm256_slli_epi32(qu_64, 4 * 4);
const __m256i qu_5 = _mm256_slli_epi32(qu_64, 4 * 3);
const __m256i qu_6 = _mm256_slli_epi32(qu_64, 4 * 2);
const __m256i qu_7 = _mm256_slli_epi32(qu_64, 4 * 1);
const __m256i qu_8 = _mm256_slli_epi32(qu_64, 4 * 0);
//
// Shift values right (sign-extent) and obtain 8x8
// 32-bit values
//
__m256i q_1 = _mm256_srai_epi32(qu_1, 28);
__m256i q_2 = _mm256_srai_epi32(qu_2, 28);
__m256i q_3 = _mm256_srai_epi32(qu_3, 28);
__m256i q_4 = _mm256_srai_epi32(qu_4, 28);
__m256i q_5 = _mm256_srai_epi32(qu_5, 28);
__m256i q_6 = _mm256_srai_epi32(qu_6, 28);
__m256i q_7 = _mm256_srai_epi32(qu_7, 28);
__m256i q_8 = _mm256_srai_epi32(qu_8, 28);
//
// Transpose the 8x8 values
//
_mm256_transpose8_epi32(q_1, q_2, q_3, q_4, q_5, q_6, q_7, q_8);
If dealing with unsigned 4-bits, the right shifts (_mm256_srai_epi32) can be skipped all-together, and instead of left shifts, we can perform left-logical shifts (_mm256_srli_epi32 ).
To see more details have a look a the gist here.
Adding Odd and Even 4-bit entries
Let's assume that you load from the vector using AVX:
const __m256i qv = _mm256_loadu_si256( ... );
Now, we can easily extract the odd and the even parts. Life would have been much easier if there were 8-bit shifts in AVX2, but there are none, so we have to deal with 16-bit shifts:
const __m256i hi_mask_08 = _mm256_set1_epi8(-16);
const __m256i qv_odd_dirty = _mm256_slli_epi16(qv, 4);
const __m256i qv_odd_shift = _mm256_and_si256(hi_mask_08, qv_odd_dirty);
const __m256i qv_evn_shift = _mm256_and_si256(hi_mask_08, qv);
At this point in time, you have essentially separated the odd and the even nibbles, in two AVX registers that hold their values in the high 4-bits (i.e. values in the range [-8 * 2^4, 7 * 2^4]). The procedure is the same even when dealing with unsigned 4-bit values. Now it is time to add the values.
const __m256i qv_sum_shift = _mm256_add_epi8(qv_odd_shift, qv_evn_shift);
This will work with both signed and unsigned, as binary addition work with two's complements. However, if you want to avoid overflows or underflows you can also consider addition with saturation already supported in AVX (for both signed and unsigned):
__m256i _mm256_adds_epi8 (__m256i a, __m256i b)
__m256i _mm256_adds_epu8 (__m256i a, __m256i b)
qv_sum_shift will be in the range [-8 * 2^4, 7 * 2^4]. To set it to the right value, we need to shift it back (Note that if qv_sum has to be unsigned, we can use _mm256_srli_epi16 instead):
const __m256i qv_sum = _mm256_srai_epi16(qv_sum_shift, 4);
The summation is now complete. Depending on your use case, this could as well be the end of the program, assuming that you want to produce 8-bit chunks of memory as a result. But let's assume that you want to solve a harder problem. Let's assume that the output is again a vector of 4-bit elements, with the same memory layout as the input one. In that case, we need to pack the 8-bit chunks into 4-bit chunks. However, the problem is that instead of having 64 values, we will end up with 32 values (i.e. half the size of the vector).
From this point there are two options. We either look ahead in the vector, processing 128 x 4-bit values, such that we produce 64 x 4-bit values. Or we revert to SSE, dealing with 32 x 4-bit values. Either way, the fastest way to pack the 8-bit chunks into 4-bit chunks would be to use the vpackuswb (or packuswb for SSE) instruction:
__m256i _mm256_packus_epi16 (__m256i a, __m256i b)
This instruction convert packed 16-bit integers from a and b to packed 8-bit integers using unsigned saturation, and store the results in dst. This means that we have to interleave the odd and even 4-bit values, such that they reside in the 8 low-bits of a 16-bit memory chunk. We can proceed as follows:
const __m256i lo_mask_16 = _mm256_set1_epi16(0x0F);
const __m256i hi_mask_16 = _mm256_set1_epi16(0xF0);
const __m256i qv_sum_lo = _mm256_and_si256(lo_mask_16, qv_sum);
const __m256i qv_sum_hi_dirty = _mm256_srli_epi16(qv_sum_shift, 8);
const __m256i qv_sum_hi = _mm256_and_si256(hi_mask_16, qv_sum_hi_dirty);
const __m256i qv_sum_16 = _mm256_or_si256(qv_sum_lo, qv_sum_hi);
The procedure will be identical for both signed and unsigned 4-bit values. Now, qv_sum_16 contains two consecutive 4-bit values, stored in the low-bits of a 16-bit memory chunk. Assuming that we have obtained qv_sum_16 from the next iteration (call it qv_sum_16_next), we can pack everything with:
const __m256i qv_sum_pack = _mm256_packus_epi16(qv_sum_16, qv_sum_16_next);
const __m256i result = _mm256_permute4x64_epi64(qv_sum_pack, 0xD8);
Alternatively, if we want to produce only 32 x 4-bit values, we can do the following:
const __m128i lo = _mm256_extractf128_si256(qv_sum_16, 0);
const __m128i hi = _mm256_extractf128_si256(qv_sum_16, 1);
const __m256i result = _mm_packus_epi16(lo, hi)
Putting it all together
Assuming signed nibbles, and vector size n, such that n is larger than 128 elements and is multiple of 128, we can perform the odd-even addition, producing n/2 elements as follows:
void add_odd_even(uint64_t n, int8_t * v, int8_t * r)
{
//
// Make sure that the vector size that is a multiple of 128
//
assert(n % 128 == 0);
const uint64_t blocks = n / 64;
//
// Define constants that will be used for masking operations
//
const __m256i hi_mask_08 = _mm256_set1_epi8(-16);
const __m256i lo_mask_16 = _mm256_set1_epi16(0x0F);
const __m256i hi_mask_16 = _mm256_set1_epi16(0xF0);
for (uint64_t b = 0; b < blocks; b += 2) {
//
// Calculate the offsets
//
const uint64_t offset0 = b * 32;
const uint64_t offset1 = b * 32 + 32;
const uint64_t offset2 = b * 32 / 2;
//
// Load 128 values in two AVX registers. Each register will
// contain 64 x 4-bit values in the range [-8, 7].
//
const __m256i qv_1 = _mm256_loadu_si256((__m256i *) (v + offset0));
const __m256i qv_2 = _mm256_loadu_si256((__m256i *) (v + offset1));
//
// Extract the odd and the even parts. The values will be split in
// two registers qv_odd_shift and qv_evn_shift, each of them having
// 32 x 8-bit values, such that each value is multiplied by 2^4
// and resides in the range [-8 * 2^4, 7 * 2^4]
//
const __m256i qv_odd_dirty_1 = _mm256_slli_epi16(qv_1, 4);
const __m256i qv_odd_shift_1 = _mm256_and_si256(hi_mask_08, qv_odd_dirty_1);
const __m256i qv_evn_shift_1 = _mm256_and_si256(hi_mask_08, qv_1);
const __m256i qv_odd_dirty_2 = _mm256_slli_epi16(qv_2, 4);
const __m256i qv_odd_shift_2 = _mm256_and_si256(hi_mask_08, qv_odd_dirty_2);
const __m256i qv_evn_shift_2 = _mm256_and_si256(hi_mask_08, qv_2);
//
// Perform addition. In case of overflows / underflows, behaviour
// is undefined. Values are still in the range [-8 * 2^4, 7 * 2^4].
//
const __m256i qv_sum_shift_1 = _mm256_add_epi8(qv_odd_shift_1, qv_evn_shift_1);
const __m256i qv_sum_shift_2 = _mm256_add_epi8(qv_odd_shift_2, qv_evn_shift_2);
//
// Divide by 2^4. At this point in time, each of the two AVX registers holds
// 32 x 8-bit values that are in the range of [-8, 7]. Summation is complete.
//
const __m256i qv_sum_1 = _mm256_srai_epi16(qv_sum_shift_1, 4);
const __m256i qv_sum_2 = _mm256_srai_epi16(qv_sum_shift_2, 4);
//
// Now, we want to take the even numbers of the 32 x 4-bit register, and
// store them in the high-bits of the odd numbers. We do this with
// left shifts that extend in zero, and 16-bit masks. This operation
// results in two registers qv_sum_lo and qv_sum_hi that hold 32
// values. However, each consecutive 4-bit values reside in the
// low-bits of a 16-bit chunk.
//
const __m256i qv_sum_1_lo = _mm256_and_si256(lo_mask_16, qv_sum_1);
const __m256i qv_sum_1_hi_dirty = _mm256_srli_epi16(qv_sum_shift_1, 8);
const __m256i qv_sum_1_hi = _mm256_and_si256(hi_mask_16, qv_sum_1_hi_dirty);
const __m256i qv_sum_2_lo = _mm256_and_si256(lo_mask_16, qv_sum_2);
const __m256i qv_sum_2_hi_dirty = _mm256_srli_epi16(qv_sum_shift_2, 8);
const __m256i qv_sum_2_hi = _mm256_and_si256(hi_mask_16, qv_sum_2_hi_dirty);
const __m256i qv_sum_16_1 = _mm256_or_si256(qv_sum_1_lo, qv_sum_1_hi);
const __m256i qv_sum_16_2 = _mm256_or_si256(qv_sum_2_lo, qv_sum_2_hi);
//
// Pack the two registers of 32 x 4-bit values, into a single one having
// 64 x 4-bit values. Use the unsigned version, to avoid saturation.
//
const __m256i qv_sum_pack = _mm256_packus_epi16(qv_sum_16_1, qv_sum_16_2);
//
// Interleave the 64-bit chunks.
//
const __m256i qv_sum = _mm256_permute4x64_epi64(qv_sum_pack, 0xD8);
//
// Store the result
//
_mm256_storeu_si256((__m256i *)(r + offset2), qv_sum);
}
}
A self-contained tester and validator of this code is available in the gist here.
Multiplying Odd and Even 4-bit entries
For the multiplication of the odd and even entries, we can use the same strategy as described above to extract the 4-bits into larger chunks.
AVX2 does not offer 8-bit multiplication, only 16-bit. However, we can implement 8-bit multiplication following the method implemented in the Agner Fog's C++ vector class library:
static inline Vec32c operator * (Vec32c const & a, Vec32c const & b) {
// There is no 8-bit multiply in SSE2. Split into two 16-bit multiplies
__m256i aodd = _mm256_srli_epi16(a,8); // odd numbered elements of a
__m256i bodd = _mm256_srli_epi16(b,8); // odd numbered elements of b
__m256i muleven = _mm256_mullo_epi16(a,b); // product of even numbered elements
__m256i mulodd = _mm256_mullo_epi16(aodd,bodd); // product of odd numbered elements
mulodd = _mm256_slli_epi16(mulodd,8); // put odd numbered elements back in place
__m256i mask = _mm256_set1_epi32(0x00FF00FF); // mask for even positions
__m256i product = selectb(mask,muleven,mulodd); // interleave even and odd
return product;
}
I would suggest however to extract the nibbles into 16-bit chunks first and then use _mm256_mullo_epi16 to avoid performing unnecessary shifts.
I have binary matrices in C++ that I repesent with a vector of 8-bit values.
For example, the following matrix:
1 0 1 0 1 0 1
0 1 1 0 0 1 1
0 0 0 1 1 1 1
is represented as:
const uint8_t matrix[] = {
0b01010101,
0b00110011,
0b00001111,
};
The reason why I'm doing it this way is because then computing the product of such a matrix and a 8-bit vector becomes really simple and efficient (just one bitwise AND and a parity computation, per row), which is much better than calculating each bit individually.
I'm now looking for an efficient way to transpose such a matrix, but I haven't been able to figure out how to do it without having to manually calculate each bit.
Just to clarify, for the above example, I'd like to get the following result from the transposition:
const uint8_t transposed[] = {
0b00000000,
0b00000100,
0b00000010,
0b00000110,
0b00000001,
0b00000101,
0b00000011,
0b00000111,
};
NOTE: I would prefer an algorithm that can calculate this with arbitrary-sized matrices but am also interested in algorithms that can only handle certain sizes.
I've spent more time looking for a solution, and I've found some good ones.
The SSE2 way
On a modern x86 CPU, transposing a binary matrix can be done very efficiently with SSE2 instructions. Using such instructions it is possible to process a 16×8 matrix.
This solution is inspired by this blog post by mischasan and is vastly superior to every suggestion I've got so far to this question.
The idea is simple:
#include <emmintrin.h>
Pack 16 uint8_t variables into an __m128i
Use _mm_movemask_epi8 to get the MSBs of each byte, producing an uint16_t
Use _mm_slli_epi64 to shift the 128-bit register by one
Repeat until you've got all 8 uint16_ts
A generic 32-bit solution
Unfortunately, I also need to make this work on ARM. After implementing the SSE2 version, it would be easy to just just find the NEON equivalents, but the Cortex-M CPU, (contrary to the Cortex-A) does not have SIMD capabilities, so NEON isn't too useful for me at the moment.
NOTE: Because the Cortex-M doesn't have native 64-bit arithmetics, I could not use the ideas in any answers that suggest to do it by treating a 8x8 block as an uint64_t. Most microcontrollers that have a Cortex-M CPU also don't have too much memory so I prefer to do all this without a lookup table.
After some thinking, the same algorithm can be implemented using plain 32-bit arithmetics and some clever coding. This way, I can work with 4×8 blocks at a time. It was suggested by a collegaue and the magic lies in the way 32-bit multiplication works: you can find a 32-bit number with which you can multiply and then the MSB of each byte gets next to each other in the upper 32 bits of the result.
Pack 4 uint8_ts in a 32-bit variable
Mask the 1st bit of each byte (using 0x80808080)
Multiply it with 0x02040810
Take the 4 LSBs of the upper 32 bits of the multiplication
Generally, you can mask the Nth bit in each byte (shift the mask right by N bits) and multiply with the magic number, shifted left by N bits. The advantage here is that if your compiler is smart enough to unroll the loop, both the mask and the 'magic number' become compile-time constants so shifting them does not incur any performance penalty whatsoever. There's some trouble with the last series of 4 bits, because then one LSB is lost, so in that case I needed to shift the input left by 8 bits and use the same method as the first series of 4-bits.
If you do this with two 4×8 blocks, then you can get an 8x8 block done and arrange the resulting bits so that everything goes into the right place.
My suggestion is that, you don't do the transposition, rather you add one bit information to your matrix data, indicating whether the matrix is transposed or not.
Now, if you want to multiply a transposd matrix with a vector, it will be the same as multiplying the matrix on the left by the vector (and then transpose). This is easy: just some xor operations of your 8-bit numbers.
This however makes some other operations complicated (e.g. adding two matrices). But in the comment you say that multiplication is exactly what you want to optimize.
Here is the text of Jay Foad's email to me regarding fast Boolean matrix
transpose:
The heart of the Boolean transpose algorithm is a function I'll call transpose8x8 which transposes an 8x8 Boolean matrix packed in a 64-bit word (in row major order from MSB to LSB). To transpose any rectangular matrix whose width and height are multiples of 8, break it down into 8x8 blocks, transpose each one individually and store them at the appropriate place in the output. To load an 8x8 block you have to load 8 individual bytes and shift and OR them into a 64-bit word. Same kinda thing for storing.
A plain C implementation of transpose8x8 relies on the fact that all the bits on any diagonal line parallel to the leading diagonal move the same distance up/down and left/right. For example, all the bits just above the leading diagonal have to move one place left and one place down, i.e. 7 bits to the right in the packed 64-bit word. This leads to an algorithm like this:
transpose8x8(word) {
return
(word & 0x0100000000000000) >> 49 // top right corner
| (word & 0x0201000000000000) >> 42
| ...
| (word & 0x4020100804020100) >> 7 // just above diagonal
| (word & 0x8040201008040201) // leading diagonal
| (word & 0x0080402010080402) << 7 // just below diagonal
| ...
| (word & 0x0000000000008040) << 42
| (word & 0x0000000000000080) << 49; // bottom left corner
}
This runs about 10x faster than the previous implementation, which copied each bit individually from the source byte in memory and merged it into the destination byte in memory.
Alternatively, if you have PDEP and PEXT instructions you can implement a perfect shuffle, and use that to do the transpose as mentioned in Hacker's Delight. This is significantly faster (but I don't have timings handy):
shuffle(word) {
return pdep(word >> 32, 0xaaaaaaaaaaaaaaaa) | pdep(word, 0x5555555555555555);
} // outer perfect shuffle
transpose8x8(word) { return shuffle(shuffle(shuffle(word))); }
POWER's vgbbd instruction effectively implements the whole of transpose8x8 in a single instruction (and since it's a 128-bit vector instruction it does it twice, independently, on the low 64 bits and the high 64 bits). This gave about 15% speed-up over the plain C implementation. (Only 15% because, although the bit twiddling is much faster, the overall run time is now dominated by the time it takes to load 8 bytes and assemble them into the argument to transpose8x8, and to take the result and store it as 8 separate bytes.)
My suggestion would be to use a lookup table to speed up the processing.
Another thing to note is with the current definition of your matrix the maximum size will be 8x8 bits. This fits into a uint64_t so we can use this to our advantage especially when using a 64-bit platform.
I have worked out a simple example using a lookup table which you can find below and run using: http://www.tutorialspoint.com/compile_cpp11_online.php online compiler.
Example code
#include <iostream>
#include <bitset>
#include <stdint.h>
#include <assert.h>
using std::cout;
using std::endl;
using std::bitset;
/* Static lookup table */
static uint64_t lut[256];
/* Helper function to print array */
template<int N>
void print_arr(const uint8_t (&arr)[N]){
for(int i=0; i < N; ++i){
cout << bitset<8>(arr[i]) << endl;
}
}
/* Transpose function */
template<int N>
void transpose_bitmatrix(const uint8_t (&matrix)[N], uint8_t (&transposed)[8]){
assert(N <= 8);
uint64_t value = 0;
for(int i=0; i < N; ++i){
value = (value << 1) + lut[matrix[i]];
}
/* Ensure safe copy to prevent misalignment issues */
/* Can be removed if input array can be treated as uint64_t directly */
for(int i=0; i < 8; ++i){
transposed[i] = (value >> (i * 8)) & 0xFF;
}
}
/* Calculate lookup table */
void calculate_lut(void){
/* For all byte values */
for(uint64_t i = 0; i < 256; ++i){
auto b = std::bitset<8>(i);
auto v = std::bitset<64>(0);
/* For all bits in current byte */
for(int bit=0; bit < 8; ++bit){
if(b.test(bit)){
v.set((7 - bit) * 8);
}
}
lut[i] = v.to_ullong();
}
}
int main()
{
calculate_lut();
const uint8_t matrix[] = {
0b01010101,
0b00110011,
0b00001111,
};
uint8_t transposed[8];
transpose_bitmatrix(matrix, transposed);
print_arr(transposed);
return 0;
}
How it works
your 3x8 matrix will be transposed to a 8x3 matrix, represented in an 8x8 array.
The issue is that you want to convert bits, your "horizontal" representation to a vertical one, divided over several bytes.
As I mentioned above, we can take advantage of the fact that the output (8x8) will always fit into a uint64_t. We will use this to our advantage because now we can use an uint64_t to write the 8 byte array, but we can also use it for to add, xor, etc. because we can perform basic arithmetic operations on a 64 bit integer.
Each entry in your 3x8 matrix (input) is 8 bits wide, to optimize processing we first generate 256 entry lookup table (for each byte value). The entry itself is a uint64_t and will contain a rotated version of the bits.
example:
byte = 0b01001111 = 0x4F
lut[0x4F] = 0x0001000001010101 = (uint8_t[]){ 0, 1, 0, 0, 1, 1, 1, 1 }
Now for the calculation:
For the calculations we use the uint64_t but keep in mind that under water it will represent a uint8_t[8] array. We simple shift the current value (start with 0), look up our first byte and add it to the current value.
The 'magic' here is that each byte of the uint64_t in the lookup table will either be 1 or 0 so it will only set the least significant bit (of each byte). Shifting the uint64_t will shift each byte, as long as we make sure we do not do this more than 8 times! we can do operations on each byte individually.
Issues
As someone noted in the comments: Translate(Translate(M)) != M so if you need this you need some additional work.
Perfomance can be improved by directly mapping uint64_t's instead of uint8_t[8] arrays since it omits a "safe-copy" to prevent alignment issues.
I have added a new awnser instead of editing my original one to make this more visible (no comment rights unfortunatly).
In your own awnser you add an additional requirement not present in the first one: It has to work on ARM Cortex-M
I did come up with an alternative solution for ARM in my original awnser but omitted it as it was not part of the question and seemed off topic (mostly because of the C++ tag).
ARM Specific solution Cortex-M:
Some or most Cortex-M 3/4 have a bit banding region which can be used for exactly what you need, it expands bits into 32-bit fields, this region can be used to perform atomic bit operations.
If you put your array in a bitbanded region it will have an 'exploded' mirror in the bitband region where you can just use move operations on the bits itself. If you make a loop the compiler will surely be able to unroll and optimize to just move operations.
If you really want to, you can even setup a DMA controller to process an entire batch of transpose operations with a bit of effort and offload it entirely from the cpu :)
Perhaps this might still help you.
This is a bit late, but I just stumbled across this interchange today.
If you look at Hacker's Delight, 2nd Edition,there are several algorithms for efficiently transposing Boolean arrays, starting on page 141.
They are quite efficient: a colleague of mine obtained a factor about 10X
speedup compared to naive coding, on an X86.
Here's what I posted on gitub (mischasan/sse2/ssebmx.src)
Changing INP() and OUT() to use induction vars saves an IMUL each.
AVX256 does it twice as fast.
AVX512 is not an option, because there is no _mm512_movemask_epi8().
#include <stdint.h>
#include <emmintrin.h>
#define INP(x,y) inp[(x)*ncols/8 + (y)/8]
#define OUT(x,y) out[(y)*nrows/8 + (x)/8]
void ssebmx(char const *inp, char *out, int nrows, int ncols)
{
int rr, cc, i, h;
union { __m128i x; uint8_t b[16]; } tmp;
// Do the main body in [16 x 8] blocks:
for (rr = 0; rr <= nrows - 16; rr += 16)
for (cc = 0; cc < ncols; cc += 8) {
for (i = 0; i < 16; ++i)
tmp.b[i] = INP(rr + i, cc);
for (i = 8; i--; tmp.x = _mm_slli_epi64(tmp.x, 1))
*(uint16_t*)&OUT(rr, cc + i) = _mm_movemask_epi8(tmp.x);
}
if (rr == nrows) return;
// The remainder is a row of [8 x 16]* [8 x 8]?
// Do the [8 x 16] blocks:
for (cc = 0; cc <= ncols - 16; cc += 16) {
for (i = 8; i--;)
tmp.b[i] = h = *(uint16_t const*)&INP(rr + i, cc),
tmp.b[i + 8] = h >> 8;
for (i = 8; i--; tmp.x = _mm_slli_epi64(tmp.x, 1))
OUT(rr, cc + i) = h = _mm_movemask_epi8(tmp.x),
OUT(rr, cc + i + 8) = h >> 8;
}
if (cc == ncols) return;
// Do the remaining [8 x 8] block:
for (i = 8; i--;)
tmp.b[i] = INP(rr + i, cc);
for (i = 8; i--; tmp.x = _mm_slli_epi64(tmp.x, 1))
OUT(rr, cc + i) = _mm_movemask_epi8(tmp.x);
}
HTH.
Inspired by Roberts answer, polynomial multiplication in Arm Neon can be utilised to scatter the bits --
inline poly8x16_t mull_lo(poly8x16_t a) {
auto b = vget_low_p8(a);
return vreinterpretq_p8_p16(vmull_p8(b,b));
}
inline poly8x16_t mull_hi(poly8x16_t a) {
auto b = vget_high_p8(a);
return vreinterpretq_p8_p16(vmull_p8(b,b));
}
auto a = mull_lo(word);
auto b = mull_lo(a), c = mull_hi(a);
auto d = mull_lo(b), e = mull_hi(b);
auto f = mull_lo(c), g = mull_hi(c);
Then the vsli can be used to combine the bits pairwise.
auto ab = vsli_p8(vget_high_p8(d), vget_low_p8(d), 1);
auto cd = vsli_p8(vget_high_p8(e), vget_low_p8(e), 1);
auto ef = vsli_p8(vget_high_p8(f), vget_low_p8(f), 1);
auto gh = vsli_p8(vget_high_p8(g), vget_low_p8(g), 1);
auto abcd = vsli_p8(ab, cd, 2);
auto efgh = vsli_p8(ef, gh, 2);
return vsli_p8(abcd, efgh, 4);
Clang optimizes this code to avoid vmull2 instructions, using heavily ext q0,q0,8 to vget_high_p8.
An iterative approach would possibly be not only faster, but also uses less registers and also simdifies for 2x or more throughput.
// transpose bits in 2x2 blocks, first 4 rows
// x = a b|c d|e f|g h a i|c k|e m|g o | byte 0
// i j|k l|m n|o p b j|d l|f n|h p | byte 1
// q r|s t|u v|w x q A|s C|u E|w G | byte 2
// A B|C D|E F|G H r B|t D|v F|h H | byte 3 ...
// ----------------------
auto a = (x & 0x00aa00aa00aa00aaull);
auto b = (x & 0x5500550055005500ull);
auto c = (x & 0xaa55aa55aa55aa55ull) | (a << 7) | (b >> 7);
// transpose 2x2 blocks (first 4 rows shown)
// aa bb cc dd aa ii cc kk
// ee ff gg hh -> ee mm gg oo
// ii jj kk ll bb jj dd ll
// mm nn oo pp ff nn hh pp
auto d = (c & 0x0000cccc0000ccccull);
auto e = (c & 0x3333000033330000ull);
auto f = (c & 0xcccc3333cccc3333ull) | (d << 14) | (e >> 14);
// Final transpose of 4x4 bit blocks
auto g = (f & 0x00000000f0f0f0f0ull);
auto h = (f & 0x0f0f0f0f00000000ull);
x = (f & 0xf0f0f0f00f0f0f0full) | (g << 28) | (h >> 28);
In ARM each step can now be composed with 3 instructions:
auto tmp = vrev16_u8(x);
tmp = vshl_u8(tmp, plus_minus_1); // 0xff01ff01ff01ff01ull
x = vbsl_u8(mask_1, x, tmp); // 0xaa55aa55aa55aa55ull
tmp = vrev32_u16(x);
tmp = vshl_u16(tmp, plus_minus_2); // 0xfefe0202fefe0202ull
x = vbsl_u8(mask_2, x, tmp); // 0xcccc3333cccc3333ull
tmp = vrev64_u32(x);
tmp = vshl_u32(tmp, plus_minus_4); // 0xfcfcfcfc04040404ull
x = vbsl_u8(mask_4, x, tmp); // 0xf0f0f0f00f0f0f0full
I have run into an interesting problem lately:
Lets say I have an array of bytes (uint8_t to be exact) of length at least one. Now i need a function that will get a subsequence of bits from this array, starting with bit X (zero based index, inclusive) and having length L and will return this as an uint32_t. If L is smaller than 32 the remaining high bits should be zero.
Although this is not very hard to solve, my current thoughts on how to do this seem a bit cumbersome to me. I'm thinking of a table of all the possible masks for a given byte (start with bit 0-7, take 1-8 bits) and then construct the number one byte at a time using this table.
Can somebody come up with a nicer solution? Note that i cannot use Boost or STL for this - and no, it is not a homework, its a problem i run into at work and we do not use Boost or STL in the code where this thing goes. You can assume that: 0 < L <= 32 and that the byte array is large enough to hold the subsequence.
One example of correct input/output:
array: 00110011 1010 1010 11110011 01 101100
subsequence: X = 12 (zero based index), L = 14
resulting uint32_t = 00000000 00000000 00 101011 11001101
Only the first and last bytes in the subsequence will involve some bit slicing to get the required bits out, while the intermediate bytes can be shifted in whole into the result. Here's some sample code, absolutely untested -- it does what I described, but some of the bit indices could be off by one:
uint8_t bytes[];
int X, L;
uint32_t result;
int startByte = X / 8, /* starting byte number */
startBit = 7 - X % 8, /* bit index within starting byte, from LSB */
endByte = (X + L) / 8, /* ending byte number */
endBit = 7 - (X + L) % 8; /* bit index within ending byte, from LSB */
/* Special case where start and end are within same byte:
just get bits from startBit to endBit */
if (startByte == endByte) {
uint8_t byte = bytes[startByte];
result = (byte >> endBit) & ((1 << (startBit - endBit)) - 1);
}
/* All other cases: get ending bits of starting byte,
all other bytes in between,
starting bits of ending byte */
else {
uint8_t byte = bytes[startByte];
result = byte & ((1 << startBit) - 1);
for (int i = startByte + 1; i < endByte; i++)
result = (result << 8) | bytes[i];
byte = bytes[endByte];
result = (result << (8 - endBit)) | (byte >> endBit);
}
Take a look at std::bitset and boost::dynamic_bitset.
I would be thinking something like loading a uint64_t with a cast and then shifting left and right to lose the uninteresting bits.
uint32_t extract_bits(uint8_t* bytes, int start, int count)
{
int shiftleft = 32+start;
int shiftright = 64-count;
uint64_t *ptr = (uint64_t*)(bytes);
uint64_t hold = *ptr;
hold <<= shiftleft;
hold >>= shiftright;
return (uint32_t)hold;
}
For the sake of completness, i'am adding my solution inspired by the comments and answers here. Thanks to all who bothered to think about the problem.
static const uint8_t firstByteMasks[8] = { 0xFF, 0x7F, 0x3F, 0x1F, 0x0F, 0x07, 0x03, 0x01 };
uint32_t getBits( const uint8_t *buf, const uint32_t bitoff, const uint32_t len, const uint32_t bitcount )
{
uint64_t result = 0;
int32_t startByte = bitoff / 8; // starting byte number
int32_t endByte = ((bitoff + bitcount) - 1) / 8; // ending byte number
int32_t rightShift = 16 - ((bitoff + bitcount) % 8 );
if ( endByte >= len ) return -1;
if ( rightShift == 16 ) rightShift = 8;
result = buf[startByte] & firstByteMasks[bitoff % 8];
result = result << 8;
for ( int32_t i = startByte + 1; i <= endByte; i++ )
{
result |= buf[i];
result = result << 8;
}
result = result >> rightShift;
return (uint32_t)result;
}
Few notes: i tested the code and it seems to work just fine, however, there may be bugs. If i find any, i will update the code here. Also, there are probably better solutions!