Parsing multiple ints from string at once using SSE/AVX - c++

I am given a string of the following form:
Each line contains two ints seperated by a single space. The line ending is a single "\n"
The number of lines is a multiple of 2
The ints are of a nice form: They are all positive, have no trailing zeros and no '+' or '-' and all have 1 to 7 digits
An example would be:
"5531 1278372\n461722 1278373\n1022606 1278374\n224406 1278375\n1218709 1278376\n195903 1278377\n604672 1278378\n998322 1278379\n"
I have a pointer to the beginning as well as to the ending of the string.
I want to parse this string by extracting all the integers from it as fast as possible. The first idea that comes to mind is using a loop in which we always extract the first integer of the string using sse and advance the pointer to the start of the next integer (which in this case is two characters after the end of the string, since all delimiters have size 1). As I have a pointer to the end of the string, the function that reads the first int of the string would not have to check for '\0' but only gets called when there really is another integer in the string. One could for example adapt the solution from How to implement atoi using SIMD? to obtain the following function, which returns the first integer of the string and then advances the pointer to after the delimiter after the int (so it points to the beginning of the next int):
inline uint32_t strToUintSSE(char*& sta) {
//Set up constants
__m128i zero = _mm_setzero_si128();
__m128i multiplier1 = _mm_set_epi16(1000,100,10,1,1000,100,10,1);
__m128i multiplier2 = _mm_set_epi32(0, 100000000, 10000, 1);
//Compute length of string
__m128i string = _mm_lddqu_si128((__m128i*)sta);
__m128i digitRange = _mm_setr_epi8('0','9',0,0,0,0,0,0,0,0,0,0,0,0,0,0);
int len = _mm_cmpistri(digitRange, string, _SIDD_UBYTE_OPS | _SIDD_CMP_RANGES | _SIDD_NEGATIVE_POLARITY);
sta += len + 1;
//Reverse order of number
__m128i permutationMask = _mm_set1_epi8(len);
permutationMask = _mm_add_epi8(permutationMask, _mm_set_epi8(-16,-15,-14,-13,-12,-11,-10,-9,-8,-7,-6,-5,-4,-3,-2,-1));
string = _mm_shuffle_epi8(string, permutationMask);
//Shift string down
__m128i zeroChar = _mm_set1_epi8('0');
string = _mm_subs_epu8(string, zeroChar);
//Multiply with right power of 10 and add up
__m128i stringLo = _mm_unpacklo_epi8(string, zero);
__m128i stringHi = _mm_unpackhi_epi8(string, zero);
stringLo = _mm_madd_epi16(stringLo, multiplier1);
stringHi = _mm_madd_epi16(stringHi, multiplier1);
__m128i intermediate = _mm_hadd_epi32(stringLo, stringHi);
intermediate = _mm_mullo_epi32(intermediate, multiplier2);
//Hadd the rest up
intermediate = _mm_add_epi32(intermediate, _mm_shuffle_epi32(intermediate, 0b11101110));
intermediate = _mm_add_epi32(intermediate, _mm_shuffle_epi32(intermediate, 0b01010101));
return _mm_cvtsi128_si32(intermediate);
}
Also since we no that the string only contains '0'-'9', ' ' and '\n' we can calculate len using
int len = _mm_tzcnt_32(_mm_movemask_epi8(_mm_cmpgt_epi8(zeroChar, string)));
However, the requirements imply that a XMM register always fits two integers, so I would like to modify the function to extract both of them from "string". The idea is to transform "string" so that the first int starts at byte 0 and the second int starts at byte 8. Before, we reversed the digits, since at the moment we added zeros to the end of the number, making it bigger. However we want to make the zeros trainling zeros which is done by reversing. Another possibility would be to have the first int end at byte 7 (inclusive) and the second at byte 15, so we essentially aligned them with the right of their respective half of the register. This way the zeros are also in the higher digits of the number. To summarize: If we e.g. have the string "2035_71582\n" (i'm using '_' to visualize the ' ' better), we want the XMM register to look like
'5','3','0','2',0,0,0,0,'2','8','5','1','7',0,0,0
0,0,0,0,'2','0','3','5',0,0,0,'7','1','5','8','2'
Note: These possibilities are the same but each half is reversed
(Of course multiplying by the right power of 10 and then adding the digits up can also be optimized since we now only have 7 digits instead of 16)
To perform this transformation, we must first extract the length of the two integers. This can be done with
inz mask = _mm_movemask_epi8(_mm_cmpgt_epi8(zeroChar, string)); //Instead of _mm_cmpistrmint len1 = _mm_tzcnt_32(mask);
int combinedLen = _mm_tzcnt_32(mask & (mask-1)); //Clears the lowest bit of mask first, will probably emit a BLSR
To implement the transform, I could think of multiple different ways:
Use a shuffle like before. One could try to compute the mask like this:
__m128i permutationMask = _mm_setr_epi8(len1, len1, len1, len1, combinedLen, combinedLen, combinedLen, combinedLen);
permutationMask = _mm_add_epi8(permutationMask, _mm_set_epi8(-8,-7,-6,-5,-4,-3,-2,-1,-8,-7,-6,-5,-4,-3,-2,-1));
However, this runs into the problem that when reversing the second int, we run backwards into the first int: e.g. "2035_71582\n" -> '5','3','0','2',0,0,0,0,'2','8','5','1','7',' ','5','3' (we have an extra 53 from the first int at the end).
If we right shift instead of reversing we can compute the mask analogously (only reverse the summand)
__m128i permutationMask = _mm_setr_epi8(len1, len1, len1, len1, combinedLen, combinedLen, combinedLen, combinedLen);
permutationMask = _mm_add_epi8(permutationMask, _mm_setr_epi8(-8,-7,-6,-5,-4,-3,-2,-1,-8,-7,-6,-5,-4,-3,-2,-1));
but run into the same problem: "2035_71582\n" -> 0,0,0,0,'2','0','3','5', '3','5',' ','7','1','5','8','2'
It seems to me, that computing a good shuffle mask is pretty hard to do. Maybe the best solution with this approach would be to first use a shuffle and then zero out the wrong bytes (there are many possibilities) for this)
Instead of a shuffle, use two pslldq to shift the ints to the right and then combine them (one is upper half, one is lower half) for example using a blend. However one would still need to zero out bytes as the first int would also possibly appear in the second half.
Use a gather. However we would still need to zero out the wrong bytes
Something different entirely, e.g. using AVX512-VBMI (vpexpandb or vpcompressb maybe?). Maybe one wouldn't even have to compute len1 and combinedLen but could use the mask directly?
The first 3 don't feel very optimal yet while I have no clue about the last. Can you think of a good way to do this? This can also be extended to using YMM registers to parse 4 ints at once (or even ZMM for 8 ints) which complicates things again, since the first 2 approaches become infeasible due to the inability to shuffle/shift across the 128bit lines, so the last approach looks the most promising to me. Sadly, I don't really have any experience with AVX512. You are free to use any version of SSE, AVX, AVX2 - and also AVX512 as a last resort (I can't run AVX512 but if you find a nice solution with it, I would be interested as well).

Here's a strategy from over here.
Other References:
Is there a fast way to convert a string of 8 ASCII decimal digits into a binary number?
How to find the position of the only-set-bit in a 64-bit value using bit manipulation efficiently?
See also:
http://0x80.pl/articles/simd-parsing-int-sequences.html
#include <tmmintrin.h> // SSSE3
#include <stdint.h>
static inline
uint64_t swar_parsedigits (uint8_t* src, uint32_t* res) {
// assumes digit group len max is 7
// assumes each group is separated by a single space or '\n'
uint64_t v;
memcpy(&v, src, 8); // assumes little endian
v -= 0x3030303030303030ULL;
uint64_t t = v & 0x8080808080808080ULL; // assumes "valid" input...
uint64_t next = ((t & (-t)) * 0x20406080a0c0e1ULL) >> 60;
v <<= (9 - next) * 8; // shift off trash chars
v = ((v * 0x0000000000000A01ULL) >> 8) & 0x00FF00FF00FF00FFULL;
v = ((v * 0x0000000000640001ULL) >> 16) & 0x0000FFFF0000FFFFULL;
v = (v * 0x0000271000000001ULL) >> 32;
*res = v;
return next;
}
static inline
uint64_t ssse3_parsedigits (uint8_t* src, uint32_t* res) {
// assumes digit group len max is 7
// assumes each group is separated by a single space or '\n'
const __m128i mul1 = _mm_set1_epi64x(0x010A0A6414C82800);
const __m128i mul2 = _mm_set1_epi64x(0x0001000A01F461A8);
const __m128i x30 = _mm_set1_epi8(0x30);
__m128i v;
// get delimiters
v = _mm_loadu_si128((__m128i *)(void *)src);
v = _mm_sub_epi8(v, x30);
uint32_t m = _mm_movemask_epi8(v);
// find first 2 group lengths
int len0 = __builtin_ctzl(m);
m &= m - 1; // clear the lowest set bit
int next = __builtin_ctzl(m);
int len1 = next - (len0 + 1);
// gather groups
uint64_t x0, x1;
memcpy(&x0, src, 8);
memcpy(&x1, &src[len0 + 1], 8);
// pad out to 8 bytes
x0 <<= (8 - len0) * 8;
x1 <<= (8 - len1) * 8;
// back into the xmm register...
v = _mm_set_epi64x(x1, x0);
v = _mm_subs_epu8(v, x30);
v = _mm_madd_epi16(_mm_maddubs_epi16(mul1, v), mul2);
v = _mm_hadd_epi32(v, v);
_mm_storel_epi64((__m128i*)(void *)res, v);
return next + 1;
}

Related

How would you transpose a binary matrix?

I have binary matrices in C++ that I repesent with a vector of 8-bit values.
For example, the following matrix:
1 0 1 0 1 0 1
0 1 1 0 0 1 1
0 0 0 1 1 1 1
is represented as:
const uint8_t matrix[] = {
0b01010101,
0b00110011,
0b00001111,
};
The reason why I'm doing it this way is because then computing the product of such a matrix and a 8-bit vector becomes really simple and efficient (just one bitwise AND and a parity computation, per row), which is much better than calculating each bit individually.
I'm now looking for an efficient way to transpose such a matrix, but I haven't been able to figure out how to do it without having to manually calculate each bit.
Just to clarify, for the above example, I'd like to get the following result from the transposition:
const uint8_t transposed[] = {
0b00000000,
0b00000100,
0b00000010,
0b00000110,
0b00000001,
0b00000101,
0b00000011,
0b00000111,
};
NOTE: I would prefer an algorithm that can calculate this with arbitrary-sized matrices but am also interested in algorithms that can only handle certain sizes.
I've spent more time looking for a solution, and I've found some good ones.
The SSE2 way
On a modern x86 CPU, transposing a binary matrix can be done very efficiently with SSE2 instructions. Using such instructions it is possible to process a 16×8 matrix.
This solution is inspired by this blog post by mischasan and is vastly superior to every suggestion I've got so far to this question.
The idea is simple:
#include <emmintrin.h>
Pack 16 uint8_t variables into an __m128i
Use _mm_movemask_epi8 to get the MSBs of each byte, producing an uint16_t
Use _mm_slli_epi64 to shift the 128-bit register by one
Repeat until you've got all 8 uint16_ts
A generic 32-bit solution
Unfortunately, I also need to make this work on ARM. After implementing the SSE2 version, it would be easy to just just find the NEON equivalents, but the Cortex-M CPU, (contrary to the Cortex-A) does not have SIMD capabilities, so NEON isn't too useful for me at the moment.
NOTE: Because the Cortex-M doesn't have native 64-bit arithmetics, I could not use the ideas in any answers that suggest to do it by treating a 8x8 block as an uint64_t. Most microcontrollers that have a Cortex-M CPU also don't have too much memory so I prefer to do all this without a lookup table.
After some thinking, the same algorithm can be implemented using plain 32-bit arithmetics and some clever coding. This way, I can work with 4×8 blocks at a time. It was suggested by a collegaue and the magic lies in the way 32-bit multiplication works: you can find a 32-bit number with which you can multiply and then the MSB of each byte gets next to each other in the upper 32 bits of the result.
Pack 4 uint8_ts in a 32-bit variable
Mask the 1st bit of each byte (using 0x80808080)
Multiply it with 0x02040810
Take the 4 LSBs of the upper 32 bits of the multiplication
Generally, you can mask the Nth bit in each byte (shift the mask right by N bits) and multiply with the magic number, shifted left by N bits. The advantage here is that if your compiler is smart enough to unroll the loop, both the mask and the 'magic number' become compile-time constants so shifting them does not incur any performance penalty whatsoever. There's some trouble with the last series of 4 bits, because then one LSB is lost, so in that case I needed to shift the input left by 8 bits and use the same method as the first series of 4-bits.
If you do this with two 4×8 blocks, then you can get an 8x8 block done and arrange the resulting bits so that everything goes into the right place.
My suggestion is that, you don't do the transposition, rather you add one bit information to your matrix data, indicating whether the matrix is transposed or not.
Now, if you want to multiply a transposd matrix with a vector, it will be the same as multiplying the matrix on the left by the vector (and then transpose). This is easy: just some xor operations of your 8-bit numbers.
This however makes some other operations complicated (e.g. adding two matrices). But in the comment you say that multiplication is exactly what you want to optimize.
Here is the text of Jay Foad's email to me regarding fast Boolean matrix
transpose:
The heart of the Boolean transpose algorithm is a function I'll call transpose8x8 which transposes an 8x8 Boolean matrix packed in a 64-bit word (in row major order from MSB to LSB). To transpose any rectangular matrix whose width and height are multiples of 8, break it down into 8x8 blocks, transpose each one individually and store them at the appropriate place in the output. To load an 8x8 block you have to load 8 individual bytes and shift and OR them into a 64-bit word. Same kinda thing for storing.
A plain C implementation of transpose8x8 relies on the fact that all the bits on any diagonal line parallel to the leading diagonal move the same distance up/down and left/right. For example, all the bits just above the leading diagonal have to move one place left and one place down, i.e. 7 bits to the right in the packed 64-bit word. This leads to an algorithm like this:
transpose8x8(word) {
return
(word & 0x0100000000000000) >> 49 // top right corner
| (word & 0x0201000000000000) >> 42
| ...
| (word & 0x4020100804020100) >> 7 // just above diagonal
| (word & 0x8040201008040201) // leading diagonal
| (word & 0x0080402010080402) << 7 // just below diagonal
| ...
| (word & 0x0000000000008040) << 42
| (word & 0x0000000000000080) << 49; // bottom left corner
}
This runs about 10x faster than the previous implementation, which copied each bit individually from the source byte in memory and merged it into the destination byte in memory.
Alternatively, if you have PDEP and PEXT instructions you can implement a perfect shuffle, and use that to do the transpose as mentioned in Hacker's Delight. This is significantly faster (but I don't have timings handy):
shuffle(word) {
return pdep(word >> 32, 0xaaaaaaaaaaaaaaaa) | pdep(word, 0x5555555555555555);
} // outer perfect shuffle
transpose8x8(word) { return shuffle(shuffle(shuffle(word))); }
POWER's vgbbd instruction effectively implements the whole of transpose8x8 in a single instruction (and since it's a 128-bit vector instruction it does it twice, independently, on the low 64 bits and the high 64 bits). This gave about 15% speed-up over the plain C implementation. (Only 15% because, although the bit twiddling is much faster, the overall run time is now dominated by the time it takes to load 8 bytes and assemble them into the argument to transpose8x8, and to take the result and store it as 8 separate bytes.)
My suggestion would be to use a lookup table to speed up the processing.
Another thing to note is with the current definition of your matrix the maximum size will be 8x8 bits. This fits into a uint64_t so we can use this to our advantage especially when using a 64-bit platform.
I have worked out a simple example using a lookup table which you can find below and run using: http://www.tutorialspoint.com/compile_cpp11_online.php online compiler.
Example code
#include <iostream>
#include <bitset>
#include <stdint.h>
#include <assert.h>
using std::cout;
using std::endl;
using std::bitset;
/* Static lookup table */
static uint64_t lut[256];
/* Helper function to print array */
template<int N>
void print_arr(const uint8_t (&arr)[N]){
for(int i=0; i < N; ++i){
cout << bitset<8>(arr[i]) << endl;
}
}
/* Transpose function */
template<int N>
void transpose_bitmatrix(const uint8_t (&matrix)[N], uint8_t (&transposed)[8]){
assert(N <= 8);
uint64_t value = 0;
for(int i=0; i < N; ++i){
value = (value << 1) + lut[matrix[i]];
}
/* Ensure safe copy to prevent misalignment issues */
/* Can be removed if input array can be treated as uint64_t directly */
for(int i=0; i < 8; ++i){
transposed[i] = (value >> (i * 8)) & 0xFF;
}
}
/* Calculate lookup table */
void calculate_lut(void){
/* For all byte values */
for(uint64_t i = 0; i < 256; ++i){
auto b = std::bitset<8>(i);
auto v = std::bitset<64>(0);
/* For all bits in current byte */
for(int bit=0; bit < 8; ++bit){
if(b.test(bit)){
v.set((7 - bit) * 8);
}
}
lut[i] = v.to_ullong();
}
}
int main()
{
calculate_lut();
const uint8_t matrix[] = {
0b01010101,
0b00110011,
0b00001111,
};
uint8_t transposed[8];
transpose_bitmatrix(matrix, transposed);
print_arr(transposed);
return 0;
}
How it works
your 3x8 matrix will be transposed to a 8x3 matrix, represented in an 8x8 array.
The issue is that you want to convert bits, your "horizontal" representation to a vertical one, divided over several bytes.
As I mentioned above, we can take advantage of the fact that the output (8x8) will always fit into a uint64_t. We will use this to our advantage because now we can use an uint64_t to write the 8 byte array, but we can also use it for to add, xor, etc. because we can perform basic arithmetic operations on a 64 bit integer.
Each entry in your 3x8 matrix (input) is 8 bits wide, to optimize processing we first generate 256 entry lookup table (for each byte value). The entry itself is a uint64_t and will contain a rotated version of the bits.
example:
byte = 0b01001111 = 0x4F
lut[0x4F] = 0x0001000001010101 = (uint8_t[]){ 0, 1, 0, 0, 1, 1, 1, 1 }
Now for the calculation:
For the calculations we use the uint64_t but keep in mind that under water it will represent a uint8_t[8] array. We simple shift the current value (start with 0), look up our first byte and add it to the current value.
The 'magic' here is that each byte of the uint64_t in the lookup table will either be 1 or 0 so it will only set the least significant bit (of each byte). Shifting the uint64_t will shift each byte, as long as we make sure we do not do this more than 8 times! we can do operations on each byte individually.
Issues
As someone noted in the comments: Translate(Translate(M)) != M so if you need this you need some additional work.
Perfomance can be improved by directly mapping uint64_t's instead of uint8_t[8] arrays since it omits a "safe-copy" to prevent alignment issues.
I have added a new awnser instead of editing my original one to make this more visible (no comment rights unfortunatly).
In your own awnser you add an additional requirement not present in the first one: It has to work on ARM Cortex-M
I did come up with an alternative solution for ARM in my original awnser but omitted it as it was not part of the question and seemed off topic (mostly because of the C++ tag).
ARM Specific solution Cortex-M:
Some or most Cortex-M 3/4 have a bit banding region which can be used for exactly what you need, it expands bits into 32-bit fields, this region can be used to perform atomic bit operations.
If you put your array in a bitbanded region it will have an 'exploded' mirror in the bitband region where you can just use move operations on the bits itself. If you make a loop the compiler will surely be able to unroll and optimize to just move operations.
If you really want to, you can even setup a DMA controller to process an entire batch of transpose operations with a bit of effort and offload it entirely from the cpu :)
Perhaps this might still help you.
This is a bit late, but I just stumbled across this interchange today.
If you look at Hacker's Delight, 2nd Edition,there are several algorithms for efficiently transposing Boolean arrays, starting on page 141.
They are quite efficient: a colleague of mine obtained a factor about 10X
speedup compared to naive coding, on an X86.
Here's what I posted on gitub (mischasan/sse2/ssebmx.src)
Changing INP() and OUT() to use induction vars saves an IMUL each.
AVX256 does it twice as fast.
AVX512 is not an option, because there is no _mm512_movemask_epi8().
#include <stdint.h>
#include <emmintrin.h>
#define INP(x,y) inp[(x)*ncols/8 + (y)/8]
#define OUT(x,y) out[(y)*nrows/8 + (x)/8]
void ssebmx(char const *inp, char *out, int nrows, int ncols)
{
int rr, cc, i, h;
union { __m128i x; uint8_t b[16]; } tmp;
// Do the main body in [16 x 8] blocks:
for (rr = 0; rr <= nrows - 16; rr += 16)
for (cc = 0; cc < ncols; cc += 8) {
for (i = 0; i < 16; ++i)
tmp.b[i] = INP(rr + i, cc);
for (i = 8; i--; tmp.x = _mm_slli_epi64(tmp.x, 1))
*(uint16_t*)&OUT(rr, cc + i) = _mm_movemask_epi8(tmp.x);
}
if (rr == nrows) return;
// The remainder is a row of [8 x 16]* [8 x 8]?
// Do the [8 x 16] blocks:
for (cc = 0; cc <= ncols - 16; cc += 16) {
for (i = 8; i--;)
tmp.b[i] = h = *(uint16_t const*)&INP(rr + i, cc),
tmp.b[i + 8] = h >> 8;
for (i = 8; i--; tmp.x = _mm_slli_epi64(tmp.x, 1))
OUT(rr, cc + i) = h = _mm_movemask_epi8(tmp.x),
OUT(rr, cc + i + 8) = h >> 8;
}
if (cc == ncols) return;
// Do the remaining [8 x 8] block:
for (i = 8; i--;)
tmp.b[i] = INP(rr + i, cc);
for (i = 8; i--; tmp.x = _mm_slli_epi64(tmp.x, 1))
OUT(rr, cc + i) = _mm_movemask_epi8(tmp.x);
}
HTH.
Inspired by Roberts answer, polynomial multiplication in Arm Neon can be utilised to scatter the bits --
inline poly8x16_t mull_lo(poly8x16_t a) {
auto b = vget_low_p8(a);
return vreinterpretq_p8_p16(vmull_p8(b,b));
}
inline poly8x16_t mull_hi(poly8x16_t a) {
auto b = vget_high_p8(a);
return vreinterpretq_p8_p16(vmull_p8(b,b));
}
auto a = mull_lo(word);
auto b = mull_lo(a), c = mull_hi(a);
auto d = mull_lo(b), e = mull_hi(b);
auto f = mull_lo(c), g = mull_hi(c);
Then the vsli can be used to combine the bits pairwise.
auto ab = vsli_p8(vget_high_p8(d), vget_low_p8(d), 1);
auto cd = vsli_p8(vget_high_p8(e), vget_low_p8(e), 1);
auto ef = vsli_p8(vget_high_p8(f), vget_low_p8(f), 1);
auto gh = vsli_p8(vget_high_p8(g), vget_low_p8(g), 1);
auto abcd = vsli_p8(ab, cd, 2);
auto efgh = vsli_p8(ef, gh, 2);
return vsli_p8(abcd, efgh, 4);
Clang optimizes this code to avoid vmull2 instructions, using heavily ext q0,q0,8 to vget_high_p8.
An iterative approach would possibly be not only faster, but also uses less registers and also simdifies for 2x or more throughput.
// transpose bits in 2x2 blocks, first 4 rows
// x = a b|c d|e f|g h a i|c k|e m|g o | byte 0
// i j|k l|m n|o p b j|d l|f n|h p | byte 1
// q r|s t|u v|w x q A|s C|u E|w G | byte 2
// A B|C D|E F|G H r B|t D|v F|h H | byte 3 ...
// ----------------------
auto a = (x & 0x00aa00aa00aa00aaull);
auto b = (x & 0x5500550055005500ull);
auto c = (x & 0xaa55aa55aa55aa55ull) | (a << 7) | (b >> 7);
// transpose 2x2 blocks (first 4 rows shown)
// aa bb cc dd aa ii cc kk
// ee ff gg hh -> ee mm gg oo
// ii jj kk ll bb jj dd ll
// mm nn oo pp ff nn hh pp
auto d = (c & 0x0000cccc0000ccccull);
auto e = (c & 0x3333000033330000ull);
auto f = (c & 0xcccc3333cccc3333ull) | (d << 14) | (e >> 14);
// Final transpose of 4x4 bit blocks
auto g = (f & 0x00000000f0f0f0f0ull);
auto h = (f & 0x0f0f0f0f00000000ull);
x = (f & 0xf0f0f0f00f0f0f0full) | (g << 28) | (h >> 28);
In ARM each step can now be composed with 3 instructions:
auto tmp = vrev16_u8(x);
tmp = vshl_u8(tmp, plus_minus_1); // 0xff01ff01ff01ff01ull
x = vbsl_u8(mask_1, x, tmp); // 0xaa55aa55aa55aa55ull
tmp = vrev32_u16(x);
tmp = vshl_u16(tmp, plus_minus_2); // 0xfefe0202fefe0202ull
x = vbsl_u8(mask_2, x, tmp); // 0xcccc3333cccc3333ull
tmp = vrev64_u32(x);
tmp = vshl_u32(tmp, plus_minus_4); // 0xfcfcfcfc04040404ull
x = vbsl_u8(mask_4, x, tmp); // 0xf0f0f0f00f0f0f0full

Optimize blockwise bit operations: base-4 numbers

This should be a fun question, at least for me.
My intent is to manipulate base-4 numbers, encoded in a unsigned integer. Each two-bits block then represents a single base-4 digit, starting from the least significant bit:
01 00 11 = base4(301)
I'd like to optimize my code using SSE instructions, because I'm not sure how I scored here, maybe poorly.
The code starts from strings (and uses them to check the correctness), and implements:
convert string to binary
convert binary to string
reverse the number
Any hints are more than welcome!
uint32_t tobin(std::string s)
{
uint32_t v, bin = 0;
// Convert to binary
for (int i = 0; i < s.size(); i++)
{
switch (s[i])
{
case '0':
v = 0;
break;
case '3':
v = 3;
break;
case '1':
v = 1;
break;
case '2':
v = 2;
break;
default:
throw "UNKOWN!";
}
bin = bin | (v << (i << 1));
}
return bin;
}
std::string tostr(int size, const uint32_t v)
{
std::string b;
// Convert to binary
for (int i = 0; i < size; i++)
{
uint32_t shl = 0, shr = 0, q;
shl = (3 << (i << 1));
shr = i << 1;
q = v & shl;
q = q >> shr;
unsigned char c = static_cast<char>(q);
switch (c)
{
case 0:
b += '0';
break;
case 3:
b += '3';
break;
case 1:
b += '1';
break;
case 2:
b += '2';
break;
default:
throw "UNKOWN!";
}
}
return b;
}
uint32_t revrs(int size, const uint32_t v)
{
uint32_t bin = 0;
// Convert to binary
for (int i = 0; i < size; i++)
{
uint32_t shl = 0, shr = 0, q;
shl = (3 << (i << 1));
shr = i << 1;
q = v & shl;
q = q >> shr;
unsigned char c = static_cast<char>(q);
shl = (size - i - 1) << 1;
bin = bin | (c << shl);
}
return bin;
}
bool ckrev(std::string s1, std::string s2)
{
std::reverse(s1.begin(), s1.end());
return s1 == s2;
}
int main(int argc, char* argv[])
{
// Binary representation of base-4 number
uint32_t binr;
std::vector<std::string> chk { "123", "2230131" };
for (const auto &s : chk)
{
std::string b, r;
uint32_t c;
binr = tobin(s);
b = tostr(s.size(), binr);
c = revrs(s.size(), binr);
r = tostr(s.size(), c);
std::cout << "orig " << s << std::endl;
std::cout << "binr " << std::hex << binr << " string " << b << std::endl;
std::cout << "revs " << std::hex << c << " string " << r << std::endl;
std::cout << ">>> CHK " << (s == b) << " " << ckrev(r, b) << std::endl;
}
return 0;
}
This is a little challenging with SSE because there is little provision for bit packing (you want to take two bits from every character and pack them contiguously). Anyway, the special instruction _mm_movemask_epi8 can help you.
For the string-to-binary conversion, you can proceed as follows:
load the 16 characters string (pad with zeroes or clear after the load if necessary);
subtract bytewise ASCII zeroes .
compare bytewise 'unsigned greater than' to a string of 16 '3' bytes; this will set bytes 0xFF wherever there is an invalid character
use _mm_movemask_epi8 to detect such a character in the packed short value
If all is fine, you now need to pack the bit pairs. For this you need to
duplicate the 16 bytes
shift the bits of weight 1 and 2, left by 7 or 6 positions, to make them most significant (_mm_sll_epi16. There is no epi8 version, but bits from one element becoming garbage in the low bits of another element isn't important for this.)
interleave them (_mm_unpack..._epi8, once with lo and once with hi)
store the high bits of those two vectors into shorts with _mm_movemask_epi8.
For the binary-to-string conversion, I can't think of an SSE implementation that makes sense, as there is no counterpart of _mm_movemask_epi8 that would allow you to unpack efficiently.
I'll solve the problem of converting 32-bit integer to base4 string on SSE.
The problem of removing leading zeros is not considered, i.e. base4 strings always have length 16.
General throughts
Clearly, we have to extract pairs of bits in vectorized form.
In order to do it, we can perform some byte manipulations and bitwise operations.
Let's see what we can do with SSE:
A single intrinsic _mm_shuffle_epi8 (from SSSE3) allows to shuffle 16 bytes in absolutely any way you desire.
Clearly, some well-structured shuffles and register mixtures can be done with simpler instructions from SSE2,
but it's important to remember that any in-register shuffling can be done with one cheap instruction.
Shuffling does not help to change indices of bits in a byte.
In order to move chunks of bits around, we usually use bit shifts.
Unfortunately, there is no way in SSE to shift different elements of XMM register by different amounts.
As #PeterCorder mentioned in comments, there are such instructions in AVX2 (e.g. _mm_sllv_epi32), but they operate on at least 32-bit granularity.
From the ancient times we are constantly taught that bit shift is fast and multiplication is slow. Today arithmetic is so much accelerated, that it is no longer so. In SSE, shifts and multiplications seem to have equal throughput, although multiplications have more latency.
Using multiplication by powers of two we can shift left different elements of single XMM register by different amounts. There are many instructions like _mm_mulhi_epi16, which allow 16-bit granularity. Also one instruction _mm_maddubs_epi16 allows 8-bit granularity of shifts.
Right shift can be done via left shift just the same way people do division via multiplication: shift left by 16-k, then shift right by two bytes (recall that any byte shuffling is cheap).
We actually want to do 16 different bit shifts. If we use multiplication with 16-bit granularity, then we'll have to use at least two XMM registers for shifting, then they can be merged together. Also, we can try to use multiplication with 8-bit granularity to do everything in a single register.
16-bit granularity
First of all, we have to move 32-bit integer to the lower 4 bytes of XMM register. Then we shuffle bytes so that each 16-bit part of XMM register contains one byte of input:
|abcd|0000|0000|0000| before shuffle (little-endian)
|a0a0|b0b0|c0c0|d0d0| after shuffle (to low halves)
|0a0a|0b0b|0c0c|0d0d| after shuffle (to high halves)
Then we can call _mm_mulhi_epi16 to shift each part right by k = 1..16. Actually, it is more convenient to put input bytes into high halves of 16-bit elements, so that we can shift left by k = -8..7. As a result, we want to see some bytes of XMM register containing the pairs of bits defining some base4 digits (as their lower bits). After that we can remove unnecessary high bits by _mm_and_si128, and shuffle valuable bytes to proper places.
Since only 8 shifts can be done at once with 16-bit granularity, we have to do the shifting part twice. Then we combine the two XMM registers into one.
Below you can see the code using this idea. It a bit optimized: there is no bytes shuffling after the bit shifts.
__m128i reg = _mm_cvtsi32_si128(val);
__m128i bytes = _mm_shuffle_epi8(reg, _mm_setr_epi8(-1, 0, -1, 0, -1, 1, -1, 1, -1, 2, -1, 2, -1, 3, -1, 3));
__m128i even = _mm_mulhi_epu16(bytes, _mm_set1_epi32(0x00100100)); //epi16: 1<<8, 1<<4 x4 times
__m128i odd = _mm_mulhi_epu16(bytes, _mm_set1_epi32(0x04004000)); //epi16: 1<<14, 1<<10 x4 times
even = _mm_and_si128(even, _mm_set1_epi16(0x0003));
odd = _mm_and_si128(odd , _mm_set1_epi16(0x0300));
__m128i res = _mm_xor_si128(even, odd);
res = _mm_add_epi8(res, _mm_set1_epi8('0'));
_mm_storeu_si128((__m128i*)s, res);
8-bit granularity
First of all we move our 32-bit integer into XMM register of course. Then we shuffle bytes so that each byte of result equals the input byte containing the two bits wanted at that place:
|abcd|0000|0000|0000| before shuffle (little-endian)
|aaaa|bbbb|cccc|dddd| after shuffle
Now we use _mm_and_si128 to filter bits: at each byte only the two bits wanted must remain. After that we only need to shift each byte right by 0/2/4/6 bits. This should be achieved with intrinsic _mm_maddubs_epi16, which allows to shift 16 bytes at once. Unfortunately, I do not see how to shift all the bytes properly with this instruction only, but at least we can shift each odd byte by 2 bits right (even bytes remain as is). Then the bytes with indices 4k+2 and 4k+3 can be shifted right by 4 bits with single _mm_madd_epi16 instruction.
Here is the resulting code:
__m128i reg = _mm_cvtsi32_si128(val);
__m128i bytes = _mm_shuffle_epi8(reg, _mm_setr_epi8(0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3));
__m128i twobits = _mm_and_si128(bytes, _mm_set1_epi32(0xC0300C03)); //epi8: 3<<0, 3<<2, 3<<4, 3<<6 x4 times
twobits = _mm_maddubs_epi16(twobits, _mm_set1_epi16(0x4001)); //epi8: 1<<0, 1<<6 x8 times
__m128i res = _mm_madd_epi16(twobits, _mm_set1_epi32(0x10000001)); //epi16: 1<<0, 1<<12 x4 times
res = _mm_add_epi8(res, _mm_set1_epi8('0'));
_mm_storeu_si128((__m128i*)s, res);
P.S.
Both solutions use a lot of compile-time constant 128-bit values. They are not encoded into x86 instructions, so processor has to load them from memory (most likely L1 cache) each time they are used. However, if you are going to run many conversions in a loop, then the compiler would load all these constants into registers before the loop (I hope).
Here you can find the full code (without timing), including implementation of the str2bin solution by #YvesDaoust.

Get Integer From Bits Inside `std::vector<char>`

I have a vector<char> and I want to be able to get an unsigned integer from a range of bits within the vector. E.g.
And I can't seem to be able to write the correct operations to get the desired output. My intended algorithm goes like this:
& the first byte with (0xff >> unused bits in byte on the left)
<< the result left the number of output bytes * number of bits in a byte
| this with the final output
For each subsequent byte:
<< left by the (byte width - index) * bits per byte
| this byte with the final output
| the final byte (not shifted) with the final output
>> the final output by the number of unused bits in the byte on the right
And here is my attempt at coding it, which does not give the correct result:
#include <vector>
#include <iostream>
#include <cstdint>
#include <bitset>
template<class byte_type = char>
class BitValues {
private:
std::vector<byte_type> bytes;
public:
static const auto bits_per_byte = 8;
BitValues(std::vector<byte_type> bytes) : bytes(bytes) {
}
template<class return_type>
return_type get_bits(int start, int end) {
auto byte_start = (start - (start % bits_per_byte)) / bits_per_byte;
auto byte_end = (end - (end % bits_per_byte)) / bits_per_byte;
auto byte_width = byte_end - byte_start;
return_type value = 0;
unsigned char first = bytes[byte_start];
first &= (0xff >> start % 8);
return_type first_wide = first;
first_wide <<= byte_width;
value |= first_wide;
for(auto byte_i = byte_start + 1; byte_i <= byte_end; byte_i++) {
auto byte_offset = (byte_width - byte_i) * bits_per_byte;
unsigned char next_thin = bytes[byte_i];
return_type next_byte = next_thin;
next_byte <<= byte_offset;
value |= next_byte;
}
value >>= (((byte_end + 1) * bits_per_byte) - end) % bits_per_byte;
return value;
}
};
int main() {
BitValues<char> bits(std::vector<char>({'\x78', '\xDA', '\x05', '\x5F', '\x8A', '\xF1', '\x0F', '\xA0'}));
std::cout << bits.get_bits<unsigned>(15, 29) << "\n";
return 0;
}
(In action: http://coliru.stacked-crooked.com/a/261d32875fcf2dc0)
I just can't seem to wrap my head around these bit manipulations, and I find debugging very difficult! If anyone can correct the above code, or help me in any way, it would be much appreciated!
Edit:
My bytes are 8 bits long
The integer to return could be 8,16,32 or 64 bits wside
The integer is stored in big endian
You made two primary mistakes. The first is here:
first_wide <<= byte_width;
You should be shifting by a bit count, not a byte count. Corrected code is:
first_wide <<= byte_width * bits_per_byte;
The second mistake is here:
auto byte_offset = (byte_width - byte_i) * bits_per_byte;
It should be
auto byte_offset = (byte_end - byte_i) * bits_per_byte;
The value in parenthesis needs to be the number of bytes to shift right by, which is also the number of bytes byte_i is away from the end. The value byte_width - byte_i has no semantic meaning (one is a delta, the other is an index)
The rest of the code is fine. Though, this algorithm has two issues with it.
First, when using your result type to accumulate bits, you assume you have room on the left to spare. This isn't the case if there are set bits near the right boundry and the choice of range causes the bits to be shifted out. For example, try running
bits.get_bits<uint16_t>(11, 27);
You'll get the result 42 which corresponds to the bit string 00000000 00101010 The correct result is 53290 with the bit string 11010000 00101010. Notice how the rightmost 4 bits got zeroed out. This is because you start off by overshifting your value variable, causing those four bits to be shifted out of the variable. When shifting back at the end, this results in the bits being zeroed out.
The second problem has to do with the right shift at the end. If the rightmost bit of the value variable happens to be a 1 before the right shift at the end, and the template parameter is a signed type, then the right shift that is done is an 'arithmetic' right shift, which causes bits on the right to be 1-filled, leaving you with an incorrect negative value.
Example, try running:
bits.get_bits<int16_t>(5, 21);
The expected result should be 6976 with the bit string 00011011 01000000, but the current implementation returns -1216 with the bit string 11111011 01000000.
I've put my implementation of this below which builds the bit string from the right to the left, placing bits in their correct positions to start with so that the above two problems are avoided:
template<class ReturnType>
ReturnType get_bits(int start, int end) {
int max_bits = kBitsPerByte * sizeof(ReturnType);
if (end - start > max_bits) {
start = end - max_bits;
}
int inclusive_end = end - 1;
int byte_start = start / kBitsPerByte;
int byte_end = inclusive_end / kBitsPerByte;
// Put in the partial-byte on the right
uint8_t first = bytes_[byte_end];
int bit_offset = (inclusive_end % kBitsPerByte);
first >>= 7 - bit_offset;
bit_offset += 1;
ReturnType ret = 0 | first;
// Add the rest of the bytes
for (int i = byte_end - 1; i >= byte_start; i--) {
ReturnType tmp = (uint8_t) bytes_[i];
tmp <<= bit_offset;
ret |= tmp;
bit_offset += kBitsPerByte;
}
// Mask out the partial byte on the left
int shift_amt = (end - start);
if (shift_amt < max_bits) {
ReturnType mask = (1 << shift_amt) - 1;
ret &= mask;
}
}
There is one thing you certainly missed I think: the way you index the bits in the vector is different from what you have been given in the problem. I.e. with algorithm you outlined, the order of the bits will be like 7 6 5 4 3 2 1 0 | 15 14 13 12 11 10 9 8 | 23 22 21 .... Frankly, I didn't read through your whole algorithm, but this one was missed in the very first step.
Interesting problem. I've done similar, for some systems work.
Your char is 8 bits wide? Or 16? How big is your integer? 32 or 64?
Ignore the vector complexity for a minute.
Think about it as just an array of bits.
How many bits do you have? You have 8*number of chars
You need to calculate a starting char, number of bits to extract, ending char, number of bits there, and number of chars in the middle.
You will need bitwise-and & for the first partial char
you will need bitwise-and & for the last partial char
you will need left-shift << (or right-shift >>), depending upon which order you start from
what is the endian-ness of your Integer?
At some point you will calculate an index into your array that is bitindex/char_bit_width, you gave the value 171 as your bitindex, and 8 as your char_bit_width, so you will end up with these useful values calculated:
171/8 = 23 //location of first byte
171%8 = 3 //bits in first char/byte
8 - 171%8 = 5 //bits in last char/byte
sizeof(integer) = 4
sizeof(integer) + ( (171%8)>0?1:0 ) // how many array positions to examine
Some assembly required...

Fast way to "down-scale" a three-dimensional tensor index

This is a bit twiddling question for C or C++. I am running GCC 4.6.3 under Ubuntu 12.04.2.
I have a memory access index p for a three-dimensional tensor which has the form:
p = (i<<(2*N)) + (j<<N) + k
Here 0 <= i,j,k < (1<<N) and N some positive integer.
Now I want to compute a "down-scaled" memory access index for i>>S, j>>S, k>>S with 0 < S < N, which would be:
q = ((i>>S)<<(2*(N-S))) + ((j>>S)<<(N-S)) + (k>>S)
What is the fastest way to compute q from p (without knowing i,j,k beforehand)? We can assume that 0 < N <= 10 (i.e. p is a 32 bit integer). I would be especially interested in a fast approach for N=8 (i.e. i,j,k are 8 bit integers). N and S are both compile time constants.
An example for N=8 and S=4:
unsigned int p = 240407; // this is (3<<16) + (171<<8) + 23;
unsigned int q = 161; // this is (0<<8) + (10<<4) + 1
Straightforward way, 8 operations (others are operations on constants):
M = (1<<(N-S)) - 1; // A mask with S lowest bits.
q = ( ((p & (M<<(2*N+S))) >> (3*S)) // Mask 'i', shift to new position.
+ ((p & (M<<( N+S))) >> (2*S)) // Likewise for 'j'.
+ ((p & (M<< S)) >> S)); // Likewise for 'k'.
Looks complicated, but really isn't, just not easy (to me at least) to get all the constants correct.
To create formula with less operations, we observe that shifting numbers by U bits to the left is the same as multiplying by 1<<U. Thus, due to multiplication distributivity, multiplying by ((1<<U1) + (1<<U2) + ...) is the same as shifting to the left by U1, U2, ... and adding everything together.
So, we could try to mask needed portions of i, j and k, "shift" them all to the correct positions relative to each other with one multiplication and then shift result to the right, to the final destination. This gives us three operations to compute q from p.
Unfortunately, there are limitations, especially for the case we try to get all three at once. When we add numbers together (indirectly, by adding together several multipliers), we have to make sure that bits can be set only in one number, else we'll get a wrong result. If we try to add (indirectly) three properly shifted numbers at once, we have this:
iiiii...........jjjjj...........kkkkk.......
N-S S N-S S N-S
.....jjjjj...........kkkkk................
N-S N-S S N-S
..........kkkkk...............
N-S N-S N-S
Note that farther to the left in the second and third numbers are bits of i and j, but we ignore them. To do this, we assume that multiplication works as on x86: multiplying two types T gives a number of type T, with only the lowest bits of the actual result (equal to the result if there is no overflow).
So, to make sure that k bits from the third number do not overlap with j bits from the first, we need that 3*(N-S) <= N, i.e. S >= 2*N/3 which for N = 8 limits us to S >= 6 (just one or two bits per component after shifting; don't know if you ever use that low precision).
However, if S >= 2*N/3, we can use just 3 operations:
// Constant multiplier to perform three shifts at once.
F = (1<<(32-3*N)) + (1<<(32-3*N+S)) + (1<<(32-3*N+2*S));
// Mask, shift/combine with multipler, right shift to destination.
q = (((p & ((M<<(2*N+S)) + (M<<(N+S)) + (M<<S))) * F)
>> (32-3*(N-S)));
If the constraint for S is too strict (which it probably is), we can combine the first and second formula: compute i and k with the second approach, then add j from the first formula. Here we need that bits don't overlap in the following numbers:
iiiii...............kkkkk.......
N-S S N-S S N-S
..........kkkkk...............
N-S N-S N-S
I.e. 3*(N-S) <= 2*N, which gives S >= N / 3, or, for N = 8 much less strict S >= 3. The formula is as follows:
// Constant multiplier to perform two shifts at once.
F = (1<<(32-3*N)) + (1<<(32-3*N+2*S));
// Mask, shift/combine with multipler, right shift to destination
// and then add 'j' from the straightforward formula.
q = ((((p & ((M<<(2*N+S)) + (M<<S))) * F) >> (32-3*(N-S)))
+ ((p & (M<<(N+S))) >> (2*S)));
This formula also works for your example where S = 4.
Whether this is faster than straightforward approach depends on architecture. Also, I have no idea if C++ guarantees the assumed multiplication overflow behavior. Finally, you need to make sure values are unsigned and exactly 32 bit for the formulas to work.
If you don't care for compatibility, for N = 8 you can get i, j, k like that:
int p = ....
unsigned char *bytes = (char *)&p;
Now k is bytes[0], j is bytes[1] and i is bytes[2] (I found little endian on my machine). But I think the better way is sth. like that (we have N_MASK = 2^N - 1)
int q;
q = ( p & N_MASK ) >> S;
p >>= N;
q |= ( ( p & N_MASK ) >> S ) << S;
p >>= N;
q |= ( ( p & N_MASK ) >> S ) << 2*S;
does it meet your requirements?
#include <cstdint>
#include <iostream>
uint32_t to_q_from_p(uint32_t p, uint32_t N, uint32_t S)
{
uint32_t mask = ~(~0 << N);
uint32_t k = p &mask;
uint32_t j = (p >> N)& mask;
uint32_t i = (p >> 2*N)&mask;
return ((i>>S)<<(2*(N-S))) + ((j>>S)<<(N-S)) + (k>>S);;
}
int main()
{
uint32_t p = 240407;
uint32_t q = to_q_from_p(p, 8, 4);
std::cout << q << '\n';
}
if you assume that N always is 8 and integers are little endian then it can be
uint32_t to_q_from_p(uint32_t p, uint32_t S)
{
auto ptr = reinterpret_cast<uint8_t*>(&p);
return ((ptr[2]>>S)<<(2*(8-S))) + ((ptr[1]>>S)<<(8-S)) + (ptr[0]>>S);
}

Vectorized extraction of a specific pattern of shorts from an array, and also insertion into a new array

I have an array of shorts where I want to grab half of the values and put them in a new array that is half the size. I want to grab particular values in this sort of pattern, where each block is 128 bits (8 shorts). This is the only pattern I will use, it doesn't need to be "any generic pattern"!
The values in white are discarded. My array sizes will always be a power of 2. Here's the vague idea of it, unvectorized:
unsigned short size = 1 << 8;
unsigned short* data = new unsigned short[size];
...
unsigned short* newdata = new unsigned short[size >>= 1];
unsigned int* uintdata = (unsigned int*) data;
unsigned int* uintnewdata = (unsigned int*) newdata;
for (unsigned short uintsize = size >> 1, i = 0; i < uintsize; ++i)
{
uintnewdata[i] = (uintdata[i * 2] & 0xFFFF0000) | (uintdata[(i * 2) + 1] & 0x0000FFFF);
}
I started out with something like this:
static const __m128i startmask128 = _mm_setr_epi32(0xFFFF0000, 0x00000000, 0xFFFF0000, 0x00000000);
static const __m128i endmask128 = _mm_setr_epi32(0x00000000, 0x0000FFFF, 0x00000000, 0x0000FFFF);
__m128i* data128 = (__m128i*) data;
__m128i* newdata128 = (__m128i*) newdata;
and I can iteratively perform _mm_and_si128 with the masks to get the values I'm looking for, combine with _mm_or_si128, and put the results in newdata128[i]. However, I don't know how to "compress" things together and remove the values in white. And it seems if I could do that, I wouldn't need the masks at all.
How can that be done?
Anyway, eventually I will also want to do the opposite of this operation as well, and create a new array of twice the size and spread out current values within it.
I will also have new values to insert in the white blocks, which I would have to compute with each pair of shorts in the original data, iteratively. This computation would not be vectorizable, but the insertion of the resulting values should be. How could I "spread out" my current values into the new array, and what would be the best way to insert my computed values? Should I compute them all for each 128-bit iteration and put them into their own temp block (64 bit? 128 bit?), then do something to insert in bulk? Or should they be emplaced directly into my target __m128i, as it seems the cost should be equivalent to putting in a temp? If so, how could that be done without messing up my other values?
I would prefer to use SSE2 operations at most for this.
Here's an outline you can try:
Use the interleave instruction ( _mm_unpackhi/lo_epi16 ) with a register containing zero to "spread out" your 16 bit values. Now you'll have two registers looking like B_R_B_R_.
Shift right creating _B_R_B_R
AND the R's out of the first version B___B___
AND the B's out of the second version ___R___R
OR together B__RB__R
In the other direction use _mm_packs_epi32 in the end after setting it up with shift/and/or.
Each direction should be 10 SSE instructions (not counting the constants setup, zero and the AND masks, and the load/store).