Vectorizing bits packing in C++ - c++

I'm writing a tool for operations on long strings of 6 different letters (e.g. >1000000 letters), so I'd like to encode each letter in less than eight bits (for 6 letters 3 bits is sufficient)
Here is my code:
Rcpp::RawVector pack(Rcpp::RawVector UNPACKED,
const unsigned short ALPH_SIZE) {
const unsigned int IN_LEN = UNPACKED.size();
Rcpp::RawVector ret((ALPH_SIZE * IN_LEN + BYTE_SIZE - 1) / BYTE_SIZE);
unsigned int out_byte = ZERO;
unsigned short bits_left = BYTE_SIZE;
for (int i = ZERO; i < IN_LEN; i++) {
if (bits_left >= ALPH_SIZE) {
ret[out_byte] |= (UNPACKED[i] << (bits_left - ALPH_SIZE));
bits_left -= ALPH_SIZE;
} else {
ret[out_byte] |= (UNPACKED[i] >> (ALPH_SIZE - bits_left));
bits_left = ALPH_SIZE - bits_left;
out_byte++;
ret[out_byte] |= (UNPACKED[i] << (BYTE_SIZE - bits_left));
bits_left = BYTE_SIZE - bits_left;
}
}
return ret;
}
I'm using Rcpp, which is an R interface for C++. RawVector is in fact vector of char's.
This code works just perfectly - except it is too slow. I'm performing operations bit by bit while I could vectorize it somehow. And here is a question - is there any library or tool to do it? I'm not acknowledged with C++ tools.
Thanks in advance!

This code works just perfectly - except it is too slow.
Then you probably want to try out 4-bits/letter. Trading space for time. If 4-bits meets your compression needs (just 33.3% larger) then your code works on nibbles which will be much faster and simpler than tri-bits.

You need to unroll your loop, so optimizer could make something useful out of it. It will also get rid of your if, which kills any chance for quick performance. Something like this:
int i = 0;
for(i = 0; i + 8 <= IN_LEN; i += 8) {
ret[out_byte ] = (UNPACKED[i] ) | (UNPACKED[i + 1] << 3) | (UNPACKED[i + 2] << 6);
ret[out_byte + 1] = (UNPACKED[i + 2] >> 2) | (UNPACKED[i + 3] << 1) | (UNPACKED[i + 4] << 4) | (UNPACKED[i + 5] << 7);
ret[out_byte + 2] = (UNPACKED[i + 5] >> 1) | (UNPACKED[i + 6] << 2) | (UNPACKED[i + 7] << 5);
out_byte += 3;
}
for (; i < IN_LEN; i++) {
if (bits_left >= ALPH_SIZE) {
ret[out_byte] |= (UNPACKED[i] << (bits_left - ALPH_SIZE));
bits_left -= ALPH_SIZE;
} else {
ret[out_byte] |= (UNPACKED[i] >> (ALPH_SIZE - bits_left));
bits_left = ALPH_SIZE - bits_left;
out_byte++;
ret[out_byte] |= (UNPACKED[i] << (BYTE_SIZE - bits_left));
bits_left = BYTE_SIZE - bits_left;
}
}
This will allow optimizer to vectorize whole thing (assuming it's smart enough). With your current implementation i doubt any current compiler can find out, that your code loops after 3 written bytes and abuse it.
EDIT:
with sufficient constexpr / template magic you might be able to write some universal handler for body of the loop. Or just cover all small values (like write specialized template function for every bitcount from 1 to let's say 16). Packing values bitwise after 16 bits is overkill.

Related

Extract and combine bits from different bytes c c++

I have declared an array of bytes:
uint8_t memory[123];
which i have filled with:
memory[0]=0xFF;
memory[1]=0x00;
memory[2]=0xFF;
memory[3]=0x00;
memory[4]=0xFF;
And now i get requests from the user for specific bits. For example, i receive a request to send the bits in position 10:35, and i must return those bits combined in bytes. In that case i would need 4 bytes which contain.
response[0]=0b11000000;
responde[1]=0b00111111;
response[2]=0b11000000;
response[3]=0b00000011; //padded with zeros for excess bits
This will be used for Modbus which is a big-endian protocol. I have come up with the following code:
for(int j=findByteINIT;j<(findByteFINAL);j++){
aux[0]=(unsigned char) (memory[j]>>(startingbit-(8*findByteINIT)));
aux[1]=(unsigned char) (memory[j+1]<<(startingbit-(8*findByteINIT)));
response[h]=(unsigned char) (aux[0] | aux[1] );
h++;
aux[0]=0x00;//clean aux
aux[1]=0x00;
}
which does not work but should be close to the ideal solution. Any suggestions?
I think this should do it.
int start_bit = 10, end_bit = 35; // input
int start_byte = start_bit / CHAR_BIT;
int shift = start_bit % CHAR_BIT;
int response_size = (end_bit - start_bit + (CHAR_BIT - 1)) / CHAR_BIT;
int zero_padding = response_size * CHAR_BIT - (end_bit - start_bit + 1);
for (int i = 0; i < response_size; ++i) {
response[i] =
static_cast<uint8_t>((memory[start_byte + i] >> shift) |
(memory[start_byte + i + 1] << (CHAR_BIT - shift)));
}
response[response_size - 1] &= static_cast<uint8_t>(~0) >> zero_padding;
If the input is a starting bit and a number of bits instead of a starting bit and an (inclusive) end bit, then you can use exactly the same code, but compute the above end_bit using:
int start_bit = 10, count = 9; // input
int end_bit = start_bit + count - 1;

Reading binary integers

const unsigned char* p;
int64_t u = ...; // ??
What's the recommended way to read a 64-bit binary little endian integer from the 8 bytes pointed to by p?
On x64 a single machine instruction should do, but on big-endian hardware swaps are needed.
How does one do this both optimally and portably?
Carl's solution is good, portable enough but not optimal. This begs the question: why doesn't C/C++ provide a better and standardized way to do this? It's not an uncommon construct.
The commonly seen:
u = (int64_t)(((uint64_t)p[0] << 0)
+ ((uint64_t)p[1] << 8)
+ ((uint64_t)p[2] << 16)
+ ((uint64_t)p[3] << 24)
+ ((uint64_t)p[4] << 32)
+ ((uint64_t)p[5] << 40)
+ ((uint64_t)p[6] << 48)
+ ((uint64_t)p[7] << 56));
Is pretty much the only game in town for portability - it's otherwise tough to avoid potential alignment problems.
This answer does assume an 8-bit char. If you might need to support different sized chars, you'll need a preprocessor definition that checks CHAR_BIT and does the right thing for each.
Carl Norum is right - if you want to be readable as well, you can write a loop (the compiler will unroll it anyway). This will also nicely deal with non-8-bit chars.
u = 0;
const int n = 64 / CHAR_BIT + !!(64 % CHAR_BIT);
for (int i = 0; i < n; i++) {
u += (uint64_t)p[i] << (i * CHAR_BIT);
}
I used the following code to reverse the byte order for any variable. I used it to convert between different "Endianess".
// Reverses the order of bytes in the specified data
void ReverseBytes(LPBYTE pData, int nSize)
{
int i, j;
for (i = 0, j = nSize - 1; i < j; i++, j--)
{
BYTE nTemp = pData[i];
pData[i] = pData[j];
pData[j] = nTemp;
}
}

Is there any code Optimization method for the following c++ program

BYTE * srcData;
BYTE * pData;
int i,j;
int srcPadding;
//some variable initialization
for (int r = 0;r < h;r++,srcData+= srcPadding)
{
for (int col = 0;col < w;col++,pData += 4,srcData += 3)
{
memcpy(pData,srcData,3);
}
}
I've tried loop unrolling, but it helps little.
int segs = w / 4;
int remain = w - segs * 4;
for (int r = 0;r < h;r++,srcData+= srcPadding)
{
int idx = 0;
for (idx = 0;idx < segs;idx++,pData += 16,srcData += 12)
{
memcpy(pData,srcData,3);
*(pData + 3) = 0xFF;
memcpy(pData + 4,srcData + 3,3);
*(pData + 7) = 0xFF;
memcpy(pData + 8,srcData + 6,3);
*(pData + 11) = 0xFF;
memcpy(pData + 12,srcData + 9,3);
*(pData + 15) = 0xFF;
}
for (idx = 0;idx < remain;idx++,pData += 4,srcData += 3)
{
memcpy(pData,srcData,3);
*(pData + 3) = 0xFF;
}
}
Depending on your compiler, you may not want memcpy at all for such a small copy. Here is a variant version for the body of your unrolled loop; see if it's faster:
uint32_t in0 = *(uint32_t*)(srcData);
uint32_t in1 = *(uint32_t*)(srcData + 4);
uint32_t in2 = *(uint32_t*)(srcData + 8);
uint32_t out0 = UINT32_C(0xFF000000) | (in0 & UINT32_C(0x00FFFFFF));
uint32_t out1 = UINT32_C(0xFF000000) | (in0 >> 24) | ((in1 & 0xFFFF) << 8);
uint32_t out2 = UINT32_C(0xFF000000) | (in1 >> 16) | ((in2 & 0xFF) << 16);
uint32_t out3 = UINT32_C(0xFF000000) | (in2 >> 8);
*(uint32_t*)(pData) = out0;
*(uint32_t*)(pData + 4) = out1;
*(uint32_t*)(pData + 8) = out2;
*(uint32_t*)(pData + 12) = out3;
You should also declare srcData and pData as BYTE * restrict pointers so the compiler will know they don't alias.
I don't see much that you're doing that isn't necessary. You could change the post-increments to pre-increments (idx++ to ++idx, for instance), but that won't have a measurable effect.
Additionally, you could use std::copy instead of memcpy. std::copy has more information available to it and in theory can pick the most efficient way to copy things. Unfortunately I don't believe that many STL implementations actually take advantage of the extra information.
The only thing that I expect would make a difference is that there's no reason to wait for one memcpy to finish before starting the next. You could use OpenMP or Intel Threading Building Blocks (or a thread queue of some kind) to parallelize the loops.
Don't call memcpy, just do the copy by hand. The function call overhead isn't worth it unless you can copy more than 3 bytes at a time.
As far as this particular loop goes, you may want to look at a technique called Duff's device, which is a loop-unrolling technique that takes advantage of the switch construct.
Maybe changing to a while loop instead of nested for loops:
BYTE *src = srcData;
BYTE *dest = pData;
int maxsrc = h*(w*3+srcPadding);
int offset = 0;
int maxoffset = w*3;
while (src+offset < maxsrc) {
*dest++ = *(src+offset++);
*dest++ = *(src+offset++);
*dest++ = *(src+offset++);
dest++;
if (offset > maxoffset) {
src += srcPadding;
offset = 0;
}
}

Packing data into arrays as fast as possible

Im starting with an array of 100,000 bytes where only the lower 6 bits in each byte have useful data. I need to pack that data into an array of 75,000 bytes as fast as possible, preserving the order of the data.
unsigned int Joinbits(unsigned int in) {}
// 00111111 00111111 00111111 00111111
// 000000 001111 111122 222222
void pack6(
register unsigned char o,
register unsigned char const *i,
unsigned char const *end
)
{
while(i!=end)
{
*o++ = *i << 2u | *(i+1) >> 4u; ++i;
*o++ = (*i & 0xfu) << 4u | *(i+1) >> 2u; ++i;
*o++ = (*i & 0xfcu) << 6u | *(i+1) ; i+=2;
}
}
Will fail if input length is not divisible by 4. Assumes high 2 bits of input are zero.
Completely portable. Reads 4 input bytes 6 times, so 50% inefficiency on reads, however the processor cache and compiler optimiser may help. Attempting to use a variable to save the read may be counter-productive, only an actual measurement can tell.
for(int pos=0; pos<100000; pos+=4)
{
*(int*)out = (in[0] & 0x3F) | ((in[1] & 0x3F)<<6) | ((in[2] & 0x3F)<<12) | ((in[3] & 0x3F)<<18);
in += 4;
out += 3;
}
This is C, I don't know C++. And is probably filled with bugs, and is by no means the fastest way, it probably isn't even fast. But I wanted to just have a go, because it seemed like a fun challenge to learn something, so please hit me with what I did wrong! :D
unsigned char unpacked[100000];
unsigned int packed[75000 / 4];
for (int i = 0; i < (100000 / 6); i += 6) {
unsigned int fourBytes = unpacked[i];
fourBytes += unpacked[i + 1] << 6;
fourBytes += unpacked[i + 2] << 12;
fourBytes += unpacked[i + 3] << 18;
fourBytes += unpacked[i + 4] << 24;
fourBytes += unpacked[i + 5] << 30;
unsigned short twoBytes = unpacked[i + 5] >> 2;
twoBytes += unpacked[i + 6] << 4
twoBytes += unpacked[i + 7] << 10;
packed[i] = fourBytes;
packed[i + 4] = twoBytes;
}

Fast way to determine right most nth bit set in a 64 bit

I try to determine the right most nth bit set
if (value & (1 << 0)) { return 0; }
if (value & (1 << 1)) { return 1; }
if (value & (1 << 2)) { return 2; }
...
if (value & (1 << 63)) { return 63; }
if comparison needs to be done 64 times. Is there any faster way?
If you're using GCC, use the __builtin_ctz or __builtin_ffs function. (http://gcc.gnu.org/onlinedocs/gcc-4.4.0/gcc/Other-Builtins.html#index-g_t_005f_005fbuiltin_005fffs-2894)
If you're using MSVC, use the _BitScanForward function. See How to use MSVC intrinsics to get the equivalent of this GCC code?.
In POSIX there's also a ffs function. (http://linux.die.net/man/3/ffs)
There's a little trick for this:
value & -value
This uses the twos' complement integer representation of negative numbers.
Edit: This doesn't quite give the exact result as given in the question. The rest can be done with a small lookup table.
You could use a loop:
unsigned int value;
unsigned int temp_value;
const unsigned int BITS_IN_INT = sizeof(int) / CHAR_BIT;
unsigned int index = 0;
// Make a copy of the value, to alter.
temp_value = value;
for (index = 0; index < BITS_IN_INT; ++index)
{
if (temp_value & 1)
{
break;
}
temp_value >>= 1;
}
return index;
This takes up less code space than the if statement proposal, with similar functionality.
KennyTM's suggestions are good if your compiler supports them. Otherwise, you can speed it up using a binary search, something like:
int result = 0;
if (!(value & 0xffffffff)) {
result += 32;
value >>= 32;
}
if (!(value & 0xffff)) {
result += 16;
value >>= 16;
}
and so on. This will do 6 comparisons (in general, log(N) comparisons, versus N for a linear search).
b = n & (-n) // finds the bit
b -= 1; // this gives 1's to the right
b--; // this gets us just the trailing 1's that need counting
b = (b & 0x5555555555555555) + ((b>>1) & 0x5555555555555555); // 2 bit sums of 1 bit numbers
b = (b & 0x3333333333333333) + ((b>>2) & 0x3333333333333333); // 4 bit sums of 2 bit numbers
b = (b & 0x0f0f0f0f0f0f0f0f) + ((b>>4) & 0x0f0f0f0f0f0f0f0f); // 8 bit sums of 4 bit numbers
b = (b & 0x00ff00ff00ff00ff) + ((b>>8) & 0x00ff00ff00ff00ff); // 16 bit sums of 8 bit numbers
b = (b & 0x0000ffff0000ffff) + ((b>>16) & 0x0000ffff0000ffff); // 32 bit sums of 16 bit numbers
b = (b & 0x00000000ffffffff) + ((b>>32) & 0x00000000ffffffff); // sum of 32 bit numbers
b &= 63; // otherwise I think an input of 0 would produce 64 for a result.
This is in C of course.
Here's another method that takes advantage of short-circuit with logical AND operations and conditional instruction execution or the instruction pipeline.
unsigned int value;
unsigned int temp_value = value;
bool bit_found = false;
unsigned int index = 0;
bit_found = !bit_found && ((temp_value & (1 << index++)); // bit 0
bit_found = !bit_found && ((temp_value & (1 << index++)); // bit 1
bit_found = !bit_found && ((temp_value & (1 << index++)); // bit 2
bit_found = !bit_found && ((temp_value & (1 << index++)); // bit 3
//...
bit_found = !bit_found && ((temp_value & (1 << index++)); // bit 64
return index - 1; // The -1 may not be necessary depending on the starting bit number.
The advantage to this method is that there are no branches and the instruction pipeline is not disturbed. This is very fast on processors that perform conditional execution of instructions.
Works for Visual C++ 6
int toErrorCodeBit(__int64 value) {
const int low_double_word = value;
int result = 0;
__asm
{
bsf eax, low_double_word
jz low_double_value_0
mov result, eax
}
return result;
low_double_value_0:
const int upper_double_word = value >> 32;
__asm
{
bsf eax, upper_double_word
mov result, eax
}
result += 32;
return result;
}