Im starting with an array of 100,000 bytes where only the lower 6 bits in each byte have useful data. I need to pack that data into an array of 75,000 bytes as fast as possible, preserving the order of the data.
unsigned int Joinbits(unsigned int in) {}
// 00111111 00111111 00111111 00111111
// 000000 001111 111122 222222
void pack6(
register unsigned char o,
register unsigned char const *i,
unsigned char const *end
)
{
while(i!=end)
{
*o++ = *i << 2u | *(i+1) >> 4u; ++i;
*o++ = (*i & 0xfu) << 4u | *(i+1) >> 2u; ++i;
*o++ = (*i & 0xfcu) << 6u | *(i+1) ; i+=2;
}
}
Will fail if input length is not divisible by 4. Assumes high 2 bits of input are zero.
Completely portable. Reads 4 input bytes 6 times, so 50% inefficiency on reads, however the processor cache and compiler optimiser may help. Attempting to use a variable to save the read may be counter-productive, only an actual measurement can tell.
for(int pos=0; pos<100000; pos+=4)
{
*(int*)out = (in[0] & 0x3F) | ((in[1] & 0x3F)<<6) | ((in[2] & 0x3F)<<12) | ((in[3] & 0x3F)<<18);
in += 4;
out += 3;
}
This is C, I don't know C++. And is probably filled with bugs, and is by no means the fastest way, it probably isn't even fast. But I wanted to just have a go, because it seemed like a fun challenge to learn something, so please hit me with what I did wrong! :D
unsigned char unpacked[100000];
unsigned int packed[75000 / 4];
for (int i = 0; i < (100000 / 6); i += 6) {
unsigned int fourBytes = unpacked[i];
fourBytes += unpacked[i + 1] << 6;
fourBytes += unpacked[i + 2] << 12;
fourBytes += unpacked[i + 3] << 18;
fourBytes += unpacked[i + 4] << 24;
fourBytes += unpacked[i + 5] << 30;
unsigned short twoBytes = unpacked[i + 5] >> 2;
twoBytes += unpacked[i + 6] << 4
twoBytes += unpacked[i + 7] << 10;
packed[i] = fourBytes;
packed[i + 4] = twoBytes;
}
Related
I had already asked a question how to get 4 int8_t into a 32bit int, I was told that I have to cast the int8_t to a uint8_t first to pack it with bitshifting into a 32bit integer.
int8_t offsetX = -10;
int8_t offsetY = 120;
int8_t offsetZ = -60;
using U = std::uint8_t;
int toShader = (U(offsetX) << 24) | (U(offsetY) << 16) | (U(offsetZ) << 8) | (0 << 0);
std::cout << (int)(toShader >> 24) << " "<< (int)(toShader >> 16) << " " << (int)(toShader >> 8) << std::endl;
My Output is
-10 -2440 -624444
It's not what I expected, of course, does anyone have a solution?
In the shader I want to unpack the int16 later and that is only possible with a 32bit integer because glsl does not have any other data types.
int offsetX = data[gl_InstanceID * 3 + 2] >> 24;
int offsetY = data[gl_InstanceID * 3 + 2] >> 16 ;
int offsetZ = data[gl_InstanceID * 3 + 2] >> 8 ;
What is written in the square bracket does not matter it is about the correct shifting of the bits or casting after the bracket.
If any of the offsets is negative, then the shift results in undefined behaviour.
Solution: Convert the offsets to an unsigned type first.
However, this brings another potential problem: If you convert to unsigned, then negative numbers will have very large values with set bits in most significant bytes, and OR operation with those bits will always result in 1 regardless of offsetX and offsetY. A solution is to convert into a small unsigned type (std::uint8_t), and another is to mask the unused bytes. Former is probably simpler:
using U = std::uint8_t;
int third = U(offsetX) << 24u
| U(offsetY) << 16u
| U(offsetZ) << 8u
| 0u << 0u;
I think you're forgetting to mask the bits that you care about before shifting them.
Perhaps this is what you're looking for:
int32 offsetX = (data[gl_InstanceID * 3 + 2] & 0xFF000000) >> 24;
int32 offsetY = (data[gl_InstanceID * 3 + 2] & 0x00FF0000) >> 16 ;
int32 offsetZ = (data[gl_InstanceID * 3 + 2] & 0x0000FF00) >> 8 ;
if (offsetX & 0x80) offsetX |= 0xFFFFFF00;
if (offsetY & 0x80) offsetY |= 0xFFFFFF00;
if (offsetZ & 0x80) offsetZ |= 0xFFFFFF00;
Without the bit mask, the X part will end up in offsetY, and the X and Y part in offsetZ.
on CPU side you can use union to avoid bit shifts and bit masking and branches ...
int8_t x,y,z,w; // your 8bit ints
int32_t i; // your 32bit int
union my_union // just helper union for the casting
{
int8_t i8[4];
int32_t i32;
} a;
// 4x8bit -> 32bit
a.i8[0]=x;
a.i8[1]=y;
a.i8[2]=z;
a.i8[3]=w;
i=a.i32;
// 32bit -> 4x8bit
a.i32=i;
x=a.i8[0];
y=a.i8[1];
z=a.i8[2];
w=a.i8[3];
If you do not like unions the same can be done with pointers...
Beware on GLSL side is this not possible (nor unions nor pointers) and you have to use bitshifts and masks like in the other answer...
I have a vector which holds byte data (chars) received from a socket. This data holds different datatypes i want to extract. E.g. the first 8 elements (8 Bytes) of the vector are an uint64_t. Now I want to convert these first 8 Bytes to a single uint64.
A workaround I've found is:
// recv_buffer is the vector containing the received Bytes
std::vector<uint64_t> frame_number(recv_buffer.begin(), recv_buffer.begin() + sizeof(uint64_t));
uint64_t frame_num = frame.number.at(0);
Is there a way to extract the data without creating a new vector?
This is an effective method:
C/C++:
uint64_t hexToUint64(char *data, int32_t offset){
uint64_t num = 0;
for (int32_t i = offset; i < offset + 8; i++) {
num = (num << 8) + (data[i] & 0xFF);
}
return num;
}
Java:
long hexToUint64(byte[] data, int offset){
return
((long)data[offset++] << 56 & 0xFF00000000000000L) |
((long)data[offset++] << 48 & 0xFF000000000000L) |
((long)data[offset++] << 40 & 0xFF0000000000L) |
((long)data[offset++] << 32 & 0xFF00000000L) |
((long)data[offset++] << 24 & 0xFF000000L) |
((long)data[offset++] << 16 & 0xFF0000L) |
((long)data[offset++] << 8 & 0xFF00L) |
((long)data[offset++] & 0xFFL);
}
JavaScript:
function hexToUint64(data, offset) {
let num = 0;
let multiple = 0x100000000000000;
for (let i = offset; i < offset + 8; i++ , multiple /= 0x100) {
num += (data[i] & 0xFF) * multiple;
}
return num;
}
One normally uses memcpy or similar to a properly aligned structure, and then ntohl to convert a number from network byte order to computer byte order. ntohl is not part of the C++ specification, but exists in Linux and Windows and others regardless.
uint64_t frame_num;
std::copy(recv_buffer.begin(), recv_buffer.begin() + sizeof(uint64_t), static_cast<char*>(&fame_num);
//or memcpy(&frame_num, recv_buffer.data(), sizeof(frame_num));
frame_num = ntohl(ntohl);
It is tempting to do this for a struct that represents an entire network header, but since C++ compilers can inject padding bytes into structs, and it's undefined to write to the padding, it's better to do this one primitive at a time.
You could perform the conversion byte by byte like this:
int main()
{
unsigned char bytesArray[8];
bytesArray[0] = 0x05;
bytesArray[1] = 0x00;
bytesArray[2] = 0x00;
bytesArray[3] = 0x00;
bytesArray[4] = 0x00;
bytesArray[5] = 0x00;
bytesArray[6] = 0x00;
bytesArray[7] = 0x00;
uint64_t intVal = 0;
intVal = (intVal << 8) + bytesArray[7];
intVal = (intVal << 8) + bytesArray[6];
intVal = (intVal << 8) + bytesArray[5];
intVal = (intVal << 8) + bytesArray[4];
intVal = (intVal << 8) + bytesArray[3];
intVal = (intVal << 8) + bytesArray[2];
intVal = (intVal << 8) + bytesArray[1];
intVal = (intVal << 8) + bytesArray[0];
cout<<intVal;
return 0;
}
I suggest doing the following:
uint64_t frame_num = *((uint64_t*)recv_buffer.data());
You should of course first verify that the amount of data you have in recv_buffer is at least sizeof(frame_num) bytes.
I'm writing a tool for operations on long strings of 6 different letters (e.g. >1000000 letters), so I'd like to encode each letter in less than eight bits (for 6 letters 3 bits is sufficient)
Here is my code:
Rcpp::RawVector pack(Rcpp::RawVector UNPACKED,
const unsigned short ALPH_SIZE) {
const unsigned int IN_LEN = UNPACKED.size();
Rcpp::RawVector ret((ALPH_SIZE * IN_LEN + BYTE_SIZE - 1) / BYTE_SIZE);
unsigned int out_byte = ZERO;
unsigned short bits_left = BYTE_SIZE;
for (int i = ZERO; i < IN_LEN; i++) {
if (bits_left >= ALPH_SIZE) {
ret[out_byte] |= (UNPACKED[i] << (bits_left - ALPH_SIZE));
bits_left -= ALPH_SIZE;
} else {
ret[out_byte] |= (UNPACKED[i] >> (ALPH_SIZE - bits_left));
bits_left = ALPH_SIZE - bits_left;
out_byte++;
ret[out_byte] |= (UNPACKED[i] << (BYTE_SIZE - bits_left));
bits_left = BYTE_SIZE - bits_left;
}
}
return ret;
}
I'm using Rcpp, which is an R interface for C++. RawVector is in fact vector of char's.
This code works just perfectly - except it is too slow. I'm performing operations bit by bit while I could vectorize it somehow. And here is a question - is there any library or tool to do it? I'm not acknowledged with C++ tools.
Thanks in advance!
This code works just perfectly - except it is too slow.
Then you probably want to try out 4-bits/letter. Trading space for time. If 4-bits meets your compression needs (just 33.3% larger) then your code works on nibbles which will be much faster and simpler than tri-bits.
You need to unroll your loop, so optimizer could make something useful out of it. It will also get rid of your if, which kills any chance for quick performance. Something like this:
int i = 0;
for(i = 0; i + 8 <= IN_LEN; i += 8) {
ret[out_byte ] = (UNPACKED[i] ) | (UNPACKED[i + 1] << 3) | (UNPACKED[i + 2] << 6);
ret[out_byte + 1] = (UNPACKED[i + 2] >> 2) | (UNPACKED[i + 3] << 1) | (UNPACKED[i + 4] << 4) | (UNPACKED[i + 5] << 7);
ret[out_byte + 2] = (UNPACKED[i + 5] >> 1) | (UNPACKED[i + 6] << 2) | (UNPACKED[i + 7] << 5);
out_byte += 3;
}
for (; i < IN_LEN; i++) {
if (bits_left >= ALPH_SIZE) {
ret[out_byte] |= (UNPACKED[i] << (bits_left - ALPH_SIZE));
bits_left -= ALPH_SIZE;
} else {
ret[out_byte] |= (UNPACKED[i] >> (ALPH_SIZE - bits_left));
bits_left = ALPH_SIZE - bits_left;
out_byte++;
ret[out_byte] |= (UNPACKED[i] << (BYTE_SIZE - bits_left));
bits_left = BYTE_SIZE - bits_left;
}
}
This will allow optimizer to vectorize whole thing (assuming it's smart enough). With your current implementation i doubt any current compiler can find out, that your code loops after 3 written bytes and abuse it.
EDIT:
with sufficient constexpr / template magic you might be able to write some universal handler for body of the loop. Or just cover all small values (like write specialized template function for every bitcount from 1 to let's say 16). Packing values bitwise after 16 bits is overkill.
I have declared an array of bytes:
uint8_t memory[123];
which i have filled with:
memory[0]=0xFF;
memory[1]=0x00;
memory[2]=0xFF;
memory[3]=0x00;
memory[4]=0xFF;
And now i get requests from the user for specific bits. For example, i receive a request to send the bits in position 10:35, and i must return those bits combined in bytes. In that case i would need 4 bytes which contain.
response[0]=0b11000000;
responde[1]=0b00111111;
response[2]=0b11000000;
response[3]=0b00000011; //padded with zeros for excess bits
This will be used for Modbus which is a big-endian protocol. I have come up with the following code:
for(int j=findByteINIT;j<(findByteFINAL);j++){
aux[0]=(unsigned char) (memory[j]>>(startingbit-(8*findByteINIT)));
aux[1]=(unsigned char) (memory[j+1]<<(startingbit-(8*findByteINIT)));
response[h]=(unsigned char) (aux[0] | aux[1] );
h++;
aux[0]=0x00;//clean aux
aux[1]=0x00;
}
which does not work but should be close to the ideal solution. Any suggestions?
I think this should do it.
int start_bit = 10, end_bit = 35; // input
int start_byte = start_bit / CHAR_BIT;
int shift = start_bit % CHAR_BIT;
int response_size = (end_bit - start_bit + (CHAR_BIT - 1)) / CHAR_BIT;
int zero_padding = response_size * CHAR_BIT - (end_bit - start_bit + 1);
for (int i = 0; i < response_size; ++i) {
response[i] =
static_cast<uint8_t>((memory[start_byte + i] >> shift) |
(memory[start_byte + i + 1] << (CHAR_BIT - shift)));
}
response[response_size - 1] &= static_cast<uint8_t>(~0) >> zero_padding;
If the input is a starting bit and a number of bits instead of a starting bit and an (inclusive) end bit, then you can use exactly the same code, but compute the above end_bit using:
int start_bit = 10, count = 9; // input
int end_bit = start_bit + count - 1;
BYTE * srcData;
BYTE * pData;
int i,j;
int srcPadding;
//some variable initialization
for (int r = 0;r < h;r++,srcData+= srcPadding)
{
for (int col = 0;col < w;col++,pData += 4,srcData += 3)
{
memcpy(pData,srcData,3);
}
}
I've tried loop unrolling, but it helps little.
int segs = w / 4;
int remain = w - segs * 4;
for (int r = 0;r < h;r++,srcData+= srcPadding)
{
int idx = 0;
for (idx = 0;idx < segs;idx++,pData += 16,srcData += 12)
{
memcpy(pData,srcData,3);
*(pData + 3) = 0xFF;
memcpy(pData + 4,srcData + 3,3);
*(pData + 7) = 0xFF;
memcpy(pData + 8,srcData + 6,3);
*(pData + 11) = 0xFF;
memcpy(pData + 12,srcData + 9,3);
*(pData + 15) = 0xFF;
}
for (idx = 0;idx < remain;idx++,pData += 4,srcData += 3)
{
memcpy(pData,srcData,3);
*(pData + 3) = 0xFF;
}
}
Depending on your compiler, you may not want memcpy at all for such a small copy. Here is a variant version for the body of your unrolled loop; see if it's faster:
uint32_t in0 = *(uint32_t*)(srcData);
uint32_t in1 = *(uint32_t*)(srcData + 4);
uint32_t in2 = *(uint32_t*)(srcData + 8);
uint32_t out0 = UINT32_C(0xFF000000) | (in0 & UINT32_C(0x00FFFFFF));
uint32_t out1 = UINT32_C(0xFF000000) | (in0 >> 24) | ((in1 & 0xFFFF) << 8);
uint32_t out2 = UINT32_C(0xFF000000) | (in1 >> 16) | ((in2 & 0xFF) << 16);
uint32_t out3 = UINT32_C(0xFF000000) | (in2 >> 8);
*(uint32_t*)(pData) = out0;
*(uint32_t*)(pData + 4) = out1;
*(uint32_t*)(pData + 8) = out2;
*(uint32_t*)(pData + 12) = out3;
You should also declare srcData and pData as BYTE * restrict pointers so the compiler will know they don't alias.
I don't see much that you're doing that isn't necessary. You could change the post-increments to pre-increments (idx++ to ++idx, for instance), but that won't have a measurable effect.
Additionally, you could use std::copy instead of memcpy. std::copy has more information available to it and in theory can pick the most efficient way to copy things. Unfortunately I don't believe that many STL implementations actually take advantage of the extra information.
The only thing that I expect would make a difference is that there's no reason to wait for one memcpy to finish before starting the next. You could use OpenMP or Intel Threading Building Blocks (or a thread queue of some kind) to parallelize the loops.
Don't call memcpy, just do the copy by hand. The function call overhead isn't worth it unless you can copy more than 3 bytes at a time.
As far as this particular loop goes, you may want to look at a technique called Duff's device, which is a loop-unrolling technique that takes advantage of the switch construct.
Maybe changing to a while loop instead of nested for loops:
BYTE *src = srcData;
BYTE *dest = pData;
int maxsrc = h*(w*3+srcPadding);
int offset = 0;
int maxoffset = w*3;
while (src+offset < maxsrc) {
*dest++ = *(src+offset++);
*dest++ = *(src+offset++);
*dest++ = *(src+offset++);
dest++;
if (offset > maxoffset) {
src += srcPadding;
offset = 0;
}
}