BYTE * srcData;
BYTE * pData;
int i,j;
int srcPadding;
//some variable initialization
for (int r = 0;r < h;r++,srcData+= srcPadding)
{
for (int col = 0;col < w;col++,pData += 4,srcData += 3)
{
memcpy(pData,srcData,3);
}
}
I've tried loop unrolling, but it helps little.
int segs = w / 4;
int remain = w - segs * 4;
for (int r = 0;r < h;r++,srcData+= srcPadding)
{
int idx = 0;
for (idx = 0;idx < segs;idx++,pData += 16,srcData += 12)
{
memcpy(pData,srcData,3);
*(pData + 3) = 0xFF;
memcpy(pData + 4,srcData + 3,3);
*(pData + 7) = 0xFF;
memcpy(pData + 8,srcData + 6,3);
*(pData + 11) = 0xFF;
memcpy(pData + 12,srcData + 9,3);
*(pData + 15) = 0xFF;
}
for (idx = 0;idx < remain;idx++,pData += 4,srcData += 3)
{
memcpy(pData,srcData,3);
*(pData + 3) = 0xFF;
}
}
Depending on your compiler, you may not want memcpy at all for such a small copy. Here is a variant version for the body of your unrolled loop; see if it's faster:
uint32_t in0 = *(uint32_t*)(srcData);
uint32_t in1 = *(uint32_t*)(srcData + 4);
uint32_t in2 = *(uint32_t*)(srcData + 8);
uint32_t out0 = UINT32_C(0xFF000000) | (in0 & UINT32_C(0x00FFFFFF));
uint32_t out1 = UINT32_C(0xFF000000) | (in0 >> 24) | ((in1 & 0xFFFF) << 8);
uint32_t out2 = UINT32_C(0xFF000000) | (in1 >> 16) | ((in2 & 0xFF) << 16);
uint32_t out3 = UINT32_C(0xFF000000) | (in2 >> 8);
*(uint32_t*)(pData) = out0;
*(uint32_t*)(pData + 4) = out1;
*(uint32_t*)(pData + 8) = out2;
*(uint32_t*)(pData + 12) = out3;
You should also declare srcData and pData as BYTE * restrict pointers so the compiler will know they don't alias.
I don't see much that you're doing that isn't necessary. You could change the post-increments to pre-increments (idx++ to ++idx, for instance), but that won't have a measurable effect.
Additionally, you could use std::copy instead of memcpy. std::copy has more information available to it and in theory can pick the most efficient way to copy things. Unfortunately I don't believe that many STL implementations actually take advantage of the extra information.
The only thing that I expect would make a difference is that there's no reason to wait for one memcpy to finish before starting the next. You could use OpenMP or Intel Threading Building Blocks (or a thread queue of some kind) to parallelize the loops.
Don't call memcpy, just do the copy by hand. The function call overhead isn't worth it unless you can copy more than 3 bytes at a time.
As far as this particular loop goes, you may want to look at a technique called Duff's device, which is a loop-unrolling technique that takes advantage of the switch construct.
Maybe changing to a while loop instead of nested for loops:
BYTE *src = srcData;
BYTE *dest = pData;
int maxsrc = h*(w*3+srcPadding);
int offset = 0;
int maxoffset = w*3;
while (src+offset < maxsrc) {
*dest++ = *(src+offset++);
*dest++ = *(src+offset++);
*dest++ = *(src+offset++);
dest++;
if (offset > maxoffset) {
src += srcPadding;
offset = 0;
}
}
Related
I've written the below code to convert and store the data from a string (array of chars) called strinto an array of 16-bit integers called arr16bit
The code works. However, i'd say that there's a better or cleaner way to implement this logic, using less variables etc.
I don't want to use index i to get the modulus % 2, because if using little endian, I have the same algorithm but i starts at the last index of the string and counts down instead of up. Any recommendations are appreciated.
// assuming str had already been initialised before this ..
int strLength = CalculateStringLength(str); // function implementation now shown
uint16_t* arr16bit = new uint16_t[ (strLength /2) + 1]; // The only C++ feature used here , so I didn't want to tag it
int indexWrite = 0;
int counter = 0;
for(int i = 0; i < strLength; ++i)
{
arr16bit[indexWrite] <<= 8;
arr16bit[indexWrite] |= str[i];
if ( (counter % 2) != 0)
{
indexWrite++;
}
counter++;
}
Yes, there are some redundant variables here.
You have both counter and i which do exactly the same thing and always hold the same value. And you have indexWrite which is always exactly half (per integer division) of both of them.
You're also shifting too far (16 bits rather than 8).
const std::size_t strLength = CalculateStringLength(str);
std::vector<uint16_t> arr16bit((strLength/2) + 1);
for (std::size_t i = 0; i < strLength; ++i)
{
arr16bit[i/2] <<= 8;
arr16bit[i/2] |= str[i];
}
Though I'd probably do it more like this to avoid N redundant |= operations:
const std::size_t strLength = CalculateStringLength(str);
std::vector<uint16_t> arr16bit((strLength/2) + 1);
for (std::size_t i = 0; i < strLength+1; i += 2)
{
arr16bit[i/2] = (str[i] << 8);
arr16bit[(i/2)+1] |= str[i+1];
}
You may also wish to consider a simple std::copy over the whole dang buffer, if your endianness is right for it.
I have a vector which holds byte data (chars) received from a socket. This data holds different datatypes i want to extract. E.g. the first 8 elements (8 Bytes) of the vector are an uint64_t. Now I want to convert these first 8 Bytes to a single uint64.
A workaround I've found is:
// recv_buffer is the vector containing the received Bytes
std::vector<uint64_t> frame_number(recv_buffer.begin(), recv_buffer.begin() + sizeof(uint64_t));
uint64_t frame_num = frame.number.at(0);
Is there a way to extract the data without creating a new vector?
This is an effective method:
C/C++:
uint64_t hexToUint64(char *data, int32_t offset){
uint64_t num = 0;
for (int32_t i = offset; i < offset + 8; i++) {
num = (num << 8) + (data[i] & 0xFF);
}
return num;
}
Java:
long hexToUint64(byte[] data, int offset){
return
((long)data[offset++] << 56 & 0xFF00000000000000L) |
((long)data[offset++] << 48 & 0xFF000000000000L) |
((long)data[offset++] << 40 & 0xFF0000000000L) |
((long)data[offset++] << 32 & 0xFF00000000L) |
((long)data[offset++] << 24 & 0xFF000000L) |
((long)data[offset++] << 16 & 0xFF0000L) |
((long)data[offset++] << 8 & 0xFF00L) |
((long)data[offset++] & 0xFFL);
}
JavaScript:
function hexToUint64(data, offset) {
let num = 0;
let multiple = 0x100000000000000;
for (let i = offset; i < offset + 8; i++ , multiple /= 0x100) {
num += (data[i] & 0xFF) * multiple;
}
return num;
}
One normally uses memcpy or similar to a properly aligned structure, and then ntohl to convert a number from network byte order to computer byte order. ntohl is not part of the C++ specification, but exists in Linux and Windows and others regardless.
uint64_t frame_num;
std::copy(recv_buffer.begin(), recv_buffer.begin() + sizeof(uint64_t), static_cast<char*>(&fame_num);
//or memcpy(&frame_num, recv_buffer.data(), sizeof(frame_num));
frame_num = ntohl(ntohl);
It is tempting to do this for a struct that represents an entire network header, but since C++ compilers can inject padding bytes into structs, and it's undefined to write to the padding, it's better to do this one primitive at a time.
You could perform the conversion byte by byte like this:
int main()
{
unsigned char bytesArray[8];
bytesArray[0] = 0x05;
bytesArray[1] = 0x00;
bytesArray[2] = 0x00;
bytesArray[3] = 0x00;
bytesArray[4] = 0x00;
bytesArray[5] = 0x00;
bytesArray[6] = 0x00;
bytesArray[7] = 0x00;
uint64_t intVal = 0;
intVal = (intVal << 8) + bytesArray[7];
intVal = (intVal << 8) + bytesArray[6];
intVal = (intVal << 8) + bytesArray[5];
intVal = (intVal << 8) + bytesArray[4];
intVal = (intVal << 8) + bytesArray[3];
intVal = (intVal << 8) + bytesArray[2];
intVal = (intVal << 8) + bytesArray[1];
intVal = (intVal << 8) + bytesArray[0];
cout<<intVal;
return 0;
}
I suggest doing the following:
uint64_t frame_num = *((uint64_t*)recv_buffer.data());
You should of course first verify that the amount of data you have in recv_buffer is at least sizeof(frame_num) bytes.
I'm writing a tool for operations on long strings of 6 different letters (e.g. >1000000 letters), so I'd like to encode each letter in less than eight bits (for 6 letters 3 bits is sufficient)
Here is my code:
Rcpp::RawVector pack(Rcpp::RawVector UNPACKED,
const unsigned short ALPH_SIZE) {
const unsigned int IN_LEN = UNPACKED.size();
Rcpp::RawVector ret((ALPH_SIZE * IN_LEN + BYTE_SIZE - 1) / BYTE_SIZE);
unsigned int out_byte = ZERO;
unsigned short bits_left = BYTE_SIZE;
for (int i = ZERO; i < IN_LEN; i++) {
if (bits_left >= ALPH_SIZE) {
ret[out_byte] |= (UNPACKED[i] << (bits_left - ALPH_SIZE));
bits_left -= ALPH_SIZE;
} else {
ret[out_byte] |= (UNPACKED[i] >> (ALPH_SIZE - bits_left));
bits_left = ALPH_SIZE - bits_left;
out_byte++;
ret[out_byte] |= (UNPACKED[i] << (BYTE_SIZE - bits_left));
bits_left = BYTE_SIZE - bits_left;
}
}
return ret;
}
I'm using Rcpp, which is an R interface for C++. RawVector is in fact vector of char's.
This code works just perfectly - except it is too slow. I'm performing operations bit by bit while I could vectorize it somehow. And here is a question - is there any library or tool to do it? I'm not acknowledged with C++ tools.
Thanks in advance!
This code works just perfectly - except it is too slow.
Then you probably want to try out 4-bits/letter. Trading space for time. If 4-bits meets your compression needs (just 33.3% larger) then your code works on nibbles which will be much faster and simpler than tri-bits.
You need to unroll your loop, so optimizer could make something useful out of it. It will also get rid of your if, which kills any chance for quick performance. Something like this:
int i = 0;
for(i = 0; i + 8 <= IN_LEN; i += 8) {
ret[out_byte ] = (UNPACKED[i] ) | (UNPACKED[i + 1] << 3) | (UNPACKED[i + 2] << 6);
ret[out_byte + 1] = (UNPACKED[i + 2] >> 2) | (UNPACKED[i + 3] << 1) | (UNPACKED[i + 4] << 4) | (UNPACKED[i + 5] << 7);
ret[out_byte + 2] = (UNPACKED[i + 5] >> 1) | (UNPACKED[i + 6] << 2) | (UNPACKED[i + 7] << 5);
out_byte += 3;
}
for (; i < IN_LEN; i++) {
if (bits_left >= ALPH_SIZE) {
ret[out_byte] |= (UNPACKED[i] << (bits_left - ALPH_SIZE));
bits_left -= ALPH_SIZE;
} else {
ret[out_byte] |= (UNPACKED[i] >> (ALPH_SIZE - bits_left));
bits_left = ALPH_SIZE - bits_left;
out_byte++;
ret[out_byte] |= (UNPACKED[i] << (BYTE_SIZE - bits_left));
bits_left = BYTE_SIZE - bits_left;
}
}
This will allow optimizer to vectorize whole thing (assuming it's smart enough). With your current implementation i doubt any current compiler can find out, that your code loops after 3 written bytes and abuse it.
EDIT:
with sufficient constexpr / template magic you might be able to write some universal handler for body of the loop. Or just cover all small values (like write specialized template function for every bitcount from 1 to let's say 16). Packing values bitwise after 16 bits is overkill.
According to cachegrind this checksum calculation routine is one of the greatest contributors to instruction-cache load and instruction-cache misses in the entire application:
#include <stdint.h>
namespace {
uint32_t OnesComplementSum(const uint16_t * b16, int len) {
uint32_t sum = 0;
uint32_t a = 0;
uint32_t b = 0;
uint32_t c = 0;
uint32_t d = 0;
// helper for the loop unrolling
auto run8 = [&] {
a += b16[0];
b += b16[1];
c += b16[2];
d += b16[3];
b16 += 4;
};
for (;;) {
if (len > 32) {
run8();
run8();
run8();
run8();
len -= 32;
continue;
}
if (len > 8) {
run8();
len -= 8;
continue;
}
break;
}
sum += (a + b) + (c + d);
auto reduce = [&]() {
sum = (sum & 0xFFFF) + (sum >> 16);
if (sum > 0xFFFF) sum -= 0xFFFF;
};
reduce();
while ((len -= 2) >= 0) sum += *b16++;
if (len == -1) sum += *(const uint8_t *)b16; // add the last byte
reduce();
return sum;
}
} // anonymous namespace
uint32_t get(const uint16_t* data, int length)
{
return OnesComplementSum(data, length);
}
See asm output here.
Maybe the it's caused by the loop unrolling, but the generated object code doesn't seem too excessive.
How can I improve the code?
Update
Because the checksum function was in an anonymous namespace it was inlined and duplicated by two functions that resided in the same cpp file.
The loop unrolling is still beneficial. Removing it slowed down the code.
Improving the infinite loop speeds up the code (but for some reason I get opposite results on my mac).
Before fixes: here you can see the two checksums and 17210 L1 IR misses
After fixes: after fixing the inlining problem and fixing the infinite loop the L1 instruction cache misses dropped to 8324.
"InstructionFetch" is higher in the fixed example. I'm not sure how to interpret that. Does it simply mean that's where most activity occurred? Or does it hint at a problem?
replace the main loop with just:
const int quick_len=len/8;
const uint16_t * const the_end=b16+quick_len*4;
len -= quick_len*8;
for (; b16+4 <= the_end; b16+=4)
{
a += b16[0];
b += b16[1];
c += b16[2];
d += b16[3];
}
There seems no need to manually loop unroll if you use -O3
Also, the test case allowed for too much optimization since the input was static and the results unused, also printing out the result helps verify optimized versions don't break anything
Full test I used:
int main(int argc, char *argv[])
{
using namespace std::chrono;
auto start_time = steady_clock::now();
int ret=OnesComplementSum((const uint8_t*)(s.data()+argc), s.size()-argc, 0);
auto elapsed_ns = duration_cast<nanoseconds>(steady_clock::now() - start_time).count();
std::cout << "loop=" << loop << " elapsed_ns=" << elapsed_ns << " = " << ret<< std::endl;
return ret;
}
Comparison with theis (CLEAN LOOP) and your improved version (UGLY LOOP) and a longer test string:
loop=CLEAN_LOOP elapsed_ns=8365 = 14031
loop=CLEAN_LOOP elapsed_ns=5793 = 14031
loop=CLEAN_LOOP elapsed_ns=5623 = 14031
loop=CLEAN_LOOP elapsed_ns=5585 = 14031
loop=UGLY_LOOP elapsed_ns=9365 = 14031
loop=UGLY_LOOP elapsed_ns=8957 = 14031
loop=UGLY_LOOP elapsed_ns=8877 = 14031
loop=UGLY_LOOP elapsed_ns=8873 = 14031
Verification here: http://coliru.stacked-crooked.com/a/52d670039de17943
EDIT:
In fact the whole function can be reduced to:
uint32_t OnesComplementSum(const uint8_t* inData, int len, uint32_t sum)
{
const uint16_t * b16 = reinterpret_cast<const uint16_t *>(inData);
const uint16_t * const the_end=b16+len/2;
for (; b16 < the_end; ++b16)
{
sum += *b16;
}
sum = (sum & uint16_t(-1)) + (sum >> 16);
return (sum > uint16_t(-1)) ? sum - uint16_t(-1) : sum;
}
Which does better than the OPs with -O3 but worse with -O2:
http://coliru.stacked-crooked.com/a/bcca1e94c2f394c7
loop=CLEAN_LOOP elapsed_ns=5825 = 14031
loop=CLEAN_LOOP elapsed_ns=5717 = 14031
loop=CLEAN_LOOP elapsed_ns=5681 = 14031
loop=CLEAN_LOOP elapsed_ns=5646 = 14031
loop=UGLY_LOOP elapsed_ns=9201 = 14031
loop=UGLY_LOOP elapsed_ns=8826 = 14031
loop=UGLY_LOOP elapsed_ns=8859 = 14031
loop=UGLY_LOOP elapsed_ns=9582 = 14031
So mileage may vary, and unless the exact architecture is known, I'd just go simpler
I am trying to write an android app which needs to calculate gaussian and laplacian pyramids for multiple full resolution images, i wrote this it on C++ with NDK, the most critical part of the code is applying gaussian filter to images abd i am applying this filter with horizontally and vertically.
The filter is (0.0625, 0.25, 0.375, 0.25, 0.0625)
Since i am working on integers i am calculating (1, 4, 6, 4, 1)/16
dst[index] = ( src[index-2] + src[index-1]*4 + src[index]*6+src[index+1]*4+src[index+2])/16;
I have made a few simple optimization however it still is working slow than expected and i was wondering if there are any other optimization options that i am missing.
PS: I should mention that i have tried to write this filter part with inline arm assembly however it give 2x slower results.
//horizontal filter
for(unsigned y = 0; y < height; y++) {
for(unsigned x = 2; x < width-2; x++) {
int index = y*width+x;
dst[index].r = (src[index-2].r+ src[index+2].r + (src[index-1].r + src[index+1].r)*4 + src[index].r*6)>>4;
dst[index].g = (src[index-2].g+ src[index+2].g + (src[index-1].g + src[index+1].g)*4 + src[index].g*6)>>4;
dst[index].b = (src[index-2].b+ src[index+2].b + (src[index-1].b + src[index+1].b)*4 + src[index].b*6)>>4;
}
}
//vertical filter
for(unsigned y = 2; y < height-2; y++) {
for(unsigned x = 0; x < width; x++) {
int index = y*width+x;
dst[index].r = (src[index-2*width].r + src[index+2*width].r + (src[index-width].r + src[index+width].r)*4 + src[index].r*6)>>4;
dst[index].g = (src[index-2*width].g + src[index+2*width].g + (src[index-width].g + src[index+width].g)*4 + src[index].g*6)>>4;
dst[index].b = (src[index-2*width].b + src[index+2*width].b + (src[index-width].b + src[index+width].b)*4 + src[index].b*6)>>4;
}
}
The index multiplication can be factored out of the inner loop since the mulitplicatation only occurs when y is changed:
for (unsigned y ...
{
int index = y * width;
for (unsigned int x...
You may gain some speed by loading variables before you use them. This would make the processor load them in the cache:
for (unsigned x = ...
{
register YOUR_DATA_TYPE a, b, c, d, e;
a = src[index - 2].r;
b = src[index - 1].r;
c = src[index + 0].r; // The " + 0" is to show a pattern.
d = src[index + 1].r;
e = src[index + 2].r;
dest[index].r = (a + e + (b + d) * 4 + c * 6) >> 4;
// ...
Another trick would be to "cache" the values of the src so that only a new one is added each time because the value in src[index+2] may be used up to 5 times.
So here is a example of the concepts:
//horizontal filter
for(unsigned y = 0; y < height; y++)
{
int index = y*width + 2;
register YOUR_DATA_TYPE a, b, c, d, e;
a = src[index - 2].r;
b = src[index - 1].r;
c = src[index + 0].r; // The " + 0" is to show a pattern.
d = src[index + 1].r;
e = src[index + 2].r;
for(unsigned x = 2; x < width-2; x++)
{
dest[index - 2 + x].r = (a + e + (b + d) * 4 + c * 6) >> 4;
a = b;
b = c;
c = d;
d = e;
e = src[index + x].r;
I'm not sure how your compiler would optimize all this, but I tend to work in pointers. Assuming your struct is 3 bytes... You can start with pointers in the right places (the edge of the filter for source, and the destination for target), and just move them through using constant array offsets. I've also put in an optional OpenMP directive on the outer loop, as this can also improve things.
#pragma omp parallel for
for(unsigned y = 0; y < height; y++) {
const int rowindex = y * width;
char * dpos = (char*)&dest[rowindex+2];
char * spos = (char*)&src[rowindex];
const char *end = (char*)&src[rowindex+width-2];
for( ; spos != end; spos++, dpos++) {
*dpos = (spos[0] + spos[4] + ((spos[1] + src[3])<<2) + spos[2]*6) >> 4;
}
}
Similarly for the vertical loop.
const int scanwidth = width * 3;
const int row1 = scanwidth;
const int row2 = row1+scanwidth;
const int row3 = row2+scanwidth;
const int row4 = row3+scanwidth;
#pragma omp parallel for
for(unsigned y = 2; y < height-2; y++) {
const int rowindex = y * width;
char * dpos = (char*)&dest[rowindex];
char * spos = (char*)&src[rowindex-row2];
const char *end = spos + scanwidth;
for( ; spos != end; spos++, dpos++) {
*dpos = (spos[0] + spos[row4] + ((spos[row1] + src[row3])<<2) + spos[row2]*6) >> 4;
}
}
This is how I do convolutions, anyway. It sacrifices readability a little, and I've never tried measuring the difference. I just tend to write them that way from the outset. See if that gives you a speed-up. The OpenMP definitely will if you have a multicore machine, and the pointer stuff might.
I like the comment about using SSE for these operations.
Some of the more obvious optimizations are exploiting the symmetry of the kernel:
a=*src++; b=*src++; c=*src++; d=*src++; e=*src++; // init
LOOP (n/5) times:
z=(a+e)+(b+d)<<2+c*6; *dst++=z>>4; // then reuse the local variables
a=*src++;
z=(b+a)+(c+e)<<2+d*6; *dst++=z>>4; // registers have been read only once...
b=*src++;
z=(c+b)+(d+a)<<2+e*6; *dst++=z>>4;
e=*src++;
The second thing is that one can perform multiple additions using a single integer. When the values to be filtered are unsigned, one can fit two channels in a single 32-bit integer (or 4 channels in a 64-bit integer); it's the poor mans SIMD.
a= 0x[0011][0034] <-- split to two
b= 0x[0031][008a]
----------------------
sum 0042 00b0
>>4 0004 200b0 <-- mask off
mask 00ff 00ff
-------------------
0004 000b <-- result
(The Simulated SIMD shows one addition followed by a shift by 4)
Here's a kernel that calculates 3 rgb operations in parallel (easy to modify for 6 rgb operations in 64-bit architectures...)
#define MASK (255+(255<<10)+(255<<20))
#define KERNEL(a,b,c,d,e) { \
a=((a+e+(c<<1))>>2) & MASK; a=(a+b+c+d)>>2 & MASK; *DATA++ = a; a=DATA[4]; }
void calc_5_rgbs(unsigned int *DATA)
{
register unsigned int a = DATA[0], b=DATA[1], c=DATA[2], d=DATA[3], e=DATA[4];
KERNEL(a,b,c,d,e);
KERNEL(b,c,d,e,a);
KERNEL(c,d,e,a,b);
KERNEL(d,e,a,b,c);
KERNEL(e,a,b,c,d);
}
Works best on ARM and on 64-bit IA with 16 registers... Needs heavy assembler optimizations to overcome register shortage in 32-bit IA (e.g. use ebp as GPR). And just because of that it's an inplace algorithm...
There are just 2 guardian bits between every 8 bits of data, which is just enough to get exactly the same result as in integer calculation.
And BTW: it's faster to just run through the array byte per byte than by r,g,b elements
unsigned char *s=(unsigned char *) source_array;
unsigned char *d=(unsigned char *) dest_array;
for (j=0;j<3*N;j++) d[j]=(s[j]+s[j+16]+s[j+8]*6+s[j+4]*4+s[j+12]*4)>>4;