How to transform this memcpy into a for? - c++

What is the difference between these two?
for (int i = 0; i < numSamples; i++) {
mData[sampleIndex++] = *buffer++ * (1.0f / 32768);
}
and
memcpy(&mData[sampleIndex], buffer, (numSamples * sizeof(float)));
If I understood correct, the first copies numSamples float values to mData, one by one. The second one, copies numSamples*sizeof(float) bytes to mData. Since we're copying numSaples * number of bytes on float, I think they do the same thing, but the first one actually multiplies things before passing to mData.
So, is there a way to transform the memcpy into a for? Something like:
for (int i = 0; i < numSamples * sizeof(float); i++) {
//What to put here?
}
Context:
const int32_t mChannelCount;
const int32_t mMaxFrames;
int32_t sampleIndex;
float *mData;
float *buffer;

What is the difference between these two?
for (int i = 0; i < numSamples; i++) {
mData[sampleIndex++] = *buffer++ * (1.0f / 32768);
}
// and
memcpy(&mData[sampleIndex], buffer, (numSamples * sizeof(float)));
These are quite different given the * (1.0f / 32768);. I assume the code compare is setting the scaling difference aside. #Thomas Matthews.
Important: buffer, sampleIndex has different values after the for loop.
*buffer++ needs no code change should the type of buffer change. * sizeof(float) obilgies a code change. Could have used * sizeof *buffer.
mempcy() is optimized code per that platform. for() loops can only do so much. In particular, mempcy() assumes mData, buffer do not overlap. The for() loop may not be able to make that optimization.
This for uses int indexing where memcpy() uses size_t. Makes a difference with huge arrays.
memcpy() tolerates an unaligned pointers. mData[sampleIndex++] = *buffer++ .. does not.
"the first copies numSamples float values to mData, one by one. " is a not certain. A smart compiler may be able to make certain parallel copies depending on the context and act as if copying was done one by one.
Post the entire block of code/function that uses these 2 approaches for a better compare.

I gather from your post that you want to make a memcpy similar copy but using a for loop, that being the case you just need do use the same for loop but without the multiplication part:
for (int i = 0; i < numSamples; i++){
mData[sampleIndex++] = *buffer++;
}
Note that memcpy can be more effective than a for loop given the conditions (see Maxim Egorushkin and Jeremy Friesner comments bellow) so you may want to keep it that way.
Another, more idiomatic, and, I would argue, better way to implement the operations you are performing is to use the C++ library provided methods as sugested by Ted Lyngmo and rustyx.
Disclaimer: As I was writing my answer, Martin York posted a comment with a similar solution, that being the case, credit to him as well.

What is the difference between these two?
The former performs a calculation on the source array while copying the result into another array a float at a time.
The latter copies the content of the array byte at a time into another without calculation.
So, is there a way to transform the memcpy into a for?
Yes. Here is a naïve way to transform it:
auto dest_c = static_cast<unsigned char*>(mData + sampleIndex);
auto src_c = static_cast<const unsigned char*>(buffer);
auto end = src_c + numSamples * sizeof(float);
for (; src_c < end;) { // or while(src_c < end)
*dest_c++ = *src_c++;
}
The actual implementation of the standard function is likely more complex, involving optimisations related to copying long sequences.
Since you don't appear to need the generic reinterpretation aspect of std::memcpy, perhaps a simpler alternative would suffice:
auto dest = mData + sampleIndex;
auto src = buffer;
auto end = src + numSamples;
for (; src < end;) {
*dest++ = *src++;
}
Or perhaps another standard algorithm:
std::copy(buffer, buffer + numSamples, mData + sampleIndex);

Related

casting a pointer to an array to a structure in efficient C++11 way

I have a big amount point cloud data that I read from a file into
char * memblock = new char [size];
where size is the size of data. Then I cast my data to float numbers
float * file_content = reinterpret_cast<float *>(memblock);
Now I would like to change the data from a pointer to an array and place it in a certain structure like std::vector<PointXYZ>.
vector.clear();
for (int i = 2; i < file_content_size; i+=3) {
vector.push_back(
PointXYZ(file_content[i-2], file_content[i-1], file_content[i] )
);
}
But I feel there must be a better way than just looping through the whole data, considering that the size of the vector is more than 1e6.
std::vector has a range constructor that you can use to copy the elements to the vector.
std::vector<PointXYZ> vec(memblock, memblock + size);
I believe this will be faster because you are not reallocating memory for every push_back, however you will still be doing a copy of all elements in memblock.
I think you alignment problems when you cast your char* raw data into a float*.
Generally you should arrange things so you cast other types to a char* because that is allowed to alias everything else and ensures you get correct alignment.
// create your array in the target type (float)
std::vector<float> file_content(size/sizeof(float));
// read the data in (cast to char* here)
file.read(reinterpret_cast<char*>(file_content.data()), size);
I honestly don't think you can get away from copying all the data.
std::vector<PointXYZ> points;
points.reserve(file_content.size() / 3);
for(auto i = 0ULL; i < file_content.size(); i += 3)
points.emplace_back(points[i], points[i + 1], points[i + 2])

C++ GDI+ bitmap manipulation needs speed up on byte operations

I'm using GDI+ in C++ to manipulate some Bitmap images, changing the colour and resizing the images. My code is very slow at one particular point and I was looking for some potential ways to speed up the line that's been highlighted in the VS2013 Profiler
for (UINT y = 0; y < 3000; ++y)
{
//one scanline at a time because bitmaps are stored wrong way up
byte* oRow = (byte*)bitmapData1.Scan0 + (y * bitmapData1.Stride);
for (UINT x = 0; x < 4000; ++x)
{
//get grey value from 0.114*Blue + 0.299*Red + 0.587*Green
byte grey = (oRow[x * 3] * .114) + (oRow[x * 3 + 1] * .587) + (oRow[x * 3 + 2] * .299); //THIS LINE IS THE HIGHLIGHTED ONE
//rest of manipulation code
}
}
Any handy hints on how to handle this arithmetic line better? It's causing massive slow downs in my code
Thanks in advance!
Optimization depends heavily on the used compiler and the target system. But there are some hints which may be usefull. Avoid multiplications:
Instead of:
byte grey = (oRow[x * 3] * .114) + (oRow[x * 3 + 1] * .587) + (oRow[x * 3 + 2] * .299); //THIS LINE IS THE HIGHLIGHTED ONE
use...
//get grey value from 0.114*Blue + 0.299*Red + 0.587*Green
byte grey = (*oRow) * .114;
oRow++;
grey += (*oRow) * .587;
oRow++;
grey += (*oRow) * .299;
oRow++;
You can put the incrimination of the pointer in the same line. I put it in a separate line for better understanding.
Also, instead of using the multiplication of a float you can use a table, which can be faster than arithmetic. This depends on CPU und table size, but you can give it a shot:
// somwhere global or class attributes
byte tred[256];
byte tgreen[256];
byte tblue[256];
...at startup...
// Only init once at startup
// I am ignoring the warnings, you should not :-)
for(int i=0;i<255;i++)
{
tred[i]=i*.114;
tgreen[i]=i*.587;
tblue[i]=i*.229;
}
...in the loop...
byte grey = tred[*oRow];
oRow++;
grey += tgreen[*oRow];
oRow++;
grey += tblue[*oRow];
oRow++;
Also. 255*255*255 is not such a great size. You can build one big table. As this Table will be larger than the usual CPU cache, I give it not such more speed efficiency.
As suggested, you could do math in integer, but you could also try floats instead of doubles (.114f instead of .114), which are usually quicker and you don't need the precision.
Do the loop like this, instead, to save on pointer math. Creating a temporary pointer like this won't cost because the compiler will understand what you're up to.
for(UINT x = 0; x < 12000; x+=3)
{
byte* pVal = &oRow[x];
....
}
This code is also easily threadable - the compiler can do it for you automatically in various ways; here's one, using parallel for:
https://msdn.microsoft.com/en-us/library/dd728073.aspx
If you have 4 cores, that's a 4x speedup, just about.
Also be sure to check release vs debug build - you don't know the perf until you run it in release/optimized mode.
You could premultiply values like: oRow[x * 3] * .114 and put them into an array. oRow[x*3] has 256 values, so you can easily create array aMul1 of 256 values from 0->255, and multiply it by .144. Then use aMul1[oRow[x * 3]] to find multiplied value. And the same for other components.
Actually you could even create such array for RGB values, ie. your pixel is 888, so you will need an array of size 256*256*256, which is 16777216 = ~16MB.Whether this would speed up your process, you would have to check yourself with profiler.
In general I've found that more direct pointer management, intermediate instructions, less instructions (on most CPUs, they're all equal cost these days), and less memory fetches - e.g. tables are not the answer more often than they are - is the usual optimum, without going to direct assembly. Vectorization, especially explicit is also helpful as is dumping assembly of the function and confirming the inner bits conform to your expectations. Try this:
for (UINT y = 0; y < 3000; ++y)
{
//one scanline at a time because bitmaps are stored wrong way up
byte* oRow = (byte*)bitmapData1.Scan0 + (y * bitmapData1.Stride);
byte *p = oRow;
byte *pend = p + 4000 * 3;
for(; p != pend; p+=3){
const float grey = p[0] * .114f + p[1] * .587f + p[2] * .299f;
}
//alternatively with an autovectorizing compiler
for(; p != pend; p+=3){
#pragma unroll //or use a compiler option to unroll loops
//make sure vectorization and relevant instruction sets are enabled - this is effectively a dot product so the following intrinsic fits the bill:
//https://msdn.microsoft.com/en-us/library/bb514054.aspx
//vector types or compiler intrinsics are more reliable often too... but get compiler specific or architecture dependent respectively.
float grey = 0;
const float w[3] = {.114f, .587f, .299f};
for(int c = 0; c < 3; ++c){
grey += w[c] * p[c];
}
}
}
Consider fooling around with OpenCL and targeting your CPU to see how fast you could solve with CPU specific optimizations and easily multiple cores - OpenCL covers this up for you pretty well and provides built in vector ops and dot product.

Optimize a nearest neighbor resizing algorithm for speed

I'm using the next algorithm to perform nearest neighbor resizing. Is there anyway to optimize it's speed? Input and Output buffers are in ARGB format, though images are known to be always opaque. Thank you.
void resizeNearestNeighbor(const uint8_t* input, uint8_t* output, int sourceWidth, int sourceHeight, int targetWidth, int targetHeight)
{
const int x_ratio = (int)((sourceWidth << 16) / targetWidth);
const int y_ratio = (int)((sourceHeight << 16) / targetHeight) ;
const int colors = 4;
for (int y = 0; y < targetHeight; y++)
{
int y2_xsource = ((y * y_ratio) >> 16) * sourceWidth;
int i_xdest = y * targetWidth;
for (int x = 0; x < targetWidth; x++)
{
int x2 = ((x * x_ratio) >> 16) ;
int y2_x2_colors = (y2_xsource + x2) * colors;
int i_x_colors = (i_xdest + x) * colors;
output[i_x_colors] = input[y2_x2_colors];
output[i_x_colors + 1] = input[y2_x2_colors + 1];
output[i_x_colors + 2] = input[y2_x2_colors + 2];
output[i_x_colors + 3] = input[y2_x2_colors + 3];
}
}
}
restrict keyword will help a lot, assuming no aliasing.
Another improvement is to declare another pointerToOutput and pointerToInput as uint_32_t, so that the four 8-bit copy-assignments can be combined into a 32-bit one, assuming pointers are 32bit aligned.
There's little that you can do to speed this up, as you already arranged the loops in the right order and cleverly used fixed-point arithmetic. As others suggested, try to move the 32 bits in a single go (hoping that the compiler didn't see that yet).
In case of significant enlargement, there is a possibility: you can determine how many times every source pixel needs to be replicated (you'll need to work on the properties of the relation Xd=Wd.Xs/Ws in integers), and perform a single pixel read for k writes. This also works on the y's, and you can memcpy the identical rows instead of recomputing them. You can precompute and tabulate the mappings of the X's and Y's using run-length coding.
But there is a barrier that you will not pass: you need to fill the destination image.
If you are desperately looking for speedup, there could remain the option of using vector operations (SEE or AVX) to handle several pixels at a time. Shuffle instructions are available that might enable to control the replication (or decimation) of the pixels. But due to the complicated replication pattern combined with the fixed structure of the vector registers, you will probably need to integrate a complex decision table.
The algorithm is fine, but you can utilize massive parallelization by submitting your image to the GPU. If you use opengl, simply creating a context of the new size and providing a properly sized quad can give you inherent nearest neighbor calculations. Also opengl could give you access to other resizing sampling techniques by simply changing the properties of the texture you read from (which would amount to a single gl command which could be an easy paramter to your resize function).
Also later in development, you could simply swap out a shader for other blending techniques which also keeps you utilizing your wonderful GPU processor of image processing glory.
Also, since you aren't using any fancy geometry it can become almost trivial to write the program. It would be a little more involved than your algorithm, but it could perform magnitudes faster depending on image size.
I hope I didn't break anything. This combines some of the suggestions posted thus far and is about 30% faster. I'm amazed that is all we got. I did not actually check the destination image to see if it was right.
Changes:
- remove multiplies from inner loop (10% improvement)
- uint32_t instead of uint8_t (10% improvement)
- __restrict keyword (1% improvement)
This was on an i7 x64 machine running Windows, compiled with MSVC 2013. You will have to change the __restrict keyword for other compilers.
void resizeNearestNeighbor2_32(const uint8_t* __restrict input, uint8_t* __restrict output, int sourceWidth, int sourceHeight, int targetWidth, int targetHeight)
{
const uint32_t* input32 = (const uint32_t*)input;
uint32_t* output32 = (uint32_t*)output;
const int x_ratio = (int)((sourceWidth << 16) / targetWidth);
const int y_ratio = (int)((sourceHeight << 16) / targetHeight);
int x_ratio_with_color = x_ratio;
for (int y = 0; y < targetHeight; y++)
{
int y2_xsource = ((y * y_ratio) >> 16) * sourceWidth;
int i_xdest = y * targetWidth;
int source_x_offset = 0;
int startingOffset = y2_xsource;
const uint32_t * inputLine = input32 + startingOffset;
for (int x = 0; x < targetWidth; x++)
{
i_xdest += 1;
source_x_offset += x_ratio_with_color;
int sourceOffset = source_x_offset >> 16;
output[i_xdest] = inputLine[sourceOffset];
}
}
}

Modifing data from uint8_t array very slow?

I'm currently trying to optimize my code to run a bit faster. Currently it is taking about +30ms to update about 3776000 bytes. If I remove the outPx updates inside my function it runs at about 3ms meaning that the updates to outPx is what is making the function slower.
Any potential feedback on how to improve the speed of my function below would be greatly appreciated.
uint8_t* outPx = (uint8_t*)out.data;
for (int px=0; px<pxSize; px+=4)
{
newTopAlpha = (alpha*inPx[px+3]);
if (0xff == newTopAlpha)
{
// top is opaque covers entire bottom
// set copy over BGR colors
outPx[px] = inPx[px];
outPx[px+1] = inPx[px+1];
outPx[px+2] = inPx[px+2];
outPx[px+3] = 0xff; //Fully opaque
}
else if (0x00 != newTopAlpha)
{
// top is not completely transparent
topAlpha = newTopAlpha/(float)0xff;
bottomAlpha = outPx[px+3]/(float)0xff;
newAlpha = topAlpha + bottomAlpha*(1-topAlpha);
alphaChange = bottomAlpha*(1-topAlpha);
outPx[px] = (uint8_t)((inPx[px]*topAlpha + outPx[px]*alphaChange)/newAlpha);
outPx[px+1] = (uint8_t)((inPx[px+1]*topAlpha + outPx[px+1]*alphaChange)/newAlpha);
outPx[px+2] = (uint8_t)((inPx[px+2]*topAlpha + outPx[px+2]*alphaChange)/newAlpha);
outPx[px+3] = (uint8_t)(newAlpha*0xff);
}
}
uint8_t is an exact width integer type, meaning that you demand the compiler to allocate exactly that much memory for your type. If your system has an alignment requirement, this may cause the code to run slower.
Change uint8_t to uint_fast8_t. This tells the compiler that you want this variable to be 8 bits if possible, but that it is ok to use a larger size if it makes the code faster.
Apart from that, there are lots of things that could cause bad performance, in which case you need to state what system and compiler you are using.
Your code is doing floating point divides, and conversion from byte to float and back again. If you use integer math, it is highly likely more efficient.
Even doing this simple conversion to multiply instead of divide may help quite a bit:
newAlpha = 1/(topAlpha + bottomAlpha*(1-topAlpha));
...
outpx = (uint8_t)((inPx[px]*topAlpha + outPx[px]*alphaChange)*newAlpha);
Multiply tends to be much faster than divide.
OK, if this really is the bottleneck, and you can't use the GPU / built-in methods for some random reason, then there is a lot you can do:
uint8_t *outPx = (uint8_t*) out.data;
const int cAlpha = (int) (alpha * 256.0f + 0.5f);
for( int px = 0; px < pxSize; px += 4 ) {
const int topAlpha = (cAlpha * (int) inPx[px|3]) >> 8; // note | not + for tiny speed boost
if( topAlpha == 255 ) {
memcpy( &outPx[px], &inPx[px], 4 ); // might be slower than per-component copying; benchmark!
} else if( topAlpha ) {
const int bottomAlpha = (int) outPx[px|3];
const int alphaChange = (bottomAlpha * (255 - topAlpha)) / 255;
const int newAlpha = topAlpha + alphaChange;
outPx[px ] = (uint8_t) ((inPx[px ]*topAlpha + outPx[px ]*alphaChange) / newAlpha);
outPx[px|1] = (uint8_t) ((inPx[px|1]*topAlpha + outPx[px|1]*alphaChange) / newAlpha);
outPx[px|2] = (uint8_t) ((inPx[px|2]*topAlpha + outPx[px|2]*alphaChange) / newAlpha);
outPx[px|3] = (uint8_t) newAlpha;
}
}
The main change is that there is no floating point arithmetic any more (I might have missed a /255 or something, but you get the idea). I also removed repeated calculations and used bit operators where possible. Another optimisation would be to use fixed-precision arithmetic to change the 3 divides into a single divide and 3 multiply/bitshifts. But you'd have to benchmark to confirm that actually helps. The memcpy might be faster. Again, you need to benchmark.
Finally, if you know something about the images, you could give the compiler hints about the branching. For example, in GCC you can say if( __builtin_expect( topAlpha == 255, 1 ) ) if you know that most of the image is solid colour, and alpha is 1.0.
Update based on comments:
And for the love of sanity, never (never) benchmark with optimisations turned off.

exchanging 2 memory positions

I am working with OpenCV and Qt, Opencv use BGR while Qt uses RGB , so I have to swap those 2 bytes for very big images.
There is a better way of doing the following?
I can not think of anything faster but looks so simple and lame...
int width = iplImage->width;
int height = iplImage->height;
uchar *iplImagePtr = (uchar *) iplImage->imageData;
uchar buf;
int limit = height * width;
for (int y = 0; y < limit; ++y) {
buf = iplImagePtr[2];
iplImagePtr[2] = iplImagePtr[0];
iplImagePtr[0] = buf;
iplImagePtr += 3;
}
QImage img((uchar *) iplImage->imageData, width, height,
QImage::Format_RGB888);
We are currently dealing with this issue in a Qt application. We've found that the Intel Performance Primitives to be be fastest way to do this. They have extremely optimized code. In the html help files at Intel ippiSwapChannels Documentation they have an example of exactly what you are looking for.
There are couple of downsides
Is the size of the library, but you can link static link just the library routines you need.
Running on AMD cpus. Intel libs run VERY slow by default on AMD. Check out www.agner.org/optimize/asmlib.zip for details on how do a work around.
I think this looks absolutely fine. That the code is simple is not something negative. If you want to make it shorter you could use std::swap:
std::swap(iplImagePtr[0], iplImagePtr[2]);
You could also do the following:
uchar* end = iplImagePtr + height * width * 3;
for ( ; iplImagePtr != end; iplImagePtr += 3) {
std::swap(iplImagePtr[0], iplImagePtr[2]);
}
There's cvConvertImage to do the whole thing in one line, but I doubt it's any faster either.
Couldn't you use one of the following methods ?
void QImage::invertPixels ( InvertMode mode = InvertRgb )
or
QImage QImage::rgbSwapped () const
Hope this helps a bit !
I would be inclined to do something like the following, working on the basis of that RGB data being in three byte blocks.
int i = 0;
int limit = (width * height); // / 3;
while(i != limit)
{
buf = iplImagePtr[i]; // should be blue colour byte
iplImagePtr[i] = iplImagaePtr[i + 2]; // save the red colour byte in the blue space
iplImagePtr[i + 2] = buf; // save the blue color byte into what was the red slot
// i++;
i += 3;
}
I doubt it is any 'faster' but at end of day, you just have to go through the entire image, pixel by pixel.
You could always do this:
int width = iplImage->width;
int height = iplImage->height;
uchar *start = (uchar *) iplImage->imageData;
uchar *end = start + width * height;
for (uchar *p = start ; p < end ; p += 3)
{
uchar buf = *p;
*p = *(p+2);
*(p+2) = buf;
}
but a decent compiler would do this anyway.
Your biggest overhead in these sorts of operations is going to be memory bandwidth.
If you're using Windows then you can probably do this conversion using the BitBlt and two appropriately set up DIBs. If you're really lucky then this could be done in the graphics hardware.
I hate to ruin anyone's day, but if you don't want to go the IPP route (see photo_tom) or pull in an optimized library, you might get better performance from the following (modifying Andreas answer):
uchar *iplImagePtr = (uchar *) iplImage->imageData;
uchar buf;
size_t limit = height * width;
for (size_t y = 0; y < limit; ++y) {
std::swap(iplImagePtr[y * 3], iplImagePtr[y * 3 + 2]);
}
Now hold on, folks, I hear you yelling "but all those extra multiplies and adds!" The thing is, this form of the loop is far easier for a compiler to optimize, especially if they get smart enough to multithread this sort of algorithm, because each pass through the loop is independent of those before or after. In the other form, the value of iplImagePtr was dependent on the value in previous pass. In this form, it is constant throughout the whole loop; only y changes, and that is in a very, very common "count from 0 to N-1" loop construct, so it's easier for an optimizer to digest.
Or maybe it doesn't make a difference these days because optimizers are insanely smart (are they?). I wonder what a benchmark would say...
P.S. If you actually benchmark this, I'd also like to see how well the following performs:
uchar *iplImagePtr = (uchar *) iplImage->imageData;
uchar buf;
size_t limit = height * width;
for (size_t y = 0; y < limit; ++y) {
uchar *pixel = iplImagePtr + y * 3;
std::swap(pix[0], pix[2]);
}
Again, pixel is defined in the loop to limit its scope and keep the optimizer from thinking there's a cycle-to-cycle dependency. If the compiler increments and decrements the stack pointer each time through the loop to "create" and "destroy" pixel, well, it's stupid and I'll apologize for wasting your time.
cvCvtColor(iplImage, iplImage, CV_BGR2RGB);