Zero-copy exporting image using png++ or libpng - c++

Suppose I have a 2000 x 3000 24-bit RGB image stored in row-major, column, color-minor order. I can export this with png++ as follows:
void export_png(const std::vector<uint8_t>& pixels, const std::string& pngfile) {
png::image<png::rgb_pixel>> image(2000,3000);
for (int y = 0; y < 3000; y++)
for (int x = 0; x < 2000; x++)
{
image[y][x].red = pixels[y*2000*3 + x*3 + 0];
image[y][x].green = pixels[y*2000*3 + x*3 + 1];
image[y][x].blue = pixels[y*2000*3 + x*3 + 2];
}
image.write(pngfile);
}
However this is inefficient because it copies the image data before exporting it, requiring at least an additional ~20 megabytes of memory more than should be needed, not to mention the time and memory bandwidth it takes to perform the copy.
Is there a way of exporting it directly from the pixels buffer without copying the data into a png::image first? With png++ or libpng? If so, what is the code for that?

Related

Optimize a nearest neighbor resizing algorithm for speed

I'm using the next algorithm to perform nearest neighbor resizing. Is there anyway to optimize it's speed? Input and Output buffers are in ARGB format, though images are known to be always opaque. Thank you.
void resizeNearestNeighbor(const uint8_t* input, uint8_t* output, int sourceWidth, int sourceHeight, int targetWidth, int targetHeight)
{
const int x_ratio = (int)((sourceWidth << 16) / targetWidth);
const int y_ratio = (int)((sourceHeight << 16) / targetHeight) ;
const int colors = 4;
for (int y = 0; y < targetHeight; y++)
{
int y2_xsource = ((y * y_ratio) >> 16) * sourceWidth;
int i_xdest = y * targetWidth;
for (int x = 0; x < targetWidth; x++)
{
int x2 = ((x * x_ratio) >> 16) ;
int y2_x2_colors = (y2_xsource + x2) * colors;
int i_x_colors = (i_xdest + x) * colors;
output[i_x_colors] = input[y2_x2_colors];
output[i_x_colors + 1] = input[y2_x2_colors + 1];
output[i_x_colors + 2] = input[y2_x2_colors + 2];
output[i_x_colors + 3] = input[y2_x2_colors + 3];
}
}
}
restrict keyword will help a lot, assuming no aliasing.
Another improvement is to declare another pointerToOutput and pointerToInput as uint_32_t, so that the four 8-bit copy-assignments can be combined into a 32-bit one, assuming pointers are 32bit aligned.
There's little that you can do to speed this up, as you already arranged the loops in the right order and cleverly used fixed-point arithmetic. As others suggested, try to move the 32 bits in a single go (hoping that the compiler didn't see that yet).
In case of significant enlargement, there is a possibility: you can determine how many times every source pixel needs to be replicated (you'll need to work on the properties of the relation Xd=Wd.Xs/Ws in integers), and perform a single pixel read for k writes. This also works on the y's, and you can memcpy the identical rows instead of recomputing them. You can precompute and tabulate the mappings of the X's and Y's using run-length coding.
But there is a barrier that you will not pass: you need to fill the destination image.
If you are desperately looking for speedup, there could remain the option of using vector operations (SEE or AVX) to handle several pixels at a time. Shuffle instructions are available that might enable to control the replication (or decimation) of the pixels. But due to the complicated replication pattern combined with the fixed structure of the vector registers, you will probably need to integrate a complex decision table.
The algorithm is fine, but you can utilize massive parallelization by submitting your image to the GPU. If you use opengl, simply creating a context of the new size and providing a properly sized quad can give you inherent nearest neighbor calculations. Also opengl could give you access to other resizing sampling techniques by simply changing the properties of the texture you read from (which would amount to a single gl command which could be an easy paramter to your resize function).
Also later in development, you could simply swap out a shader for other blending techniques which also keeps you utilizing your wonderful GPU processor of image processing glory.
Also, since you aren't using any fancy geometry it can become almost trivial to write the program. It would be a little more involved than your algorithm, but it could perform magnitudes faster depending on image size.
I hope I didn't break anything. This combines some of the suggestions posted thus far and is about 30% faster. I'm amazed that is all we got. I did not actually check the destination image to see if it was right.
Changes:
- remove multiplies from inner loop (10% improvement)
- uint32_t instead of uint8_t (10% improvement)
- __restrict keyword (1% improvement)
This was on an i7 x64 machine running Windows, compiled with MSVC 2013. You will have to change the __restrict keyword for other compilers.
void resizeNearestNeighbor2_32(const uint8_t* __restrict input, uint8_t* __restrict output, int sourceWidth, int sourceHeight, int targetWidth, int targetHeight)
{
const uint32_t* input32 = (const uint32_t*)input;
uint32_t* output32 = (uint32_t*)output;
const int x_ratio = (int)((sourceWidth << 16) / targetWidth);
const int y_ratio = (int)((sourceHeight << 16) / targetHeight);
int x_ratio_with_color = x_ratio;
for (int y = 0; y < targetHeight; y++)
{
int y2_xsource = ((y * y_ratio) >> 16) * sourceWidth;
int i_xdest = y * targetWidth;
int source_x_offset = 0;
int startingOffset = y2_xsource;
const uint32_t * inputLine = input32 + startingOffset;
for (int x = 0; x < targetWidth; x++)
{
i_xdest += 1;
source_x_offset += x_ratio_with_color;
int sourceOffset = source_x_offset >> 16;
output[i_xdest] = inputLine[sourceOffset];
}
}
}

C++ - Convert uint8_t* image data to double** image data

I am working on a C++ function (inside my iOS app) where I have image data in the form uint8_t*.
I obtained the image data using the code using the CVPixelBufferGetBaseAddress() method of the iOS SDK:
uint8_t *bPixels = (uint8_t *)CVPixelBufferGetBaseAddress(imageBuffer);
I have another function (from a third part source) that does some of the image processing functions I would like to use on my image data, but the input for the image data for these functions is double**.
Does anyone have any idea how to go about converting this?
What other information can I provide?
The constructor prototype for the class that use double** look like:
Image(double **iPixels, unsigned int iWidth, unsigned int iHeight);
Your uint8_t *bPixels seems to hold image data as 1-dimensional continuous array of height*width lenght. So to access pixel in the x-th row and y-th column you have to write bPixels[x*width+y].
Image() seems to work on 2-dimensional arrays. To access pixel like above you would have to write iPixels[x][y].
So you need to copy your existing 1-dimensional array to a 2-dimensional:
double **mypixels = new double* [height];
for (int x=0; x<height; x++)
{
mypixels[x] = new double [width];
for (int y=0; y<width; y++)
mypixels[x][y] = bPixels[x*width+y]; // attention here, maybe normalization is necessary
// e.g. mypixels[x][y] = bPixels[x*width+y] / 255.0
}
Because your 1-dimensional array has pixel of type uint8_t and the 2-dimensional one pixel of type double, you must allocate new memory. Otherwise, if both would have same pixel type, the more elegant solution (a simple map) would be:
uint8_t **mypixels = new uint8_t* [height];
for (int x=0; x<height; x++)
mypixels[x] = bPixels+x*width;
Attention: beside the problem of eventually necessary normalization, there is also a problem with the indices-compatibility! My examples assume that the 1-dimensional array is stored row-by-row and that the functions working on 2-dimensional index with [x][y] (that means first-row-then-column). The declaration of Image() however, could lead to the conclusion that it needs its arrays to be indexed with [y][x] maybe.
I'm going to take a giant bunch of guesses here in hopes that this will lead you towards getting at the documentation and answering back. If there's no further documentation, well, here's a starting point.
Guess 1) The Image constructor requires a doubly dimensioned array where each component is an R,G,B,Alpha channel in that order. So iPixels[0] is the red data, iPixels[1] is the green data, etc.
Guess 2) Because it's not integer data, the values range from 0 to 1.
Guess 3) All of this must be pre-allocated.
Guess 4) Image data is row-major
Guess 5) Source data is BRGA
So with that in mind, starting with bPixels
double *redData = new double[width*height];
double *greenData = new double[width*height];
double *blueData = new double[width*height];
double *alphaData = new double[width*height];
double **iPixels = new double*[4];
iPixels[0] = redData;
iPixels[1] = greenData;
iPixels[2] = blueData;
iPixels[3] = alphaData;
for(int y = 0;y < height;y++)
{
for(int x = 0;x < width;x++)
{
int alpha = bPixels[(y*width + x)*4 + 3];
int red = bPixels[(y*width +x)*4 + 2];
int green = bPixels[(y*width + x)*4 + 1];
int blue = bPixels[(y*width + x)*4];
redData[y*width + x] = red/255.0;
greenData[y*width + x] = green/255.0;
blueData[y*width + x] = blue/255.0;
alphaData[y*width + x] = alpha/255.0;
}
}
Image newImage(iPixels,width,height);
some of the things that can go wrong.
Source is not BGRA but RGBA, which will make the colors all wrong.
Not row major or destination is not in slices which will make things look all screwed up and/or seg-fault

How to efficiently render a 24-bpp image on a 32-bpp display?

First of all, I'm programming in the kernel context so no existing libraries exist. In fact this code is going to go into a library of my own.
Two questions, one more important than the other:
As the title suggests, how can I efficiently render a 24-bpp image onto a 32-bpp device, assuming that I have the address of the frame buffer?
Currently I have this code:
void BitmapImage::Render24(uint16_t x, uint16_t y, void (*r)(uint16_t, uint16_t, uint32_t))
{
uint32_t imght = Math::AbsoluteValue(this->DIB->GetBitmapHeight());
uint64_t ptr = (uint64_t)this->ActualBMP + this->Header->BitmapArrayOffset;
uint64_t rowsize = ((this->DIB->GetBitsPerPixel() * this->DIB->GetBitmapWidth() + 31) / 32) * 4;
uint64_t oposx = x;
uint64_t posx = oposx;
uint64_t posy = y + (this->DIB->Type == InfoHeaderV1 && this->DIB->GetBitmapHeight() < 0 ? 0 : this->DIB->GetBitmapHeight());
for(uint32_t d = 0; d < imght; d++)
{
for(uint32_t w = 0; w < rowsize / (this->DIB->GetBitsPerPixel() / 8); w++)
{
r(posx, posy, (*((uint32_t*)ptr) & 0xFFFFFF));
ptr += this->DIB->GetBitsPerPixel() / 8;
posx++;
}
posx = oposx;
posy--;
}
}
r is a function pointer to a PutPixel-esque thing that accepts x, y, and colour parameters.
Obviously this code is terribly slow, since plotting pixels one at a time is never a good idea.
For my 32-bpp rendering code (which I also have a question about, more on that later) I can easily Memory::Copy() the bitmap array (I'm loading bmp files here) to the frame buffer.
However, how do I do this with 24bpp images? On a 24bpp display this would be fine but I'm working with a 32bpp one.
One solution I can think of right now is to create another bitmap array which essentially contains values of 0x00(colour) and the use that to draw to the screen -- I don't think this is very good though, so I'm looking for a better alternative.
Next question:
2. Given, for obvious reasons, one cannot simply Memory::Copy() the entire array at once onto the frame buffer, the next best thing would be to copy them row by row.
Is there a better way?
Basically something like this:
for (uint32_t l = 0; l < h; ++l) // l line index in pixels
{
// srcPitch is distance between lines in bytes
char* srcLine = (char*)srcBuffer + l * srcPitch;
unsigned* trgLine = ((unsigned*)trgBuffer) + l * trgPitch;
for (uint32_t c = 0; c < w; ++c) // c is column index in pixels
{
// build target pixel. arrange indexes to fit your render target (0, 1, 2)
++(*trgLine) = (srcLine[0] << 16) | (srcLine[1] << 8)
| srcLine[2] | (0xff << 24);
srcLine += 3;
}
}
A few notes:
- better to write to a different buffer than the render buffer so the image is displayed at once.
- using functions for pixel placement like you did is very (very very) slow.

Memory error while using memcpy?

I'm using dcmtk library to modify the pixel data of a multi frame compressed dicom image. So, to do that, at one stage in an for loop I take the pixel data of each decompressed frame and modify them according my wish and try to concatenate each modify pixel data in a big memory buffer frame by frame. This core process of for loop is as below.
The problem is after the first iteration it gives memory at the line of the code where I call the function getUncompressedFrame. I think it's happening because of the line memcpy(fullBuffer+(i*sizeF),newBuffer,sizeF);, as when I remove that line there's no error at that time and the whole for loop works absolutely fine.
Could you please say me if I'm making a mistake in working with memcpy? Thanks.
Uint32 sizeF=828072;// I just wrote it to show what is the data type.
Uint8 * fullBuffer = new Uint8(int(sizeF*numOfFrames));//The big memory buffer
for(int i=0;i<numOfFrames;i++)
{
Uint8 * buffer = new Uint8[int(sizeF)];//Buffer for each frame
Uint8 * newBuffer = new Uint8[int(sizeF)];//Buffer in which the modified frame data is stored
DcmFileCache * cache=NULL;
OFCondition cond=element->getUncompressedFrame(dataset,i,startFragment,buffer,sizeF,decompressedColorModel,cache);
//I get the uncompressed individual frame pixel data
if(buffer != NULL)
{
for(unsigned long y = 0; y < rows; y++)
{
for(unsigned long x = 0; x < cols; x++)
{
if(planarConfiguration==0)
{
if(x>xmin && x<xmax && y>ymin && y<ymax)
{
index=(x + y + y*(cols-1))*samplePerPixel;
if(index<sizeF-2)
{
newBuffer[index] = 0;
newBuffer[index + 1] = 0;
newBuffer[index +2] = 0;
}
}
else
{
index=(x + y + y*(cols-1))*samplePerPixel;
if(index<sizeF-2)
{
newBuffer[index] = buffer[index];
newBuffer[index + 1] = buffer[index + 1];
newBuffer[index + 2] = buffer[index + 2];
}
}
}
}
}
memcpy(fullBuffer+(i*sizeF),newBuffer,sizeF);
//concatenate the modified frame by frame pixel data
}
Change the declaration of fullBuffer to this:
Uint8 * fullBuffer = new Uint8[int(sizeF*numOfFrames)];
Your code didn't allocate an array, it allocated a single Uint8 with the value int(sizeF*numOfFrames).
Uint8 * fullBuffer = new Uint8(int(sizeF*numOfFrames));
This allocates a single byte, giving it an initial value of sizeF*numOfFrames (after truncating it first to int and then to Uint8). You want an array, and you don't want to truncate the size to int:
Uint8 * fullBuffer = new Uint8[sizeF*numOfFrames];
^ ^
or, to fix the likely memory leaks in your code:
std::vector<Uint8> fullBuffer(sizeF*numOfFrames);
If the method getUncompressedFrame is doing an inner memcpy to cache, then it makes sense why, as you are passing a null pointer as argument for the cache, with no memory allocated.

How to compress YUYV raw data to JPEG using libjpeg?

I'm looking for an example of how to save a YUYV format frame to a JPEG file using the libjpeg library.
In typical computer APIs, "YUV" actually means YCbCr, and "YUYV" means "YCbCr 4:2:2" stored as Y0, Cb01, Y1, Cr01, Y2 ...
Thus, if you have a "YUV" image, you can save it to libjpeg using the JCS_YCbCr color space.
When you have a 422 image (YUYV) you have to duplicate the Cb/Cr values to the two pixels that need them before writing the scanline to libjpeg. Thus, this write loop will do it for you:
// "base" is an unsigned char const * with the YUYV data
// jrow is a libjpeg row of samples array of 1 row pointer
cinfo.image_width = width & -1;
cinfo.image_height = height & -1;
cinfo.input_components = 3;
cinfo.in_color_space = JCS_YCbCr;
jpeg_set_defaults(&cinfo);
jpeg_set_quality(&cinfo, 92, TRUE);
jpeg_start_compress(&cinfo, TRUE);
unsigned char *buf = new unsigned char[width * 3];
while (cinfo.next_scanline < height) {
for (int i = 0; i < cinfo.image_width; i += 2) {
buf[i*3] = base[i*2];
buf[i*3+1] = base[i*2+1];
buf[i*3+2] = base[i*2+3];
buf[i*3+3] = base[i*2+2];
buf[i*3+4] = base[i*2+1];
buf[i*3+5] = base[i*2+3];
}
jrow[0] = buf;
base += width * 2;
jpeg_write_scanlines(&cinfo, jrow, 1);
}
jpeg_finish_compress(&cinfo);
delete[] buf;
Use your favorite auto-ptr to avoid leaking "buf" if your error or write function can throw / longjmp.
Providing YCbCr to libjpeg directly is preferrable to converting to RGB, because it will store it directly in that format, thus saving a lot of conversion work. When the image comes from a webcam or other video source, it's also usually most efficient to get it in YCbCr of some sort (such as YUYV.)
Finally, "U" and "V" mean something slightly different in analog component video, so the naming of YUV in computer APIs that really mean YCbCr is highly confusing.
libjpeg also has a raw data mode, whereby you can directly supply the raw downsampled data (which is almost what you have in the YUYV format). This is more efficient than duplicating the UV values only to have libjpeg downscale them again internally.
To do so, you use jpeg_write_raw_data instead of jpeg_write_scanlines, and by default it will process exactly 16 scanlines at a time. JPEG expects the U and V planes to be 2x downsampled by default. YUYV format already has the horizontal dimension downsampled but not the vertical, so I skip U and V every other scanline.
Initialization:
cinfo.image_width = /* width in pixels */;
cinfo.image_height = /* height in pixels */;
cinfo.input_components = 3;
cinfo.in_color_space = JCS_YCbCr;
jpeg_set_defaults(&cinfo);
cinfo.raw_data_in = true;
JSAMPLE y_plane[16][cinfo.image_width];
JSAMPLE u_plane[8][cinfo.image_width / 2];
JSAMPLE v_plane[8][cinfo.image_width / 2];
JSAMPROW y_rows[16];
JSAMPROW u_rows[8];
JSAMPROW v_rows[8];
for (int i = 0; i < 16; ++i)
{
y_rows[i] = &y_plane[i][0];
}
for (int i = 0; i < 8; ++i)
{
u_rows[i] = &u_plane[i][0];
}
for (int i = 0; i < 8; ++i)
{
v_rows[i] = &v_plane[i][0];
}
JSAMPARRAY rows[] { y_rows, u_rows, v_rows };
Compressing:
jpeg_start_compress(&cinfo, true);
while (cinfo.next_scanline < cinfo.image_height)
{
for (JDIMENSION i = 0; i < 16; ++i)
{
auto offset = (cinfo.next_scanline + i) * cinfo.image_width * 2;
for (JDIMENSION j = 0; j < cinfo.image_width; j += 2)
{
y_plane[i][j] = image.data[offset + j * 2 + 0];
y_plane[i][j + 1] = image.data[offset + j * 2 + 2];
if (i % 2 == 0)
{
u_plane[i / 2][j / 2] = image_data[offset + j * 2 + 1];
v_plane[i / 2][j / 2] = image_data[offset + j * 2 + 3];
}
}
}
jpeg_write_raw_data(&cinfo, rows, 16);
}
jpeg_finish_compress(&cinfo);
I was able to get about a 33% decrease in compression time with this method compared to the one in #JonWatte's answer. This solution isn't for everyone though; some caveats:
You can only compress images with dimensions that are a multiple of 8. If you have different-sized images, you will have to write code to pad in the edges. If you're getting the images from a camera though, they will most likely be this way.
The quality is somewhat impaired by the fact that I simply skip color values for alternating scanlines instead of something fancier like averaging them. For my application though, speed was more important than quality.
The way it's written right now it allocates a ton of memory on the stack. This was acceptable for me because my images were small (640x480) and enough memory was available.
Documentation for libjpeg-turbo: https://raw.githubusercontent.com/libjpeg-turbo/libjpeg-turbo/master/libjpeg.txt