C++ improving palette indexing algorithm - c++

I have a game engine that indexes colors of some bitmaps which allows using some of the crazy effects of olde (color strobing etc.). Sadly, the indexing algorithm is neither slow nor fast, but since the spritesheets these days are gigantic it really adds up. Currently, loading a single large spritesheet can take 150+ milliseconds, which is an eternity, relatively speaking.
This is the algorithm:
auto& palette = p->pal; // vector
auto& lookup = p->lookup; // vector_map
palette.reserve(200); // There are on average ~100 unique colors
palette.push_back(0); // Index zero is the blank color
uint32_t lastColor = 0;
uint32_t lastPalette = 0;
for (size_t i = 0; i < pixels; i++)
{
const auto color = data[i];
if (color == lastColor)
{
data[i] = lastPalette;
continue;
}
else if (color == 0)
{
continue;
}
uint32_t j = 0;
const auto& it = lookup.find(color);
if (it != lookup.end()) {
j = it->second;
}
else
{
j = palette.size();
palette.push_back(color);
lookup.emplace(color, j);
}
lastColor = color;
// Write the index back to the bitmap:
// (this is just a GPU texture encoding, don't mind it)
data[i] = (j & 255) | ((j >> 8) << (8 + 6));
lastPalette = data[i];
}
The base algorithm is fairly straight-forward:
Go through each pixel, find or create an entry for it (the color), write the index back to the image.
Now, can you parallelize this? Probably not. I have tried with OMP and regular threads. It's simply not going to be fast because regardless of how much time you save by going through each portion of the image separately, at the end you have to have a common set of indexes that apply throughout the whole image, and those indexes have to be written back to the image. Sadly, finding the unique colors first and then writing back using parallelization is also slower than doing it once, sequentially. Makes sense, doesn't it?
Using a bitset has no function here. Knowing whether a color exists is useful, but the colors are 32-bit, which makes for 2^32 bits (aka. 530MB). In contrast, 24-bits is only ~2MB, which might be a micro-optimization. I'm not really looking for that anyway. I need to cut the time by 10x.
So, any ideas? Would it be possible to process 4 or 8 colors at the same time using SSE/AVX?

Related

How can I pixelate a 1d array

I want to pixelate an image stored in a 1d array, although i am not sure how to do it, this is what i have comeup with so far...
the value of pixelation is currently 3 for testing purposes.
currently it just creates a section of randomly coloured pixels along the left third of the image, if i increase the value of pixelation the amount of random coloured pixels decreases and vice versa, so what am i doing wrong?
I have also already implemented the rotation, reading of the image and saving of a new image this is just a separate function which i need assistance with.
picture pixelate( const std::string& file_name, picture& tempImage, int& pixelation /* TODO: OTHER PARAMETERS HERE */)
{
picture pixelated = tempImage;
RGB tempPixel;
tempPixel.r = 0;
tempPixel.g = 0;
tempPixel.b = 0;
int counter = 0;
int numtimesrun = 0;
for (int x = 1; x<tempImage.width; x+=pixelation)
{
for (int y = 1; y<tempImage.height; y+=pixelation)
{
//RGB tempcol;
//tempcol for pixelate
for (int i = 1; i<pixelation; i++)
{
for (int j = 1; j<pixelation; j++)
{
tempPixel.r +=tempImage.pixel[counter+pixelation*numtimesrun].colour.r;
tempPixel.g +=tempImage.pixel[counter+pixelation*numtimesrun].colour.g;
tempPixel.b +=tempImage.pixel[counter+pixelation*numtimesrun].colour.b;
counter++;
//read colour
}
}
for (int k = 1; k<pixelation; k++)
{
for (int l = 1; l<pixelation; l++)
{
pixelated.pixel[numtimesrun].colour.r = tempPixel.r/pixelation;
pixelated.pixel[numtimesrun].colour.g = tempPixel.g/pixelation;
pixelated.pixel[numtimesrun].colour.b = tempPixel.b/pixelation;
//set colour
}
}
counter = 0;
numtimesrun++;
}
cout << x << endl;
}
cout << "Image successfully pixelated." << endl;
return pixelated;
}
I'm not too sure what you really want to do with your code, but I can see a few problems.
For one, you use for() loops with variables starting at 1. That's certainly wrong. Arrays in C/C++ start at 0.
The other main problem I can see is the pixelation parameter. You use it to increase x and y without knowing (at least in that function) whether it is a multiple of width and height. If not, you will definitively be missing pixels on the right edge and at the bottom (which edges will depend on the orientation, of course). Again, it very much depends on what you're trying to achieve.
Also the i and j loops start at the position defined by counter and numtimesrun which means that the last line you want to hit is not tempImage.width or tempImage.height. With that you are rather likely to have many overflows. Actually that would also explain the problems you see on the edges. (see update below)
Another potential problem, cannot tell for sure without seeing the structure declaration, but this sum using tempPixel.c += <value> may overflow. If the RGB components are defined as unsigned char (rather common) then you will definitively get overflows. So your average sum is broken if that's the fact. If that structure uses floats, then you're good.
Note also that your average is wrong. You are adding source data for pixelation x pixalation and your average is calculated as sum / pixelation. So you get a total which is pixalation times larger. You probably wanted sum / (pixelation * pixelation).
Your first loop with i and j computes a sum. The math is most certainly wrong. The counter + pixelation * numtimesrun expression will start reading at the second line, it seems. However, you are reading i * j values. That being said, it may be what you are trying to do (i.e. a moving average) in which case it could be optimized but I'll leave that out for now.
Update
If I understand what you are doing, a representation would be something like a filter. There is a picture of a 3x3:
.+. *
+*+ =>
.+.
What is on the left is what you are reading. This means the source needs to be at least 3x3. What I show on the right is the result. As we can see, the result needs to be 1x1. From what I see in your code you do not take that in account at all. (the varied characters represent varied weights, in your case all weights are 1.0).
You have two ways to handle that problem:
The resulting image has a size of width - pixelation * 2 + 1 by height - pixelation * 2 + 1; in this case you keep one result and do not care about the edges...
You rewrite the code to handle edges. This means you use less source data to compute the resulting edges. Another way is to compute the edge cases and save that in several output pixels (i.e. duplicate the pixels on the edges).
Update 2
Hmmm... looking at your code again, it seems that you compute the average of the 3x3 and save it in the 3x3:
.+. ***
+*+ => ***
.+. ***
Then the problem is different. The numtimesrun is wrong. In your k and l loops you save the pixels pixelation * pixelation in the SAME pixel and that advanced by one each time... so you are doing what I shown in my first update, but it looks like you were trying to do what is shown in my 2nd update.
The numtimesrun could be increased by pixelation each time:
numtimesrun += pixelation;
However, that's not enough to fix your k and l loops. There you probably need to calculate the correct destination. Maybe something like this (also requires a reset of the counter before the loop):
counter = 0;
... for loops ...
pixelated.pixel[counter+pixelation*numtimesrun].colour.r = ...;
... (take care of g and b)
++counter;
Yet again, I cannot tell for sure what you are trying to do, so I do not know why you'd want to copy the same pixel pixelation x pixelation times. But that explains why you get data only at the left (or top) of the image (very much depends on the orientation, one side for sure. And if that's 1/3rd then pixelation is probably 3.)
WARNING: if you implement the save properly, you'll experience crashes if you do not take care of the overflows mentioned earlier.
Update 3
As explained by Mark in the comment below, you have an array representing a 2d image. In that case, your counter variable is completely wrong since this is 100% linear whereas the 2d image is not. The 2nd line is width further away. At this point, you read the first 3 pixels at the top-left, then the next 3 pixels on the same, and finally the next 3 pixels still on the same line. Of course, it could be that your image is thus defined and these pixels are really one after another, although it is not very likely...
Mark's answer is concise and gives you the information necessary to access the correct pixels. However, you will still be hit by the overflow and possibly the fact that the width and height parameters are not a multiple of pixelation...
I don't do a lot of C++, but here's a pixelate function I wrote for Processing. It takes an argument of the width/height of the pixels you want to create.
void pixelateImage(int pxSize) {
// use ratio of height/width...
float ratio;
if (width < height) {
ratio = height/width;
}
else {
ratio = width/height;
}
// ... to set pixel height
int pxH = int(pxSize * ratio);
noStroke();
for (int x=0; x<width; x+=pxSize) {
for (int y=0; y<height; y+=pxH) {
fill(p.get(x, y));
rect(x, y, pxSize, pxH);
}
}
}
Without the built-in rect() function you'd have to write pixel-by-pixel using another two for loops:
for (int px=0; px<pxSize; px++) {
for (int py=0; py<pxH; py++) {
pixelated.pixel[py * tempImage.width + px].colour.r = tempPixel.r;
pixelated.pixel[py * tempImage.width + px].colour.g = tempPixel.g;
pixelated.pixel[py * tempImage.width + px].colour.b = tempPixel.b;
}
}
Generally when accessing an image stored in a 1D buffer, each row of the image will be stored as consecutive pixels and the next row will follow immediately after. The way to address into such a buffer is:
image[y*width+x]
For your purposes you want both inner loops to generate coordinates that go from the top and left of the pixelation square to the bottom right.

Optimizing the Dijkstra's algorithm

I need a graph-search algorithm that is enough in our application of robot navigation and I chose Dijkstra's algorithm.
We are given the gridmap which contains free, occupied and unknown cells where the robot is only permitted to pass through the free cells. The user will input the starting position and the goal position. In return, I will retrieve the sequence of free cells leading the robot from starting position to the goal position which corresponds to the path.
Since executing the dijkstra's algorithm from start to goal would give us a reverse path coming from goal to start, I decided to execute the dijkstra's algorithm backwards such that I would retrieve the path from start to goal.
Starting from the goal cell, I would have 8 neighbors whose cost horizontally and vertically is 1 while diagonally would be sqrt(2) only if the cells are reachable (i.e. not out-of-bounds and free cell).
Here are the rules that should be observe in updating the neighboring cells, the current cell can only assume 8 neighboring cells to be reachable (e.g. distance of 1 or sqrt(2)) with the following conditions:
The neighboring cell is not out of bounds
The neighboring cell is unvisited.
The neighboring cell is a free cell which can be checked via the 2-D grid map.
Here is my implementation:
#include <opencv2/opencv.hpp>
#include <algorithm>
#include "Timer.h"
/// CONSTANTS
static const int UNKNOWN_CELL = 197;
static const int FREE_CELL = 255;
static const int OCCUPIED_CELL = 0;
/// STRUCTURES for easier management.
struct vertex {
cv::Point2i id_;
cv::Point2i from_;
vertex(cv::Point2i id, cv::Point2i from)
{
id_ = id;
from_ = from;
}
};
/// To be used for finding an element in std::multimap STL.
struct CompareID
{
CompareID(cv::Point2i val) : val_(val) {}
bool operator()(const std::pair<double, vertex> & elem) const {
return val_ == elem.second.id_;
}
private:
cv::Point2i val_;
};
/// Some helper functions for dijkstra's algorithm.
uint8_t get_cell_at(const cv::Mat & image, int x, int y)
{
assert(x < image.rows);
assert(y < image.cols);
return image.data[x * image.cols + y];
}
/// Some helper functions for dijkstra's algorithm.
bool checkIfNotOutOfBounds(cv::Point2i current, int rows, int cols)
{
return (current.x >= 0 && current.y >= 0 &&
current.x < cols && current.y < rows);
}
/// Brief: Finds the shortest possible path from starting position to the goal position
/// Param gridMap: The stage where the tracing of the shortest possible path will be performed.
/// Param start: The starting position in the gridMap. It is assumed that start cell is a free cell.
/// Param goal: The goal position in the gridMap. It is assumed that the goal cell is a free cell.
/// Param path: Returns the sequence of free cells leading to the goal starting from the starting cell.
bool findPathViaDijkstra(const cv::Mat& gridMap, cv::Point2i start, cv::Point2i goal, std::vector<cv::Point2i>& path)
{
// Clear the path just in case
path.clear();
// Create working and visited set.
std::multimap<double,vertex> working, visited;
// Initialize working set. We are going to perform the djikstra's
// backwards in order to get the actual path without reversing the path.
working.insert(std::make_pair(0, vertex(goal, goal)));
// Conditions in continuing
// 1.) Working is empty implies all nodes are visited.
// 2.) If the start is still not found in the working visited set.
// The Dijkstra's algorithm
while(!working.empty() && std::find_if(visited.begin(), visited.end(), CompareID(start)) == visited.end())
{
// Get the top of the STL.
// It is already given that the top of the multimap has the lowest cost.
std::pair<double, vertex> currentPair = *working.begin();
cv::Point2i current = currentPair.second.id_;
visited.insert(currentPair);
working.erase(working.begin());
// Check all arcs
// Only insert the cells into working under these 3 conditions:
// 1. The cell is not in visited cell
// 2. The cell is not out of bounds
// 3. The cell is free
for (int x = current.x-1; x <= current.x+1; x++)
for (int y = current.y-1; y <= current.y+1; y++)
{
if (checkIfNotOutOfBounds(cv::Point2i(x, y), gridMap.rows, gridMap.cols) &&
get_cell_at(gridMap, x, y) == FREE_CELL &&
std::find_if(visited.begin(), visited.end(), CompareID(cv::Point2i(x, y))) == visited.end())
{
vertex newVertex = vertex(cv::Point2i(x,y), current);
double cost = currentPair.first + sqrt(2);
// Cost is 1
if (x == current.x || y == current.y)
cost = currentPair.first + 1;
std::multimap<double, vertex>::iterator it =
std::find_if(working.begin(), working.end(), CompareID(cv::Point2i(x, y)));
if (it == working.end())
working.insert(std::make_pair(cost, newVertex));
else if(cost < (*it).first)
{
working.erase(it);
working.insert(std::make_pair(cost, newVertex));
}
}
}
}
// Now, recover the path.
// Path is valid!
if (std::find_if(visited.begin(), visited.end(), CompareID(start)) != visited.end())
{
std::pair <double, vertex> currentPair = *std::find_if(visited.begin(), visited.end(), CompareID(start));
path.push_back(currentPair.second.id_);
do
{
currentPair = *std::find_if(visited.begin(), visited.end(), CompareID(currentPair.second.from_));
path.push_back(currentPair.second.id_);
} while(currentPair.second.id_.x != goal.x || currentPair.second.id_.y != goal.y);
return true;
}
// Path is invalid!
else
return false;
}
int main()
{
// cv::Mat image = cv::imread("filteredmap1.jpg", CV_LOAD_IMAGE_GRAYSCALE);
cv::Mat image = cv::Mat(100,100,CV_8UC1);
std::vector<cv::Point2i> path;
for (int i = 0; i < image.rows; i++)
for(int j = 0; j < image.cols; j++)
{
image.data[i*image.cols+j] = FREE_CELL;
if (j == image.cols/2 && (i > 3 && i < image.rows - 3))
image.data[i*image.cols+j] = OCCUPIED_CELL;
// if (image.data[i*image.cols+j] > 215)
// image.data[i*image.cols+j] = FREE_CELL;
// else if(image.data[i*image.cols+j] < 100)
// image.data[i*image.cols+j] = OCCUPIED_CELL;
// else
// image.data[i*image.cols+j] = UNKNOWN_CELL;
}
// Start top right
cv::Point2i goal(image.cols-1, 0);
// Goal bottom left
cv::Point2i start(0, image.rows-1);
// Time the algorithm.
Timer timer;
timer.start();
findPathViaDijkstra(image, start, goal, path);
std::cerr << "Time elapsed: " << timer.getElapsedTimeInMilliSec() << " ms";
// Add the path in the image for visualization purpose.
cv::cvtColor(image, image, CV_GRAY2BGRA);
int cn = image.channels();
for (int i = 0; i < path.size(); i++)
{
image.data[path[i].x*cn*image.cols+path[i].y*cn+0] = 0;
image.data[path[i].x*cn*image.cols+path[i].y*cn+1] = 255;
image.data[path[i].x*cn*image.cols+path[i].y*cn+2] = 0;
}
cv::imshow("Map with path", image);
cv::waitKey();
return 0;
}
For the algorithm implementation, I decided to have two sets namely the visited and working set whose each elements contain:
The location of itself in the 2D grid map.
The accumulated cost
Through what cell did it get its accumulated cost (for path recovery)
And here is the result:
The black pixels represent obstacles, the white pixels represent free space and the green line represents the path computed.
On this implementation, I would only search within the current working set for the minimum value and DO NOT need to scan throughout the cost matrix (where initially, the initially cost of all cells are set to infinity and the starting point 0). Maintaining a separate vector of the working set I think promises a better code performance because all the cells that have cost of infinity is surely to be not included in the working set but only those cells that have been touched.
I also took advantage of the STL which C++ provides. I decided to use the std::multimap since it can store duplicating keys (which is the cost) and it sorts the lists automatically. However, I was forced to use std::find_if() to find the id (which is the row,col of the current cell in the set) in the visited set to check if the current cell is on it which promises linear complexity. I really think this is the bottleneck of the Dijkstra's algorithm.
I am well aware that A* algorithm is much faster than Dijkstra's algorithm but what I wanted to ask is my implementation of Dijkstra's algorithm optimal? Even if I implemented A* algorithm using my current implementation in Dijkstra's which is I believe suboptimal, then consequently A* algorithm will also be suboptimal.
What improvement can I perform? What STL is the most appropriate for this algorithm? Particularly, how do I improve the bottleneck?
You're using a std::multimap for 'working' and 'visited'. That's not great.
The first thing you should do is change visited into a per-vertex flag so you can do your find_if in constant time instead of linear times and also so that operations on the list of visited vertices take constant instead of logarithmic time. You know what all the vertices are and you can map them to small integers trivially, so you can use either a std::vector or a std::bitset.
The second thing you should do is turn working into a priority queue, rather than a balanced binary tree structure, so that operations are a (largish) constant factor faster. std::priority_queue is a barebones binary heap. A higher-radix heap---say quaternary for concreteness---will probably be faster on modern computers due to its reduced depth. Andrew Goldberg suggests some bucket-based data structures; I can dig up references for you if you get to that stage. (They're not too complicated.)
Once you've taken care of these two things, you might look at A* or meet-in-the-middle tricks to speed things up even more.
Your performance is several orders of magnitude worse than it could be because you're using graph search algorithms for what looks like geometry. This geometry is much simpler and less general than the problems that graph search algorithms can solve. Also, with a vertex for every pixel your graph is huge even though it contains basically no information.
I heard you asking "how can I make this better without changing what I'm thinking" but nevertheless I'll tell you a completely different and better approach.
It looks like your robot can only go horizontally, vertically or diagonally. Is that for real or just a side effect of you choosing graph search algorithms? I'll assume the latter and let it go in any direction.
The algorithm goes like this:
(0) Represent your obstacles as polygons by listing the corners. Work in real numbers so you can make them as thin as you like.
(1) Try for a straight line between the end points.
(2) Check if that line goes through an obstacle or not. To do that for any line, show that all corners of any particular obstacle lie on the same side of the line. To do that, translate all points by (-X,-Y) of one end of the line so that that point is at the origin, then rotate until the other point is on the X axis. Now all corners should have the same sign of Y if there's no obstruction. There might be a quicker way just using gradients.
(3) If there's an obstruction, propose N two-segment paths going via the N corners of the obstacle.
(4) Recurse for all segments, culling any paths with segments that go out of bounds. That won't be a problem unless you have obstacles that go out of bounds.
(5) When it stops recursing, you should have a list of locally optimised paths from which you can choose the shortest.
(6) If you really want to restrict bearings to multiples of 45 degrees, then you can do this algorithm first and then replace each segment by any 45-only wiggly version that avoids obstacles. We know that such a version exists because you can stay extremely close to the original line by wiggling very often. We also know that all such wiggly paths have the same length.

What is the fastest way to access a pixel in QImage?

I would like to know what is the fastest way to modify a portion of a QImage.
I have this piece of code that has to be executed with a frequency of 30Hz. It displays an image through a sort of keyhole. It is not possible to see the entire image but only a portion inside a circle. The first for-loop erases the previous "keyhole portion displayed" and the second updates the position of the "displayed keyhole".
for (int i = (prev_y - r_y); i < (prev_y + r_y); i++){
QRgb *line = (QRgb *)backgrd->scanLine(i);
for(int j = (prev_x - r_x); j < (prev_x + r_x) ; j++){
if((i >= 0 && i < this->backgrd->height()) && (j >= 0 && j < this->backgrd->width()))
line[j] = qRgb(0,0,0);
}
}
prev_x = new_x; prev_y = new_y;
for (int i = (new_y - r_y); i < (new_y + r_y); i++){
QRgb *line = (QRgb *)backgrd->scanLine(i);
QRgb *line2 = (QRgb *)this->picture->scanLine(i);
for(int j = (new_x - r_x); j < (new_x + r_x) ; j++){
if ((((new_x - j)*(new_x - j)/(r_x*r_x) + (new_y - i)*(new_y - i)/(r_y*r_y)) <= 1) && (i >= 0) && (i < this->picture->height())&& (j >= 0) && (j < this->picture->width()))
line[j] = line2[j];
}
}
this->current_img = this->backgrd;
}
this->update(); //Display QImage* this->current_img
If I analyse the timestamps of the program I find a delay in the flow of execution every time it is executed...
Is it so high consuming to access a pixel in a QImage? Am I doing something wrong?
Is there a better alternative to QImage for a Qt program?
How about prerendering your 'keyhole' in an array/qimage and doing a bitwise AND with the source?
Original pixel && black => black
Original pixel && white => original pixel
You have a lot of conditions in the innermost loop (some can be moved out though), but the circle radius calculation with the multiplies and divides looks costly. You can reuse the keyhole mask for every frame, so no calculations need be performed.
You could move some of the conditions at least to the outer loop, and maybe pre-compute some of the terms inside the conditions, though this may be optimized anyway.
Call update only for the rectangle(s) you modified
Where do you get the time stamp? Maybe you lose time somewhere else?
Actually I understood it wasn't pixel acces that was slow, but the rendering.
During the tests I did I used plain color images, but these kind of images are much faster to render than complex images loaded from file. With other tests I realized was the rendering that was slow.
The fastest way to render an QImage is first of all to transform it using
public: static QImage QGLWidget::convertToGLFormat(const QImage &img)
then the image can be fastly manipulated (it preserves bits() scanline() width() and height() functions)
and can be displayed very fast by openGL (no further conversions are necessary)
QPainter painter(this);
glDrawPixels(img.width(), img.height(), GL_RGBA, GL_UNSIGNED_BYTE, img.bits());
painter.end();
As far as I know the fastest way to access the data of a QImage is to use QImage::bits() which give you direct access to the data of QImage.
For your problem, A better approch will be to do as Bgie suggested : using a array representing the keyhole and doing only a bitwise AND operation.
it will help to choose the correct format for your Image, the format RGB32 and ARG32_Premultiplied_ARGB32 are the fastest. Don't use ARGB32 if you don't need it.

Drastic performance differences: debug vs release

I have a simple algorithm which converts a bayer image channel (BGGR,RGGB,GBRG,GRBG) to rgb (demosaicing but without neighbors). In my implementation I have pre-set offset vectors which help me to translate the bayer channel index to its corresponding rgb channel indices.
Only problem is I'm getting awful performance in debug mode with MSVC11. Under release, for an input of 3264X2540 size the function completes in ~60ms. For the same input in debug, the function completes in ~20,000ms. That's more than X300 difference and since some developers are runnig my application in debug, it's unacceptable.
My code:
void ConvertBayerToRgbImageDemosaic(int* BayerChannel, int* RgbChannel, int Width, int
Height, ColorSpace ColorSpace)
{
int rgbOffsets[4]; //translates color location in Bayer block to it's location in RGB block. So R->0, G->1, B->2
std::vector<int> bayerToRgbOffsets[4]; //the offsets from every color in the Bayer block to (bayer) indices it will be copied to (R,B are copied to all indices, Gr to R and Gb to B).
//calculate offsets according to color space
switch (ColorSpace)
{
case ColorSpace::BGGR:
/*
B G
G R
*/
rgbOffsets[0] = 2; //B->0
rgbOffsets[1] = 1; //G->1
rgbOffsets[2] = 1; //G->1
rgbOffsets[3] = 0; //R->0
//B is copied to every pixel in it's block
bayerToRgbOffsets[0].push_back(0);
bayerToRgbOffsets[0].push_back(1);
bayerToRgbOffsets[0].push_back(Width);
bayerToRgbOffsets[0].push_back(Width + 1);
//Gb is copied to it's neighbouring B
bayerToRgbOffsets[1].push_back(-1);
bayerToRgbOffsets[1].push_back(0);
//GR is copied to it's neighbouring R
bayerToRgbOffsets[2].push_back(0);
bayerToRgbOffsets[2].push_back(1);
//R is copied to every pixel in it's block
bayerToRgbOffsets[3].push_back(-Width - 1);
bayerToRgbOffsets[3].push_back(-Width);
bayerToRgbOffsets[3].push_back(-1);
bayerToRgbOffsets[3].push_back(0);
break;
... other color spaces
}
for (auto row = 0; row < Height; row++)
{
for (auto col = 0, bayerIndex = row * Width; col < Width; col++, bayerIndex++)
{
auto colorIndex = (row%2)*2 + (col%2); //0...3, For example in BGGR: 0->B, 1->Gb, 2->Gr, 3->R
//iteration over bayerToRgbOffsets is O(1) since it is either sized 2 or 4.
std::for_each(bayerToRgbOffsets[colorIndex].begin(), bayerToRgbOffsets[colorIndex].end(),
[&](int colorOffset)
{
auto rgbIndex = (bayerIndex + colorOffset) * 3 + rgbOffsets[offset];
RgbChannel[rgbIndex] = BayerChannel[bayerIndex];
});
}
}
}
What I've tried:
I tried turing on optimization (/O2) for the debug build with no significant differences.
I tried replacing the inner for_each statement with a plain old for loop but to no avail. I have a very similar algorithm which converts bayer to "green" rgb (without copying the data to neighboring pixels in the block) in which I'm not using the std::vector and there there is the expected runtime difference between debug and release (X2-X3). So, could the std::vector be the problem? If so, how do I overcome it?
As you use std::vector, It will help to disable iterator debugging.
MSDN shows how to do it.
In simple terms, make this #define before you include any STL headers:
#define _HAS_ITERATOR_DEBUGGING 0
In my experience, this gives a major boost in performance of Debug builds, although of course you do lose some Debugging functionality.
In VS you can use below settings for debugging, Disabled (/Od). Choose one of the other options (Minimum Size(/O1), Maximum Speed(/O2), Full Optimization(/Ox), or Custom). Along with iterator optimization which Roger Rowland mentioned...

std::map performance c++

I have a problem with std::map performance. In my C++ project I have a list of GUIObjects which also includes Windows. I draw everything in for loop, like this:
unsigned int guiObjectListSize = m_guiObjectList.size();
for(unsigned int i = 0; i < guiObjectListSize; i++)
{
GUIObject* obj = m_guiObjectList[i];
if(obj->getParentId() < 0)
obj->draw();
}
In this case when I run a project, it works smoothly. I have 4 windows and few other components like buttons etc.
But I would like to take care of drawing windows separately, so after modifications, my code looks like this:
// Draw all objects except windows
unsigned int guiObjectListSize = m_guiObjectList.size();
for(unsigned int i = 0; i < guiObjectListSize; i++)
{
GUIObject* obj = m_guiObjectList[i];
if((obj->getParentId() < 0) && (dynamic_cast<Window*>(obj) == nullptr))
obj->draw(); // GUIManager should only draw objects which don't have parents specified
// And those that aren't instances of Window class
// Rest objects will be drawn by their parents
// But only if that parent is able to draw children (i.e. Window or Layout)
}
// Now draw windows
for(int i = 1; i <= m_windowList.size(); i++)
{
m_windowList[i]->draw(); // m_windowList is a map!
}
So I created a std::map<int, Window*>, because I need z-indexes of Windows to be set as keys in a map. But the problem is that when I run this code, it's really slow. Even though I have only 4 windows (map size is 4), I can see that fps rate is very low. I can't say an exact number, because I don't have such counter implemented yet.
Could anyone tell me why this approach is so slow?
This is what virtual functions are for. Not only do you eliminate the slow dynamic_cast, but you get a more flexible type check.
// Draw all objects except windows
unsigned int guiObjectListSize = m_guiObjectList.size();
for(unsigned int i = 0; i < guiObjectListSize; i++)
{
GUIObject* obj = m_guiObjectList[i];
if(obj->getParentId() < 0)
obj->drawFirstChance();
}
// Now draw windows
for(int i = 1; i <= m_windowList.size(); i++)
{
m_windowList[i]->drawSecondChance();
}
Where drawFirstChance doesn't do anything for windows and other floating objects.
The next optimization opportunity is to make the window list a vector and perform z-order sorting only when it changes (assuming windows are created/destroyed/reordered much less often than they are drawn).
The problem with this code doesn't seem to be with the use of std::map. Instead, the bottleneck is rather the use of dynamic_cast, which is a very expensive operation as it needs to wander through the inheritance tree of the given class.
This tree is most likely rather large for your GUI components, which would definitely explain why doing so in each iteration slows down the approach as a whole.