Procedurally generate seamless fractal noise textures

Procedurally generate seamless fractal noise textures - c++

I have been generating noise textures to use as height maps for terrain generation. In this application, initially there is a 256x256 noise texture that is used to create a block of land that the user is free to roam around. When the user reaches a certain boundary in-game the application generates a new texture and thus another block of terrain.
In the code, a table of 64x64 random values are generated, and the values in the texture are the result of interpolating between these points at various 'frequencies' and 'wavelengths' using a smoothstep function, and then combined to form the final noise texture; and finally the values in the texture are divided through by its largest value to effectively normalize it. When the player is at the boundary and a new texture is created, the random number table that is created re-uses the values from the appropriate edge of the previous texture (eg. if the new texture is for a block of land that is on the +X side of the previous one, the last value in every row of the previous texture is used as the first value in every row of random numbers in the next.)
My problem is this: even though the same values are being used across the edges of adjacent textures, they are nowhere near seamless - some neighboring points on the terrain are mismatched by many many metres. My guess is that the changing frequencies that are used to sample the random number table are probably having a significant effect on all areas of the texture. So how might one generate fractal noise poceduraly, ie. as needed, AND have it look continuous with adjacent values?
Here is a section of the code that returns a value interpolated between the points on the random number table given a point P:
float MainApp::assessVal(glm::vec2 P){
//Integer component of P
int xi = (int)P.x;
int yi = (int)P.y;
//Decimal component ofP
float xr = P.x - xi;
float yr = P.y - yi;
//Find the grid square P lies inside of
int x0 = xi % randX;
int x1 = (xi + 1) % randX;
int y0 = yi % randY;
int y1 = (yi + 1) % randY;
//Get random values for the 4 nodes
float r00 = randNodes->randNodes[y0][x0];
float r10 = randNodes->randNodes[y0][x1];
float r01 = randNodes->randNodes[y1][x0];
float r11 = randNodes->randNodes[y1][x1];
//Smoother interpolation so
//texture appears less blocky
float sx = smoothstep(xr);
float sy = smoothstep(yr);
//Find the weighted value of the 4
//random values. This will be the
//final value in the noise texture
float sx0 = mix(r00, r10, sx);
float sx1 = mix(r01, r11, sx);
return mix(sx0, sx1, sy);
}
Where randNodes is a 2 dimensional array containing the random values.
And here is the code that takes all the values returned from the above function and constructs texture data:
int layers = 5;
float wavelength = 1, frequency = 1;
for (int k = 0; k < layers; k++) {
for (int i = 0; i < stepsY; i++) {
for(int j = 0; j < stepsX; j++){
//Compute value for (stepsX * stepsY) interpolation points
//across the grid of random numbers
glm::vec2 P = glm::vec2((float)j/stepsX * randX, (float)i/stepsY * randY);
buf[i * stepsY + j] += assessVal(P * wavelength) * frequency;
}
}
//repeat (layers) times with different signals
wavelength *= 0.5;
frequency *= 2;
}
for(int i = 0; i < buf.size(); i++){
//divide all data by the largest value.
//this normalises the data to avoid saturation
buf[i] /= largestVal;
}
Finally, here is an example of two textures generated by these functions that should be seamless, but aren't:
The 2 images placed side by side as they are now are obviously mis-matched.

Your code wraps the values only in the domain of the noise texture you read from, but not in the domain of the texture being generated.
For the texture T of size stepX to be repeatable (let's consider 1-d case for simplicity) you must have
T(0) == T(stepX)
Or in your case (substitute j = 0 and j = stepX):
assessVal(0) == assessVal(randX * wavelength)
For when k >= 1 this is clearly not true in your code, because
(randX / pow(2, k)) % randX != 0
One solution is to decrease randX and randY while you go up the frequencies.
But my typical approach would rather be starting from a 2x2 random texture, upscale it to 4x4 with GL_REPEAT, add a bit more per-pixel noise, continue upscaling to 8x8 etc.. till I get to the desired size.

The root cause of course is that your smoothing changes pixels to match their neighbors, but you later add new neighbors and do not re-smooth the pixels who got new neighbors.
One simple and common workaround is to keep an edge of invisible pixels, the width of which is half that of your smoothing kernel. Now, when expanding the area, you can resmooth those invisible pixels just before they're revealed. Don't forget to add a new edge of invisible pixels!

Related

Increase the resolution of a geometry shape

The task is to increase the resolution of a geometry shape like this:
So that the shape becomes this by adding data points:
The shape is define by data points which has x and y coordinates and a index. The index represent the order to connect them.
What type of algorithm should I use to achieve this?

You can use linear interpolation between segment ends.
It is not clear yet - how many new points you want to insert in every segment. Does it depend on segment length or other reasons? Seems your picture shows mixed approach: n is at least nmin, part length is at most lmax like this
n = max(nmin, int(seglength / lmax)) ;
For XStart, XEnd referring to starting and ending coordinates of segment, we insert n-1 points, dividing segment into n equal parts:
for (int i = 1; i < n; i++) {
X[i] = XStart + (XEnd - XStart) * i / n;
Y[i] = YStart + (YEnd - YStart) * i / n;
}

Suggestions to Compute the Intersetions of Multiple Convex 2D Polygons

I am writing this question fishing for any state-of-the-art software or methods that can quickly compute the intersection of N 2D polygons (the convex hulls of projected convex polyhedrons), and M 2D polygons where typically N >> M. N may be in the order or at least 1M polygons and N in the order 50k. I've searched for some time now, but I keep coming up with the same answer shown below.
Use boost and a loop to
compute the projection of the polyhedron (not the bottleneck)
compute the convex hull of said polyhedron (bottleneck)
compute the intersection of the projected polyhedron and existing 2D polygon (major bottleneck).
This loop is repeated NK times where typically K << M, and K is the average number of 2D polygons intersecting a single projected polyhedron. This is done to reduce the number of computations.
The problem with this is that if I have N=262144 and M=19456 it takes about 129 seconds (when multithreaded by polyhedron), and this must be done about 300 times. Ideally, I would like to reduce the computation time to about 1 second for the above sizes, so I was wondering if someone could help point to some software or literature that could improve efficiency.
[EDIT]
#sehe's request I'm posting the most relevant parts of the code. I haven't compiled it, so this is just to get the gist... this code assumes, there are voxels and pixels, but the shapes can be anything. The order of the points in the grid can be any, but the indices of where the points reside in the grid are the same.
#include <boost/geometry/geometry.hpp>
#include <boost/geometry/geometries/point.hpp>
#include <boost/geometry/geometries/ring.hpp>
const std::size_t Dimension = 2;
typedef boost::geometry::model::point<float, Dimension, boost::geometry::cs::cartesian> point_2d;
typedef boost::geometry::model::polygon<point_2d, false /* is cw */, true /* closed */> polygon_2d;
typedef boost::geometry::model::box<point_2d> box_2d;
std::vector<float> getOverlaps(std::vector<float> & projected_grid_vx, // projected voxels
std::vector<float> & pixel_grid_vx, // pixels
std::vector<int> & projected_grid_m, // number of voxels in each dimension
std::vector<int> & pixel_grid_m, // number of pixels in each dimension
std::vector<float> & pixel_grid_omega, // size of the pixel grid in cm
int projected_grid_size, // total number of voxels
int pixel_grid_size) { // total number of pixels
std::vector<float> overlaps(projected_grid_size * pixel_grid_size);
std::vector<float> h(pixel_grid_m.size());
for(int d=0; d < pixel_grid_m.size(); d++) {
h[d] = (pixel_grid_omega[2*d+1] - pixel_grid_omega[2*d]) / pixel_grid_m[d];
}
for(int i=0; i < projected_grid_size; i++){
std::vector<float> point_indices(8);
point_indices[0] = i;
point_indices[1] = i + 1;
point_indices[2] = i + projected_grid_m[0];
point_indices[3] = i + projected_grid_m[0] + 1;
point_indices[4] = i + projected_grid_m[0] * projected_grid_m[1];
point_indices[5] = i + projected_grid_m[0] * projected_grid_m[1] + 1;
point_indices[6] = i + (projected_grid_m[1] + 1) * projected_grid_m[0];
point_indices[7] = i + (projected_grid_m[1] + 1) * projected_grid_m[0] + 1;
std::vector<float> vx_corners(8 * projected_grid_m.size());
for(int vn = 0; vn < 8; vn++) {
for(int d = 0; d < projected_grid_m.size(); d++) {
vx_corners[vn + d * 8] = projected_grid_vx[point_indices[vn] + d * projeted_grid_size];
}
}
polygon_2d proj_voxel;
for(int vn = 0; vn < 8; vn++) {
point_2d poly_pt(vx_corners[2 * vn], vx_corners[2 * vn + 1]);
boost::geometry::append(proj_voxel, poly_pt);
}
boost::geometry::correct(proj_voxel);
polygon_2d proj_voxel_hull;
boost::geometry::convex_hull(proj_voxel, proj_voxel_hull);
box_2d bb_proj_vox;
boost::geometry::envelope(proj_voxel_hull, bb_proj_vox);
point_2d min_pt = bb_proj_vox.min_corner();
point_2d max_pt = bb_proj_vox.max_corner();
// then get min and max indices of intersecting bins
std::vector<float> min_idx(projected_grid_m.size() - 1),
max_idx(projected_grid_m.size() - 1);
// compute min and max indices of incidence on the pixel grid
// this is easy assuming you have a regular grid of pixels
min_idx[0] = std::min( (float) std::max( std::floor((min_pt.get<0>() - pixel_grid_omega[0]) / h[0] - 0.5 ), 0.), pixel_grid_m[0]-1);
min_idx[1] = std::min( (float) std::max( std::floor((min_pt.get<1>() - pixel_grid_omega[2]) / h[1] - 0.5 ), 0.), pixel_grid_m[1]-1);
max_idx[0] = std::min( (float) std::max( std::floor((max_pt.get<0>() - pixel_grid_omega[0]) / h[0] + 0.5 ), 0.), pixel_grid__m[0]-1);
max_idx[1] = std::min( (float) std::max( std::floor((max_pt.get<1>() - pixel_grid_omega[2]) / h[1] + 0.5 ), 0.), pixel_grid_m[1]-1);
// iterate only over pixels which intersect the projected voxel
for(int iy = min_idx[1]; iy <= max_idx[1]; iy++) {
for(int ix = min_idx[0]; ix <= max_idx[0]; ix++) {
int idx = ix + iy * pixel_grid_size[0]; // `first' index of pixel corner point
polygon_2d pix_poly;
for(int pn = 0; pn < 4; pn++) {
point_2d pix_corner_pt(
pixel_grid_vx[idx + pn % 2 + (pn / 2) * pixel_grid_m[0]],
pixel_grid_vx[idx + pn % 2 + (pn / 2) * pixel_grid_m[0] + pixel_grid_size]
);
boost::geometry::append(pix_poly, pix_corner_pt);
}
boost::geometry::correct( pix_poly );
//make this into a convex hull since the order of the point may be any
polygon_2d pix_hull;
boost::geometry::convex_hull(pix_poly, pix_hull);
// on to perform intersection
std::vector<polygon_2d> vox_pix_ints;
polygon_2d vox_pix_int;
try {
boost::geometry::intersection(proj_voxel_hull, pix_hull, vox_pix_ints);
} catch ( std::exception e ) {
// skip since these may coincide at a point or line
continue;
}
// both are convex so only one intersection expected
vox_pix_int = vox_pix_ints[0];
overlaps[i + idx * projected_grid_size] = boost::geometry::area(vox_pix_int);
}
} // end intersection for
} //end projected_voxel for
return overlaps;
}

You could create the ratio of polygon to bounding box:
This could be done computationally once to arrive at an avgerage poly area to BB ratio R constant.
Or you could do it with geometry using a circle bounded by its BB Since your using only projected polyhedron:
R = 0.0;
count = 0;
for (each poly) {
count++;
R += polyArea / itsBoundingBoxArea;
}
R = R/count;
Then calculate the summation of intersection of bounding boxes.
Sbb = 0.0;
for (box1, box2 where box1.isIntersecting(box2)) {
Sbb += box1.intersect(box2);
}
Then:
Approximation = R * Sbb
All of this would not work if concave polys were allowed. Because a concave poly can occupy less than 1% of it's bounding box. You will still have to find the convex hull.
Alternatively, If you can find the polygons area quicker than its hull, you could use the actual computed average poly area. This would give you a decent approximation as well while avoiding both poly intersection and wrapping.

Hm, the problem seems similar to doing "collision-detection" i game-engines. Or "potentially visible sets".
While I don't know much about the current state-of-the-art, i remember an optimization was to enclose objects in spheres, since checking overlaps between spheres (or circles in 2D) is really cheap.
In order to speed-up checks for collisions, objects were often put into search-structures (e.g. a sphere-tree (circle-tree in 2D case)). Basically organizing the space into a hierarchical structure, to make queries for overlaps fast.
So basically my suggestion boils down to: Try looking at algorithms for collision-detection i game-engines.

Assumption
I'm assuming that you mean "intersections" and not intersection. Moreover, It is not the expected use case that most of the individual polys from M and N will overlap at the same time. If this assumption is true then:
Answer
The way this is done with 2D game engines is by having a scene graph where every object has a bounding box. Then place all the the polygons into a node in an quadtree according to their location determined by bounding box. Then the task becomes parallel because each node can be processed separately for intersection.
Here is the wiki for quadtree:
Quadtree Wiki
An octree could be used when in 3D.
It actually doesn't even have to be a octree. You could get the same results with any space partition. You could find the maximum separation of polys (lets call it S). And create say S/10 space partitions. Then you would have 10 separate spaces to execute in parallel. Not only would it be concurrent, but It would no longer be M * N time since not every poly must be compared against every other poly.

Convolutional network filter always negative

I asked a question about a network which I've been building last week, and I iterated on the suggestions which lead me to finding a few problems. I've come back to this project and fixed up all the issues and learnt a lot more about CNNs in the process. Now I'm stuck on an issue were all of my weights move to massively negative values, which coupled with the RELU ends in the output image always being completely black (making it impossible for the classifier to do it's job).
On two labeled images:
These are passed into a two layer network, one classifier (which gets 100% on its own) and a one filter 3*3 convolutional layer.
On the first iteration the output from the conv layer looks like (images in same order as above):
The filter is 3*3*3, due to the images being RGB. The weights are all random numbers between 0.0f-1.0f. On the next iteration the images are completely black, printing the filters shows that they are now in range of -49678.5f (the highest I can see) and -61932.3f.
This issue in turn is due to the gradients being passed back from the Logistic Regression/Linear layer being crazy high for the cross (label 0, prediction 0). For the circle (label 1, prediction 0) the values are between roughly -12 and -5, but for the cross they are all in the positive high 1000 to high 2000 range.
The code which sends these back looks something like (some parts omitted):
void LinearClassifier::Train(float * x,float output, float y)
{
float h = output - y;
float average = 0.0f;
for (int i =1; i < m_NumberOfWeights; ++i)
{
float error = h*x[i-1];
m_pGradients[i-1] = error;
average += error;
}
average /= static_cast<float>(m_NumberOfWeights-1);
for (int theta = 1; theta < m_NumberOfWeights; ++theta)
{
m_pWeights[theta] = m_pWeights[theta] - learningRate*m_pGradients[theta-1];
}
// Bias
m_pWeights[0] -= learningRate*average;
}
This is passed back to the single convolution layer:
// This code is in three nested for loops (for layer,for outWidth, for outHeight)
float gradient = 0.0f;
// ReLu Derivative
if ( m_pOutputBuffer[outputIndex] > 0.0f)
{
gradient = outputGradients[outputIndex];
}
for (int z = 0; z < m_InputDepth; ++z)
{
for ( int u = 0; u < m_FilterSize; ++u)
{
for ( int v = 0; v < m_FilterSize; ++v)
{
int x = outX + u - 1;
int y = outY + v - 1;
int inputIndex = x + y*m_OutputWidth + z*m_OutputWidth*m_OutputHeight;
int kernelIndex = u + v*m_FilterSize + z*m_FilterSize*m_FilterSize;
m_pGradients[inputIndex] += m_Filters[layer][kernelIndex]*gradient;
m_GradientSum[layer][kernelIndex] += input[inputIndex]*gradient;
}
}
}
This code is iterated over by passing each image in a one at a time fashion. The gradients are obviously going in the right direction but how do I stop the huge gradients from throwing the prediction function?

RELU activations are notorious for doing this. You usually have to use a low learning rate. The reasoning behind this is that when the RELU returns positive numbers it can continue to learn freely, but if a unit gets in a position where the signal coming into it is always negative it can become a "dead" neuron and never activate again.
Also initializing your weights is more delicate with RELU. It appears that you are initializing to range 0-1 which creates a huge bias. Two tips here - Use a range centered around 0, and a range that is much smaller. A normal distribution with mean 0 and std 0.02 usually works well.

I fixed it by downscaling the gradients int the CNN layer, but now I'm confused as to why this works/is needed so if anyone has any intuition as to why this works that'd be great.

Improving C++ algorithm for finding all points within a sphere of radius r

Language/Compiler: C++ (Visual Studio 2013)
Experience: ~2 months
I am working in a rectangular grid in 3D-space (size: xdim by ydim by zdim) where , "xgrid, ygrid, and zgrid" are 3D arrays of the x,y, and z-coordinates, respectively. Now, I am interested in finding all points that lie within a sphere of radius "r" centered about the point "(vi,vj,vk)". I want to store the index locations of these points in the vectors "xidx,yidx,zidx". For a single point this algorithm works and is fast enough but when I wish to iterate over many points within the 3D-space I run into very long run times.
Does anyone have any suggestions on how I can improve the implementation of this algorithm in C++? After running some profiling software I found online (very sleepy, Luke stackwalker) it seems that the "std::vector::size" and "std::vector::operator[]" member functions are bogging down my code. Any help is greatly appreciated.
Note: Since I do not know a priori how many voxels are within the sphere, I set the length of vectors xidx,yidx,zidx to be larger than necessary and then erase all the excess elements at the end of the function.
void find_nv(int vi, int vj, int vk, vector<double> &xidx, vector<double> &yidx, vector<double> &zidx, double*** &xgrid, double*** &ygrid, double*** &zgrid, int r, double xdim,double ydim,double zdim, double pdim)
{
double xcor, ycor, zcor,xval,yval,zval;
vector<double>xyz(3);
xyz[0] = xgrid[vi][vj][vk];
xyz[1] = ygrid[vi][vj][vk];
xyz[2] = zgrid[vi][vj][vk];
int counter = 0;
// Confine loop to be within boundaries of sphere
int istart = vi - r;
int iend = vi + r;
int jstart = vj - r;
int jend = vj + r;
int kstart = vk - r;
int kend = vk + r;
if (istart < 0) {
istart = 0;
}
if (iend > xdim-1) {
iend = xdim-1;
}
if (jstart < 0) {
jstart = 0;
}
if (jend > ydim - 1) {
jend = ydim-1;
}
if (kstart < 0) {
kstart = 0;
}
if (kend > zdim - 1)
kend = zdim - 1;
//-----------------------------------------------------------
// Begin iterating through all points
//-----------------------------------------------------------
for (int k = 0; k < kend+1; ++k)
{
for (int j = 0; j < jend+1; ++j)
{
for (int i = 0; i < iend+1; ++i)
{
if (i == vi && j == vj && k == vk)
continue;
else
{
xcor = pow((xgrid[i][j][k] - xyz[0]), 2);
ycor = pow((ygrid[i][j][k] - xyz[1]), 2);
zcor = pow((zgrid[i][j][k] - xyz[2]), 2);
double rsqr = pow(r, 2);
double sphere = xcor + ycor + zcor;
if (sphere <= rsqr)
{
xidx[counter]=i;
yidx[counter]=j;
zidx[counter] = k;
counter = counter + 1;
}
else
{
}
//cout << "counter = " << counter - 1;
}
}
}
}
// erase all appending zeros that are not voxels within sphere
xidx.erase(xidx.begin() + (counter), xidx.end());
yidx.erase(yidx.begin() + (counter), yidx.end());
zidx.erase(zidx.begin() + (counter), zidx.end());
return 0;

You already appear to have used my favourite trick for this sort of thing, getting rid of the relatively expensive square root functions and just working with the squared values of the radius and center-to-point distance.
One other possibility which may speed things up (a) is to replace all the:
xyzzy = pow (plugh, 2)
calls with the simpler:
xyzzy = plugh * plugh
You may find the removal of the function call could speed things up, however marginally.
Another possibility, if you can establish the maximum size of the target array, is to use an real array rather than a vector. I know they make the vector code as insanely optimal as possible but it still won't match a fixed-size array for performance (since it has to do everything the fixed size array does plus handle possible expansion).
Again, this may only offer very marginal improvement at the cost of more memory usage but trading space for time is a classic optimisation strategy.
Other than that, ensure you're using the compiler optimisations wisely. The default build in most cases has a low level of optimisation to make debugging easier. Ramp that up for production code.
(a) As with all optimisations, you should measure, not guess! These suggestions are exactly that: suggestions. They may or may not improve the situation, so it's up to you to test them.

One of your biggest problems, and one that is probably preventing the compiler from making a lot of optimisations is that you are not using the regular nature of your grid.
If you are really using a regular grid then
xgrid[i][j][k] = x_0 + i * dxi + j * dxj + k * dxk
ygrid[i][j][k] = y_0 + i * dyi + j * dyj + k * dyk
zgrid[i][j][k] = z_0 + i * dzi + j * dzj + k * dzk
If your grid is axis aligned then
xgrid[i][j][k] = x_0 + i * dxi
ygrid[i][j][k] = y_0 + j * dyj
zgrid[i][j][k] = z_0 + k * dzk
Replacing these inside your core loop should result in significant speedups.

You could do two things. Reduce the number of points you are testing for inclusion and simplify the problem to multiple 2d tests.
If you take the sphere an look at it down the z axis you have all the points for y+r to y-r in the sphere, using each of these points you can slice the sphere into circles that contain all the points in the x/z plane limited to the circle radius at that specific y you are testing. Calculating the radius of the circle is a simple solve the length of the base of the right angle triangle problem.
Right now you ar testing all the points in a cube, but the upper ranges of the sphere excludes most points. The idea behind the above algorithm is that you can limit the points tested at each level of the sphere to the square containing the radius of the circle at that height.
Here is a simple hand draw graphic, showing the sphere from the side view.
Here we are looking at the slice of the sphere that has the radius ab. Since you know the length ac and bc of the right angle triangle, you can calculate ab using Pythagoras theorem. Now you have a simple circle that you can test the points in, then move down, it reduce length ac and recalculate ab and repeat.
Now once you have that you can actually do a little more optimization. Firstly, you do not need to test every point against the circle, you only need to test one quarter of the points. If you test the points in the upper left quadrant of the circle (the slice of the sphere) then the points in the other three points are just mirror images of that same point offset either to the right, bottom or diagonally from the point determined to be in the first quadrant.
Then finally, you only need to do the circle slices of the top half of the sphere because the bottom half is just a mirror of the top half. In the end you only tested a quarter of the point for containment in the sphere. This should be a huge performance boost.
I hope that makes sense, I am not at a machine now that I can provide a sample.

simple thing here would be a 3D flood fill from center of the sphere rather than iterating over the enclosing square as you need to visited lesser points. Moreover you should implement the iterative version of the flood-fill to get more efficiency.
Flood Fill

Optimized float Blur variations

I am looking for optimized functions in c++ for calculating areal averages of floats. the function is passed a source float array, a destination float array (same size as source array), array width and height, "blurring" area width and height.
The function should "wrap-around" edges for the blurring/averages calculations.
Here is example code that blur with a rectangular shape:
/*****************************************
* Find averages extended variations
*****************************************/
void findaverages_ext(float *floatdata, float *dest_data, int fwidth, int fheight, int scale, int aw, int ah, int weight, int xoff, int yoff)
{
printf("findaverages_ext scale: %d, width: %d, height: %d, weight: %d \n", scale, aw, ah, weight);
float total = 0.0;
int spos = scale * fwidth * fheight;
int apos;
int w = aw;
int h = ah;
float* f_temp = new float[fwidth * fheight];
// Horizontal
for(int y=0;y<fheight ;y++)
{
Sleep(10); // Do not burn your processor
total = 0.0;
// Process entire window for first pixel (including wrap-around edge)
for (int kx = 0; kx <= w; ++kx)
if (kx >= 0 && kx < fwidth)
total += floatdata[y*fwidth + kx];
// Wrap
for (int kx = (fwidth-w); kx < fwidth; ++kx)
if (kx >= 0 && kx < fwidth)
total += floatdata[y*fwidth + kx];
// Store first window
f_temp[y*fwidth] = (total / (w*2+1));
for(int x=1;x<fwidth ;x++) // x width changes with y
{
// Substract pixel leaving window
if (x-w-1 >= 0)
total -= floatdata[y*fwidth + x-w-1];
// Add pixel entering window
if (x+w < fwidth)
total += floatdata[y*fwidth + x+w];
else
total += floatdata[y*fwidth + x+w-fwidth];
// Store average
apos = y * fwidth + x;
f_temp[apos] = (total / (w*2+1));
}
}
// Vertical
for(int x=0;x<fwidth ;x++)
{
Sleep(10); // Do not burn your processor
total = 0.0;
// Process entire window for first pixel
for (int ky = 0; ky <= h; ++ky)
if (ky >= 0 && ky < fheight)
total += f_temp[ky*fwidth + x];
// Wrap
for (int ky = fheight-h; ky < fheight; ++ky)
if (ky >= 0 && ky < fheight)
total += f_temp[ky*fwidth + x];
// Store first if not out of bounds
dest_data[spos + x] = (total / (h*2+1));
for(int y=1;y< fheight ;y++) // y width changes with x
{
// Substract pixel leaving window
if (y-h-1 >= 0)
total -= f_temp[(y-h-1)*fwidth + x];
// Add pixel entering window
if (y+h < fheight)
total += f_temp[(y+h)*fwidth + x];
else
total += f_temp[(y+h-fheight)*fwidth + x];
// Store average
apos = y * fwidth + x;
dest_data[spos+apos] = (total / (h*2+1));
}
}
delete f_temp;
}
What I need is similar functions that for each pixel finds the average (blur) of pixels from shapes different than rectangular.
The specific shapes are: "S" (sharp edges), "O" (rectangular but hollow), "+" and "X", where the average float is stored at the center pixel on destination data array. Size of blur shape should be variable, width and height.
The functions does not need to be pixelperfect, only optimized for performance. There could be separate functions for each shape.
I am also happy if anyone can tip me of how to optimize the example function above for rectangluar blurring.

What you are trying to implement are various sorts of digital filters for image processing. This is equivalent to convolving two signals where the 2nd one would be the filter's impulse response. So far, you regognized that a "rectangular average" is separable. By separable I mean, you can split the filter into two parts. One that operates along the X axis and one that operates along the Y axis -- in each case a 1D filter. This is nice and can save you lots of cycles. But not every filter is separable. Averaging along other shapres (S, O, +, X) is not separable. You need to actually compute a 2D convolution for these.
As for performance, you can speed up your 1D averages by properly implementing a "moving average". A proper "moving average" implementation only requires a fixed amount of little work per pixel regardless of the averaging "window". This can be done by recognizing that neighbouring pixels of the target image are computed by an average of almost the same pixels. You can reuse these sums for the neighbouring target pixel by adding one new pixel intensity and subtracting an older one (for the 1D case).
In case of arbitrary non-separable filters your best bet performance-wise is "fast convolution" which is FFT-based. Checkout www.dspguide.com. If I recall correctly, there is even a chapter on how to properly do "fast convolution" using the FFT algorithm. Although, they explain it for 1-dimensional signals, it also applies to 2-dimensional signals. For images you have to perform 2D-FFT/iFFT transforms.

To add to sellibitze's answer, you can use a summed area table for your O, S and + kernels (not for the X one though). That way you can convolve a pixel in constant time, and it's probably the fastest method to do it for kernel shapes that allow it.
Basically, a SAT is a data structure that lets you calculate the sum of any axis-aligned rectangle. For the O kernel, after you've built a SAT, you'd take the sum of the outer rect's pixels and subtract the sum of the inner rect's pixels. The S and + kernels can be implemented similarly.
For the X kernel you can use a different approach. A skewed box filter is separable:
You can convolve with two long, thin skewed box filters, then add the two resulting images together. The center of the X will be counted twice, so will you need to convolve with another skewed box filter, and subtract that.
Apart from that, you can optimize your box blur in many ways.
Remove the two ifs from the inner loop by splitting that loop into three loops - two short loops that do checks, and one long loop that doesn't. Or you could pad your array with extra elements from all directions - that way you can simplify your code.
Calculate values like h * 2 + 1 outside the loops.
An expression like f_temp[ky*fwidth + x] does two adds and one multiplication. You can initialize a pointer to &f_temp[ky*fwidth] outside the loop, and just increment that pointer in the loop.
Don't do the division by h * 2 + 1 in the horizontal step. Instead, divide by the square of that in the vertical step.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js