Why is my hand-tuned, SSE-enabled code so slow? - c++

Long story short: I'm developing a computing-intensive image processing application in C++. It needs to calculate many variants of image warps on small blocks of pixels extracted from larger images. The program doesn't run as fast as I would like. Profiling (OProfile) showed the warping/interpolation function consumes more than 70% of the CPU time so it seemed obvious to try and optimize it.
I was using the OpenCV image processing library for the task until now:
// some parameters for the image warps (position, stretch, skew)
struct WarpParams;
void Image::get(const WarpParams &params)
{
// fills matrices mapX_ and mapY_ with x and y coordinates of points to be
// inteprolated.
updateCoordMaps(params);
// perform interpolation to obtain pixels at point locations
// P(mapX_[i], mapY_[i]) based on the original data and put the
// result in pixels_. Use bicubic inteprolation.
cv::remap(image_->data(), pixels_, mapX_, mapY_, CV_INTER_CUBIC);
}
I wrote my own interpolation function and put it in a test harness to ensure correctness while I experiment and to benchmark it in relation to the old one.
My function ran very slow which was to be expected. Generally, the idea is to:
Iterate over the mapX_, mapY_ coordinate maps, extract (real-valued) coordinates of the next pixel to be interpolated;
Retrieve a 4x4 neighborough of pixel values (integer coordinates) from the original image surrounding the interpolated pixel;
Calculate the coefficients of the convolution kernel for each of these 16 pixels;
Calculate the value of the interpolated pixel as a linear combination of the 16 pixel values and the kernel coefficients.
The old function timed at 25us on my Wolfdale Core2 Duo. The new one took 587us (!). I eagerly put my wizard hat on and started hacking the code. I managed to remove all branches, omit some duplicating calculations and transform 3 nested loops into just one over the coordinate maps. This is what I came up with:
void Image::getConvolve(const WarpParams &params)
{
__declspec(align(16)) static float kernelX[4], kernelY[4];
// grab pointers to coordinate map matrices and the original image
const float
*const mapX = mapX_.ptr<float>(),
*const mapY = mapY_.ptr<float>(),
*const img = image_->data().ptr<float>();
// grab pointer to the output image
float *const subset = pixels_.ptr<float>(),
x, y, xint, yint;
const ptrdiff_t imgw = image_->width();
ptrdiff_t imgoffs;
__m128 v_px, v_kernX, v_kernY, v_val;
// iterate over the coordinate matrices as linear buffers
for (size_t idx = 0; idx < pxCount; ++idx)
{
// retrieve coordinates of next pixel from precalculated maps,
// break up each into fractional and integer part
x = modf(mapX[idx], &xint);
y = modf(mapY[idx], &yint);
// obtain offset of the top left pixel from the required 4x4
// neighborhood of the current pixel in the image's
// buffer (sadly, the position will be unaligned)
imgoffs = (((ptrdiff_t)yint - 1) * imgw) + (ptrdiff_t)xint - 1;
// calculate all 4 convolution kernel values for every row and
// every column
tap4Kernel(x, kernelX);
tap4Kernel(y, kernelY);
// load the kernel values for the columns, these don't change
v_kernX = _mm_load_ps(kernelX);
// process a row of the 4x4 neighborhood
// get set of convolution kernel values for the current row
v_kernY = _mm_set_ps1(kernelY[0]);
v_px = _mm_loadu_ps(img + imgoffs); // load the pixel values
// calculate the linear combination of the pixels with kernelX
v_px = _mm_mul_ps(v_px, v_kernX);
v_px = _mm_mul_ps(v_px, v_kernY); // and kernel Y
v_val = v_px; // add result to the final value
imgoffs += imgw;
// offset points now to next row of the 4x4 neighborhood
v_kernY = _mm_set_ps1(kernelY[1]);
v_px = _mm_loadu_ps(img + imgoffs);
v_px = _mm_mul_ps(v_px, v_kernX);
v_px = _mm_mul_ps(v_px, v_kernY);
v_val = _mm_add_ps(v_val, v_px);
imgoffs += imgw;
/*... same for kernelY[2] and kernelY[3]... */
// store resulting interpolated pixel value in the subset's
// pixel matrix
subset[idx] = horizSum(v_val);
}
}
// Calculate all 4 values of the 4-tap convolution kernel for 4 neighbors
// of a pixel and store them in an array. Ugly but fast.
// The "arg" parameter is the fractional part of a pixel's coordinate, i.e.
// a number in the range <0,1)
void Image::tap4Kernel(const float arg, float *out)
{
// chaining intrinsics was slower, so this is done in separate steps
// load the argument into 4 cells of a XMM register
__m128
v_arg = _mm_set_ps1(arg),
v_coeff = _mm_set_ps(2.0f, 1.0f, 0.0f, -1.0f);
// subtract vector of [-1, 0, 1, 2] to obtain coorinates of 4 neighbors
// for kernel calculation
v_arg = _mm_sub_ps(v_arg, v_coeff);
// clear sign bits, this is equivalent to fabs() on all 4
v_coeff = _mm_set_ps1(-0.f);
v_arg = _mm_andnot_ps(v_coeff, v_arg);
// calculate values of abs(argument)^3 and ^2
__m128
v_arg2 = _mm_mul_ps(v_arg, v_arg),
v_arg3 = _mm_mul_ps(v_arg2, v_arg),
v_val, v_temp;
// calculate the 4 kernel values as
// arg^3 * A + arg^2 * B + arg * C + D, using
// (A,B,C,D) = (-0.5, 2.5, -4, 2) for the outside pixels and
// (1.5, -2.5, 0, 1) for inside
v_coeff = _mm_set_ps(-0.5f, 1.5f, 1.5f, -0.5f);
v_val = _mm_mul_ps(v_coeff, v_arg3);
v_coeff = _mm_set_ps(2.5f, -2.5f, -2.5f, 2.5f);
v_temp = _mm_mul_ps(v_coeff, v_arg2);
v_val = _mm_add_ps(v_val, v_temp);
v_coeff = _mm_set_ps(-4.0f, 0.0f, 0.0f, -4.0f),
v_temp = _mm_mul_ps(v_coeff, v_arg);
v_val = _mm_add_ps(v_val, v_temp);
v_coeff = _mm_set_ps(2.0f, 1.0f, 1.0f, 2.0f);
v_val = _mm_add_ps(v_val, v_coeff);
_mm_store_ps(out, v_val);
}
I was pleased to have managed to get the run time on this to below 40us, even before introducing SSE to the main loop, which I saved for last. I was expecting at least a 3-fold speedup but it ran only barely faster at 36us, slower than the old get() which I was trying to improve upon. Even worse was the fact that when I changed the benchmark loop to do more runs, the old function had the same mean run time, while mine stretched to over 127us, meaning it takes even longer for some extreme warp parameter values (makes sense because more warps means I need to reach for widely-dispersed pixel values from the original image to calculate the result).
I figured the reason must be the unaligned loads, but that can't be helped (I need to reach for unpredictable pixel values). I couldn't see anything more I could do in the optimizing department so I decided to look at the cv::remap() function to see how they do it. Imagine my surprise at finding it contain a mess of nested loops and plenty of branches. They also do a lot of argument verification which I didn't need to bother with. As far as I can tell (no comments in the code), SSE (with unaligned loads as well!) is only used for extracting the values from the coordinate maps and rounding them into integers, then a function is called that does the actual interpolating with regular float arithmetics.
My question is, why did I fail so miserably (why is my code so slow and why is theirs faster even though it looks like a mess) and what can I do to improve my code?
I'm not pasting the OpenCV code here because this is already too long, you can check it out at pastebin.
I tested and compiled my code in Release mode under VC++2010. The OpenCV used was a precompiled binary bundle of v2.3.1.
EDIT: The pixel values are floats in the range 0..1. Profiling showed the tap4Kernel() function wasn't relevant, most time is spent inside getConvolve().
EDIT2: I pasted the disassembly of the generated code to pastebin. This is compiled on an old Banias Celeron processor (has SSE2), but looks more or less the same.
EDIT3: After reading What Every Programmer Should Know About Memory I realized I was incorrectly assuming the OpenCV function implements more or less the same algorithm as I did, which must not be the case. For every pixel I interpolate, I need to retrieve its 4x4 neighborhood, whose pixels are non-sequentially placed inside the image buffer. I am misusing the CPU caches, and OpenCV probably doesn't. VTune profiling would seem to agree as my function has 5,800,000 memory accesses, while OpenCV does only 400,000. Their function is a mess and could probably be further optimized but it still manages to have an edge over me, probably due to some smarter approach to memory and cache usage.
UPDATE: I managed to improve the way pixel values are loaded into XMM registers. I allocate a buffer in the object which holds 16-element cells for every pixel of the image. At image load, I fill this cell buffer with pre-arranged sequences of 4x4-neighborhoods for every pixel. Not very space-efficient (image takes 16x the space), but that way, the loads are always aligned (no more _mm_loadu_ps()), and I am avoiding having to do scattered reads of pixels from the image buffer, since the required pixels are stored sequentially. To my surprise, there was hardly any improvement at all. I heard unaligned loads could be 10x slower, but clearly this is not the problem here. But by commenting out parts of the code, I found out that the modf() calls are responsible for 75% of the runtime! I'll focus on eliminating those and post an answer.

First a few observations.
You use function static variables, which can incur synchronization (I don't think it does here though)
The assembly mixes x87 and sse code.
tap4kernel is inlined, which is good, but the profile may be inaccurate.
modf is not inlined.
The assembly uses _ftol2_sse (_ftol2_sse, are there faster options?).
The assembly moves the registers around a lot.
Try doing the following:
Ensure that your compiler is optimizing aggressively for an SSE2 based architecture.
Make modf available for inlining.
Avoid using function static variables.
If the assembly still uses x87 instructions, try avoiding the float-int cast in imgoffs = (((ptrdiff_t)yint - 1) * imgw) + (ptrdiff_t)xint - 1; and making the float variables __m128.
It can possibly be optimized further by prefetching the maps (prefetch about 4kb ahead)

Related

OpenGL Terrain System, small height difference between GPU and CPU

A quick summary:
I've a simple Quad tree based terrain rendering system that builds terrain patches which then sample a heightmap in the vertex shader to determine the height of each vertex.
The exact same calculation is done on the CPU for object placement and co.
Super straightforward, but now after adding some systems to procedurally place objects I've discovered that they seem to be misplaced by just a small amount. To debug this I render a few crosses as single models over the terrain. The crosses (red, green, blue lines) represent the height read from the CPU. While the terrain mesh uses a shader to translate the vertices.
(I've also added a simple odd/even gap over each height value to rule out a simple offset issue. So those ugly cliffs are expected, the submerged crosses are the issue)
I'm explicitly using GL_NEAREST to be able to display the "raw" height value:
As you can see the crosses are sometimes submerged under the terrain instead of representing its exact height.
The heightmap is just a simple array of floats on the CPU and on the GPU.
How the data is stored
A simple vector<float> which is uploaded into a GL_RGB32F GL_FLOAT buffer. The floats are not normalized and my terrain usually contains values between -100 and 500.
How is the data accessed in the shader
I've tried a few things to rule out errors, the inital:
vec2 terrain_heightmap_uv(vec2 position, Heightmap heightmap)
{
return (position + heightmap.world_offset) / heightmap.size;
}
float terrain_read_height(vec2 position, Heightmap heightmap)
{
return textureLod(heightmap.heightmap, terrain_heightmap_uv(position, heightmap), 0).r;
}
Basics of the vertex shader (the full shader code is very long, so I've extracted the part that actually reads the height):
void main()
{
vec4 world_position = a_model * vec4(a_position, 1.0);
vec4 final_position = world_position;
// snap vertex to grid
final_position.x = floor(world_position.x / a_quad_grid) * a_quad_grid;
final_position.z = floor(world_position.z / a_quad_grid) * a_quad_grid;
final_position.y = terrain_read_height(final_position.xz, heightmap);
gl_Position = projection * view * final_position;
}
To ensure the slightly different way the position is determined I tested it using hardcoded values that are identical to how C++ reads the height:
return texelFetch(heightmap.heightmap, ivec2((position / 8) + vec2(1024, 1024)), 0).r;
Which gives the exact same result...
How is the data accessed in the application
In C++ the height is read like this:
inline float get_local_height_safe(uint32_t x, uint32_t y)
{
// this macro simply clips x and y to the heightmap bounds
// it does not interfer with the result
BB_TERRAIN_HEIGHTMAP_BOUND_XY_TO_SAFE;
uint32_t i = (y * _size1d) + x;
return buffer->data[i];
}
inline float get_height_raw(glm::vec2 position)
{
position = position + world_offset;
uint32_t x = static_cast<int>(position.x);
uint32_t y = static_cast<int>(position.y);
return get_local_height_safe(x, y);
}
float BB::Terrain::get_height(const glm::vec3 position)
{
return heightmap->get_height_raw({position.x / heightmap_unit_scale, position.z / heightmap_unit_scale});
}
What have I tried:
Comparing the Buffers
I've dumped the first few hundred values from the vector. And compared it with the floating point buffer uploaded to the GPU using Nvidia Nsight, they are equal, rounding/precision errors there.
Sampling method
I've tried texture, textureLod and texelFetch to rule out some issue there, they all give me the same result.
Rounding
The super strange thing, when I round all the height values. They are perfectly aligned which just screams floating point precision issues.
Position snapping
I've tried rounding, flooring and ceiling the position, to ensure the position always maps to the same texel. I also tried adding an epsilon offset to rule out a positional precision error (probably stupid because the terrain is stable...)
Heightmap sizes
I've tried various heightmaps, also of different sizes.
Heightmap patterns
I've created a heightmap containing a pattern to ensure the position is not just offsetet.

What are the fastest algorithms for rendering the mandelbrot set?

I've tried many algorithms for the rendering of the Mandelbrot set, inclusive of the naive escape time algorithm, as well as the optimized escape time algorithm. But, are there faster algorithms that are used to produce really deep zooms efficiently like the ones we see on YouTube. Also, I would love to get some ideas on how to increase my precision beyond the C/C++ double
Even High end CPU will be much slower in comparison to average GPU. You can get to real time rendering even with naive iteration algo on GPU. So using better algorithms on GPU could get to high zooms however for any decent algo you need:
multi pass rendering as we can not self modify a texture on GPU
high precision floating point as floats/doubles are not enough.
Here few related QAs:
GLSL RT Mandelbrot
Interior Distance Estimate algorithm for the Mandelbrot set
I get infinitely small numbers for fractals
Perturbation theory
which might get you kick started...
One way to speed up is you can use fractional escape like I did in the first link. It improves image quality while keeping max iteration low.
The second link will get you approximation of which parts of fractal are in and out and how far. Its not very accurate but can be used to avoid computing iterations for parts that are "outside for sure".
Next link will show you how to achieve better precision.
Last link is about Perturbation The idea is that you use high precision math only for some reference points and use that to compute its neighbor points with low precision math without loosing precision. Never used that however but looks promising.
And finally once you achieved fast rendering you might want to aim for this:
How to adjust panning while zooming Mandelbrot set
Here a small example of 3* 64bit double used for single value in GLSL:
// high precision float (very slow)
dvec3 fnor(dvec3 a)
{
dvec3 c=a;
if (abs(c.x)>1e-5){ c.y+=c.x; c.x=0.0; }
if (abs(c.y)>1e+5){ c.z+=c.y; c.y=0.0; }
return c;
}
double fget(dvec3 a){ return a.x+a.y+a.z; }
dvec3 fset(double a){ return fnor(dvec3(a,0.0,0.0)); }
dvec3 fadd(dvec3 a,double b){ return fnor(a+fset(b)); }
dvec3 fsub(dvec3 a,double b){ return fnor(a-fset(b)); }
dvec3 fmul(dvec3 a,double b){ return fnor(a*b); }
dvec3 fadd(dvec3 a,dvec3 b){ return fnor(a+b); }
dvec3 fsub(dvec3 a,dvec3 b){ return fnor(a-b); }
dvec3 fmul(dvec3 a,dvec3 b)
{
dvec3 c;
c =fnor(a*b.x);
c+=fnor(a*b.y);
c+=fnor(a*b.z);
return fnor(c);
}
so each hi precision value is dvec3 ... the thresholds in fnor can be changed to any ranges. You can convert this to vec3 and float ...
[Edit1] "fast" C++ example
Ok I wanted to try my new SSD1306 driver along with my AVR32 MCU to compute Mandelbrot So I can compare speed with this Arduino + 3D + Pong + Mandelbrot. I used AT32UC3A3256 with ~66MHz no FPU no GPU and 128x64x1bpp display. No external memory only internal 16+32+32 KByte. Naive Mandlebrot was way to slow (~2.5sec per frame) so I busted up something like this (taking advantage of that position and zoom of the view is sort of continuous):
reduce resolution by 2
to make room for dithering as my output is just B&W
use variable max iteration n based on zoom
On change of n invalidate last frame to enforce full recompute. I know this is slow but it happens only 3 times on transitions between zoom ranges.
Scaling count from last frame is not looking good as its not linear.
Its possible to use the last counts but for that it would be needed also to remember the complex variables used for iteration and that would take too much memory.
remember last frame and also which x,y screen coordinate mapped to which Mandelbrot coordinate.
On each frame compute the mapping between screen coordinates and Mandelbrot coordinates.
remap last frame to adjust to new position and zoom
so simply look at the data from #3,#4 and if we have there the same positions in both last and actual frame (closer then half of pixel size), copy the pixels. And recompute the rest.
This will hugely improve performance if your view is smooth (so position and zoom does not change a lot on per frame basis).
I know its a bit vague description so here a C++ code where you can infer all doubts:
//---------------------------------------------------------------------------
//--- Fast Mandelbrot set ver: 1.000 ----------------------------------------
//---------------------------------------------------------------------------
template<int xs,int ys,int sh> void mandelbrot_draw(float mx,float my,float zoom)
{
// xs,ys - screen resolution
// sh - log2(pixel_size) ... dithering pixel size
// mx,my - Mandelbrot position (center of view) <-1.5,+0.5>,<-1.0,+1.0>
// zoom - zoom
// ----------------
// (previous/actual) frame
static U8 p[xs>>sh][ys>>sh]; // intensities (raw Mandelbrot image)
static int n0=0; // max iteraions
static float px[(xs>>sh)+1]={-1000.0}; // pixel x position in Mandlebrot
static float py[(ys>>sh)+1]; // pixel y position in Mandlebrot
// temp variables
U8 shd; // just pattern for dithering
int ix,iy,i,n,jx,jy,kx,ky,sz; // index variables
int nx=xs>>sh,ny=ys>>sh; // real Mandelbrot resolution
float fx,fy,fd; // floating Mandlebrot position and pixel step
float x,y,xx,yy,q; // Mandelbrot iteration stuff (this need to be high precision)
int qx[xs>>sh],qy[ys>>sh]; // maping of pixels between last and actual frame
float px0[xs>>sh],py0[ys>>sh]; // pixel position in Mandlebrot from last frame
// init vars
if (zoom< 10.0) n= 31;
else if (zoom< 100.0) n= 63;
else if (zoom< 1000.0) n=127;
else n=255;
sz=1<<sh;
ix=xs; if (ix>ys) ix=ys; ix/=sz;
fd=2.0/(float(ix-1)*zoom);
mx-=float(xs>>(1+sh))*fd;
my-=float(ys>>(1+sh))*fd;
// init buffers
if ((px[0]<-999.0)||(n0!=n))
{
n0=n;
for (ix=0;ix<nx;ix++) px[ix]=-999.0;
for (iy=0;iy<ny;iy++) py[iy]=-999.0;
for (ix=0;ix<nx;ix++)
for (iy=0;iy<ny;iy++)
p[ix][iy]=0;
}
// store old and compute new float positions of pixels in Mandelbrot to px[],py[],px0[],py0[]
for (fx=mx,ix=0;ix<nx;ix++,fx+=fd){ px0[ix]=px[ix]; px[ix]=fx; qx[ix]=-1; }
for (fy=my,iy=0;iy<ny;iy++,fy+=fd){ py0[iy]=py[iy]; py[iy]=fy; qy[iy]=-1; }
// match old and new x coordinates to qx[]
for (ix=0,jx=0;(ix<nx)&&(jx<nx);)
{
x=px[ix]; y=px0[jx];
xx=(x-y)/fd; if (xx<0.0) xx=-xx;
if (xx<=0.5){ qx[ix]=jx; px[ix]=y; }
if (x<y) ix++; else jx++;
}
// match old and new y coordinates to qy[]
for (ix=0,jx=0;(ix<ny)&&(jx<ny);)
{
x=py[ix]; y=py0[jx];
xx=(x-y)/fd; if (xx<0.0) xx=-xx;
if (xx<=0.5){ qy[ix]=jx; py[ix]=y; }
if (x<y) ix++; else jx++;
}
// remap p[][] by qx[]
for (ix=0,jx=nx-1;ix<nx;ix++,jx--)
{
i=qx[ix]; if ((i>=0)&&(i>=ix)) for (iy=0;iy<ny;iy++) p[ix][iy]=p[i][iy];
i=qx[jx]; if ((i>=0)&&(i<=jx)) for (iy=0;iy<ny;iy++) p[jx][iy]=p[i][iy];
}
// remap p[][] by qy[]
for (iy=0,jy=ny-1;iy<ny;iy++,jy--)
{
i=qy[iy]; if ((i>=0)&&(i>=iy)) for (ix=0;ix<nx;ix++) p[ix][iy]=p[ix][i];
i=qy[jy]; if ((i>=0)&&(i<=jy)) for (ix=0;ix<nx;ix++) p[ix][jy]=p[ix][i];
}
// Mandelbrot
for (iy=0,ky=0,fy=py[iy];iy<ny;iy++,ky+=sz,fy=py[iy]) if ((fy>=-1.0)&&(fy<=+1.0))
for (ix=0,kx=0,fx=px[ix];ix<nx;ix++,kx+=sz,fx=px[ix]) if ((fx>=-1.5)&&(fx<=+0.5))
{
// invalid qx,qy ... recompute Mandelbrot
if ((qx[ix]<0)||(qy[iy]<0))
{
for (x=0.0,y=0.0,xx=0.0,yy=0.0,i=0;(i<n)&&(xx+yy<4.0);i++)
{
q=xx-yy+fx;
y=(2.0*x*y)+fy;
x=q;
xx=x*x;
yy=y*y;
}
i=(16*i)/(n-1); if (i>16) i=16; if (i<0) i=0;
i=16-i; p[ix][iy]=i;
}
// use stored intensity
else i=p[ix][iy];
// render point with intensity i coresponding to ix,iy position in map
for (i<<=3 ,jy=0;jy<sz;jy++)
for (shd=shade8x8[i+(jy&7)],jx=0;jx<sz;jx++)
lcd.pixel(kx+jx,ky+jy,shd&(1<<(jx&7)));
}
}
//---------------------------------------------------------------------------
//---------------------------------------------------------------------------
//---------------------------------------------------------------------------
the lcd and shade8x8 stuff can be found in the linked SSD1306 QA. However you can ignore it its just dithering and outputting a pixel so you can instead output the i directly (even without the scaling to <0..16>.
Here preview (on PC as I was lazy to connect camera ...):
so it 64x32 Mandelbrot pixels displayed as 128x64 dithered image. On my AVR32 is this may be 8x time faster than naive method (maybe 3-4fps)... The code might be more optimized however take in mind Mandelbrot is not the only stuff running as I have some ISR handlers on backround to handle the LCD and also my TTS engine based on this which I upgraded a lot since then and use for debugging of this (yes it can speek in parallel to rendering). Also I am low on memory as my 3D engine takes a lot ~11 KByte away (mostly depth buffer).
The preview was done with this code (inside timer):
static float zoom=1.0;
mandelbrot_draw<128,64,1>(+0.37,-0.1,zoom);
zoom*=1.02; if (zoom>100000) zoom=1.0;
Also for non AVR32 C++ environment use this:
//------------------------------------------------------------------------------------------
#ifndef _AVR32_compiler_h
#define _AVR32_compiler_h
//------------------------------------------------------------------------------------------
typedef int32_t S32;
typedef int16_t S16;
typedef int8_t S8;
typedef uint32_t U32;
typedef uint16_t U16;
typedef uint8_t U8;
//------------------------------------------------------------------------------------------
#endif
//------------------------------------------------------------------------------------------
[Edit2] higher float precision in GLSL
The main problem with Mandelbrot is that it need to add numbers with very big exponent difference. For +,- operations we need to align mantissa of both operands, add them as integer and normalize back to scientific notation. However if the exponent difference is big then the result mantissa need more bits than can fit into 32 bit float so only 24 most significant bits are preserved. This creates the rounding errors causing your pixelation. If you look at 32bit float in binary you will see this:
float a=1000.000,b=0.00001,c=a+b;
//012345678901234567890123456789 ... just to easy count bits
a=1111101000b // a=1000
b= 0.00000000000000001010011111000101101011b // b=0.00000999999974737875
c=1111101000.00000000000000001010011111000101101011b // not rounded result
c=1111101000.00000000000000b // c=1000 rounded to 24 bits of mantissa
Now the idea is to enlarge the number of mantissa bits. The easiest trick is to have 2 floats instead of one:
//012345678901234567890123 ... just to easy count bits
a=1111101000b // a=1000
//012345678901234567890123 ... just to easy count b= 0.0000000000000000101001111100010110101100b // b=0.00000999999974737875
c=1111101000.00000000000000001010011111000101101011b // not rounded result
c=1111101000.00000000000000b // c=1000 rounded to 24
+ .0000000000000000101001111100010110101100b
//012345678901234567890123 ... just to easy count bits
so some part of the result is in one float and the rest in the other... The more floats per single value we have the bigger the mantissa. However doing this on bit exact division of big mantissa into 24 bit chunks would be complicated and slow in GLSL (if even possible due to GLSL limitations). Instead we can select for each of the float some range of exponents (just like in example above).
So in the example we got 3 floats (vec3) per single (float) value. Each of the coordinates represent different range:
abs(x) <= 1e-5
1e-5 < abs(y) <= 1e+5
1e+5 < abs(z)
and value = (x+y+z) so we kind of have 3*24 bit mantissa however the ranges are not exactly matching 24 bits. For that the exponent range should be divided by:
log10(2^24)=7.2247198959355486851297334733878
instead of 10... for example something like this:
abs(x) <= 1e-7
1e-7 < abs(y) <= 1e+0
1e+0 < abs(z)
Also the ranges must be selected so they handle the ranges of values you use otherwise it would be for nothing. So if your numbers are <4 its pointless to have range >10^+5 So first you need to see what bounds of values you have, then dissect it to exponent ranges (as many as you have floats per value).
Beware some (but much less than native float) rounding still occurs !!!
Now doing operations on such numbers is slightly more complicated than on normal floats as you need to handle each value as bracketed sum of all components so:
(x0+y0+z0) + (x1+y1+z1) = (x0+x1 + y0+y1 + z0+z1)
(x0+y0+z0) - (x1+y1+z1) = (x0-x1 + y0-y1 + z0-z1)
(x0+y0+z0) * (x1+y1+z1) = x0*(x1+y1+z1) + y0*(x1+y1+z1) + z0*(x1+y1+z1)
And do not forget to normalize the values back to the defined ranges. Avoid adding small and big (abs) values so avoid x0+z0 etc ...
[Edit3] new win32 demo CPU vs. GPU
win32 Mandelbrot demo 64bit floats
Both executables are preset to the same location and zoom to show when the doubles starts to round off. I had to upgrade slightly the way how px,py coordinates are computed as around 10^9 the y axis started to deviate at this location (The threshold still might be too big for other locations)
Here preview CPU vs. GPU for high zoom (n=1600):
RT GIF capture of CPU (n=62++, GIF 4x scaled down):
The optimized escape algorithm should be fast enough to draw the Mandelbrot set in real time. You can use multiple threads so that your implementation will be faster (this is very easy using OpenMP for example). You can also manually vectorize your code using SIMD instructions to make it even faster if needed. You could even run this directly on the GPU using either shaders and/or GPU computing frameworks (OpenCL or CUDA) if this is still not fast enough to you (although this is a bit complex to do efficiently). Finally you should tune the number of iterations so it is rather small.
Zooming should not have any direct impact on the performance. It just changes the input window of the computation. However, it does have an indirect impact since the actual number of iterations will change. Points outside the window should not be computed.
Double precision should also be enough for drawing Mandelbrot set correctly. But if you really want more precise calculation, you can use double-double precision which gives a quite good precision and not too bad performance. However, implementing double-double precision manually is a bit tricky and it is still significantly slower than using just double precision.
My fastest solutions avoid iterating over large areas of the same depth by following a contour boundary and filling. There is a penalty that it is possible to nip off small buds instead of going around them, but all-in-all a small price to pay for a quick zoom.
One possible efficiency is that if a zoom doubles the scale, you already have ¼ of the points.
For animation, I file each frame's values, doubling the scale each time, and interpolate the in-between frames on playback in real time, so the animation doubles once per second. The double type allows more than 50 key frames to be stored, giving an animation that lasts more than a minute (in and then back out).
The actual iteration is done by hand-crafted assembler, so one pixel is iterated entirely in the FPU.

Fast, good quality pixel interpolation for extreme image downscaling

In my program, I am downscaling an image of 500px or larger to an extreme level of approx 16px-32px. The source image is user-specified so I do not have control over its size. As you can imagine, few pixel interpolations hold up and inevitably the result is heavily aliased.
I've tried bilinear, bicubic and square average sampling. The square average sampling actually provides the most decent results but the smaller it gets, the larger the sampling radius has to be. As a result, it gets quite slow - slower than the other interpolation methods.
I have also tried an adaptive square average sampling so that the smaller it gets the greater the sampling radius, while the closer it is to its original size, the smaller the sampling radius. However, it produces problems and I am not convinced this is the best approach.
So the question is: What is the recommended type of pixel interpolation that is fast and works well on such extreme levels of downscaling?
I do not wish to use a library so I will need something that I can code by hand and isn't too complex. I am working in C++ with VS 2012.
Here's some example code I've tried as requested (hopefully without errors from my pseudo-code cut and paste). This performs a 7x7 average downscale and although it's a better result than bilinear or bicubic interpolation, it also takes quite a hit:
// Sizing control
ctl(0): "Resize",Range=(0,800),Val=100
// Variables
float fracx,fracy;
int Xnew,Ynew,p,q,Calc;
int x,y,p1,q1,i,j;
//New image dimensions
Xnew=image->width*ctl(0)/100;
Ynew=image->height*ctl(0)/100;
for (y=0; y<image->height; y++){ // rows
for (x=0; x<image->width; x++){ // columns
p1=(int)x*image->width/Xnew;
q1=(int)y*image->height/Ynew;
for (z=0; z<3; z++){ // channels
for (i=-3;i<=3;i++) {
for (j=-3;j<=3;j++) {
Calc += (int)(src(p1-i,q1-j,z));
} //j
} //i
Calc /= 49;
pset(x, y, z, Calc);
} // channels
} // columns
} // rows
Thanks!
The first point is to use pointers to your data. Never use indexes at every pixel. When you write: src(p1-i,q1-j,z) or pset(x, y, z, Calc) how much computation is being made? Use pointers to data and manipulate those.
Second: your algorithm is wrong. You don't want an average filter, but you want to make a grid on your source image and for every grid cell compute the average and put it in the corresponding pixel of the output image.
The specific solution should be tailored to your data representation, but it could be something like this:
std::vector<uint32_t> accum(Xnew);
std::vector<uint32_t> count(Xnew);
uint32_t *paccum, *pcount;
uint8_t* pin = /*pointer to input data*/;
uint8_t* pout = /*pointer to output data*/;
for (int dr = 0, sr = 0, w = image->width, h = image->height; sr < h; ++dr) {
memset(paccum = accum.data(), 0, Xnew*4);
memset(pcount = count.data(), 0, Xnew*4);
while (sr * Ynew / h == dr) {
paccum = accum.data();
pcount = count.data();
for (int dc = 0, sc = 0; sc < w; ++sc) {
*paccum += *i;
*pcount += 1;
++pin;
if (sc * Xnew / w > dc) {
++dc;
++paccum;
++pcount;
}
}
sr++;
}
std::transform(begin(accum), end(accum), begin(count), pout, std::divides<uint32_t>());
pout += Xnew;
}
This was written using my own library (still in development) and it seems to work, but later I changed the variables names in order to make it simpler here, so I don't guarantee anything!
The idea is to have a local buffer of 32 bit ints which can hold the partial sum of all pixels in the rows which fall in a row of the output image. Then you divide by the cell count and save the output to the final image.
The first thing you should do is to set up a performance evaluation system to measure how much any change impacts on the performance.
As said precedently, you should not use indexes but pointers for (probably) a substantial
speed up & not simply average as a basic averaging of pixels is basically a blur filter.
I would highly advise you to rework your code to be using "kernels". This is the matrix representing the ratio of each pixel used. That way, you will be able to test different strategies and optimize quality.
Example of kernels:
https://en.wikipedia.org/wiki/Kernel_(image_processing)
Upsampling/downsampling kernel:
http://www.johncostella.com/magic/
Note, from the code it seems you apply a 3x3 kernel but initially done on a 7x7 kernel. The equivalent 3x3 kernel as posted would be:
[1 1 1]
[1 1 1] * 1/9
[1 1 1]

C++ plotting Mandlebrot set, bad performance

I'm not sure if there is an actual performance increase to achieve, or if my computer is just old and slow, but I'll ask anyway.
So I've tried making a program to plot the Mandelbrot set using the cairo library.
The loop that draws the pixels looks as follows:
vector<point_t*>::iterator it;
for(unsigned int i = 0; i < iterations; i++){
it = points->begin();
//cout << points->size() << endl;
double r,g,b;
r = (double)i+1 / (double)iterations;
g = 0;
b = 0;
while(it != points->end()){
point_t *p = *it;
p->Z = (p->Z * p->Z) + p->C;
if(abs(p->Z) > 2.0){
cairo_set_source_rgba(cr, r, g, b, 1);
cairo_rectangle (cr, p->x, p->y, 1, 1);
cairo_fill (cr);
it = points->erase(it);
} else {
it++;
}
}
}
The idea is to color all points that just escaped the set, and then remove them from list to avoid evaluating them again.
It does render the set correctly, but it seems that the rendering takes a lot longer than needed.
Can someone spot any performance issues with the loop? or is it as good as it gets?
Thanks in advance :)
SOLUTION
Very nice answers, thanks :) - I ended up with a kind of hybrid of the answers. Thinking of what was suggested, i realized that calculating each point, putting them in a vector and then extract them was a huge waste CPU time and memory. So instead, the program now just calculate the Z value of each point witout even using the point_t or vector. It now runs A LOT faster!
Edit: I think the suggestion in the answer of kuroi neko is also a very good idea if you do not care about "incremental" computation, but have a fixed number of iterations.
You should use vector<point_t> instead of vector<point_t*>.
A vector<point_t*> is a list of pointers to point_t. Each point is stored at some random location in the memory. If you iterate over the points, the pattern in which memory is accessed looks completely random. You will get a lot of cache misses.
On the opposite vector<point_t> uses continuous memory to store the points. Thus the next point is stored directly after the current point. This allows efficient caching.
You should not call erase(it); in your inner loop.
Each call to erase has to move all elements after the one you remove. This has O(n) runtime. For example, you could add a flag to point_t to indicate that it should not be processed any longer. It may be even faster to remove all the "inactive" points after each iteration.
It is probably not a good idea to draw individual pixels using cairo_rectangle. I would suggest you create an image and store the color for each pixel. Then draw the whole image with one draw call.
Your code could look like this:
for(unsigned int i = 0; i < iterations; i++){
double r,g,b;
r = (double)i+1 / (double)iterations;
g = 0;
b = 0;
for(vector<point_t>::iterator it=points->begin(); it!=points->end(); ++it) {
point_t& p = *it;
if(!p.active) {
continue;
}
p.Z = (p.Z * p.Z) + p.C;
if(abs(p.Z) > 2.0) {
cairo_set_source_rgba(cr, r, g, b, 1);
cairo_rectangle (cr, p.x, p.y, 1, 1);
cairo_fill (cr);
p.active = false;
}
}
// perhaps remove all points where p.active = false
}
If you can not change point_t, you can use an additional vector<char> to store if a point has become "inactive".
The Zn divergence computation is what makes the algorithm slow (depending on the area you're working on, of course). In comparison, pixel drawing is mere background noise.
Your loop is flawed because it makes the Zn computation slow.
The way to go is to compute divergence for each point in a tight, optimized loop, and then take care of the display.
Besides, it's useless and wasteful to store Z permanently.
You just need C as an input and the number of iterations as an output.
Assuming your points array only holds C values (basically you don't need all this vector crap, but it won't hurt performances either), you could do something like that :
for(vector<point_t>::iterator it=points->begin(); it!=points->end(); ++it)
{
point_t Z = 0;
point_t C = *it;
for(unsigned int i = 0; i < iterations; i++) // <-- this is the CPU burner
{
Z = Z * Z + C;
if(abs(Z) > 2.0) break;
}
cairo_set_source_rgba(cr, (double)i+1 / (double)iterations, g, b, 1);
cairo_rectangle (cr, p->x, p->y, 1, 1);
cairo_fill (cr);
}
Try to run this with and without the cairo thing and you should see no noticeable difference in execution time (unless you're looking at an empty spot of the set).
Now if you want to go faster, try to break down the Z = Z * Z + C computation in real and imaginary parts and optimize it. You could even use mmx or whatever to do parallel computations.
And of course the way to go to gain another significant speed factor is to parallelize your algorithm over the available CPU cores (i.e. split your display area is subsets and have different worker threads compute these parts in parallel).
This is not as obvious at it might seem, though, since each sub-picture will have a different computation time (black areas are very slow to compute while white areas are computed almost instantly).
One way to do it is to split the area is a large number of rectangles, and have all worker threads pick a random rectangle from a common pool until all rectangles have been processed.
This simple load balancing scheme that makes sure no CPU core will be left idle while its buddies are busy on other parts of the display.
The first step to optimizing performance is to find out what is slow. Your code mixes three tasks- iterating to calculate whether a point escapes, manipulating a vector of points to test, and plotting the point.
Separate these three operations and measure their contribution. You can optimise the escape calculation by parallelising it using simd operations. You can optimise the vector operations by not erasing from the vector if you want to remove it but adding it to another vector if you want to keep it ( since erase is O(N) and addition O(1) ) and improve locality by having a vector of points rather than pointers to points, and if the plotting is slow then use an off-screen bitmap and set points by manipulating the backing memory rather than using cairo functions.
(I was going to post this but #Werner Henze already made the same point in a comment, hence community wiki)

Ray picking with depth buffer: horribly inaccurate?

I'm trying to implement a ray picking algorithm, for painting and selecting blocks (thus I need a fair amount of accuracy). Initially I went with a ray casting implementation, but I didn't feel it was accurate enough (although the fault may have been with my intersection testing). Regardless, I decided to try picking by using the depth buffer, and transforming the mouse coordinates to world coordinates. Implementation below:
glm::vec3 Renderer::getMouseLocation(glm::vec2 coordinates) {
float depth = deferredFBO->getDepth(coordinates);
// Calculate the width and height of the deferredFBO
float viewPortWidth = deferredArea.z - deferredArea.x;
float viewPortHeight = deferredArea.w - deferredArea.y;
// Calculate homogenous coordinates for mouse x and y
float windowX = (2.0f * coordinates.x) / viewPortWidth - 1.0f;
float windowY = 1.0f - (2.0f * coordinates.y) / viewPortHeight;
// cameraToClip = projection matrix
glm::vec4 cameraCoordinates = glm::inverse(cameraToClipMatrix)
* glm::vec4(windowX, windowY, depth, 1.0f);
// Normalize
cameraCoordinates /= cameraCoordinates.w;
glm::vec4 worldCoordinates = glm::inverse(worldToCameraMatrix)
* cameraCoordinates;
return glm::vec3(worldCoordinates);
}
The problem is that the values are easily ±3 units (blocks are 1 unit wide), only getting accurate enough when very close to the near clipping plane.
Does the inaccuracy stem from using single-precision floats, or maybe some step in my calculations? Would it help if I used double-precision values, and does OpenGL even support that for depth buffers?
And lastly, if this method doesn't work, am I best off using colour IDs to accurately identify which polygon was picked?
Colors are the way to go, the depth buffers accuracy depend on the plane distances, the resolution of the FBO texture, also on the normal or slope of the surface.The same precision problem happens during the standard shadowing.(Using colors is a bit easier because of with the depth intersection test one object have more "color", depth values. It's more accurate if one object has one color.)
Also, maybe its just me, but I like to avoid rather complex matrix calculations if they're not necessary. It's enough for the poor CPU to do the other stuffs.
For double precision values, that could drop performance badly. I've encountered this kind of performance drop, it was about 3x slower for me to use doubles rather than floats:
my post:
GLSL performance - function return value/type and an
article about this:
https://superuser.com/questions/386456/why-does-a-geforce-card-perform-4x-slower-in-double-precision-than-a-tesla-card
so yep, you can, use 64 bit floats (double):
http://www.opengl.org/registry/specs...hader_fp64.txt,
and http://www.opengl.org/registry/specs...trib_64bit.txt,
but you should not.
All in all use colored polys, I like colors khmm...
EDIT: more about double precision depth : http://www.opengl.org/discussion_boards/showthread.php/173450-Double-Precision, its a pretty good discussion