What are the fastest algorithms for rendering the mandelbrot set? - c++

I've tried many algorithms for the rendering of the Mandelbrot set, inclusive of the naive escape time algorithm, as well as the optimized escape time algorithm. But, are there faster algorithms that are used to produce really deep zooms efficiently like the ones we see on YouTube. Also, I would love to get some ideas on how to increase my precision beyond the C/C++ double

Even High end CPU will be much slower in comparison to average GPU. You can get to real time rendering even with naive iteration algo on GPU. So using better algorithms on GPU could get to high zooms however for any decent algo you need:
multi pass rendering as we can not self modify a texture on GPU
high precision floating point as floats/doubles are not enough.
Here few related QAs:
GLSL RT Mandelbrot
Interior Distance Estimate algorithm for the Mandelbrot set
I get infinitely small numbers for fractals
Perturbation theory
which might get you kick started...
One way to speed up is you can use fractional escape like I did in the first link. It improves image quality while keeping max iteration low.
The second link will get you approximation of which parts of fractal are in and out and how far. Its not very accurate but can be used to avoid computing iterations for parts that are "outside for sure".
Next link will show you how to achieve better precision.
Last link is about Perturbation The idea is that you use high precision math only for some reference points and use that to compute its neighbor points with low precision math without loosing precision. Never used that however but looks promising.
And finally once you achieved fast rendering you might want to aim for this:
How to adjust panning while zooming Mandelbrot set
Here a small example of 3* 64bit double used for single value in GLSL:
// high precision float (very slow)
dvec3 fnor(dvec3 a)
{
dvec3 c=a;
if (abs(c.x)>1e-5){ c.y+=c.x; c.x=0.0; }
if (abs(c.y)>1e+5){ c.z+=c.y; c.y=0.0; }
return c;
}
double fget(dvec3 a){ return a.x+a.y+a.z; }
dvec3 fset(double a){ return fnor(dvec3(a,0.0,0.0)); }
dvec3 fadd(dvec3 a,double b){ return fnor(a+fset(b)); }
dvec3 fsub(dvec3 a,double b){ return fnor(a-fset(b)); }
dvec3 fmul(dvec3 a,double b){ return fnor(a*b); }
dvec3 fadd(dvec3 a,dvec3 b){ return fnor(a+b); }
dvec3 fsub(dvec3 a,dvec3 b){ return fnor(a-b); }
dvec3 fmul(dvec3 a,dvec3 b)
{
dvec3 c;
c =fnor(a*b.x);
c+=fnor(a*b.y);
c+=fnor(a*b.z);
return fnor(c);
}
so each hi precision value is dvec3 ... the thresholds in fnor can be changed to any ranges. You can convert this to vec3 and float ...
[Edit1] "fast" C++ example
Ok I wanted to try my new SSD1306 driver along with my AVR32 MCU to compute Mandelbrot So I can compare speed with this Arduino + 3D + Pong + Mandelbrot. I used AT32UC3A3256 with ~66MHz no FPU no GPU and 128x64x1bpp display. No external memory only internal 16+32+32 KByte. Naive Mandlebrot was way to slow (~2.5sec per frame) so I busted up something like this (taking advantage of that position and zoom of the view is sort of continuous):
reduce resolution by 2
to make room for dithering as my output is just B&W
use variable max iteration n based on zoom
On change of n invalidate last frame to enforce full recompute. I know this is slow but it happens only 3 times on transitions between zoom ranges.
Scaling count from last frame is not looking good as its not linear.
Its possible to use the last counts but for that it would be needed also to remember the complex variables used for iteration and that would take too much memory.
remember last frame and also which x,y screen coordinate mapped to which Mandelbrot coordinate.
On each frame compute the mapping between screen coordinates and Mandelbrot coordinates.
remap last frame to adjust to new position and zoom
so simply look at the data from #3,#4 and if we have there the same positions in both last and actual frame (closer then half of pixel size), copy the pixels. And recompute the rest.
This will hugely improve performance if your view is smooth (so position and zoom does not change a lot on per frame basis).
I know its a bit vague description so here a C++ code where you can infer all doubts:
//---------------------------------------------------------------------------
//--- Fast Mandelbrot set ver: 1.000 ----------------------------------------
//---------------------------------------------------------------------------
template<int xs,int ys,int sh> void mandelbrot_draw(float mx,float my,float zoom)
{
// xs,ys - screen resolution
// sh - log2(pixel_size) ... dithering pixel size
// mx,my - Mandelbrot position (center of view) <-1.5,+0.5>,<-1.0,+1.0>
// zoom - zoom
// ----------------
// (previous/actual) frame
static U8 p[xs>>sh][ys>>sh]; // intensities (raw Mandelbrot image)
static int n0=0; // max iteraions
static float px[(xs>>sh)+1]={-1000.0}; // pixel x position in Mandlebrot
static float py[(ys>>sh)+1]; // pixel y position in Mandlebrot
// temp variables
U8 shd; // just pattern for dithering
int ix,iy,i,n,jx,jy,kx,ky,sz; // index variables
int nx=xs>>sh,ny=ys>>sh; // real Mandelbrot resolution
float fx,fy,fd; // floating Mandlebrot position and pixel step
float x,y,xx,yy,q; // Mandelbrot iteration stuff (this need to be high precision)
int qx[xs>>sh],qy[ys>>sh]; // maping of pixels between last and actual frame
float px0[xs>>sh],py0[ys>>sh]; // pixel position in Mandlebrot from last frame
// init vars
if (zoom< 10.0) n= 31;
else if (zoom< 100.0) n= 63;
else if (zoom< 1000.0) n=127;
else n=255;
sz=1<<sh;
ix=xs; if (ix>ys) ix=ys; ix/=sz;
fd=2.0/(float(ix-1)*zoom);
mx-=float(xs>>(1+sh))*fd;
my-=float(ys>>(1+sh))*fd;
// init buffers
if ((px[0]<-999.0)||(n0!=n))
{
n0=n;
for (ix=0;ix<nx;ix++) px[ix]=-999.0;
for (iy=0;iy<ny;iy++) py[iy]=-999.0;
for (ix=0;ix<nx;ix++)
for (iy=0;iy<ny;iy++)
p[ix][iy]=0;
}
// store old and compute new float positions of pixels in Mandelbrot to px[],py[],px0[],py0[]
for (fx=mx,ix=0;ix<nx;ix++,fx+=fd){ px0[ix]=px[ix]; px[ix]=fx; qx[ix]=-1; }
for (fy=my,iy=0;iy<ny;iy++,fy+=fd){ py0[iy]=py[iy]; py[iy]=fy; qy[iy]=-1; }
// match old and new x coordinates to qx[]
for (ix=0,jx=0;(ix<nx)&&(jx<nx);)
{
x=px[ix]; y=px0[jx];
xx=(x-y)/fd; if (xx<0.0) xx=-xx;
if (xx<=0.5){ qx[ix]=jx; px[ix]=y; }
if (x<y) ix++; else jx++;
}
// match old and new y coordinates to qy[]
for (ix=0,jx=0;(ix<ny)&&(jx<ny);)
{
x=py[ix]; y=py0[jx];
xx=(x-y)/fd; if (xx<0.0) xx=-xx;
if (xx<=0.5){ qy[ix]=jx; py[ix]=y; }
if (x<y) ix++; else jx++;
}
// remap p[][] by qx[]
for (ix=0,jx=nx-1;ix<nx;ix++,jx--)
{
i=qx[ix]; if ((i>=0)&&(i>=ix)) for (iy=0;iy<ny;iy++) p[ix][iy]=p[i][iy];
i=qx[jx]; if ((i>=0)&&(i<=jx)) for (iy=0;iy<ny;iy++) p[jx][iy]=p[i][iy];
}
// remap p[][] by qy[]
for (iy=0,jy=ny-1;iy<ny;iy++,jy--)
{
i=qy[iy]; if ((i>=0)&&(i>=iy)) for (ix=0;ix<nx;ix++) p[ix][iy]=p[ix][i];
i=qy[jy]; if ((i>=0)&&(i<=jy)) for (ix=0;ix<nx;ix++) p[ix][jy]=p[ix][i];
}
// Mandelbrot
for (iy=0,ky=0,fy=py[iy];iy<ny;iy++,ky+=sz,fy=py[iy]) if ((fy>=-1.0)&&(fy<=+1.0))
for (ix=0,kx=0,fx=px[ix];ix<nx;ix++,kx+=sz,fx=px[ix]) if ((fx>=-1.5)&&(fx<=+0.5))
{
// invalid qx,qy ... recompute Mandelbrot
if ((qx[ix]<0)||(qy[iy]<0))
{
for (x=0.0,y=0.0,xx=0.0,yy=0.0,i=0;(i<n)&&(xx+yy<4.0);i++)
{
q=xx-yy+fx;
y=(2.0*x*y)+fy;
x=q;
xx=x*x;
yy=y*y;
}
i=(16*i)/(n-1); if (i>16) i=16; if (i<0) i=0;
i=16-i; p[ix][iy]=i;
}
// use stored intensity
else i=p[ix][iy];
// render point with intensity i coresponding to ix,iy position in map
for (i<<=3 ,jy=0;jy<sz;jy++)
for (shd=shade8x8[i+(jy&7)],jx=0;jx<sz;jx++)
lcd.pixel(kx+jx,ky+jy,shd&(1<<(jx&7)));
}
}
//---------------------------------------------------------------------------
//---------------------------------------------------------------------------
//---------------------------------------------------------------------------
the lcd and shade8x8 stuff can be found in the linked SSD1306 QA. However you can ignore it its just dithering and outputting a pixel so you can instead output the i directly (even without the scaling to <0..16>.
Here preview (on PC as I was lazy to connect camera ...):
so it 64x32 Mandelbrot pixels displayed as 128x64 dithered image. On my AVR32 is this may be 8x time faster than naive method (maybe 3-4fps)... The code might be more optimized however take in mind Mandelbrot is not the only stuff running as I have some ISR handlers on backround to handle the LCD and also my TTS engine based on this which I upgraded a lot since then and use for debugging of this (yes it can speek in parallel to rendering). Also I am low on memory as my 3D engine takes a lot ~11 KByte away (mostly depth buffer).
The preview was done with this code (inside timer):
static float zoom=1.0;
mandelbrot_draw<128,64,1>(+0.37,-0.1,zoom);
zoom*=1.02; if (zoom>100000) zoom=1.0;
Also for non AVR32 C++ environment use this:
//------------------------------------------------------------------------------------------
#ifndef _AVR32_compiler_h
#define _AVR32_compiler_h
//------------------------------------------------------------------------------------------
typedef int32_t S32;
typedef int16_t S16;
typedef int8_t S8;
typedef uint32_t U32;
typedef uint16_t U16;
typedef uint8_t U8;
//------------------------------------------------------------------------------------------
#endif
//------------------------------------------------------------------------------------------
[Edit2] higher float precision in GLSL
The main problem with Mandelbrot is that it need to add numbers with very big exponent difference. For +,- operations we need to align mantissa of both operands, add them as integer and normalize back to scientific notation. However if the exponent difference is big then the result mantissa need more bits than can fit into 32 bit float so only 24 most significant bits are preserved. This creates the rounding errors causing your pixelation. If you look at 32bit float in binary you will see this:
float a=1000.000,b=0.00001,c=a+b;
//012345678901234567890123456789 ... just to easy count bits
a=1111101000b // a=1000
b= 0.00000000000000001010011111000101101011b // b=0.00000999999974737875
c=1111101000.00000000000000001010011111000101101011b // not rounded result
c=1111101000.00000000000000b // c=1000 rounded to 24 bits of mantissa
Now the idea is to enlarge the number of mantissa bits. The easiest trick is to have 2 floats instead of one:
//012345678901234567890123 ... just to easy count bits
a=1111101000b // a=1000
//012345678901234567890123 ... just to easy count b= 0.0000000000000000101001111100010110101100b // b=0.00000999999974737875
c=1111101000.00000000000000001010011111000101101011b // not rounded result
c=1111101000.00000000000000b // c=1000 rounded to 24
+ .0000000000000000101001111100010110101100b
//012345678901234567890123 ... just to easy count bits
so some part of the result is in one float and the rest in the other... The more floats per single value we have the bigger the mantissa. However doing this on bit exact division of big mantissa into 24 bit chunks would be complicated and slow in GLSL (if even possible due to GLSL limitations). Instead we can select for each of the float some range of exponents (just like in example above).
So in the example we got 3 floats (vec3) per single (float) value. Each of the coordinates represent different range:
abs(x) <= 1e-5
1e-5 < abs(y) <= 1e+5
1e+5 < abs(z)
and value = (x+y+z) so we kind of have 3*24 bit mantissa however the ranges are not exactly matching 24 bits. For that the exponent range should be divided by:
log10(2^24)=7.2247198959355486851297334733878
instead of 10... for example something like this:
abs(x) <= 1e-7
1e-7 < abs(y) <= 1e+0
1e+0 < abs(z)
Also the ranges must be selected so they handle the ranges of values you use otherwise it would be for nothing. So if your numbers are <4 its pointless to have range >10^+5 So first you need to see what bounds of values you have, then dissect it to exponent ranges (as many as you have floats per value).
Beware some (but much less than native float) rounding still occurs !!!
Now doing operations on such numbers is slightly more complicated than on normal floats as you need to handle each value as bracketed sum of all components so:
(x0+y0+z0) + (x1+y1+z1) = (x0+x1 + y0+y1 + z0+z1)
(x0+y0+z0) - (x1+y1+z1) = (x0-x1 + y0-y1 + z0-z1)
(x0+y0+z0) * (x1+y1+z1) = x0*(x1+y1+z1) + y0*(x1+y1+z1) + z0*(x1+y1+z1)
And do not forget to normalize the values back to the defined ranges. Avoid adding small and big (abs) values so avoid x0+z0 etc ...
[Edit3] new win32 demo CPU vs. GPU
win32 Mandelbrot demo 64bit floats
Both executables are preset to the same location and zoom to show when the doubles starts to round off. I had to upgrade slightly the way how px,py coordinates are computed as around 10^9 the y axis started to deviate at this location (The threshold still might be too big for other locations)
Here preview CPU vs. GPU for high zoom (n=1600):
RT GIF capture of CPU (n=62++, GIF 4x scaled down):

The optimized escape algorithm should be fast enough to draw the Mandelbrot set in real time. You can use multiple threads so that your implementation will be faster (this is very easy using OpenMP for example). You can also manually vectorize your code using SIMD instructions to make it even faster if needed. You could even run this directly on the GPU using either shaders and/or GPU computing frameworks (OpenCL or CUDA) if this is still not fast enough to you (although this is a bit complex to do efficiently). Finally you should tune the number of iterations so it is rather small.
Zooming should not have any direct impact on the performance. It just changes the input window of the computation. However, it does have an indirect impact since the actual number of iterations will change. Points outside the window should not be computed.
Double precision should also be enough for drawing Mandelbrot set correctly. But if you really want more precise calculation, you can use double-double precision which gives a quite good precision and not too bad performance. However, implementing double-double precision manually is a bit tricky and it is still significantly slower than using just double precision.

My fastest solutions avoid iterating over large areas of the same depth by following a contour boundary and filling. There is a penalty that it is possible to nip off small buds instead of going around them, but all-in-all a small price to pay for a quick zoom.
One possible efficiency is that if a zoom doubles the scale, you already have ¼ of the points.
For animation, I file each frame's values, doubling the scale each time, and interpolate the in-between frames on playback in real time, so the animation doubles once per second. The double type allows more than 50 key frames to be stored, giving an animation that lasts more than a minute (in and then back out).
The actual iteration is done by hand-crafted assembler, so one pixel is iterated entirely in the FPU.

Related

Fast, good quality pixel interpolation for extreme image downscaling

In my program, I am downscaling an image of 500px or larger to an extreme level of approx 16px-32px. The source image is user-specified so I do not have control over its size. As you can imagine, few pixel interpolations hold up and inevitably the result is heavily aliased.
I've tried bilinear, bicubic and square average sampling. The square average sampling actually provides the most decent results but the smaller it gets, the larger the sampling radius has to be. As a result, it gets quite slow - slower than the other interpolation methods.
I have also tried an adaptive square average sampling so that the smaller it gets the greater the sampling radius, while the closer it is to its original size, the smaller the sampling radius. However, it produces problems and I am not convinced this is the best approach.
So the question is: What is the recommended type of pixel interpolation that is fast and works well on such extreme levels of downscaling?
I do not wish to use a library so I will need something that I can code by hand and isn't too complex. I am working in C++ with VS 2012.
Here's some example code I've tried as requested (hopefully without errors from my pseudo-code cut and paste). This performs a 7x7 average downscale and although it's a better result than bilinear or bicubic interpolation, it also takes quite a hit:
// Sizing control
ctl(0): "Resize",Range=(0,800),Val=100
// Variables
float fracx,fracy;
int Xnew,Ynew,p,q,Calc;
int x,y,p1,q1,i,j;
//New image dimensions
Xnew=image->width*ctl(0)/100;
Ynew=image->height*ctl(0)/100;
for (y=0; y<image->height; y++){ // rows
for (x=0; x<image->width; x++){ // columns
p1=(int)x*image->width/Xnew;
q1=(int)y*image->height/Ynew;
for (z=0; z<3; z++){ // channels
for (i=-3;i<=3;i++) {
for (j=-3;j<=3;j++) {
Calc += (int)(src(p1-i,q1-j,z));
} //j
} //i
Calc /= 49;
pset(x, y, z, Calc);
} // channels
} // columns
} // rows
Thanks!
The first point is to use pointers to your data. Never use indexes at every pixel. When you write: src(p1-i,q1-j,z) or pset(x, y, z, Calc) how much computation is being made? Use pointers to data and manipulate those.
Second: your algorithm is wrong. You don't want an average filter, but you want to make a grid on your source image and for every grid cell compute the average and put it in the corresponding pixel of the output image.
The specific solution should be tailored to your data representation, but it could be something like this:
std::vector<uint32_t> accum(Xnew);
std::vector<uint32_t> count(Xnew);
uint32_t *paccum, *pcount;
uint8_t* pin = /*pointer to input data*/;
uint8_t* pout = /*pointer to output data*/;
for (int dr = 0, sr = 0, w = image->width, h = image->height; sr < h; ++dr) {
memset(paccum = accum.data(), 0, Xnew*4);
memset(pcount = count.data(), 0, Xnew*4);
while (sr * Ynew / h == dr) {
paccum = accum.data();
pcount = count.data();
for (int dc = 0, sc = 0; sc < w; ++sc) {
*paccum += *i;
*pcount += 1;
++pin;
if (sc * Xnew / w > dc) {
++dc;
++paccum;
++pcount;
}
}
sr++;
}
std::transform(begin(accum), end(accum), begin(count), pout, std::divides<uint32_t>());
pout += Xnew;
}
This was written using my own library (still in development) and it seems to work, but later I changed the variables names in order to make it simpler here, so I don't guarantee anything!
The idea is to have a local buffer of 32 bit ints which can hold the partial sum of all pixels in the rows which fall in a row of the output image. Then you divide by the cell count and save the output to the final image.
The first thing you should do is to set up a performance evaluation system to measure how much any change impacts on the performance.
As said precedently, you should not use indexes but pointers for (probably) a substantial
speed up & not simply average as a basic averaging of pixels is basically a blur filter.
I would highly advise you to rework your code to be using "kernels". This is the matrix representing the ratio of each pixel used. That way, you will be able to test different strategies and optimize quality.
Example of kernels:
https://en.wikipedia.org/wiki/Kernel_(image_processing)
Upsampling/downsampling kernel:
http://www.johncostella.com/magic/
Note, from the code it seems you apply a 3x3 kernel but initially done on a 7x7 kernel. The equivalent 3x3 kernel as posted would be:
[1 1 1]
[1 1 1] * 1/9
[1 1 1]

Ray picking with depth buffer: horribly inaccurate?

I'm trying to implement a ray picking algorithm, for painting and selecting blocks (thus I need a fair amount of accuracy). Initially I went with a ray casting implementation, but I didn't feel it was accurate enough (although the fault may have been with my intersection testing). Regardless, I decided to try picking by using the depth buffer, and transforming the mouse coordinates to world coordinates. Implementation below:
glm::vec3 Renderer::getMouseLocation(glm::vec2 coordinates) {
float depth = deferredFBO->getDepth(coordinates);
// Calculate the width and height of the deferredFBO
float viewPortWidth = deferredArea.z - deferredArea.x;
float viewPortHeight = deferredArea.w - deferredArea.y;
// Calculate homogenous coordinates for mouse x and y
float windowX = (2.0f * coordinates.x) / viewPortWidth - 1.0f;
float windowY = 1.0f - (2.0f * coordinates.y) / viewPortHeight;
// cameraToClip = projection matrix
glm::vec4 cameraCoordinates = glm::inverse(cameraToClipMatrix)
* glm::vec4(windowX, windowY, depth, 1.0f);
// Normalize
cameraCoordinates /= cameraCoordinates.w;
glm::vec4 worldCoordinates = glm::inverse(worldToCameraMatrix)
* cameraCoordinates;
return glm::vec3(worldCoordinates);
}
The problem is that the values are easily ±3 units (blocks are 1 unit wide), only getting accurate enough when very close to the near clipping plane.
Does the inaccuracy stem from using single-precision floats, or maybe some step in my calculations? Would it help if I used double-precision values, and does OpenGL even support that for depth buffers?
And lastly, if this method doesn't work, am I best off using colour IDs to accurately identify which polygon was picked?
Colors are the way to go, the depth buffers accuracy depend on the plane distances, the resolution of the FBO texture, also on the normal or slope of the surface.The same precision problem happens during the standard shadowing.(Using colors is a bit easier because of with the depth intersection test one object have more "color", depth values. It's more accurate if one object has one color.)
Also, maybe its just me, but I like to avoid rather complex matrix calculations if they're not necessary. It's enough for the poor CPU to do the other stuffs.
For double precision values, that could drop performance badly. I've encountered this kind of performance drop, it was about 3x slower for me to use doubles rather than floats:
my post:
GLSL performance - function return value/type and an
article about this:
https://superuser.com/questions/386456/why-does-a-geforce-card-perform-4x-slower-in-double-precision-than-a-tesla-card
so yep, you can, use 64 bit floats (double):
http://www.opengl.org/registry/specs...hader_fp64.txt,
and http://www.opengl.org/registry/specs...trib_64bit.txt,
but you should not.
All in all use colored polys, I like colors khmm...
EDIT: more about double precision depth : http://www.opengl.org/discussion_boards/showthread.php/173450-Double-Precision, its a pretty good discussion

Why is my hand-tuned, SSE-enabled code so slow?

Long story short: I'm developing a computing-intensive image processing application in C++. It needs to calculate many variants of image warps on small blocks of pixels extracted from larger images. The program doesn't run as fast as I would like. Profiling (OProfile) showed the warping/interpolation function consumes more than 70% of the CPU time so it seemed obvious to try and optimize it.
I was using the OpenCV image processing library for the task until now:
// some parameters for the image warps (position, stretch, skew)
struct WarpParams;
void Image::get(const WarpParams &params)
{
// fills matrices mapX_ and mapY_ with x and y coordinates of points to be
// inteprolated.
updateCoordMaps(params);
// perform interpolation to obtain pixels at point locations
// P(mapX_[i], mapY_[i]) based on the original data and put the
// result in pixels_. Use bicubic inteprolation.
cv::remap(image_->data(), pixels_, mapX_, mapY_, CV_INTER_CUBIC);
}
I wrote my own interpolation function and put it in a test harness to ensure correctness while I experiment and to benchmark it in relation to the old one.
My function ran very slow which was to be expected. Generally, the idea is to:
Iterate over the mapX_, mapY_ coordinate maps, extract (real-valued) coordinates of the next pixel to be interpolated;
Retrieve a 4x4 neighborough of pixel values (integer coordinates) from the original image surrounding the interpolated pixel;
Calculate the coefficients of the convolution kernel for each of these 16 pixels;
Calculate the value of the interpolated pixel as a linear combination of the 16 pixel values and the kernel coefficients.
The old function timed at 25us on my Wolfdale Core2 Duo. The new one took 587us (!). I eagerly put my wizard hat on and started hacking the code. I managed to remove all branches, omit some duplicating calculations and transform 3 nested loops into just one over the coordinate maps. This is what I came up with:
void Image::getConvolve(const WarpParams &params)
{
__declspec(align(16)) static float kernelX[4], kernelY[4];
// grab pointers to coordinate map matrices and the original image
const float
*const mapX = mapX_.ptr<float>(),
*const mapY = mapY_.ptr<float>(),
*const img = image_->data().ptr<float>();
// grab pointer to the output image
float *const subset = pixels_.ptr<float>(),
x, y, xint, yint;
const ptrdiff_t imgw = image_->width();
ptrdiff_t imgoffs;
__m128 v_px, v_kernX, v_kernY, v_val;
// iterate over the coordinate matrices as linear buffers
for (size_t idx = 0; idx < pxCount; ++idx)
{
// retrieve coordinates of next pixel from precalculated maps,
// break up each into fractional and integer part
x = modf(mapX[idx], &xint);
y = modf(mapY[idx], &yint);
// obtain offset of the top left pixel from the required 4x4
// neighborhood of the current pixel in the image's
// buffer (sadly, the position will be unaligned)
imgoffs = (((ptrdiff_t)yint - 1) * imgw) + (ptrdiff_t)xint - 1;
// calculate all 4 convolution kernel values for every row and
// every column
tap4Kernel(x, kernelX);
tap4Kernel(y, kernelY);
// load the kernel values for the columns, these don't change
v_kernX = _mm_load_ps(kernelX);
// process a row of the 4x4 neighborhood
// get set of convolution kernel values for the current row
v_kernY = _mm_set_ps1(kernelY[0]);
v_px = _mm_loadu_ps(img + imgoffs); // load the pixel values
// calculate the linear combination of the pixels with kernelX
v_px = _mm_mul_ps(v_px, v_kernX);
v_px = _mm_mul_ps(v_px, v_kernY); // and kernel Y
v_val = v_px; // add result to the final value
imgoffs += imgw;
// offset points now to next row of the 4x4 neighborhood
v_kernY = _mm_set_ps1(kernelY[1]);
v_px = _mm_loadu_ps(img + imgoffs);
v_px = _mm_mul_ps(v_px, v_kernX);
v_px = _mm_mul_ps(v_px, v_kernY);
v_val = _mm_add_ps(v_val, v_px);
imgoffs += imgw;
/*... same for kernelY[2] and kernelY[3]... */
// store resulting interpolated pixel value in the subset's
// pixel matrix
subset[idx] = horizSum(v_val);
}
}
// Calculate all 4 values of the 4-tap convolution kernel for 4 neighbors
// of a pixel and store them in an array. Ugly but fast.
// The "arg" parameter is the fractional part of a pixel's coordinate, i.e.
// a number in the range <0,1)
void Image::tap4Kernel(const float arg, float *out)
{
// chaining intrinsics was slower, so this is done in separate steps
// load the argument into 4 cells of a XMM register
__m128
v_arg = _mm_set_ps1(arg),
v_coeff = _mm_set_ps(2.0f, 1.0f, 0.0f, -1.0f);
// subtract vector of [-1, 0, 1, 2] to obtain coorinates of 4 neighbors
// for kernel calculation
v_arg = _mm_sub_ps(v_arg, v_coeff);
// clear sign bits, this is equivalent to fabs() on all 4
v_coeff = _mm_set_ps1(-0.f);
v_arg = _mm_andnot_ps(v_coeff, v_arg);
// calculate values of abs(argument)^3 and ^2
__m128
v_arg2 = _mm_mul_ps(v_arg, v_arg),
v_arg3 = _mm_mul_ps(v_arg2, v_arg),
v_val, v_temp;
// calculate the 4 kernel values as
// arg^3 * A + arg^2 * B + arg * C + D, using
// (A,B,C,D) = (-0.5, 2.5, -4, 2) for the outside pixels and
// (1.5, -2.5, 0, 1) for inside
v_coeff = _mm_set_ps(-0.5f, 1.5f, 1.5f, -0.5f);
v_val = _mm_mul_ps(v_coeff, v_arg3);
v_coeff = _mm_set_ps(2.5f, -2.5f, -2.5f, 2.5f);
v_temp = _mm_mul_ps(v_coeff, v_arg2);
v_val = _mm_add_ps(v_val, v_temp);
v_coeff = _mm_set_ps(-4.0f, 0.0f, 0.0f, -4.0f),
v_temp = _mm_mul_ps(v_coeff, v_arg);
v_val = _mm_add_ps(v_val, v_temp);
v_coeff = _mm_set_ps(2.0f, 1.0f, 1.0f, 2.0f);
v_val = _mm_add_ps(v_val, v_coeff);
_mm_store_ps(out, v_val);
}
I was pleased to have managed to get the run time on this to below 40us, even before introducing SSE to the main loop, which I saved for last. I was expecting at least a 3-fold speedup but it ran only barely faster at 36us, slower than the old get() which I was trying to improve upon. Even worse was the fact that when I changed the benchmark loop to do more runs, the old function had the same mean run time, while mine stretched to over 127us, meaning it takes even longer for some extreme warp parameter values (makes sense because more warps means I need to reach for widely-dispersed pixel values from the original image to calculate the result).
I figured the reason must be the unaligned loads, but that can't be helped (I need to reach for unpredictable pixel values). I couldn't see anything more I could do in the optimizing department so I decided to look at the cv::remap() function to see how they do it. Imagine my surprise at finding it contain a mess of nested loops and plenty of branches. They also do a lot of argument verification which I didn't need to bother with. As far as I can tell (no comments in the code), SSE (with unaligned loads as well!) is only used for extracting the values from the coordinate maps and rounding them into integers, then a function is called that does the actual interpolating with regular float arithmetics.
My question is, why did I fail so miserably (why is my code so slow and why is theirs faster even though it looks like a mess) and what can I do to improve my code?
I'm not pasting the OpenCV code here because this is already too long, you can check it out at pastebin.
I tested and compiled my code in Release mode under VC++2010. The OpenCV used was a precompiled binary bundle of v2.3.1.
EDIT: The pixel values are floats in the range 0..1. Profiling showed the tap4Kernel() function wasn't relevant, most time is spent inside getConvolve().
EDIT2: I pasted the disassembly of the generated code to pastebin. This is compiled on an old Banias Celeron processor (has SSE2), but looks more or less the same.
EDIT3: After reading What Every Programmer Should Know About Memory I realized I was incorrectly assuming the OpenCV function implements more or less the same algorithm as I did, which must not be the case. For every pixel I interpolate, I need to retrieve its 4x4 neighborhood, whose pixels are non-sequentially placed inside the image buffer. I am misusing the CPU caches, and OpenCV probably doesn't. VTune profiling would seem to agree as my function has 5,800,000 memory accesses, while OpenCV does only 400,000. Their function is a mess and could probably be further optimized but it still manages to have an edge over me, probably due to some smarter approach to memory and cache usage.
UPDATE: I managed to improve the way pixel values are loaded into XMM registers. I allocate a buffer in the object which holds 16-element cells for every pixel of the image. At image load, I fill this cell buffer with pre-arranged sequences of 4x4-neighborhoods for every pixel. Not very space-efficient (image takes 16x the space), but that way, the loads are always aligned (no more _mm_loadu_ps()), and I am avoiding having to do scattered reads of pixels from the image buffer, since the required pixels are stored sequentially. To my surprise, there was hardly any improvement at all. I heard unaligned loads could be 10x slower, but clearly this is not the problem here. But by commenting out parts of the code, I found out that the modf() calls are responsible for 75% of the runtime! I'll focus on eliminating those and post an answer.
First a few observations.
You use function static variables, which can incur synchronization (I don't think it does here though)
The assembly mixes x87 and sse code.
tap4kernel is inlined, which is good, but the profile may be inaccurate.
modf is not inlined.
The assembly uses _ftol2_sse (_ftol2_sse, are there faster options?).
The assembly moves the registers around a lot.
Try doing the following:
Ensure that your compiler is optimizing aggressively for an SSE2 based architecture.
Make modf available for inlining.
Avoid using function static variables.
If the assembly still uses x87 instructions, try avoiding the float-int cast in imgoffs = (((ptrdiff_t)yint - 1) * imgw) + (ptrdiff_t)xint - 1; and making the float variables __m128.
It can possibly be optimized further by prefetching the maps (prefetch about 4kb ahead)

How to efficiently determine the minimum necessary size of a pre-rendered sine wave audio buffer for looping?

I've written a program that generates a sine-wave at a user-specified frequency, and plays it on a 96kHz audio channel. To save a few CPU cycles I employ the old trick of pre-rendering a short section of audio into a buffer, and then playing back the buffer in a loop, so that I can avoid calling the sin() function 96000 times per second for the duration of the program and just do simple memory-copying instead.
My problem is efficiently determining what the minimum usable size of this pre-rendered buffer would be. For some frequencies it is easy -- for example, an 8kHz sine wave can be perfectly represented by generating a 12-sample buffer and playing it in a looping, because (8000*12 == 96000). For other frequencies, however, a single cycle of the sine wave requires a non-integral number of samples to represent, and therefore looping a single cycle's worth of samples would cause unacceptable glitching.
For some of those frequencies, however, it's possible to get around that problem by pre-rendering more than one cycle of the sine wave and looping that -- if I can figure out how many cycles are required so that the number of cycles present in the buffer will be integral, while also guaranteeing that the number of samples in the buffer are integral. For example, a sine-wave frequency of 12.8kHz translates to a single-cycle buffer-size of 7.5 samples, which won't loop cleanly, but if I render two consecutive cycles of the sine wave into a 15-sample buffer, then I can cleanly loop the result.
My current approach to solving this issue is brute force: I try all possible cycle-counts and see if any of them result in a buffer size with an integral number of samples in it. I think that approach is unsatisfactory for the following reasons:
1) It's very inefficient. For example, the program shown below (which prints buffer-size results for 480,000 possible frequency values between 0Hz and 48kHz) takes 35 minutes to complete on my 2.7GHz machine. I think there must be a much faster way to do this.
2) I suspect that the results are not 100% accurate, due to floating-point errors.
3) The algorithm gives up if it can't find an acceptable buffer size less than 10 seconds long. (I could make the limit higher, but of course that would make the algorithm even slower).
So, is there any way to calculate the minimum-usable-buffer-size analytically, preferably in O(1) time? It seems like it should be easy, but I haven't been able to figure out what kind of math I should use.
Thanks in advance for any advice!
#include <stdio.h>
#include <math.h>
static const long long SAMPLES_PER_SECOND = 96000;
static const long long MAX_ALLOWED_BUFFER_SIZE_SAMPLES = (SAMPLES_PER_SECOND * 10);
// Returns the length of the pre-render buffer needed to properly
// loop a sine wave at the given frequence, or -1 on failure.
static int GetNumCyclesNeededForPreRenderedBuffer(float freqHz)
{
double oneCycleLengthSamples = SAMPLES_PER_SECOND/freqHz;
for (int count=1; (count*oneCycleLengthSamples) < MAX_ALLOWED_BUFFER_SIZE_SAMPLES; count++)
{
double remainder = fmod(oneCycleLengthSamples*count, 1.0);
if (remainder > 0.5) remainder = 1.0-remainder;
if (remainder <= 0.0) return count;
}
return -1;
}
int main(int, char **)
{
for (int i=0; i<48000*10; i++)
{
double freqHz = ((double)i)/10.0f;
int numCyclesNeeded = GetNumCyclesNeededForPreRenderedBuffer(freqHz);
if (numCyclesNeeded >= 0)
{
double oneCycleLengthSamples = SAMPLES_PER_SECOND/freqHz;
printf("For %.1fHz, use a pre-render-buffer size of %f samples (%i cycles, %f samples/cycle)\n", freqHz, (numCyclesNeeded*oneCycleLengthSamples), numCyclesNeeded, oneCycleLengthSamples);
}
else printf("For %.1fHz, there was no suitable pre-render-buffer size under the allowed limit!\n", freqHz);
}
return 0;
}
number_of_cycles/size_of_buffer = frequency/samples_per_second
This implies that if you can simplify your frequency/samples_per_second fraction, you can find the size of your buffer and the number of cycles in the buffer. If frequency and samples_per_second are integers, you can simplify the fraction by finding the greatest common divisor, otherwise you can use the method of continued fractions.
Example:
Say your frequency is 1234.5, and your samples_per_second is 96000. We can make these into two integers by multiplying by 10, so we get the ratio:
frequency/samples_per_second = 12345/960000
The greatest common divisor is 15, so it can be reduced to 823/64000.
So you would need 823 cycles in a 64000 sample buffer to reproduce the frequency exactly.

Interpretation of DirectSound buffer elements from mic capture device

I am doing some maintenance work involving DirectSound buffers. I would like to know how to interpret the elements in the buffer, that is, to know what each value in the buffer represents. This data is coming from a microphone.
This wave format is being used:
WAVEFORMATEXTENSIBLE format = {
{ WAVE_FORMAT_EXTENSIBLE, 1, sample_rate, sample_rate * 4, 4, 32, 22 },
{ 32 }, 0, KSDATAFORMAT_SUBTYPE_IEEE_FLOAT
};
My goal is to detect microphone silence. I am currently accomplishing this by simply determining if all values in the buffer fail to exceed some threshold volume value, assuming that the intensity of each buffer element directly corresponds to volume.
This what I am currently trying:
bool is_mic_silent(float * data, unsigned int num_samples, float threshold)
{
float * max_iter = std::max_element(data, data + num_samples);
if(!max_iter) {
return true;
}
float max = *max_iter;
if(max < threshold) {
return true;
}
return false; // At least one value is sufficiently loud.
}
As MSN said the samples are in 32-bit floats. To detect a silence you would normally calculate the RMS value: Take the average of the squared sample values over some time interval (say 20-50 ms) and compare (square root of) this average to a threshold.
The noise inherent in the microphone signal may let single samples reach above the threshold while the ambient sound would still be considered silence. The averaging over a short interval will result in a value that corresponds better with our perception.
From here, floating point PCM values are from [-1, 1].
In addition to Han's suggestion to average samples, als consider calibrating your threshold value. Under different environments, with different microphones and different audio channels, "silence" can mean a lot of things.
The simple way would be loowing to configure the threshold. Alternatively, allow a "Noise floor measurement" where you acqurie a threshold value.
Note that the samples are linear, but levels in audio processing are usually given in dB. So depending on yoru target audience, you may want to convert readings and inputs to/from dB.