CoreML custom layer: Pixelwise Normalization with Metal Shaders - c++

I'm converting Nvidia's Progressive Growing of GANs' Generator to coreML. I've managed to get everything transferred to coreML with the exception of the Pixelwise Normalization (Lambda) layer, which I plan on implementing as a custom coreML layer in Swift/Metal.
In TensorFlow.Keras, I have implemented pixel norm as
def pixelwise_norm(a):
return a / tf.sqrt(tf.reduce_mean(a * a, axis=3, keep_dims=True) + 1e-8)
Now, I've barely ever worked with shaders/Metal, but Following the instructions here: http://machinethink.net/blog/coreml-custom-layers/, I have a custom layer set up to use Metal for feedforward operations. I am using a MTLComputePipelineState that (calls? encodes?) the following shader for the layer's operations:
#include <metal_stdlib>
using namespace metal;
kernel void pixelwise_norm(
texture2d_array<half, access::read> inTexture [[texture(0)]],
texture2d_array<half, access::write> outTexture [[texture(1)]],
ushort3 gid [[thread_position_in_grid]])
{
if (gid.x >= outTexture.get_width() ||
gid.y >= outTexture.get_height()) {
return;
}
const float4 x = float4(inTexture.read(gid.xy, gid.z));
const float4 y = 0.0000001f + (x / sqrt(pow(x,2)));
outTexture.write(half4(y), gid.xy, gid.z);
}
I'm having trouble figuring out the metal equivalent of "reduce_mean", right now this shader implements a ~tensorflow ~operation like
return a / tf.sqrt((a * a) + 1e-8)
Does anyone have any pointers?
Thanks

If I'm reading this correctly, for every pixel in the feature map this divides that pixel by the L2 norm over that pixel's channels?
In that case you'll need to use a for loop to read the channels for that pixel, sum up these numbers, and divide by the number of channels. (You only need to do this loop if the number of channels is more than 4.)
Also note that your 1e-8 needs to be inside the sqrt() or at least in the denominator.

Related

OpenGL Terrain System, small height difference between GPU and CPU

A quick summary:
I've a simple Quad tree based terrain rendering system that builds terrain patches which then sample a heightmap in the vertex shader to determine the height of each vertex.
The exact same calculation is done on the CPU for object placement and co.
Super straightforward, but now after adding some systems to procedurally place objects I've discovered that they seem to be misplaced by just a small amount. To debug this I render a few crosses as single models over the terrain. The crosses (red, green, blue lines) represent the height read from the CPU. While the terrain mesh uses a shader to translate the vertices.
(I've also added a simple odd/even gap over each height value to rule out a simple offset issue. So those ugly cliffs are expected, the submerged crosses are the issue)
I'm explicitly using GL_NEAREST to be able to display the "raw" height value:
As you can see the crosses are sometimes submerged under the terrain instead of representing its exact height.
The heightmap is just a simple array of floats on the CPU and on the GPU.
How the data is stored
A simple vector<float> which is uploaded into a GL_RGB32F GL_FLOAT buffer. The floats are not normalized and my terrain usually contains values between -100 and 500.
How is the data accessed in the shader
I've tried a few things to rule out errors, the inital:
vec2 terrain_heightmap_uv(vec2 position, Heightmap heightmap)
{
return (position + heightmap.world_offset) / heightmap.size;
}
float terrain_read_height(vec2 position, Heightmap heightmap)
{
return textureLod(heightmap.heightmap, terrain_heightmap_uv(position, heightmap), 0).r;
}
Basics of the vertex shader (the full shader code is very long, so I've extracted the part that actually reads the height):
void main()
{
vec4 world_position = a_model * vec4(a_position, 1.0);
vec4 final_position = world_position;
// snap vertex to grid
final_position.x = floor(world_position.x / a_quad_grid) * a_quad_grid;
final_position.z = floor(world_position.z / a_quad_grid) * a_quad_grid;
final_position.y = terrain_read_height(final_position.xz, heightmap);
gl_Position = projection * view * final_position;
}
To ensure the slightly different way the position is determined I tested it using hardcoded values that are identical to how C++ reads the height:
return texelFetch(heightmap.heightmap, ivec2((position / 8) + vec2(1024, 1024)), 0).r;
Which gives the exact same result...
How is the data accessed in the application
In C++ the height is read like this:
inline float get_local_height_safe(uint32_t x, uint32_t y)
{
// this macro simply clips x and y to the heightmap bounds
// it does not interfer with the result
BB_TERRAIN_HEIGHTMAP_BOUND_XY_TO_SAFE;
uint32_t i = (y * _size1d) + x;
return buffer->data[i];
}
inline float get_height_raw(glm::vec2 position)
{
position = position + world_offset;
uint32_t x = static_cast<int>(position.x);
uint32_t y = static_cast<int>(position.y);
return get_local_height_safe(x, y);
}
float BB::Terrain::get_height(const glm::vec3 position)
{
return heightmap->get_height_raw({position.x / heightmap_unit_scale, position.z / heightmap_unit_scale});
}
What have I tried:
Comparing the Buffers
I've dumped the first few hundred values from the vector. And compared it with the floating point buffer uploaded to the GPU using Nvidia Nsight, they are equal, rounding/precision errors there.
Sampling method
I've tried texture, textureLod and texelFetch to rule out some issue there, they all give me the same result.
Rounding
The super strange thing, when I round all the height values. They are perfectly aligned which just screams floating point precision issues.
Position snapping
I've tried rounding, flooring and ceiling the position, to ensure the position always maps to the same texel. I also tried adding an epsilon offset to rule out a positional precision error (probably stupid because the terrain is stable...)
Heightmap sizes
I've tried various heightmaps, also of different sizes.
Heightmap patterns
I've created a heightmap containing a pattern to ensure the position is not just offsetet.

What are the fastest algorithms for rendering the mandelbrot set?

I've tried many algorithms for the rendering of the Mandelbrot set, inclusive of the naive escape time algorithm, as well as the optimized escape time algorithm. But, are there faster algorithms that are used to produce really deep zooms efficiently like the ones we see on YouTube. Also, I would love to get some ideas on how to increase my precision beyond the C/C++ double
Even High end CPU will be much slower in comparison to average GPU. You can get to real time rendering even with naive iteration algo on GPU. So using better algorithms on GPU could get to high zooms however for any decent algo you need:
multi pass rendering as we can not self modify a texture on GPU
high precision floating point as floats/doubles are not enough.
Here few related QAs:
GLSL RT Mandelbrot
Interior Distance Estimate algorithm for the Mandelbrot set
I get infinitely small numbers for fractals
Perturbation theory
which might get you kick started...
One way to speed up is you can use fractional escape like I did in the first link. It improves image quality while keeping max iteration low.
The second link will get you approximation of which parts of fractal are in and out and how far. Its not very accurate but can be used to avoid computing iterations for parts that are "outside for sure".
Next link will show you how to achieve better precision.
Last link is about Perturbation The idea is that you use high precision math only for some reference points and use that to compute its neighbor points with low precision math without loosing precision. Never used that however but looks promising.
And finally once you achieved fast rendering you might want to aim for this:
How to adjust panning while zooming Mandelbrot set
Here a small example of 3* 64bit double used for single value in GLSL:
// high precision float (very slow)
dvec3 fnor(dvec3 a)
{
dvec3 c=a;
if (abs(c.x)>1e-5){ c.y+=c.x; c.x=0.0; }
if (abs(c.y)>1e+5){ c.z+=c.y; c.y=0.0; }
return c;
}
double fget(dvec3 a){ return a.x+a.y+a.z; }
dvec3 fset(double a){ return fnor(dvec3(a,0.0,0.0)); }
dvec3 fadd(dvec3 a,double b){ return fnor(a+fset(b)); }
dvec3 fsub(dvec3 a,double b){ return fnor(a-fset(b)); }
dvec3 fmul(dvec3 a,double b){ return fnor(a*b); }
dvec3 fadd(dvec3 a,dvec3 b){ return fnor(a+b); }
dvec3 fsub(dvec3 a,dvec3 b){ return fnor(a-b); }
dvec3 fmul(dvec3 a,dvec3 b)
{
dvec3 c;
c =fnor(a*b.x);
c+=fnor(a*b.y);
c+=fnor(a*b.z);
return fnor(c);
}
so each hi precision value is dvec3 ... the thresholds in fnor can be changed to any ranges. You can convert this to vec3 and float ...
[Edit1] "fast" C++ example
Ok I wanted to try my new SSD1306 driver along with my AVR32 MCU to compute Mandelbrot So I can compare speed with this Arduino + 3D + Pong + Mandelbrot. I used AT32UC3A3256 with ~66MHz no FPU no GPU and 128x64x1bpp display. No external memory only internal 16+32+32 KByte. Naive Mandlebrot was way to slow (~2.5sec per frame) so I busted up something like this (taking advantage of that position and zoom of the view is sort of continuous):
reduce resolution by 2
to make room for dithering as my output is just B&W
use variable max iteration n based on zoom
On change of n invalidate last frame to enforce full recompute. I know this is slow but it happens only 3 times on transitions between zoom ranges.
Scaling count from last frame is not looking good as its not linear.
Its possible to use the last counts but for that it would be needed also to remember the complex variables used for iteration and that would take too much memory.
remember last frame and also which x,y screen coordinate mapped to which Mandelbrot coordinate.
On each frame compute the mapping between screen coordinates and Mandelbrot coordinates.
remap last frame to adjust to new position and zoom
so simply look at the data from #3,#4 and if we have there the same positions in both last and actual frame (closer then half of pixel size), copy the pixels. And recompute the rest.
This will hugely improve performance if your view is smooth (so position and zoom does not change a lot on per frame basis).
I know its a bit vague description so here a C++ code where you can infer all doubts:
//---------------------------------------------------------------------------
//--- Fast Mandelbrot set ver: 1.000 ----------------------------------------
//---------------------------------------------------------------------------
template<int xs,int ys,int sh> void mandelbrot_draw(float mx,float my,float zoom)
{
// xs,ys - screen resolution
// sh - log2(pixel_size) ... dithering pixel size
// mx,my - Mandelbrot position (center of view) <-1.5,+0.5>,<-1.0,+1.0>
// zoom - zoom
// ----------------
// (previous/actual) frame
static U8 p[xs>>sh][ys>>sh]; // intensities (raw Mandelbrot image)
static int n0=0; // max iteraions
static float px[(xs>>sh)+1]={-1000.0}; // pixel x position in Mandlebrot
static float py[(ys>>sh)+1]; // pixel y position in Mandlebrot
// temp variables
U8 shd; // just pattern for dithering
int ix,iy,i,n,jx,jy,kx,ky,sz; // index variables
int nx=xs>>sh,ny=ys>>sh; // real Mandelbrot resolution
float fx,fy,fd; // floating Mandlebrot position and pixel step
float x,y,xx,yy,q; // Mandelbrot iteration stuff (this need to be high precision)
int qx[xs>>sh],qy[ys>>sh]; // maping of pixels between last and actual frame
float px0[xs>>sh],py0[ys>>sh]; // pixel position in Mandlebrot from last frame
// init vars
if (zoom< 10.0) n= 31;
else if (zoom< 100.0) n= 63;
else if (zoom< 1000.0) n=127;
else n=255;
sz=1<<sh;
ix=xs; if (ix>ys) ix=ys; ix/=sz;
fd=2.0/(float(ix-1)*zoom);
mx-=float(xs>>(1+sh))*fd;
my-=float(ys>>(1+sh))*fd;
// init buffers
if ((px[0]<-999.0)||(n0!=n))
{
n0=n;
for (ix=0;ix<nx;ix++) px[ix]=-999.0;
for (iy=0;iy<ny;iy++) py[iy]=-999.0;
for (ix=0;ix<nx;ix++)
for (iy=0;iy<ny;iy++)
p[ix][iy]=0;
}
// store old and compute new float positions of pixels in Mandelbrot to px[],py[],px0[],py0[]
for (fx=mx,ix=0;ix<nx;ix++,fx+=fd){ px0[ix]=px[ix]; px[ix]=fx; qx[ix]=-1; }
for (fy=my,iy=0;iy<ny;iy++,fy+=fd){ py0[iy]=py[iy]; py[iy]=fy; qy[iy]=-1; }
// match old and new x coordinates to qx[]
for (ix=0,jx=0;(ix<nx)&&(jx<nx);)
{
x=px[ix]; y=px0[jx];
xx=(x-y)/fd; if (xx<0.0) xx=-xx;
if (xx<=0.5){ qx[ix]=jx; px[ix]=y; }
if (x<y) ix++; else jx++;
}
// match old and new y coordinates to qy[]
for (ix=0,jx=0;(ix<ny)&&(jx<ny);)
{
x=py[ix]; y=py0[jx];
xx=(x-y)/fd; if (xx<0.0) xx=-xx;
if (xx<=0.5){ qy[ix]=jx; py[ix]=y; }
if (x<y) ix++; else jx++;
}
// remap p[][] by qx[]
for (ix=0,jx=nx-1;ix<nx;ix++,jx--)
{
i=qx[ix]; if ((i>=0)&&(i>=ix)) for (iy=0;iy<ny;iy++) p[ix][iy]=p[i][iy];
i=qx[jx]; if ((i>=0)&&(i<=jx)) for (iy=0;iy<ny;iy++) p[jx][iy]=p[i][iy];
}
// remap p[][] by qy[]
for (iy=0,jy=ny-1;iy<ny;iy++,jy--)
{
i=qy[iy]; if ((i>=0)&&(i>=iy)) for (ix=0;ix<nx;ix++) p[ix][iy]=p[ix][i];
i=qy[jy]; if ((i>=0)&&(i<=jy)) for (ix=0;ix<nx;ix++) p[ix][jy]=p[ix][i];
}
// Mandelbrot
for (iy=0,ky=0,fy=py[iy];iy<ny;iy++,ky+=sz,fy=py[iy]) if ((fy>=-1.0)&&(fy<=+1.0))
for (ix=0,kx=0,fx=px[ix];ix<nx;ix++,kx+=sz,fx=px[ix]) if ((fx>=-1.5)&&(fx<=+0.5))
{
// invalid qx,qy ... recompute Mandelbrot
if ((qx[ix]<0)||(qy[iy]<0))
{
for (x=0.0,y=0.0,xx=0.0,yy=0.0,i=0;(i<n)&&(xx+yy<4.0);i++)
{
q=xx-yy+fx;
y=(2.0*x*y)+fy;
x=q;
xx=x*x;
yy=y*y;
}
i=(16*i)/(n-1); if (i>16) i=16; if (i<0) i=0;
i=16-i; p[ix][iy]=i;
}
// use stored intensity
else i=p[ix][iy];
// render point with intensity i coresponding to ix,iy position in map
for (i<<=3 ,jy=0;jy<sz;jy++)
for (shd=shade8x8[i+(jy&7)],jx=0;jx<sz;jx++)
lcd.pixel(kx+jx,ky+jy,shd&(1<<(jx&7)));
}
}
//---------------------------------------------------------------------------
//---------------------------------------------------------------------------
//---------------------------------------------------------------------------
the lcd and shade8x8 stuff can be found in the linked SSD1306 QA. However you can ignore it its just dithering and outputting a pixel so you can instead output the i directly (even without the scaling to <0..16>.
Here preview (on PC as I was lazy to connect camera ...):
so it 64x32 Mandelbrot pixels displayed as 128x64 dithered image. On my AVR32 is this may be 8x time faster than naive method (maybe 3-4fps)... The code might be more optimized however take in mind Mandelbrot is not the only stuff running as I have some ISR handlers on backround to handle the LCD and also my TTS engine based on this which I upgraded a lot since then and use for debugging of this (yes it can speek in parallel to rendering). Also I am low on memory as my 3D engine takes a lot ~11 KByte away (mostly depth buffer).
The preview was done with this code (inside timer):
static float zoom=1.0;
mandelbrot_draw<128,64,1>(+0.37,-0.1,zoom);
zoom*=1.02; if (zoom>100000) zoom=1.0;
Also for non AVR32 C++ environment use this:
//------------------------------------------------------------------------------------------
#ifndef _AVR32_compiler_h
#define _AVR32_compiler_h
//------------------------------------------------------------------------------------------
typedef int32_t S32;
typedef int16_t S16;
typedef int8_t S8;
typedef uint32_t U32;
typedef uint16_t U16;
typedef uint8_t U8;
//------------------------------------------------------------------------------------------
#endif
//------------------------------------------------------------------------------------------
[Edit2] higher float precision in GLSL
The main problem with Mandelbrot is that it need to add numbers with very big exponent difference. For +,- operations we need to align mantissa of both operands, add them as integer and normalize back to scientific notation. However if the exponent difference is big then the result mantissa need more bits than can fit into 32 bit float so only 24 most significant bits are preserved. This creates the rounding errors causing your pixelation. If you look at 32bit float in binary you will see this:
float a=1000.000,b=0.00001,c=a+b;
//012345678901234567890123456789 ... just to easy count bits
a=1111101000b // a=1000
b= 0.00000000000000001010011111000101101011b // b=0.00000999999974737875
c=1111101000.00000000000000001010011111000101101011b // not rounded result
c=1111101000.00000000000000b // c=1000 rounded to 24 bits of mantissa
Now the idea is to enlarge the number of mantissa bits. The easiest trick is to have 2 floats instead of one:
//012345678901234567890123 ... just to easy count bits
a=1111101000b // a=1000
//012345678901234567890123 ... just to easy count b= 0.0000000000000000101001111100010110101100b // b=0.00000999999974737875
c=1111101000.00000000000000001010011111000101101011b // not rounded result
c=1111101000.00000000000000b // c=1000 rounded to 24
+ .0000000000000000101001111100010110101100b
//012345678901234567890123 ... just to easy count bits
so some part of the result is in one float and the rest in the other... The more floats per single value we have the bigger the mantissa. However doing this on bit exact division of big mantissa into 24 bit chunks would be complicated and slow in GLSL (if even possible due to GLSL limitations). Instead we can select for each of the float some range of exponents (just like in example above).
So in the example we got 3 floats (vec3) per single (float) value. Each of the coordinates represent different range:
abs(x) <= 1e-5
1e-5 < abs(y) <= 1e+5
1e+5 < abs(z)
and value = (x+y+z) so we kind of have 3*24 bit mantissa however the ranges are not exactly matching 24 bits. For that the exponent range should be divided by:
log10(2^24)=7.2247198959355486851297334733878
instead of 10... for example something like this:
abs(x) <= 1e-7
1e-7 < abs(y) <= 1e+0
1e+0 < abs(z)
Also the ranges must be selected so they handle the ranges of values you use otherwise it would be for nothing. So if your numbers are <4 its pointless to have range >10^+5 So first you need to see what bounds of values you have, then dissect it to exponent ranges (as many as you have floats per value).
Beware some (but much less than native float) rounding still occurs !!!
Now doing operations on such numbers is slightly more complicated than on normal floats as you need to handle each value as bracketed sum of all components so:
(x0+y0+z0) + (x1+y1+z1) = (x0+x1 + y0+y1 + z0+z1)
(x0+y0+z0) - (x1+y1+z1) = (x0-x1 + y0-y1 + z0-z1)
(x0+y0+z0) * (x1+y1+z1) = x0*(x1+y1+z1) + y0*(x1+y1+z1) + z0*(x1+y1+z1)
And do not forget to normalize the values back to the defined ranges. Avoid adding small and big (abs) values so avoid x0+z0 etc ...
[Edit3] new win32 demo CPU vs. GPU
win32 Mandelbrot demo 64bit floats
Both executables are preset to the same location and zoom to show when the doubles starts to round off. I had to upgrade slightly the way how px,py coordinates are computed as around 10^9 the y axis started to deviate at this location (The threshold still might be too big for other locations)
Here preview CPU vs. GPU for high zoom (n=1600):
RT GIF capture of CPU (n=62++, GIF 4x scaled down):
The optimized escape algorithm should be fast enough to draw the Mandelbrot set in real time. You can use multiple threads so that your implementation will be faster (this is very easy using OpenMP for example). You can also manually vectorize your code using SIMD instructions to make it even faster if needed. You could even run this directly on the GPU using either shaders and/or GPU computing frameworks (OpenCL or CUDA) if this is still not fast enough to you (although this is a bit complex to do efficiently). Finally you should tune the number of iterations so it is rather small.
Zooming should not have any direct impact on the performance. It just changes the input window of the computation. However, it does have an indirect impact since the actual number of iterations will change. Points outside the window should not be computed.
Double precision should also be enough for drawing Mandelbrot set correctly. But if you really want more precise calculation, you can use double-double precision which gives a quite good precision and not too bad performance. However, implementing double-double precision manually is a bit tricky and it is still significantly slower than using just double precision.
My fastest solutions avoid iterating over large areas of the same depth by following a contour boundary and filling. There is a penalty that it is possible to nip off small buds instead of going around them, but all-in-all a small price to pay for a quick zoom.
One possible efficiency is that if a zoom doubles the scale, you already have ΒΌ of the points.
For animation, I file each frame's values, doubling the scale each time, and interpolate the in-between frames on playback in real time, so the animation doubles once per second. The double type allows more than 50 key frames to be stored, giving an animation that lasts more than a minute (in and then back out).
The actual iteration is done by hand-crafted assembler, so one pixel is iterated entirely in the FPU.

casting gl_VertexID from int to float very slow

I am rendering an octree that contains points to a FBO.
I want a way to identify the points I am rendering.
To do so, I set an ID to each of the octree nodes (16bit integer). And I use gl_VertexID to identify a point in a node (no more than 65k points per nodes)
I output this to a RGBA texture with the octree node identifier written to the rg color components and the vertex ID writtent to the ba color components.
vec4 getIdColor() {
float r = mod(nodeID, 256.0) / 255.0;
float g = (nodeID / 256.0) / 255.0;
float b = mod(gl_VertexID, 256.0) / 255.0;
float a = (gl_VertexID/ 256.0) / 255.0;
return vec4(r, g, b, a);
}
The problem is that the gl_VertexID cast from int to float is really slow (I go from 60FPS to 2-3 FPS when rendering 2 million points).
EDIT : I also have the exact same problem when just using gl_VertexID. If I remove the mods and juste write
return vec4(gl_VertexID);
I have the same hit on the framerate. So the problems comes from gl_VertexID, not the mod
Is there a workaround (also, what causes this ?)
I found the problem. In the shader, if was using a if/else cascade (I know it's not good practice but it was a test shader).
Seems that I went over some cache size. Generating the shaders code on the fly with only the sections which conditions evaluates to true fixed the issue. It was both the number of conditions & the access to gl_VertexID that slowed the rendering down.

Physically based camera values too small

I am currently working on a physically based camera model and came across this blog: https://placeholderart.wordpress.com/2014/11/21/implementing-a-physically-based-camera-manual-exposure/
So I tried to implement it myself in OpenGL. I thought of calculating the exposure using the function getSaturationBasedExposure and pass that value to a shader where I will multiply the final color with that value:
float getSaturationBasedExposure(float aperture,
float shutterSpeed,
float iso)
{
float l_max = (7800.0f / 65.0f) * Sqr(aperture) / (iso * shutterSpeed);
return 1.0f / l_max;
}
colorOut = color * exposure;
But the values I get from that function are way too small (like around 0.00025 etc) so I guess I am missunderstanding the returned value of that function.
In the blog a test scene is mentioned in which the scene luminance is around 4000, but I haven't seen a shader implementation working with color range from 0 to 4000+ (not even HDR goes that high, right?).
So could anyone explain me how to apply the calculations correctly to a OpenGL scene or help me understand the meaning behind the calculations?

Blur on Windows Phone 8 too slow

I'm implementing blur effect on windows phone using native C++ with DirectX, but it looks like even the simplest blur with small kernel causes visible FPS drop.
float4 main(PixelShaderInput input) : SV_TARGET
{
float4 source = screen.Sample(LinearSampler, input.texcoord);
float4 sum = float4(0,0,0,0);
float2 sizeFactor = float2(0.00117, 0.00208);
for (int x = -2; x <= 2; x ++)
{
float2 offset = float2(x, 0) *sizeFactor;
sum += screen.Sample(LinearSampler, input.texcoord + offset);
}
return ((sum / 5) + source);
}
I'm currently using this pixel shader for 1D blur and it's visibily slower than without blur. Is it really so that WP8 phone hardware is that slow or am I making some mistake? If so, could you point me where to look for error?
Thank you.
Phones often don't have the best fill-rate, and blur is one of the worst things you can do if you're fill-rate bound. Using some numbers from gfxbench.com's Fill test, a typical phone fill rate is around 600MTex/s. With some rough math:
(600M texels/s) / (1280*720 texels/op) / (60 frames/s) ~= 11 ops/frame
So in your loop, if your surface is the entire screen, and you're doing 5 reads and 1 write, that's 6 of your 11 ops used, just for the blur. So I would say a framerate drop is expected. One way around this is to dynamically lower your resolution, and do a single linear upscale - you'll get a different kind of natural blur from the linear interpolation, which might be passable depending on the visual effect you're going for.