Drawing circle, OpenGL style

Drawing circle, OpenGL style - c++

I have a 13 x 13 array of pixels, and I am using a function to draw a circle onto them. (The screen is 13 * 13, which may seem strange, but its an array of LED's so that explains it.)
unsigned char matrix[13][13];
const unsigned char ON = 0x01;
const unsigned char OFF = 0x00;
Here is the first implementation I thought up. (It's inefficient, which is a particular problem as this is an embedded systems project, 80 MHz processor.)
// Draw a circle
// mode is 'ON' or 'OFF'
inline void drawCircle(float rad, unsigned char mode)
{
for(int ix = 0; ix < 13; ++ ix)
{
for(int jx = 0; jx < 13; ++ jx)
{
float r; // Radial
float s; // Angular ("theta")
matrix_to_polar(ix, jx, &r, &s); // Converts polar coordinates
// specified by r and s, where
// s is the angle, to index coordinates
// specified by ix and jx.
// This function just converts to
// cartesian and then translates by 6.0.
if(r < rad)
{
matrix[ix][jx] = mode; // Turn pixel in matrix 'ON' or 'OFF'
}
}
}
}
I hope that's clear. It's pretty simple, but then I programmed it so I know how it's supposed to work. If you'd like more info / explanation then I can add some more code / comments.
It can be considered that drawing several circles, eg 4 to 6, is very slow... Hence I'm asking for advice on a more efficient algorithm to draw the circles.
EDIT: Managed to double the performance by making the following modification:
The function calling the drawing used to look like this:
for(;;)
{
clearAll(); // Clear matrix
for(int ix = 0; ix < 6; ++ ix)
{
rad[ix] += rad_incr_step;
drawRing(rad[ix], rad[ix] - rad_width);
}
if(rad[5] >= 7.0)
{
for(int ix = 0; ix < 6; ++ ix)
{
rad[ix] = rad_space_step * (float)(-ix);
}
}
writeAll(); // Write
}
I added the following check:
if(rad[ix] - rad_width < 7.0)
drawRing(rad[ix], rad[ix] - rad_width);
This increased the performance by a factor of about 2, but ideally I'd like to make the circle drawing more efficient to increase it further. This checks to see if the ring is completely outside of the screen.
EDIT 2: Similarly adding the reverse check increased performance further.
if(rad[ix] >= 0.0)
drawRing(rad[ix], rad[ix] - rad_width);
Performance is now pretty good, but again I have made no modifications to the actual drawing code of the circles and this is what I was intending to focus on with this question.
Edit 3: Matrix to polar:
inline void matrix_to_polar(int i, int j, float* r, float* s)
{
float x, y;
matrix_to_cartesian(i, j, &x, &y);
calcPolar(x, y, r, s);
}
inline void matrix_to_cartesian(int i, int j, float* x, float* y)
{
*x = getX(i);
*y = getY(j);
}
inline void calcPolar(float x, float y, float* r, float* s)
{
*r = sqrt(x * x + y * y);
*s = atan2(y, x);
}
inline float getX(int xc)
{
return (float(xc) - 6.0);
}
inline float getY(int yc)
{
return (float(yc) - 6.0);
}
In response to Clifford that's actually a lot of function calls if they are not inlined.
Edit 4: drawRing just draws 2 circles, firstly an outer circle with mode ON and then an inner circle with mode OFF. I am fairly confident that there is a more efficient method of drawing such a shape too, but that distracts from the question.

You're doing a lot of calculations that aren't really needed. For example, you're calculating the angle of the polar coordinates, but never use it. The square root can also easily be avoided by comparing the square of the values.
Without doing anything fancy, something like this should be a good start:
int intRad = (int)rad;
int intRadSqr = (int)(rad * rad);
for (int ix = 0; ix <= intRad; ++ix)
{
for (int jx = 0; jx <= intRad; ++jx)
{
if (ix * ix + jx * jx <= radSqr)
{
matrix[6 - ix][6 - jx] = mode;
matrix[6 - ix][6 + jx] = mode;
matrix[6 + ix][6 - jx] = mode;
matrix[6 + ix][6 + jx] = mode;
}
}
}
This does all the math in integer format, and takes advantage of the circle symmetry.
Variation of the above, based on feedback in the comments:
int intRad = (int)rad;
int intRadSqr = (int)(rad * rad);
for (int ix = 0; ix <= intRad; ++ix)
{
for (int jx = 0; ix * ix + jx * jx <= radSqr; ++jx)
{
matrix[6 - ix][6 - jx] = mode;
matrix[6 - ix][6 + jx] = mode;
matrix[6 + ix][6 - jx] = mode;
matrix[6 + ix][6 + jx] = mode;
}
}

Don't underestimate the cost of even basic arithmetic using floating point on a processor with no FPU. It seems unlikely that floating point is necessary, but the details of its use are hidden in your matrix_to_polar() implementation.
Your current implementation considers every pixel as a candidate - that is also unnecessary.
Using the equation y = cy ± √[rad2 - (x-cx)2] where cx, cy is the centre (7, 7 in this case), and a suitable integer square root implementation, the circle can be drawn thus:
void drawCircle( int rad, unsigned char mode )
{
int r2 = rad * rad ;
for( int x = 7 - rad; x <= 7 + rad; x++ )
{
int dx = x - 7 ;
int dy = isqrt( r2 - dx * dx ) ;
matrix[x][7 - dy] = mode ;
matrix[x][7 + dy] = mode ;
}
}
In my test I used the isqrt() below based on code from here, but given that the maximum r2 necessary is 169 (132, you could implement a 16 or even 8 bit optimised version if necessary. If your processor is 32 bit, this is probably fine.
uint32_t isqrt(uint32_t n)
{
uint32_t root = 0, bit, trial;
bit = (n >= 0x10000) ? 1<<30 : 1<<14;
do
{
trial = root+bit;
if (n >= trial)
{
n -= trial;
root = trial+bit;
}
root >>= 1;
bit >>= 2;
} while (bit);
return root;
}
All that said, on such a low resolution device, you will probably get better quality circles and faster performance by hand generating bitmap lookup tables for each radius required. If memory is an issue, then a single circle needs only 7 bytes to describe a 7 x 7 quadrant that you can reflect to all three quadrants, or for greater performance you could use 7 x 16 bit words to describe a semi-circle (since reversing bit order is more expensive than reversing array access - unless you are using an ARM Cortex-M with bit-banding). Using semi-circle look-ups, 13 circles would need 13 x 7 x 2 bytes (182 bytes), quadrant look-ups would be 7 x 8 x 13 (91 bytes) - you may find that is fewer bytes that the code space required to calculate the circles.

For a slow embedded device with only a 13x13 element display, you should really just make a look-up table. For example:
struct ComputedCircle
{
float rMax;
char col[13][2];
};
Where the draw routine uses rMax to determine which LUT element to use. For example, if you have 2 elements with one rMax = 1.4f, the other = 1.7f, then any radius between 1.4f and 1.7f will use that entry.
The column elements would specify zero, one, or two line segments per row, which can be encoded in the lower and upper 4 bits of each char. -1 can be used as a sentinel value for nothing-at-this-row. It is up to you how many look-up table entries to use, but with a 13x13 grid you should be able to encode every possible outcome of pixels with well under 100 entries, and a reasonable approximation using only 10 or so. You can also trade off compression for draw speed as well, e.g. putting the col[13][2] matrix in a flat list and encoding the number of rows defined.

I would accept MooseBoy's answer if only he explained the method he proposes better. Here's my take on the lookup table approach.
Solve it with a lookup table
The 13x13 display is quite small, and if you only need circles which are fully visible within this pixel count, you will get around with a quite small table. Even if you need larger circles, it should be still better than any algorithmic way if you need it to be fast (and have the ROM to store it).
How to do it
You basically need to define how each possible circle looks like on the 13x13 display. It is not sufficient to just produce snapshots for the 13x13 display, as it is likely you would like to plot the circles at arbitrary positions. My take for a table entry would look like this:
struct circle_entry_s{
unsigned int diameter;
unsigned int offset;
};
The entry would map a given diameter in pixels to offsets in a large byte table containing the shape of the circles. For example for diameter 9, the byte sequence would look like this:
0x1CU, 0x00U, /* 000111000 */
0x63U, 0x00U, /* 011000110 */
0x41U, 0x00U, /* 010000010 */
0x80U, 0x80U, /* 100000001 */
0x80U, 0x80U, /* 100000001 */
0x80U, 0x80U, /* 100000001 */
0x41U, 0x00U, /* 010000010 */
0x63U, 0x00U, /* 011000110 */
0x1CU, 0x00U, /* 000111000 */
The diameter specifies how many bytes of the table belong to the circle: one row of pixels are generated from (diameter + 7) >> 3 bytes, and the number of rows correspond to the diameter. The output code of these can be made quite fast, while the lookup table is sufficiently compact to get even larger than the 13x13 display circles defined in it if needed.
Note that defining circles this way for odd and even diameters may or may not appeal you when output by a centre location. The odd diameter circles will appear to have a centre in the "middle" of a pixel, while the even diameter circles will appear to have their centre on the "corner" of a pixel.
You may also find it nice later to refine the overall method so having multiple circles of different apparent sizes, but having the same pixel radius. Depends on what is your goal: if you want some kind of smooth animation, you may get there eventually.
Algorithmic solutions I think mostly will perform poorly here, since with this limited display surface really every pixel's state counts for the appearance.

Related

FFT Spectrum not displaying correctly

I'm currently trying to display an audio spectrum using FFTW3 and SFML. I've followed the directions found here and looked at numerous references on FFT and spectrums and FFTW yet somehow my bars are almost all aligned to the left like below. Another issue I'm having is I can't find information on what the scale of the FFT output is. Currently I'm dividing it by 64 yet it still reaches beyond that occasionally. And further still I have found no information on why the output of the from FFTW has to be the same size as the input. So my questions are:
Why is the majority of my spectrum aligned to the left unlike the image below mine?
Why isn't the output between 0.0 and 1.0?
Why is the input sample count related to the fft output count?
What I get:
What I'm looking for:
const int bufferSize = 256 * 8;
void init() {
sampleCount = (int)buffer.getSampleCount();
channelCount = (int)buffer.getChannelCount();
for (int i = 0; i < bufferSize; i++) {
window.push_back(0.54f - 0.46f * cos(2.0f * GMath::PI * (float)i / (float)bufferSize));
}
plan = fftwf_plan_dft_1d(bufferSize, signal, results, FFTW_FORWARD, FFTW_ESTIMATE);
}
void update() {
int mark = (int)(sound.getPlayingOffset().asSeconds() * sampleRate);
for (int i = 0; i < bufferSize; i++) {
float s = 0.0f;
if (i + mark < sampleCount) {
s = (float)buffer.getSamples()[(i + mark) * channelCount] / (float)SHRT_MAX * window[i];
}
signal[i][0] = s;
signal[i][1] = 0.0f;
}
}
void draw() {
int inc = bufferSize / 2 / size.x;
int y = size.y - 1;
int max = size.y;
for (int i = 0; i < size.x; i ++) {
float total = 0.0f;
for (int j = 0; j < inc; j++) {
int index = i * inc + j;
total += std::sqrt(results[index][0] * results[index][0] + results[index][1] * results[index][1]);
}
total /= (float)(inc * 64);
Rectangle2I rect = Rectangle2I(i, y, 1, -(int)(total * max)).absRect();
g->setPixel(rect, Pixel(254, toColor(BLACK, GREEN)));
}
}

All of your questions are related to the FFT theory. Study the properties of FFT from any standard text/reference book and you will be able to answer your questions all by yourself only.
The least you can start from is here:
https://en.wikipedia.org/wiki/Fast_Fourier_transform.

Many FFT implementations are energy preserving. That means the scale of the output is linearly related to the scale and/or size of the input.
An FFT is a DFT is a square matrix transform. So the number of outputs will always be equal to the number of inputs (or half that by ignoring the redundant complex conjugate half given strictly real input), unless some outputs are thrown away. If not, it's not an FFT. If you want less outputs, there are ways to downsample the FFT output or post process it in other ways.

Weird but close fft and ifft of image in c++

I wrote a program that loads, saves, and performs the fft and ifft on black and white png images. After much debugging headache, I finally got some coherent output only to find that it distorted the original image.
input:
fft:
ifft:
As far as I have tested, the pixel data in each array is stored and converted correctly. Pixels are stored in two arrays, 'data' which contains the b/w value of each pixel and 'complex_data' which is twice as long as 'data' and stores real b/w value and imaginary parts of each pixel in alternating indices. My fft algorithm operates on an array structured like 'complex_data'. After code to read commands from the user, here's the code in question:
if (cmd == "fft")
{
if (height > width) size = height;
else size = width;
N = (int)pow(2.0, ceil(log((double)size)/log(2.0)));
temp_data = (double*) malloc(sizeof(double) * width * 2); //array to hold each row of the image for processing in FFT()
for (i = 0; i < (int) height; i++)
{
for (j = 0; j < (int) width; j++)
{
temp_data[j*2] = complex_data[(i*width*2)+(j*2)];
temp_data[j*2+1] = complex_data[(i*width*2)+(j*2)+1];
}
FFT(temp_data, N, 1);
for (j = 0; j < (int) width; j++)
{
complex_data[(i*width*2)+(j*2)] = temp_data[j*2];
complex_data[(i*width*2)+(j*2)+1] = temp_data[j*2+1];
}
}
transpose(complex_data, width, height); //tested
free(temp_data);
temp_data = (double*) malloc(sizeof(double) * height * 2);
for (i = 0; i < (int) width; i++)
{
for (j = 0; j < (int) height; j++)
{
temp_data[j*2] = complex_data[(i*height*2)+(j*2)];
temp_data[j*2+1] = complex_data[(i*height*2)+(j*2)+1];
}
FFT(temp_data, N, 1);
for (j = 0; j < (int) height; j++)
{
complex_data[(i*height*2)+(j*2)] = temp_data[j*2];
complex_data[(i*height*2)+(j*2)+1] = temp_data[j*2+1];
}
}
transpose(complex_data, height, width);
free(temp_data);
free(data);
data = complex_to_real(complex_data, image.size()/4); //tested
image = bw_data_to_vector(data, image.size()/4); //tested
cout << "*** fft success ***" << endl << endl;
void FFT(double* data, unsigned long nn, int f_or_b){ // f_or_b is 1 for fft, -1 for ifft
unsigned long n, mmax, m, j, istep, i;
double wtemp, w_real, wp_real, wp_imaginary, w_imaginary, theta;
double temp_real, temp_imaginary;
// reverse-binary reindexing to separate even and odd indices
// and to allow us to compute the FFT in place
n = nn<<1;
j = 1;
for (i = 1; i < n; i += 2) {
if (j > i) {
swap(data[j-1], data[i-1]);
swap(data[j], data[i]);
}
m = nn;
while (m >= 2 && j > m) {
j -= m;
m >>= 1;
}
j += m;
};
// here begins the Danielson-Lanczos section
mmax = 2;
while (n > mmax) {
istep = mmax<<1;
theta = f_or_b * (2 * M_PI/mmax);
wtemp = sin(0.5 * theta);
wp_real = -2.0 * wtemp * wtemp;
wp_imaginary = sin(theta);
w_real = 1.0;
w_imaginary = 0.0;
for (m = 1; m < mmax; m += 2) {
for (i = m; i <= n; i += istep) {
j = i + mmax;
temp_real = w_real * data[j-1] - w_imaginary * data[j];
temp_imaginary = w_real * data[j] + w_imaginary * data[j-1];
data[j-1] = data[i-1] - temp_real;
data[j] = data[i] - temp_imaginary;
data[i-1] += temp_real;
data[i] += temp_imaginary;
}
wtemp = w_real;
w_real += w_real * wp_real - w_imaginary * wp_imaginary;
w_imaginary += w_imaginary * wp_real + wtemp * wp_imaginary;
}
mmax=istep;
}}
My ifft is the same only with the f_or_b set to -1 instead of 1. My program calls FFT() on each row, transposes the image, calls FFT() on each row again, then transposes back. Is there maybe an error with my indexing?

Not an actual answer as this question is Debug only so some hints instead:
your results are really bad
it should look like this:
first line is the actual DFFT result
Re,Im,Power is amplified by a constant otherwise you would see a black image
the last image is IDFFT of the original not amplified Re,IM result
the second line is the same but the DFFT result is wrapped by half size of image in booth x,y to match the common results in most DIP/CV texts
As you can see if you IDFFT back the wrapped results the result is not correct (checker board mask)
You have just single image as DFFT result
is it power spectrum?
or you forget to include imaginary part? to view only or perhaps also to computation somewhere as well?
is your 1D **DFFT working?**
for real data the result should be symmetric
check the links from my comment and compare the results for some sample 1D array
debug/repair your 1D FFT first and only then move to the next level
do not forget to test Real and complex data ...
your IDFFT looks BW (no gray) saturated
so did you amplify the DFFT results to see the image and used that for IDFFT instead of the original DFFT result?
also check if you do not round to integers somewhere along the computation
beware of (I)DFFT overflows/underflows
If your image pixel intensities are big and the resolution of image too then your computation could loss precision. Newer saw this in images but if your image is HDR then it is possible. This is a common problem with convolution computed by DFFT for big polynomials.

Thank you everyone for your opinions. All that stuff about memory corruption, while it makes a point, is not the root of the problem. The sizes of data I'm mallocing are not overly large, and I am freeing them in the right places. I had a lot of practice with this while learning c. The problem was not the fft algorithm either, nor even my 2D implementation of it.
All I missed was the scaling by 1/(M*N) at the very end of my ifft code. Because the image is 512x512, I needed to scale my ifft output by 1/(512*512). Also, my fft looks like white noise because the pixel data was not rescaled to fit between 0 and 255.

Suggest you look at the article http://www.yolinux.com/TUTORIALS/C++MemoryCorruptionAndMemoryLeaks.html
Christophe has a good point but he is wrong about it not being related to the problem because it seems that in modern times using malloc instead of new()/free() does not initialise memory or select best data type which would result in all problems listed below:-
Possibly causes are:
Sign of a number changing somewhere, I have seen similar issues when a platform invoke has been used on a dll and a value is passed by value instead of reference. It is caused by memory not necessarily being empty so when your image data enters it will have boolean maths performed on its values. I would suggest that you make sure memory is empty before you put your image data there.
Memory rotating right (ROR in assembly langauge) or left (ROL) . This will occur if data types are being used which do not necessarily match, eg. a signed value entering an unsigned data type or if the number of bits is different in one variable to another.
Data being lost due to an unsigned value entering a signed variable. Outcomes are 1 bit being lost because it will be used to determine negative or positive, or at extremes if twos complement takes place the number will become inverted in meaning, look for twos complement on wikipedia.
Also see how memory should be cleared/assigned before use. http://www.cprogramming.com/tutorial/memory_debugging_parallel_inspector.html

why are Cubic Bezier functions not accurate compared too windows api PolyBezier?

I have been trying to find a way to draw a curved line/cubic bezier line using a custom function. However, all the examples and such found on the internet, differ a little from each other and usually produce different results, why? . None of the ones i have tried produce the same result as windows api PolyBezier which is what i need.
This is my current code for drawing cubic bezier lines:
double Factorial(int number)
{
double factorial = 1;
if (number > 1)
{
for (int count = 1; count <= number; count++) factorial = factorial * count;
}
return factorial;
}
double choose(double a, double b)
{
return Factorial(a) / (Factorial(b) * Factorial(a - b));
}
VOID MyPolyBezier(HDC hdc, PPOINT Pts, int Total)
{
float x, y;
MoveToEx(hdc, Pts[0].x, Pts[0].y, 0);
Total -= 1;
//for (float t = 0; t <= 1; t += (1./128.))
for (float t = 0; t <= 1; t += 0.0078125)
{
x = 0;
y = 0;
for (int I = 0; I <= Total; I++)
{
x += Pts[I].x * choose(Total, I) * pow(1 - t, Total - I) * pow(t, I);
y += Pts[I].y * choose(Total, I) * pow(1 - t, Total - I) * pow(t, I);
}
LineTo(hdc, x, y);
}
}
And here is the code for testing it.
POINT TestPts[4];
BYTE TestType[4] = {PT_MOVETO, PT_BEZIERTO, PT_BEZIERTO, PT_BEZIERTO};
//set x, y points for the curved line.
TestPts[0].x = 50;
TestPts[0].y = 200;
TestPts[1].x = 100;
TestPts[1].y = 100;
TestPts[2].x = 150;
TestPts[2].y = 200;
TestPts[3].x = 200;
TestPts[3].y = 200;
//Draw using custom function.
MyPolyBezier(hdc, TestPts, 4);
//Move the curve down some.
TestPts[0].y += 10;
TestPts[1].y += 10;
TestPts[2].y += 10;
TestPts[3].y += 10;
//Draw using windows api.
//PolyDraw(hdc, TestPts, TestType, 4); //PolyDraw gives the same result as PolyBezier.
PolyBezier(hdc, TestPts, 4);
And an attached image of my bad results:
Note: the bottom bezier line is windows(PolyBezier) version.
Edit:
the final goal, Windows(On the left) VS custom funtion. Hopefully this helps in some way.

So a cubic bezier is a mathematical curve. The cubic bezier is a specific case of a more general curve.
The cubic bezier is defined by 4 control points -- a start and end point, and 2 control points. In general, a bezier has n control points in order.
The line is drawn as a time parameter t goes from 0 to 1.
To find out where a general bezier of degree n is at time t:
For each adjacent pair of control points in your bezier, find the weighted average of them, as controlled by t. So at + b(1-t) for control points a before b.
Use these n-1 points to form a degree n-1 bezier.
Solve the new bezier at time t.
when you hit a degree 1 bezier, stop. That is your point.
Try writing an algorithm based off the true definition of bezier, and see where it differs from the windows curve. This may ne less frustrating than taking some approximation and having two sets of errors to reconcile.

C++AMP Computing gradient using texture on a 16 bit image

I am working with depth images retrieved from kinect which are 16 bits. I found some difficulties on making my own filters due to the index or the size of the images.
I am working with Textures because allows to work with any bit size of images.
So, I am trying to compute an easy gradient to understand what is wrong or why it doesn't work as I expected.
You can see that there is something wrong when I use y dir.
For x:
For y:
That's my code:
typedef concurrency::graphics::texture<unsigned int, 2> TextureData;
typedef concurrency::graphics::texture_view<unsigned int, 2> Texture
cv::Mat image = cv::imread("Depth247.tiff", CV_LOAD_IMAGE_ANYDEPTH);
//just a copy from another image
cv::Mat image2(image.clone() );
concurrency::extent<2> imageSize(640, 480);
int bits = 16;
const unsigned int nBytes = imageSize.size() * 2; // 614400
{
uchar* data = image.data;
// Result data
TextureData texDataD(imageSize, bits);
Texture texR(texDataD);
parallel_for_each(
imageSize,
[=](concurrency::index<2> idx) restrict(amp)
{
int x = idx[0];
int y = idx[1];
// 65535 is the maxium value that can take a pixel with 16 bits (2^16 - 1)
int valX = (x / (float)imageSize[0]) * 65535;
int valY = (y / (float)imageSize[1]) * 65535;
texR.set(idx, valX);
});
//concurrency::graphics::copy(texR, image2.data, imageSize.size() *(bits / 8u));
concurrency::graphics::copy_async(texR, image2.data, imageSize.size() *(bits) );
cv::imshow("result", image2);
cv::waitKey(50);
}
Any help will be very appreciated.

Your indexes are swapped in two places.
int x = idx[0];
int y = idx[1];
Remember that C++AMP uses row-major indices for arrays. Thus idx[0] refers to row, y axis. This is why the picture you have for "For x" looks like what I would expect for texR.set(idx, valY).
Similarly the extent of image is also using swapped values.
int valX = (x / (float)imageSize[0]) * 65535;
int valY = (y / (float)imageSize[1]) * 65535;
Here imageSize[0] refers to the number of columns (the y value) not the number of rows.
I'm not familiar with OpenCV but I'm assuming that it also uses a row major format for cv::Mat. It might invert the y axis with 0, 0 top-left not bottom-left. The Kinect data may do similar things but again, it's row major.
There may be other places in your code that have the same issue but I think if you double check how you are using index and extent you should be able to fix this.

Applying Matrix To Image, Seeking Performance Improvements

Edited: Working on Windows platform.
Problem: Less of a problem, more about advise. I'm currently not incredibly versed in low-level program, but I am attempting to optimize the code below in an attempt to increase the performance of my overall code. This application depends on extremely high speed image processing.
Current Performance: On my computer, this currently computes at about 4-6ms for a 512x512 image. I'm trying to cut that in half if possible.
Limitations: Due to this projects massive size, fundamental changes to the application are very difficult to do, so things such as porting to DirectX or other GPU methods isn't much of an option. The project currently works, I'm simply trying to figure out how to make it work faster.
Specific information about my use for this: Images going into this method are always going to be exactly square and some increment of 128. (Most likely 512 x 512) and they will always come out the same size. Other than that, there is not much else to it. The matrix is calculated somewhere else, so this is just the applying of the matrix to my image. The original image and the new image are both being used, so copying the image is necessary.
Here is my current implementation:
void ReprojectRectangle( double *mpProjMatrix, unsigned char *pDstScan0, unsigned char *pSrcScan0,
int NewBitmapDataStride, int SrcBitmapDataStride, int YOffset, double InversedAspect, int RectX, int RectY, int RectW, int RectH)
{
int i, j;
double Xnorm, Ynorm;
double Ynorm_X_ProjMatrix4, Ynorm_X_ProjMatrix5, Ynorm_X_ProjMatrix7;;
double SrcX, SrcY, T;
int SrcXnt, SrcYnt;
int SrcXec, SrcYec, SrcYnvDec;
unsigned char *pNewPtr, *pSrcPtr1, *pSrcPtr2, *pSrcPtr3, *pSrcPtr4;
int RectX2, RectY2;
/* Compensate (or re-center) the Y-coordinate regarding the aspect ratio */
RectY -= YOffset;
/* Compute the second point of the rectangle for the loops */
RectX2 = RectX + RectW;
RectY2 = RectY + RectH;
/* Clamp values (be careful with aspect ratio */
if (RectY < 0) RectY = 0;
if (RectY2 < 0) RectY2 = 0;
if ((double)RectY > (InversedAspect * 512.0)) RectY = (int)(InversedAspect * 512.0);
if ((double)RectY2 > (InversedAspect * 512.0)) RectY2 = (int)(InversedAspect * 512.0);
/* Iterate through each pixel of the scaled re-Proj */
for (i=RectY; i<RectY2; i++)
{
/* Normalize Y-coordinate and take the ratio into account */
Ynorm = InversedAspect - (double)i / 512.0;
/* Pre-compute some matrix coefficients */
Ynorm_X_ProjMatrix4 = Ynorm * mpProjMatrix[4] + mpProjMatrix[12];
Ynorm_X_ProjMatrix5 = Ynorm * mpProjMatrix[5] + mpProjMatrix[13];
Ynorm_X_ProjMatrix7 = Ynorm * mpProjMatrix[7] + mpProjMatrix[15];
for (j=RectX; j<RectX2; j++)
{
/* Get a pointer to the pixel on (i,j) */
pNewPtr = pDstScan0 + ((i+YOffset) * NewBitmapDataStride) + j;
/* Normalize X-coordinates */
Xnorm = (double)j / 512.0;
/* Compute the corresponding coordinates in the source image, before Proj and normalize source coordinates*/
T = (Xnorm * mpProjMatrix[3] + Ynorm_X_ProjMatrix7);
SrcY = (Xnorm * mpProjMatrix[0] + Ynorm_X_ProjMatrix4)/T;
SrcX = (Xnorm * mpProjMatrix[1] + Ynorm_X_ProjMatrix5)/T;
// Compute the integer and decimal values of the coordinates in the sources image
SrcXnt = (int) SrcX;
SrcYnt = (int) SrcY;
SrcXec = 64 - (int) ((SrcX - (double) SrcXnt) * 64);
SrcYec = 64 - (int) ((SrcY - (double) SrcYnt) * 64);
// Get the values of the four pixels up down right left
pSrcPtr1 = pSrcScan0 + (SrcXnt * SrcBitmapDataStride) + SrcYnt;
pSrcPtr2 = pSrcPtr1 + 1;
pSrcPtr3 = pSrcScan0 + ((SrcXnt+1) * SrcBitmapDataStride) + SrcYnt;
pSrcPtr4 = pSrcPtr3 + 1;
SrcYnvDec = (64-SrcYec);
(*pNewPtr) = (unsigned char)(((SrcYec * (*pSrcPtr1) + SrcYnvDec * (*pSrcPtr2)) * SrcXec +
(SrcYec * (*pSrcPtr3) + SrcYnvDec * (*pSrcPtr4)) * (64 - SrcXec)) >> 12);
}
}
}

Two things that could help: multiprocessing and SIMD. With multiprocessing you could break up the output image into tiles and have each processor work on the next available tile. You can use SIMD instructions (like SSE, AVX, AltiVec, etc.) to calculate multiple things at the same time, such as doing the same matrix math to multiple coordinates at the same time. You can even combine the two - use multiple processors running SIMD instructions to do as much work as possible. You didn't mention what platform you're working on.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js