DSP - How to apply gain in frequency domain?

DSP - How to apply gain in frequency domain? - c++

I’m a beginner in DSP and I have to make an audio equalizer.
I’ve done some research and tried a lot of thing in the past month but in the end, it’s not working and I’m a bit overwhelmed with all those informations (that I certainly don’t interpret well).
I have two main classes : Broadcast (which generate pink noise, and apply gain to it) and Record (which analyse the input of the microphone et deduct the gain from it).
I have some trouble with both, but I’m gonna limit this post to the Broadcast side.
I’m using Aquila DSP Library, so I used this example and extended the logic of it.
/* Constructor */
Broadcast::Broadcast() :
_Info(44100, 2, 2), // 44100 Hz, 2 channels, sample size : 2 octet
_pinkNoise(_Info.GetFrequency()), // Init the Aquila::PinkNoiseGenerator
_thirdOctave() // list of “Octave” class, containing min, center, and max frequency of each [⅓ octave band](http://goo.gl/365ZFN)
{
_pinkNoise.setAmplitude(65536);
}
/* This method is called in a loop and fills the buffer with the pink noise */
bool Broadcast::BuildBuffer(char * Buffer, int BufferSize, int & BufferCopiedSize)
{
if (BufferSize < 131072)
return false;
int SampleCount = 131072 / _Info.GetSampleSize();
int signalSize = SampleCount / _Info.GetChannelCount();
_pinkNoise.generate(signalSize);
auto fft = Aquila::FftFactory::getFft(signalSize);
Aquila::SpectrumType spectrum = fft->fft(_pinkNoise.toArray());
Aquila::SpectrumType ampliSpectrum(signalSize);
std::list<Octave>::iterator it;
double gain, fl, fh;
/* [1.] - The gains are applied in this loop */
for (it = _thirdOctave.begin(); it != _thirdOctave.end(); it++)
{
/* Test values */
if ((*it).getCtr() >= 5000)
gain = 6.0;
else
gain = 0.0;
fl = (signalSize * (*it).getMin() / _Info.GetFrequency());
fh = (signalSize * (*it).getMax() / _Info.GetFrequency());
/* [2.] - THIS is the part that I think is wrong */
for (int i = 0; i < signalSize; i++)
{
if (i >= fl && i < fh)
ampliSpectrum[i] = std::pow(10, gain / 20);
else
ampliSpectrum[i] = 1.0;
}
/* [3.] - Multiply each bin of spectrum with ampliSpectrum */
std::transform(
std::begin(spectrum),
std::end(spectrum),
std::begin(ampliSpectrum),
std::begin(spectrum),
[](Aquila::ComplexType x, Aquila::ComplexType y) { return x * y; }); // Aquila::ComplexType is an std::complex<double>
}
/* Put the IFFT result in a new buffer */
boost::scoped_ptr<double> s(new double[signalSize]);
fft->ifft(spectrum, s.get());
int val;
for (int i = 0; i < signalSize; i++)
{
val = int(s.get()[i]);
/* Fills the two channels with the same value */
reinterpret_cast<int*>(Buffer)[i * 2] = val;
reinterpret_cast<int*>(Buffer)[i * 2 + 1] = val;
}
BufferCopiedSize = SampleCount * _Info.GetSampleSize();
return true;
}
I’m using the pink noise of gStreamer along with the equalizer-nbands module to compare my output.
With all gain set to 0.0 the outputs are the same.
But as soon as I add some gain, the outputs sound different (even though my output still sound like a pink noise, and seems to have gain in the right spot).
So my question is :
How can I apply my gains to each ⅓ Octave band in the frequency domain.
My research shows that I should do a filter bank of band-pass filters, but how to do that with the result of an FFT ?
Thanks for your time.

Related

Detecting linear interpolation of two frequnecies on embedded system

I am trying to recognise a sequence of audio frames on an embedded system - an audio frame being a frequency or interpolation of two frequencies for a variable amount of time. I know the sounds I am trying to recognise (i.e. the start and end frequencies which are being linearly interpolated and the duration of each audio frame), but they are produced by a another embedded system so the microphone and speaker are cheap and somewhat inaccurate. The output is a square wave. Any suggestions how to go about doing this?
What I am trying to do now is to use FFT to get the magnitude of all frequencies, detect the peaks, look at the detection duration/2 ms ago and check if that somewhat matches an audio frame, and finally just checking if any sound I am looking for matched the sequence.
So far I used the FFT to process the microphone input - after applying a Hann window - and then assigning each frequency bin a coefficient that it's a peak based on how many standard deviations is away from the mean. This hasn't worked great since it thought there are peaks when it was silence in the room. Any ideas on how to more accurately detect the peaks? Also I think there are a lot of harmonics because of the square wave / interpolation? Can I do harmonic product spectrum if the peaks don't really line up at double the frequency?
Here I graphed noise (almost silent room) with somewhere in the interpolation of 2226 and 1624 Hz.
https://i.stack.imgur.com/R5Gs2.png
I sample at 91 microseconds -> 10989 Hz. Should I sample more often?
I added here samples of how the interpolation sounds when recorded on my laptop and on the embedded system.
https://easyupload.io/m/5l72b0
#define MIC_SAMPLE_RATE 10989 // Hz
#define AUDIO_SAMPLES_NUMBER 1024
MicroBitAudioProcessor::MicroBitAudioProcessor(DataSource& source) : audiostream(source)
{
arm_rfft_fast_init_f32(&fft_instance, AUDIO_SAMPLES_NUMBER);
buf = (float *)malloc(sizeof(float) * (AUDIO_SAMPLES_NUMBER * 2));
output = (float *)malloc(sizeof(float) * AUDIO_SAMPLES_NUMBER);
mag = (float *)malloc(sizeof(float) * AUDIO_SAMPLES_NUMBER / 2);
}
float henn(int i){
return 0.5 * (1 - arm_cos_f32(2 * 3.14159265 * i / AUDIO_SAMPLES_NUMBER));
}
int MicroBitAudioProcessor::pullRequest()
{
int s;
int result;
auto mic_samples = audiostream.pull();
if (!recording)
return DEVICE_OK;
int8_t *data = (int8_t *) &mic_samples[0];
int samples = mic_samples.length() / 2;
for (int i=0; i < samples; i++)
{
s = (int) *data;
result = s;
data++;
buf[(position++)] = (float)result;
if (position % AUDIO_SAMPLES_NUMBER == 0)
{
position = 0;
float maxValue = 0;
uint32_t index = 0;
// Apply a Henn window
for(int i=0; i< AUDIO_SAMPLES_NUMBER; i++)
buf[i] *= henn(i);
arm_rfft_fast_f32(&fft_instance, buf, output, 0);
arm_cmplx_mag_f32(output, mag, AUDIO_SAMPLES_NUMBER / 2);
}
}
return DEVICE_OK;
}
uint32_t frequencyToIndex(int freq) {
return (freq / ((uint32_t)MIC_SAMPLE_RATE / AUDIO_SAMPLES_NUMBER));
}
float MicroBitAudioProcessor::getFrequencyIntensity(int freq){
uint32_t index = frequencyToIndex(freq);
if (index <= 0 || index >= (AUDIO_SAMPLES_NUMBER / 2) - 1) return 0;
return mag[index];
}

Getting values for specific frequencies in a short time fourier transform

I'm trying to use C++ to recreate the spectrogram function used by Matlab. The function uses a Short Time Fourier Transform (STFT). I found some C++ code here that performs a STFT. The code seems to work perfectly for all frequencies but I only want a few. I found this post for a similar question with the following answer:
Just take the inner product of your data with a complex exponential at
the frequency of interest. If g is your data, then just substitute for
f the value of the frequency you want (e.g., 1, 3, 10, ...)
Having no background in mathematics, I can't figure out how to do this. The inner product part seems simple enough from the Wikipedia page but I have absolutely no idea what he means by (with regard to the formula for a DFT)
a complex exponential at frequency of interest
Could someone explain how I might be able to do this? My data structure after the STFT is a matrix filled with complex numbers. I just don't know how to extract my desired frequencies.
Relevant function, where window is Hamming, and vector of desired frequencies isn't yet an input because I don't know what to do with them:
Matrix<complex<double>> ShortTimeFourierTransform::Calculate(const vector<double> &signal,
const vector<double> &window, int windowSize, int hopSize)
{
int signalLength = signal.size();
int nOverlap = hopSize;
int cols = (signal.size() - nOverlap) / (windowSize - nOverlap);
Matrix<complex<double>> results(window.size(), cols);
int chunkPosition = 0;
int readIndex;
// Should we stop reading in chunks?
bool shouldStop = false;
int numChunksCompleted = 0;
int i;
// Process each chunk of the signal
while (chunkPosition < signalLength && !shouldStop)
{
// Copy the chunk into our buffer
for (i = 0; i < windowSize; i++)
{
readIndex = chunkPosition + i;
if (readIndex < signalLength)
{
// Note the windowing!
data[i][0] = signal[readIndex] * window[i];
data[i][1] = 0.0;
}
else
{
// we have read beyond the signal, so zero-pad it!
data[i][0] = 0.0;
data[i][1] = 0.0;
shouldStop = true;
}
}
// Perform the FFT on our chunk
fftw_execute(plan_forward);
// Copy the first (windowSize/2 + 1) data points into your spectrogram.
// We do this because the FFT output is mirrored about the nyquist
// frequency, so the second half of the data is redundant. This is how
// Matlab's spectrogram routine works.
for (i = 0; i < windowSize / 2 + 1; i++)
{
double real = fft_result[i][0];
double imaginary = fft_result[i][1];
results(i, numChunksCompleted) = complex<double>(real, imaginary);
}
chunkPosition += hopSize;
numChunksCompleted++;
} // Excuse the formatting, the while ends here.
return results;
}

Look up the Goertzel algorithm or filter for example code that uses the computational equivalent of an inner product against a complex exponential to measure the presence or magnitude of a specific stationary sinusoidal frequency in a signal. Performance or resolution will depend on the length of the filter and your signal.

Drawing circle, OpenGL style

I have a 13 x 13 array of pixels, and I am using a function to draw a circle onto them. (The screen is 13 * 13, which may seem strange, but its an array of LED's so that explains it.)
unsigned char matrix[13][13];
const unsigned char ON = 0x01;
const unsigned char OFF = 0x00;
Here is the first implementation I thought up. (It's inefficient, which is a particular problem as this is an embedded systems project, 80 MHz processor.)
// Draw a circle
// mode is 'ON' or 'OFF'
inline void drawCircle(float rad, unsigned char mode)
{
for(int ix = 0; ix < 13; ++ ix)
{
for(int jx = 0; jx < 13; ++ jx)
{
float r; // Radial
float s; // Angular ("theta")
matrix_to_polar(ix, jx, &r, &s); // Converts polar coordinates
// specified by r and s, where
// s is the angle, to index coordinates
// specified by ix and jx.
// This function just converts to
// cartesian and then translates by 6.0.
if(r < rad)
{
matrix[ix][jx] = mode; // Turn pixel in matrix 'ON' or 'OFF'
}
}
}
}
I hope that's clear. It's pretty simple, but then I programmed it so I know how it's supposed to work. If you'd like more info / explanation then I can add some more code / comments.
It can be considered that drawing several circles, eg 4 to 6, is very slow... Hence I'm asking for advice on a more efficient algorithm to draw the circles.
EDIT: Managed to double the performance by making the following modification:
The function calling the drawing used to look like this:
for(;;)
{
clearAll(); // Clear matrix
for(int ix = 0; ix < 6; ++ ix)
{
rad[ix] += rad_incr_step;
drawRing(rad[ix], rad[ix] - rad_width);
}
if(rad[5] >= 7.0)
{
for(int ix = 0; ix < 6; ++ ix)
{
rad[ix] = rad_space_step * (float)(-ix);
}
}
writeAll(); // Write
}
I added the following check:
if(rad[ix] - rad_width < 7.0)
drawRing(rad[ix], rad[ix] - rad_width);
This increased the performance by a factor of about 2, but ideally I'd like to make the circle drawing more efficient to increase it further. This checks to see if the ring is completely outside of the screen.
EDIT 2: Similarly adding the reverse check increased performance further.
if(rad[ix] >= 0.0)
drawRing(rad[ix], rad[ix] - rad_width);
Performance is now pretty good, but again I have made no modifications to the actual drawing code of the circles and this is what I was intending to focus on with this question.
Edit 3: Matrix to polar:
inline void matrix_to_polar(int i, int j, float* r, float* s)
{
float x, y;
matrix_to_cartesian(i, j, &x, &y);
calcPolar(x, y, r, s);
}
inline void matrix_to_cartesian(int i, int j, float* x, float* y)
{
*x = getX(i);
*y = getY(j);
}
inline void calcPolar(float x, float y, float* r, float* s)
{
*r = sqrt(x * x + y * y);
*s = atan2(y, x);
}
inline float getX(int xc)
{
return (float(xc) - 6.0);
}
inline float getY(int yc)
{
return (float(yc) - 6.0);
}
In response to Clifford that's actually a lot of function calls if they are not inlined.
Edit 4: drawRing just draws 2 circles, firstly an outer circle with mode ON and then an inner circle with mode OFF. I am fairly confident that there is a more efficient method of drawing such a shape too, but that distracts from the question.

You're doing a lot of calculations that aren't really needed. For example, you're calculating the angle of the polar coordinates, but never use it. The square root can also easily be avoided by comparing the square of the values.
Without doing anything fancy, something like this should be a good start:
int intRad = (int)rad;
int intRadSqr = (int)(rad * rad);
for (int ix = 0; ix <= intRad; ++ix)
{
for (int jx = 0; jx <= intRad; ++jx)
{
if (ix * ix + jx * jx <= radSqr)
{
matrix[6 - ix][6 - jx] = mode;
matrix[6 - ix][6 + jx] = mode;
matrix[6 + ix][6 - jx] = mode;
matrix[6 + ix][6 + jx] = mode;
}
}
}
This does all the math in integer format, and takes advantage of the circle symmetry.
Variation of the above, based on feedback in the comments:
int intRad = (int)rad;
int intRadSqr = (int)(rad * rad);
for (int ix = 0; ix <= intRad; ++ix)
{
for (int jx = 0; ix * ix + jx * jx <= radSqr; ++jx)
{
matrix[6 - ix][6 - jx] = mode;
matrix[6 - ix][6 + jx] = mode;
matrix[6 + ix][6 - jx] = mode;
matrix[6 + ix][6 + jx] = mode;
}
}

Don't underestimate the cost of even basic arithmetic using floating point on a processor with no FPU. It seems unlikely that floating point is necessary, but the details of its use are hidden in your matrix_to_polar() implementation.
Your current implementation considers every pixel as a candidate - that is also unnecessary.
Using the equation y = cy ± √[rad2 - (x-cx)2] where cx, cy is the centre (7, 7 in this case), and a suitable integer square root implementation, the circle can be drawn thus:
void drawCircle( int rad, unsigned char mode )
{
int r2 = rad * rad ;
for( int x = 7 - rad; x <= 7 + rad; x++ )
{
int dx = x - 7 ;
int dy = isqrt( r2 - dx * dx ) ;
matrix[x][7 - dy] = mode ;
matrix[x][7 + dy] = mode ;
}
}
In my test I used the isqrt() below based on code from here, but given that the maximum r2 necessary is 169 (132, you could implement a 16 or even 8 bit optimised version if necessary. If your processor is 32 bit, this is probably fine.
uint32_t isqrt(uint32_t n)
{
uint32_t root = 0, bit, trial;
bit = (n >= 0x10000) ? 1<<30 : 1<<14;
do
{
trial = root+bit;
if (n >= trial)
{
n -= trial;
root = trial+bit;
}
root >>= 1;
bit >>= 2;
} while (bit);
return root;
}
All that said, on such a low resolution device, you will probably get better quality circles and faster performance by hand generating bitmap lookup tables for each radius required. If memory is an issue, then a single circle needs only 7 bytes to describe a 7 x 7 quadrant that you can reflect to all three quadrants, or for greater performance you could use 7 x 16 bit words to describe a semi-circle (since reversing bit order is more expensive than reversing array access - unless you are using an ARM Cortex-M with bit-banding). Using semi-circle look-ups, 13 circles would need 13 x 7 x 2 bytes (182 bytes), quadrant look-ups would be 7 x 8 x 13 (91 bytes) - you may find that is fewer bytes that the code space required to calculate the circles.

For a slow embedded device with only a 13x13 element display, you should really just make a look-up table. For example:
struct ComputedCircle
{
float rMax;
char col[13][2];
};
Where the draw routine uses rMax to determine which LUT element to use. For example, if you have 2 elements with one rMax = 1.4f, the other = 1.7f, then any radius between 1.4f and 1.7f will use that entry.
The column elements would specify zero, one, or two line segments per row, which can be encoded in the lower and upper 4 bits of each char. -1 can be used as a sentinel value for nothing-at-this-row. It is up to you how many look-up table entries to use, but with a 13x13 grid you should be able to encode every possible outcome of pixels with well under 100 entries, and a reasonable approximation using only 10 or so. You can also trade off compression for draw speed as well, e.g. putting the col[13][2] matrix in a flat list and encoding the number of rows defined.

I would accept MooseBoy's answer if only he explained the method he proposes better. Here's my take on the lookup table approach.
Solve it with a lookup table
The 13x13 display is quite small, and if you only need circles which are fully visible within this pixel count, you will get around with a quite small table. Even if you need larger circles, it should be still better than any algorithmic way if you need it to be fast (and have the ROM to store it).
How to do it
You basically need to define how each possible circle looks like on the 13x13 display. It is not sufficient to just produce snapshots for the 13x13 display, as it is likely you would like to plot the circles at arbitrary positions. My take for a table entry would look like this:
struct circle_entry_s{
unsigned int diameter;
unsigned int offset;
};
The entry would map a given diameter in pixels to offsets in a large byte table containing the shape of the circles. For example for diameter 9, the byte sequence would look like this:
0x1CU, 0x00U, /* 000111000 */
0x63U, 0x00U, /* 011000110 */
0x41U, 0x00U, /* 010000010 */
0x80U, 0x80U, /* 100000001 */
0x80U, 0x80U, /* 100000001 */
0x80U, 0x80U, /* 100000001 */
0x41U, 0x00U, /* 010000010 */
0x63U, 0x00U, /* 011000110 */
0x1CU, 0x00U, /* 000111000 */
The diameter specifies how many bytes of the table belong to the circle: one row of pixels are generated from (diameter + 7) >> 3 bytes, and the number of rows correspond to the diameter. The output code of these can be made quite fast, while the lookup table is sufficiently compact to get even larger than the 13x13 display circles defined in it if needed.
Note that defining circles this way for odd and even diameters may or may not appeal you when output by a centre location. The odd diameter circles will appear to have a centre in the "middle" of a pixel, while the even diameter circles will appear to have their centre on the "corner" of a pixel.
You may also find it nice later to refine the overall method so having multiple circles of different apparent sizes, but having the same pixel radius. Depends on what is your goal: if you want some kind of smooth animation, you may get there eventually.
Algorithmic solutions I think mostly will perform poorly here, since with this limited display surface really every pixel's state counts for the appearance.

Real time audio processing

I would like to make real time audio processing with Qt and display the fundamental frequency using FFTW3.
What I've done in steps:
I capture any sound from computer device and fill it into the buffer.
I assign sound samples to double array
I compute the fundamental frequency.
Problem
My code always returns 0 as fundamental frequency.
QByteArray *buffer;
QAudioInput *audioInput;
audioInput = new QAudioInput(format, this);
//Check the number of samples in input buffer
qint64 len = audioInput->bytesReady();
//Limit sample size
if(len > 4096)
len = 4096;
//Read sound samples from input device to buffer
qint64 l = input->read(buffer.data(), len);
if(l > 0)
{
int input_size = BufferSize;
// Compute corresponding number of complex output samples
int output_size = (input_size/2 + 1);
double *input_buffer = static_cast<double*>(fftw_malloc(input_size * sizeof(double)));
fftw_complex *out = static_cast<fftw_complex*>(fftw_malloc(output_size * sizeof(fftw_complex)));
//Assign sound samples to double array
input_buffer = (double*)buffer.data();
fftw_plan p3;
//Create plan
p3 = fftw_plan_dft_r2c_1d(input_size, input_buffer, out, FFTW_ESTIMATE);
fftw_execute(p3);
double reout[BufferSize];
double imgout[BufferSize];
double magnitude[BufferSize/2];
long ffond = 0.0; // Position of the frequency
double max = 0; // Maximal amplitude
for (int i = 0; i < BufferSize/2; i++)
{
reout[i] = out[i][0];
imgout[i] = out[i][1];
cout << imgout[i] << endl;
magnitude[i] = sqrt(reout[i]*reout[i] + imgout[i]*imgout[i]); //Calculate magnitude of first
double t = sqrt(reout[i]*reout[i] + imgout[i]*imgout[i]);
if(t > max)
{
max = t;
ffond = i;
}
}
qDebug() << "fundamental frequency is :" << QString::number(ffond*static_cast<double>);
fftw_destroy_plan(p3);

You have two immediate problems that I can see:
you are not applying a window function, so there will be considerable spectral leakage and associated "smearing" of the spectrum (and probably a large DC (0 Hz) component with associated "skirt")
you are assuming that the largest magnitude in the spectrum is the fundamental frequency, which will most likely be incorrect for two reasons: (a) you may well have a large 0 Hz component which is larger than your fundamental or harmonics and (b) depending on the nature of the sound you are trying to analyse, the fundamental may be smaller in magnitude than the harmonics (it may even be missing completely)
I suggest you do the following:
apply a suitable window function prior to the FFT - this should make your peaks better defined and should reduce the artefacts at 0 Hz and just above
start your search at an appropriate bin rather than 0, e.g. if the minimum fundamental frequency you are interested in is say 50 Hz then start at the corresponding bin for 50 Hz rather than at 0
add a debug option to display the spectrum graphically - this visual debugging aid will help greatly when you are wondering why your results do not make sense
if what you are really trying to measure is pitch rather than fundamental frequency, then read up on pitch detection algorithms, e.g. Harmonic Product Spectrum - this will work a lot better than the naïve approach of trying to identify a fundamental (whose frequency will not be the same as the pitch in the general case)

Applying Matrix To Image, Seeking Performance Improvements

Edited: Working on Windows platform.
Problem: Less of a problem, more about advise. I'm currently not incredibly versed in low-level program, but I am attempting to optimize the code below in an attempt to increase the performance of my overall code. This application depends on extremely high speed image processing.
Current Performance: On my computer, this currently computes at about 4-6ms for a 512x512 image. I'm trying to cut that in half if possible.
Limitations: Due to this projects massive size, fundamental changes to the application are very difficult to do, so things such as porting to DirectX or other GPU methods isn't much of an option. The project currently works, I'm simply trying to figure out how to make it work faster.
Specific information about my use for this: Images going into this method are always going to be exactly square and some increment of 128. (Most likely 512 x 512) and they will always come out the same size. Other than that, there is not much else to it. The matrix is calculated somewhere else, so this is just the applying of the matrix to my image. The original image and the new image are both being used, so copying the image is necessary.
Here is my current implementation:
void ReprojectRectangle( double *mpProjMatrix, unsigned char *pDstScan0, unsigned char *pSrcScan0,
int NewBitmapDataStride, int SrcBitmapDataStride, int YOffset, double InversedAspect, int RectX, int RectY, int RectW, int RectH)
{
int i, j;
double Xnorm, Ynorm;
double Ynorm_X_ProjMatrix4, Ynorm_X_ProjMatrix5, Ynorm_X_ProjMatrix7;;
double SrcX, SrcY, T;
int SrcXnt, SrcYnt;
int SrcXec, SrcYec, SrcYnvDec;
unsigned char *pNewPtr, *pSrcPtr1, *pSrcPtr2, *pSrcPtr3, *pSrcPtr4;
int RectX2, RectY2;
/* Compensate (or re-center) the Y-coordinate regarding the aspect ratio */
RectY -= YOffset;
/* Compute the second point of the rectangle for the loops */
RectX2 = RectX + RectW;
RectY2 = RectY + RectH;
/* Clamp values (be careful with aspect ratio */
if (RectY < 0) RectY = 0;
if (RectY2 < 0) RectY2 = 0;
if ((double)RectY > (InversedAspect * 512.0)) RectY = (int)(InversedAspect * 512.0);
if ((double)RectY2 > (InversedAspect * 512.0)) RectY2 = (int)(InversedAspect * 512.0);
/* Iterate through each pixel of the scaled re-Proj */
for (i=RectY; i<RectY2; i++)
{
/* Normalize Y-coordinate and take the ratio into account */
Ynorm = InversedAspect - (double)i / 512.0;
/* Pre-compute some matrix coefficients */
Ynorm_X_ProjMatrix4 = Ynorm * mpProjMatrix[4] + mpProjMatrix[12];
Ynorm_X_ProjMatrix5 = Ynorm * mpProjMatrix[5] + mpProjMatrix[13];
Ynorm_X_ProjMatrix7 = Ynorm * mpProjMatrix[7] + mpProjMatrix[15];
for (j=RectX; j<RectX2; j++)
{
/* Get a pointer to the pixel on (i,j) */
pNewPtr = pDstScan0 + ((i+YOffset) * NewBitmapDataStride) + j;
/* Normalize X-coordinates */
Xnorm = (double)j / 512.0;
/* Compute the corresponding coordinates in the source image, before Proj and normalize source coordinates*/
T = (Xnorm * mpProjMatrix[3] + Ynorm_X_ProjMatrix7);
SrcY = (Xnorm * mpProjMatrix[0] + Ynorm_X_ProjMatrix4)/T;
SrcX = (Xnorm * mpProjMatrix[1] + Ynorm_X_ProjMatrix5)/T;
// Compute the integer and decimal values of the coordinates in the sources image
SrcXnt = (int) SrcX;
SrcYnt = (int) SrcY;
SrcXec = 64 - (int) ((SrcX - (double) SrcXnt) * 64);
SrcYec = 64 - (int) ((SrcY - (double) SrcYnt) * 64);
// Get the values of the four pixels up down right left
pSrcPtr1 = pSrcScan0 + (SrcXnt * SrcBitmapDataStride) + SrcYnt;
pSrcPtr2 = pSrcPtr1 + 1;
pSrcPtr3 = pSrcScan0 + ((SrcXnt+1) * SrcBitmapDataStride) + SrcYnt;
pSrcPtr4 = pSrcPtr3 + 1;
SrcYnvDec = (64-SrcYec);
(*pNewPtr) = (unsigned char)(((SrcYec * (*pSrcPtr1) + SrcYnvDec * (*pSrcPtr2)) * SrcXec +
(SrcYec * (*pSrcPtr3) + SrcYnvDec * (*pSrcPtr4)) * (64 - SrcXec)) >> 12);
}
}
}

Two things that could help: multiprocessing and SIMD. With multiprocessing you could break up the output image into tiles and have each processor work on the next available tile. You can use SIMD instructions (like SSE, AVX, AltiVec, etc.) to calculate multiple things at the same time, such as doing the same matrix math to multiple coordinates at the same time. You can even combine the two - use multiple processors running SIMD instructions to do as much work as possible. You didn't mention what platform you're working on.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js