How to perform FFT on WAV file data? - c++

I'm trying to analyse the audio quality of a file by detecting the highest frequency present (compressed audio will generally be filtered to something less than 20KHz).
I'm reading WAV file data using a class from the soundstretch library which returns PCM samples as floats, then performing FFT on those samples with the fftw3 library. Then for each frequency (rounded to the nearest KHz), I am totalling up the amplitude for that frequency.
So for a low quality file that doesn't contain frequencies above 16KHz, I would expect there to be none or very little amplitude above 16KHz, however I'm not getting the results I would expect. Below is my code:
#include <iostream>
#include <math.h>
#include <fftw3.h>
#include <soundtouch/SoundTouch.h>
#include "include/WavFile.h"
using namespace std;
using namespace soundtouch;
#define BUFF_SIZE 6720
#define MAX_FREQ 22//KHz
static float freqMagnitude[MAX_FREQ];
static void calculateFrequencies(fftw_complex *data, size_t len, int Fs) {
for (int i = 0; i < len; i++) {
int re, im;
float freq, magnitude;
int index;
re = data[i][0];
im = data[i][1];
magnitude = sqrt(re * re + im * im);
freq = i * Fs / len;
index = freq / 1000;//round(freq);
if (index <= MAX_FREQ) {
freqMagnitude[index] += magnitude;
}
}
}
int main(int argc, char *argv[]) {
if (argc < 2) {
cout << "Incorrect args" << endl;
return -1;
}
SAMPLETYPE sampleBuffer[BUFF_SIZE];
WavInFile inFile(argv[1]);
fftw_complex *in, *out;
fftw_plan p;
in = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * BUFF_SIZE);
out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * BUFF_SIZE);
p = fftw_plan_dft_1d(BUFF_SIZE, in, out, FFTW_FORWARD, FFTW_ESTIMATE);
while (inFile.eof() == 0) {
size_t samplesRead = inFile.read(sampleBuffer, BUFF_SIZE);
for (int i = 0; i < BUFF_SIZE; i++) {
in[i][0] = (double) sampleBuffer[i];
}
fftw_execute(p); /* repeat as needed */
calculateFrequencies(out, samplesRead, inFile.getSampleRate());
}
for (int i = 0; i < MAX_FREQ; i += 2) {
cout << i << "KHz magnitude: " << freqMagnitude[i] << std::endl;
}
fftw_destroy_plan(p);
fftw_free(in);
fftw_free(out);
}
Can compile with: - (you'll need the soundtouch library and fftw3 library)
g++ -g -Wall MP3.cpp include/WavFile.cpp -lfftw3 -lm -lsoundtouch -I/usr/local/include -L/usr/local/lib
And here is the spectral analysis of the file I am testing on:
As you can see it's clipped at 16KHz, however my results are as follows:
0KHz magnitude: 4.61044e+07
2KHz magnitude: 5.26959e+06
4KHz magnitude: 4.68766e+06
6KHz magnitude: 4.12703e+06
8KHz magnitude: 12239.6
10KHz magnitude: 456
12KHz magnitude: 3
14KHz magnitude: 650468
16KHz magnitude: 1.83266e+06
18KHz magnitude: 1.40232e+06
20KHz magnitude: 1.1477e+06
I would expect there to be no amplitude over 16KHz, am I doing this right?
Is my calculation for frequency correct? (I robbed it off another stackoverflow answer)
Could it be something to do with there being 2 channels and I'm not separating the channels?
Cheers for any help guys.

You are likely measuring the interleave difference between two stereo channels, which can include high frequencies due to unequal mix and pan. Try again with the channels separated or mixed down to mono, and use a smooth window function to reduce FFT aperture edge artifacts, which can also introduce a small amount of high frequency noise due to your rectangular window.

An FFT foundamental requirement is the equally time spacing of samples and their congruence.
In your case a stereo signal supply to the FFT algorithm double the number of samples uncorrelated between themself. What is mathematically seen is the natural phase difference between the two cannels, but, more important, two samples that, because unrelated, can have such a big difference to wrongly represent a square wave (in the time domain it would be represented by an extremely high signal slew rate).
As a solution you have to separate the two channels and perform FFT on one single series of samples, or two different FFT.
I don't think that there could be any aliasing problem because this is normally related to the sampling process and performed using analog filter having bandpass frequency < 1/2 the sampling frequence (Nyquist or antialias filter). If this filtering is missed there are almost no way to remove ghosts (alias spectrum) after.

I speak as someone with very slight real-world experience and book learning over a decade ago so this answer might be evidence of a little knowledge being a dangerous thing but I think the problem you're seeing is just aliasing.
Imagine a perfect square wave. You've never heard a perfect square wave because it would require a sound source instantly to transition from one position to another, while still pushing air particles about.
You also can't describe a square wave with a finite number of harmonics. However, you can trivially describe a square wave with any frequency of PCM audio. Therefore any source PCM audio can appear to contain an infinite number of harmonics.
What you can probably do is just sit atop Nyquist and say that if the input audio is N Mhz then the highest-frequency part that can be actual signal is at N/2 Mhz; therefore you can resample the input wave down to twice the first rate less than or equal to N/2 Mhz that shows significant signal without losing meaningful content.

Related

Intel integrated performance primitives Fourier Transform magnitudes

When I am using Intel IPP's ippsFFTFwd_RToCCS_64f and then ippsMagnitude_64fc I get a massive peak at zero index in magnitudes array.
My sine wave is long and main component I am interested is somewhere between 0.15 Hz and 0.25 Hz. I take the sample with 500Hz sampling frequency. If I reduce mean from the signal before FFT I get really small zero component not that peak anymore. Below is a pic of magnitudes array head:
Also the magnitude scaling seems to be 10 times the magnitude I see in the time series of the signal e.g. if amplitude is 29 in magnitudes it is 290.
I Am not sure why this is so and my question is 1. Do I really need to address the zero index peak with mean reduction and 2. Where does this scale of 10 come?
void CalculateForwardTransform(array<double> ^signal, array<double> ^transformedSignal, array<double> ^magnitudes)
{
// source signal
pin_ptr<double> pinnedSignal = &signal[0];
double *pSignal = pinnedSignal;
int order = (int)Math::Round(Math::Log(signal->Length, 2));
// get sizes
int sizeSpec = 0, sizeInit = 0, sizeBuf = 0;
int status = ippsFFTGetSize_R_64f(order, IPP_FFT_DIV_INV_BY_N, ippAlgHintNone, &sizeSpec, &sizeInit, &sizeBuf);
// memory allocation
IppsFFTSpec_R_64f* pSpec;
Ipp8u *pSpecMem = (Ipp8u*)ippMalloc(sizeSpec);
Ipp8u *pMemInit = (Ipp8u*)ippMalloc(sizeInit);
// FFT specification structure initialized
status = ippsFFTInit_R_64f(&pSpec, order, IPP_FFT_DIV_INV_BY_N, ippAlgHintNone, pSpecMem, pMemInit);
// transform
pin_ptr<double> pinnedTransformedSignal = &transformedSignal[0];
double *pDst = pinnedTransformedSignal;
Ipp8u *pBuffer = (Ipp8u*)ippMalloc(sizeBuf);
status = ippsFFTFwd_RToCCS_64f(pSignal, pDst, pSpec, pBuffer);
// get magnitudes
pin_ptr<double> pinnedMagnitudes = &magnitudes[0];
double *pMagn = pinnedMagnitudes;
status = ippsMagnitude_64fc((Ipp64fc*)pDst, pMagn, magnitudes->Length); // magnitudes is half of signal len
// free memory
ippFree(pSpecMem);
ippFree(pMemInit);
ippFree(pBuffer);
}
Do I really need to address the zero index peak with mean reduction?
For low frequency signal analysis a small bias can really interfere (especially due to spectral leakage). For sake of illustration, consider the following low-frequency signal tone and another one with a constant bias tone_with_bias:
fs = 1;
f0 = 0.15;
for (int i = 0; i < N; i++)
{
tone[i] = 0.001*cos(2*M_PI*i*f0/fs);
tone_with_bias[i] = 1 + tone[i];
}
If we plot the frequency spectrum of a 100 sample window of these signals, you should notice that the spectrum of tone_with_bias completely drowns out the spectrum of tone:
So yes it's better if you can remove that bias. However, it should be emphasized that this is possible provided that you know the nature of the bias. If you know that the bias is indeed a constant, removing it will reveal the low-frequency component. Otherwise, removing the mean from the signal may not achieve the desired effect if the bias is just a very low-frequency component.
Where does this scale of 10 come?
Scaling of the magnitude by the FFT should be expected, as described in this answer of approximately 0.5*N (where N is the FFT size). If you were processing a small chunk of 20 samples, then you should get such a factor of 10 scaling. If you scale the output of the FFT by 2/N (or equivalently scale by 2 while also using the IPP_FFT_DIV_FWD_BY_N flag) you should get results that have similar magnitudes as the time-domain signal.

Fast, good quality pixel interpolation for extreme image downscaling

In my program, I am downscaling an image of 500px or larger to an extreme level of approx 16px-32px. The source image is user-specified so I do not have control over its size. As you can imagine, few pixel interpolations hold up and inevitably the result is heavily aliased.
I've tried bilinear, bicubic and square average sampling. The square average sampling actually provides the most decent results but the smaller it gets, the larger the sampling radius has to be. As a result, it gets quite slow - slower than the other interpolation methods.
I have also tried an adaptive square average sampling so that the smaller it gets the greater the sampling radius, while the closer it is to its original size, the smaller the sampling radius. However, it produces problems and I am not convinced this is the best approach.
So the question is: What is the recommended type of pixel interpolation that is fast and works well on such extreme levels of downscaling?
I do not wish to use a library so I will need something that I can code by hand and isn't too complex. I am working in C++ with VS 2012.
Here's some example code I've tried as requested (hopefully without errors from my pseudo-code cut and paste). This performs a 7x7 average downscale and although it's a better result than bilinear or bicubic interpolation, it also takes quite a hit:
// Sizing control
ctl(0): "Resize",Range=(0,800),Val=100
// Variables
float fracx,fracy;
int Xnew,Ynew,p,q,Calc;
int x,y,p1,q1,i,j;
//New image dimensions
Xnew=image->width*ctl(0)/100;
Ynew=image->height*ctl(0)/100;
for (y=0; y<image->height; y++){ // rows
for (x=0; x<image->width; x++){ // columns
p1=(int)x*image->width/Xnew;
q1=(int)y*image->height/Ynew;
for (z=0; z<3; z++){ // channels
for (i=-3;i<=3;i++) {
for (j=-3;j<=3;j++) {
Calc += (int)(src(p1-i,q1-j,z));
} //j
} //i
Calc /= 49;
pset(x, y, z, Calc);
} // channels
} // columns
} // rows
Thanks!
The first point is to use pointers to your data. Never use indexes at every pixel. When you write: src(p1-i,q1-j,z) or pset(x, y, z, Calc) how much computation is being made? Use pointers to data and manipulate those.
Second: your algorithm is wrong. You don't want an average filter, but you want to make a grid on your source image and for every grid cell compute the average and put it in the corresponding pixel of the output image.
The specific solution should be tailored to your data representation, but it could be something like this:
std::vector<uint32_t> accum(Xnew);
std::vector<uint32_t> count(Xnew);
uint32_t *paccum, *pcount;
uint8_t* pin = /*pointer to input data*/;
uint8_t* pout = /*pointer to output data*/;
for (int dr = 0, sr = 0, w = image->width, h = image->height; sr < h; ++dr) {
memset(paccum = accum.data(), 0, Xnew*4);
memset(pcount = count.data(), 0, Xnew*4);
while (sr * Ynew / h == dr) {
paccum = accum.data();
pcount = count.data();
for (int dc = 0, sc = 0; sc < w; ++sc) {
*paccum += *i;
*pcount += 1;
++pin;
if (sc * Xnew / w > dc) {
++dc;
++paccum;
++pcount;
}
}
sr++;
}
std::transform(begin(accum), end(accum), begin(count), pout, std::divides<uint32_t>());
pout += Xnew;
}
This was written using my own library (still in development) and it seems to work, but later I changed the variables names in order to make it simpler here, so I don't guarantee anything!
The idea is to have a local buffer of 32 bit ints which can hold the partial sum of all pixels in the rows which fall in a row of the output image. Then you divide by the cell count and save the output to the final image.
The first thing you should do is to set up a performance evaluation system to measure how much any change impacts on the performance.
As said precedently, you should not use indexes but pointers for (probably) a substantial
speed up & not simply average as a basic averaging of pixels is basically a blur filter.
I would highly advise you to rework your code to be using "kernels". This is the matrix representing the ratio of each pixel used. That way, you will be able to test different strategies and optimize quality.
Example of kernels:
https://en.wikipedia.org/wiki/Kernel_(image_processing)
Upsampling/downsampling kernel:
http://www.johncostella.com/magic/
Note, from the code it seems you apply a 3x3 kernel but initially done on a 7x7 kernel. The equivalent 3x3 kernel as posted would be:
[1 1 1]
[1 1 1] * 1/9
[1 1 1]

fftw analysing frequencies from mic input on pc

I am using fftw to analyse the frequency spectrum of audio input to a computer from the mic input. I am using portaudio c++ libraries to capture the windows of time-domain audio data and then fftw to do a real to complex r2c transformation of this data to the frequency domain. Below is my function which I call everytime I receive the block of data.
The sample rate is 44100 samples per second , the sample type is short (signed 16 bit integer)and I am taking 250ms blocks of data in each window. The fft resolution is therefore 4Hz.
The problem is , i'm not sure how to interpret the data which I am receiving after the transformation. When no audio is played , I am getting amplitudes of around 1000 to 4000 for every frequency component, as soon as audio is played from an instrument for example, all of the amplitudes go negative.
I have tried doing a normalisation before the fft, by dividing by the average power and then the data makes more sense. All amplitudes are from 200 to 500 when nothing is played, then for example if I play a tone of 76Hz, the amplitude for this component increases to around 2000. So that is something along the lines of what I expect, but still not sure if this process can be implemented better.
My question is, am I doing the right thing here? Must the data be normalised and am I doing it right? Why am I still receiving high amplitudes on the frequencies that are not being played. Has anyone any experience of doing something similar and maybe give some tips. Many thanks in advance.
void AudioProcessor::GetFFT(void* inputData, void* freqSpectrum)
{
double* input = (double*)inputData;
short* freq_spectrum = (short*)freqSpectrum;
fftPlan = fftw_plan_dft_r2c_1d(FRAMES_PER_BUFFER, input, complexOut, FFTW_ESTIMATE);
fftw_execute(fftPlan);
////
for (int k = 0; k < (FRAMES_PER_BUFFER + 1) / 2; ++k)
{
freq_spectrum[k] = (short)(sqrt(complexOut[k][0] * complexOut[k][0] + complexOut[k][1] * complexOut[k][1]));
}
if (FRAMES_PER_BUFFER % 2 == 0) /* frames per buffer is even number */
{
freq_spectrum[FRAMES_PER_BUFFER / 2] = (short)(sqrt(complexOut[FRAMES_PER_BUFFER / 2][0] * complexOut[FRAMES_PER_BUFFER / 2][0] + complexOut[FRAMES_PER_BUFFER / 2][1] * complexOut[FRAMES_PER_BUFFER / 2][1])); /* Nyquist freq. */
}
}

reconstruct signal from fft

Im reconstructing signal from amplitude, frequency and phase obtained fft. After I do fft, I picked some of its frequencies and reconstructed time line signal from those fft data. I know IFFT is for this but, I dont want to use IFFT.
Reconstruction seems fine but theres some time lag between two signals. This image shows this problem. Black one is the original signal and red one is that reconstructed.
If I know correctly, amplitude of frequency bin t is sqrt(real[t]*real[t] + imag[t]*imag[t] and phase is atan2(imag[t], real[t]).
So, I used formula amplitude * cos(2*π*x / frequency + phase) for a frequency bin. And I summed those regenerated waves. As far as I know, this should generate intact signal fits to original signal. But it ends up always with some time lag from original signal.
Yeah, I think its about phase but thats so simple to calculate and its working correctly. If it has error, reconstructed signal would not fit to its original signal in shape.
This is the code to generate cosine wave. I generated cosine wave from sin(x + π/2).
std::vector<short> encodeSineWavePCM(
const double frequency,
const double amplitude,
const double offSetPhase)
{
const double pi = 3.1415926535897932384626;
const int N = 44100; // 1 sec length wave
std::vector<short> s(N);
const double wavelength = 1.0 * N / frequency;
const double unitlength = 2 * pi / wavelength;
for (int i = 0; i<N; i ++) {
double val = sin(offSetPhase + i * unitlength);
val *= amplitude;
s[i] = (short)val;
}
return s;
}
What am I missing?
Quite normal. You're doing a frame-by-frame transform. That means the FFT frame is produced after one time frame. When transforming back, you have the inverse effect: your time frame starts after the FFT frame has been decoded.

How to efficiently determine the minimum necessary size of a pre-rendered sine wave audio buffer for looping?

I've written a program that generates a sine-wave at a user-specified frequency, and plays it on a 96kHz audio channel. To save a few CPU cycles I employ the old trick of pre-rendering a short section of audio into a buffer, and then playing back the buffer in a loop, so that I can avoid calling the sin() function 96000 times per second for the duration of the program and just do simple memory-copying instead.
My problem is efficiently determining what the minimum usable size of this pre-rendered buffer would be. For some frequencies it is easy -- for example, an 8kHz sine wave can be perfectly represented by generating a 12-sample buffer and playing it in a looping, because (8000*12 == 96000). For other frequencies, however, a single cycle of the sine wave requires a non-integral number of samples to represent, and therefore looping a single cycle's worth of samples would cause unacceptable glitching.
For some of those frequencies, however, it's possible to get around that problem by pre-rendering more than one cycle of the sine wave and looping that -- if I can figure out how many cycles are required so that the number of cycles present in the buffer will be integral, while also guaranteeing that the number of samples in the buffer are integral. For example, a sine-wave frequency of 12.8kHz translates to a single-cycle buffer-size of 7.5 samples, which won't loop cleanly, but if I render two consecutive cycles of the sine wave into a 15-sample buffer, then I can cleanly loop the result.
My current approach to solving this issue is brute force: I try all possible cycle-counts and see if any of them result in a buffer size with an integral number of samples in it. I think that approach is unsatisfactory for the following reasons:
1) It's very inefficient. For example, the program shown below (which prints buffer-size results for 480,000 possible frequency values between 0Hz and 48kHz) takes 35 minutes to complete on my 2.7GHz machine. I think there must be a much faster way to do this.
2) I suspect that the results are not 100% accurate, due to floating-point errors.
3) The algorithm gives up if it can't find an acceptable buffer size less than 10 seconds long. (I could make the limit higher, but of course that would make the algorithm even slower).
So, is there any way to calculate the minimum-usable-buffer-size analytically, preferably in O(1) time? It seems like it should be easy, but I haven't been able to figure out what kind of math I should use.
Thanks in advance for any advice!
#include <stdio.h>
#include <math.h>
static const long long SAMPLES_PER_SECOND = 96000;
static const long long MAX_ALLOWED_BUFFER_SIZE_SAMPLES = (SAMPLES_PER_SECOND * 10);
// Returns the length of the pre-render buffer needed to properly
// loop a sine wave at the given frequence, or -1 on failure.
static int GetNumCyclesNeededForPreRenderedBuffer(float freqHz)
{
double oneCycleLengthSamples = SAMPLES_PER_SECOND/freqHz;
for (int count=1; (count*oneCycleLengthSamples) < MAX_ALLOWED_BUFFER_SIZE_SAMPLES; count++)
{
double remainder = fmod(oneCycleLengthSamples*count, 1.0);
if (remainder > 0.5) remainder = 1.0-remainder;
if (remainder <= 0.0) return count;
}
return -1;
}
int main(int, char **)
{
for (int i=0; i<48000*10; i++)
{
double freqHz = ((double)i)/10.0f;
int numCyclesNeeded = GetNumCyclesNeededForPreRenderedBuffer(freqHz);
if (numCyclesNeeded >= 0)
{
double oneCycleLengthSamples = SAMPLES_PER_SECOND/freqHz;
printf("For %.1fHz, use a pre-render-buffer size of %f samples (%i cycles, %f samples/cycle)\n", freqHz, (numCyclesNeeded*oneCycleLengthSamples), numCyclesNeeded, oneCycleLengthSamples);
}
else printf("For %.1fHz, there was no suitable pre-render-buffer size under the allowed limit!\n", freqHz);
}
return 0;
}
number_of_cycles/size_of_buffer = frequency/samples_per_second
This implies that if you can simplify your frequency/samples_per_second fraction, you can find the size of your buffer and the number of cycles in the buffer. If frequency and samples_per_second are integers, you can simplify the fraction by finding the greatest common divisor, otherwise you can use the method of continued fractions.
Example:
Say your frequency is 1234.5, and your samples_per_second is 96000. We can make these into two integers by multiplying by 10, so we get the ratio:
frequency/samples_per_second = 12345/960000
The greatest common divisor is 15, so it can be reduced to 823/64000.
So you would need 823 cycles in a 64000 sample buffer to reproduce the frequency exactly.