fftw analysing frequencies from mic input on pc - c++

I am using fftw to analyse the frequency spectrum of audio input to a computer from the mic input. I am using portaudio c++ libraries to capture the windows of time-domain audio data and then fftw to do a real to complex r2c transformation of this data to the frequency domain. Below is my function which I call everytime I receive the block of data.
The sample rate is 44100 samples per second , the sample type is short (signed 16 bit integer)and I am taking 250ms blocks of data in each window. The fft resolution is therefore 4Hz.
The problem is , i'm not sure how to interpret the data which I am receiving after the transformation. When no audio is played , I am getting amplitudes of around 1000 to 4000 for every frequency component, as soon as audio is played from an instrument for example, all of the amplitudes go negative.
I have tried doing a normalisation before the fft, by dividing by the average power and then the data makes more sense. All amplitudes are from 200 to 500 when nothing is played, then for example if I play a tone of 76Hz, the amplitude for this component increases to around 2000. So that is something along the lines of what I expect, but still not sure if this process can be implemented better.
My question is, am I doing the right thing here? Must the data be normalised and am I doing it right? Why am I still receiving high amplitudes on the frequencies that are not being played. Has anyone any experience of doing something similar and maybe give some tips. Many thanks in advance.
void AudioProcessor::GetFFT(void* inputData, void* freqSpectrum)
{
double* input = (double*)inputData;
short* freq_spectrum = (short*)freqSpectrum;
fftPlan = fftw_plan_dft_r2c_1d(FRAMES_PER_BUFFER, input, complexOut, FFTW_ESTIMATE);
fftw_execute(fftPlan);
////
for (int k = 0; k < (FRAMES_PER_BUFFER + 1) / 2; ++k)
{
freq_spectrum[k] = (short)(sqrt(complexOut[k][0] * complexOut[k][0] + complexOut[k][1] * complexOut[k][1]));
}
if (FRAMES_PER_BUFFER % 2 == 0) /* frames per buffer is even number */
{
freq_spectrum[FRAMES_PER_BUFFER / 2] = (short)(sqrt(complexOut[FRAMES_PER_BUFFER / 2][0] * complexOut[FRAMES_PER_BUFFER / 2][0] + complexOut[FRAMES_PER_BUFFER / 2][1] * complexOut[FRAMES_PER_BUFFER / 2][1])); /* Nyquist freq. */
}
}

Related

FFT window causing unequal amplification across frequency spectrum

I am using FFTW to create a spectrum analyzer in C++.
After applying any window function to an input signal, the output amplitude suddenly seems to scale with frequency.
Retangular Window
Exact-Blackman
Graphs are scaled logarithmically with a sampling frequency of 44100 Hz. All harmonics are generated at the same level, peaking at 0dB as seen during the rectangular case. The Exact-Blackman window was amplified by 7.35dB to attempt to makeup for processing gain.
Here is my code for generating the input table...
freq = 1378.125f;
for (int i = 0; i < FFT_LOGICAL_SIZE; i++)
{
float term = 2 * PI * i / FFT_ORDER;
for (int h = 1; freq * h < FREQ_NYQST; h+=1) // Harmonics up to Nyquist
{
fftInput[i] += sinf(freq * h * K_PI * i / K_SAMPLE_RATE); // Generate sine
fftInput[i] *= (7938 / 18608.f) - ((9240 / 18608.f) * cosf(term)) + ((1430 / 18608.f) * cosf(term * 2)); // Exact-Blackman window
}
}
fftwf_execute(fftwR2CPlan);
Increasing or decreasing the window size changes nothing. I tested with the Hamming window as well, same problem.
Here is my code for grabbing the output.
float val; // Used elsewhere
for (int i = 1; i < K_FFT_COMPLEX_BINS_NOLAST; i++) // Skips the DC and Nyquist bins
{
real = fftOutput[i][0];
complex = fftOutput[i][1];
// Grabs the values and scales based on the window size
val = sqrtf(real * real + complex * complex) / FFT_LOGICAL_SIZE_OVER_2;
val *= powf(20, 7.35f / 20); // Only applied during Exact-Blackman test
}
Curiously, I attempted the following to try to flatten out the response in the Exact-Blackman case. This scaling back down resulted in a nearly, but still not perfectly flat response. Neat, but still doesn't explain to me why this is happening.
float x = (float)(FFT_COMPLEX_BINS - i) / FFT_COMPLEX_BINS; // Linear from 0 to 1
x = log10f((x * 9) + 1.3591409f); // Now logarithmic from 0 to 1, offset by half of Euler's constant
val = sqrt(real * real + complex * complex) / (FFT_LOGICAL_SIZE_OVER_2 / x); // Division by x added to this line
Might be a bug. You seem to be applying your window function multiple times per sample. Any windowing should be removed from your input compositing loop and applied to the input vector just once, right before the FFT.
I was not able to reproduce code because I do not have the library on hand. However, This may be a consequence of spectral leakage. https://en.wikipedia.org/wiki/Spectral_leakage
This is an inevevitiblity of window functions as well as sampling. If you look at the tradeoffs section of that article, the type of window can be adaptive for a wide range of frequencies or focused on a particular one. Since the frequency of your signal is increasing perhaps the lower freq signal outside your target is more subjected to spectral leakage.

reconstruct signal from fft

Im reconstructing signal from amplitude, frequency and phase obtained fft. After I do fft, I picked some of its frequencies and reconstructed time line signal from those fft data. I know IFFT is for this but, I dont want to use IFFT.
Reconstruction seems fine but theres some time lag between two signals. This image shows this problem. Black one is the original signal and red one is that reconstructed.
If I know correctly, amplitude of frequency bin t is sqrt(real[t]*real[t] + imag[t]*imag[t] and phase is atan2(imag[t], real[t]).
So, I used formula amplitude * cos(2*π*x / frequency + phase) for a frequency bin. And I summed those regenerated waves. As far as I know, this should generate intact signal fits to original signal. But it ends up always with some time lag from original signal.
Yeah, I think its about phase but thats so simple to calculate and its working correctly. If it has error, reconstructed signal would not fit to its original signal in shape.
This is the code to generate cosine wave. I generated cosine wave from sin(x + π/2).
std::vector<short> encodeSineWavePCM(
const double frequency,
const double amplitude,
const double offSetPhase)
{
const double pi = 3.1415926535897932384626;
const int N = 44100; // 1 sec length wave
std::vector<short> s(N);
const double wavelength = 1.0 * N / frequency;
const double unitlength = 2 * pi / wavelength;
for (int i = 0; i<N; i ++) {
double val = sin(offSetPhase + i * unitlength);
val *= amplitude;
s[i] = (short)val;
}
return s;
}
What am I missing?
Quite normal. You're doing a frame-by-frame transform. That means the FFT frame is produced after one time frame. When transforming back, you have the inverse effect: your time frame starts after the FFT frame has been decoded.

How to perform FFT on WAV file data?

I'm trying to analyse the audio quality of a file by detecting the highest frequency present (compressed audio will generally be filtered to something less than 20KHz).
I'm reading WAV file data using a class from the soundstretch library which returns PCM samples as floats, then performing FFT on those samples with the fftw3 library. Then for each frequency (rounded to the nearest KHz), I am totalling up the amplitude for that frequency.
So for a low quality file that doesn't contain frequencies above 16KHz, I would expect there to be none or very little amplitude above 16KHz, however I'm not getting the results I would expect. Below is my code:
#include <iostream>
#include <math.h>
#include <fftw3.h>
#include <soundtouch/SoundTouch.h>
#include "include/WavFile.h"
using namespace std;
using namespace soundtouch;
#define BUFF_SIZE 6720
#define MAX_FREQ 22//KHz
static float freqMagnitude[MAX_FREQ];
static void calculateFrequencies(fftw_complex *data, size_t len, int Fs) {
for (int i = 0; i < len; i++) {
int re, im;
float freq, magnitude;
int index;
re = data[i][0];
im = data[i][1];
magnitude = sqrt(re * re + im * im);
freq = i * Fs / len;
index = freq / 1000;//round(freq);
if (index <= MAX_FREQ) {
freqMagnitude[index] += magnitude;
}
}
}
int main(int argc, char *argv[]) {
if (argc < 2) {
cout << "Incorrect args" << endl;
return -1;
}
SAMPLETYPE sampleBuffer[BUFF_SIZE];
WavInFile inFile(argv[1]);
fftw_complex *in, *out;
fftw_plan p;
in = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * BUFF_SIZE);
out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * BUFF_SIZE);
p = fftw_plan_dft_1d(BUFF_SIZE, in, out, FFTW_FORWARD, FFTW_ESTIMATE);
while (inFile.eof() == 0) {
size_t samplesRead = inFile.read(sampleBuffer, BUFF_SIZE);
for (int i = 0; i < BUFF_SIZE; i++) {
in[i][0] = (double) sampleBuffer[i];
}
fftw_execute(p); /* repeat as needed */
calculateFrequencies(out, samplesRead, inFile.getSampleRate());
}
for (int i = 0; i < MAX_FREQ; i += 2) {
cout << i << "KHz magnitude: " << freqMagnitude[i] << std::endl;
}
fftw_destroy_plan(p);
fftw_free(in);
fftw_free(out);
}
Can compile with: - (you'll need the soundtouch library and fftw3 library)
g++ -g -Wall MP3.cpp include/WavFile.cpp -lfftw3 -lm -lsoundtouch -I/usr/local/include -L/usr/local/lib
And here is the spectral analysis of the file I am testing on:
As you can see it's clipped at 16KHz, however my results are as follows:
0KHz magnitude: 4.61044e+07
2KHz magnitude: 5.26959e+06
4KHz magnitude: 4.68766e+06
6KHz magnitude: 4.12703e+06
8KHz magnitude: 12239.6
10KHz magnitude: 456
12KHz magnitude: 3
14KHz magnitude: 650468
16KHz magnitude: 1.83266e+06
18KHz magnitude: 1.40232e+06
20KHz magnitude: 1.1477e+06
I would expect there to be no amplitude over 16KHz, am I doing this right?
Is my calculation for frequency correct? (I robbed it off another stackoverflow answer)
Could it be something to do with there being 2 channels and I'm not separating the channels?
Cheers for any help guys.
You are likely measuring the interleave difference between two stereo channels, which can include high frequencies due to unequal mix and pan. Try again with the channels separated or mixed down to mono, and use a smooth window function to reduce FFT aperture edge artifacts, which can also introduce a small amount of high frequency noise due to your rectangular window.
An FFT foundamental requirement is the equally time spacing of samples and their congruence.
In your case a stereo signal supply to the FFT algorithm double the number of samples uncorrelated between themself. What is mathematically seen is the natural phase difference between the two cannels, but, more important, two samples that, because unrelated, can have such a big difference to wrongly represent a square wave (in the time domain it would be represented by an extremely high signal slew rate).
As a solution you have to separate the two channels and perform FFT on one single series of samples, or two different FFT.
I don't think that there could be any aliasing problem because this is normally related to the sampling process and performed using analog filter having bandpass frequency < 1/2 the sampling frequence (Nyquist or antialias filter). If this filtering is missed there are almost no way to remove ghosts (alias spectrum) after.
I speak as someone with very slight real-world experience and book learning over a decade ago so this answer might be evidence of a little knowledge being a dangerous thing but I think the problem you're seeing is just aliasing.
Imagine a perfect square wave. You've never heard a perfect square wave because it would require a sound source instantly to transition from one position to another, while still pushing air particles about.
You also can't describe a square wave with a finite number of harmonics. However, you can trivially describe a square wave with any frequency of PCM audio. Therefore any source PCM audio can appear to contain an infinite number of harmonics.
What you can probably do is just sit atop Nyquist and say that if the input audio is N Mhz then the highest-frequency part that can be actual signal is at N/2 Mhz; therefore you can resample the input wave down to twice the first rate less than or equal to N/2 Mhz that shows significant signal without losing meaningful content.

How to calculate number of samples in audio given some parameters?

Given following parameters:
Sample size: 16
Channel count: 2
Codec: audio/pcm
Byte order: little endian
Sample rate: 11025
Sample type: signed int
How can I determine number of samples for N miliseconds of recorded audio? I'm new in audio processing. The codec is PCM so I guess it's uncompressed audio.
I'm using Qt 4.8 on Windows 7 Ultimate x64.
/**
* Converts milliseconds to samples of buffer.
* #param ms the time in milliseconds
* #return the size of the buffer in samples
*/
int msToSamples( int ms, int sampleRate, int channels ) {
return (int)(((long) ms) * sampleRate * channels / 1000);
}
/* get size of a buffer to hold nSamples */
int samplesToBytes(int nSamples, int sampleSizeBits) {
return nSamples * (sampleSizeBits / 8);
}
Reference
I think it is important here for you to understand what each of these terms means so that you can then write the code that gives you what you want.
Sample rate is the number of samples per second of audio, in your case 11025 (this is sometimes expressed in KHz) this is quite low when compared to something like CD audio which is 44.1KHz so 44100 sample rate and there are higher standards such as 48KHz, 96KHz.
Next you have the number of bits used for each sample, this can typically be 8/16/24/32 bits.
Next you can have an arbitrary number of channels for each sample.
So the code sample already posted shows how to apply each of these numbers together to get your milliseconds to samples which is simply multiplying the number of channels by the sample bits by the sample rate which gives you the data size for a single second of audio, then divide this number by 1000 to give you milliseconds.
This can get quite tricky when you start applying this to video which deals in frames which are either nice numbers like 25/30/50/60 frames a second to the NTSC based ones which are 23.98/29.97/59.94 frames a second in which case you have to do horrible calculations to make sure they align correctly.
Hope this helps.
Here a solution in pseudocode:
Given the
duration = 20... in milliseconds &
sr = 11025 ... samplingrate in Hz
then the number of samples N
N = sr * dur/1000 = 220.5
You will need to round that to the closest integer number.

Interpretation of DirectSound buffer elements from mic capture device

I am doing some maintenance work involving DirectSound buffers. I would like to know how to interpret the elements in the buffer, that is, to know what each value in the buffer represents. This data is coming from a microphone.
This wave format is being used:
WAVEFORMATEXTENSIBLE format = {
{ WAVE_FORMAT_EXTENSIBLE, 1, sample_rate, sample_rate * 4, 4, 32, 22 },
{ 32 }, 0, KSDATAFORMAT_SUBTYPE_IEEE_FLOAT
};
My goal is to detect microphone silence. I am currently accomplishing this by simply determining if all values in the buffer fail to exceed some threshold volume value, assuming that the intensity of each buffer element directly corresponds to volume.
This what I am currently trying:
bool is_mic_silent(float * data, unsigned int num_samples, float threshold)
{
float * max_iter = std::max_element(data, data + num_samples);
if(!max_iter) {
return true;
}
float max = *max_iter;
if(max < threshold) {
return true;
}
return false; // At least one value is sufficiently loud.
}
As MSN said the samples are in 32-bit floats. To detect a silence you would normally calculate the RMS value: Take the average of the squared sample values over some time interval (say 20-50 ms) and compare (square root of) this average to a threshold.
The noise inherent in the microphone signal may let single samples reach above the threshold while the ambient sound would still be considered silence. The averaging over a short interval will result in a value that corresponds better with our perception.
From here, floating point PCM values are from [-1, 1].
In addition to Han's suggestion to average samples, als consider calibrating your threshold value. Under different environments, with different microphones and different audio channels, "silence" can mean a lot of things.
The simple way would be loowing to configure the threshold. Alternatively, allow a "Noise floor measurement" where you acqurie a threshold value.
Note that the samples are linear, but levels in audio processing are usually given in dB. So depending on yoru target audience, you may want to convert readings and inputs to/from dB.