I am trying to recognise a sequence of audio frames on an embedded system - an audio frame being a frequency or interpolation of two frequencies for a variable amount of time. I know the sounds I am trying to recognise (i.e. the start and end frequencies which are being linearly interpolated and the duration of each audio frame), but they are produced by a another embedded system so the microphone and speaker are cheap and somewhat inaccurate. The output is a square wave. Any suggestions how to go about doing this?
What I am trying to do now is to use FFT to get the magnitude of all frequencies, detect the peaks, look at the detection duration/2 ms ago and check if that somewhat matches an audio frame, and finally just checking if any sound I am looking for matched the sequence.
So far I used the FFT to process the microphone input - after applying a Hann window - and then assigning each frequency bin a coefficient that it's a peak based on how many standard deviations is away from the mean. This hasn't worked great since it thought there are peaks when it was silence in the room. Any ideas on how to more accurately detect the peaks? Also I think there are a lot of harmonics because of the square wave / interpolation? Can I do harmonic product spectrum if the peaks don't really line up at double the frequency?
Here I graphed noise (almost silent room) with somewhere in the interpolation of 2226 and 1624 Hz.
https://i.stack.imgur.com/R5Gs2.png
I sample at 91 microseconds -> 10989 Hz. Should I sample more often?
I added here samples of how the interpolation sounds when recorded on my laptop and on the embedded system.
https://easyupload.io/m/5l72b0
#define MIC_SAMPLE_RATE 10989 // Hz
#define AUDIO_SAMPLES_NUMBER 1024
MicroBitAudioProcessor::MicroBitAudioProcessor(DataSource& source) : audiostream(source)
{
arm_rfft_fast_init_f32(&fft_instance, AUDIO_SAMPLES_NUMBER);
buf = (float *)malloc(sizeof(float) * (AUDIO_SAMPLES_NUMBER * 2));
output = (float *)malloc(sizeof(float) * AUDIO_SAMPLES_NUMBER);
mag = (float *)malloc(sizeof(float) * AUDIO_SAMPLES_NUMBER / 2);
}
float henn(int i){
return 0.5 * (1 - arm_cos_f32(2 * 3.14159265 * i / AUDIO_SAMPLES_NUMBER));
}
int MicroBitAudioProcessor::pullRequest()
{
int s;
int result;
auto mic_samples = audiostream.pull();
if (!recording)
return DEVICE_OK;
int8_t *data = (int8_t *) &mic_samples[0];
int samples = mic_samples.length() / 2;
for (int i=0; i < samples; i++)
{
s = (int) *data;
result = s;
data++;
buf[(position++)] = (float)result;
if (position % AUDIO_SAMPLES_NUMBER == 0)
{
position = 0;
float maxValue = 0;
uint32_t index = 0;
// Apply a Henn window
for(int i=0; i< AUDIO_SAMPLES_NUMBER; i++)
buf[i] *= henn(i);
arm_rfft_fast_f32(&fft_instance, buf, output, 0);
arm_cmplx_mag_f32(output, mag, AUDIO_SAMPLES_NUMBER / 2);
}
}
return DEVICE_OK;
}
uint32_t frequencyToIndex(int freq) {
return (freq / ((uint32_t)MIC_SAMPLE_RATE / AUDIO_SAMPLES_NUMBER));
}
float MicroBitAudioProcessor::getFrequencyIntensity(int freq){
uint32_t index = frequencyToIndex(freq);
if (index <= 0 || index >= (AUDIO_SAMPLES_NUMBER / 2) - 1) return 0;
return mag[index];
}
Related
I'm currently trying to display an audio spectrum using FFTW3 and SFML. I've followed the directions found here and looked at numerous references on FFT and spectrums and FFTW yet somehow my bars are almost all aligned to the left like below. Another issue I'm having is I can't find information on what the scale of the FFT output is. Currently I'm dividing it by 64 yet it still reaches beyond that occasionally. And further still I have found no information on why the output of the from FFTW has to be the same size as the input. So my questions are:
Why is the majority of my spectrum aligned to the left unlike the image below mine?
Why isn't the output between 0.0 and 1.0?
Why is the input sample count related to the fft output count?
What I get:
What I'm looking for:
const int bufferSize = 256 * 8;
void init() {
sampleCount = (int)buffer.getSampleCount();
channelCount = (int)buffer.getChannelCount();
for (int i = 0; i < bufferSize; i++) {
window.push_back(0.54f - 0.46f * cos(2.0f * GMath::PI * (float)i / (float)bufferSize));
}
plan = fftwf_plan_dft_1d(bufferSize, signal, results, FFTW_FORWARD, FFTW_ESTIMATE);
}
void update() {
int mark = (int)(sound.getPlayingOffset().asSeconds() * sampleRate);
for (int i = 0; i < bufferSize; i++) {
float s = 0.0f;
if (i + mark < sampleCount) {
s = (float)buffer.getSamples()[(i + mark) * channelCount] / (float)SHRT_MAX * window[i];
}
signal[i][0] = s;
signal[i][1] = 0.0f;
}
}
void draw() {
int inc = bufferSize / 2 / size.x;
int y = size.y - 1;
int max = size.y;
for (int i = 0; i < size.x; i ++) {
float total = 0.0f;
for (int j = 0; j < inc; j++) {
int index = i * inc + j;
total += std::sqrt(results[index][0] * results[index][0] + results[index][1] * results[index][1]);
}
total /= (float)(inc * 64);
Rectangle2I rect = Rectangle2I(i, y, 1, -(int)(total * max)).absRect();
g->setPixel(rect, Pixel(254, toColor(BLACK, GREEN)));
}
}
All of your questions are related to the FFT theory. Study the properties of FFT from any standard text/reference book and you will be able to answer your questions all by yourself only.
The least you can start from is here:
https://en.wikipedia.org/wiki/Fast_Fourier_transform.
Many FFT implementations are energy preserving. That means the scale of the output is linearly related to the scale and/or size of the input.
An FFT is a DFT is a square matrix transform. So the number of outputs will always be equal to the number of inputs (or half that by ignoring the redundant complex conjugate half given strictly real input), unless some outputs are thrown away. If not, it's not an FFT. If you want less outputs, there are ways to downsample the FFT output or post process it in other ways.
I’m a beginner in DSP and I have to make an audio equalizer.
I’ve done some research and tried a lot of thing in the past month but in the end, it’s not working and I’m a bit overwhelmed with all those informations (that I certainly don’t interpret well).
I have two main classes : Broadcast (which generate pink noise, and apply gain to it) and Record (which analyse the input of the microphone et deduct the gain from it).
I have some trouble with both, but I’m gonna limit this post to the Broadcast side.
I’m using Aquila DSP Library, so I used this example and extended the logic of it.
/* Constructor */
Broadcast::Broadcast() :
_Info(44100, 2, 2), // 44100 Hz, 2 channels, sample size : 2 octet
_pinkNoise(_Info.GetFrequency()), // Init the Aquila::PinkNoiseGenerator
_thirdOctave() // list of “Octave” class, containing min, center, and max frequency of each [⅓ octave band](http://goo.gl/365ZFN)
{
_pinkNoise.setAmplitude(65536);
}
/* This method is called in a loop and fills the buffer with the pink noise */
bool Broadcast::BuildBuffer(char * Buffer, int BufferSize, int & BufferCopiedSize)
{
if (BufferSize < 131072)
return false;
int SampleCount = 131072 / _Info.GetSampleSize();
int signalSize = SampleCount / _Info.GetChannelCount();
_pinkNoise.generate(signalSize);
auto fft = Aquila::FftFactory::getFft(signalSize);
Aquila::SpectrumType spectrum = fft->fft(_pinkNoise.toArray());
Aquila::SpectrumType ampliSpectrum(signalSize);
std::list<Octave>::iterator it;
double gain, fl, fh;
/* [1.] - The gains are applied in this loop */
for (it = _thirdOctave.begin(); it != _thirdOctave.end(); it++)
{
/* Test values */
if ((*it).getCtr() >= 5000)
gain = 6.0;
else
gain = 0.0;
fl = (signalSize * (*it).getMin() / _Info.GetFrequency());
fh = (signalSize * (*it).getMax() / _Info.GetFrequency());
/* [2.] - THIS is the part that I think is wrong */
for (int i = 0; i < signalSize; i++)
{
if (i >= fl && i < fh)
ampliSpectrum[i] = std::pow(10, gain / 20);
else
ampliSpectrum[i] = 1.0;
}
/* [3.] - Multiply each bin of spectrum with ampliSpectrum */
std::transform(
std::begin(spectrum),
std::end(spectrum),
std::begin(ampliSpectrum),
std::begin(spectrum),
[](Aquila::ComplexType x, Aquila::ComplexType y) { return x * y; }); // Aquila::ComplexType is an std::complex<double>
}
/* Put the IFFT result in a new buffer */
boost::scoped_ptr<double> s(new double[signalSize]);
fft->ifft(spectrum, s.get());
int val;
for (int i = 0; i < signalSize; i++)
{
val = int(s.get()[i]);
/* Fills the two channels with the same value */
reinterpret_cast<int*>(Buffer)[i * 2] = val;
reinterpret_cast<int*>(Buffer)[i * 2 + 1] = val;
}
BufferCopiedSize = SampleCount * _Info.GetSampleSize();
return true;
}
I’m using the pink noise of gStreamer along with the equalizer-nbands module to compare my output.
With all gain set to 0.0 the outputs are the same.
But as soon as I add some gain, the outputs sound different (even though my output still sound like a pink noise, and seems to have gain in the right spot).
So my question is :
How can I apply my gains to each ⅓ Octave band in the frequency domain.
My research shows that I should do a filter bank of band-pass filters, but how to do that with the result of an FFT ?
Thanks for your time.
I am using v4l2 api to grab images from a Microsoft Lifecam and then transferring these images over TCP to a remote computer. I am also encoding the video frames into a MPEG2VIDEO using ffmpeg API. These recorded videos play too fast which is probably because not enough frames have been captured and due to incorrect FPS settings.
The following is the code which converts a YUV422 source to a RGB888 image. This code fragment is the bottleneck in my code as it takes nearly 100 - 150 ms to execute which means I can't log more than 6 - 10 FPS at 1280 x 720 resolution. The CPU usage is 100% as well.
for (int line = 0; line < image_height; line++) {
for (int column = 0; column < image_width; column++) {
*dst++ = CLAMP((double)*py + 1.402*((double)*pv - 128.0)); // R - first byte
*dst++ = CLAMP((double)*py - 0.344*((double)*pu - 128.0) - 0.714*((double)*pv - 128.0)); // G - next byte
*dst++ = CLAMP((double)*py + 1.772*((double)*pu - 128.0)); // B - next byte
vid_frame->data[0][line * frame->linesize[0] + column] = *py;
// increment py, pu, pv here
}
'dst' is then compressed as jpeg and sent over TCP and 'vid_frame' is saved to the disk.
How can I make this code fragment faster so that I can get atleast 30 FPS at 1280x720 resolution as compared to the present 5-6 FPS?
I've tried parallelizing the for loop across three threads using p_thread, processing one third of the rows in each thread.
for (int line = 0; line < image_height/3; line++) // thread 1
for (int line = image_height/3; line < 2*image_height/3; line++) // thread 2
for (int line = 2*image_height/3; line < image_height; line++) // thread 3
This gave me only a minor improvement of 20-30 milliseconds per frame.
What would be the best way to parallelize such loops? Can I use GPU computing or something like OpenMP? Say spwaning some 100 threads to do the calculations?
I also noticed higher frame rates with my laptop webcam as compared to the Microsoft USB Lifecam.
Here are other details:
Ubuntu 12.04, ffmpeg 2.6
AMG-A8 quad core processor with 6GB RAM
Encoder settings:
codec: AV_CODEC_ID_MPEG2VIDEO
bitrate: 4000000
time_base: (AVRational){1, 20}
pix_fmt: AV_PIX_FMT_YUV420P
gop: 10
max_b_frames: 1
If all you care about is fps and not ms per frame (latency), another option would be a separate thread per frame.
Threading is not the only option for speed improvements. You could also perform integer operations as opposed to floating point. And SIMD is an option. Using an existing library like sws_scale will probably give you the best performance.
Mak sure you are compiling -O3 (or -Os).
Make sure debug symbols are disabled.
Move repeated operations outside the loop e.g.
// compiler cant optimize this because another thread could change frame->linesize[0]
int row = line * frame->linesize[0];
for (int column = 0; column < image_width; column++) {
...
vid_frame->data[0][row + column] = *py;
You can precompute tables, so there is no math in the loop:
init() {
for(int py = 0; py <= 255 ; ++py)
for(int pv = 0; pv <= 255 ; ++pv)
ytable[pv][py] = CLAMP(pv + 1.402*(py - 128.0));
}
for (int column = 0; column < image_width; column++) {
*dst++ = ytable[*pv][*py];
Just to name a few options.
I think unless you want to reinvent the painful wheel, using pre-existing options (ffmpeg' libswscale or ffmpeg's scale filter, gstreamer's scale plugin, etc.) is a much better option.
But if you want to reinvent the wheel for whatever reason, show the code you used. For example, thread startup is expensive, so you'd want to create the threads before measuring your looptime and reuse threads from frame-to-frame. Better yet is frame-threading, but that adds latency. This is usually ok but depends on your use case. More importantly, don't write C code, learn to write x86 assembly (simd), all previously mentioned libraries use simd for such conversions, and that'll give you a 3-4x speedup (since it allows you to do 4-8 pixels instead of 1 per iteration).
You could build blocks of x lines and convert each block in a separate thread
do not mix integer and floating point arithmetic!
char x;
char y=((double)x*1.5); /* ouch casting double<->int is slow! */
char z=(x*3)>>1; /* fixed point arithmetic rulez */
use SIMD (though this would be easier if both input and output data were properly aligned...e.g. by using RGB8888 as output)
use openMP
an alternative that does not require any coding of the processing, would be to simply do your entire processing using a framework that does proper timestamping throughout the pipeline (starting at image acquisition time), and is hopefully optimized enough to deal with big data. e.g. gstreamer
Would something like this not work?
#pragma omp parallel for
for (int line = 0; line < image_height; line++) {
for (int column = 0; column < image_width; column++) {
dst[ ( image_width*line + column )*3 ] = CLAMP((double)*py + 1.402*((double)*pv - 128.0)); // R - first byte
dst[ ( image_width*line + column )*3 + 1] = CLAMP((double)*py - 0.344*((double)*pu - 128.0) - 0.714*((double)*pv - 128.0)); // G - next byte
dst[ ( image_width*line + column )*3 + 2] = CLAMP((double)*py + 1.772*((double)*pu - 128.0)); // B - next byte
vid_frame->data[0][line * frame->linesize[0] + column] = *py;
// increment py, pu, pv here
}
Of course you have to also handle incrementing py, py, pv part accordingly.
Usually transformation of pixel format is performed with using of only integer variables.
It's allow to prevent conversion between float point and integer variables.
Also it's allow to use more effectively SIMD extensions of modern CPUs.
For example, this is a code of conversion YUV to BGR:
const int Y_ADJUST = 16;
const int UV_ADJUST = 128;
const int YUV_TO_BGR_AVERAGING_SHIFT = 13;
const int YUV_TO_BGR_ROUND_TERM = 1 << (YUV_TO_BGR_AVERAGING_SHIFT - 1);
const int Y_TO_RGB_WEIGHT = int(1.164*(1 << YUV_TO_BGR_AVERAGING_SHIFT) + 0.5);
const int U_TO_BLUE_WEIGHT = int(2.018*(1 << YUV_TO_BGR_AVERAGING_SHIFT) + 0.5);
const int U_TO_GREEN_WEIGHT = -int(0.391*(1 << YUV_TO_BGR_AVERAGING_SHIFT) + 0.5);
const int V_TO_GREEN_WEIGHT = -int(0.813*(1 << YUV_TO_BGR_AVERAGING_SHIFT) + 0.5);
const int V_TO_RED_WEIGHT = int(1.596*(1 << YUV_TO_BGR_AVERAGING_SHIFT) + 0.5);
inline int RestrictRange(int value, int min = 0, int max = 255)
{
return value < min ? min : (value > max ? max : value);
}
inline int YuvToBlue(int y, int u)
{
return RestrictRange((Y_TO_RGB_WEIGHT*(y - Y_ADJUST) +
U_TO_BLUE_WEIGHT*(u - UV_ADJUST) +
YUV_TO_BGR_ROUND_TERM) >> YUV_TO_BGR_AVERAGING_SHIFT);
}
inline int YuvToGreen(int y, int u, int v)
{
return RestrictRange((Y_TO_RGB_WEIGHT*(y - Y_ADJUST) +
U_TO_GREEN_WEIGHT*(u - UV_ADJUST) +
V_TO_GREEN_WEIGHT*(v - UV_ADJUST) +
YUV_TO_BGR_ROUND_TERM) >> YUV_TO_BGR_AVERAGING_SHIFT);
}
inline int YuvToRed(int y, int v)
{
return RestrictRange((Y_TO_RGB_WEIGHT*(y - Y_ADJUST) +
V_TO_RED_WEIGHT*(v - UV_ADJUST) +
YUV_TO_BGR_ROUND_TERM) >> YUV_TO_BGR_AVERAGING_SHIFT);
}
This code is taken here (http://simd.sourceforge.net/). Also here there is a code optimized for different SIMDs.
I would like to make real time audio processing with Qt and display the fundamental frequency using FFTW3.
What I've done in steps:
I capture any sound from computer device and fill it into the buffer.
I assign sound samples to double array
I compute the fundamental frequency.
Problem
My code always returns 0 as fundamental frequency.
QByteArray *buffer;
QAudioInput *audioInput;
audioInput = new QAudioInput(format, this);
//Check the number of samples in input buffer
qint64 len = audioInput->bytesReady();
//Limit sample size
if(len > 4096)
len = 4096;
//Read sound samples from input device to buffer
qint64 l = input->read(buffer.data(), len);
if(l > 0)
{
int input_size = BufferSize;
// Compute corresponding number of complex output samples
int output_size = (input_size/2 + 1);
double *input_buffer = static_cast<double*>(fftw_malloc(input_size * sizeof(double)));
fftw_complex *out = static_cast<fftw_complex*>(fftw_malloc(output_size * sizeof(fftw_complex)));
//Assign sound samples to double array
input_buffer = (double*)buffer.data();
fftw_plan p3;
//Create plan
p3 = fftw_plan_dft_r2c_1d(input_size, input_buffer, out, FFTW_ESTIMATE);
fftw_execute(p3);
double reout[BufferSize];
double imgout[BufferSize];
double magnitude[BufferSize/2];
long ffond = 0.0; // Position of the frequency
double max = 0; // Maximal amplitude
for (int i = 0; i < BufferSize/2; i++)
{
reout[i] = out[i][0];
imgout[i] = out[i][1];
cout << imgout[i] << endl;
magnitude[i] = sqrt(reout[i]*reout[i] + imgout[i]*imgout[i]); //Calculate magnitude of first
double t = sqrt(reout[i]*reout[i] + imgout[i]*imgout[i]);
if(t > max)
{
max = t;
ffond = i;
}
}
qDebug() << "fundamental frequency is :" << QString::number(ffond*static_cast<double>);
fftw_destroy_plan(p3);
You have two immediate problems that I can see:
you are not applying a window function, so there will be considerable spectral leakage and associated "smearing" of the spectrum (and probably a large DC (0 Hz) component with associated "skirt")
you are assuming that the largest magnitude in the spectrum is the fundamental frequency, which will most likely be incorrect for two reasons: (a) you may well have a large 0 Hz component which is larger than your fundamental or harmonics and (b) depending on the nature of the sound you are trying to analyse, the fundamental may be smaller in magnitude than the harmonics (it may even be missing completely)
I suggest you do the following:
apply a suitable window function prior to the FFT - this should make your peaks better defined and should reduce the artefacts at 0 Hz and just above
start your search at an appropriate bin rather than 0, e.g. if the minimum fundamental frequency you are interested in is say 50 Hz then start at the corresponding bin for 50 Hz rather than at 0
add a debug option to display the spectrum graphically - this visual debugging aid will help greatly when you are wondering why your results do not make sense
if what you are really trying to measure is pitch rather than fundamental frequency, then read up on pitch detection algorithms, e.g. Harmonic Product Spectrum - this will work a lot better than the naïve approach of trying to identify a fundamental (whose frequency will not be the same as the pitch in the general case)
I am trying to create a very simple C++ program that given an argument in range [0-100] applies a low-pass filter to a grayscale image that should "compress" it proprotionally to the value of the given argument.
I am using the FFTW library.
I have some doubts about how I define the frequency threshold, cut. Is there any more effective way to define such value?
//fftw_complex *fft
//double[] magnitude
// . . .
int percent = 100;
if (percent < 0 || percent > 100) {
cerr << "Compression rate must be a value between 0 and 100." << endl;
return -1;
}
double cut =(double)(w*h) * ((double)percent / (double)100);
for (i = 0; i < (w * h); i++) {
magnitude[i] = sqrt(pow(fft[i][0], 2.0) + pow(fft[i][1], 2.0));
if (magnitude[i] < cut) {
fft[i][0] = 0.0;
fft[i][1] = 0.0;
}
}
Update1:
I've changed my code to this, but again I'm not sure this is a proper way to filter frequencies. The image is surely compressed, but non-square images are messed up and setting compression to 100% isn't the real maximum compression available (I can go up to ~140%).
Here you can find an image of what I see now.
int cX = w/2;
int cY = h/2;
cout<<"TEST "<<((double)percent/(double)100)*h<<endl;
for(i = 0; i<(w*h);i++){
int row = i/s;
int col = i%s;
int distance = sqrt((col-cX)*(col-cX)+(row-cY)*(row-cY));
if(distance<((double)percent/(double)100)*min(cX,cY)){
fft[i][0] = 0.0;
fft[i][1] = 0.0;
}
}
This is not a low-pass filter at all. A low-pass filter passes low frequencies, i.e. it removes fine details (blurring). You obviously need a 2D FFT for that.
This code just removes random bits, essentially.
[edit]
The new code looks a lot more like a low-pass filter. The 141% setting is expected: the diagonal of a square is sqrt(2)=1.41 times its side. Converting an index into a row/column pair should use the image width, not some random unexplained s.
I don't know where your zero frequency is located. That should be easy to spot (largest value) but it might be in (0,0) instead of (w/2,h/2)