Parameters to improve a music frequency analyzer

Parameters to improve a music frequency analyzer - c++

I'm using a FFT on audio data to output an analyzer, like you'd see in Winamp or Windows Media Player. However the output doesn't look that great. I'm plotting using a logarithmic scale and I average the linear results from the FFT into the corresponding logarithmic bins. As an example, I'm using bins like:
16k,8k,4k,2k,1k,500,250,125,62,31,15 [hz]
Then I plot the magnitude (dB) against frequency [hz]. The graph definitely 'reacts' to the music, and I can see the response of a drum sample or a high pitched voice. But the graph is very 'saturated' close to the lower frequencies, and overall doesn't look much like what you see in applications, which tend to be more evenly distributed. I feel that apps that display visual output tend to do different things to the data to make it look better.
What things could I do to the data to make it look more like the typical music player app?
Some useful information:
I downsample to single channel, 32kHz, and specify a time window of 35ms. That means the FFT gets ~1100 points. I vary these values to experiment (ie tried 16kHz, and increasing/decreasing interval length) but I get similar results.

With an FFT of 1100 points, you probably aren't able to capture the low frequencies with a lot of frequency resolution.
Think about it, 30 Hz corresponds to a period of 33ms, which at 32kHz is roughly 1000 samples. So you'll only be able to capture about 1 period in this time.
Thus, you'll need a longer FFT window to capture those low frequencies with sharp frequency resolution.
You'll likely need a time window of 4000 samples or more to start getting noticeably more frequency resolution at the low frequencies. This will be fine too, since you'll still get about 8-10 spectrum updates per second.
One option too, if you want very fast updates for the high frequency bins but good frequency resolution at the low frequencies, is to update the high frequency bins more quickly (such as with the windows you're currently using) but compute the low frequency bins less often (and with larger windows necessary for the good freq. resolution.)

I think a lot of these applications have variable FFT bins.
What you could do is start with very wide evenly spaced FFT bins like you have and then keep track of the number of elements that are placed in each FFT bin. If some of the bins are not used significantly at all (usually the higher frequencies) then widen those bins so that they are larger (and thus have more frequency entries) and shring the low frequency bins.
I have worked on projects were we just spend a lot of time tuning bins for specific input sources but it is much nicer to have the software adjust in real time.

A typical visualizer would use constant-Q bandpass filters, not a single FFT.
You could emulate a set of constant-Q bandpass filters by multiplying the FFT results by a set of constant-Q filter responses in the frequency domain, then sum. For low frequencies, you should use an FFT longer than the significant impulse response of the lowest frequency filter. For high frequencies, you can use shorter FFTs for better responsiveness. You can slide any length FFTs along at any desired update rate by overlapping (re-using) data, or you might consider interpolation. You might also want to pre-window each FFT to reduce "spectral leakage" between frequency bands.

Related

RF Divider function in SDR

I have what may be an odd question for the SDR gurus out there.
What would be the physical implementation (in software) of a broadband frequency divider?
For example, say I want to capture a signal at 1 GHz, with a 10 MHz bandwidth, then divide it by a factor of 10.
I would expect to get a down-sampled signal at 100 MHz with a 1 MHz bandwidth.
Yes, I know I would lose information, but assume this would be presented as a spectrum analysis, not full audio, video, etc.
Conceptually, could this be accomplished by sampling the RF at 2+times the highest frequency components, say at 2.5 GHz, then discarding 9 out of 10 samples - decimating the input stream?
Thanks,
Dave

Well, as soon as you've digitized your signal it loses the property "bandwidth", which is a real-world concept (and not one attached to the inherently meaningless stream of numbers that we're talking about in DSP and SDR). So, there's no signal with a bandwidth of 10MHz (without looking at the contents of the samples), but only a stream of numbers that we remember being produced by sampling an analog signal with a sampling rate of 20MS/s (if you're doing real sampling; if you have an I/Q downconverter and sample I and Q simultaneously, you'll get complex samples, of which 10MS/s will be enough to represent 10MHz of bandwidth).
Now, if you just throw away 9 out of 10 samples, which is decimation, you'll get aliasing, because now you can't tell whether a sine that took 10 samples in the original signal is actually a sine or just a constant; the same goes for any sine with a frequency higher than your new sampling rate's Nyquist bandwidth. That is a loss of information, so yes, that would work.
I think however, you have something specific in mind, which is scaling the signal in frequency direction. Let's have a quick excourse in to Fourier analysis:
there is the well known correspondence for frequency scaling.
let G be the Fourier transform of g, then
g(at) <--> 1/|a| G(t/a)
As you can see, compressing something in frequency domain actually means "speeding it up" in time domain, ie. decimation!
So, in order to do this meaningfully, you could imagine taking the DFT of length N of your signal, and set 9 out of 10 bins to zero, by multiplying it with a comb of 1's. Now, multiplication with a signal in frequency domain is convolution with the fourier transform of that signal in time domain. The fourier transform of such a Comb is, to little surprise, the complement of a Nyquist-M filter, and thus a filter itself; you will thus end up with a multi-band-passed version of your signal, which you then can decimate without aliasing.
Hope that was what you're after!

Predict smearing of FFT

After applying a Fourier transform to a signal, the energy of a single sine wave is often spread out across multiple bins (aka smearing). Have a look at the right side of the image below for an illustration:
I want to extract a list of peak frequencies. Just finding the highest bin is easy. But after that smearing becomes a problem.
I would like to have a heuristic which tells me if the magnitude of a specific bin is possibly the result of smearing or if there has to be another peak frequency in order to explain the signal. (It is better if I miss some than have false positives)
My naive approach would be to just calculate a few thousand examples and take the maximum of these to get an envelope curve so that any smearing is likely below that envelope.
But is there a better way to do this?

The FFT result of any rectangularly windowed pure unmodulated sinusoid is a Sinc function. This Sinc (sin(pix)/(pix)) function is only zero for all bins except one (at the peak magnitude) when the frequency of the input sinusoid has exactly an integer number of periods in the FFT width.
For all other frequencies that are not at the exact center of an FFT result bin, if you know that exact frequency and magnitude (which won't be in any single FFT bin), you can calculate all the other bins by sampling the Sinc function.
And, of course, if the input sinusoid isn't perfectly pure, but modulated in any way (amplitude, frequency or phase), this modulation will produce various sidebands in the FFT result as well.

Estimate gaussian height from its area

We (I and my colleague) were given a device, which sends to us each second a large amount of discrete integer data (intensities) that tend to have gaussian distribution. These pseudo gaussians flows one by one and we are supposed to pick the largest intensity from center of each gaussian as fast as possible. Moreover, these data contain a noise, so we cannot say that each gaussian can be separated to two monotone parts => we cannot rely on simple fact that if data start to decline, we will find the maximum.
My colleauge came up with an idea:
introduce an intensity threshold to separate gaussians from each other
sum intensities of each gaussian to estimate its area and then estimate its height
But the question is, how can I fast estimate height of this pseudo gaussian from its area?
UPDATE:
To be more clear, the intensities that I get represents "function values" of a gaussian, or batter they represent heights of histogram bins.

You could use a moving average filter, and when that starts to decline, take the maximum value in that window as your height. As long as the noise in the signal is fairly low amplitude and high frequency, that should work reasonably well. You can always combine it with thresholding if required. The people on the DSP site will probably have much better ideas though, so I'd ask there.

Compute frequency of sinusoidal signal, c++

i have a sinusoidal-like shaped signal,and i would like to compute the frequency.
I tried to implement something but looks very difficult, any idea?
So far i have a vector with timestep and value, how can i get the frequency from this?
thank you

If the input signal is a perfect sinusoid, you can calculate the frequency using the time between positive 0 crossings. Find 2 consecutive instances where the signal goes from negative to positive and measure the time between, then invert this number to convert from period to frequency. Note this is only as accurate as your sample interval and it does not account for any potential aliasing.

You could try auto correlating the signal. An auto correlation can be rapidly calculated by following these steps:
Perform FFT of the audio.
Multiply each complex value with its complex conjugate.
Perform the inverse FFT of the audio.
The left most peak will always be the highest (as the signal always correlates best with itself). The second highest peak, however, can be used to calculate the sinusoid's frequency.
For example if the second peak occurs at an offset (lag) of 50 points and the sample rate is 16kHz and the window is 1 second then the end frequency is 16000 / 50 or 320Hz. You can even use interpolation to get a more accurate estimation of the peak position and thus a more accurate sinusoid frequency. This method is quite intense but is very good for estimating the frequency after significant amounts of noise have been added!

detecting pauses in a spoken word audio file using pymad, pcm, vad, etc

First I am going to broadly state what I'm trying to do and ask for advice. Then I will explain my current approach and ask for answers to my current problems.
Problem
I have an MP3 file of a person speaking. I'd like to split it up into segments roughly corresponding to a sentence or phrase. (I'd do it manually, but we are talking hours of data.)
If you have advice on how to do this programatically or for some existing utilities, I'd love to hear it. (I'm aware of voice activity detection and I've looked into it a bit, but I didn't see any freely available utilities.)
Current Approach
I thought the simplest thing would be to scan the MP3 at certain intervals and identify places where the average volume was below some threshold. Then I would use some existing utility to cut up the mp3 at those locations.
I've been playing around with pymad and I believe that I've successfully extracted the PCM (pulse code modulation) data for each frame of the mp3. Now I am stuck because I can't really seem to wrap my head around how the PCM data translates to relative volume. I'm also aware of other complicating factors like multiple channels, big endian vs little, etc.
Advice on how to map a group of pcm samples to relative volume would be key.
Thanks!

PCM is a time frame base encoding of sound. For each time frame, you get a peak level. (If you want a physical reference for this: The peak level corresponds to the distance the microphone membrane was moved out of it's resting position at that given time.)
Let's forget that PCM can uses unsigned values for 8 bit samples, and focus on
signed values. If the value is > 0, the membrane was on one side of it's resting position, if it is < 0 it was on the other side. The bigger the dislocation from rest (no matter to which side), the louder the sound.
Most voice classification methods start with one very simple step: They compare the peak level to a threshold level. If the peak level is below the threshold, the sound is considered background noise.
Looking at the parameters in Audacity's Silence Finder, the silence level should be that threshold. The next parameter, Minimum silence duration, is obviously the length of a silence period that is required to mark a break (or in your case, the end of a sentence).
If you want to code a similar tool yourself, I recommend the following approach:
Divide your sound sample in discrete sets of a specific duration. I would start with 1/10, 1/20 or 1/100 of a second.
For each of these sets, compute the maximum peak level
Compare this maximum peak to a threshold (the silence level in Audacity). The threshold is something you have to determine yourself, based on the specifics of your sound sample (loudnes, background noise etc). If the max peak is below your threshold, this set is silence.
Now analyse the series of classified sets: Calculate the length of silence in your recording. (length = number of silent sets * length of a set). If it is above your Minimum silence duration, assume that you have the end of a sentence here.
The main point in coding this yourself instead of continuing to use Audacity is that you can improve your classification by using advanced analysis methods. One very simple metric you can apply is called zero crossing rate, it just counts how often the sign switches in your given set of peak levels (i.e. your values cross the 0 line). There are many more, all of them more complex, but it may be worth the effort. Have a look at discrete cosine transformations for example...

Just wanted to update this. I'm having moderate success using Audacity's Silence Finder. However, I'm still interested in this problem. Thanks.

PCM is a way of encoding a sinusoidal wave. It will be encoded as a series of bits, where one of the bits (1, I'd guess) indicates an increase in the function, and 0 indicates a decrease. The function can stay roughly constant by alternating 1 and 0.
To estimate amplitude, plot the sin wave, then normalize it over the x axis. Then, you should be able to estimate the amplitude of the sin wave at different points. Once you've done that, you should be able to pick out the spots where amplitude is lower.
You may also try to use a Fourier transform to estimate where the signals are most distinct.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js