Programmatically change the speed of an audio file in real-time - c++

Environment
Hardware: Raspberry Pi x
O.S.: Raspbian Jessie Lite
Language: Qt5 / C++
Goal
Execute an audio file (wav or better mp3) changing its speed smoothly and countinuosly. The pitch should change according to the speed (playback rate).
My application updates several times per second a variable that contains the desired speed: i.e. 1.0 = normal speed. Required range is about 0.2 .. 3.0, with a resolution of 0.01.
The audio is likely music, expected format: mono, 16-bit, 11.025 Hz.
No specific constraints about latency: below 500 ms is acceptable.
Some thougths
QMediaPlayer in QtMultimedia has the playbackRate property that should do exactly this. Unfortunately I have never be able to make QtMultimedia work in my systems.
It's ok to use also an external player, and send data using pipes or any IPC.
How would you achieve this?

I don't know how much of this translates to C++. The work I did on this problem uses Java. Still, something of the algorithm should be of help.
Example data (made up):
sample value
0 0.0
1 0.3
2 0.5
3 0.6
4 0.2
5 -0.1
6 -0.4
With normal speed, we send the output line a series of values where the sample number increments by 1 per output frame.
If we were going slower, say half speed, we should output twice as many values before reaching the same point in the media data. In other words, we need to include, in our output, values that are at the non-existent, intermediate sample frame locations 0.5, 1.5, 2.5, ...
To do this, it turns out that linear interpolation works quite well for audio. It is possible to use a more sophisticated curve fitting algorithm but the increase in fidelity is not considered to be worth the trouble.
So, we end up with a stream as follows (for half speed):
sample value
0 0.0
0.5 0.15
1 0.3
1.5 0.4
2 0.5
2.5 0.55
3 0.6
etc.
If you want to play back 3/4 speed, then the positions and values used in the output would be this:
sample value
0 0.0
0.75 0.225
1.5 0.4
2.25 0.525
3 0.6
3.75 0.525
etc.
I code this via a "cursor" that is incremented each sample frame, where the increment amount determines the "speed" of the playback. The cursor points into an array, like an integer index would, but instead, is a float (or double). If there is a fractional part to the cursor's value, the fraction is used to interpolate between sample values pointed to by the integer part and the integer part plus one.
For example, if the cursor was 6.25, and the value of soundData[6] was A and the value of soundData[6+1] was B, the sound value would be:
audioValue = A * 0.75 + B * 0.25
The degree of precision with which you can define your speed increment is quite high. I think Java's floats are considered sufficient for this purpose.
As for keeping a dynamically changing speed increment smooth, I am spreading out the changes to new speeds over a series of 4096 steps (roughly 1/10th of a second, at 44100 fps). Change requests are often asynchronous, e.g., from a GUI, and are spread out over time in a somewhat unpredictable way. The smoothing algorithm should be able to recalculate and update itself with each new speed request.
Following is a link that demonstrates both strategies, where a sound's playback speed is altered in real time via a slider control.
SlidersTest.jar
This is a runnable copy of the jar file that also contains the source code, and executes via Java 8. You can also rename the file SlidersTest.zip and then drill in to view the source code, in context.
But links to the source files can also be navigated to directly in the two following sections of a page I posted for this code I recently wrote and made open source:
see AudioCue.java
see SlidersTest.java
AudioCue.java is a long file. The relevant parts are in the inner class at the end of the file: class AudioCuePlayer, and for the smoothing algorithm, check the setter method setSpeed which is about 3/4's of the way down. Sorry I don't have line numbers.

Related

Backpropagation 2-Dimensional Neuron Network C++

I am learning about Two Dimensional Neuron Network so I am facing many obstacles but I believe it is worth it and I am really enjoying this learning process.
Here's my plan: To make a 2-D NN work on recognizing images of digits. Images are 5 by 3 grids and I prepared 10 images from zero to nine. For Example this would be number 7:
Number 7 has indexes 0,1,2,5,8,11,14 as 1s (or 3,4,6,7,9,10,12,13 as 0s doesn't matter) and so on. Therefore, my input layer will be a 5 by 3 neuron layer and I will be feeding it zeros OR ones only (not in between and the indexes depends on which image I am feeding the layer).
My output layer however will be one dimensional layer of 10 neurons. Depends on which digit was recognized, a certain neuron will fire a value of one and the rest should be zeros (shouldn't fire).
I am done with implementing everything, I have a problem in computing though and I would really appreciate any help. I am getting an extremely high error rate and an extremely low (negative) output values on all output neurons and values (error and output) do not change even on the 10,000th pass.
I would love to go further and post my Backpropagation methods since I believe the problem is in it. However to break down my work I would love to hear some comments first, I want to know if my design is approachable.
Does my plan make sense?
All the posts are speaking about ranges ( 0->1, -1 ->+1, 0.01 -> 0.5 etc ), will it work for either { 0 | .OR. | 1 } on the output layer and not a range? if yes, how can I control that?
I am using TanHyperbolic as my transfer function. Does it make a difference between this and sigmoid, other functions.. etc?
Any ideas/comments/guidance are appreciated and thanks in advance
Well, by the description given above, I think that the design and approach taken it's correct! With respect to the choice of the activation function, remember that those functions help to get the neurons which have the largest activation number, also, their algebraic properties, such as an easy derivative, help with the definition of Backpropagation. Taking this into account, you should not worry about your choice of activation function.
The ranges that you mention above, correspond to a process of scaling of the input, it is better to have your input images in range 0 to 1. This helps to scale the error surface and help with the speed and convergence of the optimization process. Because your input set is composed of images, and each image is composed of pixels, the minimum value and and the maximum value that a pixel can attain is 0 and 255, respectively. To scale your input in this example, it is essential to divide each value by 255.
Now, with respect to the training problems, Have you tried checking if your gradient calculation routine is correct? i.e., by using the cost function, and evaluating the cost function, J? If not, try generating a toy vector theta that contains all the weight matrices involved in your neural network, and evaluate the gradient at each point, by using the definition of gradient, sorry for the Matlab example, but it should be easy to port to C++:
perturb = zeros(size(theta));
e = 1e-4;
for p = 1:numel(theta)
% Set perturbation vector
perturb(p) = e;
loss1 = J(theta - perturb);
loss2 = J(theta + perturb);
% Compute Numerical Gradient
numgrad(p) = (loss2 - loss1) / (2*e);
perturb(p) = 0;
end
After evaluating the function, compare the numerical gradient, with the gradient calculated by using backpropagation. If the difference between each calculation is less than 3e-9, then your implementation shall be correct.
I recommend to checkout the UFLDL tutorials offered by the Stanford Artificial Intelligence Laboratory, there you can find a lot of information related to neural networks and its paradigms, it's worth to take look at it!
http://ufldl.stanford.edu/wiki/index.php/Main_Page
http://ufldl.stanford.edu/tutorial/

Drawing audio spectrum with Bass library

How can I draw an spectrum for an given audio file with Bass library?
I mean the chart similar to what Audacity generates:
I know that I can get the FFT data for given time t (when I play the audio) with:
float fft[1024];
BASS_ChannelGetData(chan, fft, BASS_DATA_FFT2048); // get the FFT data
That way I get 1024 values in array for each time t. Am I right that the values in that array are signal amplitudes (dB)? If so, how the frequency (Hz) is associated with those values? By the index?
I am an programmer, but I am not experienced with audio processing at all. So I don't know what to do, with the data I have, to plot the needed spectrum.
I am working with C++ version, but examples in other languages are just fine (I can convert them).
From the documentation, that flag will cause the FFT magnitude to be computed, and from the sounds of it, it is the linear magnitude.
dB = 10 * log10(intensity);
dB = 20 * log10(pressure);
(I'm not sure whether audio file samples are a measurement of intensity or pressure. What's a microphone output linearly related to?)
Also, it indicates the length of the input and the length of the FFT match, but half the FFT (corresponding to negative frequencies) is discarded. Therefore the highest FFT frequency will be one-half the sampling frequency. This occurs at N/2. The docs actually say
For example, with a 2048 sample FFT, there will be 1024 floating-point values returned. If the BASS_DATA_FIXED flag is used, then the FFT values will be in 8.24 fixed-point form rather than floating-point. Each value, or "bin", ranges from 0 to 1 (can actually go higher if the sample data is floating-point and not clipped). The 1st bin contains the DC component, the 2nd contains the amplitude at 1/2048 of the channel's sample rate, followed by the amplitude at 2/2048, 3/2048, etc.
That seems pretty clear.

Check for similarity on different size images

I have a video source that produce many streams for different devices (such as: HD television, Pads, smart phones, etc.), every of them has to be checked within each other for similarity. The video stream release 50 images per second, one image every 20 milliseconds.
Lets take for instance img1 coming from stream1 at time ts1=1, img2 coming from stream2 at ts2=1 and img1.1 taken from stream1 at ts=2 (20 milliseconds later than ts=1), the comparison result should look something like this:
compare(img1, img1) = 1 same image same size
compare(img1, img2) = 0.9 same image different size
compare(img1, img1.1) = 0.8 different images same size
ideally this should be done real time, so within 20 millisecond, the goal is to understand if the streams are out of synchronization, I already implemented some compare methods (nobody of them works for this case yet):
1) histogram (SSE and OpenCV cuda), result compare(img1, img2) ~= compare(img1, img1.1)
2) pnsr (SSE and OCV cuda), result compare(img1, img2) < compare(img1, img1.1)
3) ssim (SSE and OCV cuda), resulting the same as pnsr
Maybe I get bad results because of the resize interpolation method?
Is it possible to realize a comparison method that fulfill my requirements, any ideas?
I'm afraid that you're running into a Real Problem (TM). This is not a trivial lets-give-it-to-the-intern problem.
The main challenge is that you can't do a brute-force comparison. HD images are 3 MB or more, and you're talking about O(N*M) comparisons (in time and across streams).
What you essentially need is a fingerprint that's robust against resizing but time-variant. And as you didn't realize that (the histogram idea for instance is quite time-stable, for instance) you didn't include the necessary information in this question.
So this isn't a C++ question, really. You need to understand your inputs.

Digital signal decimation using gnuradio lib

I write application where I must process digital signal - array of double. I must the signal decimate, filter etc.. I found a project gnuradio where are functions for this problem. But I can't figure how to use them correctly.
I need signal decimate (for example from 250Hz to 200Hz). The function should be similar to resample function in Matlab. I found, the classes for it are:
rational_resampler_base_fff Class source
fir_filter_fff Class source
...
Unfortunately I can't figure how to use them.
gnuradio and shared library I have installed
Thanks for any advice
EDIT to #jcoppens
Thank you very much for you help.
But I must process signal in my code. I find classes in gnuradio which can solve my problem, but I need help how set them.
Functions which I must set are:
low_pass(doub gain, doub sampling_freq, doub cutoff_freq, doub transition_width, window, beta)
where:
use "window method" to design a low-pass FIR filter
gain: overall gain of filter (typically 1.0)
sampling_freq: sampling freq (Hz)
cutoff_freq: center of transition band (Hz)
transition_width: width of transition band (Hz).
The normalized width of the transition band is what sets the number of taps required. Narrow –> more taps
window_type: What kind of window to use. Determines maximum attenuation and passband ripple.
beta: parameter for Kaiser window
I know, I must use window = KAISER and beta = 5, but for the rest I'm not sure.
The func which I use are: low_pass and pfb_arb_resampler_fff::filter
UPDATE:
I solved the resampling using libsamplerate
I need signal decimate (for example from 250Hz to 200Hz)
WARNING: I expressed the original introductory paragraph incorrectly - my apologies.
As 250 Hz is not related directly to 200 Hz, you have to do some tricks to convert 250Hz into 200Hz. Inserting 4 interpolated samples in between the 250Hz samples, lowers the frequency to 50Hz. Then you can raise the frequency to 200Hz again by decimating by a factor 4.
For this you need the "Rational Resampler", where you can define the subsample and decimate factors. Something like this:
This means you would have to do something similar if you use the library. Maybe it's even simpler to do it without the library. Interpolate linearly between the 250 Hz samples (i.e. insert 4 extra samples between each), then decimate by selecting each 4th sample.
Note: There is a Signal Processing forum on stackexchange - maybe this question might fall in that category...
More information: If you only have to resample your input data, and you do not need the actual gnuradio program, then have a look at this document:
https://ccrma.stanford.edu/~jos/resample/resample.pdf
There are several links to other documents, and a link to libresample, libresample4, and others, which may be of use to you. Another, very interesting, page is:
http://www.dspguru.com/dsp/faqs/multirate/resampling
Finally, from the same source as the pdf above, check their snd program. It may solve your problem without writing any software. It can load floating point samples, resample, and save again:
http://ccrma.stanford.edu/planetccrma/software/soundapps.html#SECTION00062100000000000000
EDIT: And yet another solution - maybe the simplest of all: Use Matlab (or the free Octave version):
pkg load signal
t = linspace(0, 10*pi, 50); % Generate a timeline - 5 cycles
s = sin(t); % and the sines -> 250 Hz
tr = resample(s, 5, 4); % Convert to 200 Hz
plot(t, s, 'r') % Plot 250 Hz in red
hold on
plot(t, tr(1:50)) % and resampled in blue
Will give you:

How to find why a RBM does not work correctly?

I'm trying to implement a RBM and I'm testing it on MNIST dataset. However, it does not seems to converge.
I've 28x28 visible units and 100 hidden units. I'm using mini-batches of size 50. For each epoch, I traverse the whole dataset. I've a learning rate of 0.01 and a momentum of 0.5. The weights are randomly generated based on a Gaussian distribution of mean 0.0 and stdev of 0.01. The visible and hidden biases are initialized to 0. I'm using a logistic sigmoid function as activation.
After each epoch, I compute the average reconstruction error of all mini-batches, here are the errors I get:
epoch 0: Reconstruction error average: 0.0481795
epoch 1: Reconstruction error average: 0.0350295
epoch 2: Reconstruction error average: 0.0324191
epoch 3: Reconstruction error average: 0.0309714
epoch 4: Reconstruction error average: 0.0300068
I plotted the histograms of the weights to check (left to right: hiddens, weights, visibles. top: weights, bottom: updates):
Histogram of the weights after epoch 3
Histogram of the weights after epoch 3 http://baptiste-wicht.com/static/finals/histogram_epoch_3.png
Histogram of the weights after epoch 4
Histogram of the weights after epoch 4 http://baptiste-wicht.com/static/finals/histogram_epoch_4.png
but, except for the hidden biases that seem a bit weird, the remaining seems OK.
I also tried to plot the hidden weights:
Weights after epoch 3
Weights after epoch 3 http://baptiste-wicht.com/static/finals/hiddens_weights_epoch_3.png
Weights after epoch 4
Weights after epoch 4 http://baptiste-wicht.com/static/finals/hiddens_weights_epoch_4.png
(they are plotted in two colors using that function:
static_cast<size_t>(value > 0 ? (static_cast<size_t>(value * 255.0) << 8) : (static_cast<size_t>(-value * 255.)0) << 16) << " ";
)
And here, they do not make sense at all...
If I go further, the reconstruction error falls a bit more, but do no go further than 0.025. Even if I change the momentum after sometime, it goes higher and then goes down a bit but not interestingly. Moreover, the weights do no make more sense after more epochs. In most example implementations I've seen, the weights were making some sense after iterating through the complete data set two or three times.
I've also tried to reconstruct an image from the visible units, but the results seems almost random.
What could I do to check what goes wrong in my implementation ? Should the weights be within some range ? Does something seems really strange in the data ?
Complete code: https://github.com/wichtounet/dbn/blob/master/include/rbm.hpp
You are using a very small learning rate. In most NNs trained by SGD you start out with a higher learning rate and decay it over time. Search for learning rate or adaptive learning rate to find more information on that.
Second, when implementing a new algorithm I would recommend finding the paper that introduced it and reproducing their results. A good paper should include most of the settings used - or the method used to determine the settings.
If a paper is unavailable, or it was tested on a data set you don't have access to - go find a working implementation and compare the outputs when using the same settings. If the implementations are not feature compatible, turn off as many features as you can that are not shared.