How to implement batch backpropagation - c++

I have successfully implemented stochastic backpropagation and I am trying to increase its accuracy. I've noticed batched backpropagation seems to be more popular I wanted to try and see if that will improve the network's accuracy, however I can't seem to figure out how to implement it. By "batched backpropagation" I mean backpropagation where the weights and biases are only updated after the completion of a mini-batch or epoch instead of updating it after each input.
My understanding is that you sum up the changes that are needed to be made to each weight and bias and apply the change at the end of the batch of training examples. I basically changed nothing from my original stochastic backprop code except instead of applying the change directly to the weights and biases I apply the change to a buffer which is then used to update the weights and biases later. Or am I supposed to sum up the cost from each training example and then at the end of the batch run backpropagation? If this is the case then what do I use for the intermediate results (the output vectors of each layer) if the cost is a combination of the cost for a batch of inputs?
//Called after each calculation on a training example
void ML::NeuralNetwork::learnBatch(const Matrix & calc, const Matrix & real) const {
ML::Matrix cost = 2 * (calc - real);
for (int i = weights.size() - 1; i >= 0; --i) {
//Each element in results is the column vector output for each layer
//ElementMultiply() returns Hadamard Product
ML::Matrix dCdB = cost.elementMultiply(ML::sigDerivative(weights[i] * results[i] + biases[i]));
ML::Matrix dCdW = dCdB * results[i].transpose();
cost = weights[i].transpose() * dCdB;
sumWeights[i] += learningRate * dCdW; //Scalar multiplication
sumBiases[i] += learningRate * dCdB;
/* Original Code:
* weights[i] -= learningRate * dCdW;
* biases[i] -= learningRate * dCdB;
//Called at the end of a batch
void ML::NeuralNetwork::update() {
for (int i = 0; i < weights.size(); ++i) {
weights[i] -= sumWeights[i];
biases[i] -= sumBiases[i];
//Sets all elements in the matrix to 0
Besides the addition of an update() function I really haven't changed much from my working stochastic backprop code. With my current batch backprop code the neural network never learns and consistently gets 0 correct outputs even after iterating over 200 batches. Is there something I'm not understanding?
All help will be greatly appreciated.

In batch back propagation, you sum the contribution of the backpropagation of each sample.
In other terms, the resulting gradient is thus the sum of the gradient of each sample.


is there a way to set an "unknown" variable like "x" inside a sine-equation, and change its value afterwards?

I want to write an audio code in c++ for my microcontroller-based synthesizer which should allow me to generate a sampled square wave signal using the Fourier Series equation.
My question in general is: is there a way to set an "unknown" variable like "x" inside a sine-equation, and change its value afterwards?
What do I mean by that:
If you take a look on my code i've written so far you see the following:
void SquareWave(int mHarmonics){
char x;
for(int k = 0; k <= mHarmonics; k++){
mFourier += 1/((2*k)+1)*sin(((2*k)+1)*2*M_PI*x/SAMPLES_TOTAL);
for(x = (int)0; x < SAMPLES_TOTAL; x++){
mWave[x] = mFourier;
Inside the first for loop mFourier is summing weighted sinus-signals dependent by the number of Harmonics "mHarmonics". So a note on my keyboard should be setting up the harmonic spectrum automatically.
In this equation I've set x as a character and now we get to the center of my problem because i want to set x as a "unknown" variable which has a value that i want to set afterwards and if x would be an integer it would have some standard value like 0, which would make the whole equation incorrect.
In the bottom loop i want to write this Fourier Series sum inside an array mWave, which will be the resulting output. Is there a possibility to give the sum to mWave[x], where x is a "unknown" multiplier inside the sine signal first, and then change its values afterwards inside the second loop?
Sorry if this is a stupid question, I have not much experience with c++ but I try to learn it by making these stupid mistakes!
#Useless told you what to do, but I am going to try to spell it out for you.
This is how I would do it:
#include <vector>
* Perform a rectangular window in the frequency domain of a time domain square
* wave. This should be a sync impulse response.
* #param x The time domain sample within the period of the signal.
* #param harmonic_count The number of harmonics to aggregate in the result.
* #param sample_count The number of samples across the square wave period.
* #return double The time domain result of the combined harmonics at point x.
double box_car(unsigned int x,
unsigned int harmonic_count,
unsigned int sample_count)
double mFourier = 0.0;
for (int k = 0; k <= harmonic_count; k++)
mFourier += 1.0 / ((2 * k) + 1) * sin(((2 * k) + 1) * 2.0 * M_PI * x / sample_count);
return mFourier;
* Calculate the suqare wave samples across the time domain where the samples
* are filtered to only include the harmonic_count.
* #param harmonic_count The number of harmonics to aggregate in the result.
* #param sample_count The number of samples across the square wave period.
* #return std::vector<double>
std::vector<double> box_car_samples(unsigned int harmonic_count,
unsigned int sample_count)
std::vector<double> square_wave;
for (unsigned int x = 0; x < sample_count; x++)
double sample = box_car(x, harmonic_count, sample_count);
return square_wave;
So mWave[x] is returned as a std::vector of doubles (floating point).
The function box_car_samples() is f(k, x) as stated before.
So since I can't use vectors inside Arduino IDE anyhow I've tried the following solution:
void ComputeBandlimitedSquareWave(int mHarmonics){
for(int i = 0; i < sample_count; i++){
mWavetable[i] = ComputeFourierSeriesSquare(x);
if (x < sample_count) x++;
float ComputeFourierSeriesSquare(int x){
for(int k = 0; k <= mHarmonics; k++){
mFourier += 1/((2*k)+1)*sin(((2*k)+1)*2*M_PI*x/sample_count);
return mFourier;
First I thought this must be right a minute ago, but my monitors prove me wrong...
It sounds like a completely messed up sum of signals first, but after about 2 seconds the true characterisic squarewave sound comes through. I try to figure out what I'm overseeing and keep You guys updated if I can isolate that last part coming through my speakers, because it actually has a really decent sound. Only the messy overlays in the beginning are making me desperate right now...

study of FFT - Why it's not fast?

I am not sure if it's more math or more programming question. If it's math please tell me.
I know there is a lot of ready to use for free FFT projects. But I try to understand FFT method. Just for fun and for studying it. So I made both algorithms - DFT and FFT, to compare them.
But I have problem with my FFT. It seems there is not big difference in efficiency. My FFT is only little bit faster then DFT (in some cases it's two times faster, but it's max acceleration)
In most articles about FFT, there is something about bit reversal. But I don't see the reason to use bit reversing. Probably it's the case. I don't understand it. Please help me. What I do wrong?
This is my code (you can copy it here and see how it works - online compiler):
#include <complex>
#include <iostream>
#include <math.h>
#include <cmath>
#include <vector>
#include <chrono>
#include <ctime>
float _Pi = 3.14159265;
float sampleRate = 44100;
float resolution = 4;
float _SRrange = sampleRate / resolution; // I devide Sample Rate to make the loop smaller,
//just to perform tests faster
float bufferSize = 512;
// Clock class is for measure time to execute whole loop:
class Clock
Clock() { start = std::chrono::high_resolution_clock::now(); }
~Clock() {}
float secondsElapsed()
auto stop = std::chrono::high_resolution_clock::now();
return std::chrono::duration_cast<std::chrono::microseconds>(stop - start).count();
void reset() { start = std::chrono::high_resolution_clock::now(); }
std::chrono::time_point<std::chrono::high_resolution_clock> start;
// Function to calculate magnitude of complex number:
float _mag_Hf(std::complex<float> sf);
// Function to calculate exp(-j*2*PI*n*k / sampleRate) - where "j" is imaginary number:
std::complex<float> _Wnk_Nc(float n, float k);
// Function to calculate exp(-j*2*PI*k / sampleRate):
std::complex<float> _Wk_Nc(float k);
int main() {
float scaleFFT = 512; // devide and conquere - if it's "1" then whole algorhitm is just simply DFT
// I wonder what is the maximum of that value. I alvays thought it should be equal to
// buffer size (number o samples) but above some value it start to work slower then DFT
std::vector<float> inputSignal; // array of input signal
inputSignal.resize(bufferSize); // how many sample we will use to calculate Fourier Transform
std::vector<std::complex<float>> _Sf; // array to store Fourier Transform value for each measured frequency bin
_Sf.resize(scaleFFT); // resize it to size which we need.
std::vector<std::complex<float>> _Hf_Db_vect; //array to store magnitude (in logarythmic dB scale)
//for each measured frequency bin
_Hf_Db_vect.resize(_SRrange); //resize it to make it able to store value for each measured freq value
std::complex<float> _Sf_I_half; // complex to calculate first half of freq range
// from 1 to Nyquist (sampleRate/2)
std::complex<float> _Sf_II_half; // complex to calculate second half of freq range
//from Nyquist to sampleRate
for(int i=0; i<(int)_Sf.size(); i++)
inputSignal[i] = cosf((float)i/_Pi); // fill the input signal with some data, no matter
Clock _time; // Start measure time
for(int freqBinK=0; freqBinK < _SRrange/2; freqBinK++) // start calculate all freq (devide by 2 for two halves)
for(int i=0; i<(int)_Sf.size(); i++) _Sf[i] = 0.0f; // clean all values, for next loop we need all values to be zero
for (int n=0; n<bufferSize/_Sf.size(); ++n) // Here I take all samples in buffer
std::complex<float> _W = _Wnk_Nc(_Sf.size()*(float)n, freqBinK);
for(int i=0; i<(int)_Sf.size(); i++) // Finally here is my devide and conquer
_Sf[i] += inputSignal[_Sf.size()*n +i] * _W; // And I see no reason to use any bit reversal, how it shoul be????
std::complex<float> _Wk = _Wk_Nc(freqBinK);
_Sf_I_half = 0.0f;
_Sf_II_half = 0.0f;
for(int z=0; z<(int)_Sf.size()/2; z++) // here I calculate Fourier transform for each freq
_Sf_I_half += _Wk_Nc(2.0f * (float)z * freqBinK) * (_Sf[2*z] + _Wk * _Sf[2*z+1]); // First half - to Nyquist
_Sf_II_half += _Wk_Nc(2.0f * (float)z *freqBinK) * (_Sf[2*z] - _Wk * _Sf[2*z+1]); // Second half - to SampleRate
// also don't see need to use reversal bit, where it shoul be??? :)
// Calculate magnitude in dB scale
_Hf_Db_vect[freqBinK] = _mag_Hf(_Sf_I_half); // First half
_Hf_Db_vect[freqBinK + _SRrange/2] = _mag_Hf(_Sf_II_half); // Second half
std::cout << _time.secondsElapsed() << std::endl; // time measuer after execution of whole loop
float _mag_Hf(std::complex<float> sf)
float _Re_2;
float _Im_2;
_Re_2 = sf.real() * sf.real();
_Im_2 = sf.imag() * sf.imag();
return 20*log10(pow(_Re_2 + _Im_2, 0.5f)); //transform magnitude to logarhytmic dB scale
std::complex<float> _Wnk_Nc(float n, float k)
std::complex<float> _Wnk_Ncomp;
_Wnk_Ncomp.real(cosf(-2.0f * _Pi * (float)n * k / sampleRate));
_Wnk_Ncomp.imag(sinf(-2.0f * _Pi * (float)n * k / sampleRate));
return _Wnk_Ncomp;
std::complex<float> _Wk_Nc(float k)
std::complex<float> _Wk_Ncomp;
_Wk_Ncomp.real(cosf(-2.0f * _Pi * k / sampleRate));
_Wk_Ncomp.imag(sinf(-2.0f * _Pi * k / sampleRate));
return _Wk_Ncomp;
One huge mistake you are making is calculating the butterfly weights (which involves sin and cos) on the fly (in _Wnk_Nc()). sin and cos typically cost 10s to 100s of clock cycles, whereas the other butterfly operations are just mul and add, which only take a few cycles, hence the need to factor these out. All fast FFT implementations do this as part of an initialisation step (usually called "plan creation" or similar). See e.g. FFTW and KissFFT.
apart of abovementioned "pre-calculating butterfly weights" optimization, most FFT implementations also use SIMD instructions to vectorize code.
// also don't see need to use reversal bit, where it shoul be?
The very first butterfly loop should be reverse-bit indexed. Those indexes are usually calculated inside recursion, but for loop solution calculating those indexes is also costly, so it's better to pre-calculate them in plan as well.
Combining those optimization approaches result in approximately 100x speedup
Most fast FFT implementations either use a lookup table of precomputed twiddle factors, or a simple recursion to rotate the twiddle factors on the fly, instead of calling trigonometric math library functions inside the FFT inner loop.
For large FFTs, using a trig recursion formula is less likely to thrash the data caches on contemporary processors.

Cross entropy applied to backpropagation in neural network

I watched this awesome video by Dave Miller on making a neural network from scratch in C++ here:
Here is the full source code referenced in the video:
It uses mean squared error as the cost function. I'm interested in using a neural network for binary classification though and so would like to use cross-entropy as the cost function. I was hoping to add this to this code if possible, since I've already been playing around with it.
How would that be applied specifically here?
Would the only difference be in how the error is calculated for the output layer...or do the equations change all the way through backpropogation?
Does anything change at all? Is MSE versus cross-entropy solely used to get an idea of the overall error and not independently relevant to backpropogation?
Edit for clarity:
Here are the relevant functions.
//output layer - seems like error is just target value minus calculated value
void Neuron::calcOutputGradients(double targetVal)
double delta = targetVal - m_outputVal;
m_gradient = delta * Neuron::transferFunctionDerivative(m_outputVal);
double Neuron::sumDOW(const Layer &nextLayer) const
double sum = 0.0;
// Sum our contributions of the errors at the nodes we feed.
for (unsigned n = 0; n < nextLayer.size() - 1; ++n) {
sum += m_outputWeights[n].weight * nextLayer[n].m_gradient;
return sum;
void Neuron::calcHiddenGradients(const Layer &nextLayer)
double dow = sumDOW(nextLayer);
m_gradient = dow * Neuron::transferFunctionDerivative(m_outputVal);
void Neuron::updateInputWeights(Layer &prevLayer)
// The weights to be updated are in the Connection container in the neurons in the preceding layer
for (unsigned n = 0; n < prevLayer.size(); ++n) {
Neuron &neuron = prevLayer[n];
double oldDeltaWeight = neuron.m_outputWeights[m_myIndex].deltaWeight;
//calculate new weight for neuron with momentum
double newDeltaWeight = eta * neuron.getOutputVal() * m_gradient + alpha * oldDeltaWeight;
neuron.m_outputWeights[m_myIndex].deltaWeight = newDeltaWeight;
neuron.m_outputWeights[m_myIndex].weight += newDeltaWeight;
Finally found the answer here:
You only have to change how the error at the output layer is calculated.
The relevant function to be changed is:
void Neuron::calcOutputGradients(double targetVal)
For mean square errors use:
double delta = targetVal - m_outputVal;
m_gradient = delta * Neuron::transferFunctionDerivative(m_outputVal);
For cross entropy just use:
m_gradient = targetVal - m_outputVal;

Linear regression poor gradient descent performance

I have implemented a simple Linear Regression (single variate for now) example in C++ to help me get my head around the concepts. I'm pretty sure that the key algorithm is right but my performance is terrible.
This is the method which actually performs the gradient descent:
void LinearRegression::BatchGradientDescent(std::vector<std::pair<int,int>> & data,float& theta1,float& theta2)
float weight = (1.0f/static_cast<float>(data.size()));
float theta1Res = 0.0f;
float theta2Res = 0.0f;
for(auto p: data)
float cost = Hypothesis(p.first,theta1,theta2) - p.second;
theta1Res += cost;
theta2Res += cost*p.first;
theta1 = theta1 - (m_LearningRate*weight* theta1Res);
theta2 = theta2 - (m_LearningRate*weight* theta2Res);
With the other key functions given as:
float LinearRegression::Hypothesis(float x,float theta1,float theta2) const
return theta1 + x*theta2;
float LinearRegression::CostFunction(std::vector<std::pair<int,int>> & data,
float theta1,
float theta2) const
float error = 0.0f;
for(auto p: data)
float prediction = (Hypothesis(p.first,theta1,theta2) - p.second) ;
error += prediction*prediction;
error *= 1.0f/(data.size()*2.0f);
return error;
void LinearRegression::Regress(std::vector<std::pair<int,int>> & data)
for(unsigned int itr = 0; itr < MAX_ITERATIONS; ++itr)
//Some visualisation code
Now the issue is that if the learning rate is greater than around 0.000001 the value of the cost function after gradient descent is higher than it is before. That is to say, the algorithm is working in reverse. The line forms into a straight line through the origin pretty quickly but then takes millions of iterations to actually reach a reasonably well fit line.
With a learning rate of 0.01, after six iterations the output is: (where difference is costAfter-costBefore)
Cost before 102901.945312, cost after 517539430400.000000, difference 517539332096.000000
Cost before 517539430400.000000, cost after 3131945127824588800.000000, difference 3131944578068774912.000000
Cost before 3131945127824588800.000000, cost after 18953312418560698826620928.000000, difference 18953308959796185006080000.000000
Cost before 18953312418560698826620928.000000, cost after 114697949347691988409089177681920.000000, difference 114697930004878874575022382383104.000000
Cost before 114697949347691988409089177681920.000000, cost after inf, difference inf
Cost before inf, cost after inf, difference nan
In this example the thetas are set to zero, the learning rate to 0.000001, and there are 8,000,000 iterations! The visualisation code only updates the graph after every 100,000 iterations.
Function which creates the data points:
static void SetupRegressionData(std::vector<std::pair<int,int>> & data)
srand (time(NULL));
for(int x = 50; x < 750; x += 3)
data.push_back(std::pair<int,int>(x+(rand() % 100), 400 + (rand() % 100) ));
In short, if my learning rate is too high the gradient descent algorithm effectively runs backwards and tends to infinity and if it is lowered to the point where it actually converges towards a minima the number of iterations required to actually do so is unacceptably high.
Have I missed anything/made a mistake in the core algorithm?
Looks like everything is behaving as expected, but you are having problems selecting a reasonable learning rate. That's not a totally trivial problem, and there are many approaches ranging from pre-defined schedules that progressively reduce the learning rate (see e.g. this paper) to adaptive methods such as AdaGrad or AdaDelta.
For your vanilla implementation with fixed learning rate you should make your life easier by normalising the data to zero mean and unit standard deviation before you feed it into the gradient descent algorithm. That way you will be able to reason about the learning rate more easily. Then you can just rescale your prediction accordingly.

generating correct spectrogram using fftw and window function

For a project I need to be able to generate a spectrogram from a .WAV file. I've read the following should be done:
Get N (transform size) samples
Apply a window function
Do a Fast Fourier Transform using the samples
Normalise the output
Generate spectrogram
On the image below you see two spectrograms of a 10000 Hz sine wave both using the hanning window function. On the left you see a spectrogram generated by audacity and on the right my version. As you can see my version has a lot more lines/noise. Is this leakage in different bins? How would I get a clear image like the one audacity generates. Should I do some post-processing? I have not yet done any normalisation because do not fully understand how to do so.
I found this tutorial explaining how to generate a spectrogram in c++. I compiled the source to see what differences I could find.
My math is very rusty to be honest so I'm not sure what the normalisation does here:
for(i = 0; i < half; i++){
out[i][0] *= (2./transform_size);
out[i][6] *= (2./transform_size);
processed[i] = out[i][0]*out[i][0] + out[i][7]*out[i][8];
//sets values between 0 and 1?
processed[i] =10. * (log (processed[i] + 1e-6)/log(10)) /-60.;
after doing this I got this image (btw I've inverted the colors):
I then took a look at difference of the input samples provided by my sound library and the one of the tutorial. Mine were way higher so I manually normalised is by dividing it by the factor 32767.9. I then go this image which looks pretty ok I think. But dividing it by this number seems wrong. And I would like to see a different solution.
Here is the full relevant source code.
void Spectrogram::process(){
int i;
int transform_size = 1024;
int half = transform_size/2;
int step_size = transform_size/2;
double in[transform_size];
double processed[half];
fftw_complex *out;
fftw_plan p;
out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * transform_size);
for(int x=0; x < wavFile->getSamples()/step_size; x++){
int j = 0;
for(i = step_size*x; i < (x * step_size) + transform_size - 1; i++, j++){
in[j] = wavFile->getSample(i)/32767.9;
//apply window function
for(i = 0; i < transform_size; i++){
in[i] *= windowHanning(i, transform_size);
// in[i] *= windowBlackmanHarris(i, transform_size);
p = fftw_plan_dft_r2c_1d(transform_size, in, out, FFTW_ESTIMATE);
fftw_execute(p); /* repeat as needed */
for(i = 0; i < half; i++){
out[i][0] *= (2./transform_size);
out[i][11] *= (2./transform_size);
processed[i] = out[i][0]*out[i][0] + out[i][12]*out[i][13];
processed[i] =10. * (log (processed[i] + 1e-6)/log(10)) /-60.;
for (i = 0; i < half; i++){
if(processed[i] > 0.99)
processed[i] = 1;
This is not exactly an answer as to what is wrong but rather a step by step procedure to debug this.
What do you think this line does? processed[i] = out[i][0]*out[i][0] + out[i][12]*out[i][13] Likely that is incorrect: fftw_complex is typedef double fftw_complex[2], so you only have out[i][0] and out[i][1], where the first is the real and the second the imaginary part of the result for that bin. If the array is contiguous in memory (which it is), then out[i][12] is likely the same as out[i+6][0] and so forth. Some of these will go past the end of the array, adding random values.
Is your window function correct? Print out windowHanning(i, transform_size) for every i and compare with a reference version (for example numpy.hanning or the matlab equivalent). This is the most likely cause, what you see looks like a bad window function, kind of.
Print out processed, and compare with a reference version (given the same input, of course you'd have to print the input and reformat it to feed into pylab/matlab etc). However, the -60 and 1e-6 are fudge factors which you don't want, the same effect is better done in a different way. Calculate like this:
power_in_db[i] = 10 * log(out[i][0]*out[i][0] + out[i][1]*out[i][1])/log(10)
Print out the values of power_in_db[i] for the same i but for all x (a horizontal line). Are they approximately the same?
If everything so far is good, the remaining suspect is setting the pixel values. Be very explicit about clipping to range, scaling and rounding.
int pixel_value = (int)round( 255 * (power_in_db[i] - min_db) / (max_db - min_db) );
if (pixel_value < 0) { pixel_value = 0; }
if (pixel_value > 255) { pixel_value = 255; }
Here, again, print out the values in a horizontal line, and compare with the grayscale values in your pgm (by hand, using the colorpicker in photoshop or gimp or similar).
At this point, you will have validated everything from end to end, and likely found the bug.
The code you produced, was almost correct. So, you didn't left me much to correct:
void Spectrogram::process(){
int transform_size = 1024;
int half = transform_size/2;
int step_size = transform_size/2;
double in[transform_size];
double processed[half];
fftw_complex *out;
fftw_plan p;
out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * transform_size);
for (int x=0; x < wavFile->getSamples()/step_size; x++) {
// Fill the transformation array with a sample frame and apply the window function.
// Normalization is performed later
// (One error was here: you didn't set the last value of the array in)
for (int j = 0, int i = x * step_size; i < x * step_size + transform_size; i++, j++)
in[j] = wavFile->getSample(i) * windowHanning(j, transform_size);
p = fftw_plan_dft_r2c_1d(transform_size, in, out, FFTW_ESTIMATE);
fftw_execute(p); /* repeat as needed */
for (int i=0; i < half; i++) {
// (Here were some flaws concerning the access of the complex values)
out[i][0] *= (2./transform_size); // real values
out[i][1] *= (2./transform_size); // complex values
processed[i] = out[i][0]*out[i][0] + out[i][1]*out[i][1]; // power spectrum
processed[i] = 10./log(10.) * log(processed[i] + 1e-6); // dB
// The resulting spectral values in 'processed' are in dB and related to a maximum
// value of about 96dB. Normalization to a value range between 0 and 1 can be done
// in several ways. I would suggest to set values below 0dB to 0dB and divide by 96dB:
// Transform all dB values to a range between 0 and 1:
if (processed[i] <= 0) {
processed[i] = 0;
} else {
processed[i] /= 96.; // Reduce the divisor if you prefer darker peaks
if (processed[i] > 1)
processed[i] = 1;
// This should be called each time fftw_plan_dft_r2c_1d()
// was called to avoid a memory leak:
The two corrected bugs were most probably responsible for the slight variation of successive transformation results. The Hanning window is very vell suited to minimize the "noise" so a different window would not have solved the problem (actually #Alex I already pointed to the 2nd bug in his point 2. But in his point 3. he added a -Inf-bug as log(0) is not defined which can happen if your wave file containts a stretch of exact 0-values. To avoid this the constant 1e-6 is good enough).
Not asked, but there are some optimizations:
put p = fftw_plan_dft_r2c_1d(transform_size, in, out, FFTW_ESTIMATE); outside the main loop,
precalculate the window function outside the main loop,
abandon the array processed and just use a temporary variable to hold one spectral line at a time,
the two multiplications of out[i][0] and out[i][1] can be abandoned in favour of one multiplication with a constant in the following line. I left this (and other things) for you to improve
Thanks to #Maxime Coorevits additionally a memory leak could be avoided: "Each time you call fftw_plan_dft_rc2_1d() memory are allocated by FFTW3. In your code, you only call fftw_destroy_plan() outside the outer loop. But in fact, you need to call this each time you request a plan."
Audacity typically doesn't map one frequency bin to one horizontal line, nor one sample period to one vertical line. The visual effect in Audacity may be due to resampling of the spectrogram picture in order to fit the drawing area.