Searching around on how I can improve my waveform generation code, I've come across SIMD and the libsimdpp library, but I have no idea how to use it. If I got it right using raw SIMD will require me to write code for each architecture while libsimdpp will handle that for me.
What I need to do is, to calculate the squared and rms value of a chunk of samples, which I managed to boost the process using vectorization, which worked perfectly until I introduced the same calculation for both left and right channel of an audio file.
So, my question and what I need help with, is how can I use libsimdpp (or any library that will make simdp easier for me) to improve the bellow code?
// STRAT: vector containing all the audio samples
std::vector<double> samples;
nv_samples = samples.size();
// END
// START: Loop through the samples vector, incrementing ecah time the index with samples_per_pixel
for (int i = 0; i < nb_samples; i+= samples_per_pixel)
{
// START: Create chunk of samples with the size of samples_per_pixel
double* chunk = &samplesL[i];
// END
// START: Calculate rms and sqared sum
float sum = 0;
float squaredsum = 0;
/// there are multiple definitions of above for both channels but I won't include them
//// to make the code easier to be read
for (int j = 0; j < samples_per_pixel; j++)
{
if (chunk[j] < 0)
sum += -chunk[j]
else
sum += chunk[j]
squaredsum += chunk[j] * chunk[j]
}
/// average
float average_point = (sumL * 2) / samples_per_pixel;
// rms
float meanL = squaredsumL / samples_per_pixel;
rms_pointL = qSqrt(meanL);
/// Drawing of both avearge point and rms
//// [...]
// END
}
I am trying to utilize sparse matrices in Armadillo, and am noticing a significant difference in access times with SpMat<int> compared to equivalent code using Mat<int>.
Description:
Below are two methods, which are identical in every respect except that Method_One uses regular matrices and Method_Two uses sparse matrices.
Both methods take following arguments:
WS, DS: Pointers to a NN dimensional array
WW: 13 K [max(WS)]
DD: 1.7 K [max(DS)]
NN: 2.3 M
TT: 50
I am using Visual Studio 2017 for compiling the code into a .mexw64 executable which can be called from Matlab.
Code:
void Method_One(int WW, int DD, int TT, int NN, double* WS, double* DS)
{
Mat<int> WP(WW, TT, fill::zeros); // (13000 x 50) matrix
Mat<int> DP(DD, TT, fill::zeros); // (1700 x 50) matrix
Col<int> ZZ(NN, fill::zeros); // 2,300,000 column vector
for (int n = 0; n < NN; n++)
{
int w_n = (int) WS[n] - 1;
int d_n = (int) DS[n] - 1;
int t_n = rand() % TT;
WP(w_n, t_n)++;
DP(d_n, t_n)++;
ZZ(n) = t_n + 1;
}
return;
}
void Method_Two(int WW, int DD, int TT, int NN, double* WS, double* DS)
{
SpMat<int> WP(WW, TT); // (13000 x 50) matrix
SpMat<int> DP(DD, TT); // (1700 x 50) matrix
Col<int> ZZ(NN, fill::zeros); // 2,300,000 column vector
for (int n = 0; n < NN; n++)
{
int w_n = (int) WS[n] - 1;
int d_n = (int) DS[n] - 1;
int t_n = rand() % TT;
WP(w_n, t_n)++;
DP(d_n, t_n)++;
ZZ(n) = t_n + 1;
}
return;
}
Timing:
I am timing both methods using wall_clock timer object in Armadillo. For example,
wall_clock timer;
timer.tic();
Method_One(WW, DD, TT, NN, WS, DS);
double t = timer.toc();
Results:
Timing elapsed for Method_One using Mat<int>: 0.091 sec
Timing elapsed for Method_Two using SpMat<int>: 30.227 sec (almost 300 times slower)
Any insights into this are highly appreciated!
UPDATE:
This issue has been fixed with newer version (8.100.1) of Armadillo.
Here are the new results:
Timing elapsed for Method_One using Mat<int>: 0.141 sec
Timing elapsed for Method_Two using SpMat<int>: 2.127 sec (15 times slower, which is acceptable!)
Thanks to Conrad and Ryan.
As hbrerkere already mentioned, the problem stems from the fact that the values of the matrix are stored in a packed format (CSC) that makes it time-consuming to
Find the index of an already existing entry: Depending on whether the column entries are sorted by their row index you need either linear or binary search.
Insert a value that was previously zero: Here you need to find the insertion point for your new value and move all elements after that, leading to Ω(n) worst case time for a single insertion!
All these operations are constant-time operations for dense matrices, which mostly explains the runtime difference.
My usual solution was to use a separate sparse matrix type for assembly (where you usually access an element multiple times) based on the coordinate format (storing triples (i, j, value)) that uses a map like std::map or std::unordered_map to store the triple index corresponding to a position (i,j) in the matrix.
Some similar approaches are also discussed in this question about matrix assembly
Example from my most recent use:
class DynamicSparseMatrix {
using Number = double;
using Index = std::size_t;
using Entry = std::pair<Index, Index>;
std::vector<Number> values;
std::vector<Index> rows;
std::vector<Index> cols;
std::map<Entry, Index> map; // unordered_map might be faster,
// but you need a suitable hash function
// like boost::hash<Entry> for this.
Index num_rows;
Index num_cols;
...
Number& value(Index row, Index col) {
// just to prevent misuse
assert(row >= 0 && row < num_rows);
assert(col >= 0 && col < num_cols);
// Find the entry in the matrix
Entry e{row, col};
auto it = map.find(e);
// If the entry hasn't previously been stored
if (it == map.end()) {
// Add a new entry by adding its value and coordinates
// to the end of the storage vectors.
it = map.insert(make_pair(e, values.size())).first;
rows.push_back(row);
cols.push_back(col);
values.push_back(0);
}
// Return the value
return values[(*it).second];
}
...
};
After assembly you can store all the values from rows, cols, values (which actually represent the matrix in Coordinate format), possibly sort them and do a batch insertion into your Armadillo matrix.
Sparse matrices are stored in a compressed format (CSC). Every time a non-zero element inserted into a sparse matrix, the entire internal representation has to be updated. This is time consuming.
It's much faster to construct the sparse matrix using batch constructors.
I'm trying to use C++ to recreate the spectrogram function used by Matlab. The function uses a Short Time Fourier Transform (STFT). I found some C++ code here that performs a STFT. The code seems to work perfectly for all frequencies but I only want a few. I found this post for a similar question with the following answer:
Just take the inner product of your data with a complex exponential at
the frequency of interest. If g is your data, then just substitute for
f the value of the frequency you want (e.g., 1, 3, 10, ...)
Having no background in mathematics, I can't figure out how to do this. The inner product part seems simple enough from the Wikipedia page but I have absolutely no idea what he means by (with regard to the formula for a DFT)
a complex exponential at frequency of interest
Could someone explain how I might be able to do this? My data structure after the STFT is a matrix filled with complex numbers. I just don't know how to extract my desired frequencies.
Relevant function, where window is Hamming, and vector of desired frequencies isn't yet an input because I don't know what to do with them:
Matrix<complex<double>> ShortTimeFourierTransform::Calculate(const vector<double> &signal,
const vector<double> &window, int windowSize, int hopSize)
{
int signalLength = signal.size();
int nOverlap = hopSize;
int cols = (signal.size() - nOverlap) / (windowSize - nOverlap);
Matrix<complex<double>> results(window.size(), cols);
int chunkPosition = 0;
int readIndex;
// Should we stop reading in chunks?
bool shouldStop = false;
int numChunksCompleted = 0;
int i;
// Process each chunk of the signal
while (chunkPosition < signalLength && !shouldStop)
{
// Copy the chunk into our buffer
for (i = 0; i < windowSize; i++)
{
readIndex = chunkPosition + i;
if (readIndex < signalLength)
{
// Note the windowing!
data[i][0] = signal[readIndex] * window[i];
data[i][1] = 0.0;
}
else
{
// we have read beyond the signal, so zero-pad it!
data[i][0] = 0.0;
data[i][1] = 0.0;
shouldStop = true;
}
}
// Perform the FFT on our chunk
fftw_execute(plan_forward);
// Copy the first (windowSize/2 + 1) data points into your spectrogram.
// We do this because the FFT output is mirrored about the nyquist
// frequency, so the second half of the data is redundant. This is how
// Matlab's spectrogram routine works.
for (i = 0; i < windowSize / 2 + 1; i++)
{
double real = fft_result[i][0];
double imaginary = fft_result[i][1];
results(i, numChunksCompleted) = complex<double>(real, imaginary);
}
chunkPosition += hopSize;
numChunksCompleted++;
} // Excuse the formatting, the while ends here.
return results;
}
Look up the Goertzel algorithm or filter for example code that uses the computational equivalent of an inner product against a complex exponential to measure the presence or magnitude of a specific stationary sinusoidal frequency in a signal. Performance or resolution will depend on the length of the filter and your signal.
I wrote this simple code which reads a length from the Sharp infrared sensor, end presents the average meter in cm (unit) by serial.
When write this code for the Arduino Mega board, the Arduino starts a blinking LED (pin 13) and the program does nothing. Where is the bug in this code?
#include <QueueList.h>
const int ANALOG_SHARP = 0; //Set pin data from sharp.
QueueList <float> queuea;
float cm;
float qu1;
float qu2;
float qu3;
float qu4;
float qu5;
void setup() {
Serial.begin(9600);
}
void loop() {
cm = read_gp2d12_range(ANALOG_SHARP); //Convert to cm (unit).
queuea.push(cm); //Add item to queue, when I add only this line Arduino crash.
if ( 5 <= queuea.peek()) {
Serial.println(average());
}
}
float read_gp2d12_range(byte pin) { //Function converting to cm (unit).
int tmp;
tmp = analogRead(pin);
if (tmp < 3)
return -1; // Invalid value.
return (6787.0 /((float)tmp - 3.0)) - 4.0;
}
float average() { //Calculate average length
qu1 += queuea.pop();
qu2 += queuea.pop();
qu3 += queuea.pop();
qu4 += queuea.pop();
qu5 += queuea.pop();
float aver = ((qu1+qu2+qu3+qu4+qu5)/5);
return aver;
}
I agree with the peek() -> count() error listed by vhallac. But I'll also point out that you should consider averaging by powers of 2 unless there is a strong case to do otherwise.
The reason is that on microcontrollers, division is slow. By averaging over a power of 2 (2,4,8,16,etc.) you can simply calculate the sum and then bitshift it.
To calculate the average of 2: (v1 + v2) >> 1
To calculate the average of 4: (v1 + v2 + v3 + v4) >> 2
To calculate the average of n values (where n is a power of 2) just right bitshift the sum right by [log2(n)].
As long as the datatype for your sum variable is big enough and won't overflow, this is much easier and much faster.
Note: this won't work for floats in general. In fact, microcontrollers aren't optimized for floats. You should consider converting from int (what I'm assuming you're ADC is reading) to float at the end after the averaging rather than before.
By converting from int to float and then averaging floats you are losing more precision than averaging ints than converting the int to a float.
Other:
You're using the += operator without initializing the variables (qu1, qu2, etc.) -- it's good practice to initialize them if you're going to use += but it looks as if = would work fine.
For floats, I'd have written the average function as:
float average(QueueList<float> & q, int n)
{
float sum = 0;
for(int i=0; i<n; i++)
{
sum += q.pop();
}
return (sum / (float) n);
}
And called it: average(queuea, 5);
You could use this to average any number of sensor readings and later use the same code to later average floats in a completely different QueueList. Passing the number of readings to average as a parameter will really come in handy in the case that you need to tweak it.
TL;DR:
Here's how I would have done it:
#include <QueueList.h>
const int ANALOG_SHARP=0; // set pin data from sharp
const int AvgPower = 2; // 1 for 2 readings, 2 for 4 readings, 3 for 8, etc.
const int AvgCount = pow(2,AvgPow);
QueueList <int> SensorReadings;
void setup(){
Serial.begin(9600);
}
void loop()
{
int reading = analogRead(ANALOG_SHARP);
SensorReadings.push(reading);
if(SensorReadings.count() > AvgCount)
{
int avg = average2(SensorReadings, AvgPower);
Serial.println(gpd12_to_cm(avg));
}
}
float gp2d12_to_cm(int reading)
{
if(reading <= 3){ return -1; }
return((6787.0 /((float)reading - 3.0)) - 4.0);
}
int average2(QueueList<int> & q, int AvgPower)
{
int AvgCount = pow(2, AvgPower);
long sum = 0;
for(int i=0; i<AvgCount; i++)
{
sum += q.pop();
}
return (sum >> AvgPower);
}
You are using queuea.peek() to obtain the count. This will only return the last element in queue. You should use queuea.count() instead.
Also you might consider changing the condition tmp < 3 to tmp <= 3. If tmp is 3, you divide by zero.
Great improvement jedwards, however the first question I have is why use queuelist instead of an int array.
As an example I would do the following:
int average(int analog_reading)
{
#define NUM_OF_AVG 5
static int readings[NUM_OF_AVG];
static int next_position;
static int sum;
if (++next_position >= NUM_OF_AVG)
{
next_position=0;
}
reading[next_position]=analog_reading;
for(int i=0; i<NUM_OF_AVG; i++)
{
sum += reading[i];
}
average = sum/NUM_OF_AVG
}
Now I compute a new rolling average with every reading and it eliminates all the issues related to dynamic memory allocation (memory fragmentation, no available memory, memory leaks) in a embedded device.
I appreciate and understand the use of shifting for a division by 2,4 or 8, however I would stay away from that technique for two reasons.
I think readability and maintainability of the source code is more important then saving a little bit of time with a shift instead of a divide unless you can test and verify the divide is a bottleneck.
Second, I believe most current optimizing compilers will do a shift if possible, I know GCC does.
I will leave refactoring out the for loop for the next guy.
So I have this function used to calculate statistics (min/max/std/mean). Now the thing is this runs generally on a 10,000 by 15,000 matrix. The matrix is stored as a vector<vector<int> > inside the class. Now creating and populating said matrix goes very fast, but when it comes down to the statistics part it becomes so incredibly slow.
E.g. to read all the pixel values of the geotiff one pixel at a time takes around 30 seconds. (which involves a lot of complex math to properly georeference the pixel values to a corresponding point), to calculate the statistics of the entire matrix it takes around 6 minutes.
void CalculateStats()
{
//OHGOD
double new_mean = 0;
double new_standard_dev = 0;
int new_min = 256;
int new_max = 0;
size_t cnt = 0;
for(size_t row = 0; row < vals.size(); row++)
{
for(size_t col = 0; col < vals.at(row).size(); col++)
{
double mean_prev = new_mean;
T value = get(row, col);
new_mean += (value - new_mean) / (cnt + 1);
new_standard_dev += (value - new_mean) * (value - mean_prev);
// find new max/min's
new_min = value < new_min ? value : new_min;
new_max = value > new_max ? value : new_max;
cnt++;
}
}
stats_standard_dev = sqrt(new_standard_dev / (vals.size() * vals.at(0).size()) + 1);
std::cout << stats_standard_dev << std::endl;
}
Am I doing something horrible here?
EDIT
To respond to the comments, T would be an int.
EDIT 2
I fixed my std algorithm, and here is the final product:
void CalculateStats(const std::vector<double>& ignore_values)
{
//OHGOD
double new_mean = 0;
double new_standard_dev = 0;
int new_min = 256;
int new_max = 0;
size_t cnt = 0;
int n = 0;
double delta = 0.0;
double mean2 = 0.0;
std::vector<double>::const_iterator ignore_begin = ignore_values.begin();
std::vector<double>::const_iterator ignore_end = ignore_values.end();
for(std::vector<std::vector<T> >::const_iterator row = vals.begin(), row_end = vals.end(); row != row_end; ++row)
{
for(std::vector<T>::const_iterator col = row->begin(), col_end = row->end(); col != col_end; ++col)
{
// This method of calculation is based on Knuth's algorithm.
T value = *col;
if(std::find(ignore_begin, ignore_end, value) != ignore_end)
continue;
n++;
delta = value - new_mean;
new_mean = new_mean + (delta / n);
mean2 = mean2 + (delta * (value - new_mean));
// Find new max/min's.
new_min = value < new_min ? value : new_min;
new_max = value > new_max ? value : new_max;
}
}
stats_standard_dev = mean2 / (n - 1);
stats_min = new_min;
stats_max = new_max;
stats_mean = new_mean;
This still takes ~120-130 seconds to do this, but it's a huge improvement :)!
Have you tried to profile your code?
You don't even need a fancy profiler. Just stick some debug timing statements in there.
Anything I tell you would just be an educated guess (and probably wrong)
You could be getting lots of cache misses due to the way you're accessing the contents of the vector. You might want to cache some of the results to size() but I don't know if that's the issue.
I just profiled it. 90% of the execution time was in this line:
new_mean += (value - new_mean) / (cnt + 1);
You should calculate the sum of values, min, max and count in the first loop,
then calculate the mean in one operation by dividing sum/count,
then in a second loop calculate std_dev's sum
That would probably be a bit faster.
First thing I spotted is that you evaluate vals.at(row).size() in the loop, which, obviously, isn't supposed to improve performance. It also applies to vals.size(), but of course inner loop is worse. If vals is a vector of vector, you better use iterators or at least keep reference for the outer vector (because get() with indices parameters surely eats up quite some time as well).
This code sample is supposed to illustrate my intentions ;-)
for(TVO::const_iterator i=vals.begin(),ie=vals.end();i!=ie;++i) {
for(TVI::const_iterator ii=i->begin(),iie=i->end();ii!=iie;++ii) {
T value = *ii;
// the rest
}
}
First, change your row++ to ++row. A minor thing, but you want speed, so that will help
Second, make your row < vals.size into some const comparison instead. The compiler doesn't know that vals won't change, so it has to play nice and always call size.
what is the 'get' method in the middle there? What does that do? That might be your real problem.
I'm not too sure about your std dev calculation. Take a look at the wikipedia page on calculating variance in a single pass (they have a quick explanation of Knuth's algorithm, which is an expansion of a recursion relation).
It's slow because you're benchmarking debug code.
Building and running the code on Windows XP using VS2008:
a Release build with the default optimisation level, the code in the OP runs in 2734 ms.
a Debug build with the default of no optimisation, the code in the OP runs in a massive 398,531 ms.
In comments below you say you're not using optimisation, and this appears to make a big difference in this case - normally it's less that a factor of ten, but in this case it's over a hundred times slower.
I'm using VS2008 rather than 2005, but it's probably similar:
In the Debug build, there are two range checks on each access, each of which calls std::vector::size() using a non-inlined function call and requires a branch predicition. There is overhead involved both with function calls and with branches.
In the Release build, the compiler optimizes away the range checks ( I don't know whether it just drops them, or does flow analysis based on the limits of the loop ), and the vector access becomes a small amount of inline pointer arithmetic with no branches.
No-one cares how fast the debug build is. You should be unit testing the release build anyway, as that's the build which has to work correctly. Only use the Debug build if you don't all the information you want if you try and step through the code.
The code as posted runs in < 1.5 seconds on my PC with test data of 15000 x 10000 integers all equal to 42. You report that it's running in 230 times slower that that. Are you on a 10 MHz processor?
Though there are other suggestions for making it faster ( such as moving it to use SSE, if all the values are representable using 8bit types ), but there's clearly something else which is making it slow.
On my machine, neither a version which hoisted a reference to the vector for the row and hoisting the size of the row, nor a version which used iterator had any measurable benefit ( with g++ -O3 using iterators takes 1511ms repeatably; the hoisted and original version both take 1485ms ). Not optimising means it runs in 7487ms ( original ), 3496ms ( hoisted ) or 5331ms ( iterators ).
But unless you're running on a very low power device, or are paging, or a running non-optimised code with a debugger attached, it shouldn't be this slow, and whatever is making it slow is not likely to be the code you've posted.
( as a side note, if you test it with values with a deviation of zero your SD comes out as 1 )
There are far too many calculations in the inner loop:
For the descriptive statistics (mean, standard
deviation) the only thing required is to compute the sum
of value and the sum of squared value. From these
two sums the mean and standard deviation can be computed
after the outer loop (together with a third value, the
number of samples - n is your new/updated code). The
equations can be derived from the definitions or found
on the web, e.g. Wikipedia. For instance the mean is
just sum of value divided by n. For the n version (in
contrast to the n-1 version - however n is large in
this case so it doesn't matter which one is used) the
standard deviation is: sqrt( n * sumOfSquaredValue -
sumOfValue * sumOfValue). Thus only two floating point
additions and one multiplication are needed in the
inner loop. Overflow is not a problem with these sums as
the range for doubles is 10^318. In particular you will
get rid of the expensive floating point division that
the profiling reported in another answer has revealed.
A lesser problem is that the minimum and maximum are
rewritten every time (the compiler may or may not
prevent this). As the minimum quickly becomes small and
the maximum quickly becomes large, only the two comparisons
should happen for the majority of loop iterations: use
if statements instead to be sure. It can be argued, but
on the other hand it is trivial to do.
I would change how I access the data. Assuming you are using std::vector for your container you could do something like this:
vector<vector<T> >::const_iterator row;
vector<vector<T> >::const_iterator row_end = vals.end();
for(row = vals.begin(); row < row_end; ++row)
{
vector<T>::const_iterator value;
vector<T>::const_iterator value_end = row->end();
for(value = row->begin(); value < value_end; ++value)
{
double mean_prev = new_mean;
new_mean += (*value - new_mean) / (cnt + 1);
new_standard_dev += (*value - new_mean) * (*value - mean_prev);
// find new max/min's
new_min = min(*value, new_min);
new_max = max(*value, new_max);
cnt++;
}
}
The advantage of this is that in your inner loop you aren't consulting the outter vector, just the inner one.
If you container type is a list, this will be significantly faster. Because the look up time of get/operator[] is linear for a list and constant for a vector.
Edit, I moved the call to end() out of the loop.
Move the .size() calls to before each loop, and make sure you are compiling with optimizations turned on.
If your matrix is stored as a vector of vectors, then in the outer for loop you should directly retrieve the i-th vector, and then operate on that in the inner loop. Try that and see if it improves performance.
I'm nor sure of what type vals is but vals.at(row).size() could take a long time if itself iterates through the collection. Store that value in a variable. Otherwise it could make the algorithm more like O(n³) than O(n²)
I think that I would rewrite it to use const iterators instead of row and col indexes. I would set up a const const_iterator for row_end and col_end to compare against, just to make certain it wasn't making function calls at every loop end.
As people have mentioned, it might be get(). If it accesses neighbors, for instance, you will totally smash the cache which will greatly reduce the performance. You should profile, or just think about access patterns.
Coming a bit late to the party here, but a couple of points:
You're effectively doing numerical work here. I don't know much about numerical algorithms, but I know enough to know that references and expert support are often useful. This discussion thread offers some references; and Numerical Recipes is a standard (if dated) work.
If you have the opportunity to redesign your matrix, you want to try using a valarray and slices instead of vectors of vectors; one advantage that immediately comes to mind is that you're guaranteed a flat linear layout, which makes cache pre-fetching and SIMD instructions (if your compiler can use them) more effective.
In the inner loop, you shouldn't be testing size, you shouldn't be doing any divisions, and iterators can also be costly. In fact, some unrolling would be good in there.
And, of course, you should pay attention to cache locality.
If you get the loop overhead low enough, it might make sense to do it in separate passes: one to get the sum (which you divide to get the mean), one to get the sum of squares (which you combine with the sum to get the variance), and one to get the min and/or max. The reason is to simplify what is in the inner unrolled loop so the compiler can keep stuff in registers.
I couldn't get the code to compile, so I couldn't pinpoint issues for sure.
I have modified the algorithm to get rid of almost all of the floating-point division.
WARNING: UNTESTED CODE!!!
void CalculateStats()
{
//OHGOD
double accum_f;
double accum_sq_f;
double new_mean = 0;
double new_standard_dev = 0;
int new_min = 256;
int new_max = 0;
const int oku = 100000000;
int accum_ichi = 0;
int accum_oku = 0;
int accum_sq_ichi = 0;
int accum_sq_oku = 0;
size_t cnt = 0;
int v1 = 0;
int v2 = 0;
v1 = vals.size();
for(size_t row = 0; row < v1; row++)
{
v2 = vals.at(row).size();
for(size_t col = 0; col < v2; col++)
{
T value = get(row, col);
int accum_ichi += value;
int accum_sq_ichi += (value * value);
// perform carries
accum_oku += (accum_ichi / oku);
accum_ichi %= oku;
accum_sq_oku += (accum_sq_ichi / oku);
accum_sq_ichi %= oku;
// find new max/min's
new_min = value < new_min ? value : new_min;
new_max = value > new_max ? value : new_max;
cnt++;
}
}
// now, and only now, do we use floating-point arithmetic
accum_f = (double)(oku) * (double)(accum_oku) + (double)(accum_ichi);
accum_sq_f = (double)(oku) * (double)(accum_sq_oku) + (double)(accum_sq_ichi);
new_mean = accum_f / (double)(cnt);
// standard deviation formula from Wikipedia
stats_standard_dev = sqrt((double)(cnt)*accum_sq_f - accum_f*accum_f)/(double)(cnt);
std::cout << stats_standard_dev << std::endl;
}