I've wrote a program, that approximates Pi using the Monte-Carlo method. It is working fine, but I wonder if I can make it work better and faster, because when inserting something like ~n = 100000000 and bigger from that point, it takes some time to do the calculations and print the result.
I've imagined how I could try to approximate it better by doing a median for n results, but considering how slow my algorithm is for big numbers, I decided not doing so.
Basically, the question is: How can I make this function work faster?
Here is the code that I've got so far:
double estimate_pi(double n)
{
int i;
double x,y, distance;
srand( (unsigned)time(0));
double num_point_circle = 0;
double num_point_total = 0;
double final;
for (i=0; i<n; i++)
{
x = (double)rand()/RAND_MAX;
y = (double)rand()/RAND_MAX;
distance = sqrt(x*x + y*y);
if (distance <= 1)
{
num_point_circle+=1;
}
num_point_total+=1;
}
final = ((4 * num_point_circle) / num_point_total);
return final;
}
An obvious (small) speedup is to get rid of the square root operation.
sqrt(x*x + y*y) is exactly smaller than 1, when x*x + y*y is also smaller than 1.
So you can just write:
double distance2 = x*x + y*y;
if (distance2 <= 1) {
...
}
That might gain something, as computing the square root is an expensive operation, and takes much more time (~4-20 times slower than one addition, depending on CPU, architecture, optimization level, ...).
You should also use integer values for variables where you count, like num_point_circle, n and num_point_total. Use int, long or long long for them.
The Monte Carlo algorithm is also embarrassingly parallel. So an idea to speed it up would be to run the algorithm with multiple threads, each one processes a fraction of n, and then at the end sum up the number of points that are inside the circle.
Additionally you could try to improve the speed with SIMD instructions.
And as last, there are much more efficient ways of computing Pi.
With the Monte Carlo method you can do millions of iterations, and only receive an accuracy of a couple of digits. With better algorithms its possible to compute thousands, millions or billions of digits.
E.g. you could read up on the on the website https://www.i4cy.com/pi/
First of all, you should modify the code to consider how much the estimate changes after a certain number of iterations and stop when it reached a satisfiying accuracy.
For your code as is, with a fixed number of iterations, there isnt much to optimize that the compiler cannot do better than you. One minor thing you can improve is to drop computing the square root in every iteration. You don't need it, because sqrt(x) <= 1 is true if x <= 1.
Very roughly speaking you can expect a good estimate from your Monte Carlo method once your x,y points are uniformly distributed in the quarter circle. Considering that, there is a much simpler way to get points uniformly distributed without randomness: Use a uniform grid and count how many points are inside the circle and how many are outside.
I would expect this to converge much better (when decreasing the grid spacing) than Monte Carlo. However, you can also try to use it to speed up your Monte Carlo code by starting from counts of num_point_circle and num_point_total obtained from points on a uniform grid and then incrementing the counters by continuing with randomly distributed points.
As the error decreases as the inverse of the square root of the number of samples, to gain a single digit you need one hundred times more samples. No micro-optimization will help significantly. Just consider that this method is of no use to effectively compute π.
If you insist: https://en.wikipedia.org/wiki/Variance_reduction.
A first level improvement would be to sample on a regular grid, using brute force algorithm, as suggested by 463035818_is_not_a_number.
The next level improvement would be to "draw" just a semi circle for each 'x', counting how many points must be below y = sqrt(radius*radius - x*x).
This reduces the complexity from O(N^2) to O(N).
With the grid size == radius of 10, 100, 1000, 10000, etc, one should have about one digit of each step.
One of the problems with the rand() function being, that the numbers soon begin to repeat -- with regular grid and this O(N) turned problem we can even simulate a grid of the size of 2^32 in quite a reasonable time.
With ideas from Bresenham's Circle drawing algorithm we can even quickly evaluate if the candidate (x+1, y_previous) or (x+1, y_previous-1) should be selected, as only one of those is inside the circle for the first octant. For second some other hacks are needed to avoid sqrt.
How can I make this function work faster? not more precise!
when Monte Carlo is an estimate anyway and the final multiplication is a multiple of 2,4,8 and so on you can also do bit operations. Any if statement makes it slow, so trying to get rid of it. But when we increase 1 or nothing (0) anyway you can get rid of the statement and reduce it to simple math, which should be faster. When i is initialised before the loop, and we are counting up inside the loop, it can also be a while loop.
#include <time.h>
double estimate_alt_pi(double n){
uint64_t num_point_total = 0;
double x, y;
srand( (unsigned)time(0));
uint64_t num_point_circle = 0;
while (num_point_total<n) {
x = (double)rand()/RAND_MAX;
y = (double)rand()/RAND_MAX;
num_point_circle += (sqrt(x*x + y*y) <= 1);
num_point_total++; // +1.0 we don't need 'i'
}
return ((num_point_circle << 2) / (double)num_point_total); // x<<2 == x * 4
}
Benchmarked with Xcode on a mac, looks like.
extern uint64_t dispatch_benchmark(size_t count, void (^block)(void));
int main(int argc, const char * argv[]) {
size_t count = 1000000;
double input = 1222.52764523423;
__block double resultA;
__block double resultB;
uint64_t t = dispatch_benchmark(count,^{
resultA = estimate_pi(input);
});
uint64_t t2 = dispatch_benchmark(count,^{
resultB = estimate_alt_pi(input);
});
fprintf(stderr,"estimate_____pi=%f time used=%llu\n",resultA, t);
fprintf(stderr,"estimate_alt_pi=%f time used=%llu\n",resultB, t2);
return 0;
}
~1,35 times faster, or takes ~73% of the original time
but significant less difference when the given number is lower.
And also the whole algorithm works properly only up to inputs with maximal 7 digits which is because of the used data types. Below 2 digits it is even slower. So the whole Monte Carlo algo is indeed not really worth digging deeper, when it's only about speed while keeping some kind of reliability.
Literally nothing will be faster than using a #define or static with a fixed number as Pi, but that was not your question.
Your code is very bad from performance aspect as:
x = (double)rand()/RAND_MAX;
y = (double)rand()/RAND_MAX;
is converting between int and double, also using integer division ... Why not use Random() instead? Also this:
for (i=0; i<n; i++)
is a bad idea as n is double and i is int so either store n to int variable at start or change the header to int n. Btw why are you computing num_point_total when you already got n ? Also:
num_point_circle += (sqrt(x*x + y*y) <= 1);
is a bad idea why sqrt? you know 1^2 = 1 so you can simply do:
num_point_circle += (x*x + y*y <= 1);
Why not do continuous computing? ... so what you need to implement is:
load of state at app start
computation either in timer or OnIdle event
so in each iteration/even you will do N iterations of Pi (adding to some global sum and count)
save of state at app exit
Beware Monte Carlo Pi computation converges very slowly and you will hit floating point accuracy problems once the sum grows too big
Here small example I did a long time ago doing continuous Monte Carlo...
Form cpp:
//$$---- Form CPP ----
//---------------------------------------------------------------------------
#include <vcl.h>
#include <math.h>
#pragma hdrstop
#include "Unit1.h"
#include "performance.h"
//---------------------------------------------------------------------------
#pragma package(smart_init)
#pragma resource "*.dfm"
TForm1 *Form1;
//---------------------------------------------------------------------------
int i=0,n=0,n0=0;
//---------------------------------------------------------------------------
void __fastcall TForm1::Idleloop(TObject *Sender, bool &Done)
{
int j;
double x,y;
for (j=0;j<10000;j++,n++)
{
x=Random(); x*=x;
y=Random(); y*=y;
if (x+y<=1.0) i++;
}
}
//---------------------------------------------------------------------------
__fastcall TForm1::TForm1(TComponent* Owner):TForm(Owner)
{
tbeg();
Randomize();
Application->OnIdle = Idleloop;
}
//-------------------------------------------------------------------------
void __fastcall TForm1::Timer1Timer(TObject *Sender)
{
double dt;
AnsiString txt;
txt ="ref = 3.1415926535897932384626433832795\r\n";
if (n) txt+=AnsiString().sprintf("Pi = %.20lf\r\n",4.0*double(i)/double(n));
txt+=AnsiString().sprintf("i/n = %i / %i\r\n",i,n);
dt=tend();
if (dt>1e-100) txt+=AnsiString().sprintf("IPS = %8.0lf\r\n",double(n-n0)*1000.0/dt);
tbeg(); n0=n;
mm_log->Text=txt;
}
//---------------------------------------------------------------------------
Form h:
//$$---- Form HDR ----
//---------------------------------------------------------------------------
#ifndef Unit1H
#define Unit1H
//---------------------------------------------------------------------------
#include <Classes.hpp>
#include <Controls.hpp>
#include <StdCtrls.hpp>
#include <Forms.hpp>
#include <ExtCtrls.hpp>
//---------------------------------------------------------------------------
class TForm1 : public TForm
{
__published: // IDE-managed Components
TMemo *mm_log;
TTimer *Timer1;
void __fastcall Timer1Timer(TObject *Sender);
private: // User declarations
public: // User declarations
__fastcall TForm1(TComponent* Owner);
void __fastcall TForm1::Idleloop(TObject *Sender, bool &Done);
};
//---------------------------------------------------------------------------
extern PACKAGE TForm1 *Form1;
//---------------------------------------------------------------------------
#endif
Form dfm:
object Form1: TForm1
Left = 0
Top = 0
Caption = 'Project Euler'
ClientHeight = 362
ClientWidth = 619
Color = clBtnFace
Font.Charset = OEM_CHARSET
Font.Color = clWindowText
Font.Height = 14
Font.Name = 'System'
Font.Pitch = fpFixed
Font.Style = [fsBold]
OldCreateOrder = False
PixelsPerInch = 96
TextHeight = 14
object mm_log: TMemo
Left = 0
Top = 0
Width = 619
Height = 362
Align = alClient
ScrollBars = ssBoth
TabOrder = 0
end
object Timer1: TTimer
Interval = 100
OnTimer = Timer1Timer
Left = 12
Top = 10
end
end
So you should add the save/load of state ...
As mentioned there are much much better ways of obtaining Pi like BBP
Also the code above use my time measurement heder so here it is:
//---------------------------------------------------------------------------
//--- Performance counter time measurement: 2.01 ----------------------------
//---------------------------------------------------------------------------
#ifndef _performance_h
#define _performance_h
//---------------------------------------------------------------------------
const int performance_max=64; // push urovni
double performance_Tms=-1.0, // perioda citaca [ms]
performance_tms=0.0, // zmerany cas po tend [ms]
performance_t0[performance_max]; // zmerane start casy [ms]
int performance_ix=-1; // index aktualneho casu
//---------------------------------------------------------------------------
void tbeg(double *t0=NULL) // mesure start time
{
double t;
LARGE_INTEGER i;
if (performance_Tms<=0.0)
{
for (int j=0;j<performance_max;j++) performance_t0[j]=0.0;
QueryPerformanceFrequency(&i); performance_Tms=1000.0/double(i.QuadPart);
}
QueryPerformanceCounter(&i); t=double(i.QuadPart); t*=performance_Tms;
if (t0) { t0[0]=t; return; }
performance_ix++;
if ((performance_ix>=0)&&(performance_ix<performance_max)) performance_t0[performance_ix]=t;
}
//---------------------------------------------------------------------------
void tpause(double *t0=NULL) // stop counting time between tbeg()..tend() calls
{
double t;
LARGE_INTEGER i;
QueryPerformanceCounter(&i); t=double(i.QuadPart); t*=performance_Tms;
if (t0) { t0[0]=t-t0[0]; return; }
if ((performance_ix>=0)&&(performance_ix<performance_max)) performance_t0[performance_ix]=t-performance_t0[performance_ix];
}
//---------------------------------------------------------------------------
void tresume(double *t0=NULL) // resume counting time between tbeg()..tend() calls
{
double t;
LARGE_INTEGER i;
QueryPerformanceCounter(&i); t=double(i.QuadPart); t*=performance_Tms;
if (t0) { t0[0]=t-t0[0]; return; }
if ((performance_ix>=0)&&(performance_ix<performance_max)) performance_t0[performance_ix]=t-performance_t0[performance_ix];
}
//---------------------------------------------------------------------------
double tend(double *t0=NULL) // return duration [ms] between matching tbeg()..tend() calls
{
double t;
LARGE_INTEGER i;
QueryPerformanceCounter(&i); t=double(i.QuadPart); t*=performance_Tms;
if (t0) { t-=t0[0]; performance_tms=t; return t; }
if ((performance_ix>=0)&&(performance_ix<performance_max)) t-=performance_t0[performance_ix]; else t=0.0;
performance_ix--;
performance_tms=t;
return t;
}
//---------------------------------------------------------------------------
double tper(double *t0=NULL) // return duration [ms] between tper() calls
{
double t,tt;
LARGE_INTEGER i;
if (performance_Tms<=0.0)
{
for (int j=0;j<performance_max;j++) performance_t0[j]=0.0;
QueryPerformanceFrequency(&i); performance_Tms=1000.0/double(i.QuadPart);
}
QueryPerformanceCounter(&i); t=double(i.QuadPart); t*=performance_Tms;
if (t0) { tt=t-t0[0]; t0[0]=t; performance_tms=tt; return tt; }
performance_ix++;
if ((performance_ix>=0)&&(performance_ix<performance_max))
{
tt=t-performance_t0[performance_ix];
performance_t0[performance_ix]=t;
}
else { t=0.0; tt=0.0; };
performance_ix--;
performance_tms=tt;
return tt;
}
//---------------------------------------------------------------------------
AnsiString tstr()
{
AnsiString s;
s=s.sprintf("%8.3lf",performance_tms); while (s.Length()<8) s=" "+s; s="["+s+" ms]";
return s;
}
//---------------------------------------------------------------------------
AnsiString tstr(int N)
{
AnsiString s;
s=s.sprintf("%8.3lf",performance_tms/double(N)); while (s.Length()<8) s=" "+s; s="["+s+" ms]";
return s;
}
//---------------------------------------------------------------------------
//---------------------------------------------------------------------------
#endif
//---------------------------------------------------------------------------
//---------------------------------------------------------------------------
And here result after few seconds:
where IPS is iterations per second, i,n are the global variables holding actual state of computation, Pi is actual approximation and ref is refernce Pi value for comparison. As I computing in OnIdle and showing up in OnTimer there is no need for any locks as its all single thread. For more speed you could launch more computing threads however then you need to add multi threading lock mechanisms and have the global state as volatile.
In case you are doing console app (no forms) you still can do this just convert your code to infinite loop, recompute and output Pi value once in ??? iterations and stop on some key hit like escape.
Related
I am implementing PID control in c++ to make a differential drive robot turn an accurate number of degrees, but I am having many issues.
Exiting control loop early due to fast loop runtime
If the robot measures its error to be less than .5 degrees, it exits the control loop and consider the turn "finished" (the .5 is a random value that I might change at some point). It appears that the control loop is running so quickly that the robot can turn at a very high speed, turn past the setpoint, and exit the loop/cut motor powers, because it was at the setpoint for a short instant. I know that this is the entire purpose of PID control, to accurately reach the setpoint without overshooting, but this problem is making it very difficult to tune the PID constants. For example, I try to find a value of kp such that there is steady oscillation, but there is never any oscillation because the robot thinks it has "finished" once it passes the setpoint. To fix this, I have implemented a system where the robot has to be at the setpoint for a certain period of time before exiting, and this has been effective, allowing oscillation to occur, but the issue of exiting the loop early seems like an unusual problem and my solution may be incorrect.
D term has no effect due to fast runtime
Once I had the robot oscillating in a controlled manner using only P, I tried to add D to prevent overshoot. However, this was having no effect for the majority of the time, because the control loop is running so quickly that 19 loops out of 20, the rate of change of error is 0: the robot did not move or did not move enough for it to be measured in that time. I printed the change in error and the derivative term each loop to confirm this and I could see that these would both be 0 for around 20 loop cycles before taking a reasonable value and then back to 0 for another 20 cycles. Like I said, I think that this is because the loop cycles are so quick that the robot literally hasn't moved enough for any sort of noticeable change in error. This was a big problem because it meant that the D term had essentially no effect on robot movement because it was almost always 0. To fix this problem, I tried using the last non-zero value of the derivative in place of any 0 values, but this didn't work well, and the robot would oscillate randomly if the last derivative didn't represent the current rate of change of error.
Note: I am also using a small feedforward for the static coefficient of friction, and I call this feedforward "f"
Should I add a delay?
I realized that I think the source of both of these issues is the loop running very very quickly, so something I thought of was adding a wait statement at the end of the loop. However, it seems like an overall bad solution to intentionally slow down a loop. Is this a good idea?
turnHeading(double finalAngle, double kp, double ki, double kd, double f){
std::clock_t timer;
timer = std::clock();
double pastTime = 0;
double currentTime = ((std::clock() - timer) / (double)CLOCKS_PER_SEC);
const double initialHeading = getHeading();
finalAngle = angleWrapDeg(finalAngle);
const double initialAngleDiff = initialHeading - finalAngle;
double error = angleDiff(getHeading(), finalAngle);
double pastError = error;
double firstTimeAtSetpoint = 0;
double timeAtSetPoint = 0;
bool atSetpoint = false;
double integral = 0;
double derivative = 0;
double lastNonZeroD = 0;
while (timeAtSetPoint < .05)
{
updatePos(encoderL.read(), encoderR.read());
error = angleDiff(getHeading(), finalAngle);
currentTime = ((std::clock() - timer) / (double)CLOCKS_PER_SEC);
double dt = currentTime - pastTime;
double proportional = error / fabs(initialAngleDiff);
integral += dt * ((error + pastError) / 2.0);
double derivative = (error - pastError) / dt;
//FAILED METHOD OF USING LAST NON-0 VALUE OF DERIVATIVE
// if(epsilonEquals(derivative, 0))
// {
// derivative = lastNonZeroD;
// }
// else
// {
// lastNonZeroD = derivative;
// }
double power = kp * proportional + ki * integral + kd * derivative;
if (power > 0)
{
setMotorPowers(-power - f, power + f);
}
else
{
setMotorPowers(-power + f, power - f);
}
if (fabs(error) < 2)
{
if (!atSetpoint)
{
atSetpoint = true;
firstTimeAtSetpoint = currentTime;
}
else //at setpoint
{
timeAtSetPoint = currentTime - firstTimeAtSetpoint;
}
}
else //no longer at setpoint
{
atSetpoint = false;
timeAtSetPoint = 0;
}
pastTime = currentTime;
pastError = error;
}
setMotorPowers(0, 0);
}
turnHeading(90, .37, 0, .00004, .12);
I am not sure if it's more math or more programming question. If it's math please tell me.
I know there is a lot of ready to use for free FFT projects. But I try to understand FFT method. Just for fun and for studying it. So I made both algorithms - DFT and FFT, to compare them.
But I have problem with my FFT. It seems there is not big difference in efficiency. My FFT is only little bit faster then DFT (in some cases it's two times faster, but it's max acceleration)
In most articles about FFT, there is something about bit reversal. But I don't see the reason to use bit reversing. Probably it's the case. I don't understand it. Please help me. What I do wrong?
This is my code (you can copy it here and see how it works - online compiler):
#include <complex>
#include <iostream>
#include <math.h>
#include <cmath>
#include <vector>
#include <chrono>
#include <ctime>
float _Pi = 3.14159265;
float sampleRate = 44100;
float resolution = 4;
float _SRrange = sampleRate / resolution; // I devide Sample Rate to make the loop smaller,
//just to perform tests faster
float bufferSize = 512;
// Clock class is for measure time to execute whole loop:
class Clock
{
public:
Clock() { start = std::chrono::high_resolution_clock::now(); }
~Clock() {}
float secondsElapsed()
{
auto stop = std::chrono::high_resolution_clock::now();
return std::chrono::duration_cast<std::chrono::microseconds>(stop - start).count();
}
void reset() { start = std::chrono::high_resolution_clock::now(); }
private:
std::chrono::time_point<std::chrono::high_resolution_clock> start;
};
// Function to calculate magnitude of complex number:
float _mag_Hf(std::complex<float> sf);
// Function to calculate exp(-j*2*PI*n*k / sampleRate) - where "j" is imaginary number:
std::complex<float> _Wnk_Nc(float n, float k);
// Function to calculate exp(-j*2*PI*k / sampleRate):
std::complex<float> _Wk_Nc(float k);
int main() {
float scaleFFT = 512; // devide and conquere - if it's "1" then whole algorhitm is just simply DFT
// I wonder what is the maximum of that value. I alvays thought it should be equal to
// buffer size (number o samples) but above some value it start to work slower then DFT
std::vector<float> inputSignal; // array of input signal
inputSignal.resize(bufferSize); // how many sample we will use to calculate Fourier Transform
std::vector<std::complex<float>> _Sf; // array to store Fourier Transform value for each measured frequency bin
_Sf.resize(scaleFFT); // resize it to size which we need.
std::vector<std::complex<float>> _Hf_Db_vect; //array to store magnitude (in logarythmic dB scale)
//for each measured frequency bin
_Hf_Db_vect.resize(_SRrange); //resize it to make it able to store value for each measured freq value
std::complex<float> _Sf_I_half; // complex to calculate first half of freq range
// from 1 to Nyquist (sampleRate/2)
std::complex<float> _Sf_II_half; // complex to calculate second half of freq range
//from Nyquist to sampleRate
for(int i=0; i<(int)_Sf.size(); i++)
inputSignal[i] = cosf((float)i/_Pi); // fill the input signal with some data, no matter
Clock _time; // Start measure time
for(int freqBinK=0; freqBinK < _SRrange/2; freqBinK++) // start calculate all freq (devide by 2 for two halves)
{
for(int i=0; i<(int)_Sf.size(); i++) _Sf[i] = 0.0f; // clean all values, for next loop we need all values to be zero
for (int n=0; n<bufferSize/_Sf.size(); ++n) // Here I take all samples in buffer
{
std::complex<float> _W = _Wnk_Nc(_Sf.size()*(float)n, freqBinK);
for(int i=0; i<(int)_Sf.size(); i++) // Finally here is my devide and conquer
_Sf[i] += inputSignal[_Sf.size()*n +i] * _W; // And I see no reason to use any bit reversal, how it shoul be????
}
std::complex<float> _Wk = _Wk_Nc(freqBinK);
_Sf_I_half = 0.0f;
_Sf_II_half = 0.0f;
for(int z=0; z<(int)_Sf.size()/2; z++) // here I calculate Fourier transform for each freq
{
_Sf_I_half += _Wk_Nc(2.0f * (float)z * freqBinK) * (_Sf[2*z] + _Wk * _Sf[2*z+1]); // First half - to Nyquist
_Sf_II_half += _Wk_Nc(2.0f * (float)z *freqBinK) * (_Sf[2*z] - _Wk * _Sf[2*z+1]); // Second half - to SampleRate
// also don't see need to use reversal bit, where it shoul be??? :)
}
// Calculate magnitude in dB scale
_Hf_Db_vect[freqBinK] = _mag_Hf(_Sf_I_half); // First half
_Hf_Db_vect[freqBinK + _SRrange/2] = _mag_Hf(_Sf_II_half); // Second half
}
std::cout << _time.secondsElapsed() << std::endl; // time measuer after execution of whole loop
}
float _mag_Hf(std::complex<float> sf)
{
float _Re_2;
float _Im_2;
_Re_2 = sf.real() * sf.real();
_Im_2 = sf.imag() * sf.imag();
return 20*log10(pow(_Re_2 + _Im_2, 0.5f)); //transform magnitude to logarhytmic dB scale
}
std::complex<float> _Wnk_Nc(float n, float k)
{
std::complex<float> _Wnk_Ncomp;
_Wnk_Ncomp.real(cosf(-2.0f * _Pi * (float)n * k / sampleRate));
_Wnk_Ncomp.imag(sinf(-2.0f * _Pi * (float)n * k / sampleRate));
return _Wnk_Ncomp;
}
std::complex<float> _Wk_Nc(float k)
{
std::complex<float> _Wk_Ncomp;
_Wk_Ncomp.real(cosf(-2.0f * _Pi * k / sampleRate));
_Wk_Ncomp.imag(sinf(-2.0f * _Pi * k / sampleRate));
return _Wk_Ncomp;
}
One huge mistake you are making is calculating the butterfly weights (which involves sin and cos) on the fly (in _Wnk_Nc()). sin and cos typically cost 10s to 100s of clock cycles, whereas the other butterfly operations are just mul and add, which only take a few cycles, hence the need to factor these out. All fast FFT implementations do this as part of an initialisation step (usually called "plan creation" or similar). See e.g. FFTW and KissFFT.
apart of abovementioned "pre-calculating butterfly weights" optimization, most FFT implementations also use SIMD instructions to vectorize code.
// also don't see need to use reversal bit, where it shoul be?
The very first butterfly loop should be reverse-bit indexed. Those indexes are usually calculated inside recursion, but for loop solution calculating those indexes is also costly, so it's better to pre-calculate them in plan as well.
Combining those optimization approaches result in approximately 100x speedup
Most fast FFT implementations either use a lookup table of precomputed twiddle factors, or a simple recursion to rotate the twiddle factors on the fly, instead of calling trigonometric math library functions inside the FFT inner loop.
For large FFTs, using a trig recursion formula is less likely to thrash the data caches on contemporary processors.
I started a similar question on another thread, but then I was focusing on how to use OpenCV. Having failed to achieve what I originally wanted, I will ask here exactly what I want.
I have two matrices. Matrix a is 2782x128 and Matrix b is 4000x128, both unsigned char values. The values are stored in a single array. For each vector in a, I need the index of the vector in b with the closest euclidean distance.
Ok, now my code to achieve this:
#include <windows.h>
#include <stdlib.h>
#include <stdio.h>
#include <cstdio>
#include <math.h>
#include <time.h>
#include <sys/timeb.h>
#include <iostream>
#include <fstream>
#include "main.h"
using namespace std;
void main(int argc, char* argv[])
{
int a_size;
unsigned char* a = NULL;
read_matrix(&a, a_size,"matrixa");
int b_size;
unsigned char* b = NULL;
read_matrix(&b, b_size,"matrixb");
LARGE_INTEGER liStart;
LARGE_INTEGER liEnd;
LARGE_INTEGER liPerfFreq;
QueryPerformanceFrequency( &liPerfFreq );
QueryPerformanceCounter( &liStart );
int* indexes = NULL;
min_distance_loop(&indexes, b, b_size, a, a_size);
QueryPerformanceCounter( &liEnd );
cout << "loop time: " << (liEnd.QuadPart - liStart.QuadPart) / long double(liPerfFreq.QuadPart) << "s." << endl;
if (a)
delete[]a;
if (b)
delete[]b;
if (indexes)
delete[]indexes;
return;
}
void read_matrix(unsigned char** matrix, int& matrix_size, char* matrixPath)
{
ofstream myfile;
float f;
FILE * pFile;
pFile = fopen (matrixPath,"r");
fscanf (pFile, "%d", &matrix_size);
*matrix = new unsigned char[matrix_size*128];
for (int i=0; i<matrix_size*128; ++i)
{
unsigned int matPtr;
fscanf (pFile, "%u", &matPtr);
matrix[i]=(unsigned char)matPtr;
}
fclose (pFile);
}
void min_distance_loop(int** indexes, unsigned char* b, int b_size, unsigned char* a, int a_size)
{
const int descrSize = 128;
*indexes = (int*)malloc(a_size*sizeof(int));
int dataIndex=0;
int vocIndex=0;
int min_distance;
int distance;
int multiply;
unsigned char* dataPtr;
unsigned char* vocPtr;
for (int i=0; i<a_size; ++i)
{
min_distance = LONG_MAX;
for (int j=0; j<b_size; ++j)
{
distance=0;
dataPtr = &a[dataIndex];
vocPtr = &b[vocIndex];
for (int k=0; k<descrSize; ++k)
{
multiply = *dataPtr++-*vocPtr++;
distance += multiply*multiply;
// If the distance is greater than the previously calculated, exit
if (distance>min_distance)
break;
}
// if distance smaller
if (distance<min_distance)
{
min_distance = distance;
(*indexes)[i] = j;
}
vocIndex+=descrSize;
}
dataIndex+=descrSize;
vocIndex=0;
}
}
And attached are the files with sample matrices.
matrixa
matrixb
I am using windows.h just to calculate the consuming time, so if you want to test the code in another platform than windows, just change windows.h header and change the way of calculating the consuming time.
This code in my computer is about 0.5 seconds. The problem is that I have another code in Matlab that makes this same thing in 0.05 seconds. In my experiments, I am receiving several matrices like matrix a every second, so 0.5 seconds is too much.
Now the matlab code to calculate this:
aa=sum(a.*a,2); bb=sum(b.*b,2); ab=a*b';
d = sqrt(abs(repmat(aa,[1 size(bb,1)]) + repmat(bb',[size(aa,1) 1]) - 2*ab));
[minz index]=min(d,[],2);
Ok. Matlab code is using that (x-a)^2 = x^2 + a^2 - 2ab.
So my next attempt was to do the same thing. I deleted my own code to make the same calculations, but It was 1.2 seconds approx.
Then, I tried to use different external libraries. The first attempt was Eigen:
const int descrSize = 128;
MatrixXi a(a_size, descrSize);
MatrixXi b(b_size, descrSize);
MatrixXi ab(a_size, b_size);
unsigned char* dataPtr = matrixa;
for (int i=0; i<nframes; ++i)
{
for (int j=0; j<descrSize; ++j)
{
a(i,j)=(int)*dataPtr++;
}
}
unsigned char* vocPtr = matrixb;
for (int i=0; i<vocabulary_size; ++i)
{
for (int j=0; j<descrSize; ++j)
{
b(i,j)=(int)*vocPtr ++;
}
}
ab = a*b.transpose();
a.cwiseProduct(a);
b.cwiseProduct(b);
MatrixXi aa = a.rowwise().sum();
MatrixXi bb = b.rowwise().sum();
MatrixXi d = (aa.replicate(1,vocabulary_size) + bb.transpose().replicate(nframes,1) - 2*ab).cwiseAbs2();
int* index = NULL;
index = (int*)malloc(nframes*sizeof(int));
for (int i=0; i<nframes; ++i)
{
d.row(i).minCoeff(&index[i]);
}
This Eigen code costs 1.2 approx for just the line that says: ab = a*b.transpose();
A similar code using opencv was used also, and the cost of the ab = a*b.transpose(); was 0.65 seconds.
So, It is real annoying that matlab is able to do this same thing so quickly and I am not able in C++! Of course being able to run my experiment would be great, but I think the lack of knowledge is what really is annoying me. How can I achieve at least the same performance than in Matlab? Any kind of soluting is welcome. I mean, any external library (free if possible), loop unrolling things, template things, SSE intructions (I know they exist), cache things. As I said, my main purpose is increase my knowledge for being able to code thinks like this with a faster performance.
Thanks in advance
EDIT: more code suggested by David Hammen. I casted the arrays to int before making any calculations. Here is the code:
void min_distance_loop(int** indexes, unsigned char* b, int b_size, unsigned char* a, int a_size)
{
const int descrSize = 128;
int* a_int;
int* b_int;
LARGE_INTEGER liStart;
LARGE_INTEGER liEnd;
LARGE_INTEGER liPerfFreq;
QueryPerformanceFrequency( &liPerfFreq );
QueryPerformanceCounter( &liStart );
a_int = (int*)malloc(a_size*descrSize*sizeof(int));
b_int = (int*)malloc(b_size*descrSize*sizeof(int));
for(int i=0; i<descrSize*a_size; ++i)
a_int[i]=(int)a[i];
for(int i=0; i<descrSize*b_size; ++i)
b_int[i]=(int)b[i];
QueryPerformanceCounter( &liEnd );
cout << "Casting time: " << (liEnd.QuadPart - liStart.QuadPart) / long double(liPerfFreq.QuadPart) << "s." << endl;
*indexes = (int*)malloc(a_size*sizeof(int));
int dataIndex=0;
int vocIndex=0;
int min_distance;
int distance;
int multiply;
/*unsigned char* dataPtr;
unsigned char* vocPtr;*/
int* dataPtr;
int* vocPtr;
for (int i=0; i<a_size; ++i)
{
min_distance = LONG_MAX;
for (int j=0; j<b_size; ++j)
{
distance=0;
dataPtr = &a_int[dataIndex];
vocPtr = &b_int[vocIndex];
for (int k=0; k<descrSize; ++k)
{
multiply = *dataPtr++-*vocPtr++;
distance += multiply*multiply;
// If the distance is greater than the previously calculated, exit
if (distance>min_distance)
break;
}
// if distance smaller
if (distance<min_distance)
{
min_distance = distance;
(*indexes)[i] = j;
}
vocIndex+=descrSize;
}
dataIndex+=descrSize;
vocIndex=0;
}
}
The entire process is now 0.6, and the casting loops at the beginning are 0.001 seconds. Maybe I did something wrong?
EDIT2: Anything about Eigen? When I look for external libs they always talk about Eigen and their speed. I made something wrong? Here a simple code using Eigen that shows it is not so fast. Maybe I am missing some config or some flag, or ...
MatrixXd A = MatrixXd::Random(1000, 1000);
MatrixXd B = MatrixXd::Random(1000, 500);
MatrixXd X;
This code is about 0.9 seconds.
As you observed, your code is dominated by the matrix product that represents about 2.8e9 arithmetic operations. Yopu say that Matlab (or rather the highly optimized MKL) computes it in about 0.05s. This represents a rate of 57 GFLOPS showing that it is not only using vectorization but also multi-threading. With Eigen, you can enable multi-threading by compiling with OpenMP enabled (-fopenmp with gcc). On my 5 years old computer (2.66Ghz Core2), using floats and 4 threads, your product takes about 0.053s, and 0.16s without OpenMP, so there must be something wrong with your compilation flags. To summary, to get the best of Eigen:
compile in 64bits mode
use floats (doubles are twice as slow owing to vectorization)
enable OpenMP
if your CPU has hyper-threading, then either disable it or define the OMP_NUM_THREADS environment variable to the number of physical cores (this is very important, otherwise the performance will be very bad!)
if you have other task running, it might be a good idea to reduce OMP_NUM_THREADS to nb_cores-1
use the most recent compiler that you can, GCC, clang and ICC are best, MSVC is usually slower.
One thing that is definitely hurting you in your C++ code is that it has a boatload of char to int conversions. By boatload, I mean up to 2*2782*4000*128 char to int conversions. Those char to int conversions are slow, very slow.
You can reduce this to (2782+4000)*128 such conversions by allocating a pair of int arrays, one 2782*128 and the other 4000*128, to contain the cast-to-integer contents of your char* a and char* b arrays. Work with these int* arrays rather than your char* arrays.
Another problem might be your use of int versus long. I don't work on windows, so this might not be applicable. On the machines I work on, int is 32 bits and long is now 64 bits. 32 bits is more than enough because 255*255*128 < 256*256*128 = 223.
That obviously isn't the problem.
What's striking is that the code in question is not calculating that huge 2728 by 4000 array that the Matlab code is creating. What's even more striking is that Matlab is most likely doing this with doubles rather than ints -- and it's still beating the pants off the C/C++ code.
One big problem is cache. That 4000*128 array is far too big for level 1 cache, and you are iterating over that big array 2782 times. Your code is doing far too much waiting on memory. To overcome this problem, work with smaller chunks of the b array so that your code works with level 1 cache for as long as possible.
Another problem is the optimization if (distance>min_distance) break;. I suspect that this is actually a dis-optimization. Having if tests inside your innermost loop is oftentimes a bad idea. Blast through that inner product as fast as possible. Other than wasted computations, there is no harm in getting rid of this test. Sometimes it is better to make apparently unneeded computations if doing so can remove a branch in an innermost loop. This is one of those cases. You might be able to solve your problem just by eliminating this test. Try doing that.
Getting back to the cache problem, you need to get rid of this branch so that you can split the operations over the a and b matrix into smaller chunks, chunks of no more than 256 rows at a time. That's how many rows of 128 unsigned chars fit into one of the two modern Intel chip's L1 caches. Since 250 divides 4000, look into logically splitting that b matrix into 16 chunks. You may well want to form that big 2872 by 4000 array of inner products, but do so in small chunks. You can add that if (distance>min_distance) break; back in, but do so at a chunk level rather than at the byte by byte level.
You should be able to beat Matlab because it almost certainly is working with doubles, but you can work with unsigned chars and ints.
Matrix multiply generally uses the worst possible cache access pattern for one of the two matrices, and the solution is to transpose one of the matrices and use a specialized multiply algorithm that works on data stored that way.
Your matrix already IS stored transposed. By transposing it into the normal order and then using a normal matrix multiply, your are absolutely killing performance.
Write your own matrix multiply loop that inverts the order of indices to the second matrix (which has the effect of transposing it, without actually moving anything around and breaking cache behavior). And pass your compiler whatever options it has for enabling auto-vectorization.
I wrote this simple code which reads a length from the Sharp infrared sensor, end presents the average meter in cm (unit) by serial.
When write this code for the Arduino Mega board, the Arduino starts a blinking LED (pin 13) and the program does nothing. Where is the bug in this code?
#include <QueueList.h>
const int ANALOG_SHARP = 0; //Set pin data from sharp.
QueueList <float> queuea;
float cm;
float qu1;
float qu2;
float qu3;
float qu4;
float qu5;
void setup() {
Serial.begin(9600);
}
void loop() {
cm = read_gp2d12_range(ANALOG_SHARP); //Convert to cm (unit).
queuea.push(cm); //Add item to queue, when I add only this line Arduino crash.
if ( 5 <= queuea.peek()) {
Serial.println(average());
}
}
float read_gp2d12_range(byte pin) { //Function converting to cm (unit).
int tmp;
tmp = analogRead(pin);
if (tmp < 3)
return -1; // Invalid value.
return (6787.0 /((float)tmp - 3.0)) - 4.0;
}
float average() { //Calculate average length
qu1 += queuea.pop();
qu2 += queuea.pop();
qu3 += queuea.pop();
qu4 += queuea.pop();
qu5 += queuea.pop();
float aver = ((qu1+qu2+qu3+qu4+qu5)/5);
return aver;
}
I agree with the peek() -> count() error listed by vhallac. But I'll also point out that you should consider averaging by powers of 2 unless there is a strong case to do otherwise.
The reason is that on microcontrollers, division is slow. By averaging over a power of 2 (2,4,8,16,etc.) you can simply calculate the sum and then bitshift it.
To calculate the average of 2: (v1 + v2) >> 1
To calculate the average of 4: (v1 + v2 + v3 + v4) >> 2
To calculate the average of n values (where n is a power of 2) just right bitshift the sum right by [log2(n)].
As long as the datatype for your sum variable is big enough and won't overflow, this is much easier and much faster.
Note: this won't work for floats in general. In fact, microcontrollers aren't optimized for floats. You should consider converting from int (what I'm assuming you're ADC is reading) to float at the end after the averaging rather than before.
By converting from int to float and then averaging floats you are losing more precision than averaging ints than converting the int to a float.
Other:
You're using the += operator without initializing the variables (qu1, qu2, etc.) -- it's good practice to initialize them if you're going to use += but it looks as if = would work fine.
For floats, I'd have written the average function as:
float average(QueueList<float> & q, int n)
{
float sum = 0;
for(int i=0; i<n; i++)
{
sum += q.pop();
}
return (sum / (float) n);
}
And called it: average(queuea, 5);
You could use this to average any number of sensor readings and later use the same code to later average floats in a completely different QueueList. Passing the number of readings to average as a parameter will really come in handy in the case that you need to tweak it.
TL;DR:
Here's how I would have done it:
#include <QueueList.h>
const int ANALOG_SHARP=0; // set pin data from sharp
const int AvgPower = 2; // 1 for 2 readings, 2 for 4 readings, 3 for 8, etc.
const int AvgCount = pow(2,AvgPow);
QueueList <int> SensorReadings;
void setup(){
Serial.begin(9600);
}
void loop()
{
int reading = analogRead(ANALOG_SHARP);
SensorReadings.push(reading);
if(SensorReadings.count() > AvgCount)
{
int avg = average2(SensorReadings, AvgPower);
Serial.println(gpd12_to_cm(avg));
}
}
float gp2d12_to_cm(int reading)
{
if(reading <= 3){ return -1; }
return((6787.0 /((float)reading - 3.0)) - 4.0);
}
int average2(QueueList<int> & q, int AvgPower)
{
int AvgCount = pow(2, AvgPower);
long sum = 0;
for(int i=0; i<AvgCount; i++)
{
sum += q.pop();
}
return (sum >> AvgPower);
}
You are using queuea.peek() to obtain the count. This will only return the last element in queue. You should use queuea.count() instead.
Also you might consider changing the condition tmp < 3 to tmp <= 3. If tmp is 3, you divide by zero.
Great improvement jedwards, however the first question I have is why use queuelist instead of an int array.
As an example I would do the following:
int average(int analog_reading)
{
#define NUM_OF_AVG 5
static int readings[NUM_OF_AVG];
static int next_position;
static int sum;
if (++next_position >= NUM_OF_AVG)
{
next_position=0;
}
reading[next_position]=analog_reading;
for(int i=0; i<NUM_OF_AVG; i++)
{
sum += reading[i];
}
average = sum/NUM_OF_AVG
}
Now I compute a new rolling average with every reading and it eliminates all the issues related to dynamic memory allocation (memory fragmentation, no available memory, memory leaks) in a embedded device.
I appreciate and understand the use of shifting for a division by 2,4 or 8, however I would stay away from that technique for two reasons.
I think readability and maintainability of the source code is more important then saving a little bit of time with a shift instead of a divide unless you can test and verify the divide is a bottleneck.
Second, I believe most current optimizing compilers will do a shift if possible, I know GCC does.
I will leave refactoring out the for loop for the next guy.
I want to time fps count, and set it's limit to 60 and however i've been looking throught some code via google, I completly don't get it.
If you want 60 FPS, you need to figure out how much time you have on each frame. In this case, 16.67 milliseconds. So you want a loop that completes every 16.67 milliseconds.
Usually it goes (simply put): Get input, do physics stuff, render, pause until 16.67ms has passed.
Its usually done by saving the time at the top of the loop and then calculating the difference at the end and sleeping or looping doing nothing for that duration.
This article describes a few different ways of doing game loops, including the one you want, although I'd use one of the more advanced alternatives in this article.
delta time is the final time, minus the original time.
dt= t-t0
This delta time, though, is simply the amount of time that passes while the velocity is changing.
The derivative of a function represents an infinitesimal change
in the function with respect to one of its variables.
The derivative of a function with respect to the variable is defined as
f(x + h) - f(x)
f'(x) = lim -----------------
h->0 h
http://mathworld.wolfram.com/Derivative.html
#include<time.h>
#include<stdlib.h>
#include<stdio.h>
#include<windows.h>
#pragma comment(lib,"winmm.lib")
void gotoxy(int x, int y);
void StepSimulation(float dt);
int main(){
int NewTime = 0;
int OldTime = 0;
float dt = 0;
float TotalTime = 0;
int FrameCounter = 0;
int RENDER_FRAME_COUNT = 60;
while(true){
NewTime = timeGetTime();
dt = (float) (NewTime - OldTime)/1000; //delta time
OldTime = NewTime;
if (dt > (0.016f)) dt = (0.016f); //delta time
if (dt < 0.001f) dt = 0.001f;
TotalTime += dt;
if(TotalTime > 1.1f){
TotalTime=0;
StepSimulation(dt);
}
if(FrameCounter >= RENDER_FRAME_COUNT){
// draw stuff
//Render();
gotoxy(1,2);
printf(" \n");
printf("OldTime = %d \n",OldTime);
printf("NewTime = %d \n",NewTime);
printf("dt = %f \n",dt);
printf("TotalTime = %f \n",TotalTime);
printf("FrameCounter = %d fps\n",FrameCounter);
printf(" \n");
FrameCounter = 0;
}
else{
gotoxy(22,7);
printf("%d ",FrameCounter);
FrameCounter++;
}
}
return 0;
}
void gotoxy(int x, int y){
COORD coord;
coord.X = x; coord.Y = y;
SetConsoleCursorPosition(GetStdHandle(STD_OUTPUT_HANDLE), coord);
return;
}
void StepSimulation(float dt){
// calculate stuff
//vVelocity += Ae * dt;
}
You shouldn't try to limit the fps. The only reason to do so is if you are not using delta time and you expect each frame to be the same length. Even the simplest game cannot guarantee that.
You can however take your delta time and slice it into fixed sizes and then hold onto the remainder.
Here's some code I wrote recently. It's not thoroughly tested.
void GameLoop::Run()
{
m_Timer.Reset();
while(!m_Finished())
{
Time delta = m_Timer.GetDelta();
Time frameTime(0);
unsigned int loopCount = 0;
while (delta > m_TickTime && loopCount < m_MaxLoops)
{
m_SingTick();
delta -= m_TickTime;
frameTime += m_TickTime;
++loopCount;
}
m_Independent(frameTime);
// add an exception flag later.
// This is if the game hangs
if(loopCount >= m_MaxLoops)
{
delta %= m_TickTime;
}
m_Render(delta);
m_Timer.Unused(delta);
}
}
The member objects are Boost slots so different code can register with different timing methods. The Independent slot is for things like key mapping or changing music Things that don't need to be so precise. SingTick is good for physics where it is easier if you know every tick will be the same but you don't want to run through a wall. Render takes the delta so animations run smooth, but must remember to account for it on the next SingTick.
Hope that helps.
There are many good reasons why you should not limit your frame rate in such a way. One reason being as stijn pointed out, not every monitor may run at exactly 60fps, another reason being that the resolution of timers is not sufficient, yet another reason being that even given sufficient resolutions, two separate timers (monitor refresh and yours) running in parallel will always get out of sync with time (they must!) due to random inaccuracies, and the most important reason being that it is not necessary at all.
Note that the default timer resolution under Windows is 15ms, and the best possible resolution you can get (by using timeBeginPeriod) is 1ms. Thus, you can (at best) wait 16ms or 17ms. One frame at 60fps is 16.6666ms How do you wait 16.6666ms?
If you want to limit your game's speed to the monitor's refresh rate, enable vertical sync. This will do what you want, precisely, and without sync issues. Vertical sync does have its pecularities too (such as the funny surprise you get when a frame takes 16.67ms), but it is by far the best available solution.
If the reason why you wanted to do this was to fit your simulation into the render loop, this is a must read for you.
check this one out:
//Creating Digital Watch in C++
#include<iostream>
#include<Windows.h>
using namespace std;
struct time{
int hr,min,sec;
};
int main()
{
time a;
a.hr = 0;
a.min = 0;
a.sec = 0;
for(int i = 0; i<24; i++)
{
if(a.hr == 23)
{
a.hr = 0;
}
for(int j = 0; j<60; j++)
{
if(a.min == 59)
{
a.min = 0;
}
for(int k = 0; k<60; k++)
{
if(a.sec == 59)
{
a.sec = 0;
}
cout<<a.hr<<" : "<<a.min<<" : "<<a.sec<<endl;
a.sec++;
Sleep(1000);
system("Cls");
}
a.min++;
}
a.hr++;
}
}