I am implementing PID control in c++ to make a differential drive robot turn an accurate number of degrees, but I am having many issues.
Exiting control loop early due to fast loop runtime
If the robot measures its error to be less than .5 degrees, it exits the control loop and consider the turn "finished" (the .5 is a random value that I might change at some point). It appears that the control loop is running so quickly that the robot can turn at a very high speed, turn past the setpoint, and exit the loop/cut motor powers, because it was at the setpoint for a short instant. I know that this is the entire purpose of PID control, to accurately reach the setpoint without overshooting, but this problem is making it very difficult to tune the PID constants. For example, I try to find a value of kp such that there is steady oscillation, but there is never any oscillation because the robot thinks it has "finished" once it passes the setpoint. To fix this, I have implemented a system where the robot has to be at the setpoint for a certain period of time before exiting, and this has been effective, allowing oscillation to occur, but the issue of exiting the loop early seems like an unusual problem and my solution may be incorrect.
D term has no effect due to fast runtime
Once I had the robot oscillating in a controlled manner using only P, I tried to add D to prevent overshoot. However, this was having no effect for the majority of the time, because the control loop is running so quickly that 19 loops out of 20, the rate of change of error is 0: the robot did not move or did not move enough for it to be measured in that time. I printed the change in error and the derivative term each loop to confirm this and I could see that these would both be 0 for around 20 loop cycles before taking a reasonable value and then back to 0 for another 20 cycles. Like I said, I think that this is because the loop cycles are so quick that the robot literally hasn't moved enough for any sort of noticeable change in error. This was a big problem because it meant that the D term had essentially no effect on robot movement because it was almost always 0. To fix this problem, I tried using the last non-zero value of the derivative in place of any 0 values, but this didn't work well, and the robot would oscillate randomly if the last derivative didn't represent the current rate of change of error.
Note: I am also using a small feedforward for the static coefficient of friction, and I call this feedforward "f"
Should I add a delay?
I realized that I think the source of both of these issues is the loop running very very quickly, so something I thought of was adding a wait statement at the end of the loop. However, it seems like an overall bad solution to intentionally slow down a loop. Is this a good idea?
turnHeading(double finalAngle, double kp, double ki, double kd, double f){
std::clock_t timer;
timer = std::clock();
double pastTime = 0;
double currentTime = ((std::clock() - timer) / (double)CLOCKS_PER_SEC);
const double initialHeading = getHeading();
finalAngle = angleWrapDeg(finalAngle);
const double initialAngleDiff = initialHeading - finalAngle;
double error = angleDiff(getHeading(), finalAngle);
double pastError = error;
double firstTimeAtSetpoint = 0;
double timeAtSetPoint = 0;
bool atSetpoint = false;
double integral = 0;
double derivative = 0;
double lastNonZeroD = 0;
while (timeAtSetPoint < .05)
{
updatePos(encoderL.read(), encoderR.read());
error = angleDiff(getHeading(), finalAngle);
currentTime = ((std::clock() - timer) / (double)CLOCKS_PER_SEC);
double dt = currentTime - pastTime;
double proportional = error / fabs(initialAngleDiff);
integral += dt * ((error + pastError) / 2.0);
double derivative = (error - pastError) / dt;
//FAILED METHOD OF USING LAST NON-0 VALUE OF DERIVATIVE
// if(epsilonEquals(derivative, 0))
// {
// derivative = lastNonZeroD;
// }
// else
// {
// lastNonZeroD = derivative;
// }
double power = kp * proportional + ki * integral + kd * derivative;
if (power > 0)
{
setMotorPowers(-power - f, power + f);
}
else
{
setMotorPowers(-power + f, power - f);
}
if (fabs(error) < 2)
{
if (!atSetpoint)
{
atSetpoint = true;
firstTimeAtSetpoint = currentTime;
}
else //at setpoint
{
timeAtSetPoint = currentTime - firstTimeAtSetpoint;
}
}
else //no longer at setpoint
{
atSetpoint = false;
timeAtSetPoint = 0;
}
pastTime = currentTime;
pastError = error;
}
setMotorPowers(0, 0);
}
turnHeading(90, .37, 0, .00004, .12);
I have a rendering function that runs hundreds of times per second, and it tells me how many milliseconds each frame takes to draw.
I made a function to calculate the current render speed average of all the frames, which uses an std::vector to hold all the previous frames.
However, every time I run my program the vector that stores the averages becomes huge and takes up an increasing amount of memory, along with slowing down my program by almost 10 times (draw speed).
Averaging function (please note I am a C++ beginner):
double average(std::vector<double> input_vector)
{
double total = 0;
for(unsigned int i = 0; i < input_vector.size(); i++)
{
total += input_vector.at(i);
}
return (total / (double)input_vector.size());
}
Can someone help me fix this?
Thank you
Given the definition of arithmetic mean is sum( n ) / count( n ) you don't need to store every value of n in order to recompute the running mean, you only need the current sum and the current count, like so:
double runningMean(double newValue) {
static double sum = 0;
static double count = 0;
count++;
sum += newValue;
return sum / count;
}
No vector needed at all.
I am running this code (full code here: http://codepad.org/5OJBLqIA) to time repeated daxpy function calls with and without flushing the operands from cache beforehand:
#define KB 1024
int main()
{
int cache_size = 32*KB;
double alpha = 42.5;
int operand_size = cache_size/(sizeof(double)*2);
double* X = new double[operand_size];
double* Y = new double[operand_size];
//95% confidence interval
double max_risk = 0.05;
//Interval half width
double w;
int n_iterations = 50000;
students_t dist(n_iterations-1);
double T = boost::math::quantile(complement(dist,max_risk/2));
accumulator_set<double, stats<tag::mean,tag::variance> > unflushed_acc;
for(int i = 0; i < n_iterations; ++i)
{
fill(X,operand_size);
fill(Y,operand_size);
double seconds = wall_time();
daxpy(alpha,X,Y,operand_size);
seconds = wall_time() - seconds;
unflushed_acc(seconds);
}
w = T*sqrt(variance(unflushed_acc))/sqrt(count(unflushed_acc));
printf("Without flush: time=%g +/- %g ns\n",mean(unflushed_acc)*1e9,w*1e9);
//Using clflush instruction
//We need to put the operands back in cache
accumulator_set<double, stats<tag::mean,tag::variance> > clflush_acc;
for(int i = 0; i < n_iterations; ++i)
{
fill(X,operand_size);
fill(Y,operand_size);
flush_array(X,operand_size);
flush_array(Y,operand_size);
double seconds = wall_time();
daxpy(alpha,X,Y,operand_size);
seconds = wall_time() - seconds;
clflush_acc(seconds);
}
w = T*sqrt(variance(clflush_acc))/sqrt(count(clflush_acc));
printf("With clflush: time=%g +/- %g ns\n",mean(clflush_acc)*1e9,w*1e9);
return 0;
}
This code measures the rate and the uncertainty averaged over the given number of iterations. Averaging over lots of iterations successfully minimizes the variance caused by contention for memory access from various cores (discussed in my previous question here), but the average value thus obtained varies by a huge amount between separate invocations of the same executable:
$ ./variance
Without flush: time=3107.76 +/- 0.268198 ns
With clflush: time=5862.33 +/- 9.84313 ns
$ ./variance
Without flush: time=3105.71 +/- 0.237823 ns
With clflush: time=7802.66 +/- 12.3163 ns
These were run immediately after one another. Why do the timings for the flushed case (but not the unflushed case) vary so much between processes, but so little within a given process?
Appendix
Code is run on Mac OS X 10.8 on an Intel Xeon 5650.
How could I estimate the instantaneous throughput ? For example, in a way similar to what the browser does when downloading a file. It's not just a mean throughput, but rather the an instantaneous estimation, maybe with a 'moving average'. I'm looking for the algorithm, but you can specify it in c++. Ideally, it would not involve a thread (i.e., being continuously refreshed, say every second) but rather be only evaluated when the value is asked.
You can use an exponential moving average, as explained here, but I'll repeat the formula:
accumulator = (alpha * new_value) + (1.0 - alpha) * accumulator
To achieve an estimation, suppose you intend to query the computation every second, but you want an average over the last minute. Then, here would be one way to get that estimate:
struct AvgBps {
double rate_; // The average rate
double last_; // Accumulates bytes added until average is computed
time_t prev_; // Time of previous update
AvgBps () : rate_(0), last_(0), prev_(time(0)) {}
void add (unsigned bytes) {
time_t now = time(0);
if (now - prev_ < 60) { // The update is within the last minute
last_ += bytes; // Accumulate bytes into last
if (now > prev_) { // More than a second elapsed from previous
// exponential moving average
// the more time that has elapsed between updates, the more
// weight is assigned for the accumulated bytes
double alpha = (now - prev_)/60.0;
rate_ = (1 -alpha) * last_ + alpha * rate_;
last_ = 0; // Reset last_ (it has been averaged in)
prev_ = now; // Update prev_ to current time
}
} else { // The update is longer than a minute ago
rate_ = bytes; // Current update is average rate
last_ = 0; // Reset last_
prev_ = now; // Update prev_
}
}
double rate () {
add(0); // Compute rate by doing an update of 0 bytes
return rate_; // Return computed rate
}
};
You should actually use a monotonic clock instead of time.
You probably want a boxcar average.
Just keep the last n values, and average them. For each subsequent block, subtract out the oldest and add in the most recent. Note that for floating point values, you may get some aggregated error, in which case you might want to recalculate the total from scratch every m values. For integer values of course, you don't need something like that.
So I have this function used to calculate statistics (min/max/std/mean). Now the thing is this runs generally on a 10,000 by 15,000 matrix. The matrix is stored as a vector<vector<int> > inside the class. Now creating and populating said matrix goes very fast, but when it comes down to the statistics part it becomes so incredibly slow.
E.g. to read all the pixel values of the geotiff one pixel at a time takes around 30 seconds. (which involves a lot of complex math to properly georeference the pixel values to a corresponding point), to calculate the statistics of the entire matrix it takes around 6 minutes.
void CalculateStats()
{
//OHGOD
double new_mean = 0;
double new_standard_dev = 0;
int new_min = 256;
int new_max = 0;
size_t cnt = 0;
for(size_t row = 0; row < vals.size(); row++)
{
for(size_t col = 0; col < vals.at(row).size(); col++)
{
double mean_prev = new_mean;
T value = get(row, col);
new_mean += (value - new_mean) / (cnt + 1);
new_standard_dev += (value - new_mean) * (value - mean_prev);
// find new max/min's
new_min = value < new_min ? value : new_min;
new_max = value > new_max ? value : new_max;
cnt++;
}
}
stats_standard_dev = sqrt(new_standard_dev / (vals.size() * vals.at(0).size()) + 1);
std::cout << stats_standard_dev << std::endl;
}
Am I doing something horrible here?
EDIT
To respond to the comments, T would be an int.
EDIT 2
I fixed my std algorithm, and here is the final product:
void CalculateStats(const std::vector<double>& ignore_values)
{
//OHGOD
double new_mean = 0;
double new_standard_dev = 0;
int new_min = 256;
int new_max = 0;
size_t cnt = 0;
int n = 0;
double delta = 0.0;
double mean2 = 0.0;
std::vector<double>::const_iterator ignore_begin = ignore_values.begin();
std::vector<double>::const_iterator ignore_end = ignore_values.end();
for(std::vector<std::vector<T> >::const_iterator row = vals.begin(), row_end = vals.end(); row != row_end; ++row)
{
for(std::vector<T>::const_iterator col = row->begin(), col_end = row->end(); col != col_end; ++col)
{
// This method of calculation is based on Knuth's algorithm.
T value = *col;
if(std::find(ignore_begin, ignore_end, value) != ignore_end)
continue;
n++;
delta = value - new_mean;
new_mean = new_mean + (delta / n);
mean2 = mean2 + (delta * (value - new_mean));
// Find new max/min's.
new_min = value < new_min ? value : new_min;
new_max = value > new_max ? value : new_max;
}
}
stats_standard_dev = mean2 / (n - 1);
stats_min = new_min;
stats_max = new_max;
stats_mean = new_mean;
This still takes ~120-130 seconds to do this, but it's a huge improvement :)!
Have you tried to profile your code?
You don't even need a fancy profiler. Just stick some debug timing statements in there.
Anything I tell you would just be an educated guess (and probably wrong)
You could be getting lots of cache misses due to the way you're accessing the contents of the vector. You might want to cache some of the results to size() but I don't know if that's the issue.
I just profiled it. 90% of the execution time was in this line:
new_mean += (value - new_mean) / (cnt + 1);
You should calculate the sum of values, min, max and count in the first loop,
then calculate the mean in one operation by dividing sum/count,
then in a second loop calculate std_dev's sum
That would probably be a bit faster.
First thing I spotted is that you evaluate vals.at(row).size() in the loop, which, obviously, isn't supposed to improve performance. It also applies to vals.size(), but of course inner loop is worse. If vals is a vector of vector, you better use iterators or at least keep reference for the outer vector (because get() with indices parameters surely eats up quite some time as well).
This code sample is supposed to illustrate my intentions ;-)
for(TVO::const_iterator i=vals.begin(),ie=vals.end();i!=ie;++i) {
for(TVI::const_iterator ii=i->begin(),iie=i->end();ii!=iie;++ii) {
T value = *ii;
// the rest
}
}
First, change your row++ to ++row. A minor thing, but you want speed, so that will help
Second, make your row < vals.size into some const comparison instead. The compiler doesn't know that vals won't change, so it has to play nice and always call size.
what is the 'get' method in the middle there? What does that do? That might be your real problem.
I'm not too sure about your std dev calculation. Take a look at the wikipedia page on calculating variance in a single pass (they have a quick explanation of Knuth's algorithm, which is an expansion of a recursion relation).
It's slow because you're benchmarking debug code.
Building and running the code on Windows XP using VS2008:
a Release build with the default optimisation level, the code in the OP runs in 2734 ms.
a Debug build with the default of no optimisation, the code in the OP runs in a massive 398,531 ms.
In comments below you say you're not using optimisation, and this appears to make a big difference in this case - normally it's less that a factor of ten, but in this case it's over a hundred times slower.
I'm using VS2008 rather than 2005, but it's probably similar:
In the Debug build, there are two range checks on each access, each of which calls std::vector::size() using a non-inlined function call and requires a branch predicition. There is overhead involved both with function calls and with branches.
In the Release build, the compiler optimizes away the range checks ( I don't know whether it just drops them, or does flow analysis based on the limits of the loop ), and the vector access becomes a small amount of inline pointer arithmetic with no branches.
No-one cares how fast the debug build is. You should be unit testing the release build anyway, as that's the build which has to work correctly. Only use the Debug build if you don't all the information you want if you try and step through the code.
The code as posted runs in < 1.5 seconds on my PC with test data of 15000 x 10000 integers all equal to 42. You report that it's running in 230 times slower that that. Are you on a 10 MHz processor?
Though there are other suggestions for making it faster ( such as moving it to use SSE, if all the values are representable using 8bit types ), but there's clearly something else which is making it slow.
On my machine, neither a version which hoisted a reference to the vector for the row and hoisting the size of the row, nor a version which used iterator had any measurable benefit ( with g++ -O3 using iterators takes 1511ms repeatably; the hoisted and original version both take 1485ms ). Not optimising means it runs in 7487ms ( original ), 3496ms ( hoisted ) or 5331ms ( iterators ).
But unless you're running on a very low power device, or are paging, or a running non-optimised code with a debugger attached, it shouldn't be this slow, and whatever is making it slow is not likely to be the code you've posted.
( as a side note, if you test it with values with a deviation of zero your SD comes out as 1 )
There are far too many calculations in the inner loop:
For the descriptive statistics (mean, standard
deviation) the only thing required is to compute the sum
of value and the sum of squared value. From these
two sums the mean and standard deviation can be computed
after the outer loop (together with a third value, the
number of samples - n is your new/updated code). The
equations can be derived from the definitions or found
on the web, e.g. Wikipedia. For instance the mean is
just sum of value divided by n. For the n version (in
contrast to the n-1 version - however n is large in
this case so it doesn't matter which one is used) the
standard deviation is: sqrt( n * sumOfSquaredValue -
sumOfValue * sumOfValue). Thus only two floating point
additions and one multiplication are needed in the
inner loop. Overflow is not a problem with these sums as
the range for doubles is 10^318. In particular you will
get rid of the expensive floating point division that
the profiling reported in another answer has revealed.
A lesser problem is that the minimum and maximum are
rewritten every time (the compiler may or may not
prevent this). As the minimum quickly becomes small and
the maximum quickly becomes large, only the two comparisons
should happen for the majority of loop iterations: use
if statements instead to be sure. It can be argued, but
on the other hand it is trivial to do.
I would change how I access the data. Assuming you are using std::vector for your container you could do something like this:
vector<vector<T> >::const_iterator row;
vector<vector<T> >::const_iterator row_end = vals.end();
for(row = vals.begin(); row < row_end; ++row)
{
vector<T>::const_iterator value;
vector<T>::const_iterator value_end = row->end();
for(value = row->begin(); value < value_end; ++value)
{
double mean_prev = new_mean;
new_mean += (*value - new_mean) / (cnt + 1);
new_standard_dev += (*value - new_mean) * (*value - mean_prev);
// find new max/min's
new_min = min(*value, new_min);
new_max = max(*value, new_max);
cnt++;
}
}
The advantage of this is that in your inner loop you aren't consulting the outter vector, just the inner one.
If you container type is a list, this will be significantly faster. Because the look up time of get/operator[] is linear for a list and constant for a vector.
Edit, I moved the call to end() out of the loop.
Move the .size() calls to before each loop, and make sure you are compiling with optimizations turned on.
If your matrix is stored as a vector of vectors, then in the outer for loop you should directly retrieve the i-th vector, and then operate on that in the inner loop. Try that and see if it improves performance.
I'm nor sure of what type vals is but vals.at(row).size() could take a long time if itself iterates through the collection. Store that value in a variable. Otherwise it could make the algorithm more like O(n³) than O(n²)
I think that I would rewrite it to use const iterators instead of row and col indexes. I would set up a const const_iterator for row_end and col_end to compare against, just to make certain it wasn't making function calls at every loop end.
As people have mentioned, it might be get(). If it accesses neighbors, for instance, you will totally smash the cache which will greatly reduce the performance. You should profile, or just think about access patterns.
Coming a bit late to the party here, but a couple of points:
You're effectively doing numerical work here. I don't know much about numerical algorithms, but I know enough to know that references and expert support are often useful. This discussion thread offers some references; and Numerical Recipes is a standard (if dated) work.
If you have the opportunity to redesign your matrix, you want to try using a valarray and slices instead of vectors of vectors; one advantage that immediately comes to mind is that you're guaranteed a flat linear layout, which makes cache pre-fetching and SIMD instructions (if your compiler can use them) more effective.
In the inner loop, you shouldn't be testing size, you shouldn't be doing any divisions, and iterators can also be costly. In fact, some unrolling would be good in there.
And, of course, you should pay attention to cache locality.
If you get the loop overhead low enough, it might make sense to do it in separate passes: one to get the sum (which you divide to get the mean), one to get the sum of squares (which you combine with the sum to get the variance), and one to get the min and/or max. The reason is to simplify what is in the inner unrolled loop so the compiler can keep stuff in registers.
I couldn't get the code to compile, so I couldn't pinpoint issues for sure.
I have modified the algorithm to get rid of almost all of the floating-point division.
WARNING: UNTESTED CODE!!!
void CalculateStats()
{
//OHGOD
double accum_f;
double accum_sq_f;
double new_mean = 0;
double new_standard_dev = 0;
int new_min = 256;
int new_max = 0;
const int oku = 100000000;
int accum_ichi = 0;
int accum_oku = 0;
int accum_sq_ichi = 0;
int accum_sq_oku = 0;
size_t cnt = 0;
int v1 = 0;
int v2 = 0;
v1 = vals.size();
for(size_t row = 0; row < v1; row++)
{
v2 = vals.at(row).size();
for(size_t col = 0; col < v2; col++)
{
T value = get(row, col);
int accum_ichi += value;
int accum_sq_ichi += (value * value);
// perform carries
accum_oku += (accum_ichi / oku);
accum_ichi %= oku;
accum_sq_oku += (accum_sq_ichi / oku);
accum_sq_ichi %= oku;
// find new max/min's
new_min = value < new_min ? value : new_min;
new_max = value > new_max ? value : new_max;
cnt++;
}
}
// now, and only now, do we use floating-point arithmetic
accum_f = (double)(oku) * (double)(accum_oku) + (double)(accum_ichi);
accum_sq_f = (double)(oku) * (double)(accum_sq_oku) + (double)(accum_sq_ichi);
new_mean = accum_f / (double)(cnt);
// standard deviation formula from Wikipedia
stats_standard_dev = sqrt((double)(cnt)*accum_sq_f - accum_f*accum_f)/(double)(cnt);
std::cout << stats_standard_dev << std::endl;
}