I have such a function that calculates weights according to Gaussian distribution:
const float dx = 1.0f / static_cast<float>(points - 1);
const float sigma = 1.0f / 3.0f;
const float norm = 1.0f / (sqrtf(2.0f * static_cast<float>(M_PI)) * sigma);
const float divsigma2 = 0.5f / (sigma * sigma);
m_weights[0] = 1.0f;
for (int i = 1; i < points; i++)
{
float x = static_cast<float>(i)* dx;
m_weights[i] = norm * expf(-x * x * divsigma2) * dx;
m_weights[0] -= 2.0f * m_weights[i];
}
In all the calc above the number does not matter. The only thing matters is that m_weights[0] = 1.0f; and each time I calculate m_weights[i] I subtract it twice from m_weights[0] like this:
m_weights[0] -= 2.0f * m_weights[i];
to ensure that w[0] + 2 * w[i] (1..N) will sum to exactly 1.0f. But it does not. This assert fails:
float wSum = 0.0f;
for (size_t i = 0; i < m_weights.size(); ++i)
{
float w = m_weights[i];
if (i == 0) {
wSum += w;
} else {
wSum += (w + w);
}
}
assert(wSum == 1.0 && "Weights sum is not 1.");
How can I ensure the sum to be 1.0f on all platforms?
You can't. Floating point isn't like that. Even adding the same values can produce different results according to the cpu used.
All you can do is define some accuracy value and ensure that you end up with 1.0 +/- that value.
See: http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html
Because the precision of float is only 23 bits (see e.g. https://en.wikipedia.org/wiki/Single-precision_floating-point_format ), rounding error quickly accumulates therefore even if the rest of code is correct, your sum becomes something like 1.0000001 or 0.9999999 (have you watched it in the debugger or tried to print it to console, by the way?). To improve precision you can replace float with double, but still the sum will not be exactly 1.0: the error will just be smaller, something like 1e-16 instead of 1e-7.
The second thing to do is to replace strict comparison to 1.0 with a range comparison, like:
assert(fabs(wSum - 1.0) <= 1e-13 && "Weights sum is not 1.");
Here 1e-13 is the epsilon within which you consider two floating-point numbers equal. If you choose to go with float (not double), you may need epsilon like 1e-6 .
Depending on how large your weights are and how many points there are, accumulated error can become larger than that epsilon. In that case you would need special algorithms for keeping the precision higher, such as sorting the numbers by their absolute values prior to summing them up starting with the smallest numbers.
How can I ensure the sum to be 1.0f on all platforms?
As the other answers (and comments) have stated, you can't achieve this, due to the inexactness of floating point calculations.
One solution is that, instead of using double, use a fixed point or multi-precision library such as GMP, Boost Multiprecision Library, or one of the many others out there.
I'm not sure if this the right place to ask and I'm certainly not sure
if this has been answered elsewhere so don't shoot me if it is. Maybe
I'm not using the right words when searching.
I'm trying to create a generic float method to calculate related values in c++ but limited by their min, max values. Math isn't my
thing so I'd love some help at this stage.
To the point: Let's say that we have a float variable named "health" that could get any value from 0 to 100. We also
have a second float variable named "walkingSpeed" that can get
any value from 100 to 200.
I need a method to calculate and return what "walkingSpeed" would be in relation to "health" while always taking in account the min and
max values of these two variables ( if Health = HealthMax then
walkingSpeed = walking SpeedMax etc ). I tried to work with % but I
couldn't figure it out. Any ideas?
Thanks in advance!
EDIT:
Generic Method based on #Felipe 's suggestion:
/*
* Returns valueY based on a relative value X
*
* #RelXMin - Minimum relative value
* #RelXMax - Maximum relative value
* #relX - Relative value
* #valueYMin - Minimum value
* #valueYMax - Maximum value
* #bInverse - set true if the y is inverse related to x
*/
float calcRelativeFloatValue(float RelXMin, float RelXMax, float relX, float valueYMin, float valueYMax, bool bInverse)
{
float xRange, yRelative;
xRange = RelXMax - RelXMin;
yRelative = (valueYMax - valueYMin) / xRange;
if (bInverse)
{
return valueYMax - (yRelative * relX);
}
return valueYMin + (yRelative * relX) ;
}
It looks like you are asking for linear interpolation.
The two known points are the minimum and the maximum values. So you can use a general interpolation function like this (just the equation from wikipedia written as code):
float interpolate(float x, float x_0, float x_1, float y_0, float y_1) {
return y_0*(y_1-y_0)*(x-x_0)/(x_1/x_0);
}
Your example would then be
walkingSpeed=interpolate(health, minHealth, maxHealth, minWalkingSpeed, maxWalkingSpeed);
In this specific case, the task is pretty trivial:
float estimated_walking_speed = health + 100;
For the more generic task of mapping a value in one range to the corresponding value in some other range, you'd be looking at something like this:
auto input_range = input_upper - input_lower;
auto relative_input_loc = (input_value - input_lower) / input_range;
auto output_range = output_upper - output_lower;
auto output = output_lower + (relative_input_loc * output_range);
[This is open to simplification/optimization--I'm writing it all out to keep it as understandable as possible.]
Try this.
float calculateSpeed(float minHealth, float maxHealth, float currentHealth, float minSpeed, float maxSpeed) {
float healthLength = maxHealth - minHealth;
float speedPerHealth = (maxSpeed - minSpeed) / healthLength;
float partialSpeed = speedPerHealth * currentHealth;
float currentSpeed = partialSpeed + minSpeed;
return currentSpeed;
}
I can see with the CPU profiler, that the compute_variances() is the bottleneck of my project.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
75.63 5.43 5.43 40 135.75 135.75 compute_variances(unsigned int, std::vector<Point, std::allocator<Point> > const&, float*, float*, unsigned int*)
19.08 6.80 1.37 readDivisionSpace(Division_Euclidean_space&, char*)
...
Here is the body of the function:
void compute_variances(size_t t, const std::vector<Point>& points, float* avg,
float* var, size_t* split_dims) {
for (size_t d = 0; d < points[0].dim(); d++) {
avg[d] = 0.0;
var[d] = 0.0;
}
float delta, n;
for (size_t i = 0; i < points.size(); ++i) {
n = 1.0 + i;
for (size_t d = 0; d < points[0].dim(); ++d) {
delta = (points[i][d]) - avg[d];
avg[d] += delta / n;
var[d] += delta * ((points[i][d]) - avg[d]);
}
}
/* Find t dimensions with largest scaled variance. */
kthLargest(var, points[0].dim(), t, split_dims);
}
where kthLargest() doesn't seem to be a problem, since I see that:
0.00 7.18 0.00 40 0.00 0.00 kthLargest(float*, int, int, unsigned int*)
The compute_variances() takes a vector of vectors of floats (i.e. a vector of Points, where Points is a class I have implemented) and computes the variance of them, in each dimension (with regard to the algorithm of Knuth).
Here is how I call the function:
float avg[(*points)[0].dim()];
float var[(*points)[0].dim()];
size_t split_dims[t];
compute_variances(t, *points, avg, var, split_dims);
The question is, can I do better? I would really happy to pay the trade-off between speed and approximate computation of variances. Or maybe I could make the code more cache friendly or something?
I compiled like this:
g++ main_noTime.cpp -std=c++0x -p -pg -O3 -o eg
Notice, that before edit, I had used -o3, not with a capital 'o'. Thanks to ypnos, I compiled now with the optimization flag -O3. I am sure that there was a difference between them, since I performed time measurements with one of these methods in my pseudo-site.
Note that now, compute_variances is dominating the overall project's time!
[EDIT]
copute_variances() is called 40 times.
Per 10 calls, the following hold true:
points.size() = 1000 and points[0].dim = 10000
points.size() = 10000 and points[0].dim = 100
points.size() = 10000 and points[0].dim = 10000
points.size() = 100000 and points[0].dim = 100
Each call handles different data.
Q: How fast is access to points[i][d]?
A: point[i] is just the i-th element of std::vector, where the second [], is implemented as this, in the Point class.
const FT& operator [](const int i) const {
if (i < (int) coords.size() && i >= 0)
return coords.at(i);
else {
std::cout << "Error at Point::[]" << std::endl;
exit(1);
}
return coords[0]; // Clear -Wall warning
}
where coords is a std::vector of float values. This seems a bit heavy, but shouldn't the compiler be smart enough to predict correctly that the branch is always true? (I mean after the cold start). Moreover, the std::vector.at() is supposed to be constant time (as said in the ref). I changed this to have only .at() in the body of the function and the time measurements remained, pretty much, the same.
The division in the compute_variances() is for sure something heavy! However, Knuth's algorithm was a numerical stable one and I was not able to find another algorithm, that would de both numerical stable and without division.
Note that I am not interesting in parallelism right now.
[EDIT.2]
Minimal example of Point class (I think I didn't forget to show something):
class Point {
public:
typedef float FT;
...
/**
* Get dimension of point.
*/
size_t dim() const {
return coords.size();
}
/**
* Operator that returns the coordinate at the given index.
* #param i - index of the coordinate
* #return the coordinate at index i
*/
FT& operator [](const int i) {
return coords.at(i);
//it's the same if I have the commented code below
/*if (i < (int) coords.size() && i >= 0)
return coords.at(i);
else {
std::cout << "Error at Point::[]" << std::endl;
exit(1);
}
return coords[0]; // Clear -Wall warning*/
}
/**
* Operator that returns the coordinate at the given index. (constant)
* #param i - index of the coordinate
* #return the coordinate at index i
*/
const FT& operator [](const int i) const {
return coords.at(i);
/*if (i < (int) coords.size() && i >= 0)
return coords.at(i);
else {
std::cout << "Error at Point::[]" << std::endl;
exit(1);
}
return coords[0]; // Clear -Wall warning*/
}
private:
std::vector<FT> coords;
};
1. SIMD
One easy speedup for this is to use vector instructions (SIMD) for the computation. On x86 that means SSE, AVX instructions. Based on your word length and processor you can get speedups of about x4 or even more. This code here:
for (size_t d = 0; d < points[0].dim(); ++d) {
delta = (points[i][d]) - avg[d];
avg[d] += delta / n;
var[d] += delta * ((points[i][d]) - avg[d]);
}
can be sped-up by doing the computation for four elements at once with SSE. As your code really only processes one single element in each loop iteration, there is no bottleneck. If you go down to 16bit short instead of 32bit float (an approximation then), you can fit eight elements in one instruction. With AVX it would be even more, but you need a recent processor for that.
It is not the solution to your performance problem, but just one of them that can also be combined with others.
2. Micro-parallelizm
The second easy speedup when you have that many loops is to use parallel processing. I typically use Intel TBB, others might suggest OpenMP instead. For this you would probably have to change the loop order. So parallelize over d in the outer loop, not over i.
You can combine both techniques, and if you do it right, on a quadcore with HT you might get a speed-up of 25-30 for the combination without any loss in accuracy.
3. Compiler optimization
First of all maybe it is just a typo here on SO, but it needs to be -O3, not -o3!
As a general note, it might be easier for the compiler to optimize your code if you declare the variables delta, n within the scope where you actually use them. You should also try the -funroll-loops compiler option as well as -march. The option to the latter depends on your CPU, but nowadays typically -march core2 is fine (also for recent AMDs), and includes SSE optimizations (but I would not trust the compiler just yet to do that for your loop).
The big problem with your data structure is that it's essentially a vector<vector<float> >. That's a pointer to an array of pointers to arrays of float with some bells and whistles attached. In particular, accessing consecutive Points in the vector doesn't correspond to accessing consecutive memory locations. I bet you see tons and tons of cache misses when you profile this code.
Fix this before horsing around with anything else.
Lower-order concerns include the floating-point division in the inner loop (compute 1/n in the outer loop instead) and the big load-store chain that is your inner loop. You can compute the means and variances of slices of your array using SIMD and combine them at the end, for instance.
The bounds-checking once per access probably doesn't help, either. Get rid of that too, or at least hoist it out of the inner loop; don't assume the compiler knows how to fix that on its own.
Here's what I would do, in guesstimated order of importance:
Return the floating-point from the Point::operator[] by value, not by reference.
Use coords[i] instead of coords.at(i), since you already assert that it's within bounds. The at member checks the bounds. You only need to check it once.
Replace the home-baked error indication/checking in the Point::operator[] with an assert. That's what asserts are for. They are nominally no-ops in release mode - I doubt that you need to check it in release code.
Replace the repeated division with a single division and repeated multiplication.
Remove the need for wasted initialization by unrolling the first two iterations of the outer loop.
To lessen impact of cache misses, run the inner loop alternatively forwards then backwards. This at least gives you a chance at using some cached avg and var. It may in fact remove all cache misses on avg and var if prefetch works on reverse order of iteration, as it well should.
On modern C++ compilers, the std::fill and std::copy can leverage type alignment and have a chance at being faster than the C library memset and memcpy.
The Point::operator[] will have a chance of getting inlined in the release build and can reduce to two machine instructions (effective address computation and floating point load). That's what you want. Of course it must be defined in the header file, otherwise the inlining will only be performed if you enable link-time code generation (a.k.a. LTO).
Note that the Point::operator[]'s body is only equivalent to the single-line
return coords.at(i) in a debug build. In a release build the entire body is equivalent to return coords[i], not return coords.at(i).
FT Point::operator[](int i) const {
assert(i >= 0 && i < (int)coords.size());
return coords[i];
}
const FT * Point::constData() const {
return &coords[0];
}
void compute_variances(size_t t, const std::vector<Point>& points, float* avg,
float* var, size_t* split_dims)
{
assert(points.size() > 0);
const int D = points[0].dim();
// i = 0, i_n = 1
assert(D > 0);
#if __cplusplus >= 201103L
std::copy_n(points[0].constData(), D, avg);
#else
std::copy(points[0].constData(), points[0].constData() + D, avg);
#endif
// i = 1, i_n = 0.5
if (points.size() >= 2) {
assert(points[1].dim() == D);
for (int d = D - 1; d >= 0; --d) {
float const delta = points[1][d] - avg[d];
avg[d] += delta * 0.5f;
var[d] = delta * (points[1][d] - avg[d]);
}
} else {
std::fill_n(var, D, 0.0f);
}
// i = 2, ...
for (size_t i = 2; i < points.size(); ) {
{
const float i_n = 1.0f / (1.0f + i);
assert(points[i].dim() == D);
for (int d = 0; d < D; ++d) {
float const delta = points[i][d] - avg[d];
avg[d] += delta * i_n;
var[d] += delta * (points[i][d] - avg[d]);
}
}
++ i;
if (i >= points.size()) break;
{
const float i_n = 1.0f / (1.0f + i);
assert(points[i].dim() == D);
for (int d = D - 1; d >= 0; --d) {
float const delta = points[i][d] - avg[d];
avg[d] += delta * i_n;
var[d] += delta * (points[i][d] - avg[d]);
}
}
++ i;
}
/* Find t dimensions with largest scaled variance. */
kthLargest(var, D, t, split_dims);
}
for (size_t d = 0; d < points[0].dim(); d++) {
avg[d] = 0.0;
var[d] = 0.0;
}
This code could be optimized by simply using memset. The IEEE754 representation of 0.0 in 32bits is 0x00000000. If the dimension is big, it worth it.
Something like:
memset((void*)avg, 0, points[0].dim() * sizeof(float));
In your code, you have a lot of calls to points[0].dim(). It would be better to call once at the beginning of the function and store in a variable. Likely, the compiler already does this (since you are using -O3).
The division operations are a lot more expensive (from clock-cycle POV) than other operations (addition, subtraction).
avg[d] += delta / n;
It could make sense, to try to reduce the number of divisions: use partial non-cumulative average calculation, that would result in Dim division operation for N elements (instead of N x Dim); N < points.size()
Huge speedup could be achieved, using Cuda or OpenCL, since the calculation of avg and var could be done simultaneously for each dimension (consider using a GPU).
Another optimization is cache optimization including both data cache and instruction cache.
High level optimization techniques
Data Cache optimizations
Example of data cache optimization & unrolling
for (size_t d = 0; d < points[0].dim(); d += 4)
{
// Perform loading all at once.
register const float p1 = points[i][d + 0];
register const float p2 = points[i][d + 1];
register const float p3 = points[i][d + 2];
register const float p4 = points[i][d + 3];
register const float delta1 = p1 - avg[d+0];
register const float delta2 = p2 - avg[d+1];
register const float delta3 = p3 - avg[d+2];
register const float delta4 = p4 - avg[d+3];
// Perform calculations
avg[d + 0] += delta1 / n;
var[d + 0] += delta1 * ((p1) - avg[d + 0]);
avg[d + 1] += delta2 / n;
var[d + 1] += delta2 * ((p2) - avg[d + 1]);
avg[d + 2] += delta3 / n;
var[d + 2] += delta3 * ((p3) - avg[d + 2]);
avg[d + 3] += delta4 / n;
var[d + 3] += delta4 * ((p4) - avg[d + 3]);
}
This differs from classic loop unrolling in that loading from the matrix is performed as a group at the top of the loop.
Edit 1:
A subtle data optimization is to place the avg and var into a structure. This will ensure that the two arrays are next to each other in memory, sans padding. The data fetching mechanism in processors like datums that are very close to each other. Less chance for data cache miss and better chance to load all of the data into the cache.
You could use Fixed Point math instead of floating point math as an optimization.
Optimization via Fixed Point
Processors love to manipulate integers (signed or unsigned). Floating point may take extra computing power due to the extraction of the parts, performing the math, then reassemblying the parts. One mitigation is to use Fixed Point math.
Simple Example: meters
Given the unit of meters, one could express lengths smaller than a meter by using floating point, such as 3.14159 m. However, the same length can be expressed in a unit of finer detail like millimeters, e.g. 3141.59 mm. For finer resolution, a smaller unit is chosen and the value multiplied, e.g. 3,141,590 um (micrometers). The point is choosing a small enough unit to represent the floating point accuracy as an integer.
The floating point value is converted at input into Fixed Point. All data processing occurs in Fixed Point. The Fixed Point value is convert to Floating Point before outputting.
Power of 2 Fixed Point Base
As with converting from floating point meters to fixed point millimeters, using 1000, one could use a power of 2 instead of 1000. Selecting a power of 2 allows the processor to use bit shifting instead of multiplication or division. Bit shifting by a power of 2 is usually faster than multiplication or division.
Keeping with the theme and accuracy of millimeters, we could use 1024 as the base instead of 1000. Similarly, for higher accuracy, use 65536 or 131072.
Summary
Changing the design or implementation to used Fixed Point math allows the processor to use more integral data processing instructions than floating point. Floating point operations consume more processing power than integral operations in all but specialized processors. Using powers of 2 as the base (or denominator) allows code to use bit shifting instead of multiplication or division. Division and multiplication take more operations than shifting and thus shifting is faster. So rather than optimizing code for execution (such as loop unrolling), one could try using Fixed Point notation rather than floating point.
Point 1.
You're computing the average and the variance at the same time.
Is that right?
Don't you have to calculate the average first, then once you know it, calculate the sum of squared differences from the average?
In addition to being right, it's more likely to help performance than hurt it.
Trying to do two things in one loop is not necessarily faster than two consecutive simple loops.
Point 2.
Are you aware that there is a way to calculate average and variance at the same time, like this:
double sumsq = 0, sum = 0;
for (i = 0; i < n; i++){
double xi = x[i];
sum += xi;
sumsq += xi * xi;
}
double avg = sum / n;
double avgsq = sumsq / n
double variance = avgsq - avg*avg;
Point 3.
The inner loops are doing repetitive indexing.
The compiler might be able to optimize that to something minimal, but I wouldn't bet my socks on it.
Point 4.
You're using gprof or something like it.
The only reasonably reliable number to come out of it is self-time by function.
It won't tell you very well how time is spent inside the function.
I and many others rely on this method, which takes you straight to the heart of what takes time.
I'm having an issue with floating point arithmetic in c++ (using doubles) that I've never had before, and so I'm wondering how people usually deal with this type of problem.
I'm trying to represent a curve in polar coordinates as a series of Point objects (Point is just a class that holds the coordinates of a point in 3D). The collection of Points representing the curve are stored in a vector (of Point*). The curve I'm representing is a function r(theta), which I can compute. This function is defined on the range of theta contained in [0,PI]. I am representing PI as 4.0*atan(1.0), storing it as a double.
To represent the surface, I specify the desired number of points (n+1), for which I am currently using n=80, and then I determine the interval in theta required to divide [0,PI] into 80 equal intervals (represented by n+1=81 Points). So dTheta = PI / n. dTheta is a double. I next assign coordinates to my Points. (See sample code below.)
double theta0 = 0.0; // Beginning of inteval in theta
double thetaF = PI; // End of interval in theta
double dTheta = (thetaF - theta0)/double(nSegments); // segment width
double theta = theta0; // Initialize theta to start at beginning of inteval
vector<Point*> pts; // Declare a variable to hold the Points.
while (theta <= thetaF)
{
// Store Point corresponding to current theta and r(theta) in the vector.
pts.push_back(new Point(theta, rOfTheta(theta), 0.0));
theta += dTheta; // Increment theta
}
rofTheta(theta) is some function that computes r(theta). Now the problem is that the very last point somehow doesn't satisfy the (theta <= thetaF) requirement to enter the loop one final time. Actually, after the last pass through the loop, theta is very slightly greater than PI (it's like PI + 1e-15). How should I deal with this? The function is not defined for theta > PI. One idea is to just test for ((theta > PI) and (theta < (PI+delta))) where delta is very small. If that's true, I could just set theta=PI, get and set the coordinates of the corresponding Point, and exit the loop. This seems like a reasonable problem to have, but interestingly I have never faced such a problem before. I had been using gcc 4.4.2, and now I'm using gcc 4.8.2. Could that be the problem? What is the normal way to handle this kind of problem? Thanks!
Never iterate over a range with a floating point value (theta) by adding increments if you have the alternative of computing the next value by
theta = theta0 + idx*dTheta.
Control the iteration using the integer number of steps and compute the float as indicated.
If dTheta is small compared to the entire interval, you'll accumulate errors.
You may not insert the computed last value of the range[theta0, thetaF]. That value is actually theta = theta0 + n * (dTheta + error). Skip that last calculated value and use thetaF instead.
What I might try:
while (theta <= thetaF)
{
// Store Point corresponding to current theta and r(theta) in the vector.
pts.push_back(new Point(theta, rOfTheta(theta), 0.0));
theta += dTheta; // Increment theta
}
if (theta >= thetaF) {
pts.push_back(new Point(thetaF, rOfTheta(thetaF), 0.0));
}
you might want to cehck the if statement with pts.length() == nSegments, just experiment and see which produces the better results.
If you know that there would be 81 values of theta, then why not run a for loop 81 times?
int i;
theta = theta0;
for(i = 0; i < nSegments; i++) {
pts.push_back(new Point(theta, rOfTheta(theta), 0.0));
theta += dTheta;
}
First of all: get rid of the naked pointer :-)
You know the number of segments you have, so instead of using the value of theta in the while-block:
for (auto idx = 0; idx != nSegments - 1; ++idx) {
// Store Point corresponding to current theta and r(theta) in the vector.
pts.emplace_back(theta, rOfTheta(theta), 0.0);
theta += dTheta; // Increment theta
}
pts.emplace_back(thetaF, /* rOfTheta(PI) calculated exactly */, 0.0);
for (int i = 0; i < nSegments; ++i)
{
theta = (double) i / nSegments * PI;
…
}
This:
produces the correct number of iterations (since the loop counter is maintained as an integer),
does not accumulate any error (since theta is calculated freshly each time), and
produces exactly the desired value (well, PI, not π) in the final iteration (since (double) i / nSegments will be exactly one).
Unfortunately, it contains a division, which is typically a time-consuming instruction.
(The loop counter could also be a double, and this will avoid the cast from to double inside the loop. As long as integer values are used, the arithmetic for it will be exact, until you get beyond 253 iterations.)