Split range into uniform intervals

Split range into uniform intervals - c++

I want to split a range with double borders into N>=2 equal or near equal intervals.
I found a suitable function in GNU Scientific Library:
make_uniform (double range[], size_t n, double xmin, double xmax)
{
size_t i;
for (i = 0; i <= n; i++)
{
double f1 = ((double) (n-i) / (double) n);
double f2 = ((double) i / (double) n);
range[i] = f1 * xmin + f2 * xmax;
}
}
However, when
xmin = 241141 (binary 0x410D6FA800000000)
xmax = 241141.0000000001 (binary 0x410D6FA800000003)
N = 3
the function produces
[0x410D6FA800000000,
0x410D6FA800000000,
0x410D6FA800000002,
0x410D6FA800000003]
instead of desired
[0x410D6FA800000000,
0x410D6FA800000001,
0x410D6FA800000002,
0x410D6FA800000003]
How achieve uniformity without resorting to long arithmetics (i already have a long arithmetics solution but it is ugly and slow)? Bit twiddling and x86 (x86-64, so no extended precision) assembler routines are acceptable.
UPDATE:
General solution is needed, without premise that xmin, xmax have equal exponent and sign:
xmin and xmax may be of any value except infinity and NaN (possibly also excluding denormalized values for sake of simplicity).
xmin < xmax
(1<<11)-1>=N>=2
i'm ready for major (in 2-3 orders) performance loss

x87 still exists in x86-64, and 64-bit kernels for mainstream OSes do correctly save/restore the x87 state for 64-bit processes. Despite what you may have read, x87 is fully usable in 64-bit code.
Outside of Windows (i.e. the x86-64 System V ABI used everywhere else), long double is the 80-bit native x87 native format. This will probably solve your precision problem for x86 / x86-64 only, if you don't care about portability to ARM / PowerPC / whatever else that only has 64-bit precision in HW.
Probably best to only use long double for temporaries inside the function.
I'm not sure what you have to do on Windows to get a compiler to emit 80-bit extended FP math. It's certainly possible in asm, and supported by the kernel, but the toolchain and ABI make it inconvenient to use.
x87 is only somewhat slower than scalar SSE math on current CPUs. 80-bit load/store is extra slow, though, like 4 uops on Skylake instead of 1 (https://agner.org/optimize/) and a few cycles extra latency for fld m80.
For your loop having to convert int to FP by storing and using x87 fild, it might be something like at most a factor of 2 slower than what a good compiler could do with SSE2 for 64-bit double.
And of course long double will prevent auto-vectorization.

I see two choices: reordering the operations as xmin + (i * (xmax - xmin)) / n, or dealing directly with the binary representations. Here is a example for both.
#include <iostream>
#include <iomanip>
int main() {
double xmin = 241141;
double xmax = 241141.0000000001;
size_t n = 3, i;
double range[4];
std::cout << std::setprecision(std::numeric_limits<double>::digits10) << std::fixed;
for (i = 0; i <= n; i++) {
range[i] = xmin + (i * (xmax - xmin)) / n;
std::cout << range[i] << "\n";
}
std::cout << "\n";
auto uxmin = reinterpret_cast<unsigned long long&>(xmin);
auto uxmax = reinterpret_cast<unsigned long long&>(xmax);
for (i = 0; i <= n; i++) {
auto rangei = ((n-i) * uxmin + i * uxmax) / n;
range[i] = reinterpret_cast<double&>(rangei);
std::cout << range[i] << "\n";
}
}
Live on Coliru

Related

Simpson's Composite Rule giving too large values for when n is very large

Using Simpson's Composite Rule to calculate the integral from 2 to 1,000 of 1/ln(x), however when using a large n (usually around 500,000), I start to get results that vary from the value my calculator and other sources give me (176.5644). For example, when n = 10,000,000, it gives me a value of 184.1495. Wondering why this is, since as n gets larger, the accuracy is supposed to increase and not decrease.
#include <iostream>
#include <cmath>
// the function f(x)
float f(float x)
{
return (float) 1 / std::log(x);
}
float my_simpson(float a, float b, long int n)
{
if (n % 2 == 1) n += 1; // since n has to be even
float area, h = (b-a)/n;
float x, y, z;
for (int i = 1; i <= n/2; i++)
{
x = a + (2*i - 2)*h;
y = a + (2*i - 1)*h;
z = a + 2*i*h;
area += f(x) + 4*f(y) + f(z);
}
return area*h/3;
}
int main()
{
std::cout.precision(20);
int upperBound = 1'000;
int subsplits = 1'000'000;
float approx = my_simpson(2, upperBound, subsplits);
std::cout << "Output: " << approx << std::endl;
return 0;
}
Update: Switched from floats to doubles and works much better now! Thank you!

Unlike a real (in mathematical sense) number, a float has a limited precision.
A typical IEEE 754 32-bit (single precision) floating-point number binary representation dedicates only 24 bits (one of which is implicit) to the mantissa and that translates in roughly less than 8 decimal significant digits (please take this as a gross semplification).
A double on the other end, has 53 significand bits, making it more accurate and (usually) the first choice for numerical computations, these days.
since as n gets larger, the accuracy is supposed to increase and not decrease.
Unfortunately, that's not how it works. There's a sweat spot, but after that the accumulation of rounding errors prevales and the results diverge from their expected values.
In OP's case, this calculation
area += f(x) + 4*f(y) + f(z);
introduces (and accumulates) rounding errors, due to the fact that area becomes much greater than f(x) + 4*f(y) + f(z) (e.g 224678.937 vs. 0.3606823). The bigger n is, the sooner this gets relevant, making the result diverging from the real one.
As mentioned in the comments, another issue (undefined behavior) is that area isn't initialized (to zero).

Returning 'inf' when finite result is expected

The outputs are not what Python/Mathematica previously calculated when calling test_func(3000, 10). Instead, my code returns inf, and I am not sure why.
#include <iostream>
#include <iomanip>
#include <fstream>
#include <string>
#include <chrono>
#include <cmath>
using namespace std;
//I0
double I0 (double y0, double y)
{
return(
log ( (exp(y0+y)-1)/(exp(y0-y)-1) ) - y
);
}
//test_func
double test_func (double k, double M)
{
double k0 = sqrt(pow(k, 2.0)+pow(M, 2.0));
cout << "k0: " << k0 << endl;
double Izero = I0(k0, k);
cout << "I0: " << Izero << endl;
double k3 = pow(k, 3.0);
cout << "k3: " << k3 << endl;
return(
Izero/k3
);
}
int main ()
{
cout << test_func(3000, 10) << endl;
return 0;
}
The output I get is
k0: 3000.02
I0: inf
k3: 2.7e+10
inf
but I0 should be 3004.1026690762033, while the result of the function should be test_func(3000, 10)=1.1126306181763716e-07. I am puzzled. Do you know what is wrong with it? I am a C++ beginner, so any help is very welcome.

double has a limited range. On most modern implementations, doubles are in the standard IEEE floating point binary64 format, with range up to about 1.8e308. Anything bigger (such as your exp(6000)) rounds up to infinity.
Switching to a larger type like long double may not be the best idea. Though the range of floating point types roughly grows exponentially with size, the fact that you use the exp function makes it easy to defeat the extra precision. E.g. on an implementation with 80-bit long double (with range up to about 1.2e4932), modifying test_func and I0 to use long double still fails to evaluate test_func(5700, 10).
It is instead possible to redesign I0 to avoid huge numbers. Let's start by splitting the log.
double I0(double y0, double y) {
return log(exp(y0+y)-1) - log(exp(y0-y)-1) - y;
}
When you're computing log(exp(y0+y) - 1), if exp(y0+y) gives infinity you can recover the computation by using y0+y instead of the log. I.e. we're ignoring the -1, because if our numbers are that large the precision of double isn't enough to actually register the difference in the final result. Also, you may want to replace both exp(x) - 1s with expm1. This is because when x is close to 0, exp(x) - 1 will tend to lose precision. E.g. exp(1e-16) - 1 == 0 but expm1(1e-16) > 0 assuming IEEE. I suspect that's not as important to you.
double I0(double y0, double y) {
double num = expm1(y0 + y), den = expm1(y0 - y);
num = isinf(num) ? y0 + y : log(num);
den = isinf(den) ? y0 - y : log(den); // though I suspect the domain of I0 is such that you don't actually need this
return num - den - y;
}
This is only a very rudimentary correction. Squeezing the most correctness out of floating point is very difficult in general and is its own whole field of programming. However, this is enough to make your case work, and even works on those large inputs where the naive long double also fails. (I'm no expert, but I also notice that y0-y is itself problematic, since y0 is apparently chosen to be close to y. Subtraction between them loses precision that may (or may not) destroy the result.)
If you don't want to deal with rewriting your formulas to fit within the limitations of floating point (completely understandable, given the potential for bugs!), I would suggest following #M.M's advice and using an arbitrary-precision math library like MPFR. That's similar to replacing double with long double, but now you can keep throwing bits at the problems until they go away whereas you will eventually run out of built-in floating point types.

SSE rounds down when it should round up

I am working on an application that is converting Float samples in the range of -1.0 to 1.0 to signed 16bit, to ensure the output of the optimized (SSE) routines are accurate I have written a set of tests that runs the non optimized version against the SSE version and compares their output.
Before I start I have confirmed that the SSE rounding mode is set to nearest.
In my test case the formula is:
ratio = 65536 / 2
output = round(input * ratio)
For the most part the results are accurate, but on one particular input I am seeing a failure for an input of -0.8499908447265625.
-0.8499908447265625 * (65536 / 2) = -27852.5
The normal code correctly rounds this to -27853, but the SSE code rounds this to -27852.
Here is the SSE code in use:
void Float_S16(const float *in, int16_t *out, const unsigned int samples)
{
static float ratio = 65536.0f / 2.0f;
static __m128 mul = _mm_set_ps1(ratio);
for(unsigned int i = 0; i < samples; i += 4, in += 4, out += 4)
{
__m128 xin;
__m128i con;
xin = _mm_load_ps(in);
xin = _mm_mul_ps(xin, mul);
con = _mm_cvtps_epi32(xin);
out[0] = _mm_extract_epi16(con, 0);
out[1] = _mm_extract_epi16(con, 2);
out[2] = _mm_extract_epi16(con, 4);
out[3] = _mm_extract_epi16(con, 6);
}
}
Self Contained Example as requested:
/* standard math */
float ratio = 65536.0f / 2.0f;
float in [4] = {-1.0, -0.8499908447265625, 0.0, 1.0};
int16_t out[4];
for(int i = 0; i < 4; ++i)
out[i] = round(in[i] * ratio);
/* sse math */
static __m128 mul = _mm_set_ps1(ratio);
__m128 xin;
__m128i con;
xin = _mm_load_ps(in);
xin = _mm_mul_ps(xin, mul);
con = _mm_cvtps_epi32(xin);
int16_t outSSE[4];
outSSE[0] = _mm_extract_epi16(con, 0);
outSSE[1] = _mm_extract_epi16(con, 2);
outSSE[2] = _mm_extract_epi16(con, 4);
outSSE[3] = _mm_extract_epi16(con, 6);
printf("Standard = %d, SSE = %d\n", out[1], outSSE[1]);

Although the SSE rounding mode defaults to "round to nearest", it's not the old familiar rounding method that we all learned in school, but a slightly more modern variation known as Banker's rounding (aka unbiased rounding, convergent rounding, statistician's rounding, Dutch rounding, Gaussian rounding or odd–even rounding), which rounds to the nearest even integer value. This rounding method is supposedly better than the more traditional method, from a statistical perspective. You will see the same behaviour with functions such as rint(), and it is also the default rounding mode for IEEE-754.
Note also that whereas the standard library function round() uses the traditional rounding method, the SSE instruction ROUNDPS (_mm_round_ps) uses banker's rounding.

That's the default behaviour for all floating point processing, not just SSE. Round half to even or banker's rounding is the default rounding mode according to the IEEE 754 standard.
The reason this is used is that consistently rounding up (or down) results in a half-point error that accumulates when applied over even a moderate number of operations. The half points can result in some pretty significant errors - significant enough that they became a plot point in Superman 3.
Round half to even or odd though, results in both negative and positive half-point errors that eliminate each other when applied over many operations.
This is also desirable in SSE operations. SSE operations are typically used in signal processing (audio, image), engineering and statistical scenarios where a consistent rounding error would appear as noise and require additional processing to remove (if possible). Banker's rounding ensures this noise is eliminated

Faster computation of (approximate) variance needed

I can see with the CPU profiler, that the compute_variances() is the bottleneck of my project.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
75.63 5.43 5.43 40 135.75 135.75 compute_variances(unsigned int, std::vector<Point, std::allocator<Point> > const&, float*, float*, unsigned int*)
19.08 6.80 1.37 readDivisionSpace(Division_Euclidean_space&, char*)
...
Here is the body of the function:
void compute_variances(size_t t, const std::vector<Point>& points, float* avg,
float* var, size_t* split_dims) {
for (size_t d = 0; d < points[0].dim(); d++) {
avg[d] = 0.0;
var[d] = 0.0;
}
float delta, n;
for (size_t i = 0; i < points.size(); ++i) {
n = 1.0 + i;
for (size_t d = 0; d < points[0].dim(); ++d) {
delta = (points[i][d]) - avg[d];
avg[d] += delta / n;
var[d] += delta * ((points[i][d]) - avg[d]);
}
}
/* Find t dimensions with largest scaled variance. */
kthLargest(var, points[0].dim(), t, split_dims);
}
where kthLargest() doesn't seem to be a problem, since I see that:
0.00 7.18 0.00 40 0.00 0.00 kthLargest(float*, int, int, unsigned int*)
The compute_variances() takes a vector of vectors of floats (i.e. a vector of Points, where Points is a class I have implemented) and computes the variance of them, in each dimension (with regard to the algorithm of Knuth).
Here is how I call the function:
float avg[(*points)[0].dim()];
float var[(*points)[0].dim()];
size_t split_dims[t];
compute_variances(t, *points, avg, var, split_dims);
The question is, can I do better? I would really happy to pay the trade-off between speed and approximate computation of variances. Or maybe I could make the code more cache friendly or something?
I compiled like this:
g++ main_noTime.cpp -std=c++0x -p -pg -O3 -o eg
Notice, that before edit, I had used -o3, not with a capital 'o'. Thanks to ypnos, I compiled now with the optimization flag -O3. I am sure that there was a difference between them, since I performed time measurements with one of these methods in my pseudo-site.
Note that now, compute_variances is dominating the overall project's time!
[EDIT]
copute_variances() is called 40 times.
Per 10 calls, the following hold true:
points.size() = 1000 and points[0].dim = 10000
points.size() = 10000 and points[0].dim = 100
points.size() = 10000 and points[0].dim = 10000
points.size() = 100000 and points[0].dim = 100
Each call handles different data.
Q: How fast is access to points[i][d]?
A: point[i] is just the i-th element of std::vector, where the second [], is implemented as this, in the Point class.
const FT& operator [](const int i) const {
if (i < (int) coords.size() && i >= 0)
return coords.at(i);
else {
std::cout << "Error at Point::[]" << std::endl;
exit(1);
}
return coords[0]; // Clear -Wall warning
}
where coords is a std::vector of float values. This seems a bit heavy, but shouldn't the compiler be smart enough to predict correctly that the branch is always true? (I mean after the cold start). Moreover, the std::vector.at() is supposed to be constant time (as said in the ref). I changed this to have only .at() in the body of the function and the time measurements remained, pretty much, the same.
The division in the compute_variances() is for sure something heavy! However, Knuth's algorithm was a numerical stable one and I was not able to find another algorithm, that would de both numerical stable and without division.
Note that I am not interesting in parallelism right now.
[EDIT.2]
Minimal example of Point class (I think I didn't forget to show something):
class Point {
public:
typedef float FT;
...
/**
* Get dimension of point.
*/
size_t dim() const {
return coords.size();
}
/**
* Operator that returns the coordinate at the given index.
* #param i - index of the coordinate
* #return the coordinate at index i
*/
FT& operator [](const int i) {
return coords.at(i);
//it's the same if I have the commented code below
/*if (i < (int) coords.size() && i >= 0)
return coords.at(i);
else {
std::cout << "Error at Point::[]" << std::endl;
exit(1);
}
return coords[0]; // Clear -Wall warning*/
}
/**
* Operator that returns the coordinate at the given index. (constant)
* #param i - index of the coordinate
* #return the coordinate at index i
*/
const FT& operator [](const int i) const {
return coords.at(i);
/*if (i < (int) coords.size() && i >= 0)
return coords.at(i);
else {
std::cout << "Error at Point::[]" << std::endl;
exit(1);
}
return coords[0]; // Clear -Wall warning*/
}
private:
std::vector<FT> coords;
};

1. SIMD
One easy speedup for this is to use vector instructions (SIMD) for the computation. On x86 that means SSE, AVX instructions. Based on your word length and processor you can get speedups of about x4 or even more. This code here:
for (size_t d = 0; d < points[0].dim(); ++d) {
delta = (points[i][d]) - avg[d];
avg[d] += delta / n;
var[d] += delta * ((points[i][d]) - avg[d]);
}
can be sped-up by doing the computation for four elements at once with SSE. As your code really only processes one single element in each loop iteration, there is no bottleneck. If you go down to 16bit short instead of 32bit float (an approximation then), you can fit eight elements in one instruction. With AVX it would be even more, but you need a recent processor for that.
It is not the solution to your performance problem, but just one of them that can also be combined with others.
2. Micro-parallelizm
The second easy speedup when you have that many loops is to use parallel processing. I typically use Intel TBB, others might suggest OpenMP instead. For this you would probably have to change the loop order. So parallelize over d in the outer loop, not over i.
You can combine both techniques, and if you do it right, on a quadcore with HT you might get a speed-up of 25-30 for the combination without any loss in accuracy.
3. Compiler optimization
First of all maybe it is just a typo here on SO, but it needs to be -O3, not -o3!
As a general note, it might be easier for the compiler to optimize your code if you declare the variables delta, n within the scope where you actually use them. You should also try the -funroll-loops compiler option as well as -march. The option to the latter depends on your CPU, but nowadays typically -march core2 is fine (also for recent AMDs), and includes SSE optimizations (but I would not trust the compiler just yet to do that for your loop).

The big problem with your data structure is that it's essentially a vector<vector<float> >. That's a pointer to an array of pointers to arrays of float with some bells and whistles attached. In particular, accessing consecutive Points in the vector doesn't correspond to accessing consecutive memory locations. I bet you see tons and tons of cache misses when you profile this code.
Fix this before horsing around with anything else.
Lower-order concerns include the floating-point division in the inner loop (compute 1/n in the outer loop instead) and the big load-store chain that is your inner loop. You can compute the means and variances of slices of your array using SIMD and combine them at the end, for instance.
The bounds-checking once per access probably doesn't help, either. Get rid of that too, or at least hoist it out of the inner loop; don't assume the compiler knows how to fix that on its own.

Here's what I would do, in guesstimated order of importance:
Return the floating-point from the Point::operator[] by value, not by reference.
Use coords[i] instead of coords.at(i), since you already assert that it's within bounds. The at member checks the bounds. You only need to check it once.
Replace the home-baked error indication/checking in the Point::operator[] with an assert. That's what asserts are for. They are nominally no-ops in release mode - I doubt that you need to check it in release code.
Replace the repeated division with a single division and repeated multiplication.
Remove the need for wasted initialization by unrolling the first two iterations of the outer loop.
To lessen impact of cache misses, run the inner loop alternatively forwards then backwards. This at least gives you a chance at using some cached avg and var. It may in fact remove all cache misses on avg and var if prefetch works on reverse order of iteration, as it well should.
On modern C++ compilers, the std::fill and std::copy can leverage type alignment and have a chance at being faster than the C library memset and memcpy.
The Point::operator[] will have a chance of getting inlined in the release build and can reduce to two machine instructions (effective address computation and floating point load). That's what you want. Of course it must be defined in the header file, otherwise the inlining will only be performed if you enable link-time code generation (a.k.a. LTO).
Note that the Point::operator[]'s body is only equivalent to the single-line
return coords.at(i) in a debug build. In a release build the entire body is equivalent to return coords[i], not return coords.at(i).
FT Point::operator[](int i) const {
assert(i >= 0 && i < (int)coords.size());
return coords[i];
}
const FT * Point::constData() const {
return &coords[0];
}
void compute_variances(size_t t, const std::vector<Point>& points, float* avg,
float* var, size_t* split_dims)
{
assert(points.size() > 0);
const int D = points[0].dim();
// i = 0, i_n = 1
assert(D > 0);
#if __cplusplus >= 201103L
std::copy_n(points[0].constData(), D, avg);
#else
std::copy(points[0].constData(), points[0].constData() + D, avg);
#endif
// i = 1, i_n = 0.5
if (points.size() >= 2) {
assert(points[1].dim() == D);
for (int d = D - 1; d >= 0; --d) {
float const delta = points[1][d] - avg[d];
avg[d] += delta * 0.5f;
var[d] = delta * (points[1][d] - avg[d]);
}
} else {
std::fill_n(var, D, 0.0f);
}
// i = 2, ...
for (size_t i = 2; i < points.size(); ) {
{
const float i_n = 1.0f / (1.0f + i);
assert(points[i].dim() == D);
for (int d = 0; d < D; ++d) {
float const delta = points[i][d] - avg[d];
avg[d] += delta * i_n;
var[d] += delta * (points[i][d] - avg[d]);
}
}
++ i;
if (i >= points.size()) break;
{
const float i_n = 1.0f / (1.0f + i);
assert(points[i].dim() == D);
for (int d = D - 1; d >= 0; --d) {
float const delta = points[i][d] - avg[d];
avg[d] += delta * i_n;
var[d] += delta * (points[i][d] - avg[d]);
}
}
++ i;
}
/* Find t dimensions with largest scaled variance. */
kthLargest(var, D, t, split_dims);
}

for (size_t d = 0; d < points[0].dim(); d++) {
avg[d] = 0.0;
var[d] = 0.0;
}
This code could be optimized by simply using memset. The IEEE754 representation of 0.0 in 32bits is 0x00000000. If the dimension is big, it worth it.
Something like:
memset((void*)avg, 0, points[0].dim() * sizeof(float));
In your code, you have a lot of calls to points[0].dim(). It would be better to call once at the beginning of the function and store in a variable. Likely, the compiler already does this (since you are using -O3).
The division operations are a lot more expensive (from clock-cycle POV) than other operations (addition, subtraction).
avg[d] += delta / n;
It could make sense, to try to reduce the number of divisions: use partial non-cumulative average calculation, that would result in Dim division operation for N elements (instead of N x Dim); N < points.size()
Huge speedup could be achieved, using Cuda or OpenCL, since the calculation of avg and var could be done simultaneously for each dimension (consider using a GPU).

Another optimization is cache optimization including both data cache and instruction cache.
High level optimization techniques
Data Cache optimizations
Example of data cache optimization & unrolling
for (size_t d = 0; d < points[0].dim(); d += 4)
{
// Perform loading all at once.
register const float p1 = points[i][d + 0];
register const float p2 = points[i][d + 1];
register const float p3 = points[i][d + 2];
register const float p4 = points[i][d + 3];
register const float delta1 = p1 - avg[d+0];
register const float delta2 = p2 - avg[d+1];
register const float delta3 = p3 - avg[d+2];
register const float delta4 = p4 - avg[d+3];
// Perform calculations
avg[d + 0] += delta1 / n;
var[d + 0] += delta1 * ((p1) - avg[d + 0]);
avg[d + 1] += delta2 / n;
var[d + 1] += delta2 * ((p2) - avg[d + 1]);
avg[d + 2] += delta3 / n;
var[d + 2] += delta3 * ((p3) - avg[d + 2]);
avg[d + 3] += delta4 / n;
var[d + 3] += delta4 * ((p4) - avg[d + 3]);
}
This differs from classic loop unrolling in that loading from the matrix is performed as a group at the top of the loop.
Edit 1:
A subtle data optimization is to place the avg and var into a structure. This will ensure that the two arrays are next to each other in memory, sans padding. The data fetching mechanism in processors like datums that are very close to each other. Less chance for data cache miss and better chance to load all of the data into the cache.

You could use Fixed Point math instead of floating point math as an optimization.
Optimization via Fixed Point
Processors love to manipulate integers (signed or unsigned). Floating point may take extra computing power due to the extraction of the parts, performing the math, then reassemblying the parts. One mitigation is to use Fixed Point math.
Simple Example: meters
Given the unit of meters, one could express lengths smaller than a meter by using floating point, such as 3.14159 m. However, the same length can be expressed in a unit of finer detail like millimeters, e.g. 3141.59 mm. For finer resolution, a smaller unit is chosen and the value multiplied, e.g. 3,141,590 um (micrometers). The point is choosing a small enough unit to represent the floating point accuracy as an integer.
The floating point value is converted at input into Fixed Point. All data processing occurs in Fixed Point. The Fixed Point value is convert to Floating Point before outputting.
Power of 2 Fixed Point Base
As with converting from floating point meters to fixed point millimeters, using 1000, one could use a power of 2 instead of 1000. Selecting a power of 2 allows the processor to use bit shifting instead of multiplication or division. Bit shifting by a power of 2 is usually faster than multiplication or division.
Keeping with the theme and accuracy of millimeters, we could use 1024 as the base instead of 1000. Similarly, for higher accuracy, use 65536 or 131072.
Summary
Changing the design or implementation to used Fixed Point math allows the processor to use more integral data processing instructions than floating point. Floating point operations consume more processing power than integral operations in all but specialized processors. Using powers of 2 as the base (or denominator) allows code to use bit shifting instead of multiplication or division. Division and multiplication take more operations than shifting and thus shifting is faster. So rather than optimizing code for execution (such as loop unrolling), one could try using Fixed Point notation rather than floating point.

Point 1.
You're computing the average and the variance at the same time.
Is that right?
Don't you have to calculate the average first, then once you know it, calculate the sum of squared differences from the average?
In addition to being right, it's more likely to help performance than hurt it.
Trying to do two things in one loop is not necessarily faster than two consecutive simple loops.
Point 2.
Are you aware that there is a way to calculate average and variance at the same time, like this:
double sumsq = 0, sum = 0;
for (i = 0; i < n; i++){
double xi = x[i];
sum += xi;
sumsq += xi * xi;
}
double avg = sum / n;
double avgsq = sumsq / n
double variance = avgsq - avg*avg;
Point 3.
The inner loops are doing repetitive indexing.
The compiler might be able to optimize that to something minimal, but I wouldn't bet my socks on it.
Point 4.
You're using gprof or something like it.
The only reasonably reliable number to come out of it is self-time by function.
It won't tell you very well how time is spent inside the function.
I and many others rely on this method, which takes you straight to the heart of what takes time.

On a float rounding error

I do not understand the output of the following program:
int main()
{
float x = 14.567729f;
float sqr = x * x;
float diff1 = sqr - x * x;
double diff2 = double(sqr) - double(x) * double(x);
std::cout << diff1 << std::endl;
std::cout << diff2 << std::endl;
return 0;
}
Output:
6.63225e-006
6.63225e-006
I use VS2010, x86 compiler.
I expect to get a different output
0
6.63225e-006
Why diff1 is not equal to 0?
To calculate sqr - x * x compiler increases float precision to double. Why?

The floating point registers are 80 bits (on most modern CPUs)
During an expression the result is an 80 bit value. It only gets truncated to 32 (float) or 64 (double) when it gets assigned to a location in memory. If you hold everything in registers (try compiling with -O3) you may see a different result.
Compiled with: -03:
> ./a.out
0
6.63225e-06

float diff1 = sqr - x * x;
double diff2 = double(sqr) - double(x) * double(x);
Why diff1 is not equal to 0?
Because you have already cached sqr = x*x and forced its representation to be a float.
To calculate sqr - x * x compiler increases float precision to double. Why?
Because that is how C did things back before there was a C standard. I don't think modern compilers are bound to that convention, but many still do follow it. If this is the case, the right-hand sides of the calculations of diff1 and diff2 will be identical. The only difference is that after calculating the right-hand side of float diff1 = ..., the double result is converted back to a float.

Apparently the standard allows floats to be automatically promoted to double in expressions like that. See here
Do a find on that page for "automatically promoted" and check out the first paragraph with that phrase in it.
If we go by that paragraph, as I understand it, your sqr=x*x is initially being treated as if it were a double as well, but once it is stored it is being rounded to a float. Then, in your diff1=sqr-x*x, x*x is again being treated like a double, and so is sqr although it's already rounded. Therefore, it yields the same result as casting them all to doubles: sqr is a double then but already rounded to float precision, and again x*x is double precision.

On x86/x64 architectures it is common for compilers to promote all 32-bit floats to 64-bit doubles for computations; check the output assembly to see if the two variants produce the same instructions. The only difference between the types is the storage.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js