Is there a way to optimize this function? - c++

For an application I'm working on, I need to take two integers and add them together using a particular mathematical formula. This ends up looking like this:
int16_t add_special(int16_t a, int16_t b) {
float limit = std::numeric_limits<int16_t>::max();//32767 as a floating point value
float a_fl = a, b_fl = b;
float numerator = a_fl + b_fl;
float denominator = 1 + a_fl * b_fl / std::pow(limit, 2);
float final_value = numerator / denominator;
return static_cast<int16_t>(std::round(final_value));
}
Any readers with a passing familiarity with physics will recognize that this formula is the same as what is used to calculate the sum of near-speed-of-light velocities, and the calculation here intentionally mirrors that computation.
The code as-written gives the results I need: for low numbers, they nearly add together normally, but for high numbers, they converge to the maximum value of 32767, i.e.
add_special(10, 15) == 25
add_special(100, 200) == 300
add_special(1000, 3000) == 3989
add_special(10000, 25000) == 28390
add_special(30000, 30000) == 32640
Which all appears to be correct.
The problem, however, is that the function as-written involves first transforming the numbers into floating point values before transforming them back into integers. This seems like a needless detour for numbers that I know, as a principle of its domain, will never not be integers.
Is there a faster, more optimized way to perform this computation? Or is this the most optimized version of this function I can create?
I'm building for x86-64, using MSVC 14.X, although methods that also work for GCC would be beneficial. Also, I'm not interested in SSE/SIMD optimizations at this stage; I'm mostly just looking at the elementary operations being performed on the data.

You might avoid floating number and does all computation in integral type:
constexpr int16_t add_special(int16_t a, int16_t b) {
std::int64_t limit = std::numeric_limits<int16_t>::max();
std::int64_t a_fl = a;
std::int64_t b_fl = b;
return static_cast<int16_t>(((limit * limit) * (a_fl + b_fl)
+ ((limit * limit + a_fl * b_fl) / 2)) /* Handle round */
/ (limit * limit + a_fl * b_fl));
}
Demo
but according to Benchmark, it is not faster for those values.

As noted by Johannes Overmann, a big performance boost is gained by avoiding std::round, at the cost of some (little) discrepancies in the results, though.
I tried some other little changes HERE, where it seems that the following is a faster approach (at least for that architecture)
constexpr int32_t i_max = std::numeric_limits<int16_t>::max();
constexpr int64_t i_max_2 = static_cast<int64_t>(i_max) * i_max;
int16_t my_add_special(int16_t a, int16_t b)
{
// integer multipication instead of floating point division
double numerator = (a + b) * i_max_2;
double denominator = i_max_2 + a * b;
// Approximated rounding instead of std::round
return 0.5 + numerator / denominator;
}

Suggestions:
Use 32767.0*32767.0 (which is a constant) instead of std::pow(limit, 2).
Use integer values as much as possible, potentially with fixed points. Just the two divisions are a problem. Use floats just form them, if necessary (depends on the input data ranges).
Make it inline if the function is small and if it is appropriate.
Something like:
int16_t add_special(int16_t a, int16_t b) {
float numerator = int32_t(a) + int32_t(b); // Cannot overflow.
float denominator = 1 + (int32_t(a) * int32_t(b)) / (32767.0 * 32767.0); // Cannot overflow either.
return (numerator / denominator) + 0.5; // Relying on implementation defined rounding. Not good but potentially faster than std::round().
}
The only risk with the above is the omission of the explicit rounding, so you will get some implicit rounding.

Related

Simpson's Composite Rule giving too large values for when n is very large

Using Simpson's Composite Rule to calculate the integral from 2 to 1,000 of 1/ln(x), however when using a large n (usually around 500,000), I start to get results that vary from the value my calculator and other sources give me (176.5644). For example, when n = 10,000,000, it gives me a value of 184.1495. Wondering why this is, since as n gets larger, the accuracy is supposed to increase and not decrease.
#include <iostream>
#include <cmath>
// the function f(x)
float f(float x)
{
return (float) 1 / std::log(x);
}
float my_simpson(float a, float b, long int n)
{
if (n % 2 == 1) n += 1; // since n has to be even
float area, h = (b-a)/n;
float x, y, z;
for (int i = 1; i <= n/2; i++)
{
x = a + (2*i - 2)*h;
y = a + (2*i - 1)*h;
z = a + 2*i*h;
area += f(x) + 4*f(y) + f(z);
}
return area*h/3;
}
int main()
{
std::cout.precision(20);
int upperBound = 1'000;
int subsplits = 1'000'000;
float approx = my_simpson(2, upperBound, subsplits);
std::cout << "Output: " << approx << std::endl;
return 0;
}
Update: Switched from floats to doubles and works much better now! Thank you!
Unlike a real (in mathematical sense) number, a float has a limited precision.
A typical IEEE 754 32-bit (single precision) floating-point number binary representation dedicates only 24 bits (one of which is implicit) to the mantissa and that translates in roughly less than 8 decimal significant digits (please take this as a gross semplification).
A double on the other end, has 53 significand bits, making it more accurate and (usually) the first choice for numerical computations, these days.
since as n gets larger, the accuracy is supposed to increase and not decrease.
Unfortunately, that's not how it works. There's a sweat spot, but after that the accumulation of rounding errors prevales and the results diverge from their expected values.
In OP's case, this calculation
area += f(x) + 4*f(y) + f(z);
introduces (and accumulates) rounding errors, due to the fact that area becomes much greater than f(x) + 4*f(y) + f(z) (e.g 224678.937 vs. 0.3606823). The bigger n is, the sooner this gets relevant, making the result diverging from the real one.
As mentioned in the comments, another issue (undefined behavior) is that area isn't initialized (to zero).

Precise conversion of 32-bit unsigned integer into a float in range (-1;1)

According to articles like this, half of the floating-point numbers are in the interval [-1,1]. Could you suggest how to make use of this fact so to replace the naive conversion of a 32-bit unsigned integer into a floating-point number (while keeping the uniform distribution)?
Naive code:
uint32_t i = /* randomly generated */;
float f = (float)i / (1ui32<<31) - 1.0f;
The problem here is that first the number i is converted into float losing up to 8 lower bits of precision. Only then the number is scaled to [0;2) interval, and then to [-1;1) interval.
Please, suggest the solution in C or C++ for x86_64 CPU or CUDA if you know it.
Update: the solution with a double is good for x86_64, but is too slow in CUDA. Sorry I didn't expect such a response. Any ideas how to achieve this without using double-precision floating-point?
You can do the calculation using double instead so you don't lose any precision on the uint32_t value, then assign the result to a float.
float f = (double)i / (1ui32<<31) - 1.0;
In case you drop the uniform distribution constraint its doable on 32bit integer arithmetics alone:
//---------------------------------------------------------------------------
float i32_to_f32(int x)
{
int exp;
union _f32 // semi result
{
float f; // 32bit floating point
DWORD u; // 32 bit uint
} y;
// edge cases
if (x== 0x00000000) return 0.0f;
if (x< -0x1FFFFFFF) return -1.0f;
if (x> +0x1FFFFFFF) return +1.0f;
// conversion
y.u=0; // reset bits
if (x<0){ y.u|=0x80000000; x=-x; } // sign (31 bits left)
exp=((x>>23)&63)-64; // upper 6 bits -> exponent -1,...,-64 (not 7bits to avoid denormalized numbers)
y.u|=(exp+127)<<23; // exponent bias and bit position
y.u|=x&0x007FFFFF; // mantissa
return y.f;
}
//---------------------------------------------------------------------------
int f32_to_i32(float x)
{
int exp,man,i;
union _f32 // semi result
{
float f; // 32bit floating point
DWORD u; // 32 bit uint
} y;
// edge cases
if (x== 0.0f) return 0x00000000;
if (x<=-1.0f) return -0x1FFFFFFF;
if (x>=+1.0f) return +0x1FFFFFFF;
// conversion
y.f=x;
exp=(y.u>>23)&255; exp-=127; // exponent bias and bit position
if (exp<-64) return 0.0f;
man=y.u&0x007FFFFF; // mantissa
i =(exp<<23)&0x1F800000;
i|= man;
if (y.u>=0x80000000) i=-i; // sign
return i;
}
//---------------------------------------------------------------------------
I chose to use only 29 bits + sign = ~ 30 bits of integer to avoid denormalized numbers havoc which I am too lazy to encode (it would get you 30 or even 31 bits but much slower and complicated).
But the distribution is not linear nor uniform at all:
in Red is the float in range <-1,+1> and Blue is integer in range <-1FFFFFFF,+1FFFFFFF>.
On the other hand there is no rounding at all in both conversions ...
PS. I think there might be a way to somewhat linearize the result by using a precomputed LUT for the 6 bit exponent (64 values).
The thing to realize is while (float)i does lose 8-bit of precision (so it has 24 bits of precision), the result only has 24 bits of precision as well. So this precision loss is not necessarily a bad thing (this is actually more complicated, because if i is smaller, it will lose less than 8-bits. But things will work out well).
So we just need to fix the range, so the originally non-negative value gets mapped to INT_MIN..INT_MAX.
This expression works: (float)(int)(value^0x80000000)/0x80000000.
Here's how it works:
The (int)(value^0x80000000) part flips the sign bit, so 0x0 gets mapped to INT_MIN, and 0xffffffff gets mapped to INT_MAX.
Then there is conversion to float. This is where some rounding happens, and we lose precision (but it is not a problem).
Then just divide by 0x80000000 to get into the range [-1..1]. As this division just adjusts the exponent part, this division doesn't lose any precision.
So, there is only one rounding, the other operations doesn't lose precision. These chain of operations should have the same effect, as calculating the result in infinite precision, then doing the rounding to float (this theoretical rounding has the same effect as the rounding at the 2. step)
But, to be absolutely sure, I've verified with brute force checking all the 32-bit values that this expression results in the same value as (float)((double)value/0x80000000-1.0).
I suggest (if yout want to avoid division and use an accurately float-representable start value of 1.0*2^-32):
float e = i * ldexp(1.0,-32) - 1.0;
Any ideas how to achieve this without using double-precision floating-point?
Without assuming too much about the insides of float:
Shift u until the most significant bit is set, halving the float conversion value.
"keeping the uniform distribution"
50% of the uint32_t values will be in the [0.5 ... 1.0)
25% of the uint32_t values will be in the [0.25 ... 0.5)
12.5% of the uint32_t values will be in the [0.125 ... 0.25)
6.25% of the uint32_t values will be in the [0.0625 ... 0.125)
...
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
float ui32to0to1(uint32_t u) {
if (u) {
float band = 1.0f/(1llu<<32);
while ((u & 0x80000000) == 0) {
u <<= 1;
band *= 0.5f;
}
return (float)u * band;
}
return 0.0f;
}
Some test code to show functional equivalence to double.
int test(uint32_t u) {
volatile float f0 = (float) ((double)u / (1llu<<32));
volatile float f1 = ui32to0to1(u);
if (f0 != f1) {
printf("%8lX %.7e %.7e\n", (unsigned long) u, f0, f1);
return 1;
}
return 0;
}
int main(void) {
for (int i=0; i<100000000; i++) {
test(rand()*65535u ^ rand());
}
return 0;
}
Various optimizations are possible, especially with assuming properties of float. Yet for an initial answer, I'll stick to a general approach.
For improved efficiency, the loop needs only to iterate from 32 down to FLT_MANT_DIG which is usually 24.
float ui32to0to1(uint32_t u) {
float band = 1.0f/(1llu<<32);
for (int i = 32; (i>FLT_MANT_DIG && ((u & 0x80000000) == 0)); i--) {
u <<= 1;
band *= 0.5f;
}
return (float)u * band;
}
This answers maps [0 to 232-1] to [0.0 to 1.0)
To map to [0 to 232-1] to (-1.0 to 1.0). It can form -0.0.
if (u >= 0x80000000) {
return ui32to0to1((u - 0x80000000)*2);
} else
return -ui32to0to1((0x7FFFFFFF - u)*2);
}

Faster computation of (approximate) variance needed

I can see with the CPU profiler, that the compute_variances() is the bottleneck of my project.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
75.63 5.43 5.43 40 135.75 135.75 compute_variances(unsigned int, std::vector<Point, std::allocator<Point> > const&, float*, float*, unsigned int*)
19.08 6.80 1.37 readDivisionSpace(Division_Euclidean_space&, char*)
...
Here is the body of the function:
void compute_variances(size_t t, const std::vector<Point>& points, float* avg,
float* var, size_t* split_dims) {
for (size_t d = 0; d < points[0].dim(); d++) {
avg[d] = 0.0;
var[d] = 0.0;
}
float delta, n;
for (size_t i = 0; i < points.size(); ++i) {
n = 1.0 + i;
for (size_t d = 0; d < points[0].dim(); ++d) {
delta = (points[i][d]) - avg[d];
avg[d] += delta / n;
var[d] += delta * ((points[i][d]) - avg[d]);
}
}
/* Find t dimensions with largest scaled variance. */
kthLargest(var, points[0].dim(), t, split_dims);
}
where kthLargest() doesn't seem to be a problem, since I see that:
0.00 7.18 0.00 40 0.00 0.00 kthLargest(float*, int, int, unsigned int*)
The compute_variances() takes a vector of vectors of floats (i.e. a vector of Points, where Points is a class I have implemented) and computes the variance of them, in each dimension (with regard to the algorithm of Knuth).
Here is how I call the function:
float avg[(*points)[0].dim()];
float var[(*points)[0].dim()];
size_t split_dims[t];
compute_variances(t, *points, avg, var, split_dims);
The question is, can I do better? I would really happy to pay the trade-off between speed and approximate computation of variances. Or maybe I could make the code more cache friendly or something?
I compiled like this:
g++ main_noTime.cpp -std=c++0x -p -pg -O3 -o eg
Notice, that before edit, I had used -o3, not with a capital 'o'. Thanks to ypnos, I compiled now with the optimization flag -O3. I am sure that there was a difference between them, since I performed time measurements with one of these methods in my pseudo-site.
Note that now, compute_variances is dominating the overall project's time!
[EDIT]
copute_variances() is called 40 times.
Per 10 calls, the following hold true:
points.size() = 1000 and points[0].dim = 10000
points.size() = 10000 and points[0].dim = 100
points.size() = 10000 and points[0].dim = 10000
points.size() = 100000 and points[0].dim = 100
Each call handles different data.
Q: How fast is access to points[i][d]?
A: point[i] is just the i-th element of std::vector, where the second [], is implemented as this, in the Point class.
const FT& operator [](const int i) const {
if (i < (int) coords.size() && i >= 0)
return coords.at(i);
else {
std::cout << "Error at Point::[]" << std::endl;
exit(1);
}
return coords[0]; // Clear -Wall warning
}
where coords is a std::vector of float values. This seems a bit heavy, but shouldn't the compiler be smart enough to predict correctly that the branch is always true? (I mean after the cold start). Moreover, the std::vector.at() is supposed to be constant time (as said in the ref). I changed this to have only .at() in the body of the function and the time measurements remained, pretty much, the same.
The division in the compute_variances() is for sure something heavy! However, Knuth's algorithm was a numerical stable one and I was not able to find another algorithm, that would de both numerical stable and without division.
Note that I am not interesting in parallelism right now.
[EDIT.2]
Minimal example of Point class (I think I didn't forget to show something):
class Point {
public:
typedef float FT;
...
/**
* Get dimension of point.
*/
size_t dim() const {
return coords.size();
}
/**
* Operator that returns the coordinate at the given index.
* #param i - index of the coordinate
* #return the coordinate at index i
*/
FT& operator [](const int i) {
return coords.at(i);
//it's the same if I have the commented code below
/*if (i < (int) coords.size() && i >= 0)
return coords.at(i);
else {
std::cout << "Error at Point::[]" << std::endl;
exit(1);
}
return coords[0]; // Clear -Wall warning*/
}
/**
* Operator that returns the coordinate at the given index. (constant)
* #param i - index of the coordinate
* #return the coordinate at index i
*/
const FT& operator [](const int i) const {
return coords.at(i);
/*if (i < (int) coords.size() && i >= 0)
return coords.at(i);
else {
std::cout << "Error at Point::[]" << std::endl;
exit(1);
}
return coords[0]; // Clear -Wall warning*/
}
private:
std::vector<FT> coords;
};
1. SIMD
One easy speedup for this is to use vector instructions (SIMD) for the computation. On x86 that means SSE, AVX instructions. Based on your word length and processor you can get speedups of about x4 or even more. This code here:
for (size_t d = 0; d < points[0].dim(); ++d) {
delta = (points[i][d]) - avg[d];
avg[d] += delta / n;
var[d] += delta * ((points[i][d]) - avg[d]);
}
can be sped-up by doing the computation for four elements at once with SSE. As your code really only processes one single element in each loop iteration, there is no bottleneck. If you go down to 16bit short instead of 32bit float (an approximation then), you can fit eight elements in one instruction. With AVX it would be even more, but you need a recent processor for that.
It is not the solution to your performance problem, but just one of them that can also be combined with others.
2. Micro-parallelizm
The second easy speedup when you have that many loops is to use parallel processing. I typically use Intel TBB, others might suggest OpenMP instead. For this you would probably have to change the loop order. So parallelize over d in the outer loop, not over i.
You can combine both techniques, and if you do it right, on a quadcore with HT you might get a speed-up of 25-30 for the combination without any loss in accuracy.
3. Compiler optimization
First of all maybe it is just a typo here on SO, but it needs to be -O3, not -o3!
As a general note, it might be easier for the compiler to optimize your code if you declare the variables delta, n within the scope where you actually use them. You should also try the -funroll-loops compiler option as well as -march. The option to the latter depends on your CPU, but nowadays typically -march core2 is fine (also for recent AMDs), and includes SSE optimizations (but I would not trust the compiler just yet to do that for your loop).
The big problem with your data structure is that it's essentially a vector<vector<float> >. That's a pointer to an array of pointers to arrays of float with some bells and whistles attached. In particular, accessing consecutive Points in the vector doesn't correspond to accessing consecutive memory locations. I bet you see tons and tons of cache misses when you profile this code.
Fix this before horsing around with anything else.
Lower-order concerns include the floating-point division in the inner loop (compute 1/n in the outer loop instead) and the big load-store chain that is your inner loop. You can compute the means and variances of slices of your array using SIMD and combine them at the end, for instance.
The bounds-checking once per access probably doesn't help, either. Get rid of that too, or at least hoist it out of the inner loop; don't assume the compiler knows how to fix that on its own.
Here's what I would do, in guesstimated order of importance:
Return the floating-point from the Point::operator[] by value, not by reference.
Use coords[i] instead of coords.at(i), since you already assert that it's within bounds. The at member checks the bounds. You only need to check it once.
Replace the home-baked error indication/checking in the Point::operator[] with an assert. That's what asserts are for. They are nominally no-ops in release mode - I doubt that you need to check it in release code.
Replace the repeated division with a single division and repeated multiplication.
Remove the need for wasted initialization by unrolling the first two iterations of the outer loop.
To lessen impact of cache misses, run the inner loop alternatively forwards then backwards. This at least gives you a chance at using some cached avg and var. It may in fact remove all cache misses on avg and var if prefetch works on reverse order of iteration, as it well should.
On modern C++ compilers, the std::fill and std::copy can leverage type alignment and have a chance at being faster than the C library memset and memcpy.
The Point::operator[] will have a chance of getting inlined in the release build and can reduce to two machine instructions (effective address computation and floating point load). That's what you want. Of course it must be defined in the header file, otherwise the inlining will only be performed if you enable link-time code generation (a.k.a. LTO).
Note that the Point::operator[]'s body is only equivalent to the single-line
return coords.at(i) in a debug build. In a release build the entire body is equivalent to return coords[i], not return coords.at(i).
FT Point::operator[](int i) const {
assert(i >= 0 && i < (int)coords.size());
return coords[i];
}
const FT * Point::constData() const {
return &coords[0];
}
void compute_variances(size_t t, const std::vector<Point>& points, float* avg,
float* var, size_t* split_dims)
{
assert(points.size() > 0);
const int D = points[0].dim();
// i = 0, i_n = 1
assert(D > 0);
#if __cplusplus >= 201103L
std::copy_n(points[0].constData(), D, avg);
#else
std::copy(points[0].constData(), points[0].constData() + D, avg);
#endif
// i = 1, i_n = 0.5
if (points.size() >= 2) {
assert(points[1].dim() == D);
for (int d = D - 1; d >= 0; --d) {
float const delta = points[1][d] - avg[d];
avg[d] += delta * 0.5f;
var[d] = delta * (points[1][d] - avg[d]);
}
} else {
std::fill_n(var, D, 0.0f);
}
// i = 2, ...
for (size_t i = 2; i < points.size(); ) {
{
const float i_n = 1.0f / (1.0f + i);
assert(points[i].dim() == D);
for (int d = 0; d < D; ++d) {
float const delta = points[i][d] - avg[d];
avg[d] += delta * i_n;
var[d] += delta * (points[i][d] - avg[d]);
}
}
++ i;
if (i >= points.size()) break;
{
const float i_n = 1.0f / (1.0f + i);
assert(points[i].dim() == D);
for (int d = D - 1; d >= 0; --d) {
float const delta = points[i][d] - avg[d];
avg[d] += delta * i_n;
var[d] += delta * (points[i][d] - avg[d]);
}
}
++ i;
}
/* Find t dimensions with largest scaled variance. */
kthLargest(var, D, t, split_dims);
}
for (size_t d = 0; d < points[0].dim(); d++) {
avg[d] = 0.0;
var[d] = 0.0;
}
This code could be optimized by simply using memset. The IEEE754 representation of 0.0 in 32bits is 0x00000000. If the dimension is big, it worth it.
Something like:
memset((void*)avg, 0, points[0].dim() * sizeof(float));
In your code, you have a lot of calls to points[0].dim(). It would be better to call once at the beginning of the function and store in a variable. Likely, the compiler already does this (since you are using -O3).
The division operations are a lot more expensive (from clock-cycle POV) than other operations (addition, subtraction).
avg[d] += delta / n;
It could make sense, to try to reduce the number of divisions: use partial non-cumulative average calculation, that would result in Dim division operation for N elements (instead of N x Dim); N < points.size()
Huge speedup could be achieved, using Cuda or OpenCL, since the calculation of avg and var could be done simultaneously for each dimension (consider using a GPU).
Another optimization is cache optimization including both data cache and instruction cache.
High level optimization techniques
Data Cache optimizations
Example of data cache optimization & unrolling
for (size_t d = 0; d < points[0].dim(); d += 4)
{
// Perform loading all at once.
register const float p1 = points[i][d + 0];
register const float p2 = points[i][d + 1];
register const float p3 = points[i][d + 2];
register const float p4 = points[i][d + 3];
register const float delta1 = p1 - avg[d+0];
register const float delta2 = p2 - avg[d+1];
register const float delta3 = p3 - avg[d+2];
register const float delta4 = p4 - avg[d+3];
// Perform calculations
avg[d + 0] += delta1 / n;
var[d + 0] += delta1 * ((p1) - avg[d + 0]);
avg[d + 1] += delta2 / n;
var[d + 1] += delta2 * ((p2) - avg[d + 1]);
avg[d + 2] += delta3 / n;
var[d + 2] += delta3 * ((p3) - avg[d + 2]);
avg[d + 3] += delta4 / n;
var[d + 3] += delta4 * ((p4) - avg[d + 3]);
}
This differs from classic loop unrolling in that loading from the matrix is performed as a group at the top of the loop.
Edit 1:
A subtle data optimization is to place the avg and var into a structure. This will ensure that the two arrays are next to each other in memory, sans padding. The data fetching mechanism in processors like datums that are very close to each other. Less chance for data cache miss and better chance to load all of the data into the cache.
You could use Fixed Point math instead of floating point math as an optimization.
Optimization via Fixed Point
Processors love to manipulate integers (signed or unsigned). Floating point may take extra computing power due to the extraction of the parts, performing the math, then reassemblying the parts. One mitigation is to use Fixed Point math.
Simple Example: meters
Given the unit of meters, one could express lengths smaller than a meter by using floating point, such as 3.14159 m. However, the same length can be expressed in a unit of finer detail like millimeters, e.g. 3141.59 mm. For finer resolution, a smaller unit is chosen and the value multiplied, e.g. 3,141,590 um (micrometers). The point is choosing a small enough unit to represent the floating point accuracy as an integer.
The floating point value is converted at input into Fixed Point. All data processing occurs in Fixed Point. The Fixed Point value is convert to Floating Point before outputting.
Power of 2 Fixed Point Base
As with converting from floating point meters to fixed point millimeters, using 1000, one could use a power of 2 instead of 1000. Selecting a power of 2 allows the processor to use bit shifting instead of multiplication or division. Bit shifting by a power of 2 is usually faster than multiplication or division.
Keeping with the theme and accuracy of millimeters, we could use 1024 as the base instead of 1000. Similarly, for higher accuracy, use 65536 or 131072.
Summary
Changing the design or implementation to used Fixed Point math allows the processor to use more integral data processing instructions than floating point. Floating point operations consume more processing power than integral operations in all but specialized processors. Using powers of 2 as the base (or denominator) allows code to use bit shifting instead of multiplication or division. Division and multiplication take more operations than shifting and thus shifting is faster. So rather than optimizing code for execution (such as loop unrolling), one could try using Fixed Point notation rather than floating point.
Point 1.
You're computing the average and the variance at the same time.
Is that right?
Don't you have to calculate the average first, then once you know it, calculate the sum of squared differences from the average?
In addition to being right, it's more likely to help performance than hurt it.
Trying to do two things in one loop is not necessarily faster than two consecutive simple loops.
Point 2.
Are you aware that there is a way to calculate average and variance at the same time, like this:
double sumsq = 0, sum = 0;
for (i = 0; i < n; i++){
double xi = x[i];
sum += xi;
sumsq += xi * xi;
}
double avg = sum / n;
double avgsq = sumsq / n
double variance = avgsq - avg*avg;
Point 3.
The inner loops are doing repetitive indexing.
The compiler might be able to optimize that to something minimal, but I wouldn't bet my socks on it.
Point 4.
You're using gprof or something like it.
The only reasonably reliable number to come out of it is self-time by function.
It won't tell you very well how time is spent inside the function.
I and many others rely on this method, which takes you straight to the heart of what takes time.

How can I check whether a double has a fractional part?

Basically I have two variables:
double halfWidth = Width / 2;
double halfHeight = Height / 2;
As they are being divided by 2, they will either be a whole number or a decimal. How can I check whether they are a whole number or a .5?
You can use modf, this should be sufficient:
double intpart;
if( modf( halfWidth, &intpart) == 0 )
{
// your code here
}
First, you need to make sure that you're using double-precision floating-point math:
double halfWidth = Width / 2.0;
double halfHeight = Height / 2.0;
Because one of the operands is a double (namely, 2.0), this will force the compiler to convert Width and Height to doubles before doing the math (assuming they're not already doubles). Once converted, the division will be done in double-precision floating-point. So it will have a decimal, where appropriate.
The next step is to simply check it with modf.
double temp;
if(modf(halfWidth, &temp) != 0)
{
//Has fractional part.
}
else
{
//No fractional part.
}
You may discard a fractional part and compare the result with the original value using floor():
if (floor(halfWidth) == halfWidth) {
// halfWidth is a whole number
} else {
// halfWidth has a non-zero fractional part
}
As rightly pointed out by #Dávid Laczkó, it's a better solution than modf() because there's no need for an additional variable.
And according to my benchmarks (Linux, gcc 8.3.0, optimizations -O0...-O3), the floor() call consumes less CPU time than modf() on the modern notebook and server processors. The difference even growing with compiler optimizations enabled. Probably it's because the modf() has two arguments when the floor() has only one argument.

Unsigned long long arithmetic into double

I'm making a function that takes in 3 unsigned long longs, and applies the law of cosines to find out if the triangle is obtuse, acute or a right triangle. Should I just cast the variables to doubles before I use them?
void triar( unsigned long long& r,
unsigned long long x,
unsigned long long y,
unsigned long long z )
{
if(x==0 || y==0 || z==0) die("invalid triangle sides");
double t=(x*x + y*y -z*z)/(2*x*y);
t=acos (t) * (180.0 / 3.14159265);
if(t > 90) {
cout<<"Obtuse Triangle"<<endl;
r=t;
} else if(t < 90){
cout<<"Acute Triangle"<<endl;
r=t;
} else if(t == 90){
cout<<"Right Traingle"<<endl;
r=t;
}
}
There is generally no reason why you could not cast if you need floating point arithmetics. However, there is also an implicit conversion from unsigned long to double, so you can also often do completely without casting.
In many cases, including yours, you can cast only one of the arguments to force double arithmetics on a particular operation only. For example,
double t = (double)(x*x + y*y - z*z) / (2*x*y)
This way, all operations except for the division are computed in integer arithmetics and are therefore slighly faster. The cast is still necessary to avoid truncation during division.
Your code contains a comparison of floating point arguments. Floating point arithmetics however almost inevitably reduces accuracy. Avoid limited accuracy, or analyze and control accuracy.
Prefer an integer only solution as described in an excellent sister answer if you have a wide enough integral type at your disposal
Always avoid conversion from radians to degrees except for presentation to humans
Take the value of π from your mathematical library header files (unfortunately, this is platform dependent - try _USE_MATH_DEFINES + M_PI or, if already using boost libraries, boost::math::constants::pi<double>()), or express it analytically. For example, std::atan(1)*2 is the right angle.
If you choose double precision, and the ultimate difference value is less than, say, std::numeric_limits<double>::min() * 8, you can probably not tell anything about the triangle and the classification you return is basically bogus. (I made up the value of 8, you will possibly lose way more bits than three.)
You have a problem with obtuse triangles, x*x + y*y - z*z would mathematically give a negative result, that is then reduced modulo 2^WIDTH (where WIDTH is the number of value bits in unsigned long long, at least 64 and probably exactly that) yielding a - probably large - positive value (or in rare cases 0). Then the computed result of t = (x*x + y*y - z*z)/(2*x*y) can be larger than 1, and acos(t) would return a NaN.
The correct way to find out whether the triangle is obtuse/acute/right-angled with the given argument type is to check whether x*x + y*y < /* > / == */ z*z - if you can be sure the mathematical results don't exceed the unsigned long long range.
If you can't be sure of that, you can either convert the variables to double before the computation,
double xd = x, yd = y, zd = z;
double t = (xd*xd + yd*yd - zd*zd)/(2*xd*yd);
with possible loss of precision and incorrect results for nearly right-angled triangles (e.g. for the slightly obtuse triangle x = 2^29, y = 2^56-1, z = 2^56+2, both y and z would be converted to 2^56 with standard 64-bit doubles, xd*xd + yd*yd = 2^58 + 2^112 would be evaluated to 2^112, subtracting zd*zd then results in 0).
Or you can compare x*x + y*y to z*z - or x*x to z*z - y*y - using only integer arithmetic. If x*x is representable as an unsigned long long (I assume that 0 < x <= y <= z), it's relatively easy, first check whether (z - y)*(z + y) would exceed ULLONG_MAX, if yes, the triangle is obtuse, otherwise calculate and compare. If x*x is not representable, it becomes complicated, I think the easiest way (except for using a big integer library, of course) would be to compute the high and if necessary low 64 (or whatever width unsigned long long has) bits separately by splitting the numbers at half the width and compare those.
Further note: Your value for π, 3.14159265 is too inaccurate, right-angled triangles will be reported as obtuse.