Tiny numbers in place of zero? - c++

I have been making a matrix class (as a learning exercise) and I have come across and issue whilst testing my inverse function.
I input a arbitrary matrix as such:
2 1 1
1 2 1
1 1 2
And got it to calculate the inverse and I got the correct result:
0.75 -0.25 -0.25
-0.25 0.75 -0.25
-0.25 -0.25 0.75
But when I tried multiplying the two together to make sure I got the identity matrix I get:
1 5.5111512e-017 0
0 1 0
-1.11022302e-0.16 0 1
Why am I getting these results? I would understand if I was multiplying weird numbers where I could understand some rounding errors but the sum it's doing is:
2 * -0.25 + 1 * 0.75 + 1 * -0.25
which is clearly 0, not 5.111512e-017
If I manually get it to do the calculation; eg:
std::cout << (2 * -0.25 + 1 * 0.75 + 1 * -0.25) << "\n";
I get 0 as expected?
All the numbers are represented as doubles.
Here's my multiplcation overload:
Matrix operator*(const Matrix& A, const Matrix& B)
{
if(A.get_cols() == B.get_rows())
{
Matrix temp(A.get_rows(), B.get_cols());
for(unsigned m = 0; m < temp.get_rows(); ++m)
{
for(unsigned n = 0; n < temp.get_cols(); ++n)
{
for(unsigned i = 0; i < temp.get_cols(); ++i)
{
temp(m, n) += A(m, i) * B(i, n);
}
}
}
return temp;
}
throw std::runtime_error("Bad Matrix Multiplication");
}
and the access functions:
double& Matrix::operator()(unsigned r, unsigned c)
{
return data[cols * r + c];
}
double Matrix::operator()(unsigned r, unsigned c) const
{
return data[cols * r + c];
}
Here's the function to find the inverse:
Matrix Inverse(Matrix& M)
{
if(M.rows != M.cols)
{
throw std::runtime_error("Matrix is not square");
}
int r = 0;
int c = 0;
Matrix augment(M.rows, M.cols*2);
augment.copy(M);
for(r = 0; r < M.rows; ++r)
{
for(c = M.cols; c < M.cols * 2; ++c)
{
augment(r, c) = (r == (c - M.cols) ? 1.0 : 0.0);
}
}
for(int R = 0; R < augment.rows; ++R)
{
double n = augment(R, R);
for(c = 0; c < augment.cols; ++c)
{
augment(R, c) /= n;
}
for(r = 0; r < augment.rows; ++r)
{
if(r == R) { continue; }
double a = augment(r, R);
for(c = 0; c < augment.cols; ++c)
{
augment(r, c) -= a * augment(R, c);
}
}
}
Matrix inverse(M.rows, M.cols);
for(r = 0; r < M.rows; ++r)
{
for(c = M.cols; c < M.cols * 2; ++c)
{
inverse(r, c - M.cols) = augment(r, c);
}
}
return inverse;
}

Please read this paper: What Every Computer Scientist Should Know About Floating-Point Arithmetic

You've got numbers like 0.250000000000000005 in your inverted matrix, they're just rounded for display so you see nice little round numbers like 0.25.

You shouldn't have any problems with these numbers, since with this particular matrix the inverse is all power of 2's and may be represented accurately. In general, operations on floating point numbers introduce small errors that may accumulate and the results may be surprising.
In your case, I'm pretty sure the inverse is inaccurate and you're just displaying the first few digits. I.e., it isn't exactly 0.25 (=1/4), 0.75 (=3/4), etc.

You're always going to run into floating point rounding errors like this, especially when working with numbers that do not have exact binary representations (i.e., your numbers are not equal to 2^(N) or 1/(2^N), where N is some integer value).
That being said, there are a number of ways to increase the precision of your results, and you may want to-do a google search for numerically stable gaussian elimination algorithms using fixed-precision floating point values.
You can also, if you are willing to take a speed hit, incorporate an inifinite-precision math library that uses rational numbers, and if you take that choice, just avoid the use of roots which can create irrational numbers. There are a number of libraries out there that can help you with the use of rational numbers, such as GMP. You can also make a rational class yourself, although beware it's relatively easy to overflow the results of multiple math operations if you are only using unsigned 64-bit values along with an extra sign-flag variable for the components of your rational numbers. That's where GMP, with it's unlimited-length integer string objects comes in handy.

It's just simple floating point error. Even doubles on computers aren't 100% accurate. There just is no way to 100% accurately represent a base-10 decimal number in binary with a finite number of bits.

Related

Composite Simpson's Rule in C++

I've been trying to write a function to approximate an the value of an integral using the Composite Simpson's Rule.
template <typename func_type>
double simp_rule(double a, double b, int n, func_type f){
int i = 1; double area = 0;
double n2 = n;
double h = (b-a)/(n2-1), x=a;
while(i <= n){
area = area + f(x)*pow(2,i%2 + 1)*h/3;
x+=h;
i++;
}
area -= (f(a) * h/3);
area -= (f(b) * h/3);
return area;
}
What I do is multiply each value of the function by either 2 or 4 (and h/3) with pow(2,i%2 + 1) and subtract off the edges as these should only have a weight of 1.
At first, I thought it worked just fine, however, when I compared it to my Trapezoidal Method function it was way more inaccurate which shouldn't be the case.
This is a simpler version of a code I previously wrote which had the same problem, I thought that if I cleaned it up a little the problem would go away, but alas. From another post, I get the idea that there's something going on with the types and the operations I'm doing on them which results in loss of precision, but I just don't see it.
Edit:
For completeness, I was running it for e^x from 1 to zero
\\function to be approximated
double f(double x){ double a = exp(x); return a; }
int main() {
int n = 11; //this method works best for odd values of n
double e = exp(1);
double exact = e-1; //value of integral of e^x from 0 to 1
cout << simp_rule(0,1,n,f) - exact;
The Simpson's Rule uses this approximation to estimate a definite integral:
Where
and
So that there are n + 1 equally spaced sample points xi.
In the posted code, the parameter n passed to the function appears to be the number of points where the function is sampled (while in the previous formula n is the number of intervals, that's not a problem).
The (constant) distance between the points is calculated correctly
double h = (b - a) / (n - 1);
The while loop used to sum the weighted contributes of all the points iterates from x = a up to a point with an ascissa close to b, but probably not exactly b, due to rounding errors. This implies that the last calculated value of f, f(x_n), may be slightly different from the expected f(b).
This is nothing, though, compared to the error caused by the fact that those end points are summed inside the loop with the starting weight of 4 and then subtracted after the loop with weight 1, while all the inner points have their weight switched. As a matter of fact, this is what the code calculates:
Also, using
pow(2, i%2 + 1)
To generate the sequence 4, 2, 4, 2, ..., 4 is a waste, in terms of efficency, and may add (depending on the implementation) other unnecessary rounding errors.
The following algorithm shows how to obtain the same (fixed) result, without a call to that library function.
template <typename func_type>
double simpson_rule(double a, double b,
int n, // Number of intervals
func_type f)
{
double h = (b - a) / n;
// Internal sample points, there should be n - 1 of them
double sum_odds = 0.0;
for (int i = 1; i < n; i += 2)
{
sum_odds += f(a + i * h);
}
double sum_evens = 0.0;
for (int i = 2; i < n; i += 2)
{
sum_evens += f(a + i * h);
}
return (f(a) + f(b) + 2 * sum_evens + 4 * sum_odds) * h / 3;
}
Note that this function requires the number of intervals (e.g. use 10 instead of 11 to obtain the same results of OP's function) to be passed, not the number of points.
Testable here.
The above excellent and accepted solution could benefit from liberal use of std::fma() and templatize on the floating point type.
https://en.cppreference.com/w/cpp/numeric/math/fma
#include <cmath>
template <typename fptype, typename func_type>
double simpson_rule(fptype a, fptype b,
int n, // Number of intervals
func_type f)
{
fptype h = (b - a) / n;
// Internal sample points, there should be n - 1 of them
fptype sum_odds = 0.0;
for (int i = 1; i < n; i += 2)
{
sum_odds += f(std::fma(i,h,a));
}
fptype sum_evens = 0.0;
for (int i = 2; i < n; i += 2)
{
sum_evens += f(std::fma(i,h,a);
}
return (std::fma(2,sum_evens,f(a)) +
std::fma(4,sum_odds,f(b))) * h / 3;
}

how to improve the precision of computing float numbers?

I write a code snippet in Microsoft Visual Studio Community 2019 in C++ like this:
int m = 11;
int p = 3;
float step = 1.0 / (m - 2 * p);
the variable step is 0.200003, 0.2 is what i wanted. Is there any suggestion to improve the precision?
This problem comes from UNIFORM KNOT VECTOR. Knot vector is a concept in NURBS. You can think it is just an array of numbers like this: U[] = {0, 0.2, 0.4, 0.6, 0.8, 1.0}; The span between two adjacent numbers is a constant. The size of knot vector can be changed accroding to some condition, but the range is in [0, 1].
the whole function is:
typedef float NURBS_FLOAT;
void CreateKnotVector(int m, int p, bool clamped, NURBS_FLOAT* U)
{
if (clamped)
{
for (int i = 0; i <= p; i++)
{
U[i] = 0;
}
NURBS_FLOAT step = 1.0 / (m - 2 * p);
for (int i = p+1; i < m-p; i++)
{
U[i] = U[i - 1] + step;
}
for (int i = m-p; i <= m; i++)
{
U[i] = 1;
}
}
else
{
U[0] = 0;
NURBS_FLOAT step = 1.0 / m;
for (int i = 1; i <= m; i++)
{
U[i] = U[i - 1] + step;
}
}
}
Let's follow what's going on in your code:
The expression 1.0 / (m - 2 * p) yields 0.2, to which the closest representable double value is 0.200000000000000011102230246251565404236316680908203125. Notice how precise it is – to 16 significant decimal digits. It's because, due to 1.0 being a double literal, the denominator is promoted to double, and the whole calculation is done in double precision, thus yielding a double value.
The value obtained in the previous step is written to step, which has type float. So the value has to be rounded to the closest representable value, which happens to be 0.20000000298023223876953125.
So your cited result of 0.200003 is not what you should get. Instead, it should be closer to 0.200000003.
Is there any suggestion to improve the precision?
Yes. Store the value in a higher-precision variable. E.g., instead of float step, use double step. In this case the value you've calculated won't be rounded once more, so precision will be higher.
Can you get the exact 0.2 value to work with it in the subsequent calculations? With binary floating-point arithmetic, unfortunately, no. In binary, the number 0.2 is a periodic fraction:
0.210 = 0.0̅0̅1̅1̅2 = 0.0011 0011 0011...2
See Is floating point math broken? question and its answers for more details.
If you really need decimal calculations, you should use a library solution, e.g. Boost's cpp_dec_float. Or, if you need arbitrary-precision calculations, you can use e.g. cpp_bin_float from the same library. Note that both variants will be orders of magnitude slower than using bulit-in C++ binary floating-point types.
When dealing with floating point math a certain amount of rounding errors are expected.
For starters, values like 0.2 aren't exactly represented by a float, or even a double:
std::cout << std::setprecision(60) << 0.2 << '\n';
// ^^^ It outputs something like: 0.200000000000000011102230246251565404236316680908203125
Besides, the errors may accumulate when a sequence of operations are performed on imprecise values. Some operations, like summation and subctraction, are more sensitive to this kind of errors than others, so it'd be better to avoid them if possible.
That seems to be the case, here, where we can rewrite OP's function into something like the following
#include <iostream>
#include <iomanip>
#include <vector>
#include <algorithm>
#include <cassert>
#include <type_traits>
template <typename T = double>
auto make_knots(int m, int p = 0) // <- Note that I've changed the signature.
{
static_assert(std::is_floating_point_v<T>);
std::vector<T> knots(m + 1);
int range = m - 2 * p;
assert(range > 0);
for (int i = 1; i < m - p; i++)
{
knots[i + p] = T(i) / range; // <- Less prone to accumulate rounding errors
}
std::fill(knots.begin() + m - p, knots.end(), 1.0);
return knots;
}
template <typename T>
void verify(std::vector<T> const& v)
{
bool sum_is_one = true;
for (int i = 0, j = v.size() - 1; i <= j; ++i, --j)
{
if (v[i] + v[j] != 1.0) // <- That's a bold request for a floating point type
{
sum_is_one = false;
break;
}
}
std::cout << (sum_is_one ? "\n" : "Rounding errors.\n");
}
int main()
{
// For presentation purposes only
std::cout << std::setprecision(60) << 0.2 << '\n';
std::cout << std::setprecision(60) << 0.4 << '\n';
std::cout << std::setprecision(60) << 0.6 << '\n';
std::cout << std::setprecision(60) << 0.8 << "\n\n";
auto k1 = make_knots(11, 3);
for (auto i : k1)
{
std::cout << std::setprecision(60) << i << '\n';
}
verify(k1);
auto k2 = make_knots<float>(10);
for (auto i : k2)
{
std::cout << std::setprecision(60) << i << '\n';
}
verify(k2);
}
Testable here.
One solution to avoid drift (which I guess is your worry?) is to manually use rational numbers, for example in this case you might have:
// your input values for determining step
int m = 11;
int p = 3;
// pre-calculate any intermediate values, which won't have rounding issues
int divider = (m - 2 * p); // could be float or double instead of int
// input
int stepnumber = 1234; // could also be float or double instead of int
// output
float stepped_value = stepnumber * 1.0f / divider;
In other words, formulate your problem so that step of your original code is always 1 (or whatever rational number you can represent exactly using 2 integers) internally, so there is no rounding issue. If you need to display the value for user, then you can do it just for display: 1.0 / divider and round to suitable number of digits.

Dividing cv::Mat by a number using integer division

In OpenCV if cv::Mat (CV_8U) was divided by a number (int) the result will be rounded to the nearest number for example:
cv::Mat temp(1, 1, CV_8UC1, cv::Scalar(5));
temp /= 3;
std::cout <<"OpenCV Integer Division:" << temp;
std::cout << "\nNormal Integer Division:" << 5 / 3;
The result is:
OpenCV Integer Division: 2
Normal Integer Division: 1
It is obvious that OpenCV does not use integer division even if the type of the cv::Mat is CV_8U.
My questions are:
Why? Is not supposed for integers to be divided as integers. Why this strange behaviour of OpenCV?
Can I obtain integer division without iterating pixel by pixel and dividing it?
My current solution is:
for (size_t r = 0; r < temp.rows; r++){
auto row_ptr = temp.ptr<uchar>(r);
for (size_t c = 0; c < temp.cols; c++){
row_ptr[c] /= 3;
}
}
firstly : the overloaded operator for Division does the operation by converting the elements of matrix into double. it originally uses multiplication operator as: Mat / a =Mat * (1/a).
secondly : a very easy way exists to do this by one small for loop:
for(int i=0;i<temp.total();i++)
((unsigned char*)temp.data)[i]/=3;
The solution I used to solve it is: (depending on #Afshine answer and #Miki comment):
if (frame.isContinuous()){
for(int i = 0; i < frame.total(); i++){
frame.data[i]/=3;
}
}
else{
for (size_t r = 0; r < frame.rows; r++){
auto row_ptr = frame.ptr<uchar>(r);
for (size_t c = 0; c < 3 * frame.cols; c++){
row_ptr[c] /= 3;
}
}
}

What is the fastest way to calculate determinant?

I'm writing a library where I want to have some basic NxN matrix functionality that doesn't have any dependencies and it is a bit of a learning project. I'm comparing my performance to Eigen. I've been able to be pretty equal and even beat its performance on a couple front with SSE2 and with AVX2 beat it on quite a few fronts (it only uses SSE2 so not super surprising).
My issue is I'm using Gaussian Elimination to create an Upper Diagonalized matrix then multiplying the diagonal to get the determinant.I beat Eigen for N < 300 but after that Eigen blows me away and it just gets worse as the matrices get bigger. Given all the memory is accessed sequentially and the compiler dissassembly doesn't look terrible I don't think it is an optimization issue.
There is more optimization that can be done but the timings look much more like an algorithmic timing complexity issue or there is a major SSE advantage I'm not seeing. Simply unrolling the loops a bit hasn't done much for me when trying that.
Is there a better algorithm for calculating determinants?
Scalar code
/*
Warning: Creates Temporaries!
*/
template<typename T, int ROW, int COLUMN> MML_INLINE T matrix<T, ROW, COLUMN>::determinant(void) const
{
/*
This method assumes square matrix
*/
assert(row() == col());
/*
We need to create a temporary
*/
matrix<T, ROW, COLUMN> temp(*this);
/*We convert the temporary to upper triangular form*/
uint N = row();
T det = T(1);
for (uint c = 0; c < N; ++c)
{
det = det*temp(c,c);
for (uint r = c + 1; r < N; ++r)
{
T ratio = temp(r, c) / temp(c, c);
for (uint k = c; k < N; k++)
{
temp(r, k) = temp(r, k) - ratio * temp(c, k);
}
}
}
return det;
}
AVX2
template<> float matrix<float>::determinant(void) const
{
/*
This method assumes square matrix
*/
assert(row() == col());
/*
We need to create a temporary
*/
matrix<float> temp(*this);
/*We convert the temporary to upper triangular form*/
float det = 1.0f;
const uint N = row();
const uint Nm8 = N - 8;
const uint Nm4 = N - 4;
uint c = 0;
for (; c < Nm8; ++c)
{
det *= temp(c, c);
float8 Diagonal = _mm256_set1_ps(temp(c, c));
for (uint r = c + 1; r < N;++r)
{
float8 ratio1 = _mm256_div_ps(_mm256_set1_ps(temp(r,c)), Diagonal);
uint k = c + 1;
for (; k < Nm8; k += 8)
{
float8 ref = _mm256_loadu_ps(temp._v + c*N + k);
float8 r0 = _mm256_loadu_ps(temp._v + r*N + k);
_mm256_storeu_ps(temp._v + r*N + k, _mm256_fmsub_ps(ratio1, ref, r0));
}
/*We go Scalar for the last few elements to handle non-multiples of 8*/
for (; k < N; ++k)
{
_mm_store_ss(temp._v + index(r, k), _mm_sub_ss(_mm_set_ss(temp(r, k)), _mm_mul_ss(_mm256_castps256_ps128(ratio1),_mm_set_ss(temp(c, k)))));
}
}
}
for (; c < Nm4; ++c)
{
det *= temp(c, c);
float4 Diagonal = _mm_set1_ps(temp(c, c));
for (uint r = c + 1; r < N; ++r)
{
float4 ratio = _mm_div_ps(_mm_set1_ps(temp[r*N + c]), Diagonal);
uint k = c + 1;
for (; k < Nm4; k += 4)
{
float4 ref = _mm_loadu_ps(temp._v + c*N + k);
float4 r0 = _mm_loadu_ps(temp._v + r*N + k);
_mm_storeu_ps(temp._v + r*N + k, _mm_sub_ps(r0, _mm_mul_ps(ref, ratio)));
}
float fratio = _mm_cvtss_f32(ratio);
for (; k < N; ++k)
{
temp(r, k) = temp(r, k) - fratio*temp(c, k);
}
}
}
for (; c < N; ++c)
{
det *= temp(c, c);
float Diagonal = temp(c, c);
for (uint r = c + 1; r < N; ++r)
{
float ratio = temp[r*N + c] / Diagonal;
for (uint k = c+1; k < N;++k)
{
temp(r, k) = temp(r, k) - ratio*temp(c, k);
}
}
}
return det;
}
Algorithms to reduce an n by n matrix to upper (or lower) triangular form by Gaussian elimination generally have complexity of O(n^3) (where ^ represents "to power of").
There are alternative approaches for computing determinate, such as evaluating the set of eigenvalues (the determinant of a square matrix is equal to the product of its eigenvalues). For general matrices, computation of the complete set of eigenvalues is also - practically - O(n^3).
In theory, however, calculation of the set of eigenvalues has complexity of n^w where w is between 2 and 2.376 - which means for (much) larger matrices it will be faster than using Gaussian elimination. Have a look at an article "Fast linear algebra is stable" by James Demmel, Ioana Dumitriu, and Olga Holtz in Numerische Mathematik, Volume 108, Issue 1, pp. 59-91, November 2007. If Eigen uses an approach with complexity less than O(n^3) for larger matrices (I don't know, never having had reason to investigate such things) that would explain your observations.
The answer most places seem to use Block LU Factorization to create an Lower triangle and Upper triangle matrix in the same memory space. It is ~O(n^2.5) depending on the size of block you use.
Here is a power point from Rice University that explains the algorithm.
www.caam.rice.edu/~timwar/MA471F03/Lecture24.ppt
Division by a matrix means multiplication by its inverse.
The idea seems to be to increase the number of n^2 operations significantly but reduce the number m^3 which in effect lowers the complexity of the algorithm since m is of a fixed small size.
Going to take a little bit to write this up in an efficient manner since to do it efficiently requires 'in place' algorithms I don't have written yet.

Get the positions of the 'ones' digits in a base-2 representation of a C float

Say I have a floating point number. I would like to extract the positions of all the ones digits in the number's base 2 representation.
For example, 10.25 = 2^-2 + 2^1 + 2^3, so its base-2 ones positions are {-2, 1, 3}.
Once I have the list of base-2 powers of a number n, the following should always return true (in pseudocode).
sum = 0
for power in powers:
sum += 2.0 ** power
return n == sum
However, it is somewhat difficult to perform bit logic on floats in C and C++, and even more difficult to be portable.
How would one implement this in either of the languages with a small number of CPU instructions?
Give up on portability, assume IEEE float and 32-bit int.
// Doesn't check for NaN or denormalized.
// Left as an exercise for the reader.
void pbits(float x)
{
union {
float f;
unsigned i;
} u;
int sign, mantissa, exponent, i;
u.f = x;
sign = u.i >> 31;
exponent = ((u.i >> 23) & 255) - 127;
mantissa = (u.i & ((1 << 23) - 1)) | (1 << 23);
for (i = 0; i < 24; ++i) {
if (mantissa & (1 << (23 - i)))
printf("2^%d\n", exponent - i);
}
}
This will print out the powers of two that sum to the given floating point number. For example,
$ ./a.out 156
2^7
2^4
2^3
2^2
$ ./a.out 0.3333333333333333333333333
2^-2
2^-4
2^-6
2^-8
2^-10
2^-12
2^-14
2^-16
2^-18
2^-20
2^-22
2^-24
2^-25
You can see how 1/3 is rounded up, which is not intuitive since we would always round it down in decimal, no matter how many decimal places we use.
Footnote: Don't do the following:
float x = ...;
unsigned i = *(unsigned *) &x; // no
The trick with the union is far less likely to generate warnings or confuse the compiler.
There is no need to work with the encoding of floating-point numbers. C provides routines for working with floating-point values in a portable way. The following works.
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char *argv[])
{
/* This should be replaced with proper allocation for the floating-point
type.
*/
int powers[53];
double x = atof(argv[1]);
if (x <= 0)
{
fprintf(stderr, "Error, input must be positive.\n");
return 1;
}
// Find value of highest bit.
int e;
double f = frexp(x, &e) - .5;
powers[0] = --e;
int p = 1;
// Find remaining bits.
for (; 0 != f; --e)
{
printf("e = %d, f = %g.\n", e, f);
if (.5 <= f)
{
powers[p++] = e;
f -= .5;
}
f *= 2;
}
// Display.
printf("%.19g =", x);
for (int i = 0; i < p; ++i)
printf(" + 2**%d", powers[i]);
printf(".\n");
// Test.
double y = 0;
for (int i = 0; i < p; ++i)
y += ldexp(1, powers[i]);
if (x == y)
printf("Reconstructed number equals original.\n");
else
printf("Reconstructed number is %.19g, but original is %.19g.\n", y, x);
return 0;
}