Consider
x_min as -12.5,
x_max as 12.5,
bits as 8,
x is any value between -12.5 to +12.5 ,
Can someone explain me the math's of this snippet??
int float_to_uint(float x, float x_min, float x_max, unsigned int bits)
{
float span = x_max - x_min;
return (int) ((x- x_min)*((float)((1<<bits)/span)));
}
If we ignore rounding, types and other little details, you could rearrange the separate parts a bit:
(x-x_min) / (x_max-x_min) * (1<<bits)
This is basically scaling x to values of 0..2^bits (=256) depending on where x is within x_min..x_max.
x | result
------+----------
-12.5 | 0
... |
0 | 128
... |
12.5 | 256
The goal of the function is to map values in the range x_min to x_max to values 0 to 2^bits.
(int) ((x- x_min) / span * (1<<bits));
But there is some trickery being used here to help the optimizer. The last two values are re-aranged and computed first. Mathematically it's the same but with floats it will round differently. A difference so minor there is actually a compiler flag allowing the compiler to ignore it (fast-math).
(int) ((x- x_min) * ((1<<bits) / span));
The cast to float is pointless as arithmetic promotion already turns 1<<bits into a float and float / float remains float.
Now you might ask: What is the point of this transformation? The result is the (about) the same.
Here is my thought on that: In the source the bits, x_min and x_max will be literals or constants. So span is known at compile time too. The transformation allows the compiler to inline that function and compute (1<<bits) / span) at compile time. That leaves only one float subtraction and multiplication at runtime. It will therefore generate code that runs noticeable faster on something like an Arduino that has no FPU.
Related
double MyClass::dx = ?????;
double MyClass::f(double x)
{
return 3.0*x*x*x - 2.0*x*x + x - 5.0;
}
double MyClass::fp(double x) // derivative of f(x), that is f'(x)
{
return (f(x + dx) - f(x)) / dx;
}
When using finite difference method for derivation, it is critical to choose an optimum dx value. Mathematically, dx must be as small as possible. However, I'm not sure if it is a correct choice to choose it the smallest positive double precision number (i.e.; 2.2250738585072014 x 10−308).
Is there an optimal numeric interval or exact value to choose a dx in to make the calculation error as small as possible?
(I'm using 64-bit compiler. I will run my program on a Intel i5 processor.)
Choosing the smallest possible value is almost certainly wrong: if dx were that smallest number, then f(x + dx) would be exactly equal to f(x) due to rounding.
So you have a tradeoff: Choose dx too small, and you lose precision to rounding errors. Choose it too large, and your result will be imprecise due to changes in the derivative as x changes.
To judge the numeric errors, consider (f(x + dx) - f(x))/f(x)1 mathematically. The numerator denotes the difference you want to compute, but the denominator denotes the magnitude of numbers you're dealing with. If that fraction is about 2‒k, then you can expect approximately k bits of precision in your result.
If you know your function, you can compute what error you'd get from choosing dx too large. You can then balence things, so that the error incurred from this is about the same as the error incurred from rounding. But if you know the function, you might be better off by providing a function that directly computes the derivative, like in your example with the polygonal f.
The Wikipedia section that pogorskiy pointed out suggests a value of sqrt(ε)x, or approximately 1.5e-8 * x. Without any more detailed knowledge about the function, such a rule of thumb will provide a reasonable default. Also note that that same section suggests not dividing by dx, but instead by (x + dx) - x, as this takes rounding errors incurred by computing x + dx into account. But I guess that whole article is full of suggestions you might use.
1 This formula really should divide by f(x), not by dx, even though a past editor thought differently. I'm attempting to compare the amount of significant bits remaining after the division, not the slope of the tangent.
Why not just use the Power Rule to derive the derivative, you'll get an exact answer:
f(x) = 3x^3 - 2x^2 + x - 5
f'(x) = 9x^2 - 4x + 1
Therefore:
f(x) = 3.0 * x * x * x - 2.0 * x * x + x - 5.0
fp(x) = 9.0 * x * x - 4.0 * x + 1.0
I need to be able to convert a C SInt32 integer to a float in the range [-1, 1] and back. I've seen discussions of this question regarding 24 bit integers:
C/C++ - Convert 24-bit signed integer to float
And I've tried something similar:
// Convert int - float
SInt32 integer = 1;
Float32 factor = 1;
Float32 f = integer / (0x7FFFFFF + 0.5);
// Perform some processing on the float
Process(f);
// Scale the float
f = f * factor;
// Convert float - int
integer = f * (0x7FFFFFF + 0.5);
However this doesn't work. I know it doesn't work because the work I'm doing involves audio programming and the conversion causes a hissing sound.
I'm pretty sure it is a conversion problem because when I make the float smaller by setting the factor to 0.0001 the crackling disappears. Maybe the back conversion is putting the int out of it's limits and is causing it to be truncated.
Any advice would be greatly appreciated.
Read up on IEEE floating point formats. The IEEE 32-bit float only supports 24 significant bits, so if you convert a 32-bit integer you will lose the low 8 bits.
const float recip = 1.0 / (32768.0*65536.0);
// hope that compiler will calculate this in advance
// From the expression an semi-advanced programmer can also immediately spot
// where the value comes from
float value = int_value * recip;
int value2 = value * (32768.0*65536.0);
The process is not reversible: one can lose up to 7 bits of accuracy.
I'm making a function that takes in 3 unsigned long longs, and applies the law of cosines to find out if the triangle is obtuse, acute or a right triangle. Should I just cast the variables to doubles before I use them?
void triar( unsigned long long& r,
unsigned long long x,
unsigned long long y,
unsigned long long z )
{
if(x==0 || y==0 || z==0) die("invalid triangle sides");
double t=(x*x + y*y -z*z)/(2*x*y);
t=acos (t) * (180.0 / 3.14159265);
if(t > 90) {
cout<<"Obtuse Triangle"<<endl;
r=t;
} else if(t < 90){
cout<<"Acute Triangle"<<endl;
r=t;
} else if(t == 90){
cout<<"Right Traingle"<<endl;
r=t;
}
}
There is generally no reason why you could not cast if you need floating point arithmetics. However, there is also an implicit conversion from unsigned long to double, so you can also often do completely without casting.
In many cases, including yours, you can cast only one of the arguments to force double arithmetics on a particular operation only. For example,
double t = (double)(x*x + y*y - z*z) / (2*x*y)
This way, all operations except for the division are computed in integer arithmetics and are therefore slighly faster. The cast is still necessary to avoid truncation during division.
Your code contains a comparison of floating point arguments. Floating point arithmetics however almost inevitably reduces accuracy. Avoid limited accuracy, or analyze and control accuracy.
Prefer an integer only solution as described in an excellent sister answer if you have a wide enough integral type at your disposal
Always avoid conversion from radians to degrees except for presentation to humans
Take the value of π from your mathematical library header files (unfortunately, this is platform dependent - try _USE_MATH_DEFINES + M_PI or, if already using boost libraries, boost::math::constants::pi<double>()), or express it analytically. For example, std::atan(1)*2 is the right angle.
If you choose double precision, and the ultimate difference value is less than, say, std::numeric_limits<double>::min() * 8, you can probably not tell anything about the triangle and the classification you return is basically bogus. (I made up the value of 8, you will possibly lose way more bits than three.)
You have a problem with obtuse triangles, x*x + y*y - z*z would mathematically give a negative result, that is then reduced modulo 2^WIDTH (where WIDTH is the number of value bits in unsigned long long, at least 64 and probably exactly that) yielding a - probably large - positive value (or in rare cases 0). Then the computed result of t = (x*x + y*y - z*z)/(2*x*y) can be larger than 1, and acos(t) would return a NaN.
The correct way to find out whether the triangle is obtuse/acute/right-angled with the given argument type is to check whether x*x + y*y < /* > / == */ z*z - if you can be sure the mathematical results don't exceed the unsigned long long range.
If you can't be sure of that, you can either convert the variables to double before the computation,
double xd = x, yd = y, zd = z;
double t = (xd*xd + yd*yd - zd*zd)/(2*xd*yd);
with possible loss of precision and incorrect results for nearly right-angled triangles (e.g. for the slightly obtuse triangle x = 2^29, y = 2^56-1, z = 2^56+2, both y and z would be converted to 2^56 with standard 64-bit doubles, xd*xd + yd*yd = 2^58 + 2^112 would be evaluated to 2^112, subtracting zd*zd then results in 0).
Or you can compare x*x + y*y to z*z - or x*x to z*z - y*y - using only integer arithmetic. If x*x is representable as an unsigned long long (I assume that 0 < x <= y <= z), it's relatively easy, first check whether (z - y)*(z + y) would exceed ULLONG_MAX, if yes, the triangle is obtuse, otherwise calculate and compare. If x*x is not representable, it becomes complicated, I think the easiest way (except for using a big integer library, of course) would be to compute the high and if necessary low 64 (or whatever width unsigned long long has) bits separately by splitting the numbers at half the width and compare those.
Further note: Your value for π, 3.14159265 is too inaccurate, right-angled triangles will be reported as obtuse.
I do not understand the output of the following program:
int main()
{
float x = 14.567729f;
float sqr = x * x;
float diff1 = sqr - x * x;
double diff2 = double(sqr) - double(x) * double(x);
std::cout << diff1 << std::endl;
std::cout << diff2 << std::endl;
return 0;
}
Output:
6.63225e-006
6.63225e-006
I use VS2010, x86 compiler.
I expect to get a different output
0
6.63225e-006
Why diff1 is not equal to 0?
To calculate sqr - x * x compiler increases float precision to double. Why?
The floating point registers are 80 bits (on most modern CPUs)
During an expression the result is an 80 bit value. It only gets truncated to 32 (float) or 64 (double) when it gets assigned to a location in memory. If you hold everything in registers (try compiling with -O3) you may see a different result.
Compiled with: -03:
> ./a.out
0
6.63225e-06
float diff1 = sqr - x * x;
double diff2 = double(sqr) - double(x) * double(x);
Why diff1 is not equal to 0?
Because you have already cached sqr = x*x and forced its representation to be a float.
To calculate sqr - x * x compiler increases float precision to double. Why?
Because that is how C did things back before there was a C standard. I don't think modern compilers are bound to that convention, but many still do follow it. If this is the case, the right-hand sides of the calculations of diff1 and diff2 will be identical. The only difference is that after calculating the right-hand side of float diff1 = ..., the double result is converted back to a float.
Apparently the standard allows floats to be automatically promoted to double in expressions like that. See here
Do a find on that page for "automatically promoted" and check out the first paragraph with that phrase in it.
If we go by that paragraph, as I understand it, your sqr=x*x is initially being treated as if it were a double as well, but once it is stored it is being rounded to a float. Then, in your diff1=sqr-x*x, x*x is again being treated like a double, and so is sqr although it's already rounded. Therefore, it yields the same result as casting them all to doubles: sqr is a double then but already rounded to float precision, and again x*x is double precision.
On x86/x64 architectures it is common for compilers to promote all 32-bit floats to 64-bit doubles for computations; check the output assembly to see if the two variants produce the same instructions. The only difference between the types is the storage.
I understand why floating point numbers can't be compared, and know about the mantissa and exponent binary representation, but I'm no expert and today I came across something I don't get:
Namely lets say you have something like:
float denominator, numerator, resultone, resulttwo;
resultone = numerator / denominator;
float buff = 1 / denominator;
resulttwo = numerator * buff;
To my knowledge different flops can yield different results and this is not unusual. But in some edge cases these two results seem to be vastly different. To be more specific in my GLSL code calculating the Beckmann facet slope distribution for the Cook-Torrance lighitng model:
float a = 1 / (facetSlopeRMS * facetSlopeRMS * pow(clampedCosHalfNormal, 4));
float b = clampedCosHalfNormal * clampedCosHalfNormal - 1.0;
float c = facetSlopeRMS * facetSlopeRMS * clampedCosHalfNormal * clampedCosHalfNormal;
facetSlopeDistribution = a * exp(b/c);
yields very very different results to
float a = (facetSlopeRMS * facetSlopeRMS * pow(clampedCosHalfNormal, 4));
facetDlopeDistribution = exp(b/c) / a;
Why does it? The second form of the expression is problematic.
If I say try to add the second form of the expression to a color I get blacks, even though the expression should always evaluate to a positive number. Am I getting an infinity? A NaN? if so why?
I didn't go through your mathematics in detail, but you must be aware that small errors get pumped up easily by all these powers and exponentials. You should try and substitute all variables var with var + e(var) (on paper, yes) and derive an expression for the total error - without simplifying in between steps, because that's where the error comes from!
This is also a very common problem in computational fluid dynamics, where you can observe things like 'numeric diffusion' if your grid isn't properly aligned with the simulated flow.
So get a clear grip on where the biggest errors come from, and rewrite equations where possible to minimize the numeric error.
edit: to clarify, an example
Say you have some variable x and an expression y=exp(x). The error in x is denoted e(x) and is small compared to x (say e(x)/x < 0.0001, but note that this depends on the type you are using). Then you could say that
e(y) = y(x+e(x)) - y(x)
e(y) ~ dy/dx * e(x) (for small e(x))
e(y) = exp(x) * e(x)
So there's a magnification of the absolute error of exp(x), meaning that around x=0 there's really no issue (not a surprise, since at that point the slope of exp(x) equals that of x) , but for big x you will notice this.
The relative error would then be
e(y)/y = e(y)/exp(x) = e(x)
whilst the relative error in x was
e(x)/x
so you added a factor of x to the relative error.