Here's what I want to do:
Take a double (which is between -1 and 1) and cast it to a float. But I want to make sure that the float is ALWAYS less than the double.
Is there any straightforward way to do this?
For reference, here's something I came up with.
float DoubleToSmallerFloat (double X) // ex. X = 0.79828470019999997
{
float Y = X; // 0.79828471 -> note this is greater than X
double Diff = X - Y;
return Y - Abs (Diff) * 10;
}
If you are able to use C++11 then you can use nextafter() for this:
float doubleToSmallerFloat(double x) {
float f = x;
return f < x ? f : nextafter(f, -1.0f);
}
I think that is a good question. Look at IEEE 754 single-precision and double-precision binary floating-point format
.
The real value assumed by a given 32 bit binary32 data with a given biased sign s, exponent e (the 8 bit unsigned integer), and a 23 bit fraction (mantissa) is
s * m * (2 ^(e-127)),
where m is
For double use 1023 instead of 127: s * m * (2 ^(e-1023))
First case is exponent e and sign s save its values after double-float cast. Then float mantissa is almost first digits of the double mantissa. You need to slightly decrease the value of float mantissa.
Second case. Exponent (e-127) from float is greater than exponent (e-1023) from double. Then I hope that fraction part is 23 zeros. Ok. Decrease exponent part and set fraction part to 23 ones. To get access for the fields use union.
union {
float fl;
uint32_t dw;
} f;
int s = ( f.dw >> 31 ) ? -1 : 1; /* sign */
int e = ( f.dw >> 23 ) & 0xFF; /* exponent */
int fract = f.dw & 0x7FFFFF; /* fraction */
Related
According to articles like this, half of the floating-point numbers are in the interval [-1,1]. Could you suggest how to make use of this fact so to replace the naive conversion of a 32-bit unsigned integer into a floating-point number (while keeping the uniform distribution)?
Naive code:
uint32_t i = /* randomly generated */;
float f = (float)i / (1ui32<<31) - 1.0f;
The problem here is that first the number i is converted into float losing up to 8 lower bits of precision. Only then the number is scaled to [0;2) interval, and then to [-1;1) interval.
Please, suggest the solution in C or C++ for x86_64 CPU or CUDA if you know it.
Update: the solution with a double is good for x86_64, but is too slow in CUDA. Sorry I didn't expect such a response. Any ideas how to achieve this without using double-precision floating-point?
You can do the calculation using double instead so you don't lose any precision on the uint32_t value, then assign the result to a float.
float f = (double)i / (1ui32<<31) - 1.0;
In case you drop the uniform distribution constraint its doable on 32bit integer arithmetics alone:
//---------------------------------------------------------------------------
float i32_to_f32(int x)
{
int exp;
union _f32 // semi result
{
float f; // 32bit floating point
DWORD u; // 32 bit uint
} y;
// edge cases
if (x== 0x00000000) return 0.0f;
if (x< -0x1FFFFFFF) return -1.0f;
if (x> +0x1FFFFFFF) return +1.0f;
// conversion
y.u=0; // reset bits
if (x<0){ y.u|=0x80000000; x=-x; } // sign (31 bits left)
exp=((x>>23)&63)-64; // upper 6 bits -> exponent -1,...,-64 (not 7bits to avoid denormalized numbers)
y.u|=(exp+127)<<23; // exponent bias and bit position
y.u|=x&0x007FFFFF; // mantissa
return y.f;
}
//---------------------------------------------------------------------------
int f32_to_i32(float x)
{
int exp,man,i;
union _f32 // semi result
{
float f; // 32bit floating point
DWORD u; // 32 bit uint
} y;
// edge cases
if (x== 0.0f) return 0x00000000;
if (x<=-1.0f) return -0x1FFFFFFF;
if (x>=+1.0f) return +0x1FFFFFFF;
// conversion
y.f=x;
exp=(y.u>>23)&255; exp-=127; // exponent bias and bit position
if (exp<-64) return 0.0f;
man=y.u&0x007FFFFF; // mantissa
i =(exp<<23)&0x1F800000;
i|= man;
if (y.u>=0x80000000) i=-i; // sign
return i;
}
//---------------------------------------------------------------------------
I chose to use only 29 bits + sign = ~ 30 bits of integer to avoid denormalized numbers havoc which I am too lazy to encode (it would get you 30 or even 31 bits but much slower and complicated).
But the distribution is not linear nor uniform at all:
in Red is the float in range <-1,+1> and Blue is integer in range <-1FFFFFFF,+1FFFFFFF>.
On the other hand there is no rounding at all in both conversions ...
PS. I think there might be a way to somewhat linearize the result by using a precomputed LUT for the 6 bit exponent (64 values).
The thing to realize is while (float)i does lose 8-bit of precision (so it has 24 bits of precision), the result only has 24 bits of precision as well. So this precision loss is not necessarily a bad thing (this is actually more complicated, because if i is smaller, it will lose less than 8-bits. But things will work out well).
So we just need to fix the range, so the originally non-negative value gets mapped to INT_MIN..INT_MAX.
This expression works: (float)(int)(value^0x80000000)/0x80000000.
Here's how it works:
The (int)(value^0x80000000) part flips the sign bit, so 0x0 gets mapped to INT_MIN, and 0xffffffff gets mapped to INT_MAX.
Then there is conversion to float. This is where some rounding happens, and we lose precision (but it is not a problem).
Then just divide by 0x80000000 to get into the range [-1..1]. As this division just adjusts the exponent part, this division doesn't lose any precision.
So, there is only one rounding, the other operations doesn't lose precision. These chain of operations should have the same effect, as calculating the result in infinite precision, then doing the rounding to float (this theoretical rounding has the same effect as the rounding at the 2. step)
But, to be absolutely sure, I've verified with brute force checking all the 32-bit values that this expression results in the same value as (float)((double)value/0x80000000-1.0).
I suggest (if yout want to avoid division and use an accurately float-representable start value of 1.0*2^-32):
float e = i * ldexp(1.0,-32) - 1.0;
Any ideas how to achieve this without using double-precision floating-point?
Without assuming too much about the insides of float:
Shift u until the most significant bit is set, halving the float conversion value.
"keeping the uniform distribution"
50% of the uint32_t values will be in the [0.5 ... 1.0)
25% of the uint32_t values will be in the [0.25 ... 0.5)
12.5% of the uint32_t values will be in the [0.125 ... 0.25)
6.25% of the uint32_t values will be in the [0.0625 ... 0.125)
...
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
float ui32to0to1(uint32_t u) {
if (u) {
float band = 1.0f/(1llu<<32);
while ((u & 0x80000000) == 0) {
u <<= 1;
band *= 0.5f;
}
return (float)u * band;
}
return 0.0f;
}
Some test code to show functional equivalence to double.
int test(uint32_t u) {
volatile float f0 = (float) ((double)u / (1llu<<32));
volatile float f1 = ui32to0to1(u);
if (f0 != f1) {
printf("%8lX %.7e %.7e\n", (unsigned long) u, f0, f1);
return 1;
}
return 0;
}
int main(void) {
for (int i=0; i<100000000; i++) {
test(rand()*65535u ^ rand());
}
return 0;
}
Various optimizations are possible, especially with assuming properties of float. Yet for an initial answer, I'll stick to a general approach.
For improved efficiency, the loop needs only to iterate from 32 down to FLT_MANT_DIG which is usually 24.
float ui32to0to1(uint32_t u) {
float band = 1.0f/(1llu<<32);
for (int i = 32; (i>FLT_MANT_DIG && ((u & 0x80000000) == 0)); i--) {
u <<= 1;
band *= 0.5f;
}
return (float)u * band;
}
This answers maps [0 to 232-1] to [0.0 to 1.0)
To map to [0 to 232-1] to (-1.0 to 1.0). It can form -0.0.
if (u >= 0x80000000) {
return ui32to0to1((u - 0x80000000)*2);
} else
return -ui32to0to1((0x7FFFFFFF - u)*2);
}
So atm im stuck with my calculator. It is only allowed to use following methods:
int succ(int x){
return ++x;
}
int neg(int x){
return -x;
}
What i already got is +, -. *. Iterativ an also recursive (so i can also use them if needed).
Now im stuck on the divide method because i dont know how to deal with the commas and the logic behind it. Just to imagine what it looks like to deal with succ() and neg() heres an example of an subtraction iterativ and recursive:
int sub(int x, int y){
if (y > 0){
y = neg(y);
x = add(x, y);
return x;
}
else if (y < 0){
y = neg(y);
x = add(x, y);
return x;
}
else if (y == 0) {
return x;
}
}
int sub_recc(int x, int y){
if (y < 0){
y = neg(y);
x = add_recc(x, y);
return x;
} else if (y > 0){
x = sub_recc(x, y - 1);
x = x - 1;
return x;
}else if( y == 0) {
return x;
}
}
If you can substract and add, then you can handle integer division. In pseudo code it is just:
division y/x is:
First handle signs because we will only divide positive integers
set sign = 0
if y > 0 then y = neg(y), sign = 1 - sign
if x > 0 then y = neg(y), sign = 1 - sign
ok, if sign is 0 nothing to do, if sign is 1, we will negate the result
Now the quotient is just the number of times you can substract the divisor:
set quotient = 0
while y > x do
y = y - x
quotient = quotient + 1
Ok we have the absolute value of the quotient, now for the sign:
if sign == 1, then quotient = neg(quotient)
The correct translation in C++ language as well as the recursive part are left as an exercise...
Hint for recursion y/x == 1 + (y-x)/x while y>x
Above was the integer part. Integer is nice and easy because it gives exact operations. A floating point representation in a base is always something close to mantissa * baseexp where mantissa is either an integer number with a maximum number of digits or a number between 0 and 1 (said normal representation). And you can pass from one representation to the other but changing the exponent part by the number of digits of the mantissa: 2.5 is 25 10-1 (int mantissa) of .25 101 (0 <= mantissa < 1).
So if you want to operate base 10 floating point numbers you should:
convert an integer to a floating point (mantissa + exponent) representation
for addition and substraction, the result exponent is a priori the greater of the exponents. Both mantissa shall be scaled to that exponent and added/substracted. Then the final exponent must be adjusted because the operation may have added an additional digit (7 + 9 = 16) or have caused the highest order ones to vanish (101 - 98 - 3)
for product, you add the exponents and multiply the mantissas, and then normalize (adjust exponent) the resul
for division, you scale the mantissa by the maximum number of digits, make the division with the integer division algorithm, and again normalise. For example 1/3 with a precision of 6 digits is obtained with:
1/3 = (1 * 106 /3) * 10-6 = (1000000/3) * 10-6
it give 333333 * 10-6 so .333333 in normalized form
Ok, it will be a lot of boiling plate code, but nothing really hard.
Log story made short: just remember how you learned that with a paper and a pencil...
I have problem of converting a double (say N) to p/q form (rational form), for this I have the following strategy :
Multiply double N by a large number say $k = 10^{10}$
then p = y*k and q = k
Take gcd(p,q) and find p = p/gcd(p,q) and q = p/gcd(p,q)
when N = 8.2 , Answer is correct if we solve using pen and paper, but as 8.2 is represented as 8.19999999 in N (double), it causes problem in its rational form conversion.
I tried it doing other way as : (I used a large no. 10^k instead of 100)
if(abs(y*100 - round(y*100)) < 0.000001) y = round(y*100)/100
But this approach also doesn't give right representation all the time.
Is there any way I could carry out the equivalent conversion from double to p/q ?
Floating point arithmetic is very difficult. As has been mentioned in the comments, part of the difficulty is that you need to represent your numbers in binary.
For example, the number 0.125 can be represented exactly in binary:
0.125 = 2^-3 = 0b0.001
But the number 0.12 cannot.
To 11 significant figures:
0.12 = 0b0.00011110101
If this is converted back to a decimal then the error becomes obvious:
0b0.00011110101 = 0.11962890625
So if you write:
double a = 0.2;
What the machine actually does is find the closest binary representation of 0.2 that it can hold within a double data type. This is an approximation since as we saw above, 0.2 cannot be exactly represented in binary.
One possible approach is to define an 'epsilon' which determines how close your number can be to the nearest representable binary floating point.
Here is a good article on floating points:
https://randomascii.wordpress.com/2012/02/25/comparing-floating-point-numbers-2012-edition/
have problem of converting a double (say N) to p/q form
... when N = 8.2
A typical double cannot encode 8.2 exactly. Instead the closest representable double is about
8.19999999999999928945726423989981412887573...
8.20000000000000106581410364015027880668640... // next closest
When code does
double N = 8.2;
It will be the 8.19999999999999928945726423989981412887573... that is converted into rational form.
Converting a double to p/q form:
Multiply double N by a large number say $k = 10^{10}$
This may overflow the double. First step should be to determine if the double is large, it which case, it is a whole number.
Do not multiple by some power of 10 as double certainly uses a binary encoding. Multiplication by 10, 100, etc. may introduce round-off error.
C implementations of double overwhelmingly use a binary encoding, so that FLT_RADIX == 2.
Then every finite double x has a significand that is a fraction of some integer over some power of 2: a binary fraction of DBL_MANT_DIG digits #Richard Critten. This is often 53 binary digits.
Determine the exponent of the double. If large enough or x == 0.0, the double is a whole number.
Otherwise, scale a numerator and denominator by DBL_MANT_DIG. While the numerator is even, halve both the numerator and denominator. As the denominator is a power-of-2, no other prime values are needed for simplification consideration.
#include <float.h>
#include <math.h>
#include <stdio.h>
void form_ratio(double x) {
double numerator = x;
double denominator = 1.0;
if (isfinite(numerator) && x != 0.0) {
int expo;
frexp(numerator, &expo);
if (expo < DBL_MANT_DIG) {
expo = DBL_MANT_DIG - expo;
numerator = ldexp(numerator, expo);
denominator = ldexp(1.0, expo);
while (fmod(numerator, 2.0) == 0.0 && denominator > 1.0) {
numerator /= 2.0;
denominator /= 2.0;
}
}
}
int pre = DBL_DECIMAL_DIG;
printf("%.*g --> %.*g/%.*g\n", pre, x, pre, numerator, pre, denominator);
}
int main(void) {
form_ratio(123456789012.0);
form_ratio(42.0);
form_ratio(1.0 / 7);
form_ratio(867.5309);
}
Output
123456789012 --> 123456789012/1
42 --> 42/1
0.14285714285714285 --> 2573485501354569/18014398509481984
867.53089999999997 --> 3815441248019913/4398046511104
I am a newbie in C++ and this question will be probably so easy to answer for you. I can't find the actual meaning of this kind of syntax. So I have:
struct Vec {
double x, y, z;
Vec(double x_=0, double y_=0, double z_=0){ x=x_; y=y_; z=z_; }
};
int w = 1024, h = 768;
Vec cx = Vec(w*.5135/h);
What is happening in the last row? I am creating a new struct of type Vec and, what else ?
Thanks in advance.
It's a short way of writing floating point numbers. You can do that both ways(it has to be either a double of float of course). A decimal number is divided in 3 parts (excluding the sign that is):
123 . 456
| | \_fractional part
| |
| \_decimal point
|
integer part
When the integer part is equal to 0 but the fractional part is not:
double x = .123; // the same as writing 0.123
When the fractional part is equal to 0 but the integer part is not:
double x = 123.; // the same as writing 123.0
The * is just your standard multiplication here. You are just multiplying an integer number w with a decimal number .5135 that has its integer part equal to 0.
it is equivalent to:
Vec cx = Vec(w*0.5135/h);
In the last row you're assignming cx with a newly constructor instance of type Vec by calling it's constructor Vec(double x_=0, double y_=0, double z_=0).
Vec cx = Vec(w*.5135/h);
Is the same as:
Vec cx = Vec(w*0.5135/h, 0, 0);
Because of the default values for the parameters defined by the constructor.
An floating point number doesn't have to start with a 0 in C++.
assert(0.5135 == .5135); // True
So w*.5135 is just a multiplication of integer w and double 0.5135.
I know how to get the fractional part of a float but I don't know how to set it. I have two integers returned by a function, one holds the integer and the other holds the fractional part.
For example:
int a = 12;
int b = 2; // This can never be 02, 03 etc
float c;
How do I get c to become 12.2? I know I could add something like (float)b \ 10 but then what if b is >= than 10? Then I would have to divide by 100, and so on. Is there a function or something where I can do setfractional(c, b)?
Thanks
edit: The more I think about this problem the more I realize how illogical it is. if b == 1 then it would be 12.1 but if b == 10 it would also be 12.1 so I don't know how I'm going to handle this. I'm guessing the function never returns a number >= 10 for fractional but I don't know.
Something like:
float IntFrac(int integer, int frac)
{
float integer2 = integer;
float frac2 = frac;
float log10 = log10f(frac2 + 1.0f);
float ceil = ceilf(log10);
float pow = powf(10.0f, -ceil);
float res = abs(integer);
res += frac2 * pow;
if (integer < 0)
{
res = -res;
}
return res;
}
Ideone: http://ideone.com/iwG8UO
It's like saying: log10(98 + 1) = log10(99) = 1.995, ceilf(1.995) = 2, powf(10, -2) = 0.01, 99 * 0.01 = 0.99, and then 12 + 0.99 = 12.99 and then we check for the sign.
And let's hope the vagaries of IEEE 754 float math won't hit too hard :-)
I'll add that it would be probably better to use double instead of float. Other than 3d graphics, there are very few fields were using float is a good idea nowadays.
The most trivial method would be counting the digits of b and then divide accordingly:
int i = 10;
while(b > i) // rather slow, there are faster ways
i*= 10;
c = a + static_cast<float>(b)/i;
Note that due to the nature of float the result might not be what you expected. Also, if you want something like 3.004 you can modify the initial value of i to another power of ten.
kindly try this below code after including include math.h and stdlib.h file:
int a=12;
int b=22;
int d=b;
int i=0;
float c;
while(d>0)
{
d/=10;
i++;
}
c=a+(float)b/pow(10,i);