Converting floating point to fixed point

Converting floating point to fixed point - c++

In C++, what's the generic way to convert any floating point value (float) to fixed point (int, 16:16 or 24:8)?
EDIT: For clarification, fixed-point values have two parts to them: an integer part and a fractional part. The integer part can be represented by a signed or unsigned integer data type. The fractional part is represented by an unsigned data integer data type.
Let's make an analogy with money for the sake of clarity. The fractional part may represent cents -- a fractional part of a dollar. The range of the 'cents' data type would be 0 to 99. If a 8-bit unsigned integer were to be used for fixed-point math, then the fractional part would be split into 256 evenly divisible parts.
I hope that clears things up.

Here you go:
// A signed fixed-point 16:16 class
class FixedPoint_16_16
{
short intPart;
unsigned short fracPart;
public:
FixedPoint_16_16(double d)
{
*this = d; // calls operator=
}
FixedPoint_16_16& operator=(double d)
{
intPart = static_cast<short>(d);
fracPart = static_cast<unsigned short>
(numeric_limits<unsigned short> + 1.0)*d);
return *this;
}
// Other operators can be defined here
};
EDIT: Here's a more general class based on anothercommon way to deal with fixed-point numbers (and which KPexEA pointed out):
template <class BaseType, size_t FracDigits>
class fixed_point
{
const static BaseType factor = 1 << FracDigits;
BaseType data;
public:
fixed_point(double d)
{
*this = d; // calls operator=
}
fixed_point& operator=(double d)
{
data = static_cast<BaseType>(d*factor);
return *this;
}
BaseType raw_data() const
{
return data;
}
// Other operators can be defined here
};
fixed_point<int, 8> fp1; // Will be signed 24:8 (if int is 32-bits)
fixed_point<unsigned int, 16> fp1; // Will be unsigned 16:16 (if int is 32-bits)

A cast from float to integer will throw away the fractional portion so if you want to keep that fraction around as fixed point then you just multiply the float before casting it. The below code will not check for overflow mind you.
If you want 16:16
double f = 1.2345;
int n;
n=(int)(f*65536);
if you want 24:8
double f = 1.2345;
int n;
n=(int)(f*256);

**** Edit** : My first comment applies to before Kevin's edit,but I'll leave it here for posterity. Answers change so quickly here sometimes!
The problem with Kevin's approach is that with Fixed Point you are normally packing into a guaranteed word size (typically 32bits). Declaring the two parts separately leaves you to the whim of your compiler's structure packing. Yes you could force it, but it does not work for anything other than 16:16 representation.
KPexEA is closer to the mark by packing everything into int - although I would use "signed long" to try and be explicit on 32bits. Then you can use his approach for generating the fixed point value, and bit slicing do extract the component parts again. His suggestion also covers the 24:8 case.
( And everyone else who suggested just static_cast.....what were you thinking? ;) )

I gave the answer to the guy that wrote the best answer, but I really used a related questions code that points here.
It used templates and was easy to ditch dependencies on the boost lib.

This is fine for converting from floating point to integer, but the O.P. also wanted fixed point.
Now how you'd do that in C++, I don't know (C++ not being something I can think in readily). Perhaps try a scaled-integer approach, i.e. use a 32 or 64 bit integer and programmatically allocate the last, say, 6 digits to what's on the right hand side of the decimal point.

There isn't any built in support in C++ for fixed point numbers. Your best bet would be to write a wrapper 'FixedInt' class that takes doubles and converts them.
As for a generic method to convert... the int part is easy enough, just grab the integer part of the value and store it in the upper bits... decimal part would be something along the lines of:
for (int i = 1; i <= precision; i++)
{
if (decimal_part > 1.f/(float)(i + 1)
{
decimal_part -= 1.f/(float)(i + 1);
fixint_value |= (1 << precision - i);
}
}
although this is likely to contain bugs still

Related

Is there any way to convert a float into an int without losing the decimal values?

I'm trying to convert a float value to an integer, modify the int value, then reconvert back to a float value. However, the decimals' value gets lost and I'm pretty sure I used the static_cast<>() function wrong in my code.
My code is a binary multiplier, which shifts the binary value f times to left. For example, when I'm doing something like 1.2 x 2, I'm only getting 2 instead of 2.4.
int mantissa;
int f;
int exp;
float result = mantissa + 0x800000;
int resultInt = static_cast<int>(result);
int expF = log2(abs(f));
int expM = exp + expF;
int newExp = (127 + 23 - expM);
resultInt >>= newExp;
float result2 = resultInt;

Bit shifting will not work for floating point values because the bits are laid out differently. They have to preserve the decimal location as well as the digits (hence the floating "point" value).
An integer, on the other hand, works well with bit shifting due to how well it maps from decimal-to-binary, but does not store a decimal point anywhere. Thus, when casting, you lose that information.
In short, it is impossible to multiply a decimal value directly using bit shifting the same way you can with an integer.
However, you can multiply the floating point by 10 until all digits are on the left side of the decimal, then cast to an integer. It may eat up performance depending on how it's implemented, but it's certainly possible to preserve all information this way. It's difficult to answer the question beyond that without understanding your intentions.

Converting double to array of bits for genetic algorithm in C(++) [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Floating Point to Binary Value(C++)
Currently I'm working on a genetic algorithm for my thesis and I'm trying to optimize a problem which takes three doubles to be the genome for a particular solution. For the breeding of these doubles I would like to use a binary representation of these doubles and for this I'll have to convert the doubles to their binary representation. I've searched for this, but can't find a clear solution, unfortunately.
How to do this? Is there a library function to do this, as there is in Java? Any help is greatly appreciated.

What about:
double d = 1234;
unsigned char *b = (unsigned char *)&d;
Assuming a double consists of 8 bytes you could use b[0] ... b[7].
Another possibility is:
long long x = *(long long *)&d;

Since you tag the question C++ I would use a reinterpret_cast
For the genetic algorithm what you probably really want is treating the mantissa, exponent and sign of your doubles independently. See "how can I extract the mantissa of a double"

Why do you want to use a binary representation? Just because something is more popular, does not mean that it is the solution to your specific problem.
There is a known genome representation called real that you can use to solve your problem without being submitted to several issues of the binary representation, such as hamming cliffs and different mutation values.
Please notice that I am not talking about cutting-edge, experimental stuff. This 1991 paper already describes the issue I am talking about. If you are spanish or portuguese speaking, I could point you to my personal book on GA, but there are beutiful references in English, such as Melanie Mitchell's or Eiben's books that could describe this issue more deeply.
The important thing to have in mind is that you need to tailor the genetic algorithm to your problem, not modify your needs in order to be able to use a specific type of GA.

I wouldn't convert it into an array. I guess if you do genetic stuff it should be performant. If I were you I would use an integer type (like suggested from irrelephant) and then do the mutation and crossover stuff with int operations.
If you don't do that you're always converting it back and forth. And for crossover you have to iterate through the 64 elements.
Here an example for crossover:
__int64 crossover(__int64 a, __int64 b, int x) {
__int64 mask1 = ...; // left most x bits
__int64 mask2 = ...; // right most 64-x bits
return (a & mask1) + (b & mask2);
}
And for selection, you can just cast it back to a double.

You could do it like this:
// Assuming a DOUBLE is 64bits
double d = 42.0; // just a random double
char* bits = (char*)&d; // access my double byte-by-byte
int array[64]; // result
for (int i = 0, k = 63; i < 8; ++i) // for each byte of my double
for (char j = 0; j < 8; ++j, --k) // for each bit of each byte of my double
array[k] = (bits[i] >> j) & 1; // is the Jth bit of the current byte 1?
Good luck

Either start with a binary representation of the genome and then use one-point or two-point crossover operators, or, if you want to use a real encoding for your GA then please use the simulated binary crossover(SBX) operator for crossover. Most modern GA implementation use real coded representation and a corresponding crossover and mutation operator.

You could use an int (or variant thereof).
The trick is to encode a float of 12.34 as an int of 1234.
Therefore you just need to cast to a float & divide by 100 during the fitness function, and do all your mutation & crossover on an integer.
Gotchas:
Beware the loss of precision if you actually need the nth bit.
Beware the sign bit.
Beware the difference in range between floats & ints.

Converting variable type (or workaround)

The class below is supposed to represent a musical note. I want to be able to store the length of the note (e.g. 1/2 note, 1/4 note, 3/8 note, etc.) using only integers. However, I also want to be able to store the length using a floating point number for the rare case that I deal with notes of irregular lengths.
class note{
string tone;
int length_numerator;
int length_denominator;
public:
set_length(int numerator, int denominator){
length_numerator=numerator;
length_denominator=denominator;
}
set_length(double d){
length_numerator=d; // unfortunately truncates everything past decimal point
length_denominator=1;
}
}
The reason it is important for me to be able to use integers rather than doubles to store the length is that in my past experience with floating point numbers, sometimes the values are unexpectedly inaccurate. For example, a number that is supposed to be 16 occasionally gets mysteriously stored as 16.0000000001 or 15.99999999999 (usually after enduring some operations) with floating point, and this could cause problems when testing for equality (because 16!=15.99999999999).
Is it possible to convert a variable from int to double (the variable, not just its value)? If not, then what else can I do to be able to store the note's length using either an integer or a double, depending on the what I need the type to be?

If your only problem is comparing floats for equality, then I'd say to use floats, but read "Comparing floating point numbers" / Bruce Dawson first. It's not long, and it explains how to compare two floating numbers correctly (by checking the absolute and relative difference).
When you have more time, you should also look at "What Every Computer Scientist Should Know About Floating Point Arithmetic" to understand why 16 occasionally gets "mysteriously" stored as 16.0000000001 or 15.99999999999.
Attempts to use integers for rational numbers (or for fixed point arithmetic) are rarely as simple as they look.

I see several possible solutions: the first is just to use double. It's
true that extended computations may result in inaccurate results, but in
this case, your divisors are normally powers of 2, which will give exact
results (at least on all of the machines I've seen); you only risk
running into problems when dividing by some unusual value (which is the
case where you'll have to use double anyway).
You could also scale the results, e.g. representing the notes as
multiples of, say 64th notes. This will mean that most values will be
small integers, which are guaranteed exact in double (again, at least
in the usual representations). A number that is supposed to be 16 does
not get stored as 16.000000001 or 15.99999999 (but a number that is
supposed to be .16 might get stored as .1600000001 or .1599999999).
Before the appearance of long long, decimal arithmetic classes often
used double as a 52 bit integral type, ensuring at each step that the
actual value was exactly an integer. (Only division might cause a problem.)
Or you could use some sort of class representing rational numbers.
(Boost has one, for example, and I'm sure there are others.) This would
allow any strange values (5th notes, anyone?) to remain exact; it could
also be advantageous for human readable output, e.g. you could test the
denominator, and then output something like "3 quarter notes", or the
like. Even something like "a 3/4 note" would be more readable to a
musician than "a .75 note".

It is not possible to convert a variable from int to double, it is possible to convert a value from int to double. I'm not completely certain which you are asking for but maybe you are looking for a union
union DoubleOrInt
{
double d;
int i;
};
DoubleOrInt length_numerator;
DoubleOrInt length_denominator;
Then you can write
set_length(int numerator, int denominator){
length_numerator.i=numerator;
length_denominator.i=denominator;
}
set_length(double d){
length_numerator.d=d;
length_denominator.d=1.0;
}
The problem with this approach is that you absolutely must keep track of whether you are currently storing ints or doubles in your unions. Bad things will happen if you store an int and then try to access it as a double. Preferrably you would do this inside your class.

This is normal behavior for floating point variables. They are always rounded and the last digits may change valued depending on the operations you do. I suggest reading on floating points somewhere (e.g. http://floating-point-gui.de/) - especially about comparing fp values.
I normally subtract them, take the absolute value and compare this against an epsilon, e.g. if (abs(x-y)

Given you have a set_length(double d), my guess is that you actually need doubles. Note that the conversion from double to a fraction of integer is fragile and complexe, and will most probably not solve your equality problems (is 0.24999999 equal to 1/4 ?). It would be better for you to either choose to always use fractions, or always doubles. Then, just learn how to use them. I must say, for music, it make sense to have fractions as it is even how notes are being described.

If it were me, I would just use an enum. To turn something into a note would be pretty simple using this system also. Here's a way you could do it:
class Note {
public:
enum Type {
// In this case, 16 represents a whole note, but it could be larger
// if demisemiquavers were used or something.
Semiquaver = 1,
Quaver = 2,
Crotchet = 4,
Minim = 8,
Semibreve = 16
};
static float GetNoteLength(const Type &note)
{ return static_cast<float>(note)/16.0f; }
static float TieNotes(const Type &note1, const Type &note2)
{ return GetNoteLength(note1)+GetNoteLength(note2); }
};
int main()
{
// Make a semiquaver
Note::Type sq = Note::Semiquaver;
// Make a quaver
Note::Type q = Note::Quaver;
// Dot it with the semiquaver from before
float dottedQuaver = Note::TieNotes(sq, q);
std::cout << "Semiquaver is equivalent to: " << Note::GetNoteLength(sq) << " beats\n";
std::cout << "Dotted quaver is equivalent to: " << dottedQuaver << " beats\n";
return 0;
}
Those 'Irregular' notes you speak of can be retrieved using TieNotes

Emulated Fixed Point Division/Multiplication

I'm writing a Fixedpoint class, but have ran into bit of a snag... The multiplication, division portions, I am not sure how to emulate. I took a very rough stab at the division operator but I am sure it's wrong. Here's what it looks like so far:
class Fixed
{
Fixed(short int _value, short int _part) :
value(long(_value + (_part >> 8))), part(long(_part & 0x0000FFFF)) {};
...
inline Fixed operator -() const // example of some of the bitwise it's doing
{
return Fixed(-value - 1, (~part)&0x0000FFFF);
};
...
inline Fixed operator / (const Fixed & arg) const // example of how I'm probably doing it wrong
{
long int tempInt = value<<8 | part;
long int tempPart = tempInt;
tempInt /= arg.value<<8 | arg.part;
tempPart %= arg.value<<8 | arg.part;
return Fixed(tempInt, tempPart);
};
long int value, part; // members
};
I... am not a very good programmer, haha!
The class's part is 16 bits wide (but expressed as a 32-bit long since I imagine it'd need the room for possible overflows before they're fixed) and the same goes for value which is the integer part. When the 'part' goes over 0xFFFF in one of it's operations, the highest 16 bits are added to 'value', and then the part is masked so only it's lowest 16 bits remain. That's done in the init list.
I hate to ask, but if anyone would know where I could find documentation for something like this, or even just the 'trick' or how to do those two operators, I would be very happy for it! I am a dimwit when it comes to math, and I know someone has had to do/ask this before, but searching google has for once not taken me to the promised land...

As Jan says, use a single integer. Since it looks like you're specifying 16 bit integer and fractional parts, you could do this with a plain 32 bit integer.
The "trick" is to realise what happens to the "format" of the number when you do operations on it. Your format would be described as 16.16. When you add or subtract, the format stays the same. When you multiply, you get 32.32 -- So you need a 64 bit temporary value for the result. Then you do a >>16 shift to get down to 48.16 format, then take the bottom 32 bits to get your answer in 16.16.
I'm a little rusty on the division -- In DSP, where I learned this stuff, we avoided (expensive) division wherever possible!

I'd recommend using one integer value instead of separate whole and fractional part. Than addition and subtraction are the integeral counterparts directly and you can simply use 64-bit support, which all common compilers have these days:
Multiplication:
operator*(const Fixed &other) const {
return Fixed((int64_t)value * (int64_t)other.value);
}
Division:
operator/(const Fixed &other) const {
return Fixed(((int64_t)value << 16) / (int64_t)other.value);
}
64-bit integers are
On gcc, stdint.h (or cstdint, which places them in std:: namespace) should be available, so you can use the types I mentioned above. Otherwise it's long long on 32-bit targets and long on 64-bit targets.
On Windows, it's always long long or __int64.

To get things up and running, first implement the (unary) inverse(x) = 1/x, and then implement a/b as a*inverse(b). You'll probably want to represent the intermediates as a 32.32 format.

Packing 32bit floats into 30 bits (c++)

Here are the goals I'm trying to achieve:
I need to pack 32 bit IEEE floats into 30 bits.
I want to do this by decreasing the size of mantissa by 2 bits.
The operation itself should be as fast as possible.
I'm aware that some precision will be lost, and this is acceptable.
It would be an advantage, if this operation would not ruin special cases like SNaN, QNaN, infinities, etc. But I'm ready to sacrifice this over speed.
I guess this questions consists of two parts:
1) Can I just simply clear the least significant bits of mantissa? I've tried this, and so far it works, but maybe I'm asking for trouble... Something like:
float f;
int packed = (*(int*)&f) & ~3;
// later
f = *(float*)&packed;
2) If there are cases where 1) will fail, then what would be the fastest way to achieve this?
Thanks in advance

You actually violate the strict aliasing rules (section 3.10 of the C++ standard) with these reinterpret casts. This will probably blow up in your face when you turn on the compiler optimizations.
C++ standard, section 3.10 paragraph 15 says:
If a program attempts to access the stored value of an object through an lvalue of other than one of the following types the behavior is undefined
the dynamic type of the object,
a cv-qualified version of the dynamic type of the object,
a type similar to the dynamic type of the object,
a type that is the signed or unsigned type corresponding to the dynamic type of the object,
a type that is the signed or unsigned type corresponding to a cv-qualified version of the dynamic type of the object,
an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union),
a type that is a (possibly cv-qualified) base class type of the dynamic type of the object,
a char or unsigned char type.
Specifically, 3.10/15 doesn't allow us to access a float object via an lvalue of type unsigned int. I actually got bitten myself by this. The program I wrote stopped working after turning on optimizations. Apparently, GCC didn't expect an lvalue of type float to alias an lvalue of type int which is a fair assumption by 3.10/15. The instructions got shuffled around by the optimizer under the as-if rule exploiting 3.10/15 and it stopped working.
Under the following assumptions
float really corresponds to a 32bit IEEE-float,
sizeof(float)==sizeof(int)
unsigned int has no padding bits or trap representations
you should be able to do it like this:
/// returns a 30 bit number
unsigned int pack_float(float x) {
unsigned r;
std::memcpy(&r,&x,sizeof r);
return r >> 2;
}
float unpack_float(unsigned int x) {
x <<= 2;
float r;
std::memcpy(&r,&x,sizeof r);
return r;
}
This doesn't suffer from the "3.10-violation" and is typically very fast. At least GCC treats memcpy as an intrinsic function. In case you don't need the functions to work with NaNs, infinities or numbers with extremely high magnitude you can even improve accuracy by replacing "r >> 2" with "(r+1) >> 2":
unsigned int pack_float(float x) {
unsigned r;
std::memcpy(&r,&x,sizeof r);
return (r+1) >> 2;
}
This works even if it changes the exponent due to a mantissa overflow because the IEEE-754 coding maps consecutive floating point values to consecutive integers (ignoring +/- zero). This mapping actually approximates a logarithm quite well.

Blindly dropping the 2 LSBs of the float may fail for small number of unusual NaN encodings.
A NaN is encoded as exponent=255, mantissa!=0, but IEEE-754 doesn't say anything about which mantiassa values should be used. If the mantissa value is <= 3, you could turn a NaN into an infinity!

You should encapsulate it in a struct, so that you don't accidentally mix the usage of the tagged float with regular "unsigned int":
#include <iostream>
using namespace std;
struct TypedFloat {
private:
union {
unsigned int raw : 32;
struct {
unsigned int num : 30;
unsigned int type : 2;
};
};
public:
TypedFloat(unsigned int type=0) : num(0), type(type) {}
operator float() const {
unsigned int tmp = num << 2;
return reinterpret_cast<float&>(tmp);
}
void operator=(float newnum) {
num = reinterpret_cast<int&>(newnum) >> 2;
}
unsigned int getType() const {
return type;
}
void setType(unsigned int type) {
this->type = type;
}
};
int main() {
const unsigned int TYPE_A = 1;
TypedFloat a(TYPE_A);
a = 3.4;
cout << a + 5.4 << endl;
float b = a;
cout << a << endl;
cout << b << endl;
cout << a.getType() << endl;
return 0;
}
I can't guarantee its portability though.

How much precision do you need? If 16-bit float is enough (sufficient for some types of graphics), then ILM's 16-bit float ("half"), part of OpenEXR is great, obeys all kinds of rules (http://www.openexr.com/), and you'll have plenty of space left over after you pack it into a struct.
On the other hand, if you know the approximate range of values they're going to take, you should consider fixed point. They're more useful than most people realize.

I can't select any of the answers as the definite one, because most of them have valid information, but not quite what I was looking for. So I'll just summarize my conclusions.
The method for conversion I've posted in my question's part 1) is clearly wrong by C++ standard, so other methods to extract float's bits should be used.
And most important... as far as I understand from reading the responses and other sources about IEEE754 floats, it's ok to drop the least significant bits from mantissa. It will mostly affect only precision, with one exception: sNaN. Since sNaN is represented by exponent set to 255, and mantissa != 0, there can be situation where mantissa would be <= 3, and dropping last two bits would convert sNaN to +/-Infinity. But since sNaN are not generated during floating point operations on CPU, its safe under controlled environment.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Converting floating point to fixed point - c++

I gave the answer to the guy that wrote the best answer, but I really used a related questions code that points here. It used templates and was easy to ditch dependencies on the boost lib.

Related

Is there any way to convert a float into an int without losing the decimal values?

Converting double to array of bits for genetic algorithm in C(++) [duplicate]

Converting variable type (or workaround)

Emulated Fixed Point Division/Multiplication

Packing 32bit floats into 30 bits (c++)

Categories

Resources