Yet my lectures in C++ at university began yet i got my first problems. Our task was it to implement a self made structure in C++ for floating points via the IEEE 754 standard:
Create a data structure that allows you to store a float, read its raw byte representation and its internal representation as s, e and m. Use a combination of union and bit-field-struct.
Write a program where a float number is assigned to the float part of the structure and the raw and s/e/m representation is printed. Use hexadecimal output for raw and m.
What i had so far is the following:
#include <stdio.h>
#include <math.h>
union {
struct KFloat {
//Using bit fields for our self made float. s sign, e exponent, m mantissa
//It should be unsigned because we simply use 0 and 1
unsigned int s : 1, e : 8, m : 23;
};
//One bit will be wasted for our '.'
char internal[33];
};
float calculateRealFloat(KFloat kfloat) {
if(kfloat.s == 0) {
return (1.0+kfloat.m)*pow(2.0, (kfloat.e-127.0));
} else if (kfloat.s == 1) {
return (-1.0)*((1.0+kfloat.m)*pow(2.0, (kfloat.e-127.0)));
}
//Error case when s is bigger 1
return 0.0;
}
int main(void) {
KFloat kf_pos = {0, 128, 1.5707963705062866};//This should be Pi (rounded) aka 3.1415927
KFloat kf_neg = {1, 128, 1.5707963705062866};//Pi negative
float f_pos = calculateRealFloat(kf_pos);
float f_neg = calculateRealFloat(kf_neg);
printf("The positive float is %f or ",f_pos);
printf("%e\n", f_pos);
printf("The negative float is %f or ",f_neg);
printf("%e", f_neg);
return 0;
}
The first error with this code is clearly that the mantissa is absolutely wrong but i have no idea how to fix this.
please reread the task:
Create a data structure that allows you to store a float,
read its raw byte representation
and its internal representation as s, e and m.
this does not mean that you should store a string
I would do this the following way:
union MyFloat
{
unsigned char rawByteDataRep[4];
unsigned int rawDataRep;
float floatRep;
struct{ // not checked this part just copied from you
unsigned s : 1;
unsigned e : 8;
unsigned m : 23;
} componentesRep;
}
but be careful!
Besides the fact that this union-conversion pattern is widely used, the C-Standard states that the result is undefined behaviour if you read another unionmember than the one that was written.
Edit:
added uint32 rep
void testMyfloat()
{
MyFloat mf;
mf.floatRep = 3.14;
printf("The float %f is assembled from sign %i magnitude 0x%08x and exponent %i and looks in memory like that 0x%08x.\n",
mf.floatRep,
(int)mf.componentesRep.s,
(unsigned int)mf.componentesRep.m,
(int)mf.componentesRep.e,
mf.componentesRep.rawDataRep);
}
Bruce Dawson has an excellent series of blog posts on floating point representation and arithmetic. The latest in the series, which has a bunch of links to previous posts that discusses this subject matter in detail, is here.
Related
Let's assume we have a representation of -63 as signed seven-bit integer within a uint16_t. How can we convert that number to float and back again, when we don't know the representation type (like two's complement).
An application for such an encoding could be that several numbers are stored in one int16_t. The bit-count could be known for each number and the data is read/written from a third-party library (see for example the encoding format of tivxDmpacDofNode() here: https://software-dl.ti.com/jacinto7/esd/processor-sdk-rtos-jacinto7/latest/exports/docs/tiovx/docs/user_guide/group__group__vision__function__dmpac__dof.html --- but this is just an example). An algorithm should be developed that makes the compiler create the right encoding/decoding independent from the actual representation type. Of course it is assumed that the compiler uses the same representation type as the library does.
One way that seems to work well, is to shift the bits such that their sign bit coincides with the sign bit of an int16_t and let the compiler do the rest. Of course this makes an appropriate multiplication or division necessary.
Please see this example:
#include <iostream>
#include <cmath>
int main()
{
// -63 as signed seven-bits representation
uint16_t data = 0b1000001;
// Shift 9 bits to the left
int16_t correct_sign_data = static_cast<int16_t>(data << 9);
float f = static_cast<float>(correct_sign_data);
// Undo effect of shifting
f /= pow(2, 9);
std::cout << f << std::endl;
// Now back to signed bits
f *= pow(2, 9);
uint16_t bits = static_cast<uint16_t>(static_cast<int16_t>(f)) >> 9;
std::cout << "Equals: " << (data == bits) << std::endl;
return 0;
}
I have two questions:
This example uses actually a number with known representation type (two's complement) converted by https://www.exploringbinary.com/twos-complement-converter/. Is the bit-shifting still independent from that and would it work also for other representation types?
Is this the canonical and/or most elegant way to do it?
Clarification:
I know the bit width of the integers I would like to convert (please check the link to the TIOVX example above), but the integer representation type is not specified.
The intention is to write code that can be recompiled without changes on a system with another integer representation type and still correctly converts from int to float and/or back.
My claim is that the example source code above does exactly that (except that the example input data is hardcoded and it would have to be different if the integer representation type were not two's complement). Am I right? Could such a "portable" solution be written also with a different (more elegant/canonical) technique?
Your question is ambiguous as to whether you intend to truly store odd-bit integers, or odd-bit floats represented by custom-encoded odd-bit integers. I'm assuming by "not knowing" the bit-width of the integer, that you mean that the bit-width isn't known at compile time, but is discovered at runtime as your custom values are parsed from a file, for example.
Edit by author of original post:
The assumption in the original question that the presented code is independent from the actual integer representation type, is wrong (as explained in the comments). Integer types are not specified, for example it is not clear that the leftmost bit is the sign bit. Therefore the presented code also contains assumptions, they are just different (and most probably worse) than the assumption "integer representation type is two's complement".
Here's a simple example of storing an odd-bit integer. I provide a simple struct that let's you decide how many bits are in your integer. However, for simplicity in this example, I used uint8_t which has a maximum of 8-bits obviously. There are several different assumptions and simplifications made here, so if you want help on any specific nuance, please specify more in the comments and I will edit this answer.
One key detail is to properly mask off your n-bit integer after performing 2's complement conversions.
Also please note that I have basically ignored overflow concerns and bit-width switching concerns that may or may not be a problem depending on how you intend to use your custom-width integers and the maximum bit-width you intend to support.
#include <iostream>
#include <string>
struct CustomInt {
int bitCount = 7;
uint8_t value;
uint8_t mask = 0;
CustomInt(int _bitCount, uint8_t _value) {
bitCount = _bitCount;
value = _value;
mask = 0;
for (int i = 0; i < bitCount; ++i) {
mask |= (1 << i);
}
}
bool isNegative() {
return (value >> (bitCount - 1)) & 1;
}
int toInt() {
bool negative = isNegative();
uint8_t tempVal = value;
if (negative) {
tempVal = ((~tempVal) + 1) & mask;
}
int ret = tempVal;
return negative ? -ret : ret;
}
float toFloat() {
return toInt(); //Implied truncation!
}
void setFromFloat(float f) {
int intVal = f; //Implied truncation!
bool negative = f < 0;
if (negative) {
intVal = -intVal;
}
value = intVal;
if (negative) {
value = ((~value) + 1) & mask;
}
}
};
int main() {
CustomInt test(7, 0b01001110); // -50. Would be 78 if this were a normal 8-bit integer
std::cout << test.toFloat() << std::endl;
}
Consider the following c++ code:
unsigned char* data = readData(..); //Let say data consist of 12 characters
unsigned int dataSize = getDataSize(...); //the size in byte of the data is also known (let say 12 bytes)
struct Position
{
float pos_x; //remember that float is 4 bytes
double pos_y; //remember that double is 8 bytes
}
Now I want to fill a Position variable/instance with data.
Position pos;
pos.pos_x = ? //data[0:4[ The first 4 bytes of data should be set to pos_x, since pos_x is of type float which is 4 bytes
pos.pos_x = ? //data[4:12[ The remaining 8 bytes of data should be set to pos_y which is of type double (8 bytes)
I know that in data, the first bytes correspond to pos_x and the rest to pos_y. That means the 4 first byte/character of data should be used to fill pos_x and the 8 remaining byte fill pos_y but I don't know how to do that.
Any idea? Thanks. Ps: I'm limited to c++11
You can use plain memcpy as another answer advises. I suggest packing memcpy into a function that also does error checking for you for most convenient and type-safe usage.
Example:
#include <cstring>
#include <stdexcept>
#include <type_traits>
struct ByteStreamReader {
unsigned char const* begin;
unsigned char const* const end;
template<class T>
operator T() {
static_assert(std::is_trivially_copyable<T>::value,
"The type you are using cannot be safely copied from bytes.");
if(end - begin < static_cast<decltype(end - begin)>(sizeof(T)))
throw std::runtime_error("ByteStreamReader");
T t;
std::memcpy(&t, begin, sizeof t);
begin += sizeof t;
return t;
}
};
struct Position {
float pos_x;
double pos_y;
};
int main() {
unsigned char data[12] = {};
unsigned dataSize = sizeof data;
ByteStreamReader reader{data, data + dataSize};
Position p;
p.pos_x = reader;
p.pos_y = reader;
}
One thing that you can do is to copy the data byte-by byte. There is a standard function to do that: std::memcpy. Example usage:
assert(sizeof pos.pos_x == 4);
std::memcpy(&pos.pos_x, data, 4);
assert(sizeof pos.pos_y == 8);
std::memcpy(&pos.pos_y, data + 4, 8);
Note that simply copying the data only works if the data is in the same representation as the CPU uses. Understand that different processors use different representations. Therefore, if your readData receives the data over the network for example, a simple copy is not a good idea. The least that you would have to do in such case is to possibly convert the endianness of the data to the native endianness (probably from big endian, which is conventionally used as the network endianness). Converting from one floating point representation to another is much trickier, but luckily IEE-754 is fairly ubiquitous.
Note: This question started with a faulty premise: the values that appeared to be 0.0 were in fact very small numbers. But it morphed into a discussion of different ways of reinterpreting the bits of one type as a different type. TL;DR: until C++20 arrives with its new bit_cast class, the standard, portable solution is memcpy.
Update 3: here's a very short app that demonstrates the problem. The bits interpreted as float always give a value of 0.0000:
#include <iostream>
int main() {
uint32_t i = 0x04501234; // arbitrary bits should be a valid float
uint32_t bitPattern;
uint32_t* ptr_to_uint = &bitPattern;
volatile float bitPatternAsFloat;
volatile float* ptr_to_float = &bitPatternAsFloat;
do {
bitPattern = i;
memcpy( (void*)ptr_to_float, (const void*)ptr_to_uint, 4 );
// The following 2 lines both print the float value as 0.00000
//printf( "bitPattern: %0X, bitPatternAsFloat: %0X, as float: %f \r\n", bitPattern, *(unsigned int*)&bitPatternAsFloat, bitPatternAsFloat );
printf( "bitPattern: %0X, bitPatternAsFloat: %0X, as float: %f \r\n", bitPattern, *(unsigned int*)&bitPatternAsFloat, *ptr_to_float );
i++;
} while( i < 0x04501254 );
return 0;
}
(original post)
float bitPatternAsFloat;
for (uint32_t i = 0; i <= 0xFFFFFFFF; i = (i & 0x7F800000) == 0 ? i | 0x00800000 : i + 1 )
{
bitPatternAsFloat = *(float*)&i;
...
The loop steps through every bit pattern, skipping those in which the float exponent field is 0. Then I try to interpret the bits as a float. The values of i look OK (printed in hex), but bitPatternAsFloat always comes back as 0.0. What's wrong?
Thinking that maybe i is held in a register and thus &i returns nothing, I tried copying i to another uint32_t variable first, but the result is the same.
I know I'm modifying the loop variable with the ternary in the for statement, but as I understand it this is OK.
Update: after reading about aliasing, I tried this:
union FloatBits {
uint32_t uintVersion;
float floatVersion;
};
float bitPatternAsFloat;
// inside test method:
union FloatBits fb; // also tried it without the 'union' keyword
// inside test loop (i is a uint32_t incrementing through all values)
fb.uintVersion = i;
bitPatternAsFloat = fb.floatVersion;
When I print the values, fb.uintVersion prints the expected value of i, but fb.floatVersion still prints as 0.000000. Am I close? What am I missing? The compiler is g++ 6, which allows "type punning".
Update 2: here's my attempt at using memcpy. It doesn't work (the float value is always 0.0):
uint32_t i = 0;
uint32_t bitPattern;
float bitPatternAsFloat;
uint32_t* ptr_to_uint = &bitPattern;
float* ptr_to_float = &bitPatternAsFloat;
do {
bitPattern = i; //incremented from 0x00000000 to 0xFFFFFFFF
memcpy( (void*)ptr_to_float, (const void*)ptr_to_uint, 4 );
// bitPatternAsFloat always reads as 0.0
...
i++;
} while( i );
Thanks to #geza for pointing out that the "failures" were an artifact of printf not being able to show very small numbers in %f format. With %e format the correct values are displayed.
In my tests, all of the various conversion methods discussed work, even the undefined behavior. Considering simplicity, performance, and portability it seems the solution using a union is best:
uint32_t bitPattern;
volatile float bitPatternAsFloat;
union UintFloat {
uint32_t asUint;
float asFloat;
} uintFloat;
do {
bitPattern = i; // i is loop index
uintFloat.asUint = bitPattern;
bitPatternAsFloat = uintFloat.asFloat;
An interesting question is whether the float member in the union should be declared volatile: it is never explicitly written, so might the compiler cache the initial value and reuse it, failing to notice that the corresponding uint32_t is being updated? It appears to work OK on my current compiler without the volatile declaration, but is this a potential 'gotcha'?
P.S. I just discovered [reinterpret_cast][1] and wonder if that's the best solution.
I've been stumped on this one for days. I've written this program from a book called Write Great Code Volume 1 Understanding the Machine Chapter four.
The project is to do Floating Point operations in C++. I plan to implement the other operations in C++ on my own; the book uses HLA (High Level Assembly) in the project for other operations like multiplication and division.
I wanted to display the exponent and other field values after they've been extracted from the FP number; for debugging. Yet I have a problem: when I look at these values in memory they are not what I think they should be. Key words: what I think. I believe I understand the IEEE FP format; its fairly simple and I understand all I've read so far in the book.
The big problem is why the Rexponent variable seems to be almost unpredictable; in this example with the given values its 5. Why is that? By my guess it should be two. Two because the decimal point is two digits right of the implied one.
I've commented the actual values that are produced in the program in to the code so you don't have to run the program to get a sense of whats happening (at least in the important parts).
It is unfinished at this point. The entire project has not been created on my computer yet.
Here is the code (quoted from the file which I copied from the book and then modified):
#include<iostream>
typedef long unsigned real; //typedef our long unsigned ints in to the label "real" so we don't confuse it with other datatypes.
using namespace std; //Just so I don't have to type out std::cout any more!
#define asreal(x) (*((float *) &x)) //Cast the address of X as a float pointer as a pointer. So we don't let the compiler truncate our FP values when being converted.
inline int extractExponent(real from) {
return ((from >> 23) & 0xFF) - 127; //Shift right 23 bits; & with eight ones (0xFF == 1111_1111 ) and make bias with the value by subtracting all ones from it.
}
void fpadd ( real left, real right, real *dest) {
//Left operand field containers
long unsigned int Lexponent = 0;
long unsigned Lmantissa = 0;
int Lsign = 0;
//RIGHT operand field containers
long unsigned int Rexponent = 0;
long unsigned Rmantissa = 0;
int Rsign = 0;
//Resulting operand field containers
long int Dexponent = 0;
long unsigned Dmantissa = 0;
int Dsign = 0;
std::cout << "Size of datatype: long unsigned int is: " << sizeof(long unsigned int); //For debugging
//Properly initialize the above variable's:
//Left
Lexponent = extractExponent(left); //Zero. This value is NOT a flat zero when displayed because we subtract 127 from the exponent after extracting it! //Value is: 0xffffff81
Lmantissa = extractMantissa (left); //Zero. We don't do anything to this number except add a whole number one to it. //Value is: 0x00000000
Lsign = extractSign(left); //Simple.
//Right
**Rexponent = extractExponent(right); //Value is: 0x00000005 <-- why???**
Rmantissa = extractMantissa (right);
Rsign = extractSign(right);
}
int main (int argc, char *argv[]) {
real a, b, c;
asreal(a) = -0.0;
asreal(b) = 45.67;
fpadd(a,b, &c);
printf("Sum of A and B is: %f", c);
std::cin >> a;
return 0;
}
Help would be much appreciated; I'm several days in to this project and very frustrated!
in this example with the given values its 5. Why is that?
The floating point number 45.67 is internally represented as
2^5 * 1.0110110101011100001010001111010111000010100011110110
which actually represents the number
45.6700000000000017053025658242404460906982421875
This is as close as you can get to 45.67 inside float.
If all you are interested in is the exponent of a number, simply compute its base 2 logarithm and round down. Since 45.67 is between 32 (2^5) and 64 (2^6), the exponent is 5.
Computers use binary representation for all numbers. Hence, the exponent is for base two, not base ten. int(log2(45.67)) = 5.
Based on the question convert from float-point to custom numeric type, I figured out a portable safe way to convert float-point type into array of integers and the code works fine, but for some values when converting from double to unsigned long long with precision that can be safely represented by unsigned long long the conversion fails not by compile-time error but with invalid value which is minimum representable value for signed long long or zero, the conversion fails on visual c++ 2008, intel xe 2013 and gcc 4.7.2.
here is the code: (notice first statement inside while loop in main function)
#ifndef CHAR_BIT
#include <limits.h>
#endif
#include <float.h>
#include <math.h>
typedef signed int int32;
typedef signed long long int64;
typedef unsigned int uint32;
typedef unsigned long long uint64;
typedef float float32;
typedef double float64;
// get size of type in bits corresponding to CHAR_BIT.
template<typename t>
struct sizeof_ex
{
static const uint32 value = sizeof(t) * CHAR_BIT;
};
// factorial function
float64 fct(int32 i)
{
float64 r = 1;
do r *= i; while(--i > 1);
return r;
}
int main()
{
// maximum 2 to power that can be stored in uint32
const uint32 power_2 = uint32(~0);
// number of binary digits in power_2
const uint32 digit_cnt = sizeof_ex<uint32>::value;
// number of array elements that will store expanded value
const uint32 comp_count = DBL_MAX_EXP / digit_cnt + uint32((DBL_MAX_EXP / digit_cnt) * digit_cnt < DBL_MAX_EXP);
// array elements
uint32 value[comp_count];
// get factorial for 23
float64 f = fct<float64>(23);
// save sign for later correction
bool sign = f < 0;
// remove sign from float-point if exists
if (sign) f *= -1;
// get number of binary digits in f
uint32 actual_digits = 0;
frexp(f, (int32*)&actual_digits);
// get start index in array for little-endian format
uint32 start_index = (actual_digits / digit_cnt) + uint32((actual_digits / digit_cnt) * digit_cnt < actual_digits) - 1;
// get all parts but the last
while (start_index > 0)
{
// store current part
// in this line the compiler fails
value[start_index] = uint64(f / power_2);
// exclude it from f
f -= power_2 * float64(value[start_index]);
// decrement index
--start_index;
}
// get last part
value[0] = uint32(f);
}
The convert code above will give different result from compiler to another, meaning when the parameter of factorial function say 20 all compilers return valid result, when the value greater than 20 some compiler gets part of the result others don't and when it is get bigger e.g. 35 it become zero.
please tell me why those error occurs?
thank you.
I don't think your conversion logic makes any sense.
You have a value called "power_2" which is not actually a power of 2, despite commenting that it is.
You extract bits of a very large (>64-bit) number by dividing by something less than 32-bits. Obviously the result of that will be >32 bits, but you store it into a 32-bit value, truncating it. Then you remultiply that by the original divisor and subtract from your float. However as the number was truncated, you are subtracting much less than the original value, which almost certainly wasn't what you expected.
I think there's more wrong that just that - you don't really always want the top 32 bits, for a number which is not a multiple of 32-bits long, you want the actual length mod 32.
Here's a somewhat lazy hack on your code that does what I think you're trying to do. Note that the pow() could be optimised out.
while (start_index > 0)
{
float64 fpow = pow(2., 32. * start_index);
// store current part
// in this line the compiler fails
value[start_index] = f / fpow;
// exclude it from f
f -= fpow * float64(value[start_index]);
// decrement index
--start_index;
}
That's pretty much untested, but hopefully illustrates what I mean.