FP number's exponent field is not what I expected, why?

FP number's exponent field is not what I expected, why? - c++

I've been stumped on this one for days. I've written this program from a book called Write Great Code Volume 1 Understanding the Machine Chapter four.
The project is to do Floating Point operations in C++. I plan to implement the other operations in C++ on my own; the book uses HLA (High Level Assembly) in the project for other operations like multiplication and division.
I wanted to display the exponent and other field values after they've been extracted from the FP number; for debugging. Yet I have a problem: when I look at these values in memory they are not what I think they should be. Key words: what I think. I believe I understand the IEEE FP format; its fairly simple and I understand all I've read so far in the book.
The big problem is why the Rexponent variable seems to be almost unpredictable; in this example with the given values its 5. Why is that? By my guess it should be two. Two because the decimal point is two digits right of the implied one.
I've commented the actual values that are produced in the program in to the code so you don't have to run the program to get a sense of whats happening (at least in the important parts).
It is unfinished at this point. The entire project has not been created on my computer yet.
Here is the code (quoted from the file which I copied from the book and then modified):
#include<iostream>
typedef long unsigned real; //typedef our long unsigned ints in to the label "real" so we don't confuse it with other datatypes.
using namespace std; //Just so I don't have to type out std::cout any more!
#define asreal(x) (*((float *) &x)) //Cast the address of X as a float pointer as a pointer. So we don't let the compiler truncate our FP values when being converted.
inline int extractExponent(real from) {
return ((from >> 23) & 0xFF) - 127; //Shift right 23 bits; & with eight ones (0xFF == 1111_1111 ) and make bias with the value by subtracting all ones from it.
}
void fpadd ( real left, real right, real *dest) {
//Left operand field containers
long unsigned int Lexponent = 0;
long unsigned Lmantissa = 0;
int Lsign = 0;
//RIGHT operand field containers
long unsigned int Rexponent = 0;
long unsigned Rmantissa = 0;
int Rsign = 0;
//Resulting operand field containers
long int Dexponent = 0;
long unsigned Dmantissa = 0;
int Dsign = 0;
std::cout << "Size of datatype: long unsigned int is: " << sizeof(long unsigned int); //For debugging
//Properly initialize the above variable's:
//Left
Lexponent = extractExponent(left); //Zero. This value is NOT a flat zero when displayed because we subtract 127 from the exponent after extracting it! //Value is: 0xffffff81
Lmantissa = extractMantissa (left); //Zero. We don't do anything to this number except add a whole number one to it. //Value is: 0x00000000
Lsign = extractSign(left); //Simple.
//Right
**Rexponent = extractExponent(right); //Value is: 0x00000005 <-- why???**
Rmantissa = extractMantissa (right);
Rsign = extractSign(right);
}
int main (int argc, char *argv[]) {
real a, b, c;
asreal(a) = -0.0;
asreal(b) = 45.67;
fpadd(a,b, &c);
printf("Sum of A and B is: %f", c);
std::cin >> a;
return 0;
}
Help would be much appreciated; I'm several days in to this project and very frustrated!

in this example with the given values its 5. Why is that?
The floating point number 45.67 is internally represented as
2^5 * 1.0110110101011100001010001111010111000010100011110110
which actually represents the number
45.6700000000000017053025658242404460906982421875
This is as close as you can get to 45.67 inside float.
If all you are interested in is the exponent of a number, simply compute its base 2 logarithm and round down. Since 45.67 is between 32 (2^5) and 64 (2^6), the exponent is 5.

Computers use binary representation for all numbers. Hence, the exponent is for base two, not base ten. int(log2(45.67)) = 5.

Related

How to deal with the sign bit of integer representations with odd bit counts?

Let's assume we have a representation of -63 as signed seven-bit integer within a uint16_t. How can we convert that number to float and back again, when we don't know the representation type (like two's complement).
An application for such an encoding could be that several numbers are stored in one int16_t. The bit-count could be known for each number and the data is read/written from a third-party library (see for example the encoding format of tivxDmpacDofNode() here: https://software-dl.ti.com/jacinto7/esd/processor-sdk-rtos-jacinto7/latest/exports/docs/tiovx/docs/user_guide/group__group__vision__function__dmpac__dof.html --- but this is just an example). An algorithm should be developed that makes the compiler create the right encoding/decoding independent from the actual representation type. Of course it is assumed that the compiler uses the same representation type as the library does.
One way that seems to work well, is to shift the bits such that their sign bit coincides with the sign bit of an int16_t and let the compiler do the rest. Of course this makes an appropriate multiplication or division necessary.
Please see this example:
#include <iostream>
#include <cmath>
int main()
{
// -63 as signed seven-bits representation
uint16_t data = 0b1000001;
// Shift 9 bits to the left
int16_t correct_sign_data = static_cast<int16_t>(data << 9);
float f = static_cast<float>(correct_sign_data);
// Undo effect of shifting
f /= pow(2, 9);
std::cout << f << std::endl;
// Now back to signed bits
f *= pow(2, 9);
uint16_t bits = static_cast<uint16_t>(static_cast<int16_t>(f)) >> 9;
std::cout << "Equals: " << (data == bits) << std::endl;
return 0;
}
I have two questions:
This example uses actually a number with known representation type (two's complement) converted by https://www.exploringbinary.com/twos-complement-converter/. Is the bit-shifting still independent from that and would it work also for other representation types?
Is this the canonical and/or most elegant way to do it?
Clarification:
I know the bit width of the integers I would like to convert (please check the link to the TIOVX example above), but the integer representation type is not specified.
The intention is to write code that can be recompiled without changes on a system with another integer representation type and still correctly converts from int to float and/or back.
My claim is that the example source code above does exactly that (except that the example input data is hardcoded and it would have to be different if the integer representation type were not two's complement). Am I right? Could such a "portable" solution be written also with a different (more elegant/canonical) technique?

Your question is ambiguous as to whether you intend to truly store odd-bit integers, or odd-bit floats represented by custom-encoded odd-bit integers. I'm assuming by "not knowing" the bit-width of the integer, that you mean that the bit-width isn't known at compile time, but is discovered at runtime as your custom values are parsed from a file, for example.
Edit by author of original post:
The assumption in the original question that the presented code is independent from the actual integer representation type, is wrong (as explained in the comments). Integer types are not specified, for example it is not clear that the leftmost bit is the sign bit. Therefore the presented code also contains assumptions, they are just different (and most probably worse) than the assumption "integer representation type is two's complement".
Here's a simple example of storing an odd-bit integer. I provide a simple struct that let's you decide how many bits are in your integer. However, for simplicity in this example, I used uint8_t which has a maximum of 8-bits obviously. There are several different assumptions and simplifications made here, so if you want help on any specific nuance, please specify more in the comments and I will edit this answer.
One key detail is to properly mask off your n-bit integer after performing 2's complement conversions.
Also please note that I have basically ignored overflow concerns and bit-width switching concerns that may or may not be a problem depending on how you intend to use your custom-width integers and the maximum bit-width you intend to support.
#include <iostream>
#include <string>
struct CustomInt {
int bitCount = 7;
uint8_t value;
uint8_t mask = 0;
CustomInt(int _bitCount, uint8_t _value) {
bitCount = _bitCount;
value = _value;
mask = 0;
for (int i = 0; i < bitCount; ++i) {
mask |= (1 << i);
}
}
bool isNegative() {
return (value >> (bitCount - 1)) & 1;
}
int toInt() {
bool negative = isNegative();
uint8_t tempVal = value;
if (negative) {
tempVal = ((~tempVal) + 1) & mask;
}
int ret = tempVal;
return negative ? -ret : ret;
}
float toFloat() {
return toInt(); //Implied truncation!
}
void setFromFloat(float f) {
int intVal = f; //Implied truncation!
bool negative = f < 0;
if (negative) {
intVal = -intVal;
}
value = intVal;
if (negative) {
value = ((~value) + 1) & mask;
}
}
};
int main() {
CustomInt test(7, 0b01001110); // -50. Would be 78 if this were a normal 8-bit integer
std::cout << test.toFloat() << std::endl;
}

concatenating individual characters and converting to a combined decimal in c++

I have a sensor that stores the recorded information as a .pcap file. I have managed to load the file into an unsigned char array. the sensor stores information in a unique format. For instance representing an angle of 290.16, it stores the information as binary equivalent of 0x58 0x71.
what I have to do to get the correct angle is that concatenate 0x71 and 0x58 then convert the resultant hex value into a decimal divide it by 100 and then store it for further analysis.
My current approach is this:
//all header files are included
main
{
unsigned char data[50]; //I actually have the data loaded in this from a file
data[40] = 0x58;
data[41] = 0x71;
// The above maybe incorrect. What i am trying to imply is that if i use the statement
// printf("%.2x %.2x", data[40],data[41]);
// the resultant output you see on screen is
// 58 71
//I get the decimal value i wanted using the below statement
float gar = hex2Dec(dec2Hex(data[41])+dec2Hex(data[40]))/100.0;
}
hex2Dec and dec2Hex are my own functions.
unsigned int hex2Dec (const string Hex)
{
unsigned int DecimalValue = 0;
for (unsigned int i = 0; i < Hex.size(); ++i)
{
DecimalValue = DecimalValue * 16 + hexChar2Decimal (Hex[i]);
}
return DecimalValue;
}
string dec2Hex (unsigned int Decimal)
{
string Hex = "";
while (Decimal != 0)
{
int HexValue = Decimal % 16;
// convert deimal value to a Hex digit
char HexChar = (HexValue <= 9 && HexValue >= 0 ) ?
static_cast<char>(HexValue + '0' ) : static_cast<char> (HexValue - 10 + 'A');
Hex = HexChar + Hex;
Decimal = Decimal /16;
}
return Hex;
}
int hexChar2Decimal (char Ch)
{
Ch= toupper(Ch); //Change the chara to upper case
if (Ch>= 'A' && Ch<= 'F')
{
return 10 + Ch- 'A';
}
else
return Ch- '0';
}
The pain is that I have to do this conversion billions of time which really slows down the process. Is there any other efficient way to deal with this case?
A matlab code that my friend developed for a similar sensor, took him 3 hours to extract data that was worth only 1 minute of real time. I really need it to be as fast as possible.

As far as I can tell this does the same as
float gar = ((data[45]<<8)+data[44])/100.0;
For:
unsigned char data[50];
data[44] = 0x58;
data[45] = 0x71;
the value of gar will be 290.16.
Explanation:
It is not necessary to convert the value of an integer to a string to get the hex value, because decimal, hexadecimal, binary, etc. are only different representations of the same value. data[45]<<8 shifts the value of data[45] eight bits to the left. Before the operation is performed the type of the operand is promoted to int (except for some unusual implementations where it might be unsigned int), so the new data type should be large enough to not overflow. Shifting eight bits to the left is equivalent to shifting 2 digits to the left in hexadecimal representation. So the result is 0x7100. Then the value of data[44] is added to that and you get 0x7158. The result of type int is then cast to float and divided by 100.0.
In general int might be too small to apply the shift operation without shifting the sign if it is only 16-bit long. If you want to cover that case then explicitly cast to unsigned int:
float gar = (((unsigned int)data[45]<<8)+data[44])/100.0;

In here C convert hex to decimal format, Emil H
posted some sample code that looks very similar to what you want.
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
char *hex_value_string = "deadbeef";
unsigned int out;
sscanf(hex_value_string, "%x", &out);
printf("%o %o\n", out, 0xdeadbeef);
printf("%x %x\n", out, 0xdeadbeef);
return 0;
}
Your conversion functions don't look particularly efficient, so hopefully this is faster.

Creating own float structure in C++

Yet my lectures in C++ at university began yet i got my first problems. Our task was it to implement a self made structure in C++ for floating points via the IEEE 754 standard:
Create a data structure that allows you to store a float, read its raw byte representation and its internal representation as s, e and m. Use a combination of union and bit-field-struct.
Write a program where a float number is assigned to the float part of the structure and the raw and s/e/m representation is printed. Use hexadecimal output for raw and m.
What i had so far is the following:
#include <stdio.h>
#include <math.h>
union {
struct KFloat {
//Using bit fields for our self made float. s sign, e exponent, m mantissa
//It should be unsigned because we simply use 0 and 1
unsigned int s : 1, e : 8, m : 23;
};
//One bit will be wasted for our '.'
char internal[33];
};
float calculateRealFloat(KFloat kfloat) {
if(kfloat.s == 0) {
return (1.0+kfloat.m)*pow(2.0, (kfloat.e-127.0));
} else if (kfloat.s == 1) {
return (-1.0)*((1.0+kfloat.m)*pow(2.0, (kfloat.e-127.0)));
}
//Error case when s is bigger 1
return 0.0;
}
int main(void) {
KFloat kf_pos = {0, 128, 1.5707963705062866};//This should be Pi (rounded) aka 3.1415927
KFloat kf_neg = {1, 128, 1.5707963705062866};//Pi negative
float f_pos = calculateRealFloat(kf_pos);
float f_neg = calculateRealFloat(kf_neg);
printf("The positive float is %f or ",f_pos);
printf("%e\n", f_pos);
printf("The negative float is %f or ",f_neg);
printf("%e", f_neg);
return 0;
}
The first error with this code is clearly that the mantissa is absolutely wrong but i have no idea how to fix this.

please reread the task:
Create a data structure that allows you to store a float,
read its raw byte representation
and its internal representation as s, e and m.
this does not mean that you should store a string
I would do this the following way:
union MyFloat
{
unsigned char rawByteDataRep[4];
unsigned int rawDataRep;
float floatRep;
struct{ // not checked this part just copied from you
unsigned s : 1;
unsigned e : 8;
unsigned m : 23;
} componentesRep;
}
but be careful!
Besides the fact that this union-conversion pattern is widely used, the C-Standard states that the result is undefined behaviour if you read another unionmember than the one that was written.
Edit:
added uint32 rep
void testMyfloat()
{
MyFloat mf;
mf.floatRep = 3.14;
printf("The float %f is assembled from sign %i magnitude 0x%08x and exponent %i and looks in memory like that 0x%08x.\n",
mf.floatRep,
(int)mf.componentesRep.s,
(unsigned int)mf.componentesRep.m,
(int)mf.componentesRep.e,
mf.componentesRep.rawDataRep);
}

Bruce Dawson has an excellent series of blog posts on floating point representation and arithmetic. The latest in the series, which has a bunch of links to previous posts that discusses this subject matter in detail, is here.

read float and double from binary data in C++

I need to be able to read in a float or double from binary data in C++, similarly to Python's struct.unpack function. My issue is that the data I am receiving will always be big-endian. I have dealt with this for integer values as described here, but working byte by byte does not work with floating point values. I need a way to extract floating point values (both 32 bit floats and 64 bit doubles) in in C++, similar to how you would use struct.unpack(">f", num) or struct.unpack(">d", num) in Python.
here's an example of what I have tried:
stuct.unpack("d", num) ==> *(double*) str; // if str is a char* containing the data
That works fine if str is little-endian, but not if it is big-endian, as I know it will always be. The problem is that I do not know what the native endianness of the environment will be, so I need to be able to extract the binary data as big-endian at all times.
If you look at the linked question, you'll see this is easily using bitwise-ors and bitshifts for integer values, but that method does not work for floating point.
NOTE I should have pointed this out earlier, but I cannot use c++11 or any third party libraries other than Boost.

Why working byte by byte does not work with floating point values?
Just extract 32bit integer as usual, then reinterpret it as float: float f = *(float*)&i
And the same for 64bit integers and double

void ByteSwap(void * data, int size)
{
char * ptr = (char *) data;
for (int i = 0; i < size/2; ++i)
std::swap(ptr[i], ptr[size-1-i]);
}
bool LittleEndian()
{
int test = 1;
return *((char *)&test) == 1;
}
if (LittleEndian())
ByteSwap(&my_double, sizeof(double));

int to short Assignment failing

I've encountered some strange behaviour when trying to promote a short to an int where the upper 2 bytes are 0xFFFF after promotion. AFAIK the upper bytes should always remain 0. See the following code:
unsigned int test1 = proxy0->m_collisionFilterGroup;
unsigned int test2 = proxy0->m_collisionFilterMask;
unsigned int test3 = proxy1->m_collisionFilterGroup;
unsigned int test4 = proxy1->m_collisionFilterMask;
if( test1 & 0xFFFF0000 || test2 & 0xFFFF0000 || test3 & 0xFFFF0000 || test4 & 0xFFFF0000 )
{
std::cout << "test";
}
The values of the involved variables is once cout is hit is:
Note the two highlighted values. I also looked at the disassembly which also looks fine to me:
My software is targeting x64 compiled with VS 2008 SP1. I also link in an out of the box version of Bullet Physics 2.80. The proxy objects are bullet objects.
The proxy class definition is as follows (with some functions trimmed out):
///The btBroadphaseProxy is the main class that can be used with the Bullet broadphases.
///It stores collision shape type information, collision filter information and a client object, typically a btCollisionObject or btRigidBody.
ATTRIBUTE_ALIGNED16(struct) btBroadphaseProxy
{
BT_DECLARE_ALIGNED_ALLOCATOR();
///optional filtering to cull potential collisions
enum CollisionFilterGroups
{
DefaultFilter = 1,
StaticFilter = 2,
KinematicFilter = 4,
DebrisFilter = 8,
SensorTrigger = 16,
CharacterFilter = 32,
AllFilter = -1 //all bits sets: DefaultFilter | StaticFilter | KinematicFilter | DebrisFilter | SensorTrigger
};
//Usually the client btCollisionObject or Rigidbody class
void* m_clientObject;
short int m_collisionFilterGroup;
short int m_collisionFilterMask;
void* m_multiSapParentProxy;
int m_uniqueId;//m_uniqueId is introduced for paircache. could get rid of this, by calculating the address offset etc.
btVector3 m_aabbMin;
btVector3 m_aabbMax;
SIMD_FORCE_INLINE int getUid() const
{
return m_uniqueId;
}
//used for memory pools
btBroadphaseProxy() :m_clientObject(0),m_multiSapParentProxy(0)
{
}
btBroadphaseProxy(const btVector3& aabbMin,const btVector3& aabbMax,void* userPtr,short int collisionFilterGroup, short int collisionFilterMask,void* multiSapParentProxy=0)
:m_clientObject(userPtr),
m_collisionFilterGroup(collisionFilterGroup),
m_collisionFilterMask(collisionFilterMask),
m_aabbMin(aabbMin),
m_aabbMax(aabbMax)
{
m_multiSapParentProxy = multiSapParentProxy;
}
}
;
I've never had this issue before and only started getting it after upgrading to 64 bit and integrating bullet. The only place I am getting issues is where bullet is involved so I suspect the issue is related to that somehow, but I am still super confused about what could make assignments between primitive types not behave as expected.
Thanks

You are requesting a conversion from signed to unsigned. This is pretty straigth-forward:
Your source value is -1. Since the type is short int, on your platform that has bits 0xFFFF.
The target type is unsigned int. -1 cannot be expressed as an unsigned int, but the conversion rule is defined by the standard: Pick the positive value that's congruent to -1 modulo 2N, where N is the number of value bits of the unsigned type.
On your platform, unsigned int has 32 value bits, so the modular representative of -1 modulo 232 is 0xFFFFFFFF.
If your own imaginary rules where to apply, you would want the result 0x0000FFFF, which is 65535, and not related to −1 in any obvious or useful way.
If you do want that conversion, you must perform the modular wrap-around on the short type manually:
short int mo = -1;
unsigned int weird = static_cast<unsigned short int>(mo);
Nutshell: C++ is about values, not about representations.

AFAIK the upper bytes should always remain 0
When promoting from short to int arithmetic shift (also called signed shift) is used,
to answer you question it`s enough to know that it is performed by extension of greatest byte value on number of added bytes;
example:
short b;
int a = b; /* here promotion is performed, mechanism of it can be described by following bitwise operation: */
a = b >> (sizeof(a) - sizeof(b)); // arithmetic shift performed
important to notice that in memory of computer representation of signed and unsigned values can be the same, the only difference in commands generated by compiler:
example:
unsigned short i = -1 // 0xffff
short j = 65535 // 0xffff
so actually signed/unsigned doesn`t matter for result on promotion, arithmetic shift is performed in both cases

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

FP number's exponent field is not what I expected, why? - c++

Computers use binary representation for all numbers. Hence, the exponent is for base two, not base ten. int(log2(45.67)) = 5.

Related

How to deal with the sign bit of integer representations with odd bit counts?

concatenating individual characters and converting to a combined decimal in c++

Creating own float structure in C++

read float and double from binary data in C++

int to short Assignment failing

Categories

Resources