int16_t to float conversion weirdness - c++

I am at a loss with what is happening here. I need to convert a float to an int16_t and back. Here is the syntax:
int16_t val = (int16_t)round((float)0xFFFE/100 * angle);
//and back
float angle = ((float)100/0xFFFE * val;
When I use an initial angle value of -0.093081, it converts back. But when I use 182.241211 it converts back to -17.764824?
Any idea what is going on?

0xFFFE is almost the maximum 16-bit number; and only for an unsigned 16-bit number, at that. If you divide it by 100 and then multiply by 182, it's definitely going to overflow.
Let's do it fully in base 10 for clarity (0xFFFE is 65534):
65534 / 100 * -0.093081 = -60.99970254
65534 / 100 * 182.241211 = 119429.95521674
The full range of your signed 16-bit integer is almost certainly [-32768, 32767]. That last result won't fit.

Related

What does line is supposed to mean?

float sqrt_approx(float z) {
int val_int = *(int*)&z; /* Same bits, but as an int */
/*
* To justify the following code, prove that
*
* ((((val_int / 2^m) - b) / 2) + b) * 2^m = ((val_int - 2^m) / 2) + ((b + 1) / 2) * 2^m)
*
* where
*
* b = exponent bias
* m = number of mantissa bits
*
* .
*/
val_int -= 1 << 23; /* Subtract 2^m. */
val_int >>= 1; /* Divide by 2. */
val_int += 1 << 29; /* Add ((b + 1) / 2) * 2^m. */
return *(float*)&val_int; /* Interpret again as float */
}
I was reading a wiki article on methods of computing square root. I came to this code and starred at this line.
int val_int = *(int*)&z; /* Same bits, but as an int */
Why are they casting z to an int pointer then dereference it? Why not directly say val_int = z;
Why use pointers at all? PS: I'm beginner.
This is called type punning. This particular usage violates strict aliasing rules
By taking the address of the float value z, and reinterpreting it as the address of an integer value, the author is trying to get access to in-memory bytes representing this float but in the convenience of a int.
It's not the same as int val_int = z; which would convert the float value to an integer, resulting in different bits in memory.
A big problem here, apart from the strict aliasing issue, is that the code makes assumptions about the size of int on any target system and the endianness. As a result, the code is not portable.
The correct way to access the bytes of z is as char array:
const uint8_t* zb = (const uint8_t*)&z;
You could then construct an appropriately-sized integer from these with the a specific endianness:
uint32_t int_val = ((uint32_t)zb[0]) |
(((uint32_t)zb[1]) << 8) |
(((uint32_t)zb[2]) << 16) |
(((uint32_t)zb[3]) << 24);
This is similar to a simpler call, assuming you are on a little-endian system:
uint32_t int_val;
memcpy(&int_val, &z, sizeof(int_val));
But this isn't the full picture because float endianness is standardized (at least, assuming IEEE 754 which your code is targeting) whereas int is system-dependent.
At this point, the whole example breaks down. At the fundamental level the original code is a (supposedly) fast approximation based on tricks. If you want to do these tricks "correctly", it becomes a bit of a mess.
What happens is that the line int val_int = *(int*)&z reinterprets the float's bits as integers or rather bitfield and operate on sign, mantissa, and exponent directly of the floating point number directly instead of relying on the processors' operations.
int val_int = z would apply conversion from float to int - a completely different operation.
Generally, such operations are ill advised as in different platforms there might be different conventions on interpretation and location of mantissa, exponent and sign. Also int may be of a different size. Also, most surely native operations are more efficient and reliable.

How to convert a float into uint8_t?

I am trying to sent multiple float values from an arduino using the LMIC lora library. The LMIC function only takes an uint8_t as its transmission argument type.
temp contains my temperature value as a float and I can print the measured temperature as such without problem:
Serial.println((String)"Temp C: " + temp);
There is an example that shows this code being used to do the conversion:
uint16_t payloadTemp = LMIC_f2sflt16(temp);
// int -> bytes
byte tempLow = lowByte(payloadTemp);
byte tempHigh = highByte(payloadTemp);
payload[0] = tempLow;
payload[1] = tempHigh;
I am not sure if this would work, it doesn't seem to be. The resulting data that gets sent is: FF 7F
I don't believe this is what I am looking for.
I have also tried the following conversion procedure:
uint8_t *array;
array = (unit8_t*)(&f);
using arduino, this will not even compile.
something that does work, but creates a much too long result is:
String toSend = String(temp);
toSend.toCharArray(payload, toSend.length());
payloadActualLength = toSend.length();
Serial.print("the payload is: ");
Serial.println(payload);
but the resulting hex is far far too long to when I get my other values that I want to send in.
So how do I convert a float into a uint8_t value and why doesn't my original given conversion not work as how I expect it to work?
Sounds like you are trying to figure out a minimally sized representation for these numbers that you can transmit in some very small packet format. If the range is suitably limited, this can often best be done by using an appropriate fixed-point representation.
For example, if your temperatures are always in the range 0..63, you could use a 6.2 fixed point format in a single byte:
if (value < 0.0 || value > 63.75) {
// out of range for 6.2 fixed point, so do something else.
} else {
uint8_t bval = (uint8_t)(value * 4 + 0.5);
// output this byte value
}
when you read the byte back, you just multiply it by 0.25 to get the (approximate) float value back.
Of course, since 8 bits is pretty limited for precision (about 2 digits), it will get rounded a bit to fit -- your 23.24 value will be rounded to 23.25. If you need more precision, you'll need to use more bits.
If you only need a little precision but a wider range, you can use a custom floating point format. IEEE 16-bit floats (S5.10) are pretty good (give you 3 digits of precision and around 10 orders of magnitude range), but you can go even smaller, particularly if you don't need negative values. A U4.4 float format give you 1 digit of precision and 5 orders of magnitude range in 8 bits (positive only)
If you know that both sender and receiver use the same fp binary representation and both use the same endianness then you can just memcpy:
float a = 23.24;
uint8_t buffer[sizeof(float)];
::memcpy(buffer, &a, sizeof(float));
In Arduino one can convert the float into a String
float ds_temp=sensors.getTempCByIndex(0); // DS18b20 Temp sensor
then convert the String into a char array:
String ds_str = String(ds_temp);
char* ds_char[ds_str.length()];
ds_str.toCharArray(ds_char ,ds_str.length()-1);
uint8_t* data =(uint8_t*)ds_char;
the uint_8 value is stored in data with a size sizeof(data)
A variable of uint8_t can carry only 256 values. If you actually want to squeeze temperature into single byte, you have to use fixed-point approach or least significant bit value approach
Define working range, T0 and T1
divide T0-T1 by 256 ( 2^8, a number of possible values).
Resulting value would be a float constant (working with a flexible LSB value is possible) by which you divide original float value X: R = (X-T0)/LSB. You can round the result, it would fit into byte.
On receiving side you have to multiply integer value by same constant X = R*LSB + T0.

Convert double to uint and retour

I try to convert a value from modbus.
The device show "-1.0", the retourned value is 65535 (uint16).
I try now to convert this value retour in double.
I have tried it with different cast's.
It gives me always 65353.00 :(
How do we convert negative uint values in double?
typedef unsigned short uint16;
int main() {
double dRmSP = -1.0; //-1.0000 ok
uint16 tSP = static_cast<uint16>(dRmSP); // = 65535 ok
// retour
double _dRmSP = static_cast<double>(tSP); // = 65535.0000 why??
// try
double _dRmSP_ = static_cast<double>(static_cast<int>(tSP)); // =65535.0000 why??
return 0;
}
You're taking the uint16 value 65535 and turning it into a double. This is 65535.0.
There is no other valid expectation.
The variable tSP does not "remember" that its value originally came from a double of value -1.0. tSP is the unsigned integer value 65535; period.
How do we convert negative uint values in double?
There are no "negative uint values". The "u" stands for unsigned which means negative values are not in the domain of values of that type.
If you wish to use dRmSP then use dRmSP, not some other variable with a different type and value.
Negative unsigned values, by definition do not exist. So you can't convert one to anything.
Your actual situation is that - in getting data from your device - the value of -1.0 is converted to an unsigned value first. The logic, since -1.0 is outside the range of values that an unsigned can represent is to use modulo arithmetic.
The way this works, for a negative input value (like -1.0) and an unsigned variable with maximum value 65535 (corresponding to a 16-bit unsigned) is to keep adding 65536 = 65535 + 1 until a result is obtained between 0 and 65535. For -1.0 this produces a result of 65535.0. When that value is converted to an unsigned, the result is therefore 65535.
That explains why you are getting a value of 65535 when your device displays -1.0.
What you are trying to do with the "retour" is reverse the process. It is not enough to convert an unsigned to a double (as you are) since a double can represent 65535.0 (at least, within limits of numerical precision).
The first step is to convert your value to a double (which will convert 65535 to 65535.0, because a double can represent values like that (again within limits of floating point precision).
The next step - which you are not performing - requires you need to have some idea of what the minimum (or maximum) value is that your device actually supports - which you need to get from documentation. For example, if the minimum value your device can represent is -100.0 (or the maximum is 65435.0) then you reverse the process - keep subtracting 65536.0 until a result is obtained between -100.0 and 65435.0.
In code, this might be done by
double dRmSP = -1.0; //-1.0000 ok
uint16 tSP = static_cast<uint16>(dRmSP); // = 65535 ok
// retour
double dRmSP = static_cast<double>(tSP); // = 65535.0000 - as described above
while (dRmSP > 65435.0) dRmSP -= 65536.0; // voila! -1.0 obtained
First of all, there are no negative unsigned int values. Unsigned means there is no sign bit.
What you did was:
uint16 t1(-1.0); // wraps around to positive 65535
auto t2 = static_cast<double>(t1); // turns 65535 to 65535.0 (no wrapping)
If you want this to work for negative values use an int or comparable (non unsigned integral) type. But if you do this then remember that you will lose a bit for the value (if you use int16).

Could anyone tell me why float can't hold 3153600000?

I know this is stupid but I'm a quiet a noob in a programming world here is my code.
This one works perfectly:
#include <stdio.h>
int main() {
float x = 3153600000 ;
printf("%f", x);
return 0;
}
But this one has a problem:
#include <stdio.h>
int main() {
float x = 60 * 60 * 24 * 365 * 100 ;
printf("%f", x);
return 0;
}
So 60 * 60 * 24 * 365 * 100 is 3153600000 right ??? if yes then why does it produced different results ??? I got the overflow in the second one it printed "-1141367296.000000" as a result. Could anyone tell me why ?
You're multiplying integers, then putting the result in a float. By that time, it has already overflowed.
Try float x = 60.0f * 60.0f * 24.0f * 365.0f * 100.0f;. You should get the result you want.
60 is an integer, as are 24, 365, and 100. Therefore, the entire expression 60 * 60 * 24 * 365 * 100 is carried out using integer arithmetic (the compiler evaluates the expression before it sees what type of variable you're assigning it into).
In a typical 32-bit architecture, a signed integer can only hold values up to 2,147,483,647. So the value would get truncated to 32 bits before it gets assigned into your float variable.
If you tell the compiler to use floating-point arithmetic, e.g. by tacking f onto the first value to make it float, then you'll get the expected result. (A float times an int is a float, so the float propagates to the entire expression.) E.g.:
float x = 60f * 60 * 24 * 365 * 100;
Doesn't your compiler spit this warning? Mine does:
warning: integer overflow in
expression
The overflow occurs before the all-integer expression is converted to a float before being stored in x. Add a .0f to all numbers in the expression to make them floats.
If you multiply two integers, the result will be an integer too.
60 * 60 * 24 * 365 * 100 is an integer.
Since integers can go up to 2^31-1 (2147483647) such values overflows and becomes -1141367296, which is only then converted to float.
Try multiplying float numbers, instead of integral ones.

Getting the fractional part of a float without using modf()

I'm developing for a platform without a math library, so I need to build my own tools. My current way of getting the fraction is to convert the float to fixed point (multiply with (float)0xFFFF, cast to int), get only the lower part (mask with 0xFFFF) and convert it back to a float again.
However, the imprecision is killing me. I'm using my Frac() and InvFrac() functions to draw an anti-aliased line. Using modf I get a perfectly smooth line. With my own method pixels start jumping around due to precision loss.
This is my code:
const float fp_amount = (float)(0xFFFF);
const float fp_amount_inv = 1.f / fp_amount;
inline float Frac(float a_X)
{
return ((int)(a_X * fp_amount) & 0xFFFF) * fp_amount_inv;
}
inline float Frac(float a_X)
{
return (0xFFFF - (int)(a_X * fp_amount) & 0xFFFF) * fp_amount_inv;
}
Thanks in advance!
If I understand your question correctly, you just want the part after the decimal right? You don't need it actually in a fraction (integer numerator and denominator)?
So we have some number, say 3.14159 and we want to end up with just 0.14159. Assuming our number is stored in float f;, we can do this:
f = f-(long)f;
Which, if we insert our number, works like this:
0.14159 = 3.14159 - 3;
What this does is remove the whole number portion of the float leaving only the decimal portion. When you convert the float to a long, it drops the decimal portion. Then when you subtract that from your original float, you're left with only the decimal portion. We need to use a long here because of the size of the float type (8 bytes on most systems). An integer (only 4 bytes on many systems) isn't necessarily large enough to cover the same range of numbers as a float, but a long should be.
As I suspected, modf does not use any arithmetic per se -- it's all shifts and masks, take a look here. Can't you use the same ideas on your platform?
I would recommend taking a look at how modf is implemented on the systems you use today. Check out uClibc's version.
http://git.uclibc.org/uClibc/tree/libm/s_modf.c
(For legal reasons, it appears to be BSD licensed, but you'd obviously want to double check)
Some of the macros are defined here.
There's a bug in your constants. You're basically trying to do a left shift of the number by 16 bits, mask off everything but the lower bits, then right shift by 16 bits again. Shifting is the same as multiplying by a power of 2, but you're not using a power of 2 - you're using 0xFFFF, which is off by 1. Replacing this with 0x10000 will make the formula work as intended.
I'm not completly sure, but I think that what you are doing is wrong, since you are only considering the mantissa and forgetting the exponent completely.
You need to use the exponent to shift the value in the mantissa to find the actual integer part.
For a description of the storage mechanism of 32bit floats, take a look here.
Why go to floating point at all for your line drawing? You could just stick to your fixed point version and use an integer/fixed point based line drawing routine instead - Bresenham's comes to mind. While this version isn't aliased, I know there are others that are.
Bresenham's line drawing
Seems like maybe you want this.
float f = something;
float fractionalPart = f - floor(f);
Your method is assuming that there are 16 bits in the fractional part (and as Mark Ransom notes, that means you should shift by 16 bits, i.e. multiply by by 0x1000). That might not be true. The exponent is what determines how many bit there are in the fractional part.
To put this in a formula, your method works by calculating (x modf 1.0) as ((x << 16) mod 1<<16) >> 16, and it's that hardcoded 16 which should depend on the exponent - the exact replacement depends on your float format.
double frac(double val)
{
return val - trunc(val);
}
// frac(1.0) = 1.0 - 1.0 = 0.0 correct
// frac(-1.0) = -1.0 - -1.0 = 0.0 correct
// frac(1.4) = 1.4 - 1.0 = 0.4 correct
// frac(-1.4) = -1.4 - -1.0 = -0.4 correct
Simple and works for -ve and +ve
One option is to use fmod(x, 1).