Rounding a float number to a certain precision - c++

I want to round a float to maximum 4 decimals places. It mean 0.333333333 will be 0.3333, but 0.33 is still 0.33

Use the std::round() function
The C++ standard library offers functions for performing rounding. For floats, it is:
float round ( float arg );
this will round arg to the nearest integral value. Now, you want a different decimal resolution. So don't round your value, round your value times 10000, so your singles digit is now the former 0.0001 digit. Or more generally:
float my_round(
float x,
int num_decimal_precision_digits)
{
float power_of_10 = std::pow(10, num_decimal_precision_digits);
return std::round(x * power_of_10) / power_of_10;
}
Note that there may be accuracy issues, as floating-point computations and representations are only accurate to within a certain number of digits, and in my_round we have at least four sources of such inaccuracy: The power-of-10 calculation, the multiplication, the devision and the actual rounding.

Cast it into a fixed-point type
If you want to have your results rounded, with a fixed number of decimal digits, you're hinting that you don't really need the "floating" aspect of floating point numbers. Well, in this case, cast your value to a type which represents such numbers. Essentially, that would be a (run-time-variable) integer numerator and a compile-time-fixed denominator (which in your case would be 10,000).
There's an old question here on the site about doing fixed-point math:
What's the best way to do fixed-point math?
but I would suggest you consider the CNL library as something recent/popular. Also, several proposals have been made to add fixed-point types to the standard library. I don't know which one is the farthest advance, but have a look at this one: Fixed-Point Real Numbers by John McFarlane.
Back to your specific case: Fixed-point types can typically be constructed from floating-point ones. Just do that.

Here is a solution, for example:
float ret = float(round(0.333333333 * 10000)) / 10000)
You can write it as a function. Maybe there would be a better way?

Assuming you need print rounded number, this is one of a proper solutions:
cout << setprecision(4) << x << '\n';
std::setprecision documentation.
Live demo
Until more details are not provided it is impossible to provide a better answer.
Please note if you are planing to round number x then print it, it will end with big headache, since some corner cases can produce much longer results then expected.

Use _vsnprintf
I think the best solution for this is to just format the string. Because what if we don't need to output this number to the console, but save it in a std::string variable or char[] or something like that. This solution is flexible because it is the same if you output a number to the console and used the std::setprecision() function, but also returning this number to char[].
So for this I used _vsnprintf and va_list. All it does is format the string as needed, in this case float value.
int FormatString(char* buf, size_t buf_size, const char* fmt, ...) {
va_list args;
va_start(args, fmt);
int w = _vsnprintf(buf, buf_size, fmt, args);
va_end(args);
if (buf == NULL)
return w;
if (w == -1 || w >= (int)buf_size)
w = (int)buf_size - 1;
buf[w] = 0;
return w;
}
int FormatStringFloat(char* buf, int buf_size, const void* p_data, const char* format) {
return FormatString(buf, buf_size, format, *(const float*)p_data);
}
Example
#include "iostream"
#include "string"
#define ARRAY_SIZE(_ARR) ((int)(sizeof(_ARR) / sizeof(*(_ARR))))
int main() {
float value = 3.343535f;
char buf[64];
FormatStringFloat(buf, ARRAY_SIZE(buf), (void*)&value, "%.2f");
std::cout << std::string(buf) << std::endl;
return 0;
}
So by using "%.2f" we get from the 3.343535 => 3.34.

Related

How to deal with the sign bit of integer representations with odd bit counts?

Let's assume we have a representation of -63 as signed seven-bit integer within a uint16_t. How can we convert that number to float and back again, when we don't know the representation type (like two's complement).
An application for such an encoding could be that several numbers are stored in one int16_t. The bit-count could be known for each number and the data is read/written from a third-party library (see for example the encoding format of tivxDmpacDofNode() here: https://software-dl.ti.com/jacinto7/esd/processor-sdk-rtos-jacinto7/latest/exports/docs/tiovx/docs/user_guide/group__group__vision__function__dmpac__dof.html --- but this is just an example). An algorithm should be developed that makes the compiler create the right encoding/decoding independent from the actual representation type. Of course it is assumed that the compiler uses the same representation type as the library does.
One way that seems to work well, is to shift the bits such that their sign bit coincides with the sign bit of an int16_t and let the compiler do the rest. Of course this makes an appropriate multiplication or division necessary.
Please see this example:
#include <iostream>
#include <cmath>
int main()
{
// -63 as signed seven-bits representation
uint16_t data = 0b1000001;
// Shift 9 bits to the left
int16_t correct_sign_data = static_cast<int16_t>(data << 9);
float f = static_cast<float>(correct_sign_data);
// Undo effect of shifting
f /= pow(2, 9);
std::cout << f << std::endl;
// Now back to signed bits
f *= pow(2, 9);
uint16_t bits = static_cast<uint16_t>(static_cast<int16_t>(f)) >> 9;
std::cout << "Equals: " << (data == bits) << std::endl;
return 0;
}
I have two questions:
This example uses actually a number with known representation type (two's complement) converted by https://www.exploringbinary.com/twos-complement-converter/. Is the bit-shifting still independent from that and would it work also for other representation types?
Is this the canonical and/or most elegant way to do it?
Clarification:
I know the bit width of the integers I would like to convert (please check the link to the TIOVX example above), but the integer representation type is not specified.
The intention is to write code that can be recompiled without changes on a system with another integer representation type and still correctly converts from int to float and/or back.
My claim is that the example source code above does exactly that (except that the example input data is hardcoded and it would have to be different if the integer representation type were not two's complement). Am I right? Could such a "portable" solution be written also with a different (more elegant/canonical) technique?
Your question is ambiguous as to whether you intend to truly store odd-bit integers, or odd-bit floats represented by custom-encoded odd-bit integers. I'm assuming by "not knowing" the bit-width of the integer, that you mean that the bit-width isn't known at compile time, but is discovered at runtime as your custom values are parsed from a file, for example.
Edit by author of original post:
The assumption in the original question that the presented code is independent from the actual integer representation type, is wrong (as explained in the comments). Integer types are not specified, for example it is not clear that the leftmost bit is the sign bit. Therefore the presented code also contains assumptions, they are just different (and most probably worse) than the assumption "integer representation type is two's complement".
Here's a simple example of storing an odd-bit integer. I provide a simple struct that let's you decide how many bits are in your integer. However, for simplicity in this example, I used uint8_t which has a maximum of 8-bits obviously. There are several different assumptions and simplifications made here, so if you want help on any specific nuance, please specify more in the comments and I will edit this answer.
One key detail is to properly mask off your n-bit integer after performing 2's complement conversions.
Also please note that I have basically ignored overflow concerns and bit-width switching concerns that may or may not be a problem depending on how you intend to use your custom-width integers and the maximum bit-width you intend to support.
#include <iostream>
#include <string>
struct CustomInt {
int bitCount = 7;
uint8_t value;
uint8_t mask = 0;
CustomInt(int _bitCount, uint8_t _value) {
bitCount = _bitCount;
value = _value;
mask = 0;
for (int i = 0; i < bitCount; ++i) {
mask |= (1 << i);
}
}
bool isNegative() {
return (value >> (bitCount - 1)) & 1;
}
int toInt() {
bool negative = isNegative();
uint8_t tempVal = value;
if (negative) {
tempVal = ((~tempVal) + 1) & mask;
}
int ret = tempVal;
return negative ? -ret : ret;
}
float toFloat() {
return toInt(); //Implied truncation!
}
void setFromFloat(float f) {
int intVal = f; //Implied truncation!
bool negative = f < 0;
if (negative) {
intVal = -intVal;
}
value = intVal;
if (negative) {
value = ((~value) + 1) & mask;
}
}
};
int main() {
CustomInt test(7, 0b01001110); // -50. Would be 78 if this were a normal 8-bit integer
std::cout << test.toFloat() << std::endl;
}

How to round off the totalcost (double) to nearest inetegrs

I have tried to search similar questions all over the net but non was useful for me.
The nearest I got to was "If the number before the .5 is odd round up, if even round down 13.5 turns into 14 but 12.5 turns into 12".
Coming to the question:
I simply need to calculate the total amount after a meal with the formula;
total amount = mealamount+ mealamount*tip% + mealamount *tax%
I came up with this piece of code (rough only)
#include<iostream>
#include<math.h>
#include <iomanip>
using namespace std;
int main () {
double mealCost;
double total=0;
double tipp,taxx=0;
cin>>mealCost;
int tip;
cin>>tip;
tipp=(tip*mealCost)/100;
int tax;
cin>>tax;
taxx=(tax*mealCost)/100;
total=mealCost+tipp+taxx;
cout << fixed << showpoint;
cout << setprecision(2);
cout<<total<<endl;
return 0;
}
but with the set of inputs of 10.75(mealamonut), 17(tip%),5 (tax %).
the value am getting is 12.50
if I use
int totalSum = std::round(total);
its getting converted to 12
but my requirement is 13 .
How to acheive that ?
I cannot find any duplicate question to this if exists
please mention.
There are multiple ways to convert doubles to integers. You have multiple kinds of round. Check them here, which are std::round, std::lround, std::llround.
On the other hand, if what you want to do is not rounding, but elimination of fractions into one direction, then you have std::ceil that goes always higher, and std::floor, that always goes lower.
Remember to include <cmath>, not <math.h>. The latter is for C, not C++
You achieve your goal by using std::ceil() and std::floor() which is defined under cmath header file.
You are trying to always round up so you would need to use the ceil() function. Ceil is short for ceiling and there is also a floor function. Ceiling is up, floor is down, here is a c snippet to try out.
#include <stdio.h> /* printf */
#include <math.h> /* ceil */
int main ()
{
printf ( "ceil of 2.3 is %.1f\n", ceil(2.3) );
printf ( "ceil of 3.8 is %.1f\n", ceil(3.8) );
printf ( "ceil of -2.3 is %.1f\n", ceil(-2.3) );
printf ( "ceil of -3.8 is %.1f\n", ceil(-3.8) );
return 0;
}
for rounding to nearest integer math.h has nearbyint
printf ( "nearbyint (2.3) = %.1f\n", nearbyint(2.3) );
printf ( "nearbyint (3.8) = %.1f\n", nearbyint(3.8) );
Output:
nearbyint (2.3) = 2.0
nearbyint (3.8) = 4.0
Or if you want to break the default rounding behavior when .5
int totalSum= (total - floor(total) ) >= 0.5 ? ceil(total) : floor(total);
1) 10.75 + 17*10.75/100 + 5*10.75/100 = 13.115 ... how comes I can't get 12.50?
2) How do you know it's 12.50, how do you check value of result? (it may be only 12.4999..., so when it is formatted to two decimal places, it will become 12.50) Make sure you do check the actual real value (ideally in debugger or dump memory content in bytes and reconstruct the value by hand), not some string formatted intermediate.
3) this is not some production code, right? Amounts are not calculated with doubles in real financial software, as doubles are not accurate enough and you would run into all kind of hard problems with rounding to VAT, etc. If this is some real thing, you are not up to the task, ask for help some professional.
Answer:
std::round should normally do what you need. If it's ending as 12, then it's because the value is less than 12.5.
If rounded to two decimal places it shows as 12.50, you are hitting one of those "all kind of hard problems" of real financial software.
Then you should create your own round using string representation, like this example (not handling negative numbers and reinventing wheel probably):
#include <iostream>
#include <string>
/**
* Rounds floating point value in std::string type.
* Works only for positive values, and without "+" sign!
* ie. regexp ~\d*\.?\d*~ formatting.
* For malformed input the output is undefined (but should not crash).
**/
std::string financialRound(const std::string & amount) {
const size_t dotPos = amount.find_first_of('.');
if (std::string::npos == dotPos) return amount; // no decimals found
// copy whole part into temporary result
std::string result = (0 < dotPos) ? amount.substr(0, dotPos) : "0";
const size_t firstDecimalPos = dotPos + 1;
// see if there is 5 to 9 digit after dot
if (amount.size() <= firstDecimalPos) return result; // no char
const char firstDecimal = amount.at(firstDecimalPos);
if (firstDecimal < '5' || '9' < firstDecimal) return result; //not 5-9
// add 1 to the result
int patchPos = (int)result.size();
while (0 <= --patchPos) {
++result.at(patchPos);
if ('9'+1 == result.at(patchPos)) result.at(patchPos) = '0';
else break;
}
// check if additional 1 is required (add overflow)
if (-1 == patchPos) result.insert(result.begin(), '1');
return result;
}
void tryValue(const std::string & amount) {
printf("\"%s\" is rounded to \"%s\"\n", amount.c_str(), financialRound(amount).c_str());
}
int main()
{
printf("Trying normal values...\n");
tryValue("12.50");
tryValue("12.49");
tryValue(".49");
tryValue(".50");
tryValue("9.49");
tryValue("9.50");
printf("Missing decimals...\n");
tryValue("12");
tryValue("12.");
printf("Malformed...\n");
tryValue("");
tryValue("a.4");
tryValue("a.5");
tryValue("12..");
}
live demo on cpp.sh

FP number's exponent field is not what I expected, why?

I've been stumped on this one for days. I've written this program from a book called Write Great Code Volume 1 Understanding the Machine Chapter four.
The project is to do Floating Point operations in C++. I plan to implement the other operations in C++ on my own; the book uses HLA (High Level Assembly) in the project for other operations like multiplication and division.
I wanted to display the exponent and other field values after they've been extracted from the FP number; for debugging. Yet I have a problem: when I look at these values in memory they are not what I think they should be. Key words: what I think. I believe I understand the IEEE FP format; its fairly simple and I understand all I've read so far in the book.
The big problem is why the Rexponent variable seems to be almost unpredictable; in this example with the given values its 5. Why is that? By my guess it should be two. Two because the decimal point is two digits right of the implied one.
I've commented the actual values that are produced in the program in to the code so you don't have to run the program to get a sense of whats happening (at least in the important parts).
It is unfinished at this point. The entire project has not been created on my computer yet.
Here is the code (quoted from the file which I copied from the book and then modified):
#include<iostream>
typedef long unsigned real; //typedef our long unsigned ints in to the label "real" so we don't confuse it with other datatypes.
using namespace std; //Just so I don't have to type out std::cout any more!
#define asreal(x) (*((float *) &x)) //Cast the address of X as a float pointer as a pointer. So we don't let the compiler truncate our FP values when being converted.
inline int extractExponent(real from) {
return ((from >> 23) & 0xFF) - 127; //Shift right 23 bits; & with eight ones (0xFF == 1111_1111 ) and make bias with the value by subtracting all ones from it.
}
void fpadd ( real left, real right, real *dest) {
//Left operand field containers
long unsigned int Lexponent = 0;
long unsigned Lmantissa = 0;
int Lsign = 0;
//RIGHT operand field containers
long unsigned int Rexponent = 0;
long unsigned Rmantissa = 0;
int Rsign = 0;
//Resulting operand field containers
long int Dexponent = 0;
long unsigned Dmantissa = 0;
int Dsign = 0;
std::cout << "Size of datatype: long unsigned int is: " << sizeof(long unsigned int); //For debugging
//Properly initialize the above variable's:
//Left
Lexponent = extractExponent(left); //Zero. This value is NOT a flat zero when displayed because we subtract 127 from the exponent after extracting it! //Value is: 0xffffff81
Lmantissa = extractMantissa (left); //Zero. We don't do anything to this number except add a whole number one to it. //Value is: 0x00000000
Lsign = extractSign(left); //Simple.
//Right
**Rexponent = extractExponent(right); //Value is: 0x00000005 <-- why???**
Rmantissa = extractMantissa (right);
Rsign = extractSign(right);
}
int main (int argc, char *argv[]) {
real a, b, c;
asreal(a) = -0.0;
asreal(b) = 45.67;
fpadd(a,b, &c);
printf("Sum of A and B is: %f", c);
std::cin >> a;
return 0;
}
Help would be much appreciated; I'm several days in to this project and very frustrated!
in this example with the given values its 5. Why is that?
The floating point number 45.67 is internally represented as
2^5 * 1.0110110101011100001010001111010111000010100011110110
which actually represents the number
45.6700000000000017053025658242404460906982421875
This is as close as you can get to 45.67 inside float.
If all you are interested in is the exponent of a number, simply compute its base 2 logarithm and round down. Since 45.67 is between 32 (2^5) and 64 (2^6), the exponent is 5.
Computers use binary representation for all numbers. Hence, the exponent is for base two, not base ten. int(log2(45.67)) = 5.

How to print a double value that is just less than another double value?

Actually I am working on range expression in c++. So what I want is if I have any expression like
x<1
Then my
double getMax(...);
should return a double value that is just before 1.000 (double precision) on a number line.
I tried doing this
double getMax(double& a)
{
return (a-numeric_limits<double>::min());
}
But I am still getting same value as a in return statement.
I think C++ is converting it to nearest double in cout statement.
int main()
{
double a = 32;
cout<<scientific<<getMax(a)<<endl;
return 0;
}
output:
3.200000e+001
First of all, you need to ensure that you actually print sufficiently many digits to ensure all representable values of double are displayed. You can do this as follows (make sure you #include <iomanip> for this):
std::cout << std::scientific << std::setprecision(std::numeric_limits<double>::max_digits10) << getMax(a) << std::endl;
Secondly, numeric_limits<>::min is not appropriate for this. If your starting value is 1.0, you can use numeric_limits<double>::epsilon, which is the smallest difference from 1.0 that is representable.
However, in your code example, the starting value is 32. Epsilon does not necessarily work for that. Calculating the right epsilon in this case is difficult.
However, if you can use C++11(*), there is a function in the cmath header that does what you need std::nextafter:
#include <iostream>
#include <limits>
#include <iomanip>
#include <cmath>
double getMax(double a)
{
return std::nextafter(a,std::numeric_limits<double>::lowest());
}
int main()
{
double a = 32;
std::cout << std::scientific
<< std::setprecision(std::numeric_limits<double>::max_digits10)
<< getMax(a)
<< std::endl;
return 0;
}
I've also put it on liveworkspace.
To explain:
double nextafter(double from, double to);
returns the next representable value of from in the direction of to. So I specified std::numeric_limits<double>::lowest() in my call to ensure you get the next representable value less than the argument.
(*)See Tony D's comment below. You may have access to nextafter() without C++11.
I think you've got the right idea.
Check out Setting the precision of a double without using stream (ios_base::precision) not so much for the question, but for the examples they give of using precision. You might want to try something like printing with a precision of 53.
The way I usually see "close to but not quite" involves setting a difference threshold (typically called epsilon). In that case, you wouldn't use a getMax function, but have an epsilon used in your usage of less than. (You could do a class with the epsilon value and operator overloading. I tend to avoid operator overloading like a plague.)
Basically, you'd need:
bool lessThanEpsilon(double number, double lessThan, double epsilon)
{
return (lessThan - number >= epsilon);
}
There are other varieties, of course. Equals would check if Math.abs(number - equals) < epsilon

C++ variable types limits

here is a quite simple question(I think), is there a STL library method that provides the limit of a variable type (e.g integer) ? I know these limits differ on different computers but there must be a way to get them through a method, right?
Also, would it be really hard to write a method to calculate the limit of a variable type?
I'm just curious! :)
Thanks ;).
Use std::numeric_limits:
// numeric_limits example
// from the page I linked
#include <iostream>
#include <limits>
using namespace std;
int main () {
cout << boolalpha;
cout << "Minimum value for int: " << numeric_limits<int>::min() << endl;
cout << "Maximum value for int: " << numeric_limits<int>::max() << endl;
cout << "int is signed: " << numeric_limits<int>::is_signed << endl;
cout << "Non-sign bits in int: " << numeric_limits<int>::digits << endl;
cout << "int has infinity: " << numeric_limits<int>::has_infinity << endl;
return 0;
}
I see that the 'correct' answer has already been given: Use <limits> and let the magic happen. I happen to find that answer unsatisfying, since the question is:
would it be really hard to write a method to calculate the limit of a variable type?
The answer is : easy for integer types, hard for float types. There are 3 basic types of algorithms you would need to do this. signed, unsigned, and floating point. each has a different algorithm for how you get the min and max, and the actual code involves some bit twiddling, and in the case of floating point, you have to loop unless you have a known integer type that is the same size as the float type.
So, here it is.
Unsigned is easy. the min is when all bits are 0's, the max is when all bits are 1's.
const unsigned type unsigned_type_min = (unsigned type)0;
const unsigned type unsigned_type_max = ~(unsigned type)0;
For signed, the min is when the sign bit is set but all of the other bits are zeros, the max is when all bits except the sign bit are set. with out knowing the size of the type, we don't know where the sign bit is, but we can use some bit tricks to get this to work.
const signed type signed_type_max = (signed type)(unsigned_type_max >> 1);
const signed type signed_type_min = (signed type)(~(signed_type_max));
for floating point, there are 4 limits, although knowning only the positive limits is sufficient, the negative limits are just sign inverted positive limits. There a potentially many ways to represent floating point numbers, but for those that use binary (rather than base 10) floating point, nearly everyone uses IEEE representations.
For IEEE floats, The smallest positive floating point value is when the low bit of the exponent is 1 and all other bits are 0's. The largest negative floating point value is the bitwise inverse of this. However, without an integer type that is known to be the same size as the given floating point type, there isn't any way to do this bit manipulation other than executing a loop. if you have an integer type that you know is the same size as your floating point type, you can do this as a single operation.
const float_type get_float_type_smallest() {
const float_type float_1 = (float_type)1.0;
const float_type float_2 = (float_type)0.5;
union {
byte ab[sizeof(float_type)];
float_type fl;
} u;
for (int ii = 0; ii < 0; ++ii)
u.ab[ii] = ((byte*)&float_1)[ii] ^ ((byte*)&float_2)[ii];
return u.fl;
}
const float_type get_float_type_largest() {
union {
byte ab[sizeof(float_type)];
float_type fl;
} u;
u.fl = get_float_type_smallest();
for (int ii = 0; ii < 0; ++ii)
u.ab[ii] = ~u.ab[ii];
return -u.fl; // Need to re-invert the sign bit.
}
(related to C, but I think this also applies for C++)
You can also try "enquire", which is a script which can re-create limits.h for your compiler. A quote from the projetc's home page:
This is a program that determines many
properties of the C compiler and
machine that it is run on, such as
minimum and maximum [un]signed
char/int/long, many properties of
float/ [long] double, and so on.
As an option it produces the ANSI C
float.h and limits.h files.
As a further option, it even checks
that the compiler reads the header
files correctly.
It is a good test-case for compilers,
since it exercises them with many
limiting values, such as the minimum
and maximum floating-point numbers.
#include <limits>
std::numeric_limits<type>::max() // min() etc