This question already has answers here:
Why does integer division in C# return an integer and not a float?
(8 answers)
Closed 6 years ago.
The code I used in C++ is:
float y;
y=360/100;
cout<<y;
the output is 3.
Even, if I don't output y, and instead use it for a function like left(y), the value 3 is taken instead of 3.6. But if I define y=360.0/100, it works fine.
By the way, left() is a function included in package made by our CS prof. Left(x) changes the direction by an angle of x degrees towards left. A logo based package.
This is the way the language is defined. You divide an int by an int, so the calculation is performed resulting in an int, giving 3 (truncating, rather than rounding).
You then store 3 into a float, giving 3.0.
If you want the division performed using floats, make (at least) one of the arguments a float, e.g. 360f / 100. In this way, they other argument will be converted to a float before the division is performed.
360 / 100f will work equally well, but I think it probably make sense to make it clear as early as possible that this is a floating point calculation, rather than an integral one; that's a human consideration, rather than a technical one.
(Note that 360.0 is actually a double, although using that will work as well. The division would be performed as a double, then the result converted to a float for the assignment).
360/100 is computed in integer arithmetic before it's assigned to the float.
Reworking to 360f / 100 is my favourite way of fixing this.
Finally these days floats are for quiche eaters, girls, and folk who don't understand how modern chipsets work. Use a double type instead - which will probably be faster -, and 360.0 / 100.
Related
This question already has answers here:
Floating point inaccuracy examples
(7 answers)
Closed 3 years ago.
I have a FLOAT column in a SQL Server database that appears as follows in SQL Server Management Studio.
18.001
When I read that value into a float variable, and format it using sprintf() ("%f"), it appears as:
18.000999
When I read that value into a double variable, and format it using sprintf(), it appears as:
18.001000
Could I get some suggestions on this? The values being stored are generally under 100, with up to 3 decimal places. What is the best SQL Server type? What is the best C++ type? And should I be using some rounding technique to get it in the format I want?
Note: I'm not actually using sprintf(), I'm using CString.Format(), but the expected behavior is the same.
The values being stored are generally under 100, with up to 3 decimal places.
SQL databases support the numeric/decimal (the two are synonyms) types for fixed-point values. For your specific type, you could use decimal(6, 3). That is six significant digits, with three of them to the right of the decimal point. These two values are called scale and precision respectively.
If the values can differ a bit from this, you might want a wider range.
With decimal/numeric, what-you-see-is-what-you-get. I would recommend storing them in the database as fixed-point numbers.
Answering the question on it's face value, assuming floating point should be used and fixed point is not applicable.
Unless you are really tight on memory, there is really no reason to use anything for floating numbers in C++ but double. Float looses precision without giving you much in return. You can also try long double, but in my experience it is rather overkill. Also, if your compiler is MSVC, I have heard it's long doubles are the same as doubles.
In alternative to the fixed comma decimals proposed already, just use ordinary integers!
Instead of storing 18.001 seconds, you'd store 18001 milliseconds, you wouldn't store Euro, Pound, Dollar, but tenth of a cent or penny, ...
Type in C++ would be an integer as well, large enough to hold maximum numbers you need, e. g. uint32_t, int64_t, ...
I took a programming test, and my teacher took points off for diving by 3 instead of 3.0 , even though it appears I get the same answer.
volu = PI*(BASE*BASE)*(HEIGHT/3);
Does the .0 even matter in a c++ program, if so why?
Thank you!
EDIT: I used doubles for PI and BASE
(For what it's worth I find the way you have it to be rather prudent.)
If HEIGHT was an integral type then it would have been absolutely essential that you wrote 3.0 rather than 3. (3.0 is a literal of type double, cf. 3 which is an int.)
Otherwise the division would have taken place in integer arithmetic, causing the result to go off.
Personally I'd remove all the redundant parentheses and write it as
PI * BASE * BASE * HEIGHT / 3;
Yes, two integers will be divided using euclidean division. Adding the .0 changes it into a normal division, because you're no longer dividing two integers but an integer and a double.
This question already has answers here:
Is there a C++ equivalent to Java's BigDecimal?
(9 answers)
Closed 7 years ago.
I found myself with the need to compute the exponential of a large number, e. g.exp(709). Such a number would be represented, in floating point precision, as 8.2184074615549724e+307.
It seems that numbers with exponents larger than that would be simply translated into Inf, which creates problems in my code. I can only guess that things can be fixed using more bits to represent the exponent, but I am not aware of a pragmatical way to proceed.
Here is a code snippet:
double expon = exp(500); /*here I also tried `long double`, with no effect */
printf("%e\n", expon ); /*gives INF*/
double Wa = LambertW<0>( expon); /*gives error, as it can't handle inf*/
Is there a way to compute this?
This problem has been debated in general, but I did not find an useful answer. Also, it seems that GCC supports multiple-precision floating-point arithmetics since version 4.3. How does it help?
Edit: The suggested possible-duplicate questions turned out irrelevant because as I need huge decimals, not exact decimals. This is not a duplicate.
You should be able to perform your computation with adequate precision using long double arithmetic:
The maximum value for 80 bit long double is 1.18×10^4932, much larger than e^709.
In order for the computation to be performed as long double, your must use expl instead if exp:
long double expon = expl(500);
printf("%Le\n", expon);
The LambertW function will handle the long double if it is properly overloaded for this type, otherwise expon will be converted to double and produce inf and the computation will fail as you mentioned.
I don't know which implementation of Lambert W function you use, Darko Veberic's does not support long double arguments, but it might be feasible to extend the implementation to type long double as it is available in source form: https://github.com/DarkoVeberic/LambertW . You might want to contact him directly.
Another approach is to consider that exp(709) is just too close to the maximum precision of the double type, 10^308. If you can alter your computation using just smaller exponents and a different formula, the computation might be done with regular double types.
This question already has answers here:
Why are floating point numbers inaccurate?
(5 answers)
Closed 7 years ago.
In my project I have to read some numeric data form an xml file ,use it ,and save it on disk in an another directory.
Skipping the file paring it comes to the problem of std::string to float conversion:
std::string sFloatNumber;
float fNumber = std::atof(sFloatNumber);
Which works fine but I noticed small deviations between value written in std::string and the one recieved after conversion to float (about ~0.0001).
Deviation is small but after number of such operations can accumulate to a big inacurracy later on.
So I ask if there is some conversion between std::string and float that has 1:1 accuracy?
You can't really can't make the conversion more precise than what you'd achieve by using the inbuilt operators for this. The reason is that floats can't represent all numbers. What they can represent is the number closest to the one you input, and I guess that is what they are showing. So there is no way possible in which you can convert a string exactly into a float.
If you want more accuracy, I suggest you use double. However, that also has a limit on the accuracy, but much better than float nonetheless. The reason for this is that double uses 64 bits to represent a number, whereas a float uses 32 bits. But their method of storing a number is similar, and so the same restrictions apply.
In the tutorial I'm reading for OGRE3d here the programmer is constantly adding f at the end of any variable he initializes, like 200.00f or 0.00f so I decided to erase f and see if it compiles and it compiles just fine, what is the point of adding f at the end of the variable?
EDIT: So you're saying if I initialize a variable with 200.03 it won't initialize it as a floating point but if I were to do so with 200.03f it would? If not where does the f become useful then?
It's a way to specify that number has to be interpreted as a "float", not a "double" (which is the standard for C++ decimal numbers and uses up twice the memory).
This discussion could be of help:
http://www.cplusplus.com/forum/beginner/24483/
Quoted from http://msdn.microsoft.com/en-us/library/w9bk1wcy.aspx
A floating-point constant without an f, F, l, or L suffix has type
double. If the letter f or F is the suffix, the constant has type
float. If suffixed by the letter l or L, it has type long double. For
example:
200.00f is not a variable. It can't vary.
It's a compile-time constant, with float representation. The f signifies that it's a float.
By comparison, 200.00 would be interpreted as a double.
The C standard states that constant floats are doubles which promotes the operation to a double.
float a,b,c;
...
a = b+7.1; this is a double precision operation
...
a = b+7.1f; this is a single precision operation
...
c = 7.1; //double
a = b + c; //single all the way
The double precision requires more storage for the constant, plus a conversion from single to double for the variable operand, then a conversion from double to single to assign the result. With all the conversions going on if you are not in tune with how floating point works, rounding and such you might not get the result you were thinking you were going to get. The compiler may at some point in the path optimize some of this behavior out, making it either harder to understand the real problems and the fpu in the hardware might accept mixed mode operands, also hiding what is really going on.
It is not just a speed problem but also accuracy. There was a recent SO question, pretty much the same problem, why does this comparison work with one number and not another. Take the fraction 5/11ths for example 0.454545.... Lets say, hypothetically, you had base 10 fpu with single precision of 3 significant digits and a double of 6 significant digits.
float a = 0.45454545454;
...
if(a>0.4545454545) b=1;
...
well in our hypothetical system we can only store three digits into a, so a = .455 because we are using by default a round up rounding mode. but our comparision will be considered double because we didnt put the f at the end of the number. the double version is 0.454545. a is converted to a double which results in 0.455000, so:
if(0.455000>0.454545) b = 1;
0.455 is greater than 0.454545 so b would be a 1.
float a = 0.45454545454;
...
if(a>0.4545454545f) b=1;
...
so now the comparison is single precision so we are comparing 0.455 to 0.455 which is not greater, so b=1 does not happen.
When you write floating point constants that is base 10 decimal, the floating point numbers in the computer are base 2 and they dont always convert smoothly just like 5/11 would work just fine in base 11 but in base 10 you get an infinite repeating digit. 0.1 in decimal for example creates a repeating pattern in binary. Depending on where the mantissa cuts off the rounding can make that lsbit of the mantissa round up or not (also depends on the rounding mode you are using if the floating point format you are using even has rounding). Which of itself creates problems depending on how you use the variable as the comparison above shows.
For non-floating point the compiler usually saves you, but sometimes doesnt:
unsigned long long a;
...
a = ~3;
a = ~(3ULL);
...
Depending on the compiler and computer the two assignments can give you different results one MIGHT give you 0x00000000FFFFFFFC another MIGHT give 0xFFFFFFFFFFFFFFFC.
If you want something specific you should be quite clear when you tell the compiler what you want otherwise the compiler takes a guess and doesnt always make the guess that you wanted.
It means that the value is to be interpreted as a single-precision floating point variable (type float). Without the f-suffix, it is interpreted as a double-precision floationg point variable (type double).
This is usually done to shut up compiler warnings about possible loss of precision by assigning a double value to a float variable. When you didn't receive such a warning you maybe have switched off warnings in your compiler settings (which is bad!).
But it can also have subtile syntactical meaning. As you know C++ allows functions which have the same name but differ by the types of their parameters. In that case the f suffix can determine which function is called.