Whole and fraction to float/double? - c++

If I have a signed long variable that holds the whole number part of a decimal and another long variable that holds the fraction part, how would I convert that to a float or double type?
The fraction part is scaled to the 9th place.
Example:
signed long h = -5;
long f = 200073490
Result should be -5.20007349
Example 2:
signed long h = 3;
long f = 500100;
Result should be 3.0005001
Edit
Also: looking for a mathematical solution. Converting to string and scanning it back into float/double will not work in my project.

Since the long int, h, representing the fractional part is scaled by 1000,000,000 you just need to divide it by 1000000000 and correct for the sign in the event the integer portion of the pair is negative. That is you add the scaled fractional part when the base number is positive and subtract the scaled fractional part when the base number is negative. Given that h is the integer portion and f is the fractional portion an expression that combines these to produce a double is:
double result=h + (1-2*(h < 0)) * f/1000000000.0;
The expression (1-2*(h < 0)) yields a 1 when h is not negative otherwise a -1.

Related

When I use the pow(base, exponent) function in C++ and the exponent is a fraction, I always get 1 as output [duplicate]

I was writing this code:
public static void main(String[] args) {
double g = 1 / 3;
System.out.printf("%.2f", g);
}
The result is 0. Why is this, and how do I solve this problem?
The two operands (1 and 3) are integers, therefore integer arithmetic (division here) is used. Declaring the result variable as double just causes an implicit conversion to occur after division.
Integer division of course returns the true result of division rounded towards zero. The result of 0.333... is thus rounded down to 0 here. (Note that the processor doesn't actually do any rounding, but you can think of it that way still.)
Also, note that if both operands (numbers) are given as floats; 3.0 and 1.0, or even just the first, then floating-point arithmetic is used, giving you 0.333....
1/3 uses integer division as both sides are integers.
You need at least one of them to be float or double.
If you are entering the values in the source code like your question, you can do 1.0/3 ; the 1.0 is a double.
If you get the values from elsewhere you can use (double) to turn the int into a double.
int x = ...;
int y = ...;
double value = ((double) x) / y;
Explicitly cast it as a double
double g = 1.0/3.0
This happens because Java uses the integer division operation for 1 and 3 since you entered them as integer constants.
Because you are doing integer division.
As #Noldorin says, if both operators are integers, then integer division is used.
The result 0.33333333 can't be represented as an integer, therefore only the integer part (0) is assigned to the result.
If any of the operators is a double / float, then floating point arithmetic will take place. But you'll have the same problem if you do that:
int n = 1.0 / 3.0;
The easiest solution is to just do this
double g = (double) 1 / 3;
What this does, since you didn't enter 1.0 / 3.0, is let you manually convert it to data type double since Java assumed it was Integer division, and it would do it even if it meant narrowing the conversion. This is what is called a cast operator.
Here we cast only one operand, and this is enough to avoid integer division (rounding towards zero)
The result is 0. Why is this, and how do I solve this problem?
TL;DR
You can solve it by doing:
double g = 1.0/3.0;
or
double g = 1.0/3;
or
double g = 1/3.0;
or
double g = (double) 1 / 3;
The last of these options is required when you are using variables e.g. int a = 1, b = 3; double g = (double) a / b;.
A more completed answer
double g = 1 / 3;
This result in 0 because
first the dividend < divisor;
both variables are of type int therefore resulting in int (5.6.2. JLS) which naturally cannot represent the a floating point value such as 0.333333...
"Integer division rounds toward 0." 15.17.2 JLS
Why double g = 1.0/3.0; and double g = ((double) 1) / 3; work?
From Chapter 5. Conversions and Promotions one can read:
One conversion context is the operand of a numeric operator such as +
or *. The conversion process for such operands is called numeric
promotion. Promotion is special in that, in the case of binary
operators, the conversion chosen for one operand may depend in part on
the type of the other operand expression.
and 5.6.2. Binary Numeric Promotion
When an operator applies binary numeric promotion to a pair of
operands, each of which must denote a value that is convertible to a
numeric type, the following rules apply, in order:
If any operand is of a reference type, it is subjected to unboxing
conversion (§5.1.8).
Widening primitive conversion (§5.1.2) is applied to convert either or
both operands as specified by the following rules:
If either operand is of type double, the other is converted to double.
Otherwise, if either operand is of type float, the other is converted
to float.
Otherwise, if either operand is of type long, the other is converted
to long.
Otherwise, both operands are converted to type int.
you should use
double g=1.0/3;
or
double g=1/3.0;
Integer division returns integer.
Make the 1 a float and float division will be used
public static void main(String d[]){
double g=1f/3;
System.out.printf("%.2f",g);
}
The conversion in JAVA is quite simple but need some understanding. As explain in the JLS for integer operations:
If an integer operator other than a shift operator has at least one operand of type long, then the operation is carried out using 64-bit precision, and the result of the numerical operator is of type long. If the other operand is not long, it is first widened (§5.1.5) to type long by numeric promotion (§5.6).
And an example is always the best way to translate the JLS ;)
int + long -> long
int(1) + long(2) + int(3) -> long(1+2) + long(3)
Otherwise, the operation is carried out using 32-bit precision, and the result of the numerical operator is of type int. If either operand is not an int, it is first widened to type int by numeric promotion.
short + int -> int + int -> int
A small example using Eclipse to show that even an addition of two shorts will not be that easy :
short s = 1;
s = s + s; <- Compiling error
//possible loss of precision
// required: short
// found: int
This will required a casting with a possible loss of precision.
The same is true for the floating point operators
If at least one of the operands to a numerical operator is of type double, then the operation is carried out using 64-bit floating-point arithmetic, and the result of the numerical operator is a value of type double. If the other operand is not a double, it is first widened (§5.1.5) to type double by numeric promotion (§5.6).
So the promotion is done on the float into double.
And the mix of both integer and floating value result in floating values as said
If at least one of the operands to a binary operator is of floating-point type, then the operation is a floating-point operation, even if the other is integral.
This is true for binary operators but not for "Assignment Operators" like +=
A simple working example is enough to prove this
int i = 1;
i += 1.5f;
The reason is that there is an implicit cast done here, this will be execute like
i = (int) i + 1.5f
i = (int) 2.5f
i = 2
1 and 3 are integer contants and so Java does an integer division which's result is 0. If you want to write double constants you have to write 1.0 and 3.0.
I did this.
double g = 1.0/3.0;
System.out.printf("%gf", g);
Use .0 while doing double calculations or else Java will assume you are using Integers. If a Calculation uses any amount of double values, then the output will be a double value. If the are all Integers, then the output will be an Integer.
Because it treats 1 and 3 as integers, therefore rounding the result down to 0, so that it is an integer.
To get the result you are looking for, explicitly tell java that the numbers are doubles like so:
double g = 1.0/3.0;
(1/3) means Integer division, thats why you can not get decimal value from this division. To solve this problem use:
public static void main(String[] args) {
double g = 1.0 / 3;
System.out.printf("%.2f", g);
}
public static void main(String[] args) {
double g = 1 / 3;
System.out.printf("%.2f", g);
}
Since both 1 and 3 are ints the result not rounded but it's truncated. So you ignore fractions and take only wholes.
To avoid this have at least one of your numbers 1 or 3 as a decimal form 1.0 and/or 3.0.
My code was:
System.out.println("enter weight: ");
int weight = myObj.nextInt();
System.out.println("enter height: ");
int height = myObj.nextInt();
double BMI = weight / (height *height)
System.out.println("BMI is: " + BMI);
If user enters weight(Numerator) = 5, and height (Denominator) = 7,
BMI is 0 where Denominator > Numerator & it returns interger (5/7 = 0.71 ) so result is 0 ( without decimal values )
Solution :
Option 1:
doubleouble BMI = (double) weight / ((double)height * (double)height);
Option 2:
double BMI = (double) weight / (height * height);
I noticed that this is somehow not mentioned in the many replies, but you can also do 1.0 * 1 / 3 to get floating point division. This is more useful when you have variables that you can't just add .0 after it, e.g.
import java.io.*;
public class Main {
public static void main(String[] args) {
int x = 10;
int y = 15;
System.out.println(1.0 * x / y);
}
}
Do "double g=1.0/3.0;" instead.
Many others have failed to point out the real issue:
An operation on only integers casts the result of the operation to an integer.
This necessarily means that floating point results, that could be displayed as an integer, will be truncated (lop off the decimal part).
What is casting (typecasting / type conversion) you ask?
It varies on the implementation of the language, but Wikipedia has a fairly comprehensive view, and it does talk about coercion as well, which is a pivotal piece of information in answering your question.
http://en.wikipedia.org/wiki/Type_conversion
Try this out:
public static void main(String[] args) {
double a = 1.0;
double b = 3.0;
double g = a / b;
System.out.printf(""+ g);
}

Strange result using long double in C++

Using long double arithmetic in C++, the number 50,000,056,019,485.52605438232421875 squared yields 2,500,005,601,951,690,788,240,883,712. Meanwhile, the number 50,000,056,019,485.526050567626953125 (which differs from the first number by less than 0.001) squared yields 2,500,005,601,951,690,787,704,012,800, which differs from the first square by almost 1 billion. With the differences highlighted:
50,000,056,019,485.526054382324218750 ^ 2 = 2,500,005,601,951,690,788,240,883,712
50,000,056,019,485.526050567626953125 ^ 2 = 2,500,005,601,951,690,787,704,012,800
Long double's range in my machine goes from ~1e-300 to ~1e+300. I fully understand the inability to represent all the numbers in the range, but I didn't expect such a big difference. Could anybody shed some light on this?
Math
Given d as a small value compared to a
The expected difference of (a+d)*(a+d) - (a)*(a) is about 2*a*d.
Here a is about 50,000,056,000,000.0, d is about 0.00000381... and 2*a*d is about 381,000,000.0. That is within a factor of two of the difference seen using long double: 536,870,912.0.
"I didn't expect such a big difference." is only "off" less than a factor of 2. Given the squares represent rounded products, the difference of the squares seen here is reasonable.
Some details:
long double
Given the following and long double as 10-byte, a and b are consecutive long double values. They differ by 1 ULP or about 1 part in 264. Both decimal code constants are exactly saved as a, b.
long double b = 50000056019485.526054382324218750L; // 0x1.6bcc5c9f50ec355cp+45
long double a = 50000056019485.526050567626953125L; // 0x1.6bcc5c9f50ec355ap+45
// difference
long double d = 0.000003814697265625L; // 0x0.0000000000000002p+45
The square of a, b are not execrably representable as long double. Instead the reported squares are the rounded values. They differ by 2 ULP.
long double bb = 2500005601951690788240883712.0L; // 0x1.027e98e7c774ede8cp+91
long double aa = 2500005601951690787704012800.0L; // 0x1.027e98e7c774ede88p+91
// difference 536870912.0 .... 0x0.00000000000000004p+91
When multiplying 2 floating point numbers, the error in the product is expected to be within +/-0.5 ULP. Here the ULP is 5,368,70,912 and indeed the squares are correct to within 0.5*

Multiply 2 large numbers in C++ have wrong result

There are 2 large integer numbers. When I multiply it the result is always wrong, even if I used long double and the result should be in valid range of long double:
long double f = 1000000000 * 99999;
I debugged, and the result is so strange: -723552768.00000000. Did I missed something? how can I multiple it?
Thanks and regard!
from the C++ standards:
4 An unsuffixed floating constant has type double. If suffixed by the
letter f or F, it has type float. If suffixed by the letter l or L, it
has type long double
auto fl = 1000000000.L * 99999.L;
std::cout << fl << "\n";
or
long double fl = 1000000000L * 99999.L;
std::cout <<"\n"<< fl << "\n";
Numeric literals are int by default in C++. Thus, the expression 1000000000 * 99999 is viewed as the multiplication of two int 's and therefore the result returned by the * operator is an int. This int is only converted to the long double variable f after the multiplication has taken place. Depending on your platform, the range of int is usually from -2147483648 to 2147483647 (or 4 bytes in size). However, the product of 1000000000 x 99999 is 9.9999 x 10^13 which falls outside this range and thus overflow occurs as the int variable is not large enough to hold the value.
To avoid this, at least one of the numbers the * operator operates on should be declared as a long double literal with the suffix .l or .L as follows:
long double f = 1000000000.L * 99999
In the above expression , the * operator will return a long double which is large enough to hold the resulting product before being assigned to f.
Agree with #feiXiang. You are basically multiplying two ints. To do correct calculations, you have to define large numbers as long double. See the code below:
#include <iostream>
using namespace std;
int main()
{
long double a = 1000000000;
long double b = 99999;
long double f = a * b;
cout<<f;
return 0;
}
Output:
9.9999e+13
Actually you invoke undefined behavior with:
long double f = 1000000000 * 99999;
First, evaluate 1000000000 * 99999, which is a multiplication of two int objects. Multiplying two int objects is always an int. Since int is not big enough to represent the result (most likely 32 bits), the upper bits are lost.
Since overflows in signed integer types is undefined, you just triggered undefined behavior. But in this case it is possible to explain what happened, even though it is UB.
The computation keeps only the lowest 32 bits, which should be (1000000000 * 99999) modulo (2**32) == 3571414528. But this value is too big for int. Since on PC int negatives are represented by two's complement, we have to subtract 2**32, every time 2**31<= result < 2**32. This gives -723552768
Now, the last step is:
long double f = -723552768
And that is what you see.
To overcome the issue, either use long long like this:
long double f = 1000000000LL * 99999;
Or double:
long double f = 1000000000.0 * 99999;
1000000000 and 99999 are integer numbers, then the result of 1000000000 * 99999 will be an integer before it is assigned to your variable, and the result is out of range of integer.
You should make sure that the result is a long double first:
long double f = (long double) 1000000000 * 99999;
Or
long double f = 1000000000LL * 99999;

Can n %= m ever return negative value for very large nonnegative n and m?

This question is regarding the modulo operator %. We know in general a % b returns the remainder when a is divided by b and the remainder is greater than or equal to zero and strictly less than b. But does the above hold when a and b are of magnitude 10^9 ?
I seem to be getting a negative output for the following code for input:
74 41 28
However changing the final output statement does the work and the result becomes correct!
#include<iostream>
using namespace std;
#define m 1000000007
int main(){
int n,k,d;
cin>>n>>k>>d;
if(d>n)
cout<<0<<endl;
else
{
long long *dp1 = new long long[n+1], *dp2 = new long long[n+1];
//build dp1:
dp1[0] = 1;
dp1[1] = 1;
for(int r=2;r<=n;r++)
{
dp1[r] = (2 * dp1[r-1]) % m;
if(r>=k+1) dp1[r] -= dp1[r-k-1];
dp1[r] %= m;
}
//build dp2:
for(int r=0;r<d;r++) dp2[r] = 0;
dp2[d] = 1;
for(int r = d+1;r<=n;r++)
{
dp2[r] = ((2*dp2[r-1]) - dp2[r-d] + dp1[r-d]) % m;
if(r>=k+1) dp2[r] -= dp1[r-k-1];
dp2[r] %= m;
}
cout<<dp2[n]<<endl;
}
}
changing the final output statement to:
if(dp2[n]<0) cout<<dp2[n]+m<<endl;
else cout<<dp2[n]<<endl;
does the work, but why was it required?
By the way, the code is actually my solution to this question
This is a limit imposed by the range of int.
int can only hold values between –2,147,483,648 to 2,147,483,647.
Consider using long long for your m, n, k, d & r variables. If possible use unsigned long long if your calculations should never have a negative value.
long long can hold values from –9,223,372,036,854,775,808 to 9,223,372,036,854,775,807
while unsigned long long can hold values from 0 to 18,446,744,073,709,551,615. (2^64)
The range of positive values is approximately halved in signed types compared to unsigned types, due to the fact that the most significant bit is used for the sign; When you try to assign a positive value greater than the range imposed by the specified Data Type the most significant bit is raised and it gets interpreted as a negative value.
Well, no, modulo with positive operands does not produce negative results.
However .....
The int type is only guaranteed by the C standards to support values in the range -32767 to 32767, which means your macro m is not necessarily expanding to a literal of type int. It will fit in a long though (which is guaranteed to have a large enough range).
If that's happening (e.g. a compiler that has a 16-bit int type and a 32-bit long type) the results of your modulo operations will be computed as long, and may have values that exceed what an int can represent. Converting that value to an int (as will be required with statements like dp1[r] %= m since dp1 is a pointer to int) gives undefined behaviour.
Mathematically, there is nothing special about big numbers, but computers only have a limited width to write down numbers in, so when things get too big you get "overflow" errors. A common analogy is the counter of miles traveled on a car dashboard - eventually it will show as all 9s and roll round to 0. Because of the way negative numbers are handled, standard signed integers don't roll round to zero, but to a very large negative number.
You need to switch to larger variable types so that they overflow less quickly - "long int" or "long long int" instead of just "int", the range doubling with each extra bit of width. You can also use unsigned types for a further doubling, since no range is used for negatives.

Getting the decimal point value

I have a function which should covert a double value into string one:
inline static string StringFromNumber(double val) // suppose val = 34.5678
{
long integer = (long)val; // integer = 34
long pointPart; // should be = 5678 how do I get it?
}
How do I get a long value for both integer and pointPart?
Add: I want a precision of 17 numbers, with discarding the zeros. More examples:
val = 3.14 integer = 3 pointPart = 14
val = 134.4566425814748 integer = 134 pointPart = 4566425814748
I have not got any solution so far. How can I get it?
A stringstream won't get you the decimal point in particular, but it will convert the entire number to a string.
std::stringstream ss;
ss << val;
/*access the newly-created string with str()*/
return ss.str();
long pointPart = static_cast<long>(val*10)%10;
10 for 2 decimal places...
100 for 3 etc...
String realPoint = (string)pointPart;
Plus long connot hold 17 digits. it holds 10.
so you probably want a float variable
You can use modf to separate the integer and fractional parts. You can then multiply the fractional part by 1.0e17, and call floor to properly round the results to it's integer component, and then cast to a unsigned long (the fractional part will never be negative, and this allows you to maximize the number of bits in the integral type). Finally run though a loop to trim off the zeros on the unsigned long. For instance:
inline static string StringFromNumber(double val)
{
double intpart, fracpart;
fracpart = round((modf(val, &intpart)) * 1.0e17);
long int_long = static_cast<long>(intpart);
unsigned long frac_long = static_cast<long>(fracpart);
//trim off the zeros
for(unsigned long divisor = 10;;divisor *= 10)
{
if ((frac_long / divisor) * divisor != frac_long)
{
frac_long = frac_long / (divisor / 10);
break;
}
}
//...more code for converting to string
}
Note that this code will only work up to 17 decimal places if you are on a 64-bit platform and unsigned long is defined as a 64-bit integer-type. Otherwise you will want to change unsigned long to uint64_t. Also keep in mind that since floating point numbers are approximations, and there's a multiplier by 1.0e17, the value of fracpart may not be exactly the value of the point-part of val ... in other words there may be some additional digits after any necessary rounding.