For example,
SUBROUTINE DoSomething (Z,L)
IMPLICIT DOUBLE PRECISION (A-H,O-Z)
D=Z*77.1234567D0
L=D
RETURN
END
And for talking purposes, let us assume D is equal to -1.5, would in this case L be equals to -1 or -2. In other words, does it round up or round down?
Thanks in advance.
Conversion to an integer type for assignment follows use of the intrinsic function int. The effect is defined as (F2008 13.7.81)
If A is of type real, there are two cases: if|A|<1, INT(A) has the value 0; if |A| ≥1, INT(A) is the integer whose magnitude is the largest integer that does not exceed the magnitude of A and whose sign is the same as the sign of A.
In this case, then, L will take the value -1.
Either use NINT() which is the nearest integer, or INT(). INT() only returns the signed integer part of a number. NINT() works as follows:
If a is greater than zero, NINT(a) has the value INT(a+ 0.5); if a is less than or equal to zero, NINT(a) has the value INT(a- 0.5).
Specifically NINT(0.5d0) = 1
Related
In C++, the conversion of an integer value of type I to a floating point type F will be exact — as static_cast<I>(static_cast<F>(i)) == i — if the range of I is a part of the range of integral values of F.
Is it possible, and if yes how, to calculate the loss of precision of static_cast<F>(i) (without using another floating point type with a wider range)?
As a start, I tried to code a function that would return if a conversion is safe or not (safe, meaning no loss of precision), but I must admit I am not so sure about its correctness.
template <class F, class I>
bool is_cast_safe(I value)
{
return std::abs(alue) < std::numeric_limits<F>::digits;
}
std::cout << is_cast_safe<float>(4) << std::endl; // true
std::cout << is_cast_safe<float>(0x1000001) << std::endl; // false
Thanks in advance.
is_cast_safe can be implemented with:
static const F One = 1;
F ULP = std::scalbn(One, std::ilogb(value) - std::numeric_limits<F>::digits + 1);
I U = std::max(ULP, One);
return value % U;
This sets ULP to the value of the least digit position in the result of converting value to F. ilogb returns the position (as an exponent of the floating-point radix) for the highest digit position, and subtracting one less than the number of digits adjusts to the lowest digit position. Then scalbn gives us the value of that position, which is the ULP.
Then value can be represented exactly in F if and only if it is a multiple of the ULP. To test that, we convert the ULP to I (but substitute 1 if it is less than 1), and then take the remainder of value divided by the ULP (or 1).
Also, if one is concerned the conversion to F might overflow, code can be inserted to handle this as well.
Calculating the actual amount of the change is trickier. The conversion to floating-point could round up or down, and the rule for choosing is implementation-defined, although round-to-nearest-ties-to-even is common. So the actual change cannot be calculated from the floating-point properties we are given in numeric_limits. It must involve performing the conversion and doing some work in floating-point. This definitely can be done, but it is a nuisance. I think an approach that should work is:
Assume value is non-negative. (Negative values can be handled similarly but are omitted for now for simplicity.)
First, test for overflow in conversion to F. This in itself is tricky, as the behavior is undefined if the value is too large. Some similar considerations were addressed in this answer to a question about safely converting from floating-point to integer (in C).
If the value does not overflow, then convert it. Let the result be x. Divide x by the floating-point radix r, producing y. If y is not an integer (which can be tested using fmod or trunc) the conversion was exact.
Otherwise, convert y to I, producing z. This is safe because y is less than the original value, so it must fit in I.
Then the error due to conversion is (z-value/r)*r + value%r.
I loss = abs(static_cast<I>(static_cast<F>(i))-i) should do the job. The only exception if i's magnitude is large, so static_cast<F>(i) would generate an out-of-I-range F.
(I supposed here that I abs(I) is available)
What I'm doing is very straightforward. Here are the relevant declarations:
USE, INTRINSIC :: ISO_Fortran_env, dp=>REAL64 !modern DOUBLE PRECISION
REAL(dp), PARAMETER :: G_H2_alpha = 1.57D+04, G_H2_beta = 5.3D+03, G_H2_gamma = 4.5D+03
REAL(dp) :: E_total_alpha, E_total_beta, E_total_gamma, P_H2_sed
Usage:
P_H2_sed = G_H2_alpha * E_total_alpha + G_H2_beta * E_total_beta * G_H2_gamma * E_total_gamma
where E_total_alpha, E_total_beta, and E_total_gamma are just running dp totals inside various loops. I ask for the nearest integer NINT(P_H2_sed) and get -2147483648, which looks like mixed-mode arithmetic. The float P_H2_sed returns 2529548272025.4888, so I would expect NINT to return 2529548272026. I didn't think it was possible to get this kind of result from an intrinsic function. I haven't seen this since my days with the old F77 compiler. I'm doing something bad, but what is the question.
NINT, by default, returns an integer with default type parameter, that usually is equivalent to int32.
An integer of this kind cannot represent a number as high as 2529548272026. The maximum representable number is 2^31-1, that is 2147483647. The result you are getting is similar to that, but is the lowest representable number, -2147483648 (equivalent o all 32 bits set to 1).
To get a result of other kind from NINT, pass an optional parameter named kind, like this: NINT(P_H2_sed, kind=int64).
I have multiple kinds I am using in Fortran and would like to add a real valued number where the real number is cast as that kind.
For example, something like:
program illsum
implicit none
#if defined(USE_SINGLE)
integer, parameter :: rkind = selected_real_kind(6,37)
#elif defined(USE_DOUBLE)
integer, parameter :: rkind = selected_real_kind(15,307)
#elif defined(USE_QUAD)
integer, parameter :: rkind = selected_real_kind(33, 4931)
#endif
integer :: Nmax = 100
integer :: i
real(kind = rkind) :: mysum = 0.0
do i = 1,Nmax
mysum = mysum + kind(rkind, 1.0)/kind(rkind, i)
enddo
end program illsum
So I want to make sure that 1.0 and the real valued expression of i are expressed as the proper kind that I have chosen before performing the division and addition.
How can I cast 1.0 as rkind?
To convert a numeric value to a real value then there is the real intrinsic function. Further, this takes a second argument which determines the kind value of the result. So, for your named constant rkind
real(i, rkind) ! Returns a real valued i of kind rkind
real(1.0, rkind) ! Returns a real valued 1 of kind rkind
which I think is what you are meaning with kind(rkind, 1.0). kind itself, however, is an intrinsic which returns the kind value of a numeric object.
However, there are other things to note.
First, the literal constant 1._rkind (note the . in there, could be clearer with 1.0_rkind) which is of kind rkind and value approximating 1.
There's no comparable expression i_rkind, though, so the conversion above would be necessary for a real result of kind rkind with value approximating i.
That said, for you example there is no need to do such casting of the integer value. Under the rules of Fortran the expression 1._rkind/i involves that implicit conversion of i and is equivalent to 1._rkind/real(i,rkind) (and real(1.0, rkind)/real(i,rkind)).
I like to ask that what happens if we pass a fractional number when dereferencing an array in C or C++. An example of what I mean:
int arr1[],arr2[];
for (i = 0; i < 5; ++i)
{
if (i % 2 == 0)
arr1[i]=i;
else
arr2[i/2]=i;
}
What would be the compiler do when it sees arr2[3/2]?
i/2 is integer division. The result of this division will again be an integer, namely the result of the division truncated towards 0. (3/2==1; -5/2==-2;) (As a side note, the division and truncation are all a single operation: integer division. Most compilers will execute this in a single clock cycle.) So you will not be passing a fraction to an array-index.
If you try to pass a data type which can be a fraction (for example a double), the compiler will generate an error.
The division would happen first, and the answer would then be used as the array index. So, in your example, 3/2 would resolve to 1 (truncation), and then it would assign arr2[1]=i.
3/2 yields an integer result equal to 1. There is no 'fraction' in such line, ever.
arr2[3/2] ==== arr2[1]
array index should be integer. If you use a float type, it would be cast to an integer.
integer1 / integer2 yields another integer in c/c++.
What would be the compiler do when it sees arr2[3/2]?
The compiler would do nothing. The expression "3/2" is valid and will result in an integer at runtime.
I want to implement greatest integer function. [The "greatest integer function" is a quite standard name for what is also known as the floor function.]
int x = 5/3;
My question is with greater numbers could there be a loss of precision as 5/3 would produce a double?
EDIT: Greatest integer function is integer less than or equal to X.
Example:
4.5 = 4
4 = 4
3.2 = 3
3 = 3
What I want to know is 5/3 going to produce a double? Because if so I will have loss of precision when converting to int.
Hope this makes sense.
You will lose the fractional portion of the quotient. So yes, with greater numbers you will have more relative precision, such as compared with 5000/3000.
However, 5 / 3 will return an integer, not a double. To force it to divide as double, typecast the dividend as static_cast<double>(5) / 3.
Integer division gives integer results, so 5 / 3 is 1 and 5 % 3 is 2 (the remainder operator). However, this doesn't necessarily hold with negative numbers. In the original C++ standard, -5 / 3 could be either -1 (rounding towards zero) or -2 (the floor), but -1 was recommended. In the latest C++0B draft (which is almost certainly very close to the final standard), it is -1, so finding the floor with negative numbers is more involved.
5/3 will always produce 1 (an integer), if you do 5.0/3 or 5/3.0 the result will be a double.
As far as I know, there is no predefined function for this purpose.
It might be necessary to use such a function, if for some reason floating-point calculations are out of question (e.g. int64_t has a higher precision than double can represent without error)
We could define this function as follows:
#include <cmath>
inline long
floordiv (long num, long den)
{
if (0 < (num^den))
return num/den;
else
{
ldiv_t res = ldiv(num,den);
return (res.rem)? res.quot-1
: res.quot;
}
}
The idea is to use the normal integer divison, but adjust for negative results to match the behaviour of the double floor(double) function. The point is to truncate always towards the next lower integer, irrespective of the position of the zero point. This can be very important if the intention is to create even sized intervals.
Timing measurements show that this function here only creates a small overhead compared with the built-in / operator, but of course the floating point based floor function is significantly faster....
Since in C and C++, as others have said, / is integer division, it will return an int. in particular, it will return the floor of the double answer... (C and C++ always truncate) So, basically 5/3 is exactly what you want.
It may get a little weird in negatives as -5/3 => -2 which may or may not be what you want...