INT vs FLOOR in Fortran - fortran

According to gfortran documentation, INT(x) and FLOOR(x) both take input x and convert x to integer type. FLOOR apparently only allows input of type REAL whereas INT takes input of type INTEGER, REAL, and complex.
Is the allowed input type the only difference between INT and FLOOR? If so, can anyone explain why FLOOR exists since it is apparently superfluous?
The "Similar Questions" box showed similar Stack Overflow questions in C, C++, and Python3, but apparently no one has asked this question for Fortran yet, which led me to getting this deep into asking it.
Including Fortran in my quick searches on Google and Stack Overflow meant nothing useful appeared. So this is admittedly a duplicate (unless Fortran has INT and FLOOR quirks separating it from C/C++/Python) but I think it will have utility in allowing the result to be more easily/quickly searchable.

The definition of INT is such that it rounds towards zero for REAL input, while FLOOR always rounds down. Consequently, for negative input, the results differ.
Unlike some of the other languages you reference, the result of calling FLOOR in Fortran is of type INTEGER.
Consider FLOOR in the context of its cousins NINT and CEILING.

Related

How to express float constants precisely in source code

I have some C++11 code generated via a code generator that contains a large array of floats, and I want to make sure that the compiled values are precisely the same as the compiled values in the generator (assuming that both depend on the same float ISO norm)
So I figured the best way to do it is to store the values as hex representations and interpret them as float in the code.
Edit for Clarification: The code generator takes the float values and converts them to their corresponding hex representations. The target code is supposed to convert back to float.
It looks something like this:
const unsigned int data[3] = { 0x3d13f407U, 0x3ea27884U, 0xbe072dddU};
float const* ptr = reinterpret_cast<float const*>(&data[0]);
This works and gives me access to all the data element as floats, but I recently stumbled upon the fact that this is actually undefined behavior and only works because my compiler resolves it the way I intended:
https://gist.github.com/shafik/848ae25ee209f698763cffee272a58f8
https://en.cppreference.com/w/cpp/language/reinterpret_cast.
The standard basically says that reinterpret_cast is not defined between POD pointers of different type.
So basically I have three options:
Use memcopy and hope that the compiler will be able to optimize this
Store the data not as hex-values but in a different way.
Use std::bit_cast from C++20.
I cannot use 3) because I'm stuck with C++11.
I don't have the resources to store the data array twice, so I would have to rely on the compiler to optimize this. Due to this, I don't particularly like 1) because it could stop working if I changed compilers or compiler settings.
So that leaves me with 2):
Is there a standardized way to express float values in source code so that they map to the exact float value when compiled? Does the ISO float standard define this in a way that guarantees that any compiler will follow the interpretation? I imagine if I deviate from the way the compiler expects, I could run the risk that the float "neighbor" of the number I actually want is used.
I would also take alternative ideas if there is an option 4 I forgot.
How to express float constants precisely in source code
Use hexadecimal floating point literals. Assuming some endianess for the hexes you presented:
float floats[] = { 0x1.27e80ep-5, 0x1.44f108p-2, -0x1.0e5bbap-3 };
If you have the generated code produce the full representation of the floating-point value—all of the decimal digits needed to show its exact value—then a C++ 11 compiler is required to parse the number exactly.
C++ 11 draft N3092 2.14.4 1 says, of a floating literal:
… The exponent, if present, indicates the power of 10 by which the significant [likely typo, should be “significand”] part is to be scaled. If the scaled value is in the range of representable values for its type, the result is the scaled value if representable, else the larger or smaller representable value nearest the scaled value, chosen in an implementation-defined manner…
Thus, if the floating literal does not have all the digits needed to show the exact value, the implementation may round it either upward or downward, as the implementation defines. But if it does have all the digits, then the value represented by the floating literal is representable in the floating-point format, and so its value must be the result of the parsing.
I have read some very valuable information here and would like to throw in an option that does not strictly answer the question, but could be a solution.
It might be problematic, but if so, I would like to discuss it.
The simple solution would be: Leave it as it is.
A short rundown of why I am hesitant about the suggested options:
memcpy relies on the compiler to optimize away the actual copy and understand that I only want to read the values. Since I am having large arrays of data I would want to avoid a surprise event in which a compiler setting would be changed that suddenly introduces increased runtime and would require a fix on short notice.
bit_cast is only available from C++20. There are reference implementations but they basically use memcpy under the hood (see above).
hex float literals are only available from C++17
Directly writing the floats precisely... I don't know, it seems to be somewhat dangerous, because if I make a slight mistake I may end up with a data block that is slightly off and could have an impact on my classification results. A mistake like that would be a nightmare to spot.
So why do I think I can get away with an implementation that is strictly speaking undefined? The rationale is that the standard may not define it, but compiler manufacturers likely do, at least the ones I have worked with so far gave me exact results. The code has been running without major problems for a fairly long time, across dozens of code generator run and I would expect that a failed reinterpret_cast would break the conversion so severely that I would spot the result in my classification results right away.
Still not robust enough though. So my idea was to write a unit test that contains a significant number of hex-floats, do the reinterpret_cast and compare to reference float values for exact correspondence to tell me if a setting or compiler failed in this regard.
I have one doubt though: Is the assumption somewhat reasonable that a failed reinterpret_cast would break things spectacularly, or are the bets totally off when it comes to undefined behavior?
I am a bit worried that if the compiler implementation defines the undefined behavior in a way that it would pick a float that is close the hex value instead of the precise one (although I would wonder why), and that it happens only sporadically so that my unit test misses the problems.
So the endgame would be to unit test every single data entry against the corresponding reference float. Since the code is generated, I can generate the test as well. I think that should put all my worries to rest and make sure that I can get this to work across all possible compilers and compiler settings or be notified if anything breaks.

In which of these examples is conversion necessary?

Here are three examples, where I get a number from a function with some general non-double type (could be some sort of int, some sort of size_t, etc), and need to store that in a double.
My question is, is the code fine as is in all three examples, or do I need to do some conversion?
double x = getNotDouble(); //Set x = some number.
//Set x equal to division between two non-doubles:
double x = getNotDouble() / getAnotherNotDouble();
//Take non-double in constructor
class myClass
{
double x
myClass(someType notDoublex) : x(NotDoublex)
};
Strictly speaking, a conversion is used whenever you assign a value of one type to a variable of another type. In this respect, a conversion is needed in all three cases, since all three cases assign a non-double value to a double variable.
However, needing a conversion is not the same as needing to specify a conversion. Some conversions are provided automagically by the compiler. When this happens, you do not need to specify a conversion unless the automatic conversion is not the one you wanted. So whether or not a conversion needs to be specified depends on what you want to achieve.
Each of your three cases is correct in certain situations, but not necessarily in all situations. At the same time, each of your three cases could be enhanced with an explicit conversion, which would at least serve as a reminder to future programmers (including you!) that the conversion is intentional. This could be particularly useful when there are integers and division involved, since an explicit conversion could confirm that the intent is to convert to double after the integer division (dropping the fractional part).
In the end, what you need to do depends upon what you want to accomplish. One program's feature is another program's bug, simply because the programs seek to accomplish different goals.
Note that I have taken the following statement at face value:
I get a number from a function [...] and need to store that in a double.
For the second example, the value being stored in a double is getNotDouble() / getAnotherNotDouble(). To make this fit the statement, I needed to interpret "function" in the mathematical sense, not the programming sense. That is, the division is the "function" producing the value to store in a double. Otherwise I would have two numbers from two C++ functions, and that is inconsistent with "a number from a function". So I read the question as asking whether or not a conversion is needed after the division.
If the intent was to ask if a conversion is needed before the division, the answer still depends upon what you want to accomplish. The behavior of division depends on its operands, not on what is done with the result. So if the operands are integers, then integer division is performed, and the result is an integer even if that resulting integer is then assigned to a floating point variable. Sometimes this is desired. Often not.
If you are storing the result of the division in a double because you want to store the fractional part of the quotient, then you would need to make sure at least one of the operands is a floating point value before the division is performed. (There are floating point types other than double, so "not double" is not enough to know if an explicit conversion is needed.) However, this is really a separate topic than what this question is nominally about since this is about the division operator, while the question is nominally about storing values.
Your first and third example result in no loss of data, so I assume they're fine.
Your second example is where some loss of data takes place (integer division means the result is rounded down), which you could potentially:
double x = static_cast<double>(getNotDouble()) / getAnotherNotDouble();
One of the values has to be a double in order for the return value to also be a double.

Defining constants of sufficient precision in Fortran [duplicate]

I have the following Fortran code:
Program Strange
Real(Kind=8)::Pi1=3.1415926535897932384626433832795028841971693993751058209;
Real(Kind=8)::Pi2=3.1415926535897932384626433832795028841971693993751058209_8;
Print*, "Pi1=", Pi1;
Print*, "Pi2=", Pi2;
End Program Strange
I compile with gfortran, and the output is:
Pi1= 3.1415927410125732
Pi2= 3.1415926535897931
Of course the second is correct, but should this be the case? It seems like Pi1 is being input to memory as a single precision number, and then put into a double precision memory slot. But this seems like an error to me. Am I correct?
I do know a bit of Fortran ! #Dougal's answer is correct though the snippet he quotes from is not, embedding the letter d into a real literal constant is not required (since Fortran 90), indeed many Fortran programmers now regard that approach as archaic. The snippet is also misleading in advising the use of 3.1415926535d+0 to initialise a 64-bit floating-point value for pi, it doesn't set enough of the digits to their correct values.
The statement:
Real(Kind=8)::Pi1=3.1415926535897932384626433832795028841971693993751058209
defines Pi1 to be a real variable of kind 8. The literal real value 3.1415926535897932384626433832795028841971693993751058209 is, however, a real value of default kind, most likely to be a 4-byte real on most current compilers. That seems to explain your output but do check your documentation.
On the other hand, the literal real value Pi2=3.1415926535897932384626433832795028841971693993751058209_8 is, by the suffixing of the kind specification, declared to be of kind=8 which is the same as the kind of the variable it is assigned to.
Three more points:
1) Don't fall into the trap of thinking that kind=8 means the same thing as 64-bit floating-point number or double. For many compilers it does, for some it doesn't. Kind numbers are not portable between Fortran implementations. They are, according to the standard, arbitrary positive integers. Better, with a modern compiler, would be to use the predefined constants from the intrinsic module iso_fortran_env, e.g.
use, intrinsic :: iso_fortran_env
...
real(real64) :: pi = 3.14159265358979323846264338_real64
There are other portable approaches to setting variable kinds using functions such as selected_real_kind.
2) Since the value of pi is unlikely to change during the execution of your program you might care to make it a parameter thus:
real(real64), parameter :: pi = 3.14159265358979323846264338_real64
3) It isn't necessary (or usual) to end Fortran statements with a ';' unless you want to have more than one statement on the same line in the source file.
I don't really know Fortran, but this page says:
The letter "d" must be embedded in the literal, otherwise, the compiler's pre-processor would round it off to be a Single Precision literal. For example, 3.1415926535 would be read as 3.141593 while 3.1415926535d+0 would be stored with all the digits intact. The letter "d" for double precision numbers has the same meaning as "e" for single precision numbers.
So it seems like your guess is correct.

Is this an unavoidable signed and unsigned integer comparison?

Probably not, but I can't think of a good solution. I'm no expert in C++ yet.
Recently I've converted a lot of ints to unsigned ints in a project. Basically everything that should never be negative is made unsigned. This removed a lot of these warnings by MinGW:
warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
I love it. It makes the program more robust and the code more descriptive. However, there is one place where they still occur. It looks like this:
unsigned int subroutine_point_size = settings->get<unsigned int>("subroutine_point_size");
...
for(int dx = -subroutine_point_size;dx <= subroutine_point_size;dx++) //Fill pixels in the point's radius.
{
for(int dy = -subroutine_point_size;dy <= subroutine_point_size;dy++)
{
//Do something with dx and dy here.
}
}
In this case I can't make dx and dy unsigned. They start out negative and depend on comparing which is lesser or greater.
I don't like to make subroutine_point_size signed either, though this is the lesser evil. It indicates a size of a kernel in a pass over an image, and the kernel size can't be negative (it's probably unwise for a user ever to set this kernel size to anything more than 100 but the settings file allows for numbers up to 2^32 - 1).
So it seems there is no way to cast any of the variables to fix this. Is there a way to get rid of this warning and solve this neatly?
We're using C++11, compiling with GCC for Windows, Mac and various Unix distributions.
Cast the variables to a long int or long long int type giving at the same time the range of unsigned int (0..2^32-1) and sign.
You're making a big mistake.
Basically you like the name "unsigned" and you intend it to mean "not negative" but this is not what is the semantic associated to the type.
Consider the statement:
adding a signed integer and an unsigned integer the result is unsigned
Clearly it makes no sense if you consider the term "unsigned" as "not negative", yet this is what the language does: adding -3 to the unsigned value 2 you will get a huge nonsense number instead of the correct answer -1.
Indeed the choice of using an unsigned type for the size of containers is a design mistake of C++, a mistake that is too late to fix now because of backward compatibility. By the way the reason it happened has nothing to do with "non-negativeness", but just with the ability to use the 16th bit when computers were that small (i.e. being able to use 65535 elements instead of 32767). Even back then I don't think the price of wrong semantic was worth the gain (if 32767 is not enough now then 65535 won't be enough quite soon anyway).
Do not repeat the same mistake in your programs... the name is irrelevant, what counts is the semantic and for unsigned in C and C++ it is "member of the Zn modulo ring with n=2k ".
You don't want the size of a container to be the member of a modulo ring. Do you?
Instead of the current
for(int dx = -subroutine_point_size;dx <= subroutine_point_size;dx++) //Fill pixels in the point's radius.
you can do this:
for(int dx = -int(subroutine_point_size);dx <= int(subroutine_point_size);dx++) //Fill pixels in the point's radius.
where the first int cast is (1)technically redundant, but is there for consistency, because the second cast removes the signed/unsigned warning that presumably is the issue here.
However, I strongly advise you to undo the work of converting signed to unsigned types everywhere. A good rule of thumb is to use signed types for numbers, and unsigned types for bit level stuff. That avoids the problems with wrap-around due to implicit conversions, where e.g. std:.string("Bah").length() < -5 is guaranteed (very silly), and because it does away with actual problems, it also reduces spurious warnings.
Note that you can just define a suitable name, where you want to indicate that some value will never be negative.
1) Technically redundant in practice, for two's complement representation of signed integers, with no trapping inserted by the compiler. As far as I know no extant C++ compiler behaves otherwise.
Firstly, without knowing the range of values that will be stored in the variables, your claim that changing signed to unsigned variables is unsubstantiated - there are circumstances where that claim is false.
Second, the compiler is not issuing a warning only as a result of changing variables (and I assume calls of template functions like settings.get()) to be unsigned. It is warning about the fact you have expressions involving both signed and unsigned variables. Compilers typically issue warnings about such expressions because - in practice - they are more likely to indicate a programming error or to potentially involve some behaviour that the programmer may not have anticipated (e.g. instances of undefined behaviour, expressions where a negative result is expected but a large positive result is what will occur, etc).
A rule of thumb is that, if you need to have expressions involving both signed and unsigned types, you are better off making all the relevant variables signed. While there are exceptions where that rule of thumb isn't needed, you wouldn't have asked this question if you understood how to decide that.
On that basis, I suggest the most appropriate action is to unwind your changes.

Integer vs floating division -> Who is responsible for providing the result?

I've been programming for a while in C++, but suddenly had a doubt and wanted to clarify with the Stackoverflow community.
When an integer is divided by another integer, we all know the result is an integer and like wise, a float divided by float is also a float.
But who is responsible for providing this result? Is it the compiler or DIV instruction?
That depends on whether or not your architecture has a DIV instruction. If your architecture has both integer and floating-point divide instructions, the compiler will emit the right instruction for the case specified by the code. The language standard specifies the rules for type promotion and whether integer or floating-point division should be used in each possible situation.
If you have only an integer divide instruction, or only a floating-point divide instruction, the compiler will inline some code or generate a call to a math support library to handle the division. Divide instructions are notoriously slow, so most compilers will try to optimize them out if at all possible (eg, replace with shift instructions, or precalculate the result for a division of compile-time constants).
Hardware divide instructions almost never include conversion between integer and floating point. If you get divide instructions at all (they are sometimes left out, because a divide circuit is large and complicated), they're practically certain to be "divide int by int, produce int" and "divide float by float, produce float". And it'll usually be that both inputs and the output are all the same size, too.
The compiler is responsible for building whatever operation was written in the source code, on top of these primitives. For instance, in C, if you divide a float by an int, the compiler will emit an int-to-float conversion and then a float divide.
(Wacky exceptions do exist. I don't know, but I wouldn't put it past the VAX to have had "divide float by int" type instructions. The Itanium didn't really have a divide instruction, but its "divide helper" was only for floating point, you had to fake integer divide on top of float divide!)
The compiler will decide at compile time what form of division is required based on the types of the variables being used - at the end of the day a DIV (or FDIV) instruction of one form or another will get involved.
Your question doesn't really make sense. The DIV instruction doesn't do anything by itself. No matter how loud you shout at it, even if you try to bribe it, it doesn't take responsibility for anything
When you program in a programming language [X], it is the sole responsibility of the [X] compiler to make a program that does what you described in the source code.
If a division is requested, the compiler decides how to make a division happen. That might happen by generating the opcode for the DIV instruction, if the CPU you're targeting has one. It might be by precomputing the division at compile-time, and just inserting the result directly into the program (assuming both operands are known at compile-time), or it might be done by generating a sequence of instructions which together emulate a divison.
But it is always up to the compiler. Your C++ program doesn't have any effect unless it is interpreted according to the C++ standard. If you interpret it as a plain text file, it doesn't do anything. If your compiler interprets it as a Java program, it is going to choke and reject it.
And the DIV instruction doesn't know anything about the C++ standard. A C++ compiler, on the other hand, is written with the sole purpose of understanding the C++ standard, and transforming code according to it.
The compiler is always responsible.
One of the most important rules in the C++ standard is the "as if" rule:
The semantic descriptions in this International Standard define a parameterized nondeterministic abstract machine. This International Standard places no requirement on the structure of conforming implementations. In particular, they need not copy or emulate the structure of the abstract machine. Rather, conforming implementations are required to emulate (only) the observable behavior of the abstract machine as explained below.
Which in relation to your question means it doesn't matter what component does the division, as long as it gets done. It may be performed by a DIV machine code, it may be performed by more complicated code if there isn't an appropriate instruction for the processor in question.
It can also:
Replace the operation with a bit-shift operation if appropriate and likely to be faster.
Replace the operation with a literal if computable at compile-time or an assignment if e.g. when processing x / y it can be shown at compile time that y will always be 1.
Replace the operation with an exception throw if it can be shown at compile time that it will always be an integer division by zero.
Practically
The C99 standard defines "When integers are divided, the result of the / operator
is the algebraic quotient with any fractional part
discarded." And adds in a footnote that "this is often called 'truncation toward zero.'"
History
Historically, the language specification is responsible.
Pascal defines its operators so that using / for division always returns a real (even if you use it to divide 2 integers), and if you want to divide integers and get an integer result, you use the div operator instead. (Visual Basic has a similar distinction and uses the \ operator for integer division that returns an integer result.)
In C, it was decided that the same distinction should be made by casting one of the integer operands to a float if you wanted a floating point result. It's become convention to treat integer versus floating point types the way you describe in many C-derived languages. I suspect this convention may have originated in Fortran.