Clarifying a fortran implied loop - fortran

I used FORTRAN a little, many years ago, and have recently been tasked to maintain an old FORTRAN program (F77). The following code was unfamiliar:
READ(FILE_LOG_UNIT, IOSTAT=FILE_STATUS) NUM_WORDS,
. ( BUFFER(BIX), BIX=1, NUM_WORDS )
Reviewing some on-line forums revealed that the part that was confusing me, the continuation line, is an implied loop. Since my program is giving me trouble right here, I want to convert this to a conventional DO-loop. Converting it might also help the next poor slob that picks this thing up cold 5 years from now! Anyway, my best guess at the DO-loop equivalent is
READ(FILE_LOG_UNIT, IOSTAT=FILE_STATUS) NUM_WORDS
DO BIX=1, NUM_WORDS
READ(FILE_LOG_UNIT, IOSTAT=FILE_STATUS) BUFFER(BIX)
ENDDO
But when I made only this change, test cases which were working stopped working. I still felt that what was going on here was two different READs (the first to get NUM_WORDS, and the second to loop through the data), so I tried a less drastic change, converting it to two statements but retaining the implied loop:
READ(FILE_LOG_UNIT, IOSTAT=FILE_STATUS) NUM_WORDS
READ(FILE_LOG_UNIT, IOSTAT=FILE_STATUS) ( BUFFER(BIX), BIX=1, NUM_WORDS )
But just this change also causes the good test cases to fail. In both of my alterations, the value of NUM_WORDS was coming through as expected, so it seems that the loop is where it is failing.
So, what is the equivalent DO-loop for the original implied loop? Or am I on the wrong track altogether?
Thanks

How is the file opened? I.e. is it ACCESS='sequential', access='direct', or access='stream' (well, the last is unlikely as it's a F2003 addition). Secondly, is it formatted or unformatted? I'm going to assume it's unformatted sequential since there is no REC= specifier nor a format specifier in your read statements.
The reason why what you're trying to fails is that each read statement reads a separate record. Prior to the introduction of access='stream' in F2003, Fortran I/O was completely record based, which is slightly alien to those of us used to stream type files as seen in most other languages.
Records for unformatted sequential files are typically separated by "record markers" at each end of the record, typically 4 bytes specifying the length of the record. So the record (on disk) likely looks something like
| length (4 bytes) | num_words (4 bytes?) | buffer(1) | buffer(2) | ... | length |
Now if you try to read, say, num_words with one READ statement, it will correctly read num_words from the file, THEN it will skip forward until the start of the next record. And when you then try to read buffer with a separate READ statement you're already too far ahead in the file.
If you cheat a bit and use F90+ array syntax, you might get away with
READ(FILE_LOG_UNIT, IOSTAT=FILE_STATUS) NUM_WORDS, BUFFER(1:NUM_WORDS)
(though I'm not sure if you're allowed to reference num_words in the same statement where it's being written to)

Related

what is the meaning of '*' in the hello world program of Fortran? [duplicate]

This question has been covered somewhat in previous SO questions. However, previous discussions seem somewhat incomplete.
Fortran has several I/O statements. There is READ(*,*) and WRITE(*,*), etc. The first asterisk (*) is the standard asterisk designating an input or output from the keyboard to/from the screen. My question is about the second asterisk:
The second asterisk designates the format of the I/O elements, the data TYPE which is being used. If this asterisk is left unchanged, the fortran complier uses the default format (whatever that may be, based on the compiler). Users must use a number of format descriptors to designate the data type, precision, and so forth.
(1) Are these format descriptors universal for all Fortran compilers and for all versions of Fortran?
(2) Where can I find the standard list of these format descriptors? For example, F8.3 means that the number should be printed using fixed point notation with field width 8 and 3 decimal places.
EDIT: A reference for edit descriptors can be found here: http://fortranwiki.org/fortran/show/Edit+descriptors
First, as a clarification, the 1st asterisk in the READ/WRITE statement has a slightly different meaning than you state. For write, it means write to the default file unit (in linux world generally standard out), for read it means read from the default file unit (in linux world generally standard in), either of which may not necessarily be connected to a terminal screen or a keyboard.
The 2nd asterisk means use list directed IO. For read statements this is generally useful because you don't need a specified format for your input. It breaks up the line into fields separated by space or comma (maybe a couple others that aren't commonly used), and reads each field in turn into the variable associated with that field in the argument list, ignoring unread fields, and continuing onto the next line if not enough fields were read in (unless a line termination character \ is explicitly included).
For writes, it means the compiler is allowed to determine what format to write the variables out (I believe with no separator). I believe it is allowed to do this at run time, so that you are all but guaranteed that the value it is trying to write will fit into the format specifier used, so you can be assured that you won't get ******* written out. The down side is you have to manually include a separator character in your argument list, or all your numbers will run together.
In general, using list directed read is more of a convenience to the user, so they don't have to fit their inputs into rigidly defined fields, and list directed writes are a convenience to the programmer, in case they're not sure what the output will look like.
When you have a data transfer statement like read(*,*) ... it's helpful to understand exactly what this means. read(*,*) is equivalent to the more verbose read(unit=*, fmt=*). This second asterisk, as you have it, makes this read statement (or corresponding write statement) list-directed.
List-directed input/output, as described elsewhere, is a convenience for the programmer. The Fortran standards specify lots of constraints that the compiler must follow, but this language has things like "reasonable values", so allowing output to vary by compiler, settings, and so on.
Again, as described elsewhere, fine user control over the output (or input) comes with giving a format specification. Instead of read(*,fmt=*), something like read(*,fmt=1014) or read(*,fmt=format_variable_or_literal). I take it your question is: what is this format specification?
I won't go into details of all of the possible edit descriptors, but I will say in response to (2): you can find the list of those edit descriptors in the Fortran standard (Clause 10 of Fortran 2008 goes into all the detail) or a good reference book.
To answer (1): no, edit descriptors are not universal. Even across Fortran standards. Of note are:
The introduction of I0 (and other minimal-width specifiers) for output in Fortran 95;
The removal of the H edit descriptor in Fortran 95;
The introduction of the DT edit descriptor in Fortran 2003.

C++ how to check if the std::cin buffer is empty

The title is misleading because I'm more interested in finding an alternate solution. My gut feeling is that checking whether the buffer is empty is not the most ideal solution (at least in my case).
I'm new to C++ and have been following Bjarne Stroustrup's Programming Principles and Practices using C++. I'm currently on Chapter 7, where we are "refining" the calculator from Chapter 6. (I'll put the links for the source code at the end of the question.)
Basically, the calculator can take multiple inputs from the user, delimited by semi-colons.
> 5+2; 10*2; 5-1;
= 7
> = 20
> = 4
>
But I'd like to get rid of the prompt character ('>') for the last two answers, and display it again only when the user input is asked for. My first instinct was to find a way to check if the buffer is empty, and if so, cout the character and if not, proceed with couting the answer. But after a bit of googling I realized the task is not as easy as I initially thought... And also that maybe that wasn't a good idea to begin with.
I guess essentially my question is how to get rid of the '>' characters for the last two answers when there are multiple inputs. But if checking the cin buffer is possible and is not a bad idea after all, I'd love to know how to do it.
Source code: https://gist.github.com/Spicy-Pumpkin/4187856492ccca1a24eaa741d7417675
Header file: http://www.stroustrup.com/Programming/PPP2code/std_lib_facilities.h
^ You need this header file. I assume it is written by the author himself.
Edit: I did look around the web for some solutions, but to be honest none of them made any sense to me. It's been like 4 days since I picked up C++ and I have a very thin background in programming, so sometimes even googling is a little tough..
As you've discovered, this is a deceptively complicated task. This is because there are multiple issues here at play, both the C++ library, and the actual underlying file.
C++ library
std::cin, and C++ input streams, use an intermediate buffer, a std::streambuf. Input from the underlying file, or an interactive terminal, is not read character by character, but rather in moderately sized chunks, where possible. Let's say:
int n;
std::cin >> n;
Let's say that when this is done and over is, n contains the number 42. Well, what actually happened is that std::cin, more than likely, did not read just two characters, '4' and '2', but whatever additional characters, beyond that, were available on the std::cin stream. The remaining characters were stored in the std::streambuf, and the next input operation will read them, before actually reading the underlying file.
And it is equally likely that the above >> did not actually read anything from the file, but rather fetched the '4' and the '2' characters from the std::streambuf, that were left there after the previous input operation.
It is possible to examine the underlying std::streambuf, and determine whether there's anything unread there. But this doesn't really help you.
If you were about to execute the above >> operator, you looked at the underlying std::streambuf, and discover that it contains a single character '4', that also doesn't tell you much. You need to know what the next character is in std::cin. It could be a space or a newline, in which case all you'll get from the >> operator is 4. Or, the next character could be '2', in which case >> will swallow at least '42', and possibly more digits.
You can certainly implement all this logic yourself, look at the underlying std::streambuf, and determine whether it will satisfy your upcoming input operation. Congratulations: you've just reinvented the >> operator. You might as well just parse the input, a character at a time, yourself.
The underlying file
You determined that std::cin does not have sufficient input to satisfy your next input operation. Now, you need to know whether or not input is available on std::cin.
This now becomes an operating system-specific subject matter. This is no longer covered by the standard C++ library.
Conclusion
This is doable, but in all practical situations, the best solution here is to use an operating system-specific approach, instead of C++ input streams, and read and buffer your input yourself. On Linux, for example, the classical approach is to set fd 0 to non-blocking mode, so that read() does not block, and to determine whether or not there's available input, just try read() it. If you did read something, put it into a buffer that you can look at later. Once you've consumed all previously-read buffered input, and you truly need to wait for more input to be read, poll() the file descriptor, until it's there.

Old fortran: Hollerith edit descriptor syntax for Format statement

I'm attempting to modernize an old code (or at least make it a bit more understandable) but I've run into an odd format for a, uh, FORMAT statement.
Specifically, it's a FORMAT statement with Hollerith constants in it (the nH where n is a number):
FORMAT(15H ((C(I,J),J=1,I3,12H),(D(J),J=1,I3, 6H),I=1,I3,') te'
1,'xt' )
This messes with the syntax highlighting as it appears this has unclosed parenthesis. It compiles fine with this format statement as is, but closing the parenthesis causes a compiling error (using either the intel or gfortran compiler).
As I understand it, Hollerith constants were a creature of Fortran 66 and were replaced with the advent of the CHARACTERin Fortran 77. I generally understand them when used as something like a character, but use as a FORMAT confuses me.
Further, if I change 15H ((... to 15H ((... (i.e. I remove one space) it won't compile. In fact, it won't compile even if I change the code to this:
FORMAT(15H ((C(I,J),J=1,I3,12H),(D(J),J=1,I3, 6H),I=1,I3,') text' )
I would like this to instead be in a more normal (F77+) format. Any help is appreciated.
What you have are actually Hollerith edit descriptors, not constants (which would occur in a DATA or CALL statement), although they use the same syntax. F77 replaced Hollerith constants outright; it added char-literal edit descriptor as a (much!) better alternative, but H edit descriptor remained in the standard until F95 (and even then some compilers still accepted it as a compatibility feature).
In any case, the number before the H takes that number of characters after the H, without any other delimiter; that's why deleting (or adding) a character after the H screws it up. Parsing your format breaks it into these pieces
15H ((C(I,J),J=1,
I3,
12H),(D(J),J=1,
I3,
6H),I=1,
I3,
') te'
'xt'
and thus a modern equivalent (with optional spaces for clarity) is
nn FORMAT( ' ((C(I,J),J=1,', I3, '),(D(J),J=1,', I3, '),I=1,', I3
1,') text' )
or if you prefer you can put that text after continuation (including the parens) in a CHARACTER value, variable or parameter, used in the I/O statement instead of a FORMAT label, but since you must double all the quote characters to get them in a CHARACTER value that's less convenient.
Your all-on-one-line version probably didn't compile because you were using fixed-form, perhaps by default, and only the first 72 characters of each source line are accepted in fixed-form, of which the first 6 are reserved for statement number and continuation indicator, leaving only 66 and that statement is 71 by my count. Practically any compiler you will find today also accepts free-form, which allows longer lines and has other advantages too for new code, but may require changes in existing code, sometimes extensive changes.

Fortran READ(*,*), WRITE(*,*) arguments

This question has been covered somewhat in previous SO questions. However, previous discussions seem somewhat incomplete.
Fortran has several I/O statements. There is READ(*,*) and WRITE(*,*), etc. The first asterisk (*) is the standard asterisk designating an input or output from the keyboard to/from the screen. My question is about the second asterisk:
The second asterisk designates the format of the I/O elements, the data TYPE which is being used. If this asterisk is left unchanged, the fortran complier uses the default format (whatever that may be, based on the compiler). Users must use a number of format descriptors to designate the data type, precision, and so forth.
(1) Are these format descriptors universal for all Fortran compilers and for all versions of Fortran?
(2) Where can I find the standard list of these format descriptors? For example, F8.3 means that the number should be printed using fixed point notation with field width 8 and 3 decimal places.
EDIT: A reference for edit descriptors can be found here: http://fortranwiki.org/fortran/show/Edit+descriptors
First, as a clarification, the 1st asterisk in the READ/WRITE statement has a slightly different meaning than you state. For write, it means write to the default file unit (in linux world generally standard out), for read it means read from the default file unit (in linux world generally standard in), either of which may not necessarily be connected to a terminal screen or a keyboard.
The 2nd asterisk means use list directed IO. For read statements this is generally useful because you don't need a specified format for your input. It breaks up the line into fields separated by space or comma (maybe a couple others that aren't commonly used), and reads each field in turn into the variable associated with that field in the argument list, ignoring unread fields, and continuing onto the next line if not enough fields were read in (unless a line termination character \ is explicitly included).
For writes, it means the compiler is allowed to determine what format to write the variables out (I believe with no separator). I believe it is allowed to do this at run time, so that you are all but guaranteed that the value it is trying to write will fit into the format specifier used, so you can be assured that you won't get ******* written out. The down side is you have to manually include a separator character in your argument list, or all your numbers will run together.
In general, using list directed read is more of a convenience to the user, so they don't have to fit their inputs into rigidly defined fields, and list directed writes are a convenience to the programmer, in case they're not sure what the output will look like.
When you have a data transfer statement like read(*,*) ... it's helpful to understand exactly what this means. read(*,*) is equivalent to the more verbose read(unit=*, fmt=*). This second asterisk, as you have it, makes this read statement (or corresponding write statement) list-directed.
List-directed input/output, as described elsewhere, is a convenience for the programmer. The Fortran standards specify lots of constraints that the compiler must follow, but this language has things like "reasonable values", so allowing output to vary by compiler, settings, and so on.
Again, as described elsewhere, fine user control over the output (or input) comes with giving a format specification. Instead of read(*,fmt=*), something like read(*,fmt=1014) or read(*,fmt=format_variable_or_literal). I take it your question is: what is this format specification?
I won't go into details of all of the possible edit descriptors, but I will say in response to (2): you can find the list of those edit descriptors in the Fortran standard (Clause 10 of Fortran 2008 goes into all the detail) or a good reference book.
To answer (1): no, edit descriptors are not universal. Even across Fortran standards. Of note are:
The introduction of I0 (and other minimal-width specifiers) for output in Fortran 95;
The removal of the H edit descriptor in Fortran 95;
The introduction of the DT edit descriptor in Fortran 2003.

How does the compiler optimize getline() so effectively?

I know a lot of a compiler's optimizations can be rather esoteric, but my example is so straightforward I'd like to see if I can understand, if anyone has an idea what it could be doing.
I have a 500 mb text file. I declare and initialize an fstream:
std::fstream file(path,std::ios::in)
I need to read the file sequentially. It's tab delimited but the field lengths are not known and vary line to line. The actual parsing I need to do to each line added very little time to the total (which really surprised me since I was doing string::find on each line from getline. I thought that'd be slow).
In general I want to search each line for a string, and abort the loop when I find it. I also have it incrementing and spitting out the line numbers for my own curiosity, I confirmed this adds little time (5 seconds or so) and lets me see how it blows past short lines and slows down on long lines.
I have the text to be found as the unique string tagging the eof, so it needs to search every line. I'm doing this on my phone so I apologize for formatting issues but it's pretty straightforward. I have a function taking my fstream as a reference and the text to be found as a string and returning a std::size_t.
long long int lineNum = 0;
while (std::getline (file, line))
{
pos = line.find(text);
lineNum += 1;
std::cout << std::to_string(lineNum) << std::endl;
if (pos != -1)
return file.tellg():
}
return std::string::npos;
Edit: lingxi pointed out the to_string isn't necessary here, thanks. As mentioned, entirely omitting the line number calculation and output saves a few seconds, which in my preoptimized example is a small percent of the total.
This successfully runs through every line, and returns the end position in 408 seconds. I had minimal improvement trying to put the file in a stringstream, or omitting everything in the whole loop (just getline until the end, no checks, searches, or displaying). Also pre-reserving a huge space for the string didn't help.
Seems like the getline is entirely the driver. However...if I compile with the /O2 flag (MSVC++) I get a comically faster 26 seconds. In addition, there is no apparent slow down on long lines vs short. Clearly the compiler is doing something very different. No complaints from me, but any thoughts as to how it's achieved? As an exercise I'd like to try and get my code to execute faster before compiler optimization.
I bet it has something to do with the way the getline manipulates the string. Would it be faster (alas can't test for a while) to just reserve the whole filesize for the string, and read character by character, incrementing my line number when I pass a /n? Also, would the compiler employ things like mmap?
UPDATE: I'll post code when I get home this evening. It looks like just turning off runtime checks dropped the execution from 400 seconds to 50! I tried performing the same functionality using raw c style arrays. I'm not super experienced, but it was easy enough to dump the data into a character array, and loop through it looking for newlines or the first letter of my target string.
Even in full debug mode it gets to the end and correctly finds the string in 54 seconds. 26 seconds with checks off, and 20 seconds optimized. So from my informal, ad-hoc experiments it looks like the string and stream functions are victimized by the runtime checks? Again, I'll double check when I get home.
The reason for this dramatic speedup is that the iostream class hierarchy is based on templates (std::ostream is actually a typedef of a template called std::basic_ostream), and a lot of its code is in headers. C++ iostreams take several function calls to process every byte in the stream. However, most of those functions are fairly trivial. By turning on optimization, most of these calls are inlined, exposing to the compiler the fact that std::getline essentially copies characters from one buffer to another until it finds a newline - normally this is "hidden" under several layers of function calls. This can be further optimized, reducing the overhead per byte by orders of magnitude.
Buffering behavior actually doesn't change between the optimized and non-optimized version, otherwise the speedup would be even higher.