Compiler run-time error reporting with location of error - c++

I'm writing a compiler in C++ (Ubuntu 12.04. with gcc). So far, cumulative error/warning reporting with fairly precise line and column number of error/warning location works fine.  
My project goals include simply learning how to do this, so I'm adding a preprocessing stage (in a first step doing only minimal stuff like string concatenation, comment removal, etc), creating a resulting tmp file. It will not be necessary at this point as I could concatenate strings in my lexer while parsing, and the lexer already handles comments fine, but I'd like to understand how to handle it efficiently and as elegantly as I can. 
Compile time errors are not hard:   
(1) do error check (-> report compile-time errors)
(2) if no errors, preprocess -> tmp file
(3) run parser, etc., on tmp file (which is compile-time error free)
However, I also report run-time errors with line number (eg, for array out of bounds checks for arrays with integer expression bounds). As the error checks will be added to the byte code of my IR when parsing the tmp file only, and this file can significantly differ from the source file (in particular if we start allowing the pasting in of header files, say), how on earth can you reasonably report helpful error location? Is there a standard trick how gcc, say, handles this? The type of bound check mentioned of course doesn't happen for C; but runtime error reporting applies to, say, dynamic resolution of pointers in a hierarchy in C++, and gcc gets the line numbers just fine.

You can record line number information in your temporary file produced by your preprocessor, such as Line Control of cpp.
The C preprocessor informs the C compiler of the location in your source code where each token came from. Presently, this is just the file name and line number.

Related

Preserving line numbers when pre-processing Fortran code

I have inherited a 10K-lines Fortran code that uses fpx3 for pre-processing. At this point I should say that I don't have much Fortran experience, and this is my first time dealing with a code that has a pre-processor.
The code works fine, except that when the pre-processing runs, it creates secondary Fortran files (e.g., main.f90 creates t.main.f90) that do not preserve the total amount of lines. The reason, of course, is that some code is lost when pre-processing (IF-ELSE clauses, pre-proc. directives, etc.).
This would be fine, except that when I get a bug in the code, the line number I get from the bug refers to the pre-compiled code (e.g., t.main.90) instead of the original code.
This isn't a major problem, but for nearly every bug I have to check the line (say, line 80 of t.main.f90) and manually try to find this line in the original (let's say it ends up being line 92 of main.f90). I've tried to find a way around this by telling fpx3 to comment the unused lines instead of throwing them out, but I couldn't find much info on fpx3 online.
What's the best way around this?
P.S.: I don't know if it matters, but I'm using ifort to compile.
Try -fixed as a .f90 file is assumed to be -free (free form), and fixed preserves the 1st few columns and the continuation in column 6.
If the number at on the right hand side, then NOT -132, but -72 or -80. To not include the numbers as "compilable" code. I use -132, so you have to look up the right switch.

Parentheses around multiple values in implied DO loop - "error: expected a right parenthesis in expression"

When migrating from GNU FORTRAN Compiler 4.3.2 to 4.8.5 a user got an error
p.for:227.25:
write(3,446) ((r(i),ERC(i),EIC(i),ERp(i),EIp(i)),i=1,I1)
1
Error: Expected a right parenthesis in expression at (1)
Here the output list is built using the implied DO-loop construct. The error persists even when compiling with --std=legacy.
The problem is easy to fix. When I remove the parentheses around the list of array elements the code compiles and seems to work as expected.
write(3,446) (r(i),ERC(i),EIC(i),ERp(i),EIp(i),i=1,I1)
I would be content with this fix, however our users often store multiple copies of code on different systems and exchange code between them seemingly at random. On the other hand, the makefiles and build scripts are usually specific to our cluster.
It would be nice to find that there is a command line option that will let such code pass syntax checking. I could not find anything like -ffixed-parenthesized-values-in-implied-loop in the list of available GNU Fortran Compiler Language Options.
Q: Is it possible to tweak a build script to let such statement compile without modifying the source code?

How do c/c++ compilers know which line an error is on

There is probably a very obvious answer to this, but I was wondering how the compiler knows which line of code my error is on. In some cases it even knows the column.
The only way I can think to do this is to tokenize the input string into a 2D array. This would store [lines][tokens].
C/C++ could be tokenized into 1 long 1D array which would probably be more efficient. I am wondering what the usual parsing method would be that would keep line information.
actually most of it is covered in the dragon book.
Compilers do Lexing/Parsing i.e.: transforming the source code into a tree representation.
When doing so each keyword variable etc. is associated with a line and column number.
However during parsing the exact origin of the failure might get lost and the information might be off.
This is the first step in the long, complicated path towards "Engineering a Compiler" or Compilers Theory
The short answer to that is: there's a module called "front-end" that usually takes care of many phases:
Scanning
Parsing
IR generator
IR optimizer ...
The structure isn't fixed so each compiler will have its own set of modules but more or less the steps involved in the front-end processing are
Scanning - maps character streams into words (also ignores whitespaces/comments) or tokens
Parsing - this is where syntax and (some) semantic analysis take place and where syntax errors are reported
To make this up to you: the compiler knows the location of your error because when something doesn't fit into a structure called "abstract syntax tree" (i.e. it cannot be constructed) or doesn't follow any of the syntax-directed translation rules, well.. there's something wrong and the compiler indicates the location where this didn't happen. If there's a grammar error on just one word/token then even a precise column location can be returned since nothing matched a terminal keyword: a basic token like the if keyword in the C/C++ language.
If you want to know more about this topic my suggestion is to start with the classic academic approach of the "Compiler Book" or "Dragon Book" and then, later on, possibly study an open-source front-end like Clang

How does GCC know what line an error is on when the compiler takes all whitespace and comments out of the code?

I'm sure this applies to other compilers as well, but I've only used GCC. If the compiler optimizes the code by removing everything extraneous that isn't code (comments, whitespace, etc.), how does it correctly show what line an error is on in the original file? Does it only optimize the code after checking for errors? Or does it somehow insert tags so that if an error is found it knows what line it's on?
mycode.cpp: In function ‘foo(int bar)’:
mycode.cpp:59: error: no matching function for call to ‘bla(int bar)’
The compiler converts source code to an object format, or more
correctly, here, an intermediate format which will later be used
to generate object format. I've not looked into the internals
of g++, but typically, a compiler will tokenize the input and
build a tree structure. When doing so, it will annotate the
nodes of the tree with the position in the file where it read
the token which the node represents. Many errors are detected
during this very parsing, but for those that aren't, the
compiler will use the information on the annotated node in the
error message.
With regards to "removing everything extraneouss that isn't
code", that's true in the sense that the compiler tokenizes the
input, and converts it into the tree. But when doing so, it is
reading the files; at every point, it is either reading the
file, or accessing a node which was annotated while the file was
being read.
The preprocessor (conceptually) adds #line directives, to tell the compiler which source file and line number correspond to each line of preprocessed source. They look like
// set the current line number to 100, in the current source file
#line 100
// set the current line number to 1, in a header file
#line 1 "header.h"
(Of course, a modern preprocessor usually isn't a separate program, and usually doesn't generated an intermediate text representation, so these are actually some kind of metadata passed to the compiler along with the stream of preprocessed tokens; but it may be simpler, and not significantly incorrect, to think in terms of preprocessed source).
You can add these yourself if you want. Possible uses are testing macros that use the __FILE__ and __LINE__ definitions, and laying traps for maintenance programmers.

PGI Fortran integer format

I have an input text file that contains an integer record like:
1
which is read in Fortran code as:
read(iunit,'(i4)') int_var
which works fine with Gfortran, but the same code compiled with PGI Fortran Compiler expects a field 4 characters wide (the actual record is just 1 character) and throws an error. Now I know that the format specifies the width and this may or may not be correct behavior according to the Fortran standard, but my question is - is there a compiler option for PGI that would make it behave like Gfortran in this respect?
This 3rd party code I'm using has a lot (hundreds or thousands) of read statements like this and input data has a lot of records with "wrong" width so both modifying the code or the input data would require significant effort.
I don't think this is connected to blank. This read should not cause an error, unless you opened the file iunit with pad="no". Default is allways pad="yes", which causes the input record to be padded with blanks, if it is too short.
Are you sure, that you use correct input files, with correct line ends? There could be problems with text file that originate in Windows and in Unix the CR could be read in the input record. In this case using the unix2dos utility might help. You may try to read a character(4) string using the a4 edit descriptor to test for this.
Does PGI Fortran support the open keyword blank="null"? I think that this will change the read to the behavior that you want and minimize the modifications to the code. blank="null" versus blank="zero" doesn't seem to make a difference in gfortran 4.7.