Where exactly is the boundary between a preprocessor and a compiler?

Where exactly is the boundary between a preprocessor and a compiler? - c++

According to various sources (for example, the SE radio episode with Kevlin Henney, if I remember correctly), "C with classes" was implemented with preprocessor technology (with the output then being fed to a C compiler), whereas C++ has always been implemented with a compiler (that just happened to spit out C in the early days). This seems to cause some confusion, so I was wondering:
Where exactly is the boundary between a preprocessor and a compiler? When do you call a piece of software that implements a language "a preprocessor", and when do you call it "a compiler"?
By the way, is "a compiled language" an established term? If so, what exactly does it mean?

This is an interesting question. I don't know a definitive answer, but would say this, if pressed for one:
A preprocessor doesn't parse the code, but instead scans for embedded patterns and expands them
A compiler actually parses the code by building an AST (abstract syntax tree) and then transforms that into a different language

The language of the output of the preprocessor is a subset of the language of the input.
The language of the output of the compiler is (usually) very different (machine code) then the language of the input.

From a simplified, personal, point of view:
I consider the preprocessor to be any form of textual manipulation that has no concepts of the underlying language (ie: semantics or constructs), and thus only relies on its own set of rules to perform its duties.
The compiler starts when rules and regulation are applied to what is being processed (yes, it makes 'my' preprocessor a compiler, but why not :P), this includes symantical and lexical checking, and the included transforms from x (textual) to y (binary/intermediate form). as one of my professors would say: "its a system with inputs, processes and outputs".

The C/C++ compiler cares about type-correctness while the preprocessor simply expands symbols.

A compiler consist of serval processes (components). The preprocessor is only one of these and relatively most simple one.
From the Wikipedia article, Division of compiler processes:
All but the smallest of compilers have more than two phases. However,
these phases are usually regarded as being part of the front end or
the back end. The point at which these two ends meet is open to
debate.
The front end is generally considered to be where syntactic
and semantic processing takes place, along with translation to a lower
level of representation (than source code).
The middle end is usually
designed to perform optimizations on a form other than the source code
or machine code. This source code/machine code independence is
intended to enable generic optimizations to be shared between versions
of the compiler supporting different languages and target processors.
The back end takes the output from the middle. It may perform more
analysis, transformations and optimizations that are for a particular
computer. Then, it generates code for a particular processor and OS."
Preprocessing is only the small part of the front end job.
The first C++ compiler made by attaching additional process in front of existing C compiler toolset, not because it is good design but because limited time and resources.
Nowadays, I don't think such non-native C++ compiler can survive in the commercial field.
I dare say cfront for C++11 is impossible to make.

The answer is pretty simple.
A preprocessor works on text as input and has text as output. Examples for that are the old unix commands m4, cpp (the C Pre Processor), and also unix programs like roff and nroff and troff which where used (and still are) to format man pages (unix command "man") or format text for printing or typesetting.
Preprocessors are very simple, they don't know anything about the "language of the text" they process. In other words they usually process natural languages. The C preprocessor besides its name, e.g. only recognizes #define, #include, #ifdef, #ifndef, #else etc. and if you use #define MACRO it tries to "expand" that macro everywhere it finds it. But that does not need to be C or C++ program text, it can as well be a novel written in italian or greek.
Compilers that cross compile into a different language are usually called translators. So the old cfront "compiler" for C++ which emitted C code was a C++ translator.
Preprocessors and later translators are historically used because old machines simply lacked memory to be able to do everything in one program, but instead it was done by specialized programs and from disk to disk.
A typical C program would be compiled from various sources. And the build process would be managed with make. In our days the C preprocessor is usually build directly into the C/C++ compiler. A typical make run would call the CPP on the *.c files and write the output to a different directory, from there either the C compiler CC would compile it straight to machine code or more commonly would output assembler code as text. Note: the c compiler only checks syntax, it does not really care about type safety etc. Then the assembler would take that assembler code and would output a *.o file wich later can be linked with other *.o files and *.lib files into an executable program. OTOH you likely had a make rule that would not call the C compiler but the lint command, the C language analyser, which is looking for typical mistakes and errors (which are ignored by the c compiler).
It is quite interesting to look up about lint, nroff, troff, m4 etc. on wikipedia (or your machines terminal using man) ;D

Related

Accessing tokenization of a C++ source file

My understanding is that one step of the compilation of a program (irrespective of the language, I guess) is parsing the source file into some kind of space separated tokens (this tokenization would be made by what's referred to as scanner in this answer. For instance I understand that at some point in the compilation process, a line containing x += fun(nullptr); is separated is something like
x
+=
fun
(
nullptr
)
;
Is this true? If so, is there a way to have access to this tokenization of a C++ source code?
I'm asking this question mostly for curiosity, and I do not intend to write a lexer myself
And the reason I'm curious to know whether one can leverage the compiler is that, to give an example, before meeting [[noreturn]] & Co. I wouldn't have ever considered [[ as a valid token, if I was to write a lexer myself.
Do we necessarily need a true, actual use case? I think we don't, if I am curious about whether there's an existing tool or not to do something.
However, if we really need a use case,
let's say my target is to write a C++ function which reads in a C++ source file and returns a std::vector of the lexemes it's made up of. Clearly, a requirement is that concatenating the elments of the output should make up the whole text again, including line breakers and every other byte of it.

With the restriction mentioned in the comment (tokenization keeping __DATE__) it seems rather manageable. You need the preprocessing tokens. The Boost::Wave preprocessor necessarily creates a token list, because it has to work on those tokens.
Basile correctly points out that it's hard to assign a meaning to those tokens.

C++ is a very complex programming language.
Be sure to read the C++11 draft standard n3337 before even attempting to parse C++ code.
Look inside the source code of existing open source C++ compilers, such as GCC (at least GCC 10 in October 2020) or Clang (at least Clang 10 in October 2020)
If you have to write your C++ parser from scratch, be sure to have the budget for at least a full person year of work.
Look also into existing C++ static source code analyzers, such as Frama-C++ or Clang static analyzer. Consider adapting one of them to your needs, but do document in writing your needs before starting coding. Be aware of Rice's theorem.
If you want to parse a small subset of C++ (you'll need to document and specify that subset), consider using parser generators like ANTLR or GNU bison.
Most compilers are building some internal representations, in particular some abstract syntax tree. Read the Dragon book for more.
I would suggest instead writing your own GCC plugin.
Indeed, it would be tied to some major version of GCC, but you'll win months of work.
Is this true? If so, is there a way to have access to this tokenization of a C++ source code?
Yes, by patching some existing opensource C++ compiler, or extending it with your plugin (there are licensing conditions related to both approaches).
let's say my target is to write a C++ function which reads in a C++ source file and returns a std::vector of the lexemes it's made up of.
The above specification is ambiguous.
Do you want the lexeme before or after the C++ preprocessing phase? In other words, what would be the lexeme for e.g. __DATE__ or __TIME__ ? Read e.g. the documentation of GNU cpp ... If you happen to use GCC on Linux (see gcc(1)) and have some C++ translation unit foo.cc, try running g++ -C -E -Wall foo.cc > foo.ii and look (using less(1)...) into the generated preprocessed form foo.ii ? And what about template expansion, or preprocessor conditionals or preprocessor stringizing ?
I would suggest writing your GCC plugin working on GENERIC representations. You could also start a PhD work related to your goals.
Notice that generating C++ code is a lot easier than parsing it.
Look inside Qt for an example of software generating C++ code. Yo could consider using GNU m4, or GNU gawk, or GNU autoconf, or GPP, or your own C++ source generator (perhaps with the help of GNU bison or of ANTLR) to generate some of your C++ code.
PS. On my home page you'll find an hyperlink to some draft report related to your question, and another hyperlink to an open source program generating C++ code. It sadly seems that I am forbidden here to give these hyperlinks, but you could find them in two mouse clicks. You might also look into two European H2020 projects funding that draft report: CHARIOT & DECODER.

Do comments get translated to machine code? C++

When a program written in C++ has comments, are those comments translated into machine language or do they never get that far? If I write a C++ program with an entire book amount of comments between two commands, will my program take longer to compile or run any slower?

Comments are normally stripped out during preprocessing, so the compiler itself never sees them at all.
They can (and normally do) slow compilation a little though--the preprocessor has to read through the entire comment to find its end (so subsequent code will be passed through to the compiler. Unless you include truly gargantuan comments (e.g., megabytes) the difference probably won't be very noticeable though.
Although I've never seen (or heard of) a C or C++ compiler that did it, there have been compilers (e.g., for Pascal) that used specially formatted comments to pass directives to the compiler. For example, Turbo Pascal allowed (and its successor probably still allows) the user to turn range checking on and off using a compiler directive in a comment. In this case, the comment didn't (at least in the cases of which I'm aware) generate any machine code itself, but it could and did affect the machine code that was generated for the code outside the comment.

No, they are simply ignored by the compiler. Comments' sole purpose is for human reading, not machine.

The preprocessor eliminates comments.. Why should the compiler read them anyway? They are there to make it easier for people to understand the code..
Haven't you heard the joke "It's hard to be a comment, you always get ignored" :p

In the 3rd translation phase
The source file is decomposed into comments, sequences of whitespace characters (space, horizontal tab, new-line, vertical tab, and form-feed), and preprocessing tokens.
Each comment is replaced by one space character.
See this cpprefference article for more information about the phases of translation

No , they are removed by the preprocessor .You can check this by using cpp: The C Preprocessor . Just write a simple C-program with comment and then use cpp comment.c | grep "your comment" .

Can I assume that the Fortran preprocessor will work on most systems?

I'm trying to emulate C asserts in Fortran in order to enforce pre- and post-conditions of all my procedures. This way, I get to provide the user with more detailed information about run-time errors than I could reasonably be expected to maintain otherwise.
In order to implement this, I used the preprocessor directives __FILE__ and __LINE__, and defining an assert macro which expands to a Fortran subroutine call. Rather than try to describe it here, I made a git repo with a little bit of example code in it. If you build it with
make test
./test
the function hangs, because you called a function that expects a positive argument with a negative one. However, if you build with
make test DEBUG=1
./test
the error is caught by an assertion.
This works with both gfortran and the Intel Fortran compiler. I don't have access to other Fortran compilers. Can I reasonably expect other compilers to do the necessary source pre-processing if the file extension is .F90? Or should I be relying on the -cpp flag? What's the most portable way to do this? Should I even do this at all?

It is reasonable to expect that a suitable "C-like" preprocessor is available, though undoubtedly there will be exceptions for some compilers or tools.
The definition of portability can depend on your perspective, but given that a large number of systems run with a case insensitive file system, it is not reasonable to solely rely on case alone to specify that the preprocessor needs to be run. Most Fortran related build systems will have some way of making that specification explicit.
Whether this is a good idea is a bit more subjective. Perhaps it is only nominal in terms of impact, but requiring the preprocessor still represents a reduction in portability and an increase in build complexity. Depending on compiler, use of the preprocessor may hinder use of things like standard conformance diagnostics.
Consequently, my preference for relatively simple use cases like this is to have the assertion coded as normal Fortran source - an if statement testing a named constant from a debug module or similar, invoking [ERROR] STOP (not exit) with a descriptive message if the assertion expression fails.
USE DebuggingFlags
IF (debug_flag) THEN
IF (x <= 0) ERROR STOP 'negative or zero x in sqrrt!'
END IF
This won't give you file and line information, but as long as you are somewhat selective with the STOP message the relevant source shouldn't be too hard to locate.
"Release" builds are made with the debugging flag constant defined as false (or your chosen equivalent) and with any sort of reasonable compiler optimisation active the object code associated with the assertion should be identified and eliminated as dead.
But there are pros and cons.

Fortran 95+ has conditional compilation (coco) defined in standard, but only few compilers support it. The de facto standard (since Fortran 90 I believe) is still cpp and fpp, which most compilers support. Both are highly compatible but not 100%. Using these facts, relying on cpp/fpp style preprocessor should be safe for most cases.

Adding to the answer of #LeleDumbo:
As far as I can tell, the Fortran 2008 Standard does not specify any form of preprocessing. This does also mean, that there is no standard way of specifying, how to invoke the preprocessor. However, it is common practice to use .F90 to specify the need for preprocessing.
Concerning COCO: The third part of the Fortran Standard ISO 1539-3, which should specify conditional compilation, was withdrawn.

Want a clean explanation for GCC preprocessors

When reading this document, at the end of it, there is one sentence:
Historically, compilers for many languages, including C++ and Fortran, have been implemented as “preprocessors” which emit another high level language such as C.
Have no idea about preprocessors, any document? Does it mean all these languages will be translated into C source codes?

I think it would have been better to use the term source-to-source translator instead of "preprocessors" which makes it ambiguous in meaning , but it ain't any wrong to use it either.
Basically , A compiler is a computer program translates source code from a high-level programming language to a lower level language (e.g., assembly language or machine code).But the document in the question says :
Historically, compilers for many languages, including C++ and Fortran,
have been implemented as “preprocessors” which emit another high level
language such as C.
As per this description , it can be said that earlier , the compilers were implemented as source-to-source translator . A translator is also a form of preprocessor but its different from the preprocessor used in a program.
A translator is a computer program that translates a program written
in a given programming language into a functionally equivalent program
in a different language.
Now, coming to preprocessor used in a program , lets take an example :
#include <stdio.h>// a PREPROCESSOR directive
A preprocessor is a program that processes a source file before the
main compilation takes place,( similar to a translator ) but the difference lies in the fact that HERE it handles directives whose names begin with #.
Here #include is a directive. This directive causes the preprocessor to add the contents of the stdio.h file to your
program.This is a typical preprocessor action: adding or replacing text in the source code
before it’s compiled.

Some languages have been implemented by having the compiler generate C code which is then compiled by the C compiler. Notable examples include:
C++ in the earliest days (and C with Classes before that) — cfront generated C code from the C++ code. It ceased to be practical once C++ supported exceptions (read Stroustrup The Design and Evolution of C++ for more information), but not all C++ compilers used the technique (in fact, I don't know of any other compiler than cfront that did it).
Yacc is compiled to C code. Bison can be compiled to C or C++ code.
Lex is compiled to C code. Flex can be compiled to C or C++ code, I believe.
Informix ESQL/C converts Embedded SQL into pure C.
Informix 4GL converts I4GL source into ESQL/C, and then uses the ESQL/C compiler to create C code (and the C compiler to create object code and executables), so it has a multi-stage compiler (and I'm simplifying a bit).

The phrase "preprocessor" now has a totally different meaning, and is confusing to be used here. But, yes, here it means some compilers emits its source to another language.
It should be called source to source compiler. One of the examples is Cfront (designed by Bjarne Stroustrup himself), which converted C++ to C.
For the normal meaning of the phrase "preprocessor" in C++, see here.

No. Not necessarily. Many C++ compilers, LIKE THE GCC DOCUMENT SAID, (but not gcc/g++) produce C code output. Why do they do this? So they can piggyback on all the backend executable code that C compilers can compile to (X86, AMD, etc.) By having C as their destination code,they save alot of low end coding on the back end. Such compilers include the original Cfront and Comeau C/C++.

Where do I learn "what I need to know" about C++ compilers?

I'm just starting to explore C++, so forgive the newbiness of this question. I also beg your indulgence on how open ended this question is. I think it could be broken down, but I think that this information belongs in the same place.
(FYI -- I am working predominantly with the QT SDK and mingw32-make right now and I seem to have configured them correctly for my machine.)
I knew that there was a lot in the language which is compiler-driven -- I've heard about pre-compiler directives, but it seems like someone would be able to write books the different C++ compilers and their respective parameters. In addition, there are commands which apparently precede make (like qmake, for example (is this something only in QT)).
I would like to know if there is any place which gives me an overview of what compilers are out there, and what their different options are. I'd also like to know how each of them views Makefiles (it seems that there is a difference in syntax between them?).
If there is no website regarding, "Everything you need to know about C++ compilers but were afraid to ask," what would be the best way to go about learning the answers to these questions?

Concerning the "numerous options of the various compilers"
A piece of good news: you needn't worry about the detail of most of these options. You will, in due time, delve into this, only for the very compiler you use, and maybe only for the options that pertain to a particular set of features. But as a novice, generally trust the default options or the ones supplied with the make files.
The broad categories of these features (and I may be missing a few) are:
pre-processor defines (now, you may need a few of these)
code generation (target CPU, FPU usage...)
optimization (hints for the compiler to favor speed over size and such)
inclusion of debug info (which is extra data left in the object/binary and which enables the debugger to know where each line of code starts, what the variables names are etc.)
directives for the linker
output type (exe, library, memory maps...)
C/C++ language compliance and warnings (compatibility with previous version of the compiler, compliance to current and past C Standards, warning about common possible bug-indicative patterns...)
compile-time verbosity and help
Concerning an inventory of compilers with their options and features
I know of no such list but I'm sure it probably exists on the web. However, suggest that, as a novice you worry little about these "details", and use whatever free compiler you can find (gcc certainly a great choice), and build experience with the language and the build process. C professionals may likely argue, with good reason and at length on the merits of various compilers and associated runtine etc., but for generic purposes -and then some- the free stuff is all that is needed.
Concerning the build process
The most trivial applications, such these made of a single unit of compilation (read a single C/C++ source file), can be built with a simple batch file where the various compiler and linker options are hardcoded, and where the name of file is specified on the command line.
For all other cases, it is very important to codify the build process so that it can be done
a) automatically and
b) reliably, i.e. with repeatability.
The "recipe" associated with this build process is often encapsulated in a make file or as the complexity grows, possibly several make files, possibly "bundled together in a script/bat file.
This (make file syntax) you need to get familiar with, even if you use alternatives to make/nmake, such as Apache Ant; the reason is that many (most?) source code packages include a make file.
In a nutshell, make files are text files and they allow defining targets, and the associated command to build a target. Each target is associated with its dependencies, which allows the make logic to decide what targets are out of date and should be rebuilt, and, before rebuilding them, what possibly dependencies should also be rebuilt. That way, when you modify say an include file (and if the make file is properly configured) any c file that used this header will be recompiled and any binary which links with the corresponding obj file will be rebuilt as well. make also include options to force all targets to be rebuilt, and this is sometimes handy to be sure that you truly have a current built (for example in the case some dependencies of a given object are not declared in the make).
On the Pre-processor:
The pre-processor is the first step toward compiling, although it is technically not part of the compilation. The purposes of this step are:
to remove any comment, and extraneous whitespace
to substitute any macro reference with the relevant C/C++ syntax. Some macros for example are used to define constant values such as say some email address used in the program; during per-processing any reference to this constant value (btw by convention such constants are named with ALL_CAPS_AND_UNDERSCORES) is replace by the actual C string literal containing the email address.
to exclude all conditional compiling branches that are not relevant (the #IFDEF and the like)
What's important to know about the pre-processor is that the pre-processor directive are NOT part of the C-Language proper, and they serve several important functions such as the conditional compiling mentionned earlier (used for example to have multiple versions of the program, say for different Operating Systems, or indeed for different compilers)
Taking it from there...
After this manifesto of mine... I encourage to read but little more, and to dive into programming and building binaries. It is a very good idea to try and get a broad picture of the framework etc. but this can be overdone, a bit akin to the exchange student who stays in his/her room reading the Webster dictionary to be "prepared" for meeting native speakers, rather than just "doing it!".

Ideally you shouldn't need to care what C++ compiler you are using. The compatability to the standard has got much better in recent years (even from microsoft)
Compiler flags obviously differ but the same features are generally available, it's just a differently named option to eg. set warning level on GCC and ms-cl
The build system is indepenant of the compiler, you can use any make with any compiler.

That is a lot of questions in one.
C++ compilers are a lot like hammers: They come in all sizes and shapes, with different abilities and features, intended for different types of users, and at different price points; ultimately they all are for doing the same basic task as the others.
Some are intended for highly specialized applications, like high-performance graphics, and have numerous extensions and libraries to assist the engineer with those types of problems. Others are meant for general purpose use, and aren't necessarily always the greatest for extreme work.
The technique for using each type of hammer varies from model to model—and version to version—but they all have a lot in common. The macro preprocessor is a standard part of C and C++ compilers.
A brief comparison of many C++ compilers is here. Also check out the list of C compilers, since many programs don't use any C++ features and can be compiled by ordinary C.
C++ compilers don't "view" makefiles. The rules of a makefile may invoke a C++ compiler, but also may "compile" assembly language modules (assembling), process other languages, build libraries, link modules, and/or post-process object modules. Makefiles often contain rules for cleaning up intermediate files, establishing debug environments, obtaining source code, etc., etc. Compilation is one link in a long chain of steps to develop software.
Also, many development environments abstract the makefile into a "project file" which is used by an integrated development environment (IDE) in an attempt to simplify or automate many programming tasks. See a comparison here.
As for learning: choose a specific problem to solve and dive in. The target platform (Linux/Windows/etc.) and problem space will narrow the choices pretty well. Which you choose is often linked to other considerations, such as working for a particular company, or being part of a team. C++ has something like 95% commonality among all its flavors. Learn any one of them well, and learning the next is a piece of cake.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js