Do comments get translated to machine code? C++ - c++

When a program written in C++ has comments, are those comments translated into machine language or do they never get that far? If I write a C++ program with an entire book amount of comments between two commands, will my program take longer to compile or run any slower?

Comments are normally stripped out during preprocessing, so the compiler itself never sees them at all.
They can (and normally do) slow compilation a little though--the preprocessor has to read through the entire comment to find its end (so subsequent code will be passed through to the compiler. Unless you include truly gargantuan comments (e.g., megabytes) the difference probably won't be very noticeable though.
Although I've never seen (or heard of) a C or C++ compiler that did it, there have been compilers (e.g., for Pascal) that used specially formatted comments to pass directives to the compiler. For example, Turbo Pascal allowed (and its successor probably still allows) the user to turn range checking on and off using a compiler directive in a comment. In this case, the comment didn't (at least in the cases of which I'm aware) generate any machine code itself, but it could and did affect the machine code that was generated for the code outside the comment.

No, they are simply ignored by the compiler. Comments' sole purpose is for human reading, not machine.

The preprocessor eliminates comments.. Why should the compiler read them anyway? They are there to make it easier for people to understand the code..
Haven't you heard the joke "It's hard to be a comment, you always get ignored" :p

In the 3rd translation phase
The source file is decomposed into comments, sequences of whitespace characters (space, horizontal tab, new-line, vertical tab, and form-feed), and preprocessing tokens.
Each comment is replaced by one space character.
See this cpprefference article for more information about the phases of translation

No , they are removed by the preprocessor .You can check this by using cpp: The C Preprocessor . Just write a simple C-program with comment and then use cpp comment.c | grep "your comment" .

Related

Accessing tokenization of a C++ source file

My understanding is that one step of the compilation of a program (irrespective of the language, I guess) is parsing the source file into some kind of space separated tokens (this tokenization would be made by what's referred to as scanner in this answer. For instance I understand that at some point in the compilation process, a line containing x += fun(nullptr); is separated is something like
x
+=
fun
(
nullptr
)
;
Is this true? If so, is there a way to have access to this tokenization of a C++ source code?
I'm asking this question mostly for curiosity, and I do not intend to write a lexer myself
And the reason I'm curious to know whether one can leverage the compiler is that, to give an example, before meeting [[noreturn]] & Co. I wouldn't have ever considered [[ as a valid token, if I was to write a lexer myself.
Do we necessarily need a true, actual use case? I think we don't, if I am curious about whether there's an existing tool or not to do something.
However, if we really need a use case,
let's say my target is to write a C++ function which reads in a C++ source file and returns a std::vector of the lexemes it's made up of. Clearly, a requirement is that concatenating the elments of the output should make up the whole text again, including line breakers and every other byte of it.
With the restriction mentioned in the comment (tokenization keeping __DATE__) it seems rather manageable. You need the preprocessing tokens. The Boost::Wave preprocessor necessarily creates a token list, because it has to work on those tokens.
Basile correctly points out that it's hard to assign a meaning to those tokens.
C++ is a very complex programming language.
Be sure to read the C++11 draft standard n3337 before even attempting to parse C++ code.
Look inside the source code of existing open source C++ compilers, such as GCC (at least GCC 10 in October 2020) or Clang (at least Clang 10 in October 2020)
If you have to write your C++ parser from scratch, be sure to have the budget for at least a full person year of work.
Look also into existing C++ static source code analyzers, such as Frama-C++ or Clang static analyzer. Consider adapting one of them to your needs, but do document in writing your needs before starting coding. Be aware of Rice's theorem.
If you want to parse a small subset of C++ (you'll need to document and specify that subset), consider using parser generators like ANTLR or GNU bison.
Most compilers are building some internal representations, in particular some abstract syntax tree. Read the Dragon book for more.
I would suggest instead writing your own GCC plugin.
Indeed, it would be tied to some major version of GCC, but you'll win months of work.
Is this true? If so, is there a way to have access to this tokenization of a C++ source code?
Yes, by patching some existing opensource C++ compiler, or extending it with your plugin (there are licensing conditions related to both approaches).
let's say my target is to write a C++ function which reads in a C++ source file and returns a std::vector of the lexemes it's made up of.
The above specification is ambiguous.
Do you want the lexeme before or after the C++ preprocessing phase? In other words, what would be the lexeme for e.g. __DATE__ or __TIME__ ? Read e.g. the documentation of GNU cpp ... If you happen to use GCC on Linux (see gcc(1)) and have some C++ translation unit foo.cc, try running g++ -C -E -Wall foo.cc > foo.ii and look (using less(1)...) into the generated preprocessed form foo.ii ? And what about template expansion, or preprocessor conditionals or preprocessor stringizing ?
I would suggest writing your GCC plugin working on GENERIC representations. You could also start a PhD work related to your goals.
Notice that generating C++ code is a lot easier than parsing it.
Look inside Qt for an example of software generating C++ code. Yo could consider using GNU m4, or GNU gawk, or GNU autoconf, or GPP, or your own C++ source generator (perhaps with the help of GNU bison or of ANTLR) to generate some of your C++ code.
PS. On my home page you'll find an hyperlink to some draft report related to your question, and another hyperlink to an open source program generating C++ code. It sadly seems that I am forbidden here to give these hyperlinks, but you could find them in two mouse clicks. You might also look into two European H2020 projects funding that draft report: CHARIOT & DECODER.

What should i do if i want to use functions that are not standard and are due to compiler specific API?

I have been using the very very old Turbo C++ 3.0 compiler.
During the usage of this compiler, I have become used to functions like getch(), getche() and most importantly clrscr().
Now I have started using Visual C++ 2010 Express. This is causing a lot of problems, as most of these functions (I found this out now) are non-standard and are not available in Visual C++.
What am I to do now?
Always try to avoid them if possible or try their alternatives :
for getch() --- cin.get()
clrscr -- system("cls") // try avoiding the system commands. check : [System][1]
And for any others you can search for them .
The real question is what you are trying to do, globally.
getch and clrscr have never been portable. If you're trying
to create masks or menus in a console window, you should look
into curses or ncurses: these offer a portable solution for
such things. If it's just paging, you can probably get away
with simple outputing a large number of '\n' (for clrscr),
and std::cin.get() for getch. (But beware that this will only
return once the user has entered a new line, and will only read
one character of the line, leaving the rest in the buffer. It
is definitely not a direct replacement for getch. In fact,
std::getline or std::cin::ignore might be better choices.)
Edit:
Adding some more possiblities:
First, as Joachim Pileborg suggested in his comment, if
portability is an issue, there may be platform specific
functions for much of what you are trying to do. If all you're
concerned about is Windows (and it probably is, since system(
"cls" ) and getch() don't work elsewhere), then his comment
may be a sufficient answer.
Second, for many consoles (including xterm and the a console
window under Windows), the escape sequence "\x1b""2J" should
clear the screen. (Note that you have to enter it as two
separate string literals, since otherwise, it would be
interpreted as two characters, the first with the impossible hex
value of 0x1b2.) Don't forget about possible issues of
redirection and flushing, however.
Finally, if you're doing anything non-trivial, you should look
into curses (or ncurses, they're the same thing, but with
different implementations). It's a bit more effort to put into
action (you need explicit initialization, etc.), but it has
a getch function which does exactly what you want, and it also
has functions for explicitly positionning the curser, etc. which
may also make your code simpler. (The original curses was
developed to support the original vi editor, at UCB. Any
editor like task not being developed in its own window would
benefit enormously from it.)
Well,
People, i have found the one best solution that can be used everywhere.
I simply googled the definitions of clrscr() and gotoxy() and created a header file and added these definitions to it. Thus, i can include this file and do everything that i was doing prior.
But, i have a query too.
windows.h is there in the definition. suppose i compile the file and make a exe file. Then will i be able to run it on a linux machine?
According to me the answer has to be yes. But please tell me if i am wrong and also tell me why i am wrong.

Does using leading underscores actually cause trouble?

The C/C++ standard reserves all identifiers that either lead with an underscore (plus an uppercase letter if not in the global namespace) or contain two or more adjacent underscores. Example:
int _myGlobal;
namespace _mine
{
void Im__outta__control() {}
int _LivingDangerously;
}
But what if I just don't care? What if I decide to live dangerously and use these "reserved" identifiers anyway? Just how dangerously would I be living?
Have you ever actually seen a compiler or linker problem resulting from the use of reserved identifiers by user code?
The answers below, so far, amount to, "Why break the rules when doing so might cause trouble?" But imagine that you already had a body of code that broke the rules. At what point would the cost of trouble from breaking the rules outweigh the cost of refactoring the code to comply? Or what if a programmer had developed a personal coding style that called for wild underscores (perhaps by coming from another language, for instance)? Assuming that changing their coding style was more or less painful to them, what would motivate them to overcome the pain?
Or I could ask the same question in reverse. What is it concretely that C/C++ libraries are doing with reserved words that a user is liable to fall afoul of? Are they declaring globals that might create name clashes? Functions? Classes? Each library is different, naturally, but how in general might this collision manifest?
I teach software students who come to me with these kinds of questions, and all I can tell them is, "It's against the rules." It's a superstitious, hand-waving answer. Moreover, in twenty years of C++ programming, I've never seen a compiler or linker error that resulted from breaking the reserved word rules.
A good skeptic, faced with any rule, asks, "Why should I care?" So: why should I care?
I now care because I just encountered a failure with underscores, large and old codebase, mostly aimed at Windows and compiled with VS2005 but some is also cross-compiled to Linux. While analyzing updates to a newer gcc, I rebuilt some under cygwin just for ease of syntax checking. I got totally unintelligible errors (to my tiny brain) out of a line like:
template< size_t _N = 0 > class CSomeForwardRef;
That produced an error like:
error: expected ‘>’ before numeric constant
Google on that error turned up https://svn.boost.org/trac/boost/ticket/2245 and https://svn.boost.org/trac/boost/ticket/7203 both of which hinted that a stray #define could get in the way. Sure enough, an examination of the preprocessed source via -E and a hunt thru the include path turned up some bits-related .h (forget which) that defined _N. Later in that same odyssey I encountered a similar problem with _L.
Edit: Not bits-related but char-related: /usr/include/ctype.h -- here are some samples together with how ctype.h uses them:
#define _L 02
#define _N 04
.
.
.
#define isalpha(__c) (__ctype_lookup(__c)&(_U|_L))
#define isupper(__c) ((__ctype_lookup(__c)&(_U|_L))==_U)
#define islower(__c) ((__ctype_lookup(__c)&(_U|_L))==_L)
#define isdigit(__c) (__ctype_lookup(__c)&_N)
#define isxdigit(__c) (__ctype_lookup(__c)&(_X|_N))
I'll be scanning the source for all underscored identifiers and weeding out via rename all those we created in error ...
Jon
The results may vary according to the specific complier you will use.
Regarding the "danger level" - every time you'll get a bug - you will have to wonder if it is originates from your implemented logic or from the fact you are not using the standard.
But that is not all... let's assume someone tells you: "it is perfectly safe!"
So, you can do that with no problem at all (only assuming..)
Will it redefine your thinking when you get to a bug or still you will be wondering if there is a slight chace he was wrong? :)
So, you see, no matter which answer you will get it can never be a good one.
(which makes me actually like your question)
Just how dangerously would I be living?
Dangerous enough to break your code in next compiler upgrade.
Think of the future, your code might not be portable and might break in future because future enhancement releases from your implementation might have exactly the same symbol name as you use.
Since the question has a pinch of: "It can be wrong yet how wrong can it be and when ever has it been wrong" flavor, I think Murphy's law answers this rather aptly:
"Anything that can go wrong will go wrong (When you are least expecting it)".[#]
[#] The (,) is my invention not Murphy's.
If you try to build your code somewhere where there's actually a conflict you will see strange build errors, or worse, no build error at all and incorrect runtime behavior.
I have seen someone use a reserved identifier which had to be changed when it caused build problems on a new platform.
It's not all that likely, but there's no reason to do it.

Where exactly is the boundary between a preprocessor and a compiler?

According to various sources (for example, the SE radio episode with Kevlin Henney, if I remember correctly), "C with classes" was implemented with preprocessor technology (with the output then being fed to a C compiler), whereas C++ has always been implemented with a compiler (that just happened to spit out C in the early days). This seems to cause some confusion, so I was wondering:
Where exactly is the boundary between a preprocessor and a compiler? When do you call a piece of software that implements a language "a preprocessor", and when do you call it "a compiler"?
By the way, is "a compiled language" an established term? If so, what exactly does it mean?
This is an interesting question. I don't know a definitive answer, but would say this, if pressed for one:
A preprocessor doesn't parse the code, but instead scans for embedded patterns and expands them
A compiler actually parses the code by building an AST (abstract syntax tree) and then transforms that into a different language
The language of the output of the preprocessor is a subset of the language of the input.
The language of the output of the compiler is (usually) very different (machine code) then the language of the input.
From a simplified, personal, point of view:
I consider the preprocessor to be any form of textual manipulation that has no concepts of the underlying language (ie: semantics or constructs), and thus only relies on its own set of rules to perform its duties.
The compiler starts when rules and regulation are applied to what is being processed (yes, it makes 'my' preprocessor a compiler, but why not :P), this includes symantical and lexical checking, and the included transforms from x (textual) to y (binary/intermediate form). as one of my professors would say: "its a system with inputs, processes and outputs".
The C/C++ compiler cares about type-correctness while the preprocessor simply expands symbols.
A compiler consist of serval processes (components). The preprocessor is only one of these and relatively most simple one.
From the Wikipedia article, Division of compiler processes:
All but the smallest of compilers have more than two phases. However,
these phases are usually regarded as being part of the front end or
the back end. The point at which these two ends meet is open to
debate.
The front end is generally considered to be where syntactic
and semantic processing takes place, along with translation to a lower
level of representation (than source code).
The middle end is usually
designed to perform optimizations on a form other than the source code
or machine code. This source code/machine code independence is
intended to enable generic optimizations to be shared between versions
of the compiler supporting different languages and target processors.
The back end takes the output from the middle. It may perform more
analysis, transformations and optimizations that are for a particular
computer. Then, it generates code for a particular processor and OS."
Preprocessing is only the small part of the front end job.
The first C++ compiler made by attaching additional process in front of existing C compiler toolset, not because it is good design but because limited time and resources.
Nowadays, I don't think such non-native C++ compiler can survive in the commercial field.
I dare say cfront for C++11 is impossible to make.
The answer is pretty simple.
A preprocessor works on text as input and has text as output. Examples for that are the old unix commands m4, cpp (the C Pre Processor), and also unix programs like roff and nroff and troff which where used (and still are) to format man pages (unix command "man") or format text for printing or typesetting.
Preprocessors are very simple, they don't know anything about the "language of the text" they process. In other words they usually process natural languages. The C preprocessor besides its name, e.g. only recognizes #define, #include, #ifdef, #ifndef, #else etc. and if you use #define MACRO it tries to "expand" that macro everywhere it finds it. But that does not need to be C or C++ program text, it can as well be a novel written in italian or greek.
Compilers that cross compile into a different language are usually called translators. So the old cfront "compiler" for C++ which emitted C code was a C++ translator.
Preprocessors and later translators are historically used because old machines simply lacked memory to be able to do everything in one program, but instead it was done by specialized programs and from disk to disk.
A typical C program would be compiled from various sources. And the build process would be managed with make. In our days the C preprocessor is usually build directly into the C/C++ compiler. A typical make run would call the CPP on the *.c files and write the output to a different directory, from there either the C compiler CC would compile it straight to machine code or more commonly would output assembler code as text. Note: the c compiler only checks syntax, it does not really care about type safety etc. Then the assembler would take that assembler code and would output a *.o file wich later can be linked with other *.o files and *.lib files into an executable program. OTOH you likely had a make rule that would not call the C compiler but the lint command, the C language analyser, which is looking for typical mistakes and errors (which are ignored by the c compiler).
It is quite interesting to look up about lint, nroff, troff, m4 etc. on wikipedia (or your machines terminal using man) ;D

Are there C/C++ compilers that do not require standard library includes?

All applicants to our company must pass a simple quiz using C as part of early screening process.
It consists of a C source file that must be modified to provide the desired functionality. We clearly state that we will attempt to compile the file as-is, with no changes.
Almost all applicants user "strlen" but half of them do not include "string.h", so it does not compile until I include it.
Are they just lazy or are there compilers that do not require you to include standard library files, such as "string.h"?
GCC will happily compile the following code as is:
main()
{
printf("%u\n",strlen("Hello world"));
}
It will complain about incompatible implicit declaration of built-in function ‘printf’ and strlen(), but it will still produce an executable.
If you compile with -Werror it won't compile.
I'm pretty sure it's non-conformant for a compiler to include headers that aren't asked for. The reason for this is that the C standard says that various names are reserved, if the relevant header is included. I think this implies they aren't reserved if they aren't included, since compilers aren't allowed to reserve names the standard doesn't say are reserved (unless of course they include a non-standard header which happens to be provided by the compiler, and is documented elsewhere to reserve extra names. Which is what happens when you use POSIX).
This doesn't fully answer your question - there do exist non-conformant compilers. As for your applicants, maybe they're just used to including "windows.h", and so have never thought before about what header strlen might be defined in by the C standard. I assume without testing that MSVC does in principle require you to include "string.h". But since "windows.h" does that for you, for the vast majority of practical Windows programs you don't need to know that you have to include "string.h".
They might be lazy, but you can't tell. Standard library implementations (as opposed to compilers, though of course each compiler usually has "its own" stdlib impl.) are allowed to include other headers. For example, #include <stdlib.h> could include every other library described in the standard. (I'm talking in the context of "C/C++", not strictly C.)
As a result, programmers get accustomed to such things, even if not strictly guaranteed, and it's easy to forget whether some function comes from a general catch-all like stdlib.h or something else—many people forget that memcpy is from string.h too.
If they do not include any headers, I would count them as wrong. If you don't allow them to test it with a particular implementation, however, it's hard to say they're wrong. And if you don't provide them with man pages (which represent the resources they'll need to know how to use on the job), then you're wrong.
At that point, you can certainly say the don't follow the exact letter of the standard; but do you want coders that get things done and know how to fix problems when they see them, or coders that worry about minutiea that won't matter?
If you provide a C file to start working with, make it have all the headers that could be needed from the beginning and ask the applicants to remove the unused ones.
The most common engineering experience is to add (or delete) a few lines of code to/from an application with thousands of lines already working correctly. It would be extremely rare in such a case to need another header file when adding a call to printf() or strlen().
It would be interesting to look over the shoulder of experienced engineers—not just graduated from school, but with extensive experience in the trenches—to see if they simply add strlen() and try compiling, or if they check to see if stdlib.h or string.h is already included before compiling. I bet the overwhelming majority do the former.
C implementations usually still allow implicit function declarations.
Anyway, I wouldn't consider all the boilerplate a required part of an interview, unless you specifically ask for it (e.g. "please don't omit anything you'd normally have in a source file").
(And with Visual Assist's "add ... include" I know less and less where they comde from ;))
Most compilers provide some kind of option to force headers inclusion.
Eg. the GCC compiler has the -include option which is the equivalent of #include preprocessor directive.
TCC will also happily compile a file such as the accepted answer's example:
int main()
{
printf("%u\n", strlen("hello world"));
}
without any warnings (unless you pass -Wall); as an added bonus, you can just call tcc -run file.c to execute file.c without compiling to an output file first.
In C89 (the most commonly applied standard), a function that is not declared will be assumed to return an int and have unknown arguments, and will compile (probably with warnings). If it does not, teh compiler is not compliant. If on the other hand you compiled the code as C++ it will fail, andf must do so if the C++ compiler is compliant.
Why not simply specify in the question that all necessary headers must be included (and perhaps irrelevant ones omitted).
Alternatively, sit the candidates at a machine with a compiler and have them check their own code, and state that the code must compile at the maximum warning level, without warnings.
I'm doing C/C++ for 20 years (even taught classes) and I guess there's a 50% probability that I'd forget the include too (99% of the time when I code, the include is already there somewhere). I know this isn't exactly answering your question, but if someone knows strlen() they will know within seconds what to do with that specific compiler error, so from a job qualification standpoint the slip is practically irrelevant.
Rather than putting emphasis on stuff like that, checking for the subtleties that require real understanding of the language should be far more relevant, i.e. buffer overruns, strncpy not appending a final \0 when hitting the limits, asking someone to make a safe (truncating) copy to a buffer of limited length, etc. Especially in C/C++ programming, the errors that do not generate a compiler error are the ones which will cause you/your company the real trouble.