What's the maximum length of a source line all compilers are required to accept? Did it change in C++11? If so, what was the old value?
I'm asking this question because I'm doing some heavy preprocessor voodoo (unfortunately, templates won't cut it), and doing so has a tendency to make the lines big very quickly. I want to stay on the safe side, so I won't have to worry about the possibility of compiler X on platform Y rejecting my code because of too long lines.
C++2003, Annex B, (informative)
Implementation quantities (sorry, don't have C++2011 handy)
2) The limits may constrain quantities that include those described below or others. The bracketed number
following each quantity is recommended as the minimum for that quantity. However, these quantities are
only guidelines and do not determine compliance.
…
Characters in one logical source line [65 536].
You didn't ask about these, but they might be useful, also:
Nesting levels of parenthesized expressions within a full expression [256].
Macro identifiers simultaneously defined in one translation unit [65 536].
Arguments in one macro invocation [256].
Number of characters in an internal identifier or macro name [1 024].
Macro identifiers simultaneously defined in one translation unit [65 536].
Parameters in one macro definition [256].
Postscript: It is worth noting what "one logical source line" is. A logical source line is what you have after:
Physical source file characters are mapped to the basic source
character set
Trigraph
sequences (2.3) are replaced by corresponding single-character internal representations
Each instance of a new-line character and an immediately preceding backslash character is deleted
The logical source line is what you have before:
The source file is decomposed into preprocessing tokens
Preprocessing directives are executed and macro invocations are expanded.
[quotes from C++ 2003, 2.1 Phases of Translation]
So, if the OP's concern is that the macros expand to beyond a reasonable line length, my answer is irrelevant. If the OP's concern is that his source code (after dealing with \, \n) might be too long, my answer stands.
Related
This is more for clarity or a better understanding of the internal workings of compilers: I'm beginning to study compiler design and compiler theory.
Typically When declaring the size of an array on the stack it has to be known at compile-time and this is understood, but this is not always the case.
What I would like to know is; when does this evaluation occur? Does it happen during the precompiler, tokenizer, syntax analysis, etc.? Also, does it depend on the particular compiler that is being used? Finally, is the point in time of this evaluation specified to be at any particular stage of a compiler within the language standards?
Pseudo Code snippet. C or C++
int main() {
int x[5]; // When does the evaluation of the 5 for the array's size take place
// during the compilation process?
// Does it take place during pre-compiler or normal compilation time.
return 0;
}
The C standard specifies eight phases of translation:
Physical source multibyte characters and trigraph sequences are mapped to characters of the source character set.
Each backslash followed by a new-line is deleted (splicing together two lines).
The source characters are grouped into preprocessing tokens, and each sequence of white-space characters is replaced by one space, except new-lines are kept.
Preprocessing directives and _Pragma operators are executed, and macro invocations are expanded.
Source characters in strings and character constants are converted to the execution character set.
Adjacent string literals are concatenated.
Each preprocessing token is converted into a grammar token, and white-space characters separated tokens are discarded. The resulting tokens are analyzed and translated (compiled).
All external references are resolved (the program is linked).
Resolution of constant array dimensions thus occurs in phase 7. However, the phases are largely conceptual. The phases explain how the C language is understood, not how a compiler must execute.
For compilers that produce object modules, the sizes of arrays with static storage duration are necessarily resolved before the object information is written, as the array size affects data layout, which must be completely described in the object module. Handling of the sizes of arrays with automatic storage duration could theoretically be left until the program is actually executing code that needs them, as this is necessarily the case for variable length arrays. However, it would be wasteful to do so, since constant array sizes are easily handled during compilation, and it is preferably for the necessary values (such as the amount of stack space to reserve when entering a function) to be calculated at compilation time rather than during execution. So we can expect that normal compilers resolve all constant array sizes during compilation (that is, before completing an object module for the translation unit) and conceptually after phase 6.
Additional identification of points within the translation process where array sizes are resolved is dependent on internal details of the compiler implementation (or of a C implementation generally).
After reading through the comments from various users it has come to my conclusion that there is nothing specific within the standard. There are differences between C and C++ on their implementation details of whether or not variable-sized arrays are permitted within the language within the stack frame.
When it comes to compiler design there is no context for the pre-processor. When it comes to stages of the compiler it depends on the language to which stage it belongs and it appears that it is also agnostic between compilers and their design. It appears that it is left up to the implementation design of the compiler for which stage this evaluation takes place.
Some C++ compilers may do this during syntax analysis while others may do this during tokenization. So in the end, the only true way to determine when this actually takes place is to know the particular compiler inside and out and to read through its own source code to see how it was designed and to step through the process of the compilation stages or phases.
Thank you everyone for your input and feedback. Please, by all means, correct me if I'm wrong by leaving a comment under this answer.
I am trying to do a #ifndef part way through a setter line and I received this error
"Error 20 error C2014: preprocessor command must start as first nonwhite space"
I am aware of the error means, I am just curious of why it is like that? Is it a compiler choice? What is the reasoning behind this? That it is easier for the user to notice?
Here is the code if someone is wondering:
inline void SetSomething(int value) { #ifndef XBOX ASSERT(value <= 1); #endif test = value; };
At first C did not have any standard preprocessor. Then people started using preprocessing as an external tool. You might note that the # is the same as with comments in general in Unix-land shell scripts.
As the language evolved the preprocessor became integrated with the compiler and more part of the language proper, but it kept its totally different structure, namely, in particular that it's line oriented while the C and C++ core languages are free form.
After that the lines have blurred a bit more. Now the preprocessing typically adds #line directives or the equivalent for use by the core language compiler, also #pragma directives are for the core language compiler, and in the other direction we now have _Pragma (IIRC). Still the structure is mostly as it was originally. C and C++ are evolved languages, not designed languages.
Taking a look into the standard (section 16 "Preprocessing Directives") starting with # as the frirst non whitespace character is what makes a preprocessing directive by definition.
A preprocessing directive consists of a sequence of preprocessing tokens that satisfies the following constraints:
The first token in the sequence is a # preprocessing token that (at the start of translation phase 4) is either
the first character in the source file (optionally after white space containing no new-line characters) or that
follows white space containing at least one new-line character.
If you want the most important reason, it's because the standard says so.
If you want to know why the standard says so, it's the easiest way to get the neccessary functionality.
Remember that preprocessing and compiling are two potentially completely separate tasks, and the preprocessor has no idea at all about the language of its output.
This is question about C99/C11 (may be C++ too) preprocessor and their standard-compliance.
Let's consider two source files:
/* I'm
* multiline
* comment
*/
and
/* I'm
* multiline
* comment
*/
i_am_a_token;
If we preprocess both files with gcc or clang (several version was tested), there will be a difference. In the first case preprocessor will not keep newlines from the multiline comment. And in the second case all newlines will be kept.
All mentioned standards says (somewhere inside "Translation phases"):
Each comment is replaced by one space character. New-line characters are retained.
Why there is the difference in handling multiline comments at the end of file? And is this behaviour standard-compliant?
The reason is simple - line numbers and error reporting. Since the compiler reports errors with line numbers, it is convenient so that line numbers in the pre-processed file correspond to line numbers in the original file. That's the reason the lines occupied by comment are preserved when they are followed by code, whereas they don't have to be preserved at the end of file.
As for the standards. The standards
C99: ISO/IEC 9899:1999
C11: ISO/IEC 9899:2011
specify the language, preprocessing macros etc., but they don't specify how the language should be processed. You can see it in the scope definition of C11:
ISO/IEC 9899:2011 does not specify
the mechanism by which C programs are transformed for use by a data-processing system;
which means that preprocessor output is pretty much internal issue, out of the scope of the standard.
I'm trying to understand Universal Character Names in the C11 standard and found that the N1570 draft of the C11 standard has much less detail than the C++11 standard with respect to Translation Phases 1 and 5 and the formation and handling of UCNs within them. This is what each has to say:
Translation Phase 1
N1570 Draft C11 5.1.1.2p1.1:
Physical source file multibyte characters are mapped, in an implementation-defined manner, to the source character set (introducing new-line characters for end-of-line indicators) if necessary. Trigraph sequences are replaced by corresponding single-character internal representations.
C++11 2.2p1.1:
Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical source file characters accepted is implementation-defined. Trigraph sequences (2.4) are replaced by corresponding single-character internal representations. Any source file character not in the basic source character set (2.3) is replaced by the universal-character-name that designates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e., using the \uXXXX notation), are handled equivalently except where this replacement is reverted in a raw string literal.)
Translation Phase 5
N1570 Draft C11 5.1.1.2p1.5:
Each source character set member and escape sequence in character constants and string literals is converted to the corresponding member of the execution character set; [...]
C++ 2.2p1.5:
Each source character set member in a character literal or a string literal, as well as each escape sequence and universal-character-name in a character literal or a non-raw string literal, is converted to the corresponding member of the execution character set; [...]
(emphasis was added on differences)
The Questions
In the C++11 standard, it is very clear that source file characters not in the basic source character set are converted to UCNs, and that they are treated exactly as would have been a UCN in that same place, with the sole exception of raw-strings. Is the same true of C11? When a C11 compiler sees a multi-byte UTF-8 character such as °, does it too translate this to \u00b0 in phase 1, and treat it just as if \u00b0 had appeared there instead?
To put it in a different way, at the end of which translation phase, if any, are the following snippets of code transformed into textually equivalent forms for the first time in C11?
const char* hell° = "hell°";
and
const char* hell\u00b0 = "hell\u00b0";
If in 2., the answer is "in none", then during which translation phase are those two identifiers first understood to refer to the same thing, despite being textually different?
In C11, are UCNs in character/string literals also converted in phase 5? If so, why omit this from the draft standard?
How are UCNs in identifiers (as opposed to in character/string literals as already mentioned) handled in both C11 and C++11? Are they also converted in phase 5? Or is this something implementation-defined? Does GCC for instance print out such identifiers in UCN-coded form, or in actual UTF-8?
Comments turned into an answer
Interesting question!
The C standard can leave more of the conversions unstated because they are implementation-defined (and C has no raw strings to confuse the issue).
What it says in the C standard is sufficient — except that it leaves your question 1 unanswerable.
Q2 has to be 'Phase 5', I think, with caveats about it being 'the token stream is equivalent'.
Q3 is strictly N/A, but Phase 7 is probably the answer.
Q4 is 'yes', and it says so because it mentions 'escape sequences' and UCNs are escape sequences.
Q5 is 'Phase 5' too.
Can the C++11-mandated processes in Phase 1 and 5 be taken as compliant within the wording of C11 (putting aside raw strings)?
I think they are effectively the same; the difference arises primarily from the raw literal issue which is specific to C++. Generally, the C and C++ standards try not to make things gratuitously different, and in particular try to the workings of the preprocessor and the low-level character parsing the same in both (which has been easier since C99 added support for C++ // comments, but which evidently got harder with the addition of raw literals to C++11).
One day, I'll have to study the raw literal notations and their implications more thoroughly.
First, please note that these distinction exist since 1998; UCN were first introduced in C++98, a new standard (ISO/IEC 14882, 1st edition:1998), and then made their way into the C99 revision of the C standard; but the C committee (and existing implementers, and their pre-existing implementations) did not feel the C++ way was the only way to achieve the trick, particularly with corner cases and the use of smaller character sets than Unicode, or just different; for example, the requirement to ship the mapping tables from whatever-supported-encodings to Unicode was a preoccupation for C vendors in 1998.
The C standard (consciously) avoids deciding this, and let the compiler chooses how to proceed. While your reasoning takes obviously place with the context of UTF-8 character sets used for both source and execution, there are a large (and pre-existing) range of different C99/C11 compilers available which are using different sets; and the committee felt it should not restrict the implementers too much on this issue. In my experience, most compilers keep it distinct in practice (for performance reasons.)
Because of this freedom, some compilers can have it identical after phase 1 (like a C++ compiler shall), while other can left it distinct as late as phase 7 for the first degree character; the second degree characters (in the string) ought to be the same after phase 5, assuming the degree character is part of the extended execution character set supported by the implementation.
For the other answers, I won't add anything to Jonathan's.
About your additional question about the C++ more deterministic process to be Standard-C-compliant, it is clearly a goal to be so; and if you find a corner case which shows otherwise (a C++11-compliant preprocessor which would not conform to the C99 and C11 standards), then you should consider asking the WG14 committee about a potential defect.
Obviously, the reverse is not true: it is possible to write a pre-processor with handling of UCN which complies to C99/C11 but not to the C++ standards; the most obvious difference is with
#define str(t) #t
#define str_is(x, y) const char * x = y " is " str(y)
str_is(hell°, "hell°");
str_is(hell\u00B0, "hell\u00B0");
which a C-compliant preprocessor can render in a similar same way as your examples (and most do so) and as such, will have distinct renderings; but I am under the impression that a C++-compliant preprocessor is required to transform into (strictly equivalent)
const char* hell° = "hell°" " is " "\"hell\\u00b0\"";
const char* hell\u00b0 = "hell\\u00b0" " is " "\"hell\\u00b0\"";
Last, but not least, I believe not much compilers are fully compliant to this very level of details!
At the risk of asking a question deemed too nit-picky, I have spent a long time trying to justify (as a single example of something that occurs throughout the standard in different contexts) the following definition of an integer literal in §2.14.2 of the C++11 standard, specifically in regards to one detail, the presence of whitespace in the syntax notation itself.
(Note that this example - the definition of an integer literal - is not the point of my question. The point of my question is to ask about the syntax description notation used by the C++ standard itself, specifically in regards to whitespace between grammatical category names. The example I give here - the definition of an integer literal - is specifically chosen only because it acts as an example that is simple and clear-cut.)
(Abbreviated for concision, from §2.14.2):
integer-literal:
decimal-literal integer-suffix_opt
decimal-literal:
nonzero-digit
decimal-literal digit
(with nonzero-digit and digit as expected, [0] 1 ... 9). (Note: The above text is all in italics in the standard.)
This all makes sense to me, assuming that the SPACE between the syntax category descriptives decimal-literal and digit is understood to NOT be present in the actual source code, but is only present in the syntax description itself as it appears here in section §2.14.2.
This convention - placing a space between category descriptives within the notation, where it is understood that the space is not to be present in the source code - is used in other places in the specification. The example here is just a clear-cut case where the space is clearly not supposed to be present in the source code. (See addendum to this question for counterexamples from the standard where whitespace or other separator/s must be present, or is optional, between category descriptives when those category descriptives are replaced by actual tokens in the source code.)
Again, at the risk of being nit-picky, I cannot find anywhere in the standard a statement of convention that spaces are NOT to be present in the source code when interpreting notation such as in this example.
The standard does discuss notational convention in §1.6.1 (and thereafter). The only relevant text that I can find regarding this is:
In the syntax notation used in this International Standard, syntactic
categories are indicated by italic type, and literal words and
characters in constant width type. Alternatives are listed on separate
lines except in a few cases where a long set of alternatives is marked
by the phrase “one of.”
I would not be so nit-picky; however, I find the notation used within the standard to be somewhat tricky, so I would like to be clear on all of the details. I appreciate anyone willing to take the time to fill me in on this.
ADDENDUM In response to comments in which a claim is made similar to "it's obvious that whitespace should not be included in the final source code, so there's no need for the standard to explicitly state this": I have chosen a trivial example in this question, where it is obvious. There are many cases in the standard where it isn't obvious without a. priori knowledge of the language (in my opinion), such as §8.0.4 discussing "const" and "volatile":
cv-qualifier-seq:
cv-qualifier cv-qualifier-seq_opt
... Note the opposite assumption here (whitespace, or another separator or separators, is required in the final source code), but that's not possible to deduce from the syntax notation itself.
There are also cases where a space is optional, such as:
noptr-abstract-declarator:
noptr-abstract-declarator_opt parameters-and-qualifiers
(In this example, to make a point, I won't give the section number or paraphrase what is being discussed; I'll just ask if it's obvious from the grammar notation itself that, in this context, whitespace in the final source code is optional between the tokens.)
I suspect that the comments along these lines - "it's obvious, so that's what it must be" - are the result of the fact that the example I've chosen is so obvious. That's exactly why I chose the example.
§2.7.1
There are five kinds of tokens: identifiers, keywords, literals,
operators, and other separators. Blanks, horizontal and vertical tabs,
newlines, formfeeds, and comments (collectively, “white space”), as
described below, are ignored except as they serve to separate tokens.
So, if a literal is a token, and whitespace serves to seperate tokens, space in between the digits of a literal would be interpreted as two separate tokens, and therefore cannot be part of the same literal.
I'm reasonably certain there is no more direct explanation of this fact in the standard.
The notation used is similar enough to typical BNF that they take many of the same general conventions for granted, including the fact that whitespace in the notation has no significance beyond separating the tokens of the BNF itself -- that if/when whitespace has significance in the source code beyond separating tokens, they'll include notation to specify it directly (e.g., for most preprocessing directives, the new-line is specified directly:
# ifdef identifier new-line groupopt
or:
# include < h-char-sequence> new-line
The blame for that probably goes back to the Algol 68 standard, which went so far overboard in its attempts at precisely specifying syntax that it was essentially impossible for anybody to read without weeks of full-time study1. Since then, any more than the most cursory explanation of the syntax description language leads to rejection on the basis that it's too much like Algol 68 and will undoubtedly fail because it's too formal and nobody will ever read or understand it.
1 How could it be that bad you ask? It basically went like this: they started with a formal English description of a syntax description language. That wasn't used to define Algol 68 though -- it was used to specify (even more precisely) another syntax description language. That second syntax description language was then used to specify the syntax of Algol 68 itself. So, you had to learn two separate syntax description languages before you could start to read the Algol 68 syntax itself at all. As you can undoubtedly guess, almost nobody ever did.
As you say, the standard says:
literal words and characters in constant width type
So, if a literal space were to be included in a rule, it would have to be rendered in a constant width type. Close examination of the standard will reveal that the space in the production you refer to is narrower than the constant width type. (Also your attempt to quote the standard is a misrepresentation because it renders in constant-width type that which should be rendered in italics, with a consequent semantic change.)
Ok, that was the "aspiring language lawyer" answer; furthermore, it doesn't really work because it fails on all the productions which are of the form:
One of:
0 1 2 3 4 5 6 7 8 9
I think, in reality, the answer is that whitespace is not part of the formal grammar, because it serves only to separate tokens; furthermore, that statement is mostly true of the grammar itself, whose tokens are separated by whitespace without that whitespace being a token, except that indentation in the grammar matters, unlike indentation in a program.
Addendum to answer the addendum
It's not actually true that const and volatile need to be separated by whitespace. They simply need to be separate tokens. Example:
#define A(x)x
A(const)A(volatile)A(int)A(x)A(;)
Again, more seriously, Chapter 2 (with particular reference to 2.2 and 2.5, but you have to read the entire text) describe how the program text is processed in order to produce a stream of tokens. All of the rules in which you claim whitespace must be ignored are in this part of the grammar, and all of the rules in which you claim whitespace might be required are not.
These are really two separate grammars, but the lexical grammar is necessarily incomplete because you need to consider the operation of the preprocessor in order to apply it.
I believe that everything I said can be gleaned from the standard. Here are some excerpts:
2.2(3) The source file is decomposed into preprocessing tokens (2.5) and sequences of white-space characters (including comments)… The process of dividing a source file’s characters into preprocessing tokens is context-dependent.
…
2.2(7) White-space characters separating tokens are no longer significant. Each preprocessing token is converted into a token. (2.7). The resulting tokens are syntactically and semantically analyzed and translated as a translation unit.
I think that all this makes it clear that there are two grammars, one lexical -- that is, it produces a lexeme (token) from a sequence of graphemes (characters) -- and the other syntactic -- that is, it produces an abstract syntax tree from a sequence of lexemes (tokens). In neither case (with a small exception, which I'll get to in a minute) is whitespace considered anything other than something which stops two lexemes from running into each other if the lexical grammar would otherwise allow that. (See the algorithm in 2.5(3).)
C++ is not syntactically pretty, so there are almost always exceptions. One of these, inherited from C, is the difference between:
#define A(X)(X)
and
#define A (X)(X)
Preprocessing directives have their own parsing rules, and this one is typified by the definition:
lparen:
a ( character not immediately preceded by white-space
This, I would say, is the exception that proves the rule [Note 1]. The fact that it is necessary to say that this ( is not preceded by white-space shows that the normal use of the token ( in a syntactic rule does not say anything about its blancospatial context.
So, to paraphrase Ray Cummings (not Albert Einstein, as is sometimes claimed), "time and white-space are all that separate one token from another." [Note 2]
[Note 1] I use the phrase here in its original legal sense, as perCicero.
[Note 2]:
"Time," said George, "why I can give you a definition of time. It's what keeps everything from happening at once."
A ripple of laughter went about the little group of men.
"Quite so," agreed the Chemist. "And, gentlemen, that's not so funny as it sounds. As a matter of fact, it is really not a bad scientific definition. Time and space are all that separate one event from another…
-- From The man who mastered time, by Ray Cummings, 1929, Ace Books. See first page, in Google books
The Standard actually has two separate grammars.
The preprocessor grammar, described in sections 2 and 16, defines how a sequence of source characters is converted to a sequence of preprocessing tokens and whitespace characters, in translation phases 1-6. In some of these phases and parts of this grammar, whitespace is significant.
Whitespace characters which are not part of preprocessing tokens stop being significant after translation phase 4. The Standard explicitly says at the start of translation phase 7 to discard whitespace characters between preprocessing tokens.
The language grammar defines how a sequence of tokens (converted from preprocessing tokens) are syntactically and semantically interpreted in translation phase 7. There is no such thing as whitespace in this grammar. (By this point, ' ' is a character-literal just like 'c' is.)
In both grammars, the space between grammar components visible in the Standard has nothing to do with source or execution whitespace characters, it's just there to make the Standard legible. When the preprocessor grammar depends on whitespace, it spells it out with words, for example:
c-char:
any member of the source character set except the single-quote ', backslash \, or new-line character
escape-sequence
universal-character-name
and
control-line:
...
# define identifier lparen identifier-list[opt] ) replacement-list newline
...
lparen:
a ( character not immediately preceded by white-space
So there may not be whitespace between digits of an integer-literal because the preprocessor grammar does not allow it.
One other important rule here is from C++11 2.5p3:
If the input stream has been parsed into preprocessing tokens up to a given character:
If the next character begins a sequence of characters that could be the prefix and initial double quote of a raw string literal, such as R", the next preprocessing token shall be a raw string literal. ...
Otherwise, if the next three characters are <:: and the subsequent character is neither : nor >, the < is treated as a preprocessor token by itself and not as the first character of the alternative token <:.
Otherwise, the next preprocessing token is the longest sequence of characters that could constitute a preprocessing token, even if that would cause further lexical analysis to fail.
So there must be whitespace between const and volatile tokens because otherwise, the longest-token-possible rule would convert that to a single identifier token constvolatile.