c++ is a white space independent language, exception to the rule - c++

This wikepedia page defines c++ as a "white space independent language". While mostly true as with all languages there are exceptions to the rule. The only one I can think of at the moment is this:
vector<vector<double> >
Must have a space otherwise the compiler interprets the >> as a stream operator. What other ones are around. It would be interesting to compile a list of the exceptions.

Following that logic, you can use any two-character lexeme to produce such "exceptions" to the rule. For example, += and + = would be interpreted differently. I wouldn't call them exceptions though. In C++ in many contexts "no space at all" is quite different from "one or more spaces". When someone says that C++ is space-independent they usually mean that "one space" in C++ is typically the same as "more than one space".
This is reflected in the language specification, which states (see 2.1/1) that at phase 3 of translation the implementation is allowed to replace sequences of multiple whitespace characters with one space character.

The syntax and semantic rules for parsing C++ are indeed quite complex (I'm trying to be nice, I think one is authorized to say "a mess"). Proof of this fact is that for YEARS different compiler authors where just arguing on what was legal C++ and what it was not.
In C++ for example you may need to parse an unbounded number of tokens before deciding what is the semantic meaning of the first of them (the dreaded "most vexing parse rule" that also often bites newcomers).
Your objection IMO however doesn't really make sense... for example ++ has a different meaning from + +, and in Pascal begin is not the same as beg in. Does this make Pascal a space-dependent language? Is there any space-independent language (except brainf*ck)?
The only problem about C++03 >>/> > is that this mistake when typing was very common so they decided to add even more complexity to the language definition to solve this issue in C++11.
The cases in which one whitespace instead of more whitespaces can make a difference (something that differentiates space-dependent languages and that however plays no role in the > > / >> case) are indeed few:
inside double-quoted strings (but everyone wants that and every language that supports string literals that I know does the same)
inside single quotes (the same, even if something that not many C++ programmers know is that there can be more that one char inside single quotes)
in the preprocessor directives because they work on a line basis (newline is a whitespace and it makes a difference there)
in line continuation as noticed by stefanv: to continue a single line you can put a backslash right before a newline and in that case the language will ignore both characters (you can do this even in the middle of an identifier, even if the typical use is just to make long preprocessor macros readable). If you put other whitespace characters after the backslash and before the newline however the line continuation is not recognized (some compiler accepts it anyway and simply checks if last non-whitespace of a line is a backslash). Line continuation can also be specified using trigraph equivalent ??/ of backslash (any reasonable compiler should IMO emit a warning when finding a trigraph as they most probably were not indented by the programmer).
inside single-line comments // because also there adding a newline to other whitespaces in the middle of a comment makes a difference

Like it or not, but macro's are also part of C++ and multi-line macro's should be separated with a backslash followed by EOL, no whitespace should be in between the backslash and the EOL.
Not a big issue, but still a whitespace exception.

This is because of limitations in the parser pre c++11 this is no longer the case.
The reason being that it was hard to parse >> as end of a template compared to operator >>

While C++03 did interpret >> as the shift operator in all cases (which was overridden for use in streams, but it's still the shift operator), the language parser in C++11 will now attempt to close a brace when reasonable.

Nested template parameters: set<set<int> >.
Character literals: ' '.
String literals: " ".
Justoposition of keywords and identifiers: else return x;, void foo(){}, etc.

Related

why does C++ allow a declaration with no space between the type and a parenthesized variable name? [duplicate]

This question already has an answer here:
Is white space considered a token in C like languages?
(1 answer)
Closed 8 months ago.
A previous C++ question asked why int (x) = 0; is allowed. However, I noticed that even int(x) = 0; is allowed, i.e. without a space before the (x). I find the latter quite strange, because it causes things like this:
using Oit = std::ostream_iterator<int>;
Oit bar(std::cout);
*bar = 6; // * is optional
*Oit(bar) = 7; // * is NOT optional!
where the final line is because omitting the * makes the compiler think we are declaring bar again and initializing to 7.
Am I interpreting this correctly, that int(x) = 0; is indeed equivalent to int x = 0, and Oit(bar) = 7; is indeed equivalent to Oit bar = 7;? If yes, why specifically does C++ allow omitting the space before the parentheses in such a declaration + initialization?
(my guess is because the C++ compiler does not care about any space before a left paren, since it treats that parenthesized expression as it's own "token" [excuse me if I'm butchering the terminology], i.e. in all cases, qux(baz) is equivalent to qux (baz))
It is allowed in C++ because it is allowed in C and requiring the space would be an unnecessary C-compatibility breaking change. Even setting that aside, it would be surprising to have int (x) and int(x) behave differently, since generally (with few minor exceptions) C++ is agnostic to additional white-space as long as tokens are properly separated. And ( (outside a string/character literal) is always a token on its own. It can't be part of a token starting with int(.
In C int(x) has no other potential meaning for which it could be confused, so there is no reason to require white-space separation at all. C also is generally agnostic to white-space, so it would be surprising there as well to have different behavior with and without it.
One requirement when defining the syntax of a language is that elements of the language can be separated. According to the C++ syntax rules, a space separates things. But also according to the C++ syntax rules, parentheses also separate things.
When C++ is compiled, the first step is the parsing. And one of the first steps of the parsing is separating all the elements of the language. Often this step is called tokenizing or lexing. But this is just the technical background. The user does not have to know this. He or she only has to know that things in C++ must be clearly separted from each others, so that there is a sequence "*", "Oit", "(", "bar", ")", "=", "7", ";".
As explained, the rule that the parenthesis always separates is established on a very low level of the compiler. The compiler determines even before knowing what the purpose of the parenthesis is, that a parenthesis separates things. And therefore an extra space would be redundant.
When you ever use parser generators, you will see that most of them just ignore spaces. That means, when the lexer has produced the list of tokens, the spaces do not exist any more. See above in the list. There are no spaces any more. So you have no chance to specify something that explicitly requires a space.

Escaping a single character using a character class

In a regular expression, the normal way to use a special character (\^$.|?*+()[]{}) as a literal is, of course, to escape it with a backslash:
\+\.
But I have occasionally seen code that uses a character class to achieve the same thing:
[+][.]
Now obviously that isn't the primary purpose of a character class, which is normally used to match one of several characters. While the second example uses more keystrokes, you could argue that it's also more readable.
So is there any good reason not do this (performance or otherwise)? Or does it simply come down to personal stylistic preference?
I know this isn't an earth-shattering issue—it's just a little question that has been niggling away at the back of my mind for a while, and I've not been able to find any specific mention of it elsewhere.
I tend to view using a character class as a means of escaping a single character as a side-effect of character classes, which is not their primary purpose. The main reason for a character class is to represent a range of characters, not just a single character.
So, one possibly negative thing about the pattern [+][.] is that it might leave a future reader of your regex wondering if you did not intend to include more than one character in the character class. And perhaps, given certain conditions, that reader might even change the pattern to "fix" it, by adding characters to the class which he perceives as having been wrongfully omitted.
There might be slight performance advantage to using \+ over [+], in that the latter might require the regex engine to compile a formal list (with just one character in it). But, I would expect performance differences to be minimal.

In the C++ standard, where does it indicate the spacing protocol for the replacement of category descriptives by the source code it represents?

At the risk of asking a question deemed too nit-picky, I have spent a long time trying to justify (as a single example of something that occurs throughout the standard in different contexts) the following definition of an integer literal in §2.14.2 of the C++11 standard, specifically in regards to one detail, the presence of whitespace in the syntax notation itself.
(Note that this example - the definition of an integer literal - is not the point of my question. The point of my question is to ask about the syntax description notation used by the C++ standard itself, specifically in regards to whitespace between grammatical category names. The example I give here - the definition of an integer literal - is specifically chosen only because it acts as an example that is simple and clear-cut.)
(Abbreviated for concision, from §2.14.2):
integer-literal:
decimal-literal integer-suffix_opt
decimal-literal:
nonzero-digit
decimal-literal digit
(with nonzero-digit and digit as expected, [0] 1 ... 9). (Note: The above text is all in italics in the standard.)
This all makes sense to me, assuming that the SPACE between the syntax category descriptives decimal-literal and digit is understood to NOT be present in the actual source code, but is only present in the syntax description itself as it appears here in section §2.14.2.
This convention - placing a space between category descriptives within the notation, where it is understood that the space is not to be present in the source code - is used in other places in the specification. The example here is just a clear-cut case where the space is clearly not supposed to be present in the source code. (See addendum to this question for counterexamples from the standard where whitespace or other separator/s must be present, or is optional, between category descriptives when those category descriptives are replaced by actual tokens in the source code.)
Again, at the risk of being nit-picky, I cannot find anywhere in the standard a statement of convention that spaces are NOT to be present in the source code when interpreting notation such as in this example.
The standard does discuss notational convention in §1.6.1 (and thereafter). The only relevant text that I can find regarding this is:
In the syntax notation used in this International Standard, syntactic
categories are indicated by italic type, and literal words and
characters in constant width type. Alternatives are listed on separate
lines except in a few cases where a long set of alternatives is marked
by the phrase “one of.”
I would not be so nit-picky; however, I find the notation used within the standard to be somewhat tricky, so I would like to be clear on all of the details. I appreciate anyone willing to take the time to fill me in on this.
ADDENDUM In response to comments in which a claim is made similar to "it's obvious that whitespace should not be included in the final source code, so there's no need for the standard to explicitly state this": I have chosen a trivial example in this question, where it is obvious. There are many cases in the standard where it isn't obvious without a. priori knowledge of the language (in my opinion), such as §8.0.4 discussing "const" and "volatile":
cv-qualifier-seq:
cv-qualifier cv-qualifier-seq_opt
... Note the opposite assumption here (whitespace, or another separator or separators, is required in the final source code), but that's not possible to deduce from the syntax notation itself.
There are also cases where a space is optional, such as:
noptr-abstract-declarator:
noptr-abstract-declarator_opt parameters-and-qualifiers
(In this example, to make a point, I won't give the section number or paraphrase what is being discussed; I'll just ask if it's obvious from the grammar notation itself that, in this context, whitespace in the final source code is optional between the tokens.)
I suspect that the comments along these lines - "it's obvious, so that's what it must be" - are the result of the fact that the example I've chosen is so obvious. That's exactly why I chose the example.
§2.7.1
There are five kinds of tokens: identifiers, keywords, literals,
operators, and other separators. Blanks, horizontal and vertical tabs,
newlines, formfeeds, and comments (collectively, “white space”), as
described below, are ignored except as they serve to separate tokens.
So, if a literal is a token, and whitespace serves to seperate tokens, space in between the digits of a literal would be interpreted as two separate tokens, and therefore cannot be part of the same literal.
I'm reasonably certain there is no more direct explanation of this fact in the standard.
The notation used is similar enough to typical BNF that they take many of the same general conventions for granted, including the fact that whitespace in the notation has no significance beyond separating the tokens of the BNF itself -- that if/when whitespace has significance in the source code beyond separating tokens, they'll include notation to specify it directly (e.g., for most preprocessing directives, the new-line is specified directly:
# ifdef identifier new-line groupopt
or:
# include < h-char-sequence> new-line
The blame for that probably goes back to the Algol 68 standard, which went so far overboard in its attempts at precisely specifying syntax that it was essentially impossible for anybody to read without weeks of full-time study1. Since then, any more than the most cursory explanation of the syntax description language leads to rejection on the basis that it's too much like Algol 68 and will undoubtedly fail because it's too formal and nobody will ever read or understand it.
1 How could it be that bad you ask? It basically went like this: they started with a formal English description of a syntax description language. That wasn't used to define Algol 68 though -- it was used to specify (even more precisely) another syntax description language. That second syntax description language was then used to specify the syntax of Algol 68 itself. So, you had to learn two separate syntax description languages before you could start to read the Algol 68 syntax itself at all. As you can undoubtedly guess, almost nobody ever did.
As you say, the standard says:
literal words and characters in constant width type
So, if a literal space were to be included in a rule, it would have to be rendered in a constant width type. Close examination of the standard will reveal that the space in the production you refer to is narrower than the constant width type. (Also your attempt to quote the standard is a misrepresentation because it renders in constant-width type that which should be rendered in italics, with a consequent semantic change.)
Ok, that was the "aspiring language lawyer" answer; furthermore, it doesn't really work because it fails on all the productions which are of the form:
One of:
0 1 2 3 4 5 6 7 8 9
I think, in reality, the answer is that whitespace is not part of the formal grammar, because it serves only to separate tokens; furthermore, that statement is mostly true of the grammar itself, whose tokens are separated by whitespace without that whitespace being a token, except that indentation in the grammar matters, unlike indentation in a program.
Addendum to answer the addendum
It's not actually true that const and volatile need to be separated by whitespace. They simply need to be separate tokens. Example:
#define A(x)x
A(const)A(volatile)A(int)A(x)A(;)
Again, more seriously, Chapter 2 (with particular reference to 2.2 and 2.5, but you have to read the entire text) describe how the program text is processed in order to produce a stream of tokens. All of the rules in which you claim whitespace must be ignored are in this part of the grammar, and all of the rules in which you claim whitespace might be required are not.
These are really two separate grammars, but the lexical grammar is necessarily incomplete because you need to consider the operation of the preprocessor in order to apply it.
I believe that everything I said can be gleaned from the standard. Here are some excerpts:
2.2(3) The source file is decomposed into preprocessing tokens (2.5) and sequences of white-space characters (including comments)… The process of dividing a source file’s characters into preprocessing tokens is context-dependent.
…
2.2(7) White-space characters separating tokens are no longer significant. Each preprocessing token is converted into a token. (2.7). The resulting tokens are syntactically and semantically analyzed and translated as a translation unit.
I think that all this makes it clear that there are two grammars, one lexical -- that is, it produces a lexeme (token) from a sequence of graphemes (characters) -- and the other syntactic -- that is, it produces an abstract syntax tree from a sequence of lexemes (tokens). In neither case (with a small exception, which I'll get to in a minute) is whitespace considered anything other than something which stops two lexemes from running into each other if the lexical grammar would otherwise allow that. (See the algorithm in 2.5(3).)
C++ is not syntactically pretty, so there are almost always exceptions. One of these, inherited from C, is the difference between:
#define A(X)(X)
and
#define A (X)(X)
Preprocessing directives have their own parsing rules, and this one is typified by the definition:
lparen:
  a ( character not immediately preceded by white-space
This, I would say, is the exception that proves the rule [Note 1]. The fact that it is necessary to say that this ( is not preceded by white-space shows that the normal use of the token ( in a syntactic rule does not say anything about its blancospatial context.
So, to paraphrase Ray Cummings (not Albert Einstein, as is sometimes claimed), "time and white-space are all that separate one token from another." [Note 2]
[Note 1] I use the phrase here in its original legal sense, as perCicero.
[Note 2]:
"Time," said George, "why I can give you a definition of time. It's what keeps everything from happening at once."
A ripple of laughter went about the little group of men.
"Quite so," agreed the Chemist. "And, gentlemen, that's not so funny as it sounds. As a matter of fact, it is really not a bad scientific definition. Time and space are all that separate one event from another…
-- From The man who mastered time, by Ray Cummings, 1929, Ace Books. See first page, in Google books
The Standard actually has two separate grammars.
The preprocessor grammar, described in sections 2 and 16, defines how a sequence of source characters is converted to a sequence of preprocessing tokens and whitespace characters, in translation phases 1-6. In some of these phases and parts of this grammar, whitespace is significant.
Whitespace characters which are not part of preprocessing tokens stop being significant after translation phase 4. The Standard explicitly says at the start of translation phase 7 to discard whitespace characters between preprocessing tokens.
The language grammar defines how a sequence of tokens (converted from preprocessing tokens) are syntactically and semantically interpreted in translation phase 7. There is no such thing as whitespace in this grammar. (By this point, ' ' is a character-literal just like 'c' is.)
In both grammars, the space between grammar components visible in the Standard has nothing to do with source or execution whitespace characters, it's just there to make the Standard legible. When the preprocessor grammar depends on whitespace, it spells it out with words, for example:
c-char:
any member of the source character set except the single-quote ', backslash \, or new-line character
escape-sequence
universal-character-name
and
control-line:
...
# define identifier lparen identifier-list[opt] ) replacement-list newline
...
lparen:
a ( character not immediately preceded by white-space
So there may not be whitespace between digits of an integer-literal because the preprocessor grammar does not allow it.
One other important rule here is from C++11 2.5p3:
If the input stream has been parsed into preprocessing tokens up to a given character:
If the next character begins a sequence of characters that could be the prefix and initial double quote of a raw string literal, such as R", the next preprocessing token shall be a raw string literal. ...
Otherwise, if the next three characters are <:: and the subsequent character is neither : nor >, the < is treated as a preprocessor token by itself and not as the first character of the alternative token <:.
Otherwise, the next preprocessing token is the longest sequence of characters that could constitute a preprocessing token, even if that would cause further lexical analysis to fail.
So there must be whitespace between const and volatile tokens because otherwise, the longest-token-possible rule would convert that to a single identifier token constvolatile.

Is the backslash acceptable in C and C++ #include directives?

There are two path separators in common use: the Unix forward-slash and the DOS backslash. Rest in peace, Classic Mac colon. If used in an #include directive, are they equal under the rules of the C++11, C++03, and C99 standards?
C99 says (§6.4.7/3):
If the characters ', \, ", //, or /* occur in the sequence between the < and > delimiters, the behavior is undefined. Similarly, if the characters ', \, //, or /* occur in the sequence between the " delimiters, the behavior is undefined.
(footnote: Thus, sequences of characters that resemble escape sequences cause undefined behavior.)
C++03 says (§2.8/2):
If either of the characters ’ or \, or either of the character sequences /* or // appears in a q-char- sequence or a h-char-sequence, or the character " appears in a h-char-sequence, the behavior is undefined.
(footnote: Thus, sequences of characters that resemble escape sequences cause undefined behavior.)
C++11 says (§2.9/2):
The appearance of either of the characters ’ or \ or of either of the character sequences /* or // in a q-char-sequence or an h-char-sequence is conditionally supported with implementation-defined semantics, as is the appearance of the character " in an h-char-sequence.
(footnote: Thus, a sequence of characters that resembles an escape sequence might result in an error, be interpreted as the character corresponding to the escape sequence, or have a completely different meaning, depending on the implementation.)
Therefore, although any compiler might choose to support a backslash in a #include path, it is unlikely that any compiler vendor won't support forward slash, and backslashes are likely to trip some implementations up by virtue of forming escape codes. (Edit: apparently MSVC previously required backslash. Perhaps others on DOS-derived platforms were similar. Hmmm… what can I say.)
C++11 seems to loosen the rules, but "conditionally supported" is not meaningfully better than "causes undefined behavior." The change does more to reflect the existence of certain popular compilers than to describe a portable standard.
Of course, nothing in any of these standards says that there is such a thing as paths. There are filesystems out there with no paths at all! However, many libraries assume pathnames, including POSIX and Boost, so it is reasonable to want a portable way to refer to files within subdirectories.
Forward slash is the correct way; the pre-compiler will do whatever it takes on each platform to get to the correct file.
It depends on what you mean by "acceptable".
There are two senses in which slashes are acceptable and backslashes are not.
If you're writing C99, C++03, or C1x, backslashes are undefined, while slashes are legal, so in this sense, backslashes are not acceptable.
But this is irrelevant for most people. If you're writing C++1x, where backslashes are conditionally-supported, and the platform you're coding for supports them, they're acceptable. And if you're writing an "extended dialect" of C99/C++03/C1x that defines backslashes, same deal. And, more importantly, this notion of "acceptable" is pretty meaningless in most cases anyway. None of the C/C++ standards define what slashes mean (or what backslashes mean when they're conditionally-supported). Header names are mapped to source files in an implementation-defined manner, period. If you've got a hierarchy of files, and you're asking whether to use backslashes or slashes to refer to them portably in #include directives, the answer is: neither is portable. If you want to write truly portable code, you can't use hierarchies of header files—in fact, arguably, your best bet is to write everything in a single source file, and not #include anything except standard headers.
However, in the real world, people often want "portable-enough", not "strictly portable". The POSIX standard mandates what slashes mean, and even beyond POSIX, most modern platforms—including Win32 (and Win64), the cross-compilers for embedded and mobile platforms like Symbian, etc.—treat slashes the POSIX way, at least as far as C/C++ #include directives. Any platform that doesn't, probably won't have any way for you to get your source tree onto it, process your makefile/etc., and so on, so #include directives will be the least of your worries. If that's what you care about, then slashes are acceptable, but backslashes are not.
Blackslash is undefined behavior and even with a slash you have to be careful. The C99 standard states:
If the characters ', \, ", //, or /*
occur in the sequence between the <
and > delimiters, the behavior is
undefined. Similarly, if the
characters ', \, //, or /* occur in
the sequence between the " delimiters,
the behavior is undefined.
Always use forward slashes - they work on more platforms. Backslash technically causes undefined behaviour in C++03 (2.8/2 in the standard).
The standard says for #include that it:
searches a sequence of implementation-defined places for
a header identified uniquely by the specified sequence between
the delimiters, and causes the replacement of that directive by the
entire contents of the header. How the places are specified or the header
identified is implementation-defined.
Note the last sentence.

what exactly is a token, in relation to parsing

I have to use a parser and writer in c++, i am trying to implement the functions, however i do not understand what a token is. one of my function/operations is to check to see if there are more tokens to produce
bool Parser::hasMoreTokens()
how exactly do i go about this, please help
SO!
I am opening a text file with text in it, all words are lowercased. How do i go about checking to see if it hasmoretokens?
This is what i have
bool Parser::hasMoreTokens() {
while(source.peek()!=NULL){
return true;
}
return false;
}
Tokens are the output of lexical analysis and the input to parsing. Typically they are things like
numbers
variable names
parentheses
arithmetic operators
statement terminators
That is, roughly, the biggest things that can be unambiguously identified by code that just looks at its input one character at a time.
One note, which you should feel free to ignore if it confuses you: The boundary between lexical analysis and parsing is a little fuzzy. For instance:
Some programming languages have complex-number literals that look, say, like 2+3i or 3.2e8-17e6i. If you were parsing such a language, you could make the lexer gobble up a whole complex number and make it into a token; or you could have a simpler lexer and a more complicated parser, and make (say) 3.2e8, -, 17e6i be separate tokens; it would then be the parser's job (or even the code generator's) to notice that what it's got is really a single literal.
In some programming languages, the lexer may not be able to tell whether a given token is a variable name or a type name. (This happens in C, for instance.) But the grammar of the language may distinguish between the two, so that you'd like "variable foo" and "type name foo" to be different tokens. (This also happens in C.) In this case, it may be necessary for some information to be fed back from the parser to the lexer so that it can produce the right sort of token in each case.
So "what exactly is a token?" may not always have a perfectly well defined answer.
A token is whatever you want it to be. Traditionally (and for
good reasons), language specifications broke the analysis into
two parts: the first part broke the input stream into tokens,
and the second parsed the tokens. (Theoretically, I think you
can write any grammar in only a single level, without using
tokens—or what is the same thing, using individual
characters as tokens. I wouldn't like to see the results of
that for a language like C++, however.) But the definition of
what a token is depends entirely on the language you are
parsing: most languages, for example, treat white space as
a separator (but not Fortran); most languages will predefine
a set of punctuation/operators using punctuation characters, and
not allow these characters in symbols (but not COBOL, where
"abc-def" would be a single symbol). In some cases (including
in the C++ preprocessor), what is a token depends on context, so
you may need some feedback from the parser. (Hopefully not;
that sort of thing is for very experienced programmers.)
One thing is probably sure (unless each character is a token):
you'll have to read ahead in the stream. You typically can't
tell whether there are more tokens by just looking at a single
character. I've generally found it useful, in fact, for the
tokenizer to read a whole token at a time, and keep it until the
parser needs it. A function like hasMoreTokens would in fact
scan a complete token.
(And while I'm at it, if source is an istream:
istream::peek does not return a pointer, but an int.)
A token is the smallest unit of a programming language that has a meaning. A parenthesis (, a name foo, an integer 123, are all tokens. Reducing a text to a series of tokens is generally the first step of parsing it.
A token is usually akin to a word in sponken language. In C++, (int, float, 5.523, const) will be tokens. Is the minimal unit of text which constitutes a semantic element.
When you split a large unit (long string) into a group of sub-units (smaller strings), each of the sub-units (smaller strings) is referred to as a "token". If there are no more sub-units, then you are done parsing.
How do I tokenize a string in C++?
A token is a terminal in a grammar, a sequence of one or more symbol(s) that is defined by the sequence itself, ie it does not derive from any other production defined in the grammar.