why does C++ allow a declaration with no space between the type and a parenthesized variable name? [duplicate] - c++

This question already has an answer here:
Is white space considered a token in C like languages?
(1 answer)
Closed 8 months ago.
A previous C++ question asked why int (x) = 0; is allowed. However, I noticed that even int(x) = 0; is allowed, i.e. without a space before the (x). I find the latter quite strange, because it causes things like this:
using Oit = std::ostream_iterator<int>;
Oit bar(std::cout);
*bar = 6; // * is optional
*Oit(bar) = 7; // * is NOT optional!
where the final line is because omitting the * makes the compiler think we are declaring bar again and initializing to 7.
Am I interpreting this correctly, that int(x) = 0; is indeed equivalent to int x = 0, and Oit(bar) = 7; is indeed equivalent to Oit bar = 7;? If yes, why specifically does C++ allow omitting the space before the parentheses in such a declaration + initialization?
(my guess is because the C++ compiler does not care about any space before a left paren, since it treats that parenthesized expression as it's own "token" [excuse me if I'm butchering the terminology], i.e. in all cases, qux(baz) is equivalent to qux (baz))

It is allowed in C++ because it is allowed in C and requiring the space would be an unnecessary C-compatibility breaking change. Even setting that aside, it would be surprising to have int (x) and int(x) behave differently, since generally (with few minor exceptions) C++ is agnostic to additional white-space as long as tokens are properly separated. And ( (outside a string/character literal) is always a token on its own. It can't be part of a token starting with int(.
In C int(x) has no other potential meaning for which it could be confused, so there is no reason to require white-space separation at all. C also is generally agnostic to white-space, so it would be surprising there as well to have different behavior with and without it.

One requirement when defining the syntax of a language is that elements of the language can be separated. According to the C++ syntax rules, a space separates things. But also according to the C++ syntax rules, parentheses also separate things.
When C++ is compiled, the first step is the parsing. And one of the first steps of the parsing is separating all the elements of the language. Often this step is called tokenizing or lexing. But this is just the technical background. The user does not have to know this. He or she only has to know that things in C++ must be clearly separted from each others, so that there is a sequence "*", "Oit", "(", "bar", ")", "=", "7", ";".
As explained, the rule that the parenthesis always separates is established on a very low level of the compiler. The compiler determines even before knowing what the purpose of the parenthesis is, that a parenthesis separates things. And therefore an extra space would be redundant.
When you ever use parser generators, you will see that most of them just ignore spaces. That means, when the lexer has produced the list of tokens, the spaces do not exist any more. See above in the list. There are no spaces any more. So you have no chance to specify something that explicitly requires a space.

Related

C++ single quotes in the middle of a number [duplicate]

As of C++14, thanks to n3781 (which in itself does not answer this question) we may write code like the following:
const int x = 1'234; // one thousand two hundred and thirty four
The aim is to improve on code like this:
const int y = 100000000;
and make it more readable.
The underscore (_) character was already taken in C++11 by user-defined literals, and the comma (,) has localisation problems — many European countries bafflingly† use this as the decimal separator — and conflicts with the comma operator, though I do wonder what real-world code could possibly have been broken by allowing e.g. 1,234,567.
Anyway, a better solution would seem to be the space character:
const int z = 1 000 000;
These adjacent numeric literal tokens could be concatenated by the preprocessor just as are string literals:
const char x[5] = "a" "bc" "d";
Instead, we get the apostrophe ('), not used by any writing system I'm aware of as a digit separator.
Is there a reason that the apostrophe was chosen instead of a simple space?
† It's baffling because all of those languages, within text, maintain the notion of a comma "breaking apart" an otherwise atomic sentence, with a period functioning to "terminate" the sentence — to me, at least, this is quite analogous to a comma "breaking apart" the integral part of a number and a period "terminating" it ready for the fractional input.
There is a previous paper, n3499, which tell us that although Bjarne himself suggested spaces as separators:
While this approach is consistent with one common typeographic style, it suffers from some compatibility problems.
It does not match the syntax for a pp-number, and would minimally require extending that syntax.
More importantly, there would be some syntactic ambiguity when a hexadecimal digit in the range [a-f] follows a space. The preprocessor would not know whether to perform symbol substitution starting after the space.
It would likely make editing tools that grab "words" less reliable.
I guess the following example is the main problem noted:
const int x = 0x123 a;
though in my opinion this rationale is fairly weak. I still can't think of a real-world example to break it.
The "editing tools" rationale is even worse, since 1'234 breaks basically every syntax highlighter known to mankind (e.g. that used by Markdown in the above question itself!) and makes updated versions of said highlighters much harder to implement.
Still, for better or worse, this is the rationale that led to the adoption of apostrophes instead.
The obvious reason for not using white space is that a new line is also
white space, and that C++ treats all white space identically. And off
hand, I don't know of any language which accepts arbitrary white space
as a separator.
Presumably, Unicode 0xA0 (non-breaking space) could be used—it is
the most widely used solution when typesetting. I see two problems with
that, however: first, it's not in the basic character set, and second,
it's not visually distinctive; you can't see that it isn't a space by
just looking at the text in a normal editor.
Beyond that, there aren't many choices. You can't use the comma, since
that is already a legal token (and something like 1,234 is currently
legal C++, with the meaning 234). And in a context where it could occur
in legal code, e.g. a[1,234]. While I can't quite imagine any real
code actually using this, there is a basic rule that no legal program,
regardless how absurd, should silently change semantics.
Similar considerations mean that _ can't be used either; if there is a
#define _234 * 2, then a[1_234] would silently change the meaning of
the code.
I can't say that I'm particularly pleased with the choice of ', but it
does have the advantage of being used in continental Europe, at least in
some types of texts. (I seem to remember having seen it in German, for
example, although in typical running text, German, like most other
languages, will use a point or a non breaking space. But maybe it was
Swiss German.) The problem with ' is parsing; the sequence '1' is
already legal, as is '123'. So something like 1'234 could be a 1,
followed by the start of a character constant; I'm not sure how far you
have to look-ahead to make the decision. There is no sequence of legal
C++ in which an integral constant can be followed by a character
constant, so there's no problem with breaking legal code, but it means
that lexical scanning suddenly becomes very context dependent.
(With regards to your comment: there is no logic in the choice of a
decimal or a thousands separator. A decimal separator, for example, is
certainly not a full stop. They are just arbitrary conventions.)
From wiki, we have a nice example:
auto floating_point_literal = 0.000'015'3;
Here, we have the . operator and then if another operator would be to be met, my eyes would wait for something visible, like a comma or something, not a whitespace.
So an apostrophe does much better here than a whitespace would do.
With whitespaces it would be
auto floating_point_literal = 0.000 015 3;
which doesn't feel as right as the case with the apostrophes.
In the same spirit of Albert Renshaw's answer, I think that the apostrophe is more clear than the space the Lightness Races in Orbit proposes.
type a = 1'000'000'000'000'000'544'445'555;
type a = 1 000 000 000 000 000 544 445 555;
Space is used for many things, like the strings concatenation the OP mentions, unlike the apostrophe, which in this case makes it clear for someone that is used separating the digits.
When the lines of code become many, I think that this will improve readability, but I doubt that is the reason they choose it.
About the spaces, it might worth taking a look at this C question, which says:
The language doesn't allow int i = 10 000; (an integer literal is one token, the intervening whitespace splits it into two tokens) but there's typically little to no expense incurred by expressing the initializer as an expression that is a calculation of literals:
int i = 10 * 1000; /* ten thousand */
It is true I see no practical meaning to:
if (a == 1 1 1 1 1) ...
so digits might be merged without real ambiguity
but what about an hexadecimal number?
0 x 1 a B 2 3
There is no way to disambiguate from a typo doing so (normally we should see an error)
I would assume it's because, while writing code, if you reach the end of a "line" (the width of your screen) an automatic line-break (or "word wrap") occurs. This would cause your int to get split in half, one half of it would be on the first line, the second half on the second... this way it all stays together in the event of a word-wrap.
float floating_point_literal = 0.0000153; /* C, C++*/
auto floating_point_literal = 0.0000153; // C++11
auto floating_point_literal = 0.000'015'3; // C++14
Commenting does not hurt:
/* 0. 0000 1530 */
float floating_point_literal = 0.00001530;
Binary strings can be hard to parse:
long bytecode = 0b1111011010011001; /* gcc , clang */
long bytecode = 0b1111'0110'1001'1001; //C++14
// 0b 1111 0110 1001 1001 would be better, really.
// It is how humans think.
A macro for consideration:
#define B(W,X,Y,Z) (0b##W##X##Y##Z)
#define HEX(W,X,Y,Z) (0x##W##X##Y##Z)
#define OCT(O) (0##O)
long z = B(1001, 1001, 1020, 1032 );
// result : long z = (0b1001100110201032);
long h = OCT( 35);
// result : long h = (035); // 35_oct => 29_dec
long h = HEX( FF, A6, 3B, D0 );
// result : long h = (0xFFA6BD0);
It has to do with how the language is parsed. It would have been difficult for the compiler authors to rewrite their products to accept space delimited literals.
Also, I don't think seperating digits with spaces is very common. That i've seen, it's always non-whitespace characters, even in different countries.

Why initializer list with dangle comma is allowed in C++11? [duplicate]

Maybe I am not from this planet, but it would seem to me that the following should be a syntax error:
int a[] = {1,2,}; //extra comma in the end
But it's not. I was surprised when this code compiled on Visual Studio, but I have learnt not to trust MSVC compiler as far as C++ rules are concerned, so I checked the standard and it is allowed by the standard as well. You can see 8.5.1 for the grammar rules if you don't believe me.
Why is this allowed? This may be a stupid useless question but I want you to understand why I am asking. If it were a sub-case of a general grammar rule, I would understand - they decided not to make the general grammar any more difficult just to disallow a redundant comma at the end of an initializer list. But no, the additional comma is explicitly allowed. For example, it isn't allowed to have a redundant comma in the end of a function-call argument list (when the function takes ...), which is normal.
So, again, is there any particular reason this redundant comma is explicitly allowed?
It makes it easier to generate source code, and also to write code which can be easily extended at a later date. Consider what's required to add an extra entry to:
int a[] = {
1,
2,
3
};
... you have to add the comma to the existing line and add a new line. Compare that with the case where the three already has a comma after it, where you just have to add a line. Likewise if you want to remove a line you can do so without worrying about whether it's the last line or not, and you can reorder lines without fiddling about with commas. Basically it means there's a uniformity in how you treat the lines.
Now think about generating code. Something like (pseudo-code):
output("int a[] = {");
for (int i = 0; i < items.length; i++) {
output("%s, ", items[i]);
}
output("};");
No need to worry about whether the current item you're writing out is the first or the last. Much simpler.
It's useful if you do something like this:
int a[] = {
1,
2,
3, //You can delete this line and it's still valid
};
Ease of use for the developer, I would think.
int a[] = {
1,
2,
2,
2,
2,
2, /*line I could comment out easily without having to remove the previous comma*/
}
Additionally, if for whatever reason you had a tool that generated code for you; the tool doesn't have to care about whether it's the last item in the initialize or not.
I've always assumed it makes it easier to append extra elements:
int a[] = {
5,
6,
};
simply becomes:
int a[] = {
5,
6,
7,
};
at a later date.
Everything everyone is saying about the ease of adding/removing/generating lines is correct, but the real place this syntax shines is when merging source files together. Imagine you've got this array:
int ints[] = {
3,
9
};
And assume you've checked this code into a repository.
Then your buddy edits it, adding to the end:
int ints[] = {
3,
9,
12
};
And you simultaneously edit it, adding to the beginning:
int ints[] = {
1,
3,
9
};
Semantically these sorts of operations (adding to the beginning, adding to the end) should be entirely merge safe and your versioning software (hopefully git) should be able to automerge. Sadly, this isn't the case because your version has no comma after the 9 and your buddy's does. Whereas, if the original version had the trailing 9, they would have automerged.
So, my rule of thumb is: use the trailing comma if the list spans multiple lines, don't use it if the list is on a single line.
I am surprised after all this time no one has quoted the Annotated C++ Reference Manual(ARM), it says the following about [dcl.init] with emphasis mine:
There are clearly too many notations for initializations, but each seems to serve a particular style of use well. The ={initializer_list,opt} notation was inherited from C and serves well for the initialization of data structures and arrays. [...]
although the grammar has evolved since ARM was written the origin remains.
and we can go to the C99 rationale to see why this was allowed in C and it says:
K&R allows a trailing comma in an initializer at the end of an
initializer-list. The Standard has retained this syntax, since it
provides flexibility in adding or deleting members from an initializer
list, and simplifies machine generation of such lists.
Trailing comma I believe is allowed for backward compatibility reasons. There is a lot of existing code, primarily auto-generated, which puts a trailing comma. It makes it easier to write a loop without special condition at the end.
e.g.
for_each(my_inits.begin(), my_inits.end(),
[](const std::string& value) { std::cout << value << ",\n"; });
There isn't really any advantage for the programmer.
P.S. Though it is easier to autogenerate the code this way, I actually always took care not to put the trailing comma, the efforts are minimal, readability is improved, and that's more important. You write code once, you read it many times.
I see one use case that was not mentioned in other answers,
our favorite Macros:
int a [] = {
#ifdef A
1, //this can be last if B and C is undefined
#endif
#ifdef B
2,
#endif
#ifdef C
3,
#endif
};
Adding macros to handle last , would be big pain. With this small change in syntax this is trivial to manage. And this is more important than machine generated code because is usually lot of easier to do it in Turing complete langue than very limited preprocesor.
One of the reasons this is allowed as far as I know is that it should be simple to automatically generate code; you don't need any special handling for the last element.
It makes code generators that spit out arrays or enumerations easier.
Imagine:
std::cout << "enum Items {\n";
for(Items::iterator i(items.begin()), j(items.end); i != j; ++i)
std::cout << *i << ",\n";
std::cout << "};\n";
I.e., no need to do special handling of the first or last item to avoid spitting the trailing comma.
If the code generator is written in Python, for example, it is easy to avoid spitting the trailing comma by using str.join() function:
print("enum Items {")
print(",\n".join(items))
print("}")
The reason is trivial: ease of adding/removing lines.
Imagine the following code:
int a[] = {
1,
2,
//3, // - not needed any more
};
Now, you can easily add/remove items to the list without having to add/remove the trailing comma sometimes.
In contrast to other answers, I don't really think that ease of generating the list is a valid reason: after all, it's trivial for the code to special-case the last (or first) line. Code-generators are written once and used many times.
It allows every line to follow the same form. Firstly this makes it easier to add new rows and have a version control system track the change meaningfully and it also allows you to analyze the code more easily. I can't think of a technical reason.
The only language where it's - in practice* - not allowed is Javascript, and it causes an innumerable amount of problems. For example if you copy & paste a line from the middle of the array, paste it at the end, and forgot to remove the comma then your site will be totally broken for your IE visitors.
*In theory it is allowed but Internet Explorer doesn't follow the standard and treats it as an error
It's easier for machines, i.e. parsing and generation of code.
It's also easier for humans, i.e. modification, commenting-out, and visual-elegance via consistency.
Assuming C, would you write the following?
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
puts("Line 1");
puts("Line 2");
puts("Line 3");
return EXIT_SUCCESS
}
No. Not only because the final statement is an error, but also because it's inconsistent. So why do the same to collections? Even in languages that allow you to omit last semicolons and commas, the community usually doesn't like it. The Perl community, for example, doesn't seem to like omitting semicolons, bar one-liners. They apply that to commas too.
Don't omit commas in multiline collections for the same reason you don't ommit semicolons for multiline blocks of code. I mean, you wouldn't do it even if the language allowed it, right? Right?
This is allowed to protect from mistakes caused by moving elements around in a long list.
For example, let's assume we have a code looking like this.
#include <iostream>
#include <string>
#include <cstddef>
#define ARRAY_SIZE(array) (sizeof(array) / sizeof *(array))
int main() {
std::string messages[] = {
"Stack Overflow",
"Super User",
"Server Fault"
};
size_t i;
for (i = 0; i < ARRAY_SIZE(messages); i++) {
std::cout << messages[i] << std::endl;
}
}
And it's great, as it shows the original trilogy of Stack Exchange sites.
Stack Overflow
Super User
Server Fault
But there is one problem with it. You see, the footer on this website shows Server Fault before Super User. Better fix that before anyone notices.
#include <iostream>
#include <string>
#include <cstddef>
#define ARRAY_SIZE(array) (sizeof(array) / sizeof *(array))
int main() {
std::string messages[] = {
"Stack Overflow",
"Server Fault"
"Super User",
};
size_t i;
for (i = 0; i < ARRAY_SIZE(messages); i++) {
std::cout << messages[i] << std::endl;
}
}
After all, moving lines around couldn't be that hard, could it be?
Stack Overflow
Server FaultSuper User
I know, there is no website called "Server FaultSuper User", but our compiler claims it exists. Now, the issue is that C has a string concatenation feature, which allows you to write two double quoted strings and concatenate them using nothing (similar issue can also happen with integers, as - sign has multiple meanings).
Now what if the original array had an useless comma at end? Well, the lines would be moved around, but such bug wouldn't have happened. It's easy to miss something as small as a comma. If you remember to put a comma after every array element, such bug just cannot happen. You wouldn't want to waste four hours debugging something, until you would find the comma is the cause of your problems.
Like many things, the trailing comma in an array initializer is one of the things C++ inherited from C (and will have to support for ever). A view totally different from those placed here is mentioned in the book "Deep C secrets".
Therein after an example with more than one "comma paradoxes" :
char *available_resources[] = {
"color monitor" ,
"big disk" ,
"Cray" /* whoa! no comma! */
"on-line drawing routines",
"mouse" ,
"keyboard" ,
"power cables" , /* and what's this extra comma? */
};
we read :
...that trailing comma after the final initializer is not a typo, but a blip in the syntax carried over from aboriginal C. Its presence or absence is allowed but has no significance. The justification claimed in the ANSI C rationale is that it makes automated generation of C easier. The claim would be more credible if trailing commas were permitted in every comma-sepa-rated list, such as in enum declarations, or multiple variable declarators in a single declaration. They are not.
... to me this makes more sense
In addition to code generation and editing ease, if you want to implement a parser, this type of grammar is simpler and easier to implement. C# follows this rule in several places that there's a list of comma-separated items, like items in an enum definition.
It makes generating code easier as you only need to add one line and don't need to treat adding the last entry as if it's a special case. This is especially true when using macros to generate code. There's a push to try to eliminate the need for macros from the language, but a lot of the language did evolve hand in hand with macros being available. The extra comma allows macros such as the following to be defined and used:
#define LIST_BEGIN int a[] = {
#define LIST_ENTRY(x) x,
#define LIST_END };
Usage:
LIST_BEGIN
LIST_ENTRY(1)
LIST_ENTRY(2)
LIST_END
That's a very simplified example, but often this pattern is used by macros for defining things such as dispatch, message, event or translation maps and tables. If a comma wasn't allowed at the end, we'd need a special:
#define LIST_LAST_ENTRY(x) x
and that would be very awkward to use.
So that when two people add a new item in a list on separate branches, Git can properly merge the changes, because Git works on a line basis.
It makes editing the code a lot easier.
I'm comparing editinc c/c++ array elements with editing json documents - if you forget to remove the last comma, the JSON will not parse. (Yes, I know JSON is not meant to be edited manually)
If you use an array without specified length,VC++6.0 can automaticly identify its length,so if you use "int a[]={1,2,};"the length of a is 3,but the last one hasn't been initialized,you can use "cout<

spaces not needed in case labels?

May be this is a silly question, but did anyone already know whether or not there would be a mandatory space between the case keyword and its constant expression in the switch statement?
The standard seems not say anything about...
Consider the following code:
switch(int_expression)
{
case1: /*anything*/ // no space before 1
caseZERO: /*anything*/ // no space before ZERO
// ZERO being defined as 0
// by the pre-processor
}
Both my reference compilers accept this code and once run it works nicely.
How does the preprocessor recognize that ZERO must be substituted?
Note also that if the control expression was a char type instead of a int type the following is also compiled but this time it no longer works
switch(char_expression)
{
case'a': /* anything */ // NO SPACES embedded
caseZERO: /* anything */ // NO SPACES again
// ZERO being defined as '0'
// by the pre-processor
}
THIS COMPILES but even if the value of the char_expression was '0'
no statement in caseZERO is executed.
Can anyone explain this?
These are general labels for use with goto and won't actually match the expression as you intend. The syntax is valid, but the semantics are different.
You don't specify whether this is a C compiler or a C++ compiler, so here's the documentation for goto in C as well.
Also, if you look in the actual grammar provided by the ISO C standard in Appendix A.2.3 (6.8.1) defines a labeled-statement as follows:
(6.8.1) labeled-statement:
identifier : statement
case constant-expression : statement
default : statement
Note specifically that the case is a single token which must be followed by at least one token separating character in the context of the grammar. A numeral as in your example of case1 does not count as a token separating character, so your example falls into the first branch of the grammar which is an identifier not a case statement, and thus can't be used as the target of a switch statement.
To answer your other question regarding ZERO being defined in a preprocessor macro, the substitution is not happening as you believe. Again, the preprocessor operates on tokens, and thus caseZERO is a single token which does not match the ZERO macro and so will not be substituted at all. Again, this is just defining a labeled-statement using the identifier branch where the identifier is the entire token caseZERO and not case0 or case'0' as you believe. The preprocessor does have means of doing "token pasting" using the ## operator, but that would require you to use case ## ZERO. However, this still would not have the behavior that you probably intend.
The type of the case expression is not relevant. It's the syntactic form that matters.
switch (expr) {
caseZERO: /*...*/;
case0: /*...*/;
case'a': /*...*/;
}
caseZERO and case0 are both valid identifiers. In this context, they are labels, but not case labels. If you had a statement
goto caseZERO;
or
goto case0;
in the same function, it would branch to the corresponding labeled statement.
(This is one of the many cases in C where a typo results in code that's still syntactically valid, but with a substantially different meaning.)
case'a', on the other hand, is two tokens, because the 'a' is a character constant. (But it should still be written as case 'a': for the benefit of the human reader.)
The rules for splitting source code into tokens require white space between an identifier, keyword, or numeric literal and another identifier, keyword, or numeric literal, because otherwise it would be ambiguous. They do not require white space in other contexts. That's why, for example, you can write:
x=y+func(42);
rather than
x = y + func ( 42 ) ;
Adding some white space will make the code more legible to human readers:
x = y + func(42);
but the compiler doesn't care.
(Another case where whitespace is important is in the definition of a function-like macro. The ( must immediately follow the macro name; otherwise it's treated as the first token of the expansion rather than introducing the parameter list.)

c++ is a white space independent language, exception to the rule

This wikepedia page defines c++ as a "white space independent language". While mostly true as with all languages there are exceptions to the rule. The only one I can think of at the moment is this:
vector<vector<double> >
Must have a space otherwise the compiler interprets the >> as a stream operator. What other ones are around. It would be interesting to compile a list of the exceptions.
Following that logic, you can use any two-character lexeme to produce such "exceptions" to the rule. For example, += and + = would be interpreted differently. I wouldn't call them exceptions though. In C++ in many contexts "no space at all" is quite different from "one or more spaces". When someone says that C++ is space-independent they usually mean that "one space" in C++ is typically the same as "more than one space".
This is reflected in the language specification, which states (see 2.1/1) that at phase 3 of translation the implementation is allowed to replace sequences of multiple whitespace characters with one space character.
The syntax and semantic rules for parsing C++ are indeed quite complex (I'm trying to be nice, I think one is authorized to say "a mess"). Proof of this fact is that for YEARS different compiler authors where just arguing on what was legal C++ and what it was not.
In C++ for example you may need to parse an unbounded number of tokens before deciding what is the semantic meaning of the first of them (the dreaded "most vexing parse rule" that also often bites newcomers).
Your objection IMO however doesn't really make sense... for example ++ has a different meaning from + +, and in Pascal begin is not the same as beg in. Does this make Pascal a space-dependent language? Is there any space-independent language (except brainf*ck)?
The only problem about C++03 >>/> > is that this mistake when typing was very common so they decided to add even more complexity to the language definition to solve this issue in C++11.
The cases in which one whitespace instead of more whitespaces can make a difference (something that differentiates space-dependent languages and that however plays no role in the > > / >> case) are indeed few:
inside double-quoted strings (but everyone wants that and every language that supports string literals that I know does the same)
inside single quotes (the same, even if something that not many C++ programmers know is that there can be more that one char inside single quotes)
in the preprocessor directives because they work on a line basis (newline is a whitespace and it makes a difference there)
in line continuation as noticed by stefanv: to continue a single line you can put a backslash right before a newline and in that case the language will ignore both characters (you can do this even in the middle of an identifier, even if the typical use is just to make long preprocessor macros readable). If you put other whitespace characters after the backslash and before the newline however the line continuation is not recognized (some compiler accepts it anyway and simply checks if last non-whitespace of a line is a backslash). Line continuation can also be specified using trigraph equivalent ??/ of backslash (any reasonable compiler should IMO emit a warning when finding a trigraph as they most probably were not indented by the programmer).
inside single-line comments // because also there adding a newline to other whitespaces in the middle of a comment makes a difference
Like it or not, but macro's are also part of C++ and multi-line macro's should be separated with a backslash followed by EOL, no whitespace should be in between the backslash and the EOL.
Not a big issue, but still a whitespace exception.
This is because of limitations in the parser pre c++11 this is no longer the case.
The reason being that it was hard to parse >> as end of a template compared to operator >>
While C++03 did interpret >> as the shift operator in all cases (which was overridden for use in streams, but it's still the shift operator), the language parser in C++11 will now attempt to close a brace when reasonable.
Nested template parameters: set<set<int> >.
Character literals: ' '.
String literals: " ".
Justoposition of keywords and identifiers: else return x;, void foo(){}, etc.

Why can't variable names start with numbers?

I was working with a new C++ developer a while back when he asked the question: "Why can't variable names start with numbers?"
I couldn't come up with an answer except that some numbers can have text in them (123456L, 123456U) and that wouldn't be possible if the compilers were thinking everything with some amount of alpha characters was a variable name.
Was that the right answer? Are there any more reasons?
string 2BeOrNot2Be = "that is the question"; // Why won't this compile?
Because then a string of digits would be a valid identifier as well as a valid number.
int 17 = 497;
int 42 = 6 * 9;
String 1111 = "Totally text";
Well think about this:
int 2d = 42;
double a = 2d;
What is a? 2.0? or 42?
Hint, if you don't get it, d after a number means the number before it is a double literal
It's a convention now, but it started out as a technical requirement.
In the old days, parsers of languages such as FORTRAN or BASIC did not require the uses of spaces. So, basically, the following are identical:
10 V1=100
20 PRINT V1
and
10V1=100
20PRINTV1
Now suppose that numeral prefixes were allowed. How would you interpret this?
101V=100
as
10 1V = 100
or as
101 V = 100
or as
1 01V = 100
So, this was made illegal.
Because backtracking is avoided in lexical analysis while compiling. A variable like:
Apple;
the compiler will know it's a identifier right away when it meets letter 'A'.
However a variable like:
123apple;
compiler won't be able to decide if it's a number or identifier until it hits 'a', and it needs backtracking as a result.
Compilers/parsers/lexical analyzers was a long, long time ago for me, but I think I remember there being difficulty in unambiguosly determining whether a numeric character in the compilation unit represented a literal or an identifier.
Languages where space is insignificant (like ALGOL and the original FORTRAN if I remember correctly) could not accept numbers to begin identifiers for that reason.
This goes way back - before special notations to denote storage or numeric base.
I agree it would be handy to allow identifiers to begin with a digit. One or two people have mentioned that you can get around this restriction by prepending an underscore to your identifier, but that's really ugly.
I think part of the problem comes from number literals such as 0xdeadbeef, which make it hard to come up with easy to remember rules for identifiers that can start with a digit. One way to do it might be to allow anything matching [A-Za-z_]+ that is NOT a keyword or number literal. The problem is that it would lead to weird things like 0xdeadpork being allowed, but not 0xdeadbeef. Ultimately, I think we should be fair to all meats :P.
When I was first learning C, I remember feeling the rules for variable names were arbitrary and restrictive. Worst of all, they were hard to remember, so I gave up trying to learn them. I just did what felt right, and it worked pretty well. Now that I've learned alot more, it doesn't seem so bad, and I finally got around to learning it right.
It's likely a decision that came for a few reasons, when you're parsing the token you only have to look at the first character to determine if it's an identifier or literal and then send it to the correct function for processing. So that's a performance optimization.
The other option would be to check if it's not a literal and leave the domain of identifiers to be the universe minus the literals. But to do this you would have to examine every character of every token to know how to classify it.
There is also the stylistic implications identifiers are supposed to be mnemonics so words are much easier to remember than numbers. When a lot of the original languages were being written setting the styles for the next few decades they weren't thinking about substituting "2" for "to".
Variable names cannot start with a digit, because it can cause some problems like below:
int a = 2;
int 2 = 5;
int c = 2 * a;
what is the value of c? is 4, or is 10!
another example:
float 5 = 25;
float b = 5.5;
is first 5 a number, or is an object (. operator)
There is a similar problem with second 5.
Maybe, there are some other reasons. So, we shouldn't use any digit in the beginnig of a variable name.
The restriction is arbitrary. Various Lisps permit symbol names to begin with numerals.
COBOL allows variables to begin with a digit.
Use of a digit to begin a variable name makes error checking during compilation or interpertation a lot more complicated.
Allowing use of variable names that began like a number would probably cause huge problems for the language designers. During source code parsing, whenever a compiler/interpreter encountered a token beginning with a digit where a variable name was expected, it would have to search through a huge, complicated set of rules to determine whether the token was really a variable, or an error. The added complexity added to the language parser may not justify this feature.
As far back as I can remember (about 40 years), I don't think that I have ever used a language that allowed use of a digit to begin variable names. I'm sure that this was done at least once. Maybe, someone here has actually seen this somewhere.
As several people have noticed, there is a lot of historical baggage about valid formats for variable names. And language designers are always influenced by what they know when they create new languages.
That said, pretty much all of the time a language doesn't allow variable names to begin with numbers is because those are the rules of the language design. Often it is because such a simple rule makes the parsing and lexing of the language vastly easier. Not all language designers know this is the real reason, though. Modern lexing tools help, because if you tried to define it as permissible, they will give you parsing conflicts.
OTOH, if your language has a uniquely identifiable character to herald variable names, it is possible to set it up for them to begin with a number. Similar rule variations can also be used to allow spaces in variable names. But the resulting language is likely to not to resemble any popular conventional language very much, if at all.
For an example of a fairly simple HTML templating language that does permit variables to begin with numbers and have embedded spaces, look at Qompose.
Because if you allowed keyword and identifier to begin with numberic characters, the lexer (part of the compiler) couldn't readily differentiate between the start of a numeric literal and a keyword without getting a whole lot more complicated (and slower).
C++ can't have it because the language designers made it a rule. If you were to create your own language, you could certainly allow it, but you would probably run into the same problems they did and decide not to allow it. Examples of variable names that would cause problems:
0x, 2d, 5555
One of the key problems about relaxing syntactic conventions is that it introduces cognitive dissonance into the coding process. How you think about your code could be deeply influenced by the lack of clarity this would introduce.
Wasn't it Dykstra who said that the "most important aspect of any tool is its effect on its user"?
The compiler has 7 phase as follows:
Lexical analysis
Syntax Analysis
Semantic Analysis
Intermediate Code Generation
Code Optimization
Code Generation
Symbol Table
Backtracking is avoided in the lexical analysis phase while compiling the piece of code. The variable like Apple, the compiler will know its an identifier right away when it meets letter ‘A’ character in the lexical Analysis phase. However, a variable like 123apple, the compiler won’t be able to decide if its a number or identifier until it hits ‘a’ and it needs backtracking to go in the lexical analysis phase to identify that it is a variable. But it is not supported in the compiler.
When you’re parsing the token you only have to look at the first character to determine if it’s an identifier or literal and then send it to the correct function for processing. So that’s a performance optimization.
Probably because it makes it easier for the human to tell whether it's a number or an identifier, and because of tradition. Having identifiers that could begin with a digit wouldn't complicate the lexical scans all that much.
Not all languages have forbidden identifiers beginning with a digit. In Forth, they could be numbers, and small integers were normally defined as Forth words (essentially identifiers), since it was faster to read "2" as a routine to push a 2 onto the stack than to recognize "2" as a number whose value was 2. (In processing input from the programmer or the disk block, the Forth system would split up the input according to spaces. It would try to look the token up in the dictionary to see if it was a defined word, and if not would attempt to translate it into a number, and if not would flag an error.)
Suppose you did allow symbol names to begin with numbers. Now suppose you want to name a variable 12345foobar. How would you differentiate this from 12345? It's actually not terribly difficult to do with a regular expression. The problem is actually one of performance. I can't really explain why this is in great detail, but it essentially boils down to the fact that differentiating 12345foobar from 12345 requires backtracking. This makes the regular expression non-deterministic.
There's a much better explanation of this here.
it is easy for a compiler to identify a variable using ASCII on memory location rather than number .
I think the simple answer is that it can, the restriction is language based. In C++ and many others it can't because the language doesn't support it. It's not built into the rules to allow that.
The question is akin to asking why can't the King move four spaces at a time in Chess? It's because in Chess that is an illegal move. Can it in another game sure. It just depends on the rules being played by.
Originally it was simply because it is easier to remember (you can give it more meaning) variable names as strings rather than numbers although numbers can be included within the string to enhance the meaning of the string or allow the use of the same variable name but have it designated as having a separate, but close meaning or context. For example loop1, loop2 etc would always let you know that you were in a loop and/or loop 2 was a loop within loop1.
Which would you prefer (has more meaning) as a variable: address or 1121298? Which is easier to remember?
However, if the language uses something to denote that it not just text or numbers (such as the $ in $address) it really shouldn't make a difference as that would tell the compiler that what follows is to be treated as a variable (in this case).
In any case it comes down to what the language designers want to use as the rules for their language.
The variable may be considered as a value also during compile time by the compiler
so the value may call the value again and again recursively
Backtracking is avoided in lexical analysis phase while compiling the piece of code. The variable like Apple; , the compiler will know its a identifier right away when it meets letter ‘A’ character in the lexical Analysis phase. However, a variable like 123apple; , compiler won’t be able to decide if its a number or identifier until it hits ‘a’ and it needs backtracking to go in the lexical analysis phase to identify that it is a variable. But it is not supported in compiler.
Reference
There could be nothing wrong with it when comes into declaring variable.but there is some ambiguity when it tries to use that variable somewhere else like this :
let 1 = "Hello world!"
print(1)
print(1)
print is a generic method that accepts all types of variable. so in that situation compiler does not know which (1) the programmer refers to : the 1 of integer value or the 1 that store a string value.
maybe better for compiler in this situation to allows to define something like that but when trying to use this ambiguous stuff, bring an error with correction capability to how gonna fix that error and clear this ambiguity.