C++ single quotes in the middle of a number [duplicate] - c++

As of C++14, thanks to n3781 (which in itself does not answer this question) we may write code like the following:
const int x = 1'234; // one thousand two hundred and thirty four
The aim is to improve on code like this:
const int y = 100000000;
and make it more readable.
The underscore (_) character was already taken in C++11 by user-defined literals, and the comma (,) has localisation problems — many European countries bafflingly† use this as the decimal separator — and conflicts with the comma operator, though I do wonder what real-world code could possibly have been broken by allowing e.g. 1,234,567.
Anyway, a better solution would seem to be the space character:
const int z = 1 000 000;
These adjacent numeric literal tokens could be concatenated by the preprocessor just as are string literals:
const char x[5] = "a" "bc" "d";
Instead, we get the apostrophe ('), not used by any writing system I'm aware of as a digit separator.
Is there a reason that the apostrophe was chosen instead of a simple space?
† It's baffling because all of those languages, within text, maintain the notion of a comma "breaking apart" an otherwise atomic sentence, with a period functioning to "terminate" the sentence — to me, at least, this is quite analogous to a comma "breaking apart" the integral part of a number and a period "terminating" it ready for the fractional input.

There is a previous paper, n3499, which tell us that although Bjarne himself suggested spaces as separators:
While this approach is consistent with one common typeographic style, it suffers from some compatibility problems.
It does not match the syntax for a pp-number, and would minimally require extending that syntax.
More importantly, there would be some syntactic ambiguity when a hexadecimal digit in the range [a-f] follows a space. The preprocessor would not know whether to perform symbol substitution starting after the space.
It would likely make editing tools that grab "words" less reliable.
I guess the following example is the main problem noted:
const int x = 0x123 a;
though in my opinion this rationale is fairly weak. I still can't think of a real-world example to break it.
The "editing tools" rationale is even worse, since 1'234 breaks basically every syntax highlighter known to mankind (e.g. that used by Markdown in the above question itself!) and makes updated versions of said highlighters much harder to implement.
Still, for better or worse, this is the rationale that led to the adoption of apostrophes instead.

The obvious reason for not using white space is that a new line is also
white space, and that C++ treats all white space identically. And off
hand, I don't know of any language which accepts arbitrary white space
as a separator.
Presumably, Unicode 0xA0 (non-breaking space) could be used—it is
the most widely used solution when typesetting. I see two problems with
that, however: first, it's not in the basic character set, and second,
it's not visually distinctive; you can't see that it isn't a space by
just looking at the text in a normal editor.
Beyond that, there aren't many choices. You can't use the comma, since
that is already a legal token (and something like 1,234 is currently
legal C++, with the meaning 234). And in a context where it could occur
in legal code, e.g. a[1,234]. While I can't quite imagine any real
code actually using this, there is a basic rule that no legal program,
regardless how absurd, should silently change semantics.
Similar considerations mean that _ can't be used either; if there is a
#define _234 * 2, then a[1_234] would silently change the meaning of
the code.
I can't say that I'm particularly pleased with the choice of ', but it
does have the advantage of being used in continental Europe, at least in
some types of texts. (I seem to remember having seen it in German, for
example, although in typical running text, German, like most other
languages, will use a point or a non breaking space. But maybe it was
Swiss German.) The problem with ' is parsing; the sequence '1' is
already legal, as is '123'. So something like 1'234 could be a 1,
followed by the start of a character constant; I'm not sure how far you
have to look-ahead to make the decision. There is no sequence of legal
C++ in which an integral constant can be followed by a character
constant, so there's no problem with breaking legal code, but it means
that lexical scanning suddenly becomes very context dependent.
(With regards to your comment: there is no logic in the choice of a
decimal or a thousands separator. A decimal separator, for example, is
certainly not a full stop. They are just arbitrary conventions.)

From wiki, we have a nice example:
auto floating_point_literal = 0.000'015'3;
Here, we have the . operator and then if another operator would be to be met, my eyes would wait for something visible, like a comma or something, not a whitespace.
So an apostrophe does much better here than a whitespace would do.
With whitespaces it would be
auto floating_point_literal = 0.000 015 3;
which doesn't feel as right as the case with the apostrophes.
In the same spirit of Albert Renshaw's answer, I think that the apostrophe is more clear than the space the Lightness Races in Orbit proposes.
type a = 1'000'000'000'000'000'544'445'555;
type a = 1 000 000 000 000 000 544 445 555;
Space is used for many things, like the strings concatenation the OP mentions, unlike the apostrophe, which in this case makes it clear for someone that is used separating the digits.
When the lines of code become many, I think that this will improve readability, but I doubt that is the reason they choose it.
About the spaces, it might worth taking a look at this C question, which says:
The language doesn't allow int i = 10 000; (an integer literal is one token, the intervening whitespace splits it into two tokens) but there's typically little to no expense incurred by expressing the initializer as an expression that is a calculation of literals:
int i = 10 * 1000; /* ten thousand */

It is true I see no practical meaning to:
if (a == 1 1 1 1 1) ...
so digits might be merged without real ambiguity
but what about an hexadecimal number?
0 x 1 a B 2 3
There is no way to disambiguate from a typo doing so (normally we should see an error)

I would assume it's because, while writing code, if you reach the end of a "line" (the width of your screen) an automatic line-break (or "word wrap") occurs. This would cause your int to get split in half, one half of it would be on the first line, the second half on the second... this way it all stays together in the event of a word-wrap.

float floating_point_literal = 0.0000153; /* C, C++*/
auto floating_point_literal = 0.0000153; // C++11
auto floating_point_literal = 0.000'015'3; // C++14
Commenting does not hurt:
/* 0. 0000 1530 */
float floating_point_literal = 0.00001530;
Binary strings can be hard to parse:
long bytecode = 0b1111011010011001; /* gcc , clang */
long bytecode = 0b1111'0110'1001'1001; //C++14
// 0b 1111 0110 1001 1001 would be better, really.
// It is how humans think.
A macro for consideration:
#define B(W,X,Y,Z) (0b##W##X##Y##Z)
#define HEX(W,X,Y,Z) (0x##W##X##Y##Z)
#define OCT(O) (0##O)
long z = B(1001, 1001, 1020, 1032 );
// result : long z = (0b1001100110201032);
long h = OCT( 35);
// result : long h = (035); // 35_oct => 29_dec
long h = HEX( FF, A6, 3B, D0 );
// result : long h = (0xFFA6BD0);

It has to do with how the language is parsed. It would have been difficult for the compiler authors to rewrite their products to accept space delimited literals.
Also, I don't think seperating digits with spaces is very common. That i've seen, it's always non-whitespace characters, even in different countries.

Related

why does C++ allow a declaration with no space between the type and a parenthesized variable name? [duplicate]

This question already has an answer here:
Is white space considered a token in C like languages?
(1 answer)
Closed 8 months ago.
A previous C++ question asked why int (x) = 0; is allowed. However, I noticed that even int(x) = 0; is allowed, i.e. without a space before the (x). I find the latter quite strange, because it causes things like this:
using Oit = std::ostream_iterator<int>;
Oit bar(std::cout);
*bar = 6; // * is optional
*Oit(bar) = 7; // * is NOT optional!
where the final line is because omitting the * makes the compiler think we are declaring bar again and initializing to 7.
Am I interpreting this correctly, that int(x) = 0; is indeed equivalent to int x = 0, and Oit(bar) = 7; is indeed equivalent to Oit bar = 7;? If yes, why specifically does C++ allow omitting the space before the parentheses in such a declaration + initialization?
(my guess is because the C++ compiler does not care about any space before a left paren, since it treats that parenthesized expression as it's own "token" [excuse me if I'm butchering the terminology], i.e. in all cases, qux(baz) is equivalent to qux (baz))
It is allowed in C++ because it is allowed in C and requiring the space would be an unnecessary C-compatibility breaking change. Even setting that aside, it would be surprising to have int (x) and int(x) behave differently, since generally (with few minor exceptions) C++ is agnostic to additional white-space as long as tokens are properly separated. And ( (outside a string/character literal) is always a token on its own. It can't be part of a token starting with int(.
In C int(x) has no other potential meaning for which it could be confused, so there is no reason to require white-space separation at all. C also is generally agnostic to white-space, so it would be surprising there as well to have different behavior with and without it.
One requirement when defining the syntax of a language is that elements of the language can be separated. According to the C++ syntax rules, a space separates things. But also according to the C++ syntax rules, parentheses also separate things.
When C++ is compiled, the first step is the parsing. And one of the first steps of the parsing is separating all the elements of the language. Often this step is called tokenizing or lexing. But this is just the technical background. The user does not have to know this. He or she only has to know that things in C++ must be clearly separted from each others, so that there is a sequence "*", "Oit", "(", "bar", ")", "=", "7", ";".
As explained, the rule that the parenthesis always separates is established on a very low level of the compiler. The compiler determines even before knowing what the purpose of the parenthesis is, that a parenthesis separates things. And therefore an extra space would be redundant.
When you ever use parser generators, you will see that most of them just ignore spaces. That means, when the lexer has produced the list of tokens, the spaces do not exist any more. See above in the list. There are no spaces any more. So you have no chance to specify something that explicitly requires a space.

perl split strange behavior

I apologize in advance, this is probably a very stupid question with an obvious solution which is escaping the eye of a rather beginner in perl, or it may also have been in Stackoverflow as a solved question, but my lack of knowledge about what exactly to look for is preventing me from actually finding the answer.
I have a string like:
$s = FOO: < single blankspace> BAR <some whitespace character> some more text with whitespace that can span over multiple lines, i.e. has \n in them ;
#please excuse the lack of quotes, and large text describing the character in angular brackets, but in this example, but I have the string correctly defined, and in plase of <blankspace> I have the actual ASCII 32 character etc.
Now I want to split the $s, in this way:
($instType, $inst, $trailing) = split(/\s*/, $s, 3);
#please note that i do not use the my keyword as it is not in a subroutine
#but i tested with my, it does not change the behavior
I would expect, that $instType takes the value FOO: , without any surrounding space, in the actual test string there is a colon, and I believe, to the best of my knowledge, that it will remain in the $instType. Then it is rather obvious to expect that $inst takes similary the value BAR , without any surrounding spaces, and then finally one may also lean on $trail to take the rest of the string.
However, I am getting:
$instType takes F , that is just the single char,
$inst takes O, the single charater in the 2nd position in the string
$trail takes O: BAR and the rest.
How do I address the issue?
PS perl is 5.18.0
the problem is the quantifier * that allows zero space (zero or more), you must use + instead, that means 1 or more.
Note that there is exactly zero space between F and O.
You wrote:
#please note that i do not use the my keyword as it is not in a subroutine
#but i tested with my, it does not change the behavior
You can, and should, use my outside of subroutines, too. Using that in conjunction with use strict prevents silly errors like this:
$some_field = 'bar';
if ( $some_feild ) { ... }
If those statements were separated, it could be awfully hard to track down that bug.

std.algorithm.joiner(string[],string) - why result elements are dchar and not char?

I try to compile following code:
import std.algorithm;
void main()
{
string[] x = ["ab", "cd", "ef"]; // 'string' is same as 'immutable(char)[]'
string space = " ";
char z = joiner( x, space ).front(); // error
}
Compilation with dmd ends with error:
test.d(8): Error: cannot implicitly convert expression (joiner(x,space).front()) of type dchar to char
Changing char z to dchar z does fix the error message, but I'm interested why it appears in the first place.
Why result of joiner(string[],string).front() is dchar and not char?
(There is nothing on this in documentation http://dlang.org/phobos/std_algorithm.html#joiner)
All strings are treated as ranges of dchar. That's because a dchar is guaranteed to be a single code point, since in UTF-32, every code unit is a code point, whereas in UTF-8 (char) and UTF-16 (wchar), the number of code units per code point varies. So, if you were operating on individual chars or wchars, you'd be operating on pieces of characters rather than whole characters, which would be very bad. If you don't know much about unicode, I'd advise reading this article by Joel Spolsky. It explains things fairly well.
In any case, because operating on individual chars and wchars doesn't make sense, strings of char and wchar are treated as ranges of dchar (ElementType!string is dchar), meaning that as far as ranges are concerned, they don't have length (hasLength!string is false - walkLength needs to be used to get their length), aren't sliceable (hasSlicing!string is false), and aren't indexable (isRandomAccess!string is false). This also means that anything which builds a new range from any kind of string is going to result in a range of dchar. joiner is one of those. There are some functions which understand unicode and special case strings for efficiency, taking advantage of length, slicing, and indexing where they can, but unless their result is ultimately a slice of the original, any range they return is going to have to be made of dchars.
So, front on any range of characters will always be dchar, and popFront will always pop off a full code point.
If you don't know much about ranges, I'd advise reading this. It's a chapter in a book on D which is online and is currently the best tutorial on ranges that we have. We really should get a proper article on ranges (including on how they work with strings) onto dlang.org, but no one's gotten around to writing it yet. Regardless, you're going to need to have at least a basic grasp of ranges to be able to use a lot of D's standard library (especially std.algorithm), because it uses them very heavily.

Is this an acceptable use of "ASCII arithmetic"?

I've got a string value of the form 10123X123456 where 10 is the year, 123 is the day number within the year, and the rest is unique system-generated stuff. Under certain circumstances, I need to add 400 to the day number, so that the number above, for example, would become 10523X123456.
My first idea was to substring those three characters, convert them to an integer, add 400 to it, convert them back to a string and then call replace on the original string. That works.
But then it occurred to me that the only character I actually need to change is the third one, and that the original value would always be 0-3, so there would never be any "carrying" problems. It further occurred to me that the ASCII code points for the numbers are consecutive, so adding the number 4 to the character "0", for example, would result in "4", and so forth. So that's what I ended up doing.
My question is, is there any reason that won't always work? I generally avoid "ASCII arithmetic" on the grounds that it's not cross-platform or internationalization friendly. But it seems reasonable to assume that the code points for numbers will always be sequential, i.e., "4" will always be 1 more than "3". Anybody see any problem with this reasoning?
Here's the code.
string input = "10123X123456";
input[2] += 4;
//Output should be 10523X123456
From the C++ standard, section 2.2.3:
In both the source and execution basic character sets, the value of each character after 0 in the
above list of decimal digits shall be one greater than the value of the previous.
So yes, if you're guaranteed to never need a carry, you're good to go.
The C++ language definition requres that the code-point values of the numerals be consecutive. Therefore, ASCII Arithmetic is perfectly acceptable.
Always keep in mind that if this is generated by something that you do not entirely control (such as users and third-party system), that something can and will go wrong with it. (Check out Murphy's laws)
So I think you should at least put on some validations before doing so.
It sounds like altering the string as you describe is easier than parsing the number out in the first place. So if your algorithm works (and it certainly does what you describe), I wouldn't consider it premature optimization.
Of course, after you add 400, it's no longer a day number, so you couldn't apply this process recursively.
And, <obligatory Year 2100 warning>.
Very long time ago I saw some x86 processor instructions for ASCII and BCD.
Those are AAA (ASCII Adjust for Addition), AAS (subtraction), AAM (mult), AAD (div).
But even if you are not sure about target platform you can refer to specification of characters set you are using and I guess you'll find that first 127 characters of ASCII is always have the same meaning for all characters set (for unicode that is first characters page).

Why can't variable names start with numbers?

I was working with a new C++ developer a while back when he asked the question: "Why can't variable names start with numbers?"
I couldn't come up with an answer except that some numbers can have text in them (123456L, 123456U) and that wouldn't be possible if the compilers were thinking everything with some amount of alpha characters was a variable name.
Was that the right answer? Are there any more reasons?
string 2BeOrNot2Be = "that is the question"; // Why won't this compile?
Because then a string of digits would be a valid identifier as well as a valid number.
int 17 = 497;
int 42 = 6 * 9;
String 1111 = "Totally text";
Well think about this:
int 2d = 42;
double a = 2d;
What is a? 2.0? or 42?
Hint, if you don't get it, d after a number means the number before it is a double literal
It's a convention now, but it started out as a technical requirement.
In the old days, parsers of languages such as FORTRAN or BASIC did not require the uses of spaces. So, basically, the following are identical:
10 V1=100
20 PRINT V1
and
10V1=100
20PRINTV1
Now suppose that numeral prefixes were allowed. How would you interpret this?
101V=100
as
10 1V = 100
or as
101 V = 100
or as
1 01V = 100
So, this was made illegal.
Because backtracking is avoided in lexical analysis while compiling. A variable like:
Apple;
the compiler will know it's a identifier right away when it meets letter 'A'.
However a variable like:
123apple;
compiler won't be able to decide if it's a number or identifier until it hits 'a', and it needs backtracking as a result.
Compilers/parsers/lexical analyzers was a long, long time ago for me, but I think I remember there being difficulty in unambiguosly determining whether a numeric character in the compilation unit represented a literal or an identifier.
Languages where space is insignificant (like ALGOL and the original FORTRAN if I remember correctly) could not accept numbers to begin identifiers for that reason.
This goes way back - before special notations to denote storage or numeric base.
I agree it would be handy to allow identifiers to begin with a digit. One or two people have mentioned that you can get around this restriction by prepending an underscore to your identifier, but that's really ugly.
I think part of the problem comes from number literals such as 0xdeadbeef, which make it hard to come up with easy to remember rules for identifiers that can start with a digit. One way to do it might be to allow anything matching [A-Za-z_]+ that is NOT a keyword or number literal. The problem is that it would lead to weird things like 0xdeadpork being allowed, but not 0xdeadbeef. Ultimately, I think we should be fair to all meats :P.
When I was first learning C, I remember feeling the rules for variable names were arbitrary and restrictive. Worst of all, they were hard to remember, so I gave up trying to learn them. I just did what felt right, and it worked pretty well. Now that I've learned alot more, it doesn't seem so bad, and I finally got around to learning it right.
It's likely a decision that came for a few reasons, when you're parsing the token you only have to look at the first character to determine if it's an identifier or literal and then send it to the correct function for processing. So that's a performance optimization.
The other option would be to check if it's not a literal and leave the domain of identifiers to be the universe minus the literals. But to do this you would have to examine every character of every token to know how to classify it.
There is also the stylistic implications identifiers are supposed to be mnemonics so words are much easier to remember than numbers. When a lot of the original languages were being written setting the styles for the next few decades they weren't thinking about substituting "2" for "to".
Variable names cannot start with a digit, because it can cause some problems like below:
int a = 2;
int 2 = 5;
int c = 2 * a;
what is the value of c? is 4, or is 10!
another example:
float 5 = 25;
float b = 5.5;
is first 5 a number, or is an object (. operator)
There is a similar problem with second 5.
Maybe, there are some other reasons. So, we shouldn't use any digit in the beginnig of a variable name.
The restriction is arbitrary. Various Lisps permit symbol names to begin with numerals.
COBOL allows variables to begin with a digit.
Use of a digit to begin a variable name makes error checking during compilation or interpertation a lot more complicated.
Allowing use of variable names that began like a number would probably cause huge problems for the language designers. During source code parsing, whenever a compiler/interpreter encountered a token beginning with a digit where a variable name was expected, it would have to search through a huge, complicated set of rules to determine whether the token was really a variable, or an error. The added complexity added to the language parser may not justify this feature.
As far back as I can remember (about 40 years), I don't think that I have ever used a language that allowed use of a digit to begin variable names. I'm sure that this was done at least once. Maybe, someone here has actually seen this somewhere.
As several people have noticed, there is a lot of historical baggage about valid formats for variable names. And language designers are always influenced by what they know when they create new languages.
That said, pretty much all of the time a language doesn't allow variable names to begin with numbers is because those are the rules of the language design. Often it is because such a simple rule makes the parsing and lexing of the language vastly easier. Not all language designers know this is the real reason, though. Modern lexing tools help, because if you tried to define it as permissible, they will give you parsing conflicts.
OTOH, if your language has a uniquely identifiable character to herald variable names, it is possible to set it up for them to begin with a number. Similar rule variations can also be used to allow spaces in variable names. But the resulting language is likely to not to resemble any popular conventional language very much, if at all.
For an example of a fairly simple HTML templating language that does permit variables to begin with numbers and have embedded spaces, look at Qompose.
Because if you allowed keyword and identifier to begin with numberic characters, the lexer (part of the compiler) couldn't readily differentiate between the start of a numeric literal and a keyword without getting a whole lot more complicated (and slower).
C++ can't have it because the language designers made it a rule. If you were to create your own language, you could certainly allow it, but you would probably run into the same problems they did and decide not to allow it. Examples of variable names that would cause problems:
0x, 2d, 5555
One of the key problems about relaxing syntactic conventions is that it introduces cognitive dissonance into the coding process. How you think about your code could be deeply influenced by the lack of clarity this would introduce.
Wasn't it Dykstra who said that the "most important aspect of any tool is its effect on its user"?
The compiler has 7 phase as follows:
Lexical analysis
Syntax Analysis
Semantic Analysis
Intermediate Code Generation
Code Optimization
Code Generation
Symbol Table
Backtracking is avoided in the lexical analysis phase while compiling the piece of code. The variable like Apple, the compiler will know its an identifier right away when it meets letter ‘A’ character in the lexical Analysis phase. However, a variable like 123apple, the compiler won’t be able to decide if its a number or identifier until it hits ‘a’ and it needs backtracking to go in the lexical analysis phase to identify that it is a variable. But it is not supported in the compiler.
When you’re parsing the token you only have to look at the first character to determine if it’s an identifier or literal and then send it to the correct function for processing. So that’s a performance optimization.
Probably because it makes it easier for the human to tell whether it's a number or an identifier, and because of tradition. Having identifiers that could begin with a digit wouldn't complicate the lexical scans all that much.
Not all languages have forbidden identifiers beginning with a digit. In Forth, they could be numbers, and small integers were normally defined as Forth words (essentially identifiers), since it was faster to read "2" as a routine to push a 2 onto the stack than to recognize "2" as a number whose value was 2. (In processing input from the programmer or the disk block, the Forth system would split up the input according to spaces. It would try to look the token up in the dictionary to see if it was a defined word, and if not would attempt to translate it into a number, and if not would flag an error.)
Suppose you did allow symbol names to begin with numbers. Now suppose you want to name a variable 12345foobar. How would you differentiate this from 12345? It's actually not terribly difficult to do with a regular expression. The problem is actually one of performance. I can't really explain why this is in great detail, but it essentially boils down to the fact that differentiating 12345foobar from 12345 requires backtracking. This makes the regular expression non-deterministic.
There's a much better explanation of this here.
it is easy for a compiler to identify a variable using ASCII on memory location rather than number .
I think the simple answer is that it can, the restriction is language based. In C++ and many others it can't because the language doesn't support it. It's not built into the rules to allow that.
The question is akin to asking why can't the King move four spaces at a time in Chess? It's because in Chess that is an illegal move. Can it in another game sure. It just depends on the rules being played by.
Originally it was simply because it is easier to remember (you can give it more meaning) variable names as strings rather than numbers although numbers can be included within the string to enhance the meaning of the string or allow the use of the same variable name but have it designated as having a separate, but close meaning or context. For example loop1, loop2 etc would always let you know that you were in a loop and/or loop 2 was a loop within loop1.
Which would you prefer (has more meaning) as a variable: address or 1121298? Which is easier to remember?
However, if the language uses something to denote that it not just text or numbers (such as the $ in $address) it really shouldn't make a difference as that would tell the compiler that what follows is to be treated as a variable (in this case).
In any case it comes down to what the language designers want to use as the rules for their language.
The variable may be considered as a value also during compile time by the compiler
so the value may call the value again and again recursively
Backtracking is avoided in lexical analysis phase while compiling the piece of code. The variable like Apple; , the compiler will know its a identifier right away when it meets letter ‘A’ character in the lexical Analysis phase. However, a variable like 123apple; , compiler won’t be able to decide if its a number or identifier until it hits ‘a’ and it needs backtracking to go in the lexical analysis phase to identify that it is a variable. But it is not supported in compiler.
Reference
There could be nothing wrong with it when comes into declaring variable.but there is some ambiguity when it tries to use that variable somewhere else like this :
let 1 = "Hello world!"
print(1)
print(1)
print is a generic method that accepts all types of variable. so in that situation compiler does not know which (1) the programmer refers to : the 1 of integer value or the 1 that store a string value.
maybe better for compiler in this situation to allows to define something like that but when trying to use this ambiguous stuff, bring an error with correction capability to how gonna fix that error and clear this ambiguity.