Sign in Preprocessing Number - c++

In Section 2.10 in the C++ Standard 14 ([lex.ppnumber]), preprocessing numbers are defined as
pp-number
digit
. digit
pp-number digit
pp-number ' digit
pp-number ' nondigit
pp-number identifier-nondigit
pp-number e sign
pp-number E sign
pp-number .
So this should include all integer literal tokens and all floating literal tokens. But as written in 2.14.4 ([lex.fcon]), there a sign is optional i.e. (if there is a way to format it as in the standard, feel free to improve).
exponent-part:
e sign_opt digit-sequence
E sign_opt digit-sequence
sign: one of
+ -
Why is the sign in the pp-number definition not optional? In fact the way it is written, the number 1e3 should be valid as floating-literal, but not as a pp-number, which contradicts the explanation given below section 2.10.
Is there something I do not get?

Quoting from here:
A preprocessing number has a rather bizarre definition. The category includes all the normal integer and floating point constants
one expects of C, but also a number of other things one might not
initially recognize as a number. Formally, preprocessing numbers begin
with an optional period, a required decimal digit, and then continue
with any sequence of letters, digits, underscores, periods, and
exponents. Exponents are the two-character sequences ‘e+’, ‘e-’, ‘E+’,
‘E-’, ‘p+’, ‘p-’, ‘P+’, and ‘P-’. (The exponents that begin with ‘p’
or ‘P’ are new to C99. They are used for hexadecimal floating-point
constants.)
The purpose of this unusual definition is to isolate the preprocessor from the full complexity of numeric constants. It does
not have to distinguish between lexically valid and invalid
floating-point numbers, which is complicated. The definition also
permits you to split an identifier at any position and get exactly two
tokens, which can then be pasted back together with the ‘##’ operator.
It's possible for preprocessing numbers to cause programs to be
misinterpreted. For example, 0xE+12 is a preprocessing number which
does not translate to any valid numeric constant, therefore a syntax
error. It does not mean 0xE + 12, which is what you might have
intended.

The number "1e3" is, in fact, a valid preprocssing number. You're not reading what the grammar implies correctly. A digit, followed by an identifier non-digit (that is a letter, in this case "e"), followed by a digit, does indeed conform to the grammar.

Related

Error subtracting hex constant when it ends in an 'E' [duplicate]

This question already has an answer here:
Why doesn't "0xe+1" compile?
(1 answer)
Closed 8 months ago.
int main()
{
0xD-0; // Fine
0xE-0; // Fails
}
This second line fails to compile on both clang and gcc. Any other hex constant ending is ok (0-9, A-D, F).
Error:
<source>:4:5: error: unable to find numeric literal operator 'operator""-0'
4 | 0xE-0;
| ^~~~~
I have a fix (adding a space after the constant and before the subtraction), so I'd mainly like to know why? Is this something to do with it thinking there's an exponent here?
https://godbolt.org/z/MhGT33PYP
Actually, this behaviour is mandated by the C++ standard (and documented), as strange as it may seem. This is because of how C++ compiles using Preprocessing Tokens (a.k.a pp-tokens).
If we look closely at how the compiler generates a token for numbers:
A preprocessing number is made up of a digit, optionally preceded by a period, and may be followed by letters, underscores, digits, periods, and any one of: e+ e- E+ E-.
According to this, The compiler reads 0x, then E-, which it interprets it as part of the number as having E- is allowed in a numeral pp-token and no space precedes it or is in between the E and the - (this is why adding a space is an easy fix).
This means that 0xE-0 is taken in as a single preprocessing token. In other words, the compiler interprets it as one number, instead of two numbers 0xE and 0 and an operation -. Therefore, the compiler is expecting E to represent an exponent for a floating-point literal.
Now let's take a look at how C++ interprets floating-point literals. Look at the section under "Examples". It gives this curious code sample:
<< "\n0x1p5 " << 0x1p5 // double
<< "\n0x1e5 " << 0x1e5 // integer literal, not floating-point
E is interpreted as part of the integer literal, and does not make the number a hexadecimal floating literal! Therefore, the compiler recognizes 0xE as a single, integer, hexidecimal number. Then, there is the -0 which is technically part of the same preprocessing token and therefore is not an operator and another integer. Uh oh. This is now invalid, as there is no -0 suffix.
And so the compiler reports an error, as such.

Valid syntax of calling pseudo-destructor for a floating constant

Consider the following demonstrative program.
#include <iostream>
int main()
{
typedef float T;
0.f.T::~T();
}
This program is compiled by Microsoft Visual Studio Community 2019.
But clang and gcc issue an error like this
prog.cc:7:5: error: unable to find numeric literal operator 'operator""f.T'
7 | 0.f.T::~T();
| ^~~~~
If to write the expression like ( 0.f ).T::~T() then all three compilers compile the program.
So a question arises: is this record 0.f.T::~T() syntactically valid? And if not, then what syntactical rule is broken?
The parsing of numerical tokens is quite crude, and allows many things that aren't actually valid numbers. In C++98, the grammar for a "preprocessing number", found in [lex.ppnumber], is
pp-number:
digit
. digit
pp-number digit
pp-number nondigit
pp-number e sign
pp-number E sign
pp-number .
Here, a "nondigit" is any character that can be used in an identifier, other than digits, and a "sign" is either + or -. Later standards would expand the definition to allow single quotes (C++14), and sequences of the form p-, p+, P-, P+ (C++17).
The upshot is that, in any version of the standard, while a preprocessing number is required to start with a digit, or a period followed by a digit, after that an arbitrary sequence of digits, letters, and periods may follow. Using the maximal munch rule, it follows that 0.f.T::~T(); is required to be tokenized as 0.f.T :: ~ T ( ) ;, even though 0.f.T isn't a valid numerical token.
Thus, the code is not syntactically valid.
A user defined literal suffix, ud-suffix, is an identifier. An identifier is a sequence of letters (including some non-ASCII characters), the underscore, and numbers that does not start with a number. The period character is not included.
Therefore it is a compiler bug as it is treating the non-identifier sequence f.T as an identifier.
The 0. is a fractional-constant, which can be followed by an optional exponent, then either a ud-suffix (for a user defined literal) or a floating-point-suffix (one of fFlL). The f can be considered a ud-suffx as well, but since it matches another literal type it should be that and not the UDL. A ud-suffix is defined in the grammar as an identifier.

Using an escape sequence that can't fit in its related type

wI read in C++ Primer that using a literal that is larger than its largest related type is an error, but found on the internet that this is not the case for escape sequences.
So, is using an escape sequence that can't fit in its related type an error or undefined behavior?
E.g. \x1234. Assuming the Latin-1 character set and that a byte is 8 bits, this can't fit in a char but still is valid-written literal.
This is implementation defined behavior for char and wchar_t, and undefined behavior for char8_t, char16_t and char32_t per [lex.ccon]/8
The escape \ooo consists of the backslash followed by one, two, or three octal digits that are taken to specify the value of the desired character. The escape \xhhh consists of the backslash followed by x followed by one or more hexadecimal digits that are taken to specify the value of the desired character. There is no limit to the number of digits in a hexadecimal sequence. A sequence of octal or hexadecimal digits is terminated by the first character that is not an octal digit or a hexadecimal digit, respectively. The value of a character literal is implementation-defined if it falls outside of the implementation-defined range defined for char (for character literals with no prefix) or wchar_­t (for character literals prefixed by L). [ Note: If the value of a character literal prefixed by u, u8, or U is outside the range defined for its type, the program is ill-formed. — end note ]
emphasis mine
The note is non-normative but paragraphs 3, 4, and 5 from the same section cover the text from the note.

identifier character set (clang)

I never use clang.
And I accidentally discovered that this piece of code:
#include <iostream>
void функция(int переменная)
{
std::cout << переменная << std::endl;
}
int main()
{
int русская_переменная = 0;
функция(русская_переменная);
}
will compiles fine: http://rextester.com/NFXBL38644 (clang 3.4 (clang++ -Wall -std=c++11 -O2)).
Is it a clang extension ?? And why ?
Thanks.
UPD: I'm more asking why clang make such decision ? Because I never found the discussion that someone want more characters then c++ standard have now (2.3, rev. 3691)
It's not so much an extension as it is Clang's interpretation of the Multibyte characters part of the standard. Clang supports UTF-8 source code files.
As to why, I guess "why not?" is the only real answer; it seems useful and reasonable to me to support a larger character set.
Here are the relevant parts of the standard (C11 draft):
5.2.1 Character sets
1 Two sets of characters and their associated collating sequences shall be defined: the set in which source files are written (the source character set), and the set interpreted in the execution environment (the execution character set). Each set is further divided into a basic character set, whose contents are given by this subclause, and a set of zero or more locale-specific members (which are not members of the basic character set) called extended characters. The combined set is also called the extended character set. The values of the members of the execution character set are implementation-defined.
2 In a character constant or string literal, members of the execution character set shall be represented by corresponding members of the source character set or by escape sequences consisting of the backslash \ followed by one or more characters. A byte with all bits set to 0, called the null character, shall exist in the basic execution character set; it is used to terminate a character string.
3 Both the basic source and basic execution character sets shall have the following members: the 26 uppercase letters of the Latin alphabet
A B C D E F G H I J K L M
N O P Q R S T U V W X Y Z
the 26 lowercase letters of the Latin alphabet
a b c d e f g h i j k l m
n o p q r s t u v w x y z
the 10 decimal digits
0 1 2 3 4 5 6 7 8 9
the following 29 graphic characters
! " # % & ' ( ) * + , - . / :
; < = > ? [ \ ] ^ _ { | } ~
the space character, and control characters representing horizontal tab, vertical tab, and form feed. The representation of each member of the source and execution basic character sets shall fit in a byte. In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous. In source files, there shall be some way of indicating the end of each line of text; this International Standard treats such an end-of-line indicator as if it were a single new-line character. In the basic execution character set, there shall be control characters representing alert, backspace, carriage return, and new line. If any other characters are encountered in a source file (except in an identifier, a character constant, a string literal, a header name, a comment, or a preprocessing token that is never
converted to a token), the behavior is undefined.
4 A letter is an uppercase letter or a lowercase letter as defined above; in this International Standard the term does not include other characters that are letters in other alphabets.
5 The universal character name construct provides a way to name other characters.
And also:
5.2.1.2 Multibyte characters
1 The source character set may contain multibyte characters, used to represent members of the extended character set. The execution character set may also contain multibyte characters, which need not have the same encoding as for the source character set. For both character sets, the following shall hold:
— The basic character set shall be present and each character shall be encoded as a single byte.
— The presence, meaning, and representation of any additional members is locale- specific.
— A multibyte character set may have a state-dependent encoding, wherein each sequence of multibyte characters begins in an initial shift state and enters other locale-specific shift states when specific multibyte characters are encountered in the sequence. While in the initial shift state, all single-byte characters retain their usual interpretation and do not alter the shift state. The interpretation for subsequent bytes in the sequence is a function of the current shift state.
— A byte with all bits zero shall be interpreted as a null character independent of shift state. Such a byte shall not occur as part of any other multibyte character.
2 For source files, the following shall hold:
— An identifier, comment, string literal, character constant, or header name shall begin and end in the initial shift state.
— An identifier, comment, string literal, character constant, or header name shall consist of a sequence of valid multibyte characters.
Given clang's usage of UTF-8 as the source encoding, this behavior is mandated by the standard:
C++ defines an identifier as the following:
identifier:
identifier-nondigit
identifier identifier-nondigit
identifier digit
identifier-nondigit:
nondigit
universal-character-name
other implementation-defined characters
The important part here is that identifiers can include unversal-character-names. The specifications also lists allowed UCNs:
Annex E (normative)
Universal character names for identifier characters [charname]
E.1 Ranges of characters allowed [charname.allowed]
00A8, 00AA, 00AD, 00AF, 00B2-00B5, 00B7-00BA, 00BC-00BE, 00C0-00D6, 00D8-00F6, 00F8-00FF
0100-167F, 1681-180D, 180F-1FFF
200B-200D, 202A-202E, 203F-2040, 2054, 2060-206F
2070-218F, 2460-24FF, 2776-2793, 2C00-2DFF, 2E80-2FFF
3004-3007, 3021-302F, 3031-303F
3040-D7FF
F900-FD3D, FD40-FDCF, FDF0-FE44, FE47-FFFD
10000-1FFFD, 20000-2FFFD, 30000-3FFFD, 40000-4FFFD, 50000-5FFFD,
60000-6FFFD, 70000-7FFFD, 80000-8FFFD, 90000-9FFFD, A0000-AFFFD,
B0000-BFFFD, C0000-CFFFD, D0000-DFFFD, E0000-EFFFD
The cyrillic characters in your identifier are in the range 0100-167F.
The C++ specification further mandates that characters encoded in the source encoding be handled identically to UCNs:
Any source file character not in the basic source character set (2.3) is replaced by the universal-character-name that designates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e., using the \uXXXX notation), are handled equivalently
— n3337 §2.2 Phases of translation [lex.phases]/1
So given clang's choice of UTF-8 as the source encoding, the spec mandates that these characters be converted to UCNs (or that clang's behavior be indistinguishable from performing such a conversion), and these UCNs are permitted by the spec to appear in identifiers.
It goes even further. Emoji characters happen to be in the ranges allowed by the C++ spec, so if you've seen some of those examples of Swift code with emoji identifiers and were surprised by such capability you might be even more surprised to know that C++ has exactly the same capability:
http://rextester.com/EPYJ41676
http://imgur.com/EN6uanB
Another fact that may be surprising is that this behavior isn't new with C++11; C++ has mandated this behavior since C++98. It's just that compilers ignored this for a long time: Clang implemented this feature in version 3.3 in 2013. According to this documentation Microsoft Visual Studio supports this in 2015.
Even today GCC 6.1 only supports UCNs in identifiers when they are written literally, and does not obey the mandate that any character in its extended source character set must be treated identically with the corresponding universal-character-name. E.g. gcc allows int \u043a\u043e\u0448\043a\u0430 = 10; but will not allow int кошка = 10; even with -finput-charset=utf-8.

Why is this C or C++ macro not expanded by the preprocessor?

Can someone points me the problem in the code when compiled with gcc 4.1.0.
#define X 10
int main()
{
double a = 1e-X;
return 0;
}
I am getting error:Exponent has no digits.
When I replace X with 10, it works fine. Also I checked with g++ -E command to see the file with preprocessors applied, it has not replaced X with 10.
I was under the impression that preprocessor replaces every macro defined in the file with the replacement text with applying any intelligence. Am I wrong?
I know this is a really silly question but I am confused and I would rather be silly than confused :).
Any comments/suggestions?
The preprocessor is not a text processor, it works on the level of tokens. In your code, after the define, every occurence of the token X would be replaced by the token 10. However, there is not token X in the rest of your code.
1e-X is syntactically invalid and cannot be turned into a token, which is basically what the error is telling you (it says that to make it a valid token -- in this case a floating point literal -- you have to provide a valid exponent).
When you write 1e-X all together like that, the X isn't a separate symbol for the preprocessor to replace - there needs to be whitespace (or certain other symbols) on either side. Think about it a little and you'll realize why.. :)
Edit: "12-X" is valid because it gets parsed as "12", "-", "X" which are three separate tokens. "1e-X" can't be split like that because "1e-" doesn't form a valid token by itself, as Jonathan mentioned in his answer.
As for the solution to your problem, you can use token-concatenation:
#define E(X) 1e-##X
int main()
{
double a = E(10); // expands to 1e-10
return 0;
}
Several people have said that 1e-X is lexed as a single token, which is partially correct. To explain:
There are two classes of tokens during translation: preprocessing tokens and tokens. A source file is initially decomposed into preprocessing tokens; these tokens are then used in all of the preprocessing tasks, including macro replacement. After preprocessing, each preprocessing token is converted into a token; these resulting tokens are used during actual compilation.
There are fewer types of preprocessing tokens than there are types of tokens. For example, keywords (e.g. for, while, if) are not significant during the preprocessing phases, so there is no keyword preprocessing token. Keywords are simply lexed as identifiers. When the conversion from preprocessing tokens to tokens takes place, each identifier preprocessing token is inspected; if it matches a keyword, it is converted into a keyword token; otherwise it is converted into an identifier token.
There is only one type of numeric token during preprocessing: preprocessing number. This type of preprocessing token corresponds to two different types of tokens: integer literal and floating literal.
The preprocessing number preprocessing token is defined very broadly. Effectively it matches any sequence of characters that begins with a digit or a decimal point followed by any number of digits, nondigits (e.g. letters), and e+ and e-. So, all of the following are valid preprocessing number preprocessing tokens:
1.0e-10
.78
42
1e-X
1helloworld
The first two can be converted into floating literals; the third can be converted into an integer literal. The last two are not valid integer literals or floating literals; those preprocessing tokens cannot be converted into tokens. This is why you can preprocess the source without error but cannot compile it: the error occurs in the conversion from preprocessing tokens to tokens.
GCC 4.5.0 doesn't change the X either.
The answer is going to lie in how the preprocessor interprets preprocessing tokens - and in the 'maximal munch' rule. The 'maximal munch' rule is what dictates that 'x+++++y' is treated as 'x ++ ++ + y' and hence is erroneous, rather than as 'x ++ + ++ y' which is legitimate.
The issue is why does the preprocessor interpret '1e-X' as a single preprocessing token. Clearly, it will treat '1e-10' as a single token. There is no valid interpretation for '1e-' unless it is followed by a digit once it passes the preprocessor. So, I have to guess that the preprocessor sees '1e-X' as a single token (actually erroneous). But I have not dissected the correct clauses in the standard to see where it is required. But the definition of a 'preprocessing number' or 'pp-number' in the standard (see below) is somewhat different from the definition of a valid integer or floating point constant and allows many 'pp-numbers' that are not valid as an integer or floating point constant.
If it helps, the output of the Sun C Compiler for 'cc -E -v soq.c' is:
# 1 "soq.c"
# 2
int main()
{
"soq.c", line 4: invalid input token: 1e-X
double a = 1e-X ;
return 0;
}
#ident "acomp: Sun C 5.9 SunOS_sparc Patch 124867-09 2008/11/25"
cc: acomp failed for soq.c
So, at least one C compiler rejects the code in the preprocessor - it might be that the GCC preprocessor is a little slack (I tried to provoke it into complaining with gcc -Wall -pedantic -std=c89 -Wextra -E soq.c but it did not utter a squeak). And using 3 X's in both the macro and the '1e-XXX' notation showed that all three X's were consumed by both GCC and Sun C Compiler.
C Standard Definition of Preprocessing Number
From the C Standard - ISO/IEC 9899:1999 §6.4.8 Preprocessing Numbers:
pp-number:
digit
. digit
pp-number digit
pp-number identifier-nondigit
pp-number e sign
pp-number E sign
pp-number p sign
pp-number P sign
pp-number .
Given this, '1e-X' is a valid 'pp-number', and therefore the X is not a separate token (nor is the 'XXX' in '1e-XXX' a separate token). Therefore, the preprocessor cannot expand the X; it isn't a separate token subject to expansion.