Is it safe to use $ character as part of identifier in C/C++?
Like this,
int $a = 10;
struct $b;
class $c;
void $d();
No; it is a non-standard extension in some compilers.
No. The C standard only guarantees the use of uppercase and lowercase English letters, digits, _, and Unicode codepoints specified using \u (hex-quad) or \U (hex-quad) (hex-quad) (with a few exceptions). Specific compilers may allow other characters as an extension; however this is highly nonportable. (ISO/IEC 9899:1999 (E) 6.4.2.1, 6.4.3) Further note that the Unicode codepoint method is basically useless in identifiers, even though it is, strictly speaking, permitted (in C99), as it still shows up as a literal \uXXXX in your editor.
This is not standard and only Microsoft Visual Studio (that I know of) even allows the '$' character in identifiers.
So if you ever want your code to be portable (or readable to others), I'd say no.
Related
I was experimenting with extern and extern "C" for a little, and accidentially had a typo in one of the identifiers - a $ had snuck in. When I compiled the code and got the error of an undefined symbol and eventually saw what caused it, it made me curios if it would actually compile. And guess what - Clang actually did compile that.
According to documentation I had read previously, the rules for identifiers were basically:
No double underscore at the beginning - because those are reserved.
No single underscore and upper case letter - reserved too.
Must start with a letter, a non-digit.
Must not exceed 31 characters.
May contain a-z, A-Z or 0-9 and _.
But this compiled just fine - no warning was showing too:
void __this$is$a$mess() {}
int main() { __this$is$a$mess(); }
When looking at it:
Ingwie#Ingwies-Macbook-Pro.local /tmp $ clang y.c
Ingwie#Ingwies-Macbook-Pro.local /tmp $ nm a.out
0000000100000f90 T ___this$is$a$mess
0000000100000000 T __mh_execute_header
0000000100000fa0 T _main
U dyld_stub_binder
I can see the symbol name very clearly.
So why is it that Clang will let me do this, although by ANSI standards, it should not? Even the GCC 6 I have installed did not warn or error about this.
Which compilers will allow what kinds of identifiers - and, why actually?
The rules in the 2018 C standard for identifiers include:
Per 6.4.2.1 1, an identifier is a sequence of identifier-nondigit and digit characters, starting with an identifier-nondigit.
An identifier-nodigit is _, a to z, A to Z, a universal-character-name, or “other implementation-defined characters”.
A digit is 0 to 9.
A universal-character-name is \u followed by four hexadecimal digits or \U followed by eight hexadecimal digits, which specify Unicode characters.
So, if an implementation allows $, that is a valid character for that implementation. You may use it, but it may not be portable to other implementations. The C standard requires implementations to accept the specific characters listed, but it allows them to accept more. Generally, the C standard should be viewed as an open field rather than a walled garden: The behavior is defined within the field, but you are not stopped at the barrier; you may go beyond it, at your own risk.
The rules you were taught were rules for what is portable, not rules for what the C standard requires implementations to restrict you to.
The C standard defines strictly conforming code, which is, roughly speaking, code that should work in any C implementation, and conforming code, which is code that works in at least one C implementation. Conforming code is still C code. So the rules you were taught were for strictly conforming code.
Generally, you should prefer to write strictly conforming code and only use additional features when benefit (speed, ease of development on a particular platform, whatever) is worth the cost (loss of portability).
According to documentation I had read previously, the rules for
identifiers were basically:
No double underscore at the beginning - because those are reserved.
No single underscore and upper case letter - reserved too.
Such identifiers are indeed reserved, but that means that you must not declare or define them, not that they fail to be identifiers, or that they necessarily are not meaningful.
Must start with a letter, a non-digit.
Letters are indeed non-digits, but not all non-digits are letters. The _ character is a prime example.
Must not exceed 31 characters.
This is not a formal limit of the language. C requires that implementations support at least 31 significant characters in external identifiers. Two external identifiers that differ only at the 32nd character or later are not guaranteed to be recognized as distinct, but they do not fail to be identifiers. Furthermore, implementations must recognize at least 63 significant characters in internal identifiers, which, again, can be longer.
Some implementations recognize more significant characters, some even an unbounded number.
May contain a-z, A-Z or 0-9 and _.
Yes, but explicitly may also contain other implementation-defined characters. The $ character in particular is one that is fairly commonly allowed.
So why is it that Clang will let me do this, although by ANSI
standards, it should not? Even the GCC 6 I have installed did not warn
or error about this.
The standard does not by any means say that identifiers containing the $ character are disallowed. It explicitly permits implementations to accept that character and substantially any other in identifiers, though there are some that cannot pragmatically be allowed because allowing them would introduce ambiguity. Programs that use identifiers containing such characters do not for that reason fail to conform, and implementations that accept them do not for that reason fail to conform. Such programs do fail to strictly conform, however, as that term is defined by the standard.
Can we define the variable in c++/ c using special characters such as;
double ε,µ,β,ϰ;
If yes, how can this be achieved?
As per the working draft of CPP standard (N4713),
5.10 Identifiers [lex.name]
...
An identifier is an arbitrarily long sequence of letters and digits. Each universal-character-name in an identifier shall designate a character whose encoding in ISO 10646 falls into one of the ranges specified in Table 2. The initial element shall not be a universal-character-name designating a character whose encoding falls into one of the ranges specified in Table 3.
And when we look at table 3:
Table 3 — Ranges of characters disallowed initially (combining characters)
0300-036F 1DC0-1DFF 20D0-20FF FE20-FE2F
The symbols you have mentioned are the Greek Alphabet which ranges from U+0370 to U+03FF and the extended Greek set ranges from U+1F0x to U+1FFx as per wikipedia. Both these ranges are allowed as the initial element of an identifier.
Note that not all compilers provide support for this.
GCC 8.2 with -std=c++17 option fails to compile.
However, Clang 7.0 with -std=c++17 option compiles.
Live Demo for both GCC and Clang
Since the question is tagged Visual Studio: Just write the code as you'd expect it.
double β = 0.1;
When you save the file, Visual Studio will warn you that it needs to save the file as Unicode. Accept it, and it works. AFAICT, this also works in C mode, even though most other C99 extensions are unsupported in Visual Studio.
However, as of g++ 8.2, g++ still does not support non-ASCII characters used directly in identifiers, so the code is then effectively not portable.
Yes you can use special characters, but not all of them. You can find the allowed one in the link below.
You can find a detailed explanation on how to built identifier (with the list of unicode authorized characters) on the page Identifiers - cppreference.com.
An identifier is, quoting,
an arbitrarily long sequence of digits, underscores, lowercase and uppercase Latin letters, and most Unicode characters (see below for details). A valid identifier must begin with a non-digit character (Latin letter, underscore, or Unicode non-digit character). Identifiers are case-sensitive (lowercase and uppercase letters are distinct), and every character is significant.
Furthermore, Unicode characters need to be escaped.
I stumbled on some C++ code like this:
int $T$S;
First I thought that it was some sort of PHP code or something wrongly pasted in there but it compiles and runs nicely (on MSVC 2008).
What kind of characters are valid for variables in C++ and are there any other weird characters you can use?
The only legal characters according to the standard are alphanumerics
and the underscore. The standard does require that just about anything
Unicode considers alphabetic is acceptable (but only as single
code-point characters). In practice, implementations offer extensions
(i.e. some do accept a $) and restrictions (most don't accept all of the
required Unicode characters). If you want your code to be portable,
restrict symbols to the 26 unaccented letters, upper or lower case, the
ten digits, and the '_'.
It's an extension of some compilers and not in the C standard
MSVC:
Microsoft Specific
Only the first 2048 characters of Microsoft C++ identifiers are significant. Names for user-defined types are "decorated" by the compiler to preserve type information. The resultant name, including the type information, cannot be longer than 2048 characters. (See Decorated Names for more information.) Factors that can influence the length of a decorated identifier are:
Whether the identifier denotes an object of user-defined type or a type derived from a user-defined type.
Whether the identifier denotes a function or a type derived from a function.
The number of arguments to a function.
The dollar sign is also a valid identifier in Visual C++.
// dollar_sign_identifier.cpp
struct $Y1$ {
void $Test$() {}
};
int main() {
$Y1$ $x$;
$x$.$Test$();
}
https://web.archive.org/web/20100216114436/http://msdn.microsoft.com/en-us/library/565w213d.aspx
Newest version: https://learn.microsoft.com/en-us/cpp/cpp/identifiers-cpp?redirectedfrom=MSDN&view=vs-2019
GCC:
6.42 Dollar Signs in Identifier Names
In GNU C, you may normally use dollar signs in identifier names. This is because many traditional C implementations allow such identifiers. However, dollar signs in identifiers are not supported on a few target machines, typically because the target assembler does not allow them.
http://gcc.gnu.org/onlinedocs/gcc/Dollar-Signs.html#Dollar-Signs
In my knowledge only letters (capital and small), numbers (0 to 9) and _ are valid for variable names according to standard (note: the variable name should not start with a number though).
All other characters should be compiler extensions.
This is not good practice. Generally, you should only use alphanumeric characters and underscores in identifiers ([a-z][A-Z][0-9]_).
Surface Level
Unlike in other languages (bash, perl), C does not use $ to denote the usage of a variable. As such, it is technically valid. In C it most likely falls under C11, 6.4.2. This means that it does seem to be supported by modern compilers.
As for your C++ question, lets test it!
int main(void) {
int $ = 0;
return $;
}
On GCC/G++/Clang/Clang++, this indeed compiles, and runs just fine.
Deeper Level
Compilers take source code, lex it into a token stream, put that into an abstract syntax tree (AST), and then use that to generate code (e.g. assembly/LLVM IR). Your question really only revolves around the first part (e.g. lexing).
The grammar (thus the lexer implementation) of C/C++ does not treat $ as special, unlike commas, periods, skinny arrows, etc... As such, you may get an output from the lexer like this from the below c code:
int i_love_$ = 0;
After the lexer, this becomes a token steam like such:
["int", "i_love_$", "=", "0"]
If you where to take this code:
int i_love_$,_and_.s = 0;
The lexer would output a token steam like:
["int", "i_love_$", ",", "_and_", ".", "s", "=", "0"]
As you can see, because C/C++ doesn't treat characters like $ as special, it is processed differently than other characters like periods.
There are two path separators in common use: the Unix forward-slash and the DOS backslash. Rest in peace, Classic Mac colon. If used in an #include directive, are they equal under the rules of the C++11, C++03, and C99 standards?
C99 says (§6.4.7/3):
If the characters ', \, ", //, or /* occur in the sequence between the < and > delimiters, the behavior is undefined. Similarly, if the characters ', \, //, or /* occur in the sequence between the " delimiters, the behavior is undefined.
(footnote: Thus, sequences of characters that resemble escape sequences cause undefined behavior.)
C++03 says (§2.8/2):
If either of the characters ’ or \, or either of the character sequences /* or // appears in a q-char- sequence or a h-char-sequence, or the character " appears in a h-char-sequence, the behavior is undefined.
(footnote: Thus, sequences of characters that resemble escape sequences cause undefined behavior.)
C++11 says (§2.9/2):
The appearance of either of the characters ’ or \ or of either of the character sequences /* or // in a q-char-sequence or an h-char-sequence is conditionally supported with implementation-defined semantics, as is the appearance of the character " in an h-char-sequence.
(footnote: Thus, a sequence of characters that resembles an escape sequence might result in an error, be interpreted as the character corresponding to the escape sequence, or have a completely different meaning, depending on the implementation.)
Therefore, although any compiler might choose to support a backslash in a #include path, it is unlikely that any compiler vendor won't support forward slash, and backslashes are likely to trip some implementations up by virtue of forming escape codes. (Edit: apparently MSVC previously required backslash. Perhaps others on DOS-derived platforms were similar. Hmmm… what can I say.)
C++11 seems to loosen the rules, but "conditionally supported" is not meaningfully better than "causes undefined behavior." The change does more to reflect the existence of certain popular compilers than to describe a portable standard.
Of course, nothing in any of these standards says that there is such a thing as paths. There are filesystems out there with no paths at all! However, many libraries assume pathnames, including POSIX and Boost, so it is reasonable to want a portable way to refer to files within subdirectories.
Forward slash is the correct way; the pre-compiler will do whatever it takes on each platform to get to the correct file.
It depends on what you mean by "acceptable".
There are two senses in which slashes are acceptable and backslashes are not.
If you're writing C99, C++03, or C1x, backslashes are undefined, while slashes are legal, so in this sense, backslashes are not acceptable.
But this is irrelevant for most people. If you're writing C++1x, where backslashes are conditionally-supported, and the platform you're coding for supports them, they're acceptable. And if you're writing an "extended dialect" of C99/C++03/C1x that defines backslashes, same deal. And, more importantly, this notion of "acceptable" is pretty meaningless in most cases anyway. None of the C/C++ standards define what slashes mean (or what backslashes mean when they're conditionally-supported). Header names are mapped to source files in an implementation-defined manner, period. If you've got a hierarchy of files, and you're asking whether to use backslashes or slashes to refer to them portably in #include directives, the answer is: neither is portable. If you want to write truly portable code, you can't use hierarchies of header files—in fact, arguably, your best bet is to write everything in a single source file, and not #include anything except standard headers.
However, in the real world, people often want "portable-enough", not "strictly portable". The POSIX standard mandates what slashes mean, and even beyond POSIX, most modern platforms—including Win32 (and Win64), the cross-compilers for embedded and mobile platforms like Symbian, etc.—treat slashes the POSIX way, at least as far as C/C++ #include directives. Any platform that doesn't, probably won't have any way for you to get your source tree onto it, process your makefile/etc., and so on, so #include directives will be the least of your worries. If that's what you care about, then slashes are acceptable, but backslashes are not.
Blackslash is undefined behavior and even with a slash you have to be careful. The C99 standard states:
If the characters ', \, ", //, or /*
occur in the sequence between the <
and > delimiters, the behavior is
undefined. Similarly, if the
characters ', \, //, or /* occur in
the sequence between the " delimiters,
the behavior is undefined.
Always use forward slashes - they work on more platforms. Backslash technically causes undefined behaviour in C++03 (2.8/2 in the standard).
The standard says for #include that it:
searches a sequence of implementation-defined places for
a header identified uniquely by the specified sequence between
the delimiters, and causes the replacement of that directive by the
entire contents of the header. How the places are specified or the header
identified is implementation-defined.
Note the last sentence.
What does the C++ standard say about using dollar signs in identifiers, such as Hello$World? Are they legal?
A c++ identifier can be composed of any of the following: _ (underscore), the digits 0-9, the letters a-z (both upper and lower case) and cannot start with a number.
There are a number of exceptions as C99 allows extensions to the standard (e.g. visual studio).
They are illegal. The only legal characters in identifiers are letters, numbers, and _. Identifiers also cannot start with numbers.
In C++03, the answers given earlier are correct: they are illegal. In C++11 the situation changed however:
The answer here is "Maybe":
According to §2.11, identifiers may consist of digits and identifier-nondigits, starting with one of the latter. identifier-nondigits are the usual a-z, A-Z and underscore, in addition since C++11 they include universal-character-names (e.g. \uBEAF, \UC0FFEE32), and other implementation-defined characters. So it is implementation defined if using $ in an identifier is allowed. VC10 and up supports that, maybe earlier versions, too. It even supports identifiers like こんばんは.
But: I wouldn't use them. Make identifiers as readable and portable as possible. $ is implementation defined and thus not portable.
Not legal, but many if not most of compilers support them, note this may depend on platform, thus gcc on arm does not support them due to assembly restrictions.
The relevant section is "2.8 Identifiers [lex.name]". From the basic character set, the only valid characters are A-Z a-z 0-9 and _. However, characters like é (U+00E9) are also allowed. Depending on your compiler, you might need to enter é as \u00e9, though.
They are not legal in C++. However some C/C++ derived languages (such as Java and JavaScript) do allow them.
Illegal. I think the dollar sign and backtick are the only punctuation marks on my keyboard that aren't used in C++ somewhere (the "%" sign is in format strings, which are in C++ by reference to the C standard).