In C++ language identifiers can contain unicode characters? [duplicate] - c++

This question already has answers here:
Unicode Identifiers and Source Code in C++11?
(5 answers)
Closed 2 years ago.
So far I knew that in C++ identifiers can only contain letters of English ABC, numbers and underscores, but I am reading the standard and cppreference.com, and they write that unicode characters and special characters can be used as well. Is it true? Or is it true for some compilers only?
https://timsong-cpp.github.io/cppwp/lex.name
https://en.cppreference.com/w/cpp/language/identifiers

According to translation phase 1, the compiler can map each "character" in a source file to either a character in the basic source character set, or to an escaped Unicode character.
How this mapping is performed, or how the escaped Unicode character is handled, is implementation defined. I.e. it's up to the compiler implementation to handle Unicode characters or not. Some do, some don't. You must read the documentation for your compiler to learn what it does. And if you use Unicode characters in your source, you need to be aware of that they might not work reliably (or at all) if the source is used by other compilers.
In summary. If you want to write portable code, then only use the basic source character set.

Related

How to use Unicode character literal for Han characters in Clojure [duplicate]

This question already has answers here:
Using Emoji literals in Clojure source
(2 answers)
Closed 4 years ago.
I'm trying to create a Unicode character for U+20BB7, but I can't seem to figure out a way.
\uD842\uDFB7
The above doesn't work. I'm starting to think that you can't use literal Unicode character syntax for characters above \uFFFF.
Are my only option to use a string instead?
"\uD842\uDFB7"
Since as a string it works?
You can only use a string here - you're basically trying to shove two 'char' (16bit) values into one. See [1]
Unicode Character Representations
The char data type (and therefore the value that a Character object
encapsulates) are based on the original Unicode specification, which
defined characters as fixed-width 16-bit entities. The Unicode
Standard has since been changed to allow for characters whose
representation requires more than 16 bits. The range of legal code
points is now U+0000 to U+10FFFF, known as Unicode scalar value.
(Refer to the definition of the U+n notation in the Unicode Standard.)
1: https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html

Euro Symbol for C++

I am creating a small currency converting program in C++ in Visual Studio 2013.
For a pound sign, I have created a char variable using the Windows-1252 code for £ symbol.
const char poundSign(156);
I need to find a way in which I can implement a symbol for €, euro symbol. When I run the program.
You should not use any of the deprecated extended ASCII encodings as they are long obsolete nowadays. As user1937198 said, 156 is the character code of £ in the archaic Windows-1252* encoding. The appearance of non-ASCII characters in these encodings depends on the codepage selected by the user, which makes it impossible to mix characters from different codepages under such a scheme. Additionally, your users will be very confused if they pick the wrong codepage.
Consider using Unicode instead. Since you're on Windows, you should probably use UTF-16, in which case the correct way is to declare:
// make sure the source code is saved as UTF-16!
const wchar_t poundSign[] = L"£";
const wchar_t euroSign[] = L"€";
In UTF-16, there's no guarantee that a single character will take only 1 character due to the surrogate mechanism, hence it's best to store "characters"* as strings.
Keep in mind that this means the rest of your program should switch to Unicode as well: use the "W" versions of the Windows API (the WCHAR-versions).
[*] In technical lingo, each "Unicode character" is referred to as a code point.
You have to distinguish between the translation environment and the execution environment.
E.g., it heavily depends on where your program is being run what glyph is assigned to the character you intended to produce (cf. 'codepage').
You may want to check of your terminal or whatsoever is unicode-aware and then just use the appropriate codepoints.
Also, you should not create character constants the way you did. Use a character literal instead.

Literal "or" in c++ program? [duplicate]

This question already has answers here:
When were the 'and' and 'or' alternative tokens introduced in C++?
(8 answers)
Closed 8 years ago.
I'm translating a C++ function I wrote some time ago into python when I noticed that my C++ code contains the following lines:
if(MIsScaledOut()) {
if(DataType()==UnknownDataType or DataType()==h)
Descriptor = Descriptor + DataTypeString() + "OverM";
There's an or in there! This was probably because I previously translated from python, and forgot to switch to ||.
This code compiles in various OSes, with various compilers, and I've never seen a problem with it. Is this standard, or have I just gotten lucky so far, and this is something I should worry about?
After remembering the right word to google, I now see that it is listed as a C++ keyword, along with various similar keywords like and that I'd never seen (noticed?) before in C++. The reason these exist is because there are encodings that don't have some of the required punctuation characters used by the traditional operator spellings: {, }, [, ], #, \, ^, |, ~.
As #mafso points out, the alternative "spelled out" versions can be used in C by including the <iso646.h> header, which defines them as macros.
The question of which this has been marked duplicate also points out the existence of digraphs and trigraphs, which can be used to substitute for the missing characters. (That question also says "everybody knows about" them. Obviously, I did not...)

Is C++ ASCII-aware?

A colleague told me:
C++ is not ASCII-aware.
The source character set of a C++ program is implementation-defined, so to what extent is my colleague incorrect?
The c++ compiler needs to be ASCII aware when it links the numeric 48 value to '0'. So yes, it needs to be ASCII-aware.
But does it always needs to? Imagine you work with EBCDIC ('0' => 240). Then the compiler probably doesn't care about ASCII. Maybe that's what you colleague meant.
C++, generally speaking, does not really care about ASCII. It is an implementation detail.
The C++ standard text is "aware" of ASCII in that it makes non-normative mention in a footnote:
[C++11: Footnote 14]: The glyphs for the members of the basic source character set are intended to identify characters from the subset of ISO/IEC 10646 which corresponds to the ASCII character set. However, because the mapping from source file characters to the source character set (described in translation phase 1) is specified as implementation-defined, an implementation is required to document how the basic source characters are represented in source files.
In doing so, it is declaring that the standardised language itself is not ASCII-aware.
"Seems like awareness of ASCII to me!" you might say. Well, no. The mere mention of "ASCII" in the language definition has not made the language ASCII-aware. This is in the same way that you can program a robot to say the words "I am not self-aware" without requiring that the robot become aware of self.

Using Unicode in C++ source code

What is the standard encoding of C++ source code? Does the C++ standard even say something about this? Can I write C++ source in Unicode?
For example, can I use non-ASCII characters such as Chinese characters in comments? If so, is full Unicode allowed or just a subset of Unicode? (e.g., that 16-bit first page or whatever it's called.)
Furthermore, can I use Unicode for strings? For example:
Wstring str=L"Strange chars: â Țđ ě €€";
Encoding in C++ is quite a bit complicated. Here is my understanding of it.
Every implementation has to support characters from the basic source character set. These include common characters listed in §2.2/1 (§2.3/1 in C++11). These characters should all fit into one char. In addition implementations have to support a way to name other characters using a way called universal-character-names and look like \uffff or \Uffffffff and can be used to refer to Unicode characters. A subset of them are usable in identifiers (listed in Annex E).
This is all nice, but the mapping from characters in the file, to source characters (used at compile time) is implementation defined. This constitutes the encoding used. Here is what it says literally (C++98 version):
Physical source file characters are
mapped, in an implementation-defined
manner, to the basic source character
set (introducing new-line characters
for end-of-line indicators) if
necessary. Trigraph sequences (2.3)
are replaced by corresponding
single-character internal
representations. Any source file
character not in the basic source
character set (2.2) is replaced by the
universal-character-name that des-
ignates that character. (An
implementation may use any internal
encoding, so long as an actual
extended character encountered in the
source file, and the same extended
character expressed in the source file
as a universal-character-name (i.e.
using the \uXXXX notation), are
handled equivalently.)
For gcc, you can change it using the option -finput-charset=charset. Additionally, you can change the execution character used to represet values at runtime. The proper option for this is -fexec-charset=charset for char (it defaults to utf-8) and -fwide-exec-charset=charset (which defaults to either utf-16 or utf-32 depending on the size of wchar_t).
The C++ standard doesn't say anything about source-code file encoding, so far as I know.
The usual encoding is (or used to be) 7-bit ASCII -- some compilers (Borland's, for instance) would balk at ASCII characters that used the high-bit. There's no technical reason that Unicode characters can't be used, if your compiler and editor accept them -- most modern Linux-based tools, and many of the better Windows-based editors, handle UTF-8 encoding with no problem, though I'm not sure that Microsoft's compiler will.
EDIT: It looks like Microsoft's compilers will accept Unicode-encoded files, but will sometimes produce errors on 8-bit ASCII too:
warning C4819: The file contains a character that cannot be represented
in the current code page (932). Save the file in Unicode format to prevent
data loss.
In addition to litb's post, MSVC++ supports Unicode too. I understand it gets the Unicode encoding from the BOM. It definitely supports code like int (*♫)(); or const std::set<int> ∅;
If you're really into code obfuscuation:
typedef void ‼; // Also known as \u203C
class ooɟ {
operator ‼() {}
};
There are two issues at play here. The first is what characters are allowed in C++ code (and comments), such as variable names. The second is what characters are allowed in strings and string literals.
As noted, C++ compilers must support a very restricted ASCII-based character set for the characters allowed in code and comments. In practice, this character set didn't work very well with some European character sets (and especially with some European keyboards that didn't have a few characters -- like square brackets -- available), so the concept of digraphs and trigraphs was introduced. Many compilers accept more than this character set at this time, but there isn't any guarantee.
As for strings and string literals, C++ has the concept of a wide character and wide character string. However, the encoding for that character set is undefined. In practice it's almost always Unicode, but I don't think there's any guarantee here. Wide character string literals look like L"string literal", and these can be assigned to std::wstring's.
C++11 added explicit support for Unicode strings and string literals, encoded as UTF-8, UTF-16 big endian, UTF-16 little endian, UTF-32 big endian and UTF-32 little endian.
For encoding in strings I think you are meant to use the \u notation, e.g.:
std::wstring str = L"\u20AC"; // Euro character
It's also worth noting that wide characters in C++ aren't really Unicode strings as such. They are just strings of larger characters, usually 16, but sometimes 32 bits. This is implementation-defined, though, IIRC you can have an 8-bit wchar_t You have no real guarantee as to the encoding in them, so if you are trying to do something like text processing, you will probably want a typedef to the most suitable integer type to your Unicode entity.
C++1x has additional unicode support in the form of UTF-8 encoding string literals (u8"text"), and UTF-16 and UTF-32 data types (char16_t and char32_t IIRC) as well as corresponding string constants (u"text" and U"text"). The encoding on characters specified without \uxxxx or \Uxxxxxxxx constants is still implementation-defined, though (and there is no encoding support for complex string types outside the literals)
In this context, if you get MSVC++ warning C4819, just change the source file coding to "UTF-8 with Bom".
GCC 4.1 doesn't support this, but GCC 4.4 does, and the latest Qt version uses GCC 4.4, so use "UTF-8 with Bom" as source file coding.
AFAIK It's not standardized as you can put any type of characters in wide strings.
You just have to check that your compiler is set to Unicode source code to make it work right.