How can I find out what the current charset is in C++? - c++

How can I find out what the current charset is in C++?
In a console application (WinXP) I am getting negative values for some characters (like äöüé) with
(int)mystring[a]
and this surprises me. I was expecting the values to be between 127 and 256.
So is there something like GetCharset() or SetCharset() in c++?

It depends on how you look at the value you have at hand. char can be signed(e.g. on Windows), or unsigned like on some other systems. So, what you should do is to print the value as unsigned to get what you are asking for.
C++ until now is char-set agnostic. For Windows console specifically, you can use: GetConsoleOutputCP.

Look at std::numeric_limits<char>::min() and max(). Or CHAR_MIN and CHAR_MAX if you don't like typing, or if you need an integer constant expression.
If CHAR_MAX == UCHAR_MAX and CHAR_MIN == 0 then chars are unsigned (as you expected). If CHAR_MAX != UCHAR_MAX and CHAR_MIN < 0 they are signed (as you're seeing).
In the standard 3.9.1/1, ensures that there are no other possibilities: "... a plain char can take on either the same values as a signed char or an unsigned char; which one is implementation-defined."
This tells you whether char is signed or unsigned, and that's what's confusing you. You certainly can't call anything to modify it: from the POV of a program it's baked into the compiler even if the compiler has ways of changing it (GCC certainly does: -fsigned-char and -funsigned-char).
The usual way to deal with this is if you're going to cast a char to int, cast it through unsigned char first. So in your example, (int)(unsigned char)mystring[a]. This ensures you get a non-negative value.
It doesn't actually tell you what charset your implementation uses for char, but I don't think you need to know that. On Microsoft compilers, the answer is essentially that commonly-used character encoding "ISO-8859-mutter-mutter". This means that chars with 7-bit ASCII values are represented by that value, while values outside that range are ambiguous, and will be interpreted by a console or other recipient according to how that recipient is configured. ISO Latin 1 unless told otherwise.
Properly speaking, the way characters are interpreted is locale-specific, and the locale can be modified and interrogated using a whole bunch of stuff towards the end of the C++ standard that personally I've never gone through and can't advise on ;-)
Note that if there's a mismatch between the charset in effect, and the charset your console uses, then you could be in for trouble. But I think that's separate from your issue: whether chars can be negative or not is nothing to do with charsets, just whether char is signed.

chars are normally signed by default.
Try this.
cout << (unsigned char) mystring[a] << endl;

The only gurantee that the standard provides are for members of the basic character set:
2.2 Character sets
3 The basic execution character set
and the basic execution wide-character
set shall each contain all the members
of the basic source character set,
plus control characters representing
alert, backspace, and carriage return,
plus a null character (respectively,
null wide character), whose
representation has all zero bits. For
each basic execution character set,
the values of the members shall be
non-negative and distinct from one
another. In both the source and
execution basic character sets, the
value of each character after 0 in the
above list of decimal digits shall be
one greater than the value of the
previous. The execution character set
and the execution wide-character set
are supersets of the basic execution
character set and the basic execution
wide-character set, respectively. The
values of the members of the execution
character sets are
implementation-defined, and any
additional members are locale-specific
Further, the type char is supposed to hold:
3.9.1 Fundamental types
1 Objects declared as characters (char) shall be large enough to store any member of the
implementation’s basic
character set.
So, no gurantees whethere you will get the correct value for the characters you have mentioned. However, try to use an unsigned int to hold this value (for all practical purposes, it never makes sense to use a signed type to hold char values ever, if you are going to print them/pass around).

Related

Why C++ returns wrong codes of some characters, and how to fix this?

I have a simple line of code:
std::cout << std::hex << static_cast<int>('©');
This character's the Copyright Sign Emoji, its code's a9, but the app writes c2a9. The same happens to lots of Unicode characters. Another example: ™ (this's 2122) suddenly returns e284a2. Why C++ returns wrong codes of some characters, and how to fix this?
Note: I'm using Microsoft Visual Studio, a file with my code is saved in UTF-8.
An ordinary character literal (one without prefix) usually has type char and can store only elements of the execution character set that are representable as a single byte.
If the character is not representable in this way, the character literal is only conditionally-supported with type int and implementation-defined value. Compilers typically warn when this happens with some of the generic warning flags since it is a mistake most of the time. That might depend on what warning flags exactly you have enabled.
A byte is typically 8 bits and therefore it is impossible to store all of unicode in it. I don't know what execution character set your implementation uses, but clearly neither © nor ™ are in it.
It also seems that your implementation chose to support the non-representable character by encoding it in UTF-8 and using that as the value of the literal. You are seeing a representation of the numeric value of the UTF-8 encoding of the two characters.
If you want the numeric value of the unicode code point for the character, then you should use a character literal with U prefix, which implies that the value of the character according to UTF-32 is given with type char32_t, which is large enough to hold all unicode code points:
std::cout << std::hex << static_cast<std::uint_least32_t>(U'©');

How can force the user/OS to input an Ascii string

This is an extended question of this one: Is std::string suppose to have only Ascii characters
I want to build a simple console application that take an input from the user as set of characters. Those characters include 0->9 digits and a->z letters.
I am dealing with input supposing that it is an Ascii. For example, I am using something like : static_cast<unsigned int>(my_char - '0') to get the number as unsigned int.
How can I make this code cross-platform? How can tell that I want the input to be Ascii always? Or I have missed a lot of concepts and static_cast<unsigned int>(my_char - '0') is just a bad way?
P.S. In Ascii (at least) digits have sequenced order. However, in others encoding, I do not know if they have. (I am pretty sure that they are but there is no guarantee, right?)
How can force the user/OS to input an Ascii string
You cannot, unless you let the user specify the numeric values of such ASCII input.
It all depends how the terminal implementation used to serve std::cin translates key strokes like 0 to a specific number, and what your toolchain expects to match that number with it's intrinsic translation for '0'.
You simply shouldn't expect ASCII values explicitly (e.g. using magic numbers), but char literals to provide portable code. The assumption that my_char - '0' will result in the actual digits value is true for all character sets. The C++ standard states in [lex.charset]/3 that
The basic execution character set and the basic execution wide-character set shall each contain all the members of the basic source character set, plus control characters representing alert, backspace, and carriage return, plus a null character (respectively, null wide character), whose representation has all zero bits. For each basic execution character set, the values of the members shall be non-negative and distinct from one another. In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous.[...]
emphasis mine
You can't force or even verify that beforehand . "Evil user" can always sneak a UTF-8 encoded string into your application, with no characters above U+7F. And such string happens to be also Ascii-encoded.
Also, whatever platform specific measure you take, user can pipe a UTF-16LE encoded file. Or /dev/urandom
Your mistakes string encoding with some magic property of an input stream - and it is not. It is, well, encoding, like JPEG or AVI, and must be handled exactly the same way - read an input, match with format, report errors on parsing failure.
For your case, if you want to accept only ASCII, read input stream byte by byte and throw/exit with error if you ever encounters a byte with the value outside ASCII domain.
However, if later you encounter a terminal providing data with some incompatible encoding, like UTF16LE, you have no choice but to write a detection (based on byte order mark) and a conversion routine.

Any downsides using '?' instead of L'?' with wchar_t?

Are there any downsides to using '?'-style character literals to compare against, or assign to, values known to be of type wchar_t, instead of using L'?'-style literals?
They have the wrong datatype and encoding, so that's a bad idea. The compiler will silently widen character literals (for strings you'd get a type mismatch compile error), using the standard integral conversions (such as sign-extension). But the value might not match.
For example, characters 0x80 through 0xff often map to different Unicode codepoints, and the exact mapping varies depending on the compiler's codepage.
Clearly, it's not possible for Unicode to map all the various codepages using an identity conversion. If merely widening were enough, there'd be no need for functions like mbtowcs.
WRT your specific question about '\xAB' vs L'\xAB', they probably are not equal. See http://ideone.com/b1E39
As I mentioned, the standard says
A char array (whether plain char, signed char, or unsigned char), char16_t array, char32_t array, or wchar_t array can be initialized by a narrow character literal...
However, in the section for the __STDC_MB_MIGHT_NEQ_WC__ preprocessor definition, it says
The integer constant 1, intended to indicate that, in the encoding for wchar_t, a member of the basic character set need not have a code value equal to its value when used as the lone character in an ordinary character literal.
And for __STDC_ISO_10646__:
An integer constant of the form yyyymmL (for example, 199712L). If this symbol is defined, then every character in the Unicode required set, when stored in an object of type wchar_t, has the same value as the short identifier of that character.
I am not exactly a professional at interpreting the standard, but I think that means the answer to your question is that they may have different representations, and you should always use the L.
The only downside is that your program might fail on stone-age systems using EBCDIC. On any real world system worth consideration, char and wchar_t values for the portable character set are all ASCII, and on increasingly many (but not all), wchar_t is a Unicode codepoint number.

Why is there no ASCII or UTF-8 character literal in C11 or C++11?

Why is there no UTF-8 character literal in C11 or C++11 even though there are UTF-8 string literals? I understand that, generally-speaking, a character literal represents a single ASCII character which is identical to a single-octet UTF-8 code point, but neither C nor C++ says the encoding has to be ASCII.
Basically, if I read the standard right, there's no guarantee that '0' will represent the integer 0x30, yet u8"0" must represent the char sequence 0x30 0x00.
EDIT:
I'm aware not every UTF-8 code point would fit in a char. Such a literal would only be useful for single-octet code points (aka, ASCII), so I guess calling it an "ASCII character literal" would be more fitting, so the question still stands. I just chose to frame the question with UTF-8 because there are UTF-8 string literals. The only way I can imagine portably guaranteeing ASCII values would be to write a constant for each character, which wouldn't be so bad considering there are only 128, but still...
It is perfectly acceptable to write non-portable C code, and this is one of many good reasons to do so. Feel free to assume that your system uses ASCII or some superset thereof, and warn your users that they shouldn't try to run your program on an EBCDIC system.
If you are feeling very generous, you can encode a check. The gperf program is known to generate code that includes such a check.
_Static_assert('0' == 48, "must be ASCII-compatible");
Or, for pre-C11 compilers,
extern int must_be_ascii_compatible['0' == 48 ? 1 : -1];
If you are on C11, you can use the u or U prefix on character constants, but not the u8 prefix...
/* This is useless, doesn't do what you want... */
_Static_assert(0, "this code is broken everywhere");
if (c == '々') ...
/* This works as long as wchar_t is UTF-16 or UTF-32 or UCS-2... */
/* Note: you shouldn't be using wchar_t, though... */
_Static_assert(__STDC_ISO_10646__, "wchar_t must be some form of Unicode");
if (c == L'々') ...
/* This works as long as char16_t is UTF-16 or UCS-2... */
_Static_assert(__STDC_UTF_16__, "char16_t must be UTF-16");
if (c == u'々') ...
/* This works as long as char32_t is UTF-32... */
_Static_assert(__STDC_UTF_32__, "char32_t must be UTF-32");
if (c == U'々') ...
There are some projects that are written in very portable C and have been ported to non-ASCII systems (example). This required a non-trivial amount of porting effort, and there's no real reason to make the effort unless you know you want to run your code on EBCDIC systems.
On standards: The people writing the C standard have to contend with every possible C implementation, including some downright bizarre ones. There are known systems where sizeof(char) == sizeof(long), CHAR_BIT != 8, integral types have trap representations, sizeof(void *) != sizeof(int *), sizeof(void *) != sizeof(void (*)()), va_list are heap-allocated, etc. It's a nightmare.
Don't beat yourself up trying to write code that will run on systems you've never even heard of, and don't search to hard for guarantees in the C standard.
For example, as far as the C standard is concerned, the following is a valid implementation of malloc:
void *malloc(void) { return NULL; }
Note that while u8"..." constants are guaranteed to be UTF-8, u"..." and U"..." have no guarantees except that the encoding is 16-bits and 32-bits per character, respectively, and the actual encoding must be documented by the implementation.
Summary: Safe to assume ASCII compatibility in 2012.
UTF-8 character literal would have to have variable length - for many most of them, it's not possible to store single character in char or wchar, what type should it have, then? As we don't have variable length types in C, nor in C++, except for arrays of fixed size types, the only reasonable type for it would be const char * - and C strings are required to be null-terminated, so it wouldn't change anything.
As for the edit:
Quote from the C++11 standard:
The glyphs for the members of the basic source character set are intended to identify characters from the subset of ISO/IEC 10646 which corresponds to the ASCII character set. However, because the mapping from source file characters to the source character set (described in translation phase 1) is specified as implementation-defined, an implementation is required to document how the basic source characters are represented in source files.
(footnote at 2.3.1).
I think that it's good reason for not guaranteeing it. Although, as you noted in comment here, for most (or every) mainstream compiler, the ASCII-ness of character literals is implementation guaranteed.
For C++ this has been addressed by Evolution Working Group issue 119: Adding u8 character literals whose Motivation section says:
We have five encoding-prefixes for string-literals (none, L, u8, u, U)
but only four for character literals -- the missing one is u8. If the
narrow execution character set is not ASCII, u8 character literals
would provide a way to write character literals with guaranteed ASCII
encoding (the single-code-unit u8 encodings are exactly ASCII). Adding
support for these literals would add a useful feature and make the
language slightly more consistent.
EWG discussed the idea of adding u8 character literals in Rapperswil and accepted the change. This paper provides wording for that
extension.
This was incorporated into the working draft using the wording from N4267: Adding u8 character literals and we can find the wording in at this time latest draft standard N4527 and note as section 2.14.3 say they are limited to code points that fit into a single UTF-8 code unit:
A character literal that begins with u8, such as u8'w', is a character
literal of type char, known as a UTF-8 character literal. The value of
a UTF-8 character literal is equal to its ISO10646 code point value,
provided that the code point value is representable with a single
UTF-8 code unit (that is, provided it is a US-ASCII character). A
UTF-8 character literal containing multiple c-chars is ill-formed.
If you don't trust that your compiler will treat '0' as ASCII character 0x30, then you could use static_cast<char>(0x30) instead.
As you are aware, UTF-8-encoded characters need several octets, thus chars, so the natural type for them is char[], which is indeed the type for a u8-prefixed string literal! So C11 is right on track here, just that it sticks to its syntax conventions using " for a string, needing to be used as an array of char, rather than your implied semantic-based proposal to use ' instead.
About "0" versus u8"0", you are reading right, only the latter is guaranteed to be identical to { 0x30, 0 }, even on EBCDIC systems. By the way, the very fact the former is not can be handled conveniently in your code, if you pay attention to the __STDC_MB_MIGHT_NEQ_WC__ predefined identifier.

Unicode conversion issues

Here is a beginner question on Unicode. I'm using Embarcadero C++ Builder 2009, where they supposedly changed the default strings to use Unicode.
I type various symbols in my source editor, that aren't part of the standard "7-bit ASCII".
My program is using the String type of C++ Builder to fetch user input.
I am also adding input manually by setting a value to a wchar_t.
It would seem that there are conflicts in how the symbols are interpreted. Sometimes I get a symbol with for example the code 0x00C7 ('Ç'), but sometimes the same symbol is coded as 0xFFC7, for example in the source code editor. To my understanding, the former is proper Unicode, the latter is "something else". Can someone confirm this?
I wonder where this "something else" encoding is coming from, and how to get rid of it?
EDIT: Further research: it seems that one place where the 0xFF** encoding appears is when I do something like this:
string str = ...;
wchar_t wch = (wchar_t)str[i];
Same result no matter if it is std::string or VCL String. Is wchar_t not the same as Unicode?
I'm guessing the problem is that in your compiler char is signed (the standard allows it to be either signed or unsigned, it's implementation-defined/specific). As such, whenever you convert chars that have bit 7 set to 1 (0x80 through 0xFF) into any larger integer type, it's treated as a negative value and it gets sign-extended to preserve the negative value, or, in other words, this bit 7 gets copied to bit 8, bit 9 and so on, into all higher bits of the bigger integer type. So, 0xC7 can turn into 0xFFC7 and 0xFFFFFFC7. To prevent that from happening, cast chars to unsigned chars first.
The wide character type w_type is implementation-defined and need not correspond to the concept of Unicode character. Check out the description of w_type in the Unicode Standard.
If “Ç” is changed to 0xFFC7, it looks very much like sign extension, i.e. the character is internally stored as byte 0xC7 which is then taken as a signed 8-bit integer and converted to a 16-bit integer with sign extension.