Any downsides using '?' instead of L'?' with wchar_t? - c++

Are there any downsides to using '?'-style character literals to compare against, or assign to, values known to be of type wchar_t, instead of using L'?'-style literals?

They have the wrong datatype and encoding, so that's a bad idea. The compiler will silently widen character literals (for strings you'd get a type mismatch compile error), using the standard integral conversions (such as sign-extension). But the value might not match.
For example, characters 0x80 through 0xff often map to different Unicode codepoints, and the exact mapping varies depending on the compiler's codepage.
Clearly, it's not possible for Unicode to map all the various codepages using an identity conversion. If merely widening were enough, there'd be no need for functions like mbtowcs.
WRT your specific question about '\xAB' vs L'\xAB', they probably are not equal. See http://ideone.com/b1E39

As I mentioned, the standard says
A char array (whether plain char, signed char, or unsigned char), char16_t array, char32_t array, or wchar_t array can be initialized by a narrow character literal...
However, in the section for the __STDC_MB_MIGHT_NEQ_WC__ preprocessor definition, it says
The integer constant 1, intended to indicate that, in the encoding for wchar_t, a member of the basic character set need not have a code value equal to its value when used as the lone character in an ordinary character literal.
And for __STDC_ISO_10646__:
An integer constant of the form yyyymmL (for example, 199712L). If this symbol is defined, then every character in the Unicode required set, when stored in an object of type wchar_t, has the same value as the short identifier of that character.
I am not exactly a professional at interpreting the standard, but I think that means the answer to your question is that they may have different representations, and you should always use the L.

The only downside is that your program might fail on stone-age systems using EBCDIC. On any real world system worth consideration, char and wchar_t values for the portable character set are all ASCII, and on increasingly many (but not all), wchar_t is a Unicode codepoint number.

Related

Is 16-bit wchar_t formally valid for representing full Unicode?

In the ¹comp.lang.c++ Usenet group I recently asserted, based on what I thought I knew, that Windows' 16-bit wchar_t, with UTF-16 encoding where sometimes two such values (called a “surrogate pair”) is needed for a single Unicode code point, is invalid for representing Unicode.
It's certainly inconvenient and in conflict with the assumption of the C and C++ standard libraries (e.g. character classification) that each code point is represented as a single value, although the Unicode consortium's ²Technical Note 12 from 2004 makes a good case for using UTF-16 for internal processing, with an impressive list of software that does.
And certainly it seems as if the original intent was to have one wchar_t value per code point, consistent with the assumptions of the C and C++ standard libraries. E.g. in the web page “ISO C Amendment 1 (MSE)” over at ³unix.org, about the amendment that brought wchar_t into the C standard in 1995, the authors maintain that
” The primary advantage to the one byte/one character model is that it is very easy to process data in fixed-width chunks. For this reason, the concept of the wide character was invented. A wide character is an abstract data type large enough to contain the largest character that is supported on a particular platform.
But as it turned out, the C and C++ standards seem to not talk about the largest supported character, but only about the largest extended character sets in the supported locales: that wchar_t must be large enough to represent every code point in the largest such extended character set – but not Unicode, when there is no Unicode locale.
C99 §7.17/2 (from the N869 draft):
” [the wchar_t type] is an integer type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales.
This is almost identically the same wording as in the C++ standard. And it seems to mean that with a restricted set of supported locales, wchar_t can be smallish indeed, down to a single byte with UTF-8 encoding (a nightmare possibility where e.g. no standard library character classification function would work outside of ASCII's A through Z, but hey). Possibly the following is a requirement to be wider than that:
C99 §7.1.1/4:
” A wide character is a code value (a binary encoded integer) of an object of type wchar_t that corresponds to a member of the extended character set.
… since it refers to the extended character set, but that term seems to not be further defined anywhere.
And at least with Microsoft's C and C++ runtime there is no Unicode locale: with that implementation setlocale is restricted to character encodings that have at most 2 bytes per character:
MSDN ⁴documentation of setlocale:
” The set of available locale names, languages, country/region codes, and code pages includes all those supported by the Windows NLS API except code pages that require more than two bytes per character, such as UTF-7 and UTF-8. If you provide a code page value of UTF-7 or UTF-8, setlocale will fail, returning NULL.
So it seems that contrary to what I thought I knew, and contrary to my assertion, Windows' 16-bit wchar_t is formally OK. And mainly due to Microsoft's ingenious lack of support for UTF-8 locales, or any locale with more than 2 bytes per character. But is it really so, is 16-bit wchar_t OK?
Links:
¹ news:comp.lang.c++
² http://unicode.org/notes/tn12/#Software_16
³ http://www.unix.org/version2/whatsnew/login_mse.html
⁴ https://msdn.microsoft.com/en-us/library/x99tb11d.aspx
wchar_t is not now and never was a Unicode character/code point. The C++ standard does not declare that a wide-string literal will contain Unicode characters. The C++ standard does not declare that a wide-character literal will contain a Unicode character. Indeed, the standard doesn't say anything about what wchar_t will contain.
wchar_t can be used with locale-aware APIs, but those are only relative to the implementation-defined encoding, not any particular Unicode encoding. The standard library functions that take these use their knowledge of the implementation's encoding to do their jobs.
So, is a 16-bit wchar_t legal? Yes; the standard does not require that wchar_t be sufficiently large to hold a Unicode codepoint.
Is a string of wchar_t permitted to hold UTF-16 values (or variable width in general)? Well, you are permitted to make strings of wchar_t that store whatever you want (so long as it fits). So for the purposes of the standard, the question is whether standard-provided means for generating wchar_t characters and strings are permitted to use UTF-16.
Well, the standard library can do whatever it wants to; the standard offers no guarantee that a conversion from any particular character encoding to wchar_t will be a 1:1 mapping. Even char->wchar_t conversion via wstring_convert is not required anywhere in the standard to produce a 1:1 character mapping.
If a compiler wishes to declare that the wide character set consists of the Base Multilingual Plane of Unicode, then a declaration like this L'\U0001F000' will produce a single wchar_t. But the value is implementation-defined, per [lex.ccon]/2:
The value of a wide-character literal containing a single c-char has value equal to the numerical value of the encoding of the c-char in the execution wide-character set, unless the c-char has no representation in the execution wide-character set, in which case the value is implementation-defined.
And of course, C++ doesn't allow to use surrogate pairs as a c-char; \uD800 is a compile error.
Where things get murky in the standard is the treatment of strings that contain characters outside of the character set. The above text would suggest that implementations can do what they want. And yet, [lex.string]\16 says this:
The size of a char32_t or wide string literal is the total number of escape sequences, universal-character-names, and other characters, plus
one for the terminating U’\0’ or L’\0’.
I say this is murky because nothing says what the behavior should be if a c-char in a string literal is outside the range of the destination character set.
Windows compilers (both VS and GCC-on-Windows) do indeed cause L"\U0001F000" to have an array size of 3 (two surrogate pairs and a single NUL terminator). Is that legal C++ standard behavior? What does it mean to provide a c-char to a string literal that is outside of the valid range for a character set?
I would say that this is a hole in the standard, rather than a deficiency in those compilers. It should make it more clear what the conversion behavior in this case ought to be.
In any case, wchar_t is not an appropriate tool for processing Unicode-encoded text. It is not "formally valid" for representing any form of Unicode. Yes, many compilers implement wide-string literals as a Unicode encoding. But since the standard doesn't require this, you cannot rely on it.
Now obviously, you can stick whatever will fit inside of a wchar_t. So even on platforms where wchar_t is 32-bits, you could shove UTF-16 data into them, with each 16-bit word taking up 32-bits. But you couldn't pass such text to any API function that expects the wide character encoding unless you knew that this was the expected encoding for that platform.
Basically, never use wchar_t if you want to work with a Unicode encoding.
Let's start from first principles:
(§3.7.3) wide character: bit representation that fits in an object of type
wchar_t, capable of representing any character in the current locale
(§3.7) character: 〈abstract〉 member of a set of elements used for the
organization, control, or representation of data
That, right away, discards full Unicode as a character set (a set of elements/characters) representable on 16-bit wchar_t.
But wait, Nicol Bolas quoted the following:
The size of a char32_t or wide string literal is the total number of
escape sequences, universal-character-names, and other characters,
plus one for the terminating U’\0’ or L’\0’.
and then wondered about the behavior for characters outside the execution character set. Well, C99 has the following to say about this issue:
(§5.1.1.2) Each source character set member and escape sequence in character
constants and string literals is converted to the corresponding member
of the execution character set; if there is no corresponding member,
it is converted to an implementation- defined member other than the
null (wide) character.8)
and further clarifies in a footnote that not all source characters need to map to the same execution character.
Armed with this knowledge, you can declare that your wide execution character set is the Basic Multilingual Plane, and that you consider surrogates as proper characters themselves, not as mere surrogates for other characters. AFAICT, this means you are in the clear as far as Clause 6 (Language) of ISO C99 cares.
Of course, don't expect Clause 7 (Library) to play along nicely with you. As an example, consider iswalpha(wint_t). You cannot pass astral characters (characters outside the BMP) to that function, you can only pass it the two surrogates. And you'd get some nonsensical result, but that's fine because you declared the surrogate themselves to be proper members of the execution character set.
After clarifying what the question is I will do an edit.
Q: Is the width of 16 bits for wchar_t in Windows conformant to the standard?
A: Well, lets see. We will start with the definition of wchar_t from c99 draft.
... largest extended character set specified among the supported locales.
So, we should look what are the supported locales. For that there are Three steps:
We check the documentation for setlocale
We quickly open the documentation for the locale string. We see the format of the string
locale :: "locale_name"
| "language[_country_region[.code_page]]"
| ".code_page"
| "C"
| ""
| NULL
We see the list of supported Code pages and we see UTF-8, UTF-16, UTF-32 and what not. We're in a dead end.
If we start with the C99 definition, it ends with
... corresponds to a member of the extended character set.
The word "character set" is used. But if we say UTF-16 code units are our character set, then all is OK. Otherwise, it's not. It's kinda vague, and one should not care much. The standards were defined many years ago, when Unicode was not popular.
At the end of the day, we now have C++11 and C11 that define use cases for UTF-8, 16 and 32 with the additional types char16_t and char32_t.
You need to read about Unicode and you will answer the question yourself.
Unicode is a character set. Set of characters, it's about 200000 characters. Or more precisely it is a mapping, mapping between numbers and characters. Unicode by itself does not mean this or that bit width.
Then there are 4 encodings, UTF-7, UTF-8, UTF-16 and UTF-32. UTF stands for Unicode transformation format.
Each format defines a code point and a code unit. Code point is an actual charter from Unicode and can consists of one or more units. Only UTF-32 has one unit per point.
On the other hand, each unit fits into a fixed size integer. So UTF-7 units are at most 7 bits, UTF-16 units are at most 16 bits etc.
Therefore, in a 16 bit wchar_t string we can hold Unicode text encoded in UTF-16. Particularly in UTF-16 each point takes one or two units.
So the final answer, in a single wchar_t you can not store all Unicode char, only the single unit ones, but in a string of wchar_t you can store any Unicode text.

Understanding wchar_t type in c++

The Standard says N3797::3.9.1 [basic.fundamental]:
Type wchar_t is a distinct type whose values can represent distinct
codes for all members of the largest extended character set specified
among the supported locales (22.3.1).
I can't imagine how we can use that type. Could you give an example where plain char isn't working? I thought it may be helpful if we use two different language simultaneously. But plain char is Ok in case for cyrillic and latinica
#include <iostream>
char cp[] = "LATINICA_КИРИЛЛИЦА";
int main()
{
std::cout << cp; //LATINICA_КИРИЛЛИЦА
}
DEMO
In your example, you use Unicode. Indeed you could type not only in Latin or Cyrillic, but in Thai, Arabic, Chinese in other words any Unicode symbol. Your example with some more symbols link
The case is in encoding. In your example you are using char to store Unicode symbols encoded in UTF-8. See this for more details. The main advantage of UTF-8 in backward compatibility with ASCII. The main disadvantage of using UTF-8 is variable symbol length.
There are other types of encoding for Unicode symbols. The most common (except UTF-8) are UTF-16 and UTF-32. You should be aware that the UTF-16 encoding is still variable length, however the code unit is now 16bit. UTF-32 encoding is constant length.
The type wchar_t is usually used to store symbols in UTF-16 or UTF-32 encoding depending on the system.
It depends what encoding you decide to use. Any single UTF-8 value can be held in an 8-bit char (though one Unicode code-point can take several char values to represent). It's impossible to tell from your question, but I'd guess that your editor and compiler are treating your strings as UTF-8 and that's fine if that's what you want.
Other common encodings include UTF-16, UTF-32, UCS-2 and UCS-4, which have 2-byte, 4-byte, 2-byte and 4-byte values respectively. You can't store these values in an 8-bit char.
The decision of what encoding to use for any given purpose is not straightforward. The main considerations are:
What other systems does your code have to interface to and what encoding do they use?
What libraries do you want to use and what encodings do they use? (eg xerces-c uses UTF-16 throughout)
The tradeoff between complexity and storage size. UTF-32 and UCS-4 have the useful property that every possible displayed character is represented by one value, so you can tell the length of the string from how much memory it takes up without having to look at the values in it (though this assumes that you consider combining diacretic marks as separate characters). However, if all you're representing is ASCII, they take up four times as much memory as UTF-8.
I'd suggest Joel Spolsky's essay on Unicode as a good read.
wchar_t has its own problems, though. The standard didn't specify how big a wchar_t is, so, of course, different compilers have picked different sizes; VC++ used two bytes and gcc (and most others) use four bytes. Wide-character literals, such as L"Hello, world," are similarly confused, being UTF-16 strings in VC++ and UCS-4 in gcc.
To try to clean this up, C++11 introduced two new character types:
char16_t is a character guaranteed to be 16-bits, and with a literal form u"Hello, world."
char32_t is a character guaranteed to be 32-bits, and with a literal form U"Hello, world."
However, these have problems of their own; in particular, <iostream> doesn't provide console streams that can handle them (ie there is no u16cout or u32cerr).
To be more specific I'll provide a normative reference relates to the question: [N3797:8.5.2/1 [dcl.init.string] says:
An array of narrow character type (3.9.1), char16_t array, char32_t
array, or wchar_t array can be initialized by a narrow string literal,
char16_t string literal, char32_t string literal, or wide string
literal, respectively, or by an appropriately-typed string literal
enclosed in braces (2.14.5). Successive characters of the value of the
string literal initialize the elements of the array.
8.5.2/2:
There shall not be more initializers than there are array elements.
In the case of
#include <iostream>
char cp[] = "LATINICA_КИРИЛЛИЦА";
int main()
{
std::cout << sizeof(cp) << std::endl; //28
}
DEMO
for some language, like English, it's not necessary to use wchar_t.but some language, like Chinese, you'd better use wchar_t.
although char is able to store string, likechar p[] = "你好"
but it may show messy code when you run you program in different computer, especially the computer using different language.
if you use wchar_t, you can avoid this.

Is the u8 string literal necessary in C++11

From Wikipedia:
For the purpose of enhancing support for Unicode in C++ compilers, the definition of the type char has been modified to be at least the size necessary to store an eight-bit coding of UTF-8.
I'm wondering what exactly this means for writing portable applications. Is there any difference between writing this
const char[] str = "Test String";
or this?
const char[] str = u8"Test String";
Is there be any reason not to use the latter for every string literal in your code?
What happens when there are non-ASCII-Characters inside the TestString?
The encoding of "Test String" is the implementation-defined system encoding (the narrow, possibly multibyte one).
The encoding of u8"Test String" is always UTF-8.
The examples aren't terribly telling. If you included some Unicode literals (such as \U0010FFFF) into the string, then you would always get those (encoded as UTF-8), but whether they could be expressed in the system-encoded string, and if yes what their value would be, is implementation-defined.
If it helps, imagine you're authoring the source code on an EBCDIC machine. Then the literal "Test String" is always EBCDIC-encoded in the source file itself, but the u8-initialized array contains UTF-8 encoded values, whereas the first array contains EBCDIC-encoded values.
You quote Wikipedia:
For the purpose of enhancing support for Unicode in C++ compilers, the definition of the type char has been modified to be at least the size necessary to store an eight-bit coding of UTF-8.
Well, the “For the purpose of” is not true. char has always been guaranteed to be at least 8 bits, that is, CHAR_BIT has always been required to be ≥8, due to the range required for char in the C standard. Which is (quote C++11 §17.5.1.5/1) “incorporated” into the C++ standard.
If I should guess about the purpose of that change of wording, it would be to just clarify things for those readers unaware of the dependency on the C standard.
Regarding the effect of the u8 literal prefix, it
affects the encoding of the string in the executable, but
unfortunately it does not affect the type.
Thus, in both cases "tørrfisk" and u8"tørrfisk" you get a char const[n]. But in the former literal the encoding is whatever is selected for the compiler, e.g. with Latin 1 (or Windows ANSI Western) that would be 8 bytes for the characters plus a nullbyte, for array size 9. While in the latter literal the encoding is guaranteed to be UTF-8, where the “ø” will be encoded with 2 or 3 bytes (I don’t recall exactly), for a slightly larger array size.
If the execution character set of the compiler is set to UTF-8, it makes no difference if u8 is used or not, since the compiler converts the characters to UTF-8 in both cases.
However if the compilers execution character set is the system's non UTF8 codepage (default for e.g. Visual C++), then non ASCII characters might not properly handled when u8 is omitted. For example, the conversion to wide strings will crash e.g. in VS15:
std::string narrowJapanese("スタークラフト");
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> convertWindows;
std::wstring wide = convertWindows.from_bytes(narrowJapanese); // Unhandled C++ exception in xlocbuf.
The compiler chooses a native encoding natural to the platform. On typical POSIX systems it will probably choose ASCII and something possibly depending on environment's setting for character values outside the ASCII range. On mainframes it will probably choose EBCDIC. Comparing strings received, e.g., from files or the command line will probably work best with the native character set. When processing files explicitly encoded using UTF-8 you are, however, probably best off using u8"..." strings.
That said, with the recent changes relating to character encodings a fundamental assumption of string processing in C and C++ got broken: each internal character object (char, wchar_t, etc.) used to represent one character. This is clearly not true anymore for a UTF-8 string where each character object just represents a byte of some character. As a result all the string manipulation, character classification, etc. functions won't necessarily work on these strings. We don't have any good library lined up to deal with such strings for inclusion into the standard.

Why is there no ASCII or UTF-8 character literal in C11 or C++11?

Why is there no UTF-8 character literal in C11 or C++11 even though there are UTF-8 string literals? I understand that, generally-speaking, a character literal represents a single ASCII character which is identical to a single-octet UTF-8 code point, but neither C nor C++ says the encoding has to be ASCII.
Basically, if I read the standard right, there's no guarantee that '0' will represent the integer 0x30, yet u8"0" must represent the char sequence 0x30 0x00.
EDIT:
I'm aware not every UTF-8 code point would fit in a char. Such a literal would only be useful for single-octet code points (aka, ASCII), so I guess calling it an "ASCII character literal" would be more fitting, so the question still stands. I just chose to frame the question with UTF-8 because there are UTF-8 string literals. The only way I can imagine portably guaranteeing ASCII values would be to write a constant for each character, which wouldn't be so bad considering there are only 128, but still...
It is perfectly acceptable to write non-portable C code, and this is one of many good reasons to do so. Feel free to assume that your system uses ASCII or some superset thereof, and warn your users that they shouldn't try to run your program on an EBCDIC system.
If you are feeling very generous, you can encode a check. The gperf program is known to generate code that includes such a check.
_Static_assert('0' == 48, "must be ASCII-compatible");
Or, for pre-C11 compilers,
extern int must_be_ascii_compatible['0' == 48 ? 1 : -1];
If you are on C11, you can use the u or U prefix on character constants, but not the u8 prefix...
/* This is useless, doesn't do what you want... */
_Static_assert(0, "this code is broken everywhere");
if (c == '々') ...
/* This works as long as wchar_t is UTF-16 or UTF-32 or UCS-2... */
/* Note: you shouldn't be using wchar_t, though... */
_Static_assert(__STDC_ISO_10646__, "wchar_t must be some form of Unicode");
if (c == L'々') ...
/* This works as long as char16_t is UTF-16 or UCS-2... */
_Static_assert(__STDC_UTF_16__, "char16_t must be UTF-16");
if (c == u'々') ...
/* This works as long as char32_t is UTF-32... */
_Static_assert(__STDC_UTF_32__, "char32_t must be UTF-32");
if (c == U'々') ...
There are some projects that are written in very portable C and have been ported to non-ASCII systems (example). This required a non-trivial amount of porting effort, and there's no real reason to make the effort unless you know you want to run your code on EBCDIC systems.
On standards: The people writing the C standard have to contend with every possible C implementation, including some downright bizarre ones. There are known systems where sizeof(char) == sizeof(long), CHAR_BIT != 8, integral types have trap representations, sizeof(void *) != sizeof(int *), sizeof(void *) != sizeof(void (*)()), va_list are heap-allocated, etc. It's a nightmare.
Don't beat yourself up trying to write code that will run on systems you've never even heard of, and don't search to hard for guarantees in the C standard.
For example, as far as the C standard is concerned, the following is a valid implementation of malloc:
void *malloc(void) { return NULL; }
Note that while u8"..." constants are guaranteed to be UTF-8, u"..." and U"..." have no guarantees except that the encoding is 16-bits and 32-bits per character, respectively, and the actual encoding must be documented by the implementation.
Summary: Safe to assume ASCII compatibility in 2012.
UTF-8 character literal would have to have variable length - for many most of them, it's not possible to store single character in char or wchar, what type should it have, then? As we don't have variable length types in C, nor in C++, except for arrays of fixed size types, the only reasonable type for it would be const char * - and C strings are required to be null-terminated, so it wouldn't change anything.
As for the edit:
Quote from the C++11 standard:
The glyphs for the members of the basic source character set are intended to identify characters from the subset of ISO/IEC 10646 which corresponds to the ASCII character set. However, because the mapping from source file characters to the source character set (described in translation phase 1) is specified as implementation-defined, an implementation is required to document how the basic source characters are represented in source files.
(footnote at 2.3.1).
I think that it's good reason for not guaranteeing it. Although, as you noted in comment here, for most (or every) mainstream compiler, the ASCII-ness of character literals is implementation guaranteed.
For C++ this has been addressed by Evolution Working Group issue 119: Adding u8 character literals whose Motivation section says:
We have five encoding-prefixes for string-literals (none, L, u8, u, U)
but only four for character literals -- the missing one is u8. If the
narrow execution character set is not ASCII, u8 character literals
would provide a way to write character literals with guaranteed ASCII
encoding (the single-code-unit u8 encodings are exactly ASCII). Adding
support for these literals would add a useful feature and make the
language slightly more consistent.
EWG discussed the idea of adding u8 character literals in Rapperswil and accepted the change. This paper provides wording for that
extension.
This was incorporated into the working draft using the wording from N4267: Adding u8 character literals and we can find the wording in at this time latest draft standard N4527 and note as section 2.14.3 say they are limited to code points that fit into a single UTF-8 code unit:
A character literal that begins with u8, such as u8'w', is a character
literal of type char, known as a UTF-8 character literal. The value of
a UTF-8 character literal is equal to its ISO10646 code point value,
provided that the code point value is representable with a single
UTF-8 code unit (that is, provided it is a US-ASCII character). A
UTF-8 character literal containing multiple c-chars is ill-formed.
If you don't trust that your compiler will treat '0' as ASCII character 0x30, then you could use static_cast<char>(0x30) instead.
As you are aware, UTF-8-encoded characters need several octets, thus chars, so the natural type for them is char[], which is indeed the type for a u8-prefixed string literal! So C11 is right on track here, just that it sticks to its syntax conventions using " for a string, needing to be used as an array of char, rather than your implied semantic-based proposal to use ' instead.
About "0" versus u8"0", you are reading right, only the latter is guaranteed to be identical to { 0x30, 0 }, even on EBCDIC systems. By the way, the very fact the former is not can be handled conveniently in your code, if you pay attention to the __STDC_MB_MIGHT_NEQ_WC__ predefined identifier.

How can I find out what the current charset is in C++?

How can I find out what the current charset is in C++?
In a console application (WinXP) I am getting negative values for some characters (like äöüé) with
(int)mystring[a]
and this surprises me. I was expecting the values to be between 127 and 256.
So is there something like GetCharset() or SetCharset() in c++?
It depends on how you look at the value you have at hand. char can be signed(e.g. on Windows), or unsigned like on some other systems. So, what you should do is to print the value as unsigned to get what you are asking for.
C++ until now is char-set agnostic. For Windows console specifically, you can use: GetConsoleOutputCP.
Look at std::numeric_limits<char>::min() and max(). Or CHAR_MIN and CHAR_MAX if you don't like typing, or if you need an integer constant expression.
If CHAR_MAX == UCHAR_MAX and CHAR_MIN == 0 then chars are unsigned (as you expected). If CHAR_MAX != UCHAR_MAX and CHAR_MIN < 0 they are signed (as you're seeing).
In the standard 3.9.1/1, ensures that there are no other possibilities: "... a plain char can take on either the same values as a signed char or an unsigned char; which one is implementation-defined."
This tells you whether char is signed or unsigned, and that's what's confusing you. You certainly can't call anything to modify it: from the POV of a program it's baked into the compiler even if the compiler has ways of changing it (GCC certainly does: -fsigned-char and -funsigned-char).
The usual way to deal with this is if you're going to cast a char to int, cast it through unsigned char first. So in your example, (int)(unsigned char)mystring[a]. This ensures you get a non-negative value.
It doesn't actually tell you what charset your implementation uses for char, but I don't think you need to know that. On Microsoft compilers, the answer is essentially that commonly-used character encoding "ISO-8859-mutter-mutter". This means that chars with 7-bit ASCII values are represented by that value, while values outside that range are ambiguous, and will be interpreted by a console or other recipient according to how that recipient is configured. ISO Latin 1 unless told otherwise.
Properly speaking, the way characters are interpreted is locale-specific, and the locale can be modified and interrogated using a whole bunch of stuff towards the end of the C++ standard that personally I've never gone through and can't advise on ;-)
Note that if there's a mismatch between the charset in effect, and the charset your console uses, then you could be in for trouble. But I think that's separate from your issue: whether chars can be negative or not is nothing to do with charsets, just whether char is signed.
chars are normally signed by default.
Try this.
cout << (unsigned char) mystring[a] << endl;
The only gurantee that the standard provides are for members of the basic character set:
2.2 Character sets
3 The basic execution character set
and the basic execution wide-character
set shall each contain all the members
of the basic source character set,
plus control characters representing
alert, backspace, and carriage return,
plus a null character (respectively,
null wide character), whose
representation has all zero bits. For
each basic execution character set,
the values of the members shall be
non-negative and distinct from one
another. In both the source and
execution basic character sets, the
value of each character after 0 in the
above list of decimal digits shall be
one greater than the value of the
previous. The execution character set
and the execution wide-character set
are supersets of the basic execution
character set and the basic execution
wide-character set, respectively. The
values of the members of the execution
character sets are
implementation-defined, and any
additional members are locale-specific
Further, the type char is supposed to hold:
3.9.1 Fundamental types
1 Objects declared as characters (char) shall be large enough to store any member of the
implementation’s basic
character set.
So, no gurantees whethere you will get the correct value for the characters you have mentioned. However, try to use an unsigned int to hold this value (for all practical purposes, it never makes sense to use a signed type to hold char values ever, if you are going to print them/pass around).