C++ and the Integer Value of Characters [duplicate] - c++

I didn't know that C and C++ allow multicharacter literal: not 'c' (of type int in C and char in C++), but 'tralivali' (of type int!)
enum
{
ActionLeft = 'left',
ActionRight = 'right',
ActionForward = 'forward',
ActionBackward = 'backward'
};
Standard says:
C99 6.4.4.4p10: "The value of an
integer character constant containing
more than one character (e.g., 'ab'),
or containing a character or escape
sequence that does not map to a
single-byte execution character, is
implementation-defined."
I found they are widely used in C4 engine. But I suppose they are not safe when we are talking about platform-independend serialization. Thay can be confusing also because look like strings. So what is multicharacter literal's scope of usage, are they useful for something? Are they in C++ just for compatibility with C code? Are they considered to be a bad feature as goto operator or not?

It makes it easier to pick out values in a memory dump.
Example:
enum state { waiting, running, stopped };
vs.
enum state { waiting = 'wait', running = 'run.', stopped = 'stop' };
a memory dump after the following statement:
s = stopped;
might look like:
00 00 00 02 . . . .
in the first case, vs:
73 74 6F 70 s t o p
using multicharacter literals. (of course whether it says 'stop' or 'pots' depends on byte ordering)

I don't know how extensively this is used, but "implementation-defined" is a big red-flag to me. As far as I know, this could mean that the implementation could choose to ignore your character designations and just assign normal incrementing values if it wanted. It may do something "nicer", but you can't rely on that behavior across compilers (or even compiler versions). At least "goto" has predictable (if undesirable) behavior...
That's my 2c, anyway.
Edit: on "implementation-defined":
From Bjarne Stroustrup's C++ Glossary:
implementation defined - an aspect of C++'s semantics that is defined for each implementation rather than specified in the standard for every implementation. An example is the size of an int (which must be at least 16 bits but can be longer). Avoid implementation defined behavior whenever possible. See also: undefined. TC++PL C.2.
also...
undefined - an aspect of C++'s semantics for which no reasonable behavior is required. An example is dereferencing a pointer with the value zero. Avoid undefined behavior. See also: implementation defined. TC++PL C.2.
I believe this means the comment is correct: it should at least compile, although anything beyond that is not specified. Note the advice in the definition, also.

Four character literals, I've seen and used. They map to 4 bytes = one 32 bit word. It's very useful for debugging purposes as said above. They can be used in a switch/case statement with ints, which is nice.
This (4 Chars) is pretty standard (ie supported by GCC and VC++ at least), although results (actual values compiled) may vary from one implementation to another.
But over 4 chars? I wouldn't use.
UPDATE: From the C4 page: "For our simple actions, we'll just provide an enumeration of some values, which is done in C4 by specifying four-character constants". So they are using 4 chars literals, as was my case.

Multicharacter literals allow one to specify int values via the equivalent representation in characters. Useful for enums, FourCC codes and tags, and non-type template parameters. With a multicharacter literal, a FourCC code can be typed directly into the source, which is handy.
The implementation in gcc is described at https://gcc.gnu.org/onlinedocs/cpp/Implementation-defined-behavior.html . Note that the value is truncated to the size of the type int, so 'efgh' == 'abcdefgh' if your ints are 4 chars wide, although gcc will issue a warning on the literal that overflows.
Unfortunately, gcc will issue a warning on all multi-character literals if -pedantic is passed, as their behavior is implementation-defined. As you can see above, it is perhaps possible for equality of two multi-character literals to change if you switch implementations.

In C++14 specification draft N4527 section 2.13.3, entry 2:
... An ordinary character literal that contains more than one c-char is a multicharacter literal. A multicharacter literal, or an ordinary character literal containing a single c-char not representable in the execution character set, is conditionally-supported, has type int, and has an implementation-defined value.
Previous answers to your question pertained mostly on real machines that did support multicharacter literals. Specifically, on platforms where int is 4 bytes, four-byte multicharacter is fine and can be used for convenience, as per Ferrucio's mem dump example. But, as there is no guarantee that this will ever work or work the same way on other platforms, use of multicharacter literals should be deprecated for portable programs.

unbelievable, every compiler I know places the first character of a UINT defined as 4-character constant in the low significant byte (little indian) --- but Visual C does it in opposite direction 🙄
// file signature
#define SFKFILE_SIGNATURE 'SFPK' (S=53)
// check header
if (out_FileHdr->Signature != SFKFILE_SIGNATURE)
fails on VC:
Borland: 4B504653 4B504653
Watcom: 4B504653 4B504653
VisualC: 4B504653 5346504B

Related

single vs double quotes C++ - interesting, unexpected behaviour [duplicate]

I didn't know that C and C++ allow multicharacter literal: not 'c' (of type int in C and char in C++), but 'tralivali' (of type int!)
enum
{
ActionLeft = 'left',
ActionRight = 'right',
ActionForward = 'forward',
ActionBackward = 'backward'
};
Standard says:
C99 6.4.4.4p10: "The value of an
integer character constant containing
more than one character (e.g., 'ab'),
or containing a character or escape
sequence that does not map to a
single-byte execution character, is
implementation-defined."
I found they are widely used in C4 engine. But I suppose they are not safe when we are talking about platform-independend serialization. Thay can be confusing also because look like strings. So what is multicharacter literal's scope of usage, are they useful for something? Are they in C++ just for compatibility with C code? Are they considered to be a bad feature as goto operator or not?
It makes it easier to pick out values in a memory dump.
Example:
enum state { waiting, running, stopped };
vs.
enum state { waiting = 'wait', running = 'run.', stopped = 'stop' };
a memory dump after the following statement:
s = stopped;
might look like:
00 00 00 02 . . . .
in the first case, vs:
73 74 6F 70 s t o p
using multicharacter literals. (of course whether it says 'stop' or 'pots' depends on byte ordering)
I don't know how extensively this is used, but "implementation-defined" is a big red-flag to me. As far as I know, this could mean that the implementation could choose to ignore your character designations and just assign normal incrementing values if it wanted. It may do something "nicer", but you can't rely on that behavior across compilers (or even compiler versions). At least "goto" has predictable (if undesirable) behavior...
That's my 2c, anyway.
Edit: on "implementation-defined":
From Bjarne Stroustrup's C++ Glossary:
implementation defined - an aspect of C++'s semantics that is defined for each implementation rather than specified in the standard for every implementation. An example is the size of an int (which must be at least 16 bits but can be longer). Avoid implementation defined behavior whenever possible. See also: undefined. TC++PL C.2.
also...
undefined - an aspect of C++'s semantics for which no reasonable behavior is required. An example is dereferencing a pointer with the value zero. Avoid undefined behavior. See also: implementation defined. TC++PL C.2.
I believe this means the comment is correct: it should at least compile, although anything beyond that is not specified. Note the advice in the definition, also.
Four character literals, I've seen and used. They map to 4 bytes = one 32 bit word. It's very useful for debugging purposes as said above. They can be used in a switch/case statement with ints, which is nice.
This (4 Chars) is pretty standard (ie supported by GCC and VC++ at least), although results (actual values compiled) may vary from one implementation to another.
But over 4 chars? I wouldn't use.
UPDATE: From the C4 page: "For our simple actions, we'll just provide an enumeration of some values, which is done in C4 by specifying four-character constants". So they are using 4 chars literals, as was my case.
Multicharacter literals allow one to specify int values via the equivalent representation in characters. Useful for enums, FourCC codes and tags, and non-type template parameters. With a multicharacter literal, a FourCC code can be typed directly into the source, which is handy.
The implementation in gcc is described at https://gcc.gnu.org/onlinedocs/cpp/Implementation-defined-behavior.html . Note that the value is truncated to the size of the type int, so 'efgh' == 'abcdefgh' if your ints are 4 chars wide, although gcc will issue a warning on the literal that overflows.
Unfortunately, gcc will issue a warning on all multi-character literals if -pedantic is passed, as their behavior is implementation-defined. As you can see above, it is perhaps possible for equality of two multi-character literals to change if you switch implementations.
In C++14 specification draft N4527 section 2.13.3, entry 2:
... An ordinary character literal that contains more than one c-char is a multicharacter literal. A multicharacter literal, or an ordinary character literal containing a single c-char not representable in the execution character set, is conditionally-supported, has type int, and has an implementation-defined value.
Previous answers to your question pertained mostly on real machines that did support multicharacter literals. Specifically, on platforms where int is 4 bytes, four-byte multicharacter is fine and can be used for convenience, as per Ferrucio's mem dump example. But, as there is no guarantee that this will ever work or work the same way on other platforms, use of multicharacter literals should be deprecated for portable programs.
unbelievable, every compiler I know places the first character of a UINT defined as 4-character constant in the low significant byte (little indian) --- but Visual C does it in opposite direction 🙄
// file signature
#define SFKFILE_SIGNATURE 'SFPK' (S=53)
// check header
if (out_FileHdr->Signature != SFKFILE_SIGNATURE)
fails on VC:
Borland: 4B504653 4B504653
Watcom: 4B504653 4B504653
VisualC: 4B504653 5346504B

Is the u8 string literal necessary in C++11

From Wikipedia:
For the purpose of enhancing support for Unicode in C++ compilers, the definition of the type char has been modified to be at least the size necessary to store an eight-bit coding of UTF-8.
I'm wondering what exactly this means for writing portable applications. Is there any difference between writing this
const char[] str = "Test String";
or this?
const char[] str = u8"Test String";
Is there be any reason not to use the latter for every string literal in your code?
What happens when there are non-ASCII-Characters inside the TestString?
The encoding of "Test String" is the implementation-defined system encoding (the narrow, possibly multibyte one).
The encoding of u8"Test String" is always UTF-8.
The examples aren't terribly telling. If you included some Unicode literals (such as \U0010FFFF) into the string, then you would always get those (encoded as UTF-8), but whether they could be expressed in the system-encoded string, and if yes what their value would be, is implementation-defined.
If it helps, imagine you're authoring the source code on an EBCDIC machine. Then the literal "Test String" is always EBCDIC-encoded in the source file itself, but the u8-initialized array contains UTF-8 encoded values, whereas the first array contains EBCDIC-encoded values.
You quote Wikipedia:
For the purpose of enhancing support for Unicode in C++ compilers, the definition of the type char has been modified to be at least the size necessary to store an eight-bit coding of UTF-8.
Well, the “For the purpose of” is not true. char has always been guaranteed to be at least 8 bits, that is, CHAR_BIT has always been required to be ≄8, due to the range required for char in the C standard. Which is (quote C++11 §17.5.1.5/1) “incorporated” into the C++ standard.
If I should guess about the purpose of that change of wording, it would be to just clarify things for those readers unaware of the dependency on the C standard.
Regarding the effect of the u8 literal prefix, it
affects the encoding of the string in the executable, but
unfortunately it does not affect the type.
Thus, in both cases "tþrrfisk" and u8"tþrrfisk" you get a char const[n]. But in the former literal the encoding is whatever is selected for the compiler, e.g. with Latin 1 (or Windows ANSI Western) that would be 8 bytes for the characters plus a nullbyte, for array size 9. While in the latter literal the encoding is guaranteed to be UTF-8, where the “þ” will be encoded with 2 or 3 bytes (I don’t recall exactly), for a slightly larger array size.
If the execution character set of the compiler is set to UTF-8, it makes no difference if u8 is used or not, since the compiler converts the characters to UTF-8 in both cases.
However if the compilers execution character set is the system's non UTF8 codepage (default for e.g. Visual C++), then non ASCII characters might not properly handled when u8 is omitted. For example, the conversion to wide strings will crash e.g. in VS15:
std::string narrowJapanese("ă‚čă‚żăƒŒă‚Żăƒ©ăƒ•ăƒˆ");
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> convertWindows;
std::wstring wide = convertWindows.from_bytes(narrowJapanese); // Unhandled C++ exception in xlocbuf.
The compiler chooses a native encoding natural to the platform. On typical POSIX systems it will probably choose ASCII and something possibly depending on environment's setting for character values outside the ASCII range. On mainframes it will probably choose EBCDIC. Comparing strings received, e.g., from files or the command line will probably work best with the native character set. When processing files explicitly encoded using UTF-8 you are, however, probably best off using u8"..." strings.
That said, with the recent changes relating to character encodings a fundamental assumption of string processing in C and C++ got broken: each internal character object (char, wchar_t, etc.) used to represent one character. This is clearly not true anymore for a UTF-8 string where each character object just represents a byte of some character. As a result all the string manipulation, character classification, etc. functions won't necessarily work on these strings. We don't have any good library lined up to deal with such strings for inclusion into the standard.

Why is there no ASCII or UTF-8 character literal in C11 or C++11?

Why is there no UTF-8 character literal in C11 or C++11 even though there are UTF-8 string literals? I understand that, generally-speaking, a character literal represents a single ASCII character which is identical to a single-octet UTF-8 code point, but neither C nor C++ says the encoding has to be ASCII.
Basically, if I read the standard right, there's no guarantee that '0' will represent the integer 0x30, yet u8"0" must represent the char sequence 0x30 0x00.
EDIT:
I'm aware not every UTF-8 code point would fit in a char. Such a literal would only be useful for single-octet code points (aka, ASCII), so I guess calling it an "ASCII character literal" would be more fitting, so the question still stands. I just chose to frame the question with UTF-8 because there are UTF-8 string literals. The only way I can imagine portably guaranteeing ASCII values would be to write a constant for each character, which wouldn't be so bad considering there are only 128, but still...
It is perfectly acceptable to write non-portable C code, and this is one of many good reasons to do so. Feel free to assume that your system uses ASCII or some superset thereof, and warn your users that they shouldn't try to run your program on an EBCDIC system.
If you are feeling very generous, you can encode a check. The gperf program is known to generate code that includes such a check.
_Static_assert('0' == 48, "must be ASCII-compatible");
Or, for pre-C11 compilers,
extern int must_be_ascii_compatible['0' == 48 ? 1 : -1];
If you are on C11, you can use the u or U prefix on character constants, but not the u8 prefix...
/* This is useless, doesn't do what you want... */
_Static_assert(0, "this code is broken everywhere");
if (c == '々') ...
/* This works as long as wchar_t is UTF-16 or UTF-32 or UCS-2... */
/* Note: you shouldn't be using wchar_t, though... */
_Static_assert(__STDC_ISO_10646__, "wchar_t must be some form of Unicode");
if (c == L'々') ...
/* This works as long as char16_t is UTF-16 or UCS-2... */
_Static_assert(__STDC_UTF_16__, "char16_t must be UTF-16");
if (c == u'々') ...
/* This works as long as char32_t is UTF-32... */
_Static_assert(__STDC_UTF_32__, "char32_t must be UTF-32");
if (c == U'々') ...
There are some projects that are written in very portable C and have been ported to non-ASCII systems (example). This required a non-trivial amount of porting effort, and there's no real reason to make the effort unless you know you want to run your code on EBCDIC systems.
On standards: The people writing the C standard have to contend with every possible C implementation, including some downright bizarre ones. There are known systems where sizeof(char) == sizeof(long), CHAR_BIT != 8, integral types have trap representations, sizeof(void *) != sizeof(int *), sizeof(void *) != sizeof(void (*)()), va_list are heap-allocated, etc. It's a nightmare.
Don't beat yourself up trying to write code that will run on systems you've never even heard of, and don't search to hard for guarantees in the C standard.
For example, as far as the C standard is concerned, the following is a valid implementation of malloc:
void *malloc(void) { return NULL; }
Note that while u8"..." constants are guaranteed to be UTF-8, u"..." and U"..." have no guarantees except that the encoding is 16-bits and 32-bits per character, respectively, and the actual encoding must be documented by the implementation.
Summary: Safe to assume ASCII compatibility in 2012.
UTF-8 character literal would have to have variable length - for many most of them, it's not possible to store single character in char or wchar, what type should it have, then? As we don't have variable length types in C, nor in C++, except for arrays of fixed size types, the only reasonable type for it would be const char * - and C strings are required to be null-terminated, so it wouldn't change anything.
As for the edit:
Quote from the C++11 standard:
The glyphs for the members of the basic source character set are intended to identify characters from the subset of ISO/IEC 10646 which corresponds to the ASCII character set. However, because the mapping from source file characters to the source character set (described in translation phase 1) is specified as implementation-defined, an implementation is required to document how the basic source characters are represented in source files.
(footnote at 2.3.1).
I think that it's good reason for not guaranteeing it. Although, as you noted in comment here, for most (or every) mainstream compiler, the ASCII-ness of character literals is implementation guaranteed.
For C++ this has been addressed by Evolution Working Group issue 119: Adding u8 character literals whose Motivation section says:
We have five encoding-prefixes for string-literals (none, L, u8, u, U)
but only four for character literals -- the missing one is u8. If the
narrow execution character set is not ASCII, u8 character literals
would provide a way to write character literals with guaranteed ASCII
encoding (the single-code-unit u8 encodings are exactly ASCII). Adding
support for these literals would add a useful feature and make the
language slightly more consistent.
EWG discussed the idea of adding u8 character literals in Rapperswil and accepted the change. This paper provides wording for that
extension.
This was incorporated into the working draft using the wording from N4267: Adding u8 character literals and we can find the wording in at this time latest draft standard N4527 and note as section 2.14.3 say they are limited to code points that fit into a single UTF-8 code unit:
A character literal that begins with u8, such as u8'w', is a character
literal of type char, known as a UTF-8 character literal. The value of
a UTF-8 character literal is equal to its ISO10646 code point value,
provided that the code point value is representable with a single
UTF-8 code unit (that is, provided it is a US-ASCII character). A
UTF-8 character literal containing multiple c-chars is ill-formed.
If you don't trust that your compiler will treat '0' as ASCII character 0x30, then you could use static_cast<char>(0x30) instead.
As you are aware, UTF-8-encoded characters need several octets, thus chars, so the natural type for them is char[], which is indeed the type for a u8-prefixed string literal! So C11 is right on track here, just that it sticks to its syntax conventions using " for a string, needing to be used as an array of char, rather than your implied semantic-based proposal to use ' instead.
About "0" versus u8"0", you are reading right, only the latter is guaranteed to be identical to { 0x30, 0 }, even on EBCDIC systems. By the way, the very fact the former is not can be handled conveniently in your code, if you pay attention to the __STDC_MB_MIGHT_NEQ_WC__ predefined identifier.

Multicharacter literal in C and C++

I didn't know that C and C++ allow multicharacter literal: not 'c' (of type int in C and char in C++), but 'tralivali' (of type int!)
enum
{
ActionLeft = 'left',
ActionRight = 'right',
ActionForward = 'forward',
ActionBackward = 'backward'
};
Standard says:
C99 6.4.4.4p10: "The value of an
integer character constant containing
more than one character (e.g., 'ab'),
or containing a character or escape
sequence that does not map to a
single-byte execution character, is
implementation-defined."
I found they are widely used in C4 engine. But I suppose they are not safe when we are talking about platform-independend serialization. Thay can be confusing also because look like strings. So what is multicharacter literal's scope of usage, are they useful for something? Are they in C++ just for compatibility with C code? Are they considered to be a bad feature as goto operator or not?
It makes it easier to pick out values in a memory dump.
Example:
enum state { waiting, running, stopped };
vs.
enum state { waiting = 'wait', running = 'run.', stopped = 'stop' };
a memory dump after the following statement:
s = stopped;
might look like:
00 00 00 02 . . . .
in the first case, vs:
73 74 6F 70 s t o p
using multicharacter literals. (of course whether it says 'stop' or 'pots' depends on byte ordering)
I don't know how extensively this is used, but "implementation-defined" is a big red-flag to me. As far as I know, this could mean that the implementation could choose to ignore your character designations and just assign normal incrementing values if it wanted. It may do something "nicer", but you can't rely on that behavior across compilers (or even compiler versions). At least "goto" has predictable (if undesirable) behavior...
That's my 2c, anyway.
Edit: on "implementation-defined":
From Bjarne Stroustrup's C++ Glossary:
implementation defined - an aspect of C++'s semantics that is defined for each implementation rather than specified in the standard for every implementation. An example is the size of an int (which must be at least 16 bits but can be longer). Avoid implementation defined behavior whenever possible. See also: undefined. TC++PL C.2.
also...
undefined - an aspect of C++'s semantics for which no reasonable behavior is required. An example is dereferencing a pointer with the value zero. Avoid undefined behavior. See also: implementation defined. TC++PL C.2.
I believe this means the comment is correct: it should at least compile, although anything beyond that is not specified. Note the advice in the definition, also.
Four character literals, I've seen and used. They map to 4 bytes = one 32 bit word. It's very useful for debugging purposes as said above. They can be used in a switch/case statement with ints, which is nice.
This (4 Chars) is pretty standard (ie supported by GCC and VC++ at least), although results (actual values compiled) may vary from one implementation to another.
But over 4 chars? I wouldn't use.
UPDATE: From the C4 page: "For our simple actions, we'll just provide an enumeration of some values, which is done in C4 by specifying four-character constants". So they are using 4 chars literals, as was my case.
Multicharacter literals allow one to specify int values via the equivalent representation in characters. Useful for enums, FourCC codes and tags, and non-type template parameters. With a multicharacter literal, a FourCC code can be typed directly into the source, which is handy.
The implementation in gcc is described at https://gcc.gnu.org/onlinedocs/cpp/Implementation-defined-behavior.html . Note that the value is truncated to the size of the type int, so 'efgh' == 'abcdefgh' if your ints are 4 chars wide, although gcc will issue a warning on the literal that overflows.
Unfortunately, gcc will issue a warning on all multi-character literals if -pedantic is passed, as their behavior is implementation-defined. As you can see above, it is perhaps possible for equality of two multi-character literals to change if you switch implementations.
In C++14 specification draft N4527 section 2.13.3, entry 2:
... An ordinary character literal that contains more than one c-char is a multicharacter literal. A multicharacter literal, or an ordinary character literal containing a single c-char not representable in the execution character set, is conditionally-supported, has type int, and has an implementation-defined value.
Previous answers to your question pertained mostly on real machines that did support multicharacter literals. Specifically, on platforms where int is 4 bytes, four-byte multicharacter is fine and can be used for convenience, as per Ferrucio's mem dump example. But, as there is no guarantee that this will ever work or work the same way on other platforms, use of multicharacter literals should be deprecated for portable programs.
unbelievable, every compiler I know places the first character of a UINT defined as 4-character constant in the low significant byte (little indian) --- but Visual C does it in opposite direction 🙄
// file signature
#define SFKFILE_SIGNATURE 'SFPK' (S=53)
// check header
if (out_FileHdr->Signature != SFKFILE_SIGNATURE)
fails on VC:
Borland: 4B504653 4B504653
Watcom: 4B504653 4B504653
VisualC: 4B504653 5346504B

How can I find out what the current charset is in C++?

How can I find out what the current charset is in C++?
In a console application (WinXP) I am getting negative values for some characters (like Ă€Ă¶ĂŒĂ©) with
(int)mystring[a]
and this surprises me. I was expecting the values to be between 127 and 256.
So is there something like GetCharset() or SetCharset() in c++?
It depends on how you look at the value you have at hand. char can be signed(e.g. on Windows), or unsigned like on some other systems. So, what you should do is to print the value as unsigned to get what you are asking for.
C++ until now is char-set agnostic. For Windows console specifically, you can use: GetConsoleOutputCP.
Look at std::numeric_limits<char>::min() and max(). Or CHAR_MIN and CHAR_MAX if you don't like typing, or if you need an integer constant expression.
If CHAR_MAX == UCHAR_MAX and CHAR_MIN == 0 then chars are unsigned (as you expected). If CHAR_MAX != UCHAR_MAX and CHAR_MIN < 0 they are signed (as you're seeing).
In the standard 3.9.1/1, ensures that there are no other possibilities: "... a plain char can take on either the same values as a signed char or an unsigned char; which one is implementation-defined."
This tells you whether char is signed or unsigned, and that's what's confusing you. You certainly can't call anything to modify it: from the POV of a program it's baked into the compiler even if the compiler has ways of changing it (GCC certainly does: -fsigned-char and -funsigned-char).
The usual way to deal with this is if you're going to cast a char to int, cast it through unsigned char first. So in your example, (int)(unsigned char)mystring[a]. This ensures you get a non-negative value.
It doesn't actually tell you what charset your implementation uses for char, but I don't think you need to know that. On Microsoft compilers, the answer is essentially that commonly-used character encoding "ISO-8859-mutter-mutter". This means that chars with 7-bit ASCII values are represented by that value, while values outside that range are ambiguous, and will be interpreted by a console or other recipient according to how that recipient is configured. ISO Latin 1 unless told otherwise.
Properly speaking, the way characters are interpreted is locale-specific, and the locale can be modified and interrogated using a whole bunch of stuff towards the end of the C++ standard that personally I've never gone through and can't advise on ;-)
Note that if there's a mismatch between the charset in effect, and the charset your console uses, then you could be in for trouble. But I think that's separate from your issue: whether chars can be negative or not is nothing to do with charsets, just whether char is signed.
chars are normally signed by default.
Try this.
cout << (unsigned char) mystring[a] << endl;
The only gurantee that the standard provides are for members of the basic character set:
2.2 Character sets
3 The basic execution character set
and the basic execution wide-character
set shall each contain all the members
of the basic source character set,
plus control characters representing
alert, backspace, and carriage return,
plus a null character (respectively,
null wide character), whose
representation has all zero bits. For
each basic execution character set,
the values of the members shall be
non-negative and distinct from one
another. In both the source and
execution basic character sets, the
value of each character after 0 in the
above list of decimal digits shall be
one greater than the value of the
previous. The execution character set
and the execution wide-character set
are supersets of the basic execution
character set and the basic execution
wide-character set, respectively. The
values of the members of the execution
character sets are
implementation-defined, and any
additional members are locale-specific
Further, the type char is supposed to hold:
3.9.1 Fundamental types
1 Objects declared as characters (char) shall be large enough to store any member of the
implementation’s basic
character set.
So, no gurantees whethere you will get the correct value for the characters you have mentioned. However, try to use an unsigned int to hold this value (for all practical purposes, it never makes sense to use a signed type to hold char values ever, if you are going to print them/pass around).