Is this UTF-8 implementation implementation-defined or well-defined? - c++

I just browsed around looking for some implementation of UTF-8 code points (and no, not to plagiarize) and stumbled across this:
typedef unsigned char char8_t;
typedef std::basic_string<unsigned char> u8string;
Is this code ignoring the fact that CHAR_BIT is only required to be at least 8, but may be greater? Or does this not matter in this context and the code is fine? If so, then why is this?
Also, someone (presumably SO member #NicolBolas?) wrote this:
const char *str = u8"This is a UTF-8 string.";
This is pretty much how UTF-8 will be used in C++ for string literals.
I thought that a code unit in UTF-8 is always exactly eight bits!
From the Unicode Standard 8.0.0, Chapter 2.5:
In the Unicode character encoding model, precisely defined encoding
forms specify how each integer (code point) for a Unicode character is
to be expressed as a sequence of one or more code units. The Unicode
Standard provides three distinct encoding forms for Unicode
characters, using 8-bit, 16- bit, and 32-bit units. These are
named UTF-8, UTF-16, and UTF-32, respectively.
(Newlines removed, hyphen on line-break removed, emphasis added.)
So why does he claim const char* is used instead of const uint8_t* (or the suggested, hypothetical const char8_t*)?

uint8_t only exists on systems that have memory that's accessible as exactly 8 bits. UTF-8 doesn't have any such requirement. It uses values that fit into 8 bits, but does not impose any requirements on how those values are actually stored. Each 8-bit value could be stored as 16 bits or 32 bits or whatever makes sense for the system that it's running on; the only requirement is that the value must be correct.

So why does he claim const char* is used instead of const uint8_t* (or the suggested, hypothetical const char8_t*)?
Because that's what the standard says. a u8 literal string will resolve to an array of type const char[N]. That's how UTF-8 literals in C++ are defined to work.
If char on a system has more than 8 bits... so be it. Each char in the string will still hold a value between 0 and 255, which is the range of valid UTF-8 code units. Even though char could hold large values on such system.
If char cannot hold 8 bits... then the implementation is invalid. By recent wording of the standard, char is required to hold enough bits to store every valid UTF-8 code unit. And technically, 255 is not a valid UTF-8 code unit.
And the fact of the matter is this: there's already a huge amount of code that accepts UTF-8 via char*. They aren't going to rewrite POSIX, filesystem APIs, and whatever else to adopt a different type.
That being said, manipulating a sequence of UTF-8 code units via const char* is... dubious. This is because they could be signed. However, the recent standard wording requires that a conversion between unsigned char and char work within the range of valid UTF-8 code units. That is, you can cast a const char* to a const unsigned char*, do your bit manipulation on that, and then cast it back, and you're guaranteed to work.
And what's the point of that super-complex "recent wording of the standard"?
The point of that is to allow UTF-8 strings to actually work. Because the standards committee, in their "infinite wisdom", decided not to include a special char8_t UTF-8 code unit type, they had to add wording to make char serve in that role. And that requires that the conversion to/from unsigned char and char to not be able to mangle a UTF-8 code unit.
There was even a discussion topic on the C++ standard discussion forums, where the wording was discussed (search for 1759). The C++14 wording says:
For each value i of type unsigned char in the range 0 to 255 inclusive, there exists a value j of type char such that the result of an integral conversion (4.7) from i to char is j, and the result of an integral conversion from j to unsigned char is i.
This means in particular that char could only be signed by default if the signed representation satisfies the above. A one's complement signed char would not be sufficient, since negative zero has a special representation (0x80), which when converted to unsigned becomes regular 0.
Should they have just defined a specific char8_t that is required to be unsigned and has at least 8 bits? Probably. But it's done and it ain't changing.

[lex.string]/8 Ordinary string literals and UTF-8 string literals are also referred to as narrow string literals. A narrow string literal has type “array of n const char”, where n is the size of the string as defined below, and has static storage duration (3.7).
So, whatever else is true, a UTF-8 string literal is a sequence of chars.
As to uint8_t:
7.20.1.1
2 The typedef name uintN_t designates an unsigned integer type with width N and no padding bits. Thus, uint24_t denotes such an unsigned integer type with a width of exactly 24 bits.
3 These types are optional. However, if an implementation provides integer types with widths of 8, 16, 32, or 64 bits, no padding bits, and (for the signed types) that have a two’s complement representation, it shall define the corresponding typedef names.
On a hypothetical system with a char larger than 8 bits, uint8_t would not be defined.

code unit in UTF-8 is always exactly eight bits. unsigned char is specified to have at least 8 bits, so all code units in UTF-8 fit in type unsigned char.
The rationale for the u8"This is a UTF-8 encoded string constant" is not the fact that it is stored in 8-bit bytes, but that it is encoded as UTF-8, whereas the source file might have a different encoding. The u8string typedef is consistent with that but a tad confusing if bytes have more than 8 bits.
Using unsigned char is a good way to remove the uncertainty regarding the signedness of type char.

char8_t was voted in for C++20 during a San Diego meeting so this code will not compile.
However, you will be able to use std::u8string but remember that it only works with code units and not code points or grapheme clusters so the safe way is to treat is as opaque blob and use 3rd party libraries to mutate it. At least for now.

Related

Confusing sizeof(char) by ISO/IEC in different character set encoding like UTF-16

Assuming that a program is running on a system with UTF-16 encoding character set. So according to The C++ Programming Language - 4th, page 150:
A char can hold a character of the machine’s character set.
→ I think that a char variable will have the size is 2-bytes.
But according to ISO/IEC 14882:2014:
sizeof(char), sizeof(signed char) and sizeof(unsigned char) are 1".
or The C++ Programming Language - 4th, page 149:
"[...], so by definition the size of a char is 1"
→ It is fixed with size is 1.
Question: Is there a conflict between these statements above or
is the sizeof(char) = 1 just a default (definition) value and will be implementation-defined depends on each system?
The C++ standard (and C, for that matter) effectively define byte as the size of a char type, not as an eight-bit quantity1. As per C++11 1.7/1 (my bold):
The fundamental storage unit in the C++ memory model is the byte. A byte is at least large enough to contain any member of the basic execution character set and the eight-bit code units of the Unicode UTF-8 encoding form and is composed of a contiguous sequence of bits, the number of which is implementation defined.
Hence the expression sizeof(char) is always 1, no matter what.
If you want to see whether you baseline char variable (probably the unsigned variant would be best) can actually hold a 16-bit value, the item you want to look at is CHAR_BIT from <climits>. This holds the number of bits in a char variable.
1 Many standards, especially ones related to communications protocols, use the more exact term octet for an eight-bit value.
Yes, there are a number of serious conflicts and problems with the C++ conflation of rôles for char, but also the question conflates a few things. So a simple direct answer would be like answering “yes”, “no” or “don’t know” to the question “have you stopped beating your wife?”. The only direct answer is the buddhist “mu”, unasking the question.
So let's therefore start with a look at the facts.
Facts about the char type.
The number of bits per char is given by the implementation defined CHAR_BIT from the <limits.h> header. This number is guaranteed to be 8 or larger. With C++03 and earlier that guarantee came from the specification of that symbol in the C89 standard, which the C++ standard noted (in a non-normative section, but still) as “incorporated”. With C++11 and later the C++ standard explicitly, on its own, gives the ≥8 guarantee. On most platforms CHAR_BIT is 8, but on some probably still extant Texas Instruments digital signal processors it’s 16, and other values have been used.
Regardless of the value of CHAR_BIT the sizeof(char) is by definition 1, i.e. it's not implementation defined:
C++11 §5.3.3/1 (in [expr.sizeof]):
” sizeof(char), sizeof(signed char) and
sizeof(unsigned char) are 1.
That is, char and its variants is the fundamental unit of addressing of memory, which is the primary meaning of byte, both in common speech and formally in C++:
C++11 §1.7/1 (in [intro.memory]):
” The fundamental storage unit in the C ++ memory model is the byte.
This means that on the aforementioned TI DSPs, there is no C++ way of obtaining pointers to individual octets (8-bit parts). And that in turn means that code that needs to deal with endianness, or in other ways needs to treat char values as sequences of octets, in particular for network communications, needs to do things with char values that is not meaningful on a system where CHAR_BIT is 8. It also means that ordinary C++ narrow string literals, if they adhere to the standard, and if the platform's standard software uses an 8-bit character encoding, will waste memory.
The waste aspect was (or is) directly addressed in the Pascal language, which differentiates between packed strings (multiple octets per byte) and unpacked strings (one octet per byte), where the former is used for passive text storage, and the latter is used for efficient processing.
This illustrates the basic conflation of three aspects in the single C++ type char:
unit of memory addressing, a.k.a. byte,
smallest basic type (it would be nice with an octet type), and
character encoding value unit.
And yes, this is a conflict.
Facts about UTF-16 encoding.
Unicode is a large set of 21-bit code points, most of which constitute characters on their own, but some of which are combined with others to form characters. E.g. a character with accent like “é” can be formed by combining code points for “e” and “´”-as-accent. And since that’s a general mechanism it means that a Unicode character can be an arbitrary number of code points, although it’s usually just 1.
UTF-16 encoding was originally a compatibility scheme for code based on original Unicode’s 16 bits per code point, when Unicode was extended to 21 bits per code point. The basic scheme is that code points in the defined ranges of original Unicode are represented as themselves, while each new Unicode code point is represented as a surrogate pair of 16-bit values. A small range of original Unicode is used for surrogate pair values.
At the time, examples of software based on 16 bits per code point included 32-bit Windows and the Java language.
On a system with 8-bit byte UTF-16 is an example of a wide text encoding, i.e. with an encoding unit wider than the basic addressable unit. Byte oriented text encodings are then known as narrow text. On such a system C++ char fits the latter, but not the former.
In C++03 the only built in type suitable for the wide text encoding unit was wchar_t.
However, the C++ standard effectively requires wchar_t to be suitable for a code-point, which for modern 21-bits-per-code-point Unicode means that it needs to be 32 bits. Thus there is no C++03 dedicated type that fits the requirements of UTF-16 encoding values, 16 bits per value. Due to historical reasons the most prevalent system based on UTF-16 as wide text encoding, namely Microsoft Windows, defines wchar_t as 16 bits, which after the extension of Unicode has been in flagrant contradiction with the standard, but then, the standard is impractical regarding this issue. Some platforms define wchar_t as 32 bits.
C++11 introduced new types char16_t and char32_t, where the former is (designed to be) suitable for UTF-16 encoding values.
About the question.
Regarding the question’s stated assumption of
” a system with UTF-16 encoding character set
this can mean one of two things:
a system with UTF-16 as the standard narrow encoding, or
a system with UTF-16 as the standard wide encoding.
With UTF-16 as the standard narrow encoding CHAR_BIT ≥ 16, and (by definition) sizeof(char) = 1. I do not know of any system, i.e. it appears to be hypothetical. Yet it appears to be the meaning tacitly assumed in current other answers.
With UTF-16 as the standard wide encoding, as in Windows, the situation is more complex, because the C++ standard is not up to the task. But, to use Windows as an example, one practical possibility is that sizeof(wchar_t)= 2. And one should just note that the standard is conflict with existing practice and practical considerations for this issue, when the ideal is that standards instead standardize the existing practice, where there is such.
Now finally we’re in a position to deal with the question,
” Is there a conflict between these statements above or is the sizeof(char) = 1 just a default (definition) value and will be implementation-defined depends on each system?
This is a false dichotomy. The two possibilities are not opposites. We have
There is indeed a conflict between char as character encoding unit and as a memory addressing unit (byte). As noted, the Pascal language has the keyword packed to deal with one aspect of that conflict, namely storage versus processing requirements. And there is a further conflict between the formal requirements on wchar_t, and its use for UTF-16 encoding in the most widely used system that employs UTF-16 encoding, namely Windows.
sizeof(char) = 1 by definition: it's not system-dependent.
CHAR_BIT is implementation defined, and is guaranteed ≥ 8.
No, there's no conflict. These two statements refer to different definitions of byte.
UTF-16 implies that byte is the same thing as octet - a group of 8 bits.
In C++ language byte is the same thing as char. There's no limitation on how many bits a C++-byte can contain. The number of bits in C++-byte is defined by CHAR_BIT macro constant.
If your C++ implementation decides to use 16 bits to represent each character, then CHAR_BIT will be 16 and each C++-byte will occupy two UTF-16-bytes. sizeof(char) will still be 1 and sizes of all objects will be measured in terms of 16-bit bytes.
A char is defined as being 1 byte. A byte is the smallest addressable unit. This is 8 bits on common systems, but on some systems it is 16 bits, or 32 bits, or anything else (but must be at least 8 for C++).
It is somewhat confusing because in popular jargon byte is used for what is technically known as an octet (8 bits).
So, your second and third quotes are correct. The first quote is, strictly speaking, not correct.
As defined by [intro.memory]/1 in the C++ Standard, char only needs to be able to hold the basic execution character set which is approximately 100 characters (all of which appear in the 0 - 127 range of ASCII), and the octets that make up UTF-8 encoding. Perhaps that is what the author meant by machine character set.
On a system where the hardware is octet addressable but the character set is Unicode, it is likely that char will remain 8-bit. However there are types char16_t and char32_t (added in C++11) which are designed to be used in your code instead of char for systems that have 16-bit or 32-bit character sets.
So, if the system goes with char16_t then you would use std::basic_string<char16_t> instead of std::string, and so on.
Exactly how UTF-16 should be handled will depend on the detail of the implementation chosen by the system. Unicode is a 21-bit character set and UTF-16 is a multibyte encoding of it; so the system could go the Windows-like route and use std::basic_string<char16_t> with UTF-16 encoding for strings; or it could go for std::basic_string<char32_t> with raw Unicode code points as the characters.
Alf's post goes into more detail on some of the issues that can arise.
Without quoting standard it is easily to give simply answer because:
Definition of byte is not 8 bits. Byte is any size but smallest addressable unit of memory. Most commonly it is 8 bits, but there is no reason to don't have 16 bits byte.
C++ standard gives more restrictions because it must be at least 8 bits.
So there is no problem to sizeof(char) be always 1, no matter what. Sometimes it will be stand as 8 bits, sometimes 16 bits and so on.

Under what circumstances would one use a signed char in C++?

In most situations, one would declare a char object to assign one of the character values on the ascii table ranging from 0 - 127. Even the extended character sets range from 128 - 255 (still positive). So i'm assuming that when dealing with the printing of characters, one only needs to use an unsigned char.
Now, based on some research on SO, people use a signed char when they need to use really small integers, but for that we can utilize the [u]int8 type. So i'm having trouble coming to terms with why one would need to use a signed char ? You can use it if you are dealing with the basic character ascii table (which unsigned char is already capable of doing) or you can use it to represent small integers (which [u]int8 already takes care of).
Can someone please provide a programming example in which a signed char is preferred over the other types ?
The reason is that you don't know, at least portably, if plain char variables are signed or unsigned. Different implementations have different approaches, a plain char may be signed in one platform and unsigned in another.
If you want to store negative values in a variable of type char, you absolutely must declare it as signed char, because only then you can be sure that every platform will be able to store negative values in there. Yes, you can use [u]int8 type, but this was not always the case (it was only introduced in C++11), and in fact, int8 is most likely an alias for signed char.
Moreover, uint8_t and int8_t are defined to be optional types, meaning you can't always rely on its existence (contrary to signed char). In particular, if a machine has a byte unit with more than 8 bits, it is not very likely that uint8_t and int8_t are defined (although they can; a compiler is always free to provide it and do the appropriate calculations). See this related question: What is int8_t if a machine has > 8 bits per byte?
Is char signed or unsigned?
Actually it is neither, it's implementation defined if a variable of type char can hold negative values. So if you are looking for a portable way to store negative values in a narrow character type explicitly declare it as signed char.
§ 3.9.1 - Fundamental Types - [basic.fundamental]
1 Objects declared as characters (char) shall be large enough to store any member of the implementation's basic character set. If a character from this set is stored in a character object, the integral value of that character object is equal to the value of the single character literal form of that character. It is implementation-defined whether a char object can hold negative values.
I'd like to use the smallest signed integer type available, which one is it?
c++11 introduced several fixed with integer types, but a common misunderstanding is that these types are guaranteed to be available, something which isn't true.
§ 18.4.1 - Header <cstdint> synopsis - [cstdint.syn]
typedefsigned integer typeint8_t; // optional
To preserve space in this post most of the section has been left out, but the optional rationale applies to all {,u}int{8,16,32,64}_t types. An implementation is not required to implement them.
The standard mandates that int_least8_t is available, but as the name implies this type is only guaranteed to have a width equal or larger than 8 bits.
However, the standard guarantees that even though signed char, char, and unsigned char are three distinct types[1] they must occupy the same amount of storage and have the same alignment requirements.
After inspecting the standard further we will also find that sizeof(char) is guaranteed to be 1[2] , which means that this type is guaranteed to occupy the smallest amount of space that a C++ variable can occupy under the given implementation.
Conclusion
Remember that unsigned char and signed char must occupy the same amount of storage as a char?
The smallest signed integer type that is guaranteed to be available is therefore signed char.
[note 1]
§ 3.9.1 - Fundamental Types - [basic.fundamental]
1 Plain char, signed char, and unsigned char are three distinct types, collectively called narrow character types.
A char, a signed char, and an unsigned char occupy the same amount of storage and have the same alignment requirements (3.11); that is, they have the same object representation. For narrow character types, all bits of the object representation participate in the value representation.
[note 2]
§ 5.3.3 - Sizeof - [expr.sizeof]
sizeof(char), sizeof(signed char), and sizeof(unsigned char) are 1.
The result of sizeof applied to any other fundamental type (3.9.1) is implementation-defined.
You can use char for arithmetic operations with small integers. unsigned char will give you greater range, while signed char will give you a smaller absolute range and the ability to work with negative numbers.
There are situations where char's small size is of importance and is preffered for these operations, see here, so when one has negative numbers to deal with, signed char is the way to go.

int8_t vs char ; Which is the best one?

I know both are different types (signed char and char), however my company coding guidelines specifies to use int8_t instead of char.
So, I want to know, why I have to use int8_t instead of char type. Is there any best practices to use int8_t?
The use of int8_t is perfectly good for some circumstances - specifically when the type is used for calculations where a signed 8-bit value is required. Calculations involving strictly sized data [e.g. defined by external requirements to be exactly 8 bit in the result] (I used pixel colour levels in a comment above, but that really would be uint8_t, as negative pixel colours usually don't exist - except perhaps in YUV type colourspace).
The type int8_t should NOT be used as a replacement of char in for strings. This can lead to compiler errors (or warnings, but we don't really want to have to deal with warnings from the compiler either). For example:
int8_t *x = "Hello, World!\n";
printf(x);
may well compile fine on compiler A, but give errors or warnings for mixing signed and unsigned char values on compiler B. Or if int8_t isn't even using a char type. That's just like expecting
int *ptr = "Foo";
to compile in a modern compiler...
In other words, int8_t SHOULD be used instead of char if you are using 8-bit data for caclulation. It is incorrect to wholesale replace all char with int8_t, as they are far from guaranteed to be the same.
If there is a need to use char for string/text/etc, and for some reason char is too vague (it can be signed or unsigned, etc), then usign typedef char mychar; or something like that should be used. (It's probably possible to find a better name than mychar!)
Edit: I should point out that whether you agree with this or not, I think it would be rather foolish to simply walk up to whoever is in charge of this "principle" at the company, point at a post on SO and say "I think you're wrong". Try to understand what the motivation is. There may be more to it than meets the eye.
They simply make different guarantees:
char is guaranteed to exist, to be at least 8 bits wide, and to be able to represent either all integers between -127 and 127 inclusive (if signed) or between 0 and 255 (if unsigned).
int8_t is not guaranteed to exist (and yes, there are platforms on which it doesn’t), but if it exists it is guaranteed to an 8-bit twos-complement signed integer type with no padding bits; thus it is capable of representing all integers between -128 and 127, and nothing else.
When should you use which? When the guarantees made by the type line up with your requirements. It is worth noting, however, that large portions of the standard library require char * arguments, so avoiding char entirely seems short-sighted unless there’s a deliberate decision being made to avoid usage of those library functions.
int8_t is only appropriate for code that requires a signed integer type that is exactly 8 bits wide and should not compile if there is no such type. Such requirements are far more rare than the number of questions about int8_t and it's brethren indicates. Most requirements for sizes are that the type have at least a particular number of bits. signed char works just fine if you need at least 8 bits; int_least8_t also works.
int8_t is specified by the C99 standard to be exactly eight bits wide, and fits in with the other C99 guaranteed-width types. You should use it in new code where you want an exactly 8-bit signed integer. (Take a look at int_least8_t and int_fast8_t too, though.)
char is still preferred as the element type for single-byte character strings, just as wchar_t should be preferred as the element type for wide character strings.

conflicts: definition of wchar_t string in C++ standard and Windows implementation?

From c++2003 2.13
A wide string literal has type “array of n const wchar_t” and has static storage duration, where n is the size of the string as defined below
The size of a wide string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for the terminating L’\0’.
From c++0x 2.14.5
A wide string literal has type “array of n const wchar_t”, where n is the size of the string as defined below
The size of a char32_t or wide string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for the terminating U’\0’ or L’\0’.
The size of a char16_t string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for each character requiring a surrogate pair, plus one for the terminating u’\0’.
The statement in C++2003 is quite vague. But in C++0x, when counting the length of the string, the wide string literal wchar_t shall be treated as same as char32_t, and different from char16_t.
There's a post that states clearly how windows implements wchar_t in https://stackoverflow.com/questions/402283?tab=votes%23tab-top
In short, wchar_t in windows is 16bits and encoded using UTF-16. The statement in standard apparently leaves something conflicting in Windows.
for example,
wchar_t kk[] = L"\U000E0005";
This exceeds 16bits and for UTF-16 it needs two 16 bits to encode it (a surrogate pair).
However, from standard, kk is an array of 2 wchar_t (1 for the universal-name \U000E005, 1 for \0).
But in the internal storage, Windows need 3 16-bit wchar_t objects to store it, 2 wchar_t for the surrogate pair, and 1 wchar_t for the \0. Therefore, from array's definition, kk is an array of 3 wchar_t.
It's apparently conflicting to each other.
I think one simplest solution for Windows is to "ban" anything that requires surrogate pair in wchar_t ("ban" any unicode outside BMP).
Is there anything wrong with my understanding?
Thanks.
The standard requires that wchar_t be large enough to hold any character in the supported character set. Based on this, I think your premise is correct -- it is wrong for VC++ to represent the single character \U000E0005 using two wchar_t units.
Characters outside the BMP are rarely used, and Windows itself internally uses UTF-16 encoding, so it is simply convenient (even if incorrect) for VC++ to behave this way. However, rather than "banning" such characters, it is likely that the size of wchar_t will increase in the future while char16_t takes its place in the Windows API.
The answer you linked to is somewhat misleading as well:
On Linux, a wchar_t is 4-bytes, while on Windows, it's 2-bytes
The size of wchar_t depends solely on the compiler and has nothing to do with the operating system. It just happens that VC++ uses 2 bytes for wchar_t, but once again, this could very well change in the future.
Windows knows nothing about wchar_t, because wchar_t is a programming concept. Conversely, wchar_t is just storage, and it knows nothing about the semantic value of the data you store in it (that is, it knows nothing about Unicode or ASCII or whatever.)
If a compiler or SDK that targets Windows defines wchar_t to be 16 bits, then that compiler may be in conflict with the C++0x standard. (I don't know whether there are some get-out clauses that allow wchar_t to be 16 bits.) But in any case the compiler could define wchar_t to be 32 bits (to comply with the standard) and provide runtime functions to convert to/from UTF-16 for when you need to pass your wchar_t* to Windows APIs.

Why do C++ streams use char instead of unsigned char?

I've always wondered why the C++ Standard library has instantiated basic_[io]stream and all its variants using the char type instead of the unsigned char type. char means (depending on whether it is signed or not) you can have overflow and underflow for operations like get(), which will lead to implementation-defined value of the variables involved. Another example is when you want to output a byte, unformatted, to an ostream using its put function.
Any ideas?
Note: I'm still not really convinced. So if you know the definitive answer, you can still post it indeed.
Possibly I've misunderstood the question, but conversion from unsigned char to char isn't unspecified, it's implementation-dependent (4.7-3 in the C++ standard).
The type of a 1-byte character in C++ is "char", not "unsigned char". This gives implementations a bit more freedom to do the best thing on the platform (for example, the standards body may have believed that there exist CPUs where signed byte arithmetic is faster than unsigned byte arithmetic, although that's speculation on my part). Also for compatibility with C. The result of removing this kind of existential uncertainty from C++ is C# ;-)
Given that the "char" type exists, I think it makes sense for the usual streams to use it even though its signedness isn't defined. So maybe your question is answered by the answer to, "why didn't C++ just define char to be unsigned?"
I have always understood it this way: the purpose of the iostream class is to read and/or write a stream of characters, which, if you think about it, are abstract entities that are only represented by the computer using a character encoding. The C++ standard makes great pains to avoid pinning down the character encoding, saying only that "Objects declared as characters (char) shall be large enough to store any member of the implementation's basic character set," because it doesn't need to force the "implementation basic character set" to define the C++ language; the standard can leave the decision of which character encoding is used to the implementation (compiler together with an STL implementation), and just note that char objects represent single characters in some encoding.
An implementation writer could choose a single-octet encoding such as ISO-8859-1 or even a double-octet encoding such as UCS-2. It doesn't matter. As long as a char object is "large enough to store any member of the implementation's basic character set" (note that this explicitly forbids variable-length encodings), then the implementation may even choose an encoding that represents basic Latin in a way that is incompatible with any common encoding!
It is confusing that the char, signed char, and unsigned char types share "char" in their names, but it is important to keep in mind that char does not belong to the same family of fundamental types as signed char and unsigned char. signed char is in the family of signed integer types:
There are four signed integer types: "signed char", "short int", "int", and "long int."
and unsigned char is in the family of unsigned integer types:
For each of the signed integer types, there exists a corresponding (but different) unsigned integer type: "unsigned char", "unsigned short int", "unsigned int", and "unsigned long int," ...
The one similarity between the char, signed char, and unsigned char types is that "[they] occupy the same amount of storage and have the same alignment requirements". Thus, you can reinterpret_cast from char * to unsigned char * in order to determine the numeric value of a character in the execution character set.
To answer your question, the reason why the STL uses char as the default type is because the standard streams are meant for reading and/or writing streams of characters, represented by char objects, not integers (signed char and unsigned char). The use of char versus the numeric value is a way of separating concerns.
char is for characters, unsigned char for raw bytes of data, and signed chars for, well, signed data.
Standard does not specify if signed or unsigned char will be used for the implementation of char - it is compiler-specific. It only specifies that the "char" will be "enough" to hold characters on you system - the way characters were in those days, which is, no UNICODE.
Using "char" for characters is the standard way to go. Using unsigned char is a hack, although it'll match compiler's implementation of char on most platforms.
I think this comment explains it well. To quote:
signed char and unsigned char are arithmetic, integral types just like int and unsigned int. On the other hand, char is expressly intended to be the "I/O" type that represents some opaque, system-specific fundamental unit of data on your platform. I would use them in this spirit.