What encoding does c32rtomb convert to? - c++

The functions c32rtomb and mbrtoc32 from <cuchar>/<uchar.h> are described in the C Unicode TR (draft) as performing conversions between UTF-321 and "multibyte characters".
(...) If s is not a null
pointer, the c32rtomb function determines the number of bytes needed to represent
the multibyte character that corresponds to the wide character given by c32
(including any shift sequences), and stores the multibyte character representation in
the array whose first element is pointed to by s. (...)
What is this "multibyte character representation"? I'm actually interested in the behaviour of the following program:
#include <cassert>
#include <cuchar>
#include <string>
int main() {
std::u32string u32 = U"this is a wide string";
std::string narrow = "this is a wide string";
std::string converted(1000, '\0');
char* ptr = &converted[0];
std::mbstate_t state {};
for(auto u : u32) {
ptr += std::c32rtomb(ptr, u, &state);
}
converted.resize(ptr - &converted[0]);
assert(converted == narrow);
}
Is the assertion in it guaranteed to hold1?
1 Working under the assumption that __STDC_UTF_32__ is defined.

For the assertion to be guaranteed to hold true it's necessary that the multibyte encoding used by c32rtomb() be the same as the encoding used for string literals, at least as far as the characters actually used in the string.
C99 7.11.1.1/2 specifies that setlocale() with the category LC_CTYPE affects the behavior of the character handling functions and the multibyte and wide character functions. I don't see any explicit acknowledgement that the effect is to set the multibyte and wide character encodings used, however that is the intent.
So the multibyte encoding used by c32rtomb() is the multibyte encoding from the default "C" locale.
C++11 2.14.3/2 specifies that the execution encoding, wide execution encoding, UTF-16, and UTF-32 are used for the corresponding character and string literals. Therefore std::string narrow uses the execution encoding to represent that string.
So is the "C" locale encoding of this string the same as the execution encoding of this string?
C99 7.11.1.1/3 specifies that the "C" locale provides "the minimal environment" for C translation. Such an environment would include not only character sets, but also the specific character codes used. So I believe this means not only that the "C" locale must support the characters required in translation (i.e., the basic character set), but additionally that those characters in the "C" locale must use the same character codes.
All of the characters in your string literals are members of the basic character set, and therefore converting the char32_t representation to the char "C" locale representation must produce the same sequence of values as the compiler produces for the char string literal; the assertion must hold true.
I don't see any suggestion that anything beyond the basic character set is supported in a compatible way between the execution encoding and the "C" locale, so if your string literal used any characters outside the basic character set then there would not be any guarantee that the assertion would hold. Even stipulating extended characters that exist in both the execution character set and the "C" locale, I don't see any requirement that the representations match each other.

The TR linked in the question says
At most MB_CUR_MAX bytes are stored.
which is defined (in C99) as
a positive integer expression with type size_t that is the maximum number of bytes in a multibyte character for the extended character set specified by the current locale
I believe this is sufficient evidence that the intent of the TR was to produce the multibyte characters as defined by the currently installed C locale: UTF-8 for en_US.utf8, GB18030 for zh_CN.gb18030, etc.

As I tested, in Linux/MacOSX, c32rtomb converts strings from UTF-32 to the locale specific encodings. You can use nl_langinfo(CODESET) to get the encoding currently used.
However, libc uses "C" locale by default, which uses ISO-8859-1 as the encoding. To change the encoding to which system environment specifies, usually UTF-8 but might be others, use setlocale(LC_CTYPE, "").
In Windows, VS2015+, however, c32rtomb always converts to UTF-8. Since vcruntime does not support UTF-8 locales (only legacy ANSI/OEM locales are supported), if it followed the standard, c32rtomb/c16rtomb would be totally identical to wcrtomb, and no use at all.

Related

Is 16-bit wchar_t formally valid for representing full Unicode?

In the ¹comp.lang.c++ Usenet group I recently asserted, based on what I thought I knew, that Windows' 16-bit wchar_t, with UTF-16 encoding where sometimes two such values (called a “surrogate pair”) is needed for a single Unicode code point, is invalid for representing Unicode.
It's certainly inconvenient and in conflict with the assumption of the C and C++ standard libraries (e.g. character classification) that each code point is represented as a single value, although the Unicode consortium's ²Technical Note 12 from 2004 makes a good case for using UTF-16 for internal processing, with an impressive list of software that does.
And certainly it seems as if the original intent was to have one wchar_t value per code point, consistent with the assumptions of the C and C++ standard libraries. E.g. in the web page “ISO C Amendment 1 (MSE)” over at ³unix.org, about the amendment that brought wchar_t into the C standard in 1995, the authors maintain that
” The primary advantage to the one byte/one character model is that it is very easy to process data in fixed-width chunks. For this reason, the concept of the wide character was invented. A wide character is an abstract data type large enough to contain the largest character that is supported on a particular platform.
But as it turned out, the C and C++ standards seem to not talk about the largest supported character, but only about the largest extended character sets in the supported locales: that wchar_t must be large enough to represent every code point in the largest such extended character set – but not Unicode, when there is no Unicode locale.
C99 §7.17/2 (from the N869 draft):
” [the wchar_t type] is an integer type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales.
This is almost identically the same wording as in the C++ standard. And it seems to mean that with a restricted set of supported locales, wchar_t can be smallish indeed, down to a single byte with UTF-8 encoding (a nightmare possibility where e.g. no standard library character classification function would work outside of ASCII's A through Z, but hey). Possibly the following is a requirement to be wider than that:
C99 §7.1.1/4:
” A wide character is a code value (a binary encoded integer) of an object of type wchar_t that corresponds to a member of the extended character set.
… since it refers to the extended character set, but that term seems to not be further defined anywhere.
And at least with Microsoft's C and C++ runtime there is no Unicode locale: with that implementation setlocale is restricted to character encodings that have at most 2 bytes per character:
MSDN ⁴documentation of setlocale:
” The set of available locale names, languages, country/region codes, and code pages includes all those supported by the Windows NLS API except code pages that require more than two bytes per character, such as UTF-7 and UTF-8. If you provide a code page value of UTF-7 or UTF-8, setlocale will fail, returning NULL.
So it seems that contrary to what I thought I knew, and contrary to my assertion, Windows' 16-bit wchar_t is formally OK. And mainly due to Microsoft's ingenious lack of support for UTF-8 locales, or any locale with more than 2 bytes per character. But is it really so, is 16-bit wchar_t OK?
Links:
¹ news:comp.lang.c++
² http://unicode.org/notes/tn12/#Software_16
³ http://www.unix.org/version2/whatsnew/login_mse.html
⁴ https://msdn.microsoft.com/en-us/library/x99tb11d.aspx
wchar_t is not now and never was a Unicode character/code point. The C++ standard does not declare that a wide-string literal will contain Unicode characters. The C++ standard does not declare that a wide-character literal will contain a Unicode character. Indeed, the standard doesn't say anything about what wchar_t will contain.
wchar_t can be used with locale-aware APIs, but those are only relative to the implementation-defined encoding, not any particular Unicode encoding. The standard library functions that take these use their knowledge of the implementation's encoding to do their jobs.
So, is a 16-bit wchar_t legal? Yes; the standard does not require that wchar_t be sufficiently large to hold a Unicode codepoint.
Is a string of wchar_t permitted to hold UTF-16 values (or variable width in general)? Well, you are permitted to make strings of wchar_t that store whatever you want (so long as it fits). So for the purposes of the standard, the question is whether standard-provided means for generating wchar_t characters and strings are permitted to use UTF-16.
Well, the standard library can do whatever it wants to; the standard offers no guarantee that a conversion from any particular character encoding to wchar_t will be a 1:1 mapping. Even char->wchar_t conversion via wstring_convert is not required anywhere in the standard to produce a 1:1 character mapping.
If a compiler wishes to declare that the wide character set consists of the Base Multilingual Plane of Unicode, then a declaration like this L'\U0001F000' will produce a single wchar_t. But the value is implementation-defined, per [lex.ccon]/2:
The value of a wide-character literal containing a single c-char has value equal to the numerical value of the encoding of the c-char in the execution wide-character set, unless the c-char has no representation in the execution wide-character set, in which case the value is implementation-defined.
And of course, C++ doesn't allow to use surrogate pairs as a c-char; \uD800 is a compile error.
Where things get murky in the standard is the treatment of strings that contain characters outside of the character set. The above text would suggest that implementations can do what they want. And yet, [lex.string]\16 says this:
The size of a char32_t or wide string literal is the total number of escape sequences, universal-character-names, and other characters, plus
one for the terminating U’\0’ or L’\0’.
I say this is murky because nothing says what the behavior should be if a c-char in a string literal is outside the range of the destination character set.
Windows compilers (both VS and GCC-on-Windows) do indeed cause L"\U0001F000" to have an array size of 3 (two surrogate pairs and a single NUL terminator). Is that legal C++ standard behavior? What does it mean to provide a c-char to a string literal that is outside of the valid range for a character set?
I would say that this is a hole in the standard, rather than a deficiency in those compilers. It should make it more clear what the conversion behavior in this case ought to be.
In any case, wchar_t is not an appropriate tool for processing Unicode-encoded text. It is not "formally valid" for representing any form of Unicode. Yes, many compilers implement wide-string literals as a Unicode encoding. But since the standard doesn't require this, you cannot rely on it.
Now obviously, you can stick whatever will fit inside of a wchar_t. So even on platforms where wchar_t is 32-bits, you could shove UTF-16 data into them, with each 16-bit word taking up 32-bits. But you couldn't pass such text to any API function that expects the wide character encoding unless you knew that this was the expected encoding for that platform.
Basically, never use wchar_t if you want to work with a Unicode encoding.
Let's start from first principles:
(§3.7.3) wide character: bit representation that fits in an object of type
wchar_t, capable of representing any character in the current locale
(§3.7) character: 〈abstract〉 member of a set of elements used for the
organization, control, or representation of data
That, right away, discards full Unicode as a character set (a set of elements/characters) representable on 16-bit wchar_t.
But wait, Nicol Bolas quoted the following:
The size of a char32_t or wide string literal is the total number of
escape sequences, universal-character-names, and other characters,
plus one for the terminating U’\0’ or L’\0’.
and then wondered about the behavior for characters outside the execution character set. Well, C99 has the following to say about this issue:
(§5.1.1.2) Each source character set member and escape sequence in character
constants and string literals is converted to the corresponding member
of the execution character set; if there is no corresponding member,
it is converted to an implementation- defined member other than the
null (wide) character.8)
and further clarifies in a footnote that not all source characters need to map to the same execution character.
Armed with this knowledge, you can declare that your wide execution character set is the Basic Multilingual Plane, and that you consider surrogates as proper characters themselves, not as mere surrogates for other characters. AFAICT, this means you are in the clear as far as Clause 6 (Language) of ISO C99 cares.
Of course, don't expect Clause 7 (Library) to play along nicely with you. As an example, consider iswalpha(wint_t). You cannot pass astral characters (characters outside the BMP) to that function, you can only pass it the two surrogates. And you'd get some nonsensical result, but that's fine because you declared the surrogate themselves to be proper members of the execution character set.
After clarifying what the question is I will do an edit.
Q: Is the width of 16 bits for wchar_t in Windows conformant to the standard?
A: Well, lets see. We will start with the definition of wchar_t from c99 draft.
... largest extended character set specified among the supported locales.
So, we should look what are the supported locales. For that there are Three steps:
We check the documentation for setlocale
We quickly open the documentation for the locale string. We see the format of the string
locale :: "locale_name"
| "language[_country_region[.code_page]]"
| ".code_page"
| "C"
| ""
| NULL
We see the list of supported Code pages and we see UTF-8, UTF-16, UTF-32 and what not. We're in a dead end.
If we start with the C99 definition, it ends with
... corresponds to a member of the extended character set.
The word "character set" is used. But if we say UTF-16 code units are our character set, then all is OK. Otherwise, it's not. It's kinda vague, and one should not care much. The standards were defined many years ago, when Unicode was not popular.
At the end of the day, we now have C++11 and C11 that define use cases for UTF-8, 16 and 32 with the additional types char16_t and char32_t.
You need to read about Unicode and you will answer the question yourself.
Unicode is a character set. Set of characters, it's about 200000 characters. Or more precisely it is a mapping, mapping between numbers and characters. Unicode by itself does not mean this or that bit width.
Then there are 4 encodings, UTF-7, UTF-8, UTF-16 and UTF-32. UTF stands for Unicode transformation format.
Each format defines a code point and a code unit. Code point is an actual charter from Unicode and can consists of one or more units. Only UTF-32 has one unit per point.
On the other hand, each unit fits into a fixed size integer. So UTF-7 units are at most 7 bits, UTF-16 units are at most 16 bits etc.
Therefore, in a 16 bit wchar_t string we can hold Unicode text encoded in UTF-16. Particularly in UTF-16 each point takes one or two units.
So the final answer, in a single wchar_t you can not store all Unicode char, only the single unit ones, but in a string of wchar_t you can store any Unicode text.

How to use Loadstring to load Chinese Characters

I have a string table which defines a string in Chinese like this:
STRINGTABLE
LANGAUGE 0x0C04, 0x03
BEGIN
1000 "检查环境..."
...
END
I am trying to load that string into a wchar_t buffer as follows:
#define UNICODE
#define _UNICODE
wchar_t buffer[512];
LoadString(DLL_HANDLE, (UINT) msg_num, buffer, 512);
MessageBox(NULL, buffer, NULL, NULL);
However, the string that is loaded into the buffer is different than the one that is in my string table.
It looks like this in my string table:
检查环境...
But this is how it turns out on screen:
環境をãƒã‚§ãƒƒã‚¯ä¸­...
Doesnt the 'MessageBox' function work on narrow strings by deault? Wouldn't you need to use 'MessageBoxW'?
Edit:
A couple of things to check. The encoding of L"..." strings is implementation defined. The standard makes no mention of encoding of characters of wchar_t; make sure you're using the same encoding as windows expects. (If I recall correctly, windows expects UTF-16 - but I very well may be wrong on this).
In C++11, 3 new literal string types are introduced, and their prefixes are "u8", "u" and "U", which specify UTF-8, UTF-16 & UTF-32, respectively. C++11 still makes no guarantees on the encoding the "L" prefixes from what I can tell, other than what is mentioned in §2.14.3:
A character literal that begins with the letter L, such as L’x’, is a wide-character literal. A wide-character
literal has type wchar_t.23 The value of a wide-character literal containing a single c-char has value equal
to the numerical value of the encoding of the c-char in the execution wide-character set, unless the c-char
has no representation in the execution wide-character set, in which case the value is implementation-defined.
[ Note: The type wchar_t is able to represent all members of the execution wide-character set (see 3.9.1).
—end note ]. The value of a wide-character literal containing multiple c-chars is implementation-defined.
Reference §3.9.1 P5 states:
Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest
extended character set specified among the supported locales (22.3.1). Type wchar_t shall have the same
size, signedness, and alignment requirements (3.11) as one of the other integral types, called its underlying
type. Types char16_t and char32_t denote distinct types with the same size, signedness, and alignment as
uint_least16_t and uint_least32_t, respectively, in <stdint.h>, called the underlying types.
Again, no mention of encoding. It is possible that windows is expecting a different encoding that what your resource string is using, and thus the discrepancy.
You might verify by calling MessageBox using an L"" string literal with "\Uxxxxxxx" encoding escapes for your characters to verify.
The MSDN documentation states that the format should be similar to
IDS_CHINESESTRING L"\x5e2e\x52a9". That's not the most formal of descriptions. I interpret it as stating that unicode strings must be prefixed with L and encoded using \uxxxx escape codes

C++11: Example of difference between ordinary string literal and UTF-8 string literal?

A string literal that does not begin with an encoding-prefix is an ordinary string
literal, and is initialized with the given characters.
A string literal that begins with u8, such as u8"asdf", is a UTF-8 string literal and is initialized with the given characters as encoded in UTF-8.
I don't understand the difference between an ordinary string literal and a UTF-8 string literal.
Can someone provide an example of a situation where they are different? (Cause different compiler output)
(I mean from the POV of the standard, not any particular implementation)
Each source character set member in a character literal or a string literal, as well as each escape
sequence and universal-character-name in a character literal or a non-raw string literal, is converted to
the corresponding member of the execution character set.
The C and C++ languages allow a huge amount of latitude in their implementations. C was written long before UTF-8 was "the way to encode text in single bytes": different systems had different text encodings.
So what the byte values are for a string in C and C++ are really up to the compiler. 'A' is whatever the compiler's chosen encoding is for the character A, which may not agree with UTF-8.
C++ has added the requirement that real UTF-8 string literals must be supported by compilers. The bit value of u8"A"[0] is fixed by the C++ standard through the UTF-8 standard, regardless of the preferred encoding of the platform the compiler is targeting.
Now, much as most platforms C++ targets use 2's complement integers, most compilers have character encodings that are mostly compatible with UTF-8. So for strings like "hello world", u8"hello world" will almost certainly be identical.
For a concrete example, from man gcc
-fexec-charset=charset
Set the execution character set, used for string and character constants. The default is UTF-8. charset can be any encoding supported by the system's iconv library routine.
-finput-charset=charset
Set the input character set, used for translation from the character set of the input file to the source character set used by GCC. If the locale does not specify, or GCC cannot get this information from the locale, the default is UTF-8. This can be overridden by either the locale or this command line option. Currently the command line option takes precedence if there's a conflict. charset can be any encoding supported by the system's iconv library routine.
is an example of being able to change the execution and input character sets of C/C++.

Is the u8 string literal necessary in C++11

From Wikipedia:
For the purpose of enhancing support for Unicode in C++ compilers, the definition of the type char has been modified to be at least the size necessary to store an eight-bit coding of UTF-8.
I'm wondering what exactly this means for writing portable applications. Is there any difference between writing this
const char[] str = "Test String";
or this?
const char[] str = u8"Test String";
Is there be any reason not to use the latter for every string literal in your code?
What happens when there are non-ASCII-Characters inside the TestString?
The encoding of "Test String" is the implementation-defined system encoding (the narrow, possibly multibyte one).
The encoding of u8"Test String" is always UTF-8.
The examples aren't terribly telling. If you included some Unicode literals (such as \U0010FFFF) into the string, then you would always get those (encoded as UTF-8), but whether they could be expressed in the system-encoded string, and if yes what their value would be, is implementation-defined.
If it helps, imagine you're authoring the source code on an EBCDIC machine. Then the literal "Test String" is always EBCDIC-encoded in the source file itself, but the u8-initialized array contains UTF-8 encoded values, whereas the first array contains EBCDIC-encoded values.
You quote Wikipedia:
For the purpose of enhancing support for Unicode in C++ compilers, the definition of the type char has been modified to be at least the size necessary to store an eight-bit coding of UTF-8.
Well, the “For the purpose of” is not true. char has always been guaranteed to be at least 8 bits, that is, CHAR_BIT has always been required to be ≥8, due to the range required for char in the C standard. Which is (quote C++11 §17.5.1.5/1) “incorporated” into the C++ standard.
If I should guess about the purpose of that change of wording, it would be to just clarify things for those readers unaware of the dependency on the C standard.
Regarding the effect of the u8 literal prefix, it
affects the encoding of the string in the executable, but
unfortunately it does not affect the type.
Thus, in both cases "tørrfisk" and u8"tørrfisk" you get a char const[n]. But in the former literal the encoding is whatever is selected for the compiler, e.g. with Latin 1 (or Windows ANSI Western) that would be 8 bytes for the characters plus a nullbyte, for array size 9. While in the latter literal the encoding is guaranteed to be UTF-8, where the “ø” will be encoded with 2 or 3 bytes (I don’t recall exactly), for a slightly larger array size.
If the execution character set of the compiler is set to UTF-8, it makes no difference if u8 is used or not, since the compiler converts the characters to UTF-8 in both cases.
However if the compilers execution character set is the system's non UTF8 codepage (default for e.g. Visual C++), then non ASCII characters might not properly handled when u8 is omitted. For example, the conversion to wide strings will crash e.g. in VS15:
std::string narrowJapanese("スタークラフト");
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> convertWindows;
std::wstring wide = convertWindows.from_bytes(narrowJapanese); // Unhandled C++ exception in xlocbuf.
The compiler chooses a native encoding natural to the platform. On typical POSIX systems it will probably choose ASCII and something possibly depending on environment's setting for character values outside the ASCII range. On mainframes it will probably choose EBCDIC. Comparing strings received, e.g., from files or the command line will probably work best with the native character set. When processing files explicitly encoded using UTF-8 you are, however, probably best off using u8"..." strings.
That said, with the recent changes relating to character encodings a fundamental assumption of string processing in C and C++ got broken: each internal character object (char, wchar_t, etc.) used to represent one character. This is clearly not true anymore for a UTF-8 string where each character object just represents a byte of some character. As a result all the string manipulation, character classification, etc. functions won't necessarily work on these strings. We don't have any good library lined up to deal with such strings for inclusion into the standard.

Why does wide file-stream in C++ narrow written data by default?

Honestly, I just don't get the following design decision in C++ Standard library. When writing wide characters to a file, the wofstream converts wchar_t into char characters:
#include <fstream>
#include <string>
int main()
{
using namespace std;
wstring someString = L"Hello StackOverflow!";
wofstream file(L"Test.txt");
file << someString; // the output file will consist of ASCII characters!
}
I am aware that this has to do with the standard codecvt. There is codecvt for utf8 in Boost. Also, there is a codecvt for utf16 by Martin York here on SO. The question is why the standard codecvt converts wide-characters? why not write the characters as they are!
Also, are we gonna get real unicode streams with C++0x or am I missing something here?
A very partial answer for the first question: A file is a sequence of bytes so, when dealing with wchar_t's, at least some conversion between wchar_t and char must occur. Making this conversion "intelligently" requires knowledge of the character encodings, so this is why this conversion is allowed to be locale-dependent, by virtue of using a facet in the stream's locale.
Then, the question is how that conversion should be made in the only locale required by the standard: the "classic" one. There is no "right" answer for that, and the standard is thus very vague about it. I understand from your question that you assume that blindly casting (or memcpy()-ing) between wchar_t[] and char[] would have been a good way. This is not unreasonable, and is in fact what is (or at least was) done in some implementations.
Another POV would be that, since a codecvt is a locale facet, it is reasonable to expect that the conversion is made using the "locale's encoding" (I'm handwavy here, as the concept is pretty fuzzy). For example, one would expect a Turkish locale to use ISO-8859-9, or a Japanese on to use Shift JIS. By similarity, the "classic" locale would convert to this "locale's encoding". Apparently, Microsoft chose to simply trim (which leads to IS-8859-1 if we assuming that wchar_t represents UTF-16 and that we stay in the basic multilingual plane), while the Linux implementation I know about decided stick to ASCII.
For your second question:
Also, are we gonna get real unicode streams with C++0x or am I missing something here?
In the [locale.codecvt] section of n2857 (the latest C++0x draft I have at hand), one can read:
The specialization codecvt<char16_t, char, mbstate_t> converts between the UTF-16 and UTF-8 encodings schemes, and the specialization codecvt <char32_t, char, mbstate_t> converts between the UTF-32 and UTF-8 encodings schemes. codecvt<wchar_t,char,mbstate_t> converts between the native character sets for narrow and wide characters.
In the [locale.stdcvt] section, we find:
For the facet codecvt_utf8:
— The facet shall convert between UTF-8 multibyte sequences and UCS2 or UCS4 (depending on the size of Elem) within the program.
[...]
For the facet codecvt_utf16:
— The facet shall convert between UTF-16 multibyte sequences and UCS2 or UCS4 (depending on the size of Elem) within the program.
[...]
For the facet codecvt_utf8_utf16:
— The facet shall convert between UTF-8 multibyte sequences and UTF-16 (one or two 16-bit codes) within the program.
So I guess that this means "yes", but you'd have to be more precise about what you mean by "real unicode streams" to be sure.
The model used by C++ for charsets is inherited from C, and so dates back to at least 1989.
Two main points:
IO is done in term of char.
it is the job of the locale to determine how wide chars are serialized
the default locale (named "C") is very minimal (I don't remember the constraints from the standard, here it is able to handle only 7-bit ASCII as narrow and wide character set).
there is an environment determined locale named ""
So to get anything, you have to set the locale.
If I use the simple program
#include <locale>
#include <fstream>
#include <ostream>
#include <iostream>
int main()
{
wchar_t c = 0x00FF;
std::locale::global(std::locale(""));
std::wofstream os("test.dat");
os << c << std::endl;
if (!os) {
std::cout << "Output failed\n";
}
}
which use the environment locale and output the wide character of code 0x00FF to a file. If I ask to use the "C" locale, I get
$ env LC_ALL=C ./a.out
Output failed
the locale has been unable to handle the wide character and we get notified of the problem as the IO failed. If I run ask an UTF-8 locale, I get
$ env LC_ALL=en_US.utf8 ./a.out
$ od -t x1 test.dat
0000000 c3 bf 0a
0000003
(od -t x1 just dump the file represented in hex), exactly what I expect for an UTF-8 encoded file.
I don't know about wofstream. But C++0x will include new distict character types (char16_t, char32_t) of guaranteed width and signedness (unsigned) which can be portably used for UTF-8, UTF-16 and UTF-32. In addition, there will be new string literals (u"Hello!" for an UTF-16 coded string literal, for example)
Check out the most recent C++0x draft (N2960).
For your first question, this is my guess.
The IOStreams library was constructed under a couple of premises regarding encodings. For converting between Unicode and other not so usual encodings, for example, it's assumed that.
Inside your program, you should use a (fixed-width) wide-character encoding.
Only external storage should use (variable-width) multibyte encodings.
I believe that is the reason for the existence of the two template specializations of std::codecvt. One that maps between char types (maybe you're simply working with ASCII) and another that maps between wchar_t (internal to your program) and char (external devices). So whenever you need to perform a conversion to a multibyte encoding you should do it byte-by-byte. Notice that you can write a facet that handles encoding state when you read/write each byte from/to the multibyte encoding.
Thinking this way the behavior of the C++ standard is understandable. After all, you're using wide-character ASCII encoded (assuming this is the default on your platform and you did not switch locales) strings. The "natural" conversion would be to convert each wide-character ASCII character to a ordinary (in this case, one char) ASCII character. (The conversion exists and is straightforward.)
By the way, I'm not sure if you know, but you can avoid this by creating a facet that returns noconv for the conversions. Then, you would have your file with wide-characters.
Check this out:
Class basic_filebuf
You can alter the default behavior by setting a wide char buffer, using pubsetbuf.
Once you did that, the output will be wchar_t and not char.
In other words for your example you will have:
wofstream file(L"Test.txt", ios_base::binary); //binary is important to set!
wchar_t buffer[128];
file.rdbuf()->pubsetbuf(buffer, 128);
file.put(0xFEFF); //this is the BOM flag, UTF16 needs this, but mirosoft's UNICODE doesn't, so you can skip this line, if any.
file << someString; // the output file will consist of unicode characters! without the call to pubsetbuf, the out file will be ANSI (current regional settings)