streams with default utf8 handling - c++

I have read that in some environments std::string internally uses UTF-8. Whereas, on my platform, Windows, std::string is ASCII only. This behavior can be changed by using std::locale. My version of STL doesn't have, or at least I can't find, a UTF-8 facet for use with strings. I do however have a facet for use with the fstream set of classes.
Edit:
When I say "use UTF-8 internally", I'm referring to methods like std::basic_filebuf::open(), which in some environments accept UTF-8 encoded strings. I know this isn't really an std::string issue but rather some OS's use UTF-8 natively. My question should be read as "how does your implementation handle code conversion of invalid sequences?".
How do these streams handle invalid code sequences on other platforms/implementations?
In my UTF8 facet for files, it simply returns an error, which in turn prevents any more of the stream from being read. I would have thought changing the error to the Unicode "Invalid char" 0xfffd value to be a better option.
My question isn't limited to UTF-8, how about invalid UTF-16 surrogate pairs?
Let's have an example. Say you open a UTF-8 encoded file with a UTF-8 to wchar_t locale. How are invalid UTF-8 sequences handled by your implementation?
Or, a std::wstring and print it to std::cout, this time with a lone surrogate.

I have read that in some environments std::string internally uses uses UTF-8.
A C++ program can chose to use std::string to hold a UTF-8 string on any standard-compliant platform.
Whereas, on my platform, Windows, std::string is ASCII only.
That is not correct. On Windows you can use a std::string to hold a UTF-8 string if you want, std::string is not limited to hold ASCII on any standard-compliant platform.
This behavior can be changed by using std::locale.
No, the behaviour of std::string is not affected by the locale library.
A std::string is a sequence of chars. On most platforms, including Windows, a char is 8-bits. So you can use std::string to hold ASCII, Latin1, UTF-8 or any character encoding that uses an 8-bit or less code unit. std::string::length returns the number of code units so held, and the std::string::operator[] will return the ith code unit.
For holding UTF-16 you can use char16_t and std::u16string.
For holding UTF-32 you can use char32_t and std::u32string.

Say you open a UTF-8 encoded file with a UTF-8 to wchar_t locale. How are invalid UTF-8 sequences handled by your implementation?
Typically no one bothers with converting to wchar_t or other wide char types on other platforms, but the standard facets that can be used for this all signal a read error that causes the stream to stop working until the error is cleared.

std::string should be encoding agnostic: http://en.cppreference.com/w/cpp/string/basic_string - so it should not validate codepoints/data - you should be able to store any binary data in it.
The only places where encoding really makes a difference is in calculating string length and iterating over string character by character - and locale should have no effect in either of these cases.
And also - use of std::locale is probably not a good idea if it can be avoided at all - its not thread safe on all platforms or all implementations of standard library so care must be taken when using it. The effect of this is also very limited, and probably not at all what you expect it to be.

Related

how character sets are stored in strings and wstrings?

So, i've been trying to do a bit of research of strings and wstrings as i need to understand how they work for a program i'm creating so I also looked into ASCII and unicode, and UTF-8 and UTF-16.
I believe i have an okay understanding of the concept of how these work, but what i'm still having trouble with is how they are actually stored in 'char's, 'string's, 'wchar_t's and 'wstring's.
So my questions are as follows:
Which character set and encoding is used for char and wchar_t? and are these types limited to using only these character sets / encoding?
If they are not limited to these character sets / encoding, how is it decided what character set / encoding is used for a particular char or wchar_t? is it automatically decided at compile for example or do we have to explicitly tell it what to use?
From my understanding UTF-8 uses 1 byte when using the first 128 code points in the set but can use more than 1 byte when using code point 128 and above. If so how is this stored? for example is it simply stored identically to ASCII if it only uses 1 byte? and how does the type (char or wchar_t or whatever) know how many bytes it is using?
Finally, if my understanding is correct I get why UTF-8 and UTF-16 are not compatible, eg. a string can't be used where a wstring is needed. But in a program that requires a wstring would it be better practice to write a conversion function from a string to a wstring and the use this when a wstring is required to make my code exclusively string-based or just use wstring where needed instead?
Thanks, and let me know if any of my questions are incorrectly worded or use the wrong terminology as i'm trying to get to grips with this as best as I can.
i'm working in C++ btw
They use whatever characterset and encoding you want. The types do not imply a specific characterset or encoding. They do not even imply characters - you could happily do math problems with them. Don't do that though, it's weird.
How do you output text? If it is to a console, the console decides which character is associated with each value. If it is some graphical toolkit, the toolkit decides. Consoles and toolkits tend to conform to standards, so there is a good chance they will be using unicode, nowadays. On older systems anything might happen.
UTF8 has the same values as ASCII for the range 0-127. Above that it gets a bit more complicated; this is explained here quite well: https://en.wikipedia.org/wiki/UTF-8#Description
wstring is a string made up of wchar_t, but sadly wchar_t is implemented differently on different platforms. For example, on Visual Studio it is 16 bits (and could be used to store UTF16), but on GCC it is 32 bits (and could thus be used to store unicode codepoints directly). You need to be aware of this if you want your code to be portable. Personally I chose to only store strings in UTF8, and convert only when needed.
Which character set and encoding is used for char and wchar_t? and are these types limited to using only these character sets / encoding?
This is not defined by the language standard. Each compiler will have to agree with the operating system on what character codes to use. We don't even know how many bits are used for char and wchar_t.
On some systems char is UTF-8, on others it is ASCII, or something else. On IBM mainframes it can be EBCDIC, a character encoding already in use before ASCII was defined.
If they are not limited to these character sets / encoding, how is it decided what character set / encoding is used for a particular char or wchar_t? is it automatically decided at compile for example or do we have to explicitly tell it what to use?
The compiler knows what is appropriate for each system.
From my understanding UTF-8 uses 1 byte when using the first 128 code points in the set but can use more than 1 byte when using code point 128 and above. If so how is this stored? for example is it simply stored identically to ASCII if it only uses 1 byte? and how does the type (char or wchar_t or whatever) know how many bytes it is using?
The first part of UTF-8 is identical to the corresponding ASCII codes, and stored as a single byte. Higher codes will use two or more bytes.
The char type itself just store bytes and doesn't know how many bytes we need to form a character. That's for someone else to decide.
The same thing for wchar_t, which is 16 bits on Windows but 32 bits on other systems, like Linux.
Finally, if my understanding is correct I get why UTF-8 and UTF-16 are not compatible, eg. a string can't be used where a wstring is needed. But in a program that requires a wstring would it be better practice to write a conversion function from a string to a wstring and the use this when a wstring is required to make my code exclusively string-based or just use wstring where needed instead?
You will likely have to convert. Unfortunately the conversion needed will be different for different systems, as character sizes and encodings vary.
In later C++ standards you have new types char16_t and char32_t, with the string types u16string and u32string. Those have known sizes and encodings.
Everything about used encoding is implementation defined. Check your compiler documentation. It depends on default locale, encoding of source file and OS console settings.
Types like string, wstring, operations on them and C facilities, like strcmp/wstrcmp expect fixed-width encodings. So the would not work properly with variable width ones like UTF8 or UTF16 (but will work with, e.g., UCS-2). If you want to store variable-width encoded strings, you need to be careful and not use fixed-width operations on it. C-string do have some functions for manipulation of such strings in standard library .You can use classes from codecvt header to convert between different encodings for C++ strings.
I would avoid wstring and use C++11 exact width character string: std::u16string or std::u32string
As an example here is some info on how windows uses these types/encodings.
char stores ASCII values (with code pages for non-ASCII values)
wchar_t stores UTF-16, note this means that some unicode characters will use 2 wchar_t's
If you call a system function, e.g. puts then the header file will actually pick either puts or _putws depending on how you've set things up (i.e. if you are using unicode).
So on windows there is no direct support for UTF-8, which means that if you use char to store UTF-8 encoded strings you have to covert them to UTF-16 and call the corresponding UTF-16 system functions.

How well is Unicode supported in C++11?

I've read and heard that C++11 supports Unicode. A few questions on that:
How well does the C++ standard library support Unicode?
Does std::string do what it should?
How do I use it?
Where are potential problems?
How well does the C++ standard library support unicode?
Terribly.
A quick scan through the library facilities that might provide Unicode support gives me this list:
Strings library
Localization library
Input/output library
Regular expressions library
I think all but the first one provide terrible support. I'll get back to it in more detail after a quick detour through your other questions.
Does std::string do what it should?
Yes. According to the C++ standard, this is what std::string and its siblings should do:
The class template basic_string describes objects that can store a sequence consisting of a varying number of arbitrary char-like objects with the first element of the sequence at position zero.
Well, std::string does that just fine. Does that provide any Unicode-specific functionality? No.
Should it? Probably not. std::string is fine as a sequence of char objects. That's useful; the only annoyance is that it is a very low-level view of text and standard C++ doesn't provide a higher-level one.
How do I use it?
Use it as a sequence of char objects; pretending it is something else is bound to end in pain.
Where are potential problems?
All over the place? Let's see...
Strings library
The strings library provides us basic_string, which is merely a sequence of what the standard calls "char-like objects". I call them code units. If you want a high-level view of text, this is not what you are looking for. This is a view of text suitable for serialization/deserialization/storage.
It also provides some tools from the C library that can be used to bridge the gap between the narrow world and the Unicode world: c16rtomb/mbrtoc16 and c32rtomb/mbrtoc32.
Localization library
The localization library still believes that one of those "char-like objects" equals one "character". This is of course silly, and makes it impossible to get lots of things working properly beyond some small subset of Unicode like ASCII.
Consider, for example, what the standard calls "convenience interfaces" in the <locale> header:
template <class charT> bool isspace (charT c, const locale& loc);
template <class charT> bool isprint (charT c, const locale& loc);
template <class charT> bool iscntrl (charT c, const locale& loc);
// ...
template <class charT> charT toupper(charT c, const locale& loc);
template <class charT> charT tolower(charT c, const locale& loc);
// ...
How do you expect any of these functions to properly categorize, say, U+1F34C ʙᴀɴᴀɴᴀ, as in u8"🍌" or u8"\U0001F34C"? There's no way it will ever work, because those functions take only one code unit as input.
This could work with an appropriate locale if you used char32_t only: U'\U0001F34C' is a single code unit in UTF-32.
However, that still means you only get the simple casing transformations with toupper and tolower, which, for example, are not good enough for some German locales: "ß" uppercases to "SS"☦ but toupper can only return one character code unit.
Next up, wstring_convert/wbuffer_convert and the standard code conversion facets.
wstring_convert is used to convert between strings in one given encoding into strings in another given encoding. There are two string types involved in this transformation, which the standard calls a byte string and a wide string. Since these terms are really misleading, I prefer to use "serialized" and "deserialized", respectively, instead†.
The encodings to convert between are decided by a codecvt (a code conversion facet) passed as a template type argument to wstring_convert.
wbuffer_convert performs a similar function but as a wide deserialized stream buffer that wraps a byte serialized stream buffer. Any I/O is performed through the underlying byte serialized stream buffer with conversions to and from the encodings given by the codecvt argument. Writing serializes into that buffer, and then writes from it, and reading reads into the buffer and then deserializes from it.
The standard provides some codecvt class templates for use with these facilities: codecvt_utf8, codecvt_utf16, codecvt_utf8_utf16, and some codecvt specializations. Together these standard facets provide all the following conversions. (Note: in the following list, the encoding on the left is always the serialized string/streambuf, and the encoding on the right is always the deserialized string/streambuf; the standard allows conversions in both directions).
UTF-8 ↔ UCS-2 with codecvt_utf8<char16_t>, and codecvt_utf8<wchar_t> where sizeof(wchar_t) == 2;
UTF-8 ↔ UTF-32 with codecvt_utf8<char32_t>, codecvt<char32_t, char, mbstate_t>, and codecvt_utf8<wchar_t> where sizeof(wchar_t) == 4;
UTF-16 ↔ UCS-2 with codecvt_utf16<char16_t>, and codecvt_utf16<wchar_t> where sizeof(wchar_t) == 2;
UTF-16 ↔ UTF-32 with codecvt_utf16<char32_t>, and codecvt_utf16<wchar_t> where sizeof(wchar_t) == 4;
UTF-8 ↔ UTF-16 with codecvt_utf8_utf16<char16_t>, codecvt<char16_t, char, mbstate_t>, and codecvt_utf8_utf16<wchar_t> where sizeof(wchar_t) == 2;
narrow ↔ wide with codecvt<wchar_t, char_t, mbstate_t>
no-op with codecvt<char, char, mbstate_t>.
Several of these are useful, but there is a lot of awkward stuff here.
First off—holy high surrogate! that naming scheme is messy.
Then, there's a lot of UCS-2 support. UCS-2 is an encoding from Unicode 1.0 that was superseded in 1996 because it only supports the basic multilingual plane. Why the committee thought desirable to focus on an encoding that was superseded over 20 years ago, I don't know&ddagger;. It's not like support for more encodings is bad or anything, but UCS-2 shows up too often here.
I would say that char16_t is obviously meant for storing UTF-16 code units. However, this is one part of the standard that thinks otherwise. codecvt_utf8<char16_t> has nothing to do with UTF-16. For example, wstring_convert<codecvt_utf8<char16_t>>().to_bytes(u"\U0001F34C") will compile fine, but will fail unconditionally: the input will be treated as the UCS-2 string u"\xD83C\xDF4C", which cannot be converted to UTF-8 because UTF-8 cannot encode any value in the range 0xD800-0xDFFF.
Still on the UCS-2 front, there is no way to read from an UTF-16 byte stream into an UTF-16 string with these facets. If you have a sequence of UTF-16 bytes you can't deserialize it into a string of char16_t. This is surprising, because it is more or less an identity conversion. Even more suprising, though, is the fact that there is support for deserializing from an UTF-16 stream into an UCS-2 string with codecvt_utf16<char16_t>, which is actually a lossy conversion.
The UTF-16-as-bytes support is quite good, though: it supports detecting endianess from a BOM, or selecting it explicitly in code. It also supports producing output with and without a BOM.
There are some more interesting conversion possibilities absent. There is no way to deserialize from an UTF-16 byte stream or string into a UTF-8 string, since UTF-8 is never supported as the deserialized form.
And here the narrow/wide world is completely separate from the UTF/UCS world. There are no conversions between the old-style narrow/wide encodings and any Unicode encodings.
Input/output library
The I/O library can be used to read and write text in Unicode encodings using the wstring_convert and wbuffer_convert facilities described above. I don't think there's much else that would need to be supported by this part of the standard library.
Regular expressions library
I have expounded upon problems with C++ regexes and Unicode on Stack Overflow before. I will not repeat all those points here, but merely state that C++ regexes don't have level 1 Unicode support, which is the bare minimum to make them usable without resorting to using UTF-32 everywhere.
That's it?
Yes, that's it. That's the existing functionality. There's lots of Unicode functionality that is nowhere to be seen like normalization or text segmentation algorithms.
U+1F4A9. Is there any way to get some better Unicode support in C++?
The usual suspects: ICU and Boost.Locale.
† A byte string is, unsurprisingly, a string of bytes, i.e., char objects. However, unlike a wide string literal, which is always an array of wchar_t objects, a "wide string" in this context is not necessarily a string of wchar_t objects. In fact, the standard never explicitly defines what a "wide string" means, so we're left to guess the meaning from usage. Since the standard terminology is sloppy and confusing, I use my own, in the name of clarity.
Encodings like UTF-16 can be stored as sequences of char16_t, which then have no endianness; or they can be stored as sequences of bytes, which have endianness (each consecutive pair of bytes can represent a different char16_t value depending on endianness). The standard supports both of these forms. A sequence of char16_t is more useful for internal manipulation in the program. A sequence of bytes is the way to exchange such strings with the external world. The terms I'll use instead of "byte" and "wide" are thus "serialized" and "deserialized".
&ddagger; If you are about to say "but Windows!" hold your 🐎🐎. All versions of Windows since Windows 2000 use UTF-16.
☦ Yes, I know about the großes Eszett (ẞ), but even if you were to change all German locales overnight to have ß uppercase to ẞ, there's still plenty of other cases where this would fail. Try uppercasing U+FB00 ʟᴀᴛɪɴ sᴍᴀʟʟ ʟɪɢᴀᴛᴜʀᴇ ғғ. There is no ʟᴀᴛɪɴ ᴄᴀᴘɪᴛᴀʟ ʟɪɢᴀᴛᴜʀᴇ ғғ; it just uppercases to two Fs. Or U+01F0 ʟᴀᴛɪɴ sᴍᴀʟʟ ʟᴇᴛᴛᴇʀ ᴊ ᴡɪᴛʜ ᴄᴀʀᴏɴ; there's no precomposed capital; it just uppercases to a capital J and a combining caron.
Unicode is not supported by Standard Library (for any reasonable meaning of supported).
std::string is no better than std::vector<char>: it is completely oblivious to Unicode (or any other representation/encoding) and simply treat its content as a blob of bytes.
If you only need to store and catenate blobs, it works pretty well; but as soon as you wish for Unicode functionality (number of code points, number of graphemes etc) you are out of luck.
The only comprehensive library I know of for this is ICU. The C++ interface was derived from the Java one though, so it's far from being idiomatic.
You can safely store UTF-8 in a std::string (or in a char[] or char*, for that matter), due to the fact that a Unicode NUL (U+0000) is a null byte in UTF-8 and that this is the sole way a null byte can occur in UTF-8. Hence, your UTF-8 strings will be properly terminated according to all of the C and C++ string functions, and you can sling them around with C++ iostreams (including std::cout and std::cerr, so long as your locale is UTF-8).
What you cannot do with std::string for UTF-8 is get length in code points. std::string::size() will tell you the string length in bytes, which is only equal to the number of code points when you're within the ASCII subset of UTF-8.
If you need to operate on UTF-8 strings at the code point level (i.e. not just store and print them) or if you're dealing with UTF-16, which is likely to have many internal null bytes, you need to look into the wide character string types.
C++11 has a couple of new literal string types for Unicode.
Unfortunately the support in the standard library for non-uniform encodings (like UTF-8) is still bad. For example there is no nice way to get the length (in code-points) of an UTF-8 string.
However, there is a pretty useful library called tiny-utf8, which is basically a drop-in replacement for std::string/std::wstring. It aims to fill the gap of the still missing utf8-string container class.
This might be the most comfortable way of 'dealing' with utf8 strings (that is, without unicode normalization and similar stuff). You comfortably operate on codepoints, while your string stays encoded in run-length-encoded chars.

UTF-8 Compatibility in C++

I am writing a program that needs to be able to work with text in all languages. My understanding is that UTF-8 will do the job, but I am experiencing a few problems with it.
Am I right to say that UTF-8 can be stored in a simple char in C++? If so, why do I get the following warning when I use a program with char, string and stringstream: warning C4566: character represented by universal-character-name '\uFFFD' cannot be represented in the current code page (1252). (I do not get that error when I use wchar_t, wstring and wstringstream.)
Additionally, I know that UTF is variable length. When I use the at or substr string methods would I get the wrong answer?
To use UTF-8 string literals you need to prefix them with u8, otherwise you get the implementation's character set (in your case, it seems to be Windows-1252): u8"\uFFFD" is null-terminated sequence of bytes with the UTF-8 representation of the replacement character (U+FFFD). It has type char const[4].
Since UTF-8 has variable length, all kinds of indexing will do indexing in code units, not codepoints. It is not possible to do random access on codepoints in an UTF-8 sequence because of it's variable length nature. If you want random access you need to use a fixed length encoding, like UTF-32. For that you can use the U prefix on strings.
Yes, the UTF-8 encoding can be used with char, string, and stringstream. A char will hold a single UTF-8 code unit, of which up to four may be required to represent a single Unicode code point.
However, there are a few issues using UTF-8 specifically with Microsoft's compilers. C++ implementations use an 'execution character set' for a number of things, such as encoding character and string literals. VC++ always use the system locale encoding as the execution character set, and Windows does not support UTF-8 as the system locale encoding, therefore UTF-8 can never by the execution character set.
This means that VC++ never intentionally produces UTF-8 character and string literals. Instead the compiler must be tricked.
The compiler will convert from the known source code encoding to the execution encoding. That means that if the compiler uses the locale encoding for both the source and execution encodings then no conversion is done. If you can get UTF-8 data into the source code but have the compiler think that the source uses the locale encoding, then character and string literals will use the UTF-8 encoding. VC++ uses the so-called 'BOM' to detect the source encoding, and uses the locale encoding if no BOM is detected. Therefore you can get UTF-8 encoded string literals by saving all your source files as "UTF-8 without signature".
There are caveats with this method. First, you cannot use UCNs with narrow character and string literals. Universal Character Names have to be converted to the execution character set, which isn't UTF-8. You must either write the character literally so it appears as UTF-8 in the source code, or you can use hex escapes where you manually write out a UTF-8 encoding. Second, in order to produce wide character and string literals the compiler performs a similar conversion from the source encoding to the wide execution character set (which is always UTF-16 in VC++). Since we're lying to the compiler about the encoding, it will perform this conversion to UTF-16 incorrectly. So in wide character and string literals you cannot use non-ascii characters literally, and instead you must use UCNs or hex escapes.
UTF-8 is variable length (as is UTF-16). The indices used with at() and substr() are code units rather than character or code point indices. So if you want a particular code unit then you can just index into the string or array or whatever as normal. If you need a particular code point then you either need a library that can understand composing UTF-8 code units into code points (such as the Boost Unicode iterators library), or you need to convert the UTF-8 data into UTF-32. If you need actual user perceived characters then you need a library that understands how code points are composed into characters. I imagine ICU has such functionality, or you could implement the Default Grapheme Cluster Boundary Specification from the Unicode standard.
The above consideration of UTF-8 only really matters for how you write Unicode data in the source code. It has little bearing on the program's input and output.
If your requirements allow you to choose how to do input and output then I would still recommend using UTF-8 for input. Depending on what you need to do with the input you can either convert it to another encoding that's easy for you to process, or you can write your processing routines to work directly on UTF-8.
If you want to ever output anything via the Windows console then you'll want a well defined module for output that can have different implementations, because internationalized output to the Windows console will require a different implementation from either outputting to a file on Windows or console and file output on other platforms. (On other platforms the console is just another file, but the Windows console needs special treatment.)
The reason you get the warning about \uFFFD is that you're trying to fit FF FD inside a single byte, since, as you noted, UTF-8 works on chars and is variable length.
If you use at or substr, you will possibly get wrong answers since these methods count that one byte should be one character. This is not the case with UTF-8. Notably, with at, you could end up with a single byte of a character sequence; with substr, you could break a sequence and end up with an invalid UTF-8 string (it would start or end with �, \uFFFD, the same one you're apparently trying to use, and the broken character would be lost).
I would recommend that you use wchar to store Unicode strings. Since the type is at least 16 bits, many many more characters can fit in a single "unit".

When encoding actually matters? (e.g., string storing, printing?)

Just curious about the encodings that system is using when handling string storing(if it cares) and printing.
Question 1: If I store one-byte string in std::string or two-byte string in std::wstring, will the underlying integer value differ depending on the encoding currently in use? (I remember that Bjarne says that encoding is the mapping between char and integer(s) so char should be stored as integer(s) in memory, and different encodings don't necessarily have the same mapping)
Question 2: If positive, std::string and std::wstring must have the knowledge of the encoding themselves(although another guy told me this is NOT true)? Otherwise, how is it able to translate the char to correct integers and store them? How does the system know the encoding?
Question 3: What is the default encoding in one particular system, and how to change it(Is it so-called "locale")? I guess the same mechanism matters?
Question 4: What if I print a string to the screen with std::cout, is it the same encoding?
(I remember that Bjarne says that
encoding is the mapping between char
and integer(s) so char should be
stored as integer(s) in memory)
Not quite. Make sure you understand one important distinction.
A character is the minimum unit of text. A letter, digit, punctuation mark, symbol, space, etc.
A byte is the minimum unit of memory. On the overwhelming majority of computers, this is 8 bits.
Encoding is converting a sequence of characters to a sequence of bytes. Decoding is converting a sequence of bytes to a sequence of characters.
The confusing thing for C and C++ programmers is that char means byte, NOT character! The name char for the byte type is a legacy from the pre-Unicode days when everyone (except East Asians) used single-byte encodings. But nowadays, we have Unicode, and its encoding schemes which have up to 4 bytes per character.
Question 1: If I store one-byte string
in std::string or two-byte string in
std::wstring, will the underlying
integer value depend on the encoding
currently in use?
Yes, it will. Suppose you have std::string euro = "€"; Then:
With the windows-1252 encoding, the string will be encoded as the byte 0x80.
With the ISO-8859-15 encoding, the string will be encoded as the byte 0xA4.
With the UTF-8 encoding, the string will be encoded as the three bytes 0xE2, 0x82, 0xAC.
Question 3: What is the default
encoding in one particular system, and
how to change it(Is it so-called
"locale")?
Depends on the platform. On Unix, the encoding can be specified as part of the LANG environment variable.
~$ echo $LANG
en_US.utf8
Windows has a GetACP function to get the "ANSI" code page number.
Question 4: What if I print a string
to the screen with std::cout, is it
the same encoding?
Not necessarily. On Windows, the command line uses the "OEM" code page, which is usually different from the "ANSI" code page used elsewhere.
Encoding and Decoding is inherently the same process, i.e. they both transform one integral sequence to another integral sequence.
The difference between encoding and decoding is on the conceptual level. When you "decode" a character, you transform an integral sequence encoded in a known encoding ("string") into a system-specific integral sequence ("text"). And when you "encode", you're transforming a system-specific integral sequence ("text") into an integral sequence encoded in a particular encoding ("string").
This difference is conceptual, and not physical, the memory still holds a decoded "text" as a "string"; however since a particular system always represent "text" in a particular encoding, text transformations would not need to deal with the specificities of the actual system encoding, and can safely assume to be able to work on a sequence of conceptual "characters" instead of "bytes".
Generally however, the encoding used for "text" uses encoding that have properties that makes it easy to work with (e.g. fixed-length characters, simple one-to-one mapping between characters and byte-sequence, etc); while the encoded "string" is encoded with an efficient encoding (e.g. variable-length characters, context-dependant encoding, etc)
Joel On Software has a writeup on this: http://www.joelonsoftware.com/articles/Unicode.html
This one is a good one as well: http://www.jerf.org/programming/encoding.html
Question 1: If I store one-byte string
in std::string or two-byte string in
std::wstring, will the underlying
integer value differ depending on the
encoding currently in use? (I remember
that Bjarne says that encoding is the
mapping between char and integer(s) so
char should be stored as integer(s) in
memory, and different encodings don't
necessarily have the same mapping)
You're sort of thinking about this backwards. Different encodings interpret the underlying integers as different characters (or parts of characters, if we're talking about a multi-byte character set), depending on the encoding.
Question 2: If positive, std::string and std::wstring must have
the knowledge of the encoding
themselves(although another guy told
me this is NOT true)? Otherwise, how
is it able to translate the char to
correct integers and store them? How
does the system know the encoding?
Both std::string and std::wstring are completely encoding agnostic. As far as C++ is concerned, they simply store arrays of char objects and wchar_t objects respectively. The only requirement is that char is one-byte, and wchar_t is some implementation-defined width. (Usually 2 bytes on Windows and 4 on Linux/UNIX)
Question 3: What is the default
encoding in one particular system, and
how to change it(Is it so-called
"locale")?
That depends on the platform. ISO C++ only talks about the global locale object, std::locale(), which generally refers to your current system-specific settings.
Question 4: What if I print a string
to the screen with std::cout, is it
the same encoding?
Generally, if you output to the screen through stdout, the characters you see displayed are interpreted and rendered according to your system's current locale settings.
Any one working with encodings should read this Joel on Software article: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). I found it useful when I started working with encodings.
Question 1: If I store one-byte string in std::string or two-byte string in std::wstring, will the underlying integer value differ depending on the encoding currently in use?
C/C++ programmers are used to thinking of characters as bytes, because almost everyone starts working with the ascii character set, maps the integers 0-255 to symbols such as the letters of the alphabet and arabic numbers. The fact that the C char datatype is actually a byte doesn't help matters.
The std::string class stores data as 8-bit integers, and std::wstring stores data in 16-bit integers. Neither class contains any concept of encoding. You can use any 8-bit encoding such as ASCII, UTF-8, Latin-1, Windows-1252 with a std::string, and any 8-bit or 16-bit encoding, such as UTF-16, with a std::wstring.
Data stored in std::string and std::wstring must always be interpreted by some encoding. This generally comes into play when you interact with the operating system: reading or writing data from a file, a stream, or making OS API calls that interact with strings.
So to answer your question, if you store the same byte in a std::string and a std::wstring, the memory will contain the same value (except the wstring will contain a null byte), but the interpretation of that byte will depend on the encoding in use.
If you store the same character in each of the strings, then the bytes may be different, again depending on the encoding. For example, the Euro symbol (€) might be stored in the std::string using a UTF-8 encoding, which is corresponds to the bytes 0xE2 0x82 0xAC. In the std::wstring, it might be stored using the UTF-16 encoding, which would be the bytes 0x20AC.
Question 3: What is the default encoding in one particular system, and how to change it(Is it so-called "locale")? I guess the same mechanism matters?
Yes, the locale determines how the OS interprets strings at it's API boundaries. Locale's define more than just encoding. They also include things information on how money, date, time, and other things should be formatted. On Linux or OS X, you can use the locale command in the terminal to see what the current locale is:
mch#bohr:/$ locale
LANG=en_CA.UTF-8
LC_CTYPE="en_CA.UTF-8"
LC_NUMERIC="en_CA.UTF-8"
LC_TIME="en_CA.UTF-8"
LC_COLLATE="en_CA.UTF-8"
LC_MONETARY="en_CA.UTF-8"
LC_MESSAGES="en_CA.UTF-8"
LC_PAPER="en_CA.UTF-8"
LC_NAME="en_CA.UTF-8"
LC_ADDRESS="en_CA.UTF-8"
LC_TELEPHONE="en_CA.UTF-8"
LC_MEASUREMENT="en_CA.UTF-8"
LC_IDENTIFICATION="en_CA.UTF-8"
LC_ALL=
So in this case, my locale is Canadian English. Each locale defines a encoding used to interpret strings. In this case the locale name makes it clear that it is using a UTF-8 encoding, but you can run locale -ck LC_CTYPE to see more information about the current encoding:
mch#bohr:/$ locale -ck LC_CTYPE
LC_CTYPE
ctype-class-names="upper";"lower";"alpha";"digit";"xdigit";"space";"print";"graph";"blank";"cntrl";"punct";"alnum";"combining";"combining_level3"
ctype-map-names="toupper";"tolower";"totitle"
ctype-width=16
ctype-mb-cur-max=6
charmap="UTF-8"
... output snipped ...
If you want to test a program using encodings, you can set the LC_ALL environment variable to the locale you want to use. You can also change the locale using setlocale. Permanently changing the locale depends on your distribution.
On Windows, most API functions come in a narrow and a wide format. For example, [GetCurrentDirectory][9] comes in GetCurrentDirectoryW (Unicode) and GetCurrentDirectoryA (ANSI) variants. Unicode, in this context, means UTF-16.
I don't know enough about Windows to tell you how to set the locale, other than to try the languages control panel.
Question 4: What if I print a string to the screen with std::cout, is it the same encoding?
When you print a string to std::cout, the OS will interpret that string in the encoding set by the locale. If your string is UTF-8 encoded and the OS is using Windows-1252, it will be necessary to convert it to that encoding. One way to do this is with the iconv library.

Why does wide file-stream in C++ narrow written data by default?

Honestly, I just don't get the following design decision in C++ Standard library. When writing wide characters to a file, the wofstream converts wchar_t into char characters:
#include <fstream>
#include <string>
int main()
{
using namespace std;
wstring someString = L"Hello StackOverflow!";
wofstream file(L"Test.txt");
file << someString; // the output file will consist of ASCII characters!
}
I am aware that this has to do with the standard codecvt. There is codecvt for utf8 in Boost. Also, there is a codecvt for utf16 by Martin York here on SO. The question is why the standard codecvt converts wide-characters? why not write the characters as they are!
Also, are we gonna get real unicode streams with C++0x or am I missing something here?
A very partial answer for the first question: A file is a sequence of bytes so, when dealing with wchar_t's, at least some conversion between wchar_t and char must occur. Making this conversion "intelligently" requires knowledge of the character encodings, so this is why this conversion is allowed to be locale-dependent, by virtue of using a facet in the stream's locale.
Then, the question is how that conversion should be made in the only locale required by the standard: the "classic" one. There is no "right" answer for that, and the standard is thus very vague about it. I understand from your question that you assume that blindly casting (or memcpy()-ing) between wchar_t[] and char[] would have been a good way. This is not unreasonable, and is in fact what is (or at least was) done in some implementations.
Another POV would be that, since a codecvt is a locale facet, it is reasonable to expect that the conversion is made using the "locale's encoding" (I'm handwavy here, as the concept is pretty fuzzy). For example, one would expect a Turkish locale to use ISO-8859-9, or a Japanese on to use Shift JIS. By similarity, the "classic" locale would convert to this "locale's encoding". Apparently, Microsoft chose to simply trim (which leads to IS-8859-1 if we assuming that wchar_t represents UTF-16 and that we stay in the basic multilingual plane), while the Linux implementation I know about decided stick to ASCII.
For your second question:
Also, are we gonna get real unicode streams with C++0x or am I missing something here?
In the [locale.codecvt] section of n2857 (the latest C++0x draft I have at hand), one can read:
The specialization codecvt<char16_t, char, mbstate_t> converts between the UTF-16 and UTF-8 encodings schemes, and the specialization codecvt <char32_t, char, mbstate_t> converts between the UTF-32 and UTF-8 encodings schemes. codecvt<wchar_t,char,mbstate_t> converts between the native character sets for narrow and wide characters.
In the [locale.stdcvt] section, we find:
For the facet codecvt_utf8:
— The facet shall convert between UTF-8 multibyte sequences and UCS2 or UCS4 (depending on the size of Elem) within the program.
[...]
For the facet codecvt_utf16:
— The facet shall convert between UTF-16 multibyte sequences and UCS2 or UCS4 (depending on the size of Elem) within the program.
[...]
For the facet codecvt_utf8_utf16:
— The facet shall convert between UTF-8 multibyte sequences and UTF-16 (one or two 16-bit codes) within the program.
So I guess that this means "yes", but you'd have to be more precise about what you mean by "real unicode streams" to be sure.
The model used by C++ for charsets is inherited from C, and so dates back to at least 1989.
Two main points:
IO is done in term of char.
it is the job of the locale to determine how wide chars are serialized
the default locale (named "C") is very minimal (I don't remember the constraints from the standard, here it is able to handle only 7-bit ASCII as narrow and wide character set).
there is an environment determined locale named ""
So to get anything, you have to set the locale.
If I use the simple program
#include <locale>
#include <fstream>
#include <ostream>
#include <iostream>
int main()
{
wchar_t c = 0x00FF;
std::locale::global(std::locale(""));
std::wofstream os("test.dat");
os << c << std::endl;
if (!os) {
std::cout << "Output failed\n";
}
}
which use the environment locale and output the wide character of code 0x00FF to a file. If I ask to use the "C" locale, I get
$ env LC_ALL=C ./a.out
Output failed
the locale has been unable to handle the wide character and we get notified of the problem as the IO failed. If I run ask an UTF-8 locale, I get
$ env LC_ALL=en_US.utf8 ./a.out
$ od -t x1 test.dat
0000000 c3 bf 0a
0000003
(od -t x1 just dump the file represented in hex), exactly what I expect for an UTF-8 encoded file.
I don't know about wofstream. But C++0x will include new distict character types (char16_t, char32_t) of guaranteed width and signedness (unsigned) which can be portably used for UTF-8, UTF-16 and UTF-32. In addition, there will be new string literals (u"Hello!" for an UTF-16 coded string literal, for example)
Check out the most recent C++0x draft (N2960).
For your first question, this is my guess.
The IOStreams library was constructed under a couple of premises regarding encodings. For converting between Unicode and other not so usual encodings, for example, it's assumed that.
Inside your program, you should use a (fixed-width) wide-character encoding.
Only external storage should use (variable-width) multibyte encodings.
I believe that is the reason for the existence of the two template specializations of std::codecvt. One that maps between char types (maybe you're simply working with ASCII) and another that maps between wchar_t (internal to your program) and char (external devices). So whenever you need to perform a conversion to a multibyte encoding you should do it byte-by-byte. Notice that you can write a facet that handles encoding state when you read/write each byte from/to the multibyte encoding.
Thinking this way the behavior of the C++ standard is understandable. After all, you're using wide-character ASCII encoded (assuming this is the default on your platform and you did not switch locales) strings. The "natural" conversion would be to convert each wide-character ASCII character to a ordinary (in this case, one char) ASCII character. (The conversion exists and is straightforward.)
By the way, I'm not sure if you know, but you can avoid this by creating a facet that returns noconv for the conversions. Then, you would have your file with wide-characters.
Check this out:
Class basic_filebuf
You can alter the default behavior by setting a wide char buffer, using pubsetbuf.
Once you did that, the output will be wchar_t and not char.
In other words for your example you will have:
wofstream file(L"Test.txt", ios_base::binary); //binary is important to set!
wchar_t buffer[128];
file.rdbuf()->pubsetbuf(buffer, 128);
file.put(0xFEFF); //this is the BOM flag, UTF16 needs this, but mirosoft's UNICODE doesn't, so you can skip this line, if any.
file << someString; // the output file will consist of unicode characters! without the call to pubsetbuf, the out file will be ANSI (current regional settings)