Why does wide file-stream in C++ narrow written data by default? - c++

Honestly, I just don't get the following design decision in C++ Standard library. When writing wide characters to a file, the wofstream converts wchar_t into char characters:
#include <fstream>
#include <string>
int main()
{
using namespace std;
wstring someString = L"Hello StackOverflow!";
wofstream file(L"Test.txt");
file << someString; // the output file will consist of ASCII characters!
}
I am aware that this has to do with the standard codecvt. There is codecvt for utf8 in Boost. Also, there is a codecvt for utf16 by Martin York here on SO. The question is why the standard codecvt converts wide-characters? why not write the characters as they are!
Also, are we gonna get real unicode streams with C++0x or am I missing something here?

A very partial answer for the first question: A file is a sequence of bytes so, when dealing with wchar_t's, at least some conversion between wchar_t and char must occur. Making this conversion "intelligently" requires knowledge of the character encodings, so this is why this conversion is allowed to be locale-dependent, by virtue of using a facet in the stream's locale.
Then, the question is how that conversion should be made in the only locale required by the standard: the "classic" one. There is no "right" answer for that, and the standard is thus very vague about it. I understand from your question that you assume that blindly casting (or memcpy()-ing) between wchar_t[] and char[] would have been a good way. This is not unreasonable, and is in fact what is (or at least was) done in some implementations.
Another POV would be that, since a codecvt is a locale facet, it is reasonable to expect that the conversion is made using the "locale's encoding" (I'm handwavy here, as the concept is pretty fuzzy). For example, one would expect a Turkish locale to use ISO-8859-9, or a Japanese on to use Shift JIS. By similarity, the "classic" locale would convert to this "locale's encoding". Apparently, Microsoft chose to simply trim (which leads to IS-8859-1 if we assuming that wchar_t represents UTF-16 and that we stay in the basic multilingual plane), while the Linux implementation I know about decided stick to ASCII.
For your second question:
Also, are we gonna get real unicode streams with C++0x or am I missing something here?
In the [locale.codecvt] section of n2857 (the latest C++0x draft I have at hand), one can read:
The specialization codecvt<char16_t, char, mbstate_t> converts between the UTF-16 and UTF-8 encodings schemes, and the specialization codecvt <char32_t, char, mbstate_t> converts between the UTF-32 and UTF-8 encodings schemes. codecvt<wchar_t,char,mbstate_t> converts between the native character sets for narrow and wide characters.
In the [locale.stdcvt] section, we find:
For the facet codecvt_utf8:
— The facet shall convert between UTF-8 multibyte sequences and UCS2 or UCS4 (depending on the size of Elem) within the program.
[...]
For the facet codecvt_utf16:
— The facet shall convert between UTF-16 multibyte sequences and UCS2 or UCS4 (depending on the size of Elem) within the program.
[...]
For the facet codecvt_utf8_utf16:
— The facet shall convert between UTF-8 multibyte sequences and UTF-16 (one or two 16-bit codes) within the program.
So I guess that this means "yes", but you'd have to be more precise about what you mean by "real unicode streams" to be sure.

The model used by C++ for charsets is inherited from C, and so dates back to at least 1989.
Two main points:
IO is done in term of char.
it is the job of the locale to determine how wide chars are serialized
the default locale (named "C") is very minimal (I don't remember the constraints from the standard, here it is able to handle only 7-bit ASCII as narrow and wide character set).
there is an environment determined locale named ""
So to get anything, you have to set the locale.
If I use the simple program
#include <locale>
#include <fstream>
#include <ostream>
#include <iostream>
int main()
{
wchar_t c = 0x00FF;
std::locale::global(std::locale(""));
std::wofstream os("test.dat");
os << c << std::endl;
if (!os) {
std::cout << "Output failed\n";
}
}
which use the environment locale and output the wide character of code 0x00FF to a file. If I ask to use the "C" locale, I get
$ env LC_ALL=C ./a.out
Output failed
the locale has been unable to handle the wide character and we get notified of the problem as the IO failed. If I run ask an UTF-8 locale, I get
$ env LC_ALL=en_US.utf8 ./a.out
$ od -t x1 test.dat
0000000 c3 bf 0a
0000003
(od -t x1 just dump the file represented in hex), exactly what I expect for an UTF-8 encoded file.

I don't know about wofstream. But C++0x will include new distict character types (char16_t, char32_t) of guaranteed width and signedness (unsigned) which can be portably used for UTF-8, UTF-16 and UTF-32. In addition, there will be new string literals (u"Hello!" for an UTF-16 coded string literal, for example)
Check out the most recent C++0x draft (N2960).

For your first question, this is my guess.
The IOStreams library was constructed under a couple of premises regarding encodings. For converting between Unicode and other not so usual encodings, for example, it's assumed that.
Inside your program, you should use a (fixed-width) wide-character encoding.
Only external storage should use (variable-width) multibyte encodings.
I believe that is the reason for the existence of the two template specializations of std::codecvt. One that maps between char types (maybe you're simply working with ASCII) and another that maps between wchar_t (internal to your program) and char (external devices). So whenever you need to perform a conversion to a multibyte encoding you should do it byte-by-byte. Notice that you can write a facet that handles encoding state when you read/write each byte from/to the multibyte encoding.
Thinking this way the behavior of the C++ standard is understandable. After all, you're using wide-character ASCII encoded (assuming this is the default on your platform and you did not switch locales) strings. The "natural" conversion would be to convert each wide-character ASCII character to a ordinary (in this case, one char) ASCII character. (The conversion exists and is straightforward.)
By the way, I'm not sure if you know, but you can avoid this by creating a facet that returns noconv for the conversions. Then, you would have your file with wide-characters.

Check this out:
Class basic_filebuf
You can alter the default behavior by setting a wide char buffer, using pubsetbuf.
Once you did that, the output will be wchar_t and not char.
In other words for your example you will have:
wofstream file(L"Test.txt", ios_base::binary); //binary is important to set!
wchar_t buffer[128];
file.rdbuf()->pubsetbuf(buffer, 128);
file.put(0xFEFF); //this is the BOM flag, UTF16 needs this, but mirosoft's UNICODE doesn't, so you can skip this line, if any.
file << someString; // the output file will consist of unicode characters! without the call to pubsetbuf, the out file will be ANSI (current regional settings)

Related

c++ windows: Is there a way to convert from _UNICODE_STRING to std::string? [duplicate]

I am not able to understand the differences between std::string and std::wstring. I know wstring supports wide characters such as Unicode characters. I have got the following questions:
When should I use std::wstring over std::string?
Can std::string hold the entire ASCII character set, including the special characters?
Is std::wstring supported by all popular C++ compilers?
What is exactly a "wide character"?
string? wstring?
std::string is a basic_string templated on a char, and std::wstring on a wchar_t.
char vs. wchar_t
char is supposed to hold a character, usually an 8-bit character.
wchar_t is supposed to hold a wide character, and then, things get tricky:
On Linux, a wchar_t is 4 bytes, while on Windows, it's 2 bytes.
What about Unicode, then?
The problem is that neither char nor wchar_t is directly tied to unicode.
On Linux?
Let's take a Linux OS: My Ubuntu system is already unicode aware. When I work with a char string, it is natively encoded in UTF-8 (i.e. Unicode string of chars). The following code:
#include <cstring>
#include <iostream>
int main()
{
const char text[] = "olé";
std::cout << "sizeof(char) : " << sizeof(char) << "\n";
std::cout << "text : " << text << "\n";
std::cout << "sizeof(text) : " << sizeof(text) << "\n";
std::cout << "strlen(text) : " << strlen(text) << "\n";
std::cout << "text(ordinals) :";
for(size_t i = 0, iMax = strlen(text); i < iMax; ++i)
{
unsigned char c = static_cast<unsigned_char>(text[i]);
std::cout << " " << static_cast<unsigned int>(c);
}
std::cout << "\n\n";
// - - -
const wchar_t wtext[] = L"olé" ;
std::cout << "sizeof(wchar_t) : " << sizeof(wchar_t) << "\n";
//std::cout << "wtext : " << wtext << "\n"; <- error
std::cout << "wtext : UNABLE TO CONVERT NATIVELY." << "\n";
std::wcout << L"wtext : " << wtext << "\n";
std::cout << "sizeof(wtext) : " << sizeof(wtext) << "\n";
std::cout << "wcslen(wtext) : " << wcslen(wtext) << "\n";
std::cout << "wtext(ordinals) :";
for(size_t i = 0, iMax = wcslen(wtext); i < iMax; ++i)
{
unsigned short wc = static_cast<unsigned short>(wtext[i]);
std::cout << " " << static_cast<unsigned int>(wc);
}
std::cout << "\n\n";
}
outputs the following text:
sizeof(char) : 1
text : olé
sizeof(text) : 5
strlen(text) : 4
text(ordinals) : 111 108 195 169
sizeof(wchar_t) : 4
wtext : UNABLE TO CONVERT NATIVELY.
wtext : ol�
sizeof(wtext) : 16
wcslen(wtext) : 3
wtext(ordinals) : 111 108 233
You'll see the "olé" text in char is really constructed by four chars: 110, 108, 195 and 169 (not counting the trailing zero). (I'll let you study the wchar_t code as an exercise)
So, when working with a char on Linux, you should usually end up using Unicode without even knowing it. And as std::string works with char, so std::string is already unicode-ready.
Note that std::string, like the C string API, will consider the "olé" string to have 4 characters, not three. So you should be cautious when truncating/playing with unicode chars because some combination of chars is forbidden in UTF-8.
On Windows?
On Windows, this is a bit different. Win32 had to support a lot of application working with char and on different charsets/codepages produced in all the world, before the advent of Unicode.
So their solution was an interesting one: If an application works with char, then the char strings are encoded/printed/shown on GUI labels using the local charset/codepage on the machine, which could not be UTF-8 for a long time. For example, "olé" would be "olé" in a French-localized Windows, but would be something different on an cyrillic-localized Windows ("olй" if you use Windows-1251). Thus, "historical apps" will usually still work the same old way.
For Unicode based applications, Windows uses wchar_t, which is 2-bytes wide, and is encoded in UTF-16, which is Unicode encoded on 2-bytes characters (or at the very least, UCS-2, which just lacks surrogate-pairs and thus characters outside the BMP (>= 64K)).
Applications using char are said "multibyte" (because each glyph is composed of one or more chars), while applications using wchar_t are said "widechar" (because each glyph is composed of one or two wchar_t. See MultiByteToWideChar and WideCharToMultiByte Win32 conversion API for more info.
Thus, if you work on Windows, you badly want to use wchar_t (unless you use a framework hiding that, like GTK or QT...). The fact is that behind the scenes, Windows works with wchar_t strings, so even historical applications will have their char strings converted in wchar_t when using API like SetWindowText() (low level API function to set the label on a Win32 GUI).
Memory issues?
UTF-32 is 4 bytes per characters, so there is no much to add, if only that a UTF-8 text and UTF-16 text will always use less or the same amount of memory than an UTF-32 text (and usually less).
If there is a memory issue, then you should know than for most western languages, UTF-8 text will use less memory than the same UTF-16 one.
Still, for other languages (chinese, japanese, etc.), the memory used will be either the same, or slightly larger for UTF-8 than for UTF-16.
All in all, UTF-16 will mostly use 2 and occassionally 4 bytes per characters (unless you're dealing with some kind of esoteric language glyphs (Klingon? Elvish?), while UTF-8 will spend from 1 to 4 bytes.
See https://en.wikipedia.org/wiki/UTF-8#Compared_to_UTF-16 for more info.
Conclusion
When I should use std::wstring over std::string?
On Linux? Almost never (§).
On Windows? Almost always (§).
On cross-platform code? Depends on your toolkit...
(§) : unless you use a toolkit/framework saying otherwise
Can std::string hold all the ASCII character set including special characters?
Notice: A std::string is suitable for holding a 'binary' buffer, where a std::wstring is not!
On Linux? Yes.
On Windows? Only special characters available for the current locale of the Windows user.
Edit (After a comment from Johann Gerell):
a std::string will be enough to handle all char-based strings (each char being a number from 0 to 255). But:
ASCII is supposed to go from 0 to 127. Higher chars are NOT ASCII.
a char from 0 to 127 will be held correctly
a char from 128 to 255 will have a signification depending on your encoding (unicode, non-unicode, etc.), but it will be able to hold all Unicode glyphs as long as they are encoded in UTF-8.
Is std::wstring supported by almost all popular C++ compilers?
Mostly, with the exception of GCC based compilers that are ported to Windows.
It works on my g++ 4.3.2 (under Linux), and I used Unicode API on Win32 since Visual C++ 6.
What is exactly a wide character?
On C/C++, it's a character type written wchar_t which is larger than the simple char character type. It is supposed to be used to put inside characters whose indices (like Unicode glyphs) are larger than 255 (or 127, depending...).
I recommend avoiding std::wstring on Windows or elsewhere, except when required by the interface, or anywhere near Windows API calls and respective encoding conversions as a syntactic sugar.
My view is summarized in http://utf8everywhere.org of which I am a co-author.
Unless your application is API-call-centric, e.g. mainly UI application, the suggestion is to store Unicode strings in std::string and encoded in UTF-8, performing conversion near API calls. The benefits outlined in the article outweigh the apparent annoyance of conversion, especially in complex applications. This is doubly so for multi-platform and library development.
And now, answering your questions:
A few weak reasons. It exists for historical reasons, where widechars were believed to be the proper way of supporting Unicode. It is now used to interface APIs that prefer UTF-16 strings. I use them only in the direct vicinity of such API calls.
This has nothing to do with std::string. It can hold whatever encoding you put in it. The only question is how You treat its content. My recommendation is UTF-8, so it will be able to hold all Unicode characters correctly. It's a common practice on Linux, but I think Windows programs should do it also.
No.
Wide character is a confusing name. In the early days of Unicode, there was a belief that a character can be encoded in two bytes, hence the name. Today, it stands for "any part of the character that is two bytes long". UTF-16 is seen as a sequence of such byte pairs (aka Wide characters). A character in UTF-16 takes either one or two pairs.
So, every reader here now should have a clear understanding about the facts, the situation. If not, then you must read paercebal's outstandingly comprehensive answer [btw: thanks!].
My pragmatical conclusion is shockingly simple: all that C++ (and STL) "character encoding" stuff is substantially broken and useless. Blame it on Microsoft or not, that will not help anyway.
My solution, after in-depth investigation, much frustration and the consequential experiences is the following:
accept, that you have to be responsible on your own for the encoding and conversion stuff (and you will see that much of it is rather trivial)
use std::string for any UTF-8 encoded strings (just a typedef std::string UTF8String)
accept that such an UTF8String object is just a dumb, but cheap container. Do never ever access and/or manipulate characters in it directly (no search, replace, and so on). You could, but you really just really, really do not want to waste your time writing text manipulation algorithms for multi-byte strings! Even if other people already did such stupid things, don't do that! Let it be! (Well, there are scenarios where it makes sense... just use the ICU library for those).
use std::wstring for UCS-2 encoded strings (typedef std::wstring UCS2String) - this is a compromise, and a concession to the mess that the WIN32 API introduced). UCS-2 is sufficient for most of us (more on that later...).
use UCS2String instances whenever a character-by-character access is required (read, manipulate, and so on). Any character-based processing should be done in a NON-multibyte-representation. It is simple, fast, easy.
add two utility functions to convert back & forth between UTF-8 and UCS-2:
UCS2String ConvertToUCS2( const UTF8String &str );
UTF8String ConvertToUTF8( const UCS2String &str );
The conversions are straightforward, google should help here ...
That's it. Use UTF8String wherever memory is precious and for all UTF-8 I/O. Use UCS2String wherever the string must be parsed and/or manipulated. You can convert between those two representations any time.
Alternatives & Improvements
conversions from & to single-byte character encodings (e.g. ISO-8859-1) can be realized with help of plain translation tables, e.g. const wchar_t tt_iso88951[256] = {0,1,2,...}; and appropriate code for conversion to & from UCS2.
if UCS-2 is not sufficient, than switch to UCS-4 (typedef std::basic_string<uint32_t> UCS2String)
ICU or other unicode libraries?
For advanced stuff.
When you want to have wide characters stored in your string. wide depends on the implementation. Visual C++ defaults to 16 bit if i remember correctly, while GCC defaults depending on the target. It's 32 bits long here. Please note wchar_t (wide character type) has nothing to do with unicode. It's merely guaranteed that it can store all the members of the largest character set that the implementation supports by its locales, and at least as long as char. You can store unicode strings fine into std::string using the utf-8 encoding too. But it won't understand the meaning of unicode code points. So str.size() won't give you the amount of logical characters in your string, but merely the amount of char or wchar_t elements stored in that string/wstring. For that reason, the gtk/glib C++ wrapper folks have developed a Glib::ustring class that can handle utf-8.
If your wchar_t is 32 bits long, then you can use utf-32 as an unicode encoding, and you can store and handle unicode strings using a fixed (utf-32 is fixed length) encoding. This means your wstring's s.size() function will then return the right amount of wchar_t elements and logical characters.
Yes, char is always at least 8 bit long, which means it can store all ASCII values.
Yes, all major compilers support it.
I frequently use std::string to hold utf-8 characters without any problems at all. I heartily recommend doing this when interfacing with API's which use utf-8 as the native string type as well.
For example, I use utf-8 when interfacing my code with the Tcl interpreter.
The major caveat is the length of the std::string, is no longer the number of characters in the string.
A good question!
I think DATA ENCODING (sometimes a CHARSET also involved) is a MEMORY EXPRESSION MECHANISM in order to save data to a file or transfer data via a network, so I answer this question as:
1. When should I use std::wstring over std::string?
If the programming platform or API function is a single-byte one, and we want to process or parse some Unicode data, e.g read from Windows'.REG file or network 2-byte stream, we should declare std::wstring variable to easily process them. e.g.: wstring ws=L"中国a"(6 octets memory: 0x4E2D 0x56FD 0x0061), we can use ws[0] to get character '中' and ws[1] to get character '国' and ws[2] to get character 'a', etc.
2. Can std::string hold the entire ASCII character set, including the special characters?
Yes. But notice: American ASCII, means each 0x00~0xFF octet stands for one character, including printable text such as "123abc&*_&" and you said special one, mostly print it as a '.' avoid confusing editors or terminals. And some other countries extend their own "ASCII" charset, e.g. Chinese, use 2 octets to stand for one character.
3.Is std::wstring supported by all popular C++ compilers?
Maybe, or mostly. I have used: VC++6 and GCC 3.3, YES
4. What is exactly a "wide character"?
a wide character mostly indicates using 2 octets or 4 octets to hold all countries' characters. 2 octet UCS2 is a representative sample, and further e.g. English 'a', its memory is 2 octet of 0x0061(vs in ASCII 'a's memory is 1 octet 0x61)
When you want to store 'wide' (Unicode) characters.
Yes: 255 of them (excluding 0).
Yes.
Here's an introductory article: http://www.joelonsoftware.com/articles/Unicode.html
There are some very good answers here, but I think there are a couple of things I can add regarding Windows/Visual Studio. Tis is based on my experience with VS2015. On Linux, basically the answer is to use UTF-8 encoded std::string everywhere. On Windows/VS it gets more complex. Here is why. Windows expects strings stored using chars to be encoded using the locale codepage. This is almost always the ASCII character set followed by 128 other special characters depending on your location. Let me just state that this in not just when using the Windows API, there are three other major places where these strings interact with standard C++. These are string literals, output to std::cout using << and passing a filename to std::fstream.
I will be up front here that I am a programmer, not a language specialist. I appreciate that USC2 and UTF-16 are not the same, but for my purposes they are close enough to be interchangeable and I use them as such here. I'm not actually sure which Windows uses, but I generally don't need to know either. I've stated UCS2 in this answer, so sorry in advance if I upset anyone with my ignorance of this matter and I'm happy to change it if I have things wrong.
String literals
If you enter string literals that contain only characters that can be represented by your codepage then VS stores them in your file with 1 byte per character encoding based on your codepage. Note that if you change your codepage or give your source to another developer using a different code page then I think (but haven't tested) that the character will end up different. If you run your code on a computer using a different code page then I'm not sure if the character will change too.
If you enter any string literals that cannot be represented by your codepage then VS will ask you to save the file as Unicode. The file will then be encoded as UTF-8. This means that all Non ASCII characters (including those which are on your codepage) will be represented by 2 or more bytes. This means if you give your source to someone else the source will look the same. However, before passing the source to the compiler, VS converts the UTF-8 encoded text to code page encoded text and any characters missing from the code page are replaced with ?.
The only way to guarantee correctly representing a Unicode string literal in VS is to precede the string literal with an L making it a wide string literal. In this case VS will convert the UTF-8 encoded text from the file into UCS2. You then need to pass this string literal into a std::wstring constructor or you need to convert it to utf-8 and put it in a std::string. Or if you want you can use the Windows API functions to encode it using your code page to put it in a std::string, but then you may as well have not used a wide string literal.
std::cout
When outputting to the console using << you can only use std::string, not std::wstring and the text must be encoded using your locale codepage. If you have a std::wstring then you must convert it using one of the Windows API functions and any characters not on your codepage get replaced by ? (maybe you can change the character, I can't remember).
std::fstream filenames
Windows OS uses UCS2/UTF-16 for its filenames so whatever your codepage, you can have files with any Unicode character. But this means that to access or create files with characters not on your codepage you must use std::wstring. There is no other way. This is a Microsoft specific extension to std::fstream so probably won't compile on other systems. If you use std::string then you can only utilise filenames that only include characters on your codepage.
Your options
If you are just working on Linux then you probably didn't get this far. Just use UTF-8 std::string everywhere.
If you are just working on Windows just use UCS2 std::wstring everywhere. Some purists may say use UTF8 then convert when needed, but why bother with the hassle.
If you are cross platform then it's a mess to be frank. If you try to use UTF-8 everywhere on Windows then you need to be really careful with your string literals and output to the console. You can easily corrupt your strings there. If you use std::wstring everywhere on Linux then you may not have access to the wide version of std::fstream, so you have to do the conversion, but there is no risk of corruption. So personally I think this is a better option. Many would disagree, but I'm not alone - it's the path taken by wxWidgets for example.
Another option could be to typedef unicodestring as std::string on Linux and std::wstring on Windows, and have a macro called UNI() which prefixes L on Windows and nothing on Linux, then the code
#include <fstream>
#include <string>
#include <iostream>
#include <Windows.h>
#ifdef _WIN32
typedef std::wstring unicodestring;
#define UNI(text) L ## text
std::string formatForConsole(const unicodestring &str)
{
std::string result;
//Call WideCharToMultiByte to do the conversion
return result;
}
#else
typedef std::string unicodestring;
#define UNI(text) text
std::string formatForConsole(const unicodestring &str)
{
return str;
}
#endif
int main()
{
unicodestring fileName(UNI("fileName"));
std::ofstream fout;
fout.open(fileName);
std::cout << formatForConsole(fileName) << std::endl;
return 0;
}
would be fine on either platform I think.
Answers
So To answer your questions
1) If you are programming for Windows, then all the time, if cross platform then maybe all the time, unless you want to deal with possible corruption issues on Windows or write some code with platform specific #ifdefs to work around the differences, if just using Linux then never.
2)Yes. In addition on Linux you can use it for all Unicode too. On Windows you can only use it for all unicode if you choose to manually encode using UTF-8. But the Windows API and standard C++ classes will expect the std::string to be encoded using the locale codepage. This includes all ASCII plus another 128 characters which change depending on the codepage your computer is setup to use.
3)I believe so, but if not then it is just a simple typedef of a 'std::basic_string' using wchar_t instead of char
4)A wide character is a character type which is bigger than the 1 byte standard char type. On Windows it is 2 bytes, on Linux it is 4 bytes.
Applications that are not satisfied with only 256 different characters have the options of either using wide characters (more than 8 bits) or a variable-length encoding (a multibyte encoding in C++ terminology) such as UTF-8. Wide characters generally require more space than a variable-length encoding, but are faster to process. Multi-language applications that process large amounts of text usually use wide characters when processing the text, but convert it to UTF-8 when storing it to disk.
The only difference between a string and a wstring is the data type of the characters they store. A string stores chars whose size is guaranteed to be at least 8 bits, so you can use strings for processing e.g. ASCII, ISO-8859-15, or UTF-8 text. The standard says nothing about the character set or encoding.
Practically every compiler uses a character set whose first 128 characters correspond with ASCII. This is also the case with compilers that use UTF-8 encoding. The important thing to be aware of when using strings in UTF-8 or some other variable-length encoding, is that the indices and lengths are measured in bytes, not characters.
The data type of a wstring is wchar_t, whose size is not defined in the standard, except that it has to be at least as large as a char, usually 16 bits or 32 bits. wstring can be used for processing text in the implementation defined wide-character encoding. Because the encoding is not defined in the standard, it is not straightforward to convert between strings and wstrings. One cannot assume wstrings to have a fixed-length encoding either.
If you don't need multi-language support, you might be fine with using only regular strings. On the other hand, if you're writing a graphical application, it is often the case that the API supports only wide characters. Then you probably want to use the same wide characters when processing the text. Keep in mind that UTF-16 is a variable-length encoding, meaning that you cannot assume length() to return the number of characters. If the API uses a fixed-length encoding, such as UCS-2, processing becomes easy. Converting between wide characters and UTF-8 is difficult to do in a portable way, but then again, your user interface API probably supports the conversion.
when you want to use Unicode strings and not just ascii, helpful for internationalisation
yes, but it doesn't play well with 0
not aware of any that don't
wide character is the compiler specific way of handling the fixed length representation of a unicode character, for MSVC it is a 2 byte character, for gcc I understand it is 4 bytes. and a +1 for http://www.joelonsoftware.com/articles/Unicode.html
If you keep portability for string, you can use tstring, tchar. It is widely used technique from long ago. In this sample, I use self-defined TCHAR, but you can find out tchar.h implementation for linux on internet.
This idea means that wstring/wchar_t/UTF-16 is used on windows and string/char/utf-8(or ASCII..) is used on Linux.
In the sample below, the searching of english/japanese multibyte mixed string works well on both windows/linux platforms.
#include <locale.h>
#include <stdio.h>
#include <algorithm>
#include <string>
using namespace std;
#ifdef _WIN32
#include <tchar.h>
#else
#define _TCHAR char
#define _T
#define _tprintf printf
#endif
#define tstring basic_string<_TCHAR>
int main() {
setlocale(LC_ALL, "");
tstring s = _T("abcあいうえおxyz");
auto pos = s.find(_T("え"));
auto r = s.substr(pos);
_tprintf(_T("r=%s\n"), r.c_str());
}
1) As mentioned by Greg, wstring is helpful for internationalization, that's when you will be releasing your product in languages other than english
4) Check this out for wide character
http://en.wikipedia.org/wiki/Wide_character
When should you NOT use wide-characters?
When you're writing code before the year 1990.
Obviously, I'm being flip, but really, it's the 21st century now. 127 characters have long since ceased to be sufficient. Yes, you can use UTF8, but why bother with the headaches?

Understanding wchar_t type in c++

The Standard says N3797::3.9.1 [basic.fundamental]:
Type wchar_t is a distinct type whose values can represent distinct
codes for all members of the largest extended character set specified
among the supported locales (22.3.1).
I can't imagine how we can use that type. Could you give an example where plain char isn't working? I thought it may be helpful if we use two different language simultaneously. But plain char is Ok in case for cyrillic and latinica
#include <iostream>
char cp[] = "LATINICA_КИРИЛЛИЦА";
int main()
{
std::cout << cp; //LATINICA_КИРИЛЛИЦА
}
DEMO
In your example, you use Unicode. Indeed you could type not only in Latin or Cyrillic, but in Thai, Arabic, Chinese in other words any Unicode symbol. Your example with some more symbols link
The case is in encoding. In your example you are using char to store Unicode symbols encoded in UTF-8. See this for more details. The main advantage of UTF-8 in backward compatibility with ASCII. The main disadvantage of using UTF-8 is variable symbol length.
There are other types of encoding for Unicode symbols. The most common (except UTF-8) are UTF-16 and UTF-32. You should be aware that the UTF-16 encoding is still variable length, however the code unit is now 16bit. UTF-32 encoding is constant length.
The type wchar_t is usually used to store symbols in UTF-16 or UTF-32 encoding depending on the system.
It depends what encoding you decide to use. Any single UTF-8 value can be held in an 8-bit char (though one Unicode code-point can take several char values to represent). It's impossible to tell from your question, but I'd guess that your editor and compiler are treating your strings as UTF-8 and that's fine if that's what you want.
Other common encodings include UTF-16, UTF-32, UCS-2 and UCS-4, which have 2-byte, 4-byte, 2-byte and 4-byte values respectively. You can't store these values in an 8-bit char.
The decision of what encoding to use for any given purpose is not straightforward. The main considerations are:
What other systems does your code have to interface to and what encoding do they use?
What libraries do you want to use and what encodings do they use? (eg xerces-c uses UTF-16 throughout)
The tradeoff between complexity and storage size. UTF-32 and UCS-4 have the useful property that every possible displayed character is represented by one value, so you can tell the length of the string from how much memory it takes up without having to look at the values in it (though this assumes that you consider combining diacretic marks as separate characters). However, if all you're representing is ASCII, they take up four times as much memory as UTF-8.
I'd suggest Joel Spolsky's essay on Unicode as a good read.
wchar_t has its own problems, though. The standard didn't specify how big a wchar_t is, so, of course, different compilers have picked different sizes; VC++ used two bytes and gcc (and most others) use four bytes. Wide-character literals, such as L"Hello, world," are similarly confused, being UTF-16 strings in VC++ and UCS-4 in gcc.
To try to clean this up, C++11 introduced two new character types:
char16_t is a character guaranteed to be 16-bits, and with a literal form u"Hello, world."
char32_t is a character guaranteed to be 32-bits, and with a literal form U"Hello, world."
However, these have problems of their own; in particular, <iostream> doesn't provide console streams that can handle them (ie there is no u16cout or u32cerr).
To be more specific I'll provide a normative reference relates to the question: [N3797:8.5.2/1 [dcl.init.string] says:
An array of narrow character type (3.9.1), char16_t array, char32_t
array, or wchar_t array can be initialized by a narrow string literal,
char16_t string literal, char32_t string literal, or wide string
literal, respectively, or by an appropriately-typed string literal
enclosed in braces (2.14.5). Successive characters of the value of the
string literal initialize the elements of the array.
8.5.2/2:
There shall not be more initializers than there are array elements.
In the case of
#include <iostream>
char cp[] = "LATINICA_КИРИЛЛИЦА";
int main()
{
std::cout << sizeof(cp) << std::endl; //28
}
DEMO
for some language, like English, it's not necessary to use wchar_t.but some language, like Chinese, you'd better use wchar_t.
although char is able to store string, likechar p[] = "你好"
but it may show messy code when you run you program in different computer, especially the computer using different language.
if you use wchar_t, you can avoid this.

How can I make dynamic strings to work with UTF-8 in console?

Most of answers and questions here on SO use to put L before any UTF-8 string. I found no explantion of what it is, in the source code, the constant is, according to my IDE, defined in winnt.h.
This is how I use it, without knowing what it is:
std::wcout<<L"\"Přetečení zásobníku\" is Stack overflow in Czech.";
Obviously, constant concatenation cannot be applied on variables:
void printUTF8(const char* str) {
//Does not make the slightest bit of sense
std::wcout<<L str;
}
So what is it and how to add it to dynamic strings?
L"" is a WIDE string. That is to say, it's a a wchar_t[1]. UTF-8 strings can't be wide, since they are multi-byte (variable length). VC++ is slightly wrong and made wide strings variable length, UTF-16 to be precise. But usually they're UTF-32.
The problem with multi-byte strings is that there are many different encodings, and UTF-8 is only one of them. Windows does not in fact natively support UTF-8 encodings. MessageBoxA() for instance can use any encoding but UTF-8. There's just one exception to that, which is MultiByteToWideChar(CP_UTF8, ...) which is what you'd need here.
L is an indication to the C compiler that the string is composed of "wide characters". In Windows, these would be UTF-16 - each character that you put in the string is 16 bits, or two bytes, wide:
L"This is a wide string"
In contrast, a UTF-8 string is always a string composed of bytes. ASCII characters (A-Z 0-9 etc) are encoded the way they have always been - in the range 0x00 to 0x7F (or 0 to 127). International characters (like ř) are encoded using multiple bytes in the range 0x80 to 0xFF - there is a very good explanation on wikipedia. The advantage is that it can be represented using ordinary C strings.
"This is an ordinary string, but also a UTF-8 string"
"This is a C cedilla in UTF-8: \xc3\x87"
However, if you are typing these international characters in to actual code, your editor needs to know that you are typing in UTF-8 so it can encode the characters correctly - like the C cedilla above. Then the string will be passed correctly to your function.
In your case, your comment indicates that you are using UTF-16. In which case there are two other issues:
The console will, by default, not output Unicode characters correctly. You need to change the font to a truetype font like Lucida Console
You also need to change the output mode to a Unicode UTF-16 one. You can do this with:
_setmode(_fileno(stdout), _O_U16TEXT);
Code example:
#include <iostream>
#include <io.h>
#include <fcntl.h>
int wmain(int argc, wchar_t* argv[])
{
_setmode(_fileno(stdout), _O_U16TEXT);
std::wcout << L"Přetečení zásobníku is Stack overflow in Czech." << std::endl;
}
Re your actual question
” what is [the L prefix] and how to add it to dynamic strings?
This is very different from the title of the question at the time I’m writing this, namely “How can I make dynamic strings to work with UTF-8 in console?”
In short, UTF-8 is an encoding of Unicode where the basic encoding unit is 8 bits, commonly called a byte (more precisely it's an octet), while the L prefix forms a wide character or string literal, where the encoding unit typically is 16 or 32 bits – in Windows it’s 16 bits, as in original Unicode.
A wide character or string literal is based on the wchar_t type instead of char.
In Windows a wide string is encoded as UTF-16. The most common sixty thousand or so Unicode characters are represented with single wchar_t values, but some seldom used Chinese ideograms etc. require two successive wchar_t values, called a surrogate pair.
The use of 16 bit encoding unit in Windows was established around 1992. I am not sure when UTF-16 was adopted (as an extension of then UCS-2 encoding), it was just a bit later. So this was established long before C99 required that all characters of the wide character set should be representable with single wchar_t values. That requirement appears to have been a pure political maneuver, ensuring that no Windows C compiler could be formally conforming, a general ISO programming language standard that applied only to Unix-land. Unfortunately, since C++11 was based on C99 we now have that also in C++11, ensuring that no Windows C++ compiler can be fully conforming. Pure idiocy. If you ask me.
Errata, re deleted text above: according to Wikipedia’s article about it the wording about a single wchar_t being sufficient for any character in the “extended character set” was there already in C90. Which makes the incompatibility between Windows and the C and C++ standards the fault of Microsoft, not the fault of the C committee. It still appears to be political and fairly idiotic, but (enlightened) with others to blame than I maintained at first…
One way to work with wide dynamic strings is to use std::wstring, from the <string> header.
With Visual C++ you can use a wmain function instead of standard main, as an easy way to get wide command line arguments.
wmain is also supported by MinGW64 (IIRC) g++, although not yet by ordinary MinGW g++, as of g++ 4.8.something. It is however easy to implement in terms of the Windows API. Unless you require strict standard-conforming code that provides the special main function features such as ability to declare it with or without arguments, but hey, let's be practical about things.
Example that compiles fine with both Visual C++ 12.0 and g++ 4.8.2:
// Source encoding: UTF-8 with BOM.
#include <io.h> // _setmode
#include <fcntl.h> // _O_WTEXT
#include <iostream> // std::wcout, std::endl
#include <string> // std::wstring
using namespace std;
auto main()
-> int
{
_setmode( _fileno( stdin ), _O_WTEXT );
_setmode( _fileno( stdout ), _O_WTEXT );
wcout << L"Hi, what’s your name? ";
wstring username;
getline( wcin, username );
wcout << L"Welcome to Windows C++, " << username << "!" << endl;
}
Note that with Windows ANSI source this won’t compile with g++ unless you specify the source encoding with the appropriate compiler option.

streams with default utf8 handling

I have read that in some environments std::string internally uses UTF-8. Whereas, on my platform, Windows, std::string is ASCII only. This behavior can be changed by using std::locale. My version of STL doesn't have, or at least I can't find, a UTF-8 facet for use with strings. I do however have a facet for use with the fstream set of classes.
Edit:
When I say "use UTF-8 internally", I'm referring to methods like std::basic_filebuf::open(), which in some environments accept UTF-8 encoded strings. I know this isn't really an std::string issue but rather some OS's use UTF-8 natively. My question should be read as "how does your implementation handle code conversion of invalid sequences?".
How do these streams handle invalid code sequences on other platforms/implementations?
In my UTF8 facet for files, it simply returns an error, which in turn prevents any more of the stream from being read. I would have thought changing the error to the Unicode "Invalid char" 0xfffd value to be a better option.
My question isn't limited to UTF-8, how about invalid UTF-16 surrogate pairs?
Let's have an example. Say you open a UTF-8 encoded file with a UTF-8 to wchar_t locale. How are invalid UTF-8 sequences handled by your implementation?
Or, a std::wstring and print it to std::cout, this time with a lone surrogate.
I have read that in some environments std::string internally uses uses UTF-8.
A C++ program can chose to use std::string to hold a UTF-8 string on any standard-compliant platform.
Whereas, on my platform, Windows, std::string is ASCII only.
That is not correct. On Windows you can use a std::string to hold a UTF-8 string if you want, std::string is not limited to hold ASCII on any standard-compliant platform.
This behavior can be changed by using std::locale.
No, the behaviour of std::string is not affected by the locale library.
A std::string is a sequence of chars. On most platforms, including Windows, a char is 8-bits. So you can use std::string to hold ASCII, Latin1, UTF-8 or any character encoding that uses an 8-bit or less code unit. std::string::length returns the number of code units so held, and the std::string::operator[] will return the ith code unit.
For holding UTF-16 you can use char16_t and std::u16string.
For holding UTF-32 you can use char32_t and std::u32string.
Say you open a UTF-8 encoded file with a UTF-8 to wchar_t locale. How are invalid UTF-8 sequences handled by your implementation?
Typically no one bothers with converting to wchar_t or other wide char types on other platforms, but the standard facets that can be used for this all signal a read error that causes the stream to stop working until the error is cleared.
std::string should be encoding agnostic: http://en.cppreference.com/w/cpp/string/basic_string - so it should not validate codepoints/data - you should be able to store any binary data in it.
The only places where encoding really makes a difference is in calculating string length and iterating over string character by character - and locale should have no effect in either of these cases.
And also - use of std::locale is probably not a good idea if it can be avoided at all - its not thread safe on all platforms or all implementations of standard library so care must be taken when using it. The effect of this is also very limited, and probably not at all what you expect it to be.

std::wstring VS std::string

I am not able to understand the differences between std::string and std::wstring. I know wstring supports wide characters such as Unicode characters. I have got the following questions:
When should I use std::wstring over std::string?
Can std::string hold the entire ASCII character set, including the special characters?
Is std::wstring supported by all popular C++ compilers?
What is exactly a "wide character"?
string? wstring?
std::string is a basic_string templated on a char, and std::wstring on a wchar_t.
char vs. wchar_t
char is supposed to hold a character, usually an 8-bit character.
wchar_t is supposed to hold a wide character, and then, things get tricky:
On Linux, a wchar_t is 4 bytes, while on Windows, it's 2 bytes.
What about Unicode, then?
The problem is that neither char nor wchar_t is directly tied to unicode.
On Linux?
Let's take a Linux OS: My Ubuntu system is already unicode aware. When I work with a char string, it is natively encoded in UTF-8 (i.e. Unicode string of chars). The following code:
#include <cstring>
#include <iostream>
int main()
{
const char text[] = "olé";
std::cout << "sizeof(char) : " << sizeof(char) << "\n";
std::cout << "text : " << text << "\n";
std::cout << "sizeof(text) : " << sizeof(text) << "\n";
std::cout << "strlen(text) : " << strlen(text) << "\n";
std::cout << "text(ordinals) :";
for(size_t i = 0, iMax = strlen(text); i < iMax; ++i)
{
unsigned char c = static_cast<unsigned_char>(text[i]);
std::cout << " " << static_cast<unsigned int>(c);
}
std::cout << "\n\n";
// - - -
const wchar_t wtext[] = L"olé" ;
std::cout << "sizeof(wchar_t) : " << sizeof(wchar_t) << "\n";
//std::cout << "wtext : " << wtext << "\n"; <- error
std::cout << "wtext : UNABLE TO CONVERT NATIVELY." << "\n";
std::wcout << L"wtext : " << wtext << "\n";
std::cout << "sizeof(wtext) : " << sizeof(wtext) << "\n";
std::cout << "wcslen(wtext) : " << wcslen(wtext) << "\n";
std::cout << "wtext(ordinals) :";
for(size_t i = 0, iMax = wcslen(wtext); i < iMax; ++i)
{
unsigned short wc = static_cast<unsigned short>(wtext[i]);
std::cout << " " << static_cast<unsigned int>(wc);
}
std::cout << "\n\n";
}
outputs the following text:
sizeof(char) : 1
text : olé
sizeof(text) : 5
strlen(text) : 4
text(ordinals) : 111 108 195 169
sizeof(wchar_t) : 4
wtext : UNABLE TO CONVERT NATIVELY.
wtext : ol�
sizeof(wtext) : 16
wcslen(wtext) : 3
wtext(ordinals) : 111 108 233
You'll see the "olé" text in char is really constructed by four chars: 110, 108, 195 and 169 (not counting the trailing zero). (I'll let you study the wchar_t code as an exercise)
So, when working with a char on Linux, you should usually end up using Unicode without even knowing it. And as std::string works with char, so std::string is already unicode-ready.
Note that std::string, like the C string API, will consider the "olé" string to have 4 characters, not three. So you should be cautious when truncating/playing with unicode chars because some combination of chars is forbidden in UTF-8.
On Windows?
On Windows, this is a bit different. Win32 had to support a lot of application working with char and on different charsets/codepages produced in all the world, before the advent of Unicode.
So their solution was an interesting one: If an application works with char, then the char strings are encoded/printed/shown on GUI labels using the local charset/codepage on the machine, which could not be UTF-8 for a long time. For example, "olé" would be "olé" in a French-localized Windows, but would be something different on an cyrillic-localized Windows ("olй" if you use Windows-1251). Thus, "historical apps" will usually still work the same old way.
For Unicode based applications, Windows uses wchar_t, which is 2-bytes wide, and is encoded in UTF-16, which is Unicode encoded on 2-bytes characters (or at the very least, UCS-2, which just lacks surrogate-pairs and thus characters outside the BMP (>= 64K)).
Applications using char are said "multibyte" (because each glyph is composed of one or more chars), while applications using wchar_t are said "widechar" (because each glyph is composed of one or two wchar_t. See MultiByteToWideChar and WideCharToMultiByte Win32 conversion API for more info.
Thus, if you work on Windows, you badly want to use wchar_t (unless you use a framework hiding that, like GTK or QT...). The fact is that behind the scenes, Windows works with wchar_t strings, so even historical applications will have their char strings converted in wchar_t when using API like SetWindowText() (low level API function to set the label on a Win32 GUI).
Memory issues?
UTF-32 is 4 bytes per characters, so there is no much to add, if only that a UTF-8 text and UTF-16 text will always use less or the same amount of memory than an UTF-32 text (and usually less).
If there is a memory issue, then you should know than for most western languages, UTF-8 text will use less memory than the same UTF-16 one.
Still, for other languages (chinese, japanese, etc.), the memory used will be either the same, or slightly larger for UTF-8 than for UTF-16.
All in all, UTF-16 will mostly use 2 and occassionally 4 bytes per characters (unless you're dealing with some kind of esoteric language glyphs (Klingon? Elvish?), while UTF-8 will spend from 1 to 4 bytes.
See https://en.wikipedia.org/wiki/UTF-8#Compared_to_UTF-16 for more info.
Conclusion
When I should use std::wstring over std::string?
On Linux? Almost never (§).
On Windows? Almost always (§).
On cross-platform code? Depends on your toolkit...
(§) : unless you use a toolkit/framework saying otherwise
Can std::string hold all the ASCII character set including special characters?
Notice: A std::string is suitable for holding a 'binary' buffer, where a std::wstring is not!
On Linux? Yes.
On Windows? Only special characters available for the current locale of the Windows user.
Edit (After a comment from Johann Gerell):
a std::string will be enough to handle all char-based strings (each char being a number from 0 to 255). But:
ASCII is supposed to go from 0 to 127. Higher chars are NOT ASCII.
a char from 0 to 127 will be held correctly
a char from 128 to 255 will have a signification depending on your encoding (unicode, non-unicode, etc.), but it will be able to hold all Unicode glyphs as long as they are encoded in UTF-8.
Is std::wstring supported by almost all popular C++ compilers?
Mostly, with the exception of GCC based compilers that are ported to Windows.
It works on my g++ 4.3.2 (under Linux), and I used Unicode API on Win32 since Visual C++ 6.
What is exactly a wide character?
On C/C++, it's a character type written wchar_t which is larger than the simple char character type. It is supposed to be used to put inside characters whose indices (like Unicode glyphs) are larger than 255 (or 127, depending...).
I recommend avoiding std::wstring on Windows or elsewhere, except when required by the interface, or anywhere near Windows API calls and respective encoding conversions as a syntactic sugar.
My view is summarized in http://utf8everywhere.org of which I am a co-author.
Unless your application is API-call-centric, e.g. mainly UI application, the suggestion is to store Unicode strings in std::string and encoded in UTF-8, performing conversion near API calls. The benefits outlined in the article outweigh the apparent annoyance of conversion, especially in complex applications. This is doubly so for multi-platform and library development.
And now, answering your questions:
A few weak reasons. It exists for historical reasons, where widechars were believed to be the proper way of supporting Unicode. It is now used to interface APIs that prefer UTF-16 strings. I use them only in the direct vicinity of such API calls.
This has nothing to do with std::string. It can hold whatever encoding you put in it. The only question is how You treat its content. My recommendation is UTF-8, so it will be able to hold all Unicode characters correctly. It's a common practice on Linux, but I think Windows programs should do it also.
No.
Wide character is a confusing name. In the early days of Unicode, there was a belief that a character can be encoded in two bytes, hence the name. Today, it stands for "any part of the character that is two bytes long". UTF-16 is seen as a sequence of such byte pairs (aka Wide characters). A character in UTF-16 takes either one or two pairs.
So, every reader here now should have a clear understanding about the facts, the situation. If not, then you must read paercebal's outstandingly comprehensive answer [btw: thanks!].
My pragmatical conclusion is shockingly simple: all that C++ (and STL) "character encoding" stuff is substantially broken and useless. Blame it on Microsoft or not, that will not help anyway.
My solution, after in-depth investigation, much frustration and the consequential experiences is the following:
accept, that you have to be responsible on your own for the encoding and conversion stuff (and you will see that much of it is rather trivial)
use std::string for any UTF-8 encoded strings (just a typedef std::string UTF8String)
accept that such an UTF8String object is just a dumb, but cheap container. Do never ever access and/or manipulate characters in it directly (no search, replace, and so on). You could, but you really just really, really do not want to waste your time writing text manipulation algorithms for multi-byte strings! Even if other people already did such stupid things, don't do that! Let it be! (Well, there are scenarios where it makes sense... just use the ICU library for those).
use std::wstring for UCS-2 encoded strings (typedef std::wstring UCS2String) - this is a compromise, and a concession to the mess that the WIN32 API introduced). UCS-2 is sufficient for most of us (more on that later...).
use UCS2String instances whenever a character-by-character access is required (read, manipulate, and so on). Any character-based processing should be done in a NON-multibyte-representation. It is simple, fast, easy.
add two utility functions to convert back & forth between UTF-8 and UCS-2:
UCS2String ConvertToUCS2( const UTF8String &str );
UTF8String ConvertToUTF8( const UCS2String &str );
The conversions are straightforward, google should help here ...
That's it. Use UTF8String wherever memory is precious and for all UTF-8 I/O. Use UCS2String wherever the string must be parsed and/or manipulated. You can convert between those two representations any time.
Alternatives & Improvements
conversions from & to single-byte character encodings (e.g. ISO-8859-1) can be realized with help of plain translation tables, e.g. const wchar_t tt_iso88951[256] = {0,1,2,...}; and appropriate code for conversion to & from UCS2.
if UCS-2 is not sufficient, than switch to UCS-4 (typedef std::basic_string<uint32_t> UCS2String)
ICU or other unicode libraries?
For advanced stuff.
When you want to have wide characters stored in your string. wide depends on the implementation. Visual C++ defaults to 16 bit if i remember correctly, while GCC defaults depending on the target. It's 32 bits long here. Please note wchar_t (wide character type) has nothing to do with unicode. It's merely guaranteed that it can store all the members of the largest character set that the implementation supports by its locales, and at least as long as char. You can store unicode strings fine into std::string using the utf-8 encoding too. But it won't understand the meaning of unicode code points. So str.size() won't give you the amount of logical characters in your string, but merely the amount of char or wchar_t elements stored in that string/wstring. For that reason, the gtk/glib C++ wrapper folks have developed a Glib::ustring class that can handle utf-8.
If your wchar_t is 32 bits long, then you can use utf-32 as an unicode encoding, and you can store and handle unicode strings using a fixed (utf-32 is fixed length) encoding. This means your wstring's s.size() function will then return the right amount of wchar_t elements and logical characters.
Yes, char is always at least 8 bit long, which means it can store all ASCII values.
Yes, all major compilers support it.
I frequently use std::string to hold utf-8 characters without any problems at all. I heartily recommend doing this when interfacing with API's which use utf-8 as the native string type as well.
For example, I use utf-8 when interfacing my code with the Tcl interpreter.
The major caveat is the length of the std::string, is no longer the number of characters in the string.
A good question!
I think DATA ENCODING (sometimes a CHARSET also involved) is a MEMORY EXPRESSION MECHANISM in order to save data to a file or transfer data via a network, so I answer this question as:
1. When should I use std::wstring over std::string?
If the programming platform or API function is a single-byte one, and we want to process or parse some Unicode data, e.g read from Windows'.REG file or network 2-byte stream, we should declare std::wstring variable to easily process them. e.g.: wstring ws=L"中国a"(6 octets memory: 0x4E2D 0x56FD 0x0061), we can use ws[0] to get character '中' and ws[1] to get character '国' and ws[2] to get character 'a', etc.
2. Can std::string hold the entire ASCII character set, including the special characters?
Yes. But notice: American ASCII, means each 0x00~0xFF octet stands for one character, including printable text such as "123abc&*_&" and you said special one, mostly print it as a '.' avoid confusing editors or terminals. And some other countries extend their own "ASCII" charset, e.g. Chinese, use 2 octets to stand for one character.
3.Is std::wstring supported by all popular C++ compilers?
Maybe, or mostly. I have used: VC++6 and GCC 3.3, YES
4. What is exactly a "wide character"?
a wide character mostly indicates using 2 octets or 4 octets to hold all countries' characters. 2 octet UCS2 is a representative sample, and further e.g. English 'a', its memory is 2 octet of 0x0061(vs in ASCII 'a's memory is 1 octet 0x61)
When you want to store 'wide' (Unicode) characters.
Yes: 255 of them (excluding 0).
Yes.
Here's an introductory article: http://www.joelonsoftware.com/articles/Unicode.html
There are some very good answers here, but I think there are a couple of things I can add regarding Windows/Visual Studio. Tis is based on my experience with VS2015. On Linux, basically the answer is to use UTF-8 encoded std::string everywhere. On Windows/VS it gets more complex. Here is why. Windows expects strings stored using chars to be encoded using the locale codepage. This is almost always the ASCII character set followed by 128 other special characters depending on your location. Let me just state that this in not just when using the Windows API, there are three other major places where these strings interact with standard C++. These are string literals, output to std::cout using << and passing a filename to std::fstream.
I will be up front here that I am a programmer, not a language specialist. I appreciate that USC2 and UTF-16 are not the same, but for my purposes they are close enough to be interchangeable and I use them as such here. I'm not actually sure which Windows uses, but I generally don't need to know either. I've stated UCS2 in this answer, so sorry in advance if I upset anyone with my ignorance of this matter and I'm happy to change it if I have things wrong.
String literals
If you enter string literals that contain only characters that can be represented by your codepage then VS stores them in your file with 1 byte per character encoding based on your codepage. Note that if you change your codepage or give your source to another developer using a different code page then I think (but haven't tested) that the character will end up different. If you run your code on a computer using a different code page then I'm not sure if the character will change too.
If you enter any string literals that cannot be represented by your codepage then VS will ask you to save the file as Unicode. The file will then be encoded as UTF-8. This means that all Non ASCII characters (including those which are on your codepage) will be represented by 2 or more bytes. This means if you give your source to someone else the source will look the same. However, before passing the source to the compiler, VS converts the UTF-8 encoded text to code page encoded text and any characters missing from the code page are replaced with ?.
The only way to guarantee correctly representing a Unicode string literal in VS is to precede the string literal with an L making it a wide string literal. In this case VS will convert the UTF-8 encoded text from the file into UCS2. You then need to pass this string literal into a std::wstring constructor or you need to convert it to utf-8 and put it in a std::string. Or if you want you can use the Windows API functions to encode it using your code page to put it in a std::string, but then you may as well have not used a wide string literal.
std::cout
When outputting to the console using << you can only use std::string, not std::wstring and the text must be encoded using your locale codepage. If you have a std::wstring then you must convert it using one of the Windows API functions and any characters not on your codepage get replaced by ? (maybe you can change the character, I can't remember).
std::fstream filenames
Windows OS uses UCS2/UTF-16 for its filenames so whatever your codepage, you can have files with any Unicode character. But this means that to access or create files with characters not on your codepage you must use std::wstring. There is no other way. This is a Microsoft specific extension to std::fstream so probably won't compile on other systems. If you use std::string then you can only utilise filenames that only include characters on your codepage.
Your options
If you are just working on Linux then you probably didn't get this far. Just use UTF-8 std::string everywhere.
If you are just working on Windows just use UCS2 std::wstring everywhere. Some purists may say use UTF8 then convert when needed, but why bother with the hassle.
If you are cross platform then it's a mess to be frank. If you try to use UTF-8 everywhere on Windows then you need to be really careful with your string literals and output to the console. You can easily corrupt your strings there. If you use std::wstring everywhere on Linux then you may not have access to the wide version of std::fstream, so you have to do the conversion, but there is no risk of corruption. So personally I think this is a better option. Many would disagree, but I'm not alone - it's the path taken by wxWidgets for example.
Another option could be to typedef unicodestring as std::string on Linux and std::wstring on Windows, and have a macro called UNI() which prefixes L on Windows and nothing on Linux, then the code
#include <fstream>
#include <string>
#include <iostream>
#include <Windows.h>
#ifdef _WIN32
typedef std::wstring unicodestring;
#define UNI(text) L ## text
std::string formatForConsole(const unicodestring &str)
{
std::string result;
//Call WideCharToMultiByte to do the conversion
return result;
}
#else
typedef std::string unicodestring;
#define UNI(text) text
std::string formatForConsole(const unicodestring &str)
{
return str;
}
#endif
int main()
{
unicodestring fileName(UNI("fileName"));
std::ofstream fout;
fout.open(fileName);
std::cout << formatForConsole(fileName) << std::endl;
return 0;
}
would be fine on either platform I think.
Answers
So To answer your questions
1) If you are programming for Windows, then all the time, if cross platform then maybe all the time, unless you want to deal with possible corruption issues on Windows or write some code with platform specific #ifdefs to work around the differences, if just using Linux then never.
2)Yes. In addition on Linux you can use it for all Unicode too. On Windows you can only use it for all unicode if you choose to manually encode using UTF-8. But the Windows API and standard C++ classes will expect the std::string to be encoded using the locale codepage. This includes all ASCII plus another 128 characters which change depending on the codepage your computer is setup to use.
3)I believe so, but if not then it is just a simple typedef of a 'std::basic_string' using wchar_t instead of char
4)A wide character is a character type which is bigger than the 1 byte standard char type. On Windows it is 2 bytes, on Linux it is 4 bytes.
Applications that are not satisfied with only 256 different characters have the options of either using wide characters (more than 8 bits) or a variable-length encoding (a multibyte encoding in C++ terminology) such as UTF-8. Wide characters generally require more space than a variable-length encoding, but are faster to process. Multi-language applications that process large amounts of text usually use wide characters when processing the text, but convert it to UTF-8 when storing it to disk.
The only difference between a string and a wstring is the data type of the characters they store. A string stores chars whose size is guaranteed to be at least 8 bits, so you can use strings for processing e.g. ASCII, ISO-8859-15, or UTF-8 text. The standard says nothing about the character set or encoding.
Practically every compiler uses a character set whose first 128 characters correspond with ASCII. This is also the case with compilers that use UTF-8 encoding. The important thing to be aware of when using strings in UTF-8 or some other variable-length encoding, is that the indices and lengths are measured in bytes, not characters.
The data type of a wstring is wchar_t, whose size is not defined in the standard, except that it has to be at least as large as a char, usually 16 bits or 32 bits. wstring can be used for processing text in the implementation defined wide-character encoding. Because the encoding is not defined in the standard, it is not straightforward to convert between strings and wstrings. One cannot assume wstrings to have a fixed-length encoding either.
If you don't need multi-language support, you might be fine with using only regular strings. On the other hand, if you're writing a graphical application, it is often the case that the API supports only wide characters. Then you probably want to use the same wide characters when processing the text. Keep in mind that UTF-16 is a variable-length encoding, meaning that you cannot assume length() to return the number of characters. If the API uses a fixed-length encoding, such as UCS-2, processing becomes easy. Converting between wide characters and UTF-8 is difficult to do in a portable way, but then again, your user interface API probably supports the conversion.
when you want to use Unicode strings and not just ascii, helpful for internationalisation
yes, but it doesn't play well with 0
not aware of any that don't
wide character is the compiler specific way of handling the fixed length representation of a unicode character, for MSVC it is a 2 byte character, for gcc I understand it is 4 bytes. and a +1 for http://www.joelonsoftware.com/articles/Unicode.html
If you keep portability for string, you can use tstring, tchar. It is widely used technique from long ago. In this sample, I use self-defined TCHAR, but you can find out tchar.h implementation for linux on internet.
This idea means that wstring/wchar_t/UTF-16 is used on windows and string/char/utf-8(or ASCII..) is used on Linux.
In the sample below, the searching of english/japanese multibyte mixed string works well on both windows/linux platforms.
#include <locale.h>
#include <stdio.h>
#include <algorithm>
#include <string>
using namespace std;
#ifdef _WIN32
#include <tchar.h>
#else
#define _TCHAR char
#define _T
#define _tprintf printf
#endif
#define tstring basic_string<_TCHAR>
int main() {
setlocale(LC_ALL, "");
tstring s = _T("abcあいうえおxyz");
auto pos = s.find(_T("え"));
auto r = s.substr(pos);
_tprintf(_T("r=%s\n"), r.c_str());
}
1) As mentioned by Greg, wstring is helpful for internationalization, that's when you will be releasing your product in languages other than english
4) Check this out for wide character
http://en.wikipedia.org/wiki/Wide_character
When should you NOT use wide-characters?
When you're writing code before the year 1990.
Obviously, I'm being flip, but really, it's the 21st century now. 127 characters have long since ceased to be sufficient. Yes, you can use UTF8, but why bother with the headaches?