Not being able to wrap my head around this one is a real source of shame...
I'm working with a French version of Visual Studio (2008), in a French Windows (XP). French accents put in strings sent to the output window get corrupted. Ditto input from the output window. Typical character encoding issue, I enter ANSI, get UTF-8 in return, or something to that effect. What setting can ensure that the characters remain in ANSI when showing a "hardcoded" string to the output window?
EDIT:
Example:
#include <iostream>
int main()
{
std:: cout << "àéêù" << std:: endl;
return 0;
}
Will show in the output:
óúÛ¨
(here encoded as HTML for your viewing pleasure)
I would really like it to show:
àéêù
Before I go any further, I should mention that what you are doing is not c/c++ compliant. The specification states in 2.2 what character sets are valid in source code. It ain't much in there, and all the characters used are in ascii. So... Everything below is about a specific implementation (as it happens, VC2008 on a US locale machine).
To start with, you have 4 chars on your cout line, and 4 glyphs on the output. So the issue is not one of UTF8 encoding, as it would combine multiple source chars to less glyphs.
From you source string to the display on the console, all those things play a part:
What encoding your source file is in (i.e. how your C++ file will be seen by the compiler)
What your compiler does with a string literal, and what source encoding it understands
how your << interprets the encoded string you're passing in
what encoding the console expects
how the console translates that output to a font glyph.
Now...
1 and 2 are fairly easy ones. It looks like the compiler guesses what format the source file is in, and decodes it to its internal representation. It generates the string literal corresponding data chunk in the current codepage no matter what the source encoding was. I have failed to find explicit details/control on this.
3 is even easier. Except for control codes, << just passes the data down for char *.
4 is controlled by SetConsoleOutputCP. It should default to your default system codepage. You can also figure out which one you have with GetConsoleOutputCP (the input is controlled differently, through SetConsoleCP)
5 is a funny one. I banged my head to figure out why I could not get the é to show up properly, using CP1252 (western european, windows). It turns out that my system font does not have the glyph for that character, and helpfully uses the glyph of my standard codepage (capital Theta, the same I would get if I did not call SetConsoleOutputCP). To fix it, I had to change the font I use on consoles to Lucida Console (a true type font).
Some interesting things I learned looking at this:
the encoding of the source does not matter, as long as the compiler can figure it out (notably, changing it to UTF8 did not change the generated code. My "é" string was still encoded with CP1252 as 233 0 )
VC is picking a codepage for the string literals that I do not seem to control.
controlling what the console shows is more painful than what I was expecting
So... what does this mean to you ? Here are bits of advice:
don't use non-ascii in string literals. Use resources, where you control the encoding.
make sure you know what encoding is expected by your console, and that your font has the glyphs to represent the chars you send.
if you want to figure out what encoding is being used in your case, I'd advise printing the actual value of the character as an integer. char * a = "é"; std::cout << (unsigned int) (unsigned char) a[0] does show 233 for me, which happens to be the encoding in CP1252.
BTW, if what you got was "ÓÚÛ¨" rather than what you pasted, then it looks like your 4 bytes are interpreted somewhere as CP850.
Because I was requested to, I’ll do some necromancy. The other answers were from 2009, but this article still came up on a search I did in 2018. The situation today is very different. Also, the accepted answer was incomplete even back in 2009.
The Source Character Set
Every compiler (including Microsoft’s Visual Studio 2008 and later, gcc, clang and icc) will read UTF-8 source files that start with BOM without a problem, and clang will not read anything but UTF-8, so UTF-8 with a BOM is the lowest common denominator for C and C++ source files.
The language standard doesn’t say what source character sets the compiler needs to support. Some real-world source files are even saved in a character set incompatible with ASCII. Microsoft Visual C++ in 2008 supported UTF-8 source files with a byte order mark, as well as both forms of UTF-16. Without a byte order mark, it would assume the file was encoded in the current 8-bit code page, which was always a superset of ASCII.
The Execution Character Sets
In 2012, the compiler added a /utf-8 switch to CL.EXE. Today, it also supports the /source-charset and /execution-charset switches, as well as /validate-charset to detect if your file is not actually UTF-8. This page on MSDN has a link to the documentation on Unicode support for every version of Visual C++.
Current versions of the C++ standard say the compiler must have both an execution character set, which determines the numeric value of character constants like 'a', and a execution wide-character set that determines the value of wide-character constants like L'é'.
To language-lawyer for a bit, there are very few requirements in the standard for how these must be encoded, and yet Visual C and C++ manage to break them. It must contain about 100 characters that cannot have negative values, and the encodings of the digits '0' through '9' must be consecutive. Neither capital nor lowercase letters have to be, because they weren’t on some old mainframes. (That is, '0'+9 must be the same as '9', but there is still a compiler in real-world use today whose default behavior is that 'a'+9 is not 'j' but '«', and this is legal.) The wide-character execution set must include the basic execution set and have enough bits to hold all the characters of any supported locale. Every mainstream compiler supports at least one Unicode locale and understands valid Unicode characters specified with \Uxxxxxxxx, but a compiler that didn’t could claim to be complying with the standard.
The way Visual C and C++ violate the language standard is by making their wchar_t UTF-16, which can only represent some characters as surrogate pairs, when the standard says wchar_t must be a fixed-width encoding. This is because Microsoft defined wchar_t as 16 bits wide back in the 1990s, before the Unicode committee figured out that 16 bits were not going to be enough for the entire world, and Microsoft was not going to break the Windows API. It does support the standard char32_t type as well.
UTF-8 String Literals
The third issue this question raises is how to get the compiler to encode a string literal as UTF-8 in memory. You’ve been able to write something like this since C++11:
constexpr unsigned char hola_utf8[] = u8"¡Hola, mundo!";
This will encode the string as its null-terminated UTF-8 byte representation regardless of whether the source character set is UTF-8, UTF-16, Latin-1, CP1252, or even IBM EBCDIC 1047 (which is a silly theoretical example but still, for backward-compatibility, the default on IBM’s Z-series mainframe compiler). That is, it’s equivalent to initializing the array with { 0xC2, 0xA1, 'H', /* ... , */ '!', 0 }.
If it would be too inconvenient to type a character in, or if you want to distinguish between superficially-identical characters such as space and non-breaking space or precomposed and combining characters, you also have universal character escapes:
constexpr unsigned char hola_utf8[] = u8"\u00a1Hola, mundo!";
You can use these regardless of the source character set and regardless of whether you’re storing the literal as UTF-8, UTF-16 or UCS-4. They were originally added in C99, but Microsoft supported them in Visual Studio 2015.
Edit: As reported by Matthew, u8" strings are buggy in some versions of MSVC, including 19.14. It turns out, so are literal non-ASCII characters, even if you specify /utf-8 or /source-charset:utf-8 /execution-charset:utf-8. The sample code above works properly in 19.22.27905.
There is another way to do this that worked in Visual C or C++ 2008, however: octal and hexadecimal escape codes. You would have encoded UTF-8 literals in that version of the compiler with:
const unsigned char hola_utf8[] = "\xC2\xA1Hello, world!";
Try this:
#include <iostream>
#include <locale>
int main()
{
std::locale::global(std::locale(""));
std::cout << "àéêù" << std::endl;
return 0;
}
Using _setmode() works¹ and is arguably better than changing the codepage or setting a locale, since it'll actually make your program output in Unicode and thus will be consistent - no matter which codepage or locale are currently set.
Example:
#include <iostream>
#include <io.h>
#include <fcntl.h>
int wmain()
{
_setmode( _fileno(stdout), _O_U16TEXT );
std::wcout << L"àéêù" << std::endl;
return 0;
}
Inside Visual Studio, make sure you set up your project for Unicode (Right-click *Project* -> Click *General* -> *Character Set* = *Use Unicode Character Set*).
MinGW users:
Define both UNICODE and _UNICODE
Add -finput-charset=iso-8859-1 to the compiler options to get around this error: "converting to execution character set: Invalid argument"
Add -municode to the linker options to get around "undefined reference to `WinMain#16" (read more).
**Edit:** The equivalent call to set unicode *input* is: `_setmode( _fileno(stdin), _O_U16TEXT );`
Edit 2: An important piece of information, specially considering the question uses std::cout. This is not supported. The MSDN Docs states (emphasis mine):
Unicode mode is for wide print functions (for example, wprintf) and is
not supported for narrow print functions. Use of a narrow print
function on a Unicode mode stream triggers an assert.
So, don't use std::cout when the console output mode is _O_U16TEXT; similarly, don't use std::cin when the console input is _O_U16TEXT. You must use the wide version of these facilities (std::wcout, std::wcin).
And do note that mixing cout and wcout in the same output is not allowed (but I find it works if you call flush() and then _setmode() before switching between the narrow and wide operations).
I tried this code:
#include <iostream>
#include <fstream>
#include <sstream>
int main()
{
std::wstringstream wss;
wss << L"àéêù";
std::wstring s = wss.str();
const wchar_t* p = s.c_str();
std::wcout << ws.str() << std::endl;
std::wofstream file("C:\\a.txt");
file << p << endl;
return 0;
}
The debugger showed that wss, s and p all had the expected values (i.e. "àéêù"), as did the output file. However, what appeared in the console was óúÛ¨.
The problem is therefore in the Visual Studio console, not the C++. Using Bahbar's excellent answer, I added:
SetConsoleOutputCP(1252);
as the first line, and the console output then appeared as it should.
//Save As Windows 1252
#include<iostream>
#include<windows.h>
int main()
{
SetConsoleOutputCP(1252);
std:: cout << "àéêù" << std:: endl;
}
Visual Studio does not supports UTF 8 for C++, but partially supports for C:
//Save As UTF8 without signature
#include<stdio.h>
#include<windows.h>
int main()
{
SetConsoleOutputCP(65001);
printf("àéêù\n");
}
Make sure you do not forget to change the console's font to Lucida Consolas as mentionned by Bahbar : it was crucial in my case (French win 7 64 bit with VC 2012).
Then as mentionned by others use SetConsoleOutputCP(1252) for C++ but it may fail depending on the available pages so you might want to use GetConsoleOutputCP() to check that it worked or at least to check that SetConsoleOutputCP(1252) returns zero. Changing the global locale also works (for some reason there is no need to do cout.imbue(locale()); but it may break some librairies!
In C, SetConsoleOutputCP(65001); or the locale-based approach worked for me once I had saved the source code as UTF8 without signature (scroll down, the sans-signature choice is way below in the list of pages).
Input using SetConsoleCP(65001); failed for me apparently due to a bad implementation of page 65001 in windows. The locale approach failed too both in C and C++. A more involved solution, not relying on native chars but on wchar_t seems required.
I had the same problem with Chinese input. My source code is utf8 and I added /utf-8 in the compiler option. It works fine under c++ wide-string and wide-char but not work under narrow-string/char which it shows Garbled character/code in Visual Studio 2019 debugger and my SQL database. I have to use the narrow characters because of converting to SQLAPI++'s SAString. Eventually, I find checking the following option (contorl panel->Region->Administrative->Change system locale) can resolve the issue. I know it is not an ideal solution but it does help me.
In visual studio File->Save yourSource.cpp As
then it will pop up a dialog ask you if you want to replace existed file and you choose yes.
Then pop this dialog: select UTF-8 with signature. This solves my giberish output problem both on console and file.
This also comply with #Davislor's answer:
UTF-8 with a BOM is the lowest common denominator for C and C++ source
files
Related
General question
Is there a possibility to avoid character set conversion when writing to std::cout / std::cerr?
I do something like
std::cout << "Ȋ'ɱ ȁ ȖȚƑ-8 Șțȓȉɳɠ (in UTF-8 encoding)" << std::endl;
And I want the output to be written to the console maintaining the UTF-8 encoding (my console uses UTF-8 encoding, but my C++ Standard Library, GNUs libstdc++, doesn't think so for some reason).
If there's no possibility to forbid character encoding conversion: Can I set std::cout to use UTF-8, so it hopefully figures out itself that no conversion is needed?
Background
I used the Windows API function SetConsoleOutputCP(CP_UTF8); to set my console's encoding to UTF-8.
The problem seems to be that UTF-8 does not match the code page typicallly used for my system's locale and libstdc++ therefore sets up std::cout with the default ANSI code page instead of correctly recognizing the switch.
Edit: Turns out I misinterpreted the issue and the solution is actually a lot simpler (or not...).
The "Ȋ'ɱ ȁ ȖȚƑ-8 Șțȓȉɳɠ (in UTF-8 encoding)" was just meant as a placeholder (and I shouldn't have used it as it has hidden the actual issue).
In my real code the "UTF-8 string" is a Glib::ustring, and those are by definition UTF-8 encoded.
However I did not realize that the output operator << was defined in glibmm in a way that forces character set conversion.
It uses g_locale_from_utf8() internally which in turn uses g_get_charset() to determine the target encoding.
Unfortunately the documentation for g_get_charset() states
On Windows the character set returned by this function is the so-called system default ANSI code-page. That is the character set used by the "narrow" versions of C library and Win32 functions that handle file names. It might be different from the character set used by the C library's current locale.
which simply means that glib will neither care for the C locale I set nor will it attempt to determine the encoding my console actually uses and basically makes it impossible to use many glib functions to create UTF-8 output. (As a matter of fact this also means that this issue has the exact same cause as the issue that triggered my other question: Force UTF-8 encoding in glib's "g_print()").
I'm currently considering this a bug in glib (or a serious limitation at best) and will probably open a report in the issue tracker for it.
You are looking at the wrong side, as you are talking about a string literal, included in your source code (and not input from your keyboard), and for that to work properly you have to tell the compiler which encoding is being used for all those characters (I think the first c++ spec that mentions non-ascii charsets is c++11)
As you are using actually the UTF charset, you should have to encode all them in at least a wchar_t to be considered as such, or to agree in the translator (probably this is what happens) that UTF chars will be UTF-8 encoded, when used as string literals. This will commonly mean that they will be printed as UTF-8 and, if you use a UTF-8 compliant console device, they will be printed ok, without any other problem.
I know there's a gcc option to specify the encoding used in string literals for a source file, and there should be another in clang also. Check the documentation and probably this will solve any issues. But the best thing to be portable, is not to depend on the codeset or use one like ISO-10646 (but know that full utf coverage is not only utf-8, utf-8 is only a way to encode UTF chars, and as so, it's only a way to represent UTF characters)
Another issue, is that C++11 doesn't refer to the UTF consortium standard, but to the ISO counterpart (ISO-10646, I think), both are similar, but not equal, and the character encodings are similar, but not equal (the codesize of the ISO is 32 bit while the Unicode consortium's is 21 bit, for example). These and other differences between them make some tricks to go in C++ and produce problems when one is thinking in strict Unicode.
Of course, to output correct strings on a UTF-8 terminal, you have to encode UTF codes to utf-8 format before sending them to the terminal. This is true, even if you have already them encoded as utf-8 in a string object. If you say they are already utf-8 then no conversion is made at all... but if you don't say, the normal consideration is that you are using normal utf codes (but limiting to 8bit codes), limiting yourself to eight bit codes, and encoding them to utf-8 before printing... this leads to encoding errors (double encoding) as something like ú (unicode code \u00fa) should be encoded in utf-8 as the character sequence { 0xc3, 0xba };, but if you don't say the string literal is indeed in utf-8, both characters will be handled as the two characters codes for Â(\u00c3) and º(\u00ba) characters, and will be recoded as { 0xc3, 0x83, 0xc2, 0xba }; that will show them incorrectly. This is very common error and you should probably have seen it when some encoding is done incorrectly. Source for the samples here.
I know that only positive character ASCII values are guaranteed cross platform support.
In Visual Studio 2015, I can do:
cout << '\xBA';
And it prints:
║
When I try that on http://ideone.com I don't print anything.
If I try to directly print this using the literal character:
cout << '║';
Visual Studio gives the warning:
warning C4566: character represented by universal-character-name '\u2551' cannot be represented in the current code page (1252)
And then prints:
?
When this command is run on http://ideone.com I get:
14849425
I've read that wchars may provide a cross platform approach to this. Is that true? Or am I simply out of luck on extended ASCII?
There are two separate concepts in play here.
The first one is one of a locale, which is often called "code page" in Microsoft-ese. A locale defines which visual characters are represented by which byte sequence. In your first example, whatever locale your program gets executed as, it shows the "║" character, in response to the byte 0xBA.
Other locales, or code pages, will display different characters for the same bytes. Many locales are multibyte locales, where it can take several bytes to display a single character. In the UTF-8 locale, for example, the same character, ║, takes three bytes to display: 0xE2 0x95 0x91.
The second concept here is one of the source code character set, which comes from the locale in which the source code is edited, before it gets compiled. When you enter the ║ character in your source code, it may get represented, I suppose, either as the 0xBA character, or maybe 0xE2 0x95 0x91 sequence, if your editor uses the UTF-8 locale. The compiler, when it reads the source code, just sees the actual byte sequence. Everything gets reduced to bytes.
Fortunately, all C++ keywords use US-ASCII, so it doesn't matter what character set is used to write C++ code. Until you start using non-Latin characters. Which result in a compiler warning, informing you, basically, that you're using stuff that may or may not work, depending on the eventual locale the resulting program runs in.
First, your input source file has its own encoding. Your compiler needs to be able to read this encoding (maybe with the help of flags/settings).
With a simple string, the compiler is free to do what it wants, but it must yield a const char[]. Usually, the compiler keeps the source encoding when it can, so the string stored in your program will have the encoding of your input file. There are cases when the compiler will do a conversion, for example if your file is UTF-16 (you can't fit UTF-16 characters in chars).
When you use '\xBA', you write a raw character, and you chose yourself your encoding, so there is no encoding from the compiler.
When you use '║', the type of '║' is not necessarily char. If the character is not representable as a single byte in the compiler character set, its type will be int. In the case of Visual Studio with the Windows-1252 source file, '║' doesn't fit, so it will be of type int and printed as such by cout <<.
You can force an encoding with prefixes on string literals. u8"" will force UTF-8, u"" UTF-16 and U"" UTF-32. Note that the L"" prefix will give you a wide char wchar_t string, but it's still implementation dependent. Wide chars on Windows are UCS-2 (2 bytes per char), but UTF-32 (4 bytes per char) on linux.
Printing to the console only depends on the type of the variable. cout << is overloaded with all common types, so what it does depends on the type. cout << will usually feed char strings as is to the console (actually stdin), and wcout << will usually feed wchar_t strings as is. Other combinations may have conversions or interpretations (like feeding an int). UTF-8 strings are char strings, so cout << should always feed them correctly.
Next, there is the console itself. A console is a totally independent piece of software. You feed it some bytes, it display them. It doesn't care one bit about your program. It uses its own encoding, and try to print the bytes you fed using this encoding.
The default console encoding on Windows is Code page 850 (not sure if it is always the case). In your case, your file is CP 1252 and your console is CP 850, which is why you can't print '║' directly (CP 1252 doesn't contain '║'), but you can using a raw character. You can change the console encoding on Windows with SetConsoleCP().
On linux, the default encoding is UTF-8, which is more convenient because it support the whole Unicode range. Ideone uses linux, so it will use UTF-8. Note that there is the added layer of HTTP and HTML, but they also use UTF-8 for that.
I am writing a program includes output chinese characters using Dev C++.
I've added
-finput-charset=big5
-fexec-charset=big5
in compiler parameters. I also set the code page of the console to be 950 (traditional chinese)
It works perfectly while in a simple cout like this:
cout << "中文字";
while it comes to characters array it goes wrong as expected:
char chin[] = "中文字";
cout << chin[0];//output nothing
cout << chin[0] << chin[1];//output the first chinese character as one chinese character occupies 2 bytes.
So I decided to use wchar_t instead and I have to use wcout with wchar_t or else a number will be shown.
However, wcout show nothing in the console. All of the below show nothing:
wcout << L"中文字";
wchar_t chin2[] = L"中文字";
wcout << chin2[0];
What did I missed to use wchar_t to output chinese (or other east asian) characters? I really don't want to write 2 array member to show one single chinese chracters.
There are subtle problems going on here.
The C++ compiler does not understand Big5 encoding. When you create a source code file and display it, you may see your familiar Chinese characters but the compiler sees a string of bytes. Big5 is a double byte charset so each input character will be represented by 2 bytes inside the compiler.
When that string of bytes is fed to a suitable output device the Chinese characters appear again. Code page 950 is compatible with Big5 so you see the "right" thing. But then you try to build on this and confusion is the result. Your second code sample uses L"" strings, but I expect those strings will contain half a character in each short.
The only "safe" character set you can use is Unicode. Windows internals are historically UCS-2 (char is a single short) but is now theoretically UTF-16 (char is short, but may include multi-byte sequences). Not all existing software and older APIs fully support UTF-16 (or need to). Windows has very limited support for UTF-8 or other encodings. Everything gets converted into Unicode, so best to just leave it that way.
In practice, you should build your C++ code with Unicode settings, for UCS-2, and exercise caution if you need characters that would require multibyte sequences. You should ensure that any source code you write and any input text files are identified as whatever encoding they need to be, but are translated into Unicode internally. Leave your console as the default Unicode encoding, and everything will just work.
It is almost impossible to sensibly use Big5 as an internal encoding in a Windows program. Best not to try.
Most of answers and questions here on SO use to put L before any UTF-8 string. I found no explantion of what it is, in the source code, the constant is, according to my IDE, defined in winnt.h.
This is how I use it, without knowing what it is:
std::wcout<<L"\"Přetečení zásobníku\" is Stack overflow in Czech.";
Obviously, constant concatenation cannot be applied on variables:
void printUTF8(const char* str) {
//Does not make the slightest bit of sense
std::wcout<<L str;
}
So what is it and how to add it to dynamic strings?
L"" is a WIDE string. That is to say, it's a a wchar_t[1]. UTF-8 strings can't be wide, since they are multi-byte (variable length). VC++ is slightly wrong and made wide strings variable length, UTF-16 to be precise. But usually they're UTF-32.
The problem with multi-byte strings is that there are many different encodings, and UTF-8 is only one of them. Windows does not in fact natively support UTF-8 encodings. MessageBoxA() for instance can use any encoding but UTF-8. There's just one exception to that, which is MultiByteToWideChar(CP_UTF8, ...) which is what you'd need here.
L is an indication to the C compiler that the string is composed of "wide characters". In Windows, these would be UTF-16 - each character that you put in the string is 16 bits, or two bytes, wide:
L"This is a wide string"
In contrast, a UTF-8 string is always a string composed of bytes. ASCII characters (A-Z 0-9 etc) are encoded the way they have always been - in the range 0x00 to 0x7F (or 0 to 127). International characters (like ř) are encoded using multiple bytes in the range 0x80 to 0xFF - there is a very good explanation on wikipedia. The advantage is that it can be represented using ordinary C strings.
"This is an ordinary string, but also a UTF-8 string"
"This is a C cedilla in UTF-8: \xc3\x87"
However, if you are typing these international characters in to actual code, your editor needs to know that you are typing in UTF-8 so it can encode the characters correctly - like the C cedilla above. Then the string will be passed correctly to your function.
In your case, your comment indicates that you are using UTF-16. In which case there are two other issues:
The console will, by default, not output Unicode characters correctly. You need to change the font to a truetype font like Lucida Console
You also need to change the output mode to a Unicode UTF-16 one. You can do this with:
_setmode(_fileno(stdout), _O_U16TEXT);
Code example:
#include <iostream>
#include <io.h>
#include <fcntl.h>
int wmain(int argc, wchar_t* argv[])
{
_setmode(_fileno(stdout), _O_U16TEXT);
std::wcout << L"Přetečení zásobníku is Stack overflow in Czech." << std::endl;
}
Re your actual question
” what is [the L prefix] and how to add it to dynamic strings?
This is very different from the title of the question at the time I’m writing this, namely “How can I make dynamic strings to work with UTF-8 in console?”
In short, UTF-8 is an encoding of Unicode where the basic encoding unit is 8 bits, commonly called a byte (more precisely it's an octet), while the L prefix forms a wide character or string literal, where the encoding unit typically is 16 or 32 bits – in Windows it’s 16 bits, as in original Unicode.
A wide character or string literal is based on the wchar_t type instead of char.
In Windows a wide string is encoded as UTF-16. The most common sixty thousand or so Unicode characters are represented with single wchar_t values, but some seldom used Chinese ideograms etc. require two successive wchar_t values, called a surrogate pair.
The use of 16 bit encoding unit in Windows was established around 1992. I am not sure when UTF-16 was adopted (as an extension of then UCS-2 encoding), it was just a bit later. So this was established long before C99 required that all characters of the wide character set should be representable with single wchar_t values. That requirement appears to have been a pure political maneuver, ensuring that no Windows C compiler could be formally conforming, a general ISO programming language standard that applied only to Unix-land. Unfortunately, since C++11 was based on C99 we now have that also in C++11, ensuring that no Windows C++ compiler can be fully conforming. Pure idiocy. If you ask me.
Errata, re deleted text above: according to Wikipedia’s article about it the wording about a single wchar_t being sufficient for any character in the “extended character set” was there already in C90. Which makes the incompatibility between Windows and the C and C++ standards the fault of Microsoft, not the fault of the C committee. It still appears to be political and fairly idiotic, but (enlightened) with others to blame than I maintained at first…
One way to work with wide dynamic strings is to use std::wstring, from the <string> header.
With Visual C++ you can use a wmain function instead of standard main, as an easy way to get wide command line arguments.
wmain is also supported by MinGW64 (IIRC) g++, although not yet by ordinary MinGW g++, as of g++ 4.8.something. It is however easy to implement in terms of the Windows API. Unless you require strict standard-conforming code that provides the special main function features such as ability to declare it with or without arguments, but hey, let's be practical about things.
Example that compiles fine with both Visual C++ 12.0 and g++ 4.8.2:
// Source encoding: UTF-8 with BOM.
#include <io.h> // _setmode
#include <fcntl.h> // _O_WTEXT
#include <iostream> // std::wcout, std::endl
#include <string> // std::wstring
using namespace std;
auto main()
-> int
{
_setmode( _fileno( stdin ), _O_WTEXT );
_setmode( _fileno( stdout ), _O_WTEXT );
wcout << L"Hi, what’s your name? ";
wstring username;
getline( wcin, username );
wcout << L"Welcome to Windows C++, " << username << "!" << endl;
}
Note that with Windows ANSI source this won’t compile with g++ unless you specify the source encoding with the appropriate compiler option.
I've been using StackOverflow since the beginning, and have on occasion been tempted to post questions, but I've always either figured them out myself or found answers posted eventually... until now. This feels like it should be fairly simple, but I've been wandering around the internet for hours with no success, so I turn here:
I have a pretty standard utf-16 text file, with a mixture of English and Chinese characters. I would like those characters to end up in a string (technically, a wstring). I've seen a lot of related questions answered (here and elsewhere), but they're either looking to solve the much harder problem of reading arbitrary files without knowing the encoding, or converting between encodings, or are just generally confused about "Unicode" being a range of encodings. I know the source of the text file I'm trying to read, it will always be UTF16, it has a BOM and everything, and it can stay that way.
I had been using the solution described here, which worked for text files that were all English, but after encountering certain characters, it stopped reading the file. The only other suggestion I found was to use ICU, which would probably work, but I'd really rather not include a whole large library in an application for distribution, just to read one text file in one place. I don't care about system independence, though - I only need it to compile and work in Windows. A solution that didn't rely on that fact would prettier, of course, but I would be just as happy for a solution that used the stl while relying on assumptions about Windows architecture, or even solutions that involved win32 functions, or ATL; I just don't want to have to include another large 3rd-party library like ICU. Am I still totally out of luck unless I want to reimplement it all myself?
edit: I'm stuck using VS2008 for this particular project, so C++11 code sadly won't help.
edit 2: I realized that the code I had been borrowing before didn't fail on non-English characters like I thought it was doing. Rather, it fails on specific characters in my test document, among them ':' (FULLWIDTH COLON, U+FF1A) and ')' (FULLWIDTH RIGHT PARENTHESIS, U+FF09). bames53's posted solution also mostly works, but is stumped by those same characters?
edit 3 (and the answer!): the original code I had been using -did- mostly work - as bames53 helped me discover, the ifstream just needed to be opened in binary mode for it to work.
The C++11 solution (supported, on your platform, by Visual Studio since 2010, as far as I know), would be:
#include <fstream>
#include <iostream>
#include <locale>
#include <codecvt>
int main()
{
// open as a byte stream
std::wifstream fin("text.txt", std::ios::binary);
// apply BOM-sensitive UTF-16 facet
fin.imbue(std::locale(fin.getloc(),
new std::codecvt_utf16<wchar_t, 0x10ffff, std::consume_header>));
// read
for(wchar_t c; fin.get(c); )
std::cout << std::showbase << std::hex << c << '\n';
}
When you open a file for UTF-16, you must open it in binary mode. This is because in text mode, certain characters are interpreted specially - specifically, 0x0d is filtered out completely and 0x1a marks the end of the file. There are some UTF-16 characters that will have one of those bytes as half of the character code and will mess up the reading of the file. This is not a bug, it is intentional behavior and is the sole reason for having separate text and binary modes.
For the reason why 0x1a is considered the end of a file, see this blog post from Raymond Chen tracing the history of Ctrl-Z. It's basically backwards compatibility run amok.
Edit:
So it appears that the issue was that the Windows treats certain magic byte sequences as the end of the file in text mode. This is solved by using binary mode to read the file, std::ifstream fin("filename", std::ios::binary);, and then copying the data into a wstring as you already do.
The simplest, non-portable solution would be to just copy the file data into a wchar_t array. This relies on the fact that wchar_t on Windows is 2 bytes and uses UTF-16 as its encoding.
You'll have a bit of difficulty converting UTF-16 to the locale specific wchar_t encoding in a completely portable fashion.
Here's the unicode conversion functionality available in the standard C++ library (though VS 10 and 11 implement only items 3, 4, and 5)
codecvt<char32_t,char,mbstate_t>
codecvt<char16_t,char,mbstate_t>
codecvt_utf8
codecvt_utf16
codecvt_utf8_utf16
c32rtomb/mbrtoc32
c16rtomb/mbrtoc16
And what each one does
A codecvt facet that always converts between UTF-8 and UTF-32
converts between UTF-8 and UTF-16
converts between UTF-8 and UCS-2 or UCS-4 depending on the size of target element (characters outside BMP are probably truncated)
converts between a sequence of chars using a UTF-16 encoding scheme and UCS-2 or UCS-4
converts between UTF-8 and UTF-16
If the macro __STDC_UTF_32__ is defined these functions convert between the current locale's char encoding and UTF-32
If the macro __STDC_UTF_16__ is defined these functions convert between the current locale's char encoding and UTF-16
If __STDC_ISO_10646__ is defined then converting directly using codecvt_utf16<wchar_t> should be fine since that macro indicates that wchar_t values in all locales correspond to the short names of Unicode charters (and so implies that wchar_t is large enough to hold any such value).
Unfortunately there's nothing defined that goes directly from UTF-16 to wchar_t. It's possible to go UTF-16 -> UCS-4 -> mb (if __STDC_UTF_32__) -> wc, but you'll loose anything that's not representable in the locale's multi-byte encoding. And of course no matter what, converting from UTF-16 to wchar_t will lose anything not representable in the locale's wchar_t encoding.
So it's probably not worth being portable, and instead you can just read the data into a wchar_t array, or use some other Windows specific facility, such as the _O_U16TEXT mode on files.
This should build and run anywhere, but makes a bunch of assumptions to actually work:
#include <fstream>
#include <sstream>
#include <iostream>
int main ()
{
std::stringstream ss;
std::ifstream fin("filename");
ss << fin.rdbuf(); // dump file contents into a stringstream
std::string const &s = ss.str();
if (s.size()%sizeof(wchar_t) != 0)
{
std::cerr << "file not the right size\n"; // must be even, two bytes per code unit
return 1;
}
std::wstring ws;
ws.resize(s.size()/sizeof(wchar_t));
std::memcpy(&ws[0],s.c_str(),s.size()); // copy data into wstring
}
You should probably at least add code to handle endianess and the 'BOM'. Also Windows newlines don't get converted automatically so you need to do that manually.