How to use utf8 character arrays in c++? - c++

Is it possible to have char *s to work with utf8 encoding in C++ (VC2010)?
For example if my source file is saved in utf8 and I write something like this:
const char* c = "aäáéöő";
Is this possible to make it utf-8 encoded? And if yes, how is it possible to use
char* c2 = new char[strlen("aäáéöő")];
for dynamic allocation if characters can be variable length?

The encoding for narrow character string literals is implementation defined, so you'd really have to read the documentation (if you can find it). A quick experiment shows that both VC++ (VC8, anyway) and g++ (4.4.2, anyway) actually just copy the bytes from the source file; the string literal will be in whatever encoding your editor saved it in. (This is clearly in violation of the standard, but it seems to be common practice.)
C++11 has UTF-8 string literals, which would allow you to write u8"text", and be ensured that "text" was encoded in UTF-8. But I don't really expect it to work reliably: the problem is that in order to do this, the compiler has to know what encoding your source file has. In all probability, compiler writers will continue to ignore the issue, just copying the bytes from the source file, and achieve conformance simply be documenting that the source file must be in UTF-8 for these features to work.

If the text you want to put in the string is in your source code, make sure your source code file is in UTF-8.
If that don't work, try maybe using \u1234 with 1234 being a code point value.
You can also try to use UTF8-CPP maybe.
Take a look at this answer : Using Unicode in C++ source code

See this MSDN article which talks about converting between string types (that should give you examples on how to use them). The strings types that are covered include char *, wchar_t*, _bstr_t, CComBSTR, CString, basic_string, and System.String:
How to: Convert Between Various String Types

There is a hotfix for VisualStudio 2010 SP1 which can help: http://support.microsoft.com/kb/980263.
The hotfix adds a pragma to override visual studio's control the character encoding for the char type:
#pragma execution_character_set("utf-8")
Without the pragma, char* based literals are typically interpreted as the default code page (typically 1252)
This should all be superseded eventually by new string literal prefix modifiers specified by C++0x (u8, u, and U for utf-8, utf-16, and utf-32 respectively), which ideally will be supprted in the next major version of Visual Studio after 2010.

It is possible, save the file in UTF-8 without BOM signature encoding.
//Save As UTF8 without BOM signature
#include<stdio.h>
#include<windows.h>
int main(){
SetConsoleOutputCP(65001);
char *c1 = "aäáéöő";
char *c2 = new char[strlen("aäáéöő")];
strcpy(c2,c1);
printf("%s\n",c1);
printf("%s\n",c2);
}
Result:
D:\Debug>program
aäáéöő
aäáéöő
The result of redirection program is really UTF8 encoded file.
This is compiler - independent answer (compile on Windows).
(A similar question.)

Related

Convert from std::wstring to std::string

I'm converting wstring to string with std::codecvt_utf8 as described in this question, but when I tried Greek or Chinese alphabet symbols are corrupted, I can see it in the debug Locals window, for example 日本 became "日本"
std::wstring_convert<std::codecvt_utf8<wchar_t>> myconv; //also tried codecvt_utf8_utf16
std::string str = myconv.to_bytes(wstr);
What am I doing wrong?
std::string simply holds an array of bytes. It does not hold information about the encoding in which these bytes are supposed to be interpreted, nor do the standard library functions or std::string member functions generally assume anything about the encoding. They handle the contents as just an array of bytes.
Therefore when the contents of a std::string need to be presented, the presenter needs to make some guess about the intended encoding of the string, if that information is not provided in some other way.
I am assuming that the encoding you intend to convert to is UTF8, given that you are using std::codecvt_utf8.
But if you are using Virtual Studio, the debugger simply assumes one specific encoding, at least by default. That encoding is not UTF8, but I suppose probably code page 1252.
As verification, python gives the following:
>>> '日本'.encode('utf8').decode('cp1252')
'日本'
Your string does seem to be the UTF8 encoding of 日本 interpreted as if it was cp1252 encoded.
Therefore the conversion seems to have worked as intended.
As mentioned by #MarkTolonen in the comments, the encoding to assume for a string variable can be specified to UTF8 in the Visual Studio debugger with the s8 specifier, as explained in the documentation.

Storing math symbols into string c++

Is there a way to store math symbols into strings in c++ ?
I notably need the union/intersection symbols.
Thanks in advance!
This seemingly simple question is actual a tangle of multiple questions:
What character set to use?
Unicode is almost certainly the best choice nowadays.
What encoding to use?
C++ std::strings are strings of chars, but you can decide how those chars correspond to "characters" in your character set. The default representation assumed by the language and the system is could be ASCII, some random code page like Latin-1 or Windows-1252, or UTF-8.
If you're on Linux or Mac, your best bet is to use UTF-8. If you're on Windows, you might choose to use wide strings instead (std::wstring), and to use UTF-16 as the encoding. But many people suggest that you always use UTF-8 in std::strings even on Windows, and simply convert from and to UTF-16 as needed to do I/O.
How to specify string literals in the code?
To store UTF-8 in older versions of C++ (before C++11), you could manually encode your string literals like this:
const std::string subset = "\xE2\x8A\x82";
To store UTF-8 in C++11 or newer, you use the u8 prefix to tell the compiler you want UTF-8 encoding. You can use escaped characters:
const std::string subset = u8"\u2282";
Or you can enter the character directly into the source code:
const std::string subset = u8"⊂";
I tend to use the escaped versions to avoid worrying about the encoding of the source file and whether all the editors and viewers and IDEs I use will consistently understand the source file encoding.
If you're on Windows and you choose to use UTF-16 instead, then, regardless of C++ version, you can specify wide string literals in your code like this:
const std::wstring subset = L"\u2282"; // or L"⊂";
How to display these strings?
This is very system dependent.
On Mac and Linux, I suspect things will generally just work.
In a console program on Windows (e.g., one that just uses <iostreams> or printf to display in a command prompt), you're probably in trouble because the legacy command prompts don't have good Unicode and font support. (Maybe this is better on Windows 10?)
In a GUI program on Windows, you have to make sure you use the "Unicode" version of the API and to give it the wide string. ("Unicode" is in quotation marks here because the Windows API documentation often uses "Unicode" to mean a UTF-16 encoded wide character string, which isn't exactly what Unicode means.) So if you want to use an API like TextOut or MessageBox to display your string, you have to make sure you do two things: (1) call the "wide" version of the API, and (2) pass a UTF-16 encoded string.
You solve (1) by explicitly calling the wide versions (e.g., TextOutW or MessageBoxW) or by making your you compile with "Unicode" selected in your project settings. (You can also do it by defining several C++ preprocessor macros instead, but this answer is already long enough.)
For (2), if you are using std::wstrings, you're already done. If you're using UTF-8, you'll need to make a wide copy of the string to pass to the output function. Windows provides MultiByteToWideChar for making such a copy. Make sure you specify CP_UTF8.
For (2), do not try to call the narrow versions of the API functions themselves (e.g., TextOutA or MessageBoxA). These will convert your string to a wide string automatically, but they do so assuming the string is encoded in the user's current code page. If the string is really in UTF-8, then these will do the wrong thing for all of the "interesting" (non-ASCII) characters.
How to read these strings from a file, a socket, or the user?
This is very system specific and probably worth a separate question.
Yes, you can, as follows:
std::string unionChar = "∪";
std::string intersectionChar = "∩";
They are just characters but don't expect this code to be portable. You could also use Unicode, as follows:
std::string unionChar = u8"\u222A";
std::string intersectionChar = u8"\u2229";

Why doesn't fstream support an em-dash in the file name?

I ported some code from C to C++ and have just found a problem with paths that contain em-dash, e.g. "C:\temp\test—1.dgn". A call to fstream::open() will fail, even though the path displays correctly in the Visual Studio 2005 debugger.
The weird thing is that the old code that used the C library fopen() function works fine. I thought I'd try my luck with the wfstream class instead, and then found that converting my C string using mbstowcs() loses the em-dash altogether, meaning it also fails.
I suppose this is a locale issue, but why isn't em-dash supported in the default locale? And why can't fstream handle an em-dash? I would have thought any byte character supported by the Windows filesystem would be supported by the file stream classes.
Given these limitations, what is the correct way to handle opening a file stream that may contain valid Windows file names that doesn't just fall over on certain characters?
Character em-dash is coded as U+2014 in UTF-16 (0x14 0x20 in little endian), 0xE2 0x80 0x94 in UTF-8, and with other codes or not code at all depending on the charset and code page used. Windows-1252 code page (very common in western European languages) has dash character 0x97 that we could consider equivalent.
Windows internally manages UTF-16 paths, so every time a function is called with its bad-called ANSI interface (functions ending with A) the path is converted using the current code page configured for the user to UTF-16.
On the other hand, RTL of C and C++ could be implemented accessing to the "ANSI" or "Unicode" (functions ending in W) interface. In the first case, the code page used to represent the string must be the same of the code page used for the system. In the second case, either we directly use utf-16 strings from the beginning, or the functions used to convert to utf-16 must be configured to use the same code page of the source string for the mapping.
Yes, it is a complex problem. And there are several wrong (or with problems) proposal to solve it:
Use wfstream instead fstream: wfstream do nothing with paths different to fstream. Nothing. It just means "manage the stream of bytes like wchar_t". (And it does that in a different way as one can expect, so making this class unuseless in the most of cases, but that is another history). To use the Unicode interface in Visual Studio implementation, it exists the overloaded constructor and open() function that accept const wchar_t*. Those function and constructor are overloaded for fstream and for wfstream. Use fstream with the right open().
mbstowcs(): The problem here is the locale (which contains the code page used in the string) to use. If you match the locale because the default locale matches the system one, cool. If not, you can try with mbstowcs_l(). But these functions are unsafe C functions, so you have to be careful with the buffer size. Anyway, this approach could makes sense only if the path to convert is got in runtime. If it is an static string known at compile time, better is to use it directly in your code.
L"C:\\temp\\test—1.dgn": The L prefix in the string doesn't means "converts this string to utf-16" (source code use to be in 8-bit characters), at least no in Visual Studio implementation. L prefix means "add a 0x00 byte after each character between the quotes". So —, equivalent to byte 0x97 in a narrow (ordinary) string, it become 0x97 0x00 when in a wide (prefixed with L) string, but not 0x14 0x20. Instead it is better to use its universal character name: L"C:\\temp\\test\\u20141.dgn"
One popular approach is to use always in your code either utf-8 or utf-16 and make the conversions only when strictly necessary. When converting a string with a specific code page to utf-8 or utf-16, tries to first convert to one of them (utf-8 or utf-16) identifying first the right code page. To do that conversion, uses the functions depending on where they come from. If you get your string from a XML file, well, the used code page is usually explicated there (and use to be utf-8). If it comes from a Windows control, use Windows API function, like MultiByteToWideChar. (CP_ACP or GetACP() uses to work as by default code page).
Uses always fstream (not wfstream) and its wide interfaces (open and constructor), not its narrow ones. (You can use again MultiByteToWideChar to convert from utf-8 to utf-16).
There are several articles and post with advices for this approach. One of them that I recommend you: http://www.nubaria.com/en/blog/?p=289.
This should work, provided that everything you do is in wide-char notation with wide-char functions. That is, use wfstream, but instead of using mbstowcs, use wide-string literals prefixed with the L char:
const wchar_t* filename = L"C:\temp\test—1.dgn";
Also, make sure your source file is saved as UTF-8 in Visual Studio. Otherwise the em-dash could get locale issues with the em-dash.
Posting this solution for others who run into this. The problem is that Windows assigns the "C" locale on startup by default, and em-dash (0x97) is defined in the "Windows-1252" codepage but is unmapped in the normal ASCII table used by the "C" locale. So the simple solution is to call:
setlocale ( LC_ALL, "" );
Prior to fstream::open. This sets the current codepage to the OS-defined codepage. In my program, the file I wanted to open with fstream was defined by the user, so it was in the system-defined codepage (Windows-1252).
So while fiddling with unicode and wide chars may be a solution to avoid unmapped characters, it wasn't the root of the problem. The actual problem was that the input string's codepage ("Windows-1252") didn't match the active codepage ("C") used by default in Windows programs.

Correctly reading a utf-16 text file into a string without external libraries?

I've been using StackOverflow since the beginning, and have on occasion been tempted to post questions, but I've always either figured them out myself or found answers posted eventually... until now. This feels like it should be fairly simple, but I've been wandering around the internet for hours with no success, so I turn here:
I have a pretty standard utf-16 text file, with a mixture of English and Chinese characters. I would like those characters to end up in a string (technically, a wstring). I've seen a lot of related questions answered (here and elsewhere), but they're either looking to solve the much harder problem of reading arbitrary files without knowing the encoding, or converting between encodings, or are just generally confused about "Unicode" being a range of encodings. I know the source of the text file I'm trying to read, it will always be UTF16, it has a BOM and everything, and it can stay that way.
I had been using the solution described here, which worked for text files that were all English, but after encountering certain characters, it stopped reading the file. The only other suggestion I found was to use ICU, which would probably work, but I'd really rather not include a whole large library in an application for distribution, just to read one text file in one place. I don't care about system independence, though - I only need it to compile and work in Windows. A solution that didn't rely on that fact would prettier, of course, but I would be just as happy for a solution that used the stl while relying on assumptions about Windows architecture, or even solutions that involved win32 functions, or ATL; I just don't want to have to include another large 3rd-party library like ICU. Am I still totally out of luck unless I want to reimplement it all myself?
edit: I'm stuck using VS2008 for this particular project, so C++11 code sadly won't help.
edit 2: I realized that the code I had been borrowing before didn't fail on non-English characters like I thought it was doing. Rather, it fails on specific characters in my test document, among them ':' (FULLWIDTH COLON, U+FF1A) and ')' (FULLWIDTH RIGHT PARENTHESIS, U+FF09). bames53's posted solution also mostly works, but is stumped by those same characters?
edit 3 (and the answer!): the original code I had been using -did- mostly work - as bames53 helped me discover, the ifstream just needed to be opened in binary mode for it to work.
The C++11 solution (supported, on your platform, by Visual Studio since 2010, as far as I know), would be:
#include <fstream>
#include <iostream>
#include <locale>
#include <codecvt>
int main()
{
// open as a byte stream
std::wifstream fin("text.txt", std::ios::binary);
// apply BOM-sensitive UTF-16 facet
fin.imbue(std::locale(fin.getloc(),
new std::codecvt_utf16<wchar_t, 0x10ffff, std::consume_header>));
// read
for(wchar_t c; fin.get(c); )
std::cout << std::showbase << std::hex << c << '\n';
}
When you open a file for UTF-16, you must open it in binary mode. This is because in text mode, certain characters are interpreted specially - specifically, 0x0d is filtered out completely and 0x1a marks the end of the file. There are some UTF-16 characters that will have one of those bytes as half of the character code and will mess up the reading of the file. This is not a bug, it is intentional behavior and is the sole reason for having separate text and binary modes.
For the reason why 0x1a is considered the end of a file, see this blog post from Raymond Chen tracing the history of Ctrl-Z. It's basically backwards compatibility run amok.
Edit:
So it appears that the issue was that the Windows treats certain magic byte sequences as the end of the file in text mode. This is solved by using binary mode to read the file, std::ifstream fin("filename", std::ios::binary);, and then copying the data into a wstring as you already do.
The simplest, non-portable solution would be to just copy the file data into a wchar_t array. This relies on the fact that wchar_t on Windows is 2 bytes and uses UTF-16 as its encoding.
You'll have a bit of difficulty converting UTF-16 to the locale specific wchar_t encoding in a completely portable fashion.
Here's the unicode conversion functionality available in the standard C++ library (though VS 10 and 11 implement only items 3, 4, and 5)
codecvt<char32_t,char,mbstate_t>
codecvt<char16_t,char,mbstate_t>
codecvt_utf8
codecvt_utf16
codecvt_utf8_utf16
c32rtomb/mbrtoc32
c16rtomb/mbrtoc16
And what each one does
A codecvt facet that always converts between UTF-8 and UTF-32
converts between UTF-8 and UTF-16
converts between UTF-8 and UCS-2 or UCS-4 depending on the size of target element (characters outside BMP are probably truncated)
converts between a sequence of chars using a UTF-16 encoding scheme and UCS-2 or UCS-4
converts between UTF-8 and UTF-16
If the macro __STDC_UTF_32__ is defined these functions convert between the current locale's char encoding and UTF-32
If the macro __STDC_UTF_16__ is defined these functions convert between the current locale's char encoding and UTF-16
If __STDC_ISO_10646__ is defined then converting directly using codecvt_utf16<wchar_t> should be fine since that macro indicates that wchar_t values in all locales correspond to the short names of Unicode charters (and so implies that wchar_t is large enough to hold any such value).
Unfortunately there's nothing defined that goes directly from UTF-16 to wchar_t. It's possible to go UTF-16 -> UCS-4 -> mb (if __STDC_UTF_32__) -> wc, but you'll loose anything that's not representable in the locale's multi-byte encoding. And of course no matter what, converting from UTF-16 to wchar_t will lose anything not representable in the locale's wchar_t encoding.
So it's probably not worth being portable, and instead you can just read the data into a wchar_t array, or use some other Windows specific facility, such as the _O_U16TEXT mode on files.
This should build and run anywhere, but makes a bunch of assumptions to actually work:
#include <fstream>
#include <sstream>
#include <iostream>
int main ()
{
std::stringstream ss;
std::ifstream fin("filename");
ss << fin.rdbuf(); // dump file contents into a stringstream
std::string const &s = ss.str();
if (s.size()%sizeof(wchar_t) != 0)
{
std::cerr << "file not the right size\n"; // must be even, two bytes per code unit
return 1;
}
std::wstring ws;
ws.resize(s.size()/sizeof(wchar_t));
std::memcpy(&ws[0],s.c_str(),s.size()); // copy data into wstring
}
You should probably at least add code to handle endianess and the 'BOM'. Also Windows newlines don't get converted automatically so you need to do that manually.

C++ Visual Studio character encoding issues

Not being able to wrap my head around this one is a real source of shame...
I'm working with a French version of Visual Studio (2008), in a French Windows (XP). French accents put in strings sent to the output window get corrupted. Ditto input from the output window. Typical character encoding issue, I enter ANSI, get UTF-8 in return, or something to that effect. What setting can ensure that the characters remain in ANSI when showing a "hardcoded" string to the output window?
EDIT:
Example:
#include <iostream>
int main()
{
std:: cout << "àéêù" << std:: endl;
return 0;
}
Will show in the output:
óúÛ¨
(here encoded as HTML for your viewing pleasure)
I would really like it to show:
àéêù
Before I go any further, I should mention that what you are doing is not c/c++ compliant. The specification states in 2.2 what character sets are valid in source code. It ain't much in there, and all the characters used are in ascii. So... Everything below is about a specific implementation (as it happens, VC2008 on a US locale machine).
To start with, you have 4 chars on your cout line, and 4 glyphs on the output. So the issue is not one of UTF8 encoding, as it would combine multiple source chars to less glyphs.
From you source string to the display on the console, all those things play a part:
What encoding your source file is in (i.e. how your C++ file will be seen by the compiler)
What your compiler does with a string literal, and what source encoding it understands
how your << interprets the encoded string you're passing in
what encoding the console expects
how the console translates that output to a font glyph.
Now...
1 and 2 are fairly easy ones. It looks like the compiler guesses what format the source file is in, and decodes it to its internal representation. It generates the string literal corresponding data chunk in the current codepage no matter what the source encoding was. I have failed to find explicit details/control on this.
3 is even easier. Except for control codes, << just passes the data down for char *.
4 is controlled by SetConsoleOutputCP. It should default to your default system codepage. You can also figure out which one you have with GetConsoleOutputCP (the input is controlled differently, through SetConsoleCP)
5 is a funny one. I banged my head to figure out why I could not get the é to show up properly, using CP1252 (western european, windows). It turns out that my system font does not have the glyph for that character, and helpfully uses the glyph of my standard codepage (capital Theta, the same I would get if I did not call SetConsoleOutputCP). To fix it, I had to change the font I use on consoles to Lucida Console (a true type font).
Some interesting things I learned looking at this:
the encoding of the source does not matter, as long as the compiler can figure it out (notably, changing it to UTF8 did not change the generated code. My "é" string was still encoded with CP1252 as 233 0 )
VC is picking a codepage for the string literals that I do not seem to control.
controlling what the console shows is more painful than what I was expecting
So... what does this mean to you ? Here are bits of advice:
don't use non-ascii in string literals. Use resources, where you control the encoding.
make sure you know what encoding is expected by your console, and that your font has the glyphs to represent the chars you send.
if you want to figure out what encoding is being used in your case, I'd advise printing the actual value of the character as an integer. char * a = "é"; std::cout << (unsigned int) (unsigned char) a[0] does show 233 for me, which happens to be the encoding in CP1252.
BTW, if what you got was "ÓÚÛ¨" rather than what you pasted, then it looks like your 4 bytes are interpreted somewhere as CP850.
Because I was requested to, I’ll do some necromancy. The other answers were from 2009, but this article still came up on a search I did in 2018. The situation today is very different. Also, the accepted answer was incomplete even back in 2009.
The Source Character Set
Every compiler (including Microsoft’s Visual Studio 2008 and later, gcc, clang and icc) will read UTF-8 source files that start with BOM without a problem, and clang will not read anything but UTF-8, so UTF-8 with a BOM is the lowest common denominator for C and C++ source files.
The language standard doesn’t say what source character sets the compiler needs to support. Some real-world source files are even saved in a character set incompatible with ASCII. Microsoft Visual C++ in 2008 supported UTF-8 source files with a byte order mark, as well as both forms of UTF-16. Without a byte order mark, it would assume the file was encoded in the current 8-bit code page, which was always a superset of ASCII.
The Execution Character Sets
In 2012, the compiler added a /utf-8 switch to CL.EXE. Today, it also supports the /source-charset and /execution-charset switches, as well as /validate-charset to detect if your file is not actually UTF-8. This page on MSDN has a link to the documentation on Unicode support for every version of Visual C++.
Current versions of the C++ standard say the compiler must have both an execution character set, which determines the numeric value of character constants like 'a', and a execution wide-character set that determines the value of wide-character constants like L'é'.
To language-lawyer for a bit, there are very few requirements in the standard for how these must be encoded, and yet Visual C and C++ manage to break them. It must contain about 100 characters that cannot have negative values, and the encodings of the digits '0' through '9' must be consecutive. Neither capital nor lowercase letters have to be, because they weren’t on some old mainframes. (That is, '0'+9 must be the same as '9', but there is still a compiler in real-world use today whose default behavior is that 'a'+9 is not 'j' but '«', and this is legal.) The wide-character execution set must include the basic execution set and have enough bits to hold all the characters of any supported locale. Every mainstream compiler supports at least one Unicode locale and understands valid Unicode characters specified with \Uxxxxxxxx, but a compiler that didn’t could claim to be complying with the standard.
The way Visual C and C++ violate the language standard is by making their wchar_t UTF-16, which can only represent some characters as surrogate pairs, when the standard says wchar_t must be a fixed-width encoding. This is because Microsoft defined wchar_t as 16 bits wide back in the 1990s, before the Unicode committee figured out that 16 bits were not going to be enough for the entire world, and Microsoft was not going to break the Windows API. It does support the standard char32_t type as well.
UTF-8 String Literals
The third issue this question raises is how to get the compiler to encode a string literal as UTF-8 in memory. You’ve been able to write something like this since C++11:
constexpr unsigned char hola_utf8[] = u8"¡Hola, mundo!";
This will encode the string as its null-terminated UTF-8 byte representation regardless of whether the source character set is UTF-8, UTF-16, Latin-1, CP1252, or even IBM EBCDIC 1047 (which is a silly theoretical example but still, for backward-compatibility, the default on IBM’s Z-series mainframe compiler). That is, it’s equivalent to initializing the array with { 0xC2, 0xA1, 'H', /* ... , */ '!', 0 }.
If it would be too inconvenient to type a character in, or if you want to distinguish between superficially-identical characters such as space and non-breaking space or precomposed and combining characters, you also have universal character escapes:
constexpr unsigned char hola_utf8[] = u8"\u00a1Hola, mundo!";
You can use these regardless of the source character set and regardless of whether you’re storing the literal as UTF-8, UTF-16 or UCS-4. They were originally added in C99, but Microsoft supported them in Visual Studio 2015.
Edit: As reported by Matthew, u8" strings are buggy in some versions of MSVC, including 19.14. It turns out, so are literal non-ASCII characters, even if you specify /utf-8 or /source-charset:utf-8 /execution-charset:utf-8. The sample code above works properly in 19.22.27905.
There is another way to do this that worked in Visual C or C++ 2008, however: octal and hexadecimal escape codes. You would have encoded UTF-8 literals in that version of the compiler with:
const unsigned char hola_utf8[] = "\xC2\xA1Hello, world!";
Try this:
#include <iostream>
#include <locale>
int main()
{
std::locale::global(std::locale(""));
std::cout << "àéêù" << std::endl;
return 0;
}
Using _setmode() works¹ and is arguably better than changing the codepage or setting a locale, since it'll actually make your program output in Unicode and thus will be consistent - no matter which codepage or locale are currently set.
Example:
#include <iostream>
#include <io.h>
#include <fcntl.h>
int wmain()
{
_setmode( _fileno(stdout), _O_U16TEXT );
std::wcout << L"àéêù" << std::endl;
return 0;
}
Inside Visual Studio, make sure you set up your project for Unicode (Right-click *Project* -> Click *General* -> *Character Set* = *Use Unicode Character Set*).
MinGW users:
Define both UNICODE and _UNICODE
Add -finput-charset=iso-8859-1 to the compiler options to get around this error: "converting to execution character set: Invalid argument"
Add -municode to the linker options to get around "undefined reference to `WinMain#16" (read more).
**Edit:** The equivalent call to set unicode *input* is: `_setmode( _fileno(stdin), _O_U16TEXT );`
Edit 2: An important piece of information, specially considering the question uses std::cout. This is not supported. The MSDN Docs states (emphasis mine):
Unicode mode is for wide print functions (for example, wprintf) and is
not supported for narrow print functions. Use of a narrow print
function on a Unicode mode stream triggers an assert.
So, don't use std::cout when the console output mode is _O_U16TEXT; similarly, don't use std::cin when the console input is _O_U16TEXT. You must use the wide version of these facilities (std::wcout, std::wcin).
And do note that mixing cout and wcout in the same output is not allowed (but I find it works if you call flush() and then _setmode() before switching between the narrow and wide operations).
I tried this code:
#include <iostream>
#include <fstream>
#include <sstream>
int main()
{
std::wstringstream wss;
wss << L"àéêù";
std::wstring s = wss.str();
const wchar_t* p = s.c_str();
std::wcout << ws.str() << std::endl;
std::wofstream file("C:\\a.txt");
file << p << endl;
return 0;
}
The debugger showed that wss, s and p all had the expected values (i.e. "àéêù"), as did the output file. However, what appeared in the console was óúÛ¨.
The problem is therefore in the Visual Studio console, not the C++. Using Bahbar's excellent answer, I added:
SetConsoleOutputCP(1252);
as the first line, and the console output then appeared as it should.
//Save As Windows 1252
#include<iostream>
#include<windows.h>
int main()
{
SetConsoleOutputCP(1252);
std:: cout << "àéêù" << std:: endl;
}
Visual Studio does not supports UTF 8 for C++, but partially supports for C:
//Save As UTF8 without signature
#include<stdio.h>
#include<windows.h>
int main()
{
SetConsoleOutputCP(65001);
printf("àéêù\n");
}
Make sure you do not forget to change the console's font to Lucida Consolas as mentionned by Bahbar : it was crucial in my case (French win 7 64 bit with VC 2012).
Then as mentionned by others use SetConsoleOutputCP(1252) for C++ but it may fail depending on the available pages so you might want to use GetConsoleOutputCP() to check that it worked or at least to check that SetConsoleOutputCP(1252) returns zero. Changing the global locale also works (for some reason there is no need to do cout.imbue(locale()); but it may break some librairies!
In C, SetConsoleOutputCP(65001); or the locale-based approach worked for me once I had saved the source code as UTF8 without signature (scroll down, the sans-signature choice is way below in the list of pages).
Input using SetConsoleCP(65001); failed for me apparently due to a bad implementation of page 65001 in windows. The locale approach failed too both in C and C++. A more involved solution, not relying on native chars but on wchar_t seems required.
I had the same problem with Chinese input. My source code is utf8 and I added /utf-8 in the compiler option. It works fine under c++ wide-string and wide-char but not work under narrow-string/char which it shows Garbled character/code in Visual Studio 2019 debugger and my SQL database. I have to use the narrow characters because of converting to SQLAPI++'s SAString. Eventually, I find checking the following option (contorl panel->Region->Administrative->Change system locale) can resolve the issue. I know it is not an ideal solution but it does help me.
In visual studio File->Save yourSource.cpp As
then it will pop up a dialog ask you if you want to replace existed file and you choose yes.
Then pop this dialog: select UTF-8 with signature. This solves my giberish output problem both on console and file.
This also comply with #Davislor's answer:
UTF-8 with a BOM is the lowest common denominator for C and C++ source
files