Streaming Out Extended ASCII - c++

I know that only positive character ASCII values are guaranteed cross platform support.
In Visual Studio 2015, I can do:
cout << '\xBA';
And it prints:
║
When I try that on http://ideone.com I don't print anything.
If I try to directly print this using the literal character:
cout << '║';
Visual Studio gives the warning:
warning C4566: character represented by universal-character-name '\u2551' cannot be represented in the current code page (1252)
And then prints:
?
When this command is run on http://ideone.com I get:
14849425
I've read that wchars may provide a cross platform approach to this. Is that true? Or am I simply out of luck on extended ASCII?

There are two separate concepts in play here.
The first one is one of a locale, which is often called "code page" in Microsoft-ese. A locale defines which visual characters are represented by which byte sequence. In your first example, whatever locale your program gets executed as, it shows the "║" character, in response to the byte 0xBA.
Other locales, or code pages, will display different characters for the same bytes. Many locales are multibyte locales, where it can take several bytes to display a single character. In the UTF-8 locale, for example, the same character, ║, takes three bytes to display: 0xE2 0x95 0x91.
The second concept here is one of the source code character set, which comes from the locale in which the source code is edited, before it gets compiled. When you enter the ║ character in your source code, it may get represented, I suppose, either as the 0xBA character, or maybe 0xE2 0x95 0x91 sequence, if your editor uses the UTF-8 locale. The compiler, when it reads the source code, just sees the actual byte sequence. Everything gets reduced to bytes.
Fortunately, all C++ keywords use US-ASCII, so it doesn't matter what character set is used to write C++ code. Until you start using non-Latin characters. Which result in a compiler warning, informing you, basically, that you're using stuff that may or may not work, depending on the eventual locale the resulting program runs in.

First, your input source file has its own encoding. Your compiler needs to be able to read this encoding (maybe with the help of flags/settings).
With a simple string, the compiler is free to do what it wants, but it must yield a const char[]. Usually, the compiler keeps the source encoding when it can, so the string stored in your program will have the encoding of your input file. There are cases when the compiler will do a conversion, for example if your file is UTF-16 (you can't fit UTF-16 characters in chars).
When you use '\xBA', you write a raw character, and you chose yourself your encoding, so there is no encoding from the compiler.
When you use '║', the type of '║' is not necessarily char. If the character is not representable as a single byte in the compiler character set, its type will be int. In the case of Visual Studio with the Windows-1252 source file, '║' doesn't fit, so it will be of type int and printed as such by cout <<.
You can force an encoding with prefixes on string literals. u8"" will force UTF-8, u"" UTF-16 and U"" UTF-32. Note that the L"" prefix will give you a wide char wchar_t string, but it's still implementation dependent. Wide chars on Windows are UCS-2 (2 bytes per char), but UTF-32 (4 bytes per char) on linux.
Printing to the console only depends on the type of the variable. cout << is overloaded with all common types, so what it does depends on the type. cout << will usually feed char strings as is to the console (actually stdin), and wcout << will usually feed wchar_t strings as is. Other combinations may have conversions or interpretations (like feeding an int). UTF-8 strings are char strings, so cout << should always feed them correctly.
Next, there is the console itself. A console is a totally independent piece of software. You feed it some bytes, it display them. It doesn't care one bit about your program. It uses its own encoding, and try to print the bytes you fed using this encoding.
The default console encoding on Windows is Code page 850 (not sure if it is always the case). In your case, your file is CP 1252 and your console is CP 850, which is why you can't print '║' directly (CP 1252 doesn't contain '║'), but you can using a raw character. You can change the console encoding on Windows with SetConsoleCP().
On linux, the default encoding is UTF-8, which is more convenient because it support the whole Unicode range. Ideone uses linux, so it will use UTF-8. Note that there is the added layer of HTTP and HTML, but they also use UTF-8 for that.

Related

Avoid / set character set conversion /encoding for std::cout / std::cerr

General question
Is there a possibility to avoid character set conversion when writing to std::cout / std::cerr?
I do something like
std::cout << "Ȋ'ɱ ȁ ȖȚƑ-8 Șțȓȉɳɠ (in UTF-8 encoding)" << std::endl;
And I want the output to be written to the console maintaining the UTF-8 encoding (my console uses UTF-8 encoding, but my C++ Standard Library, GNUs libstdc++, doesn't think so for some reason).
If there's no possibility to forbid character encoding conversion: Can I set std::cout to use UTF-8, so it hopefully figures out itself that no conversion is needed?
Background
I used the Windows API function SetConsoleOutputCP(CP_UTF8); to set my console's encoding to UTF-8.
The problem seems to be that UTF-8 does not match the code page typicallly used for my system's locale and libstdc++ therefore sets up std::cout with the default ANSI code page instead of correctly recognizing the switch.
Edit: Turns out I misinterpreted the issue and the solution is actually a lot simpler (or not...).
The "Ȋ'ɱ ȁ ȖȚƑ-8 Șțȓȉɳɠ (in UTF-8 encoding)" was just meant as a placeholder (and I shouldn't have used it as it has hidden the actual issue).
In my real code the "UTF-8 string" is a Glib::ustring, and those are by definition UTF-8 encoded.
However I did not realize that the output operator << was defined in glibmm in a way that forces character set conversion.
It uses g_locale_from_utf8() internally which in turn uses g_get_charset() to determine the target encoding.
Unfortunately the documentation for g_get_charset() states
On Windows the character set returned by this function is the so-called system default ANSI code-page. That is the character set used by the "narrow" versions of C library and Win32 functions that handle file names. It might be different from the character set used by the C library's current locale.
which simply means that glib will neither care for the C locale I set nor will it attempt to determine the encoding my console actually uses and basically makes it impossible to use many glib functions to create UTF-8 output. (As a matter of fact this also means that this issue has the exact same cause as the issue that triggered my other question: Force UTF-8 encoding in glib's "g_print()").
I'm currently considering this a bug in glib (or a serious limitation at best) and will probably open a report in the issue tracker for it.
You are looking at the wrong side, as you are talking about a string literal, included in your source code (and not input from your keyboard), and for that to work properly you have to tell the compiler which encoding is being used for all those characters (I think the first c++ spec that mentions non-ascii charsets is c++11)
As you are using actually the UTF charset, you should have to encode all them in at least a wchar_t to be considered as such, or to agree in the translator (probably this is what happens) that UTF chars will be UTF-8 encoded, when used as string literals. This will commonly mean that they will be printed as UTF-8 and, if you use a UTF-8 compliant console device, they will be printed ok, without any other problem.
I know there's a gcc option to specify the encoding used in string literals for a source file, and there should be another in clang also. Check the documentation and probably this will solve any issues. But the best thing to be portable, is not to depend on the codeset or use one like ISO-10646 (but know that full utf coverage is not only utf-8, utf-8 is only a way to encode UTF chars, and as so, it's only a way to represent UTF characters)
Another issue, is that C++11 doesn't refer to the UTF consortium standard, but to the ISO counterpart (ISO-10646, I think), both are similar, but not equal, and the character encodings are similar, but not equal (the codesize of the ISO is 32 bit while the Unicode consortium's is 21 bit, for example). These and other differences between them make some tricks to go in C++ and produce problems when one is thinking in strict Unicode.
Of course, to output correct strings on a UTF-8 terminal, you have to encode UTF codes to utf-8 format before sending them to the terminal. This is true, even if you have already them encoded as utf-8 in a string object. If you say they are already utf-8 then no conversion is made at all... but if you don't say, the normal consideration is that you are using normal utf codes (but limiting to 8bit codes), limiting yourself to eight bit codes, and encoding them to utf-8 before printing... this leads to encoding errors (double encoding) as something like ú (unicode code \u00fa) should be encoded in utf-8 as the character sequence { 0xc3, 0xba };, but if you don't say the string literal is indeed in utf-8, both characters will be handled as the two characters codes for Â(\u00c3) and º(\u00ba) characters, and will be recoded as { 0xc3, 0x83, 0xc2, 0xba }; that will show them incorrectly. This is very common error and you should probably have seen it when some encoding is done incorrectly. Source for the samples here.

how I'm able to store a japanese character in 1 byte normal string?

std::string str1="いい";
std::string str2="الحانةالريفية";
WriteToLog(str1.size());
WriteToLog(str2.size());
I get "2,13" in my log file which is the exact number of characters in those strings. But how the japanese and arabic characters fit into one byte. I hope str.size() is supposed to return no of bytes used by the string.
On my UTF-8-based locale, I get 6 and 26 bytes respectively.
You must be using a locale that uses the high 8-bit portion of the character set to encode these non-Latin characters, using one byte per character.
If you switch to a UTF-8 locale, you should get the same results as I did.
The answer is, you can't.
Those strings don't contain what you think they contain.
First make sure you save your source file as UTF-8 with BOM, or as UTF-16. (Visual Studio calls these UTF-8 with signature and Unicode).
Don't use any other encoding, as then the meaning of that string literal changes as you move your source file between computers with different language settings.
Then, you need to make sure the compiler uses a suitable character set to embed those strings in your binary. That's called the execution character set → see Does VC have a compile option like '-fexec-charset' in GCC to set the execution character set?
Or you can go for the portable solution, which is encoding the strings to UTF-8 yourself, and then writing the string literals as bytes: "\xe3\x81\x84\xe3\x81\x84".
They're using MBCS (Multi-byte character set).
Underlying while Unicode will encode all characters in two bytes, MBCS will encode common characters in a single byte, and will use an extension character first byte to denote its going to use more than one byte for this character. Confusingly depending on what character you chose for that second character in the Japanese string, you size may have been 3, not 2 or 4.
MBCS is a bit dated, it's recommended to use Unicode for new development when possible. See link below for more info.
https://msdn.microsoft.com/en-us/library/5z097dxa.aspx

c++ wchar_t array and char array in programming for win32 console

I am writing a program includes output chinese characters using Dev C++.
I've added
-finput-charset=big5
-fexec-charset=big5
in compiler parameters. I also set the code page of the console to be 950 (traditional chinese)
It works perfectly while in a simple cout like this:
cout << "中文字";
while it comes to characters array it goes wrong as expected:
char chin[] = "中文字";
cout << chin[0];//output nothing
cout << chin[0] << chin[1];//output the first chinese character as one chinese character occupies 2 bytes.
So I decided to use wchar_t instead and I have to use wcout with wchar_t or else a number will be shown.
However, wcout show nothing in the console. All of the below show nothing:
wcout << L"中文字";
wchar_t chin2[] = L"中文字";
wcout << chin2[0];
What did I missed to use wchar_t to output chinese (or other east asian) characters? I really don't want to write 2 array member to show one single chinese chracters.
There are subtle problems going on here.
The C++ compiler does not understand Big5 encoding. When you create a source code file and display it, you may see your familiar Chinese characters but the compiler sees a string of bytes. Big5 is a double byte charset so each input character will be represented by 2 bytes inside the compiler.
When that string of bytes is fed to a suitable output device the Chinese characters appear again. Code page 950 is compatible with Big5 so you see the "right" thing. But then you try to build on this and confusion is the result. Your second code sample uses L"" strings, but I expect those strings will contain half a character in each short.
The only "safe" character set you can use is Unicode. Windows internals are historically UCS-2 (char is a single short) but is now theoretically UTF-16 (char is short, but may include multi-byte sequences). Not all existing software and older APIs fully support UTF-16 (or need to). Windows has very limited support for UTF-8 or other encodings. Everything gets converted into Unicode, so best to just leave it that way.
In practice, you should build your C++ code with Unicode settings, for UCS-2, and exercise caution if you need characters that would require multibyte sequences. You should ensure that any source code you write and any input text files are identified as whatever encoding they need to be, but are translated into Unicode internally. Leave your console as the default Unicode encoding, and everything will just work.
It is almost impossible to sensibly use Big5 as an internal encoding in a Windows program. Best not to try.

UTF-8 Compatibility in C++

I am writing a program that needs to be able to work with text in all languages. My understanding is that UTF-8 will do the job, but I am experiencing a few problems with it.
Am I right to say that UTF-8 can be stored in a simple char in C++? If so, why do I get the following warning when I use a program with char, string and stringstream: warning C4566: character represented by universal-character-name '\uFFFD' cannot be represented in the current code page (1252). (I do not get that error when I use wchar_t, wstring and wstringstream.)
Additionally, I know that UTF is variable length. When I use the at or substr string methods would I get the wrong answer?
To use UTF-8 string literals you need to prefix them with u8, otherwise you get the implementation's character set (in your case, it seems to be Windows-1252): u8"\uFFFD" is null-terminated sequence of bytes with the UTF-8 representation of the replacement character (U+FFFD). It has type char const[4].
Since UTF-8 has variable length, all kinds of indexing will do indexing in code units, not codepoints. It is not possible to do random access on codepoints in an UTF-8 sequence because of it's variable length nature. If you want random access you need to use a fixed length encoding, like UTF-32. For that you can use the U prefix on strings.
Yes, the UTF-8 encoding can be used with char, string, and stringstream. A char will hold a single UTF-8 code unit, of which up to four may be required to represent a single Unicode code point.
However, there are a few issues using UTF-8 specifically with Microsoft's compilers. C++ implementations use an 'execution character set' for a number of things, such as encoding character and string literals. VC++ always use the system locale encoding as the execution character set, and Windows does not support UTF-8 as the system locale encoding, therefore UTF-8 can never by the execution character set.
This means that VC++ never intentionally produces UTF-8 character and string literals. Instead the compiler must be tricked.
The compiler will convert from the known source code encoding to the execution encoding. That means that if the compiler uses the locale encoding for both the source and execution encodings then no conversion is done. If you can get UTF-8 data into the source code but have the compiler think that the source uses the locale encoding, then character and string literals will use the UTF-8 encoding. VC++ uses the so-called 'BOM' to detect the source encoding, and uses the locale encoding if no BOM is detected. Therefore you can get UTF-8 encoded string literals by saving all your source files as "UTF-8 without signature".
There are caveats with this method. First, you cannot use UCNs with narrow character and string literals. Universal Character Names have to be converted to the execution character set, which isn't UTF-8. You must either write the character literally so it appears as UTF-8 in the source code, or you can use hex escapes where you manually write out a UTF-8 encoding. Second, in order to produce wide character and string literals the compiler performs a similar conversion from the source encoding to the wide execution character set (which is always UTF-16 in VC++). Since we're lying to the compiler about the encoding, it will perform this conversion to UTF-16 incorrectly. So in wide character and string literals you cannot use non-ascii characters literally, and instead you must use UCNs or hex escapes.
UTF-8 is variable length (as is UTF-16). The indices used with at() and substr() are code units rather than character or code point indices. So if you want a particular code unit then you can just index into the string or array or whatever as normal. If you need a particular code point then you either need a library that can understand composing UTF-8 code units into code points (such as the Boost Unicode iterators library), or you need to convert the UTF-8 data into UTF-32. If you need actual user perceived characters then you need a library that understands how code points are composed into characters. I imagine ICU has such functionality, or you could implement the Default Grapheme Cluster Boundary Specification from the Unicode standard.
The above consideration of UTF-8 only really matters for how you write Unicode data in the source code. It has little bearing on the program's input and output.
If your requirements allow you to choose how to do input and output then I would still recommend using UTF-8 for input. Depending on what you need to do with the input you can either convert it to another encoding that's easy for you to process, or you can write your processing routines to work directly on UTF-8.
If you want to ever output anything via the Windows console then you'll want a well defined module for output that can have different implementations, because internationalized output to the Windows console will require a different implementation from either outputting to a file on Windows or console and file output on other platforms. (On other platforms the console is just another file, but the Windows console needs special treatment.)
The reason you get the warning about \uFFFD is that you're trying to fit FF FD inside a single byte, since, as you noted, UTF-8 works on chars and is variable length.
If you use at or substr, you will possibly get wrong answers since these methods count that one byte should be one character. This is not the case with UTF-8. Notably, with at, you could end up with a single byte of a character sequence; with substr, you could break a sequence and end up with an invalid UTF-8 string (it would start or end with �, \uFFFD, the same one you're apparently trying to use, and the broken character would be lost).
I would recommend that you use wchar to store Unicode strings. Since the type is at least 16 bits, many many more characters can fit in a single "unit".

How to use Unicode in C++?

Assuming a very simple program that:
ask a name.
store the name in a variable.
display the variable content on the screen.
It's so simple that is the first thing that one learns.
But my problem is that I don't know how to do the same thing if I enter the name using japanese characters.
So, if you know how to do this in C++, please show me an example (that I can compile and test)
Thanks.
user362981 : Thanks for your help. I compiled the code that you wrote without problem, them the console window appears and I cannot enter any Japanese characters on it (using IME). Also if
I change a word in your code ("hello") to one that contains Japanese characters, it also will not display these.
Svisstack : Also thanks for your help. But when I compile your code I get the following error:
warning: deprecated conversion from string constant to 'wchar_t*'
error: too few arguments to function 'int swprintf(wchar_t*, const wchar_t*, ...)'
error: at this point in file
warning: deprecated conversion from string constant to 'wchar_t*'
You're going to get a lot of answers about wide characters. Wide characters, specifically wchar_t do not equal Unicode. You can use them (with some pitfalls) to store Unicode, just as you can an unsigned char. wchar_t is extremely system-dependent. To quote the Unicode Standard, version 5.2, chapter 5:
With the wchar_t wide character type, ANSI/ISO C provides for
inclusion of fixed-width, wide characters. ANSI/ISO C leaves the semantics of the wide
character set to the specific implementation but requires that the characters from the portable C execution set correspond to their wide character equivalents by zero extension.
and that
The width of wchar_t is compiler-specific and can be as small as 8 bits. Consequently,
programs that need to be portable across any C or C++ compiler should not use wchar_t
for storing Unicode text. The wchar_t type is intended for storing compiler-defined wide
characters, which may be Unicode characters in some compilers.
So, it's implementation defined. Here's two implementations: On Linux, wchar_t is 4 bytes wide, and represents text in the UTF-32 encoding (regardless of the current locale). (Either BE or LE depending on your system, whichever is native.) Windows, however, has a 2 byte wide wchar_t, and represents UTF-16 code units with them. Completely different.
A better path: Learn about locales, as you'll need to know that. For example, because I have my environment setup to use UTF-8 (Unicode), the following program will use Unicode:
#include <iostream>
int main()
{
setlocale(LC_ALL, "");
std::cout << "What's your name? ";
std::string name;
std::getline(std::cin, name);
std::cout << "Hello there, " << name << "." << std::endl;
return 0;
}
...
$ ./uni_test
What's your name? 佐藤 幹夫
Hello there, 佐藤 幹夫.
$ echo $LANG
en_US.UTF-8
But there's nothing Unicode about it. It merely reads in characters, which come in as UTF-8 because I have my environment set that way. I could just as easily say "heck, I'm part Czech, let's use ISO-8859-2": Suddenly, the program is getting input in ISO-8859-2, but since it's just regurgitating it, it doesn't matter, the program will still perform correctly.
Now, if that example had read in my name, and then tried to write it out into an XML file, and stupidly wrote <?xml version="1.0" encoding="UTF-8" ?> at the top, it would be right when my terminal was in UTF-8, but wrong when my terminal was in ISO-8859-2. In the latter case, it would need to convert it before serializing it to the XML file. (Or, just write ISO-8859-2 as the encoding for the XML file.)
On many POSIX systems, the current locale is typically UTF-8, because it provides several advantages to the user, but this isn't guaranteed. Just outputting UTF-8 to stdout will usually be correct, but not always. Say I am using ISO-8859-2: if you mindlessly output an ISO-8859-1 "è" (0xE8) to my terminal, I'll see a "č" (0xE8). Likewise, if you output a UTF-8 "è" (0xC3 0xA8), I'll see (ISO-8859-2) "è" (0xC3 0xA8). This barfing of incorrect characters has been called Mojibake.
Often, you're just shuffling data around, and it doesn't matter much. This typically comes into play when you need to serialize data. (Many internet protocols use UTF-8 or UTF-16, for example: if you got data from an ISO-8859-2 terminal, or a text file encoded in Windows-1252, then you have to convert it, or you'll be sending Mojibake.)
Sadly, this is about the state of Unicode support, in both C and C++. You have to remember: these languages are really system-agnostic, and don't bind to any particular way of doing it. That includes character-sets. There are tons of libraries out there, however, for dealing with Unicode and other character sets.
In the end, it's not all that complicated really: Know what encoding your data is in, and know what encoding your output should be in. If they're not the same, you need to do a conversion. This applies whether you're using std::cout or std::wcout. In my examples, stdin or std::cin and stdout/std::cout were sometimes in UTF-8, sometimes ISO-8859-2.
Try replacing cout with wcout, cin with wcin, and string with wstring. Depending on your platform, this may work:
#include <iostream>
#include <string>
int main() {
std::wstring name;
std::wcout << L"Enter your name: ";
std::wcin >> name;
std::wcout << L"Hello, " << name << std::endl;
}
There are other ways, but this is sort of the "minimal change" answer.
Pre-requisite: http://www.joelonsoftware.com/articles/Unicode.html
The above article is a must read which explains what unicode is but few lingering questions remains. Yes UNICODE has a unique code point for every character in every language and furthermore they can be encoded and stored in memory potentially differently from what the actual code is. This way we can save memory by for example using UTF-8 encoding which is great if the language supported is just English and so the memory representation is essentially same as ASCII – this of course knowing the encoding itself. In theory if we know the encoding, we can store these longer UNICODE characters however we like and read it back. But real world is a little different.
How do you store a UNICODE character/string in a C++ program? Which encoding do you use? The answer is you don’t use any encoding but you directly store the UNICODE code points in a unicode character string just like you store ASCII characters in ASCII string. The question is what character size should you use since UNICODE characters has no fixed size. The simple answer is you choose character size which is wide enough to hold the highest character code point (language) that you want to support.
The theory that a UNICODE character can take 2 bytes or more still holds true and this can create some confusion. Shouldn’t we be storing code points in 3 or 4 bytes than which is really what represents all unicode characters? Why is Visual C++ storing unicode in wchar_t then which is only 2 bytes, clearly not enough to store every UNICODE code point?
The reason we store UNICODE character code point in 2 bytes in Visual C++ is actually exactly the same reason why we were storing ASCII (=English) character into one byte. At that time, we were thinking of only English so one byte was enough. Now we are thinking of most international languages out there but not all so we are using 2 bytes which is enough. Yes it’s true this representation will not allow us to represent those code points which takes 3 bytes or more but we don’t care about those yet because those folks haven’t even bought a computer yet. Yes we are not using 3 or 4 bytes because we are still stingy with memory, why store the extra 0(zero) byte with every character when we are never going to use it (that language). Again this is exactly the same reasons why ASCII was storing each character in one byte, why store a character in 2 or more bytes when English can be represented in one byte and room to spare for those extra special characters!
In theory 2 bytes are not enough to present every Unicode code point but it is enough to hold anything that we may ever care about for now. A true UNICODE string representation could store each character in 4 bytes but we just don’t care about those languages.
Imagine 1000 years from now when we find friendly aliens and in abundance and want to communicate with them incorporating their countless languages. A single unicode character size will grow further perhaps to 8 bytes to accommodate all their code points. It doesn’t mean we should start using 8 bytes for each unicode character now. Memory is limited resource, we allocate what what we need.
Can I handle UNICODE string as C Style string?
In C++ an ASCII strings could still be handled in C++ and that’s fairly common by grabbing it by its char * pointer where C functions can be applied. However applying current C style string functions on a UNICODE string will not make any sense because it could have a single NULL bytes in it which terminates a C string.
A UNICODE string is no longer a plain buffer of text, well it is but now more complicated than a stream of single byte characters terminating with a NULL byte. This buffer could be handled by its pointer even in C but it will require a UNICODE compatible calls or a C library which could than read and write those strings and perform operations.
This is made easier in C++ with a specialized class that represents a UNICODE string. This class handles complexity of the unicode string buffer and provide an easy interface. This class also decides if each character of the unicode string is 2 bytes or more – these are implementation details. Today it may use wchar_t (2 bytes) but tomorrow it may use 4 bytes for each character to support more (less known) language. This is why it is always better to use TCHAR than a fixed size which maps to the right size when implementation changes.
How do I index a UNICODE string?
It is also worth noting and particularly in C style handling of strings that they use index to traverse or find sub string in a string. This index in ASCII string directly corresponded to the position of item in that string but it has no meaning in a UNICODE string and should be avoided.
What happens to the string terminating NULL byte?
Are UNICODE strings still terminated by NULL byte? Is a single NULL byte enough to terminate the string? This is an implementation question but a NULL byte is still one unicode code point and like every other code point, it must still be of same size as any other(specially when no encoding). So the NULL character must be two bytes as well if unicode string implementation is based on wchar_t. All UNICODE code points will be represented by same size irrespective if its a null byte or any other.
Does Visual C++ Debugger shows UNICODE text?
Yes, if text buffer is type LPWSTR or any other type that supports UNICODE, Visual Studio 2005 and up support displaying the international text in debugger watch window (provided fonts and language packs are installed of course).
Summary:
C++ doesn’t use any encoding to store unicode characters but it directly stores the UNICODE code points for each character in a string. It must pick character size large enough to hold the largest character of desirable languages (loosely speaking) and that character size will be fixed and used for all characters in the string.
Right now, 2 bytes are sufficient to represent most languages that we care about, this is why 2 bytes are used to represent code point. In future if a new friendly space colony was discovered that want to communicate with them, we will have to assign new unicode code pionts to their language and use larger character size to store those strings.
You can do simple things with the generic wide character support in your OS of choice, but generally C++ doesn't have good built-in support for unicode, so you'll be better off in the long run looking into something like ICU.
#include <stdio.h>
#include <wchar.h>
int main()
{
wchar_t name[256];
wprintf(L"Type a name: ");
wscanf(L"%s", name);
wprintf(L"Typed name is: %s\n", name);
return 0;
}