Swedish characters don't compare correctly - c++

For some reason If/else statements isn't working correctly for me in C++
The problem is that when a variabel is equal to the right (höger), it won't output the If statement, instead it will go on to the else statement. If I replace the letter 'ö' with say 'o' so it becomes 'hoger' instead, then the if statement will work. So whenever I write the word 'höger' it won't go to the if statement, instead it will go to the else statement. However if I make the variabel equal to 'hoger', and then I write 'hoger', it will work. How can I make it possible writing 'höger' were the If statement recognizes it instead? It's as if Swedish letters don't work.
My code look like this:
#include <iostream>
#include <string>
using namespace std;
int main() {
setlocale(LC_ALL,"");
string test; // Define variabel
cout << " Höger elle vänster"<<endl; // Right or left
cin >> test;
if(test == "höger") { // If right, then output this.
cout <<"Du valde höger"<<endl;
}
else if(test == "vänster") { // If left, then output this
cout <<"Du valde vänster"<<endl;
} else {
// Do this
}
}

The problem is almost certainly to do with encodings.
The C/C++ language specs do not automatically handle anything other than 7 bit ASCII. The o-umlaut character is outside that range, and the exact behaviour depends on the encoding of your source code file.
The most likely possibilities are ISO 8859-1, Windows ANSI-1252, UTF-8 or Windows OEM 850. The first two encode this character the same, but in each of the others it is different.
With a bit more information about the encoding and tool set you are using it may be possible to provide more specific diagnosis and advice.
[And by the way, if/else statements in C/C++ work just fine, thank you.]
If we assume for the moment that this is Windows and Visual C++, then this is what you're dealing with.
Source code written inside Visual Studio: code page 1252. Code point for the o-umlaut character is 0xf6.
Keyboard input read from the console: code page 850. Code point for the o-umlaut character is 0x94.
Obviously not a good match. However, Visual Studio can also quite happily edit source code files in many encodings including UTF-8 (with byte mark), UTF-16 (wide characters) and code page 850. So:
Source code written inside Visual Studio: code page 850. Code point for the o-umlaut character is 0x94. Now it works.
You can also change the code page for your console using the CHCP command.
Change Console to CHCP 1252 and it works.
The behaviour of the compiler when reading source code is obliged by the standard to be consistent with the execution character set. See n3797 S2.2.5:
Each source character set member in a character literal or a string literal, as well as each escape
sequence and universal-character-name in a character literal or a non-raw string literal, is converted to the corresponding member of the execution character set
S2.3/3:
The basic execution character set and the basic execution wide-character set shall each contain all the members of the basic source character set, plus control characters representing alert, backspace, and carriage return, plus a null character (respectively, null wide character), whose representation has all zero bits. For each basic execution character set, the values of the members shall be non-negative and distinct from one another. In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous. The execution character set and the execution wide-character set are implementation-defined supersets of the basic execution character
set and the basic execution wide-character set, respectively. The values of the members of the execution character sets and the sets of additional members are locale-specific.
n3797 S2.14.3/1:
A character literal that does not begin with u, U, or L is an ordinary character literal, also referred to as a narrow-character literal. An ordinary character literal that contains a single c-char representable in the execution character set has type char, with value equal to the numerical value of the encoding of the c-char in the execution character set.
n3297 S2.14.5/6:
a string literal that does not begin with an encoding-prefix is an ordinary string
literal, and is initialized with the given characters.
The execution character set is implementation-defined. Microsoft's statement reqarding implementation-defined behaviour for the C compiler is here: http://msdn.microsoft.com/en-us/library/hx3yt8af.aspx. [I can't find a separate one for C++, so I assume this applies to both.]
The source character set is the set of legal characters that can appear in source files. For Microsoft C, the source character set is the standard ASCII character set.
Sorry about the language-lawyer stuff, but what this says is that the MSVC compiler is independent of locale/encoding and implements 8-bit ASCII, code page unspecified. Obviously the standard library functions may need to know the encoding for various purposes, but that is a whole other story.
As a final point, the Microsoft C compiler dates back around 30 years, since before Windows. It has always been possible to write source code in code page 850 and have it run correctly on the console, subject to careful handling of extended (8-bit) characters. Many people still do. The problem here source code written in Windows-Ansi or Unicode and keyboard input from a OEM (cp850) console. Change either one to get it to work correctly.

In practice this problem will only manifest itself in Windows, so I'll assume Windows.
Then the problem is that the C++ narrow extended execution character set(1) (encoding) does not match the encoding used by the console window. "Narrow" refers to the char type. "Excecution character set" is a formal term employed by the C++ standard, and refers to the encoding that is assumed for text stored in the executable. The compiler translates source code literals to this encoding. It's also assumed for translation to/from any external encoding, such as translation to/from a console's encoding.
With Visual C++ the narrow encoding is always Windows ANSI(2), regardless of source code encoding, unless you trick the compiler. And assuming you're using Visual C++, this is then one encoding that you know.
The encoding in the console window is by default the one used for original IBM PC, in your case probably codepage 850 (a Western European variant of the original IBM PC English codepage 437). Run the Windows command interpreter cmd (Windows-key+R, type cmd, OK). Type chcp to check the current codepage. Type chcp 1252 to switch to Windows ANSI Western, which presumably is the Windows ANSI codepage on your machine. Run your program [.exe] file, e.g. by typing its full path, or by going to its directory and typing just its name, e.g.
[H:\dev\test\0046]
> cl /nologo /EHsc /GR encoding.cpp /Fe:b.exe
encoding.cpp
[H:\dev\test\0046]
> chcp & b
Active code page: 850
Höger elle vänster
höger
← No output here, didn't compare as equal.
[H:\dev\test\0046]
> chcp 1252
Active code page: 1252
[H:\dev\test\0046]
> b
Höger elle vänster
höger
Du valde höger
[H:\dev\test\0046]
> _
… where cl (short for original “Lattice C”) is the Visual C++ compiler.
You can change the console codepage more permanently by running regedit, going to this registry key:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage
and in the list in the right pane double-click the value named OEMCP (short for Original Equipment Manufacturer Code Page, referring to the IBM PC), change it to 1252, or more generally to the same value as the ACP value, and reboot the machine.
Oh, it's also necessary to change the console window font to a TrueType font such as Lucida Console, because the default is (an emulation of) a bitmapped font that only works correctly with the original console codepage. You can right click the console window title to get a menu, choose [Defaults], and configure the default font, size, colors etc. The changes won't affect the current console window, but they will apply to future console windows, except for those that have been configured individually(3).
An alternative to such console window configuration is to use the Console2 program. If you do, then in Windows 7 and later be sure to use the 64-bit version. Otherwise some things, such as invoking links to 64-bit programs, won't work.
Summing up, you can either
run the program from the command interpreter (using chcp to change the codepage), or
change the console codepage more permanently, as discussed above.
In either case it's a Good Idea™ to change the console window font to a TrueType font – and yes, this affects the functionality, not just the looks.
Note on additional Microsoft absurdity: in Windows 7 and later the "System" font used by default in console windows is actually, behind the scenes, a TrueType font with umpteen thousand glyphs, but it's used to emulate the old 16-bit Windows bitmapped fonts, with the same silly restrictions, so that you still have to change to some other TrueType font…
(1) See the C++11 standard §2.3/3.
(2) “Windows ANSI” depends on the Windows configuration and is always the codepage specified by the GetACP API function. In practice this function gets its value from the registry key/value referenced above. However, that's largely undocumented.
(3) In Windows XP Windows would ask if you wanted to save an individual console window configuration. Starting with Windows Vista it's saved with no question asked and no information that it's been saved. There is no user interface for removing such saved configurations, but they can be removed by programmatically altering shortcut files, and/or by registry editing, which however is both an impractical and brittle solution.

The only change I made to your code was the following:
// setlocale(LC_ALL, "");
char *l = setlocale(LC_ALL, NULL);
cout << "Current Locale: " << l << endl;
Because I don't have a “ISO” keyboard layout, I used the Alt code to type the character I need. The following the key combination I used for the different code pages.
First run I had to type in Alt+246 for Code page 437
Second run, Alt+148 for Windows-1252
Below is the output when I change code page between execution

It seems the problem is the encoding of your source file when your IDE compiles it. If you are using Visual Studio you can change your encoding setting like this:

Related

How do I use system("chcp 936") in my dialog based project?

The code below is supposed to convert a wstring "!" to a string and output it,
setlocale(LC_ALL, "Chinese_China.936");
//system("chcp 936");
std::wstring ws = L"!";
string as((ws.length()) * sizeof(wchar_t), '-');
auto rs = wcstombs((char*)as.c_str(), ws.c_str(), as.length());
as.resize(rs);
cout << rs << ":" << as << endl;
If you run it without system("chcp 936");, the converted string is "£¡" rather than "!". If with system("chcp 936");, the result is correct in a console project.
But on my Dialog based project, system("chcp 936")is useless, even if it's workable, I can't use it, because it would popup a console.
PS: the IDE is Visual Studio 2019, and my source code is stored as in UTF-8 with signature.
My operation system language is English and language for non-unicode programs is English (United States).
Edit: it's interesting, even with "en-US" locale, "!" can be converted to an ASCII "!".
But I don't get where "£¡" I got in the dialog based project.
There are two distinct points to considere with locales:
you must tell the program what charset should be used when converting unicode characters to plain bytes (this is the role for setlocale)
you must tell the terminal what charset it should render (this is the role for chcp in Windows console)
The first point depends on the language and optionaly libraries that you use in your program (here the C++ language and Standard Library)
The second point depends on the console application and underlying system. Windows console uses chcp, and you will find in that other post how you can configure xterm in a Unix-like system.
I found out the cause, the wstring-to-string conversion is no problem, the problem was I used CA2T to convert the Chinese punctuation mark and it failed. So it showed "£¡" in the UI finally.
By means of mbstowcs, the counterpart of wcstombs, it would work.

Text on MFC Controls - Unicode Characters such as Japanese get cut off

Background
I'm working on a C++/MFC application and we've been converting it to display unicode characters to support foreign languages. For the most part this has been successful and unicode characters are displayed correctly. But I've encountered an issue where certain text on certain controls gets cut off.
Example
Here you can see a button that should display "ログアウト/終了" but gets cutoff and displays an unknown character in it's place.
But if I pad the string with spaces it displays fine. The number of spaces needed varies by string. This string needed 4 spaces to display correctly, whereas another string with one less character needed 5 spaces; there doesn't seem to be a correlation or pattern with the number of spaces needed. And also, I don't want to pad strings randomly throughout the code, especially when other languages don't need this at all.
What I've tried (doesn't work)
Shrinking the font size
Resizing the control
Changing the font facename
Changing the font character set
Copying the control properties from another control in the application that does not have this issue
Add extra null terminators
Padding with zero-width characters
Using SetWindowTextW
Changing source and execution character sets
Changing system locale
The only thing I've found that works is padding with an arbitrary amount of spaces which is certainly not an ideal solution.
Other info
I've only noticed this issue for Japanese characters, but have only tested English, German, and Japanese.
Japanese characters use 3 bytes of data, which I suspect has something to do with this but I don't know what or why. English characters use 1 byte and certain German characters use 2 bytes.
A control (button/label/etc) in one place may have an issue whereas a control in a different place that contains the same text does not have the issue, even if they're both buttons..etc.
When the text is cutoff, it typically either displays a question mark box (like the first image) or a random character/letter at the end. This character changes each time I run the application, but the question box is the most common.
For my padding "fix", it doesn't matter if the spaces are at the beginning or end of the string, as long as the number of spaces is enough. It also doesn't need to be spaces, any non-zero-width character works.
Compiled using MBCS (Multibyte Character Set) and the Windows 10 UTF-8 Unicode Support setting enabled. (As opposed to compiling with UNICODE defined which isn't an option. Large old codebase)
EDIT: Here is an example on how the text is set
GetDlgItem(IDC_SOME_CTRL_ID)->SetWindowText(GetTranslation("Some String"));
Where GetTranslation() is our own function to look up the translation of "Some String" (basically a lookup table) and return a CString. Using a debugger I can see the returned CString always has the correct string value. I can replace GetTranslation with a hardcoded Japanese string and the issue will still happen.
EDIT 2: I got complaints that this code wasn't enough.
myapp.rc
// Microsoft Visual C++ generated resource script.
//
#include "resource.h"
#define APSTUDIO_READONLY_SYMBOLS
#include "afxres.h"
#undef APSTUDIO_READONLY_SYMBOLS
IDD_VIEW_MENU DIALOGEX 0, 0, 50, 232
STYLE DS_SETFONT | WS_CHILD
FONT 14, "Verdana", 0, 0, 0x1
BEGIN
CONTROL "btn0",IDC_BUTTON_MENU_0,"Button",BS_3STATE | BS_PUSHLIKE,12,38,25,13
END
#endif
resource.h
#define IDC_BUTTON_MENU_0 6040
ViewMenu.cpp
#include "stdafx.h"
#include "ViewMenu.h"
CViewMenu::CViewMenu() : CFormView(CViewMenu::IDD)
{
}
void CViewMenu::DoDataExchange(CDateExchange* pDX)
{
CFormView::DoDataExchange(pDX);
DDX_Control(pDX, IDC_BUTTON_MENU_0, m_ctrlMenuButton0);
}
void CViewMenu::OnInitialUpdate()
{
CFormView::OnInitialUpdate();
}
void CViewMenu::OnDraw(CDC* pDC)
{
CFormView::OnDraw(pDC);
GetDlgItem(IDC_BUTTON_MENU_0)->SetWindowText("ログアウト/終了");
return;
}
ViewMenu.h
#include "resource.h"
class CViewMenu : public CFormView
{
protected:
CViewMenu();
public:
enum { IDD = IDD_VIEW_MENU };
CButton m_ctrlMenuButton0;
}
The following should work in Windows 10 versions 1903 and later, regardless of the default system locale, and fulfills OP's requirements (string literals, MBCS build, no Unicode windows etc). It was verified to work in version 2004 set to En-US locale, without "Beta: Use Unicode UTF-8 for worldwide language support" checked, using VS 2019 16.7.5 to build.
Save source files containing characters outside the active codepage in UTF-8 encoding, with or without BOM.
Compile with _MBCS defined (in the IDE: Properties / Advanced / Character Set = MBCS).
Compile with the /utf-8 switch (C/C++ / Command Line / Additional Options = /utf-8).
Create a manifest file declaring UTF-8 as the target codepage for the process (per the activeCodePage documentation).
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<assembly manifestVersion="1.0" xmlns="urn:schemas-microsoft-com:asm.v1" xmlns:asmv3="urn:schemas-microsoft-com:asm.v3">
<asmv3:application>
<asmv3:windowsSettings xmlns="http://schemas.microsoft.com/SMI/2019/WindowsSettings">
<activeCodePage>UTF-8</activeCodePage>
</asmv3:windowsSettings>
</asmv3:application>
</assembly>
Add the manifest file to the project (in the IDE: Manifest Tool / General / Input and Output / Additional Manifest Files = manifest file created at the previous step).
This ain't Python. With C++ you need to know, why your code works. Otherwise it doesn't.
GetDlgItem(IDC_BUTTON_MENU_0)->SetWindowText("ログアウト/終了");
That's where you and your compiler start to disagree. You think this should be UTF-8. Your compiler, on the other hand, trusts you, and assumes that you are using the source character set.
While you are unaware of a concept called source character set, you get all confused about something that should be the norm: Garbage in, garbage out.
If you feel like fixing the "Garbage in" part (now, clearly, that is your job), read up on C++ string literals. In case you don't make it to the end, the quickest way to fix your ungodly workaround is to use a u8 prefix.
Seriously, though, the real solutions is to use Windows' native character encoding. Which, oddly, you seem to reject, even though you could use it, given a string literal. I mean, it's not like you have to change anything global. Just call SetWindowTextW and use an L prefix.
Just saying, you know...

Writing unicode(?) character directly from source code to WriteConsoleOutput

I'm trying to use WriteConsoleOutput from the WinApi to write characters to the command prompt window buffer. The thing is, I'd really like to be able to write characters such as ☺ directly into the source code, as-is, instead of using some kind of encoding/notation like '\uFFFF' or '0xFF', since I don't understand them too well (differences between codepages/character sets/etc.)
The code below showcases the simplest form of my problem. Running this code does not print ☺ into the command prompt window, but a question mark (?) instead.
#include <Windows.h>
int main()
{
HANDLE h = GetStdHandle(STD_OUTPUT_HANDLE);
CHAR_INFO c[1] = {0};
COORD cS = {1, 1};
COORD cH = {0, 0};
SMALL_RECT sr = {0, 0, 0, 0};
c[0].Attributes = FOREGROUND_INTENSITY;
c[0].Char.UnicodeChar = '☺';
WriteConsoleOutput(h, c, cS, cH, &sr);
Sleep(5000);
return 0;
}
It is vital for my code to display output identically between all Windows versions, regardless of the languages installed/used. So to my knowledge (which admittedly is absolutely minimal), I'd need to set a specific codepage (one which would hopefully be supported by the command prompt in any language Windows).
I've tried:
• Changing from using the CHAR_INFO.UnicodeChar to CHAR_INFO.AsciiChar
• Fiddling around with SetConsoleCP and SetConsoleOutputCP functions, but I haven't got a clue on how to utilize them to help me with this problem.
• Changing the Visual Studio -> Project -> Project properties.. -> Character Set setting to every possible value.
• Using specifically either WriteConsoleOutputA or WriteConsoleOutputW in addition to the aforementioned settings
• Changing the source code file encoding to UTF-8 with(/out) signature.
In my project I'm programmatically setting the command prompt font to 8x8 Terminal, which to my knowledge does not support actual unicode characters. The available characters are displayed here. Those characters do include '☺', so I'm not entirely sure my question is about unicode. I have no idea anymore. Please help.
C source has to be ascii only. If you embed non-ascii characters in a C source file, and IDE might show them in what appears to be the correct format, but the compiler quite likely treats them differently, and the executable function you pass them to can treat them differently still. It's just not portable or reliable. But you can use the escape sequence \x to embed arbitrary bytes in C strings.
UTF-8 is good for internal use, but Windows APIs don't yet support it, so you need to convert to Windows 16 bit chars (UTF-16 nearly but not quite), to display extended characters. However you have to ensure that you are calling the wide character version of the Windows API. Most Windows API functions that take string come in a A and W version (ascii and wide) for binary backwards compatibility. If you query the identifier in the IDE (go to definition etc) you should see which version you have.

c++ Lithuanian language, how to get more than ascii

I am trying to use Lithuanian in my c++ application, but every try is unsuccesfull.
Multi-byte character set is used. I have tryed everything i have tought of, i am new in c++. Never ever tryed to do something in Lithuanian.
Tryed every setlocale(LC_ALL, "en_US.utf8"); setlocale(LC_ALL, "Lithuanian");...
Researched for 2 hours and didnt found proper examples, solution.
I do have a average sized project which needs Lithuanian translation from database and it cant understand most of "ĄČĘĖĮŠŲŪąčęėįšųū".
Compiler - "Visual studio 2013"
Database - sqlite3.
I cant get simple strings to work(defined myself), and output as Lithuanian to win32 application, even.
In Windows use wide character strings (1UTF-16 encoding, wchar_t type) for internal text handling, and preferably UTF-8 for external text files and networking.
Note that Visual C++ will translate narrow text literals from the source encoding to Windows ANSI, which is a platform-dependent usually single-byte encoding (you can check which one via the GetACP API function), i.e., Visual C++ has the platform-specific Windows ANSI as its narrow C++ execution character set.
But also do note that for an app restricted to non-Windows platforms, i.e. Unix-land, it makes practical sense to do everything in UTF-8, based on char type.
For the database communication you may need to translate to and from the program's internal text representation.
This depends on what the database interface requires, which is not stated.
Example for console output in Windows:
#include <iostream>
#include <fcntl.h>
#include <io.h>
auto main() -> int
{
_setmode( _fileno( stdout ), _O_WTEXT );
using namespace std;
wcout << L"ĄČĘĖĮŠŲŪąčęėįšųū" << endl;
}
To make this compile by default with g++, the source code encoding needs to be UTF-8. Then, to make it produce correct results with Visual C++ the source code encoding needs to be UTF-8 with BOM, which happily is also accepted by modern versions of g++. For otherwise the Visual C++ compiler will assume the Windows ANSI encoding and produce an incorrect UTF-16 string.
Not coincidentally this is the default meaning of UTF-8 in Windows, e.g. in the Notepad editor, namely UTF-8 with BOM.
But note that while in Windows the problem is that the main system compiler requires a BOM for UTF-8, in Unix-land the problem is the opposite, that many old tools can't handle the BOM (for example, even MinGW g++ 4.9.1 isn't yet entirely up to speed: it sometimes includes the BOM bytes, then incorrectly interpreted, in error messages).
1) On other platforms wide character text can be encoded in other ways, e.g. with UTF-32. In fact the Windows convention is in direct conflict with the C and C++ standards which require that a single wchar_t should be able to encode any character in the extended character set. However, this requirement was, AFAIK, imposed after Windows adopted UTF-16, so the fault probably lies with the politics of the C and C++ standardization process, not yet another Microsoft'ism.
Complexity of internationalisation
There are several related but distinct topics that can cause mismatches between them, making try and error approach very tedious:
type used for storing strings and chars: windows iuses wchar_t by default, but for most APIs you have also char equivalents functions
character set encoding this defines how the chars stored in the type are to be understood. For exemple unicode (UTF8, UTF16, UTF32), 7 bits ascii, 8 bit ansii. In windows, by default it is UTF16 for wchar_t and ansi/windows for char
locale defines, among other things, the character set asumptions, when processing strings. This permit to use language independent functions like isalpha(i, loc), islower(i, loc), ispunct(i, loc) to find out if a given character is alphanumeric, a lower case alphabetic, or a punctuation, for example to bereak down a user text into words. C++ offers here portable functions.
output codepage or font used to show a character to the user. This assumes that the font used shows the characters using the same character set used in the code internals.
source code encoding. For example your editor could assume an ansi encoding, with windows 1252 character set.
Most typical errors
The problem n°1 is Win32 console output, as unicode is not well supported by the console. But this is not your problem here.
Another cause of mismatch is the encoding of your text editor. It might not be unicode, but use a windows code page. In this case, you type "Č", the deditor displays it as such, but editor might use windows 1257 encoding for lithuanian and store 0xC8 in the file. If you then display this literal with a windows unicode function, it will interpret 0xC8 as "latin E grave accent" and print something else, as the right unicode encoding for "Č" is 0x010C !
I can be even worse: the compiler may have its own assumption about character set encoding used and convert your litterals into unicode using false assumptions (it happened to me when I used some exotic code generation switch).
How to do ?
To figure out what goes wront, proceed by elimination:
First, for plain windows, use the native unicode setting. Ok it's UTF16 and wchar_t instead of UTF8 and as thus comes with some drawbacks, but it's native and well supported.
Then use explict unicode coding in litterals, for example TEXT("\u010C") instead of TEXT("Č"). This avoids editor and compiler mismatch.
If it's still not the right character, make sure that your font FULLY supports unicode. The default system font for instance doesn't while most other do. You can easily check with the windows font pannel (WindowKey+R fonts then click on "search char") to display the character table of your font.
Set fonts explicitely in your code
For example, a very tiny experiment :
...
case WM_PAINT:
{
hdc = BeginPaint(hWnd, &ps);
auto hf = CreateFont(24, 0, 0, 0, 0, TRUE, 0, 0, 0, 0, 0, 0, 0, L"Times New Roman");
auto hfOld = SelectObject(hdc, hf); // if you comment this out, € and Č won't display
TextOut(hdc, 50, 50, L"Test with éç € \u010C special chars", 30);
SelectObject(hdc, hfOld);
DeleteObject(hf);
EndPaint(hWnd, &ps);
break;
}

Why printf can display non-ASCII characters when "C" locale is used?

Note: I'm asking an implementation defined behavior which is on Microsoft Visual C++ 2008(possibly the same on 2005+). OS: simplified Chinese installation of Win7.
It surprises me when I'm performing non-ASCII I/O w/ printf. E.g.
// This won't be necessary as it's the system default code page.
//system("chcp 936");
// NULL to show current locale, which is "C"
printf ("%s\n", setlocale(LC_ALL, NULL));
printf ("中\n");
printf ("%s\n", setlocale(LC_ALL, "English"));
printf ("中\n");
Output:
Active code page: 936
C
中
English_United States.1252
?D
The memory footprint in debugger shows that "中" is encoded in two bytes: 0xD6, 0xD0, which is the code point of that character in code page 936, for simplified Chinese. It shouldn't be in the code point range of "C" locale which, most likely, is 0x0 ~ 0x7F.
Question:
Why can it still display the character correctly in "C" locale? So I made a guess that locale had no bearing on printf? But then, I shall ask, why can't it display anymore when changing to "English" locale, which is also different from 936? Interesting?
Edit:
I redirected the standard output to a file and took some test. It shows that whatever locale is set, the correct character "中" is saved in the file. It suggests that setlocale() is connected to the way console displays the character, which contradicts my understanding of how it works: printf puts the bytes/code points into input buffer of console, which interprets these bytes using its own code page(what chcp returns).
936 is rather tricky codepage, it allows 2 symbols character (similar it is done by UTF-8). For example Cyrillic (866) - doesn't allows two-byte characters and it behavior will be the same as "English".
So when you use default(936) codepage it knows how to process 2-symbol character, while "English" deals with 0x0 ~ 0x7f only.
Let me also answer why wprintf(L"中") fails. There are big difference between console application and Windows-window application, they use different codepages
Follow is matches between console and windows:
DOS | Windows
------+----------
850 | 1252
936 | 54936
866 | 1251
So if you would like to see in console correct symbols use WideCharToMultiByte first - that provides expected conversion to allow console work in 936
The fact that the C locale prints out the string exactly as given is not surprising. That's what I would expect. What is surprising is that the English locale would do something different.
According do the locale documentation on MSDN, the only effect that locale should have on printf is in determining the radix character for numeric values (i.e. the decimal point).
I suspect perhaps that it's a bug in Microsoft's Compiler. Or at the very least it's undocumented behaviour.
For what it's worth, on my compiler (Borland) the locale has no effect on the output of those strings. It does effect the radix though.
OK. For the default "C" locale, CRT assumes that characters passed to printf don't need any conversion. It has a reason because the ASCII characters almost always fall into the basic character set of the execution system(shared among different Windows code pages). When switched to "English", it assumes the input is encoded in code page 1252, and thus tries to perform a conversion from "English" to "Chinese", which is the locale used by the console. But CRT just cannot find the character 中 in code page 1252. That's why it outputs a question mark.
When redirected to a file, CRT knows it and won't do the conversion, because the console code page is no longer used. It just passes through the bytes as-is. How those bytes are interpreted is up to the program you use(e.g., care about BOM or not) when you open the file.
Refer to this MSDN forum link: Why printf can display non-ASCII characters when “C” locale is used?