Can I get a code page from a language preference? - c++

Windows seems to keep track of at least four dimensions of "current locale":
http://www.siao2.com/2005/02/01/364707.aspx
DEFAULT USER LOCALE
DEFAULT SYSTEM LOCALE
DEFAULT USER INTERFACE LANGUAGE
DEFAULT INPUT LOCALE
My brain hurts just trying to keep track of what the hell four separate locale's are useful for...
However, I don't grok the relationship between code page and locale (or LCID, or Language ID), all of which appear to be different (e.g. Japanese (Japan) is LANGID = 0x411 location code 1, but the code page for Japan is 932).
How can I configure our application to use the user's desired language as the default MBCS target when converting between Unicode and narrow strings?
That is to say, we used to be an MBCS application. Then we switched to Unicode. Things work well in English, but fail in Asian languages, apparently because Windows conversion functions WideCharToMultiByte and MultiByteToWideChar take an explicit code page (not a locale ID or language ID), which can be set to CP_ACP (default to ANSI code page), but don't appear to have a value for "default to user's default interface language's code page".
I mean, this is some seriously convoluted twaddle. Four separate dimensions of "current language", three different identifier types, as well as (different) string-identifiers for C library and C++ standard library.
In our previous MBCS builds, disk I/O and user I/O worked correctly: everything remained in the DEFAULT SYSTEM LOCALE (Windows XP term: "Language for non-Unicode Programs"). But now, in our UNICODE builds, everything tries to use "C" as the locale, and file I/O fails to properly transcode UNICODE to user's locale, and vice verse.
We want to have text files written out (when narrow) using the current user's language's code page. And when read in, the current user's language's code page should be converted back to UNICODE.
Help!!!
Clarification: I would ideally like to use the MUI language code page rather than the OS default code page. GetACP() returns the system default code page, but I am unaware of a function that returns the user's chosen MUI language (which auto-reverts to system default if no MUI specified / installed).

I agree with the comments by Jon Trauntvein, the GetACP function does reflect the user's language settings in the control panel. Also, based on the link to the "sorting it all out" blog, that you provided, DEFAULT USER INTERFACE LANGUAGE is the language that the Windows user interface will use, which is not the same as the language to be used by programs.
However, if you really want to use DEFAULT USER INTERFACE LANGUAGE then you get it by calling GetUserDefaultUILanguage and then you can map the language id to a code page, using the following table.
Language Identifiers and Locales
You can also use the GetLocaleInfo function to do the mapping, but first you would have to convert the language id that you got from GetUserDefaultUILanguage into a locale id, and I think you will get the name of the code page instead of a numeric value, but you could try it and see.

If all you want to be able to do is configure a locale object to use the currently selected locale settings, you should be able to do something like this:
std::locale loc = std::locale("");
You can also access the current code page in windows using the Win32 ::GetACP() function. Here is an example that I implemented in a string class to append multi-byte characters to a unicode string:
void StrUni::append_mb(char const *buff, size_t buff_len)
{
UINT current_code_page = ::GetACP();
int space_needed;
if(buff_len == 0)
return;
space_needed = ::MultiByteToWideChar(
current_code_page,
MB_PRECOMPOSED | MB_ERR_INVALID_CHARS,
buff,
buff_len,
0,
0);
if(space_needed > 0)
{
reserve(this->buff_len + space_needed + 1);
MultiByteToWideChar(
current_code_page,
MB_PRECOMPOSED | MB_ERR_INVALID_CHARS,
buff,
buff_len,
storage + this->buff_len,
space_needed);
this->buff_len += space_needed;
terminate();
}
}

Just use CW2A() or CA2W() which will take care of the conversion for you using the current system locale (or language used for non-Unicode applications).

FWIW, this is what I ended up doing:
#define _CONVERSION_DONT_USE_THREAD_LOCALE // force CP_ACP *not* CP_THREAD_ACP for MFC CString auto-conveters!!!
In application startup, construct the desired locale: m_locale(FStringA(".%u", GetACP()).GetString(), LC_CTYPE)
force it to agree with GetACP(): // force C++ and C libraries based on setlocale() to use system locale for narrow strings
m_locale = ::std::locale::global(m_locale); // we store the previous global so we can restore before termination to avoid memory loss
This gives me relatively ideal use of MFC's built-in narrow<->wide conversions in CString to automatically use the user's default language when converting to or from MBCS strings for the current locale.
Note: m_locale is type ::std::locale

Related

How do I use system("chcp 936") in my dialog based project?

The code below is supposed to convert a wstring "!" to a string and output it,
setlocale(LC_ALL, "Chinese_China.936");
//system("chcp 936");
std::wstring ws = L"!";
string as((ws.length()) * sizeof(wchar_t), '-');
auto rs = wcstombs((char*)as.c_str(), ws.c_str(), as.length());
as.resize(rs);
cout << rs << ":" << as << endl;
If you run it without system("chcp 936");, the converted string is "£¡" rather than "!". If with system("chcp 936");, the result is correct in a console project.
But on my Dialog based project, system("chcp 936")is useless, even if it's workable, I can't use it, because it would popup a console.
PS: the IDE is Visual Studio 2019, and my source code is stored as in UTF-8 with signature.
My operation system language is English and language for non-unicode programs is English (United States).
Edit: it's interesting, even with "en-US" locale, "!" can be converted to an ASCII "!".
But I don't get where "£¡" I got in the dialog based project.
There are two distinct points to considere with locales:
you must tell the program what charset should be used when converting unicode characters to plain bytes (this is the role for setlocale)
you must tell the terminal what charset it should render (this is the role for chcp in Windows console)
The first point depends on the language and optionaly libraries that you use in your program (here the C++ language and Standard Library)
The second point depends on the console application and underlying system. Windows console uses chcp, and you will find in that other post how you can configure xterm in a Unix-like system.
I found out the cause, the wstring-to-string conversion is no problem, the problem was I used CA2T to convert the Chinese punctuation mark and it failed. So it showed "£¡" in the UI finally.
By means of mbstowcs, the counterpart of wcstombs, it would work.

Can't display 'ä' in GLFW window title

glfwSetWindowTitle(win, "Nämen");
Becomes "N?men", where '?' is in a little black, twisted square, indicating that the character could not be displayed.
How do I display 'ä'?
If you want to use non-ASCII letters in the window title, then the string has to be utf-8 encoded.
GLFW: Window title:
The window title is a regular C string using the UTF-8 encoding. This means for example that, as long as your source file is encoded as UTF-8, you can use any Unicode characters.
If you see a little black, twisted square then this indicates that the ä is encoded with some iso encoding that is not UTF-8, maybe something like latin1. To fix this you need to open it in the editor in which you can change the encoding of the file, change it to uft-8 (without BOM) and fix the ä in the title.
It seems like the GLFW implementation does not work according to the specification in this case. Probably the function still uses Latin-1 instead of UTF-8.
I had the same problem on GLFW 3.3 Windows 64 bit precompiled binaries and fixed it like this:
SetWindowTextA(glfwGetWin32Window(win),"Nämen")
The issue does not lie within GLFW but within the compiler. Encodings are handled by the major compilers as follows:
Good guy clang assumes that every file is encoded in UTF-8
Trusty gcc checks the system's settings1 and falls back on UTF-8, when it fails to determine one.
MSVC checks for BOM and uses the detected encoding; otherwise it assumes that file is encoded using the users current code page2.
You can determine your current code page by simply running chcp in your (Windows) console or PowerShell. For me, on a fresh install of Windows 11, it yields "850" (language: English; keyboard: German), which stands for Code Page 850.
To fix this issue you have several solutions:
Change your systems code page. Arguably the worst solution.
Prefix your strings with u8, escape all unicode literals and convert the string to wide char before passing Win32 functions; e.g.:
const char* title = u8"\u0421\u043b\u0430\u0432\u0430\u0020\u0423\u043a\u0440\u0430\u0457\u043d\u0456\u0021";
// This conversion is actually performed by GLFW; see footnote ^3
const int l = MultiByteToWideChar(CP_UTF8, 0, title, -1, NULL, 0);
wchar_t* buf = _malloca(l * sizeof(wchar_t));
MultiByteToWideChar(CP_UTF8, 0, title, -1, buf, l);
SetWindowTextW(hWnd, buf);
_freea(buf);
Save your source files with UTF-8 encoding WITH BOM. This allows you to write your strings without having to escape them. You'll still need to convert the string to a wide char string using the method seen above.
Specify the /utf-8 flag when compiling; this has the same effect as the previous solution, but you don't need the BOM anymore.
The solutions stated above still require you convert your good and nice string to a big chunky wide string.
Another option would be to provide a manifest4 with the activeCodePage set to UTF-8. This way all Win32 functions with the A-suffix (e.g. SetWindowTextA) now accept and properly handle UTF-8 strings, if the running system is at least or newer than Windows Version 1903.
TL;DR
Compile your application with the /utf-8 flag active.
IMPORTANT: This works for the Win32 APIs. This doesn't let you magically write Unicode emojis to the console like a hipster JS developer.
1 I suppose it reads the LC_ALL setting on linux. In the last 6 years I have never seen a distribution, that does NOT specify UTF-8. However, take this information with a grain of salt; I might be entirely wrong on how gcc handles this now.
2 If no byte-order mark is found, it assumes that the source file is encoded in the current user code page [...].
3 GLFW performs the conversion as seen here.
4 More about Win32 ANSI-APIs, Manifest and UTF-8 can be found here.

Understanding Multibyte/Unicode

I'm just getting back into Programming C++, MFC, Unicode. Lots have changed over the past 20 years.
Code on another project compiled just fine, but had errors when I paste it into my code. It took me 1-1/2 days of wasted time to solve the function call below:
enter code here
CString CFileOperation::ChangeFileName(CString sFileName)
{
char drive[MAX_PATH], dir[MAX_PATH], name[MAX_PATH], ext[MAX_PATH];
_splitpath_s(sFileName, drive, dir, name, ext); //error
------- other code
}
After reading help, I changed the CString sFileName to use a cast:
enter code here
_splitpath_s((LPTCSTR)sFileName, drive, dir, name, ext); //error
This created an error too. So then I used GetBuffer() which is really the same as above.
enter code here
char* s = sFileName.GetBuffer(300);
_splitpath_s(s, drive, dir, name, ext); //same error for the 3rd time
sFileName.ReleaseBuffer();
At this point I was pretty upset, but finally realized that I needed to change the CString to Ascii (I think because I'm set up as Unicode).
hence;
enter code here
CT2A strAscii(sFileName); //convert CString to ascii, for splitpath()
then use strAscii.m_pz in the function _splitpath_s()
This finally worked. So after all this, to make a story short, I need help focusing on:
1. Unicode vs Mulit-Byte (library calls)
2. Variables to uses
I'm willing to purchase another book, please recommend.
Also, is there a way to filter my help on VS2015 so that when I'm on a variable and press F1, it only gives me help for Unicode and ways to convert old code to unicode or convert Mylti-Byte to Unicode.
Hope this is not to confusing, but I have some catching up to do. Be patient if my verbiage is not perfect.
Thanks in advance.
The documentation of _splitpath lists a Unicode (wchar_t based) version _wsplitpath. That's the one you should be using. Don't convert to ASCII or Windows ANSI, that will in general lose information and not produce a valid path when you recombine the pieces.
Modern Windows programming is Unicode based.
A Visual Studio C++ project is Unicode-based by default, in particular it defines the macro symbol UNICODE, which affects the declarations from <windows.h>.
All supported versions of Windows use Unicode internally throughout, and your application should, too. Windows uses UTF-16 encoding.
To make your application Unicode-enabled you need to perform the following steps:
Set up your project's Character Set to "Use Unicode Character Set" (if it's currently set to "Use Multi-Byte Character Set"). This is not strictly required, but it deals with those cases, where you aren't using the Unicode version explicitly.
Use wchar_t (in place of char or TCHAR) for your strings.
Use wide character string literals (L"..." in place of "...").
Use CStringW (in place of CStringA or CString) in an MFC project.
Explicitly call the Unicode version of the CRT (e.g. wcslen in place of strlen or _tcslen).
Explicitly call the Unicode version of any Windows API call where it exists (e.g. CreateWindowExW in place of CreateWindowExA or CreateWindowEx).
Try using _tsplitpath_s and TCHAR.
So the final code looks something like:
CString CFileOperation::ChangeFileName(CString sFileName)
{
TCHAR drive[MAX_PATH], dir[MAX_PATH], name[MAX_PATH], ext[MAX_PATH];
_tsplitpath_s(sFileName, drive, dir, name, ext); //error
------- other code
}
This will enable C++ compiler to use the correct character width during build time depending on the project settings

c++ Lithuanian language, how to get more than ascii

I am trying to use Lithuanian in my c++ application, but every try is unsuccesfull.
Multi-byte character set is used. I have tryed everything i have tought of, i am new in c++. Never ever tryed to do something in Lithuanian.
Tryed every setlocale(LC_ALL, "en_US.utf8"); setlocale(LC_ALL, "Lithuanian");...
Researched for 2 hours and didnt found proper examples, solution.
I do have a average sized project which needs Lithuanian translation from database and it cant understand most of "ĄČĘĖĮŠŲŪąčęėįšųū".
Compiler - "Visual studio 2013"
Database - sqlite3.
I cant get simple strings to work(defined myself), and output as Lithuanian to win32 application, even.
In Windows use wide character strings (1UTF-16 encoding, wchar_t type) for internal text handling, and preferably UTF-8 for external text files and networking.
Note that Visual C++ will translate narrow text literals from the source encoding to Windows ANSI, which is a platform-dependent usually single-byte encoding (you can check which one via the GetACP API function), i.e., Visual C++ has the platform-specific Windows ANSI as its narrow C++ execution character set.
But also do note that for an app restricted to non-Windows platforms, i.e. Unix-land, it makes practical sense to do everything in UTF-8, based on char type.
For the database communication you may need to translate to and from the program's internal text representation.
This depends on what the database interface requires, which is not stated.
Example for console output in Windows:
#include <iostream>
#include <fcntl.h>
#include <io.h>
auto main() -> int
{
_setmode( _fileno( stdout ), _O_WTEXT );
using namespace std;
wcout << L"ĄČĘĖĮŠŲŪąčęėįšųū" << endl;
}
To make this compile by default with g++, the source code encoding needs to be UTF-8. Then, to make it produce correct results with Visual C++ the source code encoding needs to be UTF-8 with BOM, which happily is also accepted by modern versions of g++. For otherwise the Visual C++ compiler will assume the Windows ANSI encoding and produce an incorrect UTF-16 string.
Not coincidentally this is the default meaning of UTF-8 in Windows, e.g. in the Notepad editor, namely UTF-8 with BOM.
But note that while in Windows the problem is that the main system compiler requires a BOM for UTF-8, in Unix-land the problem is the opposite, that many old tools can't handle the BOM (for example, even MinGW g++ 4.9.1 isn't yet entirely up to speed: it sometimes includes the BOM bytes, then incorrectly interpreted, in error messages).
1) On other platforms wide character text can be encoded in other ways, e.g. with UTF-32. In fact the Windows convention is in direct conflict with the C and C++ standards which require that a single wchar_t should be able to encode any character in the extended character set. However, this requirement was, AFAIK, imposed after Windows adopted UTF-16, so the fault probably lies with the politics of the C and C++ standardization process, not yet another Microsoft'ism.
Complexity of internationalisation
There are several related but distinct topics that can cause mismatches between them, making try and error approach very tedious:
type used for storing strings and chars: windows iuses wchar_t by default, but for most APIs you have also char equivalents functions
character set encoding this defines how the chars stored in the type are to be understood. For exemple unicode (UTF8, UTF16, UTF32), 7 bits ascii, 8 bit ansii. In windows, by default it is UTF16 for wchar_t and ansi/windows for char
locale defines, among other things, the character set asumptions, when processing strings. This permit to use language independent functions like isalpha(i, loc), islower(i, loc), ispunct(i, loc) to find out if a given character is alphanumeric, a lower case alphabetic, or a punctuation, for example to bereak down a user text into words. C++ offers here portable functions.
output codepage or font used to show a character to the user. This assumes that the font used shows the characters using the same character set used in the code internals.
source code encoding. For example your editor could assume an ansi encoding, with windows 1252 character set.
Most typical errors
The problem n°1 is Win32 console output, as unicode is not well supported by the console. But this is not your problem here.
Another cause of mismatch is the encoding of your text editor. It might not be unicode, but use a windows code page. In this case, you type "Č", the deditor displays it as such, but editor might use windows 1257 encoding for lithuanian and store 0xC8 in the file. If you then display this literal with a windows unicode function, it will interpret 0xC8 as "latin E grave accent" and print something else, as the right unicode encoding for "Č" is 0x010C !
I can be even worse: the compiler may have its own assumption about character set encoding used and convert your litterals into unicode using false assumptions (it happened to me when I used some exotic code generation switch).
How to do ?
To figure out what goes wront, proceed by elimination:
First, for plain windows, use the native unicode setting. Ok it's UTF16 and wchar_t instead of UTF8 and as thus comes with some drawbacks, but it's native and well supported.
Then use explict unicode coding in litterals, for example TEXT("\u010C") instead of TEXT("Č"). This avoids editor and compiler mismatch.
If it's still not the right character, make sure that your font FULLY supports unicode. The default system font for instance doesn't while most other do. You can easily check with the windows font pannel (WindowKey+R fonts then click on "search char") to display the character table of your font.
Set fonts explicitely in your code
For example, a very tiny experiment :
...
case WM_PAINT:
{
hdc = BeginPaint(hWnd, &ps);
auto hf = CreateFont(24, 0, 0, 0, 0, TRUE, 0, 0, 0, 0, 0, 0, 0, L"Times New Roman");
auto hfOld = SelectObject(hdc, hf); // if you comment this out, € and Č won't display
TextOut(hdc, 50, 50, L"Test with éç € \u010C special chars", 30);
SelectObject(hdc, hfOld);
DeleteObject(hf);
EndPaint(hWnd, &ps);
break;
}

Printing out Korean in console C++

I am having trouble with printing out korean.
I have tried various methods with no avail.
I have tried
1.
cout << "한글" << endl;
2.
wcout << "한글" << endl;
3.
wprintf(L"한글\n");
4.
setlocale(LC_ALL, "korean");
wprintf("한글");
and more. But all of those prints "한글".
I am using MinGW compiler, and my OS is windows 7.
P.S Strangely Java prints out Korean fine,
String kor = "한글";
System.out.println(kor);
works.
Set the console codepage to utf-8 before printing the text
::SetConsoleOutputCP(65001)
Since you are using Windows 7 you can use WriteConsoleW which is part of the windows API. #include <windows.h> and try the following code:
DWORD numCharsToWrite = str.length();
LPDWORD numCharsWritten = NULL;
WriteConsoleW(GetStdHandle(STD_OUTPUT_HANDLE), str.c_str(), numCharsToWrite, numCharsWritten, NULL);
where str is the is a std::wstring
More on WriteConsoleW: https://msdn.microsoft.com/en-us/library/windows/desktop/ms687401%28v=vs.85%29.aspx
After having tried other methods this worked for me.
Problem is that there are a lot of places where this could be broken.
Here is answer I've posted some time ago (covers Korean). Answear is for MSVC, but same applies to MinGW (compiler switches are different, locale name may be different).
Here are 5 traps which makes this hard:
Source code encoding. Source has to use encoding which supports all required characters. Nowadays UTF-8 is recommended. It is best to make sure your editor (IDE) is properly configure to enforce source encoding.
You have to inform compiler what is encoding of source file. For gcc it is: -finput-charset=utf-8 (it is default)
Encoding used by executable. You have to define what kind of encoding string literals should be encode in final executable. This encoding should also cover required characters. Here UTF-8 is also the best. Gcc option is -fexec-charset=utf-8
When you run application you have to inform standard library what kind of encoding your string literals are define in or what encoding in program logic is used. So somewhere in your code at beginning of execution you need something like this (here UTF-8 is enforced):
std::locale::global(std::locale{".utf-8"});
and finally you have to instruct stream what kind of encoding it should use. So for std::cout and std::cin you should set locale which is default for the system:
auto streamLocale = std::locale{""};
// this impacts date/time/floating point formats, so you may want tweak it just to use sepecyfic encoding and use C-loclae for formating
std::cout.imbue(streamLocale);
std::cin.imbue(streamLocale);
After this everything should work as desired without code which explicitly does conversions.
Since there are 5 places to make mistake, this is reason people have trouble with it and internet is full of "hack" solutions.
Note that if system is not configured for support all needed characters (for example wrong code page is set) then with thsi configuration characters which could not be converted will be replaced with question mark.