"C.UTF-8" C++ locale on Windows? - c++

I'm in the process of fixing a large open source cross-platform application such that it can handle file paths containing non-ANSI characters on Windows.
Update:
Based on answers and comments I got so far (thanks!) I feel like I should clarify some points:
I cannot modify the code of dozens of third party libraries to use std::wchar_t. This is just not an option. The solution has to work with plain ol' std::fopen(), std::ifstream, etc.
The solution I outline below works at 99%, at least on the system I'm developing on (Windows 10 version 1909, build 18363.535). I haven't tested on any other system yet.
The only remaining issue, at least on my system, is basically number formatting and I'm hopeful that replacing the std::numpunct facet does the trick (but I haven't succeeded yet).
My current solution involves:
Setting the C locale to .UTF-8 for the LC_CTYPE category on Windows (all other categories are set to the C locale as required by the application):
// Required by the application.
std::setlocale(LC_ALL, "C");
// On Windows, we want std::fopen() and other functions dealing with strings
// and file paths to accept narrow-character strings encoded in UTF-8.
#ifdef _WIN32
{
#ifndef NDEBUG
char* new_ctype_locale =
#endif
std::setlocale(LC_CTYPE, ".UTF-8");
assert(new_ctype_locale != nullptr);
}
#endif
Configuring boost::filesystem::path to use the en_US.UTF-8 locale so that it too can deal with paths containing non-ANSI characters:
boost::filesystem::path::imbue(std::locale("en_US.UTF-8"));
The last missing bit is to fix file I/O using C++ streams such as
std::ifstream istream(filename);
The simplest solution is probably to set the global C++ locale at the beginning of the application:
std::locale::global(std::locale("en_US.UTF-8"));
However that messes up formatting of numbers, e.g. 1234.56 gets formatted as 1,234.56.
Is there a locale that just specifies the encoding to be UTF-8 without messing with number formatting (or other things)?
Basically I'm looking for the C.UTF-8 locale, but that doesn't seem to exist on Windows.
Update: I suppose one solution would be to reset some (most? all?) of the facets of the locale, but I'm having a hard time finding information on how to do that.

Windows API does not respect the CRT locales, and the CRT implementation of fopen etc. directly call the narrow-char API, therefore changing the locale will not affect the encoding.
However, Windows 10 May 2019 Update (version 1903) introduced a support for UTF-8 in its narrow-char APIs. It can be enabled by embedding an appropriate manifest into your executable. Unfortunately it's a very recent addition, and so might not be an option if you need to target older systems.
Your other options include converting manually to wchar_t or using a layer that does that for you (like Boost.Filesystem, or even better, Boost.Nowide).

Never mind locales.
On Windows you should use Microsoft's extension that adds a constructor taking const std::wchar_t* (expected to point to UTF-16) to std::ifstream.
Hopefully all your strings are UTF-8, or otherwise some consistent and sane encoding.
So just grab a UTF-8 → UTF-16 converter (they're lightweight) and pass filenames to std::ifstream as UTF-16 (in a std::wchar_t*).
(Be sure to #ifdef it out so it doesn't get attempted on any other platform.)
You should also use _wfopen instead of std::fopen, in the same way, for the same reason.
That's it.

Related

How to properly localize a cross platform program?

I am currently making a game engine that is eventually going to support all platforms. Currently I am working on the Windows support with the Win32 API. Reading the documentation, it suggests that I use wide strings/chars and the Unicode version of API functions so that my application can be localized. But if I use wide versions of everything (wcout wstring wchar_t etc.), I will have to make my entire game engine use those wide types. That also means that when working with other platforms, I will also have to use wide types, or I will have to convert between them.
My idea is that maybe my code will only be compiled with wide string types on Windows and be compiled with normal string types on other platforms perhaps with macro definitions. Is that the best option to do this? And how may I go about doing this?
I also don't really understand how unicode works in c++. If I set my system locale to English, then I will get a compiler warning from MSVC if I have any Chinese characters stored in a normal string type. However now I set my system locale to Chinese and enabled UTF-8 encoding, I get no compiler warnings if I store Unicode characters in normal strings. I also have no idea how unicode works on other platforms. Can somebody explain this for me?
Thanks in advance.

How does the OS understand C strings if its encoding is implementation-defined?

I read that C strings are just byte strings and the encoding used to store text is implementation-defined. If this is so, then how does the operating system know how to interpret a string when we make system call to open a file or execute a program? I noticed that the fopen function in the stdio.h takes a C string.
The other part of my question, who decides the encoding that C strings should use? Is this something decided arbitrarily by the compiler, or is it decided by the operating system?
Your OS and/or C/C++ standard library expect a certain encoding in the APIs they provide. They can also provide ways to change the encoding (certain std::locales on Windows enabling UTF-8 in file paths, etc).
The compiler may either trust you on the encoding (preserve the source file encoding in the strings), or may change the encoding to a one appropriate for the target OS.
All of this is mostly theory. ASCII is ubiquitous, so both the sources and the APIs almost always use some superset of ASCII. Most of the time this superset is UTF-8, though Windows has some quirks (the terminal expecting some weird encoding instead of UTF-8 by default (last checked on MinGW on Windows 7, not sure if still the case), the standard library functions not understanding UTF-8 paths by default (at least in the old C standard library some MinGW versions use), etc).
Some compilers let you customize the encoding, e.g. GCC has -finput-charset=A -fexec-charset=B which converts from A to B. This option doesn't work in Clang though (as of Clang 14), so you shouldn't rely on it.

What are the disadvantages to not using Unicode in Windows?

What are the disadvantages to not using Unicode on Windows?
By Unicode, I mean WCHAR and the wide API functions. (CreateWindowW, MessageBoxW, and so on)
What problems could I run into by not using this?
Your code won't be able to deal correctly with characters outside the currently selected codepage when dealing with system APIs1.
Typical problems include unsupported characters being translated to question marks, inability to process text with special characters, in particular files with "strange characters" in their names/paths.
Also, several newer APIs are present only in the "wide" version.
Finally, each API call involving text will be marginally slower, since the "A" versions of APIs are normally just thin wrappers around the "W" APIs, that convert the parameters to UTF-16 on the fly - so, you have some overhead in respect to a "plain" W call.
Nothing stops you to work in a narrow-characters Unicode encoding (=>UTF-8) inside your application, but Windows "A" APIs don't speak UTF-8, so you'd have to convert to UTF-16 and call the W versions anyway.
I believe the gist of the original question was "should I compile all my Windows apps with "#define _UNICODE", and what's the down side if I don't?
My original reply was "Yeah, you should. We've moved 8-bit ASCII, and '_UNICODE' is a reasonable default for any modern Windows code."
For Windows, I still believe that's reasonably good advice. But I've deleted my original reply. Because I didn't realize until I re-read my own links how much "UTF-16 is quite a sad state of affairs" (as Matteo Italia eloquently put it).
For example:
http://utf8everywhere.org/
Microsoft has ... mistakenly used ‘Unicode’ and ‘widechar’ as
synonyms for ‘UCS-2’ and ‘UTF-16’. Furthermore, since UTF-8 cannot be
set as the encoding for narrow string WinAPI, one must compile her
code with _UNICODE rather than _MBCS. Windows C++ programmers are
educated that Unicode must be done with ‘widechars’. As a result of
this mess, they are now among the most confused ones about what is the
right thing to do about text.
I heartily recommend these three links:
The Absolute Minimum Every Software Developer Should Know about Unicode
Should UTF-16 Be Considered Harmful?
UTF-8 Everywhere
IMHO...

wopen calls when porting to Linux

I have an application which was developed under Windows, but for gcc. The code is mostly OS-independent, with very few classes which are Windows specific because a Linux port was always regarded as necessary.
The API, especially that which gets called as a direct result of user interaction, is using wide char arrays instead of char arrays (as a side note, I cannot change the API itself - at this point, std::wstring cannot be used). These are considered as encoded in UTF-16.
In some places, the code opens files, mostly using the windows-specific _wopen function call. The problem with this is there is no wopen-like substitute for Linux because Linux "only deals with bytes".
The question is: how do I port this code ? What if I wanted to open a file with the name "something™.log", how would I go about doing so in Linux ? Is a cast to char* sufficient, would the wide chars be picked up automatically based on the locale (probably not) ? Do I need to convert manually ? I'm a bit confused regarding this, perhaps someone could point me to some documentation regarding the matter.
The strategy I took on Mac hinges on the fact that Mac OS X uses utf-8 in all its file io POSIX api's.
I thus created a type "fschar" thats a char in windows non unicode builds, wchar_t in windows UNICODE builds and char (again) when building for Mac OS.
I pass around all file system strings using this type. String literals are encoded with wrappers (TEXT("literal")) to get the correct encoding - all my data files store utf-8 characters on disk that, on windows UNICODE builds, I MultiByteToWideChar to convert to utf16.
Linux does not support UTF16 filenames. It does however support UTF8 files and those can be opened using plain old fopen().
What you should do is convert you wide strings to UTF8.

Windows Codepage Interactions with Standard C/C++ filenames?

A customer is complaining that our code used to write files with Japanese characters in the filename but no longer works in all cases. We have always just used good old char * strings to represent filenames, so it came as a bit of a shock to me that it ever worked, and we haven't done anything I am aware of that should have made it stop working. I had them send me a file with an embedded filename in it exported from our software, and it looks like the strings use hex characters 82 and 83 as the first character of a double-byte sequence to represent the Japanese characters. Poking around online leads me to believe this is probably SHIFT_JIS and/or Windows codepage 932.
It looks to me like what is happening is previously both fopen and ofstream::open accepted filenames using this codepage; now only fopen does. I've checked the Visual Studio fopen docs, and I see no hint of what makes an acceptable string to pass to fopen.
In the short run, I'm hoping someone can shed some light on the specific Windows fopen versus ofstream::open issue for me. In the long run, I'd really like to know the accepted way of opening Unicode (and other?) filenames in C++, on Windows, Linux, and OS X.
Edited to add: I believe that the opens that work are done in the "C" locale, whereas the ones that do not work are done in whatever the customer's default locale is. However, that has been the case for years now, and the old version of the program still works today on their system, so this seems a longshot for explaining the issue we are seeing.
Update: I sent off a small test program to the customer. It has verified that fopen works fine with the SHIFT_JIS filename, and std::ofstream does not. This is in Visual Studio 2005, and happened regardless of whether I used the default locale or the "C" locale.
I'm still interested if anyone has an explanation for this behavior (and why it mysteriously changed -- perhaps a service pack of VS2005?) and hoping to put together a comprehensive "best practices" for handling Unicode filenames in portable C++ code.
Functions like fopen or ofstream::open take the file name as char *, but that is interpreted as being in the system code page.
It means that it can be a Japanese character represented as Shift-JIS (cp932), or Chinese Simplified (Big 5/cp936), Korean, Arabic, Russian, you name it (as long as it matches the OS system code page).
It also means that it can use Japanese file names on a Japanese system only.
Change the system code page and the application "stops working"
I suspect this is what happens here (no big changes in Windows since Win 2000, in this area).
This is how you change the system code page: http://www.mihai-nita.net/article.php?artID=20050611a
In the long run you might consider moving to Unicode (and using _wfopen, wofstream).
I'm not aware of any portable way of using unicode files using default system libraries. But there are some frameworks that provide portable functions, for example:
for C: glib uses filenames in UTF-8;
for C++: glibmm also uses filenames in UTF-8, requires glib;
for C++: boost can use wstring for filenames.
I'm pretty sure .NET/mono frameworks also do contain portable filesystem functions, but I don't know them.
Is somebody still watching this? I've just researched this question and found no answers anywhere, so I can try to explain my findings here.
In VS2005 the fstream filename handling is the odd man out: it doesn't use the system default encoding, the one you get with GetACP and set in Control Panel/Region and Language/Administrative. But always CP 1252 -- I believe.
This can cause big confusion, and Microsoft has removed this quirk in later VS versions.
All workarounds for VS2005 have their drawbacks:
Convert your code to use Unicode everywhere
Never open fstreams using narrow character filenames, always convert to them to Unicode using the system default encoding yourself, the use wide character filename open/ctor
Retrieve the codepage using GetACP(), then do a
matching setlocale:
setlocale (LC_ALL, ("." + lexical_cast<string> (GetACP())).c_str())
I'm nearly certain that on Linux, the filename string is a UTF-8 string (on the EXT3 filesystem, for example, the only disallowed chars are slash and NULL), stored in a normal char *. The man page doesn't seem to mention character encoding, which is what leads me to believe it is the system standard of UTF-8. OS X likely uses the same, since it comes from similar roots, but I am less sure about this.
You may have to set the thread locale to the system default locale.
See here for a possible reason for your problems:
http://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?FeedbackID=100887
Mac OS X uses Unicode as its native character encoding. The basic string objects are CFString and NSString. They store array of characters as Unicode.