Windows Codepage Interactions with Standard C/C++ filenames? - c++

A customer is complaining that our code used to write files with Japanese characters in the filename but no longer works in all cases. We have always just used good old char * strings to represent filenames, so it came as a bit of a shock to me that it ever worked, and we haven't done anything I am aware of that should have made it stop working. I had them send me a file with an embedded filename in it exported from our software, and it looks like the strings use hex characters 82 and 83 as the first character of a double-byte sequence to represent the Japanese characters. Poking around online leads me to believe this is probably SHIFT_JIS and/or Windows codepage 932.
It looks to me like what is happening is previously both fopen and ofstream::open accepted filenames using this codepage; now only fopen does. I've checked the Visual Studio fopen docs, and I see no hint of what makes an acceptable string to pass to fopen.
In the short run, I'm hoping someone can shed some light on the specific Windows fopen versus ofstream::open issue for me. In the long run, I'd really like to know the accepted way of opening Unicode (and other?) filenames in C++, on Windows, Linux, and OS X.
Edited to add: I believe that the opens that work are done in the "C" locale, whereas the ones that do not work are done in whatever the customer's default locale is. However, that has been the case for years now, and the old version of the program still works today on their system, so this seems a longshot for explaining the issue we are seeing.
Update: I sent off a small test program to the customer. It has verified that fopen works fine with the SHIFT_JIS filename, and std::ofstream does not. This is in Visual Studio 2005, and happened regardless of whether I used the default locale or the "C" locale.
I'm still interested if anyone has an explanation for this behavior (and why it mysteriously changed -- perhaps a service pack of VS2005?) and hoping to put together a comprehensive "best practices" for handling Unicode filenames in portable C++ code.

Functions like fopen or ofstream::open take the file name as char *, but that is interpreted as being in the system code page.
It means that it can be a Japanese character represented as Shift-JIS (cp932), or Chinese Simplified (Big 5/cp936), Korean, Arabic, Russian, you name it (as long as it matches the OS system code page).
It also means that it can use Japanese file names on a Japanese system only.
Change the system code page and the application "stops working"
I suspect this is what happens here (no big changes in Windows since Win 2000, in this area).
This is how you change the system code page: http://www.mihai-nita.net/article.php?artID=20050611a
In the long run you might consider moving to Unicode (and using _wfopen, wofstream).

I'm not aware of any portable way of using unicode files using default system libraries. But there are some frameworks that provide portable functions, for example:
for C: glib uses filenames in UTF-8;
for C++: glibmm also uses filenames in UTF-8, requires glib;
for C++: boost can use wstring for filenames.
I'm pretty sure .NET/mono frameworks also do contain portable filesystem functions, but I don't know them.

Is somebody still watching this? I've just researched this question and found no answers anywhere, so I can try to explain my findings here.
In VS2005 the fstream filename handling is the odd man out: it doesn't use the system default encoding, the one you get with GetACP and set in Control Panel/Region and Language/Administrative. But always CP 1252 -- I believe.
This can cause big confusion, and Microsoft has removed this quirk in later VS versions.
All workarounds for VS2005 have their drawbacks:
Convert your code to use Unicode everywhere
Never open fstreams using narrow character filenames, always convert to them to Unicode using the system default encoding yourself, the use wide character filename open/ctor
Retrieve the codepage using GetACP(), then do a
matching setlocale:
setlocale (LC_ALL, ("." + lexical_cast<string> (GetACP())).c_str())

I'm nearly certain that on Linux, the filename string is a UTF-8 string (on the EXT3 filesystem, for example, the only disallowed chars are slash and NULL), stored in a normal char *. The man page doesn't seem to mention character encoding, which is what leads me to believe it is the system standard of UTF-8. OS X likely uses the same, since it comes from similar roots, but I am less sure about this.

You may have to set the thread locale to the system default locale.
See here for a possible reason for your problems:
http://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?FeedbackID=100887

Mac OS X uses Unicode as its native character encoding. The basic string objects are CFString and NSString. They store array of characters as Unicode.

Related

"C.UTF-8" C++ locale on Windows?

I'm in the process of fixing a large open source cross-platform application such that it can handle file paths containing non-ANSI characters on Windows.
Update:
Based on answers and comments I got so far (thanks!) I feel like I should clarify some points:
I cannot modify the code of dozens of third party libraries to use std::wchar_t. This is just not an option. The solution has to work with plain ol' std::fopen(), std::ifstream, etc.
The solution I outline below works at 99%, at least on the system I'm developing on (Windows 10 version 1909, build 18363.535). I haven't tested on any other system yet.
The only remaining issue, at least on my system, is basically number formatting and I'm hopeful that replacing the std::numpunct facet does the trick (but I haven't succeeded yet).
My current solution involves:
Setting the C locale to .UTF-8 for the LC_CTYPE category on Windows (all other categories are set to the C locale as required by the application):
// Required by the application.
std::setlocale(LC_ALL, "C");
// On Windows, we want std::fopen() and other functions dealing with strings
// and file paths to accept narrow-character strings encoded in UTF-8.
#ifdef _WIN32
{
#ifndef NDEBUG
char* new_ctype_locale =
#endif
std::setlocale(LC_CTYPE, ".UTF-8");
assert(new_ctype_locale != nullptr);
}
#endif
Configuring boost::filesystem::path to use the en_US.UTF-8 locale so that it too can deal with paths containing non-ANSI characters:
boost::filesystem::path::imbue(std::locale("en_US.UTF-8"));
The last missing bit is to fix file I/O using C++ streams such as
std::ifstream istream(filename);
The simplest solution is probably to set the global C++ locale at the beginning of the application:
std::locale::global(std::locale("en_US.UTF-8"));
However that messes up formatting of numbers, e.g. 1234.56 gets formatted as 1,234.56.
Is there a locale that just specifies the encoding to be UTF-8 without messing with number formatting (or other things)?
Basically I'm looking for the C.UTF-8 locale, but that doesn't seem to exist on Windows.
Update: I suppose one solution would be to reset some (most? all?) of the facets of the locale, but I'm having a hard time finding information on how to do that.
Windows API does not respect the CRT locales, and the CRT implementation of fopen etc. directly call the narrow-char API, therefore changing the locale will not affect the encoding.
However, Windows 10 May 2019 Update (version 1903) introduced a support for UTF-8 in its narrow-char APIs. It can be enabled by embedding an appropriate manifest into your executable. Unfortunately it's a very recent addition, and so might not be an option if you need to target older systems.
Your other options include converting manually to wchar_t or using a layer that does that for you (like Boost.Filesystem, or even better, Boost.Nowide).
Never mind locales.
On Windows you should use Microsoft's extension that adds a constructor taking const std::wchar_t* (expected to point to UTF-16) to std::ifstream.
Hopefully all your strings are UTF-8, or otherwise some consistent and sane encoding.
So just grab a UTF-8 → UTF-16 converter (they're lightweight) and pass filenames to std::ifstream as UTF-16 (in a std::wchar_t*).
(Be sure to #ifdef it out so it doesn't get attempted on any other platform.)
You should also use _wfopen instead of std::fopen, in the same way, for the same reason.
That's it.

C++ read and write UTF-32 files

I want to write a language learning app for myself using Visual Studio 2017, C++ and the WindowsAPI (formerly known as Win32). The Operation System is the latest Windows 10 insider build and backwards-compatibility is a non-issue. Since I assume English to be the mother tounge of the user and the language I am currently interested in is another European language, ASCII might suffice. But I want to future-proof it (more excotic languages) and I also want to try my hands on UTF-32. I have previously used both UTF-8 and UTF-16, though I have more experience with the later.
Thanks to std::basic_string, it was easy to figure out how to get an UTF-32 string:
typedef std::basic_string<char32_t> stringUTF32
Since I am using the WinAPI for all GUI staff, I need to do some conversion between UTF-32 and UTF-16.
Now to my problem: Since UTF-32 is not widely used because of its inefficiencies, there is hardly any material about it on the web. To avoid unnecessary conversions, I want to save my vocabulary lists and other data as UTF-32 (for all UTF-8 advocates/evangelists, the alternative would be UTF-16). The problem is, I cannot find how to write and open files in UTF-32.
So my question is: How to write/open files in UTF-32? I would prefer if no third-party libraries are needed unless they are a part of Windows or are usually shipped with that OS.
If you have a char32_t sequence, you can write it to a file using a std::basic_ofstream<char32_t> (which I will refer to as u32_ofstream, but this typedef does not exist). This works exactly like std::ofstream, except that it writes char32_ts instead of chars. But there are limitations.
Most standard library types that have an operator<< overload are templated on the character type. So they will work with u32_ofstream just fine. The problem you will encounter is for user types. These almost always assume that you're writing char, and thus are defined as ostream &operator<<(ostream &os, ...);. Such stream output can't work with u32_ofstream without a conversion layer.
But the big issue you're going to face is endian issues. u32_ofstream will write char32_t as your platform's native endian. If your application reads them back through a u32_ifstream, that's fine. But if other applications read them, or if your application needs to read something written in UTF-32 by someone else, that becomes a problem.
The typical solution is to use a "byte order mark" as the first character of the file. Unicode even has a specific codepoint set aside for this: \U0000FEFF.
The way a BOM works is like this. When writing a file, you write the BOM before any other codepoints.
When reading a file of an unknown encoding, you read the first codepoint as normal. If it comes out equal to the BOM in your native encoding, then you can read the rest of the file as normal. If it doesn't, then you need to read the file and endian-convert it before you can process it. That process would look at bit like this:
constexpr char32_t native_bom = U'\U0000FEFF';
u32_ifstream is(...);
char32_t bom;
is >> bom;
if(native_bom == bom)
{
process_stream(is);
}
else
{
basic_stringstream<char32_t> char_stream
//Load the rest of `is` and endian-convert it into `char_stream`.
process_stream(char_stream);
}
I am currently interested in is another European language, [so] ASCII might suffice
No. Even in plain English. You know how Microsoft Word creates “curly quotes”? Those are non-ASCII characters. All those letters with accents and umlauts in eg. French or English are non-ASCII characters.
I want to future-proof it
UTF-8, UTF-16 and UTF-32 all can encode every Unicode code point. They’re all future-proof. UTF-32 does not have an advantage over the other two.
Also for future proofing: I’m quite sure some scripts use characters (the technical term is ‘grapheme clusters’) consisting of more than one code point. A cursory search turns up Playing around with Devanagari characters.
A downside of UTF-32 is support in other tools. Notepad won’t open your files. Beyond Compare won’t. Visual Studio Code… nope. Visual Studio will, but it won’t let you create such files.
And the Win32 API: it has a function MultiByteToWideChar which can convert UTF-8 to UTF-16 (which you need to pass in to all Win32 calls) but it doesn’t accept UTF-32.
So my honest answer to this question is, don’t. Otherwise follow Nicol’s answer.

Spanish characters in C++ Windows/Mac/iOS

I'm having some issues with spanish characters displaying in an iOS app. The code in question is all C++, and shared between both a Windows app and an iOS app. Compiled in Windows using Visual Studio 2010 (character set is Multi-byte). And Compiled using Xcode 4.2 on the Mac.
Currently, the code is using char pointers, and my first thought was that I need to switch over to wchar_t pointers instead. However, I noticed that the Spanish characters I want to output display just fine in Windows using just char pointers. This made me think those characters are a part of the multi-byte character set and I don't need to go to all the trouble of updating everything to wchar_t until I'm ready to do some Japanese, Russian, Arabic, etc. translations.
Unfortunatly, while the Spanish characters do display property in the Windows app, they do not display right once they hit the Mac/iOS. Experimenting with wchar_t there, I see that they will display properly if I convert everything over. But what I don't understand, and hoping someone can enlighten me as to the reason... is why are the characters perfectly valid on the Windows machine, same code, and dislaying as gibberish (requiring wchar_t instead) in the Mac environment?
Is visual studio doing something to my char pointers behind the scenes that the Mac is not doing? In other words, is the Microsoft environment simply being more forgiving to my architectural oversight when I used char pointers instead of wchar_t?
Seeing as how I already know my answer is to convert from char pointers to wchar_t pointers, my real question then is "Why does the Mac require wchar_t but in Windows I can use char for the same characters?"
Thanks.
Mac and Windows use different codepages--they both have Spanish characters available, but they show up as different character values, so the same bytes will appear differently on each platform.
The best way to deal with localization in a cross-platform codebase is UTF8. UTF8 is supported natively in NSString -stringWithUTF8String: and in Windows Unicode applications by calling MultiByteToWideChar with CP_UTF8. In fact, since it's Unicode, you can even use the same technique to handle more complicated languages like Chinese.
Don't use wide characters in cross-platform code if you can help it. This gets complicated because wchar_t is actually 32 bits wide on OS X. In fact, it's wasteful of memory for that reason as well.
http://en.wikipedia.org/wiki/UTF-8
None of char, wchar_t, string or wstring have any encoding attached to them. They just contain whatever binary soup your compiler decides to interpret the source files as. You have three variables that could be off:
What your code contains (in the actual file, between the '"' characters, on a binary level).
What your compiler thinks this is. For example, you may have a UTF-8 source file, but the compiler could turn wchar_t[] literals into proper UCS-4. (I wish MSVC 2010 could do this, but as far as I know, it does not support UTF-8 at all.)
What your rendering API expects. On Windows, this is usually Little-Endian UTF-16 (as a LPWCHAR pointer). For the old LPCHAR APIs, it is usually the "current codepage", which could be anything as far as I know. iOS and Mac OS use UTF-16 internally I think, but they are very explicit about what they accept and return.
No class or encoding can help you if there is a mismatch between any of these.
In an IDE like Xcode or Eclipse, you can see the encoding of a file in its property sheet. In Xcode 4, this is the right-most pane, bring it up with cmd+alt+0 if it's hidden. If the characters look right in the code editor, the encoding is correct. A first step is to make sure that both Xcode and MSVC are interpreting the same source files the same way. Then you need to figure what they are turned in into memory right before rendering. And then you need to ensure that both rendering APIs expect the same character set at all.
Or, just move your strings into text files separate from your source code, and in a well-defined encoding. UTF-8 is great for this, but everything will work that can encode all necessary characters. Then only translate your strings for rendering (if necessary).
I just saw this answer which gives even more reasons for the latter option: https://stackoverflow.com/a/1866668/401925

Is it possible to use UTF-8 encoding by default in Visual Studio 2008? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
How to create a UTF-8 string literal in Visual C++ 2008
Is it possible to force Visual Studio to use UTF-8 encoding for all strings by default?
For example have
wchar_t *txt="hello";
encoded in utf8
This blog article looks promising: UTF-8 strings and Visual C++
Most of the important content is still there, even though some of the pictures are broken. In short:
First step, you have to make sure the source file is UTF-8 encoded with the byte order mark (BOM). The BOM is an extremely important thing, without it the C++ compiler will not behave correctly.
In Visual Studio 2008, this can be done directly from the IDE with the Advanced save command located in the File menu. A dialog box will pop up. Select UTF-8 with signature.
If you compile and run a test program, [you are not going to get the expected result.] What happens is that, although your text is properly encoded in UTF-8, for compatibility reasons the C/C++ runtime is by default set to the “C” locale. This locale assumes that all char are 1 byte. Erm. Not quite the case with UTF-8 my dear!
You need to change the locale with the setlocale function to have the string properly interpreted by the input output stream processors.
In our case, the locale of whatever the system is using is fine, this is done in passing “” as the second parameter.
To be rigorous, you must check the return value of setlocale, if it returns 0, an error occurred. In multi-language applications, you will need to use setlocale with more precision, explicitly supplying the locale you want to use (for example you may want to have your application display Russian text on a Japanese computer).
I don't know of any good way to make this the default. I'm pretty sure it's not possible. Windows applications strongly prefer UTF-16, if you're compiling for Unicode. If at all possible, you should convert to that format.
Otherwise, the best possible option I can come up with is to define a simple macro (something akin to _T("string") defined in the Windows headers) that converts to UTF-8 using the above logic.

Nasty unicode and C++: Easy way to read ASCII/UTF-8/UTF-16 BE/LE text file

sorry if the question is stupid and has been asked thousands of times but I spent a few hours googling it and could not find an answer.
I want to read in text file which can be any of these: ASCII/UTF-8/UTF-16 BE/LE
I assume that if file is unicode then BOM is always present.
Is there any automatic way (STL,Boost or something else) to use file stream or anything to read in file line by line without checking BOMs and always getting UTF8 to put into std::string?
In this project I am using Windows only. It would also be good to know how to solve it for other platforms.
Thanks in advance!
libiconv
BOMs are often not present in UTF-8 files. As a consequence, you can't know if a file is ASCII or UTF-8 until after you have read the data and found a byte which isn't ASCII.
Furthermore, as you are on Windows, do you intend to handle ISO-8859-1 and Windows-1252 as well? The later is often the default for files from things like Notepad and Wordpad. In these cases, things are even worse: One can only distinguish heuristically between such encodings, other encodings and UTF-8.
The ICU library has a character set detection system that you can use to guess the likely character encoding of a file. I do not believe that iconv has such a function.
ICU is generally available, already installed on Mac and Linux, but, alas, not Windows. Such a routine might be available in Win32 API as well.