C++ read and write UTF-32 files - c++

I want to write a language learning app for myself using Visual Studio 2017, C++ and the WindowsAPI (formerly known as Win32). The Operation System is the latest Windows 10 insider build and backwards-compatibility is a non-issue. Since I assume English to be the mother tounge of the user and the language I am currently interested in is another European language, ASCII might suffice. But I want to future-proof it (more excotic languages) and I also want to try my hands on UTF-32. I have previously used both UTF-8 and UTF-16, though I have more experience with the later.
Thanks to std::basic_string, it was easy to figure out how to get an UTF-32 string:
typedef std::basic_string<char32_t> stringUTF32
Since I am using the WinAPI for all GUI staff, I need to do some conversion between UTF-32 and UTF-16.
Now to my problem: Since UTF-32 is not widely used because of its inefficiencies, there is hardly any material about it on the web. To avoid unnecessary conversions, I want to save my vocabulary lists and other data as UTF-32 (for all UTF-8 advocates/evangelists, the alternative would be UTF-16). The problem is, I cannot find how to write and open files in UTF-32.
So my question is: How to write/open files in UTF-32? I would prefer if no third-party libraries are needed unless they are a part of Windows or are usually shipped with that OS.

If you have a char32_t sequence, you can write it to a file using a std::basic_ofstream<char32_t> (which I will refer to as u32_ofstream, but this typedef does not exist). This works exactly like std::ofstream, except that it writes char32_ts instead of chars. But there are limitations.
Most standard library types that have an operator<< overload are templated on the character type. So they will work with u32_ofstream just fine. The problem you will encounter is for user types. These almost always assume that you're writing char, and thus are defined as ostream &operator<<(ostream &os, ...);. Such stream output can't work with u32_ofstream without a conversion layer.
But the big issue you're going to face is endian issues. u32_ofstream will write char32_t as your platform's native endian. If your application reads them back through a u32_ifstream, that's fine. But if other applications read them, or if your application needs to read something written in UTF-32 by someone else, that becomes a problem.
The typical solution is to use a "byte order mark" as the first character of the file. Unicode even has a specific codepoint set aside for this: \U0000FEFF.
The way a BOM works is like this. When writing a file, you write the BOM before any other codepoints.
When reading a file of an unknown encoding, you read the first codepoint as normal. If it comes out equal to the BOM in your native encoding, then you can read the rest of the file as normal. If it doesn't, then you need to read the file and endian-convert it before you can process it. That process would look at bit like this:
constexpr char32_t native_bom = U'\U0000FEFF';
u32_ifstream is(...);
char32_t bom;
is >> bom;
if(native_bom == bom)
{
process_stream(is);
}
else
{
basic_stringstream<char32_t> char_stream
//Load the rest of `is` and endian-convert it into `char_stream`.
process_stream(char_stream);
}

I am currently interested in is another European language, [so] ASCII might suffice
No. Even in plain English. You know how Microsoft Word creates “curly quotes”? Those are non-ASCII characters. All those letters with accents and umlauts in eg. French or English are non-ASCII characters.
I want to future-proof it
UTF-8, UTF-16 and UTF-32 all can encode every Unicode code point. They’re all future-proof. UTF-32 does not have an advantage over the other two.
Also for future proofing: I’m quite sure some scripts use characters (the technical term is ‘grapheme clusters’) consisting of more than one code point. A cursory search turns up Playing around with Devanagari characters.
A downside of UTF-32 is support in other tools. Notepad won’t open your files. Beyond Compare won’t. Visual Studio Code… nope. Visual Studio will, but it won’t let you create such files.
And the Win32 API: it has a function MultiByteToWideChar which can convert UTF-8 to UTF-16 (which you need to pass in to all Win32 calls) but it doesn’t accept UTF-32.
So my honest answer to this question is, don’t. Otherwise follow Nicol’s answer.

Related

Converting ASCII strings to UTF-16 before passing them to Windows API functions

In my current project I've been using wide chars (utf16). But since my only input from the user is going to be a url, which has to end up ascii anyways, and one other string, I'm thinking about just switching the whole program to ascii.
My question is, is there any benefit to converting the strings to utf16 before I pass them to a Windows API function?
After doing some research online, it seems like a lot of people recommend this if your not working with UTF-16 on windows.
In the Windows API, if you call a function like
int SomeFunctionA(const char*);
then it will automatically convert the string to UTF-16 and call the real, Unicode version of the function:
int SomeFunctionW(const wchar_t*);
The catch is, it converts the string to UTF-16 from the ANSI code page. That works OK if you actually have strings encoded in the ANSI code page. It doesn't work if you have strings encoded in UTF-8, which is increasingly common these days (e.g., nearly 70% of Web pages), and isn't supported as an ANSI code page.
Also, if you use the A API, you'll run into limitations like not (easily) being able to open files that have non-ANSI characters in their names (which can be arbitrary UTF-16 strings). And won't have access to some of Windows' newer features.
Which is why I always call the W functions. Even though this means annoying explicit conversions (from the UTF-8 strings used in the non-Windows-specific parts of our software).
The main point is that on Windows UTF-16 is the native encoding and all API functions that end in A are just wrappers around the W ones. The A functions are just carried around as compatibility to programs that were written for Windows 9x/ME and indeed, no new program should ever use them (in my opinion).
Unless you're doing heavy processing of billions of large strings I doubt there is any benefit to thinking about storing them in another (possibly more space-saving) encoding at all. Besides, even an URI can contain Unicode, if you think about IDN. So don't be too sure upfront about what data your users will pass to the program.

Designing an application for UTF-8 or UTF-16 usage

I am developing an application that will be primarily used by English and Spanish readers. However, in the future I would like to be able to support more extended languages, such as Japanese. While thinking of the design of the program I have hit a wall in the UTF-8 vs. UTF-16 vs. multibyte. I would like to compile my program to support either UTF-8 or UTF-16 (for when languages such as Chinese are used). For this to happen, I was thinking that I should have something such as
#if _UTF8
typedef char char_type;
#elif _UTF16
typedef unsigned short char_type;
#else
#error
#endif
That way, in the future when I use UTF-16, I can switch the #define (and of course, have the same type of #if/#endif for things such as sprintf, etc.). I have my own custom string type, so that would also make use of this case also.
Would replacing every use of just the single use of "char" with my "char_type" using the scenario mentioned above, be considered a "bad idea"? If so, why is it considered a bad idea and how could I achieve what I mentioned above?
The reason I would like to use one or the other is due to memory efficiency. I would rather not use UTF-16 all the time if I am not using it.
UTF-8 can represent every Unicode character. If your application properly supports UTF-8, you are golden for any language.
Note that Windows' native controls do not have APIs to set UTF-8 text in them, if you are writing a Windows application. However, it's easy to make an application which uses UTF-8 internally for everything, and converts UTF-8 -> UTF-16 when setting text in Windows, and converts UTF-16 -> UTF-8 when getting text from Windows. I've done it, and it worked awesome and was MUCH nicer than writing a WCHAR application. It's trivial to convert UTF-8 <-> 16; Windows has APIs for it, or you can find a simple (one page) function to do it in your own code.
I believe choosing UTF-8 is just enough for your needs. Keep in mind, that char_type as above is less than a character in both encodings.
You may wish to have a look at this discussion: https://softwareengineering.stackexchange.com/questions/102205/should-utf-16-be-considered-harmful for the benefits of different types of popular encodings.
This is essentially what Windows does with TCHAR (except that the Windows API interprets char as the "ANSI" code page instead of UTF-8).
I think it's a bad idea.

Spanish characters in C++ Windows/Mac/iOS

I'm having some issues with spanish characters displaying in an iOS app. The code in question is all C++, and shared between both a Windows app and an iOS app. Compiled in Windows using Visual Studio 2010 (character set is Multi-byte). And Compiled using Xcode 4.2 on the Mac.
Currently, the code is using char pointers, and my first thought was that I need to switch over to wchar_t pointers instead. However, I noticed that the Spanish characters I want to output display just fine in Windows using just char pointers. This made me think those characters are a part of the multi-byte character set and I don't need to go to all the trouble of updating everything to wchar_t until I'm ready to do some Japanese, Russian, Arabic, etc. translations.
Unfortunatly, while the Spanish characters do display property in the Windows app, they do not display right once they hit the Mac/iOS. Experimenting with wchar_t there, I see that they will display properly if I convert everything over. But what I don't understand, and hoping someone can enlighten me as to the reason... is why are the characters perfectly valid on the Windows machine, same code, and dislaying as gibberish (requiring wchar_t instead) in the Mac environment?
Is visual studio doing something to my char pointers behind the scenes that the Mac is not doing? In other words, is the Microsoft environment simply being more forgiving to my architectural oversight when I used char pointers instead of wchar_t?
Seeing as how I already know my answer is to convert from char pointers to wchar_t pointers, my real question then is "Why does the Mac require wchar_t but in Windows I can use char for the same characters?"
Thanks.
Mac and Windows use different codepages--they both have Spanish characters available, but they show up as different character values, so the same bytes will appear differently on each platform.
The best way to deal with localization in a cross-platform codebase is UTF8. UTF8 is supported natively in NSString -stringWithUTF8String: and in Windows Unicode applications by calling MultiByteToWideChar with CP_UTF8. In fact, since it's Unicode, you can even use the same technique to handle more complicated languages like Chinese.
Don't use wide characters in cross-platform code if you can help it. This gets complicated because wchar_t is actually 32 bits wide on OS X. In fact, it's wasteful of memory for that reason as well.
http://en.wikipedia.org/wiki/UTF-8
None of char, wchar_t, string or wstring have any encoding attached to them. They just contain whatever binary soup your compiler decides to interpret the source files as. You have three variables that could be off:
What your code contains (in the actual file, between the '"' characters, on a binary level).
What your compiler thinks this is. For example, you may have a UTF-8 source file, but the compiler could turn wchar_t[] literals into proper UCS-4. (I wish MSVC 2010 could do this, but as far as I know, it does not support UTF-8 at all.)
What your rendering API expects. On Windows, this is usually Little-Endian UTF-16 (as a LPWCHAR pointer). For the old LPCHAR APIs, it is usually the "current codepage", which could be anything as far as I know. iOS and Mac OS use UTF-16 internally I think, but they are very explicit about what they accept and return.
No class or encoding can help you if there is a mismatch between any of these.
In an IDE like Xcode or Eclipse, you can see the encoding of a file in its property sheet. In Xcode 4, this is the right-most pane, bring it up with cmd+alt+0 if it's hidden. If the characters look right in the code editor, the encoding is correct. A first step is to make sure that both Xcode and MSVC are interpreting the same source files the same way. Then you need to figure what they are turned in into memory right before rendering. And then you need to ensure that both rendering APIs expect the same character set at all.
Or, just move your strings into text files separate from your source code, and in a well-defined encoding. UTF-8 is great for this, but everything will work that can encode all necessary characters. Then only translate your strings for rendering (if necessary).
I just saw this answer which gives even more reasons for the latter option: https://stackoverflow.com/a/1866668/401925

C++ and UTF8 - Why not just replace ASCII?

In my application I have to constantly convert string between std::string and std::wstring due different APIs (boost, win32, ffmpeg etc..). Especially with ffmpeg the strings end up utf8->utf16->utf8->utf16, just to open a file.
Since UTF8 is backwards compatible with ASCII I thought that I consistently store all my strings UTF-8 std::string and only convert to std::wstring when I have to call certain unusual functions.
This worked kind of well, I implemented to_lower, to_upper, iequals for utf8. However then I met several dead-ends std::regex, and regular string comparisons. To make this usable I would need to implement a custom ustring class based on std::string with re-implementation of all corresponding algorithms (including regex).
Basically my conclusion is that utf8 is not very good for general usage. And the current std::string/std::wstring is mess.
However, my question is why the default std::string and "" are not simply changed to use UTF8? Especially as UTF8 is backward compatible? Is there possibly some compiler flag which can do this? Of course the stl implemention would need to be automatically adapted.
I've looked at ICU, but it is not very compatible with apis assuming basic_string, e.g. no begin/end/c_str etc...
The main issue is the conflation of in-memory representation and encoding.
None of the Unicode encoding is really amenable to text processing. Users will in general care about graphemes (what's on the screen) while the encoding is defined in terms of code points... and some graphemes are composed of several code points.
As such, when one asks: what is the 5th character of "Hélène" (French first name) the question is quite confusing:
In terms of graphemes, the answer is n.
In terms of code points... it depends on the representation of é and è (they can be represented either as a single code point or as a pair using diacritics...)
Depending on the source of the question (a end-user in front of her screen or an encoding routine) the response is completely different.
Therefore, I think that the real question is Why are we speaking about encodings here?
Today it does not make sense, and we would need two "views": Graphemes and Code Points.
Unfortunately the std::string and std::wstring interfaces were inherited from a time where people thought that ASCII was sufficient, and the progress made didn't really solve the issue.
I don't even understand why the in-memory representation should be specified, it is an implementation detail. All a user should want is:
to be able to read/write in UTF-* and ASCII
to be able to work on graphemes
to be able to edit a grapheme (to manage the diacritics)
... who cares how it is represented? I thought that good software was built on encapsulation?
Well, C cares, and we want interoperability... so I guess it will be fixed when C is.
You cannot, the primary reason for this is named Microsoft. They decided not to support Unicode as UTF-8 so the support for UTF-8 under Windows is minimal.
Under windows you cannot use UTF-8 as a codepage, but you can convert from or to UTF-8.
There are two snags to using UTF8 on windows.
You cannot tell how many bytes a string will occupy - it depends on which characters are present, since some characters take 1 byte, some take 2, some take 3, and some take 4.
The windows API uses UTF16. Since most windows programs make numerous calls to the windows API, there is quite an overhead converting back and forward. ( Note that you can do a "non-unicode' build, which looks like it uses a utf8 windows api, but all that is happening is that the conversion back and forward on each call is hidden )
The big snag with UTF16 is that the binary representation of a string depends on the byte order in a word on the particular hardware the program is running on. This does not matter in most cases, except when strings are transmitted between computers where you cannot be sure that the other computer uses the same byte order.
So what to do? I uses UTF16 everywhere 'inside' all my programs. When string data has to be stored in a file, or transmitted from a socket, I first convert it to UTF8.
This means that 95% of my code runs simply and most efficiently, and all the messy conversions between UTF8 and UTF16 can be isolated to routines responsible for I/O.

C++: Making my project support unicode

My C++ project currently is about 16K lines of code big, and I admit having completely not thought about unicode support in the first place.
All I have done was a custom typedef for std::string as String and jump into coding.
I have never really worked with unicode myself in programs I wrote.
How hard is it to switch my project to unicode now? Is it even a good idea?
Can I just switch to std::wchar without any major problems?
Probably the most important part of making an application unicode aware is to track the encoding of your strings and to make sure that your public interfaces are well specified and easy to use with the encodings that you wish to use.
Switching to a wider character (in c++ wchar_t) is not necessarily the correct solution. In fact, I would say it is usually not the simplest solution. Some applications can get away with specifying that all strings and interfaces use UTF-8 and not need to change at all. std::string can perfectly well be used for UTF-8 encoded strings.
However, if you need to interpret the characters in a string or interface with non-UTF-8 interfaces then you will have to put more work in but without knowing more about your application it is impossible to recommend a single best approach.
There are some issues with using std::wstring. If your application will be storing text in Unicode, and it will be running on different platforms, you may run into trouble. std::wstring relies on wchar_t, which is compiler dependent. In Microsoft Visual C++, this type is 16 bits wide, and will thus only support UTF-16 encodings. The GNU C++ compiler specifes this type to be 32 bits wide, and will thus only support UTF-32 encodings. If you then store the text in a file from one system (say Windows/VC++), and then read the file from another system (Linux/GCC), you will have to prepare for this (in this case convert from UTF-16 to UTF-32).
Can I just switch to [std::wchar_t] without any major problems?
No, it's not that simple.
The encoding of a wchar_t string is platform-dependent. Windows uses UTF-16. Linux usually uses UTF-32. (C++0x will mitigate this difference by introducing separate char16_t and char32_t types.)
If you need to support Unix-like systems, you don't have all the UTF-16 functions that Windows has, so you'd need to write your own _wfopen, etc.
Do you use any third-party libraries? Do they support wchar_t?
Although wide characters are commonly-used for an in-memory representation, on-disk and on-the-Web formats are much more likely to be UTF-8 (or other char-based encoding) than UTF-16/32. You'd have to convert these.
You can't just search-and-replace char with wchar_t because C++ confounds "character" and "byte", and you have to determine which chars are characters and which chars are bytes.