How Do You Write Code That Is Safe for UTF-8?

How Do You Write Code That Is Safe for UTF-8? - c++

We have a set of applications that were developed for the ASCII character set. Now, we're trying to install it in Iceland, and are running into problems where the Icelandic characters are getting screwed up.
We are working through our issues, but I was wondering: Is there a good "guide" out there for writing C++ code that is designed for 8-bit characters and which will work properly when UTF-8 data is given to it?
I can't expect everyone to read the whole Unicode standard, but if there is something more digestible available, I'd like to share it with the team so we don't run into these issues again.
Re-writing all the applications to use wchar_t or some other string representation is not feasible at this time. I'll also note that these applications communicate over networks with servers and devices that use 8-bit characters, so even if we did Unicode internally, we'd still have issues with translation at the boundaries. For the most part, these applications just pass data around; they don't "process" the text in any way other than copying it from place to place.
The operating systems used are Windows and Linux. We use std::string and plain-old C strings. (And don't ask me to defend any of the design decisions. I'm just trying to help fix the mess.)
Here is a list of what has been suggested:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
UTF-8 and Unicode FAQ for Unix/Linux
The Unicode HOWTO

Just be 8-bit clean, for the most part. However, you will have to be aware that any non-ASCII character splits across multiple bytes, so you must take account of this if line-breaking or truncating text for display.
UTF-8 has the advantage that you can always tell where you are in a multi-byte character: if bit 7 is set and bit 6 reset (byte is 0x80-0xBF) this is a trailing byte, while if bits 7 and 6 are set and 5 is reset (0xC0-0xDF) it is a lead byte with one trailing byte; if 7, 6 and 5 are set and 4 is reset (0xE0-0xEF) it is a lead byte with two trailing bytes, and so on. The number of consecutive bits set at the most-significant bit is the total number of bytes making up the character. That is:
110x xxxx = two-byte character
1110 xxxx = three-byte character
1111 0xxx = four-byte character
etc
The Icelandic alphabet is all contained in ISO 8859-1 and hence Windows-1252. If this is a console-mode application, be aware that the console uses IBM codepages, so (depending on the system locale) it might display in 437, 850, or 861. Windows has no native display support for UTF-8; you must transform to UTF-16 and use Unicode APIs.
Calling SetConsoleCP and SetConsoleOutputCP, specifying codepage 1252, will help with your problem, if it is a console-mode application. Unfortunately the console font selected has to be a font that supports the codepage, and I can't see a way to set the font. The standard bitmap fonts only support the system default OEM codepage.

This looks like a comprehensive quick guide:
http://www.cl.cam.ac.uk/~mgk25/unicode.html

Be aware that full unicode doesn't fit in 16bit characters; so either use 32-bit chars, or variable-width encoding (UTF-8 is the most popular).

UTF-8 was designed exactly with your problems in mind. One thing I would be careful about is that ASCII is realy a 7-bit encoding, so if any part of your infrastructure is using the 8th bit for other purposes, that may be tricky.

You might want to check out icu. They might have functions available that would make working with UTF-8 strings easier.

Icelandic uses ISO Latin 1, so eight bits should be enough. We need more details to figure out what's happening.

Icelandic, like French, German, and most other languages of Western Europe, can be supported using an 8-bit character set (CP1252 on Windows, ISO 8859-1 aka Latin1 on *x). This was the standard approach before Unicode was invented, and is still quite common. As you say you have a constraint that you can't rewrite your app to use wchar, and you don't need to.
You shouldn't be surprised that UTF-8 is causing problems; UTF-8 encodes the non-ASCII characters (e.g. the accented Latin characters, thorn, eth, etc) as TWO BYTES each.
The only general advice that can be given is quite simple (in theory):
(1) decide what character set you are going to support (Unicode, Latin1, CP1252, ...) in your system
(2) if you are being supplied data encoded in some other fashion (e.g. UTF-8) then transcode it to your standard (e.g. CP1252) at the system border
(3) if you need to supply data encoded in some other fashion, ...

You may want to use wide characters (wchar_t instead of char and std::wstring instead of std::string). This doesn't automatically solve 100% of your problems, but is good first step.
Also use string functions which are Unicode-aware (refer to documentation). If something manipulates wide chars or string it generally is aware that they are wide.

Related

C++ read and write UTF-32 files

I want to write a language learning app for myself using Visual Studio 2017, C++ and the WindowsAPI (formerly known as Win32). The Operation System is the latest Windows 10 insider build and backwards-compatibility is a non-issue. Since I assume English to be the mother tounge of the user and the language I am currently interested in is another European language, ASCII might suffice. But I want to future-proof it (more excotic languages) and I also want to try my hands on UTF-32. I have previously used both UTF-8 and UTF-16, though I have more experience with the later.
Thanks to std::basic_string, it was easy to figure out how to get an UTF-32 string:
typedef std::basic_string<char32_t> stringUTF32
Since I am using the WinAPI for all GUI staff, I need to do some conversion between UTF-32 and UTF-16.
Now to my problem: Since UTF-32 is not widely used because of its inefficiencies, there is hardly any material about it on the web. To avoid unnecessary conversions, I want to save my vocabulary lists and other data as UTF-32 (for all UTF-8 advocates/evangelists, the alternative would be UTF-16). The problem is, I cannot find how to write and open files in UTF-32.
So my question is: How to write/open files in UTF-32? I would prefer if no third-party libraries are needed unless they are a part of Windows or are usually shipped with that OS.

If you have a char32_t sequence, you can write it to a file using a std::basic_ofstream<char32_t> (which I will refer to as u32_ofstream, but this typedef does not exist). This works exactly like std::ofstream, except that it writes char32_ts instead of chars. But there are limitations.
Most standard library types that have an operator<< overload are templated on the character type. So they will work with u32_ofstream just fine. The problem you will encounter is for user types. These almost always assume that you're writing char, and thus are defined as ostream &operator<<(ostream &os, ...);. Such stream output can't work with u32_ofstream without a conversion layer.
But the big issue you're going to face is endian issues. u32_ofstream will write char32_t as your platform's native endian. If your application reads them back through a u32_ifstream, that's fine. But if other applications read them, or if your application needs to read something written in UTF-32 by someone else, that becomes a problem.
The typical solution is to use a "byte order mark" as the first character of the file. Unicode even has a specific codepoint set aside for this: \U0000FEFF.
The way a BOM works is like this. When writing a file, you write the BOM before any other codepoints.
When reading a file of an unknown encoding, you read the first codepoint as normal. If it comes out equal to the BOM in your native encoding, then you can read the rest of the file as normal. If it doesn't, then you need to read the file and endian-convert it before you can process it. That process would look at bit like this:
constexpr char32_t native_bom = U'\U0000FEFF';
u32_ifstream is(...);
char32_t bom;
is >> bom;
if(native_bom == bom)
{
process_stream(is);
}
else
{
basic_stringstream<char32_t> char_stream
//Load the rest of `is` and endian-convert it into `char_stream`.
process_stream(char_stream);
}

I am currently interested in is another European language, [so] ASCII might suffice
No. Even in plain English. You know how Microsoft Word creates “curly quotes”? Those are non-ASCII characters. All those letters with accents and umlauts in eg. French or English are non-ASCII characters.
I want to future-proof it
UTF-8, UTF-16 and UTF-32 all can encode every Unicode code point. They’re all future-proof. UTF-32 does not have an advantage over the other two.
Also for future proofing: I’m quite sure some scripts use characters (the technical term is ‘grapheme clusters’) consisting of more than one code point. A cursory search turns up Playing around with Devanagari characters.
A downside of UTF-32 is support in other tools. Notepad won’t open your files. Beyond Compare won’t. Visual Studio Code… nope. Visual Studio will, but it won’t let you create such files.
And the Win32 API: it has a function MultiByteToWideChar which can convert UTF-8 to UTF-16 (which you need to pass in to all Win32 calls) but it doesn’t accept UTF-32.
So my honest answer to this question is, don’t. Otherwise follow Nicol’s answer.

Why did C++11 introduce the char16_t and char32_t types

Why did the C++11 Standard introduce the types char16_t and char32_t? Isn't 1 Byte enough to store characters? Is there any purpose for extending the size for characters type?

So after you've read Joel's article about Unicode, you should know about Unicode in general, but not in C++.
The problem with C++98 was that it didn't know about Unicode, really. (Except for the universal character reference escape syntax.) C++ just required the implementation to define a "basic source character set" (which is essentially meaningless, because it's about the encoding of the source file, and thus comes down to telling the compiler "this is it"), a "basic execution character set" (some set of characters representable by narrow strings, and an 8-bit (possibly multi-byte) encoding used to represent it at runtime, which has to include the most important characters in C++), and a "wide execution character set" (a superset of the basic set, and an encoding that uses wchar_t as its code unit to go along with it, with the requirement that a single wchar_t can represent any character from the set).
Nothing about actual values in these character sets.
So what happened?
Well, Microsoft switched to Unicode very early, back when it still had less than 2^16 characters. They implemented their entire NT operating system using UCS-2, which is the fixed-width 16-bit encoding of old Unicode versions. It made perfect sense for them to define their wide execution character set to be Unicode, make wchar_t 16 bits and use UCS-2 encoding. For the basic set, they chose "whatever the current ANSI codepage is", which made zero sense, but they pretty much inherited that. And because narrow string support was considered legacy, the Windows API is full of weird restrictions on that. We'll get to that.
Unix switched a little later, when it was already clear that 16 bits weren't enough. Faced with the choice of using 16-bit variable width encoding (UTF-16), a 32-bit fixed width encoding (UTF-32/UCS-4), or an 8-bit variable width encoding (UTF-8), they went with UTF-8, which also had the nice property that a lot of code written to handle ASCII and ISO-8859-* text didn't even need to be updated. For wchar_t, they chose 32 bits and UCS-4, so that they could represent every Unicode code point in a single unit.
Microsoft then upgraded everything they had to UTF-16 to handle the new Unicode characters (with some long-lingering bugs), and wchar_t remained 16 bits, because of backwards compatibility. Of course that meant that wchar_t could no longer represent every character from the wide set in a single unit, making the Microsoft compiler non-conformant, but nobody thought that was a big deal. It wasn't like some C++ standard APIs are totally reliant on that property. (Well, yes, codecvt is. Tough luck.)
But still, they thought UTF-16 was the way to go, and the narrow APIs remained the unloved stepchildren. UTF-8 didn't get supported. You cannot use UTF-8 with the narrow Windows API. You cannot make the Microsoft compiler use UTF-8 as the encoding for narrow string literals. They just didn't feel it was worth it.
The result: extreme pain when trying to write internationalized applications for both Unix and Windows. Unix plays well with UTF-8, Windows with UTF-16. It's ugly. And wchar_t has different meanings on different platforms.
char16_t and char32_t, as well as the new string literal prefixes u, U and u8, are an attempt to give the programmer reliable tools to work with encodings. Sure, you still have to either do weird compile-time switching for multi-platform code, or else decide on one encoding and do conversions in some wrapper layer, but at least you now have the right tools for the latter choice. Want to go the UTF-16 route? Use u and char16_t everywhere, converting to UTF-8 near system APIs as needed. Previously you couldn't do that at all in Unix environments. Want UTF-8? Use char and u8, converting near UTF-16 system APIs (and avoid the standard library I/O and string manipulation stuff, because Microsoft's version still doesn't support UTF-8). Previously you couldn't do that at all in Windows. And now you can even use UTF-32, converting everywhere, if you really want to. That, too, wasn't possible before in Windows.
So that's why these things are in C++11: to give you some tools to work around the horrible SNAFU around character encodings in cross-platform code in an at least somewhat predictable and reliable fashion.

1 byte has never been enough. There are hundreds of Ansi 8bit encodings in existence because people kept trying to stuff different languages into the confines of 8bit limitations, thus the same byte values have different meanings in different languages. Then Unicode came along to solve that problem, but it needed 16 bits to do it (UCS-2). Eventually, the needs of the world's languages exceeded 16bit, so UTF-8/16/32 encodings were created to extend the available values.
char16_t and char32_t (and their respective text prefixes), were created to handle UTF-16/32 in a uniform manner on all platforms. Originally, there was wchar_t, but it was created when Unicode was new, and its byte size was never standardized, even to this day. On some platforms, wchar_t is 16bit (UTF-16), whereas on other platforms it is 32bit (UTF-32) instead. This has caused plenty of interoperability issues over the years when exchanging Unicode data across platforms. char16_t and char32_t were finally introduced to have standardized sizes - 16bit and 32bit, respectively - and semantics on all platforms.

There are around 100000 characters (they call them code points) defined in Unicode. So in order to specify any one of them, 1 Byte is not enough. 1 Byte is just enough to enumerate the first 256 of them, which happen to be identical to ISO-8859-1. Two bytes are enough for the most important subset of Unicode, the so-called Basic Multilingual Plane, and many applications, like e.g. Java, settle on 16 bit characters for Unicode. If you want truly every single Unicode character, you have to go beyond that and allow 4 Bytes / 32 bit. And because different people have different needs, C++ allows different sizes here. And UTF-8 is a variable-size encoding rarely used within programs, because different characters have different length. To some extend this also applies to UTF-16, but in most cases you can safely ignore this issue with char16_t.

C++ and UTF8 - Why not just replace ASCII?

In my application I have to constantly convert string between std::string and std::wstring due different APIs (boost, win32, ffmpeg etc..). Especially with ffmpeg the strings end up utf8->utf16->utf8->utf16, just to open a file.
Since UTF8 is backwards compatible with ASCII I thought that I consistently store all my strings UTF-8 std::string and only convert to std::wstring when I have to call certain unusual functions.
This worked kind of well, I implemented to_lower, to_upper, iequals for utf8. However then I met several dead-ends std::regex, and regular string comparisons. To make this usable I would need to implement a custom ustring class based on std::string with re-implementation of all corresponding algorithms (including regex).
Basically my conclusion is that utf8 is not very good for general usage. And the current std::string/std::wstring is mess.
However, my question is why the default std::string and "" are not simply changed to use UTF8? Especially as UTF8 is backward compatible? Is there possibly some compiler flag which can do this? Of course the stl implemention would need to be automatically adapted.
I've looked at ICU, but it is not very compatible with apis assuming basic_string, e.g. no begin/end/c_str etc...

The main issue is the conflation of in-memory representation and encoding.
None of the Unicode encoding is really amenable to text processing. Users will in general care about graphemes (what's on the screen) while the encoding is defined in terms of code points... and some graphemes are composed of several code points.
As such, when one asks: what is the 5th character of "Hélène" (French first name) the question is quite confusing:
In terms of graphemes, the answer is n.
In terms of code points... it depends on the representation of é and è (they can be represented either as a single code point or as a pair using diacritics...)
Depending on the source of the question (a end-user in front of her screen or an encoding routine) the response is completely different.
Therefore, I think that the real question is Why are we speaking about encodings here?
Today it does not make sense, and we would need two "views": Graphemes and Code Points.
Unfortunately the std::string and std::wstring interfaces were inherited from a time where people thought that ASCII was sufficient, and the progress made didn't really solve the issue.
I don't even understand why the in-memory representation should be specified, it is an implementation detail. All a user should want is:
to be able to read/write in UTF-* and ASCII
to be able to work on graphemes
to be able to edit a grapheme (to manage the diacritics)
... who cares how it is represented? I thought that good software was built on encapsulation?
Well, C cares, and we want interoperability... so I guess it will be fixed when C is.

You cannot, the primary reason for this is named Microsoft. They decided not to support Unicode as UTF-8 so the support for UTF-8 under Windows is minimal.
Under windows you cannot use UTF-8 as a codepage, but you can convert from or to UTF-8.

There are two snags to using UTF8 on windows.
You cannot tell how many bytes a string will occupy - it depends on which characters are present, since some characters take 1 byte, some take 2, some take 3, and some take 4.
The windows API uses UTF16. Since most windows programs make numerous calls to the windows API, there is quite an overhead converting back and forward. ( Note that you can do a "non-unicode' build, which looks like it uses a utf8 windows api, but all that is happening is that the conversion back and forward on each call is hidden )
The big snag with UTF16 is that the binary representation of a string depends on the byte order in a word on the particular hardware the program is running on. This does not matter in most cases, except when strings are transmitted between computers where you cannot be sure that the other computer uses the same byte order.
So what to do? I uses UTF16 everywhere 'inside' all my programs. When string data has to be stored in a file, or transmitted from a socket, I first convert it to UTF8.
This means that 95% of my code runs simply and most efficiently, and all the messy conversions between UTF8 and UTF16 can be isolated to routines responsible for I/O.

C++: Making my project support unicode

My C++ project currently is about 16K lines of code big, and I admit having completely not thought about unicode support in the first place.
All I have done was a custom typedef for std::string as String and jump into coding.
I have never really worked with unicode myself in programs I wrote.
How hard is it to switch my project to unicode now? Is it even a good idea?
Can I just switch to std::wchar without any major problems?

Probably the most important part of making an application unicode aware is to track the encoding of your strings and to make sure that your public interfaces are well specified and easy to use with the encodings that you wish to use.
Switching to a wider character (in c++ wchar_t) is not necessarily the correct solution. In fact, I would say it is usually not the simplest solution. Some applications can get away with specifying that all strings and interfaces use UTF-8 and not need to change at all. std::string can perfectly well be used for UTF-8 encoded strings.
However, if you need to interpret the characters in a string or interface with non-UTF-8 interfaces then you will have to put more work in but without knowing more about your application it is impossible to recommend a single best approach.

There are some issues with using std::wstring. If your application will be storing text in Unicode, and it will be running on different platforms, you may run into trouble. std::wstring relies on wchar_t, which is compiler dependent. In Microsoft Visual C++, this type is 16 bits wide, and will thus only support UTF-16 encodings. The GNU C++ compiler specifes this type to be 32 bits wide, and will thus only support UTF-32 encodings. If you then store the text in a file from one system (say Windows/VC++), and then read the file from another system (Linux/GCC), you will have to prepare for this (in this case convert from UTF-16 to UTF-32).

Can I just switch to [std::wchar_t] without any major problems?
No, it's not that simple.
The encoding of a wchar_t string is platform-dependent. Windows uses UTF-16. Linux usually uses UTF-32. (C++0x will mitigate this difference by introducing separate char16_t and char32_t types.)
If you need to support Unix-like systems, you don't have all the UTF-16 functions that Windows has, so you'd need to write your own _wfopen, etc.
Do you use any third-party libraries? Do they support wchar_t?
Although wide characters are commonly-used for an in-memory representation, on-disk and on-the-Web formats are much more likely to be UTF-8 (or other char-based encoding) than UTF-16/32. You'd have to convert these.
You can't just search-and-replace char with wchar_t because C++ confounds "character" and "byte", and you have to determine which chars are characters and which chars are bytes.

Which character set to choose when compiling a c++ dll

Could someone give some info regarding the different character sets within visual studio's project properties sheets.
The options are:
None
Unicode
Multi byte
I would like to make an informed decision as to which to choose.
Thanks.

All new software should be Unicode enabled. For Windows apps that means the UTF-16 character set, and for pretty much everyone else UTF-8 is often the best choice. The other character set choices in Windows programming should only be used for compatibility with older apps. They do not support the same range of characters as Unicode.

Multibyte takes exactly 2 bytes per character, none exactly 1, unicode varies.
None is not good as it doesn't support non-latin symbols. It's very boring if some non-English user tries to input their name into edit box. Do not use none.
If you do not use custom computation of string lengths, from programmer's point of view multibyte and unicode do not differ as long as use use TEXT macro to wrap your string constants.
Some libraries explicitly require certain encoding (DirectShow etc.), just use what they want.

As Mr. Shiny recommended, Unicode is the right thing.
If you want to understand a bit more on what are the implications of that decision, take a look here: http://www.mihai-nita.net/article.php?artID=20050306b

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js