Unicode - generally working with it in C++ - c++

Suppose we have an arbitrary string, s.
s has the property of being from just about anywhere in the world. People from USA, Japan, Korea, Russia, China and Greece all write into s from time to time. Fortunately we don't have time travellers using Linear A, however.
For the sake of discussion, let's presume we want to do string operations such as:
reverse
length
capitalize
lowercase
index into
and, just because this is for the sake of discussion, let's presume we want to write these routines ourselves (instead of grabbing a library), and we have no legacy software to maintain.
There are 3 standards for Unicode: utf-8, utf-16, and utf-32, each with pros and cons. But let's say I'm sorta dumb, and I want one Unicode to rule them all (because rolling a dynamically adapting library for 3 different kinds of string encodings that hides the difference from the API user sounds hard).
Which encoding is most general?
Which encoding is supported by wchar_t?
Which encoding is supported by the STL?
Are these encodings all(or not at all) null-terminated?
--
The point of this question is to educate myself and others in useful and usable information for Unicode: reading the RFCs is fine, but there's a 'stack' of information related to compilers, languages, and operating systems that the RFCs do not cover, but is vital to know to actually use Unicode in a real app.

Which encoding is most general
Probably UTF-32, though all three formats can store any character. UTF-32 has the property that every character can be encoded in a single codepoint.
Which encoding is supported by wchar_t
None. That's implementation defined. On most Windows platforms it's UTF-16, on most Unix platforms its UTF-32.
Which encoding is supported by the STL
None really. The STL can store any type of character you want. Just use the std::basic_string<t> template with a type large enough to hold your code point. Most operations (e.g. std::reverse) do not know about any sort of unicode encoding though.
Are these encodings all(or not at all) null-terminated?
No. Null is a legal value in any of those encodings. Technically, NULL is a legal character in plain ASCII too. NULL termination is a C thing -- not an encoding thing.
Choosing how to do this has a lot to do with your platform. If you're on Windows, use UTF-16 and wchar_t strings, because that's what the Windows API uses to support unicode. I'm not entirely sure what the best choice is for UNIX platforms but I do know that most of them use UTF-8.

Have a look at the open source library ICU, especially at the Docs & Papers section. It's an extensive library dealing with all sorts of unicode oddities.

In response to your final bullet, UTF-8 is guaranteed not to have NULL bytes in its encoding of any character (except NULL itself, of course). As a result, many functions that work with NULL-terminated strings also work with UTF-8 encoded strings.

Define "real app" :)
Seriously, the decision really depends a lot on the kind of software you are developing. If your target platform is Win32 API (with or without wrappers such as MFC, WTL, etc) you would probably want to use wstring types with the text encoded as UTF-16. That's simply because all Win32 API internally uses that encoding anyway.
On another hand, if your output is something like XML/HTML and/or needs to be delivered over the internet, UTF-8 is pretty much the standard - it is usually transmitted well via protocols that make assumptions about characters having 8 bits.
As for UTF-32, I can't think of a single reason to use it, unless you need 1:1 mapping between code units and code points (that still does not mean 1:1 mapping between code units and characters!).
For more information, be sure to look at Unicode.org. This FAQ may be a good starting point.

Related

What are the essential resources for writing internationalized and localized applications in C++?

Essential concepts.
What string and character data-types to use?
What libraries / routines to use for input and output?
What translation mechanisms (I guess this would be gettext, libintl)?
Porting guidelines?
How far can the standard C/C++ library address the above concerns? How portable can I make my software across platforms? What are the standards / best-practices in this area?
I would avoid using wchar_t or std::wstring because this data type is not the same size in different environments. For example on Windows it is 16 bit, while on Unix systems it is 32 bit. Which asks for trouble.
If you don't have time/resources to implement Unicode standard (the bare minimum) yourself, you better off using std::string as a container for UTF-8 characters. Though you must be aware that in the case of UTF-8 you will have to deal with multibyte encoding (1 character may correspond to 1 or more bytes).
As for libraries ICU is something to consider, it will allow you to convert between encodings, transform to upper/lower/title case and etc. It can also help with locales.
Translations as Marius noted are generally done through a function that looks up a table by the key you provided (be it a string, id or anything else) and in the end returns you the translation string.
Porting will go smooth if you stick to using data types that are the same on every platform, not to mention ICU is a cross-platform library, so it must work out.
wchar_t or std::wstring are your friends. Use them with the approriate wide-character functions and objects like wcscpy() or std::wcout.
You also would use a string-table for each locale, and a function like std::wstring getLocalizedString(MY_MESSAGE) that looks up the constant MY_MESSAGE in a string-table for the current locale. How you implement the string table is up to you, storing such things in external files is always a good idea.

Cross-platform C++: Use the native string encoding or standardise across platforms?

We are specifically eyeing Windows and Linux development, and have come up with two differing approaches that both seem to have their merits. The natural unicode string type in Windows is UTF-16, and UTF-8 in linux.
We can't decide whether the best approach:
Standardise on one of the two in all our application logic (and persistent data), and make the other platforms do the appropriate conversions
Use the natural format for the OS for application logic (and thus making calls into the OS), and convert only at the point of IPC and persistence.
To me they seem like they are both about as good as each other.
and UTF-8 in linux.
It's mostly true for modern Linux. Actually encoding depends on what API or library is used. Some hardcoded to use UTF-8. But some read LC_ALL, LC_CTYPE or LANG environment variables to detect encoding to use (like Qt library). So be careful.
We can't decide whether the best approach
As usual it depends.
If 90% of code is to deal with platform specific API in platform specific way, obviously it is better to use platform specific strings. As an example - a device driver or native iOS application.
If 90% of code is complex business logic that is shared across platforms, obviously it is better to use same encoding on all platforms. As an example - chat client or browser.
In second case you have a choice:
Use cross platform library that provides strings support (Qt, ICU, for example)
Use bare pointers (I consider std::string a "bare pointer" too)
If working with strings is a significant part of your application, choosing a nice library for strings is a good move. For example Qt has a very solid set of classes that covers 99% of common tasks. Unfortunately, I has no ICU experience, but it also looks very nice.
When using some library for strings you need to care about encoding only when working with external libraries, platform API or sending strings over the net (or disk). For example, a lot of Cocoa, C# or Qt (all has solid strings support) programmers know very little about encoding details (and it is good, since they can focus on their main task).
My experience in working with strings is a little specific, so I personally prefer bare pointers. Code that use them is very portable (in sense it can be easily reused in other projects and platforms) because has less external dependencies. It is extremely simple and fast also (but one probably need some experience and Unicode background to feel that).
I agree that bare pointers approach is not for everyone. It is good when:
You work with entire strings and splitting, searching, comparing is a rare task
You can use same encoding in all components and need a conversion only when using platform API
All your supported platforms has API to:
Convert from your encoding to that is used in API
Convert from API encoding to that is used in your code
Pointers is not a problem in your team
From my a little specific experience it is actually a very common case.
When working with bare pointers it is good to choose encoding that will be used in entire project (or in all projects).
From my point of view, UTF-8 is an ultimate winner. If you can't use UTF-8 - use strings library or platform API for strings - it will save you a lot of time.
Advantages of UTF-8:
Fully ASCII compatible. Any ASCII string is a valid UTF-8 string.
C std library works great with UTF-8 strings. (*)
C++ std library works great with UTF-8 (std::string and friends). (*)
Legacy code works great with UTF-8.
Quite any platform supports UTF-8.
Debugging is MUCH easier with UTF-8 (since it is ASCII compatible).
No Little-Endian/Big-Endian mess.
You will not catch a classical bug "Oh, UTF-16 is not always 2 bytes?".
(*) Until you need to lexical compare them, transform case (toUpper/toLower), change normalization form or something like this - if you do - use strings library or platform API.
Disadvantage is questionable:
Less compact for Chinese (and other symbols with large code point numbers) than UTF-16.
Harder (a little actually) to iterate over symbols.
So, I recommend to use UTF-8 as common encoding for project(s) that doesn't use any strings library.
But encoding is not the only question you need to answer.
There is such thing as normalization. To put it simple, some letters can be represented in several ways - like one glyph or like a combination of different glyphs. The common problem with this is that most of string compare functions treat them as different symbols. If you working on cross-platform project, choosing one of normalization forms as standard is a right move. This will save your time.
For example if user password contains "йёжиг" it will be differently represented (in both UTF-8 and UTF-16) when entered on Mac (that mostly use Normalization Form D) and on Windows (that mostly likes Normalization Form C). So if user registered under Windows with such password it will a problem for him to login under Mac.
In addition I would not recommend to use wchar_t (or use it only in windows code as a UCS-2/UTF-16 char type). The problem with wchar_t is that there is no encoding associated with it. It's just an abstract wide char that is larger than normal char (16 bits on Windows, 32 bits on most *nix).
I'd use the same encoding internally, and normalize the data at entry point. This will involve less code, less gotchas, and will allow you to use the same cross platform library for string processing.
I'd use unicode (utf-16) because it's simpler to handle internally and should perform better because of the constant length for each character. UTF-8 is ideal for output and storage because it's backwards compliant with latin ascii, and unly uses 8 bits for English characters. But inside the program 16-bit is simpler to handle.
C++11 provides the new string types u16string and u32string. Depending on the support your compiler versions deliver, and the expected life expectancy, it might be an idea to stay forward-compatible to those.
Other than that, using the ICU library is probably your best shot at cross-platform compatibility.
This seems to be quite enlightening on the topic. http://www.utf8everywhere.org/
Programming with UTF-8 is difficult as lengths and offsets are mixed up. e.g.
std::string s = Something();
std::cout << s.substr(0, 4);
does not necessarily find the first 4 chars.
I would use whatever a wchar_t is. On Windows that will be UTF-16. On some *nix platforms it might be UTF-32.
When saving to a file, I would recommend converting to UTF-8. That often makes the file smaller, and removes any platform dependencies due to differences in sizeof(wchar_t) or to byte order.

Arguments for and against supporting std::wstring exclusively in cross-platform library

I'm currently developing a cross-platform C++ library which I intend to be Unicode aware. I currently have compile-time support for either std::string or std::wstring via typedefs and macros. The disadvantage with this approach is that it forces you to use macros like L("string") and to make heavy use of templates based on character type.
What are the arguments for and against to support std::wstring only?
Would using std::wstring exclusively hinder the GNU/Linux user base, where UTF-8 encoding is preferred?
A lot of people would want to use unicode with UTF-8 (std::string) and not UCS-2 (std::wstring). UTF-8 is the standard encoding on a lot of linux distributions and databases - so not supporting it would be a huge disadvantage. On Linux every call to a function in your library with a string as argument would require the user to convert a (native) UTF-8 string to std::wstring.
On gcc/linux each character of a std::wstring will have 4 bytes while it will have 2 bytes on Windows. This can lead to strange effects when reading or writing files (and copying them from/to different platforms). I would rather recomend UTF-8/std::string for a cross platform project.
What are the arguments for and against to support std::wstring only?
The argument in favor of using wide characters is that it can do everything narrow characters can and more.
The argument against it that I know are:
wide characters need more space (which is hardly relevant, the Chinese do not, in principle, have more headaches over memory than Americans have)
using wide characters gives headaches to some westerners who are used for all their characters to fit into 7bit (and are unwilling to learn to pay a bit of attention to not to intermingle uses of the character type for actual characters vs. other uses)
As for being flexible: I have maintained a library (several kLoC) that could deal with both narrow and wide characters. Most of it was through the character type being a template parameter, I don't remember any macros (other than UNICODE, that is). Not all of it was flexible, though, there was some code in there which ultimately required either char or wchar_t string. (No point in making internal key strings wide using wide characters.)
Users could decide whether they wanted only narrow character support (in which case "string" was fine) or only wide character support (which required them to use L"string") or whether they wanted to support both, too (which required something like T("string")).
For:
Joel Spolsky wrote The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets. If you scroll to the bottom, you'll find that his crew uses wide character strings exclusively. If it's good enough for them, it's good enough for you. ;-)
Against:
You might have to interface with code that isn't i18n-aware. But like any good library writer, you'll just hide that mess behind an easy-to-use interface, right? Right?
I would say that using std::string or std::wstring is irrelevant.
None offer proper Unicode support anyway.
If you need internationalization, then you need proper Unicode support and should start investigating about libraries such as ICU.
After that, it's a matter of which encoding use, and this depends on the platform you're on: wrap the OS-dependent facilities behind an abstraction layer and convert in the implementation layer when applicable.
Don't worry about the encoding internally used by the Unicode library you use (or build ? hum), it's a matter of performance and should not impact the use of the library itself.
Disadvantage:
Since wstring is truly UCS-2 and not UTF-16. I will kick you in the shins one day. And it will kick hard.

Confused about C++'s std::wstring, UTF-16, UTF-8 and displaying strings in a windows GUI

I'm working on a english only C++ program for Windows where we were told "always use std::wstring", but it seems like nobody on the team really has much of an understanding beyond that.
I already read the question titled "std::wstring VS std::string. It was very helpful, but I still don't quite understand how to apply all of that information to my problem.
The program I'm working on displays data in a Windows GUI. That data is persisted as XML. We often transform that XML using XSLT into HTML or XSL:FO for reporting purposes.
My feeling based on what I have read is that the HTML should be encoded as UTF-8. I know very little about GUI development, but the little bit I have read indicates that the GUI stuff is all based on UTF-16 encoded strings.
I'm trying to understand where this leaves me. Say we decide that all of our persisted data should be UTF-8 encoded XML. Does this mean that in order to display persisted data in a UI component, I should really be performing some sort of explicit UTF-8 to UTF-16 transcoding process?
I suspect my explanation could use clarification, so I'll try to provide that if you have any questions.
Windows from NT4 onwards is based on Unicode encoded strings, yes. Early versions were based on UCS-2, which is the predecessor of UTF-16, and thus does not support all of the characters that UTF-16 does. Later versions are based on UTF-16. Not all OSes are based on UTF-16/UCS-2, though. *nix systems, for instance, are based on UTF-8 instead.
UTF-8 is a very good choice for storing data persistently. It is a universally supported encoding in all Unicode environments, and it is a good balance between data size and loss-less data compatibility.
Yes, you would have to parse the XML, extract the necessary information from it, and decode and transform it into something the UI can use.
std::wstring is technically UCS-2: two bytes are used for each character and the code tables mostly map to Unicode format. It's important to understand that UCS-2 is not the same as UTF-16! UTF-16 allows "surrogate pairs" in order to represent characters which are outside of the two-byte range, but UCS-2 uses exactly two bytes for each character, period.
The best rule for your situation is to do your transcoding when you read and write to the disk. Once it's in memory, keep it in UCS-2 format. Windows APIs will read it as if it were UTF-16 (which is to say, while std::wstring doesn't understand the concept of surrogate pairs, if you manually create them (which you won't, if your only language is English), Windows will read them).
Whenever you're reading data in or out of serialization formats (such as XML) in the modern day, you'll probably need to do transcoding. It's an unpleasant and very unfortunate fact of life, but inevitable since Unicode is a variable-width character encoding and most character-based operations in C++ are done as arrays, for which you need consistent spacing.
Higher-level frameworks, such as .NET, obscure most of the details, but behind the scenes, they're handling the transcoding in the same fashion: changing variable-width data to fixed-width strings, manipulating them, and then changing them back into variable-width encodings when required for output.
AFAIK when you work with std::wstring on Windows in C++ and store using UTF-8 in files (which sounds good and reasonable), then you have to convert the data to UTF-8 when writing to a file, and convert back to UTF-16 when reading from a file. Check out this link: Writing UTF-8 Files in C++.
I would stick with the Visual Studio default of project -> Properties -> Configuration Properties -> General -> Character Set -> Use Unicode Character Set, use the wchar_t type (i.e. with std::wstring) and not use the TCHAR type. (E.g. I would just use the wcslen version of strlen and not _tcslen.)
One advantage to using std::wstring on Windows for GUI related strings, is that internally all Windows API calls use and operate on UTF-16. If you've ever noticed there are 2 versions of all Win32 API calls that take string arguments. For example, "MessageBoxA" and "MessageBoxW". Both definitions exist in , and in fact you can call either you want, but if is included with Unicode support enabled, then the following will happen:
#define MessageBox MessageBoxW
Then you get into TCHAR's and other Microsoft tricks to try and make it easier to deal with APIs that have both an ANSI and Unicode version. In short, you can call either, but under the hood the Windows kernel in Unicode based, so you'll be paying the cost of converting to Unicode for each string accepting Win32 API call if you don't use the wide char version.
UTF-16 and Windows kernel use
Even if you say you only have English in your data, you're probably wrong. Since we're in a global world now, names/addresses/etc have foreign characters. OK, I do not know what type of data you have, but generally I would say build your application to support UNICODE for both storing data and displaying data to user. That would suggest using XML with UTF-8 for storing and UNICODE versions of Windows calls when you do GUI. And since Windows GUI uses UTF-16, where each token is 16-bit, I would suggest storing the data in the application in an 16-bit wide string. And I would guess your compiler for windows would have std::wstring as 16-bit for just this purpose.
So then you have to do a lot of conversion between UTF-16 and UTF-8. Do that with some existing library, like for instance ICU.

C++ strings: UTF-8 or 16-bit encoding?

I'm still trying to decide whether my (home) project should use UTF-8 strings (implemented in terms of std::string with additional UTF-8-specific functions when necessary) or some 16-bit string (implemented as std::wstring). The project is a programming language and environment (like VB, it's a combination of both).
There are a few wishes/constraints:
It would be cool if it could run on limited hardware, such as computers with limited memory.
I want the code to run on Windows, Mac and (if resources allow) Linux.
I'll be using wxWidgets as my GUI layer, but I want the code that interacts with that toolkit confined in a corner of the codebase (I will have non-GUI executables).
I would like to avoid working with two different kinds of strings when working with user-visible text and with the application's data.
Currently, I'm working with std::string, with the intent of using UTF-8 manipulation functions only when necessary. It requires less memory, and seems to be the direction many applications are going anyway.
If you recommend a 16-bit encoding, which one: UTF-16? UCS-2? Another one?
UTF-16 is still a variable length character encoding (there are more than 2^16 unicode codepoints), so you can't do O(1) string indexing operations. If you're doing lots of that sort of thing, you're not saving anything in speed over UTF-8. On the other hand, if your text includes a lot of codepoints in the 256-65535 range, UTF-16 can be a substantial improvement in size. UCS-2 is a variation on UTF-16 that is fixed length, at the cost of prohibiting any codepoints greater than 2^16.
Without knowing more about your requirements, I would personally go for UTF-8. It's the easiest to deal with for all the reasons others have already listed.
I have never found any reasons to use anything else than UTF-8 to be honest.
If you decide to go with UTF-8 encoding, check out this library: http://utfcpp.sourceforge.net/
It may make your life much easier.
I've actually written a widely used application (5million+ users) so every kilobyte used adds up, literally. Despite that, I just stuck to wxString. I've configured it to be derived from std::wstring, so I can pass them to functions expecting a wstring const&.
Please note that std::wstring is native Unicode on the Mac (no UTF-16 needed for characters above U+10000), and therefore it uses 4 bytes/wchar_t. The big advantage of this is that i++ gets you the next character, always. On Win32 that is true in only 99.9% of the cases. As a fellow programmer, you'll understand how little 99.9% is.
But if you're not convinced, write the function to uppercase a std::string[UTF-8] and a std::wstring. Those 2 functions will tell you which way is insanity.
Your on-disk format is another matter. For portability, that should be UTF-8. There's no endianness concern in UTF-8, nor a discussion over the width (2/4). This may be why many programs appear to use UTF-8.
On a slightly unrelated note, please read up on Unicode string comparisions and normalization. Or you'll end up with the same bug as .NET, where you can have two variables föö and föö differing only in (invisible) normalization.
I would recommend UTF-16 for any kind of data manipulation and UI.
The Mac OS X and Win32 API uses UTF-16, same for wxWidgets, Qt, ICU, Xerces, and others.
UTF-8 might be better for data interchange and storage.
See http://unicode.org/notes/tn12/.
But whatever you choose, I would definitely recommend against std::string with UTF-8 "only when necessary".
Go all the way with UTF-16 or UTF-8, but do not mix and match, that is asking for trouble.
MicroATX is pretty much a standard PC motherboard format, most capable of 4-8 GB of RAM. If you're talking picoATX maybe you're limited to 1-2 GB RAM. Even then that's plenty for a development environment. I'd still stick with UTF-8 for reasons mentioned above, but memory shouldn't be your concern.
From what I've read, it's better to use a 16-bit encoding internally unless you're short on memory. It fits almost all living languages in one character
I'd also look at ICU. If you're not going to be using certain STL features of strings, using the ICU string types might be better for you.
Have you considered using wxStrings? If I remember correctly, they can do utf-8 <-> Unicode conversions and it will make it a bit easier when you have to pass strings to and from the UI.