I'm looking for a portable and easy-to-use string library for C/C++, which helps me to work with Unicode input/output. In the best case, it will store its strings in memory in UTF-8, and allow me to convert strings from ASCII to UTF-8/UTF-16 and back. I don't need much more besides that (ok, a liberal license won't hurt). I have seen that C++ comes with a <locale> header, but this seems to work on wchar_t only, which may or may not be UTF-16 encoded, plus I'm not sure how good this is actually.
Uses cases are for example: On Windows, the unicode APIs expect UTF-16 strings, and I need to convert ASCII or UTF-8 strings to pass it on to the API. Same goes for XML parsing, which may come with UTF-16, but I actually only want to process internally with UTF-8 (or, for that matter, if I switch internally to UTF-16, I'll need a conversion to that anyway).
So far, I've taken a look at the ICU, which is quite huge. Moreover, it wants to be built using it own project files, while I'd prefer a library for which there is either a CMake project or which is easy to build (something like compile all these .c files, link and good to go), instead of shipping something large as the ICU along my application.
Do you know such a library, which is also being maintained? After all, this seems to be a pretty basic problem.
UTF8-CPP seems to be exactly what you want.
I'd recommend that you look at the GNU iconv library.
There is another portable C library for string conversion between UTF-8, UTF-16, UTF-32, wchar - mdz_unicode library.
Related
I only have experience with processing ASCII (single byte characters) and have read a number of posts on how people process Unicode differently which present their own set of issues.
At this point of my very limited exposure to Unicode, I’ve read that internal processing with UTF-16 presents portability and other issues.
I feel that UTF-32 makes more sense than UTF-16 since all Unicode characters fit within 4 bytes but would consume more resources, especially if you are mainly dealing with ISO-8859-1 characters.
I humbly feel that UTF-8 could be an ideal format to work with internally (especially for case where you deal mainly with English and Latin based characters) since the ASCII range of characters would be handled byte by byte very efficiently. Characters from the Latin alphabet would consume two bytes and other characters would consume more bytes of course.
Another advantage that I see is that UTF-8 strings could be stored within regular C++ std::string or C string arrays which seems so natural.
The disadvantage for using UTF-8 for me at least is that I have not found any libraries to support UTF-8 internally. For example, I have not found any libraries for UTF-8 case conversion and substring operations.
Another disadvantage for me is that I have not found functions to parse bytes within a UTF-8 string for character processing.
Would it be feasible to work with UTF-8 internally and are there any support libraries available for this purpose? I do hope so but if not, I think that my best option would be to forget using UTF-8 internally and use Boost::Locale since I’ve read that ICU is a mature library used by many to handle Unicode.
I would really like to hear your opinions on this matter.
I bumped into my very old answer and I'll tell you what I ended up doing. I decided to stick with UTF-8 and store my data in std::string or single byte char arrays. There was never a need for me to use multi-byte characters!
The first library that I used was UTF8-CPP which is very easy to bring into your app and use. But you soon find that you need more and more capability.
I really wanted to avoid using ICU because it is such a large library, but once you build it and get it installed, you begin to wish that you had done it in the first place because it has everything you need and much, much more.
What are my benefits you may wonder:
I write truly portable code that builds under VC++ for Windows or GCC for Linux.
ICU has everything, and I mean everything you need concerning unicode.
I am able to stick with my beloved std::string and char arrays.
I use many open source libraries in my apps with zero issues. For example, I use RapidJson for my JSON to create in-memory JSON objects containing UTF-8 data. I'm able to pass them to a web server or write them to disk, etc. Really simple.
I store my data into Firebird SQL but you need to specify your varchar and char field types as UTF8. This means that your strings will be stored as mutli-byte in the database. But this is totally transparent to you, the developer. I am certain that this applies to other SQL databases as well.
Drawbacks:
Large library, very scary and confusing at first.
The C++ was not written by C++ experts (like the Boost developers). But the code is totally stable and fast. You may not like the syntax used though. What I've done is to "wrap" common procedures with my code. This pretty much means that I include my own UTF-8 library which wraps the ICU uglies. Don't let this bother you because ICU is totally stable and fast.
I personally dynamically link ICU into my applications. This means that I first built ICU dynamically for my Win and Linux 64 bit environments. In the case of Windows, I store the dlls in a folder somewhere and add that to my Windows path so that any app that requires ICU can find the dlls.
When I looked at built-in language features, I found several lacking such as lower/upper case conversion, word boundaries, counting characters, accent sensitivity, string manipulation such as substrings, etc. Local support is also totally amazing.
I guess that summarizes entire exercise in UTF-8.
I'm writing a portable library that deals with files and directories. I want to use UTF-8 for my input (directory paths) and output (file paths). The problem is, Windows gives me a choice between UTF-16-that-used-to-be-UCS-2, and codepages. So I have to convert all my UTF-8 strings to UTF-16, pass them to WinAPI, and convert the results back to UTF-8. C++11 seems to provide <locale> library just for that, except from what I understood, none of the predefined specializations uses UTF-8 as internal (ie. my-side) coding - the closest there is is UTF-16-to-UTF-8, which is the exact opposite of what I want. So here's first question:
1) How to use codecvt thingamajigs to convert my UTF-8 strings to UTF-16 for WinAPI calls, and the UTF-16 results back to UTF-8?
Another problem: I'm also targetting Linux. On Linux, there is a very good support for many different locales - and I don't want to be any different. Hopefully everyone will use UTF-8 on their Linux machines, but there is no strict guarantee of that. So I thought it would be a good idea to extend the above Windows-specific behavior and always do UTF-8-to-system-locale-coding. Except that I don't see there's any way in C++11's <locale> library to get current system encoding! Default std::locale constructor makes specified-by-myself locale, and if I don't do it, it returns classic "C" locale. And there are no other getters I'm aware of. So here's second question:
2) How to detect current system locale? Something in <locale>? Maybe some standard C library function, or (less portable but okay in this case) something in POSIX API?
The design of these facilities in the standard library assumes that multibyte character encodings (like UTF-8) are used only for external storage (i.e. byte sequences in files on disk) and that all characters in memory are uniform in size. This is so things like std::basic_string<T>::operator[] can behave in a manner consistent with the performance constraints imposed by the standard. So while you can use files encoded in UTF-8 or some other MBCS (like those for Japanese), your strings in memory should be char, char16_t, char32_t or wchar_t.
This is why you aren't finding a match in the standard library for what you want to do because strings in memory aren't intended to be stored in UTF-8. This is similar to other languages as well, such as Java, where data on disk is interpreted as a stream of bytes and to turn them into strings you need to tell some component the expected character encoding of the byte stream. Some operating systems may stuff a UTF-8 string into argv[], but this is non-standard. This is the reason that the Unicode enabled entry point for WinMain on Windows provides a NUL terminated pointer to wchar_t and not a char* pointing to a UTF-8 encoded string.
IBM's International Components for Unicode library provides a whole set of components that are complementary to, and design to work with, the C++ standard library. I would look at their code conversion facilities. While the standard defines facilities in <locale> for code conversion, it doesn't guarantee any existence of a code conversion facility to map from UTF-8 to char16_t, char32_t, or wchar_t. If such a thing exists, you'll only get it based on the details of your implementation. The ICU library provides this functionality portably for any C++ implementation. It is well supported and well used and unlikely to have bugs decoding UTF-8 strings into the appropriate wider-than-char string.
Konrad mentioned the UTF-8 Anywhere Manifesto in a comment. This was an interesting read and they point you to the Boost.Nowide library (not officially a part of boost yet) to get solutions to the problems you cite above.
Please note that my answer is simply a description of the way the existing C++ standard library classes like std::basic_string<T> work. It is not advice against UTF-8, Unicode, or anything else. The manifesto cited agrees with me that these things simply don't work this way and if you want to use UTF-8 anywhere, then you need something else.
Essential concepts.
What string and character data-types to use?
What libraries / routines to use for input and output?
What translation mechanisms (I guess this would be gettext, libintl)?
Porting guidelines?
How far can the standard C/C++ library address the above concerns? How portable can I make my software across platforms? What are the standards / best-practices in this area?
I would avoid using wchar_t or std::wstring because this data type is not the same size in different environments. For example on Windows it is 16 bit, while on Unix systems it is 32 bit. Which asks for trouble.
If you don't have time/resources to implement Unicode standard (the bare minimum) yourself, you better off using std::string as a container for UTF-8 characters. Though you must be aware that in the case of UTF-8 you will have to deal with multibyte encoding (1 character may correspond to 1 or more bytes).
As for libraries ICU is something to consider, it will allow you to convert between encodings, transform to upper/lower/title case and etc. It can also help with locales.
Translations as Marius noted are generally done through a function that looks up a table by the key you provided (be it a string, id or anything else) and in the end returns you the translation string.
Porting will go smooth if you stick to using data types that are the same on every platform, not to mention ICU is a cross-platform library, so it must work out.
wchar_t or std::wstring are your friends. Use them with the approriate wide-character functions and objects like wcscpy() or std::wcout.
You also would use a string-table for each locale, and a function like std::wstring getLocalizedString(MY_MESSAGE) that looks up the constant MY_MESSAGE in a string-table for the current locale. How you implement the string table is up to you, storing such things in external files is always a good idea.
We are specifically eyeing Windows and Linux development, and have come up with two differing approaches that both seem to have their merits. The natural unicode string type in Windows is UTF-16, and UTF-8 in linux.
We can't decide whether the best approach:
Standardise on one of the two in all our application logic (and persistent data), and make the other platforms do the appropriate conversions
Use the natural format for the OS for application logic (and thus making calls into the OS), and convert only at the point of IPC and persistence.
To me they seem like they are both about as good as each other.
and UTF-8 in linux.
It's mostly true for modern Linux. Actually encoding depends on what API or library is used. Some hardcoded to use UTF-8. But some read LC_ALL, LC_CTYPE or LANG environment variables to detect encoding to use (like Qt library). So be careful.
We can't decide whether the best approach
As usual it depends.
If 90% of code is to deal with platform specific API in platform specific way, obviously it is better to use platform specific strings. As an example - a device driver or native iOS application.
If 90% of code is complex business logic that is shared across platforms, obviously it is better to use same encoding on all platforms. As an example - chat client or browser.
In second case you have a choice:
Use cross platform library that provides strings support (Qt, ICU, for example)
Use bare pointers (I consider std::string a "bare pointer" too)
If working with strings is a significant part of your application, choosing a nice library for strings is a good move. For example Qt has a very solid set of classes that covers 99% of common tasks. Unfortunately, I has no ICU experience, but it also looks very nice.
When using some library for strings you need to care about encoding only when working with external libraries, platform API or sending strings over the net (or disk). For example, a lot of Cocoa, C# or Qt (all has solid strings support) programmers know very little about encoding details (and it is good, since they can focus on their main task).
My experience in working with strings is a little specific, so I personally prefer bare pointers. Code that use them is very portable (in sense it can be easily reused in other projects and platforms) because has less external dependencies. It is extremely simple and fast also (but one probably need some experience and Unicode background to feel that).
I agree that bare pointers approach is not for everyone. It is good when:
You work with entire strings and splitting, searching, comparing is a rare task
You can use same encoding in all components and need a conversion only when using platform API
All your supported platforms has API to:
Convert from your encoding to that is used in API
Convert from API encoding to that is used in your code
Pointers is not a problem in your team
From my a little specific experience it is actually a very common case.
When working with bare pointers it is good to choose encoding that will be used in entire project (or in all projects).
From my point of view, UTF-8 is an ultimate winner. If you can't use UTF-8 - use strings library or platform API for strings - it will save you a lot of time.
Advantages of UTF-8:
Fully ASCII compatible. Any ASCII string is a valid UTF-8 string.
C std library works great with UTF-8 strings. (*)
C++ std library works great with UTF-8 (std::string and friends). (*)
Legacy code works great with UTF-8.
Quite any platform supports UTF-8.
Debugging is MUCH easier with UTF-8 (since it is ASCII compatible).
No Little-Endian/Big-Endian mess.
You will not catch a classical bug "Oh, UTF-16 is not always 2 bytes?".
(*) Until you need to lexical compare them, transform case (toUpper/toLower), change normalization form or something like this - if you do - use strings library or platform API.
Disadvantage is questionable:
Less compact for Chinese (and other symbols with large code point numbers) than UTF-16.
Harder (a little actually) to iterate over symbols.
So, I recommend to use UTF-8 as common encoding for project(s) that doesn't use any strings library.
But encoding is not the only question you need to answer.
There is such thing as normalization. To put it simple, some letters can be represented in several ways - like one glyph or like a combination of different glyphs. The common problem with this is that most of string compare functions treat them as different symbols. If you working on cross-platform project, choosing one of normalization forms as standard is a right move. This will save your time.
For example if user password contains "йёжиг" it will be differently represented (in both UTF-8 and UTF-16) when entered on Mac (that mostly use Normalization Form D) and on Windows (that mostly likes Normalization Form C). So if user registered under Windows with such password it will a problem for him to login under Mac.
In addition I would not recommend to use wchar_t (or use it only in windows code as a UCS-2/UTF-16 char type). The problem with wchar_t is that there is no encoding associated with it. It's just an abstract wide char that is larger than normal char (16 bits on Windows, 32 bits on most *nix).
I'd use the same encoding internally, and normalize the data at entry point. This will involve less code, less gotchas, and will allow you to use the same cross platform library for string processing.
I'd use unicode (utf-16) because it's simpler to handle internally and should perform better because of the constant length for each character. UTF-8 is ideal for output and storage because it's backwards compliant with latin ascii, and unly uses 8 bits for English characters. But inside the program 16-bit is simpler to handle.
C++11 provides the new string types u16string and u32string. Depending on the support your compiler versions deliver, and the expected life expectancy, it might be an idea to stay forward-compatible to those.
Other than that, using the ICU library is probably your best shot at cross-platform compatibility.
This seems to be quite enlightening on the topic. http://www.utf8everywhere.org/
Programming with UTF-8 is difficult as lengths and offsets are mixed up. e.g.
std::string s = Something();
std::cout << s.substr(0, 4);
does not necessarily find the first 4 chars.
I would use whatever a wchar_t is. On Windows that will be UTF-16. On some *nix platforms it might be UTF-32.
When saving to a file, I would recommend converting to UTF-8. That often makes the file smaller, and removes any platform dependencies due to differences in sizeof(wchar_t) or to byte order.
What is the best practice of Unicode processing in C++?
Use ICU for dealing with your data (or a similar library)
In your own data store, make sure everything is stored in the same encoding
Make sure you are always using your unicode library for mundane tasks like string length, capitalization status, etc. Never use standard library builtins like is_alpha unless that is the definition you want.
I can't say it enough: never iterate over the indices of a string if you care about correctness, always use your unicode library for this.
If you don't care about backwards compatibility with previous C++ standards, the current C++11 standard has built in Unicode support: http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2011/n3242.pdf
So the truly best practice for Unicode processing in C++ would be to use the built in facilities for it. That isn't always a possibility with older code bases though, with the standard being so new at present.
EDIT: To clarify, C++11 is Unicode aware in that it now has support for Unicode literals and Unicode strings. However, the standard library has only limited support for Unicode processing and conversion. For your current needs this may be enough. However, if you need to do a large amount of heavy lifting right now then you may still need to use something like ICU for more in-depth processing. There are some proposals currently in the works to include more robust support for text conversion between different encodings. My guess (and hope) is that this will be part of the next technical report.
Our company (and others) use the open source Internation Components for Unicode (ICU) library originally developed by Taligent.
It handles strings, locales, conversions, date/times, collation, transformations, et. al.
Start with the ICU Userguide
Here is a checklist for Windows programming:
All strings enclosed in _T("my string")
strlen() etc. functions replaced with _tcslen() etc.
Use LPTSTR and LPCTSTR instead of char * and const char *
When starting new projects in Dev Studio, religiously make sure the Unicode option is selected in your project properties.
For C++ strings, use std::wstring instead of std::string
Look at
Case insensitive string comparison in C++
That question has a link to the Microsoft documentation on Unicode: http://msdn.microsoft.com/en-us/library/cc194799.aspx
If you look on the left-hand navigation side on MSDN next to that article, you should find a lot of information pertaining to Unicode functions. It is part of a chapter on "Encoding Characters" (http://msdn.microsoft.com/en-us/library/cc194786.aspx)
It has the following subsections:
The Code-Page Model
Double-Byte Character Sets in Windows
Unicode
Compatibility Issues in Mixed Environments
Unicode Data Conversion
Migrating Windows-Based Programs to Unicode
Summary
Although this may not be best practice for everyone, you can write your own C++ UNICODE routines if you want!
I just finished doing it over a weekend. I learned a lot, though I don't guarantee it's 100% bug free, I did a lot of testing and it seems to work correctly.
My code is under the New BSD license and can be found here:
http://code.google.com/p/netwidecc/downloads/list
It is called WSUCONV and comes with a sample main() program that converts between UTF-8, UTF-16, and Standard ASCII. If you throw away the main code, you've got a nice library for reading / writing UNICODE.
As has been said above a library is the best bet when using a large system. However some times you do want to handle things your self (maybe because the library would use to many resources like on a micro controller). In this case you want a simple library that you can copy the parts out of for the things you actually need.
Willow Schlanger's example code seems like a good one (see his answer for more details).
I also found another one that has smaller code, but lacks full error checking and only handles UTF-8 but was simpler to take parts out of.
Here's a list of the embedded libraries that seem decent.
Embedded libraries
http://code.google.com/p/netwidecc/downloads/list (UTF8, UTF16LE, UTF16BE, UTF32)
http://www.cprogramming.com/tutorial/unicode.html (UTF8)
http://utfcpp.sourceforge.net/ (Simple UTF8 library)
Use IBM's International Components for Unicode
Have a look at the recommendations of UTF-8 Everywhere