How to use ICU with UTF-16? - c++

I'm looking into using ICU for Unicode string processing in a native Node.js module because it seems to me that v8::String (according to these docs) doesn't have a C++ API for this purpose.
To my knowledge V8 expects UTF-16 in ExternalStringResource and other APIs, so I'd like to use ICU for UTF-16 processing.
I specifically need to:
Iterate over the characters (not just the 16-bit code units) of an UTF-16 string
Tell the number of characters (not just the 16-bit code units) that an UTF-16 string contains
So I looked at the ICU documentation and found the UnicodeString and CharacterIterator classes. However, UnicodeString doesn't have a fromUTF16 method, only fromUTF8 and fromUTF32.
The other thing I'm unsure about is, does the UnicodeString constructor copy the data I give it or not? I'd very much prefer to use a zero-copy approach where I'd just work with an immutable object so it shouldn't perform any copy operations, just use the buffer I point it at.
I'm also unsure if I can just use UCharIterator (assuming I can somehow convert UChar* from my UTF-16 strings).
So my question is: How do I use ICU for the above purposes?
Thanks in advance for your answers!

UnicodeString uses UTF-16 for storage by default. That's why it only has fromUTF8 and fromUTF32: from UTF-16 there is no conversion to be made.
It does copy the data. It is an owning string, much like std::string.
You can use UCharIterator if you don't want to copy the data. UChar is a 16-bit value. You can force it to be whatever 16-bit type you prefer working with by defining the UCHAR_TYPE macro:
Define UChar to be UCHAR_TYPE, if that is #defined (for example, to char16_t), or wchar_t if that is 16 bits wide; always assumed to be unsigned.
If neither is available, then define UChar to be uint16_t.
This makes the definition of UChar platform-dependent but allows direct string type compatibility with platforms with 16-bit wchar_t types.

Related

C++ using ICU and Nana GUI Library - String conversion?

I just did some successful tests with ICU in C/C++. I need to parse different CSV files with different encodings (might be UTF-8, UTF-16LE, ), do some modifications on the data and finally output everything as UTF-8 into a file. That's why my choice fell for ICU. Character set detection works pretty well usually, character handling and conversion to UTF-8 too.
Now I wanted to integrate that library part that does CSV loading, manipulation and so on with a GUI library, Nana. Nana seems to use std::string and std::wstring.
As ICU stores all data internally as UTF-16, so either I got UChars or UnicodeStrings when working with ICU. But how could I use either of them with Nana, that doesn't 'integrate' with ICU? Any way to transform UChar arrays to wstring, or a UnicodeString to wstring?
Didn't find any hints in the ICU documentation, so...maybe somebody else already worked on that?
Most nana functions expect std::string encoded in UTF-8.
You could use the ICU functions that take or return char * to do the conversion to UTF-8.
A few of nana functions, like widget::caption have overloads for std::wstring expected to be encoded in UTF-16 (in windows) or UTF-32 (in Linux) which could be used to pass to the OS what could be the string with the native character type and encoding.
In case you need conversions nana offers nana::charset which can manage (explicitly or implicitly) some of the most frequently needed conversions from/to UTF-8/UTF-16/UTF-32.
If you experiment passing static_cast<wchar_t *>(some_UChar*) to nana, please tell us about the result. I can't test.
The nana documentation about Unicode treatment urgently need to be updated (mea culpa)
According to ICU documentation, a UChar array is an array of 16 bits wide characters... meaning a wchar_t array in common implementations. That means that provided wchar_t is 16 bits wide in your system, you can safely cast the result of the getTerminatedBuffer() function to a const wchar_t * and either use it directly as a C wide chararcter string, or use it to build a std::wstring.

how character sets are stored in strings and wstrings?

So, i've been trying to do a bit of research of strings and wstrings as i need to understand how they work for a program i'm creating so I also looked into ASCII and unicode, and UTF-8 and UTF-16.
I believe i have an okay understanding of the concept of how these work, but what i'm still having trouble with is how they are actually stored in 'char's, 'string's, 'wchar_t's and 'wstring's.
So my questions are as follows:
Which character set and encoding is used for char and wchar_t? and are these types limited to using only these character sets / encoding?
If they are not limited to these character sets / encoding, how is it decided what character set / encoding is used for a particular char or wchar_t? is it automatically decided at compile for example or do we have to explicitly tell it what to use?
From my understanding UTF-8 uses 1 byte when using the first 128 code points in the set but can use more than 1 byte when using code point 128 and above. If so how is this stored? for example is it simply stored identically to ASCII if it only uses 1 byte? and how does the type (char or wchar_t or whatever) know how many bytes it is using?
Finally, if my understanding is correct I get why UTF-8 and UTF-16 are not compatible, eg. a string can't be used where a wstring is needed. But in a program that requires a wstring would it be better practice to write a conversion function from a string to a wstring and the use this when a wstring is required to make my code exclusively string-based or just use wstring where needed instead?
Thanks, and let me know if any of my questions are incorrectly worded or use the wrong terminology as i'm trying to get to grips with this as best as I can.
i'm working in C++ btw
They use whatever characterset and encoding you want. The types do not imply a specific characterset or encoding. They do not even imply characters - you could happily do math problems with them. Don't do that though, it's weird.
How do you output text? If it is to a console, the console decides which character is associated with each value. If it is some graphical toolkit, the toolkit decides. Consoles and toolkits tend to conform to standards, so there is a good chance they will be using unicode, nowadays. On older systems anything might happen.
UTF8 has the same values as ASCII for the range 0-127. Above that it gets a bit more complicated; this is explained here quite well: https://en.wikipedia.org/wiki/UTF-8#Description
wstring is a string made up of wchar_t, but sadly wchar_t is implemented differently on different platforms. For example, on Visual Studio it is 16 bits (and could be used to store UTF16), but on GCC it is 32 bits (and could thus be used to store unicode codepoints directly). You need to be aware of this if you want your code to be portable. Personally I chose to only store strings in UTF8, and convert only when needed.
Which character set and encoding is used for char and wchar_t? and are these types limited to using only these character sets / encoding?
This is not defined by the language standard. Each compiler will have to agree with the operating system on what character codes to use. We don't even know how many bits are used for char and wchar_t.
On some systems char is UTF-8, on others it is ASCII, or something else. On IBM mainframes it can be EBCDIC, a character encoding already in use before ASCII was defined.
If they are not limited to these character sets / encoding, how is it decided what character set / encoding is used for a particular char or wchar_t? is it automatically decided at compile for example or do we have to explicitly tell it what to use?
The compiler knows what is appropriate for each system.
From my understanding UTF-8 uses 1 byte when using the first 128 code points in the set but can use more than 1 byte when using code point 128 and above. If so how is this stored? for example is it simply stored identically to ASCII if it only uses 1 byte? and how does the type (char or wchar_t or whatever) know how many bytes it is using?
The first part of UTF-8 is identical to the corresponding ASCII codes, and stored as a single byte. Higher codes will use two or more bytes.
The char type itself just store bytes and doesn't know how many bytes we need to form a character. That's for someone else to decide.
The same thing for wchar_t, which is 16 bits on Windows but 32 bits on other systems, like Linux.
Finally, if my understanding is correct I get why UTF-8 and UTF-16 are not compatible, eg. a string can't be used where a wstring is needed. But in a program that requires a wstring would it be better practice to write a conversion function from a string to a wstring and the use this when a wstring is required to make my code exclusively string-based or just use wstring where needed instead?
You will likely have to convert. Unfortunately the conversion needed will be different for different systems, as character sizes and encodings vary.
In later C++ standards you have new types char16_t and char32_t, with the string types u16string and u32string. Those have known sizes and encodings.
Everything about used encoding is implementation defined. Check your compiler documentation. It depends on default locale, encoding of source file and OS console settings.
Types like string, wstring, operations on them and C facilities, like strcmp/wstrcmp expect fixed-width encodings. So the would not work properly with variable width ones like UTF8 or UTF16 (but will work with, e.g., UCS-2). If you want to store variable-width encoded strings, you need to be careful and not use fixed-width operations on it. C-string do have some functions for manipulation of such strings in standard library .You can use classes from codecvt header to convert between different encodings for C++ strings.
I would avoid wstring and use C++11 exact width character string: std::u16string or std::u32string
As an example here is some info on how windows uses these types/encodings.
char stores ASCII values (with code pages for non-ASCII values)
wchar_t stores UTF-16, note this means that some unicode characters will use 2 wchar_t's
If you call a system function, e.g. puts then the header file will actually pick either puts or _putws depending on how you've set things up (i.e. if you are using unicode).
So on windows there is no direct support for UTF-8, which means that if you use char to store UTF-8 encoded strings you have to covert them to UTF-16 and call the corresponding UTF-16 system functions.

What to use to store Unicode (UTF-16) strings? (C++11)

I ask this question in the light of the innovations that C++11 brings, namely uchar16_t/u16string.
I write an application that should have multilingual support. According to my plan the localization strings will be stored in XML as UTF-16, and retrieved with pugixml. THe strings will be used both for the GUI and generating HTML report of the computation results. Since I have understood wchar_t/wstring as being deprecated in favour of new u16string, I've planned to use u16string for storing language strings inside the program.
But since both pugixml and MFC's CString use wchar_t as underlining storage type for the Unicode, should I perhaps forget about u16string for now and instead use straightforwardly wstring?
Language-portability is crucial, platform portability doesn't matter.
I use MVS 2013 with Intel compiler.
The encoding used for storing the data outside the program is the only one that matters.
That data is likely to be used from other software. Someone will want to write those strings and they'll probably use some kind of specialised editor or gasp a general-purpose text editor. UTF-8 has much better support from other software than UTF-16, and that's what I would recommend and why.
Inside the program, what encoding you use doesn't matter, as long as you do it consistently and don't mix them up in stupid ways.
Obviously, if you use the same encoding inside the program as you do outside of it, you don't need to perform any conversions and the risk of mixing them up and producing mojibake is not there.
The thing with pugixml using wchar_t is that the encoding it uses then depends on the size of wchar_t. If the size is 2, it uses UTF-16; if the size is 4 it uses UTF-32. pugixml also has the option to use UTF-8 with char by setting the PUGIXML_WCHAR_MODE macro appropriately, so you can use that instead.
If you use wchar_t API, stick to wstring. Remember: since we're inside the program, it doesn't matter if it's going to be UTF-16 or UTF-32, as long as we're consistent. If you use the char API, stick to string. You could, I guess, perform conversions from wchar_t to char16_t and use u16strings, but that wouldn't give much benefit.
The saving and loading functions in pugixml take an xml_encoding parameter that lets you pick what encoding will be on the data outside the program, and that doesn't have to match what you use internally. Pick whichever you find the most convenient.

In C++ when to use WCHAR and when to use CHAR

I have a question:
Some libraries use WCHAR as the text parameter and others use CHAR (as UTF-8): I need to know when to use WCHAR or CHAR when I write my own library.
Use char and treat it as UTF-8. There are a great many reasons for this; this website summarises it much better than I can:
http://utf8everywhere.org/
It recommends converting from wchar_t to char (UTF-16 to UTF-8) as soon as you receive it from any library, and converting back when you need to pass strings to it. So to answer your question, always use char except at the point that an API requires you to pass or receive wchar_t.
WCHAR (or wchar_t on Visual C++ compiler) is used for Unicode UTF-16 strings.
This is the "native" string encoding used by Win32 APIs.
CHAR (or char) can be used for several other string formats: ANSI, MBCS, UTF-8.
Since UTF-16 is the native encoding of Win32 APIs, you may want to use WCHAR (and better a proper string class based on it, like std::wstring) at the Win32 API boundary, inside your app.
And you can use UTF-8 (so, CHAR/char and std::string) to exchange your Unicode text outside your application boundary. For example: UTF-8 is widely used on the Internet, and when you exchange UTF-8 text between different platforms you don't have the problem of endianness (instead with UTF-16 you have to consider both the UTF-16BE big-endian and the UTF-16LE little-endian cases).
You can convert between UTF-16 and UTF-8 using the WideCharToMultiByte() and MultiByteToWideChar() Win32 APIs. These are pure-C APIs, and these can be conveniently wrapped in C++ code, using string classes instead of raw character pointers, and exceptions instead of raw error codes. You can find an example of that here.
The right question is not which type to use, but what should be your contract with your library users. Both char and wchar_t can mean more than one thing.
The right answer to me, is use char and consider everything utf-8 encoded, as utf8everywhere.org suggests. This will also make it easier to write cross-platform libraries.
Make sure you make correct use of strings though. Some APIs like fopen(), would accept a char* string and treat it differently (not as UTF-8) when compiled on Windows. If Unicode is important to you (and it probably is, when you are dealing with strings), be sure to handle your strings correctly. A good example can be seen in boost::locale. I also recommend using boost::nowide on Windows to get strings handled correctly inside your library.
In Windows we stick to WCHARS. std::wstring. Mainly because if you don't you end up having to convert because calling Windows functions.
I have a feeling that trying to use utf8 internally simply because of http://utf8everywhere.org/ is gonna bite us in the bum later on down the line.
It is best recommended that, when developing a Windows application, resort to TCHARs. The good thing about TCHARs is that they can be either regular chars or wchars, depending whether the unicode setting is set or not. Once you resort to TCHARs, you make sure that all string manipulations that you use also start with the _t prefix (e.g. _tcslen for length of string). That way you will know that your code will work both in Unicode and ASCII environments.

How do I get STL std::string to work with unicode on windows?

At my company we have a cross platform(Linux & Windows) library that contains our own extension of the STL std::string, this class provides all sort of functionality on top of the string; split, format, to/from base64, etc. Recently we were given the requirement of making this string unicode "friendly" basically it needs to support characters from Chinese, Japanese, Arabic, etc. After initial research this seems fine on the Linux side since every thing is inherently UTF-8, however I am having trouble with the Windows side; is there a trick to getting the STL std::string to work as UTF-8 on windows? Is it even possible? Is there a better way? Ideally we would keep ourselves based on the std::string since that is what the string class is based on in Linux.
Thank you,
There are several misconceptions in your question.
Neither C++ nor the STL deal with encodings.
std::string is essentially a string of bytes, not characters. So you should have no problem stuffing UTF-8 encoded Unicode into it. However, keep in mind that all string functions also work on bytes, so myString.length() will give you the number of bytes, not the number of characters.
Linux is not inherently UTF-8. Most distributions nowadays default to UTF-8, but it should not be relied upon.
Yes - by being more aware of locales and encodings.
Windows has two function calls for everything that requires text, a FoobarA() and a FoobarW(). The *W() functions take UTF-16 encoded strings, the *A() takes strings in the current codepage. However, Windows doesn't support a UTF-8 code page, so you can't directly use it in that sense with the *A() functions, nor would you want to depend on that being set by users. If you want "Unicode" in Windows, use the Unicode-capable (*W) functions. There are tutorials out there, Googling "Unicode Windows tutorial" should get you some.
If you are storing UTF-8 data in a std::string, then before you pass it off to Windows, convert it to UTF-16 (Windows provides functions for doing such), and then pass it to Windows.
Many of these problems arise from C/C++ being generally encoding-agnostic. char isn't really a character, it's just an integral type. Even using char arrays to store UTF-8 data can get you into trouble if you need to access individual code units, as char's signed-ness is left undefined by the standards. A statement like str[x] < 0x80 to check for multiple-byte characters can quickly introduce a bug. (That statement is always true if char is signed.) A UTF-8 code unit is an unsigned integral type with a range of 0-255. That maps to the C type of uint8_t exactly, although unsigned char works as well. Ideally then, I'd make a UTF-8 string an array of uint8_ts, but due to old APIs, this is rarely done.
Some people have recommended wchar_t, claiming it to be "A Unicode character type" or something like that. Again, here the standard is just as agnostic as before, as C is meant to work anywhere, and anywhere might not be using Unicode. Thus, wchar_t is no more Unicode than char. The standard states:
which is an integer type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales
In Linux, a wchat_t represents a UTF-32 code unit / code point. It is thus 4 bytes. However, in Windows, it's a UTF-16 code unit, and is only 2 bytes. (Which, I would have said does not conform to the above, since 2-bytes cannot represent all of Unicode, but that's the way it works.) This size difference, and difference in data encoding, clearly puts a strain on portability. The Unicode standard itself recommends against wchar_t if you need portability. (ยง5.2)
The end lesson: I find it easiest to store all my data in some well-declared format. (Typically UTF-8, usually in std::string's, but I'd really like something better.) The important thing here is not the UTF-8 part, but rather, I know that my strings are UTF-8. If I'm passing them to some other API, I must also know that that API expects UTF-8 strings. If it doesn't, then I must convert them. (Thus, if I speak to Window's API, I must convert strings to UTF-16 first.) A UTF-8 text string is an "orange", and a "latin1" text string is an "apple". A char array that doesn't know what encoding it is in is a recipe for disaster.
Putting UTF-8 code points into an std::string should be fine regardless of platform. The problem on Windows is that almost nothing else expects or works with UTF-8 -- it expects and works with UTF-16 instead. You can switch to an std::wstring which will store UTF-16 (at least on most Windows compilers) or you can write other routines that will accept UTF-8 (probably by converting to UTF-16, and then passing through to the OS).
Have you looked at std::wstring? It's a version of std::basic_string for wchar_t rather than the char that std::string uses.
No, there is no way to make Windows treat "narrow" strings as UTF-8.
Here is what works best for me in this situation (cross-platform application that has Windows and Linux builds).
Use std::string in cross-platform portion of the code. Assume that it always contains UTF-8 strings.
In Windows portion of the code, use "wide" versions of Windows API explicitly, i.e. write e.g. CreateFileW instead of CreateFile. This allows to avoid dependency on build system configuration.
In the platfrom abstraction layer, convert between UTF-8 and UTF-16 where needed (MultiByteToWideChar/WideCharToMultiByte).
Other approaches that I tried but don't like much:
typedef std::basic_string<TCHAR> tstring; then use tstring in the business code. Wrappers/overloads can be made to streamline conversion between std::string and std::tstring, but it still adds a lot of pain.
Use std::wstring everywhere. Does not help much since wchar_t is 16 bit on Windows, so you either have to restrict yourself to BMP or go to a lot of complications to make the code dealing with Unicode cross-platform. In the latter case, all benefits over UTF-8 evaporate.
Use ATL/WTL/MFC CString in the platfrom-specific portion; use std::string in cross-platfrom portion. This is actually a variant of what I recommend above. CString is in many aspects superior to std::string (in my opinion). But it introduces an additional dependency and thus not always acceptable or convenient.
If you want to avoid headache, don't use the STL string types at all. C++ knows nothing about Unicode or encodings, so to be portable, it's better to use a library that is tailored for Unicode support, e.g. the ICU library. ICU uses UTF-16 strings by default, so no conversion is required, and supports conversions to many other important encodings like UTF-8. Also try to use cross-platform libraries like Boost.Filesystem for things like path manipulations (boost::wpath). Avoid std::string and std::fstream.
In the Windows API and C runtime library, char* parameters are interpreted as being encoded in the "ANSI" code page. The problem is that UTF-8 isn't supported as an ANSI code page, which I find incredibly annoying.
I'm in a similar situation, being in the middle of porting software from Windows to Linux while also making it Unicode-aware. The approach we've taken for this is:
Use UTF-8 as the default encoding for strings.
In Windows-specific code, always call the "W" version of functions, converting string arguments between UTF-8 and UTF-16 as necessary.
This is also the approach Poco has taken.
It really platform dependant, Unicode is headache. Depends on which compiler you use. For older ones from MS (VS2010 or older), you would need use API described in MSDN
for VS2015
std::string _old = u8"D:\\Folder\\This \xe2\x80\x93 by ABC.txt"s;
according to their docs. I can't check that one.
for mingw, gcc, etc.
std::string _old = u8"D:\\Folder\\This \xe2\x80\x93 by ABC.txt";
std::cout << _old.data();
output contains proper file name...
You should consider using QString and QByteArray, it has good unicode support