UNICODE, UTF-8 and Windows mess - c++

I'm trying to implement text support in Windows with the intention of also moving to a Linux platform later on. It would be ideal to support international languages in a uniform way but that doesn't seem to be easily accomplished when considering the two platforms in question. I have spent a considerable amount of time reading up on UNICODE, UTF-8 (and other encodings), widechars and such and here is what I have come to understand so far:
UNICODE, as the standard, describes the set of characters that are mappable and the order in which they occur. I refer to this as the "what": UNICODE specifies what will be available.
UTF-8 (and other encodings) specify the how: How each character will be represented in a binary format.
Now, on windows, they opted for a UCS-2 encoding originally, but that failed to meet the requirements, so UTF-16 is what they have, which is also multi-char when necessary.
So here is the delemma:
Windows internally only does UTF-16, so if you want to support international characters you are forced to convert to their widechar versions to use the OS calls accordingly. There doesn't seem to be any support for calling something like CreateFileA() with a multi-byte UTF-8 string and have it come out looking proper. Is this correct?
In C, there are some multi-byte supporting functions (_mbscat, _mbscpy, etc), however, on windows, the character type is defined as unsigned char* for those functions. Given the fact that the _mbs series of functions is not a complete set (i.e. there is no _mbstol to convert a multi-byte string to a long, for example) you are forced to use some of the char* versions of the runtime functions, which leads to compiler problems because of the signed/unsigned type difference between those functions. Does anyone even use those? Do you just do a big pile of casting to get around the errors?
In C++, std::string has iterators, but these are based on char_type, not on code points. So if I do a ++ on an std::string::iterator, I get the next char_type, not the next code point. Similarly, if you call std::string::operator[], you get a reference to a char_type, which has the great potential to not be a complete code point. So how does one iterate an std::string by code point? (C has the _mbsinc() function).

Just do UTF-8
There are lots of support libraries for UTF-8 in every plaftorm, also some are multiplaftorm too. The UTF-16 APIs in Win32 are limited and inconsistent as you've already noted, so it's better to keep everything in UTF-8 and convert to UTF-16 at last moment. There are also some handy UTF-8 wrappings for the windows API.
Also, at application-level documents, UTF-8 is getting more and more accepted as standard. Every text-handling application either accepts UTF-8, or at worst shows it as "ASCII with some dingbats", while there's only few applications that support UTF-16 documents, and those that don't, show it as "lots and lots of whitespace!"

Correct. You will convert UTF-8 to UTF-16 for your Windows API calls.
Most of the time you will use regular string functions for UTF-8 -- strlen, strcpy (ick), snprintf, strtol. They will work fine with UTF-8 characters. Either use char * for UTF-8 or you will have to cast everything.
Note that the underscore versions like _mbstowcs are not standard, they are normally named without an underscore, like mbstowcs.
It is difficult to come up with examples where you actually want to use operator[] on a Unicode string, my advice is to stay away from it. Likewise, iterating over a string has surprisingly few uses:
If you are parsing a string (e.g., the string is C or JavaScript code, maybe you want syntax hilighting) then you can do most of the work byte-by-byte and ignore the multibyte aspect.
If you are doing a search, you will also do this byte-by-byte (but remember to normalize first).
If you are looking for word breaks or grapheme cluster boundaries, you will want to use a library like ICU. The algorithm is not simple.
Finally, you can always convert a chunk of text to UTF-32 and work with it that way. I think this is the sanest option if you are implementing any of the Unicode algorithms like collation or breaking.
See: C++ iterate or split UTF-8 string into array of symbols?

Windows internally only does UTF-16, so if you want to support international characters you are forced to convert to their widechar versions to use the OS calls accordingly. There doesn't seem to be any support for calling something like CreateFileA() with a multi-byte UTF-8 string and have it come out looking proper. Is this correct?
Yes, that's correct. The *A function variants interpret the string parameters according to the currently active code page (which is Windows-1252 on most computers in the US and Western Europe, but can often be other code pages) and convert them to UTF-16. There is a UTF-8 code page, however AFAIK there isn't a way to programmatically set the active code page (there's GetACP to get the active code page, but not corresponding SetACP).
In C, there are some multi-byte supporting functions (_mbscat, _mbscpy, etc), however, on windows, the character type is defined as unsigned char* for those functions. Given the fact that the _mbs series of functions is not a complete set (i.e. there is no _mbstol to convert a multi-byte string to a long, for example) you are forced to use some of the char* versions of the runtime functions, which leads to compiler problems because of the signed/unsigned type difference between those functions. Does anyone even use those? Do you just do a big pile of casting to get around the errors?
The mbs* family of functions is almost never used, in my experience. With the exception of mbstowcs, mbsrtowcs, and mbsinit, those functions are not standard C.
In C++, std::string has iterators, but these are based on char_type, not on code points. So if I do a ++ on an std::string::iterator, I get the next char_type, not the next code point. Similarly, if you call std::string::operator[], you get a reference to a char_type, which has the great potential to not be a complete code point. So how does one iterate an std::string by code point? (C has the _mbsinc() function).
I think that mbrtowc(3) would be the best option here for decoding a single code point of a multibyte string.
Overall, I think the best strategy for cross-platform Unicode compatibility is to do everything in UTF-8 internally using single-byte characters. When you need to call a Windows API function, convert it to UTF-16 and always call the *W variant. Most non-Windows platforms use UTF-8 already, so that makes using those a snap.

In Windows, you can call WideCharToMultiByte and MultiByteToWideChar to convert between UTF-8 string and UTF-16 string (wstring in Windows). Because Windows API do not use UTF-8, whenever you call any Windows API function that support Unicode, you have to convert string into wstring (Windows version of Unicode in UTF-16). And when you get output from Windows, you have to convert UTF-16 back to UTF-8. Linux uses UTF-8 internally, so you do not need such conversion. To make your code portable to Linux, stick to UTF-8 and provide something as below for conversion:
#if (UNDERLYING_OS==OS_WINDOWS)
using os_string = std::wstring;
std::string utf8_string_from_os_string(const os_string &os_str)
{
size_t length = os_str.size();
int size_needed = WideCharToMultiByte(CP_UTF8, 0, os_str, length, NULL, 0, NULL, NULL);
std::string strTo(size_needed, 0);
WideCharToMultiByte(CP_UTF8, 0, os_str, length, &strTo[0], size_needed, NULL, NULL);
return strTo;
}
os_string utf8_string_to_os_string(const std::string &str)
{
size_t length = os_str.size();
int size_needed = MultiByteToWideChar(CP_UTF8, 0, str, length, NULL, 0);
os_string wstrTo(size_needed, 0);
MultiByteToWideChar(CP_UTF8, 0, str, length, &wstrTo[0], size_needed);
return wstrTo;
}
#else
// Other operating system uses UTF-8 directly and such conversion is
// not required
using os_string = std::string;
#define utf8_string_from_os_string(str) str
#define utf8_string_to_os_string(str) str
#endif
To iterate over utf8 strings, two fundamental functions you need are: one to calculate the number of bytes for an utf8 character and the another to determine if the byte is the leading byte of a utf8 character sequence. The following code provides a very efficient way to test:
inline size_t utf8CharBytes(char leading_ch)
{
return (leading_ch & 0x80)==0 ? 1 : clz(~(uint32_t(uint8_t(leading_ch))<<24));
}
inline bool isUtf8LeadingByte(char ch)
{
return (ch & 0xC0) != 0x80;
}
Using these functions, it should not be difficult to implement your own iterator over utf8 strings, one is for forwarding iterator, and another is for backward iterator.

Related

How well is Unicode supported in C++11?

I've read and heard that C++11 supports Unicode. A few questions on that:
How well does the C++ standard library support Unicode?
Does std::string do what it should?
How do I use it?
Where are potential problems?
How well does the C++ standard library support unicode?
Terribly.
A quick scan through the library facilities that might provide Unicode support gives me this list:
Strings library
Localization library
Input/output library
Regular expressions library
I think all but the first one provide terrible support. I'll get back to it in more detail after a quick detour through your other questions.
Does std::string do what it should?
Yes. According to the C++ standard, this is what std::string and its siblings should do:
The class template basic_string describes objects that can store a sequence consisting of a varying number of arbitrary char-like objects with the first element of the sequence at position zero.
Well, std::string does that just fine. Does that provide any Unicode-specific functionality? No.
Should it? Probably not. std::string is fine as a sequence of char objects. That's useful; the only annoyance is that it is a very low-level view of text and standard C++ doesn't provide a higher-level one.
How do I use it?
Use it as a sequence of char objects; pretending it is something else is bound to end in pain.
Where are potential problems?
All over the place? Let's see...
Strings library
The strings library provides us basic_string, which is merely a sequence of what the standard calls "char-like objects". I call them code units. If you want a high-level view of text, this is not what you are looking for. This is a view of text suitable for serialization/deserialization/storage.
It also provides some tools from the C library that can be used to bridge the gap between the narrow world and the Unicode world: c16rtomb/mbrtoc16 and c32rtomb/mbrtoc32.
Localization library
The localization library still believes that one of those "char-like objects" equals one "character". This is of course silly, and makes it impossible to get lots of things working properly beyond some small subset of Unicode like ASCII.
Consider, for example, what the standard calls "convenience interfaces" in the <locale> header:
template <class charT> bool isspace (charT c, const locale& loc);
template <class charT> bool isprint (charT c, const locale& loc);
template <class charT> bool iscntrl (charT c, const locale& loc);
// ...
template <class charT> charT toupper(charT c, const locale& loc);
template <class charT> charT tolower(charT c, const locale& loc);
// ...
How do you expect any of these functions to properly categorize, say, U+1F34C ʙᴀɴᴀɴᴀ, as in u8"🍌" or u8"\U0001F34C"? There's no way it will ever work, because those functions take only one code unit as input.
This could work with an appropriate locale if you used char32_t only: U'\U0001F34C' is a single code unit in UTF-32.
However, that still means you only get the simple casing transformations with toupper and tolower, which, for example, are not good enough for some German locales: "ß" uppercases to "SS"☦ but toupper can only return one character code unit.
Next up, wstring_convert/wbuffer_convert and the standard code conversion facets.
wstring_convert is used to convert between strings in one given encoding into strings in another given encoding. There are two string types involved in this transformation, which the standard calls a byte string and a wide string. Since these terms are really misleading, I prefer to use "serialized" and "deserialized", respectively, instead†.
The encodings to convert between are decided by a codecvt (a code conversion facet) passed as a template type argument to wstring_convert.
wbuffer_convert performs a similar function but as a wide deserialized stream buffer that wraps a byte serialized stream buffer. Any I/O is performed through the underlying byte serialized stream buffer with conversions to and from the encodings given by the codecvt argument. Writing serializes into that buffer, and then writes from it, and reading reads into the buffer and then deserializes from it.
The standard provides some codecvt class templates for use with these facilities: codecvt_utf8, codecvt_utf16, codecvt_utf8_utf16, and some codecvt specializations. Together these standard facets provide all the following conversions. (Note: in the following list, the encoding on the left is always the serialized string/streambuf, and the encoding on the right is always the deserialized string/streambuf; the standard allows conversions in both directions).
UTF-8 ↔ UCS-2 with codecvt_utf8<char16_t>, and codecvt_utf8<wchar_t> where sizeof(wchar_t) == 2;
UTF-8 ↔ UTF-32 with codecvt_utf8<char32_t>, codecvt<char32_t, char, mbstate_t>, and codecvt_utf8<wchar_t> where sizeof(wchar_t) == 4;
UTF-16 ↔ UCS-2 with codecvt_utf16<char16_t>, and codecvt_utf16<wchar_t> where sizeof(wchar_t) == 2;
UTF-16 ↔ UTF-32 with codecvt_utf16<char32_t>, and codecvt_utf16<wchar_t> where sizeof(wchar_t) == 4;
UTF-8 ↔ UTF-16 with codecvt_utf8_utf16<char16_t>, codecvt<char16_t, char, mbstate_t>, and codecvt_utf8_utf16<wchar_t> where sizeof(wchar_t) == 2;
narrow ↔ wide with codecvt<wchar_t, char_t, mbstate_t>
no-op with codecvt<char, char, mbstate_t>.
Several of these are useful, but there is a lot of awkward stuff here.
First off—holy high surrogate! that naming scheme is messy.
Then, there's a lot of UCS-2 support. UCS-2 is an encoding from Unicode 1.0 that was superseded in 1996 because it only supports the basic multilingual plane. Why the committee thought desirable to focus on an encoding that was superseded over 20 years ago, I don't know&ddagger;. It's not like support for more encodings is bad or anything, but UCS-2 shows up too often here.
I would say that char16_t is obviously meant for storing UTF-16 code units. However, this is one part of the standard that thinks otherwise. codecvt_utf8<char16_t> has nothing to do with UTF-16. For example, wstring_convert<codecvt_utf8<char16_t>>().to_bytes(u"\U0001F34C") will compile fine, but will fail unconditionally: the input will be treated as the UCS-2 string u"\xD83C\xDF4C", which cannot be converted to UTF-8 because UTF-8 cannot encode any value in the range 0xD800-0xDFFF.
Still on the UCS-2 front, there is no way to read from an UTF-16 byte stream into an UTF-16 string with these facets. If you have a sequence of UTF-16 bytes you can't deserialize it into a string of char16_t. This is surprising, because it is more or less an identity conversion. Even more suprising, though, is the fact that there is support for deserializing from an UTF-16 stream into an UCS-2 string with codecvt_utf16<char16_t>, which is actually a lossy conversion.
The UTF-16-as-bytes support is quite good, though: it supports detecting endianess from a BOM, or selecting it explicitly in code. It also supports producing output with and without a BOM.
There are some more interesting conversion possibilities absent. There is no way to deserialize from an UTF-16 byte stream or string into a UTF-8 string, since UTF-8 is never supported as the deserialized form.
And here the narrow/wide world is completely separate from the UTF/UCS world. There are no conversions between the old-style narrow/wide encodings and any Unicode encodings.
Input/output library
The I/O library can be used to read and write text in Unicode encodings using the wstring_convert and wbuffer_convert facilities described above. I don't think there's much else that would need to be supported by this part of the standard library.
Regular expressions library
I have expounded upon problems with C++ regexes and Unicode on Stack Overflow before. I will not repeat all those points here, but merely state that C++ regexes don't have level 1 Unicode support, which is the bare minimum to make them usable without resorting to using UTF-32 everywhere.
That's it?
Yes, that's it. That's the existing functionality. There's lots of Unicode functionality that is nowhere to be seen like normalization or text segmentation algorithms.
U+1F4A9. Is there any way to get some better Unicode support in C++?
The usual suspects: ICU and Boost.Locale.
† A byte string is, unsurprisingly, a string of bytes, i.e., char objects. However, unlike a wide string literal, which is always an array of wchar_t objects, a "wide string" in this context is not necessarily a string of wchar_t objects. In fact, the standard never explicitly defines what a "wide string" means, so we're left to guess the meaning from usage. Since the standard terminology is sloppy and confusing, I use my own, in the name of clarity.
Encodings like UTF-16 can be stored as sequences of char16_t, which then have no endianness; or they can be stored as sequences of bytes, which have endianness (each consecutive pair of bytes can represent a different char16_t value depending on endianness). The standard supports both of these forms. A sequence of char16_t is more useful for internal manipulation in the program. A sequence of bytes is the way to exchange such strings with the external world. The terms I'll use instead of "byte" and "wide" are thus "serialized" and "deserialized".
&ddagger; If you are about to say "but Windows!" hold your 🐎🐎. All versions of Windows since Windows 2000 use UTF-16.
☦ Yes, I know about the großes Eszett (ẞ), but even if you were to change all German locales overnight to have ß uppercase to ẞ, there's still plenty of other cases where this would fail. Try uppercasing U+FB00 ʟᴀᴛɪɴ sᴍᴀʟʟ ʟɪɢᴀᴛᴜʀᴇ ғғ. There is no ʟᴀᴛɪɴ ᴄᴀᴘɪᴛᴀʟ ʟɪɢᴀᴛᴜʀᴇ ғғ; it just uppercases to two Fs. Or U+01F0 ʟᴀᴛɪɴ sᴍᴀʟʟ ʟᴇᴛᴛᴇʀ ᴊ ᴡɪᴛʜ ᴄᴀʀᴏɴ; there's no precomposed capital; it just uppercases to a capital J and a combining caron.
Unicode is not supported by Standard Library (for any reasonable meaning of supported).
std::string is no better than std::vector<char>: it is completely oblivious to Unicode (or any other representation/encoding) and simply treat its content as a blob of bytes.
If you only need to store and catenate blobs, it works pretty well; but as soon as you wish for Unicode functionality (number of code points, number of graphemes etc) you are out of luck.
The only comprehensive library I know of for this is ICU. The C++ interface was derived from the Java one though, so it's far from being idiomatic.
You can safely store UTF-8 in a std::string (or in a char[] or char*, for that matter), due to the fact that a Unicode NUL (U+0000) is a null byte in UTF-8 and that this is the sole way a null byte can occur in UTF-8. Hence, your UTF-8 strings will be properly terminated according to all of the C and C++ string functions, and you can sling them around with C++ iostreams (including std::cout and std::cerr, so long as your locale is UTF-8).
What you cannot do with std::string for UTF-8 is get length in code points. std::string::size() will tell you the string length in bytes, which is only equal to the number of code points when you're within the ASCII subset of UTF-8.
If you need to operate on UTF-8 strings at the code point level (i.e. not just store and print them) or if you're dealing with UTF-16, which is likely to have many internal null bytes, you need to look into the wide character string types.
C++11 has a couple of new literal string types for Unicode.
Unfortunately the support in the standard library for non-uniform encodings (like UTF-8) is still bad. For example there is no nice way to get the length (in code-points) of an UTF-8 string.
However, there is a pretty useful library called tiny-utf8, which is basically a drop-in replacement for std::string/std::wstring. It aims to fill the gap of the still missing utf8-string container class.
This might be the most comfortable way of 'dealing' with utf8 strings (that is, without unicode normalization and similar stuff). You comfortably operate on codepoints, while your string stays encoded in run-length-encoded chars.

Strings. TCHAR LPWCS LPCTSTR CString. Whats what here, simple quick

TCHAR szExeFileName[MAX_PATH];
GetModuleFileName(NULL, szExeFileName, MAX_PATH);
CString tmp;
lstrcpy(szExeFileName, tmp);
CString out;
out.Format("\nInstall32 at %s\n", tmp);
TRACE(tmp);
Error (At the Format):
error C2664: 'void ATL::CStringT<BaseType,StringTraits>::Format(const wchar_t
*,...)' : cannot convert parameter 1 from 'const char [15]' to 'const wchar_t
I'd just like to get the current path that this program was launched from and copy it into a CString so I can use it elsewhere. I am currently just try to get to see the path by TRACE'ing it out. But strings, chars, char arrays, I can't ever get all the strait. Could someone give me a pointer?
The accepted answer addresses the problem. But the question also asked for a better understanding of the differences among all the character types on Windows.
Encodings
A char on Windows (and virtually all other systems) is a single byte. A byte is typically interpreted as either an unsigned value [0..255] or a signed value [-128..127]. (Older C++ standards guarantees a signed range of only [-127..127], but most implementations give [-128..127]. I believe C++11 guarantees the larger range.)
ASCII is a character mapping for values in the range [0..127] to particular characters, so you can store an ASCII character in either a signed byte or an unsigned byte, and thus it will always fit in a char.
But ASCII doesn't have all the characters necessary for most languages, so the character sets were often extended by using the rest of the values available in a byte to represent the additional characters needed for certain languages (or families of languages). So, while [0..127] almost always mean the same thing, values like 150 can only be interpreted in the context of a particular encoding. For single-byte alphabets, these encodings are called code pages.
Code pages helped, but they didn't solve all the problems. You always had to know which code page a particular document used in order to interpret it correctly. Furthermore, you typically couldn't write a single document that used different languages.
Also, some languages have more than 256 characters, so there was no way to map one char to one character. This led to the development of multi-byte character encodings, where [0..127] is still ASCII, but some of the other values are "escapes" that mean you have to look at some number of following chars to figure out what character you really had. (It's best to think of multi-byte as variable-byte, as some characters require only one byte while other require two or more.) Multi-byte works, but it's a pain to code for.
Meanwhile, memory was becoming more plentiful, so a bunch of organizations got together and created Unicode, with the goal of making a universal mapping of values to characters (for appropriately vague definitions of "characters"). Initially, it was believed that all characters (or at least all the ones anyone would ever use) would fit into 16-bit values, which was nice because you wouldn't have to deal with multi-byte encodings--you'd just use two bytes per character instead of one. About this time, Microsoft decided to adopt Unicode as the internal representation for text in Windows.
WCHAR
So Windows has a type called WCHAR, a two-byte value that represents a "Unicode" "character". I'm using quotation marks here because Unicode evolved past the original two-byte encoding, so what Windows calls "Unicode" isn't really Unicode today--it's actually a particular encoding of Unicode called UTF-16. And a "character" is not as simple a concept in Unicode as it was in ASCII, because, in some languages, characters combine or otherwise influence adjacent characters in interesting ways.
Newer versions of Windows used these 16-bit WCHAR values for text internally, but there was a lot of code out there still written for single-byte code pages, and even some for multi-byte encodings. Those programs still used chars rather than WCHARs. And many of these programs had to work with people using older versions of Windows that still used chars internally as well as newer ones that use WCHAR. So a technique using C macros and typedefs was devised so that you could mostly write your code one way and--at compile time--choose to have it use either char or WCHAR.
TCHAR
To accomplish this flexibility, you use a TCHAR for a "text character". In some header file (often <tchar.h>), TCHAR would be typedef'ed to either char or WCHAR, depending on the compile time environment. Windows headers adopted conventions like this:
LPTSTR is a (long) pointer to a string of TCHARs.
LPWSTR is a (long) pointer to a string of WCHARs.
LPSTR is a (long) pointer to a string of chars.
(The L for "long" is a leftover from 16-bit days, when we had long, far, and near pointers. Those are all obsolete today, but the L prefix tends to remain.)
Most of the Windows API functions that take and return strings were actually replaced with two versions: the A version (for "ANSI" characters) and the W version (for wide characters). (Again, historical legacy shows in these. The code pages scheme was often called ANSI code pages, though I've never been clear if they were actually ruled by ANSI standards.)
So when you call a Windows API like this:
SetWindowText(hwnd, lptszTitle);
what you're really doing is invoking a preprocessor macro that expands to either SetWindowTextA or SetWindowTextW. It should be consistent with however TCHAR is defined. That is, if you want strings of chars, you'll get the A version, and if you want strings of WCHARs, you get the W version.
But it's a little more complicated because of string literals. If you write this:
SetWindowText(hwnd, "Hello World"); // works only in "ANSI" mode
then that will only compile if you're targeting the char version, because "Hello World" is a string of chars, so it's only compatible with the SetWindowTextA version. If you wanted the WCHAR version, you'd have to write:
SetWindowText(hwnd, L"Hello World"); // only works in "Unicode" mode
The L here means you want wide characters. (The L actually stands for long, but it's a different sense of long than the long pointers above.) When the compiler sees the L prefix on the string, it knows that string should be encoded as a series of wchar_ts rather than chars.
(Compilers targeting Windows use a two-byte value for wchar_t, which happens to be identical to what Windows defined a WCHAR. Compilers targeting other systems often use a four-byte value for wchar_t, which is what it really takes to hold a single Unicode code point.)
So if you want code that can compile either way, you need another macro to wrap the string literals. There are two to choose from: _T() and TEXT(). They work exactly the same way. The first comes from the compiler's library and the second from the OS's libraries. So you write your code like this:
SetWindowText(hwnd, TEXT("Hello World")); // compiles in either mode
If you're targeting chars, the macro is a no-op that just returns the regular string literal. If you're targeting WCHARs, the macro prepends the L.
So how do you tell the compiler that you want to target WCHAR? You define UNICODE and _UNICODE. The former is for the Windows APIs and the latter is for the compiler libraries. Make sure you never define one without the other.
My guess is you are compiling in Unicode mode.
Try enclosing your format string in the _T macro, which is designed to provide an always-correct method of providing constant string parameters, regardless of whether you're compiling in Unicode or ANSI mode:
out.Format(_T("\nInstall32 at %s\n"), tmp);

How can I use wstring(s) in Linux APIs?

I want to develope an application in Linux. I want to use wstring beacuse my application should supports unicode and I don't want to use UTF-8 strings.
In Windows OS, using wstring is easy. beacuse any ANSI API has a unicode form. for example there are two CreateProcess API, first API is CreateProcessA and second API is CreateProcessW.
wstring app = L"C:\\test.exe";
CreateProcess
(
app.c_str(), // EASY!
....
);
But it seems working with wstring in Linux is complicated! for example there is an API in Linux called parport_open (It just an example).
and I don't know how to send my wstring to this API (or APIs like parport_open that accept a string parameter).
wstring name = L"myname";
parport_open
(
0, // or a valid number. It is not important in this question.
name.c_str(), // Error: because type of this parameter is char* not wchat_t*
....
);
My question is how can I use wstring(s) in Linux APIs?
Note: I don't want to use UTF-8 strings.
Thanks
Linux APIs (on recent kernels and with correct locale setting) on almost every distribution use UTF-8 strings by default1. You too should use them inside your code. Resistance is futile.
The wchar_t (and thus wstring) on Windows were convenient only when Unicode was limited to 65536 characters (i.e. wchar_t were used for UCS-2), now that the 16-bit Windows wchar_t are used for UTF-16 the advantage of 1 wchar_t=1 Unicode character is long gone, so you have the same disadvantages of using UTF-8. Nowadays IMHO the Linux approach is the most correct. (Another answer of mine on UTF-16 and why Windows and Java use it)
By the way, both string and wstring aren't encoding-aware, so you can't reliably use any of these two to manipulate Unicode code points. I heard that wxString from the wxWidgets toolkit handles UTF-8 nicely, but I never did extensive research about it.
actually, as pointed out below, the kernel aims to be encoding-agnostic, i.e. it treats the strings as opaque sequences of (NUL-terminated?) bytes (and that's why encodings that use "larger" character types like UTF-16 cannot be used). On the other hand, wherever actual string manipulation is done, the current locale setting is used, and by default on almost any modern Linux distribution it is set to UTF-8 (which is a reasonable default to me).
I don't want to use UTF-8 strings.
Well, you will need to overcome that reluctance, at least when calling the APIs. Linux uses single byte string encodings, invariably UTF-8. Clearly you should use a single byte string type since you obviously can't pass wide characters to a function that expects char*. Use string rather than wstring.

Unicode Portability

I'm currently taking care of an application that uses std::string and char for string operations - which is fine on linux, since Linux is agnostic to Unicode (or so it seems; I don't really know, so please correct me if I'm telling stories here). This current style naturally leads to this kind of function/class declarations:
std::string doSomethingFunkyWith(const std::string& thisdata)
{
/* .... */
}
However, if thisdata contains unicode characters, it will be displayed wrongly on windows, since std::string can't hold unicode characters on Windows.
So I thought up this concept:
namespace MyApplication {
#ifdef UNICODE
typedef std::wstring string_type;
typedef wchar_t char_type;
#else
typedef std::string string_type;
typedef char char_type;
#endif
/* ... */
string_type doSomethingFunkyWith(const string_type& thisdata)
{
/* ... */
}
}
Is this a good concept to go with to support Unicode on windows?
My current toolchain consists of gcc/clang on Linux, and wine+MinGW for Windows support (crosstesting also happens via wine), if that matters.
Multiplatform issues comes from the fact that there are many encodings, and a wrong encoding pick will lead to encóding íssues. Once you tackle that problem, you should be able to use std::wstring on all your program.
The usual workflow is:
raw_input_data = read_raw_data()
input_encoding = "???" // What is your file or terminal encoding?
unicode_data = convert_to_unicode(raw_input_data, input_encoding)
// Do something with the unicode_data, store in some var, etc.
output_encoding = "???" // Is your terminal output encoding the same as your input?
raw_output_data = convert_from_unicode(unicode_data, output_encoding)
print_raw_data(raw_data)
Most Unicode issues comes from wrongly detecting the values of input_encoding and output_encoding. On a modern Linux distribution this is usually UTF-8. On Windows YMMV.
Standard C++ don't know about encodings, you should use some library like ICU to do the conversion.
How you store a string within your application is entirely up to you -- after all, nobody would know as long as the strings stay within your application. The problem starts when you try to read or write strings from the outside world (console, files, sockets etc.) and this is where the OS matters.
Linux isn't exactly "agnostic" to Unicode -- it does recognize Unicode but the standard library functions assume UTF-8 encoding, so Unicode strings fit into standard char arrays. Windows, on the other hand, uses UTF-16 encoding, so you need a wchar_t array to represent 16-bit characters.
The typedefs you proposed should work fine, but keep in mind that this alone doesn't make your code portable. As an example, if you want to store text in files in a portable manner, you should choose one encoding and stick to it across all platforms -- this could require converting between encodings on certain platforms.
Linux does support Unicode, it simply uses UTF-8. Probably a better way to make your system portable would be to make use of International Components for Unicode and treat all std::string objects as containing UTF-8 characters, and convert them to UTF-16 as needed when invoking Windows functions. It almost always makes sense to use UTF-8 over UTF-16, as UTF-8 uses less space for some of the most commonly used characters (e.g. English*) and more space for less frequent characters, whereas UTF-16 wastes space equally for all characters, no matter how frequently they are used.
While you can use your typedefs, this will mean that you have to write two copies of every single function that has to deal with strings. I think it would be more efficient to simply do all internal computations in UTF-8 and simply translate that to/from UTF-16 if necessary when inputting/outputting as needed.
*For HTML, XML, and JSON that use English as part of the encoding (e.g. "<html>, <body>, etc.) regardless of the language of the values, this can still be a win for foreign languages.
The problem for Linux and using Unicode is that all the IO and most system functions use UTF-8 and the wide character type is 32 bit. Then there is interfacing to Java and other programs which requires UTF-16.
As a suggestion for Unicode support, see the OpenRTL library at http://code.google.com/p/openrtl which supports all UTF-8, UTF-16 and UTF-32 on windows, Linux, Osx and Ios. The Unicode support is not just the character types, but also Unicode collation, normalization, case folding, title casing and about 64 different Unicode character properties per full unsigned 32 bit character.
The OpenRTL code is ready now to support char8_t, char16_t and char32_t for the new C++ standards as well, allthough the same character types are supported using macros for existing C and C++ compilers. I think for Unicode and strings processing that it might be what you want for your library.
The point is that if you use OpenRTL, you can build the system using the OpenRTL "char_t" type. This supports the notion that your entire library can be built in either UTF8, UTF16 or UTF32 mode, even on Linux, because OpenRTL is already handling all the interfacing to a lot of system functions like files and io stuff. It has its own print_f functions for example.
By default the char_t is mapping to the wide character type. So on windows it is 32 bit and on Linux it is 32 bit. But you can make it also make it 8 bit everywhere for example. Also it has the support to do fast UTF decoding inside loops using macros.
So instead of ifdeffing between wchar_t and char, you can build everything using char_t and OpenRTL takes care of the rest.

How do I get STL std::string to work with unicode on windows?

At my company we have a cross platform(Linux & Windows) library that contains our own extension of the STL std::string, this class provides all sort of functionality on top of the string; split, format, to/from base64, etc. Recently we were given the requirement of making this string unicode "friendly" basically it needs to support characters from Chinese, Japanese, Arabic, etc. After initial research this seems fine on the Linux side since every thing is inherently UTF-8, however I am having trouble with the Windows side; is there a trick to getting the STL std::string to work as UTF-8 on windows? Is it even possible? Is there a better way? Ideally we would keep ourselves based on the std::string since that is what the string class is based on in Linux.
Thank you,
There are several misconceptions in your question.
Neither C++ nor the STL deal with encodings.
std::string is essentially a string of bytes, not characters. So you should have no problem stuffing UTF-8 encoded Unicode into it. However, keep in mind that all string functions also work on bytes, so myString.length() will give you the number of bytes, not the number of characters.
Linux is not inherently UTF-8. Most distributions nowadays default to UTF-8, but it should not be relied upon.
Yes - by being more aware of locales and encodings.
Windows has two function calls for everything that requires text, a FoobarA() and a FoobarW(). The *W() functions take UTF-16 encoded strings, the *A() takes strings in the current codepage. However, Windows doesn't support a UTF-8 code page, so you can't directly use it in that sense with the *A() functions, nor would you want to depend on that being set by users. If you want "Unicode" in Windows, use the Unicode-capable (*W) functions. There are tutorials out there, Googling "Unicode Windows tutorial" should get you some.
If you are storing UTF-8 data in a std::string, then before you pass it off to Windows, convert it to UTF-16 (Windows provides functions for doing such), and then pass it to Windows.
Many of these problems arise from C/C++ being generally encoding-agnostic. char isn't really a character, it's just an integral type. Even using char arrays to store UTF-8 data can get you into trouble if you need to access individual code units, as char's signed-ness is left undefined by the standards. A statement like str[x] < 0x80 to check for multiple-byte characters can quickly introduce a bug. (That statement is always true if char is signed.) A UTF-8 code unit is an unsigned integral type with a range of 0-255. That maps to the C type of uint8_t exactly, although unsigned char works as well. Ideally then, I'd make a UTF-8 string an array of uint8_ts, but due to old APIs, this is rarely done.
Some people have recommended wchar_t, claiming it to be "A Unicode character type" or something like that. Again, here the standard is just as agnostic as before, as C is meant to work anywhere, and anywhere might not be using Unicode. Thus, wchar_t is no more Unicode than char. The standard states:
which is an integer type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales
In Linux, a wchat_t represents a UTF-32 code unit / code point. It is thus 4 bytes. However, in Windows, it's a UTF-16 code unit, and is only 2 bytes. (Which, I would have said does not conform to the above, since 2-bytes cannot represent all of Unicode, but that's the way it works.) This size difference, and difference in data encoding, clearly puts a strain on portability. The Unicode standard itself recommends against wchar_t if you need portability. (§5.2)
The end lesson: I find it easiest to store all my data in some well-declared format. (Typically UTF-8, usually in std::string's, but I'd really like something better.) The important thing here is not the UTF-8 part, but rather, I know that my strings are UTF-8. If I'm passing them to some other API, I must also know that that API expects UTF-8 strings. If it doesn't, then I must convert them. (Thus, if I speak to Window's API, I must convert strings to UTF-16 first.) A UTF-8 text string is an "orange", and a "latin1" text string is an "apple". A char array that doesn't know what encoding it is in is a recipe for disaster.
Putting UTF-8 code points into an std::string should be fine regardless of platform. The problem on Windows is that almost nothing else expects or works with UTF-8 -- it expects and works with UTF-16 instead. You can switch to an std::wstring which will store UTF-16 (at least on most Windows compilers) or you can write other routines that will accept UTF-8 (probably by converting to UTF-16, and then passing through to the OS).
Have you looked at std::wstring? It's a version of std::basic_string for wchar_t rather than the char that std::string uses.
No, there is no way to make Windows treat "narrow" strings as UTF-8.
Here is what works best for me in this situation (cross-platform application that has Windows and Linux builds).
Use std::string in cross-platform portion of the code. Assume that it always contains UTF-8 strings.
In Windows portion of the code, use "wide" versions of Windows API explicitly, i.e. write e.g. CreateFileW instead of CreateFile. This allows to avoid dependency on build system configuration.
In the platfrom abstraction layer, convert between UTF-8 and UTF-16 where needed (MultiByteToWideChar/WideCharToMultiByte).
Other approaches that I tried but don't like much:
typedef std::basic_string<TCHAR> tstring; then use tstring in the business code. Wrappers/overloads can be made to streamline conversion between std::string and std::tstring, but it still adds a lot of pain.
Use std::wstring everywhere. Does not help much since wchar_t is 16 bit on Windows, so you either have to restrict yourself to BMP or go to a lot of complications to make the code dealing with Unicode cross-platform. In the latter case, all benefits over UTF-8 evaporate.
Use ATL/WTL/MFC CString in the platfrom-specific portion; use std::string in cross-platfrom portion. This is actually a variant of what I recommend above. CString is in many aspects superior to std::string (in my opinion). But it introduces an additional dependency and thus not always acceptable or convenient.
If you want to avoid headache, don't use the STL string types at all. C++ knows nothing about Unicode or encodings, so to be portable, it's better to use a library that is tailored for Unicode support, e.g. the ICU library. ICU uses UTF-16 strings by default, so no conversion is required, and supports conversions to many other important encodings like UTF-8. Also try to use cross-platform libraries like Boost.Filesystem for things like path manipulations (boost::wpath). Avoid std::string and std::fstream.
In the Windows API and C runtime library, char* parameters are interpreted as being encoded in the "ANSI" code page. The problem is that UTF-8 isn't supported as an ANSI code page, which I find incredibly annoying.
I'm in a similar situation, being in the middle of porting software from Windows to Linux while also making it Unicode-aware. The approach we've taken for this is:
Use UTF-8 as the default encoding for strings.
In Windows-specific code, always call the "W" version of functions, converting string arguments between UTF-8 and UTF-16 as necessary.
This is also the approach Poco has taken.
It really platform dependant, Unicode is headache. Depends on which compiler you use. For older ones from MS (VS2010 or older), you would need use API described in MSDN
for VS2015
std::string _old = u8"D:\\Folder\\This \xe2\x80\x93 by ABC.txt"s;
according to their docs. I can't check that one.
for mingw, gcc, etc.
std::string _old = u8"D:\\Folder\\This \xe2\x80\x93 by ABC.txt";
std::cout << _old.data();
output contains proper file name...
You should consider using QString and QByteArray, it has good unicode support