Case-insensitive UTF-8 string collation for SQLite (C/C++)

Case-insensitive UTF-8 string collation for SQLite (C/C++) - c++

I am looking for a method to compare and sort UTF-8 strings in C++ in a case-insensitive manner to use it in a custom collation function in SQLite.
The method should ideally be locale-independent. However I won't be holding my breath, as far as I know, collation is very language-dependent, so anything that works on languages other than English will do, even if it means switching locales.
Options include using standard C or C++ library or a small (suitable for embedded system) and non-GPL (suitable for a proprietary system) third-party library.
What I have so far:
strcoll with C locales and std::collate/std::collate_byname are case-sensitive. (Are there case-insensitive versions of these?)
I tried to use a POSIX strcasecmp, but it seems to be not defined for locales other than "POSIX"
In the POSIX locale, strcasecmp() and strncasecmp() do upper to lower conversions, then a byte comparison. The results are unspecified in other locales.
And, indeed, the result of strcasecmp does not change between locales on Linux with GLIBC.
#include <clocale>
#include <cstdio>
#include <cassert>
#include <cstring>
const static char *s1 = "Äaa";
const static char *s2 = "äaa";
int main() {
printf("strcasecmp('%s', '%s') == %d\n", s1, s2, strcasecmp(s1, s2));
printf("strcoll('%s', '%s') == %d\n", s1, s2, strcoll(s1, s2));
assert(setlocale(LC_ALL, "en_AU.UTF-8"));
printf("strcasecmp('%s', '%s') == %d\n", s1, s2, strcasecmp(s1, s2));
printf("strcoll('%s', '%s') == %d\n", s1, s2, strcoll(s1, s2));
assert(setlocale(LC_ALL, "fi_FI.UTF-8"));
printf("strcasecmp('%s', '%s') == %d\n", s1, s2, strcasecmp(s1, s2));
printf("strcoll('%s', '%s') == %d\n", s1, s2, strcoll(s1, s2));
}
This is printed:
strcasecmp('Äaa', 'äaa') == -32
strcoll('Äaa', 'äaa') == -32
strcasecmp('Äaa', 'äaa') == -32
strcoll('Äaa', 'äaa') == 7
strcasecmp('Äaa', 'äaa') == -32
strcoll('Äaa', 'äaa') == 7
P. S.
And yes, I am aware about ICU, but we can't use it on the embedded platform due to its enormous size.

What you really want is logically impossible. There is no locale-independent, case-insensitive way of sorting strings. The simple counter-example is "i" <> "I" ? The naive answer is no, but in Turkish these strings are unequal. "i" is uppercased to "İ" (U+130 Latin Capital I with dot above)
UTF-8 strings add extra complexity to the question. They're perfectly valid multi-byte char* strings, if you have an appropriate locale. But neither the C nor the C++ standard defines such a locale; check with your vendor (too many embedded vendors, sorry, no genearl answer here). So you HAVE to pick a locale whose multi-byte encoding is UTF-8, for the mbscmp function to work. This of course influences the sort order, which is locale dependent. And if you have NO locale in which const char* is UTF-8, you can't use this trick at all. (As I understand it, Microsoft's CRT suffers from this. Their multi-byte code only handles characters up to 2 bytes; UTF-8 needs 3)
wchar_t is not the standard solution either. It supposedly is so wide that you don't have to deal with multi-byte encodings, but your collation will still depend on locale (LC_COLLATE) . However, using wchar_t means you now choose locales that do not use UTF-8 for const char*.
With this done, you can basically write your own ordering by converting strings to lowercase and comparing them. It's not perfect. Do you expect L"ß" == L"ss" ? They're not even the same length. Yet, for a German you have to consider them equal. Can you live with that?

I don't think there's a standard C/C++ library function you can use. You'll have to roll your own or use a 3rd-party library. The full Unicode specification for locale-specific collation can be found here: http://www.unicode.org/reports/tr10/ (warning: this is a long document).

On Windows you can call fall back on the OS function CompareStringW and use the NORM_IGNORECASE flag. You'll have to convert your UTF-8 strings to UTF-16 first. Otherwise, take a look at IBM's International Components for Unicode.

I believe you will need to roll your own or use an third party library. I recommend a third party library because there are a lot of rules that need to be followed to get true international support - best to let someone who is an expert deal with them.

I have no definitive answer in the form of example code, but I should point out that an UTF-8 bytestream contains, in fact, Unicode characters and you have to use the wchar_t versions of the C/C++ runtime library.
You have to convert those UTF-8 bytes into wchar_t strings first, though. This is not very hard, as the UTF-8 encoding standard is very well documented. I know this, because I've done it, but I can't share that code with you.

If you are using it to do searching and sorting for your locale only, I suggest your function to call a simple replace function that convert both multi-byte strings into one byte per char ones using a table like:
A -> a
Ã -> a
á -> a
ß -> ss
Ç -> c
and so on
Then simply call strcmp and return the results.

Related

Strings of 4 byte character in Windows?

I have a program that does various operations on char types in std::string, for example
if (my_string.front() == my_char) {
// do stuff with my_string
}
I'm looking for some practical advice on how to make my program support Unicode. I need the ability to compare characters to characters, and that means 4 byte characters are required so that even the largest Unicode characters can be processed without losses.
I'm on Windows with a GCC compiler and read that in this case, std::wstring is 2 bytes. C++11 has std::u32string with 4 bytes but it seems largely unsupported by the standard library.
What's the easiest solution in this case?

Even if you had a string of uint32 you could not just compare these integers one by one. You would have to first normalize the strings before. As normalization is NOT simple, you will end up using a library like ICU. So you may directly try to use it directly :)
http://site.icu-project.org/

Windows uses the UTF-16 encoding:
http://en.wikipedia.org/wiki/UTF-16
You don't need "four byte characters" to support all unicode symbols. UTF-16 is a variable length encoding.
Good reading material:
http://www.joelonsoftware.com/articles/Unicode.html

How well is Unicode supported in C++11?

I've read and heard that C++11 supports Unicode. A few questions on that:
How well does the C++ standard library support Unicode?
Does std::string do what it should?
How do I use it?
Where are potential problems?

How well does the C++ standard library support unicode?
Terribly.
A quick scan through the library facilities that might provide Unicode support gives me this list:
Strings library
Localization library
Input/output library
Regular expressions library
I think all but the first one provide terrible support. I'll get back to it in more detail after a quick detour through your other questions.
Does std::string do what it should?
Yes. According to the C++ standard, this is what std::string and its siblings should do:
The class template basic_string describes objects that can store a sequence consisting of a varying number of arbitrary char-like objects with the first element of the sequence at position zero.
Well, std::string does that just fine. Does that provide any Unicode-specific functionality? No.
Should it? Probably not. std::string is fine as a sequence of char objects. That's useful; the only annoyance is that it is a very low-level view of text and standard C++ doesn't provide a higher-level one.
How do I use it?
Use it as a sequence of char objects; pretending it is something else is bound to end in pain.
Where are potential problems?
All over the place? Let's see...
Strings library
The strings library provides us basic_string, which is merely a sequence of what the standard calls "char-like objects". I call them code units. If you want a high-level view of text, this is not what you are looking for. This is a view of text suitable for serialization/deserialization/storage.
It also provides some tools from the C library that can be used to bridge the gap between the narrow world and the Unicode world: c16rtomb/mbrtoc16 and c32rtomb/mbrtoc32.
Localization library
The localization library still believes that one of those "char-like objects" equals one "character". This is of course silly, and makes it impossible to get lots of things working properly beyond some small subset of Unicode like ASCII.
Consider, for example, what the standard calls "convenience interfaces" in the <locale> header:
template <class charT> bool isspace (charT c, const locale& loc);
template <class charT> bool isprint (charT c, const locale& loc);
template <class charT> bool iscntrl (charT c, const locale& loc);
// ...
template <class charT> charT toupper(charT c, const locale& loc);
template <class charT> charT tolower(charT c, const locale& loc);
// ...
How do you expect any of these functions to properly categorize, say, U+1F34C ʙᴀɴᴀɴᴀ, as in u8"🍌" or u8"\U0001F34C"? There's no way it will ever work, because those functions take only one code unit as input.
This could work with an appropriate locale if you used char32_t only: U'\U0001F34C' is a single code unit in UTF-32.
However, that still means you only get the simple casing transformations with toupper and tolower, which, for example, are not good enough for some German locales: "ß" uppercases to "SS"☦ but toupper can only return one character code unit.
Next up, wstring_convert/wbuffer_convert and the standard code conversion facets.
wstring_convert is used to convert between strings in one given encoding into strings in another given encoding. There are two string types involved in this transformation, which the standard calls a byte string and a wide string. Since these terms are really misleading, I prefer to use "serialized" and "deserialized", respectively, instead†.
The encodings to convert between are decided by a codecvt (a code conversion facet) passed as a template type argument to wstring_convert.
wbuffer_convert performs a similar function but as a wide deserialized stream buffer that wraps a byte serialized stream buffer. Any I/O is performed through the underlying byte serialized stream buffer with conversions to and from the encodings given by the codecvt argument. Writing serializes into that buffer, and then writes from it, and reading reads into the buffer and then deserializes from it.
The standard provides some codecvt class templates for use with these facilities: codecvt_utf8, codecvt_utf16, codecvt_utf8_utf16, and some codecvt specializations. Together these standard facets provide all the following conversions. (Note: in the following list, the encoding on the left is always the serialized string/streambuf, and the encoding on the right is always the deserialized string/streambuf; the standard allows conversions in both directions).
UTF-8 ↔ UCS-2 with codecvt_utf8<char16_t>, and codecvt_utf8<wchar_t> where sizeof(wchar_t) == 2;
UTF-8 ↔ UTF-32 with codecvt_utf8<char32_t>, codecvt<char32_t, char, mbstate_t>, and codecvt_utf8<wchar_t> where sizeof(wchar_t) == 4;
UTF-16 ↔ UCS-2 with codecvt_utf16<char16_t>, and codecvt_utf16<wchar_t> where sizeof(wchar_t) == 2;
UTF-16 ↔ UTF-32 with codecvt_utf16<char32_t>, and codecvt_utf16<wchar_t> where sizeof(wchar_t) == 4;
UTF-8 ↔ UTF-16 with codecvt_utf8_utf16<char16_t>, codecvt<char16_t, char, mbstate_t>, and codecvt_utf8_utf16<wchar_t> where sizeof(wchar_t) == 2;
narrow ↔ wide with codecvt<wchar_t, char_t, mbstate_t>
no-op with codecvt<char, char, mbstate_t>.
Several of these are useful, but there is a lot of awkward stuff here.
First off—holy high surrogate! that naming scheme is messy.
Then, there's a lot of UCS-2 support. UCS-2 is an encoding from Unicode 1.0 that was superseded in 1996 because it only supports the basic multilingual plane. Why the committee thought desirable to focus on an encoding that was superseded over 20 years ago, I don't know&ddagger;. It's not like support for more encodings is bad or anything, but UCS-2 shows up too often here.
I would say that char16_t is obviously meant for storing UTF-16 code units. However, this is one part of the standard that thinks otherwise. codecvt_utf8<char16_t> has nothing to do with UTF-16. For example, wstring_convert<codecvt_utf8<char16_t>>().to_bytes(u"\U0001F34C") will compile fine, but will fail unconditionally: the input will be treated as the UCS-2 string u"\xD83C\xDF4C", which cannot be converted to UTF-8 because UTF-8 cannot encode any value in the range 0xD800-0xDFFF.
Still on the UCS-2 front, there is no way to read from an UTF-16 byte stream into an UTF-16 string with these facets. If you have a sequence of UTF-16 bytes you can't deserialize it into a string of char16_t. This is surprising, because it is more or less an identity conversion. Even more suprising, though, is the fact that there is support for deserializing from an UTF-16 stream into an UCS-2 string with codecvt_utf16<char16_t>, which is actually a lossy conversion.
The UTF-16-as-bytes support is quite good, though: it supports detecting endianess from a BOM, or selecting it explicitly in code. It also supports producing output with and without a BOM.
There are some more interesting conversion possibilities absent. There is no way to deserialize from an UTF-16 byte stream or string into a UTF-8 string, since UTF-8 is never supported as the deserialized form.
And here the narrow/wide world is completely separate from the UTF/UCS world. There are no conversions between the old-style narrow/wide encodings and any Unicode encodings.
Input/output library
The I/O library can be used to read and write text in Unicode encodings using the wstring_convert and wbuffer_convert facilities described above. I don't think there's much else that would need to be supported by this part of the standard library.
Regular expressions library
I have expounded upon problems with C++ regexes and Unicode on Stack Overflow before. I will not repeat all those points here, but merely state that C++ regexes don't have level 1 Unicode support, which is the bare minimum to make them usable without resorting to using UTF-32 everywhere.
That's it?
Yes, that's it. That's the existing functionality. There's lots of Unicode functionality that is nowhere to be seen like normalization or text segmentation algorithms.
U+1F4A9. Is there any way to get some better Unicode support in C++?
The usual suspects: ICU and Boost.Locale.
† A byte string is, unsurprisingly, a string of bytes, i.e., char objects. However, unlike a wide string literal, which is always an array of wchar_t objects, a "wide string" in this context is not necessarily a string of wchar_t objects. In fact, the standard never explicitly defines what a "wide string" means, so we're left to guess the meaning from usage. Since the standard terminology is sloppy and confusing, I use my own, in the name of clarity.
Encodings like UTF-16 can be stored as sequences of char16_t, which then have no endianness; or they can be stored as sequences of bytes, which have endianness (each consecutive pair of bytes can represent a different char16_t value depending on endianness). The standard supports both of these forms. A sequence of char16_t is more useful for internal manipulation in the program. A sequence of bytes is the way to exchange such strings with the external world. The terms I'll use instead of "byte" and "wide" are thus "serialized" and "deserialized".
&ddagger; If you are about to say "but Windows!" hold your 🐎🐎. All versions of Windows since Windows 2000 use UTF-16.
☦ Yes, I know about the großes Eszett (ẞ), but even if you were to change all German locales overnight to have ß uppercase to ẞ, there's still plenty of other cases where this would fail. Try uppercasing U+FB00 ʟᴀᴛɪɴ sᴍᴀʟʟ ʟɪɢᴀᴛᴜʀᴇ ғғ. There is no ʟᴀᴛɪɴ ᴄᴀᴘɪᴛᴀʟ ʟɪɢᴀᴛᴜʀᴇ ғғ; it just uppercases to two Fs. Or U+01F0 ʟᴀᴛɪɴ sᴍᴀʟʟ ʟᴇᴛᴛᴇʀ ᴊ ᴡɪᴛʜ ᴄᴀʀᴏɴ; there's no precomposed capital; it just uppercases to a capital J and a combining caron.

Unicode is not supported by Standard Library (for any reasonable meaning of supported).
std::string is no better than std::vector<char>: it is completely oblivious to Unicode (or any other representation/encoding) and simply treat its content as a blob of bytes.
If you only need to store and catenate blobs, it works pretty well; but as soon as you wish for Unicode functionality (number of code points, number of graphemes etc) you are out of luck.
The only comprehensive library I know of for this is ICU. The C++ interface was derived from the Java one though, so it's far from being idiomatic.

You can safely store UTF-8 in a std::string (or in a char[] or char*, for that matter), due to the fact that a Unicode NUL (U+0000) is a null byte in UTF-8 and that this is the sole way a null byte can occur in UTF-8. Hence, your UTF-8 strings will be properly terminated according to all of the C and C++ string functions, and you can sling them around with C++ iostreams (including std::cout and std::cerr, so long as your locale is UTF-8).
What you cannot do with std::string for UTF-8 is get length in code points. std::string::size() will tell you the string length in bytes, which is only equal to the number of code points when you're within the ASCII subset of UTF-8.
If you need to operate on UTF-8 strings at the code point level (i.e. not just store and print them) or if you're dealing with UTF-16, which is likely to have many internal null bytes, you need to look into the wide character string types.

C++11 has a couple of new literal string types for Unicode.
Unfortunately the support in the standard library for non-uniform encodings (like UTF-8) is still bad. For example there is no nice way to get the length (in code-points) of an UTF-8 string.

However, there is a pretty useful library called tiny-utf8, which is basically a drop-in replacement for std::string/std::wstring. It aims to fill the gap of the still missing utf8-string container class.
This might be the most comfortable way of 'dealing' with utf8 strings (that is, without unicode normalization and similar stuff). You comfortably operate on codepoints, while your string stays encoded in run-length-encoded chars.

UNICODE, UTF-8 and Windows mess

I'm trying to implement text support in Windows with the intention of also moving to a Linux platform later on. It would be ideal to support international languages in a uniform way but that doesn't seem to be easily accomplished when considering the two platforms in question. I have spent a considerable amount of time reading up on UNICODE, UTF-8 (and other encodings), widechars and such and here is what I have come to understand so far:
UNICODE, as the standard, describes the set of characters that are mappable and the order in which they occur. I refer to this as the "what": UNICODE specifies what will be available.
UTF-8 (and other encodings) specify the how: How each character will be represented in a binary format.
Now, on windows, they opted for a UCS-2 encoding originally, but that failed to meet the requirements, so UTF-16 is what they have, which is also multi-char when necessary.
So here is the delemma:
Windows internally only does UTF-16, so if you want to support international characters you are forced to convert to their widechar versions to use the OS calls accordingly. There doesn't seem to be any support for calling something like CreateFileA() with a multi-byte UTF-8 string and have it come out looking proper. Is this correct?
In C, there are some multi-byte supporting functions (_mbscat, _mbscpy, etc), however, on windows, the character type is defined as unsigned char* for those functions. Given the fact that the _mbs series of functions is not a complete set (i.e. there is no _mbstol to convert a multi-byte string to a long, for example) you are forced to use some of the char* versions of the runtime functions, which leads to compiler problems because of the signed/unsigned type difference between those functions. Does anyone even use those? Do you just do a big pile of casting to get around the errors?
In C++, std::string has iterators, but these are based on char_type, not on code points. So if I do a ++ on an std::string::iterator, I get the next char_type, not the next code point. Similarly, if you call std::string::operator[], you get a reference to a char_type, which has the great potential to not be a complete code point. So how does one iterate an std::string by code point? (C has the _mbsinc() function).

Just do UTF-8
There are lots of support libraries for UTF-8 in every plaftorm, also some are multiplaftorm too. The UTF-16 APIs in Win32 are limited and inconsistent as you've already noted, so it's better to keep everything in UTF-8 and convert to UTF-16 at last moment. There are also some handy UTF-8 wrappings for the windows API.
Also, at application-level documents, UTF-8 is getting more and more accepted as standard. Every text-handling application either accepts UTF-8, or at worst shows it as "ASCII with some dingbats", while there's only few applications that support UTF-16 documents, and those that don't, show it as "lots and lots of whitespace!"

Correct. You will convert UTF-8 to UTF-16 for your Windows API calls.
Most of the time you will use regular string functions for UTF-8 -- strlen, strcpy (ick), snprintf, strtol. They will work fine with UTF-8 characters. Either use char * for UTF-8 or you will have to cast everything.
Note that the underscore versions like _mbstowcs are not standard, they are normally named without an underscore, like mbstowcs.
It is difficult to come up with examples where you actually want to use operator[] on a Unicode string, my advice is to stay away from it. Likewise, iterating over a string has surprisingly few uses:
If you are parsing a string (e.g., the string is C or JavaScript code, maybe you want syntax hilighting) then you can do most of the work byte-by-byte and ignore the multibyte aspect.
If you are doing a search, you will also do this byte-by-byte (but remember to normalize first).
If you are looking for word breaks or grapheme cluster boundaries, you will want to use a library like ICU. The algorithm is not simple.
Finally, you can always convert a chunk of text to UTF-32 and work with it that way. I think this is the sanest option if you are implementing any of the Unicode algorithms like collation or breaking.
See: C++ iterate or split UTF-8 string into array of symbols?

Windows internally only does UTF-16, so if you want to support international characters you are forced to convert to their widechar versions to use the OS calls accordingly. There doesn't seem to be any support for calling something like CreateFileA() with a multi-byte UTF-8 string and have it come out looking proper. Is this correct?
Yes, that's correct. The *A function variants interpret the string parameters according to the currently active code page (which is Windows-1252 on most computers in the US and Western Europe, but can often be other code pages) and convert them to UTF-16. There is a UTF-8 code page, however AFAIK there isn't a way to programmatically set the active code page (there's GetACP to get the active code page, but not corresponding SetACP).
In C, there are some multi-byte supporting functions (_mbscat, _mbscpy, etc), however, on windows, the character type is defined as unsigned char* for those functions. Given the fact that the _mbs series of functions is not a complete set (i.e. there is no _mbstol to convert a multi-byte string to a long, for example) you are forced to use some of the char* versions of the runtime functions, which leads to compiler problems because of the signed/unsigned type difference between those functions. Does anyone even use those? Do you just do a big pile of casting to get around the errors?
The mbs* family of functions is almost never used, in my experience. With the exception of mbstowcs, mbsrtowcs, and mbsinit, those functions are not standard C.
In C++, std::string has iterators, but these are based on char_type, not on code points. So if I do a ++ on an std::string::iterator, I get the next char_type, not the next code point. Similarly, if you call std::string::operator[], you get a reference to a char_type, which has the great potential to not be a complete code point. So how does one iterate an std::string by code point? (C has the _mbsinc() function).
I think that mbrtowc(3) would be the best option here for decoding a single code point of a multibyte string.
Overall, I think the best strategy for cross-platform Unicode compatibility is to do everything in UTF-8 internally using single-byte characters. When you need to call a Windows API function, convert it to UTF-16 and always call the *W variant. Most non-Windows platforms use UTF-8 already, so that makes using those a snap.

In Windows, you can call WideCharToMultiByte and MultiByteToWideChar to convert between UTF-8 string and UTF-16 string (wstring in Windows). Because Windows API do not use UTF-8, whenever you call any Windows API function that support Unicode, you have to convert string into wstring (Windows version of Unicode in UTF-16). And when you get output from Windows, you have to convert UTF-16 back to UTF-8. Linux uses UTF-8 internally, so you do not need such conversion. To make your code portable to Linux, stick to UTF-8 and provide something as below for conversion:
#if (UNDERLYING_OS==OS_WINDOWS)
using os_string = std::wstring;
std::string utf8_string_from_os_string(const os_string &os_str)
{
size_t length = os_str.size();
int size_needed = WideCharToMultiByte(CP_UTF8, 0, os_str, length, NULL, 0, NULL, NULL);
std::string strTo(size_needed, 0);
WideCharToMultiByte(CP_UTF8, 0, os_str, length, &strTo[0], size_needed, NULL, NULL);
return strTo;
}
os_string utf8_string_to_os_string(const std::string &str)
{
size_t length = os_str.size();
int size_needed = MultiByteToWideChar(CP_UTF8, 0, str, length, NULL, 0);
os_string wstrTo(size_needed, 0);
MultiByteToWideChar(CP_UTF8, 0, str, length, &wstrTo[0], size_needed);
return wstrTo;
}
#else
// Other operating system uses UTF-8 directly and such conversion is
// not required
using os_string = std::string;
#define utf8_string_from_os_string(str) str
#define utf8_string_to_os_string(str) str
#endif
To iterate over utf8 strings, two fundamental functions you need are: one to calculate the number of bytes for an utf8 character and the another to determine if the byte is the leading byte of a utf8 character sequence. The following code provides a very efficient way to test:
inline size_t utf8CharBytes(char leading_ch)
{
return (leading_ch & 0x80)==0 ? 1 : clz(~(uint32_t(uint8_t(leading_ch))<<24));
}
inline bool isUtf8LeadingByte(char ch)
{
return (ch & 0xC0) != 0x80;
}
Using these functions, it should not be difficult to implement your own iterator over utf8 strings, one is for forwarding iterator, and another is for backward iterator.

Correct use of string storage in C and C++

Popular software developers and companies (Joel Spolsky, Fog Creek software) tend to use wchar_t for Unicode character storage when writing C or C++ code. When and how should one use char and wchar_t in respect to good coding practices?
I am particularly interested in POSIX compliance when writing software that leverages Unicode.
When using wchar_t, you can look up characters in an array of wide characters on a per-character or per-array-element basis:
/* C code fragment */
const wchar_t *overlord = L"ov€rlord";
if (overlord[2] == L'€')
wprintf(L"Character comparison on a per-character basis.\n");
How can you compare unicode bytes (or characters) when using char?
So far my preferred way of comparing strings and characters of type char in C often looks like this:
/* C code fragment */
const char *mail[] = { "ov€rlord#masters.lt", "ov€rlord#masters.lt" };
if (mail[0][2] == mail[1][2] && mail[0][3] == mail[1][3] && mail[0][3] == mail[1][3])
printf("%s\n%zu", *mail, strlen(*mail));
This method scans for the byte equivalent of a unicode character. The Unicode Euro symbol € takes up 3 bytes. Therefore one needs to compare three char array bytes to know if the Unicode characters match. Often you need to know the size of the character or string you want to compare and the bits it produces for the solution to work. This does not look like a good way of handling Unicode at all. Is there a better way of comparing strings and character elements of type char?
In addition, when using wchar_t, how can you scan the file contents to an array? The function fread does not seem to produce valid results.

If you know that you're dealing with unicode, neither char nor wchar_t are appropriate as their sizes are compiler/platform-defined. For example, wchar_t is 2 bytes on Windows (MSVC), but 4 bytes on Linux (GCC). The C11 and C++11 standards have been a bit more rigorous, and define two new character types (char16_t and char32_t) with associated literal prefixes for creating UTF-{8, 16, 32} strings.
If you need to store and manipulate unicode characters, you should use a library that is designed for the job, as neither the pre-C11 nor pre-C++11 language standards have been written with unicode in mind. There are a few to choose from, but ICU is quite popular (and supports C, C++, and Java).

I am particularly interested in POSIX compliance when writing software
that leverages Unicode.
In this case, you'll probably want to use UTF-8 (with char) as your preferred Unicode string type. POSIX doesn't have a lot of functions for working with wchar_t — that's mostly a Windows thing.
This method scans for the byte equivalent of a unicode character. The
Unicode Euro symbol € takes up 3 bytes. Therefore one needs to compare
three char array bytes to know if the Unicode characters match. Often
you need to know the size of the character or string you want to
compare and the bits it produces for the solution to work.
No, you don't. You just compare the bytes. Iff the bytes match, the strings match. strcmp works just as well with UTF-8 as it does with any other encoding.
Unless you want something like a case-insensitive or accent-insensitive comparison, in which case you'll need a proper Unicode library.

You should never-ever compare bytes, or even code points, to decide if strings are equal. That's because of a lot of strings can be identical from user perspective without being identical from code point perspective.

Polish chars in std::string

I have a problem. I'm writing an app in Polish (with, of course, polish chars) for Linux and I receive 80 warnings when compiling. These are just "warning: multi-character character constant" and "warning: case label value exceeds maximum value for type". I'm using std::string.
How do I replace std::string class?
Please help.
Thanks in advance.
Regards.

std::stringdoes not define a particular encoding. You can thus store any sequence of bytes in it. There are subtleties to be aware of:
.c_str() will return a null-terminated buffer. If your character set allows null bytes, don't pass this string to functions that take a const char* parameter without a lenght, or your data will be truncated.
A char does not represent a character, but a **byte. IMHO, this is the most problematic nomenclature in computing history. Note that wchar_t does necessarily hold a full character either, depending on UTF-16 normalization.
.size() and .length() will return the number of bytes, not the number of characters.
[edit] The warnings about case labels is related to issue (2). You are using a switch statement with multi-byte characters using type char which can not hold more than one byte.[/edit]
Therefore, you can use std::string in your application, provided that you respect these three rules. There are subtleties involving the STL, including std::find() that are consequences of this. You need to use some more clever string matching algorithms to properly support Unicode because of normalization forms.
However, when writing applications in any language that uses non-ASCII characters (if you're paranoid, consider this anything outside [0, 128)), you need to be aware of encodings in different sources of textual data.
The source-file encoding might not be specified, and might be subject to change using compiler options. Any string literal will be subject to this rule. I guess this is why you are getting warnings.
You will get a variety of character encodings from external sources (files, user input, etc.). When that source specifies the encoding or you can get it from some external source (i.e. asking the user that imports the data), then this is easier. A lot of (newer) internet protocols impose ASCII or UTF-8 unless otherwise specified.
These two issues are not addressed by any particular string class. You just need to convert all any external source to your internal encoding. I suggest UTF-8 all the time, but especially so on Linux because of native support. I strongly recommend to place your string literals in a message file to forget about issue (1) and only deal with issue (2).
I don't suggest using std::wstring on Linux because 100% of native APIs use function signatures with const char* and have direct support for UTF-8. If you use any string class based on wchar_t, you will need to convert to/from std::wstring non-stop and eventually get something wrong, on top of making everything slow(er).
If you were writing an application for Windows, I'd suggest exactly the opposite because all native APIs use const wchar_t* signatures. The ANSI versions of such functions perform an internal conversion to/from const wchar_t*.
Some "portable" libraries/languages use different representations based on the platform. They use UTF-8 with char on Linux and UTF-16 with wchar_t on Windows. I recall reading bout that trick in the Python reference implementation but the article was quite old. I'm not sure if that is true anymore.

On linux you should use multibyte string class provided by a framework you use.
I'd recommend Glib::ustring, from glibmm framework, which stores strings in UTF-8 encoding.
If your source files are in UTF-8, then using multibyte string literal in code is as easy as:
ustring alphabet("aąbcćdeęfghijklłmnńoóprsśtuwyzźż");
But you can not build a switch/case statement on multibyte characters using char. I'd recommend using a series of ifs. You can use Glibmm's gunichar, but it's not very readable (You can get correct unicode values for characters using a table from article on Polish alphabet in Wikipedia):
#include <glibmm.h>
#include <iostream>
using namespace std;
int main()
{
Glib::ustring alphabet("aąbcćdeęfghijklłmnńoóprsśtuwyzźż");
int small_polish_vovels_with_diacritics_count = 0;
for ( int i=0; i<alphabet.size(); i++ ) {
switch (alphabet[i]) {
case 0x0105: // ą
case 0x0119: // ę
case 0x00f3: // ó
small_polish_vovels_with_diacritics_count++;
break;
default:
break;
}
}
cout << "There are " << small_polish_vovels_with_diacritics_count
<< " small polish vovels with diacritics in this string.\n";
return 0;
}
You can compile this using:
g++ `pkg-config --cflags --libs glibmm-2.4` progname.cc -o progname

std::string is for ASCII strings. Since your polish strings don't fit in, you should use std::wstring.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js