What is the difference between strcmp() and strcoll()? - c++

I tried understanding both of them but I did not find any differences except for strcoll() this reference says that it
compares two null terminated strings according to current locale as defined by the LC_COLLATE category.
On the second thoughts and I know I am asking another question for detailed answer, what exactly is this locale, for both C and C++?

strcmp() takes the bytes of the string one by one and compare them as is whatever the bytes are.
strcoll() takes the bytes, transform them using the locale, then compares the result. The transformation re-orders depending on the language. In French, accentuated letters come after the non-accentuated ones. So é is after e. However, é is before f. strcoll() gets it right. strcmp() not so well.
However, in many cases strcmp() is enough because you don't need to show the result ordered in the language (locale) in use. For example, if you just need to quickly access a large number of data indexed by a string you'd use a map indexed by that string. It probably is totally useless to sort those using strcoll() which is generally very slow (in comparison to strcmp() at least.)
For details about characters you may also want to check out the Unicode website.
In regard to the locale, it's the language. By default it is set to "C" (more or less, no locale). Once you select a location the locale is set accordingly. You can also set the LC_LOCALE environment variable. There are actually many such variables. But in general you use predefined functions that automatically take those variables in account and do the right thing for you. (i.e. format dates / time, format numbers / measures, compute upper / lower case, etc.)

For some reason in all unicode locales I tested, on several different versions of glibc, strcoll() returns zero for any two hiraganas. This breaks sort, uniq, and everything that interacts with orders of strings in some way.
$ echo -e -n 'い\nろ\nは\nに\nほ\nへ\nと\n' | sort | uniq
い
which is simply broken beyond repair. People from different places of world might have different ideas on whether 'い' should be placed before or after 'ろ', but nobody sane would consider them the same.
And no, setting your locale to the Japanese one does not matter:
$ LC_ALL=ja_JP.utf8 LANG=ja_JP.utf8 LC_COLLATE=ja_JP.utf8 echo -e -n 'い\nろ\nは\nに\nほ\nへ\nと\n' | sort | uniq
い
There was discussion in some official mailing list, but guess what, it was in 2002 and it was never fixed because people don't care: https://www.mail-archive.com/linux-utf8#nl.linux.org/msg02658.html
That bug happened to us in some day and in the end our only way out was to set the collate locale to "C" and rely on the nice properties of utf-8 encoding. That's a horrible experience, since one shouldn't really work under "C" locale when processing all-Japanese data.
So for your sanity's sake, do NOT directly use strcoll. A safer variant might be:
int safe_strcoll(const char *a, const char *b)
{
int ret = strcoll(a, b);
if (ret != 0) return ret;
return strcmp(a, b);
}
just in case strcoll() decides to screw you...
EDIT: I just repeated the experiment out of curiosity, and my current system (with glibc 2.29) works without problems now. Locale doesn't matter either.

Related

Microsoft's implementation of lstrcmpi and Unicode characters

I'm trying to understand whether what I'm seeing is a bug, or some accepted behaviour of the Microsoft's lstrcmpi function?
I can illustrate it with the code:
WCHAR buff1[] = L"abc ";
WCHAR buff2[] = L"abc ";
buff1[3] = 0xFFFF;
buff2[3] = 0x0;
int res = lstrcmpi(buff1, buff2);
//res is 0 or equality!
EDIT: Addition for the comment below:
lstrcmpi calls CompareString with the current locale (from thread or user) and returns "a linguistically appropriate result".
From Michael Kaplans blog:
... Now if the functions were named lstrcoll and lstrcolli then perhaps the function would not be so commonly misused
and:
Remember that when checking for equality, especially on an item like a registry value where OS semantics are involved, the best answer is CompareStringOrdinal, with a fallback to RtlCompareUnicodeString or even better RtlEqualUnicodeString or if you absolutely must wcsicmp (with awareness that there is one character it can be wrong about) for anything that has to run pre-Vista.
and finally:
Because if you are calling lstrcmpi for appropriate reasons (i.e. you wanted to get linguistically meaningful results, say in the sorting of a list in a user interface) but you wanted to have behavior that did not vary with different locales, then CompareString with LOCALE_INVARIANT is a good answer.
But if you wanted almost anything else, including all of the non-linguistic purposes hinted at earlier, then CompareStringOrdinal or RtlCompareUnicodeString is a much better choice.
How it handles non-characters has actually changed over time.
The Unicode FFFF character is a noncharacter in the Unicode spec, so it is probably being ignored during the string comparison. This results in both strings being equal.

Extracting wide chars w/ attributes in ncurses

[Please note I am using _XOPEN_SOURCE_EXTENDED 1 and setlocale(LC_CTYPE, "").]
Curses includes various functions for extracting characters from the screen; they can be divided into those which grab just the text and those which grab the text plus attributes (bold, color, etc.). The former use wchar_t (or char) and the latter curses' own chtype.
There are constants to mask a chtype to get just the character or just the attributes -- A_CHARTEXT and A_ATTRIBUTES. However, from the value of these, it is easy to see that there will be collisions with wchar_t values over 255. A_ATTRIBUTES is 64-bits and only the lower 8 are unset.
If the base type internally is chtype, this would mean ncurses was unworkable with most of unicode, but it isn't -- you can use hardcoded strings in UTF-8 source and write them out with attributes no problem. Where it gets interesting is getting them back again.
wchar_t s[] = "\412";
This character has a value of 266 and displays as Ċ. However, when extracted into a chtype using, e.g., mvwinchnstr(), it is exactly the same as a space (10) with the COLOR_PAIR(1) attribute (256) set. And in fact, if you take the extracted chtype and redisplay it, you get just that -- a space with COLOR_PAIR(1) set.
But if you extract it instead into a wchar_t with, e.g. mvwinnwstr(), it's correct, as is a colored space. The problem with this, of course, is that the attributes are gone. This implies the attributes are being masked out correctly, which is demonstrably impossible with a chtype, since a chtype for both of these has the same value (266). In other words, the internal representation is obviously niether a chtype nor a wchar_t.
I do not use ncurses much, and I notice there are other curses implementations (e.g. Oracle's) with functions that imply the chtype there might not have this problem. In any case, is there a way w/ ncurses to unambiguously extract wide chars together with their attributes?
[I've tagged this C and C++ since it is applicable in both contexts.]
It is more complicated than that. But briefly:
In the SVr4 implementation, there was just chtype.
X/Open work for standardization added on the multibyte characters, represented in cchar_t.
Not blatantly obvious in the X/Open documentation, but seen in the corresponding Unix implementations, the chtype and cchar_t were not envisioned as possibly different views of the same data. You can only make 8-bit encodings with the former.
Not many applications really delve into Unix implementations to make it apparent (in fact, at least one vendor's XPG4 implementation never worked well enough to do useful testing — so much for the state of the art).
The integration (or lack of same) was overlooked in ncurses, where it seemed a natural thing to do.
ncurses accepts multibyte strings in addstr (none of the Unix's do).
ncurses attempts to provide the same information via either style of interface which was set via the other.
There are obviously limitations: chtype corresponds to a single cell on the screen, and can hold only an 8-bit character. Interfaces such as winnstr which return a string will work within that constraint. The winchnstr function does return an array of chtype values.
If you want the attributes for a cell which is not an 8-bit character, you are best off by retrieving it via the analogous win_wchnstr

How to compare a "basic_string" using an arbitary locale

I'm re-posting a question I submitted earlier today but I'm now citing a specific example in response to the feedback I received. The original question can be found here (note that it's not a homework assignment):
I'm simply trying to determine if C++ makes it impossible to perform an (efficient) case-INsensitive comparison of a basic_string object that also factors in any arbitrary locale object. For instance, it doesn't appear to be possible to write an efficient function such as the following:
bool AreStringsEqualIgnoreCase(const string &str1, const string &str2, const locale &loc);
Based on my current understanding (but can someone confirm this), this function has to call both ctype::toupper() and collate::compare() for the given locale (extracted as always using use_facet()). However, because collate::compare() in particular requires 4 pointer args, you either need to pass these 4 args for every char you need to compare (after first calling ctype::toupper()), or alternatively, convert both strings to upppercase first and then make a single call to collate::compare().
The 1st approach is obviously inefficient (4 pointers to pass for each char tested), and the 2nd requires you to convert both strings to uppercase in their entirety (requiring allocation of memory and needless copying/converting of both strings to uppercase). Am I correct about this, i.e., it's not possible to do it efficiently (because there's no way around collate::compare()).
One of the little annoyances about trying to deal in a consistent way with all the world's writing systems is that practically nothing you think you know about characters is actually correct. This makes it tricky to do things like "case-insensitive comparison". Indeed, it is tricky to do any form of locale-aware comparison, and case-insensitivity is additionally thorny.
With some constraints, though, it is possible to accomplish. The algorithm needed can be implemented "efficiently" using normal programming practices (and precomputation of some static data), but it cannot be implemented as efficiently as an incorrect algorithm. It is often possible to trade off correctness for speed, but the results are not pleasant. Incorrect but fast locale implementations may appeal to those whose locales are implemented correctly, but are clearly unsatisfactory for the part of the audience whose locales produce unexpected results.
Lexicographical ordering doesn't work for human beings
Most locales (other than the "C" locale) for languages which have case already handle letter case in the manner expected, which is to use case differences only after all other differences have been taken into account. That is, if a list of words are sorted in the locale's collation order, then words in the list which differ only in case are going to be consecutive. Whether the words with upper case come before or after words with lower case is locale-dependent, but there won't be other words in between.
That result cannot be achieved by any single-pass left-to-right character-by-character comparison ("lexicographical ordering"). And most locales have other collation quirks which also don't yield to naïve lexicographical ordering.
Standard C++ collation should be able to deal with all of these issues, if you have appropriate locale definitions. But it cannot be reduced to lexicographical comparison just using a comparison function over pairs of whar_t, and consequently the C++ standard library doesn't provide that interface.
The following is just a few examples of why locale-aware collation is complicated; a longer explanation, with a lot more examples, is found in Unicode Technical Standard 10.
Where do the accents go?
Most romance languages (and also English, when dealing with borrowed words) consider accents over vowels to be a secondary characteristic; that is, words are first sorted as though the accents weren't present, and then a second pass is made in which unaccented letters come before accented letters. A third pass is necessary to deal with case, which is ignored in the first two passes.
But that doesn't work for Northern European languages. The alphabets of Swedish, Norwegian and Danish have three extra vowels, which follow z in the alphabet. In Swedish, these vowels are written å, ä, and ö; in Norwegian and Danish, these letters are written å, æ, and ø, and in Danish å is sometimes written aa, making Aarhus the last entry in an alphabetical list of Danish cities.
In German, the letters ä, ö, and ü are generally alphabetised as with romance accents, but in German phonebooks (and sometimes other alphabetical lists), they are alphabetised as though they were written ae, oe and ue, which is the older style of writing the same phonemes. (There are many pairs of common surnames such as "Müller" and "Mueller" are pronounced the same and are often confused, so it makes sense to intercollate them. A similar convention was used for Scottish names in Canadian phonebooks when I was young; the spellings M', Mc and Mac were all clumped together since they are all phonetically identical.)
One symbol, two letters. Or two letters, one symbol
German also has the symbol ß which is collated as though it were written out as ss, although it is not quite identical phonetically. We'll meet this interesting symbol again a bit later.
In fact, many languages consider digraphs and even trigraphs to be single letters. The 44-letter Hungarian alphabet includes Cs, Dz, Dzs, Gy, Ly, Ny, Sz, Ty, and Zs, as well as a variety of accented vowels. However, the language most commonly referenced in articles about this phenomenon -- Spanish -- stopped treating the digraphs ch and ll as letters in 1994, presumably because it was easier to force Hispanic writers to conform to computer systems than to change the computer systems to deal with Spanish digraphs. (Wikipedia claims it was pressure from "UNESCO and other international organizations"; it took quite a while for everyone to accept the new alphabetization rules, and you still occasionally find "Chile" after "Colombia" in alphabetical lists of South American countries.)
Summary: comparing character strings requires multiple passes, and sometimes requires comparing groups of characters
Making it all case-insensitive
Since locales handle case correctly in comparison, it should not really be necessary to do case-insensitive ordering. It might be useful to do case-insensitive equivalence-class checking ("equality" testing), although that raises the question of what other imprecise equivalence classes might be useful. Unicode normalization, accent deletion, and even transcription to latin are all reasonable in some contexts, and highly annoying in others. But it turns out that case conversions are not as simple as you might think, either.
Because of the existence of di- and trigraphs, some of which have Unicode codepoints, the Unicode standard actually recognizes three cases, not two: lower-case, upper-case and title-case. The last is what you use to upper case the first letter of a word, and it's needed, for example, for the Croatian digraph dž (U+01C6; a single character), whose uppercase is DŽ (U+01C4) and whose title case is Dž (U+01C5). The theory of "case-insensitive" comparison is that we could transform (at least conceptually) any string in such a way that all members of the equivalence class defined by "ignoring case" are transformed to the same byte sequence. Traditionally this is done by "upper-casing" the string, but it turns out that that is not always possible or even correct; the Unicode standard prefers the use of the term "case-folding", as do I.
C++ locales aren't quite up to the job
So, getting back to C++, the sad truth is that C++ locales do not have sufficient information to do accurate case-folding, because C++ locales work on the assumption that case-folding a string consists of nothing more than sequentially and individually upper-casing each codepoint in the string using a function which maps a codepoint to another codepoint. As we'll see, that just doesn't work, and consequently the question of its efficiency is irrelevant. On the other hand, the ICU library has an interface which does case-folding as correctly as the Unicode database allows, and its implementation has been crafted by some pretty good coders so it is probably just about as efficient as possible within the constraints. So I'd definitely recommend using it.
If you want a good overview of the difficulty of case-folding, you should read sections 5.18 and 5.19 of the Unicode standard (PDF for chapter 5). The following is just a few examples.
A case transform is not a mapping from single character to single character
The simplest example is the German ß (U+00DF), which has no upper-case form because it never appears at the beginning of a word, and traditional German orthography didn't use all-caps. The standard upper-case transform is SS (or in some cases SZ) but that transform is not reversible; not all instances of ss are written as ß. Compare, for example, grüßen and küssen (to greet and to kiss, respectively). In v5.1, ẞ, an "upper-case ß, was added to Unicode as U+1E9E, but it is not commonly used except in all-caps street signs, where its use is legally mandated. The normal expectation of upper-casing ß would be the two letters SS.
Not all ideographs (visible characters) are single character codes
Even when a case transform maps a single character to a single character, it may not be able to express that as a wchar→wchar mapping. For example, ǰ can easily be capitalized to J̌, but the former is a single combined glyph (U+01F0), while the second is a capital J with a combining caron (U+030C).
There is a further problem with glyphs like ǰ:
Naive character by character case-folding can denormalize
Suppose we upper-case ǰ as above. How do we capitalize ǰ̠ (which, in case it doesn't render properly on your system, is the same character with an bar underneath, another IPA convention)? That combination is U+01F0,U+0320 (j with caron, combining minus sign below), so we proceed to replace U+01F0 with U+004A,U+030C and then leave the U+0320 as is: J̠̌. That's fine, but it won't compare equal to a normalized capital J with caron and minus sign below, because in the normal form the minus sign diacritic comes first: U+004A,U+0320,U+030C (J̠̌, which should look identical). So sometimes (rarely, to be honest, but sometimes) it is necessary to renormalize.
Leaving aside unicode wierdness, sometimes case-conversion is context-sensitive
Greek has a lot of examples of how marks get shuffled around depending on whether they are word-initial, word-final or word-interior -- you can read more about this in chapter 7 of the Unicode standard -- but a simple and common case is Σ, which has two lower-case versions: σ and ς. Non-greeks with some maths background are probably familiar with σ, but might not be aware that it cannot be used at the end of a word, where you must use ς.
In short
The best available correct way to case-fold is to apply the Unicode case-folding algorithm, which requires creating a temporary string for each source string. You could then do a simple bytewise comparison between the two transformed strings in order to verify that the original strings were in the same equivalence class. Doing a collation ordering on the transformed strings, while possible, is rather less efficient than collation ordering the original strings, and for sorting purposes, the untransformed comparison is probably as good or better than the transformed comparison.
In theory, if you are only interested in case-folded equality, you could do the transformations linearly, bearing in mind that the transformation is not necessarily context-free and is not a simple character-to-character mapping function. Unfortunately, C++ locales don't provide you the data you need to do this. The Unicode CLDR comes much closer, but it's a complex datastructure.
All of this stuff is really complicated, and rife with edge cases. (See the note in the Unicode standard about accented Lithuanian i's, for example.) You're really better off just using a well-maintained existing solution, of which the best example is ICU.

How to parse numbers like "3.14" with scanf when locale expects "3,14"

Let's say I have to read a file, containing a bunch of floating-point numbers. The numbers can be like 1e+10, 5, -0.15 etc., i.e., any generic floating-point number, using decimal points (this is fixed!). However, my code is a plugin for another application, and I have no control over what's the current locale. It may be Russian, for example, and the LC_NUMERIC rules there call for a decimal comma to be used. Thus, Pi is expected to be spelled as "3,1415...", and
sscanf("3.14", "%f", &x);
returns "1", and x contains "3.0", since it refuses to parse past the '.' in the string.
I need to ignore the locale for such number-parsing tasks.
How does one do that?
I could write a parseFloat function, but this seems like a waste.
I could also save the current locale, reset it temporarily to "C", read the file, and restore to the saved one. What are the performance implications of this? Could setlocale() be very slow on some OS/libc combo, what does it really do under the hood?
Yet another way would be to use iostreams, but again their performance isn't stellar.
My personal preference is to never use LC_NUMERIC, i.e. just call setlocale with other categories, or, after calling setlocale with LC_ALL, use setlocale(LC_NUMERIC, "C");. Otherwise, you're completely out of luck if you want to use the standard library for printing or parsing numbers in a standared form for interchange.
If you're lucky enough to be on a POSIX 2008 conforming system, you can use the uselocale and *_l family of functions to make the situation somewhat better. There are at least 2 basic approaches:
Leave the default locale unset (at least the troublesome parts like LC_NUMERIC; LC_CTYPE should probably always be set), and pass a locale_t object for the user's locale to the appropriate *_l functions only when you want to present things to the user in a way that meets their own cultural expectations; otherwise use the default C locale.
Have your code that needs to work with data for interchange keep around a locale_t object for the C locale, and either switch back and forth using uselocale when you need to work with data in a standard form for interchange, or use the appropriate *_l functions (but there is no scanf_l).
Note that implementing your own floating point parser is not easy and is probably not the right solution to the problem unless you're an expert in numerical computing. Getting it right is very hard.
POSIX.1-2008 specifies isalnum_l(), isalpha_l(), isblank_l(), iscntrl_l(), isdigit_l(), isgraph_l(), islower_l(), isprint_l(), ispunct_l(), isspace_l(), isupper_l(), and isxdigit_l().
Here's what I've done with this stuff in the past.
The goal is to use locale-dependent numeric converters with a C-locale numeric representation. The ideal, of course, would be to use non-locale-dependent converters, or not change the locale, etc., etc., but sometimes you just have to live with what you've got. Locale support is seriously broken in several ways and this is one of them.</rant>
First, extract the number as a string using something like the C grammar's simple pattern for numeric preprocessing tokens. For use with scanf, I do an even simpler one:
" %1[-+0-9.]%[-+0-9A-Za-z.]"
This could be simplified even more, depending on how what else you might expect in the input stream. The only thing you need to do is to not read beyond the end of the number; as long as you don't allow numbers to be followed immediately by letters, without intervening whitespace, the above will work fine.
Now, get the struct lconv (man 7 locale) representing the current locale using localeconv(3). The first entry in that struct is const char* decimal_point; replace all of the '.' characters in your string with that value. (You might also need to replace '+' and '-' characters, although most locales don't change them, and the sign fields in the lconv struct are documented as only applying to currency conversions.) Finally, feed the resulting string through strtod and see if it passes.
This is not a perfect algorithm, particularly since it's not always easy to know how locale-compliant a given library actually is, so you might want to do some autoconf stuff to configure it for the library you're actually compiling with.
I am not sure how to solve it in C.
But C++ streams (can) have a unique locale object.
std::stringstream dataStream;
dataStream.imbue(std::locale("C"));
// Note: You must imbue the stream before you do anything wit it.
// If any operations have been performed then an imbue() can
// be silently ignored by the stream (which is a pain to debug).
dataStream << "3.14";
float x;
dataStream >> x;

How do you cope with signed char -> int issues with standard library?

This is a really long-standing issue in my work, that I realize I still don't have a good solution to...
C naively defined all of its character test functions for an int:
int isspace(int ch);
But char's are often signed, and a full character often doesn't fit in an int, or in any single storage-unit that used for strings******.
And these functions have been the logical template for current C++ functions and methods, and have set the stage for the current standard library. In fact, they're still supported, afaict.
So if you hand isspace(*pchar) you can end up with sign extension problems. They're hard to see, and thence they're hard to guard against in my experience.
Similarly, because isspace() and it's ilk all take ints, and because the actual width of a character is often unknown w/o string-analysis - meaning that any modern character library should essentially never be carting around char's or wchar_t's but only pointers/iterators, since only by analyzing the character stream can you know how much of it composes a single logical character, I am at a bit of a loss as to how best to approach the issues?
I keep expecting a genuinely robust library based around abstracting away the size-factor of any character, and working only with strings (providing such things as isspace, etc.), but either I've missed it, or there's another simpler solution staring me in the face that all of you (who know what you're doing) use...
** These issues don't come up for fixed-sized character-encodings that can wholly contain a full character - UTF-32 apparently is about the only option that has these characteristics (or specialized environments that restrict themselves to ASCII or some such).
So, my question is:
"How do you test for whitespace, isprintable, etc., in a way that doesn't suffer from two issues:
1) Sign expansion, and
2) variable-width character issues
After all, most character encodings are variable-width: UTF-7, UTF-8, UTF-16, as well as older standards such as Shift-JIS. Even extended ASCII can have the simple sign-extension problem if the compiler treats char as a signed 8 bit unit.
Please note:
No matter what size your char_type is, it's wrong for most character encoding schemes.
This problem is in the standard C library, as well as in the C++ standard libraries; which still tries to pass around char and wchar_t, rather than string-iterators in the various isspace, isprint, etc. implementations.
Actually, it's precisely those type of functions that break the genericity of std::string. If it only worked in storage-units, and didn't try to pretend to understand the meaning of the storage-units as logical characters (such as isspace), then the abstraction would be much more honest, and would force us programmers to look elsewhere for valid solutions...
Thank You
Everyone who participated. Between this discussion and WChars, Encodings, Standards and Portability I have a much better handle on the issues. Although there are no easy answers, every bit of understanding helps.
How do you test for whitespace, isprintable, etc., in a way that doesn't suffer from two issues:
1) Sign expansion
2) variable-width character issues
After all, all commonly used Unicode encodings are variable-width, whether programmers realize it or not: UTF-7, UTF-8, UTF-16, as well as older standards such as Shift-JIS...
Obviously, you have to use a Unicode-aware library, since you've demonstrated (correctly) that C++03 standard library is not. The C++11 library is improved, but still not quite good enough for most usages. Yes, some OS' have a 32-bit wchar_t which makes them able to correctly handle UTF32, but that's an implementation, and is not guaranteed by C++, and is not remotely sufficient for many unicode tasks, such as iterating over Graphemes (letters).
IBMICU
Libiconv
microUTF-8
UTF-8 CPP, version 1.0
utfproc
and many more at http://unicode.org/resources/libraries.html.
If the question is less about specific character testing and more about code practices in general: Do whatever your framework does. If you're coding for linux/QT/networking, keep everything internally in UTF-8. If you're coding with Windows, keep everything internally in UTF-16. If you need to mess with code points, keep everything internally in UTF-32. Otherwise (for portable, generic code), do whatever you want, since no matter what, you have to translate for some OS or other anyway.
I think you are confounding a whole host of unrelated concepts.
First off, char is simply a data type. Its first and foremost meaning is "the system's basic storage unit", i.e. "one byte". Its signedness is intentionally left up to the implementation so that each implementation can pick the most appropriate (i.e. hardware-supported) version. It's name, suggesting "character", is quite possibly the single worst decision in the design of the C programming language.
The next concept is that of a text string. At the foundation, text is a sequence of units, which are often called "characters", but it can be more involved than that. To that end, the Unicode standard coins the term "code point" to designate the most basic unit of text. For now, and for us programmers, "text" is a sequence of code points.
The problem is that there are more codepoints than possible byte values. This problem can be overcome in two different ways: 1) use a multi-byte encoding to represent code point sequences as byte sequences; or 2) use a different basic data type. C and C++ actually offer both solutions: The native host interface (command line args, file contents, environment variables) are provided as byte sequences; but the language also provides an opaque type wchar_t for "the system's character set", as well as translation functions between them (mbstowcs/wcstombs).
Unfortunately, there is nothing specific about "the system's character set" and "the systems multibyte encoding", so you, like so many SO users before you, are left puzzling what to do with those mysterious wide characters. What people want nowadays is a definite encoding that they can share across platforms. The one and only useful encoding that we have for this purpose is Unicode, which assigns a textual meaning to a large number of code points (up to 221 at the moment). Along with the text encoding comes a family of byte-string encodings, UTF-8, UTF-16 and UTF-32.
The first step to examining the content of a given text string is thus to transform it from whatever input you have into a string of definite (Unicode) encoding. This Unicode string may itself be encoded in any of the transformation formats, but the simplest is just as a sequence of raw codepoints (typically UTF-32, since we don't have a useful 21-bit data type).
Performing this transformation is already outside the scope of the C++ standard (even the new one), so we need a library to do this. Since we don't know anything about our "system's character set", we also need the library to handle that.
One popular library of choice is iconv(); the typical sequence goes from input multibyte char* via mbstowcs() to a std::wstring or wchar_t* wide string, and then via iconv()'s WCHAR_T-to-UTF32 conversion to a std::u32string or uint32_t* raw Unicode codepoint sequence.
At this point our journey ends. We can now either examine the text codepoint by codepoint (which might be enough to tell if something is a space); or we can invoke a heavier text-processing library to perform intricate textual operations on our Unicode codepoint stream (such as normalization, canonicalization, presentational transformation, etc.). This is far beyond the scope of a general-purpose programmer, and the realm of text processing specialists.
It is in any case invalid to pass a negative value other than EOF to isspace and the other character macros. If you have a char c, and you want to test whether it is a space or not, do isspace((unsigned char)c). This deals with the extension (by zero-extending). isspace(*pchar) is flat wrong -- don't write it, don't let it stand when you see it. If you train yourself to panic when you do see it, then it's less hard to see.
fgetc (for example) already returns either EOF or a character read as an unsigned char and then converted to int, so there's no sign-extension issue for values from that.
That's trivia really, though, since the standard character macros don't cover Unicode, or multi-byte encodings. If you want to handle Unicode properly then you need a Unicode library. I haven't looked into what C++11 or C1X provide in this regard, other than that C++11 has std::u32string which sounds promising. Prior to that the answer is to use something implementation-specific or third-party. (Un)fortunately there are a lot of libraries to choose from.
It may be (I speculate) that a "complete" Unicode classification database is so large and so subject to change that it would be impractical for the C++ standard to mandate "full" support anyway. It depends to an extent what operations should be supported, but you can't get away from the problem that Unicode has been through 6 major versions in 20 years (since the first standard version), while C++ has had 2 major versions in 13 years. As far as C++ is concerned, the set of Unicode characters is a rapidly-moving target, so it's always going to be implementation-defined what code points the system knows about.
In general, there are three correct ways to handle Unicode text:
At all I/O (including system calls that return or accept strings), convert everything between an externally-used character encoding, and an internal fixed-width encoding. You can think of this as "deserialization" on input and "serialization" on output. If you had some object type with functions to convert it to/from a byte stream, then you wouldn't mix up byte stream with the objects, or examine sections of byte stream for snippets of serialized data that you think you recognize. It needn't be any different for this internal unicode string class. Note that the class cannot be std::string, and might not be std::wstring either, depending on implementation. Just pretend the standard library doesn't provide strings, if it helps, or use a std::basic_string of something big as the container but a Unicode-aware library to do anything sophisticated. You may also need to understand Unicode normalization, to deal with combining marks and such like, since even in a fixed-width Unicode encoding, there may be more than one code point per glyph.
Mess about with some ad-hoc mixture of byte sequences and Unicode sequences, carefully tracking which is which. It's like (1), but usually harder, and hence although it's potentially correct, in practice it might just as easily come out wrong.
(Special purposes only): use UTF-8 for everything. Sometimes this is good enough, for example if all you do is parse input based on ASCII punctuation marks, and concatenate strings for output. Basically it works for programs where you don't need to understand anything with the top bit set, just pass it on unchanged. It doesn't work so well if you need to actually render text, or otherwise do things to it that a human would consider "obvious" but actually are complex. Like collation.
One comment up front: the old C functions like isspace took int for
a reason: they support EOF as input as well, so they need to be able
to support one more value than will fit in a char. The
“naïve” decision was allowing char to be signed—but
making it unsigned would have had severe performance implications on a
PDP-11.
Now to your questions:
1) Sign expansion
The C++ functions don't have this problem. In C++, the
“correct” way of testing things like whether a character is
a space is to grap the std::ctype facet from whatever locale you want,
and to use it. Of course, the C++ localization, in <locale>, has
been carefully designed to make it as hard as possible to use, but if
you're doing any significant text processing, you'll soon come up with
your own convenience wrappers: a functional object which takes a locale
and mask specifying which characteristic you want to test isn't hard.
Making it a template on the mask, and giving its locale argument a
default to the global locale isn't rocket science either. Throw in a
few typedef's, and you can pass things like IsSpace() to std::find.
The only subtility is managing the lifetime of the std::ctype object
you're dealing with. Something like the following should work, however:
template<std::ctype_base::mask mask>
class Is // Must find a better name.
{
std::locale myLocale;
//< Needed to ensure no premature destruction of facet
std::ctype<char> const* myCType;
public:
Is( std::locale const& l = std::locale() )
: myLocale( l )
, myCType( std::use_facet<std::ctype<char> >( l ) )
{
}
bool operator()( char ch ) const
{
return myCType->is( mask, ch );
}
};
typedef Is<std::ctype_base::space> IsSpace;
// ...
(Given the influence of the STL, it's somewhat surprising that the
standard didn't define something like the above as standard.)
2) Variable width character issues.
There is no real answer. It all depends on what you need. For some
applications, just looking for a few specific single byte characters is
sufficient, and keeping everything in UTF-8, and ignoring the multi-byte
issues, is a viable (and simple) solution. Beyond that, it's often
useful to convert to UTF-32 (or depending on the type of text you're
dealing with, UTF-16), and use each element as a single code point. For
full text handling, on the other hand, you have to deal with
multi-code-point characters even if you're using UTF-32: the sequence
\u006D\u0302 is a single character (a small m with a circumflex over
it).
I haven't been testing internationalization capabilities of Qt library so much, but from what i know, QString is fully unicode-aware, and is using QChar's which are unicode-chars. I don't know internal implementation of those, but I expect that this implies QChar's to be varaible size characters.
It would be weird to bind yourself to such big framework as Qt just to use strings though.
You seem to be confusing a function defined on 7-bit ascii with a universal space-recognition function. Character functions in standard C use int not to deal with different encodings, but to allow EOF to be an out-of-band indicator. There are no issues with sign-extension, because the numbers these functions are defined on have no 8th bit. Providing a byte with this possibility is a mistake on your part.
Plan 9 attempts to solve this with a UTF library, and the assumption that all input data is UTF-8. This allows some measure of backwards compatibility with ASCII, so non-compliant programs don't all die, but allows new programs to be written correctly.
The common notion in C, even still is that a char* represents an array of letters. It should instead be seen as a block of input data. To get the letters from this stream, you use chartorune(). Each Rune is a representation of a letter(/symbol/codepoint), so one can finally define a function isspacerune(), which would finally tell you which letters are spaces.
Work with arrays of Rune as you would with char arrays, to do string manipulation, then call runetochar() to re-encode your letters into UTF-8 before you write it out.
The sign extension issue is easy to deal with. You can either use:
isspace((unsigned char) ch)
isspace(ch & 0xFF)
the compiler option that makes char an unsigned type
As far the variable-length character issue (I'm assuming UTF-8), it depends on your needs.
If you just to deal with the ASCII whitespace characters \t\n\v\f\r, then isspace will work fine; the non-ASCII UTF-8 code units will simply be treated as non-spaces.
But if you need to recognize the extra Unicode space characters \x85\xa0\u1680\u180e\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000, it's a bit more work. You could write a function along the lines of
bool isspace_utf8(const char* pChar)
{
uint32_t codePoint = decode_char(*pChar);
return is_unicode_space(codePoint);
}
Where decode_char converts a UTF-8 sequence to the corresponding Unicode code point, and is_unicode_space returns true for characters with category Z or for the Cc characters that are spaces. iswspace may or may not help with the latter, depending on how well your C++ library supports Unicode. It's best to use a dedicated Unicode library for the job.
most strings in practice use a multibyte encoding such as UTF-7,
UTF-8, UTF-16, SHIFT-JIS, etc.
No programmer would use UTF-7 or Shift-JIS as an internal representation unless they enjoy pain. Stick with ŬTF-8, -16, or -32, and only convert as needed.
Your preamble argument is somewhat inacurate, and arguably unfair, it is simply not in the library design to support Unicode encodings - certainly not multiple Unicode encodings.
Development of the C and C++ languages and much of the libraries pre-date the development of Unicode. Also as system's level languages they require a data type that corresponds to the smallest addressable word size of the execution environment. Unfortunately perhaps the char type has become overloaded to represent both the character set of the execution environment and the minimum addressable word. It is history that has shown this to be flawed perhaps, but changing the language definition and indeed the library would break a large amount of legacy code, so such things are left to newer languages such as C# that has an 8-bit byte and distinct char type.
Moreover the variable encoding of Unicode representations makes it unsuited to a built-in data type as such. You are obviously aware of this since you suggest that Unicode character operations should be performed on strings rather than machine word types. This would require library support and as you point out this is not provided by the standard library. There are a number of reasons for that, but primarily it is not within the domain of the standard library, just as there is no standard library support for networking or graphics. The library intrinsically does not address anything that is not generally universally supported by all target platforms from the deeply embedded to the super-computer. All such things must be provided by either system or third-party libraries.
Support for multiple character encodings is about system/environment interoperability, and the library is not intended to support that either. Data exchange between incompatible encoding systems is an application issue not a system issue.
"How do you test for whitespace, isprintable, etc., in a way that
doesn't suffer from two issues:
1) Sign expansion, and
2) variable-width character issues
isspace() considers only the lower 8-bits. Its definition explicitly states that if you pass an argument that is not representable as an unsigned char or equal to the value of the macro EOF, the results are undefined. The problem does not arise if it is used as it was intended. The problem is that it is inappropriate for the purpose you appear to be applying it to.
After all, all commonly used Unicode encodings are variable-width,
whether programmers realize it or not: UTF-7, UTF-8, UTF-16, as well
as older standards such as Shift-JIS
isspace() is not defined for Unicode. You'll need a library designed to use any specific encoding you are using. This question What is the best Unicode library for C? may be relevant.