Related
I've been reading links as this question and of course this question on preparing for the upcoming "utf8" char type char8_t and their corresponding string type in C++20, and can say, up to a point, that it's about time. Also that it's a mess.
Feel free to correct me where I'm wrong:
C++, any standards, have no means to specify that the source code has a given text encoding (something like Python's # encoding:... metadata), nor what Standards can it be compiled into (like say #!/bin/env g++ -std=c++14) .
Up until C++11, there was also no way to specify that any given string literal would have a given encoding - the compiler was free to reparse a UTF8 string literal into say UTF16 or even EBCDIC if it so desired.
C++11 introduces u16"text" and u32"text" and associated char types to produce UTF16 and UTF32-encoded text, but does not provide string or stream facilities to work with them, so they're basically useless.
C++11 also introduces u8"text" for producing an UTF8-encoded string...
but does not even introduce either a proper UTF8 char type or string type (that's what char8_t is intended to be in C++20?), so it's even uselesser than the above.
Because of all this, when char8_t is finally introduced, it kills lots of code that was intended to be valid and so far some of the remediations sought include disabling char8_t behaviour altogether.
Even then, there's no readily available tooling (as in: not the same crap tier interface as <random>) to check, transform (within the same string) or convert (copying across string types) text encodings in C++. Even codecvt seems to have been dropped.
Given all of the above, I have some questions regarding why are we in this weird status and if it'll ever get better. Historically Unicode support has been one of the lowest points of C++.
Similarly, am wondering how useful is a poor-man's-emulation of the whole concept (disclaimer: am the maintainer of cxxomfort, I already backport lots of things. Work needs: latest MSVC target at the office is MSVC 2012).
Why did C++ not add char8_t at the proper time when u8"text" was introduced or otherwise delay introduction of u8?
Alternatively, why wasn't another, non-breaking prefix like c8"text" introduced with char8_t in C++20 instead of introducing a wide-scope breaking change? I thought TPTB hated breaking changes, even more something that literally breaks the simplest possible case: cout<< prefix"hello world".
Is char8_t intended to functionally be (closer to) an alias of unsigned char or of char?
If the former, is working up the way to eg.: typedef std::basic_string<unsigned char> u8string a viable emulation strategy? Are there backport / reference implementations available one can look into before writing my own?
What's the closest we have in C++17-or-below to marking text as (intended to be) UTF-8 *for storage only*?
re: char8_t as unsigned char, this is more or less what I'm looking at in terms of pseudocode:
// this is here basically only for type-distinctiveness
class char8_t {
unsigned char value;
public:
non_explicit constexpr char8_t (unsigned char ch = 0x00) noexcept;
operator unsigned char () const noexcept;
// implement all operators to mirror operations on unsigned char
};
// public adapter jic
friend unsigned char to_char (char8_t);
// note we're *not* using our new char-type here
namespace std {
typedef std::basic_string<unsigned char> u8string;
}
// unsure if these two would actually be needed
// (couldn't make a compelling case so far,
// even testing with Windows's broken conhost)
namespace std {
basic_istream<char8_t> u8cin;
basic_ostream<char8_t> u8cout;
}
// we work up operator<<, operator>> and string conversion from there
// adding utf8-validity checks where needed
std::ostream& operator<< (std::ostream&, std::u8string const&);
std::istream& operator>> (std::istream&, std::u8string&);
// likely a macro; we'll see
#define u8c(ch) static_cast<char8_t>(ch)
// char8_t ch = u8c('x');
// very likely not a macro pre-C++20; can't skip utf-8 validity check on [2]?
u8string u8s (char8_t const* str); // [1], likely trivial
u8string u8s (char const* str); // [2], non-trivial
// C++20 and up
#define u8s(str) u8##str // or something; not sure
// end result:
// no, I can't even think how would one spell this:
u8string text = u8s("H€łlo Ẅørλd");
// this wouldn't work without refactoring u8string into a full specialization,
// to add the required constructor, but doing so is a PITA because
// the basic_string interface is YAIM (yet another infamous mess):
u8string text = u8"H€łlo Ẅørλd";
I've tagged this C++ as a general, but this is more about (the value of) implementation for Standards pre-C++20. More importantly, I'm not looking for "perfect" solutions or justifications; given the context, poor-man's is more than good enough.
I'm the author of the P0482 and P1423 char8_t papers.
Also that it's a mess.
I completely agree. SG16 is working to improve all things Unicode and text related, but we're having to start near ground level, so it is going to take a while.
If you haven't seen it yet, the repository linked below provides some utilities for writing code that will work in C++17 and C++20.
https://github.com/tahonermann/char8_t-remediation
C++, any standards, have no means to specify that the source code has a given text encoding (something like Python's # encoding:... metadata), nor what Standards can it be compiled into (like say #!/bin/env g++ -std=c++14).
This is correct, but not without precedent. IBM's xlC compiler supports a #pragma filetag directive that behaves similarly to Python's encoding declaration. I started on a paper exploring this space and had hoped to submit it for the Prague meeting, but did not complete it in time. I expect to submit it for the Varna meeting (in June).
Up until C++11, there was also no way to specify that any given string literal would have a given encoding - the compiler was free to reparse a UTF8 string literal into say UTF16 or even EBCDIC if it so desired.
Correct, and this technically remained true for char16_t and char32_t string literals until C++20 and the adoption of P1041. Note though that there is no reparsing going on. In translation phase 1, the source code contents are converted to the compiler's internal encoding and then in translation phase 5, character and string literals are converted to the encoding of the appropriate execution character set.
C++11 introduces u16"text" and u32"text" and associated char types to produce UTF16 and UTF32-encoded text, but does not provide string or stream facilities to work with them, so they're basically useless.
Correct. P1629 is one of the more significant changes we're hoping to complete for C++23. The goal is to provide text encoders, decoders, and transcoders that facilitate working with text at the code unit and code point levels. We would also provide support for enumerating grapheme clusters.
C++11 also introduces u8"text" for producing an UTF8-encoded string... but does not even introduce either a proper UTF8 char type or string type (that's what char8_t is intended to be in C++20?), so it's even uselesser than the above.
Correct. The goal for C++20 was to 1) enable differentiating "text" and u8"text" in the type system, 2) enable separating locale dependent and UTF-8 text (with enforcement from the type system), 3) ensure use of an unsigned type for UTF-8 code units, and 4) avoid the char type aliasing penalty. That was all we had time to get done for C++20 (standardization is not a rapid process).
Because of all this, when char8_t is finally introduced, it kills lots of code that was intended to be valid and so far some of the remediations sought include disabling char8_t behaviour altogether.
Correct, char8_t was proposed as a breaking change; something not to be taken lightly. In this case, it was deemed acceptable because 1) code searches found little use of u8 character and string literals, 2) the options for addressing backward compatibility concerns as discussed in P1423 were considered adequate, and 3) a non-breaking proposal would have added long term baggage to the language for little gain.
Even then, there's no readily available tooling (as in: not the same crap tier interface as ) to check, transform (within the same string) or convert (copying across string types) text encodings in C++. Even codecvt seems to have been dropped.
Correct. We'll be working to improve this situation, but it will take time. codecvt has not been dropped (yet); the <codecvt> header and various UTF converters were deprecated in C++17. std::codecvt suffers from performance and usability issues, so is not considered something we can continue to build on. We believe P1629 is a superior direction.
Why did C++ not add char8_t at the proper time when u8"text" was introduced or otherwise delay introduction of u8?
I asked one of the C++ committee members who was involved in that original effort. He told me that he asked the people working on Unicode at the time if a new type should be added and the response was, "eh, we don't need it".
Alternatively, why wasn't another, non-breaking prefix like c8"text" introduced with char8_t in C++20 instead of introducing a wide-scope breaking change? I thought TPTB hated breaking changes, even more something that literally breaks the simplest possible case: cout<< prefix"hello world".
A different prefix was considered and at one point I briefly favored that approach. However, as mentioned earlier, that would have left us with two ways of spelling UTF-8 literals and related historical baggage. In the long run, it was felt that a breaking change, so long as we had reasonable means to mitigate the breakage, offered more benefits.
With regard to that simple test case, take a minute to think about what that code should do. Then go read this: What is the printf() formatting character for char8_t *?.
Is char8_t intended to functionally be (closer to) an alias of unsigned char or of char?
char8_t is intentionally and explicitly not an alias (because that has negative performance implications) but is specified to have the same underlying representation as unsigned char. The reason for unsigned char over char is to avoid expressions like u8'\x80' < 0 ever evaluating to true (which may or may not be the case with char today).
If the former, is working up the way to eg.: typedef std::basic_string u8string a viable emulation strategy? Are there backport / reference implementations available one can look into before writing my own?
I won't comment on whether this approach is a good idea or not, but it has been done before. For example, EASTL has such a typedef (That project also provides a definition of char8_t if the native type isn't available)
What's the closest we have in C++17-or-below to marking text as (intended to be) UTF-8 for storage only?
I don't think there is one right answer to this question. I've seen projects use unsigned char or provide a char8_t like type via a class.
With regard to your pseudocode, some tweaks to the code in the previously mentioned char8_t-remediation repository to provide unsigned char types instead of char should enable code like the following to work. See the definitions of the _as_char user-defined literals and U8 macro.
typedef std::basic_string<unsigned char> u8string;
u8string u8s(U8("text"));
We have a project where, for historical reasons, string handling is a cacophony of encodings and representations; we definitely have places that can only handle ASCII reliably, some places probably using UTF-8, a few places at the periphery that I suspect to be using platform-specific 8-bit encoding (of course varying between our different target platforms), various places designed to take UCS-2, and maybe also some that would be happy to operate on UTF-16 - all of which are sometimes passed around as C-style strings (char*, CHAR16*) and sometimes as C++ strings (std::string, std::basic_string<CHAR16>). Of course there is very little in terms of documentation.
As a first step towards untangling this mess, I want to set up a type system using genuinely different types for the different encodings.
One idea that crossed my mind was to use e.g. signed char as the basis for ASCII strings and unsigned char for UTF-8 strings, as well as char16_t for UCS-2 and short for UTF-16 (or something along these lines), but that would mean I won't be able to directly use string literals. Also, being able to simply feed ASCII strings to functions expecting UTF-8 (but not vice versa) would be neat.
Do you have any smart suggestions for how to go about this, or maybe even working code?
The code needs to be compatible with C++11.
Please refrain from any answers along the lines of "just use UTF-8 consistently throughout", because that's pretty much my end goal anyway; rather, this is about creating a tool that I think would help me a lot to get there.
-- addendum --
I should probably have mentioned that I presume we already have issues where string encoding doesn't "line up" properly, e.g. UTF-16 strings being passed to functions that can only handle UCS-2 strings, or platform-specific 8-bit strings being passed to functions that expect ASCII strings. Just yesterday I found dedicated conversion functions carrying "ASCII" in their name that de-facto would actually convert to/from Latin-1 instead of ASCII.
I think I'm onto something, at least as far as C++ strings (std::string, std::basic_string<chat16_t>) are concerned; there, the key might be to use non-default character traits, like so:
using ASCII = char;
using LATIN1 = char;
using UTF8 = char;
using UCS2 = char16_t;
using UTF16 = char16_t;
class ASCIICharTraits : public std::char_traits<ASCII> {};
class Latin1CharTraits : public std::char_traits<LATIN1> {};
class UTF8CharTraits : public std::char_traits<UTF8> {};
class UCS2CharTraits : public std::char_traits<UCS2> {};
class UTF16CharTraits : public std::char_traits<UTF16> {};
using ASCIIString = std::basic_string<ASCII, ASCIICharTraits>;
using Latin1String = std::basic_string<LATIN1, Latin1CharTraits>;
using UTF8String = std::basic_string<UTF8, UTF8CharTraits>;
using UCS2String = std::basic_string<UCS2, UCS2CharTraits>;
using UTF16String = std::basic_string<UTF16, UTF16CharTraits>;
Using distinct types as the traits parameter to the std::basic_string template ensures that the string types are also treated as distinct types by the compiler, preventing any mixup of incompatibly encoded C++ strings, without having to write a wrapper framework.
Note that for this to work the custom trait types need to be subclassed, not simply aliased. (Theoretically I could write new trait types from scratch, but deriving from std::char_traits makes the job much easier, and should make sure I get binary compatibility, allowing to implement trivial conversions (such as from ASCII to Latin-1 or UTF-8) by means of a simple reinterpret_cast.
(Fun fact: To the best of my knowledge this mechanism should even work with good old C++03, provided the using clauses are replaced with corresponding typedefs.)
I recommend the standard suggestion: sandwich method.
Internally use only one data type (the one of your language or like in this case, of standard libraries).
Only on the layers you will decode (input) or encode (output). There should be clear also why you decide one encoding. Writing to a file? UTF-8 is good (ASCII is a subset, so keep it as UTF-8). In such part you do also the input validation. Should it be a number? Check that they are unicode numbers. etc. Data validation and encoding (validation) should be keep nearer as possible as the reading the input. For output take the same rule (but in that case there should be no validation).
So now you may prefix true strings with some prefix (try something unique), and try to find where you encode/decode. Try to move such encoding on outer layers. When you finished, you remove the prefix.
You may use other prefixes for the other encodings (just temporarily). Also in this case try something unique. Mess with your variable names, not the types.
As alternative, I think you can annotate variables and use external tools to check that some annotations do no mix. Linux kernel uses something like that (e.g. to distinguish user space and kernel pointers). I think it is an overkill for your program.
Why the sandwich? Now you probably know much about UTF-8, UCS-2, UTF-16, etc. But it took time. Next coworker may not know all such details, and so it would cause problems on long terms. We use also integers, without worrying about if it is one-complement, two-complement, or with sign bit, but when we are writing out data. Do the same for strings. Keep the semantic and forget the encoding inside the program. Only the outer layer must handle it.
It's a horrible experience for me to get understanding of unicodes, locales, wide characters and conversion.
I need to read a text file which contains Russian and English, Chinese and Ukrainian characters all at once
My approach is to read the file in byte-chunks, then operate on the chunk, on a separate thread for fast reading. (Link)
This is done using std::ifstream.read(myChunkBuffer, chunk_byteSize)
However, I understand that there is no way any character from my multi-lingual file can be represented via 255 combinations, if I stick to char.
For that matter I converted everything into wchar_t and hoped for the best.
I also know about Sys.setlocale(locale = "Russian") (Link) but doesn't it then interpret each character as Russian? I wouldn't know when to flip between my 4 languages as I am parsing my bytes.
On Windows OS, I can create a .txt file and write "Привет! Hello!" in the program Notepad++, which will save file and re-open with the same letters. Does it somehow secretly add invisible tokens after each character, to know when to interpret as Russian, and when as English?
My current understanding is: have everything as wchar_t (double-byte), interpret any file as UTF-16 (double-byte) - is it correct?
Also, I hope to keep the code cross-platform.
Sorry for noob
Hokay, let's do this. Let's provide a practical solution to the specific problem of reading text from a UTF-8 encoded file and getting it into a wide string without losing any information.
Once we can do that, we should be OK because the utility functions presented here will handle all UTF-8 to wide-string conversion (and vice-versa) in general and that's the key thing you're missing.
So, first, how would you read in your data? Well, that's easy. Because, at one level, UTF-8 strings are just a sequence of chars, you can, for many purposes, simply treat them that way. So you just need to do what you would do for any text file, e.g.:
std::ifstream f;
f.open ("myfile.txt", std::ifstream::in);
if (!f.fail ())
{
std::string utf8;
f >> utf8;
// ...
}
So far so good. That all looks easy enough.
But now, to make processing the string we just read in easier (because handling multi-byte strings in code is a total pain), we need to convert it to a so-called wide string before we try to do anything with it. There are actually a few flavours of these (because of the uncertainty surrounding just how 'wide' wchar_t actually is on any particular platform), but for now I'll stick with wchar_t to keep things simple, and doing that conversion is actually easier than you might think.
So, without further ado, here are your conversion functions (which is what you bought your ticket for):
#include <string>
#include <codecvt>
#include <locale>
std::string narrow (const std::wstring& wide_string)
{
std::wstring_convert <std::codecvt_utf8 <wchar_t>, wchar_t> convert;
return convert.to_bytes (wide_string);
}
std::wstring widen (const std::string& utf8_string)
{
std::wstring_convert <std::codecvt_utf8 <wchar_t>, wchar_t> convert;
return convert.from_bytes (utf8_string);
}
My, that was easy, why did those tickets cost so much in the first place?
I imagine that's all I really need to say. I think, from what you say in your question, that you already had a fair idea of what you wanted to be able to do, you just didn't know how to achieve it (and perhaps hadn't quite joined up all the dots yet) but just in case there is any lingering confusion, once you do have a wide string you can freely use all the methods of std::basic_string on it and everything will 'just work'. And if you need to convert to back to a UTF-8 string to (say) write it out to a file, well, that's trivial now.
Test program over at the most excellent Wandbox. I'll touch this post up later, there are still a few things to say. Time for breakfast now :) Please ask any questions in the comments.
Notes (added as an edit):
codecvt is deprecated in C++17 (not sure why), but if you limit its use to just those two functions then it's not really anything to worry about. One can always rewrite those if and when something better comes along (hint, hint, dear standards persons).
codecvt can, I believe, handle other character encodings, but as far as I'm concerned, who cares?
if std::wstring (which is based on wchar_t) doesn't cut it for you on your particular platform, then you can always use std::u16string or std::u32string.
Unfortunately standard c++ does not have any real support for your situation. (e.g. unicode in c++-11)
You will need to use a text-handling library that does support it. Something like this one
The most important question is, what encoding that text file is in. It is most likely not a byte encoding, but Unicode of some sort (as there is no way to have Russian and Chinese in one file otherwise, AFAIK). So... run file <textfile.txt> or equivalent, or open the file in a hex editor, to determine encoding (could be UTF-8, UTF-16, UTF-32, something-else-entirely), and act appropriately.
wchar_t is, unfortunately, rather useless for portable coding. Back when Microsoft decided what that datatype should be, all Unicode characters fit into 16 bit, so that is what they went for. When Unicode was extended to 21 bit, Microsoft stuck with the definition they had, and eventually made their API work with UTF-16 encoding (which breaks the "wide" nature of wchar_). "The Unixes", on the other hand, made wchar_t 32 bit and use UTF-32 encoding, so...
Explaining the different encodings goes beyond the scope of a simple Q&A. There is an article by Joel Spolsky ("The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)") that does a reasonably good job of explaining Unicode though. There are other encodings out there, and I did a table that shows the ISO/IEC 8859 encodings and common Microsoft codepages side by side.
C++11 introduced char16_t (for UTF-16 encoded strings) and char32_t (for UTF-32 encoded strings), but several parts of the standard are not quite capable of handling Unicode correctly (toupper / tolower conversions, comparison that correctly handles normalized / unnormalized strings, ...). If you want the whole smack, the go-to library for handling all things Unicode (including conversion to / from Unicode to / from other encodings) in C/C++ is ICU.
And here's a second answer - about Microsoft's (lack of) standards compilance with regard to wchar_t - because, thanks to the standards committee hedging their bets, the situation with this is more confusing than it needs to be.
Just to be clear, wchar_t on Windows is only 16-bits wide and as we all know, there are many more Unicode characters than that these days, so, on the face of it, Windows is non-compliant (albeit, as we again all know, they do what they do for a reason).
So, moving on, I am indebted to Bo Persson for digging up this (emphasis mine):
The Standard says in [basic.fundamental]/5:
Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales. Type wchar_t shall have the same size, signedness, and alignment requirements as one of the other integral types, called its underlying type. Types char16_t and char32_t denote distinct types with the same size, signedness, and alignment as uint_least16_t and uint_least32_t, respectively, in <cstdint>, called the underlying types.
Hmmm. "Among the supported locales." What's that all about?
Well, I for one don't know, and nor, I suspect, is the person that wrote it. It's just been put in there to let Microsoft off the hook, simple as that. It's just double-speak.
As others have commented here (in effect), the standard is a mess. Someone should put something about this in there that other human beings can understand.
The c++ standard defines wchar_t as a type which will support any code point. On linux this is true. MSVC violates the standard and defines it as a 16-bit integer, which is too small.
Therefore the only portable way to handle strings is to convert them from native strings to utf-8 on input and from utf-8 to native strings at the point of output.
You will of course need to use some #ifdef magic to select the correct conversion and I/O calls depending on the OS.
Non-adherence to standards is the reason we can't have nice things.
I have a for loop that looks at every character in a string, the purpose is to eliminate some characters. For example one comparison that works is...
if(str[i] == '!'){str[i] = NULL;}
I also need to eliminate the upside down question mark. I tried several things including some hex codes and the following.
if(str[i] == 191){str[i] = NULL;}
Here, I get an error that says, "comparison of constant 191 with expression of type 'value_type' is always false." What am I missing here? How can I catch the upside-down question mark?
Your string's value_type is most likely char, which might or might not be signed on your platform.
If it's signed, CHAR_MAX would be 127... you see the problem when comparing that with 191? That is what the compiler is complaining about.
There are several ways around this.
The most roughshod one would be to cast the constant to value_type.
More elegant (but depending on your compiler's features) would be to actually write '¿' in your source and make sure your editor and your compiler agree on the encoding used by the source file.
While the standard only requires support for a subset of the ASCII-7 characters in source (minus backticks, $ and #), implementations are free (and usually quite capable) of supporting other encodings.
For GCC, the option would be -finput-charset=..., which defaults to UTF-8.
All this is, of course, assuming that your source and your input are agreeing on their respective encodings as well. Being on the same codepage, so to speak. ;-)
All that being said, if you're handling international characters in your application, you might want to take a look at the ICU library and full Unicode support.
This is a really long-standing issue in my work, that I realize I still don't have a good solution to...
C naively defined all of its character test functions for an int:
int isspace(int ch);
But char's are often signed, and a full character often doesn't fit in an int, or in any single storage-unit that used for strings******.
And these functions have been the logical template for current C++ functions and methods, and have set the stage for the current standard library. In fact, they're still supported, afaict.
So if you hand isspace(*pchar) you can end up with sign extension problems. They're hard to see, and thence they're hard to guard against in my experience.
Similarly, because isspace() and it's ilk all take ints, and because the actual width of a character is often unknown w/o string-analysis - meaning that any modern character library should essentially never be carting around char's or wchar_t's but only pointers/iterators, since only by analyzing the character stream can you know how much of it composes a single logical character, I am at a bit of a loss as to how best to approach the issues?
I keep expecting a genuinely robust library based around abstracting away the size-factor of any character, and working only with strings (providing such things as isspace, etc.), but either I've missed it, or there's another simpler solution staring me in the face that all of you (who know what you're doing) use...
** These issues don't come up for fixed-sized character-encodings that can wholly contain a full character - UTF-32 apparently is about the only option that has these characteristics (or specialized environments that restrict themselves to ASCII or some such).
So, my question is:
"How do you test for whitespace, isprintable, etc., in a way that doesn't suffer from two issues:
1) Sign expansion, and
2) variable-width character issues
After all, most character encodings are variable-width: UTF-7, UTF-8, UTF-16, as well as older standards such as Shift-JIS. Even extended ASCII can have the simple sign-extension problem if the compiler treats char as a signed 8 bit unit.
Please note:
No matter what size your char_type is, it's wrong for most character encoding schemes.
This problem is in the standard C library, as well as in the C++ standard libraries; which still tries to pass around char and wchar_t, rather than string-iterators in the various isspace, isprint, etc. implementations.
Actually, it's precisely those type of functions that break the genericity of std::string. If it only worked in storage-units, and didn't try to pretend to understand the meaning of the storage-units as logical characters (such as isspace), then the abstraction would be much more honest, and would force us programmers to look elsewhere for valid solutions...
Thank You
Everyone who participated. Between this discussion and WChars, Encodings, Standards and Portability I have a much better handle on the issues. Although there are no easy answers, every bit of understanding helps.
How do you test for whitespace, isprintable, etc., in a way that doesn't suffer from two issues:
1) Sign expansion
2) variable-width character issues
After all, all commonly used Unicode encodings are variable-width, whether programmers realize it or not: UTF-7, UTF-8, UTF-16, as well as older standards such as Shift-JIS...
Obviously, you have to use a Unicode-aware library, since you've demonstrated (correctly) that C++03 standard library is not. The C++11 library is improved, but still not quite good enough for most usages. Yes, some OS' have a 32-bit wchar_t which makes them able to correctly handle UTF32, but that's an implementation, and is not guaranteed by C++, and is not remotely sufficient for many unicode tasks, such as iterating over Graphemes (letters).
IBMICU
Libiconv
microUTF-8
UTF-8 CPP, version 1.0
utfproc
and many more at http://unicode.org/resources/libraries.html.
If the question is less about specific character testing and more about code practices in general: Do whatever your framework does. If you're coding for linux/QT/networking, keep everything internally in UTF-8. If you're coding with Windows, keep everything internally in UTF-16. If you need to mess with code points, keep everything internally in UTF-32. Otherwise (for portable, generic code), do whatever you want, since no matter what, you have to translate for some OS or other anyway.
I think you are confounding a whole host of unrelated concepts.
First off, char is simply a data type. Its first and foremost meaning is "the system's basic storage unit", i.e. "one byte". Its signedness is intentionally left up to the implementation so that each implementation can pick the most appropriate (i.e. hardware-supported) version. It's name, suggesting "character", is quite possibly the single worst decision in the design of the C programming language.
The next concept is that of a text string. At the foundation, text is a sequence of units, which are often called "characters", but it can be more involved than that. To that end, the Unicode standard coins the term "code point" to designate the most basic unit of text. For now, and for us programmers, "text" is a sequence of code points.
The problem is that there are more codepoints than possible byte values. This problem can be overcome in two different ways: 1) use a multi-byte encoding to represent code point sequences as byte sequences; or 2) use a different basic data type. C and C++ actually offer both solutions: The native host interface (command line args, file contents, environment variables) are provided as byte sequences; but the language also provides an opaque type wchar_t for "the system's character set", as well as translation functions between them (mbstowcs/wcstombs).
Unfortunately, there is nothing specific about "the system's character set" and "the systems multibyte encoding", so you, like so many SO users before you, are left puzzling what to do with those mysterious wide characters. What people want nowadays is a definite encoding that they can share across platforms. The one and only useful encoding that we have for this purpose is Unicode, which assigns a textual meaning to a large number of code points (up to 221 at the moment). Along with the text encoding comes a family of byte-string encodings, UTF-8, UTF-16 and UTF-32.
The first step to examining the content of a given text string is thus to transform it from whatever input you have into a string of definite (Unicode) encoding. This Unicode string may itself be encoded in any of the transformation formats, but the simplest is just as a sequence of raw codepoints (typically UTF-32, since we don't have a useful 21-bit data type).
Performing this transformation is already outside the scope of the C++ standard (even the new one), so we need a library to do this. Since we don't know anything about our "system's character set", we also need the library to handle that.
One popular library of choice is iconv(); the typical sequence goes from input multibyte char* via mbstowcs() to a std::wstring or wchar_t* wide string, and then via iconv()'s WCHAR_T-to-UTF32 conversion to a std::u32string or uint32_t* raw Unicode codepoint sequence.
At this point our journey ends. We can now either examine the text codepoint by codepoint (which might be enough to tell if something is a space); or we can invoke a heavier text-processing library to perform intricate textual operations on our Unicode codepoint stream (such as normalization, canonicalization, presentational transformation, etc.). This is far beyond the scope of a general-purpose programmer, and the realm of text processing specialists.
It is in any case invalid to pass a negative value other than EOF to isspace and the other character macros. If you have a char c, and you want to test whether it is a space or not, do isspace((unsigned char)c). This deals with the extension (by zero-extending). isspace(*pchar) is flat wrong -- don't write it, don't let it stand when you see it. If you train yourself to panic when you do see it, then it's less hard to see.
fgetc (for example) already returns either EOF or a character read as an unsigned char and then converted to int, so there's no sign-extension issue for values from that.
That's trivia really, though, since the standard character macros don't cover Unicode, or multi-byte encodings. If you want to handle Unicode properly then you need a Unicode library. I haven't looked into what C++11 or C1X provide in this regard, other than that C++11 has std::u32string which sounds promising. Prior to that the answer is to use something implementation-specific or third-party. (Un)fortunately there are a lot of libraries to choose from.
It may be (I speculate) that a "complete" Unicode classification database is so large and so subject to change that it would be impractical for the C++ standard to mandate "full" support anyway. It depends to an extent what operations should be supported, but you can't get away from the problem that Unicode has been through 6 major versions in 20 years (since the first standard version), while C++ has had 2 major versions in 13 years. As far as C++ is concerned, the set of Unicode characters is a rapidly-moving target, so it's always going to be implementation-defined what code points the system knows about.
In general, there are three correct ways to handle Unicode text:
At all I/O (including system calls that return or accept strings), convert everything between an externally-used character encoding, and an internal fixed-width encoding. You can think of this as "deserialization" on input and "serialization" on output. If you had some object type with functions to convert it to/from a byte stream, then you wouldn't mix up byte stream with the objects, or examine sections of byte stream for snippets of serialized data that you think you recognize. It needn't be any different for this internal unicode string class. Note that the class cannot be std::string, and might not be std::wstring either, depending on implementation. Just pretend the standard library doesn't provide strings, if it helps, or use a std::basic_string of something big as the container but a Unicode-aware library to do anything sophisticated. You may also need to understand Unicode normalization, to deal with combining marks and such like, since even in a fixed-width Unicode encoding, there may be more than one code point per glyph.
Mess about with some ad-hoc mixture of byte sequences and Unicode sequences, carefully tracking which is which. It's like (1), but usually harder, and hence although it's potentially correct, in practice it might just as easily come out wrong.
(Special purposes only): use UTF-8 for everything. Sometimes this is good enough, for example if all you do is parse input based on ASCII punctuation marks, and concatenate strings for output. Basically it works for programs where you don't need to understand anything with the top bit set, just pass it on unchanged. It doesn't work so well if you need to actually render text, or otherwise do things to it that a human would consider "obvious" but actually are complex. Like collation.
One comment up front: the old C functions like isspace took int for
a reason: they support EOF as input as well, so they need to be able
to support one more value than will fit in a char. The
“naïve” decision was allowing char to be signed—but
making it unsigned would have had severe performance implications on a
PDP-11.
Now to your questions:
1) Sign expansion
The C++ functions don't have this problem. In C++, the
“correct” way of testing things like whether a character is
a space is to grap the std::ctype facet from whatever locale you want,
and to use it. Of course, the C++ localization, in <locale>, has
been carefully designed to make it as hard as possible to use, but if
you're doing any significant text processing, you'll soon come up with
your own convenience wrappers: a functional object which takes a locale
and mask specifying which characteristic you want to test isn't hard.
Making it a template on the mask, and giving its locale argument a
default to the global locale isn't rocket science either. Throw in a
few typedef's, and you can pass things like IsSpace() to std::find.
The only subtility is managing the lifetime of the std::ctype object
you're dealing with. Something like the following should work, however:
template<std::ctype_base::mask mask>
class Is // Must find a better name.
{
std::locale myLocale;
//< Needed to ensure no premature destruction of facet
std::ctype<char> const* myCType;
public:
Is( std::locale const& l = std::locale() )
: myLocale( l )
, myCType( std::use_facet<std::ctype<char> >( l ) )
{
}
bool operator()( char ch ) const
{
return myCType->is( mask, ch );
}
};
typedef Is<std::ctype_base::space> IsSpace;
// ...
(Given the influence of the STL, it's somewhat surprising that the
standard didn't define something like the above as standard.)
2) Variable width character issues.
There is no real answer. It all depends on what you need. For some
applications, just looking for a few specific single byte characters is
sufficient, and keeping everything in UTF-8, and ignoring the multi-byte
issues, is a viable (and simple) solution. Beyond that, it's often
useful to convert to UTF-32 (or depending on the type of text you're
dealing with, UTF-16), and use each element as a single code point. For
full text handling, on the other hand, you have to deal with
multi-code-point characters even if you're using UTF-32: the sequence
\u006D\u0302 is a single character (a small m with a circumflex over
it).
I haven't been testing internationalization capabilities of Qt library so much, but from what i know, QString is fully unicode-aware, and is using QChar's which are unicode-chars. I don't know internal implementation of those, but I expect that this implies QChar's to be varaible size characters.
It would be weird to bind yourself to such big framework as Qt just to use strings though.
You seem to be confusing a function defined on 7-bit ascii with a universal space-recognition function. Character functions in standard C use int not to deal with different encodings, but to allow EOF to be an out-of-band indicator. There are no issues with sign-extension, because the numbers these functions are defined on have no 8th bit. Providing a byte with this possibility is a mistake on your part.
Plan 9 attempts to solve this with a UTF library, and the assumption that all input data is UTF-8. This allows some measure of backwards compatibility with ASCII, so non-compliant programs don't all die, but allows new programs to be written correctly.
The common notion in C, even still is that a char* represents an array of letters. It should instead be seen as a block of input data. To get the letters from this stream, you use chartorune(). Each Rune is a representation of a letter(/symbol/codepoint), so one can finally define a function isspacerune(), which would finally tell you which letters are spaces.
Work with arrays of Rune as you would with char arrays, to do string manipulation, then call runetochar() to re-encode your letters into UTF-8 before you write it out.
The sign extension issue is easy to deal with. You can either use:
isspace((unsigned char) ch)
isspace(ch & 0xFF)
the compiler option that makes char an unsigned type
As far the variable-length character issue (I'm assuming UTF-8), it depends on your needs.
If you just to deal with the ASCII whitespace characters \t\n\v\f\r, then isspace will work fine; the non-ASCII UTF-8 code units will simply be treated as non-spaces.
But if you need to recognize the extra Unicode space characters \x85\xa0\u1680\u180e\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000, it's a bit more work. You could write a function along the lines of
bool isspace_utf8(const char* pChar)
{
uint32_t codePoint = decode_char(*pChar);
return is_unicode_space(codePoint);
}
Where decode_char converts a UTF-8 sequence to the corresponding Unicode code point, and is_unicode_space returns true for characters with category Z or for the Cc characters that are spaces. iswspace may or may not help with the latter, depending on how well your C++ library supports Unicode. It's best to use a dedicated Unicode library for the job.
most strings in practice use a multibyte encoding such as UTF-7,
UTF-8, UTF-16, SHIFT-JIS, etc.
No programmer would use UTF-7 or Shift-JIS as an internal representation unless they enjoy pain. Stick with ŬTF-8, -16, or -32, and only convert as needed.
Your preamble argument is somewhat inacurate, and arguably unfair, it is simply not in the library design to support Unicode encodings - certainly not multiple Unicode encodings.
Development of the C and C++ languages and much of the libraries pre-date the development of Unicode. Also as system's level languages they require a data type that corresponds to the smallest addressable word size of the execution environment. Unfortunately perhaps the char type has become overloaded to represent both the character set of the execution environment and the minimum addressable word. It is history that has shown this to be flawed perhaps, but changing the language definition and indeed the library would break a large amount of legacy code, so such things are left to newer languages such as C# that has an 8-bit byte and distinct char type.
Moreover the variable encoding of Unicode representations makes it unsuited to a built-in data type as such. You are obviously aware of this since you suggest that Unicode character operations should be performed on strings rather than machine word types. This would require library support and as you point out this is not provided by the standard library. There are a number of reasons for that, but primarily it is not within the domain of the standard library, just as there is no standard library support for networking or graphics. The library intrinsically does not address anything that is not generally universally supported by all target platforms from the deeply embedded to the super-computer. All such things must be provided by either system or third-party libraries.
Support for multiple character encodings is about system/environment interoperability, and the library is not intended to support that either. Data exchange between incompatible encoding systems is an application issue not a system issue.
"How do you test for whitespace, isprintable, etc., in a way that
doesn't suffer from two issues:
1) Sign expansion, and
2) variable-width character issues
isspace() considers only the lower 8-bits. Its definition explicitly states that if you pass an argument that is not representable as an unsigned char or equal to the value of the macro EOF, the results are undefined. The problem does not arise if it is used as it was intended. The problem is that it is inappropriate for the purpose you appear to be applying it to.
After all, all commonly used Unicode encodings are variable-width,
whether programmers realize it or not: UTF-7, UTF-8, UTF-16, as well
as older standards such as Shift-JIS
isspace() is not defined for Unicode. You'll need a library designed to use any specific encoding you are using. This question What is the best Unicode library for C? may be relevant.