(Encoded) String handling in C++ - questions / best practices? - c++

What are the best practices for handling strings in C++? I'm wondering especially how to handle the following cases:
File input/output of text and XML files, which may be written in different encodings. What is the recommended way of handling this, and how to retrieve the values? I guess, a XML node may contain UTF-16 text, and then I have to work with it somehow.
How to handle char* strings. After all, this can be unsigned or not, and I wonder how I determine what encoding they use (ANSI?), and how to convert to UTF-8? Is there any recommended reading on this, where the basic guarantees of C/C++ about strings are documented?
String algorithms for UTF-8 etc. strings -- computing the length, parsing, etc. How is this done best?
What character type is really portable? I've learned that wchar_t can be anything from 8-32 bit wide, making it no good choice if I want to be consistent across platforms (especially when moving data between different platforms - this seems to be a problem, as described for example in EASTL, look at item #13)
At the moment, I'm using std::string everywhere, with a small helper utility to convert to UTF-16 when calling Unicode-APIs, but I'm pretty sure that this is not really the best way. Using something like Qt's QString or the ICU String class seems to be right, but I wonder whether there is a more lightweight approach (i.e. if my char strings are ANSI encoded, and the subset of ANSI that is used is equal to UFT-8, then I can easily treat the data as UTF-8 and provide converters from/to UTF-8, and I'm done, as I can store it in std::string, unless there are problems with this approach).

For a shorter answer, I would just recommend using UTF-16 for simplicity; Java/C#/Python 3.0 switched to that model exactly for simplicity.
I've always expected wchar_t to be 16 or 32bit wide, and many platforms support that; indeed, APIs like wcrtomb() do not allow an implementation to support a shift state for wchar_t*, but since UTF-8 needs none, it may be used, while other encodings are ruled out.
Then, I answer the question about XML.
File input/output of text and XML files, which may be written in different encodings. What is the recommended way of handling this, and how to retrieve the values? I guess, a XML node may contain UTF-16 text, and then I have to work with it somehow.
I'm not sure, but I don't think so.
Mixing two encodings in the same file is asking for trouble and data corruption.
Encoding a file in UTF-16 is usually a bad choice since most programs rely on using ASCII everwhere.
The issue is: an XML file might use any single encoding, maybe even UTF-16, but then also the initial encoding declaration has to use UTF-16, and even the tags then. The problem I see with UTF-16 is: how should one reliable parse the initial declaration? The answer comes in the specification:, § 4.3.3:
In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, or for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8. Note that since ASCII is a subset of UTF-8, ordinary ASCII entities do not strictly need an encoding declaration.
When reading that, note that also an XML file is an entity, called the document entity; in general, an entity is a storage unit for the document. From the whole specification, I'd say that only one encoding declaration is allowed for each entity, and I'd convert all entities to UTF-16 when reading them for easier handling.
Webography:
http://www.w3.org/TR/REC-xml/, XML spec.
http://www.xml.com/axml/testaxml.htm, Annotated XML spec.

String algorithms for UTF-8 etc. strings -- computing the length, parsing, etc. How is this done best?
mbrlen gives you the length of a C string. I don't think std::string can be used for multibyte strings, you should use wstring for wide ones.
In general, you should probaby stick with UTF-16 inside your program and use UTF-8 only on I/O (I don't know well other options, but they are surely more complex and error-prone).
How to handle char* strings. After all, this can be unsigned or not, and I wonder how I determine what encoding they use (ANSI?), and how to convert to UTF-8? Is there any recommended reading on this, where the basic guarantees of C/C++ about strings are documented?
Basically, you can use any encoding, and you will happen to use the native encoding of the system on which you are running on, as long as it's an 8-bit encoding. C was born for ASCII, and locale handling was an afterthought. For years, each system understood mostly one native encoding, say ISO-8859-x, and files from another encoding could even be non-representable.
Since for UTF-8 strings one byte is not always one character, I guess that the safest bet is to use multibyte string for them. The C manuals I used described multibyte string in abstract, without details on those issues (in particular, on the used encoding). For C, see functions like mbrlen and mbrtowc. On my Linux system, it is noted that their behaviour depends on LC_CTYPE, and this probably means that the native type of multibyte strings. From the documentation it can be inferred that their API supports also encodings where you can shift from one-byte to two-bytes and back.
How to handle char* strings. After all, this can be unsigned or not,
If you rely on signedness of char, you're doing it wrong. Signedness of chars only matters if you use char as a numeric type, and then you should always use either unsigned or signed chars; in fact, you should pretend that plain char is neither unsigned nor signed, and that an expression like a > 0 (if a is a char) has undefined semantics. But what would it be useful for, anyway?

Related

Read multi-language file - wchar_t vs char?

It's a horrible experience for me to get understanding of unicodes, locales, wide characters and conversion.
I need to read a text file which contains Russian and English, Chinese and Ukrainian characters all at once
My approach is to read the file in byte-chunks, then operate on the chunk, on a separate thread for fast reading. (Link)
This is done using std::ifstream.read(myChunkBuffer, chunk_byteSize)
However, I understand that there is no way any character from my multi-lingual file can be represented via 255 combinations, if I stick to char.
For that matter I converted everything into wchar_t and hoped for the best.
I also know about Sys.setlocale(locale = "Russian") (Link) but doesn't it then interpret each character as Russian? I wouldn't know when to flip between my 4 languages as I am parsing my bytes.
On Windows OS, I can create a .txt file and write "Привет! Hello!" in the program Notepad++, which will save file and re-open with the same letters. Does it somehow secretly add invisible tokens after each character, to know when to interpret as Russian, and when as English?
My current understanding is: have everything as wchar_t (double-byte), interpret any file as UTF-16 (double-byte) - is it correct?
Also, I hope to keep the code cross-platform.
Sorry for noob
Hokay, let's do this. Let's provide a practical solution to the specific problem of reading text from a UTF-8 encoded file and getting it into a wide string without losing any information.
Once we can do that, we should be OK because the utility functions presented here will handle all UTF-8 to wide-string conversion (and vice-versa) in general and that's the key thing you're missing.
So, first, how would you read in your data? Well, that's easy. Because, at one level, UTF-8 strings are just a sequence of chars, you can, for many purposes, simply treat them that way. So you just need to do what you would do for any text file, e.g.:
std::ifstream f;
f.open ("myfile.txt", std::ifstream::in);
if (!f.fail ())
{
std::string utf8;
f >> utf8;
// ...
}
So far so good. That all looks easy enough.
But now, to make processing the string we just read in easier (because handling multi-byte strings in code is a total pain), we need to convert it to a so-called wide string before we try to do anything with it. There are actually a few flavours of these (because of the uncertainty surrounding just how 'wide' wchar_t actually is on any particular platform), but for now I'll stick with wchar_t to keep things simple, and doing that conversion is actually easier than you might think.
So, without further ado, here are your conversion functions (which is what you bought your ticket for):
#include <string>
#include <codecvt>
#include <locale>
std::string narrow (const std::wstring& wide_string)
{
std::wstring_convert <std::codecvt_utf8 <wchar_t>, wchar_t> convert;
return convert.to_bytes (wide_string);
}
std::wstring widen (const std::string& utf8_string)
{
std::wstring_convert <std::codecvt_utf8 <wchar_t>, wchar_t> convert;
return convert.from_bytes (utf8_string);
}
My, that was easy, why did those tickets cost so much in the first place?
I imagine that's all I really need to say. I think, from what you say in your question, that you already had a fair idea of what you wanted to be able to do, you just didn't know how to achieve it (and perhaps hadn't quite joined up all the dots yet) but just in case there is any lingering confusion, once you do have a wide string you can freely use all the methods of std::basic_string on it and everything will 'just work'. And if you need to convert to back to a UTF-8 string to (say) write it out to a file, well, that's trivial now.
Test program over at the most excellent Wandbox. I'll touch this post up later, there are still a few things to say. Time for breakfast now :) Please ask any questions in the comments.
Notes (added as an edit):
codecvt is deprecated in C++17 (not sure why), but if you limit its use to just those two functions then it's not really anything to worry about. One can always rewrite those if and when something better comes along (hint, hint, dear standards persons).
codecvt can, I believe, handle other character encodings, but as far as I'm concerned, who cares?
if std::wstring (which is based on wchar_t) doesn't cut it for you on your particular platform, then you can always use std::u16string or std::u32string.
Unfortunately standard c++ does not have any real support for your situation. (e.g. unicode in c++-11)
You will need to use a text-handling library that does support it. Something like this one
The most important question is, what encoding that text file is in. It is most likely not a byte encoding, but Unicode of some sort (as there is no way to have Russian and Chinese in one file otherwise, AFAIK). So... run file <textfile.txt> or equivalent, or open the file in a hex editor, to determine encoding (could be UTF-8, UTF-16, UTF-32, something-else-entirely), and act appropriately.
wchar_t is, unfortunately, rather useless for portable coding. Back when Microsoft decided what that datatype should be, all Unicode characters fit into 16 bit, so that is what they went for. When Unicode was extended to 21 bit, Microsoft stuck with the definition they had, and eventually made their API work with UTF-16 encoding (which breaks the "wide" nature of wchar_). "The Unixes", on the other hand, made wchar_t 32 bit and use UTF-32 encoding, so...
Explaining the different encodings goes beyond the scope of a simple Q&A. There is an article by Joel Spolsky ("The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)") that does a reasonably good job of explaining Unicode though. There are other encodings out there, and I did a table that shows the ISO/IEC 8859 encodings and common Microsoft codepages side by side.
C++11 introduced char16_t (for UTF-16 encoded strings) and char32_t (for UTF-32 encoded strings), but several parts of the standard are not quite capable of handling Unicode correctly (toupper / tolower conversions, comparison that correctly handles normalized / unnormalized strings, ...). If you want the whole smack, the go-to library for handling all things Unicode (including conversion to / from Unicode to / from other encodings) in C/C++ is ICU.
And here's a second answer - about Microsoft's (lack of) standards compilance with regard to wchar_t - because, thanks to the standards committee hedging their bets, the situation with this is more confusing than it needs to be.
Just to be clear, wchar_t on Windows is only 16-bits wide and as we all know, there are many more Unicode characters than that these days, so, on the face of it, Windows is non-compliant (albeit, as we again all know, they do what they do for a reason).
So, moving on, I am indebted to Bo Persson for digging up this (emphasis mine):
The Standard says in [basic.fundamental]/5:
Type wchar_­t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales. Type wchar_­t shall have the same size, signedness, and alignment requirements as one of the other integral types, called its underlying type. Types char16_­t and char32_­t denote distinct types with the same size, signedness, and alignment as uint_­least16_­t and uint_­least32_­t, respectively, in <cstdint>, called the underlying types.
Hmmm. "Among the supported locales." What's that all about?
Well, I for one don't know, and nor, I suspect, is the person that wrote it. It's just been put in there to let Microsoft off the hook, simple as that. It's just double-speak.
As others have commented here (in effect), the standard is a mess. Someone should put something about this in there that other human beings can understand.
The c++ standard defines wchar_t as a type which will support any code point. On linux this is true. MSVC violates the standard and defines it as a 16-bit integer, which is too small.
Therefore the only portable way to handle strings is to convert them from native strings to utf-8 on input and from utf-8 to native strings at the point of output.
You will of course need to use some #ifdef magic to select the correct conversion and I/O calls depending on the OS.
Non-adherence to standards is the reason we can't have nice things.

C++: String with multiple languages

This is my first attempt at dealing with multiple languages in a program. I would really appreciate if someone could provide me with some study material and how to approach this type of issue.
The question is representing a string which has multiple languages. For example, think of a string that has "Hello" in many languages, all comma separated. What I want to do is to separate these words. So my questions are:
Can I use std::string for this or should I use std::wstring?
If I want to tokenize each of the words in the string and put them in to a char*, should I use wchar? But some encodings, such as UTF, can be bigger than what wchar can support.
Overall, what is the 'accepted' way of handling this type of case?
Thank you.
Can I use std::string for this or should I use std::wstring?
Both can be used. If you use std::string, the encoding should be UTF-8 so as to avoid null-bytes which you'd get if you were to use UTF-16, UCS-2 etc. If you use std::wstring, you can also use encodings that require larger numbers to represent the individual characters, i.e. UCS-2 and UCS-4 will typically be fine, but strictly speaking this is implementation-dependent. In C++11, there is also std::u16string (good for UTF-16 and UCS-2) and std::u32string (good for UCS-4).
So, which of these types to use depends on which encoding you prefer, not on the number or type of languages you want to represent.
As a rule of thumb, UTF-8 is great for storage of large texts, while UCS-4 is best if memory footprint does not matter so much, but you want character-level iterations and position-arithmetic to be convenient and fast. (Example: Skipping n characters in an UTF-8 string is an O(n) operation, while it is an O(1) operation in UCS-4.)
If I want to tokenize each of the words in the string and put them in to a char*, should I use wchar? But some encodings, such as UTF, can be bigger than what wchar can support.
I would use the same data type for the words as I would use for the text itself. I.e. words of a std::string text should also be std::string, and words from a std::wstring should be std::wstring.
(If there is really a good reason to switch from a string-datatype to a character-pointer datatype, of course char* is right for std::string and wchar_t* is right for std::string. Similarly for the C++11 types, there is char16_t* and char32_t*.)
Overall, what is the 'accepted' way of handling this type of case?
The first question you need to answer to yourself is which encoding you want to use for storage and processing. In highly international settings, only Unicode encodings are truly eligible, but there are still more than one to choose from: UTF-8, UCS-2 and UCS-4 are the most common ones. As described above, which one you choose has implications for memory footprint and processing speed, so think carefully about what types of operations you need to perform. It may be required to convert from one encoding to another at certain points in your program for optimal space and time behavior. Once you know which encoding you want to use in each part of the program, choose the data type accordingly.
Once encoding and data types have been decided, you might also need to look into Unicode normalization. In many languages, the same character (or character/diacritics combination) can be represented by more than one sequence of Unicode code points (esp. when combining characters are used). To deal with these cases properly, you may need to apply Unicode normalizations (such as NFKC) to the strings. Note that there is no built-in support for this in the C++ Standard Library.

How do you cope with signed char -> int issues with standard library?

This is a really long-standing issue in my work, that I realize I still don't have a good solution to...
C naively defined all of its character test functions for an int:
int isspace(int ch);
But char's are often signed, and a full character often doesn't fit in an int, or in any single storage-unit that used for strings******.
And these functions have been the logical template for current C++ functions and methods, and have set the stage for the current standard library. In fact, they're still supported, afaict.
So if you hand isspace(*pchar) you can end up with sign extension problems. They're hard to see, and thence they're hard to guard against in my experience.
Similarly, because isspace() and it's ilk all take ints, and because the actual width of a character is often unknown w/o string-analysis - meaning that any modern character library should essentially never be carting around char's or wchar_t's but only pointers/iterators, since only by analyzing the character stream can you know how much of it composes a single logical character, I am at a bit of a loss as to how best to approach the issues?
I keep expecting a genuinely robust library based around abstracting away the size-factor of any character, and working only with strings (providing such things as isspace, etc.), but either I've missed it, or there's another simpler solution staring me in the face that all of you (who know what you're doing) use...
** These issues don't come up for fixed-sized character-encodings that can wholly contain a full character - UTF-32 apparently is about the only option that has these characteristics (or specialized environments that restrict themselves to ASCII or some such).
So, my question is:
"How do you test for whitespace, isprintable, etc., in a way that doesn't suffer from two issues:
1) Sign expansion, and
2) variable-width character issues
After all, most character encodings are variable-width: UTF-7, UTF-8, UTF-16, as well as older standards such as Shift-JIS. Even extended ASCII can have the simple sign-extension problem if the compiler treats char as a signed 8 bit unit.
Please note:
No matter what size your char_type is, it's wrong for most character encoding schemes.
This problem is in the standard C library, as well as in the C++ standard libraries; which still tries to pass around char and wchar_t, rather than string-iterators in the various isspace, isprint, etc. implementations.
Actually, it's precisely those type of functions that break the genericity of std::string. If it only worked in storage-units, and didn't try to pretend to understand the meaning of the storage-units as logical characters (such as isspace), then the abstraction would be much more honest, and would force us programmers to look elsewhere for valid solutions...
Thank You
Everyone who participated. Between this discussion and WChars, Encodings, Standards and Portability I have a much better handle on the issues. Although there are no easy answers, every bit of understanding helps.
How do you test for whitespace, isprintable, etc., in a way that doesn't suffer from two issues:
1) Sign expansion
2) variable-width character issues
After all, all commonly used Unicode encodings are variable-width, whether programmers realize it or not: UTF-7, UTF-8, UTF-16, as well as older standards such as Shift-JIS...
Obviously, you have to use a Unicode-aware library, since you've demonstrated (correctly) that C++03 standard library is not. The C++11 library is improved, but still not quite good enough for most usages. Yes, some OS' have a 32-bit wchar_t which makes them able to correctly handle UTF32, but that's an implementation, and is not guaranteed by C++, and is not remotely sufficient for many unicode tasks, such as iterating over Graphemes (letters).
IBMICU
Libiconv
microUTF-8
UTF-8 CPP, version 1.0
utfproc
and many more at http://unicode.org/resources/libraries.html.
If the question is less about specific character testing and more about code practices in general: Do whatever your framework does. If you're coding for linux/QT/networking, keep everything internally in UTF-8. If you're coding with Windows, keep everything internally in UTF-16. If you need to mess with code points, keep everything internally in UTF-32. Otherwise (for portable, generic code), do whatever you want, since no matter what, you have to translate for some OS or other anyway.
I think you are confounding a whole host of unrelated concepts.
First off, char is simply a data type. Its first and foremost meaning is "the system's basic storage unit", i.e. "one byte". Its signedness is intentionally left up to the implementation so that each implementation can pick the most appropriate (i.e. hardware-supported) version. It's name, suggesting "character", is quite possibly the single worst decision in the design of the C programming language.
The next concept is that of a text string. At the foundation, text is a sequence of units, which are often called "characters", but it can be more involved than that. To that end, the Unicode standard coins the term "code point" to designate the most basic unit of text. For now, and for us programmers, "text" is a sequence of code points.
The problem is that there are more codepoints than possible byte values. This problem can be overcome in two different ways: 1) use a multi-byte encoding to represent code point sequences as byte sequences; or 2) use a different basic data type. C and C++ actually offer both solutions: The native host interface (command line args, file contents, environment variables) are provided as byte sequences; but the language also provides an opaque type wchar_t for "the system's character set", as well as translation functions between them (mbstowcs/wcstombs).
Unfortunately, there is nothing specific about "the system's character set" and "the systems multibyte encoding", so you, like so many SO users before you, are left puzzling what to do with those mysterious wide characters. What people want nowadays is a definite encoding that they can share across platforms. The one and only useful encoding that we have for this purpose is Unicode, which assigns a textual meaning to a large number of code points (up to 221 at the moment). Along with the text encoding comes a family of byte-string encodings, UTF-8, UTF-16 and UTF-32.
The first step to examining the content of a given text string is thus to transform it from whatever input you have into a string of definite (Unicode) encoding. This Unicode string may itself be encoded in any of the transformation formats, but the simplest is just as a sequence of raw codepoints (typically UTF-32, since we don't have a useful 21-bit data type).
Performing this transformation is already outside the scope of the C++ standard (even the new one), so we need a library to do this. Since we don't know anything about our "system's character set", we also need the library to handle that.
One popular library of choice is iconv(); the typical sequence goes from input multibyte char* via mbstowcs() to a std::wstring or wchar_t* wide string, and then via iconv()'s WCHAR_T-to-UTF32 conversion to a std::u32string or uint32_t* raw Unicode codepoint sequence.
At this point our journey ends. We can now either examine the text codepoint by codepoint (which might be enough to tell if something is a space); or we can invoke a heavier text-processing library to perform intricate textual operations on our Unicode codepoint stream (such as normalization, canonicalization, presentational transformation, etc.). This is far beyond the scope of a general-purpose programmer, and the realm of text processing specialists.
It is in any case invalid to pass a negative value other than EOF to isspace and the other character macros. If you have a char c, and you want to test whether it is a space or not, do isspace((unsigned char)c). This deals with the extension (by zero-extending). isspace(*pchar) is flat wrong -- don't write it, don't let it stand when you see it. If you train yourself to panic when you do see it, then it's less hard to see.
fgetc (for example) already returns either EOF or a character read as an unsigned char and then converted to int, so there's no sign-extension issue for values from that.
That's trivia really, though, since the standard character macros don't cover Unicode, or multi-byte encodings. If you want to handle Unicode properly then you need a Unicode library. I haven't looked into what C++11 or C1X provide in this regard, other than that C++11 has std::u32string which sounds promising. Prior to that the answer is to use something implementation-specific or third-party. (Un)fortunately there are a lot of libraries to choose from.
It may be (I speculate) that a "complete" Unicode classification database is so large and so subject to change that it would be impractical for the C++ standard to mandate "full" support anyway. It depends to an extent what operations should be supported, but you can't get away from the problem that Unicode has been through 6 major versions in 20 years (since the first standard version), while C++ has had 2 major versions in 13 years. As far as C++ is concerned, the set of Unicode characters is a rapidly-moving target, so it's always going to be implementation-defined what code points the system knows about.
In general, there are three correct ways to handle Unicode text:
At all I/O (including system calls that return or accept strings), convert everything between an externally-used character encoding, and an internal fixed-width encoding. You can think of this as "deserialization" on input and "serialization" on output. If you had some object type with functions to convert it to/from a byte stream, then you wouldn't mix up byte stream with the objects, or examine sections of byte stream for snippets of serialized data that you think you recognize. It needn't be any different for this internal unicode string class. Note that the class cannot be std::string, and might not be std::wstring either, depending on implementation. Just pretend the standard library doesn't provide strings, if it helps, or use a std::basic_string of something big as the container but a Unicode-aware library to do anything sophisticated. You may also need to understand Unicode normalization, to deal with combining marks and such like, since even in a fixed-width Unicode encoding, there may be more than one code point per glyph.
Mess about with some ad-hoc mixture of byte sequences and Unicode sequences, carefully tracking which is which. It's like (1), but usually harder, and hence although it's potentially correct, in practice it might just as easily come out wrong.
(Special purposes only): use UTF-8 for everything. Sometimes this is good enough, for example if all you do is parse input based on ASCII punctuation marks, and concatenate strings for output. Basically it works for programs where you don't need to understand anything with the top bit set, just pass it on unchanged. It doesn't work so well if you need to actually render text, or otherwise do things to it that a human would consider "obvious" but actually are complex. Like collation.
One comment up front: the old C functions like isspace took int for
a reason: they support EOF as input as well, so they need to be able
to support one more value than will fit in a char. The
“naïve” decision was allowing char to be signed—but
making it unsigned would have had severe performance implications on a
PDP-11.
Now to your questions:
1) Sign expansion
The C++ functions don't have this problem. In C++, the
“correct” way of testing things like whether a character is
a space is to grap the std::ctype facet from whatever locale you want,
and to use it. Of course, the C++ localization, in <locale>, has
been carefully designed to make it as hard as possible to use, but if
you're doing any significant text processing, you'll soon come up with
your own convenience wrappers: a functional object which takes a locale
and mask specifying which characteristic you want to test isn't hard.
Making it a template on the mask, and giving its locale argument a
default to the global locale isn't rocket science either. Throw in a
few typedef's, and you can pass things like IsSpace() to std::find.
The only subtility is managing the lifetime of the std::ctype object
you're dealing with. Something like the following should work, however:
template<std::ctype_base::mask mask>
class Is // Must find a better name.
{
std::locale myLocale;
//< Needed to ensure no premature destruction of facet
std::ctype<char> const* myCType;
public:
Is( std::locale const& l = std::locale() )
: myLocale( l )
, myCType( std::use_facet<std::ctype<char> >( l ) )
{
}
bool operator()( char ch ) const
{
return myCType->is( mask, ch );
}
};
typedef Is<std::ctype_base::space> IsSpace;
// ...
(Given the influence of the STL, it's somewhat surprising that the
standard didn't define something like the above as standard.)
2) Variable width character issues.
There is no real answer. It all depends on what you need. For some
applications, just looking for a few specific single byte characters is
sufficient, and keeping everything in UTF-8, and ignoring the multi-byte
issues, is a viable (and simple) solution. Beyond that, it's often
useful to convert to UTF-32 (or depending on the type of text you're
dealing with, UTF-16), and use each element as a single code point. For
full text handling, on the other hand, you have to deal with
multi-code-point characters even if you're using UTF-32: the sequence
\u006D\u0302 is a single character (a small m with a circumflex over
it).
I haven't been testing internationalization capabilities of Qt library so much, but from what i know, QString is fully unicode-aware, and is using QChar's which are unicode-chars. I don't know internal implementation of those, but I expect that this implies QChar's to be varaible size characters.
It would be weird to bind yourself to such big framework as Qt just to use strings though.
You seem to be confusing a function defined on 7-bit ascii with a universal space-recognition function. Character functions in standard C use int not to deal with different encodings, but to allow EOF to be an out-of-band indicator. There are no issues with sign-extension, because the numbers these functions are defined on have no 8th bit. Providing a byte with this possibility is a mistake on your part.
Plan 9 attempts to solve this with a UTF library, and the assumption that all input data is UTF-8. This allows some measure of backwards compatibility with ASCII, so non-compliant programs don't all die, but allows new programs to be written correctly.
The common notion in C, even still is that a char* represents an array of letters. It should instead be seen as a block of input data. To get the letters from this stream, you use chartorune(). Each Rune is a representation of a letter(/symbol/codepoint), so one can finally define a function isspacerune(), which would finally tell you which letters are spaces.
Work with arrays of Rune as you would with char arrays, to do string manipulation, then call runetochar() to re-encode your letters into UTF-8 before you write it out.
The sign extension issue is easy to deal with. You can either use:
isspace((unsigned char) ch)
isspace(ch & 0xFF)
the compiler option that makes char an unsigned type
As far the variable-length character issue (I'm assuming UTF-8), it depends on your needs.
If you just to deal with the ASCII whitespace characters \t\n\v\f\r, then isspace will work fine; the non-ASCII UTF-8 code units will simply be treated as non-spaces.
But if you need to recognize the extra Unicode space characters \x85\xa0\u1680\u180e\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000, it's a bit more work. You could write a function along the lines of
bool isspace_utf8(const char* pChar)
{
uint32_t codePoint = decode_char(*pChar);
return is_unicode_space(codePoint);
}
Where decode_char converts a UTF-8 sequence to the corresponding Unicode code point, and is_unicode_space returns true for characters with category Z or for the Cc characters that are spaces. iswspace may or may not help with the latter, depending on how well your C++ library supports Unicode. It's best to use a dedicated Unicode library for the job.
most strings in practice use a multibyte encoding such as UTF-7,
UTF-8, UTF-16, SHIFT-JIS, etc.
No programmer would use UTF-7 or Shift-JIS as an internal representation unless they enjoy pain. Stick with ŬTF-8, -16, or -32, and only convert as needed.
Your preamble argument is somewhat inacurate, and arguably unfair, it is simply not in the library design to support Unicode encodings - certainly not multiple Unicode encodings.
Development of the C and C++ languages and much of the libraries pre-date the development of Unicode. Also as system's level languages they require a data type that corresponds to the smallest addressable word size of the execution environment. Unfortunately perhaps the char type has become overloaded to represent both the character set of the execution environment and the minimum addressable word. It is history that has shown this to be flawed perhaps, but changing the language definition and indeed the library would break a large amount of legacy code, so such things are left to newer languages such as C# that has an 8-bit byte and distinct char type.
Moreover the variable encoding of Unicode representations makes it unsuited to a built-in data type as such. You are obviously aware of this since you suggest that Unicode character operations should be performed on strings rather than machine word types. This would require library support and as you point out this is not provided by the standard library. There are a number of reasons for that, but primarily it is not within the domain of the standard library, just as there is no standard library support for networking or graphics. The library intrinsically does not address anything that is not generally universally supported by all target platforms from the deeply embedded to the super-computer. All such things must be provided by either system or third-party libraries.
Support for multiple character encodings is about system/environment interoperability, and the library is not intended to support that either. Data exchange between incompatible encoding systems is an application issue not a system issue.
"How do you test for whitespace, isprintable, etc., in a way that
doesn't suffer from two issues:
1) Sign expansion, and
2) variable-width character issues
isspace() considers only the lower 8-bits. Its definition explicitly states that if you pass an argument that is not representable as an unsigned char or equal to the value of the macro EOF, the results are undefined. The problem does not arise if it is used as it was intended. The problem is that it is inappropriate for the purpose you appear to be applying it to.
After all, all commonly used Unicode encodings are variable-width,
whether programmers realize it or not: UTF-7, UTF-8, UTF-16, as well
as older standards such as Shift-JIS
isspace() is not defined for Unicode. You'll need a library designed to use any specific encoding you are using. This question What is the best Unicode library for C? may be relevant.

Strings and character encoding in C++

I read a few posts about best practices for strings and character encoding in C++, but I am struggling a bit with finding a general purpose approach that seems to me reasonably simple and correct. Could I ask for comments on the following? I'm inclined to use UTF-8 and UTF-32, and to define something like:
typedef std::string string8;
typedef std::basic_string<uint32_t> string32;
The string8 class would be used for UTF-8, and having a separate type is just a reminder of the encoding. An alternative would be for string8 to be a subclass of std::string and to remove the methods that aren't quite right for UTF-8.
The string32 class would be used for UTF-32 when a fixed character size is desired.
The UTF-8 CPP functions, utf8::utf8to32() and utf8::utf32to8(), or even simpler wrapper functions, would be used to convert between the two.
If you plan on just passing strings around and never inspect them, you can use plain std::string though it's a poor man job.
The issue is that most frameworks, even the standard, have stupidly (I think) enforced encoding in memory. I say stupid because encoding should only matter on the interface, and those encoding are not adapted for in-memory manipulation of the data.
Furthermore, encoding is easy (it's a simple transposition CodePoint -> bytes and reversely) while the main difficulty is actually about manipulating the data.
With a 8-bits or 16-bits you run the risk of cutting a character in the middle because neither std::string nor std::wstring are aware of what a Unicode Character is. Worse, even with a 32-bits encoding, there is the risk of separating a character from the diacritics that apply to it, which is also stupid.
The support of Unicode in C++ is therefore extremely subpar, as far as the standard is concerned.
If you really wish to manipulate Unicode string, you need a Unicode aware container. The usual way is to use the ICU library, though its interface is really C-ish. However you'll get everything you need to actually work in Unicode with multiple languages.
It's not specified what character encoding must be used for string, wstring etc. The common way is to use unicode in wide strings. What types and encodings should be used depends on your requirements.
If you only need to pass data from A to B, choose std::string with UTF-8 encoding (don't introduce a new type, just use std::string). If you must work with strings (extract, concat, sort, ...) choose std::wstring and as encoding UCS2/UTF-16 (BMP only) on Windows and UCS4/UTF-32 on Linux.
The benefit is the fixed size: each character has a size of 2 (or 4 for UCS4) bytes while std::string with UTF-8 returns wrong length() results.
For conversion, you can check sizeof(std::wstring::value_type) == 2 or 4 to choose UCS2 or UCS4. I'm using the ICU library, but there may be simple wrapper libs.
Deriving from std::string is not recommended because basic_string is not designed for (lacks of virtual members etc..). If you really really really need your own type like std::basic_string< my_char_type > write a custom specialization for this.
The new C++0x standard defines wstring_convert<> and wbuffer_convert<> to convert with a std::codecvt from a narrow charset to a wide charset (for example UTF-8 to UCS2).
Visual Studio 2010 has already implemented this, afaik.
The traits approach described here might be helpful. It's an old but useful technique.

How do I get STL std::string to work with unicode on windows?

At my company we have a cross platform(Linux & Windows) library that contains our own extension of the STL std::string, this class provides all sort of functionality on top of the string; split, format, to/from base64, etc. Recently we were given the requirement of making this string unicode "friendly" basically it needs to support characters from Chinese, Japanese, Arabic, etc. After initial research this seems fine on the Linux side since every thing is inherently UTF-8, however I am having trouble with the Windows side; is there a trick to getting the STL std::string to work as UTF-8 on windows? Is it even possible? Is there a better way? Ideally we would keep ourselves based on the std::string since that is what the string class is based on in Linux.
Thank you,
There are several misconceptions in your question.
Neither C++ nor the STL deal with encodings.
std::string is essentially a string of bytes, not characters. So you should have no problem stuffing UTF-8 encoded Unicode into it. However, keep in mind that all string functions also work on bytes, so myString.length() will give you the number of bytes, not the number of characters.
Linux is not inherently UTF-8. Most distributions nowadays default to UTF-8, but it should not be relied upon.
Yes - by being more aware of locales and encodings.
Windows has two function calls for everything that requires text, a FoobarA() and a FoobarW(). The *W() functions take UTF-16 encoded strings, the *A() takes strings in the current codepage. However, Windows doesn't support a UTF-8 code page, so you can't directly use it in that sense with the *A() functions, nor would you want to depend on that being set by users. If you want "Unicode" in Windows, use the Unicode-capable (*W) functions. There are tutorials out there, Googling "Unicode Windows tutorial" should get you some.
If you are storing UTF-8 data in a std::string, then before you pass it off to Windows, convert it to UTF-16 (Windows provides functions for doing such), and then pass it to Windows.
Many of these problems arise from C/C++ being generally encoding-agnostic. char isn't really a character, it's just an integral type. Even using char arrays to store UTF-8 data can get you into trouble if you need to access individual code units, as char's signed-ness is left undefined by the standards. A statement like str[x] < 0x80 to check for multiple-byte characters can quickly introduce a bug. (That statement is always true if char is signed.) A UTF-8 code unit is an unsigned integral type with a range of 0-255. That maps to the C type of uint8_t exactly, although unsigned char works as well. Ideally then, I'd make a UTF-8 string an array of uint8_ts, but due to old APIs, this is rarely done.
Some people have recommended wchar_t, claiming it to be "A Unicode character type" or something like that. Again, here the standard is just as agnostic as before, as C is meant to work anywhere, and anywhere might not be using Unicode. Thus, wchar_t is no more Unicode than char. The standard states:
which is an integer type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales
In Linux, a wchat_t represents a UTF-32 code unit / code point. It is thus 4 bytes. However, in Windows, it's a UTF-16 code unit, and is only 2 bytes. (Which, I would have said does not conform to the above, since 2-bytes cannot represent all of Unicode, but that's the way it works.) This size difference, and difference in data encoding, clearly puts a strain on portability. The Unicode standard itself recommends against wchar_t if you need portability. (§5.2)
The end lesson: I find it easiest to store all my data in some well-declared format. (Typically UTF-8, usually in std::string's, but I'd really like something better.) The important thing here is not the UTF-8 part, but rather, I know that my strings are UTF-8. If I'm passing them to some other API, I must also know that that API expects UTF-8 strings. If it doesn't, then I must convert them. (Thus, if I speak to Window's API, I must convert strings to UTF-16 first.) A UTF-8 text string is an "orange", and a "latin1" text string is an "apple". A char array that doesn't know what encoding it is in is a recipe for disaster.
Putting UTF-8 code points into an std::string should be fine regardless of platform. The problem on Windows is that almost nothing else expects or works with UTF-8 -- it expects and works with UTF-16 instead. You can switch to an std::wstring which will store UTF-16 (at least on most Windows compilers) or you can write other routines that will accept UTF-8 (probably by converting to UTF-16, and then passing through to the OS).
Have you looked at std::wstring? It's a version of std::basic_string for wchar_t rather than the char that std::string uses.
No, there is no way to make Windows treat "narrow" strings as UTF-8.
Here is what works best for me in this situation (cross-platform application that has Windows and Linux builds).
Use std::string in cross-platform portion of the code. Assume that it always contains UTF-8 strings.
In Windows portion of the code, use "wide" versions of Windows API explicitly, i.e. write e.g. CreateFileW instead of CreateFile. This allows to avoid dependency on build system configuration.
In the platfrom abstraction layer, convert between UTF-8 and UTF-16 where needed (MultiByteToWideChar/WideCharToMultiByte).
Other approaches that I tried but don't like much:
typedef std::basic_string<TCHAR> tstring; then use tstring in the business code. Wrappers/overloads can be made to streamline conversion between std::string and std::tstring, but it still adds a lot of pain.
Use std::wstring everywhere. Does not help much since wchar_t is 16 bit on Windows, so you either have to restrict yourself to BMP or go to a lot of complications to make the code dealing with Unicode cross-platform. In the latter case, all benefits over UTF-8 evaporate.
Use ATL/WTL/MFC CString in the platfrom-specific portion; use std::string in cross-platfrom portion. This is actually a variant of what I recommend above. CString is in many aspects superior to std::string (in my opinion). But it introduces an additional dependency and thus not always acceptable or convenient.
If you want to avoid headache, don't use the STL string types at all. C++ knows nothing about Unicode or encodings, so to be portable, it's better to use a library that is tailored for Unicode support, e.g. the ICU library. ICU uses UTF-16 strings by default, so no conversion is required, and supports conversions to many other important encodings like UTF-8. Also try to use cross-platform libraries like Boost.Filesystem for things like path manipulations (boost::wpath). Avoid std::string and std::fstream.
In the Windows API and C runtime library, char* parameters are interpreted as being encoded in the "ANSI" code page. The problem is that UTF-8 isn't supported as an ANSI code page, which I find incredibly annoying.
I'm in a similar situation, being in the middle of porting software from Windows to Linux while also making it Unicode-aware. The approach we've taken for this is:
Use UTF-8 as the default encoding for strings.
In Windows-specific code, always call the "W" version of functions, converting string arguments between UTF-8 and UTF-16 as necessary.
This is also the approach Poco has taken.
It really platform dependant, Unicode is headache. Depends on which compiler you use. For older ones from MS (VS2010 or older), you would need use API described in MSDN
for VS2015
std::string _old = u8"D:\\Folder\\This \xe2\x80\x93 by ABC.txt"s;
according to their docs. I can't check that one.
for mingw, gcc, etc.
std::string _old = u8"D:\\Folder\\This \xe2\x80\x93 by ABC.txt";
std::cout << _old.data();
output contains proper file name...
You should consider using QString and QByteArray, it has good unicode support