Strings and character encoding in C++ - c++

I read a few posts about best practices for strings and character encoding in C++, but I am struggling a bit with finding a general purpose approach that seems to me reasonably simple and correct. Could I ask for comments on the following? I'm inclined to use UTF-8 and UTF-32, and to define something like:
typedef std::string string8;
typedef std::basic_string<uint32_t> string32;
The string8 class would be used for UTF-8, and having a separate type is just a reminder of the encoding. An alternative would be for string8 to be a subclass of std::string and to remove the methods that aren't quite right for UTF-8.
The string32 class would be used for UTF-32 when a fixed character size is desired.
The UTF-8 CPP functions, utf8::utf8to32() and utf8::utf32to8(), or even simpler wrapper functions, would be used to convert between the two.

If you plan on just passing strings around and never inspect them, you can use plain std::string though it's a poor man job.
The issue is that most frameworks, even the standard, have stupidly (I think) enforced encoding in memory. I say stupid because encoding should only matter on the interface, and those encoding are not adapted for in-memory manipulation of the data.
Furthermore, encoding is easy (it's a simple transposition CodePoint -> bytes and reversely) while the main difficulty is actually about manipulating the data.
With a 8-bits or 16-bits you run the risk of cutting a character in the middle because neither std::string nor std::wstring are aware of what a Unicode Character is. Worse, even with a 32-bits encoding, there is the risk of separating a character from the diacritics that apply to it, which is also stupid.
The support of Unicode in C++ is therefore extremely subpar, as far as the standard is concerned.
If you really wish to manipulate Unicode string, you need a Unicode aware container. The usual way is to use the ICU library, though its interface is really C-ish. However you'll get everything you need to actually work in Unicode with multiple languages.

It's not specified what character encoding must be used for string, wstring etc. The common way is to use unicode in wide strings. What types and encodings should be used depends on your requirements.
If you only need to pass data from A to B, choose std::string with UTF-8 encoding (don't introduce a new type, just use std::string). If you must work with strings (extract, concat, sort, ...) choose std::wstring and as encoding UCS2/UTF-16 (BMP only) on Windows and UCS4/UTF-32 on Linux.
The benefit is the fixed size: each character has a size of 2 (or 4 for UCS4) bytes while std::string with UTF-8 returns wrong length() results.
For conversion, you can check sizeof(std::wstring::value_type) == 2 or 4 to choose UCS2 or UCS4. I'm using the ICU library, but there may be simple wrapper libs.
Deriving from std::string is not recommended because basic_string is not designed for (lacks of virtual members etc..). If you really really really need your own type like std::basic_string< my_char_type > write a custom specialization for this.
The new C++0x standard defines wstring_convert<> and wbuffer_convert<> to convert with a std::codecvt from a narrow charset to a wide charset (for example UTF-8 to UCS2).
Visual Studio 2010 has already implemented this, afaik.

The traits approach described here might be helpful. It's an old but useful technique.

Related

ASCII and UTF-8 (or UCS-2 and UTF-16) strings in the same C++ project

We have a project where, for historical reasons, string handling is a cacophony of encodings and representations; we definitely have places that can only handle ASCII reliably, some places probably using UTF-8, a few places at the periphery that I suspect to be using platform-specific 8-bit encoding (of course varying between our different target platforms), various places designed to take UCS-2, and maybe also some that would be happy to operate on UTF-16 - all of which are sometimes passed around as C-style strings (char*, CHAR16*) and sometimes as C++ strings (std::string, std::basic_string<CHAR16>). Of course there is very little in terms of documentation.
As a first step towards untangling this mess, I want to set up a type system using genuinely different types for the different encodings.
One idea that crossed my mind was to use e.g. signed char as the basis for ASCII strings and unsigned char for UTF-8 strings, as well as char16_t for UCS-2 and short for UTF-16 (or something along these lines), but that would mean I won't be able to directly use string literals. Also, being able to simply feed ASCII strings to functions expecting UTF-8 (but not vice versa) would be neat.
Do you have any smart suggestions for how to go about this, or maybe even working code?
The code needs to be compatible with C++11.
Please refrain from any answers along the lines of "just use UTF-8 consistently throughout", because that's pretty much my end goal anyway; rather, this is about creating a tool that I think would help me a lot to get there.
-- addendum --
I should probably have mentioned that I presume we already have issues where string encoding doesn't "line up" properly, e.g. UTF-16 strings being passed to functions that can only handle UCS-2 strings, or platform-specific 8-bit strings being passed to functions that expect ASCII strings. Just yesterday I found dedicated conversion functions carrying "ASCII" in their name that de-facto would actually convert to/from Latin-1 instead of ASCII.
I think I'm onto something, at least as far as C++ strings (std::string, std::basic_string<chat16_t>) are concerned; there, the key might be to use non-default character traits, like so:
using ASCII = char;
using LATIN1 = char;
using UTF8 = char;
using UCS2 = char16_t;
using UTF16 = char16_t;
class ASCIICharTraits : public std::char_traits<ASCII> {};
class Latin1CharTraits : public std::char_traits<LATIN1> {};
class UTF8CharTraits : public std::char_traits<UTF8> {};
class UCS2CharTraits : public std::char_traits<UCS2> {};
class UTF16CharTraits : public std::char_traits<UTF16> {};
using ASCIIString = std::basic_string<ASCII, ASCIICharTraits>;
using Latin1String = std::basic_string<LATIN1, Latin1CharTraits>;
using UTF8String = std::basic_string<UTF8, UTF8CharTraits>;
using UCS2String = std::basic_string<UCS2, UCS2CharTraits>;
using UTF16String = std::basic_string<UTF16, UTF16CharTraits>;
Using distinct types as the traits parameter to the std::basic_string template ensures that the string types are also treated as distinct types by the compiler, preventing any mixup of incompatibly encoded C++ strings, without having to write a wrapper framework.
Note that for this to work the custom trait types need to be subclassed, not simply aliased. (Theoretically I could write new trait types from scratch, but deriving from std::char_traits makes the job much easier, and should make sure I get binary compatibility, allowing to implement trivial conversions (such as from ASCII to Latin-1 or UTF-8) by means of a simple reinterpret_cast.
(Fun fact: To the best of my knowledge this mechanism should even work with good old C++03, provided the using clauses are replaced with corresponding typedefs.)
I recommend the standard suggestion: sandwich method.
Internally use only one data type (the one of your language or like in this case, of standard libraries).
Only on the layers you will decode (input) or encode (output). There should be clear also why you decide one encoding. Writing to a file? UTF-8 is good (ASCII is a subset, so keep it as UTF-8). In such part you do also the input validation. Should it be a number? Check that they are unicode numbers. etc. Data validation and encoding (validation) should be keep nearer as possible as the reading the input. For output take the same rule (but in that case there should be no validation).
So now you may prefix true strings with some prefix (try something unique), and try to find where you encode/decode. Try to move such encoding on outer layers. When you finished, you remove the prefix.
You may use other prefixes for the other encodings (just temporarily). Also in this case try something unique. Mess with your variable names, not the types.
As alternative, I think you can annotate variables and use external tools to check that some annotations do no mix. Linux kernel uses something like that (e.g. to distinguish user space and kernel pointers). I think it is an overkill for your program.
Why the sandwich? Now you probably know much about UTF-8, UCS-2, UTF-16, etc. But it took time. Next coworker may not know all such details, and so it would cause problems on long terms. We use also integers, without worrying about if it is one-complement, two-complement, or with sign bit, but when we are writing out data. Do the same for strings. Keep the semantic and forget the encoding inside the program. Only the outer layer must handle it.

Read multi-language file - wchar_t vs char?

It's a horrible experience for me to get understanding of unicodes, locales, wide characters and conversion.
I need to read a text file which contains Russian and English, Chinese and Ukrainian characters all at once
My approach is to read the file in byte-chunks, then operate on the chunk, on a separate thread for fast reading. (Link)
This is done using std::ifstream.read(myChunkBuffer, chunk_byteSize)
However, I understand that there is no way any character from my multi-lingual file can be represented via 255 combinations, if I stick to char.
For that matter I converted everything into wchar_t and hoped for the best.
I also know about Sys.setlocale(locale = "Russian") (Link) but doesn't it then interpret each character as Russian? I wouldn't know when to flip between my 4 languages as I am parsing my bytes.
On Windows OS, I can create a .txt file and write "Привет! Hello!" in the program Notepad++, which will save file and re-open with the same letters. Does it somehow secretly add invisible tokens after each character, to know when to interpret as Russian, and when as English?
My current understanding is: have everything as wchar_t (double-byte), interpret any file as UTF-16 (double-byte) - is it correct?
Also, I hope to keep the code cross-platform.
Sorry for noob
Hokay, let's do this. Let's provide a practical solution to the specific problem of reading text from a UTF-8 encoded file and getting it into a wide string without losing any information.
Once we can do that, we should be OK because the utility functions presented here will handle all UTF-8 to wide-string conversion (and vice-versa) in general and that's the key thing you're missing.
So, first, how would you read in your data? Well, that's easy. Because, at one level, UTF-8 strings are just a sequence of chars, you can, for many purposes, simply treat them that way. So you just need to do what you would do for any text file, e.g.:
std::ifstream f;
f.open ("myfile.txt", std::ifstream::in);
if (!f.fail ())
{
std::string utf8;
f >> utf8;
// ...
}
So far so good. That all looks easy enough.
But now, to make processing the string we just read in easier (because handling multi-byte strings in code is a total pain), we need to convert it to a so-called wide string before we try to do anything with it. There are actually a few flavours of these (because of the uncertainty surrounding just how 'wide' wchar_t actually is on any particular platform), but for now I'll stick with wchar_t to keep things simple, and doing that conversion is actually easier than you might think.
So, without further ado, here are your conversion functions (which is what you bought your ticket for):
#include <string>
#include <codecvt>
#include <locale>
std::string narrow (const std::wstring& wide_string)
{
std::wstring_convert <std::codecvt_utf8 <wchar_t>, wchar_t> convert;
return convert.to_bytes (wide_string);
}
std::wstring widen (const std::string& utf8_string)
{
std::wstring_convert <std::codecvt_utf8 <wchar_t>, wchar_t> convert;
return convert.from_bytes (utf8_string);
}
My, that was easy, why did those tickets cost so much in the first place?
I imagine that's all I really need to say. I think, from what you say in your question, that you already had a fair idea of what you wanted to be able to do, you just didn't know how to achieve it (and perhaps hadn't quite joined up all the dots yet) but just in case there is any lingering confusion, once you do have a wide string you can freely use all the methods of std::basic_string on it and everything will 'just work'. And if you need to convert to back to a UTF-8 string to (say) write it out to a file, well, that's trivial now.
Test program over at the most excellent Wandbox. I'll touch this post up later, there are still a few things to say. Time for breakfast now :) Please ask any questions in the comments.
Notes (added as an edit):
codecvt is deprecated in C++17 (not sure why), but if you limit its use to just those two functions then it's not really anything to worry about. One can always rewrite those if and when something better comes along (hint, hint, dear standards persons).
codecvt can, I believe, handle other character encodings, but as far as I'm concerned, who cares?
if std::wstring (which is based on wchar_t) doesn't cut it for you on your particular platform, then you can always use std::u16string or std::u32string.
Unfortunately standard c++ does not have any real support for your situation. (e.g. unicode in c++-11)
You will need to use a text-handling library that does support it. Something like this one
The most important question is, what encoding that text file is in. It is most likely not a byte encoding, but Unicode of some sort (as there is no way to have Russian and Chinese in one file otherwise, AFAIK). So... run file <textfile.txt> or equivalent, or open the file in a hex editor, to determine encoding (could be UTF-8, UTF-16, UTF-32, something-else-entirely), and act appropriately.
wchar_t is, unfortunately, rather useless for portable coding. Back when Microsoft decided what that datatype should be, all Unicode characters fit into 16 bit, so that is what they went for. When Unicode was extended to 21 bit, Microsoft stuck with the definition they had, and eventually made their API work with UTF-16 encoding (which breaks the "wide" nature of wchar_). "The Unixes", on the other hand, made wchar_t 32 bit and use UTF-32 encoding, so...
Explaining the different encodings goes beyond the scope of a simple Q&A. There is an article by Joel Spolsky ("The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)") that does a reasonably good job of explaining Unicode though. There are other encodings out there, and I did a table that shows the ISO/IEC 8859 encodings and common Microsoft codepages side by side.
C++11 introduced char16_t (for UTF-16 encoded strings) and char32_t (for UTF-32 encoded strings), but several parts of the standard are not quite capable of handling Unicode correctly (toupper / tolower conversions, comparison that correctly handles normalized / unnormalized strings, ...). If you want the whole smack, the go-to library for handling all things Unicode (including conversion to / from Unicode to / from other encodings) in C/C++ is ICU.
And here's a second answer - about Microsoft's (lack of) standards compilance with regard to wchar_t - because, thanks to the standards committee hedging their bets, the situation with this is more confusing than it needs to be.
Just to be clear, wchar_t on Windows is only 16-bits wide and as we all know, there are many more Unicode characters than that these days, so, on the face of it, Windows is non-compliant (albeit, as we again all know, they do what they do for a reason).
So, moving on, I am indebted to Bo Persson for digging up this (emphasis mine):
The Standard says in [basic.fundamental]/5:
Type wchar_­t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales. Type wchar_­t shall have the same size, signedness, and alignment requirements as one of the other integral types, called its underlying type. Types char16_­t and char32_­t denote distinct types with the same size, signedness, and alignment as uint_­least16_­t and uint_­least32_­t, respectively, in <cstdint>, called the underlying types.
Hmmm. "Among the supported locales." What's that all about?
Well, I for one don't know, and nor, I suspect, is the person that wrote it. It's just been put in there to let Microsoft off the hook, simple as that. It's just double-speak.
As others have commented here (in effect), the standard is a mess. Someone should put something about this in there that other human beings can understand.
The c++ standard defines wchar_t as a type which will support any code point. On linux this is true. MSVC violates the standard and defines it as a 16-bit integer, which is too small.
Therefore the only portable way to handle strings is to convert them from native strings to utf-8 on input and from utf-8 to native strings at the point of output.
You will of course need to use some #ifdef magic to select the correct conversion and I/O calls depending on the OS.
Non-adherence to standards is the reason we can't have nice things.

Unicode std::string class replacement

I'm looking for suggestions regarding unicode aware std::string library replacements. I have a bunch of code that uses std::string, its iterators etc, and would like to now support unicode strings (free or open source implementations preferred, regex capabilities would be great!).
I'm not sure at this point if I require a complete rewrite or if I can get away with dropping in a new string library that supports all of the std::string interfaces. The unicode world seems very complex and I'm just wanting to enable it in my applications not have to learn every single aspect of it.
btw how does the index operator work when it has to pass back a reference to either a 1, 2,3 or 4 structure which could in theory change to either a 1,2,3 or 4 byte structure. if a larger or smaller sized value is passed, does the shifting back and forth of the internal data representation occur insitu?
You don't need a complete rewrite if you make sure about what your std::string contains. For example, you could assume (and convert inputs to be sure) that your std::string contain UTF8 encoded strings (for those that need localization). Don't forget that std::string is only a container of raw data, it's not associated with an encoding (even in C++0x, it's only a possibility, not a requirement).
Then when you pass text to other libraries that require different encodings, you can use libraries like UTF8CPP to convert to the required encoding (but most of the time such libraries will do it themselves).
That way makes it simple. UTF8 with standard std::string in your code, enabling passing unicode string to everything else (with conversion if necessary).
There have been a lot of discussions about this in the boost community mailing list. Maybe reading it (if you have enough time...) can help you understand other possible solutions.
Depending on your needs, use std::wstring or the larger and more complex (but de facto standard) ICU: http://site.icu-project.org/
what unicode encoding do you need? If utf-8 is ok you can have a look at Glib::ustring
Glib::ustring has much the same
interface as std::string, but contains
Unicode characters encoded as UTF-8.
Asking for "a type like std::string, but for Unicode" is like asking for "a type like unsigned, but for primes." std::string is perfectly capable of storing Unicode, in many encodings - the most generally useful being UTF-8.
What you need to replace is your iterators, not your storage type. The iterators should iterate over the codepoints of the string rather than the bytes. That is, ++i should advance one codepoint, and *i should return a codepoint (via uint32_t) rather than a char.
I've written my own C++ UTF-8 library, which is a drop-in replacement of std::wstring/string. The data type that is showed to the user is char32_t, but internally the wide characters are all packed into utf8 char's.
The whole thing is quite fast and its performance is best with few unicode codepoints within many ascii codepoints. All operations that are known from std::string are available with this class (except for substring find) and operate on codepoint indices, in contrast to byte indices.
As a bonus of defensive programming, the whole ANSI range of 0-255 can be used without multibytes :)
Hope this helps!

How do I get STL std::string to work with unicode on windows?

At my company we have a cross platform(Linux & Windows) library that contains our own extension of the STL std::string, this class provides all sort of functionality on top of the string; split, format, to/from base64, etc. Recently we were given the requirement of making this string unicode "friendly" basically it needs to support characters from Chinese, Japanese, Arabic, etc. After initial research this seems fine on the Linux side since every thing is inherently UTF-8, however I am having trouble with the Windows side; is there a trick to getting the STL std::string to work as UTF-8 on windows? Is it even possible? Is there a better way? Ideally we would keep ourselves based on the std::string since that is what the string class is based on in Linux.
Thank you,
There are several misconceptions in your question.
Neither C++ nor the STL deal with encodings.
std::string is essentially a string of bytes, not characters. So you should have no problem stuffing UTF-8 encoded Unicode into it. However, keep in mind that all string functions also work on bytes, so myString.length() will give you the number of bytes, not the number of characters.
Linux is not inherently UTF-8. Most distributions nowadays default to UTF-8, but it should not be relied upon.
Yes - by being more aware of locales and encodings.
Windows has two function calls for everything that requires text, a FoobarA() and a FoobarW(). The *W() functions take UTF-16 encoded strings, the *A() takes strings in the current codepage. However, Windows doesn't support a UTF-8 code page, so you can't directly use it in that sense with the *A() functions, nor would you want to depend on that being set by users. If you want "Unicode" in Windows, use the Unicode-capable (*W) functions. There are tutorials out there, Googling "Unicode Windows tutorial" should get you some.
If you are storing UTF-8 data in a std::string, then before you pass it off to Windows, convert it to UTF-16 (Windows provides functions for doing such), and then pass it to Windows.
Many of these problems arise from C/C++ being generally encoding-agnostic. char isn't really a character, it's just an integral type. Even using char arrays to store UTF-8 data can get you into trouble if you need to access individual code units, as char's signed-ness is left undefined by the standards. A statement like str[x] < 0x80 to check for multiple-byte characters can quickly introduce a bug. (That statement is always true if char is signed.) A UTF-8 code unit is an unsigned integral type with a range of 0-255. That maps to the C type of uint8_t exactly, although unsigned char works as well. Ideally then, I'd make a UTF-8 string an array of uint8_ts, but due to old APIs, this is rarely done.
Some people have recommended wchar_t, claiming it to be "A Unicode character type" or something like that. Again, here the standard is just as agnostic as before, as C is meant to work anywhere, and anywhere might not be using Unicode. Thus, wchar_t is no more Unicode than char. The standard states:
which is an integer type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales
In Linux, a wchat_t represents a UTF-32 code unit / code point. It is thus 4 bytes. However, in Windows, it's a UTF-16 code unit, and is only 2 bytes. (Which, I would have said does not conform to the above, since 2-bytes cannot represent all of Unicode, but that's the way it works.) This size difference, and difference in data encoding, clearly puts a strain on portability. The Unicode standard itself recommends against wchar_t if you need portability. (§5.2)
The end lesson: I find it easiest to store all my data in some well-declared format. (Typically UTF-8, usually in std::string's, but I'd really like something better.) The important thing here is not the UTF-8 part, but rather, I know that my strings are UTF-8. If I'm passing them to some other API, I must also know that that API expects UTF-8 strings. If it doesn't, then I must convert them. (Thus, if I speak to Window's API, I must convert strings to UTF-16 first.) A UTF-8 text string is an "orange", and a "latin1" text string is an "apple". A char array that doesn't know what encoding it is in is a recipe for disaster.
Putting UTF-8 code points into an std::string should be fine regardless of platform. The problem on Windows is that almost nothing else expects or works with UTF-8 -- it expects and works with UTF-16 instead. You can switch to an std::wstring which will store UTF-16 (at least on most Windows compilers) or you can write other routines that will accept UTF-8 (probably by converting to UTF-16, and then passing through to the OS).
Have you looked at std::wstring? It's a version of std::basic_string for wchar_t rather than the char that std::string uses.
No, there is no way to make Windows treat "narrow" strings as UTF-8.
Here is what works best for me in this situation (cross-platform application that has Windows and Linux builds).
Use std::string in cross-platform portion of the code. Assume that it always contains UTF-8 strings.
In Windows portion of the code, use "wide" versions of Windows API explicitly, i.e. write e.g. CreateFileW instead of CreateFile. This allows to avoid dependency on build system configuration.
In the platfrom abstraction layer, convert between UTF-8 and UTF-16 where needed (MultiByteToWideChar/WideCharToMultiByte).
Other approaches that I tried but don't like much:
typedef std::basic_string<TCHAR> tstring; then use tstring in the business code. Wrappers/overloads can be made to streamline conversion between std::string and std::tstring, but it still adds a lot of pain.
Use std::wstring everywhere. Does not help much since wchar_t is 16 bit on Windows, so you either have to restrict yourself to BMP or go to a lot of complications to make the code dealing with Unicode cross-platform. In the latter case, all benefits over UTF-8 evaporate.
Use ATL/WTL/MFC CString in the platfrom-specific portion; use std::string in cross-platfrom portion. This is actually a variant of what I recommend above. CString is in many aspects superior to std::string (in my opinion). But it introduces an additional dependency and thus not always acceptable or convenient.
If you want to avoid headache, don't use the STL string types at all. C++ knows nothing about Unicode or encodings, so to be portable, it's better to use a library that is tailored for Unicode support, e.g. the ICU library. ICU uses UTF-16 strings by default, so no conversion is required, and supports conversions to many other important encodings like UTF-8. Also try to use cross-platform libraries like Boost.Filesystem for things like path manipulations (boost::wpath). Avoid std::string and std::fstream.
In the Windows API and C runtime library, char* parameters are interpreted as being encoded in the "ANSI" code page. The problem is that UTF-8 isn't supported as an ANSI code page, which I find incredibly annoying.
I'm in a similar situation, being in the middle of porting software from Windows to Linux while also making it Unicode-aware. The approach we've taken for this is:
Use UTF-8 as the default encoding for strings.
In Windows-specific code, always call the "W" version of functions, converting string arguments between UTF-8 and UTF-16 as necessary.
This is also the approach Poco has taken.
It really platform dependant, Unicode is headache. Depends on which compiler you use. For older ones from MS (VS2010 or older), you would need use API described in MSDN
for VS2015
std::string _old = u8"D:\\Folder\\This \xe2\x80\x93 by ABC.txt"s;
according to their docs. I can't check that one.
for mingw, gcc, etc.
std::string _old = u8"D:\\Folder\\This \xe2\x80\x93 by ABC.txt";
std::cout << _old.data();
output contains proper file name...
You should consider using QString and QByteArray, it has good unicode support

(Encoded) String handling in C++ - questions / best practices?

What are the best practices for handling strings in C++? I'm wondering especially how to handle the following cases:
File input/output of text and XML files, which may be written in different encodings. What is the recommended way of handling this, and how to retrieve the values? I guess, a XML node may contain UTF-16 text, and then I have to work with it somehow.
How to handle char* strings. After all, this can be unsigned or not, and I wonder how I determine what encoding they use (ANSI?), and how to convert to UTF-8? Is there any recommended reading on this, where the basic guarantees of C/C++ about strings are documented?
String algorithms for UTF-8 etc. strings -- computing the length, parsing, etc. How is this done best?
What character type is really portable? I've learned that wchar_t can be anything from 8-32 bit wide, making it no good choice if I want to be consistent across platforms (especially when moving data between different platforms - this seems to be a problem, as described for example in EASTL, look at item #13)
At the moment, I'm using std::string everywhere, with a small helper utility to convert to UTF-16 when calling Unicode-APIs, but I'm pretty sure that this is not really the best way. Using something like Qt's QString or the ICU String class seems to be right, but I wonder whether there is a more lightweight approach (i.e. if my char strings are ANSI encoded, and the subset of ANSI that is used is equal to UFT-8, then I can easily treat the data as UTF-8 and provide converters from/to UTF-8, and I'm done, as I can store it in std::string, unless there are problems with this approach).
For a shorter answer, I would just recommend using UTF-16 for simplicity; Java/C#/Python 3.0 switched to that model exactly for simplicity.
I've always expected wchar_t to be 16 or 32bit wide, and many platforms support that; indeed, APIs like wcrtomb() do not allow an implementation to support a shift state for wchar_t*, but since UTF-8 needs none, it may be used, while other encodings are ruled out.
Then, I answer the question about XML.
File input/output of text and XML files, which may be written in different encodings. What is the recommended way of handling this, and how to retrieve the values? I guess, a XML node may contain UTF-16 text, and then I have to work with it somehow.
I'm not sure, but I don't think so.
Mixing two encodings in the same file is asking for trouble and data corruption.
Encoding a file in UTF-16 is usually a bad choice since most programs rely on using ASCII everwhere.
The issue is: an XML file might use any single encoding, maybe even UTF-16, but then also the initial encoding declaration has to use UTF-16, and even the tags then. The problem I see with UTF-16 is: how should one reliable parse the initial declaration? The answer comes in the specification:, § 4.3.3:
In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, or for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8. Note that since ASCII is a subset of UTF-8, ordinary ASCII entities do not strictly need an encoding declaration.
When reading that, note that also an XML file is an entity, called the document entity; in general, an entity is a storage unit for the document. From the whole specification, I'd say that only one encoding declaration is allowed for each entity, and I'd convert all entities to UTF-16 when reading them for easier handling.
Webography:
http://www.w3.org/TR/REC-xml/, XML spec.
http://www.xml.com/axml/testaxml.htm, Annotated XML spec.
String algorithms for UTF-8 etc. strings -- computing the length, parsing, etc. How is this done best?
mbrlen gives you the length of a C string. I don't think std::string can be used for multibyte strings, you should use wstring for wide ones.
In general, you should probaby stick with UTF-16 inside your program and use UTF-8 only on I/O (I don't know well other options, but they are surely more complex and error-prone).
How to handle char* strings. After all, this can be unsigned or not, and I wonder how I determine what encoding they use (ANSI?), and how to convert to UTF-8? Is there any recommended reading on this, where the basic guarantees of C/C++ about strings are documented?
Basically, you can use any encoding, and you will happen to use the native encoding of the system on which you are running on, as long as it's an 8-bit encoding. C was born for ASCII, and locale handling was an afterthought. For years, each system understood mostly one native encoding, say ISO-8859-x, and files from another encoding could even be non-representable.
Since for UTF-8 strings one byte is not always one character, I guess that the safest bet is to use multibyte string for them. The C manuals I used described multibyte string in abstract, without details on those issues (in particular, on the used encoding). For C, see functions like mbrlen and mbrtowc. On my Linux system, it is noted that their behaviour depends on LC_CTYPE, and this probably means that the native type of multibyte strings. From the documentation it can be inferred that their API supports also encodings where you can shift from one-byte to two-bytes and back.
How to handle char* strings. After all, this can be unsigned or not,
If you rely on signedness of char, you're doing it wrong. Signedness of chars only matters if you use char as a numeric type, and then you should always use either unsigned or signed chars; in fact, you should pretend that plain char is neither unsigned nor signed, and that an expression like a > 0 (if a is a char) has undefined semantics. But what would it be useful for, anyway?