how to convert utf8 to std::string? - c++

I am working on this code which receives a cpprest sdk response containing a base64_encoded payload which is a json. here is my code snippet:
typedef std::wstring string_t; //defined in basic_types.h in cpprest lib
void demo() {
http_response response;
//code to handle respose ...
json::value output= response.extract_json();
string_t payload = output.at(L"payload").as_string();
vector<unsigned char> base64_encoded_payload = conversions::from_base64(payload);
std::string utf8_payload(base64_encoded_payload.begin(), base64_encoded_payload.end()); //in debugger I see the Japanese chars are garbled.
string_t utf16_payload = utf8_to_utf16(utf8_payload); //in debugger I see the Japanese chars are good here
//then I need to process the utf8_payload which is an xml.
//I have an API available to process the xml which takes an string
processXML(utf16_payload); //need to convert utf16_payload to a string here;
}
I also tried this and I see str contains garbled chars!
#include <codecvt> // for codecvt_utf8_utf16
#include <locale> // for wstring_convert
#include <string> // for string, wstring
void wstr2str(void) {
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> conversion;
std::wstring japanese = L"北島 美奈";
std::string str = conversion.to_bytes(japanese); //str is garbled:(
}
my questions is: can utf8 containing Japanese char be converted to std::string without being garbled?
Update: I gained access to the processXML() code and changed the input argument type to std::wstring and it worked.
I figured when the xml was getting created, it was converting the std::string to wstring; however, it was not turning out good!
void processXML(std::wstring xmlStrBuf) { //chaned xmlStrBuf to wstring and worked
// more code
CComBSTR xmlBuff = xmlStrBuf.c_str();
VARIANT_BOOL bSuccess = false;
xmlDoc->loadXML(xmlBuff, &bSuccess);
//more code
}
Thanks for the answers and they were helpful when mentioned the string is only a storage.

You are confusing different concepts here.
Storage
This is how we save/store/hold our data. A std::string is a collection of chars, which are bytes. A std::wstring is a collection of wchar_ts, which are sometimes 2-byte wide value (but this is not guaranteed!).
Encoding
This is what the data means, and how it should be interpreted. A std::string, a collection of bytes, could hold UTF-8, or UTF-16, or UTF-32, or ASCII, or ShiftJIS, or morse code, or a JPEG, or a movie, or my DNA (lucky string!).
There are some strong conventions in play in the world. For example, on Windows, a std::wstring is generally accepted to hold UTF-16 (because the two-byte storage is convenient for this, and also because that's how the Windows API does it).
Newer versions of C++ give us things like std::u16_string and std::u32_string as well, which still do not directly have any notion of encoding, but are intended to be used for UTF-16 and UTF-32 respectively because their names make that intention more obvious to readers of code. C++20 will introduce std::u8_string which is intended to signify a UTF-8 encoded string (and is otherwise more or less like a std::string).
But these are just conventions. Nothing about the type std::string says "UTF-8" or any other thing. It doesn't know about or care about or enforce any encoding. It just stores bytes.
So, your question about "converting UTF-8 to std::string" does not really make any sense; it's like asking how to convert a road into a car.
"What should I do, then?"
Well, Base64 is also not an encoding. Well, actually, it totally is, but it's an encoding on top of the string encoding. It's a way of transmitting/escaping/sanitising the raw bytes, not a way of describing how to interpret them later. By asking cpprest to convert from Base64, that's just transforming the way the raw bytes are provided. That's why it gives you a std::vector<char> rather than a std::string because, although (as discussed above) std::string doesn't care about encoding, we sometimes use a std::vector<char> to really, properly, completely say that "this collection does not have any particular encoding, so please don't try to guess from convention or whatever what the encoding is in this use case; all it knows is that it is a bunch of bytes". This is down to opinion. Some people will still use a std::string for that; the authors of cpprest decided not to.
The point is that the use of the function from_base64 cannot tell us anything about the encoding of the text that you've retrieved. For that, we have to go back to the documentation for the text. We have no access to that, and you did not tell us anything about it. If it were just a JSON string, the encoding would be down to the cpprest JSON library and so you'd already be done. However, it's not: it's something packed into a Base64 representation by whoever created the JSON object. Again, that information is not something that you shared with us.
But, based on the variable names you've chosen, the data you're looking at is already UTF-8. You've then attempted to convert it to UTF-16, which is rather the opposite of what you've described you wanted to do.
(Similarly, in your second example, you've taken a std::wstring that [probably] already stores UTF-16 thanks to the L"wide string literal", then told the computer that it's UTF-8 and to convert it "again" to UTF-16, then extracted the raw bytes into a std::string. None of that makes sense.)
Instead, why not literally just processXML(utf8_payload);?
General advice
Encoding can be quite complex, although it's significantly easier to deal with once you've wrapped your mind around the basic concepts of all these layers of abstraction. For the future, and for this question if you wish to clarify it, you will need to ensure that you are absolutely clear, at each stage of the "pipeline" of your data as it gets transmitted from place A to place B, and gets converted from type C to type D, and whatever else, about what encoding it should be at each of those steps. If you want to change the encoding at one of those steps, then do so (though this should be rare!). But before you write any code make sure that you know for sure what it is that you need, otherwise you'll get yourself in a massive tangle.
Eventually you'll start to detect patterns that can help, though. For example, if you were expecting some delicious non-ASCII output and instead see strange text with lots of "Å" characters in it, that's probably UTF-8 that's being interpreted as ASCII by mistake. That's because of the way that the special sequence denoting Unicode codepoints larger than one byte in UTF-8 often starts with a byte whose numerical value is the same as that of the letter "Å" in ASCII (well, ISO/IEC 8859, but close enough).
Similarly, if you get Japanese and didn't expect it, in my experience that's usually because you've given the computer some bytes and told it that they are a string in UTF-16 encoding, when actually they were UTF-8. You just get more experienced at recognising these patterns as you work more, and it can help you to fix your bugs faster.
Just last week the last example there saved me quite a bit of time: I knew immediately that my source data must have been UTF-8, and was therefore able to quickly decide to remove the byte-copy into a std::wstring that I'd been attempting. Examining the bytes in an encoding-agnostic way revealed the "Å" pattern as well and then that was that. This was important because I had no documentation for the data source and thus no way to just look up what the encoding was supposed to be. I had to guess/deduce it. Hopefully that won't be the case for you here.

std::string is just a container for 8-bit wide char, and does not know/care about the encoding. Always think in symbols (letters, numbers, punctuation, etc.) The first 128 characters (0-127) were defined per the ASCII standard, thus requiring a single char to store each symbol. With all the languages and symbols there is, we couldn't represent each of them with just 256 possibilities. The UTF-8 encoding introduces a way to deal with this problem by allowing a single symbol to take 1, 2, 3 or 4 char wide. But, for the std::string object, this is entirely transparent and it's still dealing with a series of chars.
The reason why you're thinking the string is garbled is probably because your debugger assumes the contents of the std::string is always 1 symbol per char (extended ASCII for example), and as such, it's displaying the wrong characters.
Edit: you might want to read this post also.

Related

Read multi-language file - wchar_t vs char?

It's a horrible experience for me to get understanding of unicodes, locales, wide characters and conversion.
I need to read a text file which contains Russian and English, Chinese and Ukrainian characters all at once
My approach is to read the file in byte-chunks, then operate on the chunk, on a separate thread for fast reading. (Link)
This is done using std::ifstream.read(myChunkBuffer, chunk_byteSize)
However, I understand that there is no way any character from my multi-lingual file can be represented via 255 combinations, if I stick to char.
For that matter I converted everything into wchar_t and hoped for the best.
I also know about Sys.setlocale(locale = "Russian") (Link) but doesn't it then interpret each character as Russian? I wouldn't know when to flip between my 4 languages as I am parsing my bytes.
On Windows OS, I can create a .txt file and write "Привет! Hello!" in the program Notepad++, which will save file and re-open with the same letters. Does it somehow secretly add invisible tokens after each character, to know when to interpret as Russian, and when as English?
My current understanding is: have everything as wchar_t (double-byte), interpret any file as UTF-16 (double-byte) - is it correct?
Also, I hope to keep the code cross-platform.
Sorry for noob
Hokay, let's do this. Let's provide a practical solution to the specific problem of reading text from a UTF-8 encoded file and getting it into a wide string without losing any information.
Once we can do that, we should be OK because the utility functions presented here will handle all UTF-8 to wide-string conversion (and vice-versa) in general and that's the key thing you're missing.
So, first, how would you read in your data? Well, that's easy. Because, at one level, UTF-8 strings are just a sequence of chars, you can, for many purposes, simply treat them that way. So you just need to do what you would do for any text file, e.g.:
std::ifstream f;
f.open ("myfile.txt", std::ifstream::in);
if (!f.fail ())
{
std::string utf8;
f >> utf8;
// ...
}
So far so good. That all looks easy enough.
But now, to make processing the string we just read in easier (because handling multi-byte strings in code is a total pain), we need to convert it to a so-called wide string before we try to do anything with it. There are actually a few flavours of these (because of the uncertainty surrounding just how 'wide' wchar_t actually is on any particular platform), but for now I'll stick with wchar_t to keep things simple, and doing that conversion is actually easier than you might think.
So, without further ado, here are your conversion functions (which is what you bought your ticket for):
#include <string>
#include <codecvt>
#include <locale>
std::string narrow (const std::wstring& wide_string)
{
std::wstring_convert <std::codecvt_utf8 <wchar_t>, wchar_t> convert;
return convert.to_bytes (wide_string);
}
std::wstring widen (const std::string& utf8_string)
{
std::wstring_convert <std::codecvt_utf8 <wchar_t>, wchar_t> convert;
return convert.from_bytes (utf8_string);
}
My, that was easy, why did those tickets cost so much in the first place?
I imagine that's all I really need to say. I think, from what you say in your question, that you already had a fair idea of what you wanted to be able to do, you just didn't know how to achieve it (and perhaps hadn't quite joined up all the dots yet) but just in case there is any lingering confusion, once you do have a wide string you can freely use all the methods of std::basic_string on it and everything will 'just work'. And if you need to convert to back to a UTF-8 string to (say) write it out to a file, well, that's trivial now.
Test program over at the most excellent Wandbox. I'll touch this post up later, there are still a few things to say. Time for breakfast now :) Please ask any questions in the comments.
Notes (added as an edit):
codecvt is deprecated in C++17 (not sure why), but if you limit its use to just those two functions then it's not really anything to worry about. One can always rewrite those if and when something better comes along (hint, hint, dear standards persons).
codecvt can, I believe, handle other character encodings, but as far as I'm concerned, who cares?
if std::wstring (which is based on wchar_t) doesn't cut it for you on your particular platform, then you can always use std::u16string or std::u32string.
Unfortunately standard c++ does not have any real support for your situation. (e.g. unicode in c++-11)
You will need to use a text-handling library that does support it. Something like this one
The most important question is, what encoding that text file is in. It is most likely not a byte encoding, but Unicode of some sort (as there is no way to have Russian and Chinese in one file otherwise, AFAIK). So... run file <textfile.txt> or equivalent, or open the file in a hex editor, to determine encoding (could be UTF-8, UTF-16, UTF-32, something-else-entirely), and act appropriately.
wchar_t is, unfortunately, rather useless for portable coding. Back when Microsoft decided what that datatype should be, all Unicode characters fit into 16 bit, so that is what they went for. When Unicode was extended to 21 bit, Microsoft stuck with the definition they had, and eventually made their API work with UTF-16 encoding (which breaks the "wide" nature of wchar_). "The Unixes", on the other hand, made wchar_t 32 bit and use UTF-32 encoding, so...
Explaining the different encodings goes beyond the scope of a simple Q&A. There is an article by Joel Spolsky ("The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)") that does a reasonably good job of explaining Unicode though. There are other encodings out there, and I did a table that shows the ISO/IEC 8859 encodings and common Microsoft codepages side by side.
C++11 introduced char16_t (for UTF-16 encoded strings) and char32_t (for UTF-32 encoded strings), but several parts of the standard are not quite capable of handling Unicode correctly (toupper / tolower conversions, comparison that correctly handles normalized / unnormalized strings, ...). If you want the whole smack, the go-to library for handling all things Unicode (including conversion to / from Unicode to / from other encodings) in C/C++ is ICU.
And here's a second answer - about Microsoft's (lack of) standards compilance with regard to wchar_t - because, thanks to the standards committee hedging their bets, the situation with this is more confusing than it needs to be.
Just to be clear, wchar_t on Windows is only 16-bits wide and as we all know, there are many more Unicode characters than that these days, so, on the face of it, Windows is non-compliant (albeit, as we again all know, they do what they do for a reason).
So, moving on, I am indebted to Bo Persson for digging up this (emphasis mine):
The Standard says in [basic.fundamental]/5:
Type wchar_­t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales. Type wchar_­t shall have the same size, signedness, and alignment requirements as one of the other integral types, called its underlying type. Types char16_­t and char32_­t denote distinct types with the same size, signedness, and alignment as uint_­least16_­t and uint_­least32_­t, respectively, in <cstdint>, called the underlying types.
Hmmm. "Among the supported locales." What's that all about?
Well, I for one don't know, and nor, I suspect, is the person that wrote it. It's just been put in there to let Microsoft off the hook, simple as that. It's just double-speak.
As others have commented here (in effect), the standard is a mess. Someone should put something about this in there that other human beings can understand.
The c++ standard defines wchar_t as a type which will support any code point. On linux this is true. MSVC violates the standard and defines it as a 16-bit integer, which is too small.
Therefore the only portable way to handle strings is to convert them from native strings to utf-8 on input and from utf-8 to native strings at the point of output.
You will of course need to use some #ifdef magic to select the correct conversion and I/O calls depending on the OS.
Non-adherence to standards is the reason we can't have nice things.

C++: Comparing strings or wstrings with special characters in them (á, é, ő, etc.)

I recently got an assignment that requires me to compare words. I don't want to describe it in full, but I have to compare the words character-by-character to see how similar two words are.
Now the problem is that the input text I have to use contains a lot of non-standard characters like á, é, ő, etc. I tried using string, wstring, char and wchar_t for representing my words, but nothing seems to work properly. An example:
setlocale(LC_ALL, "");
std::vector <Word::Word> words;
std::wfstream fileWrite("testout.txt");
std::wstring s = words[0].getString();
fileWrite << s;
Our string contains the word "Még" here. It outputs correctly. For the record, everything works the same if I use string instead of wstring. The following works too:
const wchar_t* wc = s.c_str();
fileWrite << wc;
But as soon as I try referencing to a char it gives me gibberish. Example:
fileWrite << wc[0] << " " << wc[1];
This outputs "ď »". I'm guessing the problem is that they use multiple bytes to store a char? I'm just wildly guessing here, but that would explain why
wcslen(wc);
returns 7.
I tried using the substr function with both string and wstring, but generally doesn't seem to be working. Anyone has any idea on how to tackle this problem? Am I missing something obvious here?
Also, I'm using codeblocks with a gcc compiler, I've read it somewhere that it doesn't handle wchar and wstring well, could that be the problem? Remember, I've tried everything above with string instead of wstring, and it was the same.
Thank you all very much for the help, it would be greatly appreciated!
These characters are not unusual. They are absolutely standard Unicode characters. Unfortunately, plain standard C++ doesn't have any support for the finer details of Unicode. Your choice is to either find a nice library supporting it (for example for code running on MacOS X or iOS you would just use what's built into the OS, other operating systems might have similar support), or to go to www.unicode.org and download their code tables. And read everything you can find out about it.
wchar and wstring are inherently non-portable. Your best bet is to use UTF-8 encoding and standard std::string. And understanding UTF-8 is absolutely essential for any programmer nowadays.
There was some discussion here about Notepad. Lots of software writes UTF-8 preceded by a byte-order marker (BOM) and lots of software uses that to recognise UTF-8. If that byte order marker is not present, they look at individual bytes. There is a possibility that a file consists only of ASCII characters, in which case it doesn't matter what encoding it is. If it's not only ASCII, the likelihood that for example a Windows-1252 encoded file containing non-ASCII characters is legal UTF-8 is practically zero.

Create UTF-16 string from char*

So I have standard C string:
char* name = "Jakub";
And I want to convert it to UTF-16. I figured out, that UTF-16 will be twice as long - one character takes two chars.
So I create another string:
char name_utf_16[10]; //"Jakub" is 5 characters
Now, I believe with ASCII characters I will only use lower bytes, so for all of them it will be like 74 00 for J and so on. With that belief, I can make such code:
void charToUtf16(char* input, char* output, int length) {
/*Todo: how to check if output is long enough?*/
for(int i=0; i<length; i+=2) //Step over 2 bytes
{
//Lets use little-endian - smallest bytes first
output[i] = input[i];
output[i+1] = 0; //We will never have any data for this field
}
}
But, with this process, I ended with "Jkb". I know no way to test this properly - I've just sent the string to Minecraft Bukkit Server. And this is what it said upon disconnecting:
13:34:19 [INFO] Disconnecting jkb?? [/127.0.0.1:53215]: Outdated server!
Note: I'm aware that Minecraft uses big-endian. Code above is just an example, in fact, I have my conversion implemented in class.
Before I answer your question, consider this:
This area of programming is full of man traps. It makes a lot of sense to understand the differences between ASCII, UTF7/8 and ANSI/'MultiByte Character Strings (MBCS)', all of which to an english speaking programmer will look and feel identical, but need very different handling if they are introduced to a european or asian user.
ASCII: Characters are in range 32-127. only ever one byte. The clue is in the name, they are great for Americans, but not fit for purpose in the rest of the world.
ANSI/MBCS: This is the reason for 'code pages'. Characters 32-127 are the same as ASCII, but it is possible to have characters in the range of 128-255 as well for additional characters, and some of the 128-255 range can be used as a flag to mark that the character continues into a second, third or even fourth byte. To process the string correctly, you need both the string bytes and the correct code page. If you try processing the string using the wrong code page you will not have the right characters, and misinterpret whether a character is a one, two or even 4 byte character.
UTF7/8: These are 8-bit wide formatting of 21-bit unicode character points. in UTF-7 and UTF-8 unicode characters can be between one and four bytes long. The advantage that UTF encodings have over ANSI/MBCS is that there is no ambiguity caused by code pages. Each glyph in every script has a unique unicode code point, which means it is not possible to mangle the character sets by interpreting the data on a different computer with different regional settings.
So to to start to answer your question:
Whilst you are making the assumption that your char* will only point to an ASCII string, that is a really dangerous choice to make, users are in control of data that is typed in, not the programmer. Windows programs will be storing this as MBCS by default.
You are making the second assumption is that a UTF-16 encoding will be twice the size of an 8 bit encoding. That is not generally a safe assumption. depending on the source encoding the UTF-16 encoding may be twice the size, may be less than twice the size, and in an extreme example may actually be shorter in length.
So, what is the safe solution?
The safe option is to implement your application internally as unicode. On windows, this is a compiler option, and then means your windows controls all use wchar_t* strings for their data type. On linux I'm less sure that you can always use unicide graphics and OS libraries. You must also use the wcslen() functions to get the length of strings etc. When you interact with the outside world, be precise in the character encodings used.
To answer to your question then becomes changing the question to, what do i do when I receive non UTF-16 data?
Firstly, be very clear about what assumptions on its formatting are you making? and secondly, accept the fact that sometimes the conversion to UTF-16 may fail.
If you are clear on the source formatting, you can then choose the appropriate win32 or the stl converter to convert the format, and you should then look for evidence the conversion failed before using the result. e.g. mbstowcs in or MultiByteToWideChar() on windows. However the use of both of these approaches safely means you need to understand ALL of the above answer.
All other options introduce risk. Use mbcs strings and you will have data strings mangled by being entered using one code page, and processed using a different code page. Assume ASCII data, and when you encounter a non ascii character your code will break, and you will 'blame' the user for your short comings.
Why do you want to make your own Unicode conversion functionality when theres's existing C/C++ functions for this, like mbstowcs() which is included in <cstdlib>.
If you still want to make your own stuff, then have a look at Unicode Consortium's open source code which can be found here:
Convert UTF-16 to UTF-8 under Windows and Linux, in C
output[i] = input[i];
This will assign every other byte of the input, because you increment i by 2. So no wonder that you obtain "Jkb".
You probably wanted to write:
output[i] = input[i / 2];

How to write a std::string to a UTF-8 text file

I just want to write some few simple lines to a text file in C++, but I want them to be encoded in UTF-8. What is the easiest and simple way to do so?
The only way UTF-8 affects std::string is that size(), length(), and all the indices are measured in bytes, not characters.
And, as sbi points out, incrementing the iterator provided by std::string will step forward by byte, not by character, so it can actually point into the middle of a multibyte UTF-8 codepoint. There's no UTF-8-aware iterator provided in the standard library, but there are a few available on the 'Net.
If you remember that, you can put UTF-8 into std::string, write it to a file, etc. all in the usual way (by which I mean the way you'd use a std::string without UTF-8 inside).
You may want to start your file with a byte order mark so that other programs will know it is UTF-8.
There is nice tiny library to work with utf8 from c++: utfcpp
libiconv is a great library for all our encoding and decoding needs.
If you are using Windows you can use WideCharToMultiByte and specify that you want UTF8.
What is the easiest and simple way to do so?
The most intuitive and thus easiest handling of utf8 in C++ is for sure using a drop-in replacement for std::string.
As the internet still lacks of one, I went to implement the functionality on my own:
tinyutf8 (EDIT: now Github).
This library provides a very lightweight drop-in preplacement for std::string (or std::u32string if you will, because you iterate over codepoints rather that chars). Ity is implemented succesfully in the middle between fast access and small memory consumption, while being very robust. This robustness to 'invalid' UTF8-sequences makes it (nearly completely) compatible with ANSI (0-255).
Hope this helps!
If by "simple" you mean ASCII, there is no need to do any encoding, since characters with an ASCII value of 127 or less are the same in UTF-8.
std::wstring text = L"Привет";
QString qstr = QString::fromStdWString(text);
QByteArray byteArray(qstr.toUtf8());
std::string str_std( byteArray.constData(), byteArray.length());
My preference is to convert to and from a std::u32string and work with codepoints internally, then convert to utf8 when writing out to a file using these converting iterators I put on github.
#include <utf/utf.h>
int main()
{
using namespace utf;
u32string u32_text = U"ɦΈ˪˪ʘ";
// do stuff with string
// convert to utf8 string
utf32_to_utf8_iterator<u32string::iterator> pos(u32_text.begin());
utf32_to_utf8_iterator<u32string::iterator> end(u32_text.end());
u8string u8_text(pos, end);
// write out utf8 to file.
// ...
}
Use Glib::ustring from glibmm.
It is the only widespread UTF-8 string container (AFAIK). While glyph (not byte) based, it has the same method signatures as std::string so the port should be simple search and replace (just make sure that your data is valid UTF-8 before loading it into a ustring).
As to UTF-8 is multibite characters string and so you get some problems to work and it's a bad idea/ Instead use normal Unicode.
So by my opinion best is use ordinary ASCII char text with some codding set. Need to use Unicode if you use more than 2 sets of different symbols
(languages) in single.
It's rather rare case. In most cases enough 2 sets of symbols. For this common case use ASCII chars, not Unicode.
Effect of using multibute chars like UTF-8 you get only China traditional, arabic or some hieroglyphic text. It's very very rare case!!!
I don't think there are many peoples needs that. So never use UTF-8!!! It's avoid strong headache of manipulate such strings.

(Encoded) String handling in C++ - questions / best practices?

What are the best practices for handling strings in C++? I'm wondering especially how to handle the following cases:
File input/output of text and XML files, which may be written in different encodings. What is the recommended way of handling this, and how to retrieve the values? I guess, a XML node may contain UTF-16 text, and then I have to work with it somehow.
How to handle char* strings. After all, this can be unsigned or not, and I wonder how I determine what encoding they use (ANSI?), and how to convert to UTF-8? Is there any recommended reading on this, where the basic guarantees of C/C++ about strings are documented?
String algorithms for UTF-8 etc. strings -- computing the length, parsing, etc. How is this done best?
What character type is really portable? I've learned that wchar_t can be anything from 8-32 bit wide, making it no good choice if I want to be consistent across platforms (especially when moving data between different platforms - this seems to be a problem, as described for example in EASTL, look at item #13)
At the moment, I'm using std::string everywhere, with a small helper utility to convert to UTF-16 when calling Unicode-APIs, but I'm pretty sure that this is not really the best way. Using something like Qt's QString or the ICU String class seems to be right, but I wonder whether there is a more lightweight approach (i.e. if my char strings are ANSI encoded, and the subset of ANSI that is used is equal to UFT-8, then I can easily treat the data as UTF-8 and provide converters from/to UTF-8, and I'm done, as I can store it in std::string, unless there are problems with this approach).
For a shorter answer, I would just recommend using UTF-16 for simplicity; Java/C#/Python 3.0 switched to that model exactly for simplicity.
I've always expected wchar_t to be 16 or 32bit wide, and many platforms support that; indeed, APIs like wcrtomb() do not allow an implementation to support a shift state for wchar_t*, but since UTF-8 needs none, it may be used, while other encodings are ruled out.
Then, I answer the question about XML.
File input/output of text and XML files, which may be written in different encodings. What is the recommended way of handling this, and how to retrieve the values? I guess, a XML node may contain UTF-16 text, and then I have to work with it somehow.
I'm not sure, but I don't think so.
Mixing two encodings in the same file is asking for trouble and data corruption.
Encoding a file in UTF-16 is usually a bad choice since most programs rely on using ASCII everwhere.
The issue is: an XML file might use any single encoding, maybe even UTF-16, but then also the initial encoding declaration has to use UTF-16, and even the tags then. The problem I see with UTF-16 is: how should one reliable parse the initial declaration? The answer comes in the specification:, § 4.3.3:
In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, or for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8. Note that since ASCII is a subset of UTF-8, ordinary ASCII entities do not strictly need an encoding declaration.
When reading that, note that also an XML file is an entity, called the document entity; in general, an entity is a storage unit for the document. From the whole specification, I'd say that only one encoding declaration is allowed for each entity, and I'd convert all entities to UTF-16 when reading them for easier handling.
Webography:
http://www.w3.org/TR/REC-xml/, XML spec.
http://www.xml.com/axml/testaxml.htm, Annotated XML spec.
String algorithms for UTF-8 etc. strings -- computing the length, parsing, etc. How is this done best?
mbrlen gives you the length of a C string. I don't think std::string can be used for multibyte strings, you should use wstring for wide ones.
In general, you should probaby stick with UTF-16 inside your program and use UTF-8 only on I/O (I don't know well other options, but they are surely more complex and error-prone).
How to handle char* strings. After all, this can be unsigned or not, and I wonder how I determine what encoding they use (ANSI?), and how to convert to UTF-8? Is there any recommended reading on this, where the basic guarantees of C/C++ about strings are documented?
Basically, you can use any encoding, and you will happen to use the native encoding of the system on which you are running on, as long as it's an 8-bit encoding. C was born for ASCII, and locale handling was an afterthought. For years, each system understood mostly one native encoding, say ISO-8859-x, and files from another encoding could even be non-representable.
Since for UTF-8 strings one byte is not always one character, I guess that the safest bet is to use multibyte string for them. The C manuals I used described multibyte string in abstract, without details on those issues (in particular, on the used encoding). For C, see functions like mbrlen and mbrtowc. On my Linux system, it is noted that their behaviour depends on LC_CTYPE, and this probably means that the native type of multibyte strings. From the documentation it can be inferred that their API supports also encodings where you can shift from one-byte to two-bytes and back.
How to handle char* strings. After all, this can be unsigned or not,
If you rely on signedness of char, you're doing it wrong. Signedness of chars only matters if you use char as a numeric type, and then you should always use either unsigned or signed chars; in fact, you should pretend that plain char is neither unsigned nor signed, and that an expression like a > 0 (if a is a char) has undefined semantics. But what would it be useful for, anyway?