How to convert char* into std::u8string? - c++

Intro
If I catch an exception, I want to convert the error message, which is returned as a C-style string by the what() method, into a std::u8string (a UTF-8 string). For example: std::u8string(error.what());
Problem
How can I convert a char* into a std::u8string?
Additional Information
I only catch exceptions from the standard library, boost and eigen.
My application is Windows dependent, so the solution doesn't need to be portable.

You can use the constructor that takes a beginning and an ending iterator for the sequence that defines the string.
#include <cstring>
// ...
auto cstr=error.what();
std::u8string str{cstr, cstr+strlen(cstr)};

Related

char8_t and utf8everywhere: How to convert to const char* APIs without invoking undefined behaviour?

As this question is some years old
Is C++20 'char8_t' the same as our old 'char'?
I would like to know, what is the recommended way to handle the char8_t and char conversion right now? boost::nowide (1.80.0) doesn´t not yet understand char8_t nor (AFAIK) boost::locale.
As Tom Honermann noted that
reinterpret_cast<const char *>(u8"text"); // Ok.
reinterpret_cast<const char8_t*>("text"); // Undefined behavior.
So: How do i interact with APIs that just accept const char* or const wchar_t* (think Win32 API) if my application "default" string type is std::u8string? The recommendation seems to be https://utf8everywhere.org/.
If i got a std::u8string and convert to std::string by
std::u8string convert(std::string str)
{
return std::u8string(reinterpret_cast<const char8_t*>(str.data()), str.size());
}
std::string convert(std::u8string str)
{
return std::string(reinterpret_cast<const char_t*>(str.data()), str.size());
}
This would invoke the same UB that Tom Honermann mentioned. This would be used when i talk to Win32 API or any other API that wants some const char* or gives some const char* back. I could go all conversions through boost::nowide but in the end i get a const char* back from boost::nowide::narrow() that i need to cast.
Is the current recommendation to just stay at char and ignore char8_t?
This would invoke the same UB that Tom Honermann mentioned.
As pointed out in the post you referred to, UB only happens when you cast from a char* to a char8_t*. The other direction is fine.
If you are given a char* which is encoded in UTF-8 (and you care to avoid the UB of just doing the cast for some reason), you can use std::transform to convert the chars to char8_ts by converting the characters:
std::u8string convert(std::string str)
{
std::u8string ret(str.size());
std::ranges::transform(str, ret.begin(), [](char c) {return char8_t(c);});
return ret;
}
C++23's ranges::to will make using a named return variable unnecessary.
For dealing with wchar_t interfaces (which you shouldn't have to, since nowadays UTF-8 support exists through narrow character interfaces on Windows), you'll have to do an actual UTF-8->UTF-16 conversion. Which you would have had to do anyway.
Personally, I think all the char8_t stuff in C++ is unusable practically!
With the current standard combined with OS support, I would recommend to avoid it, if possible.
But that is not all yet. There is more critic:
Unfortunately the C++ standard itself deprecates its own conversion support before it offers a replacement!
For example, the support in std::filesystem by using an utf-8 encoded standard string (not u8string) is deprecated (std::filesystem::u8path). With that even to use utf-8 encoded std::string is a pain because you must always convert it from one to another and back again!
To your questions. It depends what you want to do. If you want have a std::string which is utf-8 encoded but you only have an std::u8string, then you can simply do the following (no reinterpret_cast needed):
std::string convert( std::u8string str )
{
return std::string(str.begin(), str.end());
}
But here, I personally would expect, that the standard would offer a move constructor in std::string taking a std::u8string. Because otherwise you always must make a copy with an extra allocation for the unchanged data.
Unfortunately the standard does not offer such simple things. They are forcing the users to do uncomfortable and expensive stuff.
The same is true, if you have a std::string and you have 100% verified that it is valid utf-8 then you can direct convert it:
std::u8string convert( std::string str )
{
return std::u8string( str.begin(), str.end() );
}
During writing the long answer I realized that it is even more bad than I though when it comes to conversion! If you need to do a real conversion of the encoding it turns out that std::u8string is not supported at all.
The only way possible (that is my research result so far) is to use std::string as the data holder for the conversion, since the available routines are working on char and NOT on char8_t!
So, for the conversion from std::string to std::u8string you must do the following:
Use std::mbrtoc16 or std::std::mbrtoc32 for convert narrow char to either UTF-16 or UTF-32.
Use std::codecvt_utf8 to produce an UTF-8 encoded std::string.
Finally use the routine above to convert from UTF-8 encoded std::string to std::u8string.
For the other way round from std::u8string to std::string you must do the following:
Use the routine above to create a UTF-8 encoded std::string.
Use std::codecvt_utf8 to create an UTF-16/32 string.
And finally use std::c16rtomb or std::c32rtomb to produce a narrow encoded std::string.
But guess what? The codecvt routines are deprecated without a replacement...
So, personally, I would recommend to use the Windows API for it and use std::string only (or on Windows std::wstring). Usually only on Windows the std::string / char is encoded with a Windows code page and everywhere else you can normally expect it is UTF-8 (except maybe for Mainframes and some very rare old systems).
The conclusion can only be: Don't mess around with char8_t and std::u8string at all. It is practically unusable.

Change narrow string encoding or missing std::filesystem::path::imbue

I'm on Windows and I'm constructing std::filesystem::path from std::string. According to constructor reference (emphasis mine):
If the source character type is char, the encoding of the source is assumed to be the native narrow encoding (so no conversion takes place on POSIX systems)
If I understand correctly, this means string content will be treated as encoded in ANSI under Windows. To treat it as encoded in UTF-8, I need to use std::filesystem::u8path() function. See the demo: http://rextester.com/PXRH65151
I want constructor of path to treat contents of narrow string as UTF-8 encoded. For boost::filesystem::path I could use imbue() method to do this:
boost::filesystem::path::imbue(std::locale(std::locale(), new std::codecvt_utf8_utf16<wchar_t>()));
However, I do not see such method in std::filesystem::path. Is there a way to achieve this behavior for std::filesystem::path? Or do I need to spit u8path all over the place?
My solution to this problem is to fully alias the std::filesystem to a different namespace named std::u8filesystem with classes and methods that treat std::string as UTF-8 encoded. Classes inherit their corresponding in std::filesystem with same name, without adding any field or virtual method to offer full API/ABI interoperability. Full proof of concept code here, tested only on Windows so far and far to be complete. The following snippet shows the core working of the helper:
std::wstring U8ToW(const std::string &string);
namespace std
{
namespace u8filesystem
{
#ifdef WIN32
class path : public filesystem::path
{
public:
path(const std::string &string)
: fs::path(U8ToW(path))
{
}
inline std::string string() const
{
return filesystem::path::u8string();
}
}
#else
using namespace filesystem;
#endif
}
}
For the sake of performance, path does not have a global way to define locale conversions. Since C++ pre-20 does not have a specific type for UTF-8 strings, the system assumes any char strings are narrow character strings. So if you want to use UTF-8 strings, you have to spell it out explicitly, either by providing an appropriate conversion locale to the constructor or by using u8path.
C++20 gave us char8_t, which is always presumed to be UTF-8. So if you consistently use char8_t-based strings (like std::u8string), path's implicit conversion will pick up on it and work appropriately.

How to Deal with Varying String types?

I have to work with an API that is using Microsoft's TCHAR macros and such, so I was wondering if I could use C++ in a way to simplify the task. So i was wondering if there is a way to support implicit conversion and why/why not std::string doesn't support converting from a smaller char size:
#include <Windows.h>
using String = std::basic_string<TCHAR>; // say TCHAR = wchar_t or equivalent
String someLiteralString = "my simple ansi string"; // Error here obviously
// some polymorphic class...
const TCHAR* MyOverriddenFunction() override { return someLiteralString.c_str(); }
// end some polymorphic class
The reason implicit conversion isn't supported is that conversion can be complicated. The simple case is when the string to convert is pure ASCII as in your example, but there's no way to guarantee that. The creators of the standard wisely stayed away from that problem.
If you don't know whether your strings are wide-character or not, you can use Microsoft's _T() macro around each string literal to generate the proper characters. But you say you don't want to do that.
Modern Windows programming always uses wide characters in the API. Chances are your program is too, otherwise the code you've shown would not cause an error. It's very unlikely that once you've used wide characters you'll switch back to narrow ones. A simple one-character change to your literals will make them wide-character to match the string type:
String someLiteralString = L"my simple ansi string";
Use the (ATL/MFC) CStringT class, it will make your life much easier.
http://msdn.microsoft.com/en-us/library/ms174284(v=vs.80).aspx

Is there a typical pattern for handling wide character strings in exceptions?

Standard C++'s std::exception::what() returns a narrow character string. Therefore, if I want to put a wide character string message there, I can't.
Is there a common way/pattern/library of/for getting around this?
EDIT: To be clear, I could just write my own exception class and inherit from it -- but I'm curious if there's a more or less standard implementation of this. boost::exception seems to do most of what I was thinking of....
Based on this post Exceptions with Unicode what(), I decided to do something like this:
class uexception : public std::exception {
public:
uexception(LPCTSTR lpszMessage)
: std::exception(TCharToUtf8(lpszMessage)) { }
};
Everywhere in my code base, I am assuming that .what() will return a string that is encoded in UTF-8. My conversion routines from UTF-8 to TCHAR will skip unrecognized UTF-8 sequences, and replace them with ?. That way, if .what() returns something that isn't valid UTF-8, it won't be an epic fail.
The code has not been compiled (later today - have to fix some other things first! :). I also apologize for the MFC-isms in there, but I think the message gets across anyway.
You can put anything there, but if third-party code expects a const char* from what(), you should return const char* from it.
For your code - just derive from std::exception and add const wchar_t* wwhat() method.
Well, in Qt you get QString for strings, and that string is always in unicode. Not that you should go for Qt just for the sake of exceptions, but still :)

Exceptions with Unicode what()

Or, "how do Russians throw exceptions?"
The definition of std::exception is:
namespace std {
class exception {
public:
exception() throw();
exception(const exception&) throw();
exception& operator=(const exception&) throw();
virtual ~exception() throw();
virtual const char* what() const throw();
};
}
A popular school of thought for designing exception hierarchies is to derive from std::exception:
Generally, it's best to throw objects,
not built-ins. If possible, you should
throw instances of classes that derive
(ultimately) from the std::exception
class. By making your exception class
inherit (ultimately) from the standard
exception base-class, you are making
life easier for your users (they have
the option of catching most things via
std::exception), plus you are probably
providing them with more information
(such as the fact that your particular
exception might be a refinement of
std::runtime_error or whatever).
But in the face of Unicode, it seems to be impossible to design an exception hierarchy that achieves both of the following:
Derives ultimately from std::exception for ease of use at the catch site
Provides Unicode compatibility so that diagnostics are not sliced or gibberish
Coming up with an exception class that can be constructed with Unicode strings is simple enough. But the standard dictates that what() must return a const char*, so at some point the input strings must be converted to ASCII. Whether that is done at construction time or when what() is called (if the source string uses characters not representable by 7-bit ASCII), it might be impossible to format the message without loss of fidelity.
How do you design an exception hierarchy that combines the seamless integration of a std::exception-derived class with lossless Unicode diagnostics?
char* does not mean ASCII. You could use an 8 bit Unicode encoding like UTF-8. char could also be 16 bit or more, you could then use UTF-16.
Returning UTF-8 is an obvious choice. If the application that uses your exceptions uses a different multibyte encoding, it might have a hard time displaying the string though. (It can't know it's UTF-8, can it?)
On the other hand, for ISO-8859-* 8bit encodings (Western european, cyrillic, etc.) displaying a UTF-8 string will "just" display some gibberish and you (or your user) might be fine with that if you cannot disambiguate btw. a char* in the locale character set and UTF-8.
Personally I think only low level error messages should go into what() strings and personally I think these should be english anyway. (Maybe combined with some error number or whatnot.)
The worst problem I see with what() is that it is not uncommon to include some contextual details in the what() message, for example a filename. Filenames are non ASCII rather often, so you are left with no choice but to use UTF-8 as the what() encoding.
Note also that your exception class (that's derived from std::exception) can obviously provide any access methods you like and so it might make sense to add an explicit what_utf8() or what_utf16() or what_iso8859_5().
Edit: Regarding John's comment on how to return UTF-8:
If you have a const char* what() function this function essentially returns a bunch of bytes. On a western european windows platform, these bytes would usually be encoded as Win1252, but on a russian windows it might as well be Win1251.
What the bytes return signify depends on their encoding and their encoding depends on where they "came from" (and who is interpreting them). A string literal's encoding is defined at compile time, but at runtime it's still up to the application how to interpret these.
So, to have your exception return UTF-8 strings with what() (or what_utf8()) you have to make sure that:
The input message to your exception has a well defined encoding
You have a well defined encoding for the string member you use to hold the message.
You appropriately convert the encoding when what()is called
Example:
struct MyExc : virtual public std::exception {
MyExc(const char* msg)
: exception(msg)
{ }
std::string what_utf8() {
return convert_iso8859_1_to_utf8( what() );
}
};
// In a ISO-8859-1 encoded source file
const char* my_err_msg = "ISO-8859-1 ... äöüß ...";
...
throw MyExc(my_err_msg);
...
catch(MyExc const& e) {
std::string iso8859_1_msg = e.what();
std::string utf_msg = e.what_utf8();
...
The conversion could also be placed in the (overridden) what() member function of MyExc() or you could define the exception to take an already UTF-8 encoded string or you could convert (from an expected input encoding, maybe wchar_t/UTF-16) in the ctor.
The first question is what do you intend to do with the what() string?
Do you plan to log the information somewhere?
If so you should not be using the content of the what() string you should be using that string as a reference to look up the correct local specific logging message. So to me the content of the what() is not for logging purposes (or any form of display) it is a method of looking up the actual logging string (which can be any Unicode string).
Now; It can be us-full for the what() string to contain a human readable message for the developers to help in quick debugging (but for this highly readable polished text is not required). As result there is no reason to support anything more than ASCII. Obey the KISS principle.
A const char* doesn't have to point to an ASCII string; it can be in a multi-byte encoding such as UTF-8. One option is to use wcstombs() and friends to convert wstrings to strings, but you may have to convert the result of what() back to wstring before printing. It also involves more copying and memory allocation than you may be comfortable with in an exception handler.
I usually just define my own base exception class, which uses wstring instead of string in the constructor and returns a const wstring& from what(). It's not that big of a deal. The lack of a standard one is a pretty big oversight.
Another valid opinion is that exception strings should never be presented to the user, so localizing them isn't necessary and so you don't have to worry about any of the above.
Standard doesn't specify what encoding is the string returned by what(), neither there is any defacto standard. I just encode it as UTF-8 and return from what(), in my projects. Of course there may be incompatibility with other libraries.
See also: https://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmful
for why UTF-8 is good choice.
It is better way to add unicode in error processing:
try
{
// some code
}
catch (std::exception & ex)
{
report_problem(ex.what())
}
And :
void report_problem(char const * const)
{
// here we can convert char to wchar_t or do some more else
// log it, save to file or message to user
}
what() is generally not meant to display a message to a user. Among other things the text it returns is not localizable (even if it was Unicode). I'd just use what() to display something of value to you as the developer (like the source file and line number of the place where the exception was raised) and for that sort of text, ASCII is usually more than enough.
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
by Joel Spolsky
Edit: Made CW, commenters may edit in why this link is relevant if they wish