Exceptions with Unicode what()

Exceptions with Unicode what() - c++

Or, "how do Russians throw exceptions?"
The definition of std::exception is:
namespace std {
class exception {
public:
exception() throw();
exception(const exception&) throw();
exception& operator=(const exception&) throw();
virtual ~exception() throw();
virtual const char* what() const throw();
};
}
A popular school of thought for designing exception hierarchies is to derive from std::exception:
Generally, it's best to throw objects,
not built-ins. If possible, you should
throw instances of classes that derive
(ultimately) from the std::exception
class. By making your exception class
inherit (ultimately) from the standard
exception base-class, you are making
life easier for your users (they have
the option of catching most things via
std::exception), plus you are probably
providing them with more information
(such as the fact that your particular
exception might be a refinement of
std::runtime_error or whatever).
But in the face of Unicode, it seems to be impossible to design an exception hierarchy that achieves both of the following:
Derives ultimately from std::exception for ease of use at the catch site
Provides Unicode compatibility so that diagnostics are not sliced or gibberish
Coming up with an exception class that can be constructed with Unicode strings is simple enough. But the standard dictates that what() must return a const char*, so at some point the input strings must be converted to ASCII. Whether that is done at construction time or when what() is called (if the source string uses characters not representable by 7-bit ASCII), it might be impossible to format the message without loss of fidelity.
How do you design an exception hierarchy that combines the seamless integration of a std::exception-derived class with lossless Unicode diagnostics?

char* does not mean ASCII. You could use an 8 bit Unicode encoding like UTF-8. char could also be 16 bit or more, you could then use UTF-16.

Returning UTF-8 is an obvious choice. If the application that uses your exceptions uses a different multibyte encoding, it might have a hard time displaying the string though. (It can't know it's UTF-8, can it?)
On the other hand, for ISO-8859-* 8bit encodings (Western european, cyrillic, etc.) displaying a UTF-8 string will "just" display some gibberish and you (or your user) might be fine with that if you cannot disambiguate btw. a char* in the locale character set and UTF-8.
Personally I think only low level error messages should go into what() strings and personally I think these should be english anyway. (Maybe combined with some error number or whatnot.)
The worst problem I see with what() is that it is not uncommon to include some contextual details in the what() message, for example a filename. Filenames are non ASCII rather often, so you are left with no choice but to use UTF-8 as the what() encoding.
Note also that your exception class (that's derived from std::exception) can obviously provide any access methods you like and so it might make sense to add an explicit what_utf8() or what_utf16() or what_iso8859_5().
Edit: Regarding John's comment on how to return UTF-8:
If you have a const char* what() function this function essentially returns a bunch of bytes. On a western european windows platform, these bytes would usually be encoded as Win1252, but on a russian windows it might as well be Win1251.
What the bytes return signify depends on their encoding and their encoding depends on where they "came from" (and who is interpreting them). A string literal's encoding is defined at compile time, but at runtime it's still up to the application how to interpret these.
So, to have your exception return UTF-8 strings with what() (or what_utf8()) you have to make sure that:
The input message to your exception has a well defined encoding
You have a well defined encoding for the string member you use to hold the message.
You appropriately convert the encoding when what()is called
Example:
struct MyExc : virtual public std::exception {
MyExc(const char* msg)
: exception(msg)
{ }
std::string what_utf8() {
return convert_iso8859_1_to_utf8( what() );
}
};
// In a ISO-8859-1 encoded source file
const char* my_err_msg = "ISO-8859-1 ... äöüß ...";
...
throw MyExc(my_err_msg);
...
catch(MyExc const& e) {
std::string iso8859_1_msg = e.what();
std::string utf_msg = e.what_utf8();
...
The conversion could also be placed in the (overridden) what() member function of MyExc() or you could define the exception to take an already UTF-8 encoded string or you could convert (from an expected input encoding, maybe wchar_t/UTF-16) in the ctor.

The first question is what do you intend to do with the what() string?
Do you plan to log the information somewhere?
If so you should not be using the content of the what() string you should be using that string as a reference to look up the correct local specific logging message. So to me the content of the what() is not for logging purposes (or any form of display) it is a method of looking up the actual logging string (which can be any Unicode string).
Now; It can be us-full for the what() string to contain a human readable message for the developers to help in quick debugging (but for this highly readable polished text is not required). As result there is no reason to support anything more than ASCII. Obey the KISS principle.

A const char* doesn't have to point to an ASCII string; it can be in a multi-byte encoding such as UTF-8. One option is to use wcstombs() and friends to convert wstrings to strings, but you may have to convert the result of what() back to wstring before printing. It also involves more copying and memory allocation than you may be comfortable with in an exception handler.
I usually just define my own base exception class, which uses wstring instead of string in the constructor and returns a const wstring& from what(). It's not that big of a deal. The lack of a standard one is a pretty big oversight.
Another valid opinion is that exception strings should never be presented to the user, so localizing them isn't necessary and so you don't have to worry about any of the above.

Standard doesn't specify what encoding is the string returned by what(), neither there is any defacto standard. I just encode it as UTF-8 and return from what(), in my projects. Of course there may be incompatibility with other libraries.
See also: https://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmful
for why UTF-8 is good choice.

It is better way to add unicode in error processing:
try
{
// some code
}
catch (std::exception & ex)
{
report_problem(ex.what())
}
And :
void report_problem(char const * const)
{
// here we can convert char to wchar_t or do some more else
// log it, save to file or message to user
}

what() is generally not meant to display a message to a user. Among other things the text it returns is not localizable (even if it was Unicode). I'd just use what() to display something of value to you as the developer (like the source file and line number of the place where the exception was raised) and for that sort of text, ASCII is usually more than enough.

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
by Joel Spolsky
Edit: Made CW, commenters may edit in why this link is relevant if they wish

Related

How to convert char* into std::u8string?

Intro
If I catch an exception, I want to convert the error message, which is returned as a C-style string by the what() method, into a std::u8string (a UTF-8 string). For example: std::u8string(error.what());
Problem
How can I convert a char* into a std::u8string?
Additional Information
I only catch exceptions from the standard library, boost and eigen.
My application is Windows dependent, so the solution doesn't need to be portable.

You can use the constructor that takes a beginning and an ending iterator for the sequence that defines the string.
#include <cstring>
// ...
auto cstr=error.what();
std::u8string str{cstr, cstr+strlen(cstr)};

Change narrow string encoding or missing std::filesystem::path::imbue

I'm on Windows and I'm constructing std::filesystem::path from std::string. According to constructor reference (emphasis mine):
If the source character type is char, the encoding of the source is assumed to be the native narrow encoding (so no conversion takes place on POSIX systems)
If I understand correctly, this means string content will be treated as encoded in ANSI under Windows. To treat it as encoded in UTF-8, I need to use std::filesystem::u8path() function. See the demo: http://rextester.com/PXRH65151
I want constructor of path to treat contents of narrow string as UTF-8 encoded. For boost::filesystem::path I could use imbue() method to do this:
boost::filesystem::path::imbue(std::locale(std::locale(), new std::codecvt_utf8_utf16<wchar_t>()));
However, I do not see such method in std::filesystem::path. Is there a way to achieve this behavior for std::filesystem::path? Or do I need to spit u8path all over the place?

My solution to this problem is to fully alias the std::filesystem to a different namespace named std::u8filesystem with classes and methods that treat std::string as UTF-8 encoded. Classes inherit their corresponding in std::filesystem with same name, without adding any field or virtual method to offer full API/ABI interoperability. Full proof of concept code here, tested only on Windows so far and far to be complete. The following snippet shows the core working of the helper:
std::wstring U8ToW(const std::string &string);
namespace std
{
namespace u8filesystem
{
#ifdef WIN32
class path : public filesystem::path
{
public:
path(const std::string &string)
: fs::path(U8ToW(path))
{
}
inline std::string string() const
{
return filesystem::path::u8string();
}
}
#else
using namespace filesystem;
#endif
}
}

For the sake of performance, path does not have a global way to define locale conversions. Since C++ pre-20 does not have a specific type for UTF-8 strings, the system assumes any char strings are narrow character strings. So if you want to use UTF-8 strings, you have to spell it out explicitly, either by providing an appropriate conversion locale to the constructor or by using u8path.
C++20 gave us char8_t, which is always presumed to be UTF-8. So if you consistently use char8_t-based strings (like std::u8string), path's implicit conversion will pick up on it and work appropriately.

What is the purpose of std::exception::what()?

I can only think of the following situations where std::exception::what() is used:
For debug purpose. In my Visual Studio to see e.what() I have to manually add it to the watch list. Isn't it better to have a member std::string (so the debugger directly shows it in the object inspector), and only include it in non-NDEBUG builds? At least they should disable what() in NDEBUG builds.
Output it, e.g. MessageBox(e.what()) or cout << e.what(). As far as I know these messages are useless for many users. For example when I try to rename a file that doesn't exist:
boost::filesystem::rename: 系统找不到指定的文件。: "D:\MyDesktop\4", "D:\MyDesktop\5"
(The Chinese words means "The system cannot find the file specified.") How can the users decrypt the mixed things? Also, it is a const char* instead of something like const platform_char*, which may have unicode problems in Windows.
Extract data from it, e.g. std::regex_match(e.what()...). I think it's a terrible idea that shows design flaws.
So where should I use std::exception::what()? Is it useless?

A programmer is supposed to derive a class from std::exception and taylor what() to the specific requirements. Then it can be very useful.
It's also useful to report something back (e.g. in plain text for logging), which is why the standard mandates a concrete std::exception::what() rather than a pure virtual function.

what() is generic in the sense that it means what you want it to mean for your own exception classes. In many cases it is used only for logging, but in other cases it may provide information that can be used to recover from an exceptional situation.

So where should I use std::exception::what()? Is it useless?
The error message of std::exception is a char* because it's supposed to be an as-simple-as-possible user-facing diagnostics message giving details on the error.
For boost::system_error (and std::system_error), you can also get the OS-level error code (for which the user-friendly message is "file not found").
Valid uses:
if you want to identify the error type, catch the specialization std::system_error or boost::system_error and perform a switch/if on the code() function (i.e. instead of running a regex on the message).
if you want to display an explanation of the error for diagnostics (logging) or user friendliness (to console/GUI) just display the error message (what()).
std::exception is a base class (i.e. it has a virtual destructor). If you implement your own exception classes, follow the same pattern as std::system_error: use an error code to easily identify an error, and create your exception's base class (std::exception, std::runtime_error or std::logic_error usually) with a text message depending on the error type.
Ultimately, the type of the error message is char* instead of std::string, wstring or your-platform-specific-char-type because it sacrifices flexibility for reliability: with a char* everybody knows how to use it, that it has no encoding (except ASCII), that it's null terminated and once allocated, it does not fail (it doesn't generate new exceptions).
Using something that could fail (in any way) when constructing an exception would be disastrous: you would fail while creating/using a diagnostics message of your error handling code.
This means you could either not throw the exception (and your application will remain in an invalid state), or throw an exception while creating an exception instance (you really don't want that as it would discard the exception you wanted to throw (hide errors) or throw an exception with a missing or corrupted error code (which could waste months of development time in various projects to track down the wrong error).
Consider this example (unnecessarily complicated, but it makes a point):
int * p = new(nothrow) int(10); // I want to throw a complicated
// string message exception
if(nullptr == p)
throw complicated_exception(std::string("error message here"));
The throw line will presumably fail: the application has so little memory available that it won't even allocate an int*, let alone a string. The result of this code is that in low memory conditions I will still get a std::out_of_memory error - one generated by std::string constructor.
std::exception is a good implementation: it provides a minimal extendable interface with a good management/restriction of failure points and it gives a good interface for extensions (such as std::system_error).

WChars, Encodings, Standards and Portability

The following may not qualify as a SO question; if it is out of bounds, please feel free to tell me to go away. The question here is basically, "Do I understand the C standard correctly and is this the right way to go about things?"
I would like to ask for clarification, confirmation and corrections on my understanding of character handling in C (and thus C++ and C++0x). First off, an important observation:
Portability and serialization are orthogonal concepts.
Portable things are things like C, unsigned int, wchar_t. Serializable things are things like uint32_t or UTF-8. "Portable" means that you can recompile the same source and get a working result on every supported platform, but the binary representation may be totally different (or not even exist, e.g. TCP-over-carrier pigeon). Serializable things on the other hand always have the same representation, e.g. the PNG file I can read on my Windows desktop, on my phone or on my toothbrush. Portable things are internal, serializable things deal with I/O. Portable things are typesafe, serializable things need type punning. </preamble>
When it comes to character handling in C, there are two groups of things related respectively to portability and serialization:
wchar_t, setlocale(), mbsrtowcs()/wcsrtombs(): The C standard says nothing about "encodings"; in fact, it is entirely agnostic to any text or encoding properties. It only says "your entry point is main(int, char**); you get a type wchar_t which can hold all your system's characters; you get functions to read input char-sequences and make them into workable wstrings and vice versa.
iconv() and UTF-8,16,32: A function/library to transcode between well-defined, definite, fixed encodings. All encodings handled by iconv are universally understood and agreed upon, with one exception.
The bridge between the portable, encoding-agnostic world of C with its wchar_t portable character type and the deterministic outside world is iconv conversion between WCHAR-T and UTF.
So, should I always store my strings internally in an encoding-agnostic wstring, interface with the CRT via wcsrtombs(), and use iconv() for serialization? Conceptually:
my program
<-- wcstombs --- /==============\ --- iconv(UTF8, WCHAR_T) -->
CRT | wchar_t[] | <Disk>
--- mbstowcs --> \==============/ <-- iconv(WCHAR_T, UTF8) ---
|
+-- iconv(WCHAR_T, UCS-4) --+
|
... <--- (adv. Unicode malarkey) ----- libicu ---+
Practically, that means that I'd write two boiler-plate wrappers for my program entry point, e.g. for C++:
// Portable wmain()-wrapper
#include <clocale>
#include <cwchar>
#include <string>
#include <vector>
std::vector<std::wstring> parse(int argc, char * argv[]); // use mbsrtowcs etc
int wmain(const std::vector<std::wstring> args); // user starts here
#if defined(_WIN32) || defined(WIN32)
#include <windows.h>
extern "C" int main()
{
setlocale(LC_CTYPE, "");
int argc;
wchar_t * const * const argv = CommandLineToArgvW(GetCommandLineW(), &argc);
return wmain(std::vector<std::wstring>(argv, argv + argc));
}
#else
extern "C" int main(int argc, char * argv[])
{
setlocale(LC_CTYPE, "");
return wmain(parse(argc, argv));
}
#endif
// Serialization utilities
#include <iconv.h>
typedef std::basic_string<uint16_t> U16String;
typedef std::basic_string<uint32_t> U32String;
U16String toUTF16(std::wstring s);
U32String toUTF32(std::wstring s);
/* ... */
Is this the right way to write an idiomatic, portable, universal, encoding-agnostic program core using only pure standard C/C++, together with a well-defined I/O interface to UTF using iconv? (Note that issues like Unicode normalization or diacritic replacement are outside the scope; only after you decide that you actually want Unicode (as opposed to any other coding system you might fancy) is it time to deal with those specifics, e.g. using a dedicated library like libicu.)
Updates
Following many very nice comments I'd like to add a few observations:
If your application explicitly wants to deal with Unicode text, you should make the iconv-conversion part of the core and use uint32_t/char32_t-strings internally with UCS-4.
Windows: While using wide strings is generally fine, it appears that interaction with the console (any console, for that matter) is limited, as there does not appear to be support for any sensible multi-byte console encoding and mbstowcs is essentially useless (other than for trivial widening). Receiving wide-string arguments from, say, an Explorer-drop together with GetCommandLineW+CommandLineToArgvW works (perhaps there should be a separate wrapper for Windows).
File systems: File systems don't seem to have any notion of encoding and simply take any null-terminated string as a file name. Most systems take byte strings, but Windows/NTFS takes 16-bit strings. You have to take care when discovering which files exist and when handling that data (e.g. char16_t sequences that do not constitute valid UTF16 (e.g. naked surrogates) are valid NTFS filenames). The Standard C fopen is not able to open all NTFS files, since there is no possible conversion that will map to all possible 16-bit strings. Use of the Windows-specific _wfopen may be required. As a corollary, there is in general no well defined notion of "how many characters" comprise a given file name, as there is no notion of "character" in the first place. Caveat emptor.

Is this the right way to write an idiomatic, portable, universal, encoding-agnostic program core using only pure standard C/C++
No, and there is no way at all to fulfill all these properties, at least if you want your program to run on Windows. On Windows, you have to ignore the C and C++ standards almost everywhere and work exclusively with wchar_t (not necessarily internally, but at all interfaces to the system). For example, if you start with
int main(int argc, char** argv)
you have already lost Unicode support for command line arguments. You have to write
int wmain(int argc, wchar_t** argv)
instead, or use the GetCommandLineW function, none of which is specified in the C standard.
More specifically,
any Unicode-capable program on Windows must actively ignore the C and C++ standard for things like command line arguments, file and console I/O, or file and directory manipulation. This is certainly not idiomatic. Use the Microsoft extensions or wrappers like Boost.Filesystem or Qt instead.
Portability is extremely hard to achieve, especially for Unicode support. You really have to be prepared that everything you think you know is possibly wrong. For example, you have to consider that the filenames you use to open files can be different from the filenames that are actually used, and that two seemingly different filenames may represent the same file. After you create two files a and b, you might end up with a single file c, or two files d and e, whose filenames are different from the file names you passed to the OS. Either you need an external wrapper library or lots of #ifdefs.
Encoding agnosticity usually just doesn't work in practice, especially if you want to be portable. You have to know that wchar_t is a UTF-16 code unit on Windows and that char is often (bot not always) a UTF-8 code unit on Linux. Encoding-awareness is often the more desirable goal: make sure that you always know with which encoding you work, or use a wrapper library that abstracts them away.
I think I have to conclude that it's completely impossible to build a portable Unicode-capable application in C or C++ unless you are willing to use additional libraries and system-specific extensions, and to put lots of effort in it. Unfortunately, most applications already fail at comparatively simple tasks such as "writing Greek characters to the console" or "supporting any filename allowed by the system in a correct manner", and such tasks are only the first tiny steps towards true Unicode support.

I would avoid the wchar_t type because it's platform-dependent (not "serializable" by your definition): UTF-16 on Windows and UTF-32 on most Unix-like systems. Instead, use the char16_t and/or char32_t types from C++0x/C1x. (If you don't have a new compiler, typedef them as uint16_t and uint32_t for now.)
DO define functions to convert between UTF-8, UTF-16, and UTF-32 functions.
DON'T write overloaded narrow/wide versions of every string function like the Windows API did with -A and -W. Pick one preferred encoding to use internally, and stick to it. For things that need a different encoding, convert as necessary.

The problem with wchar_t is that encoding-agnostic text processing is too difficult and should be avoided. If you stick with "pure C" as you say, you can use all of the w* functions like wcscat and friends, but if you want to do anything more sophisticated then you have to dive into the abyss.
Here are some things that much harder with wchar_t than they are if you just pick one of the UTF encodings:
Parsing Javascript: Identifers can contain certain characters outside the BMP (and lets assume that you care about this kind of correctness).
HTML: How do you turn 𐀀 into a string of wchar_t?
Text editor: How do you find grapheme cluster boundaries in a wchar_t string?
If I know the encoding of a string, I can examine the characters directly. If I don't know the encoding, I have to hope that whatever I want to do with a string is implemented by a library function somewhere. So the portability of wchar_t is somewhat irrelevant as I don't consider it an especially useful data type.
Your program requirements may differ and wchar_t may work fine for you.

Given that iconv is not "pure standard C/C++", I don't think you are satisfying your own specifications.
There are new codecvt facets coming with char32_t and char16_t so I don't see how you can be wrong as long as you are consistent and pick one char type + encoding if the facets are here.
The facets are described in 22.5 [locale.stdcvt] (from n3242).
I don't understand how this doesn't satisfy at least some of your requirements:
namespace ns {
typedef char32_t char_t;
using std::u32string;
// or use user-defined literal
#define LIT u32
// Communicate with interface0, which wants utf-8
// This type doesn't need to be public at all; I just refactored it.
typedef std::wstring_convert<std::codecvt_utf8<char_T>, char_T> converter0;
inline std::string
to_interface0(string const& s)
{
return converter0().to_bytes(s);
}
inline string
from_interface0(std::string const& s)
{
return converter0().from_bytes(s);
}
// Communitate with interface1, which wants utf-16
// Doesn't have to be public either
typedef std::wstring_convert<std::codecvt_utf16<char_T>, char_T> converter1;
inline std::wstring
to_interface0(string const& s)
{
return converter1().to_bytes(s);
}
inline string
from_interface0(std::wstring const& s)
{
return converter1().from_bytes(s);
}
} // ns
Then your code can use ns::string, ns::char_t, LIT'A' & LIT"Hello, World!" with reckless abandon, without knowing what's the underlying representation. Then use from_interfaceX(some_string) whenever it's needed. It doesn't affect the global locale or streams either. The helpers can be as clever as needed, e.g. codecvt_utf8 can deal with 'headers', which I assume is Standardese from tricky stuff like the BOM (ditto codecvt_utf16).
In fact I wrote the above to be as short as possible but you'd really want helpers like this:
template<typename... T>
inline ns::string
ns::from_interface0(T&&... t)
{
return converter0().from_bytes(std::forward<T>(t)...);
}
which give you access to the 3 overloads for each [from|to]_bytes members, accepting things like e.g. const char* or ranges.

Is there a typical pattern for handling wide character strings in exceptions?

Standard C++'s std::exception::what() returns a narrow character string. Therefore, if I want to put a wide character string message there, I can't.
Is there a common way/pattern/library of/for getting around this?
EDIT: To be clear, I could just write my own exception class and inherit from it -- but I'm curious if there's a more or less standard implementation of this. boost::exception seems to do most of what I was thinking of....

Based on this post Exceptions with Unicode what(), I decided to do something like this:
class uexception : public std::exception {
public:
uexception(LPCTSTR lpszMessage)
: std::exception(TCharToUtf8(lpszMessage)) { }
};
Everywhere in my code base, I am assuming that .what() will return a string that is encoded in UTF-8. My conversion routines from UTF-8 to TCHAR will skip unrecognized UTF-8 sequences, and replace them with ?. That way, if .what() returns something that isn't valid UTF-8, it won't be an epic fail.
The code has not been compiled (later today - have to fix some other things first! :). I also apologize for the MFC-isms in there, but I think the message gets across anyway.

You can put anything there, but if third-party code expects a const char* from what(), you should return const char* from it.
For your code - just derive from std::exception and add const wchar_t* wwhat() method.

Well, in Qt you get QString for strings, and that string is always in unicode. Not that you should go for Qt just for the sake of exceptions, but still :)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js