Why is wstring_convert throwing a range_error? - c++

I'm writing some code that needs to convert between byte strings and wide strings, using the system locale. When reading from a file, this is incredibly easy to do. I can use std::wifstream, imbue it with std::locale(""), and then just use std::getline.
According to cppreference's codecvt page, wifstream just uses codecvt<wchar_t, char, mbstate_t>, so I thought that I might be able to convert between std::string and std::wstring by using that as well:
// utility wrapper to adapt locale-bound facets for wstring/wbuffer
convert
template<class Facet>
struct deletable_facet : Facet
{
template<class ...Args>
deletable_facet(Args&& ...args) : Facet(std::forward<Args>(args)...) {}
~deletable_facet() {}
};
std::locale::global(std::locale(""));
std::wstring_convert<
deletable_facet<std::codecvt<wchar_t, char, std::mbstate_t>>> wconv;
std::wstring wstr = wconv.from_bytes(data);
However, when I try to run this, I get an range_error thrown from wstring_convert. I did some googling around, and apparently this is what happens when wstring_convert fails to convert the string.
However, these strings are clearly perfectly able to be converted using wfstream, which should be using the same codecvt as I am using with wstring_convert. So why does wifstream work, but wstring_convert not?
And is there a way that can I convert between strings and wstrings using the system's locale?
A full example of my problem, adapted from the codecvt page, is here, and the output is:
sizeof(char32_t) = 4
sizeof(wchar_t) = 4
The UTF-8 file contains the following UCS4 code points:
U+007a
U+00df
U+6c34
U+1f34c
The UTF-8 string contains the following UCS4 code points:
U+007a
U+00df
U+6c34
U+1f34c
terminate called after throwing an instance of 'std::range_error'
what(): wstring_convert
Aborted (core dumped)

Yourwifstream and wstring_convert are using different facets.
wifstream is using a locale-dependent conversion facet; it pulls it out of std::locale(""), with which it was imbued, via std::use_facet
wstring_convert was given a locale-independent, standalone codecvt facet, and the one provided by your implementation apparently does not convert UTF-8 into anything fitting; try calling in on it directly to see what it does.
An easy way to get a locale-dependent facet is to ask for it by name, as in
std::codecvt_byname

Related

Change narrow string encoding or missing std::filesystem::path::imbue

I'm on Windows and I'm constructing std::filesystem::path from std::string. According to constructor reference (emphasis mine):
If the source character type is char, the encoding of the source is assumed to be the native narrow encoding (so no conversion takes place on POSIX systems)
If I understand correctly, this means string content will be treated as encoded in ANSI under Windows. To treat it as encoded in UTF-8, I need to use std::filesystem::u8path() function. See the demo: http://rextester.com/PXRH65151
I want constructor of path to treat contents of narrow string as UTF-8 encoded. For boost::filesystem::path I could use imbue() method to do this:
boost::filesystem::path::imbue(std::locale(std::locale(), new std::codecvt_utf8_utf16<wchar_t>()));
However, I do not see such method in std::filesystem::path. Is there a way to achieve this behavior for std::filesystem::path? Or do I need to spit u8path all over the place?
My solution to this problem is to fully alias the std::filesystem to a different namespace named std::u8filesystem with classes and methods that treat std::string as UTF-8 encoded. Classes inherit their corresponding in std::filesystem with same name, without adding any field or virtual method to offer full API/ABI interoperability. Full proof of concept code here, tested only on Windows so far and far to be complete. The following snippet shows the core working of the helper:
std::wstring U8ToW(const std::string &string);
namespace std
{
namespace u8filesystem
{
#ifdef WIN32
class path : public filesystem::path
{
public:
path(const std::string &string)
: fs::path(U8ToW(path))
{
}
inline std::string string() const
{
return filesystem::path::u8string();
}
}
#else
using namespace filesystem;
#endif
}
}
For the sake of performance, path does not have a global way to define locale conversions. Since C++ pre-20 does not have a specific type for UTF-8 strings, the system assumes any char strings are narrow character strings. So if you want to use UTF-8 strings, you have to spell it out explicitly, either by providing an appropriate conversion locale to the constructor or by using u8path.
C++20 gave us char8_t, which is always presumed to be UTF-8. So if you consistently use char8_t-based strings (like std::u8string), path's implicit conversion will pick up on it and work appropriately.

Encode/Decode std::string to UTF-16

I have to handle a file format (both read from and write to it) in which strings are encoded in UTF-16 (2 bytes per character). Since characters out of the ASCII table are rarely used in the application domain, all of the strings in my C++ model classes are stored in instances of std::string (UTF-8 encoded).
I'm looking for a library (searched in STL and Boost with no luck) or a set of C/C++ functions to handle this std::string <-> UTF-16 conversion when loading from or saving to file format (actually modeled as a bytestream) including the generation/recognition of surrogate pairs and all that Unicode stuffs (I'm admittedly no expert with)...
Any suggestions? Thanks!
EDIT: forgot to mention it should be cross-platform (Win / Mac) and cannot use C++11.
C++11 has this functionality:
std::string s = u8"Hello, World!";
// #include <codecvt>
std::wstring_convert<std::codecvt<char16_t,char,std::mbstate_t>,char16_t> convert;
std::u16string u16 = convert.from_bytes(s);
std::string u8 = convert.to_bytes(u16);
However to my knowledge the only implementation that has this so far is libc++. C++11 also has std::codecvt_utf8_utf16<char16_t> which some other implementations have. Specifically, codecvt_utf8_utf16 works in VS 2010 and above, and since wchar_t is used by Windows to represent UTF-16 you can use this to convert between UTF-8 and Windows' native encoding.
The specialization codecvt<char16_t, char, mbstate_t> converts between the UTF-16 and UTF-8 encoding
schemes, and the specialization codecvt<char32_t, char, mbstate_t> converts between the UTF-32 and
UTF-8 encoding schemes.
— [locale.codecvt] 22.4.1.4/3
Oh, and std::codecvt specializations have protected destructors, and wstring_convert requires access to the destructor so you really need an adapter:
template <class Facet>
class usable_facet : public Facet {
public:
using Facet::Facet; // inherit constructors
~usable_facet() {}
// workaround for compilers without inheriting constructors:
// template <class ...Args> usable_facet(Args&& ...args) : Facet(std::forward<Args>(args)...) {}
};
template<typename internT, typename externT, typename stateT>
using codecvt = usable_facet<std::codecvt<internT, externT, stateT>>;
std::wstring_convert<codecvt<char16_t,char,std::mbstate_t>> convert;
Did you look at Boost.Locale? This page, in particular, describes how to do UTF to UTF conversions and how to integrate it with IOStreams.
I would suggest having a look at:
Convert C++ std::string to UTF-16-LE encoded string
And check out the iconv function. It's a C library, no requirements for C++11.
There's also a Win32 specific iconv library at https://github.com/win-iconv/win-iconv.

What unicode format do I pass tinyxml functions that take char*?

When I call a tinyxml function that takes a char*, what unicode format do I need to pass it?
TiXmlText *element_text = new TiXmlText(string);
The reason is that I am using a wxString object and there is a lot of different encodings I can give it. If I just do string.c_str(), the wxstring object will query the encoding for the current locale and create a char* string in that format. Or if I do string.utf8_str(), it will pass a utf-8 string but it seems like tinyxml will not realize that it's utf-8 encoded already and reencode the utf-8 string as utf-8 (yes, the result is double utf-8 encoding). So when I write out, if I set notepad++ to show utf-8, I see:
baÄŸlam instead of bağlam.
I'd like to do the encoding myself to utf_8 (string.utf8_str()) and not have tinyxml touch it and just write it out.
How do I do this? What format does tinyxml expect to be passed in the function parameter (constructor in the above code)? The answer from testing is not utf-8 though it eventually writes it out as utf-8 if that makes sense.
TinyXML only supports UTF-8 encoding. So if you want to provide characters outside of ASCII, you must provide them in UTF-8.
You may want to look at this section on http://www.grinninglizard.com/tinyxmldocs/index.html
TinyXML can be compiled to use or not use STL. When using STL, TinyXML uses the std::string class, and fully supports std::istream, std::ostream, operator <<, and operator >>. Many API methods have both 'const char*' and 'const std::string&' forms.
When STL support is compiled out, no STL files are included whatsoever. All the string classes are implemented by TinyXML itself. API methods all use the 'const char*' form for input.
Use the compile time define TIXML_USE_STL to compile one version or the other. This can be passed by the compiler, or set as the first line of "tinyxml.h".

Is there an STL string class that properly handles Unicode?

I know all about std::string and std::wstring but they don't seem to fully pay attention to extended character encoding of UTF-8 and UTF-16 (On windows at least). There is also no support for UTF-32.
So does anyone know of cross-platform drop-in replacement classes that provide full UTF-8, UTF-16 and UTF-32 support?
And let's not forget the lightweight, very user-friendly, header-only UTF-8 library UTF8-CPP. Not a drop-in replacement, but can easily be used in conjunction with std::string and has no external dependencies.
Well in C++0x there are classes std::u32string and std::u16string. GCC already partially supports them, so you can already use them, but streams support for unicode is not yet done Unicode support in C++0x.
It's not STL, but if you want proper Unicode in C++, then you should take a look at ICU.
There is no support of UTF-8 on the STL. As an alternative youo can use boost codecvt:
//...
// My encoding type
typedef wchar_t ucs4_t;
std::locale old_locale;
std::locale utf8_locale(old_locale,new utf8_codecvt_facet<ucs4_t>);
// Set a New global locale
std::locale::global(utf8_locale);
// Send the UCS-4 data out, converting to UTF-8
{
std::wstringstream oss;
oss.imbue(utf8_locale);
std::copy(ucs4_data.begin(),ucs4_data.end(),
std::ostream_iterator<ucs4_t,ucs4_t>(oss));
std::wcout << oss.str() << std::endl;
}
For UTF-8 support, there is the Glib::ustring class. It is modeled after std::string but is utf-8 aware,e.g. when you are scanning the string with an iterator. It also has some restrictions, e.g. the iterator is always const, as replacing a character can change the length of the string and so it can invalidate other iterators.
ustring does not automatically converts other encodings to utf-8, Glib library has various conversion functions for this. You can validate whether the string is a valid utf-8 though.
And also, ustring and std::string are interchangeable, i.e. ustring has a cast operator to std::string so you can pass a ustring as a parameter where an std::string is expected, and vice versa of course, as ustring can be constructed from std::string.
Qt has QString which uses UTF-16 internally, but has methods for converting to or from std::wstring, UTF-8, Latin1 or locale encoding. There is also the QTextCodec class which can convert QStrings to or from basically anything. But using Qt for just strings seems like an overkill to me.
Also look at http://grigory.info/UTF8Strings.About.html it is UTF8 native.

UnicodeString to char* (UTF-8)

I am using the ICU library in C++ on OS X. All of my strings are UnicodeStrings, but I need to use system calls like fopen, fread and so forth. These functions take const char* or char* as arguments. I have read that OS X supports UTF-8 internally, so that all I need to do is convert my UnicodeString to UTF-8, but I don't know how to do that.
UnicodeString has a toUTF8() member function, but it returns a ByteSink. I've also found these examples: http://source.icu-project.org/repos/icu/icu/trunk/source/samples/ucnv/convsamp.cpp and read about using a converter, but I'm still confused. Any help would be much appreciated.
call UnicodeString::extract(...) to extract into a char*, pass NULL for the converter to get the default converter (which is in the charset which your OS will be using).
ICU User Guide > UTF-8 provides methods and descriptions of doing that.
The simplest way to use UTF-8 strings in UTF-16 APIs is via the C++ icu::UnicodeString methods fromUTF8(const StringPiece &utf8) and toUTF8String(StringClass &result). There is also toUTF8(ByteSink &sink).
And extract() is not prefered now.
Note: icu::UnicodeString has constructors, setTo() and extract() methods which take either a converter object or a charset name. These can be used for UTF-8, but are not as efficient or convenient as the fromUTF8()/toUTF8()/toUTF8String() methods mentioned above.
This will work:
std::string utf8;
uStr.toUTF8String(utf8);