Is there an STL string class that properly handles Unicode? - c++

I know all about std::string and std::wstring but they don't seem to fully pay attention to extended character encoding of UTF-8 and UTF-16 (On windows at least). There is also no support for UTF-32.
So does anyone know of cross-platform drop-in replacement classes that provide full UTF-8, UTF-16 and UTF-32 support?

And let's not forget the lightweight, very user-friendly, header-only UTF-8 library UTF8-CPP. Not a drop-in replacement, but can easily be used in conjunction with std::string and has no external dependencies.

Well in C++0x there are classes std::u32string and std::u16string. GCC already partially supports them, so you can already use them, but streams support for unicode is not yet done Unicode support in C++0x.

It's not STL, but if you want proper Unicode in C++, then you should take a look at ICU.

There is no support of UTF-8 on the STL. As an alternative youo can use boost codecvt:
//...
// My encoding type
typedef wchar_t ucs4_t;
std::locale old_locale;
std::locale utf8_locale(old_locale,new utf8_codecvt_facet<ucs4_t>);
// Set a New global locale
std::locale::global(utf8_locale);
// Send the UCS-4 data out, converting to UTF-8
{
std::wstringstream oss;
oss.imbue(utf8_locale);
std::copy(ucs4_data.begin(),ucs4_data.end(),
std::ostream_iterator<ucs4_t,ucs4_t>(oss));
std::wcout << oss.str() << std::endl;
}

For UTF-8 support, there is the Glib::ustring class. It is modeled after std::string but is utf-8 aware,e.g. when you are scanning the string with an iterator. It also has some restrictions, e.g. the iterator is always const, as replacing a character can change the length of the string and so it can invalidate other iterators.
ustring does not automatically converts other encodings to utf-8, Glib library has various conversion functions for this. You can validate whether the string is a valid utf-8 though.
And also, ustring and std::string are interchangeable, i.e. ustring has a cast operator to std::string so you can pass a ustring as a parameter where an std::string is expected, and vice versa of course, as ustring can be constructed from std::string.

Qt has QString which uses UTF-16 internally, but has methods for converting to or from std::wstring, UTF-8, Latin1 or locale encoding. There is also the QTextCodec class which can convert QStrings to or from basically anything. But using Qt for just strings seems like an overkill to me.

Also look at http://grigory.info/UTF8Strings.About.html it is UTF8 native.

Related

char8_t and utf8everywhere: How to convert to const char* APIs without invoking undefined behaviour?

As this question is some years old
Is C++20 'char8_t' the same as our old 'char'?
I would like to know, what is the recommended way to handle the char8_t and char conversion right now? boost::nowide (1.80.0) doesn´t not yet understand char8_t nor (AFAIK) boost::locale.
As Tom Honermann noted that
reinterpret_cast<const char *>(u8"text"); // Ok.
reinterpret_cast<const char8_t*>("text"); // Undefined behavior.
So: How do i interact with APIs that just accept const char* or const wchar_t* (think Win32 API) if my application "default" string type is std::u8string? The recommendation seems to be https://utf8everywhere.org/.
If i got a std::u8string and convert to std::string by
std::u8string convert(std::string str)
{
return std::u8string(reinterpret_cast<const char8_t*>(str.data()), str.size());
}
std::string convert(std::u8string str)
{
return std::string(reinterpret_cast<const char_t*>(str.data()), str.size());
}
This would invoke the same UB that Tom Honermann mentioned. This would be used when i talk to Win32 API or any other API that wants some const char* or gives some const char* back. I could go all conversions through boost::nowide but in the end i get a const char* back from boost::nowide::narrow() that i need to cast.
Is the current recommendation to just stay at char and ignore char8_t?
This would invoke the same UB that Tom Honermann mentioned.
As pointed out in the post you referred to, UB only happens when you cast from a char* to a char8_t*. The other direction is fine.
If you are given a char* which is encoded in UTF-8 (and you care to avoid the UB of just doing the cast for some reason), you can use std::transform to convert the chars to char8_ts by converting the characters:
std::u8string convert(std::string str)
{
std::u8string ret(str.size());
std::ranges::transform(str, ret.begin(), [](char c) {return char8_t(c);});
return ret;
}
C++23's ranges::to will make using a named return variable unnecessary.
For dealing with wchar_t interfaces (which you shouldn't have to, since nowadays UTF-8 support exists through narrow character interfaces on Windows), you'll have to do an actual UTF-8->UTF-16 conversion. Which you would have had to do anyway.
Personally, I think all the char8_t stuff in C++ is unusable practically!
With the current standard combined with OS support, I would recommend to avoid it, if possible.
But that is not all yet. There is more critic:
Unfortunately the C++ standard itself deprecates its own conversion support before it offers a replacement!
For example, the support in std::filesystem by using an utf-8 encoded standard string (not u8string) is deprecated (std::filesystem::u8path). With that even to use utf-8 encoded std::string is a pain because you must always convert it from one to another and back again!
To your questions. It depends what you want to do. If you want have a std::string which is utf-8 encoded but you only have an std::u8string, then you can simply do the following (no reinterpret_cast needed):
std::string convert( std::u8string str )
{
return std::string(str.begin(), str.end());
}
But here, I personally would expect, that the standard would offer a move constructor in std::string taking a std::u8string. Because otherwise you always must make a copy with an extra allocation for the unchanged data.
Unfortunately the standard does not offer such simple things. They are forcing the users to do uncomfortable and expensive stuff.
The same is true, if you have a std::string and you have 100% verified that it is valid utf-8 then you can direct convert it:
std::u8string convert( std::string str )
{
return std::u8string( str.begin(), str.end() );
}
During writing the long answer I realized that it is even more bad than I though when it comes to conversion! If you need to do a real conversion of the encoding it turns out that std::u8string is not supported at all.
The only way possible (that is my research result so far) is to use std::string as the data holder for the conversion, since the available routines are working on char and NOT on char8_t!
So, for the conversion from std::string to std::u8string you must do the following:
Use std::mbrtoc16 or std::std::mbrtoc32 for convert narrow char to either UTF-16 or UTF-32.
Use std::codecvt_utf8 to produce an UTF-8 encoded std::string.
Finally use the routine above to convert from UTF-8 encoded std::string to std::u8string.
For the other way round from std::u8string to std::string you must do the following:
Use the routine above to create a UTF-8 encoded std::string.
Use std::codecvt_utf8 to create an UTF-16/32 string.
And finally use std::c16rtomb or std::c32rtomb to produce a narrow encoded std::string.
But guess what? The codecvt routines are deprecated without a replacement...
So, personally, I would recommend to use the Windows API for it and use std::string only (or on Windows std::wstring). Usually only on Windows the std::string / char is encoded with a Windows code page and everywhere else you can normally expect it is UTF-8 (except maybe for Mainframes and some very rare old systems).
The conclusion can only be: Don't mess around with char8_t and std::u8string at all. It is practically unusable.

Encode/Decode std::string to UTF-16

I have to handle a file format (both read from and write to it) in which strings are encoded in UTF-16 (2 bytes per character). Since characters out of the ASCII table are rarely used in the application domain, all of the strings in my C++ model classes are stored in instances of std::string (UTF-8 encoded).
I'm looking for a library (searched in STL and Boost with no luck) or a set of C/C++ functions to handle this std::string <-> UTF-16 conversion when loading from or saving to file format (actually modeled as a bytestream) including the generation/recognition of surrogate pairs and all that Unicode stuffs (I'm admittedly no expert with)...
Any suggestions? Thanks!
EDIT: forgot to mention it should be cross-platform (Win / Mac) and cannot use C++11.
C++11 has this functionality:
std::string s = u8"Hello, World!";
// #include <codecvt>
std::wstring_convert<std::codecvt<char16_t,char,std::mbstate_t>,char16_t> convert;
std::u16string u16 = convert.from_bytes(s);
std::string u8 = convert.to_bytes(u16);
However to my knowledge the only implementation that has this so far is libc++. C++11 also has std::codecvt_utf8_utf16<char16_t> which some other implementations have. Specifically, codecvt_utf8_utf16 works in VS 2010 and above, and since wchar_t is used by Windows to represent UTF-16 you can use this to convert between UTF-8 and Windows' native encoding.
The specialization codecvt<char16_t, char, mbstate_t> converts between the UTF-16 and UTF-8 encoding
schemes, and the specialization codecvt<char32_t, char, mbstate_t> converts between the UTF-32 and
UTF-8 encoding schemes.
— [locale.codecvt] 22.4.1.4/3
Oh, and std::codecvt specializations have protected destructors, and wstring_convert requires access to the destructor so you really need an adapter:
template <class Facet>
class usable_facet : public Facet {
public:
using Facet::Facet; // inherit constructors
~usable_facet() {}
// workaround for compilers without inheriting constructors:
// template <class ...Args> usable_facet(Args&& ...args) : Facet(std::forward<Args>(args)...) {}
};
template<typename internT, typename externT, typename stateT>
using codecvt = usable_facet<std::codecvt<internT, externT, stateT>>;
std::wstring_convert<codecvt<char16_t,char,std::mbstate_t>> convert;
Did you look at Boost.Locale? This page, in particular, describes how to do UTF to UTF conversions and how to integrate it with IOStreams.
I would suggest having a look at:
Convert C++ std::string to UTF-16-LE encoded string
And check out the iconv function. It's a C library, no requirements for C++11.
There's also a Win32 specific iconv library at https://github.com/win-iconv/win-iconv.

Is there any built-in function that convert wstring or wchar_t* to UTF-8 in Linux?

I want to convert wstring to UTF-8 Encoding, but I want to use built-in functions of Linux.
Is there any built-in function that convert wstring or wchar_t* to UTF-8 in Linux with simple invokation?
Example:
wstring str = L"file_name.txt";
wstring mode = "a";
fopen([FUNCTION](str), [FUNCTION](mode)); // Simple invoke.
cout << [FUNCTION](str); // Simple invoke.
If/when your compiler supports enough of C++11, you could use wstring_convert
#include <iostream>
#include <codecvt>
#include <locale>
int main()
{
std::wstring_convert<std::codecvt_utf8<wchar_t>> utf8_conv;
std::wstring str = L"file_name.txt";
std::cout << utf8_conv.to_bytes(str) << '\n';
}
tested with clang++ 2.9/libc++ on Linux and Visual Studio 2010 on Windows.
The C++ language standard has no notion of explicit encodings. It only contains an opaque notion of a "system encoding", for which wchar_t is a "sufficiently large" type.
To convert from the opaque system encoding to an explicit external encoding, you must use an external library. The library of choice would be iconv() (from WCHAR_T to UTF-8), which is part of Posix and available on many platforms, although on Windows the WideCharToMultibyte functions is guaranteed to produce UTF8.
C++11 adds new UTF8 literals in the form of std::string s = u8"Hello World: \U0010FFFF";. Those are already in UTF8, but they cannot interface with the opaque wstring other than through the way I described.
See this question for a bit more background.
It's quite plausible that wcstombs will do what you need if what you actually want to do is convert from wide characters to the current locale.
If not then you probably need to look to ICU, boost or similar.
Certainly there is no function built in on Linux, because the name Linux references the kernel only, which doesn't have anything to with it. I seriously doubt that the libc that comes with gcc has such a function, and
$ man -k utf
supports this theory. But there are plenty of good UTF-8 libraries around. I personally recommend the iconv library for such conversions.

Convert ICU UnicodeString to platform dependent char * (or std::string)

In my application I use ICU UnicodeString to store my strings. Since I use some libraries incompatible with ICU, I need to convert UnicodeString to its platform dependent representation.
Basicly what I need to do is reverse process form creating new UnicodeString object - new UnicodeString("string encoded in system locale").
I found out this topic - so I know it can be done with use of stringstream.
So my answer is, can it be done in some other simpler way, without using stringstream to convert?
i use
std::string converted;
us.toUTF8String(converted);
us is (ICU) UnicodeString
You could use UnicodeString::extract() with a codepage (or a converter). Actually passing NULL for the codepage will use what ICU detected as the default codepage.
You could use the functions in ucnv.h -- namely void ucnv_fromUnicode (UConverter *converter, char **target, const char *targetLimit, const UChar **source, const UChar *sourceLimit, int32_t *offsets, UBool flush, UErrorCode *err). It's not a nice C++ API like UnicodeString, but it will work.
I'd recommend just sticking with the operator<< you're already using if at all possible. It's the standard way to handle lexical conversions (i.e. string to/from integers) in C++ in any case.

UnicodeString to char* (UTF-8)

I am using the ICU library in C++ on OS X. All of my strings are UnicodeStrings, but I need to use system calls like fopen, fread and so forth. These functions take const char* or char* as arguments. I have read that OS X supports UTF-8 internally, so that all I need to do is convert my UnicodeString to UTF-8, but I don't know how to do that.
UnicodeString has a toUTF8() member function, but it returns a ByteSink. I've also found these examples: http://source.icu-project.org/repos/icu/icu/trunk/source/samples/ucnv/convsamp.cpp and read about using a converter, but I'm still confused. Any help would be much appreciated.
call UnicodeString::extract(...) to extract into a char*, pass NULL for the converter to get the default converter (which is in the charset which your OS will be using).
ICU User Guide > UTF-8 provides methods and descriptions of doing that.
The simplest way to use UTF-8 strings in UTF-16 APIs is via the C++ icu::UnicodeString methods fromUTF8(const StringPiece &utf8) and toUTF8String(StringClass &result). There is also toUTF8(ByteSink &sink).
And extract() is not prefered now.
Note: icu::UnicodeString has constructors, setTo() and extract() methods which take either a converter object or a charset name. These can be used for UTF-8, but are not as efficient or convenient as the fromUTF8()/toUTF8()/toUTF8String() methods mentioned above.
This will work:
std::string utf8;
uStr.toUTF8String(utf8);