there is a question makes me feel confused. What the exactly difference between std::codecvt and std::codecvt_utf8? As the STL reference saying, std::codecvt_utf8 is a drived class from std::codecvt, but could you please tell me why this function would throw an exception?
std::wstring_convert<std::codecvt<wchar_t, char, std::mbstate_t>> cvtUtf8 { new std::codecvt_byname<wchar_t, char, std::mbstate_t>(".65001") };
std::wstring_convert<std::codecvt_utf8<wchar_t>> cvt_utf8;
std::string strUtf8 = cvt_utf8.to_bytes(L"你好");
std::string strUtf8Failed = cvtUtf8.to_bytes(L"你好"); // throw out an exception. bad conversion
codecvt is a template intended to be used as a base of a conversion facet for converting strings between different encodings and different sizes of code units. It has a protected destructor, which practically prevents it from being used without inheritance.
codecvt<wchar_t, char, mbstate_t> specialization in particular is a conversion facet for "conversion between the system's native wide and the single-byte narrow character sets".
codecvt_utf8 inherits codecvt and is facet is for conversion between "UTF-8 encoded byte string and UCS2 or UCS4 character string". It has a public destructor.
If the system native wide encoding is not UCS2 or UCS4 or if system native narrow encoding isn't UTF-8, then they do different things.
could you please tell me why this function would throw an exception?
Probably because the C++ source file was not encoded in the same encoding as the converter expects the input to be.
new std::codecvt<wchar_t, char, std::mbstate_t>(".65001")
codecvt has no constructor that accepts a string.
It might be worth noting that codecvt and wstring_convert have been deprecated since C++17.
which one is the instead of codecvt?
The standard committee chose to deprecate codecvt before providing an alternative. You can either keep using it - with the knowledge that it may be replaced by something else in future, and with the knowledge that it has serious shortcomings that are cause for deprecation - or you can do what you could do prior to C++11: implement the conversion yourself, or use a third party implementation.
Related
C++11 introduced the c16rtomb()/c32rtomb() conversion functions, along with the inverse (mbrtoc16()/mbrtoc32()). c16rtomb() clearly states in the reference documentation here:
The multibyte encoding used by this function is specified by the currently active C locale
The documentation for c32rtomb() states the same. Both the C and C++ versions agree that these are locale-dependent conversions (as they should be, according to the naming convention of the functions themselves).
However, MSVC seems to have taken a different approach and made them locale-independent (not using the current C locale) according to this document. These conversion functions are specified under the heading Locale-independent multibyte routines.
C++20 adds to the confusion by including the c8rtomb()/mbrtoc8() functions, which if locale-independent would basically do nothing, converting UTF-8 input to UTF-8 output.
Two questions arise from this:
Do any other compilers actually follow the standard and implement locale-dependent Unicode multibyte conversion routines? I couldn't find any concrete information after extensive searching.
Is this a bug in MSVC's implementation?
In the past, I used CT2W and CT2A to convert string between Unicode & Ansi. Now It seems that CStringW and CStringA can also do the same task.
I write the following code snippet:
CString str = _T("Abc");
CStringW str1;
CStringA str2;
CT2W str3(str);
CT2A str4(str);
str1 = str;
str2 = str;
It seems CStringW and CStringA also perform conversions by using WideCharToMultibyte when I assign str to them.
So, what is the advantages of using CT2W/CT2A instead of CStringW/CStringA, since I have never heard of the latter pair. Neither MS recommend the latter pair to do the conversion.
CString offers a number of conversion constructors to convert between ANSI and Unicode encoding. They are as convenient as they are dangerous, often masking bugs. MFC allows you to disable implicit conversion by defining the _CSTRING_DISABLE_NARROW_WIDE_CONVERSION preprocessor symbol (which you probably should). Conversions always involve creating of a new CString object with heap-allocated storage (ignoring the short string optimization).
By contrast, the Cs2d macros (where s = source, d = destination) work on raw C-style strings; no CString instances are created in the process of converting between character encodings. A temporary buffer of 128 code units is always allocated on the stack, in addition to a heap-allocated buffer in case the conversion requires more space.
Both of the above perform a conversion with an implied ANSI codepage (either CP_THREAD_ACP or CP_ACP in case the _CONVERSION_DONT_USE_THREAD_LOCALE preprocessor symbol is defined). CP_ACP is particularly troublesome, as it's a process-global setting, that any thread can change at any time.
Which one should you choose for your conversions? Neither of the above. Use the EX versions instead (see string and text classes for a full list). Those are implemented as class templates that give you a lot more control you need to reliably perform your conversions. The template non-type parameter lets you control the static buffer. More importantly, those class templates have constructors with an explicit codepage parameter, so you can perform the conversion you want (including from and to UTF-8), and document your intent in code.
I am check the document for CString . In the following statement:
CStringT( LPCSTR lpsz ): Constructs a Unicode CStringT from an ANSI string. You can also use this constructor to load a string resource as shown in the example below.
CStringT( LPCWSTR lpsz ): Constructs a CStringT from a Unicode string.
CStringT( const unsigned char* psz ): Allows you to construct a CStringT from a pointer to unsigned char.
I have some questions:
Why are there two versions, one for const char* (LPCSTR) and one for unsigned char*? Which version should I use for different cases? For example, does CStringT("Hello") use the first or second version? When getting a null-terminated string from a third-party, such as sqlite3_column_text() (see here), should I convert it to char* or unsigned char *? ie, should I use CString((LPCSTR)sqlite3_column_text(...)) or CString(sqlite3_column_text(...))? It seems that both will work, is that right?
Why does the char* version construct a "Unicode" CStringT but the unsigned char* version will construct a CStringT? CStringT is a templated class to indicate all 3 instances, ie, CString, CStringA, CStringW, so why the emphasis on "Unicode" CStringT when constructing using LPCSTR (const char*)?
LPCSTR is just const char*, not const signed char*. char is signed or unsigned depending on compiler implementation, but char, signed char, and unsigned char are 3 distinct types for purposes of overloading. String literals in C++ are of type const char[], so CStringT("Hello") will always use the LPCSTR constructor, never the unsigned char* constructor.
sqlite3_column_text(...) returns unsigned char* because it returns UTF-8 encoded text. I don't know what the unsigned char* constructor of CStringT actually does (it has something to do with MBCS strings), but the LPCSTR constructor performs a conversion from ANSI to UNICODE using the user's default locale. That would destroy UTF-8 text that contains non-ASCII characters.
Your best option in that case is to convert the UTF-8 text to UTF-16 (using MultiByteToWideChar() or equivalent, or simply using sqlite3_column_text16() instead, which returns UTF-16 encoded text), and then use the LPCWSTR (const wchar_t*) constructor of CStringT, as Windows uses wchar_t for UTF-16 data.
tl;dr: Use either of the following:
CStringW value( sqlite3_column_text16() ); (optionally setting SQLite's internal encoding to UTF-16), or
CStringW value( CA2WEX( sqlite3_column_text(), CP_UTF8 ) );
Everything else is just not going to work out, one way or another.
First things first: CStringT is a class template, parameterized (among others) on the character type it uses to represent the stored sequence. This is passed as the BaseType template type argument. There are 2 concrete template instantiations, CStringA and CStringW, that use char and wchar_t to store the sequence of characters, respectively1.
CStringT exposes the following predefined types that describe the properties of the template instantiation:
XCHAR: Character type used to store the sequence.
YCHAR: Character type that an instance can be converted from/to.
The following table shows the concrete types for CStringA and CStringW:
| XCHAR | YCHAR
---------+---------+--------
CStringA | char | wchar_t
CStringW | wchar_t | char
While the storage of the CStringT instantiations make no restrictions with respect to the character encoding being used, the conversion c'tors and operators are implemented based on the following assumptions:
char represents ANSI2 encoded code units.
whcar_t represents UTF-16 encoded code units.
If your program doesn't match those assumptions, it is strongly advised to disable implicit wide-to-narrow and narrow-to-wide conversions. To do this, defined the _CSTRING_DISABLE_NARROW_WIDE_CONVERSION preprocessor symbol prior to including any ATL/MFC header files. Doing so is recommended even if your program meets the assumptions to prevent accidental conversions, that are both costly as well as potentially destructive.
With that out of the way, let's move on to the questions:
Why are there two versions, one for const char* (LPCSTR) and one for unsigned char*?
That's easy: Convenience. The overload simply allows you to construct a CString instance irrespective of the signedness of the character type3. The implementation of the overload taking a const unsigned char* argument 'forwards' to the c'tor taking a const char*:
CSTRING_EXPLICIT CStringT(_In_z_ const unsigned char* pszSrc) :
CThisSimpleString( StringTraits::GetDefaultManager() )
{
*this = reinterpret_cast< const char* >( pszSrc );
}
Which version should I use for different cases?
It doesn't matter, as long as you are constructing a CStringA, i.e. no conversion is applied. If you are constructing a CStringW, you shouldn't be using either of those (as explained above).
For example, does CStringT("Hello") use the first or second version?
"Hello" is of type const char[6], that decays into a const char* to the first element in the array, when passed to the CString c'tor. It calls the overload taking a const char* argument.
When getting a null-terminated string from a third-party, such as sqlite3_column_text() (see here), should I convert it to char* or unsigned char *? ie, should I use CString((LPCSTR)sqlite3_column_text(...)) or CString(sqlite3_column_text(...))?
SQLite assumes UTF-8 encoding in this case. CStringA can store UTF-8 encoded text, but it's really, really dangerous to do so. CStringA assumes ANSI encoding, and readers of your code likely will do, too. It is recommended to either change your SQLite database to store UTF-16 (and use sqlite_column_text16) to construct a CStringW. If that is not feasible, manually convert from UTF-8 to UTF-16 before storing the data in a CStringW instance using the CA2WEX macro:
CStringW data( CA2WEX( sqlite3_column_text(), CP_UTF8 ) );
It seems that both will work, is that right?
That's not correct. Neither one works as soon as you get non-ASCII characters from your database.
Why does the char* version construct a "Unicode" CStringT but the unsigned char* version will construct a CStringT?
That looks to be the result of documentation trying to be compact. A CStringT is a class template. It is neither Unicode nor does it even exist. I'm guessing that remark section on the constructors is meant to highlight the ability to construct Unicode strings from ANSI input (and vice versa). This is briefly mentioned, too ("Note that some of these constructors act as conversion functions.").
To sum this up, here is a list of generic advice when using MFC/ATL strings:
Prefer using CStringW. This is the only string type whose implied character encoding is unambiguous (UTF-16).
Use CStringA only, when interfacing with legacy code. Make sure to unambiguously note the character encoding used. Also make sure to understand that "currently active locale" can change at any time. See Keep your eye on the code page: Is this string CP_ACP or UTF-8? for more information.
Never use CString. Just by looking at code, it's no longer clear, what type this is (could be any of 2 types). Likewise, when looking at a constructor invocation, it is no longer possible to see, whether this is a copy or conversion operation.
Disable implicit conversions for the CStringT class template instantiations.
1 There's also CString that uses the generic-text mapping TCHAR as its BaseType. TCHAR expands to either char or wchar_t, depending preprocessor symbols. CString is thus an alias for either CStringA or CStringW depending on those very same preprocessor symbols. Unless you are targeting Win9x, don't use any of the generic-text mappings.
2 Unlike Unicode encodings, ANSI is not a self-contained representation. Interpretation of code units depends on external state (the currently active locale). Do not use unless you are interfacing with legacy code.
3 It is implementation defined, whether char is interpreted as signed or unsigned. Either way, char, unsigned char, and signed char are 3 distinct types. By default, Visual Studio interprets char as signed.
In my program I have a std::string that contains text encoded using the "execution character set" (which is not guaranteed to be UTF-8 or even US-ASCII), and I want to convert that to a std::string that contains the same text, but encoded using UTF-8. How can I do that?
I guess I need a std::codecvt<char, char, std::mbstate_t> character-converter object, but where can I get hold of a suitable object? What function or constructor must I use?
I assume the standard library provides some means for doing this (somewhere, somehow), because the compiler itself must know about UTF-8 (to support UTF-8 string literals) and the execution character set.
I guess I need a std::codecvt<char, char, std::mbstate_t> character-converter object, but where can I get hold of a suitable object?
You can get a std::codecvt object only as a base class instance (by inheriting from it) because the destructor is protected. That said no, std::codecvt<char, char, std::mbstate_t> is not a facet that you need since it represents the identity conversion (i.e. no conversion at all).
At the moment, the C++ standard library has no functionality for conversion between the native (aka excution) character encoding (aka character set) and UTF-8. As such, you can implement the conversion yourself using the Unicode standard: https://www.unicode.org/versions/Unicode11.0.0/UnicodeStandard-11.0.pdf
To use an external library I guess you would need to know the "name" (or ID) of the execution character set. But how would you get that?
There is no standard library function for that either. On POSIX system for example, you can use nl_langinfo(CODESET).
This is hacky but it worked for me in MS VS2019
#pragma execution_character_set( "utf-8" )
It has been mentioned in several sources that C++0x will include better language-level support for Unicode(including types and literals).
If the language is going to add these new features, it's only natural to assume that the standard library will as well.
However, I am currently unable to find any references to the new standard library. I expected to find out the answer for these answers:
Does the new library provide standard methods to convert UTF-8 to UTF-16, etc.?
Does the new library allowing writing UTF-8 to files, to the console (or from files, from the console). If so, can we use cout or will we need something else?
Does the new library include "basic" functionality such as: discovering the byte count and length of a UTF-8 string, converting to upper-case/lower-case(does this consider the influence of locales?)
Finally, are any of these functions are available in any popular compilers such as GCC or Visual Studio?
I have tried to look for information, but I can't seem to find anything. I am actually starting to think that maybe these things aren't even decided yet(I am aware that C++0x is a work in progress).
Does the new library provide standard methods to convert UTF-8 to UTF-16, etc.?
No. The new library does provide std::codecvt facets which do the conversion for you when dealing with iostream, however. ISO/IEC TR 19769:2004, the C Unicode Technical Report, is included almost verbatim in the new standard.
Does the new library allowing writing UTF-8 to files, to the console (or from files, from the console). If so, can we use cout or will we need something else?
Yes, you'd just imbue cout with the correct codecvt facet. Note however that the console is not required to display those characters correctly
Does the new library include "basic" functionality such as: discovering the byte count and length of a UTF-8 string, converting to upper-case/lower-case(does this consider the influence of locales?)
AFAIK that functionality exists with the existing C++03 standard. std::toupper and std::towupper of course function just as in previous versions of the standard. There aren't any new functions which specifically operate on unicode for this.
If you need these kinds of things, you're still going to have to rely on an external library -- the <iostream> is the primary piece that was retrofitted.
What, specifically, is added for unicode in the new standard?
Unicode literals, via u8"", u"", and U""
std::char_traits classes for UTF-8, UTF-16, and UTF-32
mbrtoc16, c16rtomb, mbrtoc32, and c32rtomb from ISO/IEC TR 19769:2004
std::codecvt facets for the locale library
The std::wstring_convert class template (which uses the codecvt mechanism for code set conversions)
The std::wbuffer_convert, which does the same as wstring_convert except for raw arrays, not strings.