Use signed or unsigned char in constructing CString? - c++

I am check the document for CString . In the following statement:
CStringT( LPCSTR lpsz ): Constructs a Unicode CStringT from an ANSI string. You can also use this constructor to load a string resource as shown in the example below.
CStringT( LPCWSTR lpsz ): Constructs a CStringT from a Unicode string.
CStringT( const unsigned char* psz ): Allows you to construct a CStringT from a pointer to unsigned char.
I have some questions:
Why are there two versions, one for const char* (LPCSTR) and one for unsigned char*? Which version should I use for different cases? For example, does CStringT("Hello") use the first or second version? When getting a null-terminated string from a third-party, such as sqlite3_column_text() (see here), should I convert it to char* or unsigned char *? ie, should I use CString((LPCSTR)sqlite3_column_text(...)) or CString(sqlite3_column_text(...))? It seems that both will work, is that right?
Why does the char* version construct a "Unicode" CStringT but the unsigned char* version will construct a CStringT? CStringT is a templated class to indicate all 3 instances, ie, CString, CStringA, CStringW, so why the emphasis on "Unicode" CStringT when constructing using LPCSTR (const char*)?

LPCSTR is just const char*, not const signed char*. char is signed or unsigned depending on compiler implementation, but char, signed char, and unsigned char are 3 distinct types for purposes of overloading. String literals in C++ are of type const char[], so CStringT("Hello") will always use the LPCSTR constructor, never the unsigned char* constructor.
sqlite3_column_text(...) returns unsigned char* because it returns UTF-8 encoded text. I don't know what the unsigned char* constructor of CStringT actually does (it has something to do with MBCS strings), but the LPCSTR constructor performs a conversion from ANSI to UNICODE using the user's default locale. That would destroy UTF-8 text that contains non-ASCII characters.
Your best option in that case is to convert the UTF-8 text to UTF-16 (using MultiByteToWideChar() or equivalent, or simply using sqlite3_column_text16() instead, which returns UTF-16 encoded text), and then use the LPCWSTR (const wchar_t*) constructor of CStringT, as Windows uses wchar_t for UTF-16 data.

tl;dr: Use either of the following:
CStringW value( sqlite3_column_text16() ); (optionally setting SQLite's internal encoding to UTF-16), or
CStringW value( CA2WEX( sqlite3_column_text(), CP_UTF8 ) );
Everything else is just not going to work out, one way or another.
First things first: CStringT is a class template, parameterized (among others) on the character type it uses to represent the stored sequence. This is passed as the BaseType template type argument. There are 2 concrete template instantiations, CStringA and CStringW, that use char and wchar_t to store the sequence of characters, respectively1.
CStringT exposes the following predefined types that describe the properties of the template instantiation:
XCHAR: Character type used to store the sequence.
YCHAR: Character type that an instance can be converted from/to.
The following table shows the concrete types for CStringA and CStringW:
| XCHAR | YCHAR
---------+---------+--------
CStringA | char | wchar_t
CStringW | wchar_t | char
While the storage of the CStringT instantiations make no restrictions with respect to the character encoding being used, the conversion c'tors and operators are implemented based on the following assumptions:
char represents ANSI2 encoded code units.
whcar_t represents UTF-16 encoded code units.
If your program doesn't match those assumptions, it is strongly advised to disable implicit wide-to-narrow and narrow-to-wide conversions. To do this, defined the _CSTRING_DISABLE_NARROW_WIDE_CONVERSION preprocessor symbol prior to including any ATL/MFC header files. Doing so is recommended even if your program meets the assumptions to prevent accidental conversions, that are both costly as well as potentially destructive.
With that out of the way, let's move on to the questions:
Why are there two versions, one for const char* (LPCSTR) and one for unsigned char*?
That's easy: Convenience. The overload simply allows you to construct a CString instance irrespective of the signedness of the character type3. The implementation of the overload taking a const unsigned char* argument 'forwards' to the c'tor taking a const char*:
CSTRING_EXPLICIT CStringT(_In_z_ const unsigned char* pszSrc) :
CThisSimpleString( StringTraits::GetDefaultManager() )
{
*this = reinterpret_cast< const char* >( pszSrc );
}
Which version should I use for different cases?
It doesn't matter, as long as you are constructing a CStringA, i.e. no conversion is applied. If you are constructing a CStringW, you shouldn't be using either of those (as explained above).
For example, does CStringT("Hello") use the first or second version?
"Hello" is of type const char[6], that decays into a const char* to the first element in the array, when passed to the CString c'tor. It calls the overload taking a const char* argument.
When getting a null-terminated string from a third-party, such as sqlite3_column_text() (see here), should I convert it to char* or unsigned char *? ie, should I use CString((LPCSTR)sqlite3_column_text(...)) or CString(sqlite3_column_text(...))?
SQLite assumes UTF-8 encoding in this case. CStringA can store UTF-8 encoded text, but it's really, really dangerous to do so. CStringA assumes ANSI encoding, and readers of your code likely will do, too. It is recommended to either change your SQLite database to store UTF-16 (and use sqlite_column_text16) to construct a CStringW. If that is not feasible, manually convert from UTF-8 to UTF-16 before storing the data in a CStringW instance using the CA2WEX macro:
CStringW data( CA2WEX( sqlite3_column_text(), CP_UTF8 ) );
It seems that both will work, is that right?
That's not correct. Neither one works as soon as you get non-ASCII characters from your database.
Why does the char* version construct a "Unicode" CStringT but the unsigned char* version will construct a CStringT?
That looks to be the result of documentation trying to be compact. A CStringT is a class template. It is neither Unicode nor does it even exist. I'm guessing that remark section on the constructors is meant to highlight the ability to construct Unicode strings from ANSI input (and vice versa). This is briefly mentioned, too ("Note that some of these constructors act as conversion functions.").
To sum this up, here is a list of generic advice when using MFC/ATL strings:
Prefer using CStringW. This is the only string type whose implied character encoding is unambiguous (UTF-16).
Use CStringA only, when interfacing with legacy code. Make sure to unambiguously note the character encoding used. Also make sure to understand that "currently active locale" can change at any time. See Keep your eye on the code page: Is this string CP_ACP or UTF-8? for more information.
Never use CString. Just by looking at code, it's no longer clear, what type this is (could be any of 2 types). Likewise, when looking at a constructor invocation, it is no longer possible to see, whether this is a copy or conversion operation.
Disable implicit conversions for the CStringT class template instantiations.
1 There's also CString that uses the generic-text mapping TCHAR as its BaseType. TCHAR expands to either char or wchar_t, depending preprocessor symbols. CString is thus an alias for either CStringA or CStringW depending on those very same preprocessor symbols. Unless you are targeting Win9x, don't use any of the generic-text mappings.
2 Unlike Unicode encodings, ANSI is not a self-contained representation. Interpretation of code units depends on external state (the currently active locale). Do not use unless you are interfacing with legacy code.
3 It is implementation defined, whether char is interpreted as signed or unsigned. Either way, char, unsigned char, and signed char are 3 distinct types. By default, Visual Studio interprets char as signed.

Related

Interfacing std::filesystem::path with libraries that expect UTF-8 char*?

I'm looking to use std::filesystem::path to easily manipulate paths, but the libraries I'm using expect a const char* in UTF-8 encoding on all platforms.
I see that I can get a u8string, but its c_str() returns a char8_t*.
Is there some way for me to go from filesystem::path to a UTF-8 encoded char* on all platforms?
A buffer of char8_t can be reasonably safely cast to a char const* pointer and passed to the other API.
char8_t is a distinct type whose underlying storage is identical to an unsigned char. Casting unsigned char bits to char is legal.
char may be signed or unsigned, so fiddling with it is somewhat dangerous in portable code. But simply passing it through (read only) to another API is very safe.
Usually aliasing one type as another is illegal in C++, but char is one of the types with special dispensation to alias.
Note it is not legal to cast a buffer of char directly into a pointer to char8_t. So if it is providing utf-8 sequences in char data and you need it as a char8_t buffer, you'll have to copy it over to a char8_t buffer (which can be done via memcpy or similar) to stay within standard-defined behavior.
The very reason why char8_t is different from char is to make sure it's users are aware that this is not a simple char, and separate encoding/decoding is required to process it. Other than that, it is the same as char.
It looks like the libraries you are using fail to recognize this or are pre-c++20 libraries. In either case, you can use reinterpet_cast to cast const char8_t* to const char* - this would be one of the rare examples when such cast would be appropriate.

What is the difference in using CStringW/CStringA and CT2W/CT2A to convert strings?

In the past, I used CT2W and CT2A to convert string between Unicode & Ansi. Now It seems that CStringW and CStringA can also do the same task.
I write the following code snippet:
CString str = _T("Abc");
CStringW str1;
CStringA str2;
CT2W str3(str);
CT2A str4(str);
str1 = str;
str2 = str;
It seems CStringW and CStringA also perform conversions by using WideCharToMultibyte when I assign str to them.
So, what is the advantages of using CT2W/CT2A instead of CStringW/CStringA, since I have never heard of the latter pair. Neither MS recommend the latter pair to do the conversion.
CString offers a number of conversion constructors to convert between ANSI and Unicode encoding. They are as convenient as they are dangerous, often masking bugs. MFC allows you to disable implicit conversion by defining the _CSTRING_DISABLE_NARROW_WIDE_CONVERSION preprocessor symbol (which you probably should). Conversions always involve creating of a new CString object with heap-allocated storage (ignoring the short string optimization).
By contrast, the Cs2d macros (where s = source, d = destination) work on raw C-style strings; no CString instances are created in the process of converting between character encodings. A temporary buffer of 128 code units is always allocated on the stack, in addition to a heap-allocated buffer in case the conversion requires more space.
Both of the above perform a conversion with an implied ANSI codepage (either CP_THREAD_ACP or CP_ACP in case the _CONVERSION_DONT_USE_THREAD_LOCALE preprocessor symbol is defined). CP_ACP is particularly troublesome, as it's a process-global setting, that any thread can change at any time.
Which one should you choose for your conversions? Neither of the above. Use the EX versions instead (see string and text classes for a full list). Those are implemented as class templates that give you a lot more control you need to reliably perform your conversions. The template non-type parameter lets you control the static buffer. More importantly, those class templates have constructors with an explicit codepage parameter, so you can perform the conversion you want (including from and to UTF-8), and document your intent in code.

Convert execution character set string to a UTF-8 string

In my program I have a std::string that contains text encoded using the "execution character set" (which is not guaranteed to be UTF-8 or even US-ASCII), and I want to convert that to a std::string that contains the same text, but encoded using UTF-8. How can I do that?
I guess I need a std::codecvt<char, char, std::mbstate_t> character-converter object, but where can I get hold of a suitable object? What function or constructor must I use?
I assume the standard library provides some means for doing this (somewhere, somehow), because the compiler itself must know about UTF-8 (to support UTF-8 string literals) and the execution character set.
I guess I need a std::codecvt<char, char, std::mbstate_t> character-converter object, but where can I get hold of a suitable object?
You can get a std::codecvt object only as a base class instance (by inheriting from it) because the destructor is protected. That said no, std::codecvt<char, char, std::mbstate_t> is not a facet that you need since it represents the identity conversion (i.e. no conversion at all).
At the moment, the C++ standard library has no functionality for conversion between the native (aka excution) character encoding (aka character set) and UTF-8. As such, you can implement the conversion yourself using the Unicode standard: https://www.unicode.org/versions/Unicode11.0.0/UnicodeStandard-11.0.pdf
To use an external library I guess you would need to know the "name" (or ID) of the execution character set. But how would you get that?
There is no standard library function for that either. On POSIX system for example, you can use nl_langinfo(CODESET).
This is hacky but it worked for me in MS VS2019
#pragma execution_character_set( "utf-8" )

what's the difference between std::codecvt and std::codecvt_utf8

there is a question makes me feel confused. What the exactly difference between std::codecvt and std::codecvt_utf8? As the STL reference saying, std::codecvt_utf8 is a drived class from std::codecvt, but could you please tell me why this function would throw an exception?
std::wstring_convert<std::codecvt<wchar_t, char, std::mbstate_t>> cvtUtf8 { new std::codecvt_byname<wchar_t, char, std::mbstate_t>(".65001") };
std::wstring_convert<std::codecvt_utf8<wchar_t>> cvt_utf8;
std::string strUtf8 = cvt_utf8.to_bytes(L"你好");
std::string strUtf8Failed = cvtUtf8.to_bytes(L"你好"); // throw out an exception. bad conversion
codecvt is a template intended to be used as a base of a conversion facet for converting strings between different encodings and different sizes of code units. It has a protected destructor, which practically prevents it from being used without inheritance.
codecvt<wchar_t, char, mbstate_t> specialization in particular is a conversion facet for "conversion between the system's native wide and the single-byte narrow character sets".
codecvt_utf8 inherits codecvt and is facet is for conversion between "UTF-8 encoded byte string and UCS2 or UCS4 character string". It has a public destructor.
If the system native wide encoding is not UCS2 or UCS4 or if system native narrow encoding isn't UTF-8, then they do different things.
could you please tell me why this function would throw an exception?
Probably because the C++ source file was not encoded in the same encoding as the converter expects the input to be.
new std::codecvt<wchar_t, char, std::mbstate_t>(".65001")
codecvt has no constructor that accepts a string.
It might be worth noting that codecvt and wstring_convert have been deprecated since C++17.
which one is the instead of codecvt?
The standard committee chose to deprecate codecvt before providing an alternative. You can either keep using it - with the knowledge that it may be replaced by something else in future, and with the knowledge that it has serious shortcomings that are cause for deprecation - or you can do what you could do prior to C++11: implement the conversion yourself, or use a third party implementation.

_tcslen in Multibyte character set: how to convert WCHAR [1] to const char *?

I search over internet for about 2 hours and I don't find any work solution.
My program have multibyte character set, in code i got:
WCHAR value[1];
_tcslen(value);
And in compiling, I got error:
'strlen' : cannot convert parameter 1
from 'WCHAR [1]' to 'const char *'
How to convert this WCHAR[1] to const char * ?
I assume the not-very-useful 1-length WCHAR array is just an example...
_tcslen isn't a function; it's a macro that expands to strlen or wcslen according to the _UNICODE define. If you're using the multibyte character set (which means _UNICODE ISN'T defined) and you have a WCHAR array, you'll have to use wcslen explicitly.
(Unless you're using TCHAR specifically, you probably want to avoid the _t functions. There's a not-very-good overview in MSDN at http://msdn.microsoft.com/en-us/library/szdfzttz(VS.80).aspx; does anybody know of anything better?)
Try setting "use unicode" in your projects settings in VS if you want _tcslen to be wcslen, set it to "use multibyte" for _tcslen to be strlen. As some already pointed out, _t prefixed functions (as many others actually, e.g. MessageBox()) are macros that are "expanded" based on the _UNICODE precompiler define value.
Another option would be to use TCHAR instead of WCHAR in your code. Although I'd say a better idea would be to just stick to either wchar_t or char types and use appropriate functions with them.
And last but not least, as the question is tagged c++, consider using std::string (or std::wstring for that matter) and std::vector instead of char buffers. Using those with C API functions is trivial and generally a lot safer.
use conversion function WideCharToMutliByte() .
http://msdn.microsoft.com/en-us/library/dd374130(VS.85).aspx
_tcslen takes WCHAR pointers or arrays as input only when UNICODE is #defined in your enviroment.
From the error message, I'd say that it isn't
The best way to define unicode is to pass it as a parameter to the compiler. For Microsoft C, you would use the /D switch
cl -c /DUNICODE myfile.cpp
or you could change your array declaration to TCHAR, which like _tcslen will be char when UNICODE is not #defined and WCHAR when it is.