Interfacing std::filesystem::path with libraries that expect UTF-8 char*? - c++

I'm looking to use std::filesystem::path to easily manipulate paths, but the libraries I'm using expect a const char* in UTF-8 encoding on all platforms.
I see that I can get a u8string, but its c_str() returns a char8_t*.
Is there some way for me to go from filesystem::path to a UTF-8 encoded char* on all platforms?

A buffer of char8_t can be reasonably safely cast to a char const* pointer and passed to the other API.
char8_t is a distinct type whose underlying storage is identical to an unsigned char. Casting unsigned char bits to char is legal.
char may be signed or unsigned, so fiddling with it is somewhat dangerous in portable code. But simply passing it through (read only) to another API is very safe.
Usually aliasing one type as another is illegal in C++, but char is one of the types with special dispensation to alias.
Note it is not legal to cast a buffer of char directly into a pointer to char8_t. So if it is providing utf-8 sequences in char data and you need it as a char8_t buffer, you'll have to copy it over to a char8_t buffer (which can be done via memcpy or similar) to stay within standard-defined behavior.

The very reason why char8_t is different from char is to make sure it's users are aware that this is not a simple char, and separate encoding/decoding is required to process it. Other than that, it is the same as char.
It looks like the libraries you are using fail to recognize this or are pre-c++20 libraries. In either case, you can use reinterpet_cast to cast const char8_t* to const char* - this would be one of the rare examples when such cast would be appropriate.

Related

Using UTF-8 string-literal prefixes portably between C++17 and C++20

I have a codebase written in C++17 that makes heavy use of UTF-8, and the u8 string literal introduced in c++11 to indicate UTF encoding. However, c++20 changes the meaning of what the u8 literal does in C++ from producing a char or const char* to a char8_t or const char8_t*; the latter of which is not implicitly pointer convertible to const char*.
I'd like for this project to support operating in both C++17 and C++20 mode without breakages; what can be done to support this?
Currently, the project uses a char8 alias that uses the type-result of a u8 literal:
// Produces 'char8_t' in C++20, 'char' in anything earlier
using char8 = decltype(u8' ');
But there are a few problems with this approach:
char is not guaranteed to be unsigned, which makes producing codepoints from numeric values not portable (e.g. char8{129} breaks with char, but not with char8_t).
char8 is not distinct from char in C++17, which can break existing code, and may cause errors.
Continuing from point-2, it's not possible to overload char with char8 in C++17 to handle different encodings because they are not unique types.
What can be done to support operating in both C++17 and C++20 mode, while avoiding the type-difference problem?
I would suggest simply declaring your own char8_t and u8string types in pre-C++20 versions to alias unsigned char and basic_string<unsigned char>. And then anywhere you run into conversion problems, you can write wrapper functions to handle them appropriately in each version.

Preparing for char8_t in C++ 17

I'm using Microsoft Visual C++ 16.1 (2019 Community) and am trying to write code which will be "proper" in C++ 2020 which is expected to have a char8_t type which will be an unsigned char. I define a type like this:
using char8_t = unsigned char;
Code such as the following:
std:string data;
const char8_t* ptr = data.c_str ();
does not compile as it will not convert the signed char pointer to an unsigned char pointer without a reinterpret_cast. Is there something I can do to prepare for 2020 without having reinterpret casts all over the place?
P1423 (char8_t backward compatibility remediation) documents a number of approaches that can be used to remediate the backward compatibility impact due to the adoption of char8_t via P0482 (char8_t: A type for UTF-8 characters and strings).
Because char8_t is a non-aliasing type, it is undefined behavior to use reinterpret_cast to, for example, assign a char8_t pointer to a pointer to char as in reinterpret_cast<const char8_t*>(data.c_str()). However, because char and unsigned char are allowed to alias any type, it is permissible to use reinterpret_cast in the other direction, e.g., reinterpret_cast<const char*>(u8"text").
None of the remediation approaches documented in P1423 are silver bullets. You'll need to evaluate what works best for your use cases. You might also appreciate the answers in C++20 with u8, char8_t and std::string.
With regard to char8_t not being a UTF-8 character and u8string not being a UTF-8 string, that is correct in that, char8_t is a code unit type (not a code point type) and that u8string does not enforce well-formed UTF-8 sequences. However, the intent is very much that these types only be used for UTF-8 data.
Thanks for the comments. The comments and further research has corrected a major misconception which prompted the original question. I now understand that a 2020 char8_t is not a UTF-8 character and a 2020 u8string is not a UTF-8 string. While they may be used in a "UTF-8 string" implementation, they are not such.
Thus, it appears use of reinterpret_cast's is unavoidable, but can be hidden/isolated to a set of inline function overloads (or a set of function templates). Implementation of a utf8string object (perhaps as a template) as a distinct object is necessary (if such is not already available soemewhere).

Use signed or unsigned char in constructing CString?

I am check the document for CString . In the following statement:
CStringT( LPCSTR lpsz ): Constructs a Unicode CStringT from an ANSI string. You can also use this constructor to load a string resource as shown in the example below.
CStringT( LPCWSTR lpsz ): Constructs a CStringT from a Unicode string.
CStringT( const unsigned char* psz ): Allows you to construct a CStringT from a pointer to unsigned char.
I have some questions:
Why are there two versions, one for const char* (LPCSTR) and one for unsigned char*? Which version should I use for different cases? For example, does CStringT("Hello") use the first or second version? When getting a null-terminated string from a third-party, such as sqlite3_column_text() (see here), should I convert it to char* or unsigned char *? ie, should I use CString((LPCSTR)sqlite3_column_text(...)) or CString(sqlite3_column_text(...))? It seems that both will work, is that right?
Why does the char* version construct a "Unicode" CStringT but the unsigned char* version will construct a CStringT? CStringT is a templated class to indicate all 3 instances, ie, CString, CStringA, CStringW, so why the emphasis on "Unicode" CStringT when constructing using LPCSTR (const char*)?
LPCSTR is just const char*, not const signed char*. char is signed or unsigned depending on compiler implementation, but char, signed char, and unsigned char are 3 distinct types for purposes of overloading. String literals in C++ are of type const char[], so CStringT("Hello") will always use the LPCSTR constructor, never the unsigned char* constructor.
sqlite3_column_text(...) returns unsigned char* because it returns UTF-8 encoded text. I don't know what the unsigned char* constructor of CStringT actually does (it has something to do with MBCS strings), but the LPCSTR constructor performs a conversion from ANSI to UNICODE using the user's default locale. That would destroy UTF-8 text that contains non-ASCII characters.
Your best option in that case is to convert the UTF-8 text to UTF-16 (using MultiByteToWideChar() or equivalent, or simply using sqlite3_column_text16() instead, which returns UTF-16 encoded text), and then use the LPCWSTR (const wchar_t*) constructor of CStringT, as Windows uses wchar_t for UTF-16 data.
tl;dr: Use either of the following:
CStringW value( sqlite3_column_text16() ); (optionally setting SQLite's internal encoding to UTF-16), or
CStringW value( CA2WEX( sqlite3_column_text(), CP_UTF8 ) );
Everything else is just not going to work out, one way or another.
First things first: CStringT is a class template, parameterized (among others) on the character type it uses to represent the stored sequence. This is passed as the BaseType template type argument. There are 2 concrete template instantiations, CStringA and CStringW, that use char and wchar_t to store the sequence of characters, respectively1.
CStringT exposes the following predefined types that describe the properties of the template instantiation:
XCHAR: Character type used to store the sequence.
YCHAR: Character type that an instance can be converted from/to.
The following table shows the concrete types for CStringA and CStringW:
| XCHAR | YCHAR
---------+---------+--------
CStringA | char | wchar_t
CStringW | wchar_t | char
While the storage of the CStringT instantiations make no restrictions with respect to the character encoding being used, the conversion c'tors and operators are implemented based on the following assumptions:
char represents ANSI2 encoded code units.
whcar_t represents UTF-16 encoded code units.
If your program doesn't match those assumptions, it is strongly advised to disable implicit wide-to-narrow and narrow-to-wide conversions. To do this, defined the _CSTRING_DISABLE_NARROW_WIDE_CONVERSION preprocessor symbol prior to including any ATL/MFC header files. Doing so is recommended even if your program meets the assumptions to prevent accidental conversions, that are both costly as well as potentially destructive.
With that out of the way, let's move on to the questions:
Why are there two versions, one for const char* (LPCSTR) and one for unsigned char*?
That's easy: Convenience. The overload simply allows you to construct a CString instance irrespective of the signedness of the character type3. The implementation of the overload taking a const unsigned char* argument 'forwards' to the c'tor taking a const char*:
CSTRING_EXPLICIT CStringT(_In_z_ const unsigned char* pszSrc) :
CThisSimpleString( StringTraits::GetDefaultManager() )
{
*this = reinterpret_cast< const char* >( pszSrc );
}
Which version should I use for different cases?
It doesn't matter, as long as you are constructing a CStringA, i.e. no conversion is applied. If you are constructing a CStringW, you shouldn't be using either of those (as explained above).
For example, does CStringT("Hello") use the first or second version?
"Hello" is of type const char[6], that decays into a const char* to the first element in the array, when passed to the CString c'tor. It calls the overload taking a const char* argument.
When getting a null-terminated string from a third-party, such as sqlite3_column_text() (see here), should I convert it to char* or unsigned char *? ie, should I use CString((LPCSTR)sqlite3_column_text(...)) or CString(sqlite3_column_text(...))?
SQLite assumes UTF-8 encoding in this case. CStringA can store UTF-8 encoded text, but it's really, really dangerous to do so. CStringA assumes ANSI encoding, and readers of your code likely will do, too. It is recommended to either change your SQLite database to store UTF-16 (and use sqlite_column_text16) to construct a CStringW. If that is not feasible, manually convert from UTF-8 to UTF-16 before storing the data in a CStringW instance using the CA2WEX macro:
CStringW data( CA2WEX( sqlite3_column_text(), CP_UTF8 ) );
It seems that both will work, is that right?
That's not correct. Neither one works as soon as you get non-ASCII characters from your database.
Why does the char* version construct a "Unicode" CStringT but the unsigned char* version will construct a CStringT?
That looks to be the result of documentation trying to be compact. A CStringT is a class template. It is neither Unicode nor does it even exist. I'm guessing that remark section on the constructors is meant to highlight the ability to construct Unicode strings from ANSI input (and vice versa). This is briefly mentioned, too ("Note that some of these constructors act as conversion functions.").
To sum this up, here is a list of generic advice when using MFC/ATL strings:
Prefer using CStringW. This is the only string type whose implied character encoding is unambiguous (UTF-16).
Use CStringA only, when interfacing with legacy code. Make sure to unambiguously note the character encoding used. Also make sure to understand that "currently active locale" can change at any time. See Keep your eye on the code page: Is this string CP_ACP or UTF-8? for more information.
Never use CString. Just by looking at code, it's no longer clear, what type this is (could be any of 2 types). Likewise, when looking at a constructor invocation, it is no longer possible to see, whether this is a copy or conversion operation.
Disable implicit conversions for the CStringT class template instantiations.
1 There's also CString that uses the generic-text mapping TCHAR as its BaseType. TCHAR expands to either char or wchar_t, depending preprocessor symbols. CString is thus an alias for either CStringA or CStringW depending on those very same preprocessor symbols. Unless you are targeting Win9x, don't use any of the generic-text mappings.
2 Unlike Unicode encodings, ANSI is not a self-contained representation. Interpretation of code units depends on external state (the currently active locale). Do not use unless you are interfacing with legacy code.
3 It is implementation defined, whether char is interpreted as signed or unsigned. Either way, char, unsigned char, and signed char are 3 distinct types. By default, Visual Studio interprets char as signed.

Difference between static_cast<char*> and (char*)

this is my first question :)
I have one pile file, and I have open it like shown below ;
ifstream in ( filename, ios :: binary | ios :: in )
Then, I wish hold 2 byte data in unsigned int hold ;
unsigned int hold;
in . read(static_cast<char *>(&hold), 2);
It seems correct to me. However, when I compile it with
g++ -ansi -pedantic-errors -Werror - -Wall -o main main.cpp
Compiler emits error
error: invalid static_cast from type ‘unsigned int*’ to type ‘char*’
Actually, I have solved this problem by changing static_cast with ( char*), that is
unsigned int hold;
in . read((char*)(&hold), 2);
My questions are :
What is the difference(s) between static_cast<char*> and (char*) ?
I am not sure whether using (char*) is a safer or not. If you have enough knowledge, can you inform me about that topic ?
NOTE : If you have better idea, please help me so that I can improve my question?
static_cast is a safer cast than the implicit C style cast. If you try to cast an entity which is not compatible to another, then static_cast gives you an compilation time error unlike the implicit c-style cast.
static_cast gives you an error here because what you are trying to say is take an int and try to fit it in a char, which is not possible. int needs more memory than what char occupies and the conversion cannot be done in a safe manner.
If you still want to acheive this,You can use reinterpret_cast, It allows you to typecast two completely different data types, but it is not safe.
The only guarantee you get with reinterpret_cast is that if you cast the result back to the original type, you will get the same value, But no other safety guarantees.
first - you can easily search for _cast and find any c++ cast. Searching c-style casts is a lot harder.
second - if you use c++ casts, you need to choose the correct one. In your case, it is reinterpret_cast.
The c-style cast does everything.
you can also check here: http://www.cplusplus.com/doc/tutorial/typecasting/ for the differences of the different c++ casts. I strongly recommend only to use c++ casts. This way you can easily find & check them later and you are forced to think about what you are actually doing there. This improves code quality!
You should have used reinterpret_cast<char *> instead of static_cast<char *>, because the data types are not related: you can convert between a pointer to a subclass to a superclass for instance, or between int and long, or between void * and any pointer, but unsigned int * to char * isn't "safe" and thus you cannot do it with static_cast.
The difference is that in C++ you have various types of casts:
static_cast which is for "safe" conversions;
reinterpret_cast which is for "unsafe" conversions;
const_cast which is for removing a const attribute;
dynamic_cast which is for downcasting (casting a pointer/reference from a superclass to a subclass).
The C-style cast (char *)x can mean all of these above, so it is not as clear as the C++ casts. Furthermore, it is easy to grep for a C++-style cast (just grep for _cast), but it's quite hard to search for all C-style casts.
The static_cast is illegal here; you're casting between unrelated
pointer types. The solution to get it to compile would be to use
reinterpret_cast (which is what (char*) resolves to in this case).
Which, of course, tells you that the code isn't portable, and in fact,
unless you're doing some very low level work, probably won't work
correctly in all cases.
In this case, of course, you're reading raw data, and claiming it's an
unsigned int. Which it isn't; what read inputs is raw data, which you
still have to manually convert to whatever you need, according to the
format used when writing the file. (There is no such thing as
unformatted data. Just data with an undocumented, unspecified, or
unknown format. The “unformatted” input and output in
iostream are designed to read and write char buffers which you format
manually. The need for reinterpret_cast here is a definite warning
that something is wrong with your code. (There are exceptions, of
course, but they're few and far between.)
You will typically get these errors during binary file I/O using ifstream or ofstream or fstream. Problem is that these streams have methods that take const char* while what you have is array of some other type. You want to write your array as binary bits in to the file.
The traditional way of doing this is using old-style cast (char*) which basically just says that whatever pointer I've treat it as (char*). The old style casts are discouraged by pedantic/strict mode. To get rid of those warnings, the C++11 equivalent is reinterpret_cast<const char*>.
I would say, if you are doing binary file I/O then you already know things may or may not be portable depending on how you save the file in one OS and read in another OS. That's whole another issue, however, don't get scared by reinterpret_cast<const char*> because that's what you have to do if you want to write bytes to file.

static_cast wchar_t* to int* or short* - why is it illegal?

In both Microsoft VC2005 and g++ compilers, the following results in an error:
On win32 VC2005: sizeof(wchar_t) is 2
wchar_t *foo = 0;
static_cast<unsigned short *>(foo);
Results in
error C2440: 'static_cast' : cannot convert from 'wchar_t *' to 'unsigned short *' ...
On Mac OS X or Linux g++: sizeof(wchar_t) is 4
wchar_t *foo = 0;
static_cast<unsigned int *>(foo);
Results in
error: invalid static_cast from type 'wchar_t*' to type 'unsigned int*'
Of course, I can always use reinterpret_cast. However, I would like to understand why it is deemed illegal by the compiler to static_cast to the appropriate integer type. I'm sure there is a good reason...
You cannot cast between unrelated pointer types. The size of the type pointed to is irrelevant. Consider the case where the types have different alignment requirements, allowing a cast like this could generate illegal code on some processesors. It is also possible for pointers to different types to have differrent sizes. This could result in the pointer you obtain being invalid and or pointing at an entirely different location. Reinterpret_cast is one of the escape hatches you hacve if you know for your program compiler arch and os you can get away with it.
As with char, the signedness of wchar_t is not defined by the standard. Put this together with the possibility of non-2's complement integers, and for for a wchar_t value c,
*reinterpret_cast<unsigned short *>(&c)
may not equal:
static_cast<unsigned short>(c)
In the second case, on implementations where wchar_t is a sign+magnitude or 1's complement type, any negative value of c is converted to unsigned using modulo 2^N, which changes the bits. In the former case the bit pattern is picked up and used as-is (if it works at all).
Now, if the results are different, then there's no realistic way for the implementation to provide a static_cast between the pointer types. What could it do, set a flag on the unsigned short* pointer, saying "by the way, when you load from this, you have to also do a sign conversion", and then check this flag on all unsigned short loads?
That's why it's not, in general, safe to cast between pointers to distinct integer types, and I believe this unsafety is why there is no conversion via static_cast between them.
If the type you're casting to happens to be the so-called "underlying type" of wchar_t, then the resulting code would almost certainly be OK for the implementation, but would not be portable. So the standard doesn't offer a special case allowing you a static_cast just for that type, presumably because it would conceal errors in portable code. If you know reinterpret_cast is safe, then you can just use it. Admittedly, it would be nice to have a straightforward way of asserting at compile time that it is safe, but as far as the standard is concerned you should design around it, since the implementation is not required even to dereference a reinterpret_casted pointer without crashing.
By spec using of static_cast restricted by narrowable types, eg: std::ostream& to std::ofstream&. In fact wchar_t is just extension but widely used.
Your case (if you really need it) should be fixed by reinterpret_cast
By the way MSVC++ has an option - either treat wchar_t as macro (short) or as stand-alone datatype.
Pointers are not magic "no limitations, anything goes" tools.
They are, by the language specification actually very constrained. They do not allow you to bypass the type system or the rest of the C++ language, which is what you're trying to do.
You are trying to tell the compiler to "pretend that the wchar_t you stored at this address earlier is actually an int. Now read it."
That does not make sense. The object stored at that address is a wchar_t, and nothing else. You are working in a statically typed language, which means that every object has one, and juts one, type.
If you're willing to wander into implementation-defined behavior-land, you can use a reinterpret_cast to tell the compiler to just pretend it's ok, and interpret the result as it sees fit. But then the result is not specified by the standard, but by the implementation.
Without that cast, the operation is meaningless. A wchar_t is not an int or a short.