Preparing for char8_t in C++ 17 - c++

I'm using Microsoft Visual C++ 16.1 (2019 Community) and am trying to write code which will be "proper" in C++ 2020 which is expected to have a char8_t type which will be an unsigned char. I define a type like this:
using char8_t = unsigned char;
Code such as the following:
std:string data;
const char8_t* ptr = data.c_str ();
does not compile as it will not convert the signed char pointer to an unsigned char pointer without a reinterpret_cast. Is there something I can do to prepare for 2020 without having reinterpret casts all over the place?

P1423 (char8_t backward compatibility remediation) documents a number of approaches that can be used to remediate the backward compatibility impact due to the adoption of char8_t via P0482 (char8_t: A type for UTF-8 characters and strings).
Because char8_t is a non-aliasing type, it is undefined behavior to use reinterpret_cast to, for example, assign a char8_t pointer to a pointer to char as in reinterpret_cast<const char8_t*>(data.c_str()). However, because char and unsigned char are allowed to alias any type, it is permissible to use reinterpret_cast in the other direction, e.g., reinterpret_cast<const char*>(u8"text").
None of the remediation approaches documented in P1423 are silver bullets. You'll need to evaluate what works best for your use cases. You might also appreciate the answers in C++20 with u8, char8_t and std::string.
With regard to char8_t not being a UTF-8 character and u8string not being a UTF-8 string, that is correct in that, char8_t is a code unit type (not a code point type) and that u8string does not enforce well-formed UTF-8 sequences. However, the intent is very much that these types only be used for UTF-8 data.

Thanks for the comments. The comments and further research has corrected a major misconception which prompted the original question. I now understand that a 2020 char8_t is not a UTF-8 character and a 2020 u8string is not a UTF-8 string. While they may be used in a "UTF-8 string" implementation, they are not such.
Thus, it appears use of reinterpret_cast's is unavoidable, but can be hidden/isolated to a set of inline function overloads (or a set of function templates). Implementation of a utf8string object (perhaps as a template) as a distinct object is necessary (if such is not already available soemewhere).

Related

Using UTF-8 string-literal prefixes portably between C++17 and C++20

I have a codebase written in C++17 that makes heavy use of UTF-8, and the u8 string literal introduced in c++11 to indicate UTF encoding. However, c++20 changes the meaning of what the u8 literal does in C++ from producing a char or const char* to a char8_t or const char8_t*; the latter of which is not implicitly pointer convertible to const char*.
I'd like for this project to support operating in both C++17 and C++20 mode without breakages; what can be done to support this?
Currently, the project uses a char8 alias that uses the type-result of a u8 literal:
// Produces 'char8_t' in C++20, 'char' in anything earlier
using char8 = decltype(u8' ');
But there are a few problems with this approach:
char is not guaranteed to be unsigned, which makes producing codepoints from numeric values not portable (e.g. char8{129} breaks with char, but not with char8_t).
char8 is not distinct from char in C++17, which can break existing code, and may cause errors.
Continuing from point-2, it's not possible to overload char with char8 in C++17 to handle different encodings because they are not unique types.
What can be done to support operating in both C++17 and C++20 mode, while avoiding the type-difference problem?
I would suggest simply declaring your own char8_t and u8string types in pre-C++20 versions to alias unsigned char and basic_string<unsigned char>. And then anywhere you run into conversion problems, you can write wrapper functions to handle them appropriately in each version.

Interfacing std::filesystem::path with libraries that expect UTF-8 char*?

I'm looking to use std::filesystem::path to easily manipulate paths, but the libraries I'm using expect a const char* in UTF-8 encoding on all platforms.
I see that I can get a u8string, but its c_str() returns a char8_t*.
Is there some way for me to go from filesystem::path to a UTF-8 encoded char* on all platforms?
A buffer of char8_t can be reasonably safely cast to a char const* pointer and passed to the other API.
char8_t is a distinct type whose underlying storage is identical to an unsigned char. Casting unsigned char bits to char is legal.
char may be signed or unsigned, so fiddling with it is somewhat dangerous in portable code. But simply passing it through (read only) to another API is very safe.
Usually aliasing one type as another is illegal in C++, but char is one of the types with special dispensation to alias.
Note it is not legal to cast a buffer of char directly into a pointer to char8_t. So if it is providing utf-8 sequences in char data and you need it as a char8_t buffer, you'll have to copy it over to a char8_t buffer (which can be done via memcpy or similar) to stay within standard-defined behavior.
The very reason why char8_t is different from char is to make sure it's users are aware that this is not a simple char, and separate encoding/decoding is required to process it. Other than that, it is the same as char.
It looks like the libraries you are using fail to recognize this or are pre-c++20 libraries. In either case, you can use reinterpet_cast to cast const char8_t* to const char* - this would be one of the rare examples when such cast would be appropriate.

PC-lint / Flexelint rule against plain char

Gimpel Software's PC-lint and Flexelint have a rule "971: Use of 'char' without 'signed' or 'unsigned'", that forbids using the plain char type without specifying signedness.
http://www.gimpel.com/html/pub/msg.txt
I think this is misguided. If char is used as an integer type then it might make sense to specify signedness explicitly, but not when it is used for textual characters. Standard library functions like printf take pointers to plain char, and using signed or unsigned char is a type mismatch. One can of course cast between the types, but that can lead to just the kind of mistakes that lint is trying to prevent.
Is this lint rule against the plain char type wrong?
The messages PC Lint provides in the 900-999 (and 1900-1999 for C++) range are called "Elective Notes" and off by default. They are intended for use if you have a coding guideline that is restrictive in some specific way. You may then activate one or more of these notes to help you find potential violations. I do not think that anyone has activated all the 9xx messages in a real development effort.
You are right about char: It is the use of a byte (almost always) for a real character. However, a char is treated by a C compiler as either signed or unsigned. For C++, char is different from both unsigned char and from signed char.
I many embedded C environments I have worked in it was customary to have a coding rule stating that a plain char is not allowed. That's when this PC Lint message should be activated. Exceptions, like in interfacing with other libraries, had to be explicitly permitted, and then used a Lint comment to suppress the individual message.
I would assume the reason that they chose to force you to pick signed or unsigned is because the C standard does not. The C standard states char, unsigned char, and signed char are three unique types.
gcc for example makes the default signed but that can be modified with the flag -funsigned-char
So IMO I would say no, the rule aginst it is not wrong, it's just trying to tighten up the C spec.

static_cast wchar_t* to int* or short* - why is it illegal?

In both Microsoft VC2005 and g++ compilers, the following results in an error:
On win32 VC2005: sizeof(wchar_t) is 2
wchar_t *foo = 0;
static_cast<unsigned short *>(foo);
Results in
error C2440: 'static_cast' : cannot convert from 'wchar_t *' to 'unsigned short *' ...
On Mac OS X or Linux g++: sizeof(wchar_t) is 4
wchar_t *foo = 0;
static_cast<unsigned int *>(foo);
Results in
error: invalid static_cast from type 'wchar_t*' to type 'unsigned int*'
Of course, I can always use reinterpret_cast. However, I would like to understand why it is deemed illegal by the compiler to static_cast to the appropriate integer type. I'm sure there is a good reason...
You cannot cast between unrelated pointer types. The size of the type pointed to is irrelevant. Consider the case where the types have different alignment requirements, allowing a cast like this could generate illegal code on some processesors. It is also possible for pointers to different types to have differrent sizes. This could result in the pointer you obtain being invalid and or pointing at an entirely different location. Reinterpret_cast is one of the escape hatches you hacve if you know for your program compiler arch and os you can get away with it.
As with char, the signedness of wchar_t is not defined by the standard. Put this together with the possibility of non-2's complement integers, and for for a wchar_t value c,
*reinterpret_cast<unsigned short *>(&c)
may not equal:
static_cast<unsigned short>(c)
In the second case, on implementations where wchar_t is a sign+magnitude or 1's complement type, any negative value of c is converted to unsigned using modulo 2^N, which changes the bits. In the former case the bit pattern is picked up and used as-is (if it works at all).
Now, if the results are different, then there's no realistic way for the implementation to provide a static_cast between the pointer types. What could it do, set a flag on the unsigned short* pointer, saying "by the way, when you load from this, you have to also do a sign conversion", and then check this flag on all unsigned short loads?
That's why it's not, in general, safe to cast between pointers to distinct integer types, and I believe this unsafety is why there is no conversion via static_cast between them.
If the type you're casting to happens to be the so-called "underlying type" of wchar_t, then the resulting code would almost certainly be OK for the implementation, but would not be portable. So the standard doesn't offer a special case allowing you a static_cast just for that type, presumably because it would conceal errors in portable code. If you know reinterpret_cast is safe, then you can just use it. Admittedly, it would be nice to have a straightforward way of asserting at compile time that it is safe, but as far as the standard is concerned you should design around it, since the implementation is not required even to dereference a reinterpret_casted pointer without crashing.
By spec using of static_cast restricted by narrowable types, eg: std::ostream& to std::ofstream&. In fact wchar_t is just extension but widely used.
Your case (if you really need it) should be fixed by reinterpret_cast
By the way MSVC++ has an option - either treat wchar_t as macro (short) or as stand-alone datatype.
Pointers are not magic "no limitations, anything goes" tools.
They are, by the language specification actually very constrained. They do not allow you to bypass the type system or the rest of the C++ language, which is what you're trying to do.
You are trying to tell the compiler to "pretend that the wchar_t you stored at this address earlier is actually an int. Now read it."
That does not make sense. The object stored at that address is a wchar_t, and nothing else. You are working in a statically typed language, which means that every object has one, and juts one, type.
If you're willing to wander into implementation-defined behavior-land, you can use a reinterpret_cast to tell the compiler to just pretend it's ok, and interpret the result as it sees fit. But then the result is not specified by the standard, but by the implementation.
Without that cast, the operation is meaningless. A wchar_t is not an int or a short.

How to get string from an address?

I have a ULONG value that contains the address.
The address is basically of string(array of wchar_t terminated by NULL character)
I want to retrieve that string.
what is the best way to do that?
#KennyTM's answer is right on the money if by "basically of a string" you mean it's a pointer to an instance of the std::string class. If you mean it's a pointer to a C string, which I suspect may be more likely, you need:
char *s = reinterpret_cast<char *>(your_ulong);
Or, in your case:
whcar_t *s = reinterpret_cast<wchar_t *>(your_ulong);
Note also that you can't safely store pointers in any old integral type. I can make a compiler with a 32-bit long type and a 64-bit pointer type. If your compiler supports it, a proper way to store pointers in integers is to use the stdint.h types intptr_t and uintptr_t, which are guaranteed (by the C99 standard) to be big enough to store pointer types.
However, C99 isn't part of C++, and many C++ compilers (read: Microsoft) may not provide this kind of functionality (because who needs to write portable code?). Fortunately, stdint.h is useful enough that workarounds exist, and portable (and free) implementations of stdint.h for compatability with older compilers can be found easily on the internet.
string& s = *reinterpret_cast<string*>(your_ulong);