Is It Legal to Cast Away the Sign on a Pointer? - c++

I am working in an antiquated code base that used unsigned char*s to contain strings. For my functionality I've used strings however there is a rub:
I can't use anything in #include <cstring> in the old code. Copying from a string to an unsigned char* is a laborious process of:
unsigned char foo[12];
string bar{"Lorem Ipsum"};
transform(bar.cbegin(), bar.cbegin() + min(sizeof(foo) / sizeof(foo[0]), bar.size()), foo, [](auto i){return static_cast<unsigned char>(i);});
foo[sizeof(foo) / sizeof(foo[0]) - 1] = '\0';
Am I going to get into undefined behavior or aliasing problems if I just do:
strncpy(reinterpret_cast<char*>(foo), bar.c_str(), sizeof(foo) / sizeof(foo[0]) - 1);
foo[sizeof(foo) / sizeof(foo[0]) - 1] = '\0';

There is an explicit exception to the strict aliasing rule for [unsigned] char, so casting pointers between character types will just work.
Specifically in N3690 [basic.types] says that any trivially copyable object can be copied into an array of char or unsigned char, and if then copied back the value is identical. It also says if you copy the same array into a second object, the two objects are identical. (Paragraphs two and three)
[basic.lval] says it is legal to change an object through an lvalue of char or unsigned char type.
The concern expressed by BobTFish in the comments about whether values in char and unsigned char is misplaced I think. "Character" values are inherently of char type. You can store them in unsigned char and use them as char later - but that was happening already.
(I'd recommend writing a few in-line wrapper functions to make the whole thing less noisy, but I assume the code snippets were for exposition rather than actual usage.)
Edit: Remove erroneous recommendation to use static_cast.
Edit2: Chapter and verse.

Related

Incompatibility between char* and unsigned char*?

The following line of code produces a compiler warning with HP-UX's C++ compiler:
strcpy(var, "string")
Output:
error #2167: argument of type "unsigned char *"
is incompatible with parameter of type "char *"
Please note: var is the unsigned char * here - its data type is outside of my control.
Two questions:
What does incompatibility mean in the context of these two types? What would happen if the compiler was forced to accept the conversion? An example would be appreciated.
What would be a safe way to make the above line of code work, assuming I have to use strcpy?
C++ is being strict in checking the types where std::strcpy expects a char* and your variable var is an unsigned char*.
Fortunately, in this case, it is perfectly safe to cast the pointer to a char* like this:
std::strcpy(reinterpret_cast<char*>(var), "string");
That is because, according to the standard, char, unsigned char and signed char can each alias one another.
In C Standard, the char is Impementation Defined.
ANSI C provides three kinds of character types(All three take one byte): char, signed char, unsigned char. Not just like short, int, only two.
You can try:
char *str="abcd";
signed char *s_str = str;
The compiler will warn the second line of the code is error.It just like:
short num = 10;
unsigned short *p_num = &num;
The compiler will warn too. Because they are different type defined in compiler.
So, if you write 'strcpy( (char*)var, "string")',the code just copy the characters from "string"'s space to 'var's space. Whether there is a bug here depends on what do you do with 'var'. Because 'var' is not a 'char *'
char, signed char, and unsigned char are distinct types in C++. And pointers to them are incompatible - for example forcing a compiler to convert a unsigned char * to char * in order to pass it to strcpy() formally results in undefined behaviour - when the pointer is subsequently dereferenced - in several cases. Hence the warning.
Rather than using strcpy() (and therefore having to force conversions of pointers) you would be better off doing (C++11 and later)
const char thing[] = "string";
std::copy(std::begin(thing), std::end(thing), var);
which does not have undefined behaviour.
Even better, consider using standard containers, such as a std::vector<unsigned char> and a std::string, rather than working with raw arrays. All standard containers provide a means of accessing their data (e.g. for passing a suitable pointer to a function in a legacy API).

Is this type casting or some kind of pointer arithmetics?

I came across a line of code written in C++:
long *lbuf = (long*)spiReadBuffer;
And it turns out that "spiReadBuffer" is a byte array with 12 elements. But I am a little confused. I think I am familiar with defining pointers and I can see that "lbuf" is a type "long" pointer. Also I thought for casting we can do something like this:
y = (int) x;
But what if I put a "*" after the "int" just like my first example, where there is one after "long"?
I apologize if this is a really trivial question, but as I went through the type casting and pointers topics I did not come across my case and I did not really understand it.
I would appreciate it if you could guide me or introduce me to any relevant materials or resources.
This is called type punning. It tricks the compiler into reading the memory occupied by an object as if it was of another type.
In your case, the array spiReadBuffer decays to a pointer to its first element, then the pointer is cast and stored. When you dereference this pointer, you will access the beginning of the array as if it were a long.
The problem with this approach is that it triggers undefined behaviour (see strict aliasing). So even though it works in a lot of situations, it can also break without notice.
There are two ways (that I know of) to type-pun safely. The first one is standard-compliant : std::memcpy.
char spiReadBuffer[12];
long rbAsLong;
std::memcpy(&rbAsLong, &spiReadBuffer, sizeof rbAsLong);
// rbAsLong contains the first four bytes of spiReadBuffer, reinterpreted as a long.
The second one involves an extension that is often provided by compilers (but you should check), that extends the behaviour of unions.
union {
char buf[12];
long asLong;
} spiReadBuffer;
The standard states that writing to a member of a union then reading from another member is undefined behaviour. These compiler extensions choose to define it as a safe reinterpretation.
in C/C++ arrays are treated the same way by the compiler:
char spiReadBuffer[12];
char* pBuffer;
the compiler will treat both spiReadBuffer and pBuffer as pointers.
The code snippet
long *lbuf = (long*)spiReadBuffer;
is an example of type casting, only it's for pointer types. A char* is converted to a long*; You could say this is a type of pointer arithmetic because now, you can read sizeof(long) bytes from spiReadBuffer using the long* ( instead of one byte at a time ).
The second snippet you showed : y = (int) x; is also a cast, but not for pointers;
Consider this snippet:
char spiReadBuffer[] = {1,2,3,4,5,6,7,8};
long *lbuf = (long*)spiReadBuffer;
printf ("%08x\n", lbuf[0]);
It will print 04030201 on a little endian architecture or 01020304 on a little endian architecture.
After the long *lbuf = (long*)spiReadBuffer statement lBuf points to the beginning of the spiReadBuffer and lbuf[0] (or *lBuf) allows you to read the first 4 bytes of spiReadBuffer as a long.

Passing an "unsigned char" to an overloaded function expecting either char or const char* results in ambiguity

I'm using a QByteArray to store raw binary data. To store the data I use QByteArray's append function.
I like using unsigned chars to represent bytes since I think 255 is easier to interpret than -1. However, when I try to append a zero-valued byte to a QByteArray as follows:
command.append( (unsigned char) 0x00));
the compiler complains with call of overloaded append(unsigned char) is ambiguous. As I understand it this is because a zero can be interpreted as a null pointer, but why doesn't the compiler treat the unsigned char as a char, rather than wondering if it's a const char*? I would understand if the compiler complained about command.append(0) without any casting, of course.
Both overloads require a type conversion - one from unsigned char to char, the other from unsigned char to const char *. The compiler doesn't try to judge which one is better, it just tells you to make it explicit. If one were an exact match it would be used:
command.append( (char) 0x00));
unsigned char and char are two different, yet convertible types. unsigned char and const char * also are two different types, also convertible in this specific case. This means that neither of your overloaded functions is an exact match for the argument, yet in both cases the arguments are convertible to the parameter types. From the language point of view both functions are equally good candidates for the call. Hence the ambiguity.
You seem to believe that the unsigned char version should be considered a "better" match. But the language disagrees with you.
It is true that in this case the ambiguity stems from the fact that (unsigned char) 0x00 is a valid null-pointer constant. You can work around the problem by introducing an intermediate variable
unsigned char c = 0x0;
command.append(c);
c does not qualify as null-pointer constant, which eliminates the ambiguity. Although, as #David Rodríguez - dribeas noted in the comments you can eliminate the ambiguity by simply casting your zero to char instead of unsigned char.

Why can't I static_cast between char * and unsigned char *?

Apparently the compiler considers them to be unrelated types and hence reinterpret_cast is required. Why is this the rule?
They are completely different types see standard:
3.9.1 Fundamental types [basic.fundamental]
1 Objects declared as characters char) shall be large enough to
store any member of the implementation's basic character set. If a
character from this set is stored in a character object, the integral
value of that character object is equal to the value of the single
character literal form of that character. It is
implementation-defined whether a char object can hold negative
values. Characters can be explicitly declared unsigned or
signed. Plain char, signed char, and unsigned char are
three distinct types. A char, a signed char, and an unsigned char
occupy the same amount of storage and have the same alignment
requirements (basic.types); that is, they have the same object
representation. For character types, all bits of the object
representation participate in the value representation. For unsigned
character types, all possible bit patterns of the value representation
represent numbers. These requirements do not hold for other types. In
any particular implementation, a plain char object can take on either
the same values as a signed char or an unsigned char; which one is
implementation-defined.
So analogous to this is also why the following fails:
unsigned int* a = new unsigned int(10);
int* b = static_cast<int*>(a); // error different types
a and b are completely different types, really what you are questioning is why is static_cast so restrictive when it can perform the following without problem
unsigned int a = new unsigned int(10);
int b = static_cast<int>(a); // OK but may result in loss of precision
and why can it not deduce that the target types are the same bit-field width and can be represented? It can do this for scalar types but for pointers, unless the target is derived from the source and you wish to perform a downcast then casting between pointers is not going to work.
Bjarne Stroustrop states why static_cast's are useful in this link: http://www.stroustrup.com/bs_faq2.html#static-cast but in abbreviated form it is for the user to state clearly what their intentions are and to give the compiler the opportunity to check that what you are intending can be achieved, since static_cast does not support casting between different pointer types then the compiler can catch this error to alert the user and if they really want to do this conversion they then should use reinterpret_cast.
you're trying to convert unrelated pointers with a static_cast. That's not what static_cast is for. Here you can see: Type Casting.
With static_cast you can convert numerical data (e.g. char to unsigned char should work) or pointer to related classes (related by some inheritance). This is both not the case. You want to convert one unrelated pointer to another so you have to use reinterpret_cast.
Basically what you are trying to do is for the compiler the same as trying to convert a char * to a void *.
Ok, here some additional thoughts why allowing this is fundamentally wrong. static_cast can be used to convert numerical types into each other. So it is perfectly legal to write the following:
char x = 5;
unsigned char y = static_cast<unsigned char>(x);
what is also possible:
double d = 1.2;
int i = static_cast<int>(d);
If you look at this code in assembler you'll see that the second cast is not a mere re-interpretation of the bit pattern of d but instead some assembler instructions for conversions are inserted here.
Now if we extend this behavior to arrays, the case where simply a different way of interpreting the bit pattern is sufficient, it might work. But what about casting arrays of doubles to arrays of ints?
That's where you either have to declare that you simple want a re-interpretation of the bit patterns - there's a mechanism for that called reinterpret_cast, or you must do some extra work. As you can see simple extending the static_cast for pointer / arrays is not sufficient since it needs to behave similar to static_casting single values of the types. This sometimes needs extra code and it is not clearly definable how this should be done for arrays. In your case - stopping at \0 - because it's the convention? This is not sufficient for non-string cases (number). What will happen if the size of the data-type changes (e.g. int vs. double on x86-32bit)?
The behavior you want can't be properly defined for all use-cases that's why it's not in the C++ standard. Otherwise you would have to remember things like: "i can cast this type to the other as long as they are of type integer, have the same width and ...". This way it's totally clear - either they are related CLASSES - then you can cast the pointers, or they are numerical types - then you can cast the values.
Aside from being pointers, unsigned char * and char * have nothing in common (EdChum already mentioned the fact that char, signed char and unsigned char are three different types). You could say the same thing for Foo * and Bar * pointer types to any dissimilar structures.
static_cast means that a pointer of the source type can be used as a pointer of the destination type, which requires a subtype relationship. Hence it cannot be used in the context of your question; what you need is either reinterpret_cast which does exactly what you want or a C-style cast.

Is there a good way to convert from unsigned char* to char*?

I've been reading a lot those days about reinterpret_cast<> and how on should use it (and avoid it on most cases).
While I understand that using reinterpret_cast<> to cast from, say unsigned char* to char* is implementation defined (and thus non-portable) it seems to be no other way for efficiently convert one to the other.
Lets say I use a library that deals with unsigned char* to process some computations. Internaly, I already use char* to store my data (And I can't change it because it would kill puppies if I did).
I would have done something like:
char* mydata = getMyDataSomewhere();
size_t mydatalen = getMyDataLength();
// We use it here
// processData() takes a unsigned char*
void processData(reinterpret_cast<unsigned char*>(mydata), mydatalen);
// I could have done this:
void processData((unsigned char*)mydata, mydatalen);
// But it would have resulted in a similar call I guess ?
If I want my code to be highly portable, it seems I have no other choice than copying my data first. Something like:
char* mydata = getMyDataSomewhere();
size_t mydatalen = getMyDataLength();
unsigned char* mydata_copy = new unsigned char[mydatalen];
for (size_t i = 0; i < mydatalen; ++i)
mydata_copy[i] = static_cast<unsigned char>(mydata[i]);
void processData(mydata_copy, mydatalen);
Of course, that is highly suboptimal and I'm not even sure that it is more portable than the first solution.
So the question is, what would you do in this situation to have a highly-portable code ?
Portable is an in-practice matter. As such, reinterpret_cast for the specific usage of converting between char* and unsigned char* is portable. But still I'd wrap this usage in a pair of functions instead of doing the reinterpret_cast directly each place.
Don't go overboard introducing inefficiencies when using a language where nearly all the warts (including the one about limited guarantees for reinterpret_cast) are in support of efficiency.
That would be working against the spirit of the language, while adhering to the letter.
Cheers & hth.
The difference between char and an unsigned char types is merely data semantics. This only affects how the compiler performs arithmetic on data elements of either type. The char type signals the compiler that the value of the high bit is to be interpreted as negative, so that the compiler should perform twos-complement arithmetic. Since this is the only difference between the two types, I cannot imagine a scenario where reinterpret_cast <unsigned char*> (mydata) would generate output any different than (unsigned char*) mydata. Moreover, there is no reason to copy the data if you are merely informing the compiler about a change in data sematics, i.e., switching from signed to unsigned arithmetic.
EDIT: While the above is true from a practical standpoint, I should note that the C++ standard states that char, unsigned char and sign char are three distinct data types. § 3.9.1.1:
Objects declared as characters (char) shall be large enough to store
any member of the implementation’s basic character set. If a character
from this set is stored in a character object, the integral value of
that character object is equal to the value of the single character
literal form of that character. It is implementation-defined whether a
char object can hold negative values. Characters can be explicitly
declared unsigned or signed. Plain char, signed char, and unsigned
char are three distinct types, collectively called narrow character
types. A char, a signed char, and an unsigned char occupy the same
amount of storage and have the same alignment requirements (3.11);
that is, they have the same object representation. For narrow
character types, all bits of the object representation participate in
the value representation. For unsigned narrow character types, all
possible bit patterns of the value representation represent numbers.
These requirements do not hold for other types. In any particular
implementation, a plain char object can take on either the same values
as a signed char or an unsigned char; which one is
implementation-defined.
Go with the cast, it's OK in practice.
I just want to add that this:
for (size_t i = 0; i < mydatalen; ++i)
mydata_copy[i] = static_cast<unsigned char>(mydata[i]);
while not being undefined behaviour, could change the contents of your string on machines without 2-complement arithmetic. The reverse would be undefined behaviour.
For C compatibility, the unsigned char* and char* types have extra limitations. The rationale is that functions like memcpy() have to work, and this limits the freedom that compilers have. (unsigned char*) &foo must still point to object foo. Therefore, don't worry in this specific case.