Is there a good way to convert from unsigned char* to char*? - c++

I've been reading a lot those days about reinterpret_cast<> and how on should use it (and avoid it on most cases).
While I understand that using reinterpret_cast<> to cast from, say unsigned char* to char* is implementation defined (and thus non-portable) it seems to be no other way for efficiently convert one to the other.
Lets say I use a library that deals with unsigned char* to process some computations. Internaly, I already use char* to store my data (And I can't change it because it would kill puppies if I did).
I would have done something like:
char* mydata = getMyDataSomewhere();
size_t mydatalen = getMyDataLength();
// We use it here
// processData() takes a unsigned char*
void processData(reinterpret_cast<unsigned char*>(mydata), mydatalen);
// I could have done this:
void processData((unsigned char*)mydata, mydatalen);
// But it would have resulted in a similar call I guess ?
If I want my code to be highly portable, it seems I have no other choice than copying my data first. Something like:
char* mydata = getMyDataSomewhere();
size_t mydatalen = getMyDataLength();
unsigned char* mydata_copy = new unsigned char[mydatalen];
for (size_t i = 0; i < mydatalen; ++i)
mydata_copy[i] = static_cast<unsigned char>(mydata[i]);
void processData(mydata_copy, mydatalen);
Of course, that is highly suboptimal and I'm not even sure that it is more portable than the first solution.
So the question is, what would you do in this situation to have a highly-portable code ?

Portable is an in-practice matter. As such, reinterpret_cast for the specific usage of converting between char* and unsigned char* is portable. But still I'd wrap this usage in a pair of functions instead of doing the reinterpret_cast directly each place.
Don't go overboard introducing inefficiencies when using a language where nearly all the warts (including the one about limited guarantees for reinterpret_cast) are in support of efficiency.
That would be working against the spirit of the language, while adhering to the letter.
Cheers & hth.

The difference between char and an unsigned char types is merely data semantics. This only affects how the compiler performs arithmetic on data elements of either type. The char type signals the compiler that the value of the high bit is to be interpreted as negative, so that the compiler should perform twos-complement arithmetic. Since this is the only difference between the two types, I cannot imagine a scenario where reinterpret_cast <unsigned char*> (mydata) would generate output any different than (unsigned char*) mydata. Moreover, there is no reason to copy the data if you are merely informing the compiler about a change in data sematics, i.e., switching from signed to unsigned arithmetic.
EDIT: While the above is true from a practical standpoint, I should note that the C++ standard states that char, unsigned char and sign char are three distinct data types. § 3.9.1.1:
Objects declared as characters (char) shall be large enough to store
any member of the implementation’s basic character set. If a character
from this set is stored in a character object, the integral value of
that character object is equal to the value of the single character
literal form of that character. It is implementation-defined whether a
char object can hold negative values. Characters can be explicitly
declared unsigned or signed. Plain char, signed char, and unsigned
char are three distinct types, collectively called narrow character
types. A char, a signed char, and an unsigned char occupy the same
amount of storage and have the same alignment requirements (3.11);
that is, they have the same object representation. For narrow
character types, all bits of the object representation participate in
the value representation. For unsigned narrow character types, all
possible bit patterns of the value representation represent numbers.
These requirements do not hold for other types. In any particular
implementation, a plain char object can take on either the same values
as a signed char or an unsigned char; which one is
implementation-defined.

Go with the cast, it's OK in practice.
I just want to add that this:
for (size_t i = 0; i < mydatalen; ++i)
mydata_copy[i] = static_cast<unsigned char>(mydata[i]);
while not being undefined behaviour, could change the contents of your string on machines without 2-complement arithmetic. The reverse would be undefined behaviour.

For C compatibility, the unsigned char* and char* types have extra limitations. The rationale is that functions like memcpy() have to work, and this limits the freedom that compilers have. (unsigned char*) &foo must still point to object foo. Therefore, don't worry in this specific case.

Related

The printf `h` length qualifier and variadic arguments [duplicate]

Aside from %hn and %hhn (where the h or hh specifies the size of the pointed-to object), what is the point of the h and hh modifiers for printf format specifiers?
Due to default promotions which are required by the standard to be applied for variadic functions, it is impossible to pass arguments of type char or short (or any signed/unsigned variants thereof) to printf.
According to 7.19.6.1(7), the h modifier:
Specifies that a following d, i, o, u, x, or X conversion specifier applies to a
short int or unsigned short int argument (the argument will
have been promoted according to the integer promotions, but its value shall
be converted to short int or unsigned short int before printing);
or that a following n conversion specifier applies to a pointer to a short
int argument.
If the argument was actually of type short or unsigned short, then promotion to int followed by a conversion back to short or unsigned short will yield the same value as promotion to int without any conversion back. Thus, for arguments of type short or unsigned short, %d, %u, etc. should give identical results to %hd, %hu, etc. (and likewise for char types and hh).
As far as I can tell, the only situation where the h or hh modifier could possibly be useful is when the argument passed it an int outside the range of short or unsigned short, e.g.
printf("%hu", 0x10000);
but my understanding is that passing the wrong type like this results in undefined behavior anyway, so that you could not expect it to print 0.
One real world case I've seen is code like this:
char c = 0xf0;
printf("%hhx", c);
where the author expects it to print f0 despite the implementation having a plain char type that's signed (in which case, printf("%x", c) would print fffffff0 or similar). But is this expectation warranted?
(Note: What's going on is that the original type was char, which gets promoted to int and converted back to unsigned char instead of char, thus changing the value that gets printed. But does the standard specify this behavior, or is it an implementation detail that broken software might be relying on?)
One possible reason: for symmetry with the use of those modifiers in the formatted input functions? I know it wouldn't be strictly necessary, but maybe there was value seen for that?
Although they don't mention the importance of symmetry for the "h" and "hh" modifiers in the C99 Rationale document, the committee does mention it as a consideration for why the "%p" conversion specifier is supported for fscanf() (even though that wasn't new for C99 - "%p" support is in C90):
Input pointer conversion with %p was added to C89, although it is obviously risky, for symmetry with fprintf.
In the section on fprintf(), the C99 rationale document does discuss that "hh" was added, but merely refers the reader to the fscanf() section:
The %hh and %ll length modifiers were added in C99 (see §7.19.6.2).
I know it's a tenuous thread, but I'm speculating anyway, so I figured I'd give whatever argument there might be.
Also, for completeness, the "h" modifier was in the original C89 standard - presumably it would be there even if it wasn't strictly necessary because of widespread existing use, even if there might not have been a technical requirement to use the modifier.
In %...x mode, all values are interpreted as unsigned. Negative numbers are therefore printed as their unsigned conversions. In 2's complement arithmetic, which most processors use, there is no difference in bit patterns between a signed negative number and its positive unsigned equivalent, which is defined by modulus arithmetic (adding the maximum value for the field plus one to the negative number, according to the C99 standard). Lots of software- especially the debugging code most likely to use %x- makes the silent assumption that the bit representation of a signed negative value and its unsigned cast is the same, which is only true on a 2's complement machine.
The mechanics of this cast are such that hexidecimal representations of value always imply, possibly inaccurately, that a number has been rendered in 2's complement, as long as it didn't hit an edge condition of where the different integer representations have different ranges. This even holds true for arithmetic representations where the value 0 is not represented with the binary pattern of all 0s.
A negative short displayed as an unsigned long in hexidecimal will therefore, on any machine, be padded with f, due to implicit sign extension in the promotion, which printf will print. The value is the same, but it is truly visually misleading as to the size of the field, implying a significant amount of range that simply isn't present.
%hx truncates the displayed representation to avoid this padding, exactly as you concluded from your real-world use case.
The behavior of printf is undefined when passed an int outside the range of short that should be printed as a short, but the easiest implementation by far simply discards the high bit by a raw downcast, so while the spec doesn't require any specific behavior, pretty much any sane implementation is going to just perform the truncation. There're generally better ways to do that, though.
If printf isn't padding values or displaying unsigned representations of signed values, %h isn't very useful.
The only use I can think of is for passing an unsigned short or unsigned char and using the %x conversion specifier. You cannot simply use a bare %x - the value may be promoted to int rather than unsigned int, and then you have undefined behaviour.
Your alternatives are either to explicitly cast the argument to unsigned; or to use %hx / %hhx with a bare argument.
The variadic arguments to printf() et al are automatically promoted using the default conversions, so any short or char values are promoted to int when passed to the function.
In the absence of the h or hh modifiers, you would have to mask the values passed to get the correct behaviour reliably. With the modifiers, you no longer have to mask the values; the printf() implementation does the job properly.
Specifically, for the format %hx, the code inside printf() can do something like:
va_list args;
va_start(args, format);
...
int i = va_arg(args, int);
unsigned short s = (unsigned short)i;
...print s correctly, as 4 hex digits maximum
...even on a machine with 64-bit `int`!
I'm blithely assuming that short is a 16-bit quantity; the standard does not actually guarantee that, of course.
I found it useful to avoid casting when formatting unsigned chars to hex:
sprintf_s(tmpBuf, 3, "%2.2hhx", *(CEKey + i));
It's a minor coding convenience, and looks cleaner than multiple casts (IMO).
another place it's handy is snprintf size check.
gcc7 added size check when using snprintf
so this will fail
char arr[4];
char x='r';
snprintf(arr,sizeof(arr),"%d",r);
so it forces you to use bigger char when using %d when formatting a char
here is a commit that shows those fixes instead of increasing the char array size they changed %d to %h. this also give more accurate description
https://github.com/Mellanox/libvma/commit/b5cb1e34a04b40427d195b14763e462a0a705d23#diff-6258d0a11a435aa372068037fe161d24
I agree with you that it is not strictly necessary, and so by that reason alone is no good in a C library function :)
It might be "nice" for the symmetry of the different flags, but it is mostly counter-productive because it hides the "conversion to int" rule.

Can I safely use std::string for binary data in C++11?

There are several posts on the internet that suggest that you should use std::vector<unsigned char> or something similar for binary data.
But I'd much rather prefer a std::basic_string variant for that, since it provides many convenient string manipulation functions. And AFAIK, since C++11, the standard guarantees what every known C++03 implementation already did: that std::basic_string stores its contents contiguously in memory.
At first glance then, std::basic_string<unsigned char> might be a good choice.
I don't want to use std::basic_string<unsigned char>, however, because almost all operating system functions only accept char*, making an explicit cast necessary. Also, string literals are const char*, so I would need an explicit cast to const unsigned char* every time I assigned a string literal to my binary string, which I would also like to avoid. Also, functions for reading from and writing to files or networking buffers similarly accept char* and const char* pointers.
This leaves std::string, which is basically a typedef for std::basic_string<char>.
The only potential remaining issue (that I can see) with using std::string for binary data is that std::string uses char (which can be signed).
char, signed char, and unsigned char are three different types and char can be either unsigned or signed.
So, when an actual byte value of 11111111b is returned from std::string:operator[] as char, and you want to check its value, its value can be either 255 (if char is unsigned) or it might be "something negative" (if char is signed, depending on your number representation).
Similarly, if you want to explicitly append the actual byte value 11111111b to a std::string, simply appending (char) (255) might be implementation-defined (and even raise a signal) if char is signed and the int to char conversation results in an overflow.
So, is there a safe way around this, that makes std::string binary-safe again?
§3.10/15 states:
If a program attempts to access the stored value of an object through a glvalue of other than one of the following types the behavior is undefined:
[...]
a type that is the signed or unsigned type corresponding to the dynamic type of the object,
[...]
a char or unsigned char type.
Which, if I understand it correctly, seems to allow using an unsigned char* pointer to access and manipulate the contents of a std::string and makes this also well-defined. It just reinterprets the bit pattern as an unsigned char, without any change or information loss, the latter namely because all bits in a char, signed char, and unsigned char must be used for the value representation.
I could then use this unsigned char* interpretation of the contents of std::string as a means to access and change the byte values in the [0, 255] range, in a well-defined and portable manner, regardless of the signedness of char itself.
This should solve any problems arising from a potentially signed char.
Are my assumptions and conclusions correct?
Also, is the unsigned char* interpretation of the same bit pattern (i.e. 11111111b or 10101010b) guaranteed to be the same on all implementations? Put differently, does the standard guarantee that "looking through the eyes of an unsigned char", the same bit pattern always leads to the same numerical value (assuming the number of bits in a byte is the same)?
Can I thus safely (that is, without any undefined or implementation-defined behavior) use std::string for storing and manipulating binary data in C++11?
The conversion static_cast<char>(uc) where uc is of type is unsigned char is always valid: according to 3.9.1 [basic.fundamental] the representation of char, signed char, and unsigned char are identical with char being identical to one of the two other types:
Objects declared as characters (char) shall be large enough to store any member of the implementation’s basic character set. If a character from this set is stored in a character object, the integral value of that character object is equal to the value of the single character literal form of that character. It is implementation-defined whether a char object can hold negative values. Characters can be explicitly declared unsigned or signed. Plain char, signed char, and unsigned char are three distinct types, collectively called narrow character types. A char, a signed char, and an unsigned char occupy the same amount of storage and have the same alignment requirements (3.11); that is, they have the same object representation. For narrow character types, all bits of the object representation participate in the value representation. For unsigned narrow character types, all possible bit patterns of the value representation represent numbers. These requirements do not hold for other types. In any particular implementation, a plain char object can take on either the same
values as a signed char or an unsigned char; which one is implementation-defined.
Converting values outside the range of unsigned char to char will, of course, be problematic and may cause undefined behavior. That is, as long as you don't try to store funny values into the std::string you'd be OK. With respect to bit patterns, you can rely on the nth bit to translated into 2n. There shouldn't be a problem to store binary data in a std::string when processed carefully.
That said, I don't buy into your premise: Processing binary data mostly requires dealing with bytes which are best manipulated using unsigned values. The few cases where you'd need to convert between char* and unsigned char* create convenient errors when not treated explicitly while messing up the use of char accidentally will be silent! That is, dealing with unsigned char will prevent errors. I also don't buy into the premise that you get all those nice string functions: for one, you are generally better off using the algorithms anyway but also binary data is not string data. In summary: the recommendation for std::vector<unsigned char> isn't just coming out of thin air! It is deliberate to avoid building hard to find traps into the design!
The only mildly reasonable argument in favor of using char could be the one about string literals but even that doesn't hold water with user-defined string literals introduced into C++11:
#include <cstddef>
unsigned char const* operator""_u (char const* s, size_t)
{
return reinterpret_cast<unsigned char const*>(s);
}
unsigned char const* hello = "hello"_u;
Yes your assumptions are correct.
Store binary data as a sequence of unsigned char in std::string.
I've run into trouble using std::string to handle binary data in Microsoft Visual Studio. I've seen the strings get inexplicably truncated, so I wouldn't do this regardless of what the standards documents say.

How can I convert a vector<uint8_t> into an unsigned char*

I have a const vector<uint8_t>>, and I need to pass it to a function that takes a const unsigned char*. The two types are the same size, etc., so I'm guessing that there is an good way to coerce the types here. What's the idiomatic way of handling this type of problem?
My first instinct is to use reinterpret_cast, but after the cast the data isn't the same. Here is my code:
shared_ptr<const vector<uint8_t>> data = operation.getData();
const unsigned char* data2 = reinterpret_cast<const unsigned char*>(&data);
myFunction(data2, data->size());
Chances are I've confused a pointer for a value here, but maybe my entire approach is incorrect.
reinterpret_cast is almost never the right solution, unless you know exactly what you’re doing, and even then usually not.
In your case, you just want a pointer to the contiguous data storage inside the vector (but not the vector itself, as you’ve noticed! That stores other data as well, such as the size & capacity). That’s easy enough, it’s the pointer of the first element of the data:
&vector[0]
So your code would look as follows:
myFunction(&(*data)[0], data->size());
Just use vector<unsigned char> to begin with. On all platforms, unsigned char will have at least 8 bits. On platforms that have an 8-bit unsigned integral type, uint8_t will be a synonym for unsigned char; on platforms that do not have an 8-bit unsigned integral type, uint8_t will not exist, but unsigned char will.

Why can't I static_cast between char * and unsigned char *?

Apparently the compiler considers them to be unrelated types and hence reinterpret_cast is required. Why is this the rule?
They are completely different types see standard:
3.9.1 Fundamental types [basic.fundamental]
1 Objects declared as characters char) shall be large enough to
store any member of the implementation's basic character set. If a
character from this set is stored in a character object, the integral
value of that character object is equal to the value of the single
character literal form of that character. It is
implementation-defined whether a char object can hold negative
values. Characters can be explicitly declared unsigned or
signed. Plain char, signed char, and unsigned char are
three distinct types. A char, a signed char, and an unsigned char
occupy the same amount of storage and have the same alignment
requirements (basic.types); that is, they have the same object
representation. For character types, all bits of the object
representation participate in the value representation. For unsigned
character types, all possible bit patterns of the value representation
represent numbers. These requirements do not hold for other types. In
any particular implementation, a plain char object can take on either
the same values as a signed char or an unsigned char; which one is
implementation-defined.
So analogous to this is also why the following fails:
unsigned int* a = new unsigned int(10);
int* b = static_cast<int*>(a); // error different types
a and b are completely different types, really what you are questioning is why is static_cast so restrictive when it can perform the following without problem
unsigned int a = new unsigned int(10);
int b = static_cast<int>(a); // OK but may result in loss of precision
and why can it not deduce that the target types are the same bit-field width and can be represented? It can do this for scalar types but for pointers, unless the target is derived from the source and you wish to perform a downcast then casting between pointers is not going to work.
Bjarne Stroustrop states why static_cast's are useful in this link: http://www.stroustrup.com/bs_faq2.html#static-cast but in abbreviated form it is for the user to state clearly what their intentions are and to give the compiler the opportunity to check that what you are intending can be achieved, since static_cast does not support casting between different pointer types then the compiler can catch this error to alert the user and if they really want to do this conversion they then should use reinterpret_cast.
you're trying to convert unrelated pointers with a static_cast. That's not what static_cast is for. Here you can see: Type Casting.
With static_cast you can convert numerical data (e.g. char to unsigned char should work) or pointer to related classes (related by some inheritance). This is both not the case. You want to convert one unrelated pointer to another so you have to use reinterpret_cast.
Basically what you are trying to do is for the compiler the same as trying to convert a char * to a void *.
Ok, here some additional thoughts why allowing this is fundamentally wrong. static_cast can be used to convert numerical types into each other. So it is perfectly legal to write the following:
char x = 5;
unsigned char y = static_cast<unsigned char>(x);
what is also possible:
double d = 1.2;
int i = static_cast<int>(d);
If you look at this code in assembler you'll see that the second cast is not a mere re-interpretation of the bit pattern of d but instead some assembler instructions for conversions are inserted here.
Now if we extend this behavior to arrays, the case where simply a different way of interpreting the bit pattern is sufficient, it might work. But what about casting arrays of doubles to arrays of ints?
That's where you either have to declare that you simple want a re-interpretation of the bit patterns - there's a mechanism for that called reinterpret_cast, or you must do some extra work. As you can see simple extending the static_cast for pointer / arrays is not sufficient since it needs to behave similar to static_casting single values of the types. This sometimes needs extra code and it is not clearly definable how this should be done for arrays. In your case - stopping at \0 - because it's the convention? This is not sufficient for non-string cases (number). What will happen if the size of the data-type changes (e.g. int vs. double on x86-32bit)?
The behavior you want can't be properly defined for all use-cases that's why it's not in the C++ standard. Otherwise you would have to remember things like: "i can cast this type to the other as long as they are of type integer, have the same width and ...". This way it's totally clear - either they are related CLASSES - then you can cast the pointers, or they are numerical types - then you can cast the values.
Aside from being pointers, unsigned char * and char * have nothing in common (EdChum already mentioned the fact that char, signed char and unsigned char are three different types). You could say the same thing for Foo * and Bar * pointer types to any dissimilar structures.
static_cast means that a pointer of the source type can be used as a pointer of the destination type, which requires a subtype relationship. Hence it cannot be used in the context of your question; what you need is either reinterpret_cast which does exactly what you want or a C-style cast.

Can "signed char" and "unsigned char" always be cast to each other without loss of data?

In C++ we can have signed char and unsigned char that are of same size but hold different ranges of values.
In the following code:
signed char signedChar = -10;
unsigned char unsignedChar = static_cast<unsigned char>( signedChar );
signedChar = static_cast<signed char>( unsignedChar );
will signed char retain its value regardless of what its original value was?
No, there's no such guarantee. The conversion from signed char to unsigned char is well-defined, as all signed-to-unsigned integral conversions in C++ (and C) are. However, the result of that conversion can easily turn out to be outside the bounds of the original signed type (will happen in your example with -10).
The result of the reverse conversion - unsigned char to signed char - in that case is implementation-defined, as all overflowing unsigned-to-signed integral conversions in C++ (and C) are. This means that the result cannot be predicted from the language rules alone.
Normally, you should expect the implementation to "define" it so that the original signed char value is restored. But the language makes no guarantees about that.
I guess the meaning of your question is what is key. When you say loss, you mean that you are losing bytes or something like that. You are not losing anything as such since the size of both are the same, they just have different ranges.
signed char and unsigned char are not guaranteed to be equal. When most people think unsigned char, they are thinking from 0 to 255.
On most implementations (I have to caveat because there is a difference), signed char and unsigned char are 1 byte or 8 bits. signed char is typically from -128 to +127 whereas unsigned char is from 0 to +255.
As far as conversions, it is left up to different implementations to come up with an answer. On a whole, I wouldn't recommend you converting between the two. To me, it makes sense that it should give you the POSITIVE equivalent if the value is negative and remain the same if is positive. For instance in Borland C++ Builder 5, given a signed char test = -1 and you cast it into unsigned char, the result will be 255. Alternatively, the result is different if all values are positive.
But as far as comparisons, while the values may appear the same, they probably won't be evaluated as equal. This is a major trip up when programmers sometimes compare signed and unsigned values and wonder why the data all looks the same, but the condition will not work properly. A good compiler should warn you about this.
I'm of the opinion that there should be an implicit conversion between the signed and unsigned so that if you cast from one to the other, the compiler will take care of the conversion for you. It is up to the compiler's implementation on whether you lose the original meaning. Unfortunately there is no guarantee that it will always work.
Finally, from the standard, there should exist a plain conversion between signed char or unsigned char to char. But whichever it chooses to take, is implementation defined
3.9.1 Fundamental types [basic.fundamental]
1 Objects declared as characters char)
shall be large enough to store any
member of the implementation's basic
character set. If a character from
this set is stored in a character
object, the integral value of that
character object is equal to the value
of the single character literal form
of that character. It is
implementation-defined whether a char
object can hold negative values.
Characters can be explicitly declared
unsigned or signed. Plain char, signed
char, and unsigned char are three
distinct types. A char, a signed char,
and an unsigned char occupy the same
amount of storage and have the same
alignment requirements (basic.types);
that is, they have the same object
representation. For character types,
all bits of the object representation
participate in the value
representation. For unsigned character
types, all possible bit patterns of
the value representation represent
numbers. These requirements do not
hold for other types. In any
particular implementation, a plain
char object can take on either the
same values as a signed char or an
unsigned char; which one is
implementation-defined.
AFAIK, this cast will never alter the byte, just change its representation.
My first guess would be "maybe."
Have you tried testing this with various inputs?