Convert basic_string<unsigned char> to basic_string<char> and vice versa - c++

From the following
Can I turn unsigned char into char and vice versa?
it appears that converting a basic_string<unsigned char> to a basic_string<char> (i.e. std::string) is a valid operation. But I can't figure out how to do it.
For example what functions could perform the following conversions, filling in the functionality of these hypothetical stou and utos functions?
typedef basic_string<unsigned char> u_string;
int main() {
string s = "dog";
u_string u = stou(s);
string t = utos(u);
}
I've tried to use reinterpret_cast, static_cast, and a few others but my knowledge of their intended functionality is limited.

Assuming you want each character in the original copied across and converted, the solution is simple
u_string u(s.begin(), s.end());
std:string t(u.begin(), u.end());
The formation of u is straight forward for any content of s, since conversion from signed char to unsigned char simply uses modulo arithmetic. So that will work whether char is actually signed or unsigned.
The formation of t will have undefined behaviour if char is actually signed char and any of the individual characters in u have values outside the range that a signed char can represent. This is because of overflow of a signed char. For your particular example, undefined behaviour will generally not occur.

It is not a legal conversion to cast a basic_string<T> into any other basic_string<U>, even if it's legal to cast T to U. This is true for pretty much every template type.
If you want to create a new string that is a copy of the original, of a different type, that's easy:
basic_string<unsigned char> str(
static_cast<const unsigned char*>(char_string.c_str()),
char_string.size());

Related

Incompatibility between char* and unsigned char*?

The following line of code produces a compiler warning with HP-UX's C++ compiler:
strcpy(var, "string")
Output:
error #2167: argument of type "unsigned char *"
is incompatible with parameter of type "char *"
Please note: var is the unsigned char * here - its data type is outside of my control.
Two questions:
What does incompatibility mean in the context of these two types? What would happen if the compiler was forced to accept the conversion? An example would be appreciated.
What would be a safe way to make the above line of code work, assuming I have to use strcpy?
C++ is being strict in checking the types where std::strcpy expects a char* and your variable var is an unsigned char*.
Fortunately, in this case, it is perfectly safe to cast the pointer to a char* like this:
std::strcpy(reinterpret_cast<char*>(var), "string");
That is because, according to the standard, char, unsigned char and signed char can each alias one another.
In C Standard, the char is Impementation Defined.
ANSI C provides three kinds of character types(All three take one byte): char, signed char, unsigned char. Not just like short, int, only two.
You can try:
char *str="abcd";
signed char *s_str = str;
The compiler will warn the second line of the code is error.It just like:
short num = 10;
unsigned short *p_num = &num;
The compiler will warn too. Because they are different type defined in compiler.
So, if you write 'strcpy( (char*)var, "string")',the code just copy the characters from "string"'s space to 'var's space. Whether there is a bug here depends on what do you do with 'var'. Because 'var' is not a 'char *'
char, signed char, and unsigned char are distinct types in C++. And pointers to them are incompatible - for example forcing a compiler to convert a unsigned char * to char * in order to pass it to strcpy() formally results in undefined behaviour - when the pointer is subsequently dereferenced - in several cases. Hence the warning.
Rather than using strcpy() (and therefore having to force conversions of pointers) you would be better off doing (C++11 and later)
const char thing[] = "string";
std::copy(std::begin(thing), std::end(thing), var);
which does not have undefined behaviour.
Even better, consider using standard containers, such as a std::vector<unsigned char> and a std::string, rather than working with raw arrays. All standard containers provide a means of accessing their data (e.g. for passing a suitable pointer to a function in a legacy API).

How do I cast a const char* to a const unsigned char*

I want to take advantage of this post to understand in more detail how unsigned and signed work regarding pointers. The problem I am having is that I have to use a function from opengl called glutBitmapString which takes as parameter a void* and const unsigned char*. I am trying to convert a string to a const unsigned c_string.
Attempt:
string var = "foo";
glutBitmapString(font, var.c_str());
However, that's not quiet right because the newly generated c_str is signed. I want to stay away from casting because I think that will cause narrowing errors. I think that unsigned char and signed char is almost the same thing but both do a different mapping. Using a reinterpret_cast comes to mind, but I don't know how it works.
I would use reinterpret_cast:
glutBitmapString(font, reinterpret_cast</*const*/unsigned char*>(var.c_str()));
it is a rare case where strict aliasing rule is not broken.
Negative values will be interpreted as unsigned (so value + 256).
In this particular case (and almost all others), signed vs. unsigned refer to the content pointed at by the pointer.
unsigned char* == a pointer to (unsigned char(s))
signed char* == a pointer to (signed char(s))
Generally, no one is treating 0xFF as a numeric value at all, and signed vs. unsigned doesn't matter. Not always the case and sometimes with strings people sloppily use unsigned vs signed to refer to one type over another....but you're probably safe just casting the pointer.
If you're NOT safe casting the pointer, it means your data that is pointed to is invalid/is in the wrong format.
To clarify on unsigned char vs. signed char, check this out:
https://sqljunkieshare.files.wordpress.com/2012/01/extended-ascii-table.jpg
Is the char 0xA4 positive or negative? It's neither. It's ñ. It's not a number at all. So signed vs. unsigned doesn't really matter. Make sense?

Can I safely use std::string for binary data in C++11?

There are several posts on the internet that suggest that you should use std::vector<unsigned char> or something similar for binary data.
But I'd much rather prefer a std::basic_string variant for that, since it provides many convenient string manipulation functions. And AFAIK, since C++11, the standard guarantees what every known C++03 implementation already did: that std::basic_string stores its contents contiguously in memory.
At first glance then, std::basic_string<unsigned char> might be a good choice.
I don't want to use std::basic_string<unsigned char>, however, because almost all operating system functions only accept char*, making an explicit cast necessary. Also, string literals are const char*, so I would need an explicit cast to const unsigned char* every time I assigned a string literal to my binary string, which I would also like to avoid. Also, functions for reading from and writing to files or networking buffers similarly accept char* and const char* pointers.
This leaves std::string, which is basically a typedef for std::basic_string<char>.
The only potential remaining issue (that I can see) with using std::string for binary data is that std::string uses char (which can be signed).
char, signed char, and unsigned char are three different types and char can be either unsigned or signed.
So, when an actual byte value of 11111111b is returned from std::string:operator[] as char, and you want to check its value, its value can be either 255 (if char is unsigned) or it might be "something negative" (if char is signed, depending on your number representation).
Similarly, if you want to explicitly append the actual byte value 11111111b to a std::string, simply appending (char) (255) might be implementation-defined (and even raise a signal) if char is signed and the int to char conversation results in an overflow.
So, is there a safe way around this, that makes std::string binary-safe again?
§3.10/15 states:
If a program attempts to access the stored value of an object through a glvalue of other than one of the following types the behavior is undefined:
[...]
a type that is the signed or unsigned type corresponding to the dynamic type of the object,
[...]
a char or unsigned char type.
Which, if I understand it correctly, seems to allow using an unsigned char* pointer to access and manipulate the contents of a std::string and makes this also well-defined. It just reinterprets the bit pattern as an unsigned char, without any change or information loss, the latter namely because all bits in a char, signed char, and unsigned char must be used for the value representation.
I could then use this unsigned char* interpretation of the contents of std::string as a means to access and change the byte values in the [0, 255] range, in a well-defined and portable manner, regardless of the signedness of char itself.
This should solve any problems arising from a potentially signed char.
Are my assumptions and conclusions correct?
Also, is the unsigned char* interpretation of the same bit pattern (i.e. 11111111b or 10101010b) guaranteed to be the same on all implementations? Put differently, does the standard guarantee that "looking through the eyes of an unsigned char", the same bit pattern always leads to the same numerical value (assuming the number of bits in a byte is the same)?
Can I thus safely (that is, without any undefined or implementation-defined behavior) use std::string for storing and manipulating binary data in C++11?
The conversion static_cast<char>(uc) where uc is of type is unsigned char is always valid: according to 3.9.1 [basic.fundamental] the representation of char, signed char, and unsigned char are identical with char being identical to one of the two other types:
Objects declared as characters (char) shall be large enough to store any member of the implementation’s basic character set. If a character from this set is stored in a character object, the integral value of that character object is equal to the value of the single character literal form of that character. It is implementation-defined whether a char object can hold negative values. Characters can be explicitly declared unsigned or signed. Plain char, signed char, and unsigned char are three distinct types, collectively called narrow character types. A char, a signed char, and an unsigned char occupy the same amount of storage and have the same alignment requirements (3.11); that is, they have the same object representation. For narrow character types, all bits of the object representation participate in the value representation. For unsigned narrow character types, all possible bit patterns of the value representation represent numbers. These requirements do not hold for other types. In any particular implementation, a plain char object can take on either the same
values as a signed char or an unsigned char; which one is implementation-defined.
Converting values outside the range of unsigned char to char will, of course, be problematic and may cause undefined behavior. That is, as long as you don't try to store funny values into the std::string you'd be OK. With respect to bit patterns, you can rely on the nth bit to translated into 2n. There shouldn't be a problem to store binary data in a std::string when processed carefully.
That said, I don't buy into your premise: Processing binary data mostly requires dealing with bytes which are best manipulated using unsigned values. The few cases where you'd need to convert between char* and unsigned char* create convenient errors when not treated explicitly while messing up the use of char accidentally will be silent! That is, dealing with unsigned char will prevent errors. I also don't buy into the premise that you get all those nice string functions: for one, you are generally better off using the algorithms anyway but also binary data is not string data. In summary: the recommendation for std::vector<unsigned char> isn't just coming out of thin air! It is deliberate to avoid building hard to find traps into the design!
The only mildly reasonable argument in favor of using char could be the one about string literals but even that doesn't hold water with user-defined string literals introduced into C++11:
#include <cstddef>
unsigned char const* operator""_u (char const* s, size_t)
{
return reinterpret_cast<unsigned char const*>(s);
}
unsigned char const* hello = "hello"_u;
Yes your assumptions are correct.
Store binary data as a sequence of unsigned char in std::string.
I've run into trouble using std::string to handle binary data in Microsoft Visual Studio. I've seen the strings get inexplicably truncated, so I wouldn't do this regardless of what the standards documents say.

Is there a good way to convert from unsigned char* to char*?

I've been reading a lot those days about reinterpret_cast<> and how on should use it (and avoid it on most cases).
While I understand that using reinterpret_cast<> to cast from, say unsigned char* to char* is implementation defined (and thus non-portable) it seems to be no other way for efficiently convert one to the other.
Lets say I use a library that deals with unsigned char* to process some computations. Internaly, I already use char* to store my data (And I can't change it because it would kill puppies if I did).
I would have done something like:
char* mydata = getMyDataSomewhere();
size_t mydatalen = getMyDataLength();
// We use it here
// processData() takes a unsigned char*
void processData(reinterpret_cast<unsigned char*>(mydata), mydatalen);
// I could have done this:
void processData((unsigned char*)mydata, mydatalen);
// But it would have resulted in a similar call I guess ?
If I want my code to be highly portable, it seems I have no other choice than copying my data first. Something like:
char* mydata = getMyDataSomewhere();
size_t mydatalen = getMyDataLength();
unsigned char* mydata_copy = new unsigned char[mydatalen];
for (size_t i = 0; i < mydatalen; ++i)
mydata_copy[i] = static_cast<unsigned char>(mydata[i]);
void processData(mydata_copy, mydatalen);
Of course, that is highly suboptimal and I'm not even sure that it is more portable than the first solution.
So the question is, what would you do in this situation to have a highly-portable code ?
Portable is an in-practice matter. As such, reinterpret_cast for the specific usage of converting between char* and unsigned char* is portable. But still I'd wrap this usage in a pair of functions instead of doing the reinterpret_cast directly each place.
Don't go overboard introducing inefficiencies when using a language where nearly all the warts (including the one about limited guarantees for reinterpret_cast) are in support of efficiency.
That would be working against the spirit of the language, while adhering to the letter.
Cheers & hth.
The difference between char and an unsigned char types is merely data semantics. This only affects how the compiler performs arithmetic on data elements of either type. The char type signals the compiler that the value of the high bit is to be interpreted as negative, so that the compiler should perform twos-complement arithmetic. Since this is the only difference between the two types, I cannot imagine a scenario where reinterpret_cast <unsigned char*> (mydata) would generate output any different than (unsigned char*) mydata. Moreover, there is no reason to copy the data if you are merely informing the compiler about a change in data sematics, i.e., switching from signed to unsigned arithmetic.
EDIT: While the above is true from a practical standpoint, I should note that the C++ standard states that char, unsigned char and sign char are three distinct data types. § 3.9.1.1:
Objects declared as characters (char) shall be large enough to store
any member of the implementation’s basic character set. If a character
from this set is stored in a character object, the integral value of
that character object is equal to the value of the single character
literal form of that character. It is implementation-defined whether a
char object can hold negative values. Characters can be explicitly
declared unsigned or signed. Plain char, signed char, and unsigned
char are three distinct types, collectively called narrow character
types. A char, a signed char, and an unsigned char occupy the same
amount of storage and have the same alignment requirements (3.11);
that is, they have the same object representation. For narrow
character types, all bits of the object representation participate in
the value representation. For unsigned narrow character types, all
possible bit patterns of the value representation represent numbers.
These requirements do not hold for other types. In any particular
implementation, a plain char object can take on either the same values
as a signed char or an unsigned char; which one is
implementation-defined.
Go with the cast, it's OK in practice.
I just want to add that this:
for (size_t i = 0; i < mydatalen; ++i)
mydata_copy[i] = static_cast<unsigned char>(mydata[i]);
while not being undefined behaviour, could change the contents of your string on machines without 2-complement arithmetic. The reverse would be undefined behaviour.
For C compatibility, the unsigned char* and char* types have extra limitations. The rationale is that functions like memcpy() have to work, and this limits the freedom that compilers have. (unsigned char*) &foo must still point to object foo. Therefore, don't worry in this specific case.

reinterpret casting to and from unsigned char* and char*

I'm wondering if it is necessary to reinterpret_cast in the function below. ITER_T might be a char*, unsigned char*, std::vector<unsigned char> iterator, or something else like that. It doesn't seem to hurt so far, but does the casting ever affect how the bytes are copied at all?
template<class ITER_T>
char *copy_binary(
unsigned char length,
const ITER_T& begin)
{
// alloc_storage() returns a char*
unsigned char* stg = reinterpret_cast<unsigned char*>(alloc_storage(length));
std::copy(begin, begin + length, stg);
return reinterpret_cast<char*>(stg);
}
reinterpret_casts are used for low-level implementation defined casts. According to the standard, reinterpret_casts can be used for the following conversions (C++03 5.2.10):
Pointer to an integral type
Integral type to Pointer
A pointer to a function can be converted to a pointer to a function of a different type
A pointer to an object can be converted to a pointer to an object of different type
Pointer to member functions or pointer to data members can be converted to functions or objects of a different type. The result of such a pointer conversion is unspecified, except the pointer a converted back to its original type.
An expression of type A can be converted to a reference to type B if a pointer to type A can be explicitly converted to type B using a reinterpret_cast.
That said, using the reinterpret_cast is not a good solution in your case, since casting to different types are unspecified by the standard, though casting from char * to unsigned char * and back should work on most machines.
In your case I would think about using a static_cast or not casting at all by defining stg as type char *:
template<class ITER_T>
char *copy_binary(
unsigned char length,
const ITER_T& begin)
{
// alloc_storage() returns a char*
char* stg = alloc_storage(length);
std::copy(begin, begin + length, stg);
return stg;
}
The code as written is working as intended according to standard 4.7 (2), although this is guaranteed only for machines with two's complement representation.
If alloc_storage returns a char*, and 'char' is signed, then if I understand 4.7 (3) correctly the result would be implementation defined if the iterator's value type is unsigned and you'd drop the cast and pass the char* to copy.
The short answer is yes, it could affect.
char and unsigned char are convertible types (3.9.1 in C++ Standard 0x n2800) so you can assign one to the other. You don't need the cast at all.
[3.9.1] ... A char, a signed char, and an unsigned
char occupy the same amount of storage
and have the same alignment
requirements; that is, they have the
same object representation.
[4.7] ...
2 If the destination type is
unsigned, the resulting value is the
least unsigned integer congruent to
the source integer (modulo 2n where n
is the number of bits used to
represent the unsigned type).
[ Note:
In a two’s complement representation,
this conversion is conceptual and
there is no change in the bit pattern
(if there is no truncation). —end note
]
3 If the destination type is signed,
the value is unchanged if it can be
represented in the destination type
(and bit-field width); otherwise, the
value is implementation-defined.
Therefore even in the worst case you will get the best (less implementation-defined) conversion. Anyway in most implementations this will not change anything in the bit pattern, and you will not even have a conversion if look into the generated assembler.
template<class ITER_T>
char *copy_binary( unsigned char length, const ITER_T& begin)
{
char* stg = alloc_storage(length);
std::copy(begin, begin + length, stg);
return stg;
}
Using reinterpret_cast you depend on the compiler:
[5.2.10.3] The mapping performed by
reinterpret_cast is
implementation-defined. [ Note: it
might, or might not, produce a
representation different from the
original value. —end note ]
Note: This is an interesting related post.
So if I get it right, the cast to unsigned char is to gaurantee an unsigned byte-by-byte copy. But then you cast it back for the return. The function looks a bit dodgy what exactly is the context/reason for setting it up this way? A quick fix might be to replace all this with a memcpy() (but as commented, do not use that on iterator objects) -- otherwise just remove the redundant casts.