Is there an easy way to convert between char and unsigned char if you don't know the default setting of the machine your code is running on?
(On most architectures, char is signed by default and thus has a range from -128 to +127. On some other architectures, such as ARM, char is unsigned by default and has a range from 0 to 255)
I am looking for a method to select the correct signedness or to convert between the two transparently, preferably one that doesn't involve too many steps since I would need to do this for all elements in an array.
Using a pre-processor definition would allow the setting of this at the start of my code.
As would specifying an explicit form of char such as signed char or unsigned char as only char is variable between platforms.
The reason is, there are library functions I would like to use (such as strtol) that take in char as an argument but not unsigned char.
I am looking for some advice or perhaps some pointers in the right direction as to what would be a practical efficient way to do this to make the code portable, as I intend to run the code on a few machines with different default settings for char.
I don't feel any actual issue on this point.
It's not a matter of the architecture being signed or unsigned by default. It's rather a matter of the compiler, and the default setting can be changed between the two options as you wish.
Also, there's no need to convert between the types. Both have the same representation in memory, on the same number of bits (usually 8). It's only a matter of your program and the libraries it uses to interpret the bits. If you're going to call strtol, then your data is a character array and you ought to use plain char.
If you ever use char to store not a character (A, b, f ...) but an actual value (-1, 0, 42 ...) then the range matters. In such cases, you have to use signed char or unsigned char. However in such a case, there's little use for the libraries functions that want a char *.
For these libraries that do actually want a char * with an actual binary blob, there's no issue. Create your binary buffer with the type you prefer, signed, unsigned, or undecided, and send it, possibly with a cast. It will run perfectly.
C++ has three char types however only char is allowed to vary between compilers/architectures, as the other two are explicit version, and char is implicit, so it is allowed to default to signed or unsigned.
To make your code portable the most straightforward thing to do is explicitly to use either signed or unsigned char as you require them, however for readability you may prefer to redefine char as a the type you need, or even make your own definition of a char (for demonstration purposes I will use RLChar)
1st version - un-define char and redefine
#ifdef __arm__
#undef char
#define char signed char
#endif
2nd version - define your own custom char type to use in your code
#ifndef RLChar
#define RLChar signed char
#endif
(personally I would tend to do the second)
You can also create another macro to allow changes between the two:
#define CLAMP_VALUE_TO_255(v) ((v) > 255 ? 255 : ((v) < 0 ? 0 : (v)))
then you can use:
unsigned char clampedChar = CLAMP_VALUE_TO_255((unsigned char)pixel)
or use casts such as (these are the way to go if all the compilers you will use have the support for it):
signed char myChar = -100;
unsigned char mySecondChar;
mySecondChar = static_cast<unsigned char>(myChar); // uses a static cast
mySecondChar = reinterpret_cast<unsigned char&>(myChar); // uses a reinterpretation cast
so for your array scenario you could do
unsigned char* RLArray;
RLArray = reinterpret_cast<unsigned char*>(originalSignedCharArray);
Let me know if you need more info as this is just what I can remember off the top of my head, especially if you need C equivalents or more details. :)
Related
I use utf8 and have to save a constant in a char array:
const char s[] = {0xE2,0x82,0xAC, 0}; //the euro sign
However it gives me error:
test.cpp:15:40: error: narrowing conversion of ‘226’ from ‘int’ to ‘const char’ inside { } [-fpermissive]
I have to cast all the hex numbers to char, which I feel tedious and don't smell good. Is there any other proper way of doing this?
char may be signed or unsigned (and the default is implementation specific). You probably want
const unsigned char s[] = {0xE2,0x82,0xAC, 0};
or
const char s[] = "\xe2\x82\xac";
or with many recent compilers (including GCC)
const char s[] = "€";
(a string literal is an array of char unless you give it some prefix)
See -funsigned-char (or -fsigned-char) option of GCC.
On some implementations a char is unsigned and CHAR_MAX is 255 (and CHAR_MIN is 0). On others char-s are signed so CHAR_MIN is -128 and CHAR_MAX is 127 (and e.g. things are different on Linux/PowerPC/32 bits and Linux/x86/32 bits). AFAIK nothing in the standard prohibits 19 bits signed chars.
The short answer to your question is that you are overflowing a char. A char has the range of [-128, 127]. 0xE2 = 226 > 127. What you need to use is an unsigned char, which has a range of [0, 255].
unsigned char s = {0xE2,0x82,0xAC, 0};
While it may well be tedious to be putting lots of casts in your code, it actually smells extremely GOOD to me to use as strong of typing as possible.
As noted above, when you specify type "char" you are inviting a compiler to choose whatever the compiler writer preferred (signed or unsigned). I'm no expert on UTF-8, but there is no reason to make your code non-portable if you don't need to.
As far as your constants, I've used compilers that default constants written that way to signed ints, as well as compilers that consider the context and interpret them accordingly. Note that converting between signed and unsigned can overflow EITHER WAY. For the same number of bits, a negative overflows an unsigned (obviously) and an unsigned with the top bit set overflows a signed, because the top bit means negative.
In this case, your compiler is taking your constants as unsigned 8 bit--OR LARGER--which means they don't fit as signed 8 bit. And we are all grateful that the compiler complains (at least I am).
My perspective is, there is nothing at all bad about casting to show exactly what you intend to happen. And if a compiler lets you assign between signed and unsigned, it should require that you cast regardless of variables or constants. eg
const int8_t a = (int8_t) 0xFF; // will be -1
although in my example, it would be better to assign -1. When you are having to add extra casts, they either make sense, or you should code your constants so they make sense for the type you are assigning to.
Is there a way to mix these? I want a define macro FX_RGB(R,G,B) that makes a const string "\x01\xRR\xGG\xBB" so I can do the following:
const char* LED_text = "Hello " FX_RGB(0xff, 0xff, 0x80) "World";
and get a sting: const char* LED_text = "Hello \x01\xff\xff\x80World";
Apparently there is a possibility that plain char can be either signed or unsigned by default. Stroustrup writes:
It is implementation-defined whether a plain char is considered signed or unsigned. This opens the
possibility for some nasty surprises and implementation dependencies.
How do I check whether my chars are signed or unsigned? I might want to convert them to int later, and I don't want them to be negative. Should I always use unsigned char explicitly?
From header <limits>
std::numeric_limits<char>::is_signed
http://en.cppreference.com/w/cpp/types/numeric_limits/is_signed
Some alternatives:
const bool char_is_signed = (char)-1 < 0;
#include <climits>
const bool char_is_signed = CHAR_MIN < 0;
And yes, some systems do make plain char an unsigned type. Examples I've encountered: Cray T90, Cray SV1, Cray T3E, SGI MIPS IRIX, IBM PowerPC AIX. And any system that uses EBCDIC pretty much has to make plain char unsigned so that all basic characters have non-negative values. (And some compilers have an option to control the signedness of char, such as gcc's -fsigned-char and -funsigned-char.)
But std::numeric_limits<char>::is_signed, as suggested by Benjamin Lindley's answer, probably expresses the intent more clearly.
(On the other hand, the methods I suggested can also be applied to C.)
Using unsigned char "always" could give you some interesting surprises, as the majority of C-style functions like printf, fopen, will use char, not unsigned char.
edit: Example of "fun" with C-style functions:
const unsigned char *cmd = "grep -r blah *.txt";
FILE *pf = popen(cmd, "r");
will give errors (in fact, I get one for the *cmd = line, and one error for the popen line). Using const char *cmd = ... will work fine. I picked popen because it's a function that isn't trivial to replace with some standard C++ functionality - obviously, printf or fopen can quite easily be replaced with some iostream or fstream type functionality, which generally has alternatives that take unsigned char as well as char.
However, if you are using > or < on characters that are beyond 127, then you will need to use unsigned char (or some other solution, such as casting to int and masking the lower 8 bits). It is probably better to try to avoid direct comparisons (in particular when it comes to non-ASCII characters - they are messy anyway, because there are often several variants depending on locale, character encodings, etc). Comparing for equality should work however.
Yes, if you want to use a char type and you always want it to be unsigned, use unsigned char. Note that unlike the other fundamental integer types, unsigned char is a different type from char -- even on systems where char is unsigned. Also, conversion from char to int ought to be lossless so if your result is incorrect, your source char value may also be incorrect.
The cleanest way to test whether char is unsigned depends on whether you need it to be a preprocessor test and on which version of C++ you are targeting.
To conditionally compile code using a preprocessor test, the value of CHAR_MIN should work:
#include <climits>
#if (CHAR_MIN==0)
// code that relies on char being unsigned
#endif
In C++17, I would use std::is_signed_v and std::is_unsigned_v:
#include <type_traits>
static_assert(std::is_unsigned_v<char>);
// code that relies on char being unsigned
If you are writing against C++11 or C++14 you need the slightly more verbose std::is_signed and std::is_unsigned:
#include <type_traits>
static_assert(std::is_unsigned<char>::value, "char is signed");
// code that relies on char being unsigned
For all revisions of C++, #benjamin-lindley's solution is a good alternative.
You can use preprocessor command:
#define is_type_signed(my_type) (((my_type)-1) < 0)
I need to write bytes to a file. The natural representation for a byte is std::uint8_t. The problem is that istream.read() and ostream.write() work with chars. I could convert between the two types, e.g.:
char c;
input.read(&c, 1);
uint8_t b = (uint8_t)c;
uint8_t b = …;
char c = (char)b;
output.write(&c, 1);
That could be a problem because char is often a signed type and AFAIK there is no guarantee what the bit pattern that is written will be the same as what the int8_t originally contained.
I need to make sure this works across compilers and OSes, so that if I write something on one computer, it will be read the same on any other one.
Is there any standards-compliant way to do this?
The conversion from unsigned char to char and back is perfectly fine, and it's exactly what you should be doing. All three char types are layout compatible.
You only have to be careful with any non-char integral types and convert them to uint8_t first before converting to char, etc., and similarly in the other direction.
Is it really necessary to use unsigned char to hold binary data as in some libraries which work on character encoding or binary buffers? To make sense of my question, have a look at the code below -
char c[5], d[5];
c[0] = 0xF0;
c[1] = 0xA4;
c[2] = 0xAD;
c[3] = 0xA2;
c[4] = '\0';
printf("%s\n", c);
memcpy(d, c, 5);
printf("%s\n", d);
both the printf's output 𤭢 correctly, where f0 a4 ad a2 is the encoding for the Unicode code-point U+24B62 (𤭢) in hex.
Even memcpy also correctly copied the bits held by a char.
What reasoning could possibly advocate the use of unsigned char instead of a plain char?
In other related questions unsigned char is highlighted because it is the only (byte/smallest) data type which is guaranteed to have no padding by the C-specification. But as the above example showed, the output doesn't seem to be affected by any padding as such.
I have used VC++ Express 2010 and MinGW to compile the above. Although VC gave the warning
warning C4309: '=' : truncation of constant value
the output doesn't seems to reflect that.
P.S. This could be marked a possible duplicate of Should a buffer of bytes be signed or unsigned char buffer? but my intent is different. I am asking why something which seems to be working as fine with char should be typed unsigned char?
Update: To quote from N3337,
Section 3.9 Types
2 For any object (other than a base-class subobject) of trivially
copyable type T, whether or not the object holds a valid value of type
T, the underlying bytes (1.7) making up the object can be copied into
an array of char or unsigned char. If the content of the array of char
or unsigned char is copied back into the object, the object shall
subsequently hold its original value.
In view of the above fact and that my original example was on Intel machine where char defaults to signed char, am still not convinced if unsigned char should be preferred over char.
Anything else?
In C the unsigned char data type is the only data type that has all the following three properties simultaneously
it has no padding bits, that it where all storage bits contribute to the value of the data
no bitwise operation starting from a value of that type, when converted back into that type, can produce overflow, trap representations or undefined behavior
it may alias other data types without violating the "aliasing rules", that is that access to the same data through a pointer that is typed differently will be guaranteed to see all modifications
if these are the properties of a "binary" data type you are looking for, you definitively should use unsigned char.
For the second property we need a type that is unsigned. For these all conversion are defined with modulo arihmetic, here modulo UCHAR_MAX+1, 256 in most 99% of the architectures. All conversion of wider values to unsigned char thereby just corresponds to truncation to the least significant byte.
The two other character types generally don't work the same. signed char is signed, anyhow, so conversion of values that don't fit it is not well defined. char is not fixed to be signed or unsigned, but on a particular platform to which your code is ported it might be signed even it is unsigned on yours.
You'll get most of your problems when comparing the contents of individual bytes:
char c[5];
c[0] = 0xff;
/*blah blah*/
if (c[0] == 0xff)
{
printf("good\n");
}
else
{
printf("bad\n");
}
can print "bad", because, depending on your compiler, c[0] will be sign extended to -1, which is not any way the same as 0xff
The plain char type is problematic and shouldn't be used for anything but strings. The main problem with char is that you can't know whether it is signed or unsigned: this is implementation-defined behavior. This makes char different from int etc, int is always guaranteed to be signed.
Although VC gave the warning ... truncation of constant value
It is telling you that you are trying to store int literals inside char variables. This might be related to the signedness: if you try to store an integer with value > 0x7F inside a signed character, unexpected things might happen. Formally, this is undefined behavior in C, though practically you'd just get a weird output if attempting to print the result as an integer value stored inside a (signed) char.
In this specific case, the warning shouldn't matter.
EDIT :
In other related questions unsigned char is highlighted because it is the only (byte/smallest) data type which is guaranteed to have no padding by the C-specification.
In theory, all integer types except unsigned char and signed char are allowed to contain "padding bits", as per C11 6.2.6.2:
"For unsigned integer types other than unsigned char, the bits of the
object representation shall be divided into two groups: value bits and
padding bits (there need not be any of the latter)."
"For signed integer types, the bits of the object representation shall
be divided into three groups: value bits, padding bits, and the sign
bit. There need not be any padding bits; signed char shall not have
any padding bits."
The C standard is intentionally vague and fuzzy, allowing these theoretical padding bits because:
It allows different symbol tables than the standard 8-bit ones.
It allows implementation-defined signedness and weird signed integer formats such as one's complement or "sign and magnitude".
An integer may not necessarily use all bits allocated.
However, in the real world outside the C standard, the following applies:
Symbol tables are almost certainly 8 bits (UTF8 or ASCII). Some weird exceptions exist, but clean implementations use the standard type wchar_t when implementing symbols tables larger than 8 bits.
Signedness is always two's complement.
An integer always uses all bits allocated.
So there is no real reason to use unsigned char or signed char just to dodge some theoretical scenario in the C standard.
Bytes are usually intended as unsigned 8 bit wide integers.
Now, char doesn't specify the sign of the integer: on some compilers char could be signed, on other it may be unsigned.
If I add a bit shift operation to the code you wrote, then I will have an undefined behaviour. The added comparison will also have an unexpected result.
char c[5], d[5];
c[0] = 0xF0;
c[1] = 0xA4;
c[2] = 0xAD;
c[3] = 0xA2;
c[4] = '\0';
c[0] >>= 1; // If char is signed, will the 7th bit go to 0 or stay the same?
bool isBiggerThan0 = c[0] > 0; // FALSE if char is signed!
printf("%s\n", c);
memcpy(d, c, 5);
printf("%s\n", d);
Regarding the warning during the compilation: if the char is signed then you are trying to assign the value 0xf0, which cannot be represented in the signed char (range -128 to +127), so it will be casted to a signed value (-16).
Declaring the char as unsigned will remove the warning, and is always good to have a clean build without any warning.
The signed-ness of the plain char type is implementation defined, so unless you're actually dealing with character data (a string using the platform's character set - usually ASCII), it's usually better to specify the signed-ness explicitly by either using signed char or unsigned char.
For binary data, the best choice is most probably unsigned char, especially if bitwise operations will be performed on the data (specifically bit shifting, which doesn't behave the same for signed types as for unsigned types).
I am asking why something which seems to be working as fine with char should be typed unsigned char?
If you do things which are not "correct" in the sense of the standard, you rely on undefined behaviour. Your compiler might do it the way you want today, but you don't know what it does tomorrow. You don't know what GCC does or VC++ 2012. Or even if the behaviour depends on external factors or Debug/Release compiles etc. As soon as you leave the safe path of the standard, you might run into trouble.
Well, what do you call "binary data"? This is a bunch of bits, without any meaning assigned to them by that specific part of software that calls them "binary data". What's the closest primitive data type, which conveys the idea of the lack of any specific meaning to any one of these bits? I think unsigned char.
Is it really necessary to use unsigned char to hold binary data as in some libraries which work on character encoding or binary buffers?
"really" necessary? No.
It is a very good idea though, and there are many reasons for this.
Your example uses printf, which not type-safe. That is, printf takes it's formatting cues from the format string and not from the data type. You could just as easily tried:
printf("%s\n", (void*)c);
... and the result would have been the same. If you try the same thing with c++ iostreams, the result will be different (depending on the signed-ness of c).
What reasoning could possibly advocate the use of unsigned char instead of a plain char?
Signed specifies that the most significant bit of the data (for unsigned char the 8-th bit) represents the sign. Since you obviously do not need that, you should specify your data is unsigned (the "sign" bit represents data, not the sign of the other bits).
I frequently work with libraries that use char when working with bytes in C++. The alternative is to define a "Byte" as unsigned char but that not the standard they decided to use. I frequently pass bytes from C# into the C++ dlls and cast them to char to work with the library.
When casting ints to chars or chars to other simple types what are some of the side effects that can occur. Specifically, when has this broken code that you have worked on and how did you find out it was because of the char signedness?
Lucky i haven't run into this in my code, used a char signed casting trick back in an embedded systems class in school. I'm looking to better understand the issue since I feel it is relevant to the work I am doing.
One major risk is if you need to shift the bytes. A signed char keeps the sign-bit when right-shifted, whereas an unsigned char doesn't.
Here's a small test program:
#include <stdio.h>
int main (void)
{
signed char a = -1;
unsigned char b = 255;
printf("%d\n%d\n", a >> 1, b >> 1);
return 0;
}
It should print -1 and 127, even though a and b start out with the same bit pattern (given 8-bit chars, two's-complement and signed values using arithmetic shift).
In short, you can't rely on shift working identically for signed and unsigned chars, so if you need portability, use unsigned char rather than char or signed char.
The most obvious gotchas come when you need to compare the numeric value of a char with a hexadecimal constant when implementing protocols or encoding schemes.
For example, when implementing telnet you might want to do this.
// Check for IAC (hex FF) byte
if (ch == 0xFF)
{
// ...
Or when testing for UTF-8 multi-byte sequences.
if (ch >= 0x80)
{
// ...
Fortunately these errors don't usually survive very long as even the most cursory testing on a platform with a signed char should reveal them. They can be fixed by using a character constant, converting the numeric constant to a char or converting the character to an unsigned char before the comparison operator promotes both to an int. Converting the char directly to an unsigned won't work, though.
if (ch == '\xff') // OK
if ((unsigned char)ch == 0xff) // OK, so long as char has 8-bits
if (ch == (char)0xff) // Usually OK, relies on implementation defined behaviour
if ((unsigned)ch == 0xff) // still wrong
I've been bitten by char signedness in writing search algorithms that used characters from the text as indices into state trees. I've also had it cause problems when expanding characters into larger types, and the sign bit propagates causing problems elsewhere.
I found out when I started getting bizarre results, and segfaults arising from searching texts other than the one's I'd used during the initial development (obviously characters with values >127 or <0 are going to cause this, and won't necessarily be present in your typical text files.
Always check a variable's signedness when working with it. Generally now I make types signed unless I have a good reason otherwise, casting when necessary. This fits in nicely with the ubiquitous use of char in libraries to simply represent a byte. Keep in mind that the signedness of char is not defined (unlike with other types), you should give it special treatment, and be mindful.
The one that most annoys me:
typedef char byte;
byte b = 12;
cout << b << endl;
Sure it's cosmetics, but arrr...
When casting ints to chars or chars to other simple types
The critical point is, that casting a signed value from one primitive type to another (larger) type does not retain the bit pattern (assuming two's complement). A signed char with bit pattern 0xff is -1, while a signed short with the decimal value -1 is 0xffff. Casting an unsigned char with value 0xff to a unsigned short, however, yields 0x00ff. Therefore, always think of proper signedness before you typecast to a larger or smaller data type. Never carry unsigned data in signed data types if you don't need to - if an external library forces you to do so, do the conversion as late as possible (or as early as possible if the external code acts as data source).
The C and C++ language specifications define 3 data types for holding characters: char, signed char and unsigned char. The latter 2 have been discussed in other answers. Let's look at the char type.
The standard(s) say that the char data type may be signed or unsigned and is an implementation decision. This means that some compilers or versions of compilers, can implement char differently. The implication is that the char data type is not conducive for arithmetic or Boolean operations. For arithmetic and Boolean operations, signed and unsigned versions of char will work fine.
In summary, there are 3 versions of char data type. The char data type performs well for holding characters, but is not suited for arithmetic across platforms and translators since it's signedness is implementation defined.
You will fail miserably when compiling for multiple platforms because the C++ standard doesn't define char to be of a certain "signedness".
Therefore GCC introduces -fsigned-char and -funsigned-char options to force certain behavior. More on that topic can be found here, for example.
EDIT:
As you asked for examples of broken code, there are plenty of possibilities to break code that processes binary data. For example, image you process 8-bit audio samples (range -128 to 127) and you want to halven the volume. Now imagine this scenario (in which the naive programmer assumes char == signed char):
char sampleIn;
// If the sample is -1 (= almost silent), and the compiler treats char as unsigned,
// then the value of 'sampleIn' will be 255
read_one_byte_sample(&sampleIn);
// Ok, halven the volume. The value will be 127!
char sampleOut = sampleOut / 2;
// And write the processed sample to the output file, for example.
// (unsigned char)127 has the exact same bit pattern as (signed char)127,
// so this will write a sample with the loudest volume!!
write_one_byte_sample_to_output_file(&sampleOut);
I hope you like that example ;-) But to be honest I've never really came across such problems, not even as a beginner as far as I can remember...
Hope this answer is sufficient for you downvoters. What about a short comment?
Sign extension. The first version of my URL encoding function produced strings like "%FFFFFFA3".