C++ char definition from binary string and overflow

C++ char definition from binary string and overflow - c++

I have a datatype that's more or less a character array. Each space in the array holds a char, which, as per my understanding, is a single byte (8 bits) of information. I need to be able to specify the char value through a binary string... for instance
char someChar = char(0b00110011);
What I don't understand is why the max value I can specify is 0b0XXXXXXX, where I have to leave that MSB set to zero. If I try setting the char like so
char someChar = char(0b11111111);
I get a decimal value: -2147483648, which looks very much like overflow. So I don't really get what's going on here. If I call the sizeof() operator on char, I get an answer of 1 (one byte). Doesn't that mean that I either get 0-255 if the char is unsigned, or -128-127 if the char is signed? Any advice/input would be appreciated.
In response to most of the comments -- I converted it to an int before printing it out:
std::cerr << int(someChar)
Thanks to all for the thorough explanations :)

char is signed in this case, so setting the top bit will give a negative value. Use unsigned char if you don't want to worry about positive/negative values.
As for the negative integer value - please show how you're converting/displaying the char.
NB. You can use signed char or unsigned char to tell the compiler explicitly what you want.

-2147483648 in binary is 10000000 00000000 00000000 01111111.
When you declare you char in binary, you compiler interprets it as a signed char, which is the case for the most compilers. The leftmost bit is interpreted as the sign bit.
Upon conversion to int, the bit pattern of the value is copied, therefore the seven rightmost bits, and the sign bit is moved to the MSB of the 32-bit block.
You have two main problems here :
First, it seems that you except someChar to be unsigned. If that's the case, you should tell it to your compiler : unsigned char someChar = unsigned char(0b11111111);
Second, the way you put it to the console (which is unknown to us) apparently involves a conversion to int. If it's not needed, there is likely a way to print someChar for what it is really, i.e. a signed char.

Related

Why subtract from 256 when assigning signed char to unsigned char in C++?

in Bjarne's "The C++ Programming Language" book, the following piece of code on chars is given:
signed char sc = -140;
unsigned char uc = sc;
cout << uc // prints 't'
1Q) chars are 1byte (8 bits) in my hardware. what is the binary representation of -140? is it possible to represent -140 using 8 bits. I would think range is guaranteed to be at least [-127...127] when signed chars is considered. how is it even possible to represent -140 in 8 bits?
2Q) assume it's possible. why do we subtract 140 from uc when sc is assigned to uc? what is the logic behind this?
EDIT: I've wrote cout << sizeof (signed char) and it's produced 1 (1 byte). I put this to be exact on the byte-wise size of signed char.
EDIT 2: cout << int {sc } gives the output 116. I don't understand what happens here?

First of all: Unless you're writing something very low-level that requires bit-representation manipulation - avoid writing this kind of code like the plague. It's hard to read, easy to get wrong, confusing, and often exhibits implementation-defined/undefined behavior.
To answer your question though:
The code assumed you're on a platform in which the types signed char and unsigned char have 8 bits (although theoretically they could have more). And that the hardware has "two's complement" behavior: The bit representation of the result of an arithmetic operation on an integer type with N bits is always modulo 2^N. That also specifies how the same bit-pattern is interpreted as signed or unsigned. Now, -140 modulo 2^8 is 116 (01110100), so that's the bit pattern sc will hold. Interpreted as a signed char (-128 through 127), this is still 116.
An unsigned char can represent 116 as well, so the second assignment results in 116 as well.
116 is the ASCII code of the character t; and std::cout interprets unsigned char values (under 128) as ASCII codes. So, that's what gets printed.

The result of assigning -140 to a signed char is implementation-defined, just like its range is (i.e. see the manual). A very common choice is to use wrap-around math: if it doesn't fit, add or subtract 256 (or the relevant max range) until it fits.
Since sc will have the value 116, and uc can also hold that value, that conversion is trivial. The unusual thing already happened when we assigned -140 to sc.

C/C++ Why to use unsigned char for binary data?

Is it really necessary to use unsigned char to hold binary data as in some libraries which work on character encoding or binary buffers? To make sense of my question, have a look at the code below -
char c[5], d[5];
c[0] = 0xF0;
c[1] = 0xA4;
c[2] = 0xAD;
c[3] = 0xA2;
c[4] = '\0';
printf("%s\n", c);
memcpy(d, c, 5);
printf("%s\n", d);
both the printf's output 𤭢 correctly, where f0 a4 ad a2 is the encoding for the Unicode code-point U+24B62 (𤭢) in hex.
Even memcpy also correctly copied the bits held by a char.
What reasoning could possibly advocate the use of unsigned char instead of a plain char?
In other related questions unsigned char is highlighted because it is the only (byte/smallest) data type which is guaranteed to have no padding by the C-specification. But as the above example showed, the output doesn't seem to be affected by any padding as such.
I have used VC++ Express 2010 and MinGW to compile the above. Although VC gave the warning
warning C4309: '=' : truncation of constant value
the output doesn't seems to reflect that.
P.S. This could be marked a possible duplicate of Should a buffer of bytes be signed or unsigned char buffer? but my intent is different. I am asking why something which seems to be working as fine with char should be typed unsigned char?
Update: To quote from N3337,
Section 3.9 Types
2 For any object (other than a base-class subobject) of trivially
copyable type T, whether or not the object holds a valid value of type
T, the underlying bytes (1.7) making up the object can be copied into
an array of char or unsigned char. If the content of the array of char
or unsigned char is copied back into the object, the object shall
subsequently hold its original value.
In view of the above fact and that my original example was on Intel machine where char defaults to signed char, am still not convinced if unsigned char should be preferred over char.
Anything else?

In C the unsigned char data type is the only data type that has all the following three properties simultaneously
it has no padding bits, that it where all storage bits contribute to the value of the data
no bitwise operation starting from a value of that type, when converted back into that type, can produce overflow, trap representations or undefined behavior
it may alias other data types without violating the "aliasing rules", that is that access to the same data through a pointer that is typed differently will be guaranteed to see all modifications
if these are the properties of a "binary" data type you are looking for, you definitively should use unsigned char.
For the second property we need a type that is unsigned. For these all conversion are defined with modulo arihmetic, here modulo UCHAR_MAX+1, 256 in most 99% of the architectures. All conversion of wider values to unsigned char thereby just corresponds to truncation to the least significant byte.
The two other character types generally don't work the same. signed char is signed, anyhow, so conversion of values that don't fit it is not well defined. char is not fixed to be signed or unsigned, but on a particular platform to which your code is ported it might be signed even it is unsigned on yours.

You'll get most of your problems when comparing the contents of individual bytes:
char c[5];
c[0] = 0xff;
/*blah blah*/
if (c[0] == 0xff)
{
printf("good\n");
}
else
{
printf("bad\n");
}
can print "bad", because, depending on your compiler, c[0] will be sign extended to -1, which is not any way the same as 0xff

The plain char type is problematic and shouldn't be used for anything but strings. The main problem with char is that you can't know whether it is signed or unsigned: this is implementation-defined behavior. This makes char different from int etc, int is always guaranteed to be signed.
Although VC gave the warning ... truncation of constant value
It is telling you that you are trying to store int literals inside char variables. This might be related to the signedness: if you try to store an integer with value > 0x7F inside a signed character, unexpected things might happen. Formally, this is undefined behavior in C, though practically you'd just get a weird output if attempting to print the result as an integer value stored inside a (signed) char.
In this specific case, the warning shouldn't matter.
EDIT :
In other related questions unsigned char is highlighted because it is the only (byte/smallest) data type which is guaranteed to have no padding by the C-specification.
In theory, all integer types except unsigned char and signed char are allowed to contain "padding bits", as per C11 6.2.6.2:
"For unsigned integer types other than unsigned char, the bits of the
object representation shall be divided into two groups: value bits and
padding bits (there need not be any of the latter)."
"For signed integer types, the bits of the object representation shall
be divided into three groups: value bits, padding bits, and the sign
bit. There need not be any padding bits; signed char shall not have
any padding bits."
The C standard is intentionally vague and fuzzy, allowing these theoretical padding bits because:
It allows different symbol tables than the standard 8-bit ones.
It allows implementation-defined signedness and weird signed integer formats such as one's complement or "sign and magnitude".
An integer may not necessarily use all bits allocated.
However, in the real world outside the C standard, the following applies:
Symbol tables are almost certainly 8 bits (UTF8 or ASCII). Some weird exceptions exist, but clean implementations use the standard type wchar_t when implementing symbols tables larger than 8 bits.
Signedness is always two's complement.
An integer always uses all bits allocated.
So there is no real reason to use unsigned char or signed char just to dodge some theoretical scenario in the C standard.

Bytes are usually intended as unsigned 8 bit wide integers.
Now, char doesn't specify the sign of the integer: on some compilers char could be signed, on other it may be unsigned.
If I add a bit shift operation to the code you wrote, then I will have an undefined behaviour. The added comparison will also have an unexpected result.
char c[5], d[5];
c[0] = 0xF0;
c[1] = 0xA4;
c[2] = 0xAD;
c[3] = 0xA2;
c[4] = '\0';
c[0] >>= 1; // If char is signed, will the 7th bit go to 0 or stay the same?
bool isBiggerThan0 = c[0] > 0; // FALSE if char is signed!
printf("%s\n", c);
memcpy(d, c, 5);
printf("%s\n", d);
Regarding the warning during the compilation: if the char is signed then you are trying to assign the value 0xf0, which cannot be represented in the signed char (range -128 to +127), so it will be casted to a signed value (-16).
Declaring the char as unsigned will remove the warning, and is always good to have a clean build without any warning.

The signed-ness of the plain char type is implementation defined, so unless you're actually dealing with character data (a string using the platform's character set - usually ASCII), it's usually better to specify the signed-ness explicitly by either using signed char or unsigned char.
For binary data, the best choice is most probably unsigned char, especially if bitwise operations will be performed on the data (specifically bit shifting, which doesn't behave the same for signed types as for unsigned types).

I am asking why something which seems to be working as fine with char should be typed unsigned char?
If you do things which are not "correct" in the sense of the standard, you rely on undefined behaviour. Your compiler might do it the way you want today, but you don't know what it does tomorrow. You don't know what GCC does or VC++ 2012. Or even if the behaviour depends on external factors or Debug/Release compiles etc. As soon as you leave the safe path of the standard, you might run into trouble.

Well, what do you call "binary data"? This is a bunch of bits, without any meaning assigned to them by that specific part of software that calls them "binary data". What's the closest primitive data type, which conveys the idea of the lack of any specific meaning to any one of these bits? I think unsigned char.

Is it really necessary to use unsigned char to hold binary data as in some libraries which work on character encoding or binary buffers?
"really" necessary? No.
It is a very good idea though, and there are many reasons for this.
Your example uses printf, which not type-safe. That is, printf takes it's formatting cues from the format string and not from the data type. You could just as easily tried:
printf("%s\n", (void*)c);
... and the result would have been the same. If you try the same thing with c++ iostreams, the result will be different (depending on the signed-ness of c).
What reasoning could possibly advocate the use of unsigned char instead of a plain char?
Signed specifies that the most significant bit of the data (for unsigned char the 8-th bit) represents the sign. Since you obviously do not need that, you should specify your data is unsigned (the "sign" bit represents data, not the sign of the other bits).

Assigning negative value to char

Why does the following code print "?" ?
Also how can -1 be assigned to an unsigned char?
char test;
unsigned char testu; //isn't it supposed to hold values in range 0 - 255?
test = -1;
testu = -1;
cout<<"TEST CHAR = "<<test<<endl;
cout<<"TESTU CHAR = "<<testu<<endl;

unsigned simply affects how the internal representation of the number (chars are numbers, remember) is interpreted. So -1 is 1111 1111 in two's complement notation, which when put into an unsigned char changes the meaning (for the same bit representation) to 255.
The question mark is probably the result of your font/codepage not mapping the (extended) ASCII value 255 to a character it can display.
I don't think << discerns between an unsigned char and a signed char, since it interprets their values as ASCII codes, not plain numbers.
Also, it depends on your compiler whether chars are signed or unsigned by default; actually, the spec states there's three different char types (plain, signed, and unsigned).

When you assign a negative value to an unsigned variable, the result is that it wraps around. -1 becomes 255 in this case.

I don't know C or C++, but my intuition is telling me that it's wrapping -1 to 255 and printing ÿ, but since that's not in the first 128 characters it prints ? instead. Just a guess.
To test this, try assigning -191 and see if it prints A (or B if my math is off).

Signed/unsigned is defined by the use of the highest order bit of that number.
You can assign a negative integer to it. The sign bit will be interpreted in the signed case (when you perform arithmetics with it). When you treat it it like a character it will simply take the highest order bit as if it was an unsigned char and just produce an ASCII char beyond 127 (decimal):
unsigned char c = -2;
is equivalent to:
unsigned char c = 128;
WHEN the c is treated as a character.
-1 is an exception: it has all 8 bits set and is treated as 255 dec.

Conversion from unsigned to signed type safety?

Is it safe to convert, say, from an unsigned char * to a signed char * (or just a char *?

The access is well-defined, you are allowed to access an object through a pointer to signed or unsigned type corresponding to the dynamic type of the object (3.10/15).
Additionally, signed char is guaranteed not to have any trap values and as such you can safely read through the signed char pointer no matter what the value of the original unsigned char object was.
You can, of course, expect that the values you read through one pointer will be different from the values you read through the other one.
Edit: regarding sellibitze's comment, this is what 3.9.1/1 says.
A char, a signed char, and an unsigned char occupy the same amount of storage and have the same alignment requirements (3.9); that is, they have the same object representation. For character types, all bits of the object representation participate in the value representation. For unsigned character types, all possible bit patterns of the value representation represent numbers.
So indeed it seems that signed char may have trap values. Nice catch!

The conversion should be safe, as all you're doing is converting from one type of character to another, which should have the same size. Just be aware of what sort of data your code is expecting when you dereference the pointer, as the numeric ranges of the two data types are different. (i.e. if your number pointed by the pointer was originally positive as unsigned, it might become a negative number once the pointer is converted to a signed char* and you dereference it.)

Casting changes the type, but does not affect the bit representation. Casting from unsigned char to signed char does not change the value at all, but it affects the meaning of the value.
Here is an example:
#include <stdio.h>
int main(int args, char** argv) {
/* example 1 */
unsigned char a_unsigned_char = 192;
signed char b_signed_char = b_unsigned_char;
printf("%d, %d\n", a_signed_char, a_unsigned_char); //192, -64
/* example 2 */
unsigned char b_unsigned_char = 32;
signed char a_signed_char = a_unsigned_char;
printf("%d, %d\n", b_signed_char, b_unsigned_char); //32, 32
return 0;
}
In the first example, you have an unsigned char with value 192, or 110000000 in binary. After the cast to signed char, the value is still 110000000, but that happens to be the 2s-complement representation of -64. Signed values are stored in 2s-complement representation.
In the second example, our unsigned initial value (32) is less than 128, so it seems unaffected by the cast. The binary representation is 00100000, which is still 32 in 2s-complement representation.
To "safely" cast from unsigned char to signed char, ensure the value is less than 128.

It depends on how you are going to use the pointer. You are just converting the pointer type.

You can safely convert an unsigned char* to a char * as the function you are calling will be expecting the behavior from a char pointer, but, if your char value goes over 127 then you will get a result that will not be what you expected, so just make certain that what you have in your unsigned array is valid for a signed array.

I've seen it go wrong in a few ways, converting to a signed char from an unsigned char.
One, if you're using it as an index to an array, that index could go negative.
Secondly, if inputted to a switch statement, it may result in a negative input which often is something the switch isn't expecting.
Third, it has different behavior on an arithmetic right shift
int x = ...;
char c = 128
unsigned char u = 128
c >> x;
has a different result than
u >> x;
Because the former is sign-extended and the latter isn't.
Fourth, a signed character causes underflow at a different point than an unsigned character.
So a common overflow check,
(c + x > c)
could return a different result than
(u + x > u)

Safe if you are dealing with only ASCII data.

I'm astonished it hasn't been mentioned yet: Boost numeric cast should do the trick - but only for the data of course.
Pointers are always pointers. By casting them to a different type, you only change the way the compiler interprets the data pointed to.

What is an unsigned char?

In C/C++, what an unsigned char is used for? How is it different from a regular char?

In C++, there are three distinct character types:
char
signed char
unsigned char
If you are using character types for text, use the unqualified char:
it is the type of character literals like 'a' or '0' (in C++ only, in C their type is int)
it is the type that makes up C strings like "abcde"
It also works out as a number value, but it is unspecified whether that value is treated as signed or unsigned. Beware character comparisons through inequalities - although if you limit yourself to ASCII (0-127) you're just about safe.
If you are using character types as numbers, use:
signed char, which gives you at least the -127 to 127 range. (-128 to 127 is common)
unsigned char, which gives you at least the 0 to 255 range.
"At least", because the C++ standard only gives the minimum range of values that each numeric type is required to cover. sizeof (char) is required to be 1 (i.e. one byte), but a byte could in theory be for example 32 bits. sizeof would still be report its size as 1 - meaning that you could have sizeof (char) == sizeof (long) == 1.

This is implementation dependent, as the C standard does NOT define the signed-ness of char. Depending on the platform, char may be signed or unsigned, so you need to explicitly ask for signed char or unsigned char if your implementation depends on it. Just use char if you intend to represent characters from strings, as this will match what your platform puts in the string.
The difference between signed char and unsigned char is as you'd expect. On most platforms, signed char will be an 8-bit two's complement number ranging from -128 to 127, and unsigned char will be an 8-bit unsigned integer (0 to 255). Note the standard does NOT require that char types have 8 bits, only that sizeof(char) return 1. You can get at the number of bits in a char with CHAR_BIT in limits.h. There are few if any platforms today where this will be something other than 8, though.
There is a nice summary of this issue here.
As others have mentioned since I posted this, you're better off using int8_t and uint8_t if you really want to represent small integers.

Because I feel it's really called for, I just want to state some rules of C and C++ (they are the same in this regard). First, all bits of unsigned char participate in determining the value if any unsigned char object. Second, unsigned char is explicitly stated unsigned.
Now, I had a discussion with someone about what happens when you convert the value -1 of type int to unsigned char. He refused the idea that the resulting unsigned char has all its bits set to 1, because he was worried about sign representation. But he didn't have to be. It's immediately following out of this rule that the conversion does what is intended:
If the new type is unsigned, the value is converted by repeatedly adding or subtracting one more than the maximum value that can be represented in the new type until the value is in the range of the new type. (6.3.1.3p2 in a C99 draft)
That's a mathematical description. C++ describes it in terms of modulo calculus, which yields to the same rule. Anyway, what is not guaranteed is that all bits in the integer -1 are one before the conversion. So, what do we have so we can claim that the resulting unsigned char has all its CHAR_BIT bits turned to 1?
All bits participate in determining its value - that is, no padding bits occur in the object.
Adding only one time UCHAR_MAX+1 to -1 will yield a value in range, namely UCHAR_MAX
That's enough, actually! So whenever you want to have an unsigned char having all its bits one, you do
unsigned char c = (unsigned char)-1;
It also follows that a conversion is not just truncating higher order bits. The fortunate event for two's complement is that it is just a truncation there, but the same isn't necessarily true for other sign representations.

As for example usages of unsigned char:
unsigned char is often used in computer graphics, which very often (though not always) assigns a single byte to each colour component. It is common to see an RGB (or RGBA) colour represented as 24 (or 32) bits, each an unsigned char. Since unsigned char values fall in the range [0,255], the values are typically interpreted as:
0 meaning a total lack of a given colour component.
255 meaning 100% of a given colour pigment.
So you would end up with RGB red as (255,0,0) -> (100% red, 0% green, 0% blue).
Why not use a signed char? Arithmetic and bit shifting becomes problematic. As explained already, a signed char's range is essentially shifted by -128. A very simple and naive (mostly unused) method for converting RGB to grayscale is to average all three colour components, but this runs into problems when the values of the colour components are negative. Red (255, 0, 0) averages to (85, 85, 85) when using unsigned char arithmetic. However, if the values were signed chars (127,-128,-128), we would end up with (-99, -99, -99), which would be (29, 29, 29) in our unsigned char space, which is incorrect.

signed char has range -128 to 127; unsigned char has range 0 to 255.
char will be equivalent to either signed char or unsigned char, depending on the compiler, but is a distinct type.
If you're using C-style strings, just use char. If you need to use chars for arithmetic (pretty rare), specify signed or unsigned explicitly for portability.

unsigned char takes only positive values....like 0 to 255
where as
signed char takes both positive and negative values....like -128 to +127

char and unsigned char aren't guaranteed to be 8-bit types on all platforms—they are guaranteed to be 8-bit or larger. Some platforms have 9-bit, 32-bit, or 64-bit bytes. However, the most common platforms today (Windows, Mac, Linux x86, etc.) have 8-bit bytes.

An unsigned char is an unsigned byte value (0 to 255). You may be thinking of char in terms of being a "character" but it is really a numerical value. The regular char is signed, so you have 128 values, and these values map to characters using ASCII encoding. But in either case, what you are storing in memory is a byte value.

In terms of direct values a regular char is used when the values are known to be between CHAR_MIN and CHAR_MAX while an unsigned char provides double the range on the positive end. For example, if CHAR_BIT is 8, the range of regular char is only guaranteed to be [0, 127] (because it can be signed or unsigned) while unsigned char will be [0, 255] and signed char will be [-127, 127].
In terms of what it's used for, the standards allow objects of POD (plain old data) to be directly converted to an array of unsigned char. This allows you to examine the representation and bit patterns of the object. The same guarantee of safe type punning doesn't exist for char or signed char.

unsigned char is the heart of all bit trickery. In almost all compilers for all platforms an unsigned char is simply a byte and an unsigned integer of (usually) 8 bits that can be treated as a small integer or a pack of bits.
In addition, as someone else has said, the standard doesn't define the sign of a char. So you have 3 distinct char types: char, signed char, unsigned char.

If you like using various types of specific length and signedness, you're probably better off with uint8_t, int8_t, uint16_t, etc simply because they do exactly what they say.

Some googling found this, where people had a discussion about this.
An unsigned char is basically a single byte. So, you would use this if you need one byte of data (for example, maybe you want to use it to set flags on and off to be passed to a function, as is often done in the Windows API).

An unsigned char uses the bit that is reserved for the sign of a regular char as another number. This changes the range to [0 - 255] as opposed to [-128 - 127].
Generally unsigned chars are used when you don't want a sign. This will make a difference when doing things like shifting bits (shift extends the sign) and other things when dealing with a char as a byte rather than using it as a number.

unsigned char takes only positive values: 0 to 255 while
signed char takes positive and negative values: -128 to +127.

quoted frome "the c programming laugage" book:
The qualifier signed or unsigned may be applied to char or any integer. unsigned numbers
are always positive or zero, and obey the laws of arithmetic modulo 2^n, where n is the number
of bits in the type. So, for instance, if chars are 8 bits, unsigned char variables have values
between 0 and 255, while signed chars have values between -128 and 127 (in a two' s
complement machine.) Whether plain chars are signed or unsigned is machine-dependent,
but printable characters are always positive.

signed char and unsigned char both represent 1byte, but they have different ranges.
Type | range
-------------------------------
signed char | -128 to +127
unsigned char | 0 to 255
In signed char if we consider char letter = 'A', 'A' is represent binary of 65 in ASCII/Unicode, If 65 can be stored, -65 also can be stored. There are no negative binary values in ASCII/Unicode there for no need to worry about negative values.
Example
#include <stdio.h>
int main()
{
signed char char1 = 255;
signed char char2 = -128;
unsigned char char3 = 255;
unsigned char char4 = -128;
printf("Signed char(255) : %d\n",char1);
printf("Unsigned char(255) : %d\n",char3);
printf("\nSigned char(-128) : %d\n",char2);
printf("Unsigned char(-128) : %d\n",char4);
return 0;
}
Output -:
Signed char(255) : -1
Unsigned char(255) : 255
Signed char(-128) : -128
Unsigned char(-128) : 128

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js