Why do C++ streams use char instead of unsigned char? - c++

I've always wondered why the C++ Standard library has instantiated basic_[io]stream and all its variants using the char type instead of the unsigned char type. char means (depending on whether it is signed or not) you can have overflow and underflow for operations like get(), which will lead to implementation-defined value of the variables involved. Another example is when you want to output a byte, unformatted, to an ostream using its put function.
Any ideas?
Note: I'm still not really convinced. So if you know the definitive answer, you can still post it indeed.

Possibly I've misunderstood the question, but conversion from unsigned char to char isn't unspecified, it's implementation-dependent (4.7-3 in the C++ standard).
The type of a 1-byte character in C++ is "char", not "unsigned char". This gives implementations a bit more freedom to do the best thing on the platform (for example, the standards body may have believed that there exist CPUs where signed byte arithmetic is faster than unsigned byte arithmetic, although that's speculation on my part). Also for compatibility with C. The result of removing this kind of existential uncertainty from C++ is C# ;-)
Given that the "char" type exists, I think it makes sense for the usual streams to use it even though its signedness isn't defined. So maybe your question is answered by the answer to, "why didn't C++ just define char to be unsigned?"

I have always understood it this way: the purpose of the iostream class is to read and/or write a stream of characters, which, if you think about it, are abstract entities that are only represented by the computer using a character encoding. The C++ standard makes great pains to avoid pinning down the character encoding, saying only that "Objects declared as characters (char) shall be large enough to store any member of the implementation's basic character set," because it doesn't need to force the "implementation basic character set" to define the C++ language; the standard can leave the decision of which character encoding is used to the implementation (compiler together with an STL implementation), and just note that char objects represent single characters in some encoding.
An implementation writer could choose a single-octet encoding such as ISO-8859-1 or even a double-octet encoding such as UCS-2. It doesn't matter. As long as a char object is "large enough to store any member of the implementation's basic character set" (note that this explicitly forbids variable-length encodings), then the implementation may even choose an encoding that represents basic Latin in a way that is incompatible with any common encoding!
It is confusing that the char, signed char, and unsigned char types share "char" in their names, but it is important to keep in mind that char does not belong to the same family of fundamental types as signed char and unsigned char. signed char is in the family of signed integer types:
There are four signed integer types: "signed char", "short int", "int", and "long int."
and unsigned char is in the family of unsigned integer types:
For each of the signed integer types, there exists a corresponding (but different) unsigned integer type: "unsigned char", "unsigned short int", "unsigned int", and "unsigned long int," ...
The one similarity between the char, signed char, and unsigned char types is that "[they] occupy the same amount of storage and have the same alignment requirements". Thus, you can reinterpret_cast from char * to unsigned char * in order to determine the numeric value of a character in the execution character set.
To answer your question, the reason why the STL uses char as the default type is because the standard streams are meant for reading and/or writing streams of characters, represented by char objects, not integers (signed char and unsigned char). The use of char versus the numeric value is a way of separating concerns.

char is for characters, unsigned char for raw bytes of data, and signed chars for, well, signed data.
Standard does not specify if signed or unsigned char will be used for the implementation of char - it is compiler-specific. It only specifies that the "char" will be "enough" to hold characters on you system - the way characters were in those days, which is, no UNICODE.
Using "char" for characters is the standard way to go. Using unsigned char is a hack, although it'll match compiler's implementation of char on most platforms.

I think this comment explains it well. To quote:
signed char and unsigned char are arithmetic, integral types just like int and unsigned int. On the other hand, char is expressly intended to be the "I/O" type that represents some opaque, system-specific fundamental unit of data on your platform. I would use them in this spirit.

Related

In the C++11 standard, why leave the char type implementation dependent?

Background
Several C++ source materials and stack overflow questions talk about the implementation dependent nature of char. That is, char in C++ may be defined as either an unsigned char or a signed char, but this implementation depends entirely on the compiler according to the ARM Linux FAQ:
The above code is actually buggy in that it assumes that the type "char" is equivalent to "signed char". The C standards do say that "char" may either be a "signed char" or "unsigned char" and it is up to the compilers implementation or the platform which is followed.
This leaves the door open for both ambiguity issues and bad practices including mistaking the signage of a char when used as an 8-bit number. The Rationale for C offers some reason for why this is the case, but does not address the issue of leaving open the possibility for ambiguity:
Three types of char are specified: signed, plain, and unsigned. A plain char may be represented as either signed or unsigned, depending upon the implementation, as in prior practice. The type signed char was introduced to make available a one-byte signed integer type on those systems which implement plain char as unsigned. For reasons of symmetry, the keyword signed is allowed as part of the type name of other integral types.
It would seem advantageous to close the door to even the potential of ambiguity to leave only the types of unsigned char and signed char as the two data types for the 8-bit unit. This prompted me to ask the question...
Question
Given the potential for ambiguity, why leave the char data type implementation dependent?
Some processors prefer signed char, and others prefer unsigned char. For example, POWER can load an 8-bit value from memory with zero extension, but not sign extension. But SuperH-3 can load an 8-bit value from memory with sign extension but not zero extension. C++ derives from C, and C leaves many details of the language implementation-defined so that each implementation can be tailored to be most efficient for its target environment.

Is it better to use char or unsigned char array for storing raw data?

When need to buffer in memory some raw data, for example from a stream, is it better to use an array of char or of unsigned char? I always used char but at work are saying it is better unsigned char and I don't know why.
UPDATE: C++17 introduced std::byte, which is more suited to "raw" data buffers than using any manner of char.
For earlier C++ versions:
unsigned char emphasises that the data is not "just" text
if you've got what's effectively "byte" data from e.g. a compressed stream, a database table backup file, an executable image, a jpeg... then unsigned is appropriate for the binary-data connotation mentioned above
unsigned works better for some of the operations you might want to do on binary data, e.g. there are undefined and implementation defined behaviours for some bit operations on signed types, and unsigned values can be used directly as indices in arrays
you can't accidentally pass an unsigned char* to a function expecting char* and have it operated on as presumed text
in these situations it's usually more natural to think of the values as being in the range 0..255, after all - why should the "sign" bit have a different kind of significance to the other bits in the data?
if you're storing "raw data" that - at an application logic/design level happens to be 8-bit numeric data, then by all means choose either unsigned or explicitly signed char as appropriate to your needs
As far as the structure of the buffer is concerned, there is no difference: in both cases you get an element size of one byte, mandated by the standard.
Perhaps the most important difference that you get is the behavior that you see when accessing the individual elements of the buffer, for example, for printing. With char you get implementation-defined signed or unsigned behavior; with unsigned char you always see unsigned behavior. This becomes important if you want to print the individual bytes of your "raw data" buffer.
Another good alternative for use for buffers is the exact-width integer uint8_t. It is guaranteed to have the same width as unsigned char, its name requires less typing, and it tells the reader that you are not intended to use the individual elements of the buffer as character-based information.
Internally, it is exactly the same: Each element is a byte. The difference is given when you operate with those values.
If your values range is [0,255] you should use unsigned char but if it is [-128,127] then you should use signed char.
Suppose you are use the first range (signed char), then you can perform the operation 100+100. Otherwise that operation will overflow and give you an unexpected value.
Depending on your compiler or machine type, char may be unsigned or signed by default:
Is char signed or unsigned by default?
Thus having char the ranges described for the cases above.
If you are using this buffer just to store binary data without operating with it, there is no difference between using char or unsigned char.
EDIT
Note that you can even change the default char for the same machine and compiler using compiler's flags:
-funsigned-char
Let the type char be unsigned, like unsigned char.
Each kind of machine has a default for what char should be. It is either likeunsigned char by default or like signed char by default.
Ideally, a portable program should always use signed char or unsigned char when it depends on the signedness of an object. But many
programs have been written to use plain char and expect it to be
signed, or expect it to be unsigned, depending on the machines they
were written for. This option, and its inverse, let you make such a
program work with the opposite default.
The type char is always a distinct type from each of signed char or unsigned char, even though its behavior is always just like one of
those two.
As #Pablo said in his answer, the key reason is that if you're doing arithmetic on the bytes, you'll get the 'right' answers if you declare the bytes as unsigned char: you want (in Pablo's example) 100 + 100 to add to 200; if you do that sum with signed char (which you might do by accident if char on your compiler is signed) there's no guarantee of that – you're asking for trouble.
Another important reason is that it can help document your code, if you're explicit about what datatypes are what. It's useful to declare
typedef unsigned char byte
or even better
#include <stdint.h>
typedef uint8_t byte
Using byte thereafter makes it that little bit clearer what your program's intent is. Depending on how paranoid your compiler is (-Wall is your friend), this might even cause a type warning if you give a byte* argument to a char* function argument, thus prompting you to think slightly more carefully about whether you're doing the right thing.
A 'character' is fundamentally a pretty different thing from a 'byte'. C happens to blur the distinction (because at C's level, in a mostly ASCII world, the distinction doesn't matter in many cases). This blurring isn't always helpful, but it's at least good intellectual hygiene to keep the difference clear in your head.
It is usually better to use char but it makes so little difference it does not matter. It's raw data so you should be simply passing it around as such rather than trying to work with it via char pointers of one type or another. Since char is the native data type it makes most sense to use this rather than imagining you are forcing your data into one type or another.
If you use unsigned char then it will take only valid ASCII characters as its range will become -127 to +127.
and you can find complete difference between char and unsigned char details in this question.
diff bet char and unsigned char
and you can see the table here.
ASCII table
complete tables of raw characters
If you are able to work with C++17 there is a std::byte type that is more appropriate for working with raw data. It only has bitwise logic operators defined for it.

Under what circumstances would one use a signed char in C++?

In most situations, one would declare a char object to assign one of the character values on the ascii table ranging from 0 - 127. Even the extended character sets range from 128 - 255 (still positive). So i'm assuming that when dealing with the printing of characters, one only needs to use an unsigned char.
Now, based on some research on SO, people use a signed char when they need to use really small integers, but for that we can utilize the [u]int8 type. So i'm having trouble coming to terms with why one would need to use a signed char ? You can use it if you are dealing with the basic character ascii table (which unsigned char is already capable of doing) or you can use it to represent small integers (which [u]int8 already takes care of).
Can someone please provide a programming example in which a signed char is preferred over the other types ?
The reason is that you don't know, at least portably, if plain char variables are signed or unsigned. Different implementations have different approaches, a plain char may be signed in one platform and unsigned in another.
If you want to store negative values in a variable of type char, you absolutely must declare it as signed char, because only then you can be sure that every platform will be able to store negative values in there. Yes, you can use [u]int8 type, but this was not always the case (it was only introduced in C++11), and in fact, int8 is most likely an alias for signed char.
Moreover, uint8_t and int8_t are defined to be optional types, meaning you can't always rely on its existence (contrary to signed char). In particular, if a machine has a byte unit with more than 8 bits, it is not very likely that uint8_t and int8_t are defined (although they can; a compiler is always free to provide it and do the appropriate calculations). See this related question: What is int8_t if a machine has > 8 bits per byte?
Is char signed or unsigned?
Actually it is neither, it's implementation defined if a variable of type char can hold negative values. So if you are looking for a portable way to store negative values in a narrow character type explicitly declare it as signed char.
§ 3.9.1 - Fundamental Types - [basic.fundamental]
1 Objects declared as characters (char) shall be large enough to store any member of the implementation's basic character set. If a character from this set is stored in a character object, the integral value of that character object is equal to the value of the single character literal form of that character. It is implementation-defined whether a char object can hold negative values.
I'd like to use the smallest signed integer type available, which one is it?
c++11 introduced several fixed with integer types, but a common misunderstanding is that these types are guaranteed to be available, something which isn't true.
§ 18.4.1 - Header <cstdint> synopsis - [cstdint.syn]
typedefsigned integer typeint8_t; // optional
To preserve space in this post most of the section has been left out, but the optional rationale applies to all {,u}int{8,16,32,64}_t types. An implementation is not required to implement them.
The standard mandates that int_least8_t is available, but as the name implies this type is only guaranteed to have a width equal or larger than 8 bits.
However, the standard guarantees that even though signed char, char, and unsigned char are three distinct types[1] they must occupy the same amount of storage and have the same alignment requirements.
After inspecting the standard further we will also find that sizeof(char) is guaranteed to be 1[2] , which means that this type is guaranteed to occupy the smallest amount of space that a C++ variable can occupy under the given implementation.
Conclusion
Remember that unsigned char and signed char must occupy the same amount of storage as a char?
The smallest signed integer type that is guaranteed to be available is therefore signed char.
[note 1]
§ 3.9.1 - Fundamental Types - [basic.fundamental]
1 Plain char, signed char, and unsigned char are three distinct types, collectively called narrow character types.
A char, a signed char, and an unsigned char occupy the same amount of storage and have the same alignment requirements (3.11); that is, they have the same object representation. For narrow character types, all bits of the object representation participate in the value representation.
[note 2]
§ 5.3.3 - Sizeof - [expr.sizeof]
sizeof(char), sizeof(signed char), and sizeof(unsigned char) are 1.
The result of sizeof applied to any other fundamental type (3.9.1) is implementation-defined.
You can use char for arithmetic operations with small integers. unsigned char will give you greater range, while signed char will give you a smaller absolute range and the ability to work with negative numbers.
There are situations where char's small size is of importance and is preffered for these operations, see here, so when one has negative numbers to deal with, signed char is the way to go.

What is practiced in C++ to use for byte manipulation uint8 or char?

I am coming from Java to C++ and I need something similar to byte[] from Java. I can use std::vector<> for easy array like manipulation but I need answer what is practiced in C++ to use for byte manipulation uint8 or char ? ( I have lot off packing bigger integers in arrays with & 0xff and >> number so it need to be quick)
Assuming that uint8 is an 8 bit unsigned integer type, the main difference on a "normal" C++ implementation is that char is not necessarily unsigned.
On "not normal" C++ implementations, there could be more significant differences -- char might not be 8 bits. But then, what would you define uint8 to be on such an implementation anyway?
Whether the sign difference matters or not depends how you're using it, but as a rule of thumb it's best to use unsigned types with bitwise operators. That said, they both get promoted to int in bitwise & anyway (again on a "normal" C++ implementation) and it really doesn't matter for &, it doesn't cause surprises in practice. But using << on a negative signed value results in undefined behavior, so avoid that.
So, use an unsigned type. If the most convenient way for you to write that is uint8, and you know that your code deals in octets and will only run on systems where char is an octet, then you may as well use it.
If you want to use a standard type, use unsigned char. Or uint8_t in order to deliberately prevent your code compiling on "not normal" implementations where char is not an octet.
C++ reserved words are char, signed char, and/or unsigned char. uint8 is probably a typedef synonym for unsigned char.

Can someone explain how the signedness of char is platform specific?

I recently read that the differences between
char
unsigned char
and
signed char
is platform specific.
I can't quite get my head round this? does it mean the the bit sequence can vary from one platform to the next ie platform1 the sign is the first bit, platform2 the sign could be at the end? how would you code against this?
Basically my question comes from seeing this line:
typedef unsigned char byte;
I dont understand the relevance of the signage?
Let's assume that your platform has eight-bit bytes, and suppose we have the bit pattern 10101010. To a signed char, that value is −86. For unsigned char, though, that same bit pattern represents 170. We haven't moved any bits around; it's the same bits, interpreted two different ways.
Now for char. The standard doesn't say which of those two interpretations should be correct. A char holding the bit pattern 10101010 could be either −86 or 170. It's going to be one of those two values, but you have to know the compiler and the platform before you can predict which it will be. Some compilers offer a command-line switch to control which one it will be. Some compilers have different defaults depending on what OS they're running on, so they can match the OS convention.
In most code, it really shouldn't matter. They are treated as three distinct types, for the purposes of overloading. Pointers to one of those types aren't compatible with pointers to another type. Try calling strlen with a signed char* or an unsigned char*; it won't work.
Use signed char when you want a one-byte signed numeric type, and use unsigned char when you want a one-byte unsigned numeric type. Use plain old char when you want to hold characters. That's what the programmer was thinking when writing the typedef you're asking about. The name "byte" doesn't have the connotation of holding character data, whereas the name "unsigned char" has the word "char" in its name, and that causes some people to think it's a good type for holding characters, or that it's a good idea to compare it with variables of type char.
Since you're unlikely to do general arithmetic on characters, it won't matter whether char is signed or unsigned on any of the platforms and compilers you use.
You misunderstood something. signed char is always signed. unsigned char is always unsigned. But whether plain char is signed or unsigned is implementation specific - that means it depends on your compiler. This makes difference from int types, which all are signed (int is the same as signed int, short is the same as signed short). More interesting thing is that char, signed char and unsigned char are treated as three distinct types in terms of function overloading. It means that you can have in the same compilation unit three function overloads:
void overload(char);
void overload(signed char);
void overload(unsigned char);
For int types is contrary, you can't have
void overload(int);
void overload(signed int);
because int and signed int is the same.
It's more correct to say that it's compiler-specific and you should not count on char being signed or unsigned when using char without a signed or unsigned qualifier.
Otherwise you would face the following problem: you write and debug the program assuming that char is signed by default and then it is recompiled with a compiler assuming otherwise and the program behaviour changes drastically. If you rely on this assumption only once in a while in your code you risk facing unintended behaviour in some cases which are only triggered in your program under specific conditions and are very hard to detect and debug.
Perhaps you are referring to the fact that the signedness of char is compiler / platform specific. Here is a blog entry that sheds some light on it:
Character types in C and C++
Having a signed char is more of a fluke of how all base variable types are handled in C, generally it is not actually useful to have negative characters.
a signed char is always 8 bit and has always the signed bit as the last bit.
an unsigned char is always 8 bit and doesn't have a sign bit.
a char is as far as I know always unsigned. Any compiler defaulting to a signed char will face a lot of incompatible programs.