Under what circumstances would one use a signed char in C++? - c++

In most situations, one would declare a char object to assign one of the character values on the ascii table ranging from 0 - 127. Even the extended character sets range from 128 - 255 (still positive). So i'm assuming that when dealing with the printing of characters, one only needs to use an unsigned char.
Now, based on some research on SO, people use a signed char when they need to use really small integers, but for that we can utilize the [u]int8 type. So i'm having trouble coming to terms with why one would need to use a signed char ? You can use it if you are dealing with the basic character ascii table (which unsigned char is already capable of doing) or you can use it to represent small integers (which [u]int8 already takes care of).
Can someone please provide a programming example in which a signed char is preferred over the other types ?

The reason is that you don't know, at least portably, if plain char variables are signed or unsigned. Different implementations have different approaches, a plain char may be signed in one platform and unsigned in another.
If you want to store negative values in a variable of type char, you absolutely must declare it as signed char, because only then you can be sure that every platform will be able to store negative values in there. Yes, you can use [u]int8 type, but this was not always the case (it was only introduced in C++11), and in fact, int8 is most likely an alias for signed char.
Moreover, uint8_t and int8_t are defined to be optional types, meaning you can't always rely on its existence (contrary to signed char). In particular, if a machine has a byte unit with more than 8 bits, it is not very likely that uint8_t and int8_t are defined (although they can; a compiler is always free to provide it and do the appropriate calculations). See this related question: What is int8_t if a machine has > 8 bits per byte?

Is char signed or unsigned?
Actually it is neither, it's implementation defined if a variable of type char can hold negative values. So if you are looking for a portable way to store negative values in a narrow character type explicitly declare it as signed char.
§ 3.9.1 - Fundamental Types - [basic.fundamental]
1 Objects declared as characters (char) shall be large enough to store any member of the implementation's basic character set. If a character from this set is stored in a character object, the integral value of that character object is equal to the value of the single character literal form of that character. It is implementation-defined whether a char object can hold negative values.
I'd like to use the smallest signed integer type available, which one is it?
c++11 introduced several fixed with integer types, but a common misunderstanding is that these types are guaranteed to be available, something which isn't true.
§ 18.4.1 - Header <cstdint> synopsis - [cstdint.syn]
typedefsigned integer typeint8_t; // optional
To preserve space in this post most of the section has been left out, but the optional rationale applies to all {,u}int{8,16,32,64}_t types. An implementation is not required to implement them.
The standard mandates that int_least8_t is available, but as the name implies this type is only guaranteed to have a width equal or larger than 8 bits.
However, the standard guarantees that even though signed char, char, and unsigned char are three distinct types[1] they must occupy the same amount of storage and have the same alignment requirements.
After inspecting the standard further we will also find that sizeof(char) is guaranteed to be 1[2] , which means that this type is guaranteed to occupy the smallest amount of space that a C++ variable can occupy under the given implementation.
Conclusion
Remember that unsigned char and signed char must occupy the same amount of storage as a char?
The smallest signed integer type that is guaranteed to be available is therefore signed char.
[note 1]
§ 3.9.1 - Fundamental Types - [basic.fundamental]
1 Plain char, signed char, and unsigned char are three distinct types, collectively called narrow character types.
A char, a signed char, and an unsigned char occupy the same amount of storage and have the same alignment requirements (3.11); that is, they have the same object representation. For narrow character types, all bits of the object representation participate in the value representation.
[note 2]
§ 5.3.3 - Sizeof - [expr.sizeof]
sizeof(char), sizeof(signed char), and sizeof(unsigned char) are 1.
The result of sizeof applied to any other fundamental type (3.9.1) is implementation-defined.

You can use char for arithmetic operations with small integers. unsigned char will give you greater range, while signed char will give you a smaller absolute range and the ability to work with negative numbers.
There are situations where char's small size is of importance and is preffered for these operations, see here, so when one has negative numbers to deal with, signed char is the way to go.

Related

Is this UTF-8 implementation implementation-defined or well-defined?

I just browsed around looking for some implementation of UTF-8 code points (and no, not to plagiarize) and stumbled across this:
typedef unsigned char char8_t;
typedef std::basic_string<unsigned char> u8string;
Is this code ignoring the fact that CHAR_BIT is only required to be at least 8, but may be greater? Or does this not matter in this context and the code is fine? If so, then why is this?
Also, someone (presumably SO member #NicolBolas?) wrote this:
const char *str = u8"This is a UTF-8 string.";
This is pretty much how UTF-8 will be used in C++ for string literals.
I thought that a code unit in UTF-8 is always exactly eight bits!
From the Unicode Standard 8.0.0, Chapter 2.5:
In the Unicode character encoding model, precisely defined encoding
forms specify how each integer (code point) for a Unicode character is
to be expressed as a sequence of one or more code units. The Unicode
Standard provides three distinct encoding forms for Unicode
characters, using 8-bit, 16- bit, and 32-bit units. These are
named UTF-8, UTF-16, and UTF-32, respectively.
(Newlines removed, hyphen on line-break removed, emphasis added.)
So why does he claim const char* is used instead of const uint8_t* (or the suggested, hypothetical const char8_t*)?
uint8_t only exists on systems that have memory that's accessible as exactly 8 bits. UTF-8 doesn't have any such requirement. It uses values that fit into 8 bits, but does not impose any requirements on how those values are actually stored. Each 8-bit value could be stored as 16 bits or 32 bits or whatever makes sense for the system that it's running on; the only requirement is that the value must be correct.
So why does he claim const char* is used instead of const uint8_t* (or the suggested, hypothetical const char8_t*)?
Because that's what the standard says. a u8 literal string will resolve to an array of type const char[N]. That's how UTF-8 literals in C++ are defined to work.
If char on a system has more than 8 bits... so be it. Each char in the string will still hold a value between 0 and 255, which is the range of valid UTF-8 code units. Even though char could hold large values on such system.
If char cannot hold 8 bits... then the implementation is invalid. By recent wording of the standard, char is required to hold enough bits to store every valid UTF-8 code unit. And technically, 255 is not a valid UTF-8 code unit.
And the fact of the matter is this: there's already a huge amount of code that accepts UTF-8 via char*. They aren't going to rewrite POSIX, filesystem APIs, and whatever else to adopt a different type.
That being said, manipulating a sequence of UTF-8 code units via const char* is... dubious. This is because they could be signed. However, the recent standard wording requires that a conversion between unsigned char and char work within the range of valid UTF-8 code units. That is, you can cast a const char* to a const unsigned char*, do your bit manipulation on that, and then cast it back, and you're guaranteed to work.
And what's the point of that super-complex "recent wording of the standard"?
The point of that is to allow UTF-8 strings to actually work. Because the standards committee, in their "infinite wisdom", decided not to include a special char8_t UTF-8 code unit type, they had to add wording to make char serve in that role. And that requires that the conversion to/from unsigned char and char to not be able to mangle a UTF-8 code unit.
There was even a discussion topic on the C++ standard discussion forums, where the wording was discussed (search for 1759). The C++14 wording says:
For each value i of type unsigned char in the range 0 to 255 inclusive, there exists a value j of type char such that the result of an integral conversion (4.7) from i to char is j, and the result of an integral conversion from j to unsigned char is i.
This means in particular that char could only be signed by default if the signed representation satisfies the above. A one's complement signed char would not be sufficient, since negative zero has a special representation (0x80), which when converted to unsigned becomes regular 0.
Should they have just defined a specific char8_t that is required to be unsigned and has at least 8 bits? Probably. But it's done and it ain't changing.
[lex.string]/8 Ordinary string literals and UTF-8 string literals are also referred to as narrow string literals. A narrow string literal has type “array of n const char”, where n is the size of the string as defined below, and has static storage duration (3.7).
So, whatever else is true, a UTF-8 string literal is a sequence of chars.
As to uint8_t:
7.20.1.1
2 The typedef name uintN_t designates an unsigned integer type with width N and no padding bits. Thus, uint24_t denotes such an unsigned integer type with a width of exactly 24 bits.
3 These types are optional. However, if an implementation provides integer types with widths of 8, 16, 32, or 64 bits, no padding bits, and (for the signed types) that have a two’s complement representation, it shall define the corresponding typedef names.
On a hypothetical system with a char larger than 8 bits, uint8_t would not be defined.
code unit in UTF-8 is always exactly eight bits. unsigned char is specified to have at least 8 bits, so all code units in UTF-8 fit in type unsigned char.
The rationale for the u8"This is a UTF-8 encoded string constant" is not the fact that it is stored in 8-bit bytes, but that it is encoded as UTF-8, whereas the source file might have a different encoding. The u8string typedef is consistent with that but a tad confusing if bytes have more than 8 bits.
Using unsigned char is a good way to remove the uncertainty regarding the signedness of type char.
char8_t was voted in for C++20 during a San Diego meeting so this code will not compile.
However, you will be able to use std::u8string but remember that it only works with code units and not code points or grapheme clusters so the safe way is to treat is as opaque blob and use 3rd party libraries to mutate it. At least for now.

Is the size of built-in datatypes in c++ managed on source code level?

When I open /usr/include/stdint.h I see things like
typedef signed char int8_t;
which means that every int8_t is to be treated like signed char. This let's me suspect that on my system signed char is of size 8bit. (However the other way round would be more intuitive for me i.e. every signed char has to be treated like int8_t.) Where is the size of signed char defined?
Short answer: no.
The sizeof fundamental types like char (and therefore signed char) is defined by your compiler, based on the target system architecture. They're not defined in code.
The typedef above means the opposite - it's defining a new type named int8_t, in terms of the predefined char type. On your system (as on most) char is 8 bits wide, so it's the natural way to define an 8-bit integer type.
The implementation of the compiler for a particular architecture chooses whatever authors wanted for the "normal" datatype sizes (within the limits). So in a way, it's set in code, but not in the part visible to your programs in any means.
Then it's formulated in terms of the standard types, so that the sizes match.
It's not. It's only guaranteed to be at least 8 bits (since it must be able to contain values in the range [-127, 127]. And int8_t is also not guaranteed to exist; it's only present on machines where char is an 8 bit 2's complement.

Unsigned narrow character type number representation

N3797::3.9.1/1 [basic.fundamental] says
For unsigned narrow character types, all possible bit patterns of the
value representation represent numbers.
That's a bit unclear for me. We have the following ranges for narrow character types:
unsigned char := 0 -- 255
signed char : = -128 -- 127
For both unsgined char and signed char objects we have one-to-one mapping from the set of bits in these object representation to the integral value they could represent. The Standard says N3797::3.9.1/1 [basic.fundamental]
These requirements do not hold for other types.
Why the requirement I cited doesn't hold say for signed char type?
Signed types can use one of three representations: two's complement, one's complement, or sign-magnitude. The last two each have one bit pattern (the negation of zero) which doesn't represent a number.
Two's complement is more or less universal for integer types these days; but the language still allows for the others.
A few machines have what are called "trap representations". This means (for example) that an int can contain an extra bit (or more than one) to signify whether it has been initialized or not.
If you try to read an int with that bit saying it hasn't been initialized, it can trigger some sort of trap/exception/fault that (for example) immediately shuts down your program with some sort of error message. Any time you write a value to the int, that trap representation is cleared, so reading from it can/will work.
So basically, when your program starts, it initializes all your ints to such trap representations. If you try to read from an uninitialized variable, the hardware will catch it immediately and give you an error message.
The standard mandates that for unsigned char, no such trap representation is possible--all the bits of an unsigned char must be "visible"--they must form part of the value. That means none of them can be hidden; no pattern of bits you put into an unsigned char can form a trap representation (or anything similar). Any bits you put into unsigned char must simply form some value.
Any other type, however, can have trap representations. If, for example, you take some (more or less) arbitrarily chosen 8 bits out of some other type, and read them as an unsigned char, they'll always form a value you can read, write to a file, etc. If, however, you attempt to read them as any other type (signed char, unsigned int, etc.) it's allowable for it to form a trap representation, and attempting to do anything with it can give undefined behavior.

int8_t vs char ; Which is the best one?

I know both are different types (signed char and char), however my company coding guidelines specifies to use int8_t instead of char.
So, I want to know, why I have to use int8_t instead of char type. Is there any best practices to use int8_t?
The use of int8_t is perfectly good for some circumstances - specifically when the type is used for calculations where a signed 8-bit value is required. Calculations involving strictly sized data [e.g. defined by external requirements to be exactly 8 bit in the result] (I used pixel colour levels in a comment above, but that really would be uint8_t, as negative pixel colours usually don't exist - except perhaps in YUV type colourspace).
The type int8_t should NOT be used as a replacement of char in for strings. This can lead to compiler errors (or warnings, but we don't really want to have to deal with warnings from the compiler either). For example:
int8_t *x = "Hello, World!\n";
printf(x);
may well compile fine on compiler A, but give errors or warnings for mixing signed and unsigned char values on compiler B. Or if int8_t isn't even using a char type. That's just like expecting
int *ptr = "Foo";
to compile in a modern compiler...
In other words, int8_t SHOULD be used instead of char if you are using 8-bit data for caclulation. It is incorrect to wholesale replace all char with int8_t, as they are far from guaranteed to be the same.
If there is a need to use char for string/text/etc, and for some reason char is too vague (it can be signed or unsigned, etc), then usign typedef char mychar; or something like that should be used. (It's probably possible to find a better name than mychar!)
Edit: I should point out that whether you agree with this or not, I think it would be rather foolish to simply walk up to whoever is in charge of this "principle" at the company, point at a post on SO and say "I think you're wrong". Try to understand what the motivation is. There may be more to it than meets the eye.
They simply make different guarantees:
char is guaranteed to exist, to be at least 8 bits wide, and to be able to represent either all integers between -127 and 127 inclusive (if signed) or between 0 and 255 (if unsigned).
int8_t is not guaranteed to exist (and yes, there are platforms on which it doesn’t), but if it exists it is guaranteed to an 8-bit twos-complement signed integer type with no padding bits; thus it is capable of representing all integers between -128 and 127, and nothing else.
When should you use which? When the guarantees made by the type line up with your requirements. It is worth noting, however, that large portions of the standard library require char * arguments, so avoiding char entirely seems short-sighted unless there’s a deliberate decision being made to avoid usage of those library functions.
int8_t is only appropriate for code that requires a signed integer type that is exactly 8 bits wide and should not compile if there is no such type. Such requirements are far more rare than the number of questions about int8_t and it's brethren indicates. Most requirements for sizes are that the type have at least a particular number of bits. signed char works just fine if you need at least 8 bits; int_least8_t also works.
int8_t is specified by the C99 standard to be exactly eight bits wide, and fits in with the other C99 guaranteed-width types. You should use it in new code where you want an exactly 8-bit signed integer. (Take a look at int_least8_t and int_fast8_t too, though.)
char is still preferred as the element type for single-byte character strings, just as wchar_t should be preferred as the element type for wide character strings.

Why do C++ streams use char instead of unsigned char?

I've always wondered why the C++ Standard library has instantiated basic_[io]stream and all its variants using the char type instead of the unsigned char type. char means (depending on whether it is signed or not) you can have overflow and underflow for operations like get(), which will lead to implementation-defined value of the variables involved. Another example is when you want to output a byte, unformatted, to an ostream using its put function.
Any ideas?
Note: I'm still not really convinced. So if you know the definitive answer, you can still post it indeed.
Possibly I've misunderstood the question, but conversion from unsigned char to char isn't unspecified, it's implementation-dependent (4.7-3 in the C++ standard).
The type of a 1-byte character in C++ is "char", not "unsigned char". This gives implementations a bit more freedom to do the best thing on the platform (for example, the standards body may have believed that there exist CPUs where signed byte arithmetic is faster than unsigned byte arithmetic, although that's speculation on my part). Also for compatibility with C. The result of removing this kind of existential uncertainty from C++ is C# ;-)
Given that the "char" type exists, I think it makes sense for the usual streams to use it even though its signedness isn't defined. So maybe your question is answered by the answer to, "why didn't C++ just define char to be unsigned?"
I have always understood it this way: the purpose of the iostream class is to read and/or write a stream of characters, which, if you think about it, are abstract entities that are only represented by the computer using a character encoding. The C++ standard makes great pains to avoid pinning down the character encoding, saying only that "Objects declared as characters (char) shall be large enough to store any member of the implementation's basic character set," because it doesn't need to force the "implementation basic character set" to define the C++ language; the standard can leave the decision of which character encoding is used to the implementation (compiler together with an STL implementation), and just note that char objects represent single characters in some encoding.
An implementation writer could choose a single-octet encoding such as ISO-8859-1 or even a double-octet encoding such as UCS-2. It doesn't matter. As long as a char object is "large enough to store any member of the implementation's basic character set" (note that this explicitly forbids variable-length encodings), then the implementation may even choose an encoding that represents basic Latin in a way that is incompatible with any common encoding!
It is confusing that the char, signed char, and unsigned char types share "char" in their names, but it is important to keep in mind that char does not belong to the same family of fundamental types as signed char and unsigned char. signed char is in the family of signed integer types:
There are four signed integer types: "signed char", "short int", "int", and "long int."
and unsigned char is in the family of unsigned integer types:
For each of the signed integer types, there exists a corresponding (but different) unsigned integer type: "unsigned char", "unsigned short int", "unsigned int", and "unsigned long int," ...
The one similarity between the char, signed char, and unsigned char types is that "[they] occupy the same amount of storage and have the same alignment requirements". Thus, you can reinterpret_cast from char * to unsigned char * in order to determine the numeric value of a character in the execution character set.
To answer your question, the reason why the STL uses char as the default type is because the standard streams are meant for reading and/or writing streams of characters, represented by char objects, not integers (signed char and unsigned char). The use of char versus the numeric value is a way of separating concerns.
char is for characters, unsigned char for raw bytes of data, and signed chars for, well, signed data.
Standard does not specify if signed or unsigned char will be used for the implementation of char - it is compiler-specific. It only specifies that the "char" will be "enough" to hold characters on you system - the way characters were in those days, which is, no UNICODE.
Using "char" for characters is the standard way to go. Using unsigned char is a hack, although it'll match compiler's implementation of char on most platforms.
I think this comment explains it well. To quote:
signed char and unsigned char are arithmetic, integral types just like int and unsigned int. On the other hand, char is expressly intended to be the "I/O" type that represents some opaque, system-specific fundamental unit of data on your platform. I would use them in this spirit.