How to check if plain chars are signed or unsigned? - c++

Apparently there is a possibility that plain char can be either signed or unsigned by default. Stroustrup writes:
It is implementation-defined whether a plain char is considered signed or unsigned. This opens the
possibility for some nasty surprises and implementation dependencies.
How do I check whether my chars are signed or unsigned? I might want to convert them to int later, and I don't want them to be negative. Should I always use unsigned char explicitly?

From header <limits>
std::numeric_limits<char>::is_signed
http://en.cppreference.com/w/cpp/types/numeric_limits/is_signed

Some alternatives:
const bool char_is_signed = (char)-1 < 0;
#include <climits>
const bool char_is_signed = CHAR_MIN < 0;
And yes, some systems do make plain char an unsigned type. Examples I've encountered: Cray T90, Cray SV1, Cray T3E, SGI MIPS IRIX, IBM PowerPC AIX. And any system that uses EBCDIC pretty much has to make plain char unsigned so that all basic characters have non-negative values. (And some compilers have an option to control the signedness of char, such as gcc's -fsigned-char and -funsigned-char.)
But std::numeric_limits<char>::is_signed, as suggested by Benjamin Lindley's answer, probably expresses the intent more clearly.
(On the other hand, the methods I suggested can also be applied to C.)

Using unsigned char "always" could give you some interesting surprises, as the majority of C-style functions like printf, fopen, will use char, not unsigned char.
edit: Example of "fun" with C-style functions:
const unsigned char *cmd = "grep -r blah *.txt";
FILE *pf = popen(cmd, "r");
will give errors (in fact, I get one for the *cmd = line, and one error for the popen line). Using const char *cmd = ... will work fine. I picked popen because it's a function that isn't trivial to replace with some standard C++ functionality - obviously, printf or fopen can quite easily be replaced with some iostream or fstream type functionality, which generally has alternatives that take unsigned char as well as char.
However, if you are using > or < on characters that are beyond 127, then you will need to use unsigned char (or some other solution, such as casting to int and masking the lower 8 bits). It is probably better to try to avoid direct comparisons (in particular when it comes to non-ASCII characters - they are messy anyway, because there are often several variants depending on locale, character encodings, etc). Comparing for equality should work however.

Yes, if you want to use a char type and you always want it to be unsigned, use unsigned char. Note that unlike the other fundamental integer types, unsigned char is a different type from char -- even on systems where char is unsigned. Also, conversion from char to int ought to be lossless so if your result is incorrect, your source char value may also be incorrect.
The cleanest way to test whether char is unsigned depends on whether you need it to be a preprocessor test and on which version of C++ you are targeting.
To conditionally compile code using a preprocessor test, the value of CHAR_MIN should work:
#include <climits>
#if (CHAR_MIN==0)
// code that relies on char being unsigned
#endif
In C++17, I would use std::is_signed_v and std::is_unsigned_v:
#include <type_traits>
static_assert(std::is_unsigned_v<char>);
// code that relies on char being unsigned
If you are writing against C++11 or C++14 you need the slightly more verbose std::is_signed and std::is_unsigned:
#include <type_traits>
static_assert(std::is_unsigned<char>::value, "char is signed");
// code that relies on char being unsigned
For all revisions of C++, #benjamin-lindley's solution is a good alternative.

You can use preprocessor command:
#define is_type_signed(my_type) (((my_type)-1) < 0)

Related

Why do Boost Format and printf behave differently on same format string

The Boost Format documentation says:
One of its goal is to provide a replacement for printf, that means
format can parse a format-string designed for printf, apply it to the
given arguments, and produce the same result as printf would have.
When I compare the output of boost:format and printf using the same format string I get different outputs. Online example is here
#include <iostream>
#include <boost/format.hpp>
int main()
{
boost::format f("BoostFormat:%d:%X:%c:%d");
unsigned char cr =65; //'A'
int cr2i = int(cr);
f % cr % cr % cr % cr2i;
std::cout << f << std::endl;
printf("Printf:%d:%X:%c:%d",cr,cr,cr,cr2i);
}
The output is:
BoostFormat: A:A:A:65
printf: 65:41:A:65
The difference is when I want to display a char as integral type.
Why there is a difference? Is this a bug or wanted behavior?
This is expected behaviour.
In the boost manual it is written about the classical type-specification you uses:
But the classical type-specification flag of printf has a weaker
meaning in format. It merely sets the appropriate flags on the
internal stream, and/or formatting parameters, but does not require
the corresponding argument to be of a specific type.
Please note also, that in the stdlib-printf call all char arguments are automatically
converted to int due to the vararg-call. So the generated code is identical to:
printf("Printf:%d:%X:%c:%d",cr2i,cr2i,cr2i,cr2i);
This automatic conversion is not done with the % operator.
Addition to the accepted answer:
This also happens to arguments of type wchar_t as well as unsigned short and other equivalent types, which may be unexpected, for example, when using members of structs in the Windows API (e.g., SYSTEMTIME), which are short integers of type WORD for historical reasons.
If you are using Boost Format as a replacement for printf and "printf-like" functions in legacy code, you may consider creating a wrapper, which overrides the % operator in such a way that it converts
char and short to int
unsigned char and unsigned short to unsigned int
to emulate the behavior of C variable argument lists. It will still not be 100% compatible, but most of the remaining incompatibilities are actually helpful for fixing potentially unsafe code.
Newer code should probably not use Boost Format, but the standard std::format, which is not compatible to printf.

Why are stoi, stol not fixed width integers?

Since ints and longs and other integer types may be different sizes on different systems, why not have stouint8_t(), stoint64_t(), etc. so that portable string to int code could be written?
Because typing that would make me want to chop off my fingers.
Seriously, the basic integer types are int and long and the std::stoX functions are just very simple wrappers around strtol etc. and note that C doesn't provide strtoi32 or strtoi64 or anything that std::stouint32_t could wrap.
If you want something more complicated you can write it yourself.
I could just as well ask "why do people use int and long, instead of int32_t and int64_t everywhere, so the code is portable?" and the answer would be because it's not always necessary.
But the actual reason is probably that noone ever proposed it for the standard. Things don't just magically appear in the standard, someone has to write a proposal and justify adding them, and convince the rest of the committee to add them. So the answer to most "why isn't this thing I just thought of in the standard?" is that noone proposed it.
Because it's usually not necessary.
stoll and stoull return results of type long long and unsigned long long respectively. If you want to convert a string to int64_t, you can just call stoll() and store the result in your int64_t object; the value will be implicitly converted.
This assumes that long long is the widest signed integer type. Like C (starting with C99), C++ permits extended integer types, some of which might be wider than [unsigned] long long. C provides conversion functions strtoimax and strtoumax (operating on intmax_t and uintmax_t, respectively) in <inttypes.h>. For whatever reason, C++ doesn't provide wrappers for this functions (the logical names would be stoimax and stoumax.
But that's not going to matter unless you're using a C++ compiler that provides an extended integer type wider than [unsigned] long long, and I'm not aware that any such compilers actually exist. For any types no wider than 64 bits, the existing functions are all you need.
For example:
#include <iostream>
#include <string>
#include <cstdint>
int main() {
const char *s = "0xdeadbeeffeedface";
uint64_t u = std::stoull(s, NULL, 0);
std::cout << u << "\n";
}

Type "char" in C++

Is there an easy way to convert between char and unsigned char if you don't know the default setting of the machine your code is running on?
(On most architectures, char is signed by default and thus has a range from -128 to +127. On some other architectures, such as ARM, char is unsigned by default and has a range from 0 to 255)
I am looking for a method to select the correct signedness or to convert between the two transparently, preferably one that doesn't involve too many steps since I would need to do this for all elements in an array.
Using a pre-processor definition would allow the setting of this at the start of my code.
As would specifying an explicit form of char such as signed char or unsigned char as only char is variable between platforms.
The reason is, there are library functions I would like to use (such as strtol) that take in char as an argument but not unsigned char.
I am looking for some advice or perhaps some pointers in the right direction as to what would be a practical efficient way to do this to make the code portable, as I intend to run the code on a few machines with different default settings for char.
I don't feel any actual issue on this point.
It's not a matter of the architecture being signed or unsigned by default. It's rather a matter of the compiler, and the default setting can be changed between the two options as you wish.
Also, there's no need to convert between the types. Both have the same representation in memory, on the same number of bits (usually 8). It's only a matter of your program and the libraries it uses to interpret the bits. If you're going to call strtol, then your data is a character array and you ought to use plain char.
If you ever use char to store not a character (A, b, f ...) but an actual value (-1, 0, 42 ...) then the range matters. In such cases, you have to use signed char or unsigned char. However in such a case, there's little use for the libraries functions that want a char *.
For these libraries that do actually want a char * with an actual binary blob, there's no issue. Create your binary buffer with the type you prefer, signed, unsigned, or undecided, and send it, possibly with a cast. It will run perfectly.
C++ has three char types however only char is allowed to vary between compilers/architectures, as the other two are explicit version, and char is implicit, so it is allowed to default to signed or unsigned.
To make your code portable the most straightforward thing to do is explicitly to use either signed or unsigned char as you require them, however for readability you may prefer to redefine char as a the type you need, or even make your own definition of a char (for demonstration purposes I will use RLChar)
1st version - un-define char and redefine
#ifdef __arm__
#undef char
#define char signed char
#endif
2nd version - define your own custom char type to use in your code
#ifndef RLChar
#define RLChar signed char
#endif
(personally I would tend to do the second)
You can also create another macro to allow changes between the two:
#define CLAMP_VALUE_TO_255(v) ((v) > 255 ? 255 : ((v) < 0 ? 0 : (v)))
then you can use:
unsigned char clampedChar = CLAMP_VALUE_TO_255((unsigned char)pixel)
or use casts such as (these are the way to go if all the compilers you will use have the support for it):
signed char myChar = -100;
unsigned char mySecondChar;
mySecondChar = static_cast<unsigned char>(myChar); // uses a static cast
mySecondChar = reinterpret_cast<unsigned char&>(myChar); // uses a reinterpretation cast
so for your array scenario you could do
unsigned char* RLArray;
RLArray = reinterpret_cast<unsigned char*>(originalSignedCharArray);
Let me know if you need more info as this is just what I can remember off the top of my head, especially if you need C equivalents or more details. :)

Is it safe to call the functions from <cctype> with char arguments?

The C programming language says that the functions from <ctype.h> follow a common requirement:
ISO C99, 7.4p1:
In all cases the argument is an int, the value of which shall be representable as an unsigned char or shall equal the value of the macro EOF. If the argument has any other value, the behavior is undefined.
This means that the following code is unsafe:
int upper(const char *s, size_t index) {
return toupper(s[index]);
}
If this code is executed on an implementation where char has the same value space as signed char and there is a character with a negative value in the string, this code invokes undefined behavior. The correct version is:
int upper(const char *s, size_t index) {
return toupper((unsigned char) s[index]);
}
Nevertheless I see many examples in C++ that don't care about this possibility of undefined behavior. So is there anything in the C++ standard that guarantees that the above code will not lead to undefined behavior, or are all the examples wrong?
[Additional Keywords: ctype cctype isalnum isalpha isblank iscntrl isdigit isgraph islowwer isprint ispunct isspace isupper isxdigit tolower]
For what it's worth, the Solaris Studio compilers (using stlport4) are one such compiler suite that produce an unexpected result here. Compiling and running this:
#include <stdio.h>
#include <cctype>
int main() {
char ch = '\xa1'; // '¡' in latin-1 locales + UTF-8
printf("is whitespace: %i\n", std::isspace(ch));
return 0;
}
gives me:
kevin#solaris:~/scratch
$ CC -library=stlport4 whitespace.cpp && ./a.out
is whitespace: 8
For reference:
$ CC -V
CC: Studio 12.5 Sun C++ 5.14 SunOS_i386 2016/05/31
Of course, this behavior is as documented in the C++ standard, but it's definitely surprising.
EDIT: Since it was pointed out that the above version contained undefined behavior in the attempt to assign char ch = '\xa1' due to integer overflow, here's a version that avoids that and still retains the same output:
#include <stdio.h>
#include <cctype>
int main() {
char ch = -95;
printf("is whitespace: %i\n", std::isspace(ch));
return 0;
}
And that does still print 8 on my Solaris VM:
kevin#solaris:~/scratch
$ CC -library=stlport4 whitespace.cpp && ./a.out
is whitespace: 8
EDIT 2: And here's a program that might otherwise look sane but gives an unexpected result due to UB in the use of std::isspace():
#include <cstdio>
#include <cstring>
#include <cctype>
static int count_whitespace(const char* str, int n) {
int count = 0;
for (int i = 0; i < n; i++)
if (std::isspace(str[i])) // oops!
count += 1;
return count;
}
int main() {
const char* batman = "I am batman\xa1";
int n = std::strlen(batman);
std::printf("%i\n", count_whitespace(batman, n));
return 0;
}
And, on my Solaris machine:
kevin#solaris:~/scratch
$ CC whitespace.cpp && ./a.out
3
Note that depending on how you permute this program, you'll probably get the expected result of two whitespace characters; that is, there is almost certainly some compiler optimization kicking in that takes advantage of this UB to give you the wrong result faster.
You could imagine this biting you in the face if you were, for example, attempting to tokenize a UTF-8 string by searching for (non-multibyte) whitespace characters in the string. Such a program would behave correctly when casting str[i] to unsigned char.
Sometimes most people are wrong. I think that's so here. Having said that there's nothing to stop an standard library implementor defining the behaviour that most people expect. So maybe that's why most people don't care, since they've never actually seen a bug resulting from this error.
The history behind the char type is that it was originally the type used to describe 7-bit ASCII characters. At the same time, C lacked a separate 8 bit integer type. So in the pre-standard days of the eighties, some compilers made char unsigned - since it doesn't make sense to have negative indices in a symbol table, while other compilers made char signed, to make it consistent with all the other integer types.
When the time came to standardize C, both versions existed. Unfortunately, the committee decided to let it remain that way, leaving the decision to the compiler. Instead they added two other types: signed char and unsigned char. signed char is part of the signed integer types, unsigned char is part of the unsigned integer types, and char is part of neither, though it must have the same representation as either signed char or unsigned char. (This is all described in C11 6.2.5)
Notably, char never was anything but 8 bits on all known implementations, save from some exotic oddball DSPs that worked with 16 bit bytes. When "extended" symbol tables were used, either the implementation changed from 7 to 8 bit characters, or wchar_t was used. Please note that wchar_t has been in the C language since the beginning, so assuming that char was at some point used for things like UTF8 is probably incorrect (though theoretically possible).
Now if char is signed, and you store a value larger than CHAR_MAX or smaller than CHAR_MIN inside it, you invoke undefined behavior, as per C11 6.5 §5. Period. So if you have an array of char and any item inside it violate the type boundaries, you have undefined behavior there already. Even though character types have to trap representations, undefined behavior could cause the code to misbehave in other ways, such as incorrect optimizations.
The ctype.h functions allow EOF as parameter, but should otherwise behave as if working with character types, even though the parameter is int to allow EOF. The text from 7.4 §1 is mostly saying that "if you pass some random int to this function, which is neither of the same representation as a char, nor EOF, the behavior is undefined".
But if you pass a char where you have already invoked signed integer overflow/underflow, you already have undefined behavior even before calling the function - this has nothing to do with the ctype.h functions or any other function. Thus your assumption that the posted "upper" function is unsafe is incorrect - this code is no different from any other code using the char type.
An example of undefined behavior caused by the cited ctype.h restrictions in 7.4 would rather be something like toupper(666).

Worst side effects from chars signedness. (Explanation of signedness effects on chars and casts)

I frequently work with libraries that use char when working with bytes in C++. The alternative is to define a "Byte" as unsigned char but that not the standard they decided to use. I frequently pass bytes from C# into the C++ dlls and cast them to char to work with the library.
When casting ints to chars or chars to other simple types what are some of the side effects that can occur. Specifically, when has this broken code that you have worked on and how did you find out it was because of the char signedness?
Lucky i haven't run into this in my code, used a char signed casting trick back in an embedded systems class in school. I'm looking to better understand the issue since I feel it is relevant to the work I am doing.
One major risk is if you need to shift the bytes. A signed char keeps the sign-bit when right-shifted, whereas an unsigned char doesn't.
Here's a small test program:
#include <stdio.h>
int main (void)
{
signed char a = -1;
unsigned char b = 255;
printf("%d\n%d\n", a >> 1, b >> 1);
return 0;
}
It should print -1 and 127, even though a and b start out with the same bit pattern (given 8-bit chars, two's-complement and signed values using arithmetic shift).
In short, you can't rely on shift working identically for signed and unsigned chars, so if you need portability, use unsigned char rather than char or signed char.
The most obvious gotchas come when you need to compare the numeric value of a char with a hexadecimal constant when implementing protocols or encoding schemes.
For example, when implementing telnet you might want to do this.
// Check for IAC (hex FF) byte
if (ch == 0xFF)
{
// ...
Or when testing for UTF-8 multi-byte sequences.
if (ch >= 0x80)
{
// ...
Fortunately these errors don't usually survive very long as even the most cursory testing on a platform with a signed char should reveal them. They can be fixed by using a character constant, converting the numeric constant to a char or converting the character to an unsigned char before the comparison operator promotes both to an int. Converting the char directly to an unsigned won't work, though.
if (ch == '\xff') // OK
if ((unsigned char)ch == 0xff) // OK, so long as char has 8-bits
if (ch == (char)0xff) // Usually OK, relies on implementation defined behaviour
if ((unsigned)ch == 0xff) // still wrong
I've been bitten by char signedness in writing search algorithms that used characters from the text as indices into state trees. I've also had it cause problems when expanding characters into larger types, and the sign bit propagates causing problems elsewhere.
I found out when I started getting bizarre results, and segfaults arising from searching texts other than the one's I'd used during the initial development (obviously characters with values >127 or <0 are going to cause this, and won't necessarily be present in your typical text files.
Always check a variable's signedness when working with it. Generally now I make types signed unless I have a good reason otherwise, casting when necessary. This fits in nicely with the ubiquitous use of char in libraries to simply represent a byte. Keep in mind that the signedness of char is not defined (unlike with other types), you should give it special treatment, and be mindful.
The one that most annoys me:
typedef char byte;
byte b = 12;
cout << b << endl;
Sure it's cosmetics, but arrr...
When casting ints to chars or chars to other simple types
The critical point is, that casting a signed value from one primitive type to another (larger) type does not retain the bit pattern (assuming two's complement). A signed char with bit pattern 0xff is -1, while a signed short with the decimal value -1 is 0xffff. Casting an unsigned char with value 0xff to a unsigned short, however, yields 0x00ff. Therefore, always think of proper signedness before you typecast to a larger or smaller data type. Never carry unsigned data in signed data types if you don't need to - if an external library forces you to do so, do the conversion as late as possible (or as early as possible if the external code acts as data source).
The C and C++ language specifications define 3 data types for holding characters: char, signed char and unsigned char. The latter 2 have been discussed in other answers. Let's look at the char type.
The standard(s) say that the char data type may be signed or unsigned and is an implementation decision. This means that some compilers or versions of compilers, can implement char differently. The implication is that the char data type is not conducive for arithmetic or Boolean operations. For arithmetic and Boolean operations, signed and unsigned versions of char will work fine.
In summary, there are 3 versions of char data type. The char data type performs well for holding characters, but is not suited for arithmetic across platforms and translators since it's signedness is implementation defined.
You will fail miserably when compiling for multiple platforms because the C++ standard doesn't define char to be of a certain "signedness".
Therefore GCC introduces -fsigned-char and -funsigned-char options to force certain behavior. More on that topic can be found here, for example.
EDIT:
As you asked for examples of broken code, there are plenty of possibilities to break code that processes binary data. For example, image you process 8-bit audio samples (range -128 to 127) and you want to halven the volume. Now imagine this scenario (in which the naive programmer assumes char == signed char):
char sampleIn;
// If the sample is -1 (= almost silent), and the compiler treats char as unsigned,
// then the value of 'sampleIn' will be 255
read_one_byte_sample(&sampleIn);
// Ok, halven the volume. The value will be 127!
char sampleOut = sampleOut / 2;
// And write the processed sample to the output file, for example.
// (unsigned char)127 has the exact same bit pattern as (signed char)127,
// so this will write a sample with the loudest volume!!
write_one_byte_sample_to_output_file(&sampleOut);
I hope you like that example ;-) But to be honest I've never really came across such problems, not even as a beginner as far as I can remember...
Hope this answer is sufficient for you downvoters. What about a short comment?
Sign extension. The first version of my URL encoding function produced strings like "%FFFFFFA3".