The function std::isdigit is:
int isdigit(int ch);
The return (Non-zero value if the character is a numeric character, zero otherwise.) smells like the function was inherited from C, but even that does not explain why the parameter type is int not char while at the same time...
The behavior is undefined if the value of ch is not representable as
unsigned char and is not equal to EOF.
Is there any technical reason why isdigitstakes an int not a char?
The reaons is to allow EOF as input. And EOF is (from here):
EOF integer constant expression of type int and negative value
The accepted answer is correct, but I believe the question deserves more detail.
A char in C++ is either signed or unsigned depending on your implementation (and, yet, it's a distinct type from signed char and unsigned char).
Where C grew up, char was typically unsigned and assumed to be an n-bit byte that could represent [0..2^n-1]. (Yes, there were some machines that had byte sizes other than 8 bits.) In fact, chars were considered virtually indistinguishable from bytes, which is why functions like memcpy take char * rather than something like uint8_t *, why sizeof char is always 1, and why CHAR_BITS isn't named BYTE_BITS.
But the C standard, which was the baseline for C++, only promised that char could hold any value in the execution character set. They might hold additional values, but there was no guarantee. The source character set (basically 7-bit ASCII minus some control characters) required something like 97 values. For a while, the execution character set could be smaller, but in practice it almost never was. Eventually there was an explicit requirement that a char be large enough to hold an 8-bit byte.
But the range was still uncertain. If unsigned, you could rely on [0..255]. Signed chars, however, could--in theory--use a sign+magnitude representation that would give you a range of [-127..127]. Note that's only 255 unique values, not 256 values ([-128..127]) like you'd get from two's complement. If you were language lawyerly enough, you could argue that you cannot store every possible value of an 8-bit byte in a char even though that was a fundamental assumption throughout the design of the language and its run-time library. I think C++ finally closed that apparent loophole in C++17 or C++20 by, in effect, requiring that a signed char use two's complement even if the larger integral types use sign+magnitude.
When it came time to design fundamental input/output functions, they had to think about how to return a value or a signal that you've reached the end of the file. It was decided to use a special value rather than an out-of-band signaling mechanism. But what value to use? The Unix folks generally had [128..255] available and others had [-128..-1].
But that's only if you're working with text. The Unix/C folks thought of textual characters and binary byte values as the same thing. So getc() was also for reading bytes from a binary file. All 256 possible values of a char, regardless of its signedness, were already claimed.
K&R C (before the first ANSI standard) didn't require function prototypes. The compiler made assumptions about parameter and return types. This is why C and C++ have the "default promotions," even though they're less important now than they once were. In effect, you couldn't return anything smaller than an int from a function. If you did, it would just be converted to int anyway.
The natural solution was therefore to have getc() return an int containing either the character value or a special end-of-file value, imaginatively dubbed EOF, a macro for -1.
The default promotions not only mandated a function couldn't return an integral type smaller than an int, they also made it difficult to pass in a small type. So int was also the natural parameter type for functions that expected a character. And thus we ended up with function signatures like int isdigit(int ch).
If you're a Posix fan, this is basically all you need.
For the rest of us, there's a remaining gotcha: If your chars are signed, then -1 might represent a legitimate character in your execution character set. How can you distinguish between them?
The answer is that functions don't really traffic in char values at all. They're really using unsigned char values dressed up as ints.
int x = getc(source_file);
if (x != EOF) { /* reached end of file */ }
else if (0 <= x && x < 128) { /* plain 7-bit character */ }
else if (128 <= x && x < 256) {
// Here it gets interesting.
bool b1 = isdigit(x); // OK
bool b2 = isdigit(static_cast<char>(x)); // NOT PORTABLE
bool b3 = isdigit(static_cast<unsigned char>(x)); // CORRECT!
}
Related
I want to write a function
int char_to_int(char c);
that converts given char to int by zero extending the value. So if the char has N bits and int has M bits, M >= N, then the M-N most significant bits of the int value should be zero and the N least significant bits of the int value should match the bits of the char value.
This seems like a simple task, but I'm not sure how to write it relying only on standard behavior. No UB, no implementation-defined behavior. Without relying on char being 8 bit, int being 32 bit, char being unsigned and any other common assumptions I make that are not guaranteed by standard.
The reason I want to know this, is that I have done this conversion several times in the past, but recently I became aware about the limited guarantees C++ gives about it's data types. So now I'm curious what is the correct, standard compliant approach.
I don't suppose
return (int) c;
is good enough, is it?
There's no hurt in being extra clear:
return int((unsigned char)c);
That way you tell the compiler exactly what you want: the int that contains the char value, read as unsigned. So char 255 will become int 255.
I have written the following code to test if the given input is a digit or not.
#include<iostream>
#include<ctype.h>
#include<stdio.h>
using namespace std;
main()
{
char c;
cout<<"Please enter a digit: ";
cin>>c;
if(isdigit(c)) //int isdigit(int c) or char isdigit(char c)
{
cout<<"You entered a digit"<<endl;
}
else
{
cout<<"You entered a non-digit value"<<endl;
}
}
My question is: what should be the input variable type? char or int?
The situation is unfortunately a bit more complex than has been told by the other answers.
First of all: the first part of your code is correct (disregarding multiple-byte encodings); if you want to read a single char with cin, you'll have to use a char variable with >> operator.
Now, about isdigit: why does it take an int instead of a char?
It all comes from C; isdigit and its companion were born to be used along with functions like getchar(), which read a character from the stream and return an int. This in turn was done to provide the character and an error code: getchar() can return EOF (which is defined as some implementation-defined negative constant) through its return code to signify that the input stream has ended.
So, the basic idea is: negative = error code; positive = actual character code.
Unfortunately, this poses interoperability problems with "regular" chars.
Short digression: char ultimately is just an integral type with a very small range, but a particularly stupid one. In most occasions - when working with bytes or character codes - you'd want it to be unsigned by default; OTOH, for coherency reasons with other integral types (int, short, long, ...), you may say that the right thing would be that plain char should be signed. The Standard chose the most stupid way: plain char is either signed or unsigned, depending from whatever the implementor of the compiler decides1.
So, you have to be prepared for char being either signed or unsigned; in most implementations it's signed by default, which poses a problem with the getchar() arrangement above.
If char is used to read bytes and is signed it means that all bytes with the high bit set (AKA bytes that, read with an unsigned 8-bit type would be >127) turn out to be negative values. This obviously isn't compatible with the getchar() using negative values for EOF - there could be overlap between actual "negative" characters and EOF.
So, when C functions talk about receiving/providing characters into int variables the contract is always that the character is assumed to be a char that has been cast to an unsigned char (so that it is always positive, negative values overflowing into the top half of its range) and then put into an int. Which brings us back to the isdigit function, which, along its companion functions, has this contract as well:
The header <ctype.h> declares several functions useful for classifying and mapping characters. In all cases the argument is an int, the value of which shall be representable as an unsigned char or shall equal the value of the macro EOF. If the argument has any other value, the behavior is undefined.
(C99, §7.4, ¶1)
So, long story short: your if should be at the very least:
if(isdigit((unsigned char)c))
The problem is not just a theoretical one: several widespread C library implementations use the provided value straight as an index into a lookup table, so negative values will read into unallocated memory and segfault your program.
Also, you are not taking into account the fact that the stream may be closed, and thus >> returning without touching your variable (which will be at an uninitialized value); to take this into account, you should check if the stream is still in a valid state before working on c.
Of course this is a bit of an unfair rant; as #Pete Becker noted in the comment below, it's not like they were all morons, but just that the standard mostly tried to be compatible with existing implementations, which were probably evenly split between unsigned and signed char. Traces of this split can be found in most modern compilers, which can generally change the signedness of char through command line options (-fsigned-char/-funsigned-char for gcc/clang, /J in VC++).
If you want to read a single character and check whether it is a digit or not then it should be char.
If you set it as int then multiple characters will be read and the result of isDigit will always be true.
It just seems like "not of one mind" in the design here, because integer data and character data of 16 bits is now differentiable but integer and character data of 8 bits is not.
C++ has always had the only choice for 8-bit values a 'char'. But the feature of recognizing wchar_t as an official, distinct type from unsigned short enables improvements, but only for wide-string users. It seems like this is not coordinated; the language acts differently for 8-bit and 16-bit values.
I think there is clear value in having more distinct types; having distinct 8-bit char AND and 8-bit "byte" would be much nicer, e.g. in usage for operator overloading. For example:
// This kind of sucks...
BYTE m = 59; // This is really 'unsigned char' because there is no other option
cout << m; // outputs character data ";" because it assumes 8-bits is char data.
// This is a consequence of limited ability to overload
// But for wide strings, the behavior is different and better...
unsigned short s = 59;
wcout << s; // Prints the number "59" like we expect
wchar_t w = L'C'
wcout << w; // Prints out "C" like we expect
The language would be more consistent if there were a new 8-bit integer type introduced, which would enable more intelligent overloads and overloads that behave more similarly irrespective of if you are using narrow or wide strings.
Yes, probably, but using single-byte integers that aren't characters is pretty rare and you can trivially get around your stated problem via integral promotion (try applying a unary + and see what happens).
It's also worth noting that your premise is flawed: wchar_t and unsigned short have always been distinct types, per paragraph 3.9.1/5 in C++98, C++03, C++11 and C++14.
Is it really necessary to use unsigned char to hold binary data as in some libraries which work on character encoding or binary buffers? To make sense of my question, have a look at the code below -
char c[5], d[5];
c[0] = 0xF0;
c[1] = 0xA4;
c[2] = 0xAD;
c[3] = 0xA2;
c[4] = '\0';
printf("%s\n", c);
memcpy(d, c, 5);
printf("%s\n", d);
both the printf's output 𤭢 correctly, where f0 a4 ad a2 is the encoding for the Unicode code-point U+24B62 (𤭢) in hex.
Even memcpy also correctly copied the bits held by a char.
What reasoning could possibly advocate the use of unsigned char instead of a plain char?
In other related questions unsigned char is highlighted because it is the only (byte/smallest) data type which is guaranteed to have no padding by the C-specification. But as the above example showed, the output doesn't seem to be affected by any padding as such.
I have used VC++ Express 2010 and MinGW to compile the above. Although VC gave the warning
warning C4309: '=' : truncation of constant value
the output doesn't seems to reflect that.
P.S. This could be marked a possible duplicate of Should a buffer of bytes be signed or unsigned char buffer? but my intent is different. I am asking why something which seems to be working as fine with char should be typed unsigned char?
Update: To quote from N3337,
Section 3.9 Types
2 For any object (other than a base-class subobject) of trivially
copyable type T, whether or not the object holds a valid value of type
T, the underlying bytes (1.7) making up the object can be copied into
an array of char or unsigned char. If the content of the array of char
or unsigned char is copied back into the object, the object shall
subsequently hold its original value.
In view of the above fact and that my original example was on Intel machine where char defaults to signed char, am still not convinced if unsigned char should be preferred over char.
Anything else?
In C the unsigned char data type is the only data type that has all the following three properties simultaneously
it has no padding bits, that it where all storage bits contribute to the value of the data
no bitwise operation starting from a value of that type, when converted back into that type, can produce overflow, trap representations or undefined behavior
it may alias other data types without violating the "aliasing rules", that is that access to the same data through a pointer that is typed differently will be guaranteed to see all modifications
if these are the properties of a "binary" data type you are looking for, you definitively should use unsigned char.
For the second property we need a type that is unsigned. For these all conversion are defined with modulo arihmetic, here modulo UCHAR_MAX+1, 256 in most 99% of the architectures. All conversion of wider values to unsigned char thereby just corresponds to truncation to the least significant byte.
The two other character types generally don't work the same. signed char is signed, anyhow, so conversion of values that don't fit it is not well defined. char is not fixed to be signed or unsigned, but on a particular platform to which your code is ported it might be signed even it is unsigned on yours.
You'll get most of your problems when comparing the contents of individual bytes:
char c[5];
c[0] = 0xff;
/*blah blah*/
if (c[0] == 0xff)
{
printf("good\n");
}
else
{
printf("bad\n");
}
can print "bad", because, depending on your compiler, c[0] will be sign extended to -1, which is not any way the same as 0xff
The plain char type is problematic and shouldn't be used for anything but strings. The main problem with char is that you can't know whether it is signed or unsigned: this is implementation-defined behavior. This makes char different from int etc, int is always guaranteed to be signed.
Although VC gave the warning ... truncation of constant value
It is telling you that you are trying to store int literals inside char variables. This might be related to the signedness: if you try to store an integer with value > 0x7F inside a signed character, unexpected things might happen. Formally, this is undefined behavior in C, though practically you'd just get a weird output if attempting to print the result as an integer value stored inside a (signed) char.
In this specific case, the warning shouldn't matter.
EDIT :
In other related questions unsigned char is highlighted because it is the only (byte/smallest) data type which is guaranteed to have no padding by the C-specification.
In theory, all integer types except unsigned char and signed char are allowed to contain "padding bits", as per C11 6.2.6.2:
"For unsigned integer types other than unsigned char, the bits of the
object representation shall be divided into two groups: value bits and
padding bits (there need not be any of the latter)."
"For signed integer types, the bits of the object representation shall
be divided into three groups: value bits, padding bits, and the sign
bit. There need not be any padding bits; signed char shall not have
any padding bits."
The C standard is intentionally vague and fuzzy, allowing these theoretical padding bits because:
It allows different symbol tables than the standard 8-bit ones.
It allows implementation-defined signedness and weird signed integer formats such as one's complement or "sign and magnitude".
An integer may not necessarily use all bits allocated.
However, in the real world outside the C standard, the following applies:
Symbol tables are almost certainly 8 bits (UTF8 or ASCII). Some weird exceptions exist, but clean implementations use the standard type wchar_t when implementing symbols tables larger than 8 bits.
Signedness is always two's complement.
An integer always uses all bits allocated.
So there is no real reason to use unsigned char or signed char just to dodge some theoretical scenario in the C standard.
Bytes are usually intended as unsigned 8 bit wide integers.
Now, char doesn't specify the sign of the integer: on some compilers char could be signed, on other it may be unsigned.
If I add a bit shift operation to the code you wrote, then I will have an undefined behaviour. The added comparison will also have an unexpected result.
char c[5], d[5];
c[0] = 0xF0;
c[1] = 0xA4;
c[2] = 0xAD;
c[3] = 0xA2;
c[4] = '\0';
c[0] >>= 1; // If char is signed, will the 7th bit go to 0 or stay the same?
bool isBiggerThan0 = c[0] > 0; // FALSE if char is signed!
printf("%s\n", c);
memcpy(d, c, 5);
printf("%s\n", d);
Regarding the warning during the compilation: if the char is signed then you are trying to assign the value 0xf0, which cannot be represented in the signed char (range -128 to +127), so it will be casted to a signed value (-16).
Declaring the char as unsigned will remove the warning, and is always good to have a clean build without any warning.
The signed-ness of the plain char type is implementation defined, so unless you're actually dealing with character data (a string using the platform's character set - usually ASCII), it's usually better to specify the signed-ness explicitly by either using signed char or unsigned char.
For binary data, the best choice is most probably unsigned char, especially if bitwise operations will be performed on the data (specifically bit shifting, which doesn't behave the same for signed types as for unsigned types).
I am asking why something which seems to be working as fine with char should be typed unsigned char?
If you do things which are not "correct" in the sense of the standard, you rely on undefined behaviour. Your compiler might do it the way you want today, but you don't know what it does tomorrow. You don't know what GCC does or VC++ 2012. Or even if the behaviour depends on external factors or Debug/Release compiles etc. As soon as you leave the safe path of the standard, you might run into trouble.
Well, what do you call "binary data"? This is a bunch of bits, without any meaning assigned to them by that specific part of software that calls them "binary data". What's the closest primitive data type, which conveys the idea of the lack of any specific meaning to any one of these bits? I think unsigned char.
Is it really necessary to use unsigned char to hold binary data as in some libraries which work on character encoding or binary buffers?
"really" necessary? No.
It is a very good idea though, and there are many reasons for this.
Your example uses printf, which not type-safe. That is, printf takes it's formatting cues from the format string and not from the data type. You could just as easily tried:
printf("%s\n", (void*)c);
... and the result would have been the same. If you try the same thing with c++ iostreams, the result will be different (depending on the signed-ness of c).
What reasoning could possibly advocate the use of unsigned char instead of a plain char?
Signed specifies that the most significant bit of the data (for unsigned char the 8-th bit) represents the sign. Since you obviously do not need that, you should specify your data is unsigned (the "sign" bit represents data, not the sign of the other bits).
I frequently work with libraries that use char when working with bytes in C++. The alternative is to define a "Byte" as unsigned char but that not the standard they decided to use. I frequently pass bytes from C# into the C++ dlls and cast them to char to work with the library.
When casting ints to chars or chars to other simple types what are some of the side effects that can occur. Specifically, when has this broken code that you have worked on and how did you find out it was because of the char signedness?
Lucky i haven't run into this in my code, used a char signed casting trick back in an embedded systems class in school. I'm looking to better understand the issue since I feel it is relevant to the work I am doing.
One major risk is if you need to shift the bytes. A signed char keeps the sign-bit when right-shifted, whereas an unsigned char doesn't.
Here's a small test program:
#include <stdio.h>
int main (void)
{
signed char a = -1;
unsigned char b = 255;
printf("%d\n%d\n", a >> 1, b >> 1);
return 0;
}
It should print -1 and 127, even though a and b start out with the same bit pattern (given 8-bit chars, two's-complement and signed values using arithmetic shift).
In short, you can't rely on shift working identically for signed and unsigned chars, so if you need portability, use unsigned char rather than char or signed char.
The most obvious gotchas come when you need to compare the numeric value of a char with a hexadecimal constant when implementing protocols or encoding schemes.
For example, when implementing telnet you might want to do this.
// Check for IAC (hex FF) byte
if (ch == 0xFF)
{
// ...
Or when testing for UTF-8 multi-byte sequences.
if (ch >= 0x80)
{
// ...
Fortunately these errors don't usually survive very long as even the most cursory testing on a platform with a signed char should reveal them. They can be fixed by using a character constant, converting the numeric constant to a char or converting the character to an unsigned char before the comparison operator promotes both to an int. Converting the char directly to an unsigned won't work, though.
if (ch == '\xff') // OK
if ((unsigned char)ch == 0xff) // OK, so long as char has 8-bits
if (ch == (char)0xff) // Usually OK, relies on implementation defined behaviour
if ((unsigned)ch == 0xff) // still wrong
I've been bitten by char signedness in writing search algorithms that used characters from the text as indices into state trees. I've also had it cause problems when expanding characters into larger types, and the sign bit propagates causing problems elsewhere.
I found out when I started getting bizarre results, and segfaults arising from searching texts other than the one's I'd used during the initial development (obviously characters with values >127 or <0 are going to cause this, and won't necessarily be present in your typical text files.
Always check a variable's signedness when working with it. Generally now I make types signed unless I have a good reason otherwise, casting when necessary. This fits in nicely with the ubiquitous use of char in libraries to simply represent a byte. Keep in mind that the signedness of char is not defined (unlike with other types), you should give it special treatment, and be mindful.
The one that most annoys me:
typedef char byte;
byte b = 12;
cout << b << endl;
Sure it's cosmetics, but arrr...
When casting ints to chars or chars to other simple types
The critical point is, that casting a signed value from one primitive type to another (larger) type does not retain the bit pattern (assuming two's complement). A signed char with bit pattern 0xff is -1, while a signed short with the decimal value -1 is 0xffff. Casting an unsigned char with value 0xff to a unsigned short, however, yields 0x00ff. Therefore, always think of proper signedness before you typecast to a larger or smaller data type. Never carry unsigned data in signed data types if you don't need to - if an external library forces you to do so, do the conversion as late as possible (or as early as possible if the external code acts as data source).
The C and C++ language specifications define 3 data types for holding characters: char, signed char and unsigned char. The latter 2 have been discussed in other answers. Let's look at the char type.
The standard(s) say that the char data type may be signed or unsigned and is an implementation decision. This means that some compilers or versions of compilers, can implement char differently. The implication is that the char data type is not conducive for arithmetic or Boolean operations. For arithmetic and Boolean operations, signed and unsigned versions of char will work fine.
In summary, there are 3 versions of char data type. The char data type performs well for holding characters, but is not suited for arithmetic across platforms and translators since it's signedness is implementation defined.
You will fail miserably when compiling for multiple platforms because the C++ standard doesn't define char to be of a certain "signedness".
Therefore GCC introduces -fsigned-char and -funsigned-char options to force certain behavior. More on that topic can be found here, for example.
EDIT:
As you asked for examples of broken code, there are plenty of possibilities to break code that processes binary data. For example, image you process 8-bit audio samples (range -128 to 127) and you want to halven the volume. Now imagine this scenario (in which the naive programmer assumes char == signed char):
char sampleIn;
// If the sample is -1 (= almost silent), and the compiler treats char as unsigned,
// then the value of 'sampleIn' will be 255
read_one_byte_sample(&sampleIn);
// Ok, halven the volume. The value will be 127!
char sampleOut = sampleOut / 2;
// And write the processed sample to the output file, for example.
// (unsigned char)127 has the exact same bit pattern as (signed char)127,
// so this will write a sample with the loudest volume!!
write_one_byte_sample_to_output_file(&sampleOut);
I hope you like that example ;-) But to be honest I've never really came across such problems, not even as a beginner as far as I can remember...
Hope this answer is sufficient for you downvoters. What about a short comment?
Sign extension. The first version of my URL encoding function produced strings like "%FFFFFFA3".