What is the use of unsigned char stream.
If I am having an unsigned character stream as below , how can i print the values inside that stream.
For an ordinary character stream or string i can print it using %s, but what about this.
Thanks in advance.
unsigned char *ptr ="(some value)"
When it comes to string handling, there is no notable difference between unsigned char and plain char. The latter can be either signed or unsigned, it is implementation-defined. However, all symbol tables always have positive index. So when you are dealing with strings, you can always safely cast between unsigned char and char. (Pointer casts between the two types are formally poorly defined behavior by the standard, but in reality, it will always work on any system ever made.)
Thus printf("%s", ptr); will work fine.
An unsigned char array usually represents raw byte data and if you want to output that as a string by using %s, you'll probably get gibberish characters, that'll be of no meaning to you.
If you really wish to see the values, then i recommend conversion of a single byte to short and then output it with a %i.
Just my opinion.
Related
I have written the following code to test if the given input is a digit or not.
#include<iostream>
#include<ctype.h>
#include<stdio.h>
using namespace std;
main()
{
char c;
cout<<"Please enter a digit: ";
cin>>c;
if(isdigit(c)) //int isdigit(int c) or char isdigit(char c)
{
cout<<"You entered a digit"<<endl;
}
else
{
cout<<"You entered a non-digit value"<<endl;
}
}
My question is: what should be the input variable type? char or int?
The situation is unfortunately a bit more complex than has been told by the other answers.
First of all: the first part of your code is correct (disregarding multiple-byte encodings); if you want to read a single char with cin, you'll have to use a char variable with >> operator.
Now, about isdigit: why does it take an int instead of a char?
It all comes from C; isdigit and its companion were born to be used along with functions like getchar(), which read a character from the stream and return an int. This in turn was done to provide the character and an error code: getchar() can return EOF (which is defined as some implementation-defined negative constant) through its return code to signify that the input stream has ended.
So, the basic idea is: negative = error code; positive = actual character code.
Unfortunately, this poses interoperability problems with "regular" chars.
Short digression: char ultimately is just an integral type with a very small range, but a particularly stupid one. In most occasions - when working with bytes or character codes - you'd want it to be unsigned by default; OTOH, for coherency reasons with other integral types (int, short, long, ...), you may say that the right thing would be that plain char should be signed. The Standard chose the most stupid way: plain char is either signed or unsigned, depending from whatever the implementor of the compiler decides1.
So, you have to be prepared for char being either signed or unsigned; in most implementations it's signed by default, which poses a problem with the getchar() arrangement above.
If char is used to read bytes and is signed it means that all bytes with the high bit set (AKA bytes that, read with an unsigned 8-bit type would be >127) turn out to be negative values. This obviously isn't compatible with the getchar() using negative values for EOF - there could be overlap between actual "negative" characters and EOF.
So, when C functions talk about receiving/providing characters into int variables the contract is always that the character is assumed to be a char that has been cast to an unsigned char (so that it is always positive, negative values overflowing into the top half of its range) and then put into an int. Which brings us back to the isdigit function, which, along its companion functions, has this contract as well:
The header <ctype.h> declares several functions useful for classifying and mapping characters. In all cases the argument is an int, the value of which shall be representable as an unsigned char or shall equal the value of the macro EOF. If the argument has any other value, the behavior is undefined.
(C99, §7.4, ¶1)
So, long story short: your if should be at the very least:
if(isdigit((unsigned char)c))
The problem is not just a theoretical one: several widespread C library implementations use the provided value straight as an index into a lookup table, so negative values will read into unallocated memory and segfault your program.
Also, you are not taking into account the fact that the stream may be closed, and thus >> returning without touching your variable (which will be at an uninitialized value); to take this into account, you should check if the stream is still in a valid state before working on c.
Of course this is a bit of an unfair rant; as #Pete Becker noted in the comment below, it's not like they were all morons, but just that the standard mostly tried to be compatible with existing implementations, which were probably evenly split between unsigned and signed char. Traces of this split can be found in most modern compilers, which can generally change the signedness of char through command line options (-fsigned-char/-funsigned-char for gcc/clang, /J in VC++).
If you want to read a single character and check whether it is a digit or not then it should be char.
If you set it as int then multiple characters will be read and the result of isDigit will always be true.
I know that you can store integers up to like 100 and something in a char data type, but what about storing char data as in 'a' or 'b' in an int? I tried it and it seemed to work, but I'm not too sure if it's safe. Is it? Can I create an array of ints and use this array to store data in the form of 'x', 'b'...etc?
Yes; char is guaranteed to be smaller than int.
Characters such as 'a' and 'b' simply represent integer values in the char data type. int is at least as wide as char, that is, it represents a superset of its values. So yes, perfectly safe.
Usually you will have sizeof(int) > sizeof(char), because the standard library wants to reserve one value for EOF (end-of-file). Technically, this luxury is optional, or at least unreliable, and you should use eofbit and such for compatibility with esoteric systems.
Is it really necessary to use unsigned char to hold binary data as in some libraries which work on character encoding or binary buffers? To make sense of my question, have a look at the code below -
char c[5], d[5];
c[0] = 0xF0;
c[1] = 0xA4;
c[2] = 0xAD;
c[3] = 0xA2;
c[4] = '\0';
printf("%s\n", c);
memcpy(d, c, 5);
printf("%s\n", d);
both the printf's output 𤭢 correctly, where f0 a4 ad a2 is the encoding for the Unicode code-point U+24B62 (𤭢) in hex.
Even memcpy also correctly copied the bits held by a char.
What reasoning could possibly advocate the use of unsigned char instead of a plain char?
In other related questions unsigned char is highlighted because it is the only (byte/smallest) data type which is guaranteed to have no padding by the C-specification. But as the above example showed, the output doesn't seem to be affected by any padding as such.
I have used VC++ Express 2010 and MinGW to compile the above. Although VC gave the warning
warning C4309: '=' : truncation of constant value
the output doesn't seems to reflect that.
P.S. This could be marked a possible duplicate of Should a buffer of bytes be signed or unsigned char buffer? but my intent is different. I am asking why something which seems to be working as fine with char should be typed unsigned char?
Update: To quote from N3337,
Section 3.9 Types
2 For any object (other than a base-class subobject) of trivially
copyable type T, whether or not the object holds a valid value of type
T, the underlying bytes (1.7) making up the object can be copied into
an array of char or unsigned char. If the content of the array of char
or unsigned char is copied back into the object, the object shall
subsequently hold its original value.
In view of the above fact and that my original example was on Intel machine where char defaults to signed char, am still not convinced if unsigned char should be preferred over char.
Anything else?
In C the unsigned char data type is the only data type that has all the following three properties simultaneously
it has no padding bits, that it where all storage bits contribute to the value of the data
no bitwise operation starting from a value of that type, when converted back into that type, can produce overflow, trap representations or undefined behavior
it may alias other data types without violating the "aliasing rules", that is that access to the same data through a pointer that is typed differently will be guaranteed to see all modifications
if these are the properties of a "binary" data type you are looking for, you definitively should use unsigned char.
For the second property we need a type that is unsigned. For these all conversion are defined with modulo arihmetic, here modulo UCHAR_MAX+1, 256 in most 99% of the architectures. All conversion of wider values to unsigned char thereby just corresponds to truncation to the least significant byte.
The two other character types generally don't work the same. signed char is signed, anyhow, so conversion of values that don't fit it is not well defined. char is not fixed to be signed or unsigned, but on a particular platform to which your code is ported it might be signed even it is unsigned on yours.
You'll get most of your problems when comparing the contents of individual bytes:
char c[5];
c[0] = 0xff;
/*blah blah*/
if (c[0] == 0xff)
{
printf("good\n");
}
else
{
printf("bad\n");
}
can print "bad", because, depending on your compiler, c[0] will be sign extended to -1, which is not any way the same as 0xff
The plain char type is problematic and shouldn't be used for anything but strings. The main problem with char is that you can't know whether it is signed or unsigned: this is implementation-defined behavior. This makes char different from int etc, int is always guaranteed to be signed.
Although VC gave the warning ... truncation of constant value
It is telling you that you are trying to store int literals inside char variables. This might be related to the signedness: if you try to store an integer with value > 0x7F inside a signed character, unexpected things might happen. Formally, this is undefined behavior in C, though practically you'd just get a weird output if attempting to print the result as an integer value stored inside a (signed) char.
In this specific case, the warning shouldn't matter.
EDIT :
In other related questions unsigned char is highlighted because it is the only (byte/smallest) data type which is guaranteed to have no padding by the C-specification.
In theory, all integer types except unsigned char and signed char are allowed to contain "padding bits", as per C11 6.2.6.2:
"For unsigned integer types other than unsigned char, the bits of the
object representation shall be divided into two groups: value bits and
padding bits (there need not be any of the latter)."
"For signed integer types, the bits of the object representation shall
be divided into three groups: value bits, padding bits, and the sign
bit. There need not be any padding bits; signed char shall not have
any padding bits."
The C standard is intentionally vague and fuzzy, allowing these theoretical padding bits because:
It allows different symbol tables than the standard 8-bit ones.
It allows implementation-defined signedness and weird signed integer formats such as one's complement or "sign and magnitude".
An integer may not necessarily use all bits allocated.
However, in the real world outside the C standard, the following applies:
Symbol tables are almost certainly 8 bits (UTF8 or ASCII). Some weird exceptions exist, but clean implementations use the standard type wchar_t when implementing symbols tables larger than 8 bits.
Signedness is always two's complement.
An integer always uses all bits allocated.
So there is no real reason to use unsigned char or signed char just to dodge some theoretical scenario in the C standard.
Bytes are usually intended as unsigned 8 bit wide integers.
Now, char doesn't specify the sign of the integer: on some compilers char could be signed, on other it may be unsigned.
If I add a bit shift operation to the code you wrote, then I will have an undefined behaviour. The added comparison will also have an unexpected result.
char c[5], d[5];
c[0] = 0xF0;
c[1] = 0xA4;
c[2] = 0xAD;
c[3] = 0xA2;
c[4] = '\0';
c[0] >>= 1; // If char is signed, will the 7th bit go to 0 or stay the same?
bool isBiggerThan0 = c[0] > 0; // FALSE if char is signed!
printf("%s\n", c);
memcpy(d, c, 5);
printf("%s\n", d);
Regarding the warning during the compilation: if the char is signed then you are trying to assign the value 0xf0, which cannot be represented in the signed char (range -128 to +127), so it will be casted to a signed value (-16).
Declaring the char as unsigned will remove the warning, and is always good to have a clean build without any warning.
The signed-ness of the plain char type is implementation defined, so unless you're actually dealing with character data (a string using the platform's character set - usually ASCII), it's usually better to specify the signed-ness explicitly by either using signed char or unsigned char.
For binary data, the best choice is most probably unsigned char, especially if bitwise operations will be performed on the data (specifically bit shifting, which doesn't behave the same for signed types as for unsigned types).
I am asking why something which seems to be working as fine with char should be typed unsigned char?
If you do things which are not "correct" in the sense of the standard, you rely on undefined behaviour. Your compiler might do it the way you want today, but you don't know what it does tomorrow. You don't know what GCC does or VC++ 2012. Or even if the behaviour depends on external factors or Debug/Release compiles etc. As soon as you leave the safe path of the standard, you might run into trouble.
Well, what do you call "binary data"? This is a bunch of bits, without any meaning assigned to them by that specific part of software that calls them "binary data". What's the closest primitive data type, which conveys the idea of the lack of any specific meaning to any one of these bits? I think unsigned char.
Is it really necessary to use unsigned char to hold binary data as in some libraries which work on character encoding or binary buffers?
"really" necessary? No.
It is a very good idea though, and there are many reasons for this.
Your example uses printf, which not type-safe. That is, printf takes it's formatting cues from the format string and not from the data type. You could just as easily tried:
printf("%s\n", (void*)c);
... and the result would have been the same. If you try the same thing with c++ iostreams, the result will be different (depending on the signed-ness of c).
What reasoning could possibly advocate the use of unsigned char instead of a plain char?
Signed specifies that the most significant bit of the data (for unsigned char the 8-th bit) represents the sign. Since you obviously do not need that, you should specify your data is unsigned (the "sign" bit represents data, not the sign of the other bits).
C++ Standard §3.9.1 Fundamental types
Objects declared as characters (char)
shall be large enough to store any
member of the implementation’s basic
character set. If a character from
this set is stored in a character
object, the integral value of that
character object is equal to the value
of the single character literal form
of that character. It is
implementation-defined whether a char
object can hold negative values.
Characters can be explicitly declared
unsigned or signed. Plain char, signed
char, and unsigned char are three
distinct types.<...>
I could not make sense of unsigned char.
A number may be +1 or -1.
I can not think -A and +A in similar manner.
What is the Historical reason of introducing unsigned char.
A char is actually an integral type. It is just that the type is also used to represent a character too. Since it is an integral type, it is valid to talk about signedness.
(I don't know exactly about the historical reason. Probably to save a keyword for byte by conflating it with char.)
In C (and thus C++), char does not mean character. It means a byte (int_least8_t). This is a historical legacy from the pre-Unicode days when a characters could actually fit in a char, but is now a flaw in the language.
Since char is really a small integer, having signed char and unsigned char makes sense. There are actually three distinct char types: char, signed char, and unsigned char. A common convention is that unsigned char represents bytes while plain char represents characters UTF-8 code units.
Computers do not "understand" the concept of alphabets or characters; they only work on numbers. So a bunch of people got together and agreed on what number maps to what letter. The most common one in use is ASCII (although the language does not guarantee that).
In ASCII, the letter A has the code 65. In environments using ASCII, the letter A would be represented by the number 65.
The char datatype also serves as an integral type - meaning that it can hold just numbers, so unsigned and signed was allowed. On most platforms I've seen, char is a single 8-bit byte.
You're reading too much in to it. A character is a small integral type that can hold a character. End of story. Unsigned char was never introduced or intended, it's just how it is, because char is an integral type identical to int or long or short, it's just the size that's different. The fact is that there's little reason to use unsigned char, but people do if they want one-byte unsigned integral storage.
If you want a small memory foot print and want to store a number than signed and unsigned char are usefull.
unsigned char is needed if you want to use a value between 128-255
unsigned char score = 232;
signed char is usfull if you want to store the difference between two characters.
signed char diff = 'D' - 'A';
char is distinct from the other two because you can not assume it is either.
You can use the the overflow from 255 to 0? (I don't know. Just a guess)
Maybe it is not only about characters but also about numbers between -128 and 127, and 0 to 255.
Think of the ASCII character set.
Historically, all characters used for text in computing were defined by the ASCII character set. Each character was represented by an 8 bit byte, which was unsigned, hence each character had a value in the range of 0 - 255.
The word character was reduced to char for coding.
An 8 bit char used the same memory as an 8 bit byte and as such they were interchangeable as far as a compiler was concerned.
The compiler directive unsigned (all numbers were signed by default as twos compliment is used to represent negative numbers in binary) when applied to a byte or a char forced them to have a value in the range 0-255.
If unsigned then then had a value of -128 - +127.
Nowadays with the advent of UNICODE and multiple byte character sets this relationship between byte and char no longer exists.
Lets us consider this snippet:
int s;
scanf("%c",&s);
Here I have used int, and not char, for variable s, now for using s for character conversion safely I have to make it char again because when scanf reads a character it only overwrites one byte of the variable it is assigning it to, and not all four that int has.
For conversion I could use s = (char)s; as the next line, but is it possible to implement the same by subtracting something from s ?
What you've done is technically undefined behaviour. The %c format calls for a char*, you've passed it an int* which will (roughly speaking) be reinterpreted. Even assuming that the pointer value is still good after reinterpreting, storing an arbitrary character to the first byte of an int and then reading it back as int is undefined behaviour. Even if it were defined, reading an int when 3 bytes of it are uninitialized, is undefined behaviour.
In practice it probably does something sensible on your machine, and you just get garbage in the top 3 bytes (assuming little-endian).
Writing s = (char)s converts the value from int to char and then back to int again. This is implementation-defined behaviour: converting an out-of-range value to a signed type. On different implementations it might clean up the top 3 bytes, it might return some other result, or it might raise a signal.
The proper way to use scanf is:
char c;
scanf("%c", &c);
And then either int s = c; or int s = (unsigned char)c;, according to whether you want negative-valued characters to result in a negative integer, or a positive integer (up to 255, assuming 8-bit char).
I can't think of any good reason for using scanf improperly. There are good reasons for not using scanf at all, though:
int s = getchar();
Are you trying to convert a digit to its decimal value? If so, then
char c = '8';
int n = c - '0';
n should 8 at this point.
That's probably not a good idea; GCC gives me a warning for that code:
main.c:10: warning: format ‘%c’ expects type ‘char *’, but
argument 2 has type ‘int *’
In this case you're ok since you're passing a pointer to more space than you need (for most systems), but what if you did it the other way around? Could be crash city. If you really want to do something like what you have there, just do the typecast or mask it - the mask will be endian-dependent.
As written this won't work reliably . The argument, &s, to scanf is a pointer to int and scanf is expecting a pointer to char. The two data type (int and char) have different sizes (at least on most architectures) so the data may get put in the wrong spot in memeory, and the other part of s may not get properly cleared.
The answers suggesting manipulation of the result after using a pointer to int rely on unspecified behavior (i.e. that scanf will put the character value it has in the least significant byte of the int you're pointing to), and are not safe.
Not but you could use the following:
s = s & 0xFF
That will blank out all of the data except the first byte. But in general all these ideas (and the ones above) are bad ideas, since not all systems store the lowest part of the integer in memory first. So if you ever have to port this code to a big endian system, you'll be screwed.
True, you may never have to port the code, but why write unportable code to begin with?
See this for more info:
http://en.wikipedia.org/wiki/Endianness