Integer to Character conversion in C - c++

Lets us consider this snippet:
int s;
scanf("%c",&s);
Here I have used int, and not char, for variable s, now for using s for character conversion safely I have to make it char again because when scanf reads a character it only overwrites one byte of the variable it is assigning it to, and not all four that int has.
For conversion I could use s = (char)s; as the next line, but is it possible to implement the same by subtracting something from s ?

What you've done is technically undefined behaviour. The %c format calls for a char*, you've passed it an int* which will (roughly speaking) be reinterpreted. Even assuming that the pointer value is still good after reinterpreting, storing an arbitrary character to the first byte of an int and then reading it back as int is undefined behaviour. Even if it were defined, reading an int when 3 bytes of it are uninitialized, is undefined behaviour.
In practice it probably does something sensible on your machine, and you just get garbage in the top 3 bytes (assuming little-endian).
Writing s = (char)s converts the value from int to char and then back to int again. This is implementation-defined behaviour: converting an out-of-range value to a signed type. On different implementations it might clean up the top 3 bytes, it might return some other result, or it might raise a signal.
The proper way to use scanf is:
char c;
scanf("%c", &c);
And then either int s = c; or int s = (unsigned char)c;, according to whether you want negative-valued characters to result in a negative integer, or a positive integer (up to 255, assuming 8-bit char).
I can't think of any good reason for using scanf improperly. There are good reasons for not using scanf at all, though:
int s = getchar();

Are you trying to convert a digit to its decimal value? If so, then
char c = '8';
int n = c - '0';
n should 8 at this point.

That's probably not a good idea; GCC gives me a warning for that code:
main.c:10: warning: format ‘%c’ expects type ‘char *’, but
argument 2 has type ‘int *’
In this case you're ok since you're passing a pointer to more space than you need (for most systems), but what if you did it the other way around? Could be crash city. If you really want to do something like what you have there, just do the typecast or mask it - the mask will be endian-dependent.

As written this won't work reliably . The argument, &s, to scanf is a pointer to int and scanf is expecting a pointer to char. The two data type (int and char) have different sizes (at least on most architectures) so the data may get put in the wrong spot in memeory, and the other part of s may not get properly cleared.
The answers suggesting manipulation of the result after using a pointer to int rely on unspecified behavior (i.e. that scanf will put the character value it has in the least significant byte of the int you're pointing to), and are not safe.

Not but you could use the following:
s = s & 0xFF
That will blank out all of the data except the first byte. But in general all these ideas (and the ones above) are bad ideas, since not all systems store the lowest part of the integer in memory first. So if you ever have to port this code to a big endian system, you'll be screwed.
True, you may never have to port the code, but why write unportable code to begin with?
See this for more info:
http://en.wikipedia.org/wiki/Endianness

Related

Append to String a Signed Int (Converted to Bytes) in Big Endian

I have a 4 byte integer (signed), and (i) I want to reverse the byte order, (ii) I want to store the bytes (i.e. the 4 bytes) as bytes of the string. I am working in C++. In order to reverse the byte order in Big Endian, I was using the ntohl, but I cannot use that due the fact that my numbers can be also negative.
Example:
int32_t a = -123456;
string s;
s.append(reinterpret_cast<char*>(reinterpret_cast<void*>(&a))); // int to byte
Further, when I am appending these data, it seems that I am appending 8 bytes instead of 4, why?
I need to use the append (I cannot use memcpy or something else).
Do you have any suggestion?
I was using the ntohl, but I cannot use that due the fact that my numbers can be also negative.
It's unclear why you think that negative number would be a problem. It's fine to convert negative numbers with ntohl.
s.append(reinterpret_cast<char*>(reinterpret_cast<void*>(&a)));
std::string::append(char*) requires that the argument points to a null terminated string. An integer is not null terminated (unless it happens to contain a byte that incidentally represents a null terminator character). As a result of violating this requirement, the behaviour of the program is undefined.
Do you have any suggestion?
To fix this bug, you can use the std::string::append(char*, size_type) overload instead:
s.append(reinterpret_cast<char*>(&a), sizeof a);
reinterpret_cast<char*>(reinterpret_cast<void*>
The inner cast to void* is redundant. It makes no difference.

Why is parameter to isdigit integer?

The function std::isdigit is:
int isdigit(int ch);
The return (Non-zero value if the character is a numeric character, zero otherwise.) smells like the function was inherited from C, but even that does not explain why the parameter type is int not char while at the same time...
The behavior is undefined if the value of ch is not representable as
unsigned char and is not equal to EOF.
Is there any technical reason why isdigitstakes an int not a char?
The reaons is to allow EOF as input. And EOF is (from here):
EOF integer constant expression of type int and negative value
The accepted answer is correct, but I believe the question deserves more detail.
A char in C++ is either signed or unsigned depending on your implementation (and, yet, it's a distinct type from signed char and unsigned char).
Where C grew up, char was typically unsigned and assumed to be an n-bit byte that could represent [0..2^n-1]. (Yes, there were some machines that had byte sizes other than 8 bits.) In fact, chars were considered virtually indistinguishable from bytes, which is why functions like memcpy take char * rather than something like uint8_t *, why sizeof char is always 1, and why CHAR_BITS isn't named BYTE_BITS.
But the C standard, which was the baseline for C++, only promised that char could hold any value in the execution character set. They might hold additional values, but there was no guarantee. The source character set (basically 7-bit ASCII minus some control characters) required something like 97 values. For a while, the execution character set could be smaller, but in practice it almost never was. Eventually there was an explicit requirement that a char be large enough to hold an 8-bit byte.
But the range was still uncertain. If unsigned, you could rely on [0..255]. Signed chars, however, could--in theory--use a sign+magnitude representation that would give you a range of [-127..127]. Note that's only 255 unique values, not 256 values ([-128..127]) like you'd get from two's complement. If you were language lawyerly enough, you could argue that you cannot store every possible value of an 8-bit byte in a char even though that was a fundamental assumption throughout the design of the language and its run-time library. I think C++ finally closed that apparent loophole in C++17 or C++20 by, in effect, requiring that a signed char use two's complement even if the larger integral types use sign+magnitude.
When it came time to design fundamental input/output functions, they had to think about how to return a value or a signal that you've reached the end of the file. It was decided to use a special value rather than an out-of-band signaling mechanism. But what value to use? The Unix folks generally had [128..255] available and others had [-128..-1].
But that's only if you're working with text. The Unix/C folks thought of textual characters and binary byte values as the same thing. So getc() was also for reading bytes from a binary file. All 256 possible values of a char, regardless of its signedness, were already claimed.
K&R C (before the first ANSI standard) didn't require function prototypes. The compiler made assumptions about parameter and return types. This is why C and C++ have the "default promotions," even though they're less important now than they once were. In effect, you couldn't return anything smaller than an int from a function. If you did, it would just be converted to int anyway.
The natural solution was therefore to have getc() return an int containing either the character value or a special end-of-file value, imaginatively dubbed EOF, a macro for -1.
The default promotions not only mandated a function couldn't return an integral type smaller than an int, they also made it difficult to pass in a small type. So int was also the natural parameter type for functions that expected a character. And thus we ended up with function signatures like int isdigit(int ch).
If you're a Posix fan, this is basically all you need.
For the rest of us, there's a remaining gotcha: If your chars are signed, then -1 might represent a legitimate character in your execution character set. How can you distinguish between them?
The answer is that functions don't really traffic in char values at all. They're really using unsigned char values dressed up as ints.
int x = getc(source_file);
if (x != EOF) { /* reached end of file */ }
else if (0 <= x && x < 128) { /* plain 7-bit character */ }
else if (128 <= x && x < 256) {
// Here it gets interesting.
bool b1 = isdigit(x); // OK
bool b2 = isdigit(static_cast<char>(x)); // NOT PORTABLE
bool b3 = isdigit(static_cast<unsigned char>(x)); // CORRECT!
}

isdigit(c) - a char or int type?

I have written the following code to test if the given input is a digit or not.
#include<iostream>
#include<ctype.h>
#include<stdio.h>
using namespace std;
main()
{
char c;
cout<<"Please enter a digit: ";
cin>>c;
if(isdigit(c)) //int isdigit(int c) or char isdigit(char c)
{
cout<<"You entered a digit"<<endl;
}
else
{
cout<<"You entered a non-digit value"<<endl;
}
}
My question is: what should be the input variable type? char or int?
The situation is unfortunately a bit more complex than has been told by the other answers.
First of all: the first part of your code is correct (disregarding multiple-byte encodings); if you want to read a single char with cin, you'll have to use a char variable with >> operator.
Now, about isdigit: why does it take an int instead of a char?
It all comes from C; isdigit and its companion were born to be used along with functions like getchar(), which read a character from the stream and return an int. This in turn was done to provide the character and an error code: getchar() can return EOF (which is defined as some implementation-defined negative constant) through its return code to signify that the input stream has ended.
So, the basic idea is: negative = error code; positive = actual character code.
Unfortunately, this poses interoperability problems with "regular" chars.
Short digression: char ultimately is just an integral type with a very small range, but a particularly stupid one. In most occasions - when working with bytes or character codes - you'd want it to be unsigned by default; OTOH, for coherency reasons with other integral types (int, short, long, ...), you may say that the right thing would be that plain char should be signed. The Standard chose the most stupid way: plain char is either signed or unsigned, depending from whatever the implementor of the compiler decides1.
So, you have to be prepared for char being either signed or unsigned; in most implementations it's signed by default, which poses a problem with the getchar() arrangement above.
If char is used to read bytes and is signed it means that all bytes with the high bit set (AKA bytes that, read with an unsigned 8-bit type would be >127) turn out to be negative values. This obviously isn't compatible with the getchar() using negative values for EOF - there could be overlap between actual "negative" characters and EOF.
So, when C functions talk about receiving/providing characters into int variables the contract is always that the character is assumed to be a char that has been cast to an unsigned char (so that it is always positive, negative values overflowing into the top half of its range) and then put into an int. Which brings us back to the isdigit function, which, along its companion functions, has this contract as well:
The header <ctype.h> declares several functions useful for classifying and mapping characters. In all cases the argument is an int, the value of which shall be representable as an unsigned char or shall equal the value of the macro EOF. If the argument has any other value, the behavior is undefined.
(C99, §7.4, ¶1)
So, long story short: your if should be at the very least:
if(isdigit((unsigned char)c))
The problem is not just a theoretical one: several widespread C library implementations use the provided value straight as an index into a lookup table, so negative values will read into unallocated memory and segfault your program.
Also, you are not taking into account the fact that the stream may be closed, and thus >> returning without touching your variable (which will be at an uninitialized value); to take this into account, you should check if the stream is still in a valid state before working on c.
Of course this is a bit of an unfair rant; as #Pete Becker noted in the comment below, it's not like they were all morons, but just that the standard mostly tried to be compatible with existing implementations, which were probably evenly split between unsigned and signed char. Traces of this split can be found in most modern compilers, which can generally change the signedness of char through command line options (-fsigned-char/-funsigned-char for gcc/clang, /J in VC++).
If you want to read a single character and check whether it is a digit or not then it should be char.
If you set it as int then multiple characters will be read and the result of isDigit will always be true.

Why is memset's value int and not uint8_t? [duplicate]

Why does memset take an int as the second argument instead of a char, whereas wmemset takes a wchar_t instead of something like long or long long?
memset predates (by quite a bit) the addition of function prototypes to C. Without a prototype, you can't pass a char to a function -- when/if you try, it'll be promoted to int when you pass it, and what the function receives is an int.
It's also worth noting that in C, (but not in C++) a character literal like 'a' does not have type char -- it has type int, so what you pass will usually start out as an int anyway. Essentially the only way for it to start as a char and get promoted is if you pass a char variable.
In theory, memset could probably be modified so it receives a char instead of an int, but there's unlikely to be any benefit, and a pretty decent possibility of breaking some old code or other. With an unknown but potentially fairly high cost, and almost no chance of any real benefit, I'd say the chances of it being changed to receive a char fall right on the line between "slim" and "none".
Edit (responding to the comments): The CHAR_BIT least significant bits of the int are used as the value to write to the target.
Probably the same reason why the functions in <ctypes.h> take ints and not chars.
On most platforms, a char is too small to be pushed on the stack by itself, so one usually pushes the type closest to the machine's word size, i.e. int.
As the link in #Gui13's comment points out, doing that also increases performance.
See fred's answer, it's for performance reasons.
On my side, I tried this code:
#include <stdio.h>
#include <string.h>
int main (int argc, const char * argv[])
{
char c = 0x00;
printf("Before: c = 0x%02x\n", c);
memset( &c, 0xABCDEF54, 1);
printf("After: c = 0x%02x\n", c);
return 0;
}
And it gives me this on a 64bits Mac:
Before: c = 0x00
After: c = 0x54
So as you see, only the last byte gets written. I guess this is dependent on the architecture (endianness).

array of char is equal to int?

I am trying to figure out why an array of char is assigned to a int value, now I am a little confused in using cast operator.
I didn't get what is in do statement, I hope somebody can explain
char *readword()
{
int c,i;
char t[255];
char *p;
//jump over chars who aren't letters
while ((c=getchar())<'A'|| (c>'Z' && c<'a') || c>'z')
if (c==EOF) return 0;
i=0;
do {
t[i++]=c;// shouldn't be like (char)c
} while ((c=getchar())>='A' && c<='Z' || c>='a' && c<='z');
//keep the word in heap memory
if ( c==EOF)
return 0;
t[i++]='\0';
if ((p=(char *)malloc(i))==0)
{
printf(" not enough memory\n");
exit(1);
}
strcpy(p,t);
return p;
}
The getchar() function returns an int type; and it is important to use an int to capture the getchar() return value. This is due to if getchar() fails, it returns an (int)(EOF)(as per chux comment. When it successfully returns, it will return a value that is suitable for a char.
The question code is building a char string or array, one char at a time:
t[i++]=c;
The above line could be written:
t[i++]=(char)c;
Either is suitable due to the compiler automatically converting the first case.
The mixture of char and int is fairly simple: EOF is intended as a file that can be distinguished from any value you could have read from the file.
To support that, you need to initially read the data from the file into something larger than a char, so it can accommodate at least one value that couldn't possibly have come from the file. The type they chose for that purpose was int.
So, you read a character from the file, into an int. You compare that to EOF to see if it's really a character that came from the file or not. If (and only if) you verify that it really came from the file, you save the value into a char, because you now know that's what it really represents.
That said, I'd consider it pretty poor code as it stands right now. Just for one particularly obvious example, instead of the c<'A'|| (c>'Z' && c<'a') || c>'z') type of code, you almost certainly want to use isalpha(c) instead.
It's also a lot easier to do this with scanf instead.
You can assign any int value to a char. Only the lowest 8 bits will be used. A cast would be more "correct" in terms of communicating your intent - people might not otherwise remember that anything larger than an 8-bit value will get truncated and results are likely to be unexpected.
Note that since you didn't say "unsigned char t[255]" that you actually get 7 bits and the most significant (8th) bit will be interpreted as a sign. So for example if you were to say
char t = 0xFF;
then you would in fact get -1 assigned to t.
If you assign numbers > 0xFF then all bits higher than the 8th bit will get stripped. So if you were to say:
char t = 0x101;
Than in fact you'd get the value 1 assigned to t.
The code in question is correct because getchar() returns an int and -1 is an error value so it's important to check it. For non-error cases the return will fit in an 8-bit char.