characters in c and c++ - c++

What I ask about is(is my understanding totally true ?)
char x='b'; (x has one value can be represented with two ways one of them is int and the other is char in the two languages ( C and C++))
in C
if I let the compiler choose it will choose (Will give priority to) the int value because the standard of the C makes character literal as int (give priority to int not char) like in this code
int main(){
printf("%d", sizeof('b'));
return 0;}
output is: 4 (int size) that is meaning the compiler treat b as 98 first and get the size it as int
but I can use the char value if I choose like in this code
int main(){
char x = 'b';
printf("%c", x);
return 0;}
output is: b (char)
in C ++
if I let the compiler choose it will choose (Will give priority to) the char value because the standard of the C++ makes character literal as char (give priority to char not int) like in this code
int main(){
printf("%d", sizeof('b'));
return 0;}
output is:1 (char size) that is meaning the compiler treat b as char
but I can use the int value if I choose like in this code
int main(){
char x = 'b';
printf("%d", x);
return 0;}
output is 98 (98 is the int which represents b in the ASCII Code )

Character constants, this: 'A', are of type int in C and of type char in C++. The compiler picks the type specified by the respective language standard.
Declared character variables of type char are always char and 1 byte large (typically 8 bits on non-exotic systems).
printf("%c", some_char); is a variadic function (accepts any number of parameters) and those have special implicit type promotion rules. Something called default argument promotion will integer-promote the passed character variable to int. Read about integer promotion here: Implicit type promotion rules.
printf expects that promotion to happen, so %c will mean that parameter is converted back to char according to the internals of printf. This holds true in C++ as well, though stdio.h should be avoided.
printf("%d", x); In case x is a char this would pedantically be undefined behavior. But in practice the above mentioned integer promotion is likely to occur and so it prints an integer. Also note that there's nothing magic with char as such, they are just 1 byte large integers. So it already has the value 98 before conversion.

Some of the things you've said are true, some are mistaken. Let's go through them in turn.
char x='b'; (x has two values one of them is int and the other is char in the two languages ( C and C++))
Well, no, not really. When you say char x you get a variable generally capable of holding one character, and this is perfectly, equally true in both C and C++. The only difference, as we'll see, is the type (not the value) of the character constant 'b'.
in C if I let the compiler choose
I'm not sure what you mean by "choose", because there's not really any choice in the matter.
it will choose (Will give priority to) the int value like in this code
printf("%d", sizeof('b'));
output is: 4 (int size) that is meaning the compiler change b to 98 first and get the size it as int
Not exactly. You got 4 because, yes, the type of a character constant like 'b' is int. In ASCII, the character constant 'b' has the value 98 no matter what. The compiler didn't change a character to an int here, and the value didn't change from 'b' to 98 here. It was an int all along (because that's the type of character constants in C), and it had the value 98 all along (because that's the value of the letter b in ASCII).
but I can use the char value if I choose like in this code
`printf("%c", x);
Right. But there's nothing magical about that. Consider:
char c1 = 'b';
int i1 = 'b';
char c2 = 98;
int i2 = 98;
printf("%c %c %c %c\n", c1, i1, c2, i2);
printf("%d %d %d %d\n", c1, i1, c2, i2);
This prints
b b b b
98 98 98 98
You can print an int, or a char, using %c, and you'll get the character with that value.
You can print an int, or a char, using %d, and you'll get a numeric value.
(More on this later.)
in C ++ if I let the compiler choose it will choose (Will give priority to) the char value like in this code
printf("%d", sizeof('b'));
output is:1 (char size) that is meaning the compiler did not change b to int
What you're seeing is one of the big differences between C and C++: character constants like 'b' are type int in C, but type char in C++.
(Why is it this way? I'll speculate on that later.)
but I can use the int value if I choose like in this code
char x = 'b';
printf("%d", x);
output is 98 (98 is the int which represents b in the ASCII Code )
Right. And this would work exactly the same in C and C++. (Also my earlier printfs of c1, i1, c2, and i2 would work exactly the same in C and C++.)
So why are the types of character constants different? I'm not sure, but I believe it's like this:
C likes to promote everything to int. (Incidentally, that's why we were able to pass characters straight to printf and print them using %d: the characters all get promoted to int before being passed to printf, so it works just fine.) So there would be no point having character constants of type char, because any time you used one, for anything, it would get promoted to int. So character constants might as well start out being int.
In C++, on the other hand, the type of things matters more. And you might have two overloaded functions, f(int) and f(char), and if you called f('b'), clearly you want the version of f() called that accepts a char. So in C++ there was a reason, a good reason, to have character constants be type char, just like it looks like they are.
Addendum:
The fundamental issue that you're asking about here, that we've been kind of dancing around in this answer and these comments, is that in C (as in most languages) there are several forms of constant, that let you write constants in forms that are convenient and meaningful to you. For any different form of constant you can write, there are several things of interest, but most importantly what is the actual value? and what is the type?.
It may be easier to show this by example. Here is a rather large number of ways of representing the constant value 98 in a C or C++ program:
Form of constant
base
type
value
98
10
int
98
0142
8
unsigned
98
0x62
16
unsigned
98
98.
10
double
98
9.8e1
10
double
98
98.f
10
float
98
9.8e1f
10
float
98
98L
10
long
98
98U
10
unsigned
98
'b'
ASCII
int/char
98
This table is not even complete; there are more ways than these to write constants in C and C++.
But with one exception, every row in this table is equally true of both C and C++, except the last row. In C, the type of a character constant is int. In C++, the type of a character constant is char.
The type of a constant determines what happens when a constant appears in a larger expression. In that respect the type of a constant functions analogously to the type of a variable. If I write
int a = 1, b = 3;
double c = a / b;
it doesn't work right, because the rule in C is that when you divide an int by an int, you get truncating integer division. But the point is that the type of an operand directly determines the meaning of an expression. So the type of a constant becomes very interesting, too, as seen by the different behavior of these two lines:
double c2 = 1 / 3;
double c3 = 1. / 3;
Similarly, it can make a difference whether a constant has type int or type char. An expression that depends on whether a character constant has type int or type char will behave slightly differently in C versus C++. (In practice, pretty much the only difference that can be easily seen concerns sizeof.)
For completeness, it may be useful to look at the several other forms of character constant, in the same framework:
Form of constant
base
type
value
'b'
ASCII
int/char
98
'\142'
8
int/char
98
'\x62'
16
int/char
98

Related

std::format behaving different. between signed char / unsigned char

Is this behavior expected or as per standards (used VC compiler)?
Example 1 (signed char):
char s = 'R'
std::cout << s << std::endl; // Prints R.
std::cout << std::format("{}\n", s); // Prints R.
Example 2 (unsigned char):
unsigned char u = 'R';
std::cout << u << std::endl; // Prints R.
std::cout << std::format("{}\n", u); // Prints 82.
In the second example with std::format, u is printed as 82 instead of R, is it a bug or expected behavior?
Without using std::format, if just by std::cout, I get R in both examples.
This is intentional and specified as such in the standard.
Both char and unsigned char are fundamentally numeric types. Normally only char has the additional meaning of representing a character. For example there are no unsigned char string literals. If unsigned char is used, often aliased to std::uint8_t, then it is normally supposed to represent a numeric value (or a raw byte of memory, although std::byte is a better choice for that).
So it makes sense to choose a numeric interpretation for unsigned char and a character interpretation for char by default. In both cases that can be overwritten with {:c} as specifier for a character interpretation and {:d} for a numeric interpretation.
I think operator<<'s behavior is the non-intuitive one, but that has been around for much longer and probably can't be changed.
Also note that signed char is a completely distinct type from both char and unsigned char and that it is implementation-defined whether char is an signed or unsigned integer type (but always distinct from both signed and unsigned char).
If you used signed char it would also be interpreted as numeric by default for the same reason as unsigned char is.
In the second example std::format, its printed as 82 instead of 'R',
Is it an issue or standard?
This is behavior defined by the standard, according to [format.string.std]:
Type
Meaning
...
...
c
Copies the character static_­cast<charT>(value) to the output. Throws format_­error if value is not in the range of representable values for charT.
d
to_­chars(first, last, value).
...
...
none
The same as d. [Note 8: If the formatting argument type is charT or bool, the default is instead c or s, respectively. — end note]
For integer types, if type options are not specified, then d will be the default. Since unsigned char is an integer type, it will be interpreted as an integer, and its value will be the value converted by std::to_chars.
(Except for charT type and bool type, the default type options are c or s)

Why is parameter to isdigit integer?

The function std::isdigit is:
int isdigit(int ch);
The return (Non-zero value if the character is a numeric character, zero otherwise.) smells like the function was inherited from C, but even that does not explain why the parameter type is int not char while at the same time...
The behavior is undefined if the value of ch is not representable as
unsigned char and is not equal to EOF.
Is there any technical reason why isdigitstakes an int not a char?
The reaons is to allow EOF as input. And EOF is (from here):
EOF integer constant expression of type int and negative value
The accepted answer is correct, but I believe the question deserves more detail.
A char in C++ is either signed or unsigned depending on your implementation (and, yet, it's a distinct type from signed char and unsigned char).
Where C grew up, char was typically unsigned and assumed to be an n-bit byte that could represent [0..2^n-1]. (Yes, there were some machines that had byte sizes other than 8 bits.) In fact, chars were considered virtually indistinguishable from bytes, which is why functions like memcpy take char * rather than something like uint8_t *, why sizeof char is always 1, and why CHAR_BITS isn't named BYTE_BITS.
But the C standard, which was the baseline for C++, only promised that char could hold any value in the execution character set. They might hold additional values, but there was no guarantee. The source character set (basically 7-bit ASCII minus some control characters) required something like 97 values. For a while, the execution character set could be smaller, but in practice it almost never was. Eventually there was an explicit requirement that a char be large enough to hold an 8-bit byte.
But the range was still uncertain. If unsigned, you could rely on [0..255]. Signed chars, however, could--in theory--use a sign+magnitude representation that would give you a range of [-127..127]. Note that's only 255 unique values, not 256 values ([-128..127]) like you'd get from two's complement. If you were language lawyerly enough, you could argue that you cannot store every possible value of an 8-bit byte in a char even though that was a fundamental assumption throughout the design of the language and its run-time library. I think C++ finally closed that apparent loophole in C++17 or C++20 by, in effect, requiring that a signed char use two's complement even if the larger integral types use sign+magnitude.
When it came time to design fundamental input/output functions, they had to think about how to return a value or a signal that you've reached the end of the file. It was decided to use a special value rather than an out-of-band signaling mechanism. But what value to use? The Unix folks generally had [128..255] available and others had [-128..-1].
But that's only if you're working with text. The Unix/C folks thought of textual characters and binary byte values as the same thing. So getc() was also for reading bytes from a binary file. All 256 possible values of a char, regardless of its signedness, were already claimed.
K&R C (before the first ANSI standard) didn't require function prototypes. The compiler made assumptions about parameter and return types. This is why C and C++ have the "default promotions," even though they're less important now than they once were. In effect, you couldn't return anything smaller than an int from a function. If you did, it would just be converted to int anyway.
The natural solution was therefore to have getc() return an int containing either the character value or a special end-of-file value, imaginatively dubbed EOF, a macro for -1.
The default promotions not only mandated a function couldn't return an integral type smaller than an int, they also made it difficult to pass in a small type. So int was also the natural parameter type for functions that expected a character. And thus we ended up with function signatures like int isdigit(int ch).
If you're a Posix fan, this is basically all you need.
For the rest of us, there's a remaining gotcha: If your chars are signed, then -1 might represent a legitimate character in your execution character set. How can you distinguish between them?
The answer is that functions don't really traffic in char values at all. They're really using unsigned char values dressed up as ints.
int x = getc(source_file);
if (x != EOF) { /* reached end of file */ }
else if (0 <= x && x < 128) { /* plain 7-bit character */ }
else if (128 <= x && x < 256) {
// Here it gets interesting.
bool b1 = isdigit(x); // OK
bool b2 = isdigit(static_cast<char>(x)); // NOT PORTABLE
bool b3 = isdigit(static_cast<unsigned char>(x)); // CORRECT!
}

Memory interpretation while casting primitives

In languages like C/C++, when we do:
char c = 'A';
We allocate memory to store number 65 in binary:
stuff_to_the_left_01000001_stuff_to_the_right
Then if we do:
int i = (int) c;
As I understand it, we're saying to the compiler that it should interpret bit pattern layed out as stuff_to_the_left_01000001__00000000_00000000_00000000_stuff_to_the_right, which may or may not turn out to be 65.
The same happens when we perform a cast during an operation
cout << (int) c << endl;
In all of the above, I got 'A' for character and 65 in decimal. Am I being lucky or am I missing something fundamental?
Casts in C do not reinterpret anything. They are value conversions. (int)c means take the value of c and convert it to int, which is a no-op on essentially all systems. (The only way it could fail to be a no-op is if the range of char is larger than the range of int, for example if char and int are both 32-bit but char is unsigned.)
If you want to reinterpret the representation (bit pattern) underlying a value, that value must first exist as an object (lvalue), not just the value of an expression (typically called "rvalue" though this language is not used in the C standard). Then you can do something like:
*(new_type *)&object;
However, except in the case where new_type is a character type, this invokes undefined behavior by violating the aliasing rules. C++ has a sort of "reinterpret cast" to do this which can presumably avoid breaking aliasing rules, but as I'm not familiar with C++, I can't provide you with good details on it.
In your C++ example, the reason you get different results is operator overloading. (int)'A' does not change the value or how it's interpreted; rather, the expression having a different type causes a different overload of the operator<< function to be called. In C, on the other hand, (int)'A' is always a no-op, because 'A' has type int to begin with in C.
Am I being lucky or am I missing something fundamental?
Yes, you are missing something fundamental: the compiler does not read the char from the memory as if the memory represented an int. Instead, it reads a char as a char, and then sign-extends the value to fit in an int, so char -1 becomes int -1 as well. Sign-extending means adding 1s or 0s to the left of the most significant byte being extended, depending on the sign bit of that number. Unsigned types are always padded by zeros*.
Sign extension is usually done in a register by executing a dedicated hardware instruction, so it runs very fast.
* As Eric Postpischil noted in a comment, char type may be signed or unsigned, depending on the C implementation.
When you allocate a char, there's no such thing as stuff to the left or right. It's eight bit, nothing more. So then when you cast an eight-bit value to 32 bits, you still get 65:
0100.0001 to 0000.0000 0000.0000 0000.0000 0100.0001
No magic, no luck.
In your code "i" has its own address and "c" has its own. Value is being 'copied' from c to i.
As for "(int) c", same is done again. Though compiler does that for us, as follows.
|--- i ---|- c-|
0x01 0x02 0x03 0x04
+--------------------......
| 00 | 00 | 08 | 08 |......
+--------------------......
You would have been correct, if this was pointer based allocation.
e.g.
0x01 0x02 0x03
+---------------......
| 07 | 10 | 08 |......
+---------------......
int *p;
char c = 10;
p = &c;
print(*p); //not a real method just something that can print.
Here *p would have combined values from mem address 0x02 and 0x03.
Well, the thing is, that this behavior can change depending on the platform you're compiling to and the compiler your're using.
The ISO standard defines (int) to be a cast.
In this case, your compiler will interpret (int)c like static_cast(c) //in c++
Now, you're lucky, your compiler interprets (int) as a simple cast. It's common behavior for any c/c++ compiler but there might be some evil, no-name c++ compilers, which will do a reinterpret cast on that one, ending up in an unpredictable result (depending on the platform).
That is why you should use the static_cast(c) to be 100% shure
and if you want to reinterpret it, of course reinterpret_cast(c)
but, again, it's usually a cast in c style and therefor the c will be casted into an integer.

Char data type in C/C++

I am trying to call a C++ DLL in Java. In its C++ head file, there are following lines:
#define a '102001'
#define b '102002'
#define c '202001'
#define d '202002'
What kind of data type are for a, b, c, and d? are they char or char array? and what are the correpsonding data type in Java that I should convert to?
As Mysticial pointed out, these are multicharacter literals. Their type is implementation-dependent, but it's probably Java long, because they use 48 bits.
In Java, you need to convert them to long manually:
static long toMulticharConst(String s) {
long res = 0;
for (char c : s.toCharArray()) {
res <<= 8;
res |= ((long)c) & 0xFF;
}
return res;
}
final long a = toMulticharConst("102001");
final long b = toMulticharConst("102002");
final long c = toMulticharConst("202001");
final long d = toMulticharConst("202002");
I might try to answer the first two questions. Being not familiar with java, I have to leave the last question to others.
Single and double quotes mean very different things in C. A character enclosed in single quotes is just the same as the integer representing it in the collating sequence(e.g. in ASCII implementation, 'a' means exactly the same as 97).
However, a string enclosed in double quotes is a short-hand way of writing a pointer to the initial character of a nameless array that has been initialized with the characters between the quotes and an extra character whose binary value is 0.
Because an integer is always large enough to hold several characters, some implementations of C compilers allow multiple characters in a character constant as well as a string constant, which means that writing 'abc' instead of "abc" may well go undetected. Yet, "abc" means a pointer points to a array containing 4 characters (a,b,c,and \0) while the meaning of 'abc' is platform-dependent. Many of the C compiler take it to mean "an integer that is composed somehow of the values of the characters a,b,and c.
For more informations, you might read the chapter 1.4 of the book "C traps and pitfalls"

typecasting to unsigned in C

int a = -534;
unsigned int b = (unsigned int)a;
printf("%d, %d", a, b);
prints -534, -534
Why is the typecast not taking place?
I expected it to be -534, 534
If I modify the code to
int a = -534;
unsigned int b = (unsigned int)a;
if(a < b)
printf("%d, %d", a, b);
its not printing anything... after all a is less than b??
Because you use %d for printing. Use %u for unsigned. Since printf is a vararg function, it cannot know the types of the parameters and must instead rely on the format specifiers. Because of this the type cast you do has no effect.
First, you don't need the cast: the value of a is implicitly converted to unsigned int with the assignment to b. So your statement is equivalent to:
unsigned int b = a;
Now, an important property of unsigned integral types in C and C++ is that their values are always in the range [0, max], where max for unsigned int is UINT_MAX (it's defined in limits.h). If you assign a value that's not in that range, it is converted to that range. So, if the value is negative, you add UINT_MAX+1 repeatedly to make it in the range [0, UINT_MAX]. For your code above, it is as if we wrote: unsigned int b = (UINT_MAX + a) + 1. This is not equal to -a (534).
Note that the above is true whether the underlying representation is in two's complement, ones' complement, or sign-magnitude (or any other exotic encoding). One can see that with something like:
signed char c = -1;
unsigned int u = c;
printf("%u\n", u);
assert(u == UINT_MAX);
On a typical two's complement machine with a 4-byte int, c is 0xff, and u is 0xffffffff. The compiler has to make sure that when value -1 is assigned to u, it is converted to a value equal to UINT_MAX.
Now going back to your code, the printf format string is wrong for b. You should use %u. When you do, you will find that it prints the value of UINT_MAX - 534 + 1 instead of 534.
When used in the comparison operator <, since b is unsigned int, a is also converted to unsigned int. This, given with b = a; earlier, means that a < b is false: a as an unsigned int is equal to b.
Let's say you have a ones' complement machine, and you do:
signed char c = -1;
unsigned char uc = c;
Let's say a char (signed or unsigned) is 8-bits on that machine. Then c and uc will store the following values and bit-patterns:
+----+------+-----------+
| c | -1 | 11111110 |
+----+------+-----------+
| uc | 255 | 11111111 |
+----+------+-----------+
Note that the bit patterns of c and uc are not the same. The compiler must make sure that c has the value -1, and uc has the value UCHAR_MAX, which is 255 on this machine.
There are more details on my answer to a question here on SO.
your specifier in the printf is asking printf to print a signed integer, so the underlying bytes are interpreted as a signed integer.
You should specify that you want an unsigned integer by using %u.
edit: a==b is true for the comparison, which is odd behaviour, but it's perfectly valid. You haven't changed the underlying bits you have only asked the compiler to treat the underlying bits in a certain way. Therefore a bitwise comparison yields true.
[speculation] I would suspect that behaviour might vary among compiler implementations -i.e., a fictitious CPU might not use the same logic for both signed and unsigned numerals in which case a bitwise comparison would fail. [/speculation]
C can be an ugly beast sometimes. The problem is that -534 always represents the value 0xfffffdea whether it is stored in a variable with the type unsigned int or signed int. To compare these variables they must be the same type so one will get automatically converted to either an unsigned or signed int to match the other. Once they are the same type they are equal as they represent the same value.
It seems likely that the behaviour you want is provided by the function abs:
int a = -534;
int b = abs(a);
printf("%d, %d", a, b);
I guess the first case of why b is being printed as -534 has been sufficiently answered by Tronic and Hassan. You should not be using %d and should be using %u.
As far as your second case is concered, again an implicit typecasting will be happening and both a and b will be same due to which your comparison does yield the expected result.
As far as I can see, the if fails because the compiler assumes the second variable should be considered the same type as the first. Try
if(b > a)
to see the difference.
Re 2nd question:
comparison never works between two different types - they are always implicitly cast to the "lowest common denominator", which in this case will be unsigned int. Nasty and counter-intuitive, I know.
Casting an integer type from signed to unsigned does not modify the bit pattern, it merely changes the interpretation of the bit pattern.
You also have a format specifier mismatch, %u should be used for unsigned integers, but even then the result will not be 534 as you expect, but 4294966762.
If you want to make a negative value positive, simply negate it:
unsigned b = (unsigned)-a ;
printf("%d, %u", a, b);
As for the second example, operations between types with differing signed-ness involve arcane implicit conversion rules - avoid. You should set your compiler's warning level high to trap many of these errors. I suggest /W4 /WX in VC++ and -Wall -Werror -Wformat for GCC for example.