I am reading chapter 2 of Advanced Linux Programming:
http://www.advancedlinuxprogramming.com/alp-folder/alp-ch02-writing-good-gnu-linux-software.pdf
In the section 2.1.3 Using getopt_long, there is an example program that goes a bit like this:
int main (int argc, char* argv[]) {
int next_option;
// ...
do {
next_option = getopt_long (argc, argv, short_options, long_options, NULL);
switch (next_option) {
case ‘h’: /* -h or --help */
// ...
}
// ...
The bit that caught my attention is that next_option is declared as an int. The function getopt_long() apparently returns an int representing the short command line argument which is used in the following switch statement. How come that integer can be compared to a character in the switch statement?
Is there an implicit conversion from a char (a single character?) to an int? How is the code above valid? (see full code in linked pdf)
Neither C nor C++ have a type that can store "characters" as values with some dedicated character-specific properties. In that sense, there's no "character" type neither in C nor in C++.
In both C++ and C languages char is an integral type. It contains numbers. It is just a smallest (in terms of range) integral type. Conversion between char and int exists, just like it exists between int and long or int and short. char has no special status among other integral types (aside from the fact that char type it is distinct from signed char type).
A literal of the form 'h' in C++ has type char, but as any other integral type it is comparable to int. That's why you can use it in case label the way it is used in your original example.
In other words, your original code is as "strange" as
switch (next_option) {
case 1L: ...
// ...
}
would be. In this case the switch argument is an int, but the case label is a long. The code is valid. Do you find it surprising? Probably not. Your example with 'h' is in not much different.
You are mistaken -- getopt_long(3) returns an int.
Several functions return int in C, but char in C++. Returning an int when a char would make more sense is simply an old C cultural decision. Plus, in a few cases, it's necessary so that a function can return sentinels like EOF.
As the other answerer says, you're asking the wrong question here. But to answer the question you did ask:
No implicit casting from char* to an int is available. On x86 machines, both int and char* are 32 bits long, so it's "safe" to explicitly cast:
int x = (int*) &someChar;
BUT HIGHLY NOT RECOMMENDED!!!
On x64 machines, this will not work! int remains 32 bits long, but all pointers are now 64 bits long... so you'll lose data in the process!
According to the man page, getopt_long returns an int. And yes, there is an implicit cast from char to int; a char is just a one-byte integer value.
So in this case, the cast is not happening when assigning to next_option, but in the case statement where you have a character constant being compared to an int. Of course, this is assuming you compile this as C++. In C++, a character constant is of type char, but in C it's of type int, so if you compile this code as C then there's no type conversion at all.
(And in your question you mention char*, but you probably meant char; there are no pointers being used here.)
Think of a char as a 8bit int. You can perform integer operations on chars and you can even declare them as unsigned. You wouldn't be surprised if you could compare a short and a long. Why should comparing a char and an int be different?
Related
I use utf8 and have to save a constant in a char array:
const char s[] = {0xE2,0x82,0xAC, 0}; //the euro sign
However it gives me error:
test.cpp:15:40: error: narrowing conversion of ‘226’ from ‘int’ to ‘const char’ inside { } [-fpermissive]
I have to cast all the hex numbers to char, which I feel tedious and don't smell good. Is there any other proper way of doing this?
char may be signed or unsigned (and the default is implementation specific). You probably want
const unsigned char s[] = {0xE2,0x82,0xAC, 0};
or
const char s[] = "\xe2\x82\xac";
or with many recent compilers (including GCC)
const char s[] = "€";
(a string literal is an array of char unless you give it some prefix)
See -funsigned-char (or -fsigned-char) option of GCC.
On some implementations a char is unsigned and CHAR_MAX is 255 (and CHAR_MIN is 0). On others char-s are signed so CHAR_MIN is -128 and CHAR_MAX is 127 (and e.g. things are different on Linux/PowerPC/32 bits and Linux/x86/32 bits). AFAIK nothing in the standard prohibits 19 bits signed chars.
The short answer to your question is that you are overflowing a char. A char has the range of [-128, 127]. 0xE2 = 226 > 127. What you need to use is an unsigned char, which has a range of [0, 255].
unsigned char s = {0xE2,0x82,0xAC, 0};
While it may well be tedious to be putting lots of casts in your code, it actually smells extremely GOOD to me to use as strong of typing as possible.
As noted above, when you specify type "char" you are inviting a compiler to choose whatever the compiler writer preferred (signed or unsigned). I'm no expert on UTF-8, but there is no reason to make your code non-portable if you don't need to.
As far as your constants, I've used compilers that default constants written that way to signed ints, as well as compilers that consider the context and interpret them accordingly. Note that converting between signed and unsigned can overflow EITHER WAY. For the same number of bits, a negative overflows an unsigned (obviously) and an unsigned with the top bit set overflows a signed, because the top bit means negative.
In this case, your compiler is taking your constants as unsigned 8 bit--OR LARGER--which means they don't fit as signed 8 bit. And we are all grateful that the compiler complains (at least I am).
My perspective is, there is nothing at all bad about casting to show exactly what you intend to happen. And if a compiler lets you assign between signed and unsigned, it should require that you cast regardless of variables or constants. eg
const int8_t a = (int8_t) 0xFF; // will be -1
although in my example, it would be better to assign -1. When you are having to add extra casts, they either make sense, or you should code your constants so they make sense for the type you are assigning to.
Is there a way to mix these? I want a define macro FX_RGB(R,G,B) that makes a const string "\x01\xRR\xGG\xBB" so I can do the following:
const char* LED_text = "Hello " FX_RGB(0xff, 0xff, 0x80) "World";
and get a sting: const char* LED_text = "Hello \x01\xff\xff\x80World";
The function std::isdigit is:
int isdigit(int ch);
The return (Non-zero value if the character is a numeric character, zero otherwise.) smells like the function was inherited from C, but even that does not explain why the parameter type is int not char while at the same time...
The behavior is undefined if the value of ch is not representable as
unsigned char and is not equal to EOF.
Is there any technical reason why isdigitstakes an int not a char?
The reaons is to allow EOF as input. And EOF is (from here):
EOF integer constant expression of type int and negative value
The accepted answer is correct, but I believe the question deserves more detail.
A char in C++ is either signed or unsigned depending on your implementation (and, yet, it's a distinct type from signed char and unsigned char).
Where C grew up, char was typically unsigned and assumed to be an n-bit byte that could represent [0..2^n-1]. (Yes, there were some machines that had byte sizes other than 8 bits.) In fact, chars were considered virtually indistinguishable from bytes, which is why functions like memcpy take char * rather than something like uint8_t *, why sizeof char is always 1, and why CHAR_BITS isn't named BYTE_BITS.
But the C standard, which was the baseline for C++, only promised that char could hold any value in the execution character set. They might hold additional values, but there was no guarantee. The source character set (basically 7-bit ASCII minus some control characters) required something like 97 values. For a while, the execution character set could be smaller, but in practice it almost never was. Eventually there was an explicit requirement that a char be large enough to hold an 8-bit byte.
But the range was still uncertain. If unsigned, you could rely on [0..255]. Signed chars, however, could--in theory--use a sign+magnitude representation that would give you a range of [-127..127]. Note that's only 255 unique values, not 256 values ([-128..127]) like you'd get from two's complement. If you were language lawyerly enough, you could argue that you cannot store every possible value of an 8-bit byte in a char even though that was a fundamental assumption throughout the design of the language and its run-time library. I think C++ finally closed that apparent loophole in C++17 or C++20 by, in effect, requiring that a signed char use two's complement even if the larger integral types use sign+magnitude.
When it came time to design fundamental input/output functions, they had to think about how to return a value or a signal that you've reached the end of the file. It was decided to use a special value rather than an out-of-band signaling mechanism. But what value to use? The Unix folks generally had [128..255] available and others had [-128..-1].
But that's only if you're working with text. The Unix/C folks thought of textual characters and binary byte values as the same thing. So getc() was also for reading bytes from a binary file. All 256 possible values of a char, regardless of its signedness, were already claimed.
K&R C (before the first ANSI standard) didn't require function prototypes. The compiler made assumptions about parameter and return types. This is why C and C++ have the "default promotions," even though they're less important now than they once were. In effect, you couldn't return anything smaller than an int from a function. If you did, it would just be converted to int anyway.
The natural solution was therefore to have getc() return an int containing either the character value or a special end-of-file value, imaginatively dubbed EOF, a macro for -1.
The default promotions not only mandated a function couldn't return an integral type smaller than an int, they also made it difficult to pass in a small type. So int was also the natural parameter type for functions that expected a character. And thus we ended up with function signatures like int isdigit(int ch).
If you're a Posix fan, this is basically all you need.
For the rest of us, there's a remaining gotcha: If your chars are signed, then -1 might represent a legitimate character in your execution character set. How can you distinguish between them?
The answer is that functions don't really traffic in char values at all. They're really using unsigned char values dressed up as ints.
int x = getc(source_file);
if (x != EOF) { /* reached end of file */ }
else if (0 <= x && x < 128) { /* plain 7-bit character */ }
else if (128 <= x && x < 256) {
// Here it gets interesting.
bool b1 = isdigit(x); // OK
bool b2 = isdigit(static_cast<char>(x)); // NOT PORTABLE
bool b3 = isdigit(static_cast<unsigned char>(x)); // CORRECT!
}
Why does memset take an int as the second argument instead of a char, whereas wmemset takes a wchar_t instead of something like long or long long?
memset predates (by quite a bit) the addition of function prototypes to C. Without a prototype, you can't pass a char to a function -- when/if you try, it'll be promoted to int when you pass it, and what the function receives is an int.
It's also worth noting that in C, (but not in C++) a character literal like 'a' does not have type char -- it has type int, so what you pass will usually start out as an int anyway. Essentially the only way for it to start as a char and get promoted is if you pass a char variable.
In theory, memset could probably be modified so it receives a char instead of an int, but there's unlikely to be any benefit, and a pretty decent possibility of breaking some old code or other. With an unknown but potentially fairly high cost, and almost no chance of any real benefit, I'd say the chances of it being changed to receive a char fall right on the line between "slim" and "none".
Edit (responding to the comments): The CHAR_BIT least significant bits of the int are used as the value to write to the target.
Probably the same reason why the functions in <ctypes.h> take ints and not chars.
On most platforms, a char is too small to be pushed on the stack by itself, so one usually pushes the type closest to the machine's word size, i.e. int.
As the link in #Gui13's comment points out, doing that also increases performance.
See fred's answer, it's for performance reasons.
On my side, I tried this code:
#include <stdio.h>
#include <string.h>
int main (int argc, const char * argv[])
{
char c = 0x00;
printf("Before: c = 0x%02x\n", c);
memset( &c, 0xABCDEF54, 1);
printf("After: c = 0x%02x\n", c);
return 0;
}
And it gives me this on a 64bits Mac:
Before: c = 0x00
After: c = 0x54
So as you see, only the last byte gets written. I guess this is dependent on the architecture (endianness).
42 as unsigned int is well defined as "42U".
unsigned int foo = 42U; // yeah!
How can I write "23" so that it is clear it is an unsigned short int?
unsigned short bar = 23; // booh! not clear!
EDIT so that the meaning of the question is more clear:
template <class T>
void doSomething(T) {
std::cout << "unknown type" << std::endl;
}
template<>
void doSomething(unsigned int) {
std::cout << "unsigned int" << std::endl;
}
template<>
void doSomething(unsigned short) {
std::cout << "unsigned short" << std::endl;
}
int main(int argc, char* argv[])
{
doSomething(42U);
doSomething((unsigned short)23); // no other option than a cast?
return EXIT_SUCCESS;
}
You can't. Numeric literals cannot have short or unsigned short type.
Of course in order to assign to bar, the value of the literal is implicitly converted to unsigned short. In your first sample code, you could make that conversion explicit with a cast, but I think it's pretty obvious already what conversion will take place. Casting is potentially worse, since with some compilers it will quell any warnings that would be issued if the literal value is outside the range of an unsigned short. Then again, if you want to use such a value for a good reason, then quelling the warnings is good.
In the example in your edit, where it happens to be a template function rather than an overloaded function, you do have an alternative to a cast: do_something<unsigned short>(23). With an overloaded function, you could still avoid a cast with:
void (*f)(unsigned short) = &do_something;
f(23);
... but I don't advise it. If nothing else, this only works if the unsigned short version actually exists, whereas a call with the cast performs the usual overload resolution to find the most compatible version available.
unsigned short bar = (unsigned short) 23;
or in new speak....
unsigned short bar = static_cast<unsigned short>(23);
at least in Visual Studio (at least 2013 and newer) you can write
23ui16
for get an constant of type unsigned short.
see definitions of INT8_MIN, INT8_MAX, INT16_MIN, INT16_MAX, etc. macros in stdint.h
I don't know at the moment whether this is part of the standard C/C++
There are no modifiers for unsigned short. Integers, which has int type by default, usually implicitly converted to target type with no problems. But if you really want to explicitly indicate type, you could write the following:
unsigned short bar = static_cast<unsigned short>(23);
As I can see the only reason is to use such indication for proper deducing template type:
func( static_cast<unsigned short>(23) );
But for such case more clear would be call like the following:
func<unsigned short>( 23 );
There are multiple answers here, none of which are terribly satisfying. So here is a compilation answer with some added info to help explain things a little more thoroughly.
First, avoid shorts as suggested, but if you find yourself needing them such as when working with indexed mesh data and simply switching to shorts for your index size cuts your index data size in half...then read on...
1 While it is technically true that there is no way to express an unsigned short literal in c or C++ you can easily side step this limitation by simply marking your literal as unsigned with a 'u'.
unsigned short myushort = 16u;
This works because it tells the compiler that 16 is unsigned int, then the compiler goes looking for a way to convert it to unsigned short, finds one, most compilers will then check for overflow, and do the conversion with no complaints. The "narrowing conversion" error/warning when the "u" is left out is the compiler complaining that the code is throwing away the sign. Such that if the literal is negative such as -1 then the result is undefined. Usually this means you will get a very large unsigned value that will then be truncated to fit the short.
2 There are multiple suggestions on how to side step this limitation, most seasoned programmers will sum these up with a "don't do that".
unsigned short myshort = (unsigned short)16;
unsigned short myothershort = static_cast<unsigned short>(16);
While both of these work they are undesirable for 2 major reasons. First they are wordy, programmers get lazy and typing all that just for a literal is easy to skip which leads to basic errors that could have been avoided with a better solution. Second they are not free, static_cast in particular generates a little assembly code to do the conversion, and while an optimizer may(or may not) figure out that it can do the conversion its better to just write good quality code from the start.
unsigned short myshort = 16ui16;
This solution is undesirable because it limits who can read your code and understand it, it also means you are starting down the slippery slope of compiler specific code which can lead to your code suddenly not working because of the whims of some compiler writer, or some company that randomly decides to "make a right hand turn", or goes away and leaves in you in the lurch.
unsigned short bar = L'\x17';
This is so unreadable that nobody has upvoted it. And unreadable should be avoided for many good reasons.
unsigned short bar = 0xf;
This to is unreadable. While being able to read understand and convert hex is something serious programmers really need to learn it is very unreadable quick what number is this: 0xbad; Now convert it to binary...now octal.
3 Lastly if you find all the above solutions undesirable I offer up yet another solution that is available via a user defined operator.
constexpr unsigned short operator ""_ushort(unsigned long long x)
{
return (unsigned short)x;
}
and to use it
unsigned short x = 16_ushort;
Admittedly this too isn't perfect. First it takes an unsigned long long and whacks it all the way down to an unsigned short suppressing potential compiler warnings along the way, and it uses the c style cast. But it is constexpr which gurantees it is free in an optimized program, yet can be stepped into during debug. It is also short and sweet so programmers are more likely to use it and it is expressive so it is easy to read and understand. Unfortunately it requires a recent compiler as what can legally be done with user defined operators has changed over the various version of C++.
So pick your trade off but be careful as you may regret them later. Happy Programming.
Unfortunately, the only method defined for this is
One or two characters in single quotes
('), preceded by the letter L
According to http://cpp.comsci.us/etymology/literals.html
Which means you would have to represent your number as an ASCII escape sequence:
unsigned short bar = L'\x17';
Unfortunately, they can't. But if people just look two words behind the number, they should clearly see it is a short... It's not THAT ambiguous. But it would be nice.
If you express the quantity as a 4-digit hex number, the unsigned shortness might be clearer.
unsigned short bar = 0x0017;
You probably shouldn't use short, unless you have a whole lot of them. It's intended to use less storage than an int, but that int will have the "natural size" for the architecture. Logically it follows that a short probably doesn't. Similar to bitfields, this means that shorts can be considered a space/time tradeoff. That's usually only worth it if it buys you a whole lot of space. You're unlikely to have very many literals in your application, though, so there was no need foreseen to have short literals. The usecases simply didn't overlap.
In C++11 and beyond, if you really want an unsigned short literal conversion then it can be done with a user defined literal:
using uint16 = unsigned short;
using uint64 = unsigned long long;
constexpr uint16 operator""_u16(uint64 to_short) {
// use your favorite value validation
assert(to_short < USHRT_MAX); // USHRT_MAX from limits.h
return static_cast<uint16>(to_short);
}
int main(void) {
uint16 val = 26_u16;
}
I frequently work with libraries that use char when working with bytes in C++. The alternative is to define a "Byte" as unsigned char but that not the standard they decided to use. I frequently pass bytes from C# into the C++ dlls and cast them to char to work with the library.
When casting ints to chars or chars to other simple types what are some of the side effects that can occur. Specifically, when has this broken code that you have worked on and how did you find out it was because of the char signedness?
Lucky i haven't run into this in my code, used a char signed casting trick back in an embedded systems class in school. I'm looking to better understand the issue since I feel it is relevant to the work I am doing.
One major risk is if you need to shift the bytes. A signed char keeps the sign-bit when right-shifted, whereas an unsigned char doesn't.
Here's a small test program:
#include <stdio.h>
int main (void)
{
signed char a = -1;
unsigned char b = 255;
printf("%d\n%d\n", a >> 1, b >> 1);
return 0;
}
It should print -1 and 127, even though a and b start out with the same bit pattern (given 8-bit chars, two's-complement and signed values using arithmetic shift).
In short, you can't rely on shift working identically for signed and unsigned chars, so if you need portability, use unsigned char rather than char or signed char.
The most obvious gotchas come when you need to compare the numeric value of a char with a hexadecimal constant when implementing protocols or encoding schemes.
For example, when implementing telnet you might want to do this.
// Check for IAC (hex FF) byte
if (ch == 0xFF)
{
// ...
Or when testing for UTF-8 multi-byte sequences.
if (ch >= 0x80)
{
// ...
Fortunately these errors don't usually survive very long as even the most cursory testing on a platform with a signed char should reveal them. They can be fixed by using a character constant, converting the numeric constant to a char or converting the character to an unsigned char before the comparison operator promotes both to an int. Converting the char directly to an unsigned won't work, though.
if (ch == '\xff') // OK
if ((unsigned char)ch == 0xff) // OK, so long as char has 8-bits
if (ch == (char)0xff) // Usually OK, relies on implementation defined behaviour
if ((unsigned)ch == 0xff) // still wrong
I've been bitten by char signedness in writing search algorithms that used characters from the text as indices into state trees. I've also had it cause problems when expanding characters into larger types, and the sign bit propagates causing problems elsewhere.
I found out when I started getting bizarre results, and segfaults arising from searching texts other than the one's I'd used during the initial development (obviously characters with values >127 or <0 are going to cause this, and won't necessarily be present in your typical text files.
Always check a variable's signedness when working with it. Generally now I make types signed unless I have a good reason otherwise, casting when necessary. This fits in nicely with the ubiquitous use of char in libraries to simply represent a byte. Keep in mind that the signedness of char is not defined (unlike with other types), you should give it special treatment, and be mindful.
The one that most annoys me:
typedef char byte;
byte b = 12;
cout << b << endl;
Sure it's cosmetics, but arrr...
When casting ints to chars or chars to other simple types
The critical point is, that casting a signed value from one primitive type to another (larger) type does not retain the bit pattern (assuming two's complement). A signed char with bit pattern 0xff is -1, while a signed short with the decimal value -1 is 0xffff. Casting an unsigned char with value 0xff to a unsigned short, however, yields 0x00ff. Therefore, always think of proper signedness before you typecast to a larger or smaller data type. Never carry unsigned data in signed data types if you don't need to - if an external library forces you to do so, do the conversion as late as possible (or as early as possible if the external code acts as data source).
The C and C++ language specifications define 3 data types for holding characters: char, signed char and unsigned char. The latter 2 have been discussed in other answers. Let's look at the char type.
The standard(s) say that the char data type may be signed or unsigned and is an implementation decision. This means that some compilers or versions of compilers, can implement char differently. The implication is that the char data type is not conducive for arithmetic or Boolean operations. For arithmetic and Boolean operations, signed and unsigned versions of char will work fine.
In summary, there are 3 versions of char data type. The char data type performs well for holding characters, but is not suited for arithmetic across platforms and translators since it's signedness is implementation defined.
You will fail miserably when compiling for multiple platforms because the C++ standard doesn't define char to be of a certain "signedness".
Therefore GCC introduces -fsigned-char and -funsigned-char options to force certain behavior. More on that topic can be found here, for example.
EDIT:
As you asked for examples of broken code, there are plenty of possibilities to break code that processes binary data. For example, image you process 8-bit audio samples (range -128 to 127) and you want to halven the volume. Now imagine this scenario (in which the naive programmer assumes char == signed char):
char sampleIn;
// If the sample is -1 (= almost silent), and the compiler treats char as unsigned,
// then the value of 'sampleIn' will be 255
read_one_byte_sample(&sampleIn);
// Ok, halven the volume. The value will be 127!
char sampleOut = sampleOut / 2;
// And write the processed sample to the output file, for example.
// (unsigned char)127 has the exact same bit pattern as (signed char)127,
// so this will write a sample with the loudest volume!!
write_one_byte_sample_to_output_file(&sampleOut);
I hope you like that example ;-) But to be honest I've never really came across such problems, not even as a beginner as far as I can remember...
Hope this answer is sufficient for you downvoters. What about a short comment?
Sign extension. The first version of my URL encoding function produced strings like "%FFFFFFA3".