A warning - comparison between signed and unsigned integer expressions - c++

I am currently working through Accelerated C++ and have come across an issue in exercise 2-3.
A quick overview of the program - the program basically takes a name, then displays a greeting within a frame of asterisks - i.e. Hello ! surrounded framed by *'s.
The exercise - In the example program, the authors use const int to determine the padding (blank spaces) between the greeting and the asterisks. They then ask the reader, as part of the exercise, to ask the user for input as to how big they want the padding to be.
All this seems easy enough, I go ahead ask the user for two integers (int) and store them and change the program to use these integers, removing the ones used by the author, when compiling though I get the following warning;
Exercise2-3.cpp:46: warning: comparison between signed and unsigned integer expressions
After some research it appears to be because the code attempts to compare one of the above integers (int) to a string::size_type, which is fine. But I was wondering - does this mean I should change one of the integers to unsigned int? Is it important to explicitly state whether my integers are signed or unsigned?
cout << "Please enter the size of the frame between top and bottom you would like ";
int padtopbottom;
cin >> padtopbottom;
cout << "Please enter size of the frame from each side you would like: ";
unsigned int padsides;
cin >> padsides;
string::size_type c = 0; // definition of c in the program
if (r == padtopbottom + 1 && c == padsides + 1) { // where the error occurs
Above are the relevant bits of code, the c is of type string::size_type because we do not know how long the greeting might be - but why do I get this problem now, when the author's code didn't get the problem when using const int? In addition - to anyone who may have completed Accelerated C++ - will this be explained later in the book?
I am on Linux Mint using g++ via Geany, if that helps or makes a difference (as I read that it could when determining what string::size_type is).

It is usually a good idea to declare variables as unsigned or size_t if they will be compared to sizes, to avoid this issue. Whenever possible, use the exact type you will be comparing against (for example, use std::string::size_type when comparing with a std::string's length).
Compilers give warnings about comparing signed and unsigned types because the ranges of signed and unsigned ints are different, and when they are compared to one another, the results can be surprising. If you have to make such a comparison, you should explicitly convert one of the values to a type compatible with the other, perhaps after checking to ensure that the conversion is valid. For example:
unsigned u = GetSomeUnsignedValue();
int i = GetSomeSignedValue();
if (i >= 0)
{
// i is nonnegative, so it is safe to cast to unsigned value
if ((unsigned)i >= u)
iIsGreaterThanOrEqualToU();
else
iIsLessThanU();
}
else
{
iIsNegative();
}

I had the exact same problem yesterday working through problem 2-3 in Accelerated C++. The key is to change all variables you will be comparing (using Boolean operators) to compatible types. In this case, that means string::size_type (or unsigned int, but since this example is using the former, I will just stick with that even though the two are technically compatible).
Notice that in their original code they did exactly this for the c counter (page 30 in Section 2.5 of the book), as you rightly pointed out.
What makes this example more complicated is that the different padding variables (padsides and padtopbottom), as well as all counters, must also be changed to string::size_type.
Getting to your example, the code that you posted would end up looking like this:
cout << "Please enter the size of the frame between top and bottom";
string::size_type padtopbottom;
cin >> padtopbottom;
cout << "Please enter size of the frame from each side you would like: ";
string::size_type padsides;
cin >> padsides;
string::size_type c = 0; // definition of c in the program
if (r == padtopbottom + 1 && c == padsides + 1) { // where the error no longer occurs
Notice that in the previous conditional, you would get the error if you didn't initialize variable r as a string::size_type in the for loop. So you need to initialize the for loop using something like:
for (string::size_type r=0; r!=rows; ++r) //If r and rows are string::size_type, no error!
So, basically, once you introduce a string::size_type variable into the mix, any time you want to perform a boolean operation on that item, all operands must have a compatible type for it to compile without warnings.

The important difference between signed and unsigned ints
is the interpretation of the last bit. The last bit
in signed types represent the sign of the number, meaning:
e.g:
0001 is 1 signed and unsigned
1001 is -1 signed and 9 unsigned
(I avoided the whole complement issue for clarity of explanation!
This is not exactly how ints are represented in memory!)
You can imagine that it makes a difference to know if you compare
with -1 or with +9. In many cases, programmers are just too lazy
to declare counting ints as unsigned (bloating the for loop head f.i.)
It is usually not an issue because with ints you have to count to 2^31
until your sign bit bites you. That's why it is only a warning.
Because we are too lazy to write 'unsigned' instead of 'int'.

At the extreme ranges, an unsigned int can become larger than an int.
Therefore, the compiler generates a warning. If you are sure that this is not a problem, feel free to cast the types to the same type so the warning disappears (use C++ cast so that they are easy to spot).
Alternatively, make the variables the same type to stop the compiler from complaining.
I mean, is it possible to have a negative padding? If so then keep it as an int. Otherwise you should probably use unsigned int and let the stream catch the situations where the user types in a negative number.

The primary issue is that underlying hardware, the CPU, only has instructions to compare two signed values or compare two unsigned values. If you pass the unsigned comparison instruction a signed, negative value, it will treat it as a large positive number. So, -1, the bit pattern with all bits on (twos complement), becomes the maximum unsigned value for the same number of bits.
8-bits: -1 signed is the same bits as 255 unsigned
16-bits: -1 signed is the same bits as 65535 unsigned
etc.
So, if you have the following code:
int fd;
fd = open( .... );
int cnt;
SomeType buf;
cnt = read( fd, &buf, sizeof(buf) );
if( cnt < sizeof(buf) ) {
perror("read error");
}
you will find that if the read(2) call fails due to the file descriptor becoming invalid (or some other error), that cnt will be set to -1. When comparing to sizeof(buf), an unsigned value, the if() statement will be false because 0xffffffff is not less than sizeof() some (reasonable, not concocted to be max size) data structure.
Thus, you have to write the above if, to remove the signed/unsigned warning as:
if( cnt < 0 || (size_t)cnt < sizeof(buf) ) {
perror("read error");
}
This just speaks loudly to the problems.
1. Introduction of size_t and other datatypes was crafted to mostly work,
not engineered, with language changes, to be explicitly robust and
fool proof.
2. Overall, C/C++ data types should just be signed, as Java correctly
implemented.
If you have values so large that you can't find a signed value type that works, you are using too small of a processor or too large of a magnitude of values in your language of choice. If, like with money, every digit counts, there are systems to use in most languages which provide you infinite digits of precision. C/C++ just doesn't do this well, and you have to be very explicit about everything around types as mentioned in many of the other answers here.

or use this header library and write:
// |notEqaul|less|lessEqual|greater|greaterEqual
if(sweet::equal(valueA,valueB))
and don't care about signed/unsigned or different sizes

Related

Why is parameter to isdigit integer?

The function std::isdigit is:
int isdigit(int ch);
The return (Non-zero value if the character is a numeric character, zero otherwise.) smells like the function was inherited from C, but even that does not explain why the parameter type is int not char while at the same time...
The behavior is undefined if the value of ch is not representable as
unsigned char and is not equal to EOF.
Is there any technical reason why isdigitstakes an int not a char?
The reaons is to allow EOF as input. And EOF is (from here):
EOF integer constant expression of type int and negative value
The accepted answer is correct, but I believe the question deserves more detail.
A char in C++ is either signed or unsigned depending on your implementation (and, yet, it's a distinct type from signed char and unsigned char).
Where C grew up, char was typically unsigned and assumed to be an n-bit byte that could represent [0..2^n-1]. (Yes, there were some machines that had byte sizes other than 8 bits.) In fact, chars were considered virtually indistinguishable from bytes, which is why functions like memcpy take char * rather than something like uint8_t *, why sizeof char is always 1, and why CHAR_BITS isn't named BYTE_BITS.
But the C standard, which was the baseline for C++, only promised that char could hold any value in the execution character set. They might hold additional values, but there was no guarantee. The source character set (basically 7-bit ASCII minus some control characters) required something like 97 values. For a while, the execution character set could be smaller, but in practice it almost never was. Eventually there was an explicit requirement that a char be large enough to hold an 8-bit byte.
But the range was still uncertain. If unsigned, you could rely on [0..255]. Signed chars, however, could--in theory--use a sign+magnitude representation that would give you a range of [-127..127]. Note that's only 255 unique values, not 256 values ([-128..127]) like you'd get from two's complement. If you were language lawyerly enough, you could argue that you cannot store every possible value of an 8-bit byte in a char even though that was a fundamental assumption throughout the design of the language and its run-time library. I think C++ finally closed that apparent loophole in C++17 or C++20 by, in effect, requiring that a signed char use two's complement even if the larger integral types use sign+magnitude.
When it came time to design fundamental input/output functions, they had to think about how to return a value or a signal that you've reached the end of the file. It was decided to use a special value rather than an out-of-band signaling mechanism. But what value to use? The Unix folks generally had [128..255] available and others had [-128..-1].
But that's only if you're working with text. The Unix/C folks thought of textual characters and binary byte values as the same thing. So getc() was also for reading bytes from a binary file. All 256 possible values of a char, regardless of its signedness, were already claimed.
K&R C (before the first ANSI standard) didn't require function prototypes. The compiler made assumptions about parameter and return types. This is why C and C++ have the "default promotions," even though they're less important now than they once were. In effect, you couldn't return anything smaller than an int from a function. If you did, it would just be converted to int anyway.
The natural solution was therefore to have getc() return an int containing either the character value or a special end-of-file value, imaginatively dubbed EOF, a macro for -1.
The default promotions not only mandated a function couldn't return an integral type smaller than an int, they also made it difficult to pass in a small type. So int was also the natural parameter type for functions that expected a character. And thus we ended up with function signatures like int isdigit(int ch).
If you're a Posix fan, this is basically all you need.
For the rest of us, there's a remaining gotcha: If your chars are signed, then -1 might represent a legitimate character in your execution character set. How can you distinguish between them?
The answer is that functions don't really traffic in char values at all. They're really using unsigned char values dressed up as ints.
int x = getc(source_file);
if (x != EOF) { /* reached end of file */ }
else if (0 <= x && x < 128) { /* plain 7-bit character */ }
else if (128 <= x && x < 256) {
// Here it gets interesting.
bool b1 = isdigit(x); // OK
bool b2 = isdigit(static_cast<char>(x)); // NOT PORTABLE
bool b3 = isdigit(static_cast<unsigned char>(x)); // CORRECT!
}

C++ string.length() Strange Behavior

I just came across an extremely strange problem. The function I have is simply:
int strStr(string haystack, string needle) {
for(int i=0; i<=(haystack.length()-needle.length()); i++){
cout<<"i "<<i<<endl;
}
return 0;
}
Then if I call strStr("", "a"), although haystack.length()-needle.length()=-1, this will not return 0, you can try it yourself...
This is because .length() (and .size()) return size_t, which is an unsigned int. You think you get a negative number, when in fact it underflows back to the maximum value for size_t (On my machine, this is 18446744073709551615). This means your for loop will loop through all the possible values of size_t, instead of just exiting immediately like you expect.
To get the result you want, you can explicitly convert the sizes to ints, rather than unsigned ints (See aslgs answer), although this may fail for strings with sufficient length (Enough to over/under flow a standard int)
Edit:
Two solutions from the comments below:
(Nir Friedman) Instead of using int as in aslg's answer, include the header and use an int64_t, which will avoid the problem mentioned above.
(rici) Turn your for loop into for(int i = 0;needle.length() + i <= haystack.length();i ++){, which avoid the problem all together by rearranging the equation to avoid the subtraction all together.
(haystack.length()-needle.length())
length returns a size_t, in other words an unsigned int. Given the size of your strings, 0 and 1 respectively, when you calculate the difference it underflows and becomes the maximum possible value for an unsigned int. (Which is approximately 4.2 billions for a storage of 4 bytes, but could be a different value)
i<=(haystack.length()-needle.length())
The indexer i is converted by the compiler into an unsigned int to match the type. So you're gonna have to wait until i is greater than the max possible value for an unsigned int. It's not going to stop.
Solution:
You have to convert the result of each method to int, like so,
i <= ( (int)haystack.length() - (int)needle.length() )

comparison between signed and unsigned integer expressions and 0x80000000

I have the following code:
#include <iostream>
using namespace std;
int main()
{
int a = 0x80000000;
if(a == 0x80000000)
a = 42;
cout << "Hello World! :: " << a << endl;
return 0;
}
The output is
Hello World! :: 42
so the comparison works. But the compiler tells me
g++ -c -pipe -g -Wall -W -fPIE -I../untitled -I. -I../bin/Qt/5.4/gcc_64/mkspecs/linux-g++ -o main.o ../untitled/main.cpp
../untitled/main.cpp: In function 'int main()':
../untitled/main.cpp:8:13: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
if(a == 0x80000000)
^
So the question is: Why is 0x80000000 an unsigned int? Can I make it signed somehow to get rid of the warning?
As far as I understand, 0x80000000 would be INT_MIN as it's out of range for positive a integer. but why is the compiler assuming, that I want a positive number?
I'm compiling with gcc version 4.8.1 20130909 on linux.
0x80000000 is an unsigned int because the value is too big to fit in an int and you did not add any L to specify it was a long.
The warning is issued because unsigned in C/C++ has a quite weird semantic and therefore it's very easy to make mistakes in code by mixing up signed and unsigned integers. This mixing is often a source of bugs especially because the standard library, by historical accident, chose to use an unsigned value for the size of containers (size_t).
An example I often use to show how subtle is the problem consider
// Draw connecting lines between the dots
for (int i=0; i<pts.size()-1; i++) {
draw_line(pts[i], pts[i+1]);
}
This code seems fine but has a bug. In case the pts vector is empty pts.size() is 0 but, and here comes the surprising part, pts.size()-1 is a huge nonsense number (today often 4294967295, but depends on the platform) and the loop will use invalid indexes (with undefined behavior).
Here changing the variable to size_t i will remove the warning but is not going to help as the very same bug remains...
The core of the problem is that with unsigned values a < b-1 and a+1 < b are not the same thing even for very commonly used values like zero; this is why using unsigned types for non-negative values like container size is a bad idea and a source of bugs.
Also note that your code is not correct portable C++ on platforms where that value doesn't fit in an integer as the behavior around overflow is defined for unsigned types but not for regular integers. C++ code that relies on what happens when an integer gets past the limits has undefined behavior.
Even if you know what happens on a specific hardware platform note that the compiler/optimizer is allowed to assume that signed integer overflow never happens: for example a test like a < a+1 where a is a regular int can be considered always true by a C++ compiler.
It seems you are confusing 2 different issues: The encoding of something and the meaning of something. Here is an example: You see a number 97. This is a decimal encoding. But the meaning of this number is something completely different. It can denote the ASCII 'a' character, a very hot temperature, a geometrical angle in a triangle, etc. You cannot deduce meaning from encoding. Someone must supply a context to you (like the ASCII map, temperature etc).
Back to your question: 0x80000000 is encoding. While INT_MIN is meaning. There are not interchangeable and not comparable. On a specific hardware in some contexts they might be equal just like 97 and 'a' are equal in the ASCII context.
Compiler warns you about ambiguity in the meaning, not in the encoding. One way to give meaning to a specific encoding is the casting operator. Like (unsigned short)-17 or (student*)ptr;
On a 32 bits system or 64bits with back compatibility int and unsigned int have encoding of 32bits like in 0x80000000 but on 64 bits MIN_INT would not be equal to this number.
Anyway - the answer to your question: in order to remove the warning you must give identical context to both left and right expressions of the comparison.
You can do it in many ways. For example:
(unsigned int)a == (unsigned int)0x80000000 or (__int64)a == (__int64)0x80000000 or even a crazy (char *)a == (char *)0x80000000 or any other way as long as you maintain the following rules:
You don't demote the encoding (do not reduce the amount of bits it requires). Like (char)a == (char)0x80000000 is incorrect because you demote 32 bits into 8 bits
You must give both the left side and the right side of the == operator the same context. Like (char *)a == (unsigned short)0x80000000 is incorrect an will yield an error/warning.
I want to give you another example of how crucial is the difference between encoding and meaning. Look at the code
char a = -7;
bool b = (a==-7) ? true : false;
What is the result of 'b'? The answer will shock you: it is undefined.
Some compilers (typically Microsoft visual studio) will compile a program that b will get true while on Android NDK compilers b will get false.
The reason is that Android NDK treats 'char' type as 'unsigned char', while Visual studio treats 'char' as 'signed char'. So on Android phones the encoding of -7 actually has a meaning of 249 and is not equal to the meaning of (int)-7.
The correct way to fix this problem is to specifically define 'a' as signed char:
signed char a = -7;
bool b = (a==-7) ? true : false;
0x80000000 is considered unsigned per default.
You can avoid the warning like this:
if (a == (int)0x80000000)
a=42;
Edit after a comment:
Another (perhaps better) way would be
if ((unsigned)a == 0x80000000)
a=42;

Why do I get errors when using unsigned integers in an expression with C++?

Given the following piece of (pseudo-C++) code:
float x=100, a=0.1;
unsigned int height = 63, width = 63;
unsigned int hw=31;
for (int row=0; row < height; ++row)
{
for (int col=0; col < width; ++col)
{
float foo = x + col - hw + a * (col - hw);
cout << foo << " ";
}
cout << endl;
}
The values of foo are screwed up for half of the array, in places where (col - hw) is negative. I figured because col is int and comes first, that this part of the expression is converted to int and becomes negative. Unfortunately, apparently it doesn't, I get an overflow of an unsigned value and I've no idea why.
How should I resolve this problem? Use casts for the whole or part of the expression? What type of casts (C-style or static_cast<...>)? Is there any overhead to using casts (I need this to work fast!)?
EDIT: I changed all my unsigned ints to regular ones, but I'm still wondering why I got that overflow in this situation.
Unsigned integers implement unsigned arithmetic. Unsigned arithmetic is modulo arithmetics. All values are adjusted modulo 2^N, where N is the number of bits in the value representation of unsigned type.
In simple words, unsigned arithmetic always produces non-negative values. Every time the expression should result in negative value, the value actually "wraps around" 2^N and becomes positive.
When you mix a signed and an unsigned integer in a [sub-]expression, the unsigned arithmetic "wins", i.e. the calculations are performed in unsigned domain. For example, when you do col - hw, it is interpreted as (unsigned) col - hw. This means that for col == 0 and hs == 31 you will not get -31 as the result. Instead you wil get UINT_MAX - 31 + 1, which is normally a huge positive value.
Having said that, I have to note that in my opinion it is always a good idea to use unsigned types to represent inherently non-negative values. In fact, in practice most (or at least half) of integer variables in C/C++ the program should have unsigned types. Your attempt to use unsigned types in your example is well justified (if understand the intent correctly). Moreover, I'd use unsigned for col and row as well. However, you have to keep in mind the way unsigned arithmetic works (as described above) and write your expressions accordingly. Most of the time, an expression can be rewritten so that it doesn't cross the bounds of unsigned range, i.e. most of the time there's no need to explicitly cast anything to signed type. Otherwise, if you do need to work with negative values eventually, a well-placed cast to signed type should solve the problem.
How about making height, width, and hw signed ints? What are you really gaining by making them unsigned? Mixing signed and unsigned integers is always asking for trouble. At first glance, at least, it doesn't look like you gain anything by using unsigned values here. So, you might as well make them all signed and save yourself the trouble.
You have the conversion rules backwards — when you mix signed and unsigned versions of the same type, the signed operand is converted to unsigned.
If you want this to be fast you should static_cast all unsigned values before you start looping and use their int versions rather than unsigned int. You can still require inputs being unsigned and then just cast them on the way to your algorithm to retain the required domain for your function.
Casts won't happen automatically- uncasted arithmetic still has it's uses. The usual example is int / int = int, even if data is lost by not converting to float. I'd use signed int unless it's impossible to do so because of INT_MAX being too small.

Worst side effects from chars signedness. (Explanation of signedness effects on chars and casts)

I frequently work with libraries that use char when working with bytes in C++. The alternative is to define a "Byte" as unsigned char but that not the standard they decided to use. I frequently pass bytes from C# into the C++ dlls and cast them to char to work with the library.
When casting ints to chars or chars to other simple types what are some of the side effects that can occur. Specifically, when has this broken code that you have worked on and how did you find out it was because of the char signedness?
Lucky i haven't run into this in my code, used a char signed casting trick back in an embedded systems class in school. I'm looking to better understand the issue since I feel it is relevant to the work I am doing.
One major risk is if you need to shift the bytes. A signed char keeps the sign-bit when right-shifted, whereas an unsigned char doesn't.
Here's a small test program:
#include <stdio.h>
int main (void)
{
signed char a = -1;
unsigned char b = 255;
printf("%d\n%d\n", a >> 1, b >> 1);
return 0;
}
It should print -1 and 127, even though a and b start out with the same bit pattern (given 8-bit chars, two's-complement and signed values using arithmetic shift).
In short, you can't rely on shift working identically for signed and unsigned chars, so if you need portability, use unsigned char rather than char or signed char.
The most obvious gotchas come when you need to compare the numeric value of a char with a hexadecimal constant when implementing protocols or encoding schemes.
For example, when implementing telnet you might want to do this.
// Check for IAC (hex FF) byte
if (ch == 0xFF)
{
// ...
Or when testing for UTF-8 multi-byte sequences.
if (ch >= 0x80)
{
// ...
Fortunately these errors don't usually survive very long as even the most cursory testing on a platform with a signed char should reveal them. They can be fixed by using a character constant, converting the numeric constant to a char or converting the character to an unsigned char before the comparison operator promotes both to an int. Converting the char directly to an unsigned won't work, though.
if (ch == '\xff') // OK
if ((unsigned char)ch == 0xff) // OK, so long as char has 8-bits
if (ch == (char)0xff) // Usually OK, relies on implementation defined behaviour
if ((unsigned)ch == 0xff) // still wrong
I've been bitten by char signedness in writing search algorithms that used characters from the text as indices into state trees. I've also had it cause problems when expanding characters into larger types, and the sign bit propagates causing problems elsewhere.
I found out when I started getting bizarre results, and segfaults arising from searching texts other than the one's I'd used during the initial development (obviously characters with values >127 or <0 are going to cause this, and won't necessarily be present in your typical text files.
Always check a variable's signedness when working with it. Generally now I make types signed unless I have a good reason otherwise, casting when necessary. This fits in nicely with the ubiquitous use of char in libraries to simply represent a byte. Keep in mind that the signedness of char is not defined (unlike with other types), you should give it special treatment, and be mindful.
The one that most annoys me:
typedef char byte;
byte b = 12;
cout << b << endl;
Sure it's cosmetics, but arrr...
When casting ints to chars or chars to other simple types
The critical point is, that casting a signed value from one primitive type to another (larger) type does not retain the bit pattern (assuming two's complement). A signed char with bit pattern 0xff is -1, while a signed short with the decimal value -1 is 0xffff. Casting an unsigned char with value 0xff to a unsigned short, however, yields 0x00ff. Therefore, always think of proper signedness before you typecast to a larger or smaller data type. Never carry unsigned data in signed data types if you don't need to - if an external library forces you to do so, do the conversion as late as possible (or as early as possible if the external code acts as data source).
The C and C++ language specifications define 3 data types for holding characters: char, signed char and unsigned char. The latter 2 have been discussed in other answers. Let's look at the char type.
The standard(s) say that the char data type may be signed or unsigned and is an implementation decision. This means that some compilers or versions of compilers, can implement char differently. The implication is that the char data type is not conducive for arithmetic or Boolean operations. For arithmetic and Boolean operations, signed and unsigned versions of char will work fine.
In summary, there are 3 versions of char data type. The char data type performs well for holding characters, but is not suited for arithmetic across platforms and translators since it's signedness is implementation defined.
You will fail miserably when compiling for multiple platforms because the C++ standard doesn't define char to be of a certain "signedness".
Therefore GCC introduces -fsigned-char and -funsigned-char options to force certain behavior. More on that topic can be found here, for example.
EDIT:
As you asked for examples of broken code, there are plenty of possibilities to break code that processes binary data. For example, image you process 8-bit audio samples (range -128 to 127) and you want to halven the volume. Now imagine this scenario (in which the naive programmer assumes char == signed char):
char sampleIn;
// If the sample is -1 (= almost silent), and the compiler treats char as unsigned,
// then the value of 'sampleIn' will be 255
read_one_byte_sample(&sampleIn);
// Ok, halven the volume. The value will be 127!
char sampleOut = sampleOut / 2;
// And write the processed sample to the output file, for example.
// (unsigned char)127 has the exact same bit pattern as (signed char)127,
// so this will write a sample with the loudest volume!!
write_one_byte_sample_to_output_file(&sampleOut);
I hope you like that example ;-) But to be honest I've never really came across such problems, not even as a beginner as far as I can remember...
Hope this answer is sufficient for you downvoters. What about a short comment?
Sign extension. The first version of my URL encoding function produced strings like "%FFFFFFA3".