Related
I am going trough the book "Accelerated C++" by Andrew Koenig and Barbara E. Moo and I have some questions about the main example in chap 2. The code can be summarized as below, and is compiling without warning/error with g++:
#include <string>
using std::string;
int main()
{
const string greeting = "Hello, world!";
// OK
const int pad = 1;
// KO
// int pad = 1;
// OK
// unsigned int pad = 1;
const string::size_type cols = greeting.size() + 2 + pad * 2;
string::size_type c = 0;
if (c == 1 + pad)
{;}
return 0;
}
However, if I replace const int pad = 1; by int pad = 1;, the g++ compiler will return a warning:
warning: comparison between signed and unsigned integer expressions [-Werror=sign-compare]
if (c == 1 + pad)
If I replace const int pad = 1; by unsigned int pad = 1;, the g++ compiler will not return a warning.
I understand why g++ return the warning, but I am not sure about the three below points:
Is it safe to use an unsigned int in order to compare with a std::string::size_type? The compiler does not return a warning in that case but I am not sure if it is safe.
Why is the compiler not giving a warning with the original code const int pad = 1. Is the compiler automatically converting the variable pad to an unsigned int?
I could also replace const int pad = 1; by string::size_type pad = 1;, but the meaning of the variable pad is not really linked to a string size in my opinion. Still, would this be the best approach in that case to avoid having different types in the comparison?
From the compiler point of view:
It is unsafe to compare signed and unsinged variables (non-constants).
It is safe to compare 2 unsinged variables of different sizes.
It is safe to compare an unsigned variable with a singed constant if the compiler can check that constant to be in the allowed range for the type of the signed variable (e.g. for 16-bit signed integer it is safe to use a constant in range [0..32767]).
So the answers to your questions:
Yes, it is safe to compare unsigned int and std::string::size_type.
There is no warning because the compiler can perform the safety check (while compiling :)).
There is no problem to use different unsigned types in comparison. Use unsinged int.
Comparing signed and unsigned values is "dangerous" in the sense that you may not get what you expect when the signed value is negative (it may well behave as a very large unsigned value, and thus a > b gives true when a = -1 and b = 100. (The use of const int works because the compiler knows the value isn't changing and thus can say "well, this value is always 1, so it works fine here")
As long as the value you want to compare fits in unsigned int (on typical machines, a little over 4 billion) is fine.
If you are using std::string with the default allocator (which is likely), then size_type is actually size_t.
[support.types]/6 defines that size_t is
an implementation-defined unsigned integer type that is large enough to contain the size
in bytes of any object.
So it's not technically guaranteed to be a unsigned int, but I believe it is defined this way in most cases.
Now regarding your second question: if you use const int something = 2, the compiler sees that this integer is a) never negative and b) never changes, so it's always safe to compare this variable with size_t. In some cases the compiler may optimize the variable out completely and simply replace all it's occurrences with 2.
I would say that it is better to use size_type everywhere where you are to the size of something, since it is more verbose.
What the compiler warns about is the comparison of unsigned and signed integer types. This is because the signed integer can be negative and the meaning is counter intuitive. This is because the signed is converted to unsigned before comparison, which means the negative number will compare greater than the positive.
Is it safe to use an unsigned int in order to compare with a std::string::size_type? The compiler does not return a warning in that case but I am not sure if it is safe.
Yes, they are both unsigned and then the semantics is what's expected. If their range differs the narrower are converted to a wider type.
Why is the compiler not giving a warning with the original code const int pad = 1. Is the compiler automatically converting the variable pad to an unsigned int?
This is because how the compiler is constructed. The compiler parses and to some extent optimizes the code before warnings are issued. The important point is that at the point this warning is being considered the compiler nows that the signed integer is 1 and then it's safe to compare with a unsigned integer.
I could also replace const int pad = 1; by string::size_type pad = 1;, but the meaning of the variable pad is not really linked to a string size in my opinion. Still, would this be the best approach in that case to avoid having different types in the comparison?
If you don't want it to be constant the best solution would probably be to make it at least an unsigned integer type. However you should be aware that there is no guaranteed relation between normal integer types and sizes, for example unsigned int may be narrower, wider or equal to size_t and size_type (the latter may also differ).
Here is a C++ class for revering bits from LeetCode discuss. https://leetcode.com/discuss/29324/c-solution-9ms-without-loop-without-calculation
For example, given input 43261596 (represented in binary as 00000010100101000001111010011100), return 964176192 (represented in binary as 00111001011110000010100101000000).
Is there anyone can explain it? Thank you so very much!!
class Solution {
public:
uint32_t reverseBits(uint32_t n) {
struct bs
{
unsigned int _00:1; unsigned int _01:1; unsigned int _02:1; unsigned int _03:1;
unsigned int _04:1; unsigned int _05:1; unsigned int _06:1; unsigned int _07:1;
unsigned int _08:1; unsigned int _09:1; unsigned int _10:1; unsigned int _11:1;
unsigned int _12:1; unsigned int _13:1; unsigned int _14:1; unsigned int _15:1;
unsigned int _16:1; unsigned int _17:1; unsigned int _18:1; unsigned int _19:1;
unsigned int _20:1; unsigned int _21:1; unsigned int _22:1; unsigned int _23:1;
unsigned int _24:1; unsigned int _25:1; unsigned int _26:1; unsigned int _27:1;
unsigned int _28:1; unsigned int _29:1; unsigned int _30:1; unsigned int _31:1;
} *b = (bs*)&n,
c =
{
b->_31, b->_30, b->_29, b->_28
, b->_27, b->_26, b->_25, b->_24
, b->_23, b->_22, b->_21, b->_20
, b->_19, b->_18, b->_17, b->_16
, b->_15, b->_14, b->_13, b->_12
, b->_11, b->_10, b->_09, b->_08
, b->_07, b->_06, b->_05, b->_04
, b->_03, b->_02, b->_01, b->_00
};
return *(unsigned int *)&c;
}
};
Consider casting as providing a different layout stencil on memory.
Using this stencil picture, the code is a layout of a stencil of 32-bits on an unsigned integer memory location.
So instead of treating the memory as a uint32_t, it is treating the memory as 32 bits.
A pointer to the 32-bit structure is created.
The pointer is assigned to the same memory location as the uint32_t variable.
The pointer will allow different treatment of the memory location.
A temporary variable, of 32-bits (using the structure), is created.
The variable is initialized using an initialization list.
The bit fields in the initialization list are from the original variable, listed in reverse order.
So, in the list:
new bit 0 <-- old bit 31
new bit 1 <-- old bit 30
The foundation of this approach relies on initialization lists.
The author is letting the compiler reverse the bits.
The solution uses brute force to revert the bits.
It declares a bitfield structure (that's when the members are followed by :1) with 32 bit fields of one bit each.
The 32 bit input is then seen as such structure, by casting the address of the input to a pointer to the structure. Then c is declared as a variable of that type which is initialized by reverting the order of the bits.
Finally, the bitfield represented by c is reinterpreted as an integer and you're done.
The assembler is not very interesting, as the gcc explorer shows:
https://goo.gl/KYHDY6
It doesn't convert per see, but it just looks at the same memory address differently. It uses the value of the int n, but gets a pointer to that address, typecasts the pointer, and that way, you can interpret the number as a struct of 32 individual bits. So through this struct b you have access to the individual bits of the number.
Then, of a new struct c, each bit is bluntly set by putting bit 31 of the number in bit 0 of the output struct c, bit 30 in bit 1, etcetera.
After that, the value at the memory location of the struct is returned.
First of all, the posted code has a small bug. The line
return *(unsigned int *)&c;
will not return an accurate number if sizeof(unsigned int) is not equal to sizeof(uint32_t).
That line should be
return *(uint32_t*)&c;
Coming to the question of how it works, I will try to explain it with a smaller type, an uint8_t.
The function
uint8_t reverseBits(uint8_t n) {
struct bs
{
unsigned int _00:1; unsigned int _01:1; unsigned int _02:1; unsigned int _03:1;
unsigned int _04:1; unsigned int _05:1; unsigned int _06:1; unsigned int _07:1;
} *b = (bs*)&n,
c =
{
b->_07, b->_06, b->_05, b->_04
, b->_03, b->_02, b->_01, b->_00
};
return *(uint8_t *)&c;
}
uses a local struct. The local struct is defined as:
struct bs
{
unsigned int _00:1; unsigned int _01:1; unsigned int _02:1; unsigned int _03:1;
unsigned int _04:1; unsigned int _05:1; unsigned int _06:1; unsigned int _07:1;
};
That struct has eight members. Each member of the struct is a bitfield of width 1. The space required for an object of type bs is 8 bits.
If you separate the definition of the struct and the variables of that type, the function will be:
uint8_t reverseBits(uint8_t n) {
struct bs
{
unsigned int _00:1; unsigned int _01:1; unsigned int _02:1; unsigned int _03:1;
unsigned int _04:1; unsigned int _05:1; unsigned int _06:1; unsigned int _07:1;
};
bs *b = (bs*)&n;
bs c =
{
b->_07, b->_06, b->_05, b->_04
, b->_03, b->_02, b->_01, b->_00
};
return *(uint8_t *)&c;
}
Now, lets' say the input to the function is 0xB7, which is 1011 0111 in binary. The line
bs *b = (bs*)&n;
says:
Take the address of n ( &n )
Treat it like it is a pointer of type bs* ( (bs*)&n )
Assign the pointer to a variable. (bs *b =)
By doing that, we are able to pick each bit of n and get their values by using the members of b. At the end of that line,
The value of b->_00 is 1
The value of b->_01 is 0
The value of b->_02 is 1
The value of b->_03 is 1
The value of b->_04 is 0
The value of b->_05 is 1
The value of b->_06 is 1
The value of b->_07 is 1
The statement
bs c =
{
b->_07, b->_06, b->_05, b->_04
, b->_03, b->_02, b->_01, b->_00
};
simply creates c such that the bits of c are reversed from the bits of *b.
The line
return *(uint8_t *)&c;
says:
Take the address of c., whose value is the bit pattern 1110 1101.
Treat it like it is a pointer of type uint8_t*.
Dereference the pointer and return the resulting uint8_t
That returns an uint8_t whose value is bitwise reversed from the input argument.
This isn't exactly obfuscated but a comment or two would assist the innocent. The key is in the middle of the variable declarations, and the first step is to recognize that there is only one line of 'code' here, everything else is variable declarations and initialization.
Between declaration and initialization we find:
} *b = (bs*)&n,
c =
{
This declares a variable 'b' which is a pointer (*) to a struct "bs" just defined. It then casts the address of function argument 'n', a unit_32_t, to the type pointer-to-bs, and assigns it to 'b', effectively creating a union of uint_32_t and the bit array bs.
A second variable, an actual struct bs, named "c", is then declared, and it is initialized through the pointer 'b'. b->_31 initializes c._00, and so on.
So after "b" and "c" are created, in that order, there's nothing left to do but return the value of "c".
The author of the code, and the compiler, know that after a struct definition ends, variables of that type or related to that type can be created, before ";", and that's why #Thomas Matthews closes with, "The author is letting the compiler reverse the bits."
Is it possible that converting from size_t to unsigned int result in overflow .
size_t x = foo ( ) ; // foo ( ) returns a value in type size_t
unsigned int ux = (unsigned int ) x ;
ux == x // Is result of that line always 1 ?
language : c++
platform : any
Yes it's possible, size_t and int don't necessarily have the same size. It's actually very common to have 64bit size_ts and 32bit ints.
C++11 draft N3290 says this in §18.2/6:
The type size_t is an implementation-defined unsigned integer type that is large enough to contain the size in bytes of any object.
unsigned int on the other hand is only required to be able to store values from 0 to UINT_MAX (defined in <climits> and following the C standard header <limits.h>) which is only guaranteed to be at least 65535 (216-1).
Yes, overflow can occur on some platforms. For example, size_t can be defined as unsigned long, which can easily be bigger than unsigned int.
In case of integer overflows what is the result of (unsigned int) * (int) ? unsigned or int? What type does the array index operator (operator[]) take for char*: int, unsigned int or something else?
I was auditing the following function, and suddenly this question arose. The function has a vulnerability at line 17.
// Create a character array and initialize it with init[]
// repeatedly. The size of this character array is specified by
// w*h.
char *function4(unsigned int w, unsigned int h, char *init)
{
char *buf;
int i;
if (w*h > 4096)
return (NULL);
buf = (char *)malloc(4096+1);
if (!buf)
return (NULL);
for (i=0; i<h; i++)
memcpy(&buf[i*w], init, w); // line 17
buf[4096] = '\0';
return buf;
}
Consider both w and h are very large unsigned integers. The multiplication at line 9 have a chance to pass the validation.
Now the problem is at line 17. Multiply int i with unsigned int w: if the result is int, it is possible that the product is negative, resulting in accessing a position that is before buf. If the result is unsigned int, the product will always be positive, resulting in accessing a position that is after buf.
It's hard to write code to justify this: int is too large. Does anyone has ideas on this?
Is there any documentation that specifies the type of the product? I have searched for it, but so far haven't found anything.
I suppose that as far as the vulnerability is concerned, whether (unsigned int) * (int) produces unsigned int or int doesn't matter, because in the compiled object file, they are just bytes. The following code works the same no matter the type of the product:
unsigned int x = 10;
int y = -10;
printf("%d\n", x * y); // print x * y in signed integer
printf("%u\n", x * y); // print x * y in unsigned integer
Therefore, it does not matter what type the multiplication returns. It matters that whether the consumer function takes int or unsigned.
The question here is not how bad the function is, or how to improve the function to make it better. The function undoubtedly has a vulnerability. The question is about the exact behavior of the function, based on the prescribed behavior from the standards.
do the w*h calculation in long long, check if bigger than MAX_UINT
EDIT : alternative : if overflown (w*h)/h != w (is this always the case ?! should be, right ?)
To answer your question: the type of an expression multiplying an int and an unsigned int will be an unsigned int in C/C++.
To answer your implied question, one decent way to deal with possible overflow in integer arithmetic is to use the "IntSafe" set of routines from Microsoft:
http://blogs.msdn.com/michael_howard/archive/2006/02/02/523392.aspx
It's available in the SDK and contains inline implementations so you can study what they're doing if you're on another platform.
Ensure that w * h doesn't overflow by limiting w and h.
The type of w*i is unsigned in your case. If I read the standard correctly, the rule is that the operands are converted to the larger type (with its signedness), or unsigned type corresponding to the signed type (which is unsigned int in your case).
However, even if it's unsigned, it doesn't prevent the wraparound (writing to memory before buf), because it might be the case (on i386 platform, it is), that p[-1] is the same as p[-1u]. Anyway, in your case, both buf[-1] and buf[big unsigned number] would be undefined behavior, so the signed/unsigned question is not that important.
Note that signed/unsigned matters in other contexts - eg. (int)(x*y/2) gives different results depending on the types of x and y, even in the absence of undefined behaviour.
I would solve your problem by checking for overflow on line 9; since 4096 is a pretty small constant and 4096*4096 doesn't overflow on most architectures (you need to check), I'd do
if (w>4096 || h>4096 || w*h > 4096)
return (NULL);
This leaves out the case when w or h are 0, you might want to check for it if needed.
In general, you could check for overflow like this:
if(w*h > 4096 || (w*h)/w!=h || (w*h)%w!=0)
In C/C++ the p[n] notation is really a shortcut to writting *(p+n), and this pointer arithmetic takes into account the sign. So p[-1] is valid and refers to the value immediately before *p.
So the sign really matters here, the result of arithmetic operator with integer follow a set of rules defined by the standard, and this is called integer promotions.
Check out this page: INT02-C. Understand integer conversion rules
2 changes make it safer:
if (w >= 4096 || h >= 4096 || w*h > 4096) return NULL;
...
unsigned i;
Note also that it's not less a bad idea to write to or read from past the buffer end. So the question is not whether iw may become negative, but whether 0 <= ih +w <= 4096 holds.
So it's not the type that matters, but the result of h*i.
For example, it doesn't make a difference whether this is (unsigned)0x80000000 or (int)0x80000000, the program will seg-fault anyway.
For C, refer to "Usual arithmetic conversions" (C99: Section 6.3.1.8, ANSI C K&R A6.5) for details on how the operands of the mathematical operators are treated.
In your example the following rules apply:
C99:
Otherwise, if the type of the operand
with signed integer type can represent
all of the values of the type of the
operand with unsigned integer type,
then the operand with unsigned integer
type is converted to the type of the
operand with signed integer type.
Otherwise, both operands are converted
to the unsigned integer type
corresponding to the type of the
operand with signed integer type.
ANSI C:
Otherwise, if either operand is unsigned int, the other is converted to unsigned int.
Why not just declare i as unsigned int? Then the problem goes away.
In any case, i*w is guaranteed to be <= 4096, as the code tests for this, so it's never going to overflow.
memcpy(&buf[iw > -1 ? iw < 4097? iw : 0 : 0], init, w);
I don't think the triple calculation of iw does degrade the perfomance)
w*h could overflow if w and/or h are sufficiently large and the following validation could pass.
9. if (w*h > 4096)
10. return (NULL);
On int , unsigned int mixed operations, int is elevated to unsigned int, in which case, a negative value of 'i' would become a large positive value. In that case
&buf[i*w]
would be accessing a out of bound value.
Unsigned arithmetic is done as modular (or wrap-around), so the product of two large unsigned ints can easily be less than 4096. The multiplication of int and unsigned int will result in an unsigned int (see section 4.5 of the C++ standard).
Therefore, given large w and a suitable value of h, you can indeed get into trouble.
Making sure integer arithmetic doesn't overflow is difficult. One easy way is to convert to floating-point and doing a floating-point multiplication, and seeing if the result is at all reasonable. As qwerty suggested, long long would be usable, if available on your implementation. (It's a common extension in C90 and C++, does exist in C99, and will be in C++0x.)
There are 3 paragraphs in the current C1X draft on calculating (UNSIGNED TYPE1) X (SIGNED TYPE2) in 6.3.1.8 Usual arithmetic coversions, N1494,
WG 14: C - Project status and milestones
Otherwise, if the operand that has unsigned integer type has rank greater or
equal to the rank of the type of the other operand, then the operand with
signed integer type is converted to the type of the operand with unsigned
integer type.
Otherwise, if the type of the operand with signed integer type can represent
all of the values of the type of the operand with unsigned integer type, then
the operand with unsigned integer type is converted to the type of the
operand with signed integer type.
Otherwise, both operands are converted to the unsigned integer type
corresponding to the type of the operand with signed integer type.
So if a is unsigned int and b is int, parsing of (a * b) should generate code (a * (unsigned int)b). Will overflow if b < 0 or a * b > UINT_MAX.
If a is unsigned int and b is long of greater size, (a * b) should generate ((long)a * (long)b). Will overflow if a * b > LONG_MAX or a * b < LONG_MIN.
If a is unsigned int and b is long of the same size, (a * b) should generate ((unsigned long)a * (unsigned long)b). Will overflow if b < 0 or a * b > ULONG_MAX.
On your second question about the type expected by "indexer", the answer appears "integer type" which allows for any (signed) integer index.
6.5.2.1 Array subscripting
Constraints
1 One of the expressions shall have type ‘‘pointer to complete object type’’, the other
expression shall have integer type, and the result has type ‘‘type’’.
Semantics
2 A postfix expression followed by an expression in square brackets [] is a subscripted
designation of an element of an array object. The definition of the subscript operator []
is that E1[E2] is identical to (*((E1)+(E2))). Because of the conversion rules that
apply to the binary + operator, if E1 is an array object (equivalently, a pointer to the
initial element of an array object) and E2 is an integer, E1[E2] designates the E2-th
element of E1 (counting from zero).
It is up to the compiler to perform static analysis and warn the developer about possibility of buffer overrun when the pointer expression is an array variable and the index may be negative. Same goes about warning on possible array size overruns even when the index is positive or unsigned.
To actually answer your question, without specifying the hardware you're running on, you don't know, and in code intended to be portable, you shouldn't depend on any particular behavior.
Is it safe to convert, say, from an unsigned char * to a signed char * (or just a char *?
The access is well-defined, you are allowed to access an object through a pointer to signed or unsigned type corresponding to the dynamic type of the object (3.10/15).
Additionally, signed char is guaranteed not to have any trap values and as such you can safely read through the signed char pointer no matter what the value of the original unsigned char object was.
You can, of course, expect that the values you read through one pointer will be different from the values you read through the other one.
Edit: regarding sellibitze's comment, this is what 3.9.1/1 says.
A char, a signed char, and an unsigned char occupy the same amount of storage and have the same alignment requirements (3.9); that is, they have the same object representation. For character types, all bits of the object representation participate in the value representation. For unsigned character types, all possible bit patterns of the value representation represent numbers.
So indeed it seems that signed char may have trap values. Nice catch!
The conversion should be safe, as all you're doing is converting from one type of character to another, which should have the same size. Just be aware of what sort of data your code is expecting when you dereference the pointer, as the numeric ranges of the two data types are different. (i.e. if your number pointed by the pointer was originally positive as unsigned, it might become a negative number once the pointer is converted to a signed char* and you dereference it.)
Casting changes the type, but does not affect the bit representation. Casting from unsigned char to signed char does not change the value at all, but it affects the meaning of the value.
Here is an example:
#include <stdio.h>
int main(int args, char** argv) {
/* example 1 */
unsigned char a_unsigned_char = 192;
signed char b_signed_char = b_unsigned_char;
printf("%d, %d\n", a_signed_char, a_unsigned_char); //192, -64
/* example 2 */
unsigned char b_unsigned_char = 32;
signed char a_signed_char = a_unsigned_char;
printf("%d, %d\n", b_signed_char, b_unsigned_char); //32, 32
return 0;
}
In the first example, you have an unsigned char with value 192, or 110000000 in binary. After the cast to signed char, the value is still 110000000, but that happens to be the 2s-complement representation of -64. Signed values are stored in 2s-complement representation.
In the second example, our unsigned initial value (32) is less than 128, so it seems unaffected by the cast. The binary representation is 00100000, which is still 32 in 2s-complement representation.
To "safely" cast from unsigned char to signed char, ensure the value is less than 128.
It depends on how you are going to use the pointer. You are just converting the pointer type.
You can safely convert an unsigned char* to a char * as the function you are calling will be expecting the behavior from a char pointer, but, if your char value goes over 127 then you will get a result that will not be what you expected, so just make certain that what you have in your unsigned array is valid for a signed array.
I've seen it go wrong in a few ways, converting to a signed char from an unsigned char.
One, if you're using it as an index to an array, that index could go negative.
Secondly, if inputted to a switch statement, it may result in a negative input which often is something the switch isn't expecting.
Third, it has different behavior on an arithmetic right shift
int x = ...;
char c = 128
unsigned char u = 128
c >> x;
has a different result than
u >> x;
Because the former is sign-extended and the latter isn't.
Fourth, a signed character causes underflow at a different point than an unsigned character.
So a common overflow check,
(c + x > c)
could return a different result than
(u + x > u)
Safe if you are dealing with only ASCII data.
I'm astonished it hasn't been mentioned yet: Boost numeric cast should do the trick - but only for the data of course.
Pointers are always pointers. By casting them to a different type, you only change the way the compiler interprets the data pointed to.