Longest string in an array of strings in C++ - c++

I was trying to get the maximum length of a string from an array of strings.
I wrote this code:
char a[100][100] = {"solol","a","1234567","123","1234"};
int max = -1;
for(int i=0;i<5;i++)
if(max<strlen(a[i]))
max=strlen(a[i]);
cout<<max;
The output it gives is -1.
But when I initialize the value of max by 0 instead of 1, the code works fine. Why is it so?

The function strlen returns an unsigned integral (i.e. size_t) so when you compare it to max which is signed, it gets promoted to unsigned and that -1 turns to something like 0xffffffff (the actual value depends on your architecture/compiler) which is larger than any of those strings.
Change max type to be size_t and never compare unsigned with signed integrals.

size_t is the unsigned integer type.
From C++ Standard#4.6:
A prvalue of an integer type other than bool, char16_t, char32_t, or wchar_t whose integer conversion
rank (4.15) is less than the rank of int can be converted to a prvalue of type int if int can represent all the
values of the source type; otherwise, the source prvalue can be converted to a prvalue of type unsigned int.
The value of max is -1 which represent the maximum value that an unsigned integer type can hold. In the expression max<strlen(a[i]), the value returned by strlen will be promoted to unsigned int and will be compared with the maximum value of the unsigned integer. Hence the condition max<strlen(a[i]) will never true for the given input and the value of max will not change.
The longest string in the array of strings a is of length 7. When you initialize the max with 0 or 1, the expression max<strlen(a[i]) evaluate to true for any string whose length is greater than the value of max. Hence you will get the expected output.

strlen returns an unsigned int and when doing the comparison -1 gets promoted to a maximum value an unsigned can hold.
To fix the code, change the max definition to unsigned int max = 0

Implicit type conversion (coercion)
When evaluating expressions, the compiler breaks each expression down into individual subexpressions. The arithmetic operators require their operands to be of the same type. To ensure this, the compiler uses the following rules:
If an operand is an integer that is narrower than an int, it undergoes integral promotion (as described above) to int or unsigned int.
If the operands still do not match, then the compiler finds the highest priority operand and implicitly converts the other operand to match.
The priority of operands is as follows:
long double (highest)
double
float
unsigned long long
long long
unsigned long
long
unsigned int
int (lowest)
In your case on of operands has type int and another one type size_t, so max was promoted to type size_t, and bit representation of -1 is the biggest possible size_t.
You might also want to look in the following parts of the Standard [conv.prom],
[conv.integral], [conv.rank].

I think this might work for you.
#include <algorithm>
const size_t ARRAYSIZE = 100;
char stringArray[ARRAYSIZE][100] = {"solol","a","1234567","123","1234"};
auto longerString = std::max_element(stringArray, stringArray + ARRAYSIZE - 1, [](const auto& s1, const auto& s2){
return strlen(s1) < strlen(s2);
});
size_t maxSize = strlen(longerString[0]);
This works for me, returns the longest char array, just do a strlen over the "first element" and youre done. For your example, it returns 7.
Hope it helps.

Related

Comparing unsigned integer with negative literals

I have this simple C program.
#include <stdlib.h>
#include <stdio.h>
#include <stdbool.h>
bool foo (unsigned int a) {
return (a > -2L);
}
bool bar (unsigned long a) {
return (a > -2L);
}
int main() {
printf("foo returned = %d\n", foo(99));
printf("bar returned = %d\n", bar(99));
return 0;
}
Output when I run this -
foo returned = 1
bar returned = 0
Recreated in godbolt here
My question is why does foo(99) return true but bar(99) return false.
To me it makes sense that bar would return false. For simplicity lets say longs are 8 bits, then (using twos complement for signed value):
99 == 0110 0011
-2 == unsigned 254 == 1111 1110
So clearly the CMP instruction will see that 1111 1110 is bigger and return false.
But I dont understand what is going on behind the scenes in the foo function. The assembly for foo seems to hardcode to always return mov eax,0x1. I would have expected foo to do something similar to bar. What is going on here?
This is covered in C classes and is specified in the documentation. Here is how you use documents to figure this out.
In the 2018 C standard, you can look up > or “relational expressions” in the index to see they are discussed on pages 68-69. On page 68, you will find clause 6.5.8, which covers relational operators, including >. Reading it, paragraph 3 says:
If both of the operands have arithmetic type, the usual arithmetic conversions are performed.
“Usual arithmetic conversions” is listed in the index as defined on page 39. Page 39 has clause 6.3.1.8, “Usual arithmetic conversions.” This clause explains that operands of arithmetic types are converted to a common type, and it gives rules determining the common type. For two integer types of different signedness, such as the unsigned long and the long int in bar (a and -2L), it says that, if the unsigned type has rank greater than or equal to the rank of the other type, the signed type is converted to the unsigned type.
“Rank” is not in the index, but you can search the document to find it is discussed in clause 6.3.1.1, where it tells you the rank of long int is greater than the rank of int, and the any unsigned type has the same rank as the corresponding type.
Now you can consider a > -2L in bar, where a is unsigned long. Here we have an unsigned long compared with a long. They have the same rank, so -2L is converted to unsigned long. Conversion of a signed integer to unsigned is discussed in clause 6.3.1.3. It says the value is converted by wrapping it modulo ULONG_MAX+1, so converting the signed long −2 produces a ULONG_MAX+1−2 = ULONG_MAX−1, which is a large integer. Then comparing a, which has the value 99, to a large integer with > yields false, so zero is returned.
For foo, we continue with the rules for the usual arithmetic conversions. When the unsigned type does not have rank greater than or equal to the rank of the signed type, but the signed type can represent all the values of the type of the operand with unsigned type, the operand with the unsigned type is converted to the operand of the signed type. In foo, a is unsigned int and -2L is long int. Presumably in your C implementation, long int is 64 bits, so it can represent all the values of a 32-bit unsigned int. So this rule applies, and a is converted to long int. This does not change the value. So the original value of a, 99, is compared to −2 with >, and this yields true, so one is returned.
In the first function
bool foo (unsigned int a) {
return (a > -2L);
}
the both operands of the expression a > -2L have the type long (the first operand is converted to the type long due to the usual arithmetic conversions because the rank of the type long is greater than the rank of the type unsigned int and all values of the type unsigned int in the used system can be represented by the type long). And it is evident that the positive value 99L is greater than the negative value -2L.
The first function could produce the result 0 provided that sizeof( long ) is equal to sizeof( unsigned int ). In this case the type long is unable to represent all (positive) values of the type unsigned int. As a result due to the usual arithmetic conversions the both operands will be converted to the type unsigned long.
For example running the function foo using MS VS 2019 where sizeof( long ) is equal to 4 as sizeof( unsigned int ) you will get the result 0.
Here is a demonstration program written in C++ that visually shows the reason why the result of a call of the function foo using MS VS 2019 can be equal to 0.
#include <iostream>
#include <iomanip>
#include <type_traits>
int main()
{
unsigned int x = 0;
long y = 0;
std::cout << "sizeof( unsigned int ) = " << sizeof( unsigned int ) << '\n';
std::cout << "sizeof( long ) = " << sizeof(long) << '\n';
std::cout << "std::is_same_v<decltype( x + y ), unsigned long> is "
<< std::boolalpha
<< std::is_same_v<decltype( x + y ), unsigned long>
<< '\n';
}
The program output is
sizeof( unsigned int ) = 4
sizeof( long ) = 4
std::is_same_v<decltype( x + y ), unsigned long> is true
That is in general the result of the first function is implementation defined.
In the second functions
bool bar (unsigned long a) {
return (a > -2L);
}
the both operands have the type unsigned long (again due to the usual arithmetic conversions and ranks of the types unsigned long and signed long are equal each other, so an object of the type signed long is converted to the type unsigned long) and -2L interpreted as unsigned long is greater than 99.
The reason for this has to do with the rules of integer conversions.
In the first case, you compare an unsigned int with a long using the > operator, and in the second case you compare a unsigned long with a long.
These operands must first be converted to a common type using the usual arithmetic conversions. These are spelled out in section 6.3.1.8p1 of the C standard, with the following excerpt focusing on integer conversions:
If both operands have the same type, then no further conversion is
needed.
Otherwise, if both operands have signed integer types or both have
unsigned integer types, the operand with the type of lesser integer
conversion rank is converted to the type of the operand with greater
rank.
Otherwise, if the operand that has unsigned integer type has rank
greater or equal to the rank of the type of the other operand, then
the operand with signed integer type is converted to the type of the
operand with unsigned integer type.
Otherwise, if the type of the operand with signed integer type can
represent all of the values of the type of the operand with unsigned
integer type, then the operand with unsigned integer type is converted
to the type of the operand with signed integer type.
Otherwise, both operands are converted to the unsigned integer type
corresponding to the type of the operand with signed integer type.
In the case of comparing an unsigned int with a long the second bolded paragraph applies. long has higher rank and (assuming long is 64 bit and int is 32 bit) can hold all values than an unsigned int can, so the unsigned int operand a is converted to a long. Since the value in question is in the range of long, section 6.3.1.3p1 dictates how the conversion happens:
When a value with integer type is converted to another integer type
other than _Bool, if the value can be represented by the new type, it
is unchanged
So the value is preserved and we're left with 99 > -2 which is true.
In the case of comparing an unsigned long with a long, the first bolded paragraph applies. Both types are of the same rank with different signs, so the long constant -2L is converted to unsigned long. -2 is outside the range of an unsigned long so a value conversion must happen. This conversion is specified in section 6.3.1.3p2:
Otherwise, if the new type is unsigned, the value is converted by
repeatedly adding or subtracting one more than the maximum value that
can be represented in the new type until the value is in the range of
the new type.
So the long value -2 will be converted to the unsigned long value 264-2, assuming unsigned long is 64 bit. So we're left with 99 > 264-2, which is false.
I think what is happening here is implicit promotion by the compiler. When you perform comparison on two different primitives, the compiler will promote one of them to the same type as the other. I believe the rules are that the type with the larger possible value is used as the standard.
So in foo() you are implicitly promoting your argument to a signed long type and the comparison works as expected.
In bar() your argument is an unsigned long, which has a larger maximum value than signed long. Here the compiler promotes -2L to unsigned long, which turns into a very large number.

C++ can not calculate a formula with a vector's size in it?

int main() {
vector<int> v;
if (0 < v.size() - 1) {
printf("true");
} else {
printf("false");
}
}
It prints true which indicates 0 < -1
std::vector::size() returns an unsigned integer. If it is 0 and you subtract 1, it underflows and becomes a huge value (specifically std::numeric_limits<std::vector::size_type>::max()). The comparison works fine, but the subtraction produces a value you did not expect.
For more about unsigned underflow (and overflow), see: C++ underflow and overflow
The simplest fix for your code is probably if (1 < v.size()).
v.size() returns a result of size_t, which is an unsigned type. An unsigned value minus 1 is still unsigned. And all non-zero unsigned values are greater than zero.
std::vector<int>::size() returns type size_t which is an unsigned type whose rank is usually at least that of int.
When, in a math operation, you put together a signed type with a unsigned type and the unsigned type doesn't have a lower rank, the signed typed will get converted to the unsigned type (see 6.3.1.8 Usual arithmetic conversions (I'm linking to the C standard, but rules for integer arithmetic are foundational and need to be common to both languages)).
In other words, assuming that size_t isn't unsigned char or unsigned short
(it's usually unsigned long and the C standard recommends it shouldn't be unsigned long long unless necessary)
(size_t)0 - 1
gets implicitly translated to
(size_t)0 - (size_t)1
which is a positive number equal to SIZE_MAX (-1 cannot be represented in an unsigned type so it gets converted converted by formally "repeatedly adding or subtracting one more than the maximum value that can be represented in the new type until the value is in the range of the new type" (6.3.1.3p)).
0 is always less than SIZE_MAX.

Bitwise operation on unsigned char

I have a sample function as below:
int get_hash (unsigned char* str)
{
int hash = (str[3]^str[4]^str[5]) % MAX;
int hashVal = arr[hash];
return hashVal;
}
Here array arr has size as MAX. ( int arr[MAX] ).
My static code checker complains that there can be a out of bound array access here, as hash could be in the range -255 to -1.
Is this correct? Can bitwise operation on unsigned char produce a negative number? Should hash be declared as unsigned int?
Is this correct?
No, the static code checker is in error(1).
Can bitwise operation on unsigned char produce a negative number?
Some bitwise operations can - bitwise complement, for example - but not the exclusive or.
For the ^, the arguments, unsigned char here, are subject to the usual arithmetic conversions (6.3.1.8), they are first promoted according to the integer promotions; about those, clause 6.3.1.1, paragraph 2 says
If an int can represent all values of the original type (as restricted by the width, for a bit-field), the value is converted to an int; otherwise, it is converted to an unsigned int. These are called the integer promotions.
So, there are two possibilities:
An int can represent all possible values of unsigned char. Then all values obtained from the integer promotions are non-negative, the bitwise exclusive or of these values is also non-negative, and the remainder modulo MAX too. The value of hash is then in the range from 0 (inclusive) to MAX (exclusive) [-MAX if MAX < 0].
An int cannot represent all possible values of unsigned char. Then the values are promoted to type unsigned int, and the bitwise operations are carried out at that type. The result is of course non-negative, and the remainder modulo MAX will be non-negative too. However, in that case, the assignment to int hash might convert an out-of-range value to a negative value [the conversion of out-of-range integers to a signed integer type is implementation-defined]. (1)But in that case, the range of possible negative values is greater than -255 to -1, so even in that - very unlikely - case, the static code checker is wrong in part.
Should hash be declared as unsigned int?
That depends on the value of MAX. If there is the slightest possibility that a remainder modulo MAX is out-of-range for int, then that would be safer. Otherwise, int is equally safe.
As remarked correctly by gx_, the arithmetic is done in int. Just declare your hash variable as unsigned char, again, to be sure that everybody knows that you expect this to be positive in all cases.
And if MAX is effectively UCHAR_MAX you should just use that to improve readability.

On type of a literal, unsigned negative number

The C++ Primer says:
We can independently specify the signedees and the size of an integral
literal. If the suffix contains a U, then the literal has an unsigned
type, so a decimal, octal or hexadecimal literal with a U suffix has
the smallest type of unsigned int, unsigned long or unsigned long long
in which the literal's value fits
When one declares
int i = -12U;
The way i understand it is that -12 is converted to the unsigned version of itself (4294967284) and then assigned to an int, making the result a very large positive number due to rollover.
This does not seem to happen. What am i missing please?
cout << i << endl; // -12
You are assigning the unsigned int back to a signed int, so it gets converted again.
It's like you did this:
int i = (int)(unsigned int)(-12);
u effectively binds more tightly than -. You are getting -(12u).
12 has type int and the value 12.
12U has type unsigned int and the value 12.
-12U has type unsigned int and the value std::numeric_limits<unsigned int>::max() + 1 - 12.
int i = -12U; applies an implementation-defined conversion to convert -12U to type int.

In case of integer overflows what is the result of (unsigned int) * (int) ? unsigned or int?

In case of integer overflows what is the result of (unsigned int) * (int) ? unsigned or int? What type does the array index operator (operator[]) take for char*: int, unsigned int or something else?
I was auditing the following function, and suddenly this question arose. The function has a vulnerability at line 17.
// Create a character array and initialize it with init[]
// repeatedly. The size of this character array is specified by
// w*h.
char *function4(unsigned int w, unsigned int h, char *init)
{
char *buf;
int i;
if (w*h > 4096)
return (NULL);
buf = (char *)malloc(4096+1);
if (!buf)
return (NULL);
for (i=0; i<h; i++)
memcpy(&buf[i*w], init, w); // line 17
buf[4096] = '\0';
return buf;
}
Consider both w and h are very large unsigned integers. The multiplication at line 9 have a chance to pass the validation.
Now the problem is at line 17. Multiply int i with unsigned int w: if the result is int, it is possible that the product is negative, resulting in accessing a position that is before buf. If the result is unsigned int, the product will always be positive, resulting in accessing a position that is after buf.
It's hard to write code to justify this: int is too large. Does anyone has ideas on this?
Is there any documentation that specifies the type of the product? I have searched for it, but so far haven't found anything.
I suppose that as far as the vulnerability is concerned, whether (unsigned int) * (int) produces unsigned int or int doesn't matter, because in the compiled object file, they are just bytes. The following code works the same no matter the type of the product:
unsigned int x = 10;
int y = -10;
printf("%d\n", x * y); // print x * y in signed integer
printf("%u\n", x * y); // print x * y in unsigned integer
Therefore, it does not matter what type the multiplication returns. It matters that whether the consumer function takes int or unsigned.
The question here is not how bad the function is, or how to improve the function to make it better. The function undoubtedly has a vulnerability. The question is about the exact behavior of the function, based on the prescribed behavior from the standards.
do the w*h calculation in long long, check if bigger than MAX_UINT
EDIT : alternative : if overflown (w*h)/h != w (is this always the case ?! should be, right ?)
To answer your question: the type of an expression multiplying an int and an unsigned int will be an unsigned int in C/C++.
To answer your implied question, one decent way to deal with possible overflow in integer arithmetic is to use the "IntSafe" set of routines from Microsoft:
http://blogs.msdn.com/michael_howard/archive/2006/02/02/523392.aspx
It's available in the SDK and contains inline implementations so you can study what they're doing if you're on another platform.
Ensure that w * h doesn't overflow by limiting w and h.
The type of w*i is unsigned in your case. If I read the standard correctly, the rule is that the operands are converted to the larger type (with its signedness), or unsigned type corresponding to the signed type (which is unsigned int in your case).
However, even if it's unsigned, it doesn't prevent the wraparound (writing to memory before buf), because it might be the case (on i386 platform, it is), that p[-1] is the same as p[-1u]. Anyway, in your case, both buf[-1] and buf[big unsigned number] would be undefined behavior, so the signed/unsigned question is not that important.
Note that signed/unsigned matters in other contexts - eg. (int)(x*y/2) gives different results depending on the types of x and y, even in the absence of undefined behaviour.
I would solve your problem by checking for overflow on line 9; since 4096 is a pretty small constant and 4096*4096 doesn't overflow on most architectures (you need to check), I'd do
if (w>4096 || h>4096 || w*h > 4096)
return (NULL);
This leaves out the case when w or h are 0, you might want to check for it if needed.
In general, you could check for overflow like this:
if(w*h > 4096 || (w*h)/w!=h || (w*h)%w!=0)
In C/C++ the p[n] notation is really a shortcut to writting *(p+n), and this pointer arithmetic takes into account the sign. So p[-1] is valid and refers to the value immediately before *p.
So the sign really matters here, the result of arithmetic operator with integer follow a set of rules defined by the standard, and this is called integer promotions.
Check out this page: INT02-C. Understand integer conversion rules
2 changes make it safer:
if (w >= 4096 || h >= 4096 || w*h > 4096) return NULL;
...
unsigned i;
Note also that it's not less a bad idea to write to or read from past the buffer end. So the question is not whether iw may become negative, but whether 0 <= ih +w <= 4096 holds.
So it's not the type that matters, but the result of h*i.
For example, it doesn't make a difference whether this is (unsigned)0x80000000 or (int)0x80000000, the program will seg-fault anyway.
For C, refer to "Usual arithmetic conversions" (C99: Section 6.3.1.8, ANSI C K&R A6.5) for details on how the operands of the mathematical operators are treated.
In your example the following rules apply:
C99:
Otherwise, if the type of the operand
with signed integer type can represent
all of the values of the type of the
operand with unsigned integer type,
then the operand with unsigned integer
type is converted to the type of the
operand with signed integer type.
Otherwise, both operands are converted
to the unsigned integer type
corresponding to the type of the
operand with signed integer type.
ANSI C:
Otherwise, if either operand is unsigned int, the other is converted to unsigned int.
Why not just declare i as unsigned int? Then the problem goes away.
In any case, i*w is guaranteed to be <= 4096, as the code tests for this, so it's never going to overflow.
memcpy(&buf[iw > -1 ? iw < 4097? iw : 0 : 0], init, w);
I don't think the triple calculation of iw does degrade the perfomance)
w*h could overflow if w and/or h are sufficiently large and the following validation could pass.
9. if (w*h > 4096)
10. return (NULL);
On int , unsigned int mixed operations, int is elevated to unsigned int, in which case, a negative value of 'i' would become a large positive value. In that case
&buf[i*w]
would be accessing a out of bound value.
Unsigned arithmetic is done as modular (or wrap-around), so the product of two large unsigned ints can easily be less than 4096. The multiplication of int and unsigned int will result in an unsigned int (see section 4.5 of the C++ standard).
Therefore, given large w and a suitable value of h, you can indeed get into trouble.
Making sure integer arithmetic doesn't overflow is difficult. One easy way is to convert to floating-point and doing a floating-point multiplication, and seeing if the result is at all reasonable. As qwerty suggested, long long would be usable, if available on your implementation. (It's a common extension in C90 and C++, does exist in C99, and will be in C++0x.)
There are 3 paragraphs in the current C1X draft on calculating (UNSIGNED TYPE1) X (SIGNED TYPE2) in 6.3.1.8 Usual arithmetic coversions, N1494,
WG 14: C - Project status and milestones
Otherwise, if the operand that has unsigned integer type has rank greater or
equal to the rank of the type of the other operand, then the operand with
signed integer type is converted to the type of the operand with unsigned
integer type.
Otherwise, if the type of the operand with signed integer type can represent
all of the values of the type of the operand with unsigned integer type, then
the operand with unsigned integer type is converted to the type of the
operand with signed integer type.
Otherwise, both operands are converted to the unsigned integer type
corresponding to the type of the operand with signed integer type.
So if a is unsigned int and b is int, parsing of (a * b) should generate code (a * (unsigned int)b). Will overflow if b < 0 or a * b > UINT_MAX.
If a is unsigned int and b is long of greater size, (a * b) should generate ((long)a * (long)b). Will overflow if a * b > LONG_MAX or a * b < LONG_MIN.
If a is unsigned int and b is long of the same size, (a * b) should generate ((unsigned long)a * (unsigned long)b). Will overflow if b < 0 or a * b > ULONG_MAX.
On your second question about the type expected by "indexer", the answer appears "integer type" which allows for any (signed) integer index.
6.5.2.1 Array subscripting
Constraints
1 One of the expressions shall have type ‘‘pointer to complete object type’’, the other
expression shall have integer type, and the result has type ‘‘type’’.
Semantics
2 A postfix expression followed by an expression in square brackets [] is a subscripted
designation of an element of an array object. The definition of the subscript operator []
is that E1[E2] is identical to (*((E1)+(E2))). Because of the conversion rules that
apply to the binary + operator, if E1 is an array object (equivalently, a pointer to the
initial element of an array object) and E2 is an integer, E1[E2] designates the E2-th
element of E1 (counting from zero).
It is up to the compiler to perform static analysis and warn the developer about possibility of buffer overrun when the pointer expression is an array variable and the index may be negative. Same goes about warning on possible array size overruns even when the index is positive or unsigned.
To actually answer your question, without specifying the hardware you're running on, you don't know, and in code intended to be portable, you shouldn't depend on any particular behavior.