C++ string.length() Strange Behavior - c++

I just came across an extremely strange problem. The function I have is simply:
int strStr(string haystack, string needle) {
for(int i=0; i<=(haystack.length()-needle.length()); i++){
cout<<"i "<<i<<endl;
}
return 0;
}
Then if I call strStr("", "a"), although haystack.length()-needle.length()=-1, this will not return 0, you can try it yourself...

This is because .length() (and .size()) return size_t, which is an unsigned int. You think you get a negative number, when in fact it underflows back to the maximum value for size_t (On my machine, this is 18446744073709551615). This means your for loop will loop through all the possible values of size_t, instead of just exiting immediately like you expect.
To get the result you want, you can explicitly convert the sizes to ints, rather than unsigned ints (See aslgs answer), although this may fail for strings with sufficient length (Enough to over/under flow a standard int)
Edit:
Two solutions from the comments below:
(Nir Friedman) Instead of using int as in aslg's answer, include the header and use an int64_t, which will avoid the problem mentioned above.
(rici) Turn your for loop into for(int i = 0;needle.length() + i <= haystack.length();i ++){, which avoid the problem all together by rearranging the equation to avoid the subtraction all together.

(haystack.length()-needle.length())
length returns a size_t, in other words an unsigned int. Given the size of your strings, 0 and 1 respectively, when you calculate the difference it underflows and becomes the maximum possible value for an unsigned int. (Which is approximately 4.2 billions for a storage of 4 bytes, but could be a different value)
i<=(haystack.length()-needle.length())
The indexer i is converted by the compiler into an unsigned int to match the type. So you're gonna have to wait until i is greater than the max possible value for an unsigned int. It's not going to stop.
Solution:
You have to convert the result of each method to int, like so,
i <= ( (int)haystack.length() - (int)needle.length() )

Related

What does size_t mean in c++? [duplicate]

I'm just wondering should I use std::size_t for loops and stuff instead of int?
For instance:
#include <cstdint>
int main()
{
for (std::size_t i = 0; i < 10; ++i) {
// std::size_t OK here? Or should I use, say, unsigned int instead?
}
}
In general, what is the best practice regarding when to use std::size_t?
A good rule of thumb is for anything that you need to compare in the loop condition against something that is naturally a std::size_t itself.
std::size_t is the type of any sizeof expression and as is guaranteed to be able to express the maximum size of any object (including any array) in C++. By extension it is also guaranteed to be big enough for any array index so it is a natural type for a loop by index over an array.
If you are just counting up to a number then it may be more natural to use either the type of the variable that holds that number or an int or unsigned int (if large enough) as these should be a natural size for the machine.
size_t is the result type of the sizeof operator.
Use size_t for variables that model size or index in an array. size_t conveys semantics: you immediately know it represents a size in bytes or an index, rather than just another integer.
Also, using size_t to represent a size in bytes helps making the code portable.
The size_t type is meant to specify the size of something so it's natural to use it, for example, getting the length of a string and then processing each character:
for (size_t i = 0, max = strlen (str); i < max; i++)
doSomethingWith (str[i]);
You do have to watch out for boundary conditions of course, since it's an unsigned type. The boundary at the top end is not usually that important since the maximum is usually large (though it is possible to get there). Most people just use an int for that sort of thing because they rarely have structures or arrays that get big enough to exceed the capacity of that int.
But watch out for things like:
for (size_t i = strlen (str) - 1; i >= 0; i--)
which will cause an infinite loop due to the wrapping behaviour of unsigned values (although I've seen compilers warn against this). This can also be alleviated by the (slightly harder to understand but at least immune to wrapping problems):
for (size_t i = strlen (str); i-- > 0; )
By shifting the decrement into a post-check side-effect of the continuation condition, this does the check for continuation on the value before decrement, but still uses the decremented value inside the loop (which is why the loop runs from len .. 1 rather than len-1 .. 0).
By definition, size_t is the result of the sizeof operator. size_t was created to refer to sizes.
The number of times you do something (10, in your example) is not about sizes, so why use size_t? int, or unsigned int, should be ok.
Of course it is also relevant what you do with i inside the loop. If you pass it to a function which takes an unsigned int, for example, pick unsigned int.
In any case, I recommend to avoid implicit type conversions. Make all type conversions explicit.
short answer:
Almost never. Use signed version ptrdiff_t or non-standard ssize_t. Use function std::ssize instead of std::size.
long answer:
Whenever you need to have a vector of char bigger that 2gb on a 32 bit system. In every other use case, using a signed type is much safer than using an unsigned type.
example:
std::vector<A> data;
[...]
// calculate the index that should be used;
size_t i = calc_index(param1, param2);
// doing calculations close to the underflow of an integer is already dangerous
// do some bounds checking
if( i - 1 < 0 ) {
// always false, because 0-1 on unsigned creates an underflow
return LEFT_BORDER;
} else if( i >= data.size() - 1 ) {
// if i already had an underflow, this becomes true
return RIGHT_BORDER;
}
// now you have a bug that is very hard to track, because you never
// get an exception or anything anymore, to detect that you actually
// return the false border case.
return calc_something(data[i-1], data[i], data[i+1]);
The signed equivalent of size_t is ptrdiff_t, not int. But using int is still much better in most cases than size_t. ptrdiff_t is long on 32 and 64 bit systems.
This means that you always have to convert to and from size_t whenever you interact with a std::containers, which not very beautiful. But on a going native conference the authors of c++ mentioned that designing std::vector with an unsigned size_t was a mistake.
If your compiler gives you warnings on implicit conversions from ptrdiff_t to size_t, you can make it explicit with constructor syntax:
calc_something(data[size_t(i-1)], data[size_t(i)], data[size_t(i+1)]);
if just want to iterate a collection, without bounds cheking, use range based for:
for(const auto& d : data) {
[...]
}
here some words from Bjarne Stroustrup (C++ author) at going native
For some people this signed/unsigned design error in the STL is reason enough, to not use the std::vector, but instead an own implementation.
size_t is a very readable way to specify the size dimension of an item - length of a string, amount of bytes a pointer takes, etc.
It's also portable across platforms - you'll find that 64bit and 32bit both behave nicely with system functions and size_t - something that unsigned int might not do (e.g. when should you use unsigned long
Use std::size_t for indexing/counting C-style arrays.
For STL containers, you'll have (for example) vector<int>::size_type, which should be used for indexing and counting vector elements.
In practice, they are usually both unsigned ints, but it isn't guaranteed, especially when using custom allocators.
Soon most computers will be 64-bit architectures with 64-bit OS:es running programs operating on containers of billions of elements. Then you must use size_t instead of int as loop index, otherwise your index will wrap around at the 2^32:th element, on both 32- and 64-bit systems.
Prepare for the future!
size_t is returned by various libraries to indicate that the size of that container is non-zero. You use it when you get once back :0
However, in the your example above looping on a size_t is a potential bug. Consider the following:
for (size_t i = thing.size(); i >= 0; --i) {
// this will never terminate because size_t is a typedef for
// unsigned int which can not be negative by definition
// therefore i will always be >= 0
printf("the never ending story. la la la la");
}
the use of unsigned integers has the potential to create these types of subtle issues. Therefore imho I prefer to use size_t only when I interact with containers/types that require it.
When using size_t be careful with the following expression
size_t i = containner.find("mytoken");
size_t x = 99;
if (i-x>-1 && i+x < containner.size()) {
cout << containner[i-x] << " " << containner[i+x] << endl;
}
You will get false in the if expression regardless of what value you have for x.
It took me several days to realize this (the code is so simple that I did not do unit test), although it only take a few minutes to figure the source of the problem. Not sure it is better to do a cast or use zero.
if ((int)(i-x) > -1 or (i-x) >= 0)
Both ways should work. Here is my test run
size_t i = 5;
cerr << "i-7=" << i-7 << " (int)(i-7)=" << (int)(i-7) << endl;
The output: i-7=18446744073709551614 (int)(i-7)=-2
I would like other's comments.
It is often better not to use size_t in a loop. For example,
vector<int> a = {1,2,3,4};
for (size_t i=0; i<a.size(); i++) {
std::cout << a[i] << std::endl;
}
size_t n = a.size();
for (size_t i=n-1; i>=0; i--) {
std::cout << a[i] << std::endl;
}
The first loop is ok. But for the second loop:
When i=0, the result of i-- will be ULLONG_MAX (assuming size_t = unsigned long long), which is not what you want in a loop.
Moreover, if a is empty then n=0 and n-1=ULLONG_MAX which is not good either.
size_t is an unsigned type that can hold maximum integer value for your architecture, so it is protected from integer overflows due to sign (signed int 0x7FFFFFFF incremented by 1 will give you -1) or short size (unsigned short int 0xFFFF incremented by 1 will give you 0).
It is mainly used in array indexing/loops/address arithmetic and so on. Functions like memset() and alike accept size_t only, because theoretically you may have a block of memory of size 2^32-1 (on 32bit platform).
For such simple loops don't bother and use just int.
I have been struggling myself with understanding what and when to use it. But size_t is just an unsigned integral data type which is defined in various header files such as <stddef.h>, <stdio.h>, <stdlib.h>, <string.h>, <time.h>, <wchar.h> etc.
It is used to represent the size of objects in bytes hence it's used as the return type by the sizeof operator. The maximum permissible size is dependent on the compiler; if the compiler is 32 bit then it is simply a typedef (alias) for unsigned int but if the compiler is 64 bit then it would be a typedef for unsigned long long. The size_t data type is never negative(excluding ssize_t)
Therefore many C library functions like malloc, memcpy and strlen declare their arguments and return type as size_t.
/ Declaration of various standard library functions.
// Here argument of 'n' refers to maximum blocks that can be
// allocated which is guaranteed to be non-negative.
void *malloc(size_t n);
// While copying 'n' bytes from 's2' to 's1'
// n must be non-negative integer.
void *memcpy(void *s1, void const *s2, size_t n);
// the size of any string or `std::vector<char> st;` will always be at least 0.
size_t strlen(char const *s);
size_t or any unsigned type might be seen used as loop variable as loop variables are typically greater than or equal to 0.
size_t is an unsigned integral type, that can represent the largest integer on you system.
Only use it if you need very large arrays,matrices etc.
Some functions return an size_t and your compiler will warn you if you try to do comparisons.
Avoid that by using a the appropriate signed/unsigned datatype or simply typecast for a fast hack.
size_t is unsigned int. so whenever you want unsigned int you can use it.
I use it when i want to specify size of the array , counter ect...
void * operator new (size_t size); is a good use of it.

For loop running infinitely with string length in C++

I am very surprised with this behavior of for loop :
program 1 :
#include<bits/stdc++.h>
using namespace std;
int main()
{
string s1,s2;
cin>>s1>>s2;
for(int i=0;i<(s1.length()-s2.length()+1);i++)
{
cout<<"Hello\n";
}
}
After giving input : s1 = "ab" , s2 = "abcdef"
This for loop of program 1 is running infinitely and printing "Hello" infinite times.
Whereas , Program 2 (below) is working fine for both same input of string s1 and s2.
program 2 :
#include<bits/stdc++.h>
using namespace std;
int main()
{
string s1,s2;
cin>>s1>>s2;
int len = (s1.length()-s2.length()+1);
for(int i=0;i<len;i++)
{
cout<<"Hello\n";
}
}
Why is the for loop of program 1 running infinite times?
In your example, s1.length() is evaluated to 2u (i.e. 2, but in an unsigned type), s2.length() is evaluated to 6u, s1.length() - s2.length() is most probably evaluated to 4294967292u (because there is no -4 in unsigned types), and s1.length() - s2.length() + 1 is evaluated to 4294967293u.
.length() return size_t in C++, which is unsigned value. Subtracting an unsigned value from another unsigned value yields an unsigned value, e.g. 1u - 2u may result in 4294967295.
When mixing signed and unsigned values (e.g. s.length() - 1 or i < s.length()), the signed value is converted to an unsigned, e.g. -1 > 1u is typically true because -1 is converted to 4294967295. Modern compilers will warn you about such type of comparisons if you enable warnings.
Knowing that, you may expect that your loop is running for 4 billion iterations, but that's not necessarily true, because i is a signed int, and if it's 32-bit (most probably), it cannot become bigger than 2147483647. And the moment your program increases it from 2147483647, signed overflow happens, which is undefined behavior in C++. So your loop may very well run infinitely.
I suspect you're doing competitive programming. My recommendation for competitive programming would be to always cast .length() to int whenever you want to calculation anything. You can create a macro like this:
#define sz(x) ((int)(x).size())
and then write sz(s) instead of s.length() everywhere to avoid such bugs.
However, that approach is highly frowned upon in any area of programming where code has to live longer than few hours. E.g. in the industry or open-source. For such cases, use explicit static_cast<int>(s.length())/static_cast<ssize_t>(s.length()) every time you need it. Or, even better, ask about during code review to get specific recommendations concerning your code, there are lots of possible caveats, see comments below for some examples.
I haven't had a chance to test this yet so I can't say for sure, but I strongly suspect that this is related to the fact that string::length() returns a size_t, which is an unsigned type. Unsigned types wrap around to the maximum value if they become negative, so 2-6+1=-3 which becomes 2^32-3 when interpreted as 32 bit unsigned. This causes your loop to iterate billions of times, hence seemingly not terminating. Whereas in your second program you are explicitly converting to a signed int so the result is -3 as expected.

C++ for-loop condition

I want to know why this loops runs even when result.bad_matches.size()=0
for (int i = 1; i <= result.badmatches.size() - 1; i++)
{
...
}
Also, is there any other way I could stop it from running when badmatches size is 0 without using an if condition?
This depends on the type size() returns. It is probably a standard container and thus will be an unsigned type and those types wrap around on overflow. That means it the result of subtracting one will be the maximum value of that type.
Either use a comparison that doesn't require you to subtract from the size (<, !=) or just use iterators or a for-auto loop. Under any circumstance you should at least use the same type for iterating as the nested size_type of the container and not int.
for(auto& x : result.badmatches) {
// ...
}
use while(result.badmatches.size()) to NOT execute it.
result.badmatches.size()-1 this will be converted to -1. If its an unsigned integer, then -1 is interpreted as 0xFFFFFFFF(on a 32 bit machine). This will make the loop run for 2^32 or 2^64 times. To avoid this, use while() as before IF you're certain that result.badmatches.size() will return 0.
size must be returning an unsigned so 0-1 is getting upgraded to unsigned and so is the left value.
So for int size of 4 bytes, -1 will be represented as 2^32 -1 in unsigned int.
If you don't want this behavior then just cast it like this : static_cast <signed int > (result.badmatches.size());
PS: I've not touched C++ for past 4 years pl. excuse little mistakes.
The right way is:
for (int i=0;i< result.badmatches.size() ;++i)
{
}
If you specifically don't want this loop to enter when the sise of the collection is zero then you could check for ! badmatches.empty() assuming that badmatches is an STL container. However, if you structure your code slightly differently, you'll probably overcome this issue without having to do that:
for (size_t i=0; i < result.badmatches.size(); i++)
{
}
I've changed the int to size_t which is the same type that size() returns (an unsigned integer), changed the initial value to 0 and the comparison so that it will exit if i >= result.badmatches.size() Generally, I'd say that this is the clearest way of presenting an indexed approach as it matches the natural indexing of collections and if you need 1, 2, 3 ... rather than 0, 1, 2 in your loop, then you can address that within it.
If you're still having problems, two questions:
Is there anything in your loop that might alter the value of result.badmatches.size()?
Is your code multithreaded with a possibility that result.badmatches.size() could change by actions on another thread?
After understanding the problem explained by #Prototype Stark #Aga , i came to a more simpler solution , using which i can keep my initial index to 1 .
for(int i=1;i+1<=result.badmatches.size();i++)
Thanks for all the help , it's much clearer now .

Finding the Sum of 2D vector

Having some trouble finding the sum of a 2D vector. Does this look ok?
int sumOfElements(vector<iniMatrix> &theBlocks)
{
int theSum = 0;
for(unsigned i=0; (i < theBlocks.size()); i++)
{
for(unsigned j=0; (j < theBlocks[i].size()); j++)
{
theSum +=theBlocks[i][j];
}
}
return theSum;
}
It returns a negative number, however, it should return a positive number..
Hope someone can help :)
The code looks proper in an abstract sense, but you may be overflowing theSum. You can try making theSum type double to see what value you get to help sort out the proper integral type to use for it.
double sumOfElements(vector<iniMatrix> &theBlocks)
{
double theSum = 0;
/* ... */
return theSum;
}
When you observe the returned value, you can see if it would fit in an int or if you need to use a wider long or long long type.
If all the values in the matrix are positive, you should consider using one of the unsigned integral types. which would double your range of allowed values.
The problem is obviously the int exceeds its boundary (like others said)
For signed data types it becomes negative when overflow, and for unsigned datatypes it starts from zero again after overflow.
If you want to detect overflow pragmatically, you can paste these lines instead of the additional line.
if( theSum > int(theSum + theBlocks[i][j]) )
//print error message, throw exception, break, ...
break;
else
theSum += theBlocks[i][j];
For more generic solution to work with more data types and more operations than addition, check this: How to detect integer overflow?
A solution would be using unsigned long long and if it exceeds its boundary too, you need to use third party libraries for big integers.
Like Mokhtar Ashour says, it's may be that the variable theSum overflows. Try making it either unsigned if no numbers are negative, or change the type from int (which is 32 bits) to long long (which is 64 bits).
I think it may be int overflow problem. to make sure, you may insert a condition after the inner loop finishes to see if your result exceeds the int range.
if(result>sizeof(int))
cout<<"hitting boundaries";
a better way to test if you exceed the int boundaries is to print the result after the inner loop ends and notice the result.
.if so, just use a bigger data type.

What's an efficient way to avoid integer overflow converting an unsigned int to int in C++?

Is the following an efficient and problem free way to convert an unsigned int to an int in C++:
#include <limits.h>
void safeConvert(unsigned int passed)
{
int variable = static_cast<int>(passed % (INT_MAX+1));
...
}
Or is there a better way?
UPDATE
As pointed out by James McNellis it is not undefined to assign an unsigned int > INT_MAX to an integer - rather this is implementation defined. As such the context here is now specifically on my preference is to ensure this integer resets to zero when the unsigned int exceeds INT_MAX.
Original Context
I have a number of unsigned int's used as counters, but want to pass them around as integers in a specific case.
Under normal operation these counts will remain within the bounds of INT_MAX. However to avoid running into undefined implementation specific behaviour should the abnormal (but valid) case occur I want some efficient conversion here.
This should also work:
int variable = passed & INT_MAX;
Under normal operation these counts will remain within the bounds of INT_MAX. However to avoid running into undefined behaviour should the abnormal (but valid) case occur I want some efficient conversion here.
Efficient conversion to what? If all the shared values for int and unsigned int correspond, and you want other unsigned values such as INT_MAX + 1 to each have distinct values, then you can only map them onto the negative integer values. This is done by default, and can be explicitly requested with static_cast<int>(my_unsigned). Otherwise, you could map them all to 0, or -1, or INT_MIN, or throw away the high bit... easiest way is simply: if (my_unsigned > INT_MAX) my_unsigned = XXX, or ...my_unsigned &= INT_MAX to clear the high bit. But will the called functions work properly if the int overflows? Perhaps a better solution would be to use 64-bit ints to begin with?