While reading this question, I've seen the first comment saying that:
size_t for length is not a great idea, the proper types are signed ones for optimization/UB reasons.
followed by another comment supporting the reasoning. Is it true?
The question is important, because if I were to write e.g. a matrix library, the image dimensions could be size_t, just to avoid checking if they are negative. But then all loops would naturally use size_t. Could this impact on optimization?
size_t being unsigned is mostly an historical accident - if your world is 16 bit, going from 32767 to 65535 maximum object size is a big win; in current-day mainstream computing (where 64 and 32 bit are the norm) the fact that size_t is unsigned is mostly a nuisance.
Although unsigned types have less undefined behavior (as wraparound is guaranteed), the fact that they have mostly "bitfield" semantics is often cause of bugs and other bad surprises; in particular:
difference between unsigned values is unsigned as well, with the usual wraparound semantics, so if you may expect a negative value you have to cast beforehand;
unsigned a = 10, b = 20;
// prints UINT_MAX-10, i.e. 4294967286 if unsigned is 32 bit
std::cout << a-b << "\n";
more in general, in signed/unsigned comparisons and mathematical operations unsigned wins (so the signed value is casted to unsigned implicitly) which, again, leads to surprises;
unsigned a = 10;
int b = -2;
if(a < b) std::cout<<"a < b\n"; // prints "a < b"
in common situations (e.g. iterating backwards) the unsigned semantics are often problematic, as you'd like the index to go negative for the boundary condition
// This works fine if T is signed, loops forever if T is unsigned
for(T idx = c.size() - 1; idx >= 0; idx--) {
// ...
}
Also, the fact that an unsigned value cannot assume a negative value is mostly a strawman; you may avoid checking for negative values, but due to implicit signed-unsigned conversions it won't stop any error - you are just shifting the blame. If the user passes a negative value to your library function taking a size_t, it will just become a very big number, which will be just as wrong if not worse.
int sum_arr(int *arr, unsigned len) {
int ret = 0;
for(unsigned i = 0; i < len; ++i) {
ret += arr[i];
}
return ret;
}
// compiles successfully and overflows the array; it len was signed,
// it would just return 0
sum_arr(some_array, -10);
For the optimization part: the advantages of signed types in this regard are overrated; yes, the compiler can assume that overflow will never happen, so it can be extra smart in some situations, but generally this won't be game-changing (as in general wraparound semantics comes "for free" on current day architectures); most importantly, as usual if your profiler finds that a particular zone is a bottleneck you can modify just it to make it go faster (including switching types locally to make the compiler generate better code, if you find it advantageous).
Long story short: I'd go for signed, not for performance reasons, but because the semantics is generally way less surprising/hostile in most common scenarios.
That comment is simply wrong. When working with native pointer-sized operands on any reasonable architectute, there is no difference at the machine level between signed and unsigned offsets, and thus no room for them to have different performance properties.
As you've noted, use of size_t has some nice properties like not having to account for the possibility that a value might be negative (although accounting for it might be as simple as forbidding that in your interface contract). It also ensures that you can handle any size that a caller is requesting using the standard type for sizes/counts, without truncation or bounds checks. On the other hand, it precludes using the same type for index-offsets when the offset might need to be negative, and in some ways makes it difficult to perform certain types of comparisons (you have to write them arranged algebraically so that neither side is negative), but the same issue comes up when using signed types, in that you have to do algebraic rearrangements to ensure that no subexpression can overflow.
Ultimately you should initially always use the type that makes sense semantically to you, rather than trying to choose a type for performance properties. Only if there's a serious measured performance problem that looks like it might be improved by tradeoffs involving choice of types should you consider changing them.
I stand by my comment.
There is a simple way to check this: checking what the compiler generates.
void test1(double* data, size_t size)
{
for(size_t i = 0; i < size; i += 4)
{
data[i] = 0;
data[i+1] = 1;
data[i+2] = 2;
data[i+3] = 3;
}
}
void test2(double* data, int size)
{
for(int i = 0; i < size; i += 4)
{
data[i] = 0;
data[i+1] = 1;
data[i+2] = 2;
data[i+3] = 3;
}
}
So what does the compiler generate? I would expect loop unrolling, SIMD... for something that simple:
Let's check godbolt.
Well, the signed version has unrolling, SIMD, not the unsigned one.
I'm not going to show any benchmark, because in this example, the bottleneck is going to be on memory access, not on CPU computation. But you get the idea.
Second example, just keep the first assignment:
void test1(double* data, size_t size)
{
for(size_t i = 0; i < size; i += 4)
{
data[i] = 0;
}
}
void test2(double* data, int size)
{
for(int i = 0; i < size; i += 4)
{
data[i] = 0;
}
}
As you want gcc
OK, not as impressive as for clang, but it still generates different code.
Related
I'm just wondering should I use std::size_t for loops and stuff instead of int?
For instance:
#include <cstdint>
int main()
{
for (std::size_t i = 0; i < 10; ++i) {
// std::size_t OK here? Or should I use, say, unsigned int instead?
}
}
In general, what is the best practice regarding when to use std::size_t?
A good rule of thumb is for anything that you need to compare in the loop condition against something that is naturally a std::size_t itself.
std::size_t is the type of any sizeof expression and as is guaranteed to be able to express the maximum size of any object (including any array) in C++. By extension it is also guaranteed to be big enough for any array index so it is a natural type for a loop by index over an array.
If you are just counting up to a number then it may be more natural to use either the type of the variable that holds that number or an int or unsigned int (if large enough) as these should be a natural size for the machine.
size_t is the result type of the sizeof operator.
Use size_t for variables that model size or index in an array. size_t conveys semantics: you immediately know it represents a size in bytes or an index, rather than just another integer.
Also, using size_t to represent a size in bytes helps making the code portable.
The size_t type is meant to specify the size of something so it's natural to use it, for example, getting the length of a string and then processing each character:
for (size_t i = 0, max = strlen (str); i < max; i++)
doSomethingWith (str[i]);
You do have to watch out for boundary conditions of course, since it's an unsigned type. The boundary at the top end is not usually that important since the maximum is usually large (though it is possible to get there). Most people just use an int for that sort of thing because they rarely have structures or arrays that get big enough to exceed the capacity of that int.
But watch out for things like:
for (size_t i = strlen (str) - 1; i >= 0; i--)
which will cause an infinite loop due to the wrapping behaviour of unsigned values (although I've seen compilers warn against this). This can also be alleviated by the (slightly harder to understand but at least immune to wrapping problems):
for (size_t i = strlen (str); i-- > 0; )
By shifting the decrement into a post-check side-effect of the continuation condition, this does the check for continuation on the value before decrement, but still uses the decremented value inside the loop (which is why the loop runs from len .. 1 rather than len-1 .. 0).
By definition, size_t is the result of the sizeof operator. size_t was created to refer to sizes.
The number of times you do something (10, in your example) is not about sizes, so why use size_t? int, or unsigned int, should be ok.
Of course it is also relevant what you do with i inside the loop. If you pass it to a function which takes an unsigned int, for example, pick unsigned int.
In any case, I recommend to avoid implicit type conversions. Make all type conversions explicit.
short answer:
Almost never. Use signed version ptrdiff_t or non-standard ssize_t. Use function std::ssize instead of std::size.
long answer:
Whenever you need to have a vector of char bigger that 2gb on a 32 bit system. In every other use case, using a signed type is much safer than using an unsigned type.
example:
std::vector<A> data;
[...]
// calculate the index that should be used;
size_t i = calc_index(param1, param2);
// doing calculations close to the underflow of an integer is already dangerous
// do some bounds checking
if( i - 1 < 0 ) {
// always false, because 0-1 on unsigned creates an underflow
return LEFT_BORDER;
} else if( i >= data.size() - 1 ) {
// if i already had an underflow, this becomes true
return RIGHT_BORDER;
}
// now you have a bug that is very hard to track, because you never
// get an exception or anything anymore, to detect that you actually
// return the false border case.
return calc_something(data[i-1], data[i], data[i+1]);
The signed equivalent of size_t is ptrdiff_t, not int. But using int is still much better in most cases than size_t. ptrdiff_t is long on 32 and 64 bit systems.
This means that you always have to convert to and from size_t whenever you interact with a std::containers, which not very beautiful. But on a going native conference the authors of c++ mentioned that designing std::vector with an unsigned size_t was a mistake.
If your compiler gives you warnings on implicit conversions from ptrdiff_t to size_t, you can make it explicit with constructor syntax:
calc_something(data[size_t(i-1)], data[size_t(i)], data[size_t(i+1)]);
if just want to iterate a collection, without bounds cheking, use range based for:
for(const auto& d : data) {
[...]
}
here some words from Bjarne Stroustrup (C++ author) at going native
For some people this signed/unsigned design error in the STL is reason enough, to not use the std::vector, but instead an own implementation.
size_t is a very readable way to specify the size dimension of an item - length of a string, amount of bytes a pointer takes, etc.
It's also portable across platforms - you'll find that 64bit and 32bit both behave nicely with system functions and size_t - something that unsigned int might not do (e.g. when should you use unsigned long
Use std::size_t for indexing/counting C-style arrays.
For STL containers, you'll have (for example) vector<int>::size_type, which should be used for indexing and counting vector elements.
In practice, they are usually both unsigned ints, but it isn't guaranteed, especially when using custom allocators.
Soon most computers will be 64-bit architectures with 64-bit OS:es running programs operating on containers of billions of elements. Then you must use size_t instead of int as loop index, otherwise your index will wrap around at the 2^32:th element, on both 32- and 64-bit systems.
Prepare for the future!
size_t is returned by various libraries to indicate that the size of that container is non-zero. You use it when you get once back :0
However, in the your example above looping on a size_t is a potential bug. Consider the following:
for (size_t i = thing.size(); i >= 0; --i) {
// this will never terminate because size_t is a typedef for
// unsigned int which can not be negative by definition
// therefore i will always be >= 0
printf("the never ending story. la la la la");
}
the use of unsigned integers has the potential to create these types of subtle issues. Therefore imho I prefer to use size_t only when I interact with containers/types that require it.
When using size_t be careful with the following expression
size_t i = containner.find("mytoken");
size_t x = 99;
if (i-x>-1 && i+x < containner.size()) {
cout << containner[i-x] << " " << containner[i+x] << endl;
}
You will get false in the if expression regardless of what value you have for x.
It took me several days to realize this (the code is so simple that I did not do unit test), although it only take a few minutes to figure the source of the problem. Not sure it is better to do a cast or use zero.
if ((int)(i-x) > -1 or (i-x) >= 0)
Both ways should work. Here is my test run
size_t i = 5;
cerr << "i-7=" << i-7 << " (int)(i-7)=" << (int)(i-7) << endl;
The output: i-7=18446744073709551614 (int)(i-7)=-2
I would like other's comments.
It is often better not to use size_t in a loop. For example,
vector<int> a = {1,2,3,4};
for (size_t i=0; i<a.size(); i++) {
std::cout << a[i] << std::endl;
}
size_t n = a.size();
for (size_t i=n-1; i>=0; i--) {
std::cout << a[i] << std::endl;
}
The first loop is ok. But for the second loop:
When i=0, the result of i-- will be ULLONG_MAX (assuming size_t = unsigned long long), which is not what you want in a loop.
Moreover, if a is empty then n=0 and n-1=ULLONG_MAX which is not good either.
size_t is an unsigned type that can hold maximum integer value for your architecture, so it is protected from integer overflows due to sign (signed int 0x7FFFFFFF incremented by 1 will give you -1) or short size (unsigned short int 0xFFFF incremented by 1 will give you 0).
It is mainly used in array indexing/loops/address arithmetic and so on. Functions like memset() and alike accept size_t only, because theoretically you may have a block of memory of size 2^32-1 (on 32bit platform).
For such simple loops don't bother and use just int.
I have been struggling myself with understanding what and when to use it. But size_t is just an unsigned integral data type which is defined in various header files such as <stddef.h>, <stdio.h>, <stdlib.h>, <string.h>, <time.h>, <wchar.h> etc.
It is used to represent the size of objects in bytes hence it's used as the return type by the sizeof operator. The maximum permissible size is dependent on the compiler; if the compiler is 32 bit then it is simply a typedef (alias) for unsigned int but if the compiler is 64 bit then it would be a typedef for unsigned long long. The size_t data type is never negative(excluding ssize_t)
Therefore many C library functions like malloc, memcpy and strlen declare their arguments and return type as size_t.
/ Declaration of various standard library functions.
// Here argument of 'n' refers to maximum blocks that can be
// allocated which is guaranteed to be non-negative.
void *malloc(size_t n);
// While copying 'n' bytes from 's2' to 's1'
// n must be non-negative integer.
void *memcpy(void *s1, void const *s2, size_t n);
// the size of any string or `std::vector<char> st;` will always be at least 0.
size_t strlen(char const *s);
size_t or any unsigned type might be seen used as loop variable as loop variables are typically greater than or equal to 0.
size_t is an unsigned integral type, that can represent the largest integer on you system.
Only use it if you need very large arrays,matrices etc.
Some functions return an size_t and your compiler will warn you if you try to do comparisons.
Avoid that by using a the appropriate signed/unsigned datatype or simply typecast for a fast hack.
size_t is unsigned int. so whenever you want unsigned int you can use it.
I use it when i want to specify size of the array , counter ect...
void * operator new (size_t size); is a good use of it.
This question already has answers here:
size_t vs int in C++ and/or C
(9 answers)
Closed 4 years ago.
I have encountered a problem while I was using a loop like this,
//vector<int> sides;
for (int i = 0; i < sides.size()-2; i++) {
if (sides[i] < sides[i+1] + sides[i+2]) {
...
}
}
The problem was that the size() method uses an unsigned number. So, vectors of size less than 2 make an undefined outcome.
I understand that I should have used an unsigned variable for the loop but it doesn't solve the problem. So I had to deal with it by typecasting or using some conditions.
My question is that why do STL uses an unsigned int to eliminate negative index access violation and generating more problems?
It was decided that containers would use unsigned indexes and sizes way back in the 90s.
On the surface this seemed wise; I mean, sizes and indexes into containers cannot be negative. It also permits slightly larger max values, especially on 16 bit systems.
It is now considered a mistake; your code is but one of the many reasons why. std2 will almost certainly use ptrdiff_t, the signed partner of size_t, for sizes and indexes.
Note that 1u-2 is defined behaviour; it is -1 cast to unsigned, which is guaranteed to be the max unsigned value of that type.
You can fix your code in many ways, including:
for (int i = 0; i+2 < sides.size(); i++) {
if (sides[i] < sides[i+1] + sides[i+2]) {
or
for (int i = 0; i < (std::ptrdiff_t)sides.size()-2; i++) {
if (sides[i] < sides[i+1] + sides[i+2]) {
this second one can break on containers near the size of your memory space limit; on 64 bit systems this is not a problem, on 32 bit systems and vectors of char there is a slim chance you could create a vector large enough.
The reason is that the operator sizeof that determines the size of any object returns a value of the type size_t. So for example also the standard C functiuon strlen has return type size_t.
As result it is adopted that standard containers also returns their sizes as values of the type size_t.
As for the loop then it can be rewritten for example the following way
for ( size_t i = 0; i + 2 < sides.size(); i++) {
if (sides[i] < sides[i+1] + sides[i+2]) {
...
}
}
I'm running a C++ program which adds numbers in a for loop:
int y = 0;
for (int i=0; i<NUM; i++) {
int pow = 1;
for (int j=0 j<i; j++) {
pow *= 10;
}
y+= vec[i]*pow; // where vec is a vector of digits
}
but I am not sure how to check if an overflow has occured. Is there a way?
As Peter rightly explained, overflow is undefined behavior in standard C++11 (or C99), and you really should be afraid of UB.
However, some compilers give you extensions to deal and detect integer overflow.
If you can restrict yourself to a recent GCC compiler, you could use its integer overflow builtins.
You could also restrict yourself to int32_t and compute arithmetic in int64_t and check against result above INT32_MAX or below INT32_MIN. This would be probably slower (but more portable) than using compiler specific extensions. BTW some compilers also have some __int128_t type (so you could compute arithmetic in __int128_t and check likewise that the result fits in int64_t).
If your goal is to have arbitrary precision arithmetic, better use some exising bignum library, like GMPlib (because efficient bignum arithmetic needs very clever algorithms).
Overflow of a signed integral variable results in undefined behaviour. A consequence is that, once overflow has occurred, it cannot be reliably detected in standard C++. Any behaviour is permitted. So your implementation (compiler, library, toolset, etc) may provide a specific means to detect integer overflow - but that means will not necessarily work with other implementations.
The only ways are to check for potential overflows before doing operations (or a sequence of operations). Or to design your approach (e.g. pick limits) to prevent overflow occurring.
There are different options and probably no one that suits every situation and need.
For example in this situation you could check that if pow is less or equal to INT_MAX/10 before multiplying. This works because INT_MAX/10 is the largest number such that it times 10 will be at most INT_MAX. Note that the compiler would normally precompute the INT_MAX/10 if you insert it into the code (otherwise you could put it in a local variable if it bothers you):
int y = 0;
for (int i=0; i<NUM; i++) {
int pow = 1;
for (int j=0 j<i; j++) {
if( pow > INT_MAX/10 )
throw OverflowException;
pow *= 10;
}
y+= vec[i]*pow; // where vec is a vector of digits
}
This solution has the disadvantage that if the other factor is not constant the compiler (and you) can't reasonably avoid the division INT_MAX/n to be done at runtime (and division is normally more expensive than multiplication).
Another possibility is that you could do the calculation in a wider type. For example:
int y = 0;
for (int i=0; i<NUM; i++) {
int32_t pow = 1;
for (int j=0 j<i; j++) {
int64_t tmp = (int64_t)pow * (int64_t)10;
if( tmp > (int64_t)INT32_MAX )
raise OverflowException;
pow = (int32_t)tmp;
}
y+= vec[i]*pow; // where vec is a vector of digits
}
This have the disadvantage that first a wider type might not be available, and if it is it may require software implementation of the multiplication (for example 64-bit on 32-bit systems) which means slower multiplication.
Another situation is where you add two integers. If overflow occurs the result will be less than both integers, especially if they're both positive the result will be negative.
In a lot of code examples, source code, libraries etc. I see the use of int when as far as I can see, an unsigned int would make much more sense.
One place I see this a lot is in for loops. See below example:
for(int i = 0; i < length; i++)
{
// Do Stuff
}
Why on earth would you use an int rather than an unsigned int? Is it just laziness - people can't be bothered with typing unsigned?
Using unsigned can introduce programming errors that are hard to spot, and it's usually better to use signed int just to avoid them. One example would be when you decide to iterate backwards rather than forwards and write this:
for (unsigned i = 5; i >= 0; i--) {
printf("%d\n", i);
}
Another would be if you do some math inside the loop:
for (unsigned i = 0; i < 10; i++) {
for (unsigned j = 0; j < 10; j++) {
if (i - j >= 4) printf("%d %d\n", i, j);
}
}
Using unsigned introduces the potential for these sorts of bugs, and there's not really any upside.
It's generally laziness or lack of understanding.
I aways use unsigned int when the value should not be negative. That also serves the documentation purpose of specifying what the correct values should be.
IMHO, the assertion that it is safer to use "int" than "unsigned int" is simply wrong and a bad programming practice.
If you have used Ada or Pascal you'd be accustomed to using the even safer practice of specifying specific ranges for values (e.g., an integer that can only be 1, 2, 3, 4, 5).
If length is also int, then you should use the same integer type, otherwise weird things happen when you mix signed and unsigned types in a comparison statement. Most compilers will give you a warning.
You could go on to ask, why should length be signed? Well, that's probably historical.
Also, if you decide to reverse the loop, ie
for(int i=length-1;i>=0 ;i--)
{
// do stuff
}
the logic breaks if you use unsigned ints.
I chose to be as explicit as possible while programming. That is, if I intend to use a variable whose value is always positive, then unsigned is used. Many here mention "hard to spot bugs" but few give examples. Consider the following advocate example for using unsigned, unlike most posts here:
enum num_things {
THINGA = 0,
THINGB,
THINGC,
NUM_THINGS
};
int unsafe_function(int thing_ID){
if(thing_ID >= NUM_THINGS)
return -1;
...
}
int safe_function(unsigned int thing_ID){
if(thing_ID >= NUM_THINGS)
return -1;
...
}
int other_safe_function(int thing_ID){
if((thing_ID >=0 ) && (thing_ID >= NUM_THINGS))
return -1;
...
}
/* Error not caught */
unsafe_function(-1);
/* Error is caught */
safe_function((unsigned int)-1);
In the above example, what happens if a negative value is passed in as thing_ID? In the first case, you'll find that the negative value is not greater than or equal to NUM_THINGS, and so the function will continue executing.
In the second case, you'll actually catch this at run-time because the signedness of thing_ID forces the conditional to execute an unsigned comparison.
Of course, you could do something like other_safe_function, but this seems more of a kludge to use signed integers rather than being more explicit and using unsigned to begin with.
I think the most important reason is if you choose unsigned int, you can get some logical errors. In fact, you often do not need the range of unsigned int, using int is safer.
this tiny code is usecase related, if you call some vector element then the prototype is int but there're much modern ways to do it in c++ eg. for(const auto &v : vec) {} or iterators, in some calculcation if there's no substracting/reaching a negative number you can and should use unsigned (explains better the range of values expected), sometimes as many posted examples here shows you actually need int but the truth is it's all about usecase and situation, no one strict rule apply to all usecases and it would be kinda dumb to force one over...
Is it better to cast the iterator condition right operand from size_t to int, or iterate potentially past the maximum value of int? Is the answer implementation specific?
int a;
for (size_t i = 0; i < vect.size(); i++)
{
if (some_func((int)i))
{
a = (int)i;
}
}
int a;
for (int i = 0; i < (int)vect.size(); i++)
{
if (some_func(i))
{
a = i;
}
}
I almost always use the first variation, because I find that about 80% of the time, I discover that some_func should probably also take a size_t.
If in fact some_func takes a signed int, you need to be aware of what happens when vect gets bigger than INT_MAX. If the solution isn't obvious in your situation (it usually isn't), you can at least replace some_func((int)i) with some_func(numeric_cast<int>(i)) (see Boost.org for one implementation of numeric_cast). This has the virtue of throwing an exception when vect grows bigger than you've planned on, rather than silently wrapping around to negative values.
I'd just leave it as a size_t, since there's not a good reason not to do so. What do you mean by "or iterate potentially up to the maximum value of type_t"? You're only iterating up to the value of vect.size().
For most compilers, it won't make any difference. On 32 bit systems, it's obvious, but even on 64 bit systems, both variables will probably be stored in a 64-bit register and pushed on the stack as a 64-bit value.
If the compiler stores int values as 32 bit values on the stack, the first function should be more efficient in terms of CPU-cycles.
But the difference is negligible (although the second function "looks" cleaner)