I have learned recently that size_t was introduced to help future-proof code against native bit count increases and increases in available memory. The specific use definition seems to be on the storing of the size of something, generally an array.
I now must wonder how far this future proofing should be taken. Surely it is pointless to have an array length defined using the future-proof and appropriately sized size_t if the very next task of iterating over the array uses say an unsigned int as the index array:
void (double* vector, size_t vectorLength) {
for (unsigned int i = 0; i < vectorLength; i++) {
//...
}
}
In fact in this case I might expect the syntax strictly should up-convert the unsigned int to a size_t for the relation operator.
Does this imply the iterator variable i should simply be a size_t?
Does this imply that any integer in any program must become functionally identified as to whether it will ever be used as an array index?
Does it imply any code using logic that develops the index programmatically should then create a new result value of type size_t, particularly if the logic relies on potentially signed integer values? i.e.
double foo[100];
//...
int a = 4;
int b = -10;
int c = 50;
int index = a + b + c;
double d = foo[(size_t)index];
Surely though since my code logic creates a fixed bound, up-converting to the size_t provides no additional protection.
You should keep in mind the automatic conversion rules of the language.
Does this imply the iterator variable i should simply be a size_t?
Yes it does, because if size_t is larger than unsigned int and your array is actually larger than can be indexed with an unsigned int, then your variable (i) can never reach the size of the array.
Does this imply that any integer in any program must become functionally identified as to whether it will ever be used as an array index?
You try to make it sound drastic, while it's not. Why do you choose a variable as double and not float? Why would you make a variable as unsigned and one not? Why would you make a variable short while another is int? Of course, you always know what your variables are going to be used for, so you decide what types they should get. The choice of size_t is one among many and it's similarly decided.
In other words, every variable in a program should be functionally identified and given the correct type.
Does it imply any code using logic that develops the index programmatically should then create a new result value of type size_t, particularly if the logic relies on potentially signed integer values?
Not at all. First, if the variable can never have negative values, then it could have been unsigned int or size_t in the first place. Second, if the variable can have negative values during computation, then you should definitely make sure that in the end it's non-negative, because you shouldn't index an array with a negative number.
That said, if you are sure your index is non-negative, then casting it to size_t doesn't make any difference. C11 at 6.5.2.1 says (emphasis mine):
A postfix expression followed by an expression in square brackets [] is a subscripted
designation of an element of an array object. The definition of the subscript operator [] is that E1[E2] is identical to (*((E1)+(E2))). Because of the conversion rules that apply to the binary + operator, if E1 is an array object (equivalently, a pointer to the initial element of an array object) and E2 is an integer, E1[E2] designates the E2th element of E1 (counting from zero).
Which means whatever type of index for which some_pointer + index makes sense, is allowed to be used as index. In other words, if you know your int has enough space to contain the index you are computing, there is absolutely no need to cast it to a different type.
Surely it is pointless to have an array length defined using the future-proof and appropriately sized size_t if the very next task of iterating over the array uses say an unsigned int as the index array
Yes it is. So don't do it.
In fact in this case I might expect the syntax strictly should up-convert the unsigned int to a size_t for the relation operator.
It will only be promoted in that particular < operation. The upper limit of your int variable will not be changed, so the ++ operation will always work with an int, rather than a size_t.
Does this imply the iterator variable i should simply be a size_t?
Does this imply that any integer in any program must become functionally identified as to whether it will ever be used as an array index?
Yeah well, it is better than int... But there is a smarter way to write programs: use common sense. Whenever you declare an array, you can actually stop and consider in advance how many items the array would possibly need to store. If it will never contain more than 100 items, there is absolutely no reason for you to use int nor to use size_t to index it.
In the 100 items case, simply use uint_fast8_t. Then the program is optimized for size as well as speed, and 100% portable.
Whenever declaring a variable, a good programmer will activate their brain and consider the following:
What is the range of the values that I will store inside this variable?
Do I actually need to store negative numbers in it?
In the case of an array, how many values will I need in the worst-case? (If unknown, do I have to use dynamic memory?)
Are there any compatibility issues with this variable if I decide to port this program?
As opposed to a bad programmer, who does not activate their brain but simply types int all over the place.
As discussed by Neil Kirk, iterators are a future proof counterpart of size_t.
An additional point in your question is the computation of a position, and this typically includes an absolute position (e.g. a in your example) and possibly one or more relative quantities (e.g. b or c), potentially signed.
The signed counterpart of size_t is ptrdiff_t and the analogous for iterator type I is typename I::difference_type.
As you describe in your question, it is best to use the appropriate types everywhere in your code, so that no conversions are needed. For memory efficiency, if you have e.g. an array of one million positions into other arrays and you know these positions are in the range 0-255, then you can use unsigned char; but then a conversion is necessary at some point.
In such cases, it is best to name this type, e.g.
using pos = unsigned char;
and make all conversions explicit. Then the code will be easier to maintain, should the range 0-255 increase in the future.
Yep, if you use int to index an array, you defeat the point of using size_t in other places. This is why you can use iterators with STL. They are future proof. For C arrays, you can use either size_t, pointers, or algorithms and lambdas or range-based for loops (C++11). If you need to store the size or index in variables, they will need to be size_t or other appropriate types, as will anything else they interact with, unless you know the size will be small. (For example, if you store the distance between two elements which will always be in a small range, you can use int).
double *my_array;
for (double *it = my_array, *end_it = my_array + my_array_size, it != end_it; ++it)
{
// use it
}
std::for_each(std::begin(my_array), std::end(my_array), [](double& x)
{
// use x
});
for (auto& x : my_array)
{
// use x
}
Does this imply that any integer in any program must become functionally identified as to whether it will ever be used as an array index?
I'll pick that point, and say clearly Yes. Besides, in most cases a variable used as an array index is only used as that (or something related to it).
And this rule does not only apply here, but also in other circumstances: There are many use cases where nowadays a special type exists: ptrdiff_t, off_t (which even may change depeding on the configuration we use!), pid_t and a lot of others.
Related
As an example, consider the following structure:
struct S {
int a[4];
int b[4];
} s;
Would it be legal to write s.a[6] and expect it to be equal to s.b[2]?
Personally, I feel that it must be UB in C++, whereas I'm not sure about C.
However, I failed to find anything relevant in the standards of C and C++ languages.
Update
There are several answers suggesting ways to make sure there is no padding
between fields in order to make the code work reliably. I'd like to emphasize
that if such code is UB, then absense of padding is not enough. If it is UB,
then the compiler is free to assume that accesses to S.a[i] and S.b[j] do not
overlap and the compiler is free to reorder such memory accesses. For example,
int x = s.b[2];
s.a[6] = 2;
return x;
can be transformed to
s.a[6] = 2;
int x = s.b[2];
return x;
which always returns 2.
Would it be legal to write s.a[6] and expect it to be equal to s.b[2]?
No. Because accessing an array out of bound invoked undefined behaviour in C and C++.
C11 J.2 Undefined behavior
Addition or subtraction of a pointer into, or just beyond, an array object and an integer type produces a result that points just beyond
the array object and is used as the operand of a unary * operator that
is evaluated (6.5.6).
An array subscript is out of range, even if an object is apparently accessible with the given subscript (as in the lvalue expression
a[1][7] given the declaration int a[4][5]) (6.5.6).
C++ standard draft section 5.7 Additive operators paragraph 5 says:
When an expression that has integral type is added to or subtracted
from a pointer, the result has the type of the pointer operand. If the
pointer operand points to an element of an array object, and the array
is large enough, the result points to an element offset from the
original element such that the difference of the subscripts of the
resulting and original array elements equals the integral expression.
[...] If both the pointer operand and the result point to elements
of the same array object, or one past the last element of the array
object, the evaluation shall not produce an overflow; otherwise, the
behavior is undefined.
Apart from the answer of #rsp (Undefined behavior for an array subscript that is out of range) I can add that it is not legal to access b via a because the C language does not specify how much padding space can be between the end of area allocated for a and the start of b, so even if you can run it on a particular implementation , it is not portable.
instance of struct:
+-----------+----------------+-----------+---------------+
| array a | maybe padding | array b | maybe padding |
+-----------+----------------+-----------+---------------+
The second padding may miss as well as the alignment of struct object is the alignment of a which is the same as the alignment of b but the C language also does not impose the second padding not to be there.
a and b are two different arrays, and a is defined as containing 4 elements. Hence, a[6] accesses the array out of bounds and is therefore undefined behaviour. Note that array subscript a[6] is defined as *(a+6), so the proof of UB is actually given by section "Additive operators" in conjunction with pointers". See the following section of the C11-standard (e.g. this online draft version) describing this aspect:
6.5.6 Additive operators
When an expression that has integer type is added to or subtracted
from a pointer, the result has the type of the pointer operand. If the
pointer operand points to an element of an array object, and the array
is large enough, the result points to an element offset from the
original element such that the difference of the subscripts of the
resulting and original array elements equals the integer expression.
In other words, if the expression P points to the i-th element of an
array object, the expressions (P)+N (equivalently, N+(P)) and (P)-N
(where N has the value n) point to, respectively, the i+n-th and
i-n-th elements of the array object, provided they exist. Moreover, if
the expression P points to the last element of an array object, the
expression (P)+1 points one past the last element of the array object,
and if the expression Q points one past the last element of an array
object, the expression (Q)-1 points to the last element of the array
object. If both the pointer operand and the result point to elements
of the same array object, or one past the last element of the array
object, the evaluation shall not produce an overflow; otherwise, the
behavior is undefined. If the result points one past the last element
of the array object, it shall not be used as the operand of a unary *
operator that is evaluated.
The same argument applies to C++ (though not quoted here).
Further, though it is clearly undefined behaviour due to the fact of exceeding array bounds of a, note that the compiler might introduce padding between members a and b, such that - even if such pointer arithmetics were allowed - a+6 would not necessarily yield the same address as b+2.
Is it legal? No. As others mentioned, it invokes Undefined Behavior.
Will it work? That depends on your compiler. That's the thing about undefined behavior: it's undefined.
On many C and C++ compilers, the struct will be laid out such that b will immediately follow a in memory and there will be no bounds checking. So accessing a[6] will effectively be the same as b[2] and will not cause any sort of exception.
Given
struct S {
int a[4];
int b[4];
} s
and assuming no extra padding, the structure is really just a way of looking at a block of memory containing 8 integers. You could cast it to (int*) and ((int*)s)[6] would point to the same memory as s.b[2].
Should you rely on this sort of behavior? Absolutely not. Undefined means that the compiler doesn't have to support this. The compiler is free to pad the structure which could render the assumption that &(s.b[2]) == &(s.a[6]) incorrect. The compiler could also add bounds checking on the array access (although enabling compiler optimizations would probably disable such a check).
I've have experienced the effects of this in the past. It's quite common to have a struct like this
struct Bob {
char name[16];
char whatever[64];
} bob;
strcpy(bob.name, "some name longer than 16 characters");
Now bob.whatever will be " than 16 characters". (which is why you should always use strncpy, BTW)
As #MartinJames mentioned in a comment, if you need to guarantee that a and b are in contiguous memory (or at least able to be treated as such, (edit) unless your architecture/compiler uses an unusual memory block size/offset and forced alignment that would require padding to be added), you need to use a union.
union overlap {
char all[8]; /* all the bytes in sequence */
struct { /* (anonymous struct so its members can be accessed directly) */
char a[4]; /* padding may be added after this if the alignment is not a sub-factor of 4 */
char b[4];
};
};
You can't directly access b from a (e.g. a[6], like you asked), but you can access the elements of both a and b by using all (e.g. all[6] refers to the same memory location as b[2]).
(Edit: You could replace 8 and 4 in the code above with 2*sizeof(int) and sizeof(int), respectively, to be more likely to match the architecture's alignment, especially if the code needs to be more portable, but then you have to be careful to avoid making any assumptions about how many bytes are in a, b, or all. However, this will work on what are probably the most common (1-, 2-, and 4-byte) memory alignments.)
Here is a simple example:
#include <stdio.h>
union overlap {
char all[2*sizeof(int)]; /* all the bytes in sequence */
struct { /* anonymous struct so its members can be accessed directly */
char a[sizeof(int)]; /* low word */
char b[sizeof(int)]; /* high word */
};
};
int main()
{
union overlap testing;
testing.a[0] = 'a';
testing.a[1] = 'b';
testing.a[2] = 'c';
testing.a[3] = '\0'; /* null terminator */
testing.b[0] = 'e';
testing.b[1] = 'f';
testing.b[2] = 'g';
testing.b[3] = '\0'; /* null terminator */
printf("a=%s\n",testing.a); /* output: a=abc */
printf("b=%s\n",testing.b); /* output: b=efg */
printf("all=%s\n",testing.all); /* output: all=abc */
testing.a[3] = 'd'; /* makes printf keep reading past the end of a */
printf("a=%s\n",testing.a); /* output: a=abcdefg */
printf("b=%s\n",testing.b); /* output: b=efg */
printf("all=%s\n",testing.all); /* output: all=abcdefg */
return 0;
}
No, since accesing an array out of bounds invokes Undefined Behavior, both in C and C++.
Short Answer: No. You're in the land of undefined behavior.
Long Answer: No. But that doesn't mean that you can't access the data in other sketchier ways... if you're using GCC you can do something like the following (elaboration of dwillis's answer):
struct __attribute__((packed,aligned(4))) Bad_Access {
int arr1[3];
int arr2[3];
};
and then you could access via (Godbolt source+asm):
int x = ((int*)ba_pointer)[4];
But that cast violates strict aliasing so is only safe with g++ -fno-strict-aliasing. You can cast a struct pointer to a pointer to the first member, but then you're back in the UB boat because you're accessing outside the first member.
Alternatively, just don't do that. Save a future programmer (probably yourself) the heartache of that mess.
Also, while we're at it, why not use std::vector? It's not fool-proof, but on the back-end it has guards to prevent such bad behavior.
Addendum:
If you're really concerned about performance:
Let's say you have two same-typed pointers that you're accessing. The compiler will more than likely assume that both pointers have the chance to interfere, and will instantiate additional logic to protect you from doing something dumb.
If you solemnly swear to the compiler that you're not trying to alias, the compiler will reward you handsomely:
Does the restrict keyword provide significant benefits in gcc / g++
Conclusion: Don't be evil; your future self, and the compiler will thank you.
Jed Schaff’s answer is on the right track, but not quite correct. If the compiler inserts padding between a and b, his solution will still fail. If, however, you declare:
typedef struct {
int a[4];
int b[4];
} s_t;
typedef union {
char bytes[sizeof(s_t)];
s_t s;
} u_t;
You may now access (int*)(bytes + offsetof(s_t, b)) to get the address of s.b, no matter how the compiler lays out the structure. The offsetof() macro is declared in <stddef.h>.
The expression sizeof(s_t) is a constant expression, legal in an array declaration in both C and C++. It will not give a variable-length array. (Apologies for misreading the C standard before. I thought that sounded wrong.)
In the real world, though, two consecutive arrays of int in a structure are going to be laid out the way you expect. (You might be able to engineer a very contrived counterexample by setting the bound of a to 3 or 5 instead of 4 and then getting the compiler to align both a and b on a 16-byte boundary.) Rather than convoluted methods to try to get a program that makes no assumptions whatsoever beyond the strict wording of the standard, you want some kind of defensive coding, such as static assert(&both_arrays[4] == &s.b[0], "");. These add no run-time overhead and will fail if your compiler is doing something that would break your program, so long as you don’t trigger UB in the assertion itself.
If you want a portable way to guarantee that both sub-arrays are packed into a contiguous memory range, or split a block of memory the other way, you can copy them with memcpy().
The Standard does not impose any restrictions upon what implementations must do when a program tries to use an out-of-bounds array subscript in one structure field to access a member of another. Out-of-bounds accesses are thus "illegal" in strictly conforming programs, and programs which make use of such accesses cannot simultaneously be 100% portable and free of errors. On the other hand, many implementations do define the behavior of such code, and programs which are targeted solely at such implementations may exploit such behavior.
There are three issues with such code:
While many implementations lay out structures in predictable fashion, the Standard allows implementations to add arbitrary padding before any structure member other than the first. Code could use sizeof or offsetof to ensure that structure members are placed as expected, but the other two issues would remain.
Given something like:
if (structPtr->array1[x])
structPtr->array2[y]++;
return structPtr->array1[x];
it would normally be useful for a compiler to assume that the use of structPtr->array1[x] will yield the same value as the preceding use in the "if" condition, even though it would change the behavior of code that relies upon aliasing between the two arrays.
If array1[] has e.g. 4 elements, a compiler given something like:
if (x < 4) foo(x);
structPtr->array1[x]=1;
might conclude that since there would be no defined cases where x isn't less than 4, it could call foo(x) unconditionally.
Unfortunately, while programs can use sizeof or offsetof to ensure that there aren't any surprises with struct layout, there's no way by which they can test whether compilers promise to refrain from the optimizations of types #2 or #3. Further, the Standard is a little vague about what would be meant in a case like:
struct foo {char array1[4],array2[4]; };
int test(struct foo *p, int i, int x, int y, int z)
{
if (p->array2[x])
{
((char*)p)[x]++;
((char*)(p->array1))[y]++;
p->array1[z]++;
}
return p->array2[x];
}
The Standard is pretty clear that behavior would only be defined if z is in the range 0..3, but since the type of p->array in that expression is char* (due to decay) it's not clear the cast in the access using y would have any effect. On the other hand, since converting pointer to the first element of a struct to char* should yield the same result as converting a struct pointer to char*, and the converted struct pointer should be usable to access all bytes therein, it would seem the access using x should be defined for (at minimum) x=0..7 [if the offset of array2 is greater than 4, it would affect the value of x needed to hit members of array2, but some value of x could do so with defined behavior].
IMHO, a good remedy would be to define the subscript operator on array types in a fashion that does not involve pointer decay. In that case, the expressions p->array[x] and &(p->array1[x]) could invite a compiler to assume that x is 0..3, but p->array+x and *(p->array+x) would require a compiler to allow for the possibility of other values. I don't know if any compilers do that, but the Standard doesn't require it.
I am still at basic understanding of meta-programming.
I am struggling to understand the difference, if any, of using the int type or the size_t type when using this type as a template type.
I understand the difference between both in standard c++ programming as explained here What's the difference between size_t and int in C++?
Then when reading questions related to some template tricks, it seems that people tends to use them undifferently.
For example on this one How can I get the index of a type in a variadic class template?
Barry is using std::integral_constant instantiated with size_t type
In this question: C++11 Tagged Tuple
ecatmur provides an answer where its index helpers use int type instance of std::integral.
Modify one with the other seems to have no impact for what I have tested. Those template specialization being recursive, I presume anyway that in practice compiler would collapse if the Index N was too big.
Is choosing int or size_t in this specific context only a question of coding style ?
std::size_t is an unsigned type that is at least as large as an unsigned int.
int is a signed type, whose upper bound is less than that of an unsigned int.
There are going to be values which cannot be represented by int that size_t can represent.
Passing -1 as an int results in a negative value. Passing -1 as a size_t results in a large positive value.
Overflow on int is undefined behavior; undefined behavior at compile time makes expressions non-constexpr in some contexts.
Overflow on size_t is defined behavior, it is mathematics modulo 2^n for some (unspecified) n.
Many containers in C++ use size_t for their indexing, and tuple uses it for the index of its get.
There are disadvantages to unsigned values, in that they behave strangely (unlike "real integers") "near zero", while int behaves strangely far from zero, and being far from zero is a rarer case than being near zero.
size_t cannot be negative, which is seemingly makes sense to use to represent values that cannot be negative, but the wrap-around behavior can sometimes cause big problems. I find this happens less so with compile-time code however.
You could use ptrdiff_t, which is basically the signed equivalent of size_t, as another choice.
There are consequences to both choices. Which of these consequences you want to deal with is up to you. Which is better, a matter of opinion.
I was looking at this video. Bjarne Stroustrup says that unsigned ints are error prone and lead to bugs. So, you should only use them when you really need them. I've also read in one of the question on Stack Overflow (but I don't remember which one) that using unsigned ints can lead to security bugs.
How do they lead to security bugs? Can someone clearly explain it by giving an suitable example?
One possible aspect is that unsigned integers can lead to somewhat hard-to-spot problems in loops, because the underflow leads to large numbers. I cannot count (even with an unsigned integer!) how many times I made a variant of this bug
for(size_t i = foo.size(); i >= 0; --i)
...
Note that, by definition, i >= 0 is always true. (What causes this in the first place is that if i is signed, the compiler will warn about a possible overflow with the size_t of size()).
There are other reasons mentioned Danger – unsigned types used here!, the strongest of which, in my opinion, is the implicit type conversion between signed and unsigned.
One big factor is that it makes loop logic harder: Imagine you want to iterate over all but the last element of an array (which does happen in the real world). So you write your function:
void fun (const std::vector<int> &vec) {
for (std::size_t i = 0; i < vec.size() - 1; ++i)
do_something(vec[i]);
}
Looks good, doesn't it? It even compiles cleanly with very high warning levels! (Live) So you put this in your code, all tests run smoothly and you forget about it.
Now, later on, somebody comes along an passes an empty vector to your function. Now with a signed integer, you hopefully would have noticed the sign-compare compiler warning, introduced the appropriate cast and not have published the buggy code in the first place.
But in your implementation with the unsigned integer, you wrap and the loop condition becomes i < SIZE_T_MAX. Disaster, UB and most likely crash!
I want to know how they lead to security bugs?
This is also a security problem, in particular it is a buffer overflow. One way to possibly exploit this would be if do_something would do something that can be observed by the attacker. They might be able to find what input went into do_something, and that way data the attacker should not be able to access would be leaked from your memory. This would be a scenario similar to the Heartbleed bug. (Thanks to ratchet freak for pointing that out in a comment.)
I'm not going to watch a video just to answer a question, but one issue is the confusing conversions which can happen if you mix signed and unsigned values. For example:
#include <iostream>
int main() {
unsigned n = 42;
int i = -42;
if (i < n) {
std::cout << "All is well\n";
} else {
std::cout << "ARITHMETIC IS BROKEN!\n";
}
}
The promotion rules mean that i is converted to unsigned for the comparison, giving a large positive number and a surprising result.
Although it may only be considered as a variant of the existing answers: Referring to "Signed and unsigned types in interfaces," C++ Report, September 1995 by Scott Meyers, it's particularly important to avoid unsigned types in interfaces.
The problem is that it becomes impossible to detect certain errors that clients of the interface could make (and if they could make them, they will make them).
The example given there is:
template <class T>
class Array {
public:
Array(unsigned int size);
...
and a possible instantiation of this class
int f(); // f and g are functions that return
int g(); // ints; what they do is unimportant
Array<double> a(f()-g()); // array size is f()-g()
The difference of the values returned by f() and g() might be negative, for an awful number of reasons. The constructor of the Array class will receive this difference as a value that is implicitly converted to be unsigned. Thus, as the implementor of the Array class, one can not distinguish between an erreonously passed value of -1, and a very large array allocation.
The big problem with unsigned int is that if you subtract 1 from an unsigned int 0, the result isn't a negative number, the result isn't less than the number you started with, but the result is the largest possible unsigned int value.
unsigned int x = 0;
unsigned int y = x - 1;
if (y > x) printf ("What a surprise! \n");
And this is what makes unsigned int error prone. Of course unsigned int works exactly as it is designed to work. It's absolutely safe if you know what you are doing and make no mistakes. But most people make mistakes.
If you are using a good compiler, you turn on all the warnings that the compiler produces, and it will tell you when you do dangerous things that are likely to be mistakes.
The problem with unsigned integer types is that depending upon their size they may represent one of two different things:
Unsigned types smaller than int (e.g. uint8) hold numbers in the range 0..2ⁿ-1, and calculations with them will behave according to the rules of integer arithmetic provided they don't exceed the range of the int type. Under present rules, if such a calculation exceeds the range of an int, a compiler is allowed to do anything it likes with the code, even going so far as to negate the laws of time and causality (some compilers will do precisely that!), and even if the result of the calculation would be assigned back to an unsigned type smaller than int.
Unsigned types unsigned int and larger hold members of the abstract wrapping algebraic ring of integers congruent mod 2ⁿ; this effectively means that if a calculation goes outside the range 0..2ⁿ-1, the system will add or subtract whatever multiple of 2ⁿ would be required to get the value back in range.
Consequently, given uint32_t x=1, y=2; the expression x-y may have one of two meanings depending upon whether int is larger than 32 bits.
If int is larger than 32 bits, the expression will subtract the number 2 from the number 1, yielding the number -1. Note that while a variable of type uint32_t can't hold the value -1 regardless of the size of int, and storing either -1 would cause such a variable to hold 0xFFFFFFFF, but unless or until the value is coerced to an unsigned type it will behave like the signed quantity -1.
If int is 32 bits or smaller, the expression will yield a uint32_t value which, when added to the uint32_t value 2, will yield the uint32_t value 1 (i.e. the uint32_t value 0xFFFFFFFF).
IMHO, this problem could be solved cleanly if C and C++ were to define new unsigned types [e.g. unum32_t and uwrap32_t] such that a unum32_t would always behave as a number, regardless of the size of int (possibly requiring the right-hand operation of a subtraction or unary minus to be promoted to the next larger signed type if int is 32 bits or smaller), while a wrap32_t would always behave as a member of an algebraic ring (blocking promotions even if int were larger than 32 bits). In the absence of such types, however, it's often impossible to write code which is both portable and clean, since portable code will often require type coercions all over the place.
Numeric conversion rules in C and C++ are a byzantine mess. Using unsigned types exposes yourself to that mess to a much greater extent than using purely signed types.
Take for example the simple case of a comparison between two variables, one signed and the other unsigned.
If both operands are smaller than int then they will both be converted to int and the comparison will give numerically correct results.
If the unsigned operand is smaller than the signed operand then both will be converted to the type of the signed operand and the comparison will give numerically correct results.
If the unsigned operand is greater than or equal in size to the signed operand and also greater than or equal in size to int then both will be converted to the type of the unsigned operand. If the value of the signed operand is less than zero this will lead to numerically incorrect results.
To take another example consider multiplying two unsigned integers of the same size.
If the operand size is greater than or equal to the size of int then the multiplication will have defined wraparound semantics.
If the operand size is smaller than int but greater than or equal to half the size of int then there is the potential for undefined behaviour.
If the operand size is less than half the size of int then the multiplication will produce numerically correct results. Assigning this result back to a variable of the original unsigned type will produce defined wraparound semantics.
In addition to range/warp issue with unsigned types. Using mix of unsigned and signed integer types impact significant performance issue for processor. Less then floating point cast, but quite a lot to ignore that. Additionally compiler may place range check for the value and change the behavior of further checks.
Suppose I'm writing a function which takes a float a[] and an offset, into this array, and returns the element at that offset. Is it reasonable to use the signature
float foo(float* a, off_t offset);
for it? Or is off_t only relevant to offsets in bytes, rather than pointer arithmetic with aribtrary element sizes? i.e. is it reasonable to say a[offset] when offset is of type off_t?
The GNU C Library Reference Manual says:
off_t
This is a signed integer type used to represent file sizes.
but that doesn't tell me much.
My intuition is that the answer is "no", since the actual address used in a[offset] is the address of a + sizeof(float) * offset , so "sizeof(float) * offset" is an off_t, and sizeof(float) is a size_t, and both are constants with 'dimensions'.
Note: The offset might be negative.
Is there any good reason why you just don't use int? It's the
default type for integral values in C++, and should be used
unless there is a good reason not to.
Of course, one good reason could be that it might overflow. If
the context is such that you could end up with very large
arrays, you might want to use ptrdiff_t, which is defined (in
C and C++) as the type resulting from the subtraction of two
pointers: in other words, it is guaranteed not to overflow (when
used as an offset) for all types with a size greater than 1.
You could use size_t or ptrdiff_t as the type of an index (your second parameter is more an index inside a float array than an offset).
Your use is an index, not an offset. Notice that the standard offsetof macro is defined to return byte offsets!
In practice, you could even use int or unsigned, unless you believe your array could have billions of components.
You may want to #include <stdint.h> (or <cstdint> with a recent C++) and have explicitly sized types like int32_t for your indexes.
For source readability reasons, you might define
typedef unsigned index_t;
and later use it, e.g.
float foo(float a[], index_t i);
My opinion is that you just should use int as the type of your indexes. (but handle out-of-bound indexes appropriately).
I would say it is not appropriate, since
off_t is (intended to be) used to represent file sizes
off_t is a signed type.
I would go for size_type (usually a "typedef"ed name for size_t), which is the one used by std containers.
Perhaps the answer is to use ptrdiff_t? It...
can be negative;
alludes to the difference not being in bytes, but in units of arbitrary size depending on the element type.
What do you think?
I think I understand the semantics of pointer arithmetic fairly well, but I only ever see examples when dealing with arrays. Does it have any other uses that can't be achieved by less opaque means? I'm sure you could find a way with clever casting to use it to access members of a struct, but I'm not sure why you'd bother. I'm mostly interested in C, but I'll tag with C++ because the answer probably applies there too.
Edit, based on answers received so far: I know pointers can be used in many non-array contexts. I'm specifically wondering about arithmetic on pointers, e.g. incrementing, taking a difference, etc.
Pointer arithmetic by definition in C happens only on arrays. However, as every object has a representation consisting of an overlaid unsigned char [sizeof object] array, it's also valid to perform pointer arithmetic on this representation. For example:
struct foo {
int a, b, c;
} bar;
/* Equivalent to: bar.c = 1; */
*(int *)((unsigned char *)&bar + offsetof(struct foo, c)) = 1;
Actually char * would work just as well.
If you follow the language standard to the letter, then pointer arithmetic is only defined when pointing to an array, and not in any other case.
A pointer may point to any element of an array, or one step past the end of the array.
From the top of my head I know it's used in XOR linked-lists (very nifty) and I've seen it used in very hacky recursions.
On the other hand, it's very hard to find uses since according to the standard pointer arithmic is only defined if within the bounds of an array.
a[n] is "just" syntactic sugar for *(a + n). For lulz, try the following
int a[2];
0[a] = 10;
1[a] = 20;
So one could argue that indexing and pointer arithmetic are merely interchangeable syntax.
Pointer arithmetic is only defined on arrays. Adding an integer to a pointer that does not point to an array element produces undefined behavior.
In embedded systems, pointers are used to represent addresses or locations. There may not be an array defined. (Although one could say that all of memory is one huge array.)
For example, a stack (holding variables and addresses) is manipulated by adding or subtracting values from the stack pointer. (In this case, the stack could be said to be an array based stack.)
Here's a case for pointer arithmetic outside of (strictly defined) arrays:
double d = 0.5;
unsigned char *bytes = (void *)&d;
for(size_t i = 0; i < sizeof d; i++)
printf("Byte %zu of d is %hhu\n", i, bytes[i]);
Why would you do this? I don't know. But if you want to look at the bitwise representation of an object (useful for things like memcpy and memcmp), you'll need to cast their addresses to unsigned char *s (or signed char *s if you like) and work with them byte-by-byte. (If your task isn't too difficult you can even write the code to work word-by-word, which most memcpy implementations will do. It's the same principle, though, just replace char with int32_t.)
Note that, in the standard, the exact values (or the number of values) that are printed are implementation-defined, but that this will always work as a way to access an object's internal bytewise representation. (It is not required to work for larger integer types, but almost always will - no processor I know of has had trap representations for integers in quite some time).