How does compiler know how to increment different pointers? - c++

I understand that in general , pointers to any data type will have the same size. On a 16 bit system, normally 2 bytes and and on a 32 bit system , 4 bytes.
Depending on what this pointer points to, if it is incremented , it will increment by a different number of bytes depending on if it's a char pointer, long pointer etc.
My query is how does the compiler know by how many bytes to increment this pointer. Isn't it just a variable stored in memory like any other? Are the pointers stored in some symbol table with information about how much they should be incremented by? Thanks

That is why there are data types. Each pointer variable will have an associated data type and that data type has a defined size (See about complete/incomplete type in footnote). Pointer arithmetic will take place based on the data type.
To add to that, for pointer arithmetic to happen, the pointer(s) should be (quoted from c11 standard)
pointer to a complete object type
So, the size of the "object" the pointer points to is known and defined.
Footnote: FWIW, that is why, pointer arithmetic on void pointers (incomplete type) is not allowed / defined in the standard. (Though GCC supports void pointer arithmetic via an extension.)

Re
” I understand that in general , pointers to any data type will have the same size
No. Different pointer sizes are unusual for ¹simple pointers to objects, but can occur on word-addressed machines. Then char* is the largest pointer, and void* is the same size.
C++14 §3.9.2/4
” An object of type cv void* shall have the same
representation and alignment requirements as cv char*.
All pointers to class type object are however the same size. For example, you wouldn't be able to use an array of pointers to base type, if this wasn't the case.
Re
” how does the compiler know by how many bytes to increment this pointer
It knows the size of the type of object pointed to.
If it does not know the size of the object pointed to, i.e. that type is incomplete, then you can't increment the pointer.
For example, if p is a void*, then you can't do ++p.
Notes:
¹ In addition to ordinary pointers to objects, there are function pointers, and pointers to members. The latter kind are more like offsets, that must be combined with some specification of a relevant object, to yield a reference.

The data type of the pointer variable defines how many bytes to be incremented.
for example:
1) on incrementing a character pointer, pointer is increment by 1 Byte.
2) Likewise, for a integer pointer, pointer is increment by 4 Bytes (for 32-bit system) and 8 Bytes (for 64-bit system)

Related

C++ Undefined Behavior when subtracting pointers

The C++ standard says that subtracting pointers from non-array elements is UB:
int a, b;
&a - &b; // UB, since pointers don't point to the same array.
But if both pointers are casted to uintptr_t, then both exressions are no longer pointer expressions and subtracting them seems to be legal from the Standard's perspective:
int a, b;
reinterpret_cast<uintptr_t>(&a) - reinterpret_cast<uintptr_t>(&b);
Is this correct or am I missing something?
The subtraction of the integers is legal in the sense that behaviour is not undefined.
But the standard doesn't technically have guarantees about the values of the converted integers, and consequently you have no guarantees about the value of result of the subtraction either (except in relation to the unspecified integer values) - you could get a large value or a small value (or if the pointers had non-interconvertible types, then subtraction of converted intergers of different objects could even yield zero) depending on the system.
Furthermore, if you did some other pointer arithmetic that would result in a pointer value i.e. convert pointer to integer and add offset, then there is technically no guarantee that that converting the result of the addition back to a pointer type would yield the pointer value at that offset from the original. But it would probably work (assuming there actually exists an object of correct type at that address) except perhaps on systems using segmented memory or something more exotic.
You are correct.
The behaviour on subtracting pointers that do not point to elements of the same array is undefined. For this purpose a pointer is allowed to be one beyond the final element of an array, and an object counts as an array of length one.
But once you cast a pointer to a suitable type, e.g. to a std::uintptr_t (assuming your compiler supports it; it doesn't have to) you can apply any arithmetic you want to it, subject to the constrants imposed upon you by that type.
Although such rules may seem obtuse, they are related to a similar rule where you are not allowed to read a pointer that doesn't point to valid memory. All of this helps achieve greater portability of the language.
The UB allows an implementation to do anything. It does not prevent an implementation to simply compute the difference between address values and divide by the size. But it allows an implementation that would control whether both elements are member of the same array (or point one past end of the array) to raise an exception or crash.
Requiring both pointers to point to the same array allows the implementation to assume that any value between that (and correctly aligned...) is also valid and points inside the same array.

Why do you have to specify a type for pointers?

Why do you have to set a type for pointers? Aren’t they just a placeholder for addresses and all those addresses? Therefore, won't all pointers no matter what type specified occupy an equal size of memory?
You don't have to specify a type for pointers. You can use void* everywhere, which would force you to insert an explicit type cast every single time you read something from the address pointed by the pointer, or write something to that address, or simply increment/decrement or otherwise manipulate the value of the pointer.
But people decided a long time ago that they were tired of this way of programming, and preferred typed pointers that
do not require casts
do not require always having to know the size of the pointed type (which is an issue that gets even more complicated when proper memory alignment has to be taken into consideration)
prevent you from accidentally accessing the wrong data type or advancing the pointer by the wrong number of bytes.
And yes, indeed, all data pointers, no matter what their type, occupy the same amount of memory, which is usually 4 bytes on 32-bit systems, and 8 bytes on 64-bit systems. The type of a data pointer has nothing to do with the amount of memory occupied by the pointer, and that's because no type information is stored with the pointer; the pointer type is only useful to humans and to the compiler, not to the machine.
Different types take up different amounts of memory. So when advancing a pointer (e.g. in an array), we need to take the type's size into account.
For example, because a char takes up only one byte, going to the next element means adding 0x01 to the address. But because a int takes up 4 bytes (on many architectures), getting to the next element requires adding 0x04 to the address stored in the pointer.
Now, we could have a single pointer type which simply describes an address without type information (in fact, this is what void* is for), but then every time we wanted to increment or decrement it, we'd need to give the type's size as well.
Here's some real C code which demonstrates the pains you'd go through:
#include <stdlib.h>
typedef void* pointer;
int main(void) {
pointer numbers = calloc(10, sizeof(int));
int i;
for (i = 0; i < 10; i++)
*(int*)(numbers + i * sizeof(int)) = i;
/* this could have been simply "numbers[i] = i;" */
/* ... */
return 0;
}
Three important things to notice here:
We have to multiply the index by sizeof(int) every time; adding
simply i will not do: the first iteration would correctly access the
first 4-byte integer, but the second iteration would look for the
integer which starts with the second byte of the first integer, the
third would start with the third byte of the first integer, and so
on. It's very unlikely that this is desirable!
The compiler needs to know how much information it can store in a
pointer when assigning to the address it points to. For example, if
you try to store a number greater than 2^8 in a char, the compiler
should know to truncate the number and not overwrite the next few
bytes of memory, which might extend into the next page (causing a
segmentation fault) or, worse, be used to store other data in your
program, resulting in a subtle bug.
Speaking of width, we know in our program above that numbers stores
ints -- what if we didn't? What if, for example, we tried to store an
int in the address pointed to an array of a larger data type (on some
architectures), like a long? Then our generic functions would end up
having to compare the widths of both types, probably using the
minimum of the two, and then if the type being stored is smaller than
its container you start having to worry about endianness to make sure
you align the value being stored with the correct end of the
container.
If you want evaluate pointed element value with pointer then you have to specify type of pointed element on declaration pointer. Because the compiler does not know the precise number of bytes to which the pointer refers. Machine has to compute particular bounded memory to evaluate the value.

Why C++ have the type array?

I am learning C++. I found that the pointer has the same function with array, like a[4], a can be both pointer and array. But as C++ defined, for different length of array, it is a different type. In fact when we pass an array to a function, it will be converted into pointer automatically, I think it is another proof that array can be replaced by pointer. So, my question is:
Why C++ don't replace all the array with pointer?
In early C it was decided to represent the size of an array as part of its type, available via the sizeof operator. C++ has to be backward compatible with that. There's much wrong with C++ arrays, but having size as part of the type is not one of the wrong things.
Regarding
” pointer has the same function with array, like a[4], a can be both pointer and array
no, this is just an implicit conversion, from array expression to pointer to first item of that array.
A weird as it sounds, C++ does not provide indexing of built-in arrays. There's indexing for pointers, and p[i] just means *(p+i) by definition, so you can also write that as *(i+p) and hence as i[p]. And thus also i[a], because it's really the pointer that's indexed. Weird indeed.
The implicit conversion, called a “decay”, loses information, and is one of the things that are wrong about C++ arrays.
The indexing of pointers is a second thing that's wrong (even if it makes a lot of sense at the assembly language and machine code level).
But it needs to continue to be that way for backward compatibility.
Why array decay is Bad™: this causes an array of T to often be represented by simply a pointer to T.
You can't see from such a pointer (e.g. as a formal argument) whether it points to a single T object or to the first item of an array of T.
But much worse, if T has a derived class TD, where sizeof(TD) > sizeof(T), and you form an array of TD, then you can pass that array to a formal argument that's pointer to T – because that array of TD decays to pointer to TD which converts implicitly to pointer to T. Now using that pointer to T as an array yields incorrect address computations, due to incorrect size assumption for the array items. And bang crash (if you're lucky), or perhaps just incorrect results (if you're not so lucky).
In C and C++, everything of a single type has the same size. An int[4] array is twice as big as an int[2] array, so they can't be of the same type.
But then you might ask, "Why should type imply size?" Well:
A local variable needs to take up a certain amount of memory. When you declare an array, it takes up memory that scales up with its length. When you declare a pointer, it is always the size of pointers on your machine.
Pointer arithmetic is determined by the size of the type it's pointing to: the distance between the address pointed to by p and that pointed to by p+1 is exactly the size of its type. If types didn't have fixed sizes, then p would need to carry around extra information, or C would have to give up arithmetic.
A function needs to know how big its arguments are, because functions are compiled to expect their variables to be in particular places, and having a parameter with an unknown size screws that up.
And you say, "Well, if I pass an array to a function, it just turns into a pointer anyway." True, but you can make new types that have arrays as members, and then you can pass THOSE types around. And in C++, you can in fact pass an array as an array.
int sum10(int (&arr)[10]){ //only takes int arrays of size 10
int result = 0;
for(int i=0; i<10; i++)
result += arr[i];
return result
}
You can't use pointers in place of array declarations without having to use malloc/free or new/delete to create and destroy memory on the heap. You can declare an array as a variable and it gets created on the stack and you do not have to worry about it's destruction.
Well, array is an easier was of dealing with data and manipulating them. However, In order to use pointers you need to have a clear memory address to point to. Also, both concepts are not different from each other when it comes to passing them to a function. Bothe pointers and arrays are passed by reference. Hope that helps
I'm not sure if i get your question but assuming you're new to coding:
when you declare an array int a[4] you let the compiler know you need 4*int memory, and what the compiler does is assign a the address of the 'start' of that 4*int size memory. when u later use a[x], [x] means to do (a + sizeof(int)*x) AND dereference that pointer address to get the int.
In other words, it's always a pointer being passed around instead of an 'array', which is just an abstraction that makes it easier for you to code.

C++ pointer's suitable alignment

[basic.stc.dynamic.allocation]/2 about allocation functions:
The pointer returned shall be suitably aligned so that it can be
converted to a pointer of any complete object type with a fundamental
alignment requirement (3.11) and then used to access the object or
array in the storage allocated (until the storage is explicitly
deallocated by a call to a corresponding deallocation function).
It is a bit inclear. I thought that any pointer to (include the void*) type has alignment equal to 8. What is the point of The pointer returned shall be suitably aligned so...? Could you get an example of no suitable aligned pointer?
Many systems require the dereferenced pointers are aligned to be a multiple of the size of the type. For instance, pointers for shorts would be on multiples of 2 bytes, char pointers are unrestricted, etc. Not all systems have this requirement, but accesses on unaligned memory on these systems are frequently very slow, and so typically programmers try to keep everything aligned anyways.
You can find the alignment requirement for a type with alignof, if you want to poke around on your system. A pointer that isn't aligned properly for any type might be something like 0xFFFF0002, which wouldn't be aligned for any 4 byte or higher type.
In short, what that documentation is saying is that the memory returned will be aligned for any fundemental type.

Do I understand C/C++ strict-aliasing correctly?

I've read this article about C/C++ strict aliasing. I think the same applies to C++.
As I understand, strict aliasing is used to rearrange the code for performance optimization. That's why two pointers of different (and unrelated in C++ case) types cannot refer to the same memory location.
Does this mean that problems can occur only if memory is modified? Apart of possible problems with memory alignment.
For example, handling network protocol, or de-serialization. I have a byte array, dynamically allocated and packet struct is properly aligned. Can I reinterpret_cast it to my packet struct?
char const* buf = ...; // dynamically allocated
unsigned int i = *reinterpret_cast<unsigned int*>(buf + shift); // [shift] satisfies alignment requirements
The problem here is not strict aliasing so much as structure representation requirements.
First, it is safe to alias between char, signed char, or unsigned char and any one other type (in your case, unsigned int. This allows you to write your own memory-copy loops, as long as they're defined using a char type. This is authorized by the following language in C99 (§6.5):
6. The effective type of an object for an access to its stored value is the declared type of the object, if any. [Footnote: Allocated objects have no declared type] [...] If a value is copied into an object having no declared type using
memcpy or memmove, or is copied as an array of character type, then the effective type
of the modified object for that access and for subsequent accesses that do not modify the
value is the effective type of the object from which the value is copied, if it has one. For
all other accesses to an object having no declared type, the effective type of the object is
simply the type of the lvalue used for the access.
7. An object shall have its stored value accessed only by an lvalue expression that has one of the following types: [Footnote: The intent of this list is to specify those circumstances in which an object may or may not be aliased.]
a type compatible with the effective type of the object,
[...]
a character type.
Similar language can be found in the C++0x draft N3242 §3.11/10, although it is not as clear when the 'dynamic type' of an object is assigned (I'd appreciate any further references on what the dynamic type is of a char array, to which a POD object has been copied as a char array with proper alignment).
As such, aliasing is not a problem here. However, a strict reading of the standard indicates that a C++ implementation has a great deal of freedom in choosing a representation of an unsigned int.
As one random example, unsigned ints might be a 24-bit integer, represented in four bytes, with 8 padding bits interspersed; if any of these padding bits does not match a certain (constant) pattern, it is viewed as a trap representation, and dereferencing the pointer will result in a crash. Is this a likely implementation? Perhaps not. But there have been, historically, systems with parity bits and other oddness, and so directly reading from the network into an unsigned int, by a strict reading of the standard, is not kosher.
Now, the problem of padding bits is mostly a theoretical issue on most systems today, but it's worth noting. If you plan to stick to PC hardware, you don't really need to worry about it (but don't forget your ntohls - endianness is still a problem!)
Structures make it even worse, of course - alignment representations depend on your platform. I have worked on an embedded platform in which all types have an alignment of 1 - no padding is ever inserted into structures. This can result in inconsistencies when using the same structure definitions on multiple platforms. You can either manually work out the byte offsets for data structure members and reference them directly, or use a compiler-specific alignment directive to control padding.
So you must be careful when directly casting from a network buffer to native types or structures. But the aliasing itself is not a problem in this case.
Actually this code already has UB at the point you dereference the reinterpret_casted integer pointer without even needing to invoke strict-aliasing rules. Not only that, but if you aren't rather careful, reinterpreting directly to your packet structure could cause all sorts of issues depending on struct packing and endianness.
Given all that, and that you're already invoking UB I suspect that it's "likely to work" on multiple compilers and you're free to take that (possibly measurable) risk.