In "Computer System: A Programmer's Perspective", section 2.1 (page 31), it says:
The value of a pointer in C is the virtual address of the first byte of some block of storage.
To me it sounds like the C pointer's value can take values from 0 to [size of virtual memory - 1]. Is that the case? If yes, I wonder if there is any mechanism that checks if all pointers in a program are assigned with legal values -- values at least 0 and at most [size of virtual memory - 1], and where such mechanism is built in -- in compiler? OS? or somewhere else?
There is no process that checks pointers for validity as use of invalid pointers has undefined effects anyway.
Usually it will be impossible for a pointer to hold a value outside of the addressable range as the two will have the same available range — e.g. both will be 32 bit. However some CPUs have rules about pointer alignment that may render some addresses invalid for some types of data. Some runtimes, such as 64-bit Objective-C, which is a strict superset of C, use incorrectly aligned pointers to disguise literal objects as objects on the heap.
There are also some cases where the complete address space is defined by the instruction set to be one thing but is implemented by that specific hardware to be another. An example from history is the original 68000 which defined a 32-bit space but had only 24 address lines. Very early versions of Mac OS used the spare 8 bits for flags describing the block of data, relying on the hardware to ignore them.
So:
there's no runtime checking of validity;
even if there were, the meaning of validity is often dependent on the specific model of CPU (not just the family) or specific version of the OS (ditto) so as to make checking a less trivial task than you might guess.
In practise what will normally happen if your address is illegal per that hardware but is accessed as though legal is a processor exception.
A pointer in C is an abstract object. The only guarantee provided by the C standard is that pointers can point to all the things they need to within C: functions, objects, one past the end of an object, and NULL.
In typical C implementations, pointers can point to any address in virtual memory, and some C implementations deliberately support this in large part. However, there are complications. For example, the value used for NULL may be difficult to use as an address, and converting pointers created for one type to another type may fail (due to alignment problems). Additionally, there are legal non-typical C implementations where pointers do not directly correlate to memory addresses in a normal way.
You should not expect to use pointers to access memory arbitrarily without understanding the rules of the C standard and of the C implementations you use.
There is no mechanism in C which will check if pointers in a program are valid. The programmer is responsible for using them correctly.
For practical purposes a C pointer is either NULL or a memory address to something else. I've never heard of NULL being anything but zero in real life. If it's a memory address you're not supposed to "care" what the actual number is; just pass it around, dereference it etc.
Related
I have heard quite a lot about storing external data in pointer.
For example in (short string optimization).
For example:
when we want to overload << for our SSO class, dependant of the length of the string we want to print either value of pointer or string.
Instead of creating bool flag we could encode this flag inside pointer itself. If i am not mistaken its thanks PC architecture that adds padding to prevent unalligned memory access.
But i have yet to see it in example. How could we detect such flag, when binary operation such as & to check if RSB or LSB is set to 1 ( as a flag ) are not allowed on pointers? Also wouldnt this mess up dereferencing pointers?
All answers are appreciated.
It is quite possible to do such things (unlike other's have said). Most modern architectures (x86-64, for example) enforce alignment requirements that allow you to use the fact that the least significant bits of a pointer may be assumed to be zero, and make use of that storage for other purposes.
Let me pause for a second and say that what I'm about to describe is considered 'undefined behavior' by the C & C++ standard. You are going off-the-rails in a non-portable way by doing what I describe, but there are more standards governing the rules of a computer than the C++ standard (such as the processors assembly reference and architecture docs). Caveat emptor.
With the assumption that we're working on x86_64, let us say that you have a class/structure that starts with a pointer member:
struct foo {
bar * ptr;
/* other stuff */
};
By the x86 architectural constraints, that pointer in foo must be aligned on an 8-byte boundary. In this trivial example, you can assume that every pointer to a struct foo is therefore an address divisible by 8, meaning the lowest 3 bits of a foo * will be zero.
In order to take advantage of such a constraint, you must play some casting games to allow the pointer to be treated as a different type. There's a bunch of different ways of performing the casting, ranging from the old C method (not recommended) of casting it to and from a uintptr_t to cleaner methods of wrapping the pointer in a union. In order to access either the pointer or ancillary data, you need to logically 'and' the datum with a bitmask that zeros out the part of the datum you don't wish.
As an example of this explanation, I wrote an AVL tree a few years ago that sinks the balance book-keeping data into a pointer, and you can take a look at that example here: https://github.com/jschmerge/structures/blob/master/tree/avl_tree.h#L31 (everything you need to see is contained in the struct avl_tree_node at the line I referenced).
Swinging back to a topic you mentioned in your initial question... Short string optimization isn't implemented quite the same way. The implementations of it in Clang and GCC's standard libraries differ somewhat, but both boil down to using a union to overload a block of storage with either a pointer or an array of bytes, and play some clever tricks with the string's internal length field for differentiating whether the data is a pointer or local array. For more of the details, this blog post is rather good at explaining: https://shaharmike.com/cpp/std-string/
"encode this flag inside pointer itself"
No, you are not allowed to do this in either C or C++.
The behaviour on setting (let alone dereferencing) a pointer to memory you don't own is undefined in either language.
Sadly what you want to achieve is to be done at the assembler level, where the distinction between a pointer and integer is sufficiently blurred.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
So already know there's like 'blocks' or units of memory called.. bytes? and different variables take up different amounts of bytes. But my real question is when you create a new program, say on the compiler, does the memory start storing at address one. And using a pointer you can see what fills what blocks of memory? Also is this ram? Sorry for so much wondering by trying to get a grasp on the lower level part of c++ to get a hint of how memory is stored and such, thanks.
Objects in C++ occupy memory, and if you can obtain the address of an object, you can inspect that memory. It's completely unspecified where and how that memory comes about; it's supposed to be provided by "the platform", i.e. the compiler knows how to generate machine code that interacts with the system's notion of memory in such a way that every object fits into some memory. You also have platform-provided services (malloc and operator new) to give you memory directly for your own use.
Since this question is likely to be closed fast (it fits in well with the original idea of SO, but not with current "policy") I'm adding this answer quickly so that I can continue writing it. I disagree that strongly with current policy, for this particular kind of case. So…
About the topic.
Memory management is an extremely large topic. However, your questions about it, e.g. “does the memory start storing at address one”, concern the very basics. And this is small topic, possible to answer.
The C++ memory model.
/ Bytes.
As seen from the inside of a C++ program, memory is a not necessarily contiguous sequence of bytes. A byte is in this context the smallest addressable unit of electronic memory (or more generally of computer main memory, if other technologies should become popular), and corresponds to C++ char. The C++11 standard describes it thusly, in its §1.7/1:
“A byte is at least large enough to contain
any member of the basic execution character set (2.3) and the eight-bit code units of the Unicode UTF-8
encoding form and is composed of a contiguous sequence of bits, the number of which is implementation-defined”
Essential facts about C++ bytes:
A byte is at least 8 bits.
In practice it’s either 8 bits or 16 bits. The latter size is used on some digital signal processors, e.g. from Texas Instruments.
The number of bits per byte is given by CHAR_BIT.
This macro symbol is defined by the <limits.h> C header. It yields a value that can be used at compile time. An alternative way to designate that value is std::numeric_limits<unsigned char>::digits, after including the <limits> C++ header.
unsigned char is commonly used as a byte type.
All three variants of char, namely plain char, unsigned char and signed char, are guaranteed to map to byte, but there is no dedicated standard C++ byte type.
/ Locations.
A value of a built-in type such as double typically occupies a small number of bytes, contiguous in memory. The C++ standard, in its §1.7/3, refers to that, the bytes of a basic value, as a memory location. The essential fact about locations is that two threads can update separate memory locations without interfering with each other, but this is not guaranteed if they update separate bytes in the same memory location.
The sizeof operator produces the number of bytes of a value of a specified type.
By definition, in C++11 in §5.3.3/1, sizeof(char) is 1.
/ Addresses.
To quote the C++11 standard’s §1.7/1, “Every byte has a unique address.”.
The standard doesn’t define address further, but in practice, on modern machines the addresses that a C++ program deals with are bitpatterns of a fixed size, typically 32 or 64 bits.
When a C++ program deals directly with addresses it must do so via pointers, which are adresses with associated types. As a special case the pointer type void* represents untyped addresses, and as such must be able to store the largest address bitpatterns. Thus, on a modern machine CHAR_BIT*sizeof(void*) is in practice the number of bits of an address as seen from inside a C++ program.
Pointer values (addresses) are only guaranteed comparable via the built-in ==, < etc. if they point within the same array, extended with a hypothetical extra item at the end. However, the standard library offers a more general pointer comparision. C++ §20.8.5/8:
“For templates greater, less, greater_equal, and less_equal, the specializations for any pointer type
yield a total order, even if the built-in operators <, >, <=, >= do not.”
Thus depending on the machine addresses, as seen from C++, either are or can be mapped to integer values. But this does not mean that they can be mapped to int. Depending on the C++ implementation type int may be too small to hold addresses.
There are very few guarantees about what direction addresses increase in, e.g. that subsequent varaible declarations give you locations with increasing addresses. However, there is such a guarantee for non-static data members that (C++03) have no intervening access specifier or (C++11) have the same access, e.g. public. C++11 §9.2/14:
“Nonstatic data members of a (non-union) class with the same access control (Clause 11) are allocated so
that later members have higher addresses within a class object.”
There is also such a guarantee for items of an array.
The literal 0, used where a pointer value is expected, denotes the nullpointer of the relevant type. For the built in relational operators C++ supports comparing a non-0 pointer to 0 via == and !=, but does not support magnitude comparisons. For absolute safety pointer comparisons can be done via e.g. std::less etc., as noted above.
/ Objects.
An object is “a region of storage”, according to C++11 §1.8/1. That paragraph also notes that an object “has a type”, which determines how the bits in the memory region are interpreted. In order to create an object you can simply declare a variable (a variable is an object with a name) or e.g. use a new-expression.
Worth noting:
A region, in the formal sense of the C++ standard, is not necessarily contiguous.
As far as I can determine this fact is only implicit in the standard, in that an object can be a sub-object, which can be an object of a class with virtual inheritance (sharing a common base class sub-object), in a context of multiple inheritance, where that object – by definition a region of storage – is necessarily spread out in memory.
Dave Abrahams once contended that the intent was to support C++ implementations where objects could be spread around also in other situations than multiple virtual inheritance, but as far as I know no C++ implementations do that. In particular, a variable or any other most derived object (object that isn’t part of some other object) o is in practice a contiguous region of bytes, with all the bytes contained in the sizeof(o) bytes extending from and including the object’ start address.
/ Arrays.
An array, in the sense of array created via the [] notation, is contiguous sequence of objects of some fixed type T. Each item (object in the array) has an associated index, starting at 0 for the first item and contigously increasing. To refer to the first item of an array a, you can use square bracket notation and write a[0].
If the first item has start address a, then iten number n has start address a + n*sizeof(T).
In other words, addresses increase in the same direction as the item indices, with item 0 placed lowest in memory.
Operating system processes.
A C++ program can run on just about any kind of computer, from the smallest embedded chips to the larges supercomputers. In the small computer end of the scale there is not necessarily any operating system or memory management hardware, with the program accessing the computer’s physical memory and other hardware directly. But on e.g. a typical cell phone or desktop computer the program will be executed in an operating system process that isolates the program from direct access to the computer.
In particular, the addresses that an OS process see and manage, may not necessarily be physical memory addresses. Instead they may be just logical addresses, which transparently to your C++ code are very efficiently mapped to physical addresses. Among other things this allows you to run two or more instances of your program at the same time, without their memory addressing clashing – because the instances’ logical addresses are mapped to different parts of physical memory.
Practical note: as a security measure, unless otherwise specified a C++ program for Windows, created with Microsoft’s tools, will have have parts placed at different logical addresses in different instances, to make it more difficult for malware to exploit known locations. Thus you can’t even rely on fixed logical addresses. And so where objects will be placed, and so on, is not just compiler dependent and operating system dependent, but can depend on the particular instance of the program…
Still you have the guarantees discussed above, namely …
increasing addresses for sub-objects with the same access (e.g. public) within the same outer object, and
increasing addresses in the direction of higher indices in an array.
malloc and operator new are the library calls for allocating memory in C++ program. It is important to note that they aren't provided by the platform, they are provided by the standard library. All that is specified in C++ standard is that these calls should return a memory address that is allocated for the program code.
The platform usually have a different API for allocating memory from the OS, e.g. in Linux there are mmap() and brk() system calls, in Windows there is VirtualAlloc() system call. Malloc and operator new uses these system specific syscalls to request memory from the OS, and then suballocate them to the program. In the OS kernel itself, these system calls usually modifies MMU entries (on architectures that uses MMU).
How can I create a reserved pointer value?
The context is this: I have been thinking of how to implement a data structure for a dynamic scripting language (I am not planning on implementing this - just wondering how it would be done).
Strings may contain arbitrary bytes, including NUL. Thus, it is necessary to store the value separately. This requires a pointer (to point to the array) and a number. The first trick is that if the pointer is NULL, it cannot possibly be a valid string, so the number can be used for an actual integer.
If a second reserved pointer value could be created, this could be used to imply that the other field is now being used as a floating-point value. Can this be done?
One thought is to mmap() an address with no permissions, which could also be done to replace the usage of the NULL pointer.
On any modern system, you can just use the pointer values 1, 2, ... 4095 for such purposes. Another frequent choice is (uintptr_t)-1, which is technically inferior, but used more frequently than 1 nevertheless.
Why are these values "safe"?
Modern systems safeguard against NULL pointer accesses by making it impossible to map anything at virtual address zero. Almost any dereferencing of a NULL pointer will hit this nonexistant region, and the hardware will tell the OS system that something bad happened, which triggers the OS to segfault the process.
Since virtual memory pages are page aligned (at least 4k on current hardware), and nothing is mapped to address zero, nothing can be mapped to the entire range 0, ..., 4095, protecting all these addresses in the same way, and you can use them as special purpose values.
How much virtual memory space is reserved for this purpose is a system parameter, on linux it is controlled by /proc/sys/vm/mmap_min_addr, and the root user can change it to zero, which would disable this protection (which would not be a very smart idea). The default on Ubuntu is 64k (i. e. 16 pages).
This is also the reason why (uintptr_1)-1 is less safe than 1; even though any load of more than one byte will hit the zero page, the address (uintptr_1)-1 itself is not necessarily protected in this way. Consequently, doing string operations on (char*)-1 does not necessarily segfault.
Edit:
My original explanation with the special mapping seems to have been a bit stale, probably this was the way things were handled on the old Mac/PPC platform. Even though the effect is pretty much the same, I changed the details of the answer to reflect modern linux. Anyway, the important point is not how the null page protection is achieved, the important point is that any sane, modern system will have some null page protection that encompasses at least the mentioned address range. Some more details can be found in this SO answer: https://stackoverflow.com/a/12645890/2445184
In standard C (and standard C++), the approach that's 100% valid and works is simple: declare a variable, use its address as a magic value.
char *ptr;
char magic;
if (ptr == &magic) { ... }
This guarantees that magic will never have any overlap with another object.
Magic pointer values such as (char *) 1 have their advantages too, but it's so easy to get them wrong (even if you disregard the theoretical implementations where (char *) 1 may be a valid object, if you use (int *) 1 as a magic pointer value, and the optimiser assumes int * values are suitably aligned, it may removes checks that are no-ops only in 100% valid code, not in your code) that I'd recommend the standard approach, and optionally temporarily switch to magic pointer values only if you find they help you debug.
mmaping an address can fail if the address is already assigned. Probably it would better to use an address of some static variable or function. Or to obtain an unique address via malloc(1).
I know NULL (0x00000000) is a pointer to nothing because the OS doesn't allow the process to allocate any memory at this location. But if I use 0x00000001 (Magic number or code-pointer), is it safe to assume as well that the OS wont allow memory to be allocated here?
If so then until where is it safe to assume that?
Standard (first)
The Standard only guarantees that 0 is a sentinel value as far as pointers go. The underlying memory representation is no way guaranteed; it's implementation defined.
Using a pointer set to that sentinel value for anything else than reading the pointer state or writing a new state (which includes dereferencing or pointer arithmetic) is undefined behavior.
Virtual Memory
In the days of virtual memory (ie, each process gets its own memory space, independent from the others), a null pointer is most often indeed represented as 0 in the process memory space. I don't know of any other architectures actually, though I imagine that in mainframes it may not be so.
Unix
In the Unix world, it is typical to reserve all the address space below 0x8000 for null values. The memory is not allocated, really, it is just protected (ie, placed in a special mode), so that the OS will trigger a segmentation fault should you ever try to read it or write to it.
The idea of using such a range is that a null pointer is not necessarily used as is. For example if you use a std::pair<int, int>* p = 0; which is null, and call p->second, then the compiler will perform the arithmetic necessary to point to second (ie +4 generally) and attempt to access the memory at 0x4 directly. The problem is obviously compounded by arrays.
In practice, this 0x8000 limit should be practical enough to detect most issues (and avoid memory corruption or others). In this case, this means that you avoid the undefined behavior and get a "proper" crash. However, should you be using a large array you could overshoot it, so it's not a silver bullet.
The particular limit of your implementation or compiler/runtime stack can be determined either through documentation or by successive trials. There might even be a way to tweak it.
You should not assume anything about the actual values of pointers. Especially, the null pointer is not required to be represented by a zero address, even though the literal 0 does look like a zero.
The only valid range is supposed to be range allocated to you by the OS.ANYTHING else should be denied by the OS.
An exception to that rule is the shared memory.
The C++ standard doesn't "reserve" any pointer addresses other than zero (null). So it is not safe to use 1 or any other value as a "magic" pointer value. Of course, in practice, some implementations of c++ probably do not every use certain values. But you don't get any guarantees from the language definition.
I will try to give a broad view about this:
you probably will never ever access the real memory addresses because of the multiple sandboxing mechanism that every modern OS has and puts in place.
What is a NULL pointer from the software viewpoint ? a NULL pointer is a pointer variable that stores a value that the programmer pick as a meaningfull value and this value is used as a label with the following meaning "this pointer goes nowhere". a NULL pointer does not point to 0x000000 by definition, the definition of a NULL pointer it's not about where that pointer will point to but the value of this macro called NULL and this value will be the value of this NULL pointer.
in C you can assume that NULL == 0, only in C NULL is a macro that defines NULL as an int that is equal to 0, in C++ you do not have this liberty
there are types, labels and values ( in better terms, representations of values not real values ) for every variables, at least for primitives values, the same is for the pointers, if you are speaking about void pointers you are speaking about pointers that contains a memory address ( just like any pointer ) and the only special thing about this pointers is that they need a cast in C++ to be decoded, safely and effectively; it's a big mistake if you think about void* as pointers that points to nowhere or to 0 or to NULL or to 0x0000000
by the way, i still don't get your problem ...
A modern OS is likely to reserve at least one page for NULL pointer. So 0x1 (or 0x4 if you want 32-bit alignment) is likely to work.
But remember this is not guaranteed by C/C++ language. You would have to rely on your OS and compiler for such behavior.
Further more, there's no guarantee about the actual value of the NULL pointer. It may or may not be all zeros. If it's not, your trick won't work at all.
Or are those things are reserved for the operation system and things like that?
Thanks.
While it's unlikely that 0x00000001, etc. will be valid pointers (especially if you use odd numbers on many processors) using a pointer to store an integer value will be highly system dependent.
Are you really that strapped for space?
Edit:
You could make it portable like this:
char *base = malloc(NUM_MAGIC_VALUES);
#define MAGIC_VALUE_1 (base + 0)
#define MAGIC_VALUE_2 (base + 1)
...
Well the OS is going to give each program it's own virtual memory space, so when the application references memory spaces 0x0000001 or 0x0000002, it's actually referencing some other physical memory address. I would take a look at paging and virtual memory. So a program will never have access to memory the operating system is using. However I would stay away from manually assigning a memory address for a pointer rather than using malloc() because those memory addresses might be text or reserved space.
This depends on operating system layout. For User space applications running in general purpose operating systems, these are inaccessible addresses.
This problem is related to a architecture's virtual address space. Have a loot at this http://web.cs.wpi.edu/~cs3013/c07/lectures/Section09.1-Intel.pdf
Of course, you can do this:
int* myPointer1 = 0x000001;
int* myPointer2 = 0x000032;
But do not try to dereference addresses, cause it will end in an Access Violation.
The OS gives you the memory, by the way these addresses are just virtual
the OS hides the details and shows it like a big, continous stripe.
Maybe the 0x000000-0x211501 part is on a webserver and you read/write it through net,
and remaining is on your hard disk. Physical memory is just an illusion from your current viewpoint.
You tagged your question C++. I believe that in C++ the address at 0 is reserved and is normally referred to as NULL. Other than that you cannot assume anything. If you want to ask about a particular implementation on a particular OS then that would be a different question.
It depends on the compiler/platform, but many older compilers actually have something like the string "(null)" at address 0x00000000. This is a debug feature because that string will show up if a NULL pointer is ever used by accident. On newer systems like Windows, a pointer to this area will most likely cause a processor exception.
I can pretty much guarantee that address 1 and 2 will either be in use or will raise a processor exception if they're ever used. You can store any value you like in a pointer. But if you try and dereference a pointer with a random value, you're definitely asking for problems.
How about a nice integer instead?
Although the standard requires that NULL is 0, a pointer that is NULL does not have to consist of all zero bits, although it will do in many implementations. That is also something you have to beware of if you memset a POD struct that contains some pointers, and then rely on the pointers holding "NULL" as their value.
If you want to use the same space as a pointer you could use a union, but I guess what you really want is something that doubles up as a pointer and something else, and you know it is not a pointer to a real address if it contains low-numbered values. (With a union you still need to know which type you have).
I'd be interested to know what the magic other value is really being used for. Is this some lazy-evaluation issue where the pointer gives an indication of how to load the data when it is not yet loaded and a genuine pointer when it is?
Yes, on some platforms address 0x00000001 and 0x00000002 are valid addresses. On other platforms they are not.
In the embedded systems world, the validity depends on what resides at those locations. Some platforms may put interrupt or reset vectors at those addresses. Other embedded platforms may place Position Independent executable code there.
There is no standard specification for the layout of addresses. One cannot assume anything. If you want your code to be portable then forget about accessing specific addresses and leave that to the OS.
Also, the structure of a pointer is platform dependent. So is the conversion of the value in a pointer to a physical address. Some systems may only decode a portion of the pointer, others use the entire pointer value. Some may use indirection (a.k.a. virtual addressing) to access real objects. Still no standardization here either.