How pointers in C/C++ actually store addresses? - c++

If an int is stored in memory in 4 bytes each if which having a unique address, which address of these four addresses does a pointer to that int store?

A pointer to int (a int*) stores the address of the first byte of the integer. The size of int is known to the compiler, so it just needs to know where it starts.
How the bytes of the int are interpreted depends on the endianness of your machine, but that doesn't change the fact that the pointer just stores the starting address (the endianness is also known to the compiler).

Those 4 int bytes are not stored in the random locations - they are consecutive. So it is enough to store the reference (address) of first byte of the object.

Depends on the architecture. On a big-endian architecture (M68K, IBM z series), it’s usually the address of the most significant byte. On a little-endian architecture (x86), it’s usually the address of the least-significant byte:
A A+1 A+2 A+3 big-endian
+–––+–––+–––+–––+
|msb| | |lsb|
+–––+–––+–––+–––+
A+3 A+2 A+1 A little-endian
There may be other oddball addressing schemes I’m leaving out.
But basically it’s whatever the underlying architecture considers the “first” byte of the word.

The C Standard does not specify how addresses are represented inside pointers. Yet on most current architectures, a pointer to an int stores its address as the offset in the process' memory space of the first byte of memory used to store it, more precisely the byte with the lowest address.
Note however these remarks:
the int may have more or fewer than 32 bits. The only constraint is it must have at least 15 value bits and a sign bit.
bytes may have more than 8 bits. Most current architectures use 8-bit bytes but early Unix systems had 9-bit bytes, and some DSP systems have 16-, 24- or even 32-bit bytes.
when an int is stored using multiple bytes, it is unspecified how its bits are split among these bytes. Many systems use little-endian representation where the least-significant bits are in the first byte, other systems use big-endian representation where the most significant bits and the sign bit are in the first byte. Other representations are possible but only in theory.
many systems require that the address of the int be aligned on a multiple of their size.
how pointers are stored in memory is also system specific and unspecified. Addresses do not necessarily represent offsets in memory, real of virtual. For example 64-bit pointers on x86 CPUs have a number of bits that can be ignored or that may contain a cryptographic signature verified on the fly by the CPU. Adding one to the stored value of a pointer does not necessarily produce a valid pointer.

If an int is stored in memory in 4 bytes which each has a unique address, which address of these four addresses does a Pointer to that int store?
A pointer to int usually stores the address value of the first byte (which is stored at the lowest memory address) of the int object.
Hence the size of the int is known and constant at a specific implementation/ architecture and an int object is always stored in consecutive bytes (there are no bytes between two of them), it is clear that the following ((if sizeof(int) == 4) three) bytes belong to the same int object.
How the bytes of the int object are interpreted is dependent upon Endianness*.
The first byte is usually automatically aligned on a multiple of the data word size dependent upon a specific architecture, so that the CPU can work most efficiently.
In a 32-bit architecture for example, when the data word size is 4, the first byte lies on a 4-byte boundary - an address location with a multiple of 4.
sizeof(int) is not always 4 (although common) by the way.
*Endianness can influence if the interpretation of the object starts at the most (the first) or least (the last) significant byte.

Related

Is int byte size fixed or it occupy it accordingly in C/C++?

I have seen some program that use int instead of other type like int16_t or uint8_t even though there is no need to use int
let me give an example, when you assign 9 to an int, i know that 9 takes only 1 byte to store, so is other 3 bytes free to use or are they occupied?
all i'm saying is, does int always takes 4-bytes in memory or it takes byte accordingly and 4-bytes is the max-size
i hope you understand what im saying.
The size of all types is constant. The value that you store in an integer has no effect on the size of the type. If you store a positive value smaller than maximum value representable by a single byte, then the more significant bytes (if any) will contain a zero value.
The size of int is not necessarily 4 bytes. The byte size of integer types is implementation defined.
The size of types is fixed at compile time. There is no "dynamic resizing". If you tell the compiler to use int it will use an integer type that is guaranteed to have at least 16bit width. However, it may be (and is most of the time) more depending on the platform and compiler you are using. You can query the byte width on your platform by using sizeof(int).
There is a neat overview about the width of the different integer types at cppreference.
The int16_t or uint8_t are not core language types but convenience library defined types, that can be used if an exact bitwidth is required (e.g. for bitwise arithmetic)
An int has no "free bytes". An int is at least 16 bits wide and the exact size depends on the target platform (see https://en.cppreference.com/w/cpp/language/types). sizeof(int) is a compile time constant though. It always occupies the same number of bytes, no matter what value it holds.
The fixed width integer types (https://en.cppreference.com/w/cpp/types/integer) are useful for code that assumes a certain size of integers, because assuming certain size of int is usually a bug. int16_t is exactly 16 bits wide and uint8_t is exactly 8 bits wide, independent of the target platform.
I have seen some program that use int instead of other type like int16_t or uint8_t even though there is no need to use int
This is sometimes called "sloppy typing". int has the drawback that its size is implementation-defined, so it isn't portable. It can in theory even use an exotic signedness format (at least until the C23 standard).
when you assign 9 to an int, i know that 9 takes only 1 byte to store
That is not correct and there are no free bytes anywhere. Given some code int x = 9; then the integer constant 9 is of type int and takes up as much space as one, unless the compiler decided to optimize it into a smaller type. The 9 is stored in read-only memory, typically together with the executable code in the .text segment.
The variable x takes exactly sizeof(int) bytes (4 bytes on 32 bit systems) no matter the value stored. The compiler cannot do any sensible optimization regarding the size, other than when it is possible to remove the variable completely.

Encode additional information in pointer

My problem:
I need to encode additional information about an object in a pointer to the object.
What I thought I could do is use part of the pointer to do so. That is, use a few bits encode bool flags. As far as I know, the same thing is done with certain types of handles in the windows kernel.
Background:
I'm writing a small memory management system that can garbage-collect unused objects. To reduce memory consumption of object references and speed up copying, I want to use pointers with additional encoded data e.g. state of the object(alive or ready to be collected), lock bit and similar things that can be represented by a single bit.
My question:
How can I encode such information into a 64-bit pointer without actually overwriting the important bits of the pointer?
Since x64 windows has limited address space, I believe, not all 64 bits of the pointer are used, so I believe it should be possible. However, I wasn't able to find which bits windows actually uses for the pointer and which not. To clarify, this question is about usermode on 64-bit windows.
Thanks in advance.
This is heavily dependent on the architecture, OS, and compiler used, but if you know those things, you can do some things with it.
x86_64 defines a 48-bit1 byte-oriented virtual address space in the hardware, which means essentially all OSes and compilers will use that. What that means is:
the top 17 bits of all valid addresses must be all the same (all 0s or all 1s)
the bottom k bits of any 2k-byte aligned address must be all 0s
in addition, pretty much all OSes (Windows, Linux, and OSX at least) reserve the addresses with the upper bits set as kernel addresses -- all user addresses must have the upper 17 bits all 0s
So this gives you a variety of ways of packing a valid pointer into less than 64 bits, and then later reconstructing the original pointer with shift and/or mask instructions.
If you only need 3 bits and always use 8-byte aligned pointers, you can use the bottom 3 bits to encode extra info, and mask them off before using the pointer.
If you need more bits, you can shift the pointer up (left) by 16 bits, and use those lower 16 bits for information. To reconstruct the pointer, just right shift by 16.
To do shifting and masking operations on pointers, you need to cast them to intptr_t or int64_t (those will be the same type on any 64-bit implementation of C or C++)
1There's some hints that there may soon be hardware that extends this to 56 bits, so only the top 9 bits would need to be 0s or 1s, but it will be awhile before any OS supports this

How many bits are required to store the pointer value?

As far as I know, the size of the pointer on 32-bit systems is usually 4 bytes, and on 64-bit systems, 8 bytes. But as far as I know not all the bits are used to store the address. If so, is it safe to use free bits for other purposes? If so, how, and how many free bits are available on 32-bit and 64-bit systems in pointer memory space?
At the time of writing the current 64 bit Intel chips use 48 bit pointers internally.
Every C++ compiler I've come across abstracts this 48 bit pointer to a 64 bit pointer with the most significant 16 bits set to zero.
But the behaviour on using any of the free bits is undefined.
Towards the end of 32 bit chips being the norm, it was possible to have 4GB of physical memory, let alone virtual memory. All 32 bits were used for a pointer.
It is not portable to use any bits in a pointer value for a different purpose.
You can look at the documentation for your platform to see if it guarantees that any particular bits in a pointer value are available for use. It is likely that even if they are not directly involved in addressing, they are reserved for use by the platform.

how is word size in computer related to int or long

I have seen the link What does it mean by word size in computer? . It defines what word size is.
I am trying to represent very long string in bits where each character is represented by 4 bits and save it in long or integer array so that I can extract my string when required.
I can save the bits either in integer array or long array.
If I use long array (8 bytes) I will be able to save 8*4=32 bits in one long array.
But if I use int I will be able to save 4*4=16 bits only.
Now, if I am given my Word Size=32 then is it the case that I should use int only and not long.
To answer your direct question: There is no guaranteed relationship between the natural word-size of the processor and the C and C++ types int or long. Yes, quite often int will be the same as the size of a register in the processor, but most 64-bit processors do not follow this rule, as it makes data unnecessarily large. On the other hand, an 8-bit processor would have a register size of 8 bits, but int according to the C and C++ standards needs to be at least 16 bits in size, so the compiler would have to use more than one register to represent one integer [in some fashion].
In general, if you want to KNOW how many bits or bytes some type is, it's best to NOT rely on int, long, size_t or void *, since they are all likely to be different for different processor architectures or even different compilers on the same architecture. An int or long may be the same size or different sizes. Only rule that the standard says is that long is at least 32 bits.
So, to have control of the number of bits, use #include <cstdint> (or in C, stdint.h), and use the types for example uint16_t or uint32_t - then you KNOW that it will hold a given number of bits.
On a processor that has 36-bit "wordsize", the type uint32_t for example, will not exist, since there is no type that holds exactly 32-bits [most likely]. Alternatively, the compiler may add extra instructions to "behave as if it's a 32-bit type" (in other words, sign extending if necessary, and masking off the top bits as needed)

Bit size of GLib types and portability to more exotic (think 16 bit char) platforms

For example, given the definition at https://developer.gnome.org/glib/stable/glib-Basic-Types.html:
gint8
typedef signed char gint8;
A signed integer guaranteed to be 8 bits on all platforms. Values of
this type can range from G_MININT8 (= -128) to G_MAXINT8 (= 127)
-- what does GLIb do to guarantee the type still being 8 bits on platforms where char is not 8 bits? Or is GLib x86 / etc. only (i.e. is this a known limitation)?
As Hans Passant said in his comment, glib guarantees that gint8 is 8-bits by not supporting platforms where signed char is any other size. There are only two types of systems that have ever had C compiler implemenations where this requirement wasn't met.
The first is systems where the byte size is 9-bits. Today these are long obsolete, but systems like these had some of the earliest C compilers. It theory it could be possible for the compiler to emulate a restricted range 8-bit type as an extension, but it would still be 9-bits long in memory, and not really get you anything.
The second are word addressable systems, were the word size is either 16, 32 or 64 bits. In these computers the processor can only address memory at word boundaries. Address 0 is the first word, address 1 is the second word, and so on without any overlap between words. For the most part systems like these are obsolete now, but not anywhere as much as 9-bit byte machines. There's apparently still at least some use of word addressable processors in embedded systems.
In C compilers targeting word addressable systems the size of a byte is either the word size or 8 bits depending on the compiler. Some compilers gave a choice. Having word size bytes is the simple way to go. Implementing 8-bit bytes on other hand requires a fair bit of work. Not only does the compiler have to use multiple instructions to access the separate 8-bit values contained in each word, it also had to emulate a byte addressable pointer. This usually means char pointers have a different size than int pointers, as byte addressable pointers need more room to store both the address and a byte offset.
Needless to say the compilers that use word sized bytes wouldn't be supported by glib, while the ones using 8-bit bytes would at least be able implement gint8. Though they still probably wouldn't be supported for a number of other reasons. The fact that sizeof(char *) > size(int *) is true might be a problem.
I should also point out that there a few other long obsolete systems that, while having C compilers that used an 8-bit byte, still didn't have a type that meets the requirements of gint8. These are the systems that used ones' complement or signed magnitude integers, meaning that signed char ranged from -127 to 127 instead of the -128 to 127 range guaranteed by glib.
gint8 (together with other platform dependent types) is declared in glibconfig.h, usually installed under /usr/lib/glib-2.0/include.
That file is generate at configure time so, at least theoretically, gint8 can be something different.