Where the C++ literal-constant storage in memory?

Where the C++ literal-constant storage in memory? - c++

Where the C++ literal-constant storage in memory? stack or heap?
int *p = &2 is wrong. I want know why? Thanks
-------------------------------------------------
My question is "Where the C++ literal-constant storage in memory", "int *p = &2 is wrong",not my question.

The details depend on the machine, but assuming a commonest sort of machine and operating system... every executable file contains several "segments" - CODE, BSS, DATA and some others.
CODE holds all the executable opcodes. Actually, it's often named TEXT because somehow that made sense to people way back decades ago. Normally it's read-only.
BSS is uninitialized data - it actually doesn't need to exist in the executable file, but is allocated by the operating system's loader when the program is starting to run.
DATA holds the literal constants - the int8, int16, int32 etc along with floats, string literals, and whatever weird things the compiler and linker care to produce. This is what you're asking about. However, it holds only constants defined for use as variables, as in
const long x = 2;
but unlikely to hold literal constants used in your source code but not tightly associated with a variable. Just a lone '2' is dealt with directly by the compiler. For example in C,
print("%d", 2);
would cause the compiler to build a subroutine call to print(), writing opcodes to push a pointer to the string literal "%d" and the value 2, both as 64-bit integers on a 64-bit machine (you're not one of those laggards still using 32-bit hardware, are you? :) followed by the opcode to jump to a subroutine at (identifier for 'print' subroutine).
The "%d" literal goes into DATA. The 2 doesn't; it's built into the opcode that stuffs integers onto the stack. That might actually be a "load register RAX immediate" followed by the value 2, followed by a "push register RAX", or maybe a single opcode can do the job. So in the final executable file, the 2 will be found in the CODE (aka TEXT) segment.
It typically isn't possible to make a pointer to that value, or to any opcode. It just doesn't make sense in terms of what high level languages like C do (and C is "high level" when you're talking about opcodes and segments.) "&2" can only be an error.
Now, it's not entirely impossible to have a pointer to opcodes. Whenever you define a function in C, or an object method, constructor or destructor in C++, the name of the function can be thought of as a pointer to the first opcode of the machine code compiled from that function. For example, print() without the parentheses is a pointer to a function. Maybe if your example code were in a function and you guess the right offset, pointer arithmetic could be used to point to that "immediate" value 2 nestled among the opcodes, but this is not going to be easy for any contemporary CPU, and certainly isn't for beginners.

Let me quote relevant clauses of C++03 Standard.
5.3.1/2
The result of the unary & operator is a pointer to its operand. The
operand shall be an lvalue.
An integer literal is an rvalue (however, I haven't found a direct quote in C++03 Standard, but C++11 mentiones that as a side note in 3.10/1).
Therefore, it's not possible to take an address of an integer literal.
What about the exact place where 2 is stored, it depends on usage. It might be a part of an machine instruction, or it might be optimized away, e.g. j=i*2 might become j=i+i. You should not rely upon it.

You have two questions:
Where are literal constants stored? With the exception of string
literals (which are actual objects), pretty much wherever the
implementation wants. It will usually depend on what you're doing with
them, but on a lot of architectures, integral constants (and often some
special floating point constants, like 0.0) will end up as part of a
machine instruction. When this isn't possible, they'll usually be
placed in the same logical segment as the code.
As to why taking the address of an rvalue is illegal, the main reason is
because the standard says so. Historically, it's forbidden because such
constants often never exist as a separate object in memory, and thus
have no address. Today... one could imagine other solutions: compilers
are smart enough to put them in memory if you took their address, and
not otherwise; and rvalues of class type do have a memory address.
The rules are somewhat arbitrary (and would be, regardless of what they
were)—hopefully, any rules which would allow taking the address of
a literal would make its type int const*, and not int*.

Related

Storage of literal constants in c++

I would like to know where literal constants are actually stored in the memory?
example:
int i = 5;
char* data = char* &("abcdefgh");
the storage sections of i and data depends on where they are declared.
But does the compiler store 5 and "abcdefgh" before actually copying it to the variables?
And here I can get the address of "abcdefgh" where it is stored, but why can't I get the address of 5?

Integer literals like 5 can be part of machine instructions. For example:
LD A, 5
would load the value 5 into processor register A for some imaginary architecture, and as the 5 is actually part of the instruction, it has no address. Few (if any) architectures have the ability to create string literals inline in the machine instructions, so these have to actually be stored elsewhere in memory and accessed via pointers. Exactly where "elsewhere" is is not specified by the C++ Standard.

On the language level, string literals and numeric literals are different beasts.
The C and C++ standard essentially specify that string literals are treated "as if" you defined a constant array of characters with the appropriate size and content, and then you used its name in place of the literal. IOW, when you write
const char *foo = "hello";
it's as if you wrote
// in global scope
const char hello_literal[6] = {'h', 'e', 'l', 'l', 'o', '\0'};
...
const char *foo = hello_literal;
(there are some backwards-compatibility exceptions that allow you to even write char *foo = "hello";, without the const, but that's deprecated and it's undefined behavior anyway to try to write through such a pointer)
So, given this equivalence it's normal that you can have the address of the string literal. Integral literals, OTOH, are rvalues, for which the standard specifies that you cannot take any address - you can roughly think of them as values that the standard expect not to have a backing memory location in the conventional sense.
Now, this distinction actually descends from the fact that on the machine level they are usually implemented differently.
A string literal generally is stored as data somewhere in memory, typically in a read-only data section that gets mapped in memory straight from the executable. When the compiler needs its address it's easy to oblige, since it is data stuff that is already in memory, and thus it does have an address.
Instead, when you do something like
int a = 5;
the 5 does not really have a separate memory location like the "hello world" array above, but it's usually embedded into the machine code as an immediate value.
It's quite complicated to have a pointer to it, since it would be a pointer pointing halfway into an instruction, and in general pointing to data in a different format than what be expected for a regular int variable to which you can point - think x86 where for small numbers you use more compact encodings, or PowerPC/ARM and other RISC architectures where some values are built from an immediate manipulated by the implicit barrel shifter and you cannot even have immediates for some values - you have to compose them out of several instructions, or Harvard architectures where data and code live in different address spaces.
For this reason, you cannot take the address of numeric literals (as well as of numeric expressions evaluation results and much other temporary stuff); if you want to have the address of a number you have to first assign it to a variable (which can provide an in-memory storage), and then ask for its address.

Although the C and C++ standards don't dictate where the literals are stored, common practice stores them in one of two places: in the code (see #NeilButterworth answer), or in a "constants" segment.
Common executable files have a code section and a data section. The data segment may be split up into read-only, uninitialized read/write and initialized read-write. Often, the literals are placed into the read-only section of the executable.
Some tools may also place the literals into a separate data file. This data file may be used to program the data into read-only memory devices (ROM, PROM, Flash, etc.).
In summary, the placement of literals is implementation dependent. The C and C++ standards state that writing to the location of literals is undefined behavior. Preferred practice with character literals is to declare the variable as const so compiler can generate warnings or errors when a write to a literal occurs.

Is pointer conversion expensive or not?

Is pointer conversion considered expensive? (e.g. how many CPU cycles it takes to convert a pointer/address), especially when you have to do it quite frequently, for instance (just an example to show the scale of freqency, I know there are better ways for this particular cases):
unsigned long long *x;
/* fill data to x*/
for (int i = 0; i < 1000*1000*1000; i++)
{
A[i]=foo((unsigned char*)x+i);
};

(e.g. how many CPU cycles it takes to convert a pointer/address)
In most machine code languages there is only 1 "type" of pointer and so it doesn't cost anything to convert between them. Keep in mind that C++ types really only exist at compile time.
The real issue is that this sort of code can break strict aliasing rules. You can read more about this elsewhere, but essentially the compiler will either produce incorrect code through undefined behavior, or be forced to make conservative assumptions and thus produce slower code. (note that the char* and friends is somewhat exempt from the undefined behavior part)
Optimizers often have to make conservative assumptions about variables in the presence of pointers. For example, a constant propagation process that knows the value of variable x is 5 would not be able to keep using this information after an assignment to another variable (for example, *y = 10) because it could be that *y is an alias of x. This could be the case after an assignment like y = &x.
As an effect of the assignment to *y, the value of x would be changed as well, so propagating the information that x is 5 to the statements following *y = 10 would be potentially wrong (if *y is indeed an alias of x). However, if we have information about pointers, the constant propagation process could make a query like: can x be an alias of *y? Then, if the answer is no, x = 5 can be propagated safely.
Another optimization impacted by aliasing is code reordering. If the compiler decides that x is not aliased by *y, then code that uses or changes the value of x can be moved before the assignment *y = 10, if this would improve scheduling or enable more loop optimizations to be carried out.
To enable such optimizations in a predictable manner, the ISO standard for the C programming language (including its newer C99 edition, see section 6.5, paragraph 7) specifies that it is illegal (with some exceptions) for pointers of different types to reference the same memory location. This rule, known as "strict aliasing", sometime allows for impressive increases in performance,[1] but has been known to break some otherwise valid code. Several software projects intentionally violate this portion of the C99 standard. For example, Python 2.x did so to implement reference counting,[2] and required changes to the basic object structs in Python 3 to enable this optimisation. The Linux kernel does this because strict aliasing causes problems with optimization of inlined code.[3] In such cases, when compiled with gcc, the option -fno-strict-aliasing is invoked to prevent unwanted optimizations that could yield unexpected code.
[edit]
http://en.wikipedia.org/wiki/Aliasing_(computing)#Conflicts_with_optimization
What is the strict aliasing rule?

On any architecture you're likely to encounter, all pointer types have the same representation, and so conversion between different pointer types representing the same address has no run-time cost. This applies to all pointer conversions in C.
In C++, some pointer conversions have a cost and some don't:
reinterpret_cast and const_cast (or an equivalent C-style cast, such as the one in the question) and conversion to or from void* will simply reinterpret the pointer value, with no cost.
Conversion between pointer-to-base-class and pointer-to-derived class (either implicitly, or with static_cast or an equivalent C-style cast) may require adding a fixed offset to the pointer value if there are multiple base classes.
dynamic_cast will do a non-trivial amount of work to look up the pointer value based on the dynamic type of the object pointed to.
Historically, some architectures (e.g. PDP-10) had different representations for pointer-to-byte and pointer-to-word; there may be some runtime cost for conversions there.

unsigned long long *x;
/* fill data to x*/
for (int i = 0; i < 1000*1000*1000; i++)
{
A[i]=foo((unsigned char*)x+i); // bad cast
}
Remember, the machine only knows memory addresses, data and code. Everything else (such as types etc) are known only to the Compiler(that aid the programmer), and that does all the pointer arithmetic, only the compiler knows the size of each type.. so on and so forth.
At runtime, there are no machine cycles wasted in converting one pointer type to another because the conversion does not happen at runtime. All pointers are treated as of 4 bytes long(on a 32 bit machine) nothing more and nothing less.

It all depends on your underlying hardware.
On most machine architectures, all pointers are byte pointers, and converting between a byte pointer and a byte pointer is a no-op. On some architectures, a pointer conversion may under some circumstances require extra manipulation (there are machines that work with word based addresses for instance, and converting a word pointer to a byte pointer or vice versa will require extra manipulation).
Moreover, this is in general an unsafe technique, as the compiler can't perform any sanity checking on what you are doing, and you can end up overwriting data you didn't expect.

Binary How The Processor Distinguishes Between Two Same Byte Size Variable Types

I'm trying to figure out how it is that two variable types that have the same byte size?
If i have a variable, that is one byte in size.. how is it that the computer is able to tell that it is a character instead of a Boolean type variable? Or even a character or half of a short integer?

The processor doesn't know. The compiler does, and generates the appropriate instructions for the processor to execute to manipulate bytes in memory in the appropriate manner, but to the processor itself a byte of data is a byte of data and it could be anything.
The language gives meaning to these things, but it's an abstraction the processor isn't really aware of.

The computer is not able to do that. The compiler is. You use the char or bool keyword to declare a variable and the compiler produces code that makes the computer treat the memory occupied by that variable in a way that makes sense for that particular type.
A 32-bit integer for example, takes up 4 bytes in memory. To increment it, the CPU has an instruction that says "increment a 32-bit integer at this address". That's what the compiler produces and the CPU blindly executes it. It doesn't care if the address is correct or what binary data is located there.
The size of the instruction for incrementing the variable is another matter. It may very well be another 4 or so bytes, but instructions (code) are stored separately from data. There may be many instructions generated for a program that deal with the same location in memory. It is not possible to formally specify the size of the instructions beforehand because of optimizations that may change the number of instructions used for a given operation. The only way to tell is to compile your program and look at the generated assembly code (the instructions).
Also, take a look at unions in C. They let you use the same memory location for different data types. The compiler lets you do that and produces code for it but you have to know what you're doing.

Because you specify the type. C++ is a strongly typed language. You can't write $x = 10. :)
It knows
char c = 0;
is a char because of... well, the char keyword.

The computer only sees 1 and 0. You are in command of what the variable contains.
you can cast that data also into what ever you want.
char foo = 'a';
if ( (bool)(foo) ) // true
{
int sumA = (byte)(foo) + (byte)(foo);
// sumA == (97 + 97)
}
Also look into data casting to look at the memory location as different data types. This can be as small as a char or entire structs.

In general, it can't. Look at the restrictions of dynamic_cast<>, which tries to do exactly that. dynamic_cast can only work in the special case of objects derived from polymorphic base classes. That's because such objects (and only those) have extra data in them. Chars and ints do not have this information, so you can't use dynamic_cast on them.

What are the use pointer variables?

I've recently tried to really come to grips with references and pointers in C++, and I'm getting a little bit confused. I understand the * and & operators which can respectively get the value at an address and get the address of a value, however why can't these simply be used with basic types like ints?
I don't understand why you can't, for example, do something like the following and not use any weird pointer variable creation:
string x = "Hello";
int y = &x; //Set 'y' to the memory address of 'x'
cout << *y; //Output the value at the address 'y' (which is the memory address of 'x')
The code above should, theoretically in my mind, output the value of 'x'. 'y' contains the memory address of 'x', and hence '*y' should be 'x'. If this works (which incidentally on trying to compile it, it doesn't -- it tells me it can't convert from a string to an int, which doesn't make much sense since you'd think a memory address could be stored in an int fine).
Why do we need to use special pointer variable declarations (e.g. string *y = &x)?
And inside this, if we take the * operator in the pointer declaration literally in the example in the line above, we are setting the value of 'y' to the memory address of 'x', but then later when we want to access the value at the memory address ('&x') we can use the same '*y' which we previously set to the memory address.

C and C++ resolve type information at compile-time, not runtime. Even runtime polymorphism relies on the compiler constructing a table of function pointers with offsets fixed at compile time.
For that reason, the only way the program can know that cout << *y; is printing a string is because y is strongly typed as a pointer-to-string (std::string*). The program cannot, from the address alone, determine that the object stored at address y is a std::string. (Even C++ RTTI does not allow this, you need enough type information to identify a polymorphic base class.)

In short, C is a typed language. You cannot store arbitrary things in variables.
Check the type safety article at wikipedia. C/C++ prevents problematic operations and functional calls at compliation time by checking the type of the operands and function parameters (but note that with explicit casts you can change the type of an expression).
It doesn't make sense to store a string in an integer -> The same way it doesn't make sense to store a pointer in it.

Simply put, a memory address has a type, which is pointer. Pointers are not ints, so you can't store a pointer in an int variable. If you're curious why ints and pointers are not fungible, it's because the size of each is implementation defined (with certain restrictions) and there is no guarantee that they will be the same size.
For instance, as #Damien_The_Unbeliever pointed out pointers on a 64-bit system must be 64-bits long, but it is perfectly legal for an int to be 32-bits, as long as it is no longer than a long and nor shorter than a short.
As to why each data type has it's own pointer type, that's because each type (especially user-defined types) is structured differently in memory. If we were to dereference typeless (or void) pointers, there would be no information indicating how that data should be interpreted. If, on the other hand, you were to create a universal pointer and do away with the "inconvenience" of specifying types, each entity in memory would probably have to be stored along-side its type information. While this is doable, it's far from efficient, and efficiency is on of C++'s design goals.

Some very low-level languages... like machine language... operate exactly as you describe. A number is a number, and it's up to the programmer to hold it in their heads what it represents. Generally speaking, the hope of higher level languages is to keep you from the concerns and potential for error that comes from that style of development.
You can actually disregard C++'s type-safety, at your peril. For instance, the gcc on a 32-bit machine I have will print "Hello" when I run this:
string x = "Hello";
int y = reinterpret_cast<int>(&x);
cout << *reinterpret_cast<string*>(y) << endl;
But as pretty much every other answerer has pointed out, there's no guarantee it would work on another computer. If I try this on a 64-bit machine, I get:
error: cast from ‘std::string*’ to ‘int’ loses precision
Which I can work around by changing it to a long:
string x = "Hello";
long y = reinterpret_cast<long>(&x);
cout << *reinterpret_cast<string*>(y) << endl;
The C++ standard specifies minimums for these types, but not maximums, so you really don't know what you're going to be dealing with when you face a new compiler. See: What does the C++ standard state the size of int, long type to be?
So the potential for writing non-portable code is high once you start going this route and "casting away" the safeties in the language. reinterpret_cast is the most dangerous type of casting...
When should static_cast, dynamic_cast, const_cast and reinterpret_cast be used?
But that's just technically drilling down into the "why not int" part specifically, in case you were interested. Note that as #BenVoight points out in the comment below, there does exist an integer type as of C99 called intptr_t which is guaranteed to hold any poniter. So there are much larger problems when you throw away type information than losing precision...like accidentally casting back to a wrong type!

C++ is a strongly typed language, and pointers and integers are different types. By making those separate types the compiler is able to detect misuses and tell you that what you are doing is incorrect.
At the same time, the pointer type maintains information on the type of the pointed object, if you obtain the address of a double, you have to store that in a double*, and the compiler knows that dereferencing that pointer you will get to a double. In your example code, int y = &x; cout << *y; the compiler would loose the information of what y points to, the type of the expression *y would be unknown and it would not be able to determine which of the different overloads of operator<< to call. Compare that with std::string *y = &x; where the compiler sees y it knows it is a std::string* and knows that dereferencing it you get to a std::string (and not a double or any other type), enabling the compiler to statically check all expressions that contain y.
Finally, while you think that a pointer is just the address of the object and that should be representable by an integral type (which on 64bit architectures would have to be int64 rather than int) that is not always the case. There are different architectures on which pointers are not really representable by integral values. For example in architectures with segmented memory, the address of an object can contain both a segment (integral value) and an offset into the segment (another integral value). On other architectures the size of pointers was different than the size of any integral type.

The language is trying to protect you from conflating two different concepts - even though at the hardware level they are both just sets of bits;
Outside of needing to pass values manually between various parts of a debugger, you never need to know the numerical value.
Outside of archaic uses of arrays, it doesn't make sense to "add 10" to a pointer - so you shouldn't treat them as numeric values.
By the compiler retaining type information, it also prevents you from making mistakes - if all pointers were equal, then the compiler couldn't, helpfully, point out that what you're trying to dereference as an int is a pointer to a string.

What is the uintptr_t data type?

What is uintptr_t and what can it be used for?

First thing, at the time the question was asked, uintptr_t was not in C++. It's in C99, in <stdint.h>, as an optional type. Many C++03 compilers do provide that file. It's also in C++11, in <cstdint>, where again it is optional, and which refers to C99 for the definition.
In C99, it is defined as "an unsigned integer type with the property that any valid pointer to void can be converted to this type, then converted back to pointer to void, and the result will compare equal to the original pointer".
Take this to mean what it says. It doesn't say anything about size.
uintptr_t might be the same size as a void*. It might be larger. It could conceivably be smaller, although such a C++ implementation approaches perverse. For example on some hypothetical platform where void* is 32 bits, but only 24 bits of virtual address space are used, you could have a 24-bit uintptr_t which satisfies the requirement. I don't know why an implementation would do that, but the standard permits it.

uintptr_t is an unsigned integer type that is capable of storing a data pointer (whether it can hold a function pointer is unspecified). Which typically means that it's the same size as a pointer.
It is optionally defined in C++11 and later standards.
A common reason to want an integer type that can hold an architecture's pointer type is to perform integer-specific operations on a pointer, or to obscure the type of a pointer by providing it as an integer "handle".

It's an unsigned integer type exactly the size of a pointer. Whenever you need to do something unusual with a pointer - like for example invert all bits (don't ask why) you cast it to uintptr_t and manipulate it as a usual integer number, then cast back.

There are already many good answers to "what is uintptr_t data type?". I will try to address the "what it can be used for?" part in this post.
Primarily for bitwise operations on pointers. Remember that in C++ one cannot perform bitwise operations on pointers. For reasons see Why can't you do bitwise operations on pointer in C, and is there a way around this?
Thus in order to do bitwise operations on pointers one would need to cast pointers to type uintptr_t and then perform bitwise operations.
Here is an example of a function that I just wrote to do bitwise exclusive or of 2 pointers to store in a XOR linked list so that we can traverse in both directions like a doubly linked list but without the penalty of storing 2 pointers in each node.
template <typename T>
T* xor_ptrs(T* t1, T* t2)
{
return reinterpret_cast<T*>(reinterpret_cast<uintptr_t>(t1)^reinterpret_cast<uintptr_t>(t2));
}

Running the risk of getting another Necromancer badge, I would like to add one very good use for uintptr_t (or even intptr_t) and that is writing testable embedded code.
I write mostly embedded code targeted at various arm and currently tensilica processors. These have various native bus width and the tensilica is actually a Harvard architecture with separate code and data buses that can be different widths.
I use a test driven development style for much of my code which means I do unit tests for all the code units I write. Unit testing on actual target hardware is a hassle so I typically write everything on an Intel based PC either in Windows or Linux using Ceedling and GCC.
That being said, a lot of embedded code involves bit twiddling and address manipulations. Most of my Intel machines are 64 bit. So if you are going to test address manipulation code you need a generalized object to do math on. Thus the uintptr_t give you a machine independent way of debugging your code before you try deploying to target hardware.
Another issue is for the some machines or even memory models on some compilers, function pointers and data pointers are different widths. On those machines the compiler may not even allow casting between the two classes, but uintptr_t should be able to hold either.
-- Edit --
Was pointed out by #chux, this is not part of the standard and functions are not objects in C. However it usually works and since many people don't even know about these types I usually leave a comment explaining the trickery. Other searches in SO on uintptr_t will provide further explanation. Also we do things in unit testing that we would never do in production because breaking things is good.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js