Storage of literal constants in c++ - c++

I would like to know where literal constants are actually stored in the memory?
example:
int i = 5;
char* data = char* &("abcdefgh");
the storage sections of i and data depends on where they are declared.
But does the compiler store 5 and "abcdefgh" before actually copying it to the variables?
And here I can get the address of "abcdefgh" where it is stored, but why can't I get the address of 5?

Integer literals like 5 can be part of machine instructions. For example:
LD A, 5
would load the value 5 into processor register A for some imaginary architecture, and as the 5 is actually part of the instruction, it has no address. Few (if any) architectures have the ability to create string literals inline in the machine instructions, so these have to actually be stored elsewhere in memory and accessed via pointers. Exactly where "elsewhere" is is not specified by the C++ Standard.

On the language level, string literals and numeric literals are different beasts.
The C and C++ standard essentially specify that string literals are treated "as if" you defined a constant array of characters with the appropriate size and content, and then you used its name in place of the literal. IOW, when you write
const char *foo = "hello";
it's as if you wrote
// in global scope
const char hello_literal[6] = {'h', 'e', 'l', 'l', 'o', '\0'};
...
const char *foo = hello_literal;
(there are some backwards-compatibility exceptions that allow you to even write char *foo = "hello";, without the const, but that's deprecated and it's undefined behavior anyway to try to write through such a pointer)
So, given this equivalence it's normal that you can have the address of the string literal. Integral literals, OTOH, are rvalues, for which the standard specifies that you cannot take any address - you can roughly think of them as values that the standard expect not to have a backing memory location in the conventional sense.
Now, this distinction actually descends from the fact that on the machine level they are usually implemented differently.
A string literal generally is stored as data somewhere in memory, typically in a read-only data section that gets mapped in memory straight from the executable. When the compiler needs its address it's easy to oblige, since it is data stuff that is already in memory, and thus it does have an address.
Instead, when you do something like
int a = 5;
the 5 does not really have a separate memory location like the "hello world" array above, but it's usually embedded into the machine code as an immediate value.
It's quite complicated to have a pointer to it, since it would be a pointer pointing halfway into an instruction, and in general pointing to data in a different format than what be expected for a regular int variable to which you can point - think x86 where for small numbers you use more compact encodings, or PowerPC/ARM and other RISC architectures where some values are built from an immediate manipulated by the implicit barrel shifter and you cannot even have immediates for some values - you have to compose them out of several instructions, or Harvard architectures where data and code live in different address spaces.
For this reason, you cannot take the address of numeric literals (as well as of numeric expressions evaluation results and much other temporary stuff); if you want to have the address of a number you have to first assign it to a variable (which can provide an in-memory storage), and then ask for its address.

Although the C and C++ standards don't dictate where the literals are stored, common practice stores them in one of two places: in the code (see #NeilButterworth answer), or in a "constants" segment.
Common executable files have a code section and a data section. The data segment may be split up into read-only, uninitialized read/write and initialized read-write. Often, the literals are placed into the read-only section of the executable.
Some tools may also place the literals into a separate data file. This data file may be used to program the data into read-only memory devices (ROM, PROM, Flash, etc.).
In summary, the placement of literals is implementation dependent. The C and C++ standards state that writing to the location of literals is undefined behavior. Preferred practice with character literals is to declare the variable as const so compiler can generate warnings or errors when a write to a literal occurs.

Related

C++ reference of a string literal

This code outputs a random memory address can anyone explain why and how ?
#include<iostream>
using std::cout;
int main()
{
cout<<&"hello";
return 0;
}
output:
0x560d6984e048
...Program finished with exit code 0
Press ENTER to exit console.
A literal strings in C++ are really arrays of constant characters (including the null-terminator).
By using the pointer-to operator you get a pointer to that array.
It's somewhat equivalent to something like this:
#include <iostream>
char const hello_str[] = "hello";
int main()
{
std::cout << &hello_str;
}
In C++, a string literal has the type const char[N] where N is the length of the string plus one for the null terminator. That array is an lvalue so when you do &"some string", you are getting the address of the array that represents the string literal.
This does not work with other literals like integer literals because they are prvalues, and the address operator (&) requires an lvalue.
Like the other answers have already stated, in C and C++, a string is basically a pointer to an array of characters.
According to cppreference.com:
String literals have static storage duration, and thus exist in memory
for the life of the program.
The C++ standard doesn't describe how exactly executables are supposed to look like, or where and how things get stored in memory. It, e.g., describes storage duration and behaviors etc., but it is a platform independent standard. So this is implementation specific.
So why does this still work and what address do you see? It depends on your compiler and platform, but generally, your compiler will create an executable of your program[^1], e.g., an ELF (Executable and Linkable Format) on Linux, or a PE (portable executable) on Windows.
In an ELF binary, literals are stored in the .rodata section (read-only data) and the machine instructions in the .text section.
You can look at the binary the compiler spits out with certain compiler options, or online on Matt Godbolt's compiler explorer.
Let's look at the example you gave: https://gcc.godbolt.org/z/4s5Y66nY5
We see a label on top, .LC0 where your string is! It is part of the executable file. And this file has to be mapped into memory to be executed.
Line 5 loads the address of that label into a register (%esi) before the call to the stream operation.
So what you see is the location of the string in the read-only section mapped to memory (more precisely, the address in the virtual address space of your process).
This as all rather Linux specific, because that is what I know best, but it is very similar on Windows and Mac.
Here is a nice (student??) paper that goes into more details about how GCC deals with string literals on Unix.
[1]: The compiler could generate other things, of course. An object file. Maybe you want to create a static library, or a dynamic library. But let's keep it simple and talk about executables.

Which is more efficient way for storing strings, an array or a pointer in C/C++? [duplicate]

This question already has answers here:
What is the difference between char s[] and char *s?
(14 answers)
Closed 6 years ago.
We can store string using 2 methods.
Method 1: using array
char a[]="str";
Method 2:
char *b="str";
In method 1 the memory is used only in storing the string "str" so the memory used is 4 bytes.
In method 2 the memory is used in storing the string "str" on 'Read-Only-Memory' and then in storing the pointer to the 1st character of the string.
So the memory used must be 4 bytes for storing string in ROM and then 8 bytes for storing pointer (in 64-bit machine) to the first character.
In total the 1st method uses 4 bytes and the method 2 uses 12 bytes. So is the method 1 always better than method 2 for storing strings in C/C++.
Except if you use a highly resource limited system, you should not care too much for the memory used by a pointer. Anyway, optimizing compilers could lead to same code in both cases.
You should care more about Undefined Behaviour in second case!
char a[] = "str";
correctly declares a non const character array which is initialized to "str". That means that a[0] = 'S'; is perfectly allowed and will change a to "Str".
But with
char *b = "str";
you declare a non const pointer to a litteral char array which is implicitely const. That means that b[0] = 'S'; tries to modify a litteral string and is Undefined Behaviour => it can work, segfault or anything in between including not changing the string.
All of the numbers that you cite, and the type of the memory where the string literal is stored are platform specific.
Which is more efficient way for storing strings, an array or a pointer
Some pedantry about terminology: A pointer can not store a string; it stores an address. The string is always stored in an array, and a pointer can point to it. String literals in particular are stored in an array of static storage duration.
Method 1: using array char a[]="str";
This makes a copy of the content of the string literal, into a local array of automatic storage duration.
Method 2: char *b="str";
You may not bind a non const pointer to a string literal in standard C++. This is ill-formed in that language (since C++11; prior to that the conversion was merely deprecated). Even in C (and extensions of C++) where this conversion is allowed, this is quite dangerous, since you might accidentally pass the pointer to a function that might try to modify the pointed string. Const correctness replaces accidental UB with a compile time error.
Ignoring that, this doesn't make a copy of the literal, but points to it instead.
So is the method 1 always better than method 2 for storing strings in C/C++.
Memory use is not the only metric that matters. Method 1 requires copying of the string from the literal into the automatic array. Not making copies is usually faster than making copies. This becomes more and more important with increasingly longer strings.
The major difference between method 1 and 2, are that you may modify the local array of method 1 but you may not modify string literals. If you need a modifiable buffer, then method 2 doesn't give you that - regardless of its efficiency.
Additional considerations:
Suppose your system is not a RAM-based PC computer but rather a computer with true non-volatile memory (NVM), such as a microcontroller. The string literal "str" would then in both cases get stored in NVM.
In the array case, the string literal has to be copied down from NVM in run-time, whereas in the pointer case you don't have to make a copy, you can just point straight at the string literal.
This also means that on such systems, assuming 32 bit, the array version will occupy 4 bytes of RAM for the array, while the pointer version will occupy 4 bytes of RAM for the pointer. Both cases will have to occupy 4 bytes of NVM for the string literal.

Where the C++ literal-constant storage in memory?

Where the C++ literal-constant storage in memory? stack or heap?
int *p = &2 is wrong. I want know why? Thanks
-------------------------------------------------
My question is "Where the C++ literal-constant storage in memory", "int *p = &2 is wrong",not my question.
The details depend on the machine, but assuming a commonest sort of machine and operating system... every executable file contains several "segments" - CODE, BSS, DATA and some others.
CODE holds all the executable opcodes. Actually, it's often named TEXT because somehow that made sense to people way back decades ago. Normally it's read-only.
BSS is uninitialized data - it actually doesn't need to exist in the executable file, but is allocated by the operating system's loader when the program is starting to run.
DATA holds the literal constants - the int8, int16, int32 etc along with floats, string literals, and whatever weird things the compiler and linker care to produce. This is what you're asking about. However, it holds only constants defined for use as variables, as in
const long x = 2;
but unlikely to hold literal constants used in your source code but not tightly associated with a variable. Just a lone '2' is dealt with directly by the compiler. For example in C,
print("%d", 2);
would cause the compiler to build a subroutine call to print(), writing opcodes to push a pointer to the string literal "%d" and the value 2, both as 64-bit integers on a 64-bit machine (you're not one of those laggards still using 32-bit hardware, are you? :) followed by the opcode to jump to a subroutine at (identifier for 'print' subroutine).
The "%d" literal goes into DATA. The 2 doesn't; it's built into the opcode that stuffs integers onto the stack. That might actually be a "load register RAX immediate" followed by the value 2, followed by a "push register RAX", or maybe a single opcode can do the job. So in the final executable file, the 2 will be found in the CODE (aka TEXT) segment.
It typically isn't possible to make a pointer to that value, or to any opcode. It just doesn't make sense in terms of what high level languages like C do (and C is "high level" when you're talking about opcodes and segments.) "&2" can only be an error.
Now, it's not entirely impossible to have a pointer to opcodes. Whenever you define a function in C, or an object method, constructor or destructor in C++, the name of the function can be thought of as a pointer to the first opcode of the machine code compiled from that function. For example, print() without the parentheses is a pointer to a function. Maybe if your example code were in a function and you guess the right offset, pointer arithmetic could be used to point to that "immediate" value 2 nestled among the opcodes, but this is not going to be easy for any contemporary CPU, and certainly isn't for beginners.
Let me quote relevant clauses of C++03 Standard.
5.3.1/2
The result of the unary & operator is a pointer to its operand. The
operand shall be an lvalue.
An integer literal is an rvalue (however, I haven't found a direct quote in C++03 Standard, but C++11 mentiones that as a side note in 3.10/1).
Therefore, it's not possible to take an address of an integer literal.
What about the exact place where 2 is stored, it depends on usage. It might be a part of an machine instruction, or it might be optimized away, e.g. j=i*2 might become j=i+i. You should not rely upon it.
You have two questions:
Where are literal constants stored? With the exception of string
literals (which are actual objects), pretty much wherever the
implementation wants. It will usually depend on what you're doing with
them, but on a lot of architectures, integral constants (and often some
special floating point constants, like 0.0) will end up as part of a
machine instruction. When this isn't possible, they'll usually be
placed in the same logical segment as the code.
As to why taking the address of an rvalue is illegal, the main reason is
because the standard says so. Historically, it's forbidden because such
constants often never exist as a separate object in memory, and thus
have no address. Today... one could imagine other solutions: compilers
are smart enough to put them in memory if you took their address, and
not otherwise; and rvalues of class type do have a memory address.
The rules are somewhat arbitrary (and would be, regardless of what they
were)—hopefully, any rules which would allow taking the address of
a literal would make its type int const*, and not int*.

Binary How The Processor Distinguishes Between Two Same Byte Size Variable Types

I'm trying to figure out how it is that two variable types that have the same byte size?
If i have a variable, that is one byte in size.. how is it that the computer is able to tell that it is a character instead of a Boolean type variable? Or even a character or half of a short integer?
The processor doesn't know. The compiler does, and generates the appropriate instructions for the processor to execute to manipulate bytes in memory in the appropriate manner, but to the processor itself a byte of data is a byte of data and it could be anything.
The language gives meaning to these things, but it's an abstraction the processor isn't really aware of.
The computer is not able to do that. The compiler is. You use the char or bool keyword to declare a variable and the compiler produces code that makes the computer treat the memory occupied by that variable in a way that makes sense for that particular type.
A 32-bit integer for example, takes up 4 bytes in memory. To increment it, the CPU has an instruction that says "increment a 32-bit integer at this address". That's what the compiler produces and the CPU blindly executes it. It doesn't care if the address is correct or what binary data is located there.
The size of the instruction for incrementing the variable is another matter. It may very well be another 4 or so bytes, but instructions (code) are stored separately from data. There may be many instructions generated for a program that deal with the same location in memory. It is not possible to formally specify the size of the instructions beforehand because of optimizations that may change the number of instructions used for a given operation. The only way to tell is to compile your program and look at the generated assembly code (the instructions).
Also, take a look at unions in C. They let you use the same memory location for different data types. The compiler lets you do that and produces code for it but you have to know what you're doing.
Because you specify the type. C++ is a strongly typed language. You can't write $x = 10. :)
It knows
char c = 0;
is a char because of... well, the char keyword.
The computer only sees 1 and 0. You are in command of what the variable contains.
you can cast that data also into what ever you want.
char foo = 'a';
if ( (bool)(foo) ) // true
{
int sumA = (byte)(foo) + (byte)(foo);
// sumA == (97 + 97)
}
Also look into data casting to look at the memory location as different data types. This can be as small as a char or entire structs.
In general, it can't. Look at the restrictions of dynamic_cast<>, which tries to do exactly that. dynamic_cast can only work in the special case of objects derived from polymorphic base classes. That's because such objects (and only those) have extra data in them. Chars and ints do not have this information, so you can't use dynamic_cast on them.

Difference between Array initializations

Please see the following statements:
char a[5]="jgkl"; // let's call this Statement A
char *b="jhdfjnfnsfnnkjdf"; // let's call this Statement B , and yes i know this is not an Array
char c[5]={'j','g','k','l','\0'}; // let's call this Statement C
Now, is there any difference between Statements A and C?
I mean both should be on Stack dont they? Only b will be at Static location.
So wouldn't that make "jgkl" exist at the static location for the entire life of the program? Since it is supposed to be read-only/constant?
Please clarify.
No, because the characters "jgkl" from Statement A are used to initialize a, it does not create storage in the executable for a character string (other than the storage you created by declaring a). This declaration creates an array of characters in read-write memory which contain the bytes {'j','g','k','l','\0'}, but the string which was used to initialize it is otherwise not present in the executable result.
In Statement B, the string literal's address is used as an initializer. The variable char *b is a pointer stored in read-write memory. It points to the character string "jhdfjnfnsfnnkjdf". This string is present in your executable image in a segment often called ".sdata", meaning "static data." The string is usually stored in read-only memory, as allowed by the C standard.
That is one key difference between declaring an array of characters and a string constant: Even if you have a pointer to the string constant, you should not modify the contents.
Attempting to modify the string constant is "undefined behavior" according to ANSI C standard section 6.5.7 on initialization.
If a[] is static then so is c[] - the two are equivalent, and neither is a string literal. The two could equally well be declared so that they were on the stack - it depends where and how they are declared, not the syntax used to specify their contents.
the value "jgkl" may never be loaded into working memory. Before main is ever called, a function (often called cinit) is run. One of the things this function does is initialize static and file-scope variables. On the DSP compiler I use, the initial values are stored in a table which is part of the program image. The table's format is unrelated to the format of the variables that are being initialized. The initializer table remains part of the program image and is never copied to RAM. Simply put, there is nowhere in memory that I can reliably access "jgkl".
Small strings like a might not even be stored in that table at all. The optimizer may reduce that to the equivalent (pseudoinstruction) store reg const(152<<24|167<<16|153<<8|154)
I suspect most compilers are similar.
A and C are exactly equivalent. The syntax used in A is an abbreviation for the syntax in C.
Each of the objects named a and c is an array of bytes of length 5, stored at a certain location in memory which is fixed during execution. The program can change the element bytes at any time. It is the compiler's responsibility to decide how to initialize the objects. The compiler might generate something similar to a[0] = 'j'; a[1] = 'g'; ..., or something similar to memcpy(a, static_read_only_initialization_data[1729], 5), or whatever it chooses. The array is on the (conceptual) stack if the declaration occurs in a function, or in global writable memory if the declaration occurs at file scope.
The object named b is a pointer to a byte. Its initial value is a pointer to string literal memory, which is read-only on many implementations that have read-only memory, but not all. The value of b could change (for example to point to a different string, or to NULL). The program is not allowed to change the contents of jhdfjnfnsfnnkjdf", though as usual in C the implementation may not enforce this.
C-Literals always are read-only.
a) allocates 5 bytes memory and store the
CONTENT from literal incl. '\0' in it
b) allocates sizeof(size_t) bytes
memory and store the literal-address in it
c) allocates 5 bytes memory and
store the 5 character-values in it