Where is const data stored? - c++

For example:
In the file demo.c,
#inlcude<stdio.h>
int a = 5;
int main(){
int b=5;
int c=a;
printf("%d", b+c);
return 0;
}
For int a = 5, does the compiler translate this into something like store 0x5 at the virtual memory address, for example, Ox0000000f in the const area so that for int c = a, it is translated to something like movl 0x0000000f %eax?
Then for int b = 5, the number 5 is not put into the const area, but translated directly to a immediate in the assembly instruction like mov $0x5 %ebx.

It depends. Your program has several constants:
int a = 5;
This is a "static" initialization (which occurs when the program text and data is loaded before running). The value is stored in the memory reserved by a which is in a read-write data "program section". If something changes a, the value 5 is lost.
int b=5;
This is a local variable with limited scope (only by main()). The storage could well be a CPU register or a location on the stack. The instructions generated for most architectures will place the value 5 in an instruction as "immediate data", for an x86 example:
mov eax, 5
The ability for instructions to hold arbitrary constants is limited. Small constants are supported by most CPU instructions. "Large" constants are not usually directly supported. In that case the compiler would store the constant in memory and load it instead. For example,
.psect rodata
k1 dd 3141592653
.psect code
mov eax k1
The ARM family has a powerful design for loading most constants directly: any 8-bit constant value can be rotated any even number of times. See this page 2-25.
One not-as-obvious but totally different item is in the statement:
printf("%d", b+c);
The string %d is, by modern C semantics, a constant array of three char. Most modern implementations will store it in read-only memory so that attempts to change it will cause a SEGFAULT, which is a low level CPU error which usually causes the program to instantly abort.
.psect rodata
s1 db '%', 'd', 0
.psect code
mov eax s1
push eax

In OP's program, a is an "initialized" "global". I expect that it is placed in the initialized part of the data segment. See https://en.wikipedia.org/wiki/File:Program_memory_layout.pdf, http://www.cs.uleth.ca/~holzmann/C/system/memorylayout.gif (from more info on Memory layout of an executable program (process)). The location of a is decided by the compiler- linker duo.
On the other hand, being automatic (stack) variables, b and c are expected in the stack segment.
Being said that, the compiler/linker has the liberty to perform any optimization as long as the observed behavior is not violated (What exactly is the "as-if" rule?). For example, if a is never referenced, then it may be optimized out completely.

Related

How can I make single object larger than 2GB using new operator?

I'm trying to make a single object larger than 2GB using new operator.
But if the size of the object is larger than 0x7fffffff, The size of memory to be allocated become strange.
I think it is done by compiler because the assembly code itself use strange size of memory allocation.
I'm using Visual Stuio 2015 and configuration is Release, x64.
Is it bug of VS2015? otherwise, I want to know why the limitation exists.
The example code is as below with assembly code.
struct chunk1MB
{
char data[1024 * 1024];
};
class chunk1
{
chunk1MB data1[1024];
chunk1MB data2[1023];
char data[1024 * 1024 - 1];
};
class chunk2
{
chunk1MB data1[1024];
chunk1MB data2[1024];
};
auto* ptr1 = new chunk1;
00007FF668AF1044 mov ecx,7FFFFFFFh
00007FF668AF1049 call operator new (07FF668AF13E4h)
auto* ptr2 = new chunk2;
00007FF668AF104E mov rcx,0FFFFFFFF80000000h // must be 080000000h
00007FF668AF1055 mov rsi,rax
00007FF668AF1058 call operator new (07FF668AF13E4h)
Use a compiler like clang-cl that isn't broken, or that doesn't have intentional signed-32-bit implementation limits on max object size, whichever it is for MSVC. (Could this be affected by a largeaddressaware option?)
Current MSVC (19.33 on Godbolt) has the same bug, although it does seem to handle 2GiB static objects. But not 3GiB static objects; adding another 1GiB member leads to wrong code when accessing a byte more than 2GiB from the object's start (Godbolt -
mov BYTE PTR chunk2 static_chunk2-1073741825, 2 - note the negative offset.)
GCC targeting Linux makes correct code for the case of a 3GiB object, using mov r64, imm64 to get the absolute address into a register, since a RIP-relative addressing mode isn't usable. (In general you'd need gcc -mcmodel=medium to work correctly when some .data / .bss addresses are linked outside the low 2GiB and/or more than 2GiB away from code.)
MSVC seems to have internally truncated the size to signed 32-bit, and then sign-extended. Note the arg it passes to new: mov rcx, 0FFFFFFFF80000000h instead of mov ecx, 80000000h (which would set RCX = 0000000080000000h by implicit zero-extension when writing a 32-bit register.)
In a function that returns sizeof(chunk2); as a size_t, it works correctly, but interestingly prints the size as negative in the source. That might be innocent, e.g. after realizing that the value fits in a 32-bit zero-extended value, MSVC's asm printing code might just always print 32-bit integers as signed decimal, with unsigned hex in a comment.
It's clearly different from how it passes the arg to new; in that case it used 64-bit operand-size in the machine code, so the same 32-bit immediate gets sign-extended to 64-bit, to a huge value near SIZE_MAX, which is of course vastly larger than any possible max object size for x86-64. (The 48-bit virtual address spaces is 1/65536th of the 64-bit value-range of size_t).
unsigned __int64 sizeof_chunk2(void) PROC ; sizeof_chunk2, COMDAT
mov eax, -2147483648 ; 80000000H
ret 0
unsigned __int64 sizeof_chunk2(void) ENDP ; sizeof_chunk2
This looks like a compiler bug or intentional implementation limit; report it to Microsoft if it's not already known.
I'm not sure how to completely solve your issue as it's not answered anywhere I've seen properly.
Memory models are tricky and up until x64 2GB were pretty much the limit.
No basic memory model in Windows support large allocations as far as I know.
Huge pages support 1GB of memory.
However I want to point to different directions:
3 Ways I found to achieve something similar:
The obvious answer - split your allocations to smaller chunks - it's more memory efficient.
Use a different kind of swap, you can write memory to files.
Use virtual memory, not sure if it's helpful to you, using the windows api VirtualAlloc
const static SIZE_T giga = 1024 * 1024 * 1024;
const static SIZE_T size = 4 * giga;
BYTE* ptr = static_cast<BYTE*>(VirtualAlloc(nullptr, (SIZE_T)size, MEM_COMMIT, PAGE_READWRITE));
VirtualFree(ptr, 0, MEM_RELEASE);
Best of luck.

How to write Linux C++ debug information when performance is critical?

I'm trying to debug a rather large program with many variables. The code is setup in this way:
while (condition1) {
//non timing sensitive code
while (condition2) {
//timing sensitive code
//many variables that change each iteration
}
}
I have many variables on the inner loop that I want to save for viewing. I want to write them to a text file each outer loop iteration. The inner loop executes a different number of times each iteration. It can be just 2 or 3, or it can be several thousands.
I need to see all the variables values from each inner iteration, but I need to keep the inner loop as fast as possible.
Originally, I tried just storing each data variable in its own vector where I just appended a value at each inner loop iteration. Then, when the outer loop iteration came, I would read from the vectors and write the data to a debug file. This quickly got out of hand as variables were added.
I thought about using a string buffer to store the information, but I'm not sure if this is the fastest way given strings would need to be created multiple times within the loop. Also, since I don't know the number of iterations, I'm not sure how large the buffer would grow.
With the information stored being in formats such as:
"Var x: 10\n
Var y: 20\n
.
.
.
Other Text: Stuff\n"
So, is there a cleaner option for writing large amounts of debug data quickly?
If it's really time-sensitive, then don't format strings inside the critical loop.
I'd go for appending records to a log buffer of binary records inside the critical loop. The outer loop can either write that directly to a binary file (which can be processed later), or format text based on the records.
This has the advantage that the loop only needs to track a couple extra variables (pointers to the end of used and allocated space of one std::vector), rather than two pointers for a std::vector for every variable being logged. This will have much lower impact on register allocation in the critical loop.
In my testing, it looks like you just get a bit of extra loop overhead to track the vector, and a store instruction for every variable you want to log. I didn't write a big enough test loop to expose any potential problems from keeping all the variables "alive" until the emplace_back(). If the compiler does a bad job with bigger loops where it needs to spill registers, see the section below about using a simple array without any size checking. That should remove any constraint on the compiler that makes it try to do all the stores into the log buffer at the same time.
Here's an example of what I'm suggesting. It compiles and runs, writing a binary log file which you can hexdump.
See the source and asm output with nice formatting on the Godbolt compiler explorer. It can even colourize source and asm lines so you can more easily see which asm comes from which source line.
#include <vector>
#include <cstdint>
#include <cstddef>
#include <iostream>
struct loop_log {
// Generally sort in order of size for better packing.
// Use as narrow types as possible to reduce memory bandwidth.
// e.g. logging an int loop counter into a short log record is fine if you're sure it always in-practice fits in a short, and has zero performance downside
int64_t x, y, z;
uint64_t ux, uy, uz;
int32_t a, b, c;
uint16_t t, i, j;
uint8_t c1, c2, c3;
// isn't there a less-repetitive way to write this?
loop_log(int64_t x, int32_t a, int outer_counter, char c1)
: x(x), a(a), i(outer_counter), c1(c1)
// leaves other members *uninitialized*, not zeroed.
// note lack of gcc warning for initializing uint16_t i from an int
// and for not mentioning every member
{}
};
static constexpr size_t initial_reserve = 10000;
// take some args so gcc can't count the iterations at compile time
void foo(std::ostream &logfile, int outer_iterations, int inner_param) {
std::vector<struct loop_log> log;
log.reserve(initial_reserve);
int outer_counter = outer_iterations;
while (--outer_counter) {
//non timing sensitive code
int32_t a = inner_param - outer_counter;
while (a != 0) {
//timing sensitive code
a <<= 1;
int64_t x = outer_counter * (100LL + a);
char c1 = x;
// much more efficient code with gcc 5.3 -O3 than push_back( a struct literal );
log.emplace_back(x, a, outer_counter, c1);
}
const auto logdata = log.data();
const size_t bytes = log.size() * sizeof(*logdata);
// write group size, then a group of records
logfile.write( reinterpret_cast<const char *>(&bytes), sizeof(bytes) );
logfile.write( reinterpret_cast<const char *>(logdata), bytes );
// you could format the records into strings at this point if you want
log.clear();
}
}
#include <fstream>
int main() {
std::ofstream logfile("dbg.log");
foo(logfile, 100, 10);
}
gcc's output for foo() pretty much optimizes away all the vector overhead. As long as the initial reserve() is big enough, the inner loop is just:
## gcc 5.3 -masm=intel -O3 -march=haswell -std=gnu++11 -fverbose-asm
## The inner loop from the above C++:
.L59:
test rbx, rbx # log // IDK why gcc wants to check for a NULL pointer inside the hot loop, instead of doing it once after reserve() calls new()
je .L6 #,
mov QWORD PTR [rbx], rbp # log_53->x, x // emplace_back the 4 elements
mov DWORD PTR [rbx+48], r12d # log_53->a, a
mov WORD PTR [rbx+62], r15w # log_53->i, outer_counter
mov BYTE PTR [rbx+66], bpl # log_53->c1, x
.L6:
add rbx, 72 # log, // struct size is 72B
mov r8, r13 # D.44727, log
test r12d, r12d # a
je .L58 #, // a != 0
.L4:
add r12d, r12d # a // a <<= 1
movsx rbp, r12d # D.44726, a // x = ...
add rbp, 100 # D.44726, // x = ...
imul rbp, QWORD PTR [rsp+8] # x, %sfp // x = ...
cmp r14, rbx # log$D40277$_M_impl$_M_end_of_storage, log
jne .L59 #, // stay in this tight loop as long as we don't run out of reserved space in the vector
// fall through into code that allocates more space and copies.
// gcc generates pretty lame copy code, using 8B integer loads/stores, not rep movsq. Clang uses AVX to copy 32B at a time
// anyway, that code never runs as long as the reserve is big enough
// I guess std::vector doesn't try to realloc() to avoid the copy if possible (e.g. if the following virtual address region is unused) :/
An attempt to avoid repetitive constructor code:
I tried a version that uses a braced initializer list to avoid having to write a really repetitive constructor, but got much worse code from gcc:
#ifdef USE_CONSTRUCTOR
// much more efficient code with gcc 5.3 -O3.
log.emplace_back(x, a, outer_counter, c1);
#else
// Put the mapping from local var names to struct member names right here in with the loop
log.push_back( (struct loop_log) {
.x = x, .y =0, .z=0, // C99 designated-initializers are a GNU extension to C++,
.ux=0, .uy=0, .uz=0, // but gcc doesn't support leaving having uninitialized elements before the last initialized one:
.a = a, .b=0, .c=0, // without all the ...=0, you get "sorry, unimplemented: non-trivial designated initializers not supported"
.t=0, .i = outer_counter, .j=0,
.c1 = (uint8_t)c1
} );
#endif
This unfortunately stores a struct onto the stack and then copies it 8B at a time with code like:
mov rax, QWORD PTR [rsp+72]
mov QWORD PTR [rdx+8], rax // rdx points into the vector's buffer
mov rax, QWORD PTR [rsp+80]
mov QWORD PTR [rdx+16], rax
... // total of 9 loads/stores for a 72B struct
So it will have more impact on the inner loop.
There are a few ways to push_back() a struct into a vector, but using a braced-initializer-list unfortunately seems to always result in a copy that doesn't get optimized away by gcc 5.3. It would nice to avoid writing a lot of repetitive code for a constructor. And with designated initializer lists ({.x = val}), the code inside the loop wouldn't have to care much about what order the struct actually stores things. You could just write them in easy-to-read order.
BTW, .x= val C99 designated-initializer syntax is a GNU extension to C++. Also, you can get warnings for forgetting to initialize a member in a braced-list with gcc's -Wextra (which enables -Wmissing-field-initializers).
For more on syntax for initializers, have a look at Brace-enclosed initializer list constructor and the docs for member initialization.
This was a fun but terrible idea:
// Doesn't compiler. Worse: hard to read, probably easy to screw up
while (outerloop) {
int64_t x=0, y=1;
struct loop_log {int64_t logx=x, logy=y;}; // loop vars as default initializers
// error: default initializers can't be local vars with automatic storage.
while (innerloop) { x+=y; y+=x; log.emplace_back(loop_log()); }
}
Lower overhead from using a flat array instead of a std::vector
Perhaps trying to get the compiler to optimize away any kind of std::vector operation is less good than just making a big array of structs (static, local, or dynamic) and keeping a count yourself of how many records are valid. std::vector checks to see if you've used up the reserved space on every iteration, but you don't need anything like that if there is a fixed upper-bound you can use to allocate enough space to never overflow. (Depending on the platform and how you allocate the space, a big chunk of memory that's allocated but never written isn't really a problem. e.g. on Linux, malloc uses mmap(MAP_ANONYMOUS) for big allocations, and that gives you pages that are all copy-on-write mapped to a zeroed physical page. The OS doesn't need to allocate physical pages until you write, them. The same should apply to a large static array.)
So in your loop, you could just have code like
loop_log *current_record = logbuf;
while(inner_loop) {
int64_t x = ...;
current_record->x = x;
...
current_record->i = (short)outer_counter;
...
// or maybe
// *current_record = { .x = x, .i = (short)outer_counter };
// compilers will probably have an easier time avoiding any copying with a braced initializer list in this case than with vector.push_back
current_record++;
}
size_t record_bytes = (current_record - log) * sizeof(log[0]);
// or size_t record_bytes = static_cast<char*>(current_record) - static_cast<char*>(log);
logfile.write((const char*)logbuf, record_bytes);
Scattering the stores throughout the inner loop will require the array pointer to be live all the time, but OTOH doesn't require all the loop variables to be live at the same time. IDK if gcc would optimize an emplace_back to store each variable into the vector once the variable was no longer needed, or if it might spill variables to the stack and then copy them all into the vector in one group of instructions.
Using log[records++].x = ... might lead to the compiler keeping the array and counter tying up two registers, since we'd use the record count in the outer loop. We want the inner loop to be fast, and can take the time to do the subtraction in the outer loop, so I wrote it with pointer increments to encourage the compiler to only use one register for that piece of state. Besides register pressure, base+index store instructions are less efficient on Intel SnB-family hardware than single-register addressing modes.
You could still use a std::vector for this, but it's hard to get std::vector not to write zeroes into memory it allocates. reserve() just allocates without zeroing, but you calling .data() and using the reserved space without telling vector about it with .resize() kind of defeats the purpose. And of course .resize() will initialize all the new elements. So you std::vector is a bad choice for getting your hands on a large allocation without dirtying it.
It sounds like what you really want is to look at your program from within a debugger. You haven't specified a platform, but if you build with debug information (-g using gcc or clang) you should be able to step through the loop when starting the program from within the debugger (gdb on linux.) Assuming you are on linux, tell it to break at the beginning of the function (break ) and then run. If you tell the debugger to display all the variables you want to see after each step or breakpoint hit, you'll get to the bottom of your problem in no time.
Regarding performance: unless you do something fancy like set conditional breakpoints or watch memory, running the program through the debugger will not dramatically affect perf as long as the program is not stopped. You may need to turn down the optimization level to get meaningful information though.

Does it take more time to access a value in a pointed struct than a local value?

I have a struct which holds values that are used as the arguments of a for loop:
struct ARGS {
int endValue;
int step;
int initValue;
}
ARGS * arg = ...; //get a pointer to an initialized struct
for (int i = arg->initValue; i < arg->endValue; i+=arg->step) {
//...
}
Since the values of initValue and step are checked each iteration, would it be faster if I move them to local values before using in the for loop?
initValue = arg->initValue;
endValue = arg->endValue;
step = arg->step;
for (int i = initValue; i < endValue; i+=step) {
//...
}
The clear cut answer is that in 99.9% of the cases it does not matter, and you should not be concerned with it. Now, there might be different micro differences that won't matter to mostly anyone. The gory details depend on the architecture and optimizer. But bear with me, understand not no mean very very high probability that there is no difference.
// case 1
ARGS * arg = ...; //get a pointer to an initialized struct
for (int i = arg->initValue; i < endValue; i+=arg->step) {
//...
}
// case 2
initValue = arg->initValue;
step = arg->step;
for (int i = initValue; i < endValue; i+=step) {
//...
}
In the case of initValue, there will not be a difference. The value will be loaded through the pointer and stored into the initValue variable, just to store it in i. Chances are that the optimizer will skip initValue and write directly to i.
The case of step is a bit more interesting, in that the compiler can prove that the local variable step is not shared by any other thread and can only change locally. If the pressure on registers is small, it can keep step in a register and never have to access the real variable. On the other hand, it cannot assume that arg->step is not changing by external means, and is required to go to memory to read the value. Understand that memory here means L1 cache most probably. A L1 cache hit on a Core i7 takes approximately 4 cpu cycles, which roughly means 0.5 * 10-9 seconds (on a 2Ghz processor). And that is under the worst case assumption that the compiler can maintain step in a register, which may not be the case. If step cannot be held on a register, you will pay for the access to memory (cache) in both cases.
Write code that is easy to understand, then measure. If it is slow, profile and figure out where the time is really spent. Chances are that this is not the place where you are wasting cpu cycles.
This depends on your architecture. If it is a RISC or CISC processor, then that will affect how memory is accessed, and on top of that the addressing modes will affect it as well.
On the ARM code I work with, typically the base address of a structure will be moved into a register, and then it will execute a load from that address plus an offset. To access a variable, it will move the address of the variable into the register, then execute the load without an offset. In this case it takes the same amount of time.
Here's what the example assembly code might look like on ARM for accessing the second int member of a strcture compared to directly accessing a variable.
ldr r0, =MyStruct ; struct {int x, int y} MyStruct
ldr r0, [r0, #4] ; load MyStruct.y into r0
ldr r1, =MyIntY ; int MyIntX, MyIntY
ldr r1, [r1] ; directly load MyIntY into r0.
If your architecture does not allow addressing with offsets, then it would need to move the address into a register and then perform the addition of the offset.
Additionally, since you've tagged this as C++ as well, if you overload the -> operator for the type, then this will invoke your own code which could take longer.
The problem is that the two version are not identical. If the code in the ... part modifies the values in arg then the two options will behave differently (the "optimized" one will use the step and end value using the original values, not the updated ones).
If the optimizer can prove by looking at the code that this is not going to happen then the performance will be the same because moving things out of loops is a common optimization performed today. However it's quite possible that something in ... will POTENTIALLY change the content of the structure and in this case the optimizer must be paranoid and the generated code will reload the values from the structure at each iteration. How costly it will be depends on the processor.
For example if the arg pointer is received as a parameter and the code in ... calls any external function for which the code is unknown to the compiler (including things like malloc) then the compiler must assume that MAY BE the external code knows the address of the structure and MAY BE it will change the end or step values and thus the optimizer is forbidden to move those computations out of the loop because doing so would change the behavior of the code.
Even if it's obvious for you that malloc is not going to change the contents of your structure this is not obvious at all for the compiler, for which malloc is just an external function that will be linked in at a later step.

Alocating local variables on stack & using pointer arithemtic

I read that in function the local variables are put on stack as they are defined after the parameters has been put there first.
This is mentioned also here
5 .All function arguments are placed on the stack. 6.The instructions
inside of the function begin executing. 7.Local variables are pushed
onto the stack as they are defined.
So I excpect that if the C++ code is like this:
#include "stdafx.h"
#include <iostream>
int main()
{
int a = 555;
int b = 666;
int *p = &a;
std::cout << *(p+1);
return 0;
}
and if integer here has 4 bytes and we call the memory space on stack that contains first 8 bits of int 555 x, then 'moving' another 4 bytes to the top of the stack via *(p+1) we should be looking into memory at address x + 4.
However, the output of this is -858993460 - an that is always like that no matter what value int b has. Evidently its some standard value. Of course I am accessing a memory which I should not as for this is the variable b. It was just an experiment.
How come I neither get the expected value nor an illegal access error?
Where is my assumption wrong?
What could -858993460 represent?
What everyone else has said (i.e. "don't do that") is absolutely true. Don't do that. However, to actually answer your question, p+1 is most likely pointing at either a pointer to the caller's stack frame or the return address itself. The system-maintained stack pointer is decremented when you push something on it. This is implementation dependent, officially speaking, but every stack pointer I've ever seen (this is since the 16-bit era) has been like this. Thus, if as you say, local variables are pushed on the stack as they are initialized, &a should == &b + 1.
Perhaps an illustration is in order. Suppose I compile your code for 32 bit x86 with no optimizations, and the stack pointer esp is 20 (this is unlikely, for the record) before I call your function. This is what memory looks like right before the line where you invoke cout:
4: 12 (value of p)
8: 666 (value of b)
12: 555 (value of a)
16: -858993460 (return address)
p+1, since p is an int*, is 16. The memory at this location isn't read protected because it's needed to return to the calling function.
Note that this answer is academic; it's possible that the compiler's optimizations or differences between processors caused the unexpected result. However, I would not expect p+1 to == &b on any processor architecture with any calling convention I've ever seen because the stack usually grows downward.
Your assumptions are true in theory (From the CS point of view).
In practice there is no guarantee to do pointer arithmetic in that way expecting those results.
For example, your asumption "All function arguments are placed on the stack" is not true: The allocation of function argumments is implementation-defined (Depending on the architecture, it could use registers or the stack), and also the compiler is free to allocate local variables in registers if it feels necesary.
Also the asumption "int size is 4 bytes, so adding 4 to the pointer goes to b" is false. The compiler could have added padding between a and b to ensure memory aligment.
The conclusion here is: Don't use low-level tricks, they are implementation-defined. Even if you have to (Regardless of our advises) do it, you have to know how the compiler works and how it generates the code.

can anyone explain where the constants or const variables stored?

As we all know, C++'s memory model can be divided to five blocks: stack, heap, free blocks, global/static blocks, const blocks. I can understand the first three blocks and I also know variables like static int xx are stored in the 4th blocks, and also the "hello world"-string constant, but what is stored in the 5th blocks-const blocks? and , like int a = 10, where does the "10" stored? Can someone explain this to me?
Thanks a lot.
There is a difference between string literals and primitive constants. String literals are usually stored with the code in a separate area (for historical reasons this block is often called the "text block"). Primitive constants, on the other hand, are somewhat special: they can be stored in the "text" block as well, but their values can also be "baked" into the code itself. For example, when you write
// Global integer constant
const int a = 10;
int add(int b) {
return b + a;
}
the return expression could be translated into a piece of code that does not reference a at all. Instead of producing binary code that looks like this
LOAD R0, <stack>+offset(b)
LOAD R1, <address-of-a>
ADD R0, R1
RET
the compiler may produce something like this:
LOAD R0, <stack>+offset(b)
ADD R0, #10 ; <<== Here #10 means "integer number 10"
RET
Essentially, despite being stored with the rest of the constants, a is cut out of the compiled code.
As far as integer literals constants go, they have no address at all: they are always "baked" into the code: when you reference them, instructions that load explicit values are generated, in the same way as shown above.
and , like int a = 10, where does the "10" stored?
It's an implementation detail. Will likely be a part of generated code and be turned into something like
mov eax, 10
in assembly.
The same will happen to definitions like
const int myConst = 10;
unless you try to get address of myConst like that:
const int *ptr = &myConst;
in which case the compiler will have to put the value of 10 into the dedicated block of memory (presumably 5th in your numeration).