C++ reference of a string literal - c++

This code outputs a random memory address can anyone explain why and how ?
#include<iostream>
using std::cout;
int main()
{
cout<<&"hello";
return 0;
}
output:
0x560d6984e048
...Program finished with exit code 0
Press ENTER to exit console.

A literal strings in C++ are really arrays of constant characters (including the null-terminator).
By using the pointer-to operator you get a pointer to that array.
It's somewhat equivalent to something like this:
#include <iostream>
char const hello_str[] = "hello";
int main()
{
std::cout << &hello_str;
}

In C++, a string literal has the type const char[N] where N is the length of the string plus one for the null terminator. That array is an lvalue so when you do &"some string", you are getting the address of the array that represents the string literal.
This does not work with other literals like integer literals because they are prvalues, and the address operator (&) requires an lvalue.

Like the other answers have already stated, in C and C++, a string is basically a pointer to an array of characters.
According to cppreference.com:
String literals have static storage duration, and thus exist in memory
for the life of the program.
The C++ standard doesn't describe how exactly executables are supposed to look like, or where and how things get stored in memory. It, e.g., describes storage duration and behaviors etc., but it is a platform independent standard. So this is implementation specific.
So why does this still work and what address do you see? It depends on your compiler and platform, but generally, your compiler will create an executable of your program[^1], e.g., an ELF (Executable and Linkable Format) on Linux, or a PE (portable executable) on Windows.
In an ELF binary, literals are stored in the .rodata section (read-only data) and the machine instructions in the .text section.
You can look at the binary the compiler spits out with certain compiler options, or online on Matt Godbolt's compiler explorer.
Let's look at the example you gave: https://gcc.godbolt.org/z/4s5Y66nY5
We see a label on top, .LC0 where your string is! It is part of the executable file. And this file has to be mapped into memory to be executed.
Line 5 loads the address of that label into a register (%esi) before the call to the stream operation.
So what you see is the location of the string in the read-only section mapped to memory (more precisely, the address in the virtual address space of your process).
This as all rather Linux specific, because that is what I know best, but it is very similar on Windows and Mac.
Here is a nice (student??) paper that goes into more details about how GCC deals with string literals on Unix.
[1]: The compiler could generate other things, of course. An object file. Maybe you want to create a static library, or a dynamic library. But let's keep it simple and talk about executables.

Related

Storage of literal constants in c++

I would like to know where literal constants are actually stored in the memory?
example:
int i = 5;
char* data = char* &("abcdefgh");
the storage sections of i and data depends on where they are declared.
But does the compiler store 5 and "abcdefgh" before actually copying it to the variables?
And here I can get the address of "abcdefgh" where it is stored, but why can't I get the address of 5?
Integer literals like 5 can be part of machine instructions. For example:
LD A, 5
would load the value 5 into processor register A for some imaginary architecture, and as the 5 is actually part of the instruction, it has no address. Few (if any) architectures have the ability to create string literals inline in the machine instructions, so these have to actually be stored elsewhere in memory and accessed via pointers. Exactly where "elsewhere" is is not specified by the C++ Standard.
On the language level, string literals and numeric literals are different beasts.
The C and C++ standard essentially specify that string literals are treated "as if" you defined a constant array of characters with the appropriate size and content, and then you used its name in place of the literal. IOW, when you write
const char *foo = "hello";
it's as if you wrote
// in global scope
const char hello_literal[6] = {'h', 'e', 'l', 'l', 'o', '\0'};
...
const char *foo = hello_literal;
(there are some backwards-compatibility exceptions that allow you to even write char *foo = "hello";, without the const, but that's deprecated and it's undefined behavior anyway to try to write through such a pointer)
So, given this equivalence it's normal that you can have the address of the string literal. Integral literals, OTOH, are rvalues, for which the standard specifies that you cannot take any address - you can roughly think of them as values that the standard expect not to have a backing memory location in the conventional sense.
Now, this distinction actually descends from the fact that on the machine level they are usually implemented differently.
A string literal generally is stored as data somewhere in memory, typically in a read-only data section that gets mapped in memory straight from the executable. When the compiler needs its address it's easy to oblige, since it is data stuff that is already in memory, and thus it does have an address.
Instead, when you do something like
int a = 5;
the 5 does not really have a separate memory location like the "hello world" array above, but it's usually embedded into the machine code as an immediate value.
It's quite complicated to have a pointer to it, since it would be a pointer pointing halfway into an instruction, and in general pointing to data in a different format than what be expected for a regular int variable to which you can point - think x86 where for small numbers you use more compact encodings, or PowerPC/ARM and other RISC architectures where some values are built from an immediate manipulated by the implicit barrel shifter and you cannot even have immediates for some values - you have to compose them out of several instructions, or Harvard architectures where data and code live in different address spaces.
For this reason, you cannot take the address of numeric literals (as well as of numeric expressions evaluation results and much other temporary stuff); if you want to have the address of a number you have to first assign it to a variable (which can provide an in-memory storage), and then ask for its address.
Although the C and C++ standards don't dictate where the literals are stored, common practice stores them in one of two places: in the code (see #NeilButterworth answer), or in a "constants" segment.
Common executable files have a code section and a data section. The data segment may be split up into read-only, uninitialized read/write and initialized read-write. Often, the literals are placed into the read-only section of the executable.
Some tools may also place the literals into a separate data file. This data file may be used to program the data into read-only memory devices (ROM, PROM, Flash, etc.).
In summary, the placement of literals is implementation dependent. The C and C++ standards state that writing to the location of literals is undefined behavior. Preferred practice with character literals is to declare the variable as const so compiler can generate warnings or errors when a write to a literal occurs.

Integer pointer to char array

I am working on an application (C++ language on visual studio) where all the strings are referred with integer pointer.
For example, the class I am using has this integer pointer to the data and a variable for size.
{
..
..
unsigned short int *pData;
int iLen
}
I would like to know
Are there any advantages of using int pointer instead of char pointer?
After thinking a lot, I suspect that the reason may be to avoid the application crash which may happen if the char pointer is used without a null termination. But i am not 100% sure.
During debugging, how can we check the content of the pointer, where the content is a char array or string (on Visual studio).
I can only see the address when I check the content during debugging. because of this i am facing difficulty in debugging.
Using printf would work to display the content but I can't do it in all places.
i am suspecting that the reason for using integer pointer may be to avoid the application crash which may happen if the char pointer is used without a null termination. But i am not 100% sure.
may be to avoid such programmatic errors it is taken as integer pointer.
the class i am using has this integer pointer to the data and a variable for size.
{
..
..
unsigned int *pData;
int iLen
}
Please correct me if you think this can not be the reason.
Please let us know which language you're using so that we can help you a bit better. As for your questions:
There is no benefit of using an int array vs. char array. In fact having an integer array takes up more space (if each int represents its own character). This is because a char takes up one byte where an integer takes up four.
As for printing things when debugging I'm not a master of visual studio since I haven't used it in some time but most modern IDEs allow you to cast things before you print them. For example in lldb you can do po (char)myIntArray[0] (po stands for print). In visual studio writing a custom visualizer should do the trick.
I am not sure why you would want to do this, but if you wanted to store an EOF character in your string for some reason, you would need a pointer to int.
MS Visual Studio often uses UTF-16 to store strings. UTF-16 requires a 16-bit data type, your application uses unsigned short for that, while a more correct name would be uint16_t (or maybe wchar_t, but I am not sure about that).
Another way to store strings uses UTF-8; if you use this way, use a pointer to char (not convenient) or std::string (more convenient). However, you don't have any choice here; you are not going to change how your application stores strings (it's probably too tedious).
To view your UTF-16-encoded string, use a dedicated format specifier. For example, in a quick-watch window, enter:
object->pData,su
instead of just
object->pData

Where the C++ literal-constant storage in memory?

Where the C++ literal-constant storage in memory? stack or heap?
int *p = &2 is wrong. I want know why? Thanks
-------------------------------------------------
My question is "Where the C++ literal-constant storage in memory", "int *p = &2 is wrong",not my question.
The details depend on the machine, but assuming a commonest sort of machine and operating system... every executable file contains several "segments" - CODE, BSS, DATA and some others.
CODE holds all the executable opcodes. Actually, it's often named TEXT because somehow that made sense to people way back decades ago. Normally it's read-only.
BSS is uninitialized data - it actually doesn't need to exist in the executable file, but is allocated by the operating system's loader when the program is starting to run.
DATA holds the literal constants - the int8, int16, int32 etc along with floats, string literals, and whatever weird things the compiler and linker care to produce. This is what you're asking about. However, it holds only constants defined for use as variables, as in
const long x = 2;
but unlikely to hold literal constants used in your source code but not tightly associated with a variable. Just a lone '2' is dealt with directly by the compiler. For example in C,
print("%d", 2);
would cause the compiler to build a subroutine call to print(), writing opcodes to push a pointer to the string literal "%d" and the value 2, both as 64-bit integers on a 64-bit machine (you're not one of those laggards still using 32-bit hardware, are you? :) followed by the opcode to jump to a subroutine at (identifier for 'print' subroutine).
The "%d" literal goes into DATA. The 2 doesn't; it's built into the opcode that stuffs integers onto the stack. That might actually be a "load register RAX immediate" followed by the value 2, followed by a "push register RAX", or maybe a single opcode can do the job. So in the final executable file, the 2 will be found in the CODE (aka TEXT) segment.
It typically isn't possible to make a pointer to that value, or to any opcode. It just doesn't make sense in terms of what high level languages like C do (and C is "high level" when you're talking about opcodes and segments.) "&2" can only be an error.
Now, it's not entirely impossible to have a pointer to opcodes. Whenever you define a function in C, or an object method, constructor or destructor in C++, the name of the function can be thought of as a pointer to the first opcode of the machine code compiled from that function. For example, print() without the parentheses is a pointer to a function. Maybe if your example code were in a function and you guess the right offset, pointer arithmetic could be used to point to that "immediate" value 2 nestled among the opcodes, but this is not going to be easy for any contemporary CPU, and certainly isn't for beginners.
Let me quote relevant clauses of C++03 Standard.
5.3.1/2
The result of the unary & operator is a pointer to its operand. The
operand shall be an lvalue.
An integer literal is an rvalue (however, I haven't found a direct quote in C++03 Standard, but C++11 mentiones that as a side note in 3.10/1).
Therefore, it's not possible to take an address of an integer literal.
What about the exact place where 2 is stored, it depends on usage. It might be a part of an machine instruction, or it might be optimized away, e.g. j=i*2 might become j=i+i. You should not rely upon it.
You have two questions:
Where are literal constants stored? With the exception of string
literals (which are actual objects), pretty much wherever the
implementation wants. It will usually depend on what you're doing with
them, but on a lot of architectures, integral constants (and often some
special floating point constants, like 0.0) will end up as part of a
machine instruction. When this isn't possible, they'll usually be
placed in the same logical segment as the code.
As to why taking the address of an rvalue is illegal, the main reason is
because the standard says so. Historically, it's forbidden because such
constants often never exist as a separate object in memory, and thus
have no address. Today... one could imagine other solutions: compilers
are smart enough to put them in memory if you took their address, and
not otherwise; and rvalues of class type do have a memory address.
The rules are somewhat arbitrary (and would be, regardless of what they
were)—hopefully, any rules which would allow taking the address of
a literal would make its type int const*, and not int*.

Binary How The Processor Distinguishes Between Two Same Byte Size Variable Types

I'm trying to figure out how it is that two variable types that have the same byte size?
If i have a variable, that is one byte in size.. how is it that the computer is able to tell that it is a character instead of a Boolean type variable? Or even a character or half of a short integer?
The processor doesn't know. The compiler does, and generates the appropriate instructions for the processor to execute to manipulate bytes in memory in the appropriate manner, but to the processor itself a byte of data is a byte of data and it could be anything.
The language gives meaning to these things, but it's an abstraction the processor isn't really aware of.
The computer is not able to do that. The compiler is. You use the char or bool keyword to declare a variable and the compiler produces code that makes the computer treat the memory occupied by that variable in a way that makes sense for that particular type.
A 32-bit integer for example, takes up 4 bytes in memory. To increment it, the CPU has an instruction that says "increment a 32-bit integer at this address". That's what the compiler produces and the CPU blindly executes it. It doesn't care if the address is correct or what binary data is located there.
The size of the instruction for incrementing the variable is another matter. It may very well be another 4 or so bytes, but instructions (code) are stored separately from data. There may be many instructions generated for a program that deal with the same location in memory. It is not possible to formally specify the size of the instructions beforehand because of optimizations that may change the number of instructions used for a given operation. The only way to tell is to compile your program and look at the generated assembly code (the instructions).
Also, take a look at unions in C. They let you use the same memory location for different data types. The compiler lets you do that and produces code for it but you have to know what you're doing.
Because you specify the type. C++ is a strongly typed language. You can't write $x = 10. :)
It knows
char c = 0;
is a char because of... well, the char keyword.
The computer only sees 1 and 0. You are in command of what the variable contains.
you can cast that data also into what ever you want.
char foo = 'a';
if ( (bool)(foo) ) // true
{
int sumA = (byte)(foo) + (byte)(foo);
// sumA == (97 + 97)
}
Also look into data casting to look at the memory location as different data types. This can be as small as a char or entire structs.
In general, it can't. Look at the restrictions of dynamic_cast<>, which tries to do exactly that. dynamic_cast can only work in the special case of objects derived from polymorphic base classes. That's because such objects (and only those) have extra data in them. Chars and ints do not have this information, so you can't use dynamic_cast on them.

64bit architecture - character pointer truncated while returning from function

Environment:
Windows x64 bit with 5GB RAM. My binary is a 64bit one built with compiler of version - "Microsoft (R) C/C++ Optimizing Compiler Version 14.00.50727.762 for x64"
Environment setting:
Microsoft suggests to set the below registry key to test 64 bit applications and I have set the same in my box. The problem doesn't occur if i don't set the below registry because the program is placed at a low address. The same registry key is mentioned in the discussion - As a programmer, what do I need to worry about when moving to 64-bit windows?
To force allocations to allocate from higher addresses before lower addresses for testing purposes, specify MEM_TOP_DOWN when calling VirtualAlloc or set the following registry value to 0x100000:
HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\Session Manager\Memory Management\AllocationPreference
Sample code:
char *alloc_str()
{
char *temp;
temp = (char *) malloc(60);
/* copy some data to temp */
return temp;
}
main()
{
char *str;
str = (char *)alloc_str();
}
Analysis:
malloc returns address 0x000007fffe999b40 which is stored in temp but when the pointer is returned to main(), str gets only the second half - 0xfffffffffe999b40 and I am not able to access the data at that location.
Two points about style, which could help to diagnose this kind of problems. As you use malloc I suppose that it is compiled as a C program. In that case, do not typecast when it is not necessary. Excessive typecasts may suppress the warnings telling you that you have not correctly declared your functions. If there is no prototype and the function definition is in another module then your call str = (char *)alloc_str(); will suppress the warning that the function is used without declaration and the default declaration of C will be used, i.e. all parameter and return value will be considered as int .
Same thing with the malloc, if you forgot the right include, your typecast will suppress the warning and the compiler will assume the function returns an int. This might be already the cause of the truncation.
Another point, in C, an empty parameter list is declared with (void) not () which has another meaning (unspecified parameters).
These are 2 points that differ between C and C++.
The registry setting you refer to selects top-down memory allocation. You state that this setting will "place program in high address space". It won't do that. What this does is force the Windows memory manager to allocate memory from the top of the address space.
What's more since you are running under 64 bit Windows, I fail to see where PAE comes into play. That would be something used on a 32 bit platform.
I would guess that you are compiling a 32 bit app and so inevitably your pointers are 32 bits wide. But that is only a guess due to the lack of information.
In short your question is impossible to answer because you haven't told us what you are doing.
David, thanks for your time and suggestion. The sample program was not working because I had not included stdlib.h in my C file and so the pointer returned by malloc was 32 bit. The problem I faced was with a production code which was bit different from the sample code.
tristopia, Your explanation is absolutely correct. I was facing the same problem.
In my production C files, I was facing the problem as below
a.c
call_test1()
{
char* temp;
temp = (char *)test1();
}
b.c
include stdlib.h
char* test1()
{
char *str;
test = (char *)malloc(60);
/* copy some data to test*/
return str;
}
When str was returned by test1(), the pointer contained 64 bit address (I analyzed the rx registry (which stores the return value of a function) in windbg using "r rx") but when it was assigned to temp, it got truncated to 32 bit address).
The problem was due to
Not including signature of test1() in the a.c file
Typecasting the return value to char *
Modified source
a.c
char * test1();
call_test1()
{
char* temp;
temp = test1();
}
b.c
include stdlib.h
char* test1()
{
char *str;
test = (char *)malloc(60);
/* copy some data to test*/
return str;
}
I got the solution by trial and error but tristopia explained the reason.
64 bit notes
Use %p instead of %lx to print an address of a pointer in 64 bit environmnent.
Include function signature and try to avoid typecasting pointers
On Windows,
If you recompile your C/C++ code into a 64-bit EXE or DLL, you must test it with the AllocationPreference registry value set.
To force allocations to allocate from higher addresses before lower addresses for testing purposes, specify MEM_TOP_DOWN when calling VirtualAlloc or set the following registry value to 0x100000:
HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\Session Manager\Memory Management\AllocationPreference
I would have to ask if you are using the correct compiler settings and linking against the correct C runtime library.