(c/c++) do copies of string literals share memory in TEXT section? - c++

If I call a function like
myObj.setType("fluid");
many times in a program, how many copies of the literal "fluid" are saved in memory? Can the compiler recognize that this literal is already defined and just reference it again?

This has nothing to do with C++(the language). Instead, it is an "optimization" that a compiler can do. So, the answer yes and no, depending on the compiler/platform you are using.
#David This is from the latest draft of the language:
§ 2.14.6 (page 28)
Whether all string literals are
distinct (that is, are stored in
non overlapping objects) is
implementation defined. The effect of
attempting to modify a string literal
is undefined.
The emphasis is mine.
In other words, string literals in C++ are immutable because modifying a string literal is undefined behavior. So, the compiler is free, to eliminate redundant copies.
BTW, I am talking about C++ only ;)

Yes, it can. Of course, it depends on the compiler. For VC++, it's even configurable:
http://msdn.microsoft.com/en-us/library/s0s0asdt(VS.80).aspx

Yes it can, but there's no guarantee that it will. Define a constant if you want to be sure.

This is a compiler implementation issue. Many compilers that I have used have an option to share or merge duplicate string literals. Allowing duplicate string literals speeds up the compilation process but produces larger executables.

I believe that in C/C++ there is no specified handling for that case, but in most cases would use multiple definitions of that string.

2.13.4/2: "whether all string literals are distinct (that is, are stored in nonoverlapping objects) is implementation-defined".
This permits the optimisation you're asking about.
As an aside, there may be a slight ambiguity, at least locally within that section of the standard. The definition of string literal doesn't quite make clear to me whether the following code uses one string literal twice, or two string literals once each:
const char *a = "";
const char *b = "";
But the next paragraph says "In translation phase 6 adjacent narrow string literals are concatenated". Unless it means to say that something can be adjacent to itself, I think the intention is pretty clear that this code uses two string literals, which are concatenated in phase 6. So it's not one string literal twice:
const char *c = "a" "a";
Still, if you did read that "a" and "a" are the same string literal, then the standard requires the optimisation you're talking about. But I don't think they are the same literal, I think they're different literals that happen to consist of the same characters. This is perhaps made clear elsewhere in the standard, for instance in the general information on grammar and parsing.
Whether it's made clear or not, many compiler-writers have interpreted the standard the way I think it is, so I might as well be right ;-)

Related

Confusion between constants and literals?

I am currently reading about constants on the c++ tutorial from TutorialsPoint and, where it says:
Constants refer to fixed values that the program may not alter and they are called literals.
(Source)
I do not really get this. If constants are called literals and literals are data represented directly in the code, how can constants be considered as literals? I mean variables preceded with the const keyword are constants, but they are not literals, so how can you say that constants are literals?
Here:
const int MEANING = 42;
the value MEANING is a constant, 42 is a literal. There is no real relationship between the two terms, as can be seen here:
int n = 42;
where n is not a constant, but 42 is still a literal.
The major difference is that a constant may have an address in memory (if you write some code that needs such an address), whereas a literal never has an address.
I disagree with the claim "...There wasn't a thing called const in C originally so this was fine." const is actually one of the 32 C keywords. Google to see.
With that rested, I think the man missed something at TP. To be fair to them at Tutorials Point, they had an article that explained the difference thus (full quote, verbatim):
https://www.tutorialspoint.com/questions/category/Cplusplus
A literal is a value that is expressed as itself. For example, the number 25 or the string "Hello World" are both literals.
A constant is a data type that substitutes a literal. Constants are used when a specific, unchanging value is used various times during the program. For example, if you have a constant named PI that you'll be using at various places in your program to find the area, circumference, etc of a circle, this is a constant as you'll be reusing its value. But when you'll be declaring it as:
const float PI = 3.141;
The 3.141 is a literal that you're using. It doesn't have any memory address of its own and just sits in the source code.
Pls don't disparage those fellows doing what you call "random tutorials". Kids from poorer homes and less developed world can't afford your " good C++ textbooks " e.g. Scott Myers Effective C++ It is these online free tutorials they can have, and most of these tutorials do better explaining than the "good books".
By any means read them guys. Get confused some then come over here to StackOveflow or Quora to have your confusion cleared. Happy coding guys.
The author of the article is confused, and spreading that confusion to others (including you).
In C, literals are "constants". There wasn't a thing called const in C originally so this was fine.
C++ is a different language. In C++, literals are called "literals", and "constant" has a few meanings but generally is a const thing. The two concepts are different (although both kinds of things cannot be mutated after initial creation). We also have compile-time constants via constexpr which is yet another thing.
In general, read a good book rather than random tutorials written by randomers on the internet!
While the first part of the statement makes sense
Constants refer to fixed values that the program may not alter
the continuation
and they are called literals
is not really true.
Neil has already explained the semantical difference between the literal and the constant in his answer. But I would also like to add that the values of constant variables in C++ are not necessarily known at compile time.
// x might be obtained at runtime
// for instance, from the user input
void print_square(int x)
{
const int square = x*x;
std::cout << square << '\n';
}
Literals are values that are known at compile-time, which allows the compiler to put them to a separate read-only address space in the resulting binaries.
You can also enforce your variables to be known at compile-time by applying constexpr keyword (C++11).
constexpr int meaning = 42;
P.S. And I also do agree with a comment suggesting to use a good book instead of tutorialspoint.
If constants are called literals and literals are data represented directly in the code, how can constants be considered as literals?
The article from which you drew the quote is defining the word "constant" to be a synonym of "literal". The latter is the C++ standard's term for what it is describing. The former is what the C standard uses for the same concept.
I mean variables preceded with the const keyword are constants, but they are not literals, so how can you say that constants are literals?
And there you are providing an alternative definition for the term "constant", which, you are right, is inconsistent with the other. That's all. TP is using a different definition of the term than the one you are used to.
In truth, although the noun usage of "constant" appears in a couple of places in the C++ standard outside the defined term "null pointer constant", apparently with the meaning you propose here, I do not find an actual definition of that term, and especially not one matching yours. In truth, your definition is less plausible than TutorialPoint's, because an expression having const-qualified type can nevertheless designate an object that is modifiable (via a different expression).
Constant is simply a variable declared constant by keyword 'const' whose value after being declared shouldn't be altered during the course of the program (and if tried to alter it will result in an error).
On the other hand, literal is simply what is used and represented as it is typed in. For example, 25 when used in an expression (x+4*y+25) will be termed as literal.
Whenever we use String values or directly supply it in double quotes ("hello"), then that value in double quotes is called literal.
For example, printf("This is literal");
And if you are assigning a string value to a variable then thereafter you will refer to the variable (which could be declared constant if desired) and not exclusively to the value you have stored in it, i.e., only till the point you are supplying a value (string type of any other type) to the variable, the value is referred to as literal value, after that the variable is talked about whenever referring that value.
Once again, the value(25) in expression : x+4*y+25 is literal.
The value(4) in the term 4*y is also a literal (since it is exactly as we see it and is known to compiler beforehand).
--> The value(4) in the term 4*y is called numerical coefficient in algebraic terms and y is called literal coefficient in algebraic terms.
Hence,
All the above explanation I have given is in computer terms only. The meaning of literals and constants in Algebra are somewhat different than used in computer terms.
"Constants refer to fixed values that the program may not alter and they are called literals. (Source)"
The sentence construction is weird which is leading to the confusion.
Here, the the "they" that are referring to are the the fixed values and not constants. I would phrase it as "Constants refer to fixed values, that the program may not alter, called literals." which is less confusing I hope.
Constants are variables that can't vary, whereas Literals are literally numbers/letters that indicate the value of a variable or constant.
I can explain it this way.
Basically, constants are variables whose value cannot change.
Literals are notations that represent fixed values. These values can be Strings numbers etc
Literals can be assigned to variables
Code :
var a = 10;
var name = "Simba";
const pi = 3.14;
Here a and name are variables. pi is a constant. ( Constants are those variables whose value doesn't change. )
Here 10, "Simba" and 3.14 are literals.

Why is wrong to modify the contents of a pointer to a string litteral?

If I write:
char *aPtr = "blue"; //would be better const char *aPtr = "blue"
aPtr[0]='A';
I have a warning. The code above can work but isn't standard, it has a undefined behavior because it's read-only memory with a pointer at string litteral. The question is:
Why is it like this?
with this code rather:
char a[]="blue";
char *aPtr=a;
aPtr[0]='A';
is ok. I want to understand under the hood what happens
The first is a pointer to a read-only value created by the compiler and placed in a read-only section of the program. You cannot modify the characters at that address because they are read-only.
The second creates an array and copies each element from the initializer (see this answer for more details on that). You can modify the contents of the array, because it's a simple variable.
The first one works the way it does because doing anything else would require dynamically-allocating a new variable, and would require garbage collection to free it. That is not how C and C++ work.
The primary reason that string literals can't be modified (without undefined behavior) is to support string literal merging.
Long ago, when memory was much tighter than today, compiler authors noticed that many programs had the same string literals repeated many times--especially things like mode strings being passed to fopen (e.g., f = fopen("filename", "r");) and simple format strings being passed to printf (e.g., printf("%d\n", a);).
To save memory, they'd avoid allocating separate memory for each instance of these strings. Instead, they'd allocate one piece of memory, and point all the pointers at it.
In a few cases, they got even trickier than that, to merge literals that were't even entirely identical. For example consider code like this:
printf("%s\t%d\n", a);
/* ... */
printf("%d\n", b);
In this case, the string literals aren't entirely identical, but the second one is identical part of the end of the first. In this case, they'd still allocate one piece of memory. One pointer would point to the beginning of the memory, and the other to the position of the %d in that same block of memory.
With a possibility (but no requirement for) string literal merging, it's essentially impossible to say what behavior you'll get when you modify a string literal. If string literals are merged, modifying one string literal might modify others that are identical, or end identically. If string literals are not merged, modifying one will have no effect on any other.
MMUs added another dimension: they allowed memory to be marked as read-only, so attempting to modify a string literal would result in a signal of some sort--but only if the system had an MMU (which was often optional at one time) and also depending on whether the compiler/linker decided to put the string literals in memory they'd marked constant or not.
Since they couldn't define what the behavior would be when you modified a string literal, they decided that modifying a string literal would produce undefined behavior.
The second case is entirely different. Here you've defined an array of char. It's clear that if you define two separate arrays, they're still separate, regardless of content, so modifying one can't possibly affect the other. The behavior is clear and always has been, so doing so gives defined behavior. The fact that the array in question might be initialized from a string literal doesn't change that.

Does multiple #define of the same string use the same constant string?

Does multiple #define of the same string use the same constant string? Say I do the following in multiple places:
#define TEST #"test"
Compiler is smart enough to know it refers to the same constant string in the data section right?
Truly your question does not have much to do with what #define does, but rather about how string literals are treated by the compiler. The compiler inserts the string object into the program image, which is read-only and doesn't implement the retain count. This is an optimization so that the string doesn't need to be created at runtime.
Usually the compiler is enough smart to recognize that you are using the same string literal, and the same constant string will be used, but it will not be in the heap.
Also check this question: Authoritative description of ObjectiveC string literals?
The compiler does something called string interning. It is not a necessary operation so if your code relies on test being at the same address then you may have some problems. For the most part yes, it will try to reuse strings that are the same and just make them all point to the same string (in read only memory).

Why is a hard-coded string constant an lvalue? [duplicate]

C++03 5.1 Primary expressions §2 says:
A literal is a primary expression. Its type depends on its form (2.13). A string literal is an lvalue; all other literals are rvalues.
Similarly, C99 6.5.1 §4 says:
A string literal is a primary expression. It is an lvalue with type as detailed in 6.4.5.
What is the rationale behind this?
As I understand, string literals are objects, while all other literals are not. And an l-value always refers to an object.
But the question then is why are string literals objects while all other literals are not? This rationale seems to me more like an egg or chicken problem.
I understand the answer to this may be related to hardware architecture rather than C/C++ as programming languages, nevertheless I would like to hear the same.
A string literal is a literal with array type, and in C there is no way for an array type to exist in an expression except as an lvalue. String literals could have been specified to have pointer type (rather than array type that usually decays to a pointer) pointing to the string "contents", but this would make them rather less useful; in particular, the sizeof operator could not be applied to them.
Note that C99 introduced compound literals, which are also lvalues, so having a literal be an lvalue is no longer a special exception; it's closer to being the norm.
String literals are arrays - objects of inherently unpredictable size (i.e of user-defined and possibly large size). In general case, there's simply no other way to represent such literals except as objects in memory, i.e. as lvalues. In C99 this also applies to compound literals, which are also lvalues.
Any attempts to artificially hide the fact that string literals are lvalues at the language level would produce a considerable number of completely unnecessary difficulties, since the ability to point to a string literal with a pointer as well as the ability to access it as an array relies critically on its lvalue-ness being visible at the language level.
Meanwhile, literals of scalar types have fixed compile-time size. At the same time, such literals are very likely to be embedded directly into the machine commands on the given hardware architecture. For example, when you write something like i = i * 5 + 2, the literal values 5 and 2 become explicit (or even implicit) parts of the generated machine code. They don't exist and don't need to exist as standalone locations in data storage. There's simply no point in storing values 5 and 2 in the data memory.
It is also worth noting that on many (if not most, or all) hardware architectures floating-point literals are actually implemented as "hidden" lvalues (even though the language does not expose them as such). On platforms like x86 machine commands from floating-point group do not support embedded immediate operands. This means that virtually every floating-point literal has to be stored in (and read from) data memory by the compiler. E.g. when you write something like i = i * 5.5 + 2.1 it is translated into something like
const double unnamed_double_5_5 = 5.5;
const double unnamed_double_2_1 = 2.1;
i = i * unnamed_double_5_5 + unnamed_double_2_1;
In other words, floating-point literals often end up becoming "unofficial" lvalues internally. However, it makes perfect sense that language specification did not make any attempts to expose this implementation detail. At language level, arithmetic literals make more sense as rvalues.
I'd guess that the original motive was mainly a pragmatic one: a string
literal must reside in memory and have an address. The type of a string
literal is an array type (char[] in C, char const[] in C++), and
array types convert to pointers in most contexts. The language could
have found other ways to define this (e.g. a string literal could have
pointer type to begin with, with special rules concerning what it
pointed to), but just making the literal an lvalue is probably the
easiest way of defining what is concretely needed.
An lvalue in C++ does not always refer to an object. It can refer to a function too. Moreover, objects do not have to be referred to by lvalues. They may be referred to by rvalues, including for arrays (in C++ and C). However, in old C89, the array to pointer conversion did not apply for rvalues arrays.
Now, an rvalue denotes no, limited or soon to be an expired lifetime. A string literal, however, lives for the entire program.
So string literals being lvalues is exactly right.

Why isn't ("Maya" == "Maya") true in C++?

Any idea why I get "Maya is not Maya" as a result of this code?
if ("Maya" == "Maya")
printf("Maya is Maya \n");
else
printf("Maya is not Maya \n");
Because you are actually comparing two pointers - use e.g. one of the following instead:
if (std::string("Maya") == "Maya") { /* ... */ }
if (std::strcmp("Maya", "Maya") == 0) { /* ... */ }
This is because C++03, §2.13.4 says:
An ordinary string literal has type “array of n const char”
... and in your case a conversion to pointer applies.
See also this question on why you can't provide an overload for == for this case.
You are not comparing strings, you are comparing pointer address equality.
To be more explicit -
"foo baz bar" implicitly defines an anonymous const char[m]. It is implementation-defined as to whether identical anonymous const char[m] will point to the same location in memory(a concept referred to as interning).
The function you want - in C - is strmp(char*, char*), which returns 0 on equality.
Or, in C++, what you might do is
#include <string>
std::string s1 = "foo"
std::string s2 = "bar"
and then compare s1 vs. s2 with the == operator, which is defined in an intuitive fashion for strings.
The output of your program is implementation-defined.
A string literal has the type const char[N] (that is, it's an array). Whether or not each string literal in your program is represented by a unique array is implementation-defined. (§2.13.4/2)
When you do the comparison, the arrays decay into pointers (to the first element), and you do a pointer comparison. If the compiler decides to store both string literals as the same array, the pointers compare true; if they each have their own storage, they compare false.
To compare string's, use std::strcmp(), like this:
if (std::strcmp("Maya", "Maya") == 0) // same
Typically you'd use the standard string class, std::string. It defines operator==. You'd need to make one of your literals a std::string to use that operator:
if (std::string("Maya") == "Maya") // same
What you are doing is comparing the address of one string with the address of another. Depending on the compiler and its settings, sometimes the identical literal strings will have the same address, and sometimes they won't (as apparently you found).
Any idea why i get "Maya is not Maya" as a result
Because in C, and thus in C++, string literals are of type const char[], which is implicitly converted to const char*, a pointer to the first character, when you try to compare them. And pointer comparison is address comparison.
Whether the two string literals compare equal or not depends whether your compiler (using your current settings) pools string literals. It is allowed to do that, but it doesn't need to. .
To compare the strings in C, use strcmp() from the <string.h> header. (It's std::strcmp() from <cstring>in C++.)
To do so in C++, the easiest is to turn one of them into a std::string (from the <string> header), which comes with all comparison operators, including ==:
#include <string>
// ...
if (std::string("Maya") == "Maya")
std::cout << "Maya is Maya\n";
else
std::cout << "Maya is not Maya\n";
C and C++ do this comparison via pointer comparison; looks like your compiler is creating separate resource instances for the strings "Maya" and "Maya" (probably due to having an optimization turned off).
My compiler says they are the same ;-)
even worse, my compiler is certainly broken. This very basic equation:
printf("23 - 523 = %d\n","23"-"523");
produces:
23 - 523 = 1
Indeed, "because your compiler, in this instance, isn't using string pooling," is the technically correct, yet not particularly helpful answer :)
This is one of the many reasons the std::string class in the Standard Template Library now exists to replace this earlier kind of string when you want to do anything useful with strings in C++, and is a problem pretty much everyone who's ever learned C or C++ stumbles over fairly early on in their studies.
Let me explain.
Basically, back in the days of C, all strings worked like this. A string is just a bunch of characters in memory. A string you embed in your C source code gets translated into a bunch of bytes representing that string in the running machine code when your program executes.
The crucial part here is that a good old-fashioned C-style "string" is an array of characters in memory. That block of memory is often referred to by means of a pointer -- the address of the start of the block of memory. Generally, when you're referring to a "string" in C, you're referring to that block of memory, or a pointer to it. C doesn't have a string type per se; strings are just a bunch of chars in a row.
When you write this in your code:
"wibble"
Then the compiler provides a block of memory that contains the bytes representing the characters 'w', 'i', 'b', 'b', 'l', 'e', and '\0' in that order (the compiler adds a zero byte at the end, a "null terminator". In C a standard string is a null-terminated string: a block of characters starting at a given memory address and continuing until the next zero byte.)
And when you start comparing expressions like that, what happens is this:
if ("Maya" == "Maya")
At the point of this comparison, the compiler -- in your case, specifically; see my explanation of string pooling at the end -- has created two separate blocks of memory, to hold two different sets of characters that are both set to 'M', 'a', 'y', 'a', '\0'.
When the compiler sees a string in quotes like this, "under the hood" it builds an array of characters, and the string itself, "Maya", acts as the name of the array of characters. Because the names of arrays are effectively pointers, pointing at the first character of the array, the type of the expression "Maya" is pointer to char.
When you compare these two expressions using "==", what you're actually comparing is the pointers, the memory addresses of the beginning of these two different blocks of memory. Which is why the comparison is false, in your particular case, with your particular compiler.
If you want to compare two good old-fashioned C strings, you should use the strcmp() function. This will examine the contents of the memory pointed two by both "strings" (which, as I've explained, are just pointers to a block of memory) and go through the bytes, comparing them one-by-one, and tell you whether they're really the same.
Now, as I've said, this is the kind of slightly surprising result that's been biting C beginners on the arse since the days of yore. And that's one of the reasons the language evolved over time. Now, in C++, there is a std::string class, that will hold strings, and will work as you expect. The "==" operator for std::string will actually compare the contents of two std::strings.
By default, though, C++ is designed to be backwards-compatible with C, i.e. a C program will generally compile and work under a C++ compiler the same way it does in a C compiler, and that means that old-fashioned strings, "things like this in your code", will still end up as pointers to bits of memory that will give non-obvious results to the beginner when you start comparing them.
Oh, and that "string pooling" I mentioned at the beginning? That's where some more complexity might creep in. A smart compiler, to be efficient with its memory, may well spot that in your case, the strings are the same and can't be changed, and therefore only allocate one block of memory, with both of your names, "Maya", pointing at it. At which point, comparing the "strings" -- the pointers -- will tell you that they are, in fact, equal. But more by luck than design!
This "string pooling" behaviour will change from compiler to compiler, and often will differ between debug and release modes of the same compiler, as the release mode often includes optimisations like this, which will make the output code more compact (it only has to have one block of memory with "Maya" in, not two, so it's saved five -- remember that null terminator! -- bytes in the object code.) And that's the kind of behaviour that can drive a person insane if they don't know what's going on :)
If nothing else, this answer might give you a lot of search terms for the thousands of articles that are out there on the web already, trying to explain this. It's a bit painful, and everyone goes through it. If you can get your head around pointers, you'll be a much better C or C++ programmer in the long run, whether you choose to use std::string instead or not!