String literals that contain '\0' - why aren't they the same? - c++

So I did the following test:
char* a = "test";
char* b = "test";
char* c = "test\0";
And now the questions:
1) Is it guaranteed that a==b? I know I'm comparing addresses. This is not meant to compare the strings, but whether identical string literals are stored in a single memory location
2) Why doesn't a==c? Shouldn't the compiler be able to see that they're referring to the same string?
3) Is an extra \0 appended at the end of c, even though it already contains one?
I didn't want to ask 3 different questions for this because they seem somehow related, sorry 'bout that.
Note: The tag is correct, I'm interested in C++. (although please specify if the behavior is different for C)

Is it guaranteed that a==b?
No. But it is allowed by §2.14.5/12:
Whether all string literals are distinct (that is, are stored in nonoverlapping objects) is implementation-defined. The effect of attempting to modify a string literal is undefined.
And as you can see from that last sentence using char* instead of char const* is a recipe for trouble (and your compiler should be rejecting it; make sure you have warnings enabled and high conformance levels selected).
Why doesn't a==c? Shouldn't the compiler be able to see that they're referring to the same string?
No, they're not required to be referring to same array of characters. One has five elements, the other six. An implementation could store the two in overlapping storage, but that's not required.
Is an extra \0 appended at the end of c, even though it already contains one?
Yes.

1 - absolutely not. a might == b though if the compiler chooses to share the same static string.
2 - because they are NOT referring to the same string
3 - yes.
The behavior is no different between C and C++ here except that C++ compilers should reject the assignment to non-const char*.

1) Is it guaranteed that a==b?
It is not. Note that you are comparing addresses and they could be pointing to different locations. Most smart compilers would fold this duplicate literal constant, so the pointers may compare equal, but again its not guaranteed by the standard.
2) Why doesn't a==c? Shouldn't the compiler be able to see that they're referring to the same string?
You are trying to compare pointers, they point to different memory locations. Even if you were comparing the content of such pointers, they are still unequal (see next question).
3) Is an extra \0 appended at the end of c, even though it already contains one?
Yes, there is.

First note that this should be const char* as that's what string literals decay to.
Both create arrays initialized with 't' 'e' 's' 't' folowed by a '\0' (length = 5). Comparing for equality will only tell you if they both start with the same pointer, not if they have the same contents (though logically, the two ideas follow each other).
A isn't equal to C because the same rules apply, a = 't' 'e' 's' 't' '\0' and b = 't' 'e' 's' 't' '\0' '\0'
Yes, the compiler always does it and you shouldn't expicitly do in if you're making a string like this. If you however crated an array and manually populated it, you need to ensure you add the \0.
Note that for my #3, const char[] = "Hello World" would also automatically get the \0 at the end, I was refferring to manually filling the array, not having the compiler work it out.

The problem here is you're mixing the concepts of pointer and textual equivalence.
When you say a == b or a == c you are asking if the pointers involved point to the same physical address. The test has nothing to do with the textual contents of the pointers.
To get textual equivalence you should use strcmp

If you are doing pointer comparisons than a != b, b != c, and c != a. Unless the compiler is smart enough to notice that your first two strings are the same.
If you do a strcmp(str, str) then all your strings will come back as matches.
I am not sure if the compiler will add an additional null termination to c, but I would guess that it would.

As has been said a few times in other answers, you are comparing pointers. However, I would add that strcmp(b,c) should be true, because it stops checking at the first \0.

Related

Confusion about the necessity of the null-character?

I am reading about why exactly there is a need for null-characters, and then I found this answer which made somewhat sense to me. It states that it is needed because that char arrays (for the C strings) are often allocated much larger than the actual strings and you thereby need a a way to symbolize the end.
But why aren't these array not just constructed with a size deduction based on the initializer (without the null-character that actually is implicitly added when assigning directly to string literals). Like, if the arrays holding the strings are constructed using size deduction, there would not be a need for the null-character because the array was not any bigger than the string, so of course, it would end at the end of that array.
I am reading about why exactly there is a need for null-characters, and then I found this answer which made somewhat sense to me. It states that it is needed because that char arrays (for the C strings) are often allocated much larger than the actual strings and you thereby need a a way to symbolize the end.
The answer is misleading. That's not really the reason for why null termination is needed. The accepted answer with more upvotes is better.
there would not be a need for the null-character because the array was not any bigger than the string, so of course, it would end at the end of that array.
Let us remind ourselves, that we cannot use arrays as function arguments. Even if we could, we wouldn't want to, because it would be slow to copy an entire array into the argument.
Therefore, there is a need to refer to an array indirectly. Indirection is commonly achieved using pointers (or references). Now, we could have a "pointer to character array of size 42", but that is not very useful because then the argument can only point to strings of one particular size.
Instead, the common approach is to use a pointer to the first element of the array. This is so common pattern that the language has a rule that allows the name of the array to implicitly decay into the pointer to first element.
But can you tell how big an array is, based on a pointer to an element of that array? You cannot. You need extra information. The accepted answer of the linked question explains the options that are available for representing the size, and that the designer of C chose the option that uses a terminating character (which was already the convention used by the BCPL language which C is based on).
TL;DR Size information is needed because there is a need to refer to the string indirectly, and that indirection hides the knowledge about the size of the array. Null termination is one way to encode the size information within the content of the string, and it is the way that was chosen by the designer of the C language.
Historically, string arrays are provided with termination symbol(s). Reason is simple: instead of sending two values (head of the array and array length) you just need to pass just one value, head of the array. This simplifies calling signature but places some requirements for caller.
In C/C++ itself, null character is a termination symbol so all runtime functions do work with intention that very first null char they can meet is a line end. Same time, in terms of applied logic, terminal symbol(s) may be different: for example, in HTTP headers there is a CR-LF-CR-LF sequence that marks a end-of-the-header and single CR-LF sequence is just a start-of-next-line.
But why aren't these array not just constructed with a size deduction
based on the initializer (without the null-character that actually is
implicitly added when assigning directly to string literals).
I suppose you mean why you can't write:
char t[] = "abracadabra";
and the compiler would deduce a size of 11?
Because you have 12 characters and not 11. If the array would have size 11, then something would be lost: the byte used to contains the NUL would not have been referenced and compiler wouldn't make a difference in between:
char t[] = "abracadabra"; // an array deduced from a C-string literal
and
char t[11] = { 'a', 'b', 'r', 'a', 'c', 'a', 'b', 'r', 'a' }; // a "real" array not a C-string!
The first would have to release 12 bytes at the end of scope and the second 11.
Historically arrays are just some kind a syntactic sugar on top of pointers arithmetic.
... because that char arrays ... are often allocated much larger than the actual strings
That answer is awful.
C strings can be dynamically allocated, meaning you don't know, before runtime, how long they should be. Instead of pre-allocating a massive array and filling most of it with zeroes, you can just malloc(required_size+1) and stick a single nul character at the end.
Conversely, string literals which are known at compile time, are definitely not "allocated much larger than the actual strings". there wouldn't be any point, since you know exactly how much space is needed in advance.
But why aren't these array not just constructed with a size deduction based on the initializer
size_t expected;
if (read(fd, &expected, sizeof(expected)) == sizeof(expected)) {
char *buf = malloc(expected + 1);
if (buf && read(fd, buf, expected) == expected) {
buf[expected] = 0;
/* now do something with buf */
}
}
there you go, a dynamically-sized string. What would your "size deduction" be? What is the "initializer"?
I could have written a less-ugly example using std::string, since the question is tagged C++, but it's actually C strings you're specifically asking about, and it doesn't make any real difference.
Strings are often manipulated by creating a char array to hold intermediate results and modifying its contents:
char buffer[128];
strcpy(buffer, "Hello, ");
strcat(buffer, "world");
std::cout << buffer << '\n';
After the call to strcpy the buffer has 7 characters that we care about; after the call to strcat it has 12. So the number of characters in the buffer can change, and we need to have a way of indicating how many characters there are that matter. One convention is to put a character count in the first location in the array, and the actual characters after that. Another convention is to put a marker at the end of the characters that matter. There are tradeoffs here, but the decision in C, which was carried through into C++ was to go with an end marker.

Comparing a char variable to empty char does not work

Say x is a character.
Whenever I do if(x <> '') to know whether the variable is empty or not, it just does not work.
However, when I attempt to do this if(x <> chr(0)), it does work.
I have tried the same thing on two versions of the compiler : Free Pascal and Charm Pascal, but I am still facing the same problem.
There is no such thing as an "empty char". The Char type is always a single character.
That character could be 1 byte AnsiChar representing a value from 0..255. (In Delphi and fpc, it could also be a 2 byte WideChar representing a value from 0..65535.) Either way it is always represented as '<something>'. That "something" must be a character value.
When you compare x <> Chr(0) you are taking the byte value of 0 and converting it to a Char so a valid comparison can be performed.
Side Notes
For Char to reliably have the concept "no value" requires storing additional information. E.g. Databases may have a hidden internal bit field indicating the value is NULL. It's important to be aware that this is fundamentally different from any of the valid values it may have if it's not NULL. Libraries that interact with databases need to provide a way to determine if a value is NULL.
You haven't provided any information about the actual problem you're trying to solve but here are some thoughts that may yield progress:
If you're dealing with user input, it may be more appropriate to compare with a space character ' '.
If you're dealing with characters read from a file, you should probably be checking number of bytes/characters actually read.
If you're trying to determine the end of a string it's much more reliable to use the Length() of the string.
(Though there are some environments that use the convention of treating Char(0) as a special character meaning "end-of-string".) But the convention requires allocating an extra character making the string internally longer than its text length. So the technique is not usable if the environment doesn't support it.
Most importantly, from comments it seems you might be struggling with the difference between empty-string and how that's represented as a Char. And the point is that it isn't. You need to check the length of the string.
E.g. You can do the following:
if (s <> '') then
begin
{ You now know there is at least 1 character in the string so
you can safely read it and not worry about "if it has a value".}
x := s[1];
...
end;

Why is wrong to modify the contents of a pointer to a string litteral?

If I write:
char *aPtr = "blue"; //would be better const char *aPtr = "blue"
aPtr[0]='A';
I have a warning. The code above can work but isn't standard, it has a undefined behavior because it's read-only memory with a pointer at string litteral. The question is:
Why is it like this?
with this code rather:
char a[]="blue";
char *aPtr=a;
aPtr[0]='A';
is ok. I want to understand under the hood what happens
The first is a pointer to a read-only value created by the compiler and placed in a read-only section of the program. You cannot modify the characters at that address because they are read-only.
The second creates an array and copies each element from the initializer (see this answer for more details on that). You can modify the contents of the array, because it's a simple variable.
The first one works the way it does because doing anything else would require dynamically-allocating a new variable, and would require garbage collection to free it. That is not how C and C++ work.
The primary reason that string literals can't be modified (without undefined behavior) is to support string literal merging.
Long ago, when memory was much tighter than today, compiler authors noticed that many programs had the same string literals repeated many times--especially things like mode strings being passed to fopen (e.g., f = fopen("filename", "r");) and simple format strings being passed to printf (e.g., printf("%d\n", a);).
To save memory, they'd avoid allocating separate memory for each instance of these strings. Instead, they'd allocate one piece of memory, and point all the pointers at it.
In a few cases, they got even trickier than that, to merge literals that were't even entirely identical. For example consider code like this:
printf("%s\t%d\n", a);
/* ... */
printf("%d\n", b);
In this case, the string literals aren't entirely identical, but the second one is identical part of the end of the first. In this case, they'd still allocate one piece of memory. One pointer would point to the beginning of the memory, and the other to the position of the %d in that same block of memory.
With a possibility (but no requirement for) string literal merging, it's essentially impossible to say what behavior you'll get when you modify a string literal. If string literals are merged, modifying one string literal might modify others that are identical, or end identically. If string literals are not merged, modifying one will have no effect on any other.
MMUs added another dimension: they allowed memory to be marked as read-only, so attempting to modify a string literal would result in a signal of some sort--but only if the system had an MMU (which was often optional at one time) and also depending on whether the compiler/linker decided to put the string literals in memory they'd marked constant or not.
Since they couldn't define what the behavior would be when you modified a string literal, they decided that modifying a string literal would produce undefined behavior.
The second case is entirely different. Here you've defined an array of char. It's clear that if you define two separate arrays, they're still separate, regardless of content, so modifying one can't possibly affect the other. The behavior is clear and always has been, so doing so gives defined behavior. The fact that the array in question might be initialized from a string literal doesn't change that.

Is the expression 'ab' == "ab" true in C++

My question sounds probably quite stupid, but I have to answer it while preparing myself to my bachelor exam.
So, what do you think about such an expression 'ab' == "ab" in C++? Is this not true or simply not legal and compiling error? I have googled a little and get to know that 'ab' is in type int and "ab" of course not...
I have to regard not what compilator says but what says formal description of language..
It definitely generates a warning, but by default, gcc compiles it.
It should normally be false.
That being said, it should be theoretically possible, of course depending on the platform you're running this on, to have the compile-time constant "ab" at a memory location whose address is equal in numerical value to the numerical value of 'ab', case in which the expression would be true (although the comparison is of course meaningless).
In both C and C++ expression 'ab' == "ab" is invalid. It has no meaning. Neither language allows comparing arbitrary integral values with pointer values. For this reason, the matter of it being "true" or not does not even arise. In order to turn it into a compilable expression you have to explicitly cast the operands to comparable types.
The only loophole here is that the value of multi-char character constant is implementation-defined. If in some implementation the value of 'ab' happens to be zero, it can serve as a null-pointer constant. In that case 'ab' == "ab" becomes equivalent to 0 == "ab" and NULL == "ab". This is guaranteed to be false.
It is going to give you a warning, but it will build. What it will do is compare the multibyte integer 'ab' with the address of the string literal "ab".
Bottom line, the result of the comparison won't reflect the choice of letters being the same or not.
The Standard has absolutely nothing to say about comparing an integral type with a pointer. All it says is the following (in section 5.9):
The operands shall have arithmetic, enumeration, or pointer type, or
type std::nullptr_t...
It then goes into a detailed description on what it means to compare two pointers, and mentions comparing two integers. So my interpretation of the lack of specification would be "whatever the compiler writer decides", which is either an error or a warning.
Lets consider this to parts in simple C, the 'c' is a simple char if you want to manipulate strings you will have to use array of chars, as a result 'ca' shouldn't work the way you expect, and in c++ this stuff is still valid. If you want to use Strings you will have to use String class which isn't a raw type. And all what it does is a class with methods and type def's so you handle chars of arrays easier. As result even the C-style-string and the array of chars are different stuff, as result 'ab' == "ab" is not going to give a valid boolean respond . It's like trying to compare an int to a string. So, this comaprison will most likely throw an error.

Why isn't ("Maya" == "Maya") true in C++?

Any idea why I get "Maya is not Maya" as a result of this code?
if ("Maya" == "Maya")
printf("Maya is Maya \n");
else
printf("Maya is not Maya \n");
Because you are actually comparing two pointers - use e.g. one of the following instead:
if (std::string("Maya") == "Maya") { /* ... */ }
if (std::strcmp("Maya", "Maya") == 0) { /* ... */ }
This is because C++03, §2.13.4 says:
An ordinary string literal has type “array of n const char”
... and in your case a conversion to pointer applies.
See also this question on why you can't provide an overload for == for this case.
You are not comparing strings, you are comparing pointer address equality.
To be more explicit -
"foo baz bar" implicitly defines an anonymous const char[m]. It is implementation-defined as to whether identical anonymous const char[m] will point to the same location in memory(a concept referred to as interning).
The function you want - in C - is strmp(char*, char*), which returns 0 on equality.
Or, in C++, what you might do is
#include <string>
std::string s1 = "foo"
std::string s2 = "bar"
and then compare s1 vs. s2 with the == operator, which is defined in an intuitive fashion for strings.
The output of your program is implementation-defined.
A string literal has the type const char[N] (that is, it's an array). Whether or not each string literal in your program is represented by a unique array is implementation-defined. (§2.13.4/2)
When you do the comparison, the arrays decay into pointers (to the first element), and you do a pointer comparison. If the compiler decides to store both string literals as the same array, the pointers compare true; if they each have their own storage, they compare false.
To compare string's, use std::strcmp(), like this:
if (std::strcmp("Maya", "Maya") == 0) // same
Typically you'd use the standard string class, std::string. It defines operator==. You'd need to make one of your literals a std::string to use that operator:
if (std::string("Maya") == "Maya") // same
What you are doing is comparing the address of one string with the address of another. Depending on the compiler and its settings, sometimes the identical literal strings will have the same address, and sometimes they won't (as apparently you found).
Any idea why i get "Maya is not Maya" as a result
Because in C, and thus in C++, string literals are of type const char[], which is implicitly converted to const char*, a pointer to the first character, when you try to compare them. And pointer comparison is address comparison.
Whether the two string literals compare equal or not depends whether your compiler (using your current settings) pools string literals. It is allowed to do that, but it doesn't need to. .
To compare the strings in C, use strcmp() from the <string.h> header. (It's std::strcmp() from <cstring>in C++.)
To do so in C++, the easiest is to turn one of them into a std::string (from the <string> header), which comes with all comparison operators, including ==:
#include <string>
// ...
if (std::string("Maya") == "Maya")
std::cout << "Maya is Maya\n";
else
std::cout << "Maya is not Maya\n";
C and C++ do this comparison via pointer comparison; looks like your compiler is creating separate resource instances for the strings "Maya" and "Maya" (probably due to having an optimization turned off).
My compiler says they are the same ;-)
even worse, my compiler is certainly broken. This very basic equation:
printf("23 - 523 = %d\n","23"-"523");
produces:
23 - 523 = 1
Indeed, "because your compiler, in this instance, isn't using string pooling," is the technically correct, yet not particularly helpful answer :)
This is one of the many reasons the std::string class in the Standard Template Library now exists to replace this earlier kind of string when you want to do anything useful with strings in C++, and is a problem pretty much everyone who's ever learned C or C++ stumbles over fairly early on in their studies.
Let me explain.
Basically, back in the days of C, all strings worked like this. A string is just a bunch of characters in memory. A string you embed in your C source code gets translated into a bunch of bytes representing that string in the running machine code when your program executes.
The crucial part here is that a good old-fashioned C-style "string" is an array of characters in memory. That block of memory is often referred to by means of a pointer -- the address of the start of the block of memory. Generally, when you're referring to a "string" in C, you're referring to that block of memory, or a pointer to it. C doesn't have a string type per se; strings are just a bunch of chars in a row.
When you write this in your code:
"wibble"
Then the compiler provides a block of memory that contains the bytes representing the characters 'w', 'i', 'b', 'b', 'l', 'e', and '\0' in that order (the compiler adds a zero byte at the end, a "null terminator". In C a standard string is a null-terminated string: a block of characters starting at a given memory address and continuing until the next zero byte.)
And when you start comparing expressions like that, what happens is this:
if ("Maya" == "Maya")
At the point of this comparison, the compiler -- in your case, specifically; see my explanation of string pooling at the end -- has created two separate blocks of memory, to hold two different sets of characters that are both set to 'M', 'a', 'y', 'a', '\0'.
When the compiler sees a string in quotes like this, "under the hood" it builds an array of characters, and the string itself, "Maya", acts as the name of the array of characters. Because the names of arrays are effectively pointers, pointing at the first character of the array, the type of the expression "Maya" is pointer to char.
When you compare these two expressions using "==", what you're actually comparing is the pointers, the memory addresses of the beginning of these two different blocks of memory. Which is why the comparison is false, in your particular case, with your particular compiler.
If you want to compare two good old-fashioned C strings, you should use the strcmp() function. This will examine the contents of the memory pointed two by both "strings" (which, as I've explained, are just pointers to a block of memory) and go through the bytes, comparing them one-by-one, and tell you whether they're really the same.
Now, as I've said, this is the kind of slightly surprising result that's been biting C beginners on the arse since the days of yore. And that's one of the reasons the language evolved over time. Now, in C++, there is a std::string class, that will hold strings, and will work as you expect. The "==" operator for std::string will actually compare the contents of two std::strings.
By default, though, C++ is designed to be backwards-compatible with C, i.e. a C program will generally compile and work under a C++ compiler the same way it does in a C compiler, and that means that old-fashioned strings, "things like this in your code", will still end up as pointers to bits of memory that will give non-obvious results to the beginner when you start comparing them.
Oh, and that "string pooling" I mentioned at the beginning? That's where some more complexity might creep in. A smart compiler, to be efficient with its memory, may well spot that in your case, the strings are the same and can't be changed, and therefore only allocate one block of memory, with both of your names, "Maya", pointing at it. At which point, comparing the "strings" -- the pointers -- will tell you that they are, in fact, equal. But more by luck than design!
This "string pooling" behaviour will change from compiler to compiler, and often will differ between debug and release modes of the same compiler, as the release mode often includes optimisations like this, which will make the output code more compact (it only has to have one block of memory with "Maya" in, not two, so it's saved five -- remember that null terminator! -- bytes in the object code.) And that's the kind of behaviour that can drive a person insane if they don't know what's going on :)
If nothing else, this answer might give you a lot of search terms for the thousands of articles that are out there on the web already, trying to explain this. It's a bit painful, and everyone goes through it. If you can get your head around pointers, you'll be a much better C or C++ programmer in the long run, whether you choose to use std::string instead or not!