Understanding C++ strings - c++

I'm trying to understand how strings really work in C++ because I just got really confused after coming across an unexpected behavior.
Considering a string, I insert a character (not using append()) using [] operator:
string str;
str[0] = 'a';
Let's print the string:
cout << "str:" << str << endl;
I get NULL as output:
str:
Ok, let's try printing the only character in the string:
cout << "str[0]:" << str[0] << endl;
Output:
str[0]:a
Q1. What happened there? Why was a not printed in the first case?
Now, I do something that should throw a compilation error but it doesn't and my question is again, why.
str = 'ABC';
Q2. How's that not an incorrect semantic i.e. assigning a character (which is not really a character but essentially a string in single quotes) to a string?
Now, worse when I print the string, it always prints last character i.e C (I was expecting first character i.e. A):
cout << "str:" << str << endl;
Output:
str:C
Q3. Why was the last character printed, not first?

Considering a string, I insert a character (not using append()) using [] operator:
string str;
str[0] = 'a';
You did not insert a character. operator[](size_type pos) returns a reference to the - already existing - character at pos. If pos == size() then behaviour is undefined. Your string is empty, so size() == 0 and therefore str[0] has undefined behaviour.
Q1. What happened there? Why was a not printed in the first case?
The behaviour is undefined.
Now, I do something that should throw a compilation error but it doesn't and my question is again, why.
str = 'ABC';
Q2. How's that not an incorrect semantic i.e. assigning a character ... to a string?
Assigning a character to a string is not incorrect semantic. It sets the content of the string to that single character.
Q2. ... a character (which is not really a character but essentially a string in single quotes) ...
It is a multicharacter literal. The type of a multicharacter literal is int. If the compiler supports multicharacter literals, then the semantic is not incorrect.
There isn't an assignment operator for string that would accept an int. However, int is implicitly convertible to char, so the assignment operator that accepts a char is used after the conversion.
char cannot necessarily represent all the values that int can, so it is possible that the conversion overflows. If char is a signed type, then this overflow has undefined behaviour.
Q3. Why was the last character printed, not first?
The value of a multicharacter literal is implementation-defined. You'll need to consult the manual of your compiler to find out whether multicharacter literals are supported, and what value you should expect. Furthermore, you'll need to consider the fact that the char that the value is converted to probably cannot represent all values of int.
but I didn't get any warnings
Then consider getting a better compiler. This is what GCC warns:
warning: multi-character character constant [-Wmultichar]
str = 'ABC';
warning: overflow in implicit constant conversion [-Woverflow]
str[0] = 'a' should work with string just like it does with char str[] = "" (but it doesn't as we saw). Can you help me understand why [] operator has different behavior in dealing with array of characters than string?
Because that's how the standard has defined the behaviour and requirements of std::string.
char str[] = "";
Creates an array of size 1, consisting of the null terminator. This element of the array is like any other, and you can freely modify it:
str[0] = 'a';
This is well defined and OK. But now str no longer contains a null-terminated string, so trying to use it as such has undefined behaviour:
out << "str:" << str << endl; // oops, str is not a null terminated string
So, std::string has been designed such that you cannot mess with the final null terminator - as long as you obey the requirements of std::string. Not allowing touching the null terminator also allows the implementation to never allocate a memory buffer for an empty string. Not allocating memory may be faster than allocating memory, so this is a good thing.

You should take a look at http://en.cppreference.com/w/cpp/string/basic_string/operator_at. Namely, the portion about "If pos == size(), the behavior is undefined."
The following line creates an empty string:
string str;
so size() will return 0.

Your statement str string; str[0]='a' is undefined behaviour, though the reason for this differs between "before C++11" and "from C++11 on". Note that str is a non-const string. Before C++11 already a (read) access like str[pos] with pos == size() and str being a non-const string yields undefined behaviour. From C++11 on, a read-access would be permitted (yielding a reference to the '\0'-character. A modification, however, again is undefined in its behaviour.
So far to the Cpp reference regarding std::basic_string::operator_at.
But now let's explain the behaviour of a program similar to yours but with defined behaviour; (I'll use this then as analogy to describe the behaviour of your program):
string str = "bbbb";
const char* cstr = str.data();
printf("adress: %p; content:%s\n", cstr, cstr);
// yields "adress: 0x7fff5fbff5d9; content:bbbb"
str[0] = 'a';
const char* cstr2 = &str[0];
printf("adress: %p; content:%s\n", cstr2, cstr2);
// yields "adress: 0x7fff5fbff5d9; content:abbb"
cout << "str:" << str << endl;
// yields "str:abbb"
The program is almost self explanatory, but note that str.data()gives a pointer to the internal data buffer, and str.data() returns the same address as &str[0].
If we now change the same program to your setting with string str = "", then there does not even change to much in the behaviour (although this behaviour is undefined, not safe, not guaranteed, and may differ from compiler to compiler):
string str; // is the same as string str = ""
const char* cstr = str.data();
printf("adress: %p; content:%s\n", cstr, cstr);
// yields "adress: 0x7fff5fbff5c1; content:"
str[0] = 'a';
const char* cstr2 = &str[0];
printf("adress: %p; content:%s\n", cstr2, cstr2);
// yields "adress: 0x7fff5fbff5c1; content:a"
cout << "str:" << str << endl;
// yields "str:"
Note that str.data() returns the same address as &str[0] and that 'a' has actually been written to that address (if we have good luck, we do not access non-allocated memory, as an empty string is not guaranteed to have a buffer ready; maybe we have really good luck). So printing out str.data() actually gives you a (if we have additional luck that the character after 'a' is a string terminating char). Anyway, statement str[0]='a' does not increase string size, which is still 0, such that cout << str gives an empty string.
Hope this helps somehow.

string str;
Makes a string of length 0.
str[0] = 'a';
Sets the first element of the string to 'a'. Note that the length of the string is still 0. Also note there may not be space allocated to hold this 'a' and the program is broken at this point so further analysis is best guesses.
cout << "str:" << str << endl;
Prints the contents of the string. The string is length 0, so nothing prints.
cout << "str[0]:" << str[0] << endl;
reaches into undefined territories and tries to read back the previously stored 'a'. This won't work, and the result is undefined. In this case it gave the appearance of working, possibly the nastiest thing undefined behaviour can do.
str = 'ABC';
is not necessarily an error as there are multibyte characters out there, but this most likely will, but is not required to, result in a warning from the compiler as it's probably a mistake.
cout << "str:" << str << endl;
Your guess is as good as mine what the compiler will do since str = 'ABC'; was logically incorrect (although syntactically valid). The compiler seems to have truncated ABC to the last character much like putting 257 into a 8 bit integer may result in preserving only the least significant bit.

Related

in the C ++ stl, does the string container actually contain a string with a closing 0? [duplicate]

Will the below string contain the null terminator '\0'?
std::string temp = "hello whats up";
No, but if you say temp.c_str() a null terminator will be included in the return from this method.
It's also worth saying that you can include a null character in a string just like any other character.
string s("hello");
cout << s.size() << ' ';
s[1] = '\0';
cout << s.size() << '\n';
prints
5 5
and not 5 1 as you might expect if null characters had a special meaning for strings.
Not in C++03, and it's not even guaranteed before C++11 that in a C++ std::string is continuous in memory. Only C strings (char arrays which are intended for storing strings) had the null terminator.
In C++11 and later, mystring.c_str() is equivalent to mystring.data() is equivalent to &mystring[0], and mystring[mystring.size()] is guaranteed to be '\0'.
In C++17 and later, mystring.data() also provides an overload that returns a non-const pointer to the string's contents, while mystring.c_str() only provides a const-qualified pointer.
This depends on your definition of 'contain' here. In
std::string temp = "hello whats up";
there are few things to note:
temp.size() will return the number of characters from first h to last p (both inclusive)
But at the same time temp.c_str() or temp.data() will return with a null terminator
Or in other words int(temp[temp.size()]) will be zero
I know, I sound similar to some of the answers here but I want to point out that size of std::string in C++ is maintained separately and it is not like in C where you keep counting unless you find the first null terminator.
To add, the story would be a little different if your string literal contains embedded \0. In this case, the construction of std::string stops at first null character, as following:
std::string s1 = "ab\0\0cd"; // s1 contains "ab", using string literal
std::string s2{"ab\0\0cd", 6}; // s2 contains "ab\0\0cd", using different ctr
std::string s3 = "ab\0\0cd"s; // s3 contains "ab\0\0cd", using ""s operator
References:
https://akrzemi1.wordpress.com/2014/03/20/strings-length/
http://en.cppreference.com/w/cpp/string/basic_string/basic_string
Yes if you call temp.c_str(), then it will return null-terminated c-string.
However, the actual data stored in the object temp may not be null-terminated, but it doesn't matter and shouldn't matter to the programmer, because when then programmer wants const char*, he would call c_str() on the object, which is guaranteed to return null-terminated string.
With C++ strings you don't have to worry about that, and it's possibly dependent of the implementation.
Using temp.c_str() you get a C representation of the string, which will definitely contain the \0 char. Other than that, i don't really see how it would be useful on a C++ string
std::string internally keeps a count of the number of characters. Internally it works using this count. Like others have said, when you need the string for display or whatever reason, you can its c_str() method which will give you the string with the null terminator at the end.

Output non-null terminated char array behaviours?

char sentence[] ={'k','k','k','k','k','k','k','k'}; (8 character)
std::cout << sentence << std::endl;
Then output just "kkkkkkkk".
But if we decrement characters of array (i.e. preceding array have 8 character after less 8 character)
char sentence[] ={'k','k','k','k','k','k','k'}; ( 7 character)
std::cout << sentence << std::endl;
output:
kkkkkkk`\363\277\357\376.
The operand sentence decays to a pointer to first element of the array. The stream insertion operator overload that accepts const char* parameter requires that the pointer is to a null terminated array. If that pre-condition is violated, then the behaviour of the program is undefined.
sentence does not contain the null terminator character. You insert it into a character stream. The behaviour of the program is undefined.
Then output just "kkkkkkkk".
That's one potential behaviour. This behaviour isn't guaranteed because the behaviour of the program is undefined.
But if we decrement length of array then output kkkkkkk`\363\277\357\376.
That's one potential behaviour. This behaviour isn't guaranteed because the behaviour of the program is undefined.
char arrays are often used for null-terminated strings, but not always.
When you write
const char* str = "Hello";
then str is a pointer to the first element of a const char[6]. 5 for the characters, and the rules of the language make sure that there is room for the terminating \0.
On the other hand, sometimes a char array is just that: An array of characters. When you write
char sentence[] ={'H','e','l','l','o'};
Then sentence is an array of only 5 chars. It is not a null-terminated string.
This two worlds (general char array / null-terminated strings) clash when you call
std::cout << sentence << std::endl;
because the operator<< overload does expect a null-terminated string. Because sentence does not point to a null-terminated string, your code has undefined behavior.
If you want sentence to be a string, make it one:
char sentence[] = {'H','e','l','l','o','\0'};
Or treat it like a plain array of characters:
for (const auto& c : sentence) {
std::cout << c;
}
Try using
char sentence[] = {'k','k','k','k','k','k','k','k','\0'};
you must always use a '\0' at the end when you are declaring a char array manually and want to print it. Because either cout or printf function will look for '\0' character and will only stop printing if it finds '\0'. also make sure you wont you it in middle of your array as it wont print the complete array

Why does cout << &r give different output than cout << (void*)&r?

This might be a stupid question, but I'm new to C++ so I'm still fooling around with the basics. Testing pointers, I bumped into something that didn't produce the output I expected.
When I ran the following:
char r ('m');
cout << r << endl;
cout << &r << endl;
cout << (void*)&r << endl;
I expected this:
m
0042FC0F
0042FC0F
..but I got this:
m
m╠╠╠╠ôNh│hⁿB
0042FC0F
I was thinking that perhaps since r is of type char, cout would interpret &r as a char* and [for some reason] output the pointer value - the bytes comprising the address of r - as a series of chars, but then why would the first one would be m, the content of the address pointed to, rather than the char representation of the first byte of the pointer address.. It was as if cout interprets &r as r but instead of just outputting 'm', it goes on to output more chars - interpreted from the byte values of the subsequent 11 memory addresses.. Why? And why 11?
I'm using MSVC++ (Visual Studio 2013) on 64 bit Win7.
Postscript: I got a lot of correct answers here (as expected, given the trivial nature of the question). Since I can only accept one, I made it the first one I saw. But thanks, everyone.
So to summarize and expand on the instinctive theories mentioned in my question:
Yes, cout does interpret &r as char*, but since char* is a 'special thing' in C++ that essentially means a null terminated string (rather than a pointer [to a single char]), cout will attempt to print out that string by outputting chars (interpreted from the byte contents of the memory address of r onwards) until it encounters '\0'. Which explains the 11 extra characters (it just coincidentally took 11 more bytes to hit that NUL).
And for completeness - the same code, but with int instead of char, performs as expected:
int s (3);
cout << s << endl;
cout << &s << endl;
cout << (void*)&s << endl;
Produces:
3
002AF940
002AF940
A char * is a special thing in C++, inherited from C. It is, in most circumstances, a C-style string. It is supposed to point to an array of chars, terminated with a 0 (a NUL character, '\0').
So it tries to print this, following on in to the memory after the 'm', looking for a terminating '\0'. This makes it print some random garbage. This is known as Undefined Behaviour.
There is an operator<< overload specifically for char* strings. This outputs the null-terminated string, not the address. Since the pointer you're passing this overload isn't a null-terminated string, you also get Undefined Behavior when operator<< runs past the end of the buffer.
Conversely, the void* overload will print the address.
Because operator<< is overloaded based on the data type.
If you give it a char, it assumes you want that character.
If you give it a void*, it assumes you want an address.
However, if you give it a char*, it takes that as a C-style string and attempts to output it as such. Since the original intent of C++ was "C with classes", handling of C-style strings was a necessity.
The reason you get all the rubbish at the end is simply because, despite your assertion to the compiler, it isn't actually a C-style string. Specifically, it is not guaranteed to have a string-terminating NUL character at the end so the output routines will just output whatever happens to be in memory after it.
This may work (if there's a NUL there), it may print gibberish (if there's a NUL nearby), or it may fall over spectacularly (if there's no NUL before it gets to memory it cannot read). It's not something you should rely on.
Because there's an overload of operator<< which takes a const char pointer as it's second argument and prints out a string. The overload that takes a void pointer prints only the address.
A char * is often - usually even - a pointer to a C-style null-terminated string (or a string literal) and is treated as such by ostreams. A void * by contrast unambiguously indicates a pointer value is required.
The output operator (operator<<()) is overloaded for char const* and void const*. When passing a char* the overload for char const* is a better match and chosen. This overload expects a pointer to the start of a null terminated string. You give it a pointer to an individual char, i.e., you get undefined behavior.
If you want to try with a well-defined example you can use
char s[] = { 'm', 0 };
std::cout << s[0] << '\n';
std::cout << &s[0] << '\n';
std::cout << static_cast<void*>(&s[0]) << '\n';

Does std::string have a null terminator?

Will the below string contain the null terminator '\0'?
std::string temp = "hello whats up";
No, but if you say temp.c_str() a null terminator will be included in the return from this method.
It's also worth saying that you can include a null character in a string just like any other character.
string s("hello");
cout << s.size() << ' ';
s[1] = '\0';
cout << s.size() << '\n';
prints
5 5
and not 5 1 as you might expect if null characters had a special meaning for strings.
Not in C++03, and it's not even guaranteed before C++11 that in a C++ std::string is continuous in memory. Only C strings (char arrays which are intended for storing strings) had the null terminator.
In C++11 and later, mystring.c_str() is equivalent to mystring.data() is equivalent to &mystring[0], and mystring[mystring.size()] is guaranteed to be '\0'.
In C++17 and later, mystring.data() also provides an overload that returns a non-const pointer to the string's contents, while mystring.c_str() only provides a const-qualified pointer.
This depends on your definition of 'contain' here. In
std::string temp = "hello whats up";
there are few things to note:
temp.size() will return the number of characters from first h to last p (both inclusive)
But at the same time temp.c_str() or temp.data() will return with a null terminator
Or in other words int(temp[temp.size()]) will be zero
I know, I sound similar to some of the answers here but I want to point out that size of std::string in C++ is maintained separately and it is not like in C where you keep counting unless you find the first null terminator.
To add, the story would be a little different if your string literal contains embedded \0. In this case, the construction of std::string stops at first null character, as following:
std::string s1 = "ab\0\0cd"; // s1 contains "ab", using string literal
std::string s2{"ab\0\0cd", 6}; // s2 contains "ab\0\0cd", using different ctr
std::string s3 = "ab\0\0cd"s; // s3 contains "ab\0\0cd", using ""s operator
References:
https://akrzemi1.wordpress.com/2014/03/20/strings-length/
http://en.cppreference.com/w/cpp/string/basic_string/basic_string
Yes if you call temp.c_str(), then it will return null-terminated c-string.
However, the actual data stored in the object temp may not be null-terminated, but it doesn't matter and shouldn't matter to the programmer, because when then programmer wants const char*, he would call c_str() on the object, which is guaranteed to return null-terminated string.
With C++ strings you don't have to worry about that, and it's possibly dependent of the implementation.
Using temp.c_str() you get a C representation of the string, which will definitely contain the \0 char. Other than that, i don't really see how it would be useful on a C++ string
std::string internally keeps a count of the number of characters. Internally it works using this count. Like others have said, when you need the string for display or whatever reason, you can its c_str() method which will give you the string with the null terminator at the end.

C++ Concatenation Oops

So I've been going back and forth from C++, C# and Java lately and well writing some C++ code I did something like this.
string LongString = "Long String";
char firstChar = LongString.at(0);
And then tried using a method that looks like this,
void MethodA(string str)
{
//some code
cout << str;
//some more code }
Here's how I implemented it.
MethodA("1. "+ firstChar );
though perfectly valid in C# and Java this did something weird in C++.
I expected something like
//1. L
but it gave me part of some other string literal later in the program.
what did I actually do?
I should note I've fixed the mistake so that it prints what I expect but I'm really interested in what I mistakenly did.
Thanks ahead of time.
C++ does not define addition on string literals as concatenation. Instead, a string literal decays to a pointer to its first element; a single character is interpreted as a numeric value so the result is a pointer offset from one location in the program's read-only memory segment to another.
To get addition as concatenation, use std::string:
MethodA(std::string() + "1. " + firstChar);
MethodA(std::string("1. ")+ firstChar );
since "1. " is const char[4] and has no concat methods)
The problem is that "1. " is a string literal (array of characters), that will decay into a pointer. The character itself is a char that can be promoted to an int, and addition of a const char* and an int is defined as calculating a new pointer by offsetting the original pointer by that many positions.
Your code in C++ is calling MethodA with the result of adding (int)firstChar (ASCII value of the character) to the string literal "1. ", which if the value of firstChar is greater than 4 (which it probably is) will be undefined behavior.
MethodA("1. "+ firstChar ); //your code
doesn't do what you want it to do. It is a pointer arithmetic : it just adds an integral value (which is firstChar) to the address of string-literal "1. ", then the result (which is of char const* type) is passed to the function, where it converts into string type. Based on the value of firstChar, it could invoked undefined behavior. In fact, in your case, it does invoke undefined behavior, because the resulting pointer points to beyond the string-literal.
Write this:
MethodA(string("1. ")+ firstChar ); //my code
String literals in C++ are not instances of std::string, but rather constant arrays of chars. So by adding a char to it an implicit cast to character pointer which is then incremented by the numerical value of the character, whick happened to point to another string literal stored in .data section.