STL basic_string length with null characters - c++

Why is it that you can insert a '\0' char in a std::basic_string and the .length() method is unaffected but if you call char_traits<char>::length(str.c_str()) you get the length of the string up until the first '\0' character?
e.g.
string str("abcdefgh");
cout << str.length(); // 8
str[4] = '\0';
cout << str.length(); // 8
cout << char_traits<char>::length(str.c_str()); // 4

Great question!
The reason is that a C-style string is defined as a sequence of bytes that ends with a null byte. When you use .c_str() to get a C-style string out of a C++ std::string, then you're getting back the sequence the C++ string stores with a null byte after it. When you pass this into strlen, it will scan across the bytes until it hits a null byte, then report how many characters it found before that. If the string contains a null byte, then strlen will report a value that's smaller than the whole length of the string, since it will stop before hitting the real end of the string.
An important detail is that strlen and char_traits<char>::length are NOT the same function. However, the C++ ISO spec for char_traits<charT>::length (§21.1.1) says that char_traits<charT>::length(s) returns the smallest i such that char_traits<charT>::eq(s[i], charT()) is true. For char_traits<char>, the eq function just returns if the two characters are equal by doing a == comparison, and constructing a character by writing char() produces a null byte, and so this is equal to saying "where is the first null byte in the string?" It's essentially how strlen works, though the two are technically different functions.
A C++ std::string, however, it a more general notion of "an arbitrary sequence of characters." The particulars of its implementation are hidden from the outside world, though it's probably represented either by a start and stop pointer or by a pointer and a length. Because this representation does not depend on what characters are being stored, asking the std::string for its length tells you how many characters are there, regardless of what those characters actually are.
Hope this helps!

Related

Is the size of an std::string always the number of printed characters?

I'm trying to understand what std::string::size() returns.
According to https://en.cppreference.com/w/cpp/string/basic_string/size it's the "number of CharT elements in the string", but I'm not sure how that relates to the number of printed characters, especially if string termination characters are involved somehow.
This code
int main()
{
std::string str0 = "foo" "\0" "bar";
cout << str0 << endl;
cout << str0.size() << endl;
std::string str1 = "foo0bar";
str1[3] = '\0';
cout << str1 << endl;
cout << str1.size() << endl;
return 0;
}
prints
foo
3
foobar
7
In the case of str0, the size matches the number of printed characters. I assume the constructor iterates on the characters of the string literal until it reaches \0, which is why only 'f', 'o' and 'o' are put in the std::string, i.e. 3 characters, and the string termination character is not put in the std::string.
In the case of str1, the size doesn't match the number of printed characters. I assume the same went on as what I described above, but that I broke something by assigning a character. According to cppreference.com, "the behavior is undefined if this character is modified to any value other than CharT()", so I assume I've walked into undefined behavior here.
My question is this: outside of undefined behavior, is it possible that the size of a std::string doesn't match the number of printed characters, or is it actually something guaranteed by the standard?
(note: if the answer to that question changed between versions of the standard I'm interested in knowing that too)
In the case of str1 ... the behavior is undefined if this character is modified to any value other than CharT(), so I assume I've walked into undefined behavior here.
Your assumption is wrong. There is no UB for two reasons:
You did assign the element to '\0' which happens to be same as CharT() and thus it would be well defined to assign that value to str1[str1.size()].
Furthermore, str1.size() is 7 as you demonstrated and 3 is less than 7 and is therefore within bounds and it would be well defined to assign any value to that element.
is it possible that the size of a std::string doesn't match the number of printed characters
Yes, it is possible. std::string can contain non-printable characters as well, and thus the size is not necessarily the same as the number of printed characters. Your example str1 has no undefined behaviour and demonstrates how size can be different from number of printed characters.
Besides non-printable characters, in some character encodings - notably in unicode - grapheme clusters may consist of multiple graphemes which may consist of multiple code points which may consist of multiple code units (code unit is a single char object). The size of the string is the number of chars i.e. the number of code units. Thus, one should not expect the size of the string to match the number of printed characters.
or is it actually something guaranteed by the standard?
No such guarantee exists.
if the answer to that question changed between versions of the standard I'm interested in knowing that too
There has been no change regarding this.
std::string has several constructors, one of which receives const char* and that's the one that constructs str0. Because there's no length information provided, the string will just be initialized until the null termination character is found
In case of str1 then the string length is really 7 characters. When you replace str1[3] with '\0' then the string doesn't change its length, but the content is now "foo\0bar". Unlike C string, std::string can contain embedded null because it has the length information. Therefore when you cout << str1 << endl; exactly 7 bytes are printed out. It's just that you don't see the byte '\0' in the output because it's ASCII NUL which isn't a printable character
It's recommended to use the s suffix to construct the std::string faster and with the ability to construct from a string with embedded null directly without resorting to another constructor. Try auto str0 = "foo\0bar"s; and see

Will cout function of c++ stop printing if it comes across a null character(positioned in middle of array) in character array

Will cout function of c++ stop printing if it comes across a null character(positioned in middle of array) in character array?
Firstly, std::cout isn't a function. It's an object (specifically an output stream object).
Its << operator is overloaded. The overloads which are particularly relevant are those which accept strings.
If we pass a std::string, then << will output every character in the string.
If we pass a C-style string (char*), then << will output every character up to, but not including, the first NUL character encountered. Note that for malformed input (containing no NUL), this can lead to undefined behaviour, due to reading beyond the bounds of the array.
The other (unformatted) output operation is write(), which accepts a character array and a length. That outputs exactly the number of characters specified, including any NULs encountered.

cout string and c_str gives different values in c++

In my code, I have a string variable named ChannelPacket.
when I print Channelpacket in gdb, it gives following string :
"\020\000B\001\237\246&\b\000\016\000\002\064\001\000\000\005\000\021\002\000\000\006\000\f\001\001\000\000sZK"
But if i print Channelpacket.c_str(), it gives just "\020 output.
Please help me.
c_str() returns a pointer to char that's understood to be terminated by a NUL character ('\0').
Since your string contains an embedded '\0', it's seen as the end of the string when viewed as a pointer to char.
When viewed as an actual std::string, the string's length is known, so the whole thing is written out, regardless of the embedded NUL characters.
The second byte is a zero, which means the end of the string. If you want to output the raw bytes, rather than treating them as a null-terminated string, you can't use cout << Channelpacket.c_str() - use cout << Channelpacket instead.

What does '\0' mean?

I can't understand what the '\0' in the two different place mean in the following code:
string x = "hhhdef\n";
cout << x << endl;
x[3]='\0';
cout << x << endl;
cout<<"hhh\0defef\n"<<endl;
Result:
hhhdef
hhhef
hhh
Can anyone give me some pointers?
C++ std::strings are "counted" strings - i.e., their length is stored as an integer, and they can contain any character. When you replace the third character with a \0 nothing special happens - it's printed as if it was any other character (in particular, your console simply ignores it).
In the last line, instead, you are printing a C string, whose end is determined by the first \0 that is found. In such a case, cout goes on printing characters until it finds a \0, which, in your case, is after the third h.
C++ has two string types:
The built-in C-style null-terminated strings which are really just byte arrays and the C++ standard library std::string class which is not null terminated.
Printing a null-terminated string prints everything up until the first null character. Printing a std::string prints the whole string, regardless of null characters in its middle.
\0 is the NULL character, you can find it in your ASCII table, it has the value 0.
It is used to determinate the end of C-style strings.
However, C++ class std::string stores its size as an integer, and thus does not rely on it.
You're representing strings in two different ways here, which is why the behaviour differs.
The second one is easier to explain; it's a C-style raw char array. In a C-style string, '\0' denotes the null terminator; it's used to mark the end of the string. So any functions that process/display strings will stop as soon as they hit it (which is why your last string is truncated).
The first example is creating a fully-formed C++ std::string object. These don't assign any special meaning to '\0' (they don't have null terminators).
The \0 is treated as NULL Character. It is used to mark the end of the string in C.
In C, string is a pointer pointing to array of characters with \0 at the end. So following will be valid representation of strings in C.
char *c =”Hello”; // it is actually Hello\0
char c[] = {‘Y’,’o’,’\0′};
The applications of ‘\0’ lies in determining the end of string .For eg : finding the length of string.
The \0 is basically a null terminator which is used in C to terminate the end of string character , in simple words its value is null in characters basically gives the compiler indication that this is the end of the String Character
Let me give you example -
As we write printf("Hello World"); /* Hello World\0
here we can clearly see \0 is acting as null ,tough printinting the String in comments would give the same output .

String going crazy if I don't give it a little extra room. Can anyone explain what is happening here?

First, I'd like to say that I'm new to C / C++, I'm originally a PHP developer so I am bred to abuse variables any way I like 'em.
C is a strict country, compilers don't like me here very much, I am used to breaking the rules to get things done.
Anyway, this is my simple piece of code:
char IP[15] = "192.168.2.1";
char separator[2] = "||";
puts( separator );
Output:
||192.168.2.1
But if I change the definition of separator to:
char separator[3] = "||";
I get the desired output:
||
So why did I need to give the man extra space, so he doesn't sleep with the man before him?
That's because you get a not null-terminated string when separator length is forced to 2.
Always remember to allocate an extra character for the null terminator. For a string of length N you need N+1 characters.
Once you violate this requirement any code that expects null-terminated strings (puts() function included) will run into undefined behavior.
Your best bet is to not force any specific length:
char separator[] = "||";
will allocate an array of exactly the right size.
Strings in C are NUL-terminated. This means that a string of two characters requires three bytes (two for the characters and the third for the zero byte that denotes the end of the string).
In your example it is possible to omit the size of the array and the compiler will allocate the correct amount of storage:
char IP[] = "192.168.2.1";
char separator[] = "||";
Lastly, if you are coding in C++ rather than C, you're better off using std::string.
If you're using C++ anyway, I'd recommend using the std::string class instead of C strings - much easier and less error-prone IMHO, especially for people with a scripting language background.
There is a hidden nul character '\0' at the end of each string. You have to leave space for that.
If you do
char seperator[] = "||";
you will get a string of size 3, not size 2.
Because in C strings are nul terminated (their end is marked with a 0 byte). If you declare separator to be an array of two characters, and give them both non-zero values, then there is no terminator! Therefore when you puts the array pretty much anything could be tacked on the end (whatever happens to sit in memory past the end of the array - in this case, it appears that it's the IP array).
Edit: this following is incorrect. See comments below.
When you make the array length 3, the extra byte happens to have 0 in it, which terminates the string. However, you probably can't rely on that behavior - if the value is uninitialized it could really contain anything.
In C strings are ended with a special '\0' character, so your separator literal "||" is actually one character longer. puts function just prints every character until it encounters '\0' - in your case one after the IP string.
In C, strings include a (invisible) null byte at the end. You need to account for that null byte.
char ip[15] = "1.2.3.4";
in the code above, ip has enough space for 15 characters. 14 "regular characters" and the null byte. It's too short: should be char ip[16] = "1.2.3.4";
ip[0] == '1';
ip[1] == '.';
/* ... */
ip[6] == '4';
ip[7] == '\0';
Since no one pointed it out so far: If you declare your variable like this, the strings will be automagically null-terminated, and you don't have to mess around with the array sizes:
const char* IP = "192.168.2.1";
const char* seperator = "||";
Note however, that I assume you don't intend to change these strings.
But as already mentioned, the safe way in C++ would be using the std::string class.
A C "String" always ends in NULL, but you just do not give it to the string if you write
char separator[2] = "||". And puts expects this \0 at the ned in the first case it writes till it finds a \0 and here you can see where it is found at the end of the IP address. Interesting enoiugh you can even see how the local variables are layed out on the stack.
The line: char seperator[2] = "||"; should get you undefined behaviour since the length of that character array (which includes the null at the end) will be 3.
Also, what compiler have you compiled the above code with? I compiled with g++ and it flagged the above line as an error.
String in C\C++ are null terminated, i.e. have a hidden zero at the end.
So your separator string would be:
{'|', '|', '\0'} = "||"