Are list-initialized char arrays still null-terminated? - c++

As I worked through the Lippman C++ Primer (5th ed, C++11), I came across this code:
char ca[] = {'C', '+', '+'}; //not null terminated
cout << strlen(ca) << endl; //disaster: ca isn't null terminated
Calling the library strlen function on ca, which is not null-terminated, results in undefined behavior. Lippman et al say that "the most likely effect of this call is that strlen will keep looking through the memory that follows ca until it encounters a null character."
A later exercise asks what the following code does:
const char ca[] = {'h','e','l','l','o'};
const char *cp = ca;
while (*cp) {
cout << *cp << endl;
++cp;
}
My analysis: ca is a char array that is not null-terminated. cp, a pointer to char, initially holds the address of ca[0]. The condition of the while loop dereferences pointer cp, contextually converts the resulting char value to bool, and executes the loop block only if the conversion results in 'true.' Since any non-null char converts to a bool value of 'true,' the loop block executes, incrementing the pointer by the size of a char. The loop then steps through memory, printing each char until a null character is reached. Since ca is not null-terminated, the loop may continue well past the address of ca[4], interpreting the contents of later memory addresses as chars and writing their values to cout, until it happens to come across a chunk of bits that happen to represent the null character (all 0's). This behavior would be similar to what Lippman et al suggested that strlen(ca) does in the earlier example.
However, when I actually execute the code (again compiling with g++ -std=c++11), the program consistently prints:
'h'
'e'
'l'
'l'
'o'
and terminates. Why?

Most likely explanation: On modern desktop/server operating systems like windows and linux, memory is zeroed out before it is mapped into the address space of a program. So as long as the program doesn't use the adjacent memory locations for something else, it will look like a null terminated string.
In your case, the adjacent bytes are probably just padding, as most variables are at least 4-Byte aligned.
As far as the language is concerned this is just one possible realization of undefined behavior.

Are list-initialized char arrays still null-terminated?
There is no implicit null-terminator.
A list-initialized char array contains a null-terminated string, if at least one of the characters is initialized with the null-terminator.
If none of the characters are the null-terminator, then the array does not contain a null-terminated string.
the program consistently prints ... and terminates. Why?
You analyzed that the array would be accessed out of bounds. Your analysis is correct. You should also know that accessing an array out of bounds has undefined behaviour. So, the answer to why does it behave like this is: Because the behaviour is undefined.
As I already mentioned, your analysis is correct. Only your (implied) assumption that when the memory is accessed out of bounds, the first value must be a non-zero value. That assumption is wrong, because it is not guaranteed.

Related

How is char array stored in C++?

int main()
{
char c1[5]="abcde";
char c2[5]={'a','b','c','d','e'};
char *s1 = c1;
char *s2 = c2;
printf("%s",s1);
printf("%s",s2);
return 0;
}
In this code snippet, the char array C2 doesn't return any error but the char array C1 returns string too long. I know that C1 must require a size of 6 to store 5 characters as it stores the \0 (NULL char) in the last index. But I'm confused why C2 works just fine then?
Also, when C2 is printed using %s, the output is abcde# where # is a gibberish character. %s with printf prints all the characters starting from the given address till \0 is encountered. I don't understand why is it printing that extra character at the end?
You've created two unterminated strings. Make your arrays big enough to hold the null terminator and you'll avoid this undefined behaviour:
char c1[6] = "abcde";
char c2[6] = {'a','b','c','d','e','\0'};
Strictly, speaking the latter doesn't actually require the '\0'. This declaration is equivalent and will include the null terminator:
char c2[6] = {'a','b','c','d','e'};
I personally prefer the first form, but with the added convenience of being able to leave out the explicit length:
char c1[] = "abcde";
I know that C1 must require a size of 6 to store 5 characters as it stores the \0 (NULL char) in the last index. But I'm confused why C2 works just fine then?
The compiler does not complain about the initialization of c2 because initializing with {'a','b','c','d','e'} does not implicitly include a terminating null character.
In contrast, initializing with "abcde" does include a null character: The C standard defines a string literal to include a terminating null character, so char c1[5]="abcde"; nominally initializes a 5-element array with 6 values. The C standard does not require a warning or error in this case because C 2018 6.7.9 14 indicates that null character may be neglected if the array does not have room for it. However, the compiler you are using1 has chosen to issue a warning message because this form of initialization often indicates an error: The programmer attempted to initialize an array with a string, but there is not room for the full string.
In C, arrays of characters and strings are different things: An array is a sequence of values, and an array of characters can contain any arbitrary values of those characters, including no zero value at the end and possible zero values in the middle. For example, if we have a buffer of bytes from a binary file, the bytes are just integer values to us; their meaning as characters that might be printed is irrelevant. A string is a sequence of characters that is terminated by a null character. It cannot have internal zero values because the first null character marks the end.
So, when you define an array of characters such as char c1[5], the compiler does not automatically know whether you intend to use it to hold strings or you intended to use it as an array of arbitrary values. When you initialize the array with a string, your compiler is essentially figuring you intend to use the array to hold strings, and it warns you if the string you use to initialize the array does not fit. When you initialize the array with a list of values, your compiler essentially figures you may be using it to hold arbitrary values, and it does not warn you that there could be a missing terminator.
Also, when C2 is printed using %s, the output is abcde# where # is a gibberish character.
Because c2 does not have a terminating character, attempting to print it runs off the end of the array, resulting in behavior not defined by the C standard. Commonly, printf continues reading memory beyond the array, printing whatever happens to be there until it reaches a null character.
Footnote
1 This assumes you are indeed using a C compiler to compile this source code. C++ has different rules and does not permit an array being initialized with a string literal to be too short to include the terminating null character.

Is it Safe to strncpy Into a string That Doesn't Have Room for the Null Terminator?

Consider the following code:
const char foo[] = "lorem ipsum"; // foo is an array of 12 characters
const auto length = strlen(foo); // length is 11
string bar(length, '\0'); // bar was constructed with string(11, '\0')
strncpy(data(bar), foo, length);
cout << data(bar) << endl;
My understanding is that strings are always allocated with a hidden null element. If this is the case then bar really allocates 12 characters, with the 12th being a hidden '\0' and this is perfectly safe... If I'm wrong on that then the cout will result in undefined behavior because there isn't a null terminator.
Can someone confirm for me? Is this legal?
There have been a lot of questions about why to use strncpy instead of just using the string(const char*, const size_t) constructor. My intent has been to make my toy code close to my actual code which contains a vsnprintf. Unfortunately even after getting excellent answers here I've found that vsnprintf doesn't behave the same as strncpy, and I've asked a follow up question here: Why is vsnprintf Not Writing the Same Number of Characters as strncpy Would?
This is safe, as long as you copy [0, size()) characters into the string . Per [basic.string]/3
In all cases, [data(), data() + size()] is a valid range, data() + size() points at an object with value charT() (a “null terminator”), and size() <= capacity() is true.
So string bar(length, '\0') gives you a string with a size() of 11, with an immutable null terminator at the end (for a total of 12 characters in actual size). As long as you do not overwrite that null terminator, or try to write past it, you're okay.
There are two different things here.
First, does strncpy add an additional \0 in this instance (11 non-\0 elements to be copied in a string of size 11). The answer is no:
Copies at most count characters of the byte string pointed to by src (including the terminating null character) to character array pointed to by dest.
If count is reached before the entire string src was copied, the resulting character array is not null-terminated.
So the call is perfectly fine.
Then data() gives you a proper \0-terminated string:
c_str() and data() perform the same function. (since C++11)
So it seems that for C++11, you are safe. Whether the string allocates an additional \0 or not doesn't seems to be indicated in the documentation, but the API is clear that what you are doing is perfectly fine.
You have allocated an 11-character std::string. You are not trying to read nor write anything past that, so that part will be safe.
So the real question is whether you have messed up the internals of the string. Since you haven't done anything that isn't allowed, how would that be possible? If it's required for the string to internally keep a 12-byte buffer with a null padding at the end in order to fulfill its contract, that will be the case no matter what operations you performed.
Yes it's safe according to the char * strncpy(char* destination, const char* source, size_t num):
Copy characters from string
Copies the first num characters of source to destination. If the end of the source C string (which is signaled by a null-character) is found before num characters have been copied, destination is padded with zeros until a total of num characters have been written to it.

Why C++ variable doesn't need defining properly when it's a pointer?

I'm completely new to the C++ language (pointers in particular, experience is mainly in PHP) and would love some explanation to the following (I've tried searching for answers).
How are both lines of code able to do exactly the same job in my program? The second line seems to go against everything I've learnt & understood so far about pointers.
char disk[3] = "D:";
char* disk = "D:";
How am I able to initialize a pointer to anything other than a memory address? Not only that, in the second line I'm not declaring the array properly either - but it's still working?
The usual way to initialize an array in C and C++ is:
int a[3] = { 0, 1, 2 };
Aside: And you can optionally leave out the array bound and have it deduced from the initializer list, or have a larger bound than there are initializers:
int aa[] = { 0, 1, 2 }; // another array of three ints
int aaa[5] = { 0, 1, 2 }; // equivalent to { 0, 1, 2, 0, 0}
For arrays of characters there is a special rule that allows an array to be initialized from a string literal, with each element of the array being initialized from the corresponding character in the string literal.
Your first example uses the string literal "D:" so each element of the array will be initialized to a character from that string, equivalent to:
char disk[3] = { 'D', ':', '\0' };
(The third character is the null terminator, which is implicitly present in all string literals).
Aside: Here too you can optionally leave out the array bound and have it deduced from the string literal, or have a larger bound than the string length:
char dd[] = "D:"; // another array of three chars
char ddd[5] = "D:"; // equivalent to { 'D', ':', '\0', '\0', '\0'}
Just like the aaa example above, the extra elements in ddd that don't have a corresponding character in the string will be zero-initialized.
Your second example works because the string literal "D:" will be output by the compiler and stored somewhere in the executable as an array of three chars. When the executable is run the segment that contains the array (and other constants) will be mapped into the process' address space. So your char* pointer is then initialized to point to the location of that array, wherever that happens to be. Conceptually it's similar to:
const char __some_array_created_by_the_compiler[3] = "D:";
const char* disk = __some_array_created_by_the_compiler;
For historical reasons (mostly that const didn't exist in the early days of C) it was legal to use a non-const char* to point to that array, even though the array is actually read-only, so C and the first C++ standard allow you to use a non-const char* pointer to point to a string literal, even though the array that it refers to is really const:
const char __some_array_created_by_the_compiler[3] = "D:";
char* disk = (char*)__some_array_created_by_the_compiler;
This means that despite appearances your two examples are not exactly the same, because this is only allowed for the first one:
disk[0] = 'C';
For the first example that is OK, it alters the first element of the array.
For the second example it might compile, but it results in undefined behaviour, because what it's actually doing is modifying the first element of the __some_array_created_by_the_compiler which is read-only. In practice what will probably happen is that the process will crash, because trying to write to a read-only page of memory will raise a segmentation fault.
It's important to understand that there are lots of things in C++ (and even more in C) which the compiler will happily compile, but which cause Very Bad Things to happen when the code is executed.
char disk[3] = "D:";
Is treated as
char disk[3] = {'D',':','\0'};
Where as in C++11 and above
char* disk = "D:";
Is an error as a string literal is of type const char[] and cannot be assigned to a char *. You can assign it to a const char * though.
String literals are actually read-only, zero-terminated arrays of characters, and using a string literal gives you a pointer to the first character in the array.
So in the second example
char* disk = "D:";
you initialize disk to point to the first character of an array of three characters.
Note in my first paragraph above that I said that string literals are read-only arrays, that means that having a plain char* pointing to this array could make you think that it's okay to modify this array when it's not (attempting to modify a string literal leads to undefined behavior). This is the reason that const char* is usually used:
const char* disk = "D:";
Since C++11 it's actually an error to not use a const char*, through most compilers still only warn about it instead of producing an error.
You are absolutely right to say that pointers can store only memory address. Then how is the second statement valid? Let me explain.
When you put a sequence of characters in double quotes, what happens behind the screens is that the string gets stored in a read only computer memory and the address of the location where the string is stored is returned. So at run-time, the expression is evaluated, the string evaluates to the memory address, which is a character pointer. It is this pointer that is assigned to your pointer variable.
So what is the difference between the two statements? The string in the second case is a constant, while the string declared by the first statement can be changed.

storage of character pointer in memory

1a) There is this code
char *p;
p[0]='a';
p[1]='b';
printf("%s",p);
When i run this program on ideone.com compiler: c++ 4.3.2, it displays "RUNTIME ERROR" every single time i run it.
1b). however when i edit this code to
char *p;
//allocate memory using malloc
p[0]='a';
p[1]='b';
printf("%s",p);
It correctly runs and prints "ab" . shouldn't it require the p[2]='\0' at the end?
2)
char *p;
p="abc"
printf("%s",p);
this correctly runs and prints "abc" . why does this work without allocation.
Can anyone please explain the rules regarding string storage ?
1a) undefined behavior because you dereference a non initialized pointer
1b) undefined behavior since you call printf with %s for a non null terminated string
2) works ok: there is an allocation, it just is a string literal (you can't modify it, it is stored in the read-only portion of the program : you should declare it const char* for that reason)
Note:
In C++, use std::string and std::cout .
In the first example you declare a pointer to char and then you assign values to undefined locations in memory. Undefined because its what the uninitialize p pointer points to. You need to allocate memory for the sequence (with new[] in C++, not malloc). If you do not put a '\0' at the end, printing will stop at the first 0 encountered in memory.
In the third example you are declaring a pointer to char and initialize its value with the address of the literal string "abc". That is stored in the (read-only) data section in the executable and that gets map to the process address space. So that's a valid pointer and your printing works.
1a) here you don't allocate memory, so the p pointer points to a random place, therefore causing segfault when you write that random location
1b) if you allocate memory manually with malloc, it will work correctly. If the allocated memory contains 0s, you don't have to add it manually (but you should, because you can't count on the zero filling)
2) here you assign the p pointer to the string literal "abs", so it will point to it, and the allocation is done by the compiler

Strlen returns unreasonable number

If I write:
char lili [3];
cout<<strlen(lili)<<endl;
then what is printed is : 11
but if I write:
char lili [3];
lili [3]='\0';
cout<<strlen(lili)<<endl;
then I get 3.
I don't understand why it returns 11 on the first part?
Isn't strlen supposed to return 3, since I allocated 3 chars for lili?
It is because strlen works with "C-style" null terminated strings. If you give it a plain pointer or uninitialised buffer as you did in your first example, it will keep marching through memory until it a) finds a \0, at which point it will return the length of that "string", or b) until it reaches a protected memory location and generates an error.
Given that you've tagged this C++, perhaps you should consider using std::array or better yet, std::string. Both provide length-returning functions (size()) and both have some additional range checking logic that will help prevent your code from wandering into uninitialised memory regions as you're doing here.
The strlen function searches for a byte set to \0. If you run it on an uninitialized array then the behavior is undefined.
You have to initialize your array first. Otherwise there is random data in it.
strlen is looking for a string termination sign and will count until it finds it.
strlen calculates the number of characters till it reaches '\0' (which denotes "end-of-string").
In C and C++ char[] is equivalent to char *, and strlen uses lili as a pointer to char and iterates the memory pointed to by it till it reaches the terminating '\0'. It just so happened that there was 0 byte in memory 11 bytes from the memory allocated for your array. You could have got much stranger result.
In fact, when you write lili[3] = '\0'
you access memory outside your array. The valid indices for 3-element array in C/C++ are 0-2.