What does it mean to be "terminated by a zero"? - c++

I am getting into C/C++ and a lot of terms are popping up unfamiliar to me. One of them is a variable or pointer that is terminated by a zero. What does it mean for a space in memory to be terminated by a zero?

Take the string Hi in ASCII. Its simplest representation in memory is two bytes:
0x48
0x69
But where does that piece of memory end? Unless you're also prepared to pass around the number of bytes in the string, you don't know - pieces of memory don't intrinsically have a length.
So C has a standard that strings end with a zero byte, also known as a NUL character:
0x48
0x69
0x00
The string is now unambiguously two characters long, because there are two characters before the NUL.

It's a reserved value to indicate the end of a sequence of (for example) characters in a string.
More correctly known as null (or NUL) terminated. This is because the value used is zero, rather than being the character code for '0'. To clarify the distinction check out a table of the ASCII character set.
This is necessary because languages like C have a char data type, but no string data type. Therefore it is left to the devleoper to decide how to manage strings in their application. The usual way of doing this is to have an array of chars with a null value used to terminate (i.e. signify the end of) the string.
Note that there is a distinction between the length of the string, and the length of the char array that was originally declared.
char name[50];
This declares an array of 50 characters. However, these values will be uninitialised. So if I want to store the string "Hello" (5 characters long) I really don't want to bother setting the remaining 45 characters to spaces (or some other value). Instead I store a NUL value after the last character in my string.
More recent languages such as Pascal, Java and C# have a specific string type defined. These have a header value to indicate the number of characters in the string. This has a couple of benefits; firstly you don't need to walk to the end of the string to find out its length, secondly your string can contain null characters.
Wikipedia has further information in the String (computer science) entry.

Arrays and string in C is just a pointers to a memory location. By pointer you can find a start of array. The end of array is undefined. The end of character array (which is the string) is zero-byte.
So, in memory string hello is written as:
68 65 6c 6c 6f 00 |hello|

It refers to how C strings are stored in memory. The NUL character represented by \0 in string iterals is present at the end of a C string in memory. There is no other meta data associated with a C string like length for example. Note the different spelling between NUL character and NULL pointer.

There are two common ways to handle arrays that can have varying-length contents (like Strings). The first is to separately keep the length of the data stored in the array. Languages like Fortran and Ada and C++'s std::string do this. The disadvantage to doing this is that you somehow have to pass that extra information to everything that is dealing with your array.
The other way, is to reserve an extra non-data element at the end of the array to serve as a sentinel. For the sentinel you use a value that should never appear in the actual data. For strings, 0 (or "NUL") is a good choice, as that is unprintable and serves no other purpose in ASCII. So what C (and many languages copied from C) do is to assume that all strings end (or "are terminated by") a 0.
There are several drawbacks to this. For one thing, it is slow. Any time a routine needs to know the length of the string, it is an O(n) operation (searching through the entire string looking for the 0). Another problem is that you may one day want to put a 0 in your string for some reason, so now you need a whole second set of string routines that ignore the null and use a separate length anyway (eg: strnlen() ). The third big problem is that if someone forgets to put that 0 at the end (or it gets wiped out somehow), the next string operation to do a lenth check will go merrily marching through memory until it either happens to randomly find another 0, crashes, or the user loses patience and kills it. Such bugs can be a serious PITA to track down.
For all these reasons, the C approach is generally viewed with disfavor.

C-style strings are terminated by a NUL character ('\0'). This provides a marker for functions that operate on strings (e.g. strlen, strcpy) to use to identify the end of the string.

While the classic example of "terminated by a zero" is that of strings in C, the concept is more general. It can be applied to any list of things stored in an array, the size of which is not known explicitly.
The trick is simply to avoid passing around an array size by appending a sentinel value to the end of the array. Typically, some form of a zero is used, but it can be anything else (like a NAN if the array contains floating point values).
Here are three examples of this concept:
C strings, of course. A single zero character is appended to the string: "Hello" is encoded as 48 65 6c 6c 6f 00.
Arrays of pointers naturally allow zero termination, because the null pointer (the one that points to address zero) is defined to never point to a valid object. As such, you might find code like this:
Foo list[] = { somePointer, anotherPointer, NULL };
bar(list);
instead of
Foo list[] = { somePointer, anotherPointer };
bar(sizeof(list)/sizeof(*list), list);
This is why the execvpe() only needs three arguments, two of which pass arrays of user defined length. Since all that's passed to execvpe() are (possibly lots of) strings, this little function actually sports two levels of zero termination: null pointers terminating the string lists, and null characters terminating the strings themselves.
Even when the element type of the array is a more complex struct, it may still be zero terminated. In many cases, one of the struct members is defined to be the one that signals the end of the list. I have seen such function definitions, but I can't unearth a good example of this right now, sorry. Anyway, the calling code would look something like this:
Foo list[] = {
{ someValue, somePointer },
{ anotherValue, anotherPointer },
{ 0, NULL }
};
bar(list);
or even
Foo list[] = {
{ someValue, somePointer },
{ anotherValue, anotherPointer },
{} //C zeros out an object initialized with an empty initializer list.
};
bar(list);

Related

Why doesn't strlen() count the byte of the terminating NUL-character, when the NUL-character is defined to be part of a string?

I know that strlen() does not count the NUL-terminating character with. I really know that this is a fact. Thus, this question is NOT about asking for why strlen() might "presumably" not return the right string length, which is already asked and answered alot well here on StackOverflow, f.e. in this thread, or this one.
So lets go ahead to my question:
In ISO/IEC 9899:1990 (E); 7.1.1., is stated:
A string is a contiguous sequence of characters terminated by and including the first null character.
What is the reason, why strlen() deviate from this formed standard, and does not "want" to accept a string with its NUL-terminating character?
Why?
Because you would expect this pseudocode's assertion to hold true:
str1 = "foo"
str2 = "bar"
str3 = concatenate(str1, str2)
Assert strlen(str1) + strlen(s2) == strlen(str3)
If terminating '\0' was counted by strlen, above assertion would not hold, which would be much more of overall headache, than what the current C string behavior is. More importantly, it would in my opinion be quite unintuitive and illogical.
Taking your doubt as a reasonable point we can state that: The C-string consists of two parts:
the string's useful content ("the text");
the null terminating character;
The null terminating character is purely a technical measure for determination of the end of the string by the C-originated library functions. Still, if one types a declaration:
char * str = "some string";
they logically would rather expect its length to be 11 which is as many as they can see in this statement. Hence the strlen() value yields only the length of the part 1. of the string.
Not really an answer to your question, but consider this example:
char string[] = "string";
printf("sizeof: %zu\n", sizeof(string));
printf("strlen: %zu\n", strlen(string));
This prints
sizeof: 7
strlen: 6
So sizeof counts the \0, but strlen doesn't.
Questions like this, that ask why a certain age-old decision was made one way and not another way, are hard to answer. I can say that it's perfectly obvious to me, anyway, that strlen should count just the real, "interesting" characters that are in the string, and ignore the \0 at the end that merely terminates it. I'm used to accounting for the \0 separately. I imagine it would have been considerably more of a nuisance overall if strlen had been defined the other way. But I can't prove this with convincing arguments, and I've been using strlen with its current definition for so long that I'm probably hopelessly biased; I might be saying "it's perfectly obvious to me that..." even if strlen's definition were quite wrong.
There is a difference between the physical, stored representation of a C style string and the logical representation of a C style string.
The physical representation, how the string is actually stored in memory or other media includes the null character. The null character is included when discussing the physical representation because it take up an additional piece of storage. In order to be a C style string the null character must be stored.
However the logical representation of a string does not include the null character. The logical representation of a string includes only the text characters that the programmer is wanting to manipulate.
I suspect that the null character, a value of binary zero, was chosen because of the original ASCII character set defined a character value of zero as the NULL character. Part of the lower values among the various teletype control codes, it seems to be the least likely ASCII character that may appear in text. See ASCII Character Codes.
Another nice quality of using a binary zero as the string terminator is that is the value that represents logical false so iterating over a string is often a matter of incrementing an array index or incrementing a pointer while logical true since all characters other than the end of string indicator have a non-zero or logical true value.
Due to how close to the hardware that the C programming language is, the programmer needs to be concerned about both representations, the physical representation when allocating memory to store a string which includes the null character and the logical representation which is the string without the null character.
The various C style string manipulation functions in the Standard Library (strlen(), strcpy(), etc.) are all designed around the logical representation of a C style string. They perform their actions by using the null character as not being part of the text but rather as a special indicator character which indicates the end of the string. However as a part of their operations they need to be aware of the null character and its use as a special symbol. For instance when strcpy() or strcat() are used to copy strings, they must also copy the null character that indicates the end of the string even though it is not part of the actual text of the logical representation.
This choice allows text strings to be stored as arrays of characters, as befits the hardware orientation and efficiency characteristics of C. There is no need to create an additional built in type for text strings and it fits well with the lean character of the C programming language.
C++ is able to provide the std::string because of being object oriented and having the additional facilities of the language that allows for objects to be created and managed. The C programming language, due to its simple syntax and lack of object oriented facilities does not have this convenience.
The problem with this approach is that the programmer needs to be aware of both the physical representation and the logical representation of text strings and be able to accommodate the needs of both when writing programs.

Confusion about the necessity of the null-character?

I am reading about why exactly there is a need for null-characters, and then I found this answer which made somewhat sense to me. It states that it is needed because that char arrays (for the C strings) are often allocated much larger than the actual strings and you thereby need a a way to symbolize the end.
But why aren't these array not just constructed with a size deduction based on the initializer (without the null-character that actually is implicitly added when assigning directly to string literals). Like, if the arrays holding the strings are constructed using size deduction, there would not be a need for the null-character because the array was not any bigger than the string, so of course, it would end at the end of that array.
I am reading about why exactly there is a need for null-characters, and then I found this answer which made somewhat sense to me. It states that it is needed because that char arrays (for the C strings) are often allocated much larger than the actual strings and you thereby need a a way to symbolize the end.
The answer is misleading. That's not really the reason for why null termination is needed. The accepted answer with more upvotes is better.
there would not be a need for the null-character because the array was not any bigger than the string, so of course, it would end at the end of that array.
Let us remind ourselves, that we cannot use arrays as function arguments. Even if we could, we wouldn't want to, because it would be slow to copy an entire array into the argument.
Therefore, there is a need to refer to an array indirectly. Indirection is commonly achieved using pointers (or references). Now, we could have a "pointer to character array of size 42", but that is not very useful because then the argument can only point to strings of one particular size.
Instead, the common approach is to use a pointer to the first element of the array. This is so common pattern that the language has a rule that allows the name of the array to implicitly decay into the pointer to first element.
But can you tell how big an array is, based on a pointer to an element of that array? You cannot. You need extra information. The accepted answer of the linked question explains the options that are available for representing the size, and that the designer of C chose the option that uses a terminating character (which was already the convention used by the BCPL language which C is based on).
TL;DR Size information is needed because there is a need to refer to the string indirectly, and that indirection hides the knowledge about the size of the array. Null termination is one way to encode the size information within the content of the string, and it is the way that was chosen by the designer of the C language.
Historically, string arrays are provided with termination symbol(s). Reason is simple: instead of sending two values (head of the array and array length) you just need to pass just one value, head of the array. This simplifies calling signature but places some requirements for caller.
In C/C++ itself, null character is a termination symbol so all runtime functions do work with intention that very first null char they can meet is a line end. Same time, in terms of applied logic, terminal symbol(s) may be different: for example, in HTTP headers there is a CR-LF-CR-LF sequence that marks a end-of-the-header and single CR-LF sequence is just a start-of-next-line.
But why aren't these array not just constructed with a size deduction
based on the initializer (without the null-character that actually is
implicitly added when assigning directly to string literals).
I suppose you mean why you can't write:
char t[] = "abracadabra";
and the compiler would deduce a size of 11?
Because you have 12 characters and not 11. If the array would have size 11, then something would be lost: the byte used to contains the NUL would not have been referenced and compiler wouldn't make a difference in between:
char t[] = "abracadabra"; // an array deduced from a C-string literal
and
char t[11] = { 'a', 'b', 'r', 'a', 'c', 'a', 'b', 'r', 'a' }; // a "real" array not a C-string!
The first would have to release 12 bytes at the end of scope and the second 11.
Historically arrays are just some kind a syntactic sugar on top of pointers arithmetic.
... because that char arrays ... are often allocated much larger than the actual strings
That answer is awful.
C strings can be dynamically allocated, meaning you don't know, before runtime, how long they should be. Instead of pre-allocating a massive array and filling most of it with zeroes, you can just malloc(required_size+1) and stick a single nul character at the end.
Conversely, string literals which are known at compile time, are definitely not "allocated much larger than the actual strings". there wouldn't be any point, since you know exactly how much space is needed in advance.
But why aren't these array not just constructed with a size deduction based on the initializer
size_t expected;
if (read(fd, &expected, sizeof(expected)) == sizeof(expected)) {
char *buf = malloc(expected + 1);
if (buf && read(fd, buf, expected) == expected) {
buf[expected] = 0;
/* now do something with buf */
}
}
there you go, a dynamically-sized string. What would your "size deduction" be? What is the "initializer"?
I could have written a less-ugly example using std::string, since the question is tagged C++, but it's actually C strings you're specifically asking about, and it doesn't make any real difference.
Strings are often manipulated by creating a char array to hold intermediate results and modifying its contents:
char buffer[128];
strcpy(buffer, "Hello, ");
strcat(buffer, "world");
std::cout << buffer << '\n';
After the call to strcpy the buffer has 7 characters that we care about; after the call to strcat it has 12. So the number of characters in the buffer can change, and we need to have a way of indicating how many characters there are that matter. One convention is to put a character count in the first location in the array, and the actual characters after that. Another convention is to put a marker at the end of the characters that matter. There are tradeoffs here, but the decision in C, which was carried through into C++ was to go with an end marker.

Comparing a char variable to empty char does not work

Say x is a character.
Whenever I do if(x <> '') to know whether the variable is empty or not, it just does not work.
However, when I attempt to do this if(x <> chr(0)), it does work.
I have tried the same thing on two versions of the compiler : Free Pascal and Charm Pascal, but I am still facing the same problem.
There is no such thing as an "empty char". The Char type is always a single character.
That character could be 1 byte AnsiChar representing a value from 0..255. (In Delphi and fpc, it could also be a 2 byte WideChar representing a value from 0..65535.) Either way it is always represented as '<something>'. That "something" must be a character value.
When you compare x <> Chr(0) you are taking the byte value of 0 and converting it to a Char so a valid comparison can be performed.
Side Notes
For Char to reliably have the concept "no value" requires storing additional information. E.g. Databases may have a hidden internal bit field indicating the value is NULL. It's important to be aware that this is fundamentally different from any of the valid values it may have if it's not NULL. Libraries that interact with databases need to provide a way to determine if a value is NULL.
You haven't provided any information about the actual problem you're trying to solve but here are some thoughts that may yield progress:
If you're dealing with user input, it may be more appropriate to compare with a space character ' '.
If you're dealing with characters read from a file, you should probably be checking number of bytes/characters actually read.
If you're trying to determine the end of a string it's much more reliable to use the Length() of the string.
(Though there are some environments that use the convention of treating Char(0) as a special character meaning "end-of-string".) But the convention requires allocating an extra character making the string internally longer than its text length. So the technique is not usable if the environment doesn't support it.
Most importantly, from comments it seems you might be struggling with the difference between empty-string and how that's represented as a Char. And the point is that it isn't. You need to check the length of the string.
E.g. You can do the following:
if (s <> '') then
begin
{ You now know there is at least 1 character in the string so
you can safely read it and not worry about "if it has a value".}
x := s[1];
...
end;

What are the problems of a zero-terminated string that length-prefixed strings overcome?

What are the problems of a zero-terminated string that length-prefixed strings overcome?
I was reading the book Write Great Code vol. 1 and I had that question in mind.
One problem is that with zero-terminated strings you have to keep finding the end of the string repeatedly. The classic example where this is inefficient is concatenating into a buffer:
char buf[1024] = "first";
strcat(buf, "second");
strcat(buf, "third");
strcat(buf, "fourth");
On every call to strcat the program has to start from the beginning of the string and find the terminator to know where to start appending. This means the function spends more and more time finding the place to append as the string grows longer.
With a length-prefixed string the equivalent of the strcat function would know where the end is immediately, and would just update the length after appending to it.
There are pros and cons to each way of representing strings and whether they cause problems for you depend on what you are doing with strings, and which operations need to be efficient. The problem described above can be overcome by manually keeping track of the end of the string as it grows, so by changing the code you can avoid the performance cost.
One problem is that you can not store null characters (value zero) in a zero terminated string. This makes it impossible to store some character encodings as well as encrypted data.
Length-prefixed strings do not suffer that limitation.
First a clarification: C++ strings (i.e. std::string) aren't weren't required to end with zero until C++11. They always provided access to a zero-terminated C string though.
C-style strings end with a 0 character for historical reasons.
The problems you're referring to are mainly bound to security issues: zero ended strings need to have a zero terminator. If they lack it (for whatever reason), the string's length becomes unreliable and they can lead to buffer overrun problems (which a malicious attacker can exploit by writing arbitrary data in places where it shouldn't be.. DEP helps in mitigating these issues but it's off-topic here).
It is best summarized in The Most Expensive One-byte Mistake by Poul-Henning Kamp.
Performance Costs: It is cheaper to manipulate memory in chunks, which cannot be done if you're always having to look for the NULL character. In other words if you know before hand you have a 129 character string, it would likely be more efficient to manipulate it in sections of 64, 64, and 1 bytes, instead of character by character.
Security: Marco A. already hit this pretty hard. Over and under-running string buffers is still a major route for attacks by hackers.
Compiler Development Costs: Big costs are associated with optimizing compilers for null terminating strings that would have been easier with the address and length format.
Hardware Development Costs: Hardware development costs are also large for string specific instructions associated with null terminating strings.
A few more bonus features that can be implemented with length-prefixed strings:
It's possible to have multiple styles of length prefix, identifiable through one or more bits of the first byte identified by the string pointer/reference. In exchange for a little extra time determining string length, one could e.g. use a single-byte prefix for short strings and longer prefixes for longer strings. If one uses a lot of 1-3 byte strings that could save more than 50% on overall memory consumption for such strings compared with using a fixed four-byte prefix; such a format could also accommodate strings whose length exceeded the range of 32-bit integers.
One may store variable-length strings within bounds-checked buffers at a cost of only one or two bits in the length prefix. The number N combined with the other bits would indicate one of three things:
An N-byte string
(Optional) An N-byte buffer holding a zero-length string
An N-byte buffer which, if its last byte B is less than 248, holds a string of length N-B-1; if the 248 or more, the preceding B-247 bytes would store the difference between the buffer size and the string length. Note that if the length of the string is precisely N-1, the string will be followed by a NUL byte, and if it's less than that the byte following the string will be unused and could be set to NUL.
Using such an approach, one would need to initialize strong buffers before use (to indicate their length), but would then no longer need to pass the length of a string buffer to a routine that was going to store data there.
One may use certain prefix values to indicate various special things. For example, one may have a prefix that indicates that it is not followed by a string, but rather by a string-data pointer and two integers giving buffer size and current length. If methods that operate on strings call a method to get the data pointer, buffer size, and length, one may pass such a method a reference to a portion of a string cheaply provided that the string itself will outlive the method call.
One may extend the above feature with a bit to indicate that the string data is in a region that was generated by malloc and may be resized if needed; additionally, one could safely have methods that sometimes return a dynamically-generated string allocated on the heap, and sometimes return an immutable static string, and have the recipient perform a "free this string if it isn't static".
I don't know if any prefixed-string implementations implement all those bonus features, but they can all be accommodated for very little cost in storage space, relatively little cost in code, and less cost in time than would be required to use NUL-terminated strings whose length was neither known nor short.
What are the problems of a zero-terminated string that length-prefixed strings overcome?
None whatsoever.
It's just eye candy.
Length-prefixed strings have, as part of their structure, information on how long the string is. If you want to do the same with zero-terminated strings you can use a helper variable;
lpstring = "foobar"; // saves '6' somewhere "inside" lpstring
ztstring = "foobar";
ztlength = 6; // saves '6' in a helper variable
Lots of C library functions work with zero-terminated strings and cannot use anything past the '\0' byte. That's an issue with the functions themselves, not the string structure. If you need functions which deal with zero-terminated strings with embedded zeroes, write your own.

how to make a not null-terminated c string?

i am wondering :char *cs = .....;what will happen to strlen() and printf("%s",cs) if cs point to memory block which is huge but with no '\0' in it?
i write these lines:
char s2[3] = {'a','a','a'};
printf("str is %s,length is %d",s2,strlen(s2));
i get the result :"aaa","3",but i think this result is because that a '\0'(or a 0 byte) happens to reside in the location s2+3.
how to make a not null-terminated c string? strlen and other c string function relies heavily on the '\0' byte,what if there is no '\0',i just want know this rule deeper and better.
ps: my curiosity is aroused by studying the follw post on SO.
How to convert a const char * to std::string
and these word in that post :
"This is actually trickier than it looks, because you can't call strlen unless the string is actually nul terminated."
If it's not null-terminated, then it's not a C string, and you can't use functions like strlen - they will march off the end of the array, causing undefined behaviour. You'll need to keep track of the length some other way.
You can still print a non-terminated character array with printf, as long as you give the length:
printf("str is %.3s",s2);
printf("str is %.*s",s2_length,s2);
or, if you have access to the array itself, not a pointer:
printf("str is %.*s", (int)(sizeof s2), s2);
You've also tagged the question C++: in that language, you usually want to avoid all this error-prone malarkey and use std::string instead.
A "C string" is, by definition, null-terminated. The name comes from the C convention of having null-terminated strings. If you want something else, it's not a C string.
So if you have a string that is not null-terminated, you cannot use the C string manipulation routines on it. You can't use strlen, strcpy or strcat. Basically, any function that takes a char* but no separate length is not usable.
Then what can you do? If you have a string that is not null-terminated, you will have the length separately. (If you don't, you're screwed. You need some way to find the length, either by a terminator or by storing it separately.) What you can do is allocate a buffer of the appropriate size, copy the string over, and append a null. Or you can write your own set of string manipulation functions that work with pointer and length. In C++ you can use std::string's constructor that takes a char* and a length; that one doesn't need the terminator.
Your supposition is correct: your strlen is returning the correct value out of sheer luck, because there happens to be a zero on the stack right after your improperly terminated string. It probably helps that the string is 3 bytes, and the compiler is likely aligning stuff on the stack to 4-byte boundaries.
You cannot depend on this. C strings need NUL characters (zeroes) at the end to work correctly. C string handling is messy, and error-prone; there are libraries and APIs that help make it less so… but it's still easy to screw up. :)
In this particular case, your string could be initialized as one of these:
A: char s2[4] = { 'a','a','a', 0 }; // good if string MUST be 3 chars long
B: char *s2 = "aaa"; // if you don't need to modify the string after creation
C: char s2[]="aaa"; // if you DO need to modify the string afterwards
Also note that declarations B and C are 'safer' in the sense that if someone comes along later and changes the string declaration in a way that alters the length, B and C are still correct automatically, whereas A depends on the programmer remembering to change the array size and keeping the explicit null terminator at the end.
What happens is that strlen keeps going, reading memory values until it eventually gets to a null. it then assumes that is the terminator and returns the length that could be massively large. If you're using strlen in an environment that expects C-strings to be used, you could then copy this huge buffer of data into another one that is just not big enough - causing buffer overrun problems, or at best, you could copy a large amount of garbage data into your buffer.
Copying a non-null terminated C string into a std:string will do this. If you then decide that you know this string is only 3 characters long and discard the rest, you will still have a massively long std:string that contains the first 3 good characters and then a load of wastage. That's inefficient.
The moral is, if you're using the CRT functions to operator on C strings, they must be null-terminated. Its no different to any other API, you must follow the rules that API sets down for correct usage.
Of course, there is no reason you cannot use the CRT functions if you always use the specific-length versions (eg strncpy) but you will have to limit yourself to just those, always, and manually keep track of the correct lengths.
Convention states that a char array with a terminating \0 is a null terminated string. This means that all str*() functions expect to find a null-terminator at the end of the char-array. But that's it, it's convention only.
By convention also strings should contain printable characters.
If you create an array like you did char arr[3] = {'a', 'a', 'a'}; you have created a char array. Since it is not terminated by a \0 it is not called a string in C, although its contents can be printed to stdout.
The C standard does not define the term string until the section 7 - Library functions. The definition in C11 7.1.1p1 reads:
A string is a contiguous sequence of characters terminated by and including the first null character.
(emphasis mine)
If the definition of string is a sequence of characters terminated by a null character, a sequence of non-null characters not terminated by a null is not a string, period.
What you have done is undefined behavior.
You are trying to write to a memory location that is not yours.
Change it to
char s2[] = {'a','a','a','\0'};