Null-terminate string: Use '\0' or just 0? - c++

If I need to null-terminate a String, should I rather use \0 or is a simple 0 also enough?
Is there any difference between using
char a[5];
a[0] = 0;
and
char a[5];
a[0] = '\0';
Or is \0 just preferred to make it clear that I'm null-terminating here, but for the compiler it is the same?

'\0' is an escape sequence for an octal literal with the value of 0. So there is no difference between them
Side note: if you are dealing with strings than you should use a std::string.

Use '\0' or just 0?
There is no difference in value.
In C, there is no difference in type: both are int.
In C++ they have different types: char and int. So the edge goes to '\0' as there is no type conversion involved.
Different style guides promote one over the other. '\0' for clarity. 0 for lack of clutter.
Right answer: Use the style based on your group's coding standards/guidelines. If you group does not have such guideline, make one. Better to use one than have divergent styles.

'\0' is exactly the same as 0 despite of the type. '\0' is just a representation as a char literal. The type char can be initialized from plain int literals though.
So it's actually impossible to tell what's better, just keep in mind to use it consistently in your code.

Both will generate the same machine code, as 0 will be converted to the character value 0, and '\0' is just another way of writing the character value 0. The latter is clearly a character, so it will show that you didn't accidentally mean to write '0' instead, but other than that, it's exactly the same thing in the end.

It's the same thing. Look at the ascii table.
It's better from my point of view to use '\0' because you explicity say that the end of string.
That help when you read your code (like using NULL for pointer instead of 0).

Related

Why doesn't strlen() count the byte of the terminating NUL-character, when the NUL-character is defined to be part of a string?

I know that strlen() does not count the NUL-terminating character with. I really know that this is a fact. Thus, this question is NOT about asking for why strlen() might "presumably" not return the right string length, which is already asked and answered alot well here on StackOverflow, f.e. in this thread, or this one.
So lets go ahead to my question:
In ISO/IEC 9899:1990 (E); 7.1.1., is stated:
A string is a contiguous sequence of characters terminated by and including the first null character.
What is the reason, why strlen() deviate from this formed standard, and does not "want" to accept a string with its NUL-terminating character?
Why?
Because you would expect this pseudocode's assertion to hold true:
str1 = "foo"
str2 = "bar"
str3 = concatenate(str1, str2)
Assert strlen(str1) + strlen(s2) == strlen(str3)
If terminating '\0' was counted by strlen, above assertion would not hold, which would be much more of overall headache, than what the current C string behavior is. More importantly, it would in my opinion be quite unintuitive and illogical.
Taking your doubt as a reasonable point we can state that: The C-string consists of two parts:
the string's useful content ("the text");
the null terminating character;
The null terminating character is purely a technical measure for determination of the end of the string by the C-originated library functions. Still, if one types a declaration:
char * str = "some string";
they logically would rather expect its length to be 11 which is as many as they can see in this statement. Hence the strlen() value yields only the length of the part 1. of the string.
Not really an answer to your question, but consider this example:
char string[] = "string";
printf("sizeof: %zu\n", sizeof(string));
printf("strlen: %zu\n", strlen(string));
This prints
sizeof: 7
strlen: 6
So sizeof counts the \0, but strlen doesn't.
Questions like this, that ask why a certain age-old decision was made one way and not another way, are hard to answer. I can say that it's perfectly obvious to me, anyway, that strlen should count just the real, "interesting" characters that are in the string, and ignore the \0 at the end that merely terminates it. I'm used to accounting for the \0 separately. I imagine it would have been considerably more of a nuisance overall if strlen had been defined the other way. But I can't prove this with convincing arguments, and I've been using strlen with its current definition for so long that I'm probably hopelessly biased; I might be saying "it's perfectly obvious to me that..." even if strlen's definition were quite wrong.
There is a difference between the physical, stored representation of a C style string and the logical representation of a C style string.
The physical representation, how the string is actually stored in memory or other media includes the null character. The null character is included when discussing the physical representation because it take up an additional piece of storage. In order to be a C style string the null character must be stored.
However the logical representation of a string does not include the null character. The logical representation of a string includes only the text characters that the programmer is wanting to manipulate.
I suspect that the null character, a value of binary zero, was chosen because of the original ASCII character set defined a character value of zero as the NULL character. Part of the lower values among the various teletype control codes, it seems to be the least likely ASCII character that may appear in text. See ASCII Character Codes.
Another nice quality of using a binary zero as the string terminator is that is the value that represents logical false so iterating over a string is often a matter of incrementing an array index or incrementing a pointer while logical true since all characters other than the end of string indicator have a non-zero or logical true value.
Due to how close to the hardware that the C programming language is, the programmer needs to be concerned about both representations, the physical representation when allocating memory to store a string which includes the null character and the logical representation which is the string without the null character.
The various C style string manipulation functions in the Standard Library (strlen(), strcpy(), etc.) are all designed around the logical representation of a C style string. They perform their actions by using the null character as not being part of the text but rather as a special indicator character which indicates the end of the string. However as a part of their operations they need to be aware of the null character and its use as a special symbol. For instance when strcpy() or strcat() are used to copy strings, they must also copy the null character that indicates the end of the string even though it is not part of the actual text of the logical representation.
This choice allows text strings to be stored as arrays of characters, as befits the hardware orientation and efficiency characteristics of C. There is no need to create an additional built in type for text strings and it fits well with the lean character of the C programming language.
C++ is able to provide the std::string because of being object oriented and having the additional facilities of the language that allows for objects to be created and managed. The C programming language, due to its simple syntax and lack of object oriented facilities does not have this convenience.
The problem with this approach is that the programmer needs to be aware of both the physical representation and the logical representation of text strings and be able to accommodate the needs of both when writing programs.

how to make a not null-terminated c string?

i am wondering :char *cs = .....;what will happen to strlen() and printf("%s",cs) if cs point to memory block which is huge but with no '\0' in it?
i write these lines:
char s2[3] = {'a','a','a'};
printf("str is %s,length is %d",s2,strlen(s2));
i get the result :"aaa","3",but i think this result is because that a '\0'(or a 0 byte) happens to reside in the location s2+3.
how to make a not null-terminated c string? strlen and other c string function relies heavily on the '\0' byte,what if there is no '\0',i just want know this rule deeper and better.
ps: my curiosity is aroused by studying the follw post on SO.
How to convert a const char * to std::string
and these word in that post :
"This is actually trickier than it looks, because you can't call strlen unless the string is actually nul terminated."
If it's not null-terminated, then it's not a C string, and you can't use functions like strlen - they will march off the end of the array, causing undefined behaviour. You'll need to keep track of the length some other way.
You can still print a non-terminated character array with printf, as long as you give the length:
printf("str is %.3s",s2);
printf("str is %.*s",s2_length,s2);
or, if you have access to the array itself, not a pointer:
printf("str is %.*s", (int)(sizeof s2), s2);
You've also tagged the question C++: in that language, you usually want to avoid all this error-prone malarkey and use std::string instead.
A "C string" is, by definition, null-terminated. The name comes from the C convention of having null-terminated strings. If you want something else, it's not a C string.
So if you have a string that is not null-terminated, you cannot use the C string manipulation routines on it. You can't use strlen, strcpy or strcat. Basically, any function that takes a char* but no separate length is not usable.
Then what can you do? If you have a string that is not null-terminated, you will have the length separately. (If you don't, you're screwed. You need some way to find the length, either by a terminator or by storing it separately.) What you can do is allocate a buffer of the appropriate size, copy the string over, and append a null. Or you can write your own set of string manipulation functions that work with pointer and length. In C++ you can use std::string's constructor that takes a char* and a length; that one doesn't need the terminator.
Your supposition is correct: your strlen is returning the correct value out of sheer luck, because there happens to be a zero on the stack right after your improperly terminated string. It probably helps that the string is 3 bytes, and the compiler is likely aligning stuff on the stack to 4-byte boundaries.
You cannot depend on this. C strings need NUL characters (zeroes) at the end to work correctly. C string handling is messy, and error-prone; there are libraries and APIs that help make it less so… but it's still easy to screw up. :)
In this particular case, your string could be initialized as one of these:
A: char s2[4] = { 'a','a','a', 0 }; // good if string MUST be 3 chars long
B: char *s2 = "aaa"; // if you don't need to modify the string after creation
C: char s2[]="aaa"; // if you DO need to modify the string afterwards
Also note that declarations B and C are 'safer' in the sense that if someone comes along later and changes the string declaration in a way that alters the length, B and C are still correct automatically, whereas A depends on the programmer remembering to change the array size and keeping the explicit null terminator at the end.
What happens is that strlen keeps going, reading memory values until it eventually gets to a null. it then assumes that is the terminator and returns the length that could be massively large. If you're using strlen in an environment that expects C-strings to be used, you could then copy this huge buffer of data into another one that is just not big enough - causing buffer overrun problems, or at best, you could copy a large amount of garbage data into your buffer.
Copying a non-null terminated C string into a std:string will do this. If you then decide that you know this string is only 3 characters long and discard the rest, you will still have a massively long std:string that contains the first 3 good characters and then a load of wastage. That's inefficient.
The moral is, if you're using the CRT functions to operator on C strings, they must be null-terminated. Its no different to any other API, you must follow the rules that API sets down for correct usage.
Of course, there is no reason you cannot use the CRT functions if you always use the specific-length versions (eg strncpy) but you will have to limit yourself to just those, always, and manually keep track of the correct lengths.
Convention states that a char array with a terminating \0 is a null terminated string. This means that all str*() functions expect to find a null-terminator at the end of the char-array. But that's it, it's convention only.
By convention also strings should contain printable characters.
If you create an array like you did char arr[3] = {'a', 'a', 'a'}; you have created a char array. Since it is not terminated by a \0 it is not called a string in C, although its contents can be printed to stdout.
The C standard does not define the term string until the section 7 - Library functions. The definition in C11 7.1.1p1 reads:
A string is a contiguous sequence of characters terminated by and including the first null character.
(emphasis mine)
If the definition of string is a sequence of characters terminated by a null character, a sequence of non-null characters not terminated by a null is not a string, period.
What you have done is undefined behavior.
You are trying to write to a memory location that is not yours.
Change it to
char s2[] = {'a','a','a','\0'};

How to get a C-string out of a string that contains \0 without losing the \0

I currently have a pretty huge string. I NEED to convert it into a C-string (char*), because the function I want to use only take C-string in parameter.
My problem here is that any thing I tried made the final C-string wayyy smaller then the original string, because my string contains many \0. Those \0 are essential, so I can't simply remove them :(...
I tried various way to do so, but the most common were :
myString.c_str();
myString.data();
Unfortunately the C-string was always only the content of the original string that was before the first \0.
Any help will be greatly appreciated!
You cannot create a C-string which contains '\0' characters, because a C-string is, by definition, a sequence of characters terminated by '\0' (also called a "zero-terminated string"), so the first '\0' in the sequence ends the string.
However, there are interfaces that take a a pointer to the first character and the length of the character sequence. These might be able to deal with character sequences including '\0'.
Watch out for myString.data(), because this returns a pointer to a character sequence that might not be zero-terminated, while mystring.c_str() always returns a zero-terminated C-string.
This is not possible. The null is the end of a null terminated string. If you take a look at your character buffer (use &myString[0]), you'll see that the NULLs are still there. However, no C functions are going to interpret those NULLs correctly because NULL is not a valid value in the middle of a string in C.
Well, myString has probably been truncated at construction/assignment time. You can try std::basic_string::assign which takes two iterators as arguments or simply use std::vector <char>, the latter being more usual in your use case.
And your API taking that C string must actually support taking a char pointer together with a length.
I'm a bit confused, but:
string x("abc");
if (x.c_str()[3] == '\0')
{ cout << "there it is" << endl; }
This may not meet your needs, you did say 'Those \0 are essential', but how about escaping or replacing the '\0' chars?
Would one of these ideas work?
replace the '\0' chars with a '\t' (tab char, decimal 9).
replace the '\0' with some rarely used char value like decimal 1, or decimal 255.
Create an escape code, say by replacing each '\0' char with a coded substring, (like octal as in "\000"). (Be sure to replace any original '\' with a coded value as well (like "\134")).

What is '\0' in C++?

I'm trying to translate a huge project from C++ to Delphi and I'm finalizing the translation. One of the things I left is the '\0' monster.
if (*asmcmd=='\0' || *asmcmd==';')
where asmcmd is char*.
I know that \0 marks the end of array type in C++, but I need to know it as a byte. Is it 0?
In other words, would the code below be the equivalent of the C++ line?
if(asmcmd^=0) or (asmcmd^=';') then ...
where asmcmd is PAnsiChar.
You need not know Delphi to answer my question, but tell me \0 as byte. That would work also. :)
'\0' equals 0. It's a relic from C, which doesn't have any string type at all and uses char arrays instead. The null character is used to mark the end of a string; not a very wise decision in retrospect - most other string implementations use a dedicated counter variable somewhere, which makes finding the end of a string O(1) instead of C's O(n).
*asmcmd=='\0' is just a convoluted way of checking length(asmcmd) == 0 or asmcmd.is_empty() in a hypothetical language.
Strictly it is an escape sequence for the character with the octal value zero (which is of course also zero in any base).
Although you can use any number prefixed with zero to specify an octal character code (for example '\040' is a space character in ASCII encoding) you would seldom ever have cause to do so. '\0' is idiomatic for specifying a NUL character (because you cannot type such a character from the keyboard or display it in your editor).
You could equally specify '\x0', which is a NUL character expressed in hexadecimal.
The NUL character is used in C and C++ to terminate a string stored in a character array. This representation is used for literal string constants and by convention for strings that are manipulated by the<cstring>/<string.h>library. In C++ the std::string class can be used instead.
Note that in C++ a character constant such as '\0' or 'a' has type char. In C, for perhaps obscure reasons, it has type int.
That is the char for null or char value 0. It is used at the end of the string.

initializing char arrays in a way similar to initializing string literals

Suppose I've following initialization of a char array:
char charArray[]={'h','e','l','l','o',' ','w','o','r','l','d'};
and I also have following initialization of a string literal:
char stringLiteral[]="hello world";
The only difference between contents of first array and second string is that second string's got a null character at its end.
When it's the matter of initializing a char array, is there a macro or something that allows us to put our initializing text between two double quotation marks but where the array doesn't get an extra null terminating character?
It just doesn't make sense to me that when a terminating null character is not needed, we should use syntax of first mentioned initialization and write two single quotation marks for each character in the initializer text, as well as virgule marks to separate characters.
I should add that when I want to have a char array, it should also be obvious that I don't want to use it with functions that rely on string literals along with the fact that none of features in which using string literals results, is into my consideration.
I'm thankful for your answers.
It's allowed in C to declare the array as follows, which will initialize it without copying the terminating '\0'
char c[3] = "foo";
But it's illegal in C++. I'm not aware of a trick that would allow it for C++. The C++ Standard further says
Rationale: When these non-terminated arrays are manipulated by standard string routines, there is potential for major catastrophe.
Effect on original feature: Deletion of semantically well-defined feature.
Difficulty of converting: Semantic transformation. The arrays must be declared one element bigger to contain the string terminating ’\0’.
How widely used: Seldom. This style of array initialization is seen as poor coding style.
There is no way of doing what you want. The first way of initializing the array specifies separate initializers for each character, which allows to explicitly leave off the '\0'. The second is initializing a character array from a character string, which in C/C++ is always terminated by a null character.
EDIT: corrected: 'character pointer' --> 'character array'
litb has the technically correct answer.
As for an opinion - I say just live with the 'waste' of the extra '\0'. So many bugs are the result of code expecting a terminating null where one isn't (this advice may seem to go directly against some other advice I gave just a day or two ago about not bothering to zero an entire buffer. I claim there's no contradiction - I still advocated null terminating the string in the buffer).
If you really can't live with the '\0' terminator because of some semantics in the data structure you're dealing with, such as it might be part of some larger packed structure, you can always init the array yourself (which I think should be no less efficient than what the compiler might have done for you):
#define MY_STRING_LITERAL "hello world"
char stringLiteral[sizeof(MY_STRING_LITERAL) - 1];
memcpy( stringLiteral, MY_STRING_LITERAL, sizeof(stringLiteral));
The basic answer is that the vast majority of char arrays are strings - in C, strings are null terminated. C++ inherited that convention. Even when that null isn't needed, most of the time it isn't a problem just to leave it there anyway.
Macros aren't powerful enough to do what you want. Templates would be, except they don't have any compile-time string handling.
Usually, when people want to mix numeric bytes and string literals in the same char-array sequence, they use a string literal but use hex character escapes such as \xFF.
I might have found a way to do what i want though it isn't directly what I wanted, but it likely has the same effect.
First consider two following classes:
template <size_t size>
class Cont{
public:
char charArray[size];
};
template <size_t size>
class ArrayToUse{
public:
Cont<size> container;
inline ArrayToUse(const Cont<size+1> & input):container(reinterpret_cast<const Cont<size> &>(input)){}
};
Before proceeding, you might want to go here and take a look at constant expression constructors and initialization types.
Now look at following code:
const Cont<12> container={"hello world"};
ArrayToUse<11> temp(container);
char (&charArray)[11]=temp.container.charArray;
Finally initializer text is written between two double quotations.