sizeof char array is off by one - c++

I want to use the sizeof function to get the size of a char array. The size that I get is one too much. Example:
#include <stdio.h>
char text[] = "hey";
const int n = sizeof(text);
int main(int argc, char *argv[])
{
printf("%i\n", n);
return 0;
}
Outputs 4, instead of the expected 3. I reproduced this behaviour on various online c++ compilers, so I think it is intended (oddly enough, I can't find anything about on the internet). Most sources that I can find online say that it should be 3 * sizeof(char) (which is 3 on most normal systems).
If I understand everything correctly, there is an extra byte that is used for the array representation in some way. Why does this happen?

String literals are implicitly NUL terminated, so "hey" is actually four characters in size; the three letters you see, plus a \0 (aka NUL).
When you initialize an array without specifying a size, it's sizing it to match the initializer, and the initializer is that four byte quantity including the NUL. char text[] = "hey"; is equivalent to saying char text[] = {'h', 'e', 'y', '\0'};. If it didn't work like this, attempting to work with the contents of the array as a C-style string would run past the buffer into neighboring memory until it found a NUL terminator by coincidence.

Related

Get away with Initialize the char array without putting \0 at the end of string

I am new to c++ language,recently, as I was taught that:
we should put '\0' at the end of char array while doing initialization ,for example :
char x[6] = "hello"; //OK
However,if you do :
char x[5] = "hello";
Then this would raise the error :
initializer-string for array of chars is too long
Everything goes as I expect until the experssion below does not raise the compile error...:
char x[5] = {'h','e','l','l','o'};
This really confuses me , So I would like to ask two questions :
1.Why doesn't expression char x[5] = "hello"; raise error?
2.To my knowledge,the function strlen() would stop only if it finds '\0' to determine the lengh of char array,in this case,what would strlen(x) return?
Thanks!
The string literal "hello" has six characters, because there's an implied nul terminator. So
char x[] = "hello";
defines an array of six char. That's almost always what you want, because the C-style string functions (strlen, strcpy, strcat, etc.) operate on C-style strings, which are, by definition, nul terminated.
But that doesn't mean that every array of char will be nul terminated.
char x[] = { 'h', 'e', 'l', 'l', 'o' };
This defines an array of five char. Applying C-style string functions to this array will result in undefined behavior, because the array does not have a nul terminator.
You can do character-by-character initialization and create a valid C-style string by explicitly including the nul terminator:
char x[] = { 'h', 'e', 'l', 'l', 'o', '\0' };
This defines an array of six char that holds a C-style string (i.e., a nul terminated sequence of characters).
The key here is to separate in your mind the general notion of an array of char from the more specific notion of an array of char that holds a C-style string. The latter is almost always what you want to do, but that doesn't mean that there is never a use for the former. It's just that the former is uncommon.
As an aside, in C you're allowed to elide the nul terminator:
char x[5] = "hello";
this is legal C, and it creates an array of 5 char, with no nul terminator. In C++ that's not legal.
Why doesn't expression char x[5] = "hello"; raise an error?
This is not true. The appearance of an error is expected in this case.
To my knowledge, the function strlen() would stop only if it finds '\0' to determine the length of the char array, in this case, what would strlen(x) return?
If you can run the code somehow, the program will undergo an undefined-behavior. That is, you will not get what you would expect. The strlen() will only stop counting when it finds a null-terminator, i.e. it may go outside the initialized part of the char array and access the uninitialized ones – it's where the UB is invoked.

Properties of double quotes

I was wondering what properties double quotes have, especially in relation to initializing char pointers.
char *ptr="hello";
char array[6]={'h','e','l','l','o'};
cout<<ptr<<endl<<array<<endl;
The above code prints hello twice. I know that using double quotes denotes a string, or a char array with a null character at the end. Since ptr is looking for a memory address (char*) to be assigned to, I'm guessing that the "hello" resolves to the memory address of the 'h', despite the fact that you are also filling in char values for the rest of the array? If this is the case, does that mean that in the above code
char *ptr="hello";
the double quotes creates a string somewhere in memory and then ptr is assigned to the first element of that string, whereas
char array[6]={'h','e','l','l','o'};
creates an array somewhere in memory and then assigns values for each index based on the right hand side of the assignment operator?
There are two things to note here of importance.
char array[6]={'h','e','l','l','o'};
This will allocate 6 bytes on the stack and initialize them to "hello\0"; The following are equlivalent:
char array[] = "hello";
char array[6] = "hello";
However, the below is different.
char *ptr="hello";
This will allocate a pointer on the stack, that points to the constant string "hello\0". This is an important distinction, if you alter the value ptr points to, you will cause undefined behavior as you will be altering the constant value it points to.
Example:
#include <stdio.h>
void testLocal()
{
char s[] = "local"; // locally allocated 6 bytes initialized to "local\0"
s[0] = 'L';
printf("%s\n", s);
}
void testPointer()
{
char *s = "pointer"; // locally allocated pointer, pointing to a constant
// This segfaults on my system, but really the behavior is undefined.
s[0] = 'P';
printf("%s\n", s);
}
int main(int argc, char * argv[])
{
testLocal();
testPointer();
}
For strings, there is a special terminal character \0 which is added to the end. This tells it that that's the end of the string.
So, if you have a string "hello", it'll keep reading each character: "h", "e", "l", "l", "o", "\0", which tells it to stop.
The character array is similar, but doesn't have that terminal character. Instead, the length of the array itself is what indicates how many characters to read (which won't necessarily work for all methods).

Difference in the initialisation of character array

I do the following for character array intialisation :
char a[] = "teststring";
char b[]={'a','a','b','b','a'};
While for the first, if I need to get the string length, I must do strlen(a) ....for the other string I should do sizeof(b)/sizeof(b[0]).
why this difference?
EDIT : (I got this)
char name[10]="StudyTonight"; //valid character array initialization
char name[10]={'L','e','s','s','o','n','s','\0'}; //valid initialization
Remember that when you initialize a character array by listings all its characters separately then you must supply the '\0' character explicitly.
I get that with char b we have to add '\0' to make a proper initialisation.
ANOTHER :
Therefore, the array of char elements called myword can be initialized with a null-terminated sequence of characters by either one of these two statements:
char myword[] = { 'H', 'e', 'l', 'l', 'o', '\0' };
char myword[] = "Hello";
A string literal, like "teststring" contains the characters between the double-quotes, plus a terminating char with value zero. So
char a[] = "ab";
has the same effect as;
char a[] = {'a', 'b', '\0'};
strlen() searches for that character with value '\0'. So strlen(a) in this case will return 2.
Conversely, sizeof() gets the actual size of the memory used. Since sizeof(char) is 1, by definition in the standard, this means sizeof(a) give the value of 3 - it counts the 'a', the 'b', and the '\0'.
a is a C-style string, i.e, null-terminated char array. The initialization is equivalent to:
char a[] = {'t','e','s','t','s','t','r','i','n','g','\0'};
b, however, is not null-terminated, so it's not a C-style string, you can't use functions like std::strlen() because they are only valid for C-style strings.
String literals are expanded into char arrays, but also include the terminating zero char. So think the
char a[] = "teststring";
as if you have types this
char a[] = {'t','e','s','t','s','t','r','i','n','g','\0'};
A rule of thumb
Whenever you will use strlen() on a char array, use string literals its for initialization. The strlen function can be thought as a simple scan the terminating zero char (\0) counting the iterations needed.
A word about sizeof
Even if sometimes used with parentheses, sizeof is an operator, an integral part of the C++ language (inherited from C times). In cases like char c[] = "hello";, sizeof(c) will return 6, which is exactly 1 more than strlen(c), and you might be thinking: "lets skip that inefficient scanning for the terminator", but sizeof stops to be such "efficient" as soon as it works on pointers, and arrays can (and will) be used as pointers whenever required. Look at the following example:
#include <iostream>
// naive approach, don't do that
int myarraysize(char s[])
{
return sizeof(s);
}
int main ()
{
char c[] = "hello";
std::cout << sizeof(c) << " vs " << myarraysize(c) << std::endl;
return 0;
}
online demo
You can always write
char b[]={'a','a','b','b','a','\0'};
to overcome "the difference".
Also note
sizeof(b)/sizeof(b[0])
essentially boils down to
sizeof(b)
since sizeof(char) is always 1. Your formula is used for any other array element types.

pointer to string and char catch 22

I'm studying on pointers and I'm stuck when I see char *p[10]. Because something is misunderstood. Can someone explain step-by-step and blow-by-blow why my logic is wrong and what the mistakes are and where did I think wrong and how should I think. Because I want to learn exactly. Also what about int *p[10]; ? Besides, for example x is a pointer to char but just char not chars. But how come char *x = "possible";
I think above one should be right but, I have seen for char *name[] = { "no month","jan","feb" }; I am really confused.
Your char *p[10] diagram shows an array where each element points to a character.
You could construct it like this:
char f = 'f';
char i = 'i';
char l1 = 'l';
char l2 = 'l';
char a1 = 'a';
char r1 = 'r';
char r2 = 'r';
char a2 = 'a';
char y = 'y';
char nul = '\0';
char *p[10] = { &f, &i, &l1, &l2, &a1, &r1, &r2, &a2, &y, &nul };
This is very different from the array
char p[10] = {'f', 'i', 'l', 'l', 'a', 'r', 'r', 'a', 'y', '\0'};
or
char p[10] = "fillarray";
which are arrays of characters, not pointers.
A pointer can equally well point to the first element of an array, as you've probably seen in constructions like
const char *p = "fillarray";
where p holds the address of the first element of an array defined by the literal.
This works because an array can decay into a pointer to its first element.
The same thing happens if you make an array of pointers:
/* Each element is a pointer to the first element of the corresponding string in the initialiser. */
const char *name[] = { "no month","jan","feb" };
You would get the same results with
const char* name[3];
name[0] = "no month";
name[1] = "jan";
name[2] = "feb";
char c = 'a';
Here, c is a char, typically a single byte of ASCII encoded data.
char* ptr = &c;
ptr is a char pointer. In C, all it does is point to a memory location and doesn't make any guarantees about what is at that location. You could use a char* to pass a char to a function to allow the function to allow the function to make changes to that char (pass by reference).
A common C convention is for a char* to point to a memory location where several characters are stored in sequence followed by the null character \0. This convention is called a C string:
char const* cstr = "hello";
cstr points to a block of memory 6 bytes long, ending with a null character. The data itself cannot be modified, though the pointer can be changed to point to something else.
An array of chars looks similar, but behaves slightly differently.
char arr[] = "hello";
Here arr IS a memory block of 6 chars. Since arr represents the memory itself, it cannot be changed to point to another location. The data can be modified though.
Now,
char const* name[] = { "Jan", " Feb"..., "Dec"};
is an array of pointer to characters.
name is a block of memory, each containing a pointer to a null-terminated string.
In the diagram, I think string* was accidentally used instead of char*. The difference between the left and the right, is not a technical difference really, but a difference in the way a char* is used. On the left each char* points to a single character, whereas in the one on the right, each char* points to a null-terminated block of characters.
Both are right.
A pointer in C or C++ may point either to a single item (a single char) or to the first in an array of items (char[]).
So a char *p[10]; definition may point to 10 single characters or 10 arrays (i.e. 10 strings).
Let’s go back to basics.
First, char *p is simply a pointer. p contains nothing more than a memory address. That memory address can point to anything, anywhere. By convention, we have always used NULL (or, I hate this method, assigning it to zero – yeah, they are the same “thing”, but NULL has traditionally been used in conjunction with pointers, so when you’re eyes flit across the code, you see NULL – you think “pointer”).
Anyway, that memory address being pointed to can contain anything. So, to use within the language, we type it, in this case it is a pointer to a character (char *p). This can be overridden by type casting, but that’s for a later time.
Second, we know anytime we see p[10], that we are dealing with an array. Again, the array can be an array of characters, an array of ints, etc. – but it’s still an array.
Your example: char *p[10], is then nothing more than an array of 10 character pointers. Nothing more, nothing less. Your problem comes in because you are trying to force the “string” concept onto this. There ain’t no strings in C. There ain’t no objects in C. The concept of a NULL-terminated string can most certainly be used. But a “string” in C is nothing more than an array of characters, terminated by a NULL (or, if you use some of the appropriate functions, you can use a specific number of characters – strncpy instead of strcpy, etc.). But, for all its appearance, and apparent use, there are no strings in C. They are nothing more than arrays of characters, with a few supporting functions that happen to stop going through the array when a NULL is encountered.
So – char a[10] – is simply an array of characters that is 10 characters long. You can fill it with any characters you wish. If one of those is the NULL character, then that terminates what is typically called a “C-style string”. There are functions that support this type of character array (i.e. “string”), but it is still a use of a character array.
Your confusion comes in because you are trying to mix C++ string objects, and forcing that concept onto C arrays of characters. As ugoren noted – your examples are both correct – because you are dealing with arrays of character pointers, NOT strings. Again, putting a NULL somewhere in that character array is happily supported by several C functions that give you the ability to work with a “string-like” concept – but they are not truly strings. Unless of course, you want to phrase it that a string is nothing more than one character following another – an array.

C++ sized char array initialized by setting to string literal causes array bounds overflow

I read that when one inicializes an array it is possitle to use a string literal.
But if the list if inicializers is bigger than the size of array, an error is caught.
#include "stdafx.h"
#include <iostream>
using namespace std;
int _tmain(int argc, _TCHAR* argv[])
{
char cAr2[3] = "ABC";
for (int i = 0; i < 3; i++)
cout<<cAr2[i]<<endl;
system("pause");
return 0;
}
Well, this example is given in my book.
It really ends like this: error C2117: 'cAr2' : array bounds overflow.
Could you tell me what is what here: I can see an array of 3 elements and 3 elements being placed into it. Everything seems Ok. Why error?
The string literal "ABC" gives you an "array of 4 const char". There are 4 characters because the string is terminated with the null character. That is, your initialisation would be equivalent to:
char cAr2[] = {'A', 'B', 'C', '\0'};
The null character is implicitly appended to the end of your string so that algorithms that loop over the contents of the array know when to stop without having a string length explicitly given.
Well, the easy answer is this: if you're going to use an initializer, save yourself some grief and leave out the size.
The longer answer is that strings are null-terminated, which means there's an additional character you do not see at the end of the string. So you will need an array of size n+1 where n is the number of characters you see.
The size 3 is not large enough for the "ABC" string:
char cAr2[3] = "ABC";
You need at least 4 characters to store this string with the null terminator
Even if your compiler auto corrects that (I am not sure), it is not a good idea to undersize the array..
If you want to initialize using a string literal I think you'll want to do something like this:
char *cAr2 = "ABC";
However, if you'd like to keep the same type do this:
char cAr2[3] = { 'A', 'B', 'C' };