How are C++ strings stored? [duplicate]

How are C++ strings stored? [duplicate] - c++

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
std::string and its automatic memory resizing
I am just curious, how are strings stored in memory? for example, when I do this:
string testString = "asd";
it allocates 4 bytes, right? a + s + d + \0.
But later, when I want to assign some new text to this string, it works, but I don't understand how. For example I do this:
testString = "123456789"
Now it should be 10 bytes long. But what if there wasn't space for such string? let's say that fifth+sixth bytes from the beginning of string are taken by some other 2 chars. How does the CPU handles it? It finds completely new position in memory where that string fits?

This is implementation dependent, but the general idea is that the string class will contain a pointer to a region of memory where the actual contents of the string are stored. Two common implementations are storing 3 pointers (begin of the allocated region and data, end of data, end of allocated region) or a pointer (begin of allocated region and data) and two integers (number of characters in the string and number of allocated bytes).
When new data is appended to the string, if it fits the allocated region it will just be written and the size/end of data pointer will be updated accordingly. If the data does not fit in the region a new buffer will be created and the data copied.
Also note that many implementations have optimizations for small strings, where the string class does contain a small buffer. If the contents of the string fit in the buffer, then no memory is dynamically allocated and only the local buffer is used.

string is not a simple datatype like char *. It's a class, which has implementation details that aren't necessarily visible.
Among other things, string includes a counter to keep track of how big it really is.
char[] test = "asd"; // allocates exactly 4 bytes
string testString = "asd"; // who knows?
testString = "longer"; // allocates more if necessary
Suggestion: write a simple program and step through it using a debugger. Examine the string, and see how the private members change as the value is changed.

string is an object, not just some memory location. It dynamically allocates memory as needed.
The = operator is overloaded; when you say testString = "123456789"; a method is being called and deals with the const char * you passed in.

It's stored with a size. If you store a new string, it will optionally deallocate the existing memory and allocate new memory to cope with the change in size.
And it doesn't necessarily allocate 4 bytes the first time you assign a string of 4 bytes to it. It may allocate more space than that (it won't allocate less).

Related

Is there any way to find Dynamic memory size like sizeof facelity?

I am looking for something which give me size which taken by str character pointer.
int main()
{
char * str = (char *) malloc(sizeof(char) * 100);
int size = 0;
size = /* library function or anything use to find size */
printf("Total size of str array - %d\n", size);
}
I want prove that give memory is 100 bytes.
Is any one have any idea about this ?

A raw pointer only knows it points to a single element of it's type. If that thing it points to happens to be part of an array, the pointer doesn't know and there's no way to get that information from it.
You want to instead use types that do know their size, like for example; std::string, std::array or std::vector.

The C and C++ standards do not provide a way to get, from an address, the amount of memory that was requested in the call to malloc that returned that address.
Some C or C++ implementations provide a way to get the amount of memory that was provided at the given address, such as malloc_size. The amount provided may be greater than the amount that was requested.
If the memory contains a string, which is an array of characters terminated by a null character, then you can determine the length of the string by counting characters up to the null character. This function is provided by the standard strlen function. This length is different from the space allocated unless, of course, the string happens to fill the space.

There is no (good, standard, portable) way to tell from a pointer value alone whether it's the first element of an array or not, nor how many elements follow it. That information has to be tracked separately.
If you're writing in C++, don't do your own memory management if you can help it. Use a standard container type like std::vector or std::map (or std::string for text). If you must do your own memory management, use the new and delete operators instead of the *alloc and free library functions, and wrap a class around those operations that also keeps track of how many elements have been allocated (which, like std::vector and std::map, is returned via a read-only size() method).

Memory efficiency of C++ arrays

Somewhere in my brainstem a voice whispers:
In C++, an array does not need more memory than the number of elements
need.
std::string str = "aabbcc";
std::array<std::string, 3> str_array = {"aa", "bb", "cc"};
Accordingly, both should have the same size, because (unlike in Java), there is no separate size field or similar. But I haven't found a reference.
Is this true? Under which circumstances is it not?

Storing strings in any language is more complicated than you think. A C++ std::string must provide you contiguous storage for the contents. Apart from that, std::string can hold more things, like pointer/iterator to the last character, number of characters in it, etc. std::string::size is required to be O(1), so it must store more information than just a buffer. Also, most standard library implementations provide SSO (small string optimization). When SSO is enabled, std::string allocates a small buffer, to avoid unneccessary dynamic allocations. You can also reserve more memory than you need. Lets say, you need to collect 800-1000 characters in loop. You can do it like this:
std::string str;
for(...)
str += some_character;
But this will cause unneccessary memory allocations and deallocations. If you can estimate number of characters you want to store, you should reserve memory.
std::string str;
str.reserve(1000);
for(...)
str.push_back(some_character);
Then, you can always shrink_to_fit, to save memory:
str.shrink_to_fit();
There are also other things you must be aware of:
reserve increases capacity, but size stays the same. It means, that std::string must also store (or be able to calculate) for how many more characters buffer capacity allows.
string literals are null terminated
std::basic_string::c_str must return null terminated array of characters, so it is possible that std::string also contains null terminator (unluckily I am not sure how it is done)
there are more encodings and characters sets - ASCII is just one of them. UTF-8 and UTF-16 encoded strings may need to use few stored elements to add up to one code point, but this is more complicated.

In C++, an array does not need more memory than the number of elements need.
This is true. A raw array has a size equal to the size of it's element type times the number of elements. So,
int array[10];
has a size of sizeof(int) * std::size(array). std::array is the same but it is allowed to have padding so
std::array<int, 10> array;
has the size of sizeof(int) * std::size(array) + P where P is some integer amount of padding.
Your example though isn't quite the same thing.
A std::string is a container. It has it's own size that is separate of what it contains. So sizeof(std::string) will always be the same thing regardless of how many characters are in the string. So ignoring short string optimization
std::string str = "aabbcc";
Takes of sizeof(std::string) plus however much the string allocated for the underlying c-string. That is not the same value as
std::array<std::string, 3> str_array = {"aa", "bb", "cc"};
Since you now have 3 * sizeof(std::string) plus whatever each string allocated.

Accordingly, both should have the same size (e.g. 6 bytes),
Not a correct deduction.
The memory used by a std::string, if you want to call its size, consists of at least a pointer and the memory allocated to hold the data.
The memory allocated to hold the data can also include the space required to hold the terminating null character.
Given
std::string s = "aabbcc";
std::string a = "aa";
std::string b = "bb";
std::string c = "cc";
mem(s) != mem(a) + mem(b) + mem(c)

Virtually every string can hold following info:
The size of the string i.e. num of chars it contains.
The capacity of memory holding the string's chars.
The value of the string.
Additionally it may also hold:
A copy of it's allocator and reference count for the value.

They don’t have the same size. Strings are saved null-terminated, giving you an extra byte for each string.

How to code a strcat function that works with two dynamic arrays

As we know, the strcat function concatinates one c-string onto another to make one big c-string containing two others.
My question is how to make a strcat function that works with two dynamically allocated arrays.
The desired strcat function should be able to work for any sized myStr1 and myStr2
//dynamic c-string array 1
char* myStr1 = new char [26];
strcpy(myStr1, "The dog on the farm goes ");
//dynamic c-string array 2
char* myStr2 = new char [6];
strcpy(myStr2, "bark.");
//desired function
strcat(myStr1,myStr2);
cout<<myStr1; //would output 'The dog on the farm goes bark.'
This is as far as I was able to get on my own:
//*& indicates that the dynamic c-string str1 is passed by reference
void strcat(char*& str1, char* str2)
{
int size1 = strlen(str1);
int size2 = strlen(str2);
//unknown code
//str1 = new char [size1+size2]; //Would wipe out str1's original contents
}
Thanks!

You need first to understand better how pointers work. Your code for example:
char* myStr1 = new char [25];
myStr1 = "The dog on the farm goes ";
first allocates 25 characters, then ignores the pointer to that allocated area (the technical term is "leaks it") and sets myStr1 to point to a string literal.
That code should have used strcpy instead to copy from the string literal into the allocated area. Except that the string is 25 characters so you will need to allocate space for at least 26 as one is needed for the ASCII NUL terminator (0x00).
Correct code for that part should have been:
char* myStr1 = new char [26]; // One more than the actual string length
strcpy(myStr1, "The dog on the farm goes ");
To do the concatenation of C strings the algorithm could be:
measure the lengths n1 and n2 of the two strings (with strlen)
allocate n1+n2+1 charaters for the destination buffer (+1 is needed for the C string terminator)
strcpy the first string at the start of the buffer
strcat the second string to the buffer (*)
delete[] the memory for the original string buffers if they are not needed (if this is the right thing to do or not depends on who is the "owner" of the strings... this part is tricky as the C string interface doesn't specify that).
(*) This is not the most efficient way. strcat will go through all the characters of the string to find where it ends, but you already know that the first string length is n1 and the concatenation could be done instead with strcpy too by choosing the correct start as buffer+n1. Even better instead of strcpy you could use memcpy everywhere if you know the count as strcpy will have to check each character for being the NUL terminator. Before getting into this kind of optimization however you should understand clearly how things work... only once the string concatenation code is correct and for you totally obvious you are authorized to even start thinking about optimization.
PS: Once you get all this correct and working and efficient you will appreciate how much of a simplification is to use std::string objects instead, where all this convoluted code becomes just s1+s2.

You allocate memory and make your pointers point to that memory. Then you overwrite the pointers, making them point somewhere else. The assignment of e.g. myStr1 causes the variable to point to the string literal instead of the memory you allocated. You need to copy the strings into the memory you have allocated.
Of course, that copying will lead to another problem, as you seem to forget that C-strings need an extra character for the terminator. So a C-string with 5 characters needs space for six characters.
As for your concatenation function, you need to do copying here too. Allocate enough space for both strings plus a single terminator character. Then copy the first string into the beginning of the new memory, and copy the second string into the end.
Also you need a temporary pointer variable for the memory you allocate, as you otherwise "would wipe out str1's original contents" (not strictly true, you just make str1 point somewhere else, losing the original pointer).

Allocate extra memory to the character array C++

I have this problem where I have a string and I pass it to the function as a character pointer.
void test(char * str) {
....
}
where str = "abc". Now I want to add few extra characters to the end of this string without creating a new string. I do not want to use strcat as I do not know how many characters I am adding to the end of the string and what I am adding. I was trying to work with realloc but it does not work as the str is allocated on stack.
Is there any way I can increase the size of the char array dynamically?
UPDATE :
I was asked a question which involved this in my interview. I was asked to do it without using additional space. So if I allocate memory using malloc I am technically using additional space right?
Thanks

No, especially if the string is allocated on the stack. The stack space is fixed at compile-time. You must either allocate more space initially, or allocate a new array with more space and strcpy it over.

If you are using C++ - then stick to std::string and forget the whole deal with char *.
However if you wish to use the char * for strings, then allocate a new character array and strcpy() from one string to another. Do not forget to deallocate the original char * memory to avoid memory leaks.
I was asked a question which involved this in my interview. I was asked to do it without using additional space. So if I allocate memory using malloc I am technically using additional space right?
How can you increase the length of the string without adding additional space?

You must delete the old string and allocate a new one with new with the length you want.

Sorry, no. A dynamic variable/array cannot be resized up. The problem is that another variable, or even another call frame could be immediately following the variable in question. These cannot be moved to make space as there may be pointers to these objects elsewhere in the code.

void test(string &str) {
....
str += "wibble";
}
Seems to work for C++

Rather than using realloc(not to be done on stack) or strcpy(uses extra buffer space) you may store the new values from the byte right after the input string. In the simple example below, I begin with "abcd" and add three z's at the end in the function fn.
void fn(char *str)
{
int len = strlen(str);
memset(str+len, 'z', 3);
str[len+3] = 0;
return;
}
int main()
{
char s[] = "abcd";
printf("%s\n", s);
fn(s);
printf("%s\n", s);
}
Output:
abcd
zzz
This way can be extended to adding different strings in front of original one.

C++ Pointer question

I'm new to pointers in C++. I'm not sure why I need pointers like char * something[20] as oppose to just char something[20][100]. I realize that the second method would mean that 100 block of memory will be allocated for each element in the array, but wouldn't the first method introduce memory leak issues.
If someone could explain to me how char * something[20] locates memory, that would be great.
Edit:
My C++ Primer Plus book is doing:
const char * cities[5] = {
"City 1",
"City 2",
"City 3",
"City 4",
"City 5"
}
Isn't this the opposite of what people just said?

You allocate 20 pointers in the memory, then you will need to go through each and every one of them to allocate memory dynamically:
something[0] = new char[100];
something[1] = new char[20]; // they can differ in size
And delete them all separately:
delete [] something[0];
delete [] something[1];
EDIT:
const char* text[] = {"These", "are", "string", "literals"};
Strings specified directly in the source code ("string literals", which are always const char *) are quite different to char *, mainly because you don't have to worry about alloc/dealloc of them. They are also generally handled very different in memory, but this depends on the implementation of your compiler.

You're right.
You'd need to go through each element of that array and allocate a character buffer for each one.
Then, later, you'd need to go through each element of that array and free the memory again.
Why you would want to faff about with this in C++ is anyone's guess.
What's wrong with std::vector<std::string> myStrings(20)?

It will allocate space for twenty char-pointers.
They will not be initialized, so typical usage looks like
char * something[20];
for (int i=0; i<20; i++)
something[i] = strdup("something of a content");
and later
for (int i=0; i<20; i++)
if (something[i])
free(something[i]);

You're right - the first method may introduce memory leak issues and the overhead of doing dynamic allocations, plus more reads. I think the second method is usually preferable, unless it wastes too much RAM or you may need the strings to grow longer than 99 chars.
How the first method works:
char* something[20]; // Stores 20 pointers.
something[0] = malloc(100); // Make something[0] point to a new buffer of 100 bytes.
sprintf(something[0], "hai"); // Make the new buffer contain "hai", going through the pointer in something[0]
free(something[0]); // Release the buffer.

char* smth[20] does not allocate any memeory on heap. It allocates just enough space on the stack to store 20 pointers. The value of those pointers is undefined, so before using them, you have to initialize them, like this:
char* smth[20];
smth[0] = new char[100]; // allocate memory for 100 chars, store the address of the first one in smth[0]
//..some code..
delete[] smth[0];

First of all, this almost inapplicable in C++. The normal equivalent in C++ would be something like: std::vector<std::string> something;
In C, the primary difference is that you can allocate each string separately from the others. With char something[M][N], you always allocate exactly the same number of strings, and the same space for each string. This will frequently waste space (when the strings are shorter than you've made space for), and won't allow you to deal with any more strings or longer of strings than you've made space for initially.
char *something[20] let's you deal with longer/shorter strings more efficiently, but still only makes space for 20 strings.
The next step (if you're feeling adventurous) is to use something like:
char **something;
and allocate the strings individually, and allocate space for the pointers dynamically as well, so if you get more than 20 strings you can deal with that as well.
I'll repeat, however, that for most practical purposes, this is restricted to C. In C++, the standard library already has data structures for situations like these.

C++ has pointers because C has pointers.
Why do we use pointers?
To track dynamically-allocated memory. The memory allocation functions in C (malloc, calloc, realloc) and the new operator in C++ all return pointer values.
To mimic pass-by-reference semantics (C only). In C, all function arguments are passed by value; the formal parameter and the actual parameter are distinct objects, and modifying a formal parameter doesn't affect the actual parameter. We get around this by passing pointers to the function. C++ introduced reference types, which serve the same purpose, but are a bit cleaner and safer than using pointers.
To build dynamic, self-referential data structures. A struct cannot contain an instance of itself, but it can contain a pointer to an instance. For example, the following code
struct node
{
data_t data;
struct node *next;
};
creates a data type for a simple linked-list node; the next member explicitly points to the next element in the list. Note that in C++, the STL containers for stacks and queues and vectors all use pointers under the hood, isolating you from the bookkeeping.
There are literally dozens of other places where pointers come up, but those are the main reasons you use them.
Your array of pointers could be used to store strings of varying length by allocating just enough memory for each, rather than relying on some maximum size (which will eventually be exceeded, leading to a buffer overflow error, and in any case will lead to internal memory fragmentation). Naturally, in C++ you'd use the string data type (which hides all the pointer and memory management behind the class API) instead of pointers to char, but someone has decided to confuse you by starting with low-level details instead of the big picture.

I'm not
sure why I need pointers like char *
something[20] as oppose to just char
something[20][100]. I realize that the
second method would mean that 100
block of memory will be allocated for
each element in the array, but
wouldn't the first method introduce
memory leak issues.
The second method will suffice if you're only referencing your buffer(s) locally.
The problem comes when you pass the array name to another function. When you pass char something[10] to another function, you're actually passing char* something because the array length doesn't go along for the ride.
For multidimensional arrays, you can declare a function that takes in an array of determinate length in all but one direction, e.g. foo(char* something[10]).
So why use the first form rather than the second? I can think of a few reasons:
You don't want to have the restriction that the entire buffer must reside in continuous memory.
You don't know at compile-time that you'll need each buffer, or that the length of each buffer will need to be the same size, and you want the flexibility to determine that at run-time.
This is a function declaration.

char * something[20]
Assuming this is 32Bit, this allocates 80 bytes of data on the stack.
4 Bytes for each pointer address, 20 pointers total = 4 x 20 = 80 bytes.
The pointers are all uninitialized, so you need to write additional code to allocate/free
the buffers for doing this.
It roughly looks like:
[0] [4 Bytes of Uninitialized data to hold a pointer/memory address...]
[1] [4 Bytes of ... ]
...
[19]
char something[20][100]
Allocates 2000 bytes on the stack.
100 Bytes for each something, 20 somethings total = 100 x 20 = 2000 bytes.
[0] [100 bytes to hold characters]
[1] [100 bytes to hold characters]
...
[19]
The char *, has a smaller memory overhead, but you have to manage the memory.
The char[][] approach, has bigger memory overhead, but you don't have additional memory management.
With either approach, you have to be careful when writing to the buffer allocated not to exceed/overwrite the memory alloc'd for it.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js