Memory allocation for strings in vectors

Memory allocation for strings in vectors - c++

If a vector always provides contiguous memory storage, how does the compiler allocate memory to empty std::strings?
I have a vector to which I've pushed a number of classes with std:string as a private member. I then pass a reference to the vector as an argument to another method.
Is the string's data elsewhere in the heap referenced from the vector's contiguous array?

Allocating memory for std::string is trivial.
Internally, it'll have some sort of pointer that points to a block of memory in which the actual string data will be stored. So, allocating memory for a std::string is simply a matter of allocating space for a pointer, a size_t or something, and maybe a couple more primitives.
If you have a std::vector<std::string> for example, it's easy for the vector to allocate space for the std::string's because they're just k bytes each for some constant k. The string data will not be involved in this allocation.

The details of what happens really in memory in this case are quite dependent on the specific STL implementation you're using.
Having said that, my impression is that in most implementations vector and string are implemented with something like (very simplified):
template<typename T>
class vector
{
//...
private:
T* _data;
};
class string
{
private:
char _smallStringsBuffer[kSmallSize];
char* _bigStringsBuffer;
};
The vector's data is dynamically allocated on the heap based on the capacity (which has got a default value when default-initialized, and grows while you add elements to the vector).
The string's data is statically allocated for small strings (implementation-dependent value of "small"), and then dynamically when the string becomes bigger. This is the case for a number of reasons but mostly to allow more efficient handling of small strings.
The example you described is something like:
void MyFunction(const vector<string>& myVector)
{
// ...
}
int main()
{
vector<string> v = ...;
// ...
MyFunction(v);
// ...
return 0;
}
In this particular case only the basic data of the vector v will be in the stack, as v._data will be allocated on the heap. If v has capacity N, v._data's size in the heap will be sizeof(string) * N, where the size of the string is a constant that will depend on kSmallSize * sizeof(char) + sizeof(char*), based on the definition of the string above.
As for contiguous data, only if all strings collected in the vector have fewer characters than kSmallSize, will their data be "almost" contiguous in memory.
This is an important consideration for performance-critical code, but to be honest I don't think that most people would rely on standard STL's vectors and strings for such situations, as the implementation details can change over time and on different platforms and compilers. Furthermore, whenever your code goes out of the "fast" path, you won't notice except with spikes of latency that are going to be hard to keep in check.

Related

std::vector internals

How is std::vector implemented, using what data structure? When I write
void f(int n) {
std::vector<int> v(n);
...
}
Is the vector v allocated on stack?

The vector object will be allocated on the stack and will internally contain a pointer to beginning of the elements on the heap.
Elements on the heap give the vector class an ability to grow and shrink on demand.
While having the vector on the stack gives the object the benefit of being destructed when going out of scope.
In regards to your [] question, the vector class overloads the [] operator. I would say internally it's basically doing something like this when you do array[1]:
return *(_Myfirst+ (n * elementSize))
Where the vector keeps track of the start of it's internal heap with _Myfirst.
When your vector starts to fill up, it will allocate more memory for you. A common practice is to double the amount of memory needed each time.
Note that vector inherits from _Vector_val, which contains the following members:
pointer _Myfirst; // pointer to beginning of array
pointer _Mylast; // pointer to current end of sequence
pointer _Myend; // pointer to end of array
_Alty _Alval; // allocator object for values

Your v is allocated in automatic memory. (commonly known as the stack, yes)
The implementation details are not specified, but most commonly it's implemented using a dynamic array which is resized if you attempt to add more elements than the previous allocation can hold.
The standard only specifies the interface (which methods it should have) and execution time boundaries.
Since vector is a template, the implementation is visible, so locate your <vector> file and start inspecting.

void f(int n) {
std::vector<int> v(n);
...
}
The vector above has automatic storage duration and is allocated on the stack. However, the data array that the vector manages is allocated on the heap.
The internals of the vector are implementation specific, but a typical implementation will contain 3 pointers; one each to keep track of start of array, size of vector and capacity of vector.

C++: struct and new keyword

I'm a beginner to C++, I've got the following piece of code:
struct Airline {
string Name;
int diameter;
int weight;
};
Airline* myPlane = new Airline;
my question is when I call the method new it allocates memory, if I recall correctly. How does the PC know how much memory to allocate,especially given that there is a string type in there?
Thanks

An std::string object is fixed-size; it contains a pointer to an actual buffer of characters along with its length. std::string's definition looks something like
class string
{
char *buffer;
size_t nchars;
public:
// interface
};
It follows that your Airline objects also have a fixed size.
Now, new does not only allocate; it also initializes your object, including the std::string, which means it probably sets the char pointer to 0 because the string is empty.

You can also get the size of the structure, by using sizeof:
cout << "sizeof(Airline) = " << sizeof(Airline) << endl;
This is because the compiler knows the fields inside the structure, and adds up the sizes of each structure member.
The string object is no different than your structure. It is actually a class in the standard library, and not a special type like int or float that is handled by the compiler. Like your structure, the string class contains fields that the compiler knows the size of, and so it knows the size of your complete structure and uses that when you use new.

The call to new will allocate sizeof(Airline) which is what is needed to hold an object of type Airline.
As of the management for strings, the string object holds some internal data to manage the memory of the actual data stored, but not the data itself (unless the small object optimization is in use). While the idea is the same that has been pointed by others with stores a pointer to the actual string, that is not precise enough, as it implementations will store that pointer plus extra data required to hold the size() and capacity() (and others, like reference counts in reference counting implementations).

The memory for the string may or may not be within class string. Possible (and probably), class string will manage its own memory, having only a pointer to the memory used to store the data. Example:
struct Airlane {
String Name {
char *data; // size = 4
size_t size; // size = 4
}
int diameter; // size = 4
int weight; // size = 4
}; // size = 16
Note that those are not necessarily actual sizes, they are just for example.
Also note that in C++ (unlike C, for example), for every class T, sizeof T is a compile time constant, meaning that objects can never have dynamic size. This in effect means: As soon as you need runtime dynamic sized data, there have to be external (w.r.t. the object) memory areas. This may imply the use of standard containers like std::string or std::vector, or even manually managed resources.
This in turn means, operator new does not need to know the dynamic size of all members, recursively, but only the size of the outermost class, the one that you allocate. When this outer class needs more memory, it has to manage it itself. Some exemplary p-code:
Airline* myPlane = new Airline {
Name = {
data = new char[some-size]
...
}
...
}
The inner allocations are done by the holding constructors:
Airline::Airline() : string(), ... {}
string::string () : data(new char[...] ... {}
operator new does nothing else but to allocate some fixed size memory as the "soil" for Airline (see first p-code), and then "seeds" Airlines constructor, which itself has to manage its lifetime in that restricted volume of "soil", by invoking the string constructor (implicitly or explicitly), which itself does another new.

When you allocate Airline, new will allocate enough space on the heap for two ints, string and its fields.
A string will always be the same size on the stack. However, internally, the string stores a pointer to a character array.

Does a vector have to store the size twice?

This is a rather academic question, I realise it matters little regarding optimization, but it is just out of interest.
From what I understand, when you call new[size], additional space is allocated to store the size of the allocated array. This is so when delete [] is called, it is known how much space can be freed.
What I've done is written how I think a vector would roughly be implemented:
#include <cstddef>
template <class T>
class Vector
{
public:
struct VectorStorage
{
std::size_t size;
T data[];
};
Vector(std::size_t size) : storage(new VectorStorage[size])
{
storage->size = size;
}
std::size_t size()
{
return storage->size;
}
~Vector()
{
delete[] storage;
}
private:
VectorStorage* storage;
};
As far as I can tell, size is stored twice. Once in the VectorStorage object directly (as it needs to be so the size() function can work) but again in a hidden way by the compiler, so delete[] can work.
It seems like size is stored twice. Is this the case and unavoidable, or is there a way to ensure that the size is only stored once?

std::vector does not allocate memory; std::allocator, or whatever allocator you give the vector, is what allocates the memory. The allocator interface is given the number of items to allocate/deallocate, so it is not required to actually store that.

Vectors don't usually store the size. The usual implementation keeps a pointer past the end of the last element and one past the end of the allocated memory, since one can reserve space without actually storing elements in it. However, there is no standard way to access the size stored by new (which may not be stored to begin with), so some duplication is usually needed.

Yes. But that's an implementation detail of the heap allocator that the compiler knows nothing about. It is almost certainly not the same as the capacity of the vector since the vector is only interested in the number of elements, not the number of bytes. And a heap block tends to have extra overhead for its own purposes.

You have size in the wrong place -- you're storing it with every element (by virtue of an array of VectorStorage), which also contains an array (per instance of VectorStorage) of T. Do you really mean to have an array inside an array like that? You're never cleaning up your array of T in your destructor either.

Cast char* to std::vector

Is there a fast (CPU) way to cast a char array into a std::vector<char>?
My function looks like this:
void someFunction(char* data, int size)
{
// Now here I need to make the std::vector<char>
// using the data and size.
}

You can't "cast" anything here, but you can easily construct a vector from the C string:
std::vector<char> v(data, data + size);
This will create a copy though.

The general rule with STL containers is that they make copies of their contents. With C++11, there are special ways of moving elements into a container (for example, the emplace_back() member function of std::vector), but in your example, the elements are char objects, so you are still going to copy each of the size char objects.
Think of a std::vector as a wrapper of a pointer to an array together with the length of the array. The closest equivalent of "casting a char * to a std::vector<char>" is to swap out the vector's pointer and length with a given pointer and length however the length is specified (two possibilities are a pointer to one past the end element or a size_t value). There is no standard member function of std::vector that you can use to swap its internal data with a given pointer and length.
This is for good reason, though. std::vector implements ownership semantics for every element that it contains. Its underlying array is allocated with some allocator (the second template parameter), which is std::allocator by default. If you were allowed to swap out the internal members, then you would need to ensure that the same set of heap allocation routines were used. Also, your STL implementation would need to fix the method of storing "length" of the vector rather than leaving this detail unspecified. In the world of OOP, specifying more details than necessary is generally frowned upon because it can lead to higher coupling.
But, assume that such a member function exists for your STL implementation. In your example, you simply do not know how data was allocated, so you could inadvertently give a std::vector a pointer to heap memory that was not allocated with the expected allocator. For example, data could have been allocated with malloc whereas the vector could be freeing the memory with delete. Using mismatched allocation and deallocation routines leads to Undefined Behavior. You might require that someFunction() only accept data allocated with a particular allocation routine, but this is specifying more details than necessary again.
Hopefully I have made my case that a std::vector member function that swaps out the internal data members is not a good idea. If you really need a std::vector<char> from data and size, you should construct one with:
std::vector<char> vec(data, data + size);

What's the difference between these two classes?

Below, I'm not declaring my_ints as a pointer. I don't know where the memory will be allocated. Please educate me here!
#include <iostream>
#include <vector>
class FieldStorage
{
private:
std::vector<int> my_ints;
public:
FieldStorage()
{
my_ints.push_back(1);
my_ints.push_back(2);
}
void displayAll()
{
for (int i = 0; i < my_ints.size(); i++)
{
std::cout << my_ints[i] << std::endl;
}
}
};
And in here, I'm declaring the field my_ints as a pointer:
#include <iostream>
#include <vector>
class FieldStorage
{
private:
std::vector<int> *my_ints;
public:
FieldStorage()
{
my_ints = new std::vector<int>();
my_ints->push_back(1);
my_ints->push_back(2);
}
void displayAll()
{
for (int i = 0; i < my_ints->size(); i++)
{
std::cout << (*my_ints)[i] << std::endl;
}
}
~FieldStorage()
{
delete my_ints;
}
};
main() function to test:
int main()
{
FieldStorage obj;
obj.displayAll();
return 0;
}
Both of them produces the same result. What's the difference?

In terms of memory management, these two classes are virtually identical. Several other responders have suggested that there is a difference between the two in that one is allocating storage on the stack and other on the heap, but that's not necessarily true, and even in the cases where it is true, it's terribly misleading. In reality, all that's different is where the metadata for the vector is allocated; the actual underlying storage in the vector is allocated from the heap regardless.
It's a little bit tricky to see this because you're using std::vector, so the specific implementation details are hidden. But basically, std::vector is implemented like this:
template <class T>
class vector {
public:
vector() : mCapacity(0), mSize(0), mData(0) { }
~vector() { if (mData) delete[] mData; }
...
protected:
int mCapacity;
int mSize;
T *mData;
};
As you can see, the vector class itself only has a few members -- capacity, size and a pointer to a dynamically allocated block of memory that will store the actual contents of the vector.
In your example, the only difference is where the storage for those few fields comes from. In the first example, the storage is allocated from whatever storage you use for your containing class -- if it is heap allocated, so too will be those few bits of the vector. If your container is stack allocated, so too will be those few bits of the vector.
In the second example, those bits of the vector are always heap allocated.
In both examples, the actual meat of the vector -- the contents of it -- are allocated from the heap, and you cannot change that.
Everybody else has pointed out already that you have a memory leak in your second example, and that is also true. Make sure to delete the vector in the destructor of your container class.

You have to release ( to prevent memory leak ) memory allocated for vector in the second case in the FieldStorage destructor.
FieldStorage::~FieldStorage()
{
delete my_ints;
}

As Mykola Golubyev pointed out, you need to delete the vector in the second case.
The first will possibly build faster code, as the optimiser knows the full size of FieldStorage including the vector, and could allocate enough memory in one allocation for both.
Your second implementation requires two separate allocations to construct the object.

I think you are really looking for the difference between the Stack and the Heap.
The first one is allocated on the stack while the second one is allocated on the heap.

In the first example, the object is allocated on the stack.
The the second example, the object is allocated in the heap, and a pointer to that memory is stored on the stack.

the difference is that the second allocates the vector dynamically. there are few differences:
you have to release memory occupied by the vector object (the vector object itself, not the object kept in the vector because it is handled correctly by the vector). you should use some smart pointer to keep the vector or make (for example in the destructor):
delete my_ints;
the first one is probably more efficient because the object is allocated on the stack.
access to the vector's method have different syntax :)

The first version of FieldStorage contains a vector. The size of the FieldStorage class includes enough bytes to hold a vector. When you construct a FieldStorage, the vector is constructed right before the body of FieldStorage's constructor is executed. When a FieldStorage is destructed, so is the vector.
This does not necessarily allocate the vector on the stack; if you heap-allocate a FieldStorage object, then the space for the vector comes from that heap allocation, not the stack. If you define a global FieldStorage object, then the space for the vector comes from neither the stack nor the heap, but rather from space designated for global objects (e.g. the .data or .bss section on some platforms).
Note that the vector performs heap allocations to hold the actual data, so it is likely to only contain a few pointers, or a pointer and couple of lengths, but it may contain whatever your compiler's STL implementation needs it to.
The second version of FieldStorage contains a pointer to a vector. The size of the FieldStorage class includes room for a pointer to a vector, not an actual vector. You are allocating storage for the vector using new in the body of FieldStorage's constructor, and you leaking that storage when FieldStorage is destructed, because you didn't define a destructor that deletes the vector.

First way is the prefereable one to use, you don't need pointer on vector and have forgot to delete it.

In the first case, the std::vector is being put directly into your class (and it is handling any memory allocation and deallocation it needs to grow and shrink the vector and free the memory when your object is destroyed; it is abstracting the memory management from you so that you don't have to deal with it). In the second case, you are explicitly allocating the storage for the std::vector in the heap somewhere, and forgetting to ever delete it, so your program will leak memory.

The size of the object is going to be different. In the second case the Vector<>* only takes up the size of a pointer (4bytes on 32bit machines). In the first case, your object will be larger.

One practical difference is that in your second example, you never release either the memory for the vector, or its contents (because you don't delete it, and therefore call its destructor).
But the first example will automatically destroy the vector (and free its contents) upon destruction of your FieldStorage object.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Memory allocation for strings in vectors - c++

Related

std::vector internals

C++: struct and new keyword

Does a vector have to store the size twice?

Cast char* to std::vector

What's the difference between these two classes?

Categories

Resources