How is stored a vector<string> in memory - c++

I am working on a project where I absolutly need to have data contiguous in memory.
I want to store some (maximum 100) string (I don't know the actual size of each string). So I will create a vector of string of 100 elements.
std::vector<std::string> vect;
vect.reserve(100)
But a String can be of any size. So how does it work? Is my vector reallocated everytime I change a string? Or is a std::string simply like a pointer to the first character of the string like a char* would be for a C string?

Each string will be an instance of class string and that instance will contain a char*.
The string objects in the vector will be in contiguous memory.
The chars of each string will be in contiguous memory
All The chars of all the strings will not be in contiguous memory, unless you define a custom std::allocator for the strings
The location in memory of the strings may change when you increase the size of the vector or call shrink_to_fit
The location in memory of the chars of each string may change when you increase the size of the string
The vector will not be reallocated if you modify or remove one of the strings
There is something called Small String Optimization. If that comes into play the chars of each string will be stored within the string instead of another location pointed to by char*

The data in std::vector is laid out contiguosly. However std::strings implementation does not guarantee that the memory holding the character array is stored locally to the class itself. How could it? Like you said you don't know how large the string will be.
A lot of array like structures have a layout like follows:
class string
{
T * begin;
T * end;
T * capacity;
}
Which means that your vector of 100 strings will have 100 instances of a class layout that POINTS to the memory where the string is stored.
Now if you need to pack memory allocations as tightly as possible and still want to use std::string you can write a custom allocator.
Maybe you can write the string data into a char array and have a second container that stores the lengths of each individual string + NULL terminator.

The implementation of string is implementation defined and has actually changed between different versions of certain compilers (for example from gcc 4.9 to 5.0). The is absolutely no guarantee that the chars in consecutive strings are contiguous in memory, even if you use a custom allocator.
So if you really need the chars to be contiguous in memory, you must use just a vector<char>.

Related

Do a file containing words separated by newline and a vector of strings for those words in C++ have same size?

The file is of the following form:
word1
word2
word3
...
And I create vector of strings after reading those words from the file like this:
std::vector<string> words;
string w;
ifstream file("input");
while(getline(file,w))
words.push_back(w);
file.close();
Will the size of physical memory occupied by the vector be same as the size of input file? Why?
Will the size of physical memory occupied by the vector be same as the size of input file?
It depends on what do you mean by "size of physical memory occupied by the vector". Size of the vector object itself is typically size of 3 pointers (or 1 pointer and 2 numbers), such as 24 bytes on a 64-bit architecture. However, the vector then dynamically allocates space for at least N string objects, where N is the number of file lines. Note that if you do not reserve vector space, it will likely allocate more space than for N strings.
Each string object has again some "internal" size (24 bytes with libc++/Clang, 32 bytes with libstdc++/GCC in my experiments).
And then, each string needs to store the text line. It might allocate memory dynamically, or for short string it might employ small string optimization. With dynamic memory allocations you need to take some padding into account, since dynamically allocated buffers are aligned (to 16 bytes in my environment).
You therefore cannot easily compare memory occupations here. But, generally, there would be a lot of overhead with vector of strings.
If you want to avoid this overhead, simply read whole file content into a single char array (vector, string) and then create an additional array with pointers to where individual lines begin.
The vector implementation has two memory footprints: the sizeof(vector) is the memory used on the stack (usually 24 bytes) and then there is the memory dynamically allocated, first by the vector and second by the string arguments.
The vector may well allocate more memory than it actually needs to hold all the strings: if you grow it by push_back (or emplace_back), it doubles the dynamically allocated memory whenever it runs out of capacity.
The strings, finally, have their own overhead: for short words (shorter than sizeof(string)) unused memory in string is wasted, while for long words the string must allocate dynamic memory and keep a separate pointer (causing memory overhead).
Thus, the answer is: NO, vector<string> takes more space (which may be distributed between stack and different places on the heap).

How the memory allocation works for nested containers?

For example, i have std::vector<std::string>, how the allocators for vector and string work together?
Say the allocator for vector allocates a chunk of memory ChunkVec, does the allocator for string allocate memory inside ChunkVec so that the memory allocated for each string sums to ChunkVec? Or the allocator for string allocates memory outside ChunkVec?
Is the answer the same for other nested containers?
And is there a difference between C++ and C++11?
i have std::vector < std::string >
On my Ubuntu 15.04, 64 bit, a std::string is 8 bytes, regardless of contents.
(using std::string s1; I am comparing sizeof(std::string) versus s1.size(). Then append to the string and then print them both again.)
I have not noticed or found a way to specify what allocator to use when the string allocates its data from the heap, therefore, I believe it must use some standard allocator, probably new, but I have never looked into the std::string code. And that standard allocator would know nothing about your vector.
does the allocator for string allocate memory inside ChunkVec so that
the memory allocated for each string sums to ChunkVec?
I believe the part of the string in a vector element is only the 8 byte pointer to where the string 'proper' resides in the heap. So no.
Or the allocator for string allocates memory outside ChunkVec?
Yes, I believe so.
You can confirm this by printing the addresses of the vector elements i, and i+1, and the address of the some of the chars of element i.
By the way, on my implementation (g++ 4.9.2) , sizeof(std::vector) is 24 bytes, regardless of the number of data elements (vec.size()) and regardless of element size. Note also, that I have read about some implementations where some of a small vector might actually reside in the 24 bytes. Implementation details can be tedious, but helpful. Still, some might be interested in why you want to know this.
Be aware we are talking about implementation details (I think) ... so your exploration might vary from mine.
Is the answer the same for other nested containers?
I have not explored every container (but I have used many "std::vector< std::string >").
Generally, and without much thought, I would guess not.
And is there a difference between C++ and C++11?
Implementation details change for various reasons, including language feature changes. What have you tried?
ChunkVec stores only the pointer to the data allocated by string.(in this case it stores a std::string object which stores pointer). Its a totally different allocation. A Good way to understand it is to analyze the tree structure in programming.
struct node
{
int data;
struct node* left;
struct node* right;
};
left and right are different memory allocations than node. You can remove them without removing this very node.
std::string has two things to store--the size of the string and the content. If I allocate one on the stack, the size will be on the stack as well. For short strings, the character data itself will also be on the stack. These two items make up the "control structure". std::string only uses its allocator for long strings that don't fit in its fixed-size control structure.
std::vector allocates memory to store the control structure of the std::string. Any allocation required by std::string to store long strings could be in a completely different area of memory than the vector. Short strings will be entirely managed be the allocator of std::vector.

C++ string / container allocation

This is probably obvious to a C++ non-noob, but it's stumping me a bit - does a string member of a class allocate a variable amount of space in that class? Or does it just allocate a pointer internally to some other space in memory? E.g. in this example:
class Parent {
public:
vector<Child> Children;
}
class Child {
public:
string Name;
}
How is that allocated on the heap if I create a "new Parent()" and add some children with varying length strings? Is Parent 4 bytes, Child 4 bytes (or whatever the pointer size, plus fixed size internal data), and then a random pile of strings somewhere else on the heap? Or is it all bundled together in memory?
I guess in general, are container types always fixed size themselves, and just contain pointers to their variable-sized data, and is that data always on the heap?
Classes in C++ are always fixed size. When there is a variable sized component, e.g., the elements of a vector or the characters in a string, they may be allocated on the heap (for small strings they may also be embedded in the string itself; this is known as the small string optimization). That is, your Parent object would contain a std::vector<Child> where the Child objects are allocated on the heap (the std::vector<...> object itself probably keeps three words to its data but there are several ways things may be laid out). The std::string objects in Child allocate their own memory. That is, there may be quite a few memory allocations.
The C++ 2011 standard thoroughly defines allocators to support passing an allocation mechanism to an object and all its children. Of course, the classes need to also support this mechanism. If your Parent and Child classes had suitable constructors taking an allocator and would pass this allocator to all members doing allocations, it would be propagated through the system. This way, allocation of objects belong together can be arranged to be in reasonably close proximity.
Classes in C++ always have a fixed size. Therefore vector and string can only contain pointers to heap allocated memory* (although they contain typically more data then one pointer, since it also needs to store the length). Therefore the object itself always has a fixed length.
*For string this is not entirely correct. Often an optimization technique called short string optimization is used. In that case small strings are embedded inside the object (in the place where otherwise the pointer to heap data would be stored) and heap memory is only allocated if the string is too long.
Yes -- using your words -- container types always fixed size themselves, and just contain pointers to their variable-sized data.
If we have vector<int> vi;, the size of vi is always fixed, sizeof(vector<int>) to be exact, irrespective of the number of int's in vi.
does a string member of a class allocate a variable amount of space in that class?
No, it does not.
Or does it just allocate a pointer internally to some other space in memory?
No, it does not.
An std::string allocates wahtever sizeof(std::string) is.
Do not confuse
the size of an object
the size of the resources, that an object is responsible for.

Allocate constant strings in container contiguously

Lets say I have a std::vector of const std::strings.
std::vector<const std::string> strs;
Now the default behavior here is that the actual string containers can be allocated anywhere on the heap, which pretty much disables any prefetching of data when iterating over the contained strings.
strs.push_back("Foo"); // allocates char block on heap
strs.push_back("Boo"); // allocates char block on heap
However, since the strings are "const" I would like the char blocks to be allocated contiguously or close to each other (when possible) in order to have the most efficient cache behavior when iterating over the strings.
Is there any way to achieve this behavior?
You need a custom allocator known as a memory region allocator. You can look on Wikipedia or Google for more information, but the basic idea is something akin to the hardware stack- allocate one large chunk and then simply increment the pointer to mark it as used. It can serve many contiguous requests very quickly but can't deal with frees and allocations- all freeing is done at once.
If it really is that simple - pushing strings that will never change, it is easy to write your own allocator. Allocate a large block of memory, set a pointer free to offset 0 in the block. When you need storage for a new string strncpy it to free and increase free with the strlen. Keep track of the end of the memory block and allocate another block when needed.
Not really.
std::string isn't a POD, it doesn't keep its contents "inside of the object". What's more - it doesn't even require to store its contents in a single memory block.
Also a std::vector (as all arrays) needs its contents to be of one type (= of equal size), so you can't make a literal "array" of strings of different lengths.
Your best shot is to assume a length and use std::vector<std::array<char, N> >
If you need really different lengths, an alternative is just a std::vector<char> for the data plus a std::vector<unsigned> for the indices where consecutive strings start.
Rolling your own allocator for the string is a tempting idea, you could base it on std::vector<char> and then roll up your own std::basic_string on it, then make a collection of those.
Note that you are actually depending much on a specific std::string implementation. Some do have an internal buffer of N chars and only allocate memory externally if the string length is bigger than the buffer. If that's the case on your implementation, you still wouldn't get a contiguous memory for the whole buffer of strings.
On that grounds, I conclude that with std::string you won't be generally able to accomplish what you want (unless you rely on a specific STL implementation) and you need to provide another string implementation to suit your needs.
A custom allocator is great, but why not store all the strings in a single std::vector<char> or std::string, and access the original strings by offset?
Simple and effective.
You can always write a private allocator (second template parameter for std::vector) that will allocate all the strings from a continuous pool. Also you can use std::basic_string instead of std::string (which is a private case of std::basic_string), which allows specifying your own allocator similarly. Generally I would say its a case of "premature optimization", but I trust you've measured and saw a performance hit here... I believe the price to pay would be some memory wasted, though.
A vector is guaranteed to be contiguous memory and is
interoperable with an array. It is not a singly linked list.
"Contiguity is in fact part of the vector abstraction. It’s so important, in fact, the C++03 standard was amended to explicitly add the guarantee."
Source : http://herbsutter.com/2008/04/07/cringe-not-vectors-are-guaranteed-to-be-contiguous/
Use reserve() to force it to be contiguous and not reallocate.
#include <iostream>
#include <vector>
#include <string>
#include <algorithm>
#include <iterator>
using namespace std;
int main()
{
// create empty vector for strings
vector<const string> sentence;
// reserve memory for five elements to avoid reallocation
sentence.reserve(5);
// append some elements
sentence.push_back("Hello,");
sentence.push_back("how");
sentence.push_back("are");
sentence.push_back("you");
sentence.push_back("?");
// print elements separated with spaces
copy (sentence.begin(), sentence.end(),
ostream_iterator<string>(cout," "));
cout << endl;
return 0;
}

C++: How does string vectors' random access time work?

I know a simple int vector have O(1) random access time, since it is easy to compute the position of the xth element, given all elements have the same size.
Now whats up with a string vector?
Since the string lengths vary, it can't have O(1) random access time, can it? If it can, what is the logic behind it?
Thanks.
Update:
The answers are very clear and concise, thank you all for the help.
I accepted Joey's answer because it is simple and easy to understand.
The vector does have O(1) access time.
String objects are all the same size (on a given implementation), regardless of the size of the string they represent. Typically the string object contains a pointer to allocated memory which holds the string data.
So, if s is a std::string, then sizeof s is constant and equal to sizeof(std::string), but s.size() depends on the string value. The vector only cares about sizeof(std::string).
The string references are stored in one location. The strings may be stored anywhere in memory. So, you still get O(1) random access time.
---------------------------
| 4000 | 4200 | 5000 | 6300 | <- data
---------------------------
[1000] [1004] [1008] [1012] <- address
[4000] [4200] [5000] [6300] <- starting address
"string1" "string2" "string3" "string4" <- string
Because the string object has a fixed size just like any other type. The difference is that string object stores its own string on heap, and it keeps a pointer to the string which is fixed in size.
The actual string in a std::string is usually just a pointer. The sizeof a string is always the same, even if the length of the string it holds vary.
You've gotten a number of answers (e.g., Steve Jessop's and AraK's) that are mostly correct already. I'll add just one minor detail: many current implementations of std::string use what's called a short string optimization (SSO), which means they allocate a small, fixed, amount of space in the string object itself that can be used to store short strings, and only when/if the length exceeds what's allocated in the string object itself does it actually allocate separate space on the heap to store the data.
As far as a vector of strings goes, this make no real difference: each string object has a fixed size regardless of the length of the string itself. The difference is that with SSO that fixed size is larger -- and in many cases the string object does not have block allocated on the heap to hold the actual data.