Uses for the capacity value of a string - c++

In the C++ Standard Library, std::string has a public member function capacity() which returns the size of the internal allocated storage, a value greater than or equal to the number of characters in the string (according to here). What can this value be used for? Does it have something to do with custom allocators?

You are more likely to use the reserve() member function, which sets the capacity to at least the supplied value.
The capacity() member function itself might be used to avoid allocating memory. For instance, you could recycle used strings through a pool, and put each one in a different size bucket based on its capacity. A client of the pool could then ask for a string that already has some minimum capacity.

The string::capacity() function return the number of total characters the std::string may contain before it has to reallocate memory, which is quite an expensive operation.
A std::vector works in the same way so I suggest you look up std::vector here for an in detail explanation of the difference between allocated and initialized memory.
Update
Hmm I can see I misread the question, actually I think I've never used capacity() myself on either std::string or std::vector, seems to seldomely be of any reason as you have to call reserve anyway.

It gives you the number of characters the string could contain without having to re-allocate. I suppose this might be important in a situation where allocation was expensive, and you wanted to avoid it, but I must say this is one string member function I've never used in real code.

It could be used for some performance tuning if you are about to add a lot of characters to the string. Before starting the string manipulation, you can check for the capacity and if it is too small, reserve the desired length in a single step (instead of letting it reallocate successively bigger chunks of memory several times, which would be a performance hog).

Strings have a capacity and a size. The capacity indicates how many characters that the string can hold before it will have to allocate more memory. The size indicates how many characters that it currently holds. reserve() can be used to set the minimum capacity of the string (it will allocate memory for at least that number of characters but could allocate more).
This is primarily of importance if you're increasing the size of the string. When you concatenate onto the string with += or append(), the characters from the given string will be added to the end of the current one. If increasing the string to that size does not exceed the capacity, then it's just going to use the capacity that it has. However, if the new size would exceed the current capacity, then the string will have to reallocate memory internally and copy its internals into the new memory. If you're going to be doing that a lot, it can get expensive (though it is done in amortized constant time), so in such a case, you could use reserve() to preallocate enough memory to reduce how often reallocations have to take place.
vector functions in basically the same way with the same functions.
Personally, while I've dealt with capacity() and reserve() with vector from time to time, I've never seen much need to do so with string - probably because I don't generally do enough string concatenations in my code for it to be worth it. In most cases, a particular string might get a few concatenations but not enough to worry about its capacity. Worrying about capacity is generally something you do when trying to optimize your code.

There's hardly any relevant use. It is similar to std::vector::capacity. However, one of the most common uses of strings is assignment. When assigning to a std::string, its .capacity may change. This means that an implementation has the right to ignore the old capacity and allocate precisely enough memory.

It genuinely isn't very useful, and is probably there only for symmetry with vector (under the assumption that both will operate internally in the same way).
The capacity of a vector is guaranteed to affect the behaviour of a resize. Resizing a vector to a value less than or equal to the capacity will not induce a reallocation, and will not therefore invalidate iterators or pointers referring to elements in the vector. This means you can pre-allocate some storage by calling reserve on a vector, then (with care) add elements to it by resizing or pushing back (etc.), safe in the knowledge that the underlying buffer won't move.
There is no such guarantee for string, though. It seems that the capacity is for informational purposes only -- though even that's a stretch, as it doesn't look like there's any useful information to be taken from it anyway. (Worse yet, contiguity of string chars isn't guaranteed either, so the only way you can get at the string as a linear buffer is c_str() -- which may induce a reallocation.)
At a guess, string was presumably originally intended to be implemented as some kind of a special case of vector, but over time the two grew apart...

Related

How does STL containers keep track of the current size of container over the total size?

Given
vector<int> a;
If a.push_back() is done, how does the vector knows whether to increase the size by reallocating memory or there is space available (Because vector allocates some extra space when size is full to reduce overhead).
P.S. Does same technique applies for other types of containers like stack, queue etc.
I think that it does the same thing as "struct" in C.
The method capacity() returns the number of items that can be stored in the vector without a reallocation.
The method size() returns the number of items which are currently stored in the vector.
Prior to inserting another item, it stands to reason that if size() == capacity() then more capacity will need to be made available. This will involve a reallocation to make more capacity available.
Does same technique applies for other types of containers like stack, queue etc.
stack and queue are built on top of other std containers. These underlying containers (normally vector or deque) employ a similar technique.
I think that it does the same thing as "struct" in C.
No.
In general, a vector ( and incidently a List in C# ), will allocate a block of memory. As you add elements to it, it will mark more and more of that memory as consumed. Then, when the block is full, it will allocate a new larger block, copy the contents into the new larger block, and delete the old one. Again the new larger block has more free space, and then, again, it can be filled up. The idea is that vector always has a contiguous space so it can be used in applications where one would consider an array. Because it has contiguous space, machine instructions for accessing a single element are trivial and so random access is very fast. List in C# has similar semantics. The implementation dependent thing has a lot to do with how much bigger that new bigger block is. Sometimes they make it a percentage bigger. Sometimes they just double the size.

Can you predict where in memory a vector might move when growing?

I'm learning about C++ and have a conceptual question. Let's say I have a vector. I know that my vector is stored in contiguous memory, but let's say my vector keeps growing and runs out of room to keep the memory contiguous. How can I predict where in memory the vector will go? I'm excluding the option of using functions that tell the vector where it should be in memory.
If it "runs out of room to keep the memory contiguous", then it simply won't grow. Attempting to add items past the currently allocated size will (typically) result in its throwing an exception (though technically, it's up to the allocator object to decide what to do--it's responsible for memory allocation, and responding when that's not possible.
Note, however, that this could result from running out of address space (especially on a 32-bit machine) rather than running out of actual memory. A typical virtual memory manager can reallocate physical pages (e.g., 4 KB or 8 KB chunks) and write data to the paging file if necessary to free physical memory if needed--but when/if there's not enough contiguous address space, there's not much that can be done.
The answer depends highly on your allocation strategy, but in general, the answer is no. Most allocators do not provide you with information where the next allocation will occur. If you were writing a custom allocator, then you could potentially make this information accessible, but doing so is not necessarily a good idea unless your use case specifically requires this knowledge.
The realloc function is the only C function which will attempt to grow your memory in place, and it makes no guarantees that it will do so.
Neither new nor malloc provide any information for where the "next" allocation will take place. You could potentially guess, if you knew the exact implementation details for your specific compiler, but this would be very unwise to rely on in a real program. Regarding specifically the std::allocator used for std::vector, it also does not provide details about where future allocations will take place.
Even if you could predict it in a particular situation, it would be extremely fragile - all it takes is one function you call to change to make another call to new or malloc [unless you are using a very specific allocation method - which is different from the "usual" method] to "break" where the next allocation is made.
If you KNOW that you need a certain size, you can use std::vector::resize() to set the size of the vector [or std::vector<int> vec(10000); to create a pre-sized to 10000, for example] - which of course is not guaranteed to work, but it guarantees that you never need "enough space to hold 3x the current content", which is what happens with std::vector when you grow it using push_back [and if you are REALLY unlucky, that means that your vector will use 2*n-1 elements, leaving n-1 unused, because your size is n-1 and you add ONE more element, which doubles the size, so now 2*n, and you only actually require one more element...
The internal workings of STL containers are kept private for good reasons. You should never be accessing any container elements through any mechanism other than the appropriate iterators; and it is not possible to acquire one of those on an element that does not yet exist.
You could however, supply an allocator and use that to deterministically place future allocations.
Can you predict where in memory a vector might move when growing?
As others like EJP, Jerry and Mats have said, you cannot determine the location of a "grown" vector until after it grows. There are some corner cases, like the allocator providing a block of memory that's larger than required so that the vector does not actually move after a grow. But its not something you should depend on.
In general, stacks grow down and heaps grow up. This is an artifact from the old memory days. Your code segment was sandwiched between them, and it ensured your program would overwrite its own code segment and eventually cause an illegal instruction. So you might be able to guess the new vector is going to be higher in memory than the old vector because the vector is probably using heap memory. But its not really useful information.
If you are devising a strategy for locating elements after a grow, then use an index and not an iterator. Iterators are invalidated after inserts and deletes (including the grow).
For example, suppose you are parsing the vector and you are looking for the data that follows -----BEGIN CERTIFICATE-----. Once you know the offset of the data (byte 27 in the vector), then you can always relocate it in constant time with v.begin() + 26. If you only have part of the certificate and later add the tail of the data and the -----END CERTIFICATE----- (and the vector grows), then the data is still located at v.begin() + 26.
No, in practical terms you can't predict where it will go if it has to move due to resizing. However, it isn't so random that you could use it as a random number generator (;

Does the standard guarantee that the total memory occupied by a std::vector scales as C+N*sizeof(T)?

The C++ standard provides the guarantee that the content of a std::vector is stored contiguously. But does it states that the total occupied memory is:
S = C+N*sizeof(T)
where:
S is the total size on the stack AND on the heap
C is the total size on the stack: C = sizeof(std::vector)
N is the capacity of the vector
T is the type stored
In other words, do I have the guarantee that there is no overhead per element ?
And if I have no such guarantee is there any reason ?
EDIT: to be clear, if I take the example of a std::list, it generally stores 2 extra pointers per element. So my question is: would a such implementation of a std::vector be standard-compliant ?
For there to be any such guarantee, the standard would have to pass the requirement on to the interface of the allocator. It doesn't, so there isn't.
In practice though, as a quality of implementation issue, you expect that memory allocators probably have a constant overhead per allocation but no overhead proportional to the size of the allocation. A counter-example to this would be a memory allocator that always uses a power-of-two-sized block regardless of the size requested. This would be pretty wasteful for large allocations, but not forbidden either as a user-defined allocator or even as the system allocator used by ::operator new[]. It would create an overhead proportional to N on average, assuming that the vector capacities don't happen to fit nicely.
Leaving aside the allocator, I don't believe there's anything in the standard to say that the vector can't allocate (for example) an extra byte per element and use it to store some flags for who-knows-what purpose. As others have remarked, the contiguousness requirement means that those extra bytes cannot lie between the vector elements. They would have to be in a separate allocation or all together at one end of the allocation.
There's at least one good reason that the standard doesn't forbid implementations from "wasting" space by using it to store data used for operations not required by the standard -- doing so would rule out many debugging techniques!
Do I have the guarantee that there is no overhead per element?
Does the standard prohibit it? No.
But would you ever expect to see this in practice? No.
The rule of contiguous data storage and the complexity requirements of vector growth mean that the only possible way for a non-constant-sized data block to be part of the vector would be if it were emplaced directly before the dynamically-allocated element data, or somewhere else entirely. There is no guarantee that this doesn't happen, but, quite simply, no implementation does it because it would be entirely ridiculous and serve no purpose whatsoever.
Does it states that the total occupied memory is:
S = C+N*sizeof(T)
There may be other data members of the vector itself (what you've inaccurately deemed to be "on the stack"), increasing the object's size in constant terms.
The standard gives no guarantee, afaics. But the requirement that the elements be stored contiguously makes it likely that there is no per element overhead. The whole data must be in a memory area which was allocated in one piece. #aschepler remarked correctly though that typical free store implementations have a (constant) overhead per allocation unit, typically a size variable or an end pointer.
Additionally there may be some padding overhead, e.g. an allocation unit will probably span multiples of the natural word size on a machine. And then the OS call will likely reserve a whole memory page to the program, even if you allocate only 1 byte. Whether you consider that as overhead or not is a matter of taste (from the outside yes, from the inside of the program no; and of course subsequent vectors or resize()s dine from the same page).
So at least it's CM + CV + N*sizeof(T), CM and CV being the overhead in the vector (not necessarily on the stack, as Lighness said) and CM the overhead of the memory management.
No, the implementation characteristics you suggest would not be standard compliant. The STL specifies that a std::vector support appending individual elements in amortized constant time.
In order for the amortized cost of inserting an element to be O(1), the size of the array must increase in at least a geometric progression when it is reallocated (see here). A geometric progression means that if the size of the array was N, the new size after reallocation must be K * N, for some K > 1. The choice of K is implementation dependent.
To find out how much space a std::vector has allocated, call std::vector::capacity(). With regard to overhead per element, in the best case the capacity() == size(). In the worst case capacity() == K * (size() - 1).
If you must ensure that your vector is absolutely no larger than it has to be, you can call std::vector::reserve() if you know exactly how large your std::vector will be. You may also call std::vector::resize() (or std::vector::shrink_to_fit() in C++11) after you are done adding elements to reduce the amount of memory reserved.

c++ unordered_map is there a way to pre-allocate memory for elements if max size known in advance

Looks like reserve/rehash functions only pre-allocate the number of buckets, not memory for the elements(key,vlaue) pairs to be inserted.
Is there a way we can pre-allocate memory for elements also, so low-latency apps dont need to waste time on dynamic memory allocation.
One possibility would be to write your own allocator. This can be especially effective if you have at least a fair idea of how many items are likely to go in the table (so you can pre-allocate space for all of them) and don't care about re-using space for items if they're removed from the table (so your bookkeeping is simple).
In such a case, you can basically pre-allocate space for N objects, and simply keep track of the position of the next item to be allocated. Allocating the object consists of simply returning the address and incrementing the pointer, as in return *next++;
Of course, this doesn't truly eliminate the dynamic allocation--it just makes it cheap enough that you probably don't care about it any more (and since it's supplied as a template parameter, there's a good chance of its being expanded inline, so you don't even get the overhead of a function call in the process.
Even if you can't put up with quite that restrictive of an allocator, a general-purpose allocator for fixed-size objects will still usually be (at least somewhat) faster than one for variable sizes of objects. It still won't eliminate the dynamic allocation, but it may give enough improvement in speed to work quite a bit better for your purpose.

std::string and its automatic memory resizing

I'm pretty new to C++, but I know you can't just use memory willy nilly like the std::string class seems to let you do. For instance:
std::string f = "asdf";
f += "fdsa";
How does the string class handle getting larger and smaller? I assume it allocates a default amount of memory and if it needs more, it news a larger block of memory and copies itself over to that. But wouldn't that be pretty inefficient to have to copy the whole string every time it needed to resize? I can't really think of another way it could be done (but obviously somebody did).
And for that matter, how do all the stdlib classes like vector, queue, stack, etc handle growing and shrinking so transparently?
Your analysis is correct — it is inefficient to copy the string every time it needs to resize. That's why common advice discourages that use pattern. Use the string's reserve function to ask it to allocate enough memory for what you intend to store in it. Then further operations will fill that memory. (But if your hint turns out to be too small, the string will still grow automatically, too.)
Containers will also usually try to mitigate the effects of frequent re-allocation by allocating more memory than they need. A common algorithm is that when a string finds that it's out of space, it doubles its buffer size instead of just allocating the minimum required to hold the new value. If the string is being grown one character at a time, this doubling algorithm reduces the time complexity to amortized linear time (instead of quadratic time). It also reduces the program's susceptibility to memory fragmentation.
Usually, there's a doubling algorithm. In other words, when it fills the current buffer, it allocates a new buffer that's twice as big, and then copies the current data over. This results in fewer allocate/copy operations than the alternative of growing by a single allocation block.
Although I do not know the exact implementation of std::string, most data structures that need to handle dynamic memory growth do so by doing exactly what you say - allocate a default amount of memory, and if more is needed then create a bigger block and copy yourself over.
The way you get around the obvious inefficiency problem is to allocate more memory than you need. The ratio of used memory:total memory of a vector/string/list/etc is often called the load factor (also used for hash tables in a slightly different meaning). Usually it's a 1:2 ratio - that is, you assign twice the memory you need. When you run out of space, you assign a new amount of memory twice your current amount and use that. This means that over time, if you continue to add things to the vector/string/etc, you need to copy over the item less and less (as the memory creation is exponential, and your inserting of new items is of course linear), and so the time taken for this method of memory handling is not as large as you might think. By the principles of Amortized Analysis, you can then see that inserting m items into a vector/string/list using this method is only Big-Theta of m, not m2.