vector implemented with many blocks and no resize copy - c++

I'm wondering if it would be possible to implement an stl-like vector where the storage is done in blocks, and rather than allocate a larger block and copy from the original block, you could keep different blocks in different places, and overload the operator[] and the iterator's operator++ so that the user of the vector wasn't aware that the blocks weren't contiguous.
This could save a copy when moving beyond the existing capacity.

You would be looking for std::deque
See GotW #54 Using Vector and Deque
In Most Cases, Prefer Using deque (Controversial)
Contains benchmarks to demonstrate the behaviours
The latest C++11 standard says:
ยง 23.2.3 Sequence containers
[2] The sequence containers offer the programmer different complexity trade-offs and should be used accordingly.
vector or array is the type of sequence container that should be used by default. list or forward_list
should be used when there are frequent insertions and deletions from the middle of the sequence. deque is
the data structure of choice when most insertions and deletions take place at the beginning or at the end of
the sequence.
FAQ > Prelude's Corner > Vector or Deque? (intermediate) Says:
A vector can only add items to the end efficiently, any attempt to insert an item in the middle of the vector or at the beginning can be and often is very inefficient. A deque can insert items at both the beginning and then end in constant time, O(1), which is very good. Insertions in the middle are still inefficient, but if such functionality is needed a list should be used. A deque's method for inserting at the front is push_front(), the insert() method can also be used, but push_front is more clear.
Just like insertions, erasures at the front of a vector are inefficient, but a deque offers constant time erasure from the front as well.
A deque uses memory more effectively. Consider memory fragmentation, a vector requires N consecutive blocks of memory to hold its items where N is the number of items and a block is the size of a single item. This can be a problem if the vector needs 5 or 10 megabytes of memory, but the available memory is fragmented to the point where there are not 5 or 10 megabytes of consecutive memory. A deque does not have this problem, if there isn't enough consecutive memory, the deque will use a series of smaller blocks.
[...]

Yes it's possible.
do you know rope? it's what you describe, for strings (big string == rope, got the joke?). Rope is not part of the standard, but for practical purposes: it's available on modern compilers. You could use it to represent the complete content of a text editor.
Take a look here: STL Rope - when and where to use
And always remember:
the first rule of (performance) optimizations is: don't do it
the second rule (for experts only): don't do it now.

Related

Collecting objects faster than std::vector

I have a function that stores a lot of small objects (~16 bytes) in a vector, but it doesn't know in advance how many objects will be stored (imagine a recursive descent parser storing tokens for example).
std::vector<SmallObject> getObjects();
This is quite slow because of all the reallocation and copying (and apparently C++ even has to invoke the copy constructors if you don't use an optimised version (see "Object Relocation").
There must be a better way to do things like this where all I am doing to construct the vector is appending things. For example I could have a singly linked list of blocks that are filled, and convert everything to a single vector at the end, so everything only has to be copied once.
Is there anything in Boost or the standard C++ library that would help with this? Or any particularly clever algorithms?
Edit: To be more concrete:
struct SmallObject {
unsigned id;
boost::icl::discrete_interval<unsigned> ival;
};
The question which container is most efficient is always best answered by "it depends" and "measure it!".
Without any more information about your specific situation, there are two 'obvious' possibilities:
Use a linked list
The STL has two linked lists by default: a singly linked list std::forward_list and a doubly linked list std::deque. Moreover there is std::list which is usually the doubly-linked variant. Some quotes from the documentation:
std::forward [...] is implemented as a singly-linked list and essentially does not have any overhead compared to its implementation in C. Compared to std::list this container provides more space efficient storage when bidirectional iteration is not needed.
std::list [...] is usually implemented as a doubly-linked list. Compared to std::forward_list this container provides bidirectional iteration capability while being less space efficient.
std::deque (double-ended queue) [..] insertion and deletion at either end of a deque never invalidates pointers or references to the rest of the elements.
As opposed to std::vector, the elements of a deque are not stored contiguously: typical implementations use a sequence of individually allocated fixed-size arrays
Reserve space in a vector
If there is any way you can estimate an upper bound on the number of objects you will want to store, you can use that to reserve some space in advance.
For example, if you're reading these objects from a file, the number of objects may be at most the file size divided by 16, or the number of lines times two, or some other quick and easy calculation that you can do before constructing these objects.
In that case, if you reserve the capacity, you will allocate too much memory but prevent moves. Even if the upper bound is a bit too low, that's OK: you may still need to double the capacity once or twice but at least you prevent all the small increases (2 -> 4 -> 10 -> 16) at the start of the loop.

Does the std::vector implementation use an internal array or linked list or other?

I've been told that std::vector has a C-style array on the inside implementation, but would that not negate the entire purpose of having a dynamic container?
So is inserting a value in a vector an O(n) operation? Or is it O(1) like in a linked-list?
From the C++11 standard, in the "sequence containers" library section (emphasis mine):
[23.3.6.1 Class template vector overview][vector.overview]
A vector is a sequence container that supports (amortized) constant time insert and erase operations at the
end; insert and erase in the middle take linear time. Storage management is handled automatically, though
hints can be given to improve efficiency.
This does not defeat the purpose of dynamic size -- part of the point of vector is that not only is it very fast to access a single element, but scanning over the vector has very good memory locality because everything is tightly packed together. In practice, having good memory locality is very important because it greatly reduces cache misses, which has a large impact on runtime. This is a major advantage of vector over list in many situations, particularly those where you need to iterate over the entire container more often than you need to add or remove elements.
The memory in a std::vector is required to be contiguous, so it's typically represented as an array.
Your question about the complexity of the operations on a std::vector is a good one - I remember wondering this myself when I first started programming. If you append an element to a std::vector, then it may have to perform a resize operation and copy over all the existing elements to a new array. This will take time O(n) in the worst case. However, the amortized cost of appending an element is O(1). By this, we mean that the total cost of any sequence of n appends to a std::vector is always O(n). The intuition behind this is that the std::vector usually overallocates space in its array, leaving a lot of free slots for elements to be inserted into without a reallocation. As a result, most of the appends will take time O(1) even though every now and then you'll have one that takes time O(n).
That said, the cost of performing an insertion elsewhere in a std::vector will be O(n), because you may have to shift everything down.
You also asked why this is, if it defeats the purpose of having a dynamic array. Even if the std::vector just acted like a managed array, it's still a win over raw arrays. The std::vector knows its size, can do bounds-checking (with at), is an actual object (unlike an array), and doesn't decay to a pointer. These extra features - coupled with the extra logic to make appends work quickly - are almost always worth it.

Vector vs Deque insertion in middle

I know that deque is more efficient than vector when insertions are at front or end and vector is better if we have to do pointer arithmetic. But which one to use when we have to perform insertions in middle.? and Why.?
You might think that a deque would have the advantage, because it stores the data broken up into blocks. However to implement operator[] in constant time requires all those blocks to be the same size. Inserting or deleting an element in the middle still requires shifting all the values on one side or the other, same as a vector. Since the vector is simpler and has better locality for caching, it should come out ahead.
Selection criteria with Standard library containers is, You select a container depending upon:
Type of data you want to store &
The type of operations you want to perform on the data.
If you want to perform large number of insertions in the middle you are much better off using a std::list.
If the choice is just between a std::deque and std::vector then there are a number of factors to consider:
Typically, there is one more indirection in case of deque to access the elements, so element
access and iterator movement of deques are usually a bit slower.
In systems that have size limitations for blocks of memory, a deque might contain more elements because it uses more than one block of memory. Thus, max_size() might be larger for deques.
Deques provide no support to control the capacity and the moment of reallocation. In
particular, any insertion or deletion of elements other than at the beginning or end
invalidates all pointers, references, and iterators that refer to elements of the deque.
However, reallocation may perform better than for vectors, because according to their
typical internal structure, deques don't have to copy all elements on reallocation.
Blocks of memory might get freed when they are no longer used, so the memory size of a
deque might shrink (this is not a condition imposed by standard but most implementations do)
std::deque could perform better for large containers because it is typically implemented as a linked sequence of contiguous data blocks, as opposed to the single block used in an std::vector. So an insertion in the middle would result in less data being copied from one place to another, and potentially less reallocations.
Of course, whether that matters or not depends on the size of the containers and the cost of copying the elements stored. With C++11 move semantics, the cost of the latter is less important. But in the end, the only way of knowing is profiling with a realistic application.
Deque would still be more efficient, as it doesn't have to move half of the array every time you insert an element.
Of course, this will only really matter if you consider large numbers of elements, and even than it is advisable to run a benchmark and see which one works better in your particular case. Remember that premature optimization is the root of all evil.

how c++ vector works

Lets say if I have a vector V, which has 10 elements.
If I erase the first element (at index 0) using v.erase(v.begin()) then how STL vector handle this?
Does it create another new vector and copy elements from the old vector to the new vector and deallocate the old one? Or Does it copy each element starting from index 1 and copy the element to index-1 ?
If I need to have a vector of size 100,000 at once and later I don't use that much space, lets say I only need a vector of size 10 then does it automatically reduce the size? ( I don't think so)
I looked online and there are only APIs and tutorials how to use STL library.
Is there any good references that I can have an idea of the implementation or complexity of STL library?
Actually, the implementation of vector is visible, since it's a template, so you can look into that for details:
iterator erase(const_iterator _Where)
{ // erase element at where
if (_Where._Mycont != this
|| _Where._Myptr < _Myfirst || _Mylast <= _Where._Myptr)
_DEBUG_ERROR("vector erase iterator outside range");
_STDEXT unchecked_copy(_Where._Myptr + 1, _Mylast, _Where._Myptr);
_Destroy(_Mylast - 1, _Mylast);
_Orphan_range(_Where._Myptr, _Mylast);
--_Mylast;
return (iterator(_Where._Myptr, this));
}
Basically, the line
unchecked_copy(_Where._Myptr + 1, _Mylast, _Where._Myptr);
does exactly what you thought - copies the following elements over (or moves them in C++11 as bames53 pointed out).
To answer your second question, no, the capacity cannot decrease on its own.
The complexities of the algorithms in std can be found at http://www.cplusplus.com/reference/stl/ and the implementation, as previously stated, is visible.
Does it copy each element starting from index 1 and copy the element to index-1 ?
Yes (though it actually moves them since C++11).
does it automatically reduce the size?
No, reducing the size would typically invalidate iterators to existing elements, and that's only allowed on certain function calls.
I looked online and there are only APIs and tutorials how to use STL library. Is there any good references that I can have an idea of the implementation or complexity of STL library?
You can read the C++ specification which will tell you exactly what's allowed and what isn't in terms of implementation. You can also go look at your actual implementation.
Vector will copy (move in C++11) the elements to the beginning, that's why you should use deque if you would like to insert and erase from the beginning of a collection. If you want to truly resize the vector's internal buffer you can do this:
vector<Type>(v).swap(v);
This will hopefully make a temporary vector with the correct size, then swaps it's internal buffer with the old one, then the temporary one goes out of scope and the large buffer gets deallocated with it.
As others noted, you may use vector::shrink_to_fit() in C++11.
That's one of my (many) objection to C++. Everybody says "use the standard libraries" ... but even when you have the STL source (which is freely available from many different places. Including, in this case, the header file itself!) ... it's basically an incomprehensible nightmare to dig in to and try to understand.
The (C-only) Linux kernel is a paragon of simplicity and clarity in contrast.
But we digress :)
Here's the 10,000-foot answer to your question:
http://www.cplusplus.com/reference/stl/vector/
Vector containers are implemented as dynamic arrays; Just as regular
arrays, vector containers have their elements stored in contiguous
storage locations, which means that their elements can be accessed not
only using iterators but also using offsets on regular pointers to
elements.
But unlike regular arrays, storage in vectors is handled
automatically, allowing it to be expanded and contracted as needed.
Vectors are good at:
Accessing individual elements by their position index (constant time).
Iterating over the elements in any order (linear time).
Add and remove elements from its end (constant amortized time).
Compared to arrays, they provide almost the same performance for these
tasks, plus they have the ability to be easily resized. Although, they
usually consume more memory than arrays when their capacity is handled
automatically (this is in order to accommodate extra storage space for
future growth).
Compared to the other base standard sequence containers (deques and
lists), vectors are generally the most efficient in time for accessing
elements and to add or remove elements from the end of the sequence.
For operations that involve inserting or removing elements at
positions other than the end, they perform worse than deques and
lists, and have less consistent iterators and references than lists.
...
Reallocations may be a costly operation in terms of performance, since
they generally involve the entire storage space used by the vector to
be copied to a new location. You can use member function
vector::reserve to indicate beforehand a capacity for the vector. This
can help optimize storage space and reduce the number of reallocations
when many enlargements are planned.
...
I only need a vector of size 10 then does it automatically reduce the size?
No it doesn't automatically shrink.
Traditionally you swap the vector with a new empty one: reduce the capacity of an stl vector
But C++x11 includes a std::vector::shrink_to_fit() which it does it directly

STL Containers - difference between vector, list and deque

Should I use deque instead of vector if i'd like to push elements also in the beginning of the container? When should I use list and what's the point of it?
Use deque if you need efficient insertion/removal at the beginning and end of the sequence and random access; use list if you need efficient insertion anywhere, at the sacrifice of random access. Iterators and references to list elements are very stable under almost any mutation of the container, while deque has very peculiar iterator and reference invalidation rules (so check them out carefully).
Also, list is a node-based container, while a deque uses chunks of contiguous memory, so memory locality may have performance effects that cannot be captured by asymptotic complexity estimates.
deque can serve as a replacement for vector almost everywhere and should probably have been considered the "default" container in C++ (on account of its more flexible memory requirements); the only reason to prefer vector is when you must have a guaranteed contiguous memory layout of your sequence.
deque and vector provide random access, list provides only linear accesses. So if you need to be able to do container[i], that rules out list. On the other hand, you can insert and remove items anywhere in a list efficiently, and operations in the middle of vector and deque are slow.
deque and vector are very similar, and are basically interchangeable for most purposes. There are only two differences worth mentioning. First, vector can only efficiently add new items at the end, while deque can add items at either end efficiently. So why would you ever use a vector then? Unlike deque, vector guarantee that all items will be stored in contiguous memory locations, which makes iterating through them faster in some situations.