Being Smart About Vector Memory Allocation

Being Smart About Vector Memory Allocation - c++

Let's say I have to iterate over a potentially very large vector of numbers and copy the even and odd elements into new, separate vectors. (The source vector may have any proportion of evens to odds; it could be all evens, all odds, or somewhere in-between.)
For simplicity, push_back is often used for this sort of thing:
for (std::size_t Index; Index < Source.size(); Index++)
{
if (Source[Index] % 2) Odds.push_back(Source[Index]);
else Evens.push_back(Source[Index]);
}
However, I'm worried that this will be inefficient and be harmful if it's used as part of the implementation for something like a sorting algorithm, where performance is paramount. QuickSort, for example, involves separating elements much like this.
You could use reserve() to allocate memory before-hand so only one allocation is needed, but then you have to iterate over the entire source vector twice - once to count how many elements will need to be sorted out, and once more for the actual copying.
You could, of course, allocate the same amount of space as the source vector's size, since neither new vector will need to hold more than that, but that seems somewhat wasteful.
Is there a better method that I'm missing? Is push_back() usually trusted to manage this sort of thing for the programmer, or can it become burdensome for sensitive algorithms?

I'm going to answer the question I think you really meant to ask, which is "should push_back() be avoided in the inner loops of heavy algorithms?" rather than what others seem to have read into your post, which is "does it matter if I call push_back before doing an unrelated sort on a large vector?" Also, I'm going to answer from my experience rather than spend time chasing down citations and peer-reviewed articles.
Your example is basically doing two things that add up to the total CPU cost: it's reading and operating on elements in the input vector, and then it has to insert the elements into the output vector. You're concerned about the cost of inserting elements because:
push_back() is constant time (instantaneous, really) when a vector has enough space pre-reserved for an additional element, but slow when you've run out of reserved space.
Allocating memory is costly (malloc() is just slow, even when pedants pretend that new is something different)
Copying a vector's data from one region to another after reallocation is also slow: when push_back() finds it hasn't got enough space, it has to go and allocate a bigger vector, then copy all the elements. (In theory, for vectors that are many OS pages in size, a magic implementation of the STL could use the VMM to move them around in the virtual address space without copying — in practice I've never seen one that could.)
Over-allocating the output vectors causes problems: it causes fragmentation, making future allocations slower; it burns data cache, making everything slower; if persistent, it ties up scarce free memory, leading to disk paging on a PC and a crash on embedded platforms.
Under-allocating the output vectors causes problems because reallocating a vector is an O(n) operation, so reallocating it m times is O(m×n). If the STL's default allocator uses exponential reallocation (making the vector's reserve twice its previous size every time you realloc), that makes your linear algorithm O(n + n log m).
Your instinct, therefore, is correct: always pre-reserve space for your vectors where possible, not because push_back is slow, but because it can trigger a reallocation that is slow. Also, if you look at the implementation of shrink_to_fit, you'll see it also does a copy reallocation, temporarily doubling your memory cost and causing further fragmentation.
Your problem here is that you don't always know exactly how much space you'll need for the output vectors; the usual response is to use a heuristic and maybe a custom allocator. Reserve n/2+k of the input size for each of your output vectors by default, where k is some safety margin. That way you'll usually have enough space for the output, so long as your input is reasonably balanced, and push_back can reallocate in the rare cases where it's not. If you find that push_back's exponential behavior is wasting too much memory ( causing you to reserve 2n elements when really you just needed n+2 ), you can give it a custom allocator that expands the vector size in smaller, linear chunks — but of course that will be much slower in cases where the vectors are really unbalanced and you end up doing lots of resizes.
There's no way to always reserve the exact right amount of space without walking the input elements in advance; but if you know what the balance usually looks like, you can use a heuristic to make a good guess at it for a statistical performance gain over many iterations.

You could, of course, allocate the same amount of space as the source
vector's size, since neither new vector will need to hold more than
that, but that seems somewhat wasteful.
Then follow it up with a call to shrink_to_fit
However, I'm worried that this will be inefficient and harm things
like sorting algorithms. ... Is push_back() usually trusted to manage
this sort of thing for the programmer, or can it become burdensome for
sensitive algorithms?
Yes, push_back is trusted. Although honestly I don't understand what your concern is. Presumably, if you're using algorithms on the vector, you've already put the elements into the vector. What kind of algorithm are you talking about where it would matter how the vector elements got there, be it push_back or something else?

How about sorting the original vector with a custom predicate that puts all the evens before all the odds?
bool EvenBeforeOdd(int a, int b)
{
if ((a - b) % 2 == 0) return a < b;
return a % 2 == 0;
}
std::sort(v.begin(), v.end(), EvenBeforeOdd);
Then you just have to find the largest even number, which you can do e.g. with upper_bound for a very large even number or something like that. Once you found that, you can make very cheap copies of the ranges.
Update: As #Blastfurnace commented, it's much more efficient to use std::partition rather than sort, since we don't actually need the elements ordered within each partition:
bool isEven(int a) { return 0 == a % 2; }
std::vector<int>::const_iterator it = std::partition(v.begin(), v.end(), isEven);
std::vector<int> evens, odds;
evens.reserve(std::distance(v.begin(), it);
odds.reserve(std::distance(it, v.end());
std::copy(v.begin(), it, std::back_inserter(evens));
std::copy(it, v.end(), std::back_inserter(odds));

If your objects are created dynamically then the vectors are literally just storing pointers. This makes the vectors considerably more efficient, especially when it comes to internal reallocation. This would also save memory if same objects exist in multiple locations.
std::vector<YourObject*> Evens;
Note: Do not push pointers from context of function as this will cause data corruption outside of that frame. Instead objects would need to be allocated dynamically.
This might not solve your problem, but perhaps it is of use.

If your sub vectors are exactly half (odd / even) then simply allocate 50% of original vector for each. This would avoid wastage and shrink_to_fit.

Related

What is best to insert several values at the end of a std::vector?

To add elements to a std::vector<int> v is it better to do:
// Read and manipulate a, b, c triplet as ints.
// Potentially also: v.reserve(v.size() + 3); or trust vector growth policy?
v.push_back(a);
v.push_back(b);
v.push_back(c);
or
v.insert(v.end(), {a, b, c});
from a performance point of view (assuming we are always going to insert triplets that are different everytime and a large unfixed number of them, say 1 million triplets)? Thanks for tips.

First of all, doing v.reserve(v.size() + 3); in a loop is generally a very bad idea since it will certainly cause a new reallocations for each iteration. For example, both Clang and GCC with the libstdc++ and libc++ actually do a linear number of reallocations (see here, here or even there). Here is a quote from cppreference:
Correctly using reserve() can prevent unnecessary reallocations, but inappropriate uses of reserve() (for instance, calling it before every push_back() call) may actually increase the number of reallocations (by causing the capacity to grow linearly rather than exponentially) and result in increased computational complexity and decreased performance. For example, a function that receives an arbitrary vector by reference and appends elements to it should usually not call reserve() on the vector, since it does not know of the vector's usage characteristics.
When inserting a range, the range version of insert() is generally preferable as it preserves the correct capacity growth behavior, unlike reserve() followed by a series of push_back()s.
reserve() cannot be used to reduce the capacity of the container; to that end shrink_to_fit() is provided.
When it comes to insert VS push_backs, insert should be slightly better than many push_back because the capacity check can be done only once as opposed to multiple push_backs. That being said, the performance difference is very dependent of the standard library implementation.

Efficient linked list in C++?

This document says std::list is inefficient:
std::list is an extremely inefficient class that is rarely useful. It performs a heap allocation for every element inserted into it, thus having an extremely high constant factor, particularly for small data types.
Comment: that is to my surprise. std::list is a doubly linked list, so despite its inefficiency in element construction, it supports insert/delete in O(1) time complexity, but this feature is completely ignored in this quoted paragraph.
My question: Say I need a sequential container for small-sized homogeneous elements, and this container should support element insert/delete in O(1) complexity and does not need random access (though supporting random access is nice, it is not a must here). I also don't want the high constant factor introduced by heap allocation for each element's construction, at least when the number of element is small. Lastly, iterators should be invalidated only when the corresponding element is deleted. Apparently I need a custom container class, which might (or might not) be a variant of doubly linked list. How should I design this container?
If the aforementioned specification cannot be achieved, then perhaps I should have a custom memory allocator, say, bump pointer allocator? I know std::list takes an allocator as its second template argument.
Edit: I know I shouldn't be too concerned with this issue, from an engineering standpoint - fast enough is good enough. It is just a hypothetical question so I don't have a more detailed use case. Feel free to relax some of the requirements!
Edit2: I understand two algorithms of O(1) complexity can have entirely different performance due to the difference in their constant factors.

Your requirements are exactly those of std::list, except that you've decided you don't like the overhead of node-based allocation.
The sane approach is to start at the top and only do as much as you really need:
Just use std::list.
Benchmark it: is the default allocator really too slow for your purposes?
No: you're done.
Yes: goto 2
Use std::list with an existing custom allocator such as the Boost pool allocator
Benchmark it: is the Boost pool allocator really too slow for your purposes?
No: you're done.
Yes: goto 3
Use std::list with a hand-rolled custom allocator finely tuned to your unique needs, based on all the profiling you did in steps 1 and 2
Benchmark as before etc. etc.
Consider doing something more exotic as a last resort.
If you get to this stage, you should have a really well-specified SO question, with lots of detail about exactly what you need (eg. "I need to squeeze n nodes into a cacheline" rather than "this doc said this thing is slow and that sounds bad").
PS. The above makes two assumptions, but both are worth investigation:
as Baum mit Augen points out, it's not sufficient to do simple end-to-end timing, because you need to be sure where your time is going. It could be the allocator itself, or cache misses due to the memory layout, or something else. If something's slow, you still need to be sure why before you know what ought to change.
your requirements are taken as a given, but finding ways to weaken requirements is often the easiest way to make something faster.
do you really need constant-time insertion and deletion everywhere, or only at the front, or the back, or both but not in the middle?
do you really need those iterator invalidation constraints, or can they be relaxed?
are there access patterns you can exploit? If you're frequently removing an element from the front and then replacing it with a new one, could you just update it in-place?

As an alternative, you can use a growable array and handle the links explicitly, as indexes into the array.
Unused array elements are put in a linked list using one of the links. When an element is deleted, it is returned to the free list. When the free list is exhausted, grow the array and use the next element.
For the new free elements, you have two options:
append them to the free list at once,
append them on demand, based on the number of elements in the free list vs. the array size.

The requirement of not invalidating iterators except the one on a node being deleted is forbidding to every container that doesn't allocate individual nodes and is much different from e.g. list or map.
However, I've found that in almost every case when I thought that this was necessary, it turned out with a little discipline I could just as well do without. You might want to verify if you can, you would benefit greatly.
While std::list is indeed the "correct" thing if you need something like a list (for CS class, mostly), the statement that it is almost always the wrong choice is, unluckily, exactly right. While the O(1) claim is entirely true, it's nevertheless abysmal in relation to how actual computer hardware works, which gives it a huge constant factor. Note that not only are the objects that you iterate randomly placed, but the nodes that you maintain are, too (yes, you can somehow work around that with an allocator, but that is not the point). On the average, you have two one guaranteed cache misses for anything you do, plus up to two one dynamic allocations for mutating operations (one for the object, and another one for the node).
Edit: As pointed out by #ratchetfreak below, implementations of std::list commonly collapse the object and node allocation into one memory block as an optimization (akin to what e.g. make_shared does), which makes the average case somewhat less catastrophic (one allocation per mutation and one guaranteed cache miss instead of two).
A new, different consideration in this case might be that doing so may not be entirely trouble-free either. Postfixing the object with two pointers means reversing the direction while dereference which may interfere with auto prefetch.
Prefixing the object with the pointers, on the other hand, means you push the object back by two pointers' size, which will mean as much as 16 bytes on a 64-bit system (that might split a mid-sized object over cache line boundaries every time). Also, there's to consider that std::list cannot afford to break e.g. SSE code solely because it adds a clandestine offset as special surprise (so for example the xor-trick would likely not be applicable for reducing the two-pointer footprint). There would likely have to be some amount of "safe" padding to make sure objects added to a list still work the way they should.
I am unable to tell whether these are actual performance problems or merely distrust and fear from my side, but I believe it's fair to say that there may be more snakes hiding in the grass than one expects.
It's not for no reason that high-profile C++ experts (Stroustrup, notably) recommend using std::vector unless you have a really good reason not to.
Like many people before, I've tried to be smart about using (or inventing) something better than std::vector for one or the other particular, specialized problem where it seems you can do better, but it turns out that simply using std::vector is still almost always the best, or second best option (if std::vector happens to be not-the-best, std::deque is usually what you need instead).
You have way fewer allocations than with any other approach, way less memory fragmentation, way fewer indirections, and a much more favorable memory access pattern. And guess what, it's readily available and just works.
The fact that every now and then inserts require a copy of all elements is (usually) a total non-issue. You think it is, but it's not. It happens rarely and it is a copy of a linear block of memory, which is exactly what processors are good at (as opposed to many double-indirections and random jumps over memory).
If the requirement not to invalidate iterators is really an absolute must, you could for example pair a std::vector of objects with a dynamic bitset or, for lack of something better, a std::vector<bool>. Then use reserve() appropriately so reallocations do not happen. When deleting an element, do not remove it but only mark it as deleted in the bitmap (call the destructor by hand). At appropriate times, when you know that it's OK to invalidate iterators, call a "vacuum cleaner" function that compacts both the bit vector and the object vector. There, all unforeseeable iterator invalidations gone.
Yes, that requires maintaining one extra "element was deleted" bit, which is annoying. But a std::list must maintain two pointers as well, in additon to the actual object, and it must do allocations. With the vector (or two vectors), access is still very efficient, as it happens in a cache-friendly way. Iterating, even when checking for deleted nodes, still means you move linearly or almost-linearly over memory.

std::list is a doubly linked list, so despite its inefficiency in element construction, it supports insert/delete in O(1) time complexity, but this feature is completely ignored in this quoted paragraph.
It's ignored because it's a lie.
The problem of algorithmic complexity is that it generally measures one thing. For example, when we say that insertion in a std::map is O(log N), we mean that it performs O(log N) comparisons. The costs of iterating, fetching cache lines from memory, etc... are not taken into account.
This greatly simplifies analysis, of course, but unfortunately does not necessarily map cleanly to real-world implementation complexities. In particular, one egregious assumption is that memory allocation is constant-time. And that, is a bold-faced lie.
General purpose memory allocators (malloc and co), do not have any guarantee on the worst-case complexity of memory allocations. The worst-case is generally OS-dependent, and in the case of Linux it may involve the OOM killer (sift through the ongoing processes and kill one to reclaim its memory).
Special purpose memory allocators could potentially be made constant time... within a particular range of number of allocations (or maximum allocation size). Since Big-O notation is about the limit at infinity, it cannot be called O(1).
And thus, where the rubber meets the road, the implementation of std::list does NOT in general feature O(1) insertion/deletion, because the implementation relies on a real memory allocator, not an ideal one.
This is pretty depressing, however you need not lose all hopes.
Most notably, if you can figure out an upper-bound to the number of elements and can allocate that much memory up-front, then you can craft a memory allocator which will perform constant-time memory allocation, giving you the illusion of O(1).

Use two std::lists: One "free-list" that's preallocated with a large stash of nodes at startup, and the other "active" list into which you splice nodes from the free-list. This is constant time and doesn't require allocating a node.

The new slot_map proposal claim O(1) for insert and delete.
There is also a link to a video with a proposed implementation and some previous work.
If we knew more about the actual structure of the elements there might be some specialized associative containers that are much better.

I would suggest doing exactly what #Yves Daoust says, except instead of using a linked list for the free list, use a vector. Push and pop the free indices on the back of the vector. This is amortized O(1) insert, lookup, and delete, and doesn't involve any pointer chasing. It also doesn't require any annoying allocator business.

The simplest way I see to fulfill all your requirements:
Constant-time insertion/removal (hope amortized constant-time is okay for insertion).
No heap allocation/deallocation per element.
No iterator invalidation on removal.
... would be something like this, just making use of std::vector:
template <class T>
struct Node
{
// Stores the memory for an instance of 'T'.
// Use placement new to construct the object and
// manually invoke its dtor as necessary.
typename std::aligned_storage<sizeof(T), alignof(T)>::type element;
// Points to the next element or the next free
// element if this node has been removed.
int next;
// Points to the previous element.
int prev;
};
template <class T>
class NodeIterator
{
public:
...
private:
std::vector<Node<T>>* nodes;
int index;
};
template <class T>
class Nodes
{
public:
...
private:
// Stores all the nodes.
std::vector<Node> nodes;
// Points to the first free node or -1 if the free list
// is empty. Initially this starts out as -1.
int free_head;
};
... and hopefully with a better name than Nodes (I'm slightly tipsy and not so good at coming up with names at the moment). I'll leave the implementation up to you but that's the general idea. When you remove an element, just do a doubly-linked list removal using the indices and push it to the free head. The iterator doesn't invalidate since it stores an index to a vector. When you insert, check if the free head is -1. If not, overwrite the node at that position and pop. Otherwise push_back to the vector.
Illustration
Diagram (nodes are stored contiguously inside std::vector, we simply use index links to allow skipping over elements in a branchless way along with constant-time removals and insertions anywhere):
Let's say we want to remove a node. This is your standard doubly-linked list removal, except we use indices instead of pointers and you also push the node to the free list (which just involves manipulating integers):
Removal adjustment of links:
Pushing removed node to free list:
Now let's say you insert to this list. In that case, you pop off the free head and overwrite the node at that position.
After insertion:
Insertion to the middle in constant-time should likewise be easy to figure out. Basically you just insert to the free head or push_back to the vector if the free stack is empty. Then you do your standard double-linked list insertion. Logic for the free list (though I made this diagram for someone else and it involves an SLL, but you should get the idea):
Make sure you properly construct and destroy the elements using placement new and manual calls to the dtor on insertion/removal. If you really want to generalize it, you'll also need to think about exception-safety and we also need a read-only const iterator.
Pros and Cons
The benefit of such a structure is that it does allow very rapid insertions/removals from anywhere in the list (even for a gigantic list), insertion order is preserved for traversal, and it never invalidates the iterators to element which aren't directly removed (though it will invalidate pointers to them; use deque if you don't want pointers to be invalidated). Personally I'd find more use for it than std::list (which I practically never use).
For large enough lists (say, larger than your entire L3 cache as a case where you should definitely expect a huge edge), this should vastly outperform std::vector for removals and insertions to/from the middle and front. Removing elements from vector can be quite fast for small ones, but try removing a million elements from a vector starting from the front and working towards the back. There things will start to crawl while this one will finish in the blink of an eye. std::vector is ever-so-slightly overhyped IMO when people start using its erase method to remove elements from the middle of a vector spanning 10k elements or more, though I suppose this is still preferable over people naively using linked lists everywhere in a way where each node is individually allocated against a general-purpose allocator while causing cache misses galore.
The downside is that it only supports sequential access, requires the overhead of two integers per element, and as you can see in the above diagram, its spatial locality degrades if you constantly remove things sporadically.
Spatial Locality Degradation
The loss of spatial locality as you start removing and inserting a lot from/to the middle will lead to zig-zagging memory access patterns, potentially evicting data from a cache line only to go back and reload it during a single sequential
loop. This is generally inevitable with any data structure that allows removals from the middle in constant-time while likewise allowing that space to be reclaimed while preserving the order of insertion. However, you can restore spatial locality by offering some method or you can copy/swap the list. The copy constructor can copy the list in a way that iterates through the source list and inserts all the elements which gives you back a perfectly contiguous, cache-friendly vector with no holes (though doing this will invalidate iterators).
Alternative: Free List Allocator
An alternative that meets your requirements is implement a free list conforming to std::allocator and use it with std::list. I never liked reaching around data structures and messing around with custom allocators though and that one would double the memory use of the links on 64-bit by using pointers instead of 32-bit indices, so I'd prefer the above solution personally using std::vector as basically your analogical memory allocator and indices instead of pointers (which both reduce size and become a requirement if we use std::vector since pointers would be invalidated when vector reserves a new capacity).
Indexed Linked Lists
I call this kind of thing an "indexed linked list" as the linked list isn't really a container so much as a way of linking together things already stored in an array. And I find these indexed linked lists exponentially more useful since you don't have to get knee-deep in memory pools to avoid heap allocations/deallocations per node and can still maintain reasonable locality of reference (great LOR if you can afford to post-process things here and there to restore spatial locality).
You can also make this singly-linked if you add one more integer to the node iterator to store the previous node index (comes free of memory charge on 64-bit assuming 32-bit alignment requirements for int and 64-bit for pointers). However, you then lose the ability to add a reverse iterator and make all iterators bidirectional.
Benchmark
I whipped up a quick version of the above since you seem interested in 'em: release build, MSVC 2012, no checked iterators or anything like that:
--------------------------------------------
- test_vector_linked
--------------------------------------------
Inserting 200000 elements...
time passed for 'inserting': {0.000015 secs}
Erasing half the list...
time passed for 'erasing': {0.000021 secs}
time passed for 'iterating': {0.000002 secs}
time passed for 'copying': {0.000003 secs}
Results (up to 10 elements displayed):
[ 11 13 15 17 19 21 23 25 27 29 ]
finished test_vector_linked: {0.062000 secs}
--------------------------------------------
- test_vector
--------------------------------------------
Inserting 200000 elements...
time passed for 'inserting': {0.000012 secs}
Erasing half the vector...
time passed for 'erasing': {5.320000 secs}
time passed for 'iterating': {0.000000 secs}
time passed for 'copying': {0.000000 secs}
Results (up to 10 elements displayed):
[ 11 13 15 17 19 21 23 25 27 29 ]
finished test_vector: {5.320000 secs}
Was too lazy to use a high-precision timer but hopefully that gives an idea of why one shouldn't use vector's linear-time erase method in critical paths for non-trivial input sizes with vector above there taking ~86 times longer (and exponentially worse the larger the input size -- I tried with 2 million elements originally but gave up after waiting almost 10 minutes) and why I think vector is ever-so-slightly-overhyped for this kind of use. That said, we can turn removal from the middle into a very fast constant-time operation without shuffling the order of the elements, without invalidating indices and iterators storing them, and while still using vector... All we have to do is simply make it store a linked node with prev/next indices to allow skipping over removed elements.
For removal I used a randomly shuffled source vector of even-numbered indices to determine what elements to remove and in what order. That somewhat mimics a real world use case where you're removing from the middle of these containers through indices/iterators you formerly obtained, like removing the elements the user formerly selected with a marquee tool after he his the delete button (and again, you really shouldn't use scalar vector::erase for this with non-trivial sizes; it'd even be better to build a set of indices to remove and use remove_if -- still better than vector::erase called for one iterator at a time).
Note that iteration does become slightly slower with the linked nodes, and that doesn't have to do with iteration logic so much as the fact that each entry in the vector is larger with the links added (more memory to sequentially process equates to more cache misses and page faults). Nevertheless, if you're doing things like removing elements from very large inputs, the performance skew is so epic for large containers between linear-time and constant-time removal that this tends to be a worthwhile exchange.

I second #Useless' answer, particularly PS item 2 about revising requirements. If you relax the iterator invalidation constraint, then using std::vector<> is Stroustrup's standard suggestion for a small-number-of-items container (for reasons already mentioned in the comments). Related questions on SO.
Starting from C++11 there is also std::forward_list.
Also, if standard heap allocation for elements added to the container is not good enough, then I would say you need to look very carefully at your exact requirements and fine tune for them.

I just wanted to make a small comment about your choice. I'm a huge fan of vector because of it's read speeds, and you can direct access any element, and do sorting if need be. (vector of class/struct for example).
But anyways I digress, there's two nifty tips I wanted to disclose.
With vector inserts can be expensive, so a neat trick, don't insert if you can get away with not doing it. do a normal push_back (put at the end) then swap the element with one you want.
Same with deletes. They are expensive. So swap it with the last element, delete it.

Thanks for all the answers.
This is a simple - though not rigorous - benchmark.
// list.cc
#include <list>
using namespace std;
int main() {
for (size_t k = 0; k < 1e5; k++) {
list<size_t> ln;
for (size_t i = 0; i < 200; i++) {
ln.insert(ln.begin(), i);
if (i != 0 && i % 20 == 0) {
ln.erase(++++++++++ln.begin());
}
}
}
}
and
// vector.cc
#include <vector>
using namespace std;
int main() {
for (size_t k = 0; k < 1e5; k++) {
vector<size_t> vn;
for (size_t i = 0; i < 200; i++) {
vn.insert(vn.begin(), i);
if (i != 0 && i % 20 == 0) {
vn.erase(++++++++++vn.begin());
}
}
}
}
This test aims to test what std::list claims to excel at - O(1) inserting and erasing. And, because of the positions I ask to insert/delete, this race is heavily skewed against std::vector, because it has to shift all the following elements (hence O(n)), while std::list doesn't need to do that.
Now I compile them.
clang++ list.cc -o list
clang++ vector.cc -o vector
And test the runtime. The result is:
time ./list
./list 4.01s user 0.05s system 91% cpu 4.455 total
time ./vector
./vector 1.93s user 0.04s system 78% cpu 2.506 total
std::vector has won.
Compiled with optimization O3, std::vector still wins.
time ./list
./list 2.36s user 0.01s system 91% cpu 2.598 total
time ./vector
./vector 0.58s user 0.00s system 50% cpu 1.168 total
std::list has to call heap allocation for each element, while std::vector can allocate heap memory in batch (though it might be implementation-dependent), hence std::list's insert/delete has a higher constant factor, though it is O(1).
No wonder this document says
std::vector is well loved and respected.
EDIT: std::deque does even better in some cases, at least for this task.
// deque.cc
#include <deque>
using namespace std;
int main() {
for (size_t k = 0; k < 1e5; k++) {
deque<size_t> dn;
for (size_t i = 0; i < 200; i++) {
dn.insert(dn.begin(), i);
if (i != 0 && i % 20 == 0) {
dn.erase(++++++++++dn.begin());
}
}
}
}
Without optimization:
./deque 2.13s user 0.01s system 86% cpu 2.470 total
Optimized with O3:
./deque 0.27s user 0.00s system 50% cpu 0.551 total

why std::vector item deletion does not reduce its capacity?

I am aware that when we insert items to a vector its capacity could be increase by non-linear factor. In gcc its capacity doubles. But I wonder why when I erase items from a vector, the capacity does not reduce. I tried to find out a reason for this. It 'seems' C++ standard does not say any word about this reduction (either to do or not).
For my understand ideally, when vector size comes to 1/4 of its capacity at item deletion, it the vector could be shrunken by 1/2 of its capacity to achieve constant amortized space allocation/de-allocation complexity.
My question is why C++ standard does not specify capacity reduction policy? What are the language design goals to not to specify anything about this? Does anyone has an idea about this?

It 'seems' C++ standard does not say any word about this reduction (either to do or not)
This is not true, because the complexity description for vector::erase specifies exactly what operations will be performed.
From §23.3.6.5/4 [vector.modifiers]
iterator erase(const_iterator position);
iterator erase(const_iterator first, const_iterator last);
Complexity: The destructor of T is called the number of times equal to the number of the elements erased, but the move assignment operator of T is called the number of times equal to the number of elements in the vector after the erased elements.
This precludes implementations from reducing capacity because that would mean reallocation of storage and moving all remaining elements to the new memory.
And if you're asking why the standard itself doesn't specify implementations are allowed to reduce capacity when you erase elements, then one can only guess the reasons.
It was probably considered not important enough from a performance point of view to have the vector spend time reallocating and moving elements when erasing
Reducing capacity would also add an additional possibility of an exception due to a failed memory allocation.
You can attempt to reduce capacity yourself by calling vector::shrink_to_fit, but be aware that this call is non-binding, and implementations are allowed to ignore it.
Another possibility for reducing the capacity would be move the elements into a temporary vector and swap it back into the original.
decltype(vec)(std::make_move_iterator(vec.begin()),
std::make_move_iterator(vec.end())).swap(vec);
But even with the second method, there's nothing stopping an implementation from over allocating storage.

Even more than the performance of moving all elements is the effect on existing iterators and pointers to elements. The behavior of erase is:
Invalidates iterators and references at or after the point of the erase.
If reallocation occurred, then all iterators, pointers, and references would become invalid. In general, keeping iterator validity is a desirable thing.

The algorithm for allocating additional space as the vector grows has "constant amortized complexity" due to the notion that the total complexity (which is O(N) when a vector of N elements is created by a series of push_back() operations) can be "amortized" over the N push_back() calls--that is, the total cost is divided by N.
Even more specifically, using the algorithm that allocates twice as much space each time, the worst case is that the algorithm allocates nearly 4 times as much memory as would need to be allocated if you knew the exact size of the vector in advance. The last allocation is just slightly less than two times the size of the vector after the allocation, and the some of all the previous allocations is slightly less than the size of the last allocation.
The total number of allocations is O(log N), and the number of deallocations (up to that point) is just one less than the number of allocations.
For a large vector, if you know its maximum size in advance, it's more efficient to reserve that space at the beginning (one allocation rather than O(log N) allocations)
before inserting any data.
If you cut the capacity in half each time the size of the vector shrank to 1/4 of the currently-allocated space--that is, if you ran the allocation algorithm in reverse--you would be re-allocating (and then deallocating) nearly as much memory as the maximum capacity of the vector, in addition to deallocating the memory block with the maximum capacity. That's a performance penalty for applications that simply wanted to erase elements of the vector until they were all gone and then delete the vector.
That is, with deallocation as well as allocation, it's better to do it all at once if you can. And with deallocation you (almost) always can.
The only beneficiary of the more complicated deallocation algorithm would be an application that makes a vector, then erases at least 3/4 of it and then keeps the remaining part in memory while proceeding to grow new vectors. And even then there would be no benefit from the complicated algorithm unless the sum of the maximum capacities of the old (but still existing) vectors and the new vectors was so large that the application started to run into limitations of virtual memory.
Why penalize all algorithms that progressively erase their vectors in order to gain this advantage in this special case?

performance of array vs. map

I have to loop over a subset of elements in a large array where each element point to another one (problem coming from the detection of connected component in a large graph).
My algo is going as follows:
1. consider 1st element
2. consider next element as the one pointed by the previous element.
3. loop until no new element is discover
4. consider next element not already consider in 1-3, get back to 1.
Note that the number of elements to consider is much smaller than the total number of elements.
For what I see now, I can either:
//create a map of all element, init all values to 0, set to 1 when consider
map<int,int> is_set; // is_set.size() will be equal to N
or
//create a (too) large array (total size), init to 0 the elements to consider
int* is_set = (int*)malloc(total_size * sizeof(int)); // is_set length will be total_size>>N
I know that accessing keys in map is O(log N) while it's only constant for arrays, but I don't know if malloc is not more costly at the creation while it also requires more memory?

When in doubt, measure the performance of both alternatives. That's the only way to know for sure which approach will be fastest for your application.
That said, a one-time large malloc is generally not terribly expensive. Also, although the map is O(log N), the big-O conceals a relatively large constant factor, at least for the std::map implementation, in my experience. I would not be surprised to find that the array approach is faster in this case, but again the only way to know for sure is to measure.
Keep in mind too that although the map does not have a large up-front memory allocation, it has many small allocations over the lifetime of the object (every time you insert a new element, you get another allocation, and every time you remove an element, you get another free). If you have very many of these, that can fragment your heap, which may negatively impact performance depending on what else your application might be doing at the same time.

If indexed search suits your needs (like provided by regular C-style arrays), probably std::map is not the right class for you. Instead, consider using std::vector if you need dynamic run-time allocation or std::array if your collection is fixed-sized and you just need the fastest bounds-safe alternative to a C-style pointer.
You can find more information on this previous post.

I know that accessing keys in map is O(log N) while it's only constant for arrays, but I don't know if malloc is not more costly at the creation while it also requires more memory?
Each entry in the map is dynamically allocated, so if the dynamic allocation is an issue it will be a bigger issue in the map. As of the data structure, you can use a bitmap rather than a plain array of int's. That will reduce the size of the array by a factor of 32 in architectures with 32bit ints, the extra cost of mapping the index into the array will in most cases be much smaller than the cost of the extra memory, as the structure is more compact and can fit in fewer cache lines.
There are other things to consider, as whether the density of elements in the set is small or not. If there are very few entries (i.e. the graph is sparse) then either option could be fine. As a final option you can manually implement the map by using a vector of pair<int,int> and short them, then use binary search. That will reduce the number of allocations, incur some extra cost in sorting and provide a more compact O(log N) solution than a map. Still, I would try to go for the bitmask.

Which stl container should I use when doing few inserts?

I don't know my exact numbers but i'll try my best. I have a 10000 element deque thats populated right at the start. Than i scan through each element and lets every 20 elements i'll need to insert an new element. The insert would happen at the current position and maybe one element back.
I don't exactly need to remember the position but i also don't exactly need random access either. I'd like fast inserts. Does deque and vector have a heavy price to pay on insert? Should i use list?
My other option is to have a 2nd deque list and as i go through each element insert it to the other deque list unless i need to do the insert i am talking about. This does need to be fast as its a performance intensive app. But I am using a lot of pointers (each element is a pointer) which is upsetting me but there isn't a way around that so i should assume L1 cache will always miss?

I'd start with std::vector in this case, but use a second std::vector for your mass mutations, reserve() appropriately, then swap() the vectors.
Update
It would take this general form:
std:vector<t_object*> source; // << source already holds 10000 elements
std:vector<t_object*> tmp;
// to minimize reallocations and frees to 1 and 1, if possible.
// if you do not swap or have to grow more, reserving can really work against you.
tmp.reserve(aMeaningfulReserveValue);
while (performingMassMutation) {
// "i scan through each element and lets every 20 elements"
for (twentyElements)
tmp.push_back(source[readPos++]);
// "every 20 elements i'll need to insert an new element"
tmp.push_back(newElement);
}
// approximately 500 iterations later…
source.swap(tmp);
Borealid brought up a good point, which is measure -- execution varies dramatically depending on your std library implementations, data sizes, complexity to copy, and so on.
For raw pointers of a collection this size with my configuration, the vector mass mutation and push_back above was 7 times faster than std::list insertion. push_back was faster than vector's range insertion.
As Emile points out below, std::vector::swap() does not need to move or reallocate elements -- it can just swap out internals (provided the allocators are the same type).

First off, the answer to all performance questions is "benchmark it". Always. Now...
If you don't care about the memory overhead, and you don't need random access, but you do care about having constant-time insertions, list is probably right for you.
std::vector will have constant-time insertions at the end when it has sufficient capacity. When the capacity is exceeded, it needs a linear-time copy. deque is better because it links discrete allocations, avoiding a complete copy and letting you do constant-time insertions at the front as well. Random insertions (every 20 elements) will always be linear time.
As for cache locality, a vector is as good as you can get (contiguous memory), but you said you cared about insertions rather than lookups; in my experience, when that's the case you don't care about how hot the cache gets as you scan through to dump, so list's poor behavior doesn't much matter.

Lists are useful when either you frequently want to insert elements in the middle of the collection, or frequently remove them. Lists are, however, slow to read.
Vectors are very fast to read and very fast when you only want to add or remove elements at the end of the collection, but they are very slow when you insert elements in the middle. This is because it has to move all elements after the desired position by one place, to make room for the new element.
Deques are basically doubly linked lists that can be used as vectors.
If you don't need to insert elements in the middle of the collection (you don't care about the order), I suggest you use vector. If you can approximate the number of elements that will be introduced in the vector from the beginning, you should also use std::vector::reserve to allocate memory necessary from the beginning. The value you pass to reserve doesn't need to be exact, just approximate; if it's smaller than needed, the vector will resize automatically, when necessary.

You can go two ways: list is always an option for random place insertions, however as you allocate every element separately this will cause some performance implications too. The other option of inserting in-place in the deque is not good as well - because you will pay linear time for every insertion. Maybe your idea of inserting in new deque is the best here - you pay twice as much memory, but on the other hand you always do insertion either at the end of the second deque, or one element before that - this all gives constant amortized time, and still you have good caching of the container.

The number of copies done for std::vector/deque ::insert etc is proportional to the number of elements between the insert position and the end of container (the number of elements that need to be shifted to make room). The worst-case for a std::vector is O(N) - when you insert at the front of the container. If you're inserting M elements the worst -case is therefore O(M*N) which isn't great.
There could also be a reallocation involved if the containers capacity is exceeded. You could prevent reallocation by ensuring that sufficient space was ::reserve'd up front.
You're other suggestion - copying to a second std::vector/deque container could be better in that it could always be organised to achieve O(N) complexity, but at the cost of temporarily storing two containers.
Using a std::list would allow you to achieve in-place O(1) inserts, but at the cost of additional memory overhead (storing the list pointers etc) and reduced memory locality (list nodes are not allocated contiguously). You could improve the memory locality by using a pool'd memory allocator (Boost pools maybe?).
Overall you'd have to benchmark to really sort out which is "the fastest" approach.
Hope this helps.

If you need fast inserts in the middle, but don't care about random access, vector and deque are definitely not for you: For those, every time you insert something, all elements between that one and the end have to be moved. Of the built-in containers, list is almost certainly your best bet. However a better data structure for your scenario would probably be a VList because it provides better cache locality, however that's not provided by the C++ standard library. The Wikipedia page links to a C++ implementation, however from a quick view on the interface it doesn't seem to completely STL compatible; I don't know if this is an issue for you.
Of course, in the end the only way to be sure which is the optimal solution is to measure the performance.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js