c++ Why std::multimap is slower than std::priority_queue

c++ Why std::multimap is slower than std::priority_queue - c++

I implemented an algorithm where I make use of an priority queue.
I was motivated by this question:
Transform a std::multimap into std::priority_queue
I am going to store up to 10 million elements with their specific priority value.
I then want to iterate until the queue is empty.
Every time an element is retrieved it is also deleted from the queue.
After this I recalculate the elements pririty value, because of previous iterations it can change.
If the value did increase I am inserting the element againg into the queue.
This happens more often dependent on the progress. (at the first 25% it does not happen, in the next 50% it does happen, in the last 25% it will happen multiple times).
After receiving the next element and not reinserting it, I am going to process it. This for I do not need the priority value of this element but the technical ID of this element.
This was the reason I intuitively had chosen a std::multimap to achieve this, using .begin() to get the first element, .insert() to insert it and .erase() to remove it.
Also, I did not intuitively choose std::priority_queue directly because of other questions to this topic answering that std::priority_queue most likely is used for only single values and not for mapped values.
After reading the link above I reimplemented it using priority queue analogs to the other question from the link.
My runtimes seem to be not that unequal (about an hour on 10 mio elements).
Now I am wondering why std::priority_queue is faster at all.
I actually would expect to be the std::multimap faster because of the many reinsertions.
Maybe the problem is that there are too many reorganizations of the multimap?

To summarize: your runtime profile involves both removing and inserting elements from your abstract priority queue, with you trying to use both a std::priority_queue and a std::multimap as the actual implementation.
Both the insertion into a priority queue and into a multimap have roughly equivalent complexity: logarithmic.
However, there's a big difference with removing the next element from a multimap versus a priority queue. With a priority queue this is going to be a constant-complexity operation. The underlying container is a vector, and you're removing the last element from the vector, which is going to be mostly a nothing-burger.
But with a multimap you're removing the element from one of the extreme ends of the multimap.
The typical underlying implementation of a multimap is a balanced red/black tree. Repeated element removals from one of the extreme ends of a multimap has a good chance of skewing the tree, requiring frequent rebalancing of the entire tree. This is going to be an expensive operation.
This is likely to be the reason why you're seeing a noticeable performance difference.

I think the main difference comes form two facts:
Priority queue has a weaker constraint on the order of elements. It doesn't have to have sorted whole range of keys/priorities. Multimap, has to provide that. Priority queue only have to guarantee the 1st / top element to be largest.
So, while, the theoretical time complexities for the operations on both are the same O(log(size)), I would argue that erase from multimap, and rebalancing the RB-tree performs more operations, it simply has to move around more elements. (NOTE: RB-tree is not mandatory, but very often chosen as underlying container for multimap)
The underlying container of priority queue is contiguous in memory (it's a vector by default).
I suspect the rebalancing is also slower, because RB-tree relies on nodes (vs contiguous memory of vector), which makes it prone to cache misses, although one has to remember that operations on heap are not done in iterative manner, it is hopping through the vector. I guess to be really sure one would have to profile it.
The above points are true for both insertions and erasues. I would say the difference is in the constant factors lost in the big-O notation. This is intuitive thinking.

The abstract, high level explanation for map being slower is that it does more. It keeps the entire structure sorted at all times. This feature comes at a cost. You are not paying that cost if you use a data structure that does not keep all elements sorted.
Algorithmic explanation:
To meet the complexity requirements, a map must be implemented as a node based structure, while priority queue can be implemented as a dynamic array. The implementation of std::map is a balanced (typically red-black) tree, while std::priority_queue is a heap with std::vector as the default underlying container.
Heap insertion is usually quite fast. The average complexity of insertion into a heap is O(1), compared to O(log n) for balanced tree (worst case is the same, though). Creating a priority queue of n elements has worst case complexity of O(n) while creating a balanced tree is O(n log n). See more in depth comparison: Heap vs Binary Search Tree (BST)
Additional, implementation detail:
Arrays usually use CPU cache much more efficiently, than node based structures such as trees or lists. This is because adjacent elements of an array are adjacent in memory (high memory locality) and therefore may fit within a single cache line. Nodes of a linked structure however exist in arbitrary locations (low memory locality) in memory and usually only one or very few are within a single cache line. Modern CPUs are very very fast at calculations but memory speed is a bottle neck. This is why array based algorithms and data structures tend to be significantly faster than node based.

While I agree with both #eerorika and #luk32, it is worth mentioning that in the real world, when using default STL allocator, memory management cost easily out-weights a few data structure maintenance operations such as updating pointers to perform tree rotation. Depending on the implementation the memory allocation itself could involve tree maintenance operation and potentially triggers system-call where it would become even more costly.
In multi-map, there is memory allocation and deallocation associated with each insert() and erase() respectively which often contributes to slowness in a higher order of magnitude than the extra steps in the algorithm.
priority-queue however, by default uses vector which only triggers memory allocation (a much more expansive one though, which involves moving all stored objects to the new memory location) once the capacity is exhausted. In your case pretty much all allocation only happens in the first iteration for priority-queue whereas multi-map keeps paying memory management cost with each insert and erase.
The downside around memory management for map could be mitigated by using a memory-pool based custom allocator. This also gives you cache hit rate comparable to priority queue. It might even out-perform priority-queue when your object is expansive to move or copy.

Related

fastest non-destructive method to get sorted version of a heap?

I have a priority heap holding an event queue.
I need to dump this out for the user in order, but without rendering the heap unusable.
Obviously if I was willing to destroy it I could simply dequeue events until it was empty, and add them in order to my sorted list. But then of course the heap is gone. Further, Quicksort is so much faster than a heap sort that I don't have confidence that I can build this sorted list faster than I can make a copy of the heap and sort the copy.
One idea I had was to in fact destroy the heap by dequeueing all its items, but then... replacing the now-empty priority queue with the resulting sorted list, which should maintain the heap property (of cell i being a higher priority than cell i * 2+1 and i * 2+2). So I'm also wondering whether such a heap would perform better than a regular heap.
The easiest solution is just to copy the heap array, and sort the copy. But some sorts do a bad job when given sorted or somewhat-sorted data, and I'm wondering whether the Standard C++ library's sort (or C qsort()) could be trusted to handle sorting a heap as efficiently?
Conversely, while quicksort is far faster than heapsort in the general case, this isn't the general case. The array is already heapified, which is the first half the work of heapsort. It'd be plausible that pulling the heap items out in order (the second half) could be faster than quicksort. So I'm wondering if there's a research result that it is faster to 1) copy heapified array, 2) pull items out in order and place at end of copy, and 3) reverse copy, is typically faster than quicksort. (Or, pull items out in order and place at second new array, and use that as your return value. I suspect the reversal stage may be better than increasing the cache lines needed.)

It looks to me like you're concerned about a performance problem that isn't really a problem. As I understand it, modern C++ implementations of sort use Introsort, which avoids the pathological worst-case times of a naïve Quicksort. And the difference between Quicksort and Heapsort, in the context of generating user output, is not large enough to be a concern.
Just copy the heap and sort it. Or sort the heap directly and output the result, provided of course that doing so doesn't break the heap.
You asked if a sorted heap performs better than a non-sorted heap. No. When adding an item, you still add it as the last node and sift it up. Half of the nodes in a heap are at the leaf level and assuming a uniform distribution of new items, then half of the items you add will end up at the leaf level, requiring no swaps. Worst case is if every item you add ends up being the smallest (in a min-heap), in which case every time you add an item it will take log(n) swaps to move it to the root. Now, if every item added is larger than any other item in the heap, then of course addition is O(1). But that's true regardless of whether the heap was initially created from a sorted array.
Deleting an item from the heap requires that you replace the root item with an item from the leaf level and then sift it down. In a sorted heap, the likelihood that the replacement item will end up back down at the leaf level is very high, which means that adjusting the heap will require the maximum log(n) swaps. A sorted heap almost guarantees that removal will require the maximum number of swaps. In this case, a sorted heap is potentially worse in terms of performance than a heap constructed from a randomly-arranged array.
But all that changes quickly as you begin adding items to and removing items from the heap. The heap becomes "not sorted" fairly quickly.
Over the life of the priority queue, it's highly unlikely that the initial order of items will make any noticeable difference in the performance of your binary heap.

With the usual heap implementation, just sort the heap in place. A sorted list satisfies the heap condition for a min-heap. Alternately if you sort the heap descending, you satisfy the heap condition for a max-heap. And you can always sort it one way and traverse another if that is what you need.
Note that the sort::heap documentation warns about breaking the heap condition. Be careful that you know you haven't if you are changing the heap data in place.

What container to choose for fast search/insert with huge amounts of data?

So it's a thought experiment. I want to have a huge collection of structures such as:
struct
{
KeyType key;
ValueType value;
}
And I need fast access by a key and fast insertion of new values.
I would not use std::map cuz it has too big memory overhead for one structure and for huge amounts of data it might be drastical. Right?
So next I would consider using sorted std::vector and binary_search. It's fine for searching, but adding new values to the vector would be too slow. Imagine you need to add a new value to the beginning of the sorted array, you'd have to move data right aaaaaAAAALOT!
What if I use deque? As I know it has O(1) for push_back/push_front, but still O(n) for inserting (as it would have to move data anyway, less data though).
The questions are:
1) Is O(n) of inserting data in deque much faster in real situation than O(n) in vector?
2) What happens when you insert a value to Deque and the bucket it should go into is full?
3) Is there another preferable type of container in case you need to store lots of data and need two fast operations: search and insertion?
Thanks!

I would not use std::map cuz it has too big memory overhead for one structure and for huge amounts of data it might be drastical. Right?
That depends on the size of your structs... the bigger they are the less the overheads are as a proportion of the overall memory use. For example, a std::map implementation might average say 20 bytes of housekeeping data per element (I just made that up - measure on your own system), so if your struct size is in the hundreds of bytes - who cares...? But, if the struct holds 2 ints, it's a big proportion....
So next I would consider using sorted std::vector and binary_search. It's fine for searching, but adding new values to the vector would be too slow. Imagine you need to add a new value to the beginning of the sorted array, you'd have to move data right aaaaaAAAALOT!
Totally unsuitable....
1) Is O(n) of inserting data in deque much faster in real situation than O(n) in vector?
As deque is likely implemented as a vector of fixed-sized arrays, insertion implies a shuffling of all elements towards the nearest end of the container. The shuffling's probably a tiny bit less cache efficient, but if inserting nearer the front of the container it would likely still end up faster.
2) What happens when you insert a value to Deque and the bucket it should go into is full?
As above, it'll need to shuffle, overflowing either:
the last element to become the first element of the next "bucket", moving all those elements along and overflowing into the next bucket, etc.
the first element to become the last element of the previous bucket, moving all those elements along and overflowing into the next bucket, etc.
3) Is there another preferable type of container in case you need to store lots of data and need two fast operations: search and insertion?
unordered_map, which is implemented as a hash map. If you have small objects (e.g. less than 20 or 30 bytes) or a firm cap on the number of elements, you can normally easily outperform unordered_map with custom code, but it's rarely worth the effort unless the table access dominates you application's performance, and that performance is critical.

3) Is there another preferable type of container in case you need to store lots of data and need two fast operations: search and insertion?
Consider using std::unordered_map, which is an implementation of a hash map. Insertion, lookup, and removal are all O(1) in the average case. This assumes that you will only ever look for an item based on its exact key; if your searches can have different constraints then you either need a different structure, or you need multiple maps to map the various keys you will search for to the corresponding object.
This requires that there is an available hash function for KeyType, either as part of the standard library or provided by you.

There's no container which would provide the best of all the worlds to you. Like you are saying you want best lookup/insertion with minimum amount of space needed for storing elements.
Below if the list of containers which you could consider for your implementation:-
VECTOR :-
Strengths:-
1) Space is allocated only for holding data.
2) Good for random access.
3) Container of choice if insertions/deletions are not in the middle of the container.
Weakness:-
1) poor performance if insertions/deletions are at the middle.
2) rellocations happen if reserve is not used properly.
DEQUE:-
Choose deque over vector in case insertions/deletions are at the beginning as well as end of the container.
MAP:-
Disadvantage over vector:-
1) more space is allocated for holding pointers.
Advantages over vector:-
1) better insertions/deletions/lookup as compared to vector.
If std::unordered_map is used then these dictionary operations would be amortized O(1).

Firstly, in order to directly answer your questions:
1) Is O(n) of inserting data in deque much faster in real situation
than O(n) in vector?
The number of elements that have to be moved is (on average) only half compared to vector. However, it can actually perform worse as the data is stored in non-contiguous memory, so copying/moving the same number of elements is much less efficient (it cannot e.g. be implemented in terms of a single memcopy operation).
2) What happens when you insert a value to Deque and the bucket it
should go into is full?
At least for the gnu gcc Libstdc++ implementation, every bucket except the first and last one is always full. I believe, that inserting in the middle means that all elements are moved/copied one slot to the closer end (front or back) and the effect ripples through all buckets until the first or last one is reached.
In summary, the only scenario, where std::deque is consistently better than vector is if you use it as (suprise) a queue (only inserting and removing elements from the front or end) and that's what the implementation is optimized for. It is not optimized for insertions in the middle.
3) Is there another preferable type of container in case you need to
store lots of data and need two fast operations: search and insertion?
As already stated by others: A hash table like std::unordered_map is the data structure you are looking for.
From what I've heard however, std::unordered_map is a slightly suboptimal implementation if it, as it uses buckets in order to resolve hash collisions and those buckets are implemented as linked lists (here is a very interesting talk from Chandler Carruth on the general topic of the performance of different data structures). For random access on big data structures, cache locality should matter a lot less, so this is probably not such a big issue in your case.
Finally I'd like to mention that if your value and key types are small PODs and depending on how big your huge collection is (are we talking about some million or rather billions of elements) and how often you actually have to insert/remove elements, there might still be cases, where a simple std::vector outperforms any other STL container. So as always: if your thought experiment ever becomes reality try out and measure.

Looking for clarification on Hashing and BST functions and Big O notation

So I am trying to understand the data types and Big O notation of some functions for a BST and Hashing.
So first off, how are BSTs and Hashing stored? Are BSTs usually arrays, or are they linked lists because they have to point to their left and right leaves?
What about Hashing? I've had the most trouble finding clear information regarding Hashing in terms of computation-based searching. I understand that Hashing is best implemented with an array of chains. Is this for faster searching or to decrease overhead on creating the allocated data type?
This following question might be just bad interpretation on my part, but what makes a traversal function different from a search function in BSTs, Hashing, and STL containers?
Is traversal Big O(N) for BSTS because you're actually visiting each node/data member, whereas search() can reduce its time by eliminating half the searching field?
And somewhat related, why is it that in the STL, list.insert() and list.erase() have a Big O(1) whereas the vector and deque counterparts are O(N)?
Lastly, why would a vector.push_back() be O(N)? I thought the function could be done something along the lines of this like O(1), but I've come across text saying it is O(N):
vector<int> vic(2,3);
vector<int>::const iterator IT = vic.end();
//wanna insert 4 to the end using push_back
IT++;
(*IT) = 4;
hopefully this works. I'm a bit tired but I would love any explanations why something similar to that wouldn't be efficient or plausible. Thanks

BST's (Ordered Binary Trees) are a series of nodes where a parent node points to its two children, which in turn point to their max-two children, etc. They're traversed in O(n) time because traversal visits every node. Lookups take O(log n) time. Inserts take O(1) time because internally they don't need to a bunch of existing nodes; just allocate some memory and re-aim the pointers. :)
Hashes (unordered_map) use a hashing algorithm to assign elements to buckets. Usually buckets contain a linked list so that hash collisions just result in several elements in the same bucket. Traversal will again be O(n), as expected. Lookups and inserts will be amortized O(1). Amortized means that on average, O(1), though an individual insert might result in a rehashing (redistribution of buckets to minimize collisions). But over time the average complexity is O(1). Note, however, that big-O notation doesn't really deal with the "constant" aspect; only order of growth. The constant overhead in the hashing algorithms can be high enough that for some data-sets the O(log n) binary trees outperform the hashes. Nevertheless, the hash's advantage is that its operations are constant time-complexity.
Search functions take advantage (in the case of binary trees) of the notion of "order"; a search through a BST has the same characteristics as a basic binary search over an ordered array. O(log n) growth. Hashes don't really "search". They compute the bucket, and then quickly run through the collisions to find the target. That's why lookups are constant time.
As for insert and erase; in array-based sequence containers, all elements that come after the target have to be bumped over to the right. Move semantics in C++11 can improve upon the performance, but the operation is still O(n). For linked sequence containers (list, forward_list, trees), insertion and erasing just means fiddling with some pointers internally. It's a constant-time process.
push_back() will be O(1) until you exceed the existing allocated capacity of the vector. Once the capacity is exceeded, a new allocation takes place to produce a container that is large enough to accept more elements. All the elements need to then be moved into the larger memory region, which is an O(n) process. I believe Move Semantics can help here as well, but it's still going to be O(n). Vectors and strings are implemented such that as they allocate space for a growing data set, they allocate more than they need, in anticipation of additional growth. This is an efficiency safeguard; it means that the typical push_back() won't trigger a new allocation and move of the entire data set into a larger container. But eventually after enough push_backs, the limit will be reached, and the vector's elements will be copied into a larger container, which again has some extra headroom left over for more efficient push_backs.

Traversal refers to visiting every node, whereas search is only to find a particular node, so your intuition is spot on there. O(N) complexity because you need to visit N nodes.
std::vector::insert is for insert in the middle, and it involves copying all subsequent elements over by one slot, inorder to make room for the element being inserted, hence O(N). Linked list doesnt have this issue, hence O(1). Similar logic for erase. deque properties are similar to vector
std::vector::push_back is a O(1) operation, for the most part, only deviates if capacity is exceeded and reallocations + copy are needed.

Which one is better stl map or unordered_map for the following cases

I am trying to compare stl map and stl unordered_map for certain operations. I looked on the net and it only increases my doubts regarding which one is better as a whole. So I would like to compare the two on the basis of the operation they perform.
Which one performs faster in
Insert, Delete, Look-up
Which one takes less memory and less time to clear it from the memory. Any explanations are heartily welcomed !!!
Thanks in advance

Which one performs faster in Insert, Delete, Look-up? Which one takes less memory and less time to clear it from the memory. Any explanations are heartily welcomed !!!
For a specific use, you should try both with your actual data and usage patterns and see which is actually faster... there are enough factors that it's dangerous to assume either will always "win".
implementation and characteristics of unordered maps / hash tables
Academically - as the number of elements increases towards infinity, those operations on an std::unordered_map (which is the C++ library offering for what Computing Science terms a "hash map" or "hash table") will tend to continue to take the same amount of time O(1) (ignoring memory limits/caching etc.), whereas with a std::map (a balanced binary tree) each time the number of elements doubles it will typically need to do an extra comparison operation, so it gets gradually slower O(log2n).
std::unordered_map implementations necessarily use open hashing: the fundamental expectation is that there'll be a contiguous array of "buckets", each logically a container of any values hashing thereto.
It generally serves to picture the hash table as a vector<list<pair<key,value>>> where getting from the vector elements to a value involves at least one pointer dereference as you follow the list-head-pointer stored in the bucket to the initial list node; the insert/find/delete operations' performance depends on the size of the list, which on average equals the unordered_map's load_factor.
If the max_load_factor is lowered (the default is 1.0), then there will be less collisions but more reallocation/rehashing during insertion and more wasted memory (which can hurt performance through increased cache misses).
The memory usage for this most-obvious of unordered_map implementations involves both the contiguous array of bucket_count() list-head-iterator/pointer-sized buckets and one doubly-linked list node per key/value pair. Typically, bucket_count() + 2 * size() extra pointers of overhead, adjusted for any rounding-up of dynamic memory allocation request sizes the implementation might do. For example, if you ask for 100 bytes you might get 128 or 256 or 512. An implementation's dynamic memory routines might use some memory for tracking the allocated/available regions too.
Still, the C++ Standard leaves room for real-world implementations to make some of their own performance/memory-usage decisions. They could, for example, keep the old contiguous array of buckets around for a while after allocating a new larger array, so rehashing values into the latter can be done gradually to reduce the worst-case performance at the cost of average-case performance as both arrays are consulted during operations.
implementation and characteristics of maps / balanced binary trees
A map is a binary tree, and can be expected to employ pointers linking distinct heap memory regions returned by different calls to new. As well as the key/value data, each node in the tree will need parent, left, and right pointers (see wikipedia's binary tree article if lost).
comparison
So, both unordered_map and map need to allocate nodes for key/value pairs with the former typically having two-pointer/iterator overhead for prev/next-node linkage, and the latter having three for parent/left/right. But, the unordered_map additionally has the single contiguous allocation for bucket_count() buckets (== size() / load_factor()).
For most purposes that's not a dramatic difference in memory usage, and the deallocation time difference for one extra region is unlikely to be noticeable.
another alternative
For those occasions when the container's populated up front then repeatedly searched without further inserts/erases, it can sometimes be fastest to use a sorted vector, searched using Standard algorithms binary_search, equal_range, lower_bound, upper_bound. This has the advantage of a single contiguous memory allocation, which is much more cache friendly. It always outperforms map, but unordered_map may still be faster - measure if you care.

The reason there is both is that neither is better as a whole.
Use either one. Switch if the other proves better for your usage.
std::map provides better space for worse time.
std::unordered_map provides better time for worse space.

The answer to your question is heavily dependent on the particular STL implementation you're using. Really, you should look at your STL implementation's documentation – it'll likely have a good amount of information on performance.
In general, though, according to cppreference.com, maps are usually implemented as red-black trees and support operations with time complexity O(log n), while unordered_maps usually support constant-time operations. cppreference.com offers little insight into memory usage; however, another StackOverflow answer suggests maps will generally use less memory than unordered_maps.
For the STL implementation Microsoft packages with Visual Studio 2012, it looks like map supports these operations in amortized O(log n) time, and unordered_map supports them in amortized constant time. However, the documentation says nothing explicit about memory footprint.

Map:
Insertion:
For the first version ( insert(x) ), logarithmic.
For the second
version ( insert(position,x) ), logarithmic in general, but
amortized constant if x is inserted right after the element pointed
by position.
For the third version ( insert (first,last) ),
Nlog(size+N) in general (where N is the distance between first and
last, and size the size of the container before the insertion), but
linear if the elements between first and last are already sorted
according to the same ordering criterion used by the container.
Deletion:
For the first version ( erase(position) ), amortized constant.
For the second version ( erase(x) ), logarithmic in container size.
For the last version ( erase(first,last) ), logarithmic in container size plus linear in the distance between first and last.
Lookup:
Logarithmic in size.
Unordered map:
Insertion:
Single element insertions:
Average case: constant.
Worst case: linear in container size.
Multiple elements insertion:
Average case: linear in the number of elements inserted.
Worst case: N*(size+1): number of elements inserted times the container size plus one.
Deletion:
Average case: Linear in the number of elements removed ( constant when you remove just one element )
Worst case: Linear in the container size.
Lookup:
Average case: constant.
Worst case: linear in container size.
Knowing these, you can decide which container to use according to the type of the implementation.
Source: www.cplusplus.com

How large does a collection have to be for std::map<k,v> to outpace a sorted std::vector<std::pair<k,v> >?

How large does a collection have to be for std::map to outpace a sorted std::vector >?
I've got a system where I need several thousand associative containers, and std::map seems to carry a lot of overhead in terms of CPU cache. I've heard somewhere that for small collections std::vector can be faster -- but I'm wondering where that line is....
EDIT: I'm talking about 5 items or fewer at a time in a given structure. I'm concerned most with execution time, not storage space. I know that questions like this are inherently platform-specific, but I'm looking for a "rule of thumb" to use.
Billy3

It's not really a question of size, but of usage.
A sorted vector works well when the usage pattern is that you read the data, then you do lookups in the data.
A map works well when the usage pattern involves a more or less arbitrary mixture of modifying the data (adding or deleting items) and doing queries on the data.
The reason for this is fairly simple: a map has higher overhead on an individual lookup (thanks to using linked nodes instead of a monolithic block of storage). An insertion or deletion that maintains order, however, has a complexity of only O(lg N). An insertion or deletion that maintains order in a vector has a complexity of O(N) instead.
There are, of course, various hybrid structures that can be helpful to consider as well. For example, even when data is being updated dynamically, you often start with a big bunch of data, and make a relatively small number of changes at a time to it. In this case, you can load your data into memory into a sorted vector, and keep the (small number of) added objects in a separate vector. Since that second vector is normally quite small, you simply don't bother with sorting it. When/if it gets too big, you sort it and merge it with the main data set.
Edit2: (in response to edit in question). If you're talking about 5 items or fewer, you're probably best off ignoring all of the above. Just leave the data unsorted, and do a linear search. For a collection this small, there's effectively almost no difference between a linear search and a binary search. For a linear search you expect to scan half the items on average, giving ~2.5 comparisons. For a binary search you're talking about log2 N, which (if my math is working this time of the morning) works out to ~2.3 -- too small a difference to care about or notice (in fact, a binary search has enough overhead that it could very easily end up slower).

If you say "outspace" you mean consuming more space (aka memory), then it's very likely that vector will always be more efficient (the underlying implementation is an continous memory array with no othe data, where map is a tree, so every data implies using more space). This however depends on how much the vector reserves extra space for future inserts.
When it is about time (and not space), vector will also always be more effective (doing a dichotomic search). But it will be extreamly bad for adding new elements (or removing them).
So : no simple answer ! Look-up the complexities, think about the uses you are going to do. http://www.cplusplus.com/reference/stl/

The main issue with std::map is an issue of cache, as you pointed.
The sorted vector is a well-known approach: Loki::AssocVector.
For very small datasets, the AssocVector should crush the map despite the copy involved during insertion simply because of cache locality. The AssocVector will also outperform the map for read-only usage. Binary search is more efficient there (less pointers to follow).
For all other uses, you'll need to profile...
There is however an hybrid alternative that you might wish to consider: using the Allocator parameter of the map to restrict the memory area where the items are allocated, thus minimizing the locality reference issue (the root of cache misses).
There is also a paradigm shift that you might consider: do you need sorted items, or fast look-up ?
In C++, the only STL-compliant containers for fast-lookup have been implemented in terms of Sorted Associative Containers for years. However the up-coming C++0x features the long awaited unordered_map which could out perform all the above solutions!

EDIT: Seeing as you're talking about 5 items or fewer:
Sorting involves swapping items. When inserting into std::map, that will only involve pointer swaps. Whether a vector or map will be faster depends on how fast it is to swap two elements.
I suggest you profile your application to figure it out.
If you want a simple and general rule, then you're out of luck - you'll need to consider at least the following factors:
Time
How often do you insert new items compared to how often you lookup?
Can you batch inserts of new items?
How expensive is sorting you vector? Vectors of elements that are expensive to swap become very expensive to sort - vectors of pointers take far less.
Memory
How much overhead per allocation does the allocator you're using have? std::map will perform one allocation per item.
How big are your key/value pairs?
How big are your pointers? (32/64 bit)
How fast does you implementation of std::vector grow? (Popular growth factors are 1.5 and 2)
Past a certain size of container and element, the overhead of allocation and tree pointers will become outweighed by the cost of the unused memory at the end of the vector - but by far the easiest way to find out if and when this occurs is by measuring.

It has to be in the millionth items. And even there ...
I am more thinking here to memory usage and memory accesses. Under hundreds of thousands, take whatever you want, there will be no noticeable difference. CPUs are really fast these days, and the bottleneck is memory latency.
But even with millions of items, if your map<> has been build by inserting elements in random order. When you want to traverse your map (in sorted order) you'll end up jumping around randomly in the memory, stalling the CPU for memory to be available, resulting in poor performance.
On the other side, if your millions of items are in a vector, traversing it is really fast, taking advantage of the CPU memory accesses predictions.
As other have written, it depends on your usage.
Edit: I would more question the way to organize your thousands of associative containers than the containers themselves if they contain only 5 items.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js