I'm building a content storage system for my game engine and I'm looking at possible alternatives for storing the data. Since this is a game, it's obvious that performance is important. Especially considering various entities in the engine will be requesting resources from the data structures of the content manager upon their creation. I'd like to be able to search resources by a name instead of an index number, so a dictionary of some sort would be appropriate.
What are the pros and cons to using an std::map and to creating my own dictionary class based on std::vector? Are there any speed differences (if so, where will performance take a hit? I.e. appending vs. accessing) and is there any point in taking the time to writing my own class?
For some background on what needs to happen:
Writing to the data structures occurs only at one time, when the engine loads. So no writing actually occurs during gameplay. When the engine exits, these data structures are to be cleaned up. Reading from them can occur at any time, whenever an entity is created or a map is swapped. There can be as little as one entity being created at a time, or as many as 20, each needing a variable number of resources. Resource size can also vary depending on the size of the file being read in at the start of the engine, images being the smallest and music being the largest depending on the format (.ogg or .midi).
Map: std::map has guaranteed logarithmic lookup complexity. It's usually implemented by experts and will be of high quality (e.g. exception safety). You can use custom allocators for custom memory requirements.
Your solution: It'll be written by you. A vector is for contiguous storage with random access by position, so how will you implement lookup by value? Can you do it with guaranteed logarithmic complexity or better? Do you have specific memory requirements? Are you sure you can implement a the lookup algorithm correctly and efficiently?
3rd option: If you key type is string (or something that's expensive to compare), do also consider std::unordered_map, which has constant-time lookup by value in typical situations (but not quite guaranteed).
If you want the speed guarantee of std::map as well as the low memory usage of std::vector you could put your data in a std::vector, std::sort it and then use std::lower_bound to find the elements.
std::map is written with performance in mind anyway, whilst it does have some overhead as they have attempted to generalize to all circumstances, it will probably end up more efficient than your own implementation anyway. It uses a red-black binary tree, giving all of it's operations O[log n] efficiency (aside from copying and iterating for obvious reasons).
How often will you be reading/writing to the map, and how long will each element be in it? Also, you have to consider how often will you need to resize etc. Each of these questions is crucial to choosing the correct data structure for your implementation.
Overall, one of the std functions will probably be what you want, unless you need functionality which is not in a single one of them, or if you have an idea which could improve on their time complexities.
EDIT: Based on your update, I would agree with Kerrek SB that if you're using C++0x, then std::unordered_map would be a good data structure to use in this case. However, bear in mind that your performance can degrade to linear time complexity if you have conflicting hashes (this cannot happen with std::map), as it will store the two pair's in the same bucket. Whilst this is rare, the probability of it obviously increases with the number of elements. So if you're writing a huge game, it's possible that std::unordered_map could become less optimal than std::map. Just a consideration. :)
Related
The C++ Standard Template Library provides a number of container types which have very obvious implementations as linked structures, such as list and map.
A very basic optimization with highly-linked structures is to use a custom sub-allocator with a private memory pool providing fixed-size allocation. Given the STL's emphasis on performance, I would expect this or similar optimizations to be performed. At the same time, all these containers have an optional Allocator template parameter, and it would seem largely redundant to be able to provide a custom allocator to something that already uses a custom allocator.
So, if I'm looking for maximum-performance linked structures with the STL, do I need to specify a custom allocator, or can I count on the STL to do that work for me?
It depends a lot on your workload.
If you don't iterate through your datastructures a lot, don't even bother optimizing anything. Your time is better spent elsewhere.
If you do iterate, but your payload is large and you do a lot of work per item, it's unlikely that the default implementation is going to be the bottleneck. The iteration inefficiencies are going to be swallowed by the per item work.
If you store small elements (ints, pointers), you do trivial operations and you iterate through the structure a lot, then you'll get better performance out of something like std::vector or boost::flat_map since they allow better pre-fetch operations.
Allocators are mostly useful when you find yourself allocating and deallocating a lot of small bits of memory. This causes memory fragmentation and can have performance implications.
As with all performance advice you need to benchmark your workload on your target machine.
P.S. Make sure the optimizations are turned on (i.e. -O3).
Naturally it can vary from one standard lib implementation to the next, but last time I checked in libs like MSVC, GNU C++, and EASTL, linked structures allocate node and element data in a single allocation.
However, each node is still allocated one-at-a-time against std::allocator which is a fairly general-purpose variable-length allocator (though it can at least assume that all elements being allocated are of a particular data type, but many times I've found it just defaulting to malloc calls in VTune and CodeXL sessions). Sometimes there's even thread-safe memory allocation going on which is a bit wasteful when the data structure itself isn't designed for concurrent modifications or simultaneous reads/writes while using thread-safe, general-purpose memory allocation to allocate one node at a time.
The design makes sense though if you want to allow the client to pass in their own custom allocators as template parameters. In that case you don't want the data structure to pool memory since that would be fighting against what the allocator might want to do. There's a decision to be made with linked structures in particular whether you allocate one node at a time and pass the responsibility of more efficient allocation techniques like free lists to the allocator to make each individual node allocation efficient, or avoid depending on the allocator to make that efficient and effectively do the pooling in the data structure by allocating many nodes at once in a contiguous fashion and pooling them. The standard lib leans towards the former route which can unfortunately makes things like std::list and std::map, when used against default std::allocator, very inefficient.
Personally for linked structures, I use my own handrolled solutions which rely on 32-bit indexes into arrays (ex: std::vector) which effectively serve as the pooled memory and work like an "indexed free list", like so:
... where we might actually store the linked list nodes inside std::vector. The links just become a way to allow us to remove things in constant-time and reclaim these empty spaces in constant-time. The real example is a little more complex than the pseudocode above since that code only works for PODs (real one uses aligned_storage, placement new, and manual dtor invocation like the standard containers), but it's not that much more complex. Similar case here:
... a doubly-linked "index" list using std::vector (or something like std::deque if you don't want pointer invalidation), for example, to store the list nodes. In that case the links allow us to just skip around when traversing the vector's contiguous memory. The whole point of that is to allow constant-time removal and insertion anywhere to the list while preserving insertion order on traversal (something which would be lost with just std::vector if we used swap-to-back-and-pop-back technique for constant-time removal from the middle).
Aside from making everything more contiguous and cache-friendly to traverse as well as faster to allocate and free, it also halves the sizes of the links on 64-bit architectures when we can use 32-bit indices instead into a random-access sequence storing the nodes.
Linked lists have actually accumulated a really bad rep in C++ and I believe it's largely for this reason. The people benchmarking are using std::list against the default allocator and incurring bottlenecks in the form of cache misses galore on traversal and costly, possibly thread-safe memory allocations with the insertion of each individual node and freeing with the removal of each individual node. Similar case with the huge preference these days towards unordered_map and unordered_set over map and set. Hash tables might have always had some edge but that edge is so skewed when map and set just use a general-purpose allocator one node at a time and incur cache misses galore on tree traversal.
So, if I'm looking for maximum-performance linked structures with the
STL, do I need to specify a custom allocator, or can I count on the
STL to do that work for me?
The advice to measure/profile is always wise, but if your needs are genuinely critical (like you're looping over the data repeatedly every single frame and it stores hundreds of thousands of elements or more while repeatedly also inserting and removing elements to/from the middle each frame), then I'd at least reach for a free list before using the likes of std::list or std::map. And linked lists are such trivial data structures that I'd actually recommend rolling your own if you're really hitting hotspots with the linked structures in the standard library instead of having to deal with the combo of both allocator and data structure to achieve an efficient solution (it can be easier to just have a data structure which is very efficient for your exact needs in its default form if it's trivial enough to implement).
I used to fiddle around with allocators a lot, reaching around data structures and trying to make them more efficient by experimenting with allocators (with moderate success, enough to encourage me but not amazing results), but I've found my life becoming so much easier to just make linked structures which pool their memory upfront (which gave me the most amazing results). And funnily enough, just creating these data structures which are more efficient about their allocation strategies upfront took less time than all the time I spent fiddling with allocators (trying out third party ones as well as implementing my own). Here's a quick example I whipped up which uses linked lists for the collision detection with 4 million particles (this is old so it was running on an i3).
It uses singly-linked lists using my own deque-like container to store the nodes like so:
Similar thing here with a spatial index for collision between 500k variable-sized agents (just took 2 hours to implement the whole thing and I didn't even bother to multithread it):
I point this out mainly for those who say linked lists are so inefficient since, as long as you store the nodes in an efficient and relatively contiguous way, they can really be a useful tool to add to your arsenal. I think the C++ community at large dismissed them a bit too hastily as I'd be completely lost without linked lists. Used correctly, they can reduce heap allocations rather than multiply them and improve spatial locality rather than degrade it (ex: consider that grid diagram above if it used a separate instance of std::vector or SmallVector with a fixed SBO for every single cell instead of just storing one 32-bit integer). And it doesn't take long to write, say, a linked list which allocates nodes very efficiently -- I'd be surprised if anyone takes more than a half hour to write both the data structure and unit test. Similar case with, say, an efficient red-black tree which might take a couple of hours but it's not that big of a deal.
These days I just end up storing the linked nodes directly inside things like std::vector, my own chunkier equivalent of std::deque, tbb::concurrent_vector if I need to build a concurrent linked structure, etc. Life becomes a lot easier when efficient allocation is absorbed into the data structure's responsibility rather than having to think about efficient allocation and the data structure as two entirely separate concepts and having to whip up and pass all these different types of allocators around all over the place. The design I favor these days is something like:
// Creates a tree storing elements of type T with 'BlockSize' contiguous
// nodes allocated at a time and pooled.
Tree<T, BlockSize> tree;
... or I just omit that BlockSize parameter and let the nodes be stored in std::vector with amortized reallocations while storing all nodes contiguously. I don't even bother with an allocator template parameter anymore. Once you absorb efficient node allocation responsibilities into the tree structure, there's no longer much benefit to a class template for your allocator since it just becomes like a malloc and free interface and dynamic dispatch becomes trivially cheap at that point when you're only involving it once for, say, every 128 nodes allocated/freed at once contiguously if you still need a custom allocator for some reason.
So, if I'm looking for maximum-performance linked structures with the
STL, do I need to specify a custom allocator, or can I count on the
STL to do that work for me?
So coming back to this question, if you genuinely have a very performance-critical need (either anticipated upfront like large amounts of data you have to process every frame or in hindsight through measurements), you might even consider just rolling some data structures of your own which store nodes in things like std::vector. As counter-productive as that sounds, it can take a lot less time than fiddling around and experimenting with memory allocators all day long, not to mention an "indexed linked list" which allocates nodes into std::vector using 32-bit indices for the links will halve the cost of the links and also probably take less time to implement than an std::allocator-conforming free list, e.g. And hopefully if people do this more often, linked lists can start to become a bit more popular again, since I think they've become too easily dismissed as inefficient when, used in a way that allocates nodes efficiently, they might actually be an excellent data structure for certain problems.
While the standard does not explicitly forbid such optimizations, it would be a poor design choice by the implementer.
First of all, one could imagine a use case where pooling allocation would not be a desirable choice.
It's not too hard to refer a custom allocator in the template parameters to introduce the pooling behavior you want, but disabling that behavior if it was a part of the container would be pretty much impossible.
Also from the OOP point of view, you'd have a template that obviously has more than one responsibility, some consider it a bad sign.
The overall answer seems to be "Yes, you do need a custom allocator" (Boost::pool_alloc?).
Finally, you can write a simple test to check what does your specific implementation do.
I was soving a competitive programming problem with the following requirements:
I had to maintain a list of unqiue 2d points (x,y), the number of unique points would be less than 500.
My idea was to store them in a hash table (C++ unordered set to be specific) and each time a node turned up i would lookup the table and if the node is not already there i would insert it.
I also know for a fact that i wouldn't be doing more than 500 lookups.
So i saw some solutions simply searching through an array (unsorted) and checking if the node was already there before inserting.
My question is is there any reasonable way to guess when should i use a hash table over a manual search over keys without having to benchmark them?
My question is is there any reasonable way to guess when should i use a hash table over a manual search over keys without having to benchmark them?
I am guessing you are familiar with basic algorithmics & time complexity and C++ standard containers and know that with luck hash table access is O(1)
If the hash table code (or some balanced tree code, e.g. using std::map - assuming there is an easy order on keys) is more readable, I would prefer it for that readability reason alone.
Otherwise, you might make some guess taking into account the approximate timing for various operations on a PC. BTW, the entire http:///norvig.com/21-days.html page is worth reading.
Basically, memory accesses are much more slow than everything else in the CPU. The CPU cache is extremely important. A typical memory access with cache fault requiring fetching data from DRAM modules is several hundreds times slower than some elementary arithmetic operation or machine instruction (e.g. adding two integers in registers).
In practice, it does not matter that much, as long as your data is tiny (e.g. less than a thousand elements), since in that case it is likely to sit in L2 cache.
Searching (linearly) in an array is really fast (since very cache friendly), up to several thousand of (small) elements.
IIRC, Herb Sutter mentions in some video that even inserting an element inside a vector is practically -but unintuitively- faster (taking into account the time needed to move slices) than inserting it into some balanced tree (or perhaps some other container, e.g. an hash table), up to a container size of several thousand small elements. This is on typical tablet, desktop or server microprocessor with a multimegabyte cache. YMMV.
If you really care that much, you cannot avoid benchmarking.
Notice that 500 pairs of integers is probably fitting into the L1 cache!
My rule of thumb is to assume the processor can deal with 10^9 operations per second.
In your case there are only 500 entries. An algorithm up to O(N^2) could be safe. By using contiguous data structure like vector you can leverage the fast cache hit. Also hash function sometimes can be costly in terms of constant. However if you have a data size of 10^6, the safe complexity might be only O(N) in total. In this case you might need to consider O(1) hashmap for a single lookup.
You can use Big O Complexity to roughly estimate the performance. For the Hash Table, Searching an element is between O(1) and O(n) in the worst case. That means, that in the best case your access time is independant of the number of elements in your map but in the worst case it is linear dependant on the size of your hash table.
A Binary tree has a guaranteed search complexity of O(nlog(n)). That means, that searching an element always depends on the size of the array, but in the Worst Case its faster than a hash table.
You can look up some Big O Complexities at this handy website here: http://bigocheatsheet.com/
I am currently developing (at Design-Stage) a kinda small compiler which uses a custom IR. The problem I am having is to choose an efficient container data structure to contain all the instructions.
A basic block will contain ~10000 instructions and an instruction will be like ~250 Bytes big.
I thought about using a list because the compiler will have lots of complex transformations (ie. lots of random insertions/removals) so having a container data structure which does not invalidate iterators would be good. As it would keep the transformation algorithm simple and easy to follow.
On the other hand, it would be a lost of performance because of the known problem with cache misses and memory fragmentation. An std::vector would help here but I imagine it would be a pain to implement transformations with a vector.
So the questions is, if there is another data structure which has low memory fragmentation to reduce memory cache misses and does not invalidate iterators.
Or if I should ignore this and keep using a list.
Start with using Container = std::vector<Instruction>. std::vector is a pretty good default container. Once performance becomes an issue profile the program with a couple of different containers. You should be able to swap out the Container without needing much change in the rest of the code. I imagine some kind of array-list would be best, but you should probably check.
I have the following requirements:
I need a datastructure with key,value pairs(keys are integers if that helps).
I need the below operations:-
Iteration(most used)
Insertion (2nd most used)
Searching by key and deletion(least)
I plan to use multiple locks over the structure for concurrent access.
What is the ideal data structure to use?
Map or an unordered map?
I think unordered map makes sense, because i can insert in O(1), delete in O(1). But i am not sure of the iteration. How bad is the performance when compared to map?
Also, i plan to use multiple locks on blocks instead of the whole structure. Any good implementation example of this?
Thanks
The speed of iterator incrementing is O(1) for both containers, although you might get somewhat better cache locality from std::unordered_map.
Apart from the slower O(log N) find/insert/erase functionality of std::map, one other difference is that std::map provides bidirectional iterators, whereas the faster (amortized O(1) element access) std::unordered_map only provides forward iterators.
The excellent book C++ Concurrency in Action: Practical Multithreading by Anthony Williams provides a code example of a multithreaded unordered_map with a lock per entry. This book is highly recommended if you are doing serious multithreaded coding.
Iteration is not a problem in an unordered_map. It is a little less efficient than a vector, but not largely so.
As always, you will need to benchmark for YOUR use-cases, and compare with other container types if it's a critical part of your code.
Not sure what you mean by "multiple locks on blocks instead of the whole structure" - any container updates will need to be locked for the whole container...
Why not simply use an existing concurrent_unordered_map which you can find in both TBB and Concrt.
Have you thought about trying a std::deque The reasoning being as follows:
Iteration is fast - data should be more or less packed close (unlike lists)
Insertion (at either end) should be quick - data in a deque is never resized
Iteration and deletion slow (but uncommon usecase).
If the last two cases are common, a std::list may be used. Also consider testing a std::vector` since it is more cache efficient.
Iteration in an unordered_map may be slow due to iterating over a large number of unused elements in the hashtable. Insertions will be quich until collision levels become intolerable at which point the whole data structure will need to laid out again.
maps have relatively fast iteration except that data elements may be far apart. Insertion can be slow due to the re-balancing of the red-black trees that this requires.
The main usecase for unordered_maps is for fast lookup (O1). normal map have fastish lookup (O log n) but much better iteration performance.
If you have hard real-time requirements, I would recommend map over unordered_map. std::map has guaranteed performance 100% of the time, but the std::unordered_map may do a rehash and completely ruin real-time performance in some critical corner case. In general, I prefer red-black trees (std::map) over hashtables (std::unordered_map) if I need absolute guarantees on worst-case performance.
I am looking a container to support frequent adds/removals. I have not idea how large the container may grow but I don't want to get stalls due to huge reallocations. I need a good balance between performance and consistent behavior.
Initially, I considered std::tr1::unordered_map but since I don't know the upper bound of the data set, collisions could really slow down the unordered_map's performance. It's not a matter of a good hashing function because no matter how good it is, if the occupancy of the map is more than half the bucket count, collisions will likely be a problem.
Now I'm considering std::map because it doesn't suffer from the issue of collisions but it only has log(n) performance.
Is there a way to intelligently handle collisions when you don't know the target size of an unordered_map? Any other ideas for handling this situation, which I imagine is not uncommon?
Thanks
This is a run-time container, right?
Are you adding at the end (as in push_back) or in the front or the middle?
Are you removing at random locations, or what?
How are you referencing information in it?
Randomly, or from the front or back, or what?
If you need random access, something based on array or hash is preferred.
If reallocation is a big problem, you want something more like a tree or list.
Even so, if you are constantly new-ing (and delete-ing) the objects that you're putting in the container, that alone is likely to consume a large fraction of time,
in which case you might find it makes sense to save used objects in junk lists, so you can recycle them.
My suggestion is, rather than agonize over the choice of container, just pick one, write the program, and then tune it, as in this example.
No matter what you choose, you will probably want to change it, maybe more than once.
What I found in that example was that any pre-existing container class is justifying its existence by ease of programming, not by fastest-possible speed.
I know it's counter-intuitive, but
unless some other activity in your program ends up being the dominant cost, and you can't shrink it, your final burst of speed will require hand-coding the data structure.
What kind of access do you need? Sequential, random access, lookup by key? Plus, you can rehash unordered map either manually (rehash method), and set its load factor. In any case the hash will rebuild itself when chains get too long (i.e., when the load factor is exceeded). Additionally, the slow-down point of a hash table is when it is full ~80%, not 50%.
You should really have read the documentation, for example here.