I'm wondering what the most generally efficient tree structure would be for a collection that has the following requirements:
The tree will hold anywhere between 0 and 232 - 1 items.
Each item will be a simple structure, containing one 32-bit unsigned integer (the item's unique ID, which will be used as the tree value) and two pointers.
Items will be inserted and removed from the tree very often; some items in the tree will remain there for the duration of the program, while others will only be in the tree very briefly before being removed.
Once an item is removed, its unique ID (that 32-bit unsigned integer) will be recycled and reused for a new item.
The tree structure needs to support efficient inserts and deletions, as well as quick lookups by the unique ID. Also, finding the first available unused unique ID needs to be a fast operation.
What sort of tree would be best-suited for these requirements?
EDIT: This tree is going to be held only in memory; at no point will it be persisted to disk. I don't need to worry about hitting the disk, or disk caching, or anything of the sort. This is also why I'm not looking into using something like SQLite.
Depending on how fast you need this to be you might just treat the whole thing as a single, in-memory table mmap-ed onto a file. Addressing is by direct computation. You can simply chain the free slots so you always know exactly where the next free one is. Most accesses will have a max of 1 or 2 disk accesses (depending on underlying filesystem requirements). Put a buttload of memory on the machine and you might not hit the disk at all.
I know this sounds pretty brute force, but you'd be amazed how fast it can be.
Update in response to: "I'm not looking for a disk-persistable solution"
Well, if you truly are going to have as many as 2^32 items in this structure (times how big it is) then you either need enough memory on the machine to hold this puppy or the kernel will start to swap things in and out of memory for you. This still translates to hitting the disk. If you let it swap, don't forget to check the size of the swap area, there's a good chance you'll have to bump it. Using mmap (or something similar) is sort of like creating your own private swap area and it will probably have less impact on other processes running on the same system.
I'll note that once this thing exceeds your available physical memory (whether you are using swap space or mmap or B-trees or Black-Red or extensible hashing or whatever) it becomes critical to understand
your access pattern. If you are hopscotching all over the place you're going to be hitting the disk a lot. One of the primary reasons for using a structure like a B-tree (or any one of several similar structures) is that the top level of the tree (containing the index) tends to stay in memory (because most paging algorithms use LRU) and you only eat a disk access when you touch a leaf page.
Bottom line: it either fits in memory or it doesn't. If it doesn't then your 10^-9 sec memory access turns into a 10^-3 disk access. I.e. 1 million times slower.
TANSTAAFL!
Have you considered something like a trie? Lookup is linear in key length, which in your case means essentially constant, and storage can be more compact due to nodes sharing common substrings.
Keep in mind, though, that if your data set is actually filling large amounts of your key space your bigger efficiency concern is likely to be caching and disk access, not lookups.
I would go for a red-black tree, because it balances the tree on insertion to ensure optimal insertion/deletion/retrieval. An AVL tree is an option, but it's slightly slower for insertions because it's more rigid about balancing on insertions.
http://en.wikipedia.org/wiki/Red-black_tree
http://en.wikipedia.org/wiki/AVL_tree
My reflex would tell me to reach for a standard implementation, such as the one in stl. But suppose you have reasons to implement your own I would typically go for either Red-Black Trees, which performs well on all operations. Alternatively I would try splay trees which can be really fast but have amortized complexity, i.e. some individual operations might take a little longer.
Stay away from AVL trees as you need to do a lot of updates. AVL trees are good for when you have a lot of lookups but few updates as the updated can be fairly slow.
Do you expect your tree to really hold 2^32-1 entries? Even half that and I would definitely try this with SQLite. You may be able to fit it all in memory, but if you page once, a database will be faster. Database are meant to handle huge data sets efficiently, especially when the whole set won't fit in memory at once.
I you do intend to do this yourself, look at some database code and use a BTree. A red-black will be faster with smaller datasets but with that much data your bottle neck isn't going to be processor speed but memory and harddrive speed.
All that said I can't imagine a map of pointers that large being useful. You'll be pushing the limits of modern memory just storing the map. You won't have anything left over for the map to point to.
boost::unordered_map has amortized constant time insertions, deletions and lookups. It's the best data structure for what you described.
Its only downside is that it's, well, unordered as the name says.. And also if you're REALLY unlucky it could end up being linear time if every single hash clashes. However that can be easily avoided using boost's default boost::hash function. Additionally hashing integers is trivial; so that worst case scenario will not happen to you.
(Note: it's not a tree but a hash table, and you asked specifically for a "Tree".. Maybe you thought that the most efficient way was some sort of tree (it's not)?)
Why a tree at all?
To me it seems you need a database. If you expect lower count of nodes, Hash Table could be enough.
I'm going to warn you about the memory. If you fill up whole tree (2^32 items) you will need 4 gigabytes for the values themselves another 8GB for the pointers. Consider the database, if this is likely.
Each item is represented by a 32-bit identity, which is its key, and two pointers. Are the pointers associated with the tree, or do they have to do with the identity?
If they're just part of implementing the tree, ditch them. You don't need them. Represent whether a number is there or not as a bit in a really big bitmap. Finding the lowest unused bit isn't fast, but I don't think it can be. It's only about 512M of main memory, which isn't that bad.
If the pointers are meaningful data, use an array. You're going to have to allocate space for four giganodes plus pointers to make up the map anyway, so allocate space for four giganodes plus one indicator each for whether the node is active or not. Use memset() to set the whole thing to zero, and keep a lowest-unused-node pointer. Use that to add a node. When you delete a node, mark it as unused, and use the pointers to maintain a two-way linked free list. You'll have to find the next lower unused node, and that might take a while, but again I don't see how to keep this fast. (If you just need an unused node, not the lowest one, just put the released node on the free list somewhere.)
This is likely to take about 64G or 96G of RAM, but that's less than a map solution.
Related
I am looking for input on an associative data structure that might take advantage of the specific criteria of my use case.
Currently I am using a red/black tree to implement a dictionary that maps keys to values (in my case integers to addresses).
In my use case, the maximum number of elements is known up front (1024), and I will only ever be inserting and searching. Searching happens twenty times more often than inserting. At the end of the process I clear the structure and repeat again. There can be no allocations during use - only the initial up front one. Unfortunately, the STL and recent versions of C++ are not available.
Any insight?
I ended up implementing a simple linear-probe HashTable from an example here. I used the MurmurHash3 hash function since my data is randomized.
I modified the hash table in the following ways:
The size is a template parameter. Internally, the size is doubled. The implementation requires power of 2 sizes, and traditionally resizes at 75% occupation. Since I know I am going to be filling up the hash table, I pre-emptively double it's capacity to keep it sparse enough. This might be less efficient when adding small number of objects, but it is more efficient once the capacity starts to fill up. Since I cannot resize it I chose to start it doubled in size.
I do not allow keys with a value of zero to be stored. This is okay for my application and it keeps the code simple.
All resizing and deleting is removed, replaced by a single clear operation which performs a memset.
I chose to inline the insert and lookup functions since they are quite small.
It is faster than my red/black tree implementation before. The only change I might make is to revisit the hashing scheme to see if there is something in the source keys that would help make a cheaper hash.
Billy ONeal suggested, given a small number of elements (1024) that a simple linear search in a fixed array would be faster. I followed his advice and implemented one for side by side comparison. On my target hardware (roughly first generation iPhone) the hash table outperformed a linear search by a factor of two to one. At smaller sizes (256 elements) the hash table was still superior. Of course these values are hardware dependant. Cache line sizes and memory access speed are terrible in my environment. However, others looking for a solution to this problem would be smart to follow his advice and try and profile it first.
I am trying to implement a Kd tree to perform the nearest neighbor and approximate nearest neighbor search in C++. So far I came across 2 versions of the most basic Kd tree.
The one, where data is stored in nodes and in leaves, such as here
The one, where data is stored only in leaves, such as here
They seem to be fundamentally the same, having the same asymptotic properties.
My question is: are there some reasons why choose one over another?
I figured two reasons so far:
The tree which stores data in nodes too is shallower by 1 level.
The tree which stores data only in leaves has easier to
implement delete data function
Are there some other reasons I should consider before deciding which one to make?
You can just mark nodes as deleted, and postpone any structural changes to the next tree rebuild. k-d-trees degrade over time, so you'll need to do frequent tree rebuilds. k-d-trees are great for low-dimensional data sets that do not change, or where you can easily afford to rebuild an (approximately) optimal tree.
As for implementing the tree, I recommend using a minimalistic structure. I usually do not use nodes. I use an array of data object references. The axis is defined by the current search depth, no need to store it anywhere. Left and right neighbors are given by the binary search tree of the array. (Otherwise, just add an array of byte, half the size of your dataset, for storing the axes you used). Loading the tree is done by a specialized QuickSort. In theory it's O(n^2) worst-case, but with a good heuristic such as median-of-5 you can get O(n log n) quite reliably and with minimal constant overhead.
While it doesn't hold as much for C/C++, in many other languages you will pay quite a price for managing a lot of objects. A type*[] is the cheapest data structure you'll find, and in particular it does not require a lot of management effort. To mark an element as deleted, you can null it, and search both sides when you encounter a null. For insertions, I'd first collect them in a buffer. And when the modification counter reaches a threshold, rebuild.
And that's the whole point of it: if your tree is really cheap to rebuild (as cheap as resorting an almost pre-sorted array!) then it does not harm to frequently rebuild the tree.
Linear scanning over a short "insertion list" is very CPU cache friendly. Skipping nulls is very cheap, too.
If you want a more dynamic structure, I recommend looking at R*-trees. They are actually desinged to balance on inserts and deletions, and organize the data in a disk-oriented block structure. But even for R-trees, there have been reports that keeping an insertion buffer etc. to postpone structural changes improves performance. And bulk loading in many situations helps a lot, too!
There is a data structure which acts like a growing array. Unknown amount of integers will be inserted into it one by one, if and only if these integers has no dup in this data structure.
Initially I thought a std::set suffices, it will automatically grow as new integers come in and make sure no dups.
But, as the set grows large, the insertion speed goes down. So any other idea to do this job besides hash?
Ps
I wonder any tricks such as xor all the elements or build a Sparse Table (just like for rmq) would apply?
If you're willing to spend memory on the problem, 2^32 bits is 512MB, at which point you can just use a bit field, one bit per possible integer. Setting aside CPU cache effects, this gives O(1) insertion and lookup times.
Without knowing more about your use case, it's difficult to say whether this is a worthwhile use of memory or a vast memory expense for almost no gain.
This site includes all the possible containers and layout their running time for each action ,
so maybe this will be useful :
http://en.cppreference.com/w/cpp/container
Seems like unordered_set as suggested is your best way.
You could try a std::unordered_set, which should be implemented as a hash table (well, I do not understand why you write "besides hash"; std::set normally is implemented as a balanced tree, which should be the reason for insufficient insertion performance).
If there is some range the numbers fall in, then you can create several std::set as buckets.
EDIT- According to the range that you have specified, std::set, should be fast enough. O(log n) is fast enough for most purposes, unless you have done some measurements and found it slow for your case.
Also you can use Pigeonhole Principle along with sets to reject any possible duplicate, (applicable when set grows large).
A bit vector can be useful to detect duplicates
Even more requirements would be necessary for an optimal decision. This suggestion is based on the following constraints:
Alcott 32 bit integers, with about 10.000.000 elements (ie any 10m out of 2^32)
It is a BST (binary search tree) where every node stores two values, the beginning and the end of a continuous region. The first element stores the number where a region starts, the second the last. This arrangement allows big regions in the hope that you reach you 10M limit with a very small tree height, so cheap search. The data structure with 10m elements would take up 8 bytes per node, plus the links (2x4bytes) maximum two children per node. So that make 80M for all the 10M elements. And of course, if there are usually more elements inserted you can keep track of the once which are not.
Now if you need to be very careful with space and after running simulations and/or statistical checks you find that there are lots of small regions (less than 32 bit in length), you may want to change your node type to one number which starts the region, plus a bitmap.
If you don't have to align access to the bitmap and, say, you only have continuous chunks with only 8 elements, then your memo requirement becuse 4+1 for the node and 4+4 bytes for the children. Hope this helps.
I'm implementing a session store for a web-server. Keys are string
and stored objects are pointers. I tried using map but need something
faster. I will look up an object 5-20 times
as frequent than insert.
I tried using hash-map but failed. I felt like I got more constraints than more free time.
I'm coding c/c++ under Linux.
I don't want to commit to boost, since my web server is going to outlive boost. :)
This is a highly relevant question since the hardware (ssd disk) is
changing rapidly. What was the right solution will not be in 2 years.
I was going to suggest a map, but I see you have already ruled this out.
I tried using map but need something
faster.
These are the std::map performance bounds courtesy of the Wikipedia page:
Searching for an element takes O(log n) time
Inserting a new element takes O(log n) time
Incrementing/decrementing an iterator takes O(log n) time
Iterating through every element of a map takes O(n) time
Removing a single map element takes O(log n) time
Copying an entire map takes O(n log n) time.
How have you measured and determined that a map is not optimised sufficiently for you? It's quite possible that any bottlenecks you are seeing are in other parts of the code, and a map is perfectly adequate.
The above bounds seem like they would fit within all but the most stringent scalability requirements.
The type of data structure that will be used will be determined by the data you want to access. Some questions you should ask:
How many items will be in the session store? 50? 100000? 10000000000?
How large is each item in the store (byte size)?
What kind of string input is used for the key? ASCII-7? UTF-8? UCS2?
...
Hash tables generally perform very well for look ups. You can optimize them heavily for speed by writing them yourself (and yes, you can resize the table). Suggestions to improve performance with hash tables:
Choose a good hash function! this will have preferably even distribution among your hash table and will not be time intensive to compute (this will depend on the format of the key input).
Make sure that if you are using buckets to not exceed a length of 6. If you do exceed 6 buckets then your hash function probably isn't distributing evenly enough. A bucket length of < 3 is preferable.
Watch out for how you allocate your objects. If at all possible, try to allocate them near each other in memory to take advantage of locality of reference. If you need to, write your own sub-allocator/heap manager. Also keep to aligned boundaries for better access speeds (aligned is processor/bus dependent so you'll have to determine if you want to target a particular processor type).
BTrees are also very good and in general perform well. (Someone can insert info about btrees here).
I'd recommend looking at the data you are storing and making sure that the data is as small as possible. use shorts, unsigned char, bit fields as necessary. There are other additional ways to squeeze out improved performance as well such as allocating your string data at the end of your struct at the same time that you allocate the struct. i.e.
struct foo {
int a;
char my_string[0]; // allocate an instance of foo to be
// sizeof(int) + sizeof(your string data) etc
}
You may also find that implementing your own string compare routine can actually boost performance dramatically, however this will depend upon your input data.
It is possible to make your own. But you shouldn't have any problems with boost or std::tr1::unordered_map.
A ternary trie may be faster than a hash map for a smaller number of elements.
I need a fast container with only two operations. Inserting keys on from a very sparse domain (all 32bit integers, and approx. 100 are set at a given time), and iterating over the inserted keys. It should deal with a lot of insertions which hit the same entries (like, 500k, but only 100 different ones).
Currently, I'm using a std::set (only insert and the iterating interface), which is decent, but still not fast enough. std::unordered_set was twice as slow, same for the Google Hash Maps. I wonder what data structure is optimized for this case?
Depending on the distribution of the input, you might be able to get some improvement without changing the structure.
If you tend to get a lot of runs of a single value, then you can probably speed up insertions by keeping a record of the last value you inserted, and don't bother doing the insertion if it matches. It costs an extra comparison per input, but saves a lookup for each element in a run beyond the first. So it could improve things no matter what data structure you're using, depending on the frequency of repeats and the relative cost of comparison vs insertion.
If you don't get runs, but you tend to find that values aren't evenly distributed, then a splay tree makes accessing the most commonly-used elements cheaper. It works by creating a deliberately-unbalanced tree with the frequent elements near the top, like a Huffman code.
I'm not sure I understand "a lot of insertions which hit the same entries". Do you mean that there are only 100 values which are ever members, but 500k mostly-duplicate operations which insert one of those 100 values?
If so, then I'd guess that the fastest container would be to generate a collision-free hash over those 100 values, then maintain an array (or vector) of flags (int or bit, according to what works out fastest on your architecture).
I leave generating the hash as an exercise for the reader, since it's something that I'm aware exists as a technique, but I've never looked into it myself. The point is to get a fast hash over as small a range as possible, such that for each n, m in your 100 values, hash(n) != hash(m).
So insertion looks like array[hash(value)] = 1;, deletion looks like array[hash(value)] = 0; (although you don't need that), and to enumerate you run over the array, and for each set value at index n, inverse_hash(n) is in your collection. For a small range you can easily maintain a lookup table to perform the inverse hash, or instead of scanning the whole array looking for set flags, you can run over the 100 potentially-in values checking each in turn.
Sorry if I've misunderstood the situation and this is useless to you. And to be honest, it's not very much faster than a regular hashtable, since realistically for 100 values you can easily size the table such that there will be few or no collisions, without using so much memory as to blow your caches.
For an in-use set expected to be this small, a non-bucketed hash table might be OK. If you can live with an occasional expansion operation, grow it in powers of 2 if it gets more than 70% full. Cuckoo hashing has been discussed on Stackoverflow before and might also be a good approach for a set this small. If you really need to optimise for speed, you can implement the hashing function and lookup in assembler - on linear data structures this will be very simple so the coding and maintenance effort for an assembler implementation shouldn't be unduly hard to maintain.
You might want to consider implementing a HashTree using a base 10 hash function at each level instead of a binary hash function. You could either make it non-bucketed, in which case your performance would be deterministic (log10) or adjust your bucket size based on your expected distribution so that you only have a couple of keys/bucket.
A randomized data structure might be perfect for your job. Take a look at the skip list – though I don't know any decend C++ implementation of it. I intended to submit one to Boost but never got around to do it.
Maybe a set with a b-tree (instead of binary tree) as internal data structure. I found this article on codeproject which implements this.
Note that while inserting into a hash table is fast, iterating over it isn't particularly fast, since you need to iterate over the entire array.
Which operation is slow for you? Do you do more insertions or more iteration?
How much memory do you have? 32-bits take "only" 4GB/8 bytes, which comes to 512MB, not much for a high-end server. That would make your insertions O(1). But that could make the iteration slow. Although skipping all words with only zeroes would optimize away most iterations. If your 100 numbers are in a relatively small range, you can optimize even further by keeping the minimum and maximum around.
I know this is just brute force, but sometimes brute force is good enough.
Since no one has explicitly mentioned it, have you thought about memory locality? A really great data structure with an algorithm for insertion that causes a page fault will do you no good. In fact a data structure with an insert that merely causes a cache miss would likely be really bad for perf.
Have you made sure a naive unordered set of elements packed in a fixed array with a simple swap to front when an insert collisides is too slow? Its a simple experiment that might show you have memory locality issues rather than algorithmic issues.