What hashing method is implemented in standard unordered containers? [closed] - c++

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Since language standards rarely mandate implementation methods, I'd like to know what is the real world hashing method used by C++ standard library implementations (libc++, libstdc++ and dinkumware).
In case it's not clear, I expect the answer to be a method like these :
Hashing with chaining
Hashing by Division / Multiplication
Universal hashing
Perfect hashing (static, dynamic)
Hashing with open addressing (linear/quadratic probing or double hashing)
Robin-Hood hashing
Bloom Filters
Cuckoo hashing
Knowing why a particular method was chosen over the others would be a good thing as well.

libstdc++: Chaining, only power-of-two table size, default (if it is even configurable) load threshold for rehashing is 1.0, buckets are all separate allocations. Outdated. I don't know current state of things.
Rust: Robin Hood, default load threshold for rehashing is 0.9 (too much for open addressing, BTW)
Go: table slots point to "bins" of 5(7?) slots, not sure what happens if bin is full, AFAIR it is growing in a vector/ArrayList manner
Java: chaining, only power-of-two table size, default load threshold is 0.75 (configurable), buckets (called entries) are all separate allocations. In recent versions of Java, above a certain threshold, chains are changed to binary search trees.
C#: chaining, buckets are allocated from a flat array of bucket structures. If this array is full, it is rehashed (with the table, I suppose) in a vector/ArrayList manner.
Python: open addressing, with own unique collision-resolution scheme (not very fortunate, IMHO), only power-of-two table sizes, load threshold for rehashing is 0.666.. (good). However, slot data in a separate array of structures (like in C#), i. e. hash table operations touch at least two different random memory locations (in the table and in the array of slot data)
If some points missed in descriptions, it doesn't mean they are absent, it means I don't know/remember details.

Related

Most efficient way to index true/false values in C++ [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 months ago.
The community reviewed whether to reopen this question 9 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
I have a list of unsigned shorts that act as local IDs for a database. I was wondering what is the most memory-efficient way to store allowed IDs. For the lifetime of my project, the allowed ID list will be dynamic, so it may have more true or more false allowed IDs as time goes on, with a range of none allowed or all allowed.
What would be the best method to store these? I've considered the following:
List of allowed IDs
Bool vector/array of true/false for allowed IDs
Byte array that can be iterated through, similar to 2
Let me know which of these would be best or if another, better method, exists.
Thanks
EDIT: If possible, can a vector have a value put at say, index 1234, without all 1233 previous values, or would this suit a map or similar type more?
I'm looking at using an Arduino with 2k total ram and using external storage to assist with managing a large block of data, but I'm exploring what my options are
"Best" is opinion-based, unless you are aiming for memory efficiency at the expense of all other considerations. Is that really what you want?
First of all, I hope we're talking <vector> here, not <list> -- because a std::list< short > would be quite wasteful already.
What is the possible value range of those ID's? Do they use the full range of 0..USHRT_MAX, or is there e.g. a high bit you could use to indicate allowed ones?
If that doesn't work, or you are willing to sacrifice a bit of space (no pun intended) for a somewhat cleaner implementation, go for a vector partitioned into allowed ones first, disallowed second. To check whether a given ID is allowed, find it in the vector and compare its position against the cut-off iterator (which you got from the partitioning). That would be the most memory-efficient standard container solution, and quite close to a memory-optimum solution either way. You would need to re-shuffle and update the cut-off iterator whenever the "allowedness" of an entry changes, though.
One suitable data structure to solve your problem is a trie (string tree) that holds your allowed or disallowed IDs.
Your can refer to the ID binary representation as the string. Trie is a compact way to store the IDs (memory wise) and the runtime access to it is bound by the longest ID length (which in your case is constant 16)
I'm not familiar with a standard library c++ implementation, but if efficiency is crucial you can find an implementation or implementat yourself.

Should I use linked lists or arrays when sorting 100 million elements? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I want to implement algorithms like quicksort, mergesort, etc., while working with a big data files, like 100 million elements. I can't use std::vector, std::sort, or anything like that, since this is a school assignment to get to know these specific algorithms. I can only use things that I will only write on my own.
Should I implement sorting algorithms using linked lists or arrays? Which from these two is more efficient in term of working with big data? What are the advantages of using one of them?
If the number of elements is large, the better option would be an array (or any type that has contiguous memory storage, i.e. a std::vector, allocating with new [], etc.). A linked list usually does not store its nodes in contiguous memory. The contiguous memory aspect leads to better cache friendliness.
In addition to this, a linked list (assuming a doubly-linked list), would need to store a next and previous pointers to the next and previous elements for each data item, thus requiring more memory per data item. Even for a singly-linked list, a next pointer has to exist, so even though less overhead than a doubly-linked list, it is still more overhead than an array.
Another reason that isn't related to efficiency why you want to use an array is ease of implementation of the sorting algorithm. It is more difficult to implement a sorting algorithm for a linked list than it is for an array, and especially an algorithm that works with non-adjacent elements.
Also, please note that std::sort is an algorithm, it is not a container. Thus it can work with regular arrays, std::array, std::vector, std::deque, etc. So comparing std::sort to an array is not a correct comparison.

sparse vector in C++? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
I have some code, using the class vector, which I want to implement with a vector that implements sparse vectors (i.e. instead of recording the elements in an array of the vector's length, including 0's, it would only include the non-zero tables in a look-up table).
Is there any sparse vector class in C++ that makes use of the same interface that vector does? (that will make refactoring much easier.)
Brendan is right to observe that logically a vector provides a map from index to value. An std::vector accomplishes this mapping with a simple array. But there are other options.
std::unordered_map has amortized O(1) time operations and seems like a natural alternative.
std::map has O(logn) operations, but the constants are smaller and there are more years of optimizations behind this. In practice it may be faster depending on your application.
SparseHash has an STL-compatible hashtable implementation that claims to be better than a standard unordered_map.
C++ BTree again offers an STL-compatible map, but one that uses btrees instead of binary trees. They claim significantly improved memory (50-80%) and time.
BitMagic offers an implementation of a sparse bit vector. Think a sparse std::bitset. If this fits your needs it offers really significant improvements over all the other approaches.
Finally the classical approach to a sparse vector is to use two vectors, one for the index and one for the values. So you have an std::vector<uint> ind; and a std::vector<value_type> val;.
None of these have exactly the same interface as a std::vector, but the differences are pretty small and you could easily write a small wrapper. For example, for the map classes you would want to keep track of the size and overload size() to return that number instead of the number of non-empty elements. Indeed, Boost's mapped_vector that Brendan links to does exactly this: it wraps a map-like class in a vector-like interface.
A drop-in replacement that works in all cases is impossible (because a std::vector is in nearly all cases assumed to degenerate into an array, eg. &vector[0], and often this is used). Also most users who are interested in the sparse cases are also interested in taking advantage of the sparsity, hence need it exposed. For example, your sparse vector's iterator would have to iterate over all elements, including the empties, which is simply wasteful. The whole point of a sparse structure is to skip all that. If your algorithms can't handle that then you leave a lot of potential gains on the table.
Boost has a sparse vector. I don't think one exists in the standard library.
std::unordered_map is probably a better choice though in the long run though, unless you're already using Boost. The main annoyance in refactoring will be that size() means something different in a map vs. sparse array. Range-based for loops should make that easier to deal with though.

Practical summary/reference of C++11 containers/adapters properties? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
I'm looking for a comprehensive summary/reference of important properties of the various C++11 standard containers and container adapters (optionally also including boost/Qt), but indexed by those properties rather than the usual per container documentation (more on that below).
The properties I have in mind include, from the top of my head:
Insertion capabilities (front / back / arbitrary).
Removal capabilities (front / back / arbitrary).
Access capabilities (front / back / uni/bi-directional traversal / random access).
Complexity of the aforementioned operations, and iterator invalidation rules.
Uniqueness? Ordered? Associative? Contiguous storage? Reservation ahead of time?
I may have forgotten some in which case don't hesitate to comment/edit.
The goal is to use that document as an aid to choose the right container/adapter for the right job, without having to wade through the various individual documentations over and over every time (I have a terrible memory).
Ideally it should be indexed both by property and by container type (eg. table-like) to allow for decision-making as well as for quick reference of the basic constraints. But really the per property indexes are the most important for me since this is the most painful to search in the documentation.
I'd be very surprised if nobody had already produced such a document, but my Search-fu is failing me on this one.
NOTE: I'm not asking for you to summarize all these informations (I'll do that myself if I really have to, in which case I'll post the result here) but only if you happen to know an existing document that fits those requirements. Something like this is a good start but as you can see it still lacks many of the information I'd like to have since it's restricted to member functions.
Thanks for your attention.
I am not aware of a single document that provides everything you need, but most of it has been catalogued somewhere.
This reference site has a large table with all the member functions of all the containers
This SO question has a large table of the complexity guarantees.
This SO question gives you a decision tree to choose between containers.
The complexity requirements for container member functions are not too hard to memorize since there are only 4 categories: (amortized) O(1), O(log N), O(N), and O(N log N) (member function std::list::sort() which really crosses into the algorithms domain of the Standard Library) so if you want you could make a 4-color-coded version of the cpp reference container table.
Choosing the right container can be as simple as always using std::vector unless your profiler indicates a bottleneck. After you reach that point, you have to make hard tradeoffs between space / time complexity, data locality, ease of lookup vs ease of insertion / modification, vs extra invariants (sortedness, uniqueness, iterator invalidation rules).
The hardest part is that you have to balance your containers (space requirements) against the algorithms that you are using (time requirements). Containers can maintain invariants (e.g. std::map is sorted on its keys) that other containers can only mimic using algorithms (e.g. std::vector with std::sort, but without the same insertion complexity). So after you finish the container table, make sure to do something similar for the algorithms!
Finally, no container summary would be complete without mentioning Boost.MultiIndex: because sometimes you don't have to choose!

Super high performance C/C++ hash map (table, dictionary) [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I need to map primitive keys (int, maybe long) to struct values in a high-performance hash map data structure.
My program will have a few hundred of these maps, and each map will generally have at most a few thousand entries. However, the maps will be "refreshing" or "churning" constantly; imagine processing millions of add and delete messages a second.
What libraries in C or C++ have a data structure that fits this use case? Or, how would you recommend building your own? Thanks!
I would recommend you to try Google SparseHash (or the C11 version Google SparseHash-c11) and see if it suits your needs. They have a memory efficient implementation as well as one optimized for speed.
I did a benchmark a long time ago, it was the best hashtable implementation available in terms of speed (however with drawbacks).
What libraries in C or C++ have a data structure that fits this use case? Or, how would you recommend building your own? Thanks!
Check out the LGPL'd Judy arrays. Never used myself, but was advertised to me on few occasions.
You can also try to benchmark STL containers (std::hash_map, etc). Depending on platform/implementation and source code tuning (preallocate as much as you can dynamic memory management is expensive) they could be performant enough.
Also, if performance of the final solution trumps the cost of the solution, you can try to order the system with sufficient RAM to put everything into plain arrays. Performance of access by index is unbeatable.
The add/delete operations are much (100x) more frequent than the get operation.
That hints that you might want to concentrate on improving algorithms first. If data are only written, not read, then why write them at all?
Just use boost::unordered_map (or tr1 etc) by default. Then profile your code and see if that code is the bottleneck. Only then would I suggest to precisely analyze your requirements to find a faster substitute.
If you have a multithreaded program, you can find some useful hash tables in intel thread building blocks library. For example, tbb::concurrent_unordered_map has the same api as std::unordered_map, but it's main functions are thread safe.
Also have a look at facebook's folly library, it has high performance concurrent hash table and skip list.
khash is very efficient. There is author's detailed benchmark: https://attractivechaos.wordpress.com/2008/10/07/another-look-at-my-old-benchmark/ and it also shows khash beats many other hash libraries.
from android sources (thus Apache 2 licensed)
https://github.com/CyanogenMod/android_system_core/tree/ics/libcutils
look at hashmap.c, pick include/cutils/hashmap.h, if you don't need thread safety you can remove mutex code, a sample implementation is in libcutils/str_parms.c
First check if existing solutions like libmemcache fits your need.
If not ...
Hash maps seems to be the definite answer to your requirement. It provides o(1) lookup based on the keys. Most STL libraries provide some sort of hash these days. So use the one provided by your platform.
Once that part is done, you have to test the solution to see if the default hashing algorithm is good enough performance wise for your needs.
If it is not, you should explore some good fast hashing algorithms found on the net
good old prime number multiply algo
http://www.azillionmonkeys.com/qed/hash.html
http://burtleburtle.net/bob/
http://code.google.com/p/google-sparsehash/
If this is not good enough, you could roll a hashing module by yourself, that fixes the problem that you saw with the STL containers you have tested, and one of the hashing algorithms above. Be sure to post the results somewhere.
Oh and its interesting that you have multiple maps ... perhaps you can simplify by having your key as a 64 bit num with the high bits used to distinguish which map it belongs to and add all key value pairs to one giant hash. I have seen hashes that have hundred thousand or so symbols working perfectly well on the basic prime number hashing algorithm quite well.
You can check how that solution performs compared to hundreds of maps .. i think that could be better from a memory profiling point of view ... please do post the results somewhere if you do get to do this exercise
I believe that more than the hashing algorithm it could be the constant add/delete of memory (can it be avoided?) and the cpu cache usage profile that might be more crucial for the performance of your application
good luck
Try hash tables from Miscellaneous Container Templates. Its closed_hash_map is about the same speed as Google's dense_hash_map, but is easier to use (no restriction on contained values) and has some other perks as well.
I would suggest uthash. Just include #include "uthash.h" then add a UT_hash_handle to the structure and choose one or more fields in your structure to act as the key. A word about performance here.
http://incise.org/hash-table-benchmarks.html gcc has a very very good implementation. However, mind that it must respect a very bad standard decision :
If a rehash happens, all iterators are invalidated, but references and
pointers to individual elements remain valid. If no actual rehash
happens, no changes.
http://www.cplusplus.com/reference/unordered_map/unordered_map/rehash/
This means basically the standard says that the implementation MUST BE based on linked lists.
It prevents open addressing which has better performance.
I think google sparse is using open addressing, though in these benchmarks only the dense version outperforms the competition.
However, the sparse version outperforms all competition in memory usage. (also it doesn't have any plateau, pure straight line wrt number of elements)