Efficieny of stl::map of stl::sets

Efficieny of stl::map of stl::sets - c++

I believe I'd like to use boost::icl::interval_map to solve a problem (described here, I'll post a complete answer if interval_maps ultimately work.)
I want to use an interval_map<unsigned long long, set<foo*>>, but the documentation for boost::icl mentions that there are potential efficiency problems (below from).
We are introducing interval_maps using an interval map of sets of strings, because of it's didactic advantages. The party example is used to give an immediate access to the basic ideas of interval maps and aggregate on overlap. For real world applications, an interval_map of sets is not necessarily recommended. It has the same efficiency problems as a std::map of std::sets. There is a big realm though of using interval_maps with numerical and other efficient data types for the associated values.
What are the efficiency issues with std::map of std::sets?
and How can I avoid them?

Both std::map<K, V> and std::set<V> are node based containers linked by pointers. Traversing them has nice complexity guarantees (i.e., each individual operation is at most O(log n)) but you actually need fairly sizable containers for the complexity to matter compared, e.g., to a std::vector<std::pair<K, V>> especially when K and V are fundamental types. The main performance issue with node based containers is that they are laid out more or less randomly in memory while contemporary CPUs like to access data which is clustered in some form.
Of course, as usual you'll need to measure the times obtained between different implementations on fairly realistic data sets to determine which data structure yields the best performance.

Related

Selection of map or unordered_map based on keys's type

A generally asked question is whether we should use unordered_map or map for faster access.
The most common( rather age old ) answer to this question is:
If you want direct access to single elements, use unordered_map but if you want to iterate over elements(most likely in a sorted way) use map.
Shouldn't we consider the data type of key while making such a choice?
As hash algorithm for one dataType(say int) may be more collision prone than other(say string).
If that is the case( the hash algorithm is quite collision prone ), then I would probably use map even for direct access as in that case the O(1) constant time(probably averaged over large no. of inputs) for unordered_map map be more than lg(N) even for fairly large value of N.

You raise a good point... but you are focusing on the wrong part.
The problem is not the type of the key, per se, but on the hash function that is used to derive a hash value for that key.
Lexicographical ordering is easy: if you tell me you want to order a structure according to its 3 fields (and they already support ordering themselves) then I'll just write:
bool operator<(Struct const& left, Struct const& right) {
return boost::tie(left._1, left._2, left._3)
< boost::tie(right._1, right._2, right._3);
}
And I am done!
However writing a hash function is difficult. You need some knowledge about the distribution of your data (statistics), you might need to prevent specially crafted attacks, etc... Honestly, I do not expect many people of being able to craft a good hash function. But the worst part is, composition is difficult too! Given two independent fields, combining their hash value right is hard (hint: boost::hash_combine).
So, indeed, if you have no idea what you are doing and you are treating user-crafted data, just stick to a map. It's maybe slower (not sure), but it's safer.

There isn't really such a thing as collision prone object, because this thing is dependent on the hash function you use. Assuming the objects are not identical - there is some feature that can be utilized to create an informative hash function to be used.
Assuming you have some knowledge on your data - and you know it is likely to have a lot of collision for some hash function h1() - then you should find and use a different hash function h2() which is better suited for this task.
That said, there are other issues as well why to favor tree based data structures over hash bases (such as latency and the size of the set), some are covered by my answer in this thread.

There's no point trying to be too clever about this. As always, profile, compare, optimise if useful. There are many factors involved - quite a few of which aren't specified in the Standard and will vary across compilers. Some things may profile better or worse on specific hardware. If you are interested in this stuff (or paid to pretend to be) you should learn about these things a bit more systematically. You might start with learning a bit about actual hash functions and their characteristics. It's extremely rare to be unable to find a hash function that has - for all practical purposes - no more collision proneness than a random but repeatable value - it's just that sometimes it's slower to approach that point than it is to handle a few extra collisions.

Is using a map where value is std::shared_ptr a good design choice for having multi-indexed lists of classes?

problem is simple:
We have a class that has members a,b,c,d...
We want to be able to quickly search(key being value of one member) and update class list with new value by providing current value for a or b or c ...
I thought about having a bunch of
std::map<decltype(MyClass.a/*b,c,d*/),shared_ptr<MyClass>>.
1) Is that a good idea?
2) Is boost multi index superior to this handcrafted solution in every way?
PS SQL is out of the question for simplicity/perf reasons.

Boost MultiIndex may have a distinct disadvantage that it will attempt to keep all indices up to date after each mutation of the collection.
This may be a large performance penalty if you have a data load phase with many separate writes.
The usage patterns of Boost Multi Index may not fit with the coding style (and taste...) of the project (members). This should be a minor disadvantage, but I thought I'd mention it
As ildjarn mentioned, Boost MI doesn't support move semantics as of yet
Otherwise, I'd consider Boost MultiIndex superior in most occasions, since you'd be unlikely to reach the amount of testing it received.

You want want to consider containing all of your maps in a single class, arbitrarily deciding on one of the containers as the one that stores the "real" objects, and then just use a std::map with a mapped type of raw pointers to elements of the first std::map.
This would be a little more difficult if you ever need to make copies of those maps, however.

C++ (Hashmap style) Data Structure Ideal For This Scenario?

People have asked similar questions about the efficiency of various data structures but none I have read are totally applicable to my scenario so I wondered if people had suggestions for one that was tailored to satisfy the following criteria efficiently:
Each element will have a unique key. There will be no possibility of collisions because each element hashes to a different key. EDIT: *The key is a 32-bit uint.*
The elements are all unique and therefore can be thought of as a set.
The only operations required are adding and getting, not deletion. These need to be quick as they will be used several 100,000 times in a typical run!
The order in which elements are kept is irrelevant.
Speed is more important than memory-consumption... though it can't be too
greedy!
I am developing for a company that will use the program commercially so any third-party data structures should come with no copyright protection or anything, but if the STL has a data structure that will do the job efficiently then that would be perfect.
I know there are countless Hashmap/Dictionary style C++ data structures with implementations that are built to satisfy different criteria so if someone can suggest one ideal for this situation then that would be greatly appreciated.
Many thanks
Edit:
I found this passage on SO that seems to suggest unordered_map would be good?
hash_map and unordered_map are generally implemented with hash tables.
Thus the order is not maintained. unordered_map insert/delete/query
will be O(1) (constant time) where map will be O(log n) where n is the
number of items in the data structure. So unordered_map is faster, and
if you don't care about the order of the items should be preferred
over map. Sometimes you want to maintain order (ordered by the key)
and for that map would be the choice.

Looks like a prefix tree (with element at each node end) also fits in this scenario. It's damn fast, even faster than hash map because no hash value calculation is done and getting a value is purely O(n) where n is the key length. It's a bit memory hungry but common prefix of keys are shared in the same node path.
EDIT: I assume the keys are string, not simple values like integers

As for build-in solutions I'd recommand google::dense_hash_map. They are really fast especially for numeric keys. You'll have to decide on a specific key that will be reserved as "empty_key". Moreover here is a really nice comparison of different hash-map implementations.
An excerpt
Library Linux-intCPU (sec) Linux-strCPU (sec) Linux PeakMem (MB)
glib 3.490 4.720 24.968
ghthash 3.260 3.460 61.232
CC’s hashtable 3.040 4.050 129.020
TR1 1.750 3.300 28.648
STL hash_set 2.070 3.430 25.764
google-sparse 2.560 6.930 5.42/8.54
google-dense 0.550 2.820 24.7/49.3
khash (C++) 1.100 2.900 6.88/13.1
khash (C) 1.140 2.940 6.91/13.1
STL set (RB) 7.840 18.620 29.388
kbtree (C) 4.260 17.620 4.86/9.59
NP’s splaytree 11.180 27.610 19.024
However, when setting a "deleted_key", this map can also perform deletions. So maybe it'll be possible to create a custom solution that is even more efficient. But apart from that minor point, any hash-map should exactly suit your needs (note that "map" is an ordered tree-map and thus slower).

What you need definitely sounds like a hash set, C++ has this as either std::tr1::unordered_set or in Boost.Unordered.
P.S. Note, however, that TR1 is not yet standard, and you'll probably need to get Boost for the implementation.

It sounds like std::unordered_set would fit the bill, but without
knowing more about the key, it's difficult to say. I'm curious about
how you can guarantee that there will be no possibility of collisions:
this implies a small (less than the size of the table), finite set of
keys. If this is the case, it may be more efficient to map the keys to
a small int, and use std::vector (with empty slots for the entries not
present).

What you're looking for is an unordered_set. You can find one in Boost, TR1, or C++0x. If you're hoping to associate the key with a value, then unordered_map does just that- also in Boost/TR1/C++0x.

data structure for storing array of strings in a memory

I'm considering of data structure for storing a large array of strings in a memory. Strings will be inserted at the beginning of the programm and will not be added or deleted while programm is running. The crucial point is that search procedure should be as fast as it can be. Saving of memory is not important. I incline to standard structure hash_set from standard library, that allows to search elements in the structure with about constant time. But it's not guaranteed that this time will be short. Will anyone suggest a better standard desicion?
Many thanks!

Try a Prefix Tree
A Trie is better than a Binary Search Tree for searching elements. Compared against a hash table, you could see this question

If lookup time really is the only important thing, then at startup time, once you have all the strings, you could compute a perfect hash over them, and use this as the hashing function for a hashtable.
The problem is how you'd execute the hash - any kind of byte-code-based computation is probably going to be slower than using a fixed hash and dealing with collisions. But if all you care about is lookup speed, then you can require that your process has the necessary privileges to load and execute code. Write the code for the perfect hash, run it through a compiler, load it. Test at runtime whether it's actually faster for these strings than your best known data-agnostic structure (which might be a Trie, a hashtable, a Judy array or a splay tree, depending on implementation details and your typical access patterns), and if not fall back to that. Slow setup, fast lookup.
It's almost never truly the case that speed is the only crucial point.

There is e.g. google-sparsehash.
It includes a dense hash set/map (re)implementation that may perform better than the standard library hash set/map.
See performance. Make sure that you are using a good hash function. (My subjective vote: murmur2.)
Strings will be inserted at the
beginning of the programm and will not
be added or deleted while programm is running.
If the strings are immutable - so insertion/deletion is "infrequent", so to speak -, another option is to build a Directed Acyclic Word Graph or a Compact Directed Acyclic Word Graph that might* be faster than a hash table and has a better worst case guarantee.
**Standard disclaimer applies: depending on the use case, implementations, data set, phase of the moon, etc. Theoretical expectations may differ from observed results because of factors not accounted for (e.g. cache and memory latency, time complexity of certain machine instructions, etc.).*

A hash_set with a suitable number of buckets would be ideal, alternatively a vector with the strings in dictionary order, searched used binary search, would be great too.

The two standard data structures for fast string lookup are hash tables and tries, particularly Patricia tries. A good hash implementation and a good trie implementation should give similar performance, as long as the hash implementation is good enough to limit the number of collisions. Since you never modify the set of strings, you could try to build a perfect hash. If performance is more important than development time, try all solutions and benchmark them.
A complementary technique that could save lookups in the string table is to use atoms: each time you read a string that you know you're going to look up in the table, look it up immediately, and store a pointer to it (or an index in the data structure) instead of storing the string. That way, testing the equality of two strings is a simple pointer or integer equality (and you also save memory by storing each string once).

Your best bet would be as follows:
Building your structure:
Insert all your strings (char*s) into an array.
Sort the array lexicographically.
Lookup
Use a binary search on your array.
This maintains cache locality, allows for efficient lookup (Will search in a space of ~4 billion strings with 32 comparisons), and is dead simple to implement. There's no need to get fancy with tries, because they are complicated, and slower than they appear (especially if you have long strings).
Random sidenote: Combined with http://blogs.msdn.com/b/oldnewthing/archive/2005/05/19/420038.aspx, you'll be unstoppable!

Well, assuming you truly want an array and not an associative contaner as you've mentioned, the allocation strategy mentioned in Raymond Chen's Blog would be efficient.

Using Hash Maps to represent an extremely large data source

I have a very large possible data set that I am trying to visualize at once. The set itself consists of hundreds of thousands of segments, each of which is mapped to an id.
I have received a second data source that gives more real-time information for each segment, but the id's do not correspond to the id's I have.
I have a 1:1 mapping of the data id's (9-character strings) to the current id's (long integers). The problem is that there are a lot of id's, and the data that is coming in is in no specific order.
The solution I came up with is to have a hash-map that maps the strings to the road id's. The problem is that I don't know if the hash-map will be efficient enough to have all 166k data entries.
Does anyone have any suggestions and/or hashing algorithms that I can use for this?

If you're only dealing with hundreds of thousands of datapoints, it will likely not be a problem to go with the naive way and just stick with a hash-map.
Even if you have 500,000 9-character strings and an equal number of longs, that still only 16ish bytes per item, or 8,000,000 bytes total. Even if you double that for overhead, 16 MB is hardly too big to have in memory at one time.
Basically, try the easy way first, and only worry about it when your profiling tells you it's taking too long.

Judy Arrays are designed for this sort of thing: "Judy's key benefits are scalability, high performance, and memory efficiency. [...] Judy can replace many common data structures, such as arrays, sparse arrays, hash tables, B-trees, binary trees, linear lists, skiplists, other sort and search algorithms, and counting functions."

Since the comments on the question indicate the primary concern may be memory usage:
Use a pooling or other small-object-optimized allocator; assuming you have access to boost you can probably find a drop-in replacement in Pool. Using a better small-object allocator is probably the single biggest memory win you'll find.
If you know your strings are fixed-width, you may want to make sure you're allocating only enough space to store them. For example, a struct wrapped around a fixed-length char[] with a custom comparison operator may work better than a std::string. std::string comes with an additional dynamic allocation (and uses space for the corresponding pointer) and some extra size and capacity tracking overhead. (Generally, try to reduce the number of allocations that stick around; it reduces overhead.)
(Assuming STL) Look at the overhead difference between std::map and std::unordered_map (the latter may or may not be available to you at the moment); an RBtree-based std::map may be close enough to the lookup performance characteristics of your "hashmap" and may (or may not) be more memory efficient depending on your standard library implementation.
What route you take should be influenced by info you can gather -- try to get a picture of number of allocs and alloc size/alignment overhead.
You can either instrument your allocator or insert a few elements and see how you're doing compared to how you think you should be doing in terms of memory usage.

Since your strings are known up front and have a fixed length, theoretically and practically the best solution is a perfect hash. You could use cmph to generate it.
According to Wikipedia, your keys woud take 2.5 bits/key, or about 50KB. That's negligable compared to the 664KB for the values.

Although 166k data entries is rather small IMO you can have a look at google-sparsehash

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js