So, I'm trying to implement a map that will have probably around 20,000 String, String pairs in C++. Is it worth it to use Google Dense Hash Map? I will be mainly checking the map for the pair and inserting it if not found. I wont be making any deletions or alterations to the pairs. If I should use dense hash map, how do I? There is not much information online but I know I need a hash function.
EDIT: they are String to String pairs
If you need a hash map to be faster and have a bit of spare memory then go for it.
Google dense map is easy to install (aptitude install libsparsehash-dev on Debian) and it is header-only so you don't even need to link to another library. It is a fastest hash map around, but has higher memory requirements than other maps.
http://incise.org/hash-table-benchmarks.html benchmark has a good comparison of performance and memory profiles (here's the raw results of that benchmark from 2013-08-02: http://pastebin.com/jJL3rzWp).
Note, that for some loads a b+tree can me more cache-friendly than a hash.
Related
Working on some legacy code, I am running into memory issues due mainly (I believe) to the extensive use of STL maps (particularly “maps-of-maps”.)
I am looking at Boost flat_map as a possible solution. Does anyone have any firsthand experience with flat_maps, in particular with regards improvements in speed and/or memory usage? I realize of course this can be very dependent on the types of data stored and the manner in which they are stored but still curious of folk’s actual experience.
Can anyone point me to some solid examples?
As an example: there are several cases in this code of a map-of-a-map; that is, a map where the value is another map.
By replacing the “inner” map with a pair of vectors, I reduced the memory footprint 10:1 (3G to 300M). Of course this can slow down searches but for this particular case it doesn’t seem to matter much. And it involved about a day of refactoring and careful testing.
Boost’s flat_map sounds like it might be just what I need but I can’t seem to find out much about it other than the class description on the Boost web site. Looking for some firsthand feedback.
Boost's flat_map is a binary-tree-based map implementation, except that that binary tree is stored as a (sorted) vector of key-value pairs.
You can basically figure out the answers regarding performance (relative to an std::map by yourself based on that fact:
Iterating the map or a large part of it should be super-fast, relatively
Lookup should typically be relatively fast
Adding or removing values is theoretically much slower, but in practice - assuming your key and value types are small and the number of map elements not very high - probably comparable in speed (or even better on small maps - often no allocation is necessary on insert)
etc.
In your case - maps-of-maps - you're going to lose some of the benefit of "flattening things out", since you'll have an outer map with a pointer to an inner map (if not more levels of indirection); but the flat map would at least help you reduce that. Also, supposing you have two levels of maps, you could arrange it so that you store all of the inner maps contiguously (either by constructing the inner maps appropriately or by instantiating them with your own allocator, a trickier affair); in that case, you could replace pointers to maps with map indices, reducing the amount of space they take up and making life easier for the compiler.
You might also want read Boost's documentation of flat_map; and you could also just use the force and read the source (and the source of the underlying flat_tree) - like I have; I dont actually have flat_map experience myself.
I know this is an old question, but this might be of use to someone finding this question.
I found that flat_map was a big improvement in searching, lookup and iterating large maps. The fact the map is using contiguous data in memory also makes inserting faster than you might expect due to great data locality. If you're doing more inserts than lookups in your map then it might not be for you.
Having said that, repeatedly inserting a random value into a sorted vector is faster than the same on a linked list because of the data locality - despite what Big O might tell you. (tested in VS2017 and G++ 4.8).
I have a map with about 100,000 pairs . Is there any way that i can speed up searching when using find(), given that the keys are in alphabetical order. Also how should i go about doing it. I know that you can specify a new comparator when you create the map. But will that speed up the find() function at all?
Thanks in advance.
[solved] Thanks a bunch guys i have decided to go with a vector and use lower and upperbound to "snip" some of the searching.
Also i am new here is there any way to mark this question as answered , or pick a best answer?
A different comparator will only speed up find if it manages to do the comparison faster (which, for strings will usually be pretty difficult).
If you're basically inserting all the data in order, then doing the searching, it may be faster to use a std::vector with std::lower_bound or std::upper_bound.
If you don't really care about ordering, and just want to find the data as quickly as possible, you might find that std::unordered_map works better for you.
Edit: Just for the record: the way you "might find" or "may find" those things is normally by profiling. Depending on the situation, it might be enough faster that it's pretty obvious even in simple testing, so profiling isn't really necessary, but if there's (much) doubt, or you want to quantify the effect, a profiler is probably the right way to do it.
std::map is already taking advantage of the fact the keys are in alphabetical order - it guarantees that itself. You aren't going to be able to improve it by changing the comparator (one assumes it's already a reasonably efficient string comparison).
Have you considered using unordered_map (aka hash_map in various implementations pre C++11? It should be able to search in O(1) instead of O(log(n)) for std::map.
You could also look into something slightly more exotic, like a trie, but that's not part of the standard library so you'd either have to find one elsewhere or roll your own, so I'd suggest unordered_map is a good place to start.
If you're using std::find to find elements, you should switch to using map::find (you don't really say in your question.) map::find uses the fact that the map is ordered to search much faster.
If that's still not good enough, you might look into a hash container such as unordered_map rather than map.
I've put in a vote for unordered_map but I wanted to also make another point.
One of the things that can hurt performance on modern machines is poor use of the cache. A map is going to have nodes allocated all over the place and there won't be much locality of reference. Also since it has to store a bunch of pointers between nodes it will use up more memory.
At the recent Going Native 2012 conference Bjarne Stroustroup gave an interesting talk that touched on this topic. He compared vector and list performance at a task involving a lot of random insertions and deletions, where it might seem list ought to have dominated, but because of the memory size and layout issue vector was in fact the fastest by far. Take a look at his slides, starting at slide 43.
unordered_map gives you direct access to the element and so it probably means even less hopping around in memory than trying to stick your data in a vector (and thus better performance than vector) so my comment is simply an admonishment to always keep your memory access pattern in mind for performance
I'm looking for a binary data structure (tree, list) that enables very fast searching. I'll only add/remove items at the beginning/end of the program, all at once. So it's gonna be fixed-sized, thus I don't really care about the insertion/deletion speed. Basically what I'm looking for is a structure that provides fast searching and doesn't use much memory.
Thanks
Look up the Unordered set in the Boost C++ library here. Unlike red-black trees, which are O(log n) for searching, the unordered set is based on a hash, and on average gives you O(1) search performance.
One container not to be overlooked is a sorted std::vector.
It definitely wins on the memory consumption, especially if you can reserve() the correct amount up front.
So the key can be a simple type and the value is a smallish structure of five pointers.
With only 50 elements it starts getting small enough that the Big-O theoretical performance may be overshadowed or at least measurable affected by the fixed time overhead of the algorithm or structure.
For example an array a vector with linear search is often the fastest with less than ten elements because of its simple structure and tight memory.
I would wrap the container and run real data on it with timing. Start with STL's vector, go to the standard STL map, upgrade to unordered_map and maybe even try Google's dense or sparse_hash_map:
http://google-sparsehash.googlecode.com/svn/trunk/doc/performance.html
One efficient (albeit a teeny bit confusing) algorithm is the Red-Black tree.
Internally, the c++ standard library uses Red-Black trees to implement std::map - see this question
The std::map and hash map are good choices. They also have constructors to ease one time construction.
The hash map puts key data into a function that returns an array index. This may be slower than an std::map, but only profiling will tell.
My preference would be std::map, as it is usually implemented as a type of binary tree.
The fastest tends to be a trei/trie. I implemented one 3 to 15 times faster than the std::unordered_map, they tend to use more ram unless you use a large number of elements though.
I have a large collection of unique strings (about 500k). Each string is associated with a vector of strings. I'm currently storing this data in a
map<string, vector<string> >
and it's working fine. However I'd like the look-up into the map to be faster than log(n). Under these constrained circumstances how can I create a hashtable that supports O(1) look-up? Seems like this should be possible since I know all the keys ahead of time... and all the keys are unique (so I don't have to account for collisions).
Cheers!
You can create a hashtable with boost::unordered_map, std::tr1::unordered_map or (on C++0x compilers) std::unordered_map. That takes almost zero effort. Google sparsehash may be faster still and tends to take less memory. (Deletion can be a pain, but it seems you won't need that.)
If the code is still not fast enough, you can exploit prior knowledge of the keys with a minimal perfect hash, as suggested by others, to obtain guaranteed O(1) performance. Whether the code generating effort that takes is worth it depends on you; putting 500k keys into a tool like gperf may take a code generator generator.
You may also want to look at CMPH, which generates a perfect hash function at run-time, though through a C API.
I would look into creating a Perfect Hash Function for your table. This will guarantee no collisions which are an expensive operation to resolve. Perfect Hash Function Generators are also available.
What you're looking for is a Perfect Hash. gperf is often used to generate these, but I don't know how well it works with such a large collection of strings.
If you want no collisions for a known collection of keys you're looking for a perfect hash. The CMPH library (my apologies as it is for C rather than C++) is mature and can generate minimal perfect hashes for rather large data sets.
I'm looking for suggestions regarding in-memory key-value store engines or libraries, that have C++ interfaces or that are written in C++.
I'm looking for solutions that can scale without any problems to about 100mill key-value pairs and that are compatible/compilable on linux and win32/64
How about std::map?
http://cplusplus.com/reference/stl/map/
If you really need to store such amount of pairs in memory consider this Sparse Hash. It has special implementation which is optimized for low memory consumption.
std::map is fine given that size of key and value is small and the available memory is large ( for about 100million pairs).
If its not the case, and you want to run a program over the key-value pairs, consider using a standard MapReduce API. Map Reduce is specifically meant to be used on distributed systems and process large data specially key-value pairs. Also there are nice C++ APIs for Map Reduce.
http://en.wikipedia.org/wiki/MapReduce
Try Tokyo Cabinet, it supports hashtables and B+trees:
http://1978th.net/tokyocabinet/
Try FastDB, though you may get more than you ask for. Tokyo cabinet also seems to support in-memory databases. (Or, backed by file mapped by mmap. With modern operating systems, there's no much difference between "in-ram" database and something mmap'd as the OS caching makes also the latter very efficient).
A hash map (also called unordered map) is the best bet for that many pairs. You can find an implementation in Boost and TR1.
Edit:
Some people have questioned the size- if he's got, say, a 64bit server, there's plenty of space for 100million kv pairs.
Oracle Berkeley_db is what you need.