C++ in-memory Key-Value stores - c++

I'm looking for suggestions regarding in-memory key-value store engines or libraries, that have C++ interfaces or that are written in C++.
I'm looking for solutions that can scale without any problems to about 100mill key-value pairs and that are compatible/compilable on linux and win32/64

How about std::map?
http://cplusplus.com/reference/stl/map/

If you really need to store such amount of pairs in memory consider this Sparse Hash. It has special implementation which is optimized for low memory consumption.

std::map is fine given that size of key and value is small and the available memory is large ( for about 100million pairs).
If its not the case, and you want to run a program over the key-value pairs, consider using a standard MapReduce API. Map Reduce is specifically meant to be used on distributed systems and process large data specially key-value pairs. Also there are nice C++ APIs for Map Reduce.
http://en.wikipedia.org/wiki/MapReduce

Try Tokyo Cabinet, it supports hashtables and B+trees:
http://1978th.net/tokyocabinet/

Try FastDB, though you may get more than you ask for. Tokyo cabinet also seems to support in-memory databases. (Or, backed by file mapped by mmap. With modern operating systems, there's no much difference between "in-ram" database and something mmap'd as the OS caching makes also the latter very efficient).

A hash map (also called unordered map) is the best bet for that many pairs. You can find an implementation in Boost and TR1.
Edit:
Some people have questioned the size- if he's got, say, a 64bit server, there's plenty of space for 100million kv pairs.

Oracle Berkeley_db is what you need.

Related

Google dense_hash_map C++

So, I'm trying to implement a map that will have probably around 20,000 String, String pairs in C++. Is it worth it to use Google Dense Hash Map? I will be mainly checking the map for the pair and inserting it if not found. I wont be making any deletions or alterations to the pairs. If I should use dense hash map, how do I? There is not much information online but I know I need a hash function.
EDIT: they are String to String pairs
If you need a hash map to be faster and have a bit of spare memory then go for it.
Google dense map is easy to install (aptitude install libsparsehash-dev on Debian) and it is header-only so you don't even need to link to another library. It is a fastest hash map around, but has higher memory requirements than other maps.
http://incise.org/hash-table-benchmarks.html benchmark has a good comparison of performance and memory profiles (here's the raw results of that benchmark from 2013-08-02: http://pastebin.com/jJL3rzWp).
Note, that for some loads a b+tree can me more cache-friendly than a hash.

c++ hashtable where keys are strings and values are vectors of strings

I have a large collection of unique strings (about 500k). Each string is associated with a vector of strings. I'm currently storing this data in a
map<string, vector<string> >
and it's working fine. However I'd like the look-up into the map to be faster than log(n). Under these constrained circumstances how can I create a hashtable that supports O(1) look-up? Seems like this should be possible since I know all the keys ahead of time... and all the keys are unique (so I don't have to account for collisions).
Cheers!
You can create a hashtable with boost::unordered_map, std::tr1::unordered_map or (on C++0x compilers) std::unordered_map. That takes almost zero effort. Google sparsehash may be faster still and tends to take less memory. (Deletion can be a pain, but it seems you won't need that.)
If the code is still not fast enough, you can exploit prior knowledge of the keys with a minimal perfect hash, as suggested by others, to obtain guaranteed O(1) performance. Whether the code generating effort that takes is worth it depends on you; putting 500k keys into a tool like gperf may take a code generator generator.
You may also want to look at CMPH, which generates a perfect hash function at run-time, though through a C API.
I would look into creating a Perfect Hash Function for your table. This will guarantee no collisions which are an expensive operation to resolve. Perfect Hash Function Generators are also available.
What you're looking for is a Perfect Hash. gperf is often used to generate these, but I don't know how well it works with such a large collection of strings.
If you want no collisions for a known collection of keys you're looking for a perfect hash. The CMPH library (my apologies as it is for C rather than C++) is mature and can generate minimal perfect hashes for rather large data sets.

Super high performance C/C++ hash map (table, dictionary) [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I need to map primitive keys (int, maybe long) to struct values in a high-performance hash map data structure.
My program will have a few hundred of these maps, and each map will generally have at most a few thousand entries. However, the maps will be "refreshing" or "churning" constantly; imagine processing millions of add and delete messages a second.
What libraries in C or C++ have a data structure that fits this use case? Or, how would you recommend building your own? Thanks!
I would recommend you to try Google SparseHash (or the C11 version Google SparseHash-c11) and see if it suits your needs. They have a memory efficient implementation as well as one optimized for speed.
I did a benchmark a long time ago, it was the best hashtable implementation available in terms of speed (however with drawbacks).
What libraries in C or C++ have a data structure that fits this use case? Or, how would you recommend building your own? Thanks!
Check out the LGPL'd Judy arrays. Never used myself, but was advertised to me on few occasions.
You can also try to benchmark STL containers (std::hash_map, etc). Depending on platform/implementation and source code tuning (preallocate as much as you can dynamic memory management is expensive) they could be performant enough.
Also, if performance of the final solution trumps the cost of the solution, you can try to order the system with sufficient RAM to put everything into plain arrays. Performance of access by index is unbeatable.
The add/delete operations are much (100x) more frequent than the get operation.
That hints that you might want to concentrate on improving algorithms first. If data are only written, not read, then why write them at all?
Just use boost::unordered_map (or tr1 etc) by default. Then profile your code and see if that code is the bottleneck. Only then would I suggest to precisely analyze your requirements to find a faster substitute.
If you have a multithreaded program, you can find some useful hash tables in intel thread building blocks library. For example, tbb::concurrent_unordered_map has the same api as std::unordered_map, but it's main functions are thread safe.
Also have a look at facebook's folly library, it has high performance concurrent hash table and skip list.
khash is very efficient. There is author's detailed benchmark: https://attractivechaos.wordpress.com/2008/10/07/another-look-at-my-old-benchmark/ and it also shows khash beats many other hash libraries.
from android sources (thus Apache 2 licensed)
https://github.com/CyanogenMod/android_system_core/tree/ics/libcutils
look at hashmap.c, pick include/cutils/hashmap.h, if you don't need thread safety you can remove mutex code, a sample implementation is in libcutils/str_parms.c
First check if existing solutions like libmemcache fits your need.
If not ...
Hash maps seems to be the definite answer to your requirement. It provides o(1) lookup based on the keys. Most STL libraries provide some sort of hash these days. So use the one provided by your platform.
Once that part is done, you have to test the solution to see if the default hashing algorithm is good enough performance wise for your needs.
If it is not, you should explore some good fast hashing algorithms found on the net
good old prime number multiply algo
http://www.azillionmonkeys.com/qed/hash.html
http://burtleburtle.net/bob/
http://code.google.com/p/google-sparsehash/
If this is not good enough, you could roll a hashing module by yourself, that fixes the problem that you saw with the STL containers you have tested, and one of the hashing algorithms above. Be sure to post the results somewhere.
Oh and its interesting that you have multiple maps ... perhaps you can simplify by having your key as a 64 bit num with the high bits used to distinguish which map it belongs to and add all key value pairs to one giant hash. I have seen hashes that have hundred thousand or so symbols working perfectly well on the basic prime number hashing algorithm quite well.
You can check how that solution performs compared to hundreds of maps .. i think that could be better from a memory profiling point of view ... please do post the results somewhere if you do get to do this exercise
I believe that more than the hashing algorithm it could be the constant add/delete of memory (can it be avoided?) and the cpu cache usage profile that might be more crucial for the performance of your application
good luck
Try hash tables from Miscellaneous Container Templates. Its closed_hash_map is about the same speed as Google's dense_hash_map, but is easier to use (no restriction on contained values) and has some other perks as well.
I would suggest uthash. Just include #include "uthash.h" then add a UT_hash_handle to the structure and choose one or more fields in your structure to act as the key. A word about performance here.
http://incise.org/hash-table-benchmarks.html gcc has a very very good implementation. However, mind that it must respect a very bad standard decision :
If a rehash happens, all iterators are invalidated, but references and
pointers to individual elements remain valid. If no actual rehash
happens, no changes.
http://www.cplusplus.com/reference/unordered_map/unordered_map/rehash/
This means basically the standard says that the implementation MUST BE based on linked lists.
It prevents open addressing which has better performance.
I think google sparse is using open addressing, though in these benchmarks only the dense version outperforms the competition.
However, the sparse version outperforms all competition in memory usage. (also it doesn't have any plateau, pure straight line wrt number of elements)

Scalable stl set like container for C++

I need to store large number of integers. There can be
duplicates in the input stream of integers, I just need
to store distinct amongst them.
I was using stl set initially but It went OutOfMem when
input number of integers went too high.
I am looking for some C++ container library which would
allow me to store numbers with the said requirement possibly
backed by file i.e container should not try to keep all numbers in-mem.
I don't need to store this data persistently, I just need to find
unique values amongst it.
Take a look at the STXXL; might be what you're looking for.
Edit: I haven't used it myself, but from the docs - you could use stream::runs_creator to create sorted runs of your data (however much fits in memory), then stream::runs_merger to merge the sorted streams, and finally use stream::unique to filter uniques.
Since you need larger than RAM allows you might look at memcached
Have you considered using DB (maybe SQLite)? Or it would be too slow?
You should seriously at least try a database before concluding it is too slow. All you need is one of the lightweight key-value store ones. In the past I have used Berkeley DB, but here is a list of other ones.

Sqlite for disk backed associative array?

I want to use SQLite as an associative array that is saved to disk.
Is this a good idea? I am worried about having to parse SQL every time I do something like:
database["someindex"] which will have to be translated to something like
select value from db where index = 'someindex' which in turn will have to be translated to the SQL internal language.
If you are worried about SQL overhead and only need a simple associative array, maybe a dbm relative like GDBM or Berkeley DB would be a good choice?
Check out sqlite parameters for an easy way to go variable <-> sql
SQLite should be pretty fast as a disk based associative array. Remember to use prepared statements, which parse and compile your SQL once to be invoked many times; they're also safer against SQL injection attacks. If you do that, you should get pretty good performance out of SQLite.
Another option, for a simple disk-based associative array, is, well, the filesystem; that's a pretty popular disk-based associative array. Create a directory on the filesystem, use one key per entry. If you are going to need more than a couple hundred, then create one directory per two-character prefix of the key, to keep the number of files per directory reasonably small. If your keys aren't safe as filenames, then hash them using SHA-1 or SHA-256 or something.
It really depends on your actual problem. Your problem statement is very generic and highly dependent on the size of your hashtable.
For small hashtables you're only intending to read and write once you might actually prefer a text-file (handy for debugging).
If your hashtable is, say, smaller than 25meg SQLite will probably work well for you