Scalable stl set like container for C++ - c++

I need to store large number of integers. There can be
duplicates in the input stream of integers, I just need
to store distinct amongst them.
I was using stl set initially but It went OutOfMem when
input number of integers went too high.
I am looking for some C++ container library which would
allow me to store numbers with the said requirement possibly
backed by file i.e container should not try to keep all numbers in-mem.
I don't need to store this data persistently, I just need to find
unique values amongst it.

Take a look at the STXXL; might be what you're looking for.
Edit: I haven't used it myself, but from the docs - you could use stream::runs_creator to create sorted runs of your data (however much fits in memory), then stream::runs_merger to merge the sorted streams, and finally use stream::unique to filter uniques.

Since you need larger than RAM allows you might look at memcached

Have you considered using DB (maybe SQLite)? Or it would be too slow?

You should seriously at least try a database before concluding it is too slow. All you need is one of the lightweight key-value store ones. In the past I have used Berkeley DB, but here is a list of other ones.

Related

Is there a way to create a database with a hash table that features constant time lookup?

I'm trying to create a database in my program with a hash table for constant time lookups. Right now I have the hash table coded, and I have a few stored values in the table(I used an array of structures). But I wanted to enable the user of my code to manually input a new value and permanently store it in a table. I figured I might need to use a database as I don't think implementing a text file would allow the constant time look-ups the hashtable provides. I also don't know how to implement an array of structures in a text file if that would be the better option. Any help?
EDIT: I didn't make this clear enough, but is it possible for me to make a hash table and have the values I input in the hash table permanently stored un the table for constant time look-up? Or do I have to manually code in everything?
There are many third party libraries you can use for this. They are mostly C libraries, which can be used in C++.
If you are on a typical Linux platform, you probably already have gdbm installed, so you might as well just use that.
Other options include LMDB, qdbm, and BerkeleyDB, to name just a very few.
edit: Oops, don't know how I forgot LevelDB, from the big G.

Creating maps of maps from a dictionary file in c++

I have a text file containing list of words (about 35 MB of data). I wrote an application that works pretty much like Scrabble helper or so. I find it insufficient to load the whole file into a set since it takes like 10 minutes to do it. I am not so experienced in C++ and thus I want to ask you what's a better way to achieve it? In my first version of application I just binary searched through it. So I managed to solve this problem by doing a binary search on a file (without loading it, just moving file pointer using seekg). But this solution isn't as fast as using maps of maps. When searching for word I look up it's first letter in a map. Then I retrieve a map of possible second letters and I do another search (for the second letter) and so on. In that way I am able to tell if the word is in dictionary much faster. How can I acheive it without loading whole file into a program to make these maps? Can I save them in a database and read them? Would that be faster?
35MB of data is tiny. There's no problem with loading it all into memory, and no reason for it to take 10 minutes to load. If it takes so long, I suspect your loading scheme recopies maps.
However, instead of fixing this, or coming up with your own scheme, perhaps you should try something ready.
Your description sounds like you could use a database of nested structures. MongoDB, which has a C++ interface, is one possible solution.
For improved efficiency, you could go a bit fancy with the scheme. Say up to 5 letter words, you could use a multikey index. Beyond that, you could go with completely nested structure.
Just don't do it yourself. Concentrate on your program logic.
First, I agree with Ami that 35 MB shouldn't in principle take that long to load and store in memory. Could there be a problem with your loading code (for example accidentally copying maps, causing lots of allocation/deallocation) ?
If I understand well your intention, you build a kind of trie structure (trie and not tree) using maps of maps as you described it. This can be very nice if in memory, but if you want to load only part of the maps in memory, it'll become very difficult (not to do it technically, but to determine which maps to load, and which not to load). You'd then risk to read much more data from disk than actually needed, although there are some implementations of persistend tries around.
If your intend is to have the indexing scheme on disk, I'd rather advise you to use a traditional B-tree data structure, which is designed to optimize loading of partial indexes. You can write your own, but there are already a couple of implementations acround (see this SO question).
Now you could also go to use something like sqlite which is a lightweitght DMS that you can easily embed in your applciation.

How can I create an index for fast access on a clojure hash table?

I wish to store many records in a clojure hash table. If I wish to get fast access to certain records using a certain field or range query then what options do I have, without having to resort to storing the data in a database (which is where the data came from in the first place).
I guess I'm also wondering whether an STM is the right place for a large indexed data set as well.
Depending how far you want to push this, you're asking to build an in-memory database. I assume you don't actually want to do that or presumably use one of the many in-memory Java databases that already exist (Derby, H2, etc).
If you want indexed or range access to multiple attributes of your data, then you need to create all of those indexes in Clojure data structures. Clojure maps will give you O(log32 n) time access to data (worse than constant, but still very bounded). If you need better than that, you can use Java maps like HashMap or ConcurrentHashMap directly with the caveat that you're outside the Clojure data model. For range access, you'll want some sort of sorted tree data structure... Java has ConcurentSkipListMap which is pretty great for what it does. If that's not good enough, you might need your own btree impl.
If you're not changing this data, then Clojure's STM is immaterial. Is this data treated as a cache of a subset of the database? If so, you might consider using a cache library like Ehcache instead (they've recently added support for very large off-heap caches and search capabilities).
Balancing data between in-memory cache and persistent store is a tricky business and one of the most important things to get right in data-heavy apps.
You'll probably want to create separate indexes for each field using a sorted-map so that you can do range queries. Under the hood this uses something like a persistent version of a Java TreeMap.
STM shouldn't be an issue if you are mostly interested in read access. In fact it might even prove better than mutable tables since:
Reads don't require any locking
You can make a consistent snapshot of data and indexes at the same time.

C++ in-memory Key-Value stores

I'm looking for suggestions regarding in-memory key-value store engines or libraries, that have C++ interfaces or that are written in C++.
I'm looking for solutions that can scale without any problems to about 100mill key-value pairs and that are compatible/compilable on linux and win32/64
How about std::map?
http://cplusplus.com/reference/stl/map/
If you really need to store such amount of pairs in memory consider this Sparse Hash. It has special implementation which is optimized for low memory consumption.
std::map is fine given that size of key and value is small and the available memory is large ( for about 100million pairs).
If its not the case, and you want to run a program over the key-value pairs, consider using a standard MapReduce API. Map Reduce is specifically meant to be used on distributed systems and process large data specially key-value pairs. Also there are nice C++ APIs for Map Reduce.
http://en.wikipedia.org/wiki/MapReduce
Try Tokyo Cabinet, it supports hashtables and B+trees:
http://1978th.net/tokyocabinet/
Try FastDB, though you may get more than you ask for. Tokyo cabinet also seems to support in-memory databases. (Or, backed by file mapped by mmap. With modern operating systems, there's no much difference between "in-ram" database and something mmap'd as the OS caching makes also the latter very efficient).
A hash map (also called unordered map) is the best bet for that many pairs. You can find an implementation in Boost and TR1.
Edit:
Some people have questioned the size- if he's got, say, a 64bit server, there's plenty of space for 100million kv pairs.
Oracle Berkeley_db is what you need.

Sqlite for disk backed associative array?

I want to use SQLite as an associative array that is saved to disk.
Is this a good idea? I am worried about having to parse SQL every time I do something like:
database["someindex"] which will have to be translated to something like
select value from db where index = 'someindex' which in turn will have to be translated to the SQL internal language.
If you are worried about SQL overhead and only need a simple associative array, maybe a dbm relative like GDBM or Berkeley DB would be a good choice?
Check out sqlite parameters for an easy way to go variable <-> sql
SQLite should be pretty fast as a disk based associative array. Remember to use prepared statements, which parse and compile your SQL once to be invoked many times; they're also safer against SQL injection attacks. If you do that, you should get pretty good performance out of SQLite.
Another option, for a simple disk-based associative array, is, well, the filesystem; that's a pretty popular disk-based associative array. Create a directory on the filesystem, use one key per entry. If you are going to need more than a couple hundred, then create one directory per two-character prefix of the key, to keep the number of files per directory reasonably small. If your keys aren't safe as filenames, then hash them using SHA-1 or SHA-256 or something.
It really depends on your actual problem. Your problem statement is very generic and highly dependent on the size of your hashtable.
For small hashtables you're only intending to read and write once you might actually prefer a text-file (handy for debugging).
If your hashtable is, say, smaller than 25meg SQLite will probably work well for you