Sqlite for disk backed associative array? - c++

I want to use SQLite as an associative array that is saved to disk.
Is this a good idea? I am worried about having to parse SQL every time I do something like:
database["someindex"] which will have to be translated to something like
select value from db where index = 'someindex' which in turn will have to be translated to the SQL internal language.

If you are worried about SQL overhead and only need a simple associative array, maybe a dbm relative like GDBM or Berkeley DB would be a good choice?

Check out sqlite parameters for an easy way to go variable <-> sql

SQLite should be pretty fast as a disk based associative array. Remember to use prepared statements, which parse and compile your SQL once to be invoked many times; they're also safer against SQL injection attacks. If you do that, you should get pretty good performance out of SQLite.
Another option, for a simple disk-based associative array, is, well, the filesystem; that's a pretty popular disk-based associative array. Create a directory on the filesystem, use one key per entry. If you are going to need more than a couple hundred, then create one directory per two-character prefix of the key, to keep the number of files per directory reasonably small. If your keys aren't safe as filenames, then hash them using SHA-1 or SHA-256 or something.

It really depends on your actual problem. Your problem statement is very generic and highly dependent on the size of your hashtable.
For small hashtables you're only intending to read and write once you might actually prefer a text-file (handy for debugging).
If your hashtable is, say, smaller than 25meg SQLite will probably work well for you

Related

Is there a way to create a database with a hash table that features constant time lookup?

I'm trying to create a database in my program with a hash table for constant time lookups. Right now I have the hash table coded, and I have a few stored values in the table(I used an array of structures). But I wanted to enable the user of my code to manually input a new value and permanently store it in a table. I figured I might need to use a database as I don't think implementing a text file would allow the constant time look-ups the hashtable provides. I also don't know how to implement an array of structures in a text file if that would be the better option. Any help?
EDIT: I didn't make this clear enough, but is it possible for me to make a hash table and have the values I input in the hash table permanently stored un the table for constant time look-up? Or do I have to manually code in everything?
There are many third party libraries you can use for this. They are mostly C libraries, which can be used in C++.
If you are on a typical Linux platform, you probably already have gdbm installed, so you might as well just use that.
Other options include LMDB, qdbm, and BerkeleyDB, to name just a very few.
edit: Oops, don't know how I forgot LevelDB, from the big G.

Creating maps of maps from a dictionary file in c++

I have a text file containing list of words (about 35 MB of data). I wrote an application that works pretty much like Scrabble helper or so. I find it insufficient to load the whole file into a set since it takes like 10 minutes to do it. I am not so experienced in C++ and thus I want to ask you what's a better way to achieve it? In my first version of application I just binary searched through it. So I managed to solve this problem by doing a binary search on a file (without loading it, just moving file pointer using seekg). But this solution isn't as fast as using maps of maps. When searching for word I look up it's first letter in a map. Then I retrieve a map of possible second letters and I do another search (for the second letter) and so on. In that way I am able to tell if the word is in dictionary much faster. How can I acheive it without loading whole file into a program to make these maps? Can I save them in a database and read them? Would that be faster?
35MB of data is tiny. There's no problem with loading it all into memory, and no reason for it to take 10 minutes to load. If it takes so long, I suspect your loading scheme recopies maps.
However, instead of fixing this, or coming up with your own scheme, perhaps you should try something ready.
Your description sounds like you could use a database of nested structures. MongoDB, which has a C++ interface, is one possible solution.
For improved efficiency, you could go a bit fancy with the scheme. Say up to 5 letter words, you could use a multikey index. Beyond that, you could go with completely nested structure.
Just don't do it yourself. Concentrate on your program logic.
First, I agree with Ami that 35 MB shouldn't in principle take that long to load and store in memory. Could there be a problem with your loading code (for example accidentally copying maps, causing lots of allocation/deallocation) ?
If I understand well your intention, you build a kind of trie structure (trie and not tree) using maps of maps as you described it. This can be very nice if in memory, but if you want to load only part of the maps in memory, it'll become very difficult (not to do it technically, but to determine which maps to load, and which not to load). You'd then risk to read much more data from disk than actually needed, although there are some implementations of persistend tries around.
If your intend is to have the indexing scheme on disk, I'd rather advise you to use a traditional B-tree data structure, which is designed to optimize loading of partial indexes. You can write your own, but there are already a couple of implementations acround (see this SO question).
Now you could also go to use something like sqlite which is a lightweitght DMS that you can easily embed in your applciation.

How efficient is searching in SQL database?

Hi I'm trying to write data that I have std::map<key, value> into a sql database. In map the keys are sorted for easy lookup but when I would create a table in sql of map items, how hard would it be to search through the table to get a record by its key id?
The searching would be easy, the efficiency would depend on whether or not your indexing correctly.
Around about as hard as learning how to use the SQL SELECT statement.
What may well be inefficient would be constructing and deconstructing multiple SQL select statements where all you want is a different value in the "where" clause for the variable.
Stored-procedures could well be far more efficient for doing this.
You question is vague, but it seems like you're comparing apples and oranges. SQL is designed to efficiently search scalable data. C (or any other language) key-value pairs are limited to RAM, so are not very scalable. There are overheads in communicating to a RDBMS. There is memory pressure, design efficiency (i.e. your chosen data types and indexes), the algorithm C++ implements for the lookup (hash/b-tree) etc.
At the end of the day, the correct question to ask is, "Which tool is best for the job?" and leave it at that.
Assuming that you are inserting the data from a map into a database table, you have the primary key for your table i.e. the key you use for the map and as map doesn't allow duplicate keys you have unique key for each record.
Create an index on the key for the table. Do create an index, else all your queries will do a full table scan and all the benefits of having unique rows goes down the drain. But beware, if you map has only 100 rows or something, you will unnecessarily create humongous overhead by creating an index on the table.
Creating indexes would be more or less same across datbases, but it would be quite difficult to estimate the efficieny without knowing which database you are using and how much data is going to be stored in the table.
If the data fits to the memory std::map would be much more efficient then any DB.
Once I've tested against Sqlite3 with in-memory database and std::map was faster by an order of magnitude (and sqlite is very fast in this case faster then any other RDBMS I've seen)
Reason: it provides direct access to data structure it has no any intermediate layer and allows to walk over tree very fast.
However if your data is very big or you don't want to use too much memory RDBMS is good solution and given index on key it would be quite fast.

How can I create an index for fast access on a clojure hash table?

I wish to store many records in a clojure hash table. If I wish to get fast access to certain records using a certain field or range query then what options do I have, without having to resort to storing the data in a database (which is where the data came from in the first place).
I guess I'm also wondering whether an STM is the right place for a large indexed data set as well.
Depending how far you want to push this, you're asking to build an in-memory database. I assume you don't actually want to do that or presumably use one of the many in-memory Java databases that already exist (Derby, H2, etc).
If you want indexed or range access to multiple attributes of your data, then you need to create all of those indexes in Clojure data structures. Clojure maps will give you O(log32 n) time access to data (worse than constant, but still very bounded). If you need better than that, you can use Java maps like HashMap or ConcurrentHashMap directly with the caveat that you're outside the Clojure data model. For range access, you'll want some sort of sorted tree data structure... Java has ConcurentSkipListMap which is pretty great for what it does. If that's not good enough, you might need your own btree impl.
If you're not changing this data, then Clojure's STM is immaterial. Is this data treated as a cache of a subset of the database? If so, you might consider using a cache library like Ehcache instead (they've recently added support for very large off-heap caches and search capabilities).
Balancing data between in-memory cache and persistent store is a tricky business and one of the most important things to get right in data-heavy apps.
You'll probably want to create separate indexes for each field using a sorted-map so that you can do range queries. Under the hood this uses something like a persistent version of a Java TreeMap.
STM shouldn't be an issue if you are mostly interested in read access. In fact it might even prove better than mutable tables since:
Reads don't require any locking
You can make a consistent snapshot of data and indexes at the same time.

Scalable stl set like container for C++

I need to store large number of integers. There can be
duplicates in the input stream of integers, I just need
to store distinct amongst them.
I was using stl set initially but It went OutOfMem when
input number of integers went too high.
I am looking for some C++ container library which would
allow me to store numbers with the said requirement possibly
backed by file i.e container should not try to keep all numbers in-mem.
I don't need to store this data persistently, I just need to find
unique values amongst it.
Take a look at the STXXL; might be what you're looking for.
Edit: I haven't used it myself, but from the docs - you could use stream::runs_creator to create sorted runs of your data (however much fits in memory), then stream::runs_merger to merge the sorted streams, and finally use stream::unique to filter uniques.
Since you need larger than RAM allows you might look at memcached
Have you considered using DB (maybe SQLite)? Or it would be too slow?
You should seriously at least try a database before concluding it is too slow. All you need is one of the lightweight key-value store ones. In the past I have used Berkeley DB, but here is a list of other ones.