How efficient is searching in SQL database? - c++

Hi I'm trying to write data that I have std::map<key, value> into a sql database. In map the keys are sorted for easy lookup but when I would create a table in sql of map items, how hard would it be to search through the table to get a record by its key id?

The searching would be easy, the efficiency would depend on whether or not your indexing correctly.

Around about as hard as learning how to use the SQL SELECT statement.

What may well be inefficient would be constructing and deconstructing multiple SQL select statements where all you want is a different value in the "where" clause for the variable.
Stored-procedures could well be far more efficient for doing this.

You question is vague, but it seems like you're comparing apples and oranges. SQL is designed to efficiently search scalable data. C (or any other language) key-value pairs are limited to RAM, so are not very scalable. There are overheads in communicating to a RDBMS. There is memory pressure, design efficiency (i.e. your chosen data types and indexes), the algorithm C++ implements for the lookup (hash/b-tree) etc.
At the end of the day, the correct question to ask is, "Which tool is best for the job?" and leave it at that.

Assuming that you are inserting the data from a map into a database table, you have the primary key for your table i.e. the key you use for the map and as map doesn't allow duplicate keys you have unique key for each record.
Create an index on the key for the table. Do create an index, else all your queries will do a full table scan and all the benefits of having unique rows goes down the drain. But beware, if you map has only 100 rows or something, you will unnecessarily create humongous overhead by creating an index on the table.
Creating indexes would be more or less same across datbases, but it would be quite difficult to estimate the efficieny without knowing which database you are using and how much data is going to be stored in the table.

If the data fits to the memory std::map would be much more efficient then any DB.
Once I've tested against Sqlite3 with in-memory database and std::map was faster by an order of magnitude (and sqlite is very fast in this case faster then any other RDBMS I've seen)
Reason: it provides direct access to data structure it has no any intermediate layer and allows to walk over tree very fast.
However if your data is very big or you don't want to use too much memory RDBMS is good solution and given index on key it would be quite fast.

Related

Is there a way to create a database with a hash table that features constant time lookup?

I'm trying to create a database in my program with a hash table for constant time lookups. Right now I have the hash table coded, and I have a few stored values in the table(I used an array of structures). But I wanted to enable the user of my code to manually input a new value and permanently store it in a table. I figured I might need to use a database as I don't think implementing a text file would allow the constant time look-ups the hashtable provides. I also don't know how to implement an array of structures in a text file if that would be the better option. Any help?
EDIT: I didn't make this clear enough, but is it possible for me to make a hash table and have the values I input in the hash table permanently stored un the table for constant time look-up? Or do I have to manually code in everything?
There are many third party libraries you can use for this. They are mostly C libraries, which can be used in C++.
If you are on a typical Linux platform, you probably already have gdbm installed, so you might as well just use that.
Other options include LMDB, qdbm, and BerkeleyDB, to name just a very few.
edit: Oops, don't know how I forgot LevelDB, from the big G.

Is RocksDB a good choice for storing homogeneous objects?

I'm looking for an embeddable data storage engine in C++. RocksDB is a key-value store.
My data is very homogeneous. I have a modest number of types (on the order of 20), and I store many instances (on the order of 1 million) of those types.
I imagine that the homogeneity of my data makes RocksDB a poor choice. If I serialise each object individually, surely I'm duplicating the schema metadata? And surely that will result in poor performance?
So my question: Is RocksDB a good choice for storing homogeneous objects? If so, how does one avoid the performance implications of duplicating schema metadata?
Unlike, e.g., sqlite, there is no schema metadata in RocksDB, because there is no schema: it maps a binary key to a binary value. RocksDB has no serialization built into it. If you are storing objects, you will have to serialize them yourself and use, e.g., the key, a key-prefix or column families (~ DB tables light) to distinguish the types.
Typically you would use RocksDB to build some kind of custom database. Someone built, e.g., a cache for protobuf objects on top of it (ProfaneDB). Often I would say it is too low-level, but if you need no structured data and queries, it will work fine, is very fast, and is generally pleasant to work with (their code is readable, and sometimes the best documentation, because you will deal with database internals).
I have used a varint key-prefix in a small toy-application before, which comes at just a byte overhead up to 127 types, but column families are probably preferable for a prod application. They also have constant overhead, and can be individually tuned, added, dropped, and managed. I wouldn't forsake the additional features you get from them for a few bytes. That is also roughly representative of the level at which you will deal with problems, if you go with RocksDB.
As I understand, RocksDB is really a KeyValue store and not a database at all.
This means you only get the facility to store binary key and value data. Unlike a normal database (e.g. MySQL, SQLite) you don't get tables where you can define the columns/types etc..
Therefore it is your program which determines how the data would be stored.
One possibility is to store your data as JSON values, in which case as you say you pay the cost of storing the "schema" (i.e. the JSON field names) in the values.
Another choice might be, you have a special key (for example) called SCHEMA that contains an AVRO schema of all your object types. Your app can read this on startup, initialise the readers/writers, and then it knows how to process each key+value stored in RocksDB.
Yet another choice might be you hard-code the logic in your app. You could use any number of libraries for this, including AVRO (as mentioned above) or MsgPack and its variants. In this case you do need to be careful if you intend to use a RocksDB data from a previous version of the app, if you made any schema changes. So maybe store a version number or something in DB.

c++ pivot table implementation

Similar to this question Pivot Table in c#, I'm looking to find an implementation of a pivot table in c++. Due to the project requirements speed is fairly critical and the rest of the performance critical part project is written in c++ so an implementation in c++ or callable from c++ would be highly desirable. Does anyone know of implementations of a pivot table similar to the one found in Excel or open office?
I'd rather not have to code such a thing from scratch, but if I was to do this how should I go about it? What algorithms and data structures would be good to be aware of? Any links to an algorithm would be greatly appreciated.
I am sure you are not asking full feature of pivot table in Excel. I think you want simple statistics table based on discrete explanatory variables and given statistics. If you do, I think this is the case that writing from scratch might be faster than looking at other implementations.
Just update std::map (or similar data structure) of key representing combination of explanatory variables and value of given statistics when program reading each data point.
After done with the reading, it's just matter of organizing output table with the map which might be trivial depending on your goal.
I believe most of C# examples in that question you linked do this approach anyway.
I'm not aware of an existing implementation that would suit your needs, so, assuming you were to write one...
I'd suggest using SQLite to store your data and use SQL to compute aggregates (Note: SQL won't do median, I suggest an abstraction at some stage to allow such behavior), The benefit of using SQLite is that it's pretty flexible and extremely robust, plus it lets you take advantage of their hard work in terms of storing and manipulating data. Wrapping the interface you expect from your pivot table around this concept seems like a good way to start, and save you quite a lot of time.
You could then combine this with a model-view-controller architecture for UI components, I anticipate that would work like a charm. I'm a very satisfied user of Qt, so in this regard I'd suggest using Qt's QTableView in combination with QStandardItemModel (if I can get away with it) or QAbstractItemModel (if I have to). Not sure if you wanted this suggestion, but it's there if you want it :).
Hope that gives you a starting point, any questions or additions, don't hesitate to ask.
I think the reason your question didn't get much attention is that it's not clear what your input data is, nor what options for pivot table you want to support.
A pivot table is in it's basic form, running through the data, aggregating operations into buckets. For example, you want to see how many items you shipped each week from each warehouse for the last few weeks:
You would create a multi-dimensional array of buckets (rows are weeks, columns are warehouses), and run through the data, deciding which bucket that data belongs to, adding the amount in the record you're looking at, and moving to the next record.

How can I create an index for fast access on a clojure hash table?

I wish to store many records in a clojure hash table. If I wish to get fast access to certain records using a certain field or range query then what options do I have, without having to resort to storing the data in a database (which is where the data came from in the first place).
I guess I'm also wondering whether an STM is the right place for a large indexed data set as well.
Depending how far you want to push this, you're asking to build an in-memory database. I assume you don't actually want to do that or presumably use one of the many in-memory Java databases that already exist (Derby, H2, etc).
If you want indexed or range access to multiple attributes of your data, then you need to create all of those indexes in Clojure data structures. Clojure maps will give you O(log32 n) time access to data (worse than constant, but still very bounded). If you need better than that, you can use Java maps like HashMap or ConcurrentHashMap directly with the caveat that you're outside the Clojure data model. For range access, you'll want some sort of sorted tree data structure... Java has ConcurentSkipListMap which is pretty great for what it does. If that's not good enough, you might need your own btree impl.
If you're not changing this data, then Clojure's STM is immaterial. Is this data treated as a cache of a subset of the database? If so, you might consider using a cache library like Ehcache instead (they've recently added support for very large off-heap caches and search capabilities).
Balancing data between in-memory cache and persistent store is a tricky business and one of the most important things to get right in data-heavy apps.
You'll probably want to create separate indexes for each field using a sorted-map so that you can do range queries. Under the hood this uses something like a persistent version of a Java TreeMap.
STM shouldn't be an issue if you are mostly interested in read access. In fact it might even prove better than mutable tables since:
Reads don't require any locking
You can make a consistent snapshot of data and indexes at the same time.

Sqlite for disk backed associative array?

I want to use SQLite as an associative array that is saved to disk.
Is this a good idea? I am worried about having to parse SQL every time I do something like:
database["someindex"] which will have to be translated to something like
select value from db where index = 'someindex' which in turn will have to be translated to the SQL internal language.
If you are worried about SQL overhead and only need a simple associative array, maybe a dbm relative like GDBM or Berkeley DB would be a good choice?
Check out sqlite parameters for an easy way to go variable <-> sql
SQLite should be pretty fast as a disk based associative array. Remember to use prepared statements, which parse and compile your SQL once to be invoked many times; they're also safer against SQL injection attacks. If you do that, you should get pretty good performance out of SQLite.
Another option, for a simple disk-based associative array, is, well, the filesystem; that's a pretty popular disk-based associative array. Create a directory on the filesystem, use one key per entry. If you are going to need more than a couple hundred, then create one directory per two-character prefix of the key, to keep the number of files per directory reasonably small. If your keys aren't safe as filenames, then hash them using SHA-1 or SHA-256 or something.
It really depends on your actual problem. Your problem statement is very generic and highly dependent on the size of your hashtable.
For small hashtables you're only intending to read and write once you might actually prefer a text-file (handy for debugging).
If your hashtable is, say, smaller than 25meg SQLite will probably work well for you