Is RocksDB a good choice for storing homogeneous objects? - c++

I'm looking for an embeddable data storage engine in C++. RocksDB is a key-value store.
My data is very homogeneous. I have a modest number of types (on the order of 20), and I store many instances (on the order of 1 million) of those types.
I imagine that the homogeneity of my data makes RocksDB a poor choice. If I serialise each object individually, surely I'm duplicating the schema metadata? And surely that will result in poor performance?
So my question: Is RocksDB a good choice for storing homogeneous objects? If so, how does one avoid the performance implications of duplicating schema metadata?

Unlike, e.g., sqlite, there is no schema metadata in RocksDB, because there is no schema: it maps a binary key to a binary value. RocksDB has no serialization built into it. If you are storing objects, you will have to serialize them yourself and use, e.g., the key, a key-prefix or column families (~ DB tables light) to distinguish the types.
Typically you would use RocksDB to build some kind of custom database. Someone built, e.g., a cache for protobuf objects on top of it (ProfaneDB). Often I would say it is too low-level, but if you need no structured data and queries, it will work fine, is very fast, and is generally pleasant to work with (their code is readable, and sometimes the best documentation, because you will deal with database internals).
I have used a varint key-prefix in a small toy-application before, which comes at just a byte overhead up to 127 types, but column families are probably preferable for a prod application. They also have constant overhead, and can be individually tuned, added, dropped, and managed. I wouldn't forsake the additional features you get from them for a few bytes. That is also roughly representative of the level at which you will deal with problems, if you go with RocksDB.

As I understand, RocksDB is really a KeyValue store and not a database at all.
This means you only get the facility to store binary key and value data. Unlike a normal database (e.g. MySQL, SQLite) you don't get tables where you can define the columns/types etc..
Therefore it is your program which determines how the data would be stored.
One possibility is to store your data as JSON values, in which case as you say you pay the cost of storing the "schema" (i.e. the JSON field names) in the values.
Another choice might be, you have a special key (for example) called SCHEMA that contains an AVRO schema of all your object types. Your app can read this on startup, initialise the readers/writers, and then it knows how to process each key+value stored in RocksDB.
Yet another choice might be you hard-code the logic in your app. You could use any number of libraries for this, including AVRO (as mentioned above) or MsgPack and its variants. In this case you do need to be careful if you intend to use a RocksDB data from a previous version of the app, if you made any schema changes. So maybe store a version number or something in DB.

Related

Is there a way to create a database with a hash table that features constant time lookup?

I'm trying to create a database in my program with a hash table for constant time lookups. Right now I have the hash table coded, and I have a few stored values in the table(I used an array of structures). But I wanted to enable the user of my code to manually input a new value and permanently store it in a table. I figured I might need to use a database as I don't think implementing a text file would allow the constant time look-ups the hashtable provides. I also don't know how to implement an array of structures in a text file if that would be the better option. Any help?
EDIT: I didn't make this clear enough, but is it possible for me to make a hash table and have the values I input in the hash table permanently stored un the table for constant time look-up? Or do I have to manually code in everything?
There are many third party libraries you can use for this. They are mostly C libraries, which can be used in C++.
If you are on a typical Linux platform, you probably already have gdbm installed, so you might as well just use that.
Other options include LMDB, qdbm, and BerkeleyDB, to name just a very few.
edit: Oops, don't know how I forgot LevelDB, from the big G.

Opinions on my data storage problem (database/homebrew solution)

I have very simply structured data which is currently stored in a home-brew file format, but I am wondering whether we should migrate to something more modern. The data is simply a table of doubles, indexed by a double column. The things I need to perform are:
Iterating through the table.
Insertion and deletion of arbitrary records.
Selecting a given number of rows before and after a given key value (where the key might not be in the database).
The requirements are:
The storage must be file-based without a server.
It should not be necessary to read the whole file into memory.
The resulting file should be portable between different architectures (wrt endian-ness...)
Must be a very stable project (the data is highly critical).
Must run on Solaris/SPARC and preferably also on Linux/x64.
Access times should be as fast as possible.
Must be available as a C++ library. Bonus points for Fortran and Python bindings :)
Optional higher precision number representation than double precision would be a bonus.
Relatively compact storage size would also be a bonus.
From my limited experience, sqlite would be an interesting choice, or perhaps mysql in a non-server mode if sqlite is not fast enough. But perhaps a full-fledged SQL database is overkill?
What do you suggest?
SQLite meets nearly all of your requirements, and it's not that hard to use. Give it a try!
It's file-based, and the entire database is a single file.
It does not need to read the entire file into memory. Database size might be limited; you should check here if the limits will be a problem in your situation.
The format is cross-platform:
SQLite databases are portable across 32-bit and 64-bit machines and between big-endian and little-endian architectures.
It's been around for a long time and is used in many places, and is generally considered mature and stable.
It's very portable and runs on Solaris/SPARC and Linux/x64.
It's faster than MySQL (grains of salt present behind that link, though) or other such database servers, because only one client needs to be taken into account.
There is a C++ API and a Python binding and a Fortran wrapper.
There is no arbitrary-precision column type, but NUMERIC will be silently converted to text if it cannot be exactly represented:
For conversions between TEXT and REAL storage classes, SQLite considers the conversion to be lossless and reversible if the first 15 significant decimal digits of the number are preserved. If the lossless conversion of TEXT to INTEGER or REAL is not possible then the value is stored using the TEXT storage class.
Compact storage of the database, I'm not sure of. But I've never heard any claims that SQLite would be particularly wasteful.

How can I create an index for fast access on a clojure hash table?

I wish to store many records in a clojure hash table. If I wish to get fast access to certain records using a certain field or range query then what options do I have, without having to resort to storing the data in a database (which is where the data came from in the first place).
I guess I'm also wondering whether an STM is the right place for a large indexed data set as well.
Depending how far you want to push this, you're asking to build an in-memory database. I assume you don't actually want to do that or presumably use one of the many in-memory Java databases that already exist (Derby, H2, etc).
If you want indexed or range access to multiple attributes of your data, then you need to create all of those indexes in Clojure data structures. Clojure maps will give you O(log32 n) time access to data (worse than constant, but still very bounded). If you need better than that, you can use Java maps like HashMap or ConcurrentHashMap directly with the caveat that you're outside the Clojure data model. For range access, you'll want some sort of sorted tree data structure... Java has ConcurentSkipListMap which is pretty great for what it does. If that's not good enough, you might need your own btree impl.
If you're not changing this data, then Clojure's STM is immaterial. Is this data treated as a cache of a subset of the database? If so, you might consider using a cache library like Ehcache instead (they've recently added support for very large off-heap caches and search capabilities).
Balancing data between in-memory cache and persistent store is a tricky business and one of the most important things to get right in data-heavy apps.
You'll probably want to create separate indexes for each field using a sorted-map so that you can do range queries. Under the hood this uses something like a persistent version of a Java TreeMap.
STM shouldn't be an issue if you are mostly interested in read access. In fact it might even prove better than mutable tables since:
Reads don't require any locking
You can make a consistent snapshot of data and indexes at the same time.

Binary parser or serialization?

I want to store a graph of different objects for a game, their classes may or may not be related, they may or may not contain vectors of simple structures.
I want parsing operation to be fast, data can be pretty big.
Adding new things should not be hard, and it should not break backward compatibility.
Smaller file size is kind of important
Readability counts
By serialization I mean, making objects serialize themselves, which is effective, but I will need to write different serialization methods for different objects for that.
By binary parsing/composing I mean, creating a new tree of parsers/composers that holds and reads data for these objects, and passing this around to have my objects push/pull their data.
I can also use json, but it can be pretty slow for reading, and it is not very size effective when it comes to pretty big sets of matrices, and numbers.
Point by point:
Fast Parsing: binary (since you don't necessarily have to "parse", you can just deserialize)
Adding New Things: text
Smaller: text (even if gzipped text is larger than binary, it won't be much larger).
Readability: text
So that's three votes for text, one point for binary. Personally, I'd go with text for everything except images (and other data which is "naturally" binary). Then, store everything in a big zip file (I can think of several games do this or something close to it).
Good reads: The Importance of Being Textual and Power Of Plain Text.
Check out protocol buffers from Google or thrift from Apache. Although billed as a way to write wire protocols easily, it's basically an object serialization mechanism that can create bindings in a dozen languages, has efficient binary representation, easy versioning, fast performance, and is well-supported.
We're using Boost.Serialization. Don't know how it performs next to those offered by samkass.

Sqlite for disk backed associative array?

I want to use SQLite as an associative array that is saved to disk.
Is this a good idea? I am worried about having to parse SQL every time I do something like:
database["someindex"] which will have to be translated to something like
select value from db where index = 'someindex' which in turn will have to be translated to the SQL internal language.
If you are worried about SQL overhead and only need a simple associative array, maybe a dbm relative like GDBM or Berkeley DB would be a good choice?
Check out sqlite parameters for an easy way to go variable <-> sql
SQLite should be pretty fast as a disk based associative array. Remember to use prepared statements, which parse and compile your SQL once to be invoked many times; they're also safer against SQL injection attacks. If you do that, you should get pretty good performance out of SQLite.
Another option, for a simple disk-based associative array, is, well, the filesystem; that's a pretty popular disk-based associative array. Create a directory on the filesystem, use one key per entry. If you are going to need more than a couple hundred, then create one directory per two-character prefix of the key, to keep the number of files per directory reasonably small. If your keys aren't safe as filenames, then hash them using SHA-1 or SHA-256 or something.
It really depends on your actual problem. Your problem statement is very generic and highly dependent on the size of your hashtable.
For small hashtables you're only intending to read and write once you might actually prefer a text-file (handy for debugging).
If your hashtable is, say, smaller than 25meg SQLite will probably work well for you