Fast embedded database - c++

I am working on an application which will need to store metadata associated with music files (artist, title, play count, etc.), as well as sets of integers (in particular, SHA-1 hashes).
The solution I pick needs to:
Provide "fast" storage & retrieval (when viewing a list of potentially thousands of songs I need to be able to retrieve metadata more or less interactively).
Be cross-platform (to Linux, Windows and OSX).
Provide an interface I can interact with from C++.
Be open-source (or, at the very least, be free as in beer).
Provide fast set operations (union, intersection, difference) - if the solution doesn't provide this, but it will allow me to store binary data, I could implement this myself using a technique like "Fast Set Operations Using Treaps".
Be "embedded" - that is, operate without me having to fork another process, or at least provide an easy interface to do so (like libmysqld).
Solutions I have considered include:
Flat files. This is extremely simple, but doesn't provide any features besides flat data storage.
SQlite. This seems to be a very popular option, but it seems to have some issues regarding performance and concurrency (see KDE's Akonadi, for some example issues).
Embedded MySQL/MariaDB. This seems to be a reasonable option, but it also might be a bit heavyweight considering I won't be needing a lot of complicated SQL features.
A hypothetical solution I'm thinking would be perfect would be something like Redis, but which persists data to the disk, and only stores some portion of the data in memory to make retrieval fast. Redis itself might not be a good option because 1) I would need to fork it manually, 2) its Windows port seems less than rock-solid, and 3) storing all of my data in RAM would be less than ideal.
Are there any other solutions for this type of problem, or is one of the solutions I have already listed far better than the others?

In the end, I've decided to use SQlite for metadata. It seems to be as fast if not faster than e.g. libmysqld, and it has a really simple clean C interface. According to benchmarks, it should be way more than fast enough to suit my needs.
For larger data structures, I'm planning on just storing them in separate binary files (the SQlite website says it can store binary data, but that if your data size exceeds a certain amount it is faster to store it in flat files instead - see this page).

Don't store you binary files BLOBS inside SQLite, unless you want an elephant size database. Just store a string with the path file name on the file system. The only downside of SQLite is that it does not allow remote (web) access, but you can embedded it inside a small TCP/HTTP server.

Related

Is there a persistency layer for list/queue containers?

Is there some kind of persistency layer that can be used for a regularly-modified list/queue container that stores strings?
The data in the list is just strings, nothing fancy. It could be useful, though, to store a key or hash with each string for definite references, so I thought I'd wrap each string in a struct with an extra key field.
The persistency should be saved on each modification, more or less, as spontaneous power offs might happen.
I looked into Boost::Serialisation and it seems easy to use, but I guess I'd have to write the whole queue everytime it gets modified to close the file and be safe for power offs, as I see no journaling option there.
I saw SQLite, but it could be over the top as I don't need relations or any sophisticated queries.
And I don't want to reinvent the wheel by doing it manually in some files.
Is there anything available worth looking into?
I have few experience with C++ and an OS beneath, so I'm unaware of what's available and what's suitable. And couldn't find any better.
Potentially simpler alternative to relational databases, when you don't need those relations, are "nosql" databases. A document oriented database might be a reasonable choice based on the description.

Django BinaryField - Justification for statement in documentation

In the Django 1.10 documentation for the BinaryField field type, they give a warning about its use:
Abusing BinaryField
Although you might think about storing files in the database, consider that it is bad design in 99% of the cases. This field is not a replacement for proper static files handling.
It does not continue with any justification for this claim. Are there any generalized indicators for what falls in the 99% "bad design" or 1% "not bad design" cases? Does this ring particularly true with Django because it has great static files support?
I consider this premature optimization at best and cargo cult programming at worst.
While it is true that relational database systems aren't optimized for storing large fields (whether binary or text) and some of them treat them specially or at least have some restrictions on their use, most of them handle at least moderately sized binary values (let's say up to a few hundred megabytes) quite well. Storing pictures or PDFs in the database will be less efficient than storing them in the file system, but for 99% of all applications it will be efficient enough.
On the other hand, if you store these files in the file system, you lose several advantages:
Updates will be outside of transactions, so you can't be sure that an update to the file (in the filesystem) and the metadata (in the database) will be atomic.
You lose referential integrity: Your database may refer to files which have been deleted or renamed.
You have two different places where you store your data. This complicates access, backups, etc.
I would try to store all data together which belongs logically together. Usually that means storing everything in the database. If this is not technically possible (e.g. because your files are too big - most RDBMS have a size limit on blobs) or because tests show that it is too slow or otherwise inconvenient, you can always optimize it later.
Django models are an abstraction for relational database. These excel in storing small amount of data with well defined format and relationship. They are optimised for fixed length row and low memory usage.
Is your data fixed length, smaller than 4Kb, and is not meant to be served by a webserver ? You are probably in the 1%.

Where to store SQL code for a C++ application?

We have a C++ application that utilizes some basic APIs to send raw queries to a MS SQL Server. Scattered through the various translation units in our program, we have simple 1-2 line queries as C++ strings, and every now and then you'll see more complex queries that can be over 20 lines.
I can't help but think that the larger queries, specifically the 20+ line ones, should not be embedded in C++ code as constant strings. I want to propose pulling these out into separate text files that are loaded on-demand by the C++ application, however I'm not sure if this is the best approach.
What design choices are typical for situations like this? I definitely feel there needs to be improvement, I just don't know if moving the SQL queries out into data files (text files) is the best idea.
You could make a DAL (Data Access Layer).
It would be the API that the rest of the program talks to. Then you can mess around and try anything and everything (Stored procedures, caching, etc.) without disturbing the main program.
Move them into their own files, or even into their own stored procedures. Queries embedded in the application cannot be changed without a recompile, and depending on your release procedures, that could severely impair your ability to respond to emergencies or deploy hot fixes. You could alter your app to cache the file contents, if you go down that road, and even periodically check the files for updates.
the best "design choice" - for many different reasons - is to use MSSQL stored procedures whenever/wherever possible.
I've seen code that segregates SQL queries into a common module, but I don't think there's much benefit to a common "queries module" (or a standalone text file) over having the SQL queries spelled out as string literals in the module that's calling them.
Stored procedures, on the other hand, increase modularity, enhance security, and can vastly improve performance.
IMHO...
I would leave the SQL embedded in the C++ functions that use it: it will be easier to read and understand what the code does.
If you have SQL queries scattered around your code I'd say that there is some problem with the overall structure of the classes you are using: you should have some (or even just one) 'low level' classes that handle the interaction with the database, and the rest of the code uses these classes.
I personally don't like using stored procedure: if you have to support a different database server the porting will be a pain, I never saw that much of a performance improvement and to understand what the code does you have to jump back and forth between the stored procedures and the C++.
It really depends, here are some notes:
1) If all your sql code resides in the application, then your application is pretty much self contained in terms of logic. This is good as you have done in the current application. In terms of speed, this can be a little slower as SQL will need to be parsed when when you run these queries(also depends if you used Prepared statements,etc which can speed it up).
2) The second approach is to put all SQL logic as stored procedures on the server. This is a very much preferred approach for even small SQL queries whether one line or not. You just build a DAL layer. In terms of performance this is very good, however the logic lives in two different systems, your C++ app and the SQL server. You will quite likely need to build a small utility application that can translate the stored procedures input and output to template code (be it C++ or any other) to make your life easier.
3) A mixed approach with the above two. I would not recommend this route.
You need to think about how these queries are likely to change over time, and compare it to how the related C++ code is likely to change. If the queries are relatively independent of the code, and have a higher likelihood of change, then I would either load them at runtime from separate files, or use stored procedures instead. That approach allows for changing the queries without recompiling the C++ code. On the other hand, if the queries are highly coupled to the C++ code, making a change in one likely to accompany a change in the other, I would keep the queries in the code. This approach makes a change more localized and less error prone.

C++ Boost.serialization vs simple load/save

I am computational scientist that work with large amount of simulation data and often times I find myself saving/loading data into/from the disk. For simple tasks, like a vector, this is usually as simple as dumping bunch of numbers into a file and that's it.
For more complex stuff, life objects and such, I have save/load member functions. Now, I'm not a computer scientist, and thus often times I see terminologies here on SO that I just do not understand (but I love to). One of these that I've came across recently is the subject of serialization and Boost.Serialization library.
From what I understand serialization is the simply the process of converting your objects into something that can be saved/loaded from dist or be transmitted over a network and such. Considering that at most I need to save/load my objects into/from disk, is there any reason I should switch from the simple load/save functions into Boost.Serialization? What would Boost.Serialization give me other than what I'm already doing?
That library takes into accounts many details that could be non very apparent from a purely 'applicative' point of view.
For instance, data portability WRT big/little numeric endianess, pointed data life time, structured containers, versioning, non intrusive extensions, and more. Moreover, it handles the right way the interaction with other std or boost infrastructure, and dictates a way of code structuring that will reward you with easier code maintenance. You will find ready to use serializers for many (all std & boost ?) containers.
And consider if you need to share your data with someone other, there are chances that referring to a published, mantained, and debugged schema will make things much easier.

Even lighter than SQLite

I've been looking for a C++ SQL library implementation that is simple to hook in like SQLite, but faster and smaller. My projects are in games development and there's definitely a cutoff point between needing to pass the ACID test and wanting some extreme performance. I'm willing to move away from SQL string style queries, allowing it to be code driven, but I haven't found anything out there that provides SQL-like flexibility while also preferring performance over the ACID test.
I don't want to go re-inventing the wheel, and the idea of implementing an SQL library on my own is quite daunting, even if it's only going to be a simple subset of all the calls you could make.
I need the basic commands (SELECT, MODIFY, DELETE, INSERT, with JOIN, and WHERE), not data operations (like sorting, min, max, count) and don't need the database to be atomic, or even enforce consistency (I can use a real SQL service while I'm testing and debugging).
Are you sure that you have obtained the maximum speed available from SQLITE? Out of the box, SQLITE is extremely safe, but quite slow. If you know what you are doing, and are willing to risk db corruption on a disk crash, then there are several optimizations you can do that provide spectacular speed improvements.
In particular:
Switch off synchronization
Group writes into transactions
Index tables
Use database in memory
If you have not explored all of these, then you are likely running many times slower than you might.
I'm not sure you'll manage to find anything with better performances than SQL. Especially if you want operations like JOINs... Is SQLite speed really a problem? For simple requests it's usually faster than any full SGDB.
Don't you have an index problem?
About size, it's not event 1Meg extra in the binary file, so I'm a bit suprised it's a problem.
You can look at Berkeley DB which has to be probably the fastest DB available, but it's mostly only key->value database.
If you really need higher speed consider loading the whole database in memory (using SQLite again).
Take a look at gigabase and its twin fastdb.
You might want to consider Embedded innoDB. It offers the basic SQL functionality (obviously, see MySQL) but doesn't offer the actual SQL syntax (as that's part of MySQL, not innoDB). At 838KB, it's not too heavy.
If you just need those basic operations, you don't really need SQL. Take a look at NoSQL data storage, for example Tokyo Cabinet.
you can try leveldb, it's key/value store
http://code.google.com/p/leveldb