Is there a persistency layer for list/queue containers? - c++

Is there some kind of persistency layer that can be used for a regularly-modified list/queue container that stores strings?
The data in the list is just strings, nothing fancy. It could be useful, though, to store a key or hash with each string for definite references, so I thought I'd wrap each string in a struct with an extra key field.
The persistency should be saved on each modification, more or less, as spontaneous power offs might happen.
I looked into Boost::Serialisation and it seems easy to use, but I guess I'd have to write the whole queue everytime it gets modified to close the file and be safe for power offs, as I see no journaling option there.
I saw SQLite, but it could be over the top as I don't need relations or any sophisticated queries.
And I don't want to reinvent the wheel by doing it manually in some files.
Is there anything available worth looking into?
I have few experience with C++ and an OS beneath, so I'm unaware of what's available and what's suitable. And couldn't find any better.

Potentially simpler alternative to relational databases, when you don't need those relations, are "nosql" databases. A document oriented database might be a reasonable choice based on the description.

Related

Fast embedded database

I am working on an application which will need to store metadata associated with music files (artist, title, play count, etc.), as well as sets of integers (in particular, SHA-1 hashes).
The solution I pick needs to:
Provide "fast" storage & retrieval (when viewing a list of potentially thousands of songs I need to be able to retrieve metadata more or less interactively).
Be cross-platform (to Linux, Windows and OSX).
Provide an interface I can interact with from C++.
Be open-source (or, at the very least, be free as in beer).
Provide fast set operations (union, intersection, difference) - if the solution doesn't provide this, but it will allow me to store binary data, I could implement this myself using a technique like "Fast Set Operations Using Treaps".
Be "embedded" - that is, operate without me having to fork another process, or at least provide an easy interface to do so (like libmysqld).
Solutions I have considered include:
Flat files. This is extremely simple, but doesn't provide any features besides flat data storage.
SQlite. This seems to be a very popular option, but it seems to have some issues regarding performance and concurrency (see KDE's Akonadi, for some example issues).
Embedded MySQL/MariaDB. This seems to be a reasonable option, but it also might be a bit heavyweight considering I won't be needing a lot of complicated SQL features.
A hypothetical solution I'm thinking would be perfect would be something like Redis, but which persists data to the disk, and only stores some portion of the data in memory to make retrieval fast. Redis itself might not be a good option because 1) I would need to fork it manually, 2) its Windows port seems less than rock-solid, and 3) storing all of my data in RAM would be less than ideal.
Are there any other solutions for this type of problem, or is one of the solutions I have already listed far better than the others?
In the end, I've decided to use SQlite for metadata. It seems to be as fast if not faster than e.g. libmysqld, and it has a really simple clean C interface. According to benchmarks, it should be way more than fast enough to suit my needs.
For larger data structures, I'm planning on just storing them in separate binary files (the SQlite website says it can store binary data, but that if your data size exceeds a certain amount it is faster to store it in flat files instead - see this page).
Don't store you binary files BLOBS inside SQLite, unless you want an elephant size database. Just store a string with the path file name on the file system. The only downside of SQLite is that it does not allow remote (web) access, but you can embedded it inside a small TCP/HTTP server.

An alternative to hierarchical data model

Problem domain
I'm working on a rather big application, which uses a hierarchical data model. It takes images, extracts images' features and creates analysis objects on top of these. So the basic model is like Object-(1:N)-Image_features-(1:1)-Image. But the same set of images may be used to create multiple analysis objects (with different options).
Then an object and image can have a lot of other connected objects, like the analysis object can be refined with additional data or complex conclusions (solutions) can be based on the analysis object and other data.
Current solution
This is a sketch of the solution. Stacks represent sets of objects, arrows represent pointers (i.e. image features link to their images, but not vice versa). Some parts: images, image features, additional data, may be included in multiple analysis objects (because user wants to make analysis on different sets of object, combined differently).
Images, features, additional data and analysis objects are stored in global storage (god-object). Solutions are stored inside analysis objects by means of composition (and contain solution features in turn).
All the entities (images, image features, analysis objects, solutions, additional data) are instances of corresponding classes (like IImage, ...). Almost all the parts are optional (i.e., we may want to discard images after we have a solution).
Current solution drawbacks
Navigating this structure is painful, when you need connections like the dotted one in the sketch. If you have to display an image with a couple of solutions features on top, you first have to iterate through analysis objects to find which of them are based on this image, and then iterate through the solutions to display them.
If to solve 1. you choose to explicitly store dotted links (i.e. image class will have pointers to solution features, which are related to it), you'll put very much effort maintaining consistency of these pointers and constantly updating the links when something changes.
My idea
I'd like to build a more extensible (2) and flexible (1) data model. The first idea was to use a relational model, separating objects and their relations. And why not use RDBMS here - sqlite seems an appropriate engine to me. So complex relations will be accessible by simple (left)JOIN's on the database: pseudocode "images JOIN images_to_image_features JOIN image_features JOIN image_features_to_objects JOIN objects JOIN solutions JOIN solution_features") and then fetching actual C++ objects for solution features from global storage by ID.
The question
So my primary question is
Is using RDBMS an appropriate solution for problems I described, or it's not worth it and there are better ways to organize information in my app?
If RDBMS is ok, I'd appreciate any advice on using RDBMS and relational approach to store C++ objects' relationships.
You may want to look at Semantic Web technologies, such as RDF, RDFS and OWL that provide an alternative, extensible way of modeling the world. There are some open-source triple stores available, and some of the mainstream RDBMS also have triple store capabilities.
In particular take a look at Manchester Universities Protege/OWL tutorial: http://owl.cs.manchester.ac.uk/tutorials/protegeowltutorial/
And if you decide this direction is worth looking at further, I can recommend "SEMANTIC WEB for the WORKING ONTOLOGIST"
Just based on the diagram, I would suggest that an RDBMS solution would indeed work. It has been years since I was a developer on an RDMS (called RDM, of course!), but I was able to renew my knowledge and gain very many valuable insights into data structure and layout very similar to what you describe by reading the fabulous book "The Art of SQL" by Stephane Faroult. His book will go a long way to answer your questions.
I've included a link to it on Amazon, to ensure accuracy: http://www.amazon.com/The-Art-SQL-Stephane-Faroult/dp/0596008945
You will not go wrong by reading it, even if in the end it does not solve your problem fully, because the author does such a great job of breaking down a relation in clear terms and presenting elegant solutions. The book is not a manual for SQL, but an in-depth analysis of how to think about data and how it interrelates. Check it out!
Using an RDBMS to track the links between data can be an efficient way to store and think about the analysis you are seeking, and the links are "soft" -- that is, they go away when the hard objects they link are deleted. This ensures data integrity; and Mssr Fauroult can answer what to do to ensure that remains true.
I don't recommend RDBMS based on your requirement for an extensible and flexible model.
Whenever you change your data model, you will have to change DB schema and that can involve more work than change in code.
Any problems with DB queries are discovered only at runtime. This can make a lot of difference to the cost of maintenance.
I strongly recommend using standard C++ OO programming with STL.
You can make use of encapsulation to ensure any data change is done properly, with updates to related objects and indexes.
You can use STL to build highly efficient indexes on the data
You can create facades to get you the information easily, rather than having to go to multiple objects/collections. This will be one-time work
You can make unit test cases to ensure correctness (much less complicated compared to unit testing with databases)
You can make use of polymorphism to build different kinds of objects, different types of analysis etc
All very basic points, but I reckon your effort would be best utilized if you improve the current solution rather than by look for a DB based solution.
http://www.boost.org/doc/libs/1_51_0/libs/multi_index/doc/index.html
"you'll put very much effort maintaining consistency of these pointers
and constantly updating the links when something changes."
With the help of Boost.MultiIndex you can create almost every kind of index on a "table". I think the quoted problem is not so serious, so the original solution is manageable.

which diagram to use in NoSql (Mcd, Merise, UML)

again, sorry for my silly question, but it seems that what i've learned from Relation Database should be "erased", there is no joins, so how the hell will i draw use Merise and UML in NoSql?
http://en.wikipedia.org/wiki/Class_diagram
this one will not work for NoSql?
How you organize your project is an independent notion of the technology used for persistence; In particular; UML or ERD or any such tool doesn't particularly apply to relational databases any more than it does to document databases.
The idea that NoSQL has "No Joins" is both silly and unhelpful. It's totally correct that (most) document databases do not provide a join operator; but that just means that when you do need a join, you do it in the application code instead of the query language; The basic facts of organizing your project stay the same.
Another difference is that document databases make expressing some things easier, and other things harder. For example, it's often easier to an entity relationship constraint in a relational database, but it's easier to express an inheritance heirarchy in a document database. Both notions can be supported by both technologies, and you will certainly use them when your application needs them; regardless of the technology you end up using.
In short, you should design your application without first choosing a persistence technology. Once you've got a good idea what you want to persist, You may have a better idea of which technology is a better fit. It may be the case that what you really need is both, or you might need something totally different from either.
EDIT: The idea of a foreign key is no more magical than simply saying "This is a name of that kind of thing". It happens that many SQL databases provide some very terse and useful features for dealing with this sort of thing; specifically, constraints (This column references this other relation; so it cannot take a value unless there is a corresponding row in the referant), and cascades, (If the status of the referent changes, make the corresponding change to the reference). This does make it easy to keep data consistent even at the lowest level, since there's no way to tell the database to enter a state where the referent is missing.
The important thing to distinguish though, is that the idea of giving a database entity (A row in a relational database, document in document databases) is distinct from the notion of schema constraints. One of the nice things about document databases is that they can easily combine or reorient where data lives so that you don't always have to have a referant that actually exists; Most document databases use the document class as part of the key, and so you can still be sure that the key is meaningful, even if when the referent doesn't actually exist.
But most of the time, you actually do want the referent to exist; you don't want a blog post to have an author unless that author actually exists in your system. Just how well this is supported depends a lot on the particular document database; Some databases do provide triggers, or other tools, to enforce the integrity of the reference, but many others only provide some kind of transactional capability, which requires that the integrity be enforced in the application.
The point is; for most kinds of database, every value in the database has some kind of identifier; in a relational database, that's a triple of relation:column:key; and in a document database it's usually something like the pair document_class:path. When you need one entity to refer to another, you use whatever sort of key you need to identify that datum for that kind of database. Foreign Key constraints found in RDBMses are just (really useful) syntactic sugar for "if not referant exists then raise ForeignKeyError", which could be implemented with equal power in other ways, if that's helpful for your particular use.

Why are relational databases needed?

Specifically thinking of web apps,
(1) why are relationships(ie:foreign keys) in RDBMS even useful?
The web apps I write have logic built-in that validates user input against required fields. I see no real use for foreign keys and thus no real use for relational databases.
Besides, if I were to put all the required field validation logic in the RDBMS(ie:MySQL) it would simply return a vague error. At least with PHP-based validation I know which field is missing and I can notify the user(though with Javascript-based validation this would almost NEVER happen anyway).
(2) Was there a point in the past where RDBMS were useful for some reason or is there a reason they are useful now that I'm not aware of?
I really need some insight on this topic. I'm simply can't come up with a good answer.
I will come at this from a different angle.
I work at a place where we had a database that had no foreign key constraints, default values, or other data checks whatsoever in their initial records database. The lead engineer's excuse for this was something similar to what you have described above. "The application will ensure the referential integrity".
The problem is, we did not have a standard data layer (like an object relational mapping) over the top of the database. We had multiple programmatic sources that fed into the same tables. It was funny because after a while, you could tell which parts of the code created which rows in the table. Sometimes the links lined up, sometimes they didn't. Sometimes the links were NULL (when they shouldn't be), and sometimes they were 0. We even had a few cyclic records which was fun.
My point is, you never know when you are going to need to write a quick script to batch import records, or write a new subsystem that references the same tables. It behooves us as programmers to program as defensively as possible. We can't assume that those who come after us will know as much (if anything) about how our schema should be used.
I'm not much of an SQL lover, but even I must say that the relational structure has its advantages.
It doesn't only allow validation. By providing the database with metadata describing the relations between the actual pieces information stored, a great number of optimizations are possible.
This makes it possible to quickly retrieve large, complex datasets. It also reduces the number of queries needed to make modifications and keep the data coherent, since most of the "book-keeping" is carried out automatically on the DB side of the connection.
One incredibly useful feature of foreign keys in most relational databases are cascades.
Suppose you have a families table and a persons table. Each family can have multiple people, but a person can only belong in one family (one-to-many relationship). If you have foreign keys and you delete a family row, the database can automatically update all the related people, either by deleting them or setting their foreign keys to null.
If you do not have this constraint, you must handle this situation yourself, in your own code.
RDBMSs are still very useful. Not sure why you wouldn't think so. Foreign key constraints can be used to maintain referential integrity (in other words, to provide a simple way to express 1:1, 1:many and many:many relationships. RDBMSs are also useful because there was a rich theory accompanying practical developments, unlike previous DBMSs. In particular, relational calculus/algebra are nice since they allow for good query optimization, normalization, etc.
Not sure if that really answers your question. Wikipedia might list some advantages of RDBMSs.
(1) why are relationships(ie:foreign keys) in RDBMS even useful?
First off, I think you are talking about foreign key CONSTRAINTS. Foreign keys are just a logical design feature that says that this entity matches up with that one.
The reason foreign key constraints are useful are:
They help you adhere to the DRY (Don't repeat yourself) principle. Sure your app validates the relationship, but does it do it in several places? Are there multiple apps that access the same DB? Do you have to repeat the logic in each app? Hey, you could pull that logic out and use a common DLL for access to that data that enforces that logic.Better yet, what if that was built into the RDMBS so I didn't have to write custom code to do something so routine? Bam. Foreign key constraints.
If your app enforces the foreign key validations, how do you force users who are working directly in the DB to honor your rules? I know, I know. You shouldn't let users into the back-end directly, but you just try telling that to the data analysts when they have a project for corporate and you are the bottleneck.
As to the vague error. Wouldn't your argument be better stated as RDBMS X has vague errors when data fails foreign key constraint checks? The way you have generalized it, you could also argue that we should use paper ledgers instead of computers because the constraint had a vague error.
(2) Was there a point in the past where RDBMS were useful for some reason or is there a reason they are useful now that I'm not aware of?
Yeah, that would be now, yesterday and probably long into the future.
I could go on forever about the reasons, but here is the big one...
It provides a common structured file format that is easy to extend, leverage by other applications. You may be too young to remember when every dang system had it's own proprietary structured file format, but it sucked. Plus, it forced you re-invent the wheel constantly in terms of things like indexing, a query language, locking, etc.
"I see no real use for foreign keys and thus no real use for
relational databases"
Judging by this remark, you seem to be underestimating what a relational database is for. Foreign key constraints aren't a defining feature of relational databases and certainly aren't the only reason for using such databases. The relational database model is a powerful and effective way to represent data and it remains so even if you decide you don't want to implement a foreign key constraint. I will therefore assume the question you really meant to ask is: Why are foreign keys useful in relational databases?
A foreign key constraint is just one kind of data integrity constraint. You can of course implement integrity rules outside the database but the DBMS is designed and optimised to do the job for you and is generally the most efficient place to do it because it is closest to the data structures. If you did it outside the database then you would have at least an extra round trip to retrieve the necessary data. You would also have to replicate the DBMS's locking/concurrency model in your application code.
The database optimiser can take advantage of constraints in the database to improve the performance of queries. It can't do that if the rules only exist in your application code.
If you have many applications sharing the same database then implementing data integrity rules in every application is impractical and expensive to maintain. Centralising the constraint logic makes more sense.
Various CASE tools and DBA tools will take advantage of database constraints, can reverse engineer them and use them to assist development and maintenance tasks.
In practice the meaning and function of a database constraint versus some procedural code that validates data only on entry is very different. If X is implemented in a database constraint then I know it is valid for every piece of data in the database. If X is implemented in the application when data is entered then I only know it applies to future data - I can't be sure it applies to everything already in the database (maybe X was only implemented today and didn't apply to the data entered yesterday).
Because they maintain the integrity of the database. If you have all your business logic in the application then in theory they are not needed, but are still useful as a safeguard against bad data.

Even lighter than SQLite

I've been looking for a C++ SQL library implementation that is simple to hook in like SQLite, but faster and smaller. My projects are in games development and there's definitely a cutoff point between needing to pass the ACID test and wanting some extreme performance. I'm willing to move away from SQL string style queries, allowing it to be code driven, but I haven't found anything out there that provides SQL-like flexibility while also preferring performance over the ACID test.
I don't want to go re-inventing the wheel, and the idea of implementing an SQL library on my own is quite daunting, even if it's only going to be a simple subset of all the calls you could make.
I need the basic commands (SELECT, MODIFY, DELETE, INSERT, with JOIN, and WHERE), not data operations (like sorting, min, max, count) and don't need the database to be atomic, or even enforce consistency (I can use a real SQL service while I'm testing and debugging).
Are you sure that you have obtained the maximum speed available from SQLITE? Out of the box, SQLITE is extremely safe, but quite slow. If you know what you are doing, and are willing to risk db corruption on a disk crash, then there are several optimizations you can do that provide spectacular speed improvements.
In particular:
Switch off synchronization
Group writes into transactions
Index tables
Use database in memory
If you have not explored all of these, then you are likely running many times slower than you might.
I'm not sure you'll manage to find anything with better performances than SQL. Especially if you want operations like JOINs... Is SQLite speed really a problem? For simple requests it's usually faster than any full SGDB.
Don't you have an index problem?
About size, it's not event 1Meg extra in the binary file, so I'm a bit suprised it's a problem.
You can look at Berkeley DB which has to be probably the fastest DB available, but it's mostly only key->value database.
If you really need higher speed consider loading the whole database in memory (using SQLite again).
Take a look at gigabase and its twin fastdb.
You might want to consider Embedded innoDB. It offers the basic SQL functionality (obviously, see MySQL) but doesn't offer the actual SQL syntax (as that's part of MySQL, not innoDB). At 838KB, it's not too heavy.
If you just need those basic operations, you don't really need SQL. Take a look at NoSQL data storage, for example Tokyo Cabinet.
you can try leveldb, it's key/value store
http://code.google.com/p/leveldb