Using SQL statements to query in-memory objects - c++

Suppose I have a collection of C++ objects in memory and would like to query them using an SQL statement. I’m willing to implement some type of interface to expose the objects’ properties like columns of a database row. Is there a library available to accomplish this?
In essence, I’m trying to accomplish something like LINQ without using the .NET platform.

C++ objects are not the same thing as SQL tables.
If you want to use SQL syntax to query the objects, you will first need to map/persist them into a table structure (ORM, object-relational-mapping). There are a number of fine ORM solutions out there besides Linq.
Once you have your objects represented in SQL tables, you should look to the SQL engine to do the heavy lifting. Most SQL platforms can be configured to keep a table mostly or always in memory.
As an alternative, you might consider a system specifically designed to cache objects. On Linux, memcached is a leading choice.

There is a pitfall between the object world and entity-relational data (ER). Its called Object-relational impedance mismatch. Basicaly it means You need to "map" object concepts to relational database concepts which is called Object-relational mapping (ORM).
Example for Polymorphism: a derived class is not an ER concept so you need to say for example all attributes from all object belonging to the same class will be stored in the same table with all the "parents" attributes
OR a derived class (object) will be stored in the same table as the abstract class from which it is derived from.
It would be probably best to use a comunity supported ORM, but when the motives are right your company can benefit having your own ORM solution.
In our company we developed our own ORM solution (started 6 years ago so it was written in c++ and the modeller was a windows desktop application). But now we would probably use ADO.NET Entity Framework (LINQ's "big" brother) or another supported ORM

I have also been searching for something of this sort, but it seems the SQLlite is the closest I can find.

Related

An alternative to hierarchical data model

Problem domain
I'm working on a rather big application, which uses a hierarchical data model. It takes images, extracts images' features and creates analysis objects on top of these. So the basic model is like Object-(1:N)-Image_features-(1:1)-Image. But the same set of images may be used to create multiple analysis objects (with different options).
Then an object and image can have a lot of other connected objects, like the analysis object can be refined with additional data or complex conclusions (solutions) can be based on the analysis object and other data.
Current solution
This is a sketch of the solution. Stacks represent sets of objects, arrows represent pointers (i.e. image features link to their images, but not vice versa). Some parts: images, image features, additional data, may be included in multiple analysis objects (because user wants to make analysis on different sets of object, combined differently).
Images, features, additional data and analysis objects are stored in global storage (god-object). Solutions are stored inside analysis objects by means of composition (and contain solution features in turn).
All the entities (images, image features, analysis objects, solutions, additional data) are instances of corresponding classes (like IImage, ...). Almost all the parts are optional (i.e., we may want to discard images after we have a solution).
Current solution drawbacks
Navigating this structure is painful, when you need connections like the dotted one in the sketch. If you have to display an image with a couple of solutions features on top, you first have to iterate through analysis objects to find which of them are based on this image, and then iterate through the solutions to display them.
If to solve 1. you choose to explicitly store dotted links (i.e. image class will have pointers to solution features, which are related to it), you'll put very much effort maintaining consistency of these pointers and constantly updating the links when something changes.
My idea
I'd like to build a more extensible (2) and flexible (1) data model. The first idea was to use a relational model, separating objects and their relations. And why not use RDBMS here - sqlite seems an appropriate engine to me. So complex relations will be accessible by simple (left)JOIN's on the database: pseudocode "images JOIN images_to_image_features JOIN image_features JOIN image_features_to_objects JOIN objects JOIN solutions JOIN solution_features") and then fetching actual C++ objects for solution features from global storage by ID.
The question
So my primary question is
Is using RDBMS an appropriate solution for problems I described, or it's not worth it and there are better ways to organize information in my app?
If RDBMS is ok, I'd appreciate any advice on using RDBMS and relational approach to store C++ objects' relationships.
You may want to look at Semantic Web technologies, such as RDF, RDFS and OWL that provide an alternative, extensible way of modeling the world. There are some open-source triple stores available, and some of the mainstream RDBMS also have triple store capabilities.
In particular take a look at Manchester Universities Protege/OWL tutorial: http://owl.cs.manchester.ac.uk/tutorials/protegeowltutorial/
And if you decide this direction is worth looking at further, I can recommend "SEMANTIC WEB for the WORKING ONTOLOGIST"
Just based on the diagram, I would suggest that an RDBMS solution would indeed work. It has been years since I was a developer on an RDMS (called RDM, of course!), but I was able to renew my knowledge and gain very many valuable insights into data structure and layout very similar to what you describe by reading the fabulous book "The Art of SQL" by Stephane Faroult. His book will go a long way to answer your questions.
I've included a link to it on Amazon, to ensure accuracy: http://www.amazon.com/The-Art-SQL-Stephane-Faroult/dp/0596008945
You will not go wrong by reading it, even if in the end it does not solve your problem fully, because the author does such a great job of breaking down a relation in clear terms and presenting elegant solutions. The book is not a manual for SQL, but an in-depth analysis of how to think about data and how it interrelates. Check it out!
Using an RDBMS to track the links between data can be an efficient way to store and think about the analysis you are seeking, and the links are "soft" -- that is, they go away when the hard objects they link are deleted. This ensures data integrity; and Mssr Fauroult can answer what to do to ensure that remains true.
I don't recommend RDBMS based on your requirement for an extensible and flexible model.
Whenever you change your data model, you will have to change DB schema and that can involve more work than change in code.
Any problems with DB queries are discovered only at runtime. This can make a lot of difference to the cost of maintenance.
I strongly recommend using standard C++ OO programming with STL.
You can make use of encapsulation to ensure any data change is done properly, with updates to related objects and indexes.
You can use STL to build highly efficient indexes on the data
You can create facades to get you the information easily, rather than having to go to multiple objects/collections. This will be one-time work
You can make unit test cases to ensure correctness (much less complicated compared to unit testing with databases)
You can make use of polymorphism to build different kinds of objects, different types of analysis etc
All very basic points, but I reckon your effort would be best utilized if you improve the current solution rather than by look for a DB based solution.
http://www.boost.org/doc/libs/1_51_0/libs/multi_index/doc/index.html
"you'll put very much effort maintaining consistency of these pointers
and constantly updating the links when something changes."
With the help of Boost.MultiIndex you can create almost every kind of index on a "table". I think the quoted problem is not so serious, so the original solution is manageable.

When should I use C++ instead of SQL?

I am a C++ programmer who occasionally uses MySQL to work with databases, but my SQL knowledge is rather limited. However I am surely willing to change that.
At the moment I am trying to do analysis(!) on the data I have in my database solely with SQL queries. But I am about to give up, and instead import the data to C++ and do the analysis with C++ code.
I have discussed this with my colleagues, and they also push me to use C++, saying that SQL is not meant for complex analysis but mainly for importing (from the existing tables) and exporting (to new tables) data, and a little bit more such as merging data to - e.g. - joined tables.
Can somebody help me drawing a line? So I know when to switch to C++? Of course performance is also an issue.
What are indications that things get to complex in SQL? Or maybe I just take the wrong approach with designing the queries. Then where can I find tutorials, books, ... to take a better approach?
I hope this is not too vague. I am really a bit lost.
SQL excels at analyzing large sets of relational data.
The place to draw the line is the scale of your analysis.
If you analyze individual records one at a time, do it in your application.
If you analyze large sets of records as a unit, SQL is definitely the best tool for that job.
Row-by-row analysis is not something SQL is designed or optimized for very well. But, if you want to know something about a million-row group of data, do it in the database.
I have discussed this with my colleagues, and they also push me to use C++, saying that SQL is not meant for complex analysis but mainly for importing (from the existent tables) and exporting (to new tables) data, and a little bit more such as merging data to - e.g. - joined tables.
This is completely arbitrary. Learn SQL. There are a lot of resources available on the web for free.
You can do very complex analysis of data in SQL, provided you know how use the features that SQL offers.
SQL has features for doing relational operations, like joins and projections. Also for doing set operations like union, intersection, and restriction (subset). Also for doing basic arithmetic on numbers, like the four arithmetic operators, and built in functions like SQRT. Also statistical functions like COUNT, SUM, and AVG that can be combined with projections in very interesting ways. A good DBMS will let you extend the built in functions with your own functions written in C, C++ or maybe PL/SQL.
The power you get from these features depends on how well designed the database is. A well designed database conforms to the relational model, and should be relvant to your intended use of the data.
SQL code can be stored in the database in stored prodecures. It can be stored in SQL script files. And, as you already know, it can be embedded in application programs. In addition to SQL, you can use OLAP tools and report generators to do standard things with the data very easily.
The people who advise you to keep all of your processing in C++ sound like they have learned just enough to use a database like a big and stupid file system. A good DBMS is much more than that.
SQL is usually very efficient handling its own database (depends on the server implementation).
You should use queries to analyze the database.
The main reason for that would be the communication overhead.
Even if the server is on the local machine (remote servers would have obvious communication overhead), you'll still have to retrieve the stored information from the SQL server to your c++ program for analysis.
Now if you have 10000s of lines in the SQL you would have to get the SQL server to read them all and send them to your program where it would probably create a local copy of the data for you to work on.
If you let the SQL server do it with queries, you'll gain the complex optimizations it does according the kind of query you're executing, and in the end you can retrieve only a limited amount of data (the one you actually need) through the communication.
You made right decision to begin data analysis with SQL. Now, when you feel that your knowledge of SQL limits you, you have 2 choices: give up and switch back to familiar but not very efficient toolset (C++) or bring your level with SQL up.
It's possible that at some point SQL will become too complex too, but then C++ won't be the answer either - most likely some specialized tools.
In my opinion you should only perform analysis in C++ if no equivalent for the analysis function is provided by database server, As database servers are very smart and it is hard and almost imposible to beat the algorithm efficiency of analysis function of database server. Also bringing raw data to the application for performing analysis also includes lots of overheads.
If at some point plain SQL becomes overly complex native PL of the sever could be a good choice
I agree with JNK and Jochai, but disagree with Ascanio.
It's better to improve the knowledge in database systems.
Sql comes with it
So, this is something I've been thinking about and it seems to me that SQL, as just a platform/language for storing/manipulating data, should have no inherent advantage over a C++ or C library. It seems to me that theoretically you could build a C++ library just as efficient, if not more efficient, than SQL at doing this. In doing so, you would be able to build it from the ground up, in terms of how ints, chars, strings, and other data types are stored, and make it easier to interface with you particular application (like web development). You could even make it so that the queries could be done in a language like javascript (allowing web developers to focus on just learning one language really well).

Django, polymorphism and N+1 queries problem

I'm writing an app in Django where I'd like to make use of implicit inheritence when using ForeignKeys. As far as I'm concerned the only way to handle this nicely is to use django_polymorphic library (no single table inheritence in Django, WHY OH WHY??).
I'd like to know about the performance implications of this solution. What kind of joins are performed when doing polymorphic queries? Does it have to hit the database multiple times as compared to regular queries (the infamous N+1 queries problem)? The docs warn that "the type of queries that are performed aren't handled efficiently by the modern RDBMs"? However it doesn't really tell what those queries are. Any statistics, experiences would be really helpful.
EDIT:
Is there any way of retrieving a list of objects, each being an instance of its actual class with a constant number of queries ?? I thought this is what the aforementioned library does, however now I got confused and I'm not that certain anymore.
Django-Typed-Models is an alternative to Django-Polymorphic which takes a simple & clean approach to solving the single table inheritance issue. It works off a 'type' attribute which is added to your model. When you save it, the class is persisted into the 'type' attribute. At query time, the attribute is used to set the class of the resulting object.
It does what you expect query-wise (every object returned from a queryset is the downcasted class) without needing special syntax or the scary volume of code associated with Django-Polymorphic. And no extra database queries.
In Django inherited models are internally represented through an OneToOneField. If you are using select_related() in a query Django will follow a one to one relation forwards and backwards to include the referenced table with a join; so you wouldn't need to hit the database twice if you are using select_related.
Ok, I've digged a little bit further and found this nice passage:
https://github.com/bconstantin/django_polymorphic/blob/master/DOCS.rst#performance-considerations
So happily this library does something reasonably sane. That's good to know.

Trying to choose SQL API library

I am just beginning to learn how to write software that accesses an SQL server. It seems that each server implementation (Postgres, MySQL, etc.) offers API libraries for various languages (my code is in C and C++, though solutions for Java and Python would also interest me). I'm a little wary of depending on these libraries, however, because I'd prefer a vendor-neutral solution.
As near as I can tell, Microsoft's ODBC API was meant to solve such problems for C/C++ (and JDBC for Java); unixODBC seems to be one popular implementation. Am I right even so far?
Moreover, do any such libraries provide an object-oriented interface? It would be nice to not simply embed SQL queries into another, more featureful language; I'd like to have a wrapper that mimics the style of the rest of the language, too.
So is there a preferred solution along those lines? Am I asking for something weird?
As near as I can tell, Microsoft's ODBC API was meant to solve such problems for C/C++ (and JDBC for Java); unixODBC seems to be one popular implementation. Am I right even so far?
Yes. The equivalent of ODBC or JDBC for Python is called the DB-API. Perl's equivalent is called DBI.
Moreover, do any such libraries provide an object-oriented interface? It would be nice to not simply embed SQL queries into another, more featureful language; I'd like to have a wrapper that mimics the style of the rest of the language, too.
Yeah, there are a bunch of things like this for different languages. C# has LINQ, Smalltalk has Roe and GLORP, Python has SQLAlchemy and SQLObject (and Django in Python has quite a bit of query power built into its ORM (see Simon Willison's notes)), Ruby has ActiveRecord, and so on. I don't know what you'd use in C++ but I bet it has to use a lot of ugly template hacking to approach these.
All these choices might seem overwhelming, but chances are your choice of language will be shaped by something other than the convenience of working with relational data. (If not, you should consider Prolog.) That will probably tie you more or less to some ORM you hate just like the rest of us.
Indeed, ODBC/JDBC are libraries that help make the calling interface standard between vendors, but you're right that each respective RDBMS has its own flavor of SQL. ODBC/JDBC doesn't help abstract the SQL syntax.
One solution to move literal SQL out of your application code is to implement queries in stored procedures that reside in each database back-end, and then use ODBC/JDBC to call the stored procedures. You can define stored procedures with similar names and calling interface for each flavor of RDBMS you use. But be aware that the stored procedure language is also variable from one vendor to the next.
Another solution is to use an "object-relational mapping" technology such as Hibernate for Java, or NHibernate for .NET. These technologies can make it feel more "object-oriented" to work with databases, and free you from writing literal SQL in many cases.
But most ORM tools tends to focus on very simple queries. If your query is at all complex (using a GROUP BY or a JOIN for instance), using the ORM tool is harder than using literal SQL.
See also "Good ORM for C++ solutions?"
If SQL troubles you that much, you're probably not going to be happy using an RDBMS at all. Some programmers don't see the value to the Rules of Normalization, for instance. If that's true for you, you might want to look into the emerging technologies for non-relational data stores, including:
BerkeleyDB
Project Voldemort
CouchDB
ODBC/JDBC attempt to abstract away the database interface to provide a consistent programming model. Bear in mind that, by using such a least-common-denominator interface, you cannot take advantage of specific, non-standard features that a given DB may offer.
To get an object oriented interface to your data model, look into Object Relational Mapping (ORM) solutions such as Hibernate. ORM solutions map your objects to their representation in a relational database, generally making data persistence much simpler from an application programming perspective.
Quince is a C++ library that lets you use C++ syntax and C++ types with the feature set of SQL. Currently it supports PostgreSQL and sqlite only, but new backends can always be added. See quince-lib.com. (Full disclosure: I wrote it.)
Take a look at Qt. It is not a library, but a complete framework. It has a very excellent SQL module.
Qt SQL is an essential module which provides support for SQL
databases. Qt SQL's APIs are divided into different layers:
Driver layer
SQL API layer
User interface layer
http://doc.qt.io/qt-5/qtsql-index.html

Dynamic SQL vs Static SQL

In our current codebase we are using MFC db classes to connect to DB2. This is all old code that has been passed onto us by another development team, so we know some of the history but not all.
Most of the Code abstracts away the creation of SQL queries through functions such as Update() and Insert() that prepend something like "INSERT INTO SCHEMA.TABLE" onto a string that you supply. This is done through the recordset classes that sit on top of the database class
The other way to do the SQL queries is to execute them directly on the database class using dbclass.ExecuteSQL(String).
We are wondering what the pro's and cons of each approach is. From our perspective its much easier to do the ExecuteSQL() call, as we dont have to write another class etc, but there must be good reasons to do the other way. we are just not sure what they are.
Any help would be great!
Thanks Mark
Update----
I think I may have misunderstood Dynamic and Static SQL. I think our code always uses Dynamic, so my question really becomes, should I construct the SQL strings myself and do an ExecuteSQL() or should this be abstracted away in a class for each table in the database, as the recordset classes from mfc seem to do?
The ATL OLE DB consumer database classes are absolutely the way to go. Beyond the risks of injection (mentioned by Skurmedel), piles of string-concatenated queries will become impossible to maintain very quickly.
While the ATL classes can be initially tedious, they provide the benefit of strong-typed and named columns, result navigation, connection and session management, etc.
I would try to abstract it away if it's many SQL statements. Managing dozens of different SQL queries quickly become tedious. Also it's easier to validate input that way.