Correct implementation of the Filter (Criteria) Design Pattern - c++

The design pattern is explained here:
I'm working on a software very similar to Adobe Lightroom or ACDSee but with different purposes. The user (photographer) is able to import thousands of images from his hard drive (it wouldn't be weird to have over 100k/200k images).
We have a side panel where users can create custom "filters" which are expressions like:
Does contain the keyword: "car"
Does not contain the keyword "woods"
Camera model is "Nikon D300s"
Camera model is "Canon 7D Mark II"
Directory is "C:\today_pictures"
You can get the idea from the above example.
We have a SQLite database where all image information is stored. The question is, should we load ALL Photo objects into memory from the database the first time the program is loaded and implement the Criteria/Filter design pattern as explained in the website cited above so our Criteria classes filter objects or is better to do the criteria classes actually generate an SQL query that is finally executed in order to retrieve only what's needed from the database?
We are developing the program with C++ (QT).

TL;DR: It's already properly implemented in SQLITE3, and look at how long that took. You'll face the same burden.
It'd be a horrible case of data duplication to read the data from the database and store it again in another data structure. Use database queries to implement the query that the user gave you. Let the database execute the query. That's what databases are for.
By reimplementing a search/query system for ~500k records, you'll be rewriting large chunks of a bog-standard database yourself. It'd be a mostly pointless exercise. SQLITE3 is very well tested and is essentially foolproof. It'll cost you thousands of hours of work to reimplement even a small fraction of its capabilities and reliability/resiliency. If that doesn't scream "reinventing the wheel", I don't know what does.
The database also allows you to very easily implement lookahead/dropdowns to aid the user in writing the query. For example, as you're typing out "camera model is", the user can have an option of autocompletion or a dropdown to select one or more models from.
You paid the "price" of a database, it'd be a shame for it all to go to waste. So, use it. It'll give you lots of leverage, and allow you to implement features two orders of magnitude faster than otherwise.
The pattern you've linked to is just a pattern. It doesn't mean that it's an exact blueprint of how to design your application to perform on real data. You'll be, eventually, fighting things such as concurrency (a file scanning thread running to update the metadata), indexing, resiliency in face of crashes, etc. In the end you'll end up with big chunks of SQLITE reimplemented for your particular application. 500k metadata records are nothing much, if you design your query translator well and support it with proper indexes, it'll work perfectly well.


Django/Sqlite Improve Database performance

We are developing an online school diary application using django. The prototype is ready and the project will go live next year with about 500 students.
Initially we used sqlite and hoped that for the initial implementation this would perform well enough.
The data tables are such that to obtain details of a school day (periods, classes, teachers, classrooms, many tables are used and the database access takes 67ms on a reasonably fast PC.
Most of the data is static once the year starts with perhaps minor changes to classrooms. I thought of extracting the timetable for each student for each term day so no table joins would be needed. I put this data into a text file for one student, the file is 100K in size. The time taken to read this data and process it for a days timetable is about 8ms. If I pre-load the data on login and store it in sessions it takes 7ms at login and 2ms for each query.
With 500 students what would be the impact on the web server using this approach and what other options are there (putting the student text files into a sort of memory cache rather than session for example?)
There will not be a great deal of data entry, students adding notes, teachers likewise, so it will mostly be checking the timetable status and looking to see what events exist for that day or week.
What is your expected response time, and what is your expected number of requests per minute? One twentieth of a second for the database access (which is likely to be slow part) for a request doesn't sound like a problem to me. SQLite should perform fine in a read-mostly situation like this. So I'm not convinced you even have a performance problem.
If you want faster response you could consider:
First, ensuring that you have the best response time by checking your indexes and profiling individual retrievals to look for performance bottlenecks.
Pre-computing the static parts of the system and storing the HTML. You can put the HTML right back into the database or store it as disk files.
Using the database as a backing store only (to preserve state of the system when the server is down) and reading the entire thing into in-memory structures at system start-up. This eliminates disk access for the data, although it limits you to one physical server.
This sounds like premature optimization. 67ms is scarcely longer than the ~50ms where we humans can observe that there was a delay.
SQLite's representation of your data is going to be more efficient than a text format, and unlike a text file that you have to parse, the operating system can efficiently cache just the portions of your database that you're actually using in RAM.
You can lock down ~50MB of RAM to cache a parsed representation of the data for all the students, but you'll probably get better performance using that RAM for something else, like the OS disk cache.
I agree with some of other answers which suggest to use MySQL or PostgreSQL instead of SQLite. It is not designed to be used as production db. It is great for storing data for one-user applications such as mobile apps or even a desktop application, but it falls short very quickly in server applications. With Django it is trivial to switch to any other full-pledges database backend.
If you switch to one of those, you should not really have any performance issues, especially if you will do all the necessary joins using select_related and prefetch_related.
If you will still need more performance, considering that "most of the data is static", you actually might want to convert Django site a static site (a collection of html files) and then serve those using nginx or something similar to that. The simplest way I can think of doing that is to just write a cron-job which will loop over all needed url-configs, request the page from Django and then save that as an html file. If you want to go into that direction, you also might want to take a look at Python's static site generators: Hyde and Pelican.
This approach will certainly work much faster then any caching system however you will loose any dynamic components of the site. If you need them, then caching seems like the best and fastest solution.
You should use MySQL or PostgreSQL for your production database. sqlite3 isn't a good idea.
You should also avoid pre-loading data on login. Since your records can be inserted in advance, write django management commands and run the import to your chosen database before hand and design your models such that when a user logs in, the user would already be able to access and view/edit his or her related data (which are pre-inserted before the application even goes live). Hardcoding data operations when log in does not smell right at all from an application design point-of-view.
The benefit of designing your django models and using custom management commands to insert the records right way before your application goes live implies that you can use django orm to make the appropriate relationships between users and their records.
I suspect - based on your description of what you need above - that you need to re-look at the approach you are creating this application.
With 500 students, we shouldn't even be talking about caching. If you want response speed, you should deal with the following issues in priority:-
Use a production quality database
Design your application use case correctly and design your application model right
Pre-load any data you need to the production database
front end optimization comes first (css/js compression etc)
use django debug toolbar to figure out if any of your sql is slow and optimize specifically those
implement caching (memcached etc) as needed
As a general guideline.

SQL Query minimizing/caching in a C++ application

I'm writing a project in C++/Qt and it is able to connect to any type of SQL database supported by the QtSQL ( This includes local servers and external ones.
However, when the database in question is external, the speed of the queries starts to become a problem (slow UI, ...). The reason: Every object that is stored in the database is lazy-loaded and as such will issue a query every time an attribute is needed. On average about 20 of these objects are to be displayed on screen, each of them showing about 5 attributes. This means that for every screen that I show about 100 queries get executed. The queries execute quite fast on the database server itself, but the overhead of the actual query running over the network is considerable (measured in seconds for an entire screen).
I've been thinking about a few ways to solve the issue, the most important approaches seem to be (according to me):
Make fewer queries
Make queries faster
Tackling (1)
I could find some sort of way to delay the actual fetching of the attribute (start a transaction), and then when the programmer writes endTransaction() the database tries to fetch everything in one go (with SQL UNION or a loop...). This would probably require quite a bit of modification to the way the lazy objects work but if people comment that it is a decent solution I think it could be worked out elegantly. If this solution speeds up everything enough then an elaborate caching scheme might not even be necessary, saving a lot of headaches
I could try pre-loading attribute data by fetching it all in one query for all the objects that are requested, effectively making them non-lazy. Of course in that case I will have to worry about stale data. How would I detect stale data without at least sending one query to the external db? (Note: sending a query to check for stale data for every attribute check would provide a best-case 0x performance increase and a worst-caste 2x performance decrease when the data is actually found to be stale)
Tackling (2)
Queries could for example be made faster by keeping a local synchronized copy of the database running. However I don't really have a lot of possibilities on the client machines to run for example exactly the same database type as the one on the server. So the local copy would for example be an SQLite database. This would also mean that I couldn't use an db-vendor specific solution. What are my options here? What has worked well for people in these kinds of situations?
My primary worries are:
Stale data: there are plenty of queries imaginable that change the db in such a way that it prohibits an action that would seem possible to a user with stale data.
Maintainability: How loosely can I couple in this new layer? It would obviously be preferable if it didn't have to know everything about my internal lazy object system and about every object and possible query
Final question
What would be a good way to minimize the cost of making a query? Good meaning some sort of combination of: maintainable, easy to implement, not too aplication specific. If it comes down to pick any 2, then so be it. I'd like to hear people talk about their experiences and what they did to solve it.
As you can see, I've thought of some problems and ways of handling it, but I'm at a loss for what would constitute a sensible approach. Since it will probable involve quite a lot of work and intensive changes to many layers in the program (hopefully as few as possible), I thought about asking all the experts here before making a final decision on the matter. It is also possible I'm just overlooking a very simple solution, in which case a pointer to it would be much appreciated!
Assuming all relevant server-side tuning has been done (for example: MySQL cache, best possible indexes, ...)
*Note: I've checked questions of users with similar problems that didn't entirely satisfy my question: Suggestion on a replication scheme for my use-case? and Best practice for a local database cache? for example)
If any additional information is necessary to provide an answer, please let me know and I will duly update my question. Apologies for any spelling/grammar errors, english is not my native language.
Note about "lazy"
A small example of what my code looks like (simplified of course):
QList<MyObject> myObjects = database->getObjects(20, 40); // fetch and construct object 20 to 40 from the db
// ...some time later
// screen filling time!
foreach (const MyObject& o, myObjects) {
o->getInt("status", 0); // == db request
o->getString("comment", "no comment!"); // == db request
// about 3 more of these
At first glance it looks like you have two conflicting goals: Query speed, but always using up-to-date data. Thus you should probably fall back to your needs to help decide here.
1) Your database is nearly static compared to use of the application. In this case use your option 1b and preload all the data. If there's a slim chance that the data may change underneath, just give the user an option to refresh the cache (fully or for a particular subset of data). This way the slow access is in the hands of the user.
2) The database is changing fairly frequently. In this case "perhaps" an SQL database isn't right for your needs. You may need a higher performance dynamic database that pushes updates rather than requiring a pull. That way your application would get notified when underlying data changed and you would be able to respond quickly. If that doesn't work however, you want to concoct your query to minimize the number of DB library and I/O calls. For example if you execute a sequence of select statements your results should have all the appropriate data in the order you requested it. You just have to keep track of what the corresponding select statements were. Alternately if you can use a looser query criteria so that it returns more than one row for your simple query that ought to help performance as well.

Tools and tips for switching CMS

I work for a university, and in the past year we finally broke away from our static HTML site of several thousand pages and moved to a Drupal site. This obviously entails massive amounts of data entry.
What if you're already using a CMS and are switching to another one that better suits your needs? How do you minimize the mountain of data entry during such a huge change? Are there tools built for this, or some best practices one should follow?
The Migrate module for Drupal would provide a big help. The data migration to Drupal will give you an overview of the process.
The video from the Migration: not just for the birds presentation at Drupalcon DC 2009 is probably somewhat out-of-date, but also gives a good introduction.
Expect to have to both pre-process and post-process your data manually, whatever happens. Accept early on that your data is likely to be in a worse state than you think it is: fields will be misused; record-to-record references (foreign keys) might not be implemented properly, or at all; content is likely to need weeding and occasionally to be just bad or incorrect.
Check your database encoding. Older databases won't be in Unicode encodings, and get grumpy if you have to export data dumps and import them elsewhere. Even then, assume that there'll be some wacky nonprintable characters in your data: programs like Word seem to somehow inject them everywhere, and I've seen... codepoints... you people wouldn't believe. Consider sweeping your data before you even start (or even sweeping a database dump) for these characters. Decide whether or not to junk them or try to convert them in the case of e.g. Word "smart" punctuation characters.
It's very difficult to create explicit data structures from implied one. If your incoming data has a separate date field, you can map that to a date field; if it has a date as part of a big lump of HTML, even if that date is in a tag with an id attribute, simple scripting won't work. You could use offline scripting with BeautifulSoup or (if your HTML's a bit nicer) the faster lxml to pre-process your data set, extract those implicit fields, and save them into an implicit format. Consider creating an intermediate database where these revisions are going to go.
The Migrate module is excellent, but to get really good data fidelity and play more clever tricks you might need to learn about its hook system (Drupal's terminology for functions following a particular naming scheme) and the basics of writing a module to put these hooks in (a module is broadly just a PHP file where all the functions begin with the same text, the name of the module file.)
All imported content should be flagged for at least a cursory check. You can do this by importing it with status=0 i.e. unpublished, and then create a view with the Views module to go through the content and open it in other tabs for checking. Views Bulk Operations lets you have a set of checkboxes alongside your view items, so you could approve many nodes at once.
Expect to run and re-run and re-run the import, fixing new things every time. Check ten, or twenty items, as early as possible. If there are any problems, check ten or twenty more. Fix and repeat the import.
Gauge how long a single import run is likely to take. Be pessimistic: we had an import we expected to take ten hours encounter exponential slowdown when we introduced the full data set; until we finally fixed some slow queries, it was projected to take two weeks.
If in doubt, or if you think the technical aspects of the above are just going to take more time than the work itself, then just hire temps to do the data. But you still need decent quality controls, as early as possible during their work. Drupal developers are also for hire: try your country's relevant IRC channel, or post a note in a relevant group. They're more expensive than temps but they usually write better PHP...! Consider hiring an agency too: that's a shameless plug, as I work for one, but sometimes it's best to get experts in for these specific jobs.
Really good imports are always hard, harder than you expect. Don't let it get you down!
Migrate + table wizard (and schema + views) is the way to go. With table wizard you can expose any table to drupal and map fields accordingly using migrate.
Look here for a detailed walktrough:
You'll want to have an access to existing data from django. This helps me a lot with migrating: . With correct model definitions you'll have full django power including the admin. In fact, I'm using django just as admin backend for several legacy php projects - django's admin can easily outachieve a lot of custom hand-written admin scripts.
Authorization should remain the same. Users should be able to login with their credentials but it is hard to write a migration script for auth data because password hashing schemas may be different and there is no way to convert between them without knowing plain passwords. Django provides a way to support different sources of auth so you can write Drupal auth backend:
There is no need to do the full rewrite. If some parts are working fine they can still be powered by Drupal. New code can written using Django with same UI. Routing between old and new parts can be performed by web server url rewriting. Both django and drupal parts can be powered by the same DB.

At what point is it worth using a database?

I have a question relating to databases and at what point is worth diving into one. I am primarily an embedded engineer, but I am writing an application using Qt to interface with our controller.
We are at an odd point where we have enough data that it would be feasible to implement a database (around 700+ items and growing) to manage everything, but I am not sure it is worth the time right now to deal with. I have no problems implementing the GUI with files generated from excel and parsed in, but it gets tedious and hard to track even with VBA scripts. I have been playing around with converting our data into something more manageable for the application side with Microsoft Access and that seems to be working well. If that works out I am only a step (or several) away from using an SQL database and using the Qt library to access and modify it.
I don't have much experience managing data at this level and am curious what may be the best way to approach this. So what are some of the real benefits of using a database if any in this case? I realize much of this can be very application specific, but some general ideas and suggestions on how to straddle the embedded/application programming line would be helpful.
This is not about putting a database in an embedded project. It is also not a business type application where larger databases are commonly used. I am designing a GUI for a single user on a desktop to interface with a micro-controller for monitoring and configuration purposes.
I decided to go with SQLite. You can do some very interesting things with data that I didn't really consider an option when first starting this project.
A database is worthwhile when:
Your application evolves to some
form of data driven execution.
You're spending time designing and
developing external data storage
Sharing data between applications or
organizations (including individual
The data is no longer short and
Data Duplication
Evolution to Data Driven Execution
When the data is changing but the execution is not, this is a sign of a data driven program or parts of the program are data driven. A set of configuration options is a sign of a data driven function, but the whole application may not be data driven. In any case, a database can help manage the data. (The database library or application does not have to be huge like Oracle, but can be lean and mean like SQLite).
Design & Development of External Data Structures
Posting questions to Stack Overflow about serialization or converting trees and lists to use files is a good indication your program has graduated to using a database. Also, if you are spending any amount of time designing algorithms to store data in a file or designing the data in a file is a good time to research the usage of a database.
Sharing Data
Whether your application is sharing data with another application, another organization or another person, a database can assist. By using a database, data consistency is easier to achieve. One of the big issues in problem investigation is that teams are not using the same data. The customer may use one set of data; the validation team another and development using a different set of data. A database makes versioning the data easier and allows entities to use the same data.
Complex Data
Programs start out using small tables of hard coded data. This evolves into using dynamic data with maps, trees and lists. Sometimes the data expands from simple two columns to 8 or more. Database theory and databases can ease the complexity of organizing data. Let the database worry about managing the data and free up your application and your development time. After all, how the data is managed is not as important as to the quality of the data and it's accessibility.
Data Duplication
Often times, when data grows, there is an ever growing attraction for duplicate data. Databases and database theory can minimize the duplication of data. Databases can be configured to warn against duplications.
Moving to using a database has many factors to be considered. Some include but are not limited to: data complexity, data duplication (including parts of the data), project deadlines, development costs and licensing issues. If your program can run more efficiently with a database, then do so. A database may also save development time (and money). There are other tasks that you and your application can be performing than managing data. Leave data management up to the experts.
What you are describing doesn't sound like a typical business application, and many of the answers already posted here assume that this is the kind of application you are talking about, so let me offer a different perspective.
Whether or not you use a database for 700 items is going to depend greatly on the nature of the data.
I would say that, about 90% of the time at this scale, you will benefit from a light-weight database like SQLite, provided that:
The data may potentially grow substantially larger than what you are describing,
The data may be shared by more than one user,
You may need to run queries against the data (which I don't think you're doing right now), and
The data can easily be described in table form.
The other 10% of the time, your data will be highly structured, hierarchical, object-based, and doesn't neatly fit into the table model of a database or Excel table. If this is the case, consider using XML files.
I know developers instinctively like to throw databases at problems like this, but if you are currently using Excel data to design user interfaces (or display configuration settings), rather than display a customer record, XML may be a better fit. XML is more expressive than either Excel or database tables, and can be easily manipulated with a simple text editor.
XML parsers and data binders for C++ are easy to find.
I recommend you to introduce a Database in your app, your application will gain flexibility and will be easier to maintain and to improve with new features in the future.
I would start with a lightweight file based db like Sqlite.
With a well designed db you'll have:
Reduced data redundancy
Greater data integrity
Improved data security
Last but not least, using a database will save you from the Excel import/update/export Hell!
Reasons for using a database:
Concurrent writes. It's easy to achieve concurrency in databases
Easy querying. SQL queries tend to be much concise than procedural code to search data. UPDATEs, INSERT INTOs can also do lots of stuff with very little code
Integrity. Constraints are very easy to define and are enforced without writing code. If you have a non-null constraint, you can rest assured that the value won't be null, no need to write checks anywhere. If you have a foreign key constraint in place, you won't have "dangling references".
Performance over large datasets. Indexing is very simple to add to an SQL database
Reasons for not using a database:
It tends to be an extra dependency (although there exist very lightweight databases- I like H2 for Java, for instance)
Data not well suited to a relational schema. Things that are basically key/value maps. XML (although databases often support XPath, etc.).
Sometimes files are more convenient. They can be diff'ed, merged, edited with a plain text editor, etc. Sometimes spreadsheets can be more practical (you don't have to build an editor- you can use a spreadsheet program)
Your data is already somewhere else
When you have a lot of data that you're not sure how they will be exploited in the future.
For example you might want to add an SQLite database in an embedded application that need to register statistics that you're not sure how will be used. Later you send the full database for injection in a bigger one running on a central server and those data can easily be exploited, using requests.
In fact, if your application's purpose is to "gather data" then having a database is a must have.
I see quite a few requirements that well met by databases:
1). Ad hoc queries. Find me all the {X} that meet criteria Y
2). Data with structure that can benefit from normalisation - factoring out common values into separate "tables". You can save space and reduce the possibility of inconsistency this way. Once you've done this then those ad-hoc queries start to be really useful.
3). Large data volumes. Professional database are very good at making good use of resoruces, clever query optmisations and paging strategies. Trying to write this stuff yourself is a real challenge.
You're clearly not needing that last one, but the other two, maybe do apply to you.
Don't forget that the appropriate database can be quite different depending on your requirements (and don't forget that a text file could be used as a database if you're requirements are simple enough - for example, config files are just a specific kind of database). Such parameters might be:
number of records
size of data items
does the database need to be shared with other devices? Concurrently?
how complex are the relationships between the various pieces of data
is the database read only (created at build time and not changed, for example)?
does the database need to be updated by multiple entities concurrently?
do you need to support complex queries?
For a database with 700 entries, an in-memory sorted array loaded from a text file might well be appropriate. But I could also see the need for an embedded SQL database or maybe having the controller request data from the database over a network connection depending on what the various requirements (and resource limitations) are.
There isn't a specific point at which a database is worthwhile. Instead I usually ask the following questions:
Is the amount of data the application uses/creates growing?
Is the upper limit of this data growth unknown (or unclear)?
Will the application need to aggregate or filter this data?
Could there be future uses of the data that may not be obvious right now?
Is performance of data retrieval and/or storage important?
Are there (or could there be) multiple users of the application who share data?
If I answer 'Yes' to most of these questions I almost always choose a database (as opposed to other options such as XML/ini/CSV/Excel/text files or the filesystem).
Also, if the application will have many users who could be accessing the data concurrently, I'll lean towards a full database server (MySQL, SQl Server, Oracle etc).
But often in a single user (or small concurrency) situation, a local database such as SQLite cannot be beaten for portability and ease of deployment.
To add a negative: not suitable for real-time processing, due to non-deterministic latency. However, It would be quite ample for looking up and setting operating parameters, for instance during startup. I would not put database accesses on critical time paths.
You don't need a database if you have a few thousand rows in one or two tables to handle in a single user app (for the embedded point).
If it is for multiple users (concurrent access, locking) or the need of transactions you definitly should consider a database.
Handling complex datastructures in normalized tables and maintain integrety, or a huge amount of data would be another indication you should use a database.
It sounds like your application is running on a desktop computer and simply communicating to the embedded device.
As such using a database is much more feasible. Using one on an embedded platform is a much more complex issue.
On the desktop front I use a database when there is the need to store new information continuously and the need to extract that information in a relational way. What i don't use databases for is storing static information, information i read once at load and thats it. The exception is when the application has many users and there is the need to storage this information on a per user basis.
It sounds be to me like your collecting information from your embedded device, storing it somehow, then using it later to display via a GUI.
This is a good case for using a database, especially if you can architect the system such that there is a data collection daemon that manages the continuous communication with the embedded device. This app can then just write the data into the database. When the GUI is launched it can extract the data for display.
Using the database will also ease your GUI develop if you need to display different views, such as "show me all the entries between 2 dates". With a database you just ask it for the correct values to display with a proper SQL query and the GUI displays whatever comes back allowing you to decouple much of the "business logic" code from the GUI.
We are also facing a similar situation. We have set of data coming from different test setups and it is currently being dumped into excel sheets, processed using Perl or VBA.
We found out this method had lot of problems:
i. Managing data using excel sheets is quite cumbersome. After some time you have a whole lot of excel sheets and there is no easy way to retrieve required data from it.
ii. People start sending the excel sheets to and fro for comments and review through e-mails. E-Mail becomes the primary mode of managing the comments related to the data. These comments are lost at a later point of time and there is no way of retrieving it back.
iii. Multiple copies of the files get created and changes in one copy are not reflected in the other - there is no versioning.
This is for the same reasons we have decided to move to a database based solution and are currently working on it. Let me summaries what we are trying to do:
i. The database is in a central server accessible by PC in all the test setups.
ii. All the data goes into a temporary location (local hard disk in files) as soon as it is generated. From the files, it is pushed into database by a process running in the background (so even if there is a network problem, data will be present in the local files system).
iii. We have a web based application which allows users to log in and access data in the format they want. The portal will allow them to add comment, generate different kind of reports, share it with other users after review etc. It will also have the ability to export data into excel sheet, just in case you need to take it with you.
Let know if this can be better implemented.
"At what point is it worth using a database?"
If and when you've got data to manage ?

Query building in a database agnostic way

In a C++ application that can use just about any relational database, what would be the best way of generating queries that can be easily extended to allow for a database engine's eccentricities?
In other words, the code may need to retrieve data in a way that is not consistent among the various database engines. What's the best way to design the code on the client side to generate queries in a way that will make supporting a new database engine a relatively painless affair.
For example, if I have (MFC)code that looks like this:
CString query = "SELECT id FROM table"
results = dbConnection->Query(query);
and we decide to support some database that uses, um, "AVEC" instead of "FROM". Now whenever the user uses that database engine, this query will fail.
Options so far:
Worst option: have the code making the query check the database type.
Better option: Create query request method on the db connection object that takes a unique query "code" and returns the appropriate query based on the database engine in use.
Betterer option: Create a query builder class that allows the caller to construct queries without using any SQL directly. Once the query is completed, caller can invoke a "Generate" method which returns a query string approrpriate for the active database engine
Best option: ??
Note: The database engine itself is abstracted away through some thin layers of our own creation. It's the queries themselves are the only remaining problem.
I've decided to go with the "better" option (query "selector") for two reasons.
Debugging: As mentioned below, debugging is going to be slightly easier with the selector approach since the queries are pre-built and listed out in a readable form in code.
Flexibility: It occurred to me that there are some databases which might have vastly better and completely different ways of solving a particular query. For example, with Access I perform a complicated query on multiple tables each time because I have to, but on Sql Server I'd like to setup a view. Selecting from the view and from several tables are completely different queries (i think) and this query selector would handle it easily.
You need your own query-writing object, which can be inherited from by database-specific implementations.
So you would do something like:
DbAgnosticQueryObject query = new PostgresSQLQuery();
// and so on
CString queryString = query.toString();
It can get pretty complicated in there once you go past simple selects from a single table. There are already ORM packages out there that deal with a lot of these nuances; it may be worth at looking at them instead of writing your own.
Best option: Pick a database, and code to it.
How often are you going to up and swap out the database on the back end of a production system? And even if you did, you'd have a lot more to worry about than just minor syntax issues. (Major stuff like join syntax, even datatypes can differ widely between databases.)
Now, if you are designing a commercial application where you want the customer to be able to use one of several back-end options when they implement it, then you may have to specify "we support Oracle, MS SQl, or MYSQL" and code to those specific options.
All of your options can be reduced to
Worst option: have the code making the query check the database type.
It's just a matter of where you're putting the logic to check the database type.
The option that I've seen work best in practice is
Better option: Create query request method on the db connection object that takes a unique query "code" and returns the appropriate query based on the database engine in use.
In my experience it is much easier to test queries independently from the rest of your code. It gets a lot harder if you have objects that are piecing together queries from bits of syntax, because then you have to test the query-creation code and the query itself.
If you pull all of your SQL out into separate files that are written and maintained by hand, you can have someone who is an expert in SQL write them (you can still automate the testing of these queries). If you try to write query-generating functions you'll essentially have a C++ expert writing SQL.
Choose an ORM, and start mapping.
If you are to support more than one DB, your problem is only going to get worse.
And just think of DB that are comming - cloud dbs with no (or close to no) SQL, and Object databases.
Take your queries outside the code - put them in the DB or in a resource file and allow overrides for different database engines.
If you use SPs it's potentially even easier, since the SPs abstract away your database differences.
I would think that what you would want to do, if you needed the ability to support multiple databases, would be to create a data provider interface (or abstract class) and associated concrete implementations. The data provider would need to support your standard query operators and other common, supported functionality required support your query operations (have a look at IEnumerable extension methods in .NET 3.5). Each concrete provider would then translate these into specific queries based on the target database engine.
Essentially, what you do is create a database abstraction layer and have your code interact with it. If you can find one of these for C++, it would probably be worth buying instead of writing. You may also want to look for Inversion of Control (IoC) containers for C++ that would basically do this and more. I know of several for Java and C#, but I'm not familiar with any for C++.